UW-CALMA-datarescue/how-to-start/track-3-technical.md

89 lines
6.2 KiB
Markdown
Raw Normal View History

2025-01-06 01:38:16 +00:00
# 🕵️ Track 3 (Technical)
2025-01-06 01:28:37 +00:00
This track focuses on the actual capture of at-risk data in a variety of formats. As these tasks require the most technical knowledge, skills, and equipment, volunteers are encouraged to take this track when they are able to dedicate more time.
**Tech Skill Level:** Advanced
**Time Commitment:** \~2-3 hours
**Tools Required (vary across tasks):**
* Web capture tools ([Conifer](https://guide.conifer.rhizome.org/), [Archive-It](https://archive-it.org/), [Webrecorder](https://webrecorder.io/), [wget](https://www.gnu.org/software/wget/). [More information on web archiving](https://bits.ashleyblewer.com/blog/2017/09/20/how-do-web-archiving-frameworks-work/))
* Data quality check system
* Spreadsheet editor (excel, google sheets)
* Web monitoring tool
* Storage (available internal memory, external hard drive)
**Tasks Include:**
1. Setup website monitoring systems
2. Capture website content
3. Harvesting public datasets
4. Review data authenticity and quality
5. Program or conduct comprehensive data/website crawl 
2025-01-13 20:59:00 +00:00
**Breakdown of Task Sections**\
🚁 _(helicopter emoji)_ gives summary of task\
🗂️ _(index dividers)_ outlines specific steps needed to complete task\
🛠️ _(hammer & wrench emoji)_ details skills & tools needed for task
2025-01-06 01:28:37 +00:00
### TASKS BREAKDOWN
2025-01-15 22:07:26 +00:00
#### <mark style="background-color:purple;">1. Set up monitoring API tracker to document changes to government websites</mark>
2025-01-06 01:28:37 +00:00
2025-01-13 20:59:00 +00:00
🚁**Summary:** Given the previous removal of content and subtle revision to federal government environmental websites, many&#x20;
2025-01-06 01:28:37 +00:00
2025-01-13 20:59:00 +00:00
🗂️**Workflow**
2025-01-06 01:28:37 +00:00
1. Read or skim the following report of website monitoring by EDGI
1. Report Link: [https://envirodatagov.org/publication/changing-digital-climate/](https://envirodatagov.org/publication/changing-digital-climate/)&#x20;
2. Download the a monitoring tool like:
1. HTTP API tracker [https://github.com/edgi-govdata-archiving/web-monitoring-db](https://github.com/edgi-govdata-archiving/web-monitoring-db)&#x20;
2. Comprehensive list of other tools here: [https://github.com/edgi-govdata-archiving/awesome-website-change-monitoring](https://github.com/edgi-govdata-archiving/awesome-website-change-monitoring)&#x20;
3. Identify website to track using [this Data Tracking List](https://docs.google.com/spreadsheets/d/1tOS7B3lgK-8wdgyhY81ntfICMIkGwAiHfeV63hi3UzU/edit?usp=drive_link)&#x20;
4. Deploy tracker for selected website&#x20;
5. Submit information about tracked website to [the Data Tracking form](https://docs.google.com/forms/d/e/1FAIpQLSfII-rl4yUcGPJlPWk9knWMhC_qBueJLEPcC7vphPeVisLhHA/viewform?usp=sf_link)
**Skills Needed:** Advanced understanding of software deployment, APIs, and technical git repositories.&#x20;
2025-01-15 23:55:20 +00:00
#### <mark style="background-color:purple;">2. Capture web files/data</mark>
2025-01-06 01:28:37 +00:00
2025-01-13 20:59:00 +00:00
🚁**Summary:** The collecting of web archives (meaning webpages and the content with them) can be complex, but necessary. Using more user friendly software, non-digital preservationist can help capture select content of websites without worrying about collecting the entire structure of a website.
2025-01-06 01:28:37 +00:00
2025-01-13 20:59:00 +00:00
🗂️**Workflow**
2025-01-06 01:28:37 +00:00
2025-01-15 22:58:56 +00:00
1. Identify a web file ready to [ready to be captured](https://docs.google.com/spreadsheets/d/1tOS7B3lgK-8wdgyhY81ntfICMIkGwAiHfeV63hi3UzU/edit?usp=sharing).
2. Comment (using comment function) on the "Status" cell that you are working on that row
2025-01-06 01:28:37 +00:00
3. Using web capture software (like [Conifer](https://guide.conifer.rhizome.org/)) pick an at-risk website that includes at-risk data
2025-01-15 22:58:56 +00:00
4. Change the status on the same "Status" cell that the web file/data has been archived
2025-01-06 01:28:37 +00:00
2025-01-13 20:59:00 +00:00
🛠️**Skills Needed:** Intermediate understanding of software deployment and website navigation.&#x20;
2025-01-06 01:28:37 +00:00
2025-01-15 23:55:20 +00:00
#### <mark style="background-color:purple;">3. Harvest public datasets available online</mark>
2025-01-06 01:28:37 +00:00
2025-01-13 20:59:00 +00:00
🚁**Summary:** Some state and federal agencies are required by law to publish data, publications, and basic information about publicly funded projects (think grants and contracts) Given changes in agency personnel, system updates, as well as financial support to pay for database services and storage, the data stored in these repositories may not always be available for the public. Saving copies can help ensure future access as well as information on past government activities and areas of interests.
2025-01-06 01:28:37 +00:00
2025-01-13 20:59:00 +00:00
🗂️**Workflow**
2025-01-06 01:28:37 +00:00
2025-01-13 20:59:00 +00:00
1. Search for public funded project repositories (examples include: NIH [RePORTER](https://reporter.nih.gov/), US Government Awards [USASpending](https://www.usaspending.gov/search), Federal Audit Clearinghouse [FAC](https://app.fac.gov/dissemination/search/))
2025-01-06 01:28:37 +00:00
2. Verify that downloadable datasets contain enough descriptive information (data files, interactive maps, etc.)&#x20;
3. Capture dataset(s) to internal storage (temporary place)
4. Submit and upload the dataset(s) to [this Data Tracking Form](https://docs.google.com/forms/d/e/1FAIpQLSfII-rl4yUcGPJlPWk9knWMhC_qBueJLEPcC7vphPeVisLhHA/viewform?usp=sf_link)&#x20;
5. You can delete dataset after successful submission via form
2025-01-13 20:59:00 +00:00
🛠️**Skills Needed:** Intermediate understanding of different dataset types and file formats. Comfort with downloading and saving larger files.
2025-01-06 01:28:37 +00:00
#### 4. Create Bag/Create checksum (save for Data Rescue Day 2 - Jan 22)
2025-01-13 20:59:00 +00:00
🚁**Summary:** This helps short and long term preservation effort to verify the integrity (fixity) of stored files and datasets. Creating checksums or reviewing them helps detect transfer or creation errors or signs of tampering by external forces.
2025-01-06 01:28:37 +00:00
2025-01-13 20:59:00 +00:00
🗂️**Workflow**
2025-01-06 01:28:37 +00:00
* Read through the [digital preservation manual chapter on fixity and checksums by the Digital Preservation Coalition](https://www.dpconline.org/handbook/technical-solutions-and-tools/fixity-and-checksums)&#x20;
* Download a fixity or checksum verification tool like
* [Md5summer](https://md5summer.org/): An application for Windows machines that will generate and verify md5 checksums.
* Identify dataset to create checksum using this [Data Tracking List - Data Rescue 2025 (Responses)](https://docs.google.com/spreadsheets/d/1tOS7B3lgK-8wdgyhY81ntfICMIkGwAiHfeV63hi3UzU/edit?usp=drive_link)
* Run a check on the selected data to create the supplemental checksum value
2025-01-13 20:59:00 +00:00
🛠️**Skills Needed:** Best for those with basic data or web archiving experience, or have both strong tech skills and attention to detail.