UW-CALMA-datarescue/how-to-start/track-3-technical.md
2025-01-06 01:38:16 +00:00

5.2 KiB

🕵️ Track 3 (Technical)

This track focuses on the actual capture of at-risk data in a variety of formats. As these tasks require the most technical knowledge, skills, and equipment, volunteers are encouraged to take this track when they are able to dedicate more time.

Tech Skill Level: Advanced

Time Commitment: ~2-3 hours

Tools Required (vary across tasks):

Tasks Include:

  1. Setup website monitoring systems
  2. Capture website content
  3. Harvesting public datasets
  4. Review data authenticity and quality
  5. Program or conduct comprehensive data/website crawl

TASKS BREAKDOWN

1. Set up monitoring API tracker to document changes to government websites

Summary: Given the previous removal of content and subtle revision to federal government environmental websites, many

Workflow

  1. Read or skim the following report of website monitoring by EDGI
    1. Report Link: https://envirodatagov.org/publication/changing-digital-climate/
  2. Download the a monitoring tool like:
    1. HTTP API tracker https://github.com/edgi-govdata-archiving/web-monitoring-db
    2. Comprehensive list of other tools here: https://github.com/edgi-govdata-archiving/awesome-website-change-monitoring
  3. Identify website to track using this Data Tracking List
  4. Deploy tracker for selected website
  5. Submit information about tracked website to the Data Tracking form

Skills Needed: Advanced understanding of software deployment, APIs, and technical git repositories.

2. Capture web files/data

Summary: The collecting of web archives (meaning webpages and the content with them) can be complex, but necessary. Using more user friendly software, non-digital preservationist can help capture select content of websites without worrying about collecting the entire structure of a website.

Workflow

  1. Identify a web file ready to ready to be archived
  2. Comment on the Status cell that you are working on that row
  3. Using web capture software (like Conifer) pick an at-risk website that includes at-risk data
  4. Comment on the same Status cell that the web file/data has been archived

Skills Needed: Intermediate understanding of software deployment and website navigation.

3. Harvest public datasets available online

Summary:

Workflow

  1. Search for public funding project repositories (NIH RePORTER, US Government Awards USASpending, Federal Audit Clearinghouse FAC)
  2. Verify that downloadable datasets contain enough descriptive information (data files, interactive maps, etc.)
  3. Capture dataset(s) to internal storage (temporary place)
  4. Submit and upload the dataset(s) to this Data Tracking Form
  5. You can delete dataset after successful submission via form

Skills Needed: Intermediate understanding of different dataset types and file formats. Comfort with downloading and saving larger files.

4. Create Bag/Create checksum (save for Data Rescue Day 2 - Jan 22)

Summary: This helps short and long term preservation effort to verify the integrity (fixity) of stored files nd datasets. Creating checksums or reviewing them helps detect errors or signs of tampering.

Workflow

Skills Needed: Best for those with basic data or web archiving experience, or have both strong tech skills and attention to detail.