{"id":21153,"date":"2021-05-13T11:30:44","date_gmt":"2021-05-13T15:30:44","guid":{"rendered":"https:\/\/circulatingnow.nlm.nih.gov\/?p=21153"},"modified":"2024-10-09T12:49:14","modified_gmt":"2024-10-09T16:49:14","slug":"exploring-the-data-of-web-archives-as-part-of-data-science-nlm","status":"publish","type":"post","link":"https:\/\/circulatingnow.nlm.nih.gov\/2021\/05\/13\/exploring-the-data-of-web-archives-as-part-of-data-science-nlm\/","title":{"rendered":"Exploring the Data of Web Archives as Part of Data Science @ NLM"},"content":{"rendered":"

By Christie Moffatt ~<\/em><\/p>\n

Over the past year (May 2020\u2013May 2021) I participated in a National Library of Medicine (NLM) Data Science Mentorship program, which is part of a broader Data Science @ NLM<\/a> training program designed to prepare staff to engage with and participate in the Library’s developing data science efforts. My mentor was NLM Computer Scientist Marie Gallagher, and our project was to gain a better understanding of and practical experience with tools and techniques for the exploration and study of the data of web archives.<\/p>\n

NLM has been actively involved in building collections of web archives<\/a> through the work of our Web Collecting and Archiving Group.\u00a0 A team of archivists, librarians, and historians is primarily using Archive-It<\/a>, a service of the Internet Archive, to collect on a broad range of topics in line with NLM collection development policies, including HIV\/AIDS<\/a>, the Opioid Epidemic<\/a>, the 2014 Ebola Outbreak<\/a>, and currently around the COVID-19 Pandemic<\/a>. \u00a0\u00a0As a member of this working group, I was interested to learn more about tools such as Archives Unleashed<\/a> and the GLAM Workbench<\/a> to better understand the work and needs of researchers, as well as explore the possibility of using these tools to support ongoing collection development and curation. \u00a0\u00a0I had participated in an Archives Unleashed datathon<\/a> in 2019, and recognized that I needed much more hands-on experience to better understand the nature of the tools, how to use them, and the broader picture of web archives data and research.\u00a0 The NLM Data Science Mentorship program provided a wonderful opportunity to collaborate and learn more about the data of web archives, as well as project design, experimental thinking, science communication, and data storytelling.<\/p>\n

The Archives Unleashed project, supported by The Andrew W. Mellon Foundation, provides a set of tools designed to lower barriers for researchers to explore web archives.\u00a0 The Archives Unleashed tools<\/a> are designed for different levels of experience including the Archives Unleashed Cloud (for beginners), Archives Unleashed Notebooks (for beginner\/intermediate users), and Archives Unleashed Toolkit (for advanced users).\u00a0 My mentor and I reviewed each of these tools and focused on the Archives Unleashed Cloud<\/a> (migrating soon to Archive-It) to query the data of individual NLM web archive collections and obtain derivative data files for further analysis.<\/p>\n

We uploaded the resulting derivative data files into a variety of data visualization and text analysis tools and learned a number of lessons on the value of a flexible computer environment to install programs and software, the need for advanced data cleaning skills, and generally, the need for patience and flexibility.\u00a0 I also gained an appreciation for the complexity of the analysis tools and the need for more time to understand how the data is interpreted and presented.<\/p>\n

In one experiment, following learning guides from Archives Unleashed, we loaded one of the derivative files (the GEFX<\/a> file) into an open source graph visualization program called Gephi<\/a> to create a visualization of the network of nodes (domains) and edges (hyperlinks between them) for a small collection<\/a> of sites related to the NLM exhibition Confronting Violence, Improving Women\u2019s Lives<\/a><\/em>.<\/p>\n