{"id":21153,"date":"2021-05-13T11:30:44","date_gmt":"2021-05-13T15:30:44","guid":{"rendered":"https:\/\/circulatingnow.nlm.nih.gov\/?p=21153"},"modified":"2024-10-09T12:49:14","modified_gmt":"2024-10-09T16:49:14","slug":"exploring-the-data-of-web-archives-as-part-of-data-science-nlm","status":"publish","type":"post","link":"https:\/\/circulatingnow.nlm.nih.gov\/2021\/05\/13\/exploring-the-data-of-web-archives-as-part-of-data-science-nlm\/","title":{"rendered":"Exploring the Data of Web Archives as Part of Data Science @ NLM"},"content":{"rendered":"
By Christie Moffatt ~<\/em><\/p>\n Over the past year (May 2020\u2013May 2021) I participated in a National Library of Medicine (NLM) Data Science Mentorship program, which is part of a broader Data Science @ NLM<\/a> training program designed to prepare staff to engage with and participate in the Library’s developing data science efforts. My mentor was NLM Computer Scientist Marie Gallagher, and our project was to gain a better understanding of and practical experience with tools and techniques for the exploration and study of the data of web archives.<\/p>\n NLM has been actively involved in building collections of web archives<\/a> through the work of our Web Collecting and Archiving Group.\u00a0 A team of archivists, librarians, and historians is primarily using Archive-It<\/a>, a service of the Internet Archive, to collect on a broad range of topics in line with NLM collection development policies, including HIV\/AIDS<\/a>, the Opioid Epidemic<\/a>, the 2014 Ebola Outbreak<\/a>, and currently around the COVID-19 Pandemic<\/a>. \u00a0\u00a0As a member of this working group, I was interested to learn more about tools such as Archives Unleashed<\/a> and the GLAM Workbench<\/a> to better understand the work and needs of researchers, as well as explore the possibility of using these tools to support ongoing collection development and curation. \u00a0\u00a0I had participated in an Archives Unleashed datathon<\/a> in 2019, and recognized that I needed much more hands-on experience to better understand the nature of the tools, how to use them, and the broader picture of web archives data and research.\u00a0 The NLM Data Science Mentorship program provided a wonderful opportunity to collaborate and learn more about the data of web archives, as well as project design, experimental thinking, science communication, and data storytelling.<\/p>\n The Archives Unleashed project, supported by The Andrew W. Mellon Foundation, provides a set of tools designed to lower barriers for researchers to explore web archives.\u00a0 The Archives Unleashed tools<\/a> are designed for different levels of experience including the Archives Unleashed Cloud (for beginners), Archives Unleashed Notebooks (for beginner\/intermediate users), and Archives Unleashed Toolkit (for advanced users).\u00a0 My mentor and I reviewed each of these tools and focused on the Archives Unleashed Cloud<\/a> (migrating soon to Archive-It) to query the data of individual NLM web archive collections and obtain derivative data files for further analysis.<\/p>\n We uploaded the resulting derivative data files into a variety of data visualization and text analysis tools and learned a number of lessons on the value of a flexible computer environment to install programs and software, the need for advanced data cleaning skills, and generally, the need for patience and flexibility.\u00a0 I also gained an appreciation for the complexity of the analysis tools and the need for more time to understand how the data is interpreted and presented.<\/p>\n In one experiment, following learning guides from Archives Unleashed, we loaded one of the derivative files (the GEFX<\/a> file) into an open source graph visualization program called Gephi<\/a> to create a visualization of the network of nodes (domains) and edges (hyperlinks between them) for a small collection<\/a> of sites related to the NLM exhibition Confronting Violence, Improving Women\u2019s Lives<\/a><\/em>.<\/p>\n If we look closely, we can see that there are arrows between the domains, indicating hyperlink connections.\u00a0 The size of the labels and nodes is significant, representing how many times the source is linked to. Researchers can use this visualization to see who is linking to who, and the most popular domains in the collection.\u00a0 In this case, we found that forensicnurses.org, Twitter, Facebook, and Youtube are domains frequently linked to in the collection. We can look specifically at forensicnurses.org and focus attention on the links to and from this particular domain, with safeta.org<\/span>, then\u00a0<\/span>community.iafn.org,<\/span> having the largest number of links.<\/p>\n In another experiment we used a derivative file containing the text extracted from HTML documents\u00a0 within the web archive (a csv file).\u00a0 We explored this data using a web-based text analysis set of tools called Voyant Tools<\/a>.\u00a0\u00a0 Data cleanup of the text in our larger collections proved challenging, and we ended up creating a very small sample data set created for this project. The text was still a challenge with unrecognizable characters, though a bit easier to manage. We removed content in languages not English or Spanish (which made sense for this data set), and removed file formats that were not text. You can see a big difference in the visualizations using the \u201cbefore\u201d and \u201cafter\u201d version of the derivative text file.<\/p>\n We also tested out a set of Jupyter notebooks, released in 2020 as part of the GLAM Workbench<\/a> (Galleries, Libraries, Archives, and Museums) with funding from the International Internet Preservation Consortium (IIPC)<\/a>. Like Archives Unleashed Cloud, the notebooks (there are 16) are intended to be a starting point specifically for researchers who want to make use of web archives.\u00a0 The notebooks offer a range of options for examining content in the Internet Archive (and other archives); and\u2014even easier on the researcher\u2014run using Binder<\/a>, a virtual machine that you don\u2019t need any software installed on your own computer to use.<\/p>\n Researchers can use a notebook, “Get full page screenshots from archived web pages<\/a>,” for example, to examine visual changes in a website over time.\u00a0 In the visual below, we reviewed the the CDC coronavirus homepage as it changed throughout 2020.<\/p>\n Other notebooks in the collection allow researchers to discover changes to text on a webpage over time, for example, or to discover when a piece of text first appears in an archived webpage. In the example below, we used the notebook “Find when a piece of text appears in an archived web page<\/a>” to discover the first time \u201csocial distancing\u201d was used on the CDC coronavirus homepage.<\/p>\n These notebooks are not without challenges themselves.\u00a0 While really easy to use, it takes time to query the entire Internet Archive for results (sometimes hours) and the notebooks can time out.\u00a0 This work provided the opportunity to compare approaches to querying the data of web archives, as well as more lessons on patience and persistence.<\/p>\n The work we started during this mentorship is ongoing and the landscape of tools is evolving.\u00a0 There is definitely more room for further exploration of analysis tools to better understand how researchers can use web archives and how web collecting organizations like NLM can support their work. We learned midway through the project that Archives Unleashed Cloud will be decommissioned at the end of June 2021 and migrated to Archive-It.\u00a0 I look forward to learning more about what opportunities this will bring for making NLM web archive collections available as data, whether through providing tidy derivative data sets for our researchers, or sharing notebooks querying the web archive data.\u00a0 Supporting researchers through description and transparency about the scope of a collection is also important, as well as helping them understand the nature of web content as historical materials (on this topic I read, and consulted many times throughout this project, Ian Milligan\u2019s History in the Age of Abundance?\u00a0 How the Web is Transforming Historical Research<\/em><\/a>). There is much exciting and important work possible ahead.<\/p>\n I\u2019m grateful for this opportunity to work with a mentor to explore and learn about the bigger picture of working with web archives as data over this past year.\u00a0 Many, many thanks to Marie Gallagher, the entire Data Science @ NLM training program team, and all those at NLM supporting this work.<\/p>\n <\/a>
<\/a>
<\/a>
<\/a>
<\/a>
<\/a>
<\/a>
<\/a>