1 line
No EOL
42 KiB
Text
1 line
No EOL
42 KiB
Text
{"id":21153,"date":"2021-05-13T11:30:44","date_gmt":"2021-05-13T15:30:44","guid":{"rendered":"https:\/\/circulatingnow.nlm.nih.gov\/?p=21153"},"modified":"2024-10-09T12:49:14","modified_gmt":"2024-10-09T16:49:14","slug":"exploring-the-data-of-web-archives-as-part-of-data-science-nlm","status":"publish","type":"post","link":"https:\/\/circulatingnow.nlm.nih.gov\/2021\/05\/13\/exploring-the-data-of-web-archives-as-part-of-data-science-nlm\/","title":{"rendered":"Exploring the Data of Web Archives as Part of Data Science @ NLM"},"content":{"rendered":"<p><em>By Christie Moffatt ~<\/em><\/p>\n<p>Over the past year (May 2020\u2013May 2021) I participated in a National Library of Medicine (NLM) Data Science Mentorship program, which is part of a broader <a href=\"https:\/\/nlmdirector.nlm.nih.gov\/2021\/04\/28\/data-science-nlm-journey-continues-and-what-we-have-learned\/\">Data Science @ NLM<\/a> training program designed to prepare staff to engage with and participate in the Library’s developing data science efforts. My mentor was NLM Computer Scientist Marie Gallagher, and our project was to gain a better understanding of and practical experience with tools and techniques for the exploration and study of the data of web archives.<\/p>\n<p>NLM has been actively involved in building <a href=\"https:\/\/archive-it.org\/organizations\/350\">collections of web archives<\/a> through the work of our Web Collecting and Archiving Group.\u00a0 A team of archivists, librarians, and historians is primarily using <a href=\"https:\/\/archive-it.org\/\" target=\"_blank\" rel=\"noopener\">Archive-It<\/a>, a service of the Internet Archive, to collect on a broad range of topics in line with NLM collection development policies, including <a href=\"https:\/\/circulatingnow.nlm.nih.gov\/2017\/12\/01\/archiving-hiv-aids-on-the-web\/\">HIV\/AIDS<\/a>, the <a href=\"https:\/\/circulatingnow.nlm.nih.gov\/2018\/09\/27\/the-opioid-epidemic-collecting-now-for-future-research\/\">Opioid Epidemic<\/a>, the <a href=\"https:\/\/circulatingnow.nlm.nih.gov\/2014\/11\/19\/future-historical-collections-archiving-the-2014-ebola-outbreak\/\">2014 Ebola Outbreak<\/a>, and currently around the <a href=\"https:\/\/circulatingnow.nlm.nih.gov\/2021\/01\/28\/covid-19-web-collecting-reflections-at-one-year\/\">COVID-19 Pandemic<\/a>. \u00a0\u00a0As a member of this working group, I was interested to learn more about tools such as <a href=\"https:\/\/archivesunleashed.org\/\" target=\"_blank\" rel=\"noopener\">Archives Unleashed<\/a> and the <a href=\"https:\/\/glam-workbench.net\/web-archives\/\" target=\"_blank\" rel=\"noopener\">GLAM Workbench<\/a> to better understand the work and needs of researchers, as well as explore the possibility of using these tools to support ongoing collection development and curation. \u00a0\u00a0I had participated in an Archives Unleashed <a href=\"https:\/\/archivesunleashed.org\/events\/\">datathon<\/a> in 2019, and recognized that I needed much more hands-on experience to better understand the nature of the tools, how to use them, and the broader picture of web archives data and research.\u00a0 The NLM Data Science Mentorship program provided a wonderful opportunity to collaborate and learn more about the data of web archives, as well as project design, experimental thinking, science communication, and data storytelling.<\/p>\n<p>The Archives Unleashed project, supported by The Andrew W. Mellon Foundation, provides a set of tools designed to lower barriers for researchers to explore web archives.\u00a0 The <a href=\"https:\/\/archivesunleashed.org\/getting-started\/\" target=\"_blank\" rel=\"noopener\">Archives Unleashed tools<\/a> are designed for different levels of experience including the Archives Unleashed Cloud (for beginners), Archives Unleashed Notebooks (for beginner\/intermediate users), and Archives Unleashed Toolkit (for advanced users).\u00a0 My mentor and I reviewed each of these tools and focused on the <a href=\"https:\/\/archivesunleashed.org\/cloud\/\">Archives Unleashed Cloud<\/a> (migrating soon to Archive-It) to query the data of individual NLM web archive collections and obtain derivative data files for further analysis.<\/p>\n<p>We uploaded the resulting derivative data files into a variety of data visualization and text analysis tools and learned a number of lessons on the value of a flexible computer environment to install programs and software, the need for advanced data cleaning skills, and generally, the need for patience and flexibility.\u00a0 I also gained an appreciation for the complexity of the analysis tools and the need for more time to understand how the data is interpreted and presented.<\/p>\n<p>In one experiment, following learning guides from Archives Unleashed, we loaded one of the derivative files (the <a href=\"https:\/\/gexf.net\/\">GEFX<\/a> file) into an open source graph visualization program called <a href=\"https:\/\/gephi.org\/\">Gephi<\/a> to create a visualization of the network of nodes (domains) and edges (hyperlinks between them) for a small <a href=\"https:\/\/archive-it.org\/collections\/8370\">collection<\/a> of sites related to the NLM exhibition <em><a href=\"https:\/\/www.nlm.nih.gov\/exhibition\/confrontingviolence\/index.html\">Confronting Violence, Improving Women\u2019s Lives<\/a><\/em>.<\/p>\n<div class=\"tiled-gallery type-rectangular tiled-gallery-unresized\" data-original-width=\"840\" data-carousel-extra='{"blog_id":1,"permalink":"https:\\\/\\\/circulatingnow.nlm.nih.gov\\\/2021\\\/05\\\/13\\\/exploring-the-data-of-web-archives-as-part-of-data-science-nlm\\\/","likes_blog_id":"52242398"}' itemscope itemtype=\"http:\/\/schema.org\/ImageGallery\" > <div class=\"gallery-row\" style=\"width: 840px; height: 196px;\" data-original-width=\"840\" data-original-height=\"196\" > <div class=\"gallery-group images-1\" style=\"width: 285px; height: 196px;\" data-original-width=\"285\" data-original-height=\"196\" > <div class=\"tiled-gallery-item tiled-gallery-item-large\" itemprop=\"associatedMedia\" itemscope itemtype=\"http:\/\/schema.org\/ImageObject\"> <a href=\"https:\/\/circulatingnow.nlm.nih.gov\/gephi_fullcollection\/\" border=\"0\" itemprop=\"url\"> <meta itemprop=\"width\" content=\"281\"> <meta itemprop=\"height\" content=\"192\"> <img decoding=\"async\" class=\"\" data-attachment-id=\"21113\" data-orig-file=\"https:\/\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_fullCollection.jpg\" data-orig-size=\"1124,769\" data-comments-opened=\"1\" data-image-meta=\"{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0"}\" data-image-title=\"Full Gephi Visualization\" data-image-description=\"\" data-medium-file=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_fullCollection.jpg?fit=300%2C205&ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_fullCollection.jpg?fit=840%2C575&ssl=1\" src=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_fullCollection.jpg?w=281&h=192&ssl=1\" srcset=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_fullCollection.jpg?w=1124&ssl=1 1124w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_fullCollection.jpg?resize=300%2C205&ssl=1 300w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_fullCollection.jpg?resize=1024%2C701&ssl=1 1024w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_fullCollection.jpg?resize=768%2C525&ssl=1 768w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_fullCollection.jpg?resize=840%2C575&ssl=1 840w\" width=\"281\" height=\"192\" loading=\"lazy\" data-original-width=\"281\" data-original-height=\"192\" itemprop=\"http:\/\/schema.org\/image\" title=\"Full Gephi Visualization\" alt=\"A large network of web domains (nodes) and hyperlinks between them (edges) in various colors.\" style=\"width: 281px; height: 192px;\" \/> <\/a> <div class=\"tiled-gallery-caption\" itemprop=\"caption description\"> Graphic visualization in Gephi, of network connections between domains in an NLM web archive collection related to NLM exhibition Confronting Violence, Improving Women\u2019s Lives. <\/div> <\/div> <\/div> <!-- close group --> <div class=\"gallery-group images-1\" style=\"width: 277px; height: 196px;\" data-original-width=\"277\" data-original-height=\"196\" > <div class=\"tiled-gallery-item tiled-gallery-item-large\" itemprop=\"associatedMedia\" itemscope itemtype=\"http:\/\/schema.org\/ImageObject\"> <a href=\"https:\/\/circulatingnow.nlm.nih.gov\/gephi_zoom2\/\" border=\"0\" itemprop=\"url\"> <meta itemprop=\"width\" content=\"273\"> <meta itemprop=\"height\" content=\"192\"> <img decoding=\"async\" class=\"\" data-attachment-id=\"21114\" data-orig-file=\"https:\/\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_Zoom2.jpg\" data-orig-size=\"1139,801\" data-comments-opened=\"1\" data-image-meta=\"{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0"}\" data-image-title=\"Close-up of Major Nodes\" data-image-description=\"\" data-medium-file=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_Zoom2.jpg?fit=300%2C211&ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_Zoom2.jpg?fit=840%2C591&ssl=1\" src=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_Zoom2.jpg?w=273&h=192&ssl=1\" srcset=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_Zoom2.jpg?w=1139&ssl=1 1139w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_Zoom2.jpg?resize=300%2C211&ssl=1 300w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_Zoom2.jpg?resize=1024%2C720&ssl=1 1024w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_Zoom2.jpg?resize=768%2C540&ssl=1 768w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_Zoom2.jpg?resize=840%2C591&ssl=1 840w\" width=\"273\" height=\"192\" loading=\"lazy\" data-original-width=\"273\" data-original-height=\"192\" itemprop=\"http:\/\/schema.org\/image\" title=\"Close-up of Major Nodes\" alt=\"A large network of web domains (nodes) and hyperlinks between them (edges) in various colors.\" style=\"width: 273px; height: 192px;\" \/> <\/a> <div class=\"tiled-gallery-caption\" itemprop=\"caption description\"> Graphic visualization in Gephi, of network connections between domains in an NLM web archive collection related to NLM exhibition Confronting Violence, Improving Women\u2019s Lives. <\/div> <\/div> <\/div> <!-- close group --> <div class=\"gallery-group images-1\" style=\"width: 278px; height: 196px;\" data-original-width=\"278\" data-original-height=\"196\" > <div class=\"tiled-gallery-item tiled-gallery-item-large\" itemprop=\"associatedMedia\" itemscope itemtype=\"http:\/\/schema.org\/ImageObject\"> <a href=\"https:\/\/circulatingnow.nlm.nih.gov\/gephi_zoom3\/\" border=\"0\" itemprop=\"url\"> <meta itemprop=\"width\" content=\"274\"> <meta itemprop=\"height\" content=\"192\"> <img decoding=\"async\" class=\"\" data-attachment-id=\"21115\" data-orig-file=\"https:\/\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_zoom3.jpg\" data-orig-size=\"1188,831\" data-comments-opened=\"1\" data-image-meta=\"{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0"}\" data-image-title=\"Detail of Edges Around forensicnurses.org\" data-image-description=\"\" data-medium-file=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_zoom3.jpg?fit=300%2C210&ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_zoom3.jpg?fit=840%2C587&ssl=1\" src=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_zoom3.jpg?w=274&h=192&ssl=1\" srcset=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_zoom3.jpg?w=1188&ssl=1 1188w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_zoom3.jpg?resize=300%2C210&ssl=1 300w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_zoom3.jpg?resize=1024%2C716&ssl=1 1024w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_zoom3.jpg?resize=768%2C537&ssl=1 768w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_zoom3.jpg?resize=840%2C588&ssl=1 840w\" width=\"274\" height=\"192\" loading=\"lazy\" data-original-width=\"274\" data-original-height=\"192\" itemprop=\"http:\/\/schema.org\/image\" title=\"Detail of Edges Around forensicnurses.org\" alt=\"A large network of web domains (nodes) and hyperlinks between them (edges) in various colors.\" style=\"width: 274px; height: 192px;\" \/> <\/a> <div class=\"tiled-gallery-caption\" itemprop=\"caption description\"> Graphic visualization in Gephi, of network connections between domains in an NLM web archive collection related to NLM exhibition Confronting Violence, Improving Women\u2019s Lives. <\/div> <\/div> <\/div> <!-- close group --> <\/div> <!-- close row --> <\/div>\n<figure id=\"attachment_11953\" class=\"wp-caption aligncenter\"><figcaption class=\"wp-caption-text\" style=\"width: 800px;\">Graphic visualization in Gephi, of network connections between domains in an NLM web archive collection related to NLM exhibition <em>Confronting Violence, Improving Women\u2019s Lives.<\/em><\/figcaption><\/figure>\n<p>If we look closely, we can see that there are arrows between the domains, indicating hyperlink connections.\u00a0 The size of the labels and nodes is significant, representing how many times the source is linked to. Researchers can use this visualization to see who is linking to who, and the most popular domains in the collection.\u00a0 In this case, we found that forensicnurses.org, Twitter, Facebook, and Youtube are domains frequently linked to in the collection. We can look specifically at forensicnurses.org and focus attention on the links to and from this particular domain, with <span data-ccp-charstyle=\"Hyperlink\">safeta.org<\/span><span data-contrast=\"auto\">, then\u00a0<\/span><span data-ccp-charstyle=\"Hyperlink\">community.iafn.org,<\/span> having the largest number of links.<\/p>\n<p>In another experiment we used a derivative file containing the text extracted from HTML documents\u00a0 within the web archive (a csv file).\u00a0 We explored this data using a web-based text analysis set of tools called <a href=\"https:\/\/voyant-tools.org\/\" target=\"_blank\" rel=\"noopener\">Voyant Tools<\/a>.\u00a0\u00a0 Data cleanup of the text in our larger collections proved challenging, and we ended up creating a very small sample data set created for this project. The text was still a challenge with unrecognizable characters, though a bit easier to manage. We removed content in languages not English or Spanish (which made sense for this data set), and removed file formats that were not text. You can see a big difference in the visualizations using the \u201cbefore\u201d and \u201cafter\u201d version of the derivative text file.<\/p>\n<div class=\"tiled-gallery type-rectangular tiled-gallery-unresized\" data-original-width=\"840\" data-carousel-extra='{"blog_id":1,"permalink":"https:\\\/\\\/circulatingnow.nlm.nih.gov\\\/2021\\\/05\\\/13\\\/exploring-the-data-of-web-archives-as-part-of-data-science-nlm\\\/","likes_blog_id":"52242398"}' itemscope itemtype=\"http:\/\/schema.org\/ImageGallery\" > <div class=\"gallery-row\" style=\"width: 840px; height: 209px;\" data-original-width=\"840\" data-original-height=\"209\" > <div class=\"gallery-group images-1\" style=\"width: 420px; height: 209px;\" data-original-width=\"420\" data-original-height=\"209\" > <div class=\"tiled-gallery-item tiled-gallery-item-large\" itemprop=\"associatedMedia\" itemscope itemtype=\"http:\/\/schema.org\/ImageObject\"> <a href=\"https:\/\/circulatingnow.nlm.nih.gov\/voyant1\/\" border=\"0\" itemprop=\"url\"> <meta itemprop=\"width\" content=\"416\"> <meta itemprop=\"height\" content=\"205\"> <img decoding=\"async\" class=\"\" data-attachment-id=\"21119\" data-orig-file=\"https:\/\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Voyant1.jpg\" data-orig-size=\"1655,814\" data-comments-opened=\"1\" data-image-meta=\"{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0"}\" data-image-title=\"Before\" data-image-description=\"\" data-medium-file=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Voyant1.jpg?fit=300%2C148&ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Voyant1.jpg?fit=840%2C413&ssl=1\" src=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Voyant1.jpg?w=416&h=205&ssl=1\" srcset=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Voyant1.jpg?w=1655&ssl=1 1655w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Voyant1.jpg?resize=300%2C148&ssl=1 300w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Voyant1.jpg?resize=1024%2C504&ssl=1 1024w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Voyant1.jpg?resize=768%2C378&ssl=1 768w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Voyant1.jpg?resize=1536%2C755&ssl=1 1536w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Voyant1.jpg?resize=1200%2C590&ssl=1 1200w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Voyant1.jpg?resize=840%2C413&ssl=1 840w\" width=\"416\" height=\"205\" loading=\"lazy\" data-original-width=\"416\" data-original-height=\"205\" itemprop=\"http:\/\/schema.org\/image\" title=\"Before\" alt=\"A screenshot of a dashboard showing unmeaningful displays of information.\" style=\"width: 416px; height: 205px;\" \/> <\/a> <div class=\"tiled-gallery-caption\" itemprop=\"caption description\"> A Voyant display of data before it has been cleaned. <\/div> <\/div> <\/div> <!-- close group --> <div class=\"gallery-group images-1\" style=\"width: 420px; height: 209px;\" data-original-width=\"420\" data-original-height=\"209\" > <div class=\"tiled-gallery-item tiled-gallery-item-large\" itemprop=\"associatedMedia\" itemscope itemtype=\"http:\/\/schema.org\/ImageObject\"> <a href=\"https:\/\/circulatingnow.nlm.nih.gov\/voyant2\/\" border=\"0\" itemprop=\"url\"> <meta itemprop=\"width\" content=\"416\"> <meta itemprop=\"height\" content=\"205\"> <img decoding=\"async\" class=\"\" data-attachment-id=\"21120\" data-orig-file=\"https:\/\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Voyant2.jpg\" data-orig-size=\"1655,814\" data-comments-opened=\"1\" data-image-meta=\"{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0"}\" data-image-title=\"After\" data-image-description=\"\" data-medium-file=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Voyant2.jpg?fit=300%2C148&ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Voyant2.jpg?fit=840%2C413&ssl=1\" src=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Voyant2.jpg?w=416&h=205&ssl=1\" srcset=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Voyant2.jpg?w=1655&ssl=1 1655w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Voyant2.jpg?resize=300%2C148&ssl=1 300w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Voyant2.jpg?resize=1024%2C504&ssl=1 1024w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Voyant2.jpg?resize=768%2C378&ssl=1 768w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Voyant2.jpg?resize=1536%2C755&ssl=1 1536w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Voyant2.jpg?resize=1200%2C590&ssl=1 1200w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Voyant2.jpg?resize=840%2C413&ssl=1 840w\" width=\"416\" height=\"205\" loading=\"lazy\" data-original-width=\"416\" data-original-height=\"205\" itemprop=\"http:\/\/schema.org\/image\" title=\"After\" alt=\"A program dashboard displaying a wordcloud, text associations, graph and frequency data drawn from a text analysis\" style=\"width: 416px; height: 205px;\" \/> <\/a> <div class=\"tiled-gallery-caption\" itemprop=\"caption description\"> A Voyent display using cleaned data. <\/div> <\/div> <\/div> <!-- close group --> <\/div> <!-- close row --> <\/div>\n<figure id=\"attachment_11953\" class=\"wp-caption aligncenter\"><figcaption class=\"wp-caption-text\" style=\"width: 800px;\">A “before” and “after” data cleaning text visualization in Voyant, using data from a test NLM web archive collection related to HHS efforts to address hesitancy around COVID-19 vaccines.<\/figcaption>With Voyant Tools researchers can visualize the text in multiple ways: a word cloud showing the most frequent words used in the collection, the context of the word or words used in a collection (for example, what text comes before or after the word \u201cvaccine\u201d), where in the text the terms of interest are most concentrated, and the terms highlighted in the text itself.\u00a0 There are all kinds of ways to filter this data, and researchers can swap out the visualizations (there are 28 available) depending on what is most useful to their research.<\/figure>\n<p>We also tested out a set of Jupyter notebooks, released in 2020 as part of the <a href=\"https:\/\/glam-workbench.net\/web-archives\/\" target=\"_blank\" rel=\"noopener\">GLAM Workbench<\/a> (Galleries, Libraries, Archives, and Museums) with funding from the <a href=\"https:\/\/netpreserve.org\/\">International Internet Preservation Consortium (IIPC)<\/a>. Like Archives Unleashed Cloud, the notebooks (there are 16) are intended to be a starting point specifically for researchers who want to make use of web archives.\u00a0 The notebooks offer a range of options for examining content in the Internet Archive (and other archives); and\u2014even easier on the researcher\u2014run using <a href=\"https:\/\/jupyter.org\/binder\" target=\"_blank\" rel=\"noopener\">Binder<\/a>, a virtual machine that you don\u2019t need any software installed on your own computer to use.<\/p>\n<p>Researchers can use a notebook, “<a href=\"https:\/\/glam-workbench.net\/web-archives\/#create-and-compare-full-page-screenshots-from-archived-web-pages\" target=\"_blank\" rel=\"noopener\">Get full page screenshots from archived web pages<\/a>,” for example, to examine visual changes in a website over time.\u00a0 In the visual below, we reviewed the the CDC coronavirus homepage as it changed throughout 2020.<\/p>\n<figure id=\"attachment_21118\" aria-describedby=\"caption-attachment-21118\" style=\"width: 996px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/GLAMWorkbench_GetFullPageScreenshots.jpg?ssl=1\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"21118\" data-permalink=\"https:\/\/circulatingnow.nlm.nih.gov\/2021\/05\/13\/exploring-the-data-of-web-archives-as-part-of-data-science-nlm\/glamworkbench_getfullpagescreenshots\/\" data-orig-file=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/GLAMWorkbench_GetFullPageScreenshots.jpg?fit=996%2C820&ssl=1\" data-orig-size=\"996,820\" data-comments-opened=\"1\" data-image-meta=\"{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0"}\" data-image-title=\"GLAMWorkbench_GetFullPageScreenshots\" data-image-description=\"\" data-image-caption=\"<p>A GLAM Workbench tool displaying archives of a webpage on three different dates.<\/p>\n\" data-medium-file=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/GLAMWorkbench_GetFullPageScreenshots.jpg?fit=300%2C247&ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/GLAMWorkbench_GetFullPageScreenshots.jpg?fit=840%2C692&ssl=1\" class=\"wp-image-21118 size-full\" title=\"CDC Page on COVID-19 in January, February, and March of 2020\" src=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/GLAMWorkbench_GetFullPageScreenshots.jpg?resize=840%2C692&ssl=1\" alt=\"A display of three versions of a CDC wepage from January, February and March of 2020 showing how it got longer and more complex.\" width=\"840\" height=\"692\" srcset=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/GLAMWorkbench_GetFullPageScreenshots.jpg?w=996&ssl=1 996w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/GLAMWorkbench_GetFullPageScreenshots.jpg?resize=300%2C247&ssl=1 300w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/GLAMWorkbench_GetFullPageScreenshots.jpg?resize=768%2C632&ssl=1 768w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/GLAMWorkbench_GetFullPageScreenshots.jpg?resize=840%2C692&ssl=1 840w\" sizes=\"auto, (max-width: 840px) 100vw, 840px\" \/><\/a><figcaption id=\"caption-attachment-21118\" class=\"wp-caption-text\">Screenshot of the target URL <a href=\"https:\/\/www.cdc.gov\/coronavirus\/2019-ncov\/\" rel=\"nofollow\">https:\/\/www.cdc.gov\/coronavirus\/2019-ncov\/<\/a> over time using “Get full page screenshots from archived web pages” GLAM Workbench notebook.<\/figcaption><\/figure>\n<p>Other notebooks in the collection allow researchers to discover changes to text on a webpage over time, for example, or to discover when a piece of text first appears in an archived webpage. In the example below, we used the notebook “<a href=\"https:\/\/glam-workbench.net\/web-archives\/#find-when-a-piece-of-text-appears-in-an-archived-web-page\" target=\"_blank\" rel=\"noopener\">Find when a piece of text appears in an archived web page<\/a>” to discover the first time \u201csocial distancing\u201d was used on the CDC coronavirus homepage.<\/p>\n<figure id=\"attachment_21117\" aria-describedby=\"caption-attachment-21117\" style=\"width: 970px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/GLAMWorkbench_FindWhenAPieceOfTextAppears.jpg?ssl=1\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"21117\" data-permalink=\"https:\/\/circulatingnow.nlm.nih.gov\/2021\/05\/13\/exploring-the-data-of-web-archives-as-part-of-data-science-nlm\/glamworkbench_findwhenapieceoftextappears\/\" data-orig-file=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/GLAMWorkbench_FindWhenAPieceOfTextAppears.jpg?fit=970%2C800&ssl=1\" data-orig-size=\"970,800\" data-comments-opened=\"1\" data-image-meta=\"{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0"}\" data-image-title=\"GLAMWorkbench_FindWhenAPieceOfTextAppears\" data-image-description=\"\" data-image-caption=\"<p>A GLAM Workbench tool displaying the first use of “social distancing” on a cdc webpage on June 15, 2020.<\/p>\n\" data-medium-file=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/GLAMWorkbench_FindWhenAPieceOfTextAppears.jpg?fit=300%2C247&ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/GLAMWorkbench_FindWhenAPieceOfTextAppears.jpg?fit=840%2C693&ssl=1\" class=\"wp-image-21117 size-full\" title=\""Social Distancing" first appeared on the CDC coronavirus homepage on June 15, 2020\" src=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/GLAMWorkbench_FindWhenAPieceOfTextAppears.jpg?resize=840%2C693&ssl=1\" alt=\"A screenshot of a tool that locates specific phrases in webpage text accross time.\" width=\"840\" height=\"693\" srcset=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/GLAMWorkbench_FindWhenAPieceOfTextAppears.jpg?w=970&ssl=1 970w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/GLAMWorkbench_FindWhenAPieceOfTextAppears.jpg?resize=300%2C247&ssl=1 300w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/GLAMWorkbench_FindWhenAPieceOfTextAppears.jpg?resize=768%2C633&ssl=1 768w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/GLAMWorkbench_FindWhenAPieceOfTextAppears.jpg?resize=840%2C693&ssl=1 840w\" sizes=\"auto, (max-width: 840px) 100vw, 840px\" \/><\/a><figcaption id=\"caption-attachment-21117\" class=\"wp-caption-text\">Screenshot of results from searching for First Occurrence of the text “Social Distancing” on <a href=\"https:\/\/www.cdc.gov\/coronavirus\/2019-ncov\/\" rel=\"nofollow\">https:\/\/www.cdc.gov\/coronavirus\/2019-ncov\/<\/a> using “Find when a piece of text appears in an archived web page” GLAM Workbench notebook.<\/figcaption><\/figure>\n<p>These notebooks are not without challenges themselves.\u00a0 While really easy to use, it takes time to query the entire Internet Archive for results (sometimes hours) and the notebooks can time out.\u00a0 This work provided the opportunity to compare approaches to querying the data of web archives, as well as more lessons on patience and persistence.<\/p>\n<p>The work we started during this mentorship is ongoing and the landscape of tools is evolving.\u00a0 There is definitely more room for further exploration of analysis tools to better understand how researchers can use web archives and how web collecting organizations like NLM can support their work. We learned midway through the project that Archives Unleashed Cloud will be decommissioned at the end of June 2021 and migrated to Archive-It.\u00a0 I look forward to learning more about what opportunities this will bring for making NLM web archive collections available as data, whether through providing tidy derivative data sets for our researchers, or sharing notebooks querying the web archive data.\u00a0 Supporting researchers through description and transparency about the scope of a collection is also important, as well as helping them understand the nature of web content as historical materials (on this topic I read, and consulted many times throughout this project, Ian Milligan\u2019s <a href=\"https:\/\/www.ianmilligan.ca\/publication\/history-in-the-age-of-abundance\/\"><em>History in the Age of Abundance?\u00a0 How the Web is Transforming Historical Research<\/em><\/a>). There is much exciting and important work possible ahead.<\/p>\n<figure id=\"attachment_21182\" aria-describedby=\"caption-attachment-21182\" style=\"width: 350px\" class=\"wp-caption alignright\"><a href=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/NLMstaffnetwork.jpg?ssl=1\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"21182\" data-permalink=\"https:\/\/circulatingnow.nlm.nih.gov\/2021\/05\/13\/exploring-the-data-of-web-archives-as-part-of-data-science-nlm\/nlmstaffnetwork\/\" data-orig-file=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/NLMstaffnetwork.jpg?fit=1253%2C769&ssl=1\" data-orig-size=\"1253,769\" data-comments-opened=\"1\" data-image-meta=\"{"aperture":"0","credit":"","camera":"","caption":"","created_timestamp":"0","copyright":"","focal_length":"0","iso":"0","shutter_speed":"0","title":"","orientation":"0"}\" data-image-title=\"Thanks to NLM Staff\" data-image-description=\"<p>Patti Brennan, Dianne Babski, Maryam Zaringhalam, Maria Collins, Peter Cooper, Mike Davidson, Lisa Federer, Anna Ripple, Nicole Sroka, Emma Write, LO Management, Jennifer Marill, HMD Management, Rebecca Warlow, Mentees, Mentors, LHNCBC Management, Rachel Tohn, Jim Mork, Marie Gallagher, OCCS Management<\/p>\n\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/NLMstaffnetwork.jpg?fit=300%2C184&ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/NLMstaffnetwork.jpg?fit=840%2C515&ssl=1\" class=\"wp-image-21182\" title=\"Thank you NLM!\" src=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/NLMstaffnetwork.jpg?resize=350%2C215&ssl=1\" alt=\"Network graphic naming NLM staff and offices that supported the project.\" width=\"350\" height=\"215\" longdesc=\"Patti Brennan, Dianne Babski, Maryam Zaringhalam, Maria Collins, Peter Cooper, Mike Davidson, Lisa Federer, Anna Ripple, Nicole Sroka, Emma Write, LO Management, Jennifer Marill, HMD Management, Rebecca Warlow, Mentees, Mentors, LHNCBC Management, Rachel Tohn, Jim Mork, Marie Gallagher, OCCS Management\" srcset=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/NLMstaffnetwork.jpg?resize=300%2C184&ssl=1 300w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/NLMstaffnetwork.jpg?resize=1024%2C628&ssl=1 1024w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/NLMstaffnetwork.jpg?resize=768%2C471&ssl=1 768w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/NLMstaffnetwork.jpg?resize=1200%2C736&ssl=1 1200w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/NLMstaffnetwork.jpg?resize=840%2C516&ssl=1 840w, https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/NLMstaffnetwork.jpg?w=1253&ssl=1 1253w\" sizes=\"auto, (max-width: 350px) 100vw, 350px\" \/><\/a><figcaption id=\"caption-attachment-21182\" class=\"wp-caption-text\">Thanks to everyone across NLM for support on this project!<\/figcaption><\/figure>\n<p>I\u2019m grateful for this opportunity to work with a mentor to explore and learn about the bigger picture of working with web archives as data over this past year.\u00a0 Many, many thanks to Marie Gallagher, the entire Data Science @ NLM training program team, and all those at NLM supporting this work.<\/p>\n<p><em><a href=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2018\/05\/christie-moffatt.jpg?ssl=1\"><img data-recalc-dims=\"1\" loading=\"lazy\" decoding=\"async\" data-attachment-id=\"14591\" data-permalink=\"https:\/\/circulatingnow.nlm.nih.gov\/2019\/10\/15\/nlms-profiles-in-science-exploring-the-stories-of-scientific-discovery\/nlm-staff-head-shots\/\" data-orig-file=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2018\/05\/christie-moffatt.jpg?fit=800%2C1200&ssl=1\" data-orig-size=\"800,1200\" data-comments-opened=\"1\" data-image-meta=\"{"aperture":"3.5","credit":"Chia-Chi \\"Charlie\\" Chang 703-431","camera":"Canon EOS 5D Mark IV","caption":"","created_timestamp":"1525258458","copyright":"www.ImageCaffeine.com","focal_length":"100","iso":"800","shutter_speed":"0.00625","title":"NLM Staff Head Shots","orientation":"1"}\" data-image-title=\"Christie Moffatt\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2018\/05\/christie-moffatt.jpg?fit=200%2C300&ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2018\/05\/christie-moffatt.jpg?fit=683%2C1024&ssl=1\" class=\"alignleft wp-image-14591\" src=\"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2018\/05\/christie-moffatt.jpg?resize=67%2C100&ssl=1\" alt=\"An informal portrait of Christie Moffatt.\" width=\"67\" height=\"100\" \/><\/a>Christie Moffatt is Manager of the Digital Manuscripts Program in the History of Medicine Division at the National Library of Medicine and Chair of NLM\u2019s Web Collecting and Archiving Working Group.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Over the past year our project was to gain practical experience with tools and techniques for the study of web archives data.<\/p>\n","protected":false},"author":53041628,"featured_media":21177,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_coblocks_attr":"","_coblocks_dimensions":"","_coblocks_responsive_height":"","_coblocks_accordion_ie_support":"","advanced_seo_description":"","jetpack_seo_html_title":"","jetpack_seo_noindex":false,"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"Exploring the Data of Web Archives as Part of Data Science @ NLM #webarchives #histmed","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","enabled":false},"version":2}},"categories":[12763,76943049],"tags":[678875810,22379,541876,668,678875857,238214791],"class_list":["post-21153","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-collections","category-revealing-data","tag-about-us","tag-data","tag-digital-humanities","tag-research","tag-research-tools","tag-web-collecting"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/circulatingnow.nlm.nih.gov\/wp-content\/uploads\/2021\/05\/Gephi_zoom_feature.jpg?fit=900%2C400&ssl=1","jetpack_likes_enabled":true,"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p3xcDk-5vb","jetpack-related-posts":[],"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/circulatingnow.nlm.nih.gov\/wp-json\/wp\/v2\/posts\/21153","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/circulatingnow.nlm.nih.gov\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/circulatingnow.nlm.nih.gov\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/circulatingnow.nlm.nih.gov\/wp-json\/wp\/v2\/users\/53041628"}],"replies":[{"embeddable":true,"href":"https:\/\/circulatingnow.nlm.nih.gov\/wp-json\/wp\/v2\/comments?post=21153"}],"version-history":[{"count":41,"href":"https:\/\/circulatingnow.nlm.nih.gov\/wp-json\/wp\/v2\/posts\/21153\/revisions"}],"predecessor-version":[{"id":37730,"href":"https:\/\/circulatingnow.nlm.nih.gov\/wp-json\/wp\/v2\/posts\/21153\/revisions\/37730"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/circulatingnow.nlm.nih.gov\/wp-json\/wp\/v2\/media\/21177"}],"wp:attachment":[{"href":"https:\/\/circulatingnow.nlm.nih.gov\/wp-json\/wp\/v2\/media?parent=21153"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/circulatingnow.nlm.nih.gov\/wp-json\/wp\/v2\/categories?post=21153"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/circulatingnow.nlm.nih.gov\/wp-json\/wp\/v2\/tags?post=21153"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}} |