Adds FDA and NIH HTML URLs for seed list (#25)

* FDA HTML urls for seed list

This is a file of URLs derived from the FDA's sitemap.xml file. It has URLs of the form /media/*/download and the warning-letters filtered out, so in theory everything in this file is an HTML link. I plan on submitting the PDF content separately.

The format of this CSV file is: <sitemap file the URL is sourced from>,<the URL>

* FDA warning letters from sitemap.xml

This is a file of URLs derived from the FDA's sitemap.xml file with warning letters content. I thought this was going to be PDF content, but it turns out to be HTML content. 

The format of this CSV file is: sitemap file the URL is sourced from,the URL

* FDA download links from sitemaps

This is a file of URLs derived from the FDA's sitemap.xml file, where the link is of the form /media/id/download. These resolve to a PDF file rendered in a HTML wrapper using Mozilla's pdfjs library -- so the download in the path name is a little misleading.

The format of this CSV file is: sitemap file the URL is sourced from,the URL

* NIH urls from three sitemaps.

This is a file of URLs derived from three NIH sitemap files: https://www.nih.gov/sitemap.xml, https://newsinhealth.nih.gov/sitemap.xml and https://nihrecord.nih.gov/sitemap.xml. 

The format of this CSV file is: sitemap file the URL is sourced from,the URL
This commit is contained in:
YakShaver 2025-01-02 09:33:40 -08:00 committed by GitHub
commit a3d96841db
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
4 changed files with 100935 additions and 0 deletions

15044
seed-lists/nih-urls.csv Normal file

File diff suppressed because it is too large Load diff