BioC API for PMC Open Access
PubMed Central Open Access in BioC format (click here for accessing PubMed articles)
All the PubMed Central (PMC) Open Access articles are available in the BioC format. This provides a large number of full text research articles for text mining and information retrieval research. BioC is a simple format designed for straightforward text processing. These articles are available in BioC XML or BioC JSON, in Unicode or ASCII, and via PubMed ID or PMC ID.
If you use this resource, please cite:
Articles available from this service are in the PMC Open Access Subset and the PMC Author Manuscript Collection. Information about these collections is available on the following pages.
- PMC Open Access Subset: https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/
- PMC Author Manuscript Collection: https://www.ncbi.nlm.nih.gov/pmc/about/mscollection/
Not all PMC articles are available in these collections. Lists of articles in the collections are available via FTP.
- Complete Open Access Subset: ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_file_list.txt
- Commercial Use Collection: ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_comm_use_file_list.txt
- PMC Author Manuscript Collection: ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/manuscript/filelist.txt
These files are also available in the CSV format. A description of the FTP Service is available from: https://www.ncbi.nlm.nih.gov/pmc/tools/ftp/.
Articles in the BioC API for PMC Open Access are usually updated within 24 hours of these files being updated.
Instructions
https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_[format]/[ID]/[encoding]
The parameters are:
- format: xml or json
- ID: PubMed ID (such as 17299597) or PMC ID (such as PMC1790863)
- encoding: unicode or ascii
Sample URL:
https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_xml/17299597/unicode
Same article in ASCII:
https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_xml/17299597/ascii
Obviously, no Unicode to ASCII translation is perfect. We have found this one useful.
JSON instead of XML:
https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/17299597/unicode
BioC JSON follows the same structure as BioC XML.
Using PMC ID instead of PubMed ID:
https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_xml/PMC1790863/unicode
Bulk Download
BioC PMC articles can be downloaded in bulk from the FTP site:
https://ftp.ncbi.nlm.nih.gov/pub/wilbur/BioC-PMC
More information
General information about BioC XML structure:
ftp://ftp.ncbi.nlm.nih.gov/pub/wilbur/BioC-PMC/BioC.dtd
Specific information about BioC-PMC:
ftp://ftp.ncbi.nlm.nih.gov/pub/wilbur/BioC-PMC/pmc.key
Main BioC web page:
http://bioc.sourceforge.net
Caution
If you experience any problems, please share them with us: donald.comeau@nih.gov or zhiyong.lu@nih.gov.