BioC API for PubMed
Accessing PubMed in BioC format (click here for accessing PMC articles)
All the PubMed articles are available in the BioC format. This provides a large number of research articles for text mining and information retrieval research. BioC is a simple format designed for straightforward text processing. These articles are available in BioC XML or BioC JSON, in Unicode or ASCII.
If you use this resource, please cite:
- Comeau DC, Wei CH, Dogan RI, and Lu Z. PMC text mining subset in BioC: about 3 million full text articles and growing, Bioinformatics, 2019
Instructions
https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pubmed.cgi/BioC_[format]/[PMID]/[encoding]
The parameters are:
- format: xml or json
- PMID
- encoding: unicode or ascii
Sample URL:
https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pubmed.cgi/BioC_xml/17299597/unicode
Same article in ASCII:
https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pubmed.cgi/BioC_xml/17299597/ascii
Obviously, no Unicode to ASCII translation is perfect. We have found this one useful.
JSON instead of XML:
https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pubmed.cgi/BioC_json/17299597/unicode
BioC JSON follows the same structure as BioC XML.
More information
General information about BioC XML structure:
ftp://ftp.ncbi.nlm.nih.gov/pub/wilbur/BioC-PMC/BioC.dtd
Specific information about BioC PubMed:
ftp://ftp.ncbi.nlm.nih.gov/pub/wilbur/BioC-PMC/pubmed.key
Main BioC web page:
http://bioc.sourceforge.net
Caution
If you experience any problems, please share them with us: donald.comeau@nih.gov or zhiyong.lu@nih.gov.