Skip to main page content Skip to main page content

AI Datasets

Description
The PubMed Computed Authors dataset consists of disambiguated author names from PubMed, freely available via API queries and FTP downloads. Using advanced AI algorithms, the PubMed Computed Authors disambiguated more than 21 million individual authors across nearly 36 million PubMed articles with high accuracy. With regular weekly update, the PubMed Computed Authors continuously provide the most recent disambiguated authors for all PubMed articles.
Description
The NLM-Chem corpus is a manually annotated full-text resource on chemicals in the biomedical literature. The corpus contains 150 full-text journal articles selected both to be rich in chemical mentions and for articles where human annotation was expected to be most valuable. The corpus was doubly annotated by ten expert NLM indexers, with high inter-annotator agreement, and contains ~5000 unique chemical name annotations mapped to ~2000 MeSH identifiers.
Description
The dataset contains a collection of 705,915 PubMed Phrases (Kim et al., 2018) that are beneficial for information retrieval and human comprehension.
Description
The weakly-labeled corpus used in (Peng et al., 2016) consists of 18,410 abstracts and 33,224 CID relations. The raw data was extracted from curated data in the CTD-Pfizer collaboration with document-level annotations of drug-disease and drug-phenotype interactions. We applied tmChem and DNorm to recognize and normalize chemical and disease mentions, respectively. To maximize recall, we also applied a dictionary look-up method with a controlled vocabulary (MeSH). Finally, we filtered those without CID relations in the title/abstracts as some asserted relations are only in the full text.
Description
tmVar Corpus contains 500 PubMed articles manually annotated with mutation mentions of various kinds.
Description
BC5CDR corpus consists of 1500 PubMed articles with 4409 annotated chemicals, 5818 diseases and 3116 chemical-disease interactions.
Description
NCBI disease corpus is a collection of 793 PubMed abstracts fully annotated at both mention and concept levels.