Text Mining Research - NIH

Skip to main page content Skip to main page content

Research

Overview of Recent R&D Projects

nlp wordcloud

Example 1: PubMed 2.0

Example 1: PubMed 2.0

PubMed, an information system for accessing the biomedical literature, is used billions of times each year by millions of people, both in the US and worldwide. It is built and maintained by NCBI/NLM to serve both scientific and medical community and the public at large. With the rapid growth of the biomedical literature along with its associated biomedical data, exciting opportunities arise to provide access to pertinent biomedical information across data sources in an effective and efficient manner. Our overall goal is to deliver the most relevant results (from 26+ million articles) within a fraction of a second to drive accelerated discovery and better health. Through automatic analysis of PubMed search logs, we have identified various kinds of information needs of our users and the gaps in the current system. To close the gap, our team is currently developing a next-gen intelligent system, namely PubMed Labs, for literature search with improved user experience, along with new search features and capabilities.

Example publications

Fiorini et al., Best Match: New relevance search for PubMed PLoS Biology, 2018
Fiorini et al., How User Intelligence Is Improving PubMed Nature Biotechnology, 2018.
Fiorini et al., Towards PubMed 2.0 eLife, 2017.

Example 2: Medical AI/LLMs

Example 2: Medical AI/LLMs

Our recent research has explored the use/limits of large language models (LLMs) in medical text and image analysis for clinical decision support and knowledge discovery. Our investigations into LLMs cover three primary areas: (1) comprehensive evaluations concerning their performance, equity, and associated risks; (2) methods to augment LLMs with domain-specific knowledge and tools (e.g. GeneGPT); (3) novel applications of LLMs in biomedicine (e.g., TrialGPT).

Example publications

Jin Q et al., TrialGPT: Matching Patients to Clinical Trials with Large Language Models. Nature Communications, 2024
Jin Q et al., Hidden Flaws Behind Expert-Level Accuracy of Multimodal GPT-4 Vision in Medicine. npj Digital Medicine, 2024
Tian S et al., Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Briefings in Bioinformatics, 2024

Example 3: Literature mining and information extraction

Example 3: Literature mining and information extraction

Biological database curation (biocuration) is a key human activity to provide high-quality structured information that otherwise would be buried in unstructured text, facilitating both human and computer analyses of published biological knowledge. To achieve this, expert human curators are required to read and extract relevant information from the scholarly publications, a highly tedious and time-consuming task. Indeed, this manual process presents a considerable bottleneck in terms of curation cost, efficiency and productivity, making it difficult to keep pace with the rapid growth of the literature. Hence, our overall goals are to fulfill the practical needs of text-mining needs in biocuration, creating a new paradigm where manual curation is greatly facilitated by automated computer analysis. To this end, we have developed PubTator (Wei et al., 2013), a web-based application for assisting document triage and gene indexing. Through collaboration, PubTator is now successfully integrated into production workflows of multiple important biological databases such as SwissProt.

Example publications

Wei et al. PubTator: a web-based text mining tool for assisting biocuration. Nucleic Acids Res, 2013.
Poux et al., On expert curation and scalability: UniProtKB/Swiss-Prot as a case study Bioinformatics 2017.
Lee et al., Scaling Up Data Curation Using Deep Learning: An Application to Literature Triage in Genomic Variation Resources PLoS Computational Biology, 2018.

Example 4: Medical Image Analysis (Radiology & Ophthalmology)

Example 4: Medical Image Analysis (Radiology & Ophthalmology)

Mining EMRs and medical images has the potential to lead to improvement in patient care as such data contain rich information for large patient populations. We have recently text-mined over 100,000 radiology reports where our algorithm generated “weak” training labels to enable the development of advanced deep learning methods for automatically reading and classifying chest X-ray images. This work has also resulted in the release of ChestX-ray8: one of the largest publicly available chest x-ray datasets to the scientific community. We have also conducted research to assist in the screening of age-related macular degeneration (AMD): a leading cause of vision loss in Americans 60 and older. By leveraging cutting-edge deep learning techniques and repurposing “big” imaging data from a major AMD clinical trial, we developed a novel data-driven approach (DeepSeeNet) for autonomous AMD diagnosis with its performance exceeding human ophthalmologists (retinal specialists in this case). Such a result highlights the potential of deep learning systems to assist early disease detection and enhance clinical decision-making processes.

Example publications

Wang X et al., ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. Proceedings of 2017 IEEE Computer Vision and Pattern Recognition (CVPR). 2017
Peng et al., DeepSeeNet: A deep learning model for automated classification of patient-based age-related macular degeneration severity from color fundus photographs. Ophthalmology. 2018
Wang X et al., TieNet: Text-Image Embedding Network for Common Thorax Disease Classification and Reporting in Chest X-rays. Proceedings of 2018 EEE Computer Vision and Pattern Recognition (CVPR), 2018.

Example 5: BioCreative

Example 5: BioCreative

Critical Assessment of Information Extraction in Biology (BioCreative) is a community effort for evaluating text mining and information extraction systems applied to the biological domain. Since 2004, the BioCreative Evaluation series has included over ten different tasks such as ranking of relevant documents ("document triage"), extraction of genes and proteins ("gene mention") and their linkage to database identifiers ("gene normalization"), as well as creation of functional annotations in standard ontologies (e.g., GO) and extraction of entity-relations (e.g., protein-protein interaction). As part of the BioCreative executive committee, we have led the organization of multiple shared tasks in recent years such as:

Chemical-Disease Relation Extraction - BioCreative 2015
BioC: The BioCreative Interoperability Initiative - BioCreative 2015 & 2013
Automatic Gene Ontology (GO) Annotation - BioCreative 2013
Multi-species Gene Normalization (GN) - BioCreative 2010

Example publications

Lu et al. The gene normalization task in BioCreative III. BMC Bioinformatics, 2011.
Comeau et al. BioC: a minimalist approach to interoperability for biomedical text processing. Database (Oxford), 2013.
Mao et al. Overview of the gene ontology task at BioCreative IV. Database (Oxford), 2014.
Wei et al. Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task. Database (Oxford), 2016.