Research
Overview of Recent R&D Projects

Example 1: PubMed 2.0
PubMed, an information system for accessing the biomedical literature, is used billions of times each year by millions of people, both in the US and worldwide. It is built and maintained by NCBI/NLM to serve both scientific and medical community and the public at large. With the rapid growth of the biomedical literature along with its associated biomedical data, exciting opportunities arise to provide access to pertinent biomedical information across data sources in an effective and efficient manner. Our overall goal is to deliver the most relevant results (from 26+ million articles) within a fraction of a second to drive accelerated discovery and better health. Through automatic analysis of PubMed search logs, we have identified various kinds of information needs of our users and the gaps in the current system. To close the gap, our team is currently developing a next-gen intelligent system, namely PubMed Labs, for literature search with improved user experience, along with new search features and capabilities.
Example publications
- Fiorini et al., Best Match: New relevance search for PubMed PLoS Biology, 2018
- Fiorini et al., How User Intelligence Is Improving PubMed Nature Biotechnology, 2018.
- Fiorini et al., Towards PubMed 2.0 eLife, 2017.
Example 2: Medical AI/LLMs
Example publications
- Jin Q et al., TrialGPT: Matching Patients to Clinical Trials with Large Language Models. Nature Communications, 2024
- Jin Q et al., Hidden Flaws Behind Expert-Level Accuracy of Multimodal GPT-4 Vision in Medicine. npj Digital Medicine, 2024
- Tian S et al., Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Briefings in Bioinformatics, 2024
Example 3: Literature mining and information extraction
Biological database curation (biocuration) is a key human activity to provide high-quality structured information that otherwise would be buried in unstructured text, facilitating both human and computer analyses of published biological knowledge. To achieve this, expert human curators are required to read and extract relevant information from the scholarly publications, a highly tedious and time-consuming task. Indeed, this manual process presents a considerable bottleneck in terms of curation cost, efficiency and productivity, making it difficult to keep pace with the rapid growth of the literature. Hence, our overall goals are to fulfill the practical needs of text-mining needs in biocuration, creating a new paradigm where manual curation is greatly facilitated by automated computer analysis. To this end, we have developed PubTator (Wei et al., 2013), a web-based application for assisting document triage and gene indexing. Through collaboration, PubTator is now successfully integrated into production workflows of multiple important biological databases such as SwissProt.
Example publications
- Wei et al. PubTator: a web-based text mining tool for assisting biocuration. Nucleic Acids Res, 2013.
- Poux et al., On expert curation and scalability: UniProtKB/Swiss-Prot as a case study Bioinformatics 2017.
- Lee et al., Scaling Up Data Curation Using Deep Learning: An Application to Literature Triage in Genomic Variation Resources PLoS Computational Biology, 2018.
Example 4: Medical Image Analysis (Radiology & Ophthalmology)
Mining EMRs and medical images has the potential to lead to improvement in patient care as such data contain rich information for large patient populations. We have recently text-mined over 100,000 radiology reports where our algorithm generated “weak” training labels to enable the development of advanced deep learning methods for automatically reading and classifying chest X-ray images. This work has also resulted in the release of ChestX-ray8: one of the largest publicly available chest x-ray datasets to the scientific community. We have also conducted research to assist in the screening of age-related macular degeneration (AMD): a leading cause of vision loss in Americans 60 and older. By leveraging cutting-edge deep learning techniques and repurposing “big” imaging data from a major AMD clinical trial, we developed a novel data-driven approach (DeepSeeNet) for autonomous AMD diagnosis with its performance exceeding human ophthalmologists (retinal specialists in this case). Such a result highlights the potential of deep learning systems to assist early disease detection and enhance clinical decision-making processes.
Example publications
- Wang X et al., ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases. Proceedings of 2017 IEEE Computer Vision and Pattern Recognition (CVPR). 2017
- Peng et al., DeepSeeNet: A deep learning model for automated classification of patient-based age-related macular degeneration severity from color fundus photographs. Ophthalmology. 2018
- Wang X et al., TieNet: Text-Image Embedding Network for Common Thorax Disease Classification and Reporting in Chest X-rays. Proceedings of 2018 EEE Computer Vision and Pattern Recognition (CVPR), 2018.
Example 5: BioCreative
Critical Assessment of Information Extraction in Biology (BioCreative) is a community effort for evaluating text mining and information extraction systems applied to the biological domain. Since 2004, the BioCreative Evaluation series has included over ten different tasks such as ranking of relevant documents ("document triage"), extraction of genes and proteins ("gene mention") and their linkage to database identifiers ("gene normalization"), as well as creation of functional annotations in standard ontologies (e.g., GO) and extraction of entity-relations (e.g., protein-protein interaction). As part of the BioCreative executive committee, we have led the organization of multiple shared tasks in recent years such as:
- Chemical-Disease Relation Extraction - BioCreative 2015
- BioC: The BioCreative Interoperability Initiative - BioCreative 2015 & 2013
- Automatic Gene Ontology (GO) Annotation - BioCreative 2013
- Multi-species Gene Normalization (GN) - BioCreative 2010
Example publications
- Lu et al. The gene normalization task in BioCreative III. BMC Bioinformatics, 2011.
- Comeau et al. BioC: a minimalist approach to interoperability for biomedical text processing. Database (Oxford), 2013.
- Mao et al. Overview of the gene ontology task at BioCreative IV. Database (Oxford), 2014.
- Wei et al. Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task. Database (Oxford), 2016.