Skip to main page content Skip to main page content

Research

Overview of Recent R&D Projects

nlp wordcloud

Example 1: PubMed 2.0

PubMed, an information system for accessing the biomedical literature, is used billions of times each year by millions of people, both in the US and worldwide. It is built and maintained by NCBI/NLM to serve both scientific and medical community and the public at large. With the rapid growth of the biomedical literature along with its associated biomedical data, exciting opportunities arise to provide access to pertinent biomedical information across data sources in an effective and efficient manner. Our overall goal is to deliver the most relevant results (from 26+ million articles) within a fraction of a second to drive accelerated discovery and better health. Through automatic analysis of PubMed search logs, we have identified various kinds of information needs of our users and the gaps in the current system. To close the gap, our team is currently developing a next-gen intelligent system, namely PubMed Labs, for literature search with improved user experience, along with new search features and capabilities.

Example publications

Example 2: Medical AI/LLMs

Our recent research has explored the use/limits of large language models (LLMs) in medical text and image analysis for clinical decision support and knowledge discovery. Our investigations into LLMs cover three primary areas: (1) comprehensive evaluations concerning their performance, equity, and associated risks; (2) methods to augment LLMs with domain-specific knowledge and tools (e.g. GeneGPT); (3) novel applications of LLMs in biomedicine (e.g., TrialGPT).
Example publications

Example 3: Literature mining and information extraction

Biological database curation (biocuration) is a key human activity to provide high-quality structured information that otherwise would be buried in unstructured text, facilitating both human and computer analyses of published biological knowledge. To achieve this, expert human curators are required to read and extract relevant information from the scholarly publications, a highly tedious and time-consuming task. Indeed, this manual process presents a considerable bottleneck in terms of curation cost, efficiency and productivity, making it difficult to keep pace with the rapid growth of the literature. Hence, our overall goals are to fulfill the practical needs of text-mining needs in biocuration, creating a new paradigm where manual curation is greatly facilitated by automated computer analysis. To this end, we have developed PubTator (Wei et al., 2013), a web-based application for assisting document triage and gene indexing. Through collaboration, PubTator is now successfully integrated into production workflows of multiple important biological databases such as SwissProt.

Example publications

Example 4: Medical Image Analysis (Radiology & Ophthalmology)

Mining EMRs and medical images has the potential to lead to improvement in patient care as such data contain rich information for large patient populations. We have recently text-mined over 100,000 radiology reports where our algorithm generated “weak” training labels to enable the development of advanced deep learning methods for automatically reading and classifying chest X-ray images. This work has also resulted in the release of ChestX-ray8: one of the largest publicly available chest x-ray datasets to the scientific community. We have also conducted research to assist in the screening of age-related macular degeneration (AMD): a leading cause of vision loss in Americans 60 and older. By leveraging cutting-edge deep learning techniques and repurposing “big” imaging data from a major AMD clinical trial, we developed a novel data-driven approach (DeepSeeNet) for autonomous AMD diagnosis with its performance exceeding human ophthalmologists (retinal specialists in this case). Such a result highlights the potential of deep learning systems to assist early disease detection and enhance clinical decision-making processes.

Example publications

Example 5: BioCreative

Critical Assessment of Information Extraction in Biology (BioCreative) is a community effort for evaluating text mining and information extraction systems applied to the biological domain. Since 2004, the BioCreative Evaluation series has included over ten different tasks such as ranking of relevant documents ("document triage"), extraction of genes and proteins ("gene mention") and their linkage to database identifiers ("gene normalization"), as well as creation of functional annotations in standard ontologies (e.g., GO) and extraction of entity-relations (e.g., protein-protein interaction). As part of the BioCreative executive committee, we have led the organization of multiple shared tasks in recent years such as:

Example publications