Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2018 Aug 13;14(8):e1006390.
doi: 10.1371/journal.pcbi.1006390. eCollection 2018 Aug.

Scaling up data curation using deep learning: An application to literature triage in genomic variation resources

Affiliations

Scaling up data curation using deep learning: An application to literature triage in genomic variation resources

Kyubum Lee et al. PLoS Comput Biol. .

Abstract

Manually curating biomedical knowledge from publications is necessary to build a knowledge based service that provides highly precise and organized information to users. The process of retrieving relevant publications for curation, which is also known as document triage, is usually carried out by querying and reading articles in PubMed. However, this query-based method often obtains unsatisfactory precision and recall on the retrieved results, and it is difficult to manually generate optimal queries. To address this, we propose a machine-learning assisted triage method. We collect previously curated publications from two databases UniProtKB/Swiss-Prot and the NHGRI-EBI GWAS Catalog, and used them as a gold-standard dataset for training deep learning models based on convolutional neural networks. We then use the trained models to classify and rank new publications for curation. For evaluation, we apply our method to the real-world manual curation process of UniProtKB/Swiss-Prot and the GWAS Catalog. We demonstrate that our machine-assisted triage method outperforms the current query-based triage methods, improves efficiency, and enriches curated content. Our method achieves a precision 1.81 and 2.99 times higher than that obtained by the current query-based triage methods of UniProtKB/Swiss-Prot and the GWAS Catalog, respectively, without compromising recall. In fact, our method retrieves many additional relevant publications that the query-based method of UniProtKB/Swiss-Prot could not find. As these results show, our machine learning-based method can make the triage process more efficient and is being implemented in production so that human curators can focus on more challenging tasks to improve the quality of knowledge bases.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Literature triage using our deep learning framework.
Fig 2
Fig 2
ROC curves of the classification results on the 2017JanJul group of UniProtKB/Swiss-Prot (Blue) and the GWAS Catalog (Red)–(a) Curves in all the publications, (b) Curves in the publications containing mutations at the abstract level.

Similar articles

Cited by

References

    1. The UniProt C. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 2017;45(D1):D158–D69. Epub 2016/12/03. 10.1093/nar/gkw1099 ; PubMed Central PMCID: PMCPMC5210571. - DOI - PMC - PubMed
    1. Famiglietti ML, Estreicher A, Gos A, Bolleman J, Gehant S, Breuza L, et al. Genetic variations and diseases in UniProtKB/Swiss-Prot: the ins and outs of expert manual curation. Hum Mutat. 2014;35(8):927–35. Epub 2014/05/23. 10.1002/humu.22594 ; PubMed Central PMCID: PMCPMC4107114. - DOI - PMC - PubMed
    1. MacArthur J, Bowler E, Cerezo M, Gil L, Hall P, Hastings E, et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Res. 2017;45(D1):D896–D901. Epub 2016/12/03. 10.1093/nar/gkw1133 ; PubMed Central PMCID: PMCPMC5210590. - DOI - PMC - PubMed
    1. Keseler IM, Skrzypek M, Weerasinghe D, Chen AY, Fulcher C, Li GW, et al. Curation accuracy of model organism databases. Database (Oxford). 2014;2014. Epub 2014/06/14. 10.1093/database/bau058 ; PubMed Central PMCID: PMCPMC4207230. - DOI - PMC - PubMed
    1. Baumgartner WA Jr, Cohen KB, Fox LM, Acquaah-Mensah G, Hunter L. Manual curation is not sufficient for annotation of genomic databases. Bioinformatics. 2007;23(13):i41–i8. 10.1093/bioinformatics/btm229 - DOI - PMC - PubMed

Publication types