Scaling up data curation using deep learning: An application to literature triage in genomic variation resources
- PMID: 30102703
- PMCID: PMC6107285
- DOI: 10.1371/journal.pcbi.1006390
Scaling up data curation using deep learning: An application to literature triage in genomic variation resources
Abstract
Manually curating biomedical knowledge from publications is necessary to build a knowledge based service that provides highly precise and organized information to users. The process of retrieving relevant publications for curation, which is also known as document triage, is usually carried out by querying and reading articles in PubMed. However, this query-based method often obtains unsatisfactory precision and recall on the retrieved results, and it is difficult to manually generate optimal queries. To address this, we propose a machine-learning assisted triage method. We collect previously curated publications from two databases UniProtKB/Swiss-Prot and the NHGRI-EBI GWAS Catalog, and used them as a gold-standard dataset for training deep learning models based on convolutional neural networks. We then use the trained models to classify and rank new publications for curation. For evaluation, we apply our method to the real-world manual curation process of UniProtKB/Swiss-Prot and the GWAS Catalog. We demonstrate that our machine-assisted triage method outperforms the current query-based triage methods, improves efficiency, and enriches curated content. Our method achieves a precision 1.81 and 2.99 times higher than that obtained by the current query-based triage methods of UniProtKB/Swiss-Prot and the GWAS Catalog, respectively, without compromising recall. In fact, our method retrieves many additional relevant publications that the query-based method of UniProtKB/Swiss-Prot could not find. As these results show, our machine learning-based method can make the triage process more efficient and is being implemented in production so that human curators can focus on more challenging tasks to improve the quality of knowledge bases.
Conflict of interest statement
The authors have declared that no competing interests exist.
Figures
Similar articles
-
On expert curation and scalability: UniProtKB/Swiss-Prot as a case study.Bioinformatics. 2017 Nov 1;33(21):3454-3460. doi: 10.1093/bioinformatics/btx439. Bioinformatics. 2017. PMID: 29036270 Free PMC article.
-
An enhanced workflow for variant interpretation in UniProtKB/Swiss-Prot improves consistency and reuse in ClinVar.Database (Oxford). 2019 Jan 1;2019:baz040. doi: 10.1093/database/baz040. Database (Oxford). 2019. PMID: 30937429 Free PMC article.
-
Using distant supervised learning to identify protein subcellular localizations from full-text scientific articles.J Biomed Inform. 2015 Oct;57:134-44. doi: 10.1016/j.jbi.2015.07.013. Epub 2015 Jul 26. J Biomed Inform. 2015. PMID: 26220461
-
Genetic variations and diseases in UniProtKB/Swiss-Prot: the ins and outs of expert manual curation.Hum Mutat. 2014 Aug;35(8):927-35. doi: 10.1002/humu.22594. Epub 2014 Jun 24. Hum Mutat. 2014. PMID: 24848695 Free PMC article. Review.
-
Challenges in the annotation of pseudoenzymes in databases: the UniProtKB approach.FEBS J. 2020 Oct;287(19):4114-4127. doi: 10.1111/febs.15100. Epub 2019 Nov 3. FEBS J. 2020. PMID: 31618524 Free PMC article. Review.
Cited by
-
PubTator central: automated concept annotation for biomedical full text articles.Nucleic Acids Res. 2019 Jul 2;47(W1):W587-W593. doi: 10.1093/nar/gkz389. Nucleic Acids Res. 2019. PMID: 31114887 Free PMC article.
-
Automatic identification of scientific publications describing digital reconstructions of neural morphology.Brain Inform. 2023 Sep 8;10(1):23. doi: 10.1186/s40708-023-00202-x. Brain Inform. 2023. PMID: 37684527 Free PMC article.
-
Integrating image caption information into biomedical document classification in support of biocuration.Database (Oxford). 2020 Jan 1;2020:baaa024. doi: 10.1093/database/baaa024. Database (Oxford). 2020. PMID: 32294192 Free PMC article.
-
Automatic identification of scientific publications describing digital reconstructions of neural morphology.bioRxiv [Preprint]. 2023 Feb 15:2023.02.14.527522. doi: 10.1101/2023.02.14.527522. bioRxiv. 2023. Update in: Brain Inform. 2023 Sep 8;10(1):23. doi: 10.1186/s40708-023-00202-x. PMID: 36824882 Free PMC article. Updated. Preprint.
-
Performance of Multiple Pretrained BERT Models to Automate and Accelerate Data Annotation for Large Datasets.Radiol Artif Intell. 2022 Jun 29;4(4):e220007. doi: 10.1148/ryai.220007. eCollection 2022 Jul. Radiol Artif Intell. 2022. PMID: 35923377 Free PMC article.
References
-
- Famiglietti ML, Estreicher A, Gos A, Bolleman J, Gehant S, Breuza L, et al. Genetic variations and diseases in UniProtKB/Swiss-Prot: the ins and outs of expert manual curation. Hum Mutat. 2014;35(8):927–35. Epub 2014/05/23. 10.1002/humu.22594 ; PubMed Central PMCID: PMCPMC4107114. - DOI - PMC - PubMed
Publication types
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources
Research Materials