On expert curation and scalability: UniProtKB/Swiss-Prot as a case study
- PMID: 29036270
- PMCID: PMC5860168
- DOI: 10.1093/bioinformatics/btx439
On expert curation and scalability: UniProtKB/Swiss-Prot as a case study
Abstract
Motivation: Biological knowledgebases, such as UniProtKB/Swiss-Prot, constitute an essential component of daily scientific research by offering distilled, summarized and computable knowledge extracted from the literature by expert curators. While knowledgebases play an increasingly important role in the scientific community, their ability to keep up with the growth of biomedical literature is under scrutiny. Using UniProtKB/Swiss-Prot as a case study, we address this concern via multiple literature triage approaches.
Results: With the assistance of the PubTator text-mining tool, we tagged more than 10 000 articles to assess the ratio of papers relevant for curation. We first show that curators read and evaluate many more papers than they curate, and that measuring the number of curated publications is insufficient to provide a complete picture as demonstrated by the fact that 8000-10 000 papers are curated in UniProt each year while curators evaluate 50 000-70 000 papers per year. We show that 90% of the papers in PubMed are out of the scope of UniProt, that a maximum of 2-3% of the papers indexed in PubMed each year are relevant for UniProt curation, and that, despite appearances, expert curation in UniProt is scalable.
Availability and implementation: UniProt is freely available at http://www.uniprot.org/.
Contact: sylvain.poux@sib.swiss.
Supplementary information: Supplementary data are available at Bioinformatics online.
© The Author 2017. Published by Oxford University Press.
Figures

Similar articles
-
Expert curation in UniProtKB: a case study on dealing with conflicting and erroneous data.Database (Oxford). 2014 Mar 12;2014:bau016. doi: 10.1093/database/bau016. Print 2014. Database (Oxford). 2014. PMID: 24622611 Free PMC article.
-
Scaling up data curation using deep learning: An application to literature triage in genomic variation resources.PLoS Comput Biol. 2018 Aug 13;14(8):e1006390. doi: 10.1371/journal.pcbi.1006390. eCollection 2018 Aug. PLoS Comput Biol. 2018. PMID: 30102703 Free PMC article.
-
UniProtKB/Swiss-Prot.Methods Mol Biol. 2007;406:89-112. doi: 10.1007/978-1-59745-535-0_4. Methods Mol Biol. 2007. PMID: 18287689
-
Genetic variations and diseases in UniProtKB/Swiss-Prot: the ins and outs of expert manual curation.Hum Mutat. 2014 Aug;35(8):927-35. doi: 10.1002/humu.22594. Epub 2014 Jun 24. Hum Mutat. 2014. PMID: 24848695 Free PMC article. Review.
-
Managing the life cycle of a portfolio of open data resources at the SIB Swiss Institute of Bioinformatics.Brief Bioinform. 2022 Jan 17;23(1):bbab478. doi: 10.1093/bib/bbab478. Brief Bioinform. 2022. PMID: 34850820 Free PMC article. Review.
Cited by
-
The origin, evolution, and molecular diversity of the chemokine system.Life Sci Alliance. 2024 Jan 16;7(3):e202302471. doi: 10.26508/lsa.202302471. Print 2024 Mar. Life Sci Alliance. 2024. PMID: 38228369 Free PMC article.
-
Pathway design using de novo steps through uncharted biochemical spaces.Nat Commun. 2018 Jan 12;9(1):184. doi: 10.1038/s41467-017-02362-x. Nat Commun. 2018. PMID: 29330441 Free PMC article.
-
UPCLASS: a deep learning-based classifier for UniProtKB entry publications.Database (Oxford). 2020 Jan 1;2020:baaa026. doi: 10.1093/database/baaa026. Database (Oxford). 2020. PMID: 32367111 Free PMC article.
-
A Second Look at FAIR in Proteomic Investigations.J Proteome Res. 2021 May 7;20(5):2182-2186. doi: 10.1021/acs.jproteome.1c00177. Epub 2021 Mar 13. J Proteome Res. 2021. PMID: 33719446 Free PMC article.
-
Recent advances of automated methods for searching and extracting genomic variant information from biomedical literature.Brief Bioinform. 2021 May 20;22(3):bbaa142. doi: 10.1093/bib/bbaa142. Brief Bioinform. 2021. PMID: 32770181 Free PMC article. Review.
References
-
- Baker M. (2016) 1,500 scientists lift the lid on reproducibility. Nature, 533, 452–454. - PubMed
-
- Bengtsson-Palme J. et al. (2016) Strategies to improve usability and preserve accuracy in biological sequence databases. Proteomics, 16, 2454–2460. - PubMed
-
- Bourne P.E. et al. (2015) Perspective: Sustaining the big-data ecosystem. Nature, 527, S16–S17. - PubMed
MeSH terms
Grants and funding
LinkOut - more resources
Full Text Sources
Other Literature Sources