Expanded microbial genome coverage and improved protein family annotation in the COG database

doi:10.1093/nar/gku1223

. 2015 Jan;43(Database issue):D261-9.

doi: 10.1093/nar/gku1223. Epub 2014 Nov 26.

Expanded microbial genome coverage and improved protein family annotation in the COG database

Michael Y Galperin¹, Kira S Makarova¹, Yuri I Wolf¹, Eugene V Koonin²

Affiliations

¹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 2094, USA.
² National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 2094, USA koonin@ncbi.nlm.nih.gov.

PMID: 25428365
PMCID: PMC4383993
DOI: 10.1093/nar/gku1223

Expanded microbial genome coverage and improved protein family annotation in the COG database

Michael Y Galperin et al. Nucleic Acids Res. 2015 Jan.

. 2015 Jan;43(Database issue):D261-9.

doi: 10.1093/nar/gku1223. Epub 2014 Nov 26.

Authors

Michael Y Galperin¹, Kira S Makarova¹, Yuri I Wolf¹, Eugene V Koonin²

Affiliations

¹ National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 2094, USA.
² National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 2094, USA koonin@ncbi.nlm.nih.gov.

PMID: 25428365
PMCID: PMC4383993
DOI: 10.1093/nar/gku1223

Abstract

Microbial genome sequencing projects produce numerous sequences of deduced proteins, only a small fraction of which have been or will ever be studied experimentally. This leaves sequence analysis as the only feasible way to annotate these proteins and assign to them tentative functions. The Clusters of Orthologous Groups of proteins (COGs) database (http://www.ncbi.nlm.nih.gov/COG/), first created in 1997, has been a popular tool for functional annotation. Its success was largely based on (i) its reliance on complete microbial genomes, which allowed reliable assignment of orthologs and paralogs for most genes; (ii) orthology-based approach, which used the function(s) of the characterized member(s) of the protein family (COG) to assign function(s) to the entire set of carefully identified orthologs and describe the range of potential functions when there were more than one; and (iii) careful manual curation of the annotation of the COGs, aimed at detailed prediction of the biological function(s) for each COG while avoiding annotation errors and overprediction. Here we present an update of the COGs, the first since 2003, and a comprehensive revision of the COG annotations and expansion of the genome coverage to include representative complete genomes from all bacterial and archaeal lineages down to the genus level. This re-analysis of the COGs shows that the original COG assignments had an error rate below 0.5% and allows an assessment of the progress in functional genomics in the past 12 years. During this time, functions of many previously uncharacterized COGs have been elucidated and tentative functional assignments of many COGs have been validated, either by targeted experiments or through the use of high-throughput methods. A particularly important development is the assignment of functions to several widespread, conserved proteins many of which turned out to participate in translation, in particular rRNA maturation and tRNA modification. The new version of the COGs is expected to become an important tool for microbial genomics.

Published by Oxford University Press on behalf of Nucleic Acids Research 2014. This work is written by US Government employees and is in the public domain in the US.

PubMed Disclaimer

Figures

**Figure 1.**
A protein family (COG) page in the COG database. For each phylum (or for *Firmicutes* and *Proteobacteria*, for each class), the numbers indicate the number of organisms that have a COG member and the total number of organisms from that phylum (class) in the COG database. Each COG member is represented by its gene index (gi) number in the NCBI Protein database (22) and is linked to the respective entry in RefSeq database (28). The phyla (classes) with no COG members are collapsed. The phylum, class and organism names are linked to the respective entries in the NCBI Taxonomy database (23).

**Figure 2.**
COG coverage of various bacterial phyla. The columns represent the average fraction of proteins from the organisms in the given phylum that are not included in COGs (gray), assigned to the R or S categories in COGs (yellow) or assigned to other COG functional categories (green). For *Firmicutes* and *Proteobacteria*, coverage is shown at the class level.

See this image and copyright information in PMC

Cited by

Comparative Genomics, from the Annotated Genome to Valuable Biological Information: A Case Study.
Zoledowska S, Motyka-Pomagruk A, Misztak A, Lojkowska E. Zoledowska S, et al. Methods Mol Biol. 2021;2242:91-112. doi: 10.1007/978-1-0716-1099-2_7. Methods Mol Biol. 2021. PMID: 33961220
Metagenomic functional profiling: to sketch or not to sketch?
Hera MR, Liu S, Wei W, Rodriguez JS, Ma C, Koslicki D. Hera MR, et al. Bioinformatics. 2024 Sep 1;40(Suppl 2):ii165-ii173. doi: 10.1093/bioinformatics/btae397. Bioinformatics. 2024. PMID: 39230701 Free PMC article.
Hohaiivirga grylli gen. nov., sp. nov., a New Member of the Family Methylobacteriaceae, Isolated from Cricket (Gryllus chinensis).
Wang HC, Huang MH, Guo DY, He W, Wang L, Fu ZY, Li WJ, Zhang AH, Zhang DF. Wang HC, et al. Curr Microbiol. 2024 Oct 6;81(11):392. doi: 10.1007/s00284-024-03922-3. Curr Microbiol. 2024. PMID: 39369359
The marine environmental microbiome mediates physiological outcomes in host nematodes.
Xue Y, Xie Y, Cao X, Zhang L. Xue Y, et al. BMC Biol. 2024 Oct 8;22(1):224. doi: 10.1186/s12915-024-02021-w. BMC Biol. 2024. PMID: 39379910 Free PMC article.
The Evolutionary Kaleidoscope of Rhodopsins.
Bulzu PA, Kavagutti VS, Andrei AS, Ghai R. Bulzu PA, et al. mSystems. 2022 Oct 26;7(5):e0040522. doi: 10.1128/msystems.00405-22. Epub 2022 Sep 19. mSystems. 2022. PMID: 36121162 Free PMC article.

See all "Cited by" articles

References

1. Tatusov R.L., Koonin E.V., Lipman D.J. A genomic perspective on protein families. Science. 1997;278:631–637. - PubMed
1. Marchler-Bauer A., Anderson J.B., Cherukuri P.F., DeWeese-Scott C., Geer L.Y., Gwadz M., He S., Hurwitz D.I., Jackson J.D., Ke Z., et al. CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res. 2005;33:D192–D196. - PMC - PubMed
1. Marchler-Bauer A., Zheng C., Chitsaz F., Derbyshire M.K., Geer L.Y., Geer R.C., Gonzales N.R., Gwadz M., Hurwitz D.I., Lanczycki C.J., et al. CDD: conserved domains and protein three-dimensional structure. Nucleic Acids Res. 2013;41:D348–D352. - PMC - PubMed
1. Overbeek R., Olson R., Pusch G.D., Olsen G.J., Davis J.J., Disz T., Edwards R.A., Gerdes S., Parrello B., Shukla M., et al. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST) Nucleic Acids Res. 2014;42:D206–D214. - PMC - PubMed
1. Markowitz V.M., Chen I.M., Palaniappan K., Chu K., Szeto E., Pillay M., Ratner A., Huang J., Woyke T., Huntemann M., et al. IMG 4 version of the integrated microbial genomes comparative analysis system. Nucleic Acids Res. 2014;42:D560–D567. - PMC - PubMed

Publication types

Actions

MeSH terms

Actions
Actions
Actions
Actions
Actions
Actions
Actions

Substances

Actions
Actions

Grants and funding

Intramural NIH HHS/United States

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations
Molecular Biology Databases
- BioCyc
Miscellaneous
- NCI CPTAC Assay Portal

[1] Tatusov R.L., Koonin E.V., Lipman D.J. A genomic perspective on protein families. Science. 1997;278:631–637. - PubMed

[2] Tatusov R.L., Koonin E.V., Lipman D.J. A genomic perspective on protein families. Science. 1997;278:631–637. - PubMed

[3] Marchler-Bauer A., Anderson J.B., Cherukuri P.F., DeWeese-Scott C., Geer L.Y., Gwadz M., He S., Hurwitz D.I., Jackson J.D., Ke Z., et al. CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res. 2005;33:D192–D196. - PMC - PubMed

[4] Marchler-Bauer A., Anderson J.B., Cherukuri P.F., DeWeese-Scott C., Geer L.Y., Gwadz M., He S., Hurwitz D.I., Jackson J.D., Ke Z., et al. CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res. 2005;33:D192–D196. - PMC - PubMed

[5] Marchler-Bauer A., Zheng C., Chitsaz F., Derbyshire M.K., Geer L.Y., Geer R.C., Gonzales N.R., Gwadz M., Hurwitz D.I., Lanczycki C.J., et al. CDD: conserved domains and protein three-dimensional structure. Nucleic Acids Res. 2013;41:D348–D352. - PMC - PubMed

[6] Marchler-Bauer A., Zheng C., Chitsaz F., Derbyshire M.K., Geer L.Y., Geer R.C., Gonzales N.R., Gwadz M., Hurwitz D.I., Lanczycki C.J., et al. CDD: conserved domains and protein three-dimensional structure. Nucleic Acids Res. 2013;41:D348–D352. - PMC - PubMed

[7] Overbeek R., Olson R., Pusch G.D., Olsen G.J., Davis J.J., Disz T., Edwards R.A., Gerdes S., Parrello B., Shukla M., et al. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST) Nucleic Acids Res. 2014;42:D206–D214. - PMC - PubMed

[8] Overbeek R., Olson R., Pusch G.D., Olsen G.J., Davis J.J., Disz T., Edwards R.A., Gerdes S., Parrello B., Shukla M., et al. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST) Nucleic Acids Res. 2014;42:D206–D214. - PMC - PubMed

[9] Markowitz V.M., Chen I.M., Palaniappan K., Chu K., Szeto E., Pillay M., Ratner A., Huang J., Woyke T., Huntemann M., et al. IMG 4 version of the integrated microbial genomes comparative analysis system. Nucleic Acids Res. 2014;42:D560–D567. - PMC - PubMed

[10] Markowitz V.M., Chen I.M., Palaniappan K., Chu K., Szeto E., Pillay M., Ratner A., Huang J., Woyke T., Huntemann M., et al. IMG 4 version of the integrated microbial genomes comparative analysis system. Nucleic Acids Res. 2014;42:D560–D567. - PMC - PubMed

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Expanded microbial genome coverage and improved protein family annotation in the COG database

Affiliations

Expanded microbial genome coverage and improved protein family annotation in the COG database

Authors

Affiliations

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous

Abstract

Figures

Similar articles

Cited by

References

Publication types

MeSH terms

Substances

Related information

Grants and funding

LinkOut - more resources

Full Text Sources

Other Literature Sources

Molecular Biology Databases

Miscellaneous