1232 lines
150 KiB
Text
1232 lines
150 KiB
Text
<!DOCTYPE html>
|
|
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" class="no-js no-jr">
|
|
<head>
|
|
<!-- For pinger, set start time and add meta elements. -->
|
|
<script type="text/javascript">var ncbi_startTime = new Date();</script>
|
|
|
|
<!-- Logger begin -->
|
|
<meta name="ncbi_db" content="books">
|
|
<meta name="ncbi_pdid" content="book-part">
|
|
<meta name="ncbi_acc" content="NBK20253">
|
|
<meta name="ncbi_domain" content="sef">
|
|
<meta name="ncbi_report" content="reader">
|
|
<meta name="ncbi_type" content="fulltext">
|
|
<meta name="ncbi_objectid" content="">
|
|
<meta name="ncbi_pcid" content="/NBK20253/?report=reader">
|
|
<meta name="ncbi_pagename" content="Genome Annotation and Analysis - Sequence - Evolution - Function - NCBI Bookshelf">
|
|
<meta name="ncbi_bookparttype" content="chapter">
|
|
<meta name="ncbi_app" content="bookshelf">
|
|
<!-- Logger end -->
|
|
|
|
<!--component id="Page" label="meta"/-->
|
|
<script type="text/javascript" src="/corehtml/pmc/jatsreader/ptpmc_3.22/js/jr.boots.min.js"> </script><title>Genome Annotation and Analysis - Sequence - Evolution - Function - NCBI Bookshelf</title>
|
|
<meta charset="utf-8">
|
|
<meta name="apple-mobile-web-app-capable" content="no">
|
|
<meta name="viewport" content="initial-scale=1,minimum-scale=1,maximum-scale=1,user-scalable=no">
|
|
<meta name="jr-col-layout" content="auto">
|
|
<meta name="jr-prev-unit" content="/books/n/sef/A166/?report=reader">
|
|
<meta name="jr-next-unit" content="/books/n/sef/A298/?report=reader">
|
|
<meta name="bk-toc-url" content="/books/n/sef/?report=toc">
|
|
<meta name="robots" content="INDEX,NOFOLLOW,NOARCHIVE,NOIMAGEINDEX">
|
|
<meta name="citation_inbook_title" content="Sequence - Evolution - Function: Computational Approaches in Comparative Genomics">
|
|
<meta name="citation_title" content="Genome Annotation and Analysis">
|
|
<meta name="citation_publisher" content="Kluwer Academic">
|
|
<meta name="citation_date" content="2003">
|
|
<meta name="citation_author" content="Eugene V Koonin">
|
|
<meta name="citation_author" content="Michael Y Galperin">
|
|
<meta name="citation_fulltext_html_url" content="https://www.ncbi.nlm.nih.gov/books/NBK20253/">
|
|
<link rel="schema.DC" href="http://purl.org/DC/elements/1.0/">
|
|
<meta name="DC.Title" content="Genome Annotation and Analysis">
|
|
<meta name="DC.Type" content="Text">
|
|
<meta name="DC.Publisher" content="Kluwer Academic">
|
|
<meta name="DC.Contributor" content="Eugene V Koonin">
|
|
<meta name="DC.Contributor" content="Michael Y Galperin">
|
|
<meta name="DC.Date" content="2003">
|
|
<meta name="DC.Identifier" content="https://www.ncbi.nlm.nih.gov/books/NBK20253/">
|
|
<meta name="DC.Language" content="en">
|
|
<meta name="description" content="In the preceding chapter, we gave a brief overview of the methods that are commonly used for identification of protein-coding genes and analysis of protein sequences. Here, we turn to one of the main subjects of this book, namely, how these methods are applied to the task of primary analysis of genomes, which often goes under the name of “genome annotation”. Many researchers still view genome annotation as a notoriously unreliable and inaccurate process. There are excellent reasons for this opinion: genome annotation produces a considerable number of errors and some outright ridiculous “identifications” (see 3.1.3 and further discussion in this chapter). These errors are highly visible, even when the error rate is quite low: because of the large numbers of genes in most genomes, the errors are also rather numerous. Some of the problems and challenges faced by genome annotation are an issue of quantity turning into quality: an analysis that can be easily and reliably done by a qualified researcher for one or ten protein sequences becomes difficult and error-prone for the same scientist and much more so for an automated tool when the task is scaled up to 10,000 sequences. We discuss here the performance of manual, automated, and mixed approaches in genome annotation and ways to avoid some common pitfalls. Mostly, however, we concentrate in this chapter on the so-called context methods of genome analysis, which are the recent excitement in the annotation field. These approaches go beyond individual genes and explicitly take advantage of genome comparison.">
|
|
<meta name="og:title" content="Genome Annotation and Analysis">
|
|
<meta name="og:type" content="book">
|
|
<meta name="og:description" content="In the preceding chapter, we gave a brief overview of the methods that are commonly used for identification of protein-coding genes and analysis of protein sequences. Here, we turn to one of the main subjects of this book, namely, how these methods are applied to the task of primary analysis of genomes, which often goes under the name of “genome annotation”. Many researchers still view genome annotation as a notoriously unreliable and inaccurate process. There are excellent reasons for this opinion: genome annotation produces a considerable number of errors and some outright ridiculous “identifications” (see 3.1.3 and further discussion in this chapter). These errors are highly visible, even when the error rate is quite low: because of the large numbers of genes in most genomes, the errors are also rather numerous. Some of the problems and challenges faced by genome annotation are an issue of quantity turning into quality: an analysis that can be easily and reliably done by a qualified researcher for one or ten protein sequences becomes difficult and error-prone for the same scientist and much more so for an automated tool when the task is scaled up to 10,000 sequences. We discuss here the performance of manual, automated, and mixed approaches in genome annotation and ways to avoid some common pitfalls. Mostly, however, we concentrate in this chapter on the so-called context methods of genome analysis, which are the recent excitement in the annotation field. These approaches go beyond individual genes and explicitly take advantage of genome comparison.">
|
|
<meta name="og:url" content="https://www.ncbi.nlm.nih.gov/books/NBK20253/">
|
|
<meta name="og:site_name" content="NCBI Bookshelf">
|
|
<meta name="og:image" content="https://www.ncbi.nlm.nih.gov/corehtml/pmc/pmcgifs/bookshelf/thumbs/th-sef-lrg.png">
|
|
<meta name="twitter:card" content="summary">
|
|
<meta name="twitter:site" content="@ncbibooks">
|
|
<meta name="bk-non-canon-loc" content="/books/n/sef/A264/?report=reader">
|
|
<link rel="canonical" href="https://www.ncbi.nlm.nih.gov/books/NBK20253/">
|
|
<link href="https://fonts.googleapis.com/css?family=Archivo+Narrow:400,700,400italic,700italic&subset=latin" rel="stylesheet" type="text/css">
|
|
<link rel="stylesheet" href="/corehtml/pmc/jatsreader/ptpmc_3.22/css/libs.min.css">
|
|
<link rel="stylesheet" href="/corehtml/pmc/jatsreader/ptpmc_3.22/css/jr.min.css">
|
|
<meta name="format-detection" content="telephone=no">
|
|
<link rel="stylesheet" href="/corehtml/pmc/css/bookshelf/2.26/css/books.min.css" type="text/css">
|
|
<link rel="stylesheet" href="/corehtml/pmc/css/bookshelf/2.26/css//books_print.min.css" type="text/css" media="print">
|
|
<link rel="stylesheet" href="/corehtml/pmc/css/bookshelf/2.26/css/books_reader.min.css" type="text/css">
|
|
<style type="text/css">p a.figpopup{display:inline !important} .bk_tt {font-family: monospace} .first-line-outdent .bk_ref {display: inline} .body-content h2, .body-content .h2 {border-bottom: 1px solid #97B0C8} .body-content h2.inline {border-bottom: none} a.page-toc-label , .jig-ncbismoothscroll a {text-decoration:none;border:0 !important} .temp-labeled-list .graphic {display:inline-block !important} .temp-labeled-list img{width:100%}</style>
|
|
|
|
<link rel="shortcut icon" href="//www.ncbi.nlm.nih.gov/favicon.ico">
|
|
<meta name="ncbi_phid" content="CE8D52EE7DB2EAA10000000000CA009B.m_5">
|
|
<meta name='referrer' content='origin-when-cross-origin'/><link type="text/css" rel="stylesheet" href="//static.pubmed.gov/portal/portal3rc.fcgi/4216699/css/3852956/3849091.css"></head>
|
|
<body>
|
|
<!-- Book content! -->
|
|
|
|
|
|
<div id="jr" data-jr-path="/corehtml/pmc/jatsreader/ptpmc_3.22/"><div class="jr-unsupported"><table class="modal"><tr><td><span class="attn inline-block"></span><br />Your browser does not support the NLM PubReader view.<br />Go to <a href="/pmc/about/pr-browsers/">this page</a> to see a list of supported browsers<br />or return to the <br /><a href="/books/NBK20253/?report=classic">regular view</a>.</td></tr></table></div><div id="jr-ui" class="hidden"><nav id="jr-head"><div class="flexh tb"><div id="jr-tb1"><a id="jr-links-sw" class="hidden" title="Links"><svg xmlns="http://www.w3.org/2000/svg" version="1.1" x="0px" y="0px" viewBox="0 0 70.6 85.3" style="enable-background:new 0 0 70.6 85.3;vertical-align:middle" xml:space="preserve" width="24" height="24">
|
|
<style type="text/css">.st0{fill:#939598;}</style>
|
|
<g>
|
|
<path class="st0" d="M36,0C12.8,2.2-22.4,14.6,19.6,32.5C40.7,41.4-30.6,14,35.9,9.8"></path>
|
|
<path class="st0" d="M34.5,85.3c23.2-2.2,58.4-14.6,16.4-32.5c-21.1-8.9,50.2,18.5-16.3,22.7"></path>
|
|
<path class="st0" d="M34.7,37.1c66.5-4.2-4.8-31.6,16.3-22.7c42.1,17.9,6.9,30.3-16.4,32.5h1.7c-66.2,4.4,4.8,31.6-16.3,22.7 c-42.1-17.9-6.9-30.3,16.4-32.5"></path>
|
|
</g>
|
|
</svg> Books</a></div><div class="jr-rhead f1 flexh"><div class="head"><a href="/books/n/sef/A166/?report=reader"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="none"><path d="M75,30 c-80,60 -80,0 0,60 c-30,-60 -30,0 0,-60"></path><text x="20" y="28" textLength="60" style="font-size:25px">Prev</text></svg></a></div><div class="body"><div class="t">Chapter 5, Genome Annotation and Analysis</div><div class="j">Sequence - Evolution - Function: Computational Approaches in Comparative Genomics</div></div><div class="tail"><a href="/books/n/sef/A298/?report=reader"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="none"><path d="M25,30c80,60 80,0 0,60 c30,-60 30,0 0,-60"></path><text x="20" y="28" textLength="60" style="font-size:25px">Next</text></svg></a></div></div><div id="jr-tb2"><a id="jr-bkhelp-sw" class="btn wsprkl hidden" title="Help with NLM PubReader">?</a><a id="jr-help-sw" class="btn wsprkl hidden" title="Settings and typography in NLM PubReader"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512" preserveAspectRatio="none"><path d="M462,283.742v-55.485l-29.981-10.662c-11.431-4.065-20.628-12.794-25.274-24.001 c-0.002-0.004-0.004-0.009-0.006-0.013c-4.659-11.235-4.333-23.918,0.889-34.903l13.653-28.724l-39.234-39.234l-28.72,13.652 c-10.979,5.219-23.68,5.546-34.908,0.889c-0.005-0.002-0.01-0.003-0.014-0.005c-11.215-4.65-19.933-13.834-24-25.273L283.741,50 h-55.484l-10.662,29.981c-4.065,11.431-12.794,20.627-24.001,25.274c-0.005,0.002-0.009,0.004-0.014,0.005 c-11.235,4.66-23.919,4.333-34.905-0.889l-28.723-13.653l-39.234,39.234l13.653,28.721c5.219,10.979,5.545,23.681,0.889,34.91 c-0.002,0.004-0.004,0.009-0.006,0.013c-4.649,11.214-13.834,19.931-25.271,23.998L50,228.257v55.485l29.98,10.661 c11.431,4.065,20.627,12.794,25.274,24c0.002,0.005,0.003,0.01,0.005,0.014c4.66,11.236,4.334,23.921-0.888,34.906l-13.654,28.723 l39.234,39.234l28.721-13.652c10.979-5.219,23.681-5.546,34.909-0.889c0.005,0.002,0.01,0.004,0.014,0.006 c11.214,4.649,19.93,13.833,23.998,25.271L228.257,462h55.484l10.595-29.79c4.103-11.538,12.908-20.824,24.216-25.525 c0.005-0.002,0.009-0.004,0.014-0.006c11.127-4.628,23.694-4.311,34.578,0.863l28.902,13.738l39.234-39.234l-13.66-28.737 c-5.214-10.969-5.539-23.659-0.886-34.877c0.002-0.005,0.004-0.009,0.006-0.014c4.654-11.225,13.848-19.949,25.297-24.021 L462,283.742z M256,331.546c-41.724,0-75.548-33.823-75.548-75.546s33.824-75.547,75.548-75.547 c41.723,0,75.546,33.824,75.546,75.547S297.723,331.546,256,331.546z"></path></svg></a><a id="jr-fip-sw" class="btn wsprkl hidden" title="Find"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 550 600" preserveAspectRatio="none"><path fill="none" stroke="#000" stroke-width="36" stroke-linecap="round" style="fill:#FFF" d="m320,350a153,153 0 1,0-2,2l170,170m-91-117 110,110-26,26-110-110"></path></svg></a><a id="jr-rtoc-sw" class="btn wsprkl hidden" title="Table of Contents"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="none"><path d="M20,20h10v8H20V20zM36,20h44v8H36V20zM20,37.33h10v8H20V37.33zM36,37.33h44v8H36V37.33zM20,54.66h10v8H20V54.66zM36,54.66h44v8H36V54.66zM20,72h10v8 H20V72zM36,72h44v8H36V72z"></path></svg></a></div></div></nav><nav id="jr-dash" class="noselect"><nav id="jr-dash" class="noselect"><div id="jr-pi" class="hidden"><a id="jr-pi-prev" class="hidden" title="Previous page"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="none"><path d="M75,30 c-80,60 -80,0 0,60 c-30,-60 -30,0 0,-60"></path><text x="20" y="28" textLength="60" style="font-size:25px">Prev</text></svg></a><div class="pginfo">Page <i class="jr-pg-pn">0</i> of <i class="jr-pg-lp">0</i></div><a id="jr-pi-next" class="hidden" title="Next page"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="none"><path d="M25,30c80,60 80,0 0,60 c30,-60 30,0 0,-60"></path><text x="20" y="28" textLength="60" style="font-size:25px">Next</text></svg></a></div><div id="jr-is-tb"><a id="jr-is-sw" class="btn wsprkl hidden" title="Switch between Figures/Tables strip and Progress bar"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="none"><rect x="10" y="40" width="20" height="20"></rect><rect x="40" y="40" width="20" height="20"></rect><rect x="70" y="40" width="20" height="20"></rect></svg></a></div><nav id="jr-istrip" class="istrip hidden"><a id="jr-is-prev" href="#" class="hidden" title="Previous"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="none"><path d="M80,40 60,65 80,90 70,90 50,65 70,40z M50,40 30,65 50,90 40,90 20,65 40,40z"></path><text x="35" y="25" textLength="60" style="font-size:25px">Prev</text></svg></a><a id="jr-is-next" href="#" class="hidden" title="Next"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="none"><path d="M20,40 40,65 20,90 30,90 50,65 30,40z M50,40 70,65 50,90 60,90 80,65 60,40z"></path><text x="15" y="25" textLength="60" style="font-size:25px">Next</text></svg></a></nav><nav id="jr-progress"></nav></nav></nav><aside id="jr-links-p" class="hidden flexv"><div class="tb sk-htbar flexh"><div><a class="jr-p-close btn wsprkl">Done</a></div><div class="title-text f1">NCBI Bookshelf</div></div><div class="cnt lol f1"><a href="/books/">Home</a><a href="/books/browse/">Browse All Titles</a><a class="btn share" target="_blank" rel="noopener noreferrer" href="https://www.facebook.com/sharer/sharer.php?u=https://www.ncbi.nlm.nih.gov/books/NBK20253/"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 33 33" style="vertical-align:middle" width="24" height="24" preserveAspectRatio="none"><g><path d="M 17.996,32L 12,32 L 12,16 l-4,0 l0-5.514 l 4-0.002l-0.006-3.248C 11.993,2.737, 13.213,0, 18.512,0l 4.412,0 l0,5.515 l-2.757,0 c-2.063,0-2.163,0.77-2.163,2.209l-0.008,2.76l 4.959,0 l-0.585,5.514L 18,16L 17.996,32z"></path></g></svg> Share on Facebook</a><a class="btn share" target="_blank" rel="noopener noreferrer" href="https://twitter.com/intent/tweet?url=https://www.ncbi.nlm.nih.gov/books/NBK20253/&text=Genome%20Annotation%20and%20Analysis"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 33 33" style="vertical-align:middle" width="24" height="24"><g><path d="M 32,6.076c-1.177,0.522-2.443,0.875-3.771,1.034c 1.355-0.813, 2.396-2.099, 2.887-3.632 c-1.269,0.752-2.674,1.299-4.169,1.593c-1.198-1.276-2.904-2.073-4.792-2.073c-3.626,0-6.565,2.939-6.565,6.565 c0,0.515, 0.058,1.016, 0.17,1.496c-5.456-0.274-10.294-2.888-13.532-6.86c-0.565,0.97-0.889,2.097-0.889,3.301 c0,2.278, 1.159,4.287, 2.921,5.465c-1.076-0.034-2.088-0.329-2.974-0.821c-0.001,0.027-0.001,0.055-0.001,0.083 c0,3.181, 2.263,5.834, 5.266,6.438c-0.551,0.15-1.131,0.23-1.73,0.23c-0.423,0-0.834-0.041-1.235-0.118 c 0.836,2.608, 3.26,4.506, 6.133,4.559c-2.247,1.761-5.078,2.81-8.154,2.81c-0.53,0-1.052-0.031-1.566-0.092 c 2.905,1.863, 6.356,2.95, 10.064,2.95c 12.076,0, 18.679-10.004, 18.679-18.68c0-0.285-0.006-0.568-0.019-0.849 C 30.007,8.548, 31.12,7.392, 32,6.076z"></path></g></svg> Share on Twitter</a></div></aside><aside id="jr-rtoc-p" class="hidden flexv"><div class="tb sk-htbar flexh"><div><a class="jr-p-close btn wsprkl">Done</a></div><div class="title-text f1">Table of Content</div></div><div class="cnt lol f1"><a href="/books/n/sef/?report=reader">Title Information</a><a href="/books/n/sef/toc/?report=reader">Table of Contents Page</a></div></aside><aside id="jr-help-p" class="hidden flexv"><div class="tb sk-htbar flexh"><div><a class="jr-p-close btn wsprkl">Done</a></div><div class="title-text f1">Settings</div></div><div class="cnt f1"><div id="jr-typo-p" class="typo"><div><a class="sf btn wsprkl">A-</a><a class="lf btn wsprkl">A+</a></div><div><a class="bcol-auto btn wsprkl"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 200 100" preserveAspectRatio="none"><text x="10" y="70" style="font-size:60px;font-family: Trebuchet MS, ArialMT, Arial, sans-serif" textLength="180">AUTO</text></svg></a><a class="bcol-1 btn wsprkl"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="none"><path d="M15,25 85,25zM15,40 85,40zM15,55 85,55zM15,70 85,70z"></path></svg></a><a class="bcol-2 btn wsprkl"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="none"><path d="M5,25 45,25z M55,25 95,25zM5,40 45,40z M55,40 95,40zM5,55 45,55z M55,55 95,55zM5,70 45,70z M55,70 95,70z"></path></svg></a></div></div><div class="lol"><a class="" href="/books/NBK20253/?report=classic">Switch to classic view</a><a href="/books/NBK20253/?report=printable">Print View</a></div></div></aside><aside id="jr-bkhelp-p" class="hidden flexv"><div class="tb sk-htbar flexh"><div><a class="jr-p-close btn wsprkl">Done</a></div><div class="title-text f1">Help</div></div><div class="cnt f1 lol"><a id="jr-helpobj-sw" data-path="/corehtml/pmc/jatsreader/ptpmc_3.22/" data-href="/corehtml/pmc/jatsreader/ptpmc_3.22/img/bookshelf/help.xml" href="">Help</a><a href="mailto:info@ncbi.nlm.nih.gov?subject=PubReader%20feedback%20%2F%20NBK20253%20%2F%20sid%3ACE8BC1E97D9F05E1_0182SID%20%2F%20phid%3ACE8D52EE7DB2EAA10000000000CA009B.4">Send us feedback</a><a id="jr-about-sw" data-path="/corehtml/pmc/jatsreader/ptpmc_3.22/" data-href="/corehtml/pmc/jatsreader/ptpmc_3.22/img/bookshelf/about.xml" href="">About PubReader</a></div></aside><aside id="jr-objectbox" class="thidden hidden"><div class="jr-objectbox-close wsprkl">✘</div><div class="jr-objectbox-inner cnt"><div class="jr-objectbox-drawer"></div></div></aside><nav id="jr-pm-left" class="hidden"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 40 800" preserveAspectRatio="none"><text font-stretch="ultra-condensed" x="800" y="-15" text-anchor="end" transform="rotate(90)" font-size="18" letter-spacing=".1em">Previous Page</text></svg></nav><nav id="jr-pm-right" class="hidden"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 40 800" preserveAspectRatio="none"><text font-stretch="ultra-condensed" x="800" y="-15" text-anchor="end" transform="rotate(90)" font-size="18" letter-spacing=".1em">Next Page</text></svg></nav><nav id="jr-fip" class="hidden"><nav id="jr-fip-term-p"><input type="search" placeholder="search this page" id="jr-fip-term" autocorrect="off" autocomplete="off" /><a id="jr-fip-mg" class="wsprkl btn" title="Find"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 550 600" preserveAspectRatio="none"><path fill="none" stroke="#000" stroke-width="36" stroke-linecap="round" style="fill:#FFF" d="m320,350a153,153 0 1,0-2,2l170,170m-91-117 110,110-26,26-110-110"></path></svg></a><a id="jr-fip-done" class="wsprkl btn" title="Dismiss find">✘</a></nav><nav id="jr-fip-info-p"><a id="jr-fip-prev" class="wsprkl btn" title="Jump to previuos match">◀</a><button id="jr-fip-matches">no matches yet</button><a id="jr-fip-next" class="wsprkl btn" title="Jump to next match">▶</a></nav></nav></div><div id="jr-epub-interstitial" class="hidden"></div><div id="jr-content"><article data-type="main"><div class="main-content lit-style" itemscope="itemscope" itemtype="http://schema.org/CreativeWork"><div class="meta-content fm-sec"><div class="fm-sec"><h1 id="_NBK20253_"><span class="label">Chapter 5</span><span class="title" itemprop="name">Genome Annotation and Analysis</span></h1><p class="fm-aai"><a href="#_NBK20253_pubdet_">Publication Details</a></p></div></div><div class="jig-ncbiinpagenav body-content whole_rhythm" data-jigconfig="allHeadingLevels: ['h2'],smoothScroll: false" itemprop="text"><p>In the preceding chapter, we gave a brief overview of the methods that are commonly used
|
|
for identification of protein-coding genes and analysis of protein sequences. Here, we
|
|
turn to one of the main subjects of this book, namely, how these methods are applied to
|
|
the task of primary analysis of genomes, which often goes under the name of
|
|
“genome annotation”. Many researchers still view genome annotation
|
|
as a notoriously unreliable and inaccurate process. There are excellent reasons for this
|
|
opinion: genome annotation produces a considerable number of errors and some outright
|
|
ridiculous “identifications” (see <a href="/books/n/sef/A55/?report=reader#A64">3.1.3</a> and further discussion in this chapter). These errors are highly
|
|
visible, even when the error rate is quite low: because of the large numbers of genes in
|
|
most genomes, the errors are also rather numerous. Some of the problems and challenges
|
|
faced by genome annotation are an issue of quantity turning into quality: an analysis
|
|
that can be easily and reliably done by a qualified researcher for one or ten protein
|
|
sequences becomes difficult and error-prone for the same scientist and much more so for
|
|
an automated tool when the task is scaled up to 10,000 sequences. We discuss here the
|
|
performance of manual, automated, and mixed approaches in genome annotation and ways to
|
|
avoid some common pitfalls. Mostly, however, we concentrate in this chapter on the
|
|
so-called context methods of genome analysis, which are the recent excitement in the
|
|
annotation field. These approaches go beyond individual genes and explicitly take
|
|
advantage of genome comparison.</p><div id="A265"><h2 id="_A265_">5.1. Methods, Approaches and Results in Genome Annotation</h2><div id="A266"><h3>5.1.1. Genome annotation: data flow and performance</h3><p>What is genome annotation? Of course, there hardly can be any exact definition
|
|
but, for the purpose of this discussion, it might be useful to define annotation
|
|
as a subfield in the general field of genome analysis, which includes more or
|
|
less anything that can be done with genome sequences by computational means. In
|
|
simple, operational terms, annotation may be defined as the part of genome
|
|
analysis that is customarily performed before a genome sequence is deposited in
|
|
GenBank and described in a published paper. We say
|
|
“customarily” because the annotations available through
|
|
GenBank and particularly the types of analysis reported in the literature for
|
|
different genomes vary widely. For instance, the reports on the human genome
|
|
sequence [<a href="/books/n/sef/A727/?report=reader#A1216">488</a>,<a href="/books/n/sef/A727/?report=reader#A1598">870</a>] clearly include a considerable amount of information
|
|
that goes beyond typical genome annotation. The “unit” of
|
|
genome annotation is the description of an individual gene and its protein (or
|
|
RNA) product, and the focal point of each such record is the function assigned
|
|
to the gene product. The record may also include a brief description of the
|
|
evidence for this assigned function, e.g. percent identity with a functionally
|
|
characterized homolog or the boundaries of domains detected in a domain database
|
|
search, but there is no room for any details of the analysis.</p><p>
|
|
<a class="figpopup" href="/books/NBK20253/figure/A267/?report=objectonly" target="object" rid-figpopup="figA267" rid-ob="figobA267">Figure 5.1</a> shows a rough schematic of the
|
|
data flow in genome annotation, starting with the finished sequence; we leave
|
|
finishing of the sequence out of this scheme but indicate the possibility of
|
|
feedback resulting in correction of sequencing errors. Of these procedures,
|
|
which must be integrated for predicting gene functions, statistical gene
|
|
prediction and search of general-purpose databases for sequence similarity are
|
|
central in the sense that this is done comprehensively as part of any genome
|
|
project. The contribution of the other approaches in the scheme in <a class="figpopup" href="/books/NBK20253/figure/A267/?report=objectonly" target="object" rid-figpopup="figA267" rid-ob="figobA267">Figure 5.1</a>, particularly specialized
|
|
database search, including domain databases, such as Pfam, SMART, and CDD (see
|
|
<a href="/books/n/sef/A55/?report=reader#A82">3.2.2</a>), and genome-oriented databases,
|
|
such as COGs, KEGG, or WIT (see <a href="/books/n/sef/A55/?report=reader#A103">3.4</a>), and
|
|
genomic context analysis, varies greatly from project to project. So far, these
|
|
relatively new methods and resources remain ancillary to traditional database
|
|
search in genome annotation, but we argue further in this chapter that they can
|
|
and probably will transform the annotation process in the nearest future.
|
|
|
|
|
|
</p><div class="iconblock whole_rhythm clearfix ten_col fig" id="figA267" co-legend-rid="figlgndA267"><a href="/books/NBK20253/figure/A267/?report=objectonly" target="object" title="Figure 5.1" class="img_link icnblk_img figpopup" rid-figpopup="figA267" rid-ob="figobA267"><img class="small-thumb" src="/books/NBK20253/bin/ch5f1.gif" src-large="/books/NBK20253/bin/ch5f1.jpg" alt="Figure 5.1. A generalized flow chart of genome annotation." /></a><div class="icnblk_cntnt" id="figlgndA267"><h4 id="A267"><a href="/books/NBK20253/figure/A267/?report=objectonly" target="object" rid-ob="figobA267">Figure 5.1</a></h4><p class="float-caption no_bottom_margin">A generalized flow chart of genome annotation. FB: feedback from gene identification for correction of sequencing
|
|
errors, primarily frameshifts. General database search: searching
|
|
sequence databases (typically, NCBI NR) for sequence similarity,
|
|
usually <a href="/books/NBK20253/figure/A267/?report=objectonly" target="object" rid-ob="figobA267">(more...)</a></p></div></div><p>Before we consider several aspects of genome annotation, it may be instructive to
|
|
assess its brutto performance, i.e. the fraction of the genes in a genome, to
|
|
which a specific function is assigned. <a class="figpopup" href="/books/NBK20253/table/A268/?report=objectonly" target="object" rid-figpopup="figA268" rid-ob="figobA268">Table
|
|
5.1</a> lists such data for several genomes sequenced in 2001 and
|
|
annotated using relatively up-to-date methods. This comparison shows notable
|
|
differences between the levels of annotation of different genomes. Some genomes
|
|
simply come practically unannotated, such as, for example, <i>Sulfolobus
|
|
tokodaii</i>, which is a crenarchaeon closely related to <i>S.
|
|
solfataricus</i>, and represented in the COGs to the same extent as the
|
|
latter species. In most genomes, however, functional prediction has been made
|
|
for the majority of the genes, from 54% to 79% of the
|
|
protein-coding genes. Obviously, these differences depend both on the taxonomic
|
|
position of the species in question (e.g. it is likely that for Crenarchaea,
|
|
whose biology is in general poorly understood, the fraction of genes for which
|
|
functional prediction is feasible will be lower than for bacteria of the
|
|
well-characterized <i>Bacillus</i>-<i>Clostridium</i> group,
|
|
such as <i>C. acetobutylicum</i> or <i>L. lactis</i>) and on
|
|
the methods and practices of genome annotators.</p><div class="iconblock whole_rhythm clearfix ten_col table-wrap" id="figA268"><a href="/books/NBK20253/table/A268/?report=objectonly" target="object" title="Table 5.1" class="img_link icnblk_img figpopup" rid-figpopup="figA268" rid-ob="figobA268"><img class="small-thumb" src="/books/NBK20253/table/A268/?report=thumb" src-large="/books/NBK20253/table/A268/?report=previmg" alt="Table 5.1. Microbial genome annotation 2001." /></a><div class="icnblk_cntnt"><h4 id="A268"><a href="/books/NBK20253/table/A268/?report=objectonly" target="object" rid-ob="figobA268">Table 5.1</a></h4><p class="float-caption no_bottom_margin">Microbial genome annotation 2001. </p></div></div><p>Even in better-characterized genomes, for hundreds of genes (those encoding
|
|
“conserved hypothetical” and
|
|
“hypothetical” proteins), there is no functional prediction
|
|
whatsoever. Furthermore, among those proteins that formally belong to the
|
|
annotated category, a substantial fraction of the predictions are only general
|
|
and are in need of major refinement. Some of these problems can be solved only
|
|
through experiment, but the above numbers show beyond doubt that there is ample
|
|
room for improvement in computational annotation itself; further in this
|
|
chapter, we touch upon some of the possible directions.</p><p>Genome annotation necessarily involves some level of automation. No one is going
|
|
to manually paste each of several thousand-protein sequences encoded in a genome
|
|
into the BLAST window, hit the button, and wait for the results to appear on
|
|
screen. For annotation to be practicable at all, software is necessary to run
|
|
such routine tasks in a batch mode and also to organize the results from
|
|
different programs in a convenient form, and each genome project employs one or
|
|
another set of tools to achieve this. After that point, however, genome
|
|
annotation is still mostly “manual” (or, better,
|
|
“expert”) because decisions on how to assign gene functions
|
|
are made by humans (supposedly, experts). Several attempts have been made to
|
|
push automation beyond straightforward data processing and to allow a program to
|
|
actually make all the decisions. We briefly discuss some of the automated
|
|
systems for genome annotation in the next section.</p></div><div id="A269"><h3>5.1.2. Automation of genome annotation</h3><p>Terry Gaasterland and Christoph Sensen once estimated that annotating genomic
|
|
sequence by hand would require as much as one year per person per one megabase
|
|
[<a href="/books/n/sef/A727/?report=reader#A981">253</a>]. We now believe, on the basis
|
|
of our own experience of genome annotation (e.g. [<a href="/books/n/sef/A727/?report=reader#A1350">622</a>,<a href="/books/n/sef/A727/?report=reader#A1507">779</a>,<a href="/books/n/sef/A727/?report=reader#A1533">805</a>]), that this estimate is exaggerated
|
|
perhaps by a factor of 5 or 6. Nevertheless, there is no doubt that genome
|
|
annotation has become the limiting step in most genome projects. Besides, humans
|
|
are supposed to be inconsistent and error-prone. Hence the incentives for
|
|
automating as much of the annotation process as possible.</p><p> The <b>GeneQuiz</b> (<a href="http://www.sander.ebi.ac.uk/genequiz/" ref="pagearea=body&targetsite=external&targetcat=link&targettype=uri">http://www.sander.ebi.ac.uk/genequiz/</a>) project was the first
|
|
automatic system for genome analysis, which performed similarity searches
|
|
followed by automatic evaluation of results and generation of functional
|
|
annotation by an expert system based on a set of several predefined rules [<a href="/books/n/sef/A727/?report=reader#A1477">749</a>]. Several other similar systems have
|
|
been created since then, but GeneQuiz remains the only such tool that is open to
|
|
the general public [<a href="/books/n/sef/A727/?report=reader#A1078">350</a>].</p><p>GeneQuiz runs automated database searches and sequence analysis by taking a
|
|
protein sequence and comparing it against a non-redundant protein database,
|
|
generated by automated cross-linking and cross-referencing of PDB, SWISS-PROT,
|
|
PIR, PROSITE, and TrEMBL databases, with the addition of human, mouse, fruit
|
|
fly, zebrafish, and <i>Anopheles gambiae</i> protein sets obtained
|
|
from the Ensemble project (<a href="http://www.ensembl.org" ref="pagearea=body&targetsite=external&targetcat=link&targettype=uri">http://www.ensembl.org</a>) and a <i>C. elegans</i>
|
|
protein set (<a href="http://www.sanger.ac.uk/Projects/C_elegans/wormpep" ref="pagearea=body&targetsite=external&targetcat=link&targettype=uri">http://www.sanger.ac.uk/Projects/C_elegans/wormpep</a>). This
|
|
comparison is done by running BLAST and FASTA programs and is used to identify
|
|
the cases with high similarity, where function can be predicted. Additionally,
|
|
searches for PROSITE patterns are performed. Predictions are also made for
|
|
coiled-coil regions using COILS2 [<a href="/books/n/sef/A727/?report=reader#A1261">533</a>],
|
|
transmembrane segments using PHDhtm [<a href="/books/n/sef/A727/?report=reader#A1443">715</a>], and secondary structure elements using PHDsec [<a href="/books/n/sef/A727/?report=reader#A1446">718</a>]. The system further clusters
|
|
proteins from the analyzed genome by sequence similarity [<a href="/books/n/sef/A727/?report=reader#A1550">822</a>] and constructs multiple alignments. The results are
|
|
presented in a table that contains information on the best hits (including gene
|
|
names, database identifiers, and links to the corresponding databases),
|
|
predictions for secondary structure, coiled-coil regions, etc. and a reliability
|
|
score for each item. The functional assignment is then made automatically on the
|
|
basis of the functions of the homologs found in the database. At this level,
|
|
functional assignments are qualified as clear or as ambiguous.</p><p>The effectiveness and accuracy of such fully automated system have been the
|
|
subject of a rather heated discussion but still remain uncertain. While the
|
|
authors originally estimated the accuracy of their functional assignments to be
|
|
95% or better [<a href="/books/n/sef/A727/?report=reader#A1366">638</a>,<a href="/books/n/sef/A727/?report=reader#A1477">749</a>], others reported that only 8 of 21
|
|
new functional predictions for <i>M. genitalium</i> proteins made by
|
|
GeneQuiz could be fully corroborated [<a href="/books/n/sef/A727/?report=reader#A1194">466</a>]. A similar discrepancy between the functional predictions made
|
|
by the GeneQuiz team [<a href="/books/n/sef/A727/?report=reader#A759">31</a>] and those
|
|
obtained by mostly manual annotation [<a href="/books/n/sef/A727/?report=reader#A1194">466</a>] was reported for the proteins encoded in the <i>M.
|
|
jannaschii</i> genome ([<a href="/books/n/sef/A727/?report=reader#A992">264</a>],
|
|
see <a href="http://www.bioinfo.de/isb/1998/01/0007" ref="pagearea=body&targetsite=external&targetcat=link&targettype=uri">http://www.bioinfo.de/isb/1998/01/0007</a>). It appeared that
|
|
GeneQuiz analysis suffered from the usual pitfalls of sequence similarity
|
|
searches (see <a href="/books/n/sef/A55/?report=reader#A64">3.1.3</a>, the next section and
|
|
[<a href="/books/n/sef/A727/?report=reader#A827">99</a>,<a href="/books/n/sef/A727/?report=reader#A832">104</a>,<a href="/books/n/sef/A727/?report=reader#A992">264</a>]).</p><div id="A270"><h4>PEDANT, MAGPIE, ERGO, IMAGENE</h4><p>While GeneQuiz seems to be the only fully automated genome annotation tool
|
|
that is open to the public for new genome analysis, there have been reports
|
|
of similar systems developed by other genome annotation groups. These
|
|
include Dmitrij Frishman's PEDANT (<a href="http://pedant.gsf.de" ref="pagearea=body&targetsite=external&targetcat=link&targettype=uri">http://pedant.gsf.de</a>,
|
|
[<a href="/books/n/sef/A727/?report=reader#A973">245</a>,<a href="/books/n/sef/A727/?report=reader#A976">248</a>], Terry Gaasterland's MAGPIE and its sister
|
|
programs (<a href="http://genomes.rockefeller.edu" ref="pagearea=body&targetsite=external&targetcat=link&targettype=uri">http://genomes.rockefeller.edu</a>, [<a href="/books/n/sef/A727/?report=reader#A980">252</a>,<a href="/books/n/sef/A727/?report=reader#A981">253</a>]),
|
|
Ross Overbeek's ERGO (<a href="http://ergo.integratedgenomics.com/ERGO" ref="pagearea=body&targetsite=external&targetcat=link&targettype=uri">http://ergo.integratedgenomics.com/ERGO</a>, [<a href="/books/n/sef/A727/?report=reader#A1370">642</a>,<a href="/books/n/sef/A727/?report=reader#A1371">643</a>]), Alan Viari's Imagene (<a href="http://wwwabi.snv.jussieu.fr/research" ref="pagearea=body&targetsite=external&targetcat=link&targettype=uri">http://wwwabi.snv.jussieu.fr/research</a>, [<a href="/books/n/sef/A727/?report=reader#A1289">561</a>]), and some others. Although none
|
|
of these systems is freely available to outside users, many of the genome
|
|
annotation results they produced are accessible on the web and can be used
|
|
to judge the performance.</p><p>The PEDANT web site contains by far the most information open to the public
|
|
and can be used as a good reference point for automated genome analyses (see
|
|
also <a href="/books/n/sef/A22/?report=reader#A47">2.4</a>).</p></div><div id="A271"><h4>SEALS</h4><p>In addition to completely automated systems, some tools that greatly
|
|
facilitate and accelerate manual genome annotation are worth a mention.
|
|
System for Easy Analysis of Lots of Sequences (SEALS), developed by Roland
|
|
Walker at the NCBI is, for obvious reasons, the one most familiar to the
|
|
authors of this book (available for downloading at <a href="http://iubio.bio.indiana.edu:7780/archive/00000466/" ref="pagearea=body&targetsite=external&targetcat=link&targettype=uri">http://iubio.bio.indiana.edu:7780/archive/00000466/</a>, [<a href="/books/n/sef/A727/?report=reader#A1606">878</a>]). The SEALS package consists of
|
|
~50 simple, UNIX-based tools (written in PERL), which follow consistent
|
|
syntax and semantics. SEALS combines software for retrieving sequence
|
|
information, scripting database searches with BLAST, viewing and parsing
|
|
search outputs, searching for protein sequence motifs using regular
|
|
expressions, and predicting protein structural features and motifs.
|
|
Typically, using SEALS, a genome analyst first looks for structural features
|
|
of proteins, such as signal peptides (predicted by SignalP), transmembrane
|
|
domains (predicted by PHDhtm), coiled-coil domains (predicted by COILS2),
|
|
and large non-globular domains (predicted using SEG). Once these regions are
|
|
identified and masked, database searches are run in a batch mode using the
|
|
chosen method, e.g. PSI-BLAST. The outputs can be presented in a variety of
|
|
formats, of which filtering with taxonomic queries implemented in the SEALS
|
|
script TAX_COLLECTOR is among the most useful. SEALS has been extensively
|
|
used in the comparative studies of bacterial, archaeal, and eukaryotic
|
|
genomes (e.g. [<a href="/books/n/sef/A727/?report=reader#A780">52</a>,<a href="/books/n/sef/A727/?report=reader#A783">55</a>,<a href="/books/n/sef/A727/?report=reader#A1268">540</a>].</p></div></div><div id="A272"><h3>5.1.3. Accuracy of genome annotation, sources of errors, and some thoughts on
|
|
possible improvements</h3><p>Benchmarking the accuracy of genome annotation is extremely hard. It has been
|
|
shown on numerous occasions that more advanced methods for sequence comparison,
|
|
such as gapped BLAST and subsequently PSI-BLAST, sometimes used in combination
|
|
with threading, as well as various forms of motif analysis and careful manual
|
|
integration of the results produced by all these approaches, substantially
|
|
improve detection of homologs (e.g. [<a href="/books/n/sef/A727/?report=reader#A896">168</a>,<a href="/books/n/sef/A727/?report=reader#A1129">401</a>,<a href="/books/n/sef/A727/?report=reader#A1162">434</a>,<a href="/books/n/sef/A727/?report=reader#A1194">466</a>,<a href="/books/n/sef/A727/?report=reader#A1313">585</a>]). At the end,
|
|
however, genome annotation is not about detection of homologs but rather about
|
|
functional prediction, and here, the problem of a standard of truth is
|
|
formidable. By definition, functional annotation (more precisely, functional
|
|
prediction) deals with proteins whose functions are unknown, and the rate of
|
|
experimental testing of predictions is extremely slow. We believe that it is
|
|
possible to design an objective test of the accuracy of genome annotation in the
|
|
following manner. The protein set encoded in a newly sequenced genome is
|
|
analyzed, and specific active centers and other functionally important sites are
|
|
predicted for as many proteins as possible. When a new, preferably
|
|
phylogenetically distant genome becomes available, orthologs of the proteins
|
|
from the first genome are identified, and the conservation of the predicted
|
|
functional sites is assessed. Lack of conservation would count as an error; this
|
|
is, of course, a harsh test that would give the low bound of accuracy because:
|
|
first, functional site prediction may be partly wrong but the function of the
|
|
protein still would be predicted correctly; and second, some active sites might
|
|
be disrupted in the new genome. In this way, the accuracy of the prediction
|
|
could be assessed quantitatively and, in principle, even a
|
|
“tournament” analogous to the CASP competition in protein
|
|
structure prediction [<a href="/books/n/sef/A727/?report=reader#A1597">869</a>] could be
|
|
arranged.</p><p>However, so far, evaluation of the accuracy of genome annotation has been largely
|
|
limited to the assessments of consistency of annotations of the same genome
|
|
generated by different groups and various “sanity checks”
|
|
and expert judgments. Steven Brenner published an interesting comparison of
|
|
three independent annotations [<a href="/books/n/sef/A727/?report=reader#A970">242</a>,<a href="/books/n/sef/A727/?report=reader#A1195">467</a>,<a href="/books/n/sef/A727/?report=reader#A1367">639</a>] of the smallest of the sequenced bacterial genomes,
|
|
<i>Mycoplasma genitalium</i> [<a href="/books/n/sef/A727/?report=reader#A844">116</a>]. Without attempting to determine which annotation was
|
|
“better”, he manually examined all conflicting annotations,
|
|
eliminating trivial semantic differences and counting the apparent
|
|
irreconcilable ones as errors (in at least one of the annotations). His
|
|
conclusion was that there was an at least 8% error rate among the 340
|
|
genes annotated by at least two of the three groups. In a similar exercise that
|
|
we have done on the basis of the COG database, we found that of 786 COGs that
|
|
did not include paralogs (the number for the end of 1999), members of 194 had
|
|
conflicting annotations in GenBank [<a href="/books/n/sef/A727/?report=reader#A995">267</a>]. This suggests, more pessimistically, an annotation error rate of
|
|
at least 25% using the same criterion as applied by Brenner. Clearly,
|
|
even the lower of these estimates represents a serious problem for genome
|
|
annotation, bringing up the specter of error catastrophe [<a href="/books/n/sef/A727/?report=reader#A817">89</a>,<a href="/books/n/sef/A727/?report=reader#A832">104</a>]. We first
|
|
briefly discuss the most common sources of errors and then some ideas regarding
|
|
the ways out. Manual and automated genome annotation encounter the same typical
|
|
problems, which we already mentioned in the discussion of the reliability of
|
|
sequence database records (see <a href="/books/n/sef/A55/?report=reader#A64">3.1.3</a>).
|
|
Inevitably, even partial automation of the annotation process tends to increase
|
|
the likelihood of all these types of errors.</p><p>In order to examine various kinds of errors that are common in genome annotation,
|
|
it is convenient to re-examine four cases of discrepancies in the annotation of
|
|
<i>M. genitalium</i> proteins that were specifically highlighted
|
|
in the aforecited article of Steven Brenner (<a class="figpopup" href="/books/NBK20253/table/A273/?report=objectonly" target="object" rid-figpopup="figA273" rid-ob="figobA273">Table 5.2</a>). Although one of the authors was involved in one of the
|
|
compared annotations, we think we can be completely impartial in the spirit of
|
|
Brenner's article, especially since six years have passed, an eternity for
|
|
genomics.</p><div class="iconblock whole_rhythm clearfix ten_col table-wrap" id="figA273"><a href="/books/NBK20253/table/A273/?report=objectonly" target="object" title="Table 5.2" class="img_link icnblk_img figpopup" rid-figpopup="figA273" rid-ob="figobA273"><img class="small-thumb" src="/books/NBK20253/table/A273/?report=thumb" src-large="/books/NBK20253/table/A273/?report=previmg" alt="Table 5.2. Different types of errors in genome annotation." /></a><div class="icnblk_cntnt"><h4 id="A273"><a href="/books/NBK20253/table/A273/?report=objectonly" target="object" rid-ob="figobA273">Table 5.2</a></h4><p class="float-caption no_bottom_margin">Different types of errors in genome annotation. </p></div></div><p>The protein MG302 was not annotated in the original genome publication by Fraser
|
|
and colleagues and was assigned conflicting annotations by the other two groups.
|
|
Ouzounis and coworkers notably characterized this protein as a
|
|
“mitochondrial 60S ribosomal protein L2”, whereas Koonin and
|
|
coworkers annotated it is as a permease, perhaps specific for
|
|
glycerol-3-phosphate. A database search performed in 2002 leaves no doubt
|
|
whatsoever that the protein is a permease; this is, of course, readily supported
|
|
by transmembrane segment prediction. However, the glycerol-3-phosphate
|
|
specificity is not supported at all. Instead, these searches, particularly the
|
|
CDD search, unequivocally pointed to a relationship between MG302 and a family
|
|
of cobalt transporters. Nevertheless, since the similarity between MG302 and the
|
|
cobalt transporters is not particularly strong and transporters switch their
|
|
specificity with relative ease during evolution, caution is due, and the
|
|
annotation as “probable Co transporter” seems most
|
|
appropriate. This single case nicely covers several common problems of genome
|
|
annotation. The most benign but also apparently most widespread of these is <b>
|
|
<i>overprediction</i>
|
|
</b> or, more precisely, <b>
|
|
<i>overly specific prediction</i>
|
|
</b>. Even with the methods available in 1996 (ungapped BLAST, FASTA, various
|
|
alignment methods, and transmembrane segment prediction), the conclusion that
|
|
MG302 was a permease was quite firm. However, glycerol-3-phosphate permease
|
|
turned up as the most similar functionally characterized protein just by chance
|
|
(Co<sup>2+</sup> transporters had not been characterized at the
|
|
time). Transferring functional information from this unreliable best hit,
|
|
however tentatively, was a typical error of overprediction; the appropriate
|
|
annotation at the time would have been, simply, “predicted
|
|
permease”. The annotation of MG302 as “mitochondrial 60S
|
|
ribosomal protein L2” is, of course, much more conspicuous. At face
|
|
value, this does not even pass a “reality check”: there
|
|
certainly can be no mitochondria and no 60S ribosomes in mycoplasmas.</p><p>Such semantic snafus are pretty common in genome annotation, especially those
|
|
that are either produced fully automatically or manually but non-critically
|
|
(e.g. the “discovery” of head morphogenesis in bacteria
|
|
mentioned in <a href="/books/n/sef/A55/?report=reader">Chapter 3</a>). However,
|
|
these are probably the least serious annotation errors.</p><p>Let us just assume that the authors of this annotation meant “homolog
|
|
of mitochondrial 60S ribosomal protein L2”. What is worse: the search
|
|
result that presumably gave rise to this annotation is impossible to reproduce
|
|
at this time, at least not without detailed research, which we are not willing
|
|
to undertake. It is most likely that this blatantly wrong annotation was due to <b>
|
|
<i>a spurious database hit</i>
|
|
</b> to a ribosomal protein that was not critically assessed. It is not
|
|
clear, in this particular case, how could this spurious hit pass the
|
|
significance threshold, but in general, this happens most often because of the
|
|
lack of proper filtering for low complexity (or alternative approaches, such as
|
|
composition-based statistics, which are available in 2002 but had not been
|
|
developed in 1996; see <a href="/books/n/sef/A166/?report=reader">Chapter 4</a>).
|
|
Alternatively or additionally, the problem might lie in non-critical transfer of
|
|
annotation from <b>
|
|
<i>an unreliable database record</i>
|
|
</b>, i.e. a low-complexity sequence erroneously labeled as a ribosomal
|
|
protein. Notably, our re-analysis shows that the annotations assigned by each of
|
|
the three groups were not completely correct: one was an outright error; another
|
|
one involved overprediction; and the third one, an underprediction. Although
|
|
less notorious than false predictions (false-positives, in statistical terms),
|
|
lack of prediction, where a confident one is feasible with available methods, is
|
|
still an error (a false-negative).</p><p>The case of the MG225 protein is quite similar except that there was no clear
|
|
false prediction involved. Once again, the original genome project gave no
|
|
annotation (a false-negative), whereas one of the remaining groups annotated the
|
|
protein as “histidine permease”, and the other one stopped
|
|
at an “amino acid permease” annotation without proposing
|
|
specificity. Today's searches support the latter decision because no convincing,
|
|
specific relationship between this protein and transporters for any particular
|
|
amino acid could be detected (in fact, given the small repertoire of
|
|
transporters in mycoplasmas, this one might have a broad specificity). Notably,
|
|
both MG302 and MG225 remain “hypothetical proteins” in
|
|
GenBank to this day, although closely related orthologs from <i>M.
|
|
pneumoniae</i> are correctly annotated as permeases [<a href="/books/n/sef/A727/?report=reader#A896">168</a>].</p><p>The MG085 protein was annotated as an oxidoreductase (of different families) in
|
|
the original genome report and by Ouzounis and coworkers, whereas Koonin and
|
|
coworkers predicted that it was an ATP(GTP?)-utilizing enzyme on the basis of
|
|
the conservation of the P-loop motif in this protein and its homologs. In 2002,
|
|
database searches immediately identify this protein as HPr kinase (this
|
|
annotation is now correctly assigned to MG085 in GenBank), a regulator of the
|
|
sugar phosphotransferase system, which indeed is a P-loop-containing,
|
|
ATP-utilizing enzyme [<a href="/books/n/sef/A727/?report=reader#A1451">723</a>]. Back in
|
|
1996, this was the only informative annotation that could be derived for this
|
|
protein; HPr kinase genes had not been identified at the time. Once again, the
|
|
specific source of the oxidoreductase assignments is hard to determine; spurious
|
|
hits, non-critical use of incorrect database annotations, or a combination
|
|
thereof must have caused this.</p><p>The case of MG448 is of particular interest. This protein was annotated as
|
|
“pilin repressor” or simply PilB protein by Fraser and
|
|
coworkers and Ouzounis and coworkers and, somewhat cryptically, as
|
|
“chaperone-like protein” by Koonin and coworkers. This
|
|
protein remains “hypothetical” in GenBank but became a
|
|
peptide methionine sulfoxide reductase (PMSR) in SWISS-PROT. A database search
|
|
detects highly significantly similarity with numerous proteins that are
|
|
annotated primarily as PMSR and, in some cases, as PilB-related repressors. In
|
|
reality, this protein is indeed a recently characterized, distinct form of PMSR,
|
|
MsrB [<a href="/books/n/sef/A727/?report=reader#A1204">476</a>,<a href="/books/n/sef/A727/?report=reader#A1254">526</a>], which is evolutionarily unrelated to, but is often
|
|
associated with, the classic PMSR, MsrA, either as part of a multidomain protein
|
|
or as a separate gene in the same operon [<a href="/books/n/sef/A727/?report=reader#A995">267</a>]. These fusions resulted in the annotation of MG448 as PMSR,
|
|
which, ironically, turned out to be correct, but mostly (except for the recently
|
|
updated SWISS-PROT description), for a wrong reason, because it was the MsrA
|
|
domain that was recognized in the fusion proteins. Furthermore, in several
|
|
bacteria, these two domains are fused to a third, thioredoxin domain. The
|
|
three-domain protein of <i>Neisseria gonorrhoeae</i> has been
|
|
characterized as a regulator of pili operon expression, and this is what caused
|
|
the annotation of MG448 as PilB, which was reproduced by two groups. This
|
|
annotation is outright wrong and does not even pass a “reality
|
|
check” because there are no pili in mycoplasmas (parenthetically,
|
|
latest reports appear to indicate that even the original functional
|
|
characterization of the <i>Neisseria</i> protein was erroneous [<a href="/books/n/sef/A727/?report=reader#A1504">776</a>]).</p><p>
|
|
<b>
|
|
<i>Unrecognized multidomain architecture</i>
|
|
</b> of either the analyzed protein or its homologs or both is a common cause
|
|
of erroneous annotation. The “chaperone-like protein”
|
|
annotation was based on the notion that the PMSR function could be interpreted
|
|
as a form of chaperone action, and accordingly, the associated domain was also
|
|
likely to have a chaperone-like activity. In retrospect, this looks like
|
|
overprediction combined with insufficient information included in the
|
|
annotation. A straightforward annotation of MG448 as a PMSR-associated domain,
|
|
perhaps with an extra prediction of redox activity on the basis of conservation
|
|
of cysteines in this domain, the way it has been done in a subsequent
|
|
publication [<a href="/books/n/sef/A727/?report=reader#A995">267</a>], would have been
|
|
appropriate. We revisit this interesting set of proteins when discussing context
|
|
analysis in <a href="#A276">Section 5.2</a>.</p><p>While considering only four proteins with contradictory annotations, we
|
|
encountered all the main sources of systematic error in genome annotation. We
|
|
list them here again, more or less in the order of decreasing severity, as we
|
|
see it: (i) spurious database hits, often caused by low-complexity regions in
|
|
the query or the database sequence; (ii) non-critical transfer of functional
|
|
prediction from an unreliable database record; (iii) incorrect interpretation
|
|
(lack of recognition) of multidomain architecture of the query and/database
|
|
sequences; (iv) overly specific functional prediction; and (v)
|
|
underprediction.</p><p>We believe that this brief discussion highlights more general problems beyond
|
|
these specific causes of errors. Even the apparently correct database
|
|
annotations are insufficiently informative. Typically, the records do not
|
|
include the evidence behind the prediction or include only minimal data that may
|
|
be hard to interpret, such as E-values of the hits to particular domains. In
|
|
this situation, any complicated case will not be represented adequately (e.g.
|
|
the PMSR-associated domain discussed above). In addition, there is no controlled
|
|
vocabulary for genome annotation, which creates numerous semantic problems,
|
|
although an attempt to correct this situation is being undertaken in the form of
|
|
the Genome Ontology project [<a href="/books/n/sef/A727/?report=reader#A788">60</a>,<a href="/books/n/sef/A727/?report=reader#A1241">513</a>].</p><p>The above discussion shows that the general state of genome annotation is far
|
|
from being satisfactory. What can be done to improve it? In his paper on genome
|
|
annotation errors, Steven Brenner noted that, “to prevent errors from
|
|
spreading out of control, database curation by the scientific community will be
|
|
essential.” [<a href="/books/n/sef/A727/?report=reader#A844">116</a>]. Curation,
|
|
however, implies that databases other than GenBank will have to be employed
|
|
because GenBank, by definition, is an archival database (<a href="/books/n/sef/A55/?report=reader">Chapter 3</a>). It appears that the future
|
|
and, to some degree, already the present of genome annotation lies in
|
|
specialized databases that actually function as annotation tools. The beginnings
|
|
of such tools can be seen in databases like KEGG, WIT, and COGs, complemented by
|
|
tools for domain identification, such as CDD and SMART (see <a href="/books/n/sef/A55/?report=reader">Chapters 3</a> and <a href="/books/n/sef/A166/?report=reader">4</a>).</p><p>Conceptually, the advantage of this approach may be viewed as reduction and
|
|
structuring of the search space for genome annotation. Thus, when using COGs, a
|
|
genome analyst compares each protein sequence not to the unstructured set of
|
|
more than a million proteins (the NR database) but instead to a collection of
|
|
~5,000 mostly well-characterized protein sets classified by orthology, which is
|
|
the appropriate level of granularity for functional assignment. Already genome
|
|
annotation today is starting to change through the use of the new generation of
|
|
databases and tools. However, smooth integration of these and development of
|
|
new, richer formats for annotation are things of the future. In the next
|
|
subsection, we turn to a specific example to illustrate how the use of COGs
|
|
helps genome annotation.</p></div><div id="A274"><h3>5.1.4. A case study on genome annotation: the crenarchaeon <i>Aeropyrum
|
|
pernix</i></h3><p>
|
|
<i>Aeropyrum pernix</i> was the first representative of the
|
|
Crenarchaeota (one of the two major branches of archaea; see <a href="/books/n/sef/A298/?report=reader">Chapter 6</a>) and the first aerobic
|
|
archaeon whose genome has been sequenced [<a href="/books/n/sef/A727/?report=reader#A1155">427</a>]. <i>A. pernix</i> was reported to encode 2,694
|
|
putative proteins in a 1.67-Mbase genome. Of these, 633 proteins were assigned a
|
|
specific or general function in the original report on the basis of sequence
|
|
comparison to proteins in the GenBank, SWISS-PROT, EMBL, PIR, and Owl databases.
|
|
Given the intrinsic interest of the first crenarchaeal genome and also because
|
|
of the unexpectedly low fraction of predicted genes that were assigned functions
|
|
in the original report, <i>A. pernix</i> was chosen for a pilot
|
|
annotation project centered around the COG database [<a href="/books/n/sef/A727/?report=reader#A1333">605</a>].</p><p>
|
|
<a class="figpopup" href="/books/NBK20253/figure/A1680/?report=objectonly" target="object" rid-figpopup="figA1680" rid-ob="figobA1680">Figure 5.2</a> (see the color plates) shows
|
|
the protocol employed for the COG-based genome annotation. This procedure was
|
|
not limited to straightforward COGNITOR analysis but also explicitly drew from
|
|
the phyletic patterns. Whenever <i>A. pernix</i> was unexpectedly not
|
|
represented in a COG (e.g. a COG that included all other archaeal species),
|
|
additional analysis was undertaken. To identify possible diverged COG members
|
|
from <i>A. pernix,</i> PSI-BLAST searches were run with multiple
|
|
members of the respective COGs, and to detect COG members that could have been
|
|
missed in the original genome annotation, the translated sequence of the
|
|
<i>A. pernix</i> genome was searched using TBLASTN. Conversely,
|
|
unexpected occurrence of <i>A. pernix</i> proteins in COGs that did
|
|
not have any other archaeal members were examined case by case to detect likely
|
|
HGT events and novel functions in the crenarchaeal genome.</p><div class="iconblock whole_rhythm clearfix ten_col fig" id="figA1680" co-legend-rid="figlgndA1680"><a href="/books/NBK20253/figure/A1680/?report=objectonly" target="object" title="Figure 5.2" class="img_link icnblk_img figpopup" rid-figpopup="figA1680" rid-ob="figobA1680"><img class="small-thumb" src="/books/NBK20253/bin/ch5f2.gif" src-large="/books/NBK20253/bin/ch5f2.jpg" alt="Figure 5.2. Protocol of genome annotation using the COG database." /></a><div class="icnblk_cntnt" id="figlgndA1680"><h4 id="A1680"><a href="/books/NBK20253/figure/A1680/?report=objectonly" target="object" rid-ob="figobA1680">Figure 5.2</a></h4><p class="float-caption no_bottom_margin">Protocol of genome annotation using the COG database. </p></div></div><p>Proteins were assigned to COGs through two rounds of automated comparison using
|
|
COGNITOR, each followed by curation, that is, manual checking of the
|
|
assignments. The first round attempts to assign proteins to existing COGs;
|
|
typically, >90% of the assignments are made in this step. The
|
|
second round serves two purposes: first, to assign paralogs, that might have
|
|
been missed in the first round, to existing COGs; and, second, to create new
|
|
COGs from unassigned proteins.</p><p>The results of COG assignment for <i>A. pernix</i> are shown in <a class="figpopup" href="/books/NBK20253/table/A275/?report=objectonly" target="object" rid-figpopup="figA275" rid-ob="figobA275">Table 5.3</a>. Manual curation of the
|
|
automatic assignments revealed a false-positive rate of less than 2%
|
|
(23 of 1123 proteins). Even if the less severe errors, when a protein was
|
|
transferred from one related COG to another, are taken into account, the
|
|
false-positive rate was 4%, which is not negligible but substantially
|
|
lower than the estimates cited above for more standard genome annotation
|
|
methods. The number of identified false-negatives was even lower, but in this
|
|
case, of course, it is not possible to determine how many proteins remain
|
|
unassigned. It is further notable that the great majority of assigned proteins
|
|
belonged to pre-existing COGs, which facilitates a (nearly) automatic
|
|
annotation.</p><div class="iconblock whole_rhythm clearfix ten_col table-wrap" id="figA275"><a href="/books/NBK20253/table/A275/?report=objectonly" target="object" title="Table 5.3" class="img_link icnblk_img figpopup" rid-figpopup="figA275" rid-ob="figobA275"><img class="small-thumb" src="/books/NBK20253/table/A275/?report=thumb" src-large="/books/NBK20253/table/A275/?report=previmg" alt="Table 5.3. Assignment of predicted Aeropyrum pernix proteins to COGs." /></a><div class="icnblk_cntnt"><h4 id="A275"><a href="/books/NBK20253/table/A275/?report=objectonly" target="object" rid-ob="figobA275">Table 5.3</a></h4><p class="float-caption no_bottom_margin">Assignment of predicted <i>Aeropyrum pernix</i> proteins to
|
|
COGs. </p></div></div><p>Altogether, 1,102 <i>A. pernix</i> proteins were assigned to COGs. Some
|
|
of these proteins (<a href="/books/n/sef/A727/?report=reader#A882">154</a>) were members of
|
|
functionally uncharacterized COGs. Subtracting these, annotation has been added
|
|
to 315 proteins, which is an increase of about 50% compared to the
|
|
original annotation. These newly annotated <i>A. pernix</i> proteins
|
|
included, among others, the key glycolytic enzymes glucose-6-phosphate isomerase
|
|
(APE0768, COG0166) and triose phosphate isomerase (APE1538, COG0149), and the
|
|
pyrimidine biosynthetic enzymes orotidine-5′-phosphate decarboxylase
|
|
(APE2348, COG0284), uridylate kinase (APE0401, COG0528), cytidylate kinase
|
|
(APE0978, COG1102), and thymidylate kinase (APE2090, COG0125). Similarly,
|
|
important functions in DNA replication and repair were confidently assigned to a
|
|
considerable number of <i>A. pernix</i> proteins, which, in the
|
|
original annotation, were described as “hypothetical”.
|
|
Examples include the bacterial-type DNA primase (COG0358), the large subunit of
|
|
the archaeal-eukaryotic-type primase (COG2219), a second ATP-dependent DNA
|
|
ligase (COG1423), three paralogous photolyases (COG1533), and several helicases
|
|
and nucleases of different specificities.</p><p>The case of the large subunit of the archaeal-eukaryotic primase is particularly
|
|
illustrative of the contribution of different types of inference to genome
|
|
annotation. COGNITOR failed to assign an <i>A. pernix</i> protein to
|
|
the respective COG (COG2219). However, given the ubiquity of this subunit in
|
|
euryarchaea and eukaryotes and the presence of a readily detectable small
|
|
primase subunit in <i>A. pernix</i> (COG1467), a more detailed
|
|
analysis was undertaken by running PSI-BLAST searches against the NR database
|
|
with all members of COG2219 as queries. When the <i>A. fulgidus</i>
|
|
primase sequence (AF0336) was used to initiate the search, the <i>A.
|
|
pernix</i> counterpart (APE0667) was indeed detected at a statistically
|
|
significant level.</p><p>An interesting case of re-annotation of a protein with a critical function, which
|
|
also led to more general conclusions, is the archaeal uracil DNA glycosylase
|
|
(UDG; COG1573). The members of this COG were originally annotated (and still
|
|
remain so labeled in GenBank) as a “DNA polymerase homologous
|
|
protein” (APE0427 from <i>A. pernix</i>) or as a
|
|
“DNA polymerase, bacteriophage type” (AF2277 <i>from A.
|
|
fulgidus</i>) or as a hypothetical protein. However, UDG activity has
|
|
been experimentally demonstrated for the COG1573 members from <i>T.
|
|
maritima</i> and <i>A. fulgidus</i> [<a href="/books/n/sef/A727/?report=reader#A1468">740</a>,<a href="/books/n/sef/A727/?report=reader#A1469">741</a>]. The
|
|
reason for the erroneous annotation of these proteins as DNA polymerases is
|
|
already well familiar to us: independent fusion of the uracil DNA glycosylase
|
|
with DNA polymerases was detected in bacteriophage SPO1 and in <i>Yersinia
|
|
pestis</i> [<a href="/books/n/sef/A727/?report=reader#A772">44</a>]. Although these
|
|
fusions hampered the correct annotation in the original analysis of the archaeal
|
|
genomes, they seem to be functionally informative, suggesting that this type of
|
|
UDG functions in conjunction with the replicative DNA polymerase.</p><p>The 1,102 COG members from <i>A. pernix</i> comprise 41% of
|
|
the total number of predicted genes. This percentage was significantly lower
|
|
than the average fraction of COG members (72%) for the other archaeal
|
|
species. It seems most likely that this was due to an overestimate of the total
|
|
number of ORFs in the genome. Many of the <i>A. pernix</i> ORFs with
|
|
no similarity to proteins in sequence databases (1,538, or 57.1%)
|
|
overlap with ORFs from conserved families, including COG members. On the basis
|
|
of the average representation of all genomes in the COGs (67%) and
|
|
the average for the other archaea (72%), one could estimate the total
|
|
number of <i>A. pernix</i> proteins to be between 1,550 and 1,700.
|
|
This range is also consistent with the size of the <i>A. pernix</i>
|
|
genome (1.67 Mb), given the gene density of about one gene per kilobase, which
|
|
is typical of bacteria and archaea. More conservatively, 849 ORFs, originally
|
|
annotated as probable protein-coding genes, significantly overlapped with COG
|
|
members and could be confidently eliminated, which brings the total number of
|
|
protein-coding genes in <i>A. pernix</i> to a maximum of 1,873.
|
|
Unfortunately, the spurious ORFs still remain in the NR database, polluting it
|
|
and potentially even leading to the emergence of ghost
|
|
“protein” families once new, related genomes are sequenced.
|
|
Evidence has been presented that spurious “proteins” have
|
|
been produced by other microbial genome products also [<a href="/books/n/sef/A727/?report=reader#A1505">777</a>], although probably not on the same scale as
|
|
<i>A. pernix</i>. This regrettable pollution emphasizes the value
|
|
of specialized, curated databases that are free of apparitions.</p><p>Despite this overrepresentation of ORFs in <i>A. pernix</i>, we
|
|
nonetheless added 28 previously unidentified ORFs that were detected by
|
|
searching the genome sequence translated in all six frames for possible members
|
|
of COGs with unexpected phyletic patterns. These newly detected genes represent
|
|
conserved protein families, including functionally indispensable proteins, such
|
|
as chorismate mutase (APE0563a, COG1605), translation initiation factor IF-1
|
|
(APE_IF-1, COG0361), and seven ribosomal proteins (APE_rpl21E, COG2139;
|
|
APE_rps14, COG0199; APE_rpl29, COG0255; APE_rplX, COG2157; APE_rpl39E, COG2167;
|
|
APE_rpl34E, COG2174; APE_rps27AE, COG1998).</p><p>This pilot analysis, while falling far short of the goal of comprehensive genome
|
|
annotation, highlights some advantages of specialized comparative-genomic
|
|
databases as annotation tools. In this particular case, the original annotation
|
|
probably had been overly conservative, which partly accounts for the large
|
|
increase in the functional prediction rate. However, the employed protocol is
|
|
general and, with modifications and addition of some extra procedures, has been
|
|
used in primary genome analysis [<a href="/books/n/sef/A727/?report=reader#A1350">622</a>,<a href="/books/n/sef/A727/?report=reader#A1507">779</a>]. In other genome
|
|
projects, the WIT system has been employed in a conceptually similar manner
|
|
[<a href="/books/n/sef/A727/?report=reader#A907">179</a>,<a href="/books/n/sef/A727/?report=reader#A1146">418</a>]. As shown above, this type of analysis yields
|
|
reasonable accuracy of annotation, even when applied in a fully automated mode
|
|
(<a class="figpopup" href="/books/NBK20253/table/A275/?report=objectonly" target="object" rid-figpopup="figA275" rid-ob="figobA275">Table 5.3</a>). However, additional
|
|
expert contribution, particularly in the form of context analysis discussed in
|
|
the next section, adds substantial value to genome annotation.</p></div></div><div id="A276"><h2 id="_A276_">5.2. Genome Context Analysis and Functional Prediction</h2><p>All the preceding discussion in this chapter centered on prediction of the functions
|
|
of proteins encoded in sequenced genomes by extrapolating from the functions of
|
|
their experimentally characterized homologs. The success of this approach depends on
|
|
the sensitivity and selectivity of the methods that are used for detecting sequence
|
|
similarity (see <a href="/books/n/sef/A166/?report=reader">Chapter 4</a>) and on the
|
|
employed rules of inference (see <a href="#A265">5.1</a>). There
|
|
is no doubt that homology analysis remains the central methodology of genomics, i.e.
|
|
the one that produces the bulk of useful information. However, a group of recently
|
|
developed approaches in comparative genomics goes beyond sequence or structure
|
|
comparison. These methods have become collectively and, we think, aptly known as
|
|
genome context analysis [<a href="/books/n/sef/A727/?report=reader#A995">267</a>,<a href="/books/n/sef/A727/?report=reader#A1096">368</a>,<a href="/books/n/sef/A727/?report=reader#A1097">369</a>,<a href="/books/n/sef/A727/?report=reader#A1100">372</a>]. The notion of
|
|
“context” here includes all types of associations between genes
|
|
and proteins in the same or in different genomes that may point to functional
|
|
interactions and justify a verdict of “guilt by association”
|
|
[<a href="/books/n/sef/A727/?report=reader#A764">36</a>]: if gene A is involved in function
|
|
X and we obtain evidence that gene B functionally associates with A, then B is also
|
|
involved in X. More specifically, context in comparative genomics pertains to
|
|
phyletic profiles of protein families, domain fusions in multidomain proteins, gene
|
|
adjacency in genomes, and expression patterns. Indeed, genes whose products are
|
|
involved in closely related functions (e.g. form different subunits of a
|
|
multisubunit enzyme or participate in the same pathway) should all be either present
|
|
or absent in a certain set of genomes (i.e. have similar if not identical phyletic
|
|
patterns) and should be coordinately expressed (i.e. are expected to be encoded in
|
|
the same operon or at least to have similar expression patterns). This simple logic
|
|
gives us a potentially powerful way to assign genes that have no experimentally
|
|
characterized homologs to particular pathways or cellular systems. Although context
|
|
methods usually provide only rather general predictions, they represent a new and
|
|
important development in genomics that explicitly takes advantage of the rapidly
|
|
growing collection of sequenced genomes.</p><div id="A277"><h3>5.2.1. Phyletic patterns (profiles)</h3><p>Genes coding for proteins that function in the same cellular system or pathway
|
|
tend to have similar phyletic patterns [<a href="/books/n/sef/A727/?report=reader#A987">259</a>,<a href="/books/n/sef/A727/?report=reader#A1556">828</a>]. Numerous examples
|
|
for a variety of metabolic pathways are given in <a href="/books/n/sef/A371/?report=reader">Chapter 7</a>. These observations led to the suggestion that
|
|
this trend could be used in the reverse direction, i.e. to deduce functions of
|
|
uncharacterized genes [<a href="/books/n/sef/A727/?report=reader#A1393">665</a>]. However
|
|
attractive this idea might be, the real-life phyletic patterns are heavily
|
|
affected by such major evolutionary phenomena as partial redundancy in gene
|
|
functions, non-orthologous gene displacement, and lineage-specific gene loss. As
|
|
a result, there are thousands different phyletic patterns in the COGs, most of
|
|
them represented only once or twice. Moreover, examination of a variety of
|
|
multi-component systems and biochemical pathways (<a href="/cgi-bin/COG/palox?sysall" ref="pagearea=body&targetsite=external&targetcat=link&targettype=uri">http://www.ncbi.nlm.nih.gov/cgi-bin/COG/palox?sys=all</a>)
|
|
shows that, despite the tendency of the components of the same complex or
|
|
pathway to have similar patterns, there is not even one pathway in which <b>
|
|
<i>all</i>
|
|
</b> members show exactly the same pattern. Even the principal metabolic
|
|
pathways, such as glycolysis, TCA cycle, and purine and pyrimidine biosynthesis,
|
|
show considerable variability of phyletic patterns due to non-orthologous gene
|
|
displacement ([<a href="/books/n/sef/A727/?report=reader#A993">265</a>,<a href="/books/n/sef/A727/?report=reader#A998">270</a>,<a href="/books/n/sef/A727/?report=reader#A1098">370</a>], see
|
|
<a href="/books/n/sef/A371/?report=reader">Chapter 7</a>).</p><p>Because of this variability, the predictive power of the observation that two
|
|
genes have the same phyletic pattern is, in and by itself, limited. However,
|
|
when supported by other lines of evidence, such observations prove useful.
|
|
Somewhat counterintuitively, the universal pattern is one of the most strongly
|
|
indicative of gene function: among the 63 universal COGs, at least 56 consist of
|
|
proteins involved in translation. The functions of those few proteins in the
|
|
universal set that remain uncharacterized can be predicted with considerable
|
|
confidence through combination of this phyletic pattern with other lines of
|
|
evidence. For example, the uncharacterized protein YchF, which belongs to the
|
|
universal set (COG0012), is predicted by sequence analysis to be a GTPase; in
|
|
addition, this protein contains a C-terminal RNA-binding TGS domain [<a href="/books/n/sef/A727/?report=reader#A1637">909</a>]. Taken together with the ubiquity of
|
|
this protein and with the fact that, in phylogenetic trees, the archaeal members
|
|
of the COG clearly cluster with eukaryotic ones, this strongly suggests that
|
|
YchF is an uncharacterized, universal translation factor [<a href="/books/n/sef/A727/?report=reader#A995">267</a>]. This is supported by the juxtaposition of the
|
|
<i>ychF</i> gene with the gene for peptidyl-tRNA hydrolase
|
|
(<i>pth</i>) in numerous proteobacteria. The discussion of this
|
|
protein made us run ahead of ourselves and invoke other context methods, which
|
|
are considered in the next subsections, namely, analysis of domain fusions and
|
|
gene juxtaposition. This situation is quite typical: context methods are at
|
|
their best when they complement one another. Although statistical significance
|
|
estimates for a combination of context methods do not currently seem feasible,
|
|
in a case like YchF, the evidence appears to be, for all practical purposes,
|
|
irrefutable.</p><p>Another similar case involves the predicted ATPase or (more likely) kinase YjeE
|
|
from <i>E. coli</i> [<a href="/books/n/sef/A727/?report=reader#A984">256</a>] and
|
|
its orthologs from a majority of bacterial genomes that comprise COG0802. Domain
|
|
analysis identified this protein as a likely P-loop ATPase but failed to give
|
|
any indications as to its cellular role. The phyletic pattern of this COG shows
|
|
that YjeE is encoded in every bacterial genome, with the exception of <i>M.
|
|
genitalum</i>, <i>M. pneumoniae</i>, and <i>U.
|
|
urealyticum</i>, the only three bacterial species in the COG database
|
|
that do not form a cell wall. Since other conserved proteins with the same
|
|
phyletic pattern (MurA, MurB, MurG, FtsI, FtsW, DdlA) are enzymes of cell wall
|
|
biosynthesis, it can be predicted that YjeE is an ATPase or kinase involved in
|
|
the same process. Again, this prediction is supported by the adjacency of the
|
|
<i>yjeE</i> with the gene for N-acetylmuramoyl-L-alanine amidase,
|
|
another cell wall biosynthesis enzyme.</p><p>There is more to phyletic pattern analysis then prediction based on identical or
|
|
similar patterns. Guilt by association can be established also through
|
|
identification of sets of genes that are <b>
|
|
<i>co-eliminated</i>
|
|
</b> in a given lineage; this approach exploits the widespread phenomenon of
|
|
lineage-specific gene loss. A systematic analysis of the set of genes that have
|
|
been co-eliminated in the yeast <i>S. cerevisiae</i> after its
|
|
divergence from the common ancestor with <i>S. pombe</i> led to the
|
|
prediction that a particular group of proteins, including one that contained a
|
|
helicase and a duplicated RNAse III domain, was involved in post-transcriptional
|
|
gene silencing [<a href="/books/n/sef/A727/?report=reader#A783">55</a>]. This protein turned
|
|
out to be the now famous dicer nuclease, which indeed has a central role in
|
|
silencing [<a href="/books/n/sef/A727/?report=reader#A1093">365</a>,<a href="/books/n/sef/A727/?report=reader#A1164">436</a>].</p><p>On many occasions, non-orthologous gene displacement manifests in <b>
|
|
<i>complementary</i>
|
|
</b>, rather than identical or similar, phyletic patterns, like we have seen
|
|
for phosphoglycerate mutase in <a href="/books/n/sef/A22/?report=reader#A43">2.2.6</a>. The
|
|
complementarity is rarely perfect because of partial functional redundancy: some
|
|
organisms, particularly those with larger genomes, often encode more than one
|
|
protein to perform the same function. This can be illustrated by the case of the
|
|
recently discovered new type of fructose-1,6-bisphosphate aldolase, referred to
|
|
as FbaB or DhnA [<a href="/books/n/sef/A727/?report=reader#A985">257</a>]. The two
|
|
well-known variants of this enzyme, class I (Schiff-base forming,
|
|
metal-independent) and class II (metal-dependent), have long been considered to
|
|
be unrelated (analogous) enzymes until structural comparisons revealed their
|
|
underlying similarity (see Figure 1.9) [<a href="/books/n/sef/A727/?report=reader#A823">95</a>,<a href="/books/n/sef/A727/?report=reader#A915">187</a>,<a href="/books/n/sef/A727/?report=reader#A985">257</a>,<a href="/books/n/sef/A727/?report=reader#A1277">549</a>]. These
|
|
enzymes are generally limited in their phyletic distribution to eukaryotes
|
|
(class I) and bacteria (class II); some bacteria, however, have both variants
|
|
and yeast has the bacterial (class II) form of the enzyme [<a href="/books/n/sef/A727/?report=reader#A1277">549</a>]:</p><div class="graphic"><img src="/books/NBK20253/bin/ch5e1.jpg" alt="Image ch5e1.jpg" /></div><p>Sequencing of archaeal genomes revealed the absence of either form of the
|
|
fructose-1,6-bisphosphate aldolase. The same was the case with chlamydiae, which
|
|
were predicted to have a third form of this enzyme [<a href="/books/n/sef/A727/?report=reader#A1140">412</a>,<a href="/books/n/sef/A727/?report=reader#A1533">805</a>].
|
|
Indeed, investigation of the metal-independent fructose-1,6-bisphosphate
|
|
aldolase activity in <i>E. coli</i> led to the discovery of another
|
|
metal-independent Schiff-base-forming variant [<a href="/books/n/sef/A727/?report=reader#A1572">844</a>] whose sequence, however, was more closely related to those of
|
|
class II enzymes than to typical class I enzymes [<a href="/books/n/sef/A727/?report=reader#A985">257</a>]. Highly conserved homologs of this new, third form of
|
|
fructose-1,6-bisphosphate aldolase were found in chlamydial and archaeal
|
|
genomes:</p><div class="graphic"><img src="/books/NBK20253/bin/ch5e2.jpg" alt="Image ch5e2.jpg" /></div><p>As with phosphoglycerate mutase, combining these phyletic patterns shows almost
|
|
perfect complementarity, with aldolase missing only in
|
|
<i>Rickettsia</i>, which does not encode any glycolytic enzymes,
|
|
and in <i>Thermoplasma</i>, which appears to rely exclusively on the
|
|
Entner-Doudoroff pathway (see <a href="/books/n/sef/A371/?report=reader#A373">7.1.1</a>):</p><div class="graphic"><img src="/books/NBK20253/bin/ch5e3.jpg" alt="Image ch5e3.jpg" /></div><p>Other interesting examples of complementary phylogenetic patterns include
|
|
lysyl-tRNA synthetases, pyridoxine biosynthesis proteins PdxA and PdxZ [<a href="/books/n/sef/A727/?report=reader#A984">256</a>], thymidylate synthases [<a href="/books/n/sef/A727/?report=reader#A995">267</a>], and many others. The case of
|
|
thymidylate synthases is particularly remarkable. Thymidylate synthase is a
|
|
strictly essential enzyme of DNA precursor biosynthesis, and its apparent
|
|
absence in several bacterial and archaeal species became a major puzzle as their
|
|
genome sequences were reported.</p><div class="graphic"><img src="/books/NBK20253/bin/ch5e4.jpg" alt="Image ch5e4.jpg" /></div><p>The alternative thymidylate synthase was predicted [<a href="/books/n/sef/A727/?report=reader#A995">267</a>] on the basis of a phyletic pattern that was nearly
|
|
complementary (with just one case of redundancy) to that of the classic
|
|
thymidylate synthase (ThyA) and the report that the homolog of the COG1351
|
|
proteins from <i>Dictyostelium</i> complemented thymidylate synthase
|
|
deficiency [<a href="/books/n/sef/A727/?report=reader#A934">206</a>]. Just before this book
|
|
went to print, a new issue of <i>Science</i> reported the confirmation
|
|
of this prediction: not only was it shown that the COG1351 member from
|
|
<i>H. pylori</i> had thymidylate synthase activity, but also the
|
|
structure of this proteins has been solved and turned out to be unrelated to
|
|
that of ThyA [<a href="/books/n/sef/A727/?report=reader#A1317">589</a>,<a href="/books/n/sef/A727/?report=reader#A1326">598</a>].</p></div><div id="A278"><h3>5.2.2. Gene (domain) fusions: “guilt by association”</h3><p>It is fairly common that functionally interacting proteins that are encoded by
|
|
separate genes in some organisms are fused in a single polypeptide chain in
|
|
others. This has been confirmed by statistical analysis that demonstrated
|
|
general functional coherence of fused domains [<a href="/books/n/sef/A727/?report=reader#A1658">930</a>]. The advantages of a multidomain architecture are that this
|
|
organization facilitates functional complex assembly and may also allow reaction
|
|
intermediate channeling [<a href="/books/n/sef/A727/?report=reader#A1274">546</a>].</p><p>The basic assumption in the analysis of domain fusions is that a fusion will be
|
|
fixed during evolution only when it provides a selective advantage to the
|
|
organism in the form of improved functional interaction between proteins. Thus,
|
|
finding fused proteins (domains) in one species suggests that they might
|
|
interact, physically or at least functionally, in other species. In and by
|
|
itself, this notion is trivial and has been employed for predicting protein and
|
|
domain functions on an anecdotal basis for years (see [<a href="/books/n/sef/A727/?report=reader#A828">100</a>], just as an example). However, with the rapid growth
|
|
of the sequence information, the applicability of this approach widened and two
|
|
independent groups proposed, in well-publicized papers, that analysis of domain
|
|
fusions could be a general method for systematic and, moreover, automatic,
|
|
prediction of protein functions [<a href="/books/n/sef/A727/?report=reader#A941">213</a>,<a href="/books/n/sef/A727/?report=reader#A1274">546</a>]. In one of these
|
|
studies [<a href="/books/n/sef/A727/?report=reader#A1274">546</a>], domain fusions are
|
|
referred to as “Rosetta Stone” proteins – clues to
|
|
deciphering the functions of their component domains, and this memorable name
|
|
stuck to the whole approach. (The Rosetta Stone metaphor is quite loose: the
|
|
notorious stone used by François Champollion to decipher the Egyptian
|
|
hieroglyphs and now on public display in the British Museum, is a tri-lingua,
|
|
i.e. a monument that has on it the same text in three different languages. There
|
|
is nothing exactly like that about domain fusions, it is just possible to say
|
|
vaguely that the “language” of domain fusions is translated
|
|
into the “language” of functional interactions. The
|
|
“guilt by association” simile [<a href="/books/n/sef/A727/?report=reader#A764">36</a>] seems much more apt if less glamorous).</p><p>In his comment on the “Rosetta Stone” excitement, Russell
|
|
Doolittle pointed out that cases that establish a link between two well-known
|
|
domains or those that link two unknown domains are not likely to lead to any
|
|
scientific breakthroughs [<a href="/books/n/sef/A727/?report=reader#A916">188</a>]. Only
|
|
those “Rosetta Stone” proteins, in which an unknown domain
|
|
is linked to a previously characterized one, can be used to infer the
|
|
function(s) of the uncharacterized domain. Analysis of domain fusions in
|
|
complete microbial genomes indicates that they are a complex mixture of
|
|
informative, uninformative and potentially misleading cases, which certainly
|
|
provide many clues to functions of uncharacterized domains. However,
|
|
interpretations stemming from domain fusion seem to require case-by-case
|
|
examination by human experts and, most of the time, become really useful only
|
|
when combined with other lines of evidence.</p><p>One of the advantages of the guilt by association approach is that, at least in
|
|
principle, it allows transitive closure, i.e. expansion of functional
|
|
associations between transitively connected components. In other words,
|
|
detection of domain combinations AB, BC, and CD suggests that domains A, B, C
|
|
and D form a functional network. This approach has been successfully applied to
|
|
the analysis of prokaryotic signal-transduction systems, resulting in the
|
|
prediction of several new signaling domains. Participation of these domains in
|
|
signaling cascades has been originally proposed solely on the basis of their
|
|
conserved domain architectures and subsequently confirmed experimentally [<a href="/books/n/sef/A727/?report=reader#A997">269</a>].</p><p>In <a class="figpopup" href="/books/NBK20253/figure/A279/?report=objectonly" target="object" rid-figpopup="figA279" rid-ob="figobA279">Figure 5.3</a>, we illustrate the
|
|
“guilt by association” approach using the peptide methionine
|
|
sulfoxide reductase example discussed in the previous section as a case of
|
|
annotation complicated by domain fusion. As in the examples above, the logic of
|
|
the analysis does not allow us to use domain fusions only; we also have to
|
|
invoke phyletic patterns and organization of genes in the genome.
|
|
|
|
</p><div class="iconblock whole_rhythm clearfix ten_col fig" id="figA279" co-legend-rid="figlgndA279"><a href="/books/NBK20253/figure/A279/?report=objectonly" target="object" title="Figure 5.3" class="img_link icnblk_img figpopup" rid-figpopup="figA279" rid-ob="figobA279"><img class="small-thumb" src="/books/NBK20253/bin/ch5f3.gif" src-large="/books/NBK20253/bin/ch5f3.jpg" alt="Figure 5.3. A Rosetta Stone case: domain fusions and gene clusters that involve peptide methionine sulfoxide reductases." /></a><div class="icnblk_cntnt" id="figlgndA279"><h4 id="A279"><a href="/books/NBK20253/figure/A279/?report=objectonly" target="object" rid-ob="figobA279">Figure 5.3</a></h4><p class="float-caption no_bottom_margin">A Rosetta Stone case: domain fusions and gene clusters that
|
|
involve peptide methionine sulfoxide reductases. </p></div></div><p>In most organisms, protein methionine sulfoxide reductase A (MsrA) is a small,
|
|
single-domain protein. However, in <i>H. influenzae</i>, <i>H.
|
|
pylori</i> and <i>T. pallidum,</i> it is fused with another,
|
|
highly conserved domain (MsrB) that is found as a distinct protein in all other
|
|
organisms that encode MsrA. In other words, the two fusion components show the
|
|
same phyletic patterns:</p><div class="graphic"><img src="/books/NBK20253/bin/ch5e5.jpg" alt="Image ch5e5.jpg" /></div><p>In <i>B. subtilis</i>, the genes for MsrA and MsrB are not fused, but
|
|
are adjacent and may form an operon. In contrast, in <i>T.
|
|
pallidum</i>, MsrA and MsrB are fused, but in reverse order, compared
|
|
to <i>H. influenzae</i> and <i>H. pylori</i> (<a class="figpopup" href="/books/NBK20253/figure/A279/?report=objectonly" target="object" rid-figpopup="figA279" rid-ob="figobA279">Figure 5.3</a>). The <i>H.
|
|
influenzae</i> and <i>H. pylori</i> “Rosetta
|
|
Stone” proteins are most closely related to each other, but the one
|
|
from <i>T. pallidum</i> does not show particularly strong similarity
|
|
to any of them, suggesting two independent fusion events in these two
|
|
lineages.</p><p>In <i>Neisseria</i> and <i>Fusobacterium</i>, a third,
|
|
thioredoxin-like domain joins the MsrAB fusion (<a class="figpopup" href="/books/NBK20253/figure/A279/?report=objectonly" target="object" rid-figpopup="figA279" rid-ob="figobA279">Figure 5.3</a>). In <i>H. influenzae</i>, the ortholog of this
|
|
predicted thioredoxin is encoded two genes upstream of MsrAB. The gene in
|
|
between encodes a conserved integral membrane protein, designated CcdA for its
|
|
requirement for cytochrome c biogenesis in <i>B. subtilis</i>. Its
|
|
ortholog is encoded next to MsrAB in <i>H. pylori</i> and next to
|
|
thioredoxin in several other genomes (<a class="figpopup" href="/books/NBK20253/figure/A279/?report=objectonly" target="object" rid-figpopup="figA279" rid-ob="figobA279">Figure
|
|
5.3</a>).</p><p>Combining all this evidence from the guilt by association approach, gene
|
|
adjacency data, phyletic profiles, and sequence analysis, it has been predicted
|
|
that the MsrA, MsrB and thioredoxin form an enzymatic complex, which catalyzes a
|
|
cascade of redox reactions and is associated with the bacterial membrane via
|
|
CcdA. However, this is probably not the only complex in which MsrAB is involved,
|
|
because not all genomes that have this gene pair also encode CcdA (<a class="figpopup" href="/books/NBK20253/figure/A279/?report=objectonly" target="object" rid-figpopup="figA279" rid-ob="figobA279">Figure 5.3</a>). Since the publication of this
|
|
prediction, it has been largely confirmed by the demonstration that MsrB is a
|
|
second, distinct, thioredoxin-dependent peptide methionine sulfoxide reductase,
|
|
which cooperates with MsrA in the defense of bacterial cells against reactive
|
|
oxygen species [<a href="/books/n/sef/A727/?report=reader#A1044">316</a>,<a href="/books/n/sef/A727/?report=reader#A1254">526</a>,<a href="/books/n/sef/A727/?report=reader#A1504">776</a>]. However, the CcdA connection remains to be investigated.</p><p>This case study demonstrates both the considerable potential of domain fusion
|
|
analysis as a tool for protein function prediction, particularly when combined
|
|
with other context-based and homology-based approaches, and potential problems.
|
|
One could be tempted to extend the small network of domains shown in <a class="figpopup" href="/books/NBK20253/figure/A279/?report=objectonly" target="object" rid-figpopup="figA279" rid-ob="figobA279">Figure 5.3</a> by including other domains that
|
|
form fusions (or are encoded by adjacent genes) with the thioredoxin domain. It
|
|
appears, however, that such an extension would have been ill-advised. Firstly,
|
|
orthologous relationships among thioredoxins are ambiguous, and secondly,
|
|
although thioredoxins are not among the most “promiscuous”
|
|
domains, the variety of their “guilt by association” links
|
|
still is sufficiently large to make any predictions regarding potential
|
|
functional connections between the respective domains and MsrAB dubious at best.
|
|
These two issues, identification of orthologs and
|
|
“promiscuity” characteristic of certain domains, are the
|
|
principal problems encountered by the “guilt by association”
|
|
approach. Domain fusions often are found only within a specialized, narrow group
|
|
of orthologous protein domains, and translating their functional interaction
|
|
into a general prediction for the respective domains is likely to be grossly
|
|
misleading. A relatively small number of “promiscuous”
|
|
domains, particularly those involved in signal transduction and different forms
|
|
of regulation (e.g. CBS, PAS, GAF domains), combine with a variety of other
|
|
domains that otherwise have nothing in common and therefore significantly
|
|
increase the number of false-positives among the Rosetta Stone predictions.
|
|
Although it is possible to simply exclude the worst known offenders from any
|
|
Rosetta Stone analysis [<a href="/books/n/sef/A727/?report=reader#A1274">546</a>], other
|
|
domains also have the potential of showing “illicit”
|
|
behavior and compromising the results. Manual detection of such cases is
|
|
relatively straightforward, but automation of this process may be
|
|
complicated.</p></div><div id="A280"><h3>5.2.3. Gene clusters and genomic neighborhoods</h3><p>As already mentioned in <a href="/books/n/sef/A22/?report=reader">Chapter 2</a>,
|
|
comparisons of complete bacterial genomes have revealed the lack of large-scale
|
|
conservation of the gene order even between relatively close species, such as
|
|
<i>E. coli</i> and <i>H. influenzae</i> [<a href="/books/n/sef/A727/?report=reader#A1323">595</a>,<a href="/books/n/sef/A727/?report=reader#A1557">829</a>] or <i>E. coli</i> and <i>P. aeruginosa</i>
|
|
(<a href="/books/n/sef/A22/?report=reader#A1679">Figure 2.6B</a>). Although these pairs
|
|
of genomes have numerous similar strings of adjacent genes (most of them
|
|
predicted operons), comparisons of more distantly related bacterial and archaeal
|
|
genomes have shown that, at large phylogenetic distances, even most of the
|
|
operons are extensively rearranged [<a href="/books/n/sef/A727/?report=reader#A1189">461</a>,<a href="/books/n/sef/A727/?report=reader#A1612">884</a>]. The few operons
|
|
that are conserved across distantly related genomes typically encode physically
|
|
interacting proteins, such as ribosomal proteins or subunits of the H-ATPase and
|
|
ABC-type transporter complexes [<a href="/books/n/sef/A727/?report=reader#A897">169</a>,<a href="/books/n/sef/A727/?report=reader#A1113">385</a>,<a href="/books/n/sef/A727/?report=reader#A1189">461</a>,<a href="/books/n/sef/A727/?report=reader#A1323">595</a>].</p><p>It should be noted that only a relatively small number of operons have been
|
|
identified experimentally, primarily in well-characterized bacteria, such as
|
|
<i>E. coli</i> and <i>B. subtilis</i> [<a href="/books/n/sef/A727/?report=reader#A1091">363</a>,<a href="/books/n/sef/A727/?report=reader#A1460">732</a>]. However, analysis of gene strings that are conserved in
|
|
bacterial and archaeal genome strongly suggested that the great majority of them
|
|
do form operons [<a href="/books/n/sef/A727/?report=reader#A1644">916</a>]. This conclusion
|
|
was based on the following principal arguments: (i) as shown by Monte Carlo
|
|
simulations, the likelihood that identical strings of more than two genes are
|
|
found by chance in more than two genomes is extremely low; (ii) most of those
|
|
conserved strings that include characterized genes either are known operons or
|
|
include functionally linked genes and can be predicted to form operons; (iii)
|
|
typical conserved gene strings include 2 to 4 genes, which is the characteristic
|
|
size of operons; (iv) conserved gene strings that include genes from adjacent,
|
|
independent operons are extremely rare; (v) nearly all conserved gene strings
|
|
consist of genes that are transcribed in the same direction [<a href="/books/n/sef/A727/?report=reader#A1644">916</a>]. As a result, one can usually assume
|
|
that conserved gene strings are co-regulated, i.e. form operons, even if they
|
|
contain additional promoters.</p><p>Pairwise genome comparisons showed that, on average, ~10% of the genes
|
|
in each genome belong to gene strings that are conserved in at least one of the
|
|
other available genomes [<a href="/books/n/sef/A727/?report=reader#A1113">385</a>,<a href="/books/n/sef/A727/?report=reader#A1644">916</a>]. These numbers vary widely from
|
|
<5% for the cyanobacterium <i>Synechocystis</i> sp.
|
|
to 23–24% in <i>T. maritima</i> and <i>M.
|
|
genitalium</i>; the fraction of genes that belonged to predicted
|
|
operons in the archaeal genomes was only slightly lower than that in bacterial
|
|
genomes [<a href="/books/n/sef/A727/?report=reader#A1644">916</a>].</p><p>These observations indicate that conserved gene strings are under stabilizing
|
|
selection that prevents their disruption. For functionally related genes (e.g.
|
|
those encoding proteins that function in the same pathway or multimeric
|
|
complex), this selective pressure probably comes from the necessity to
|
|
synchronize their expression. This conclusion holds even in the face of the
|
|
“selfish operon” hypothesis, which posits that operons
|
|
survive during evolution <b>
|
|
<i>because</i>
|
|
</b> they are disseminated via HGT [<a href="/books/n/sef/A727/?report=reader#A1222">494</a>,<a href="/books/n/sef/A727/?report=reader#A1223">495</a>]. We believe that
|
|
the selfish operon hypothesis seems to put the cart ahead of the horse: operons
|
|
certainly do spread via HGT, but their transfer leads to fixation more often
|
|
than transfer of individual genes because of the selective advantage conferred
|
|
to the recipient by the acquired operon. In contrast, for functionally unrelated
|
|
genes, there would be no selection towards coexpression. Therefore, an
|
|
observation of similar operons found in phylogenetically distant species can be
|
|
considered an indication of a potential functional relationship between the
|
|
corresponding genes, even if these genes are scattered in other genomes. Because
|
|
of the simplicity and elegance of this approach to functional analysis of
|
|
complete genomes, there are several web sites that offer slightly different
|
|
approaches to delineation of the conserved gene strings.</p><div id="A281"><h4>WIT/ERGO</h4><p>The operon comparison tool in the WIT database (<a href="http://wit.mcs.anl.gov" ref="pagearea=body&targetsite=external&targetcat=link&targettype=uri">http://wit.mcs.anl.gov</a>),
|
|
the first of the genome context-based tools, was developed by Ross Overbeek
|
|
in 1998 [<a href="/books/n/sef/A727/?report=reader#A1368">640</a>,<a href="/books/n/sef/A727/?report=reader#A1369">641</a>]. This tool identifies conserved gene strings by
|
|
searching for pairs of homologous proteins that are encoded by genes located
|
|
no more than 300 bp apart on the same DNA strand in each of the analyzed
|
|
genomes. Each of these pairs is then assigned a score based on the
|
|
evolutionary distance between the respective species on the rRNA-based
|
|
phylogenetic tree. It is expected that chance occurrence of pairs of
|
|
homologous genes in distantly related species is less likely than in closely
|
|
related ones, so such pairs are more likely to be functionally relevant.
|
|
Homologous genes are defined as bidirectional best hits in all-against-all
|
|
BLAST comparisons, which is similar to the method used in constructing the
|
|
COG database [<a href="/books/n/sef/A727/?report=reader#A1556">828</a>].</p><p>Because the number of potential gene linkages grows exponentially with the
|
|
number of the analyzed genomes [<a href="/books/n/sef/A727/?report=reader#A1368">640</a>], the sensitivity of methods based on the detection of conserved
|
|
gene strings can be significantly improved by taking into consideration even
|
|
unfinished genome sequences. For this reason, WIT and ERGO databases include
|
|
many incomplete genome sequences from the DOE Joint Genome Institute and
|
|
other sequencing centers. This approach was used in the successful
|
|
reconstruction of several known metabolic pathways and led to the correct
|
|
prediction of candidate genes for some previously uncharacterized metabolic
|
|
enzymes [<a href="/books/n/sef/A727/?report=reader#A810">82</a>,<a href="/books/n/sef/A727/?report=reader#A899">171</a>,<a href="/books/n/sef/A727/?report=reader#A1369">641</a>].
|
|
Unfortunately, while this book was in preparation, the ERGO database has
|
|
been closed for the public, while WIT was still missing some of the useful
|
|
functionality. We will therefore illustrate the use of the method by
|
|
exploiting a somewhat similar tool in the COG database.</p></div><div id="A282"><h4>COGs</h4><p>The COG database (<a href="/COG" ref="pagearea=body&targetsite=external&targetcat=link&targettype=uri">http://www.ncbi.nlm.nih.gov/COG</a>) allows a simple and
|
|
straightforward search for conserved operons. Because all proteins in the
|
|
same COG are presumed to be orthologs, the “Genome
|
|
context” view, available from each COG page, shows the genes that
|
|
encode members of the given COG together with the surrounding genes. Genes
|
|
whose products belong to the same COG are identically colored. This provides
|
|
for easy identification of sets of COGs that tend to be clustered in
|
|
genomes. Of course, this tool only works for the genes whose products belong
|
|
to COGs, so the relationships between genes that are found in only two
|
|
complete genomes and hence do not belong to any COG would be missed. An
|
|
exhaustive matching of the co-localization of genes encoding members of the
|
|
same two COGs allowed new functional predictions for almost 90 COGs, which
|
|
comprised ~4% of the total set [<a href="/books/n/sef/A727/?report=reader#A1197">469</a>,<a href="/books/n/sef/A727/?report=reader#A1644">916</a>].</p><p>For a practical example of the use of this method, let us consider the search
|
|
for the archaeal shikimate kinase, the enzyme that is not homologous to the
|
|
bacterial shikimate kinase (AroK) and hence was not found by traditional
|
|
sequence similarity searches [<a href="/books/n/sef/A727/?report=reader#A899">171</a>].
|
|
Reconstruction of the aromatic amino acids biosynthesis pathway in archaea
|
|
showed that genomes of <i>A. fulgidus</i>, <i>M.
|
|
jannaschii</i>, and <i>M. thermoautotrophicum</i> encoded
|
|
orthologs of bacterial enzymes for all but three reactions of this pathway
|
|
([<a href="/books/n/sef/A727/?report=reader#A1268">540</a>], see <a href="/books/n/sef/A371/?report=reader#A452">Figure 7.6</a>).</p><p>Two of these missing enzymes catalyze first and second reactions of the
|
|
pathway, indicating that aromatic acids biosynthesis in (most) archaea uses
|
|
different precursors than in bacteria, whereas the third reaction,
|
|
phosphorylation of shikimate, was attributed to a non-orthologous kinase,
|
|
encoded only in archaea [<a href="/books/n/sef/A727/?report=reader#A1268">540</a>].
|
|
Daugherty and coworkers made a list of the genes involved in aromatic amino
|
|
acid biosynthesis in archaea and looked for potential neighbors of the
|
|
<i>aroE</i> gene whose product, shikimate dehydrogenase,
|
|
catalyzes the reaction immediately preceding the phosphorylation of
|
|
shikimate (<a href="/books/n/sef/A371/?report=reader#A452">Figure 7.6</a>). In <i>P.
|
|
abyssi</i> genome, the <i>aroE</i> gene (PAB0300) was
|
|
followed by an uncharacterized gene (PAB0301) encoding a predicted kinase,
|
|
which is distantly related to homoserine kinases. This was also the case in
|
|
<i>A. pernix</i> and <i>T. acidophilum</i> genomes,
|
|
where the PAB0301-like gene (COG1685, <a class="figpopup" href="/books/NBK20253/figure/A283/?report=objectonly" target="object" rid-figpopup="figA283" rid-ob="figobA283">Figure
|
|
5.4</a>) was found sandwiched between the <i>aroE</i> gene
|
|
and the <i>aroA</i> gene, whose product catalyzes the next step of
|
|
the pathway after shikimate phosphorylation [<a href="/books/n/sef/A727/?report=reader#A899">171</a>]. Genes encoding PAB0301 orthologs (COG1685) were
|
|
also found in other archaeal genomes, but not in any of the bacterial
|
|
genomes that contain the typical <i>aroK</i> gene (<a class="figpopup" href="/books/NBK20253/figure/A283/?report=objectonly" target="object" rid-figpopup="figA283" rid-ob="figobA283">Figure 5.4</a>). Given this connection,
|
|
Daugherty et al. expressed MJ1440, the COG1685 member from <i>M.
|
|
jannaschii</i> and demonstrated that it indeed had shikimate kinase
|
|
activity [<a href="/books/n/sef/A727/?report=reader#A899">171</a>].
|
|
|
|
|
|
</p><div class="iconblock whole_rhythm clearfix ten_col fig" id="figA283" co-legend-rid="figlgndA283"><a href="/books/NBK20253/figure/A283/?report=objectonly" target="object" title="Figure 5.4" class="img_link icnblk_img figpopup" rid-figpopup="figA283" rid-ob="figobA283"><img class="small-thumb" src="/books/NBK20253/bin/ch5f4.gif" src-large="/books/NBK20253/bin/ch5f4.jpg" alt="Figure 5.4. Genome context of COG1685 “Archaeal shikimate kinase”." /></a><div class="icnblk_cntnt" id="figlgndA283"><h4 id="A283"><a href="/books/NBK20253/figure/A283/?report=objectonly" target="object" rid-ob="figobA283">Figure 5.4</a></h4><p class="float-caption no_bottom_margin">Genome context of COG1685 “Archaeal shikimate
|
|
kinase”. Each line corresponds to an individual genome: aful,
|
|
<i>Archaeoglobus fulgidus</i>; hbsp,
|
|
<i>Halobacterium</i> sp.; mjan,
|
|
<i>Methanococcus jannaschii</i>; mthe,
|
|
<i>Methanobacterium thermoautotrophicum</i>; pyro,
|
|
<a href="/books/NBK20253/figure/A283/?report=objectonly" target="object" rid-ob="figobA283">(more...)</a></p></div></div></div><div id="A284"><h4>STRING</h4><p>The Search Tool for Recurring Instances of Neighbouring Genes (STRING,
|
|
<a href="http://www.bork.embl-heidelberg.de/STRING" ref="pagearea=body&targetsite=external&targetcat=link&targettype=uri">http://www.bork.embl-heidelberg.de/STRING</a>), developed by
|
|
Peer Bork and colleagues, is based on a similar approach [<a href="/books/n/sef/A727/?report=reader#A1516">788</a>]. Gene clusters are defined by
|
|
STRING the same way as in WIT, namely as strings of genes on the same strand
|
|
located no more than 300 bp from each other. Orthologs are identified as
|
|
bidirectional best hits using Smith-Waterman comparisons. The STRING search
|
|
starts from a single protein sequence that can be entered as a FASTA file or
|
|
just by its gene name in the complete genome. The sequence entered in FASTA
|
|
format is compared against the database of all proteins encoded in complete
|
|
genomes so that the user could choose one of the best hits for further
|
|
examination. Like COGs, STRING contains information only on completely
|
|
sequenced genomes. The default option in STRING further reduces the number
|
|
of analyzed genomes by eliminating closely related ones (this option can be
|
|
switched off by the user). Additionally, STRING features a useful tool that
|
|
allows the user to perform an “iterative” analysis of
|
|
gene neighborhoods. After the nearest neighbors of a gene in question are
|
|
identified, the next “iteration” of STRING would look
|
|
for their neighbors and record if any of these were found previously. If no
|
|
new neighbors are found, STRING reports that the search has
|
|
“converged”. If this does not happen even after five
|
|
consequent search cycles, the program would just tabulate how many times was
|
|
each particular gene found in the output. Combined with impressive graphics,
|
|
this approach makes STRING a fast and convenient tool to search for
|
|
consistent gene associations in complete genomes.</p></div><div id="A285"><h4>SNAPper</h4><p>The SNAP (Similarity-Neighbourhood APproach) tool at MIPS (<a href="http://mips.gsf.de/cgi-bin/proj/snap/znapit.pl" ref="pagearea=body&targetsite=external&targetcat=link&targettype=uri">http://mips.gsf.de/cgi-bin/proj/snap/znapit.pl</a>, [<a href="/books/n/sef/A727/?report=reader#A1175">447</a>]) is similar to STRING, but
|
|
instead of precomputed pairs of orthologs, it simply looks for BLAST hits
|
|
with user-defined E-values. In addition, SNAP does not require the related
|
|
genes to form conserved gene strings, they only need to be in the vicinity
|
|
of each other. SNAPper looks for the homologs of the given protein, than
|
|
takes neighbors of the corresponding genes, looks for their homologs, and so
|
|
on [<a href="/books/n/sef/A727/?report=reader#A1175">447</a>]. The program then builds a
|
|
similarity-neighborhood graph (SN-graph), which consists of the chains of
|
|
orthologous genes in different genomes and adjacent genes in the same
|
|
genome. The hits that form a closed SN-graph, i.e. recognize the original
|
|
set of homologs, are predicted to be functionally related. The advanced
|
|
version of SNAPper offers the choice of several parameters, which allow
|
|
fine-tuning the performance of the tool depending on the particular query
|
|
protein.</p></div><div id="A286"><h4>KEGG</h4><p>In contrast to the tools described above, identification of gene strings in
|
|
the KEGG database (<a href="http://www.genome.ad.jp/kegg-bin/mk_genome_cmp_html" ref="pagearea=body&targetsite=external&targetcat=link&targettype=uri">http://www.genome.ad.jp/kegg-bin/mk_genome_cmp_html</a>) is
|
|
geared toward an analysis of the operon conservation. It allows one to find
|
|
all genes in any two selected complete genomes whose products are
|
|
sufficiently similar to each other and are separated by no more than five
|
|
genes. The user can specify the desired degree of similarity between the
|
|
proteins in terms of the minimal pairwise BLAST score (or maximal Evalue),
|
|
the minimal length of the alignment, and the type of BLAST hits
|
|
(bidirectional or unidirectional hits, or just any hits with the specified
|
|
BLAST score). The user can also specify maximum allowable distances between
|
|
the genes in either organism, limiting it to any number of genes from zero
|
|
to five. This option allows one to retrieve much more distant gene pairs
|
|
than those detected by the ERGO tool. The downside of this richness is that
|
|
unless one uses fairly strict criteria for protein similarity and the
|
|
intergenic distances, he or she will end up with dozens or even hundreds of
|
|
reported gene pairs, few of which would have predictive power. Nonetheless,
|
|
a sensible use of this tool can bring some very interesting results [<a href="/books/n/sef/A727/?report=reader#A996">268</a>].</p></div><div id="A287"><h4>Genome context tools in genome annotation</h4><p>To evaluate the power of gene order-based methods for making functional
|
|
predictions, we have isolated those cases where a substantial functional
|
|
prediction did not appear possible without explicit use of gene adjacency
|
|
information [<a href="/books/n/sef/A727/?report=reader#A1644">916</a>]. In spite of the
|
|
inherent subjectivity of such assessments, the result was instructive: such
|
|
unique predictions were made for ~90 genes (more precisely, COGs) or
|
|
~4% of all COGs analyzed. Given that, as noted above,
|
|
homology-based approaches already allow functional predictions for a
|
|
majority of the genes in each sequenced prokaryotic genome, this places
|
|
gene-string analysis in the position of an important accessory methodology
|
|
in the hierarchy of genome annotation approaches. Other genome context-based
|
|
methods may also be useful but are clearly less powerful. This is, of
|
|
course, a pessimistic assessment because more subtle changes in prediction
|
|
for gene already annotated by homology-based methods were not taken into
|
|
account.</p><p>These limitations notwithstanding, some of the predictions made on the basis
|
|
of gene order conservation combined with homology information seem to be
|
|
exceptionally important. Perhaps the most straightforward case is the
|
|
prediction of the archaeal exosome, a complex of RNAses, RNA-binding
|
|
proteins and helicases that mediates processing and
|
|
3’->5’ degradation of a variety of RNA species
|
|
[<a href="/books/n/sef/A727/?report=reader#A1197">469</a>]. This finding was made by
|
|
examination of archaeal genome alignments, which led to the detection of a
|
|
large superoperon, which, in its complete form, consists of 15 genes. This
|
|
full complement of co-localized genes, however, is present in only one
|
|
species, <i>M. thermoautotrophicum</i>, whereas, in all other
|
|
archaea, the superoperon is partially disrupted and, in some cases, certain
|
|
genes have been lost altogether. Remarkably, the predicted exosomal
|
|
superoperon also includes genes for proteasome subunits. According to the
|
|
logic outlined above, this points to a hitherto unknown functional and
|
|
possibly even physical association between the proteasome and the exosome,
|
|
the machines for controlled degradation of RNA and proteins,
|
|
respectively.</p><p>Gene order-based functional prediction seems to be impossible for eukaryotes
|
|
because of the apparent lack of clustering of functionally linked genes.
|
|
However, several operons that have been identified in <i>C.
|
|
elegans</i> [<a href="/books/n/sef/A727/?report=reader#A1373">645</a>,<a href="/books/n/sef/A727/?report=reader#A1622">894</a>,<a href="/books/n/sef/A727/?report=reader#A1672">944</a>] comprise the first exceptions to this rule and suggest that
|
|
gene order analysis could be eventually used for eukaryotes, too. Besides,
|
|
the above prediction of proteasome-exosome association might potentially
|
|
extend to eukaryotes, offering yet another example of the use of prokaryotic
|
|
genome comparisons for understanding the eukaryotic cell.</p><p>Given the fluidity of gene order in prokaryotes, detection of subtle
|
|
conservation patterns requires fairly sophisticated computational procedures
|
|
that search for <b>
|
|
<i>gene neighborhoods</i>
|
|
</b>, sets of genes that tend to cluster together in multiple genomes,
|
|
but do not necessarily show extensive conservation of exact gene order
|
|
[<a href="/books/n/sef/A727/?report=reader#A1175">447</a>,<a href="/books/n/sef/A727/?report=reader#A1219">491</a>,<a href="/books/n/sef/A727/?report=reader#A1368">640</a>,<a href="/books/n/sef/A727/?report=reader#A1369">641</a>,<a href="/books/n/sef/A727/?report=reader#A1437">709</a>]. One of the interesting findings
|
|
that have been made possible through these approaches is the prediction of a
|
|
new DNA repair system in archaeal and bacterial hyperthemophiles [<a href="/books/n/sef/A727/?report=reader#A1269">541</a>]. As shown in <a class="figpopup" href="/books/NBK20253/figure/A1681/?report=objectonly" target="object" rid-figpopup="figA1681" rid-ob="figobA1681">Figure 5.5</a> (see color plates), the
|
|
gene neighborhood predicted to encode this system forms a complex patchwork,
|
|
with very few conserved gene strings. However, the overall conservation of
|
|
the neighborhood is obvious (once the analysis is completed and the results
|
|
are summarized as in <a class="figpopup" href="/books/NBK20253/figure/A1681/?report=objectonly" target="object" rid-figpopup="figA1681" rid-ob="figobA1681">Figure 5.5</a>) and
|
|
statistically significant [<a href="/books/n/sef/A727/?report=reader#A1269">541</a>,<a href="/books/n/sef/A727/?report=reader#A1437">709</a>]. In an already
|
|
familiar theme, prediction of this repair system involved a combination of
|
|
genomic neighborhood detection with fairly complicated protein sequence
|
|
analysis and structure prediction. One of the notable findings was the
|
|
identification of a novel family of predicted DNA polymerases (COG1353).
|
|
Finally, this is where we encounter, once again, COG1518, the protein family
|
|
already discussed in <a href="/books/n/sef/A166/?report=reader#A233">4.5</a>. When we
|
|
first analyzed those proteins, we were inclined to predict that they were
|
|
novel enzymes, perhaps with a hydrolytic activity. Context analysis allows
|
|
us to make a much more specific prediction: these proteins mostly likely are
|
|
nucleases involved in DNA repair.</p><div class="iconblock whole_rhythm clearfix ten_col fig" id="figA1681" co-legend-rid="figlgndA1681"><a href="/books/NBK20253/figure/A1681/?report=objectonly" target="object" title="Figure 5.5" class="img_link icnblk_img figpopup" rid-figpopup="figA1681" rid-ob="figobA1681"><img class="small-thumb" src="/books/NBK20253/bin/ch5f5.gif" src-large="/books/NBK20253/bin/ch5f5.jpg" alt="Figure 5.5. Predicted DNA repair system in hyperthermophiles." /></a><div class="icnblk_cntnt" id="figlgndA1681"><h4 id="A1681"><a href="/books/NBK20253/figure/A1681/?report=objectonly" target="object" rid-ob="figobA1681">Figure 5.5</a></h4><p class="float-caption no_bottom_margin">Predicted DNA repair system in hyperthermophiles. The pink boxes show optimal growth temperatures for each of the analyzed species (<i>A. aeolicus, T. maritima, A. fulgidus, M. thermoautotrophicum, M. jannaschii</i>). The genes are not drawn to scale; arrows <a href="/books/NBK20253/figure/A1681/?report=objectonly" target="object" rid-ob="figobA1681">(more...)</a></p></div></div></div></div></div><div id="A288"><h2 id="_A288_">5.3. Conclusions and Outlook</h2><p>In this chapter, we discussed both traditional methods for genome annotation based on
|
|
homology detection and newer approaches united under the umbrella of genome context
|
|
analysis. We noted that, although functions can be predicted, at some level of
|
|
precision, for a substantial majority of genes in each sequenced prokaryotic genome,
|
|
current annotations are replete with inaccuracies, inconsistencies and
|
|
incompleteness. This should not be construed as any kind of implicit criticism of
|
|
those researchers who are involved in genome annotation: the task is objectively
|
|
hard and is getting progressively more difficult with the growth of databases (and
|
|
accumulation of inconsistencies). Fortunately, we believe that the remedy is already
|
|
at hand (see <a href="/books/n/sef/A55/?report=reader#A64">3.1.3</a>). Specialized databases,
|
|
designed as genome annotation tools, seem to be capable of dramatically improving
|
|
the situation, if not solving the annotation problem completely. Prototypes of such
|
|
databases already exist and function and their extensive growth in the near future
|
|
seems assured.</p><p>The context-based methods of genome annotation are quite new: the development of
|
|
these approaches started only after multiple genome sequences became available.
|
|
These approaches have a lot of appeal because they are, indeed, true <b>
|
|
<i>genomic</i>
|
|
</b> methods based on the notion that the genome (and, especially, many compared
|
|
genomes) is much more than the sum of its parts. The results produced by these
|
|
methods are often very intuitive and even visually appealing as in gene string
|
|
analysis. Objectively, however, these methods yield considerably less information on
|
|
gene function than homology-based methods, at least for the foreseeable future.
|
|
Nevertheless, different genome context approaches substantially complement each
|
|
other and homology-based methods. In fact, homology-based and context-based methods
|
|
often produce different and complementary types of functional predictions. The
|
|
former tend to predict <b>
|
|
<i>biochemical</i>
|
|
</b> functions (activities), whereas the latter result in <b>
|
|
<i>biological</i>
|
|
</b> predictions, such as involvement of a gene in a particular cellular process
|
|
(e.g. DNA repair in the example above), even if the exact activity cannot be
|
|
predicted.</p><p>We would like to end this chapter on an upbeat note by stating, in large part on the
|
|
basis of personal experience, that genome annotation is not a routine, mundane
|
|
activity as it might seem to an outside observer. On the contrary, this is exciting
|
|
research, somewhat akin to detective work, which has the potential of teasing out
|
|
deep mysteries of life from genome sequences.</p></div><div id="A289"><h2 id="_A289_">5.4. Further Reading</h2><dl class="temp-labeled-list"><dl class="bkr_refwrap"><dt>1.</dt><dd><div class="bk_ref" id="A291">Brenner S. Errors in genome annotation. <span><span class="ref-journal">Trends in Genetics. </span>1999;<span class="ref-vol">15</span>:132–133.</span> [<a href="https://pubmed.ncbi.nlm.nih.gov/10203816" ref="pagearea=cite-ref&targetsite=entrez&targetcat=link&targettype=pubmed">PubMed<span class="bk_prnt">: 10203816</span></a>]</div></dd></dl><dl class="bkr_refwrap"><dt>2.</dt><dd><div class="bk_ref" id="A292">Galperin MY, Koonin EV. Who's your neighbor? New computational approaches for
|
|
functional genomics. <span><span class="ref-journal">Nature Biotechnology. </span>2000;<span class="ref-vol">18</span>:609–613.</span> [<a href="https://pubmed.ncbi.nlm.nih.gov/10835597" ref="pagearea=cite-ref&targetsite=entrez&targetcat=link&targettype=pubmed">PubMed<span class="bk_prnt">: 10835597</span></a>]</div></dd></dl><dl class="bkr_refwrap"><dt>3.</dt><dd><div class="bk_ref" id="A293">Huynen MA, Snel B. Gene and context: integrative approaches to genome
|
|
analysis. <span><span class="ref-journal">Advances in Protein Chemistry. </span>2000;<span class="ref-vol">54</span>:345–379.</span> [<a href="https://pubmed.ncbi.nlm.nih.gov/10829232" ref="pagearea=cite-ref&targetsite=entrez&targetcat=link&targettype=pubmed">PubMed<span class="bk_prnt">: 10829232</span></a>]</div></dd></dl><dl class="bkr_refwrap"><dt>4.</dt><dd><div class="bk_ref" id="A294">Huynen MA, Snel B, Lathe W, Bork P. Predicting protein function by genomic context:
|
|
quantitative evaluation and qualitative inferences. <span><span class="ref-journal">Genome Research. </span>2000;<span class="ref-vol">10</span>:1204–1210.</span> [<a href="/pmc/articles/PMC310926/" ref="pagearea=cite-ref&targetsite=entrez&targetcat=link&targettype=pmc">PMC free article<span class="bk_prnt">: PMC310926</span></a>] [<a href="https://pubmed.ncbi.nlm.nih.gov/10958638" ref="pagearea=cite-ref&targetsite=entrez&targetcat=link&targettype=pubmed">PubMed<span class="bk_prnt">: 10958638</span></a>]</div></dd></dl><dl class="bkr_refwrap"><dt>5.</dt><dd><div class="bk_ref" id="A295">Wolf YI, Rogozin IB, Kondrashov AS, Koonin EV. Genome alignment, evolution of prokaryotic genome
|
|
organization and prediction of gene function using genomic
|
|
context. <span><span class="ref-journal">Genome Research. </span>2001;<span class="ref-vol">11</span>:356–372.</span> [<a href="https://pubmed.ncbi.nlm.nih.gov/11230160" ref="pagearea=cite-ref&targetsite=entrez&targetcat=link&targettype=pubmed">PubMed<span class="bk_prnt">: 11230160</span></a>]</div></dd></dl><dl class="bkr_refwrap"><dt>6.</dt><dd><div class="bk_ref" id="A296">Makarova KS, Aravind L, Grishin NV, Rogozin IB, Koonin EV. A DNA repair system specific for thermophilic Archaea and
|
|
bacteria predicted by genomic context analysis. <span><span class="ref-journal">Nucleic Acids Research. </span>2002;<span class="ref-vol">30</span>:482–496.</span> [<a href="/pmc/articles/PMC99818/" ref="pagearea=cite-ref&targetsite=entrez&targetcat=link&targettype=pmc">PMC free article<span class="bk_prnt">: PMC99818</span></a>] [<a href="https://pubmed.ncbi.nlm.nih.gov/11788711" ref="pagearea=cite-ref&targetsite=entrez&targetcat=link&targettype=pubmed">PubMed<span class="bk_prnt">: 11788711</span></a>]</div></dd></dl><dl class="bkr_refwrap"><dt>7.</dt><dd><div class="bk_ref" id="A297"> Ouzounis CA, Karp PD. 2002. The past,
|
|
present and future of genome-wide re-annotation. <em>Genome
|
|
Biology</em> 3, COMMENT2001. [<a href="/pmc/articles/PMC139008/" ref="pagearea=cite-ref&targetsite=entrez&targetcat=link&targettype=pmc">PMC free article<span class="bk_prnt">: PMC139008</span></a>] [<a href="https://pubmed.ncbi.nlm.nih.gov/11864365" ref="pagearea=cite-ref&targetsite=entrez&targetcat=link&targettype=pubmed">PubMed<span class="bk_prnt">: 11864365</span></a>]</div></dd></dl></dl></div><div style="display:none"><div id="figA1679"><img alt="Image ch2f6" src-large="/books/n/sef/A22/bin/ch2f6.jpg" /></div><div id="figA452"><img alt="Image ch7f6" src-large="/books/n/sef/A371/bin/ch7f6.jpg" /></div><div id="figA468"><img alt="Image ch7f7" src-large="/books/n/sef/A371/bin/ch7f7.jpg" /></div></div><div id="bk_toc_contnr"></div></div></div><div class="fm-sec"><h2 id="_NBK20253_pubdet_">Publication Details</h2><h3>Copyright</h3><div><div class="half_rhythm"><a href="/books/about/copyright/">Copyright</a> © 2003, Kluwer Academic.</div></div><h3>Publisher</h3><p><a href="http://www.springer.com/" ref="pagearea=page-banner&targetsite=external&targetcat=link&targettype=publisher">Kluwer Academic</a>, Boston</p><h3>NLM Citation</h3><p>Koonin EV, Galperin MY. Sequence - Evolution - Function: Computational Approaches in Comparative Genomics. Boston: Kluwer Academic; 2003. Chapter 5, Genome Annotation and Analysis.<span class="bk_cite_avail"></span></p></div><div class="small-screen-prev"><a href="/books/n/sef/A166/?report=reader"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="none"><path d="M75,30 c-80,60 -80,0 0,60 c-30,-60 -30,0 0,-60"></path><text x="20" y="28" textLength="60" style="font-size:25px">Prev</text></svg></a></div><div class="small-screen-next"><a href="/books/n/sef/A298/?report=reader"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="none"><path d="M25,30c80,60 80,0 0,60 c30,-60 30,0 0,-60"></path><text x="20" y="28" textLength="60" style="font-size:25px">Next</text></svg></a></div></article><article data-type="fig" id="figobA267"><div id="A267" class="figure bk_fig"><div class="graphic"><img data-src="/books/NBK20253/bin/ch5f1.jpg" alt="Figure 5.1. A generalized flow chart of genome annotation." /></div><h3><span class="label">Figure 5.1</span><span class="title">A generalized flow chart of genome annotation</span></h3><div class="caption"><p>FB: feedback from gene identification for correction of sequencing
|
|
errors, primarily frameshifts. General database search: searching
|
|
sequence databases (typically, NCBI NR) for sequence similarity,
|
|
usually using BLAST. Specialized database search: searching domain
|
|
databases, such as Pfam, SMART, and CDD, for conserved domains,
|
|
genome-oriented databases, such as COGs, for identification of
|
|
orthologous relationship and refined functional prediction,
|
|
metabolic databases, such as KEGG for metabolic pathway
|
|
reconstruction, and possibly, other database searches. Statistical
|
|
gene prediction: use of methods like GeneMark or Glimmer to predict
|
|
protein-coding genes. Prediction of structural features: prediction
|
|
of signal peptide, transmembrane segments, coiled domain and other
|
|
features in putative protein functions.</p></div></div></article><article data-type="fig" id="figobA1680"><div id="A1680" class="figure bk_fig"><div class="graphic"><img data-src="/books/NBK20253/bin/ch5f2.jpg" alt="Figure 5.2. Protocol of genome annotation using the COG database." /></div><h3><span class="label">Figure 5.2</span><span class="title">Protocol of genome annotation using the COG database</span></h3></div></article><article data-type="table-wrap" id="figobA268"><div id="A268" class="table"><h3><span class="label">Table 5.1</span><span class="title">Microbial genome annotation 2001</span></h3><p class="large-table-link" style="display:none"><span class="right"><a href="/books/NBK20253/table/A268/?report=objectonly" target="object">View in own window</a></span></p><div class="large_tbl" id="__A268_lrgtbl__"><table class="no_margin"><thead><tr><th id="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:bottom;">
|
|
Species
|
|
</th><th id="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:left;vertical-align:bottom;">
|
|
Total no. of genes<sup>a</sup>
|
|
</th><th id="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">
|
|
Genes with assigned function
|
|
</th><th id="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:left;vertical-align:bottom;">
|
|
“Conserved
|
|
hypothetical”    proteins
|
|
</th><th id="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:bottom;">
|
|
“Hypothetical” proteins
|
|
</th><th id="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:bottom;">
|
|
Assigned to COGs
|
|
</th><th id="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:bottom;">
|
|
Ref.
|
|
</th></tr></thead><tbody><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
|
|
<i>Agrobacterium tumefaciens</i>
|
|
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">5,419</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">3,475 (64%).</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,236 (22%)</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">708 (13%)</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">4490 (83%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A1645">917</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
|
|
<i>Caulobacter crescentus</i>
|
|
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">3,737</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">2,030 (54%)</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">725 (19%)</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,012 (27%)</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">3,514 (93%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A1346">618</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
|
|
<i>Clostridium acetobutylicum</i>
|
|
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">3,672</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">2,888 (79%)</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">187 (5%)</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">597 (16%)</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">2,941 (80%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A1350">622</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
|
|
<i>Lactococcus lactis</i>
|
|
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">2,310</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,482 (64%)</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">465 (20%)</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">363 (16%)</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,849 (80%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A825">97</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
|
|
<i>Listeria innocua</i>
|
|
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">3,052</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1920 (63%)</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">757 (25%)</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">375 (12%)</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">2,444 (80%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A1014">286</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
|
|
<i>Mycobacterium leprae</i>
|
|
<sup>b</sup>
|
|
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">2,720</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1802 (66%)</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">776 (29%)</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">142 (5%)</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,231 (45%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A881">153</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
|
|
<i>Nostoc</i> (<i>Anabaena</i>) sp.
|
|
PCC7120</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">5,368</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">45%</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">27%</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">28%</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">4,002 (75%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A1144">416</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
|
|
<i>Pasteurella multocida</i>
|
|
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">2,014</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,814 (64%)</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">531 (26%)</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">200 (10%)</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,881 (93%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A1282">554</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
|
|
<i>Sinorhizobium meliloti</i>
|
|
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">6,204</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">3,704 (60%)</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,991 (32%)</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">509 (8%)</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">5298 (85%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A983">255</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
|
|
<i>Staphylococcus aureus</i>
|
|
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">2,595</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">63%</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">23%</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">14%</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">2,126 (82%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A1209">481</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
|
|
<i>Streptococcus pyogenes</i>
|
|
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,752</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1137 (65%)</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">145 (8.2%)</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">470 (27%)</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,390 (79%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A951">223</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
|
|
<i>Sulfolobus solfataricus</i>
|
|
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">2,977</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,624 (57%)</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">619 (21%)</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">734 (25%)</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,910 (64%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A1492">764</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
|
|
<i>Sulfolobus tokodaii</i>
|
|
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">2,826</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">14 (0.5%)</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">920 (33%)</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,892 (67%)</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,778 (63%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A1154">426</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
|
|
<i>Yersinia pestis</i>
|
|
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">4,012</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">76%</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">13%</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">9%</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">3,669 (91%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A1384">656</a>]</td></tr></tbody></table></div><div class="tblwrap-foot"><div><dl class="temp-labeled-list small"><dl class="bkr_refwrap"><dt>a</dt><dd><div id="N0x1cf9150N0x39b9008"><p class="no_margin"> In contrast to <a href="/books/n/sef/A4/?report=reader#A11">Table 1.4</a>,
|
|
the total gene numbers, as well as the numbers of genes with
|
|
assigned function, “conserved hypothetical” and
|
|
“hypothetical” genes, were taken from the
|
|
original publications.</p></div></dd></dl><dl class="bkr_refwrap"><dt>b</dt><dd><div id="N0x1cf9150N0x39b9128"><p class="no_margin"> The low fraction of <i>M. leprae</i> genes, assigned to
|
|
COGs, is due to the large number of pseudogenes in this genome
|
|
[<a href="/books/n/sef/A727/?report=reader#A881">153</a>].</p></div></dd></dl></dl></div></div></div></article><article data-type="table-wrap" id="figobA273"><div id="A273" class="table"><h3><span class="label">Table 5.2</span><span class="title">Different types of errors in genome annotation</span></h3><p class="large-table-link" style="display:none"><span class="right"><a href="/books/NBK20253/table/A273/?report=objectonly" target="object">View in own window</a></span></p><div class="large_tbl" id="__A273_lrgtbl__"><table class="no_top_margin"><thead><tr><th id="hd_h_A273_1_1_1_1" rowspan="1" colspan="1" style="vertical-align:top;"></th><th id="hd_h_A273_1_1_1_2" colspan="6" content-type="rowsep" rowspan="1" style="text-align:center;vertical-align:top;">
|
|
<b>Annotation</b>
|
|
<span class="hr"></span>
|
|
</th></tr><tr><th headers="hd_h_A273_1_1_1_1" id="hd_h_A273_1_1_2_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:bottom;">
|
|
Protein
|
|
</th><th headers="hd_h_A273_1_1_1_2" id="hd_h_A273_1_1_2_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:bottom;">
|
|
Fraser and coworkers
|
|
</th><th headers="hd_h_A273_1_1_1_2" id="hd_h_A273_1_1_2_3" rowspan="1" colspan="1" style="text-align:left;vertical-align:bottom;">
|
|
Ouzounis and coworkers
|
|
</th><th headers="hd_h_A273_1_1_1_2" id="hd_h_A273_1_1_2_4" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
|
|
Koonin and coworkers
|
|
</th><th headers="hd_h_A273_1_1_1_2" id="hd_h_A273_1_1_2_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:bottom;">
|
|
GenBank 2002
|
|
</th><th headers="hd_h_A273_1_1_1_2" id="hd_h_A273_1_1_2_6" rowspan="1" colspan="1" style="text-align:left;vertical-align:bottom;">
|
|
Conclusion 2002
|
|
</th></tr></thead><tbody><tr><td headers="hd_h_A273_1_1_1_1 hd_h_A273_1_1_2_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">MG085</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">Hydroxymethyl-glutaryl-CoA reductase
|
|
(NADPH)</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_3" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">NADH-ubiquinone oxidoreductase</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_4" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">ATP(GTP?)-utilizing enzyme</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_5" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">HPr (Ser) kinase, putative</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_6" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">HPr kinase</td></tr><tr><td headers="hd_h_A273_1_1_1_1 hd_h_A273_1_1_2_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">MG225</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">Hypothetical</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_3" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Histidine permease</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_4" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Amino acid permease</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_5" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Hypothetical</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_6" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Amino acid permease</td></tr><tr><td headers="hd_h_A273_1_1_1_1 hd_h_A273_1_1_2_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">MG302</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">No database match</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_3" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Mitochondrial 60S ribosomal protein
|
|
L2</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_4" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">(Glycerol-3-phosphate?) permease</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_5" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Hypothetical</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_6" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Probable cobalt transporter</td></tr><tr><td headers="hd_h_A273_1_1_1_1 hd_h_A273_1_1_2_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">MG448</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">Pilin repressor (pilB)</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_3" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">PilB protein</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_4" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Putative chaperone-like protein</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_5" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Hypothetical/Peptide methionine
|
|
sulfoxide reductase</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_6" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Peptide methionine sulfoxide reductase
|
|
B</td></tr></tbody></table></div></div></article><article data-type="table-wrap" id="figobA275"><div id="A275" class="table"><h3><span class="label">Table 5.3</span><span class="title">Assignment of predicted <i>Aeropyrum pernix</i> proteins to
|
|
COGs</span></h3><p class="large-table-link" style="display:none"><span class="right"><a href="/books/NBK20253/table/A275/?report=objectonly" target="object">View in own window</a></span></p><div class="large_tbl" id="__A275_lrgtbl__"><table class="no_top_margin"><thead><tr><th id="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
|
|
Protein category
|
|
</th><th id="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">
|
|
No. of proteins
|
|
</th></tr></thead><tbody><tr><td headers="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Assigned by COGNITOR
|
|
automatically</td><td headers="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,123</td></tr><tr><td headers="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Included in COGs after validation</td><td headers="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,102</td></tr><tr><td headers="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">True positives</td><td headers="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,062</td></tr><tr><td headers="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;"> Preexisting COGs</td><td headers="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,035</td></tr><tr><td headers="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;"> New COGs</td><td headers="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">27</td></tr><tr><td headers="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">False positives</td><td headers="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">44</td></tr><tr><td headers="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;"> Rejected</td><td headers="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">21</td></tr><tr><td headers="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;"> Re-assigned to a related
|
|
COG</td><td headers="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">21</td></tr><tr><td headers="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Re-assigned to an unrelated COG</td><td headers="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">2</td></tr><tr><td headers="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">False negatives (added during manual
|
|
checking)</td><td headers="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">17</td></tr><tr><td headers="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Proteins in COGs:Update 2001</td><td headers="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,178</td></tr><tr><td headers="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Proteins in COGs:Update 2002</td><td headers="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,242</td></tr></tbody></table></div></div></article><article data-type="fig" id="figobA279"><div id="A279" class="figure bk_fig"><div class="graphic"><img data-src="/books/NBK20253/bin/ch5f3.jpg" alt="Figure 5.3. A Rosetta Stone case: domain fusions and gene clusters that involve peptide methionine sulfoxide reductases." /></div><h3><span class="label">Figure 5.3</span><span class="title">A Rosetta Stone case: domain fusions and gene clusters that
|
|
involve peptide methionine sulfoxide reductases</span></h3></div></article><article data-type="fig" id="figobA283"><div id="A283" class="figure bk_fig"><div class="graphic"><img data-src="/books/NBK20253/bin/ch5f4.jpg" alt="Figure 5.4. Genome context of COG1685 “Archaeal shikimate kinase”." /></div><h3><span class="label">Figure 5.4</span><span class="title">Genome context of COG1685 “Archaeal shikimate
|
|
kinase”</span></h3><div class="caption"><p> Each line corresponds to an individual genome: aful,
|
|
<i>Archaeoglobus fulgidus</i>; hbsp,
|
|
<i>Halobacterium</i> sp.; mjan,
|
|
<i>Methanococcus jannaschii</i>; mthe,
|
|
<i>Methanobacterium thermoautotrophicum</i>; pyro,
|
|
<i>Pyrococcus horikoshii</i>; pabyssi,
|
|
<i>Pyrococcus abyssi</i>; tacid,
|
|
<i>Thermoplasma acidophilum</i>; tvol,
|
|
<i>Thermoplasma volcanium</i>; aero,
|
|
<i>Aeropyrum pernix</i>; aquae, <i>Aquifex
|
|
aeolicus</i>. The genes encoding members of COG1685 are
|
|
shown in the middle. Genes encoding members of the same COG are
|
|
indicated by the same color. Genomes that do not encode a member
|
|
of COG 1685 are indicated by empty lines. The names of all COGs
|
|
represented in the picture are listed starting from the most
|
|
common ones. Note that in <i>Halobacterium</i> sp.
|
|
(second line) and <i>M. thermoautotrophicum</i>
|
|
(fourth line), COG1685 genes are followed by the genes encoding
|
|
chorismate mutase (<i>tyrA</i>_1, COG1605). In
|
|
<i>Thermoplasma</i> spp. and <i>A.
|
|
pernix</i> (lines 7-9), COG1685 genes are sandwiched
|
|
between the genes encoding shikimate-5-dehydrogenase
|
|
(<i>aroE</i>, COG0169), and genes encoding
|
|
5-enoyl-puruvoylshikimate-3-phosphate synthetase
|
|
(<i>aroA</i>, COG0128). See <a href="/books/n/sef/A371/?report=reader#A468">Figure 7.7</a> for the chart of the complete
|
|
pathway of phenylalanine and tyrosine biosynthesis.</p></div></div></article><article data-type="fig" id="figobA1681"><div id="A1681" class="figure bk_fig"><div class="graphic"><img data-src="/books/NBK20253/bin/ch5f5.jpg" alt="Figure 5.5. Predicted DNA repair system in hyperthermophiles." /></div><h3><span class="label">Figure 5.5</span><span class="title">Predicted DNA repair system in hyperthermophiles</span></h3><div class="caption"><p>The pink boxes show optimal growth temperatures for each of the analyzed species (<i>A. aeolicus, T. maritima, A. fulgidus, M. thermoautotrophicum, M. jannaschii</i>). The genes are not drawn to scale; arrows indicate the direction of transcription. The upper row shows the COG numbers for the corresponding proteins. Some of the newly predicted COG functions are: COG2452, helix-turn-helix transcriptional regulator; COG 1203, helicase; COG1468, RecB family exonuclease; COG2254 nuclease of the HD superfamily; COG1353, novel DNA polymerase;.</p></div></div></article></div><div id="jr-scripts"><script src="/corehtml/pmc/jatsreader/ptpmc_3.22/js/libs.min.js"> </script><script src="/corehtml/pmc/jatsreader/ptpmc_3.22/js/jr.min.js"> </script></div></div>
|
|
|
|
|
|
|
|
|
|
<!-- Book content -->
|
|
|
|
<script type="text/javascript" src="/portal/portal3rc.fcgi/rlib/js/InstrumentNCBIBaseJS/InstrumentPageStarterJS.js"> </script>
|
|
|
|
|
|
<!-- CE8BC1E97D9F05E1_0182SID /projects/books/PBooks@9.11 portal106 v4.1.r689238 Tue, Oct 22 2024 16:10:51 -->
|
|
<span id="portal-csrf-token" style="display:none" data-token="CE8BC1E97D9F05E1_0182SID"></span>
|
|
|
|
<script type="text/javascript" src="//static.pubmed.gov/portal/portal3rc.fcgi/4216699/js/3968615.js" snapshot="books"></script></body>
|
|
</html>
|