nih-gov/www.ncbi.nlm.nih.gov/books/NBK20253/index.html?report=reader

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" class="no-js no-jr">
    <head>
        <!-- For pinger, set start time and add meta elements. -->
        <script type="text/javascript">var ncbi_startTime = new Date();</script>

        <!-- Logger begin -->
        <meta name="ncbi_db" content="books">
<meta name="ncbi_pdid" content="book-part">
<meta name="ncbi_acc" content="NBK20253">
<meta name="ncbi_domain" content="sef">
<meta name="ncbi_report" content="reader">
<meta name="ncbi_type" content="fulltext">
<meta name="ncbi_objectid" content="">
<meta name="ncbi_pcid" content="/NBK20253/?report=reader">
<meta name="ncbi_pagename" content="Genome Annotation and Analysis - Sequence - Evolution - Function - NCBI Bookshelf">
<meta name="ncbi_bookparttype" content="chapter">
<meta name="ncbi_app" content="bookshelf">
        <!-- Logger end -->

        <!--component id="Page" label="meta"/-->
        <script type="text/javascript" src="/corehtml/pmc/jatsreader/ptpmc_3.22/js/jr.boots.min.js"> </script><title>Genome Annotation and Analysis - Sequence - Evolution - Function - NCBI Bookshelf</title>
<meta charset="utf-8">
<meta name="apple-mobile-web-app-capable" content="no">
<meta name="viewport" content="initial-scale=1,minimum-scale=1,maximum-scale=1,user-scalable=no">
<meta name="jr-col-layout" content="auto">
<meta name="jr-prev-unit" content="/books/n/sef/A166/?report=reader">
<meta name="jr-next-unit" content="/books/n/sef/A298/?report=reader">
<meta name="bk-toc-url" content="/books/n/sef/?report=toc">
<meta name="robots" content="INDEX,NOFOLLOW,NOARCHIVE,NOIMAGEINDEX">
<meta name="citation_inbook_title" content="Sequence - Evolution - Function: Computational Approaches in Comparative Genomics">
<meta name="citation_title" content="Genome Annotation and Analysis">
<meta name="citation_publisher" content="Kluwer Academic">
<meta name="citation_date" content="2003">
<meta name="citation_author" content="Eugene V Koonin">
<meta name="citation_author" content="Michael Y Galperin">
<meta name="citation_fulltext_html_url" content="https://www.ncbi.nlm.nih.gov/books/NBK20253/">
<link rel="schema.DC" href="http://purl.org/DC/elements/1.0/">
<meta name="DC.Title" content="Genome Annotation and Analysis">
<meta name="DC.Type" content="Text">
<meta name="DC.Publisher" content="Kluwer Academic">
<meta name="DC.Contributor" content="Eugene V Koonin">
<meta name="DC.Contributor" content="Michael Y Galperin">
<meta name="DC.Date" content="2003">
<meta name="DC.Identifier" content="https://www.ncbi.nlm.nih.gov/books/NBK20253/">
<meta name="DC.Language" content="en">
<meta name="description" content="In the preceding chapter, we gave a brief overview of the methods that are commonly used for identification of protein-coding genes and analysis of protein sequences. Here, we turn to one of the main subjects of this book, namely, how these methods are applied to the task of primary analysis of genomes, which often goes under the name of &ldquo;genome annotation&rdquo;. Many researchers still view genome annotation as a notoriously unreliable and inaccurate process. There are excellent reasons for this opinion: genome annotation produces a considerable number of errors and some outright ridiculous &ldquo;identifications&rdquo; (see 3.1.3 and further discussion in this chapter). These errors are highly visible, even when the error rate is quite low: because of the large numbers of genes in most genomes, the errors are also rather numerous. Some of the problems and challenges faced by genome annotation are an issue of quantity turning into quality: an analysis that can be easily and reliably done by a qualified researcher for one or ten protein sequences becomes difficult and error-prone for the same scientist and much more so for an automated tool when the task is scaled up to 10,000 sequences. We discuss here the performance of manual, automated, and mixed approaches in genome annotation and ways to avoid some common pitfalls. Mostly, however, we concentrate in this chapter on the so-called context methods of genome analysis, which are the recent excitement in the annotation field. These approaches go beyond individual genes and explicitly take advantage of genome comparison.">
<meta name="og:title" content="Genome Annotation and Analysis">
<meta name="og:type" content="book">
<meta name="og:description" content="In the preceding chapter, we gave a brief overview of the methods that are commonly used for identification of protein-coding genes and analysis of protein sequences. Here, we turn to one of the main subjects of this book, namely, how these methods are applied to the task of primary analysis of genomes, which often goes under the name of &ldquo;genome annotation&rdquo;. Many researchers still view genome annotation as a notoriously unreliable and inaccurate process. There are excellent reasons for this opinion: genome annotation produces a considerable number of errors and some outright ridiculous &ldquo;identifications&rdquo; (see 3.1.3 and further discussion in this chapter). These errors are highly visible, even when the error rate is quite low: because of the large numbers of genes in most genomes, the errors are also rather numerous. Some of the problems and challenges faced by genome annotation are an issue of quantity turning into quality: an analysis that can be easily and reliably done by a qualified researcher for one or ten protein sequences becomes difficult and error-prone for the same scientist and much more so for an automated tool when the task is scaled up to 10,000 sequences. We discuss here the performance of manual, automated, and mixed approaches in genome annotation and ways to avoid some common pitfalls. Mostly, however, we concentrate in this chapter on the so-called context methods of genome analysis, which are the recent excitement in the annotation field. These approaches go beyond individual genes and explicitly take advantage of genome comparison.">
<meta name="og:url" content="https://www.ncbi.nlm.nih.gov/books/NBK20253/">
<meta name="og:site_name" content="NCBI Bookshelf">
<meta name="og:image" content="https://www.ncbi.nlm.nih.gov/corehtml/pmc/pmcgifs/bookshelf/thumbs/th-sef-lrg.png">
<meta name="twitter:card" content="summary">
<meta name="twitter:site" content="@ncbibooks">
<meta name="bk-non-canon-loc" content="/books/n/sef/A264/?report=reader">
<link rel="canonical" href="https://www.ncbi.nlm.nih.gov/books/NBK20253/">
<link href="https://fonts.googleapis.com/css?family=Archivo+Narrow:400,700,400italic,700italic&amp;subset=latin" rel="stylesheet" type="text/css">
<link rel="stylesheet" href="/corehtml/pmc/jatsreader/ptpmc_3.22/css/libs.min.css">
<link rel="stylesheet" href="/corehtml/pmc/jatsreader/ptpmc_3.22/css/jr.min.css">
<meta name="format-detection" content="telephone=no">
<link rel="stylesheet" href="/corehtml/pmc/css/bookshelf/2.26/css/books.min.css" type="text/css">
<link rel="stylesheet" href="/corehtml/pmc/css/bookshelf/2.26/css//books_print.min.css" type="text/css" media="print">
<link rel="stylesheet" href="/corehtml/pmc/css/bookshelf/2.26/css/books_reader.min.css" type="text/css">
<style type="text/css">p a.figpopup{display:inline !important} .bk_tt {font-family: monospace}  .first-line-outdent .bk_ref {display: inline}  .body-content h2, .body-content .h2  {border-bottom: 1px solid #97B0C8} .body-content h2.inline {border-bottom: none} a.page-toc-label , .jig-ncbismoothscroll a {text-decoration:none;border:0 !important} .temp-labeled-list  .graphic {display:inline-block !important} .temp-labeled-list  img{width:100%}</style>

    <link rel="shortcut icon" href="//www.ncbi.nlm.nih.gov/favicon.ico">
<meta name="ncbi_phid" content="CE8D52EE7DB2EAA10000000000CA009B.m_5">
<meta name='referrer' content='origin-when-cross-origin'/><link type="text/css" rel="stylesheet" href="//static.pubmed.gov/portal/portal3rc.fcgi/4216699/css/3852956/3849091.css"></head>
    <body>
        <!-- Book content! -->


<div id="jr" data-jr-path="/corehtml/pmc/jatsreader/ptpmc_3.22/"><div class="jr-unsupported"><table class="modal"><tr><td><span class="attn inline-block"></span><br />Your browser does not support the NLM PubReader view.<br />Go to <a href="/pmc/about/pr-browsers/">this page</a> to see a list of supported browsers<br />or return to the <br /><a href="/books/NBK20253/?report=classic">regular view</a>.</td></tr></table></div><div id="jr-ui" class="hidden"><nav id="jr-head"><div class="flexh tb"><div id="jr-tb1"><a id="jr-links-sw" class="hidden" title="Links"><svg xmlns="http://www.w3.org/2000/svg" version="1.1" x="0px" y="0px" viewBox="0 0 70.6 85.3" style="enable-background:new 0 0 70.6 85.3;vertical-align:middle" xml:space="preserve" width="24" height="24">
								<style type="text/css">.st0{fill:#939598;}</style>
								<g>
									<path class="st0" d="M36,0C12.8,2.2-22.4,14.6,19.6,32.5C40.7,41.4-30.6,14,35.9,9.8"></path>
									<path class="st0" d="M34.5,85.3c23.2-2.2,58.4-14.6,16.4-32.5c-21.1-8.9,50.2,18.5-16.3,22.7"></path>
									<path class="st0" d="M34.7,37.1c66.5-4.2-4.8-31.6,16.3-22.7c42.1,17.9,6.9,30.3-16.4,32.5h1.7c-66.2,4.4,4.8,31.6-16.3,22.7           c-42.1-17.9-6.9-30.3,16.4-32.5"></path>
								</g>
							</svg> Books</a></div><div class="jr-rhead f1 flexh"><div class="head"><a href="/books/n/sef/A166/?report=reader"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="none"><path d="M75,30 c-80,60 -80,0 0,60 c-30,-60 -30,0 0,-60"></path><text x="20" y="28" textLength="60" style="font-size:25px">Prev</text></svg></a></div><div class="body"><div class="t">Chapter 5, Genome Annotation and Analysis</div><div class="j">Sequence - Evolution - Function: Computational Approaches in Comparative Genomics</div></div><div class="tail"><a href="/books/n/sef/A298/?report=reader"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="none"><path d="M25,30c80,60 80,0 0,60 c30,-60 30,0 0,-60"></path><text x="20" y="28" textLength="60" style="font-size:25px">Next</text></svg></a></div></div><div id="jr-tb2"><a id="jr-bkhelp-sw" class="btn wsprkl hidden" title="Help with NLM PubReader">?</a><a id="jr-help-sw" class="btn wsprkl hidden" title="Settings and typography in NLM PubReader"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 512 512" preserveAspectRatio="none"><path d="M462,283.742v-55.485l-29.981-10.662c-11.431-4.065-20.628-12.794-25.274-24.001  c-0.002-0.004-0.004-0.009-0.006-0.013c-4.659-11.235-4.333-23.918,0.889-34.903l13.653-28.724l-39.234-39.234l-28.72,13.652  c-10.979,5.219-23.68,5.546-34.908,0.889c-0.005-0.002-0.01-0.003-0.014-0.005c-11.215-4.65-19.933-13.834-24-25.273L283.741,50  h-55.484l-10.662,29.981c-4.065,11.431-12.794,20.627-24.001,25.274c-0.005,0.002-0.009,0.004-0.014,0.005  c-11.235,4.66-23.919,4.333-34.905-0.889l-28.723-13.653l-39.234,39.234l13.653,28.721c5.219,10.979,5.545,23.681,0.889,34.91  c-0.002,0.004-0.004,0.009-0.006,0.013c-4.649,11.214-13.834,19.931-25.271,23.998L50,228.257v55.485l29.98,10.661  c11.431,4.065,20.627,12.794,25.274,24c0.002,0.005,0.003,0.01,0.005,0.014c4.66,11.236,4.334,23.921-0.888,34.906l-13.654,28.723  l39.234,39.234l28.721-13.652c10.979-5.219,23.681-5.546,34.909-0.889c0.005,0.002,0.01,0.004,0.014,0.006  c11.214,4.649,19.93,13.833,23.998,25.271L228.257,462h55.484l10.595-29.79c4.103-11.538,12.908-20.824,24.216-25.525  c0.005-0.002,0.009-0.004,0.014-0.006c11.127-4.628,23.694-4.311,34.578,0.863l28.902,13.738l39.234-39.234l-13.66-28.737  c-5.214-10.969-5.539-23.659-0.886-34.877c0.002-0.005,0.004-0.009,0.006-0.014c4.654-11.225,13.848-19.949,25.297-24.021  L462,283.742z M256,331.546c-41.724,0-75.548-33.823-75.548-75.546s33.824-75.547,75.548-75.547  c41.723,0,75.546,33.824,75.546,75.547S297.723,331.546,256,331.546z"></path></svg></a><a id="jr-fip-sw" class="btn wsprkl hidden" title="Find"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 550 600" preserveAspectRatio="none"><path fill="none" stroke="#000" stroke-width="36" stroke-linecap="round" style="fill:#FFF" d="m320,350a153,153 0 1,0-2,2l170,170m-91-117 110,110-26,26-110-110"></path></svg></a><a id="jr-rtoc-sw" class="btn wsprkl hidden" title="Table of Contents"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="none"><path d="M20,20h10v8H20V20zM36,20h44v8H36V20zM20,37.33h10v8H20V37.33zM36,37.33h44v8H36V37.33zM20,54.66h10v8H20V54.66zM36,54.66h44v8H36V54.66zM20,72h10v8 H20V72zM36,72h44v8H36V72z"></path></svg></a></div></div></nav><nav id="jr-dash" class="noselect"><nav id="jr-dash" class="noselect"><div id="jr-pi" class="hidden"><a id="jr-pi-prev" class="hidden" title="Previous page"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="none"><path d="M75,30 c-80,60 -80,0 0,60 c-30,-60 -30,0 0,-60"></path><text x="20" y="28" textLength="60" style="font-size:25px">Prev</text></svg></a><div class="pginfo">Page <i class="jr-pg-pn">0</i> of <i class="jr-pg-lp">0</i></div><a id="jr-pi-next" class="hidden" title="Next page"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="none"><path d="M25,30c80,60 80,0 0,60 c30,-60 30,0 0,-60"></path><text x="20" y="28" textLength="60" style="font-size:25px">Next</text></svg></a></div><div id="jr-is-tb"><a id="jr-is-sw" class="btn wsprkl hidden" title="Switch between Figures/Tables strip and Progress bar"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="none"><rect x="10" y="40" width="20" height="20"></rect><rect x="40" y="40" width="20" height="20"></rect><rect x="70" y="40" width="20" height="20"></rect></svg></a></div><nav id="jr-istrip" class="istrip hidden"><a id="jr-is-prev" href="#" class="hidden" title="Previous"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="none"><path d="M80,40 60,65 80,90 70,90 50,65 70,40z M50,40 30,65 50,90 40,90 20,65 40,40z"></path><text x="35" y="25" textLength="60" style="font-size:25px">Prev</text></svg></a><a id="jr-is-next" href="#" class="hidden" title="Next"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="none"><path d="M20,40 40,65 20,90 30,90 50,65 30,40z M50,40 70,65 50,90 60,90 80,65 60,40z"></path><text x="15" y="25" textLength="60" style="font-size:25px">Next</text></svg></a></nav><nav id="jr-progress"></nav></nav></nav><aside id="jr-links-p" class="hidden flexv"><div class="tb sk-htbar flexh"><div><a class="jr-p-close btn wsprkl">Done</a></div><div class="title-text f1">NCBI Bookshelf</div></div><div class="cnt lol f1"><a href="/books/">Home</a><a href="/books/browse/">Browse All Titles</a><a class="btn share" target="_blank" rel="noopener noreferrer" href="https://www.facebook.com/sharer/sharer.php?u=https://www.ncbi.nlm.nih.gov/books/NBK20253/"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 33 33" style="vertical-align:middle" width="24" height="24" preserveAspectRatio="none"><g><path d="M 17.996,32L 12,32 L 12,16 l-4,0 l0-5.514 l 4-0.002l-0.006-3.248C 11.993,2.737, 13.213,0, 18.512,0l 4.412,0 l0,5.515 l-2.757,0 c-2.063,0-2.163,0.77-2.163,2.209l-0.008,2.76l 4.959,0 l-0.585,5.514L 18,16L 17.996,32z"></path></g></svg> Share on Facebook</a><a class="btn share" target="_blank" rel="noopener noreferrer" href="https://twitter.com/intent/tweet?url=https://www.ncbi.nlm.nih.gov/books/NBK20253/&amp;text=Genome%20Annotation%20and%20Analysis"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 33 33" style="vertical-align:middle" width="24" height="24"><g><path d="M 32,6.076c-1.177,0.522-2.443,0.875-3.771,1.034c 1.355-0.813, 2.396-2.099, 2.887-3.632 c-1.269,0.752-2.674,1.299-4.169,1.593c-1.198-1.276-2.904-2.073-4.792-2.073c-3.626,0-6.565,2.939-6.565,6.565 c0,0.515, 0.058,1.016, 0.17,1.496c-5.456-0.274-10.294-2.888-13.532-6.86c-0.565,0.97-0.889,2.097-0.889,3.301 c0,2.278, 1.159,4.287, 2.921,5.465c-1.076-0.034-2.088-0.329-2.974-0.821c-0.001,0.027-0.001,0.055-0.001,0.083 c0,3.181, 2.263,5.834, 5.266,6.438c-0.551,0.15-1.131,0.23-1.73,0.23c-0.423,0-0.834-0.041-1.235-0.118 c 0.836,2.608, 3.26,4.506, 6.133,4.559c-2.247,1.761-5.078,2.81-8.154,2.81c-0.53,0-1.052-0.031-1.566-0.092 c 2.905,1.863, 6.356,2.95, 10.064,2.95c 12.076,0, 18.679-10.004, 18.679-18.68c0-0.285-0.006-0.568-0.019-0.849 C 30.007,8.548, 31.12,7.392, 32,6.076z"></path></g></svg> Share on Twitter</a></div></aside><aside id="jr-rtoc-p" class="hidden flexv"><div class="tb sk-htbar flexh"><div><a class="jr-p-close btn wsprkl">Done</a></div><div class="title-text f1">Table of Content</div></div><div class="cnt lol f1"><a href="/books/n/sef/?report=reader">Title Information</a><a href="/books/n/sef/toc/?report=reader">Table of Contents Page</a></div></aside><aside id="jr-help-p" class="hidden flexv"><div class="tb sk-htbar flexh"><div><a class="jr-p-close btn wsprkl">Done</a></div><div class="title-text f1">Settings</div></div><div class="cnt f1"><div id="jr-typo-p" class="typo"><div><a class="sf btn wsprkl">A-</a><a class="lf btn wsprkl">A+</a></div><div><a class="bcol-auto btn wsprkl"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 200 100" preserveAspectRatio="none"><text x="10" y="70" style="font-size:60px;font-family: Trebuchet MS, ArialMT, Arial, sans-serif" textLength="180">AUTO</text></svg></a><a class="bcol-1 btn wsprkl"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="none"><path d="M15,25 85,25zM15,40 85,40zM15,55 85,55zM15,70 85,70z"></path></svg></a><a class="bcol-2 btn wsprkl"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="none"><path d="M5,25 45,25z M55,25 95,25zM5,40 45,40z M55,40 95,40zM5,55 45,55z M55,55 95,55zM5,70 45,70z M55,70 95,70z"></path></svg></a></div></div><div class="lol"><a class="" href="/books/NBK20253/?report=classic">Switch to classic view</a><a href="/books/NBK20253/?report=printable">Print View</a></div></div></aside><aside id="jr-bkhelp-p" class="hidden flexv"><div class="tb sk-htbar flexh"><div><a class="jr-p-close btn wsprkl">Done</a></div><div class="title-text f1">Help</div></div><div class="cnt f1 lol"><a id="jr-helpobj-sw" data-path="/corehtml/pmc/jatsreader/ptpmc_3.22/" data-href="/corehtml/pmc/jatsreader/ptpmc_3.22/img/bookshelf/help.xml" href="">Help</a><a href="mailto:info@ncbi.nlm.nih.gov?subject=PubReader%20feedback%20%2F%20NBK20253%20%2F%20sid%3ACE8BC1E97D9F05E1_0182SID%20%2F%20phid%3ACE8D52EE7DB2EAA10000000000CA009B.4">Send us feedback</a><a id="jr-about-sw" data-path="/corehtml/pmc/jatsreader/ptpmc_3.22/" data-href="/corehtml/pmc/jatsreader/ptpmc_3.22/img/bookshelf/about.xml" href="">About PubReader</a></div></aside><aside id="jr-objectbox" class="thidden hidden"><div class="jr-objectbox-close wsprkl">&#10008;</div><div class="jr-objectbox-inner cnt"><div class="jr-objectbox-drawer"></div></div></aside><nav id="jr-pm-left" class="hidden"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 40 800" preserveAspectRatio="none"><text font-stretch="ultra-condensed" x="800" y="-15" text-anchor="end" transform="rotate(90)" font-size="18" letter-spacing=".1em">Previous Page</text></svg></nav><nav id="jr-pm-right" class="hidden"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 40 800" preserveAspectRatio="none"><text font-stretch="ultra-condensed" x="800" y="-15" text-anchor="end" transform="rotate(90)" font-size="18" letter-spacing=".1em">Next Page</text></svg></nav><nav id="jr-fip" class="hidden"><nav id="jr-fip-term-p"><input type="search" placeholder="search this page" id="jr-fip-term" autocorrect="off" autocomplete="off" /><a id="jr-fip-mg" class="wsprkl btn" title="Find"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 550 600" preserveAspectRatio="none"><path fill="none" stroke="#000" stroke-width="36" stroke-linecap="round" style="fill:#FFF" d="m320,350a153,153 0 1,0-2,2l170,170m-91-117 110,110-26,26-110-110"></path></svg></a><a id="jr-fip-done" class="wsprkl btn" title="Dismiss find">&#10008;</a></nav><nav id="jr-fip-info-p"><a id="jr-fip-prev" class="wsprkl btn" title="Jump to previuos match">&#9664;</a><button id="jr-fip-matches">no matches yet</button><a id="jr-fip-next" class="wsprkl btn" title="Jump to next match">&#9654;</a></nav></nav></div><div id="jr-epub-interstitial" class="hidden"></div><div id="jr-content"><article data-type="main"><div class="main-content lit-style" itemscope="itemscope" itemtype="http://schema.org/CreativeWork"><div class="meta-content fm-sec"><div class="fm-sec"><h1 id="_NBK20253_"><span class="label">Chapter 5</span><span class="title" itemprop="name">Genome Annotation and Analysis</span></h1><p class="fm-aai"><a href="#_NBK20253_pubdet_">Publication Details</a></p></div></div><div class="jig-ncbiinpagenav body-content whole_rhythm" data-jigconfig="allHeadingLevels: ['h2'],smoothScroll: false" itemprop="text"><p>In the preceding chapter, we gave a brief overview of the methods that are commonly used
for identification of protein-coding genes and analysis of protein sequences. Here, we
turn to one of the main subjects of this book, namely, how these methods are applied to
the task of primary analysis of genomes, which often goes under the name of
&#x0201c;genome annotation&#x0201d;. Many researchers still view genome annotation
as a notoriously unreliable and inaccurate process. There are excellent reasons for this
opinion: genome annotation produces a considerable number of errors and some outright
ridiculous &#x0201c;identifications&#x0201d; (see <a href="/books/n/sef/A55/?report=reader#A64">3.1.3</a> and further discussion in this chapter). These errors are highly
visible, even when the error rate is quite low: because of the large numbers of genes in
most genomes, the errors are also rather numerous. Some of the problems and challenges
faced by genome annotation are an issue of quantity turning into quality: an analysis
that can be easily and reliably done by a qualified researcher for one or ten protein
sequences becomes difficult and error-prone for the same scientist and much more so for
an automated tool when the task is scaled up to 10,000 sequences. We discuss here the
performance of manual, automated, and mixed approaches in genome annotation and ways to
avoid some common pitfalls. Mostly, however, we concentrate in this chapter on the
so-called context methods of genome analysis, which are the recent excitement in the
annotation field. These approaches go beyond individual genes and explicitly take
advantage of genome comparison.</p><div id="A265"><h2 id="_A265_">5.1. Methods, Approaches and Results in Genome Annotation</h2><div id="A266"><h3>5.1.1. Genome annotation: data flow and performance</h3><p>What is genome annotation? Of course, there hardly can be any exact definition
but, for the purpose of this discussion, it might be useful to define annotation
as a subfield in the general field of genome analysis, which includes more or
less anything that can be done with genome sequences by computational means. In
simple, operational terms, annotation may be defined as the part of genome
analysis that is customarily performed before a genome sequence is deposited in
GenBank and described in a published paper. We say
&#x0201c;customarily&#x0201d; because the annotations available through
GenBank and particularly the types of analysis reported in the literature for
different genomes vary widely. For instance, the reports on the human genome
sequence [<a href="/books/n/sef/A727/?report=reader#A1216">488</a>,<a href="/books/n/sef/A727/?report=reader#A1598">870</a>] clearly include a considerable amount of information
that goes beyond typical genome annotation. The &#x0201c;unit&#x0201d; of
genome annotation is the description of an individual gene and its protein (or
RNA) product, and the focal point of each such record is the function assigned
to the gene product. The record may also include a brief description of the
evidence for this assigned function, e.g. percent identity with a functionally
characterized homolog or the boundaries of domains detected in a domain database
search, but there is no room for any details of the analysis.</p><p>
<a class="figpopup" href="/books/NBK20253/figure/A267/?report=objectonly" target="object" rid-figpopup="figA267" rid-ob="figobA267">Figure 5.1</a> shows a rough schematic of the
data flow in genome annotation, starting with the finished sequence; we leave
finishing of the sequence out of this scheme but indicate the possibility of
feedback resulting in correction of sequencing errors. Of these procedures,
which must be integrated for predicting gene functions, statistical gene
prediction and search of general-purpose databases for sequence similarity are
central in the sense that this is done comprehensively as part of any genome
project. The contribution of the other approaches in the scheme in <a class="figpopup" href="/books/NBK20253/figure/A267/?report=objectonly" target="object" rid-figpopup="figA267" rid-ob="figobA267">Figure 5.1</a>, particularly specialized
database search, including domain databases, such as Pfam, SMART, and CDD (see
<a href="/books/n/sef/A55/?report=reader#A82">3.2.2</a>), and genome-oriented databases,
such as COGs, KEGG, or WIT (see <a href="/books/n/sef/A55/?report=reader#A103">3.4</a>), and
genomic context analysis, varies greatly from project to project. So far, these
relatively new methods and resources remain ancillary to traditional database
search in genome annotation, but we argue further in this chapter that they can
and probably will transform the annotation process in the nearest future.


</p><div class="iconblock whole_rhythm clearfix ten_col fig" id="figA267" co-legend-rid="figlgndA267"><a href="/books/NBK20253/figure/A267/?report=objectonly" target="object" title="Figure 5.1" class="img_link icnblk_img figpopup" rid-figpopup="figA267" rid-ob="figobA267"><img class="small-thumb" src="/books/NBK20253/bin/ch5f1.gif" src-large="/books/NBK20253/bin/ch5f1.jpg" alt="Figure 5.1. A generalized flow chart of genome annotation." /></a><div class="icnblk_cntnt" id="figlgndA267"><h4 id="A267"><a href="/books/NBK20253/figure/A267/?report=objectonly" target="object" rid-ob="figobA267">Figure 5.1</a></h4><p class="float-caption no_bottom_margin">A generalized flow chart of genome annotation. FB: feedback from gene identification for correction of sequencing
errors, primarily frameshifts. General database search: searching
sequence databases (typically, NCBI NR) for sequence similarity,
usually <a href="/books/NBK20253/figure/A267/?report=objectonly" target="object" rid-ob="figobA267">(more...)</a></p></div></div><p>Before we consider several aspects of genome annotation, it may be instructive to
assess its brutto performance, i.e. the fraction of the genes in a genome, to
which a specific function is assigned. <a class="figpopup" href="/books/NBK20253/table/A268/?report=objectonly" target="object" rid-figpopup="figA268" rid-ob="figobA268">Table
5.1</a> lists such data for several genomes sequenced in 2001 and
annotated using relatively up-to-date methods. This comparison shows notable
differences between the levels of annotation of different genomes. Some genomes
simply come practically unannotated, such as, for example, <i>Sulfolobus
tokodaii</i>, which is a crenarchaeon closely related to <i>S.
solfataricus</i>, and represented in the COGs to the same extent as the
latter species. In most genomes, however, functional prediction has been made
for the majority of the genes, from 54% to 79% of the
protein-coding genes. Obviously, these differences depend both on the taxonomic
position of the species in question (e.g. it is likely that for Crenarchaea,
whose biology is in general poorly understood, the fraction of genes for which
functional prediction is feasible will be lower than for bacteria of the
well-characterized <i>Bacillus</i>-<i>Clostridium</i> group,
such as <i>C. acetobutylicum</i> or <i>L. lactis</i>) and on
the methods and practices of genome annotators.</p><div class="iconblock whole_rhythm clearfix ten_col table-wrap" id="figA268"><a href="/books/NBK20253/table/A268/?report=objectonly" target="object" title="Table 5.1" class="img_link icnblk_img figpopup" rid-figpopup="figA268" rid-ob="figobA268"><img class="small-thumb" src="/books/NBK20253/table/A268/?report=thumb" src-large="/books/NBK20253/table/A268/?report=previmg" alt="Table 5.1. Microbial genome annotation 2001." /></a><div class="icnblk_cntnt"><h4 id="A268"><a href="/books/NBK20253/table/A268/?report=objectonly" target="object" rid-ob="figobA268">Table 5.1</a></h4><p class="float-caption no_bottom_margin">Microbial genome annotation 2001. </p></div></div><p>Even in better-characterized genomes, for hundreds of genes (those encoding
&#x0201c;conserved hypothetical&#x0201d; and
&#x0201c;hypothetical&#x0201d; proteins), there is no functional prediction
whatsoever. Furthermore, among those proteins that formally belong to the
annotated category, a substantial fraction of the predictions are only general
and are in need of major refinement. Some of these problems can be solved only
through experiment, but the above numbers show beyond doubt that there is ample
room for improvement in computational annotation itself; further in this
chapter, we touch upon some of the possible directions.</p><p>Genome annotation necessarily involves some level of automation. No one is going
to manually paste each of several thousand-protein sequences encoded in a genome
into the BLAST window, hit the button, and wait for the results to appear on
screen. For annotation to be practicable at all, software is necessary to run
such routine tasks in a batch mode and also to organize the results from
different programs in a convenient form, and each genome project employs one or
another set of tools to achieve this. After that point, however, genome
annotation is still mostly &#x0201c;manual&#x0201d; (or, better,
&#x0201c;expert&#x0201d;) because decisions on how to assign gene functions
are made by humans (supposedly, experts). Several attempts have been made to
push automation beyond straightforward data processing and to allow a program to
actually make all the decisions. We briefly discuss some of the automated
systems for genome annotation in the next section.</p></div><div id="A269"><h3>5.1.2. Automation of genome annotation</h3><p>Terry Gaasterland and Christoph Sensen once estimated that annotating genomic
sequence by hand would require as much as one year per person per one megabase
[<a href="/books/n/sef/A727/?report=reader#A981">253</a>]. We now believe, on the basis
of our own experience of genome annotation (e.g. [<a href="/books/n/sef/A727/?report=reader#A1350">622</a>,<a href="/books/n/sef/A727/?report=reader#A1507">779</a>,<a href="/books/n/sef/A727/?report=reader#A1533">805</a>]), that this estimate is exaggerated
perhaps by a factor of 5 or 6. Nevertheless, there is no doubt that genome
annotation has become the limiting step in most genome projects. Besides, humans
are supposed to be inconsistent and error-prone. Hence the incentives for
automating as much of the annotation process as possible.</p><p> The <b>GeneQuiz</b> (<a href="http://www.sander.ebi.ac.uk/genequiz/" ref="pagearea=body&amp;targetsite=external&amp;targetcat=link&amp;targettype=uri">http://www.sander.ebi.ac.uk/genequiz/</a>) project was the first
automatic system for genome analysis, which performed similarity searches
followed by automatic evaluation of results and generation of functional
annotation by an expert system based on a set of several predefined rules [<a href="/books/n/sef/A727/?report=reader#A1477">749</a>]. Several other similar systems have
been created since then, but GeneQuiz remains the only such tool that is open to
the general public [<a href="/books/n/sef/A727/?report=reader#A1078">350</a>].</p><p>GeneQuiz runs automated database searches and sequence analysis by taking a
protein sequence and comparing it against a non-redundant protein database,
generated by automated cross-linking and cross-referencing of PDB, SWISS-PROT,
PIR, PROSITE, and TrEMBL databases, with the addition of human, mouse, fruit
fly, zebrafish, and <i>Anopheles gambiae</i> protein sets obtained
from the Ensemble project (<a href="http://www.ensembl.org" ref="pagearea=body&amp;targetsite=external&amp;targetcat=link&amp;targettype=uri">http://www.ensembl.org</a>) and a <i>C. elegans</i>
protein set (<a href="http://www.sanger.ac.uk/Projects/C_elegans/wormpep" ref="pagearea=body&amp;targetsite=external&amp;targetcat=link&amp;targettype=uri">http://www.sanger.ac.uk/Projects/C_elegans/wormpep</a>). This
comparison is done by running BLAST and FASTA programs and is used to identify
the cases with high similarity, where function can be predicted. Additionally,
searches for PROSITE patterns are performed. Predictions are also made for
coiled-coil regions using COILS2 [<a href="/books/n/sef/A727/?report=reader#A1261">533</a>],
transmembrane segments using PHDhtm [<a href="/books/n/sef/A727/?report=reader#A1443">715</a>], and secondary structure elements using PHDsec [<a href="/books/n/sef/A727/?report=reader#A1446">718</a>]. The system further clusters
proteins from the analyzed genome by sequence similarity [<a href="/books/n/sef/A727/?report=reader#A1550">822</a>] and constructs multiple alignments. The results are
presented in a table that contains information on the best hits (including gene
names, database identifiers, and links to the corresponding databases),
predictions for secondary structure, coiled-coil regions, etc. and a reliability
score for each item. The functional assignment is then made automatically on the
basis of the functions of the homologs found in the database. At this level,
functional assignments are qualified as clear or as ambiguous.</p><p>The effectiveness and accuracy of such fully automated system have been the
subject of a rather heated discussion but still remain uncertain. While the
authors originally estimated the accuracy of their functional assignments to be
95% or better [<a href="/books/n/sef/A727/?report=reader#A1366">638</a>,<a href="/books/n/sef/A727/?report=reader#A1477">749</a>], others reported that only 8 of 21
new functional predictions for <i>M. genitalium</i> proteins made by
GeneQuiz could be fully corroborated [<a href="/books/n/sef/A727/?report=reader#A1194">466</a>]. A similar discrepancy between the functional predictions made
by the GeneQuiz team [<a href="/books/n/sef/A727/?report=reader#A759">31</a>] and those
obtained by mostly manual annotation [<a href="/books/n/sef/A727/?report=reader#A1194">466</a>] was reported for the proteins encoded in the <i>M.
jannaschii</i> genome ([<a href="/books/n/sef/A727/?report=reader#A992">264</a>],
see <a href="http://www.bioinfo.de/isb/1998/01/0007" ref="pagearea=body&amp;targetsite=external&amp;targetcat=link&amp;targettype=uri">http://www.bioinfo.de/isb/1998/01/0007</a>). It appeared that
GeneQuiz analysis suffered from the usual pitfalls of sequence similarity
searches (see <a href="/books/n/sef/A55/?report=reader#A64">3.1.3</a>, the next section and
[<a href="/books/n/sef/A727/?report=reader#A827">99</a>,<a href="/books/n/sef/A727/?report=reader#A832">104</a>,<a href="/books/n/sef/A727/?report=reader#A992">264</a>]).</p><div id="A270"><h4>PEDANT, MAGPIE, ERGO, IMAGENE</h4><p>While GeneQuiz seems to be the only fully automated genome annotation tool
that is open to the public for new genome analysis, there have been reports
of similar systems developed by other genome annotation groups. These
include Dmitrij Frishman's PEDANT (<a href="http://pedant.gsf.de" ref="pagearea=body&amp;targetsite=external&amp;targetcat=link&amp;targettype=uri">http://pedant.gsf.de</a>,
[<a href="/books/n/sef/A727/?report=reader#A973">245</a>,<a href="/books/n/sef/A727/?report=reader#A976">248</a>], Terry Gaasterland's MAGPIE and its sister
programs (<a href="http://genomes.rockefeller.edu" ref="pagearea=body&amp;targetsite=external&amp;targetcat=link&amp;targettype=uri">http://genomes.rockefeller.edu</a>, [<a href="/books/n/sef/A727/?report=reader#A980">252</a>,<a href="/books/n/sef/A727/?report=reader#A981">253</a>]),
Ross Overbeek's ERGO (<a href="http://ergo.integratedgenomics.com/ERGO" ref="pagearea=body&amp;targetsite=external&amp;targetcat=link&amp;targettype=uri">http://ergo.integratedgenomics.com/ERGO</a>, [<a href="/books/n/sef/A727/?report=reader#A1370">642</a>,<a href="/books/n/sef/A727/?report=reader#A1371">643</a>]), Alan Viari's Imagene (<a href="http://wwwabi.snv.jussieu.fr/research" ref="pagearea=body&amp;targetsite=external&amp;targetcat=link&amp;targettype=uri">http://wwwabi.snv.jussieu.fr/research</a>, [<a href="/books/n/sef/A727/?report=reader#A1289">561</a>]), and some others. Although none
of these systems is freely available to outside users, many of the genome
annotation results they produced are accessible on the web and can be used
to judge the performance.</p><p>The PEDANT web site contains by far the most information open to the public
and can be used as a good reference point for automated genome analyses (see
also <a href="/books/n/sef/A22/?report=reader#A47">2.4</a>).</p></div><div id="A271"><h4>SEALS</h4><p>In addition to completely automated systems, some tools that greatly
facilitate and accelerate manual genome annotation are worth a mention.
System for Easy Analysis of Lots of Sequences (SEALS), developed by Roland
Walker at the NCBI is, for obvious reasons, the one most familiar to the
authors of this book (available for downloading at <a href="http://iubio.bio.indiana.edu:7780/archive/00000466/" ref="pagearea=body&amp;targetsite=external&amp;targetcat=link&amp;targettype=uri">http://iubio.bio.indiana.edu:7780/archive/00000466/</a>, [<a href="/books/n/sef/A727/?report=reader#A1606">878</a>]). The SEALS package consists of
~50 simple, UNIX-based tools (written in PERL), which follow consistent
syntax and semantics. SEALS combines software for retrieving sequence
information, scripting database searches with BLAST, viewing and parsing
search outputs, searching for protein sequence motifs using regular
expressions, and predicting protein structural features and motifs.
Typically, using SEALS, a genome analyst first looks for structural features
of proteins, such as signal peptides (predicted by SignalP), transmembrane
domains (predicted by PHDhtm), coiled-coil domains (predicted by COILS2),
and large non-globular domains (predicted using SEG). Once these regions are
identified and masked, database searches are run in a batch mode using the
chosen method, e.g. PSI-BLAST. The outputs can be presented in a variety of
formats, of which filtering with taxonomic queries implemented in the SEALS
script TAX_COLLECTOR is among the most useful. SEALS has been extensively
used in the comparative studies of bacterial, archaeal, and eukaryotic
genomes (e.g. [<a href="/books/n/sef/A727/?report=reader#A780">52</a>,<a href="/books/n/sef/A727/?report=reader#A783">55</a>,<a href="/books/n/sef/A727/?report=reader#A1268">540</a>].</p></div></div><div id="A272"><h3>5.1.3. Accuracy of genome annotation, sources of errors, and some thoughts on
possible improvements</h3><p>Benchmarking the accuracy of genome annotation is extremely hard. It has been
shown on numerous occasions that more advanced methods for sequence comparison,
such as gapped BLAST and subsequently PSI-BLAST, sometimes used in combination
with threading, as well as various forms of motif analysis and careful manual
integration of the results produced by all these approaches, substantially
improve detection of homologs (e.g. [<a href="/books/n/sef/A727/?report=reader#A896">168</a>,<a href="/books/n/sef/A727/?report=reader#A1129">401</a>,<a href="/books/n/sef/A727/?report=reader#A1162">434</a>,<a href="/books/n/sef/A727/?report=reader#A1194">466</a>,<a href="/books/n/sef/A727/?report=reader#A1313">585</a>]). At the end,
however, genome annotation is not about detection of homologs but rather about
functional prediction, and here, the problem of a standard of truth is
formidable. By definition, functional annotation (more precisely, functional
prediction) deals with proteins whose functions are unknown, and the rate of
experimental testing of predictions is extremely slow. We believe that it is
possible to design an objective test of the accuracy of genome annotation in the
following manner. The protein set encoded in a newly sequenced genome is
analyzed, and specific active centers and other functionally important sites are
predicted for as many proteins as possible. When a new, preferably
phylogenetically distant genome becomes available, orthologs of the proteins
from the first genome are identified, and the conservation of the predicted
functional sites is assessed. Lack of conservation would count as an error; this
is, of course, a harsh test that would give the low bound of accuracy because:
first, functional site prediction may be partly wrong but the function of the
protein still would be predicted correctly; and second, some active sites might
be disrupted in the new genome. In this way, the accuracy of the prediction
could be assessed quantitatively and, in principle, even a
&#x0201c;tournament&#x0201d; analogous to the CASP competition in protein
structure prediction [<a href="/books/n/sef/A727/?report=reader#A1597">869</a>] could be
arranged.</p><p>However, so far, evaluation of the accuracy of genome annotation has been largely
limited to the assessments of consistency of annotations of the same genome
generated by different groups and various &#x0201c;sanity checks&#x0201d;
and expert judgments. Steven Brenner published an interesting comparison of
three independent annotations [<a href="/books/n/sef/A727/?report=reader#A970">242</a>,<a href="/books/n/sef/A727/?report=reader#A1195">467</a>,<a href="/books/n/sef/A727/?report=reader#A1367">639</a>] of the smallest of the sequenced bacterial genomes,
<i>Mycoplasma genitalium</i> [<a href="/books/n/sef/A727/?report=reader#A844">116</a>]. Without attempting to determine which annotation was
&#x0201c;better&#x0201d;, he manually examined all conflicting annotations,
eliminating trivial semantic differences and counting the apparent
irreconcilable ones as errors (in at least one of the annotations). His
conclusion was that there was an at least 8% error rate among the 340
genes annotated by at least two of the three groups. In a similar exercise that
we have done on the basis of the COG database, we found that of 786 COGs that
did not include paralogs (the number for the end of 1999), members of 194 had
conflicting annotations in GenBank [<a href="/books/n/sef/A727/?report=reader#A995">267</a>]. This suggests, more pessimistically, an annotation error rate of
at least 25% using the same criterion as applied by Brenner. Clearly,
even the lower of these estimates represents a serious problem for genome
annotation, bringing up the specter of error catastrophe [<a href="/books/n/sef/A727/?report=reader#A817">89</a>,<a href="/books/n/sef/A727/?report=reader#A832">104</a>]. We first
briefly discuss the most common sources of errors and then some ideas regarding
the ways out. Manual and automated genome annotation encounter the same typical
problems, which we already mentioned in the discussion of the reliability of
sequence database records (see <a href="/books/n/sef/A55/?report=reader#A64">3.1.3</a>).
Inevitably, even partial automation of the annotation process tends to increase
the likelihood of all these types of errors.</p><p>In order to examine various kinds of errors that are common in genome annotation,
it is convenient to re-examine four cases of discrepancies in the annotation of
<i>M. genitalium</i> proteins that were specifically highlighted
in the aforecited article of Steven Brenner (<a class="figpopup" href="/books/NBK20253/table/A273/?report=objectonly" target="object" rid-figpopup="figA273" rid-ob="figobA273">Table 5.2</a>). Although one of the authors was involved in one of the
compared annotations, we think we can be completely impartial in the spirit of
Brenner's article, especially since six years have passed, an eternity for
genomics.</p><div class="iconblock whole_rhythm clearfix ten_col table-wrap" id="figA273"><a href="/books/NBK20253/table/A273/?report=objectonly" target="object" title="Table 5.2" class="img_link icnblk_img figpopup" rid-figpopup="figA273" rid-ob="figobA273"><img class="small-thumb" src="/books/NBK20253/table/A273/?report=thumb" src-large="/books/NBK20253/table/A273/?report=previmg" alt="Table 5.2. Different types of errors in genome annotation." /></a><div class="icnblk_cntnt"><h4 id="A273"><a href="/books/NBK20253/table/A273/?report=objectonly" target="object" rid-ob="figobA273">Table 5.2</a></h4><p class="float-caption no_bottom_margin">Different types of errors in genome annotation. </p></div></div><p>The protein MG302 was not annotated in the original genome publication by Fraser
and colleagues and was assigned conflicting annotations by the other two groups.
Ouzounis and coworkers notably characterized this protein as a
&#x0201c;mitochondrial 60S ribosomal protein L2&#x0201d;, whereas Koonin and
coworkers annotated it is as a permease, perhaps specific for
glycerol-3-phosphate. A database search performed in 2002 leaves no doubt
whatsoever that the protein is a permease; this is, of course, readily supported
by transmembrane segment prediction. However, the glycerol-3-phosphate
specificity is not supported at all. Instead, these searches, particularly the
CDD search, unequivocally pointed to a relationship between MG302 and a family
of cobalt transporters. Nevertheless, since the similarity between MG302 and the
cobalt transporters is not particularly strong and transporters switch their
specificity with relative ease during evolution, caution is due, and the
annotation as &#x0201c;probable Co transporter&#x0201d; seems most
appropriate. This single case nicely covers several common problems of genome
annotation. The most benign but also apparently most widespread of these is <b>
<i>overprediction</i>
</b> or, more precisely, <b>
<i>overly specific prediction</i>
</b>. Even with the methods available in 1996 (ungapped BLAST, FASTA, various
alignment methods, and transmembrane segment prediction), the conclusion that
MG302 was a permease was quite firm. However, glycerol-3-phosphate permease
turned up as the most similar functionally characterized protein just by chance
(Co<sup>2+</sup> transporters had not been characterized at the
time). Transferring functional information from this unreliable best hit,
however tentatively, was a typical error of overprediction; the appropriate
annotation at the time would have been, simply, &#x0201c;predicted
permease&#x0201d;. The annotation of MG302 as &#x0201c;mitochondrial 60S
ribosomal protein L2&#x0201d; is, of course, much more conspicuous. At face
value, this does not even pass a &#x0201c;reality check&#x0201d;: there
certainly can be no mitochondria and no 60S ribosomes in mycoplasmas.</p><p>Such semantic snafus are pretty common in genome annotation, especially those
that are either produced fully automatically or manually but non-critically
(e.g. the &#x0201c;discovery&#x0201d; of head morphogenesis in bacteria
mentioned in <a href="/books/n/sef/A55/?report=reader">Chapter 3</a>). However,
these are probably the least serious annotation errors.</p><p>Let us just assume that the authors of this annotation meant &#x0201c;homolog
of mitochondrial 60S ribosomal protein L2&#x0201d;. What is worse: the search
result that presumably gave rise to this annotation is impossible to reproduce
at this time, at least not without detailed research, which we are not willing
to undertake. It is most likely that this blatantly wrong annotation was due to <b>
<i>a spurious database hit</i>
</b> to a ribosomal protein that was not critically assessed. It is not
clear, in this particular case, how could this spurious hit pass the
significance threshold, but in general, this happens most often because of the
lack of proper filtering for low complexity (or alternative approaches, such as
composition-based statistics, which are available in 2002 but had not been
developed in 1996; see <a href="/books/n/sef/A166/?report=reader">Chapter 4</a>).
Alternatively or additionally, the problem might lie in non-critical transfer of
annotation from <b>
<i>an unreliable database record</i>
</b>, i.e. a low-complexity sequence erroneously labeled as a ribosomal
protein. Notably, our re-analysis shows that the annotations assigned by each of
the three groups were not completely correct: one was an outright error; another
one involved overprediction; and the third one, an underprediction. Although
less notorious than false predictions (false-positives, in statistical terms),
lack of prediction, where a confident one is feasible with available methods, is
still an error (a false-negative).</p><p>The case of the MG225 protein is quite similar except that there was no clear
false prediction involved. Once again, the original genome project gave no
annotation (a false-negative), whereas one of the remaining groups annotated the
protein as &#x0201c;histidine permease&#x0201d;, and the other one stopped
at an &#x0201c;amino acid permease&#x0201d; annotation without proposing
specificity. Today's searches support the latter decision because no convincing,
specific relationship between this protein and transporters for any particular
amino acid could be detected (in fact, given the small repertoire of
transporters in mycoplasmas, this one might have a broad specificity). Notably,
both MG302 and MG225 remain &#x0201c;hypothetical proteins&#x0201d; in
GenBank to this day, although closely related orthologs from <i>M.
pneumoniae</i> are correctly annotated as permeases [<a href="/books/n/sef/A727/?report=reader#A896">168</a>].</p><p>The MG085 protein was annotated as an oxidoreductase (of different families) in
the original genome report and by Ouzounis and coworkers, whereas Koonin and
coworkers predicted that it was an ATP(GTP?)-utilizing enzyme on the basis of
the conservation of the P-loop motif in this protein and its homologs. In 2002,
database searches immediately identify this protein as HPr kinase (this
annotation is now correctly assigned to MG085 in GenBank), a regulator of the
sugar phosphotransferase system, which indeed is a P-loop-containing,
ATP-utilizing enzyme [<a href="/books/n/sef/A727/?report=reader#A1451">723</a>]. Back in
1996, this was the only informative annotation that could be derived for this
protein; HPr kinase genes had not been identified at the time. Once again, the
specific source of the oxidoreductase assignments is hard to determine; spurious
hits, non-critical use of incorrect database annotations, or a combination
thereof must have caused this.</p><p>The case of MG448 is of particular interest. This protein was annotated as
&#x0201c;pilin repressor&#x0201d; or simply PilB protein by Fraser and
coworkers and Ouzounis and coworkers and, somewhat cryptically, as
&#x0201c;chaperone-like protein&#x0201d; by Koonin and coworkers. This
protein remains &#x0201c;hypothetical&#x0201d; in GenBank but became a
peptide methionine sulfoxide reductase (PMSR) in SWISS-PROT. A database search
detects highly significantly similarity with numerous proteins that are
annotated primarily as PMSR and, in some cases, as PilB-related repressors. In
reality, this protein is indeed a recently characterized, distinct form of PMSR,
MsrB [<a href="/books/n/sef/A727/?report=reader#A1204">476</a>,<a href="/books/n/sef/A727/?report=reader#A1254">526</a>], which is evolutionarily unrelated to, but is often
associated with, the classic PMSR, MsrA, either as part of a multidomain protein
or as a separate gene in the same operon [<a href="/books/n/sef/A727/?report=reader#A995">267</a>]. These fusions resulted in the annotation of MG448 as PMSR,
which, ironically, turned out to be correct, but mostly (except for the recently
updated SWISS-PROT description), for a wrong reason, because it was the MsrA
domain that was recognized in the fusion proteins. Furthermore, in several
bacteria, these two domains are fused to a third, thioredoxin domain. The
three-domain protein of <i>Neisseria gonorrhoeae</i> has been
characterized as a regulator of pili operon expression, and this is what caused
the annotation of MG448 as PilB, which was reproduced by two groups. This
annotation is outright wrong and does not even pass a &#x0201c;reality
check&#x0201d; because there are no pili in mycoplasmas (parenthetically,
latest reports appear to indicate that even the original functional
characterization of the <i>Neisseria</i> protein was erroneous [<a href="/books/n/sef/A727/?report=reader#A1504">776</a>]).</p><p>
<b>
<i>Unrecognized multidomain architecture</i>
</b> of either the analyzed protein or its homologs or both is a common cause
of erroneous annotation. The &#x0201c;chaperone-like protein&#x0201d;
annotation was based on the notion that the PMSR function could be interpreted
as a form of chaperone action, and accordingly, the associated domain was also
likely to have a chaperone-like activity. In retrospect, this looks like
overprediction combined with insufficient information included in the
annotation. A straightforward annotation of MG448 as a PMSR-associated domain,
perhaps with an extra prediction of redox activity on the basis of conservation
of cysteines in this domain, the way it has been done in a subsequent
publication [<a href="/books/n/sef/A727/?report=reader#A995">267</a>], would have been
appropriate. We revisit this interesting set of proteins when discussing context
analysis in <a href="#A276">Section 5.2</a>.</p><p>While considering only four proteins with contradictory annotations, we
encountered all the main sources of systematic error in genome annotation. We
list them here again, more or less in the order of decreasing severity, as we
see it: (i) spurious database hits, often caused by low-complexity regions in
the query or the database sequence; (ii) non-critical transfer of functional
prediction from an unreliable database record; (iii) incorrect interpretation
(lack of recognition) of multidomain architecture of the query and/database
sequences; (iv) overly specific functional prediction; and (v)
underprediction.</p><p>We believe that this brief discussion highlights more general problems beyond
these specific causes of errors. Even the apparently correct database
annotations are insufficiently informative. Typically, the records do not
include the evidence behind the prediction or include only minimal data that may
be hard to interpret, such as E-values of the hits to particular domains. In
this situation, any complicated case will not be represented adequately (e.g.
the PMSR-associated domain discussed above). In addition, there is no controlled
vocabulary for genome annotation, which creates numerous semantic problems,
although an attempt to correct this situation is being undertaken in the form of
the Genome Ontology project [<a href="/books/n/sef/A727/?report=reader#A788">60</a>,<a href="/books/n/sef/A727/?report=reader#A1241">513</a>].</p><p>The above discussion shows that the general state of genome annotation is far
from being satisfactory. What can be done to improve it? In his paper on genome
annotation errors, Steven Brenner noted that, &#x0201c;to prevent errors from
spreading out of control, database curation by the scientific community will be
essential.&#x0201d; [<a href="/books/n/sef/A727/?report=reader#A844">116</a>]. Curation,
however, implies that databases other than GenBank will have to be employed
because GenBank, by definition, is an archival database (<a href="/books/n/sef/A55/?report=reader">Chapter 3</a>). It appears that the future
and, to some degree, already the present of genome annotation lies in
specialized databases that actually function as annotation tools. The beginnings
of such tools can be seen in databases like KEGG, WIT, and COGs, complemented by
tools for domain identification, such as CDD and SMART (see <a href="/books/n/sef/A55/?report=reader">Chapters 3</a> and <a href="/books/n/sef/A166/?report=reader">4</a>).</p><p>Conceptually, the advantage of this approach may be viewed as reduction and
structuring of the search space for genome annotation. Thus, when using COGs, a
genome analyst compares each protein sequence not to the unstructured set of
more than a million proteins (the NR database) but instead to a collection of
~5,000 mostly well-characterized protein sets classified by orthology, which is
the appropriate level of granularity for functional assignment. Already genome
annotation today is starting to change through the use of the new generation of
databases and tools. However, smooth integration of these and development of
new, richer formats for annotation are things of the future. In the next
subsection, we turn to a specific example to illustrate how the use of COGs
helps genome annotation.</p></div><div id="A274"><h3>5.1.4. A case study on genome annotation: the crenarchaeon <i>Aeropyrum
pernix</i></h3><p>
<i>Aeropyrum pernix</i> was the first representative of the
Crenarchaeota (one of the two major branches of archaea; see <a href="/books/n/sef/A298/?report=reader">Chapter 6</a>) and the first aerobic
archaeon whose genome has been sequenced [<a href="/books/n/sef/A727/?report=reader#A1155">427</a>]. <i>A. pernix</i> was reported to encode 2,694
putative proteins in a 1.67-Mbase genome. Of these, 633 proteins were assigned a
specific or general function in the original report on the basis of sequence
comparison to proteins in the GenBank, SWISS-PROT, EMBL, PIR, and Owl databases.
Given the intrinsic interest of the first crenarchaeal genome and also because
of the unexpectedly low fraction of predicted genes that were assigned functions
in the original report, <i>A. pernix</i> was chosen for a pilot
annotation project centered around the COG database [<a href="/books/n/sef/A727/?report=reader#A1333">605</a>].</p><p>
<a class="figpopup" href="/books/NBK20253/figure/A1680/?report=objectonly" target="object" rid-figpopup="figA1680" rid-ob="figobA1680">Figure 5.2</a> (see the color plates) shows
the protocol employed for the COG-based genome annotation. This procedure was
not limited to straightforward COGNITOR analysis but also explicitly drew from
the phyletic patterns. Whenever <i>A. pernix</i> was unexpectedly not
represented in a COG (e.g. a COG that included all other archaeal species),
additional analysis was undertaken. To identify possible diverged COG members
from <i>A. pernix,</i> PSI-BLAST searches were run with multiple
members of the respective COGs, and to detect COG members that could have been
missed in the original genome annotation, the translated sequence of the
<i>A. pernix</i> genome was searched using TBLASTN. Conversely,
unexpected occurrence of <i>A. pernix</i> proteins in COGs that did
not have any other archaeal members were examined case by case to detect likely
HGT events and novel functions in the crenarchaeal genome.</p><div class="iconblock whole_rhythm clearfix ten_col fig" id="figA1680" co-legend-rid="figlgndA1680"><a href="/books/NBK20253/figure/A1680/?report=objectonly" target="object" title="Figure 5.2" class="img_link icnblk_img figpopup" rid-figpopup="figA1680" rid-ob="figobA1680"><img class="small-thumb" src="/books/NBK20253/bin/ch5f2.gif" src-large="/books/NBK20253/bin/ch5f2.jpg" alt="Figure 5.2. Protocol of genome annotation using the COG database." /></a><div class="icnblk_cntnt" id="figlgndA1680"><h4 id="A1680"><a href="/books/NBK20253/figure/A1680/?report=objectonly" target="object" rid-ob="figobA1680">Figure 5.2</a></h4><p class="float-caption no_bottom_margin">Protocol of genome annotation using the COG database. </p></div></div><p>Proteins were assigned to COGs through two rounds of automated comparison using
COGNITOR, each followed by curation, that is, manual checking of the
assignments. The first round attempts to assign proteins to existing COGs;
typically, &#x0003e;90% of the assignments are made in this step. The
second round serves two purposes: first, to assign paralogs, that might have
been missed in the first round, to existing COGs; and, second, to create new
COGs from unassigned proteins.</p><p>The results of COG assignment for <i>A. pernix</i> are shown in <a class="figpopup" href="/books/NBK20253/table/A275/?report=objectonly" target="object" rid-figpopup="figA275" rid-ob="figobA275">Table 5.3</a>. Manual curation of the
automatic assignments revealed a false-positive rate of less than 2%
(23 of 1123 proteins). Even if the less severe errors, when a protein was
transferred from one related COG to another, are taken into account, the
false-positive rate was 4%, which is not negligible but substantially
lower than the estimates cited above for more standard genome annotation
methods. The number of identified false-negatives was even lower, but in this
case, of course, it is not possible to determine how many proteins remain
unassigned. It is further notable that the great majority of assigned proteins
belonged to pre-existing COGs, which facilitates a (nearly) automatic
annotation.</p><div class="iconblock whole_rhythm clearfix ten_col table-wrap" id="figA275"><a href="/books/NBK20253/table/A275/?report=objectonly" target="object" title="Table 5.3" class="img_link icnblk_img figpopup" rid-figpopup="figA275" rid-ob="figobA275"><img class="small-thumb" src="/books/NBK20253/table/A275/?report=thumb" src-large="/books/NBK20253/table/A275/?report=previmg" alt="Table 5.3. Assignment of predicted Aeropyrum pernix proteins to COGs." /></a><div class="icnblk_cntnt"><h4 id="A275"><a href="/books/NBK20253/table/A275/?report=objectonly" target="object" rid-ob="figobA275">Table 5.3</a></h4><p class="float-caption no_bottom_margin">Assignment of predicted <i>Aeropyrum pernix</i> proteins to
COGs. </p></div></div><p>Altogether, 1,102 <i>A. pernix</i> proteins were assigned to COGs. Some
of these proteins (<a href="/books/n/sef/A727/?report=reader#A882">154</a>) were members of
functionally uncharacterized COGs. Subtracting these, annotation has been added
to 315 proteins, which is an increase of about 50% compared to the
original annotation. These newly annotated <i>A. pernix</i> proteins
included, among others, the key glycolytic enzymes glucose-6-phosphate isomerase
(APE0768, COG0166) and triose phosphate isomerase (APE1538, COG0149), and the
pyrimidine biosynthetic enzymes orotidine-5&#x02032;-phosphate decarboxylase
(APE2348, COG0284), uridylate kinase (APE0401, COG0528), cytidylate kinase
(APE0978, COG1102), and thymidylate kinase (APE2090, COG0125). Similarly,
important functions in DNA replication and repair were confidently assigned to a
considerable number of <i>A. pernix</i> proteins, which, in the
original annotation, were described as &#x0201c;hypothetical&#x0201d;.
Examples include the bacterial-type DNA primase (COG0358), the large subunit of
the archaeal-eukaryotic-type primase (COG2219), a second ATP-dependent DNA
ligase (COG1423), three paralogous photolyases (COG1533), and several helicases
and nucleases of different specificities.</p><p>The case of the large subunit of the archaeal-eukaryotic primase is particularly
illustrative of the contribution of different types of inference to genome
annotation. COGNITOR failed to assign an <i>A. pernix</i> protein to
the respective COG (COG2219). However, given the ubiquity of this subunit in
euryarchaea and eukaryotes and the presence of a readily detectable small
primase subunit in <i>A. pernix</i> (COG1467), a more detailed
analysis was undertaken by running PSI-BLAST searches against the NR database
with all members of COG2219 as queries. When the <i>A. fulgidus</i>
primase sequence (AF0336) was used to initiate the search, the <i>A.
pernix</i> counterpart (APE0667) was indeed detected at a statistically
significant level.</p><p>An interesting case of re-annotation of a protein with a critical function, which
also led to more general conclusions, is the archaeal uracil DNA glycosylase
(UDG; COG1573). The members of this COG were originally annotated (and still
remain so labeled in GenBank) as a &#x0201c;DNA polymerase homologous
protein&#x0201d; (APE0427 from <i>A. pernix</i>) or as a
&#x0201c;DNA polymerase, bacteriophage type&#x0201d; (AF2277 <i>from A.
fulgidus</i>) or as a hypothetical protein. However, UDG activity has
been experimentally demonstrated for the COG1573 members from <i>T.
maritima</i> and <i>A. fulgidus</i> [<a href="/books/n/sef/A727/?report=reader#A1468">740</a>,<a href="/books/n/sef/A727/?report=reader#A1469">741</a>]. The
reason for the erroneous annotation of these proteins as DNA polymerases is
already well familiar to us: independent fusion of the uracil DNA glycosylase
with DNA polymerases was detected in bacteriophage SPO1 and in <i>Yersinia
pestis</i> [<a href="/books/n/sef/A727/?report=reader#A772">44</a>]. Although these
fusions hampered the correct annotation in the original analysis of the archaeal
genomes, they seem to be functionally informative, suggesting that this type of
UDG functions in conjunction with the replicative DNA polymerase.</p><p>The 1,102 COG members from <i>A. pernix</i> comprise 41% of
the total number of predicted genes. This percentage was significantly lower
than the average fraction of COG members (72%) for the other archaeal
species. It seems most likely that this was due to an overestimate of the total
number of ORFs in the genome. Many of the <i>A. pernix</i> ORFs with
no similarity to proteins in sequence databases (1,538, or 57.1%)
overlap with ORFs from conserved families, including COG members. On the basis
of the average representation of all genomes in the COGs (67%) and
the average for the other archaea (72%), one could estimate the total
number of <i>A. pernix</i> proteins to be between 1,550 and 1,700.
This range is also consistent with the size of the <i>A. pernix</i>
genome (1.67 Mb), given the gene density of about one gene per kilobase, which
is typical of bacteria and archaea. More conservatively, 849 ORFs, originally
annotated as probable protein-coding genes, significantly overlapped with COG
members and could be confidently eliminated, which brings the total number of
protein-coding genes in <i>A. pernix</i> to a maximum of 1,873.
Unfortunately, the spurious ORFs still remain in the NR database, polluting it
and potentially even leading to the emergence of ghost
&#x0201c;protein&#x0201d; families once new, related genomes are sequenced.
Evidence has been presented that spurious &#x0201c;proteins&#x0201d; have
been produced by other microbial genome products also [<a href="/books/n/sef/A727/?report=reader#A1505">777</a>], although probably not on the same scale as
<i>A. pernix</i>. This regrettable pollution emphasizes the value
of specialized, curated databases that are free of apparitions.</p><p>Despite this overrepresentation of ORFs in <i>A. pernix</i>, we
nonetheless added 28 previously unidentified ORFs that were detected by
searching the genome sequence translated in all six frames for possible members
of COGs with unexpected phyletic patterns. These newly detected genes represent
conserved protein families, including functionally indispensable proteins, such
as chorismate mutase (APE0563a, COG1605), translation initiation factor IF-1
(APE_IF-1, COG0361), and seven ribosomal proteins (APE_rpl21E, COG2139;
APE_rps14, COG0199; APE_rpl29, COG0255; APE_rplX, COG2157; APE_rpl39E, COG2167;
APE_rpl34E, COG2174; APE_rps27AE, COG1998).</p><p>This pilot analysis, while falling far short of the goal of comprehensive genome
annotation, highlights some advantages of specialized comparative-genomic
databases as annotation tools. In this particular case, the original annotation
probably had been overly conservative, which partly accounts for the large
increase in the functional prediction rate. However, the employed protocol is
general and, with modifications and addition of some extra procedures, has been
used in primary genome analysis [<a href="/books/n/sef/A727/?report=reader#A1350">622</a>,<a href="/books/n/sef/A727/?report=reader#A1507">779</a>]. In other genome
projects, the WIT system has been employed in a conceptually similar manner
[<a href="/books/n/sef/A727/?report=reader#A907">179</a>,<a href="/books/n/sef/A727/?report=reader#A1146">418</a>]. As shown above, this type of analysis yields
reasonable accuracy of annotation, even when applied in a fully automated mode
(<a class="figpopup" href="/books/NBK20253/table/A275/?report=objectonly" target="object" rid-figpopup="figA275" rid-ob="figobA275">Table 5.3</a>). However, additional
expert contribution, particularly in the form of context analysis discussed in
the next section, adds substantial value to genome annotation.</p></div></div><div id="A276"><h2 id="_A276_">5.2. Genome Context Analysis and Functional Prediction</h2><p>All the preceding discussion in this chapter centered on prediction of the functions
of proteins encoded in sequenced genomes by extrapolating from the functions of
their experimentally characterized homologs. The success of this approach depends on
the sensitivity and selectivity of the methods that are used for detecting sequence
similarity (see <a href="/books/n/sef/A166/?report=reader">Chapter 4</a>) and on the
employed rules of inference (see <a href="#A265">5.1</a>). There
is no doubt that homology analysis remains the central methodology of genomics, i.e.
the one that produces the bulk of useful information. However, a group of recently
developed approaches in comparative genomics goes beyond sequence or structure
comparison. These methods have become collectively and, we think, aptly known as
genome context analysis [<a href="/books/n/sef/A727/?report=reader#A995">267</a>,<a href="/books/n/sef/A727/?report=reader#A1096">368</a>,<a href="/books/n/sef/A727/?report=reader#A1097">369</a>,<a href="/books/n/sef/A727/?report=reader#A1100">372</a>]. The notion of
&#x0201c;context&#x0201d; here includes all types of associations between genes
and proteins in the same or in different genomes that may point to functional
interactions and justify a verdict of &#x0201c;guilt by association&#x0201d;
[<a href="/books/n/sef/A727/?report=reader#A764">36</a>]: if gene A is involved in function
X and we obtain evidence that gene B functionally associates with A, then B is also
involved in X. More specifically, context in comparative genomics pertains to
phyletic profiles of protein families, domain fusions in multidomain proteins, gene
adjacency in genomes, and expression patterns. Indeed, genes whose products are
involved in closely related functions (e.g. form different subunits of a
multisubunit enzyme or participate in the same pathway) should all be either present
or absent in a certain set of genomes (i.e. have similar if not identical phyletic
patterns) and should be coordinately expressed (i.e. are expected to be encoded in
the same operon or at least to have similar expression patterns). This simple logic
gives us a potentially powerful way to assign genes that have no experimentally
characterized homologs to particular pathways or cellular systems. Although context
methods usually provide only rather general predictions, they represent a new and
important development in genomics that explicitly takes advantage of the rapidly
growing collection of sequenced genomes.</p><div id="A277"><h3>5.2.1. Phyletic patterns (profiles)</h3><p>Genes coding for proteins that function in the same cellular system or pathway
tend to have similar phyletic patterns [<a href="/books/n/sef/A727/?report=reader#A987">259</a>,<a href="/books/n/sef/A727/?report=reader#A1556">828</a>]. Numerous examples
for a variety of metabolic pathways are given in <a href="/books/n/sef/A371/?report=reader">Chapter 7</a>. These observations led to the suggestion that
this trend could be used in the reverse direction, i.e. to deduce functions of
uncharacterized genes [<a href="/books/n/sef/A727/?report=reader#A1393">665</a>]. However
attractive this idea might be, the real-life phyletic patterns are heavily
affected by such major evolutionary phenomena as partial redundancy in gene
functions, non-orthologous gene displacement, and lineage-specific gene loss. As
a result, there are thousands different phyletic patterns in the COGs, most of
them represented only once or twice. Moreover, examination of a variety of
multi-component systems and biochemical pathways (<a href="/cgi-bin/COG/palox?sysall" ref="pagearea=body&amp;targetsite=external&amp;targetcat=link&amp;targettype=uri">http://www.ncbi.nlm.nih.gov/cgi-bin/COG/palox?sys=all</a>)
shows that, despite the tendency of the components of the same complex or
pathway to have similar patterns, there is not even one pathway in which <b>
<i>all</i>
</b> members show exactly the same pattern. Even the principal metabolic
pathways, such as glycolysis, TCA cycle, and purine and pyrimidine biosynthesis,
show considerable variability of phyletic patterns due to non-orthologous gene
displacement ([<a href="/books/n/sef/A727/?report=reader#A993">265</a>,<a href="/books/n/sef/A727/?report=reader#A998">270</a>,<a href="/books/n/sef/A727/?report=reader#A1098">370</a>], see
<a href="/books/n/sef/A371/?report=reader">Chapter 7</a>).</p><p>Because of this variability, the predictive power of the observation that two
genes have the same phyletic pattern is, in and by itself, limited. However,
when supported by other lines of evidence, such observations prove useful.
Somewhat counterintuitively, the universal pattern is one of the most strongly
indicative of gene function: among the 63 universal COGs, at least 56 consist of
proteins involved in translation. The functions of those few proteins in the
universal set that remain uncharacterized can be predicted with considerable
confidence through combination of this phyletic pattern with other lines of
evidence. For example, the uncharacterized protein YchF, which belongs to the
universal set (COG0012), is predicted by sequence analysis to be a GTPase; in
addition, this protein contains a C-terminal RNA-binding TGS domain [<a href="/books/n/sef/A727/?report=reader#A1637">909</a>]. Taken together with the ubiquity of
this protein and with the fact that, in phylogenetic trees, the archaeal members
of the COG clearly cluster with eukaryotic ones, this strongly suggests that
YchF is an uncharacterized, universal translation factor [<a href="/books/n/sef/A727/?report=reader#A995">267</a>]. This is supported by the juxtaposition of the
<i>ychF</i> gene with the gene for peptidyl-tRNA hydrolase
(<i>pth</i>) in numerous proteobacteria. The discussion of this
protein made us run ahead of ourselves and invoke other context methods, which
are considered in the next subsections, namely, analysis of domain fusions and
gene juxtaposition. This situation is quite typical: context methods are at
their best when they complement one another. Although statistical significance
estimates for a combination of context methods do not currently seem feasible,
in a case like YchF, the evidence appears to be, for all practical purposes,
irrefutable.</p><p>Another similar case involves the predicted ATPase or (more likely) kinase YjeE
from <i>E. coli</i> [<a href="/books/n/sef/A727/?report=reader#A984">256</a>] and
its orthologs from a majority of bacterial genomes that comprise COG0802. Domain
analysis identified this protein as a likely P-loop ATPase but failed to give
any indications as to its cellular role. The phyletic pattern of this COG shows
that YjeE is encoded in every bacterial genome, with the exception of <i>M.
genitalum</i>, <i>M. pneumoniae</i>, and <i>U.
urealyticum</i>, the only three bacterial species in the COG database
that do not form a cell wall. Since other conserved proteins with the same
phyletic pattern (MurA, MurB, MurG, FtsI, FtsW, DdlA) are enzymes of cell wall
biosynthesis, it can be predicted that YjeE is an ATPase or kinase involved in
the same process. Again, this prediction is supported by the adjacency of the
<i>yjeE</i> with the gene for N-acetylmuramoyl-L-alanine amidase,
another cell wall biosynthesis enzyme.</p><p>There is more to phyletic pattern analysis then prediction based on identical or
similar patterns. Guilt by association can be established also through
identification of sets of genes that are <b>
<i>co-eliminated</i>
</b> in a given lineage; this approach exploits the widespread phenomenon of
lineage-specific gene loss. A systematic analysis of the set of genes that have
been co-eliminated in the yeast <i>S. cerevisiae</i> after its
divergence from the common ancestor with <i>S. pombe</i> led to the
prediction that a particular group of proteins, including one that contained a
helicase and a duplicated RNAse III domain, was involved in post-transcriptional
gene silencing [<a href="/books/n/sef/A727/?report=reader#A783">55</a>]. This protein turned
out to be the now famous dicer nuclease, which indeed has a central role in
silencing [<a href="/books/n/sef/A727/?report=reader#A1093">365</a>,<a href="/books/n/sef/A727/?report=reader#A1164">436</a>].</p><p>On many occasions, non-orthologous gene displacement manifests in <b>
<i>complementary</i>
</b>, rather than identical or similar, phyletic patterns, like we have seen
for phosphoglycerate mutase in <a href="/books/n/sef/A22/?report=reader#A43">2.2.6</a>. The
complementarity is rarely perfect because of partial functional redundancy: some
organisms, particularly those with larger genomes, often encode more than one
protein to perform the same function. This can be illustrated by the case of the
recently discovered new type of fructose-1,6-bisphosphate aldolase, referred to
as FbaB or DhnA [<a href="/books/n/sef/A727/?report=reader#A985">257</a>]. The two
well-known variants of this enzyme, class I (Schiff-base forming,
metal-independent) and class II (metal-dependent), have long been considered to
be unrelated (analogous) enzymes until structural comparisons revealed their
underlying similarity (see Figure 1.9) [<a href="/books/n/sef/A727/?report=reader#A823">95</a>,<a href="/books/n/sef/A727/?report=reader#A915">187</a>,<a href="/books/n/sef/A727/?report=reader#A985">257</a>,<a href="/books/n/sef/A727/?report=reader#A1277">549</a>]. These
enzymes are generally limited in their phyletic distribution to eukaryotes
(class I) and bacteria (class II); some bacteria, however, have both variants
and yeast has the bacterial (class II) form of the enzyme [<a href="/books/n/sef/A727/?report=reader#A1277">549</a>]:</p><div class="graphic"><img src="/books/NBK20253/bin/ch5e1.jpg" alt="Image ch5e1.jpg" /></div><p>Sequencing of archaeal genomes revealed the absence of either form of the
fructose-1,6-bisphosphate aldolase. The same was the case with chlamydiae, which
were predicted to have a third form of this enzyme [<a href="/books/n/sef/A727/?report=reader#A1140">412</a>,<a href="/books/n/sef/A727/?report=reader#A1533">805</a>].
Indeed, investigation of the metal-independent fructose-1,6-bisphosphate
aldolase activity in <i>E. coli</i> led to the discovery of another
metal-independent Schiff-base-forming variant [<a href="/books/n/sef/A727/?report=reader#A1572">844</a>] whose sequence, however, was more closely related to those of
class II enzymes than to typical class I enzymes [<a href="/books/n/sef/A727/?report=reader#A985">257</a>]. Highly conserved homologs of this new, third form of
fructose-1,6-bisphosphate aldolase were found in chlamydial and archaeal
genomes:</p><div class="graphic"><img src="/books/NBK20253/bin/ch5e2.jpg" alt="Image ch5e2.jpg" /></div><p>As with phosphoglycerate mutase, combining these phyletic patterns shows almost
perfect complementarity, with aldolase missing only in
<i>Rickettsia</i>, which does not encode any glycolytic enzymes,
and in <i>Thermoplasma</i>, which appears to rely exclusively on the
Entner-Doudoroff pathway (see <a href="/books/n/sef/A371/?report=reader#A373">7.1.1</a>):</p><div class="graphic"><img src="/books/NBK20253/bin/ch5e3.jpg" alt="Image ch5e3.jpg" /></div><p>Other interesting examples of complementary phylogenetic patterns include
lysyl-tRNA synthetases, pyridoxine biosynthesis proteins PdxA and PdxZ [<a href="/books/n/sef/A727/?report=reader#A984">256</a>], thymidylate synthases [<a href="/books/n/sef/A727/?report=reader#A995">267</a>], and many others. The case of
thymidylate synthases is particularly remarkable. Thymidylate synthase is a
strictly essential enzyme of DNA precursor biosynthesis, and its apparent
absence in several bacterial and archaeal species became a major puzzle as their
genome sequences were reported.</p><div class="graphic"><img src="/books/NBK20253/bin/ch5e4.jpg" alt="Image ch5e4.jpg" /></div><p>The alternative thymidylate synthase was predicted [<a href="/books/n/sef/A727/?report=reader#A995">267</a>] on the basis of a phyletic pattern that was nearly
complementary (with just one case of redundancy) to that of the classic
thymidylate synthase (ThyA) and the report that the homolog of the COG1351
proteins from <i>Dictyostelium</i> complemented thymidylate synthase
deficiency [<a href="/books/n/sef/A727/?report=reader#A934">206</a>]. Just before this book
went to print, a new issue of <i>Science</i> reported the confirmation
of this prediction: not only was it shown that the COG1351 member from
<i>H. pylori</i> had thymidylate synthase activity, but also the
structure of this proteins has been solved and turned out to be unrelated to
that of ThyA [<a href="/books/n/sef/A727/?report=reader#A1317">589</a>,<a href="/books/n/sef/A727/?report=reader#A1326">598</a>].</p></div><div id="A278"><h3>5.2.2. Gene (domain) fusions: &#x0201c;guilt by association&#x0201d;</h3><p>It is fairly common that functionally interacting proteins that are encoded by
separate genes in some organisms are fused in a single polypeptide chain in
others. This has been confirmed by statistical analysis that demonstrated
general functional coherence of fused domains [<a href="/books/n/sef/A727/?report=reader#A1658">930</a>]. The advantages of a multidomain architecture are that this
organization facilitates functional complex assembly and may also allow reaction
intermediate channeling [<a href="/books/n/sef/A727/?report=reader#A1274">546</a>].</p><p>The basic assumption in the analysis of domain fusions is that a fusion will be
fixed during evolution only when it provides a selective advantage to the
organism in the form of improved functional interaction between proteins. Thus,
finding fused proteins (domains) in one species suggests that they might
interact, physically or at least functionally, in other species. In and by
itself, this notion is trivial and has been employed for predicting protein and
domain functions on an anecdotal basis for years (see [<a href="/books/n/sef/A727/?report=reader#A828">100</a>], just as an example). However, with the rapid growth
of the sequence information, the applicability of this approach widened and two
independent groups proposed, in well-publicized papers, that analysis of domain
fusions could be a general method for systematic and, moreover, automatic,
prediction of protein functions [<a href="/books/n/sef/A727/?report=reader#A941">213</a>,<a href="/books/n/sef/A727/?report=reader#A1274">546</a>]. In one of these
studies [<a href="/books/n/sef/A727/?report=reader#A1274">546</a>], domain fusions are
referred to as &#x0201c;Rosetta Stone&#x0201d; proteins &#x02013; clues to
deciphering the functions of their component domains, and this memorable name
stuck to the whole approach. (The Rosetta Stone metaphor is quite loose: the
notorious stone used by Fran&#x000e7;ois Champollion to decipher the Egyptian
hieroglyphs and now on public display in the British Museum, is a tri-lingua,
i.e. a monument that has on it the same text in three different languages. There
is nothing exactly like that about domain fusions, it is just possible to say
vaguely that the &#x0201c;language&#x0201d; of domain fusions is translated
into the &#x0201c;language&#x0201d; of functional interactions. The
&#x0201c;guilt by association&#x0201d; simile [<a href="/books/n/sef/A727/?report=reader#A764">36</a>] seems much more apt if less glamorous).</p><p>In his comment on the &#x0201c;Rosetta Stone&#x0201d; excitement, Russell
Doolittle pointed out that cases that establish a link between two well-known
domains or those that link two unknown domains are not likely to lead to any
scientific breakthroughs [<a href="/books/n/sef/A727/?report=reader#A916">188</a>]. Only
those &#x0201c;Rosetta Stone&#x0201d; proteins, in which an unknown domain
is linked to a previously characterized one, can be used to infer the
function(s) of the uncharacterized domain. Analysis of domain fusions in
complete microbial genomes indicates that they are a complex mixture of
informative, uninformative and potentially misleading cases, which certainly
provide many clues to functions of uncharacterized domains. However,
interpretations stemming from domain fusion seem to require case-by-case
examination by human experts and, most of the time, become really useful only
when combined with other lines of evidence.</p><p>One of the advantages of the guilt by association approach is that, at least in
principle, it allows transitive closure, i.e. expansion of functional
associations between transitively connected components. In other words,
detection of domain combinations AB, BC, and CD suggests that domains A, B, C
and D form a functional network. This approach has been successfully applied to
the analysis of prokaryotic signal-transduction systems, resulting in the
prediction of several new signaling domains. Participation of these domains in
signaling cascades has been originally proposed solely on the basis of their
conserved domain architectures and subsequently confirmed experimentally [<a href="/books/n/sef/A727/?report=reader#A997">269</a>].</p><p>In <a class="figpopup" href="/books/NBK20253/figure/A279/?report=objectonly" target="object" rid-figpopup="figA279" rid-ob="figobA279">Figure 5.3</a>, we illustrate the
&#x0201c;guilt by association&#x0201d; approach using the peptide methionine
sulfoxide reductase example discussed in the previous section as a case of
annotation complicated by domain fusion. As in the examples above, the logic of
the analysis does not allow us to use domain fusions only; we also have to
invoke phyletic patterns and organization of genes in the genome.

</p><div class="iconblock whole_rhythm clearfix ten_col fig" id="figA279" co-legend-rid="figlgndA279"><a href="/books/NBK20253/figure/A279/?report=objectonly" target="object" title="Figure 5.3" class="img_link icnblk_img figpopup" rid-figpopup="figA279" rid-ob="figobA279"><img class="small-thumb" src="/books/NBK20253/bin/ch5f3.gif" src-large="/books/NBK20253/bin/ch5f3.jpg" alt="Figure 5.3. A Rosetta Stone case: domain fusions and gene clusters that involve peptide methionine sulfoxide reductases." /></a><div class="icnblk_cntnt" id="figlgndA279"><h4 id="A279"><a href="/books/NBK20253/figure/A279/?report=objectonly" target="object" rid-ob="figobA279">Figure 5.3</a></h4><p class="float-caption no_bottom_margin">A Rosetta Stone case: domain fusions and gene clusters that
involve peptide methionine sulfoxide reductases. </p></div></div><p>In most organisms, protein methionine sulfoxide reductase A (MsrA) is a small,
single-domain protein. However, in <i>H. influenzae</i>, <i>H.
pylori</i> and <i>T. pallidum,</i> it is fused with another,
highly conserved domain (MsrB) that is found as a distinct protein in all other
organisms that encode MsrA. In other words, the two fusion components show the
same phyletic patterns:</p><div class="graphic"><img src="/books/NBK20253/bin/ch5e5.jpg" alt="Image ch5e5.jpg" /></div><p>In <i>B. subtilis</i>, the genes for MsrA and MsrB are not fused, but
are adjacent and may form an operon. In contrast, in <i>T.
pallidum</i>, MsrA and MsrB are fused, but in reverse order, compared
to <i>H. influenzae</i> and <i>H. pylori</i> (<a class="figpopup" href="/books/NBK20253/figure/A279/?report=objectonly" target="object" rid-figpopup="figA279" rid-ob="figobA279">Figure 5.3</a>). The <i>H.
influenzae</i> and <i>H. pylori</i> &#x0201c;Rosetta
Stone&#x0201d; proteins are most closely related to each other, but the one
from <i>T. pallidum</i> does not show particularly strong similarity
to any of them, suggesting two independent fusion events in these two
lineages.</p><p>In <i>Neisseria</i> and <i>Fusobacterium</i>, a third,
thioredoxin-like domain joins the MsrAB fusion (<a class="figpopup" href="/books/NBK20253/figure/A279/?report=objectonly" target="object" rid-figpopup="figA279" rid-ob="figobA279">Figure 5.3</a>). In <i>H. influenzae</i>, the ortholog of this
predicted thioredoxin is encoded two genes upstream of MsrAB. The gene in
between encodes a conserved integral membrane protein, designated CcdA for its
requirement for cytochrome c biogenesis in <i>B. subtilis</i>. Its
ortholog is encoded next to MsrAB in <i>H. pylori</i> and next to
thioredoxin in several other genomes (<a class="figpopup" href="/books/NBK20253/figure/A279/?report=objectonly" target="object" rid-figpopup="figA279" rid-ob="figobA279">Figure
5.3</a>).</p><p>Combining all this evidence from the guilt by association approach, gene
adjacency data, phyletic profiles, and sequence analysis, it has been predicted
that the MsrA, MsrB and thioredoxin form an enzymatic complex, which catalyzes a
cascade of redox reactions and is associated with the bacterial membrane via
CcdA. However, this is probably not the only complex in which MsrAB is involved,
because not all genomes that have this gene pair also encode CcdA (<a class="figpopup" href="/books/NBK20253/figure/A279/?report=objectonly" target="object" rid-figpopup="figA279" rid-ob="figobA279">Figure 5.3</a>). Since the publication of this
prediction, it has been largely confirmed by the demonstration that MsrB is a
second, distinct, thioredoxin-dependent peptide methionine sulfoxide reductase,
which cooperates with MsrA in the defense of bacterial cells against reactive
oxygen species [<a href="/books/n/sef/A727/?report=reader#A1044">316</a>,<a href="/books/n/sef/A727/?report=reader#A1254">526</a>,<a href="/books/n/sef/A727/?report=reader#A1504">776</a>]. However, the CcdA connection remains to be investigated.</p><p>This case study demonstrates both the considerable potential of domain fusion
analysis as a tool for protein function prediction, particularly when combined
with other context-based and homology-based approaches, and potential problems.
One could be tempted to extend the small network of domains shown in <a class="figpopup" href="/books/NBK20253/figure/A279/?report=objectonly" target="object" rid-figpopup="figA279" rid-ob="figobA279">Figure 5.3</a> by including other domains that
form fusions (or are encoded by adjacent genes) with the thioredoxin domain. It
appears, however, that such an extension would have been ill-advised. Firstly,
orthologous relationships among thioredoxins are ambiguous, and secondly,
although thioredoxins are not among the most &#x0201c;promiscuous&#x0201d;
domains, the variety of their &#x0201c;guilt by association&#x0201d; links
still is sufficiently large to make any predictions regarding potential
functional connections between the respective domains and MsrAB dubious at best.
These two issues, identification of orthologs and
&#x0201c;promiscuity&#x0201d; characteristic of certain domains, are the
principal problems encountered by the &#x0201c;guilt by association&#x0201d;
approach. Domain fusions often are found only within a specialized, narrow group
of orthologous protein domains, and translating their functional interaction
into a general prediction for the respective domains is likely to be grossly
misleading. A relatively small number of &#x0201c;promiscuous&#x0201d;
domains, particularly those involved in signal transduction and different forms
of regulation (e.g. CBS, PAS, GAF domains), combine with a variety of other
domains that otherwise have nothing in common and therefore significantly
increase the number of false-positives among the Rosetta Stone predictions.
Although it is possible to simply exclude the worst known offenders from any
Rosetta Stone analysis [<a href="/books/n/sef/A727/?report=reader#A1274">546</a>], other
domains also have the potential of showing &#x0201c;illicit&#x0201d;
behavior and compromising the results. Manual detection of such cases is
relatively straightforward, but automation of this process may be
complicated.</p></div><div id="A280"><h3>5.2.3. Gene clusters and genomic neighborhoods</h3><p>As already mentioned in <a href="/books/n/sef/A22/?report=reader">Chapter 2</a>,
comparisons of complete bacterial genomes have revealed the lack of large-scale
conservation of the gene order even between relatively close species, such as
<i>E. coli</i> and <i>H. influenzae</i> [<a href="/books/n/sef/A727/?report=reader#A1323">595</a>,<a href="/books/n/sef/A727/?report=reader#A1557">829</a>] or <i>E. coli</i> and <i>P. aeruginosa</i>
(<a href="/books/n/sef/A22/?report=reader#A1679">Figure 2.6B</a>). Although these pairs
of genomes have numerous similar strings of adjacent genes (most of them
predicted operons), comparisons of more distantly related bacterial and archaeal
genomes have shown that, at large phylogenetic distances, even most of the
operons are extensively rearranged [<a href="/books/n/sef/A727/?report=reader#A1189">461</a>,<a href="/books/n/sef/A727/?report=reader#A1612">884</a>]. The few operons
that are conserved across distantly related genomes typically encode physically
interacting proteins, such as ribosomal proteins or subunits of the H-ATPase and
ABC-type transporter complexes [<a href="/books/n/sef/A727/?report=reader#A897">169</a>,<a href="/books/n/sef/A727/?report=reader#A1113">385</a>,<a href="/books/n/sef/A727/?report=reader#A1189">461</a>,<a href="/books/n/sef/A727/?report=reader#A1323">595</a>].</p><p>It should be noted that only a relatively small number of operons have been
identified experimentally, primarily in well-characterized bacteria, such as
<i>E. coli</i> and <i>B. subtilis</i> [<a href="/books/n/sef/A727/?report=reader#A1091">363</a>,<a href="/books/n/sef/A727/?report=reader#A1460">732</a>]. However, analysis of gene strings that are conserved in
bacterial and archaeal genome strongly suggested that the great majority of them
do form operons [<a href="/books/n/sef/A727/?report=reader#A1644">916</a>]. This conclusion
was based on the following principal arguments: (i) as shown by Monte Carlo
simulations, the likelihood that identical strings of more than two genes are
found by chance in more than two genomes is extremely low; (ii) most of those
conserved strings that include characterized genes either are known operons or
include functionally linked genes and can be predicted to form operons; (iii)
typical conserved gene strings include 2 to 4 genes, which is the characteristic
size of operons; (iv) conserved gene strings that include genes from adjacent,
independent operons are extremely rare; (v) nearly all conserved gene strings
consist of genes that are transcribed in the same direction [<a href="/books/n/sef/A727/?report=reader#A1644">916</a>]. As a result, one can usually assume
that conserved gene strings are co-regulated, i.e. form operons, even if they
contain additional promoters.</p><p>Pairwise genome comparisons showed that, on average, ~10% of the genes
in each genome belong to gene strings that are conserved in at least one of the
other available genomes [<a href="/books/n/sef/A727/?report=reader#A1113">385</a>,<a href="/books/n/sef/A727/?report=reader#A1644">916</a>]. These numbers vary widely from
&#x0003c;5% for the cyanobacterium <i>Synechocystis</i> sp.
to 23&#x02013;24% in <i>T. maritima</i> and <i>M.
genitalium</i>; the fraction of genes that belonged to predicted
operons in the archaeal genomes was only slightly lower than that in bacterial
genomes [<a href="/books/n/sef/A727/?report=reader#A1644">916</a>].</p><p>These observations indicate that conserved gene strings are under stabilizing
selection that prevents their disruption. For functionally related genes (e.g.
those encoding proteins that function in the same pathway or multimeric
complex), this selective pressure probably comes from the necessity to
synchronize their expression. This conclusion holds even in the face of the
&#x0201c;selfish operon&#x0201d; hypothesis, which posits that operons
survive during evolution <b>
<i>because</i>
</b> they are disseminated via HGT [<a href="/books/n/sef/A727/?report=reader#A1222">494</a>,<a href="/books/n/sef/A727/?report=reader#A1223">495</a>]. We believe that
the selfish operon hypothesis seems to put the cart ahead of the horse: operons
certainly do spread via HGT, but their transfer leads to fixation more often
than transfer of individual genes because of the selective advantage conferred
to the recipient by the acquired operon. In contrast, for functionally unrelated
genes, there would be no selection towards coexpression. Therefore, an
observation of similar operons found in phylogenetically distant species can be
considered an indication of a potential functional relationship between the
corresponding genes, even if these genes are scattered in other genomes. Because
of the simplicity and elegance of this approach to functional analysis of
complete genomes, there are several web sites that offer slightly different
approaches to delineation of the conserved gene strings.</p><div id="A281"><h4>WIT/ERGO</h4><p>The operon comparison tool in the WIT database (<a href="http://wit.mcs.anl.gov" ref="pagearea=body&amp;targetsite=external&amp;targetcat=link&amp;targettype=uri">http://wit.mcs.anl.gov</a>),
the first of the genome context-based tools, was developed by Ross Overbeek
in 1998 [<a href="/books/n/sef/A727/?report=reader#A1368">640</a>,<a href="/books/n/sef/A727/?report=reader#A1369">641</a>]. This tool identifies conserved gene strings by
searching for pairs of homologous proteins that are encoded by genes located
no more than 300 bp apart on the same DNA strand in each of the analyzed
genomes. Each of these pairs is then assigned a score based on the
evolutionary distance between the respective species on the rRNA-based
phylogenetic tree. It is expected that chance occurrence of pairs of
homologous genes in distantly related species is less likely than in closely
related ones, so such pairs are more likely to be functionally relevant.
Homologous genes are defined as bidirectional best hits in all-against-all
BLAST comparisons, which is similar to the method used in constructing the
COG database [<a href="/books/n/sef/A727/?report=reader#A1556">828</a>].</p><p>Because the number of potential gene linkages grows exponentially with the
number of the analyzed genomes [<a href="/books/n/sef/A727/?report=reader#A1368">640</a>], the sensitivity of methods based on the detection of conserved
gene strings can be significantly improved by taking into consideration even
unfinished genome sequences. For this reason, WIT and ERGO databases include
many incomplete genome sequences from the DOE Joint Genome Institute and
other sequencing centers. This approach was used in the successful
reconstruction of several known metabolic pathways and led to the correct
prediction of candidate genes for some previously uncharacterized metabolic
enzymes [<a href="/books/n/sef/A727/?report=reader#A810">82</a>,<a href="/books/n/sef/A727/?report=reader#A899">171</a>,<a href="/books/n/sef/A727/?report=reader#A1369">641</a>].
Unfortunately, while this book was in preparation, the ERGO database has
been closed for the public, while WIT was still missing some of the useful
functionality. We will therefore illustrate the use of the method by
exploiting a somewhat similar tool in the COG database.</p></div><div id="A282"><h4>COGs</h4><p>The COG database (<a href="/COG" ref="pagearea=body&amp;targetsite=external&amp;targetcat=link&amp;targettype=uri">http://www.ncbi.nlm.nih.gov/COG</a>) allows a simple and
straightforward search for conserved operons. Because all proteins in the
same COG are presumed to be orthologs, the &#x0201c;Genome
context&#x0201d; view, available from each COG page, shows the genes that
encode members of the given COG together with the surrounding genes. Genes
whose products belong to the same COG are identically colored. This provides
for easy identification of sets of COGs that tend to be clustered in
genomes. Of course, this tool only works for the genes whose products belong
to COGs, so the relationships between genes that are found in only two
complete genomes and hence do not belong to any COG would be missed. An
exhaustive matching of the co-localization of genes encoding members of the
same two COGs allowed new functional predictions for almost 90 COGs, which
comprised ~4% of the total set [<a href="/books/n/sef/A727/?report=reader#A1197">469</a>,<a href="/books/n/sef/A727/?report=reader#A1644">916</a>].</p><p>For a practical example of the use of this method, let us consider the search
for the archaeal shikimate kinase, the enzyme that is not homologous to the
bacterial shikimate kinase (AroK) and hence was not found by traditional
sequence similarity searches [<a href="/books/n/sef/A727/?report=reader#A899">171</a>].
Reconstruction of the aromatic amino acids biosynthesis pathway in archaea
showed that genomes of <i>A. fulgidus</i>, <i>M.
jannaschii</i>, and <i>M. thermoautotrophicum</i> encoded
orthologs of bacterial enzymes for all but three reactions of this pathway
([<a href="/books/n/sef/A727/?report=reader#A1268">540</a>], see <a href="/books/n/sef/A371/?report=reader#A452">Figure 7.6</a>).</p><p>Two of these missing enzymes catalyze first and second reactions of the
pathway, indicating that aromatic acids biosynthesis in (most) archaea uses
different precursors than in bacteria, whereas the third reaction,
phosphorylation of shikimate, was attributed to a non-orthologous kinase,
encoded only in archaea [<a href="/books/n/sef/A727/?report=reader#A1268">540</a>].
Daugherty and coworkers made a list of the genes involved in aromatic amino
acid biosynthesis in archaea and looked for potential neighbors of the
<i>aroE</i> gene whose product, shikimate dehydrogenase,
catalyzes the reaction immediately preceding the phosphorylation of
shikimate (<a href="/books/n/sef/A371/?report=reader#A452">Figure 7.6</a>). In <i>P.
abyssi</i> genome, the <i>aroE</i> gene (PAB0300) was
followed by an uncharacterized gene (PAB0301) encoding a predicted kinase,
which is distantly related to homoserine kinases. This was also the case in
<i>A. pernix</i> and <i>T. acidophilum</i> genomes,
where the PAB0301-like gene (COG1685, <a class="figpopup" href="/books/NBK20253/figure/A283/?report=objectonly" target="object" rid-figpopup="figA283" rid-ob="figobA283">Figure
5.4</a>) was found sandwiched between the <i>aroE</i> gene
and the <i>aroA</i> gene, whose product catalyzes the next step of
the pathway after shikimate phosphorylation [<a href="/books/n/sef/A727/?report=reader#A899">171</a>]. Genes encoding PAB0301 orthologs (COG1685) were
also found in other archaeal genomes, but not in any of the bacterial
genomes that contain the typical <i>aroK</i> gene (<a class="figpopup" href="/books/NBK20253/figure/A283/?report=objectonly" target="object" rid-figpopup="figA283" rid-ob="figobA283">Figure 5.4</a>). Given this connection,
Daugherty et al. expressed MJ1440, the COG1685 member from <i>M.
jannaschii</i> and demonstrated that it indeed had shikimate kinase
activity [<a href="/books/n/sef/A727/?report=reader#A899">171</a>].


</p><div class="iconblock whole_rhythm clearfix ten_col fig" id="figA283" co-legend-rid="figlgndA283"><a href="/books/NBK20253/figure/A283/?report=objectonly" target="object" title="Figure 5.4" class="img_link icnblk_img figpopup" rid-figpopup="figA283" rid-ob="figobA283"><img class="small-thumb" src="/books/NBK20253/bin/ch5f4.gif" src-large="/books/NBK20253/bin/ch5f4.jpg" alt="Figure 5.4. Genome context of COG1685 &#x0201c;Archaeal shikimate kinase&#x0201d;." /></a><div class="icnblk_cntnt" id="figlgndA283"><h4 id="A283"><a href="/books/NBK20253/figure/A283/?report=objectonly" target="object" rid-ob="figobA283">Figure 5.4</a></h4><p class="float-caption no_bottom_margin">Genome context of COG1685 &#x0201c;Archaeal shikimate
kinase&#x0201d;.  Each line corresponds to an individual genome: aful,
<i>Archaeoglobus fulgidus</i>; hbsp,
<i>Halobacterium</i> sp.; mjan,
<i>Methanococcus jannaschii</i>; mthe,
<i>Methanobacterium thermoautotrophicum</i>; pyro,
 <a href="/books/NBK20253/figure/A283/?report=objectonly" target="object" rid-ob="figobA283">(more...)</a></p></div></div></div><div id="A284"><h4>STRING</h4><p>The Search Tool for Recurring Instances of Neighbouring Genes (STRING,
<a href="http://www.bork.embl-heidelberg.de/STRING" ref="pagearea=body&amp;targetsite=external&amp;targetcat=link&amp;targettype=uri">http://www.bork.embl-heidelberg.de/STRING</a>), developed by
Peer Bork and colleagues, is based on a similar approach [<a href="/books/n/sef/A727/?report=reader#A1516">788</a>]. Gene clusters are defined by
STRING the same way as in WIT, namely as strings of genes on the same strand
located no more than 300 bp from each other. Orthologs are identified as
bidirectional best hits using Smith-Waterman comparisons. The STRING search
starts from a single protein sequence that can be entered as a FASTA file or
just by its gene name in the complete genome. The sequence entered in FASTA
format is compared against the database of all proteins encoded in complete
genomes so that the user could choose one of the best hits for further
examination. Like COGs, STRING contains information only on completely
sequenced genomes. The default option in STRING further reduces the number
of analyzed genomes by eliminating closely related ones (this option can be
switched off by the user). Additionally, STRING features a useful tool that
allows the user to perform an &#x0201c;iterative&#x0201d; analysis of
gene neighborhoods. After the nearest neighbors of a gene in question are
identified, the next &#x0201c;iteration&#x0201d; of STRING would look
for their neighbors and record if any of these were found previously. If no
new neighbors are found, STRING reports that the search has
&#x0201c;converged&#x0201d;. If this does not happen even after five
consequent search cycles, the program would just tabulate how many times was
each particular gene found in the output. Combined with impressive graphics,
this approach makes STRING a fast and convenient tool to search for
consistent gene associations in complete genomes.</p></div><div id="A285"><h4>SNAPper</h4><p>The SNAP (Similarity-Neighbourhood APproach) tool at MIPS (<a href="http://mips.gsf.de/cgi-bin/proj/snap/znapit.pl" ref="pagearea=body&amp;targetsite=external&amp;targetcat=link&amp;targettype=uri">http://mips.gsf.de/cgi-bin/proj/snap/znapit.pl</a>, [<a href="/books/n/sef/A727/?report=reader#A1175">447</a>]) is similar to STRING, but
instead of precomputed pairs of orthologs, it simply looks for BLAST hits
with user-defined E-values. In addition, SNAP does not require the related
genes to form conserved gene strings, they only need to be in the vicinity
of each other. SNAPper looks for the homologs of the given protein, than
takes neighbors of the corresponding genes, looks for their homologs, and so
on [<a href="/books/n/sef/A727/?report=reader#A1175">447</a>]. The program then builds a
similarity-neighborhood graph (SN-graph), which consists of the chains of
orthologous genes in different genomes and adjacent genes in the same
genome. The hits that form a closed SN-graph, i.e. recognize the original
set of homologs, are predicted to be functionally related. The advanced
version of SNAPper offers the choice of several parameters, which allow
fine-tuning the performance of the tool depending on the particular query
protein.</p></div><div id="A286"><h4>KEGG</h4><p>In contrast to the tools described above, identification of gene strings in
the KEGG database (<a href="http://www.genome.ad.jp/kegg-bin/mk_genome_cmp_html" ref="pagearea=body&amp;targetsite=external&amp;targetcat=link&amp;targettype=uri">http://www.genome.ad.jp/kegg-bin/mk_genome_cmp_html</a>) is
geared toward an analysis of the operon conservation. It allows one to find
all genes in any two selected complete genomes whose products are
sufficiently similar to each other and are separated by no more than five
genes. The user can specify the desired degree of similarity between the
proteins in terms of the minimal pairwise BLAST score (or maximal Evalue),
the minimal length of the alignment, and the type of BLAST hits
(bidirectional or unidirectional hits, or just any hits with the specified
BLAST score). The user can also specify maximum allowable distances between
the genes in either organism, limiting it to any number of genes from zero
to five. This option allows one to retrieve much more distant gene pairs
than those detected by the ERGO tool. The downside of this richness is that
unless one uses fairly strict criteria for protein similarity and the
intergenic distances, he or she will end up with dozens or even hundreds of
reported gene pairs, few of which would have predictive power. Nonetheless,
a sensible use of this tool can bring some very interesting results [<a href="/books/n/sef/A727/?report=reader#A996">268</a>].</p></div><div id="A287"><h4>Genome context tools in genome annotation</h4><p>To evaluate the power of gene order-based methods for making functional
predictions, we have isolated those cases where a substantial functional
prediction did not appear possible without explicit use of gene adjacency
information [<a href="/books/n/sef/A727/?report=reader#A1644">916</a>]. In spite of the
inherent subjectivity of such assessments, the result was instructive: such
unique predictions were made for ~90 genes (more precisely, COGs) or
~4% of all COGs analyzed. Given that, as noted above,
homology-based approaches already allow functional predictions for a
majority of the genes in each sequenced prokaryotic genome, this places
gene-string analysis in the position of an important accessory methodology
in the hierarchy of genome annotation approaches. Other genome context-based
methods may also be useful but are clearly less powerful. This is, of
course, a pessimistic assessment because more subtle changes in prediction
for gene already annotated by homology-based methods were not taken into
account.</p><p>These limitations notwithstanding, some of the predictions made on the basis
of gene order conservation combined with homology information seem to be
exceptionally important. Perhaps the most straightforward case is the
prediction of the archaeal exosome, a complex of RNAses, RNA-binding
proteins and helicases that mediates processing and
3&#x02019;-&#x0003e;5&#x02019; degradation of a variety of RNA species
[<a href="/books/n/sef/A727/?report=reader#A1197">469</a>]. This finding was made by
examination of archaeal genome alignments, which led to the detection of a
large superoperon, which, in its complete form, consists of 15 genes. This
full complement of co-localized genes, however, is present in only one
species, <i>M. thermoautotrophicum</i>, whereas, in all other
archaea, the superoperon is partially disrupted and, in some cases, certain
genes have been lost altogether. Remarkably, the predicted exosomal
superoperon also includes genes for proteasome subunits. According to the
logic outlined above, this points to a hitherto unknown functional and
possibly even physical association between the proteasome and the exosome,
the machines for controlled degradation of RNA and proteins,
respectively.</p><p>Gene order-based functional prediction seems to be impossible for eukaryotes
because of the apparent lack of clustering of functionally linked genes.
However, several operons that have been identified in <i>C.
elegans</i> [<a href="/books/n/sef/A727/?report=reader#A1373">645</a>,<a href="/books/n/sef/A727/?report=reader#A1622">894</a>,<a href="/books/n/sef/A727/?report=reader#A1672">944</a>] comprise the first exceptions to this rule and suggest that
gene order analysis could be eventually used for eukaryotes, too. Besides,
the above prediction of proteasome-exosome association might potentially
extend to eukaryotes, offering yet another example of the use of prokaryotic
genome comparisons for understanding the eukaryotic cell.</p><p>Given the fluidity of gene order in prokaryotes, detection of subtle
conservation patterns requires fairly sophisticated computational procedures
that search for <b>
<i>gene neighborhoods</i>
</b>, sets of genes that tend to cluster together in multiple genomes,
but do not necessarily show extensive conservation of exact gene order
[<a href="/books/n/sef/A727/?report=reader#A1175">447</a>,<a href="/books/n/sef/A727/?report=reader#A1219">491</a>,<a href="/books/n/sef/A727/?report=reader#A1368">640</a>,<a href="/books/n/sef/A727/?report=reader#A1369">641</a>,<a href="/books/n/sef/A727/?report=reader#A1437">709</a>]. One of the interesting findings
that have been made possible through these approaches is the prediction of a
new DNA repair system in archaeal and bacterial hyperthemophiles [<a href="/books/n/sef/A727/?report=reader#A1269">541</a>]. As shown in <a class="figpopup" href="/books/NBK20253/figure/A1681/?report=objectonly" target="object" rid-figpopup="figA1681" rid-ob="figobA1681">Figure 5.5</a> (see color plates), the
gene neighborhood predicted to encode this system forms a complex patchwork,
with very few conserved gene strings. However, the overall conservation of
the neighborhood is obvious (once the analysis is completed and the results
are summarized as in <a class="figpopup" href="/books/NBK20253/figure/A1681/?report=objectonly" target="object" rid-figpopup="figA1681" rid-ob="figobA1681">Figure 5.5</a>) and
statistically significant [<a href="/books/n/sef/A727/?report=reader#A1269">541</a>,<a href="/books/n/sef/A727/?report=reader#A1437">709</a>]. In an already
familiar theme, prediction of this repair system involved a combination of
genomic neighborhood detection with fairly complicated protein sequence
analysis and structure prediction. One of the notable findings was the
identification of a novel family of predicted DNA polymerases (COG1353).
Finally, this is where we encounter, once again, COG1518, the protein family
already discussed in <a href="/books/n/sef/A166/?report=reader#A233">4.5</a>. When we
first analyzed those proteins, we were inclined to predict that they were
novel enzymes, perhaps with a hydrolytic activity. Context analysis allows
us to make a much more specific prediction: these proteins mostly likely are
nucleases involved in DNA repair.</p><div class="iconblock whole_rhythm clearfix ten_col fig" id="figA1681" co-legend-rid="figlgndA1681"><a href="/books/NBK20253/figure/A1681/?report=objectonly" target="object" title="Figure 5.5" class="img_link icnblk_img figpopup" rid-figpopup="figA1681" rid-ob="figobA1681"><img class="small-thumb" src="/books/NBK20253/bin/ch5f5.gif" src-large="/books/NBK20253/bin/ch5f5.jpg" alt="Figure 5.5. Predicted DNA repair system in hyperthermophiles." /></a><div class="icnblk_cntnt" id="figlgndA1681"><h4 id="A1681"><a href="/books/NBK20253/figure/A1681/?report=objectonly" target="object" rid-ob="figobA1681">Figure 5.5</a></h4><p class="float-caption no_bottom_margin">Predicted DNA repair system in hyperthermophiles. The pink boxes show optimal growth temperatures for each of the analyzed species (<i>A. aeolicus, T. maritima, A. fulgidus, M. thermoautotrophicum, M. jannaschii</i>). The genes are not drawn to scale; arrows <a href="/books/NBK20253/figure/A1681/?report=objectonly" target="object" rid-ob="figobA1681">(more...)</a></p></div></div></div></div></div><div id="A288"><h2 id="_A288_">5.3. Conclusions and Outlook</h2><p>In this chapter, we discussed both traditional methods for genome annotation based on
homology detection and newer approaches united under the umbrella of genome context
analysis. We noted that, although functions can be predicted, at some level of
precision, for a substantial majority of genes in each sequenced prokaryotic genome,
current annotations are replete with inaccuracies, inconsistencies and
incompleteness. This should not be construed as any kind of implicit criticism of
those researchers who are involved in genome annotation: the task is objectively
hard and is getting progressively more difficult with the growth of databases (and
accumulation of inconsistencies). Fortunately, we believe that the remedy is already
at hand (see <a href="/books/n/sef/A55/?report=reader#A64">3.1.3</a>). Specialized databases,
designed as genome annotation tools, seem to be capable of dramatically improving
the situation, if not solving the annotation problem completely. Prototypes of such
databases already exist and function and their extensive growth in the near future
seems assured.</p><p>The context-based methods of genome annotation are quite new: the development of
these approaches started only after multiple genome sequences became available.
These approaches have a lot of appeal because they are, indeed, true <b>
<i>genomic</i>
</b> methods based on the notion that the genome (and, especially, many compared
genomes) is much more than the sum of its parts. The results produced by these
methods are often very intuitive and even visually appealing as in gene string
analysis. Objectively, however, these methods yield considerably less information on
gene function than homology-based methods, at least for the foreseeable future.
Nevertheless, different genome context approaches substantially complement each
other and homology-based methods. In fact, homology-based and context-based methods
often produce different and complementary types of functional predictions. The
former tend to predict <b>
<i>biochemical</i>
</b> functions (activities), whereas the latter result in <b>
<i>biological</i>
</b> predictions, such as involvement of a gene in a particular cellular process
(e.g. DNA repair in the example above), even if the exact activity cannot be
predicted.</p><p>We would like to end this chapter on an upbeat note by stating, in large part on the
basis of personal experience, that genome annotation is not a routine, mundane
activity as it might seem to an outside observer. On the contrary, this is exciting
research, somewhat akin to detective work, which has the potential of teasing out
deep mysteries of life from genome sequences.</p></div><div id="A289"><h2 id="_A289_">5.4. Further Reading</h2><dl class="temp-labeled-list"><dl class="bkr_refwrap"><dt>1.</dt><dd><div class="bk_ref" id="A291">Brenner S. Errors in genome annotation. <span><span class="ref-journal">Trends in Genetics. </span>1999;<span class="ref-vol">15</span>:132&ndash;133.</span> [<a href="https://pubmed.ncbi.nlm.nih.gov/10203816" ref="pagearea=cite-ref&amp;targetsite=entrez&amp;targetcat=link&amp;targettype=pubmed">PubMed<span class="bk_prnt">: 10203816</span></a>]</div></dd></dl><dl class="bkr_refwrap"><dt>2.</dt><dd><div class="bk_ref" id="A292">Galperin MY, Koonin EV. Who's your neighbor? New computational approaches for
functional genomics. <span><span class="ref-journal">Nature Biotechnology. </span>2000;<span class="ref-vol">18</span>:609&ndash;613.</span> [<a href="https://pubmed.ncbi.nlm.nih.gov/10835597" ref="pagearea=cite-ref&amp;targetsite=entrez&amp;targetcat=link&amp;targettype=pubmed">PubMed<span class="bk_prnt">: 10835597</span></a>]</div></dd></dl><dl class="bkr_refwrap"><dt>3.</dt><dd><div class="bk_ref" id="A293">Huynen MA, Snel B. Gene and context: integrative approaches to genome
analysis. <span><span class="ref-journal">Advances in Protein Chemistry. </span>2000;<span class="ref-vol">54</span>:345&ndash;379.</span> [<a href="https://pubmed.ncbi.nlm.nih.gov/10829232" ref="pagearea=cite-ref&amp;targetsite=entrez&amp;targetcat=link&amp;targettype=pubmed">PubMed<span class="bk_prnt">: 10829232</span></a>]</div></dd></dl><dl class="bkr_refwrap"><dt>4.</dt><dd><div class="bk_ref" id="A294">Huynen MA, Snel B, Lathe W, Bork P. Predicting protein function by genomic context:
quantitative evaluation and qualitative inferences. <span><span class="ref-journal">Genome Research. </span>2000;<span class="ref-vol">10</span>:1204&ndash;1210.</span> [<a href="/pmc/articles/PMC310926/" ref="pagearea=cite-ref&amp;targetsite=entrez&amp;targetcat=link&amp;targettype=pmc">PMC free article<span class="bk_prnt">: PMC310926</span></a>] [<a href="https://pubmed.ncbi.nlm.nih.gov/10958638" ref="pagearea=cite-ref&amp;targetsite=entrez&amp;targetcat=link&amp;targettype=pubmed">PubMed<span class="bk_prnt">: 10958638</span></a>]</div></dd></dl><dl class="bkr_refwrap"><dt>5.</dt><dd><div class="bk_ref" id="A295">Wolf YI, Rogozin IB, Kondrashov AS, Koonin EV. Genome alignment, evolution of prokaryotic genome
organization and prediction of gene function using genomic
context. <span><span class="ref-journal">Genome Research. </span>2001;<span class="ref-vol">11</span>:356&ndash;372.</span> [<a href="https://pubmed.ncbi.nlm.nih.gov/11230160" ref="pagearea=cite-ref&amp;targetsite=entrez&amp;targetcat=link&amp;targettype=pubmed">PubMed<span class="bk_prnt">: 11230160</span></a>]</div></dd></dl><dl class="bkr_refwrap"><dt>6.</dt><dd><div class="bk_ref" id="A296">Makarova KS, Aravind L, Grishin NV, Rogozin IB, Koonin EV. A DNA repair system specific for thermophilic Archaea and
bacteria predicted by genomic context analysis. <span><span class="ref-journal">Nucleic Acids Research. </span>2002;<span class="ref-vol">30</span>:482&ndash;496.</span> [<a href="/pmc/articles/PMC99818/" ref="pagearea=cite-ref&amp;targetsite=entrez&amp;targetcat=link&amp;targettype=pmc">PMC free article<span class="bk_prnt">: PMC99818</span></a>] [<a href="https://pubmed.ncbi.nlm.nih.gov/11788711" ref="pagearea=cite-ref&amp;targetsite=entrez&amp;targetcat=link&amp;targettype=pubmed">PubMed<span class="bk_prnt">: 11788711</span></a>]</div></dd></dl><dl class="bkr_refwrap"><dt>7.</dt><dd><div class="bk_ref" id="A297"> Ouzounis CA, Karp PD. 2002. The past,
present and future of genome-wide re-annotation. <em>Genome
Biology</em> 3, COMMENT2001. [<a href="/pmc/articles/PMC139008/" ref="pagearea=cite-ref&amp;targetsite=entrez&amp;targetcat=link&amp;targettype=pmc">PMC free article<span class="bk_prnt">: PMC139008</span></a>] [<a href="https://pubmed.ncbi.nlm.nih.gov/11864365" ref="pagearea=cite-ref&amp;targetsite=entrez&amp;targetcat=link&amp;targettype=pubmed">PubMed<span class="bk_prnt">: 11864365</span></a>]</div></dd></dl></dl></div><div style="display:none"><div id="figA1679"><img alt="Image ch2f6" src-large="/books/n/sef/A22/bin/ch2f6.jpg" /></div><div id="figA452"><img alt="Image ch7f6" src-large="/books/n/sef/A371/bin/ch7f6.jpg" /></div><div id="figA468"><img alt="Image ch7f7" src-large="/books/n/sef/A371/bin/ch7f7.jpg" /></div></div><div id="bk_toc_contnr"></div></div></div><div class="fm-sec"><h2 id="_NBK20253_pubdet_">Publication Details</h2><h3>Copyright</h3><div><div class="half_rhythm"><a href="/books/about/copyright/">Copyright</a> &#x000a9; 2003, Kluwer Academic.</div></div><h3>Publisher</h3><p><a href="http://www.springer.com/" ref="pagearea=page-banner&amp;targetsite=external&amp;targetcat=link&amp;targettype=publisher">Kluwer Academic</a>, Boston</p><h3>NLM Citation</h3><p>Koonin EV, Galperin MY. Sequence - Evolution - Function: Computational Approaches in Comparative Genomics. Boston: Kluwer Academic; 2003.  Chapter 5, Genome Annotation and Analysis.<span class="bk_cite_avail"></span></p></div><div class="small-screen-prev"><a href="/books/n/sef/A166/?report=reader"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="none"><path d="M75,30 c-80,60 -80,0 0,60 c-30,-60 -30,0 0,-60"></path><text x="20" y="28" textLength="60" style="font-size:25px">Prev</text></svg></a></div><div class="small-screen-next"><a href="/books/n/sef/A298/?report=reader"><svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 100 100" preserveAspectRatio="none"><path d="M25,30c80,60 80,0 0,60 c30,-60 30,0 0,-60"></path><text x="20" y="28" textLength="60" style="font-size:25px">Next</text></svg></a></div></article><article data-type="fig" id="figobA267"><div id="A267" class="figure bk_fig"><div class="graphic"><img data-src="/books/NBK20253/bin/ch5f1.jpg" alt="Figure 5.1. A generalized flow chart of genome annotation." /></div><h3><span class="label">Figure 5.1</span><span class="title">A generalized flow chart of genome annotation</span></h3><div class="caption"><p>FB: feedback from gene identification for correction of sequencing
errors, primarily frameshifts. General database search: searching
sequence databases (typically, NCBI NR) for sequence similarity,
usually using BLAST. Specialized database search: searching domain
databases, such as Pfam, SMART, and CDD, for conserved domains,
genome-oriented databases, such as COGs, for identification of
orthologous relationship and refined functional prediction,
metabolic databases, such as KEGG for metabolic pathway
reconstruction, and possibly, other database searches. Statistical
gene prediction: use of methods like GeneMark or Glimmer to predict
protein-coding genes. Prediction of structural features: prediction
of signal peptide, transmembrane segments, coiled domain and other
features in putative protein functions.</p></div></div></article><article data-type="fig" id="figobA1680"><div id="A1680" class="figure bk_fig"><div class="graphic"><img data-src="/books/NBK20253/bin/ch5f2.jpg" alt="Figure 5.2. Protocol of genome annotation using the COG database." /></div><h3><span class="label">Figure 5.2</span><span class="title">Protocol of genome annotation using the COG database</span></h3></div></article><article data-type="table-wrap" id="figobA268"><div id="A268" class="table"><h3><span class="label">Table 5.1</span><span class="title">Microbial genome annotation 2001</span></h3><p class="large-table-link" style="display:none"><span class="right"><a href="/books/NBK20253/table/A268/?report=objectonly" target="object">View in own window</a></span></p><div class="large_tbl" id="__A268_lrgtbl__"><table class="no_margin"><thead><tr><th id="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:bottom;">
Species
</th><th id="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:left;vertical-align:bottom;">
Total no. of genes<sup>a</sup>
</th><th id="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">
Genes with assigned function
</th><th id="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:left;vertical-align:bottom;">
&#x0201c;Conserved
hypothetical&#x0201d;&#x000a0;&#x000a0;&#x000a0;&#x000a0;proteins
</th><th id="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:bottom;">
&#x0201c;Hypothetical&#x0201d; proteins
</th><th id="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:bottom;">
Assigned to COGs
</th><th id="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:bottom;">
Ref.
</th></tr></thead><tbody><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
<i>Agrobacterium tumefaciens</i>
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">5,419</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">3,475 (64%).</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,236 (22%)</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">708 (13%)</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">4490 (83%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A1645">917</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
<i>Caulobacter crescentus</i>
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">3,737</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">2,030 (54%)</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">725 (19%)</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,012 (27%)</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">3,514 (93%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A1346">618</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
<i>Clostridium acetobutylicum</i>
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">3,672</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">2,888 (79%)</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">187 (5%)</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">597 (16%)</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">2,941 (80%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A1350">622</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
<i>Lactococcus lactis</i>
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">2,310</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,482 (64%)</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">465 (20%)</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">363 (16%)</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,849 (80%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A825">97</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
<i>Listeria innocua</i>
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">3,052</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1920 (63%)</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">757 (25%)</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">375 (12%)</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">2,444 (80%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A1014">286</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
<i>Mycobacterium leprae</i>
<sup>b</sup>
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">2,720</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1802 (66%)</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">776 (29%)</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">142 (5%)</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,231 (45%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A881">153</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
<i>Nostoc</i> (<i>Anabaena</i>) sp.
PCC7120</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">5,368</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">45%</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">27%</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">28%</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">4,002 (75%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A1144">416</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
<i>Pasteurella multocida</i>
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">2,014</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,814 (64%)</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">531 (26%)</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">200 (10%)</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,881 (93%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A1282">554</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
<i>Sinorhizobium meliloti</i>
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">6,204</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">3,704 (60%)</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,991 (32%)</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">509 (8%)</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">5298 (85%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A983">255</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
<i>Staphylococcus aureus</i>
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">2,595</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">63%</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">23%</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">14%</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">2,126 (82%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A1209">481</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
<i>Streptococcus pyogenes</i>
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,752</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1137 (65%)</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">145 (8.2%)</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">470 (27%)</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,390 (79%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A951">223</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
<i>Sulfolobus solfataricus</i>
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">2,977</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,624 (57%)</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">619 (21%)</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">734 (25%)</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,910 (64%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A1492">764</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
<i>Sulfolobus tokodaii</i>
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">2,826</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">14 (0.5%)</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">920 (33%)</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,892 (67%)</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,778 (63%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A1154">426</a>]</td></tr><tr><td headers="hd_h_A268_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
<i>Yersinia pestis</i>
</td><td headers="hd_h_A268_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">4,012</td><td headers="hd_h_A268_1_1_1_3" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">76%</td><td headers="hd_h_A268_1_1_1_4" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">13%</td><td headers="hd_h_A268_1_1_1_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">9%</td><td headers="hd_h_A268_1_1_1_6" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">3,669 (91%)</td><td headers="hd_h_A268_1_1_1_7" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">[<a href="/books/n/sef/A727/?report=reader#A1384">656</a>]</td></tr></tbody></table></div><div class="tblwrap-foot"><div><dl class="temp-labeled-list small"><dl class="bkr_refwrap"><dt>a</dt><dd><div id="N0x1cf9150N0x39b9008"><p class="no_margin"> In contrast to <a href="/books/n/sef/A4/?report=reader#A11">Table 1.4</a>,
the total gene numbers, as well as the numbers of genes with
assigned function, &#x0201c;conserved hypothetical&#x0201d; and
&#x0201c;hypothetical&#x0201d; genes, were taken from the
original publications.</p></div></dd></dl><dl class="bkr_refwrap"><dt>b</dt><dd><div id="N0x1cf9150N0x39b9128"><p class="no_margin"> The low fraction of <i>M. leprae</i> genes, assigned to
COGs, is due to the large number of pseudogenes in this genome
[<a href="/books/n/sef/A727/?report=reader#A881">153</a>].</p></div></dd></dl></dl></div></div></div></article><article data-type="table-wrap" id="figobA273"><div id="A273" class="table"><h3><span class="label">Table 5.2</span><span class="title">Different types of errors in genome annotation</span></h3><p class="large-table-link" style="display:none"><span class="right"><a href="/books/NBK20253/table/A273/?report=objectonly" target="object">View in own window</a></span></p><div class="large_tbl" id="__A273_lrgtbl__"><table class="no_top_margin"><thead><tr><th id="hd_h_A273_1_1_1_1" rowspan="1" colspan="1" style="vertical-align:top;"></th><th id="hd_h_A273_1_1_1_2" colspan="6" content-type="rowsep" rowspan="1" style="text-align:center;vertical-align:top;">
<b>Annotation</b>
<span class="hr"></span>
</th></tr><tr><th headers="hd_h_A273_1_1_1_1" id="hd_h_A273_1_1_2_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:bottom;">
Protein
</th><th headers="hd_h_A273_1_1_1_2" id="hd_h_A273_1_1_2_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:bottom;">
Fraser and coworkers
</th><th headers="hd_h_A273_1_1_1_2" id="hd_h_A273_1_1_2_3" rowspan="1" colspan="1" style="text-align:left;vertical-align:bottom;">
Ouzounis and coworkers
</th><th headers="hd_h_A273_1_1_1_2" id="hd_h_A273_1_1_2_4" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
Koonin and coworkers
</th><th headers="hd_h_A273_1_1_1_2" id="hd_h_A273_1_1_2_5" rowspan="1" colspan="1" style="text-align:center;vertical-align:bottom;">
GenBank 2002
</th><th headers="hd_h_A273_1_1_1_2" id="hd_h_A273_1_1_2_6" rowspan="1" colspan="1" style="text-align:left;vertical-align:bottom;">
Conclusion 2002
</th></tr></thead><tbody><tr><td headers="hd_h_A273_1_1_1_1 hd_h_A273_1_1_2_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">MG085</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">Hydroxymethyl-glutaryl-CoA reductase
(NADPH)</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_3" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">NADH-ubiquinone oxidoreductase</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_4" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">ATP(GTP?)-utilizing enzyme</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_5" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">HPr (Ser) kinase, putative</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_6" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">HPr kinase</td></tr><tr><td headers="hd_h_A273_1_1_1_1 hd_h_A273_1_1_2_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">MG225</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">Hypothetical</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_3" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Histidine permease</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_4" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Amino acid permease</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_5" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Hypothetical</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_6" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Amino acid permease</td></tr><tr><td headers="hd_h_A273_1_1_1_1 hd_h_A273_1_1_2_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">MG302</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">No database match</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_3" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Mitochondrial 60S ribosomal protein
L2</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_4" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">(Glycerol-3-phosphate?) permease</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_5" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Hypothetical</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_6" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Probable cobalt transporter</td></tr><tr><td headers="hd_h_A273_1_1_1_1 hd_h_A273_1_1_2_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">MG448</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">Pilin repressor (pilB)</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_3" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">PilB protein</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_4" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Putative chaperone-like protein</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_5" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Hypothetical/Peptide methionine
sulfoxide reductase</td><td headers="hd_h_A273_1_1_1_2 hd_h_A273_1_1_2_6" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Peptide methionine sulfoxide reductase
B</td></tr></tbody></table></div></div></article><article data-type="table-wrap" id="figobA275"><div id="A275" class="table"><h3><span class="label">Table 5.3</span><span class="title">Assignment of predicted <i>Aeropyrum pernix</i> proteins to
COGs</span></h3><p class="large-table-link" style="display:none"><span class="right"><a href="/books/NBK20253/table/A275/?report=objectonly" target="object">View in own window</a></span></p><div class="large_tbl" id="__A275_lrgtbl__"><table class="no_top_margin"><thead><tr><th id="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">
Protein category
</th><th id="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">
No. of proteins
</th></tr></thead><tbody><tr><td headers="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Assigned by COGNITOR
automatically</td><td headers="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,123</td></tr><tr><td headers="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Included in COGs after validation</td><td headers="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,102</td></tr><tr><td headers="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">True positives</td><td headers="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,062</td></tr><tr><td headers="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">&#x02003;Preexisting COGs</td><td headers="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,035</td></tr><tr><td headers="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">&#x02003;New COGs</td><td headers="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">27</td></tr><tr><td headers="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">False positives</td><td headers="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">44</td></tr><tr><td headers="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">&#x02003;Rejected</td><td headers="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">21</td></tr><tr><td headers="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">&#x02003;Re-assigned to a related
COG</td><td headers="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">21</td></tr><tr><td headers="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Re-assigned to an unrelated COG</td><td headers="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">2</td></tr><tr><td headers="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">False negatives (added during manual
checking)</td><td headers="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">17</td></tr><tr><td headers="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Proteins in COGs:Update 2001</td><td headers="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,178</td></tr><tr><td headers="hd_h_A275_1_1_1_1" rowspan="1" colspan="1" style="text-align:left;vertical-align:top;">Proteins in COGs:Update 2002</td><td headers="hd_h_A275_1_1_1_2" rowspan="1" colspan="1" style="text-align:center;vertical-align:top;">1,242</td></tr></tbody></table></div></div></article><article data-type="fig" id="figobA279"><div id="A279" class="figure bk_fig"><div class="graphic"><img data-src="/books/NBK20253/bin/ch5f3.jpg" alt="Figure 5.3. A Rosetta Stone case: domain fusions and gene clusters that involve peptide methionine sulfoxide reductases." /></div><h3><span class="label">Figure 5.3</span><span class="title">A Rosetta Stone case: domain fusions and gene clusters that
involve peptide methionine sulfoxide reductases</span></h3></div></article><article data-type="fig" id="figobA283"><div id="A283" class="figure bk_fig"><div class="graphic"><img data-src="/books/NBK20253/bin/ch5f4.jpg" alt="Figure 5.4. Genome context of COG1685 &#x0201c;Archaeal shikimate kinase&#x0201d;." /></div><h3><span class="label">Figure 5.4</span><span class="title">Genome context of COG1685 &#x0201c;Archaeal shikimate
kinase&#x0201d;</span></h3><div class="caption"><p> Each line corresponds to an individual genome: aful,
<i>Archaeoglobus fulgidus</i>; hbsp,
<i>Halobacterium</i> sp.; mjan,
<i>Methanococcus jannaschii</i>; mthe,
<i>Methanobacterium thermoautotrophicum</i>; pyro,
<i>Pyrococcus horikoshii</i>; pabyssi,
<i>Pyrococcus abyssi</i>; tacid,
<i>Thermoplasma acidophilum</i>; tvol,
<i>Thermoplasma volcanium</i>; aero,
<i>Aeropyrum pernix</i>; aquae, <i>Aquifex
aeolicus</i>. The genes encoding members of COG1685 are
shown in the middle. Genes encoding members of the same COG are
indicated by the same color. Genomes that do not encode a member
of COG 1685 are indicated by empty lines. The names of all COGs
represented in the picture are listed starting from the most
common ones. Note that in <i>Halobacterium</i> sp.
(second line) and <i>M. thermoautotrophicum</i>
(fourth line), COG1685 genes are followed by the genes encoding
chorismate mutase (<i>tyrA</i>_1, COG1605). In
<i>Thermoplasma</i> spp. and <i>A.
pernix</i> (lines 7-9), COG1685 genes are sandwiched
between the genes encoding shikimate-5-dehydrogenase
(<i>aroE</i>, COG0169), and genes encoding
5-enoyl-puruvoylshikimate-3-phosphate synthetase
(<i>aroA</i>, COG0128). See <a href="/books/n/sef/A371/?report=reader#A468">Figure 7.7</a> for the chart of the complete
pathway of phenylalanine and tyrosine biosynthesis.</p></div></div></article><article data-type="fig" id="figobA1681"><div id="A1681" class="figure bk_fig"><div class="graphic"><img data-src="/books/NBK20253/bin/ch5f5.jpg" alt="Figure 5.5. Predicted DNA repair system in hyperthermophiles." /></div><h3><span class="label">Figure 5.5</span><span class="title">Predicted DNA repair system in hyperthermophiles</span></h3><div class="caption"><p>The pink boxes show optimal growth temperatures for each of the analyzed species (<i>A. aeolicus, T. maritima, A. fulgidus, M. thermoautotrophicum, M. jannaschii</i>). The genes are not drawn to scale; arrows indicate the direction of transcription. The upper row shows the COG numbers for the corresponding proteins. Some of the newly predicted COG functions are: COG2452, helix-turn-helix transcriptional regulator; COG 1203, helicase; COG1468, RecB family exonuclease; COG2254 nuclease of the HD superfamily; COG1353, novel DNA polymerase;.</p></div></div></article></div><div id="jr-scripts"><script src="/corehtml/pmc/jatsreader/ptpmc_3.22/js/libs.min.js"> </script><script src="/corehtml/pmc/jatsreader/ptpmc_3.22/js/jr.min.js"> </script></div></div>


        <!-- Book content -->

        <script type="text/javascript" src="/portal/portal3rc.fcgi/rlib/js/InstrumentNCBIBaseJS/InstrumentPageStarterJS.js"> </script>


<!-- CE8BC1E97D9F05E1_0182SID /projects/books/PBooks@9.11 portal106 v4.1.r689238 Tue, Oct 22 2024 16:10:51 -->
<span id="portal-csrf-token" style="display:none" data-token="CE8BC1E97D9F05E1_0182SID"></span>

<script type="text/javascript" src="//static.pubmed.gov/portal/portal3rc.fcgi/4216699/js/3968615.js" snapshot="books"></script></body>
</html>