NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Koonin EV, Galperin MY. Sequence - Evolution - Function: Computational Approaches in Comparative Genomics. Boston: Kluwer Academic; 2003.

Chapter 1Genomics: From Phage to Human

1.1. The Humble Beginnings …

The first genome, that of RNA bacteriophage MS2, was sequenced in 1976, in a truly heroic feat of direct determination of an RNA sequence [225]. This was followed by the genome of bacteriophage ϕX174, the first triumph of the new, rapid sequencing methods developed in the laboratories of Walter Gilbert and Fred Sanger [553,743]. These are some of the smallest known genomes with only four and ten genes, respectively. Then, in 1982, the last paper published by Sanger before he retired, announced the first relatively large genome to be sequenced, that of bacteriophage λ, probably the most famous model system of classic molecular biology [742]. Phage λ has 48,502 bases of genomic DNA and ~70 known and predicted protein-coding genes and 23 RNA-coding genes. At 70 characters per line and 43 lines per page, this sequence alone would take over 16 pages of this book. However, the listing of the λ protein-coding genes (Table 1.1) fits into just two pages and definitely conveys more information. These days, it may be hard to imagine all the excitement felt by molecular biologists 20 years ago when the λ genome was finally finished. Nevertheless, even in this era of high-throughput methods, it could be instructive to look back and address several questions: (i) is λ genome a good model of the subsequently sequenced prokaryotic and eukaryotic genomes? (ii) how accurate was the sequence itself and the original gene assignment? and (iii) how much more have we learned about functions of λ genes in the past 20 years?

The answer to the first question is definitely yes: λ genome has many features common to the genomes of cellular life forms, particularly prokaryotes. Most of the genome consists of protein-coding genes. Adjacent genes are often transcribed in the same direction and encode proteins that have similar functions and/or interact with each other (e.g. cell lysis proteins, tail components). Adjacent genes either slightly overlap or are separated by intergenic regions of varying length, typically much shorter than the genes themselves.

To answer the second question, both the sequence and gene assignments turned out to be essentially correct. The latter may not be surprising since the λ genome was annotated by researchers who had studied the phage for years, on the basis of the entire body of knowledge amassed by that time. In contemporary genome sequencing projects, such detailed analysis by highly qualified biologists with intimate knowledge of the biology of the given organism is more an exception, rather than the norm, partly because biological information on many of the sequenced organisms is simply too scarce.

A comparison of Table 1.1 with the original paper by Sanger et al. [742] shows that there is actually not much to add to the gene annotations. The use of recently developed sophisticated gene prediction programs, such as Glimmer (see 4.1), coupled with the analysis of the regions that are conserved between lambda and related bacteriophages, led to the conclusion that certain intergenic regions might contain additional protein-coding genes (marked by asterisks in the Table 1.1). Unfortunately, most of these genes remain uncharacterized, and it is not even known whether they are ever expressed. It is worth noting that exactly the same doubts exist about the possible functions and/or expression of a large number of so-called “hypothetical” genes, identified in the genomes of cellular life forms by essentially the same two principal approaches (see 4.1).

When reading the Sanger paper now, 20 years after it appeared, one is struck by the absence of any analysis of protein sequences in this detailed, thorough work. Although the authors have done careful computational analysis of open reading frames, particularly the likely translation starts and codon usage, the very word “homolog” is not used in the article, and there is no mention of any search of protein sequence databases, something that these days is, by default, an integral part of any genomic study. Not that protein sequence databases did not exist at the time: the first one, the Protein Identification Resource, was launched by Margaret Dayhoff, one of the great pioneers of computational biology, in 1965, long before genomics had even become conceivable [172,173]. However, reliable and rapid methods for searching this database still have not been developed, and more generally, database search was not a part of the culture in molecular biology at the time. And for a good reason, too. Had Sanger and his coworkers performed a PIR search, even using the methods available in 2002, they would not have found anything of interest because the sequences available at that time were few and far apart, and there were no homologs of phage λ proteins among them. Clearly, the time was not ripe for comparative genomics and, in a sense, for genomics itself because, as we will see throughout this book, the comparative approach is truly central to the genomic enterprise.

Revisiting phage λ genome after 20 years, we see a completely different “genomescape”. Using the PSI-BLAST program (see 4.3), the search of the complete non-redundant protein sequence database maintained at the NCBI (National Center for Biotechnology Information, a division of the National Institutes of Health in Bethesda, Maryland, USA) for homologs of the 73 proteins listed as gene products of phage λ takes about an hour on a moderate power computer. Another hour was spent running selected proteins through the conserved domain search using the CDD option of the NCBI’s BLAST server (see 4.4). Of course, we could have scoured the literature for descriptions of computational analyses of λ proteins instead. However, extracting the relevant information from databases, such as PubMed (see 3.7), is far from trivial because, in most cases, the papers including this information dealt with more general issues and would not have λ, let alone a particular gene, mentioned in the title or abstract. Running the searches anew was much faster and easier. Besides, sequence databases are growing daily, which may substantially affect the results of searches and might even lead to new discoveries. Perusing the results, we should note that, with a few exceptions, there are now homologs readily detectable for the phage proteins. In the majority of cases, these are proteins from other related phages (sometimes integrated as prophages into the bacterial chromosome). However, 12 λ proteins show conservation in bacteria, archaea, and eukaryotes (Table 1.2). For several of these proteins whose functions have not been studied experimentally, non-trivial functional predictions become possible.

It is remarkable that some of the more interesting computational predictions remain without experimental test. Admittedly, the visibility of molecular biology of bacteriophages as a research field has not increased since the 1970’s, and the funds have pretty much tapered off. Good examples are the Ea59 and K genes that are predicted to encode an ATPase and a metal-dependent protease, respectively. Both are clear and readily testable predictions that have been described in print, even if briefly [296,679]. However, to our knowledge, no experimental tests of these predictions have been reported so far. Interestingly, an observation has been made during these searches that actually seems to have a novel aspect to it. The Ea31 protein was shown to contain a metal-dependent nuclease domain [50]. The stop codon of the Ea31 gene overlaps the start codon of Ea59, leading to the intriguing hypothesis that the two proteins interact and form an ATP-dependent nuclease complex. We discuss sequence analysis of Ea31 in greater detail in Chapter 4 to illustrate the process of discovery in database searches. Furthermore, this is a little example of context analysis, an increasingly important direction in genome annotation, which is covered in Chapter 5. This situation is not uncommon: computational analysis of genomes keeps yielding interesting functional predictions, even years after the publication of the sequence; what is most often lacking is systematic experimental testing of these predictions.

We will come back to this dramatic rift between computational and experimental analysis of most, if not all, genomes with more numbers, but first let us step back and have a quick look into the history of genomics, which is short, but dynamic (Table 1.3). By definition, genomics requires genome sequences, and to engage in comparative genomics, one needs at least two genomes to compare. In a close analogy to the history of molecular genetics, which owes most of its early progress to bacteriophages used as model systems, comparative genomics was first practiced with the genomes of viruses. These are several orders of magnitude smaller than even the tiniest bacterial genomes and, in case a virus grows well, sequencing of viral genomes became a relatively straightforward enterprise in the early 1980’s. By 1983, six years after the beginning of the sequencing era, a considerable number of complete genomes of diverse small viruses of plants, animals, and bacteria (bacteriophages) had been amassed, and the time was ripe for the birth of comparative genomics.

Pinpointing the exact beginning of comparative genomics may be difficult. In a sense, one may say that it was born as soon as there were two genomes to compare, i.e. in 1977 when the genome of phage ϕX174 was sequenced and could be compared with the already available sequence of the RNA phage MS2. However, this was a vacuous start because the two phages had virtually nothing in common (a propos, this has not changed in 20 years: for all we know, these phage families are truly unrelated). It seems that comparative genomics had a real head start with two astonishing discoveries that caught most, if not all, virologists utterly by surprise. First, it has been shown that RNA-containing retroviruses (causative agents of certain leucoses in animals and humans and, as shown later, of AIDS) shared a conserved replicative enzyme, the reverse transcriptase, with two groups of DNA viruses, the hepadnaviruses (including the medically important hepatitis B virus) and caulimoviruses, infecting plants [847]. Second, it turned out that small RNA viruses infecting animals (picornaviruses, such as polio and foot-and-mouth disease) and those infecting plants (cowpea mosaic virus) shared not only significant sequence similarity that allowed the identification of homologous (orthologous) genes, but also, in part, the order of these genes in their genomes [7,56,335]. Subsequent systematic studies have revealed a complex network of homologous relationships within the vast classes of positive-strand RNA viruses and negative-strand RNA viruses. Although still disputed, the concept emerged that each of these classes was monophyletic, that is, probably evolved from a common ancestral virus [460]. These studies combined two elements that were crucial in defining the identity of the emerging discipline of comparative and evolutionary genomics.

Firstly, the objects of analysis were complete genomes, however small, rather than individual genes, and accordingly, the notions of conservation of gene order and gene shuffling became important. Secondly, the discoveries made through these genome comparisons were completely unexpected; there was no experimental data that would prepare researchers for the startling unity of superficially unrelated viruses.

In retrospect, it is somewhat ironic that comparative genomics had to start with virus genomes (due to the experimental contingency) because viral proteins tend to evolve extremely fast, and detection of conservation between distant viruses may be a non-trivial task, even with advanced methods of computational sequence analysis, let alone with those available in the early 1980’s. This was a challenge and perhaps a blessing in disguise. The difficulty of detecting sequence conservation among viral proteins prompted those who ventured into this area to employ approaches that later proved invaluable in comparative genomics and computational biology in general: (i) compare protein sequences, rather than nucleotide sequences directly, whenever distant relationships are involved and sensitivity is an issue; (ii) rely on multiple, rather than pairwise, comparisons; (iii) search for conserved patterns or motifs in multiple sequences; and, above all (iv) actually look at sequences (and structures whenever these are available) and think about the potential relationships in an effort to synthesize all relevant shreds of information. This practice has been dubbed, more or less pejoratively, “sequence gazing” [341]. Sure enough, sequence and structure comparisons are prone to error and, worse, to fantasy, and these dangers had been particularly grave in the early days, before the statistical foundations of computational biology had been worked out and the rules of thumb had been established through accumulated practices. There is no doubt, however, that success stories of computational prediction of gene functions have been of much greater import and have, to a large extent, determined the very feasibility of the further progress of genomics.

The first comparative-genomic study of a larger scale, investigating the relationships between genomes that contained >100 genes each, came in 1986 [558]. The newly sequenced genome of varicella zoster virus was carefully compared to the previously sequenced Epstein-Barr virus genome (the original Epstein-Barr genome paper [68] resembled the λ work in that no homologs were reported for any of the viral proteins because, indeed, none were to be easily identified among the sequences then available). This work, though little noticed outside virology, already included the principal elements of the comparative-genomic approach, if not the actual methods.

1.2. … and the Astonishing Progress of Genome Sequencing

Comparative genomics of cellular life forms is in a way a “by-product” of the Human Genome Project. Probably the greatest insight of the leaders of the early stages of this project was the realization that, in isolation, the human genome would be a costly but uninterpretable string of three billion or so of A’s, T’s, G’s and C’s. Only through systematic comparisons to other genomes may we hope to make sense of the text of this “Book of Life”. As far as genomics is concerned, Theodosius Dobzhansky’s famous dictum “Nothing in biology makes sense except in the light of evolution” is not some kind of evolutionist propaganda, but an entirely literal and more or less routine description of the situation. And so, in the last decade of the second millennium, the genome sequences started pouring in. Yeast chromosome III, the first respectable chunk of contiguous genome sequence [629] that became available in 1992 (quite modest, by today’s standards, just ~320,000 base pairs), generated major excitement epitomized in the title of a Nature note describing a re-analysis of the ORFs from this chromosome: “What’s in the genome?” [105]. From the analysis of this sequence and other large genome segments that started to appear in the next months, at least two notions were derived that became critical for the subsequent evolution of comparative genomics: (i) there were many more genes in the genome than anyone suspected previously on the basis of genetic or biochemical experiments; and (ii) methods of computational analysis matter—careful analysis employing multiple complementary approaches yields incomparably more information on gene functions and evolutionary relationships than any single automatic procedure.

The appearance in August 1995 of the complete genome sequence of the parasitic bacterium Haemophilus influenzae [232] ushered in the era of “real” genomics, the study of complete genomes of cellular organisms. The acceleration of genome sequencing required for this to happen was greatly facilitated by the whole-genome shotgun approach pioneered by Craig Venter, Hamilton Smith, and Leroy Hood [871]. Systematic comparative approaches were tried immediately, even before the second genome came, by using the largely finished genome of Escherichia coli [829]. Since that point, complete genomes of bacteria and archaea have been arriving at a steady rate, which seems to be accelerating in the 3rd millennium (Figure 1.1). Starting with the second genome sequencing paper [242], reports on new genomes inevitably became comparative-genomic studies because, as we have already mentioned, that is the only way to even start understanding “what’s in the genome”.

By June 1, 2002, genomes of 73 species of unicellular organisms (55 bacterial species, 16 archaea, and 2 eukaryotes) were completely sequenced and available in public databases. In the three parts of Table 1.4, the completely sequenced bacterial, archaeal, and eukaryotic genomes are listed in the order of decreasing size. The largest prokaryotic genomes (Streptomyces coelicolor among bacteria, Methanosarcina acetivorans among the archaea) have been sequenced only recently, which promises many interesting discoveries yet to come.

By the time of this writing (August 2002), the first genomes of multicellular eukaryotes, the nematode worm Caenorhabditis elegans, the fruit fly Drosophila melanogaster, the thale cress Arabidopsis thaliana, the pufferfish Fugu rubripes, and Homo sapiens have been nearly completed (let us note that the very concept of a complete genome sequence for these organisms differs from that for prokaryotes and unicellular eukaryotes). At least 100 more prokaryotic genomes and many eukaryotic genomes, including those of mouse and rat, were at different stages of completion. Beyond doubt, many more finished or nearly finished genome sequences exist in proprietary databases maintained by biotech companies, but since these cannot be freely analyzed, they do not count inasmuch as comparative genomics is discussed.

Any list of completed genomes rapidly becomes outdated and so will Table 1.4, even as this book appears in print. Periodically updated listings of both finished and unfinished publicly funded genome sequencing projects are available at the web sites maintained at the Institute for Genomic Research (TIGR, http://www.tigr.org/tdb/mdb/mdb.html) and at the NCBI (http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html). The Chicago-based Integrated Genomics Inc. maintains Genomes OnLine Database (http://wit.genomesonline.org), which lists most public as well as some private projects. In addition, web sites of the genome sequencing centers list the projects run or planned in those particular institutions (see Appendix 2).

The relative ease of 6- to 8-fold coverage sequencing as compared to finishing and genome annotation resulted in the availability of a number of incomplete genomes, which are not going to be finalized any time soon (see, for example, the web site of the Department of Energy Joint Genome Institute, http://www.jgi.doe.gov/JGI_microbial/html/index.html). These sequences are a treasure trove for someone who knows what to look for. Most of the data are available for searching through the NCBI BLAST page at http://www.ncbi.nlm.nih.gov/cgi-bin/Entrez/genom_table_cgi or through the web sites of the respective sequencing centers. A partial list of the major genome sequencing centers is available in Appendix 2. Of course, as new genome sequencing centers appear on the map, this listing is going to become obsolete, too. For updated listings of such centers, one could look at the web sites of NCBI (http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/links.html) or the National Human Genome Research Institute (http://www.genome.gov).

In addition to the whole-genome sequencing projects, there are many large-scale expressed sequence tags (EST) sequencing projects, aimed at collecting partial mRNA sequence data from eukaryotic organisms that have not yet made it to the list of priority targets for complete sequencing.

1.3. Basic Questions of Comparative Genomics

In the subsequent chapters of this book, we address many specific problems in comparative and evolutionary genomics. Right now, however, it makes sense to address some basic questions, the answers to which, as we believe, define the status of this research area.

How good is our current collection of genome sequences? Or, more precisely, how representative is it of the actual diversity of life forms? To address this issue, one has to superimpose the sequenced genomes over the taxonomy tree and see how densely populated the main branches are. When this is done with the prokaryotic part of the taxonomy, the result seems to be rather encouraging: the main bacterial and archaeal lineages are already represented by either a complete genome sequence or a genome project that is nearing completion (Table 1.5). However, this needs to be taken with a grain of salt because our knowledge of prokaryotic diversity is itself quite incomplete. Environmental molecular evolutionary studies indicate that the great majority of bacterial and archaeal species is uncultivable with the current methods [644]. Recent techniques aimed at growing these organisms [411] might eventually result in a real revolution in microbial genomics, but it will take years to unfold. Most of those species whose rRNA sequences are produced by environmental cloning fall within known bacterial and archaeal lineages, suggesting that we have already sampled most of the prokaryotic diversity. However, this argument is somewhat circular because we have no idea how many prokaryotes might be not only uncultivable but also unclonable, even with the most non-specific set of PCR primers that have been tried. A case in point is the recent report of a new archaeal phylum, the Nanoarchaea [362]. With these caveats, it is fair to say that, to the best of our knowledge, the diversity of prokaryotes is reasonably well covered by genome sequences, and hence, the stage is set for prokaryotic evolutionary genomics.

The situation with eukaryotes is different in that we seem to have a better grasp of the true eukaryotic diversity and realize that the available set of genome sequences is by no means representative (Table 1.6). While certain groups (ascomycetes, nematodes, insects, mammals) are being tackled by multiple genome projects, most of the early branching eukaryotic lineages are not represented among the sequenced genomes, and neither are most of the animal and plant phyla, including such important groups as sponges, coelencerates, and segmented (annelid) worms. Certainly, this is no reason to postpone detailed comparative-genomic analysis, but this insufficiency of genomic data needs to be taken into account when conclusions are made on eukaryotic evolution.

The next question that we have to address is: Why does comparative genomics work to give us information on gene functions and evolution? The general answer is provided by the neutral theory of molecular evolution [440]. Neutral evolution is fast, as convincingly demonstrated, for example, by the rapid deterioration of pseudogene sequences. Therefore, whenever we detect sequence conservation among proteins or nucleic acids from species separated by a long span of evolution (and this, in practical terms, involves any comparison between two species because these are typically separated by millions of years, time more than sufficient for a pseudogene to change beyond recognition), we can be sure that this conservation is due to the pressure of purifying selection driven by functional constraints. To put it in even simpler terms, what is conserved in a sequence is functionally important . Furthermore, and less trivially, the conserved amino acids and nucleotides almost always perform the same or similar functions, at least in structural and biochemical terms, in homologous protein, RNA, or DNA molecules.

These general concepts of molecular evolution indicate that comparative genomics is likely to be informative in principle, but they tell us nothing about the evolutionary distances at which it is expected to work. The theory would not have been violated in any way if only homologs from closely related species showed significant sequence similarity. However, it had been known already in the pre-genomic era that certain proteins are highly conserved even between vertebrates and bacteria, and the very first genome comparisons revealed deep evolutionary conservation for the majority of proteins. When state of the art methods for sequence comparison are applied, homologs from more than one distantly related species are detectable for 70-80% of the proteins encoded in any prokaryotic genome [827]. At present this fraction seems to be somewhat lower for some of the eukaryotes, but only because the taxonomic density of genome sequencing so far has been insufficient. Indeed, in the genomes of humans and mice, species that diverged from their common ancestor 80-100 million years ago, nearly all genes are conserved. These crucial facts show that genome comparisons are likely to reveal important information on the functions and evolutionary relationships of the great majority of genes in any genome.

We have already stated that genomics would not make any sense at all without the possibility of informative genome comparison. Why is this so? In principle, one could imagine that a combination of theoretical methods for deciphering a protein’s three-dimensional structure from the sequence and experimental studies would allow functional identification without recourse to evolutionary analysis. However, neither of these approaches is up to the task. Some recent progress notwithstanding, there is no hope that, in the foreseeable future, ab initio methods become capable of correctly predicting the structure of proteins on genome scale (or on any significant scale except, possibly, for some small proteins with simple folds), let alone their functions.

As for genome-wide experimental characterization of protein functions, far-reaching studies have been conducted, such as elucidation of the phenotype of all gene knockout mutants, massive study of subcellular localization, and identification of protein-protein interaction in bulk for yeast S. cerevisiae [714,876]. However, actual determination of the biochemical activity and more so of the biological function of a protein remains a unique task, and even for model organisms such as yeast or E. coli, this goal is not in sight for all gene products.

Indeed, for the great majority of organisms whose genomes have been sequenced, only a few genes have been studied experimentally (Figure 1.2), and there is no hope for substantial progress in the near future.

Even for E. coli, the workhorse of molecular genetics for the last 50 years, less than half of the genes have been experimentally characterized. Prior to the completion of the genome of the archaeon M. jannaschii, only four proteins have been characterized in that organism: two flagellins, RadA recombinase and the adenylate kinase (in Figure 1.2, this sector is just not visible).

The availability of the genome sequence spawned efforts to characterize other genes in these organisms, but so far these studies made only a limited contribution. The level of characterization of eukaryotic genomes is not much higher, although post-genomic efforts are improving the understanding of the yeast and nematode proteomes (see 3.5.2).

Under these circumstances, the theory of molecular evolution and, in particular, the simple connection between evolutionary conservation and function outlined above remain the crucial theoretical underpinning and the main methodology of functional genomics. The comparative approach allows researchers to predict protein functions by transferring information from functionally characterized proteins of model organisms to their uncharacterized homologs and to delineate the functionally critical parts of protein (and RNA) molecules, such as catalytic or binding sites. Naturally, the quality of these inferences depends on the sensitivity and robustness of computational methods employed by comparative genomics. These caveats notwithstanding, we will argue that comprehensive comparative analysis of genomic sequences and the proteins they encode is an absolute prerequisite to further advances in our understanding of cell biology. Actually, we tend to believe that comparative genomics is up to something grander, namely prioritization of targets for systematic experimental studies. This approach has been partially realized in structural genomics, and we see no reason why it cannot be profitably applied in functional genomics as well. We will be quite satisfied if this book makes just a small step in this direction.

1.4. Further Reading

1.
Doolittle RF. 1986. Of Urfs and Orfs: A primer on how to analyze derived amino acid sequences. University Science Books, San Diego.
2.
Cairns J, Stent GS, Watson JD. 1992. Phage and the Origins of Molecular Biology. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY.
3.
Mount DW. 2000. Bioinformatics: Sequence and genome analysis. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY. Chapter 1.
4.
Koonin EV, Dolja VV. Evolution and taxonomy of positive-strand RNA viruses: implications of comparative analysis of amino acid sequences. Critical Reviews in Biochemistry and Molecular Biology. 1993;28:375–430. [PubMed: 8269709]

Figures

Figure 1.1. Growth of the number of completely sequenced genomes.

Figure 1.1Growth of the number of completely sequenced genomes

The data are from Table 1.4. The 2002 figure is extrapolated from the 5-month results.

Figure 1.2. The current state of annotation of some genomes.

Figure 1.2The current state of annotation of some genomes

The data were derived from the original genome sequencing papers [94,130,290,488]. The information on experimentally characterized genes of E. coli is from the GeneProtEC and E. coli Proteome databases, the corresponding data for yeast and human are from the MIPS and OMIM databases, respectively (see 3.5). The numbers of genes characterized by similarity only and similar to unknown genes are from the COG database (see 3.4); these numbers might be a slight underestimate because each COG is required to include representatives of three sufficiently distant species, and those few proteins that have a homolog in only one other species are lost in this analysis.

Tables

Table 1.1Protein-coding genes of bacteriophage λ

Chromosomal location, bases DNA strand Length, aa Gene name Gene product
191..736+181nu1DNA packaging protein
711..2636+641ADNA packaging protein
2633..2839+68WHead-tail joining protein
2836..4437+533BCapsid component
4418..5737+439CCapsid component
5132..5737+201nu3Capsid assembly
5747..6079+110DHead-DNA stabilization
6135..7160+341ECapsid component
7202..7600+132FiDNA packaging
7612..7965+117FiiHead-tail joining
7977..8555+192ZTail component
8552..8947+131UTail component
8955..9695+256VTail component
9711..10133+140GTail component
10115..10549+144TTail component
10542..13103+853HTail component
13100..13429+109MTail component
13429..14127+232LTail component
14276..14875+199KTail component
14773..15444+223ITail component
15505..18903+1132LTail:host specificity
18965..19585+206lomOuter host membrane
19650..20855+401orf401Tail fiber protein
20147..20767206orf206Hypothetical protein*
21029..21973+314orf314Tail fiber
21973..22557+194orf194Fiber assembly protein
22686..23918410ea47
24509..25399296ea31
25396..26973525ea59
27812..28882356intIntegration protein
28860..2907872xisExcisionase
29118..2928555-Hypothetical protein*
29374..2965593ea8.5
29847..30395182ea22
30839..3102461orf61Hypothetical protein*
31005..3119663orf63Hypothetical protein*
31169..3135160orf60aHypothetical protein*
31348..32028226exoExonuclease
32025..32810261betRecombination protein
32816..33232138gamHost-nuclease inhibitor protein
33187..3333047kilHost-killing
33299..3346354cIIIAntitermination
33536..33904122ssbSingle-stranded DNA binding protein
34087..3428766ralRestriction alleviation
34271..3435728orf28Hypothetical protein*
34482..35036+184imm21Superinfection exclusion protein B
35037..35438133NEarly gene regulator
35825..36259144rexBExclusion
36275..37114279rexAExclusion
37227..37940237cIRepressor
38041..38241+67croAntirepressor
38360..38653+97cIIAntitermination
38686..39585+299ODNA replication
39582..40283+233PDNA replication
40280..40570+96renRen exclusion protein
40644..41084+146Nin146
41081..41953+290Nin290
41950..42123+57Nin57
42090..42272+60Nin60
42269..42439+56Nin56
42429..43043+204Nin204
43040..43246+68Nin68
43224..43889+221Nin221
43886..44509+207QLate gene regulator
44621..44815+64orf64Hypothetical protein*
45186..45509+107SCell lysis protein
45493..45969+158RCell lysis protein
45966..46427+153RzCell lysis protein
46459..4675297borBor protein precursor
47042..47575177-Putative envelope protein
47738..47944+68-Hypothetical protein*

Based on the data from the NCBI Entrez Genomes web site, http://www​.ncbi.nih.gov/Genomes/.

Table 1.2Non-trivial evolutionary connections and functional predictions for bacteriophage λ proteins

Gene product Evolutionary conservation Structure, Domain architecturea Predicted function, Reference
A (TerL)Bacteriophages, herpesvirusesA modified P-loop ATPase domain, distantly related to a vast class of helicasesATPase subunit of the terminase, involved in DNA packaging in phage head
CBacteria and archaeaClpP protease domainMinor capsid protein, cleaves the scaffold protein during maturation
KBacteria, archaea, and eukaryotesConsists of an N-terminal JAB/MPN domain (predicted metalloprotease) and a C-terminal CHAP domain (Cys,His-dependent DL-glutamate-specific amidohydrolase)Tail subunit; predicted protease involved in tail assembly (based on the presence of the JAB/MPN domain) [679] and peptidoglycan lysis (based on the presence of the peptidoglycan amidohydrolase CHAP domain [948]
Ea31Scattered distribution archaeaEndo VII-colicin domainPredicted nuclease of the McrA (HNH) family [50]
Ea59Bacteria, archaea, and eukaryotesP-loop ATPase domain of the ABC classPredicted ATPase [296]
Exo (RedX)Bacteria, archaea, eukaryotes, virusesλ exonuclease domain, distantly related to a broad variety of nucleasesA nuclease involved in phage recombination and late rolling-circle replication
CIBacteria, archaeaN-terminal helix-turn-helix DNA-binding domain fused to a C-terminal serine protease domain of the LexA/UmuD familyTranscription repressor of genes required for lytic development
CroBacteria, archaeaHelix-turn-helix DNA-binding domainTranscription repressor of early genes
OBacteria, archaeaHelix-turn-helix DNA-binding domainDNA-binding protein involved in the initiation of replication
RenBacteria, archaeaHelix-turn-helix DNA-binding domainProtein involved in exclusion of replication of heterologous genomes in λ-infected bacteria
Nin290Bacteria, archaea, eukaryotesPP-loop ATPase domainPredicted ATP pyrophosphatase, role in phage replication unknown [102]
Nin221Bacteria, archaea, eukaryotesCalcineurin-like serine/threonine protein phosphatase domainProtein phosphatase, role in phage replication unknown [450]
a

Detailed descriptions of these and other domains are available in the Pfam, SMART, and CDD protein domain databases (see 3.2) and in SCOP and CATH protein structure databases (see 3.3).

Table 1.3A brief timeline of genomics

Year Event Ref.
1962  The first theory of molecular evolution; the Molecular Clock concept (Linus Pauling and Emile Zukerkandl)[946]
1965  Atlas of Protein Sequences, the first protein database (Margaret Dayhoff and coworkers)[173]
1970  Needleman-Wunsch algorithm for global protein sequence alignment[606]
1977  New DNA sequencing methods (Fred Sanger, Walter Gilbert and coworkers); bacteriophage ϕX174 sequence[553,743]
1977  First software for sequence analysis (Roger Staden)[797]
1977  Phylogenetic taxonomy; archaea discovered; the notion of the three primary kingdoms of life introduced (Carl Woese and coworkers)[905]
1981  Smith-Waterman algorithm for local protein sequence alignment[784]
1981  Human mitochondrial genome sequenced[28]
1981  The concept of a sequence motif (Russell Doolittle)[185]
1982  GenBank Release 3 made public
1982  Phage λ genome sequenced (Fred Sanger and coworkers)[742]
1983  The first practical sequence database searching algorithm (John Wilbur and David Lipman)[892]
1985  FASTP/FASTN: fast sequence similarity searching (William Pearson and David Lipman)[521]
1986  Introduction of Markov models for DNA analysis (Mark Borodovsky and coworkers)[107]
1987  First profile search algorithm (Michael Gribskov, Andrew McLachlan, David Eisenberg)[315]
1988  National Center for Biotechnology Information (NCBI) created at NIH/NLM
1988  EMBnet network for database distribution created
1990  BLAST: fast sequence similarity searching with rigorous statistics (Stephen Altschul, David Lipman and coworkers)[20]
1991  EST: expressed sequence tag sequencing (Craig Venter and coworkers)[4]
1994  Hidden Markov Models of multiple alignments (David Haussler and coworkers; Pierre Baldi and coworkers)[71,72,473]
1994  SCOP classification of protein structures (Alexei Murzin, Cyrus Chothia and coworkers)[590]
1995  First bacterial genomes completely sequenced[232,242]
1996  First archaeal genome completely sequenced[130]
1996  First eukaryotic genome (yeast) completely sequenced[290]
1997  Introduction of gapped BLAST and PSI-BLAST[22]
1997  COGs: Evolutionary classification of proteins from complete genomes[828]
1998  Worm genome, the first multicellular genome, (nearly) completely sequenced[840]
1999  Fly genome (nearly) completely sequenced[3]
2001  Human genome (nearly) completely sequenced[488,870]

Table 1.4Completely sequenced genomes (as of June 1, 2002)

Speciesa Genome size, kb Total no. of proteins Year finished Sequencing centerb Ref.
Bacteria
Streptomyces coelicolor 8,6687,5672002Sanger Centre[85]
Mesorhizobium loti 7,0366,7522000Kazusa Institute[415]
Sinorhizobium meliloti 6,6926,2052001EU Consortium[137]
Nostoc sp. (Anabaena) 6,4145,3662001Kazusa Institute[416]
Pseudomonas aeruginosa 6,2645,5652000Pathogenesis Co.[807]
Agrobacterium tumefaciens 5,6745,4192001U. Washington, Cereon Inc.[293,917]
Xanthomonas citri 5,1764,3122002U. Sao Paulo[167]
Xanthomonas campestris 5,0764,1812002U. Sao Paulo[167]
Salmonella typhimurium 4,8574,4512001Sidney Kimmel Cancer Center[557]
Salmonella typhi 4,8094,6002001Sanger Centre[654]
Yersinia pestis 4,6544,0082001Sanger Centre[656]
Escherichia coli 4,6394,2891997U. Wisconsin[94]
Mycobacterium tuberculosis 4,412 3,918 1998 Sanger Centre [152]
Bacillus subtilis 4,2154,1001997Institute Pasteur[477]
Bacillus halodurans 4,2024,0662000JAMST Center[820]
Vibrio cholerae 4,0333,8272000TIGR[337]
Caulobacter crescentus 4,0173,7372001TIGR[618]
Clostridium acetobutylicum 3,9413,6722001Genome Therapeutics[622]
Ralstonia solanacearum 3,716 3,442 2002 Genoscope [731]
Synechocystis sp.3,5733,1691996Kazusa Institute[417]
Corynebacterium glutamicum 3,3093,0402002U. Bielefeld[831]
Mycobacterium leprae 3,268 2,720 2001 Sanger Centre [153]
Clostridium perfringens 3,031 2,660 2002 U. Tsukuba [767]
Listeria innocua 3,011 2,981 2001 Institute Pasteur [286]
Listeria monocytogenes 2,945 2,855 2001 Institute Pasteur [286]
Staphylococcus aureus 2,8142,5942001Juntendo U.[481]
Thermoanaerobacter tengcongensis 2,6892,5882002Beijing Genomics Inst.[73]
Xylella fastidiosa 2,679 2,766 2000 San Paulo State [772]
Deinococcus radiodurans 2,6492,5801999TIGR[891]
Lactococcus lactis 2,3652,2662001INRA[97]
Pasteurella multocida 2,257 2,014 2001 U. Minnesota [554]
Neisseria meningitidus 2,184 2,121 2000 TIGR, Sanger [653,837]
Fusobacterium nucleatum 2,174 2,067 2002 Integr.Genomics [418]
Streptococcus pneumoniae 2,160 2,094 2001 TIGR [357,836]
Brucella melitensis 2,117 2,059 2002 U. Scranton [179]
Thermotoga maritima 1,8611,8461999TIGR[610]
Streptococcus pyogenes 1,852 1,697 2001 U. Oklahoma [223]
Haemophilus influenzae 1,830 1,709 1995 TIGR [232]
Campylobacter jejuni 1,641 1,654 2000 Sanger Centre [655]
Helicobacter pylori 1,668 1,566 1997 TIGR [848]
Aquifex aeolicus 1,5511,5221998Diversa Corp.[175]
Rickettsia conorii 1,269 1,274 2000 U. Marseille [626]
Chlamydia pneumoniae 1,230 1,052 1999 UC Berkeley [412]
Treponema pallidum 1,138 1,031 1998 TIGR [243]
Rickettsia prowazekii 1,111 834 1998 Uppsala U. [30]
Chlamydia muridarum 1,069 909 2000 TIGR [694]
Chlamydia trachomatis 1,042 894 1998 UC Berkeley [805]
Borellia burgdorferi 911 850 1997 TIGR [241]
Mycoplasma pneumoniae 816 677 1996 U. Heidelberg [347]
Ureaplasma urealyticum 752 611 2000 U. Alabama [287]
Mycoplasma pulmonis 964 782 2001 U. Bordeaux [144]
Buchnera sp. APS 6415642000U. Tokyo[766]
Mycoplasma genitalium 580 467 1995 TIGR [242]
Archaea
Methanosarcina acetivorans 5,7514,5402002Whitehead Inst.[254]
Methanosarcina mazei 4,0963,3712002U. Göttingen[181]
Sulfolobus solfataricus 2,9922,9972001EU/Canada[764]
Sulfolobus tokodaii 2,6952,8262001NITE[426]
Halobacterium sp NRC-1 2,3802,4462000Inst. Syst. Biol.[616]
Pyrobaculum aerophilum 2,2222,6052002UCLA[231]
Archaeoglobus fulgidus 2,1782,4201997TIGR[444]
Pyrococcus furiosus 1,9082,0652001U. Maryland[704]
Methanobacerium thermoautotrophicum 1,7511,8691997Genome Therapeutics[781]
Pyrococcus abyssi 1,7651,7652000Genoscope[599]
Pyrococcus horikoshii 1,739~1,7501998NITE[428]
Methanopyrus kandleri 1,6951,6912002Fidelity Sistems[779]
Aeropyrum pernix 1,670~1,7201999NITE[427]
Methanococcus jannaschii 1,6651,7151996TIGR[130]
Thermoplasma volcanuim 1,5851,4992000NIBHT[430]
Thermoplasma acidophilum 1,5651,4782000MPI Biochem.[719]
Eukaryotes
Homo sapiens ~3,100,000~40,000~2002Human Genome Project, Celera[488,870]
Mus musculus ~3,100,000~40,000~2002Mouse Genome Project, Celera-
Oryza sativa ~420,00032,2772002Syngenta Corp.[289]
Anopheles gambiae ~278,000-2002Celera, Sanger-
Drosophila melanogaster ~137,300~13,5002000Celera, UC Berkeley[3]
Arabidopsis thaliana ~115,40025,4982000Arabidopsis Genome Project[35]
Caenorhabditis elegans ~96,900~19,0001999Sanger Centre, Washington U.[840]
Saccharomyces cerevisiae ~11,600~6,0001996European Consortium[290]
Schizosaccharomyces pombe ~12,6004,8242002Sanger Centre[918]
Encephalitozoon cuniculi ~2,5001,9972001Genoscope[425]
a

Further in the book, these names are used mostly in the abbreviated form. Shading indicates obligate parasites.

b

For the complete names of the sequencing centers, see the NCBI Entrez Genomes web site http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome, Appendix 2, or the original references.

Table 1.5Coverage of the main prokaryotic phyla by genome projects

Genome sequencing
Major prokaryotic phylaa Completed In progress
Archaea
 Crenarchaeota43
 Euryarchaeota122
Bacteria
 Aquificales1-
 CFB/Chlorobium group-4
 Chlamydiales/Verrucomicrobia group3-
 Chrysiogenetes--
 Cyanobacteria22
 Deferribacteres--
 Dehalococcoides group-1
 Dictyoglomus group--
 Fibrobacter/Acidobacteria group-1
 Bacillus/Clostridium group (low G+C gram-positive)1320
 Actinobacteria (high G+C gram-positive)38
 Fusobacteria1-
 Green non-sulfur bacteria-1
 Planctomycetales--
 Proteobacteria2344
  Alpha subdivision79
  Beta subdivision28
  Gamma subdivision1220
  Delta subdivision-5
  Epsilon subdivision22
 Spirochaetales22
 Thermodesulfobacteria--
 Thermomicrobia--
 Thermotogales1-
 Thermus/Deinococcus group11
a

The taxonomy is from the NCBI Taxonomy database (see 3.6). The data on the finished and ongoing genome sequencing projects are from the Entrez Genomes database (http://www​.ncbi.nlm.nih​.gov/PMGifs/Genomes/micr.html) and the Genomes OnLine Database (http:​//genomesonline.org).

Table 1.6Status of the eukaryotic genome projects

Major eukaryotic phyla a Representatives with ongoing sequencing projects
Acanthamoebidae-
Acantharea-
Alveolata Babesia bovis , Cryptosporidium parvum, Eimeria tenella, Plasmodium falciparum, P. berghei , P. chabaudi , P. vivax, P. yoelii, Theileria annulata , Toxoplasma gondii
 Apicomplexa
 CiliophoraParamecium tetraurelia, Tetrahymena sp.
 Dinophyceae-
 Haplosporida-
Apusomonadidae-
CercozoaChlorarachnion reptans
Core jakobidsReclinomonas americana
Cryptophyta Guillardia theta (nucleomorph genome)
Diplomonadida Giardia intestinalis
Entamoebidae Entamoeba histolytica
Euglenozoa Leishmania major , Trypanosoma brucei
Glaucocystophyceae-
Granuloreticulosea-
Haptophyceae-
Heterolobosea-
Lobosea-
Malawimonadidae-
Microsporidia Encephalitozoon cuniculi, Spraguea lophii
Mycetozoa Dictyostelium discoideum
Oxymonadida-
Parabasalidea-
Paramyxea-
Pelobiontida-
Plasmodiophorida-
Polycystinea-
Retortamonadidae-
RhodophytaPorphyra yezoensis
StramenopilesThalassiosira pseudonana
Viridiplantae
 ChlorophytaChlamydonas reinhardtii
 StreptophytaAlfalfa, barley, bean, coffee, corn, cotton, pine, poplar, potato, rice, sorghum, soybean, sugar cane, tomato, wheat
Fungi/Metazoa group
 Aconchulinia -
 Choanoflagellida -
 Fungi
  Ascomycota Saccharomyces cerevisiae, Schizosaccharomyces pombe, Aspergillus nidulans , A. fumigatus , A. niger, Candida albicans , Coccidioides immitis, Debaryomyces hansenii, Fusarium proliferatum, Neurospora crassa , Pneumocystis carinii, Trichoderma reesei
  BasidiomycotaCryptococcus neoformans, Phanerochaete chrysosporium, Ustilago maydis
  Chytridiomycota-
  Zygomycota-
 Metazoa
  Porifera (sponges)-
  Cnidaria-
  Ctenophora-
  PlatyhelminthesSchistosoma mansoni, S. japonicum
  Nematoda Caenorhabditis elegans, Ascaris suum, Brugia malayi, C. briggsae, Haemonchus contortus
  Annelida-
  Mollusca-
  Arthropoda Drosophila melanogaster, Anopheles gambiae, Aedes aegypti, A. albopictus, Amblyomma americanum, Glossina morsitans
  Chordata
   UrochordataCiona intestinalis (sea squirt), C. savignyi
   Actinopterygii Takifugu rubripes (fugu), Danio rerio (zebrafish) , Oreochromis niloticus (tilapia)
   AmphibiaAmbystoma mexicanum (axolotl), Xenopus tropicalis (frog), X. laevis
   Crocodylidae-
   Aves (birds)Chicken, turkey
   Mammals Human, mouse, rat, cat, chimpanzee, cow, dog
a

The taxonomy is from the NCBI Taxonomy database (see 3.6). Organisms with finished or almost finished projects are shown in bold; advanced-stage projects are shown in bold and italic; not all sequencing projects for each phylogenetic lineage are listed. Absence of sequencing projects for any representative of a phylogenetic lineage, according to the Entrez Genomes and Genomes OnLine databases, is indicated by a dash.

Copyright © 2003, Kluwer Academic.
Bookshelf ID: NBK20263