Genomics: From Phage to Human

Eugene V Koonin; Michael Y Galperin

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Koonin EV, Galperin MY. Sequence - Evolution - Function: Computational Approaches in Comparative Genomics. Boston: Kluwer Academic; 2003.

Chapter 1Genomics: From Phage to Human

1.1. The Humble Beginnings …

The first genome, that of RNA bacteriophage MS2, was sequenced in 1976, in a truly heroic feat of direct determination of an RNA sequence [225]. This was followed by the genome of bacteriophage ϕX174, the first triumph of the new, rapid sequencing methods developed in the laboratories of Walter Gilbert and Fred Sanger [553,743]. These are some of the smallest known genomes with only four and ten genes, respectively. Then, in 1982, the last paper published by Sanger before he retired, announced the first relatively large genome to be sequenced, that of bacteriophage λ, probably the most famous model system of classic molecular biology [742]. Phage λ has 48,502 bases of genomic DNA and ~70 known and predicted protein-coding genes and 23 RNA-coding genes. At 70 characters per line and 43 lines per page, this sequence alone would take over 16 pages of this book. However, the listing of the λ protein-coding genes (Table 1.1) fits into just two pages and definitely conveys more information. These days, it may be hard to imagine all the excitement felt by molecular biologists 20 years ago when the λ genome was finally finished. Nevertheless, even in this era of high-throughput methods, it could be instructive to look back and address several questions: (i) is λ genome a good model of the subsequently sequenced prokaryotic and eukaryotic genomes? (ii) how accurate was the sequence itself and the original gene assignment? and (iii) how much more have we learned about functions of λ genes in the past 20 years?

The answer to the first question is definitely yes: λ genome has many features common to the genomes of cellular life forms, particularly prokaryotes. Most of the genome consists of protein-coding genes. Adjacent genes are often transcribed in the same direction and encode proteins that have similar functions and/or interact with each other (e.g. cell lysis proteins, tail components). Adjacent genes either slightly overlap or are separated by intergenic regions of varying length, typically much shorter than the genes themselves.

To answer the second question, both the sequence and gene assignments turned out to be essentially correct. The latter may not be surprising since the λ genome was annotated by researchers who had studied the phage for years, on the basis of the entire body of knowledge amassed by that time. In contemporary genome sequencing projects, such detailed analysis by highly qualified biologists with intimate knowledge of the biology of the given organism is more an exception, rather than the norm, partly because biological information on many of the sequenced organisms is simply too scarce.

A comparison of Table 1.1 with the original paper by Sanger et al. [742] shows that there is actually not much to add to the gene annotations. The use of recently developed sophisticated gene prediction programs, such as Glimmer (see 4.1), coupled with the analysis of the regions that are conserved between lambda and related bacteriophages, led to the conclusion that certain intergenic regions might contain additional protein-coding genes (marked by asterisks in the Table 1.1). Unfortunately, most of these genes remain uncharacterized, and it is not even known whether they are ever expressed. It is worth noting that exactly the same doubts exist about the possible functions and/or expression of a large number of so-called “hypothetical” genes, identified in the genomes of cellular life forms by essentially the same two principal approaches (see 4.1).

When reading the Sanger paper now, 20 years after it appeared, one is struck by the absence of any analysis of protein sequences in this detailed, thorough work. Although the authors have done careful computational analysis of open reading frames, particularly the likely translation starts and codon usage, the very word “homolog” is not used in the article, and there is no mention of any search of protein sequence databases, something that these days is, by default, an integral part of any genomic study. Not that protein sequence databases did not exist at the time: the first one, the Protein Identification Resource, was launched by Margaret Dayhoff, one of the great pioneers of computational biology, in 1965, long before genomics had even become conceivable [172,173]. However, reliable and rapid methods for searching this database still have not been developed, and more generally, database search was not a part of the culture in molecular biology at the time. And for a good reason, too. Had Sanger and his coworkers performed a PIR search, even using the methods available in 2002, they would not have found anything of interest because the sequences available at that time were few and far apart, and there were no homologs of phage λ proteins among them. Clearly, the time was not ripe for comparative genomics and, in a sense, for genomics itself because, as we will see throughout this book, the comparative approach is truly central to the genomic enterprise.

Revisiting phage λ genome after 20 years, we see a completely different “genomescape”. Using the PSI-BLAST program (see 4.3), the search of the complete non-redundant protein sequence database maintained at the NCBI (National Center for Biotechnology Information, a division of the National Institutes of Health in Bethesda, Maryland, USA) for homologs of the 73 proteins listed as gene products of phage λ takes about an hour on a moderate power computer. Another hour was spent running selected proteins through the conserved domain search using the CDD option of the NCBI’s BLAST server (see 4.4). Of course, we could have scoured the literature for descriptions of computational analyses of λ proteins instead. However, extracting the relevant information from databases, such as PubMed (see 3.7), is far from trivial because, in most cases, the papers including this information dealt with more general issues and would not have λ, let alone a particular gene, mentioned in the title or abstract. Running the searches anew was much faster and easier. Besides, sequence databases are growing daily, which may substantially affect the results of searches and might even lead to new discoveries. Perusing the results, we should note that, with a few exceptions, there are now homologs readily detectable for the phage proteins. In the majority of cases, these are proteins from other related phages (sometimes integrated as prophages into the bacterial chromosome). However, 12 λ proteins show conservation in bacteria, archaea, and eukaryotes (Table 1.2). For several of these proteins whose functions have not been studied experimentally, non-trivial functional predictions become possible.

It is remarkable that some of the more interesting computational predictions remain without experimental test. Admittedly, the visibility of molecular biology of bacteriophages as a research field has not increased since the 1970’s, and the funds have pretty much tapered off. Good examples are the Ea59 and K genes that are predicted to encode an ATPase and a metal-dependent protease, respectively. Both are clear and readily testable predictions that have been described in print, even if briefly [296,679]. However, to our knowledge, no experimental tests of these predictions have been reported so far. Interestingly, an observation has been made during these searches that actually seems to have a novel aspect to it. The Ea31 protein was shown to contain a metal-dependent nuclease domain [50]. The stop codon of the Ea31 gene overlaps the start codon of Ea59, leading to the intriguing hypothesis that the two proteins interact and form an ATP-dependent nuclease complex. We discuss sequence analysis of Ea31 in greater detail in Chapter 4 to illustrate the process of discovery in database searches. Furthermore, this is a little example of context analysis, an increasingly important direction in genome annotation, which is covered in Chapter 5. This situation is not uncommon: computational analysis of genomes keeps yielding interesting functional predictions, even years after the publication of the sequence; what is most often lacking is systematic experimental testing of these predictions.

We will come back to this dramatic rift between computational and experimental analysis of most, if not all, genomes with more numbers, but first let us step back and have a quick look into the history of genomics, which is short, but dynamic (Table 1.3). By definition, genomics requires genome sequences, and to engage in comparative genomics, one needs at least two genomes to compare. In a close analogy to the history of molecular genetics, which owes most of its early progress to bacteriophages used as model systems, comparative genomics was first practiced with the genomes of viruses. These are several orders of magnitude smaller than even the tiniest bacterial genomes and, in case a virus grows well, sequencing of viral genomes became a relatively straightforward enterprise in the early 1980’s. By 1983, six years after the beginning of the sequencing era, a considerable number of complete genomes of diverse small viruses of plants, animals, and bacteria (bacteriophages) had been amassed, and the time was ripe for the birth of comparative genomics.

Pinpointing the exact beginning of comparative genomics may be difficult. In a sense, one may say that it was born as soon as there were two genomes to compare, i.e. in 1977 when the genome of phage ϕX174 was sequenced and could be compared with the already available sequence of the RNA phage MS2. However, this was a vacuous start because the two phages had virtually nothing in common (a propos, this has not changed in 20 years: for all we know, these phage families are truly unrelated). It seems that comparative genomics had a real head start with two astonishing discoveries that caught most, if not all, virologists utterly by surprise. First, it has been shown that RNA-containing retroviruses (causative agents of certain leucoses in animals and humans and, as shown later, of AIDS) shared a conserved replicative enzyme, the reverse transcriptase, with two groups of DNA viruses, the hepadnaviruses (including the medically important hepatitis B virus) and caulimoviruses, infecting plants [847]. Second, it turned out that small RNA viruses infecting animals (picornaviruses, such as polio and foot-and-mouth disease) and those infecting plants (cowpea mosaic virus) shared not only significant sequence similarity that allowed the identification of homologous (orthologous) genes, but also, in part, the order of these genes in their genomes [7,56,335]. Subsequent systematic studies have revealed a complex network of homologous relationships within the vast classes of positive-strand RNA viruses and negative-strand RNA viruses. Although still disputed, the concept emerged that each of these classes was monophyletic, that is, probably evolved from a common ancestral virus [460]. These studies combined two elements that were crucial in defining the identity of the emerging discipline of comparative and evolutionary genomics.

Firstly, the objects of analysis were complete genomes, however small, rather than individual genes, and accordingly, the notions of conservation of gene order and gene shuffling became important. Secondly, the discoveries made through these genome comparisons were completely unexpected; there was no experimental data that would prepare researchers for the startling unity of superficially unrelated viruses.

In retrospect, it is somewhat ironic that comparative genomics had to start with virus genomes (due to the experimental contingency) because viral proteins tend to evolve extremely fast, and detection of conservation between distant viruses may be a non-trivial task, even with advanced methods of computational sequence analysis, let alone with those available in the early 1980’s. This was a challenge and perhaps a blessing in disguise. The difficulty of detecting sequence conservation among viral proteins prompted those who ventured into this area to employ approaches that later proved invaluable in comparative genomics and computational biology in general: (i) compare protein sequences, rather than nucleotide sequences directly, whenever distant relationships are involved and sensitivity is an issue; (ii) rely on multiple, rather than pairwise, comparisons; (iii) search for conserved patterns or motifs in multiple sequences; and, above all (iv) actually look at sequences (and structures whenever these are available) and think about the potential relationships in an effort to synthesize all relevant shreds of information. This practice has been dubbed, more or less pejoratively, “sequence gazing” [341]. Sure enough, sequence and structure comparisons are prone to error and, worse, to fantasy, and these dangers had been particularly grave in the early days, before the statistical foundations of computational biology had been worked out and the rules of thumb had been established through accumulated practices. There is no doubt, however, that success stories of computational prediction of gene functions have been of much greater import and have, to a large extent, determined the very feasibility of the further progress of genomics.

The first comparative-genomic study of a larger scale, investigating the relationships between genomes that contained >100 genes each, came in 1986 [558]. The newly sequenced genome of varicella zoster virus was carefully compared to the previously sequenced Epstein-Barr virus genome (the original Epstein-Barr genome paper [68] resembled the λ work in that no homologs were reported for any of the viral proteins because, indeed, none were to be easily identified among the sequences then available). This work, though little noticed outside virology, already included the principal elements of the comparative-genomic approach, if not the actual methods.

1.2. … and the Astonishing Progress of Genome Sequencing

Comparative genomics of cellular life forms is in a way a “by-product” of the Human Genome Project. Probably the greatest insight of the leaders of the early stages of this project was the realization that, in isolation, the human genome would be a costly but uninterpretable string of three billion or so of A’s, T’s, G’s and C’s. Only through systematic comparisons to other genomes may we hope to make sense of the text of this “Book of Life”. As far as genomics is concerned, Theodosius Dobzhansky’s famous dictum “Nothing in biology makes sense except in the light of evolution” is not some kind of evolutionist propaganda, but an entirely literal and more or less routine description of the situation. And so, in the last decade of the second millennium, the genome sequences started pouring in. Yeast chromosome III, the first respectable chunk of contiguous genome sequence [629] that became available in 1992 (quite modest, by today’s standards, just ~320,000 base pairs), generated major excitement epitomized in the title of a Nature note describing a re-analysis of the ORFs from this chromosome: “What’s in the genome?” [105]. From the analysis of this sequence and other large genome segments that started to appear in the next months, at least two notions were derived that became critical for the subsequent evolution of comparative genomics: (i) there were many more genes in the genome than anyone suspected previously on the basis of genetic or biochemical experiments; and (ii) methods of computational analysis matter—careful analysis employing multiple complementary approaches yields incomparably more information on gene functions and evolutionary relationships than any single automatic procedure.

The appearance in August 1995 of the complete genome sequence of the parasitic bacterium Haemophilus influenzae [232] ushered in the era of “real” genomics, the study of complete genomes of cellular organisms. The acceleration of genome sequencing required for this to happen was greatly facilitated by the whole-genome shotgun approach pioneered by Craig Venter, Hamilton Smith, and Leroy Hood [871]. Systematic comparative approaches were tried immediately, even before the second genome came, by using the largely finished genome of Escherichia coli [829]. Since that point, complete genomes of bacteria and archaea have been arriving at a steady rate, which seems to be accelerating in the 3^rd millennium (Figure 1.1). Starting with the second genome sequencing paper [242], reports on new genomes inevitably became comparative-genomic studies because, as we have already mentioned, that is the only way to even start understanding “what’s in the genome”.

By June 1, 2002, genomes of 73 species of unicellular organisms (55 bacterial species, 16 archaea, and 2 eukaryotes) were completely sequenced and available in public databases. In the three parts of Table 1.4, the completely sequenced bacterial, archaeal, and eukaryotic genomes are listed in the order of decreasing size. The largest prokaryotic genomes (Streptomyces coelicolor among bacteria, Methanosarcina acetivorans among the archaea) have been sequenced only recently, which promises many interesting discoveries yet to come.

By the time of this writing (August 2002), the first genomes of multicellular eukaryotes, the nematode worm Caenorhabditis elegans, the fruit fly Drosophila melanogaster, the thale cress Arabidopsis thaliana, the pufferfish Fugu rubripes, and Homo sapiens have been nearly completed (let us note that the very concept of a complete genome sequence for these organisms differs from that for prokaryotes and unicellular eukaryotes). At least 100 more prokaryotic genomes and many eukaryotic genomes, including those of mouse and rat, were at different stages of completion. Beyond doubt, many more finished or nearly finished genome sequences exist in proprietary databases maintained by biotech companies, but since these cannot be freely analyzed, they do not count inasmuch as comparative genomics is discussed.

Any list of completed genomes rapidly becomes outdated and so will Table 1.4, even as this book appears in print. Periodically updated listings of both finished and unfinished publicly funded genome sequencing projects are available at the web sites maintained at the Institute for Genomic Research (TIGR, http://www.tigr.org/tdb/mdb/mdb.html) and at the NCBI (http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html). The Chicago-based Integrated Genomics Inc. maintains Genomes OnLine Database (http://wit.genomesonline.org), which lists most public as well as some private projects. In addition, web sites of the genome sequencing centers list the projects run or planned in those particular institutions (see Appendix 2).

The relative ease of 6- to 8-fold coverage sequencing as compared to finishing and genome annotation resulted in the availability of a number of incomplete genomes, which are not going to be finalized any time soon (see, for example, the web site of the Department of Energy Joint Genome Institute, http://www.jgi.doe.gov/JGI_microbial/html/index.html). These sequences are a treasure trove for someone who knows what to look for. Most of the data are available for searching through the NCBI BLAST page at http://www.ncbi.nlm.nih.gov/cgi-bin/Entrez/genom_table_cgi or through the web sites of the respective sequencing centers. A partial list of the major genome sequencing centers is available in Appendix 2. Of course, as new genome sequencing centers appear on the map, this listing is going to become obsolete, too. For updated listings of such centers, one could look at the web sites of NCBI (http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/links.html) or the National Human Genome Research Institute (http://www.genome.gov).

In addition to the whole-genome sequencing projects, there are many large-scale expressed sequence tags (EST) sequencing projects, aimed at collecting partial mRNA sequence data from eukaryotic organisms that have not yet made it to the list of priority targets for complete sequencing.

1.3. Basic Questions of Comparative Genomics

In the subsequent chapters of this book, we address many specific problems in comparative and evolutionary genomics. Right now, however, it makes sense to address some basic questions, the answers to which, as we believe, define the status of this research area.

How good is our current collection of genome sequences? Or, more precisely, how representative is it of the actual diversity of life forms? To address this issue, one has to superimpose the sequenced genomes over the taxonomy tree and see how densely populated the main branches are. When this is done with the prokaryotic part of the taxonomy, the result seems to be rather encouraging: the main bacterial and archaeal lineages are already represented by either a complete genome sequence or a genome project that is nearing completion (Table 1.5). However, this needs to be taken with a grain of salt because our knowledge of prokaryotic diversity is itself quite incomplete. Environmental molecular evolutionary studies indicate that the great majority of bacterial and archaeal species is uncultivable with the current methods [644]. Recent techniques aimed at growing these organisms [411] might eventually result in a real revolution in microbial genomics, but it will take years to unfold. Most of those species whose rRNA sequences are produced by environmental cloning fall within known bacterial and archaeal lineages, suggesting that we have already sampled most of the prokaryotic diversity. However, this argument is somewhat circular because we have no idea how many prokaryotes might be not only uncultivable but also unclonable, even with the most non-specific set of PCR primers that have been tried. A case in point is the recent report of a new archaeal phylum, the Nanoarchaea [362]. With these caveats, it is fair to say that, to the best of our knowledge, the diversity of prokaryotes is reasonably well covered by genome sequences, and hence, the stage is set for prokaryotic evolutionary genomics.

The situation with eukaryotes is different in that we seem to have a better grasp of the true eukaryotic diversity and realize that the available set of genome sequences is by no means representative (Table 1.6). While certain groups (ascomycetes, nematodes, insects, mammals) are being tackled by multiple genome projects, most of the early branching eukaryotic lineages are not represented among the sequenced genomes, and neither are most of the animal and plant phyla, including such important groups as sponges, coelencerates, and segmented (annelid) worms. Certainly, this is no reason to postpone detailed comparative-genomic analysis, but this insufficiency of genomic data needs to be taken into account when conclusions are made on eukaryotic evolution.

The next question that we have to address is: Why does comparative genomics work to give us information on gene functions and evolution? The general answer is provided by the neutral theory of molecular evolution [440]. Neutral evolution is fast, as convincingly demonstrated, for example, by the rapid deterioration of pseudogene sequences. Therefore, whenever we detect sequence conservation among proteins or nucleic acids from species separated by a long span of evolution (and this, in practical terms, involves any comparison between two species because these are typically separated by millions of years, time more than sufficient for a pseudogene to change beyond recognition), we can be sure that this conservation is due to the pressure of purifying selection driven by functional constraints. To put it in even simpler terms, what is conserved in a sequence is functionally important . Furthermore, and less trivially, the conserved amino acids and nucleotides almost always perform the same or similar functions, at least in structural and biochemical terms, in homologous protein, RNA, or DNA molecules.

These general concepts of molecular evolution indicate that comparative genomics is likely to be informative in principle, but they tell us nothing about the evolutionary distances at which it is expected to work. The theory would not have been violated in any way if only homologs from closely related species showed significant sequence similarity. However, it had been known already in the pre-genomic era that certain proteins are highly conserved even between vertebrates and bacteria, and the very first genome comparisons revealed deep evolutionary conservation for the majority of proteins. When state of the art methods for sequence comparison are applied, homologs from more than one distantly related species are detectable for 70-80% of the proteins encoded in any prokaryotic genome [827]. At present this fraction seems to be somewhat lower for some of the eukaryotes, but only because the taxonomic density of genome sequencing so far has been insufficient. Indeed, in the genomes of humans and mice, species that diverged from their common ancestor 80-100 million years ago, nearly all genes are conserved. These crucial facts show that genome comparisons are likely to reveal important information on the functions and evolutionary relationships of the great majority of genes in any genome.

We have already stated that genomics would not make any sense at all without the possibility of informative genome comparison. Why is this so? In principle, one could imagine that a combination of theoretical methods for deciphering a protein’s three-dimensional structure from the sequence and experimental studies would allow functional identification without recourse to evolutionary analysis. However, neither of these approaches is up to the task. Some recent progress notwithstanding, there is no hope that, in the foreseeable future, ab initio methods become capable of correctly predicting the structure of proteins on genome scale (or on any significant scale except, possibly, for some small proteins with simple folds), let alone their functions.

As for genome-wide experimental characterization of protein functions, far-reaching studies have been conducted, such as elucidation of the phenotype of all gene knockout mutants, massive study of subcellular localization, and identification of protein-protein interaction in bulk for yeast S. cerevisiae [714,876]. However, actual determination of the biochemical activity and more so of the biological function of a protein remains a unique task, and even for model organisms such as yeast or E. coli, this goal is not in sight for all gene products.

Indeed, for the great majority of organisms whose genomes have been sequenced, only a few genes have been studied experimentally (Figure 1.2), and there is no hope for substantial progress in the near future.

Even for E. coli, the workhorse of molecular genetics for the last 50 years, less than half of the genes have been experimentally characterized. Prior to the completion of the genome of the archaeon M. jannaschii, only four proteins have been characterized in that organism: two flagellins, RadA recombinase and the adenylate kinase (in Figure 1.2, this sector is just not visible).

The availability of the genome sequence spawned efforts to characterize other genes in these organisms, but so far these studies made only a limited contribution. The level of characterization of eukaryotic genomes is not much higher, although post-genomic efforts are improving the understanding of the yeast and nematode proteomes (see 3.5.2).

Under these circumstances, the theory of molecular evolution and, in particular, the simple connection between evolutionary conservation and function outlined above remain the crucial theoretical underpinning and the main methodology of functional genomics. The comparative approach allows researchers to predict protein functions by transferring information from functionally characterized proteins of model organisms to their uncharacterized homologs and to delineate the functionally critical parts of protein (and RNA) molecules, such as catalytic or binding sites. Naturally, the quality of these inferences depends on the sensitivity and robustness of computational methods employed by comparative genomics. These caveats notwithstanding, we will argue that comprehensive comparative analysis of genomic sequences and the proteins they encode is an absolute prerequisite to further advances in our understanding of cell biology. Actually, we tend to believe that comparative genomics is up to something grander, namely prioritization of targets for systematic experimental studies. This approach has been partially realized in structural genomics, and we see no reason why it cannot be profitably applied in functional genomics as well. We will be quite satisfied if this book makes just a small step in this direction.

1.4. Further Reading

1.: Doolittle RF. 1986. Of Urfs and Orfs: A primer on how to analyze derived amino acid sequences. University Science Books, San Diego.
2.: Cairns J, Stent GS, Watson JD. 1992. Phage and the Origins of Molecular Biology. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY.
3.: Mount DW. 2000. Bioinformatics: Sequence and genome analysis. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY. Chapter 1.
4.: Koonin EV, Dolja VV. Evolution and taxonomy of positive-strand RNA viruses: implications of comparative analysis of amino acid sequences. Critical Reviews in Biochemistry and Molecular Biology. 1993;28:375–430. [PubMed: 8269709]

Figures

Figure 1.1Growth of the number of completely sequenced genomes

The data are from Table 1.4. The 2002 figure is extrapolated from the 5-month results.

Figure 1.2The current state of annotation of some genomes

The data were derived from the original genome sequencing papers [94,130,290,488]. The information on experimentally characterized genes of E. coli is from the GeneProtEC and E. coli Proteome databases, the corresponding data for yeast and human are from the MIPS and OMIM databases, respectively (see 3.5). The numbers of genes characterized by similarity only and similar to unknown genes are from the COG database (see 3.4); these numbers might be a slight underestimate because each COG is required to include representatives of three sufficiently distant species, and those few proteins that have a homolog in only one other species are lost in this analysis.

Tables

Table 1.1Protein-coding genes of bacteriophage λ

Chromosomal location, bases	DNA strand	Length, aa	Gene name	Gene product
191..736	+	181	nu1	DNA packaging protein
711..2636	+	641	A	DNA packaging protein
2633..2839	+	68	W	Head-tail joining protein
2836..4437	+	533	B	Capsid component
4418..5737	+	439	C	Capsid component
5132..5737	+	201	nu3	Capsid assembly
5747..6079	+	110	D	Head-DNA stabilization
6135..7160	+	341	E	Capsid component
7202..7600	+	132	Fi	DNA packaging
7612..7965	+	117	Fii	Head-tail joining
7977..8555	+	192	Z	Tail component
8552..8947	+	131	U	Tail component
8955..9695	+	256	V	Tail component
9711..10133	+	140	G	Tail component
10115..10549	+	144	T	Tail component
10542..13103	+	853	H	Tail component
13100..13429	+	109	M	Tail component
13429..14127	+	232	L	Tail component
14276..14875	+	199	K	Tail component
14773..15444	+	223	I	Tail component
15505..18903	+	1132	L	Tail:host specificity
18965..19585	+	206	lom	Outer host membrane
19650..20855	+	401	orf401	Tail fiber protein
20147..20767	−	206	orf206	Hypothetical protein*
21029..21973	+	314	orf314	Tail fiber
21973..22557	+	194	orf194	Fiber assembly protein
22686..23918	−	410	ea47
24509..25399	−	296	ea31
25396..26973	−	525	ea59
27812..28882	−	356	int	Integration protein
28860..29078	−	72	xis	Excisionase
29118..29285	−	55	-	Hypothetical protein*
29374..29655	−	93	ea8.5
29847..30395	−	182	ea22
30839..31024	−	61	orf61	Hypothetical protein*
31005..31196	−	63	orf63	Hypothetical protein*
31169..31351	−	60	orf60a	Hypothetical protein*
31348..32028	−	226	exo	Exonuclease
32025..32810	−	261	bet	Recombination protein
32816..33232	−	138	gam	Host-nuclease inhibitor protein
33187..33330	−	47	kil	Host-killing
33299..33463	−	54	cIII	Antitermination
33536..33904	−	122	ssb	Single-stranded DNA binding protein
34087..34287	−	66	ral	Restriction alleviation
34271..34357	−	28	orf28	Hypothetical protein*
34482..35036	+	184	imm21	Superinfection exclusion protein B
35037..35438	−	133	N	Early gene regulator
35825..36259	−	144	rexB	Exclusion
36275..37114	−	279	rexA	Exclusion
37227..37940	−	237	cI	Repressor
38041..38241	+	67	cro	Antirepressor
38360..38653	+	97	cII	Antitermination
38686..39585	+	299	O	DNA replication
39582..40283	+	233	P	DNA replication
40280..40570	+	96	ren	Ren exclusion protein
40644..41084	+	146	Nin146
41081..41953	+	290	Nin290
41950..42123	+	57	Nin57
42090..42272	+	60	Nin60
42269..42439	+	56	Nin56
42429..43043	+	204	Nin204
43040..43246	+	68	Nin68
43224..43889	+	221	Nin221
43886..44509	+	207	Q	Late gene regulator
44621..44815	+	64	orf64	Hypothetical protein*
45186..45509	+	107	S	Cell lysis protein
45493..45969	+	158	R	Cell lysis protein
45966..46427	+	153	Rz	Cell lysis protein
46459..46752	−	97	bor	Bor protein precursor
47042..47575	−	177	-	Putative envelope protein
47738..47944	+	68	-	Hypothetical protein*

: Based on the data from the NCBI Entrez Genomes web site, http://www.ncbi.nih.gov/Genomes/.

Table 1.2Non-trivial evolutionary connections and functional predictions for bacteriophage λ proteins

Gene product	Evolutionary conservation	Structure, Domain architecture^a	Predicted function, Reference
A (TerL)	Bacteriophages, herpesviruses	A modified P-loop ATPase domain, distantly related to a vast class of helicases	ATPase subunit of the terminase, involved in DNA packaging in phage head
C	Bacteria and archaea	ClpP protease domain	Minor capsid protein, cleaves the scaffold protein during maturation
K	Bacteria, archaea, and eukaryotes	Consists of an N-terminal JAB/MPN domain (predicted metalloprotease) and a C-terminal CHAP domain (Cys,His-dependent DL-glutamate-specific amidohydrolase)	Tail subunit; predicted protease involved in tail assembly (based on the presence of the JAB/MPN domain) [679] and peptidoglycan lysis (based on the presence of the peptidoglycan amidohydrolase CHAP domain [948]
Ea31	Scattered distribution archaea	Endo VII-colicin domain	Predicted nuclease of the McrA (HNH) family [50]
Ea59	Bacteria, archaea, and eukaryotes	P-loop ATPase domain of the ABC class	Predicted ATPase [296]
Exo (RedX)	Bacteria, archaea, eukaryotes, viruses	λ exonuclease domain, distantly related to a broad variety of nucleases	A nuclease involved in phage recombination and late rolling-circle replication
CI	Bacteria, archaea	N-terminal helix-turn-helix DNA-binding domain fused to a C-terminal serine protease domain of the LexA/UmuD family	Transcription repressor of genes required for lytic development
Cro	Bacteria, archaea	Helix-turn-helix DNA-binding domain	Transcription repressor of early genes
O	Bacteria, archaea	Helix-turn-helix DNA-binding domain	DNA-binding protein involved in the initiation of replication
Ren	Bacteria, archaea	Helix-turn-helix DNA-binding domain	Protein involved in exclusion of replication of heterologous genomes in λ-infected bacteria
Nin290	Bacteria, archaea, eukaryotes	PP-loop ATPase domain	Predicted ATP pyrophosphatase, role in phage replication unknown [102]
Nin221	Bacteria, archaea, eukaryotes	Calcineurin-like serine/threonine protein phosphatase domain	Protein phosphatase, role in phage replication unknown [450]

a: Detailed descriptions of these and other domains are available in the Pfam, SMART, and CDD protein domain databases (see 3.2) and in SCOP and CATH protein structure databases (see 3.3).

Table 1.3A brief timeline of genomics

Year	Event	Ref.
1962	The first theory of molecular evolution; the Molecular Clock concept (Linus Pauling and Emile Zukerkandl)	[946]
1965	Atlas of Protein Sequences, the first protein database (Margaret Dayhoff and coworkers)	[173]
1970	Needleman-Wunsch algorithm for global protein sequence alignment	[606]
1977	New DNA sequencing methods (Fred Sanger, Walter Gilbert and coworkers); bacteriophage ϕX174 sequence	[553,743]
1977	First software for sequence analysis (Roger Staden)	[797]
1977	Phylogenetic taxonomy; archaea discovered; the notion of the three primary kingdoms of life introduced (Carl Woese and coworkers)	[905]
1981	Smith-Waterman algorithm for local protein sequence alignment	[784]
1981	Human mitochondrial genome sequenced	[28]
1981	The concept of a sequence motif (Russell Doolittle)	[185]
1982	GenBank Release 3 made public
1982	Phage λ genome sequenced (Fred Sanger and coworkers)	[742]
1983	The first practical sequence database searching algorithm (John Wilbur and David Lipman)	[892]
1985	FASTP/FASTN: fast sequence similarity searching (William Pearson and David Lipman)	[521]
1986	Introduction of Markov models for DNA analysis (Mark Borodovsky and coworkers)	[107]
1987	First profile search algorithm (Michael Gribskov, Andrew McLachlan, David Eisenberg)	[315]
1988	National Center for Biotechnology Information (NCBI) created at NIH/NLM
1988	EMBnet network for database distribution created
1990	BLAST: fast sequence similarity searching with rigorous statistics (Stephen Altschul, David Lipman and coworkers)	[20]
1991	EST: expressed sequence tag sequencing (Craig Venter and coworkers)	[4]
1994	Hidden Markov Models of multiple alignments (David Haussler and coworkers; Pierre Baldi and coworkers)	[71,72,473]
1994	SCOP classification of protein structures (Alexei Murzin, Cyrus Chothia and coworkers)	[590]
1995	First bacterial genomes completely sequenced	[232,242]
1996	First archaeal genome completely sequenced	[130]
1996	First eukaryotic genome (yeast) completely sequenced	[290]
1997	Introduction of gapped BLAST and PSI-BLAST	[22]
1997	COGs: Evolutionary classification of proteins from complete genomes	[828]
1998	Worm genome, the first multicellular genome, (nearly) completely sequenced	[840]
1999	Fly genome (nearly) completely sequenced	[3]
2001	Human genome (nearly) completely sequenced	[488,870]

Table 1.4Completely sequenced genomes (as of June 1, 2002)

Species^a	Genome size, kb	Total no. of proteins	Year finished	Sequencing center^b	Ref.
Bacteria
Streptomyces coelicolor	8,668	7,567	2002	Sanger Centre	[85]
Mesorhizobium loti	7,036	6,752	2000	Kazusa Institute	[415]
Sinorhizobium meliloti	6,692	6,205	2001	EU Consortium	[137]
Nostoc sp. (Anabaena)	6,414	5,366	2001	Kazusa Institute	[416]
Pseudomonas aeruginosa	6,264	5,565	2000	Pathogenesis Co.	[807]
Agrobacterium tumefaciens	5,674	5,419	2001	U. Washington, Cereon Inc.	[293,917]
Xanthomonas citri	5,176	4,312	2002	U. Sao Paulo	[167]
Xanthomonas campestris	5,076	4,181	2002	U. Sao Paulo	[167]
Salmonella typhimurium	4,857	4,451	2001	Sidney Kimmel Cancer Center	[557]
Salmonella typhi	4,809	4,600	2001	Sanger Centre	[654]
Yersinia pestis	4,654	4,008	2001	Sanger Centre	[656]
Escherichia coli	4,639	4,289	1997	U. Wisconsin	[94]
Mycobacterium tuberculosis	4,412	3,918	1998	Sanger Centre	[152]
Bacillus subtilis	4,215	4,100	1997	Institute Pasteur	[477]
Bacillus halodurans	4,202	4,066	2000	JAMST Center	[820]
Vibrio cholerae	4,033	3,827	2000	TIGR	[337]
Caulobacter crescentus	4,017	3,737	2001	TIGR	[618]
Clostridium acetobutylicum	3,941	3,672	2001	Genome Therapeutics	[622]
Ralstonia solanacearum	3,716	3,442	2002	Genoscope	[731]
Synechocystis sp.	3,573	3,169	1996	Kazusa Institute	[417]
Corynebacterium glutamicum	3,309	3,040	2002	U. Bielefeld	[831]
Mycobacterium leprae	3,268	2,720	2001	Sanger Centre	[153]
Clostridium perfringens	3,031	2,660	2002	U. Tsukuba	[767]
Listeria innocua	3,011	2,981	2001	Institute Pasteur	[286]
Listeria monocytogenes	2,945	2,855	2001	Institute Pasteur	[286]
Staphylococcus aureus	2,814	2,594	2001	Juntendo U.	[481]
Thermoanaerobacter tengcongensis	2,689	2,588	2002	Beijing Genomics Inst.	[73]
Xylella fastidiosa	2,679	2,766	2000	San Paulo State	[772]
Deinococcus radiodurans	2,649	2,580	1999	TIGR	[891]
Lactococcus lactis	2,365	2,266	2001	INRA	[97]
Pasteurella multocida	2,257	2,014	2001	U. Minnesota	[554]
Neisseria meningitidus	2,184	2,121	2000	TIGR, Sanger	[653,837]
Fusobacterium nucleatum	2,174	2,067	2002	Integr.Genomics	[418]
Streptococcus pneumoniae	2,160	2,094	2001	TIGR	[357,836]
Brucella melitensis	2,117	2,059	2002	U. Scranton	[179]
Thermotoga maritima	1,861	1,846	1999	TIGR	[610]
Streptococcus pyogenes	1,852	1,697	2001	U. Oklahoma	[223]
Haemophilus influenzae	1,830	1,709	1995	TIGR	[232]
Campylobacter jejuni	1,641	1,654	2000	Sanger Centre	[655]
Helicobacter pylori	1,668	1,566	1997	TIGR	[848]
Aquifex aeolicus	1,551	1,522	1998	Diversa Corp.	[175]
Rickettsia conorii	1,269	1,274	2000	U. Marseille	[626]
Chlamydia pneumoniae	1,230	1,052	1999	UC Berkeley	[412]
Treponema pallidum	1,138	1,031	1998	TIGR	[243]
Rickettsia prowazekii	1,111	834	1998	Uppsala U.	[30]
Chlamydia muridarum	1,069	909	2000	TIGR	[694]
Chlamydia trachomatis	1,042	894	1998	UC Berkeley	[805]
Borellia burgdorferi	911	850	1997	TIGR	[241]
Mycoplasma pneumoniae	816	677	1996	U. Heidelberg	[347]
Ureaplasma urealyticum	752	611	2000	U. Alabama	[287]
Mycoplasma pulmonis	964	782	2001	U. Bordeaux	[144]
Buchnera sp. APS	641	564	2000	U. Tokyo	[766]
Mycoplasma genitalium	580	467	1995	TIGR	[242]
Archaea
Methanosarcina acetivorans	5,751	4,540	2002	Whitehead Inst.	[254]
Methanosarcina mazei	4,096	3,371	2002	U. Göttingen	[181]
Sulfolobus solfataricus	2,992	2,997	2001	EU/Canada	[764]
Sulfolobus tokodaii	2,695	2,826	2001	NITE	[426]
Halobacterium sp NRC-1	2,380	2,446	2000	Inst. Syst. Biol.	[616]
Pyrobaculum aerophilum	2,222	2,605	2002	UCLA	[231]
Archaeoglobus fulgidus	2,178	2,420	1997	TIGR	[444]
Pyrococcus furiosus	1,908	2,065	2001	U. Maryland	[704]
Methanobacerium thermoautotrophicum	1,751	1,869	1997	Genome Therapeutics	[781]
Pyrococcus abyssi	1,765	1,765	2000	Genoscope	[599]
Pyrococcus horikoshii	1,739	~1,750	1998	NITE	[428]
Methanopyrus kandleri	1,695	1,691	2002	Fidelity Sistems	[779]
Aeropyrum pernix	1,670	~1,720	1999	NITE	[427]
Methanococcus jannaschii	1,665	1,715	1996	TIGR	[130]
Thermoplasma volcanuim	1,585	1,499	2000	NIBHT	[430]
Thermoplasma acidophilum	1,565	1,478	2000	MPI Biochem.	[719]
Eukaryotes
Homo sapiens	~3,100,000	~40,000	~2002	Human Genome Project, Celera	[488,870]
Mus musculus	~3,100,000	~40,000	~2002	Mouse Genome Project, Celera	-
Oryza sativa	~420,000	32,277	2002	Syngenta Corp.	[289]
Anopheles gambiae	~278,000	-	2002	Celera, Sanger	-
Drosophila melanogaster	~137,300	~13,500	2000	Celera, UC Berkeley	[3]
Arabidopsis thaliana	~115,400	25,498	2000	Arabidopsis Genome Project	[35]
Caenorhabditis elegans	~96,900	~19,000	1999	Sanger Centre, Washington U.	[840]
Saccharomyces cerevisiae	~11,600	~6,000	1996	European Consortium	[290]
Schizosaccharomyces pombe	~12,600	4,824	2002	Sanger Centre	[918]
Encephalitozoon cuniculi	~2,500	1,997	2001	Genoscope	[425]

a: Further in the book, these names are used mostly in the abbreviated form. Shading indicates obligate parasites.
b: For the complete names of the sequencing centers, see the NCBI Entrez Genomes web site http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome, Appendix 2, or the original references.

Table 1.5Coverage of the main prokaryotic phyla by genome projects

	Genome sequencing
Major prokaryotic phyla^a	Completed	In progress
Archaea
Crenarchaeota	4	3
Euryarchaeota	12	2
Bacteria
Aquificales	1	-
CFB/Chlorobium group	-	4
Chlamydiales/Verrucomicrobia group	3	-
Chrysiogenetes	-	-
Cyanobacteria	2	2
Deferribacteres	-	-
Dehalococcoides group	-	1
Dictyoglomus group	-	-
Fibrobacter/Acidobacteria group	-	1
Bacillus/Clostridium group (low G+C gram-positive)	13	20
Actinobacteria (high G+C gram-positive)	3	8
Fusobacteria	1	-
Green non-sulfur bacteria	-	1
Planctomycetales	-	-
Proteobacteria	23	44
Alpha subdivision	7	9
Beta subdivision	2	8
Gamma subdivision	12	20
Delta subdivision	-	5
Epsilon subdivision	2	2
Spirochaetales	2	2
Thermodesulfobacteria	-	-
Thermomicrobia	-	-
Thermotogales	1	-
Thermus/Deinococcus group	1	1

a: The taxonomy is from the NCBI Taxonomy database (see 3.6). The data on the finished and ongoing genome sequencing projects are from the Entrez Genomes database (http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html) and the Genomes OnLine Database (http://genomesonline.org).

Table 1.6Status of the eukaryotic genome projects

Major eukaryotic phyla ^a	Representatives with ongoing sequencing projects
Acanthamoebidae	-
Acantharea	-
Alveolata	*Babesia bovis* , Cryptosporidium parvum, Eimeria tenella, Plasmodium falciparum, *P. berghei* , *P. chabaudi* , P. vivax, P. yoelii, *Theileria annulata* , *Toxoplasma gondii*
Apicomplexa
Ciliophora	Paramecium tetraurelia, Tetrahymena sp.
Dinophyceae	-
Haplosporida	-
Apusomonadidae	-
Cercozoa	Chlorarachnion reptans
Core jakobids	Reclinomonas americana
Cryptophyta	Guillardia theta (nucleomorph genome)
Diplomonadida	*Giardia intestinalis*
Entamoebidae	*Entamoeba histolytica*
Euglenozoa	*Leishmania major* , *Trypanosoma brucei*
Glaucocystophyceae	-
Granuloreticulosea	-
Haptophyceae	-
Heterolobosea	-
Lobosea	-
Malawimonadidae	-
Microsporidia	Encephalitozoon cuniculi, Spraguea lophii
Mycetozoa	*Dictyostelium discoideum*
Oxymonadida	-
Parabasalidea	-
Paramyxea	-
Pelobiontida	-
Plasmodiophorida	-
Polycystinea	-
Retortamonadidae	-
Rhodophyta	Porphyra yezoensis
Stramenopiles	Thalassiosira pseudonana
Viridiplantae
Chlorophyta	Chlamydonas reinhardtii
Streptophyta	Alfalfa, barley, bean, coffee, corn, cotton, pine, poplar, potato, rice, sorghum, soybean, sugar cane, tomato, wheat
Fungi/Metazoa group
Aconchulinia	-
Choanoflagellida	-
Fungi
Ascomycota	Saccharomyces cerevisiae, Schizosaccharomyces pombe, *Aspergillus nidulans* , *A. fumigatus* , A. niger, *Candida albicans* , Coccidioides immitis, Debaryomyces hansenii, Fusarium proliferatum, *Neurospora crassa* , Pneumocystis carinii, Trichoderma reesei
Basidiomycota	Cryptococcus neoformans, Phanerochaete chrysosporium, Ustilago maydis
Chytridiomycota	-
Zygomycota	-
Metazoa
Porifera (sponges)	-
Cnidaria	-
Ctenophora	-
Platyhelminthes	Schistosoma mansoni, S. japonicum
Nematoda	Caenorhabditis elegans, Ascaris suum, Brugia malayi, C. briggsae, Haemonchus contortus
Annelida	-
Mollusca	-
Arthropoda	Drosophila melanogaster, Anopheles gambiae, Aedes aegypti, A. albopictus, Amblyomma americanum, Glossina morsitans
Chordata
Urochordata	Ciona intestinalis (sea squirt), C. savignyi
Actinopterygii	Takifugu rubripes (fugu), *Danio rerio (zebrafish)* , Oreochromis niloticus (tilapia)
Amphibia	Ambystoma mexicanum (axolotl), Xenopus tropicalis (frog), X. laevis
Crocodylidae	-
Aves (birds)	Chicken, turkey
Mammals	Human, mouse, rat, cat, chimpanzee, cow, dog

a: The taxonomy is from the NCBI Taxonomy database (see 3.6). Organisms with finished or almost finished projects are shown in bold; advanced-stage projects are shown in bold and italic; not all sequencing projects for each phylogenetic lineage are listed. Absence of sequencing projects for any representative of a phylogenetic lineage, according to the Entrez Genomes and Genomes OnLine databases, is indicated by a dash.

Bookshelf ID: NBK20263

Contents

< Prev Next >