697 lines
37 KiB
HTML
697 lines
37 KiB
HTML
<!doctype html public "-//IETF//DTD HTML//EN">
|
|
<html>
|
|
|
|
<head>
|
|
<title>February 1998</title>
|
|
<meta name="GENERATOR" content="Microsoft FrontPage 3.0">
|
|
<meta name="AUTHOR" content="RJohnson">
|
|
</head>
|
|
|
|
<body bgcolor="#FFFFFF" text="#000000" vlink="#0000FF" alink="#0000FF">
|
|
|
|
<p><img src="/Gifs/newslogo.gif"> </p>
|
|
|
|
<p> </p>
|
|
|
|
<hr style="margin-top: -2in; margin-bottom: -2in; padding-top: ; padding-bottom:">
|
|
<a name="toc">
|
|
|
|
<p>February 1998</a></p>
|
|
|
|
<hr style="margin-top: -3in; margin-bottom: ; padding-top: -9 in; padding-bottom: 0">
|
|
|
|
<h3>In This Issue</h3>
|
|
|
|
<p><a href="#Speed">Speed and Sensitivity: BLAST Version 2.0</a><br>
|
|
<a href="#Protein">Protein Families and Genome Evolution: COGs</a><br>
|
|
<a href="#GenBank">GenBank Submissions: From Deposit to Release</a><br>
|
|
<a href="#Throughput">High Throughput Sequencing Gives Rise to New GenBank Division</a><br>
|
|
<a href="#One Billion">GenBank Reaches One Billion Bases</a><br>
|
|
<a href="#faq">Frequently Asked Questions</a> <br>
|
|
<a href="#FTP">NCBI Data by FTP</a> <br>
|
|
<a href="#Pubs">Selected Recent Publications by NCBI Staff</a><br>
|
|
<a href="#Masthead">Masthead</a></p>
|
|
|
|
<hr>
|
|
<font size="6" face="Times">
|
|
|
|
<h3><a name="Speed">Speed and Sensitivity: BLAST Version 2.0 </a></font><font face="Times"
|
|
size="7"></h3>
|
|
|
|
<p ALIGN="left"></font><font face="Times" size="3">Michael Crichton, author of <i>Jurassic
|
|
Park</i> (1990, Knopf, New York), discovered the virtue of BLAST to perform sequence
|
|
similarity searches when NCBI researcher Mark Boguski informed him that the <i>Tyrannosaurus
|
|
rex</i> sequence Crichton published was 100% contaminated with pBR322 vector.</font><font
|
|
FACE="Times" SIZE="1"><sup>1</sup> </font><font face="Times" size="3">While the scientific
|
|
community awaits an authentic dinosaur sequence, new features added to NCBI’s BLAST
|
|
service are offering increased speed and sensitivity, providing scientists with an
|
|
enhanced ability to uncover biologically meaningful relationships among various
|
|
organisms’ sequences.</p>
|
|
|
|
<p ALIGN="left">BLAST (Basic Local Alignment Search Tool) is widely used to perform
|
|
sequence similarity searching because it can produce valuable results swiftly. Currently,
|
|
over 5,100 people around the world use NCBI’s BLAST server on the World Wide Web
|
|
daily, and an additional 1,000 are using a server-client version. Together they perform
|
|
over 38,000 searches each day. The new BLAST version 2.0 programs equip researchers with
|
|
advanced search strategies that are both fast and convenient. BLAST 2.0 combines the
|
|
statistical analysis of the original BLAST with the ability to perform gapped alignments
|
|
(Gapped BLAST) and to construct position-specific score matrices for sequence similarity
|
|
searches (PSI-BLAST).</p>
|
|
<b>
|
|
|
|
<p>Gapped BLAST Is Fast</b></p>
|
|
|
|
<p ALIGN="left">A traditional BLAST search begins by seeking a "word" in a
|
|
database sequence that matches a "word" in the query with at least the
|
|
"threshold" score <i>T</i>. Such a "hit" is extended in both
|
|
directions until the running score drops a certain amount below the best score yet
|
|
achieved. The alignments produced are evaluated for statistical significance, and any
|
|
high-scoring segment pairs (HSPs) that meet a user-definable cutoff are reported.</p>
|
|
|
|
<p ALIGN="left">The new Gapped BLAST is considerably faster than the original due to two
|
|
refinements. First, the original BLAST needed to be very sensitive in detecting weak HSPs
|
|
because several that involved a single database sequence could, in concert, constitute a
|
|
significant result. By allowing a single HSP of sufficient score to trigger a gapped
|
|
extension step, BLAST 2.0 can afford to miss some very weak HSPs in its initial pass. The
|
|
threshold score <i>T</i> can therefore be raised, with an attendant increase in speed.
|
|
Second, the new program requires the detection of two hits within a short distance of one
|
|
another on the same diagonal before it invokes an ungapped extension. Even after <i>T</i>
|
|
is adjusted to maintain the same sensitivity, this requirement reduces substantially the
|
|
number of time-consuming extensions needed. The net result is a program that is not only
|
|
more sensitive but also three times faster than before. The Gapped BLAST programs blastn
|
|
and blastp offer fully gapped alignments; blastx and tblastn have "in-frame"
|
|
gapped alignments and use sum statistics to link alignments from different reading frames.
|
|
Gapped BLAST is not offered for tblastx searches. (For a description of the BLAST family
|
|
of programs, see <a href="http://www.ncbi.nlm.nih.gov/BLAST/blast_program.html">http://www.ncbi.nlm.nih.gov/BLAST/blast_program.html</a>.)</p>
|
|
<b>
|
|
|
|
<p>PSI-BLAST for Motif-Style Searching</b></p>
|
|
|
|
<p ALIGN="left">"Motif" searches are potentially much more sensitive to distant
|
|
relationships than are the traditional pairwise similarity searches for which BLAST has
|
|
been tailored. Position-Specific Iterated BLAST (PSI-BLAST) now brings both speed and ease
|
|
of operation to motif searching. It can be used to help delineate diverse protein families
|
|
and to predict function for newly sequenced proteins. PSI-BLAST uses an initial BLAST run
|
|
to generate a gapped multiple alignment. It then constructs from this alignment a
|
|
position-specific score matrix, which is employed as a "query" in a subsequent
|
|
BLAST search. This process can be repeated multiple times to hunt for homologous sequences
|
|
that would not have been retrieved by the original BLAST algorithm. Currently, PSI-BLAST
|
|
is limited to protein-protein queries.</p>
|
|
|
|
<p ALIGN="left">NCBI researchers tested the power of PSI-BLAST by applying it to the
|
|
C-terminal 215 amino acids of the BRCA1 sequence.</font><font FACE="Times" SIZE="1"><sup>2</sup>
|
|
</font><font face="Times" size="3">BRCA1 and other members of the BRCT superfamily
|
|
typically are involved in DNA damage-responsive cell cycle checkpoints.</font><sup><font
|
|
FACE="Times" SIZE="1">3</font></sup><font face="Times" size="3"> In multiple iterations,
|
|
PSI-BLAST automatically identified almost all the previously recognized BRCT proteins and
|
|
added seven new ones to the roster (see Table 1 and Altschul et al., 1997</font><sup><font
|
|
FACE="Times" SIZE="1">4</font></sup><font face="Times" size="3">).</p>
|
|
|
|
<p ALIGN="left">For more information, a BLAST help manual is available on-line.</p>
|
|
<div align="left">
|
|
|
|
<table border="1" width="577" height="32" cellpadding="2">
|
|
<tr>
|
|
<td width="100" height="32"><p align="center"><strong>Protein</strong></td>
|
|
<td width="226" height="32"><p align="center"><strong>Species</strong></td>
|
|
<td width="94" height="32" align="right"><p align="center"><strong>GenBank ID Number</strong></td>
|
|
<td width="92" height="32" align="center"><p align="center"><strong>PSI-BLAST iteration</strong></td>
|
|
<td width="65" height="32"><p align="center"><strong>E-value</strong></td>
|
|
</tr>
|
|
<tr>
|
|
<td width="100" height="32">T10M13.12</td>
|
|
<td width="226" height="32"><i>Arabidopsis thaliana </i></td>
|
|
<td width="94" height="32" align="right">2104545</td>
|
|
<td width="92" height="32" align="center">1</td>
|
|
<td width="65" height="32" align="right">4e-06</td>
|
|
</tr>
|
|
<tr>
|
|
<td width="100" height="32">KIAA0259</td>
|
|
<td width="226" height="32"><i>Homo sapiens</i></td>
|
|
<td width="94" height="32" align="right">1665785</td>
|
|
<td width="92" height="32" align="center">1</td>
|
|
<td width="65" height="32" align="right">0.001</td>
|
|
</tr>
|
|
<tr>
|
|
<td width="100" height="32">T13F2.3 </td>
|
|
<td width="226" height="32"><i>Caenorhabditis elegans</i> </td>
|
|
<td width="94" height="32" align="right">1667334</td>
|
|
<td width="92" height="32" align="center">3</td>
|
|
<td width="65" height="32" align="right">2e-07</td>
|
|
</tr>
|
|
<tr>
|
|
<td width="100" height="32">SPAC6G9.12</td>
|
|
<td width="226" height="32"><i>Schizosaccharomyces pombe</i></td>
|
|
<td width="94" height="32" align="right">1644324</td>
|
|
<td width="92" height="32" align="center">7</td>
|
|
<td width="65" height="32" align="right">4e-04</td>
|
|
</tr>
|
|
<tr>
|
|
<td width="100" height="32">C36A4.8</td>
|
|
<td width="226" height="32"><i>Caenorhabditis elegans</i></td>
|
|
<td width="94" height="32" align="right">1657667</td>
|
|
<td width="92" height="32" align="center">7 </td>
|
|
<td width="65" height="32" align="right">0.010 </td>
|
|
</tr>
|
|
<tr>
|
|
<td width="100" height="32">D90904 </td>
|
|
<td width="226" height="32"><i>Synechocystis </i>sp.</td>
|
|
<td width="94" height="32" align="right">1652299</td>
|
|
<td width="92" height="32" align="center">15</td>
|
|
<td width="65" height="32" align="right">0.17</td>
|
|
</tr>
|
|
<tr>
|
|
<td width="100" height="32">Pescadillo</td>
|
|
<td width="226" height="32"><i>Homo sapiens</i></td>
|
|
<td width="94" height="32" align="right">2194203</td>
|
|
<td width="92" height="32" align="center">16</td>
|
|
<td width="65" height="32" align="right">0.017</td>
|
|
</tr>
|
|
</table>
|
|
</div><b>
|
|
|
|
<p>Notes</b></font><font FACE="Times" SIZE="1"></p>
|
|
</font>
|
|
|
|
<p><font FACE="Times" SIZE="1"><sup>1</sup></font><font FACE="Times" SIZE="2"> Boguski MS.
|
|
A molecular biologist visits <i>Jurassic Park</i>. <i>Biotechniques </i>12:668–9,
|
|
1992.</font></p>
|
|
|
|
<p><font FACE="Times" SIZE="1"><sup>2</sup></font><font FACE="Times" SIZE="2"> Altschul
|
|
SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and
|
|
PSI-BLAST: a new generation of protein database search programs. <i>Nucleic Acids Res</i>
|
|
25:3389–402, 1997.</font></p>
|
|
|
|
<p><font FACE="Times" SIZE="1"><sup>3</sup></font><font FACE="Times" SIZE="2"> Bork
|
|
P, Hofmann K, Bucher P, Neuwald AF, Altschul SF, Koonin EV. A superfamily of conserved
|
|
domains in DNA damage-responsive cell cycle checkpoint proteins.<i> FASEB J</i>
|
|
11:68–76, 1997.</font></p>
|
|
|
|
<p><font FACE="Times" SIZE="1"><sup>4</sup></font><font FACE="Times" SIZE="2"> <i>Op.</i> <i>cit.</i>
|
|
2.</font><font face="Times" size="3"> </font></p>
|
|
|
|
<p><font face="Times" size="3"> <a href="#toc">Return to Table of Contents</a></font></p>
|
|
|
|
<hr>
|
|
<font FACE="Times" SIZE="6">
|
|
|
|
<h3><a name="Protein">Protein Families and Genome Evolution: COGs</a></font><font
|
|
face="Times" size="7"></h3>
|
|
|
|
<p ALIGN="left"></font><font FACE="Times" SIZE="3">Evolutionary biologists assume that the
|
|
genetic constitution of every organism can be traced back to a set of common ancestral
|
|
genes. This assumption has prompted scientists to perform sequence comparisons between
|
|
genes from different species to identify the distant and subtle relationships between
|
|
them. Genes with the same function can often be found in different species. These genes
|
|
are likely to have evolved from a single ancestral gene and are known as
|
|
"orthologs." Alternatively, there may be sequences within the same organism that
|
|
are similar but have different functions; these "paralogs" most likely arose
|
|
from a gene duplication event and then evolved new functions. The growing number of
|
|
completely sequenced genomes makes it possible to make unprecedented comprehensive
|
|
comparisons between major phylogenetic groups or specific organisms and produce an
|
|
informative outline of these relationships. Such a panoramic perspective will augment our
|
|
knowledge of the course of evolution and identify protein functions conserved in some
|
|
organisms but not in others.</p>
|
|
<b>
|
|
|
|
<p align="left">In Search of Gene Families from Complete Genomes</b></p>
|
|
|
|
<p ALIGN="left">Working with the newly sequenced genomes from seven different organisms,
|
|
three scientists at NCBI, Roman Tatusov, Eugene Koonin, and David Lipman, designed a new
|
|
system for classifying conserved genes and exploring the evolutionary relationships among
|
|
them. Beginning with a single gene, they looked for the best match to that sequence in
|
|
every other genome. They continued to perform pairwise sequence comparisons for each
|
|
protein sequence against every other sequence in all the genomes until nearly 18,000
|
|
sequences had been compared. When two genes from different organisms found each other as
|
|
their best match, they were identified as orthologs. Paralogs in genomes were identified
|
|
when matches between sequences in genomes were not reciprocal. The NCBI team cataloged the
|
|
sequences according to their functional similarities into "Clusters of Orthologous
|
|
Groups," or COGs.</font><sup><font FACE="Times" SIZE="1">1</font></sup><font
|
|
FACE="Times" SIZE="3"> A total of 720 unique COGs were identified. Each COG has at least
|
|
three orthologs from three genomes (Figure 1a) and, in some cases, paralogs from the same
|
|
lineage (Figure 1b).</p>
|
|
|
|
<p ALIGN="left">The results of this comparison are available on a new NCBI Web page (<a
|
|
href="http://www.ncbi.nlm.nih.gov/COG/">http://www.ncbi.nlm.nih.gov/COG/</a>). The genomes
|
|
analyzed include five bacterial genomes, <i>Escherichia coli, Haemophilus influenzae,
|
|
Mycoplasma genitalium, Mycoplasma pneumoniae,</i> and Cyanobacteria <i>Synechocystis </i>sp;
|
|
one archaebacterial genome, <i>Methanococcus jannaschii; </i>and one eukaryotic yeast
|
|
genome, <i>Saccharomyces cerevisiae</i>.</p>
|
|
|
|
<p ALIGN="left"> <img src="cog.gif" alt="cog.gif (20530 bytes)" width="612"
|
|
height="437"></p>
|
|
<b>
|
|
|
|
<p align="left">COGs Predict Functions</b></p>
|
|
|
|
<p ALIGN="left">Since orthologs typically have the same function, COGs allow the functions
|
|
of putative gene products to be predicted from the growing number of newly sequenced
|
|
genomes. Functions were assigned to the majority of the 720 COGs based on known proteins
|
|
within the groups or significant similarities to proteins in organisms not included in
|
|
this study. The COGs were further organized into 15 functional subgroups within 4 major
|
|
divisions: (1) information storage and processing, (2) cellular processes, (3) metabolism,
|
|
and (4) poorly characterized. The distribution of proteins from different organisms in the
|
|
COGs identifies trends in functional diversification. For example, the absence of
|
|
representative proteins from the pathogenic bacteria (<i>H. influenzae</i> and the
|
|
mycoplasmas) in some metabolic groups was demonstrated.</p>
|
|
</font>
|
|
|
|
<p align="left"><b>Expanding COGs into Superfamilies</b></p>
|
|
|
|
<p ALIGN="left">The COGs represent ancient, conserved protein families with relevant
|
|
cellular functions because they are from organisms representing the major phylogenetic
|
|
groups that are estimated to be over 1 billion years old. Conserved sequence motifs within
|
|
the proteins reflect distinct biochemical activities employed by a variety of proteins to
|
|
perform their designated role in the cell. The NCBI team also employed motif-style
|
|
searching by using PSI-BLAST to identify protein superfamilies. Protein superfamilies
|
|
represent a higher level of protein classification than the COGs alone and can be used to
|
|
classify highly evolved proteins not assigned to any COG. The largest superfamily
|
|
contained ATP-ase and GTP-ase motifs broadly distributed in a variety of cellular
|
|
mechanisms.</p>
|
|
<b>
|
|
|
|
<p align="left">Phylogenetic Patterns in COGs</b></p>
|
|
|
|
<p ALIGN="left">Like pieces of a mosaic that reveal an image when viewed together, COGs
|
|
can be used to conceptualize genetic evolution. The presence or absence of a
|
|
representative gene from an organism in a COG can be studied to reveal
|
|
"patterns" of gene conservation or loss for that particular COG function.
|
|
Tatusov, Koonin, and Lipman compiled a list of phylogenetic patterns gleaned from the 720
|
|
COGs. A single letter of the alphabet was assigned to represent each genome (e.g.,
|
|
"e" for <i>E. coli</i>), and a dash was indicated when the organism was not
|
|
represented in the COG. A COG that has proteins from all seven genomes has a phylogenetic
|
|
pattern shown as "ehgpcmy." A COG that is missing representative sequences from
|
|
the pathogenic species <i>M. genitalium</i> and <i>M. pneumoniae</i> has the pattern
|
|
"eh_ _cmy." These two patterns are the most frequently occurring patterns, being
|
|
displayed by 114 and 119 COGs, respectively, and therefore represent conserved patterns.
|
|
The conserved patterns demonstrated continuity between the genomes, while rare patterns
|
|
suggest unique functions that need investigating. The addition of more genomes to the COG
|
|
analyses is expected to illuminate the functional role behind the rare patterns. </p>
|
|
<b>
|
|
|
|
<p align="left">Piece by Piece</b></p>
|
|
|
|
<p ALIGN="left">The NCBI team continues to expand its COG research, analyzing eight more
|
|
genomes: <i>Helicobacter pylori, Bacillus subtilis, Borrelia burgdorferi, Treponema
|
|
pallidum, Chlamydia trachomatis, Methanobacterium thermoautotrophicum, Archaeoglobus
|
|
fulgidus, </i>and <i>Caenorhabditis elegans</i>. These analyses will be incorporated into
|
|
the Web site upon completion. Refinements and new additions are expected to build a COG
|
|
collection that will become a valuable resource for characterizing genomes and
|
|
comprehending life’s blueprint. </p>
|
|
<b>
|
|
|
|
<p>Note</b><font FACE="Times" SIZE="1"></p>
|
|
|
|
<p><sup>1</sup></font><font FACE="Times" SIZE="2"> Tatusov RL, Koonin EV, Lipman DJ. A
|
|
genomic perspective on protein families. <i>Science</i> 278:631–7, 1997. </font></p>
|
|
<font FACE="Times" SIZE="3">
|
|
|
|
<p align="left"> <a href="#toc">Return to Table of Contents</a></p>
|
|
|
|
<hr>
|
|
<font FACE="Times" SIZE="5">
|
|
|
|
<h3><a name="GenBank">GenBank Submissions: From Deposit to Release</a></font><font
|
|
FACE="Times" SIZE="7"></h3>
|
|
|
|
<p ALIGN="left"></font>Ever wonder what happens to your cherished sequence once you send
|
|
it hundreds, even thousands, of electronic miles away to GenBank? Whether you are
|
|
using the sequence submission tool BankIt on the WWW, the stand-alone program Sequin, or
|
|
one of the specialized submission procedures for EST, STS, GSS, and HTG sequences, your
|
|
submission is received by the GenBank staff—a group of highly trained biologists and
|
|
database specialists who manage the collection and distribution of GenBank data.
|
|
Currently, over 5,000 sequences arrive each month at GenBank (excluding the specialized
|
|
submissions). While EST, STS, GSS, and HTG submissions are processed in large numbers
|
|
using semiautomated systems, all other types of sequence records are processed manually to
|
|
ensure biological integrity and internal consistency with annotation rules established by
|
|
the International Nucleotide Sequence Database Collaboration.</p>
|
|
<b>
|
|
|
|
<p align="left">Certificate of Deposit: The Accession Number</b></p>
|
|
|
|
<p ALIGN="left">An NCBI staff member checks that your submission meets minimum
|
|
requirements and then assigns an accession number to the sequence within 24 hours. The
|
|
accession number serves as a confirmation that the sequence has been submitted and is a
|
|
permanent, citable number that will allow your sequence to be referenced in publications
|
|
by yourself and others. This same number is used to retrieve your sequence from GenBank or
|
|
from one of the other International Database Collaborators, EMBL and DDBJ.</p>
|
|
|
|
<p ALIGN="left">Accession numbers consist of one letter and five digits, or two letters
|
|
and six digits, and do not change even if the record or its sequence is updated.
|
|
GenBank also assigns a unique GenBank identifier, or GI number, to every <i>sequence </i>loaded
|
|
into the GenBank database. The GI numbers for nucleotide and protein sequences are
|
|
referred to as NIDs and PIDs, respectively. The GI number changes every time the sequence
|
|
is updated, enabling GenBank to track changes in <i>sequence </i>over time.</p>
|
|
<b>
|
|
|
|
<p align="left">Checking Accounts: Indexers and Scientists</b></p>
|
|
|
|
<p ALIGN="left">Under the coordination of Francis Ouellette, a staff of 17 indexers
|
|
trained in molecular biology and skilled in database production operations annotate,
|
|
organize, and maintain the 1.7 million database entries. The indexers ensure that all
|
|
direct submissions receive a systematic quality assurance review. Sequences are screened
|
|
against GenBank by using BLAST to identify full or partial matches to sequences in the
|
|
database and then searched to detect vector, yeast, and mitochondrial contamination.
|
|
Programs that check for internal consistency are used to confirm coding regions, detect
|
|
open reading frames, and verify amino acid translations. Using GenBank content and data
|
|
representation guidelines, annotators then review the descriptive parts of the entry: the
|
|
locus name, definition line, taxonomy classification, and journal references. Staff
|
|
consult with submitters as necessary to add or modify features. Finally, one of 21 senior
|
|
scientists performs a final review for biological integrity and continuity.</p>
|
|
|
|
<p ALIGN="left">At least four people have reviewed your sequence and its annotations
|
|
before a draft of the GenBank record is mailed back to you for review. If the record is
|
|
not to be held confidential, it is loaded into GenBank after a 5-day review period. A
|
|
confidential record will not be released into the public database until you have notified
|
|
GenBank or it is published, whichever comes first. At any time, you may update information
|
|
in your record. We encourage authors to notify GenBank of publication so that confidential
|
|
records may be released and public records can be updated in a timely manner. Use the
|
|
BankIt Update function or send a message with the new information to <a
|
|
href="mailto:update@ncbi.nlm.nih.gov">update@ncbi.nlm.nih.gov</a>; please include your
|
|
accession number with all correspondence.</p>
|
|
<b>
|
|
|
|
<p align="left">Gaining Interest: Release of Records</b></p>
|
|
|
|
<p align="left">The turnaround time from submission to release is anywhere from 1 to 3
|
|
weeks, depending on the number of submissions GenBank is processing. Once the record is
|
|
loaded into the database, the public can see the record the next day by using the Query
|
|
e-mail server (<a href="mailto:query@ncbi.nlm.nih.gov">query@ncbi.nlm.nih.gov</a>),
|
|
Network Entrez, or WWW Entrez. Entrez provides links to additional sequences, graphic
|
|
displays, structures, genome records, and PubMed. Still have questions? Write to us at <a
|
|
href="mailto:info@ncbi.nlm.nih.gov">info@ncbi.nlm.nih.gov</a>. </p>
|
|
|
|
<p><a href="#toc">Return to Table of Contents</a></p>
|
|
|
|
<hr>
|
|
|
|
<h3><a name="Throughput">High Throughput Sequencing Gives Rise to New GenBank Division </a></h3>
|
|
|
|
<p ALIGN="left">Generation-megabase scientists interested in accessing information hot off
|
|
the sequencer will be pleased to know that large-scale sequencing centers involved in the
|
|
eukaryotic genome projects are making copious amounts of sequence data available to the
|
|
public prior to completion. GenBank, in concert with the other members of the
|
|
International Nucleotide Sequence Database Collaboration, DDBJ and EMBL, has created the
|
|
High Throughput Genome (HTG) division to handle the evolving assemblage of genomic data.
|
|
To date, the high throughput sequencing projects include <i>Homo sapiens</i>, <i>Caenorhabditis
|
|
elegans</i>, <i>Drosophila melanogaster</i>, <i>Arabidopsis thaliana</i>, and <i>Mus
|
|
musculus</i>.</p>
|
|
<b>
|
|
|
|
<p align="left">HTG Record Evolution</b></p>
|
|
|
|
<p ALIGN="left">Genome sequencing centers generate preliminary sequence information
|
|
from a single genomic clone and deposit random sequence fragments greater than 2 KB into
|
|
the GenBank HTG division. GenBank assigns a single accession number to the sequence data
|
|
derived from each clone and indicates the status of the HTG record as it passes through
|
|
several stages toward completion. Phase 1 records contain sequences that are unordered,
|
|
unoriented, and contain gaps. In Phase 2, the order and orientation of the sequences have
|
|
been determined, but gaps remain. In Phase 3, once the sequencing is complete and the
|
|
error rate is less than 10<sup>-4</sup>, records are considered finished. Phase 3 records
|
|
are transferred to the appropriate organism division of GenBank, such as the Primate (PRI)
|
|
division for human sequences, or the Invertebrate (INV) division for <i>C. elegans</i>.
|
|
Sequences submitted to the HTG division are automatically searched against the various
|
|
databases using the BLAST programs, and the records are annotated to show the significant
|
|
matches. This sequence similarity information is valuable for positional cloning and gene
|
|
hunting.</p>
|
|
<b>
|
|
|
|
<p align="left">Accessing HTG records</b></p>
|
|
|
|
<p ALIGN="left">HTG records can be retrieved in Entrez by selecting the organism and
|
|
specifying "HTG" in the Keywords field. The Genomes database in Entrez, which
|
|
offers graphical displays of nucleotide and protein sequences, provides a visual framework
|
|
for the HTG sequences with links to additional DNA, protein, and bibliographic records.
|
|
Unfinished HTGs (Phase 1 or 2) are also available for BLAST searching by selecting the
|
|
"htgs" database, or the "month" database for the latest entries;
|
|
finished records (Phase 3) are available in the "nr" and "month" BLAST
|
|
databases.</p>
|
|
<b>
|
|
|
|
<p align="left">HTG Information on the Web</b></p>
|
|
|
|
<p align="left">The new HTG Web site at <a href="http://www.ncbi.nlm.nih.gov/HTGS">http://www.ncbi.nlm.nih.gov/HTGS</a>
|
|
describes the HTG division and gives more detailed instructions for sequencing centers
|
|
interested in submitting HTG sequences. </p>
|
|
<font face="Times" size="3">
|
|
|
|
<p align="left"><img src="htg1.gif" alt="htg1.gif (14191 bytes)" width="612" height="374"></p>
|
|
|
|
<p align="left"><img src="htg3.gif" alt="htg3.gif (81258 bytes)" width="612" height="374"></p>
|
|
</font>
|
|
|
|
<p align="left"><a href="#toc">Return to Table of Contents</a></p>
|
|
|
|
<hr>
|
|
<font FACE="Times" SIZE="6">
|
|
|
|
<h3><a name="One Billion">GenBank Reaches One Billion Bases</a></font><font FACE="Times"
|
|
SIZE="7"></h3>
|
|
|
|
<p ALIGN="left"></font>In 1985, GenBank contained just over 5,700 entries that were
|
|
obtained principally by scanning the biomedical literature for sequence data. GenBank now
|
|
contains almost 2 million entries and recently surpassed 1 billion base pairs of genetic
|
|
information for more than 25,000 organisms. Human sequence data predominate in the
|
|
database, representing 43% of the billion-plus base pairs. Mouse<i> </i>(<i>Mus musculus</i>)
|
|
and the nematode, <i>Caenorhabditis elegans, </i>are second and third, representing 10%
|
|
and 9%, respectively.</p>
|
|
<b>
|
|
|
|
<p align="left">Exponential Growth</b></p>
|
|
|
|
<p ALIGN="left">Doubling in size every 18 months, GenBank is now built primarily from the
|
|
direct submission of sequence data from authors and sequencing centers. Currently, more
|
|
than 70% of the sequence records in the database are ESTs (expressed sequence tags). As
|
|
EST and genomic sequencing efforts are intensified, the GenBank doubling rate is expected
|
|
to accelerate. Additional information about GenBank, its various divisions, and its growth
|
|
statistics can be found in the current release notes (<a
|
|
href="ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt">ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt</a>).</p>
|
|
<b>
|
|
|
|
<p align="left">GenBank CD-ROM to Be Discontinued</b></p>
|
|
|
|
<p align="left">The explosive growth of sequence information is reflected in the GenBank
|
|
CD-ROM, which expanded from a single disc in 1992 to 12 discs with December 1997 Release
|
|
104. Since production costs are escalating and users are opting for the convenience of the
|
|
Internet over the unwieldy discs, the GenBank CD-ROM will be discontinued following the
|
|
April 15, 1998, release. GenBank full releases with cumulative and noncumulative update
|
|
files continue to be available in the genbank/ directory for downloading by Anonymous FTP.
|
|
Consult the README file in this directory for more details. </p>
|
|
|
|
<p><a href="#toc">Return to Table of Contents</a></p>
|
|
|
|
<hr>
|
|
|
|
<h3><a name="faq">Frequently Asked Questions</a> </h3>
|
|
<i>
|
|
|
|
<p ALIGN="left">How do I do a BLAST search with a short DNA sequence?</i></p>
|
|
|
|
<p ALIGN="left">You will probably need to increase the Expect (E) value, since a short
|
|
query is more likely to occur by chance in the database. You may also want to turn off the
|
|
low-complexity filter, since short queries often contain low-complexity sequence. Another
|
|
parameter that becomes important with a short query is Word size, which is used by BLAST
|
|
to nucleate regions of similarity. The default Word size is 11 for nucleotides, so if your
|
|
query sequence falls below this, you may want to decrease Word size (W). For more detail,
|
|
see the FAQ section of the BLAST Web page.</p>
|
|
<i>
|
|
|
|
<p ALIGN="left">Is it possible to perform a BLAST search against just human ESTs?</i></p>
|
|
|
|
<p ALIGN="left">Many separate databases are now available for BLAST searching. Select
|
|
"Human ESTs" in the Database field pull-down menu when using Gapped BLAST.</p>
|
|
<i>
|
|
|
|
<p ALIGN="left">Where can I get more information about the Interactive Digital
|
|
Differential Display (DDD) facility used in the Cancer Genome Anatomy Project (CGAP)
|
|
Project?</i></p>
|
|
|
|
<p ALIGN="left">DDD is a computational method for comparing gene frequencies among various
|
|
cDNA libraries or pools of libraries. It is available from the CGAP Web site at <a
|
|
href="http://www.ncbi.nlm.nih.gov/ncicgap/ddd.html">http://www.ncbi.nlm.nih.gov/ncicgap/ddd.html</a>.</p>
|
|
<i>
|
|
|
|
<p ALIGN="left">How can I obtain the EST clones described in my UniGene search?</i></p>
|
|
|
|
<p ALIGN="left">Information on clone availability is located in the dbEST record (<a
|
|
href="http://www.ncbi.nlm.nih.gov/dbEST/index.html">http://www.ncbi.nlm.nih.gov/dbEST/index.html</a>).
|
|
Click on "Search dbEST" and enter the GenBank accession number. Individuals
|
|
interested in obtaining materials can (1) contact the submitter of the sequence, (2) refer
|
|
to the Source field (if present) for sources providing the clone, (3) refer to the Clone
|
|
ID and library number located under "Clone Info," which can be used to order a
|
|
particular clone through the I.M.A.G.E. Consortium. To see a list of distributors
|
|
participating in the I.M.A.G.E. Consortium, scroll down the dbEST Web page and click on
|
|
"Distributors."</p>
|
|
<i>
|
|
|
|
<p ALIGN="left">How can I search for just review articles or specify a certain time period
|
|
when searching PubMed?</i></p>
|
|
|
|
<p ALIGN="left">In the Advanced mode, set the Search Field pull-down menu to
|
|
"Publication Type" and enter the word "review" into the text box. To
|
|
display a list of available terms for Publication Type, select "List Terms" from
|
|
the Mode menu and enter a term using "Publication Type" in the Search Field. To
|
|
search a range of dates, use a colon between the limiting years (e.g., 1966:1976), and set
|
|
the Search Field to "Publication Date." Search results can also be limited to
|
|
the last 30 days or another period of time by selecting one of the options under the
|
|
Publication Date limit menu.</p>
|
|
<i>
|
|
|
|
<p ALIGN="left">In CGAP, is there a way to tell which libraries are made with tissue from
|
|
the same donor?</i></p>
|
|
|
|
<p align="left">For any tissue, including microdissected tissues, clicking on the link for
|
|
"Tissue sample" will lead to a list of all libraries made from the same samples.</p>
|
|
|
|
<p><a href="#toc">Return to Table of Contents</a></p>
|
|
|
|
<hr>
|
|
|
|
<h3><a name="FTP">NCBI Data by FTP</a> </h3>
|
|
<font FACE="Times" SIZE="2">
|
|
|
|
<p ALIGN="left"></font>The NCBI FTP site contains a variety of directories with publicly
|
|
available databases and software. The available directories include
|
|
‘repository,’ ‘genbank,’ ‘entrez,’ ‘toolbox,’
|
|
‘pub,’ and ‘sequin.’</font></p>
|
|
|
|
<p ALIGN="left"><font face="Times" size="3">The <b>repository </b>directory makes a number
|
|
of molecular biology databases available to the scientific community. This directory
|
|
includes databases such as PIR, SwissProt, CarbBank, AceDB, and FlyBase.</font></p>
|
|
|
|
<p ALIGN="left"><font face="Times" size="3">The <b>genbank </b>directory contains files
|
|
with the latest full release of GenBank, the daily cumulative updates, and the latest
|
|
release notes.</font></p>
|
|
|
|
<p ALIGN="left"><font face="Times" size="3">The <b>entrez </b>directory contains the
|
|
client software for Network Entrez.</font></p>
|
|
|
|
<p ALIGN="left"><font face="Times" size="3">The <b>toolbox </b>directory contains a set of
|
|
software and data exchange specifications that are used by NCBI to produce portable
|
|
software, and includes ASN.1 tools and specifications for molecular sequence data. </font></p>
|
|
|
|
<p ALIGN="left"><font face="Times" size="3">The <b>pub </b>directory offers public-domain
|
|
software, such as BLAST (sequence similarity search program). Client software for Network
|
|
BLAST and PowerBlast is also included in this directory.</font></p>
|
|
|
|
<p ALIGN="left"><font face="Times" size="3">The <b>sequin </b>directory contains the new
|
|
Sequin submission software for Mac, PC, and UNIX platforms.</font></p>
|
|
|
|
<p align="left"><font face="Times" size="3">Data in these directories can be transferred
|
|
through the Internet by using the Anonymous FTP program. To connect, type: <b>ftp
|
|
ncbi.nlm.nih.gov. </b>Enter <b>anonymous </b>as the login name, and enter your e-mail
|
|
address as the password. Then change to the appropriate directory. For example, change to
|
|
the repository directory (cd repository) to download specialized databases.</p>
|
|
|
|
<p><a href="#toc">Return to Table of Contents</a></p>
|
|
|
|
<hr>
|
|
|
|
<h3 align="left"><a name="Pubs">Selected Recent Publications by NCBI Staff</a> </font></h3>
|
|
<b>
|
|
|
|
<p ALIGN="left"></b><font size="3"><strong>Benson DA, Boguski MS, Lipman DJ, Ostell J,
|
|
Ouellette BFF.</strong> GenBank. <i>Nucleic Acids Res</i> 26:1–7, 1998.</font></p>
|
|
|
|
<p ALIGN="left"><font size="3"><strong>Galperin MY, Koonin EV.</strong> A diverse
|
|
superfamily of enzymes with ATP-dependent carboxylate-amine/thiol ligase activity. <i>Protein
|
|
Sci</i> 6:2639–43, 1997.</font></p>
|
|
|
|
<p ALIGN="left"><font size="3"><strong>Leipe DD, Landsman D. </strong>Histone
|
|
deacetylases, acetoin utilization proteins, and acetylpolyamine amidohydrolases are
|
|
members of an ancient protein superfamily. <i>Nucleic Acids Res</i> 25:3693–7, 1997.</font></p>
|
|
|
|
<p ALIGN="left"><font size="3"><strong>Lipman DJ.</strong> Making (anti)sense of
|
|
non-coding sequence conservation. <i>Nucleic Acids Res</i> 25:3580–3, 1997.</font></p>
|
|
|
|
<p ALIGN="left"><font size="3"><strong>Marchler-Bauer A, Bryant SH.</strong> A measure of
|
|
success in fold recognition. <i>Trends Biochem Sci</i> 22:236–40, 1997.</font></p>
|
|
|
|
<p ALIGN="left"><font size="3"><strong>Neuwald AF.</strong> An unexpected structural
|
|
relationship between integral membrane phosphatases and soluble haloperoxidases. <i>Protein
|
|
Sci </i>6:1764–7, 1997.</font></p>
|
|
|
|
<p ALIGN="left"><font size="3"><strong>Ouellette BFF, Boguski MS.</strong> Database
|
|
divisions and homology search files: a guide for the perplexed. <i>Genome Res </i>7:952–5,
|
|
1997.</font></p>
|
|
|
|
<p ALIGN="left"><font size="3"><strong>Pruitt KD.</strong> WebWise: navigating the Human
|
|
Genome Project. <i>Genome Res</i> 7:1038–9, 1997.</font></p>
|
|
|
|
<p ALIGN="left"><font size="3"><strong>Schuler GD.</strong> Pieces of the puzzle:
|
|
expressed sequence tags and the catalog of human genes. <i>J Mol Med</i> 75:694–8,
|
|
1997.</font></p>
|
|
<b>
|
|
|
|
<p ALIGN="left"></b><font size="3"><strong>Sonnhammer EL, Wootton JC.</strong> Widespread
|
|
eukaryotic sequences, highly similar to bacterial DNA polymerase I, looking for functions.
|
|
<i>Curr Biol</i> 7:R463–5, 1997.</font></p>
|
|
|
|
<p align="left"><font size="3"><strong>Tatusov RL, Koonin EV, Lipman DJ.</strong> A
|
|
genomic perspective on protein families. <i>Science </i>278:63 1–7, 1997.</font><font
|
|
face="Times" size="3"></p>
|
|
|
|
<p><a href="#toc">Return to Table of Contents</a></p>
|
|
|
|
<hr>
|
|
|
|
<h3 align="left"><a name="Masthead">Masthead</a></h3>
|
|
</font><i>
|
|
|
|
<p ALIGN="left"><font size="3">NCBI News</i> is distributed two to three times a year. We
|
|
welcome communication from users of NCBI databases and software and invite suggestions for
|
|
articles in future issues. Send correspondence and suggestions to <i>NCBI News</i> at the
|
|
address below.</font></p>
|
|
|
|
<p ALIGN="left"><font size="3">NCBI News<br>
|
|
National Library of Medicine<br>
|
|
Bldg. 38A, Room 8N-803<br>
|
|
8600 Rockville Pike<br>
|
|
Bethesda, MD 20894<br>
|
|
Phone: (301) 496-2475<br>
|
|
Fax: (301) 480-9241<br>
|
|
</font><font face="Times" size="3">E-mail: <a href="mailto:info@ncbi.nlm.nih.gov">info@ncbi.nlm.nih.gov</a></font></p>
|
|
<i>
|
|
|
|
<p></i><font size="3"><em>Editors</em><br>
|
|
Dennis Benson<br>
|
|
Barbara Rapp</font></p>
|
|
<i>
|
|
|
|
<p align="left"></i><font size="3"><em>NCBI Contributors</em><br>
|
|
Renata McCarthy<br>
|
|
Ken Katz<br>
|
|
Francis Ouellette<br>
|
|
Stephen Altschul</font></p>
|
|
|
|
<p><font size="3"><em>Writer</em><br>
|
|
Donna Roscoe</font></p>
|
|
|
|
<p><font size="3"><em>Managing Editor</em><br>
|
|
Roseanne Price </font></p>
|
|
|
|
<p><font size="3"><em>Graphics and Production</em><br>
|
|
Veronica Johnson</font></p>
|
|
|
|
<p ALIGN="JUSTIFY"><font size="3"><em>Design Consultant</em><br>
|
|
Troy M. Hill</font></p>
|
|
|
|
<p ALIGN="left"><font size="3">In 1988, Congress established the National Center for
|
|
Biotechnology Information as part of the National Library of Medicine; its charge is to
|
|
create information systems for molecular biology and genetics data, and to perform
|
|
research in computational molecular biology.</font></p>
|
|
|
|
<p ALIGN="left"><font size="3">The contents of this newsletter may be reprinted without
|
|
permission. The mention of trade names, commercial products, or organizations does not
|
|
imply endorsement by NCBI, NIH, or the U.S. Government. </font></p>
|
|
|
|
<p ALIGN="JUSTIFY"><font size="3">NIH Publication No. 98-3272<br>
|
|
ISSN 1060-8788 </font></p>
|
|
<font face="Times" size="3">
|
|
|
|
<p><a href="#toc">Return to Table of Contents</a> </p>
|
|
|
|
<hr>
|
|
</font>
|
|
</body>
|
|
</html>
|