nih-gov/www.ncbi.nlm.nih.gov/Web/Newsltr/feb98.html

<!doctype html public "-//IETF//DTD HTML//EN">
<html>

<head>
<title>February 1998</title>
<meta name="GENERATOR" content="Microsoft FrontPage 3.0">
<meta name="AUTHOR" content="RJohnson">
</head>

<body bgcolor="#FFFFFF" text="#000000" vlink="#0000FF" alink="#0000FF">

<p><img src="/Gifs/newslogo.gif"> </p>

<p>&nbsp;</p>

<hr style="margin-top: -2in; margin-bottom: -2in; padding-top:  ; padding-bottom:">
<a name="toc">

<p>February 1998</a></p>

<hr style="margin-top: -3in; margin-bottom:  ; padding-top: -9 in; padding-bottom: 0">

<h3>In This Issue</h3>

<p><a href="#Speed">Speed and Sensitivity: BLAST Version 2.0</a><br>
<a href="#Protein">Protein Families and Genome Evolution:&nbsp; COGs</a><br>
<a href="#GenBank">GenBank Submissions:&nbsp; From Deposit to Release</a><br>
<a href="#Throughput">High Throughput Sequencing Gives Rise to New GenBank Division</a><br>
<a href="#One Billion">GenBank Reaches One Billion Bases</a><br>
<a href="#faq">Frequently Asked Questions</a> <br>
<a href="#FTP">NCBI Data by FTP</a> <br>
<a href="#Pubs">Selected Recent Publications by NCBI Staff</a><br>
<a href="#Masthead">Masthead</a></p>

<hr>
<font size="6" face="Times">

<h3><a name="Speed">Speed and Sensitivity: BLAST Version 2.0 </a></font><font face="Times"
size="7"></h3>

<p ALIGN="left"></font><font face="Times" size="3">Michael Crichton, author of <i>Jurassic
Park</i> (1990, Knopf, New York), discovered the virtue of BLAST to perform sequence
similarity searches when NCBI researcher Mark Boguski informed him that the&nbsp;<i>Tyrannosaurus
rex</i> sequence Crichton published was 100% contaminated with pBR322 vector.</font><font
FACE="Times" SIZE="1"><sup>1</sup> </font><font face="Times" size="3">While the scientific
community awaits an authentic dinosaur sequence, new features added to NCBI&#146;s BLAST
service are offering increased speed and sensitivity, providing scientists with an
enhanced ability to uncover biologically meaningful relationships among various
organisms&#146; sequences.</p>

<p ALIGN="left">BLAST (Basic Local Alignment Search Tool) is widely used to perform
sequence similarity searching because it can produce valuable results swiftly. Currently,
over 5,100 people around the world use NCBI&#146;s BLAST server on the World Wide Web
daily, and an additional 1,000 are using a server-client version. Together they perform
over 38,000 searches each day. The new BLAST version 2.0 programs equip researchers with
advanced search strategies that are both fast and convenient. BLAST 2.0 combines the
statistical analysis of the original BLAST with the ability to perform gapped alignments
(Gapped BLAST) and to construct position-specific score matrices for sequence similarity
searches (PSI-BLAST).</p>
<b>

<p>Gapped BLAST Is Fast</b></p>

<p ALIGN="left">A traditional BLAST search begins by seeking a &quot;word&quot; in a
database sequence that matches a &quot;word&quot; in the query with at least the
&quot;threshold&quot; score <i>T</i>. Such a &quot;hit&quot; is extended in both
directions until the running score drops a certain amount below the best score yet
achieved. The alignments produced are evaluated for statistical significance, and any
high-scoring segment pairs (HSPs) that meet a user-definable cutoff are reported.</p>

<p ALIGN="left">The new Gapped BLAST is considerably faster than the original due to two
refinements. First, the original BLAST needed to be very sensitive in detecting weak HSPs
because several that involved a single database sequence could, in concert, constitute a
significant result. By allowing a single HSP of sufficient score to trigger a gapped
extension step, BLAST 2.0 can afford to miss some very weak HSPs in its initial pass. The
threshold score <i>T</i> can therefore be raised, with an attendant increase in speed.
Second, the new program requires the detection of two hits within a short distance of one
another on the same diagonal before it invokes an ungapped extension. Even after <i>T</i>
is adjusted to maintain the same sensitivity, this requirement reduces substantially the
number of time-consuming extensions needed. The net result is a program that is not only
more sensitive but also three times faster than before. The Gapped BLAST programs blastn
and blastp offer fully gapped alignments; blastx and tblastn have &quot;in-frame&quot;
gapped alignments and use sum statistics to link alignments from different reading frames.
Gapped BLAST is not offered for tblastx searches. (For a description of the BLAST family
of programs, see <a href="http://www.ncbi.nlm.nih.gov/BLAST/blast_program.html">http://www.ncbi.nlm.nih.gov/BLAST/blast_program.html</a>.)</p>
<b>

<p>PSI-BLAST for Motif-Style Searching</b></p>

<p ALIGN="left">&quot;Motif&quot; searches are potentially much more sensitive to distant
relationships than are the traditional pairwise similarity searches for which BLAST has
been tailored. Position-Specific Iterated BLAST (PSI-BLAST) now brings both speed and ease
of operation to motif searching. It can be used to help delineate diverse protein families
and to predict function for newly sequenced proteins. PSI-BLAST uses an initial BLAST run
to generate a gapped multiple alignment. It then constructs from this alignment a
position-specific score matrix, which is employed as a &quot;query&quot; in a subsequent
BLAST search. This process can be repeated multiple times to hunt for homologous sequences
that would not have been retrieved by the original BLAST algorithm. Currently, PSI-BLAST
is limited to protein-protein queries.</p>

<p ALIGN="left">NCBI researchers tested the power of PSI-BLAST by applying it to the
C-terminal 215 amino acids of the BRCA1 sequence.</font><font FACE="Times" SIZE="1"><sup>2</sup>
</font><font face="Times" size="3">BRCA1 and other members of the BRCT superfamily
typically are involved in DNA damage-responsive cell cycle checkpoints.</font><sup><font
FACE="Times" SIZE="1">3</font></sup><font face="Times" size="3"> In multiple iterations,
PSI-BLAST automatically identified almost all the previously recognized BRCT proteins and
added seven new ones to the roster (see Table 1 and Altschul et al., 1997</font><sup><font
FACE="Times" SIZE="1">4</font></sup><font face="Times" size="3">).</p>

<p ALIGN="left">For more information, a BLAST help manual is available on-line.</p>
<div align="left">

<table border="1" width="577" height="32" cellpadding="2">
  <tr>
    <td width="100" height="32"><p align="center"><strong>Protein</strong></td>
    <td width="226" height="32"><p align="center"><strong>Species</strong></td>
    <td width="94" height="32" align="right"><p align="center"><strong>GenBank ID Number</strong></td>
    <td width="92" height="32" align="center"><p align="center"><strong>PSI-BLAST iteration</strong></td>
    <td width="65" height="32"><p align="center"><strong>E-value</strong></td>
  </tr>
  <tr>
    <td width="100" height="32">T10M13.12</td>
    <td width="226" height="32"><i>Arabidopsis thaliana </i></td>
    <td width="94" height="32" align="right">2104545</td>
    <td width="92" height="32" align="center">1</td>
    <td width="65" height="32" align="right">4e-06</td>
  </tr>
  <tr>
    <td width="100" height="32">KIAA0259</td>
    <td width="226" height="32"><i>Homo sapiens</i></td>
    <td width="94" height="32" align="right">1665785</td>
    <td width="92" height="32" align="center">1</td>
    <td width="65" height="32" align="right">0.001</td>
  </tr>
  <tr>
    <td width="100" height="32">T13F2.3 </td>
    <td width="226" height="32"><i>Caenorhabditis elegans</i> </td>
    <td width="94" height="32" align="right">1667334</td>
    <td width="92" height="32" align="center">3</td>
    <td width="65" height="32" align="right">2e-07</td>
  </tr>
  <tr>
    <td width="100" height="32">SPAC6G9.12</td>
    <td width="226" height="32"><i>Schizosaccharomyces pombe</i></td>
    <td width="94" height="32" align="right">1644324</td>
    <td width="92" height="32" align="center">7</td>
    <td width="65" height="32" align="right">4e-04</td>
  </tr>
  <tr>
    <td width="100" height="32">C36A4.8</td>
    <td width="226" height="32"><i>Caenorhabditis elegans</i></td>
    <td width="94" height="32" align="right">1657667</td>
    <td width="92" height="32" align="center">7 </td>
    <td width="65" height="32" align="right">0.010 </td>
  </tr>
  <tr>
    <td width="100" height="32">D90904 </td>
    <td width="226" height="32"><i>Synechocystis </i>sp.</td>
    <td width="94" height="32" align="right">1652299</td>
    <td width="92" height="32" align="center">15</td>
    <td width="65" height="32" align="right">0.17</td>
  </tr>
  <tr>
    <td width="100" height="32">Pescadillo</td>
    <td width="226" height="32"><i>Homo sapiens</i></td>
    <td width="94" height="32" align="right">2194203</td>
    <td width="92" height="32" align="center">16</td>
    <td width="65" height="32" align="right">0.017</td>
  </tr>
</table>
</div><b>

<p>Notes</b></font><font FACE="Times" SIZE="1"></p>
</font>

<p><font FACE="Times" SIZE="1"><sup>1</sup></font><font FACE="Times" SIZE="2"> Boguski MS.
A molecular biologist visits <i>Jurassic Park</i>. <i>Biotechniques </i>12:668&#150;9,
1992.</font></p>

<p><font FACE="Times" SIZE="1"><sup>2</sup></font><font FACE="Times" SIZE="2"> Altschul
SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and
PSI-BLAST: a new generation of protein database&nbsp;search programs. <i>Nucleic Acids Res</i>
25:3389&#150;402, 1997.</font></p>

<p><font FACE="Times" SIZE="1"><sup>3</sup></font><font FACE="Times" SIZE="2">&nbsp;Bork
P, Hofmann K, Bucher P, Neuwald AF, Altschul SF, Koonin EV. A superfamily of conserved
domains in DNA damage-responsive cell cycle checkpoint proteins.<i> FASEB J</i>
11:68&#150;76, 1997.</font></p>

<p><font FACE="Times" SIZE="1"><sup>4</sup></font><font FACE="Times" SIZE="2"> <i>Op.</i> <i>cit.</i>
2.</font><font face="Times" size="3">&nbsp;</font></p>

<p><font face="Times" size="3">&nbsp;<a href="#toc">Return to Table of Contents</a></font></p>

<hr>
<font FACE="Times" SIZE="6">

<h3><a name="Protein">Protein Families and Genome Evolution: COGs</a></font><font
face="Times" size="7"></h3>

<p ALIGN="left"></font><font FACE="Times" SIZE="3">Evolutionary biologists assume that the
genetic constitution of every organism can be traced back to a set of common ancestral
genes. This assumption has prompted scientists to perform sequence comparisons between
genes from different species to identify the distant and subtle relationships between
them. Genes with the same function can often be found in different species. These genes
are likely to have evolved from a single ancestral gene and are known as
&quot;orthologs.&quot; Alternatively, there may be sequences within the same organism that
are similar but have different functions; these &quot;paralogs&quot; most likely arose
from a gene duplication event and then evolved new functions. The growing number of
completely sequenced genomes makes it possible to make unprecedented comprehensive
comparisons between major phylogenetic groups or specific organisms and produce an
informative outline of these relationships. Such a panoramic perspective will augment our
knowledge of the course of evolution and identify protein functions conserved in some
organisms but not in others.</p>
<b>

<p align="left">In Search of Gene Families from Complete Genomes</b></p>

<p ALIGN="left">Working with the newly sequenced genomes from seven different organisms,
three scientists at NCBI, Roman Tatusov, Eugene Koonin, and David Lipman, designed a new
system for classifying conserved genes and exploring the evolutionary relationships among
them. Beginning with a single gene, they looked for the best match to that sequence in
every other genome. They continued to perform pairwise sequence comparisons for each
protein sequence against every other sequence in all the genomes until nearly 18,000
sequences had been compared. When two genes from different organisms found each other as
their best match, they were identified as orthologs. Paralogs in genomes were identified
when matches between sequences in genomes were not reciprocal. The NCBI team cataloged the
sequences according to their functional similarities into &quot;Clusters of Orthologous
Groups,&quot; or COGs.</font><sup><font FACE="Times" SIZE="1">1</font></sup><font
FACE="Times" SIZE="3"> A total of 720 unique COGs were identified. Each COG has at least
three orthologs from three genomes (Figure 1a) and, in some cases, paralogs from the same
lineage (Figure 1b).</p>

<p ALIGN="left">The results of this comparison are available on a new NCBI Web page (<a
href="http://www.ncbi.nlm.nih.gov/COG/">http://www.ncbi.nlm.nih.gov/COG/</a>). The genomes
analyzed include five bacterial genomes, <i>Escherichia coli, Haemophilus influenzae,
Mycoplasma genitalium, Mycoplasma pneumoniae,</i> and Cyanobacteria <i>Synechocystis </i>sp;
one archaebacterial genome, <i>Methanococcus jannaschii; </i>and one eukaryotic yeast
genome, <i>Saccharomyces cerevisiae</i>.</p>

<p ALIGN="left">&nbsp;<img src="cog.gif" alt="cog.gif (20530 bytes)" width="612"
height="437"></p>
<b>

<p align="left">COGs Predict Functions</b></p>

<p ALIGN="left">Since orthologs typically have the same function, COGs allow the functions
of putative gene products to be predicted from the growing number of newly sequenced
genomes. Functions were assigned to the majority of the 720 COGs based on known proteins
within the groups or significant similarities to proteins in organisms not included in
this study. The COGs were further organized into 15 functional subgroups within 4 major
divisions: (1) information storage and processing, (2) cellular processes, (3) metabolism,
and (4) poorly characterized. The distribution of proteins from different organisms in the
COGs identifies trends in functional diversification. For example, the absence of
representative proteins from the pathogenic bacteria (<i>H. influenzae</i> and the
mycoplasmas) in some metabolic groups was demonstrated.</p>
</font>

<p align="left"><b>Expanding COGs into Superfamilies</b></p>

<p ALIGN="left">The COGs represent ancient, conserved protein families with relevant
cellular functions because they are from organisms representing the major phylogenetic
groups that are estimated to be over 1 billion years old. Conserved sequence motifs within
the proteins reflect distinct biochemical activities employed by a variety of proteins to
perform their designated role in the cell. The NCBI team also employed motif-style
searching by using PSI-BLAST to identify protein superfamilies. Protein superfamilies
represent a higher level of protein classification than the COGs alone and can be used to
classify highly evolved proteins not assigned to any COG. The largest superfamily
contained ATP-ase and GTP-ase motifs broadly distributed in a variety of cellular
mechanisms.</p>
<b>

<p align="left">Phylogenetic Patterns in COGs</b></p>

<p ALIGN="left">Like pieces of a mosaic that reveal an image when viewed together, COGs
can be used to conceptualize genetic evolution. The presence or absence of a
representative gene from an organism in a COG can be studied to reveal
&quot;patterns&quot; of gene conservation or loss for that particular COG function.
Tatusov, Koonin, and Lipman compiled a list of phylogenetic patterns gleaned from the 720
COGs. A single letter of the alphabet was assigned to represent each genome (e.g.,
&quot;e&quot; for <i>E. coli</i>), and a dash was indicated when the organism was not
represented in the COG. A COG that has proteins from all seven genomes has a phylogenetic
pattern shown as &quot;ehgpcmy.&quot; A COG that is missing representative sequences from
the pathogenic species <i>M. genitalium</i> and <i>M. pneumoniae</i> has the pattern
&quot;eh_ _cmy.&quot; These two patterns are the most frequently occurring patterns, being
displayed by 114 and 119 COGs, respectively, and therefore represent conserved patterns.
The conserved patterns demonstrated continuity between the genomes, while rare patterns
suggest unique functions that need investigating. The addition of more genomes to the COG
analyses is expected to illuminate the functional role behind the rare patterns. </p>
<b>

<p align="left">Piece by Piece</b></p>

<p ALIGN="left">The NCBI team continues to expand its COG research, analyzing eight more
genomes: <i>Helicobacter pylori, Bacillus subtilis, Borrelia burgdorferi, Treponema
pallidum, Chlamydia trachomatis, Methanobacterium thermoautotrophicum, Archaeoglobus
fulgidus, </i>and <i>Caenorhabditis elegans</i>. These analyses will be incorporated into
the Web site upon completion. Refinements and new additions are expected to build a COG
collection that will become a valuable resource for characterizing genomes and
comprehending life&#146;s blueprint. </p>
<b>

<p>Note</b><font FACE="Times" SIZE="1"></p>

<p><sup>1</sup></font><font FACE="Times" SIZE="2"> Tatusov RL, Koonin EV, Lipman DJ. A
genomic perspective on protein families. <i>Science</i> 278:631&#150;7, 1997.&nbsp;</font></p>
<font FACE="Times" SIZE="3">

<p align="left">&nbsp;<a href="#toc">Return to Table of Contents</a></p>

<hr>
<font FACE="Times" SIZE="5">

<h3><a name="GenBank">GenBank Submissions: From Deposit to Release</a></font><font
FACE="Times" SIZE="7"></h3>

<p ALIGN="left"></font>Ever wonder what happens to your cherished sequence once you send
it&nbsp;hundreds, even thousands, of electronic miles away to GenBank? Whether you are
using the sequence submission tool BankIt on the WWW, the stand-alone program Sequin, or
one of the specialized submission procedures for EST, STS, GSS, and HTG sequences, your
submission is received by the GenBank staff&#151;a group of highly trained biologists and
database specialists who manage the collection and distribution of GenBank data.
Currently, over 5,000 sequences arrive each month at GenBank (excluding the specialized
submissions). While EST, STS, GSS, and HTG submissions are processed in large numbers
using semiautomated systems, all other types of sequence records are processed manually to
ensure biological integrity and internal consistency with annotation rules established by
the International Nucleotide Sequence Database Collaboration.</p>
<b>

<p align="left">Certificate of Deposit: The Accession Number</b></p>

<p ALIGN="left">An NCBI staff member checks that your submission meets minimum
requirements and then assigns an accession number to the sequence within 24 hours. The
accession number serves as a confirmation that the sequence has been submitted and is a
permanent, citable number that will allow your sequence to be referenced in publications
by yourself and others. This same number is used to retrieve your sequence from GenBank or
from one of the other International Database Collaborators, EMBL and DDBJ.</p>

<p ALIGN="left">Accession numbers consist of one letter and five digits, or two letters
and six digits, and do not change even if the record or its sequence&nbsp;is updated.
GenBank also assigns a unique GenBank identifier, or GI number, to every <i>sequence </i>loaded
into the GenBank database. The GI numbers for nucleotide and protein sequences are
referred to as NIDs and PIDs, respectively. The GI number changes every time the sequence
is updated, enabling GenBank to track changes in <i>sequence </i>over time.</p>
<b>

<p align="left">Checking Accounts: Indexers and Scientists</b></p>

<p ALIGN="left">Under the coordination of Francis Ouellette, a staff of 17 indexers
trained in molecular biology and skilled in database production operations annotate,
organize, and maintain the 1.7 million database entries. The indexers ensure that all
direct submissions receive a systematic quality assurance review. Sequences are screened
against GenBank by using BLAST to identify full or partial matches to sequences in the
database and then searched to detect vector, yeast, and mitochondrial contamination.
Programs that check for internal consistency are used to confirm coding regions, detect
open reading frames, and verify amino acid translations. Using GenBank content and data
representation guidelines, annotators then review the descriptive parts of the entry: the
locus name, definition line, taxonomy classification, and journal references. Staff
consult with submitters as necessary to add or modify features. Finally, one of 21 senior
scientists performs a final review for biological integrity and continuity.</p>

<p ALIGN="left">At least four people have reviewed your sequence and its annotations
before a draft of the GenBank record is mailed back to you for review. If the record is
not to be held confidential, it is loaded into GenBank after a 5-day review period. A
confidential record will not be released into the public database until you have notified
GenBank or it is published, whichever comes first. At any time, you may update information
in your record. We encourage authors to notify GenBank of publication so that confidential
records may be released and public records can be updated in a timely manner. Use the
BankIt Update function or send a message with the new information to <a
href="mailto:update@ncbi.nlm.nih.gov">update@ncbi.nlm.nih.gov</a>; please include your
accession number with all correspondence.</p>
<b>

<p align="left">Gaining Interest: Release of Records</b></p>

<p align="left">The turnaround time from submission to release is anywhere from 1 to 3
weeks, depending on the number of submissions GenBank is processing. Once the record is
loaded into the database, the public can see the record the next day by using the Query
e-mail server (<a href="mailto:query@ncbi.nlm.nih.gov">query@ncbi.nlm.nih.gov</a>),
Network Entrez, or WWW Entrez. Entrez provides links to additional sequences, graphic
displays, structures, genome records, and PubMed. Still have questions? Write to us at <a
href="mailto:info@ncbi.nlm.nih.gov">info@ncbi.nlm.nih.gov</a>.&nbsp;</p>

<p><a href="#toc">Return to Table of Contents</a></p>

<hr>

<h3><a name="Throughput">High Throughput Sequencing Gives Rise to New GenBank Division </a></h3>

<p ALIGN="left">Generation-megabase scientists interested in accessing information hot off
the sequencer will be pleased to know that large-scale sequencing centers involved in the
eukaryotic genome projects are making copious amounts of sequence data available to the
public prior to completion. GenBank, in concert with the other members of the
International Nucleotide Sequence Database Collaboration, DDBJ and EMBL, has created the
High Throughput Genome (HTG) division to handle the evolving assemblage of genomic data.
To date, the high throughput sequencing projects include <i>Homo sapiens</i>, <i>Caenorhabditis
elegans</i>, <i>Drosophila melanogaster</i>, <i>Arabidopsis thaliana</i>, and <i>Mus
musculus</i>.</p>
<b>

<p align="left">HTG Record Evolution</b></p>

<p ALIGN="left">Genome sequencing&nbsp;centers generate preliminary sequence information
from a single genomic clone and deposit random sequence fragments greater than 2 KB into
the GenBank HTG division. GenBank assigns a single accession number to the sequence data
derived from each clone and indicates the status of the HTG record as it passes through
several stages toward completion. Phase 1 records contain sequences that are unordered,
unoriented, and contain gaps. In Phase 2, the order and orientation of the sequences have
been determined, but gaps remain. In Phase 3, once the sequencing is complete and the
error rate is less than 10<sup>-4</sup>, records are considered finished. Phase 3 records
are transferred to the appropriate organism division of GenBank, such as the Primate (PRI)
division for human sequences, or the Invertebrate (INV) division for <i>C. elegans</i>.
Sequences submitted to the HTG division are automatically searched against the various
databases using the BLAST programs, and the records are annotated to show the significant
matches. This sequence similarity information is valuable for positional cloning and gene
hunting.</p>
<b>

<p align="left">Accessing HTG records</b></p>

<p ALIGN="left">HTG records can be retrieved in Entrez by selecting the organism and
specifying &quot;HTG&quot; in the Keywords field. The Genomes database in Entrez, which
offers graphical displays of nucleotide and protein sequences, provides a visual framework
for the HTG sequences with links to additional DNA, protein, and bibliographic records.
Unfinished HTGs (Phase 1 or 2) are also available for BLAST searching by selecting the
&quot;htgs&quot; database, or the &quot;month&quot; database for the latest entries;
finished records (Phase 3) are available in the &quot;nr&quot; and &quot;month&quot; BLAST
databases.</p>
<b>

<p align="left">HTG Information on the Web</b></p>

<p align="left">The new HTG Web site at <a href="http://www.ncbi.nlm.nih.gov/HTGS">http://www.ncbi.nlm.nih.gov/HTGS</a>
describes the HTG division and gives more detailed instructions for sequencing centers
interested in submitting HTG sequences.&nbsp;</p>
<font face="Times" size="3">

<p align="left"><img src="htg1.gif" alt="htg1.gif (14191 bytes)" width="612" height="374"></p>

<p align="left"><img src="htg3.gif" alt="htg3.gif (81258 bytes)" width="612" height="374"></p>
</font>

<p align="left"><a href="#toc">Return to Table of Contents</a></p>

<hr>
<font FACE="Times" SIZE="6">

<h3><a name="One Billion">GenBank Reaches One Billion Bases</a></font><font FACE="Times"
SIZE="7"></h3>

<p ALIGN="left"></font>In 1985, GenBank contained just over 5,700 entries that were
obtained principally by scanning the biomedical literature for sequence data. GenBank now
contains almost 2 million entries and recently surpassed 1 billion base pairs of genetic
information for more than 25,000 &nbsp;organisms. Human sequence data predominate in the
database, representing 43% of the billion-plus base pairs. Mouse<i> </i>(<i>Mus musculus</i>)
and the nematode, <i>Caenorhabditis elegans, </i>are second and third, representing 10%
and 9%, respectively.</p>
<b>

<p align="left">Exponential Growth</b></p>

<p ALIGN="left">Doubling in size every 18 months, GenBank is now built primarily from the
direct submission of sequence data from authors and sequencing centers. Currently, more
than 70% of the sequence records in the database are ESTs (expressed sequence tags). As
EST and genomic sequencing efforts are intensified, the GenBank doubling rate is expected
to accelerate. Additional information about GenBank, its various divisions, and its growth
statistics can be found in the current release notes (<a
href="ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt">ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt</a>).</p>
<b>

<p align="left">GenBank CD-ROM to Be Discontinued</b></p>

<p align="left">The explosive growth of sequence information is reflected in the GenBank
CD-ROM, which expanded from a single disc in 1992 to 12 discs with December 1997 Release
104. Since production costs are escalating and users are opting for the convenience of the
Internet over the unwieldy discs, the GenBank CD-ROM will be discontinued following the
April 15, 1998, release. GenBank full releases with cumulative and noncumulative update
files continue to be available in the genbank/ directory for downloading by Anonymous FTP.
Consult the README file in this directory for more details.&nbsp;&nbsp;</p>

<p><a href="#toc">Return to Table of Contents</a></p>

<hr>

<h3><a name="faq">Frequently Asked Questions</a> </h3>
<i>

<p ALIGN="left">How do I do a BLAST search with a short DNA sequence?</i></p>

<p ALIGN="left">You will probably need to increase the Expect (E) value, since a short
query is more likely to occur by chance in the database. You may also want to turn off the
low-complexity filter, since short queries often contain low-complexity sequence. Another
parameter that becomes important with a short query is Word size, which is used by BLAST
to nucleate regions of similarity. The default Word size is 11 for nucleotides, so if your
query sequence falls below this, you may want to decrease Word size (W). For more detail,
see the FAQ section of the BLAST Web page.</p>
<i>

<p ALIGN="left">Is it possible to perform a BLAST search against just human ESTs?</i></p>

<p ALIGN="left">Many separate databases are now available for BLAST searching. Select
&quot;Human ESTs&quot; in the Database field pull-down menu when using Gapped BLAST.</p>
<i>

<p ALIGN="left">Where can I get more information about the Interactive Digital
Differential Display (DDD) facility used in the Cancer Genome Anatomy Project (CGAP)
Project?</i></p>

<p ALIGN="left">DDD is a computational method for comparing gene frequencies among various
cDNA libraries or pools of libraries. It is available from the CGAP Web site at <a
href="http://www.ncbi.nlm.nih.gov/ncicgap/ddd.html">http://www.ncbi.nlm.nih.gov/ncicgap/ddd.html</a>.</p>
<i>

<p ALIGN="left">How can I obtain the EST clones described in my UniGene search?</i></p>

<p ALIGN="left">Information on clone availability is located in the dbEST record (<a
href="http://www.ncbi.nlm.nih.gov/dbEST/index.html">http://www.ncbi.nlm.nih.gov/dbEST/index.html</a>).
Click on &quot;Search dbEST&quot; and enter the GenBank accession number. Individuals
interested in obtaining materials can (1) contact the submitter of the sequence, (2) refer
to the Source field (if present) for sources providing the clone, (3) refer to the Clone
ID and library number located under &quot;Clone Info,&quot; which can be used to order a
particular clone through the&nbsp;I.M.A.G.E. Consortium. To see a list of distributors
participating in the I.M.A.G.E. Consortium, scroll down the dbEST Web page and click on
&quot;Distributors.&quot;</p>
<i>

<p ALIGN="left">How can I search for just review articles or specify a certain time period
when searching PubMed?</i></p>

<p ALIGN="left">In the Advanced mode, set the Search Field pull-down menu to
&quot;Publication Type&quot; and enter the word &quot;review&quot; into the text box. To
display a list of available terms for Publication Type, select &quot;List Terms&quot; from
the Mode menu and enter a term using &quot;Publication Type&quot; in the Search Field. To
search a range of dates, use a colon between the limiting years (e.g., 1966:1976), and set
the Search Field to &quot;Publication Date.&quot; Search results can also be limited to
the last 30 days or another period of time by selecting one of the options under the
Publication Date limit menu.</p>
<i>

<p ALIGN="left">In CGAP, is there a way to tell which libraries are made with tissue from
the same donor?</i></p>

<p align="left">For any tissue, including microdissected tissues, clicking on the link for
&quot;Tissue sample&quot; will lead to a list of all libraries made from the same samples.</p>

<p><a href="#toc">Return to Table of Contents</a></p>

<hr>

<h3><a name="FTP">NCBI Data by FTP</a> </h3>
<font FACE="Times" SIZE="2">

<p ALIGN="left"></font>The NCBI FTP site contains a variety of directories with publicly
available databases and software. The available directories include
&#145;repository,&#146; &#145;genbank,&#146; &#145;entrez,&#146; &#145;toolbox,&#146;
&#145;pub,&#146; and &#145;sequin.&#146;</font></p>

<p ALIGN="left"><font face="Times" size="3">The <b>repository </b>directory makes a number
of molecular biology databases available to the scientific community. This directory
includes databases such as PIR, SwissProt, CarbBank, AceDB, and FlyBase.</font></p>

<p ALIGN="left"><font face="Times" size="3">The <b>genbank </b>directory contains files
with the latest full release of GenBank, the daily cumulative updates, and the latest
release notes.</font></p>

<p ALIGN="left"><font face="Times" size="3">The <b>entrez </b>directory contains the
client software for Network Entrez.</font></p>

<p ALIGN="left"><font face="Times" size="3">The <b>toolbox </b>directory contains a set of
software and data exchange specifications that are used by NCBI to produce portable
software, and includes ASN.1 tools and specifications for molecular sequence data. </font></p>

<p ALIGN="left"><font face="Times" size="3">The <b>pub </b>directory offers public-domain
software, such as BLAST (sequence similarity search program). Client software for Network
BLAST and PowerBlast is also included in this directory.</font></p>

<p ALIGN="left"><font face="Times" size="3">The <b>sequin </b>directory contains the new
Sequin submission software for Mac, PC, and UNIX platforms.</font></p>

<p align="left"><font face="Times" size="3">Data in these directories can be transferred
through the Internet by using the Anonymous FTP program. To connect, type: <b>ftp
ncbi.nlm.nih.gov. </b>Enter <b>anonymous </b>as the login name, and enter your e-mail
address as the password. Then change to the appropriate directory. For example, change to
the repository directory (cd repository) to download specialized databases.</p>

<p><a href="#toc">Return to Table of Contents</a></p>

<hr>

<h3 align="left"><a name="Pubs">Selected Recent Publications by NCBI Staff</a>&nbsp; </font></h3>
<b>

<p ALIGN="left"></b><font size="3"><strong>Benson DA, Boguski MS, Lipman DJ, Ostell J,
Ouellette BFF.</strong> GenBank. <i>Nucleic Acids Res</i> 26:1&#150;7, 1998.</font></p>

<p ALIGN="left"><font size="3"><strong>Galperin MY, Koonin EV.</strong> A diverse
superfamily of enzymes with ATP-dependent carboxylate-amine/thiol ligase activity. <i>Protein
Sci</i> 6:2639&#150;43, 1997.</font></p>

<p ALIGN="left"><font size="3"><strong>Leipe DD, Landsman D. </strong>Histone
deacetylases, acetoin utilization proteins, and acetylpolyamine amidohydrolases are
members of an ancient protein superfamily. <i>Nucleic Acids Res</i> 25:3693&#150;7, 1997.</font></p>

<p ALIGN="left"><font size="3"><strong>Lipman DJ.</strong> Making (anti)sense of
non-coding sequence conservation. <i>Nucleic Acids Res</i> 25:3580&#150;3, 1997.</font></p>

<p ALIGN="left"><font size="3"><strong>Marchler-Bauer A, Bryant SH.</strong> A measure of
success in fold recognition. <i>Trends Biochem Sci</i> 22:236&#150;40, 1997.</font></p>

<p ALIGN="left"><font size="3"><strong>Neuwald AF.</strong> An unexpected structural
relationship between integral membrane phosphatases and soluble haloperoxidases. <i>Protein
Sci </i>6:1764&#150;7, 1997.</font></p>

<p ALIGN="left"><font size="3"><strong>Ouellette BFF, Boguski MS.</strong> Database
divisions and homology search files: a guide for the perplexed. <i>Genome Res </i>7:952&#150;5,
1997.</font></p>

<p ALIGN="left"><font size="3"><strong>Pruitt KD.</strong> WebWise: navigating the Human
Genome Project. <i>Genome Res</i> 7:1038&#150;9, 1997.</font></p>

<p ALIGN="left"><font size="3"><strong>Schuler GD.</strong> Pieces of the puzzle:
expressed sequence tags and the catalog of human genes. <i>J Mol Med</i> 75:694&#150;8,
1997.</font></p>
<b>

<p ALIGN="left"></b><font size="3"><strong>Sonnhammer EL, Wootton JC.</strong> Widespread
eukaryotic sequences, highly similar to bacterial DNA polymerase I, looking for functions.
<i>Curr Biol</i> 7:R463&#150;5, 1997.</font></p>

<p align="left"><font size="3"><strong>Tatusov RL, Koonin EV, Lipman DJ.</strong> A
genomic perspective on protein families. <i>Science </i>278:63 1&#150;7, 1997.</font><font
face="Times" size="3"></p>

<p><a href="#toc">Return to Table of Contents</a></p>

<hr>

<h3 align="left"><a name="Masthead">Masthead</a></h3>
</font><i>

<p ALIGN="left"><font size="3">NCBI News</i> is distributed two to three times a year. We
welcome communication from users of NCBI databases and software and invite suggestions for
articles in future issues. Send correspondence and suggestions to <i>NCBI News</i> at the
address below.</font></p>

<p ALIGN="left"><font size="3">NCBI News<br>
National Library of Medicine<br>
Bldg. 38A, Room 8N-803<br>
8600 Rockville Pike<br>
Bethesda, MD 20894<br>
Phone: (301) 496-2475<br>
Fax: (301) 480-9241<br>
</font><font face="Times" size="3">E-mail: <a href="mailto:info@ncbi.nlm.nih.gov">info@ncbi.nlm.nih.gov</a></font></p>
<i>

<p></i><font size="3"><em>Editors</em><br>
Dennis Benson<br>
Barbara Rapp</font></p>
<i>

<p align="left"></i><font size="3"><em>NCBI Contributors</em><br>
Renata McCarthy<br>
Ken Katz<br>
Francis Ouellette<br>
Stephen Altschul</font></p>

<p><font size="3"><em>Writer</em><br>
Donna Roscoe</font></p>

<p><font size="3"><em>Managing Editor</em><br>
Roseanne Price </font></p>

<p><font size="3"><em>Graphics and Production</em><br>
Veronica Johnson</font></p>

<p ALIGN="JUSTIFY"><font size="3"><em>Design Consultant</em><br>
Troy M. Hill</font></p>

<p ALIGN="left"><font size="3">In 1988, Congress established the National Center for
Biotechnology Information as part of the National Library of Medicine; its charge is to
create information systems for molecular biology and genetics data, and to perform
research in computational molecular biology.</font></p>

<p ALIGN="left"><font size="3">The contents of this newsletter may be reprinted without
permission. The mention of trade names, commercial products, or organizations does not
imply endorsement by NCBI, NIH, or the U.S. Government. </font></p>

<p ALIGN="JUSTIFY"><font size="3">NIH Publication No. 98-3272<br>
ISSN 1060-8788 </font></p>
<font face="Times" size="3">

<p><a href="#toc">Return to Table of Contents</a> </p>

<hr>
</font>
</body>
</html>