nih-gov/www.ncbi.nlm.nih.gov/Web/Newsltr/Spring99/spring99.htm

766 lines
41 KiB
HTML
Raw Permalink Blame History

<!doctype html public "-//IETF//DTD HTML//EN">
<html>
<head>
<title>Spring 1999</title>
<meta name="GENERATOR" content="Microsoft FrontPage 3.0">
<meta name="AUTHOR" content="RJohnson">
</head>
<body bgcolor="#FFFFFF" text="#000000" vlink="#0000FF" alink="#0000FF">
<p align="left"><img src="newslogo.gif"> </p>
<p>&nbsp;</p>
<hr style="margin-top: -2in; margin-bottom: -2in; padding-top: ; padding-bottom:">
<a name="toc">
<p>Spring 1999</a></p>
<hr style="margin-top: -3in; margin-bottom: ; padding-top: -9 in; padding-bottom: 0">
<h3>In This Issue</h3>
<p><a href="#New Human Genome Web Resource: A Nexus for Genome Data">New Human Genome Web
Resource:&nbsp; A Nexus for Genome Data</a><br>
<a href="#DNA Sequences">DNA Sequences from Times Past in GenBank</a><br>
<a href="#Sequin 2.90">Sequin 2.90 Offers Simplified Network Access</a><br>
<a href="#SAGEmap">SAGEmap Offers a Versatile Interface to Gene Expression Data</a><br>
<a href="#Profile">Profile:&nbsp; PSI-BLAST's Impact is &quot;High Profile&quot;</a><br>
<a href="#Lab">BLAST Lab</a><br>
<a href="#EST Submissions">Mandatory Protocol for EST Submissions to Take Effect Soon</a><br>
<a href="#FAQ">Frequently Asked Questions</a><br>
<a href="#Recent Pubs">Selected Recent Publications by NCBI Staff</a><br>
<a href="#Masthead">Masthead</a></p>
<hr>
<h3><a name="New Human Genome Web Resource: A Nexus for Genome Data"><font SIZE="5">New
Human Genome Web Resource: A Nexus for Genome Data</font></a></h3>
<p><font SIZE="5">C</font><font SIZE="3">ompletion of the sequencing and analysis of the
human genome promises to be a complex task that will involve cooperation among researchers
applying diverse tools to the problem. The data generated will be reported in a variety of
forms reflecting this diverse set of tools. Genetic and physical maps, markers, nucleotide
polymorphisms, disease phenotypes, expression profiles, and sequence data must be
integrated and made accessible for analysis. A repository of sequence data represents a
natural site for construction of a nexus into which data in many forms can flow and from
which these data can be accessed. NCBI&#146;s <a
href="http://www.ncbi.nlm.nih.gov/genome/guide/">Human Genome Resources</a> page, designed
to serve as such a nexus, is closely connected to the GenBank sequence database and also
provides centralized access to a full range of human genome resources available within
NCBI and elsewhere. </p>
<p ALIGN="left">From the NCBI home page, the Human Genome Resources link leads to an
organized set of links to human genome data in many forms. A screen clip of the upper half
of the page is shown below. The first collection of links, called The Genome at a Glance,
is an array of 24 chromosome ideograms serving as links to GeneMap &#146;98 (GeneMap
&#146;99 coming soon). Clicking on one of these chromosomes leads to radiation hybrid (RH)
mapping data as well as information on gene distributions and gene-disease associations. </p>
<p ALIGN="left"><img src="hgen3_html.jpg" alt="hgen3_html.jpg (54557 bytes)" border="1" width="660"
height="705"><br>
<small><em><font face="Helvetica">Partial display of recently introduced Human Genome
Resources page on NCBI's Web site.</font></em></small></p>
<p ALIGN="JUSTIFY">&nbsp;</p>
<p align="left"><strong>A Search Box for LocusLink</strong></p>
<p ALIGN="left">A query box spanning the top of the page can be used to conduct a text
search of any of six major NCBI resources. The default target is <a
href="http://www.ncbi.nlm.nih.gov/LocusLink">LocusLink</a>, a new nomenclature
cross-referencing tool developed by Donna Maglott. LocusLink allows searches that begin
with queries as diverse as official gene names, aliases, sequence accession numbers,
protein names, phenotypes, EC numbers, MIM numbers, other database identifiers, UniGene
clusters, or mapping information to converge upon the same data. The target database can
be changed from LocusLink to MEDLINE, OMIM, GenBank, GeneMap &#146;98, or UniGene.</p>
<p align="left"><strong>A Column of Links to the Left</strong></p>
<p ALIGN="left">A blue column along the left of the page contains an array of links to a
variety of human genome data. A link to OMIM (Online Mendelian Inheritance in Man)
provides access to over 10,000 descriptions of genetic diseases and genes, stressing
genotype-phenotype correlations. A pointer to GeneMap &#146;98 provides access to mapping
data on over 35,000 human genes. Gateways to UniGene, dbEST, and the Davis Human/Mouse
Homology Map follow. There are also links for dbSNP, leading to a database of single
nucleotide polymorphisms, and Mutation DBs, pointing to 27 disease-specific databases.</p>
<p ALIGN="left">Of particular interest are links to Human Genome Sequencing, Reference
mRNA Sequences, and SAGEmap. The first two of these resources are described below. &nbsp;
The third, SAGEmap, is a new NCBI database of quantitative gene expression and is the
subject of a separate <a href="#SAGEmap">article</a> in this issue of the <em>NCBI News.</em></p>
<p align="left"><strong>A Look at the Genome Sequencing Page</strong></p>
<p ALIGN="left">The Genome Sequencing site, developed by Greg Schuler, is an important new
resource supporting the human sequencing effort. From the Human Genome Resources page,
follow the <a
href="http://www.ncbi.nlm.nih.gov/genome/seq/page.cgi?F=HsHome.html&amp;ORG=Hs">Human
Genome Sequencing</a> link to reach a colorful graphical display of genome sequencing
progress. Finished sequence is indicated by hot red and orange bands on a set of
chromosome ideograms. These data are also available in numerical form. A link labeled More
Statistics leads to a table giving sequencing progress by chromosome. The table is
extended with more links to the individual contigs involved. A query box at the head of
the page provides access to a database of contigs plus six other NCBI databases. Down the
left side of the page is a column of links pointing to genome sequencing centers, a contig
browser, chromosome-specific BLAST searches, and information on downloading sequences.</p>
<p align="left"><strong>A Quick Look at RefSeq</strong></p>
<p ALIGN="left">RefSeq, a project developed by Kim Pruitt that provides reference
sequences for chromosomes, mRNAs, and proteins, is reached via the <a
href="http://www.ncbi.nlm.nih.gov/LocusLink/refseq.html">Reference mRNA Sequences</a>
link. RefSeq standards provide a foundation for functional annotation of the genome. They
provide a stable reference point for mutation analysis, studies of gene expression, and
polymorphism discovery. In addition, RefSeq-to-LocusLink associations anchor UniGene
clusters and support annotation of genomic contig sequence data generated by the Human
Genome Project. RefSeq records are available through BLAST, Entrez, and LocusLink.</p>
<p align="left"><strong>More Links to the Right</strong></p>
<p ALIGN="left">The right side of the Human Genome Resources page is also lined with many
links. The first is a pointer to the NCBI Genes and Disease page, which gives synopses of
over 60 diseases of genetic origin and provides links to the literature and sequence
databases. Below this is a block of links to resources of the National Human Genome
Research Institute (NHGRI), including its home page, the Human Genome Project page, and a
glossary of genetic terms.</p>
<p ALIGN="left">A National Cancer Institute (NCI) block offers gateways to cGAP, cCAP,
CancerNet, and the NCI home page. The path to cGAP allows access to gene expression data
for normal, precancerous, and cancerous cells derived from the 121 sequence libraries of
the Cancer Genome Anatomy Project. The database currently contains expression data for
over 20,000 human genes. For data on physical chromosomal defects associated with cancer,
follow the cCAP link to the Cancer Chromosomal Aberration Project.</p>
<p align="left"><strong>News Items</strong></p>
<p align="left">The Human Genome Resources page also features a central block of short
newspaper-style announcements. Currently grouped under the headings What&#146;s New and
Human Genome Meetings, these items run the gamut from synopses of new human genome
resources at NCBI to notices of symposia and workshops to recent news bulletins of
relevance to the Human Genome Project.&nbsp;&nbsp;</p>
<p><a href="#toc">Return to Table of Contents</a></p>
<hr>
<h3><font SIZE="5"><a name="DNA Sequences"><i>On the lighter side . . .<br>
</i>DNA Sequences from Times Past in GenBank</a></font></h3>
<p ALIGN="left"><font SIZE="5">A</font><font SIZE="3">lthough GenBank lacks dinosaur DNA,
fragments of genomes past can be found here. A practical limit of about 100,000 years
currently applies to the age of recoverable DNA samples. Beyond this limit, hydrolysis of
the phosphate backbone of the DNA and oxidative damage to the bases that make up the DNA
sequence become too great to allow for efficient PCR amplification. This is why deposition
of significant amounts of dinosaur sequence (age&nbsp; </font><font FACE="Symbol"><EFBFBD></font><font
SIZE="3"> 65 million years) in GenBank is unlikely to occur in the near future. However,
many DNA sequences arising from extinct organisms and ancient genomes are in the database
today, and the number is expected to grow as technology for the extraction and
amplification of aged DNA progresses. A selection of sequences from the past now available
in GenBank is given below.</font><b><font FACE="Helvetica" SIZE="1"></p>
</font></b>
<table border="0" width="84%" cellpadding="2">
<tr>
<td width="100%" bgcolor="#C0C0C0"><font face="Helvetica" size="2"><b>AF011222:</b><i>
Mitochondrial DNA from a Neanderthal specimen discovered in 1856 near Dusseldorf, Germany.
Source: Bone (3.5 g of right humerus). Age of Source: 30,000 to 100,000 years. Length: 379
bp.</i><b><p>S69989: </b><i>Mitochondrial DNA from the Late Neolithic &quot;Iceman&quot;
found mummified in the Tyrolean Alps. Source: Soft tissue. Age of Source: 5,000 years.
Length: 354 bp.</i><b></p>
<p>X73306:</b><i> Mitochondrial DNA from Egyptian mummy. Source: Tarsus bone. Age of
Source: 2,000 years. Length: 122 bp.</i><b></p>
<p>K02137: </b><i>Alu-repeat family DNA sequence from Egyptian mummy. Source: Mummified
soft tissue. Age of Source: 2,400 years. Length: 919 bp.</i><b></p>
<p>X88771:</b><i> Ribosomal RNA gene from an Iceman fungal clone. Source: Grass clothing
of the Iceman (cloak, boots). Age of Source: 5,000 years. Length: 495 bp. </i></p>
<p><strong>Z48945:</strong> <i>Ribosomal RNA gene from an extinct giant ground sloth.
Source: Bone, teeth, and coprolites. Age of Source: 13,000 years. Length: 571 bp.</i><b></p>
<p>L08480:</b> <i>Mitochondrial DNA from an extinct legume. Source: Leaf embedded in
Dominican amber. Length: 348 bp.</i><b></p>
<p>D50842:</b><i> Mitochondrial DNA from an extinct woolly mammoth (the baby Magadan
mammoth known as Dima). Source: 1 g of muscle. Age of Source: 40,000 years. Length: 1,137
bp.</i><b></p>
<p>S72502: </b><i>Mitochondrial DNA from the extinct Siberian woolly mammoth. Source:
Humerus cortical bone. Length: 242 bp.</i><b></p>
<p>D83049: </b><i>Mitochondrial DNA from the extinct Stellar&#146;s sea cow, a relative of
the manatee. Source: Bone (2 g of bone from a scapula collected on Bering Island,
Kamchatka). Length: 1,005 bp.</i><b></p>
<p>X64307:</b><i> Mitochondrial DNA from a quagga (an extinct relative of the zebra from
southern Africa). Length: 117 bp.</i><b></p>
<p>S46659:</b><i> Mitochondrial ribosomal RNA from </i>Smilodon fatalis,<i> the
saber-toothed tiger found in the Rancho La Brea Tar Pits in Los Angeles, California.
Source: Bone from three specimens. Age of Source: 14,000 years. Length: 132 bp.</i><b></p>
<p>X67636:</b><i> Mitochondrial ribosomal RNA of the Moa, an extinct bird. Source: Bones
and soft tissues of four species of Moa. Length: 386 bp.</i><b></p>
<p>S78028: </b><i>Mitochondrial DNA from Medieval French rabbits. Source: 1 to 4 g of
bone. Age of Source: 400 to 600 years. Length: 233 bp.</i></font></td>
</tr>
</table>
<p><a href="#toc">Return to Table of Contents</a></p>
<hr>
<h3><a name="Sequin 2.90"><font SIZE="5">Sequin 2.90 Offers Simplified Network Access </font></a></h3>
<p align="left"><font SIZE="5">S</font>equin can function in either a stand-alone mode or
a &#147;network- aware&#148; mode. The stand-alone mode provides the functions needed to
prepare most sequence submissions. In its network-aware mode, however, Sequin acquires
additional functionality through online access to GenBank and NCBI&#146;s sequence
analysis tools. For instance, network-aware Sequin can download sequences from GenBank in
order to facilitate submitting multiple sequence alignments that include existing GenBank
sequences. Network-aware Sequin can also conduct PowerBLAST searches, perform Entrez
queries, and screen for the presence of contaminating vector sequences or repeat elements
within a sequence submission. Setting up Sequin to communicate over the network has been
simplified in version 2.90, which is now available at <a
href="ftp://ncbi.nlm.nih.gov/sequin/">ftp://ncbi.nlm.nih.gov/sequin/</a>. </p>
<p align="left">Sequin&#146;s <strong>Network Configuration</strong> option, available on
the initial Welcome to Sequin page as well as in the record viewer, is located under the
Misc menu. To configure Sequin to use the network, most users need only select the <strong>Normal</strong>
connection and click on <strong>Accept</strong> to begin the configuration. Users who are
behind a firewall may need to contact their system administrator in order to fill in the
Proxy and Port fields. Users outside the United States or with an unreliable Internet
connection may need to increase the &#147;timeout&#148; value, which is the length of time
Sequin will wait for a response from the network. Sequin must be restarted in order for
the network configuration changes to take effect. </p>
<p><a href="#toc">Return to Table of Contents</a></p>
<hr>
<h3 align="left"><a name="SAGEmap"><font SIZE="5">SAGEmap Offers a Versatile Interface to
Gene Expression Data</font></a> </h3>
<p ALIGN="left"><font SIZE="5">S</font><font SIZE="3">erial Analysis of Gene Expression
(SAGE) refers to a technique for taking a snapshot of the messenger RNA population of a
cell. The elements of the film, in this case, are oligonucleotides, each consisting of a
concatenated array of short, 20- to 24-bp sequence tag pairs, or ditags. Each 10- to 12-bp
tag of the ditag pair represents one parent messenger RNA. Many of these concatenated
arrays are combined within a SAGE &quot;library.&quot; The number of times a particular
tag is detected in a library gives a digital measure of the abundance of its associated
mRNA and, hence, provides a quantitative measure of gene expression. By using the SAGE
technique, coupled with high-throughput sequencing technology, it is possible to obtain
accurate expression data for thousands of genes within a cell. A major application of SAGE
is in the identification of abnormal gene expression leading to, or diagnostic of, various
disease states, such as cancer. NCBI&#146;s <a href="http://www.ncbi.nlm.nih.gov/SAGE/">SAGEmap</a>
site, developed by Alex Lash and introduced in March 1999, implements many functions
useful in the analysis of SAGE data.</p>
<b>
<p align="left">Mapping Tools</b></p>
<p align="left">SAGEmap provides tag-to-gene as well as gene-to-tag mappings. Both
mappings are updated weekly, immediately following the updating of UniGene.</p>
<p ALIGN="left">The tag-to-gene function maps a SAGE tag to one or more UniGene clusters.
It also produces a table listing the SAGE libraries in which the gene tag occurs, the
number of occurrences per library, and the total number of tags in the library. Following
a link to a SAGE library leads to more information about the library and allows one to
download the SAGE tag data. The full tag-to-gene mapping data for all the SAGE libraries,
or a &quot;reliably mapped&quot; subset of these data, may also be downloaded as a single
file.</p>
<p ALIGN="left">The inverse function, gene-to-tag mapping, maps a UniGene cluster ID to
the SAGE tags found within the cluster. For each SAGE tag found, library information is
given. A link to the UniGene cluster used as the query leads to the UniGene page for this
cluster.</p>
<b>
<p align="left">SAGE Data Analysis</b></p>
<p ALIGN="left">SAGEmap can construct a user-configurable table of data comparing one
group of SAGE libraries with another. Several CGAP SAGE libraries are currently available
and may be included in the table. Libraries included in the table are assigned to one of
two groups, A or B, between which a comparison can be made. For each group, the user
specifies which tags within the designated libraries should be included in the table. A
logical AND specifies that tags be included only if they are expressed in all libraries
making up the group. A logical OR specifies that tags be included regardless of groupwide
expression. Tags having wide intragroup variations in expression may be excluded from the
comparison.</p>
<p ALIGN="left">Tags may also be included in the table on the basis of intergroup
expression differences. In this case, three schemes are available: OR, to include all
tags; XOR, to include tags expressed in one group but not in the second group; AND, to
include only tags expressed in both groups. A minimum difference threshold for use between
groups may be defined. When the tabular display parameters are set, a click on the <b>Results
</b>button displays the SAGE expression table.</p>
<p ALIGN="left">For each SAGE tag, the expression table includes columns giving the
associated UniGene cluster ID and the cluster description. A column summarizing the
expression of each tag in groups A and B is colored red if the level is higher in group A
and green if the level is higher in group B. The definition of &quot;higher&quot; depends
on the difference threshold specified by the user.</p>
<p align="left">SAGEmap may be reached from the NCBI home page, the Human Genome Resources
page, or directly at <a href="http://www.ncbi.nlm.nih.gov/SAGE">http://www.ncbi.nlm.nih.gov/SAGE</a>.
</font></p>
<p ALIGN="left"><a href="#toc">Return to Table of Contents</a></p>
<hr>
<h3><a name="EST Submissions"><font SIZE="5">Mandatory Protocol for EST Submissions to
Take Effect Soon</font></a></h3>
<p ALIGN="left"><font SIZE="5">T</font><font SIZE="3">o facilitate the submission of EST
sequences, which are usually submitted in large batches, GenBank has offered a streamlined
EST submission procedure using a specialized data format. Beginning May 31, 1999, the use
of this specialized format will become mandatory for all EST submissions. EST submissions
made with either BankIt or Sequin will no longer be accepted.</p>
<p ALIGN="left">Expressed Sequence Tags (ESTs) are short (300 to 500 bp) single reads from
cDNA complementary to mRNA, which are usually produced in large numbers. ESTs are useful
in providing a snapshot of the mRNA population characteristic of a given tissue or of a
given tissue at a particular developmental stage. EST sequences now represent
approximately 70% of the sequences in GenBank and constitute the most rapidly growing
GenBank division.</p>
<p ALIGN="left">For instructions on using GenBank&#146;s specialized EST submission
format, see <a href="http://www.ncbi.nlm.nih.gov/dbEST/how_to_submit.html">http://www.ncbi.nlm.nih.gov/dbEST/how_to_submit.html</a>.
</p>
<p align="left">Completed EST submissions should be mailed to batch-sub@ncbi.nlm.nih.gov.
After May 31, GenBank will no longer accept EST submissions at the gb-sub@ncbi.nlm.nih.gov
address.</font></p>
<p><a href="#toc">Return to Table of Contents</a></p>
<hr>
<table border="1" width="100%" bgcolor="#0080C0"
style="border-left: medium none rgb(255,255,255); border-right: medium none; border-top: medium none; border-bottom: medium none"
height="54">
<tr>
<td width="100%" height="48"><p align="center"><font SIZE="5"><font color="#FFFFFF"><a
name="Profile"><strong>PROFILE</strong></a></font>&nbsp; </font></td>
</tr>
</table>
<h3 align="left"></font><a name="New Human Genome Web Resource: A Nexus for Genome Data"><font
SIZE="5">PSI-BLAST's Impact Is &quot;High Profile&quot;</font></a><font SIZE="3"></h3>
<p align="left"><font SIZE="5">C</font><font SIZE="3">omparison, whether of morphological
features or protein sequences, lies at the heart of biology. The introduction of BLAST in
1990 made it easier to rapidly scan huge sequence databases for overt homologies and
statistically evaluate the resulting matches. With more than 8,700 citations to date, the
paper describing the original algorithm</font><a href="#Altschul"><font face="Times"
size="1"><sup>1</sup></font></a><font SIZE="3"> has since become the most heavily cited of
the decade.</font><a href="#Russo"><sup><font SIZE="1">2</font></sup></a></p>
<font SIZE="3">
<p ALIGN="left">Not all significant homologies are overt, however. Some of the most
interesting are subtle and do not rise to statistical significance during a standard BLAST
search. NCBI&#146;s Stephen Altschul has extended BLAST and its statistical methodology to
address the problem of detecting weak but significant sequence similarities. With a small
group of other NCBI researchers, he has developed <a
href="http://www.ncbi.nlm.nih.gov/cgi-bin/BLAST/nph-psi_blast">Position-Specific Iterated
BLAST</a> (PSI-BLAST), which searches sequence databases with a profile constructed from
BLAST alignments. </p>
<p ALIGN="left">As Altschul notes, &quot;PSI-BLAST is by no means the first program for
searching sequence databases with protein profiles. However, it is the first program fully
to automate the construction of these profiles from the output of a standard database
search, and the first to apply fast heuristic search techniques to profile comparison. As
such, it has for the first time rendered the powerful profile-search methodology
accessible to the non-expert.&quot; </p>
<p ALIGN="left">Indeed, PSI-BLAST has proven to be popular. Among scientific articles
published in the past 2 years, the 1997 paper describing PSI-BLAST</font><a
href="#Altschul2"><font SIZE="1"><sup>3</sup></font></a><font SIZE="3"> has been the most
heavily cited for the fourth 2-month interval in a row, with over 700 citations to date,
according to statistics compiled by the Institute for Scientific Information. Due to this
strong citation record, PSI-BLAST and its development team were featured in the April 12,
1999, issue of <i>The Scientist</i>.</font><a href="#Russo"><sup><font SIZE="1">2</font></sup></a><font
SIZE="3"></p>
<p ALIGN="left">Just as users of PSI-BLAST are a diverse group, so are its developers,
with backgrounds spanning genetics, physics, and medicine.</p>
<p ALIGN="left">Stephen Altschul, Ph.D., leads the PSI-BLAST development team. He received
his Ph.D. in mathematics from MIT in 1987 and joined the Computational Biology Branch of
NCBI in 1989, shortly after its creation. His research continues to center on measures,
algorithms, and statistics for the comparison of DNA and protein sequences.</p>
<p ALIGN="left">Alejandro Sch<63>ffer, Ph.D., wrote most of the PSI-BLAST source code. His
current interests include development of software for genetics, including genetic linkage
analysis, sequence analysis, and modeling genetic changes in tumor progression.</p>
<p ALIGN="left">Tom Madden, Ph.D., is the main BLAST programmer at NCBI. He received a
doctorate in physics from the University of California at Santa Cruz. After performing
postdoctoral work in biophysics at Brandeis University, he joined NCBI in 1993, and has
since been involved in all BLAST development.</p>
<p ALIGN="left">David Lipman, M.D., has been a driving force behind improved database
searches for more than a decade. In 1988, he and William Pearson authored a seminal paper
describing a predecessor to the BLAST search, the well-known FASTA search.</font><a
href="#Pearson"><font SIZE="1"><sup>4</sup></font></a><font SIZE="3"> Since his
appointment as Director of NCBI in 1989, he has continued to take a lead role in
development of BLAST and its specialized variants.</p>
<p ALIGN="left">A tutorial based on a recent PSI-BLAST introductory article has been
prepared by Altschul.</font><a href="#Altschul3"><font SIZE="1"><sup>5</sup></font></a><font
SIZE="3"> It can be accessed via the <a
href="http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html">BLAST Course</a><b> </b>link
on the main BLAST page. </p>
<p ALIGN="left"></font><img src="profile.jpg" alt="NCBI05.jpg (136288 bytes)" width="294"
height="320" vspace="5"><br>
<font size="2" face="Helvetica"><i>(clockwise from left) Tom Madden, David Lipman, <br>
Stephen Altschul, and Alejandro Sch<63>ffer.</i></font><font SIZE="3"></p>
<b>
<p align="left">Notes</b></font><font SIZE="2"></p>
<p ALIGN="left">1. <a name="Altschul">Altschul</a>, SF, et al. <a
href="http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=2231712&amp;form=6&amp;db=m&amp;Dopt=b">Basic
local alignment search tool</a>. <i>J Mol Biol</i> 215(3):403&#150;10, 1990.</p>
<p ALIGN="left">2. <a name="Russo">Russo</a>, E, and S Bunk, eds. Hot papers. <i>The
Scientist </i>13(8):15, 1999.</p>
<p>3. <a name="Altschul2">Altschul</a>, SF, et al. <a
href="http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=9254694&amp;form=6&amp;db=m&amp;Dopt=b">Gapped
BLAST and PSI-BLAST: A new generation of protein database search programs</a>. <i>Nucleic
Acids Res</i> 25(17):3389&#150;402, 1997.</p>
<p ALIGN="left">4. <a name="Pearson">Pearson</a>, WR, and DJ Lipman. <a
href="http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=3162770&amp;form=6&amp;db=m&amp;Dopt=b">Improved
tools for biological sequence comparison</a>. <i>Proc Natl Acad Sci USA</i>
85(8):2444&#150;8, 1988.</p>
<p align="left">5. <a name="Altschul3">Altschul</a>, SF, and EV Koonin. <a
href="http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=9852764&amp;form=6&amp;db=m&amp;Dopt=b">Iterated
profile searches with PSI-BLAST&#151;a tool for discovery in protein databases</a>. <i>Trends
Biochem Sci</i> 23(11):444&#150;7, 1998.&nbsp;&nbsp;</font></p>
<p align="left"><a href="#toc">Return to Table of Contents</a></p>
<hr>
<p ALIGN="left">&nbsp;<img src="blast_lab.gif" alt="blast_lab.gif (22614 bytes)"
width="700" height="61"></p>
<h3><a name="Lab"><font color="#000000">How to Write and Load PSI-BLAST Checkpoint Files
and Inspect the PSSMs</font></a></h3>
<b>
<p ALIGN="left"></b><font size="2" face="Helvetica"><a
href="http://www.ncbi.nlm.nih.gov/cgi-bin/BLAST/nph-psi_blast">PSI-BLAST</a> uses the
pairwise alignments it creates between query and database sequences during a preliminary
BLAST search to construct a Position-Specific Scoring Matrix, or PSSM.<a href="#Altschul4"><sup>1</sup></a>
This PSSM is then used in place of the usual combination of query and scoring matrix
during the first PSI-BLAST iteration. The PSSM is updated at the beginning of each
subsequent PSI-BLAST iteration using the data from all significant alignments generated
during the previous iteration. Stand-alone PSI-BLAST can write a &quot;checkpoint&quot;
file at the end of a search; this file contains the information necessary to regenerate
the last PSSM used. PSI-BLAST can also load a previously stored checkpoint file to conduct
a search using a previously computed PSSM. Because the PSSM encapsulates information
derived from PSI-BLAST alignments, the inspection of this matrix may provide insights into
the sequence patterns present within the query. </p>
<p align="left">This BLAST Lab describes how to write and load checkpoint files using the
stand-alone version of PSI-BLAST. Instructions are also given for producing an ASCII
version of the PSI-BLAST PSSM.</font></p>
<hr color="#0080C0" noshade>
<p><font face="Helvetica" size="2" color="#0080C0"><strong>Installing Stand-alone
PSI-BLAST&nbsp; </strong></font></p>
<p>Packages for installing stand-alone PSI-BLAST on a variety of computer platforms are
found at <a href="ftp://ncbi.nlm.nih.gov/blast/executables/">ftp://ncbi.nlm.nih.gov/blast/executables/</a>. The PC self-extracting archive is
&quot;blastz.exe.&quot; Unix packages are those ending in &quot;.Z&quot;.</p>
<font SIZE="3">
<p>See BLAST Lab in the <a
href="http://www.ncbi.nlm.nih.gov/Web/Newsltr/Winter99/winter99.htm">Winter 1999 <i>NCBI
News</i></a> for details on archive extraction.</font></p>
<hr color="#0080C0" noshade>
<p><font face="Helvetica" size="2" color="#0080C0"><strong>Writing a Checkpoint File</strong></font></p>
<p>The command line below will run PSI-BLAST using the sequence query contained in a file
named &quot;tf1.fsa&quot;. </p>
<font SIZE="3">
<p ALIGN="JUSTIFY">blastpgp -i tf1.fsa -o tf1.out -d ecoli -j 2 -C tf1.chk</p>
<p>The output will be written to &quot;tf1.out&quot;. The local database to be searched is
named &quot;ecoli&quot;. The switch &quot;-j 2&quot; instructs PSI-BLAST to run one
regular BLAST round followed by one PSI-BLAST iteration. The syntax &quot;-C tf1.chk&quot;
(&quot;C&quot; stands for &quot;checkpoint&quot;) specifies that a checkpoint file called
&quot;tf1.chk&quot; be saved after the search. This checkpoint file can be loaded later to
perform a search of a different database as described below.</font></p>
<hr color="#0080C0" noshade>
<p><font face="Helvetica" size="2" color="#0080C0"><strong>Loading a Checkpoint File</strong></font><font
SIZE="3"></p>
<p>PSI-BLAST can use a previously stored checkpoint file to reconstruct a PSSM fabricated
by PSI-BLAST during a previous search. This feature enables you to search two separate
databases with the same query and PSSM. In the example above, a checkpoint file,
&quot;tft1.chk&quot;, was saved after a PSI-BLAST search of the local database
&quot;ecoli&quot; using the query contained in &quot;tf1.fsa&quot;. A search of a second
local database, named &quot;saureusi&quot;, can be conducted with the same query and PSSM
by using the command line below. </p>
<p>blastpgp -i tf1.fsa -o tf1.out -d saureusi -j 2 -R tf1.chk</p>
<p>In this case, the checkpoint file, &quot;tf1.chk&quot;, is read using the
&quot;-R&quot; switch (&quot;R&quot; stands for &quot;restart&quot;). When using the
&quot;-R&quot; switch, the query sequence must be identical to that which was used to
generate the checkpoint file to be read.</font></p>
<hr color="#0080C0" noshade>
<p><font face="Helvetica" size="2" color="#0080C0"><strong>Inspecting a Stored PSSM</strong></font></p>
<p>ASCII versions of the PSI-BLAST PSSM may be written to a file using the new command
line switch &quot;-Q&quot; followed by the name to be given to the file. A syntax such as<br>
<br>
blastpgp -i tf1.fsa -j 3 -Q pssm.dat<br>
<br>
will save the PSSM used in the last PSI-BLAST iteration in an ASCII file named
&quot;pssm.dat&quot;.</p>
<font SIZE="3">
<p ALIGN="JUSTIFY">The syntax below will read a checkpoint file called
&quot;tf1.chk&quot;, run one iteration of PSI-BLAST, and save an ASCII version of the PSSM
used in the file &quot;pssm.dat&quot;.<br>
<br>
blastpgp -i tf1.fsa -R tf1chk -Q pssm.dat</p>
<p ALIGN="JUSTIFY">To use the new &quot;-Q&quot; switch with the blastpgp program,
download the latest stand-alone BLAST distribution (version 2.0.9).</p>
<b>
<p>Note</b></font><font SIZE="2"></p>
<p>1. <a name="Altschul4">Altschul</a> SF, TL Madden, AA Sch<63>ffer, J Zhang, Z Zhang, W
Miller, and DJ Lipman. <a
href="http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=9254694&amp;form=6&amp;db=m&amp;Dopt=b">Gapped
BLAST and PSI-BLAST: A new generation of protein database search programs</a>.<i> Nucleic
Acids Res</i> 25(17):3389&#150;402, 1997.</font> </p>
<table border="1" width="50%" style="border: 1px solid rgb(0,0,255)" bordercolor="#2154CB"
cellpadding="0" cellspacing="0" bgcolor="#CCACD2">
<tr>
<td width="100%" bgcolor="#BBD2FD" style="border-bottom: medium none"><img
src="blast_standard.jpg" alt="blast_standard.jpg (46679 bytes)" width="494" height="42"></td>
</tr>
<tr>
<td width="100%" bgcolor="#BBD2FD"
style="border-left: 1px solid rgb(0,0,255); border-right: 1px solid rgb(0,0,255); border-top: 1px none rgb(0,0,255); border-bottom: 1px solid rgb(0,0,255); margin-left: 0; margin-right: 0; padding-left: 11px; padding-right: 11px"><i>The
BLAST Lab feature is intended to provide detailed technical information on some of the
more specialized uses of the BLAST family of programs. Topics are selected from the range
of questions received by the BLAST Help Group.</i></td>
</tr>
</table>
<p><a href="#toc">Return to Table of Contents</a></p>
<hr>
<table border="0" width="100%" cellpadding="12">
<tr>
<td width="29%" style="border:none"><img src="qanda.gif" alt="qanda.gif (1368 bytes)"
hspace="17" vspace width="117" height="156"></td>
<td width="71%" style="border: medium none"></font>&nbsp;<p>&nbsp;</p>
<h2><a name="FAQ"><font color="#0080C0" size="5">Frequently Asked Questions</font></a></h2>
</td>
</tr>
<tr>
<td width="100%" style="border: medium none" colspan="2"><hr color="#0080C0" noshade
size="3">
</td>
</tr>
<tr>
<td width="29%" style="border: medium none"><font SIZE="3"><i>Is there any way to
determine the total number of organisms currently represented in GenBank?</i></font></td>
<td width="71%" style="border: medium none"><font SIZE="3">Yes, you can obtain a monthly
updated count of the number of organisms represented in GenBank at <a
href="http://www.ncbi.nlm.nih.gov/Taxonomy/Taxresources/taxaJan0197.html">http://www.ncbi.nlm.nih.gov/Taxonomy/Taxresources/taxaJan0197.html</a>.
This page presents a table giving the total GenBank species count for each year since
1995. The table also gives yearly subtotals for Viruses, Bacteria, Archaea, and Eukaryota.</font></td>
</tr>
<tr>
<td width="29%" style="border: medium none"><font SIZE="3"><i>How can I determine which
portions of my nucleic acid sequence will be considered to be of low complexity during a
filtered blastn search?</i></font></td>
<td width="71%" style="border: medium none"><font SIZE="3">Blastn uses a program called
DUST to filter nucleic acid sequences for low complexity. DUST is available as source code
and in the form of Unix binaries at <a
href="ftp://ncbi.nlm.nih.gov/pub/tatusov/dust/version1/">ftp://ncbi.nlm.nih.gov/pub/tatusov/dust/version1/</a>.
To see the results of filtering for a FASTA-formatted nucleic acid sequence in a file
called, for example, &quot;nuc.fsa&quot;, execute DUST as follows: dust nuc.fsa.</font></td>
</tr>
<tr>
<td width="29%" style="vertical-align: text-top; border: medium none"><font SIZE="3"><i>I
am looking for GeneMap &#146;98, but I no longer see this as a link on the NCBI home page.
Does NCBI have a search engine I can use to find it?</i></font></td>
<td width="71%" style="border: medium none"><font SIZE="3">GeneMap &#146;98 is now
accessible via a link from the new Human Genome Resources page. You can also use
NCBI&#146;s new search engine to find it by entering GeneMap into the search box and
pressing the Search button. A link directly to GeneMap &#146;98 will appear as the first
hit. The NCBI search engine is reached through a link on the NCBI home page.</font></td>
</tr>
<tr>
<td width="29%" style="vertical-align: text-top; border: medium none"><font SIZE="3"><i>I
am interested in an old GenBank record entry that does not show a CDS feature. How can I
determine where a coding region might be?</i></font></td>
<td width="71%" style="border: medium none"><font SIZE="3" COLOR="#000000">You can do this
easily by pasting the sequence into NCBI&#146;s ORF Finder or by simply specifying a
GenBank accession or gi number. ORF Finder will search for Open Reading Frames over the
entire sequence or over a range of nucleotides within the sequence using any of 15 genetic
codes. A link to ORF Finder is found on the NCBI home page.</font></td>
</tr>
<tr>
<td width="29%" style="vertical-align: text-top; border: medium none"><font SIZE="3"><i>On
your FTP server, I have only seen files containing the entire EST division. Can I download
the three subdivisions of human, murine, and other ESTs separately?</i></font></td>
<td width="71%" style="border: medium none"><font SIZE="3">Yes, you can download the raw
sequence data in this way. The EST data sets are available separated by organism, but only
in FASTA format and not as full GenBank records. These FASTA sequence files are available
at <a href="ftp://ncbi.nlm.nih.gov/blast/db/">ftp://ncbi.nlm.nih.gov/blast/db/</a>. From
the NCBI home page, select <b>Anonymous FTP,</b> then <b>BLAST,</b> and then the <b>db </b>subdirectory.
You will see est_human.z, est_mouse.z, and est_others.z. </font></td>
</tr>
</table>
<p><a href="#toc">Return to Table of Contents</a></p>
<hr>
<font FACE="Times" SIZE="5">
<p ALIGN="left"></font><font SIZE="5"><a name="Recent Pubs">Selected Recent Publications
by NCBI Staff</a></font><b></p>
<p ALIGN="left">Aravind, L, DR Walker,</b> and <b>EV Koonin.</b> <a
href="http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=9973609&amp;form=6&amp;db=m&amp;Dopt=b">Conserved
domains in DNA repair proteins and evolution of repair systems</a>. <i>Nucleic Acids Res</i>
27(5):1223&#150;42, 1999.</p>
<b>
<p ALIGN="left">Aravind, L,</b> VM Dixit, and <b>EV Koonin. </b><a
href="http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=10098397&amp;form=6&amp;db=m&amp;Dopt=b">The
domains of death: evolution of the apoptosis machinery</a>. <i>Trends Biochem Sci</i>
24(2):47&#150;53, 1999.</p>
<p ALIGN="left">Desper, R, F Jiang, OP Kallioniemi, H Moch, CH Papadimitriou, and <b>AA
Sch<EFBFBD>ffer. </b><a
href="http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=10223663&amp;form=6&amp;db=m&amp;Dopt=b">Inferring
tree models for oncogenesis from comparative genome hybridization data</a>. <i>J Comput
Biol</i> 6(1):37&#150;51, 1999.</p>
<b>
<p ALIGN="left">Galperin, MY, </b>and D Frishman. Toward automated prediction of protein
function from microbial genomic sequences. <i>Methods of Microbiology </i>28:245&#150;63.
London: Academic Press, 1999.</p>
<b>
<p ALIGN="left">Kuehl, PM,</b> <b>JM Weisemann,</b> JW Touchman, ED Green, and <b>MS
Boguski. </b><a
href="http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=10022984&amp;form=6&amp;db=m&amp;Dopt=b">An
effective approach for analyzing &quot;Prefinished&quot; genomic sequence data</a>. <i>Genome
Res</i> 9(2):189&#150;94, 1999.</p>
<p ALIGN="left">Menotti-Raymond, M, VA David, LA Lyons, <b>AA Sch<63>ffer, </b>JF Tomlin, MK
Hutton, and SJ O&#146;Brien. <a
href="http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=10191079&amp;form=6&amp;db=m&amp;Dopt=b">A
genetic linkage map of microsatellites in the domestic cat</a>. <i>Genomics </i>57(1):9&#150;23,
1999. </p>
<p ALIGN="left">Pesole, G, S Liuni, G Grillo, M Ippedico, A Larizza, <b>W Makalowski, </b>and
C Saccone. <a
href="http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=9847176&amp;form=6&amp;db=m&amp;Dopt=b">UTRdb:
a specialized database of 5<font FACE="Symbol"><EFBFBD></font> and 3<font FACE="Symbol"><EFBFBD></font>
untranslated regions of eukaryotic mRNAs</a>. <i>Nucleic Acids Res</i> 27(1):188&#150;91,
1999. </p>
<b>
<p ALIGN="left">Sch<EFBFBD>ffer, AA.</b> <a
href="http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=10030397&amp;form=6&amp;db=m&amp;Dopt=b">Computing
probabilities of homozygosity by descent</a>. <i>Genet Epidemiol</i> 16(2):135&#150;49,
1999.</p>
<b>
<p align="left">Wolf, YI,</b> SE Brenner, PA Bash, and <b>EV Koonin.</b> <a
href="http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?uid=9927481&amp;form=6&amp;db=m&amp;Dopt=b">Distribution
of protein folds in the three superkingdoms of life</a>. <i>Genome Res</i>
9(1):17&#150;26, 1999.</p>
<p><a href="#toc">Return to Table of Contents</a></p>
<hr>
<h3 align="left"><font SIZE="5"><a name="Masthead">Masthead</a></font><font SIZE="3"></h3>
<p></font><i>NCBI News</i> is distributed four times a year. Beginning in 1999, issues are
dated Winter, Spring, Summer, and Fall. We welcome communication from users of NCBI
databases and software and invite suggestions for articles in future issues. Send
correspondence to <i>NCBI News</i> at the address below.</p>
<p ALIGN="left">NCBI News<br>
National Library of Medicine<br>
Bldg. 38A, Room 8N-803<br>
8600 Rockville Pike<br>
Bethesda, MD 20894<br>
Phone: (301) 496-2475<br>
Fax: (301) 480-9241&gt;<br>
E-mail: <a href="mailto:info@ncbi.nlm.nih.gov">info@ncbi.nlm.nih.gov</a></p>
<i>
<p></i><em>Editors</em><br>
Dennis Benson<br>
Barbara Rapp</p>
<i>
<p>Contributors</i><br>
Stephen Altschul&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Donna Maglott<br>
Jonathan Kans&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Kim Pruitt<br>
Alex Lash
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Greg
Schuler</p>
<i>
<p>Writer<br>
</i>David Wheeler</p>
<i>
<p>Editing, Graphics, and Production<br>
</i>Marla Fogelman<br>
Veronica Johnson<br>
Jennifer Vyskocil</p>
<i>
<p>Photography</i><br>
Fran Beckwith</p>
<i>
<p>Design Consultant<br>
</i>Troy M. Hill</p>
<p ALIGN="left">In 1988, Congress established the National Center for Biotechnology
Information as part of the National Library of Medicine; its charge is to create
information systems for molecular biology and genetics data and perform research in
computational molecular biology.</p>
<p ALIGN="left">The contents of this newsletter may be reprinted without permission. The
mention of trade names, commercial products, or organizations does not imply endorsement
by NCBI, NIH, or the U.S. Government. </p>
<p ALIGN="JUSTIFY">NIH Publication No. 99-3272<br>
<br>
ISSN 1060-8788 <br>
ISSN 1098-8408 (Online Version)<font SIZE="3"></p>
<p><a href="#toc">Return to Table of Contents</a> <font face="Times" size="1"></p>
</font></font>
</body>
</html>