969 lines
41 KiB
HTML
969 lines
41 KiB
HTML
<!doctype html public "-//IETF//DTD HTML//EN">
|
|
<HTML>
|
|
|
|
<HEAD>
|
|
|
|
<TITLE>August 1996</TITLE>
|
|
<body bgcolor="#f0f0f0">
|
|
<META NAME="GENERATOR" CONTENT="Internet Assistant for Word 1.0Z">
|
|
<META NAME="AUTHOR" CONTENT="The KEVRIC Company">
|
|
</HEAD>
|
|
|
|
<BODY>
|
|
|
|
<P>
|
|
<IMG SRC="newslogo.gif" ALIGN="BOTTOM">
|
|
<P>
|
|
August 1996<HR>
|
|
|
|
<P>
|
|
<A NAME="toc"></A><A HREF="#c3d">See in 3D: New Entrez Release</A>
|
|
|
|
<P>
|
|
<A HREF="#advance">UniGene Collection</A>
|
|
<P>
|
|
<A HREF="#discont">Entrez CD-ROM Discontinued</A>
|
|
<P>
|
|
<A HREF="#query">QUERY E-Mail Server</A>
|
|
<P>
|
|
<A HREF="#mouse">Human/Mouse Homology Map</A>
|
|
<P>
|
|
<A HREF="#images">Images in OMIM</A>
|
|
<P>
|
|
<A HREF="#genome">Genome Survey Sequences</A>
|
|
<P>
|
|
<A HREF="#sequin">Sequin Quick Guide</A>
|
|
<P>
|
|
<A HREF="#blast">New BLAST Services</A>
|
|
<P>
|
|
<A HREF="#faq">Frequently Asked Questions</A>
|
|
<P>
|
|
<A HREF="#FTP">NCBI Data by FTP</A>
|
|
<P>
|
|
<A HREF="#Pubs">Recent Publications</A><HR>
|
|
|
|
<H3><A NAME="c3d">See in 3D: New Entrez Release 5.0</A></H3>
|
|
|
|
<P>
|
|
Since September 1995, Network Entrez has included 3D structure
|
|
data, based on crystallographic and NMR structure determinations.
|
|
The structure data are contained in NCBI's Molecular Modeling
|
|
DataBase (MMDB), which is derived from the Brookhaven Protein
|
|
DataBank of more than 4,000 biomolecules. MMDB is also referred
|
|
to as the Structure division of Entrez.
|
|
<P>
|
|
With the release of Entrez 5.0 in July 1996, NCBI has added a
|
|
new built-in 3D-structure viewer called Cn3D ("See in 3D").
|
|
Cn3D allows one to visualize and rotate protein structure records
|
|
from Entrez. Structure data can provide a wealth of information
|
|
on the biological function and mechanism of action of macromolecules.
|
|
By fully integrating the structure database into Entrez, we hope
|
|
to make this information easily accessible to biologists.
|
|
<P>
|
|
<B>Searching for Structures </B>
|
|
<P>
|
|
Finding a structure in Entrez is just like any other Entrez search.
|
|
A query can contain specific fields such as author names or text
|
|
terms occurring anywhere in the structure description. In this
|
|
way you may check for structure data on a specific protein or
|
|
nucleic acid. For example, select the "structure" database
|
|
from Entrez's search page, enter a search term like "copper,"
|
|
then press the <B>Retrieve Documents</B> button to bring up the
|
|
list of 3D structure entries matching your query. To see the 3D
|
|
structure, double click on the 3D icon of any record you want
|
|
to display.
|
|
<P>
|
|
A more powerful search approach, however, is to select the molecule
|
|
of interest in the sequence database, identify its sequence neighbors
|
|
(candidate homologues), and then, by linking to the structure
|
|
database, ask whether structure data is available for any of the
|
|
members of this family. The structure database is smaller than
|
|
the protein or nucleotide databases, but many sequenced proteins
|
|
have homologues in this set, and you may often learn more about
|
|
a protein by examining the 3D structure of its homologues.
|
|
<P>
|
|
<B>Using Cn3D From WWW Entrez</B>
|
|
<P>
|
|
WWW users will need to download and install the Network Entrez
|
|
client software and configure it as a helper application for their
|
|
WWW browser. When a 3D structure is requested from WWW Entrez,
|
|
the browser will automatically launch Cn3D.
|
|
<P>
|
|
Detailed instructions for installing the program, getting started,
|
|
and using the viewing features are provided on the Cn3D Web page
|
|
(http://www.ncbi.nlm.nih.gov/Structure/cn3d.html). If you installed
|
|
your own WWW browser and your Internet connection, you can probably
|
|
install Network Entrez without difficulty. For assistance, first
|
|
check with a systems administrator at your institution before
|
|
contacting NCBI.
|
|
<P>
|
|
<A NAME="cn3d"><IMG SRC="cn3d.gif" ALIGN="BOTTOM"></A>
|
|
<P>
|
|
<I>3D structure of human Sry-DNA complex (PDB accession: 1HRY)</I>
|
|
|
|
<P>
|
|
|
|
<B>Getting the Software</B>
|
|
<P>
|
|
Entrez 5.0 with Cn3D is available for many platforms, including
|
|
Mac, Windows, and UNIX. It can be downloaded from NCBI's FTP site
|
|
(ncbi.nlm.nih.gov) in the 'entrez/network' directory. For installation
|
|
instructions, be sure to download the README document, or see
|
|
the Entrez Overview section from WWW Entrez.
|
|
<P>
|
|
The current version, numbered 5.002, is still considered a "beta"
|
|
release. There will be a series of software updates throughout
|
|
the rest of the year, so check the FTP site periodically to make
|
|
sure you have the most up-to-date version. We are still refining
|
|
the program and welcome comments and suggestions (info@ncbi.nlm.nih.gov).
|
|
<P>
|
|
<A HREF="#toc">Return to Table of Contents</A><HR>
|
|
|
|
<H3><A NAME="advance">Advancing Genomic Research: The UniGene
|
|
Collection</A></H3>
|
|
|
|
<P>
|
|
The UniGene collection, now accessible through NCBI's Home Page,
|
|
contains more than 48,000 clusters of sequences, each representing
|
|
the transcription product of a distinct human gene. With current
|
|
estimates of 80,000 to 100,000 genes in the human genome, this
|
|
is close to the 50% mark. The clusters are largely based on EST
|
|
sequences, so most of the sequences are not complete and most
|
|
of the genes have still not been characterized. But one important
|
|
use of the UniGene clusters is to identify novel, nonredundant
|
|
mapping candidates for generating a transcript map that identifies
|
|
all coding sequences in the genome.
|
|
<P>
|
|
Although a primary goal of the Human Genome Project is to determine
|
|
the complete sequence of the 3 billion base pairs in the human
|
|
genome, only about 3% of the genome actually encodes protein,
|
|
and the biological significance of most of the sequence that will
|
|
be generated is not known. Therefore, a transcript, or expression,
|
|
map is a critical resource for charting the way.
|
|
<P>
|
|
Until a few years ago, GenBank contained sequences for only 3,000
|
|
unique human genes, and developing a transcript map did not seem
|
|
worthwhile based on such a small sample. But recent advancements
|
|
in EST technology and the increased public availability of EST
|
|
sequences have dramatically increased the numbers of genes in
|
|
GenBank, so that developing a dense transcript map is now feasible.
|
|
The Merck-funded EST project at Washington University alone has
|
|
produced 320,000 EST sequences so far, with new data being submitted
|
|
at the rate of 4,500 sequences per week. Mark Boguski, who leads
|
|
NCBI's EST database project, says, "The transcript map will
|
|
provide needed reality checks for the large-scale sequencing efforts
|
|
ahead," and adds that "the disease gene hunting community
|
|
has long had a desire to develop a transcript map."
|
|
<P>
|
|
<B>Organizing the UniGene Clusters</B>
|
|
<P>
|
|
When EST sequence data started rolling into GenBank by the thousands
|
|
earlier this year, NCBI's Greg Schuler began investigating ways
|
|
to use them to identify unique human genes. The problem was to
|
|
organize the data in such a way that all representations of a
|
|
single gene were collected in a single cluster.
|
|
<P>
|
|
As a comprehensive collection of publicly available sequence data,
|
|
GenBank is also a historical archive with a large degree of internal
|
|
redundancy. A sequence for the same gene may have been submitted
|
|
by multiple labs, and a given gene may have separate entries from
|
|
different types of sequence (e.g., contiguous and noncontiguous
|
|
genomic sequences, mRNA sequences with alternative splicing, and
|
|
EST sequences). For EST sequences, redundancy and overlap are
|
|
especially prevalent. This data redundancy makes it difficult
|
|
to identify unique markers for mapping, thus the need for the
|
|
UniGene project.
|
|
<P>
|
|
In the first phase of the UniGene project, Schuler screened all
|
|
ESTs against existing functionally cloned GenBank entries to eliminate
|
|
redundancies. He then developed techniques to screen the remaining
|
|
ESTs against each other to determine those likely to be derived
|
|
from the same gene. If sequences were found to share statistically
|
|
significant DNA sequence similarity in the 3' UTR, they were assigned
|
|
to the same cluster.
|
|
<P>
|
|
The first phase of the UniGene project resulted in a set of 3,125
|
|
nonredundant unique human 3' UTRs, referred to as the UniGene
|
|
set. The UniGene set serves as a source of mapping candidates
|
|
and as a standard to compare and screen new EST submissions. New
|
|
EST submissions that do not match any sequences in the UniGene
|
|
set are considered new human genes and are organized into unique
|
|
clusters to provide additional mapping candidates. To date, more
|
|
than 48,000 3'-anchored UniGene clusters have been generated.
|
|
Some clusters contain more than 1,000 ESTs, while others consist
|
|
of as few as 1 EST. As would be expected, the largest clusters
|
|
correspond to well-studied genes, such as the hemoglobin subunits
|
|
and the serum albumin precursor.
|
|
<P>
|
|
<B>Developing the Transcript Map: A Collaborative Effort</B>
|
|
<P>
|
|
Once the UniGene clusters were identified, there was an immediate
|
|
use for them in developing a comprehensive transcription map of
|
|
the human genome. The mapping project is a collaborative effort,
|
|
involving NCBI, several genome mapping centers, and the sequence
|
|
submissions of individual scientists. NCBI distributes nonoverlapping
|
|
cluster sets to the various mapping centers to ensure that redundancy
|
|
does not creep back into the databases and that duplication of
|
|
mapping effort is kept to the minimum necessary for data accuracy
|
|
checks and cross referencing. This collaborative effort has resulted
|
|
in the placement of 15,000-20,000 transcripts on RH and YAC maps.
|
|
<P>
|
|
<B>Using the UniGene Clusters</B>
|
|
<P>
|
|
Aside from their contribution to large-scale mapping efforts and
|
|
to basic research in genome organization, the UniGene collection
|
|
and subsequent transcript maps are an important resource for many
|
|
investigators. For example, of great interest to disease gene
|
|
hunters is that 82% of the positionally cloned genes that are
|
|
currently known to be mutated in human disease states are represented
|
|
by exact matches with one or more ESTs in GenBank. Gene hunters
|
|
can use the transcript maps to gain valuable clues to expected
|
|
gene location and density in an area of interest. UniGene clusters
|
|
are also being studied to find gene polymorphisms. And recently
|
|
developed techniques for assessing gene expression on a genomewide
|
|
scale (e.g., microarray expression systems) take advantage of
|
|
the abundance of unique EST sequences that can be readily retrieved
|
|
from GenBank.
|
|
<P>
|
|
The UniGene data set can be accessed through NCBI's WWW service
|
|
(http://www.ncbi.nlm.nih.gov). From the Home Page, scroll to "Other
|
|
NCBI Resources," and click on Unigene. The UniGene page displays
|
|
icons for each of the 23 chromosomes. To see a list of all the
|
|
UniGene clusters that have been identified for a given chromosome
|
|
and the sequences comprising the cluster, just click on the chromosome.
|
|
To search for clusters containing a specific word or phrase, enter
|
|
the search term in the text box at the top of the UniGene page.
|
|
<P>
|
|
UniGene is updated every 2 months, approximately 1 week after
|
|
a new GenBank release is produced. Files can be downloaded from
|
|
NCBI's FTP site in the 'repository/unigene' directory. No search
|
|
tools are provided other than the Web interface.
|
|
<P>
|
|
<A HREF="#toc">Return to Table of Contents</A><HR>
|
|
|
|
<H3><A NAME="discont">Entrez CD-ROM Discontinued </A></H3>
|
|
|
|
<P>
|
|
Users are reminded that effective August 15, 1996, NCBI is discontinuing
|
|
Entrez on CD-ROM. Two versions of Entrez are available free of
|
|
charge over the Internet. Network Entrez is a client/server program
|
|
that retains the look and feel of Entrez on CD-ROM. Client software
|
|
for PC/Windows, Macintosh, and several Unix workstations can be
|
|
downloaded by FTP from 'ncbi.nlm.nih.gov' in the 'entrez' directory.
|
|
There is also a World Wide Web version of Entrez, accessible from
|
|
NCBI's Home Page (http://www.ncbi.nlm.nih.gov). This version has
|
|
essentially the same functionality as Network Entrez, but with
|
|
a different search and display interface.
|
|
<P>
|
|
<A HREF="#toc">Return to Table of Contents</A><HR>
|
|
|
|
<H3><A NAME="query">QUERY: A New E-Mail Server for Entrez</A>
|
|
</H3>
|
|
|
|
<P>
|
|
NCBI now has an e-mail server specifically designed to do text-based
|
|
searches of the integrated Entrez database. As with the RETRIEVE
|
|
e-mail server that has been in place for several years, users
|
|
specify a data set to search, then the words or ID numbers to
|
|
be used in the search. However, the new server offers a choice
|
|
of output options and provides access to all the information from
|
|
the various databases that make up Entrez. Some of these data,
|
|
such as the molecular biology subset of MEDLINE and protein sequences
|
|
entered directly from the published literature, are not available
|
|
through the older RETRIEVE server.
|
|
<P>
|
|
QUERY uses the Entrez search engine so important Entrez features,
|
|
such as viewing sequence neighbors or linking to associated information
|
|
such as MEDLINE abstracts, are now also available through an e-mail
|
|
search interface.
|
|
<P>
|
|
To use the QUERY server, send a formatted e-mail message to the
|
|
address: query@ncbi.nlm.nih.gov. Your search results will be returned
|
|
to you as an e-mail message.
|
|
<P>
|
|
To format a search, first specify the database (DB) to be searched:
|
|
<B>n</B> for nucleotide sequences, <B>p</B> for protein sequences,
|
|
<B>s</B> for both nucleotide and protein sequences, <B>t</B> for
|
|
3D structures, or <B>m</B> for the molecular biology subset of
|
|
MEDLINE.
|
|
<P>
|
|
Next, specify your search term, and indicate whether it is a unique
|
|
identifier for a record (UID) or a text term from elsewhere in
|
|
the record (TERM). UIDs include sequence database accession numbers,
|
|
sequence-specific GI numbers, and MEDLINE accession numbers. Search
|
|
terms can also be restricted to specific fields such as organism,
|
|
author, title, journal name, or date. In addition, you can combine
|
|
search terms with Boolean logic operators.
|
|
<P>
|
|
Finally, specify a particular output format if desired, and include
|
|
any other optional search specifications, such as the maximum
|
|
number of records to display. Display options include such formats
|
|
as FASTA or GenBank flat file, but also are used to specify that
|
|
you want to see related information such as sequence neighbors
|
|
or MEDLINE abstracts.
|
|
<P>
|
|
Some sample search queries are shown below. For more detailed
|
|
information on formatting searches and available search options,
|
|
review the QUERY server documentation. To obtain the documentation,
|
|
send the word HELP as your message to the server (query@ncbi.nlm.nih.gov).
|
|
<P>
|
|
Questions or comments about the QUERY server are welcomed, and
|
|
should be sent to the user support group at info@ncbi.nlm.nih.gov.
|
|
<P>
|
|
<B>Sample Searches for QUERY E-Mail Server </B>
|
|
<P>
|
|
DB n
|
|
<P>
|
|
UID U30150,U30153
|
|
<P>
|
|
DOPT f
|
|
<P>
|
|
* Retrieve the nucleotide database entries with accession numbers
|
|
U30150 and U30153, and display them in FASTA format.
|
|
<P>
|
|
DB m
|
|
<P>
|
|
UID 88055872
|
|
<P>
|
|
* Display the MEDLINE record 88055872 in the default format.
|
|
<P>
|
|
DB n
|
|
<P>
|
|
UID U30150
|
|
<P>
|
|
DOPT m
|
|
<P>
|
|
* Retrieve the nucleotide database entry with accession number
|
|
U30150, and display any related MEDLINE information.
|
|
<P>
|
|
DB p
|
|
<P>
|
|
TERM ras
|
|
<P>
|
|
* Search for the term "ras" in all fields of the protein
|
|
database, and display in the default format.
|
|
<P>
|
|
DB m
|
|
<P>
|
|
TERM smith ab [auth]
|
|
<P>
|
|
DISPMAX 15
|
|
<P>
|
|
* Search the author field of the MEDLINE database for papers by
|
|
A.B. Smith, and display the most recent 15 documents in the default
|
|
report format.
|
|
<P>
|
|
DB n
|
|
<P>
|
|
TERM caenorhabditis elegans [ORGN] & 1996/01/28 [DATM]
|
|
<P>
|
|
DOPT g
|
|
<P>
|
|
* Retrieve all the C. elegans records added to the nucleotide
|
|
database on Jan. 28, 1996, and display in GenBank format.
|
|
<P>
|
|
<A HREF="#toc">Return to Table of Contents</A><HR>
|
|
|
|
<H3><A NAME="mouse"> Human/Mouse Homology Map Added to Web Site</A>
|
|
</H3>
|
|
|
|
<P>
|
|
NCBI now provides access to the Seldin/Debry Human/Mouse Homology
|
|
Map through its WWW Home Page. The homology map is provided and
|
|
maintained by Michael Seldin at Duke University Medical Center
|
|
and Ronald Debry at the University of Cincinnati. To use the homology
|
|
map, select the <B>Human/Mouse Homology Maps</B> option from the
|
|
Home Page, and click on a particular human or mouse chromosome.
|
|
You will then see a table comparing genes in homologous segments
|
|
of DNA from human and mouse sources, sorted by position in each
|
|
genome. More than 1,400 loci are presented, most of which are
|
|
genes. Links to more information on using the map, table construction,
|
|
and underlying assumptions are also provided.
|
|
<P>
|
|
<A HREF="#toc">Return to Table of Contents</A><HR>
|
|
|
|
<H3><A NAME="images">Images Now Accessible Through OMIM</A></H3>
|
|
|
|
<P>
|
|
NCBI's WWW version of the Online Mendelian Inheritance in Man
|
|
(OMIM) database now includes images of clinical phenotypes via
|
|
a link to the Genetics Image Archive of the Cedars-Sinai Medical
|
|
Center. If an image is available for a given OMIM record, an <B>Images</B>
|
|
button is included as one of the available database links. Alternatively,
|
|
from the OMIM Home Page, users can go directly to the Image Archive,
|
|
where the images are organized by OMIM number. Currently more
|
|
than 100 images are available. The URL for direct access to the
|
|
OMIM Home Page is http://www.ncbi.nlm.nih.gov/omim.
|
|
<P>
|
|
<A HREF="#toc">Return to Table of Contents</A><HR>
|
|
|
|
<H3><A NAME="genome">New Genome Survey Sequence Division</A></H3>
|
|
|
|
<P>
|
|
To keep pace with the rapidly increasing output of genomic sequence
|
|
data, NCBI will be creating a new Genome Survey Sequence (GSS)
|
|
division to be included in GenBank Release 96.0 (August 1996).
|
|
<P>
|
|
The GSS division will fill the need for a repository for genomic
|
|
sequence data that is not appropriate for inclusion in the standard
|
|
organism-specific divisions. Submissions to the GSS division can
|
|
include sequence data generated by single pass "reads"
|
|
from random genome surveys, exon trapped products, and cosmid,
|
|
BAC, or YAC end clones. Creation of the new GSS division will
|
|
allow users easy access to this data for use in mapping and sequencing
|
|
of larger contigs, which can then be submitted to the standard
|
|
GenBank divisions, while at the same time segregating this specialized
|
|
type of high-volume data from the more traditional GenBank sequences.
|
|
There are currently more than 7,000 sequences in this division.
|
|
<P>
|
|
There is a special data submission format for these sequences,
|
|
similar to that used for EST and STS submissions. To obtain a
|
|
copy of the format specifications, send a request to info@ncbi.nlm.nih.gov.
|
|
<P>
|
|
<A HREF="#toc">Return to Table of Contents</A><HR>
|
|
|
|
<H3><A NAME="sequin">Sequin for Database Submissions: A Quick
|
|
Guide</A></H3>
|
|
|
|
<P>
|
|
NCBI has recently released a new program called Sequin for submitting
|
|
sequences to the GenBank, EMBL, and DDBJ databases. The advantages
|
|
of Sequin over Authorin include the capacity to handle long sequences
|
|
and segmented entries, easier editing and updating, and complex
|
|
annotation capabilities. In addition, Sequin contains a number
|
|
of built-in validation functions for enhanced quality assurance.
|
|
<P>
|
|
This overview is intended to provide a quick guide to Sequin's
|
|
capabilities, including automatic annotation of coding regions,
|
|
the graphical viewer, quality control features, and editing features.
|
|
More detailed instructions on these and other functions can be
|
|
found in Sequin's on-screen <B>Help</B> file.
|
|
<P>
|
|
<B>Basic Sequin Organization </B>
|
|
<P>
|
|
Sequin is organized into a series of forms for (1) entering submitting
|
|
authors, (2) entering organism and sequences, (3) viewing the
|
|
complete submission, and (4) editing and annotating the submission.
|
|
To advance through the pages making up each form, simply click
|
|
on labeled folder tabs or the<B> Next Page</B> button. After the
|
|
basic information forms have been completed and the sequence data
|
|
imported, Sequin provides a complete view of your submission,
|
|
in your choice of text or graphic format. At this point, any of
|
|
the information fields can be easily modified by double-clicking
|
|
on any area of the record, and additional biological annotations
|
|
can be entered by selecting from a menu.
|
|
<P>
|
|
Sequin has an on-screen <B>Help</B> file that is opened automatically
|
|
when you start the program. Because it is context-sensitive, the
|
|
<B>Help</B> text will change as you progress through the program.
|
|
<P>
|
|
<B>Welcome to Sequin Form</B>
|
|
<P>
|
|
Sequin's first window asks you to indicate the database to which
|
|
the sequence will be submitted, and prompts you to start a new
|
|
project or continue with an existing one. In general, each sequence
|
|
submission should be entered as a separate project. However, an
|
|
important new feature of Sequin is that it also accepts submissions
|
|
of segmented DNA sequences, population studies, and phylogenetic
|
|
studies. These entries would be submitted together as one project.
|
|
<P>
|
|
The sequence data for this example is Drosophila eukaryotic initiation
|
|
factors 4E-I and 4E-II (accession number U54469).
|
|
<P>
|
|
<B>Submitting Authors Form</B>
|
|
<P>
|
|
The pages in this form ask you to provide the release date, a
|
|
working title, names and contact information of submitting authors,
|
|
and affiliation information. To create a personal template for
|
|
use in future submissions, use the <B>File/Export</B> option after
|
|
completing each page of the Submitting Authors form. Figure 1 shows
|
|
a partially filled out page for affiliation information.
|
|
<P>
|
|
<A NAME="fig1"><IMG SRC="fig1.gif" ALIGN="BOTTOM"></A>
|
|
<P>
|
|
<I>Figure 1</I>
|
|
<P>
|
|
<B>Organism and Sequences Form </B>
|
|
<P>
|
|
The first page of this form requests information regarding the
|
|
organism from which the sequence was derived. Organism information
|
|
is most easily entered by selecting the appropriate organism from
|
|
the scrollable list. As you begin typing the organism name, the
|
|
list will jump to the right alphabetical location. Once you select
|
|
an organism from the list, the corresponding scientific and common
|
|
name and genetic code are filled out automatically (Figure 2).
|
|
If your organism is not on the list, Sequin will simply accept
|
|
what you have typed.
|
|
<P>
|
|
<A NAME="fig2"><IMG SRC="fig2.gif" ALIGN="BOTTOM"></A>
|
|
<P>
|
|
<I>Figure 2</I>
|
|
<P>
|
|
<B>Importing Nucleotide and Protein FASTA Files</B>
|
|
<P>
|
|
With Sequin, the actual sequence data are imported from an outside
|
|
data file. So before you begin, prepare your sequence data files
|
|
using a word processor or perhaps a text editor associated with
|
|
your laboratory sequence analysis software. One great feature
|
|
of Sequin is that the program can automatically annotate your
|
|
sequence and coding regions if you format the identifying descriptive
|
|
information (known in Sequin as the FASTA definition line) in
|
|
a particular structured manner. See <A HREF="#before">"Before You Begin"</A>
|
|
for format details.
|
|
<P>
|
|
To import the nucleotide sequence data, click on the <B>Nucleotide</B>
|
|
folder tab to advance to the next page (Figure 3).
|
|
Select molecule type and topology, check any additional boxes
|
|
that apply, then click on<B> Import Nucleotide FASTA</B> and select
|
|
the appropriate file. When the sequence file import is complete,
|
|
a box will appear showing the number of nucleotide segments imported,
|
|
the total length in nucleotides of the sequences entered, and
|
|
the local ID you designated, but the actual sequence data is not
|
|
shown. If any of this information is missing or incorrect, check
|
|
the file containing the sequence data for proper FASTA format,
|
|
choose Clear from the <B>Edit</B> menu, then reimport the sequence.
|
|
<P>
|
|
<A NAME="fig3"><IMG SRC="fig3.gif" ALIGN="BOTTOM"></A>
|
|
<P>
|
|
<I>Figure 3</I>
|
|
<P>
|
|
To import the amino acid sequence, click on the <B>Protein</B>
|
|
folder tab and proceed in the same manner as nucleotide data.
|
|
In this example, we imported two protein sequences. These are
|
|
the alternative splice products of the same gene. As shown in
|
|
<A HREF="#before">"Before You Begin"</A>, both protein
|
|
sequences are in the same data file, but each has its own definition
|
|
line with local ID.
|
|
<P>
|
|
<B>Viewing Your Submission</B>
|
|
<P>
|
|
After you have completed importing the data files, Sequin will
|
|
display your full submission information in the GenBank text format
|
|
(Figure 4).
|
|
<P>
|
|
<A NAME="fig4"><IMG SRC="fig4.gif" ALIGN="BOTTOM"></A>
|
|
<P>
|
|
<I>Figure 4</I>
|
|
<P>
|
|
Based on information provided in your DNA and amino acid sequence
|
|
files, any coding regions will be automatically identified and
|
|
annotated for you. Figure 4 shows only the top portion of the
|
|
GenBank record, but you can see the first of two coding region
|
|
(CDS) features. There are also two mRNA features (not shown in
|
|
figure) that, with minor editing, can be extended to include the
|
|
5' and 3' UTRs.
|
|
<P>
|
|
To get a graphical view, use the <B>Display Format </B>pop-up
|
|
menu to change from GenBank to Graphic (Figure 5).
|
|
Reviewing your submission in Graphic format allows you to visually
|
|
confirm expected location of exons, introns, and other features
|
|
in multiple interval coding regions. The Graphic view in our eukaryotic
|
|
initiation factor example illustrates how the coding region intervals
|
|
for the two protein products are spatially related to each other.
|
|
This figure shows the record after the initial mRNA intervals
|
|
have been edited to include the 5' and 3' UTRs.
|
|
<P>
|
|
<A NAME="fig5"><IMG SRC="fig5.gif" ALIGN="BOTTOM"></A>
|
|
<P>
|
|
<I>Figure 5</I>
|
|
<P>
|
|
<B>Editing and Annotating Your Submission</B>
|
|
<P>
|
|
At this point, Sequin could process your entry based on what you
|
|
have submitted so far. However, to optimize usefulness of your
|
|
entry for the scientific community, you will probably wish to
|
|
provide additional information to indicate biologically significant
|
|
regions of the sequence. This information may be in the form of
|
|
Descriptors or Features. (Descriptors are annotations that apply
|
|
to an entire sequence or set of sequences. Features are annotations
|
|
that apply to a specific sequence interval.)
|
|
<P>
|
|
Sequin provides two convenient methods to modify your entry: (1)
|
|
to edit existing information, double click on the text or graphic
|
|
area you wish to modify, and Sequin will display forms requesting
|
|
needed information, or (2) to add new information, use the <B>Misc</B>
|
|
and <B>Feature</B> menus and select from the list of available
|
|
annotations. Additional sequence data can also be added using
|
|
Sequin's powerful sequence editor. Sequin will automatically adjust
|
|
feature intervals when editing the sequence. But first, save the
|
|
entry so that if you make any unwanted changes during the editing
|
|
process you can revert to the original copy.
|
|
<P>
|
|
In this example, there are two RNA sequences transcribed from
|
|
the same region, and we have additional information about their
|
|
5' and 3' UTRs. With minor editing, we can extend the two mRNA
|
|
features to include these untranslated intervals. Just double-click
|
|
on an mRNA feature, then click on the <B>Location</B> tab, and
|
|
you will see a small spreadsheet showing the existing intervals.
|
|
Edit the locations in the spreadsheet to extend the mRNA. The
|
|
interval of the appropriate gene feature will automatically be
|
|
adjusted as well.
|
|
<P>
|
|
Publication information can also be added at this point. To change
|
|
the publication status from Unpublished to published in the <I>Journal
|
|
of Biological Chemistry</I>, just double-click on the Reference
|
|
section, and fill in the citation form that is presented.
|
|
<P>
|
|
<B>Validation</B>
|
|
<P>
|
|
Once you are satisfied that you have entered all the relevant
|
|
information, save your file! Then select <B>Validate</B> under
|
|
the <B>Search</B> menu. You will either receive a message that
|
|
the validation test succeeded or see a screen listing the validation
|
|
errors. Just double click on an error item to launch the appropriate
|
|
editor for making corrections. See the Sequin <B>Help</B> text for more
|
|
information on correcting errors. The validator includes
|
|
checks for such things as missing organism information, correct
|
|
coding region length, internal stop codons in coding regions,
|
|
mismatched amino acids, or nonconsensus splice sites.
|
|
<P>
|
|
<B>Submitting the Entry</B>
|
|
<P>
|
|
When the entry is properly formatted and error-free, click the
|
|
<B>Done</B> button or select Prepare Submission under the File
|
|
menu. You will be prompted to save your entry and e-mail it to
|
|
the database you selected. The address for GenBank is gb-sub@ncbi.nlm.nih.gov.
|
|
<P>
|
|
<B><A NAME="before">Before You Begin: Preparing Nucleotide and
|
|
Amino Acid Data </A></B>
|
|
<P>
|
|
Prepare your sequence data files using a word processor or some
|
|
other text editor, and save in ASCII text format. The data should
|
|
be arranged in FASTA format, which simply requires that line 1
|
|
begin with a > sign, followed by identifying descriptive text.
|
|
The sequence begins in line 2. Note that many sequence analysis
|
|
software packages include FASTA as one of the available output
|
|
formats.
|
|
<P>
|
|
For the DNA sequence, the definition line should contain your
|
|
own local ID code for the sequence and a working title. During
|
|
the submission process, NCBI staff will change your local ID to
|
|
a GenBank accession number.
|
|
<P>
|
|
If you have an amino acid translation, create a separate sequence
|
|
file in the same manner as above. Multiple amino acid sequences
|
|
can be included in a single file. Our eukaryotic initiation factor
|
|
example has two protein products, which are contained in the same
|
|
file, but with separate definition lines.
|
|
<P>
|
|
In order to take advantage of Sequin's automatic annotation feature,
|
|
the definition line for amino acid sequences must be in the structured
|
|
format illustrated below. Additional information can also be provided
|
|
for other features, but we are only showing the minimum information
|
|
required.
|
|
<P>
|
|
<B>Segmented Nucleotide Sets </B> -- A segmented nucleotide entry is a set
|
|
of noncontiguous genomic DNA sequences, for example, encoding
|
|
exons along with fragments of their flanking introns. Segmented
|
|
sets apply only to incomplete genomic DNA sequences, not complete
|
|
genomic DNA sequences or mRNA sequences. In order to import nucleotides
|
|
in a segmented set, each individual sequence must be in FASTA
|
|
format with an appropriate definition line, and all sequences
|
|
may be in the same file. The file containing the sequences is
|
|
imported into Sequin as described.
|
|
<P>
|
|
<B>Population or Phylogenetic Studies</B> -- For phylogenetic studies,
|
|
the scientific or common name of each organism should be encoded
|
|
in each FASTA definition line, e.g., [org=mouse]. In this case,
|
|
the organism page should not be filled out. For population studies,
|
|
you can encode strain, clone, and isolate information in the definition
|
|
line, e.g., [strain=BALB/c].
|
|
<P>
|
|
<I>Format for DNA Sequence Definition Line </I>
|
|
<PRE>
|
|
>local ID [org=organism] title
|
|
</PRE>
|
|
|
|
<P>
|
|
<I>DNA Sequence File </I>
|
|
<PRE>
|
|
>eIF4E Drosophila melanogaster eukaryotic
|
|
initiation factors 4E-I and 4E-II (eIF4E) gene
|
|
CGGTTGCTTGGGTTTTATAACATCAGTCAGTGACAGGCATTTCCAGA
|
|
GTTGCCCTGTTCAACAATCGATAGCTGCCTTTGGCCACCAAAATCCC
|
|
AAACTTAATTAAAGAATTAAATAATTCGAAT.....
|
|
</PRE>
|
|
|
|
<P>
|
|
<I>Format for Protein Sequence Definition Line </I>
|
|
<PRE>
|
|
> local ID [gene=locus; optional description] [prot=name;
|
|
optional description] optional title
|
|
</PRE>
|
|
|
|
<P>
|
|
<I>Protein Sequence File </I>
|
|
<PRE>
|
|
>4E-I [gene=eIF4E] [prot=eukaryotic initiation factor 4E-I]
|
|
MQSDFHRMKNFANPKSMFKTSAPSTEQGRPEPPTSAAAPAEAKDVKP
|
|
KEDPQETGEPAGNTATTTAPAGDDAVRTEHLYKHPLMNVWTLWYLEN
|
|
DRSKSWEDMQNEITSFDTVEDFWSLYNHIKP.....
|
|
>4E-II [gene=eIF4E] [prot=eukaryotic initiation factor 4E-II]
|
|
MVVLETEKTSAPSTEQGRPEPPTSAAAPAEAKDVKPKEDPQETGEPA
|
|
GNTATTTAPAGDDAVRTEHLYKHPLMNVWTLWYLENDRSKSWEDMQN
|
|
EITSFDTVEDFWSLYNHIKPPSEIKLGSDYS.....
|
|
</PRE>
|
|
|
|
<P>
|
|
<A HREF="#toc">Return to Table of Contents</A><HR>
|
|
|
|
<H3><A NAME="blast">New BLAST Services Now Offered</A></H3>
|
|
|
|
<P>
|
|
If you have visited the Web BLAST page recently, you will have
|
|
discovered that the service has undergone substantial revision
|
|
and several new features have been added. Users now have the option
|
|
to select either the "Basic" BLAST search using default
|
|
parameters or the "Advanced" search using customized
|
|
BLAST search parameters. In addition, an e-mail option has been
|
|
added for convenient delivery of search results. By using this
|
|
option, your BLAST output will be delivered by e-mail, and your
|
|
Web browser will not be tied up while the BLAST search is being
|
|
performed.
|
|
<P>
|
|
<B>Introducing PowerBlast</B>
|
|
<P>
|
|
NCBI has released PowerBlast, a new Network BLAST application
|
|
for automated analysis of genomic sequences. PowerBlast combines
|
|
BLAST searching with additional filtering for low complexity regions
|
|
and repeats. In addition, PowerBlast features a one-to-many alignment
|
|
output showing the alignment of the query sequence with all the
|
|
matching sequences (as opposed to standard BLAST results that
|
|
show the query sequence aligned individually against each matching
|
|
sequence). The one-to-many presentation illustrates the differences
|
|
between the query sequence and the search results, rather than
|
|
the similarities, as in standard BLAST results. The multiple alignment
|
|
results are displayed in both text and graphical formats. The
|
|
graphic view shows the computed optimal alignment gaps, and annotated
|
|
features are superimposed on the aligned sequences. PowerBlast
|
|
can also generate organism-specific output-for example, searches
|
|
restricted to human sequences. Versions of PowerBlast are available
|
|
for Macintosh, PC, SunOS, and Solaris platforms, and can be downloaded
|
|
from NCBI's FTP site in the 'pub/sim2/PowerBlast' directory.
|
|
<P>
|
|
<B>New BLAST E-Mail Server</B>
|
|
<P>
|
|
All BLAST e-mail queries sent to "blast@ncbi.nlm.nih.gov"
|
|
after August 5 are being processed by a new e-mail server at the
|
|
NCBI. The server address and query format will not change.
|
|
<P>
|
|
The most important new features of the server are--
|
|
<P>
|
|
1. Filtering of the query sequence is performed as the default.
|
|
Low complexity sequence that is found by a filter program is substituted
|
|
using the letter "N" in nucleotide sequences and the
|
|
letter "X" in protein sequences. The program "dust" is used for BLASTN queries;
|
|
"seg" is used for all others. For a description of these
|
|
filtering programs, the advantages of filtering, and instructions
|
|
on how to perform queries without filtering, see section 5 of
|
|
the new Help document.
|
|
<P>
|
|
2. There are two new directives: NCBI_GI, which causes the GI
|
|
to be displayed in the output, and HTML, which causes the output
|
|
to be in HTML format, suitable for viewing by a Web viewer. Both
|
|
of these command options are discussed in section 5 of the new
|
|
Help document.
|
|
<P>
|
|
To receive the documentation for the new BLAST e-mail server,
|
|
send a message consisting of only the word HELP to the server
|
|
address. Questions and comments on the new service are welcome
|
|
at blast-help@ncbi.nlm.nih.gov.
|
|
<P>
|
|
<A HREF="#toc">Return to Table of Contents</A> <HR>
|
|
|
|
<H3><A NAME="faq">Frequently Asked Questions</A></H3>
|
|
|
|
<P>
|
|
<I>Since the yeast genome has now been completely sequenced, how
|
|
can I now retrieve these records? Can I search it with the BLAST
|
|
servers at NCBI? </I>
|
|
<P>
|
|
Yes, a single copy of the complete <I>Saccharomyces cerevisiae
|
|
</I>genome is now available from the Entrez retrieval system (using
|
|
the genomes database) and for BLAST searches. NCBI has a searchable
|
|
database called "yeast" for either the nucleotide or
|
|
protein sequences, using blastn, blastp, blastx, tblastn, or tblastx
|
|
search engines. The sequences are also available from the NCBI
|
|
anonymous FTP (ncbi.nlm.nih.gov) site in the '/genbank/genomes/S_cerevisiae'
|
|
directory. See the README file in the '/genbank/genomes' directory
|
|
for a description of the files present in this directory.
|
|
<P>
|
|
<I>What is the difference between the GenBank accession number
|
|
and the GI number? </I>
|
|
<P>
|
|
The accession number is assigned to every GenBank record when
|
|
it is submitted. It applies to the full record and does not change
|
|
if parts of the record are modified, such as the publication information,
|
|
feature annotations, or even sequence corrections.
|
|
<P>
|
|
The GI identification numbers are assigned specifically to the
|
|
sequence components of the record in order to track changes in
|
|
the sequence itself. The nucleotide sequence gets a GI number
|
|
(called an NID), plus each protein sequence gets an individual
|
|
GI number (called a PID). Any time the sequence is modified by
|
|
the submitter, a new GI number (NID or PID) is assigned. But the
|
|
older numbers are still retained in the system, and can be retrieved
|
|
if needed.
|
|
<P>
|
|
<I>How does your BLAST queuing system work? How can one get bumped
|
|
from position 3 to 7, or from 12 to 13, for example?</I>
|
|
<P>
|
|
You can fall back in line if others come in with jobs that take
|
|
up fewer resources. For example, a tblastn job, which is very
|
|
computing-intensive, could be bumped back by blastn or blastp
|
|
jobs that take only seconds to run. Priority is also given to
|
|
queries against small databases. Note that about 8,500 BLAST queries
|
|
are performed each day through the Web page, and queues tend to
|
|
be shorter in the early morning or at night, eastern time. Also,
|
|
the Web BLAST service now allows for results to be returned by
|
|
e-mail (and also in HTML format for viewing in a Web browser).
|
|
<P>
|
|
<I>When I do a BLAST search, I am only interested in matches to
|
|
human sequences. Can I limit my results that way?</I>
|
|
<P>
|
|
Yes. If you are using Network BLAST (server/client version), there
|
|
is now a new client available, PowerBlast, which permits filtering
|
|
searches by organism, among several other features. See the <A HREF="#blast">BLAST
|
|
</A>article for details.
|
|
<P>
|
|
<I>Does the nr database already include the sequences for genomes,
|
|
like the E. coli genome or other available genome sequences?</I>
|
|
|
|
<P>
|
|
With the exception of EST and STS sequences, the nr database includes
|
|
all the sequences that are in GenBank, including sequences from
|
|
complete genomes. (For EST and STS database searches, you need
|
|
to explicitly specify those databases.)
|
|
<P>
|
|
<A HREF="#toc">Return to Table of Contents</A><HR>
|
|
|
|
<H3><A NAME="FTP">NCBI Data by FTP</A></H3>
|
|
|
|
<P>
|
|
The NCBI FTP site contains a variety of directories with publicly
|
|
available databases and software. The available directories include
|
|
'repository', 'genbank', 'entrez', 'toolbox', 'pub', and 'sequin'.
|
|
<P>
|
|
The <B>repository</B> directory makes a number of molecular biology
|
|
databases available to the scientific community. This directory
|
|
includes databases such as PIR 48.00, Swiss-Prot, CarbBank, AceDB,
|
|
and FlyBase.
|
|
<P>
|
|
The <B>genbank</B> directory contains files with the latest full
|
|
release of Genbank, the daily cumulative updates, and the latest
|
|
release notes.
|
|
<P>
|
|
The <B>entrez</B> directory contains the Entrez executable programs
|
|
for accessing CD-ROM data on a variety of platforms. It also contains
|
|
client software for Network Entrez.
|
|
<P>
|
|
The <B>toolbox</B> directory contains a set of software and data
|
|
exchange specifications that are used by NCBI to produce portable
|
|
software, and includes ASN.1 tools and specifications for molecular
|
|
sequence data.
|
|
<P>
|
|
The <B>pub</B> directory offers public-domain software, such as
|
|
BLAST (sequence similarity search program) and MACAW (multiple
|
|
sequence alignment program). Client software for Network BLAST
|
|
and PowerBlast is also included in this directory.
|
|
<P>
|
|
The <B>sequin</B> directory contains the new Sequin submission
|
|
software for Mac, PC, and UNIX platforms.
|
|
<P>
|
|
Data in these directories can be transferred through the Internet
|
|
by using the Anonymous FTP program. To connect, type:<B> ftp ncbi.nlm.nih.gov
|
|
or ftp 130.14.25.1</B>. Enter <B>anonymous</B> as the login name,
|
|
and enter your e-mail address as the password. Then change to
|
|
the appropriate directory. For example, change to the repository
|
|
directory (cd repository) to download specialized databases.
|
|
<P>
|
|
<A HREF="#toc">Return to Table of Contents</A><HR>
|
|
|
|
<H3><A NAME="Pubs">Selected Recent Publications by NCBI Staff</A>
|
|
</H3>
|
|
|
|
<P>
|
|
<B>Altschul, SF</B>, and W Gish. Local alignment statistics. <I>Methods
|
|
Enzymol</I> 266:460-80, 1996.
|
|
<P>
|
|
<B>Hogue, CWV,</B> <B>H Ohkawa,</B> and<B> SH Bryant</B>. A dynamic
|
|
look at structures: WWW-Entrez and the Molecular Modeling Database.
|
|
<I>TIBS</I> 21:226-9, 1996.
|
|
<P>
|
|
<B>Koonin, EV, RL Tatusov,</B> and <B>KE Rudd</B>. Protein sequence
|
|
comparison at genome scale. <I>Methods Enzymol</I> 266:295-322,
|
|
1996.
|
|
<P>
|
|
Madden, TL, RL Tatusov, and J Zhang. Applications of network BLAST
|
|
server. <I>Methods Enzymol</I> 266:131-41, 1996.
|
|
<P>
|
|
<B>Schuler, GD, JA Epstein</B>, <B>H Ohkawa, </B>and <B>JA Kans</B>.
|
|
Entrez: molecular biology database and retrieval system.<I> Methods
|
|
Enzymol </I>266:141-62, 1996.
|
|
<P>
|
|
Silberman, JD, ML Sogin, <B>DD Leipe,</B> and CG Clark. Human
|
|
parasite finds taxonomic home. <I> Nature</I> 380:398, 1996.
|
|
<P>
|
|
<B>Wilbur, WJ</B>, and Y Yang. An analysis of statistical term
|
|
strength and its use in the indexing and retrieval of molecular
|
|
biology texts. <I>Comput Biol Med</I> 26(3):209-22, 1996.
|
|
<P>
|
|
<B>Wilbur, WJ, </B>F Major, <B>J Spouge,</B> and <B> S Bryant</B>.
|
|
The statistics of unique native states for random peptides. <I>Biopolymers
|
|
</I>38:447-59, 1996.
|
|
<P>
|
|
<B>Wootton, JC</B>, and <B>S Federhen</B>. Analysis of compositionally
|
|
biased regions in sequence databases. <I>Methods Enzymol</I> 266:554-71,
|
|
1996.
|
|
<P>
|
|
<A HREF="#toc">Return to Table of Contents</A><HR>
|
|
|
|
<H4><A NAME="Masthead">Masthead</A></H4>
|
|
|
|
<P>
|
|
<I>NCBI News</I> is distributed three times a year. We welcome
|
|
communication from users of NCBI databases and software and invite
|
|
suggestions for articles in future issues. Send correspondence
|
|
and suggestions to <I>NCBI News</I> at the address below.
|
|
<P>
|
|
<I>NCBI News<BR>
|
|
</I>National Library of Medicine<BR>
|
|
Bldg. 38A, Room 8N-803<BR>
|
|
8600 Rockville Pike<BR>
|
|
Bethesda, MD 20894<BR>
|
|
Phone: (301) 496-2475<BR>
|
|
Fax: (301) 480-9241<BR>
|
|
E-mail: info@ncbi.nlm.nih.gov
|
|
<P>
|
|
<I>Editors<BR>
|
|
</I>Dennis Benson<BR>
|
|
Barbara Rapp
|
|
<P>
|
|
<I>Design Consultant<BR>
|
|
</I>Troy M. Hill
|
|
<P>
|
|
<I>Photography<BR>
|
|
</I>Karlton Jackson
|
|
<P>
|
|
<I>Editing, Graphics, and Production<BR>
|
|
</I>Veronica Johnson<BR>
|
|
Deborah Loer-Martin<BR>
|
|
Wendy B. Osborne
|
|
<P>
|
|
In 1988, Congress established the National Center for Biotechnology
|
|
Information as part of the National Library of Medicine; its charge
|
|
is to create automated systems for storing molecular biology,
|
|
biochemistry, and genetics data, and to perform research in computational
|
|
molecular biology.
|
|
<P>
|
|
The contents of this newsletter may be reprinted without permission.
|
|
The mention of trade names, commercial products, or organizations
|
|
does not imply endorsement by NCBI, NIH, or the U.S. Government.
|
|
<P>
|
|
NIH Publication No. 96-3272<BR>
|
|
ISSN 1060-8788
|
|
<P>
|
|
<A HREF="#toc">Return to Table of Contents</A>
|
|
</BODY>
|
|
|
|
</HTML>
|