nih-gov/www.ncbi.nlm.nih.gov/Web/Newsltr/FallWinter2000/human_genome.html

247 lines
No EOL
16 KiB
HTML

<html>
<head>
<meta http-equiv="content-type" content="text/html;charset=iso-8859-1">
<meta name="generator">
<title>NCBI News | Fall/Winter 2000</title>
<style type="text/css">
<!--
a:hover { color: 993300; }
-->
</style>
</head>
<body background="images/bckgrnd.gif" bgcolor="white" link="#003399" alink="#003399" vlink="#003399" text="black">
<span class="heads"></span> <span class="subheads"></span>
<table border="0" cellpadding="0" cellspacing="0" width="673" valign="left">
<tr height="176">
<td height="176" colspan="2" valign="left" align="left"><img height="12" width="8" src="images/dotclear.gif"><img height="171" width="173" src="images/logo.gif" alt="NCBI Logo"></td>
<td height="176" valign="top" width="10" align="left"></td>
<td width="475" height="176" valign="top"><img height="80" width="364" src="images/msthd1.gif" border="0" alt="NCBI News" usemap="#E"><map name="E"><area href="http://www.ncbi.nlm.nih.gov/About/newsletter.html" coords="1,16,362,71" shape="rect"></map><br>
<img height="80" width="340" src="images/msthd1a.gif" border="0" alt="National Center for Biotechnology Information" usemap="#NCBI"><map name="NCBI"><area href="http://www.nih.gov" coords="0,63,133,77" shape="rect"><area href="http://www.nlm.nih.gov" coords="0,41,138,53" shape="rect"><area href="http://www.ncbi.nlm.nih.gov" coords="0,14,248,26" shape="rect"></map>
<img height="80" width="114" src="images/edition.gif" alt="Summer 2000"></td>
</tr>
<tr valign="top">
<td width="13" align="left" valign="top"><img height="1" width="1" src="images/dotclear.gif"></td>
<td width="160" align="left" valign="top"><font size="2" face="Arial,Helvetica,sans-serif"><br>
<br>
<img height="33" width="178" src="images/issue.gif" alt="In this issue"><br>
<br>
<b><font face="Arial, Helvetica, sans-serif" color="003399"><font color="#000000">The
Human <br>
Genome Sequence</font></font></b></font>
<p><b><font face="Arial, Helvetica, sans-serif" size="2" color="003399"><a href="blink.html">BLink
Enhances<br>
Entrez Exploration</a></font></b></p>
<p><font face="Arial, Helvetica, sans-serif" size="2" color="003399"><b><a href="nomenclature.html">Human
Gene<br>
Nomenclature</a></b></font></p>
<p><font face="Arial, Helvetica, sans-serif" size="2" color="003399"><b><a href="faqs.html">FAQs</a></b></font></p>
<p><font face="Arial, Helvetica, sans-serif" size="2" color="003399"><b><a href="pubs.html">Recent
Publications</a></b></font></p>
<p><font face="Arial, Helvetica, sans-serif" size="2" color="003399"><b><a href="standalone.html">Standalone
<br>
BLAST Additions</a></b></font></p>
<p><font face="Arial, Helvetica, sans-serif" size="2" color="003399"><b><a href="blastlab.html">BLAST
Lab</a></b></font></p>
<p><font face="Arial, Helvetica, sans-serif" size="2" color="003399"><b><a href="mirrorftp.html">Mirror
FTP Site<br>
for GenBank</a></b></font></p>
<p><font face="Arial, Helvetica, sans-serif" size="2" color="003399"><b><a href="masthead.html"><span class="subheads"><span class="subheads">Ma<span class="heads">sthead</span></span></span></a>
</b></font><b><font face="Arial, Helvetica, sans-serif" size="2" color="003399">
</font></b>
</td>
<td width="10" valign="left">&nbsp;</td>
<td width="475">
<div valign="left">
<p><br>
<br>
<font face="Arial, Helvetica, sans-serif" size="3" color="003399"><b>The
Human Genome Sequence: NCBI&#146;s First Annotated Edition</b></font></p>
<p><font face="Times New Roman, Times, serif" size="3" color="#000000"><br>
The NCBI recently released its first assembled and annotated view of
the human genome sequence. The assembly is based not only on the finished
and draft sequence deposited in GenBank by the public sequencing centers&#130;
but also on the thousands of sequences contributed to GenBank over the
years by individual scientists around the world. Hence&#130; this resource
represents a true international public effort to sequence the human
genome.</font></p>
<p><font face="Times New Roman, Times, serif" size="3" color="#000000">Updated
assemblies&#151;incorporating new data&#130; filling in existing gaps
and increasing overall accuracy&#151;will be released to the public
on a regular basis. The human genome data can be viewed on the Web with
NCBI&#146;s human genome Map Viewer or downloaded in bulk via FTP.</font></p>
<p><font face="Times New Roman, Times, serif" size="3" color="#000000"><br>
<b><font face="Arial, Helvetica, sans-serif" color="003399" size="2">Assembly</font></b></font></p>
<p><font face="Times New Roman, Times, serif" size="3" color="#000000">
NCBI&#146;s assembly process starts with the entire complement of human
genomic sequence in GenBank, both draft and finished. Assembling and
ordering the individual sequence units is a critical phase of the Human
Genome Project. It involves many different steps&#130; including screening
for vector and other sequence contamination&#130; before merging the
input data into ordered segments of DNA referred to as contigs. This
first build presents more than 6&#130;000 contigs, representing roughly
2.8 billion base pairs. Nearly 700 contigs are longer than 1 MB. Over
75 percent of the bases in the contigs are in unbroken segments of greater
than 30Kb&#130; the size of a typical human gene.<br>
<br>
<br>
</font><font face="Times New Roman, Times, serif" size="3" color="#000000">
</font><font face="Times New Roman, Times, serif" size="3" color="#003399"><i>
</i></font></p>
<table width="100%" border="0" cellspacing="0" cellpadding="1" bgcolor="#FFFFFF">
<tr bgcolor="003399">
<td>
<table border="0" cellspacing="0" cellpadding="11" bgcolor="#FFFFFF" width="100%">
<tr bordercolor="003399">
<td><font face="Arial, Helvetica, sans-serif" size="3" color="003399"><b><font size="2">Model
Sequences</font></b></font><font size="2"><b><font face="Arial, Helvetica, sans-serif" color="003399">
Get New Accession Numbers</font></b></font><b><font face="Arial, Helvetica, sans-serif" size="3" color="003399"><br>
<br>
</font></b><font face="Times New Roman, Times, serif" size="3" color="#000000">The
NCBI assembly process produces a new kind of sequence record
termed a &#147;model sequence.&#148; Model mRNA records are
created <i>de novo</i> from human genomic sequence, and aligned
to mRNA reference sequences from RefSeq. Since such alignments
may contain some mismatches, model sequences are assigned
their own accession numbers, in the format XM_12345 for mRNA
and XP_12345 for the corresponding model protein sequence.<br>
<br>
The alignment-based evidence for the model sequences is provided
through AceView, a new service currently accessed from LocusLink
and the Map Viewer. AceView shows a predicted gene, its intron/exon
structure, and its alignment to the corresponding RefSeq mRNA
sequence.</font><font face="Times New Roman, Times, serif" size="3" color="#000000"><br>
</font><font size="2" face="Times New Roman, Times, serif"></font></td>
</tr>
</table>
</td>
</tr>
</table>
<p><font face="Times New Roman, Times, serif" size="3" color="#000000">
<br>
<b><font face="Arial, Helvetica, sans-serif" color="003399" size="2">Annotation</font></b></font></p>
<p><font face="Times New Roman, Times, serif" size="3" color="#000000">NCBI
is also engaged in the essential process of annotating, or labeling
the biologically important areas&#130; of the human genomic sequence.
Human gene annotation falls into two major tasks: the correct placement
of known human genes into their proper genomic context; and the prediction
of new&#130; previously unknown genes&#130; from the genomic sequence.
</font></p>
<p><font face="Times New Roman, Times, serif" size="3" color="#000000">For
the first task, the mRNAs from the NCBI RefSeq collection are placed
on the genome primarily by alignment&#130; with compensation for various
problems in both the genomic and mRNA sequences&#130; and reconciliation
of close paralogs and pseudogenes. In this first release on the NCBI
Web site&#130; 8&#130;800 of the 10&#130;500 RefSeq mRNAs were placed
on the genome.</font></p>
<p><font face="Times New Roman, Times, serif" size="3" color="#000000">For
the second task, multiple lines of evidence including EST alignments&#130;
splice junctions&#130; protein similarities&#130; and other methods
are combined to predict new genes. The predicted mRNAs and proteins
will be subject to change with improved data and better algorithms.
Nonetheless, NCBI will do its best to keep the same accession numbers
with the same predicted genes from build to build. A new release containing
both known gene placements and predicted gene models was in process
as this article went to press.</font></p>
<p><font face="Times New Roman, Times, serif" size="3" color="#000000">Additional
biological features are also being annotated on the genomic sequence.
This first release includes more than 1.3 million SNPs and 111&#130;851
STS markers.<br>
</font> </p>
<p><font face="Times New Roman, Times, serif" size="3" color="#000000"><b><font face="Arial, Helvetica, sans-serif" size="2" color="003399"><br>
Public Access</font></b></font></p>
<p><font face="Times New Roman, Times, serif" size="3" color="#000000">NCBI&#146;s
human genome Map Viewer may be used to view the contigs used to assemble
the sequence by selecting Contig map. SNP data may be viewed on the
SNP map. The Map Viewer may be used to further explore the human genome
data by viewing up to 7 parallel maps selected from a pallet of nineteen&#151;
including 6 sequence maps&#130; 5 cytogenetic maps&#130; 2 genetic maps&#130;
and 6 radiation hybrid maps.</font></p>
<p><font face="Times New Roman, Times, serif" size="3" color="#000000">The
data is also available for downloading from the &#147;genomes/ H_sapiens&#148;
directory of the NCBI FTP site.</font></p>
<p><font face="Times New Roman, Times, serif" size="3" color="#000000">The
FTP site includes the contigs produced by the NCBI assembly&#130; RefSeq
and model mRNA sequences annotated on the genome, and information used
by the Map Viewer to generate and display the palette of nineteen maps
mentioned above. </font><font face="Arial, Helvetica, sans-serif" size="2" color="#000000"><i>&#151;DW,
CB, JO<br>
<br>
<br>
</i></font></p>
<table width="100%" border="0" cellspacing="0" cellpadding="1" bgcolor="#FFFFFF">
<tr bgcolor="003399">
<td>
<table border="0" cellspacing="0" cellpadding="11" bgcolor="#99CCFF" width="100%">
<tr bordercolor="003399">
<td width="100%" height="100%"><font face="Arial, Helvetica, sans-serif" size="3" color="003399"><font size="4" face="Times New Roman, Times, serif"><i><font color="#FFFFFF">What
is Draft Sequence?</font></i></font><font face="Times New Roman, Times, serif" color="#FFFFFF" size="4"><br>
</font><font face="Times New Roman, Times, serif" color="#000000">
<font face="Arial, Helvetica, sans-serif" size="2"><br>
Two-thirds of the human genomic sequence in GenBank is termed
&#147;draft&#148; or &#147;unfinished.&#148; These sequences
can be comprised of many unordered pieces and are of lower
quality than a typical</font></font></font> <font face="Arial, Helvetica, sans-serif" size="2" color="#000000">&#147;finished&#148;
GenBank sequence. The finishing process involves closure of
sequence gaps&#130; determination of proper order and orientation,
and resolution of any sequencing ambiguities and errors. This
is an ongoing process in the sequencing centers of the Human
Genome Project&#130; and NCBI updates draft sequence on a
daily basis.<br>
<br>
Draft sequence is placed in the HTG (High Throughput Genomic)
division of GenBank. A typical HTG record consists of all
sequence data generated from a single cosmid, BAC, YAC, or
P1 clone. A single accession number is assigned to this collection
of HTG sequences. Each record includes a clear indication
of its status&#151;Phase 1 or Phase 2&#151; and a prominent
warning that the sequence data is &#147;unfinished&#148; and
may contain errors. Phase 1 indicates an unfinished sequence
with gaps and unknown order and orientation of the pieces.
In Phase 2, the order and orientation of the pieces is known,
but the length of the gaps may still be unknown. Finished
sequence data&#130; consisting of one continuous piece of
high-quality DNA sequence, is moved out of the HTG division
and placed in the Mammalian division of GenBank. Contigs from
the NCBI human genome assembly contain finished as well as
draft sequence.<br>
</font><font face="Arial, Helvetica, sans-serif" size="3" color="#000000">
</font></td>
</tr>
</table>
</td>
</tr>
</table>
<p><font face="Times New Roman, Times, serif" size="3" color="#003399"><i>
<br>
</i></font><br>
<font face="Times New Roman, Times, serif" size="3" color="#003399"><i>
</i></font><font face="Times New Roman, Times, serif" size="3" color="#003399"><i>
<br>
</i></font><font face="Times New Roman, Times, serif" size="3" color="#000000">
</font></p>
<p> <font face="Times New Roman, Times, serif" size="3" color="#000000">
</font> </p>
<p align="right"><a href="blink.html"><img height="27" width="69" src="images/continue.gif" border="0" alt="Continue"></a><br>
<div align="right"><font color="#003399"> </font></div>
<font color="#003399">
<hr noshade size="1" align="right">
</font>
<div align="right"><img height="32" width="187" src="images/fallwinter_foot.gif" border="0" alt="NCBI News | Fall/Winter 2000" usemap="#NCBI News foot"><map name="NCBI News foot"><area href="http://www.ncbi.nlm.nih.gov/About/newsletter.html" coords="0,9,196,34" shape="rect"></map><br>
</div>
</div>
</td>
</tr>
</table>
</body>
</html>