nih-gov/www.ncbi.nlm.nih.gov/Web/Newsltr/FallWinter2000/human_genome.html

<html>


<head>
<meta http-equiv="content-type" content="text/html;charset=iso-8859-1">
<meta name="generator">
<title>NCBI News | Fall/Winter 2000</title>
<style type="text/css">
<!--
a:hover { color: 993300; }
-->
</style>
</head>


<body background="images/bckgrnd.gif" bgcolor="white" link="#003399" alink="#003399" vlink="#003399" text="black">
<span class="heads"></span> <span class="subheads"></span>
<table border="0" cellpadding="0" cellspacing="0" width="673" valign="left">
  <tr height="176">

    <td height="176" colspan="2" valign="left" align="left"><img height="12" width="8" src="images/dotclear.gif"><img height="171" width="173" src="images/logo.gif" alt="NCBI Logo"></td>

    <td height="176" valign="top" width="10" align="left"></td>

    <td width="475" height="176" valign="top"><img height="80" width="364" src="images/msthd1.gif" border="0" alt="NCBI News" usemap="#E"><map name="E"><area href="http://www.ncbi.nlm.nih.gov/About/newsletter.html" coords="1,16,362,71" shape="rect"></map><br>
      <img height="80" width="340" src="images/msthd1a.gif" border="0" alt="National Center for Biotechnology Information" usemap="#NCBI"><map name="NCBI"><area href="http://www.nih.gov" coords="0,63,133,77" shape="rect"><area href="http://www.nlm.nih.gov" coords="0,41,138,53" shape="rect"><area href="http://www.ncbi.nlm.nih.gov" coords="0,14,248,26" shape="rect"></map>
      <img height="80" width="114" src="images/edition.gif" alt="Summer 2000"></td>
			</tr>
			<tr valign="top">

    <td width="13" align="left" valign="top"><img height="1" width="1" src="images/dotclear.gif"></td>

    <td width="160" align="left" valign="top"><font size="2" face="Arial,Helvetica,sans-serif"><br>
						<br>
      <img height="33" width="178" src="images/issue.gif" alt="In this issue"><br>
						<br>
      <b><font face="Arial, Helvetica, sans-serif" color="003399"><font color="#000000">The
      Human <br>
      Genome Sequence</font></font></b></font>
      <p><b><font face="Arial, Helvetica, sans-serif" size="2" color="003399"><a href="blink.html">BLink
        Enhances<br>
        Entrez Exploration</a></font></b></p>
      <p><font face="Arial, Helvetica, sans-serif" size="2" color="003399"><b><a href="nomenclature.html">Human
        Gene<br>
        Nomenclature</a></b></font></p>
      <p><font face="Arial, Helvetica, sans-serif" size="2" color="003399"><b><a href="faqs.html">FAQs</a></b></font></p>
      <p><font face="Arial, Helvetica, sans-serif" size="2" color="003399"><b><a href="pubs.html">Recent
        Publications</a></b></font></p>
      <p><font face="Arial, Helvetica, sans-serif" size="2" color="003399"><b><a href="standalone.html">Standalone
        <br>
        BLAST Additions</a></b></font></p>
      <p><font face="Arial, Helvetica, sans-serif" size="2" color="003399"><b><a href="blastlab.html">BLAST
        Lab</a></b></font></p>
      <p><font face="Arial, Helvetica, sans-serif" size="2" color="003399"><b><a href="mirrorftp.html">Mirror
        FTP Site<br>
        for GenBank</a></b></font></p>
      <p><font face="Arial, Helvetica, sans-serif" size="2" color="003399"><b><a href="masthead.html"><span class="subheads"><span class="subheads">Ma<span class="heads">sthead</span></span></span></a>
        </b></font><b><font face="Arial, Helvetica, sans-serif" size="2" color="003399">
        </font></b>
    </td>

    <td width="10" valign="left">&nbsp;</td>
    <td width="475">
      <div valign="left">
        <p><br>
          <br>
          <font face="Arial, Helvetica, sans-serif" size="3" color="003399"><b>The
          Human Genome Sequence: NCBI&#146;s First Annotated Edition</b></font></p>
        <p><font face="Times New Roman, Times, serif" size="3" color="#000000"><br>
          The NCBI recently released its first assembled and annotated view of
          the human genome sequence. The assembly is based not only on the finished
          and draft sequence deposited in GenBank by the public sequencing centers&#130;
          but also on the thousands of sequences contributed to GenBank over the
          years by individual scientists around the world. Hence&#130; this resource
          represents a true international public effort to sequence the human
          genome.</font></p>
        <p><font face="Times New Roman, Times, serif" size="3" color="#000000">Updated
          assemblies&#151;incorporating new data&#130; filling in existing gaps
          and increasing overall accuracy&#151;will be released to the public
          on a regular basis. The human genome data can be viewed on the Web with
          NCBI&#146;s human genome Map Viewer or downloaded in bulk via FTP.</font></p>
        <p><font face="Times New Roman, Times, serif" size="3" color="#000000"><br>
          <b><font face="Arial, Helvetica, sans-serif" color="003399" size="2">Assembly</font></b></font></p>
        <p><font face="Times New Roman, Times, serif" size="3" color="#000000">
          NCBI&#146;s assembly process starts with the entire complement of human
          genomic sequence in GenBank, both draft and finished. Assembling and
          ordering the individual sequence units is a critical phase of the Human
          Genome Project. It involves many different steps&#130; including screening
          for vector and other sequence contamination&#130; before merging the
          input data into ordered segments of DNA referred to as contigs. This
          first build presents more than 6&#130;000 contigs, representing roughly
          2.8 billion base pairs. Nearly 700 contigs are longer than 1 MB. Over
          75 percent of the bases in the contigs are in unbroken segments of greater
          than 30Kb&#130; the size of a typical human gene.<br>
          <br>
          <br>
          </font><font face="Times New Roman, Times, serif" size="3" color="#000000">
          </font><font face="Times New Roman, Times, serif" size="3" color="#003399"><i>
          </i></font></p>
        <table width="100%" border="0" cellspacing="0" cellpadding="1" bgcolor="#FFFFFF">
          <tr bgcolor="003399">
            <td>
              <table border="0" cellspacing="0" cellpadding="11" bgcolor="#FFFFFF" width="100%">
                <tr bordercolor="003399">
                  <td><font face="Arial, Helvetica, sans-serif" size="3" color="003399"><b><font size="2">Model
                    Sequences</font></b></font><font size="2"><b><font face="Arial, Helvetica, sans-serif" color="003399">
                    Get New Accession Numbers</font></b></font><b><font face="Arial, Helvetica, sans-serif" size="3" color="003399"><br>
                    <br>
                    </font></b><font face="Times New Roman, Times, serif" size="3" color="#000000">The
                    NCBI assembly process produces a new kind of sequence record
                    termed a &#147;model sequence.&#148; Model mRNA records are
                    created <i>de novo</i> from human genomic sequence, and aligned
                    to mRNA reference sequences from RefSeq. Since such alignments
                    may contain some mismatches, model sequences are assigned
                    their own accession numbers, in the format XM_12345 for mRNA
                    and XP_12345 for the corresponding model protein sequence.<br>
                    <br>
                    The alignment-based evidence for the model sequences is provided
                    through AceView, a new service currently accessed from LocusLink
                    and the Map Viewer. AceView shows a predicted gene, its intron/exon
                    structure, and its alignment to the corresponding RefSeq mRNA
                    sequence.</font><font face="Times New Roman, Times, serif" size="3" color="#000000"><br>
                    </font><font size="2" face="Times New Roman, Times, serif"></font></td>
                </tr>
              </table>
            </td>
          </tr>
        </table>
        <p><font face="Times New Roman, Times, serif" size="3" color="#000000">
          <br>
          <b><font face="Arial, Helvetica, sans-serif" color="003399" size="2">Annotation</font></b></font></p>
        <p><font face="Times New Roman, Times, serif" size="3" color="#000000">NCBI
          is also engaged in the essential process of annotating, or labeling
          the biologically important areas&#130; of the human genomic sequence.
          Human gene annotation falls into two major tasks: the correct placement
          of known human genes into their proper genomic context; and the prediction
          of new&#130; previously unknown genes&#130; from the genomic sequence.
          </font></p>
        <p><font face="Times New Roman, Times, serif" size="3" color="#000000">For
          the first task, the mRNAs from the NCBI RefSeq collection are placed
          on the genome primarily by alignment&#130; with compensation for various
          problems in both the genomic and mRNA sequences&#130; and reconciliation
          of close paralogs and pseudogenes. In this first release on the NCBI
          Web site&#130; 8&#130;800 of the 10&#130;500 RefSeq mRNAs were placed
          on the genome.</font></p>
        <p><font face="Times New Roman, Times, serif" size="3" color="#000000">For
          the second task, multiple lines of evidence including EST alignments&#130;
          splice junctions&#130; protein similarities&#130; and other methods
          are combined to predict new genes. The predicted mRNAs and proteins
          will be subject to change with improved data and better algorithms.
          Nonetheless, NCBI will do its best to keep the same accession numbers
          with the same predicted genes from build to build. A new release containing
          both known gene placements and predicted gene models was in process
          as this article went to press.</font></p>
        <p><font face="Times New Roman, Times, serif" size="3" color="#000000">Additional
          biological features are also being annotated on the genomic sequence.
          This first release includes more than 1.3 million SNPs and 111&#130;851
          STS markers.<br>
          </font> </p>
        <p><font face="Times New Roman, Times, serif" size="3" color="#000000"><b><font face="Arial, Helvetica, sans-serif" size="2" color="003399"><br>
          Public Access</font></b></font></p>
        <p><font face="Times New Roman, Times, serif" size="3" color="#000000">NCBI&#146;s
          human genome Map Viewer may be used to view the contigs used to assemble
          the sequence by selecting Contig map. SNP data may be viewed on the
          SNP map. The Map Viewer may be used to further explore the human genome
          data by viewing up to 7 parallel maps selected from a pallet of nineteen&#151;
          including 6 sequence maps&#130; 5 cytogenetic maps&#130; 2 genetic maps&#130;
          and 6 radiation hybrid maps.</font></p>
        <p><font face="Times New Roman, Times, serif" size="3" color="#000000">The
          data is also available for downloading from the &#147;genomes/ H_sapiens&#148;
          directory of the NCBI FTP site.</font></p>
        <p><font face="Times New Roman, Times, serif" size="3" color="#000000">The
          FTP site includes the contigs produced by the NCBI assembly&#130; RefSeq
          and model mRNA sequences annotated on the genome, and information used
          by the Map Viewer to generate and display the palette of nineteen maps
          mentioned above. </font><font face="Arial, Helvetica, sans-serif" size="2" color="#000000"><i>&#151;DW,
          CB, JO<br>
          <br>
          <br>
          </i></font></p>
        <table width="100%" border="0" cellspacing="0" cellpadding="1" bgcolor="#FFFFFF">
          <tr bgcolor="003399">
            <td>
              <table border="0" cellspacing="0" cellpadding="11" bgcolor="#99CCFF" width="100%">
                <tr bordercolor="003399">
                  <td width="100%" height="100%"><font face="Arial, Helvetica, sans-serif" size="3" color="003399"><font size="4" face="Times New Roman, Times, serif"><i><font color="#FFFFFF">What
                    is Draft Sequence?</font></i></font><font face="Times New Roman, Times, serif" color="#FFFFFF" size="4"><br>
                    </font><font face="Times New Roman, Times, serif" color="#000000">
                    <font face="Arial, Helvetica, sans-serif" size="2"><br>
                    Two-thirds of the human genomic sequence in GenBank is termed
                    &#147;draft&#148; or &#147;unfinished.&#148; These sequences
                    can be comprised of many unordered pieces and are of lower
                    quality than a typical</font></font></font> <font face="Arial, Helvetica, sans-serif" size="2" color="#000000">&#147;finished&#148;
                    GenBank sequence. The finishing process involves closure of
                    sequence gaps&#130; determination of proper order and orientation,
                    and resolution of any sequencing ambiguities and errors. This
                    is an ongoing process in the sequencing centers of the Human
                    Genome Project&#130; and NCBI updates draft sequence on a
                    daily basis.<br>
                    <br>
                    Draft sequence is placed in the HTG (High Throughput Genomic)
                    division of GenBank. A typical HTG record consists of all
                    sequence data generated from a single cosmid, BAC, YAC, or
                    P1 clone. A single accession number is assigned to this collection
                    of HTG sequences. Each record includes a clear indication
                    of its status&#151;Phase 1 or Phase 2&#151; and a prominent
                    warning that the sequence data is &#147;unfinished&#148; and
                    may contain errors. Phase 1 indicates an unfinished sequence
                    with gaps and unknown order and orientation of the pieces.
                    In Phase 2, the order and orientation of the pieces is known,
                    but the length of the gaps may still be unknown. Finished
                    sequence data&#130; consisting of one continuous piece of
                    high-quality DNA sequence, is moved out of the HTG division
                    and placed in the Mammalian division of GenBank. Contigs from
                    the NCBI human genome assembly contain finished as well as
                    draft sequence.<br>
                    </font><font face="Arial, Helvetica, sans-serif" size="3" color="#000000">
                    </font></td>
                </tr>
              </table>
            </td>
          </tr>
        </table>
        <p><font face="Times New Roman, Times, serif" size="3" color="#003399"><i>
          <br>
          </i></font><br>
          <font face="Times New Roman, Times, serif" size="3" color="#003399"><i>
          </i></font><font face="Times New Roman, Times, serif" size="3" color="#003399"><i>
          <br>
          </i></font><font face="Times New Roman, Times, serif" size="3" color="#000000">
          </font></p>
        <p> <font face="Times New Roman, Times, serif" size="3" color="#000000">
          </font> </p>
        <p align="right"><a href="blink.html"><img height="27" width="69" src="images/continue.gif" border="0" alt="Continue"></a><br>
        <div align="right"><font color="#003399"> </font></div>
        <font color="#003399">
        <hr noshade size="1" align="right">
        </font>
        <div align="right"><img height="32" width="187" src="images/fallwinter_foot.gif" border="0" alt="NCBI News | Fall/Winter 2000" usemap="#NCBI News foot"><map name="NCBI News foot"><area href="http://www.ncbi.nlm.nih.gov/About/newsletter.html" coords="0,9,196,34" shape="rect"></map><br>
        </div>
        </div>
				</td>
			</tr>
		</table>
	</body>

</html>