NCBI News | Fall/Winter 2000

The Human Genome Sequence: NCBI’s First Annotated Edition

The NCBI recently released its first assembled and annotated view of the human genome sequence. The assembly is based not only on the finished and draft sequence deposited in GenBank by the public sequencing centers‚ but also on the thousands of sequences contributed to GenBank over the years by individual scientists around the world. Hence‚ this resource represents a true international public effort to sequence the human genome.

Updated assemblies—incorporating new data‚ filling in existing gaps and increasing overall accuracy—will be released to the public on a regular basis. The human genome data can be viewed on the Web with NCBI’s human genome Map Viewer or downloaded in bulk via FTP.

Assembly

NCBI’s assembly process starts with the entire complement of human genomic sequence in GenBank, both draft and finished. Assembling and ordering the individual sequence units is a critical phase of the Human Genome Project. It involves many different steps‚ including screening for vector and other sequence contamination‚ before merging the input data into ordered segments of DNA referred to as contigs. This first build presents more than 6‚000 contigs, representing roughly 2.8 billion base pairs. Nearly 700 contigs are longer than 1 MB. Over 75 percent of the bases in the contigs are in unbroken segments of greater than 30Kb‚ the size of a typical human gene.

Model Sequences Get New Accession Numbers

The NCBI assembly process produces a new kind of sequence record termed a “model sequence.” Model mRNA records are created de novo from human genomic sequence, and aligned to mRNA reference sequences from RefSeq. Since such alignments may contain some mismatches, model sequences are assigned their own accession numbers, in the format XM_12345 for mRNA and XP_12345 for the corresponding model protein sequence.

The alignment-based evidence for the model sequences is provided through AceView, a new service currently accessed from LocusLink and the Map Viewer. AceView shows a predicted gene, its intron/exon structure, and its alignment to the corresponding RefSeq mRNA sequence.

Annotation

NCBI is also engaged in the essential process of annotating, or labeling the biologically important areas‚ of the human genomic sequence. Human gene annotation falls into two major tasks: the correct placement of known human genes into their proper genomic context; and the prediction of new‚ previously unknown genes‚ from the genomic sequence.

For the first task, the mRNAs from the NCBI RefSeq collection are placed on the genome primarily by alignment‚ with compensation for various problems in both the genomic and mRNA sequences‚ and reconciliation of close paralogs and pseudogenes. In this first release on the NCBI Web site‚ 8‚800 of the 10‚500 RefSeq mRNAs were placed on the genome.

For the second task, multiple lines of evidence including EST alignments‚ splice junctions‚ protein similarities‚ and other methods are combined to predict new genes. The predicted mRNAs and proteins will be subject to change with improved data and better algorithms. Nonetheless, NCBI will do its best to keep the same accession numbers with the same predicted genes from build to build. A new release containing both known gene placements and predicted gene models was in process as this article went to press.

Additional biological features are also being annotated on the genomic sequence. This first release includes more than 1.3 million SNPs and 111‚851 STS markers.

Public Access

NCBI’s human genome Map Viewer may be used to view the contigs used to assemble the sequence by selecting Contig map. SNP data may be viewed on the SNP map. The Map Viewer may be used to further explore the human genome data by viewing up to 7 parallel maps selected from a pallet of nineteen— including 6 sequence maps‚ 5 cytogenetic maps‚ 2 genetic maps‚ and 6 radiation hybrid maps.

The data is also available for downloading from the “genomes/ H_sapiens” directory of the NCBI FTP site.

The FTP site includes the contigs produced by the NCBI assembly‚ RefSeq and model mRNA sequences annotated on the genome, and information used by the Map Viewer to generate and display the palette of nineteen maps mentioned above. —DW, CB, JO

What is Draft Sequence?

Two-thirds of the human genomic sequence in GenBank is termed “draft” or “unfinished.” These sequences can be comprised of many unordered pieces and are of lower quality than a typical “finished” GenBank sequence. The finishing process involves closure of sequence gaps‚ determination of proper order and orientation, and resolution of any sequencing ambiguities and errors. This is an ongoing process in the sequencing centers of the Human Genome Project‚ and NCBI updates draft sequence on a daily basis.

Draft sequence is placed in the HTG (High Throughput Genomic) division of GenBank. A typical HTG record consists of all sequence data generated from a single cosmid, BAC, YAC, or P1 clone. A single accession number is assigned to this collection of HTG sequences. Each record includes a clear indication of its status—Phase 1 or Phase 2— and a prominent warning that the sequence data is “unfinished” and may contain errors. Phase 1 indicates an unfinished sequence with gaps and unknown order and orientation of the pieces. In Phase 2, the order and orientation of the pieces is known, but the length of the gaps may still be unknown. Finished sequence data‚ consisting of one continuous piece of high-quality DNA sequence, is moved out of the HTG division and placed in the Mammalian division of GenBank. Contigs from the NCBI human genome assembly contain finished as well as draft sequence.