Publication Supplemental Data
Below is the supporting information for the alignments described in the GRC Publication "Moderning Reference Genome Assemblies" in PLoS Biology (Jul 5, 2011).
Alignment of YH1 'novel' scaffolds to GRCh37.p2
We started with the 5.1 Mb of sequenced defined as 'novel' in Li et al, 2010; specifically Supplemental Data set 1. We aligned this to both NCBI36 and GRCh37.p2 using an in-house alignment tool called the 'NG-aligner'.
Alignment Details
We identified 112 Kb of sequence that aligned to NCBI36 and 1.49 Mb of sequence that aligned to GRCh37.p2, using a cut-off of 90% coverage and 90% identity (see methods). We then used BLAST to align the remaining sequence to the NCBI NR database. Using a cut-off value of 1e-20 we identified 1.56 Mb of sequence that aligned to primate sequences, 1.39 Mb of which were human. An additional 163 Kb of sequence aligned to non-primate sequences, suggesting a low level of contamination in this data set; this number is in general agreement with the 152 Kb of contamination reported in Alkan et al, 2010 (Figure 3). Notably, GRCh37.p2 not only allows for improved identification of 'novel' YH1 sequence, but enables more than 95% of this sequence to be put into a chromosome context. The YH1 sequences that do not align to GRCh37.p2, but that do align to human clone sequences, are now being reviewed for inclusion in future reference assembly releases. In many cases, a single clone sequence will capture several of the novel sequences. For example, 16 YH1 sequences, accounting for 24.8 Kb, align to AC161429.3, a fosmid clone assigned to chr20 (Supplemental Table 1). However, inclusion of these sequences in the reference assembly is not automatic, as they must be reviewed for quality and in some cases, may require additional sequencing. We also are investigating the remaining sequence with no high quality alignment to GRCh37.p2 or to NR to determine if any of these data merit inclusion in future assembly releases.
Supplemental Table 1 (excel workbook)

Supplemental Figure 1: YH1 'novel' sequence alignment breakdown. Almost two-thirds of the sequence identified as 'novel' by Li et al can be accounted for. 1.49 Mb of the novel sequence aligns to a sequence in GRCh37.p2, while another 1.56 Mb aligns to some primate sequence, 1.39 Mb of which is a human clone based sequence. These sequences are currently being reviewed for inclusion in future assembly updates. A small fraction of sequence appears to be contamination leaving 1.9 Mb of sequence that remains to be characterized. Experiments are ongoing to confirm that these assembled sequences represent a human sequence.
Alignment of Next Generation Sequencing (NGS) reads to GRCh37
We selected NA12156 and NA12878, (SRA accessions ERX000125 and ERX000080, respectively) and aligned their reads to GRCh37 using an algorithm called srprism (Agarwala et al., in preparation). This aligner can consider the placement information associated with the assembly and is thus 'alternate locus aware'. Because of this feature, reads with 'multiple' alignments corresponding to hits both on the primary assembly unit and on alternate loci can be distinguished and thereby not assigned a depressed mapping score.
For our analyses, we performed two sets of alignments: one against the full GRCh37 assembly and a second using only the GRCh37 primary assembly (thus omitting the alternate loci). Alignment parameters allowed no more than 2 alignment discrepancies (mismatches or insertions/deletions) per read alignment for the entire length of the read. When scoring a set of paired-end reads, the pair retained the score of the lowest scoring member. Importantly, despite the small number of regions associated with alternate loci in GRCh37, we were able to identify sequence reads that aligned only to the alternate locus scaffolds when they are present (Supplemental Figure 2). Furthermore, even using strict alignment parameters, removal of the alternate locus scaffolds resulted in two-thirds of these sequence reads aligning to another region of the assembly, thus resulting in a mis-mapped read (Supplemental Table 2). These data clearly demonstrate that that inclusion of alternate representations for genomic loci can improve alignment quality and avoid spurious variation calls.
Supplemental Table 2 (excel workbook)

Supplemental Figure 2: Alignments of short reads to GRCh37. The top panel shows the MAPT alternate locus scaffold NT_167251.1 aligned to chromosome 17 (NC_000017.10). On the alignment track, gray indicates regions of perfect match, red show regions of mismatches and blue shows regions of insertion-deletions. Below the alignment track is a histogram of the NA12156 alignments to NT_167251.1. Below this is the gene annotation on NT_167251.1. The yellow highlight shows the region that is expanded in the bottom panel.The bottom panel shows a zoomed in view of the region highlighted in the top panel. This region contains sequence not present on the chromosome as well as an inversion breakpoint. The arrow marks the region of the inversion break- read to the right of the arrow may also align to the chromosome sequences as the alternate locus scaffold and chromosome sequence are identical in this region. Reads to the left of the arrow only align to the alternate locus scaffold as this sequence is not represented on the chromosome. Removing the alternate locus scaffold causes two-thirds of the alternate locus specific reads to mismap to a location on the primary assembly.