Publication Supplemental Data

Below is the supporting information for the alignments described in the GRC Publication "Moderning Reference Genome Assemblies" in PLoS Biology (Jul 5, 2011).

YH1 alignments
Next Generation Sequencing Alignments

Alignment of YH1 'novel' scaffolds to GRCh37.p2

We started with the 5.1 Mb of sequenced defined as 'novel' in Li et al, 2010; specifically Supplemental Data set 1. We aligned this to both NCBI36 and GRCh37.p2 using an in-house alignment tool called the 'NG-aligner'.

Alignment Details

This BLAST-based tool includes a merge function that enables nearby fragmented BLAST hits to be combined into a larger alignment, and results from multiple alignment passes using different filters, parameters and/or targets to be consolidated into a final results set. All sequences were initially aligned to both assemblies using the following set of BLAST parameters:
Pass 1:filter = Window Mask; word size = 32; e-value = 0.0001; soft masking = true; best hit overhang = 0.1; best hit score edge = 0.1. Any query sequence that did not have an alignment in pass 1 with an ungapped percent identity ≥ 90.0%, percent coverage ≥ 90.0% and gapped alignment length/ungapped alignment length ≤ 3.0 underwent a second round of alignment to the target reference assembly using the following parameters: Pass 2: filter = DUST; word size = 80; e-value = 0.0001; soft masking = true; best hit overhang = 0.1; best hit score edge = 0.1. At the conclusion of pass 2, only sequences that had an alignment with ungapped percent identity ≥ 90.0%, percent coverage ≥ 90.0% and gapped alignment length/ungapped alignment length ≤ 3.0 were considered to be represented in the target genome. Query sequences not represented in GRCh37.p2 (per the aforementioned criteria), were subsequently aligned to the NCBI NR database using the following parameters: filter = DUST; word size = 28; e-value = 0.0001; soft masking = true; best hit overhang = 0.1; best hit score edge = 0.1, max target seqs = 3. For analysis purposes, the e-value threshold for acceptable alignments to non-human sequences was set at 1e-20.

We identified 112 Kb of sequence that aligned to NCBI36 and 1.49 Mb of sequence that aligned to GRCh37.p2, using a cut-off of 90% coverage and 90% identity (see methods). We then used BLAST to align the remaining sequence to the NCBI NR database. Using a cut-off value of 1e-20 we identified 1.56 Mb of sequence that aligned to primate sequences, 1.39 Mb of which were human. An additional 163 Kb of sequence aligned to non-primate sequences, suggesting a low level of contamination in this data set; this number is in general agreement with the 152 Kb of contamination reported in Alkan et al, 2010 (Figure 3). Notably, GRCh37.p2 not only allows for improved identification of 'novel' YH1 sequence, but enables more than 95% of this sequence to be put into a chromosome context. The YH1 sequences that do not align to GRCh37.p2, but that do align to human clone sequences, are now being reviewed for inclusion in future reference assembly releases. In many cases, a single clone sequence will capture several of the novel sequences. For example, 16 YH1 sequences, accounting for 24.8 Kb, align to AC161429.3, a fosmid clone assigned to chr20 (Supplemental Table 1). However, inclusion of these sequences in the reference assembly is not automatic, as they must be reviewed for quality and in some cases, may require additional sequencing. We also are investigating the remaining sequence with no high quality alignment to GRCh37.p2 or to NR to determine if any of these data merit inclusion in future assembly releases.

Supplemental Table 1 (excel workbook)

Supplemental Figure 1: YH1 'novel' sequence alignment breakdown. Almost two-thirds of the sequence identified as 'novel' by Li et al can be accounted for. 1.49 Mb of the novel sequence aligns to a sequence in GRCh37.p2, while another 1.56 Mb aligns to some primate sequence, 1.39 Mb of which is a human clone based sequence. These sequences are currently being reviewed for inclusion in future assembly updates. A small fraction of sequence appears to be contamination leaving 1.9 Mb of sequence that remains to be characterized. Experiments are ongoing to confirm that these assembled sequences represent a human sequence.

Alignment of Next Generation Sequencing (NGS) reads to GRCh37

We selected NA12156 and NA12878, (SRA accessions ERX000125 and ERX000080, respectively) and aligned their reads to GRCh37 using an algorithm called srprism (Agarwala et al., in preparation). This aligner can consider the placement information associated with the assembly and is thus 'alternate locus aware'. Because of this feature, reads with 'multiple' alignments corresponding to hits both on the primary assembly unit and on alternate loci can be distinguished and thereby not assigned a depressed mapping score.

For our analyses, we performed two sets of alignments: one against the full GRCh37 assembly and a second using only the GRCh37 primary assembly (thus omitting the alternate loci). Alignment parameters allowed no more than 2 alignment discrepancies (mismatches or insertions/deletions) per read alignment for the entire length of the read. When scoring a set of paired-end reads, the pair retained the score of the lowest scoring member. Importantly, despite the small number of regions associated with alternate loci in GRCh37, we were able to identify sequence reads that aligned only to the alternate locus scaffolds when they are present (Supplemental Figure 2). Furthermore, even using strict alignment parameters, removal of the alternate locus scaffolds resulted in two-thirds of these sequence reads aligning to another region of the assembly, thus resulting in a mis-mapped read (Supplemental Table 2). These data clearly demonstrate that that inclusion of alternate representations for genomic loci can improve alignment quality and avoid spurious variation calls.

Supplemental Table 2 (excel workbook)

Supplemental Figure 2: Alignments of short reads to GRCh37. The top panel shows the MAPT alternate locus scaffold NT_167251.1 aligned to chromosome 17 (NC_000017.10). On the alignment track, gray indicates regions of perfect match, red show regions of mismatches and blue shows regions of insertion-deletions. Below the alignment track is a histogram of the NA12156 alignments to NT_167251.1. Below this is the gene annotation on NT_167251.1. The yellow highlight shows the region that is expanded in the bottom panel.The bottom panel shows a zoomed in view of the region highlighted in the top panel. This region contains sequence not present on the chromosome as well as an inversion breakpoint. The arrow marks the region of the inversion break- read to the right of the arrow may also align to the chromosome sequences as the alternate locus scaffold and chromosome sequence are identical in this region. Reads to the left of the arrow only align to the alternate locus scaffold as this sequence is not represented on the chromosome. Removing the alternate locus scaffold causes two-thirds of the alternate locus specific reads to mismap to a location on the primary assembly.