How NCBI Remaps ClinVar and dbSNP Variants

Remapping (or lifting over) is a process for translating sequence coordinates from one sequence to another. Variant remapping is a specific form of remapping that determines how a variant defined relative to one reference sequence corresponds to another variant described on a different, but related reference sequence. Remapping variants is essential to understanding whether two such variants are identical or not. At NCBI, dbSNP and ClinVar perform variant remapping to group submission variants into "reference variants". They use two Variation Services SPDI methods, "canonical_representatitve" and "all_equivalent_contextual", as an integral part of this process.

This page would best serve the users, who are interested in how NCBI computes variant representation on different reference sequences, especially, those users who utilize NCBI Variation Services.

Alignment Data Sets

One of the basic concepts that support remapping is Alignment Data Sets (ADS). An ADS is a set of sequence-to-sequence alignments that form a graph of possible connections between reference sequences.

NCBI has developed a massive automated process to compute alignments between different types of reference sequences, such as reference assemblies, genes and transcripts. The assembly-to-assembly alignments that are used by the NCBI Genome Remapping Service, and are considered the official alignments for assemblies curated by the Genome Reference Consortium, are built as part of this process. dbSNP and ClinVar combine 5 different types of alignments into a single alignment data set. They are:

Assembly-to-Assembly alignments
Alignments of Patches/Alt-loci/PAR to Primary Assembly (see also NCBI Genome Assembly Model)
Alignments of RefSeq-select GenBank transcripts to RefSeqGene (NG)
Old version of RefSeq genomic regions (NG) and transcripts (NM/NR) to current Assembly
Current RefSeq transcripts (NM/NR/XM/XR), RefSeqGene and other RefSeq genomic regions (NG) to the latest Assembly

The Alignment Data Set used for remapping by NCBI's Variation Services is updated every night. It means that the results returned by the Services may fluctuate in accordance with the changes in the underlying sequence alignments included in the refreshed ADS.

Handling of Gaps in Alignment

When aligning two sequences, it is possible that one sequence contains an insertion or deletion (indel), that the other one does not. Consider this theoretical example:

Example of variants in relation to an alignment gap

Fig 1. Illustration of indels in alignments. In the alignment between two sequences Seq1 and Seq2, there is a gap. Seq2 is missing ATA (either Seq1:3:ATA or Seq1:5:ATA, though the alignment specifies one over the other arbitrarily). See SPDI and SeqInterval Notation Page for explanation of the colon-delimited notations.

Let's consider a few variants asserted on Seq1, and how they would remap to Seq2.

Variant Spanning Alignment Gap

The remapping algorithm ignores an alignment gap that is completely inside the boundary of a variant's deletion interval. Consider Seq1:2:GATAT:CC. It completely spans the gap. The Seq1:2:G aligns to Seq2:4:G, and likewise Seq1:6:T aligns to Seq2:5:T. The resulting variant on Seq2 will be Seq2:4:GT:CC. Thus a 5-nt deletion/2-nt insertion becomes a 2-nt deletion/2-nt insertion.

Variant Abutting Alignment Gap Terminus

Now let's consider a variant for which the deletion interval ends at the same location as the alignment gap, such as Seq1:3:ATATA:TA. Seq1:7:A aligns nicely to Seq2:6:A. However, Seq1:3:A does not align to any nucleotide on Seq2. The last nucleotide on Seq2 before the gap, headed upstream, is Seq2:5:T. Thus, the variant description on Seq2 will be Seq2:5:TA:TA, the identity variant. This is an interesting example, in that the difference between Seq1 and Seq2 is described as the original variant, Seq1:3:ATATA:TA. Many such cases exist in dbSNP.

Interval of Variant Deletion Terminates Inside Alignment Gap

Let's consider a variant for which the deletion interval terminates inside the alignment gap, such as Seq1:5:ATAC:G. Here again, the 3' end of the deletion interval maps nicely to Seq2:7:C. Here again, we know that the last nucleotide on Seq2 before the gap, headed upstream, is Seq2:5:T. Thus the variant description on Seq2 is Seq2:5:TAC:G. However, unlike the previous examples, now you will end up with different variant sequences when the variant representation is applied to the reference. For Seq1, the entire sequence will be: TGAT(G), with the insertion in paranthesis. For Seq2, the entire sequence will be: TG(G). The notable difference is the undescribed sequence that aligns to the gap. It is not carried over to Seq2, because it is not in the original variant description.

Variant Deletion Interval Enclosed Inside Alignment Gap

Finally, let's consider a variant where the deletion interval is entirely enclosed inside an alignment gap, such as Seq1:4:T. No part of that interval (albeit just a single nucleotide) aligns to Seq2. At this time, NCBI does not remap this variant onto Seq2. Seq2 will be completely absent from the result set.

Medical Genetics and Human Variation