nih-gov/www.ncbi.nlm.nih.gov/grc/help/mouse-examples

180 lines
16 KiB
Text

<!DOCTYPE html>
<html>
<head>
<title>Issues seen in the mouse genome assembly - Genome Reference Consortium</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="ncbi_app" content="grc" />
<meta name="ncbi_db" content="none" />
<meta name="ncbi_pdid" content="screen_default" />
<meta name="ncbi_pagename" content="mouse-examples" />
<meta name="ncbi_pagetitle" content="Issues seen in the mouse genome assembly" />
<link rel="stylesheet" type="text/css" href="/grc/static/grc/css/grc.css" />
<style>
#menu ul#primary a#helpMenuItem {
border: 1px solid #666;
border-bottom: none;
background: #333;
padding-bottom: 6px;
margin-top: 0;
color: #ff9933;
text-decoration: none;
}
#menu ul#secondary li a#help-mouse-examplesMenuItem {
color: #EFEFEF;
}
</style>
<link rel="stylesheet" type="text/css" href="/core/jig/1.14.8/css/jig.min.css" />
<script type="text/javascript" src="/core/jig/1.14.8/js/jig.min.js"></script>
<script type="text/javascript" src="/grc/static/grc/js/grc.js"></script>
</head>
<body>
<noscript>
<p>
<b>Warning:</b> this web site requires JavaScript to function. <a href="/guide/browsers#js_settings">more...</a>
</p>
</noscript>
<div id="header-container">
<a href="/grc">
<div id="header">
<img alt="GRC logo" src="/grc/static/grc/img/GRC_logo_reasonably_small.png" /><img src="/grc/static/grc/img/TitleBanner.png" alt="Genome Reference Consortium" /></a>
</div>
</div>
<div id="page">
<a id="skipnav" href="#main">Skip navigation and go to main content</a>
<div id="menu">
<ul id="primary">
<li><a href="/grc" title="GRC home" id="homeMenuItem">GRC Home</a></li>
<li><a href="/grc/data" title="Data" id="dataMenuItem">Data</a>
<ul id="secondary">
<li><a href="/grc/help" title="Help overview" id="help-helpMenuItem">Overview</a></li>
<li><a href="/grc/help/definitions" title="GRC definitions" id="help-definitionsMenuItem">Definitions</a></li>
<li><a href="/grc/help/faq" title="Frequently Asked Questions" id="help-faqMenuItem">FAQ</a></li>
<li><a href="/grc/help/patches" title="Patches tutorial" id="help-patchesMenuItem">Patches</a></li>
<li><a href="/grc/help/human-examples" title="Examples of human regions" id="help-human-examplesMenuItem">Human Region Examples</a></li>
<li><a href="/grc/help/mouse-examples" title="Examples of mouse regions" id="help-mouse-examplesMenuItem">Mouse Region Examples</a></li>
<li><a href="/grc/help/workshops" title="Workshops and presentation slides" id="help-workshopsMenuItem">Workshops</a></li>
</ul>
</li>
<li><a href="/grc/help" title="Information and help" id="helpMenuItem">Help</a></li>
<li><a href="/grc/report-an-issue" title="Report a problem" id="reportAnIssueMenuItem">Report an Issue</a></li>
<li><a href="/grc/contact-us" title="contact us" id="contactUsMenuItem">Contact Us</a></li>
<li><a href="/grc/credits" title="credits" id="creditsMenuItem">Credits</a></li>
<li><a href="/projects/genome/assembly/grc/curation" title="Curators only, authentication required" id="curatorsOnlyMenuItem">Curators Only</a></li>
</ul>
</div><!--end menu-->
<div id="main"><a name="main"></a>
<div id="mouse-examples">
<div id="contents" style="overflow:hidden;">
<div id="grc-cms-content">
<h1 id="issues-seen-in-the-mouse-genome-">Issues seen in the mouse genome assembly</h1>
<p>Potential problems that arise in the genome need to be tracked. We are using a centralized tracking system to store information concerning these issues. Issues can fall into the following categories:</p>
<ul>
<li>Unknown: unclear what the problem is without further investigation.</li>
<li>Clone Problem: the issue is contained within a single clone.</li>
<li>Gap: the issue is associated with a known gap in the assembly.</li>
<li>Path Problem: the data supporting the issue suggests there is a problem with the tiling path.</li>
<li>Variation: the data supporting the issue suggest there is no error, but that there is extensive allelic variation within the region.</li>
<li>Missing sequence: sequence has been identified that does not map to the reference assembly.</li>
</ul>
<p>It is not expected that many problem joins within the mouse assembly will fall into the category of variation. However, we do expect issues where there is substantial allelic variation between strains. Below are some examples of issues currently being tracked for the mouse genome.</p>
<ul>
<li><a href="#EX1">Example 1: Possible chromosome placement error</a></li>
<li><a href="#EX2">Example 2: Sequence missing from reference assembly</a></li>
<li><a href="#EX3">Example 3: Possible missed join in the assembly</a></li>
<li><a href="#EX4">Example 4: Base difference between transcript sequence and genomic sequence</a></li>
</ul>
<div class="examples">
<h3 id="example-1-possible-chromosome-pl"><span id="EX1"/>Example 1: Possible chromosome placement error</h3>
<p>The issue: In the current mouse build (37) the gene Dhrsx (dehydrogenase/reductase (SDR family) X-linked) is placed in a contig on chr 4 unlocalized scaffold, apparently on the strength of two adjacent markers not contained within the gene which map to chr 4. In human, this gene is in the X p-arm pseudo-autosomal region, which is missing from the mouse genome. Do we have any convincing reason to think this gene has moved off X onto 4 in mouse?</p>
<p>Approaches to this issue? <a title="Show details" class="ex_toggle" id="ex1" href="#"><span id="ex1_toggle">(details...)</span></a></p>
<div class="hide_list" id="ex1_data">
<p>A query in the NCBI Map Viewer shows that this mouse gene maps to a chr4 unlocalized scaffold in the reference, as reported by the user. The gene maps to the assembled chromosome available for the Celera assembly ( <a href="/projects/mapview/map_search.cgi?taxid=10090&amp;build=current&amp;advsrch=off&amp;query=Dhrsx">view search results</a>). However, this is not sufficient to confirm this is the actual location of the gene. Clicking on the 'Genes_seq' link in the tabular results (below the graphic) takes you to a graphical view of the unplaced scaffold. The green color indicates that this is a WGS contig; further investigation is needed.</p>
<p>Typically, a mapping experiment would be required to resolve this. Clicking on the MGI link associated with the Dhrsx gene takes you to the MGI page. From here it is clear that at least 4 mapping experiments have been performed and are available at MGI. All four of these point to a chromosome 4 location, suggestion that the chromosome assignment in Build 37 is correct. Additionally, the mapping data from experiment 2 (as well as the Celera assembly) suggest that this gene maps in the subtelomeric region of 4. Attempts will be made to connect this sequence with data on distal chromosome 4 in an effort to place this sequence on the chromosome for Build 38.</p>
</div>
</div>
<div class="examples">
<h3 id="example-2-sequence-missing-from-"><span id="EX2"/>Example 2: Sequence missing from reference assembly</h3>
<p>The issue: There are a cluster of Ear* genes on chr14 (and genes/pseudogenes on chr10) and it appears that Ear3 is one of the missing genes in build 36.1. Its best hit is to the Ear2 locus on BAC component AC167013.4 and this is the best genomic hit along with BAC AC156790.2 at this time. same situation in build 37.</p>
<p>Approaches to this issue? <a title="Show details" class="ex_toggle" id="ex2" href="#"><span id="ex2_toggle">(details...)</span></a></p>
<div class="hide_list" id="ex2_data">
<p>Start with Entrez Gene to learn more about this particular gene ( <a href="/sites/entrez?Db=gene&amp;Cmd=ShowDetailView&amp;TermToSearch=53876&amp;ordinalpos=2&amp;itool=EntrezSystem2.PEntrez.Gene.Gene_ResultsPanel.Gene_RVDocSum">Ear3</a>). There is, however, a RefSeq transcript sequence for this gene ( <a href="/entrez/viewer.fcgi?db=nucleotide&amp;qty=1&amp;c_start=1&amp;list_uids=NM_017388.1&amp;uids=&amp;dopt=fasta&amp;dispmax=5&amp;sendto=&amp;fmt_mask=0&amp;from=begin&amp;to=end">NM_017388</a>). We can attempt to align this sequence to the genome using the <a href="/genome/seq/BlastGen/BlastGen.cgi?taxid=10090&amp;db=ref_contig&amp;pgm=mbn&amp;EXPECT=2&amp;DESCRIPTIONS=3&amp;ALIGNMENTS=3">Mouse BLAST page</a>. Copy the accession number into the search box and check that the selected database is 'genome (reference only)'. Once the BLAST search has completed, you can see that there is an alignment to a single sequence. Clicking on the sequence identifier (the line that looks like: ref|NT_039606.7|Mm14_39646_37 ) will take you to a graphical view of the BLAST result in Map Viewer. A quick glance at the results show that the entire 471 bp sequence aligns to multiple locations on this sequence, but none of the alignments is perfect. The best hit is at the Ear2 locus (as suggested by the ticket). Use the Maps&amp;Options menu to add the 'Component' map. When this is added, it is clear that the BLAST hit (and Ear2 locus) are contained within a single clone, and that the path within this region looks complete.</p>
<p>Based on the available data, there are two possibilities: 1) the BAC clone containing this locus is deleted and 2) the Ear2 gene is polymorphic and not found in the C57BL/6J genome. Currently, there is not enough data to resolve these two possibilities. One approach would be to design a genotyping assay for the Ear2 gene and experimentally determine whether this sequence is contained within the C57BL/6J genome.</p>
</div>
</div>
<div class="examples">
<h3 id="example-3-possible-missed-join-i"><span id="EX3"/>Example 3: Possible missed join in the assembly</h3>
<p>The transcript evidence for Atg4a, including the refseq, spans a gap in build 36.1 but I believe this may be a missed join rather than an assembly gap. The complete transcript sequence is represented (no exons fall into the gap) on both the reference genome and the Celera assembly. Celera includes some unplaced NW contigs between the two WGS contigs where the transcripts align on either side of the gap and this may have lead to an artifactual assembly gap in C57BL/6 so I've stored this as a missed join in case the gap can be closed in the reference assembly. problem persists in build 37.1. Related Accessions: NM_174875.3: affected transcript, CR478060.2: borders gap, CT025930.4: borders gap</p>
<p>Approaches to this issue? <a title="Show details" class="ex_toggle" id="ex3" href="#"><span id="ex3_toggle">(details...)</span></a></p>
<div class="hide_list" id="ex3_data">
<p>The first step is double double check the alignment of NM_174875.3 to the assembly. This can be done using a tool called <a href="/sutils/splign/splign.cgi?textpage=online&amp;level=form">SPLIGN</a>. SPLIGN is an alignment too specifically tuned to find the discontiguous alignments generated by aligning cDNAs to a genomic sequence. It optimizes the selected alignments for usage of consensus splice sites. In the cDNA text box, enter NM_174875.3 (this specifies the exact version of NM_174875 you wish to use. If you want the latest version, just omit the version number and enter the accession). Below the genomic text box is a pull-down menu labeled 'Whole Genome', select 'Mus musculus' from this menu and click the 'Align' button. We see all but 76 basepairs of this transcript are aligned to a single contig. To determine if this 76 bp aligns across the gap, you can use the Mouse BLAST page and use BLAST to align the cDNA to the reference assembly.</p>
<p>To further investigate this issue, you can take the two genomic accessions in the ticket and try to align them to each other using the BLAST 2 Sequences ( <a href="https://blast.ncbi.nlm.nih.gov/bl2seq/wblast2.cgi">BL2Seq</a>) page. This is not the exact algorithm used to align components for the genome assembly, but it should find the correct alignment if it is there. How does the alignment look between these two sequences?</p>
<p>In looking at the GenBank record for these sequences you can see that <a href="/entrez/viewer.fcgi?val=78039376">CT025930.4</a> is the sequence of a BAC clone, but the record is only 6900 nucleotides long and <a href="/entrez/viewer.fcgi?cmd=Retrieve&amp;db=Nucleotide&amp;list_uids=CR478060.2">CR478060.2</a> is the sequence of a fosmid clone, but the record is only 13139 bases long. In some cases, sequences centers only finish what they consider to be the unique part of the clone. In these cases, the final sequence record does not contain the entire sequence of the clone insert, but only part. If you step back through previous versions of the records, you can see that earlier versions are longer. What happens if you previous versions of these accessions in the BL2Seq program?</p>
</div>
</div>
<div class="examples">
<h3 id="example-4-base-difference-betwee"><span id="EX4"/>Example 4: Base difference between transcript sequence and genomic sequence</h3>
<p>AC161414.3 has an extra 'G' at nt 31552 compared to NM_001134741.1 and supporting transcripts (AK132674.1, BC099972.1, DV655037.1), the Celera assembly (AAHY01000304.1), and orthologs (e.g. rat BC168218.1 and human AAI30533.1). No other BACs are currently available to improve the reference assembly. The indel occurs at bp 217-218 in the mRNA. (158,232,554 on NC_000067.5).</p>
<p>Approaches to this issue? <a title="Show details" class="ex_toggle" id="ex4" href="#"><span id="ex4_toggle">(details...)</span></a></p>
<div class="hide_list" id="ex4_data">
<p>First, check Entrez gene to determine the source of the supporting cDNAs. Because cDNA sequences can come from sources other than C57BL/6J, it is important to ensure that the differences being observed are not due to valid strain variation. In this case, there is clearly support for this gene being present in C57BL/6J. In fact, the current RefSeq ( <a href="/entrez/viewer.fcgi?val=NM_001134741.1">NM_001134741</a>) is derived from this strain.</p>
<p>The next step is to confirm there is a sequence difference. Once again, use <a href="/sutils/splign/splign.cgi?textpage=online&amp;level=form">SPLIGN</a> to align the RefSeq accession to the mouse genome. As you step through the segments you can review the alignment to look for differences. Do you see any?</p>
<p>Most of the traces for the mouse genome are available in the <a href="/Traces/trace.cgi?">Trace Archive.</a>You can interrogate the traces using BLAST from the <a href="/genome/seq/BlastGen/BlastGen.cgi?taxid=10090&amp;db=Mus_musculus_WGS&amp;pgm=mbn&amp;EXPECT=2&amp;DESCRIPTIONS=3&amp;ALIGNMENTS=3">Mouse BLAST page</a>. Check to make sure the 'Traces-WGS' database is selected. The depth of coverage across the mouse genome is high, so you would expect many hits to be returned. To facilitate viewing these, click the 'Formatting Options' link at the top of the BLAST results page, and then select 'Flat query-anchored with dots for identities'. This will place the query sequences as the first line in each section, and all hits for that part of the sequence will line up beneath in something that may resemble a multiple sequence alignment (but this is not one). Dots will be shown in places where the sequence is identical, and only the differences will be shown as bases. Do the majority of the traces support the sequence in the RefSeq transcript or in the genome? You can also take just this region of the assembly (by selecting the component contributing to the assembly at this location) and use this to interrogate the traces.</p>
</div>
</div>
</div>
<script>
if (jQuery) {
jQuery(function(){
(function($){
$('#grc-cms-content table').ncbigrid();
})(jQuery);
})
}
</script>
</div>
</div>
<div class="cleaner"></div>
<div id="footer">
<ul>
<li><a title="Get GRC data via FTP" href="https://ftp.ncbi.nlm.nih.gov/pub/grc/">FTP</a></li>
<li><a target="_blank" href="https://www.genome.gov">NHGRI</a></li>
<li><a target="_blank" href="http://www.wellcome.ac.uk">Wellcome Sanger Institute</a></li>
<li><a target="_blank" href="https://www.hhs.gov">HHS</a></li>
<li><a target="_blank" href="https://www.nih.gov">NIH</a></li>
<li><a target="_blank" href="https://www.nih.gov/web-policies-notices">Accessibility</a></li>
<li class="last"><a target="_blank" href="https://www.hhs.gov/vulnerability-disclosure-policy/index.html">HHS Vulnerability Disclosure</a></li>
</ul>
</div>
</div>
</div><!-- end page -->
<script type="text/javascript" src="/portal/portal3rc.fcgi/rlib/js/InstrumentNCBIBaseJS/InstrumentPageStarterJS.js"></script>
</body>
</html>