nih-gov/www.ncbi.nlm.nih.gov/WebSub/html/help/alignment.html

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  <meta http-equiv="Content-Style-Type" content="text/css" />
  <title>BankIt Submission Help: Alignment Files</title>
  <style type="text/css">
      code{white-space: pre-wrap;}
  </style>
      <link rel="stylesheet" href="../../css/bankit.13.6.css"  type="text/css">
    <link rel="stylesheet" type="text/css" href="../../css/sp_3_74_ncbi_header.13.6.css">
    <link rel="stylesheet" type="text/css" href="../../css/sp_1_82_layout.13.6.css">
</head>
<body class="help">
    <header id="ncbi_header" class="ncbi-header" role="banner">
        <div class="usa-grid">
            <div class="usa-width-one-whole">
                <div class="ncbi-header__logo">
                    <a href="https://www.ncbi.nlm.nih.gov/" class="logo" aria-label="NCBI Logo"
                       data-ga-action="click_image" data-ga-label="NIH NLM Logo">
                       <img src="https://www.ncbi.nlm.nih.gov/coreutils/nwds/img/logos/AgencyLogo.svg" alt="NIH NLM Logo">
                    </a>
                </div>
            </div>
        </div>
    </header>
<div id="header">
  <h1 class="title">BankIt Submission Help: Alignment Files</h1>
</div>
<div id="TOC">
  <p>Table of Contents:</p>
  <ul>
    <li><a href="#when-and-why">When and Why should I submit an alignment to GenBank?</a></li>
    <li><a href="#format-list">Which alignment formats are accepted?</a></li>
    <li><a href="#how-to-submit">How do I submit an alignment?</a></li>
    <li><a href="#sequence-ids">What is a sequence_ID and how do I format my sequence_IDs?</a></li>
    <li><a href="#format-guidelines">What are the guidelines for each alignment format?</a></li>
  </ul>
</div>
<h1 id="when-and-why">When and Why should I submit an alignment to GenBank?</h1>
<p>If you are submitting multiple sequences from the same locus or region, you may submit the sequences to GenBank as an alignment.</p>
<p>Reasons to submit sequences as an alignment:</p>
<ol style="list-style-type: decimal">
  <li><p>You will be given the option to use Feature Propagate to annotate features in your submission. Feature Propagate allows you to annotate just one sequence and then features are applied to the other sequences in the alignment automatically. This can be a useful tool to assist with annotation.</p></li>
  <li><p>Alignments that pass quality checks are made available in the <a target="_blank" href="https://www.ncbi.nlm.nih.gov/popset">PopSet database</a> after processing by NCBI staff. Sequences are also retrievable in the <a target="_blank" href="https://www.ncbi.nlm.nih.gov/nuccore/">Nucleotide database</a> by individual Accession numbers.</p></li>
</ol>
<p>Note that individual sequences in the alignment receive Accession numbers after review by NCBI staff. The alignment itself does not receive an Accession number.</p>
<h1 id="format-list">Which alignment formats are accepted?</h1>
<p>BankIt currently accepts the following alignment formats:</p>
<ul>
  <li><a href="#fasta-gap-format">Fasta+Gap</a></li>
  <li><a href="#nexus-format">Nexus</a></li>
  <li><a href="#phylip-format">Phylip</a></li>
  <li><a href="#clustal-format">Clustal(w)</a></li>
</ul>
<p>Currently, BankIt does not yet support submission of alignments that include
accessioned sequences already present in the GenBank/ENA/DDBJ database. Contact
<a target="_blank" href="mailto:gb-admin@ncbi.nlm.nih.gov?subject=BankIt+alignment+help">gb-admin@ncbi.nlm.nih.gov</a>
if you have questions about this.</p>
<h1 id="how-to-submit">How do I submit an alignment?</h1>
<p>Start a submission in BankIt. Proceed through the forms, providing the requested information until you arrive at the Nucleotide page. On the Nucleotide page, specify that you are importing one of the accepted alignment types by selecting the Alignment radio button. Upload your alignment file in one of the <a href="#format-list">acceptable formats</a>.</p>
<h1 id="sequence-ids">What is a sequence_ID and how do I format my sequence_IDs?</h1>
<p>The sequence_ID identifies the same sample throughout all steps of the submission. The sequence_IDs in your alignment must conform to the following rules:</p>
<ul>
  <li>Each sequence must have a sequence_ID that is unique within the alignment file.</li>
  <li>Sequence_IDs must not contain spaces</li>
  <li>Shorter sequence_IDs are preferred. It is recommended to limit the length of the sequence_ID to fewer than 25 characters. An error will be produced if the sequence_ID is too long (&gt;50 characters).</li>
  <li>It is recommended that you use alpha-numeric characters only in the sequence_ID. If you use other characters, only the following characters are allowed: letters, digits, hyphens (<code>-</code>), underscores (<code>_</code>), periods (<code>.</code>), colons (<code>:</code>), asterisks (<code>*</code>), and number signs (<code>#</code>). Do not begin a sequence_ID with a <code>#</code>.</li>
</ul>
<h1 id="format-guidelines">What are the guidelines for each alignment format?</h1>
<h2 id="fasta-gap-format">FASTA+GAP Format for Aligned Nucleotide Sequences</h2>
<p>The sequence alignment software that you are using may have an option to output your alignment in the FASTA format. To align the sequences, the software may insert gaps, thereby creating the FASTA+GAP format. The gaps will only show up in the alignment, not in the individual sequence in the database. The gaps in this example are represented by the <code>–</code> character.</p>
<p>Sequences in FASTA+GAP format resemble FASTA sequences. See the page on <a href="fasta.html">FASTA format help</a> for instructions on formatting FASTA sequences.</p>
<p>The following is an example of FASTA+GAP format without source information:</p>
<pre><code>&gt;A-0V-1-A
TCACTCTTTGGCAACGACCCGTCGTCATAATAAAGATAGAGGGGCAACTAAAGGAAGCTCTA
TTAGATACAGGAGCAGATGATACAGTATTAGAAGAAAT

&gt;A-0V-2-A
TCACTCTTTGGCAAC---GCGTCGTCACAATAAAGATAGAGGGGCAACTAAAGGAAGCTCTA
TTAGATACAGGAGCAGATGATACAGTATTAGAAGAAAT

&gt;A-0V-3-A
TCACTCTTTGGCAAC---GCGTCGTCACAATAAAGATAGAGGGGCAACTAAAGGAAGCTCTA
TTAGATACAGGAGCAGATGATACAGTATTAGAAGAAAT

&gt;A-0V-4-A
TCACTCTTTGGCAACGACCCGTCGTCACAATAAAGATAGAGGGGCAACTAAAGGAAGCTCTA
TTAGATACAGGAGCAGATGATACAGTATTAGAAGAAAT

&gt;A-0V-7-A
TCACTCTTTGGCAACGACCAGTCGTCACAATAAAGATAGAGGGGCAACTAAAGGAAGCTCTA
TTAGATACAGGAGCAGATGATACAGTATTAGAAGAAAT</code></pre>
<p>You may add source information to the definition lines so that BankIt can determine the correct organism and any other modifiers for each sequence, however it is not required. If you do not provide source information in the alignment file you will be prompted for the information with instructions on the Organism and Source Modifiers pages in BankIt.</p>
<p>The following is an example of FASTA+GAP format with source information:</p>
<pre><code>&gt;A-0V-1-A [organism=Gallus gallus] [clone=C]
TCACTCTTTGGCAACGACCCGTCGTCATAATAAAGATAGAGGGGCAACTAAAGGAAGCTCTA
TTAGATACAGGAGCAGATGATACAGTATTAGAAGAAAT

&gt;A-0V-2-A [organism=Drosophila melanogaster] [strain=D]
TCACTCTTTGGCAAC---GCGTCGTCACAATAAAGATAGAGGGGCAACTAAAGGAAGCTCTA
TTAGATACAGGAGCAGATGATACAGTATTAGAAGAAAT

&gt;A-0V-3-A [organism=Caenorhabditis elegans] [strain=E]
TCACTCTTTGGCAAC---GCGTCGTCACAATAAAGATAGAGGGGCAACTAAAGGAAGCTCTA
TTAGATACAGGAGCAGATGATACAGTATTAGAAGAAAT

&gt;A-0V-4-A [organism=Rattus norvegicus] [strain=F]
TCACTCTTTGGCAACGACCCGTCGTCACAATAAAGATAGAGGGGCAACTAAAGGAAGCTCTA
TTAGATACAGGAGCAGATGATACAGTATTAGAAGAAAT

&gt;A-0V-7-A [organism=Aspergillus nidulans] [strain=G]
TCACTCTTTGGCAACGACCAGTCGTCACAATAAAGATAGAGGGGCAACTAAAGGAAGCTCTA
TTAGATACAGGAGCAGATGATACAGTATTAGAAGAAAT</code></pre>
<p>If you opt to include source information with your alignment, you must include it for each sequence.</p>
<p>After the “&gt;” character, is the <a href="#sequence-ids">sequence_ID</a>. Organism name follows in brackets. Optional modifiers also follow in brackets. BankIt will not be able to correctly interpret the organism name and the source modifiers unless you correctly format them within the square brackets. For each modifier, use the value appropriate for your samples, do not copy the values present in the above example. See the <a href="genbank-source-table.html#modifiers">list of valid source modifiers</a>.</p>
<h2 id="nexus-format">NEXUS Format for Aligned Nucleotide Sequences</h2>
<p>The sequence alignment software that you are using may have an option to output your alignment in the NEXUS interleaved format.</p>
<p>NEXUS files can contain <code>?</code> for “missing” at the 5’ and 3’ ends of sequences, as long as this parameter is properly defined within the header of the NEXUS file. BankIt will replace the “?” characters in the sequences with “N”s since they are defined as “missing” data in the header. Gaps in the alignment are represented by the <code>-</code> character, as specified in the header of the NEXUS file. The gaps will only show up in the alignment, not in the individual sequence in the database.</p>
<p>The following is an example of NEXUS Interleaved format.</p>
<pre><code>#NEXUS

begin data;
   dimensions ntax=5 nchar=100;
   format datatype=dna missing=? gap=- interleave;
   matrix

A-0V-1-A   ????TCTTTG GCAACGACCC GTCGTCATAA TAAAGATAGA GGGGCAACTA
A-0V-2-A   TCACTCTTTG GCAAC---GC GTCGTCACAA TAAAGATAGA GGGGCAACTA
A-0V-3-A   TCACTCTTTG GCAAC---GC GTCGTCACAA TAAAGATAGA GGGGCAACTA
A-0V-4-A   TCACTCTTTG GCAACGACCC GTCGTCACAA T????ATAGA GGGGCAACTA
A-0V-7-A   TCACTCTTTG GCAACGACCA GTCGTCACAA TAAAGATAGA GGGGCAACTA


A-0V-1-A   AAGGAAGCTC TATTAGATAC AGGAGCAGAT GATACAGTAT TAGAAGAAAT
A-0V-2-A   AAGGAAGCTC TATTAGATAC AGGAGCAGAT GATACAGTAT TAGAAGAAAT
A-0V-3-A   AAGGAAGCTC TATTAGATAC AGGAGCAGAT GATACAGTAT TAGAAGAAAT
A-0V-4-A   AAGGAAGCTC TATTAGATAC AGGAGCAGAT GATACAGTAT TAGAAGAAAT
A-0V-7-A   AAGGAAGCTC TATTAGATAC AGGAGCAGAT GATACAGTAT TAGAAG????
;
End;</code></pre>
<p>In this example, the first few lines provide information about the data in the sequence alignment. The following five lines contain the <a href="#sequence-ids">Sequence IDs</a>, followed by the sequences. In this example, the sequence_ID for the first sequence is A-0V-1-A. Note that subsequent blocks of sequence also contain the sequence_ID.</p>
<p>You may add the organism names and source modifiers to the alignment as shown in the example, however it is not required. If you do not provide source information in the alignment file you will be prompted for the information with instructions on the Organism and Source Modifiers pages in BankIt. The following is an example of NEXUS with source information optionally added to the end of the file.</p>
<p>NEXUS Interleaved EXAMPLE with SOURCE information</p>
<pre><code>#NEXUS

begin data;
        dimensions  ntax=3 nchar=100;
        format datatype=dna  missing=? gap=-  interleave ;
        matrix

[     1                                                   50]
ABC_1 ???ATTGCGT TATGGAAATT CGAAACTGCC AAATACTATG TCACCATCAT
ABC_2 GATATTGCTT TATGGAAATT CGAAACTGCC AAATACTATG TCACCATCAT
ABC_3 ???ATTGCTT TATGGAAATT CGAAACTGCC AAATACTATG TTA-------

[     51                                                 100]
ABC_1 TGATGCACCT GGACACAGAG ATTTCATCAA GAACATGATC ACTGGTACTT
ABC_2 TGATGCACCT GGACACAGAA ATTTCATCAA GAACATGATC ACTGGTACTT
ABC_3 TGATGCACCT GGACACAGAG ATTTCATCAA AAACATGATC ACTGGTACTT
;
END;

begin ncbi;
sequin
&gt;[organism=Saccharomyces cerevisiae][strain=ABC][clone=1]
&gt;[organism=Saccharomyces cerevisiae][strain=ABC][clone=2]
&gt;[organism=Saccharomyces cerevisiae][strain=ABC][clone=3]
;
end;</code></pre>
<p>If you opt to include source information in a nexus file, source information must be included for all sequences in the alignment and it must be formatted as shown in the example with “begin ncbi;”, “sequin”, source information, “;” and “end;”.</p>
<p>The source information begins with a “&gt;” character. The first line of source information applies to the first sequence (<code>ABC_1</code>), the second line to the second sequence (<code>ABC_2</code>), and so on. If you opt to include source information with your alignment, you must have one line of source information for each sequence. These inserted lines contain modifiers formatted like in the FASTA definition line, but do not begin with the sequence_ID. Instead, the sequence_ID is present at the beginning of the sequence lines as shown in the example.</p>
<p>After the “&gt;” character, the organism name follows in brackets. Optional modifiers also follow in brackets. BankIt will not be able to correctly interpret the organism name and the source modifiers unless you correctly format them within the square brackets. For each modifier, use the value appropriate for your samples, do not copy the values present in the above example. See the <a href="genbank-source-table.html#modifiers">list of valid source modifiers</a>.</p>
<h2 id="phylip-format">PHYLIP Format for Aligned Nucleotide Sequences</h2>
<p>The sequence alignment software that you are using may have an option to output your alignment in the PHYLIP format.</p>
<p>The following is an example of PHYLIP format:</p>
<pre><code>     5    100
A-0V-1-A   TCACTCTTTG GCAACGACCC GTCGTCATAA TAAAGATAGA GGGGCAACTA
A-0V-2-A   TCACTCTTTG GCAAC---GC GTCGTCACAA TAAAGATAGA GGGGCAACTA
A-0V-3-A   TCACTCTTTG GCAAC---GC GTCGTCACAA TAAAGATAGA GGGGCAACTA
A-0V-4-A   TCACTCTTTG GCAACGACCC GTCGTCACAA TAAAGATAGA GGGGCAACTA
A-0V-7-A   ----TCTTTG GCAACGACCA GTCGTCACAA TAAAGATAGA GGGGCAACTA


           AAGGAAGCTC TATTAGATAC AGGAGCAGAT GATACAGTAT TAGAAGAAAT
           AAGGAAGCTC TATTAGATAC AGGAGCAGAT GATACAGTAT TAGAAGAAAT
           AAGGAAGCTC TATTAGATAC AGGAGCAGAT GATACAGTAT TAGAAGAAAT
           AAGGAAGCTC TATTAGATAC AGGAGCAGAT GATACAGTAT TAGAAGAAAT
           AAGGAAGCTC TATTAGATAC AGGAGCAGAT GATACAGTAT TAGAAGAAAT</code></pre>
<p>In this example, the first line indicates that there are 5 sequences, each with 100 nt of sequence. The following five lines contain the <a href="#sequence-ids">sequence_IDs</a>, followed by the sequences. In this example, the sequence_ID for the first sequence is A-0V-1-A. Note that subsequent blocks of sequence do not contain the Sequence ID. Gaps in the alignment are represented by the <code>-</code> character. The gaps will only show up in the alignment, not in the individual sequence in the database.</p>
<p>You may add the organism names and source modifiers to the alignment as shown in the example, however it is not required. If you do not provide source information in the alignment file you will be prompted for the information with instructions on the Organism and Source Modifiers pages in BankIt. The following is an example of PHYLIP with source information optionally added to the end of the file.</p>
<p>The following is an example of PHYLIP format with SOURCE information:</p>
<pre><code>      3  100
ABC-1      ---ATTGCGT TATGGAAATT CGAAACTGCC AAATACTATG TCACCATCAT
ABC-2      GATATTGCTT TATGGAAATT CGAAACTGCC AAATACTATG TCACCATCAT
ABC-3      ---ATTGCTT TATGGAAATT CGAAACTGCC AAATACTATG TTA-------

           TGATGCACCT GGACACAGAG ATTTCATCAA GAACATGATC ACTGGTACTT
           TGATGCACCT GGACACAGAA ATTTCATCAA GAACATGATC ACTGGTACTT
           TGATGCACCT GGACACAGAG ATTTCATCAA AAACATGATC ACTGGTACTT

&gt;[organism=Saccharomyces cerevisiae][strain=ABC][clone=1]
&gt;[organism=Saccharomyces cerevisiae][strain=ABC][clone=2]
&gt;[organism=Saccharomyces cerevisiae][strain=ABC][clone=3]</code></pre>
<p>If you opt to include source information with your alignment, you must include it for each sequence.</p>
<p>The first line of the source information begins with a “&gt;” character. The first line of source information applies to the first sequence (<code>ABC-1</code>), the second line to the second sequence (<code>ABC-2</code>), and so on. If you opt to include source information with your alignment, you must have one line of source information for each sequence. These inserted lines contain modifiers formatted like in the FASTA definition line, but do not begin with a sequence_ID. Instead, the sequence_ID is present at the beginning of the sequence lines as shown above.</p>
<p>After the “&gt;” character, the organism name follows in brackets. Optional modifiers also follow in brackets. BankIt will not be able to correctly interpret the organism name and the source modifiers unless you correctly format them within the square brackets. For each modifier, use the value appropriate for your samples, do not copy the values present in the above example. See the <a href="genbank-source-table.html#modifiers">list of valid source modifiers</a>.</p>
<h2 id="clustal-format">Clustal(w) Format for Aligned Nucleotide Sequences</h2>
<p>The sequence alignment software that you are using may have an option to output your alignment in the Clustal(w) format.</p>
<p>The following is an example of Clustal(w) format:</p>
<pre><code>CLUSTAL W

seq1	------ACTAGACTGGGTGCGGTACCTAAGTG-TTACTGGCGGTGTGTTGCTCTATGATT
seq2	CCCCTTTCTGGAAAAGTTCACGGTACTATTCG-TTTCTGCCTGTGAGCTGCTATACGATT
seq3	------ATTAGGCTAG--ACGGTACCAATAGTCGCGCTGACTGTGGGTTGCTCTACGACT
seq4	-----AATTAGGCTGGTGTCAGTATCAAG--GTTCCCTGGATGTTAGTTGCTCTACGACT
seq5	-----AAATAGGCTGGTGTCAGTATCAAGAGGATCCTGGAATGTTAGATGCTCTAAGACT
	        *.*.:: .   ::*:: *        . ::* ::**. * ****:*  **.*

seq1	CGGAATGTTACCAGGATGATATACCTAGTTGCCTAGATGCACACCTTGAATTTGTCGAAA
seq2	CGGCATGTTCCCCGGTTGATGTTGCTAGTTGCAAAGTTGCAAAGACTCAATTTGCTGAAA
seq3	CGGTAAGATCCCTGGTTGGTTGTACTCGTTGCAAATAGGCACCCATTGTAATAGTTGCAA
seq4	AGGTAAGTTCCCTGGGTGGTGTAACTCGTTGCCTATGCGCAACTACTTTAGTGGAAGAAA
seq5	AGGTAAGTTCCCTGGGTGGTATAACTCGTTGCCTAGGTGCACCGAATCCAGTGGATGAAA
	:** *:* *:**  *.**.* .: **.*****: *. .*** :.: *  * * *  *:**

seq1	CACCCTATTTTCGGGGTATGGGTGCAGGCCAGGAAGTAG--
seq2	CACCCTATTTTCGCGGTATGGTTGCAGACCAGGAAGTAGGC
seq3	CTTCCTATTTTCGGGGCATGGTTGCAGACTAGGCAGATGC-
seq4	CTACCTAGTTTCGCGGCAAGTGTGCAGGCTAGGCAGAAG--
seq5	CTACCTAGATTCGGGGCAAGTGTGCAGGCTAGGCAGAAA--
	*::*** . ****:** * *..*****.*:***:**:</code></pre>
<p>In this example, there are 5 sequences in the alignment. Each line contains the <a href="#sequence-ids">sequence_ID</a> followed by the sequence for that sequence_ID. For example, the first sequence is seq1. Each subsequent block of sequence contains the sequence_IDs. Gaps in the alignment are represented by the - character. The gaps will only show up in the alignment, not in the individual sequence in the database. After each alignment block are the sequence conservation characters included in the Clustal(w) output.</p>
<p>At this time, you may not modify the Clustal(w) output to include source information. You will be prompted for source information in the BankIt forms as you continue with your submission.</p>
</body>
</html>