File Format Guide
Introduction
This page reviews the submission file formats currently supported by the Sequence Read Archives (SRA) at NCBI, EBI, and DDBJ, and gives guidance to submitters about current and future file formats and policies regarding SRA submissions.
Some things to keep in mind:
- The SRA is a raw data archive, and requires per-base quality scores for all submitted data. Therefore, FASTA and other sequence-only formats are not sufficient for submission! FASTA can, however, be submitted as a reference sequence(s) for BAM files or as part of a FASTA/QUAL pair (see below).
- SRA accepts binary files such as BAM, SFF, and HDF5 formats and text formats such as FASTQ.
BAM files
Binary Alignment/Map files (BAM) represent one of the preferred SRA submission formats. BAM is a compressed version of the Sequence Alignment/Map (SAM)
format (see SAMv1 (.pdf)). BAM files can be decompressed to a human-readable
text format (SAM) using SAM/BAM-specific utilities
(e.g. samtools ) and can contain unaligned sequences as well. SRA recommends aligning to an unmodified known reference,
if possible, to enable subsequent users to view the alignments in the Sequence Viewer or to compare the alignments with
other alignments on the same reference.
SAM is a tab-delimited format including both the raw read data and information about the
alignment of that read to a known reference sequence(s). There are two main sections in a SAM file, the header and the alignment (sequence read)
sections, each of which are described below. Note that this documentation will focus on a description of the SAM format with respect
to submission of BAM files to the SRA (i.e. SRA doe not accept SAM files for submission). A more comprehensive discussion of the
format specifications can be found at the samtools website.
SAM Header Example:
@SQ SN:CHROMOSOME_I LN:15072423
UR:ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/invertebrates/Caenorhabditis_elegans/
WBcel215/Primary_Assembly/assembled_chromosomes/FASTA/chrI.fa.gz AS:ce10
SP:Caenorhabditis elegans
@SQ SN:CHROMOSOME_II LN:15279345
UR:ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/invertebrates/Caenorhabditis_elegans/
WBcel215/Primary_Assembly/assembled_chromosomes/FASTA/chrII.fa.gz AS:ce10
SP:Caenorhabditis elegans
@RG ID:1 PL:ILLUMINA LB:C_ele_05 DS:WGS of C elegans PG:BamIndexDecoder
@PG ID:bwa PN:bwa VN:0.5.10-tpx
Ideally, the SN
value should be a versioned accession (e.g., NC_003279.7
, rather than CHROMOSOME_I
). This will allow
the SRA to unambiguously identify the reference sequence(s) and process the BAM file with minimal intervention. Otherwise, submitters
are strongly encouraged to include the "URL/URI" that can be used to obtain the reference sequence(s) and AS
tags to clearly define which assembly has been used (as above).
If the data are instead aligned to a local or submitter-defined set of references (including any modifications to accessioned assemblies),
then the submitter must include a reference fasta
along with each submitted bam file. Note: the FASTA header line(s) MUST match
the SN
names provided in the BAM file exactly.
Deviation from these recommended practices will require manual intervention by SRA staff in order to process a BAM file and can delay completion of a submission and acquisition of accession numbers.
SAM Alignment Example:
GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCT
AAGCCT
@CCC?:CCCCC@CCCEC>AFDFDBEGHEAHCIGIHHGIGEGJGGIIIHFHIHGF@HGGIGJJJJJIJJJJJJJJJJJJJJJJJJJJJHHHHHFF
FFFCCC RG:Z:1 NH:i:1 NM:i:0
5482659 65 CHROMOSOME_I 1 0 100M CHROMOSOME_II 11954696 0
GCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCT
AAGCCT
CCCFFFFFHHGHGJJGIJHIJIJJJJJIJJJJJIJJGIJJJJJIIJIIJFJJJJJFIJJJJIIIIGIIJHHHHDEEFFFEEEEEDDDDCDCCCA
AA?CC: RG:Z:1 NH:i:1 NM:i:0
The header and alignment section are internally consistent: each aligned read has an RNAME
(reference sequence name, 3rd field) that matches an SN tag value from the header (e.g., CHROMOSOME_I
), and,
if provided, the alignment read group optional field (RG:Z:
) is consistent with the read group ID in the header (1
). It is
also important to ensure that the FLAG fields (2nd field in each line) are correctly set for the data.
The SRA pipeline will attempt to resolve incorrect FLAG values, but sufficiently incorrect values can lead to processing errors.
The SRA does not archive optional and non-standard tags/field values contained in the alignment section. However, the entire header
section of the bam file is retained. Additionally, although the SAM format allows for an equal sign (=
) in the sequence field to
represent a match to the reference sequence or only an asterisk (*
) in both the sequence and quality fields,
the SRA processing software does not recognize either of these formats.
Please note that unexpected notations used to indicated paired reads can lead to failure to recognize the pairs and an
improper SRA archive (i.e. paired reads are treated like fragments). For example, using :0
and :1
at the end of the
read names is atypical and is currently not recognized as an indication of read 1 and 2 in a pair. It would be better to exclude
these notations and provide the two reads with the same names. Expected notations for particular platforms will work. For example,
Illumina reads with /1
or /2
appended is an expected notation. Further, neglecting to set the proper bits for paired reads in the
SAM/BAM flags (e.g. multi-segment template 1-bit, first segment 64-bit, and last segment 128-bit) or splitting paired reads into separate
bam files can result in an improper SRA archive or failure to generate the SRA archive.
CRAM files
Another acceptable SRA submission format is the CRAM format (see CRAMv3(.pdf)). Files received in this format are converted to the BAM format for processing. The references provided in this format are treated in the same manner as BAM references with the added possibility of a check against the European Nucleotide Archive (ENA) CRAM reference registry.
SFF files
In the absence of a BAM file, Standard Flowgram Files or SFF is the preferred input format for 454 Life Sciences (now part of Roche) data; IonTorrent data can also be submitted as SFF. Extensive technical details about the format can be obtained here .
HDF5 files
HDF5 is a data model, library, and file format for storing and managing data.
The SRA accepts bas.h5
and bax.h5
file submissions for PacBio-based submission and .fast5
files for submissions related to MinION Oxford Nanopore.
PacBio
Submission of data from the
RS II instrument requires one (1) bas.h5
file and three (3) bax.h5
files.
Do not link more than one PacBio RS II to an SRA run and please do not change the bax.h5
files names from
those indicated in the bas.h5
file.
Depending on the platform used for your PacBio sequencing project, the following data files with respective extensions are produced and required for SRA submission.
PacBio RS Platform | Data Files Delivered |
---|---|
PacBio RS |
|
PacBio RS II |
|
Please be sure to list the files for each SMRT Cell in a separate Run or on a separate row of your sra_metadata sheet.
PacBio documentation on bax.h5 / bas.h5
format: bas.h5ReferenceGuide.pdf.
MinION Oxford Nanopore
In this case, there are 1-3 sequences per fast5 HDF file (one spot of information) and the entire set of fast5
files should be submitted in a tar.gz
file. You must submit the fast5 files generated after base calling.
Learn more about this platform at Oxford Nanopore Technologies website.
HDF5 tools
HDF5 tools: http://www.hdfgroup.org/products/hdf5_tools
FASTQ files
Fastq consists of a defline that contains a read identifier and possibly other information, nucleotide base calls, a second defline, and per-base quality scores, all in text form. There are many variations.
The following terms and formats are defined in general:
- Identifier and other information: text string terminated by white space.
- Bases: fastq sequence should contain standard base calls (ACTGactg) or unknown bases (Nn) and can vary in length.
-
Qualities options:
Decimal-encoding, space-delimited [0-9]+ | <quality>\s[0-9]+
Phred-33 ASCII [\!\"\#\$\%\&\'\(\)\*\+,\-\.\/0-9:;<=>\?\@A-I]+
Phred-64 ASCII [\@A-Z\[\\\]\^_`a-h]+
Quality string length should be equal to sequence length.
In a limited set of cases, log odds or non-ASCII numerical quality values will succeed during an SRA submission.
Files from various platforms employing this format are acceptable:
<sequence>
+<identifier and other information OR empty string>
<quality>
Where each instance of Identifier, Bases, and Qualities are newline-separated.
Extra information added beyond the <identifier and expected information>
examples is likely to be discarded/ignored.
As indicated above, the Qualities string can be space-separated numeric Phred scores or an ASCII string of the Phred scores with the ASCII character value = Phred score plus an offset constant used to place the ASCII characters in the printable character range. There are 2 predominant offsets: 33 (0 = !) and 64 (0=@).
Paired-end FASTQ
Although generally the case, there are some instances where paired reads are not a forward read paired with a reverse read.
Paired-end data submitted in FASTQ format should be submitted in one of two formats:
- As separate files for forward and reverse reads, in which the reads are in the same order.
- As interleaved, or "8-line", FASTQ, in which forward and reverse reads alternate in the file and are in order (i.e., read "1F", followed by read "1R", then read "2F", then "2R").
SRA supports the following forward/reverse read indicators: '/1'
and '/2'
at the end of the read name or newer Illumina style '1:Y:18:ATCACG'
and '2:Y:18:ATCACG'
.
Platform specific FASTQ files
454 fastq
Under Roche 454, SRA accepts both 'pre-split' or 'post-split' 454 fastq sequences. Paired 'post-split' 454 reads must be provided in separate files or in the interleaved format. 'Split' means the 454 linker has been located/removed and used to split the sequence into biological read pairs (and all other technical reads have been removed).
Ion Torrent fastq
In the same manner as Roche 454, SRA only accepts 'pre-split' Ion Torrent sequences or 'post-split' Ion Torrent single read fragments in a fastq form. Paired 'post-split' Ion Torrent reads will require submission in a BAM file. 'Split' means the Ion Torrent linker has been located/removed and used to split the sequence into biological read pairs (and all other technical reads have been removed).
Recent Illumina fastq
<index>
values for Illumina fastq can be barcodes.
Older Illumina fastq
<index>
values for Illumina fastq can be barcodes.
QIIME de-multiplexed sequences in fastq
PacBio CCS (Circular Consensus Sequence) or RoI (Read of Insert) read
PacBio CCS subread
Helicos fastq with a fixed ASCII-based Phred value for quality
Characteristic use of a quality '/'
, which gives a Phred value of 14.
The native format for helicos is fasta so converting to fastq requires creating a default quality score. The default value selected by the SRA team is '14'.
FASTA files
Fasta files adhering to the definition lines described in the fastq section are acceptable, too, although fastq is preferred (a file type of fastq should still be specified). The SRA assigns a default quality value of 30 in this case and expects this format:
<sequence>
FASTA with QUAL file pairs
Fasta files may be submitted with corresponding qual files, too. These are recognized in the SRA data processing pipeline as equivalent to fastq and should be specified as fastq when submitting the data files.
Files from some platforms (mostly older Illumina and Roche 454) employing this format are acceptable and the entries in the pair of files should look like:
File 1
BASES
File 2
QUALITIES
Where READNAME must be identical between files for a given read, and QUALITIES are generally in whitespace-separated decimal values.
Note the following guidelines for FASTA/QUAL pairs of files:
In a given pair of files, there must be the same number of reads in both. For a given read, there must be the same number of BASES and QUALITIES, i.e., if the BASES are trimmed to remove barcodes, then the same scores must be removed from the QUALITIES, etc.
CSFASTA with QUAL Files
The files have an optional header that is identified by lines that begin with the hash/pound/number sign (#). The HEADER can be defined as:
# Cwd: <path>
# Title: <flowcell>
The permissible CSFASTA format is as follows:
>TAGNAME
BASES
The permissible QUAL format is as follows:
>TAGNAME
QUALITIES
As with FASTA/QUAL pairs, there are several rules for pairs of CSFASTA/QUAL files. TAGNAME must be identical between files for a given read, and QUALITIES are generally in whitespace-separated decimal values.
Note the following guidelines for CSFASTA/QUAL pairs of files:
In a given pair of files, there must be the same number of reads in both. For a given read, there must be the same number of color space digits and QUALITIES, i.e., the BASES line is typically 1 character longer than the number of QUALITIES (due to the color space indexing base that begins each BASES string). HEADER must be identical between paired files.
Also see SOLiD™ Data Format and File Definitions Guide (.pdf)
Legacy Formats
These formats are still accepted by SRA, but are considered out-of-date and not recommended for submission. If you are able to update your files to a more common format please do so before submitting to SRA.
SRF files
SRF is a generic format for DNA sequence data. This format has sufficient flexibility to store data from current and future DNA sequencing technologies. This is a single input file format for all downstream applications and a read lookup index enabling downstream formats to reference reads without duplication of all of the read specific information.
Sequence Read Format (SRF) homepage: http://srf.sourceforge.net/ .
Native Illumina
Submitters may submit native data from the primary analysis output of the Illumina GA.
The filetype is Illumina_native
and constituent files for a run should be tarred together into a single tar file.
Illumina GA readname can be defined as follows:
<lane> = 1..8
<title> = 1..1024
<X> = 1..4096
<Y> = 1..4096
<sep> ::= [_\t]
READNAME ::= [<flowcell><sep> | s_]<lane><sep><tile><sep><x><sep><y>
Within a related set of files, reads are grouped by tile. Reads should be fixed length, and the number of quality scores and bases is the same in each.
Allowed characters:
BASES: AaCcTtGgNn
QUALITIES: \!\"\#\$\%\&\'\(\)\*\+,\-\.\/0-9:;<=>\?\@A-I]+
or \@A-Z\[\\\]\^_`a-h]+
QSEQ
The basecalling program Bustard emits a _qseq.txt
file for each lane (two files for mate pairs). Paired-end data are presented in the orientation in which they were sequenced (5'-3'& 3'-5').
Each read is contained on a single line with tab separators in the following format:
- Machine name: Unique identifier of the sequencer.
- Run number: Unique number to identify the run on the sequencer.
- Lane number: Positive Integer (currently 1-8).
- Tile number: Positive Integer.
- X coordinate of the spot: Integer (can be negative).
- Y coordinate of the spot: Integer (can be negative).
- Index: Positive Integer (no indexing should have a value of 1).
- Read Number: 1 for single reads; 1 or 2 for paired-ends.
- Sequence (BASES)
- Quality: the calibrated quality string (QUALITIES).
- Filter: Did the read pass filtering? 0 - No, 1 - Yes.
Machine Specific Information
File types accepted by platform in approximate order of preference (formats that are least desirable marked with '*', those with uncertain outcome marked with '?'):
Illumina
bam, fastq, qseq, fasta+qual*?, native*, srf*?
SOLiD
bam, csfasta + QV.qual, srf*?
Roche 454 (formerly Life Sciences)
bam, sff, fastq, fasta+qual*?
IonTorrent
bam, sff, fastq, fasta+qual*?
PacBio
bam, hdf5, fastq
MinION Oxford Nanopore
hdf5, fastq
Helicos
bam, fastq
Capillary (Sanger)
bam, fastq*?
CompleteGenomics
native, bam*
Complete Genomics format – see CG Data File Formats . This format requires providing tarred versions of the ASM, LIB, and MAP sub-directories for a successful submission to take place. Additionally, processing of reference sequences occurs in the same manner as for BAM and CRAM files. For this format, please contact SRA prior to submission.
Contact SRA
Contact SRA staff for assistance at sra@ncbi.nlm.nih.gov