NCBI Datasets SARS-CoV-2 Data Package
Sequences and metadata for a set of SARS-CoV-2 GenBank genomes or proteins
NCBI Datasets SARS-CoV-2 Data Package
The NCBI Datasets SARS-CoV-2 Data Package contains sequences and metadata for a set of requested SARS-CoV-2 GenBank genomes or proteins. The data package may include genome, coding sequence (CDS) and protein sequences in FASTA format, and a data report containing metadata in JSON Lines format.
Note: The GBFF, GPFF and PDB files are no longer available as part of the SARS-CoV-2 data packages. If you have any questions or feedback on this change, please use the Feedback button below.
Package Content
NCBI Datasets SARS-CoV-2 Genome Data Package
sars-cov-2/
|-- README.md
`-- ncbi_dataset
`-- data
|-- cds.fna
|-- data_report.jsonl
|-- dataset_catalog.json
|-- genomic.fna
|-- protein.faa
`-- virus_dataset.md
NCBI Datasets SARS-CoV-2 Protein Data Package
(note: this package does not contain SARS-CoV-2 genome sequence)
spike-protein/
|-- README.md
`-- ncbi_dataset
`-- data
|-- cds.fna
|-- data_report.jsonl
|-- dataset_catalog.json
|-- protein.faa
`-- virus_dataset.md
Virus Data Report
The virus data report contains metadata describing the genomes and proteins in the data package. The file is in JSON Lines format, where each line is the metadata for one genome or one protein. Use the dataformat tool for easy conversion to a tabular format of selected fields.
- Path:
ncbi_dataset/data/data_report.jsonl
- Schema: Virus Data Report
FASTA Sequence Files
Genomic FASTA
Nucleotide sequence of the viral GenBank genome.
- Path:
ncbi_dataset/data/genomic.fna
- Schema: Nucleotide FASTA
Example FASTA Defline:
>MW583405.1 Severe acute respiratory syndrome coronavirus 2 isolate SARS-CoV-2/human/USA/TX-CDC-9N37-8996/2021, complete genome
CDS FASTA
Nucleotide sequence for the coding sequence of each protein and mature peptide.
- Path:
ncbi_dataset/data/cds.fna
- Schema: Nucleotide FASTA
Example FASTA Defline:
>NC_045512.2:21563-25384 surface glycoprotein [organism=Severe acute respiratory syndrome coronavirus 2] [isolate=Wuhan-Hu-1]
Protein FASTA
Protein sequences for each protein and mature peptide.
- Path:
ncbi_dataset/data/protein.faa
- Schema: Protein FASTA
Example FASTA Defline:
>QMT27626.1:1-180 leader protein [polyprotein=ORF1ab polyprotein] [organism=Severe acute respiratory syndrome coronavirus 2] [isolate=SARS-CoV-2/human/USA/WA-S1488/2020]
Additional Files
Virus README
The virus README describes the available SARS-CoV-2 data packages, their content and options for querying.
- Path:
ncbi_dataset/data/virus_dataset.md
README.md
The README contains a general project description common to all data packages.
- Path:
README.md
Dataset catalog
The dataset catalog lists each data file contained within or referenced by this package. Each data file is associated with a content type and location.
- Path:
ncbi_dataset/dataset_catalog.json
Related information
Get SARS-CoV-2 data using one of these tools:
- Browse and download SARS-CoV-2 genome data at the NCBI Datasets Coronavirus Genomes page
- Browse and download SARS-CoV-2 protein data at the Datasets SARS-CoV-2 protein page
- Learn how to download SARS-CoV-2 genome data using the How-to guide
- Learn how to download SARS-CoV-2 protein data using the How-to guide
- Download via programmatic access through our OpenAPI specification