protein
download a SARS-CoV-2 protein dataset by protein name
protein
Name
datasets download virus protein - download a SARS-CoV-2 protein dataset by protein name
Synopsis
datasets download virus protein <protein_name ...> [flags]
Description
Download a SARS-CoV-2 protein data package by protein name. SARS-CoV-2 protein data packages include CDS and protein sequence, annotation and a detailed data report. Data packages are downloaded as a zip file.
The default SARS-CoV-2 protein data package includes the following files:
- cds.fna (nucleotide coding sequences)
- protein.faa (protein sequences)
- data_report.jsonl (data report with viral metadata)
- virus_dataset.md (README containing details on sequence file data content and other information)
- dataset_catalog.json (a list of files and file types included in the data package)
Refer to NCBI’s download and install documentation for information about getting started with the command-line tools.
Allowed protein names:
- ORF1ab
- ORF1a
- nsp1
- nsp2
- nsp3
- nsp4
- nsp5
- nsp6
- nsp7
- nsp8
- nsp9
- nsp10
- rdrp
- nsp11
- nsp13
- nsp14
- nsp15
- nsp16
- S
- ORF3a
- E
- M
- ORF6
- ORF7a
- ORF7b
- ORF8
- N
- ORF10
Examples
datasets download virus protein S --host dog --filename SARS2-spike-dog.zip
datasets download virus protein S E M N --refseq --filename SARS2-structural-refseq.zip
Options
--annotated limit to annotated coronavirus genomes
--api-key string NCBI Datasets API Key
--complete-only limit to complete coronavirus genomes
--exclude-cds exclude cds.fna (CDS sequence file)
--exclude-protein exclude protein.faa (protein sequence file)
--filename string specify a custom file name for the downloaded dataset (default "ncbi_dataset.zip")
--geo-location string limit to coronavirus genomes isolated from a specified geographic location (continent, country or U.S. state)
-h, --help help for protein
--host string limit to coronavirus genomes isolated from a specified host (NCBI Taxonomy ID, scientific or common name at any taxonomic rank)
--no-progressbar hide progress bar
--refseq limit to RefSeq coronavirus genomes
--released-since string limit to coronavirus genomes released after a specified date (MM/DD/YYYY)
--updated-since string limit to coronavirus genomes updated after a specified date (MM/DD/YYYY)
Generated March 11, 2025