FAQs
Questions and answers for common NCBI Datasets questions
FAQs
Why are the NCBI Datasets CLIv13.x and older and API v1 being deprecated and retired?
The NCBI Datasets API v1 and Command Line Tool (CLI) v13.x and older versions along with API v1 are being retired to allow us to focus our attention on improved features in the newer versions. Retirement ensures our users can access the latest advancements and maintain an efficient experience.
What are the benefits of migrating to CLI v16+ and API v2?
Migrating to CLI v16+ and API v2 offers several advantages, including access to enhanced functionality, improved performance, and ongoing support, ensuring a better user experience.
When will API v1 and CLI v13 deprecation and retirement occur?
The deprecation is set for June 2024, with retirement planned for December 2024. During this period, users are advised to migrate to the more recent versions, CLI v16+ and API v2. API v2 will transition to API v2 and reach stability by June 2024.
Will my existing scripts and workflows using API v1 and CLI v13 continue to work after retirement?
No, these older versions will no longer be functional after API v1 and CLI v13 are retired. It is crucial to migrate to CLI v16+ and API v2 to ensure uninterrupted access and functionality.
What will happen to the Python and R libraries with the deprecation and retirement of API v1 and CLI v13?
The NCBI Datasets Python and R libraries that rely on API v1 and CLI v13 will no longer function after their retirement. See programming languages for guidance on interacting with the API using your preferred language, ensuring a seamless transition to the latest versions of NCBI Datasets.
How is Datasets command-line tools version 14+ (CLI v14+) different from version 13 (CLI v13.x) and previous versions?
The new Datasets command-line tools (CLI v14+)…
- Provide easier access to metadata
- Contain smaller data packages (faster downloads)
- Offer expanded content for virus genomes
- Deliver genome sequences as a single file by default
- Use simpler command syntax (data files are now included using the
--include
flag)
Easier access to metadata
All metadata can now be printed to the screen, redirected to a file, or piped to the dataformat command-line tool to generate a customized table. Additionally, metadata formats have been standardized across services, and all metadata schemas are now documented. Previously, some metadata was only available as part of a downloaded data package.
Smaller data packages
Data packages now include a smaller set of files by default, so downloads are faster and more reliable. For example, the default genome data package now includes only genome sequence and the data report file. You also have the option to include all other sequence, annotation and report files.
Expanded content for virus genomes
All genomes in NCBI Virus
are now available through Datasets.
Genome sequences are now delivered as a single file
CLI v14+ now delivers genome sequences as a single file by default. You also have the option to request genome sequences as separate files by chromosome using --chromosomes
.
Simpler command syntax
CLI v14+ offers a simpler way to request specific data files and data reports (metadata) compared to previous CLI versions.
Data files can be specified using a single --include
flag instead of multiple exclude flags.
For example, genome and protein sequences for the current human reference genome can be downloaded using:datasets download genome taxon human --reference --include genome,protein
You can also add additional data reports to the data package using the --include
flag.
Why does --exclude
not work in CLI v14+?
We have removed the multiple --exclude
flags from CLI v14+ in favor of a single --include
flag. Data package content can be customized by specifying the desired data or data reports (metadata) after the --include
flag.
Combined with changes to the contents of our default data packages, requesting the data you want is simpler and more intuitive.
For example, to get genome and protein sequences for the human reference genome, try the following:
datasets download genome taxon human --reference --include genome,protein
Which version of the documentation should I use?
Since the release of datasets command-line tools (CLI) version 14, we now have two documentation versions. The first CLI v16+ (API v2)
version describes the latest version of the command-line tools (v16+) and the underlying API (v2). The second CLI v13.x (API v1)
describes the previous version of the command-line tools (v13.x) and the underlying API (API v1). You can opt for your preferred documentation version or toggle between the two versions using the drop-down options on the left side of each documentation page.
CLI v16+ (API v2) documentation
CLI v16+ (API v2) describes the latest version of the NCBI Datasets command-line tools and the underlying API. Please refer to this latest documentation if you are using the latest version of the command-line tools, datasets and dataformat v16+, or are using the latest version of the Datasets API (v2).
CLI v13.x (API v1) documentation
CLI v13.x (API v1) describes the previous version of the NCBI Datasets command-line tools and the underlying API. Please refer to this documentation version if you are using previous versions of the command-line tools, datasets and dataformat v13.x or earlier, or are using the previous version of the Datasets API (v1).
We recommend you upgrade to the latest version of the CLI. However, in certain scenarios your workflow or code may stop working if you upgrade to the latest version due to breaking changes in command-line syntax, data report schemas, and/or default data package file contents. In such instances, you may choose to continue using previous versions of the CLI.
Where is the data I requested?
Your data is in the subdirectory ncbi_dataset/data/
within the zip archive you downloaded.
I still can’t find my data, can you help?
We have identified a bug affecting Mac Safari users. When downloading data from the NCBI Datasets web interface, you may see only a README file after the download has completed (while other files appear to be missing). As a workaround to prevent this issue from recurring, we recommend disabling automatic zip archive extraction in Safari until Apple releases a bug fix. For more information, visit: Mac Safari zip archive bug
What file formats can be downloaded using NCBI Datasets?
Datasets offers the following file formats (if available for the requested query):
- Sequence files in FASTA format: genomic/gene, transcript and protein nucleotide sequences
- Annotation files: GTF, GFF3, and GBFF
- Metadata files: JSON and JSON Lines
What is a data package?
A “data package” is an NCBI Datasets zip archive that contains sequence, annotation, metadata and other biological data. For more detailed information about the gene, genome and virus data packages, please visit: Data packages
How do I work with JSON Lines data reports?
Visit our JSON Lines data report documentation page
How can I access resources on NCBI Datasets website programmatically?
We have three options for programmatic access. Click on each link for more information and installation options.
Why do gene counts differ when comparing taxonomy and species pages to the gene table?
Gene counts on the taxonomy and species pages are derived from the annotation report. The annotation report and other genome annotation files represent a snapshot of the genome at the time of genome annotation. In contrast, the gene table and gene data obtained from the datasets command-line tool (datasets download gene...
) contains current gene data, including unannotated genes, genes created after the last annotation, as well as any updates made to existing genes after the last annotation. For some model organisms, particularly human, frequent manual curation means that current gene data is likely to differ compared to the most recent annotation.
What is the difference between a GenBank (GCA) and RefSeq (GCF) genome assembly?
A GenBank (GCA) genome assembly contains assembled genome sequences submitted by investigators or sequencing centers to GenBank or another member of the International Nucleotide Sequence Database Collaboration (INSDC). The GenBank (GCA) assembly is an archival record that is owned by the submitter and may or may not include annotation. A RefSeq (GCF) genome assembly represents an NCBI-derived copy of a submitted GenBank (GCA) assembly. RefSeq (GCF) assembly records are maintained by NCBI. In some cases the RefSeq (GCF) assembly may not be completely identical to the GenBank (GCA) assembly due to assembly improvements made by NCBI staff. All RefSeq (GCF) genome assemblies include annotation.