NCBI Datasets Gene Package
Sequences and metadata for a set of requested genes
NCBI Datasets Gene Package
The NCBI Datasets Gene Data Package contains sequences and metadata for a set of requested genes. The data package may include gene, transcript and protein sequences in FASTA format, data reports containing metadata in JSON Lines format, and a subset of metadata in tabular format. There are two types of gene data packages, a eukaryotic gene data package and a prokaryotic gene data package. Differences between these two types of gene data package are described below.
Package content
NCBI Datasets Eukaryotic Gene Data Package
This example of Human BRCA1 (GeneID: 672) illustrates a typical eukaryotic gene data package.
human-brca1
|-- README.md
`-- ncbi_dataset
`-- data
|-- data_report.jsonl
|-- data_table.tsv
|-- dataset_catalog.json
|-- gene.fna
|-- protein.faa
`-- rna.fna
NCBI Datasets Prokaryotic Gene Data Package
This example of E. coli restriction endonuclease (WP_000769114.1) illustrates a typical prokaryotic gene data package.
endonuclease
|-- README.md
`-- ncbi_dataset
`-- data
|-- annotation_report.jsonl
|-- data_report.jsonl
|-- dataset_catalog.json
|-- gene.fna
`-- protein.faa
Gene data report
The gene data report contains metadata describing the genes in the data package. The file is in JSON Lines format, where each line is the metadata for one gene. The dataformat tool is available for easy conversion to a tabular format of selected fields. The content of the gene data report differs in the eukaryotic and prokaryotic data packages. For details, see the schemas below.
Eukaryotic gene data report (Gene Report)
- Path:
ncbi_dataset/data/data_report.jsonl
- Schema: Gene Data Report
Prokaryotic gene data report (Prokaryotic gene report)
- Path:
ncbi_dataset/data/data_report.jsonl
- Schema: Gene Data Report
Gene annotation report
The gene annotation report contains metadata describing the annotated locations of the genes in the data package and is only provided for prokaryotic genes. The file is in JSON Lines format, where each line is the metadata for one gene. Use the dataformat tool for easy conversion to a tabular format of selected fields.
- Path:
ncbi_dataset/data/annotation_report.jsonl
- Schema: Gene Annotation Report
Gene data table
The gene data table is a tabular representation of a subset of metdata in the gene data report and is only provided for eukaryotic genes. Each row of the data table represents one transcript of each gene in the data package.
The columns of the data table are Gene ID, Symbol, Gene name, Gene type, Scientific name, Transcripts, and Query.
- Path:
ncbi_dataset/data/data_table.tsv
- Schema: Gene Data Report Schema
FASTA sequence files
You can request three FASTA sequence files.
Gene FASTA
- Path:
ncbi_dataset/data/gene.fna
- Schema: Nucleotide FASTA
Example FASTA Defline:
>NC_000004.12:c122621066-122610108 IL21 [organism=Homo sapiens] [GeneID=59067] [chromosome=4]
Transcript FASTA
- Path:
ncbi_dataset/data/rna.fna
- Schema: Nucleotide FASTA
Example FASTA Defline:
>NM_021803.4 IL21 [organism=Homo sapiens] [GeneID=59067] [transcript=1]
Protein FASTA
- Path:
ncbi_dataset/data/protein.faa
- Schema: Protein FASTA
Example FASTA Defline:
>NP_001193935.1 IL21 [organism=Homo sapiens] [GeneID=59067] [isoform=2 precursor]
README.md
The README contains a general project description common to all data packages.
- Path:
README.md
Dataset catalog
The dataset catalog lists each data file contained within or referenced by the package. Each data file is associated with a content type and location.
- Path:
ncbi_dataset/dataset_catalog.json
Related information
Retrieve a gene package using one of these tools:
- Browse and download at the NCBI Datasets Gene Page by gene-id, symbol or RefSeq sequence accession
- Browse and download with the command line tool
- Download via * Download via programmatic access through our OpenAPI specification