Prokaryote gene report
Prokaryote gene record identifiers, protein info, and taxonomic scope
Prokaryote gene report
The downloaded prokaryote package contains a prokaryote gene data report in
JSON Lines
format in the file:
ncbi_dataset/data/data_report.jsonl
Each line of the prokaryote gene data report file is a hierarchical
JSON
object that represents a single prokaryote gene record. The schema of the prokaryote gene record
is defined in the tables below where each row describes a single field in the report or a sub-structure,
which is a collection of fields. The outermost structure of the report is ProkaryoteGene.
Table fields that include a Table Field Mnemonic can be used with the
dataformat command-line tool's
--fields
Sample report
{
"accession": "WP_001435165.1",
"geneSymbol": "merC",
"numberOfGenomeMappings": 15,
"proteinLength": 137,
"proteinName": "organomercurial transporter MerC",
"proteinNameEvidence": {
"accession": "NF010318.0",
"category": "HMM",
"source": "NCBI Protein Cluster (PRK)"
},
"taxonomyScope": {
"organismName": "Gammaproteobacteria",
"taxId": 1236
}
}
ProkaryoteGene Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
accession | accession | Accession | string | The RefSeq WP_ prefixed accession for the protein sequence. | WP_000443665.1 |
geneSymbol | gene-symbol | Gene Symbol | string | The gene symbol | ligA |
proteinName | protein-name | Protein Name | string | The protein name | NAD-dependent DNA ligase LigA |
proteinLength | protein-length | Protein Length | uint32 | Length of the protein | 671 |
taxonomyScope | Organism | ||||
numberOfGenomeMappings | mapping-count | Number of Genome Mappings | uint32 | The number of nucleotide mappings | 7642 |
proteinNameEvidence | name-evidence- | Protein Name Evidence | ProkaryoteGene.ProteinNameEvidence | ||
description | description | Description | string | Description | Catalyzes the formation of a phosphodiester at the site of a single-strand break in duplex DNA |
ecNumber repeated | ec-number | EC Number | string | EC Number | 6.5.1.2 |
LineageOrganism Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
taxId | coming soon | coming soon | uint32 | NCBI Taxonomy identifier | 11118 |
name | coming soon | coming soon | string | Scientific name | Coronaviridae |
Organism Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
taxId | tax-id | Taxonomic ID | uint32 | NCBI Taxonomy identifier | 9606 2697049 |
organismName | organism-name | Organism Name | string | Scientific name | Homo sapiens Severe acute respiratory syndrome coronavirus 2 |
commonName | common-name | Common Name | string | Common name | human pangolin MERS SARS2 |
lineage repeated | LineageOrganism | Lineage ordered from superkingdom level to increasingly more specific taxonomic entries | |||
strain | strain | Strain | string | SE11 | |
pangolinClassification | pangolin | Pangolin Classification | string | B.1.1.7 |
ProkaryoteGene.ProteinNameEvidence Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
accession | accession | Accession | string | Accession | NF005932.1 |
category | category | Category | string | Catagory | HMM |
source | source | Source | string | Source | NCBI Protein Cluster (PRK) |
Scalar Value Types
Protocol buffers type | Notes | C++ | Python | Java | Go |
---|---|---|---|---|---|
double | double | float | double | float64 | |
float | float | float | float | float32 | |
int32 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. | int32 | int | int | int32 |
int64 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. | int64 | int/long | long | int64 |
uint32 | Uses variable-length encoding. | uint32 | int/long | int | uint32 |
uint64 | Uses variable-length encoding. | uint64 | int/long | long | uint64 |
sint32 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. | int32 | int | int | int32 |
sint64 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. | int64 | int/long | long | int64 |
fixed32 | Always four bytes. More efficient than uint32 if values are often greater than 2^28. | uint32 | int | int | uint32 |
fixed64 | Always eight bytes. More efficient than uint64 if values are often greater than 2^56. | uint64 | int/long | long | uint64 |
sfixed32 | Always four bytes. | int32 | int | int | int32 |
sfixed64 | Always eight bytes. | int64 | int/long | long | int64 |
bool | bool | boolean | boolean | bool | |
string | A string must always contain UTF-8 encoded or 7-bit ASCII text. | string | str/unicode | String | string |
bytes | May contain any arbitrary sequence of bytes. | string | str | ByteString | []byte |