Gene report
Gene record metadata
Gene report
The downloaded gene package contains a gene data report in
JSON Lines
format in the file:
ncbi_dataset/data/data_report.jsonl
Each line of the gene data report file is a hierarchical JSON
object that represents a single gene record. The schema of the gene record is defined in the tables below
where each row describes a single field in the report or a sub-structure, which is a collection of fields.
The outermost structure of the report is GeneDescriptor.
Table fields that include a Table Field Mnemonic can be used with the
dataformat command-line tool's --fields
Sample report
{
"annotations": [
{
"annotationName": "GCF_000001405.40-RS_2024_08",
"annotationReleaseDate": "2024-08-23",
"assemblyAccession": "GCF_000001405.40",
"assemblyName": "GRCh38.p14",
"genomicLocations": [
{
"genomicAccessionVersion": "NC_000019.10",
"genomicRange": {
"begin": "58345183",
"end": "58353492",
"orientation": "minus"
},
"sequenceName": "19"
}
]
},
{
"annotationName": "GCF_009914755.1-RS_2024_08",
"annotationReleaseDate": "2024-08-23",
"assemblyAccession": "GCF_009914755.1",
"assemblyName": "T2T-CHM13v2.0",
"genomicLocations": [
{
"genomicAccessionVersion": "NC_060943.1",
"genomicRange": {
"begin": "61441599",
"end": "61449907",
"orientation": "minus"
},
"sequenceName": "19"
}
]
}
],
"chromosomes": [
"19"
],
"commonName": "human",
"description": "alpha-1-B glycoprotein",
"ensemblGeneIds": [
"ENSG00000121410"
],
"geneGroups": [
{
"id": "1",
"method": "NCBI Ortholog"
}
],
"geneId": "1",
"geneOntology": {
"biologicalProcesses": [
{
"evidenceCode": "IBA",
"goId": "GO:0002764",
"name": "immune response-regulating signaling pathway",
"qualifier": "involved_in"
}
],
"cellularComponents": [
{
"evidenceCode": "HDA",
"goId": "GO:0072562",
"name": "blood microparticle",
"qualifier": "located_in"
},
{
"evidenceCode": "HDA",
"goId": "GO:0062023",
"name": "collagen-containing extracellular matrix",
"qualifier": "located_in"
},
{
"evidenceCode": "HDA",
"goId": "GO:0070062",
"name": "extracellular exosome",
"qualifier": "located_in"
},
{
"evidenceCode": "HDA",
"goId": "GO:0005576",
"name": "extracellular region",
"qualifier": "located_in"
},
{
"evidenceCode": "IDA",
"goId": "GO:0005576",
"name": "extracellular region",
"qualifier": "located_in"
},
{
"evidenceCode": "TAS",
"goId": "GO:0005576",
"name": "extracellular region",
"qualifier": "located_in"
},
{
"evidenceCode": "HDA",
"goId": "GO:0005615",
"name": "extracellular space",
"qualifier": "located_in"
},
{
"evidenceCode": "TAS",
"goId": "GO:1904813",
"name": "ficolin-1-rich granule lumen",
"qualifier": "located_in"
},
{
"evidenceCode": "IBA",
"goId": "GO:0005886",
"name": "plasma membrane",
"qualifier": "is_active_in"
},
{
"evidenceCode": "TAS",
"goId": "GO:0031093",
"name": "platelet alpha granule lumen",
"qualifier": "located_in"
},
{
"evidenceCode": "TAS",
"goId": "GO:0034774",
"name": "secretory granule lumen",
"qualifier": "located_in"
}
]
},
"nomenclatureAuthority": {
"authority": "HGNC",
"identifier": "HGNC:5"
},
"omimIds": [
"138670"
],
"orientation": "minus",
"proteinCount": 1,
"summary": [
{
"description": "The protein encoded by this gene is a plasma glycoprotein of unknown function. The protein shows sequence similarity to the variable regions of some immunoglobulin supergene family member proteins. [provided by RefSeq, Jul 2008]"
}
],
"swissProtAccessions": [
"P04217"
],
"symbol": "A1BG",
"synonyms": [
"A1B",
"ABG",
"GAB",
"HYST2477"
],
"taxId": "9606",
"taxname": "Homo sapiens",
"transcriptCount": 1,
"transcriptTypeCounts": [
{
"count": 1,
"type": "PROTEIN_CODING"
}
],
"type": "PROTEIN_CODING"
}
GeneDescriptor Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
geneId | gene-id | NCBI GeneID | uint64 | NCBI Gene ID | 2778 |
symbol | symbol | Symbol | string | Gene symbol | GNAS |
description | description | Description | string | Gene name | GNAS complex locus |
taxId | tax-id | Taxonomic ID | uint64 | NCBI Taxonomy ID for the organism | 9606 |
taxname | tax-name | Taxonomic Name | string | Taxonomic name of the organism | Homo sapiens |
commonName | common-name | Common Name | string | Common name of the organism | human |
type | gene-type | Gene Type | GeneType | Type of gene | |
rnaType | rna-type | RNA Type | RnaType | ||
orientation | orientation | Orientation | Orientation | Direction of the gene relative to the genome coordinates | |
referenceStandards repeated | ref-standard- | Reference Standard | GenomicRegion | Clinical reference standard NG | |
genomicRegions repeated | genomic-region- | Genomic Region | GenomicRegion | Pseudogene, non-genic regulatory element and other genomic region NG | |
chromosomes repeated | chromosomes | Chromosomes | string | Chromosomes on which the gene is annotated | 1 X,Y |
nomenclatureAuthority | name- | Nomenclature | NomenclatureAuthority | ||
swissProtAccessions repeated | swissprot-accessions | SwissProt Accessions | string | Swiss-prot accessions matching the protein encoded by the gene | |
ensemblGeneIds repeated | ensembl-geneids | Ensembl GeneIDs | string | Ensembl Gene IDs that match the gene | |
omimIds repeated | omim-ids | OMIM IDs | string | Online Mendelian Inheritance in Man (OMIM) record associated with the gene | |
synonyms repeated | synonyms | Synonyms | string | Alternative names for the gene | |
replacedGeneId | replaced-gene-id | Replaced NCBI GeneID | uint64 | The NCBI Gene ID for the gene that was merged into the current gene record | |
annotations repeated | annotation- | Annotation | Annotation | ||
transcriptCount | transcript-count | Transcripts | uint32 | Number of transcripts encoded by the gene | |
proteinCount | protein-count | Proteins | uint32 | Number of proteins encoded by the gene | |
transcriptTypeCounts repeated | TranscriptTypeCount | Number of transcripts by type | |||
geneGroups repeated | group- | Gene Group | GeneGroup | ||
summary repeated | summary- | Summary | GeneSummary | ||
geneOntology | go- | Gene Ontology | GeneOntology | ||
locusTag | locus-tag | Locus Tag | string |
Annotation Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
assemblyAccession | assembly-accession | Assembly Accession | string | Genome assembly accession | |
assemblyName | assembly-name | Assembly Name | string | Genome assembly name | |
annotationName | release-name | Release Name | string | Genome annotation name | |
annotationReleaseDate | release-date | Release Date | string | Genome annotation release date | |
genomicLocations repeated | genomic-range- | Genomic Range | GenomicLocation |
GeneGroup Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
id | id | Identifier | string | Gene group identifier, currently these only include gene ortholog sets | |
method | method | Method | string | Method used to calculate the gene group, currently this only includes “NCBI Ortholog” |
GeneOntology Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
assignedBy | assigned-by | Assigned By | string | The database that made the annotation | |
molecularFunctions repeated | mf- | Molecular Function | ProcessMetadata | Molecular functions | |
biologicalProcesses repeated | bp- | Biological Process | ProcessMetadata | Biological Processes | |
cellularComponents repeated | cc- | Cellular Component | ProcessMetadata | Cellular components |
GeneSummary Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
source | source | Source | string | Source of the gene summary | |
description | description | Description | string | Gene summary text itself that describes the gene | |
date | date | Date | string | Date that the gene summary was last updated |
GenomicLocation Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
genomicAccessionVersion | accession | Accession | string | ||
sequenceName | seq-name | Seq Name | string | ||
genomicRange | range- | Range | |||
exons repeated | exon- | Exons | Range |
GenomicRegion Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
geneRange | gene-range- | Gene Range | SeqRangeSet | The range of this Gene record on this genomic region. | |
type | genomic-region-type | Genomic Region Type | GenomicRegion.GenomicRegionType | Type of genomic region |
NomenclatureAuthority Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
authority | authority | Authority | string | The nomenclature authority for this gene record | HGNC |
identifier | id | ID | string | The nomenclature authority identifier for this gene record | HGNC:4392 |
ProcessMetadata Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
name | name | Name | string | Gene ontology term name | |
goId | id | Go ID | string | Gene ontology identifier | |
evidenceCode | evidence-code | Evidence Code | string | Indicates how the annotation is supported | |
qualifier | qualifier | Qualifier | string | Explicitly link gene products to GO terms | |
reference | reference- | Reference | Reference | Source of evidence supporting the GO annotation |
Range Structure
A 1-based range on a sequence record.
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
begin | start | Start | uint64 | Sequence start position | |
end | stop | Stop | uint64 | Sequence stop position | |
orientation | orientation | Orientation | Orientation | Direction relative to the genome | |
order | order | Order | uint32 | The position of this sequence in a group of sequences | |
ribosomalSlippage | coming soon | coming soon | int32 | When ribosomal slippage is desired, fill out slippage amount between this and previous range. |
Reference Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
pmids repeated | pmid | PMID | uint64 | PubMed identifier |
SeqRangeSet Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
accessionVersion | accession | Sequence Accession | string | NCBI Accession.version of the sequence | |
range repeated | range- | Range | Series of intervals on above accession_version |
TranscriptTypeCount Structure
Field | Table Field Mnemonic | Table Column Name | Type | Description | Examples |
---|---|---|---|---|---|
type | Transcript.TranscriptType | Type of transcript | |||
count | coming soon | coming soon | uint32 | Number of transcripts of a particular type |
GeneType Enumeration
NB: GeneType values match Entrez Gene
Name | Number | Description |
---|---|---|
UNKNOWN | 0 | |
tRNA | 1 | |
rRNA | 2 | |
snRNA | 3 | |
scRNA | 4 | |
snoRNA | 5 | |
PROTEIN_CODING | 6 | |
PSEUDO | 7 | these will have NG or NR |
TRANSPOSON | 8 | |
miscRNA | 9 | |
ncRNA | 10 | |
BIOLOGICAL_REGION | 11 | these will have NG |
OTHER | 255 |
GenomicRegion.GenomicRegionType Enumeration
Name | Number | Description |
---|---|---|
UNKNOWN | 0 | |
REFSEQ_GENE | 1 | |
PSEUDOGENE | 2 | |
BIOLOGICAL_REGION | 3 | |
OTHER | 4 |
Orientation Enumeration
Name | Number | Description |
---|---|---|
none | 0 | |
plus | 1 | |
minus | 2 |
RnaType Enumeration
Name | Number | Description |
---|---|---|
rna_UNKNOWN | 0 | |
premsg | 1 | |
tmRna | 2 |
Transcript.TranscriptType Enumeration
Name | Number | Description |
---|---|---|
UNKNOWN | 0 | |
PROTEIN_CODING | 1 | |
NON_CODING | 2 | |
PROTEIN_CODING_MODEL | 3 | |
NON_CODING_MODEL | 4 |
Scalar Value Types
Protocol buffers type | Notes | C++ | Python | Java | Go |
---|---|---|---|---|---|
double | double | float | double | float64 | |
float | float | float | float | float32 | |
int32 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead. | int32 | int | int | int32 |
int64 | Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead. | int64 | int/long | long | int64 |
uint32 | Uses variable-length encoding. | uint32 | int/long | int | uint32 |
uint64 | Uses variable-length encoding. | uint64 | int/long | long | uint64 |
sint32 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s. | int32 | int | int | int32 |
sint64 | Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s. | int64 | int/long | long | int64 |
fixed32 | Always four bytes. More efficient than uint32 if values are often greater than 2^28. | uint32 | int | int | uint32 |
fixed64 | Always eight bytes. More efficient than uint64 if values are often greater than 2^56. | uint64 | int/long | long | uint64 |
sfixed32 | Always four bytes. | int32 | int | int | int32 |
sfixed64 | Always eight bytes. | int64 | int/long | long | int64 |
bool | bool | boolean | boolean | bool | |
string | A string must always contain UTF-8 encoded or 7-bit ASCII text. | string | str/unicode | String | string |
bytes | May contain any arbitrary sequence of bytes. | string | str | ByteString | []byte |