Gene report

Gene record metadata

Gene report

Gene record metadata

The downloaded gene package contains a gene data report in JSON Lines format in the file:

ncbi_dataset/data/data_report.jsonl

Each line of the gene data report file is a hierarchical JSON object that represents a single gene record. The schema of the gene record is defined in the tables below where each row describes a single field in the report or a sub-structure, which is a collection of fields. The outermost structure of the report is GeneDescriptor.

Table fields that include a Table Field Mnemonic can be used with the dataformat command-line tool's --fields option. Refer to the dataformat CLI tool reference to see how you can use this tool to transform gene data reports from JSON Lines to tabular formats.

Sample report

{
  "annotations": [
    {
      "annotationName": "GCF_000001405.40-RS_2024_08",
      "annotationReleaseDate": "2024-08-23",
      "assemblyAccession": "GCF_000001405.40",
      "assemblyName": "GRCh38.p14",
      "genomicLocations": [
        {
          "genomicAccessionVersion": "NC_000019.10",
          "genomicRange": {
            "begin": "58345183",
            "end": "58353492",
            "orientation": "minus"
          },
          "sequenceName": "19"
        }
      ]
    },
    {
      "annotationName": "GCF_009914755.1-RS_2024_08",
      "annotationReleaseDate": "2024-08-23",
      "assemblyAccession": "GCF_009914755.1",
      "assemblyName": "T2T-CHM13v2.0",
      "genomicLocations": [
        {
          "genomicAccessionVersion": "NC_060943.1",
          "genomicRange": {
            "begin": "61441599",
            "end": "61449907",
            "orientation": "minus"
          },
          "sequenceName": "19"
        }
      ]
    }
  ],
  "chromosomes": [
    "19"
  ],
  "commonName": "human",
  "description": "alpha-1-B glycoprotein",
  "ensemblGeneIds": [
    "ENSG00000121410"
  ],
  "geneGroups": [
    {
      "id": "1",
      "method": "NCBI Ortholog"
    }
  ],
  "geneId": "1",
  "geneOntology": {
    "biologicalProcesses": [
      {
        "evidenceCode": "IBA",
        "goId": "GO:0002764",
        "name": "immune response-regulating signaling pathway",
        "qualifier": "involved_in"
      }
    ],
    "cellularComponents": [
      {
        "evidenceCode": "HDA",
        "goId": "GO:0072562",
        "name": "blood microparticle",
        "qualifier": "located_in"
      },
      {
        "evidenceCode": "HDA",
        "goId": "GO:0062023",
        "name": "collagen-containing extracellular matrix",
        "qualifier": "located_in"
      },
      {
        "evidenceCode": "HDA",
        "goId": "GO:0070062",
        "name": "extracellular exosome",
        "qualifier": "located_in"
      },
      {
        "evidenceCode": "HDA",
        "goId": "GO:0005576",
        "name": "extracellular region",
        "qualifier": "located_in"
      },
      {
        "evidenceCode": "IDA",
        "goId": "GO:0005576",
        "name": "extracellular region",
        "qualifier": "located_in"
      },
      {
        "evidenceCode": "TAS",
        "goId": "GO:0005576",
        "name": "extracellular region",
        "qualifier": "located_in"
      },
      {
        "evidenceCode": "HDA",
        "goId": "GO:0005615",
        "name": "extracellular space",
        "qualifier": "located_in"
      },
      {
        "evidenceCode": "TAS",
        "goId": "GO:1904813",
        "name": "ficolin-1-rich granule lumen",
        "qualifier": "located_in"
      },
      {
        "evidenceCode": "IBA",
        "goId": "GO:0005886",
        "name": "plasma membrane",
        "qualifier": "is_active_in"
      },
      {
        "evidenceCode": "TAS",
        "goId": "GO:0031093",
        "name": "platelet alpha granule lumen",
        "qualifier": "located_in"
      },
      {
        "evidenceCode": "TAS",
        "goId": "GO:0034774",
        "name": "secretory granule lumen",
        "qualifier": "located_in"
      }
    ]
  },
  "nomenclatureAuthority": {
    "authority": "HGNC",
    "identifier": "HGNC:5"
  },
  "omimIds": [
    "138670"
  ],
  "orientation": "minus",
  "proteinCount": 1,
  "summary": [
    {
      "description": "The protein encoded by this gene is a plasma glycoprotein of unknown function. The protein shows sequence similarity to the variable regions of some immunoglobulin supergene family member proteins. [provided by RefSeq, Jul 2008]"
    }
  ],
  "swissProtAccessions": [
    "P04217"
  ],
  "symbol": "A1BG",
  "synonyms": [
    "A1B",
    "ABG",
    "GAB",
    "HYST2477"
  ],
  "taxId": "9606",
  "taxname": "Homo sapiens",
  "transcriptCount": 1,
  "transcriptTypeCounts": [
    {
      "count": 1,
      "type": "PROTEIN_CODING"
    }
  ],
  "type": "PROTEIN_CODING"
}

GeneDescriptor Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
geneIdgene-idNCBI GeneIDuint64NCBI Gene ID2778
symbolsymbolSymbolstringGene symbolGNAS
descriptiondescriptionDescriptionstringGene nameGNAS complex locus
taxIdtax-idTaxonomic IDuint64NCBI Taxonomy ID for the organism9606
taxnametax-nameTaxonomic NamestringTaxonomic name of the organismHomo sapiens
commonNamecommon-nameCommon NamestringCommon name of the organismhuman
typegene-typeGene TypeGeneTypeType of gene
rnaTyperna-typeRNA TypeRnaType
orientationorientationOrientationOrientationDirection of the gene relative to the genome coordinates
referenceStandards repeatedref-standard-Reference StandardGenomicRegionClinical reference standard NG
genomicRegions repeatedgenomic-region-Genomic RegionGenomicRegionPseudogene, non-genic regulatory element and other genomic region NG
chromosomes repeatedchromosomesChromosomesstringChromosomes on which the gene is annotated1
X,Y
nomenclatureAuthorityname-NomenclatureNomenclatureAuthority
swissProtAccessions repeatedswissprot-accessionsSwissProt AccessionsstringSwiss-prot accessions matching the protein encoded by the gene
ensemblGeneIds repeatedensembl-geneidsEnsembl GeneIDsstringEnsembl Gene IDs that match the gene
omimIds repeatedomim-idsOMIM IDsstringOnline Mendelian Inheritance in Man (OMIM) record associated with the gene
synonyms repeatedsynonymsSynonymsstringAlternative names for the gene
replacedGeneIdreplaced-gene-idReplaced NCBI GeneIDuint64The NCBI Gene ID for the gene that was merged into the current gene record
annotations repeatedannotation-AnnotationAnnotation
transcriptCounttranscript-countTranscriptsuint32Number of transcripts encoded by the gene
proteinCountprotein-countProteinsuint32Number of proteins encoded by the gene
transcriptTypeCounts repeatedTranscriptTypeCountNumber of transcripts by type
geneGroups repeatedgroup-Gene GroupGeneGroup
summary repeatedsummary-SummaryGeneSummary
geneOntologygo-Gene OntologyGeneOntology
locusTaglocus-tagLocus Tagstring

Annotation Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
assemblyAccessionassembly-accessionAssembly AccessionstringGenome assembly accession
assemblyNameassembly-nameAssembly NamestringGenome assembly name
annotationNamerelease-nameRelease NamestringGenome annotation name
annotationReleaseDaterelease-dateRelease DatestringGenome annotation release date
genomicLocations repeatedgenomic-range-Genomic RangeGenomicLocation

GeneGroup Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
ididIdentifierstringGene group identifier, currently these only include gene ortholog sets
methodmethodMethodstringMethod used to calculate the gene group, currently this only includes “NCBI Ortholog”

GeneOntology Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
assignedByassigned-byAssigned BystringThe database that made the annotation
molecularFunctions repeatedmf-Molecular FunctionProcessMetadataMolecular functions
biologicalProcesses repeatedbp-Biological ProcessProcessMetadataBiological Processes
cellularComponents repeatedcc-Cellular ComponentProcessMetadataCellular components

GeneSummary Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
sourcesourceSourcestringSource of the gene summary
descriptiondescriptionDescriptionstringGene summary text itself that describes the gene
datedateDatestringDate that the gene summary was last updated

GenomicLocation Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
genomicAccessionVersionaccessionAccessionstring
sequenceNameseq-nameSeq Namestring
genomicRangerange-Range
exons repeatedexon-ExonsRange

GenomicRegion Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
geneRangegene-range-Gene RangeSeqRangeSetThe range of this Gene record on this genomic region.
typegenomic-region-typeGenomic Region TypeGenomicRegion.GenomicRegionTypeType of genomic region

NomenclatureAuthority Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
authorityauthorityAuthoritystringThe nomenclature authority for this gene recordHGNC
identifieridIDstringThe nomenclature authority identifier for this gene recordHGNC:4392

ProcessMetadata Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
namenameNamestringGene ontology term name
goIdidGo IDstringGene ontology identifier
evidenceCodeevidence-codeEvidence CodestringIndicates how the annotation is supported
qualifierqualifierQualifierstringExplicitly link gene products to GO terms
referencereference-ReferenceReferenceSource of evidence supporting the GO annotation

Range Structure

A 1-based range on a sequence record.

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
beginstartStartuint64Sequence start position
endstopStopuint64Sequence stop position
orientationorientationOrientationOrientationDirection relative to the genome
orderorderOrderuint32The position of this sequence in a group of sequences
ribosomalSlippagecoming sooncoming soonint32When ribosomal slippage is desired, fill out slippage amount between this and previous range.

Reference Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
pmids repeatedpmidPMIDuint64PubMed identifier

SeqRangeSet Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
accessionVersionaccessionSequence AccessionstringNCBI Accession.version of the sequence
range repeatedrange-RangeSeries of intervals on above accession_version

TranscriptTypeCount Structure

FieldTable Field MnemonicTable Column NameTypeDescriptionExamples
typeTranscript.TranscriptTypeType of transcript
countcoming sooncoming soonuint32Number of transcripts of a particular type

GeneType Enumeration

NB: GeneType values match Entrez Gene

NameNumberDescription
UNKNOWN0
tRNA1
rRNA2
snRNA3
scRNA4
snoRNA5
PROTEIN_CODING6
PSEUDO7these will have NG or NR
TRANSPOSON8
miscRNA9
ncRNA10
BIOLOGICAL_REGION11these will have NG
OTHER255

GenomicRegion.GenomicRegionType Enumeration

NameNumberDescription
UNKNOWN0
REFSEQ_GENE1
PSEUDOGENE2
BIOLOGICAL_REGION3
OTHER4

Orientation Enumeration

NameNumberDescription
none0
plus1
minus2

RnaType Enumeration

NameNumberDescription
rna_UNKNOWN0
premsg1
tmRna2

Transcript.TranscriptType Enumeration

NameNumberDescription
UNKNOWN0
PROTEIN_CODING1
NON_CODING2
PROTEIN_CODING_MODEL3
NON_CODING_MODEL4

Scalar Value Types

Protocol buffers typeNotesC++PythonJavaGo
doubledoublefloatdoublefloat64
floatfloatfloatfloatfloat32
int32Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint32 instead.int32intintint32
int64Uses variable-length encoding. Inefficient for encoding negative numbers – if your field is likely to have negative values, use sint64 instead.int64int/longlongint64
uint32Uses variable-length encoding.uint32int/longintuint32
uint64Uses variable-length encoding.uint64int/longlonguint64
sint32Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int32s.int32intintint32
sint64Uses variable-length encoding. Signed int value. These more efficiently encode negative numbers than regular int64s.int64int/longlongint64
fixed32Always four bytes. More efficient than uint32 if values are often greater than 2^28.uint32intintuint32
fixed64Always eight bytes. More efficient than uint64 if values are often greater than 2^56.uint64int/longlonguint64
sfixed32Always four bytes.int32intintint32
sfixed64Always eight bytes.int64int/longlongint64
boolboolbooleanbooleanbool
stringA string must always contain UTF-8 encoded or 7-bit ASCII text.stringstr/unicodeStringstring
bytesMay contain any arbitrary sequence of bytes.stringstrByteString[]byte
Generated February 26, 2025