jq cheatsheet for genome metadata
jq cheatsheet for parsing genome metadata from the datasets CLI summary command
jq cheatsheet for genome metadata
Try out jq commands on the web: https://jqplay.org/ The below examples were run using the datasets CLI v12.13.2 on 9/24/2021.
Download jq
https://stedolan.github.io/jq/
First generate a json file with metadata for all cow genomes
datasets summary genome taxon cow > cow_genomes.json
Pretty-print the data (and only show the first 10 lines)
Note that the data is hierarchically structured: the busco information is nested within annotation_metadata, and annotation_metadata is nested within the assembly object
jq . cow_genomes.json | head
{
"assemblies": [
{
"assembly": {
"annotation_metadata": {
"busco": {
"busco_lineage": "cetartiodactyla_odb10",
"busco_ver": "4.0.2 ",
"complete": 0.98672664,
"duplicated": 0.005024372,
Show the assembly count
jq '.total_count' cow_genomes.json
7
Only show data for the first assembly in a set of multiple assemblies (and only show the first 10 lines)
Note that assemblies[0] is used to specify the first assembly in the set, assemblies[1] refers to the second assembly, etc.
jq '.assemblies[0]' cow_genomes.json | head
{
"assembly": {
"annotation_metadata": {
"busco": {
"busco_lineage": "cetartiodactyla_odb10",
"busco_ver": "4.0.2 ",
"complete": 0.98672664,
"duplicated": 0.005024372,
"fragmented": 0.0045744283,
"missing": 0.008698912,
Show the BUSCO data for the first assembly in a set
jq '.assemblies[0].assembly.annotation_metadata.busco' cow_genomes.json
{
"busco_lineage": "cetartiodactyla_odb10",
"busco_ver": "4.0.2 ",
"complete": 0.98672664,
"duplicated": 0.005024372,
"fragmented": 0.0045744283,
"missing": 0.008698912,
"single_copy": 0.98170227,
"total_count": "13335"
}
Show the gene counts for the first assembly in a set
jq '.assemblies[0].assembly.annotation_metadata.stats.gene_counts' cow_genomes.json
{
"protein_coding": 21039,
"total": 35143
}
Show the assembly accession, submitter, and submission date for the first assembly in a set and format the output in a new JSON object with custom key names
jq '.assemblies[0].assembly | {accession: .assembly_accession, submitter: .submitter, date: .submission_date}' cows.json
{
"accession": "GCF_002263795.1",
"submitter": "USDA ARS",
"date": "2018-04-11"
}
Generate a table of 3 columns including assembly accession, submission date and submitter
jq -r '.assemblies[].assembly | [.assembly_accession, .submission_date, .submitter] | @tsv' cows.json
GCF_002263795.1 2018-04-11 USDA ARS
GCF_000003055.6 2014-11-25 Center for Bioinformatics and Computational Biology, University of Maryland
GCF_000003205.5 2011-11-02 Cattle Genome Sequencing International Consortium
GCF_000003205.7 2015-11-19 Cattle Genome Sequencing International Consortium
GCA_000003055.5 2014-11-25 Center for Bioinformatics and Computational Biology, University of Maryland
GCA_000003205.6 2015-11-19 Cattle Genome Sequencing International Consortium
GCA_002263795.2 2018-04-11 USDA ARS
Show the assembly accession and the chromosome count for the first assembly in a set
Note that we will use jq length to count the number of chromosomes. Chromosome count includes all assembled chromosomes, the set of unplaced scaffolds counts as 1 chromosome, and each organelle genome counts as 1 chromosome, so in this example 29 autosomes + 1 X chromosome + 1 set of unplaced scaffolds + 1 mitochondrial genome = 32
jq '.assemblies[0].assembly | {accession: .assembly_accession, chromosome_count: [.chromosomes[]] | length}' cows.json
{
"accession": "GCF_002263795.1",
"chromosome_count": 32
}
Generated March 11, 2025