MicroBIGG-E data at Google Cloud Platform

BETA RELEASE -- This is under active development and while we strive to maintain correctness, it is possible results may be unstable, unavailable, or incorrect at times. Please contact us by email at pd-help@ncbi.nlm.nih.gov before relying on this data for production analyses.
- What data is available on the Google Cloud?
- Getting started with BigQuery
- Linking to Isolates Browser data in BigQuery
- Example searches
- Find all carbapenem resistance genes or point mutations in the database
- Find all carbapenem resistance genes in the database
- Find all AMRFinderPlus results from Salmonella genomes for further analysis
- Find elements on contigs that have both blaKPC-2 and blaTEM-1 genes
- Find the five most common known parC resistance mutations in Pathogen Detection analyzed isolates
- Find the five most common AMR genes associated with quinolone resistance
- Contig sequences
- Protein sequences
What data is available on the Google Cloud?

For a list of all resources see Pathogen Detection Resources at Google Cloud Platform
The Microbial Browser for Genomic and Genetic Elements
data is now publicly available in the
ncbi-pathogen-detect.pdbrowser.microbigge
table at Google BigQuery. This
data includes all the fields available in the browser and can be searched using
Google Standard
SQL
instead of the SOLR Query Language. This
also permits programmatic access and more complex queries. MicroBIGG-E at
BigQuery will also allow you to download tables exceeding the 100,000 row limit
for the MicroBIGG-E web
download. NCBI is
piloting this in BigQuery to help users leverage the benefits of elastic
scaling and parallel execution of queries. BigQuery has a large collection of
client libraries that can be used within your workflow. You can also interact
with it on a web browser as described below.
We also are storing the contig sequences and protein sequences for MicroBIGG-E hits in Google Storage buckets. See Contig sequences and Protein sequences below for more information.
Pathogen Detection Resources available on the Google Cloud
- Pathogen Detection Resources at Google Cloud Platform
- Getting started with BigQuery
- MicroBIGG-E table in BigQuery
- MicroBIGG-E contig sequences in Google Storage buckets
- MicroBIGG-E protein sequences in Google Storage buckets
- AST Browser in BigQuery
- Isolates Browser table in BigQuery
- Isolate Exceptions table in BigQuery
- BioProject Hierarchy in BigQuery
Update Frequency

The microbigge table at Google Cloud BigQuery is updated daily. For this reason the contents may not agree exactly with those shown in the MicroBIGG-E web browser. If you see unexpected discrepancies please let us know by emailing us at pd-help@ncbi.nlm.nih.gov.
Getting started with BigQuery
Our Getting started with BigQuery page has instructions on how to run queries with BigQuery.
Linking to Isolates Browser data in BigQuery
NCBI Pathogen Detection also maintains Isolates Browser data in the BigQuery table ncbi-pathogen-detect.pdbrowser.isolates
. There are several fields in common between the two tables, but we generally recommend joining on the target_acc
field. See Isolates Browser Data at Google Cloud Platform for examples of joining the two tables.
Example searches

Find all carbapenem resistance genes or point mutations in the database

SELECT *
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE subclass like '%CARBAPENEM%'
ORDER BY element_symbol, closest_reference_acc, target_acc, protein_acc
Find all carbapenem resistance genes in the database

SELECT *
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE subclass like '%CARBAPENEM%'
AND subtype = 'AMR'
ORDER BY element_symbol, closest_reference_acc, target_acc, protein_acc
Find all AMRFinderPlus results from Salmonella genomes for further analysis

SELECT *
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE taxgroup_name = 'Salmonella enterica'
Find elements on contigs that have both blaKPC-2 and blaTEM-1 genes

SELECT
mb.contig_acc,
mb.element_symbol
FROM
`ncbi-pathogen-detect.pdbrowser.microbigge` mb
JOIN ( SELECT DISTINCT
mb1.contig_acc
FROM
`ncbi-pathogen-detect.pdbrowser.microbigge` mb1
JOIN `ncbi-pathogen-detect.pdbrowser.microbigge` mb2
ON mb1.element_symbol = 'blaTEM-1'
AND mb1.contig_acc = mb2.contig_acc
AND mb2.element_symbol = 'blaKPC-2') contigs
ON contigs.contig_acc = mb.contig_acc
ORDER BY
mb.contig_acc,
mb.start_on_contig
Find the five most common known parC resistance mutations in Pathogen Detection analyzed isolates

SELECT element_symbol, count(*) num_found
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE element_symbol like 'parC_%'
GROUP BY element_symbol
ORDER BY num_found DESC
LIMIT 5
Find the five most common AMR genes associated with quinolone resistance

SELECT element_symbol, subclass, count(*) num_found
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE subclass like '%QUINOLONE%'
AND subtype = 'AMR'
GROUP BY element_symbol, subclass
ORDER BY num_found DESC
LIMIT 5
Contig sequences

Contig sequences in gzipped FASTA format are stored and accessible in the Google storage bucket ncbi-pathogen-assemblies
and the paths to those contigs are listed in the ncbi-pathogen-detect.pdbrowser.microbigge
field contig_url
.
These can be accessed using the gsutil
command-line program included with the Google Cloud CLI (Installation instructions). Or through the GCP BigQuery web interface. See Getting started with BigQuery for more information on how to use BigQuery.
Example:

Get the contig sequence for a contig with a point mutation in a specific assembly

First find the contig_url using BigQuery

SELECT contig_url
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE element_symbol = 'ompK36_D135DGD'
AND biosample_acc = 'SAMN01057611';
The results should be:
contig_url |
---|
gs://ncbi-pathogen-assemblies/Klebsiella/9/640/NZ_CP008827.1.fna.gz |
Copy the gzipped contig file using the gs
utility

Enter the following at a unix shell command-line to copy the gzipped contig FASTA file to your computer. See the Google docs for more information on the gsutil program.
gsutil cp gs://ncbi-pathogen-assemblies/Klebsiella/9/640/NZ_CP008827.1.fna.gz .
Protein sequences

Protein sequences in gzipped FASTA format are stored and accessible in the Google Storage bucket ncbi-pathogen-assemblies
and the paths to those files are listed in the ncbi-pathogen-detect.pdbrowser.microbigge
field protein_url
.
These can be accessed using the gsutil
command-line program included with the Google Cloud CLI (Installation instructions). Or through the GCP BigQuery web interface. See Getting started with BigQuery for more information on how to use BigQuery.
Examples:

Get the sequence of a single protein from MicroBIGG-E

Find the protein URL using BigQuery

SELECT protein_url
FROM `ncbi-pathogen-detect.pdbrowser.microbigge`
WHERE element_symbol = 'ompK36_D135DGD'
AND biosample_acc = 'SAMN01057611';
The results should be:
protein_url |
---|
gs://ncbi-pathogen-proteins/WP_/004/151/WP_004151112.1.faa.gz |
Copy the gzipped protein FASTA file using the gs
utility

Enter the following at a unix shell command-line to copy the gzipped contig FASTA file to your computer. See the Google docs for more information on the gsutil program.
gsutil cp gs://ncbi-pathogen-proteins/WP_/004/151/WP_004151112.1.faa.gz .
Download all QUINOLONE resistance genes

This example uses a Linux or MacOS command-line, the Google cloud CLI, and the bash shell. See Install the Google Cloud CLI documentation from Google for instructions of how to install the CLI.
Authenticate the CLI to give it permissions on your Google Cloud project

See Initializing the gcloud CLI for more information.
gcloud auth login
Follow instructions to authenticate to google cloud
Download a list of URLs using bq

bq query --use_legacy_sql=false --format=csv --max_rows 300000 '
select distinct protein_url
from `ncbi-pathogen-detect.pdbrowser.microbigge`
where class = "%QUINOLONE%"
' > all_quinolone_urls.csv
Split the list of files into batches

We do this because directories tend to have performance issues when there are too many files in one directory. Depending on your operating system and configuration you could change the size of the batches.
split -d -l 5000 all_quinolone_urls.csv batch.
Use a shell loop to download the protein files

for file in batch.*
do
mkdir $file.asm
cat $file | gcloud alpha storage cp --read-paths-from-stdin $file.asm/
done