Setting-up BigQuery

Overview

Sequence Read Archive (SRA) has moved all of its metadata into BigQuery to provide the bioinformatics community with programmatic access to these data. You can now search across the entire SRA by sequencing methodologies and sample attributes. NCBI is piloting this in BigQuery to help users leverage the benefits of elastic scaling and parallel execution of queries. BigQuery has a large collection of client libraries that can be used within your workflow. You can also interact with it on a web browser.

Get started in BigQuery

Set up account

To access BigQuery, you will need to set up a Google cloud account:
https://cloud.google.com/

Once you have set up the account, you will need to create your project:
https://cloud.google.com/resource-manager/docs/creating-managing-projects

You will need to record the project ID as it will be necessary if you want to access BigQuery by client libraries or command line.

Payment

The user pays for running queries against public data sets and you should review the payment requirements for on-demand queries from Big Query. Big Query provides 1TB per month for free for querying data.

Access methods

We recommend to first use the BigQuery query editor to become familiar with SQL and writing queries before attempting to use the command line tools or client libraries.

BigQuery can be accessed through a web browser query editor:
https://console.cloud.google.com/bigquery

BigQuery client library documentation is also available for reference if you plan to access it through the supported programming languages:
https://cloud.google.com/bigquery/docs/reference/libraries

BigQuery command line tools can be downloaded and set up from here:
https://cloud.google.com/sdk/docs/quickstarts

Linking the SRA dataset in BigQuery Console

You will want to pin the SRA dataset to your BigQuery Console to make it easier to access and explore the available metadata. Click the Add Data button on the left side of the screen, in the Explorer panel.

Next, select Pin a project, click on Enter project name, paste nih-sra-datastore into the Pin a project box and click Pin.
Now you can proceed to example queries.

Set up the command line tool

First, you should create your account and your project through the Google web interface.
Next, download and install the Cloud SDK from the link above.
Then, you can use the command line tools to sign into your account using this command:

gcloud auth login --no-launch-browser

Once you are signed into your account, you need to set your project ID:

gcloud config set project PROJECT_ID

Where PROJECT_ID is the ID that was set when you created your project (this is different from your project name).
Now you will be able to run the following example query:

bq --format=csv query --nouse_legacy_sql --max_rows=10000 'SELECT acc FROM `nih-sra-datastore`.sra.metadata where consent = "public" and acc like "SRR%" LIMIT 1000' | sed '2 d' > accession_list.txt

The accession_list.txt file will contain the list of accession that you can use with the SRA Toolkit to download the data.

More command line query example could be found here.

Contact SRA

Contact SRA staff for assistance at sra@ncbi.nlm.nih.gov

Getting Started

Getting Started

Cloud Quick Start

Setting Up

Cloud Data Access

Accessing dbGAP

Download dbGAP with JWT

SRA

SRA