The entire corpus of the Sequence Read Archive (SRA) now live on two cloud platforms!

The National Library of Medicine (NLM) is pleased to announce that all controlled-access and publicly available data in SRA is now available through Google Cloud Platform (GCP) and Amazon Web Services (AWS). To access the data please visit our SRA in the Cloud webpage where you will find links to our new SRA Toolkit and other access methods.

The SRA data available in the two clouds currently totals more than 14 petabytes and consists of all data in the SRA format as well as some data in its original submission format.  Since May 2019, NCBI has been putting all submitted SRA data on the GCP and AWS clouds in both the submitted format and our converted SRA format. We have also been moving previously submitted original format data to the clouds and expect to complete that process in 2021.

Availability of SRA data in the cloud allows bioinformaticians to access and compute on it more efficiently and faster than if downloaded from NLM’s National Center for Biotechnology Information (NCBI) datacenter. This massive archive of genetic sequence data facilitates discoveries in human genetics, health, and disease as this data on cloud platforms breaks down barriers and helps unlock its potential. Users can now essentially transform a laptop into a high-performance computing (HPC) device when working in the cloud with a GCP or AWS account.

Efforts are underway to improve usability and portability of SRA data and to better support data search and retrieval through BigQuery and other interfaces. Please stay tuned to hear more about these efforts and opportunities to participate.

We are listening and want your feedback! We are actively developing tutorials and training materials and want to know if there are aspects of our cloud data and tools that you would like to learn more about. Contact us at suggest@ncbi.nlm.nih.gov and tell us about your experience or let us know if you need help getting started.

This work was made possible by financial support through the Office of Data Science Strategy and technical support from the National Institute of Health’s  Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) Initiative.

3 thoughts on “The entire corpus of the Sequence Read Archive (SRA) now live on two cloud platforms!

Leave a Reply