Rail-dbGaP: Analyzing dbGaP-protected data in the cloud with Amazon Elastic MapReduce

Abhinav Nellore, Christopher Wilks, Kasper D. Hansen, Jeffrey T. Leek, Ben Langmead

Research output: Contribution to journalArticlepeer-review

5 Scopus citations


Motivation: Public archives contain thousands of trillions of bases of valuable sequencing data. More than 40% of the Sequence Read Archive is human data protected by provisions such as dbGaP. To analyse dbGaP-protected data, researchers must typically work with IT administrators and signing officials to ensure all levels of security are implemented at their institution. This is a major obstacle, impeding reproducibility and reducing the utility of archived data. Results: We present a protocol and software tool for analyzing protected data in a commercial cloud. The protocol, Rail-dbGaP, is applicable to any tool running on Amazon Web Services Elastic MapReduce. The tool, Rail-RNA v0.2, is a spliced aligner for RNA-seq data, which we demonstrate by running on 9662 samples from the dbGaP-protected GTEx consortium dataset. The Rail-dbGaP protocol makes explicit for the first time the steps an investigator must take to develop Elastic MapReduce pipelines that analyse dbGaP-protected data in a manner compliant with NIH guidelines. Rail-RNA automates implementation of the protocol, making it easy for typical biomedical investigators to study protected RNA-seq data, regardless of their local IT resources or expertise.

Original languageEnglish (US)
Pages (from-to)2551-2553
Number of pages3
Issue number16
StatePublished - Aug 15 2016
Externally publishedYes

ASJC Scopus subject areas

  • Statistics and Probability
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Computational Theory and Mathematics
  • Computational Mathematics


Dive into the research topics of 'Rail-dbGaP: Analyzing dbGaP-protected data in the cloud with Amazon Elastic MapReduce'. Together they form a unique fingerprint.

Cite this