open-discussion > Use of Hadoop and HBase in Bioinformatics
Jun 5, 2017  03:06 AM | Azhar uddin - Appmajix Technologies
Use of Hadoop and HBase in Bioinformatics
Use in next-generation sequencing

The Cloudburst software maps next-generation short read sequencing data to a reference genome for SNP discovery and genotyping. Cloudburst was created by Michael C. Schatz at the University of Maryland (UMD). Schatz's Cloudburst paper, published in May 2009, put Hadoop "on the map" in bioinformatics. Following release of Cloudburst, Schatz and colleagues at UMD and at Johns Hopkins University (e.g., B. Langmead) have developed a suite of algorithms that employ Hadoop for analysis of next generation sequencing data:

1) Crossbow uses Hadoop for its calculations for whole genome resequencing analysis and SNP genotyping from short reads.

2) Contrail uses Hadoop for de novo assembly from short sequencing reads (without using a reference genome), scaling up de Brujin graph construction.

3) Myrna uses Bowtie, another UMD tool for ultrafast short read alignment, and R/Bioconductor for calculating differential gene expression from large RNA-seq data sets. When running on a cluster, Myrna uses Hadoop. Also, Myrna can be run in the cloud using Amazon Elastic MapReduce.

Cloud computing results - Amazon Elastic Compute Cloud (EC2)and Amazon Elastic MapReduce are Web services that provide resizable compute capacity in the cloud. Among other batch processing software, they provide Hadoop. Myrna was designed to function in Elastic MapReduce as well as on a local Hadoop-based cluster. Obviously, Langmead et al. believe that cloud computing is a worthwhile computing framework, and they report their results using such in. Also, Schatz has tested Crossbow on EC2 and believes that running on EC2 can be quite cost effective. (Note: non-commercial services such as the IBM/Google Cloud Computing Initiative are also available to researchers.) Also, Indiana University (IU) researchers have performed comparisons between MPI, Dryad (Microsoft), Azure (Microsoft), and Hadoop MapReduce, measuring relative performance using three bioinformatics applications. This work was summarized by Judy Qui of IU at BOSC 2010. The flexibility of clouds and MapReduce come off quite well in the IU testing, suggesting "they will become preferred approaches".

Use in other bioinformatics domains
 
In addition to next-gen sequencing, Hadoop and HBase have been applied to other areas in bioinformatics. M. Gaggero and colleagues in the Distributed Computing Group at the Center for Advanced Studies, Research and Development in Sardinia, have reported on implementing BLAST and Gene Set Enrichment Analysis (GSEA) in Hadoop. BLAST was implemented using a Python wrapper for the NCBI C++ Toolkit and Hadoop Streaming to build an executable mapper for BLAST. GSEA was implemented using rewritten functions in Python and used with Hadoop Streaming for the MapReduce version. They are now working on development of Biodoop, a suite of parallel bioinformatics applications based upon Hadoop, said suite consisting of three qualitatively different algorithms: BLAST, GSEA and GRAMMAR. They deem their results "very promising", with MapReduce being a "versatile framework".

In other work, Andrea Matsunaga and colleagues at the University of Florida have created CloudBLAST, a parallelized version of the NCBI BLAST2 algorithm using Hadoop. Their parallelization approach segmented the input sequences and ran multiple instances of the unmodified NCBI BLAST2 on each segment, using the Hadoop Streaming utility. Results across multiple input sets were compared against the publicly available version of mpiBLAST, a leading parallel version of BLAST. CloudBLAST exhibited better performance while also having advantages in simpler development and sustainability. Matsunaga et al. conclude that for applications that can fit into the MapReduce paradigm, use of Hadoop brings significant advantages in terms of management of failures, data, and jobs.

In other work, Hadoop has been used for multiple sequence alignment. In regard to HBase use, Brian O'Connor of University of North Carolina at Chapel Hill recently described the use of HBase as a scalable backend for the SeqWare Query Engine at the BOSC 2010 meeting. Recent work on the design of the Genome Analysis Toolkit at the Broad Institute has created a framework that supports MapReduce programming in bioinformatics. Hadoop has also emerged as an enabling technology for large-scale graph processing, which is directly relevant to topological analysis of biological networks. Lin & Schatz have recently reported on improving the capabilities of Hadoop-based programs in this area.

As to future work not yet reported: starting in August 2010, A. Tiwari is maintaining a list of Hadoop/MapReduce applications in bioinformatics on his blog site.

Use in scientific cloud computing, biological data integration and knowledgebase construction
 
The U.S. Department of Energy (DOE) is exploring scientific cloud computing in the Magellan project, a joint research effort of the National Energy Research Scientific Computing Center (NERSC), Lawrence Berkeley National Laboratory, and of the Leadership Computing Facility at Argonne National Laboratory (ANL). Hadoop and HBase have been installed on a cluster at NERSC (40 nodes reserved for Hadoop, soon to double), and studies have been run using Hadoop in Streaming mode for BLAST computations. NERSC is evaluating the use of solid state (flash) storage on the Hadoop nodes. Also, the DOE Joint Genome Institute has performed contig extension work using Hadoop on the NERSC cluster. The Hadoop cluster at ANL, now undergoing testing, will be available for researchers in late 2010. Users interested in using clouds for their research may fill out the Magellan Cloud Computing statement of interest form.

At the Environmental Molecular Sciences Laboratory, a national user facility located at DOE's Pacific Northwest National Laboratory (PNNL), we wish to develop a scientific data management system that will scale into the petabyte range, that will accurately and reliably store data acquired from our various instruments, and that will store the output of analysis software and relevant metadata. As a pilot project for such an effort, work started in August 2010 on a prototype data repository, i.e., a workspace for integration of high-throughput transcriptomics and proteomics data. This database will have the capacity to store very large amounts of data from mass spectrometry-based proteomics experiments as well as from next-gen high throughput sequencing platforms. The author (RCT) is building the pilot database on a 25-node cluster using Hadoop and HBase as the framework. In addition to such data warehousing / data integration work, we may envisage using Hadoop and HBase for the design of large knowledgebases operating on a cluster across the distributed file system. The U.S. Dept. of Energy is funding work on construction of large biological knowledgebases, and Kandinsky, a 68-node, 1088-core Linux cluster (64 GB RAM, 8Tb disk per node) running Hadoop (Cloudera distribution, under CentOS 5) and HBase was set up in 2010 at Oak Ridge National Laboratory as an exploratory environment. Cloudburst has been installed as a sample Hadoop-based application, and the cluster is open to use by researchers wishing to conduct preliminary work towards knowledgebase construction and towards support of grant proposals for such.

Conclusions

Hadoop and its associated open source projects have a diverse and growing community in bioinformatics of both users and developers, as can be seen from the large number of projects described above. A concluding point, extracted from preliminary work for the Hadoop/HBase based PNNL project, follows Dean & Ghemawat. That is, for much bioinformatics work not only is the scalability permitted by Hadoop and HBase important, but also of consequence is the ease of integrating and analyzing various large, disparate data sources into one data warehouse under Hadoop, in relatively few HBase tables.

Want to get trained on Hadoop? Mindmajix provides best Online Hadoop Training. Visit Mindmajix and Schudule a free demo!