Using Hadoop to Quickly Create an Index for a Genome

Hadoop Genome Indexing Project

Background

I started this project to gain a better grasp of Map/Reduce on Hadoop. The source code and research for this project is based on the work of researchers Rohith Menon, Goutham Bhat and Michael Schatz (Rapid Parallel Genome Indexing with MapReduce). They developed an algorithm for creating a suffix-index and Burrow-Wheeler Transform for a whole genome using Map/Reduce on Hadoop. Generating a pre-computed index of a genome is an important step to optimize sequence alignment, which is an essential technique in computation biology. Techniques to distribute the indexing workload across multiple nodes existed before their paper. What they introduced however, is a technique to distribute the workload using commodity hardware on Amazon’s EC2 platform and Hadoop’s Map/Reduce.

Project Goals

My goals for this project are:

  1. Understand how the algorithm distributes the work among the Map and Reduce tasks and explain it more plainly.
  2. Understand the benefit behind creating a suffix index and Burrows-Wheeler Transform (BWT)
  3. Re-express the C and C++ portions of the code in Java and analyse any significant performance impact this has on the technique
  4. Implement unit tests for the existing code to support refactoring and extension
  5. Reproduce the results for indexing the genomes published in their paper
  6. Utilize the Gradle build tool to manage dependencies and make testing and building more fluid

Project Source

I am publishing the source code as I adapt it and work through the above goals. This code is available on github

About the Original Source

The original source code is available under MIT license here. Future posts will cover each portion of the algorithm in more depth. Here I am just providing an overview of the source code in the original project. It contains the following components:

  • Sampler: This C module creates a reasonably well distributed partitioning of the suffixes that will be generated.
  • run.sh: This shell scripts generates ranges of integers used in the map phase, calls the Sampler to create the partitions used to distribute the work among the reducers, loads the DNA sequence into HDFS, and then executes the Map/Reduce job.
  • Buck Sort Mapper: This is a Java implementation of the map phase. Each call to map is given a range of integers. The map method creates a key for each integer in the range and assigns it a null value. When all mappers are taken as a whole, a key-value pair is generated for each integer from 1 to G, where G is the size of the genome.
  • Bucket Sort Total Order Partitioner: This Java class determines how the Map/Reduce algorithm will assign the map results to reducers. It distributes subsequences of the genome to the reducers based on the partitions created by the Sampler
  • PSort: This is the Reducer. It is a small amount of Java code that wraps a sorting algorithm implemented in C++ and called through JNI. The reduce method simply collects the indices produced by the map phase in a buffer. The work of sorting the suffixes is actually done in the cleanup method of the reducers.

Posts in this Series

About the Image

The header image on this page is “DNA” by John Goode, licensed under CC BY 2.0.