ABSTRACT
Bioinformatics researchers need efficient means to process large collections of sequence data. One application of interest, genome assembly, has great potential for parallelization, however most previous attempts at parallelization require uncommon high-end hardware. This paper introduces a scalable modular genome assembler that can achieve significant speedup using large numbers of conventional desktop machines, such as those found in a campus computing grid. The system is based on the Celera open-source assembly toolkit, and replaces two independent sequential modules with scalable replacements: a scalable candidate selector exploits the distributed memory capacity of a campus grid, while the scalable aligner exploits the distributed computing capacity. For large problems, these modules provide robust task and data management while also achieving speedup with high efficiency on several scales of resources. We show results for several datasets ranging from 738 thousand to over 121 million alignments using campus grid resources ranging from a small cluster to more than a thousand nodes spanning three institutions. Our largest run so far achieves a 927x speedup with 71.3 percent efficiency.
- The Open Science Grid. http://www.opensciencegrid.org.Google Scholar
- D. Bakken and R. Schlichting. Tolerating failures in the bag-of-tasks programming paradigm. In IEEE International Symposium on Fault Tolerant Computing, June 1991.Google ScholarCross Ref
- S. Batzoglou et al. ARACHNE: A whole-genome shotgun assembler. Genome Res., 12(1):177--189, January 2002.Google ScholarCross Ref
- D. da Silva, W. Cirne, and F. Brasilero. Trading cycles for information: Using replication to schedule bag-of-tasks applications on computational grids. In Euro-Par, 2003.Google ScholarCross Ref
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large cluster. In Operating Systems Design and Implementation, 2004. Google ScholarDigital Library
- W. Gentzsch. Sun grid engine: Towards creating a compute power grid. In CCGRID '01: Proceedings of the 1st International Symposium on Cluster Computing and the Grid, page 35, Washington, DC, USA, 2001. IEEE Computer Society. Google ScholarDigital Library
- D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge Univ. Press, January 2007. Google ScholarDigital Library
- P. Havlak et al. The Atlas genome assembly system. Genome Res, 14(4):721--732, April 2004.Google ScholarCross Ref
- L. W. W. Hillier et al. Whole-genome sequencing and variant discovery in C. elegans. Nat Methods, January 2008.Google Scholar
- X. Huang and A. Madan. CAP3: A DNA sequence assembly program. Genome Res., 9(9):868--877, September 1999.Google ScholarCross Ref
- X. Huang, J. Wang, S. Aluru, S.-P. Yang, and L. Hillier. PCAP: A whole-genome assembly program. Genome Res., 13(9):2164--2170, September 2003.Google ScholarCross Ref
- A. Kalyanaraman, S. Emrich, P. Schnable, and S. Aluru. Assembling genomes on large-scale parallel computers. Journal of Parallel and Distributed Computing, 67(12):1240--1255, 2007. Best Paper Awards: 20th International Parallel and Distributed Processing Symposium (IPDPS 2006). Google ScholarDigital Library
- J. Linderoth et al. An enabling framework for master-worker applications on the computational grid. In IEEE High Performance Distributed Computing, pages 43--50, Pittsburgh, Pennsylvania, August 2000. Google ScholarDigital Library
- E. W. Myers et al. A whole-genome assembly of Drosophila. Science, 287(5461):2196--2204, March 2000.Google ScholarCross Ref
- A. H. Paterson et al. The Sorghum bicolor genome and the diversification of grasses. Nature, 457(7229):551--556, January 2009.Google ScholarCross Ref
- M. Pop et al. Genome sequence assembly: Algorithms and issues. Computer, 35(7):47--54, 2002. Google ScholarDigital Library
- M. Pop and S. L. Salzberg. Bioinformatics challenges of new sequencing technology. Trends in Genetics, 24(3):142--149, March 2008.Google ScholarCross Ref
- I. Raicu, I. Foster, and Y. Zhao. Many-Task Computing for Grids and Supercomputers. In IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS08), 2008.Google Scholar
- I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster, and M. Wilde. Falkon: a Fast and Light-weight tasK executiON framework. In IEEE/ACM Supercomputing, 2007. Google ScholarDigital Library
- M. Roberts et al. A preprocessor for shotgun assembly of large genomes. Journal of Computational Biology, 11(4):734--752, 2004.Google ScholarCross Ref
- A. Sarje and S. Aluru. Parallel biological sequence alignments on the cell broadband engine. pages 1--11, April 2008.Google Scholar
- M. Schatz. CloudBurst: Highly sensitive read mapping with MapReduce. Bioinformatics (Online Advance Access), April 2009. Google ScholarDigital Library
- M. V. Sharakhova et al. Update of the Anopheles gambiae PEST genome assembly. Genome Biology, 8: R5+, January 2007.Google Scholar
- O. Storaasli and D. Strenski. Exploring accelerating science applications with FPGAs. July 2007.Google Scholar
- K. A. Swan et al. High-throughput gene mapping in caenorhabditis elegans. Genome Res, 12(7):1100--1105, July 2002.Google Scholar
- D. Thain, T. Tannenbaum, and M. Livny. Condor and the grid. In F. Berman, G. Fox, and T. Hey, editors, Grid Computing: Making the Global Infrastructure a Reality. John Wiley, 2003.Google ScholarDigital Library
- L. Yu, C. Moretti, S. Emrich, K. Judd, and D. Thain. Harnessing Parallelism in Multicore Clusters with the All-Pairs and Wavefront Abstractions. In IEEE High Performance Distributed Computing, pages 1--10, 2009. Google ScholarDigital Library
Index Terms
- Highly scalable genome assembly on campus grids
Recommendations
A Framework for Scalable Genome Assembly on Clusters, Clouds, and Grids
Bioinformatics researchers need efficient means to process large collections of genomic sequence data. One application of interest, genome assembly, has great potential for parallelization; however, most previous attempts at parallelization require ...
Genome Sequence Assembly: Algorithms and Issues
Ultimately, genome sequencing seeks to provide an organism's complete DNA sequence. Automation of DNA sequencing allowed scientists to decode entire genomes and gave birth to genomics, the analytic and comparative study of genomes. Although genomes can ...
Comments