skip to main content
10.1145/1646468.1646480acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Highly scalable genome assembly on campus grids

Published:16 November 2009Publication History

ABSTRACT

Bioinformatics researchers need efficient means to process large collections of sequence data. One application of interest, genome assembly, has great potential for parallelization, however most previous attempts at parallelization require uncommon high-end hardware. This paper introduces a scalable modular genome assembler that can achieve significant speedup using large numbers of conventional desktop machines, such as those found in a campus computing grid. The system is based on the Celera open-source assembly toolkit, and replaces two independent sequential modules with scalable replacements: a scalable candidate selector exploits the distributed memory capacity of a campus grid, while the scalable aligner exploits the distributed computing capacity. For large problems, these modules provide robust task and data management while also achieving speedup with high efficiency on several scales of resources. We show results for several datasets ranging from 738 thousand to over 121 million alignments using campus grid resources ranging from a small cluster to more than a thousand nodes spanning three institutions. Our largest run so far achieves a 927x speedup with 71.3 percent efficiency.

References

  1. The Open Science Grid. http://www.opensciencegrid.org.Google ScholarGoogle Scholar
  2. D. Bakken and R. Schlichting. Tolerating failures in the bag-of-tasks programming paradigm. In IEEE International Symposium on Fault Tolerant Computing, June 1991.Google ScholarGoogle ScholarCross RefCross Ref
  3. S. Batzoglou et al. ARACHNE: A whole-genome shotgun assembler. Genome Res., 12(1):177--189, January 2002.Google ScholarGoogle ScholarCross RefCross Ref
  4. D. da Silva, W. Cirne, and F. Brasilero. Trading cycles for information: Using replication to schedule bag-of-tasks applications on computational grids. In Euro-Par, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  5. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large cluster. In Operating Systems Design and Implementation, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. W. Gentzsch. Sun grid engine: Towards creating a compute power grid. In CCGRID '01: Proceedings of the 1st International Symposium on Cluster Computing and the Grid, page 35, Washington, DC, USA, 2001. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. Gusfield. Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge Univ. Press, January 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Havlak et al. The Atlas genome assembly system. Genome Res, 14(4):721--732, April 2004.Google ScholarGoogle ScholarCross RefCross Ref
  9. L. W. W. Hillier et al. Whole-genome sequencing and variant discovery in C. elegans. Nat Methods, January 2008.Google ScholarGoogle Scholar
  10. X. Huang and A. Madan. CAP3: A DNA sequence assembly program. Genome Res., 9(9):868--877, September 1999.Google ScholarGoogle ScholarCross RefCross Ref
  11. X. Huang, J. Wang, S. Aluru, S.-P. Yang, and L. Hillier. PCAP: A whole-genome assembly program. Genome Res., 13(9):2164--2170, September 2003.Google ScholarGoogle ScholarCross RefCross Ref
  12. A. Kalyanaraman, S. Emrich, P. Schnable, and S. Aluru. Assembling genomes on large-scale parallel computers. Journal of Parallel and Distributed Computing, 67(12):1240--1255, 2007. Best Paper Awards: 20th International Parallel and Distributed Processing Symposium (IPDPS 2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Linderoth et al. An enabling framework for master-worker applications on the computational grid. In IEEE High Performance Distributed Computing, pages 43--50, Pittsburgh, Pennsylvania, August 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. E. W. Myers et al. A whole-genome assembly of Drosophila. Science, 287(5461):2196--2204, March 2000.Google ScholarGoogle ScholarCross RefCross Ref
  15. A. H. Paterson et al. The Sorghum bicolor genome and the diversification of grasses. Nature, 457(7229):551--556, January 2009.Google ScholarGoogle ScholarCross RefCross Ref
  16. M. Pop et al. Genome sequence assembly: Algorithms and issues. Computer, 35(7):47--54, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Pop and S. L. Salzberg. Bioinformatics challenges of new sequencing technology. Trends in Genetics, 24(3):142--149, March 2008.Google ScholarGoogle ScholarCross RefCross Ref
  18. I. Raicu, I. Foster, and Y. Zhao. Many-Task Computing for Grids and Supercomputers. In IEEE Workshop on Many-Task Computing on Grids and Supercomputers (MTAGS08), 2008.Google ScholarGoogle Scholar
  19. I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster, and M. Wilde. Falkon: a Fast and Light-weight tasK executiON framework. In IEEE/ACM Supercomputing, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Roberts et al. A preprocessor for shotgun assembly of large genomes. Journal of Computational Biology, 11(4):734--752, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  21. A. Sarje and S. Aluru. Parallel biological sequence alignments on the cell broadband engine. pages 1--11, April 2008.Google ScholarGoogle Scholar
  22. M. Schatz. CloudBurst: Highly sensitive read mapping with MapReduce. Bioinformatics (Online Advance Access), April 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. V. Sharakhova et al. Update of the Anopheles gambiae PEST genome assembly. Genome Biology, 8: R5+, January 2007.Google ScholarGoogle Scholar
  24. O. Storaasli and D. Strenski. Exploring accelerating science applications with FPGAs. July 2007.Google ScholarGoogle Scholar
  25. K. A. Swan et al. High-throughput gene mapping in caenorhabditis elegans. Genome Res, 12(7):1100--1105, July 2002.Google ScholarGoogle Scholar
  26. D. Thain, T. Tannenbaum, and M. Livny. Condor and the grid. In F. Berman, G. Fox, and T. Hey, editors, Grid Computing: Making the Global Infrastructure a Reality. John Wiley, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. L. Yu, C. Moretti, S. Emrich, K. Judd, and D. Thain. Harnessing Parallelism in Multicore Clusters with the All-Pairs and Wavefront Abstractions. In IEEE High Performance Distributed Computing, pages 1--10, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Highly scalable genome assembly on campus grids

                  Recommendations

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in
                  • Published in

                    cover image ACM Conferences
                    MTAGS '09: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
                    November 2009
                    131 pages
                    ISBN:9781605587141
                    DOI:10.1145/1646468

                    Copyright © 2009 ACM

                    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                    Publisher

                    Association for Computing Machinery

                    New York, NY, United States

                    Publication History

                    • Published: 16 November 2009

                    Permissions

                    Request permissions about this article.

                    Request Permissions

                    Check for updates

                    Qualifiers

                    • research-article

                  PDF Format

                  View or Download as a PDF file.

                  PDF

                  eReader

                  View online with eReader.

                  eReader