ABSTRACT
As scientific research becomes more data intensive, there is an increasing need for scalable, reliable, and high performance storage systems. Such data repositories must provide both data archival services and rich metadata, and cleanly integrate with large scale computing resources. ROARS is a hybrid approach to distributed storage that provides both large, robust, scalable storage and efficient rich metadata queries for scientific applications. In this paper, we demonstrate that ROARS is capable of importing and exporting large quantities of data, migrating data to new storage nodes, providing robust fault tolerance, and generating materialized views based on metadata queries. Our experimental results demonstrate that ROARS' aggregate throughput scales with the number of concurrent clients while providing fault-tolerant data access. ROARS is currently being used to store 5.1TB of data in our local biometrics repository.
- }}S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 3(215):403--410, Oct 1990.Google ScholarCross Ref
- }}Amazon Simple Storage Service (Amazon S3). http://aws.amazon.com/s3/, 2009.Google Scholar
- }}C. Baru, R. Moore, A. Rajasekar, and M. Wan. The SDSC storage resource broker. In Proceedings of CASCON, Toronto, Canada, 1998. Google ScholarDigital Library
- }}H. Bui, M. Kelly, C. Lyon, M. Pasquier, D. Thomas, P. Flynn, and D. Thain. Experience with BXGrid: A Data Repository and Computing Grid for Biometrics Research. Journal of Cluster Computing, 12(4):373, 2009. Google ScholarDigital Library
- }}J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Operating Systems Design and Implementation, 2004. Google ScholarDigital Library
- }}J. J. Dongarra and D. W. Walker. MPI: A standard message passing interface. Supercomputer, pages 56--68, January 1996.Google Scholar
- }}S. Ghemawat, H. Gobioff, and S. Leung. The Google filesystem. In ACM Symposium on Operating Systems Principles, 2003. Google ScholarDigital Library
- }}Hadoop. http://hadoop.apache.org/, 2007.Google Scholar
- }}J. Howard, M. Kazar, S. Menees, D. Nichols, M. Satyanarayanan, R. Sidebotham, and M. West. Scale and performance in a distributed file system. ACM Trans. on Comp. Sys., 6(1):51--81, February 1988. Google ScholarDigital Library
- }}M. Ivanova, N. Nes, R. Goncalves, and M. Kersten. Monetdb/sql meets skyserver: the challenges of a scientific database. Scientific and Statistical Database Management, International Conference on, 0:13, 2007. Google ScholarDigital Library
- }}J. No, R. Thakur, and A. Choudhary:. Integrating parallel file i/o and database support for high-performance scientific data management. In IEEE High Performance Networking and Computing, 2000. Google ScholarDigital Library
- }}E. Riedel, G. A. Gibson, and C. Faloutsos. Active storage for large scale data mining and multimedia. In Very Large Databases (VLDB), 1998. Google ScholarDigital Library
- }}R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and B. Lyon. Design and implementation of the Sun network filesystem. In USENIX Summer Technical Conference, pages 119--130, 1985.Google Scholar
- }}R. Searcs, C. V. Ingen, and J. Gray. To blob or not to blob: Large object storage in a database or a filesystem. Technical Report MSR-TR-2006-45, Microsoft Research, April 2006.Google Scholar
- }}E. Stolte, C. von Praun, G. Alonso, and T. Gross. Scientific data repositories. designing for a moving target. In SIGMOD, 2003. Google ScholarDigital Library
- }}M. Stonebraker, J. Becla, D. J. DeWitt, K.-T. Lim, D. Maier, O. Ratzesberger, and S. B. Zdonik. Requirements for science data bases and scidb. In CIDR. www.crdrdb.org, 2009.Google Scholar
- }}M. Stonebraker, J. F. T, and J. Dozier. An overview of the sequoia 2000 project. In In Proceedings of the Third International Symposium on Large Spatial Databases, pages 397--412, 1992. Google ScholarDigital Library
- }}A. S. Szalay, P. Z. Kunszt, A. Thakar, J. Gray, and D. R. Slutz. Designing and mining multi-terabyte astronomy archives: The sloan digital sky survey. In SIGMOD Conference, 2000. Google ScholarDigital Library
- }}O. Tatebe, N. Soda, Y. Morita, S. Matsuoka, and S. Sekiguchi. Gfarm v2: A grid file system that supports high-performance distributed and parallel data computing. In Computing in High Energy Physics (CHEP), September 2004.Google Scholar
- }}D. Thain. Identity Boxing: A New Technique for Consistent Global Identity. In IEEE/ACM Supercomputing, pages 51--61, 2005. Google ScholarDigital Library
- }}D. Thain, C. Moretti, and J. Hemmes. Chirp: A Practical Global Filesystem for Cluster and Grid Computing. Journal of Grid Computing, 7(1):51--72, 2009.Google ScholarCross Ref
- }}D. Thain, T. Tannenbaum, and M. Livny. Condor and the grid. In F. Berman, G. Fox, and T. Hey, editors, Grid Computing: Making the Global Infrastructure a Reality. John Wiley, 2003. Google ScholarDigital Library
- }}Vertica. http://www.vertica.com/, 2009.Google Scholar
- }}M. Wan, R. Moore, and W. Schroeder. A prototype rule-based distributed data management system rajasekar. In HPDC Workshop on Next Generation Distributed Data Management, May 2006.Google Scholar
- }}S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long, and C. Maltzahn. Ceph: A scalable, high-performance distributed file system. In USENIX Operating Systems Design and Implementation, 2006. Google ScholarDigital Library
Index Terms
- ROARS: a scalable repository for data intensive scientific computing
Recommendations
ROARS: a robust object archival system for data intensive scientific computing
As scientific research becomes more data intensive, there is an increasing need for scalable, reliable, and high performance storage systems. Such data repositories must provide both data archival services and rich metadata, and cleanly integrate with ...
HPDA: A hybrid parity-based disk array for enhanced performance and reliability
Flash-based Solid State Drive (SSD) has been productively shipped and deployed in large scale storage systems. However, a single flash-based SSD cannot satisfy the capacity, performance and reliability requirements of the modern storage systems that ...
Higher reliability redundant disk arrays: Organization, operation, and coding
Parity is a popular form of data protection in redundant arrays of inexpensive/independent disks (RAID). RAID5 dedicates one out of N disks to parity to mask single disk failures, that is, the contents of a block on a failed disk can be reconstructed by ...
Comments