ABSTRACT
Scientific instruments, as well as simulations, generate increasingly large datasets, changing the way we do science. We propose a system that we call the data-intensive computer for computing with Petascale-sized datasets. The data-intensive computer consists of an HPC cluster, a massively parallel database and a set of computing servers running the data-intensive operating system, which turns the database into a layer in the memory hierarchy of the data-intensive computer.
The data-intensive operating system is data-object-oriented: the abstract programming model of a sequential file, central to traditional computer operating systems, is replaced with system-level support for high-level data objects, such as multi-dimensional arrays, graphs, sparse arrays, etc. User application programs will be compiled into code that is executed both on the HPC cluster and inside the database. The data-intensive operating system is however non-local, allowing remote applications to execute code inside the database. This model supports the collaborative environment, where a large data set is typically created and processed by a large group of users.
We are developing a software library, MPI-DB, which is a prototype of the data-intensive operating system. It is currently being used by the Turbulence group at JHU to store simulation output in the database and to perform simulations refining previously stored results.
- Benchmarks provided by infiniband vendors.Google Scholar
- http://openconnectomeproject.org.Google Scholar
- http://pcbunn.cacr.caltech.edu/cochlea/.Google Scholar
- http://turbulence.pha.jhu.edu/.Google Scholar
- http://www.sdss.org/.Google Scholar
- K. Asanovíc, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, S. Williams, and K. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-2006-183, Electrical Engineering and Computer Sciences, University of California at Berkeley, December 2006.Google Scholar
- C. Blakeley, N. Cunningham, B. Ellis, Rathakrishnan, and M. C. Wu. Distributed/heterogeneous query processing in microsoft sql server. In 21st Int. Conf. on Data Engineering (ICDE'05), pages 1001--1012, 2005. Google ScholarDigital Library
- D. D. Bock, W.-C. A. Lee, A. M. Kerlin, M. L. Andermann, G. Hood, A. W. Wetzel, S. Yurgenson, E. R. Soucy, H. S. Kim, and R. C. Reid. Network anatomy and in vivo physiology of visual cortical neurons. Nature, (471):177--182, March 2011.Google Scholar
- T. Budavari, A. Szalay, and G. Fekete. Searchable sky coverage of astronomical observations: Footprints and exposures. Submitted, 2010.Google Scholar
- L. Dobos, I. Csabai, M. Milovanovic, T. Budavari, A. Szalay, M. Tintor, J. Blakeley, A. Jovanovic, and D. Tomic. Array requirements for scientific applications and an implementation for microsoft sql server. In Proc. of the EDBT/ICDT 2011 Workshop on Array Databases, Uppsala, Sweden, (ed.: P. Baumann), 2011. Google ScholarDigital Library
- G. L. Eyink. Stochastic flux freezing and magnetic dynamo, 2011.Google Scholar
- E. Givelberg and J. Bunn. A comprehensive three-dimensional model of the cochlea. J. Comp. Phys., 191(2):377--391, 2003. Google ScholarDigital Library
- E. Givelberg and K. Yelick. Distributed immersed boundary simulation in titanium. SIAM J. on Scientific Computing, 28(4):1367--1378, July 2006. Google ScholarDigital Library
- Y. Gu and R. L. Grossman. Udt: Udp-based data transfer for high-speed wide area networks. Comput. Netw., 51:1777--1799, May 2007. Google ScholarDigital Library
- L. V. Kale and G. Zheng. Charm ++ and ampi: Adaptive runtime strategies via migratable objects. pages 265--282. Wiley-Interscience, 2009.Google Scholar
- N. Kasthuri, K. Hayworth, J. Tapia, R. Schalek, S. Nundy, and J. Lichtman. The brain on tape: Imaging an ultra-thin section library (utsl). Society for Neuroscience Abstracts, 2009.Google Scholar
- G. Lemson and the Virgo Consortium. Halo and galaxy formation histories from the millennium simulation: Public release of a vo-oriented database. 2006.Google Scholar
- Leonard. personal communication., 2009.Google Scholar
- Y. Li, E. Perlman, M. Wan, Y. Yang, R. Burns, C. Meneveau, S. Chen, A. Szalay, and G. Eyink. A public turbulence database cluster and applications to study lagrangian evolution of velocity increments in turbulence. J. Turbulence, 9(31), 2008.Google Scholar
- G. Memik, M. T. Kandemir, W.-K. Liao, and A. Choudhary. Multicollective i/o: A technique for exploiting inter-file access patterns. ACM Transactions on Storage, 2(3), Aug. 2006. Google ScholarDigital Library
- C. Meneveau. A web-services accessible turbulence database of isotropic turbulence: lessons learned. In Progress in wall turbulence: understanding and modeling. (M. Stanislas, ed.), held on 21--23 April, in Lille France, 2009.Google Scholar
- E. Perlman, R. Burns, Y. Li, and C. Meneveau. Data exploration of turbulence simulations using a database cluster. In Proceedings of the Supercomputing Conference (SC'07), 2007. Google ScholarDigital Library
- Snyder. personal communication., 2009.Google Scholar
- V. Springel, S. White, A. Jenkins, C. Frenk, N. Yoshida, L. Gao, J. Navarro, R. Thacker, D. Croton, J. Helly, J. Peacock, S. Cole, P. Thomas, H. Couchman, A. Evrard, J. Colberg, and F. Pearce. Simulations of the formation, evolution and clustering of galaxies and quasars. Nature, 435:629--636, 2005.Google ScholarCross Ref
- M. Stonebraker, P. Brown, A. Poliakov, and S. Raman. The architecture of scidb. In J. Bayard Cushing, J. French, and S. Bowers, editors, Scientific and Statistical Database Management, volume 6809 of Lecture Notes in Computer Science, pages 1--16. Springer Berlin / Heidelberg, 2011. Google ScholarDigital Library
- A. S. Szalay, P. Z. Kunszt, A. Thakar, J. Gray, D. Slutz, and R. J. Brunner. Designing and mining multi-terabyte astronomy archives: the sloan digital sky survey. In SIGMOD '00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 451--462, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
- L. X and J. Katz. Measurement of pressure-rate-of-strain, pressure diffusion and velocity-presure- gradient tensors around an open cavity trailing corner. Bull. Am. Phys. Soc., 53(15), 2008.Google Scholar
- K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A high-performance java dialect. Concurrency: Practice and Experience, 10(11--13), September-November 1998.Google Scholar
- H. Yu and C. Meneveau. Lagrangian refined kolmogorov similarity hypothesis for gradient time evolution and correlation in turbulent flows. Phys. Rev. Lett., 104(8):084502, Feb 2010.Google ScholarCross Ref
Index Terms
- An architecture for a data-intensive computer
Recommendations
Data-Intensive Scalable Computing for Scientific Applications
Increasingly, scientific computing applications must accumulate and manage massive datasets, as well as perform sophisticated computations over these data. Such applications call for data-intensive scalable computer (DISC) systems, which differ in ...
Big data challenges in simulation-based science
DIDC '14: Proceedings of the sixth international workshop on Data intensive distributed computingData-related challenges are quickly dominating computational and data-enabled sciences, and are limiting the potential impact of scientific applications enabled by current and emerging high-performance distributed computing environments. These data-...
A peer-to-peer architecture for data-intensive cycle sharing
NDM '11: Proceedings of the first international workshop on Network-aware data managementIt has been nearly a decade since volunteer computing was first popularized by the highly successful SETI@Home project. SETI@Home's core software evolved and became the Berkeley Infrastructure for Open Network Computing (BOINC). This development paved ...
Comments