short-paper

An architecture for a data-intensive computer

Authors:
Edward Givelberg

The Johns Hopkins University, baltimore, MD, USA

The Johns Hopkins University, baltimore, MD, USA
View Profile

,
Alexander Szalay

The Johns Hopkins University, baltimore, MD, USA

The Johns Hopkins University, baltimore, MD, USA
View Profile

,
Kalin Kanov

The Johns Hopkins University, baltimore, MD, USA

The Johns Hopkins University, baltimore, MD, USA
View Profile

,
Randal Burns

The Johns Hopkins University, baltimore, MD, USA

The Johns Hopkins University, baltimore, MD, USA
View Profile

NDM '11: Proceedings of the first international workshop on Network-aware data managementNovember 2011Pages 57–64https://doi.org/10.1145/2110217.2110226

Published:14 November 2011Publication History

NDM '11: Proceedings of the first international workshop on Network-aware data management

Pages 57–64

ABSTRACT

Scientific instruments, as well as simulations, generate increasingly large datasets, changing the way we do science. We propose a system that we call the data-intensive computer for computing with Petascale-sized datasets. The data-intensive computer consists of an HPC cluster, a massively parallel database and a set of computing servers running the data-intensive operating system, which turns the database into a layer in the memory hierarchy of the data-intensive computer.

The data-intensive operating system is data-object-oriented: the abstract programming model of a sequential file, central to traditional computer operating systems, is replaced with system-level support for high-level data objects, such as multi-dimensional arrays, graphs, sparse arrays, etc. User application programs will be compiled into code that is executed both on the HPC cluster and inside the database. The data-intensive operating system is however non-local, allowing remote applications to execute code inside the database. This model supports the collaborative environment, where a large data set is typically created and processed by a large group of users.

We are developing a software library, MPI-DB, which is a prototype of the data-intensive operating system. It is currently being used by the Turbulence group at JHU to store simulation output in the database and to perform simulations refining previously stored results.

References

Benchmarks provided by infiniband vendors.Google Scholar
http://openconnectomeproject.org.Google Scholar
http://pcbunn.cacr.caltech.edu/cochlea/.Google Scholar
http://turbulence.pha.jhu.edu/.Google Scholar
http://www.sdss.org/.Google Scholar
K. Asanovíc, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, S. Williams, and K. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-2006-183, Electrical Engineering and Computer Sciences, University of California at Berkeley, December 2006.Google Scholar
C. Blakeley, N. Cunningham, B. Ellis, Rathakrishnan, and M. C. Wu. Distributed/heterogeneous query processing in microsoft sql server. In 21st Int. Conf. on Data Engineering (ICDE'05), pages 1001--1012, 2005. Google ScholarDigital Library
D. D. Bock, W.-C. A. Lee, A. M. Kerlin, M. L. Andermann, G. Hood, A. W. Wetzel, S. Yurgenson, E. R. Soucy, H. S. Kim, and R. C. Reid. Network anatomy and in vivo physiology of visual cortical neurons. Nature, (471):177--182, March 2011.Google Scholar
T. Budavari, A. Szalay, and G. Fekete. Searchable sky coverage of astronomical observations: Footprints and exposures. Submitted, 2010.Google Scholar
L. Dobos, I. Csabai, M. Milovanovic, T. Budavari, A. Szalay, M. Tintor, J. Blakeley, A. Jovanovic, and D. Tomic. Array requirements for scientific applications and an implementation for microsoft sql server. In Proc. of the EDBT/ICDT 2011 Workshop on Array Databases, Uppsala, Sweden, (ed.: P. Baumann), 2011. Google ScholarDigital Library
G. L. Eyink. Stochastic flux freezing and magnetic dynamo, 2011.Google Scholar
E. Givelberg and J. Bunn. A comprehensive three-dimensional model of the cochlea. J. Comp. Phys., 191(2):377--391, 2003. Google ScholarDigital Library
E. Givelberg and K. Yelick. Distributed immersed boundary simulation in titanium. SIAM J. on Scientific Computing, 28(4):1367--1378, July 2006. Google ScholarDigital Library
Y. Gu and R. L. Grossman. Udt: Udp-based data transfer for high-speed wide area networks. Comput. Netw., 51:1777--1799, May 2007. Google ScholarDigital Library
L. V. Kale and G. Zheng. Charm ++ and ampi: Adaptive runtime strategies via migratable objects. pages 265--282. Wiley-Interscience, 2009.Google Scholar
N. Kasthuri, K. Hayworth, J. Tapia, R. Schalek, S. Nundy, and J. Lichtman. The brain on tape: Imaging an ultra-thin section library (utsl). Society for Neuroscience Abstracts, 2009.Google Scholar
G. Lemson and the Virgo Consortium. Halo and galaxy formation histories from the millennium simulation: Public release of a vo-oriented database. 2006.Google Scholar
Leonard. personal communication., 2009.Google Scholar
Y. Li, E. Perlman, M. Wan, Y. Yang, R. Burns, C. Meneveau, S. Chen, A. Szalay, and G. Eyink. A public turbulence database cluster and applications to study lagrangian evolution of velocity increments in turbulence. J. Turbulence, 9(31), 2008.Google Scholar
G. Memik, M. T. Kandemir, W.-K. Liao, and A. Choudhary. Multicollective i/o: A technique for exploiting inter-file access patterns. ACM Transactions on Storage, 2(3), Aug. 2006. Google ScholarDigital Library
C. Meneveau. A web-services accessible turbulence database of isotropic turbulence: lessons learned. In Progress in wall turbulence: understanding and modeling. (M. Stanislas, ed.), held on 21--23 April, in Lille France, 2009.Google Scholar
E. Perlman, R. Burns, Y. Li, and C. Meneveau. Data exploration of turbulence simulations using a database cluster. In Proceedings of the Supercomputing Conference (SC'07), 2007. Google ScholarDigital Library
Snyder. personal communication., 2009.Google Scholar
V. Springel, S. White, A. Jenkins, C. Frenk, N. Yoshida, L. Gao, J. Navarro, R. Thacker, D. Croton, J. Helly, J. Peacock, S. Cole, P. Thomas, H. Couchman, A. Evrard, J. Colberg, and F. Pearce. Simulations of the formation, evolution and clustering of galaxies and quasars. Nature, 435:629--636, 2005.Google ScholarCross Ref
M. Stonebraker, P. Brown, A. Poliakov, and S. Raman. The architecture of scidb. In J. Bayard Cushing, J. French, and S. Bowers, editors, Scientific and Statistical Database Management, volume 6809 of Lecture Notes in Computer Science, pages 1--16. Springer Berlin / Heidelberg, 2011. Google ScholarDigital Library
A. S. Szalay, P. Z. Kunszt, A. Thakar, J. Gray, D. Slutz, and R. J. Brunner. Designing and mining multi-terabyte astronomy archives: the sloan digital sky survey. In SIGMOD '00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data, pages 451--462, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
L. X and J. Katz. Measurement of pressure-rate-of-strain, pressure diffusion and velocity-presure- gradient tensors around an open cavity trailing corner. Bull. Am. Phys. Soc., 53(15), 2008.Google Scholar
K. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. Hilfinger, S. Graham, D. Gay, P. Colella, and A. Aiken. Titanium: A high-performance java dialect. Concurrency: Practice and Experience, 10(11--13), September-November 1998.Google Scholar
H. Yu and C. Meneveau. Lagrangian refined kolmogorov similarity hypothesis for gradient time evolution and correlation in turbulent flows. Phys. Rev. Lett., 104(8):084502, Feb 2010.Google ScholarCross Ref

Index Terms

An architecture for a data-intensive computer
1. Computer systems organization
  1. Architectures

Recommendations

Data-Intensive Scalable Computing for Scientific Applications

Increasingly, scientific computing applications must accumulate and manage massive datasets, as well as perform sophisticated computations over these data. Such applications call for data-intensive scalable computer (DISC) systems, which differ in ...
Read More
Big data challenges in simulation-based science
DIDC '14: Proceedings of the sixth international workshop on Data intensive distributed computing

Data-related challenges are quickly dominating computational and data-enabled sciences, and are limiting the potential impact of scientific applications enabled by current and emerging high-performance distributed computing environments. These data-...
Read More
A peer-to-peer architecture for data-intensive cycle sharing
NDM '11: Proceedings of the first international workshop on Network-aware data management

It has been nearly a decade since volunteer computing was first popularized by the highly successful SETI@Home project. SETI@Home's core software evolved and became the Berkeley Infrastructure for Open Network Computing (BOINC). This development paved ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
NDM '11: Proceedings of the first international workshop on Network-aware data management
November 2011
84 pages
ISBN:9781450311328
DOI:10.1145/2110217
General Chairs:
Mehmet Balman
Lawrence Berkeley National Laboratory, USA
,
Surendra Byna
Lawrence Berkeley National Laboratory, USA
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 November 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data-intensive computing
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate14of23submissions,61%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 218
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An architecture for a data-intensive computer

NDM '11: Proceedings of the first international workshop on Network-aware data management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Data-Intensive Scalable Computing for Scientific Applications

Big data challenges in simulation-based science

A peer-to-peer architecture for data-intensive cycle sharing