ABSTRACT
Simulations of complex scientific phenomena involve the execution of massively parallel computer programs. These simulation programs generate large-scale multidimensional data sets over the spatio-temporal region. Analyzing such massive data sets is an essential step in helping scientists glean new information. To this end, efficient and effective data models are needed. In this paper, we present a hybrid approach for constructing data models from large-scale multidimensional scientific data sets. Our models not only provide descriptive information about the data but also allow users to subsequently examine the data by querying the data models. Our approach combines a multiresolution-topological model of the data with a multivariate-physical model of the data to generate one hierarchical data model that efficiently captures both the spatio-temporal and the physical aspects of the data. In particular, this hybrid approach consists of three phases. In the first phase, we build a multiresolution model that encapsulates the data set's spatial information (i.e., topology and spatial connectivity). In the second phase, we build a multivariate model from the physical dimensions of the data set. Physical dimensions refer to those dimensions that are neither spatial (x, y, z) nor temporal (time). The exclusion of the spatial-temporal dimensions from the clustering phase is important since "similar" characteristics could be located (spatially) far from each other. Finally, in the third phase, we connect the multivariate-physical model to the multiresolution-topological model by utilizing ideas from information retrieval. The third phase is essential since the multivariate-physical model does not contain any topological information (without which the model does not have accurate spatial context information). Experimental evaluations on two large-scale multidimensional scientific data sets illustrate the value of our hybrid approach.
- Abdulla, G., Critchlow, T., Arrighi, W. Simulation Data as Data Streams, In SIGMOD Record, 33, 1 (March 2004). Google ScholarDigital Library
- Abdulla, G., Baldwin, C., Critchlow, T, Kamimura, R., Lozares, I., Musick, R., Tang, N. A., Lee, B., and Snapp, R. Approximate ad-hoc query engine for simulation data, In JCDL 2001, 255--256. Google ScholarDigital Library
- Acharya, S., Gibbsons, P. B., Poosala, V., and Ramaswamy, S. The Aqua approximate query answering system, In ACM SIGMOD 1999, 574--576. Google ScholarDigital Library
- Baldwin, C., Eliassi-Rad, T., Abdulla, G., and Critchlow, T. The evolution of a hierarchical partitioning algorithm for large-scale scientific data: three steps of increasing complexity, In SSDBM 2003, 225--228. Google ScholarDigital Library
- Baldwin, C., Abdulla, G., Critchlow, T. Multi-resolution modeling of large scale scientific simulation data, In CIKM 2003, 40--48. Google ScholarDigital Library
- Dadgostar, H., Zarnegar, B., Hoffmann, A., Qin, X.-F., Truong, U., Rao, G., Baltimore, D., and Cheng, G., Cooperation of multiple signaling pathways in CD40-regulated gene expression in B lymphocytes. In Proc. of National Academy of Sciences of the U.S.A., 99, 3, 2002, 1497--1502.Google ScholarCross Ref
- DuMouchel, W., Volinsky, CH., Johnson, T., Cortes, C., and Pregibon, D., Squashing flat files flatter, In KDD 1999, 6--15. Google ScholarDigital Library
- Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. Cluster analysis and display of genome-wide expression patterns. In Proc. of the National Academy of Sciences of the U.S.A., 95, 25, 1998, 14863--14868.Google ScholarCross Ref
- Eliassi-Rad, T., Baldwin, C., Abdulla, G., and Critchlow, T. Statistical modeling of large-scale scientific simulation data. New Generation of Data Mining Applications, Eds: Zurada J. and Kantardzie M., IEEE Press/Wiley, January 2005.Google Scholar
- Eliassi-Rad, T., and Critchlow, T. Clustering with Uncentered Correlation Coefficients: Beware of Offsets, Lawrence Livermore Technical Report, 2004.Google Scholar
- Freitag, L. A., and Loy, R. M. Adaptive, multi-resolution visualization of large data sets using a distributed memory octree, Supercomputing 1999, Article 60. Google ScholarDigital Library
- Hand, D., Mannila, H., and Smyth, P. Principles of Data Mining, MIT Press, Cambridge, MA, 2001. Google ScholarDigital Library
- Jolliffe, I. T. Principal Component Analysis, Springer-Verlag; 2nd edition, 2002.Google Scholar
- Musick, R., and Critchlow, T. Practical lessons in supporting large-scale computational science, In SIGMOD Record, 28, 4 (December 1999). Google ScholarDigital Library
- Ng, R. T., and Han, J., Efficient and effective clustering methods for spatial data mining, In VLDB 1994, 144--155. Google ScholarDigital Library
- Parsons, L., Haque, E., and Liu, H. Subspace Clustering for High Dimensional Data: A Review. In SIGKDD Explorations, 6, 1 (June 2004), 90--105. Google ScholarDigital Library
- Wang, W, Yang, J., and Muntz, R. STING: A statistical information grid approach to spatial data mining, In VLDB 1997, 186--195. Google ScholarDigital Library
Index Terms
- A hybrid approach for multiresolution modeling of large-scale scientific data
Recommendations
Statistical modeling of large-scale simulation data
KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data miningWith the advent of fast computer systems, scientists are now able to generate terabytes of simulation data. Unfortunately, the sheer size of these data sets has made efficient exploration of them impossible. To aid scientists in gleaning insight from ...
A document-based data warehousing approach for large scale data mining
ICPCA/SWS'12: Proceedings of the 2012 international conference on Pervasive Computing and the Networked WorldData mining techniques are widely applied and data warehousing is relatively important in this process. Both scalability and efficiency have always been the key issues in data warehousing. Due to the explosive growth of data, data warehousing today is ...
Mesh Data Management in Large-Scale Scientific Computing
CHINAGRID '08: Proceedings of the The Third ChinaGrid Annual Conference (chinagrid 2008)In many research fields of numerical simulations, programs often produce a large amount of mesh data with complex structure. It is a fatal bottleneck for scientists to manage such large-scale simulation data. In allusion to typical data characteristics ...
Comments