XG: A Data-Driven Computation Grid for Enterprise-Scale Mining

Sion, Radu; Natarajan, Ramesh; Narang, Inderpal; Li, Wen-Syan; Phan, Thomas

doi:10.1007/11546924_81

Radu Sion¹⁹,
Ramesh Natarajan²⁰,
Inderpal Narang²¹,
Wen-Syan Li²¹ &
…
Thomas Phan²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3588))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

1284 Accesses

Abstract

In this paper we introduce a novel architecture for data processing, based on a functional fusion between a data and a computation layer. We show how such an architecture can be leveraged to offer significant speedups for data processing jobs such as data analysis and mining over large data sets.

One novel contribution of our solution is its data-driven approach. The computation infrastructure is controlled from within the data layer. Grid compute job submission events are based within the query processor on the DBMS side and in effect controlled by the data processing job to be performed. This allows the early deployment of on-the-fly data aggregation techniques, minimizing the amount of data to be transfered to/from compute nodes and is in stark contrast to existing Grid solutions that interact with data layers mainly as external “storage”.

We validate this in a scenario derived from a real business deployment, involving financial customer profiling using common types of data analytics (e.g., linear regression analysis). Experimental results show significant speedups. For example, using a grid of only 12 non-dedicated nodes, we observed a speedup of approximately 1000% in a scenario involving complex linear regression analysis data mining computations for commercial customer profiling.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Bigflow: A General Optimization Layer for Distributed Computing Frameworks

Article 27 March 2020

An Empirical Analysis Data Mining Frameworks—An Overview

AdBench: A Complete Benchmark for Modern Data Pipelines

References

DiskRouter, Online at http://www.cs.wisc.edu/condor/diskrouter
Kangaroo, Online at http://www.cs.wisc.edu/condor/kangaroo
Parrot, Online at http://www.cs.wisc.edu/condor/parrot
STORK: A Scheduler for Data Placement Activities in the Grid, Online at http://www.cs.wisc.edu/condor/stork
The Condor Project, Online at http://www.cs.wisc.edu/condor
The Global Grid Forum, Online at http://www.gridforum.org
The Globus Alliance, Online at http://www.globus.org
The Globus Data Grid Effort, Online at http://www.globus.org/datagrid
The Grid Physics Network, Online at http://www.griphyn.org
The IBM DB2 Information Integrator, Online at http://www.ibm.com/software/data/integration
The IBM DB2 Universal Database, Online at http://www.ibm.com/software/data/db2
The IBM DB2 XML Extender, Online at http://www.ibm.com/software/data/db2/extenders/xmlext
The Microsoft SQL Server, Online at http://www.microsoft.com/sql
The Oracle Database, Online at http://www.oracle.com/database
The Particle Physics Data Grid, Online at http://www.ppdg.net
Bent, J., Thain, D., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H., Livny, M.: Explicit Control in a Batch-Aware Distributed File System. In: Proceedings of the First USENIX Symposium on Networked Systems Design and Implementation (NSDI), San Francisco, CA (March 2004)
Google Scholar
Ratner, J.: Human Factors and Web Development, 2nd edn. Lawrence Erlbaum Associates, Mahwah (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Sciences, Stony Brook University, Stony Brook, NY, 11794, USA
Radu Sion
IBM TJ Watson Research Lab, Yorktown Heights, NY, 10598, USA
Ramesh Natarajan
IBM Almaden Research Lab, 650 Harry Rd, San Jose, CA, 95120, USA
Inderpal Narang, Wen-Syan Li & Thomas Phan

Authors

Radu Sion
View author publications
You can also search for this author in PubMed Google Scholar
Ramesh Natarajan
View author publications
You can also search for this author in PubMed Google Scholar
Inderpal Narang
View author publications
You can also search for this author in PubMed Google Scholar
Wen-Syan Li
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Phan
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Copenhagen Business School, Centre for Applied ICT, 60 Howitzvej, 2000, Frederiksberg, DK
Kim Viborg Andersen
University Of Technology Sydney, NSW 2007, Australia
John Debenham
University of Linz, Altenbergerstraße 69, 4040, Linz, Austria
Roland Wagner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sion, R., Natarajan, R., Narang, I., Li, WS., Phan, T. (2005). XG: A Data-Driven Computation Grid for Enterprise-Scale Mining. In: Andersen, K.V., Debenham, J., Wagner, R. (eds) Database and Expert Systems Applications. DEXA 2005. Lecture Notes in Computer Science, vol 3588. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11546924_81

Download citation

DOI: https://doi.org/10.1007/11546924_81
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28566-3
Online ISBN: 978-3-540-31729-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics