ABSTRACT
With the increasing incentive of enterprises to ingest as much data as they can in what is commonly referred to as "data lakes", and with the recent development of multiple technologies to support this "load-first" paradigm, the new environment presents serious data management challenges. Among them, the assessment of data quality and cleaning large volumes of heterogeneous data sources become essential tasks in unveiling the value of big data. The coveted use of unstructured and semi-structured data in large volumes makes current data cleaning tools (primarily designed for relational data) not directly adoptable.
We present CLAMS, a system to discover and enforce expressive integrity constraints from large amounts of lake data with very limited schema information (e.g., represented as RDF triples). This demonstration shows how CLAMS is able to discover the constraints and the schemas they are defined on simultaneously. CLAMS also introduces a scale-out solution to efficiently detect errors in the raw data. CLAMS interacts with human experts to both validate the discovered constraints and to suggest data repairs.
CLAMS has been deployed in a real large-scale enterprise data lake and was experimented with a real data set of 1.2 billion triples. It has been able to spot multiple obscure data inconsistencies and errors early in the data processing stack, providing huge value to the enterprise.
- A. Chalamalla, I. F. Ilyas, M. Ouzzani, and P. Papotti. Descriptive and Prescriptive Data Cleaning. In SIGMOD, 2014. Google ScholarDigital Library
- S. Chaudhuri and U. Dayal. An Overview of Data Warehousing and OLAP Technology. SIGMOD Rec., 26(1):65--74, Mar. 1997. Google ScholarDigital Library
- X. Chu, I. F. Ilyas, and P. Papotti. Discovering Denial Constraints. Proc. VLDB Endow., 6(13):1498--1509, Aug. 2013. Google ScholarDigital Library
- X. Chu, I. F. Ilyas, and P. Papotti. Holistic Data Cleaning: Put Violations Into Context. In ICDE, 2013.Google ScholarDigital Library
- I. F. Ilyas and X. Chu. Trends in cleaning relational data: Consistency and deduplication. Foundations and Trends in Databases, 5(4):281--393, 2015. Google ScholarDigital Library
Index Terms
- CLAMS: Bringing Quality to Data Lakes
Recommendations
A Data Quality in Use model for Big Data
Beyond the hype of Big Data, something within business intelligence projects is indeed changing. This is mainly because Big Data is not only about data, but also about a complete conceptual and technological stack including raw and processed data, ...
Leveraging the Data Lake: Current State and Challenges
Big Data Analytics and Knowledge DiscoveryAbstractThe digital transformation leads to massive amounts of heterogeneous data challenging traditional data warehouse solutions in enterprises. In order to exploit these complex data for competitive advantages, the data lake recently emerged as a ...
Piglet: Interactive and Platform Transparent Analytics for RDF & Dynamic Data
WWW '16 Companion: Proceedings of the 25th International Conference Companion on World Wide WebData analytics has gained more and more focus during recent years and many data processing platforms have been developed. They all provide a powerful but often complex API that users have to learn. Furthermore, results can only be stored or printed, ...
Comments