research-article

Draining the Data Swamp: A Similarity-based Approach

Authors:
Will Brackenbury

University of Chicago

University of Chicago
View Profile

,
Rui Liu

University of Chicago

University of Chicago
View Profile

,
Mainack Mondal

University of Chicago

University of Chicago
View Profile

,
Aaron J. Elmore

University of Chicago

University of Chicago
View Profile

,
Blase Ur

University of Chicago

University of Chicago
View Profile

,
Kyle Chard

University of Chicago

University of Chicago
View Profile

,
Michael J. Franklin

University of Chicago

University of Chicago
View Profile

HILDA '18: Proceedings of the Workshop on Human-In-the-Loop Data AnalyticsJune 2018Article No.: 13Pages 1–7https://doi.org/10.1145/3209900.3209911

Published:10 June 2018Publication History

HILDA '18: Proceedings of the Workshop on Human-In-the-Loop Data Analytics

Pages 1–7

ABSTRACT

While hierarchical namespaces such as filesystems and repositories have long been used to organize data, the rapid increase in data production places increasing strain on users who wish to make use of the data. So called "data lakes" embrace the storage of data in its natural form, integrating and organizing in a Pay-as-you-go fashion. While this model defers the upfront cost of integration, the result is that data is unusable for discovery or analysis until it is processed. Thus, data scientists are forced to spend significant time and energy on mundane tasks such as data discovery, cleaning, integration, and management -- when this is neglected, "data lakes" become "data swamps."

Prior work suggests that pure computational methods for resolving issues with the data discovery and management components are insufficient. Here, we provide evidence to confirm this hypothesis, showing that methods such as automated file clustering are unable to extract the necessary features from repositories to provide useful information to end-user data scientists, or make effective data management decisions on their behalf. We argue that the combination of frameworks for specifying file similarity and human-in-the-loop interaction is needed to aid automated organization. We propose an initial step here, classifying several dimensions by which items may be considered similar: the data, its origin, and its current characteristics. We initially consider this model in the context of identifying data that can be integrated or managed collectively. We additionally explore how current methods can be used to automate decision making using real-world data repository and file systems, and suggest how an online user study could be developed to further validate this hypothesis.

References

Alexandr Andoni and Piotr Indyk. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Foundations of Computer Science, 2006. FOCS'06. 47th Annual IEEE Symposium on. IEEE, 459--468. Google ScholarDigital Library
Taiwo Ayodele, Galyna Akmayeva, and Charles A. Shoniregun. 2012. Machine learning approach towards email management. In World Congress on Internet Security (WorldCIS-2012). 106--109.Google Scholar
Sreeram Balakrishnan, Alon Y Halevy, Boulos Harb, Hongrae Lee, Jayant Madhavan, Afshin Rostamizadeh, Warren Shen, Kenneth Wilder, Fei Wu, and Cong Yu. 2015. Applying WebTables in Practice.. In CIDR.Google Scholar
Deborah K. Barreau. 1995. Context As a Factor in Personal Information Management Systems. J. Am. Soc. Inf. Sci. 46, 5 (1995), 327--339. Google ScholarDigital Library
Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2009. Swoosh: a generic approach to entity resolution. The VLDB JournalâĂŤThe International Journal on Very Large Data Bases 18, 1 (2009), 255--276. Google ScholarDigital Library
Moria Bergman, Tova Milo, Slava Novgorodov, and Wang-Chiew Tan. 2015. Query-oriented data cleaning with oracles. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1199--1214. Google ScholarDigital Library
Anant Bhardwaj, Amol Deshpande, Aaron J Elmore, David Karger, Sam Madden, Aditya Parameswaran, Harihar Subramanyam, Eugene Wu, and Rebecca Zhang. 2015. Collaborative data analytics with DataHub. Proceedings of the VLDB Endowment 8, 12 (2015), 1916--1919. Google ScholarDigital Library
Richard Boardman and M Angela Sasse. 2004. Stuff goes into the computer and doesn't come out: a cross-tool study of personal information management. In Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 583--590. Google ScholarDigital Library
Andrei Z Broder. 2000. Identifying and filtering near-duplicate documents. In Annual Symposium on Combinatorial Pattern Matching. Springer, 1--10. Google ScholarDigital Library
Harry Bruce. 2005. Personal, Anticipated Information Need. Information Research: An International Electronic Journal 10, 3 (2005), n3.Google Scholar
Michael J Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. 2008. Webtables: exploring the power of tables on the web. Proceedings of the VLDB Endowment 1, 1 (2008), 538--549. Google ScholarDigital Library
Kaushik Chakrabarti, Surajit Chaudhuri, Zhimin Chen, Kris Ganjam, Yeye He, and WA Redmond. 2016. Data services leveraging Bing's data assets. IEEE Data Eng. Bull. 39, 3 (2016), 15--28.Google Scholar
Gautam Dasarathy, Robert Nowak, and Xiaojin Zhu. 2015. S2: An efficient graph based active learning algorithm with application to nonparametric classification. In Conference on Learning Theory. 503--522.Google Scholar
Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K Elmagarmid, Ihab F Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang. 2017. The Data Civilizer System.. In CIDR.Google Scholar
Susan Dumais, Edward Cutrell, Jonathan J Cadiz, Gavin Jancke, Raman Sarin, and Daniel C Robbins. 2016. Stuff I've seen: a system for personal information retrieval and re-use. In ACM SIGIR Forum, Vol. 49. ACM, 28--35. Google ScholarDigital Library
Rihan Hai, Sandra Geisler, and Christoph Quix. 2016. Constance: An intelligent data lake system. In Proceedings of the 2016 International Conference on Management of Data. ACM, 2097--2100. Google ScholarDigital Library
Alon Halevy, Flip Korn, Natalya F Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Goods: Organizing google's datasets. In Proceedings of the 2016 International Conference on Management of Data. ACM, 795--806. Google ScholarDigital Library
Joseph M Hellerstein, Vikram Sreekanti, Joseph E Gonzalez, James Dalton, Akon Dey, Sreyashi Nag, Krishna Ramachandran, Sudhanshu Arora, Arka Bhattacharyya, Shirshanka Das, et al. 2017. Ground: A Data Context Service.. In CIDR.Google Scholar
Shawn R Jeffery, Michael J Franklin, and Alon Y Halevy. 2008. Pay-as-you-go user feedback for dataspace systems. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 847--860. Google ScholarDigital Library
Shawn R Jeffery, Liwen Sun, Matt DeLand, Nick Pendar, Rick Barber, and Andrew Galdi. 2013. Arnold: Declarative Crowd-Machine Data Integration.. In CIDR.Google Scholar
William Jones. 2007. Personal information management. Annual review of information science and technology 41, 1 (2007), 453--504. Google ScholarDigital Library
William Jones. 2010. Keeping found things found: The study and practice of personal information management. Morgan Kaufmann. Google ScholarDigital Library
Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 3363--3372. Google ScholarDigital Library
Mohammad Taha Khan, Maria Hyun, Chris Kanich, and Blase Ur. 2018. Forgotten But Not Gone: Identifying the Need for Longitudinal Data Management in Cloud Storage. In Proceedings of the SIGCHI conference on Human factors in computing systems. ACM. Google ScholarDigital Library
Peter Klemperer, Yuan Liang, Michelle Mazurek, Manya Sleeper, Blase Ur, Lujo Bauer, Lorrie Faith Cranor, Nitin Gupta, and Michael Reiter. 2012. Tag, You Can See It!: Using Tags for Access Control in Photo Sharing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 377--386. Google ScholarDigital Library
Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Goldberg. 2016. ActiveClean: interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment 9, 12 (2016), 948--959. Google ScholarDigital Library
Tze Leung Lai and Herbert Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6, 1 (1985), 4--22. Google ScholarDigital Library
Guoliang Li. 2017. Human-in-the-loop data integration. Proceedings of the VLDB Endowment 10, 12 (2017), 2006--2017. Google ScholarDigital Library
Jayant Madhavan, Shawn R Jeffery, Shirley Cohen, Xin Dong, David Ko, Cong Yu, and Alon Halevy. 2007. Web-scale data integration: You can only afford to pay as you go. CIDR.Google Scholar
Michelle L Mazurek, Yuan Liang, William Melicher, Manya Sleeper, Lujo Bauer, Gregory R Ganger, Nitin Gupta, and Michael K Reiter. 2014. Toward strong, usable access control for shared distributed data. In Proceedings of the 12th USENIX conference on File and Storage Technologies. USENIX Association, 89--103. Google ScholarDigital Library
Theodoros Rekatsinas, Xu Chu, Ihab F Ilyas, and Christopher Ré. 2017. Holoclean: Holistic data repairs with probabilistic inference. Proceedings of the VLDB Endowment 10, 11 (2017), 1190--1201. Google ScholarDigital Library
Leo Sauermann, Gunnar Aastrand Grimnes, Malte Kiesel, Christiaan Fluit, Heiko Maus, Dominik Heim, Danish Nadeem, Benjamin Horak, and Andreas Dengel. 2006. Semantic Desktop 2.0: The Gnowsis Experience. In Proceedings of the 5th International Conference on The Semantic Web. 887--900. Google ScholarDigital Library
Burr Settles. {n. d.}. Active Learning Literature Survey. 2010. Computer Sciences Technical Report 1648 ({n. d.}).Google Scholar
Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. 2012. Crowder: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment 5, 11 (2012), 1483--1494. Google ScholarDigital Library
Steven Euijong Whang, Peter Lofgren, and Hector Garcia-Molina. 2013. Question selection for crowd entity resolution. Proceedings of the VLDB Endowment 6, 6 (2013), 349--360. Google ScholarDigital Library
Mohamed Yakout, Ahmed K Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F Ilyas. 2011. Guided data repair. Proceedings of the VLDB Endowment 4, 5 (2011), 279--289. Google ScholarDigital Library
Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, and Surajit Chaudhuri. 2012. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD. ACM, 97--108. Google ScholarDigital Library

Recommendations

A Review on Data Cleansing Methods for Big Data
Abstract
Massive amounts of data are available for the organization which will influence their business decision. Data collected from the various resources are dirty and this will affect the accuracy of prediction result. Data cleansing offers a better ...
Read More
An Enhanced Technique to Clean Data in the Data Warehouse
DESE '11: Proceedings of the 2011 Developments in E-systems Engineering

Data quality is a critical factor for the success of data warehousing projects. Improving the quality of data is important in data warehouse, because it is used in the process of decision support, which requires accurate data. There are many errors and ...
Read More
Alliance Rules for Data Warehouse Cleansing
ICSPS '09: Proceedings of the 2009 International Conference on Signal Processing Systems

Data Cleansing is an activity performed on the data sets of data warehouse to enhance and maintain the quality and consistency of the data. This paper addresses the problems related with dirty data, entrance of dirty data and detection of dirty data in ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

HILDA '18: Proceedings of the Workshop on Human-In-the-Loop Data Analytics
June 2018
87 pages
ISBN:9781450358279
DOI:10.1145/3209900

Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 June 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate28of56submissions,50%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 470
  Total Downloads
- Downloads (Last 12 months)42
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Draining the Data Swamp: A Similarity-based Approach

HILDA '18: Proceedings of the Workshop on Human-In-the-Loop Data Analytics

ABSTRACT

References

Cited By

Recommendations

A Review on Data Cleansing Methods for Big Data

An Enhanced Technique to Clean Data in the Data Warehouse

Alliance Rules for Data Warehouse Cleansing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Draining the Data Swamp: A Similarity-based Approach

HILDA '18: Proceedings of the Workshop on Human-In-the-Loop Data Analytics

ABSTRACT

References

Cited By

Recommendations

A Review on Data Cleansing Methods for Big Data

An Enhanced Technique to Clean Data in the Data Warehouse

Alliance Rules for Data Warehouse Cleansing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media