skip to main content
10.1145/3209900.3209911acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Draining the Data Swamp: A Similarity-based Approach

Published:10 June 2018Publication History

ABSTRACT

While hierarchical namespaces such as filesystems and repositories have long been used to organize data, the rapid increase in data production places increasing strain on users who wish to make use of the data. So called "data lakes" embrace the storage of data in its natural form, integrating and organizing in a Pay-as-you-go fashion. While this model defers the upfront cost of integration, the result is that data is unusable for discovery or analysis until it is processed. Thus, data scientists are forced to spend significant time and energy on mundane tasks such as data discovery, cleaning, integration, and management -- when this is neglected, "data lakes" become "data swamps."

Prior work suggests that pure computational methods for resolving issues with the data discovery and management components are insufficient. Here, we provide evidence to confirm this hypothesis, showing that methods such as automated file clustering are unable to extract the necessary features from repositories to provide useful information to end-user data scientists, or make effective data management decisions on their behalf. We argue that the combination of frameworks for specifying file similarity and human-in-the-loop interaction is needed to aid automated organization. We propose an initial step here, classifying several dimensions by which items may be considered similar: the data, its origin, and its current characteristics. We initially consider this model in the context of identifying data that can be integrated or managed collectively. We additionally explore how current methods can be used to automate decision making using real-world data repository and file systems, and suggest how an online user study could be developed to further validate this hypothesis.

References

  1. Alexandr Andoni and Piotr Indyk. 2006. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Foundations of Computer Science, 2006. FOCS'06. 47th Annual IEEE Symposium on. IEEE, 459--468. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Taiwo Ayodele, Galyna Akmayeva, and Charles A. Shoniregun. 2012. Machine learning approach towards email management. In World Congress on Internet Security (WorldCIS-2012). 106--109.Google ScholarGoogle Scholar
  3. Sreeram Balakrishnan, Alon Y Halevy, Boulos Harb, Hongrae Lee, Jayant Madhavan, Afshin Rostamizadeh, Warren Shen, Kenneth Wilder, Fei Wu, and Cong Yu. 2015. Applying WebTables in Practice.. In CIDR.Google ScholarGoogle Scholar
  4. Deborah K. Barreau. 1995. Context As a Factor in Personal Information Management Systems. J. Am. Soc. Inf. Sci. 46, 5 (1995), 327--339. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Omar Benjelloun, Hector Garcia-Molina, David Menestrina, Qi Su, Steven Euijong Whang, and Jennifer Widom. 2009. Swoosh: a generic approach to entity resolution. The VLDB JournalâĂŤThe International Journal on Very Large Data Bases 18, 1 (2009), 255--276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Moria Bergman, Tova Milo, Slava Novgorodov, and Wang-Chiew Tan. 2015. Query-oriented data cleaning with oracles. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 1199--1214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Anant Bhardwaj, Amol Deshpande, Aaron J Elmore, David Karger, Sam Madden, Aditya Parameswaran, Harihar Subramanyam, Eugene Wu, and Rebecca Zhang. 2015. Collaborative data analytics with DataHub. Proceedings of the VLDB Endowment 8, 12 (2015), 1916--1919. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Richard Boardman and M Angela Sasse. 2004. Stuff goes into the computer and doesn't come out: a cross-tool study of personal information management. In Proceedings of the SIGCHI conference on Human factors in computing systems. ACM, 583--590. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Andrei Z Broder. 2000. Identifying and filtering near-duplicate documents. In Annual Symposium on Combinatorial Pattern Matching. Springer, 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Harry Bruce. 2005. Personal, Anticipated Information Need. Information Research: An International Electronic Journal 10, 3 (2005), n3.Google ScholarGoogle Scholar
  11. Michael J Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. 2008. Webtables: exploring the power of tables on the web. Proceedings of the VLDB Endowment 1, 1 (2008), 538--549. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Kaushik Chakrabarti, Surajit Chaudhuri, Zhimin Chen, Kris Ganjam, Yeye He, and WA Redmond. 2016. Data services leveraging Bing's data assets. IEEE Data Eng. Bull. 39, 3 (2016), 15--28.Google ScholarGoogle Scholar
  13. Gautam Dasarathy, Robert Nowak, and Xiaojin Zhu. 2015. S2: An efficient graph based active learning algorithm with application to nonparametric classification. In Conference on Learning Theory. 503--522.Google ScholarGoogle Scholar
  14. Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K Elmagarmid, Ihab F Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang. 2017. The Data Civilizer System.. In CIDR.Google ScholarGoogle Scholar
  15. Susan Dumais, Edward Cutrell, Jonathan J Cadiz, Gavin Jancke, Raman Sarin, and Daniel C Robbins. 2016. Stuff I've seen: a system for personal information retrieval and re-use. In ACM SIGIR Forum, Vol. 49. ACM, 28--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Rihan Hai, Sandra Geisler, and Christoph Quix. 2016. Constance: An intelligent data lake system. In Proceedings of the 2016 International Conference on Management of Data. ACM, 2097--2100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Alon Halevy, Flip Korn, Natalya F Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. 2016. Goods: Organizing google's datasets. In Proceedings of the 2016 International Conference on Management of Data. ACM, 795--806. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Joseph M Hellerstein, Vikram Sreekanti, Joseph E Gonzalez, James Dalton, Akon Dey, Sreyashi Nag, Krishna Ramachandran, Sudhanshu Arora, Arka Bhattacharyya, Shirshanka Das, et al. 2017. Ground: A Data Context Service.. In CIDR.Google ScholarGoogle Scholar
  19. Shawn R Jeffery, Michael J Franklin, and Alon Y Halevy. 2008. Pay-as-you-go user feedback for dataspace systems. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 847--860. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Shawn R Jeffery, Liwen Sun, Matt DeLand, Nick Pendar, Rick Barber, and Andrew Galdi. 2013. Arnold: Declarative Crowd-Machine Data Integration.. In CIDR.Google ScholarGoogle Scholar
  21. William Jones. 2007. Personal information management. Annual review of information science and technology 41, 1 (2007), 453--504. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. William Jones. 2010. Keeping found things found: The study and practice of personal information management. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Sean Kandel, Andreas Paepcke, Joseph Hellerstein, and Jeffrey Heer. 2011. Wrangler: Interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 3363--3372. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Mohammad Taha Khan, Maria Hyun, Chris Kanich, and Blase Ur. 2018. Forgotten But Not Gone: Identifying the Need for Longitudinal Data Management in Cloud Storage. In Proceedings of the SIGCHI conference on Human factors in computing systems. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Peter Klemperer, Yuan Liang, Michelle Mazurek, Manya Sleeper, Blase Ur, Lujo Bauer, Lorrie Faith Cranor, Nitin Gupta, and Michael Reiter. 2012. Tag, You Can See It!: Using Tags for Access Control in Photo Sharing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 377--386. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Sanjay Krishnan, Jiannan Wang, Eugene Wu, Michael J Franklin, and Ken Goldberg. 2016. ActiveClean: interactive data cleaning for statistical modeling. Proceedings of the VLDB Endowment 9, 12 (2016), 948--959. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Tze Leung Lai and Herbert Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6, 1 (1985), 4--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Guoliang Li. 2017. Human-in-the-loop data integration. Proceedings of the VLDB Endowment 10, 12 (2017), 2006--2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jayant Madhavan, Shawn R Jeffery, Shirley Cohen, Xin Dong, David Ko, Cong Yu, and Alon Halevy. 2007. Web-scale data integration: You can only afford to pay as you go. CIDR.Google ScholarGoogle Scholar
  30. Michelle L Mazurek, Yuan Liang, William Melicher, Manya Sleeper, Lujo Bauer, Gregory R Ganger, Nitin Gupta, and Michael K Reiter. 2014. Toward strong, usable access control for shared distributed data. In Proceedings of the 12th USENIX conference on File and Storage Technologies. USENIX Association, 89--103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Theodoros Rekatsinas, Xu Chu, Ihab F Ilyas, and Christopher Ré. 2017. Holoclean: Holistic data repairs with probabilistic inference. Proceedings of the VLDB Endowment 10, 11 (2017), 1190--1201. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Leo Sauermann, Gunnar Aastrand Grimnes, Malte Kiesel, Christiaan Fluit, Heiko Maus, Dominik Heim, Danish Nadeem, Benjamin Horak, and Andreas Dengel. 2006. Semantic Desktop 2.0: The Gnowsis Experience. In Proceedings of the 5th International Conference on The Semantic Web. 887--900. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Burr Settles. {n. d.}. Active Learning Literature Survey. 2010. Computer Sciences Technical Report 1648 ({n. d.}).Google ScholarGoogle Scholar
  34. Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. 2012. Crowder: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment 5, 11 (2012), 1483--1494. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Steven Euijong Whang, Peter Lofgren, and Hector Garcia-Molina. 2013. Question selection for crowd entity resolution. Proceedings of the VLDB Endowment 6, 6 (2013), 349--360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Mohamed Yakout, Ahmed K Elmagarmid, Jennifer Neville, Mourad Ouzzani, and Ihab F Ilyas. 2011. Guided data repair. Proceedings of the VLDB Endowment 4, 5 (2011), 279--289. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, and Surajit Chaudhuri. 2012. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD. ACM, 97--108. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    HILDA '18: Proceedings of the Workshop on Human-In-the-Loop Data Analytics
    June 2018
    87 pages
    ISBN:9781450358279
    DOI:10.1145/3209900

    Copyright © 2018 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 10 June 2018

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate28of56submissions,50%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader