skip to main content
10.1145/2806416.2806429acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Approximated Summarization of Data Provenance

Published:17 October 2015Publication History

ABSTRACT

Many modern applications involve collecting large amounts of data from multiple sources, and then aggregating and manipulating it in intricate ways. The complexity of such applications, combined with the size of the collected data, makes it difficult to understand how the resulting information was derived. Data provenance has proven helpful in this respect, however, maintaining and presenting the full and exact provenance information may be infeasible due to its size and complexity. We therefore introduce the notion of approximated summarized provenance, which provides a compact representation of the provenance at the possible cost of information loss. Based on this notion, we present a novel provenance summarization algorithm which, based on the semantics of the underlying data and the intended use of provenance, outputs a summary of the input provenance. Experiments measure the conciseness and accuracy of the resulting provenance summaries, and improvement in provenance usage time.

References

  1. Movielens site. https://movielens.org/.Google ScholarGoogle Scholar
  2. Paper's online version. http://bit.ly/1ceBKEC.Google ScholarGoogle Scholar
  3. Provenance working group. http://www.w3.org/2011/prov/.Google ScholarGoogle Scholar
  4. Sape research group. http://sape.inf.usi.ch/hac.Google ScholarGoogle Scholar
  5. Yago knowledge base. http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/.Google ScholarGoogle Scholar
  6. A. Ailamaki, Y. E. Ioannidis, and M. Livny. Scientific workflow management by database management. In SSDBM, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Y. Amsterdamer, S. B. Davidson, D. Deutch, T. Milo, J. Stoyanovich, and V. Tannen. Putting Lipstick on Pig: Enabling Database-style Workflow Provenance. PVLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y. Amsterdamer, D. Deutch, and V. Tannen. Provenance for aggregate queries. In PODS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. O. Benjelloun, A. Sarma, A. Halevy, M. Theobald, and J. Widom. Databases with uncertainty and lineage. VLDB J., 17, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Buneman, J. Cheney, and S. Vansummeren. On the expressiveness of implicit provenance in query and update languages. ACM Trans. Database Syst., 33(4), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Buneman, S. Khanna, and W. Tan. Why and where: A characterization of data provenance. In ICDT, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Cheney, L. Chiticariu, and W. C. Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379--474, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. R. Christopher D. Manning and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarGoogle ScholarCross RefCross Ref
  14. D. Cohn and R. Hull. Business artifacts: A data-centric approach to modeling business operations and processes. IEEE Data Eng. Bull., 32(3), 2009.Google ScholarGoogle Scholar
  15. S. B. Davidson, S. C. Boulakia, A. Eyal, B. Lud\"ascher, T. M. McPhillips, S. Bowers, M. K. Anand, and J. Freire. Provenance in scientific workflow systems. IEEE Data Eng. Bull., 30(4):44--50, 2007.Google ScholarGoogle Scholar
  16. S. B. Davidson and J. Freire. Provenance and scientific workflows: challenges and opportunities. In SIGMOD Conference, pages 1345--1350, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Deutch, Y. Moskovitch, and V. Tannen. A provenance framework for data-dependent process analysis. PVLDB, 7(6):457--468, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Fink, L. Han, and D. Olteanu. Aggregation in probabilistic databases via knowledge compilation. PVLDB, 5(5), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. I. Foster, J. Vockler, M. Wilde, and A. Zhao. Chimera: A virtual data system for representing, querying, and automating data derivation. SSDBM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Glavic, J. Siddique, P. Andritsos, and R. J. Miller. Provenance for Data Mining. In Theory and Practice of Provenance (TAPP), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, pages 31--40, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Hull, K. Wolstencroft, R. Stevens, C. Goble, M. Pocock, P. Li, and T. Oinn. Taverna: a tool for building and running workflows of services. Nucleic Acids Res., 34, 2006.Google ScholarGoogle Scholar
  23. P. Missier, N. Paton, and K. Belhajjame. Fine-grained and efficient lineage querying of collection-based workflow provenance. In EDBT, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Olteanu and J. Zavodny. Factorised representations of query results: size bounds and readability. In ICDT, pages 285--298, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. Ré and D. Suciu. Approximate lineage for probabilistic databases. PVLDB, 1(1):797--808, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Pearson, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Y. L. Simhan, B. Plale, and D. Gammon. Karma2: Provenance management for data-driven workflows. Int. J. Web Service Res., 5(2), 2008.Google ScholarGoogle Scholar
  28. Z. Wu and M. Palmer. Verb semantics and lexical selection. In Proc. of 32nd Annual meeting of the Associations for Computational Linguistics, pages 133--138, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Approximated Summarization of Data Provenance

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management
        October 2015
        1998 pages
        ISBN:9781450337946
        DOI:10.1145/2806416

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 October 2015

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        CIKM '15 Paper Acceptance Rate165of646submissions,26%Overall Acceptance Rate1,861of8,427submissions,22%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader