ABSTRACT
Many modern applications involve collecting large amounts of data from multiple sources, and then aggregating and manipulating it in intricate ways. The complexity of such applications, combined with the size of the collected data, makes it difficult to understand how the resulting information was derived. Data provenance has proven helpful in this respect, however, maintaining and presenting the full and exact provenance information may be infeasible due to its size and complexity. We therefore introduce the notion of approximated summarized provenance, which provides a compact representation of the provenance at the possible cost of information loss. Based on this notion, we present a novel provenance summarization algorithm which, based on the semantics of the underlying data and the intended use of provenance, outputs a summary of the input provenance. Experiments measure the conciseness and accuracy of the resulting provenance summaries, and improvement in provenance usage time.
- Movielens site. https://movielens.org/.Google Scholar
- Paper's online version. http://bit.ly/1ceBKEC.Google Scholar
- Provenance working group. http://www.w3.org/2011/prov/.Google Scholar
- Sape research group. http://sape.inf.usi.ch/hac.Google Scholar
- Yago knowledge base. http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/.Google Scholar
- A. Ailamaki, Y. E. Ioannidis, and M. Livny. Scientific workflow management by database management. In SSDBM, 1998. Google ScholarDigital Library
- Y. Amsterdamer, S. B. Davidson, D. Deutch, T. Milo, J. Stoyanovich, and V. Tannen. Putting Lipstick on Pig: Enabling Database-style Workflow Provenance. PVLDB, 2012. Google ScholarDigital Library
- Y. Amsterdamer, D. Deutch, and V. Tannen. Provenance for aggregate queries. In PODS, 2011. Google ScholarDigital Library
- O. Benjelloun, A. Sarma, A. Halevy, M. Theobald, and J. Widom. Databases with uncertainty and lineage. VLDB J., 17, 2008. Google ScholarDigital Library
- P. Buneman, J. Cheney, and S. Vansummeren. On the expressiveness of implicit provenance in query and update languages. ACM Trans. Database Syst., 33(4), 2008. Google ScholarDigital Library
- P. Buneman, S. Khanna, and W. Tan. Why and where: A characterization of data provenance. In ICDT, 2001. Google ScholarDigital Library
- J. Cheney, L. Chiticariu, and W. C. Tan. Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379--474, 2009. Google ScholarDigital Library
- P. R. Christopher D. Manning and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarCross Ref
- D. Cohn and R. Hull. Business artifacts: A data-centric approach to modeling business operations and processes. IEEE Data Eng. Bull., 32(3), 2009.Google Scholar
- S. B. Davidson, S. C. Boulakia, A. Eyal, B. Lud\"ascher, T. M. McPhillips, S. Bowers, M. K. Anand, and J. Freire. Provenance in scientific workflow systems. IEEE Data Eng. Bull., 30(4):44--50, 2007.Google Scholar
- S. B. Davidson and J. Freire. Provenance and scientific workflows: challenges and opportunities. In SIGMOD Conference, pages 1345--1350, 2008. Google ScholarDigital Library
- D. Deutch, Y. Moskovitch, and V. Tannen. A provenance framework for data-dependent process analysis. PVLDB, 7(6):457--468, 2014. Google ScholarDigital Library
- R. Fink, L. Han, and D. Olteanu. Aggregation in probabilistic databases via knowledge compilation. PVLDB, 5(5), 2012. Google ScholarDigital Library
- I. Foster, J. Vockler, M. Wilde, and A. Zhao. Chimera: A virtual data system for representing, querying, and automating data derivation. SSDBM, 2002. Google ScholarDigital Library
- B. Glavic, J. Siddique, P. Andritsos, and R. J. Miller. Provenance for Data Mining. In Theory and Practice of Provenance (TAPP), 2013. Google ScholarDigital Library
- T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. In PODS, pages 31--40, 2007. Google ScholarDigital Library
- D. Hull, K. Wolstencroft, R. Stevens, C. Goble, M. Pocock, P. Li, and T. Oinn. Taverna: a tool for building and running workflows of services. Nucleic Acids Res., 34, 2006.Google Scholar
- P. Missier, N. Paton, and K. Belhajjame. Fine-grained and efficient lineage querying of collection-based workflow provenance. In EDBT, 2010. Google ScholarDigital Library
- D. Olteanu and J. Zavodny. Factorised representations of query results: size bounds and readability. In ICDT, pages 285--298, 2012. Google ScholarDigital Library
- C. Ré and D. Suciu. Approximate lineage for probabilistic databases. PVLDB, 1(1):797--808, 2008. Google ScholarDigital Library
- S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Pearson, 2009. Google ScholarDigital Library
- Y. L. Simhan, B. Plale, and D. Gammon. Karma2: Provenance management for data-driven workflows. Int. J. Web Service Res., 5(2), 2008.Google Scholar
- Z. Wu and M. Palmer. Verb semantics and lexical selection. In Proc. of 32nd Annual meeting of the Associations for Computational Linguistics, pages 133--138, 1994. Google ScholarDigital Library
Index Terms
- Approximated Summarization of Data Provenance
Recommendations
The perm provenance management system in action
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of dataIn this demonstration we present the Perm provenance management system (PMS). Perm is capable of computing, storing and querying provenance information for the relational data model. Provenance is computed by using query rewriting techniques to annotate ...
Provenance of publications: a PROV style for latex
TaPP'15: Proceedings of the 7th USENIX Conference on Theory and Practice of ProvenanceIn general, the task of generating provenance is still tedious, and the community still lacks tools to generate provenance easily. In particular, when writing papers, researchers should be able to produce the provenance of their papers, make it ...
Semantic Provenance for eScience: Managing the Deluge of Scientific Data
Provenance information in eScience is metadata that's critical to effectively manage the exponentially increasing volumes of scientific data from industrial-scale experiment protocols. Semantic provenance, based on domain-specific provenance ontologies, ...
Comments