Dissociation and propagation for approximate lifted inference with standard relational database management systems

Gatterbauer, Wolfgang; Suciu, Dan

doi:10.1007/s00778-016-0434-5

Dissociation and propagation for approximate lifted inference with standard relational database management systems

Special Issue Paper
Published: 16 July 2016

Volume 26, pages 5–30, (2017)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Wolfgang Gatterbauer¹ &
Dan Suciu²

612 Accesses
7 Citations
Explore all metrics

Abstract

Probabilistic inference over large data sets is a challenging data management problem since exact inference is generally #P-hard and is most often solved approximately with sampling-based methods today. This paper proposes an alternative approach for approximate evaluation of conjunctive queries with standard relational databases: In our approach, every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. We provide an algorithm that takes into account important schema information to enumerate only the minimal necessary plans among all possible plans. Importantly, this algorithm is a strict generalization of all known PTIME self-join-free conjunctive queries: A query is in PTIME if and only if our algorithm returns one single plan. Furthermore, our approach is a generalization of a family of efficient ranking methods from graphs to hypergraphs. We also adapt three relational query optimization techniques to evaluate all necessary plans very fast. We give a detailed experimental evaluation of our approach and, in the process, provide a new way of thinking about the value of probabilistic methods over non-probabilistic methods for ranking query answers. We also note that the techniques developed in this paper apply immediately to lifted inference from statistical relational models since lifted inference corresponds to PTIME plans in probabilistic databases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

10 Years of Probabilistic Querying – What Next?

Challenges for Efficient Query Evaluation on Structured Probabilistic Data

Querying and Learning in Probabilistic Databases

Notes

Also see [55] for a related discussion of fact finding algorithms, in which the approach of [78] and its use of the iterative propagation Eq. 1 is referred to as “pseudoprobabilistic”.
W.l.o.g. we assume $\varvec{\mathbf {x}}_i$ to be a tuple of only variables and don’t write the constants. Selections can always be directly pushed into the database before executing the query.
Defined formally as ${\textit{ADom}}_{x_j} = \bigcup _{i: x_j \in {\texttt {Var}}(R_i)} \pi _{x_j}(R_i)$.
Non-hierarchical queries can be in PTIME when considering functional dependencies or deterministic tables [12, 53] (see Sect. 5).
Extensional approaches compute the probability of any formula as a function of the probabilities of its subformulas according to syntactic rules, regardless of how those were derived. Intensional approaches reason in terms of possible worlds and keep track of dependencies [56].
Incidence matrices allow us to compactly reason about two types of relationships between variables and relations of sf-free CQs simultaneously: (i) in a column: a variable that is shared across relations, and (ii) in a row: relations that are joined by a variable. They thus allow us to reason about both the “query hypergraph” and the “dual query hypergraph” at the same time, which is helpful also for other types of problems involving sf-free CQs (see, e.g. [24]).
A conjunctive k-chain query is a query q without self-joins in which each relation is binary, all relations are joined together, and there is no single variable common to more than two relations. Furthermore, the first and last variable are head variables and can be replaced by constants: $q(x_1, x_{k+1}) {\,:\!\!-\,}R_1(x_1, x_2), R_2(x_2, x_3), \ldots , R_k(x_k, x_{k+1})$. The fact that relations are binary entails that the query hypergraph is actually a standard graph. Similarly, the fact that a variable is not common to more than two relations also entails the “dual hypergraph” to be a graph as well. The expression chain query derives from the observation that both its hypergraph and dual hypergraph resemble a simple chain.
Notice that dissociating a table on any head variable has no implication on the probability of a query result as it does not change its lineage. We therefore only focus on dissociating existential variables.
Recall that we say a query is connected if all subgoals are connected by considering only existential variables ${\texttt {EVar}}(q)$. In other words, when computing query components we remove head variables from the query: $q - {\texttt {HVar}}(q)$. An alternative way to write this is to first substitute all head variables by constants $q' = q[\varvec{\mathbf {a}} / \varvec{\mathbf {x}}]$ (here $q[\varvec{\mathbf {a}} / \varvec{\mathbf {x}}]$ denotes the query obtained by substituting each head variable $x_i \in \varvec{\mathbf {x}}$ with the constant $a_i \in \varvec{\mathbf {a}}$), then to let $q_1, \ldots , q_k$ be the components of $q'$ connected by any variable. The query is connected if $k=1$, otherwise it is disconnected, and $\forall i \ne j: {\texttt {Var}}(q_i) \cap {\texttt {Var}}(q_j) \subseteq {\texttt {HVar}}(q)$.
This follows from the recursive definition of the unique safe plan of a query in Lemma 5: the top-most projection consists precisely of its separator variables.
Note that if there are no existential variables ($\varvec{\mathbf {z}} = \varvec{\mathbf {x}}_i$), then there is no need for the projection operator and one could instead simplify to $\mathscr {P} \leftarrow \{R_i(\varvec{\mathbf {z}})\}$, instead of $\mathscr {P} \leftarrow \{\pi ^p_{\varvec{\mathbf {z}}} R_i(\varvec{\mathbf {x}}_i)\}$.
A Boolean conjunctive k-star query is a query with k unary relations and one k-ary relation: $q {\,:\!\!-\,}R_1(x_1), \ldots , R_k(x_k), U(x_1, \ldots , x_k)$. The fact that each variable appears in exactly two relations implies that the dual query hypergraph is actually a standard graph (the dual hypergraph of a query is defined by the relations as vertices and variables as the hyperedges). The expression star query derives from the observation that the query’s dual (hyper)graph resembles a star with the table U connected to all other relations.
E.g., if $\varvec{\mathbf {x}} = \{y \}$ and $\varvec{\Gamma } = \{ x \rightarrow y, y \rightarrow z, z \rightarrow u \}$, then ${\varvec{\mathbf {x}}}^+ = \{y, z, u \}$.
The time needed for the lineage query thus serves as minimum benchmark for any probabilistic approximation. The reported times for SampleSearch and MC are the sum of time for retrieving the lineage plus the actual calculations, without the time for reading and writing the input and output files for SampleSearch.
Results for MC with other parameters of $2 are similar. However, the evaluation time for the experiments becomes quickly infeasible.

References

Amarilli, A., Amsterdamer, Y., Milo, T.: Uncertainty in crowd data sourcing under structural constraints. In: DASFAA Workshops, pp. 351–359 (2014)
Antova, L., Jansen, T., Koch, C., Olteanu, D.: Fast and simple relational processing of uncertain data. In: ICDE, pp. 983–992 (2008)
Antova, L., Koch, C., Olteanu, D.: MayBMS: managing incomplete information with probabilistic world-set decompositions. In: ICDE, pp. 1479–1480 (2007)
Beame, P., Li, J., Roy, S., Suciu, D.: Model counting of query expressions: limitations of propositional methods. In: ICDT, pp. 177–188 (2014)
Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., Sudarshan, S.: Keyword searching and browsing in databases using BANKS. In: ICDE, pp. 431–440 (2002)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. 30(1–7), 107–117 (1998)
Google Scholar
Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka Jr., E.R., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: AAAI (2010)
Chen, Y., Wang, D.Z.: Knowledge expansion over probabilistic knowledge bases. In: SIGMOD, pp. 649–660 (2014)
Cohen, W.W.: Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst. 18(3), 288–321 (2000)
Article MathSciNet Google Scholar
Colbourn, C.J.: The Combinatorics of Network Reliability. Oxford University Press, New York (1987)
Google Scholar
Crestani, F.: Application of spreading activation techniques in information retrieval. Artif. Intell. Rev. 11(6), 453–482 (1997)
Article Google Scholar
Dalvi, N.N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB J. 16(4), 523–544 (2007)
Article Google Scholar
Dalvi, N.N., Suciu, D.: The dichotomy of probabilistic inference for unions of conjunctive queries. J. ACM 59(6), 30 (2012)
Article MathSciNet MATH Google Scholar
Davis, M., Putnam, H.: A computing procedure for quantification theory. J. ACM 7(3), 201–215 (1960)
Article MathSciNet MATH Google Scholar
DeepDive: http://deepdive.stanford.edu/
Detwiler, L., Gatterbauer, W., Louie, B., Suciu, D., Tarczy-Hornoch, P.: Integrating and ranking uncertain scientific data. In: ICDE, pp. 1235–1238 (2009)
Domingos, Pedro, Lowd, Daniel: Markov Logic: An Interface Layer for Artificial Intelligence. Morgan & Claypool Publishers, San Rafael (2009)
MATH Google Scholar
Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., Strohmann, T., Sun, S., Zhang, W.: Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: KDD, pp. 601–610 (2014)
Dylla, M., Miliaraki, I., Theobald, M.: Top-k query processing in probabilistic databases with non-materialized views. In: ICDE, pp. 122–133 (2013)
Ermis, B., Bouchard, G.: Iterative splits of quadratic bounds for scalable binary tensor factorization. In: UAI, pp. 192–199 (2014)
Fink, R., Huang, J., Olteanu, D.: Anytime approximation in probabilistic databases. VLDB J. 22(6), 823–848 (2013)
Article Google Scholar
Fink, R., Olteanu, D.: On the optimal approximation of queries using tractable propositional languages. In: ICDT, pp. 174–185 (2011)
Fink, R., Olteanu, D.: A dichotomy for non-repeating queries with negation in probabilistic databases. In: PODS, pp. 144–155 (2014)
Freire, C., Gatterbauer, W., Immerman, N., Meliou, A.: The complexity of resilience and responsibility for self-join-free conjunctive queries. PVLDB 9(3), 180–191 (2015)
Google Scholar
Fuhr, N., Rölleke, T.: A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inf. Syst. 15(1), 32–66 (1997)
Article Google Scholar
Gatterbauer, W., Günnemann, S., Koutra, D., Faloutsos, C.: Linearized and single-pass belief propagation. PVLDB 8(5), 581–592 (2015)
Google Scholar
Gatterbauer, W., Jha, A.K., Suciu, D.: Dissociation and propagation for efficient query evaluation over probabilistic databases. In: Proceedings of 4th International VLDB workshop on Management of Uncertain Data (MUD), pp. 83–97 (2010)
Gatterbauer, W., Suciu, V.: Dissociation and propagation for approximate lifted inference with standard relational database management systems (2013). arXiv:1310.6257 [cs.DB]
Gatterbauer, W., Suciu, D.: Oblivious bounds on the probability of Boolean functions. ACM Trans. Database Syst. (TODS) 39(1), 5 (2014)
Article MathSciNet MATH Google Scholar
Gatterbauer, W., Suciu, D.: Approximate lifted inference with probabilistic databases. PVLDB 8(5), 629–640 (2015)
Google Scholar
Gogate, V., Dechter, R.: SampleSearch: importance sampling in presence of determinism. Artif. Intell. 175(2), 694–729 (2011)
Article MathSciNet MATH Google Scholar
Gogate, V., Domingos, P.: Formula-based probabilistic inference. In: UAI, pp. 210–219 (2010)
Gogate, V., Domingos, P.: Probabilistic theorem proving. In: UAI, pp. 256–265 (2011)
Gomes, C.P., Sabharwal, A., Selman, B.: Model counting. In: Handbook of Satisfiability, pp. 633–654 (2009)
Goyal, A., Bonchi, F., Lakshmanan, L.V.S.: Learning influence probabilities in social networks. In: WSDM, pp. 241–250 (2010)
Grädel, E., Gurevich, Y., Hirsch, C.: The complexity of query reliability. In: PODS, pp. 227–234 (1998)
Gribkoff, E., Suciu, D.: Slimshot: in-database probabilistic inference for knowledge bases. PVLDB 9(7), 552–563 (2016)
Google Scholar
Guha, R.V., Kumar, R., Raghavan, P., Tomkins, A.: Propagation of trust and distrust. In: WWW, pp. 403–412 (2004)
Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: Yago2: a spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell. 194, 28–61 (2013)
Article MathSciNet MATH Google Scholar
Jaeger, M., Van den Broeck, G.: Liftability of probabilistic inference: upper and lower bounds. In: StaRAI (2012)
Jampani, R., Xu, F., Wu, M., Perez, L.L., Jermaine, C.M., Haas, P.J.: MCDB: a Monte Carlo approach to managing uncertain data. In: SIGMOD, pp. 687–700 (2008)
Jha, A., Olteanu, D., Suciu, D.: Bridging the gap between intensional and extensional query evaluation in probabilistic databases. In: EDBT, pp. 323–334 (2010)
Jha, A., Suciu, D.: Probabilistic databases with MarkoViews. PVLDB 5(11), 1160–1171 (2012)
Google Scholar
Joshi, S., Jermaine, C.M.: Sampling-based estimators for subset-based queries. VLDB J. 18(1), 181–202 (2009)
Article Google Scholar
Kennedy, O., Koch, C.: PIP: a database system for great and small expectations. In: ICDE, pp. 157–168 (2010)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Book MATH Google Scholar
McSherry, F., Najork, M.: Computing information retrieval performance measures efficiently in the presence of tied scores. In: ECIR, pp. 414–421 (2008)
Microsoft SQL Server 2012. http://www.microsoft.com/sqlserver
Moerkotte, G.: Building query compilers. Draft version 03 Mar 2009
Niu, F., Ré, C., Doan, A., Shavlik, J.W.: Tuffy: scaling up statistical inference in markov logic networks using an RDBMS. PVLDB 4(6), 373–384 (2011)
Google Scholar
OEIS: The on-line encyclopedia of integer sequences: http://oeis.org/
Olteanu, D., Huang, J.: Using OBDDs for efficient query evaluation on probabilistic databases. In: SUM, pp. 326–340 (2008)
Olteanu, D., Huang, J., Koch, C.: Sprout: lazy vs. eager query plans for tuple-independent probabilistic databases. In: ICDE, pp. 640–651 (2009)
Olteanu, D., Huang, J., Koch, C.: Approximate confidence computation in probabilistic databases. In: ICDE, pp. 145–156 (2010)
Pasternack, J., Roth, D.: Knowing what to believe (when you already know something). In: COLING, pp. 877–885 (2010)
Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, San Mateo (1988)
MATH Google Scholar
Poole, D.: First-order probabilistic inference. In: IJCAI, pp. 985–991 (2003)
PostgreSQL 9.2. http://www.postgresql.org/download/
Quillian, M.R.: Semantic memory. In: Semantic Information Processing, pp. 227–270. MIT Press (1968)
Raghunathan, R., De, S., Kambhampati, S.: Bayesian networks for supporting query processing over incomplete autonomous databases. J. Intell. Inf. Syst. 42(3), 595–618 (2014)
Article Google Scholar
Ré, C., Dalvi, N.N., Suciu, D.: Query evaluation on probabilistic databases. IEEE Data Eng. Bull. 29(1), 25–31 (2006)
Google Scholar
Ré, C., Dalvi, N.N., Suciu, D.: Efficient top-k query evaluation on probabilistic data. In: ICDE, pp. 886–895 (2007)
Ré, C., Suciu, D.: Approximate lineage for probabilistic databases. PVLDB 1(1), 797–808 (2008)
Google Scholar
Roy, S., Perduca, V., Tannen, V.: Faster query answering in probabilistic databases using read-once functions. In: ICDT, pp. 232–243 (2011)
Rumelhart, D.E., Hinton, G.E.,Williams, R.J.: Learning internal representations by error propagation. In: Parallel distributed processing: explorations in the microstructure of cognition, vol. 1, pp 318–362. MIT Press (1986)
Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: SIGMOD, pp. 23–34 (1979)
Sen, P., Deshpande, A.: Representing and querying correlated tuples in probabilistic databases. In: ICDE, pp. 596–605 (2007)
Sen, P., Deshpande, A., Getoor, L.: Read-once functions and query evaluation in probabilistic databases. PVLDB 3(1), 1068–1079 (2010)
Google Scholar
Singh, A.P., Gordon, G.J.: Relational learning via collective matrix factorization. In: KDD, pp. 650–658 (2008)
Stoyanovich, J., Davidson, S.B., Milo, T., Tannen, V.: Deriving probabilistic databases with inference ensembles. In: ICDE, pp. 303–314 (2011)
TPC-H Benchmark. http://www.tpc.org/tpch/
Van den Broeck, G., Choi, A., Darwiche, A.: Lifted relax, compensate and then recover: from approximate to exact lifted probabilistic inference. In: UAI, pp. 131–141 (2012)
Van den Broeck, G., Meert, W., Darwiche, A.: Skolemization for weighted first-order model counting. In: KR (2014)
Van den Broeck, G., Suciu, D.: Lifted probabilistic inference in relational models. In: UAI tutorials (2014)
Van den Broeck, G., Taghipour, N., Meert, W., Davis, J., De Raedt, L.: Lifted probabilistic inference by first-order knowledge compilation. In: IJCAI, pp. 2178–2185 (2011)
Vardi, M.Y.: The complexity of relational query languages (extended abstract). In: STOC, pp. 137–146 (1982)
Weston, J., Elisseeff, A., Zhou, D., Leslie, C.S., Noble, W.S.: Protein ranking: from local to global structure in the protein similarity network. Proc Natl Acad Sci USA 101(17), 6559–6563 (2004)
Article Google Scholar
Yin, X., Han, J., Philip, S.Y.: Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng. 20(6), 796–808 (2008)
Article Google Scholar
Zeng, K., Gao, S., Mozafari, B., Zaniolo, C.: The analytical bootstrap: a new method for fast error estimation in approximate query processing. In: SIGMOD, pp. 277–288 (2014)
Zhang, C., Ré, C.: Towards high-throughput Gibbs sampling at scale: a study across storage managers. In: SIGMOD, pp. 397–408 (2013)

Download references

Acknowledgments

This work was supported in part by NSF Grants IIS-0513877, IIS-0713576, IIS-0915054, IIS-1115188, IIS-1247469, and CAREER IIS-1553547. We like to thank Abhay Jha for help with the experiments in the workshop version of this paper, Alexandra Meliou for suggesting the name “dissociation”, and Vibhav Gogate for guidance in using his tool SampleSearch. WG would also like to thank Manfred Hauswirth for a small comment in 2007 that was crucial for the development of the ideas in this paper.

Author information

Authors and Affiliations

Tepper School of Business, Carnegie Mellon University, Pittsburgh, PA, USA
Wolfgang Gatterbauer
Department of Computer Science and Engineering, University of Washington, Seattle, WA, USA
Dan Suciu

Authors

Wolfgang Gatterbauer
View author publications
You can also search for this author in PubMed Google Scholar
Dan Suciu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wolfgang Gatterbauer.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 644 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gatterbauer, W., Suciu, D. Dissociation and propagation for approximate lifted inference with standard relational database management systems. The VLDB Journal 26, 5–30 (2017). https://doi.org/10.1007/s00778-016-0434-5

Download citation

Received: 31 December 2015
Revised: 25 April 2016
Accepted: 13 June 2016
Published: 16 July 2016
Issue Date: February 2017
DOI: https://doi.org/10.1007/s00778-016-0434-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dissociation and propagation for approximate lifted inference with standard relational database management systems

Abstract

Access this article

Similar content being viewed by others

10 Years of Probabilistic Querying – What Next?

Challenges for Efficient Query Evaluation on Structured Probabilistic Data

Querying and Learning in Probabilistic Databases

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 644 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Dissociation and propagation for approximate lifted inference with standard relational database management systems

Abstract

Access this article

Similar content being viewed by others

10 Years of Probabilistic Querying – What Next?

Challenges for Efficient Query Evaluation on Structured Probabilistic Data

Querying and Learning in Probabilistic Databases

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (pdf 644 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation