Skip to main content
Log in

Dissociation and propagation for approximate lifted inference with standard relational database management systems

  • Special Issue Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Probabilistic inference over large data sets is a challenging data management problem since exact inference is generally #P-hard and is most often solved approximately with sampling-based methods today. This paper proposes an alternative approach for approximate evaluation of conjunctive queries with standard relational databases: In our approach, every query is evaluated entirely in the database engine by evaluating a fixed number of query plans, each providing an upper bound on the true probability, then taking their minimum. We provide an algorithm that takes into account important schema information to enumerate only the minimal necessary plans among all possible plans. Importantly, this algorithm is a strict generalization of all known PTIME self-join-free conjunctive queries: A query is in PTIME if and only if our algorithm returns one single plan. Furthermore, our approach is a generalization of a family of efficient ranking methods from graphs to hypergraphs. We also adapt three relational query optimization techniques to evaluate all necessary plans very fast. We give a detailed experimental evaluation of our approach and, in the process, provide a new way of thinking about the value of probabilistic methods over non-probabilistic methods for ranking query answers. We also note that the techniques developed in this paper apply immediately to lifted inference from statistical relational models since lifted inference corresponds to PTIME plans in probabilistic databases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Notes

  1. Also see [55] for a related discussion of fact finding algorithms, in which the approach of [78] and its use of the iterative propagation Eq. 1 is referred to as “pseudoprobabilistic”.

  2. W.l.o.g. we assume \(\varvec{\mathbf {x}}_i\) to be a tuple of only variables and don’t write the constants. Selections can always be directly pushed into the database before executing the query.

  3. Defined formally as \({\textit{ADom}}_{x_j} = \bigcup _{i: x_j \in {\texttt {Var}}(R_i)} \pi _{x_j}(R_i)\).

  4. Non-hierarchical queries can be in PTIME when considering functional dependencies or deterministic tables [12, 53] (see Sect. 5).

  5. Extensional approaches compute the probability of any formula as a function of the probabilities of its subformulas according to syntactic rules, regardless of how those were derived. Intensional approaches reason in terms of possible worlds and keep track of dependencies [56].

  6. Incidence matrices allow us to compactly reason about two types of relationships between variables and relations of sf-free CQs simultaneously: (i) in a column: a variable that is shared across relations, and (ii) in a row: relations that are joined by a variable. They thus allow us to reason about both the “query hypergraph” and the “dual query hypergraph” at the same time, which is helpful also for other types of problems involving sf-free CQs (see, e.g. [24]).

  7. A conjunctive k-chain query is a query q without self-joins in which each relation is binary, all relations are joined together, and there is no single variable common to more than two relations. Furthermore, the first and last variable are head variables and can be replaced by constants: \(q(x_1, x_{k+1}) {\,:\!\!-\,}R_1(x_1, x_2), R_2(x_2, x_3), \ldots , R_k(x_k, x_{k+1})\). The fact that relations are binary entails that the query hypergraph is actually a standard graph. Similarly, the fact that a variable is not common to more than two relations also entails the “dual hypergraph” to be a graph as well. The expression chain query derives from the observation that both its hypergraph and dual hypergraph resemble a simple chain.

  8. Notice that dissociating a table on any head variable has no implication on the probability of a query result as it does not change its lineage. We therefore only focus on dissociating existential variables.

  9. Recall that we say a query is connected if all subgoals are connected by considering only existential variables \({\texttt {EVar}}(q)\). In other words, when computing query components we remove head variables from the query: \(q - {\texttt {HVar}}(q)\). An alternative way to write this is to first substitute all head variables by constants \(q' = q[\varvec{\mathbf {a}} / \varvec{\mathbf {x}}]\) (here \(q[\varvec{\mathbf {a}} / \varvec{\mathbf {x}}]\) denotes the query obtained by substituting each head variable \(x_i \in \varvec{\mathbf {x}}\) with the constant \(a_i \in \varvec{\mathbf {a}}\)), then to let \(q_1, \ldots , q_k\) be the components of \(q'\) connected by any variable. The query is connected if \(k=1\), otherwise it is disconnected, and \(\forall i \ne j: {\texttt {Var}}(q_i) \cap {\texttt {Var}}(q_j) \subseteq {\texttt {HVar}}(q)\).

  10. This follows from the recursive definition of the unique safe plan of a query in Lemma 5: the top-most projection consists precisely of its separator variables.

  11. Note that if there are no existential variables (\(\varvec{\mathbf {z}} = \varvec{\mathbf {x}}_i\)), then there is no need for the projection operator and one could instead simplify to \(\mathscr {P} \leftarrow \{R_i(\varvec{\mathbf {z}})\}\), instead of \(\mathscr {P} \leftarrow \{\pi ^p_{\varvec{\mathbf {z}}} R_i(\varvec{\mathbf {x}}_i)\}\).

  12. A Boolean conjunctive k-star query is a query with k unary relations and one k-ary relation: \(q {\,:\!\!-\,}R_1(x_1), \ldots , R_k(x_k), U(x_1, \ldots , x_k)\). The fact that each variable appears in exactly two relations implies that the dual query hypergraph is actually a standard graph (the dual hypergraph of a query is defined by the relations as vertices and variables as the hyperedges). The expression star query derives from the observation that the query’s dual (hyper)graph resembles a star with the table U connected to all other relations.

  13. E.g., if \(\varvec{\mathbf {x}} = \{y \}\) and \(\varvec{\Gamma } = \{ x \rightarrow y, y \rightarrow z, z \rightarrow u \}\), then \({\varvec{\mathbf {x}}}^+ = \{y, z, u \}\).

  14. The time needed for the lineage query thus serves as minimum benchmark for any probabilistic approximation. The reported times for SampleSearch and MC are the sum of time for retrieving the lineage plus the actual calculations, without the time for reading and writing the input and output files for SampleSearch.

  15. Results for MC with other parameters of $2 are similar. However, the evaluation time for the experiments becomes quickly infeasible.

References

  1. Amarilli, A., Amsterdamer, Y., Milo, T.: Uncertainty in crowd data sourcing under structural constraints. In: DASFAA Workshops, pp. 351–359 (2014)

  2. Antova, L., Jansen, T., Koch, C., Olteanu, D.: Fast and simple relational processing of uncertain data. In: ICDE, pp. 983–992 (2008)

  3. Antova, L., Koch, C., Olteanu, D.: MayBMS: managing incomplete information with probabilistic world-set decompositions. In: ICDE, pp. 1479–1480 (2007)

  4. Beame, P., Li, J., Roy, S., Suciu, D.: Model counting of query expressions: limitations of propositional methods. In: ICDT, pp. 177–188 (2014)

  5. Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., Sudarshan, S.: Keyword searching and browsing in databases using BANKS. In: ICDE, pp. 431–440 (2002)

  6. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. 30(1–7), 107–117 (1998)

    Google Scholar 

  7. Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka Jr., E.R., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: AAAI (2010)

  8. Chen, Y., Wang, D.Z.: Knowledge expansion over probabilistic knowledge bases. In: SIGMOD, pp. 649–660 (2014)

  9. Cohen, W.W.: Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst. 18(3), 288–321 (2000)

    Article  MathSciNet  Google Scholar 

  10. Colbourn, C.J.: The Combinatorics of Network Reliability. Oxford University Press, New York (1987)

    Google Scholar 

  11. Crestani, F.: Application of spreading activation techniques in information retrieval. Artif. Intell. Rev. 11(6), 453–482 (1997)

    Article  Google Scholar 

  12. Dalvi, N.N., Suciu, D.: Efficient query evaluation on probabilistic databases. VLDB J. 16(4), 523–544 (2007)

    Article  Google Scholar 

  13. Dalvi, N.N., Suciu, D.: The dichotomy of probabilistic inference for unions of conjunctive queries. J. ACM 59(6), 30 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  14. Davis, M., Putnam, H.: A computing procedure for quantification theory. J. ACM 7(3), 201–215 (1960)

    Article  MathSciNet  MATH  Google Scholar 

  15. DeepDive: http://deepdive.stanford.edu/

  16. Detwiler, L., Gatterbauer, W., Louie, B., Suciu, D., Tarczy-Hornoch, P.: Integrating and ranking uncertain scientific data. In: ICDE, pp. 1235–1238 (2009)

  17. Domingos, Pedro, Lowd, Daniel: Markov Logic: An Interface Layer for Artificial Intelligence. Morgan & Claypool Publishers, San Rafael (2009)

    MATH  Google Scholar 

  18. Dong, X., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., Strohmann, T., Sun, S., Zhang, W.: Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: KDD, pp. 601–610 (2014)

  19. Dylla, M., Miliaraki, I., Theobald, M.: Top-k query processing in probabilistic databases with non-materialized views. In: ICDE, pp. 122–133 (2013)

  20. Ermis, B., Bouchard, G.: Iterative splits of quadratic bounds for scalable binary tensor factorization. In: UAI, pp. 192–199 (2014)

  21. Fink, R., Huang, J., Olteanu, D.: Anytime approximation in probabilistic databases. VLDB J. 22(6), 823–848 (2013)

    Article  Google Scholar 

  22. Fink, R., Olteanu, D.: On the optimal approximation of queries using tractable propositional languages. In: ICDT, pp. 174–185 (2011)

  23. Fink, R., Olteanu, D.: A dichotomy for non-repeating queries with negation in probabilistic databases. In: PODS, pp. 144–155 (2014)

  24. Freire, C., Gatterbauer, W., Immerman, N., Meliou, A.: The complexity of resilience and responsibility for self-join-free conjunctive queries. PVLDB 9(3), 180–191 (2015)

    Google Scholar 

  25. Fuhr, N., Rölleke, T.: A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inf. Syst. 15(1), 32–66 (1997)

    Article  Google Scholar 

  26. Gatterbauer, W., Günnemann, S., Koutra, D., Faloutsos, C.: Linearized and single-pass belief propagation. PVLDB 8(5), 581–592 (2015)

    Google Scholar 

  27. Gatterbauer, W., Jha, A.K., Suciu, D.: Dissociation and propagation for efficient query evaluation over probabilistic databases. In: Proceedings of 4th International VLDB workshop on Management of Uncertain Data (MUD), pp. 83–97 (2010)

  28. Gatterbauer, W., Suciu, V.: Dissociation and propagation for approximate lifted inference with standard relational database management systems (2013). arXiv:1310.6257 [cs.DB]

  29. Gatterbauer, W., Suciu, D.: Oblivious bounds on the probability of Boolean functions. ACM Trans. Database Syst. (TODS) 39(1), 5 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  30. Gatterbauer, W., Suciu, D.: Approximate lifted inference with probabilistic databases. PVLDB 8(5), 629–640 (2015)

    Google Scholar 

  31. Gogate, V., Dechter, R.: SampleSearch: importance sampling in presence of determinism. Artif. Intell. 175(2), 694–729 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  32. Gogate, V., Domingos, P.: Formula-based probabilistic inference. In: UAI, pp. 210–219 (2010)

  33. Gogate, V., Domingos, P.: Probabilistic theorem proving. In: UAI, pp. 256–265 (2011)

  34. Gomes, C.P., Sabharwal, A., Selman, B.: Model counting. In: Handbook of Satisfiability, pp. 633–654 (2009)

  35. Goyal, A., Bonchi, F., Lakshmanan, L.V.S.: Learning influence probabilities in social networks. In: WSDM, pp. 241–250 (2010)

  36. Grädel, E., Gurevich, Y., Hirsch, C.: The complexity of query reliability. In: PODS, pp. 227–234 (1998)

  37. Gribkoff, E., Suciu, D.: Slimshot: in-database probabilistic inference for knowledge bases. PVLDB 9(7), 552–563 (2016)

    Google Scholar 

  38. Guha, R.V., Kumar, R., Raghavan, P., Tomkins, A.: Propagation of trust and distrust. In: WWW, pp. 403–412 (2004)

  39. Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: Yago2: a spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell. 194, 28–61 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  40. Jaeger, M., Van den Broeck, G.: Liftability of probabilistic inference: upper and lower bounds. In: StaRAI (2012)

  41. Jampani, R., Xu, F., Wu, M., Perez, L.L., Jermaine, C.M., Haas, P.J.: MCDB: a Monte Carlo approach to managing uncertain data. In: SIGMOD, pp. 687–700 (2008)

  42. Jha, A., Olteanu, D., Suciu, D.: Bridging the gap between intensional and extensional query evaluation in probabilistic databases. In: EDBT, pp. 323–334 (2010)

  43. Jha, A., Suciu, D.: Probabilistic databases with MarkoViews. PVLDB 5(11), 1160–1171 (2012)

    Google Scholar 

  44. Joshi, S., Jermaine, C.M.: Sampling-based estimators for subset-based queries. VLDB J. 18(1), 181–202 (2009)

    Article  Google Scholar 

  45. Kennedy, O., Koch, C.: PIP: a database system for great and small expectations. In: ICDE, pp. 157–168 (2010)

  46. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)

    Book  MATH  Google Scholar 

  47. McSherry, F., Najork, M.: Computing information retrieval performance measures efficiently in the presence of tied scores. In: ECIR, pp. 414–421 (2008)

  48. Microsoft SQL Server 2012. http://www.microsoft.com/sqlserver

  49. Moerkotte, G.: Building query compilers. Draft version 03 Mar 2009

  50. Niu, F., Ré, C., Doan, A., Shavlik, J.W.: Tuffy: scaling up statistical inference in markov logic networks using an RDBMS. PVLDB 4(6), 373–384 (2011)

    Google Scholar 

  51. OEIS: The on-line encyclopedia of integer sequences: http://oeis.org/

  52. Olteanu, D., Huang, J.: Using OBDDs for efficient query evaluation on probabilistic databases. In: SUM, pp. 326–340 (2008)

  53. Olteanu, D., Huang, J., Koch, C.: Sprout: lazy vs. eager query plans for tuple-independent probabilistic databases. In: ICDE, pp. 640–651 (2009)

  54. Olteanu, D., Huang, J., Koch, C.: Approximate confidence computation in probabilistic databases. In: ICDE, pp. 145–156 (2010)

  55. Pasternack, J., Roth, D.: Knowing what to believe (when you already know something). In: COLING, pp. 877–885 (2010)

  56. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, San Mateo (1988)

    MATH  Google Scholar 

  57. Poole, D.: First-order probabilistic inference. In: IJCAI, pp. 985–991 (2003)

  58. PostgreSQL 9.2. http://www.postgresql.org/download/

  59. Quillian, M.R.: Semantic memory. In: Semantic Information Processing, pp. 227–270. MIT Press (1968)

  60. Raghunathan, R., De, S., Kambhampati, S.: Bayesian networks for supporting query processing over incomplete autonomous databases. J. Intell. Inf. Syst. 42(3), 595–618 (2014)

    Article  Google Scholar 

  61. Ré, C., Dalvi, N.N., Suciu, D.: Query evaluation on probabilistic databases. IEEE Data Eng. Bull. 29(1), 25–31 (2006)

    Google Scholar 

  62. Ré, C., Dalvi, N.N., Suciu, D.: Efficient top-k query evaluation on probabilistic data. In: ICDE, pp. 886–895 (2007)

  63. Ré, C., Suciu, D.: Approximate lineage for probabilistic databases. PVLDB 1(1), 797–808 (2008)

    Google Scholar 

  64. Roy, S., Perduca, V., Tannen, V.: Faster query answering in probabilistic databases using read-once functions. In: ICDT, pp. 232–243 (2011)

  65. Rumelhart, D.E., Hinton, G.E.,Williams, R.J.: Learning internal representations by error propagation. In: Parallel distributed processing: explorations in the microstructure of cognition, vol. 1, pp 318–362. MIT Press (1986)

  66. Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: SIGMOD, pp. 23–34 (1979)

  67. Sen, P., Deshpande, A.: Representing and querying correlated tuples in probabilistic databases. In: ICDE, pp. 596–605 (2007)

  68. Sen, P., Deshpande, A., Getoor, L.: Read-once functions and query evaluation in probabilistic databases. PVLDB 3(1), 1068–1079 (2010)

    Google Scholar 

  69. Singh, A.P., Gordon, G.J.: Relational learning via collective matrix factorization. In: KDD, pp. 650–658 (2008)

  70. Stoyanovich, J., Davidson, S.B., Milo, T., Tannen, V.: Deriving probabilistic databases with inference ensembles. In: ICDE, pp. 303–314 (2011)

  71. TPC-H Benchmark. http://www.tpc.org/tpch/

  72. Van den Broeck, G., Choi, A., Darwiche, A.: Lifted relax, compensate and then recover: from approximate to exact lifted probabilistic inference. In: UAI, pp. 131–141 (2012)

  73. Van den Broeck, G., Meert, W., Darwiche, A.: Skolemization for weighted first-order model counting. In: KR (2014)

  74. Van den Broeck, G., Suciu, D.: Lifted probabilistic inference in relational models. In: UAI tutorials (2014)

  75. Van den Broeck, G., Taghipour, N., Meert, W., Davis, J., De Raedt, L.: Lifted probabilistic inference by first-order knowledge compilation. In: IJCAI, pp. 2178–2185 (2011)

  76. Vardi, M.Y.: The complexity of relational query languages (extended abstract). In: STOC, pp. 137–146 (1982)

  77. Weston, J., Elisseeff, A., Zhou, D., Leslie, C.S., Noble, W.S.: Protein ranking: from local to global structure in the protein similarity network. Proc Natl Acad Sci USA 101(17), 6559–6563 (2004)

    Article  Google Scholar 

  78. Yin, X., Han, J., Philip, S.Y.: Truth discovery with multiple conflicting information providers on the web. IEEE Trans. Knowl. Data Eng. 20(6), 796–808 (2008)

    Article  Google Scholar 

  79. Zeng, K., Gao, S., Mozafari, B., Zaniolo, C.: The analytical bootstrap: a new method for fast error estimation in approximate query processing. In: SIGMOD, pp. 277–288 (2014)

  80. Zhang, C., Ré, C.: Towards high-throughput Gibbs sampling at scale: a study across storage managers. In: SIGMOD, pp. 397–408 (2013)

Download references

Acknowledgments

This work was supported in part by NSF Grants IIS-0513877, IIS-0713576, IIS-0915054, IIS-1115188, IIS-1247469, and CAREER IIS-1553547. We like to thank Abhay Jha for help with the experiments in the workshop version of this paper, Alexandra Meliou for suggesting the name “dissociation”, and Vibhav Gogate for guidance in using his tool SampleSearch. WG would also like to thank Manfred Hauswirth for a small comment in 2007 that was crucial for the development of the ideas in this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wolfgang Gatterbauer.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 644 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gatterbauer, W., Suciu, D. Dissociation and propagation for approximate lifted inference with standard relational database management systems. The VLDB Journal 26, 5–30 (2017). https://doi.org/10.1007/s00778-016-0434-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-016-0434-5

Keywords

Navigation