Skip to main content
Log in

Query reverse engineering

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

In this paper, we introduce a new problem termed query reverse engineering (QRE). Given a database \(D\) and a result table \(T\)—the output of some known or unknown query \(Q\) on \(D\)—the goal of QRE is to reverse-engineer a query \(Q'\) such that the output of query \(Q'\) on database \(D\) (denoted by \(Q'(D)\)) is equal to \(T\) (i.e., \(Q(D)\)). The QRE problem has useful applications in database usability, data analysis, and data security. In this work, we propose a data-driven approach, TALOS for Tree-based classifier with At Least One Semantics, that is based on a novel dynamic data classification formulation and extend the approach to efficiently support the three key dimensions of the QRE problem: whether the input query is known/unknown, supporting different query fragments, and supporting multiple database versions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. http://archive.ics.uci.edu/ml/datasets/Adult.

  2. http://baseball1.com/statistics/.

  3. Note that even though the definition of a weak IEQ \(Q'\) of \(Q\) does not require the queries to share a set of core relations, we find this restriction to be a reasonable and effective way to obtain “good” IEQs.

  4. If the search is for strong IEQs, then the discussion remains the same except that \(L\) is the ordered listing of the key attributes of a set of core relations \(S\) of \(Q\), and we replace \(Q(D)\) by \({Q}_{S}(D)\).

  5. We also experimented with a scheme that randomly labels only one free tuple for each subset as positive, but its performance is worse than NI and RD.

  6. To generate only precise IEQs, \(\tau = 0\). In our experiments, we set \(\tau = 1\) to derive a reasonable number of approximate IEQs.

  7. Note that in the SQL implementation, null count values are first converted to zero values before being added.

  8. http://www.basketballreference.com/.

  9. http://www.ebay.com/.

  10. We use \(gain,\,ms,\,edu,\,loss,\,nc,\,hpw\), and \(rs\), respectively, as abbreviations for capital gain, marital status, number of years of education, capital loss, native country, hours per week, and relationship.

  11. As all the class-labeling schemes are based on the same framework to derive SPJU-IEQs which differ only in how they label free tuples to derive SPJ-IEQs that form the subqueries for SPJU-IEQs, we do not consider SPJU queries here. For SPJA queries, since the free tuples are assigned fixed class labels to satisfy the aggregation conditions before being classified, all the three schemes would have returned the same results for SPJA-IEQs; therefore, SPJA queries are not considered. For the TPC-H queries, it turns out that all the free tuples are bound, and therefore, all the three schemes return the same set of IEQs; therefore, we do not report results for TPC-H data sets.

  12. We remove the \(comment\) attributes from each relation in TPC-H database.

  13. These four relations are chosen because they appeared in the test queries \(T_1\) to \(T_4\).

  14. Two queries \(Q\) and \(Q'\) are semantically equivalent if for every valid database \(D,\,Q(D) = Q'(D)\).

References

  1. Andritsos, P., Miller, R.J., Tsaparas, P.: Information-theoretic tools for mining database structure from large data sets. In: SIGMOD (2004)

  2. Arasu, A., Kaushik, R., Li, J.: Data generation using declarative constraints. In: SIGMOD Conference, pp. 685–696 (2011)

  3. Binnig, C., Kossmann, D., Lo, E.: Reverse query processing. In: ICDE, pp. 506–515 (2007)

  4. Binnig, C., Kossmann, D., Lo, E., Özsu, M.T.: QAGen: Generating query-aware test databases. In: SIGMOD Conference, pp. 341–352 (2007)

  5. Blakeley, J.A., Larson, P.A., Tompa, F.W.: Efficiently updating materialized views. SIGMOD Rec. 15(2), 61–71 (1986)

    Article  Google Scholar 

  6. Brown, P.G., Haas, P.J.: BHUNT: Automatic discovery of fuzzy algebraic constraints in relational data. In: VLDB (2003)

  7. Bruno, N., Chaudhuri, S., Thomas, D.: Generating queries with cardinality constraints for dbms testing. IEEE TKDE 18(12), 1721–1725 (2006)

    Google Scholar 

  8. Cormen, T.H., Stein, C., Rivest, R.L., Leiserson, C.E.: Introduction to Algorithms. McGraw-Hill Science, Boston (2001)

  9. Gaasterland, T., Godfrey, P., Minker, J.: An overview of cooperative answering. J. Intell. Inf. Syst. 1(2), 123–157 (1992)

    Google Scholar 

  10. Getoor, L., Taskar, B., Koller, D.: Selectivity estimation using probabilistic models. In: SIGMOD Conference, pp. 461–472 (2001)

  11. Godfrey, P., Gryz, J., Zuzarte, C.: Exploiting constraint-like data characterizations in query optimization. In: SIGMOD Conference, pp. 582–592 (2001)

  12. Johnson, T., Marathe, A., Dasu, T.: Database exploration and bellman. IEEE Data Eng. Bull. 26(3), 34–39 (2003)

    Google Scholar 

  13. Lenzerini, M.: Data integration: A theoretical perspective. In: PODS, pp. 233–246 (2002)

  14. Lo, E., Cheng, N., Hon, W.K.: Generating databases for query workloads. Proc. VLDB Endow. 3(1), 848–859 (2010)

  15. Malpani, A., Bernstein, P., Melnik, S., Terwilliger, J.: Reverse engineering models from databases to bootstrap application development. In: ICDE (2010)

  16. Mehta, M., Agrawal, R., Rissanen, J.: SLIQ: A fast scalable classifier for data mining. In: EDBT, pp. 18–32 (1996)

  17. Mishra, C., Koudas, N., Zuzarte, C.: Generating targeted queries for database testing. In: SIGMOD Conference, pp. 499–510 (2008)

  18. Motro, A.: Intensional answers to database queries. IEEE TKDE 6(3), 444–454 (1994)

    Google Scholar 

  19. Mumick, I.S., Quass, D., Mumick, B.S.: Maintenance of data cubes and summary tables in a warehouse. SIGMOD Rec. 26(2), 100–111 (1997)

    Google Scholar 

  20. Petit, J.M., Toumani, F., Boulicaut, J.F., Kouloumdjian, J.: Towards the reverse engineering of denormalized relational databases. In: ICDE (1996)

  21. Tan, P.N., Kumar, M.V.: Introduction to Data Mining. Addison-Wesley, Reading, MA (2006)

    Google Scholar 

  22. Ramakrishnan, N., Kumar, D., Mishra, B., Potts, M., Helm, R.F.: Turning cartwheels: An alternating algorithm for mining redescriptions. In: SIGKDD Conference, pp. 266–275 (2004)

  23. van Rijsbergen, C.J.: Information Retrieval. Butterworth, London (1979)

    Google Scholar 

  24. Rissanen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978)

    Article  MATH  Google Scholar 

  25. Sarma, A.D., Parameswaran, A., Garcia-Molina, H., Widom, J.: Synthesizing view definitions from data. In: ICDT, pp. 89–103 (2010)

  26. Sismanis, Y., Brown, P., Haas, P.J., Reinwald, B.: GORDIAN: Efficient and scalable discovery of composite keys. In: VLDB (2006)

  27. Stonebraker, M.: The design of the postgres storage system. In: VLDB, pp. 289–300 (1987)

  28. Tran, Q.T., Chan, C.Y.: How to ConQueR why-not questions. In: SIGMOD Conference, pp. 15–26 (2010)

  29. Tran, Q.T., Chan, C.Y., Parthasarathy, S.: Query by output. In: SIGMOD Conference, pp. 535–548 (2009)

  30. Valduriez, P.: Join indices. ACM Trans. Database Syst. 12(2), 218–246 (1987)

    Article  Google Scholar 

  31. Wu, W., Reinwald, B., Sismanis, Y., Manjrekar, R.: Discovering topical structures of databases. In: SIGMOD (2008)

  32. Xiao, X., Tao, Y.: Output perturbation with query relaxation. Proc. VLDB Endow. 1(1), 857–869 (2008)

    Article  Google Scholar 

  33. Yin, X., Han, J., Yang, J., Yu, P.S.: Efficient classification across multiple database relations: A crossmine approach. TKDE 18, 770–783 (2006)

    Google Scholar 

Download references

Acknowledgments

We would like to thank the editors and reviewers for their constructive comments and suggestions to improve the paper’s presentation. This research is supported in part by NUS Grant R-252-000-512-112. Parthasarathy would also like to acknowledge the following US NSF grants: IIS-0347662 (CAREER) and CCF-0702587.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Quoc Trung Tran.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 205 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tran, Q.T., Chan, CY. & Parthasarathy, S. Query reverse engineering. The VLDB Journal 23, 721–746 (2014). https://doi.org/10.1007/s00778-013-0349-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-013-0349-3

Keywords

Navigation