Abstract
In this paper, we introduce a new problem termed query reverse engineering (QRE). Given a database \(D\) and a result table \(T\)—the output of some known or unknown query \(Q\) on \(D\)—the goal of QRE is to reverse-engineer a query \(Q'\) such that the output of query \(Q'\) on database \(D\) (denoted by \(Q'(D)\)) is equal to \(T\) (i.e., \(Q(D)\)). The QRE problem has useful applications in database usability, data analysis, and data security. In this work, we propose a data-driven approach, TALOS for Tree-based classifier with At Least One Semantics, that is based on a novel dynamic data classification formulation and extend the approach to efficiently support the three key dimensions of the QRE problem: whether the input query is known/unknown, supporting different query fragments, and supporting multiple database versions.
Similar content being viewed by others
Notes
Note that even though the definition of a weak IEQ \(Q'\) of \(Q\) does not require the queries to share a set of core relations, we find this restriction to be a reasonable and effective way to obtain “good” IEQs.
If the search is for strong IEQs, then the discussion remains the same except that \(L\) is the ordered listing of the key attributes of a set of core relations \(S\) of \(Q\), and we replace \(Q(D)\) by \({Q}_{S}(D)\).
We also experimented with a scheme that randomly labels only one free tuple for each subset as positive, but its performance is worse than NI and RD.
To generate only precise IEQs, \(\tau = 0\). In our experiments, we set \(\tau = 1\) to derive a reasonable number of approximate IEQs.
Note that in the SQL implementation, null count values are first converted to zero values before being added.
We use \(gain,\,ms,\,edu,\,loss,\,nc,\,hpw\), and \(rs\), respectively, as abbreviations for capital gain, marital status, number of years of education, capital loss, native country, hours per week, and relationship.
As all the class-labeling schemes are based on the same framework to derive SPJU-IEQs which differ only in how they label free tuples to derive SPJ-IEQs that form the subqueries for SPJU-IEQs, we do not consider SPJU queries here. For SPJA queries, since the free tuples are assigned fixed class labels to satisfy the aggregation conditions before being classified, all the three schemes would have returned the same results for SPJA-IEQs; therefore, SPJA queries are not considered. For the TPC-H queries, it turns out that all the free tuples are bound, and therefore, all the three schemes return the same set of IEQs; therefore, we do not report results for TPC-H data sets.
We remove the \(comment\) attributes from each relation in TPC-H database.
These four relations are chosen because they appeared in the test queries \(T_1\) to \(T_4\).
Two queries \(Q\) and \(Q'\) are semantically equivalent if for every valid database \(D,\,Q(D) = Q'(D)\).
References
Andritsos, P., Miller, R.J., Tsaparas, P.: Information-theoretic tools for mining database structure from large data sets. In: SIGMOD (2004)
Arasu, A., Kaushik, R., Li, J.: Data generation using declarative constraints. In: SIGMOD Conference, pp. 685–696 (2011)
Binnig, C., Kossmann, D., Lo, E.: Reverse query processing. In: ICDE, pp. 506–515 (2007)
Binnig, C., Kossmann, D., Lo, E., Özsu, M.T.: QAGen: Generating query-aware test databases. In: SIGMOD Conference, pp. 341–352 (2007)
Blakeley, J.A., Larson, P.A., Tompa, F.W.: Efficiently updating materialized views. SIGMOD Rec. 15(2), 61–71 (1986)
Brown, P.G., Haas, P.J.: BHUNT: Automatic discovery of fuzzy algebraic constraints in relational data. In: VLDB (2003)
Bruno, N., Chaudhuri, S., Thomas, D.: Generating queries with cardinality constraints for dbms testing. IEEE TKDE 18(12), 1721–1725 (2006)
Cormen, T.H., Stein, C., Rivest, R.L., Leiserson, C.E.: Introduction to Algorithms. McGraw-Hill Science, Boston (2001)
Gaasterland, T., Godfrey, P., Minker, J.: An overview of cooperative answering. J. Intell. Inf. Syst. 1(2), 123–157 (1992)
Getoor, L., Taskar, B., Koller, D.: Selectivity estimation using probabilistic models. In: SIGMOD Conference, pp. 461–472 (2001)
Godfrey, P., Gryz, J., Zuzarte, C.: Exploiting constraint-like data characterizations in query optimization. In: SIGMOD Conference, pp. 582–592 (2001)
Johnson, T., Marathe, A., Dasu, T.: Database exploration and bellman. IEEE Data Eng. Bull. 26(3), 34–39 (2003)
Lenzerini, M.: Data integration: A theoretical perspective. In: PODS, pp. 233–246 (2002)
Lo, E., Cheng, N., Hon, W.K.: Generating databases for query workloads. Proc. VLDB Endow. 3(1), 848–859 (2010)
Malpani, A., Bernstein, P., Melnik, S., Terwilliger, J.: Reverse engineering models from databases to bootstrap application development. In: ICDE (2010)
Mehta, M., Agrawal, R., Rissanen, J.: SLIQ: A fast scalable classifier for data mining. In: EDBT, pp. 18–32 (1996)
Mishra, C., Koudas, N., Zuzarte, C.: Generating targeted queries for database testing. In: SIGMOD Conference, pp. 499–510 (2008)
Motro, A.: Intensional answers to database queries. IEEE TKDE 6(3), 444–454 (1994)
Mumick, I.S., Quass, D., Mumick, B.S.: Maintenance of data cubes and summary tables in a warehouse. SIGMOD Rec. 26(2), 100–111 (1997)
Petit, J.M., Toumani, F., Boulicaut, J.F., Kouloumdjian, J.: Towards the reverse engineering of denormalized relational databases. In: ICDE (1996)
Tan, P.N., Kumar, M.V.: Introduction to Data Mining. Addison-Wesley, Reading, MA (2006)
Ramakrishnan, N., Kumar, D., Mishra, B., Potts, M., Helm, R.F.: Turning cartwheels: An alternating algorithm for mining redescriptions. In: SIGKDD Conference, pp. 266–275 (2004)
van Rijsbergen, C.J.: Information Retrieval. Butterworth, London (1979)
Rissanen, J.: Modeling by shortest data description. Automatica 14, 465–471 (1978)
Sarma, A.D., Parameswaran, A., Garcia-Molina, H., Widom, J.: Synthesizing view definitions from data. In: ICDT, pp. 89–103 (2010)
Sismanis, Y., Brown, P., Haas, P.J., Reinwald, B.: GORDIAN: Efficient and scalable discovery of composite keys. In: VLDB (2006)
Stonebraker, M.: The design of the postgres storage system. In: VLDB, pp. 289–300 (1987)
Tran, Q.T., Chan, C.Y.: How to ConQueR why-not questions. In: SIGMOD Conference, pp. 15–26 (2010)
Tran, Q.T., Chan, C.Y., Parthasarathy, S.: Query by output. In: SIGMOD Conference, pp. 535–548 (2009)
Valduriez, P.: Join indices. ACM Trans. Database Syst. 12(2), 218–246 (1987)
Wu, W., Reinwald, B., Sismanis, Y., Manjrekar, R.: Discovering topical structures of databases. In: SIGMOD (2008)
Xiao, X., Tao, Y.: Output perturbation with query relaxation. Proc. VLDB Endow. 1(1), 857–869 (2008)
Yin, X., Han, J., Yang, J., Yu, P.S.: Efficient classification across multiple database relations: A crossmine approach. TKDE 18, 770–783 (2006)
Acknowledgments
We would like to thank the editors and reviewers for their constructive comments and suggestions to improve the paper’s presentation. This research is supported in part by NUS Grant R-252-000-512-112. Parthasarathy would also like to acknowledge the following US NSF grants: IIS-0347662 (CAREER) and CCF-0702587.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Tran, Q.T., Chan, CY. & Parthasarathy, S. Query reverse engineering. The VLDB Journal 23, 721–746 (2014). https://doi.org/10.1007/s00778-013-0349-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-013-0349-3