Abstract
While many applications export data in hierarchical formats like XML and JSON, it is often necessary to convert such hierarchical documents to a relational representation. This paper presents a novel programming-by-example approach, and its implementation in a tool called Mitra, for automatically migrating tree-structured documents to relational tables. We have evaluated the proposed technique using two sets of experiments. In the first experiment, we used Mitra to automate 98 data transformation tasks collected from StackOverflow. Our method can generate the desired program for 94% of these benchmarks with an average synthesis time of 3.8 seconds. In the second experiment, we used Mitra to generate programs that can convert real-world XML and JSON datasets to full-fledged relational databases. Our evaluation shows that Mitra can automate the desired transformation for all datasets.
- Database Schemas. goo.gl/xRMDTe.Google Scholar
- Dblp dataset. http://dblp.uni-trier.de/xml/.Google Scholar
- Evaluation benchmarks. https://goo.gl/JtYXj6.Google Scholar
- Imdb dataset. http://www.imdb.com/interfaces.Google Scholar
- Imdb to json. https://github.com/oxplot/imdb2json/blob/master/Readme.md.Google Scholar
- Mondial dataset. http://www.dbis.informatik.uni-goettingen.de/Mondial/.Google Scholar
- Oxygen xml editor. https://www.oxygenxml.com.Google Scholar
- Yelp dataset. https://www.yelp.com/dataset_challenge.Google Scholar
- B. Alexe, B. T. Cate, P. G. Kolaitis, and W.-C. Tan. Characterizing schema mappings via data examples. TODS, 36(4):23, 2011. Google ScholarDigital Library
- B. Alexe, L. Chiticariu, R. J. Miller, and W.-C. Tan. Muse: Mapping understanding and design by example. In ICDE, pages 10--19. IEEE, 2008. Google ScholarDigital Library
- B. Alexe, B. Ten Cate, P. G. Kolaitis, and W.-C. Tan. Designing and refining schema mappings via data examples. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 133--144. ACM, 2011. Google ScholarDigital Library
- B. Alexe, B. ten Cate, P. G. Kolaitis, and W. C. Tan. Eirene: Interactive design and refinement of schema mappings via data examples. PVLDB, 4(12):1414--1417, 2011.Google ScholarDigital Library
- S. Amer-Yahia, F. Du, and J. Freire. A comprehensive solution to the XML-to-relational mapping problem. In Proceedings of the 6th annual ACM international workshop on Web information and data management, pages 31--38. ACM, 2004. Google ScholarDigital Library
- M. Atay, A. Chebotko, D. Liu, S. Lu, and F. Fotouhi. Efficient schema-based XML-to-Relational data mapping. Information Systems, 32(3):458--476, 2007. Google ScholarDigital Library
- M. Balog, A. L. Gaunt, M. Brockschmidt, S. Nowozin, and D. Tarlow. Deepcoder: Learning to write programs. arXiv preprint arXiv:1611.01989, 2016.Google Scholar
- H.-H. Do and E. Rahm. COMA: a system for flexible combination of schema matching approaches. In VLDB, pages 610--621, 2002. Google ScholarDigital Library
- I. Dweib, A. Awadi, S. E. F. Elrhman, and J. Lu. Schemaless approach of mapping XML document into Relational Database. In Computer and Information Technology, 2008. CIT 2008. 8th IEEE International Conference on, pages 167--172. IEEE, 2008.Google ScholarCross Ref
- H. Elmeleegy, M. Ouzzani, and A. Elmagarmid. Usage-based schema matching. In ICDE, pages 20--29. IEEE, 2008. Google ScholarDigital Library
- R. Fagin, L. M. Haas, M. Hernández, R. J. Miller, L. Popa, and Y. Velegrakis. Clio: Schema mapping creation and data exchange. In Conceptual modeling foundations and applications, pages 198--236. Springer, 2009. Google ScholarDigital Library
- R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: semantics and query answering. Theoretical Computer Science, 336(1):89--124, 2005. Google ScholarDigital Library
- R. Fagin, P. G. Kolaitis, and L. Popa. Data exchange: getting to the core. ACM Transactions on Database Systems (TODS), 30(1):174--210, 2005. Google ScholarDigital Library
- Y. Feng, R. Martins, J. Van Geffen, I. Dillig, and S. Chaudhuri. Component-based synthesis of table consolidation and transformation tasks from examples. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, pages 422--436. ACM, 2017. Google ScholarDigital Library
- J. K. Feser, S. Chaudhuri, and I. Dillig. Synthesizing data structure transformations from input-output examples. SIGPLAN Not., 50(6):229--239, June 2015. Google ScholarDigital Library
- K. Fujimoto, D. D. Kha, M. Yoshikawa, and T. Amagasa. A Mapping Scheme of XML Documents into Relational Databases using Schema-based Path Identi.ers. In International Workshop on Challenges in Web Information Retrieval and Integration, pages 82--90, 2005. Google ScholarDigital Library
- S. Gulwani. Automating string processing in spreadsheets using input-output examples. In ACM SIGPLAN Notices, volume 46, pages 317--330. ACM, 2011. Google ScholarDigital Library
- S. Gulwani. Synthesis from examples: Interaction models and algorithms. In Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), 2012 14th International Symposium on, pages 8--14. IEEE, 2012. Google ScholarDigital Library
- W. R. Harris and S. Gulwani. Spreadsheet table transformations from examples. In ACM SIGPLAN Notices, volume 46, pages 317--328. ACM, 2011. Google ScholarDigital Library
- J. E. Hopcroft, R. Motwani, and J. D. Ullman. Introduction to Automata Theory, Languages, and Computation (3rd Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2006. Google ScholarDigital Library
- H. Jiang, H. Lu, W. Wang, and J. X. Yu. XParent: An efficient RDBMS-based XML database system. In ICDE, pages 335--336. IEEE, 2002. Google ScholarDigital Library
- Z. Jin, M. R. Anderson, M. Cafarella, and H. V. Jagadish. Foofah: Transforming data by example. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD '17, pages 683--698. ACM, 2017. Google ScholarDigital Library
- J. Kang and J. F. Naughton. On schema matching with opaque column names and data values. In SIGMOD, pages 205--216. ACM, 2003. Google ScholarDigital Library
- P. G. Kolaitis. Schema mappings, data exchange, and metadata management. In PODS, pages 61--75. ACM, 2005. Google ScholarDigital Library
- R. Krishnamurthy. Xml-to-sql query translation. PhD thesis, University of Wisconsin-Madison, 2004. Google ScholarDigital Library
- V. Le and S. Gulwani. Flashextract: A framework for data extraction by examples. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '14, pages 542--553. ACM, 2014. Google ScholarDigital Library
- H. Lieberman. Your wish is my command: Programming by example. Morgan Kaufmann, 2001.Google ScholarDigital Library
- J. Madhavan, P. A. Bernstein, and E. Rahm. Generic Schema Matching with Cupid. In VLDB, pages 49--58, 2001. Google ScholarDigital Library
- E. J. McCluskey. Minimization of boolean functions. Bell Labs Technical Journal, 35(6):1417--1444, 1956.Google ScholarCross Ref
- R. J. Miller, L. M. Haas, and M. A. Hernández. Schema mapping as query discovery. In VLDB, volume 2000, pages 77--88, 2000. Google ScholarDigital Library
- A. Nandi and P. A. Bernstein. HAMSTER: using search clicklogs for schema and taxonomy matching. PVLDB, 2(1):181--192, 2009. Google ScholarDigital Library
- L. Popa, Y. Velegrakis, M. A. Hernández, R. J. Miller, and R. Fagin. Translating web data. In Proceedings of the 28th international conference on Very Large Data Bases, pages 598--609. VLDB Endowment, 2002. Google ScholarDigital Library
- L. Qian, M. J. Cafarella, and H. Jagadish. Sample-driven schema mapping. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 73--84. ACM, 2012. Google ScholarDigital Library
- W. V. Quine. The problem of simplifying truth functions. The American Mathematical Monthly, 59(8):521--531, 1952.Google ScholarCross Ref
- M. Roth and W.-C. Tan. Data Integration and Data Exchange: It's Really About Time. In CIDR, 2013.Google Scholar
- J. Shanmugasundaram, E. Shekita, J. Kiernan, R. Krishnamurthy, E. Viglas, J. Naughton, and I. Tatarinov. A general technique for querying xml documents using a relational database system. ACM SIGMOD Record, 30(3):20--26, 2001. Google ScholarDigital Library
- J. Shanmugasundaram, K. Tufte, C. Zhang, G. He, D. J. DeWitt, and J. F. Naughton. Relational databases for querying XML documents: Limitations and opportunities. In VLDB, pages 302--314. Morgan Kaufmann Publishers Inc., 1999. Google ScholarDigital Library
- D. E. Shaw, W. R. Swartout, and C. C. Green. Inferring lisp programs from examples. In IJCAI, pages 260--267, 1975. Google ScholarDigital Library
- R. Singh. Blinkfill: Semi-supervised programming by example for syntactic string transformations. PVLDB, 9(10):816--827, 2016. Google ScholarDigital Library
- C. Smith and A. Albarghouthi. MapReduce Program Synthesis. PLDI, pages 326--340. ACM, 2016. Google ScholarDigital Library
- S. Soltan and M. Rahgozar. A clustering-based scheme for labeling xml trees. International Journal of Computer Science and Network Security, 6(9A):84--89, 2006.Google Scholar
- I. Tatarinov, S. D. Viglas, K. Beyer, J. Shanmugasundaram, E. Shekita, and C. Zhang. Storing and querying ordered XML using a relational database system. In SIGMOD, pages 204--215. ACM, 2002. Google ScholarDigital Library
- C. Wang, A. Cheung, and R. Bodik. Synthesizing highly expressive sql queries from input-output examples. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 452--466. ACM, 2017. Google ScholarDigital Library
- X. Wang, I. Dillig, and R. Singh. Synthesis of data completion scripts using finite tree automata. Proc. ACM Program. Lang., 1(OOPSLA):62:1--62:26, Oct. 2017. Google ScholarDigital Library
- X. Wang, I. Dillig, and R. Singh. Program Synthesis using Abstraction Refinement. In POPL. ACM, 2018. Google ScholarDigital Library
- X. Wang, S. Gulwani, and R. Singh. Fidex: Filtering spreadsheet data using examples. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2016, pages 195--213. ACM, 2016. Google ScholarDigital Library
- G. Xing, Z. Xia, and D. Ayers. X2R: A System for Managing XML Documents and Key Constraints Using RDBMS. In ACMSE, 2007. Google ScholarDigital Library
- N. Yaghmazadeh, C. Klinger, I. Dillig, and S. Chaudhuri. Synthesizing transformations on hierarchically structured data. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '16, pages 508--521, New York, NY, USA, 2016. ACM. Google ScholarDigital Library
- N. Yaghmazadeh, X. Wang, and I. Dillig. Automated migration of hierarchical data to relational tables using programming-by-example. CoRR, abs/1711.04001, 2017.Google Scholar
- L. L. Yan, R. J. Miller, L. M. Haas, and R. Fagin. Data-driven understanding and refinement of schema mappings. In SIGMOD, volume 30, pages 485--496. ACM, 2001. Google ScholarDigital Library
- M. Yoshikawa, T. Amagasa, T. Shimura, and S. Uemura. XRel: a path-based approach to storage and retrieval of XML documents using relational databases. ACM Transactions on Internet Technology, 1(1):110--141, 2001. Google ScholarDigital Library
- S. Zhang and Y. Sun. Automatically synthesizing sql queries from input-output examples. In Automated Software Engineering (ASE), 2013 IEEE/ACM 28th International Conference on, pages 224--234. IEEE, 2013. Google ScholarDigital Library
Recommendations
Automated migration of hierarchical data to relational tables using programming-by-example
While many applications export data in hierarchical formats like XML and JSON, it is often necessary to convert such hierarchical documents to a relational representation. This paper presents a novel programming-by-example approach, and its ...
Translating JSON Data into Relational Data Using Schema-oblivious Approaches
ACM SE '19: Proceedings of the 2019 ACM Southeast ConferenceJSON (JavaScript Object Notation) has become popular as the data exchange standard over the Web. JSON has been gaining more popularity over XML due to its simplicity, compactness and ability to fit into the object types of programming languages. The ...
Comments