skip to main content
research-article

Automated migration of hierarchical data to relational tables using programming-by-example

Published:05 October 2018Publication History
Skip Abstract Section

Abstract

While many applications export data in hierarchical formats like XML and JSON, it is often necessary to convert such hierarchical documents to a relational representation. This paper presents a novel programming-by-example approach, and its implementation in a tool called Mitra, for automatically migrating tree-structured documents to relational tables. We have evaluated the proposed technique using two sets of experiments. In the first experiment, we used Mitra to automate 98 data transformation tasks collected from StackOverflow. Our method can generate the desired program for 94% of these benchmarks with an average synthesis time of 3.8 seconds. In the second experiment, we used Mitra to generate programs that can convert real-world XML and JSON datasets to full-fledged relational databases. Our evaluation shows that Mitra can automate the desired transformation for all datasets.

References

  1. Database Schemas. goo.gl/xRMDTe.Google ScholarGoogle Scholar
  2. Dblp dataset. http://dblp.uni-trier.de/xml/.Google ScholarGoogle Scholar
  3. Evaluation benchmarks. https://goo.gl/JtYXj6.Google ScholarGoogle Scholar
  4. Imdb dataset. http://www.imdb.com/interfaces.Google ScholarGoogle Scholar
  5. Imdb to json. https://github.com/oxplot/imdb2json/blob/master/Readme.md.Google ScholarGoogle Scholar
  6. Mondial dataset. http://www.dbis.informatik.uni-goettingen.de/Mondial/.Google ScholarGoogle Scholar
  7. Oxygen xml editor. https://www.oxygenxml.com.Google ScholarGoogle Scholar
  8. Yelp dataset. https://www.yelp.com/dataset_challenge.Google ScholarGoogle Scholar
  9. B. Alexe, B. T. Cate, P. G. Kolaitis, and W.-C. Tan. Characterizing schema mappings via data examples. TODS, 36(4):23, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B. Alexe, L. Chiticariu, R. J. Miller, and W.-C. Tan. Muse: Mapping understanding and design by example. In ICDE, pages 10--19. IEEE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. Alexe, B. Ten Cate, P. G. Kolaitis, and W.-C. Tan. Designing and refining schema mappings via data examples. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 133--144. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Alexe, B. ten Cate, P. G. Kolaitis, and W. C. Tan. Eirene: Interactive design and refinement of schema mappings via data examples. PVLDB, 4(12):1414--1417, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Amer-Yahia, F. Du, and J. Freire. A comprehensive solution to the XML-to-relational mapping problem. In Proceedings of the 6th annual ACM international workshop on Web information and data management, pages 31--38. ACM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Atay, A. Chebotko, D. Liu, S. Lu, and F. Fotouhi. Efficient schema-based XML-to-Relational data mapping. Information Systems, 32(3):458--476, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Balog, A. L. Gaunt, M. Brockschmidt, S. Nowozin, and D. Tarlow. Deepcoder: Learning to write programs. arXiv preprint arXiv:1611.01989, 2016.Google ScholarGoogle Scholar
  16. H.-H. Do and E. Rahm. COMA: a system for flexible combination of schema matching approaches. In VLDB, pages 610--621, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. I. Dweib, A. Awadi, S. E. F. Elrhman, and J. Lu. Schemaless approach of mapping XML document into Relational Database. In Computer and Information Technology, 2008. CIT 2008. 8th IEEE International Conference on, pages 167--172. IEEE, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  18. H. Elmeleegy, M. Ouzzani, and A. Elmagarmid. Usage-based schema matching. In ICDE, pages 20--29. IEEE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Fagin, L. M. Haas, M. Hernández, R. J. Miller, L. Popa, and Y. Velegrakis. Clio: Schema mapping creation and data exchange. In Conceptual modeling foundations and applications, pages 198--236. Springer, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data exchange: semantics and query answering. Theoretical Computer Science, 336(1):89--124, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Fagin, P. G. Kolaitis, and L. Popa. Data exchange: getting to the core. ACM Transactions on Database Systems (TODS), 30(1):174--210, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Y. Feng, R. Martins, J. Van Geffen, I. Dillig, and S. Chaudhuri. Component-based synthesis of table consolidation and transformation tasks from examples. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, pages 422--436. ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. K. Feser, S. Chaudhuri, and I. Dillig. Synthesizing data structure transformations from input-output examples. SIGPLAN Not., 50(6):229--239, June 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. K. Fujimoto, D. D. Kha, M. Yoshikawa, and T. Amagasa. A Mapping Scheme of XML Documents into Relational Databases using Schema-based Path Identi.ers. In International Workshop on Challenges in Web Information Retrieval and Integration, pages 82--90, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Gulwani. Automating string processing in spreadsheets using input-output examples. In ACM SIGPLAN Notices, volume 46, pages 317--330. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Gulwani. Synthesis from examples: Interaction models and algorithms. In Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), 2012 14th International Symposium on, pages 8--14. IEEE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. W. R. Harris and S. Gulwani. Spreadsheet table transformations from examples. In ACM SIGPLAN Notices, volume 46, pages 317--328. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. E. Hopcroft, R. Motwani, and J. D. Ullman. Introduction to Automata Theory, Languages, and Computation (3rd Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. H. Jiang, H. Lu, W. Wang, and J. X. Yu. XParent: An efficient RDBMS-based XML database system. In ICDE, pages 335--336. IEEE, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Z. Jin, M. R. Anderson, M. Cafarella, and H. V. Jagadish. Foofah: Transforming data by example. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD '17, pages 683--698. ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. Kang and J. F. Naughton. On schema matching with opaque column names and data values. In SIGMOD, pages 205--216. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. P. G. Kolaitis. Schema mappings, data exchange, and metadata management. In PODS, pages 61--75. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. Krishnamurthy. Xml-to-sql query translation. PhD thesis, University of Wisconsin-Madison, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. V. Le and S. Gulwani. Flashextract: A framework for data extraction by examples. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '14, pages 542--553. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. H. Lieberman. Your wish is my command: Programming by example. Morgan Kaufmann, 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Madhavan, P. A. Bernstein, and E. Rahm. Generic Schema Matching with Cupid. In VLDB, pages 49--58, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. E. J. McCluskey. Minimization of boolean functions. Bell Labs Technical Journal, 35(6):1417--1444, 1956.Google ScholarGoogle ScholarCross RefCross Ref
  38. R. J. Miller, L. M. Haas, and M. A. Hernández. Schema mapping as query discovery. In VLDB, volume 2000, pages 77--88, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. A. Nandi and P. A. Bernstein. HAMSTER: using search clicklogs for schema and taxonomy matching. PVLDB, 2(1):181--192, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. L. Popa, Y. Velegrakis, M. A. Hernández, R. J. Miller, and R. Fagin. Translating web data. In Proceedings of the 28th international conference on Very Large Data Bases, pages 598--609. VLDB Endowment, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. L. Qian, M. J. Cafarella, and H. Jagadish. Sample-driven schema mapping. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pages 73--84. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. W. V. Quine. The problem of simplifying truth functions. The American Mathematical Monthly, 59(8):521--531, 1952.Google ScholarGoogle ScholarCross RefCross Ref
  43. M. Roth and W.-C. Tan. Data Integration and Data Exchange: It's Really About Time. In CIDR, 2013.Google ScholarGoogle Scholar
  44. J. Shanmugasundaram, E. Shekita, J. Kiernan, R. Krishnamurthy, E. Viglas, J. Naughton, and I. Tatarinov. A general technique for querying xml documents using a relational database system. ACM SIGMOD Record, 30(3):20--26, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. Shanmugasundaram, K. Tufte, C. Zhang, G. He, D. J. DeWitt, and J. F. Naughton. Relational databases for querying XML documents: Limitations and opportunities. In VLDB, pages 302--314. Morgan Kaufmann Publishers Inc., 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. D. E. Shaw, W. R. Swartout, and C. C. Green. Inferring lisp programs from examples. In IJCAI, pages 260--267, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. R. Singh. Blinkfill: Semi-supervised programming by example for syntactic string transformations. PVLDB, 9(10):816--827, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. C. Smith and A. Albarghouthi. MapReduce Program Synthesis. PLDI, pages 326--340. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. S. Soltan and M. Rahgozar. A clustering-based scheme for labeling xml trees. International Journal of Computer Science and Network Security, 6(9A):84--89, 2006.Google ScholarGoogle Scholar
  50. I. Tatarinov, S. D. Viglas, K. Beyer, J. Shanmugasundaram, E. Shekita, and C. Zhang. Storing and querying ordered XML using a relational database system. In SIGMOD, pages 204--215. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. C. Wang, A. Cheung, and R. Bodik. Synthesizing highly expressive sql queries from input-output examples. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 452--466. ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. X. Wang, I. Dillig, and R. Singh. Synthesis of data completion scripts using finite tree automata. Proc. ACM Program. Lang., 1(OOPSLA):62:1--62:26, Oct. 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. X. Wang, I. Dillig, and R. Singh. Program Synthesis using Abstraction Refinement. In POPL. ACM, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. X. Wang, S. Gulwani, and R. Singh. Fidex: Filtering spreadsheet data using examples. In Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2016, pages 195--213. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. G. Xing, Z. Xia, and D. Ayers. X2R: A System for Managing XML Documents and Key Constraints Using RDBMS. In ACMSE, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. N. Yaghmazadeh, C. Klinger, I. Dillig, and S. Chaudhuri. Synthesizing transformations on hierarchically structured data. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '16, pages 508--521, New York, NY, USA, 2016. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. N. Yaghmazadeh, X. Wang, and I. Dillig. Automated migration of hierarchical data to relational tables using programming-by-example. CoRR, abs/1711.04001, 2017.Google ScholarGoogle Scholar
  58. L. L. Yan, R. J. Miller, L. M. Haas, and R. Fagin. Data-driven understanding and refinement of schema mappings. In SIGMOD, volume 30, pages 485--496. ACM, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. M. Yoshikawa, T. Amagasa, T. Shimura, and S. Uemura. XRel: a path-based approach to storage and retrieval of XML documents using relational databases. ACM Transactions on Internet Technology, 1(1):110--141, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. S. Zhang and Y. Sun. Automatically synthesizing sql queries from input-output examples. In Automated Software Engineering (ASE), 2013 IEEE/ACM 28th International Conference on, pages 224--234. IEEE, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 11, Issue 5
    January 2018
    274 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    • Published: 5 October 2018
    Published in pvldb Volume 11, Issue 5

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader