skip to main content
research-article

Mining Abstract XML Data-Types

Published:04 December 2018Publication History
Skip Abstract Section

Abstract

Schema integration has been a long-standing challenge for the data-engineering community that has received steady attention over the past three decades. General-purpose integration approaches construct unified schemas that encompass all schema elements. Schema integration has been revisited in the past decade in service-oriented computing since the input/output data-types of service interfaces are heterogeneous XML schemas. However, service integration differs from the traditional integration problem, since it should generalize schemas (mining abstract data-types) instead of unifying all schema elements. To mine well-formed abstract data-types, the fundamental Liskov Substitution Principle (LSP), which generally holds between abstract data-types and their subtypes, should be followed. However, due to the heterogeneity of service data-types, the strict employment of LSP is not usually feasible. On top of that, XML offers a rich type system, based on which data-types are defined via combining type patterns (e.g., composition, aggregation). The existing integration approaches have not dealt with the challenges of a defining subtyping relation between XML type patterns. To address these challenges, we propose a relaxed version of LSP between XML type patterns and an automated generalization process for mining abstract XML data-types. We evaluate the effectiveness and the efficiency of the process on the schemas of two datasets against two representative state-of-the-art approaches.

References

  1. A. Doan and A. Y. Halevy. 2005. Semantic integration research in the database community: A brief survey. AI Magazine 26, 1 (2005), 83--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Carlo Batini, Maurizio Lenzerini, and Shamkant B. Navathe. 1986. A comparative analysis of methodologies for database schema integratison. ACM Computings Surveys 18, 4 (1986), 323--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Pottinger and P. A. Bernstein. 2003. Merging models based on given correspondences. In Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann Publishers, Berlin, 826--873. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. T. Erl. 2005. Service-Oriented Architecture: Concepts, Technology, and Design. Prentice Hall. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Athanasopoulos, A. Zarras, P. Vassiliadis, and V. Issarny. 2011. Mining service abstractions. In Proceedings of the International Conference on Software Engineering. IEEE, HI, Hawaii, 944--947. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. X. Liu and H. Liu. 2012. Automatic abstract service generation from web service communities. In Proceedings of the International Conference on Web Services. IEEE, HI, Hawaii, 154--161. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. Liskov and J. M. Wing. 1994. A behavioural notion of subtyping. ACM Transactions on Programming Languages and Systems 16, 6 (1994), 1811--1841. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Erhard Rahm, Hong Hai Do, and Sabine Massmann. 2004. Matching large XML schemas. SIGMOD Record 33, 4 (2004). ACM, 26--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. Saleem, Z. Bellahsene, and E. Hunt. 2008. PORSCHE: Performance ORiented SCHEma mediation. Information Systems 33, 7--8 (2008). Elsevier, 637--657. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Y. Halevy, A. Rajaraman, and J. J. Ordille. 2006. Data integration: The teenage years. In Proceedings of the International Conference on Very Large Data Bases. ACM, Seoul, 9--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Pottinger and P. A. Bernstein. 2008. Schema merging and mapping creation for relational sources. In Proceedings of the International Conference on Extending Database Technology: Advances in Database Technology. ACM, Nantes, 73--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Parent and S. Spaccapietra. 1998. Issues and approaches of database integration. Communications of the ACM 41, 5 (1998), 166--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Xiang Li. 2012. Constraint-Driven Schema Merging. Ph.D. Dissertation. RWTH Aachen University.Google ScholarGoogle Scholar
  14. A. Baqasah, E. Pardede, and J. W. Rahayu. 2014. A new approach for meaningful XML schema merging. In Proceedings of the International Conference on Information Integration and Web-based Applications 8 Services. ACM, Hanoi, 430--439. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. H. Ma, K.-D. Schewe, B. Thalheim, and J. Zhao. 2005. View integration and cooperation in databases, data warehouses and web information systems. Journal on Data Semantics. Springer, 213--249. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. V. Kashyap and A. P. Sheth. 1996. Semantic and schematic similarities between database objects: A context-based approach. The VLDB Journal 5, 4 (1996). Springer, 276--304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. X. Li and C. Quix. 2011. Merging relational views: A minimization approach. In Proceedings of the International Conference on Conceptual Modeling. Springer, Brussels, 379--392. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Arenas, J. Pérez, J. L. Reutter, and C. Riveros. 2010. Foundations of schema mapping management. In Proceedings of the ACM Symposium on Principles of Database Systems. ACM, Indianapolis, Indiana, 227--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. A. Bernstein, S. Melnik, M. Petropoulos, and C. Quix. 2004. Industrial-strength schema matching. ACM SIGMOD Record 33, 4 (2004), 38--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Radwan, L. Popa, I. R. Stanoi, and A. Younis. 2009. Top-k generation of integrated schemas based on directed and weighted correspondences. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, Providence, Rhode Island, 641--654. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. D. Sarma, X. Dong, and A. Halevy. 2008. Bootstrapping pay-as-you-go data integration systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, Vancouver, 861--874. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Melnik, E. Rahm, and P. A. Bernstein. 2003. Rondo: A programming platform for generic model management. In Proceedings of the ACM SIGMOD International conference on Management of Data. ACM, San Diego, California, 193--204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Aída Jiménez, Fernando Berzal, and Juan Carlos Cubero Talavera. 2010. Frequent tree pattern mining: A survey. Intelligent Data Analysis 14, 6 (2010). IOS Press, 603--622. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. J. Zaki. 2005. Efficiently mining frequent embedded unordered trees. Fundamenta Informaticae 66, 1--2 (2005). IOS Press, 33--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. Chi, R. R. Muntz, S. Nijssen, and J. N. Kok. 2005. Frequent subtree mining -- An overview. Fundamenta Informaticae 66, 1--2 (2005). IOS Press, 161--198. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. J. Zaki. 2005. Efficiently mining frequent trees in a forest: Algorithms and applications. IEEE Transactions on Knowledge and Data Engineering 17, 8 (2005), 1021--1035. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, and M. Hsu. 2004. Mining sequential patterns by pattern-growth: The PrefixSpan approach. IEEE Transactions on Knowledge and Data Engineering 16, 11 (2004), 1424--1440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. X. Yan, J. Han, and R. Afshar. 2003. CloSpan: Mining closed sequential patterns in large databases. In Proceedings of the SIAM International Conference on Data Mining. SIAM, San Francisco, 166--177.Google ScholarGoogle Scholar
  29. C. Wang, M. Hong, J. Pei, H. Zhou, W. Wang, and B. Shi. 2004. Efficient pattern-growth methods for frequent tree pattern mining. In Proceedings of the Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Springer, Sydney, 441--451.Google ScholarGoogle Scholar
  30. L. Zou, Y. Lu, H. Zhang, and R. Hu. 2006. PrefixTreeESpan: A pattern growth algorithm for mining embedded subtrees. In Proceedings of the International Conference on Web Information Systems Engineering. Springer, Wuhan, 499--505. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. I. Chowdhury and R. Nayak. 2014. BEST: An Efficient Algorithm for Mining Frequent Unordered Embedded Subtrees. In Proceedings of the Pacific Rim International Conference on Artificial Intelligence. Springer, Gold Coast, 459--471.Google ScholarGoogle Scholar
  32. E. Rahm and P. A. Bernstein. 2001. A survey of approaches to automatic schema matching. VLDB Journal 10, 4 (2001). Springer, 334--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Z. Bellahsene, A. Bonifati, and E. Rahm (Eds.). 2011. Schema Matching and Mapping. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. P. Shvaiko and J. Euzenat. 2013. Ontology matching: State of the art and future challenges. IEEE Transactions on Knowledge and Data Engineering 25, 1 (2013), 158--176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. Hamdaqa and L. Tahvildari. 2014. Prison break: A generic schema matching solution to the cloud vendor lock-in problem. In Proceedings of the International Symposium on the Maintenance and Evolution of Service-Oriented and Cloud-Based Systems. IEEE, Victoria, British Columbia, 37--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. F. Duchateau, Z. Bellahsene, and M. Roche. 2007. A context-based measure for discovering approximate semantic matching between schema elements. In Proceedings of the International Conference on Research Challenges in Information Science. IEEE, Ouarzazate, 9--20.Google ScholarGoogle Scholar
  37. F. Duchateau, Z. Bellahsene, M. Roantree, and M. Roche. 2007. Poster session: An indexing structure for automatic schema matching. In Proceedings of the IEEE International Conference on Data Engineering Workshop. IEEE, Istanbul, 485--491. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. P. De Meo, G. Quattrone, G. Terracina, and D. Ursino. 2006. Integration of XML schemas at various “severity” levels. Information Systems 31, 6 (2006). Elsevier, 397--434. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. F. Duchateau, Z. Bellahsene, and M. Roche. 2007. BMatch: A semantically context-based tool enhanced by an indexing structure to accelerate schema matching. In Journées Bases de Données Avancées. IEEE, Marseille, 1--20.Google ScholarGoogle Scholar
  40. W. Hu, Y. Qu, and G. Cheng. 2008. Matching large ontologies: A divide-and-conquer approach. Data 8 Knowledge Engineering 67, 1 (2008). Elsevier, 140--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. H. H. Do and E. Rahm. 2002. COMA -- A system for flexible combination of schema matching approaches. In Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann Publishers, Hong Kong, 610--621. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. H. H. Do and E. Rahm. 2007. Matching large schemas: Approaches and evaluation. Information Systems 32, 6 (2007). Elsevier, 857--885. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. Madhavan, P. A. Bernstein, and E. Rahm. 2001. Generic schema matching with CUPID. In Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann Publishers, Roma, 49--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. A. Algergawy, E. Schallehn, and G. Saake. 2009. Improving XML schema matching performance using Prüfer sequences. Data and Knowledge Engineering 68, 8 (2009). Elsevier, 728--747. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. M. Lee, L. H. Yang, W. Hsu, and X. Yang. 2002. XClust: Clustering XML schemas for effective integration. In Proceedings of the ACM International Conference on Information and Knowledge Management. ACM, McLean, Virginia, 292--299. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. F. Giunchiglia, P. Shvaiko, and M. Yatskevich. 2004. S-Match: An algorithm and an implementation of semantic matching. In Proceedings of the European Semantic Web Symposium. Springer, Heraklion, Crete, 61--75.Google ScholarGoogle Scholar
  47. R. Nayak and W. Iryadi. 2007. XML schema clustering with semantic and hierarchical similarity measures. Knowledge-Based Systems 20, 4 (2007). ACM, 336--349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. A. Algergawy, R. Nayak, and G. Saake. 2010. Element similarity measures in XML schema matching. Information Sciences 180, 24 (2010). Elsevier, 4975--4998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. J. Kim, Y. Peng, N. Ivezik, and J. Shin. 2011. An optimization approach for semantic-based XML schema matching. International Journal of Trade, Economics, and Finance 2, 1 (2011). IACSIT Press, 78--86.Google ScholarGoogle Scholar
  50. M. M. Meijer. 2008. On a method for XML schema matching. In Proceedings of the 8th Twente Student Conference on Information Technology. University of Twente, Twente, 1--10.Google ScholarGoogle Scholar
  51. I. F. Cruz, F. P. Antonelli, and C. Stroe. 2009. AgreementMaker: Efficient matching for large real-world schemas and ontologies. VLDB Endowment 2, 2 (2009). ACM, 1586--1589. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Y. R. Jean-Mary, E. P. Shironoshita, and M. R. Kabuka. 2009. Ontology matching with semantic verification. Web Semantics: Science, Services and Agents on the World Wide Web 7, 3 (2009). Elsevier, 235--251. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. P. Lambrix and H. Tan. 2006. SAMBO -- A system for aligning and merging biomedical ontologies. Web Semantics: Science, Services and Agents on the World Wide Web 4, 3 (2006). Elsevier, 196--206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. K. Voigt. 2011. Structural Graph-Based Metamodel Matching. Ph.D. Dissertation. Technical University of Dresden, Department of Computer Science.Google ScholarGoogle Scholar
  55. C. H. Papadimitriou. 1994. Computational Complexity. Addison-Wesley.Google ScholarGoogle Scholar
  56. D. Aumueller, H. H. Do, S. Massmann, and E. Rahm. 2005. Schema and ontology matching with COMA++. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, MD. 906--908. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. P. Bille. 2005. A survey on tree edit distance and related problems. Theoretical Computer Science 337, 1--3 (2005). Elsevier, 217--239. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. S. Melnik, H. Garcia-Molina, and E. Rahm. 2002. Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In Proceedings of the International Conference on Data Engineering. IEEE, San Jose, California, 117--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. G. Valiente. 2002. Algorithms on Trees and Graphs. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. 2001. Introduction to Algorithms (2nd ed.). McGraw-Hill Higher Education. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa. 2002. Efficient substructure discovery from large semi-structured data. In Proceedings of the SIAM International Conference on Data Mining. SIAM, Maebashi City, 158--174.Google ScholarGoogle Scholar
  62. M. J. Zaki. 2002. Efficiently mining frequent trees in a forest. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, AB, 71--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. P. Plebani and B. Pernici. 2009. URBE: Web service retrieval based on similarity evaluation. IEEE Transactions on Knowledge and Data Engineering 21, 11 (2009), 1629--1642. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. E. Stroulia and Y. Wang. 2005. Structural and semantic matching for assessing web-service similarity. International Journal of Cooperative Information Systems (2005). World Scientific, 407--438.Google ScholarGoogle Scholar
  65. G. A. Miller. 1995. WordNet: A lexical database for english. ACM Communications 38, 11 (1995), 39--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. T. Pedersen, S. Patwardhan, and J. Michelizzi. 2004. WordNet: : Similarity -- Measuring the relatedness of concepts. In Proceedings of the National Conference on Innovative Applications of Artificial Intelligence. AAAI Press, San Jose, California, 1024--1025. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. R. Burkard, M. Dell’Amico, and S. Martello. 2009. Assignment Problems. Society for Industrial and Applied Mathematics, USA. SIAM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. A. V. Aho, J. E. Hopcroft, and J. Ullman. 1983. Data Structures and Algorithms. Addison-Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. F. Duchateau and Z. Bellahsene. 2010. Measuring the Quality of an Integrated Schema. In Proceedings of the International Conference on Conceptual Modeling. Springer, Vancouver, BC, 261--273. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. R. A. Baeza-Yates and B. A. Ribeiro-Neto. 1999. Modern Information Retrieval. ACM Press/Addison-Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. D. Zhang and J. P. Tsai. 2007. Advances in Machine Learning Applications in Software Engineering. IGI Global, Hershey, PA, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mining Abstract XML Data-Types

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on the Web
        ACM Transactions on the Web  Volume 13, Issue 1
        February 2019
        206 pages
        ISSN:1559-1131
        EISSN:1559-114X
        DOI:10.1145/3297729
        Issue’s Table of Contents

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 4 December 2018
        • Accepted: 1 August 2018
        • Revised: 1 April 2018
        • Received: 1 August 2015
        Published in tweb Volume 13, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader