research-article

Mining Abstract XML Data-Types

Authors:
Dionysis Athanasopoulos

Victoria University of Wellington, New Zealand

Victoria University of Wellington, New Zealand

0000-0002-0720-1986
View Profile

,
Apostolos Zarras

University of Ioannina, Greece

University of Ioannina, Greece
View Profile

Authors Info & Claims

ACM Transactions on the Web Volume 13 Issue 1Article No.: 2pp 1–37https://doi.org/10.1145/3267467

Published:04 December 2018Publication History

ACM Transactions on the Web

Abstract

Schema integration has been a long-standing challenge for the data-engineering community that has received steady attention over the past three decades. General-purpose integration approaches construct unified schemas that encompass all schema elements. Schema integration has been revisited in the past decade in service-oriented computing since the input/output data-types of service interfaces are heterogeneous XML schemas. However, service integration differs from the traditional integration problem, since it should generalize schemas (mining abstract data-types) instead of unifying all schema elements. To mine well-formed abstract data-types, the fundamental Liskov Substitution Principle (LSP), which generally holds between abstract data-types and their subtypes, should be followed. However, due to the heterogeneity of service data-types, the strict employment of LSP is not usually feasible. On top of that, XML offers a rich type system, based on which data-types are defined via combining type patterns (e.g., composition, aggregation). The existing integration approaches have not dealt with the challenges of a defining subtyping relation between XML type patterns. To address these challenges, we propose a relaxed version of LSP between XML type patterns and an automated generalization process for mining abstract XML data-types. We evaluate the effectiveness and the efficiency of the process on the schemas of two datasets against two representative state-of-the-art approaches.

References

A. Doan and A. Y. Halevy. 2005. Semantic integration research in the database community: A brief survey. AI Magazine 26, 1 (2005), 83--94. Google ScholarDigital Library
Carlo Batini, Maurizio Lenzerini, and Shamkant B. Navathe. 1986. A comparative analysis of methodologies for database schema integratison. ACM Computings Surveys 18, 4 (1986), 323--364. Google ScholarDigital Library
R. Pottinger and P. A. Bernstein. 2003. Merging models based on given correspondences. In Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann Publishers, Berlin, 826--873. Google ScholarDigital Library
T. Erl. 2005. Service-Oriented Architecture: Concepts, Technology, and Design. Prentice Hall. Google ScholarDigital Library
D. Athanasopoulos, A. Zarras, P. Vassiliadis, and V. Issarny. 2011. Mining service abstractions. In Proceedings of the International Conference on Software Engineering. IEEE, HI, Hawaii, 944--947. Google ScholarDigital Library
X. Liu and H. Liu. 2012. Automatic abstract service generation from web service communities. In Proceedings of the International Conference on Web Services. IEEE, HI, Hawaii, 154--161. Google ScholarDigital Library
B. Liskov and J. M. Wing. 1994. A behavioural notion of subtyping. ACM Transactions on Programming Languages and Systems 16, 6 (1994), 1811--1841. Google ScholarDigital Library
Erhard Rahm, Hong Hai Do, and Sabine Massmann. 2004. Matching large XML schemas. SIGMOD Record 33, 4 (2004). ACM, 26--31. Google ScholarDigital Library
K. Saleem, Z. Bellahsene, and E. Hunt. 2008. PORSCHE: Performance ORiented SCHEma mediation. Information Systems 33, 7--8 (2008). Elsevier, 637--657. Google ScholarDigital Library
A. Y. Halevy, A. Rajaraman, and J. J. Ordille. 2006. Data integration: The teenage years. In Proceedings of the International Conference on Very Large Data Bases. ACM, Seoul, 9--16. Google ScholarDigital Library
R. Pottinger and P. A. Bernstein. 2008. Schema merging and mapping creation for relational sources. In Proceedings of the International Conference on Extending Database Technology: Advances in Database Technology. ACM, Nantes, 73--84. Google ScholarDigital Library
C. Parent and S. Spaccapietra. 1998. Issues and approaches of database integration. Communications of the ACM 41, 5 (1998), 166--178. Google ScholarDigital Library
Xiang Li. 2012. Constraint-Driven Schema Merging. Ph.D. Dissertation. RWTH Aachen University.Google Scholar
A. Baqasah, E. Pardede, and J. W. Rahayu. 2014. A new approach for meaningful XML schema merging. In Proceedings of the International Conference on Information Integration and Web-based Applications 8 Services. ACM, Hanoi, 430--439. Google ScholarDigital Library
H. Ma, K.-D. Schewe, B. Thalheim, and J. Zhao. 2005. View integration and cooperation in databases, data warehouses and web information systems. Journal on Data Semantics. Springer, 213--249. Google ScholarDigital Library
V. Kashyap and A. P. Sheth. 1996. Semantic and schematic similarities between database objects: A context-based approach. The VLDB Journal 5, 4 (1996). Springer, 276--304. Google ScholarDigital Library
X. Li and C. Quix. 2011. Merging relational views: A minimization approach. In Proceedings of the International Conference on Conceptual Modeling. Springer, Brussels, 379--392. Google ScholarDigital Library
M. Arenas, J. Pérez, J. L. Reutter, and C. Riveros. 2010. Foundations of schema mapping management. In Proceedings of the ACM Symposium on Principles of Database Systems. ACM, Indianapolis, Indiana, 227--238. Google ScholarDigital Library
P. A. Bernstein, S. Melnik, M. Petropoulos, and C. Quix. 2004. Industrial-strength schema matching. ACM SIGMOD Record 33, 4 (2004), 38--43. Google ScholarDigital Library
A. Radwan, L. Popa, I. R. Stanoi, and A. Younis. 2009. Top-k generation of integrated schemas based on directed and weighted correspondences. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, Providence, Rhode Island, 641--654. Google ScholarDigital Library
A. D. Sarma, X. Dong, and A. Halevy. 2008. Bootstrapping pay-as-you-go data integration systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, Vancouver, 861--874. Google ScholarDigital Library
S. Melnik, E. Rahm, and P. A. Bernstein. 2003. Rondo: A programming platform for generic model management. In Proceedings of the ACM SIGMOD International conference on Management of Data. ACM, San Diego, California, 193--204. Google ScholarDigital Library
Aída Jiménez, Fernando Berzal, and Juan Carlos Cubero Talavera. 2010. Frequent tree pattern mining: A survey. Intelligent Data Analysis 14, 6 (2010). IOS Press, 603--622. Google ScholarDigital Library
M. J. Zaki. 2005. Efficiently mining frequent embedded unordered trees. Fundamenta Informaticae 66, 1--2 (2005). IOS Press, 33--52. Google ScholarDigital Library
Y. Chi, R. R. Muntz, S. Nijssen, and J. N. Kok. 2005. Frequent subtree mining -- An overview. Fundamenta Informaticae 66, 1--2 (2005). IOS Press, 161--198. Google ScholarDigital Library
M. J. Zaki. 2005. Efficiently mining frequent trees in a forest: Algorithms and applications. IEEE Transactions on Knowledge and Data Engineering 17, 8 (2005), 1021--1035. Google ScholarDigital Library
J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto, Q. Chen, U. Dayal, and M. Hsu. 2004. Mining sequential patterns by pattern-growth: The PrefixSpan approach. IEEE Transactions on Knowledge and Data Engineering 16, 11 (2004), 1424--1440. Google ScholarDigital Library
X. Yan, J. Han, and R. Afshar. 2003. CloSpan: Mining closed sequential patterns in large databases. In Proceedings of the SIAM International Conference on Data Mining. SIAM, San Francisco, 166--177.Google Scholar
C. Wang, M. Hong, J. Pei, H. Zhou, W. Wang, and B. Shi. 2004. Efficient pattern-growth methods for frequent tree pattern mining. In Proceedings of the Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Springer, Sydney, 441--451.Google Scholar
L. Zou, Y. Lu, H. Zhang, and R. Hu. 2006. PrefixTreeESpan: A pattern growth algorithm for mining embedded subtrees. In Proceedings of the International Conference on Web Information Systems Engineering. Springer, Wuhan, 499--505. Google ScholarDigital Library
J. I. Chowdhury and R. Nayak. 2014. BEST: An Efficient Algorithm for Mining Frequent Unordered Embedded Subtrees. In Proceedings of the Pacific Rim International Conference on Artificial Intelligence. Springer, Gold Coast, 459--471.Google Scholar
E. Rahm and P. A. Bernstein. 2001. A survey of approaches to automatic schema matching. VLDB Journal 10, 4 (2001). Springer, 334--350. Google ScholarDigital Library
Z. Bellahsene, A. Bonifati, and E. Rahm (Eds.). 2011. Schema Matching and Mapping. Springer. Google ScholarDigital Library
P. Shvaiko and J. Euzenat. 2013. Ontology matching: State of the art and future challenges. IEEE Transactions on Knowledge and Data Engineering 25, 1 (2013), 158--176. Google ScholarDigital Library
M. Hamdaqa and L. Tahvildari. 2014. Prison break: A generic schema matching solution to the cloud vendor lock-in problem. In Proceedings of the International Symposium on the Maintenance and Evolution of Service-Oriented and Cloud-Based Systems. IEEE, Victoria, British Columbia, 37--46. Google ScholarDigital Library
F. Duchateau, Z. Bellahsene, and M. Roche. 2007. A context-based measure for discovering approximate semantic matching between schema elements. In Proceedings of the International Conference on Research Challenges in Information Science. IEEE, Ouarzazate, 9--20.Google Scholar
F. Duchateau, Z. Bellahsene, M. Roantree, and M. Roche. 2007. Poster session: An indexing structure for automatic schema matching. In Proceedings of the IEEE International Conference on Data Engineering Workshop. IEEE, Istanbul, 485--491. Google ScholarDigital Library
P. De Meo, G. Quattrone, G. Terracina, and D. Ursino. 2006. Integration of XML schemas at various “severity” levels. Information Systems 31, 6 (2006). Elsevier, 397--434. Google ScholarDigital Library
F. Duchateau, Z. Bellahsene, and M. Roche. 2007. BMatch: A semantically context-based tool enhanced by an indexing structure to accelerate schema matching. In Journées Bases de Données Avancées. IEEE, Marseille, 1--20.Google Scholar
W. Hu, Y. Qu, and G. Cheng. 2008. Matching large ontologies: A divide-and-conquer approach. Data 8 Knowledge Engineering 67, 1 (2008). Elsevier, 140--160. Google ScholarDigital Library
H. H. Do and E. Rahm. 2002. COMA -- A system for flexible combination of schema matching approaches. In Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann Publishers, Hong Kong, 610--621. Google ScholarDigital Library
H. H. Do and E. Rahm. 2007. Matching large schemas: Approaches and evaluation. Information Systems 32, 6 (2007). Elsevier, 857--885. Google ScholarDigital Library
J. Madhavan, P. A. Bernstein, and E. Rahm. 2001. Generic schema matching with CUPID. In Proceedings of the International Conference on Very Large Data Bases. Morgan Kaufmann Publishers, Roma, 49--58. Google ScholarDigital Library
A. Algergawy, E. Schallehn, and G. Saake. 2009. Improving XML schema matching performance using Prüfer sequences. Data and Knowledge Engineering 68, 8 (2009). Elsevier, 728--747. Google ScholarDigital Library
M. Lee, L. H. Yang, W. Hsu, and X. Yang. 2002. XClust: Clustering XML schemas for effective integration. In Proceedings of the ACM International Conference on Information and Knowledge Management. ACM, McLean, Virginia, 292--299. Google ScholarDigital Library
F. Giunchiglia, P. Shvaiko, and M. Yatskevich. 2004. S-Match: An algorithm and an implementation of semantic matching. In Proceedings of the European Semantic Web Symposium. Springer, Heraklion, Crete, 61--75.Google Scholar
R. Nayak and W. Iryadi. 2007. XML schema clustering with semantic and hierarchical similarity measures. Knowledge-Based Systems 20, 4 (2007). ACM, 336--349. Google ScholarDigital Library
A. Algergawy, R. Nayak, and G. Saake. 2010. Element similarity measures in XML schema matching. Information Sciences 180, 24 (2010). Elsevier, 4975--4998. Google ScholarDigital Library
J. Kim, Y. Peng, N. Ivezik, and J. Shin. 2011. An optimization approach for semantic-based XML schema matching. International Journal of Trade, Economics, and Finance 2, 1 (2011). IACSIT Press, 78--86.Google Scholar
M. M. Meijer. 2008. On a method for XML schema matching. In Proceedings of the 8th Twente Student Conference on Information Technology. University of Twente, Twente, 1--10.Google Scholar
I. F. Cruz, F. P. Antonelli, and C. Stroe. 2009. AgreementMaker: Efficient matching for large real-world schemas and ontologies. VLDB Endowment 2, 2 (2009). ACM, 1586--1589. Google ScholarDigital Library
Y. R. Jean-Mary, E. P. Shironoshita, and M. R. Kabuka. 2009. Ontology matching with semantic verification. Web Semantics: Science, Services and Agents on the World Wide Web 7, 3 (2009). Elsevier, 235--251. Google ScholarDigital Library
P. Lambrix and H. Tan. 2006. SAMBO -- A system for aligning and merging biomedical ontologies. Web Semantics: Science, Services and Agents on the World Wide Web 4, 3 (2006). Elsevier, 196--206. Google ScholarDigital Library
K. Voigt. 2011. Structural Graph-Based Metamodel Matching. Ph.D. Dissertation. Technical University of Dresden, Department of Computer Science.Google Scholar
C. H. Papadimitriou. 1994. Computational Complexity. Addison-Wesley.Google Scholar
D. Aumueller, H. H. Do, S. Massmann, and E. Rahm. 2005. Schema and ontology matching with COMA++. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, MD. 906--908. Google ScholarDigital Library
P. Bille. 2005. A survey on tree edit distance and related problems. Theoretical Computer Science 337, 1--3 (2005). Elsevier, 217--239. Google ScholarDigital Library
S. Melnik, H. Garcia-Molina, and E. Rahm. 2002. Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In Proceedings of the International Conference on Data Engineering. IEEE, San Jose, California, 117--128. Google ScholarDigital Library
G. Valiente. 2002. Algorithms on Trees and Graphs. Springer. Google ScholarDigital Library
T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. 2001. Introduction to Algorithms (2nd ed.). McGraw-Hill Higher Education. Google ScholarDigital Library
T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa. 2002. Efficient substructure discovery from large semi-structured data. In Proceedings of the SIAM International Conference on Data Mining. SIAM, Maebashi City, 158--174.Google Scholar
M. J. Zaki. 2002. Efficiently mining frequent trees in a forest. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Edmonton, AB, 71--80. Google ScholarDigital Library
P. Plebani and B. Pernici. 2009. URBE: Web service retrieval based on similarity evaluation. IEEE Transactions on Knowledge and Data Engineering 21, 11 (2009), 1629--1642. Google ScholarDigital Library
E. Stroulia and Y. Wang. 2005. Structural and semantic matching for assessing web-service similarity. International Journal of Cooperative Information Systems (2005). World Scientific, 407--438.Google Scholar
G. A. Miller. 1995. WordNet: A lexical database for english. ACM Communications 38, 11 (1995), 39--41. Google ScholarDigital Library
T. Pedersen, S. Patwardhan, and J. Michelizzi. 2004. WordNet: : Similarity -- Measuring the relatedness of concepts. In Proceedings of the National Conference on Innovative Applications of Artificial Intelligence. AAAI Press, San Jose, California, 1024--1025. Google ScholarDigital Library
R. Burkard, M. Dell’Amico, and S. Martello. 2009. Assignment Problems. Society for Industrial and Applied Mathematics, USA. SIAM. Google ScholarDigital Library
A. V. Aho, J. E. Hopcroft, and J. Ullman. 1983. Data Structures and Algorithms. Addison-Wesley. Google ScholarDigital Library
F. Duchateau and Z. Bellahsene. 2010. Measuring the Quality of an Integrated Schema. In Proceedings of the International Conference on Conceptual Modeling. Springer, Vancouver, BC, 261--273. Google ScholarDigital Library
R. A. Baeza-Yates and B. A. Ribeiro-Neto. 1999. Modern Information Retrieval. ACM Press/Addison-Wesley. Google ScholarDigital Library
D. Zhang and J. P. Tsai. 2007. Advances in Machine Learning Applications in Software Engineering. IGI Global, Hershey, PA, USA. Google ScholarDigital Library

Index Terms

Mining Abstract XML Data-Types
1. Applied computing
  1. Document management and text processing
    1. Document preparation
      1. Markup languages
        Extensible Markup Language (XML)
2. Information systems
  1. World Wide Web
    1. Web services

Recommendations

XML data mining

With the spreading of XML sources, mining XML data can be an important objective in the near future. This paper presents a project focussed on designing a general-purpose query language in support of mining XML data. In our framework, raw data, mining ...
Read More
Polymorphic type inference and abstract data types
Read More
Polymorphic type inference and abstract data types

Many statically typed programming languages provide an abstract data type construct, such as the module in Modula-2. However, in most of these languages, implementations of abstract data types are not first-class values. Thus, they cannot be assigned to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on the Web Volume 13, Issue 1
February 2019
206 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/3297729
Editor:
Brian D. Davison
Lehigh University, USA
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 December 2018
- Accepted: 1 August 2018
- Revised: 1 April 2018
- Received: 1 August 2015
Published in tweb Volume 13, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Type pattern
embedded subtree
pruning
subtyping relation
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 254
  Total Downloads
- Downloads (Last 12 months)20
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.