Integrating Content and Structure into a Comprehensive Framework for XML Document Similarity Represented in 3D Space

Draken, Eric; Jarada, Tamer N.; Kianmehr, Keivan; Alhajj, Reda

doi:10.1007/978-3-642-22913-8_13

Eric Draken⁴,
Tamer N. Jarada⁴,
Keivan Kianmehr⁵ &
…
Reda Alhajj^4,6

Part of the book series: Studies in Computational Intelligence ((SCI,volume 375))

650 Accesses

Abstract

XML is attractive for data exchange between different platforms, and the number of XML documents is rapidly increasing. This raised the need for techniques capable of investigating the similarity between XML documents to help in classifying them for better organized utilization. In fact, the idea of similarity between documents is not new. However, XML documents are more rich and informative than classical documents in the sense that they encapsulate both structure and content; on the other hand, classical documents are characterized only by the content. According, using both the content and structure of XML documents to assign a similarity metric is relatively new. Of the recent research and algorithms proposed in the literature, the majority assign a similarity metric between 0.0 and 1.0 when comparing two XML documents. The similarity measures between multiple XML documents may be arranged in a matrix whereby data mining may be done to cluster closely related documents. In this chapter the authors have presented a novel way to represent XML document similarity in 3D space. Their approach benefits from the characteristics of the XML documents to produce a measure to be used in clustering and classification techniques, information retrieval and searching methods for the case of XML documents. We mainly derive a three dimensional vector per document by considering two dimensions as the document’s structural and content, while the third dimension is a combination of both structure and content characteristics of the document. The outcome from our research allows users to intuitively visualize document similarity.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agrawal, R., Faloutsos, C., Swami, A.: Efficient Similarity Search in Sequence Databases. In: Proceedings of the Fourth International Conference on Foundations of Data Organization (1993)
Google Scholar
Aizawa, A.: An information-theoretic perspective of tf-idf measures. Information Processing & Management 39, 45–65 (2003)
Article MATH Google Scholar
Chawathe, S., Garcia-Molina, H.: Meaningful change detection in structured data. In: Proceedings of ACM SIGMOD International Conference on Management of Data (1997)
Google Scholar
Dumais, S., Platt, J., Heckerman, D.: Inductive Learning Algorithms and Representations for Text Categorization. In: Proceedings of ACM International Conference on Information and Knowledge Management (CIKM 1998), Bethesda, MD, pp. 148–155 (1998)
Google Scholar
Daconta, M.C., Obrst, L.J., Smith, K.T.: The Semantic Web: a guide to the future of XML, Web services, and knowledge management. Wiley, Chichester (2009)
Google Scholar
Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Fast detection of XML structural similarity. IEEE Transactions on Knowledge and Data Engineering 17(2), 60–175 (2005)
Article Google Scholar
Goldin, D., Kanellakis, P.: On Similarity Queries for Time Series Data: Constraint Specification and Implementation. In: Proceedings of the International Conference on Constraint Programming (1995)
Google Scholar
Kim, W.: XML document similarity measure in terms of the structure and contents. In: Proceedings of the International Conference on Computer Engineering and Applications (CEA 2008), pp. 205–212 (2008)
Google Scholar
Laurent, S.S., Lenz, E., McRae, M.: Office 2003 Xml: Integrating Office with the Rest of the World, 1st edn. O’Reilly & Associates, Sebastopol (2004)
Google Scholar
Lee, J., Lee, K.: XML Document Analysis based on Similarity. Journal of KISS: Software and Application 29(6) (June 2002)
Google Scholar
Ma, Y., Chbeir, R.: Content and Structure Based Approach for XML Similarity. In: Proceedings of the International Conference on Computer and Information Technology (September 2005)
Google Scholar
Nierman, A., Jagadish, H.: Evaluating structural similarity in XML documents. In: Proceedings of the International Workshop on the Web and Databases (2002)
Google Scholar
Park, U., Seo, Y.: An Implementation of XML Document Searching System based on Structure and Semantics Similarity. Journal of Korean Society for Internet Information 6(2) (April 2005)
Google Scholar
Tekli, J., Chbeir, R., Yetongnon, K. (eds.): Proceedings of the International Conference on Current Trends in Theory and Practice of Computer Science (January 2007)
Google Scholar
Wang, L., Cheung, D.W.-I., Mamoulis, N., Yiu, S.-M.: An efficient and scalable algorithm for clustering XML documents by structure. IEEE Transactions on Knowledge and Data Engineering 13(1), 82–96 (2004)
Article Google Scholar
Xyleme, L.: Xyleme: A Dynamic Warehouse for XML Data of the Web. In: Proceedings of the International Symposium on Database Engineering and Applications (IDEAS 2001), pp. 3–7 (2001)
Google Scholar
Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: Proceedings of ACM International Conference on Information Retrieval (SIGIR 1999), Berkley, CA, pp. 42–49 (1999)
Google Scholar
Zhang, K., Shasha, D., Wang, J.T.L.: Approximate tree matching in the presence of variable length don’t cares. Journal of Algorithms 16(1) (1994)
Google Scholar
AG’s corpus of news articles, http://www.di.unipi.it/~gulli/newsspace200.xml.bz
Xerces Java XML Parser, http://xerces.apache.org/xerces-j/
Ranks.nl English Stopwords, http://www.ranks.nl/resources/stopwords.html

Download references

Author information

Authors and Affiliations

Computer Science Department, University of Calgary, Calgary, Alberta, Canada
Eric Draken, Tamer N. Jarada & Reda Alhajj
Computer Engineering Department, University of Western Ontario, London, Ontario, Canada
Keivan Kianmehr
Department of Computer Science, Global University, Beirut, Lebanon
Reda Alhajj

Authors

Eric Draken
View author publications
You can also search for this author in PubMed Google Scholar
Tamer N. Jarada
View author publications
You can also search for this author in PubMed Google Scholar
Keivan Kianmehr
View author publications
You can also search for this author in PubMed Google Scholar
Reda Alhajj
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of New York Tirana, Rr. Komuna E Parisit,, Tirana, Albania
Marenglen Biba
Technical University of Catalonia, Campus Nord, Ed. Omega, C/Jordi Girona 1-3, 08034, Barcelona, Spain
Fatos Xhafa

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Draken, E., Jarada, T.N., Kianmehr, K., Alhajj, R. (2011). Integrating Content and Structure into a Comprehensive Framework for XML Document Similarity Represented in 3D Space. In: Biba, M., Xhafa, F. (eds) Learning Structure and Schemas from Documents. Studies in Computational Intelligence, vol 375. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-22913-8_13

Download citation

DOI: https://doi.org/10.1007/978-3-642-22913-8_13
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-22912-1
Online ISBN: 978-3-642-22913-8
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics