ABSTRACT
In this paper, we present an algorithm to find a sequence of top-down edit operations with minimum cost that transforms an XML document such that it conforms to a schema. It is shown that the algorithm runs in O(p x log p x n), where p is the size of the schema(grammar) and n is the size of the XML document (tree). We have also shown that edit distance with restricted top-down edit operations can be computed the same way.We will also show how to use the edit distances in document classification. Experimental studies have shown that our methods are effective in structure-oriented classification for both real and synthesized data sets.
- Nobutaka Suzuki, Finding an Optimum Edit Script between an XML Document and a DTD, Proceedings of ACM Symposium on Applied Computing, pp. 647 - 653, March, 2005, Santa Fe, NM. Google ScholarDigital Library
- Rodney Canfield, Guangming Xing, Approximate XML Document Matching (Poster), Proceedings of ACM Symposium on Applied Computing, March, 2005, Santa Fe, NM. Google ScholarDigital Library
- T. Bray, J. Paoli, M. Sperberg-McQueen, and etl., Extensible Markup Language (XML) 1.0 (Third Edition), W3C, http://www.w3.org/TR/2004/REC-xml-20040204/.Google Scholar
- D. Shasha, K. Zhang, Approximate Tree Pattern Matching, Chapter 14 Pattern Matching Algorithms (eds. Apostolico, A. and Galil, Z.), Oxford University Press, June 1997.Google Scholar
- E. Tanaka, K. Tanaka, The Tree-to-tree Editing Problem, International Journal of Pattern Recognition and Artificial Intelligence, 2, (2), pp.221--240, 1988.Google Scholar
- M. Murata Hedge Automata: A Formal Model for XML Schemata http://www.xml.gr.jp/relax/hedge_nice.htmlGoogle Scholar
- G. Myers Approximately Matching Context Free Languages, Information Processing Letters, 54, 2, pp. 85--92, 1995. Google ScholarDigital Library
- E. Bertino, G. Guerrini, M. Mesiti, A Matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications, Info. Systems, V29, pp 23--46, 2004. Google ScholarDigital Library
- A. Boukottaya, C. Vanoirbeek, F. Paganelli, O. Abou Khaled, Automating XML Documents Transformations: A Conceptual Modelling Based Approach, Proceedings of 1st Asian-Pacific conference on Conceptual modelling, Dunedin, NZ, pp81 - 90 2004. Google ScholarDigital Library
- D. Reis, P. Golgher, A. Silva, A. Laender, Automatic web news extraction using tree edit distance, WWW'04, 2004, pp 502--511, Manhattan, NY, 2004. Google ScholarDigital Library
- W. Chen, New Algorithm for Ordered Tree-to-Tree Correction Problem, J. of Algorithms, 40:135--158, 2001. Google ScholarDigital Library
- A. Nierman, H. V. Jagadish, Evaluating structural similarity in XML documents, WebDB 2002, Madison, Wisconsin, June 2002.Google Scholar
- XML Document Mining Challenge, http://xmlmining.lip6.fr/Google Scholar
- B. Chidlovskii, Schema Extraction from XML Data: A Grammatical Inference Approach, KRDB'01 Workshop, Rome, Italy, September 15, 2001.Google Scholar
- WEKA Project, http://www.cs.waikato.ac.nz/ml/weka/Google Scholar
Index Terms
- Computing edit distances between an XML document and a schema and its application in document classification
Recommendations
Approximate XML structure validation based on document-grammar tree similarity
Comparing XML documents with XML grammars, also known as XML document and grammar validation, is useful in various applications such as: XML document classification, document transformation, grammar evolution, XML retrieval, and the selective ...
Finding an optimum edit script between an XML document and a DTD
SAC '05: Proceedings of the 2005 ACM symposium on Applied computingFinding an optimum edit script between data plays an important role in data retrieval and data transformation. Many methods for finding an optimum edit script between two XML documents have been proposed so far, but few studies on finding an optimum ...
XML-based XML schema access
WWW '07: Proceedings of the 16th international conference on World Wide WebXML Schema's abstract data model consists of components, which are the structures that eventually define a schema as a whole. XML Schema's XML syntax, on the other hand, is not a direct representation of the schema components, and it proves to be ...
Comments