Copyright © 2004 Elsevier Ltd. All rights reserved.
Bayesian network model for semi-structured document classification
Available online 17 June 2004.
References and further reading may be available for this article. To view references and further reading you must purchase this article.
Abstract
Recently, a new community has started to emerge around the development of new information research methods for searching and analyzing semi-structured and XML like documents. The goal is to handle both content and structural information, and to deal with different types of information content (text, image, etc.). We consider here the task of structured document classification. We propose a generative model able to handle both structure and content which is based on Bayesian networks. We then show how to transform this generative model into a discriminant classifier using the method of Fisher kernel. The model is then extended for dealing with different types of content information (here text and images). The model was tested on three databases: the classical webKB corpus composed of HTML pages, the new INEX corpus which has become a reference in the field of ad-hoc retrieval for XML documents, and a multimedia corpus of Web pages.
Author Keywords: Statistical learning; Bayesian networks; Categorization; Structured documents; XML; Machine learning
Article Outline
- 1. Introduction
- 2. Previous works
- 3. Structured document
- 4. Modeling documents with Bayesian networks
- 5. Learning
- 6. Improving discriminant abilities using Fisher kernel
- 6.1. Fisher score and Fisher kernel
- 6.2. Fisher score for the Bayesian network model
- 6.3. Computational considerations
- 7. Considering different types of information: text and image
- 8. Experiments
- 8.1. INEX: a large XML corpus
- 8.2. HTML corpus: webKB
- 8.3. Multimedia corpus: NetProtect corpus
- 9. Conclusion
- References







E-mail Article
Add to my Quick Links

Cited By in Scopus (12)







