ScienceDirect® Home Skip Main Navigation Links
You have guest access to ScienceDirect. Find out more.
 
Home
Browse
My Settings
Alerts
Help
 Quick Search
 Search tips (Opens new window)
    Clear all fields    
advertisementadvertisement
Information Processing & Management
Volume 40, Issue 5, September 2004, Pages 807-827
Bayesian Networks and Information Retrieval
 
Font Size: Decrease Font Size  Increase Font Size
 Abstract - selected
Article
Purchase PDF (382 K)

 
 
 
Related Articles in ScienceDirect
View More Related Articles
 
Special issue
View Record in Scopus
 
doi:10.1016/j.ipm.2004.04.009    How to Cite or Link Using DOI (Opens New Window)
Copyright © 2004 Elsevier Ltd. All rights reserved.

Bayesian network model for semi-structured document classification

Ludovic DenoyerCorresponding Author Contact Information, E-mail The Corresponding Author and Patrick GallinariE-mail The Corresponding Author

Laboratoire d'Informatique de Paris VI, LIP6, 8 rue du Capitaine Scott, 75015, Paris, France

Available online 17 June 2004.

Purchase the full-text article



References and further reading may be available for this article. To view references and further reading you must purchase this article.

Abstract

Recently, a new community has started to emerge around the development of new information research methods for searching and analyzing semi-structured and XML like documents. The goal is to handle both content and structural information, and to deal with different types of information content (text, image, etc.). We consider here the task of structured document classification. We propose a generative model able to handle both structure and content which is based on Bayesian networks. We then show how to transform this generative model into a discriminant classifier using the method of Fisher kernel. The model is then extended for dealing with different types of content information (here text and images). The model was tested on three databases: the classical webKB corpus composed of HTML pages, the new INEX corpus which has become a reference in the field of ad-hoc retrieval for XML documents, and a multimedia corpus of Web pages.

Author Keywords: Statistical learning; Bayesian networks; Categorization; Structured documents; XML; Machine learning

Article Outline

1. Introduction
2. Previous works
3. Structured document
4. Modeling documents with Bayesian networks
4.1. Tree-like model for structured document classification
4.2. Classifying document parts
5. Learning
6. Improving discriminant abilities using Fisher kernel
6.1. Fisher score and Fisher kernel
6.2. Fisher score for the Bayesian network model
6.3. Computational considerations
7. Considering different types of information: text and image
8. Experiments
8.1. INEX: a large XML corpus
8.1.1. Evaluation measure
8.1.1.1. Multi-class single label categorization
8.1.1.2. Ranking
8.1.1.3. Results
8.2. HTML corpus: webKB
8.3. Multimedia corpus: NetProtect corpus
9. Conclusion
References













Information Processing & Management
Volume 40, Issue 5, September 2004, Pages 807-827
Bayesian Networks and Information Retrieval
 
Home
Browse
My Settings
Alerts
Help
Elsevier.com (Opens new window)
About ScienceDirect  |  Contact Us  |  Information for Advertisers  |  Terms & Conditions  |  Privacy Policy
Copyright © 2008 Elsevier B.V. All rights reserved. ScienceDirect® is a registered trademark of Elsevier B.V.