Bayesian network model for semi-structured document classification

https://doi.org/10.1016/j.ipm.2004.04.009Get rights and content

Abstract

Recently, a new community has started to emerge around the development of new information research methods for searching and analyzing semi-structured and XML like documents. The goal is to handle both content and structural information, and to deal with different types of information content (text, image, etc.). We consider here the task of structured document classification. We propose a generative model able to handle both structure and content which is based on Bayesian networks. We then show how to transform this generative model into a discriminant classifier using the method of Fisher kernel. The model is then extended for dealing with different types of content information (here text and images). The model was tested on three databases: the classical webKB corpus composed of HTML pages, the new INEX corpus which has become a reference in the field of ad-hoc retrieval for XML documents, and a multimedia corpus of Web pages.

Introduction

Document classification is used in many different contexts in information retrieval: document filtering, word sense disambiguation, document classification in hierarchies like those of Yahoo!, etc. This field mainly developed over the last ten years, using techniques originating from the pattern recognition and machine learning communities. Almost all classification techniques which have been proposed in recent years (e.g. neural networks, support vector machines, decision trees and decision lists, etc.) have been tested on this problem. All these methods do operate on flat text representations and do not consider text structure information. Some attempts have recently been made to relax the traditional word independence assumption. Denoyer, Zaragoza, and Gallinari (2001) for example consider a limited form of sequence information and use hidden Markov models for text and passage classification. The recent paper (Sebastiani, 2002) gives a very good survey of the literature on textual document classification. With the development of structured textual and multimedia documents, and with the increasing importance of structured document formats like XML, the document nature is changing. Structured documents usually have a much richer representation than flat ones. They have a logical structure. They allow the incorporation of additional information such as metadata and are often composed of heterogeneous information sources (e.g. text, image, video, etc.). The development of classifiers for structured content is a new challenge for the machine learning and IR communities. Since this is a new area there is not yet a consensus on what the main tasks and challenges of structured document classification are. A major change with structured documents compared to flat documents is the possibility to access document elements or fragments. Accordingly, a classifier for structured documents should be able to classify both full documents and document parts. It is also important to be able to make use of the different content information sources present in an XML document. A classifier should then easily adapt to a variety of different sources. A final requirement is that the system be able to scale with large document collections.

We propose here a new model for the classification of structured documents. It is a generative model based on Bayesian networks. Each document will be modelled by a Bayesian network, the size of which being proportional to the size of the document. Classification will then amount to perform inference in this network. The model is able to take into account the structure of the document and different types of content information. It also allows one to perform inference either on whole documents or on document parts taken in their context, which goes beyond the capabilities of classical classifier schemes. In this paper, the elements we consider are defined by the logical structure of the document. They typically correspond to the different components of an XML document. Different types of Bayesian models could be used for the documents. For keeping the computations to a reasonable complexity level and for allowing robust parameter estimation, we have to restrict ourselves to simple models exploiting local structural dependencies. We further show how these generative models can be turned into discriminant classifiers using Fisher kernels. Doing this, we lose part of the potential of the Bayesian network model. Compared to the latter, the Fisher kernel classifier does not offer a natural framework for classifying document fragments but increases the classification accuracy for full documents. It could also be trained for classifying predefined fragment types at a price of an increased complexity.

We first review previous work in Section 2, we then introduce structured documents in Section 3 and our core Bayesian network model in Section 4. We describe in Section 5 how to learn the network parameters from a document corpus. We introduce the Fisher kernels in Section 6. In Section 7 we show on an example how the model may be used with different types of content information. We then describe tests on three collections (Section 8): a classical benchmark where textual documents correspond to academic Web sites, a large corpus of XML documents and a large collection of Web sites where both the textual and image content are considered for classification.

Section snippets

Previous works

Handling structured documents for different IR tasks has recently attracted an increasing attention. However, it rapidly appeared that designing new information retrieval systems so that they can handle structured documents is far from trivial. Many questions are still open for designing such systems so that we are only in the early stages of this development. Most of the work in this new area has concentrated on ad hoc retrieval. Two recent Sigir workshops (2000 and 2002) where dedicated to

Structured document

In the following, we will consider that a document is a tree where each node represents a structural entity. This corresponds to the usual representation of XML documents and this is also the classical structured document representation. A node in the tree will contain two types of information:

  • A label information which represents the type of the structural entity. A label could be for example paragraph, section, introduction, title, etc. The set of labels depends on the documents corpora we are

Modeling documents with Bayesian networks

We will now describe the probabilistic structured models used for the documents.

Let us first define the notations:

  • Let C be a discrete random variable which represents a class from the set of classes C.

  • Let Λ be the set of all the possible labels for a structural node.

  • Let V be the set of all the possible words. V* denotes the set of all possible word sequences, including the empty one.

  • Let d be a structured document consisting of a set of features (sd1,…,sd|d|,td1,…,td|d|) where sdi is the label

Learning

In order to estimate the joint probability of each document and each class, the model parameters must be learned from a training set of documents. Let us define the θ parameters asθ=⋃cn∈Λ,m∈Λθc,sn,mn∈V,m∈Λθc,wn,mwhere θc,sn,m is the estimation for the P(sdi=n|pa(sdi)=m,c) and θc,wn,m is the estimation for P(wd,ki=n|sdi=m,c). s in θ·,s·,· indicates a structural parameter and w in θ·,w·,· a textual parameter. Note that for the classification task we are dealing with, there is one set of

Improving discriminant abilities using Fisher kernel

In order to improve the discriminative abilities of generative models, Jaakkola, Diekhans, and Haussler (1999) proposed a new method based on the Fisher scores. They developed this method for classifying sequences modeled with HMMs and this led to a significant performance increase on biological data. We show below how this idea naturally extends to tree generative models.

Considering different types of information: text and image

The above models could be used not only with text, but with any other type of content. The only requirement is that we have a generative model for scoring the different content types. We describe below an extension of the structured document model using a generative model for images. It will be used in Section 8.3 for classifying Web pages using both text and image information. We do not ambition here to describe a state of the art model for image classification. Instead, we merely want to

Experiments

The structured document model has been tested on three different corpora. We performed extensive experiments on the INEX corpus (Fuhr et al., 2002) which is a large collection of XML documents. We present additional experiments on a corpus of textual HTML pages (webKB, 1999) which has already been used by different authors for comparing flat classifiers. Finally we show how the model behaves on multimedia data. The corpus for this experiment has been gathered inside the European project

Conclusion

We have presented a new generative model for structured document. It is based on Bayesian networks and allows one to model the structure and the content of documents. It has been tested here for the classical task of whole document classification. We have described how this model can be turned into a discriminant model using the Fisher kernel method. We have also shown how our model can be easily extended to take into account different types of information and have presented an example of

References (29)

  • Journal of the American Society for Information Science and Technology (JASIST)

    (2002)
  • D.M. Blei et al.

    Modeling annotated data

  • L. Cai et al.

    Text categorization by boosting automatically extracted concepts

  • Callan, J. P., Croft, W. B., & Harding, S. M. (1992). The INQUERY retrieval system. In Proceedings of DEXA-92 (pp....
  • S. Chakrabarti et al.

    Enhanced hypertext categorization using hyperlinks

  • Cline, M. (1999). Utilizing HTML structure and linked pages to improve learning for text categorization. Undergraduate...
  • Denoyer, L., Zaragoza, H., & Gallinari, P. (2001). HMM-based passage models for document classification and ranking. In...
  • Diligenti, M., Gori, M., Maggini, M., & Scarselli, F. (2001). Classification of HTML documents by hidden tree-Markov...
  • S.T. Dumais et al.

    Hierarchical classification of Web content

  • S. Fine et al.

    The hierarchical hidden Markov model: analysis and applications

    Machine Learning

    (1998)
  • Fuhr, N., Govert, N., Kazai, G., & Lalmas, M. (2002). INEX: Initiative for the evaluation of XML retrieval. In...
  • Hofmann, T. (1999a). The cluster-abstraction model: unsupervised learning of topic hierarchies from text data. In IJCAI...
  • Hofmann, T. (1999b). Probabilistic latent semantic analysis. In Proceedings of uncertainty in artificial intelligence,...
  • Hofmann, T. (2000). Learning the similarity of documents: an information-geometric approach to document retrieval and...
  • Cited by (92)

    • Fine-grained image classification with factorized deep user click feature

      2020, Information Processing and Management
      Citation Excerpt :

      Though much progress has been made for designing deep models to extract deep visual features, research on learning deep models for textual features is rarely studied. In early studies, researchers modeled texts from the perspective of probability and statistics (Denoyer & Gallinari, 2004), e.g., the N-gram textual models (Figueiredo, de Assis, & Ferreira, 2017; Nguyen & Nguyen, 2018). In recent years, with the revolution of deep learning, integrating deep model into N-gram for language modeling becomes more and more popular.

    • Integrating a Bayesian semantic similarity approach into CBR for knowledge reuse in Community Question Answering

      2019, Knowledge-Based Systems
      Citation Excerpt :

      For the query rewriting, they propose different heuristics to reformulates each user question into several alternative queries, then they use a BN that models the accuracy of answers associated with the queries derived from rewrites, i.e. it provides inferences about the probabilities that each query rewrites will lead to a correct answer to the question. In [42] authors aim at classifying semi-structured and XML documents, which could be useful for performing inference on document parts and eventually question answering. They model the structure and the content of each document using a BN, by modeling the dependencies and relations between the different elements in the document.

    • Medical image modality classification using discrete Bayesian networks

      2016, Computer Vision and Image Understanding
      Citation Excerpt :

      The BN models selected for the experimental analysis show an advantage in their trade-off between efficiency and quality. They also demonstrate a promising performance in a wide range of domains such as document classification (Denoyer and Gallinari, 2004), object detection (Schneiderman,2004) or semantic localization (Rubio et al., 2014). Moreover, probabilistic graphical models are suitable for the integration of contextual categorical variables in conjunction with descriptors extracted from computer vision techniques.

    • Evolving fuzzy grammar for crime texts categorization

      2015, Applied Soft Computing
      Citation Excerpt :

      Association rules analysis is used in Ref. [24] to categorize thesis in Chinese by finding out the relations between items or features that happened synchronously in the database and prioritizing categories. A growing number of statistical classification methods and pattern recognition techniques have been applied to text categorization in recent years, including nearest neighbour classification, Naïve Bayes [25], decision trees [26], neural networks [27], boosting methods [28,29], and support vector machines [30,31]. Text understanding based categorization using fuzzy set has not been much explored due to lack of proper testing and comparison to other approaches [5].

    • Beyond Purchase Intentions: Mining Behavioral Intentions of Social-Network Users

      2024, International Journal of Human-Computer Interaction
    • Understanding the Price of Data in Commercial Data Marketplaces

      2023, Proceedings - International Conference on Data Engineering
    View all citing articles on Scopus
    View full text