Elsevier

Neural Networks

Volume 12, Issue 9, November 1999, Pages 1229-1252
Neural Networks

Contributed article
Improved learning algorithms for mixture of experts in multiclass classification

https://doi.org/10.1016/S0893-6080(99)00043-XGet rights and content

Abstract

Mixture of experts (ME) is a modular neural network architecture for supervised learning. A double-loop Expectation-Maximization (EM) algorithm has been introduced to the ME architecture for adjusting the parameters and the iteratively reweighted least squares (IRLS) algorithm is used to perform maximization in the inner loop [Jordan, M.I., Jacobs, R.A. (1994). Hierarchical mixture of experts and the EM algorithm, Neural Computation, 6(2), 181–214]. However, it is reported in literature that the IRLS algorithm is of instability and the ME architecture trained by the EM algorithm, where IRLS algorithm is used in the inner loop, often produces the poor performance in multiclass classification. In this paper, the reason of this instability is explored. We find out that due to an implicitly imposed incorrect assumption on parameter independence in multiclass classification, an incomplete Hessian matrix is used in that IRLS algorithm. Based on this finding, we apply the Newton–Raphson method to the inner loop of the EM algorithm in the case of multiclass classification, where the exact Hessian matrix is adopted. To tackle the expensive computation of the Hessian matrix and its inverse, we propose an approximation to the Newton–Raphson algorithm based on a so-called generalized Bernoulli density. The Newton–Raphson algorithm and its approximation have been applied to synthetic data, benchmark, and real-world multiclass classification tasks. For comparison, the IRLS algorithm and a quasi-Newton algorithm called BFGS have also been applied to the same tasks. Simulation results have shown that the use of the proposed learning algorithms avoids the instability problem and makes the ME architecture produce good performance in multiclass classification. In particular, our approximation algorithm leads to fast learning. In addition, the limitation of our approximation algorithm is also empirically investigated in this paper.

Introduction

There have recently been widespread interests in the use of multiple models for pattern classification and regression in statistics and neural network communities. The basic idea underlying these methods is the application of a so-called divide-and-conquer principle that is often used to tackle a complex problem by dividing it into simpler problems whose solutions can be combined to yield a final solution. Utilizing this principle, Jacobs, Jordan, Nowlan and Hinton (1991) proposed a modular neural network architecture called mixture of experts (ME). It consists of several expert networks trained on different partitions of the input space. The ME weights the input space by using the posterior probabilities that expert networks generated for getting the output from the input. The outputs of expert networks are combined by a gating network simultaneously trained in order to stochastically select the expert that is performing the best at solving the problem. The gating network is realized by the multinomial logit or softmax function (Bridle, 1989). As pointed out by Jordan and Jacobs (1994), the gating network performs a typical multiclass classification task. The ME architecture has been extended to a hierarchical structure called hierarchical mixtures of experts (HME) (Jordan & Jacobs, 1994). Moreover, Jordan and Jacobs (1994) have introduced the Expectation-Maximization (EM) algorithm (Dempster, Laird & Rubin, 1977) to both the ME and the HME architecture so that the learning process is decoupled in a manner that fits well with the modular structure. The favorable properties of the EM algorithm have been shown by theoretical analyses (Jordan & Xu, 1995; Xu & Jordan, 1996). In the ME architecture, the EM algorithm makes the original complicated maximum likelihood problem decomposed into several separate maximum likelihood problems in the E-step and solve these problems in the M-step (Jordan & Jacobs, 1994). Since these optimization problems are not usually analytically solvable, the EM algorithm is a double-loop procedure in general. To tackle these separate optimization problems in the inner loop, Jordan and Jacobs (1994) proposed an iteratively reweighted least squares (IRLS) algorithm for both regression and pattern classification.

The ME architecture with the EM algorithm has already been applied to both regression and pattern classification (for a review see Waterhouse, 1997). However, empirical studies indicate that the use of the IRLS algorithm in the inner loop of the EM algorithm causes the ME architecture to produce the instable performance. In particular, the problem becomes rather serious in multiclass classification. Our earlier studies show that an ME or an HME architecture cannot reach the steady state often when the IRLS algorithm is used to solve the maximization problems in the inner loop of the EM algorithm. The further observation shows that when the IRLS algorithm is used in the inner loop the-log-likelihood corresponding to an ME architecture does not monotonically increase during parameter estimation (Chen et al., 1995, Chen et al., 1996b). Similar problems have also been mentioned in the literature (Ramamurti & Ghosh, 1996; Ramamurti & Ghosh, 1997). As a result, Waterhouse (1993) transferred a multiclass classification task into several binary classification subtasks for speech recognition to implicitly avoid the instability problem. Alternatively, Chen et al. (1996b) used a so-called generalized Bernoulli density as the statistical model of expert networks for multiclass classification and applied such ME and HME classifiers to speaker identification.

Xu and Jordan (1994), and Xu, Jordan and Hinton (1995) have proposed an alternative ME model, where a localized gating network is employed so that parameter estimation in the gating network can be analytically solvable. For a regression task, the IRLS algorithm is avoided in the alternative ME model so that the EM algorithm becomes a single-loop procedure, and empirical studies show that the alternative ME model can reach the steady state and yield fast learning (Xu et al., 1995; Ramamurti & Ghosh, 1997). It has also been shown that, as a special case, the alternative ME model covers a class of radial basis function networks such that these networks can be trained by either the batch way or adaptive EM-like algorithm (Xu, 1996, Xu, 1998). In principle, the alternative ME model is also applicable to a multiclass classification task with the stable solution, as long as each expert has a structure similar to that of the gating network proposed by Xu and others (Xu et al., 1995; Xu & Jordan, 1994).

Since the original ME model with the IRLS algorithm for learning has been widely used in literature, and the reason behind the instability of the IRLS algorithm for multiclass classification still remains unknown, it is undoubtedly important to investigate the intrinsic reason. In this paper, we find out that an incorrect assumption on the parameter independence for multiclass classification is implicitly imposed and results in the use of an incomplete Hessian matrix in the IRLS algorithm, which causes the aforementioned instability of learning. On the basis of the investigation, we propose a Newton–Raphson algorithm to replace the original IRLS algorithm (Jordan & Jacobs, 1994) in the inner loop of the EM algorithm for multiclass classification. Using the proposed learning algorithm, we show that the use of the exact Hessian matrix makes the ME architecture perform well in multiclass classification. However, the use of the exact Hessian matrix could lead to expensive computation during learning. To speed up learning, we propose an approximation to the Newton–Raphson algorithm by introducing an approximate statistical model to expert networks for multiclass classification. In order to demonstrate their effectiveness, we have used the proposed learning algorithms in the inner loop of the EM algorithm to perform synthetic data, benchmark, and real-world multiclass classification tasks. Simulation results have shown that the proposed learning algorithms make the ME architecture produce the satisfactory performance in those multiclass classification tasks. In particular, the proposed approximation algorithm yields significantly faster learning. For comparison, we have also applied the IRLS algorithm and a quasi-Newton algorithm called BFGS in the inner loop of the EM algorithm, respectively, to train the ME architecture for the same tasks. Comparative results show that the ME architecture yields the poor performance when the IRLS algorithm is used in the inner loop of the EM algorithm and the BFGS algorithm does not yield significantly faster learning in contrast to the proposed learning algorithms.

The remainder of this paper is organized as follows. Section 2 briefly reviews the mixture of experts architecture and the EM algorithm. Section 3 analyzes the reason why the IRLS algorithm causes the ME architecture to produce the poor performance in multiclass classification. Section 4 proposes a Newton–Raphson algorithm used in the inner loop of the EM algorithm for multiclass classification and relates it to the IRLS algorithm. Section 5 presents an approximation to the Newton–Raphson algorithm to speed up learning. Simulation results are reported in Section 6. Further discussions are given in Section 7, and conclusions are drawn in the last section.

Section snippets

Mixtures of experts and EM algorithm

To make this paper self-contained, we briefly review the ME architecture (Jacobs et al., 1991) and the EM algorithm (Jordan & Jacobs, 1994) in this section.

As illustrated in Fig. 1, the ME architecture is composed of a gating network and several expert networks. The gating network receives the vector x as input and produces scalar outputs that are partition of unity at each point in the input space. Each expert network produces an output vector for an input vector. The gating network provides

The IRLS algorithm

Apparently, the performance of an EM algorithm highly depends upon solutions to those separate maximization problems. As pointed out by Jordan and Jacobs (1994), the separate maximization problems in , belong to the IRLS problem (McCullagh & Nelder, 1983). To tackle those separate optimization problems, Jordan and Jacobs (1994) propose an IRLS algorithm for all the generalized linear models used in the ME architecture. To explore the reason of instability of the IRLS algorithm in multiclass

A Newton–Raphson algorithm and its relation to the IRLS algorithm

In this section, we first propose a learning algorithm based on the Newton–Raphson method for use in the inner loop of the EM algorithm, and then discuss the relation between the IRLS algorithm and the proposed learning algorithm. It is followed by a multiclass classification example to demonstrate that the use of the exact Hessian matrix makes the ME architecture perform well, while the use of the IRLS algorithm suggested by Jordan and Jacobs (1994) causes the ME architecture to produce the

Fast learning algorithm

It is well known that the Newton–Raphson method suffers from a high computational burden in general since the second derivatives of objective function and inverse of the Hessian matrix are necessary for updating parameters. Our simulation results presented in Section 4 also indicates that the expensive computation could be involved in our learning algorithm. To tackle the expensive computation problem in the Newton–Raphson algorithm, we propose an approximation algorithm for the ME architecture

Simulations

In this section, we report simulation results on synthetic data, benchmark, and real-world multiclass classification problems using the ME architecture along with the EM algorithm where the Newton–Raphson algorithm and its approximation were used in the inner loop, respectively. For comparison, the IRLS method (Jordan & Jacobs, 1994) was also used in the inner loop of the EM algorithm to train the ME architecture for the same problems. On the other hand, a class of methods called quasi-Newton

Discussion

As a realization of multinomial density for multiclass classification, the multinomial logit or softmax function results in mutual dependency among components of an output vector. Due to this dependency, each row of the weight matrix of a component network in the ME architecture cannot be viewed as an independent and separable parameter vector. Therefore, the exact evaluation of the Hessian matrix must be performed by considering all the parameters in the weight matrix simultaneously. As a

Concluding remarks

We have investigated the reason why the ME architecture produces the poor performance in the case of multiclass classification when the IRLS algorithm is used in the inner loop of the EM algorithm to train the ME architecture. We find out that an incorrect assumption on parameter independence is implicitly imposed for multiclass classification, which causes an incomplete Hessian matrix is used in the second-order IRLS algorithm. The use of the incomplete Hessian matrix is responsible for the

Acknowledgements

Authors are grateful to the anonymous referees for their comments that improve the presentation of this paper. This work was supported by the HK RGC Earmarked Grant CUHK 250/94E and National Science Foundation in china.

References (38)

  • J. Bridle

    Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition

  • C.G. Broyden

    The convergence of a class of double rank minimization algorithms

    Journal of the Institute of Mathematics and Its Applications

    (1970)
  • F.P. Campbell

    Speaker recognition: a tutorial

    Proceedings of the IEEE

    (1997)
  • K. Chen et al.

    Speaker identification based on hierarchical mixture of experts

    Proceedings of World Congress on Neural Networks, Washington, DC

    (1995)
  • Chen, K., Xie, D., & Chi, H. (1996). A modified HME architecture for text-dependent speaker identification. IEEE...
  • K. Chen et al.

    Speaker identification using time-delay HMEs

    International Journal of Neural Systems

    (1996)
  • A.P. Dempster et al.

    Maximum-likelihood from incomplete data via the EM algorithm

    Journal of the Royal Statistical Society B

    (1977)
  • R. Duda et al.

    Pattern classification and scene analysis

    (1973)
  • R.A. Fisher

    The use of multiple measurements in taxonomic problem

    Annals of Eugenices

    (1936)
  • Cited by (129)

    View all citing articles on Scopus
    View full text