Improved learning algorithms for mixture of experts in multiclass classification

doi:10.1016/S0893-6080(99)00043-X

Neural Networks

Volume 12, Issue 9, November 1999, Pages 1229-1252

https://doi.org/10.1016/S0893-6080(99)00043-X Get rights and content

Abstract

Mixture of experts (ME) is a modular neural network architecture for supervised learning. A double-loop Expectation-Maximization (EM) algorithm has been introduced to the ME architecture for adjusting the parameters and the iteratively reweighted least squares (IRLS) algorithm is used to perform maximization in the inner loop [Jordan, M.I., Jacobs, R.A. (1994). Hierarchical mixture of experts and the EM algorithm, Neural Computation, 6(2), 181–214]. However, it is reported in literature that the IRLS algorithm is of instability and the ME architecture trained by the EM algorithm, where IRLS algorithm is used in the inner loop, often produces the poor performance in multiclass classification. In this paper, the reason of this instability is explored. We find out that due to an implicitly imposed incorrect assumption on parameter independence in multiclass classification, an incomplete Hessian matrix is used in that IRLS algorithm. Based on this finding, we apply the Newton–Raphson method to the inner loop of the EM algorithm in the case of multiclass classification, where the exact Hessian matrix is adopted. To tackle the expensive computation of the Hessian matrix and its inverse, we propose an approximation to the Newton–Raphson algorithm based on a so-called generalized Bernoulli density. The Newton–Raphson algorithm and its approximation have been applied to synthetic data, benchmark, and real-world multiclass classification tasks. For comparison, the IRLS algorithm and a quasi-Newton algorithm called BFGS have also been applied to the same tasks. Simulation results have shown that the use of the proposed learning algorithms avoids the instability problem and makes the ME architecture produce good performance in multiclass classification. In particular, our approximation algorithm leads to fast learning. In addition, the limitation of our approximation algorithm is also empirically investigated in this paper.

Introduction

There have recently been widespread interests in the use of multiple models for pattern classification and regression in statistics and neural network communities. The basic idea underlying these methods is the application of a so-called divide-and-conquer principle that is often used to tackle a complex problem by dividing it into simpler problems whose solutions can be combined to yield a final solution. Utilizing this principle, Jacobs, Jordan, Nowlan and Hinton (1991) proposed a modular neural network architecture called mixture of experts (ME). It consists of several expert networks trained on different partitions of the input space. The ME weights the input space by using the posterior probabilities that expert networks generated for getting the output from the input. The outputs of expert networks are combined by a gating network simultaneously trained in order to stochastically select the expert that is performing the best at solving the problem. The gating network is realized by the multinomial logit or softmax function (Bridle, 1989). As pointed out by Jordan and Jacobs (1994), the gating network performs a typical multiclass classification task. The ME architecture has been extended to a hierarchical structure called hierarchical mixtures of experts (HME) (Jordan & Jacobs, 1994). Moreover, Jordan and Jacobs (1994) have introduced the Expectation-Maximization (EM) algorithm (Dempster, Laird & Rubin, 1977) to both the ME and the HME architecture so that the learning process is decoupled in a manner that fits well with the modular structure. The favorable properties of the EM algorithm have been shown by theoretical analyses (Jordan & Xu, 1995; Xu & Jordan, 1996). In the ME architecture, the EM algorithm makes the original complicated maximum likelihood problem decomposed into several separate maximum likelihood problems in the E-step and solve these problems in the M-step (Jordan & Jacobs, 1994). Since these optimization problems are not usually analytically solvable, the EM algorithm is a double-loop procedure in general. To tackle these separate optimization problems in the inner loop, Jordan and Jacobs (1994) proposed an iteratively reweighted least squares (IRLS) algorithm for both regression and pattern classification.

The ME architecture with the EM algorithm has already been applied to both regression and pattern classification (for a review see Waterhouse, 1997). However, empirical studies indicate that the use of the IRLS algorithm in the inner loop of the EM algorithm causes the ME architecture to produce the instable performance. In particular, the problem becomes rather serious in multiclass classification. Our earlier studies show that an ME or an HME architecture cannot reach the steady state often when the IRLS algorithm is used to solve the maximization problems in the inner loop of the EM algorithm. The further observation shows that when the IRLS algorithm is used in the inner loop the-log-likelihood corresponding to an ME architecture does not monotonically increase during parameter estimation (Chen et al., 1995, Chen et al., 1996b). Similar problems have also been mentioned in the literature (Ramamurti & Ghosh, 1996; Ramamurti & Ghosh, 1997). As a result, Waterhouse (1993) transferred a multiclass classification task into several binary classification subtasks for speech recognition to implicitly avoid the instability problem. Alternatively, Chen et al. (1996b) used a so-called generalized Bernoulli density as the statistical model of expert networks for multiclass classification and applied such ME and HME classifiers to speaker identification.

Xu and Jordan (1994), and Xu, Jordan and Hinton (1995) have proposed an alternative ME model, where a localized gating network is employed so that parameter estimation in the gating network can be analytically solvable. For a regression task, the IRLS algorithm is avoided in the alternative ME model so that the EM algorithm becomes a single-loop procedure, and empirical studies show that the alternative ME model can reach the steady state and yield fast learning (Xu et al., 1995; Ramamurti & Ghosh, 1997). It has also been shown that, as a special case, the alternative ME model covers a class of radial basis function networks such that these networks can be trained by either the batch way or adaptive EM-like algorithm (Xu, 1996, Xu, 1998). In principle, the alternative ME model is also applicable to a multiclass classification task with the stable solution, as long as each expert has a structure similar to that of the gating network proposed by Xu and others (Xu et al., 1995; Xu & Jordan, 1994).

Since the original ME model with the IRLS algorithm for learning has been widely used in literature, and the reason behind the instability of the IRLS algorithm for multiclass classification still remains unknown, it is undoubtedly important to investigate the intrinsic reason. In this paper, we find out that an incorrect assumption on the parameter independence for multiclass classification is implicitly imposed and results in the use of an incomplete Hessian matrix in the IRLS algorithm, which causes the aforementioned instability of learning. On the basis of the investigation, we propose a Newton–Raphson algorithm to replace the original IRLS algorithm (Jordan & Jacobs, 1994) in the inner loop of the EM algorithm for multiclass classification. Using the proposed learning algorithm, we show that the use of the exact Hessian matrix makes the ME architecture perform well in multiclass classification. However, the use of the exact Hessian matrix could lead to expensive computation during learning. To speed up learning, we propose an approximation to the Newton–Raphson algorithm by introducing an approximate statistical model to expert networks for multiclass classification. In order to demonstrate their effectiveness, we have used the proposed learning algorithms in the inner loop of the EM algorithm to perform synthetic data, benchmark, and real-world multiclass classification tasks. Simulation results have shown that the proposed learning algorithms make the ME architecture produce the satisfactory performance in those multiclass classification tasks. In particular, the proposed approximation algorithm yields significantly faster learning. For comparison, we have also applied the IRLS algorithm and a quasi-Newton algorithm called BFGS in the inner loop of the EM algorithm, respectively, to train the ME architecture for the same tasks. Comparative results show that the ME architecture yields the poor performance when the IRLS algorithm is used in the inner loop of the EM algorithm and the BFGS algorithm does not yield significantly faster learning in contrast to the proposed learning algorithms.

The remainder of this paper is organized as follows. Section 2 briefly reviews the mixture of experts architecture and the EM algorithm. Section 3 analyzes the reason why the IRLS algorithm causes the ME architecture to produce the poor performance in multiclass classification. Section 4 proposes a Newton–Raphson algorithm used in the inner loop of the EM algorithm for multiclass classification and relates it to the IRLS algorithm. Section 5 presents an approximation to the Newton–Raphson algorithm to speed up learning. Simulation results are reported in Section 6. Further discussions are given in Section 7, and conclusions are drawn in the last section.

Section snippets

Mixtures of experts and EM algorithm

To make this paper self-contained, we briefly review the ME architecture (Jacobs et al., 1991) and the EM algorithm (Jordan & Jacobs, 1994) in this section.

As illustrated in Fig. 1, the ME architecture is composed of a gating network and several expert networks. The gating network receives the vector x as input and produces scalar outputs that are partition of unity at each point in the input space. Each expert network produces an output vector for an input vector. The gating network provides

The IRLS algorithm

Apparently, the performance of an EM algorithm highly depends upon solutions to those separate maximization problems. As pointed out by Jordan and Jacobs (1994), the separate maximization problems in , belong to the IRLS problem (McCullagh & Nelder, 1983). To tackle those separate optimization problems, Jordan and Jacobs (1994) propose an IRLS algorithm for all the generalized linear models used in the ME architecture. To explore the reason of instability of the IRLS algorithm in multiclass

A Newton–Raphson algorithm and its relation to the IRLS algorithm

In this section, we first propose a learning algorithm based on the Newton–Raphson method for use in the inner loop of the EM algorithm, and then discuss the relation between the IRLS algorithm and the proposed learning algorithm. It is followed by a multiclass classification example to demonstrate that the use of the exact Hessian matrix makes the ME architecture perform well, while the use of the IRLS algorithm suggested by Jordan and Jacobs (1994) causes the ME architecture to produce the

Fast learning algorithm

It is well known that the Newton–Raphson method suffers from a high computational burden in general since the second derivatives of objective function and inverse of the Hessian matrix are necessary for updating parameters. Our simulation results presented in Section 4 also indicates that the expensive computation could be involved in our learning algorithm. To tackle the expensive computation problem in the Newton–Raphson algorithm, we propose an approximation algorithm for the ME architecture

Simulations

In this section, we report simulation results on synthetic data, benchmark, and real-world multiclass classification problems using the ME architecture along with the EM algorithm where the Newton–Raphson algorithm and its approximation were used in the inner loop, respectively. For comparison, the IRLS method (Jordan & Jacobs, 1994) was also used in the inner loop of the EM algorithm to train the ME architecture for the same problems. On the other hand, a class of methods called quasi-Newton

Discussion

As a realization of multinomial density for multiclass classification, the multinomial logit or softmax function results in mutual dependency among components of an output vector. Due to this dependency, each row of the weight matrix of a component network in the ME architecture cannot be viewed as an independent and separable parameter vector. Therefore, the exact evaluation of the Hessian matrix must be performed by considering all the parameters in the weight matrix simultaneously. As a

Concluding remarks

We have investigated the reason why the ME architecture produces the poor performance in the case of multiclass classification when the IRLS algorithm is used in the inner loop of the EM algorithm to train the ME architecture. We find out that an incorrect assumption on parameter independence is implicitly imposed for multiclass classification, which causes an incomplete Hessian matrix is used in the second-order IRLS algorithm. The use of the incomplete Hessian matrix is responsible for the

Acknowledgements

Authors are grateful to the anonymous referees for their comments that improve the presentation of this paper. This work was supported by the HK RGC Earmarked Grant CUHK 250/94E and National Science Foundation in china.

References (38)

K. Chen
A connectionist method for pattern classification with diverse features
Pattern Recognition Letters
(1998)
S. Furui
Recent advances in speaker recognition
Pattern Recognition Letters
(1997)
M. Ishikawa
Structural learning with forgetting
Neural Networks
(1996)
M.I. Jordan et al.
Convergence results for the EM approach to mixtures of experts
Neural Networks
(1995)
L. Xu
RBF nets, mixture of experts, and Bayesian Ying-Yang learning, Neurocomputing
(1998)
Bennani, Y., & Gallinari, P. (1994). Connectionist approaches for automatic speaker recognition. Proceeding of ESCA...
C. Bishop
A fast procedure for re-training the multilayer perception
International Journal of Neural Systems
(1991)
C. Bishop
Exact calculation of the Hessian matrix for the multilayer perceptron
Neural Computation
(1992)
D. Böning
Construction of reliable maximum likelihood algorithms with applications to logistic and Cox regression
L. Breiman et al.
Classification and regression trees
(1984)

J. Bridle

Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition

C.G. Broyden

The convergence of a class of double rank minimization algorithms

Journal of the Institute of Mathematics and Its Applications

(1970)

F.P. Campbell

Speaker recognition: a tutorial

Proceedings of the IEEE

(1997)

K. Chen et al.

Speaker identification based on hierarchical mixture of experts

Proceedings of World Congress on Neural Networks, Washington, DC

(1995)

Chen, K., Xie, D., & Chi, H. (1996). A modified HME architecture for text-dependent speaker identification. IEEE...

K. Chen et al.

Speaker identification using time-delay HMEs

International Journal of Neural Systems

(1996)

A.P. Dempster et al.

Maximum-likelihood from incomplete data via the EM algorithm

Journal of the Royal Statistical Society B

(1977)

R. Duda et al.

Pattern classification and scene analysis

(1973)

R.A. Fisher

The use of multiple measurements in taxonomic problem

Annals of Eugenices

(1936)

Cited by (129)

Robust claim frequency modeling through phase-type mixture-of-experts regression
2023, Insurance: Mathematics and Economics
This paper addresses the problem of modeling loss frequency using regression when the counts have a non-standard distribution. We propose a novel approach based on mixture-of-experts specifications on discrete-phase type distributions. Compared to continuous phase-type counterparts, our approach offers fast estimation via expectation-maximization, making it more feasible for use in real-life scenarios. Our model is both robust and interpretable in terms of risk classes, and can be naturally extended to the multivariate case through two different constructions. This avoids the need for ad-hoc multivariate claim count modeling. Overall, our approach provides a more effective solution for modeling loss frequency in non-standard situations.
MoDE: A Mixture-of-Experts Model with Mutual Distillation among the Experts
2024, Proceedings of the AAAI Conference on Artificial Intelligence
Mixture of Link Predictors
2024, arXiv
Towards More Unified In-context Visual Understanding
2023, arXiv
A General Theory for Softmax Gating Multinomial Logistic Mixture of Experts
2023, arXiv
Enhancing NeRF akin to Enhancing LLMs: Generalizable NeRF Transformer with Mixture-of-View-Experts
2023, arXiv

View all citing articles on Scopus

View full text

Contributed articleImproved learning algorithms for mixture of experts in multiclass classification

Abstract

Introduction

Section snippets

Mixtures of experts and EM algorithm

The IRLS algorithm

A Newton–Raphson algorithm and its relation to the IRLS algorithm

Fast learning algorithm

Simulations

Discussion

Concluding remarks

Acknowledgements

Pattern Recognition Letters

Pattern Recognition Letters

Neural Networks

Neural Networks

RBF nets, mixture of experts, and Bayesian Ying-Yang learning, Neurocomputing

A fast procedure for re-training the multilayer perception

International Journal of Neural Systems

Exact calculation of the Hessian matrix for the multilayer perceptron

Neural Computation

Construction of reliable maximum likelihood algorithms with applications to logistic and Cox regression

Classification and regression trees

Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition

The convergence of a class of double rank minimization algorithms

Journal of the Institute of Mathematics and Its Applications

Speaker recognition: a tutorial

Proceedings of the IEEE

Speaker identification based on hierarchical mixture of experts

Proceedings of World Congress on Neural Networks, Washington, DC

Speaker identification using time-delay HMEs

International Journal of Neural Systems

Maximum-likelihood from incomplete data via the EM algorithm

Journal of the Royal Statistical Society B

Pattern classification and scene analysis

The use of multiple measurements in taxonomic problem

Annals of Eugenices

Contributed article
Improved learning algorithms for mixture of experts in multiclass classification