Elsevier

Pattern Recognition

Volume 81, September 2018, Pages 280-293
Pattern Recognition

Variational inference based bayes online classifiers with concept drift adaptation

https://doi.org/10.1016/j.patcog.2018.04.007Get rights and content

Highlights

  • A novel online classifier based on Bayes variational inference is proposed.

  • Two new concept drift adaptation techniques are proposed for the online classifier.

  • State-of-the-art performance is achieved for the proposed method.

Abstract

We present VIGO, a novel online Bayesian classifier for both binary and multiclass problems. In our model, variational inference for multivariate distribution technique is exploited to approximate the class conditional probability density functions of data in an online manner. To handle concept drift that could arise in streaming data, we develop 2 new adaptive methods based on VIGO, which we called VIGOw and VIGOd. While VIGOw naturally adapts to any kind of changing environments, VIGOd maximises the benefit of a static environment as long as it does not detect any change. Extensive experiments on big/medium real-world/synthetic datasets demonstrate the superior performance of our algorithms over many state-of-the-art methods in the literature.

Introduction

Nowadays very often data come in the form of streams. Examples of such data can be easily seen in many real-world applications like network traffic, sensor networks, web searches, stock market systems and others. Storing large volumes of streaming data in the machine's main memory is often infeasible and traditional offline method where the prediction is made based on learning the entire training dataset at once becomes impractical. Moreover, offline algorithms are not applicable in real-time learning scenarios where a stream of data is arriving, and predictions must be made before all the data is seen. Therefore, online learning is emerging as an efficient machine learning method for large-scale applications, especially those with streaming data. In an online learning process, predictive models can be updated after the arrival of every new data point (one-by-one) in a sequential fashion or defer until a group of points has arrived (minibatch-by-minibatch) to reduce the effect of noise in the data. They do not require the whole data set to be stored or loaded into memory, but just make use of a single/set of observations and then discard them before the next observations are used.

Online learning can deal with many tasks such as classification, regression and clustering. In this paper, we focus on online classification algorithms with full feedback (i.e. supervised learning). An online classification task usually involves the three main steps:

  • Predict: When a new instance xt arrives, a prediction y^t is made using the current model Lt.

  • Calculate the suffered loss: After making the prediction, the true label yt is revealed, and the loss l(yt,y^t) can be estimated to measure the difference between the learner's prediction and the revealed true label yt.

  • Update: Based on the result of the loss, the learner can use the sample (xt, yt) to update the classification model (LtLt+1).

From this framework, we can see that online learning algorithms avoid re-training when adding new data. Besides the requirement of accurate and rapid learning and prediction on-the-fly without storing past instances, another challenge to any online method is that it does not know in advance if the data stream is stationary with stable concepts or evolving over time with changing concepts.

In this paper, we introduce a novel online classifier (VIGO) based on the variational inference (VI) technique. In our framework, two learning phases (learning from past instances (prior information) and from recent instances (through sufficient statistics)) flexibly support each other. They are also naturally separated which offer us the opportunity to focus more on recent information or to detect concept drifts. This resulted in 2 new adaptive methods named VIGOw and VIGOd, respectively. Our algorithms are second-order generative models, where distributions are placed not only on the data of each class but also on the model parameters. They do not require storing more than a single instance in the main memory. We evaluate the performance of our proposed methods by comparing them with recent or well-known online methods including kernel-based DualSGD (Dual space gradient descent) [1], FOGD (Fourier online gradient descent) and NOGD (Nystrom ONLINE GRADIENT DEScent) [2]; ensemble-based BLAST [3]; state-of-the-art second-order linear AROW (adaptive regularisation of weights) [4]; widely used first-order linear PA (passive aggressive learning) [5], the most used decision tree HT (Hoeffding tree) [6]. We also compare our proposed methods with ONBG (online Naïve Bayes for Gaussians) [7]—a first-order generative method. Over and above that, the experiment is extended to estimate the adaptability of VIGOw/d in mining evolving data streams with concept drifts. A number of recent or commonly-used adaptive stream learning methods such as SAM–kNN (kNN classifier with Self Adjusting Memory) [8], [9], kNN–PAW (kNN with probabilistic adaptive windowing) [10], ensemble-based DACC (dynamic adaptation to concept changes) [11], and HAT (Hoeffding adaptive tree) [12] are used as benchmark algorithms.

The remainder of this paper is organised as follows. In Section 2, we have a brief review of online methods, especially the ones we used as benchmark algorithms in this paper. After that, the background about Bayesian methods and variational inference techniques is summarised in Section 3. Online variational inference for Gaussian (VIGO) is introduced in Section 4. VIGO with built-in concept drift detector (VIGOd) is the topic of Section 5. Section 6 introduces online variational inference weighted for multivariate Gaussian (VIGOw). Experimental results are provided in Section 7. The final section contains conclusion and suggestions for future work.

Section snippets

Online classifiers

In this section, we discuss online classifiers in general and about the one we use as benchmark algorithms in our experiments in more details. From the three main steps of online classification, we can see that different online algorithms are mainly distinguished in terms of the different type of loss function l(yt,y^t) and different way of updating LtLt+1. Given a problem instance to be classified represented by a vector x=(x1,x2,xD)RD, linear methods use the predictive rule: y^t=argmaxi{1,

Variational inference for multivariate Gaussian

Given a problem instance to be classified represented by a vector x=(x1,x2,xD)RD. A Bayesian classifier based on Bayes’ theorem predicts the label y of x from the label set {1, 2, ⋅⋅⋅, K} as y=argmaxk{1,2,,K}p(y=k|x)argmaxk{1,2,,K}p(y=k)p(x|y=k)where p(y=k|x) is the posterior probability that x belongs to the class k, p(y=k) is the prior probability of class k, and p(x|y) is the class conditional probability density function, respectively.

The class prior p(y=k) can often be estimated

Online variational inference for multivariate Gaussian (VIGO)

We assume that the class conditional probability density functions p(x|y=k) are multivariate Gaussians N(μk,Λk1) for k=1,,K, and use variational inference technique to update the distributions of μk and Λk. Because of the large amount of data in online applications, the assumption about multivariate Gaussian distributions of data in each class becomes even more accurate. As in Bayesian methods, the distribution p(x|y=k) of each class k{1,,K} can be updated independently, we only show the

VIGO with built-in concept drift detector (VIGOd)

It can be seen that VIGO learn its model on the whole data stream, where the past is contained in the prior information and the present in the sufficient statistics. Indeed, for each class, the mean calculated by (2) is the average value of all instances belonging to that class. In a stationary data stream learning, this lossless property is desirable as it shows that the model learned by an online method can be asymptotically arbitrarily close to that of the offline batch method regardless of

Online variational inference weighted for multivariate Gaussian (VIGOw)

In this section, we present VIGOw, which also updates the model for class k if n(k) reaches a predefined value |B|, but it does not use a built-in drift detector. VIGOw is focused on the idea of using a decay function to weigh the importance of instances according to their age [31]. In a decay function, a decay factor α (0< α <1) is multiplied to the weight of the old information to phase it out. However, the variational inference framework allows us to put more emphasis on the recent instances

Dataset

To evaluate the performance of the proposed methods in different scenarios (where data may be very rare or extremely redundant), we conduct experiments on 30 datasets (see Table 1). Two of which (Poker and Airlines) are large-scale datasets with millions of data points, and the others are of medium and small size. They can be downloaded from LIBSVM [38] and UCI [39] repositories, except Airlines. Information about flights in the year of 2008 was obtained from American Statistical Association's

Conclusion

We have presented a new Bayesian online learning method VIGO and its 2 adaptive versions (VIGOd and VIGOw) which achieve superior performance over many recent or well-known online learning methods. In particular, VIGOd and VIGOw are two fast adaptive methods that run well in both stationary and changing environments. Unlike many online learning algorithms, our algorithms do not require the storage of arriving instances as they are discarded immediately after used. Furthermore, in the framework

Acknowledgements

Thi Thu Thuy Nguyen is supported by the Australian Government Research Training Program Scholarship to undertake this research.

Thi Thu Thuy Nguyen is currently a Ph.D. student at the School of Information & Communication Technology, Griffith University, Australia. She graduated from the Faculty of Mathematics, Voronezh State University, Russia in 2008. Her research interest is in the field of applied mathematics, machine learning, pattern recognition.

References (44)

  • G.J. Ross et al.

    Exponentially weighted moving average charts for detecting concept drift

    Pattern Recognit. Lett.

    (2012)
  • T.T. Nguyen et al.

    A novel combining classifier method based on variational inference

    Pattern Recognit.

    (2016)
  • T. Le et al.

    Dual space gradient descent for online learning

  • J. Lu et al.

    Large scale online kernel learning

    J. Mach. Learn. Res.

    (2016)
  • J.N. van Rijn et al.

    Having a Blast: meta-learning and heterogeneous ensembles for data streams

  • K. Crammer et al.

    Adaptive regularization of weight vectors

    Mach. Learn.

    (2013)
  • K. Crammer et al.

    Online passive aggressive algorithms

    J. Mach. Learn. Res.

    (2006)
  • P. Domingos et al.

    Mining high-speed data streams

  • T.T. Nguyen et al.

    An ensemble-based online learning algorithm for streaming data

  • V. Losing et al.

    Self-Adjusting Memory: how to deal with diverse drift types

  • V. Losing et al.

    KNN classifier with Self Adjusting Memory for heterogeneous concept drift

  • A. Bifet et al.

    Efficient data stream classification via probabilistic adaptive windows

  • G. Jaber et al.

    Online learning: searching for the best forgetting strategy under concept drift

  • A. Bifet et al.

    Adaptive learning from evolving data streams

  • F. Rosenblatt

    The perceptron: a probabilistic model for information storage and organization in the brain

    Psychol. Rev.

    (1958)
  • J. Zinkevich

    Online convex programming and generalized infinitesimal gradient ascent

  • N. Cesa-Bianchi et al.

    A second-order perceptron algorithm

    SIAM J. Comput.

    (2005)
  • J. Wang et al.

    Exact soft confidence-weighted learning

  • Z. Wang et al.

    Breaking the curse of kernelization: budgeted stochastic gradient descent for large-scale SVM training

    J. Mach. Learn. Res.

    (2012)
  • X.C. Pham et al.

    An online machine learning method based on random projection and Hoeffding tree classifier

  • C.M. Bishop

    Pattern Recognition And Machine Learning

    (2006)
  • A.A. Farag et al.

    Density estimation using modified expectation-maximization algorithm for a linear combination of Gaussians

  • Cited by (25)

    • A novel semi-supervised classification approach for evolving data streams

      2023, Expert Systems with Applications
      Citation Excerpt :

      The data collection environment tends to change dynamically, and the distribution of data streams changes over time (Liao et al., 2021; Ren et al., 2018). The data streams are not independently and identically distributed so that the traditional machine learning techniques are invalid (Duda et al., 2018; Nguyen et al., 2018). For extracting the important information from data streams, the data streams mining techniques have gained significant research attention, in which the data streams classification plays a critical role (Hu et al., 2020; Liu et al., 2015; Lu et al., 2019; Wang et al., 2018).

    • Adaptive one-pass passive-aggressive radial basis function for classification problems

      2022, Neurocomputing
      Citation Excerpt :

      Such a hypothesis contributes to another widely used approach, known as the probabilistic method. Online probabilistic methods cover a wide range of Online algorithms such as the Online variational inference for Gaussian (VIGO), online Variational Inference Weighted for multivariate Gaussian (VIGOW), and Online Variational Inference to approximate the multivariate Gaussian model (OVIG) [31,32]. Although these methods are categorized as online, they are not comprehensively one-pass.

    • Online learning using projections onto shrinkage closed balls for adaptive brain-computer interface

      2020, Pattern Recognition
      Citation Excerpt :

      First, it allows data scale to grow infinitely, and thus is well-suited for endless or lifelong learning [3]. Second, online learning yields an up-to-date model by adapting to data dynamics, making it capable of handling non-stationary issues, such as the concept shift [4]. Third, online learning preserves history information, thus avoiding repetitive batch retraining from scratch and significantly reducing the computational effort.

    View all citing articles on Scopus

    Thi Thu Thuy Nguyen is currently a Ph.D. student at the School of Information & Communication Technology, Griffith University, Australia. She graduated from the Faculty of Mathematics, Voronezh State University, Russia in 2008. Her research interest is in the field of applied mathematics, machine learning, pattern recognition.

    Tien Thanh Nguyen received PhD degree in computer science from the School of Information & Communication Technology, Griffith University, Australia in 2017. He is currently a Lecturer in the School of Applied Mathematics and Informatics, Hanoi University of Science and Technology, Vietnam. His research interest is in the field of machine learning, pattern recognition, and evolutionary computation. He is a member of the IEEE since 2014.

    Alan Wee-Chung Liew is currently an Associate Professor at the School of Information & Communication Technology, Griffith University, Australia. His research interest is in the field of medical imaging, bioinformatics, computer vision, pattern recognition, and machine learning. He serves on the technical program committee of many international conferences and is on the editorial board of several journals, including the IEEE Transactions on fuzzy systems. He is a senior member of the IEEE since 2005.

    Shi-Lin Wang received his B.Eng. degree in Electrical and Electronic Engineering from Shanghai Jiaotong University, Shanghai, China in 2001, and his Ph.D. degree in the Department of Computer Engineering and Information Technology, City University of Hong Kong in 2004. Since 2004, he has been with the School of Information Security Engineering, Shanghai Jiaotong University, where he is currently an Associate Professor. His research interests include image processing and pattern recognition. Dr. Wang is a senior member of the Institute of Electrical and Electronic Engineers (IEEE) and his biography is listed in Marquis Who's Who in Science and Engineering.

    View full text