Robust Bayesian matrix decomposition with mixture of Gaussian noise
Introduction
Discovering knowledge from raw data is a fundamental problem in machine learning. Matrix decomposition plays an important role due to its effectiveness. In general, raw data can be represented in matrix forms in many cases. For instance, images can be organized in a pixel matrix and corpora can be represented by a document-item matrix which depicts the frequency of items appearing in the collection of documents. However, the raw data are usually large and noisy matrices. It is often difficult or infeasible to extract useful information directly from them. Therefore, we can use matrix decomposition to represent the observed matrices more compactly and meaningfully. Matrix decomposition has been successfully used in image analysis [1], [2], text mining [3], [4], genomics data analysis [5], [6], [7], and many other fields.
The basic idea of matrix decomposition is to approximate the observed data matrix by a product of two or more low rank matrices. Given the observation matrix, the general matrix decomposition method can be summarized as: , where . This can be further expressed as an optimization problem:where is the element-wise matrix norm, and . The special case is the Frobenius norm. Note that . Therefore, each column of the observed matrix X is approximated by the linear combination of all columns of U. Thus, U is often referred as the basis matrix, and V is often referred as the coefficient matrix.
The Frobenius norm is the most popular choice for the advantages of the smooth optimization problem and the existence of the closed form solution [8]. However, the Frobenius norm is only optimal for Gaussian noise. This model provides a biased estimate when there are outliers and non-Gaussian noise. To obtain a more robust model, there are basically two types of approaches.
The first type of approach is to replace the Frobenius norm with robust loss functions. Torre and Black proposed Robust Subspace Learning based on robust M-estimation [9]. One can also replace the Frobenius norm with -norm loss [10], [11], which is optimal for the Laplace distribution and is robust to outliers. Unfortunately, many -norm methods result in non-smooth optimization problems, so standard optimization tools are not applicable. In the same spirit of using -norm, some authors proposed using -norm as the loss function [12], [13], [14]. In a sense, -norm combines the advantages of Frobenius norm and -norm; it is robust to the outliers and is also smooth. But it lacks a direct probabilistic interpretation compared to the Frobenius norm and -norm. The loss functions using the Frobenius or -norm is optimal when the noise follows the Gaussian or Laplace distributions. While in most real problems, this is not the case. For instance, both the light level and the sensor temperature cause a large amount of noise in the generated image, and the image will be contaminated by noise due to the interference of the transmission channel [15]. Considering the images of Extended Yale Face Database B [16], there exist cast shadows, graininess in darker background areas, etc. These different noise types can have different distributions. Thus, using a single Gaussian or Laplace distribution to approximate the noise is unlikely to work well in many real applications. To address this problem, the mixture of Gaussians has been used to model the noise [17]. Mixture of Gaussians can universally approximate any continuous distribution in theory [18], as long as we set the number of Gaussian components large enough. Thus, mixture of Gaussians subsumes the Frobenius and -norm models mentioned before (the Laplace distribution can be equivalently represented as a scaled mixture of Gaussians [19]).
The second type of approach is to introduce regularization terms or priors on the basis matrix and/or the coefficient matrix, which can reduce the risk of overfitting. For example, Kim and Park [20] explored sparse NMF by imposing regularization terms to the basis matrix and the coefficient matrix. Min et al. introduced group-sparse SVD using and -norm penalties [21]. Another commonly used type of regularization terms is graph-based. The key idea is that data points that are close to each other should also be close in the representation (coefficient matrix). Matrix decomposition methods with such regularization terms have been studied intensively [22], [23], [24], [25], [26]. From a probabilistic perspective, Salakhutdinov and Mnih [27] proposed the probabilistic matrix factorization (PMF), which places isotropic Gaussian priors on both U and V. The isotropic Gaussian priors result in Frobenius regularization terms, which penalize parameters of large values and thus less likely to overfit the noise. Wang et al. [28] replaced the Frobenius norm loss in PMF with -norm (RPMF). Therefore, RPMF is more robust to outliers. Saddiki et al. [29] proposed a mixed membership model (GLAD), which puts a Laplace prior on the basis matrix for sparsity and a Dirichlet prior on the coefficient matrix for interpretability. However, expensive computational cost of GLAD makes it unrealistic for real-world data. Zhang and Zhang [30] proposed a Bayesian Joint Matrix Decomposition (BJMD) model, which extended GLAD for multi-view data with heterogeneous noise and developed two efficient algorithms. To address the challenge of big data, Zhang et al. [31] also proposed a Bayesian matrix decomposition framework for heterogeneous noise in distributed data. However, none of those methods is suitable for general noise.
With the rapid accumulation of data, some data inevitably contain complex noise. Using only one approach can be insufficient. For example, the matrix decomposition model with a -norm loss function (e.g., RPMF) works well when there exists sparse noise. However, its performance degenerates in the presence of the mixture of Gaussian noise. In this work, we combine the merits of the two types of approaches, and propose a matrix decomposition model (BMD-MoG) under the Bayesian framework. Specifically, to address the weakness of PMF and RPMF, we use a mixture of Gaussian distribution to model a wide range of noise. Meanwhile, we impose the Laplace prior on the basis matrix to encourage its sparsity, and the Dirichlet prior on the coefficient matrix for better interpretability. Moreover, we adopt an Expectation Maximization (EM) algorithm under the maximum a posteriori (MAP) framework to estimate the model parameters. Extensive experiments on both synthetic and real-world data demonstrate that the hybrid method BMD-MoG benefits from both the Bayesian priors and the mixture of Gaussian noise assumption, and thus achieves superior performance compared to competing methods.
Section snippets
Related work
We first introduce the methods that replace the Frobenius norm loss with robust loss functions. The Householder-Young method was used in matrix decomposition under the unweighted least squares in 1969 (see [32]). Gabriel and Zamir [33] presented an available solution where weights were introduced to deal with the missing data. The objective function can therefore be written as:where and are the i-th row of U and the j-th column of V,
Bayesian matrix decomposition model with mixture of Gaussian noise
In this section, we present a Bayesian matrix decomposition model that approximates the noise with a mixture of Gaussians distribution (BMD-MoG).
Experimental results
In this section, we compare BMD-MoG with BMD, MoG, RPMF and PMF on both synthetic and real-world data. Due to the non-convexity of the matrix decomposition methods, those methods can only find a local minimum. Therefore, the evaluation are based on the average of best 10 of 20 runs (according to the objective value) in the following experiments. The choice of the hyperparameters requires careful consideration. Basically, the hyperparameters () reflect our prior knowledge of the concerned
Conclusion
In this paper, we propose a Bayesian matrix decomposition model BMD-MoG with a mixture of Gaussian distribution to model the complex noise. We suggest an effective EM algorithm for the inference of BMD-MoG. Extensive experiments on the synthetic and real-world datasets show that BMD-MoG achieves a better or comparable performance compared to BMD and MoG in most settings. It suggests that BMD-MoG benefits from both the priors and the mixture of Gaussian noise assumption. A limitation of this
CRediT authorship contribution statement
Haohui Wang: Methodology, Software, Investigation, Writing - original draft. Chihao Zhang: Methodology, Software, Investigation, Writing - original draft. Shihua Zhang: Conceptualization, Methodology, Supervision, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
This work has been supported by the National Key Research and Development Program of China [2019YFA0709501]; the National Natural Science Foundation of China [11661141019, 61621003]; National Ten Thousand Talent Program for Young Top-notch Talents; CAS Frontier Science Research Key Project for Top Young Scientist [QYZDB-SSW-SYS008].
Haohui Wang is currently pursuing her master degree in the School of Mathematical Sciences, Zhejiang University. Her research interests include machine learning and data mining.
References (39)
- et al.
Algorithms and applications for approximate nonnegative matrix factorization
Computational Statistics & Data Analysis
(2007) - et al.
Manifold nmf with l21 norm for clustering
Neurocomputing
(2018) - et al.
Regularized nonnegative matrix factorization with adaptive local structure learning
Neurocomputing
(2020) Non-negative matrix factorization with sparseness constraints
Journal of Machine Learning Research
(2004)- et al.
Learning the parts of objects by non-negative matrix factorization
Nature
(1999) - et al.
Document clustering using nonnegative matrix factorization
Information Processing & Management
(2004) - et al.
Discovery of multi-dimensional modules by integrative analysis of cancer genomic data
Nucleic Acids Research
(2012) - et al.
Discovery of two-level modular organization from matched genomic data via joint matrix tri-factorization
Nucleic Acids Research
(2018) - et al.
Learning common and specific patterns from data of multiple interrelated biological scenarios with matrix factorization
Nucleic Acids Research
(2019) - et al.
Damped Newton algorithms for matrix factorization with missing data
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(2005)
A framework for robust subspace learning
International Journal of Computer Vision
Robust L1 norm factorization in the presence of outliers and missing data by alternative convex programming
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Efficient computation of robust low-rank matrix approximations in the presence of missing data using the L1 norm
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Robust nonnegative matrix factorization using l21-norm
Digital Image Processing
From few to many: Illumination cone models for face recognition under variable lighting and pose
IEEE Transactions on Pattern Analysis and Machine Intelligence
Robust matrix factorization with unknown noise
On approximate approximations using gaussian kernels
IMA Journal of Numerical Analysis
Cited by (4)
A robust sparse Bayesian learning method for the structural damage identification by a mixture of Gaussians
2023, Mechanical Systems and Signal ProcessingGAME: GAussian Mixture Error-based meta-learning architecture
2023, Neural Computing and ApplicationsPrediction Network Downtime Values Using Bayesian Non-Negative Matrix Factorization
2023, 7th International Symposium on Multidisciplinary Studies and Innovative Technologies, ISMSIT 2023 - Proceedings
Haohui Wang is currently pursuing her master degree in the School of Mathematical Sciences, Zhejiang University. Her research interests include machine learning and data mining.
Chihao Zhang is a PhD candidate in the Academy of Mathematics and Systems Science, Chinese Academy of Sciences. His research interests include machine learning, data mining and data science.
Shihua Zhang received the PhD degree in applied mathematics and bioinformatics from the Academy of Mathematics and Systems Science, Chinese Academy of Sciences in 2008 with the highest honor. He joined the same institute as an Assistant Professor in 2008, and is currently Professor. His research interests are mainly in bioinformatics and computational biology, data mining, pattern recognition and machine learning. He has won various awards and honors including Ten Thousand Talent Program—Young top-notch talent (2018), NSFC for Excellent Young Scholars (2014), Outstanding Young Scientist Program of CAS (2014), Youth Science and Technology Award of China (2013) and so on. Now he serves as an Editorial Board Member of BMC Genomics, Frontiers in Genetics and so on. He is a member of the IEEE, ISCB and SIAM.