Elsevier

Neurocomputing

Volume 449, 18 August 2021, Pages 108-116
Neurocomputing

Robust Bayesian matrix decomposition with mixture of Gaussian noise

https://doi.org/10.1016/j.neucom.2021.04.004Get rights and content

Abstract

Matrix decomposition is a popular and fundamental approach in machine learning. The classical matrix decomposition methods with Frobenius norm loss is only optimal for Gaussian noise and thus suffer from the sensitivity to outliers and non-Gaussian noise. To address these limitations, the proposed methods can be divided into two categories. One type of approach is to replace the Frobenius norm loss with robust loss functions. The other type of approach is to impose the Bayesian priors to reduce the risk of overfitting. This paper combines these two approaches. Specifically, we model the noise by a mixture of Gaussian distribution, enabling the model to approximate a wide range of noise distributions. Meanwhile, we put a Laplace prior on the basis matrix to enforce the sparsity and a Dirichlet prior on the coefficient matrix to improve the interpretability. Extensive experiments in synthetic data and real-world data demonstrate that this method outperforms several competing ones. Ablation studies show that this method benefits from both the Bayesian priors and the Mixture of Gaussian noise loss, which confirms the necessity of combining the two schemes.

Introduction

Discovering knowledge from raw data is a fundamental problem in machine learning. Matrix decomposition plays an important role due to its effectiveness. In general, raw data can be represented in matrix forms in many cases. For instance, images can be organized in a pixel matrix and corpora can be represented by a document-item matrix which depicts the frequency of items appearing in the collection of documents. However, the raw data are usually large and noisy matrices. It is often difficult or infeasible to extract useful information directly from them. Therefore, we can use matrix decomposition to represent the observed matrices more compactly and meaningfully. Matrix decomposition has been successfully used in image analysis [1], [2], text mining [3], [4], genomics data analysis [5], [6], [7], and many other fields.

The basic idea of matrix decomposition is to approximate the observed data matrix by a product of two or more low rank matrices. Given the observation matrix, the general matrix decomposition method can be summarized as: XUV, where XRd×n,URd×r,VRr×n. This can be further expressed as an optimization problem:minU,V||X-UV||ppwhere ||·||p is the element-wise Lp matrix norm, and ||X||p=||vec(X)||p=(i=1dj=1n|xij|p)1/p. The special case p=2 is the Frobenius norm. Note that x·jt=1ru·tvtj. Therefore, each column of the observed matrix X is approximated by the linear combination of all columns of U. Thus, U is often referred as the basis matrix, and V is often referred as the coefficient matrix.

The Frobenius norm is the most popular choice for the advantages of the smooth optimization problem and the existence of the closed form solution [8]. However, the Frobenius norm is only optimal for Gaussian noise. This model provides a biased estimate when there are outliers and non-Gaussian noise. To obtain a more robust model, there are basically two types of approaches.

The first type of approach is to replace the Frobenius norm with robust loss functions. Torre and Black proposed Robust Subspace Learning based on robust M-estimation [9]. One can also replace the Frobenius norm with L1-norm loss [10], [11], which is optimal for the Laplace distribution and is robust to outliers. Unfortunately, many L1-norm methods result in non-smooth optimization problems, so standard optimization tools are not applicable. In the same spirit of using L1-norm, some authors proposed using L2,1-norm as the loss function [12], [13], [14]. In a sense, L2,1-norm combines the advantages of Frobenius norm and L1-norm; it is robust to the outliers and is also smooth. But it lacks a direct probabilistic interpretation compared to the Frobenius norm and L1-norm. The loss functions using the Frobenius or L1-norm is optimal when the noise follows the Gaussian or Laplace distributions. While in most real problems, this is not the case. For instance, both the light level and the sensor temperature cause a large amount of noise in the generated image, and the image will be contaminated by noise due to the interference of the transmission channel [15]. Considering the images of Extended Yale Face Database B [16], there exist cast shadows, graininess in darker background areas, etc. These different noise types can have different distributions. Thus, using a single Gaussian or Laplace distribution to approximate the noise is unlikely to work well in many real applications. To address this problem, the mixture of Gaussians has been used to model the noise [17]. Mixture of Gaussians can universally approximate any continuous distribution in theory [18], as long as we set the number of Gaussian components large enough. Thus, mixture of Gaussians subsumes the Frobenius and L1-norm models mentioned before (the Laplace distribution can be equivalently represented as a scaled mixture of Gaussians [19]).

The second type of approach is to introduce regularization terms or priors on the basis matrix and/or the coefficient matrix, which can reduce the risk of overfitting. For example, Kim and Park [20] explored sparse NMF by imposing L1 regularization terms to the basis matrix and the coefficient matrix. Min et al. introduced group-sparse SVD using L1 and L0-norm penalties [21]. Another commonly used type of regularization terms is graph-based. The key idea is that data points that are close to each other should also be close in the representation (coefficient matrix). Matrix decomposition methods with such regularization terms have been studied intensively [22], [23], [24], [25], [26]. From a probabilistic perspective, Salakhutdinov and Mnih [27] proposed the probabilistic matrix factorization (PMF), which places isotropic Gaussian priors on both U and V. The isotropic Gaussian priors result in Frobenius regularization terms, which penalize parameters of large values and thus less likely to overfit the noise. Wang et al. [28] replaced the Frobenius norm loss in PMF with L1-norm (RPMF). Therefore, RPMF is more robust to outliers. Saddiki et al. [29] proposed a mixed membership model (GLAD), which puts a Laplace prior on the basis matrix for sparsity and a Dirichlet prior on the coefficient matrix for interpretability. However, expensive computational cost of GLAD makes it unrealistic for real-world data. Zhang and Zhang [30] proposed a Bayesian Joint Matrix Decomposition (BJMD) model, which extended GLAD for multi-view data with heterogeneous noise and developed two efficient algorithms. To address the challenge of big data, Zhang et al. [31] also proposed a Bayesian matrix decomposition framework for heterogeneous noise in distributed data. However, none of those methods is suitable for general noise.

With the rapid accumulation of data, some data inevitably contain complex noise. Using only one approach can be insufficient. For example, the matrix decomposition model with a L1-norm loss function (e.g., RPMF) works well when there exists sparse noise. However, its performance degenerates in the presence of the mixture of Gaussian noise. In this work, we combine the merits of the two types of approaches, and propose a matrix decomposition model (BMD-MoG) under the Bayesian framework. Specifically, to address the weakness of PMF and RPMF, we use a mixture of Gaussian distribution to model a wide range of noise. Meanwhile, we impose the Laplace prior on the basis matrix to encourage its sparsity, and the Dirichlet prior on the coefficient matrix for better interpretability. Moreover, we adopt an Expectation Maximization (EM) algorithm under the maximum a posteriori (MAP) framework to estimate the model parameters. Extensive experiments on both synthetic and real-world data demonstrate that the hybrid method BMD-MoG benefits from both the Bayesian priors and the mixture of Gaussian noise assumption, and thus achieves superior performance compared to competing methods.

Section snippets

Related work

We first introduce the methods that replace the Frobenius norm loss with robust loss functions. The Householder-Young method was used in matrix decomposition under the unweighted least squares in 1969 (see [32]). Gabriel and Zamir [33] presented an available solution where weights were introduced to deal with the missing data. The objective function can therefore be written as:mini=1dj=1nwij(xij-ui·v·j)2=||W(X-UV)||Fwhere ui· and v·j are the i-th row of U and the j-th column of V,

Bayesian matrix decomposition model with mixture of Gaussian noise

In this section, we present a Bayesian matrix decomposition model that approximates the noise with a mixture of Gaussians distribution (BMD-MoG).

Experimental results

In this section, we compare BMD-MoG with BMD, MoG, RPMF and PMF on both synthetic and real-world data. Due to the non-convexity of the matrix decomposition methods, those methods can only find a local minimum. Therefore, the evaluation are based on the average of best 10 of 20 runs (according to the objective value) in the following experiments. The choice of the hyperparameters requires careful consideration. Basically, the hyperparameters (λ,α) reflect our prior knowledge of the concerned

Conclusion

In this paper, we propose a Bayesian matrix decomposition model BMD-MoG with a mixture of Gaussian distribution to model the complex noise. We suggest an effective EM algorithm for the inference of BMD-MoG. Extensive experiments on the synthetic and real-world datasets show that BMD-MoG achieves a better or comparable performance compared to BMD and MoG in most settings. It suggests that BMD-MoG benefits from both the priors and the mixture of Gaussian noise assumption. A limitation of this

CRediT authorship contribution statement

Haohui Wang: Methodology, Software, Investigation, Writing - original draft. Chihao Zhang: Methodology, Software, Investigation, Writing - original draft. Shihua Zhang: Conceptualization, Methodology, Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work has been supported by the National Key Research and Development Program of China [2019YFA0709501]; the National Natural Science Foundation of China [11661141019, 61621003]; National Ten Thousand Talent Program for Young Top-notch Talents; CAS Frontier Science Research Key Project for Top Young Scientist [QYZDB-SSW-SYS008].

Haohui Wang is currently pursuing her master degree in the School of Mathematical Sciences, Zhejiang University. Her research interests include machine learning and data mining.

References (39)

  • M.W. Berry et al.

    Algorithms and applications for approximate nonnegative matrix factorization

    Computational Statistics & Data Analysis

    (2007)
  • B. Wu et al.

    Manifold nmf with l21 norm for clustering

    Neurocomputing

    (2018)
  • S. Huang et al.

    Regularized nonnegative matrix factorization with adaptive local structure learning

    Neurocomputing

    (2020)
  • P.O. Hoyer

    Non-negative matrix factorization with sparseness constraints

    Journal of Machine Learning Research

    (2004)
  • D.D. Lee et al.

    Learning the parts of objects by non-negative matrix factorization

    Nature

    (1999)
  • F. Shahnaz et al.

    Document clustering using nonnegative matrix factorization

    Information Processing & Management

    (2004)
  • S. Zhang et al.

    Discovery of multi-dimensional modules by integrative analysis of cancer genomic data

    Nucleic Acids Research

    (2012)
  • J. Chen et al.

    Discovery of two-level modular organization from matched genomic data via joint matrix tri-factorization

    Nucleic Acids Research

    (2018)
  • L. Zhang et al.

    Learning common and specific patterns from data of multiple interrelated biological scenarios with matrix factorization

    Nucleic Acids Research

    (2019)
  • A.M. Buchanan et al.

    Damped Newton algorithms for matrix factorization with missing data

    IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    (2005)
  • F. De La Torre et al.

    A framework for robust subspace learning

    International Journal of Computer Vision

    (2003)
  • Q. Ke et al.

    Robust L1 norm factorization in the presence of outliers and missing data by alternative convex programming

    IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    (2005)
  • A. Eriksson et al.

    Efficient computation of robust low-rank matrix approximations in the presence of missing data using the L1 norm

    IEEE Computer Society Conference on Computer Vision and Pattern Recognition

    (2010)
  • D. Kong et al.

    Robust nonnegative matrix factorization using l21-norm

  • M. Luo, F. Nie, X. Chang, Y. Yang, A. Hauptmann, Q. Zheng, Probabilistic non-negative matrix factorization and its...
  • R.C. Gonzalez et al.

    Digital Image Processing

    (2006)
  • A.S. Georghiades et al.

    From few to many: Illumination cone models for face recognition under variable lighting and pose

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2001)
  • D. Meng et al.

    Robust matrix factorization with unknown noise

  • V. Mazya et al.

    On approximate approximations using gaussian kernels

    IMA Journal of Numerical Analysis

    (1996)
  • Cited by (4)

    • Prediction Network Downtime Values Using Bayesian Non-Negative Matrix Factorization

      2023, 7th International Symposium on Multidisciplinary Studies and Innovative Technologies, ISMSIT 2023 - Proceedings

    Haohui Wang is currently pursuing her master degree in the School of Mathematical Sciences, Zhejiang University. Her research interests include machine learning and data mining.

    Chihao Zhang is a PhD candidate in the Academy of Mathematics and Systems Science, Chinese Academy of Sciences. His research interests include machine learning, data mining and data science.

    Shihua Zhang received the PhD degree in applied mathematics and bioinformatics from the Academy of Mathematics and Systems Science, Chinese Academy of Sciences in 2008 with the highest honor. He joined the same institute as an Assistant Professor in 2008, and is currently Professor. His research interests are mainly in bioinformatics and computational biology, data mining, pattern recognition and machine learning. He has won various awards and honors including Ten Thousand Talent Program—Young top-notch talent (2018), NSFC for Excellent Young Scholars (2014), Outstanding Young Scientist Program of CAS (2014), Youth Science and Technology Award of China (2013) and so on. Now he serves as an Editorial Board Member of BMC Genomics, Frontiers in Genetics and so on. He is a member of the IEEE, ISCB and SIAM.

    View full text