Compressive Sensing via Variational Bayesian Inference under Two Widely Used Priors: Modeling, Comparison and Discussion

Shekaramiz, Mohammad; Moon, Todd K.

doi:10.3390/e25030511

Open AccessArticle

Compressive Sensing via Variational Bayesian Inference under Two Widely Used Priors: Modeling, Comparison and Discussion

by

Mohammad Shekaramiz

^1,*

and

Todd K. Moon

²

¹

Machine Learning & Drone Lab, Electrical and Computer Engineering Program, Engineering Department, Utah Valley University, 800 West University Parkway, Orem, UT 84058, USA

²

Electrical and Computer Engineering Department, Utah State University, 4120 Old Main Hill, Logan, UT 84322, USA

^*

Author to whom correspondence should be addressed.

Entropy 2023, 25(3), 511; https://doi.org/10.3390/e25030511

Submission received: 30 January 2023 / Revised: 6 March 2023 / Accepted: 14 March 2023 / Published: 16 March 2023

(This article belongs to the Special Issue Statistical Signal Processing, Detection and Estimation: Dealing with the Data Deluge)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Compressive sensing is a sub-Nyquist sampling technique for efficient signal acquisition and reconstruction of sparse or compressible signals. In order to account for the sparsity of the underlying signal of interest, it is common to use sparsifying priors such as Bernoulli–Gaussian-inverse Gamma (BGiG) and Gaussian-inverse Gamma (GiG) priors on the components of the signal. With the introduction of variational Bayesian inference, the sparse Bayesian learning (SBL) methods for solving the inverse problem of compressive sensing have received significant interest as the SBL methods become more efficient in terms of execution time. In this paper, we consider the sparse signal recovery problem using compressive sensing and the variational Bayesian (VB) inference framework. More specifically, we consider two widely used Bayesian models of BGiG and GiG for modeling the underlying sparse signal for this problem. Although these two models have been widely used for sparse recovery problems under various signal structures, the question of which model can outperform the other for sparse signal recovery under no specific structure has yet to be fully addressed under the VB inference setting. Here, we study these two models specifically under VB inference in detail, provide some motivating examples regarding the issues in signal reconstruction that may occur under each model, perform comparisons and provide suggestions on how to improve the performance of each model.

Keywords:

compressive sensing; signal recovery; variational Bayes inference; sparse Bayesian learning; prior modeling; hyperparameters; graphical Bayesian representation

1. Introduction

Compressive sensing (CS) involves efficient signal acquisition and reconstruction techniques in a sub-Nyquist sampling sense. The CS framework can capture the vital information of the underlying signal via a small number of measurements while retaining the ability to reconstruct the signal. CS operates under the assumption that the signal is compressible or sparse, and the number and location of dominating components are unknown in most cases [1,2,3]. Compressibility or sparsity means that the signal has few dominating elements under some proper basis. CS has been used in a variety of applications such as the single-pixel camera, missing pixels and inpainting removal of images, biomedical such as heart rate estimation, internet of things (IoT), geostatistical data analysis, seismic tomography, communications such as blind multi-narrowband signals sampling and recovery, the direction of arrival (DoA) estimation, spectrum sharing of radar and communication signals, wireless networks and many more [4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27]. In the linear CS framework, the problem is posed as

y = A x_{s} + e,

(1)

where

y \in R^{M}

contains the measurements,

x_{s} \in R^{N}

is the sparse signal of interest,

e

is the noise representing either the measurement noise or the insignificant coefficients of

x_{s}

and, generally,

M ≪ N

[1,2]. The measurement matrix can be defined as

A = Φ Ψ

, where

Φ

is the sensing design matrix and

Ψ

is a proper sparsifying basis. There exist various approaches to solve for

x_{s}

in (1) including greedy-based, convex-based, thresholding-based and sparse Bayesian learning (SBL) algorithms [27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64]. Typically, the performance of CS reconstruction is determined in terms of the mean-squared reconstruction error. In this paper, we are also interested in the more demanding requirements of the probability of detection and the false alarm of the nonzero components. This is of more interest to CS applications such as blind multinarrowband signals, spectrum sharing RADAR, etc. [11,12,13,14,15].

The focus of this paper is on sparse Bayesian learning (SBL) for the CS problem. Bayesian learning models are flexible in incorporating prior knowledge of the characteristics of the underlying signal into the model. Bayesian learning also provides a distribution of the hidden variables, which is more informative than the point estimate approaches. A prior favoring the sparsity or compressibility in

x_{s}

can be represented in the SBL framework via Gaussian-inverse Gamma (GiG), Laplace-inverse Gamma (LiG), Bernoulli–Gaussian-inverse Gamma (BGiG), often referred to as spike-and-slab prior, etc. [27,46,47,48,49,50,51,52,53,54,55,56,57,58,59]. The inference on parameters and hidden variables in these models is usually made using Markov chain Monte Carlo (MCMC) and variational Bayes (VB) [27,45,46,47,48,49,50,51,52]. In this paper, we focus on the two most commonly used SBL prior models for solving the inverse problem of compressive sensing: Bernoulli–Gaussian-inverse Gamma (BGiG) prior and Gaussian-inverse Gamma (GiG). These models have been widely used, along with some additional priors, for sparse recovery of signals or images with block-sparsity/clustering patterns, sparse signals with correlated coefficients or other structured patterns [26,27,48,49,50,51,62,63].

We use VB inference to estimate the variables and parameters of the SBL model. VB is preferred over MCMC because MCMC is computationally expensive, though it can numerically approximate exact posterior distributions with a sufficient amount of computation. The convergence diagnostic of MCMC requires additional work, such as measuring the potential scale reduction factor (PSRF) for all the hidden variables and parameters of the model or monitoring their trace plots [45,50,51,52,65]. In contrast, VB inference can lead to a reasonable approximation of the exact posteriors, using less computation than MCMC and less effort to monitor the convergence [45,51,66,67]. In this paper, we present the derivation of the update rules of the parameters and variables using VB inference for both the BGiG and GiG models. (Portions of this derivation have been previously presented in [68,69]). Although these prior models have been widely used in various applications of compressive sensing, the study of the overall performance of these models under VB inference has yet to be thoroughly investigated. The preference for one model over the other becomes crucial when dealing with moderate or low sampling ratios, which we discuss in this paper. Here, we study the issues associated with each model via some motivational examples. Pre-/postprocessing approaches will then be discussed to tackle the issues. Finally, the overall performance of BGiG and GiG is compared.

The remainder of this work is organized as follows. In Section 2, we present a brief background on VB inference. We study Bernoulli–Gaussian-inverse Gamma modeling for CS using VB in Section 3. Some motivational examples are provided to show the issues with this approach. Section 4 represents Gaussian-inverse Gamma modeling, the associated update rules using VB inference and a motivational example of the issue that may occur using this approach. In Section 5, we study the improvement of the performances of the models after some pre-/postprocessing along with simulation results and comparisons. Section 6 concludes this work.

2. Variational Bayesian Inference

Variational Bayes (VB) is an effective approach to approximate intractable integrals that may arise in Bayesian inference. The main idea behind variational methods is to use a family of distributions over the latent variables with their own variational parameters. VB is a fast alternative to sampling methods such as Markov chain Monte Carlo (MCMC) and Sequential Monte Carlo (SMC) for performing approximate Bayesian inference [70,71]. For a probabilistic model with unknown parameters

θ

and hidden variables

x

, the posterior distribution of the unknowns, given a set of observations

y

, can be written as

p (x, θ | y) = p (x, θ, y) / p (y)

. Finding the exact posterior in closed form to perform the inference would be a challenge, as the marginal distribution

p (y) = \int p (y, x, θ) d x d θ

is often intractable. As an efficient approximation method for such inference problems, VB provides an analytical approximation to the posterior

p (x, θ | y)

. VB approximates the joint density

p (x, θ | y)

via a variational distribution

Q_{x, θ} (x, θ)

, i.e.,

p (x, θ | y) \approx Q_{x, θ} (x, θ)

. VB assumes that the distribution Q can be fully factorized with respect to the unknown parameters and hidden variables, i.e.,

\begin{matrix} Q_{x, θ} (x, θ) & = q_{x} (x) q_{θ} (θ) \\ = \prod_{i = 1}^{I} q_{x} (x_{i}) \prod_{j = 1}^{J} q_{θ} (θ_{j}), \end{matrix}

where I and J are the number of unknown parameters and hidden variables, respectively. This independence assumption in VB further simplifies the search for a closed-form solution to the approximation of the actual posterior. We desire to select the variational distribution

Q_{x, θ}^{*} (x, θ)

as close as possible to

p (x, θ | y)

, where the closeness metric for distribution

Q_{x, θ} (x, θ)

is formulated as minimizing the Kullback–Leibler (KL) divergence of the approximation

Q_{x, θ} (x, θ)

and the true posterior

p (x, θ | y)

as

\begin{matrix} Q_{x, θ}^{🟉} (x, θ) & = \underset{Q_{x, θ} (x, θ)}{\arg \min} KL (Q_{x, θ} (x, θ) | | p (x, θ | y)) \\ = \underset{Q_{x, θ} (x, θ)}{\arg \min} \int Q_{x, θ} (x, θ) log \frac{Q_{x, θ} (x, θ)}{p (x, θ | y)} d x d θ . \end{matrix}

The quantity

log p (y)

can be written as

log p (y) = log {\int p (x, θ, y) d x d θ}

. Then, defining

F (Q_{x, θ} (x, θ)) = \int Q_{x, θ} (x, θ) log \frac{p (x, θ, y)}{Q_{x, θ} (x, θ)} d x d θ .

It is straightforward to show that

log p (y) = F (Q_{x, θ} (x, θ)) + KL (Q_{x, θ} (x, θ), p (x, θ | y)) .

Since (by Jensen’s inequality)

K L (Q_{x, θ} (x, θ) \geq 0

,

log (p (y) \geq F (Q_{x, θ} (x, θ)

. Since

log (p (y)

is constant with respect to

Q_{x, θ}

, minimizing the KL-divergence between the actual posterior distribution and the variational distribution is equivalent to maximizing the lower bound

F (\cdot)

[66,67]. Since the term

p (y)

in

p (x, θ | y) = p (x, θ, y) p (y)

does not involve the variational distribution

Q_{x, θ} (x, θ)

, this term can be ignored when maximizing

F (\cdot)

. The lower bound

F (\cdot)

on the model log-marginal likelihood can be iteratively optimized until the convergence by the following update rules [66,72].

VB-E step:

q_{x}^{[t + 1]} (x) \propto exp {E_{q_{θ}^{[t]}} [log p (x, y | θ)]}

(2)

VB-M step:

q_{θ}^{[t + 1]} (θ) \propto p (θ) exp {E_{q_{x}^{[t + 1]}} [log p (x, y | θ)]}

(3)

This results in an iterative algorithm analogous to the expectation-maximization (EM) approach.

3. Bernoulli–Gaussian-Inverse Gamma Modeling and SBL(BGiG) Algorithm

In the inverse problem of CS defined in (1), the goal is to recover the sparse vector

x_{s}

. In the Bernoulli–Gaussian-inverse Gamma model, the sparse solution is defined as

x_{s} = (s \circ x),

(4)

where

s

is a binary support vector indicating the non-zero locations in the solution,

x

represents values of the solution and ∘ is Hadamard (element-by-element) product [47]. We refer to the algorithm associated with this Bayesian modeling based on VB inference as SBL(BGiG). SBL using VB inference for the clustered pattern of sparse signals has already been investigated in the recent literature [45,50,51,58]. In this paper, however, we intend to focus on the ordinary SBL using VB inference modeling without promoting any structure on the supports other than sparsity itself. We show that when the sampling ratio is moderate or low (with respect to the sparsity level), the reconstruction performance becomes sensitive to selecting the support-related hyperparameters.

We define a set of priors as follows [47,68,69]. We model the elements of vector

s

as

s_{n} \sim Bernoulli (γ_{n}), γ_{n} \sim Beta (α_{0}, β_{0}), \forall n,

(5)

where

α_{0}

and

β_{0}

are the support-related hyperparameters. Setting

α_{0}

and

β_{0}

to small values and with

α_{0} ≪ β_{0}

encourages

s

to be sparse on average. The prior on the solution value vector is defined as

x \sim N (0, τ^{- 1} I_{N}), τ \sim Gamma (a_{0}, b_{0}) .

(6)

Here,

τ

is the precision value. Finally, the prior on the noise is

e \sim N (0, ϵ^{- 1} I_{M}), ϵ \sim Gamma (θ_{0}, θ_{1}),

(7)

where

θ_{0}

and

θ_{1}

are set to small positive values.

3.1. Update Rules of SBL(BGiG) Using VB Inference

According to the VB algorithm defined in (2) and (3), the update rule of the variables and parameters of the BGiG model can be simplified as follows [68]. The details of these derivations appear in Appendix A.1.

Update rule for the support vector $s$

q (s_{n} | -) \sim Bernoulli (\frac{1}{1 + c_{n} κ_{n}}), \forall n = 1, \dots, N,

where conditioning on − denotes conditioning on all relevant variables and observations. Therefore,

{\tilde{s}}_{n} = \frac{1}{1 + c_{n} κ_{n}}, \forall n = 1, \dots, N,

(8)

where

\begin{matrix} c_{n} & : = e^{ψ (β_{1, n}) - ψ (α_{1, n})}, \\ κ_{n} & : = e^{\frac{1}{2} \tilde{ϵ} (∥ a_{n} ∥_{2}^{2} ({\tilde{x_{n}}}^{2} + σ_{\tilde{x_{n}}}^{2}) - 2 \tilde{x_{n}} a_{n}^{T} {\tilde{y}}^{- n})}, \\ {\tilde{y}}_{m}^{- n} & : = y_{m} - \sum_{l \neq n}^{N} a_{m l} {\tilde{s}}_{l} {\tilde{x}}_{l} . \end{matrix}

(9)

Here,

\tilde{x} : = < x >_{q_{x}}

,

ψ

is the digamma function (the logarithmic derivative of the gamma function), and

{\tilde{y}}^{- n} = {[{\tilde{y}}_{1}^{- n}, \dots, {\tilde{y}}_{M}^{- n}]}^{T}

.

Update rule for the solution value matrix $x$

q (x | -) \sim N (\tilde{x}, Σ_{\tilde{x}}),

where

Σ_{\tilde{x}} = {(\tilde{τ} I_{N} + \tilde{ϵ} \tilde{Φ})}^{- 1} and \tilde{x} = \tilde{ϵ} Σ_{\tilde{x}} diag (\tilde{s}) A^{T} y,

(10)

and where

diag (s)

denotes a diagonal matrix with the components of

s

on its main diagonal, and

\tilde{Φ} : = [(A^{T} A) \circ (\tilde{s} {\tilde{s}}^{T} + diag (\tilde{s} \circ (1 - \tilde{s})))] .

(11)

Update rule for $γ_{n}$

q (γ_{n} | -) \sim Beta (α_{1, n}, β_{1, n}), \forall n = 1, \dots, N .

Therefore,

\tilde{γ_{n}} = \frac{α_{1, n}}{α_{1, n} + β_{1, n}}, \forall n = 1, \dots, N,

(12)

where

α_{1, n} : = α_{0} + {\tilde{s}}_{n} and β_{1, n} : = β_{0} + 1 - {\tilde{s}}_{n} .

(13)

Update rule for the solution precision $τ$

q (τ | -) \sim Gamma (a_{0} + \frac{N}{2}, b_{0} + \frac{1}{2} (∥ \tilde{x} ∥_{2}^{2} + Tr (Σ_{\tilde{x}}))),

where

Σ_{\tilde{x}} = diag (σ_{{\tilde{x}}_{1}}^{2}, \dots, σ_{{\tilde{x}}_{N}}^{2})

and

Tr (A)

is the trace of matrix A. Thus

\tilde{τ} = \frac{a_{0} + \frac{N}{2}}{b_{0} + \frac{1}{2} (∥ \tilde{x} ∥_{2}^{2} + \sum_{n = 1}^{N} σ_{{\tilde{x}}_{n}}^{2})} .

(14)

Update rule for the noise precision $ϵ$

q (ϵ | -) \sim Gamma (θ_{0} + \frac{M}{2}, θ_{1} + \frac{1}{2} \tilde{Ψ}),

where

\tilde{Ψ} : = (y^{T} y - 2 {(\tilde{x} \circ \tilde{s})}^{T} A^{T} y + Tr ((\tilde{x} {\tilde{x}}^{T} + Σ_{\tilde{x}}) \tilde{Φ})) .

(15)

This yields to the following update rule for the precision of the noise component

\tilde{ϵ} = \frac{θ_{0} + \frac{M}{2}}{θ_{1} + \frac{1}{2} \tilde{Ψ}} .

(16)

The stopping criterion of the algorithm is made based on the log-marginalized likelihood. We define the stopping condition in terms of

L : = log {p (y | s, ϵ, τ)}

. The marginalized likelihood can be written as

p (y | s, ϵ, τ) = \int p (y | x, s, ϵ) p (x | τ I_{N}) d x .

After some simplification, the negative log-likelihood is proportional to

- L \propto log | Σ_{0}^{- 1} | + y^{T} Σ_{0} y,

where

Σ_{0} = {({\tilde{ϵ}}^{- 1} I_{M} + {\tilde{τ}}^{- 1} A {\tilde{S}}^{2} A^{T})}^{- 1}

(17)

and

\tilde{S} : = diag {\tilde{s}}

. Therefore, the stopping condition can be made as

Δ L_{n}^{[t]} : = | Δ L^{[t]} | / | L^{[t - 1]} | \leq T_{0},

(18)

for some small value of threshold

T_{0}

[50], where

L^{[t]} : = log Σ_{0}^{[t]} - y^{T} Σ_{0}^{[t]} y .

(19)

and

\begin{matrix} Δ L^{[t]} : & = L^{[t]} - L^{[t - 1]} \\ = log | \frac{Σ_{0}^{[t]}}{Σ_{0}^{[t - 1]}} | + y^{T} (Σ_{0}^{[t - 1]} - Σ_{0}^{[t]}) y . \end{matrix}

(20)

Figure 1 illustrates the graphical Bayesian representation of the BGiG model, which is an undirected graph. The shaded node

y

shows the observations (measurements), and the small solid nodes represent the hyperparameters. Each unshaded node denotes a random variable (or a group of random variables).

The flowchart representation of the algorithm is shown in Figure 2 motivated by the graphical approach in [47,73]. According to the pseudocode in Algorithm 1 and the flowchart in Figure 2, first, the hyperparameters of the model are set. The support-related hyperparameters

α_{0}

and

β_{0}

are suggested to be set to small values with

α_{0} ≪ β_{0}

to encourage

s

to be sparse on the average. The hyperparameters

a_{0}

and

b_{0}

on the precision of the solution-value vector are also initialized and suggested to be small not to bias the estimation when the measurements are incorporated. The hyperparameters

θ_{0}

and

θ_{1}

on the precision of the noise are recommended to be of order

10^{- 6}

for high SNRs. For moderate and low SNRs, higher values are recommended. In the next step, all the main variables of the model are drawn i.i.d. from their corresponding prior distributions defined in (5)–(7). Then, the stopping condition is computed based on the log-marginalized likelihood in (19). In the main loop, all of the main variables of the model are updated via the expected values obtained from the VB inference. Specifically, we first update the support vector and the solution value components; then, the precisions of the solution vector and the noise are updated. Finally, the stopping criterion is computed through the measure of the log-marginalized likelihood of the observations. The pseudocode of the algorithm is provided below.

Algorithm 1: SBL(BGiG) Algorithm

{\hat{x}}_{s} = \tilde{x} \circ \tilde{s}

[\tilde{x}, \tilde{s}] = SBL - BGiG (Y, A)

Set the hyperparameters, i.e., (

α_{0}, β_{0}

), (

a_{0}, b_{0}

), and (

θ_{0}, θ_{1}

)

% Variables Initialization

Draw

\tilde{s}

and

\tilde{γ}

from (5)

Draw

\tilde{x}

and

\tilde{τ}

from (6)

Draw

\tilde{ϵ}

from (7)

t = 1 % Iterator

Compute

L^{[t]}

from (19) and set

L^{[0]} = 0

% Main Loop for Estimations

While

\frac{| L^{[t]} - L^{[t - 1]} |}{| L^{[t - 1]} |} \geq T_{0}

. For example

T_{0} = 10^{- 6}

.

Compute

{\tilde{s}}_{n}

from (8),

\forall n = 1, \dots, N

% (Support vector component )

Compute

Σ_{\tilde{x}}

and

\tilde{x}

from (10) % (Solution-value matrix component)

Compute

α_{1, n}

and

β_{1, n}

from (13)

\forall n = 1, \dots, N

% (Parameters of the hyperprior

γ

)

t Compute

\tilde{τ}

from (14) % (Precision on the solution)

Compute

\tilde{ϵ}

from (16) % (Precision on the noise)

Compute

L^{[t]}

from (19) and then t = t + 1

End While

3.2. Issues with SBL(BGiG)

In this section, we show that the estimated solution using SBL(BGiG) algorithm is sensitive to support-related hyperparameters, i.e.,

α_{0}

and

β_{0}

in (5). We provide an example under three cases to demonstrate this issue. We generated a random scenario, where the true solution

x_{s} \in R^{100}

has the sparsity level of

k = 25

, that is, the true

x

(or

s

) has k active elements. The active elements of

s

were drawn randomly. The nonzeros of

x_{s}

, corresponding to the active locations of

s

, were drawn from

N (0, σ_{x}^{2})

, with

σ_{x}^{2} = 1

. Each entry of the sensing matrix A was drawn i.i.d. from the Gaussian distribution

N (0, 1)

, then normalized, so each column has the Euclidian norm of 1. The elements of measurement noise were drawn from

N (0, σ^{2})

with

S N R = 25

dB, where

S N R : = 20 {log}_{10} (σ_{x} / σ)

. The hyperparameters of

τ

and

ϵ

were set to

a_{0} = b_{0} = 10^{- 3}

and

θ_{0} = θ_{1} = 10^{- 6}

, respectively. In Cases 1–3, we set the pair

(α_{0}, β_{0})

with low emphasis on the prior

(0.01, 0.99)

, moderate emphasis

(0.1, 0.9)

and fairly high emphasis

(1.4, 2)

, respectively.

From the top to the bottom row of Figure 3, Figure 4 and Figure 5, we illustrate the estimated results with the number of measurements set to 80, 60 and 40 (that is, the sample ratio

λ

is 0.80, 0.60, and 0.40), respectively. In each row of Figure 3, Figure 4 and Figure 5 from left to right, we show the comparison between the measurements

y

and the computed measurements based on

\hat{y} = A (\tilde{s} \circ \tilde{x})

, the true signal

x_{s} = s \circ x

and the reconstructed signal

{\hat{x}}_{s} = \tilde{s} \circ \tilde{x}

, the true support vector

s

and the estimated support vector

\tilde{s}

and the evolution of the estimated supports with respect to the iterations in the SBL(BGiG) algorithm.

According to Figure 3, the setting for

(α_{0}, β_{0})

in Case 1 fails to provide perfect results even for high sampling ratios. Similarly, Figure 4 shows that the settings for

(α_{0}, β_{0})

in Case 2 do not provide encouraging results even for high sampling ratios. Specifically, it turns out that Case 1 and Case 2 provide sparse solutions for the sampling ratios within the range

[0, 1]

, where

λ = 1

means

M = N

.

According to Figure 5, setting

(α_{0}, β_{0})

to

(1.4, 2)

seems to be a reasonable choice for high sampling ratios (over

70 %

), while it is not a good choice for the lower sampling ratios. This issue can be seen in the supports plot in the 2nd and 3rd row of Figure 5. One may argue that the estimated support vector

\hat{s}

can be filtered via some threshold value (such as

0.3

) for

λ = 0.6

. However, thresholding will adversely affect the detection rate, and setting the threshold depends on our understanding of the signal characteristics. Furthermore, we should account for the effect of the filtered supports since their corresponding estimated components in

{\hat{x}}_{s}

contribute to fitting the model to the measurements.

In Table 1, we summarize the performance of the generated example for Cases 1–3, where

P_{D}

,

P_{F A}

and

N M S E

denote the detection rate and false alarm rate in support recovery and the normalized mean-squared error between the true and the estimated sparse signal. This also shows that the algorithm fails to provide reasonable results for the sampling ratio of

λ = 0.4

.

These experiments suggest that there is no fixed setting for

(α_{0}, β_{0})

capable of performing reasonably well for all sampling ratios and thus, selecting the hyperparameters

(α_{0}, β_{0})

should be made with care.

Continuing this examination, in Figure 6, Figure 7 and Figure 8, we illustrate the negative log-marginalized likelihood, the noise precision estimation and the estimated precision on the generated true solution in Cases 1–3, respectively. The horizontal axis shows the iterations until the stopping rule is met.

As expected, as the sampling ratio increases, the algorithm requires fewer iterations to meet its stopping condition. This can be seen on the negative log-marginalized likelihood plots in Figure 6, Figure 7 and Figure 8. In these experiments, the actual precision of the solution components was set to

τ = 1

, and the actual noise precision was set to

ϵ = 316.2

.

For Cases 1 and 2, according to Figure 6, Figure 7 and Figure 8, the estimated precisions on both the noise and solution components were far off from the actual ones even for

λ = 0.8

. Thus, it resulted in poor performance in signal recovery for Cases 1 and 2 (see Figure 3 and Figure 4).

For Case 3, the estimated precisions on the noise and the solution components were acceptable for

λ = 0.8

but far off from the actual ones for lower sampling ratios (see Figure 8). The main issue of the failures can be found in the update rule of the support learning vector

\tilde{s}

defined in (8). It is important to balance between the terms

c_{n}

and

κ_{n}

, where

c_{n}

imposes the effect of hyper-prior on

s

accompanied by the current estimate of

s_{n}

. In contrast,

κ_{n}

imposes the contribution of the current estimates of noise precision, solution and other supports in fitting the model to the measurements. Therefore, if we impose a substantial weight on the sparsity via

c_{n}

, the solution tends to neglect the effect of

κ_{n}

and vice versa. This is why we had sparse (with poor performance) in Cases 1 and 2 for all the represented sampling ratios and nonsparse (with poor performance) for moderate and lower sampling ratios in Case 3. These results suggest that the algorithm and its update rules are sensitive to the selection of hyperparameters on the Gamma prior on the support vector

s

. The main issue can be seen in (9), where the selection of the hyperparameters

α_{0}

and

β_{0}

resulted in a large or small value in

c_{n}

due to the digamma function.

4. Gaussian-Inverse Gamma Modeling and SBL(GiG) Algorithm

In this section, we consider the Gaussian-inverse Gamma (GiG) model. In this model, each component

x_{n}

of the solution is modeled by zero-mean Gaussian with the precision

τ_{n}

. The main difference between this model and the model defined in Section 3 is that the GiG model does not have the support vector

s

; instead, different precisions are considered on the components of the solution vector

x_{s}

in (1). A simpler version of GiG can also be used by defining the same precision

τ

for all the components of

x_{s}

.

Here, we rather use different precisions to make the GiG model have almost the same complexity as the BGiG model in terms of the parameters to be learned. The set of priors in this model is defined as follows.

x_{n} \sim N (0, τ_{n}^{- 1}), τ_{n} \sim Gamma (a_{0}, b_{0}), \forall n,

(21)

where

a_{0}

and

b_{0}

denote the shape and rate of the Gamma distribution, respectively. The entries of the noise component

e

are defined the same as (7), i.e.,

e \sim N (0, ϵ^{- 1} I_{M}), ϵ \sim Gamma (θ_{0}, θ_{1}),

where

θ_{0}

and

θ_{1}

are set to small positive values. The estimation of the parameters in this model is carried out using VB inference, as discussed below.

4.1. Update Rules of SBL(GiG) Using VB Inference

According to the VB algorithm described in (2) and (3), the update rule of the variables and parameters of the GiG model can be simplified as follows. The details of these derivations appear in Appendix A.2.

Update rule for the precision $τ_{n}$ on $x_{n}$ using VB

q (τ_{n}) \sim Gamma (a_{0} + \frac{1}{2}, b_{0} + \frac{1}{2} ({\tilde{x}}_{n}^{2} + σ_{{\tilde{x}}_{n}}^{2})), \forall n = 1, \dots, N .

Thus,

\tilde{τ_{n}} = \frac{a_{0} + \frac{1}{2}}{b_{0} + \frac{1}{2} ({\tilde{x}}_{n}^{2} + σ_{{\tilde{x}}_{n}}^{2})}, \forall n = 1, 2, \dots, N .

(22)

Update rule for the noise precision $ϵ$ using VB

q (ϵ) \sim Gamma (θ_{0} + \frac{M}{2}, b_{0} + \frac{1}{2} \tilde{Ψ})

which yields

\tilde{ϵ} = \frac{θ_{0} + \frac{M}{2}}{θ_{1} + \frac{1}{2} \tilde{Ψ}},

(23)

where

\tilde{Ψ} : = y^{T} y - 2 {\tilde{x}}^{T} A^{T} y + Tr ((\tilde{x} {\tilde{x}}^{T} + Σ_{\tilde{x}}) A^{T} A) .

(24)

Update rule for the solution vector $x$ using VB

q_{x} (x) \sim N (\tilde{x}, Σ_{\tilde{x}}),

(25)

where

Σ_{\tilde{x}} : = {(\tilde{T} + \tilde{ϵ} A^{T} A)}^{- 1} and \tilde{x} : = \tilde{ϵ} Σ_{\tilde{x}} A^{T} y,

(26)

and

\tilde{T} : = diag {[{\tilde{τ}}_{1}, \dots, {\tilde{τ}}_{N}]} .

We set the stopping rule of the algorithm using the marginalized likelihood (evidence) defined as

p (y | ϵ, τ) = \int p (y | x, ϵ, τ) p (x | τ) d x .

After simplification and for the comparison purposes of

L^{[t]}

with

L^{[t - 1]}

in the updating process, we have

L^{[t]} \propto log | Σ_{0}^{[t]} | - y^{T} Σ_{0}^{[t]} y,

where

Σ_{0}

is defined as

Σ_{0} : = {({\tilde{ϵ}}^{- 1} I_{M} + {\tilde{T}}^{- 1} A A^{T})}^{- 1} .

(27)

Therefore, similar to SBL(BGiG), the stopping condition can be made as

Δ L_{n}^{[t]} : = | Δ L^{[t]} | / | L^{[t - 1]} | \leq T_{0},

(28)

for some small value of threshold

T_{0}

.

Figure 9 illustrates the graphical Bayesian representation of the GiG model, which is an undirected graph. Similar to Figure 1, the shaded node

y

shows the observations, the small solid nodes represent the hyperparameters and the unshaded nodes denote the random variables.

The flowchart representation of the algorithm is shown in Figure 10. According to the pseudocode in Algorithm 2 and the flowchart in Figure 10, first, the hyperparameters of the model are set. The hyperparameters

a_{0}

and

b_{0}

on the precision of the solution-value vector are initialized and suggested to be small. Similar to SBL(BGiG), the hyperparameters

θ_{0}

and

θ_{1}

on the precision of the noise are recommended to be of order

10^{- 6}

for high SNRs. All the main variables of the model are drawn i.i.d. from their corresponding prior distributions defined in (22)–(26). Then, the stopping condition is computed based on (28). In the main loop, all the main variables of the model are updated via the expected values obtained from the VB inference through (22)–(26). The pseudocode of the algorithm is provided below.

Algorithm 2: SBL(GiG) Algorithm

{\tilde{x}}_{s} = SBL - GiG (Y, A)

Set the hyperparameters, i.e., (

a_{0}, b_{0}

) and (

θ_{0}, θ_{1}

)

% Variables’ Initialization

Draw

{\tilde{x}}_{s}

and

\tilde{ø}

from (21)

Draw

\tilde{ϵ}

from (7)

t = 1 % Iterator

Compute

{\tilde{L}}^{[t]}

from (28) and (27), and set

{\tilde{L}}^{[0]} = 0

% Main Loop for Estimations

t = 1

While

\frac{| L^{[t]} - L^{[t - 1]} |}{| L^{[t - 1]} |} \geq T_{0}

. For example

T_{0} = 10^{- 6}

.

Compute

Σ_{\tilde{x}}

and

{\tilde{x}}_{s}

from (26) % (Solution-value matrix component)

Compute

\tilde{T}

from (22) % (Precisions on the solution)

Compute

\tilde{ϵ}

from (23) % (Precision on the noise)

Compute

L^{[t]}

from (28) and (27), and then t = t + 1

End While

4.2. Issues with SBL(GiG)

An issue with the SBL(GiG) algorithm is that the solution becomes nonsparse since it does not incorporate a binary vector

s

(hard-thresholding or soft-thresholding if the expected value is used) as we had in SBL(BGiG). This may have no major effect on the signal reconstruction for high sampling ratios. However, the nonsparseness effect appears in low sampling ratios by misleading the algorithm to wrongly activate many components in the estimated signal yet providing a good fit of the model to the measurements. Here, we use the same example as we made for the SBL(BGiG) model with the same sensing matrix A, measurement vector

y

and noise

e

. Notice that in the SBL(BGiG) model, we considered the same precision

τ

on all the components of the solution value vector

x

support vector

s

. In contrast, the SBL(GiG) model does not have the support learning vector; instead, we assume that each component of the solution vector has different precision

τ_{n}

. It turns out that SBL(GiG) is not very sensitive to the selection of the hyperparameters as the SBL(BGiG). Thus, here, we show the results for one case scenario for the hyperparameters. We use the same setting for the parameters of

ϵ

in the hyper prior as before, i.e.,

θ_{0} = θ_{1} = 10^{- 6}

, and the same parameters for all the precisions

τ_{n}

of the solution component, i.e.,

a_{0} = b_{0} = 10^{- 3}

. In Figure 11 and Figure 12, we illustrate the results after applying the SBL(GiG) algorithm. In Figure 11, from left to right, we show the results for sampling ratios of

λ =

0.8, 0.6, and 0.40, respectively. The first row shows the comparison of

y

with

\hat{y} = A {\tilde{x}}_{s}

, the second row shows the true solution

x_{s}

and the estimated solution

{\tilde{x}}_{s}

, and the third row demonstrates the estimated precisions on the solution components. In Figure 12, we demonstrate the negative log-marginalized likelihood comparison and the estimated noise precision against the true noise precision for the sampling ratios of

λ =

0.8, 0.6 and 0.4.

From the results shown in Figure 11 and Figure 12, we observe that the recovered signal tends to become nonsparse. This effect is illustrated in the second row of Figure 11. This can also be observed in the precision estimations of the solution components. More specifically, the true nonzero components in our simulations were drawn from a zero-mean Gaussian with the precision of

τ_{n} = 1

. Thus, the ideal precision estimation would be within the two classes of values of 1 and infinity or very large values. However, the estimated results in our simulation do not show such a classification. As the sampling ratio decreases, the solution estimate has poor performance, due not only to the reduction in the number of measurements but also the nonsparseness behavior.

5. Preprocessing versus Postprocessing and Simulations

In this section, we show that in order to improve the performance of Bernoulli–Gaussian-inverse Gamma modeling using the SBL(BGiG) algorithm, we need to perform a preprocessing step. The results in Section 4 suggest one can perform some postprocessing for the SBL(GiG) algorithm to improve the reconstruction performance. Below, we provide more details for each of these algorithms.

5.1. Pre-Processing for the SBL(BGiG) Algorithm

Based on the observations made on the performance of SBL(BGiG) in Section 3.2, we showed that the pair of hyperparameters

(α_{0}

,

β_{0})

should be selected with care. In other words, obtaining good performance with this algorithm needs some preprocessing to assess an appropriate setting for the parameters. For a more rigorous study, here, we perform a grid search on the hyperparameters

(α_{0}, β_{0})

to see whether we can find some common pattern in selecting these parameters for all sampling ratios. The grid search runs the algorithm for different values of

α_{0}

and

β_{0}

with the search range of

[0.1, 2]

with the resolution of

0.1

. For each

(α_{0}, β_{0})

within this range, we ran 200 random trials and then averaged the results. The settings of these trials are represented in Table 2.

We generated a random scenario, where the true solution

x_{s} \in R^{100}

has the sparsity level of

k = 25

. The active elements of

s

were drawn randomly. The nonzeros of

x_{s}

were drawn from

N (0, σ_{x}^{2})

, with

σ_{x}^{2} = 1

. Each entry of the sensing matrix A was drawn i.i.d. from the Gaussian distribution

N (0, 1)

, then normalized. The elements of measurement noise were drawn from

N (0, σ^{2})

with

S N R = 25

dB. The results were examined to see what values of

(α_{0}, β_{0})

provided the highest performance in the detection rate vs. and false alarm rate. The simulation was executed for a range of sampling ratios in the range

[0.05, 1]

with the step size of 0.05. The results are demonstrated in Figure 13. In this figure, we also provide the results of performing a random Sobol search for

(α_{0}, β_{0})

. A Sobol sequence is a low discrepancy quasirandom sequence. The two right plots in Figure 13 show the results for the best setting of

(α_{0}, β_{0})

.

It should be clear from Figure 13 that there is no fixed setting for these parameters in order to get the best performance for all sampling ratios. The two plots on the right of Figure 13 illustrate the performance based on the best values of these hyperparameters, which provided the best performance, i.e., tuned hyperparameters. We also examined the grid search results for the top 10 highest performances for each sampling ratio, where performance is in terms of

P_{D} - P_{F A}

and the normalized mean-squared error (NMSE). In Figure 14a, we demonstrate the top 10 highest performances based on NMSE and

P_{D} - P_{F A}

for different sampling ratios. In Figure 14b,c, we illustrate the values of

(α_{0}, β_{0})

, which led to the performances shown in Figure 14a for different sampling ratios. Figure 15 details the top 10 values of

(α_{0}, β_{0})

vs. sampling ratio.

According to Figure 14b,c, there is no specific pattern for these hyperparameters. Figure 15 also shows that hyperparameters need to be carefully selected.

5.2. Post-Processing for the SBL(GiG) Algorithm

Since the SBL(GiG) algorithm does not include the binary support vector

s

, as SBL(BGiG) possesses, the resulting solution tends to become nonsparse. This leads to a high detection rate for the location of active supports and a high false alarm rate. Thus, as the sampling ratio decreases, there is a high chance that this algorithm overwhelms the locations of the true solution. Therefore, SBL(GiG) requires some postprocessing to discard the components with low amplitudes. This problem becomes of great importance for applications where detecting the correct nonzero components is more crucial than the magnitudes of the nonzeros in the signal. This effect can be seen in Figure 16b. The curves with solid lines in this plot show the detection and false alarm rate in support recovery and the difference between the rates. This issue can be resolved by some postprocessing such as data-driven threshold tuning. That way, the amplitudes in the reconstructed signal with lower values than the threshold can be discarded. For this purpose, we set up 200 random trials, the same way as the one explained for SBL(BGiG), and then evaluate the performance in terms of NMSE by varying the threshold. Figure 16b shows the averaged results of 200 trials. The settings of these trials are represented in Table 3.

In Figure 16a, we observe that the postprocessing does not benefit us so much in terms of the reconstruction error for low and moderate sampling ratios. However, there is a threshold of around 0.25, for which the postprocessing step reduced the reconstruction error by approximately 3 dB. We set the threshold to 0.25 and ran 200 random trials by applying SBL(GiG) and evaluating the performance based on the detection and false alarm rate in support recovery. According to Figure 16b, the additional post-processing step provides reasonable performance.

Finally, in Figure 17, we compare the performance of the SBL(BGiG) algorithm (with performing the preprocessing step) with the SBL(GiG) algorithm (after performing postprocessing). We see that Bernoulli–Gaussian-inverse Gamma implemented via SBL(BGiG) provides better performance for low and high sampling ratios. In contrast, Gaussian-inverse Gamma modeling implemented via SBL(GiG) performs much better for the moderate sampling ratios.

6. Conclusions

We investigated solving the inverse problem of compressive sensing using VB inference for two sparse Bayesian models of Bernoulli–Gaussian-inverse Gamma (BGiG) and Gaussian-inverse Gamma (GiG). The issues of each approach were discussed and the performance between the two models was compared. Specifically, we showed the behavior of these models and algorithms when the sampling ratio is low and moderate as well as the importance of selecting the hyperparameters of BGiG model with care. We further provided some intuition for performing additional pre/post-processing steps, depending on the selected model for better performance.

Based on our study on the synthetic data and considering the overall performance of both algorithms and the complexity in additional pre-/postprocessing, we observed that for moderate sampling ratios, SBL(GiG) is performing better than SBL(BGiG) modeling when using VB for sparse signals with no specific pattern in the supports. In contrast, SBL(BGiG) provided better perfomance for low and high sampling ratios. Finally, a rigorous comparison is required to study in the future under real-world scenarios and various applications. The MATLAB codes for GiG and BGiG modeling are available at https://github.com/MoShekaramiz/Compressive-Sensing-GiG-versus-BGiG-Modeling.git, accessed on 15 December 2022.

Author Contributions

Methodology, M.S. and T.K.M.; Formal analysis, M.S. and T.K.M.; Investigation, M.S.; Resources, M.S.; Writing—original draft, M.S.; Writing—review & editing, M.S. and T.K.M.; Visualization, M.S.; Supervision, T.K.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

https://github.com/MoShekaramiz/Compressive-Sensing-GiG-versus-BGiG-Modeling.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

In this section, we provide details on deriving the update rules of the parameters and variables for both models and the associated algorithms.

Appendix A.1. Bernoulli–Gaussian-Inverse-Gamma Modeling and the SBL(BGiG)

Update rule for the precision $τ$ of the solution value vector $x$

\begin{matrix} q (τ) & \propto p (τ; a_{0}, b_{0}) e^{(< log p (x | τ I_{N}) >_{q_{x}})} \\ \propto τ^{a_{0} - 1} e^{- b_{0} τ} e^{(< log \{\prod_{n = 1}^{N} p (x_{n}; τ^{- 1})\} >_{q_{x}})} \\ \propto τ^{a_{0} - 1} e^{- b_{0} τ} e^{\{< log {τ^{\frac{N}{2}} e^{{- \frac{τ}{2} ∥ x ∥_{2}^{2}}}} >_{q_{x}}\}} \\ \propto τ^{(a_{0} + \frac{N}{2}) - 1} e^{- (b_{0} + \frac{1}{2} < ∥ x ∥_{2}^{2} >_{q_{x}}) τ}, \end{matrix}

where

{< ∥ x ∥}_{2}^{2} >_{q_{x}} = < x^{T} x >_{q_{x}} = Tr (< x x^{T} >_{q_{x}}) = {∥ \tilde{x} ∥}_{2}^{2} + \sum_{n = 1}^{N} σ_{{\tilde{x}}_{n}}^{2},

and

\tilde{x} : = < x >_{q_{x}}

. Therefore,

q (τ) \sim Gamma (a_{0} + \frac{N}{2}, b_{0} + \frac{1}{2} (∥ \tilde{x} ∥_{2}^{2} + \sum_{n = 1}^{N} σ_{{\tilde{x}}_{n}}^{2})) .

Finally, considering the point estimate on

τ

as the expected value of the Gamma distribution in

q (τ)

, the update rule for

τ

can be defined as

\tilde{τ} = \frac{a_{0} + \frac{N}{2}}{b_{0} + \frac{1}{2} (∥ \tilde{x} ∥_{2}^{2} + \sum_{n = 1}^{N} σ_{{\tilde{x}}_{n}}^{2})} .

Update rule for the noise precision $ϵ$

\begin{matrix} q (ϵ) & \propto p (ϵ; θ_{0}, θ_{1}) e^{\{< log p (y | x, s, ϵ) >_{q_{x} q_{s}}\}} \\ \propto ϵ^{θ_{0} - 1} e^{- θ_{1} ϵ} e^{\{< log \{ϵ^{\frac{M}{2}} e^{\{- \frac{1}{2} ϵ {∥ y - A (s \circ x) ∥}_{2}^{2}\}}\} >_{q_{x} q_{s}}\}} \\ \propto ϵ^{(θ_{0} + \frac{M}{2}) - 1} e^{- (θ_{1} + \frac{1}{2} < ∥ y - A (s \circ x) ∥_{2}^{2} >_{q_{x} q_{s}}) ϵ}, \end{matrix}

where

\begin{matrix} {< ∥ y - A (s \circ x) ∥}_{2}^{2} >_{q_{x} q_{s}} & = {< ∥ y - A S x) ∥}_{2}^{2} >_{q_{x} q_{s}} \\ = y^{T} y - 2 < x^{T} S A^{T} y >_{q_{x} q_{s}} + < x^{T} S A^{T} A S x >_{q_{x} q_{s}} \\ = y^{T} y - 2 < x >_{q_{x}}^{T} < S >_{q_{s}} A^{T} y + < x^{T} S A^{T} A S x >_{q_{x} q_{s}} \\ = y^{T} y - 2 {(\tilde{x} \circ \tilde{s})}^{T} A^{T} y + < x^{T} M_{s} x >_{q_{x} q_{s}}, \end{matrix}

where

S = diag {s}

,

M_{s} : = S A^{T} A S

, and

\begin{matrix} < x^{T} M_{s} x >_{q_{x} q_{s}} & = Tr (< x x^{T} >_{q_{x}} < M_{s} >_{q_{s}}) \\ = Tr ((\tilde{x} {\tilde{x}}^{T} + Σ_{\tilde{x}}) < M_{s} >_{q_{s}}), \end{matrix}

where

Σ_{\tilde{x}} = diag {σ_{\tilde{x_{1}}}^{2}, \dots, σ_{\tilde{x_{N}}}^{2}}

, and

\begin{matrix} < M_{s} >_{s} & = < S A^{T} A S >_{q_{s}} \\ = < (A^{T} A) \circ (s s^{T}) >_{q_{s}} \\ = (A^{T} A) \circ < (s s^{T}) >_{q_{s}} \\ = (A^{T} A) \circ (\tilde{s} {\tilde{s}}^{T} + diag (\tilde{s} \circ (1 - \tilde{s}))) . \end{matrix}

Therefore,

< x^{T} M_{s} x >_{q_{x} q_{s}} = Tr ((\tilde{x} {\tilde{x}}^{T} + Σ_{\tilde{x}}) ((A^{T} A) \circ (\tilde{s} {\tilde{s}}^{T} + diag {\tilde{s} \circ (1 - \tilde{s})}))) .

As a result,

q (ϵ) \sim Gamma (θ_{0} + \frac{M}{2}, θ_{1} + \frac{1}{2} \tilde{Ψ}),

where

\begin{matrix} \tilde{Ψ} & : = {< ∥ y - A (s \circ x) ∥}_{2}^{2}) >_{q_{x} q_{s}} \\ = y^{T} y - 2 {(\tilde{x} \circ \tilde{s})}^{T} A^{T} y + Tr ((\tilde{x} {\tilde{x}}^{T} + Σ_{\tilde{x}}) ((A^{T} A) \circ (\tilde{s} {\tilde{s}}^{T} + diag {\tilde{s} \circ (1 - \tilde{s})}))) . \end{matrix}

Finally, the update rule for the precision of the noise can be written as

\tilde{ϵ} = \frac{θ_{0} + \frac{M}{2}}{θ_{1} + \frac{1}{2} \tilde{Ψ}},

Remark A1.

Notice that

Tr (X^{T} Y) = \sum_{i, j} {(X \circ Y)}_{i j} = 1^{T} (X \circ Y) 1

. Therefore,

\begin{matrix} Tr ((\tilde{x} {\tilde{x}}^{T} + Σ_{\tilde{x}}) ((A^{T} A) \circ (\tilde{s} {\tilde{s}}^{T} + diag {\tilde{s} \circ (1 - \tilde{s})}))) \\ = 1^{T} ((\tilde{x} {\tilde{x}}^{T} + Σ_{\tilde{x}}) \circ (A^{T} A) \circ (\tilde{s} {\tilde{s}}^{T} + diag {\tilde{s} \circ (1 - \tilde{s})}))) 1, \end{matrix}

where

1 = {[1, \dots, 1]}^{T}

. Thus,

\tilde{Ψ}

can be written as

\begin{matrix} \tilde{Ψ} : = y^{T} y - 2 {(\tilde{x} \circ \tilde{s})}^{T} A^{T} y + 1^{T} ((\tilde{x} {\tilde{x}}^{T} + Σ_{\tilde{x}}) \circ (A^{T} A) \circ (\tilde{s} {\tilde{s}}^{T} + diag {\tilde{s} \circ (1 - \tilde{s})}))) 1 . \end{matrix}

Update rule for $γ_{n}$

\begin{matrix} q (γ_{n}) & \propto p (γ_{n}; α_{0}, β_{0}) e^{(< log {p (x, s, y | θ)} >_{q_{x} q_{s}})} \\ \propto γ_{n}^{α_{0} - 1} {(1 - γ_{n})}^{β_{0} - 1} e^{{< log {p (s_{n} | γ_{n})} >_{q_{x} q_{s}}}} \\ \propto γ_{n}^{α_{0} - 1} {(1 - γ_{n})}^{β_{0} - 1} e^{{< log {γ_{n}^{s_{n}} {(1 - γ_{n})}^{1 - s_{n}}} >_{q_{s_{n}}}}} \\ \propto γ_{n}^{α_{0} - 1} {(1 - γ_{n})}^{β_{0} - 1} e^{< s_{n} >_{q_{s_{n}}} log {γ_{n}}} e^{(1 - < s_{n} >_{q_{s_{n}}}) log {1 - γ_{n}}} \\ \propto γ_{n}^{α_{0} - 1} {(1 - γ_{n})}^{β_{0} - 1} γ_{n}^{< s_{n} >_{q_{s_{n}}}} {(1 - γ_{n})}^{1 - < s_{n} >_{q_{s_{n}}}} \\ \propto γ_{n}^{(α_{0} + {\tilde{s}}_{n}) - 1} {(1 - γ_{n})}^{β_{0} - {\tilde{s}}_{n}} . \end{matrix}

Therefore,

q_{γ_{n}} (γ_{n}) \sim Beta (α_{1, n}, β_{1, n}), \forall n = 1, \dots, N,

where

α_{1, n} : = α_{0} + {\tilde{s}}_{n}

and

β_{1, n} : = β_{0} + 1 - {\tilde{s}}_{n}

. Finally, the update rule for

γ_{n}

can be defined as

{\tilde{γ}}_{n} = \frac{α_{1, n}}{α_{1, n} + β_{1, n}} .

Update rule for the solution vector $x$

\begin{matrix} q_{x} (x) \propto e^{{< log {p (x, s, y | θ)} >_{q_{x} q_{s}}}} \\ \propto e^{{< log {p (x, s | θ) p (y | x, s, θ)} >_{q_{θ} q_{s}}}} \\ \propto e^{{< log {p (x | θ)} >_{q_{θ}}}} e^{{< log {p (y | x, s, θ)} >_{q_{θ} q_{s}}}} \\ \propto e^{{< log {p (x | τ)} >_{q_{τ}}}} e^{{< log {p (y | x, s, ϵ)} >_{q_{ϵ} q_{s}}}} . \end{matrix}

To update the elements of

x

, we have

p (y | x, s, ϵ) \propto e^{{- \frac{1}{2} ϵ ∥ y - A (s \circ x) ∥_{2}^{2}}} .

Therefore,

\begin{matrix} < log {p (y | x, s, ϵ)} >_{q_{ϵ} q_{s}} & \propto - \frac{1}{2} < ϵ {∥ y - A (s \circ x) ∥}_{2}^{2} >_{q_{ϵ} q_{s}} \\ \propto - \frac{1}{2} < ϵ >_{q_{ϵ}} < {∥ y - A (s \circ x) ∥}_{2}^{2} >_{q_{s}} \\ \propto - \frac{1}{2} \tilde{ϵ} < {| y - A (s \circ x) ∥}_{2}^{2} >_{q_{s}} \end{matrix}

and

\begin{matrix} {< ∥ y - A (s \circ x) ∥}_{2}^{2} >_{q_{s}} = & < Tr (y y^{T} + {(x \circ s)}^{T} A^{T} A (x \circ s) - 2 {(x \circ s)}^{T} A^{T} y) >_{q_{s}} \\ \propto < Tr ({(x \circ s)}^{T} A^{T} A (x \circ s) - 2 {(x \circ s)}^{T} A^{T} y) >_{q_{s}} \\ \propto < Tr (x^{T} S^{T} A^{T} A S x - 2 x^{T} S A^{T} y) >_{q_{s}} \\ \propto Tr (x^{T} < S A^{T} A S >_{q_{s}} x - 2 x^{T} \tilde{S} A^{T} y) . \end{matrix}

This yields to

{< ∥ y - A (s \circ x) ∥}_{2}^{2} >_{q_{s}} \propto x^{T} < S A^{T} A S >_{q_{s}} x - 2 x^{T} \tilde{S} A^{T} y,

which results in

< log p (y | x, s, ϵ) >_{q_{ϵ} q_{s}} \propto - \frac{1}{2} \tilde{ϵ} (x^{T} < S A^{T} A S >_{q_{s}} x - 2 x^{T} \tilde{S} A^{T} y) .

Thus, we can write

q_{x} (x)

as

\begin{matrix} q_{x} (x) & \propto e^{< log {p (x | τ)} >_{q_{τ}}} e^{< log {p (y | x, s, ϵ)} >_{q_{ϵ} q_{s}}} \\ \propto e^{{- \frac{1}{2} \tilde{τ} x^{T} x}} e^{{- \frac{1}{2} \tilde{ϵ} (x^{T} < S A^{T} A S >_{q_{s}} x - 2 x^{T} \tilde{S} A^{T} y)}} \\ \propto e^{{- \frac{1}{2} (x^{T} (\tilde{τ} I_{N} + \tilde{ϵ} > S A^{T} A S >_{q_{s}}) x - 2 \tilde{ϵ} x^{T} \tilde{S} A^{T} y)}} . \end{matrix}

Notice that

S A^{T} A S = (A^{T} A) \circ (s s^{T})

. Since

s_{n}

is drawn from a Bernoulli distribution, we have

< s_{n}^{2} >_{q_{s}} = < s_{n} >_{q_{s}} = {\tilde{s}}_{n}

, and

\tilde{s} {\tilde{s}}^{T} = [\begin{matrix} {\tilde{s}}_{1}^{2} & {\tilde{s}}_{1} {\tilde{s}}_{2} & \dots & {\tilde{s}}_{1} {\tilde{s}}_{n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ {\tilde{s}}_{n} {\tilde{s}}_{1} & {\tilde{s}}_{n} {\tilde{s}}_{2} & \dots & {\tilde{s}}_{n}^{2} \end{matrix}]

Therefore,

\begin{matrix} < S^{T} A^{T} A S >_{q_{s}} & = (A^{T} A) \circ (\tilde{s} {\tilde{s}}^{T} - diag {\tilde{s} \circ \tilde{s}} + diag {\tilde{s}}) \\ = (A^{T} A) \circ (\tilde{s} {\tilde{s}}^{T} + diag {\tilde{s} \circ (1 - \tilde{s})}), \end{matrix}

which yields to

q_{x} (x) \sim N (\tilde{x}, Σ_{\tilde{x}}),

where

Σ_{\tilde{x}} = {(\tilde{τ} I_{N} + \tilde{ϵ} ((A^{T} A) \circ (\tilde{s} {\tilde{s}}^{T} + diag {\tilde{s} \circ (1 - \tilde{s})})))}^{- 1}

and

\tilde{x} = \tilde{ϵ} Σ_{\tilde{x}} \tilde{S} A^{T} y,

which

\tilde{x}

is the update rule for the solution value vector

x

.

Update rule for the support vector $s$

\begin{matrix} q_{s_{n}} (s_{n}) & \sim e^{{< log {p (x, s, y | θ)} >_{q_{θ} q_{x}}}} \\ \propto e^{{< log {p (x, s | θ) p (y | x, s, θ)} >_{q_{θ} q_{x}}}} \\ \propto e^{{< log {p (s_{n}; γ_{n})} >_{q_{γ_{n}}}}} e^{{< log {p (y | x, s, ϵ)} >_{q_{s^{- n}} q_{x} q_{ϵ}}}} \\ \propto e^{{< log {γ_{n}^{s_{n}} {(1 - γ_{n})}^{1 - s_{n}}} >_{q_{γ_{n}}}}} e^{{< log {p (y | x, s, ϵ)} >_{q_{s^{- n}} q_{x} q_{ϵ}}}}, \end{matrix}

where

e^{< log {γ_{n}^{s_{n}} {(1 - γ_{n})}^{1 - s_{n}}} >_{q_{γ_{n}}}} = e^{s_{n} < log {γ_{n}} >_{q_{γ_{n}}}} e^{(1 - s_{n}) < log {1 - γ_{n}} >_{q_{γ_{n}}}},

for which

< log γ_{n} >_{q_{γ_{n}}} \sim Beta (α_{1, n}, β_{1, n}) = ψ (α_{1, n}) - ψ (α_{1, n} + β_{1, n})

and

< log {1 - γ_{n}} >_{q_{γ_{n}}} \sim Beta (α_{1, n}, β_{1, n}) = ψ (β_{1, n}) - ψ (α_{1, n} + β_{1, n}),

where

ψ (\cdot)

is digamma function, the logarithmic derivative of the gamma function, i.e.,

ψ (x) = \frac{d}{d x} log Γ (x)

. Therefore,

e^{< log {γ_{n}^{s_{n}} {(1 - γ_{n})}^{1 - s_{n}}} >_{q_{γ_{n}}}} = e^{s_{n} (ψ (α_{1, n}) - ψ (α_{1, n} + β_{1, n}))} e^{(1 - s_{n}) (ψ (β_{1, n}) - ψ (α_{1, n} + β_{1, n}))} .

Also,

\begin{matrix} e^{< log {p (y | x, s, ϵ)} >_{q_{s^{- n}} q_{x} q_{ϵ}}} & \propto e^{- \frac{1}{2} < ϵ {∥ y - A (s \circ x) ∥}_{2}^{2} >_{q_{s^{- n}} q_{x} q_{ϵ}}} \\ \propto e^{- \frac{1}{2} \tilde{ϵ} < {∥ y - A (s \circ x) ∥}_{2}^{2} >_{q_{s^{- n}} q_{x}}} \\ \propto e^{- \frac{1}{2} \tilde{ϵ} < \sum_{m = 1}^{M} {(y_{m} - \sum_{n = 1}^{N} a_{m n} s_{n} x_{n})}^{2} >_{q_{s^{- n}} q_{x}}} \\ \propto e^{- \frac{1}{2} \tilde{ϵ} < {((y_{1} - \sum_{l \neq n}^{N} a_{1 n} s_{n} x_{n}) - a_{1 n} s_{n} x_{n})}^{2} + \dots + {((y_{M} - \sum_{l \neq n}^{N} a_{M l} s_{l} x_{l}) - a_{M n} s_{n} x_{n})}^{2} >_{q_{s^{- n}} q_{x}}}, \end{matrix}

where

y_{m}^{- n} : = y_{m} - \sum_{l \neq n}^{N} a_{m l} s_{l} x_{l}, \forall m = 1, 2, \dots, M

. Therefore,

\begin{matrix} e^{< log {p (y | x, s, ϵ)} >_{q_{s^{- n}} q_{x} q_{ϵ}}} & \propto e^{- \frac{1}{2} \tilde{ϵ} < \sum_{m = 1}^{M} {(a_{m n} s_{n} x_{n} - y_{m}^{- n})}^{2} >_{q_{s^{- n}} q_{x}}} \\ \propto e^{- \frac{1}{2} ϵ \sum_{m = 1}^{M} (a_{m n}^{2} s_{n}^{2} < x_{n}^{2} >_{q_{x}} - 2 a_{m n} s_{n} < x_{n} y_{m}^{- n} >_{q_{s^{- n}} q_{x}})} \\ \propto e^{- \frac{1}{2} \tilde{ϵ} (∥ a_{n} ∥_{2}^{2} s_{n}^{2} < x_{n}^{2} >_{q_{x}} - 2 \sum_{m = 1}^{M} a_{m n} s_{n} < x_{n} y_{m}^{- n} >_{q_{x}, q_{s^{- n}}})} \\ \propto e^{- \frac{1}{2} \tilde{ϵ} (∥ a_{n} ∥_{2}^{2} s_{n}^{2} ({\tilde{x}}_{n}^{2} + σ_{{\tilde{x}}_{n}}^{2}) - 2 s_{n} \tilde{x_{n}} \sum_{m = 1}^{M} a_{m n} < y_{m}^{- n} >_{q_{s^{- n}} q_{x}})} \\ \propto e^{- \frac{1}{2} \tilde{ϵ} (∥ a_{n} ∥_{2}^{2} ({\tilde{x}}_{n}^{2} + σ_{{\tilde{x}}_{n}}^{2}) s_{n}^{2} - 2 s_{n} {\tilde{x}}_{n} a_{n}^{T} < y^{- n} >_{q_{s^{- n}} q_{x}})}, \end{matrix}

where

y_{m}^{- n}

contains no

x_{n}

component and

y^{- n} : = [y_{1}^{- n}, y_{2}^{- n}, \dots, y_{M}^{- n}] .

Thus,

< y_{m}^{- n} >_{q_{s^{- n}} q_{x}} = < y_{m} - \sum_{l \neq n}^{N} a_{m l} s_{l} x_{l} >_{q_{s^{- n}} q_{x}} = y_{m} - \sum_{l \neq n}^{N} a_{m l} {\tilde{s}}_{l} {\tilde{x}}_{l}

which yields to

\begin{matrix} {\tilde{y}}_{m}^{- n} : = & < y_{m}^{- n} >_{q_{s^{- n}} q_{x}} \\ {\tilde{y}}^{- n} = y - \sum_{l \neq n}^{N} {\tilde{s}}_{l} {\tilde{x}}_{l} a_{l} . \end{matrix}

and thus

{\tilde{y}}^{- n} : = < y^{- n} >_{q_{s^{- n}} q_{x}} .

Therefore,

e^{< log {p (y | x, s, ϵ)} >_{q_{s^{- n}} q_{x} q_{ϵ}}} \propto e^{- \frac{1}{2} \tilde{ϵ} ((∥ a_{n} ∥_{2}^{2} ({\tilde{x}}_{n}^{2} + σ_{{\tilde{x}}_{n}}^{2})) s_{n}^{2} - 2 ({\tilde{x}}_{n} a_{n}^{T} {\tilde{y}}^{- n}) s_{n})} .

Finally,

q_{s_{n}} (s_{n}) \propto e^{\{s_{n} (ψ (α_{1, n}) - ψ (α_{1, n} + β_{1, n})) + (1 - s_{n}) (ψ (β_{1, n}) - ψ (α_{1, n} + β_{1, n})) - \frac{1}{2} \tilde{ϵ} (∥ a_{n} ∥_{2}^{2} ({\tilde{x}}_{n}^{2} + σ_{{\tilde{x}}_{n}}^{2}) s_{n}^{2} - 2 {\tilde{x}}_{n} a_{n}^{T} {\tilde{y}}^{- n} s_{n})\}} .

Since

s_{n}

is an outcome of a Bernoulli random variable,

q_{s_{n}} (s_{n} = 0) \propto e^{\{ψ (β_{1, n}) - ψ (α_{1, n} + β_{1, n})\}}

and

q_{s_{n}} (s_{n} = 1) \propto e^{{ψ (α_{1, n}) - ψ (α_{1, n} + β_{1, n}) - \frac{1}{2} \tilde{ϵ} (∥ a_{n} ∥_{2}^{2} ({\tilde{x}}_{n}^{2} + σ_{{\tilde{x}}_{n}}^{2}) - 2 {\tilde{x}}_{n} a_{n}^{T} {\tilde{y}}^{- n})}} .

Therefore,

\begin{matrix} q_{s_{n}} (s_{n}) & \sim Bernoulli (\frac{q_{s_{n}} (s_{n} = 1)}{q_{s_{n}} (s_{n} = 0) + q_{s_{n}} (s_{n} = 1)}) \\ \sim Bernoulli (\frac{1}{1 + \frac{q_{s_{n}} (s_{n} = 0)}{q_{s_{n}} (s_{n} = 1)}}), \end{matrix}

which yields to

\begin{matrix} q_{s_{n}} (s_{n}) & \sim Bernoulli (\frac{1}{1 + e^{ψ (β_{1, n}) - ψ (α_{1, n} + β_{1, n})} e^{- ψ (α_{1, n}) + ψ (α_{1, n} + β_{1, n}) + \frac{1}{2} \tilde{ϵ} (∥ a_{n} ∥_{2}^{2} ({\tilde{x}}_{n}^{2} + σ_{{\tilde{x}}_{n}}^{2}) - 2 {\tilde{x}}_{n} a_{n}^{T} {\tilde{y}}^{- n})}}) \\ \sim Bernoulli (\frac{1}{1 + e^{\{ψ (β_{1, n}) - ψ (α_{1, n}) + \frac{1}{2} \tilde{ϵ} (∥ a_{n} ∥_{2}^{2} ({\tilde{x}}_{n}^{2} + σ_{{\tilde{x}}_{n}}^{2}) - 2 {\tilde{x}}_{n} a_{n}^{T} {\tilde{y}}^{- n})\}}}) . \end{matrix}

The update rule for the component

s_{n}

can then be written as

{\tilde{s}}_{n} = \frac{1}{1 + e^{\{ψ (β_{1, n}) - ψ (α_{1, n}) + \frac{1}{2} \tilde{ϵ} (∥ a_{n} ∥_{2}^{2} ({\tilde{x}}_{n}^{2} + σ_{{\tilde{x}}_{n}}^{2}) - 2 {\tilde{x}}_{n} a_{n}^{T} {\tilde{y}}^{- n})\}}}

or equivalently,

{\tilde{s}}_{n} = \frac{1}{1 + c_{n} κ_{n}}, \forall n = 1, \dots, N,

where

c_{n} : = e^{{ψ (β_{1, n}) - ψ (α_{1, n})}}

and

κ_{n} : = e^{\{\frac{1}{2} \tilde{ϵ} (∥ a_{n} ∥_{2}^{2} ({\tilde{x}}_{n}^{2} + σ_{{\tilde{x}}_{n}^{2}}^{2}) - 2 {\tilde{x}}_{n} a_{n}^{T} {\tilde{y}}^{- n})\}} .

Stopping rule

The stopping rule of the algorithm can be set based on the marginalized likelihood (evidence). We would rather follow the effect of

s

on the evidence because if

s

is learned, it would be easy to compute

x_{s}

. Therefore, we marginalize the distribution on

y

and integrate

x

out. The details are described below.

\begin{matrix} p (y | s, ϵ, τ) & = \int p (y, x | s, ϵ, τ) d x \\ = \int p (y | x, s, ϵ, τ) p (x | τ) d x \\ = \int \frac{1}{{(2 π ϵ^{- 1})}^{\frac{M}{2}}} e^{- \frac{1}{2} ϵ {∥ y - A (s \circ x) ∥}_{2}^{2}} \frac{1}{{(2 π τ^{- 1})}^{\frac{N}{2}}} e^{- \frac{1}{2} τ {∥ x ∥}_{2}^{2}} d x \\ = \frac{1}{{(2 π)}^{\frac{M}{2}}} ϵ^{\frac{M}{2}} τ^{\frac{N}{2}} \int \frac{1}{{(2 π)}^{\frac{N}{2}}} e^{- \frac{1}{2} (ϵ (y^{T} y - 2 {(s \circ x)}^{T} A^{T} y + {(s \circ x)}^{T} A^{T} A (s \circ x)) + τ x^{T} x)} d x \\ = \frac{1}{{(2 π)}^{\frac{M}{2}}} ϵ^{\frac{M}{2}} τ^{\frac{N}{2}} \int \frac{1}{{(2 π)}^{\frac{N}{2}}} e^{- \frac{1}{2} (ϵ (y^{T} y - 2 x^{T} S A^{T} y + x^{T} S A^{T} A S x) + τ x^{T} x)} d x \\ = \frac{1}{{(2 π)}^{\frac{M}{2}}} ϵ^{\frac{M}{2}} τ^{\frac{N}{2}} e^{- \frac{1}{2} ϵ y^{T} y} \int \frac{1}{{(2 π)}^{\frac{N}{2}}} e^{- \frac{1}{2} (x^{T} (ϵ S A^{T} A S + τ I_{N}) x - 2 ϵ x^{T} S A^{T} y)} d x \end{matrix}

\begin{matrix} p (y | s, ϵ, τ) = & \frac{1}{{(2 π)}^{\frac{M}{2}}} ϵ^{\frac{M}{2}} τ^{\frac{N}{2}} e^{- \frac{1}{2} ϵ y^{T} y} {| {(τ I_{N} + ϵ S A^{T} A S)}^{- 1} |}^{\frac{1}{2}} \int \frac{1}{{(2 π)}^{\frac{N}{2}}} \frac{1}{| {(τ I_{N} + ϵ S A^{T} A S)}^{- 1} |^{\frac{1}{2}}} \times \dots \\ e^{- \frac{1}{2} (x^{T} (ϵ S A^{T} A S + τ I_{N}) x - 2 ϵ x^{T} S A^{T} y)} d x \\ = & \frac{1}{{(2 π)}^{\frac{M}{2}}} ϵ^{\frac{M}{2}} τ^{\frac{N}{2}} e^{- \frac{1}{2} ϵ y^{T} y} {| {(τ I_{N} + ϵ S A^{T} A S)}^{- 1} |}^{\frac{1}{2}} \int \frac{1}{{(2 π)}^{\frac{N}{2}}} \frac{1}{| {(τ I_{N} + ϵ S A^{T} A S)}^{- 1} |^{\frac{1}{2}}} \times \dots \\ e^{- \frac{1}{2} ({(x - {(τ I_{N} + ϵ S A^{T} A S)}^{- 1} ϵ S A^{T} y)}^{T} (τ I_{N} + ϵ S A^{T} A S) (x - {(τ I_{N} + ϵ S A^{T} A S)}^{- 1} ϵ S A^{T} y) - ϵ^{2} y^{T} A S {(τ I_{N} + ϵ S A^{T} A S)}^{- 1} S A^{T} y)} d x, \end{matrix}

which results in

p (y | s, ϵ, τ) = \frac{1}{{(2 π)}^{\frac{M}{2}}} ϵ^{\frac{M}{2}} τ^{\frac{N}{2}} e^{- \frac{1}{2} ϵ y^{T} y} {| {(τ I_{N} + ϵ S A^{T} A S)}^{- 1} |}^{\frac{1}{2}} e^{\frac{1}{2} ϵ^{2} y^{T} A S {(τ I_{N} + ϵ S A^{T} A S)}^{- 1} S A^{T} y} .

Thus,

\begin{matrix} log p (y | s, ϵ, τ) = - \frac{M}{2} log {2 π} + \frac{M}{2} log ϵ + \frac{N}{2} log τ - \frac{1}{2} ϵ y^{T} y + \dots \\ \frac{1}{2} log {| {(τ I_{N} + ϵ S A^{T} A S)}^{- 1} |} + \frac{1}{2} ϵ^{2} y^{T} A S {(τ I_{N} + ϵ S A^{T} A S)}^{- 1} S A^{T} y \end{matrix}

and

- \frac{1}{2} ϵ y^{T} y + \frac{1}{2} ϵ^{2} y^{T} A S {(τ I_{N} + ϵ S A^{T} A S)}^{- 1} S A^{T} y = - \frac{1}{2} y^{T} (I_{M} - ϵ A S {(τ I_{N} + ϵ S A^{T} A S)}^{- 1} S A^{T}) y

Also,

\begin{matrix} \frac{N}{2} log {τ} + \frac{1}{2} log {| {(τ I_{N} + ϵ S A^{T} A S)}^{- 1} |} & = \frac{1}{2} log {| (τ I_{N}) {(τ I_{N} + ϵ S A^{T} A S)}^{- 1} |} \\ = - \frac{1}{2} log {| (τ^{- 1} I_{N}) (τ I_{N} + ϵ S A^{T} A S) |} \\ = - \frac{1}{2} log {| I_{N} + \frac{ϵ}{τ} S A^{T} A S |} \\ = - \frac{1}{2} log {| I_{M} + \frac{ϵ}{τ} A S^{2} A^{T} |} . \end{matrix}

Thus,

\begin{matrix} L : & = log p (y | s, ϵ, τ) \\ = - \frac{M}{2} log {2 π} + \frac{M}{2} log {ϵ} - \frac{1}{2} log | I_{M} + \frac{ϵ}{τ} A S^{2} A^{T} | - \frac{1}{2} ϵ y^{T} (I_{M} - ϵ A S {(τ I_{N} + ϵ S A^{T} A S)}^{- 1} S A^{T}) y . \end{matrix}

For comparing the changes of

L^{[t]}

with

L^{[t - 1]}

in the updating process, we have

\begin{matrix} L & \propto \frac{M}{2} log {ϵ} + \frac{1}{2} log {| {(I_{M} + \frac{ϵ}{τ} A S^{2} A^{T})}^{- 1} |} - \frac{1}{2} ϵ y^{T} (I_{M} - ϵ A S {(τ I_{N} + ϵ S A^{T} A S)}^{- 1} S A^{T}) y \\ \propto \frac{1}{2} (log {| ϵ I_{M} |} + log {| {(I_{M} + \frac{ϵ}{τ} A S^{2} A^{T})}^{- 1} |}) - \frac{1}{2} y^{T} {(ϵ^{- 1} I_{M} + \frac{1}{τ} A S^{2} A^{T})}^{- 1} y \\ \propto \frac{1}{2} log {| ϵ^{- 1} I_{M} |^{- 1} | I_{M} + \frac{ϵ}{τ} A S^{2} A^{T} |^{- 1}} - \frac{1}{2} y^{T} {(ϵ^{- 1} I_{M} + \frac{1}{τ} A S^{2} A^{T})}^{- 1} y \\ \propto \frac{1}{2} log {| ϵ^{- 1} I_{M} {(I_{M} + \frac{ϵ}{τ} A S^{2} A^{T})}^{- 1} |} - \frac{1}{2} y^{T} {(ϵ^{- 1} I_{M} + \frac{1}{τ} A S^{2} A^{T})}^{- 1} y \\ \propto log {| ϵ^{- 1} I_{M} + \frac{1}{τ} A S^{2} A^{T} |^{- 1}} - y^{T} {(ϵ^{- 1} I_{M} + \frac{1}{τ} A S^{2} A^{T})}^{- 1} y . \end{matrix}

Therefore,

L^{[t]} \propto log | Σ_{0}^{[t]} | - y^{T} Σ_{0}^{[t]} y,

where

Σ_{0} : = {({\tilde{ϵ}}^{- 1} I_{M} + {\tilde{τ}}^{- 1} A {\tilde{S}}^{2} A^{T})}^{- 1},

which yields to

- L \propto log {| Σ_{0}^{- 1} |} + y^{T} Σ_{0} y .

This means that

p (y | S, ϵ, τ) = \frac{1}{{(2 π)}^{\frac{M}{2}}} \frac{1}{| Σ_{0}^{- 1} |^{\frac{1}{2}}} e^{{- \frac{1}{2} y^{T} Σ_{0} y}}

or equivalently,

p (y | s, ϵ, τ) \sim N (0, Σ_{0}^{- 1}) .

Therefore, the stopping criterion can be made based on

\begin{matrix} Δ L^{[t]} : = & L^{[t]} - L^{[t - 1]} \\ = & log {\frac{Σ_{0}^{[t]}}{Σ_{0}^{[t - 1]}}} + y^{T} (Σ_{0}^{[t - 1]} - Σ_{0}^{[t]}) y . \end{matrix}

Appendix A.2. Gaussian-Inverse-Gamma Modeling and the SBL(GiG)

Update rule for the precision $τ_{n}$ of the nth component of the solution vector $x$

\begin{matrix} q (τ_{n}) & \propto p (τ_{n}; a_{0}, b_{0}) e^{(< log p (x | T) >_{q_{x_{n}}})} \\ \propto τ_{n}^{a_{0} - 1} e^{- b_{0} τ_{n}} e^{{< log \{\prod_{n = 1}^{N} p (x_{n}; τ_{n}^{- 1})\} >_{q_{x_{n}}}}} \\ \propto τ_{n}^{a_{0} - 1} e^{- b_{0} τ_{n}} e^{{< log {τ_{n}^{\frac{1}{2}} e^{- \frac{τ_{n}}{2} x_{n}^{2}}} >_{q_{x_{n}}}}} \\ \propto τ_{n}^{a_{0} + \frac{1}{2} - 1} e^{- b_{0} τ_{n}} e^{{- \frac{τ_{n}}{2} < x_{n}^{2} >_{q_{x_{n}}}}} \\ \propto τ_{n}^{(a_{0} + \frac{1}{2}) - 1} e^{- b_{0} τ_{n}} e^{- \frac{τ_{n}}{2} ({\tilde{x}}_{n}^{2} + σ_{{\tilde{x}}_{n}}^{2})} \\ \propto τ_{n}^{(a_{0} + \frac{1}{2}) - 1} e^{- (b_{0} + \frac{1}{2} ({\tilde{x}}_{n}^{2} + σ_{{\tilde{x}}_{n}}^{2})) τ_{n}}, \end{matrix}

where

T : = diag {τ_{1}, \dots, τ_{N}}

. Therefore, we can model

τ_{n}

as

q (τ_{n}) \sim Gamma (a_{0} + \frac{1}{2}, b_{0} + \frac{1}{2} ({\tilde{x}}_{n}^{2} + σ_{{\tilde{x}}_{n}}^{2})) .

The update rule for

τ_{n}

can be then defined as follows.

{\tilde{τ}}_{n} = \frac{a_{0} + \frac{1}{2}}{b_{0} + \frac{1}{2} ({\tilde{x}}_{n}^{2} + σ_{{\tilde{x}}_{n}}^{2})}, \forall n = 1, 2, \dots, N

Update rule for the noise precision $ϵ$

\begin{matrix} q (ϵ) & \propto p (ϵ; θ_{0}, θ_{1}) e^{{< log p (y | x, ϵ) >_{q_{x}}}} \\ \propto ϵ^{θ_{0} - 1} e^{- θ_{1} ϵ} e^{{< log {ϵ^{\frac{M}{2} e^{(- \frac{1}{2} ϵ ∥ y - A x ∥_{2}^{2})}}} >_{q_{x}}}} \\ \propto ϵ^{(θ_{0} + \frac{M}{2}) - 1} e^{- ϵ (θ_{1} + \frac{1}{2} < ∥ y - A x ∥_{2}^{2} >_{q_{x}})}, \end{matrix}

where

\begin{matrix} {< ∥ y - A x ∥}_{2}^{2} >_{q_{x}} & = y^{T} y - 2 < x >_{q_{x}}^{T} A^{T} y + < x^{T} A^{T} A x > q_{x} \\ = y^{T} y - 2 {\tilde{x}}^{T} A^{T} y + < x^{T} A^{T} A x >_{q_{x}}, \end{matrix}

and

\begin{matrix} < x^{T} A^{T} A x >_{q_{x}} & = Tr (< x^{T} A^{T} A x >_{q_{x}}) \\ = Tr (< x x^{T} >_{q_{x}} A^{T} A) \\ = Tr ((\tilde{x} {\tilde{x}}^{T} + Σ_{\tilde{x}}) A^{T} A) . \end{matrix}

Therefore,

\tilde{Ψ} : = < {∥ y - A x ∥}_{2}^{2} >_{q_{x}} = y^{T} y - 2 {\tilde{x}}^{T} A^{T} y + Tr ((\tilde{x} {\tilde{x}}^{T} + Σ_{\tilde{x}}) A^{T} A) .

Therefore, we can model

\hat{ϵ}

as

q (ϵ) \sim Gamma (θ_{0} + \frac{M}{2}, θ_{1} + \frac{1}{2} \tilde{Ψ}) .

Finally, the update rule for

ϵ

can be then written as

\tilde{ϵ} = \frac{θ_{0} + \frac{M}{2}}{θ_{1} + \frac{1}{2} \tilde{Ψ}} .

Update rule for the solution vector $x$

\begin{matrix} q_{x} (x) & \propto e^{{< log {p (x, y | θ)} >_{q_{θ}}}} \\ \propto e^{{< log {p (x | θ) p (y | x, θ)} >_{q_{θ}}}} \\ \propto e^{{< log {p (x | T)} >_{q_{τ}}}} e^{{< log {p (y | x, ϵ)} >_{q_{ϵ}}}} \\ \propto e^{{< log {p (x | T)} >_{q_{τ}}}} e^{{- \frac{1}{2} x^{T} \tilde{T} x}}, \end{matrix}

where

θ

contains the information on the parameters T and

ϵ

, and

\tilde{T} : = diag {{\tilde{τ}}_{1}, \dots, {\tilde{τ}}_{N}}

. To update the elements of

x

, we have

\begin{matrix} p (y, x, ϵ) & \propto ϵ^{\frac{M}{2}} e^{{- \frac{1}{2} ϵ ∥ y - A x ∥_{2}^{2}}} \\ \propto e^{{- \frac{1}{2} ϵ ∥ y - A x ∥_{2}^{2}}} . \end{matrix}

Therefore,

\begin{matrix} < log {p (y | x, ϵ)} >_{q_{ϵ}} & \propto - \frac{1}{2} < ϵ {∥ y - A x ∥}_{2}^{2} >_{q_{ϵ}} \\ \propto - \frac{1}{2} < ϵ >_{q_{ϵ}} {∥ y - A x ∥}_{2}^{2} \\ \propto - \frac{1}{2} \tilde{ϵ} {∥ y - A x ∥}_{2}^{2} . \end{matrix}

Thus, we can write

q_{x} (x)

as

\begin{matrix} q_{x} (x) & \propto e^{< log {p (x | T)} >_{q_{T}}} e^{< log p (y | x, θ) >_{q_{θ}}} \\ \propto e^{{- \frac{1}{2} x^{T} \tilde{T} x}} e^{{- \frac{1}{2} \tilde{ϵ} (x^{T} A^{T} A x - 2 x^{T} A^{T} y)}} \\ \propto e^{{- \frac{1}{2} (x^{T} (\tilde{T} + \tilde{ϵ} A^{T} A) x - 2 \tilde{ϵ} x^{T} A^{T} y)}} . \end{matrix}

Finally,

q_{x} (x) \sim N (\tilde{x}, Σ_{\tilde{x}}),

where

Σ_{\tilde{x}} : = {(\tilde{T} + \tilde{ϵ} A^{T} A)}^{- 1} and \tilde{x} : = \tilde{ϵ} Σ_{\tilde{x}} A^{T} y .

Stopping rule

We set the stopping rule of the algorithm based on the marginalized log-likelihood (evidence) defined as

\begin{matrix} p (y | ϵ, T) & = \int p (y, x | ϵ, T) d x \\ \int p (y | x, ϵ, T) p (x | T) d x \\ = \int \frac{1}{{(2 π ϵ^{- 1})}^{\frac{M}{2}}} e^{- \frac{1}{2} ϵ {∥ y - A x ∥}_{2}^{2}} \frac{1}{{((2 π)}^{N} | T^{- 1} {|)}^{\frac{1}{2}}} e^{- \frac{1}{2} x^{T} T x} d x \\ = \frac{1}{{(2 π)}^{\frac{M}{2}}} ϵ^{\frac{M}{2}} {| T |}^{\frac{1}{2}} \int \frac{1}{{(2 π)}^{\frac{N}{2}}} e^{- \frac{1}{2} (ϵ (y^{T} y - 2 x^{T} A^{T} y + x^{T} A^{T} A x) + x^{T} T x)} d x \\ = \frac{1}{{(2 π)}^{\frac{M}{2}}} ϵ^{\frac{M}{2}} {| T |}^{\frac{1}{2}} \int \frac{1}{{(2 π)}^{\frac{N}{2}}} e^{- \frac{1}{2} (ϵ (y^{T} y - 2 x^{T} A^{T} y + x^{T} A^{T} A x) + x^{T} T x)} d x \end{matrix}

\begin{matrix} = \frac{1}{{(2 π)}^{\frac{M}{2}}} ϵ^{\frac{M}{2}} {| T |}^{\frac{1}{2}} e^{- \frac{1}{2} ϵ y^{T} y} \int \frac{1}{{(2 π)}^{\frac{N}{2}}} e^{- \frac{1}{2} (x^{T} (ϵ A^{T} A + T) x - 2 ϵ x^{T} A^{T} y)} d x \\ = \frac{1}{{(2 π)}^{\frac{M}{2}}} ϵ^{\frac{M}{2}} {| T |}^{\frac{1}{2}} e^{- \frac{1}{2} ϵ y^{T} y} {| {(T + ϵ A^{T} A)}^{- 1} |}^{\frac{1}{2}} \int \frac{1}{{(2 π)}^{\frac{N}{2}}} \frac{1}{| {(T + ϵ A^{T} A)}^{- 1} |^{\frac{1}{2}}} e^{- \frac{1}{2} (x^{T} (ϵ A^{T} A + T) x - 2 ϵ x^{T} A^{T} y)} d x \\ = \frac{1}{{(2 π)}^{\frac{M}{2}}} ϵ^{\frac{M}{2}} {| T |}^{\frac{1}{2}} e^{- \frac{1}{2} ϵ y^{T} y} {| {(T + ϵ A^{T} A)}^{- 1} |}^{\frac{1}{2}} \int \frac{1}{{(2 π)}^{\frac{N}{2}}} \frac{1}{| {(T + ϵ A^{T} A)}^{- 1} |^{\frac{1}{2}}} \times \dots \\ e^{- \frac{1}{2} ({(x - {(T + ϵ A^{T} A)}^{- 1} ϵ A^{T} y)}^{T} (T + ϵ A^{T} A) (x - {(T + ϵ A^{T} A)}^{- 1} ϵ A^{T} y) - ϵ^{2} y^{T} A {(T + ϵ A^{T} A)}^{- 1} A^{T} y)} d x . \end{matrix}

Thus,

\begin{matrix} log p (y | ϵ, T) = \\ - \frac{M}{2} log {2 π} + \frac{M}{2} log ϵ + \frac{1}{2} log | T | - \frac{1}{2} ϵ y^{T} y + \frac{1}{2} log {| {(T + ϵ A^{T} A)}^{- 1} |} + \frac{1}{2} ϵ^{2} y^{T} A {(T + ϵ A^{T} A)}^{- 1} A^{T} y . \end{matrix}

Notice that

- \frac{1}{2} ϵ y^{T} y + \frac{1}{2} ϵ^{2} y^{T} A {(T + ϵ A^{T} A)}^{- 1} A^{T} y = - \frac{1}{2} y^{T} ϵ (I_{M} - ϵ A {(T + ϵ A^{T} A)}^{- 1} A^{T}) y

and,

\begin{matrix} \frac{1}{2} log | T | + \frac{1}{2} log {| {(T + ϵ A^{T} A)}^{- 1} |} & = \frac{1}{2} (log | T | + log {| {(T + ϵ A^{T} A)}^{- 1} |}) \\ = - \frac{1}{2} log {| T^{- 1} (T + ϵ A^{T} A) |} \\ = - \frac{1}{2} log {| I_{N} + ϵ T^{- 1} A^{T} A |} \\ = - \frac{1}{2} log {| I_{M} + ϵ A T^{- 1} A^{T} |} . \end{matrix}

Thus,

\begin{matrix} L : & = log p (y | ϵ, T) \\ = - \frac{M}{2} log {2 π} + \frac{M}{2} log ϵ - \frac{1}{2} log {| I_{M} + ϵ A T^{- 1} A^{T} |} - \frac{1}{2} ϵ y^{T} (I_{M} - ϵ A {(T + ϵ A^{T} A)}^{- 1} A^{T}) y . \end{matrix}

For the comparing

L^{[t]}

with

L^{[t - 1]}

in the updating process, we have

L \propto \frac{M}{2} log ϵ + \frac{1}{2} log {| {(I_{M} + ϵ A T^{- 1} A^{T})}^{- 1} |} - \frac{1}{2} ϵ y^{T} (I_{M} - ϵ A {(T + ϵ A^{T} A)}^{- 1} A^{T}) y .

Therefore,

I_{M} - ϵ A {(T + ϵ A^{T})}^{- 1} A^{T} = {(I_{M} + ϵ A T^{- 1} A^{T})}^{- 1} .

Thus,

\begin{matrix} L & \propto \frac{M}{2} log ϵ + \frac{1}{2} log {| {(I_{M} + + ϵ A T^{- 1} A^{T})}^{- 1} |} - \frac{1}{2} ϵ y^{T} {(I_{M} + ϵ A T^{- 1} A^{T})}^{- 1} y \\ \propto \frac{1}{2} log {| ϵ^{- 1} I_{M} |^{- 1} | I_{M} + ϵ A T^{- 1} A^{T} |^{- 1}} - \frac{1}{2} y^{T} {(ϵ^{- 1} I_{M} + A T^{- 1} A^{T})}^{- 1} y \\ \propto \frac{1}{2} log {| ϵ^{- 1} I_{M} {(I_{M} + ϵ A T^{- 1} A^{T})}^{- 1} |} - \frac{1}{2} y^{T} {(ϵ^{- 1} I_{M} + A T^{- 1} A^{T})}^{- 1} y \\ \propto log {| ϵ^{- 1} I_{M} + A T^{- 1} A^{T} |^{- 1}} - y^{T} {(ϵ^{- 1} I_{M} + A T^{- 1} A^{T})}^{- 1} y . \end{matrix}

Therefore,

L^{[t]} \propto log | Σ_{0}^{[t]} | - y^{T} Σ_{0}^{[t]} y,

where

Σ_{0} : = {({\tilde{ϵ}}^{- 1} I_{M} + A {\tilde{T}}^{- 1} A^{T})}^{- 1} .

This means that

p (y | ϵ, T) = \frac{1}{{(2 π)}^{\frac{M}{2}}} \frac{1}{| Σ_{0}^{- 1} |^{\frac{1}{2}}} e^{{- \frac{1}{2} y^{T} Σ_{0} y}}

or equivalently,

p (y | ϵ, T) \sim N (0, Σ_{0}^{- 1}) .

Thus, the stopping criterion can be made based on

\begin{matrix} Δ L^{[t]} : = & L^{[t]} - L^{[t - 1]} \\ = log | \frac{Σ_{0}^{[t]}}{Σ_{0}^{[t - 1]}} | + y^{T} (Σ_{0}^{[t - 1]} - Σ_{0}^{[t]}) y . \end{matrix}

References

Candes, E.J.; Romberg, J.; Tao, T. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory 2006, 52, 489–509. [Google Scholar] [CrossRef] [Green Version]
Donoho, D.L. Compressed sensing. IEEE Trans. Inf. Theory 2006, 52, 1289–1306. [Google Scholar] [CrossRef]
Candes, E.J.; Wakin, M.B. An introduction to compressive sampling. IEEE Signal Process. Mag. 2008, 25, 21–30. [Google Scholar] [CrossRef]
Duarte, M.; Davenport, M.; Takhar, D.; Laska, J.; Sun, T.; Kelly, K.; Baraniuk, R. Single-pixel imaging via compressive sampling. IEEE Signal Process. Mag. 2008, 25, 83–91. [Google Scholar] [CrossRef] [Green Version]
Bajwa, W.; Haupt, J.; Sayeed, A.; Nowak, R. Compressed channel sensing: A new approach to estimating sparse multipath channels. Proc. IEEE 2010, 98, 1058–1076. [Google Scholar] [CrossRef]
Lustig, M.; Donoho, D.; Pauly, J. Sparse MRI: The application of compressed sensing for rapid MR imaging. Magn. Reson. Med. 2007, 58, 1182–1195. [Google Scholar] [CrossRef]
Kutynoik, G. Theory and applications of compressed sensing. GAMM-Mitteilungen 2013, 36, 79–101. [Google Scholar] [CrossRef] [Green Version]
Chang, K.; Ding, P.; Li, B. Compressive sensing reconstruction of correlated images using joint regularization. IEEE Signal Process. Lett. 2016, 23, 449–453. [Google Scholar] [CrossRef]
Wijewardhana, U.L.; Codreanu, M.; Latva-aho, M. Bayesian method for image recovery from block compressive sensing. In Proceedings of the 2016 50th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, 6–9 November 2016; pp. 379–383. [Google Scholar]
Qaisar, S.; Bilal, R.M.; Iqbal, W.; Naureen, M.; Lee, S. Compressive sensing: From theory to applications, a survey. Commun. Netw. J. 2013, 15, 443–456. [Google Scholar] [CrossRef]
Mishali, M.; Eldar, Y.C. Blind multi-band signal reconstruction: Compressed sensing for analog signals. IEEE Trans. Signal Process. 2009, 57, 993–1009. [Google Scholar] [CrossRef] [Green Version]
Mishali, M.; Eldar, Y.C. Xampling: Signal acquisition and processing in unions of subspaces. IEEE Trans. Signal Process. 2011, 59, 4719–4734. [Google Scholar] [CrossRef] [Green Version]
Cohen, D.; Mishra, K.V.; Eldar, Y.C. Spectrum sharing Radar: Coexistence via xampling. IEEE Trans. Aerosp. Electron. Syst. 2017, 54, 1279–1296. [Google Scholar] [CrossRef] [Green Version]
Aubry, A.; Carotenuto, V.; Maio, A.D.; Govoni, M.A.; Farina, A. Experimental analysis of block-sparsity-based spectrum sensing techniques for cognitive Radar. IEEE Trans. Aerosp. Electron. Syst. 2020, 57, 355–370. [Google Scholar] [CrossRef]
Hwang, S.; Seo, J.; Park, J.; Kim, H.; Jeong, B.J. Compressive sensing-based Radar imaging and subcarrier allocation for joint MIMO OFDM Radar and communication system. Sensors 2021, 21, 2382. [Google Scholar] [CrossRef]
Rani, M.; Dhok, S.B.; Deshmukh, R.B. A systematic review of compressive sensing: Concepts, implementations and applications. IEEE Access 2018, 6, 4875–4894. [Google Scholar] [CrossRef]
Zhan, Z.; Li, Q.; Huang, J. Application of wavefield compressive sensing in surface wave tomography. Geophys. J. Int. 2018, 213, 1731–1743. [Google Scholar] [CrossRef] [Green Version]
Da Poian, G.; Rozell, C.J.; Bernardini, R.; Rinaldo, R.; Clifford, G.D. Matched filtering for heart rate estimation on compressive sensing ECG measurements. IEEE Trans. Biomed. Eng. 2017, 65, 1349–1358. [Google Scholar] [CrossRef]
Djelouat, H.; Zhai, X.; Disi, M.A.; Amira, A.; Bensaali, F. System-on-chip solution for patients biometric: A compressive sensing-based approach. IEEE Sens. J. 2018, 18, 9629–9639. [Google Scholar] [CrossRef] [Green Version]
Zhang, P.; Wang, S.; Guo, K.; Wang, J. A secure data collection scheme based on compressive sensing in wireless sensor networks. Ad Hoc Netw. 2018, 70, 73–84. [Google Scholar] [CrossRef]
Sharma, S.K.; Chatzinotas, S.; Ottersten, B. Compressive sparsity order estimation for wideband cognitive radio receiver. IEEE Trans. Signal Process. 2014, 62, 4984–4996. [Google Scholar] [CrossRef] [Green Version]
Zhao, T.; Wang, Y. Statistical interpolation of spatially varying but sparsely measured 3D geo-data using compressive sensing and variational Bayesian inference. Math. Geosci. 2021, 53, 1171–1199. [Google Scholar] [CrossRef]
Han, R.; Bai, L.; Zhang, W.; Liu, J.; Choi, J.; Zhang, W. Variational inference based sparse signal detection for next generation multiple access. IEEE J. Sel. Areas Commun. 2022, 40, 1114–1127. [Google Scholar] [CrossRef]
Tang, V.H.; Bouzerdoum, A.; Phung, S.L. Variational Bayesian compressive multipolarization indoor Radar imaging. IEEE Trans. Geosci. Remote Sens. 2021, 59, 7459–7474. [Google Scholar] [CrossRef]
Wan, Q.; Fang, J.; Huang, Y.; Duan, H.; Li, H. A Variational Bayesian inference-inspired unrolled deep network for MIMO detection. IEEE Trans. Signal Process. 2022, 70, 423–437. [Google Scholar] [CrossRef]
Fang, J.; Shen, Y.; Li, H.; Wang, P. Pattern-coupled sparse Bayesian learning for recovery of block-sparse signals. IEEE Trans. Signal Process. 2015, 63, 360–372. [Google Scholar] [CrossRef] [Green Version]
Shekaramiz, M.; Moon, T.K.; Gunther, J.H. Bayesian compressive sensing of sparse signals with unknown clustering patterns. Entropy 2019, 21, 247. [Google Scholar] [CrossRef] [Green Version]
Wipf, D.P.; Rao, B.D. Sparse Bayesian learning for basis pursuit selection. IEEE Trans. Signal Process. 2004, 52, 2153–2164. [Google Scholar] [CrossRef]
Lv, F.; Zhang, C.; Tang, Z.; Zhang, P. Block-sparse signal recovery based on adaptive matching pursuit via spike and slab prior. In Proceedings of the 2020 IEEE 11th Sensor Array and Multichannel Signal Processing Workshop (SAM), Hangzhou, China, 8–11 June 2020; pp. 1–5. [Google Scholar]
Worley, B. Scalable mean-field sparse Bayesian learning. IEEE Trans. Signal Process. 2019, 67, 6314–6326. [Google Scholar] [CrossRef]
Chen, P.; Zhao, J.; Bai, X. Block inverse-free sparse Bayesian learning for block sparse signal recovery. In Proceedings of the 2019 IEEE International Conference on Signal, Information and Data Processing (ICSIDP), Chongqing, China, 11–13 December 2019; pp. 1–4. [Google Scholar]
Hilli, A.A.; Najafizadeh, L.; Petropulu, A. Weighted sparse Bayesian learning (WSBL) for basis selection in linear underdetermined systems. IEEE Trans. Veh. Technol. 2019, 68, 7353–7367. [Google Scholar] [CrossRef]
Wang, D.; Zhang, Z. Variational Bayesian inference based robust multiple measurement sparse signal recovery. Digit. Signal Process. 2019, 89, 131–144. [Google Scholar] [CrossRef]
Bayisa, F.L.; Zhou, Z.; Cronie, O.; Yu, J. Adaptive algorithm for sparse signal recovery. Digit. Signal Process. 2019, 87, 10–18. [Google Scholar] [CrossRef] [Green Version]
Nayek, R.; Fuentes, R.; Worden, K.; Cross, E.J. On spike-and-slab priors for Bayesian equation discovery of nonlinear dynamical systems via sparse linear regression. Mech. Syst. Signal Process. 2021, 161, 107986. [Google Scholar] [CrossRef]
Li, J.; Zhou, W.; Cheng, C. Adaptive support-driven Bayesian reweighted algorithm for sparse signal recovery. Signal Image Video Process. 2021, 15, 1295–1302. [Google Scholar] [CrossRef]
Zong-Long, B.; Li-Ming, S.; Jin-Wei, S. Sparse Bayesian learning using adaptive LASSO priors. Acta Autom. Sin. 2021, 45, 1–16. [Google Scholar]
Mallat, S.; Zhang, Z. Matching pursuits with time-frequency dictionaries. IEEE Trans. Signal Process. 1993, 41, 3397–3415. [Google Scholar] [CrossRef] [Green Version]
Blumensath, T.; Davies, M.E. Iterative hard thresholding for compressive sensing. Appl. Comput. Harmon. Anal. 2009, 27, 265–274. [Google Scholar] [CrossRef] [Green Version]
Stanković, L.; Daković, M.; Vujović, S. Adaptive variable step algorithm for missing samples recovery in sparse signals. IET Signal Process. 2014, 8, 246–256. [Google Scholar] [CrossRef] [Green Version]
Chen, S.; Donoho, D. Basis pursuit. In Proceedings of the 1994 28th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, 31 October–2 November 1994; pp. 41–44. [Google Scholar]
Zhou, W.; Zhang, H.T.; Wang, J. An efficient sparse Bayesian learning algorithm based on Gaussian-scale mixtures. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 3065–3078. [Google Scholar] [CrossRef]
Sant, A.; Leinonen, M.; Rao, B.D. General total variation regularized sparse Bayesian learning for robust block-sparse signal recovery. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 5604–5608. [Google Scholar]
Liu, J.; Wu, Q.; Amin, M.G. Multi-Task Bayesian compressive sensing exploiting signal structures. Signal Process. 2021, 178, 107804. [Google Scholar] [CrossRef]
He, L.; Chen, H.; Carin, L. Tree-structured compressive sensing with variational Bayesian analysis. IEEE Signal Process. Lett. 2010, 17, 233–236. [Google Scholar]
Ji, S.; Xue, Y.; Carin, L. Bayesian compressive sensing. IEEE Trans. Signal Process. 2008, 56, 2346–2356. [Google Scholar] [CrossRef]
Shekaramiz, M.; Moon, T.K.; Gunther, J.H. Hierarchical Bayesian approach for jointly-sparse solution of multiple-measurement vectors. In Proceedings of the 2014 48th Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, 2–5 November 2014; pp. 1962–1966. [Google Scholar]
Oikonomou, V.P.; Nikolopoulos, S.; Kompatsiaris, I. A novel compressive sensing scheme under the variational Bayesian framework. In Proceedings of the 2019 27th European Signal Processing Conference (EUSIPCO), A Coruna, Spain, 2–6 September 2019; pp. 1–5. [Google Scholar]
Wang, L.; Zhao, L.; Yu, L.; Wang, J.; Bi, G. Structured Bayesian learning for recovery of clustered sparse signal. Signal Process. 2020, 166, 107255. [Google Scholar] [CrossRef]
Yu, L.; Wei, C.; Jia, J.; Sun, H. Compressive sensing for cluster structured sparse signals: Variational Bayes approach. IET Signal Process. 2016, 10, 770–779. [Google Scholar] [CrossRef] [Green Version]
Babacan, S.D.; Nakajima, S.; Do, M.N. Bayesian group-sparse modeling and variational inference. IEEE Trans. Signal Process. 2014, 62, 2906–2921. [Google Scholar] [CrossRef]
Yu, L.; Sun, H.; Barbot, J.P.; Zheng, G. Bayesian compressive sensing for cluster structured sparse signals. Signal Process. 2012, 92, 259–269. [Google Scholar] [CrossRef] [Green Version]
Anderson, M.R.; Winther, O.; Hansen, L.K. Bayesian inference for structured spike and slab priors. In Proceedings of the Advances in Neural Information Processing Systems 27 (NIPS 2014), Montreal, QC, Canada, 8–13 December 2014; pp. 1745–1753. [Google Scholar]
Babacan, S.; Molina, R.; Katsaggelos, A. Bayesian compressive sensing using Laplace priors. IEEE Trans. Image Process. 2010, 19, 53–63. [Google Scholar] [CrossRef]
Hernandez-Lobato, D.; Hernandez-Lobato, J.M.; Dupont, P. Generalized spike-and-slab priors for Bayesian group feature selection using expectation propagation. J. Mach. Learn. Res. 2013, 14, 1891–1945. [Google Scholar]
Ji, S.; Dunson, D.; Carin, L. Multitask compressive sensing. IEEE Trans. Signal Process. 2009, 57, 92–106. [Google Scholar] [CrossRef]
Shekaramiz, M.; Moon, T.K.; Gunther, J.H. Sparse Bayesian learning using variational Bayes inference based on a greedy criterion. In Proceedings of the 2017 51st Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 29 October–1 November 2017; pp. 858–862. [Google Scholar]
Wu, Q.; Fang, S. Structured Bayesian compressive sensing with spatial location dependence via variational Bayesian inference. Digit. Signal Process. 2017, 71, 95–107. [Google Scholar] [CrossRef]
Wipf, D.P.; Rao, B.D. An empirical Bayesian strategy for solving the simultaneous sparse approximation problem. IEEE Trans. Signal Process. 2007, 55, 3704–3716. [Google Scholar] [CrossRef]
Tibshirani, R.; Saunders, M.; Rosset, S.; Zhu, J.; Knight, K. Sparsity and smoothness via the fused LASSO. J. R. Stat. Soc. Ser. B 2005, 67, 91–108. [Google Scholar] [CrossRef] [Green Version]
Blumensath, T.; Davies, M.E. Normalized iterative hard thresholding: Guaranteed stability and performance. IEEE J. Sel. Top. Signal Process. 2010, 4, 298–309. [Google Scholar] [CrossRef] [Green Version]
Qin, L.; Tan, J.; Wang, Z.; Wang, G.; Guo, X. Exploiting the tree-structured compressive sensing of Wavelet coefficients via block sparse Bayesian learning. Electron. Lett. 2018, 54, 975–976. [Google Scholar] [CrossRef]
Ambat, S.K.; Chatterjee, S.; Hari, K.V. Fusion of greedy pursuits for compressed sensing signal reconstruction. In Proceedings of the 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), Bucharest, Romania, 27–31 August 2012; pp. 1434–1438. [Google Scholar]
Cao, Z.; Dai, J.; Xu, W.; Chang, C. Fast variational Bayesian inference for temporally correlated sparse signal recovery. IEEE Sigal Process. Lett. 2021, 28, 214–218. [Google Scholar] [CrossRef]
Gelman, A.; Rubin, D.B. Inference from iterative simulation using multiple sequences. Stat. Sci. 1992, 7, 457–511. [Google Scholar] [CrossRef]
Beal, M. Variational Algorithms for Approximate Bayesian Inference. Ph.D. Dissertation, University College London, London, UK, 2003. [Google Scholar]
Tzikas, D.G.; Likas, A.C.; Galatsanos, N.P. The variational approximation for Bayesian inference. IEEE Signal Process. Mag. 2008, 25, 131–142. [Google Scholar] [CrossRef]
Shekaramiz, M.; Moon, T.K. Compressive sensing via variational Bayesian inference. In Proceedings of the 2020 Intermountain Engineering, Technology and Computing (IETC), Orem, UT, USA, 2–3 October 2020; pp. 1–6. [Google Scholar]
Shekaramiz, M.; Moon, T.K. Sparse Bayesian learning via variational Bayes fused With orthogonal matching pursuit. In Proceedings of the 2022 Intermountain Engineering, Technology and Computing (IETC), Orem, UT, USA, 13–14 May 2022; pp. 1–5. [Google Scholar]
You, C.; Ormerod, J.T.; Mueller, S. On variational Bayes estimation and variational information criteria for linear regression models. Aust. N. Z. J. Stat. 2014, 56, 73–87. [Google Scholar] [CrossRef]
Tran, M.N.; Nguyen, T.N.; Dao, V.H. A practical tutorial on variational Bayes. arXiv 2021, arXiv:2103.01327. [Google Scholar]
Fox, C.; Roberts, S. A tutorial on variational Bayesian inference. Artif. Intell. Rev. 2011, 38, 85–95. [Google Scholar] [CrossRef]
Manipur, I.; Manzo, M.; Granata, I.; Giordano, M.; Maddalena, L.; Guarracino, M.R. Netpro2vec: A graph embedding framework for biomedical applications. IEEE/ACM Trans. Comput. Biol. Bioinform. 2021, 19, 729–740. [Google Scholar] [CrossRef]

Figure 1. Graphical Bayesian representation of the BGiG model.

Figure 2. Flowchart of SBL(BGiG) algorithm.

Figure 3. Case 1:

(α_{0}, β_{0}) = (0.01, 0.99)

. From top to bottom, the rows show the results of SBL(BGiG) for the sampling ratio

λ = 0.80, 0.60, 0.40

, respectively.

Figure 3. Case 1:

(α_{0}, β_{0}) = (0.01, 0.99)

. From top to bottom, the rows show the results of SBL(BGiG) for the sampling ratio

λ = 0.80, 0.60, 0.40

, respectively.

Figure 4. Case 2:

(α_{0}, β_{0}) = (0.1, 0.9)

. From top to bottom, the rows show the results of SBL(BGiG) for the sampling ratio

λ = 0.80, 0.60, 0.40

, respectively.

Figure 4. Case 2:

(α_{0}, β_{0}) = (0.1, 0.9)

. From top to bottom, the rows show the results of SBL(BGiG) for the sampling ratio

λ = 0.80, 0.60, 0.40

, respectively.

Figure 5. Case 3:

(α_{0}, β_{0}) = (1.4, 2)

. From top to bottom, the rows show the results of SBL(BGiG) for the sampling ratio

λ = 0.80, 0.60, 0.40

, respectively.

Figure 5. Case 3:

(α_{0}, β_{0}) = (1.4, 2)

. From top to bottom, the rows show the results of SBL(BGiG) for the sampling ratio

λ = 0.80, 0.60, 0.40

, respectively.

Figure 6. Case 1: Performance evaluation of SBL(BGiG).

Figure 7. Case 2: Performance evaluation of SBL(BGiG).

Figure 8. Case 3: Performance evaluation of SBL(BGiG).

Figure 9. Graphical Bayesian representation of the GiG model.

Figure 10. Flowchart of SBL(GiG) algorithm.

Figure 11. From left to right, we show the results for sampling ratios of

λ =

0.8, 0.6 and 0.40, respectively. The first row shows the comparison of

y

with

\hat{y} = A \tilde{x_{s}}

, the second row shows the true solution

x_{s}

and the estimated solution

{\tilde{x}}_{s}

, and the third row demonstrates the estimated precisions on the solution components.

Figure 11. From left to right, we show the results for sampling ratios of

λ =

0.8, 0.6 and 0.40, respectively. The first row shows the comparison of

y

with

\hat{y} = A \tilde{x_{s}}

, the second row shows the true solution

x_{s}

and the estimated solution

{\tilde{x}}_{s}

, and the third row demonstrates the estimated precisions on the solution components.

Figure 12. The behavior of negative marginalized log-likelihood and the precision on the noise using SBL(GiG) for the sampling ratios of 0.4, 0.6 and 0.80.

Figure 13. Performance evaluation of SBL(BGiG) using grid and random Sobol search.

Figure 14. (a) Overall performance (b) Top 10

(α_{0}, β_{0})

with lowest NMSE (c) Top 10

(α_{0}, β_{0})

with highest

P_{D} - P_{F A}

.

Figure 14. (a) Overall performance (b) Top 10

(α_{0}, β_{0})

with lowest NMSE (c) Top 10

(α_{0}, β_{0})

with highest

P_{D} - P_{F A}

.

Figure 15. (a) Top 10

(α_{0}, β_{0})

with lowest NMSE vs. sampling ratio (b) Top 10

(α_{0}, β_{0})

with highest

P_{D} - P_{F A}

vs. sampling ratio.

Figure 15. (a) Top 10

(α_{0}, β_{0})

with lowest NMSE vs. sampling ratio (b) Top 10

(α_{0}, β_{0})

with highest

P_{D} - P_{F A}

vs. sampling ratio.

Figure 16. Performance of SBL(GiG). (a) NMSE of SBL(GiG) vs. threshold. (b) Performance of SBL(GiG) before and after postprocessing.

Figure 17. Performance of SBL(BGiG) and SBL(GiG) after preprocessing and postprocessing, respectively.

Table 1. Performance results of SBL(BGiG) for Cases 1–3.

Case 1: $(α_{0} = 0.01, β_{0} = 0.99)$				Case 2: $(α_{0} = 0.1, β_{0} = 0.9)$				Case 3: $(α_{0} = 1.4, β_{0} = 2.0)$
$λ$	$P_{D}$	$P_{FA}$	NMSE (dB)	$λ$	$P_{D}$	$P_{FA}$	NMSE (dB)	$λ$	$P_{D}$	$P_{FA}$	NMSE (dB)
0.8	0.20	0	−2.367	0.8	0.24	0	−3.109	0.8	0.72	0	−16.264
0.6	0.08	0	−1.326	0.6	0.16	0	−2.197	0.6	1	0	−5.226
0.4	0.08	0	−1.181	0.4	0.08	0	−1.181	0.4	1	1	−0.088

Table 2. Settings for preprocessing analysis and simulations on SBL(BGiG).

$α_{0}$	$β_{0}$	$a_{0}$	$b_{0}$	$θ_{0}$	$θ_{1}$	Sparsity	$γ$	N
$[0.1, 2]$	$[0.1, 2]$	$10^{- 3}$	$10^{- 3}$	$10^{- 6}$	$10^{- 6}$	25	(5)	100
$s$	$τ$	$x$	$x_{s}$	M	$ϵ$	$e$	A	$y$
(5)	(6)	(6)	$x_{s} = x \circ s$	$5 : N$	316	(7)	${[A]}_{m n} \sim N (0, 1)$	$A x_{s} + e$

Table 3. Settings for preprocessing analysis and simulations on SBL(GiG).

$a_{0}$	$b_{0}$	$θ_{0}$	$θ_{1}$	Sparsity	N
$10^{- 3}$	$10^{- 3}$	$10^{- 6}$	$10^{- 6}$	25	100
$τ_{n}$	$x_{s}$	M	$e$	A	$y$
(22)	(25)	$5 : N$	(23)	${[A]}_{m n} \sim N (0, 1)$	$A x_{s} + e$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shekaramiz, M.; Moon, T.K. Compressive Sensing via Variational Bayesian Inference under Two Widely Used Priors: Modeling, Comparison and Discussion. Entropy 2023, 25, 511. https://doi.org/10.3390/e25030511

AMA Style

Shekaramiz M, Moon TK. Compressive Sensing via Variational Bayesian Inference under Two Widely Used Priors: Modeling, Comparison and Discussion. Entropy. 2023; 25(3):511. https://doi.org/10.3390/e25030511

Chicago/Turabian Style

Shekaramiz, Mohammad, and Todd K. Moon. 2023. "Compressive Sensing via Variational Bayesian Inference under Two Widely Used Priors: Modeling, Comparison and Discussion" Entropy 25, no. 3: 511. https://doi.org/10.3390/e25030511

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Compressive Sensing via Variational Bayesian Inference under Two Widely Used Priors: Modeling, Comparison and Discussion

Abstract

1. Introduction

2. Variational Bayesian Inference

3. Bernoulli–Gaussian-Inverse Gamma Modeling and SBL(BGiG) Algorithm

3.1. Update Rules of SBL(BGiG) Using VB Inference

3.2. Issues with SBL(BGiG)

4. Gaussian-Inverse Gamma Modeling and SBL(GiG) Algorithm

4.1. Update Rules of SBL(GiG) Using VB Inference

4.2. Issues with SBL(GiG)

5. Preprocessing versus Postprocessing and Simulations

5.1. Pre-Processing for the SBL(BGiG) Algorithm

5.2. Post-Processing for the SBL(GiG) Algorithm

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Bernoulli–Gaussian-Inverse-Gamma Modeling and the SBL(BGiG)

Appendix A.2. Gaussian-Inverse-Gamma Modeling and the SBL(GiG)

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI