Coupled regularization with multiple data discrepancies

Martin Holler; Richard Huber; Florian Knoll

doi:10.1088/1361-6420/aac539

1. Introduction

In many classical fields of application of inverse problems, rather than measuring a single data channel, the simultaneous or sequential acquisition of multiple channels has become increasingly important recently. Besides having different sources of information available, an advantage of multiple measurements is that correlations between different data channels can be exploited in the inversion process, often leading to a significant improvement for each channel.

Multi-modality and multi-contrast imaging techniques for instance deal with the exploration of such joint structures for improved reconstruction. Applications of such techniques can be found for instance in biomedical imaging [2, 11–13, 21, 26, 36, 38, 42, 42, 46], geosciences [45], electron microscopy [20] and beyond. Also spatio-temporal imaging techniques can be interpreted in this way, regarding the measurement at each time-point as a different channel, and we exemplary refer to [17, 32, 41] for applications.

The present work focuses on a general framework for coupled regularization of inverse problems where the data is split into multiple channels. In an abstract setting, we consider the following problem setup.

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \label{eq:main_min_prob_intro} \min _{u} R(u) + \sum_{i=1}^N \lambda_i D_i (T_i u ,f_i), \nonumber \end{align} \tag{ 1 }$

where the T_i denote forward operators, the f_i the given data, the D_i data discrepancy terms and R a regularization functional. Our main interpretation of this setup is that $u= (u_1, \ldots, u_N)$ denotes some unknown multi-channel quantity of interest, each u_i corresponds to some measured data f_i such that $T_i u = \widetilde{T}_i u_i \approx f_i$ , and R realizes a coupled regularization of all channels.

A first particular, example for such a setting is the joint reconstruction of magnetic resonance (MR) and positron emission tomography (PET) images via solving

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \min _{u = (u_1,u_2)} R((u_1,u_2)) + \lambda_1 \|T_1 u_1 - f_1\|_2 ^2 + \lambda_2 \int_\Sigma (T_2 u_2 + c_2 - f_2) - f_2 \log (\frac{T_2 u_2 + c_2}{f_2}) , \nonumber \end{align*}$

as has for instance been considered in [13, 26]. Here, the first data fidelity reflects Gaussian noise in MR, while the second one is the Kullback–Leibler divergence that arises since the noise in PET imaging is typically assumed to be Poisson distributed.

A second example is the situation that the components $(\,f_i)_i$ result from a sequential measurement process which is such that at different measurement times, the quality of the measured data is different. A practical application of this is spatio-temporal regularization for dynamic magnetic resonance imaging (dMRI) [32, 41], where the f_i would correspond to measurements of the same object at different time points. There, one aims to solve

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \min _{u} R(u) + \sum_{i=1}^N \lambda_i \|T_i u_i - f_i\|_2^2, \nonumber \end{align} \tag{ 2 }$

where now $u = (u_i)_{i=1}^N$ is a time series of measurements and R employs a spatio-temporal regularization. Since in dMRI, there is always a trade-off between the signal quality and measurement time, it is reasonable to adapt the measurement time to the underlying dynamics of the object, leading to measurements with different resolution and noise properties.

We highlight two particular situations of the setting (1) that are captured by standard theory: the first one is the situation when the regularization decouples, i.e. $R(u) = \sum\nolimits_{i=1}^N R_i (u_i)$ , which reduces the reconstruction to N decoupled problems. Application-wise, however, it has been shown in many recent works that, in case the different channels share some structure, a coupled regularization is highly beneficial in terms of reconstruction quality. The second particular case is when all data terms are weighted the same, i.e. $\lambda_i = \lambda_j$ for all $i, \,j$ . In this situation, all data terms can be grouped to a single discrepancy and, again, standard theory applies. We believe, however, that there are various good reasons to consider an individual weighting of all data terms: (i) The noise type of the involved measurements might be different and the different data discrepancies might hence scale differently. (ii) The noise level will in general be different, e.g. the SNR for one channel might be significantly better, allowing to choose a very high value for one of the $\lambda_i$ . (iii) The degree of ill-posedness might be very different for each channel, as is the case for instance with MR-PET reconstruction, and hence again an appropriate adaption of the regularization parameters is necessary.

Allowing different parameters for different data discrepancies, the following questions arise.

What is the convergence behavior of the method if the data of some channels converges to a ground truth? Is there a limit problem, and if so, what does it look like?
What about convergence rates in case the noise on all channels approaches zero? How is the interplay of the different data terms with respect to convergence rates?
Taking into account different types of data terms, what is the best parameter choice?

We will see that those questions all have rather natural answers. In particular, we will show that in case the data discrepancies have different continuity properties, which is true for instance for MR-PET reconstruction, an appropriate parameter choice allows to partially compensate for less regular discrepancies.

To the best knowledge of the authors, a convergence theory for coupled regularization approach to inverse problems with multiple data discrepancies has not been developed so far. Regarding classical Tikhonov regularization with a single regularization and a single data term we refer to the books [15, 29, 40, 43] for convergence results in Hilbert and Banach spaces. In particular, we highlight the seminal work [22] that provides appropriate source conditions and convergence rates in Banach spaces that have also been the basis for this work.

While [22] and classical works consider a setting that allows for powers of norms as data discrepancies, there are far less works that also cover discrepancies that do not satisfy a (generalized) triangle inequality, as is the case for the Kullback–Leibler divergence that is included in the present work. For the latter, we refer to [23, 33, 37, 39].

While multi-discrepancy regularization has not been considered in depth so far, there has been a lot of recent research on multi-penalty regularization and convergence behavior for multiple parameters and functionals. We refer to the works [18, 25, 28] for an additive combination of multiple regularization terms, to [16, 19, 31] for an infimal-convolution-type combination and to [5] for stability and convergence results for multiple-parameter-TGV regularization.

1.1. Outline of the paper

The paper is organized as follows: in section 2 we define the variational problem setting under consideration and provide stability and convergence results in the general setting. This is done in a way that captures both powers of norms as well as the Kullback–Leibler divergence as data discrepancy terms. In section 3 we consider particular classes of data fidelities. We show how our results can be applied to powers of norms and obtain corresponding convergence rates. After some preliminary results on the Kullback–Leibler divergence, we conclude stability and convergence rates for mixed L² norm and Kullback–Leibler discrepancies, which is particularly motivated by applications to joint MR-PET reconstruction. In section 4 we exemplary work out the application of our results to two choices of coupled regularization. In section 5 we first outline an algorithmic framework that allows to realize coupled TGV regularization with any combination of L² and Kullback–Leibler divergence discrepancies. Then we show exemplary numerical results for coupled regularization in the context of multi-contrast MR and PET imaging.

2. General stability and convergence results

2.1. Notation and variational problem setting

We will, throughout this work and up to further specification, consider the following problem setting: let $(X, \|\cdot \|_X)$ be a Banach space and $R\colon X \to [0, \infty]$ be a regularization term. Further, for $i = 1, \dots, N$ , let $(Y_i, \|\cdot \|_{Y_i})$ be normed spaces, let $D_i\colon Y_i\times Y_i \to [0, \infty]$ be given data discrepancies and let $\newcommand{\domain}{{\rm dom}} T_i:\domain(T_i) \subset X \rightarrow Y_i$ be operators that model the forward problem. For $\newcommand{\domain}{{\rm dom}} \newcommand{\bi}{\boldsymbol}u \in \domain(T):=\bigcap\nolimits_{i=1}^N \domain(T_i) \subset X$ , $f = (\,f_1, \ldots, f_N) \in Y:=Y_1 \times \ldots \times Y_N$ and $\lambda \in [0, \infty]^N$ we define

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \label{eq:objective_functional_general} J_\lambda (u,f) := \sum_{i=1}^N \lambda_i D_i(T_i u,f_i)+R(u), \nonumber \end{align} \tag{ 3 }$

where, for $\lambda _i = \infty$ , we define

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\I}{\mathcal{I}} \displaystyle \lambda_i D_i(T_i u,f_i) = \I_{\{v:D_i(T_i v,f_i)=0\}} (u) = \left\{\begin{array}{@{}ll@{}} 0 \quad &{\rm if} ~D_i(T_i u,f_i) = 0, \nonumber \\ \infty \quad & {\rm else.} \end{array}\right. \nonumber \end{align*}$

We consider the problem

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\domain}{{\rm dom}} \displaystyle \min_{u\in \domain(T)} J_\lambda (u,f) = \min_{u\in \domain(T)} \sum_{i=1}^N \lambda_i D_i(T_i u,f_i)+R(u). \qquad \qquad \quad \quad \,\, \,\, ({{P(\lambda,f)}}) \nonumber \end{align*}$

We will further use $\tau_X$ and $\tau_{Y_i}$ , $\tau_{D_i}$ , $i=1, \ldots, N$ , to denote different topologies on X and Y_i, $i=1, \ldots, N$ , respectively, and assume that $\tau_X$ and $\tau_{Y_i}$ satisfy the T2 separation axiom. For τ being any topology, we say that a sequence $(z_n)_n$ τ-converges to z as $n\rightarrow \infty$ , and write $z_n \mathop{\rightarrow}\limits^{\tau} z$ as $n\rightarrow \infty$ , if the sequence converges with respect to the notion of convergence induced by the topology. Further we write $f^n_i \mathop{\rightarrow}\limits^{D_i\,} f_i$ and say $f^n_i$ D_i-converges to f_i if $D_i(\,f_i, f^n_i) \rightarrow 0$ as well as $f^n_i \mathop{\rightarrow}\limits^{\tau_{D_i}}f_i$ as $n \rightarrow \infty$ . By $z_n \rightarrow z$ we denote norm convergence and by $z_n \rightharpoonup z$ we denote weak convergence.

Note that the notion $f^n_i \mathop{\rightarrow}\limits^{D_i\,} f_i$ will be used to denote convergence of the data, and the topology $\tau_{D_i}$ allows to require a stronger convergence than just $D_i(\,f_i, f^n_i) \rightarrow 0$ , which is useful for particular applications. In the applications considered in this paper, however, the topology $\tau_{D_i}$ will be chosen to be the trivial topology such that $f^n_i \mathop{\rightarrow}\limits^{D_i\,} f_i$ is equivalent to $D_i(\,f_i, f^n_i) \rightarrow 0$ .

We will use the following product-space notation: we denote by $\mathop{\rightarrow}\limits^{\tau_{Y\,}}$ and $\mathop{\rightarrow}\limits^{\tau_{D\,}}$ component wise convergence for sequences in $\newcommand{\bi}{\boldsymbol}Y:=\bigoplus\nolimits_{i=1}^N Y_i$ . By $\|\cdot\|_Y= \left(\sum\nolimits_{i=1}^N \|\cdot \|_{Y_i}^2\right) ^{1/2}$ we denote the standard product norm on Y and we define $T:X \rightarrow Y$ as $(Tu)_i = T_i u$ .

Throughout this work, we will use the following assumptions

Assumption (G).

1.
For each i and $a, b\in Y_i$ , $\newcommand{\R}{\mathbb{R}} D_i(a, b) = 0 \Rightarrow a=b$ .
2.
Each D_i is $\tau_{Y_i} \times {D_i}$ lower semi-continuous, i.e.
$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\R}{\mathbb{R}} \displaystyle \Big(a^n \mathop{\to}\limits^{\tau_{Y_i}} a, \ f_i^n \mathop{\to}\limits^{{D_i}} f_i \Big)\Rightarrow D_i(a,f_i) \leqslant \liminf_{n \to \infty} D_i(a^n,f^n_i). \nonumber \end{align} \tag{ 4 }$
3.
R is lower semi-continuous w.r.t $\tau_X$ .
4.
The operators T_i are continuous from $(\rm dom(T_i), \tau_X)$ to $(Y_i, \tau_{Y_i})$ and $\newcommand{\domain}{{\rm dom}} \domain(T_i)$ is closed w.r.t convergence in $\tau_X$ .
5.
For any $\lambda \in (0, \infty){}^N$ , the functional $J_{\lambda}(\cdot, \cdot)$ is uniformly coercive in the sense that, for sequences $(u^n)_n, (\,f^n)_n$ such that $f^n \mathop{\to}\limits^{D} f$ we have that $J_\lambda(u^n, f^n)< c$ implies that $(u^n)_n$ admits a $\tau_X$ -convergent subsequence.

Remark 2.1. The assumptions above are kept rather general with the aim of allowing data discrepancies beyond powers of norms, in particular the Kullback–Leibler divergence used for Poisson noise. Regarding their applicability, we note the following: assumptions 1 and 2 are rather weak and standard, e.g. are fulfilled in case the D_i are powers of norms and $\tau_Y$ is stronger than the corresponding weak topology. Assumptions 3 and 4 are also rather standard. Assumption 5 essentially requires pre-compactness of sublevel sets, which is a standard condition for existence.

It will also be convenient to use the following notation: for a set $I\subset \{1, \dots, N\}$ we define the reduced energy functional

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle J_{\lambda,I^c}(u,f)= R(u)+ \sum_{i \in I^c} \lambda_i D_i(T_iu,f_i). \nonumber \end{align} \tag{ 5 }$

Here, the complement I^c is always taken in $\{1, \ldots, N\}$ , i.e. we define $I^c = \{1, \ldots, N\} \setminus I$ . Further, writing $I = \{i_1, \ldots, i_{|I|}\}$ , we will use the notation $T_I=(T_{i_1}, \dots T_{i_{|I|}})$ , $f_I = (\,f_{i_1}, \dots f_{i_{|I|}})$ and $\lambda_I = (\lambda_{i_1}, \dots \lambda_{i_{|I|}})$ . For two sequences $(x^n)_n$ and $(y^n)_n$ we write $x^n \sim y^n$ if there exist constants $c, C>0$ such that $cy^n \leqslant x^n \leqslant Cy^n$ for all n. Further, we use the standard little $o()$ and big $O()$ terminology, that is, we say $x^n = o(y^n)$ and $x^n = O(y^n)$ if $\lim\nolimits_{n\rightarrow \infty} x^n/y^n = 0$ or $x^n/y^n$ is uniformly bounded in n, respectively.

For the functional R and $\xi \in \partial R(u) \subset X^*$ , an element of the subdifferential of R at $u \in X$ , we define the Bregman distance $D_R^\xi:X \times X \rightarrow [0, \infty]$ as $D_R^\xi (v, u) = R(v) - R(u) - \langle \xi, v-u\rangle_{X^*, X}$ .

2.2. Results in the general setting

Using assumption (G), the following existence result can be obtained by standard arguments.

Proposition 2.2. Let $f\in Y$ and $\lambda \in (0, \infty){}^ N$ be such that $J_\lambda(\cdot, f)$ is proper, i.e. finite in at least one point, and assume that assumption (G) holds. Then there exists a solution to $P(\lambda, f)$ .

Proof. Take $(u^n)_n$ to be a minimizing sequence for which we can assume that $J_\lambda(u^n, f)$ is bounded. Then by assumption $(u^n)_n$ admits a $\tau_X$ -convergent subsequence and, by the lower semi-continuity assumptions on R and D_i and by continuity of T_i we obtain that the limit is a minimizer. □

The following proposition shows existence of what we will refer to as $J_{\lambda, I^c}$ minimal solution to $T_Iu=f_I^\dagger$ . This notion will be relevant later on when we study limit problems in multi-discrepancy regularization for vanishing noise level.

Proposition 2.3. Assume that assumption (G) holds. Let $\lambda \in (0, \infty){}^N$ , $I\subset \{1, \dots , N\}$ and $f^\dagger \in Y$ be given. Assume that there exists u₀ such that $D_i(T_iu_0, f_i^\dagger) = 0$ for $i \in I$ and $J_{\lambda, I^c}(u_0, f^\dagger)<\infty$ . Then there exist an $u^\dagger$ such that

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\domain}{{\rm dom}} \newcommand{\argmin}{{\rm argmin}} \displaystyle \label{eq:JI_minimizing_solution} u^\dagger \in \mathop{\argmin}_{{u \in \domain(T)}\atop {T_Iu=f_I^\dagger}} J_\lambda(u,f^\dagger) = \mathop{\argmin}_{{u \in \domain(T)} \atop{T_Iu=f_I^\dagger}} J_{\lambda,I^c}(u,f^\dagger). \nonumber \end{align} \tag{ 6 }$

Remark 2.4. Note that $J_{\lambda, I^c}$ is again a multi-data Tikhonov functional for indices in I^c, and so $J_{\lambda, I^c}$ -minimal solutions to $T_iu=f_i^\dagger$ can be understood as solutions to a Tikhonov problem for I^c, using $T_Iu=f_I^\dagger$ as prior.

Proof. It is obvious that the two minimization problems are equivalent, hence we show existence for the second one. Due to our assumptions, the set of solutions is not empty and the infimum is greater equal zero. Consequently, an infimising sequence exists, which due to the coercivity assumption admits a $\tau_X$ -convergent subsequence with limit $u^\dagger$ . Due to continuity of the T_i and lower semi-continuity of the D_i in the first component w.r.t. $\tau_{Y_i}$ we get that the limit $u^\dagger$ is a solution of (6) and the assertion follows. □

Now we consider the situation of varying data, which is relevant for investigating stability of the solution method with respect to data perturbations. To this aim, we will need the following continuity assumption at a particular pair $(u, f^\dagger) \in X\times Y$ and index set $I\subset \{1, \ldots, N\}$ .

Assumption ${S_1(u, f^\dagger, I^c)}$ .

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \left\{\begin{array}{@{}ll@{}} {\rm For}~i \in I^ c{\rm , ~the~mapping}~ f_i \mapsto D_i(T_iu,f_i) {\rm ~is~continuous~at~} f_i^\dagger {\rm ~ w.r.t.~} {D_i}{\rm -convergence}. \end{array}\right. \nonumber \end{align*}$

We remark that this assumption is in particular always satisfied if the data fidelity is a power of a norm, as can be easily deduced from the inverse triangle inequality for norms.

The next theorem considers the rather general situation that the given data converges and we choose the parameters to either converge to infinity, which corresponds to assuming vanishing noise, or to a fixed positive number. From this, simplified results on stability and convergence for vanishing noise level will then be obtained as corollaries. Note further that we allow for powers p_i of the data terms, which will be relevant for the particular case of norms later on.

Theorem 2.5 (General convergence result). Suppose that assumption (G) holds. Let $(\,f^n)_n$ in Y and $f^\dagger \in Y$ be such that $f^n \mathop{{\rightarrow}}\limits^{D} f^\dagger$ as $n\rightarrow \infty$ , and, for $i \in \{1, \ldots, N\}$ , let $p_i \in [1, \infty)$ and define $\delta_i ^n = D_i(\,f^\dagger _i, f_i^n){}^{1/p_i}$ such that, by assumption, $\delta_i^n \rightarrow 0$ as $n \rightarrow \infty$ . Choose the parameters $\lambda^n = (\lambda_1^n, \ldots, \lambda_N^n) \in (0, \infty){}^ N$ such that, for a given subset $I \subset \{1, \ldots, N\}$ ,

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle & \lambda_i^n \rightarrow \infty, \quad {\rm as~well~as}\quad \lambda_i^n (\delta_i^n)^{\,p_i} \rightarrow 0 &~~{{\rm for~}}~i \in I, \nonumber \\ &\lambda_i^n \rightarrow \lambda_i ^\dagger \in (0,\infty) &~~ {\rm for~}~i \in I^c, \nonumber \end{align*}$

and denote $\lambda^\dagger = \lim\nolimits_{n \rightarrow \infty} \lambda ^n \in (0, \infty]^N$ .

Assume that there exists a $J_{\lambda^\dagger, I^c}$ -minimal solution to $T_Iu=f_I^\dagger$ , denoted by u₀, such that $S_1(u_0, f^\dagger, I^c)$ holds. Then every sequence $(u^n)_n$ of solutions to the problems

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\domain}{{\rm dom}} \displaystyle \min _{u \in \domain(T)} J_{\lambda ^n} (u,f^n) \nonumber \end{align} \tag{ 7 }$

admits a $\tau_X$ -convergent subsequence again denoted by $(u^n)_n$ with limit $u^\dagger$ such that

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle D_i(T_iu^n,f^n_i) &= o((\lambda_i^n)^{-1}) & \qquad{\rm for ~} ~i \in I, \nonumber \\ \lim_{n\rightarrow\infty} D_i(T_iu^n,f^n_i) &= D_i(T_iu^\dagger,f^\dagger_i) &\qquad ~{\rm for~} ~i \in I^c,\nonumber \\ \lim_{n\rightarrow\infty} R(u^n) &= R(u^\dagger), \nonumber \end{align*}$

and every limit of a $\tau_X$ -convergent subsequence of solutions is a solution to

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\domain}{{\rm dom}} \displaystyle \label{eq:stability_pdagger_orig} \min_{{u \in \domain(T)} \atop {T_Iu=f_I}} J_{\lambda^ \dagger} (u,f^\dagger). \nonumber \end{align} \tag{ 8 }$

Proof. Let $(u^n)_n$ be any sequence of solutions and let u₀ be a $J_{\lambda^\dagger, I^c}$ -minimal solution to $T_Iu=f_I^\dagger$ satisfying $S_1(u_0, f^\dagger, I^c)$ . By minimality, one obtains

$\begin{align} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle \label{equ_Stability_proof_est1} J_{\lambda^n}(u^n,f^n)=\sum_{i\in I} \lambda_i^n D_i\big(T_iu^n,f_i^n\big) +\sum_{i\in I^c} \lambda_i ^n D_i\big(T_iu^n,f_i^n\big) +R(u^n) \nonumber \end{align} \tag{ 9 }$

$\begin{align} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle \leqslant J_{\lambda^n}(u_0,f^n)=\sum_{i\in I} \underbrace {\lambda_i^n(\delta_i^n)^{\,p_i}}_{\to 0} +\sum_{i\in I^c} \underbrace{\lambda_i^n}_{\to \lambda_i^\dagger< \infty} \underbrace{D_i\big(T_iu_0,f_i^n\big)}_{\to D_i\big(T_iu_0,f_i^\dagger\big)<\infty} +\underbrace{R(u_0)}_{<\infty}<C< \infty \notag \nonumber \end{align}$

for C > 0, where the convergence $\newcommand{\bi}{\boldsymbol}D_i\big(T_iu_0, f_i^n\big)\to D_i\big(T_iu_0, f_i^\dagger\big)$ is due to the assumption $S_1(u_0, f^\dagger, I^c)$ . Setting $\newcommand{\N}{\mathbb{N}} \lambda^0_i = \inf\nolimits_{n\in \N} \lambda_i^n>0$ for all i, this implies boundedness of $J_{\lambda^0}(u^n, f^n)$ and consequently, by coercivity, $(u^n)_n$ admits a convergent subsequence.

Assume now that $(u^n)_n$ is any subsequence of a sequence of solutions converging to some $u^\dagger \in X$ . The estimate in (9) implies for each $i\in I$

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \label{eq:stability_limsup_estimate} \limsup_{n \rightarrow \infty} \lambda_i^nD_i(T_iu^n,f^n_i)< C \nonumber \end{align} \tag{ 10 }$

and hence, by lower semi-continuity of D_i and since $\lambda_i \to \infty$ we obtain $T_iu^\dagger=f_i^\dagger$ .

Further, we get from lower semi-continuity and convergence of the $\lambda_i^n$ for $i \notin I$ that

$\begin{align*} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle \liminf _{n\rightarrow \infty} J_{\lambda^n}(u^n,f^n) &\geqslant \liminf _{n\rightarrow \infty} \sum_{i \in I^c} \lambda_i ^n D_i\big(T_iu^n,f_i^n\big) +R(u^n) \nonumber \\ & \geqslant \sum_{i \in I^c} \lambda^\dagger_i D_i\big(T_iu^\dagger,f_i^\dagger \big) +R(u^\dagger). \nonumber \end{align*}$

Combining this with the estimate in (9) we obtain, since $u_0 \in X$ solves (8), that also $u^\dagger$ is also a $J_{\lambda^\dagger, I^c}$ -minimizing solution of $T_Iu=f_I^\dagger$ .

Now assume that there is one $i_0 \in I$ such that $\lambda_{i_0}^ n D_{i_0}(T_{i_0}u^ n, f^ n)$ does not converge to zero. This implies that

$\begin{align*} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle \limsup_{n\rightarrow \infty} \sum_{i\in I} \lambda_i^n D_i\big(T_iu^n,f_i^n\big) > c > 0,\nonumber \end{align*}$

which, by (9), yields

$\begin{align*} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle J_{\lambda^ \dagger}(u_0,f^ \dagger) \geqslant \limsup_{n\rightarrow \infty} J_{\lambda^n}(u^n,f^n) \geqslant & c + \liminf _{n\rightarrow \infty} \sum_{i \in I^c} \lambda_i ^nD_i\big(T_iu^n,f_i^n\big) +R(u^n) \nonumber \\ & \geqslant c + J_{\lambda^\dagger}(u^\dagger,f^\dagger), \nonumber \end{align*}$

and hence a contradiction. Consequently, for all $i \in I$ ,

$\begin{align*} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle \lim_{n\rightarrow \infty} \lambda_i^n D_i\big(T_iu^n,f_i^n\big) = 0, \nonumber \end{align*}$

which means that $D_i(T_iu^n, f_i^n) = o((\lambda_i^n){}^{-1})$ . Finally, we assume that either there is an $i_0 \in I^c$ such that $\newcommand{\bi}{\boldsymbol}\limsup\nolimits_{n\rightarrow \infty} D_{i_0}\big(T_{i_0}u^n, f_{i_0}^n\big) > D_{i_0} \big(T_{i_0}u^\dagger, f_{i_0}^\dagger\big)$ or that $\limsup\nolimits_{n\rightarrow \infty} R(u^n) > R(u^\dagger)$ . In either case this implies that there is a c > 0 such that

$\begin{align*} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle \limsup _{n\rightarrow \infty} \sum_{i \in I^c} \lambda_i ^n D_i\big(T_iu^n,f_i^n\big) +R(u^n) \geqslant c + \sum_{i \in I^c} \lambda_i ^\dagger D_i\big(T_iu^\dagger,f_i^\dagger\big) +R(u^\dagger) .\nonumber \end{align*}$

But as before this implies that

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle J_{\lambda^ \dagger}(u_0,f^ \dagger) \geqslant c + J_{\lambda^\dagger}(u^\dagger,f^\dagger) \nonumber \end{align*}$

which is a contradiction to u₀ being optimal. Together with lower semi-continuity, this implies $\newcommand{\bi}{\boldsymbol}\lim\nolimits_{n\rightarrow \infty} D_i\big(T_iu^n, f_i^n\big) = D_i \big(T_iu^\dagger, f_i^\dagger\big)$ for all $i \in I^c$ as well as $\lim\nolimits_{n\rightarrow \infty} R(u^n) = R(u^\dagger)$ . □

Remark 2.6. We note that, by choosing $\lambda_i^n$ for $i \in I$ such that

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \lambda_i^n \sim (\delta_i^n)^{-(\,p_i - \epsilon_i)}\nonumber \end{align*}$

with $\newcommand{\e}{{\rm e}} \epsilon _i \in (0, p_i)$ , the assumptions of theorem 2.5 concerning the choice of $\lambda_i^n$ hold and we obtain the rate

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle D_i(T_iu^n,f^n) = o(\delta^{\,p_i - \epsilon_i}) \quad {\rm for} ~i \in I,\nonumber \end{align*}$

which comes arbitrary close to the rate $o(\delta^{\,p_i})$ . The rate $O(\delta^{\,p_i})$ will be obtained under additional source conditions later on.

As first consequence of theorem 2.5 we obtain a basic stability result. That is, letting all regularization parameters in theorem 2.5 fixed, it follows that the solutions of $P(\lambda, f^ \dagger)$ are stable with respect to perturbations of the data. Note that this could have also been obtained from standard theory (e.g. [22]) without handling the N data terms individually.

Corollary 2.7 (Stability). Suppose that assumption (G) holds. Let $(\,f^n)_n$ in Y and $f^\dagger \in Y$ be such that $f^n \mathop{{\rightarrow}}\limits^{D} f^\dagger$ as $n\rightarrow \infty$ and take $\lambda^\dagger \in (0, \infty){}^ N$ . Assume that for u₀ being a solution to $P(\lambda, f^ \dagger)$ , assumption $S_1(u_0, f^\dagger, \{1, \ldots, N\})$ holds. Then every sequence $(u^n)_n$ of solutions to

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\domain}{{\rm dom}} \displaystyle \min _{u \in \domain(T)} J_{\lambda ^\dagger} (u,f^n), \nonumber \end{align} \tag{ 11 }$

admits a $\tau_X$ -convergent subsequence again denoted by $(u^n)_n$ with limit $u^\dagger$ such that

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \lim_{n\rightarrow\infty} D_i(T_iu^n,f^n_i) &= D_i(T_iu^\dagger,f^\dagger_i) & \qquad\qquad\qquad\qquad {\rm for~all~} i,\nonumber \\ \lim_{n\rightarrow\infty} R(u^n) &= R(u^\dagger), \nonumber \end{align*}$

and every limit of a $\tau_X$ -convergent subsequence of solutions is a solution to

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\domain}{{\rm dom}} \displaystyle \min_{u \in \domain(T)} J_{\lambda^ \dagger} (u,f^\dagger). \nonumber \end{align} \tag{ 12 }$

Proof. Choose $\newcommand{\e}{{\rm e}} I = \emptyset$ , $\lambda^n = \lambda^\dagger$ and apply theorem 2.5. □

The next corollary is specific to the multi-data-discrepancy setting and handles the situation when the noise on some components vanishes while the noise on other components remains the same. It shows that in this situation, using an appropriate parameter choice, the limit problem comprises the minimization of the remaining terms subject to hard constraints for the channels with ground truth data. To deduce this result, we set the regularization parameters corresponding to components with a fixed noise level to be constant and assume an appropriate parameter choice for the others. Remarkably, the result holds without assumption $S_1(u, f^\dagger, I^c)$ .

Corollary 2.8 (Convergence for vanishing noise). Assume that assumption (G) holds. Let $I \subset \{1, \ldots, N\}$ be any index set, $f^ \dagger \in Y$ , and the sequence $(\,f^n)_n$ be such that $f^n_{I^c} = f_{I^c}^\dagger$ and $f^n_I \mathop{\rightarrow}\limits^{{D_I}} f_I^\dagger$ as $n\rightarrow \infty$ . Define $\delta^n_i = D_i(\,f^ \dagger_i, f^n_i){}^ {1/p_i}$ with $p_i \in [1, \infty)$ for $i\in I$ and $\delta_i^ n = 0$ else.

Choose the parameters $\lambda^n = (\lambda_1^n, \ldots, \lambda_N^n) \in (0, \infty){}^ N$ such that $\lambda_{i}^n = \lambda^ \dagger _{i}\in (0, \infty)$ for $i \in I^c$ and

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \lambda_i^n \rightarrow \infty, \quad {\rm as~well~as}~\quad \lambda_i^n (\delta_i^n)^{\,p_i} \rightarrow 0 \quad{\rm for}~i \in I, \nonumber \end{align*}$

and denote $\lambda^\dagger = \lim\nolimits_{n \rightarrow \infty} \lambda ^n \in (0, \infty]^N$ .

Assume that there exists a $J_{\lambda^\dagger, I^c}$ -minimal solution to $T_Iu=f_I^\dagger$ . Then every sequence $(u^n)_n$ of solutions to

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\domain}{{\rm dom}} \displaystyle \min _{u \in \domain(T)} J_{\lambda ^n} (u,f^n), \nonumber \end{align} \tag{ 13 }$

admits a $\tau_X$ -convergent subsequence again denoted by $(u^n)_n$ with limit $u^\dagger$ such that

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle D_i(T_iu^n,f^n_i) &= o((\lambda_i^n)^{-1})\,\,\quad\qquad\qquad{\rm for} ~i \in I, \nonumber \\ \lim_{n\rightarrow\infty} D_i(T_iu^n,f^ \dagger_i) &~= D_i(T_iu^\dagger,f^\dagger_i) \,\,\qquad\qquad{\rm for} ~i \in I^c,\nonumber \\ \lim_{n\rightarrow\infty} R(u^n) &~~= R(u^\dagger), \nonumber \end{align*}$

and every limit of a $\tau_X$ -convergent subsequence of solutions is a solution to

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\domain}{{\rm dom}} \displaystyle \min_{{u \in \domain(T)} \atop {T_Iu=f_I}} J_{\lambda^ \dagger} (u,f^\dagger). \nonumber \end{align} \tag{ 14 }$

Proof. Apply theorem 2.5 and note that assumption $S_1(u_0, f^\dagger, I)$ was only needed in its proof to show the convergence $\newcommand{\bi}{\boldsymbol} D_i\big(T_iu_0, f_i^n\big)\rightarrow D_i\big(T_iu_0, f_i^\dagger\big)$ for $i \in I^c$ , which holds trivially since $f^n_i = f_i^ \dagger$ for $i \in I^c$ . □

Remark 2.9. A particular consequence of the previous corollary is that fixing $f^n_i = f^\dagger_i$ also for $i \in I$ allows for any parameter choice $\lambda_i^n$ such that $\lambda_i^n \rightarrow \infty$ and yields still $D_i(T_iu^n, f^\dagger_i)= D_i(T_iu^n, f^n_i) = o((\lambda_i^n){}^{-1})$ for $i \in I$ . In the context of joint regularization, this analytically confirms that, letting the parameter $\lambda_i^n$ for some particular, more well-conditioned modalities or channels increase approximates the situation that the corresponding components T_iu are fixed to hit the data exactly and only enter as prior information in the joint regularization functional R.

In summary, the previous results answer the first question posed in the introduction: multi-data-fidelity regularization is stable with respect to convergence of the data for some channels and, using an appropriate parameter choice strategy, the limit problem in case the data for some channels converges to a ground truth comprises the minimization of the regularization and data discrepancies for the other channels subject to hard constraints for the channels with ground truth data.

Next we consider convergence rates. To this aim, we will need a slightly stronger version of assumption $S_1(u, f^\dagger, I^c)$ , which we denote by $S_2(U, f^\dagger)$ . Essentially we will require that the continuity of assumption $S_1(u, f^\dagger, I^c)$ also holds for all u in a set U and that the modulus of continuity can be estimated by functions $\psi_i$ . The typical situation we have in mind for applications is that $\psi_i(x) = x^{\nu_i}$ with $\nu_i \in (0, 1]$ .

Assumption ${S_2(U, f^ \dagger)}$ .

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \left\{\begin{array}{@{}ll@{}} & {\rm For~each~} i, {\rm ~there~is~a~continuous~increasing~function~} \psi_i:[0,\infty) \rightarrow [0,\infty) {\rm~ with~} \psi_i(0) =0, \nonumber \\ & V_i \in \tau_{D_i} {\rm ~ with~}f_i^\dagger \in V_i ~{\rm and}~\alpha>0 {\rm ~s.t.~for~each} ~v \in U\subset X{\rm ~and~} f_i\in V_i {\rm ~with} ~ D_i(\,f^\dagger_i,f_i)< \alpha \nonumber \\ &\left\{\begin{array}{ll} D_i(T_iv,f_i) &\leqslant D_i(T_iv,f^\dagger_i)\Big(1 + \psi_i(D_i(\,f^\dagger_i,f_i))\Big)+ \psi_i(D_i(\,f^\dagger_i,f_i)) \nonumber \\ D_i(T_iv,f_i)& \geqslant D_i(T_iv,f^\dagger_i)\Big(1 - \psi_i(D_i(\,f^\dagger_i,f_i))\Big) - \psi_i(D_i(\,f^\dagger_i,f_i)) \nonumber \\ \end{array} \right. \end{array}\right. \nonumber \end{align*}$

Remark 2.10. We note that assumption $S_2(U, f^ \dagger)$ is satisfied trivially in case the data discrepancy is a norm or, more generally, whenever the data terms satisfy an inverse triangle inequality of the type

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle |D_i(T_iv,f_i) - D_i(T_iv, f_i^\dagger) | \leqslant \psi_i(D_i(\,f^\dagger_i,f_i)). \nonumber \end{align*}$

The reason for using a slightly weaker assumption is that the weaker version as above is satisfied by norms to the power p with p > 1, for which an inverse triangle inequality does not hold.

We also remark that the assumption $S_2(U, f^ \dagger)$ could be further relaxed for those indices $i \in I$ where $\lambda^n_i \rightarrow \infty$ , by allowing arbitrary constants as factors for $D_i(T_iv, f_i^\dagger)$ , instead of $(1 \pm\psi_i(D_i(\,f^\dagger_i, f_i)))$ which goes to one. The fact that we can even get the constants to converge to one, which is true but less trivial for powers of norms, is only needed for indices in I^c where $\lambda_i^n$ converges to a finite value. Hence the latter is a particularity of the multi-data fidelity setting when the noise in some component goes to zero but in others not.

The next theorem provides the main estimates for obtaining convergence rates for all specifications of the data terms that are considered later on. It partially relies on a rather abstract source condition that will be discussed below and simplified in particular applications of the theorem.

Theorem 2.11. With the assumptions of theorem 2.5, for $i \in I^c$ , define $\phi_i:[0, \infty) \rightarrow [0, \infty)$ continuous, monotonuously increasing with $\phi(0) = 0$ such that

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle |\lambda^\dagger_i-\lambda^n_i| \leqslant \phi (\delta_i^n) {\rm ~for ~all~} i\in I^c. \nonumber \end{align*}$

Let $(u^n)_n$ be a sequence of solutions to $P(\lambda^n, f^n)$ and $u^\dagger$ be a $J_{\lambda^\dagger, f^\dagger}$ -minimizing solution of $T_I u = f_I^\dagger$ . Set $U_{n_0} = \{u^n \, |\, n\geqslant n_0\} \cup \{u^\dagger\}$ and assume that, for some $n_0 \geqslant 0$ , $S_2(U_{n_0}, f^\dagger)$ holds.

Then

$\begin{align} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle J_{\lambda^\dagger,I^c}(u^n,f^\dagger) \leqslant J_{\lambda^\dagger,I^c}(u^\dagger,f^\dagger)+ C \bigg(\sum_{i\in I} \lambda_i^n (\delta_i^n)^{\,p_i} + \sum_{i\in I^c} \psi_i((\delta_i^n)^{\,p_i}) + \phi_i(\delta_i^n) \bigg) . \label{eq:objective_functional_rates} \nonumber \end{align} \tag{ 15 }$

If we assume further that there exists

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \xi \in \partial \Big(J_{\lambda^\dagger,I^c}(\cdot,f^\dagger)\Big)(u^\dagger), \nonumber \end{align*}$

and constants $0<\beta_1<1$ , $0<\beta_2$ such that

$\begin{align*} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle - \langle\xi, u-u^\dagger\rangle \leqslant \beta_1 D_{J_{\lambda^\dagger,I^ c}(\cdot,f^\dagger)}^\xi(u,u^\dagger) +\beta_2 \sum_{i\in I} D_i\big(T_iu,T_iu^\dagger\big)^{1/p_i} \qquad \qquad ({\rm SC}) \nonumber \end{align*}$

for all u satisfying $\newcommand{\e}{{\rm e}} J_{\lambda^\dagger, I^c}(u, f^\dagger) + \sum\nolimits_{i \in I} D_i(T_i u, f^\dagger_i) \leqslant J_{\lambda^\dagger}(u^\dagger, f^\dagger)+ \epsilon$ , for some $\newcommand{\e}{{\rm e}} \epsilon > 0$ , then, after at most finitely many indices n we have

$\begin{align*} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle \sum_{i\in I} \lambda_i^n D_i (T_i u^n,f^n_i) + (1-\beta_1) D^\xi_{J_{\lambda^\dagger,I^c}(\cdot,f^\dagger)} (u^n,u^\dagger) \leqslant \beta_2 \sum_{i \in I} D_i\big(T_iu^n,T_iu^\dagger\big)^ {1/p_i} \nonumber \\ + C \bigg(\sum_{i \in I} \lambda_i^n (\delta_i^n)^{\,p_i} + \sum_{i \in I^c} \psi_i((\delta_i^n)^{\,p_i}) + \phi_i(\delta_i^n) \bigg) . \nonumber \end{align*}$

Proof. Take $(u^n)_n$ a sequence of solutions and $u^ \dagger$ a $J_{\lambda^\dagger, f^\dagger}$ -minimizing solution of $T_I u = f_I^\dagger$ and, without loss of generality, assume that $S_2(U_0, f^\dagger)$ holds.

At first we note that, from optimality and the fact that $T_i u^\dagger = f_i^\dagger$ for $i \in I$ , it follows that

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \label{eq:data_term_estimate_rates_proof} \begin{array}{@{}ll@{}} \sum\nolimits_{i\in I} \lambda_i^n D_i (T_i u^n,f^n_i) &\leqslant \sum\nolimits_{i\in I} \lambda_i^n D_i (T_i u^\dagger,f^n_i) + J_{\lambda^n,I^c}(u^\dagger,f^n) - J_{\lambda^n,I^c}(u^n,f^n) \nonumber \\ &= \sum\nolimits_{i\in I} \lambda_i^n (\delta_i^n)^{\,p_i} + J_{\lambda^n,I^c}(u^\dagger,f^n) - J_{\lambda^n,I^c}(u^n,f^n). \end{array} \nonumber \end{align} \tag{ 16 }$

Now we aim to change from f ⁿ to $f^\dagger$ and from $\lambda^n$ to $\lambda^\dagger$ in the functionals $J_{\lambda^n, I^c}$ of last line above. To this aim, we first note that, by assumption $S_2(U_0, f^\dagger)$ ,

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle J_{\lambda^n,I^c}(u^\dagger,f^n) & \leqslant R(u^\dagger) + \sum _{i\in I^c} \lambda_i^n D_i(T_i u^\dagger,f_i^\dagger) (1+ \psi_i((\delta_i^n)^{\,p_i}) + \sum_{i \in I^c} \lambda^n_i \psi_i((\delta_i^n)^{\,p_i}) \nonumber \\ &\leqslant J_{\lambda^n,I^c}(u^\dagger,f^\dagger) + C\sum_{i \in I^c} \psi_i((\delta_i^n)^{\,p_i}) \nonumber \end{align*}$

where the last inequality follows from estimating $\lambda_i^n (D_i(T_iu^\dagger, f_i^\dagger)+1) \leqslant C$ for all $i\in I^c$ . Similarly, we get

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle -J_{\lambda^n,I^c}(u^n,f^n) &\leqslant -R(u^n) - \sum _{I^c} \lambda_i^n D_i(T_iu^n,f_i^\dagger) (1-\psi_i((\delta_i^n)^{\,p_i})) + \sum_{i \in I^c} \lambda^n_i \psi_i((\delta_i^n)^{\,p_i})\nonumber \\ &\leqslant -J_{\lambda^n,I^c}(u^n,f^\dagger) + C\sum_{i \in I^c} \psi_i((\delta_i^n)^{\,p_i}), \nonumber \end{align*}$

where the last inequality is obtained by boundedness of $D_i(T_iu^ n, f^\dagger_i)$ , which follows by estimating it above with $D_i(T_iu^n, f^n_i)$ (up to constants) using $S_2(U_0, f^\dagger)$ , which in turn is bounded by the assertion of theorem 2.5.

Now in order to change from $\lambda^n$ to $\lambda^\dagger$ we sum the two equations above and further estimate

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle J_{\lambda^n,I^c}(u^\dagger,f^n) -J_{\lambda^n,I^c}(u^n,f^n) &\leqslant J_{\lambda^\dagger,I^c}(u^\dagger,f^\dagger) -J_{\lambda^\dagger,I^c}(u^n,f^\dagger) + C\sum_{i \in I^c} \psi_i((\delta_i^n)^{\,p_i}) \nonumber \\ &+ \sum_{i \in I^c} |\lambda^n _i - \lambda^\dagger_i | | D_i (T_i u^\dagger,f^\dagger_i) - D_i (T_i u^n,f^\dagger_i)| \nonumber \\ &\leqslant J_{\lambda^\dagger,I^c}(u^\dagger,f^\dagger) -J_{\lambda^\dagger,I^c}(u^n,f^\dagger) + C \bigg(\sum_{i \in I^c} \psi_i((\delta_i^n)^{\,p_i}) + \phi_i(\delta_i^n)\bigg) , \nonumber \end{align*}$

where the last inequality is again due to boundedness of $D_i(T_iu^n, f^\dagger_i)$ .

Combining the last estimate with (16) we get

$\begin{align} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle \label{eq:data_j_estimate_rates_proof_general} \begin{array}{@{}ll@{}} \sum\nolimits_{i\in I} \lambda_i^n D_i (T_i u^n,f^n_i) & \leqslant J_{\lambda^\dagger,I^c}(u^\dagger,f^\dagger) -J_{\lambda^\dagger,I^c}(u^n,f^\dagger) \nonumber \\ & + C \bigg(\sum\nolimits_{i \in I} \lambda_i^n (\delta_i^n)^{\,p_i} + \sum\nolimits_{i \in I^c} \psi_i((\delta_i^n)^{\,p_i}) + \phi_i(\delta_i^n) \bigg) \end{array} \nonumber \end{align} \tag{ 17 }$

which already shows the estimate on the extended objective functional $J_{\lambda ^\dagger, I^ c}(\cdot, f^ \dagger)$ as in (15). Now by the second inequality in $S_2(U_0, f^\dagger)$ we can estimate $\newcommand{\bi}{\boldsymbol}D_i (T_i u^n, f^\dagger_i) \leqslant C\big(D_i (T_i u^n, f_i^n) + \psi_i((\delta_i^n){}^{\,p_i})\big)$ for $i \in I$ . This, together with the fact that $C < \lambda _i ^n$ for $i \in I$ and n sufficiently large allows to conclude from (17) that, up to finitely many indices, $\newcommand{\e}{{\rm e}} J_{\lambda^\dagger, I^c}(u^n, f^\dagger) + \sum\nolimits_{i \in I} D_i(T_i u^n, f_i^\dagger) \leqslant J_{\lambda^\dagger}(u^\dagger, f^\dagger)+ \epsilon$ .

Hence the source condition holds for u = uⁿ and the difference of the reduced objective functionals as above can be estimated by

$\begin{align*} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle J_{\lambda^\dagger,I^c}(u^\dagger,f^\dagger) -J_{\lambda^\dagger,I^c}(u^n,f^\dagger) & = - D^\xi_{J_{\lambda^\dagger,I^c}(\cdot,f^\dagger)}(u^n,u^\dagger) - \langle \xi,u^n -u^\dagger \rangle \nonumber \\ &\leqslant (\beta_1 - 1)D^\xi_{J_{\lambda^\dagger,I^c}(\cdot,f^\dagger)}(u^n,u^\dagger)+\beta_2 \sum_{i \in I} D_i\big(T_iu^n,T_iu^\dagger\big)^ {1/p_i}. \nonumber \end{align*}$

Plugging this estimate in equation (17), the assertion follows. □

As mentioned above, the parameters $p_i\geqslant 1$ are introduced for the case that the data discrepancies are powers of norms. Setting all p_i = 1, an abstract estimate on convergence rates follows immediately from the previous theorem.

Proposition 2.12. With the assumptions of theorem 2.11, set p_i = 1 for $i \in I$ . Then, with $(u^n)_n$ a sequence of solutions to $P(\lambda^n, f^n)$ we obtain for $j\in I$

$\begin{align} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle D_{J_{\lambda^\dagger,I^c}(\cdot,f^\dagger)}^\xi(u^n,u^\dagger) = O\Bigg(\sum_{i \in I}\big(\lambda_i^n\delta_i^n+\psi_i(\delta_i^n)\big) +\sum_{i \in I^c} \big(\psi_i(\delta_i^n) + \phi_i(\delta^n_i) \big) \Bigg), \label{eq:bregman_rates} \nonumber \end{align} \tag{ 18 }$

$\begin{align} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle D_j(T_ju_j^n,f_j^n) = O\Bigg((\lambda_j^n)^{-1} \bigg(\sum_{i \in I}\big(\lambda_i^n\delta_i^n + \psi_i(\delta_i^n)\big) +\sum_{i \in I^c} \big(\psi_i(\delta_i^n) + \phi_i(\delta^n_i) \big) \bigg) \bigg) . \label{eq:data_term_rates} \nonumber \end{align} \tag{ 19 }$

Proof. First note that, by assumption $S_2(U_0, f^\dagger)$ we can again estimate, for $i \in I$ ,

$\begin{align*} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle D_i (T_i u^n,f^\dagger_i) \leqslant C\bigg(D_i (T_i u^n,f^n_i) + \psi_i(\delta_i^n)\bigg). \nonumber \end{align*}$

It then follows from the assertion of theorem 2.11 that after finitely many indices

$\begin{align*} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle \sum_{i\in I} (\lambda_i^n - C \beta_2) D_i (T_i u^n,f^n_i) + (1-\beta_1) D^\xi_{J_{\lambda^\dagger,I^c}(\cdot,f^\dagger)} (u^n,u^\dagger) \nonumber \\ \leqslant C \bigg(\sum_{i \in I} (\lambda_i^n\delta_i^n + \psi_i(\delta_i^n)) + \sum_{i \in I^c}(\psi(\delta^n_i) + \phi_i(\delta_i^n)) \bigg). \nonumber \end{align*}$

Using non-negativity of all involved terms (for sufficiently large n), the assertion follows as direct consequence. □

Before considering a more concrete result for convergence rates, we discuss the source condition (SC), which is a direct extension of a standard source condition for single, norm discrepancy terms in the context of non-linear inverse problems in Banach spaces, see [22]. The reason for introducing the power 1/p_i in the condition is to obtain compatibility with existing source conditions in the case that $D_i(v, f) = \|v-f\|^ {q_i}_{Y_i}$ , by choosing $p_i = q_i$ . Obviously, when letting the noise on all data components converge to zero (still potentially at different rates), i.e. choosing $I=\{1, \ldots, N\}$ , $J_{\lambda^\dagger, I^ c}(\cdot, f^\dagger)$ reduces to R and the source condition coincides with the one of [22]. As discussed for instance there or in [40], under some additional assumptions, this source condition then is equivalent to more classical source conditions [15].

The following remark, whose proof is direct, allows for an interpretation of the source condition also in the most general setting.

Remark 2.13. Take $u^ \dagger\in X$ , $f^ \dagger \in Y$ , $\lambda^ \dagger \in (0, \infty]^ N$ and $I \subset \{1, \ldots, N\}$ such that $\lambda_i ^ \dagger < \infty$ for $i \in I^ c$ . With $\xi \in \partial \Big(J_{\lambda^\dagger, I^c}(\cdot, f^\dagger)\Big)(u^\dagger)$ and constants $0<\beta_1<1$ , $0<\beta_2$ , $p_i \in [1, \infty)$ for $i \in I$ , we have, for all $u \in X$ ,

$\begin{align*} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle - \langle\xi, u-u^\dagger\rangle \leqslant \beta_1 D_{J_{\lambda^\dagger,I^ c}(\cdot,f^\dagger)}^\xi(u,u^\dagger) +\beta_2 \sum_{i \in I} D_i\big(T_iu,T_iu^\dagger\big)^{1/p_i} \nonumber \end{align*}$

if and only if

$\begin{align*} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle J_{\lambda^\dagger,I^ c}(u^ \dagger,f^\dagger) \leqslant J_{\lambda^\dagger,I^ c}(u,f^\dagger) + \beta_2 \sum_{i \in I} D_i\big(T_iu,T_iu^\dagger\big)^{1/p_i} + (\beta_1-1)D_{J_{\lambda^\dagger,I^ c}(\cdot,f^\dagger)}^\xi(u,u^\dagger). \nonumber \end{align*}$

The second equation can be interpreted such that if u deviates from the equality constraints $T_i u = f^ \dagger_i$ for $i \in I$ , the cost of the respective data term needs to increase at least as fast as the Bregman distance of u to $u^\dagger$ with respect to the reduced objective functional.

We now consider more concrete assertions on convergence rates. It is easy to see that, in the assertion of proposition 2.12, the choice $f_i^n = f_i^\dagger$ and $\lambda_i^n = \lambda_i^\dagger$ for $i\in I^c$ allows to choose $\newcommand{\e}{{\rm e}} \phi_i \equiv 0$ and improves the convergence rates accordingly. More particularly, in the case that the noise on every component goes to zero, i.e. $\newcommand{\e}{{\rm e}} I^c = \emptyset$ , we get the following corollary.

Corollary 2.14. With the assumptions of theorem 2.11, set $I= \{1, \ldots, N\}$ , p_i = 1 for all i and let $(u^n)_n$ be a sequence of solutions to $P(\lambda^n, f^n)$ and $u^\dagger$ be an R-minimizing solution of $T u = f^\dagger$ such that $S_2(U_{n_0}, f^\dagger)$ holds for some n₀. Then, after at most finitely many $n\geqslant 0$ ,

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle R(u^n) \leqslant R(u^\dagger)+ C \sum_{i=1}^ N \lambda_i^n \delta_i^n . \nonumber \end{align} \tag{ 20 }$

If we assume further that there exists

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \xi \in \partial R(u^\dagger), \nonumber \end{align*}$

constants $0<\beta_1<1$ , $0<\beta_2$ such that

$\begin{align*} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle - \langle\xi, u-u^\dagger\rangle \leqslant \beta_1 D_{R}^\xi(u,u^\dagger) +\beta_2 \sum_{i=1}^ N D_i\big(T_iu,T_iu^\dagger\big)\qquad\qquad\qquad \qquad ({\rm SC}) \nonumber \end{align*}$

for all u satisfying $\newcommand{\e}{{\rm e}} R(u) + \sum\nolimits_{i=1}^N D_i(T_i u, f^\dagger_i) \leqslant R (u^\dagger)+ \epsilon$ for some $\newcommand{\e}{{\rm e}} \epsilon > 0$ , then we obtain for any $j \in \{1, \ldots, N\}$ ,

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle D_{R}^\xi(u^n,u^\dagger) = O\Bigg(\sum_{i=1}^ N (\lambda_i^n \delta_i^n + \psi_i(\delta_i^n)) \Bigg), \nonumber \end{align} \tag{ 21 }$

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle D_j(T_ju^n,f^n_j) = O\Bigg((\lambda_j^n)^{-1}\sum_{i=1}^N(\lambda_i^n \delta_i^n + \psi_i(\delta_i^n)) \Bigg). \nonumber \end{align} \tag{ 22 }$

Proof. This follows from theorem 2.11 and proposition 2.12 by choosing $I = \{1, \ldots, N\}$ . □

The previous result provides, in a general setting, an answer to the second question posed in the introduction: under appropriate assumptions, convergence rates can be obtained for the multi-discrepancy setting and the estimates above provide explicit information on the interplay of the different channels and discrepancy terms with respect to convergence rates. Next, we will discuss a first example in the general setting how this interplay can be exploited with appropriate parameter choice strategies in order to improve convergence rates. This is a first answer to the third question of the introduction, for which a more detailed discussion with concrete examples of data discrepancies is provided in the subsequent section.

Consider the particular case that the noise level for all components is given as $\delta_i^n \sim (\delta^n){}^{\mu_i}$ for some $\delta^n \rightarrow 0$ and $\mu_i \geqslant 1$ and that $\psi_i (x) \sim x$ for all i. It is immediate that using the same $\delta_i^n$ -dependent parameter choice for all $\lambda_i^n$ , say $\newcommand{\e}{{\rm e}} \lambda_i^n \sim (\delta_i^n){}^{-(1-\epsilon)}$ for $\newcommand{\e}{{\rm e}} \epsilon \in (0, 1)$ , yields the rates

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle D_{R}^\xi(u^n,u^\dagger) = O\Bigg((\delta^n)^{\epsilon \min\nolimits_i\{\mu_i\}} \Bigg) \qquad D_j(T_ju_j^n,f_j) = O\Bigg((\delta^ n)^{\mu_j(1- \epsilon)+ \epsilon \min\nolimits_i \{\mu_i \}} \Bigg) . \nonumber \end{align*}$

That is, the rates for both $D_R^\xi$ and D_j are affected by the worst of all $\mu_i$ . In addition, choosing $\newcommand{\e}{{\rm e}} \epsilon \rightarrow 1$ , gives the best rate for $D_R^\xi$ , however worsens the rate for D_j whenever $\mu_j > \min\nolimits_i \{\mu_i\}$ .

As will be shown in the following corollary, adapting the rates of each parameter to the interplay of the $\mu_i$ allows to improve upon this. In particular, it allows to obtain the rate $(\delta^ n){}^{\mu_j}$ for each $\mu_j$ and a slightly improved rate for $D_R^\xi$ . This is achieved in the more general setting that $\psi_i (x) \sim x^{\nu_i}$ with potentially different $\nu_i \in (0, 1]$ , which will be relevant for applications with different noise characteristics yielding different discrepancy terms.

Corollary 2.15. With the assumptions of corollary 2.14, suppose that the $\psi_i$ in assumption $S_2(U, f^ \dagger)$ are such that

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \psi_i (x) \sim x^ {\nu_i} \nonumber \end{align*}$

and that

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \delta_i^n \sim (\delta^n)^{\mu_i} \nonumber \end{align*}$

with a sequence $(\delta^n)_n$ in $(0, \infty)$ such that $\delta^n \rightarrow 0$ , $\mu_i \geqslant 1$ and $\nu_i \in (0, 1]$ . Define $\newcommand{\e}{{\rm e}} \eta_i = \mu_i \nu_i$ and $\newcommand{\e}{{\rm e}} \eta_{\min} = \min\nolimits_i \{\eta_i\}$ and take one i₀ such that $\newcommand{\e}{{\rm e}} \mu_{i_0} = \min \{\mu_i \, |\, \eta_i = \eta_{\min} \}$ . Assume that $\nu_{i_0} < 1$ . Then, with

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \epsilon_i = \frac{\eta_{\min}}{\mu_i}\nonumber \end{align*}$

for all i it follows that $\newcommand{\e}{{\rm e}} \epsilon_i \in (0, 1)$ and with the parameter choice

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \lambda^n_i \sim (\delta^n_i)^{-(1-\epsilon_i)},\quad {\rm for~all~}i,\nonumber \end{align*}$

we obtain

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle R(u^n) \leqslant R(u^\dagger)+ C (\delta^n)^{\eta_{\min}}. \nonumber \end{align} \tag{ 23 }$

If we again assume that the source condition (SC) of corollary 2.14 holds, then we obtain for any $j \in \{1, \ldots, N\}$ ,

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle D_{R}^\xi(u^n,u^\dagger) = O\Big((\delta^n)^{\eta_{\min}} \Big), \nonumber \end{align} \tag{ 24 }$

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle D_j(T_ju^n,f^n_j) = O\Big((\delta^n)^{\mu_j} \Big) . \nonumber \end{align} \tag{ 25 }$

This choice of the $\newcommand{\e}{{\rm e}} \epsilon_i$ is optimal for the estimates on D_j and $D_R^\xi$ given in corollary 2.14 in the sense that they cannot be improved by any other parameter choice of the form $\newcommand{\e}{{\rm e}} \lambda_i^n = (\delta_i^n){}^{-(1-\tilde{\epsilon}_i)}$ with $\newcommand{\e}{{\rm e}} \tilde{\epsilon} _i \in (0, 1)$ .

Remark 2.16. Before providing the proof, we note that, for the sake of simplicity, we have left out the case that $\nu_{i_0}=1$ with $\nu_{i_0}$ as above. In this case, one has to choose $\newcommand{\e}{{\rm e}} \epsilon_i = \epsilon \frac{\eta_{\min}}{\mu_i}$ with $\newcommand{\e}{{\rm e}} \epsilon \in (0, 1)$ , which allows to approximate the same rates in the limit $\newcommand{\e}{{\rm e}} \epsilon \rightarrow 1$ .

Proof. At first we show that indeed $\newcommand{\e}{{\rm e}} \epsilon_i < 1$ : in case i is such that $\newcommand{\e}{{\rm e}} \eta_i > \eta_{\min}$ this follows from $\newcommand{\e}{{\rm e}} \epsilon_i= \frac{\eta_{\min}}{\eta_i}\nu_i \leqslant \frac{\eta_{\min}}{\eta_i} < 1$ . In the other case, we note that necessarily $\nu_i<1$ hence $\newcommand{\e}{{\rm e}} \epsilon_i < \frac{\eta_{\min}}{\eta_i} \leqslant 1$ .

Plugging in the proposed parameter choice in the rates of corollary 2.14 the claimed rates follow by direct computation. Regarding their optimality, we note that for any choice of $\newcommand{\e}{{\rm e}} \tilde{\epsilon}_i$ , the rate for $\newcommand{\D}{\mathcal{D}} \D_R^\xi$ is given as

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle O \Big(\sum_{i=1}^N (\delta^n)^{\tilde{\epsilon}_i \mu_i} + (\delta^n)^{\eta_i} \Big)\nonumber \end{align*}$

which cannot be better that $\newcommand{\e}{{\rm e}} (\delta^n){}^{\eta_{\min}}$ .

Similar, by considering the jth summand in the rate for D_j it is immediate that the rate cannot be better than $(\delta^n){}^{\mu_j}$ , which is achieved with the proposed parameter choice. □

3. Specification of the data term

3.1. Power-of-norm discrepancies

In this subsection, we consider the particular case that all discrepancy terms D_i are powers of norms, i.e. $D_i(u, f) = \|u-f\|_{Y_i}^{\,p_i}$ with $p_i \in [1, \infty)$ and $J_\lambda(u, f) = R(u) + \sum\nolimits_{i=1}^N\lambda_i \|T_iu - f_i\|_{Y_i}^{\,p_i}$ . That is, we consider the minimization problem

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\domain}{{\rm dom}} \displaystyle \min _{u \in \domain(T)} R(u) + \sum_{i=1}^N \lambda_i \|T_i u - f \|_{Y_i}^{\,p_i} \qquad \qquad ({P_N(\lambda,f)}). \nonumber \end{align*}$

In this particular case, we choose the topologies $\tau_{D_i}$ to be the trivial topologies and hence D_i-convergence is equivalent to convergence in the norm $\|\cdot \|_{Y_i}$ . As we will see, assumption (G) then reduces as follows.

Assumption (N). [Norm discrepancies, $\tau_D$ the trivial topology]

1.
$\|\cdot \|_{Y_i}$ is sequentially lower semi-continuous w.r.t. $\tau_{Y_i}$ , and addition and scalar multiplication are continuous in the topology $\tau_{Y_i}$ .
2.
R is lower semi-continuous w.r.t $\tau_X$ .
3.
The operators T_i are continuous from $(X, \tau_X)$ to $(Y_i, \tau_{Y_i})$ and $\newcommand{\domain}{{\rm dom}} \domain(T_i)$ is closed w.r.t convergence in $\tau_X$ .
4.
For any $\lambda \in (0, \infty){}^N$ , $f \in Y$ and any sequence $(u^n)_n$ we have that $J_\lambda(u^n, f)< c$ implies that $(u^n)_n$ admits a $\tau_X$ -convergent subsequence.

Proposition 3.1. For each $i \in \{1, \ldots, N\}$ , set $D_i(v_i,f_i) := \|v_i-f_i\|_{Y_i}^{\,p_i}$ with $p_i \in [1, \infty)$ , and define $\tau_D$ to be the trivial topology. Then, if assumption (N) holds, also assumption (G) is satisfied.

Proof. We consider the particular points of assumption (G) with the numbering given there. The Point 1. obviously holds. For two sequences $(a_i^n)_n, (\,f_i^n)_n$ in Y_i converging to $a _i, f_i \in Y_i$ with respect to $\tau_{Y_i}$ and norm convergence, respectively, we get that

$\begin{align*} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle \liminf_{n\rightarrow \infty}\|a_i^n - f^n_i\|_{Y_i} \geqslant \liminf_{n\rightarrow \infty} \big(\|a_i^n - f_i\|_{Y_i} - \|f_i^n - f_i\|_{Y_i}\big) = \liminf_{n\rightarrow \infty} \|a_i^n - f_i\|_{Y_i} \geqslant \|a_i - f_i\|_{Y_i}\nonumber \end{align*}$

and hence, by monotonicity and continuity of the mapping $x \mapsto x^{\,p_i}$ on $[0, \infty)$ Point 2. in assumption (G) holds. The Points 3. and 4. remain unchanged. Regarding point 5., take $\lambda >0$ , $(\,f^n)_n$ to be sequence D-converging to $f \in Y$ and take $(u^n)_n$ to be a sequence such that $J_\lambda(u^n, f^{n}) < c$ , with c > 0. By the triangle-inequality we get that

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle J_\lambda(u^n,f) \leqslant J_\lambda(u^n,f^{n}) + \sum _{i=1}^N 2 ^{\,p_i-1}\lambda_i \|f_i^ {n} - f_i\|_Y^{\,p_i} \nonumber \end{align*}$

which is bounded, hence by assumption $(u^ n)_n$ admits a $\tau_X$ -convergent subsequence. □

By the inverse triangle inequality and continuity of $x \mapsto x^{\,p_i}$ it is also clear that assumption $S_1(u, f^\dagger, I^c)$ holds for any $u \in X$ . Hence, in summary, for the case of power-of-norm discrepancies, we get the following simplified version of the general result in theorem 2.5, covering existence, stability and convergence for vanishing noise. Obviously, the corollaries 2.7 and 2.8 then also hold in simplified versions.

Theorem 3.2 (General convergence result for powers of norms). Suppose that assumption (N) holds. Let $(\,f^n)_n$ in Y and $f^\dagger \in Y$ be such that $\|\,f^\dagger - f^n \|_{Y_i} \rightarrow 0$ as $n\rightarrow \infty$ , and, for $i \in \{1, \ldots, N\}$ , define $\delta_i ^n = \|\,f^\dagger _i - f_i^n\|_{Y_i}$ such that, by assumption, $\delta_i^n \rightarrow 0$ as $n \rightarrow \infty$ . Choose the parameters $\lambda^n = (\lambda_1^n, \ldots, \lambda_N^n) \in (0, \infty){}^ N$ such that, for a given subset $I \subset \{1, \ldots, N\}$ ,

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle & \lambda_i^n \rightarrow \infty, \quad {\rm as~well~as~}\quad \lambda_i^n (\delta_i^n)^{\,p_i} \rightarrow 0 \qquad\qquad\qquad{\rm for~}i \in I, \nonumber \\ &\lambda_i^n \rightarrow \lambda_i ^\dagger \in (0,\infty) \qquad\qquad\qquad\qquad\qquad\qquad\quad\quad\,\, {\rm for~}i \in I^c. \nonumber \end{align*}$

Further denote $\lambda^\dagger = \lim\nolimits_{n \rightarrow \infty} \lambda ^n \in (0, \infty]^N$ .

If $\newcommand{\domain}{{\rm dom}} \newcommand{\e}{{\rm e}} \newcommand{\bi}{\boldsymbol}\domain(R) \cap \left(\bigcap\nolimits_{i=1}^n \domain(T_i) \right) \neq \emptyset$ , then there exists a sequence $(u^n)_n$ of solutions to $P_N(\lambda, f^n)$ , every such sequence of solutions possesses a $\tau_X$ -convergent subsequence again denoted by $(u^n)_n$ with limit $u^\dagger$ such that

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \|T_iu^n - f^n_i\|_{Y_i} &= o((\lambda_i^n)^{-1/p_i}) {\rm ~for~} i \in I, \nonumber \\ \lim_{n\rightarrow\infty} \|T_iu^n - f^n_i\|_{Y_i} &= \|T_iu^\dagger - f^\dagger_i\|_{Y_i} {\rm ~ for~} i \in I^c,\nonumber \\ \lim_{n\rightarrow\infty} R(u^n) &= R(u^\dagger), \nonumber \end{align*}$

and every limit of a $\tau_X$ -convergent subsequence of solutions is a solution to

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\domain}{{\rm dom}} \displaystyle \min_{{u \in \domain(T)} \atop {T_Iu=f_I}} J_{\lambda^ \dagger} (u,f^\dagger). \nonumber \end{align} \tag{ 26 }$

Regarding convergence rates results, we remember that in the general case they rely on assumption $S_2(U, f^ \dagger)$ . For power-of-norm discrepancies, we now show that also this assumption is always satisfied. To this aim, we first need the following basic facts from analysis.

Lemma 3.3. Let $a, b, \lambda >0$ , $p, q \in (1, \infty)$ such that $\frac{1}{p}+\frac{1}{q}=1$ . Then

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle |a+b|^{\,p} \leqslant a^{\,p}(1+\lambda^q)^{\,p-1} + b^{\,p}(1+\lambda^{-q})^{\,p-1} . \nonumber \end{align*}$

Further, there exist constants $c, C>0$ such that for any $\lambda \in (0, 1]$ ,

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle c \lambda^q \leqslant (1+\lambda^q)^{\,p-1} - 1 \leqslant C \lambda^q \quad {\rm ~and~}\quad c \lambda^{-q(\,p-1)} \leqslant (1+\lambda^{-q})^{\,p-1} \leqslant C \lambda^{-q(\,p-1)}. \nonumber \end{align*}$

Proof. The first inequality follows from the Hölder inequality, since

$\begin{align*} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle |a+b| = |a + \frac{b}{\lambda}\lambda| \leqslant \big(a^{\,p} + (\frac{b}{\lambda})^{\,p}\big)^{1/p}(1 + \lambda^q)^{1/q} \nonumber \end{align*}$

which implies

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle |a+b|^{\,p} \leqslant a^{\,p} (1 + \lambda^q)^{\,p-1} + \frac{b^{\,p}}{\lambda^{\,p}}(1 + \lambda^q)^{\,p-1} = a^{\,p} (1 + \lambda^q)^{\,p-1} + b^{\,p}(1 + \lambda^{-q})^{\,p-1} .\nonumber \end{align*}$

Using de l'Hospital's rule we further get

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \lim _{\lambda \rightarrow 0} \frac{(1+\lambda^q)^{\,p-1} - 1}{\lambda^q} = \lim _{\lambda \rightarrow 0}(\,p-1)(1+\lambda^q)^{\,p-2} = (\,p-1) \nonumber \end{align*}$

and obviously $\lim\nolimits_{\mu \rightarrow \infty} \frac{(1+\mu^q){}^{\,p-1}}{\mu^{q(\,p-1)}} = 1$ , hence the other two inequalities hold. □

This allows us to obtain the following estimates, which are a particular case of the inequalities required by assumption $S_2(U, f^ \dagger)$ .

Proposition 3.4. Let $(Y, \|\cdot\|)$ be a normed space and $f, f^\dagger, v \in Y$ and $p, q\in (1, \infty)$ with $\frac{1}{p}+\frac{1}{q}=1$ . Define $\delta = \|f - f^\dagger\|$ . Then, there exits $\alpha>0$ and constants $c, C, d, D>0$ such that, whenever $\delta < \alpha$ ,

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \|v-f\|^{\,p} &\leqslant \|v-f^\dagger\|^{\,p} (1 + c\delta) + C\delta, \nonumber \\ -\|v-f\|^{\,p} &\leqslant - \|v-f^\dagger\|^{\,p} (1 - d\delta) + D\delta. \nonumber \end{align*}$

Proof. Obviously, the results holds for $\delta=0$ , hence we assume $\delta>0$ . First we use lemma 3.3 to estimate for a general $\lambda \in (0, 1]$

$\begin{align} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle \label{eq:basic_estimate_norm_continuity} \begin{array}{@{}ll@{}} \|v-f\|^{\,p} \leqslant \left(\|v-f^ \dagger\| + \delta \right)^{\,p} &\leqslant \|v-f^ \dagger\|^{\,p} \bigg((1+\lambda^q)^{\,p-1}- 1 + 1\bigg) + \delta^{\,p}(1+\lambda^{-q})^{\,p-1} \nonumber \\ & \leqslant \|v-f^ \dagger\|^{\,p} (1 + C\lambda^ q) + C\delta^{\,p} \lambda ^ {-q(\,p-1)}. \end{array} \nonumber \end{align} \tag{ 27 }$

Now we set $\lambda = \delta^\frac{1}{q}$ and get for δ sufficiently small

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \|v-f\|^{\,p} \leqslant \left(\|v-f^ \dagger\| + \delta \right)^{\,p} \leqslant \|v-f^ \dagger\|^{\,p} (1 + C\delta) + C\delta. \nonumber \end{align*}$

Reordering the inequality (27), we obtain

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle -\|v-f^\dagger\|^{\,p} \leqslant - \|v-f^ \dagger\|^{\,p} (1 + C\lambda^ q)^{-1} + C\delta^{\,p} \lambda ^ {-q(\,p-1)}(1 + C\lambda^ q)^{-1} \nonumber \end{align*}$

and since $\lim\nolimits_{\lambda \rightarrow 0} \frac{(1+C \lambda){}^{-1}-1}{\lambda}=-C$ , there exists d > 0 such that for all $\lambda$ sufficiently small $(1+C\lambda){}^{-1} \geqslant (1-d\lambda)$ and hence

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle -\|v-f^\dagger\|^{\,p} \leqslant - \|v-f^ \dagger\|^{\,p} (1 - d\lambda^ q) + C\delta^{\,p} \lambda ^ {-q(\,p-1)}. \nonumber \end{align*}$

Setting again $\lambda = \delta^\frac{1}{q}$ the result follows. □

The previous proposition shows, in case all D_i-terms are powers of norms, assumption $S_2(U, f^ \dagger)$ holds again for any $U \subset X$ , $f^\dagger$ with $\psi_i(x)=x^\frac{1}{p_i}$ . From this, we obtain the following convergence rates.

Theorem 3.5. With the assumptions of theorem 3.2, decompose the index set I such that $I = J_1 \cup J_p$ where $J_1 = \{i \in I \, |\, p_i = 1\}$ , $J_p = \{i \in I \, |\, p_i > 1\}$ . Let the parameter choice for the indices $i\in I$ be such that

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \left\{\begin{array}{@{}ll@{}} \lambda_i^n \sim (\delta_i^n)^{-(\,p_i - \epsilon_i)} &{\rm if~} p_i=1, \nonumber \\ \lambda_i^n \sim (\delta_i^n)^{-(\,p_i - 1)} &{\rm if~} p_i>1 , \end{array}\right. \nonumber \end{align} \tag{ 28 }$

with $\newcommand{\e}{{\rm e}} \epsilon_i \in (0, 1)$ . Let $(u^n)_n$ be a sequence of solutions to $P_N(\lambda, f^n)$ and $u^\dagger$ be a $J_{\lambda^\dagger, f^\dagger}$ -minimizing solution of $T_I u = f_I^\dagger$ . Then

$\begin{align} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle J_{\lambda^\dagger,I^c}(u^n,f^\dagger) \leqslant J_{\lambda^\dagger,I^c}(u^\dagger,f^\dagger)+ C \bigg(\sum_{i \in J_1} (\delta_i^ n)^ {\epsilon_i} + \sum_{i \in J_p} \delta_i^n + \sum_{i \in I^c}(\delta^n_i + \phi_i(\delta_i^n)) \bigg). \nonumber \end{align} \tag{ 29 }$

Assume further that there exists

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \xi \in \partial \Big(J_{\lambda^\dagger,I^c}(\cdot,f^\dagger)\Big)(u^\dagger) \nonumber \end{align*}$

and constants $0<\beta_1<1$ , $0<\beta_2$ such that

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle - \langle\xi, u-u^\dagger\rangle \leqslant \beta_1 D_{J_{\lambda^\dagger,I^c}(\cdot,f^\dagger)}^\xi(u,u^\dagger) +\beta_2 \sum_{i \in I} \|T_iu - T_iu^\dagger\|_{Y_i} \qquad \qquad ({\rm SC}) \nonumber \end{align*}$

for all u satisfying $\newcommand{\e}{{\rm e}} J_{\lambda^\dagger, I^c}(u, f^\dagger) + \sum\nolimits_{i \in I} \|T_iu-f^\dagger_i\|_{Y_i}^{p_i} \leqslant J_{\lambda^\dagger}(u^\dagger, f^\dagger)+ \epsilon$ for some $\newcommand{\e}{{\rm e}} \epsilon > 0$ . Then, for $j \in I$ , it holds

$\begin{align} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle D_{J_{\lambda^\dagger,I^c}(\cdot,f^\dagger)}^\xi(u^n,u^\dagger) = O \bigg(\sum_{i \in J_1} (\delta_i^n)^{\epsilon_i} + \sum _{i \in J_p} \delta_i^n + \sum_{i \in I^c} (\delta_i^n + \phi_i(\delta_i^n)) \bigg) \nonumber \end{align} \tag{ 30 }$

and, in case p_j = 1,

$\begin{align} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle \|T_ju_j^n - f_j\|_{Y_j}= O \bigg((\delta_j^n)^{1-\epsilon_j} \big(\sum_{i \in J_1} (\delta_i^n)^{\epsilon_i} + \sum _{i \in J_p} \delta_i^n + \sum_{i \in I^c} (\delta_i^n + \phi_i(\delta_i^n))\big) \bigg), \nonumber \end{align} \tag{ 31 }$

and, in case p_j > 1,

$\begin{align} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle \|T_ju_j^n - f_j\|_{Y_j}^{\,p_j} = O \bigg((\delta_j^n)^{\,p_j-1} \big(\sum_{i \in J_1} (\delta_i^n)^{\epsilon_i} + \sum _{i\in J_p} \delta_i^n + \sum_{i \in I^c} (\delta_i^n + \phi_i(\delta_i^n))\big) \bigg). \nonumber \end{align} \tag{ 32 }$

Proof. At first we note that, by theorem 2.11, for all but finitely many indices n we get

$\begin{align*} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle \sum_{i\in I} \lambda_i^n \|T_i u^n - & f^n_i\|_{Y_i}^{\,p_i} + (1-\beta_1) D^\xi_{J_{\lambda^\dagger,I^c}(\cdot,f^\dagger)} (u^n,u^\dagger) \nonumber \\ &\leqslant \beta_2 \sum_{i \in I} \|T_i u^n - f^\dagger_i\|_{Y_i} + C \sum_{i \in I} \lambda_i^n (\delta_i^n)^{\,p_i} + C \sum_{i \in I^c} \big(\delta_i^n + \phi_i(\delta_i^n)\big) \nonumber \\ &\leqslant \beta_2 \sum_{i \in I} \|T_i u^n - f^n_i\|_{Y_i} + C \sum_{i \in I} (\lambda_i^n \big(\delta_i^n)^{\,p_i} + \delta_i^n\big) + C \sum_{i \in I^c} \big(\delta_i^n + \phi_i(\delta_i^n)\big). \nonumber \end{align*}$

Using that, by Young's inequality, for $i \in J_p$ ,

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle (\beta_2(\lambda_i^n)^{-1/p_i})(\lambda_i^n)^{1/p_i}\|T_i u^n - f^n_i\|_{Y_i} \leqslant \frac{\lambda_i\|T_i u^n - f^n_i\|^{\,p_i}_{Y_i}}{p_i} + C\lambda_i^{-1/(\,p_i-1)} \nonumber \end{align*}$

we obtain

$\begin{align} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle \label{eq:general_estimates_norm_rates} \begin{array}{@{}ll@{}} c \sum\nolimits_{i\in I} \lambda_i^n \|T_i u^n - &f^n_i\|_{Y_i}^{\,p_i} + (1-\beta_1) D^\xi_{J_{\lambda^\dagger,I^c}(\cdot,f^\dagger)} (u^n,u^\dagger) \nonumber \\ &\leqslant C \sum\nolimits_{J_p} \lambda_i^{-1/(\,p_i-1)} + C \sum\nolimits_{i \in I} \big(\lambda_i^n (\delta_i^n)^{\,p_i} + \delta_i^n\big) + C \sum\nolimits_{i \in I^c} \big(\delta_i^n + \phi_i(\delta_i^n)\big) \nonumber \\ &\leqslant C \bigg(\sum\nolimits_{J_p} \delta_i^n + \sum\nolimits_{i \in J_1} (\delta_i^n)^{\epsilon_i} + \sum\nolimits_{i \in I^c} \big(\delta_i^n + \phi_i(\delta_i^n)\big) \bigg). \end{array} \nonumber \end{align} \tag{ 33 }$

Estimating the left hand side below by the individual terms, the assertion follows. □

We now consider the particular case that the noise on all components vanishes, i.e. $I= \{1, \ldots, N\}$ , and that p_i = 2 for all i, which is typically chosen when the norms $\|\cdot \|_{Y_i}$ are two-norms and the noise is Gaussian.

Assume again that the noise level of the different components is related as $(\delta_i^n) = (\delta^n){}^{\mu_i}$ with $\delta^n \rightarrow 0$ , $\delta^n \geqslant 0$ and $\mu_i \geqslant 1$ with $\mu_{i_0} = 1$ for one i₀ (the slowest rate). As a direct consequence of the previous theorem we get the rate $\delta^n$ for $D^\xi_R$ and $(\delta^n){}^{\mu_j + 1}$ for $\|T_ju^n - f_j\|_{Y_j}^{2}$ . The following corollary shows that, again, by adapting the parameter choice to the interplay of the different noise levels, we can improve the rate for $\|T_ju^n - f_j\|_{Y_j}^{2}$ without worsening the rate for $D_R^\xi$ .

Corollary 3.6. With the assumptions of theorem 3.2, set the index set $I= \{1, \ldots, N\}$ and p_i = 2 for all i. Assume that the noise level for each component is such that

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \delta^n_i \sim (\delta^n)^{\mu_i} \nonumber \end{align*}$

with $(0, \infty) \ni \delta^n \rightarrow 0$ , $\mu_i \geqslant 1$ and $\mu_{i_0} = 1$ for some i₀. Let the parameter choice for all i be such that

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \lambda_i^n \sim (\delta_i^n)^{-(2 - 1/\mu_i)} . \nonumber \end{align} \tag{ 34 }$

Let $(u^n)_n$ be a sequence of solutions to ( ${P_N(\lambda, f)}$ ) and $u^\dagger$ be an R-minimizing solution of $T u = f^\dagger$ . Then

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle R(u^n) \leqslant R(u^\dagger)+ C \delta^ n. \nonumber \end{align} \tag{ 35 }$

Assume further that there exists

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \xi \in \partial R(u^\dagger) \nonumber \end{align*}$

and constants $0<\beta_1<1$ , $0<\beta_2$ such that

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle - \langle\xi, u-u^\dagger\rangle \leqslant \beta_1 D_{R}^\xi(u,u^\dagger) +\beta_2 \sum_{i=1}^n \|T_iu - T_iu^\dagger\|_{Y_i} \qquad \qquad ({\rm SC}) \nonumber \end{align*}$

for all u satisfying $\newcommand{\e}{{\rm e}} R(u) + \sum\nolimits_{i=1}^n D_i(T_iu, f_i^\dagger) \leqslant R(u^\dagger)+ \epsilon$ for some $\newcommand{\e}{{\rm e}} \epsilon > 0$ . Then, for any j, it holds that

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle D_{R}^\xi(u^n,u^\dagger) = O (\delta^n) \nonumber \end{align} \tag{ 36 }$

and

$\begin{align} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle \|T_ju^n - f_j\|_{Y_j}^{2} = O \big((\delta^n)^{2 \mu_j} \big). \nonumber \end{align} \tag{ 37 }$

This parameter choice is optimal for the estimates given in equation (33) in the sense that they cannot be improved by choosing $\newcommand{\e}{{\rm e}} \lambda_i^n = (\delta_i^n){}^{-(2-\epsilon_i)}$ with $\newcommand{\e}{{\rm e}} \epsilon_i \in (0, 2)$ .

Proof. First note that the first estimate in equation (33) of theorem 3.2 holds true for any admissible parameter choice. It is also easy to see that, for the particular case of this corollary, the rate for $D_{R}^\xi$ cannot be better than claimed. Also for the data discrepancy $\|T_j u^n -f_j^n\|_{Y_j}^2$ we see by considering the jth summand, that the rate cannot be better than $(\delta^n){}^{2\mu_j}$ , which is the claimed rate for the proposed choice. Plugging in the proposed parameter choice, the claimed rates follow by direct computation. □

3.2. Mixed two-norm and Kullback–Leibler-divergence discrepancies

In inverse problems where the measurement process essentially corresponds to counting a number of events, such as photon measurements in PET or electrons in electron tomography, the noise is typically assumed to follow a Poisson distribution. Interpreting the regularization term as a prior in a Bayesian approach, the data fidelity for computing a maximum a posteriori (MAP) estimate for Poisson-noise-corrupted data is the Kullback–Leibler divergence (see for instance [23, 39]).

Since important applications of joint regularization approaches, such as MR-PET reconstruction or multi-spectral electron tomography, include this type of measurements we also consider the application of the previous results to the Kullback–Leibler divergence as data fidelity. To this aim, we first establish some basic properties of this type of data fidelity.

3.2.1. Basic properties of the Kullback–Leibler divergence.

We define the Kullback–Leibler divergence as follows. Let $(\Sigma, \mathcal{B}_\mu, \mu)$ with $\newcommand{\R}{\mathbb{R}} \Sigma \subset \R^m$ be a finite measure space. We consider functions in $L^1(\Sigma, \mu)$ , where typically μ is for example either the m-dimensional Lebesgue $\mathcal{L}^ m$ measure and Σ a bounded subset of $\newcommand{\R}{\mathbb{R}} \R^m$ or and $\newcommand{\R}{\mathbb{R}} \Sigma \subset \mathcal{S}^{m-1} \times \R$ , with $\mathcal{S}^{m-1}$ being the m − 1 dimensional unit sphere and the m − 1 dimensional Hausdorff measure restricted to $\mathcal{S}^{m-1}$ . The latter is the generic image space for the Radon transform.

Then we define $\newcommand{\dkl}{D_{{\rm KL}}} \dkl:L^1(\Sigma, \mu) \times L^1(\Sigma, \mu) \rightarrow [0, \infty]$ as

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\diff}{\mathop{}\!{\rm d}} \newcommand{\dkl}{D_{{\rm KL}}} \displaystyle \label{eq:d_kl_definition} \dkl(v,f)= \left\{\begin{array}{@{}ll@{}} \int_{\Sigma} v-f-f \log \Big(\frac{v}{f}\Big)\diff \mu, &{\rm if~} v \geqslant 0,\ f\geqslant 0 {\rm ~ a.e.,~} \nonumber \\ \infty &{{\rm else,}} \end{array}\right. \nonumber \end{align} \tag{ 38 }$

where we set $0 \log \Big(\frac{a} {0}\Big) = 0$ for $a \geqslant 0$ and $-b \log (\frac{0}{b}) = \infty$ for b > 0. Note that $\newcommand{\dkl}{D_{{\rm KL}}} \dkl$ is well-defined since, as an easy computation shows, the integrand is always non-negative, but potentially takes the value infinity also for $u \geqslant 0$ .

The following properties of $\newcommand{\dkl}{D_{{\rm KL}}} \dkl$ are well-known and can for instance be found in [3, 37] for the case of Lebesgue measures and directly extended more general measures since they rely exclusively on pointwise assertions.

Lemma 3.7 (General properties of Kullback–Leibler). For $v, f \in L^1(\Sigma, \mu)$ , the Kullback–Leibler divergence satisfies the following.

1.
The function $\newcommand{\dkl}{D_{{\rm KL}}} \dkl(\cdot, \cdot)$ is non-negative, and
$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\R}{\mathbb{R}} \newcommand{\dkl}{D_{{\rm KL}}} \displaystyle \dkl(v,f)=0 \Rightarrow v=f {\rm a.e.~in}~\Sigma. \nonumber \end{align*}$
2.
The function $\newcommand{\dkl}{D_{{\rm KL}}} (v, f)\mapsto \dkl(v, f)$ is convex.
3.
The Kullback–Leibler divergence can be estimated as follows
$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\dkl}{D_{{\rm KL}}} \displaystyle \|v-f\|_{L^1}^2 \leqslant \Big(\frac{2}{3}\| f\|_{L^1} +\frac{4}{3}\|v\|_{L^1} \Big) \dkl(v,f). \nonumber \end{align} \tag{ 39 }$
4.
For $(v^n, f^n)\in L^1(\Sigma, \mu) \times L^1(\Sigma, \mu)$ uniformly bounded in $L^1(\Sigma, \mu)$ ,
$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\dkl}{D_{{\rm KL}}} \displaystyle \| v^n-f^n\|_{L^1}^2 \leqslant C \dkl(v^n,f^n). \nonumber \end{align*}$

Further, we will require suitable continuity results for the Kullback–Leibler divergence as stated in the following proposition.

Proposition 3.8. (Continuity properties)

1.
For sequences $(v^n)_n, (\,f^n)_n \in L^1(\Sigma, \mu)$ and $v,f \in L^1(\Sigma,\mu)$ , the Kullback–Leibler divergence is jointly lower semi-continuous in the sense that
$\begin{align} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \newcommand{\R}{\mathbb{R}} \newcommand{\dkl}{D_{{\rm KL}}} \displaystyle \big (v^n \mathop{\rightharpoonup}\limits^{L^1} v {{\rm ~and~}} \dkl(\,f,f^n) \to 0 \big) \Rightarrow \dkl(v,f)\leqslant \liminf_{n \to \infty} \dkl(v^n,f^n). \nonumber \end{align} \tag{ 40 }$
2.
Let $f, v \in L^1(\Sigma, \mu)$ and the sequence $(\,f^n)_n$ be given such that $\newcommand{\dkl}{D_{{\rm KL}}} \dkl(\,f, f^n) \to 0$ as $n \rightarrow \infty$ . Assume that there exists constants $\beta_0, \, \beta_1>0$ such that
$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle f \beta_0 \geqslant v \geqslant f \beta_1 \,{\rm a.e.~on~} \Sigma. \nonumber \end{align*}$
Then
$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\dkl}{D_{{\rm KL}}} \displaystyle \dkl(v,f^n) \to \dkl(v,f) {\rm ~as~} n \to \infty. \nonumber \end{align} \tag{ 41 }$
3.
In the setting of Assertion 2, defining $\Sigma_{v+} = \{x \in \Sigma \, |\, v(x)>0\}$ . Then there exists C > 0 such that
$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\dkl}{D_{{\rm KL}}} \displaystyle |\dkl(v,f)-\dkl(v,f^n)| \leqslant C \| \log\Big(\frac{v}{f}\Big)\|_{L^\infty(\Sigma_{v+},\mu)} \dkl(\,f,f^n)^{\frac{1}{2}}+\dkl(\,f,f^n). \nonumber \end{align} \tag{ 42 }$

Proof.

Ad 1.
We show that $\newcommand{\dkl}{D_{{\rm KL}}} \dkl$ is lower semi-continuous with respect to L¹ convergence in both components. The desired statement then follows since convergence in $\newcommand{\dkl}{D_{{\rm KL}}} \dkl(\,f, f^n)$ implies convergence in L¹ due to lemma 3.7, assertion 3, and convexity of the function yields lower semi-continuity in weak L¹ convergence.Without loss of generality we assume that $(v^n)_n$ and $(\,f^n)_n$ are such that $v^n\geqslant 0$ , $f^n \geqslant 0$ pointwise almost everywhere and $\newcommand{\dkl}{D_{{\rm KL}}} \lim\nolimits_{n\to \infty} \dkl(v^n, f^n)=\liminf\nolimits_{n\to \infty} \dkl(v^n, f^n)$ . The sequences $(v^n)_n$ , $(\,f^n)_n$ are bounded in L¹, and thus contain subsequences $(v^{n_i})_i$ , $(\,f^{n_i})_i$ converging pointwise to $v$ and f, respectively. We show that the non-negative integrand $\newcommand{\bi}{\boldsymbol}g(a, b)=a-b-b\log \big(\frac{a}{b}\big)$ of $\newcommand{\dkl}{D_{{\rm KL}}} \dkl$ , is lower semi-continuous on $[0, \infty)\times[0, \infty)$ . Indeed, for $a^n\to a$ , $b^n\to b$ , the case a > 0, b > 0 is trivial since g is continuous on $(0, \infty)\times(0, \infty)$ . If a = 0 and b = 0, then $g(a, b)=0\leqslant g(a^n, b^n)$ , if a = 0, b > 0, then $\newcommand{\bi}{\boldsymbol}-b^n \log \big(\frac{a^n}{b^n}\big) \to \infty$ implying $\lim g(a^n, b^n)=g(a, b)=\infty$ . In the case a > 0, b = 0, the value $\newcommand{\bi}{\boldsymbol}b^n \log \big(\frac{a^n}{b^n}\big) \to 0$ resulting in $g(a, b)=\lim \limits_{n\rightarrow \infty} g(a^n, b^n)$ . Hence, $\newcommand{\bi}{\boldsymbol}g\big(v(x), f(x)\big)\leqslant \liminf\nolimits_{i \to \infty} g\big(v^{n_i}(x), f^{n_i}(x)\big)$ a.e. and application of Fatou's Lemma yields
$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\dkl}{D_{{\rm KL}}} \displaystyle \dkl(v,f) \leqslant \liminf_{i \to \infty} \dkl(v^{n_i},f^{n_i}) = \liminf_{n \to \infty} \dkl(v^{n},f^{n}). \nonumber \end{align*}$
Ad 2.
We assume $\newcommand{\dkl}{D_{{\rm KL}}} \dkl(\,f, f^n) < \infty$ and $\newcommand{\dkl}{D_{{\rm KL}}} \dkl(v, f)< \infty$ , since the case $\newcommand{\dkl}{D_{{\rm KL}}} \dkl(v, f)=\infty$ is trivial due to lower semi-continuity. Define $\Sigma_{v+} = \{x \in \Sigma \, |\, v(x) >0 \}$ and note that $f(x) > 0$ for $x \in \Sigma_{v+}$ by assumption. Using the conventions for $a\log(\frac{a}{b})$ for the case that a = 0 or b = 0, direct computation shows
$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\dkl}{D_{{\rm KL}}} \displaystyle \Big|\dkl(v,f^n)-\dkl(v,f)-\dkl(\,f,f^n)\Big|\notag \nonumber \end{align}$

$\begin{align} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \newcommand{\diff}{\mathop{}\!{\rm d}} \displaystyle =\Big|\int_{\Sigma} -f^n\log\Big(\frac{v}{f^n}\Big)+f\log\Big(\frac{v}{f}\Big)+f^n\log\Big(\frac{f}{f^n}\Big) \diff \mu\Big|\label{equ_KL_proof_conv_estimate} \ \nonumber \end{align} \tag{ 43 }$
$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\diff}{\mathop{}\!{\rm d}} \displaystyle =\Big|\int_{\Sigma_{v+}}\big(\log(v)-\log(\,f)\big) \big(\,f-f^n\big) \diff \mu \Big| \leqslant\|f^n-f\|_{L^1(\Sigma,\mu)} \|\log\Big(\frac{v}{f}\Big)\|_{L^\infty(\Sigma_{v+},\mu)} .\notag \nonumber \end{align}$
Indeed, note that the above computations are also valid in the special cases that one of the occurring functions attains the value 0: since both $\newcommand{\dkl}{D_{{\rm KL}}} \dkl(\,f, f^n) < \infty$ and $\newcommand{\dkl}{D_{{\rm KL}}} \dkl(v, f)< \infty$ are finite we get that, up to sets of measure zero, $f(x) = 0$ implies fⁿ(x) = 0 as well as $v(x)=0$ implies $f(x)=0$ . This means that all log terms in the second line above are finite and a simple case study shows that the estimate above also captures all degenerate cases. The claimed convergence then follows immediately.
Ad 3.
This follows directly from (43) and lemma 3.7.

□

Our goal is now to identify when the general assumptions (G) as well as assumptions $S_1(u, f^\dagger, I^c)$ , $S_2(U, f^ \dagger)$ can be verified for data fidelities that include $\newcommand{\dkl}{D_{{\rm KL}}} \dkl$ . To this aim, we summarize the corresponding assertions in the following proposition, which is an immediate consequence of lemma 3.7 and proposition 3.8.

Proposition 3.9 (Properties of $\newcommand{\dkl}{D_{{\rm KL}}} \dkl$ ). Define by $\newcommand{\dkl}{D_{{\rm KL}}} \tau_{\dkl}$ to be the trivial topology on $L^1(\Sigma, \mu)$ such that $\newcommand{\dkl}{D_{{\rm KL}}} f^n \mathop{\rightarrow}\limits^{\dkl} f$ if and only if $\newcommand{\dkl}{D_{{\rm KL}}} \dkl(\,f, f^n) \rightarrow 0$ . Then the following holds for each $f, v \in L^1(\Sigma, \mu)$ and each sequence $(\,f^n)_n$ in $L^1(\Sigma, \mu)$ .

1.
$f^n \overset{D_{\rm KL}}{\rightarrow} f$ if and only if $\newcommand{\dkl}{D_{{\rm KL}}} \dkl(\,f, f^n) \rightarrow 0$ and $\|f^n -f \|_1 \rightarrow 0$ .
2.
$\newcommand{\dkl}{D_{{\rm KL}}} \dkl(u, v) \geqslant 0$ , $\newcommand{\dkl}{D_{{\rm KL}}} \newcommand{\R}{\mathbb{R}} \dkl(u, v) = 0 \Rightarrow u=v$ .
3.
$\newcommand{\dkl}{D_{{\rm KL}}} \dkl$ is lower semi-continuous w.r.t weak-L¹ convergence in the first and $\newcommand{\dkl}{D_{{\rm KL}}} \dkl$ convergence in the second component.
4.
If $v, f^\dagger \in L^1(\Sigma, \mu)$ are such that there exists constants $\beta_0, \beta_1$ with $\beta_0 f^\dagger \geqslant v \geqslant \beta_1 f^\dagger$ , then the mapping
$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\dkl}{D_{{\rm KL}}} \displaystyle f \mapsto \dkl(v,f) \quad {\rm ~is~ continuous~at~}f^\dagger{\rm ~w.r.t.~} \dkl-{\rm convergence.} \nonumber \end{align*}$
5.
If $U \subset L^1 (\Sigma, \mu)$ and $f^\dagger \in L^1(\Sigma, \mu)$ are such that there exit constants $\beta_0, \beta_1$ with $\beta_0 f^\dagger \geqslant v \geqslant \beta_1 f^\dagger$ for all $v \in U$ , then there exists C > 0 such that, for each $v \in U$ and f such that $\newcommand{\dkl}{D_{{\rm KL}}} \dkl(\,f^\dagger, f) \leqslant 1$ we have
$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\dkl}{D_{{\rm KL}}} \displaystyle | \dkl(v,f) - \dkl(v,f^\dagger)| \leqslant C \dkl(\,f^\dagger,f)^{1/2}\ . \nonumber \end{align} \tag{ 44 }$

3.2.2. Stability and convergence for mixed two-norm and Kullback–Leibler discrepancies.

In the previous section we have derived properties of the Kullback–Leibler divergence that allow to incorporate it in the general framework of section 2. In this section, motivated by concrete applications, we will work out how these results can be employed to the particular case of mixed L² and Kullback–Leibler data discrepancy terms. That is, we decompose the index set $\{1, \ldots, N\}$ as the disjoint union of $\newcommand{\nr}{{{\rm nr}}} L_\nr$ and $\newcommand{\kl}{{{\rm kl}}} L_\kl$ and consider the minimization problem

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\kl}{{{\rm kl}}} \newcommand{\nr}{{{\rm nr}}} \newcommand{\dkl}{D_{{\rm KL}}} \displaystyle \min _{u \in L^{\,p}(\Omega)^N} R(u) + \sum_{i\in L_\nr} \lambda_i \|T_i u_i - f_i\|_2 ^2 + \sum_{i\in L_\kl} \lambda _i \dkl(T_i u_i + c_i,f_i), \qquad \qquad (P_{{\rm NKL}}(\lambda,f)) \nonumber \end{align*}$

where $p \in (1, 2]$ and $\newcommand{\R}{\mathbb{R}} \Omega \subset \R^d$ is a bounded Lipschitz domain, $R:L^{\,p}(\Omega){}^N \rightarrow [0, \infty]$ is a regularization term and $\lambda \in (0, \infty){}^N$ . Note that we have further specified the problem by evaluating the data discrepancies only on the components u_i of $u = (u_1, \ldots, u_N)$ , which is a particular case of the previous setting and corresponds to the situation that all components of u are obtained from independent measurements. Also, we now assume the operators T_i to be bounded linear operators, that is, for $\newcommand{\nr}{{{\rm nr}}} i \in L_\nr$ we assume $T_i\in \mathcal{L}(L^{\,p}(\Omega) , L^{\,p}(\Sigma_i))$ with $\newcommand{\R}{\mathbb{R}} \Sigma_i \subset \R^{m_i}$ bounded and Lebesgue measurable and for $\newcommand{\kl}{{{\rm kl}}} i \in L_\kl$ we assume $T_i \in \mathcal{L}(L^{\,p}(\Omega) , L^1(\Sigma_i, \mu_i))$ with $\newcommand{\R}{\mathbb{R}} \Sigma_i \subset \R^{m_i}$ and $(\Sigma_i, \mathcal{B}_{\mu_i}, \mu_i)$ a finite measure space. The (f_i) denote the given data in the respective spaces and, for indices $\newcommand{\kl}{{{\rm kl}}} i \in L_\kl$ , we have included additional additive terms c_i that we assume to be given and such that $c_i \in L^ 1(\Sigma_i)$ , $c_i \geqslant 0$ pointwise. The reason for including the c_i is that in PET imaging they are typically used to model some (a-priory estimated) measurements resulting from scattering and random events (see for instance [26]).

The data discrepancies $\newcommand{\dkl}{D_{{\rm KL}}} \dkl$ are the Kullback–Leibler divergence as defined in equation (38). For the norm discrepancies we extend the 2-norm by duality for $v \in L^{\,p}(\Sigma_i)$ as

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \|v\|_2 = \sup_{{\phi \in C_c^\infty(\Sigma_i)} \atop {\|\phi\|_2 \leqslant 1}} \int_{\Sigma_i} v\phi. \nonumber \end{align*}$

Note that, as can be easily seen by duality and density arguments that $\|u\|_2 < \infty$ if and only if $u \in L^2(\Sigma_i)$ and $\|\cdot \|_2$ is lower semi-continuous w.r.t weak L¹ convergence as being the pointwise supremum of a set of continuous functions [14]. The reason for using $L^{\,p}(\Sigma_i)$ as image space and extending the 2-norm instead of defining T_i to be continuously mapping in L² is that, in typical applications we need to choose $p \leqslant \frac{d}{d-1}$ with $d \in \{2, 3, 4\}$ to obtain continuous embeddings and hence allowing T_i to map continuously to L^p instead of L² is less restrictive. On the other hand, the 2-norm is the natural choice for measurements corrupted by Gaussian noise.

For the particular setup of equation ( $P_{{\rm NKL}}(\lambda, f)$ ), choosing the involved topologies appropriately, the general assumption (G) can be simplified as follows.

Assumption (NKL). [ $\tau_X$ and $\tau_{Y_i}$ the weak topologies in the respective spaces, $\tau_{D_i}$ the trivial topologies]

1.
R is lower semi-continuous w.r.t weak convergence in $L^{\,p}(\Omega){}^N$ .
2.
$T_i\in \mathcal{L}(L^{\,p}(\Omega) , L^{\,p}(\Sigma_i))$ for $\newcommand{\nr}{{{\rm nr}}} i \in L_\nr$ and $T_i \in \mathcal{L}(L^{\,p}(\Omega) , L^1(\Sigma_i, \mu_i))$ for $\newcommand{\kl}{{{\rm kl}}} i \in L_\kl$ .
3.
For each C > 0 the set
$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\kl}{{{\rm kl}}} \newcommand{\nr}{{{\rm nr}}} \displaystyle \{u \in L^{\,p}(\Omega)^N \,|\, R(u) + \sum_{i\in L_\nr} \|T_i u_i\|_{p} + \sum_{i \in L_{\kl}} \|T_i u_i \|_1 < C\} \nonumber \end{align*}$
is sequentially compact w.r.t. weak convergence in $L^{\,p}(\Omega) ^N$ .

Proposition 3.10. If assumption (NKL) holds for the particular choices of regularization and data fidelity used equation ( $P_{{\rm NKL}}(\lambda, f)$ ), and if we choose $\tau_X$ the topology corresponding to weak L^p convergence, $\tau_{D_i}$ the trivial topologies and $\tau_{Y_i}$ the topology corresponding to weak L^p and weak L¹ convergence for $\newcommand{\nr}{{{\rm nr}}} i \in L_\nr$ and $\newcommand{\kl}{{{\rm kl}}} i\in L_\kl$ , respectively, then assumption (G) holds.

Proof. First note that, when considering assumption (G) we include the additional terms c_i appearing in the Kullback–Leibler discrepancies in the forward operators T_i, that is, for $\newcommand{\kl}{{{\rm kl}}} i \in L_\kl$ we consider the affine operators $\tilde{T}_i u = T_i u_i + c_i$ . Given the properties of $\newcommand{\dkl}{D_{{\rm KL}}} \dkl$ and the 2-norm, the only thing that is then left to show is that the sequential compactness as in Point 3. above implies sequential compactness as in assumption (G), Point 5. To this aim take $\lambda \in (0, \infty) ^N$ and sequences $(u^n)_n$ , $(\,f^n)_n$ , and assume that $f^n \overset{D}{\rightarrow} f$ and $J_\lambda(u^n, f^n)<c$ . Obviously this implies that R(uⁿ) is bounded. For $\newcommand{\nr}{{{\rm nr}}} i \in L_\nr$ , also, $\|T_i u^n_i - f^n_i\|_2$ is bounded and we can estimate

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \|T_i u_i^n\|_p \leqslant \|T_i u_i^n- f_i^n\|_p + \|f_i^n - f\|_p + \|f\|_p \leqslant \|T_i u_i^n- f_i^n\|_2 + \|f_i^n - f\|_2 + \|f\|_p < C < \infty \nonumber \end{align*}$

and consequently $\|T_i u_i\|_{p}$ is bounded. Hence, if we can show that $\|T_iu_i^n\|_1$ is bounded for $\newcommand{\kl}{{{\rm kl}}} i \in L_{\kl}$ the assertion follows. Assume that (up to subsequences) $\|T_i u_i^n\|_1 \rightarrow \infty$ which implies that $\|T_i u_i^n + c_i\|_1 \rightarrow \infty$ . Using lemma 3.7 we get for large enough n

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\dkl}{D_{{\rm KL}}} \displaystyle &\|T_iu_i^n + c_i\|_1^2 - 2 \|T_iu^n_i + c_i\|_1 \|f^n_i\|_1 + \|f^n_i\|^2_1 \nonumber \\& \leqslant \|T_iu^n_i+ c_i-f_i^n \|_1 ^2 \leqslant \left(\frac{2}{3}\|f^n_i\|_1 + \frac{4}{3}\|T_iu^n_i+ c_i\|_1 \right) \dkl(T_iu^n_i+ c_i,f^n_i). \nonumber \end{align*}$

Now estimating all bounded quantities with constants, this implies that

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \|T_iu^n_i + c_i\|_1 (\|T_iu^n_i+ c_i\|_1 - C_1) \leqslant C_2 \nonumber \end{align*}$

for $C_1, C_2 >0$ which contradicts to $\|T_iu^n_i+ c_i\|_1 \rightarrow \infty$ . Hence the assertion follows. □

Now the continuity assumption $S_1(u, f^\dagger, I^c)$ that was required for stability translates to the following assumption on the data.

Assumption ${S^{{\rm kl}}_1(u, f^\dagger, I^c \cap L_{{\rm kl}})}$ .

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\kl}{{{\rm kl}}} \displaystyle \left\{\begin{array}{@{}ll@{}} {\rm For~}i \in I^ c \cap L_\kl {\rm, there~exist~} \beta_0,\beta_1 > 0 {\rm s.t.~} f^\dagger_i \beta_0 \geqslant T_iu_i +c_i\geqslant f^\dagger_i \beta_1. \end{array}\right. \nonumber \end{align*}$

Noting that assumption $S_1(u, f^\dagger, I^c)$ was only used for u such that $D_i(T_iu, f_i^\dagger) < \infty$ , the following lemma allows to relate the assumption $S_1(u, f^\dagger, I^c)$ and $S^{{\rm kl}}_1(u, f^\dagger, I^c \cap L_{{\rm kl}})$ .

Lemma 3.11. Assume that (NKL) holds and that u is such that $\|T_i u_i - f_i^\dagger\|_2 < \infty$ for $\newcommand{\nr}{{{\rm nr}}} i \in I^c \cap L_\nr$ . Then assumptions $S^{{\rm kl}}_1(u, f^\dagger, I^c \cap L_{{\rm kl}})$ implies $S_1(u, f^\dagger, I^c)$

Proof. For indices $\newcommand{\nr}{{{\rm nr}}} i \in I^c \cap L_\nr$ we note that $\|T_i u_i - f_i^\dagger\|_2 < \infty$ implies for all f_i with $\|\,f_i - f_i^\dagger \|_2$ finite that $\|T_i u_i -f_i \|_2 \leqslant \|T_i u_i -f_i^\dagger \|_2 + \|f_i^\dagger -f_i \|_2< \infty$ . Hence the inverse triangle inequality can be employed to show continuity of $f_i \mapsto \|T_i u_i - f_i\|_2$ w.r.t D_i-convergence. For $\newcommand{\kl}{{{\rm kl}}} i \in I^c \cap L_\kl$ , this is the assertion of proposition 3.9. □

Remark 3.12. Remember that assumption $S^{{\rm kl}}_1(u, f^\dagger, I^c \cap L_{{\rm kl}})$ will be applied with u being a solution to $P_{{\rm NKL}}(\lambda, f^\dagger)$ and hence it appears not too restrictive. In particular, it is not stronger that standard assumptions in the context of inverse problems with Poisson noise even with a single data discrepancy [37, 39].

Using assumption $S^{{\rm kl}}_1(u, f^\dagger, I^c \cap L_{{\rm kl}})$ , a basic existence and stability result follows as direct consequence of proposition 2.2 and corollary 2.7.

Proposition 3.13 (Stability). Assume that assumption (NKL) holds. Let $(\,f^n)_n$ in Y and $f^\dagger \in Y$ be such that $f^n \overset{{D}}{\rightarrow}f^\dagger$ as $n\rightarrow \infty$ and take $\lambda \in (0, \infty){}^ N$ . In case $J_\lambda(\cdot, f^n)$ is proper, there exists a solution to $P_{{\rm NKL}}(\lambda, f^n)$ . If we further assume that, for u₀ being a solution to ( $P_{{\rm NKL}}(\lambda, f)$ ), assumption $\newcommand{\kl}{{{\rm kl}}} S^{\kl}_1(u_0, f^\dagger, \{1, \ldots, N\} \cap L_\kl)$ holds, then every sequence $(u^n)_n$ of solutions to

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \min _{u \in X} J_{\lambda} (u,f^n) \nonumber \end{align} \tag{ 45 }$

admits a $\tau_X$ -convergent subsequence again denoted by $(u^n)_n$ with limit $u^\dagger$ such that

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\kl}{{{\rm kl}}} \newcommand{\nr}{{{\rm nr}}} \newcommand{\dkl}{D_{{\rm KL}}} \displaystyle \lim_{n\rightarrow\infty} \|T_iu^n_i - f^n_i\|_2 ^2 &= \|T_iu^\dagger_i-f^\dagger_i\|_2 ^2 &\qquad\qquad\qquad{\rm for~all~}i \in L_\nr, \nonumber \\ \lim_{n\rightarrow\infty} \dkl (T_iu^n_i+c_i,f^n_i) &= \dkl(T_iu^\dagger_i+c_i,f^\dagger_i) &\qquad\qquad\qquad{\rm for~all~} i \in L_\kl, \nonumber \\ \lim_{n\rightarrow\infty} R(u^n) &= R(u^\dagger), \nonumber \end{align*}$

and every limit of a $\tau_X$ -convergent subsequence of solutions is a solution to

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle \min_{u \in X} J_{\lambda} (u,f^\dagger). \nonumber \end{align} \tag{ 46 }$

It is immediate that, under assumption (NKL) and even without the assumption $S^{{\rm kl}}_1(u, f^\dagger, I^c \cap L_{{\rm kl}})$ , the counterpart of the convergence result of corollary 2.8 holds for the particular setting of this section. In order to obtain rates, we need to translate assumption $S_2(U, f^ \dagger)$ appropriately as follows.

Assumption ${S^{\rm kl}_2(U, f^ \dagger)}$ .

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\kl}{{{\rm kl}}} \displaystyle \left\{\begin{array}{@{}ll@{}} {\rm There~exist~} \beta_0,\beta_1 > 0 {\rm ~such~that,} {\rm ~for~each~} i \in L_\kl {\rm ~ and~} v \in U,\, \beta_0 f^\dagger_i \geqslant T_i v_i+c_i \geqslant \beta_1 f^\dagger_i \end{array}\right. \nonumber \end{align*}$

Again we note that, when assumption $S_2(U, f^ \dagger)$ will be applied in the particular setting of this section, we set $U = \{(u^n)_n \, |\, n \geqslant n_0 \} \cup \{u^\dagger \}$ with $(u^n)_n$ being a sequence of minimizers and $u^\dagger$ being a minimizer for data fⁿ and $f^\dagger$ , respectively. Hence, as can be easily deduced by the triangle inequality, it will hold that $\|T_iu_i-f_i\|_2 + \|T_iu_i-f_i^\dagger\|_2< \infty$ for all $u \in U$ , f_i such that $\|f_i-f_i^\dagger \|_2 < \infty$ and $\newcommand{\nr}{{{\rm nr}}} i \in L_\nr$ . Taking this into account, the relation of assumptions $S^{\rm kl}_2(U, f^ \dagger)$ and $S_2(U, f^ \dagger)$ is given as follows.

Lemma 3.14. Assume that assumption (NKL) holds and that U is such that for each $u \in U$ , $\|T_i u_i - f_i\|_2 + \|T_i u_i - f_i^\dagger\|_2 < \infty$ for each $\|f_i-f_i^\dagger\|_2 <\infty$ and $\newcommand{\nr}{{{\rm nr}}} i \in L_\nr$ . Then assumptions $S^{\rm kl}_2(U, f^ \dagger)$ implies $S_2(U, f^ \dagger)$ with $\psi_i(x) = x^\frac{1}{2}$ for $\newcommand{\nr}{{{\rm nr}}} i \in L_\nr$ and $\psi_i(x) = x^{1/2}$ for $\newcommand{\kl}{{{\rm kl}}} i \in L_\kl$ .

Proof. Given that all involved quantities are finite, we can transfer the result of proposition 3.4 also to the extended 2-norm, which implies that $S_2(U, f^ \dagger)$ holds with $\psi_i (x) = x^{1/2}$ for $\newcommand{\nr}{{{\rm nr}}} i \in L_\nr$ . For $\newcommand{\kl}{{{\rm kl}}} i \in L_\kl$ , it is the assertion of proposition 3.9 that $S_2(U, f^ \dagger)$ holds with $\psi_i(x) = x^{1/2}$ . □

Using assumption $S^{\rm kl}_2(U, f^ \dagger)$ , we get the following convergence rates result for ( $P_{{\rm NKL}}(\lambda, f)$ ).

Corollary 3.15. Let a sequence $(\,f^n)_n$ be such that $f^n \mathop{\rightarrow}\limits^{{D}} f^\dagger\in Y$ as $n\rightarrow \infty$ . Define $\delta^n_i = \|f_i^n - f^\dagger_i\|_2$ for $i \in L_{nr}$ and $\newcommand{\dkl}{D_{{\rm KL}}} \delta^n_i = \dkl(\,f^\dagger_i, f^n_i)$ for $i \in L_{kl}$ . Assume that for each i,

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \delta_i^n \sim (\delta^n)^{\mu_i} \nonumber \end{align*}$

with $(0, \infty) \ni \delta^n \rightarrow 0$ and $\mu_i \geqslant 1$ . Define $\newcommand{\nr}{{{\rm nr}}} \newcommand{\kl}{{{\rm kl}}} \overline{\mu} = \min (\{\mu_i\, |\, i \in L_\nr \} \cup \{\mu_i/2 \, | \, i \in L_\kl \})$ , set $\newcommand{\e}{{\rm e}} \epsilon_i = \overline{\mu}/\mu_i$ for all i, let the parameter choice be such that

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\kl}{{{\rm kl}}} \newcommand{\nr}{{{\rm nr}}} \displaystyle \left\{\begin{array}{@{}ll@{}} \lambda_i^n \sim (\delta_i^n)^{-(2-\epsilon_i)} &{\rm ~for~}i \in I\cap L_\nr, \nonumber \\ \lambda_i^n \sim (\delta_i^n)^{-(1-\epsilon_i)} &{\rm ~for~}i \in I\cap L_\kl, \end{array}\right. \nonumber \end{align*}$

and denote $\lambda^\dagger = \lim\nolimits_{n \rightarrow \infty} \lambda ^n \in (0, \infty]^N$ . Let $(u^n)_n$ be a sequence of solutions to $P_{{\rm NKL}}(\lambda^n, f^n)$ and let $u^\dagger$ be an R-minimizing solution of $T_i u_i = f^\dagger$ . Set $U_{n_0} = \{u^n \, |\, n \geqslant n_0 \} \cup \{u^\dagger \}$ and assume that $\newcommand{\kl}{{{\rm kl}}} S_2^\kl(U_{n_0}, f^\dagger)$ holds for some n₀. Then

$\begin{align} \newcommand{\e}{{\rm e}} \displaystyle R(u^n) \leqslant R(u^\dagger)+ C (\delta^n)^{\overline{\mu}}. \nonumber \end{align} \tag{ 47 }$

If we assume further that there exists

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \xi \in \partial R(u^\dagger), \nonumber \end{align*}$

constants $0<\beta_1<1$ , $0<\beta_2$ such that

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\kl}{{{\rm kl}}} \newcommand{\nr}{{{\rm nr}}} \newcommand{\dkl}{D_{{\rm KL}}} \displaystyle - \langle\xi, u-u^\dagger\rangle \leqslant \beta_1 D_{R}^\xi(u,u^\dagger) +\beta_2 \sum_{i \in L_\nr} \|T_iu_i-T_iu_i^\dagger\|_2 +\beta_2 \sum_{i \in L_\kl} \dkl (T_iu_i+c_i,T_iu_i^\dagger+c_i) \qquad \qquad ({{\rm SC}}) \nonumber \end{align*}$

for all u satisfying $\newcommand{\dkl}{D_{{\rm KL}}} \newcommand{\nr}{{{\rm nr}}} \newcommand{\kl}{{{\rm kl}}} \newcommand{\e}{{\rm e}} R(u)+ \sum\nolimits_{i \in L_\nr} \|T_i u_i - f^\dagger _i\|_2 ^2 + \sum\nolimits_{i \in L_\kl} \dkl(T_iu_i+c_i, f^\dagger_i) \leqslant R(u^\dagger)+ \epsilon$ for some $\newcommand{\e}{{\rm e}} \epsilon > 0$ , then we obtain for any $j \in \{1, \ldots, N\}$ ,

$\begin{align} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \displaystyle D_{R}^\xi(u^n,u^\dagger) = O\bigg((\delta^n)^{\overline{\mu}} \bigg), \nonumber \end{align} \tag{ 48 }$

$\begin{align} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \newcommand{\nr}{{{\rm nr}}} \displaystyle \|T_ju_j^n - f_j\|_2 ^2 = O\bigg((\delta^n)^{2\mu_j} \bigg) \quad {\rm for} ~j \in L_\nr, \nonumber \end{align} \tag{ 49 }$

$\begin{align} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \newcommand{\kl}{{{\rm kl}}} \newcommand{\dkl}{D_{{\rm KL}}} \displaystyle \dkl(T_ju_j^n+c_j,f_j) = O\bigg((\delta^n)^{\mu_j} \bigg) \quad {\rm for} ~j \in L_\kl . \nonumber \end{align} \tag{ 50 }$

Proof. The estimate on R is an immediate consequence of theorem 2.11 and the parameter choice. In case the source condition holds, the theorem also implies that

$\begin{align*} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \newcommand{\kl}{{{\rm kl}}} \newcommand{\nr}{{{\rm nr}}} \newcommand{\dkl}{D_{{\rm KL}}} \displaystyle &\sum_{i \in L_\nr} \lambda_i^n\|T_iu_i^n - f^n_i\|^2_2 + \sum_{i \in I} \lambda_i^n \dkl\big(T_iu^n+c_i,f_i^n\big) + (1-\beta_1) D^\xi_{R} (u^n,u^\dagger) \nonumber \\ & \leqslant \beta_2 \sum_{i \in L_\nr} \|T_iu_i^n - f^\dagger_i\|_2 + \beta_2 \sum_{i \in L_\kl} \dkl\big(T_iu^n+c_i,f_i^\dagger\big) + C \bigg(\sum_{i \in L_\nr} \lambda_i^n (\delta_i^n)^2+ \sum_{i \in L_\kl} \lambda_i^n \delta_i^n \bigg) . \nonumber \end{align*}$

Now using Young's inequality we can estimate for $\newcommand{\nr}{{{\rm nr}}} i \in L_\nr$ and C > 0 a generic constant

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \beta_2 \|T_iu_i^n - f^\dagger_i\|_2\leqslant \beta_2 \|T_iu_i^n - f^n_i\|_2 + C \delta_i^n \leqslant \frac{\lambda_i^n\|T_i u^n - f^n_i\|^{2}_{Y_i}}{2} + C((\lambda_i^n)^{-1} + \delta_i^n) . \nonumber \end{align*}$

Further, by $\newcommand{\kl}{{{\rm kl}}} S^\kl_2(U_{n_0}, f^\dagger)$ we can estimate

$\begin{align*} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \newcommand{\dkl}{D_{{\rm KL}}} \displaystyle \dkl \big(T_iu^n+c_i,f_i^\dagger\big) \leqslant C (\dkl \big(T_iu^n+c_i,f_i^n\big) + (\delta^n)^{\frac{1}{2}}). \nonumber \end{align*}$

Together, this implies that

$\begin{align*} \newcommand{\bi}{\boldsymbol} \newcommand{\e}{{\rm e}} \newcommand{\kl}{{{\rm kl}}} \newcommand{\nr}{{{\rm nr}}} \newcommand{\dkl}{D_{{\rm KL}}} \displaystyle \sum_{i \in L_N} (\lambda_i^n-C)\|T_iu_i^n - f^n_i\|^2_2 + \sum_{i \in I} (\lambda_i^n-\beta_2C) \dkl\big(T_iu^n+c_i,f_i^n\big) + (1-\beta_1) D^\xi_{R} (u^n,u^\dagger) \nonumber \\ \leqslant + C \bigg(\sum_{i \in L_\nr} (\lambda_i^n)^{-1} + \delta_i^n + \lambda_i^n (\delta_i^n)^2 + \sum_{i \in L_\kl} (\delta_i^n)^\frac{1}{2} + \lambda_i^n (\delta_i^n) \bigg) . \nonumber \end{align*}$

Plugging in the parameter choice this implies the rates as claimed. □

We note that the rates for the data terms above are optimal in the sense that, even for single data term regularization with the corresponding data term, the rate would be the same. Since the Bregman distance involves all components of the unknown, we naturally obtain the worst-case rate amongst all data terms.

4. Particular choices for coupled regularization

In this subsection, we show how the previous results on mixed L² and Kullback–Leibler discrepancies can be applied for concrete regularization functionals. First we consider a rather simple example of joint wavelet-sparsity regularization, where the regularization term is coercive in L². As second application, we consider coupled second order total generalized variation (TGV) [9] regularization, where coercivity can only be established up to finite dimensional subspaces, and show how our results can also be applied in this situation.

4.1. Joint wavelet sparsity

Wavelets are well-known and heavily used in image processing and beyond for their property of sparse approximation of data. This is exploited for example in image compression and denoising applications [30]. When dealing with the regularization of multi-spectral data where the individual components are correlated, a natural approach for coupled regularization is to enforce joint sparsity of wavelet coefficients, see for instance [27]. This is realized by the regularization functional

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle u \mapsto \|W_N u\|_{2,1} ,\nonumber \end{align*}$

where $W_Nu = (Wu_1, \ldots, Wu_N)$ with $\newcommand{\e}{{\rm e}} W:L^2(\Omega) \rightarrow \ell^2$ a biorthogonal wavelet transform, Ω a bounded set and, for $\newcommand{\e}{{\rm e}} z= (z_1, \ldots, z_N) = ((z_{i, 1})_i, \ldots, (z_{i, N})_i) \in (\ell^2){}^N$ we define

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \|z\|_{2,1} = \sum_i \Big(\sum_{j=1}^N z_{i,j}^2\Big)^{1/2}. \nonumber \end{align*}$

With the definitions of section 3.2.2, we then consider

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\kl}{{{\rm kl}}} \newcommand{\nr}{{{\rm nr}}} \newcommand{\dkl}{D_{{\rm KL}}} \displaystyle \min _{u \in L^2(\Omega)^N} \|W_Nu\|_{2,1} + \sum_{i\in L_\nr} \lambda_i \|T_i u_i - f_i\|_2 ^2 + \sum_{i\in L_\kl} \lambda _i \dkl(T_i u_i+c_i,f_i). \qquad \qquad (P_{{\rm NKL-W}}(\lambda,f)) \nonumber \end{align*}$

Then, with $R(u) = \|W_Nu\|_{2, 1}$ , it is easy to see that R is lower semi-continuous w.r.t weak L² convergence and, setting p = 2, that also the compactness requirement of assumption (NKL) is satisfied since $c \|u\|_{L^2} \leqslant \|W_Nu\|_2 \leqslant \|W_Nu\|_{2, 1}$ for all $u \in L^2(\Omega){}^N$ and c > 0 (see [30]).

Hence if all the forward operators T_i are continuous, assumption (NKL) holds and all results of section 3.2.2 apply.

4.2. Joint total generalized variation regularization

The Total Generalized Variation functional has been introduced in 2010 [9] with the aim of overcoming the well-known staircasing effect of the Total Variation (TV) functional while still allowing for jump discontinuities in the reconstruction. It has been analyzed in [5, 8] in the context of regularization of linear inverse problems and has been heavily used in diverse applications, including the joint reconstruction of magnetic resonance (MR) and positron emission tomography (PET) images [26]. The latter employs an extension of second order TGV for vector-valued data where the coupling of the individual terms is carried out via a Nuclear and Frobenius norm at the level of first and second order derivatives, respectively. In this section, we show how the general results of section 3.2.2 can be applied to an extended version of this setting. Numerical examples for concrete applications to multi-contrast MR and PET-MR regularization will then be considered in the next section.

For parameters $\alpha_0, \alpha_1 > 0$ and $\newcommand{\BV}{{\rm BV}} u=(u_1, \ldots, u_N) \in \BV(\Omega){}^N$ with Ω a bounded Lipschitz domain we define the second order TGV functional as

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\T}{\mathcal{T}} \newcommand{\sym}{{\rm Sym}} \newcommand{\symgrad}{\mathcal{E}} \newcommand{\Wrt}{\mathrm{D}} \newcommand{\BD}{{\rm BD}} \newcommand{\TGVat}{\TGV_\alpha ^2} \newcommand{\TGV}{{\rm TGV}} \displaystyle \TGVat(u)= \min_{w\in \BD(\Omega,\mathbb{R}^{d})^N} \alpha_1\| \Wrt u -w \|_\mathcal{M}+ \alpha_0\| \symgrad w\| _\mathcal{M}, \nonumber \end{align} \tag{ 51 }$

noting that by the results in [5] this is equivalent to the original definition of $\newcommand{\TGV}{{\rm TGV}} \newcommand{\TGVat}{\TGV_\alpha ^2} \newcommand{\T}{\mathcal{T}} \TGVat$ as in [9]. Here, $\newcommand{\Wrt}{\mathrm{D}} \Wrt u = (\Wrt u_1, \ldots, \Wrt u_N)$ and $\newcommand{\symgrad}{\mathcal{E}} \newcommand{\sym}{{\rm Sym}} \symgrad w = (\symgrad w_1, \ldots, \symgrad w_N)$ denote the distributional derivative and symmetrized derivative of u and $w=(w_1, \ldots, w_N)$ , respectively, and $\newcommand{\BD}{{\rm BD}} \newcommand{\R}{\mathbb{R}} \BD(\Omega, \R^{d}){}^N$ denotes the space of functions of bounded deformation, i.e. the space of all $\newcommand{\R}{\mathbb{R}} w \in L^1(\Omega, \R^d){}^N$ such that $\newcommand{\symgrad}{\mathcal{E}} \newcommand{\sym}{{\rm Sym}} \symgrad w$ can be represented by a finite Radon measure. For $\newcommand{\M}{\mathcal{M}} z \in \M(\Omega, Z){}^N$ , the space of N-tuples of finite Radon measures with values in $\newcommand{\R}{\mathbb{R}} Z \in \{\R^{d}, S^{d \times d}\}$ where $S^{d \times d}$ is the space of symmetric $d\times d$ matrices, we define

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\M}{\mathcal{M}} \newcommand{\wrt}{\:{\rm d}} \displaystyle \|z \|_{\M} = \sup _{{\phi \in C_c(\Omega,Z)^N} \atop {|\phi(y)| \leqslant 1 {\rm ~ for~all~}y\in \Omega}} \int_\Omega \phi(x) \wrt z(x),\nonumber \end{align*}$

where $|\cdot |$ can be any pointwise norm on Z^N. We note that in fact this pointwise norm defines the coupling of the different components of Du and $\newcommand{\symgrad}{\mathcal{E}} \newcommand{\sym}{{\rm Sym}} \symgrad w$ , e.g. choosing $|\cdot|$ to be a Frobenius norm results in a Frobenius-norm-type coupling of $\newcommand{\Wrt}{\mathrm{D}} \Wrt u$ and $\newcommand{\symgrad}{\mathcal{E}} \newcommand{\sym}{{\rm Sym}} \symgrad w$ . In [26], $|\cdot|$ was chosen to be the spectral norm for $\newcommand{\R}{\mathbb{R}} Z = \R^{d}$ , yielding a nuclear-norm coupling of the components of $\newcommand{\Wrt}{\mathrm{D}} \Wrt u$ .

With the definitions of section 3.2.2 and $p \in (1, d/(d-1)]$ , we now consider the minimization problem

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\kl}{{{\rm kl}}} \newcommand{\nr}{{{\rm nr}}} \newcommand{\T}{\mathcal{T}} \newcommand{\I}{\mathcal{I}} \newcommand{\TGVat}{\TGV_\alpha ^2} \newcommand{\dkl}{D_{{\rm KL}}} \newcommand{\TGV}{{\rm TGV}} \displaystyle \min _{u \in L^{\,p}(\Omega)^N} \TGVat(u)+ \I_{{\rm KL}_+}(u) + \sum_{i\in L_\nr} \lambda_i \|T_i u_i - f_i\|_2 ^2 + \sum_{i\in L_\kl} \lambda _i \dkl(T_i u_i+c_i,f_i). \nonumber \end{align*}$

Here, $\newcommand{\I}{\mathcal{I}} \I_{{\rm KL}_+}(u)$ is the indicator function of the set $\newcommand{\kl}{{{\rm kl}}} {\rm ~KL}_+:= \{u \in L^{\,p}(\Omega){}^ N \, |\, u_i \geqslant$ $\newcommand{\kl}{{{\rm kl}}} 0 {\rm ~a.e.~for~all~} i \in L_\kl\}$ and constrains all u_i for $\newcommand{\kl}{{{\rm kl}}} i\in L_\kl$ to be non-negative, i.e.

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\I}{\mathcal{I}} \displaystyle \I_{{\rm KL}_+}(u) = \left\{\begin{array}{@{}ll@{}} 0 &{\rm if~} u \in {\rm ~KL}_+, \nonumber \\ \infty & {\rm else.} \end{array}\right.\nonumber \end{align*}$

Our goal is to show that assumption (NKL) holds for this setting. However, since $\newcommand{\TGV}{{\rm TGV}} \newcommand{\TGVat}{\TGV_\alpha ^2} \newcommand{\T}{\mathcal{T}} \TGVat$ itself is coercive only up to a finite dimensional subset, we have to employ a slight modification of this setup to obtain coercivity (without imposing further assumptions on the T_i), which is equivalent in the sense that solutions to the modified problem are still solutions to the original problem. That is, with $\mathcal{P}^1(\Omega){}^N$ the set of $\newcommand{\R}{\mathbb{R}} \R^N$ -valued polynomials of order less or equal to one, we define the finite dimensional subspace

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle Z = \mathcal{P}^1(\Omega)^N \cap \ker(T) ,\nonumber \end{align*}$

where $T = (T_1, \ldots, T_N)$ . Further, we define

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle Z^\perp = \{u \in L^{\,p}(\Omega)^N \,|\, \int _\Omega (u ,v) = 0 {\rm ~for~all~}v \in Z\},\nonumber \end{align*}$

where $(\cdot, \cdot)$ denotes the standard inner product in $\newcommand{\R}{\mathbb{R}} \R^N$ , and note that, as can be easily shown, $Z^\perp$ is a complement of Z in $L^{\,p}(\Omega)$ , i.e. $Z^\perp$ is closed, $Z\cap Z^\perp = \{0\}$ and $L^{\,p}(\Omega) = Z + Z^\perp$ .

We then consider the modified problem

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\kl}{{{\rm kl}}} \newcommand{\nr}{{{\rm nr}}} \newcommand{\T}{\mathcal{T}} \newcommand{\I}{\mathcal{I}} \newcommand{\TGVat}{\TGV_\alpha ^2} \newcommand{\dkl}{D_{{\rm KL}}} \newcommand{\TGV}{{\rm TGV}} \displaystyle &\min _{u \in L^{\,p}(\Omega)^N} \TGVat(u)+ \I_{{\rm KL}_+}(u) + \I_{Z^ \perp}(u) \\ & \quad + \sum_{i\in L_\nr} \lambda_i \|T_i u_i - f_i\|_2 ^2 + \sum_{i\in L_\kl} \lambda _i \dkl(T_i u_i,f_i), \qquad \qquad \quad \,\,\, (P_{{\rm NKL-TGV}}(\lambda,f)) \nonumber \end{align*}$

which is equivalent to the original one in the sense that any solution of ( $P_{{\rm NKL-TGV}}(\lambda, f)$ ) is a solution of the original one and, with $u = u_1 + u_2 \in Z^\perp + Z$ being a solution of the original problem, u₁ is a solution of ( $P_{{\rm NKL-TGV}}(\lambda, f)$ ).

For this setting, we get the following assertion.

Proposition 4.1. With the setting of section 3.2.2, let $p\in (1, d/(d-1)]$ and assume that $T_i\in \mathcal{L}(L^{\,p}(\Omega) , L^{\,p}(\Sigma_i))$ for $\newcommand{\nr}{{{\rm nr}}} i \in L_\nr$ and $T_i \in \mathcal{L}(L^{\,p}(\Omega) , L^1(\Sigma_i, \mu_i))$ for $\newcommand{\kl}{{{\rm kl}}} i \in L_\kl$ . Set $\newcommand{\TGV}{{\rm TGV}} \newcommand{\TGVat}{\TGV_\alpha ^2} \newcommand{\I}{\mathcal{I}} \newcommand{\T}{\mathcal{T}} R = \TGVat +\I_{{\rm ~KL}_+} +\I_{Z^\perp}$ . Then assumption (NKL) holds for ( $P_{{\rm NKL-TGV}}(\lambda, f)$ ) and all results of section 3.2.2 apply.

Proof. First note that $\newcommand{\TGV}{{\rm TGV}} \newcommand{\TGVat}{\TGV_\alpha ^2} \newcommand{\T}{\mathcal{T}} \TGVat$ , $\newcommand{\I}{\mathcal{I}} \I_{{\rm KL}_+}$ and $\newcommand{\I}{\mathcal{I}} \I_{Z^\perp}$ are lower semi-continuous w.r.t weak L^p convergence (see for instance [6] for $\newcommand{\TGV}{{\rm TGV}} \newcommand{\T}{\mathcal{T}} \TGV$ in the vector-valued case). Hence we are left to show that any sequence $(u^n)_n$ in $L^{\,p}(\Omega){}^N$ such that

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\kl}{{{\rm kl}}} \newcommand{\nr}{{{\rm nr}}} \newcommand{\T}{\mathcal{T}} \newcommand{\I}{\mathcal{I}} \newcommand{\TGVat}{\TGV_\alpha ^2} \newcommand{\TGV}{{\rm TGV}} \displaystyle \TGVat(u^n) + \I_{{\rm KL}_+} + \I_{Z^\perp}(u^n) + \sum_{i\in L_\nr} \|T_i u^n_i\|_{p} + \sum_{i \in I_{\kl}} \|T_i u^n_i \|_1 < C \nonumber \end{align*}$

admits an L^p-weakly convergent subsequence. To this aim, we first define the operator $P_{\mathcal{P}^1}:L^{\,p}(\Omega) \rightarrow L^{\,p}(\Omega)$ as

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle P_{\mathcal{P}^1}(u) = v \quad \Leftrightarrow \quad v \in \mathcal{P}^1(\Omega)^N {\rm ~and~} \int _\Omega (u, p) = \int _\Omega (v, p) \quad {\rm ~for~all~} p \in \mathcal{P}^1(\Omega)^N .\nonumber \end{align*}$

It is easy to see that $P_{\mathcal{P}^1}$ is well-defined and a continuous linear projection. Then, by results in [5, 6], there exists C > 0 such that

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\T}{\mathcal{T}} \newcommand{\TGVat}{\TGV_\alpha ^2} \newcommand{\TGV}{{\rm TGV}} \displaystyle \|u^n - P_{\mathcal{P}^1}(u^n) \|_{p} \leqslant C \TGVat(u^n). \nonumber \end{align*}$

Hence, in order to obtain existence of a weakly convergent subsequence in $L^{\,p}(\Omega){}^N$ , we are left to show boundedness of $P_{\mathcal{P}^1}(u^n)$ . For this, note that since $u^n \in Z^\perp$ , also $P_{\mathcal{P}^1}(u^n) \in Z^\perp$ for all n and since T is injective on the finite dimensional space $\mathcal{P}^1(\Omega){}^N \cap Z^\perp$ we get from equivalence of norms in finite dimensions that and boundedness of $u^n_i - P_{\mathcal{P}^1}(u^n)_i$ in L^p

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\kl}{{{\rm kl}}} \newcommand{\nr}{{{\rm nr}}} \displaystyle \| P_{\mathcal{P}^1}(u^n) \|_p \leqslant C \|T P_{\mathcal{P}^1}(u^n) \|_p & \leqslant \sum_{i\in L_\nr} \|T_i P_{\mathcal{P}^1}(u^n)_i\|_{p} + \sum_{i \in I_{\kl}} \|T_i P_{\mathcal{P}^1}(u^n)_i \|_1 \nonumber \\ & \leqslant \sum_{i\in L_\nr} \|T_i u^n_i\|_{p} + \sum_{i \in I_{\kl}} \|T_iu^n_i \|_1 + C\nonumber \\ \nonumber \end{align*}$

for some C > 0 and the proof is complete. □

5. Numerical realization and examples

In this section we sketch how coupled second order TGV regularization with multiple data discrepancies can be realized numerically and provide some examples at the end. Source code that realizes the outlined algorithmic framework for any number of data terms in dimensions two and three is provided online [24].

With $\Omega_h$ being a discretized pixel grid in $\newcommand{\R}{\mathbb{R}} \R^d$ and the discrete vector spaces $\newcommand{\R}{\mathbb{R}} U:= \{u:\Omega_h \rightarrow \R^N\}$ an $\newcommand{\R}{\mathbb{R}} V:= \{v:\Omega_h \rightarrow (\R^d){}^N\}$ , we consider the following minimization problem.

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\kl}{{{\rm kl}}} \newcommand{\nr}{{{\rm nr}}} \newcommand{\I}{\mathcal{I}} \newcommand{\sym}{{\rm Sym}} \newcommand{\symgrad}{\mathcal{E}} \newcommand{\dkl}{D_{{\rm KL}}} \displaystyle \label{eq:discrete_prop_general} \min _{u \in U,v \in V} \alpha_1 \|\nabla u - v \|_1 + \alpha_0 \|\symgrad v\|_1+ \I_{{\rm KL}_+} (u) + \sum _ {i \in L_\nr} \lambda_i \|T_i u_i - f_i\|_2 ^2 + \sum _ {i \in L_\kl} \lambda_i \dkl(T_i u_i +c_i,f_i), \nonumber \end{align} \tag{ 52 }$

where we now simultaneously minimize over the unknown u and the balancing variable $v$ appearing in the definition of $\newcommand{\TGV}{{\rm TGV}} \newcommand{\TGVat}{\TGV_\alpha ^2} \newcommand{\T}{\mathcal{T}} \TGVat$ . We equip the discrete vector spaces with the standard inner products $(u, s):= \sum\nolimits_{x \in \Omega_h} \sum\nolimits_{i=1}^N u_i(x)s_i(x)$ for $u, s \in U$ and similarly for $V$ , and assume that all involved operators and functionals are now appropriately discretized (see for instance [4, 7] for a detailed explanation on a possible discretization of $\newcommand{\TGV}{{\rm TGV}} \newcommand{\TGVat}{\TGV_\alpha ^2} \newcommand{\T}{\mathcal{T}} \TGVat$ ). In particular, $\nabla :U \rightarrow V$ is a discrete gradient operator and $\newcommand{\R}{\mathbb{R}} \newcommand{\symgrad}{\mathcal{E}} \newcommand{\sym}{{\rm Sym}} \symgrad: V \rightarrow W:= (\R^{d \times d}){}^N$ a discrete symmetrized gradient operator, and the terms $\|z\|_1$ for $z \in V$ or $z \in W$ denote discrete L¹ norms such that $\|z\|_1 = \sum\nolimits_{x \in \Omega_h} |z(x)|$ with $|\cdot|$ appropriate pointwise norms on $\newcommand{\R}{\mathbb{R}} (\R^d){}^N$ or $\newcommand{\R}{\mathbb{R}} (\R^{d \times d}){}^N$ that define the coupling of the individual components.

For simplicity, we assume that $T = (T_1, \ldots, T_N)$ does not vanish on affine functions, hence existence can be ensured without the additional constraint on $Z^\perp$ . Re-writing the discrete minimization problem (52) in the form

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \min_{x = (u,v)} F(Kx) + G(x) \nonumber \end{align*}$

with $\newcommand{\symgrad}{\mathcal{E}} \newcommand{\sym}{{\rm Sym}} Kx = K(u, v) = (\nabla u - v, \symgrad v, Tu)$ , $\newcommand{\I}{\mathcal{I}} G=\I_{{\rm KL}_+}$ and F chosen appropriately, in the discrete setting it follows from standard arguments from convex analysis ([14, theorem III.4.1 and proposition III.3.1]) that (52) is equivalent to a saddle-point problem of the form

$\begin{align*} \newcommand{\e}{{\rm e}} \displaystyle \min _x \max _y \, (Kx,y) + G(x) - F^* (y) ,\nonumber \end{align*}$

where $F^*(y^*): = \sup\nolimits_y (y, y^*) - F(y)$ denotes the convex conjugate of F. More explicitly, this yields equivalence of (52) to the saddle-point problem

$\begin{align} \newcommand{\e}{{\rm e}} \newcommand{\kl}{{{\rm kl}}} \newcommand{\nr}{{{\rm nr}}} \newcommand{\I}{\mathcal{I}} \newcommand{\sym}{{\rm Sym}} \newcommand{\symgrad}{\mathcal{E}} \newcommand{\dkl}{D_{{\rm KL}}} \displaystyle \label{eq:discrete_saddle_point_problem} \min_{u,v}{{\underset{\|p\|_\infty \leq \alpha_1, \|q\|_\infty \leq \alpha_0}{\underset{p,q,r}{\max}}}} \, ( \nabla u - v,p) + (\symgrad v ,q) + (Tu,r) + \I_{{\rm KL}_+} (u) \nonumber \\ - \sum_{i \in L_\nr}\frac{1}{2\lambda_i} \|r_i\|_2^2 - (r_i,f_i) - \sum_{i \in L_\kl} (\lambda_i\dkl)^*(r_i,f_i) + (r_i,c_i) \nonumber \end{align} \tag{ 53 }$

where the variables $p, q, r$ are defined in appropriate image spaces of the operators $\nabla$ , $\newcommand{\symgrad}{\mathcal{E}} \newcommand{\sym}{{\rm Sym}} \symgrad$ and T, respectively and $\newcommand{\dkl}{D_{{\rm KL}}} (\lambda_i\dkl){}^*(r_i, f_i) := \sup\nolimits_{s} (r_i, s) - \lambda_i\dkl(s, f_i)$ . Employing the general algorithm of [10] to this saddle-point reformulation, we arrive at algorithm 1 for obtaining a solution of (53).

Algorithm 1. Scheme of implementation for solving (53).

1: function coupled-reconstruction $(\,f_i)_{i}, (c_i)_i$ $(\,f_i)_{i}, (c_i)_i$

2: Initialize $u, \overline{u}, v, \overline{v}, p, q, r, \sigma, \tau$ $u, \overline{u}, v, \overline{v}, p, q, r, \sigma, \tau$

3: Repeat

4: $p \gets {\rm prox}_{\alpha_1} (\,p + \sigma \nabla \overline{u} - \overline{v})$ $p \gets {\rm prox}_{\alpha_1} (\,p + \sigma \nabla \overline{u} - \overline{v})$

5: $\newcommand{\symgrad}{\mathcal{E}} \newcommand{\sym}{{\rm Sym}} q \gets {\rm prox}_{\alpha_0} (q + \sigma \symgrad \overline{v})$ $\newcommand{\symgrad}{\mathcal{E}} \newcommand{\sym}{{\rm Sym}} q \gets {\rm prox}_{\alpha_0} (q + \sigma \symgrad \overline{v})$

6: $\newcommand{\nr}{{{\rm nr}}} r_i \gets {\rm prox} _{\|\cdot\|_2 ^2, \lambda_i} (r_i + \sigma T_i \overline{u}_i - \sigma f_i) \quad {\rm for}~i \in L_\nr$ $\newcommand{\nr}{{{\rm nr}}} r_i \gets {\rm prox} _{\|\cdot\|_2 ^2, \lambda_i} (r_i + \sigma T_i \overline{u}_i - \sigma f_i) \quad {\rm for}~i \in L_\nr$

7: $\newcommand{\dkl}{D_{{\rm KL}}} \newcommand{\kl}{{{\rm kl}}} r_i \gets {\rm prox} _{\dkl, \lambda_i} (r_i + \sigma T_i \overline{u}_i + \sigma c_i) \quad {\rm for}~i \in L_\kl$ $\newcommand{\dkl}{D_{{\rm KL}}} \newcommand{\kl}{{{\rm kl}}} r_i \gets {\rm prox} _{\dkl, \lambda_i} (r_i + \sigma T_i \overline{u}_i + \sigma c_i) \quad {\rm for}~i \in L_\kl$

8: $\newcommand{\dive}{{\rm div}} \newcommand{\bi}{\boldsymbol} u_+ \gets {\rm prox}_{{\rm KL}_+} \big(u - \tau (-\dive p + T^* r) \big)$ $\newcommand{\dive}{{\rm div}} \newcommand{\bi}{\boldsymbol} u_+ \gets {\rm prox}_{{\rm KL}_+} \big(u - \tau (-\dive p + T^* r) \big)$

9: $\newcommand{\dive}{{\rm div}} v_+ \gets v - \tau (-p - \dive q)$ $\newcommand{\dive}{{\rm div}} v_+ \gets v - \tau (-p - \dive q)$

10: $(\overline{u}_+, \overline{v}_+) \gets2 (u_+, v_+) - (u, v)$ $(\overline{u}_+, \overline{v}_+) \gets2 (u_+, v_+) - (u, v)$

11: $(\sigma_+, \tau_+) \gets \mathcal{S}(\sigma, \tau, u_+, v_+, u, v)$ $(\sigma_+, \tau_+) \gets \mathcal{S}(\sigma, \tau, u_+, v_+, u, v)$

12: $(u, \overline{u}) \gets(u_+, \overline{u}_+)$ $(u, \overline{u}) \gets(u_+, \overline{u}_+)$

13: $(v, \overline{v}) \gets (v_+, \overline{v}_+)$ $(v, \overline{v}) \gets (v_+, \overline{v}_+)$

14: $(\sigma, \tau) \gets (\sigma_+, \tau_+)$ $(\sigma, \tau) \gets (\sigma_+, \tau_+)$

15: Until Stopping criterion fulfilled

16: Return u

17: end function

Note that there, the operation ${\rm prox}$ corresponds to proximal mappings which are, for a function $g\colon H \to H$ on a Hilbert space H and $\alpha>0$ , defined as

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\argmin}{{\rm argmin}} \displaystyle {\rm prox}_{g,\alpha}\colon H \to H, \quad {\rm prox}_{g,\alpha}(x_0)=\argmin_{x\in H} \frac{\| x-x_0\|_H^2}{2}+\alpha f(x). \nonumber \end{align*}$

Here, at any point $x \in \Omega_h$ in the discrete spatial grid and for $i \in \{0, 1\}$ , the mapping ${\rm prox}_{\alpha_i}$ can be given explicitly as

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\proj}{{\rm proj}} \displaystyle ({\rm prox}_{\alpha_i} (\hat{z}))(x) = \proj_{\{z \,:\, |z| \leqslant \alpha_i \}}(\hat{z}(x)) ,\nonumber \end{align*}$

that is, it reduces to a pointwise projection with respect to the pointwise norm. In case $|\cdot|$ is the Frobenius norm (i.e. the root of the squared sum of all entries on $\newcommand{\R}{\mathbb{R}} (\R^d){}^N$ or $(S^{d\times d}){}^N$ ), the projection is given as $\newcommand{\proj}{{\rm proj}} \proj_{\{z \, :\, |z| \leqslant \alpha_i \}}(\hat{z}) = \frac{\hat{z}}{\max\{1, |\hat{z}|/\alpha_i\}}$ . In the case that $|\cdot|$ is the nuclear-norm on $\newcommand{\R}{\mathbb{R}} (\R^d){}^N$ the computation requires a pointwise SVD of $2\times2$ or $3\times3$ matrices (depending on the spatial dimensions and the number of channels) at every point and efficient code to compute this is provided in the source code [24]. The other proximal mappings can be given explicitly and again pointwise for any $x \in \Omega_h$ as

$\begin{align*} \newcommand{\e}{{\rm e}} \newcommand{\kl}{{{\rm kl}}} \newcommand{\dkl}{D_{{\rm KL}}} \displaystyle ({\rm prox} _ {\|\cdot\|_2 ^2,\lambda_i}(r_i))(x) &= \frac{r_i(x)}{1 + \sigma/\lambda_i},\nonumber \\ ({\rm prox} _ {\dkl,\lambda_i}(r_i))(x) &= r_i(x) - \frac{r_i(x) - \lambda_i + \sqrt{(r_i(x) - \lambda_i)^2 + 4\sigma\lambda_i f_i(x)}}{2}, \nonumber \\ ({\rm prox}_{{\rm KL}_+}(u)_i)(x) &= \left\{\begin{array}{@{}ll@{}} u_i(x) &{\rm if~} u_i(x) \geqslant 0 {\rm ~or~} i \notin L_\kl, \nonumber \\ 0 &{\rm else.} \end{array}\right. \nonumber \end{align*}$

The operator $\mathcal{S}(\cdot)$ realizes an adaptive stepsize update and we refer to [26] for more details on the discretization for a particular case and to the source code [24] for more details on the general setup.

5.1. Exemplary numerical results

In this section we provide exemplary numerical results that have been obtained with algorithm 1 described in the previous section and the publicly available source code [24].

In the first experiment, we test the reconstruction of an 2D in vivo coronal knee MRI exam from a clinical MR system. This is an interesting test case for our approach because parts of the clinical protocol consists of two acquisition sequences that obtain data for the same orientation but with two different contrasts. In the first sequence, the spin signal from fat-tissue is suppressed, in the second sequence, fat-tissue is not suppressed. We employ the setting of (52) with N = 2, $\newcommand{\nr}{{{\rm nr}}} L_\nr = \{1, 2\}$ and $\newcommand{\kl}{{{\rm kl}}} \newcommand{\e}{{\rm e}} L_\kl = \emptyset$ .

A reduced number of k-space lines, at a rate of 4 below the Nyquist limit, was acquired to accelerate the duration of the scan. Figure 1 shows results from conventional CG-SENSE parallel imaging [34] and joint TGV regularized reconstructions. Since no ground truth is available, the parameters for the methods were chosen according to visual inspection. As can be seen in the figure, the regularized reconstruction suffers from less noise and artifacts and anatomical details are better visible.

In the next experiment, we consider the situation of combining data from two different imaging modalities, which is an interesting problem in the context of recent developments in hybrid PET-MRI systems [35]. This is an interesting case because the different imaging physics lead to different noise properties. MR data is corrupted by Gaussian noise, while PET data comes from a Poisson process. We first performed a numerical simulation study using a human brain phantom [1], where we have a ground truth available to evaluate our results. The PET acquisition was simulated such that the total number of counts corresponded to a 10min FDG PET head scan. Radial MR acquisitions with 16 projections were simulated for three commonly used contrasts in brain imaging (MPRAGE, FLAIR, T2 weighted). Acquisition with an 8 channel receive coil was simulated, and MR k-space data were corrupted by adding complex Gaussian noise. The simulated matrix size was 176 × 176 × 30.

We reconstructed the combined dataset by solving (52) in a three-dimensional setting with N = 4, $\newcommand{\nr}{{{\rm nr}}} L_\nr = \{1, 2, 3\}$ and $\newcommand{\kl}{{{\rm kl}}} L_\kl = \{4\}$ , and using nuclear norm coupling [26] of the four channels.

Results of the simulation are shown in figure 2. For reference, we also present results using the currently used standard methods to reconstruct such data, expectation maximization [44] for PET and iterative conjugate gradient (CG) SENSE [34] for MR. Since PET is a quantitative imaging modality, it is important that the quantitative signal values are preserved when coupling the different contrasts and modalities. Table 1 presents the PET signal values (in Bq cm⁻³) of the ground truth, the conventional and the joint reconstruction for gray matter, white matter and cerebrospinal fluid. Our results demonstrate that coupling improves image quality in terms of RMSE to the ground truth and the fidelity of the quantitative signal values.

Table 1. PET signal activity (Bq cm⁻³) for three different brain tissues.

	Gray matter	White matter	Cerebrospinal fluid
Ground truth	229 90	8450	0
EM	173 90	11 470	8097
Joint TGV	20 255	10 368	3193

Finally, we performed an experiment with in vivo PET-MR data from a clinical PET-MR system (figure 3). PET scan duration was 10 min, the MR exam was performed using an MPRAGE contrast at an acceleration factor of 4 below the Nyquist limit. The variational reconstruction was obtained by solving (52) again in a three-dimensional setting with N = 2, $\newcommand{\nr}{{{\rm nr}}} L_\nr = \{1\}$ and $\newcommand{\kl}{{{\rm kl}}} L_\kl = \{2\}$ . Reference reconstructions using EM and CG-SENSE are again shown for reference. Again no ground truth is available, but one can see that the joint reconstruction method generally reduces noise and obtains a sharper reconstruction of the PET image.

Acknowledgments

MH acknowledges support by the Austrian Science Fund (FWF) (Grants J 4112 and P 29192). He would also like to thank the Isaac Newton Institute for Mathematical Sciences for support and hospitality during the program 'Variational Methods and Effective Algorithms for Imaging and Vision', which was supported by EPSRC grant number EP/K032208/1, when part of the work on this paper was undertaken. RH acknowledges support by the Austrian Science Fund (FWF) (Grant P 29192). FK acknowledges support by the National Institutes of Health (NIH) under grant P41 EB017183.

Coupled regularization with multiple data discrepancies

Article metrics

Submit

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

1.1. Outline of the paper

2. General stability and convergence results

2.1. Notation and variational problem setting

2.2. Results in the general setting

3. Specification of the data term

3.1. Power-of-norm discrepancies

3.2. Mixed two-norm and Kullback–Leibler-divergence discrepancies

3.2.1. Basic properties of the Kullback–Leibler divergence.

3.2.2. Stability and convergence for mixed two-norm and Kullback–Leibler discrepancies.

4. Particular choices for coupled regularization

4.1. Joint wavelet sparsity

4.2. Joint total generalized variation regularization

5. Numerical realization and examples

5.1. Exemplary numerical results

Acknowledgments

Coupled regularization with multiple data discrepancies

Article metrics

Submit

Share this article

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

1.1. Outline of the paper

2. General stability and convergence results

2.1. Notation and variational problem setting

2.2. Results in the general setting

3. Specification of the data term

3.1. Power-of-norm discrepancies

3.2. Mixed two-norm and Kullback–Leibler-divergence discrepancies

3.2.1. Basic properties of the Kullback–Leibler divergence.

3.2.2. Stability and convergence for mixed two-norm and Kullback–Leibler discrepancies.

4. Particular choices for coupled regularization

4.1. Joint wavelet sparsity

4.2. Joint total generalized variation regularization

5. Numerical realization and examples

5.1. Exemplary numerical results

Acknowledgments