1 Introduction

In this paper, we present a convex optimization framework for total-variation regularized problems of the following form:

$$\begin{aligned} \begin{aligned} \inf _{ u : \Omega \rightarrow \Gamma } ~&\int _{\Omega } c(x, u_1(x), \ldots , u_k(x)) \, \textrm{d}x \\&+ \sum _{i=1}^k \lambda _i \text {TV}(u_i). \end{aligned} \end{aligned}$$
(1)

The set \(\Gamma = \{ (\gamma _1, \ldots , \gamma _k) \in \textbf{R}^N : \gamma _i \in \Gamma _i, i=1\ldots k \}\) is defined by k individual submanifolds \(\Gamma _i \subset \textbf{R}^{N_i}\) with \(N = N_1 + \cdots + N_k\). The individual \(\Gamma _i\) are required to be bounded subsets of \(\textbf{R}^{N_i}\).

Since the focus of this paper are imaging applications we assume \(\Omega \subset \textbf{R}^2\) to be a rectangular domain but the approach is easily generalized to higher dimensional or non-rectangular domains.

We make no special assumptions on the cost \(c : \Omega \times \Gamma \rightarrow \textbf{R}_{\ge 0}\) in (1) and allow it to be a general nonnegative nonconvex function. This turns (1) into an overall nonconvex optimization problem, which can be challenging to solve using standard gradient-based methods. Moreover, we do not assume that we are able to compute gradients, projections or proximal operators of the cost function c(xu(x)). Our approach only requires function evaluations. This allows us to consider degenerate costs that are out of reach for gradient-based approaches.

The regularizer in (1) is a separable total variation regularization \(\text {TV}(u_i)\) on the individual components \(u_i : \Omega \rightarrow \textbf{R}^{N_i}\) weighted by a parameter \(\lambda _i > 0\). The total variation (TV) (Rudin et al. 1992; Chambolle et al. 2010) encourages a spatially smooth but edge-preserving solution. It can be defined as the following equation:

$$\begin{aligned} \text {TV}(u_i)&:= \sup _{\begin{array}{c} p : \Omega \rightarrow \textbf{R}^{N_i \times 2}\\ \Vert p(x) \Vert _* \le 1 \end{array}} \int _{\Omega } \langle {{\,\textrm{Div}\,}}_x p(x), u_i(x) \rangle \, \textrm{d}x \nonumber \\&= \int _{\Omega } \Vert \nabla u_i(x) \Vert \, \textrm{d}x, \end{aligned}$$
(2)

where by \(\nabla u_i(x) \in \textbf{R}^{N_i \times 2}\) we denote the Jacobian matrix and by \(\Vert \cdot \Vert _*\) the dual norm of \(\Vert \cdot \Vert \). The last equality in (2) holds for sufficiently smooth \(u_i\).

The convex relaxation approach we use in this paper works for general convex and nonconvex regularizers which depend on the Jacobian \(\nabla u_i\), see Pock et al. (2009, 2010), Möllenhoff and Cremers (2019), Vogt et al. (2020). However, the main focus of this paper is an efficient implementation of the data cost c, and therefore we consider only the separable total variation (2).

Motivation and applications

To motivate Problem (1), let us consider some practical applications in low-level vision and imaging. One example is the variational estimation of optical flow between two RGB images \(I_1, I_2 : \Omega \rightarrow \textbf{R}^3\), see Horn and Schunck (1981). In that case, \(\Gamma _1 = \Gamma _2 = [a, b] \subset \textbf{R}\) models the displacement between the two images and the cost function is given by a photometric error.

Often, \(\Gamma _i\) is a curved manifold, see, e.g., the applications presented by Lellmann et al. (2013), Weinmann et al. (2014). Examples include \(\Gamma _i = \mathbb {S}^2\) for normal field processing (Lellmann et al. 2013), \(\text {SO}(3)\) for motion estimation (Görlitz et al. 2019) or the circle \(\mathbb {S}^1\) for processing of cyclic data (Cremers and Strekalovskiy 2013; Steinke et al. 2010).

Many real-wolrd applications requires to estimate multiple quantities in a joint fashion. This naturally leads to the formulation of product space which is considered in (1), where \(\Gamma = \Gamma _1 \times \cdots \times \Gamma _k\). In this case, each \(\Gamma _i\) models one quantity of interest that one aims to estimate.

A prominent approach to address joint optimization problems of this form are alternating procedures such as expectation maximization (Dempster et al. 1977), block-coordinate descent and alternating direction-type methods (Boyd et al. 2011). There, the idea is to estimates a single quantity while holding the other ones fixed. However, these approaches usually depend on a good initialization and are easy to get stuck in a very poor local optima. Therefore, the goal of this paper is to instead consider a convex relaxation of Problem (1). The relaxed problem can then be solved to global optimality with standard proximal methods such as the primal dual algorithm (Pock and Chambolle 2011). These methods are usually well parallelizable. Thus, they can be efficiently implemented on GPUs, allowing to solve large-scale problems in a reasonable time.

Contributions The main difficulty with convex approaches to (1) is the large memory requirements which are inherent to a lifted problem formulation which renders the problem convex. In order to improve the memory-efficiency of relaxations, two disparate ideas have been considered in previous work: sublabel-accurate liftings (Möllenhoff et al. 2016) and product-space relaxations (Goldluecke et al. 2013). In this paper, we combine both approaches and present a sublabel-accurate implementation of Goldluecke et al. (2013). Unlike previous liftings (Möllenhoff et al. 2016; Möllenhoff and Cremers 2017; Vogt et al. 2020), our approach does not require epigraphical projections and can therefore be applied in a black-box fashion, requiring only evaluations of the cost c.

Our main contribution is a simple way to implement the resulting semi-infinite optimization problem with a cutting-plane method. Moreover, we show that using this method, we can achieve a lower energy than the product-space lifting (Goldluecke et al. 2013) on optical flow estimation and manifold-valued denoising problems.

This journal paper is an extended version of the conference paper (Ye et al. 2021). In particular, we offer the following contributions over the conference version:

  • We have added additional background and explanations on the basics of functional lifting to Sects. 2 and 3.

  • In Sect. 4 we added the detailed update equations for the primal-dual algorithm.

  • We added additional figures and explanations to Sect. 5 to illustrate and provide an intuition of our algorithm on a simple example.

  • We added additional experiments on optical flow and manifold-valued image denoising to Sect.  5 and evaluated our method on a larger set of images.

Overview of the paper

After this introduction, we provide in Sect. 2 an introduction to functional lifting methods for Problem (1), review existing works and explain our contributions relatively to them. We summarize the convex relaxation for (1) and its discretization in Sect. 3. The proposed cutting-plane method and sampling strategy which we use to implement the discretized relaxation are presented in Sect. 4. In Sect. 5 we evaluate our method on a toy problem and several real-world imaging applications. Our conclusions are eventually drawn in Sect. 6.

Fig. 1
figure 1

Traditional optimization versus optimization over a space of probability distributions. In the top row, going from left to right, we illustrate a traditional gradient-based local optimization method on a nonconvex problem of the form (3). Given a bad initialization, the algorithm might get stuck in a poor local optimum. The bottom row illustrates an optimization procedure on the relaxed problem (4). Due to convexity of the objective and search space of probability measures, the solution concentrates at a Dirac distribution centered around a global optimum

2 An Introduction to Functional Lifting

Let us first consider a simplified version of Problem (1) where \(\Omega \) consists only of a single point, i.e., the nonconvex minimization of one data term:

$$\begin{aligned} \min _{\gamma \in \Gamma } ~ c(\gamma _1, \ldots , \gamma _k). \end{aligned}$$
(3)

A well-known approach to the global optimization of (3) is a lifting or stochastic relaxation procedure, which has been considered in diverse fields such as polynomial optimization (Lasserre, J- B. 2000), continuous Markov random fields (Fix and Agarwal 2014; Peng et al. 2011; Bauermeister et al. 2021), variational methods (Pock et al. 2008), and black-box optimization (de Boer et al. 2005; Ollivier et al. 2017; Schaul 2011). The idea is to relax the search space in (3) from \(\gamma \in \Gamma \) to probability distributions\(\textbf{u}\in \mathcal {P}(\Gamma )\) and solveFootnote 1

$$\begin{aligned} \min _{\textbf{u}\in \mathcal {P}(\Gamma )} \int _{\Gamma } c(\gamma _1, \ldots , \gamma _k) \, \textrm{d}\textbf{u}(\gamma _1, \ldots , \gamma _k). \end{aligned}$$
(4)

Due to linearity of the integral wrt. \(\textbf{u}\) and convexity of the relaxed search space, this is a convex problem for arbitrary cost c. Moreover, the minimizers of (4) concentrate at the optima of c and can hence be identified with solutions to (3). However, if \(\Gamma \) is a continuum, this problem is infinite-dimensional and therefore challenging.

We illustrate the conceptual difference between the formulation (3) and (4) on a one-dimensional example in Fig. 1.

Discrete/traditional multilabeling

In the context of Markov random fields (Ishikawa 2003; Kappes et al. 2013) and multilabel optimization (Chambolle et al. 2012; Lellmann et al. 2009; Lellmann and Schnörr 2011; Zach et al. 2008) one typically discretizes \(\Gamma \) into a finite set of points (called the labels) \(\Gamma = \{ \textbf{v}_1, \ldots , \textbf{v}_{\ell } \}\). This turns (4) into a finite-dimensional linear program \(\min _{\textbf{u}\in \Delta ^\ell }\,\langle c', \textbf{u} \rangle \) where \(c' \in \textbf{R}^\ell _{\ge 0}\) denotes the label cost and \(\Delta ^\ell \subset \textbf{R}^\ell \) is the \((\ell -1)\)-dimensional unit simplex. If we evaluate the cost at the labels, this program upper bounds the continuous problem (3), since instead of all possible solutions, one considers a restricted subset determined by the labels. Since the solution will be attained at one of the labels, typically a fine meshing is needed. Similar to black-box and zero-order optimization methods, this strategy suffers from the curse of dimensionality. When each \(\Gamma _i\) is discretized into \(\ell \) labels, the overall number is \(\ell ^k\) which quickly becomes intractable since many labels are required for a smooth solution. This limits the method to be applied on more practical problems. Additionally, for pairwise or regularizing terms, often a large number of dual constraints has to be implemented. In that context, the work from Lellmann et al. (2013) considers a constraint pruning strategy as an offline-preprocessing.

Sublabel-accurate multilabeling

The discrete-continuous MRF (Fix and Agarwal 2014; Zach 2013; Zach and Kohli 2012) and lifting methods (Laude et al. 2016; Möllenhoff et al. 2016; Möllenhoff and Cremers 2017) attempt to find a more label-efficient convex formulation. These approaches can be understood through duality (Fix and Agarwal 2014; Möllenhoff and Cremers 2017). Applied to (3), the idea is to replace the cost \(c : \Gamma \rightarrow \textbf{R}\) with a dual variable \(\textbf{q}: \Gamma \rightarrow \textbf{R}\):

$$\begin{aligned}&\min _{\textbf{u}\in \mathcal {P}(\Gamma )} \sup _{\textbf{q}: \Gamma \rightarrow \textbf{R}} ~ \int _{\Gamma } \textbf{q}(\gamma _1, \ldots , \gamma _k) \, \textrm{d}\textbf{u}(\gamma _1, \ldots , \gamma _k), \nonumber \\&~~\text {s.t.} ~\textbf{q}(\gamma ) \le c(\gamma ) ~\text { for all } \gamma \in \Gamma . \end{aligned}$$
(5)

The inner supremum in the formulation (5) maximizes the lower-bound \(\textbf{q}\). Additionally, if the dual variable is sufficiently expressive, this problem is actually equivalent to (4).

Approximating \(\textbf{q}\), for example with piecewise linear functions on \(\Gamma \), one arrives at a lower-bound to the nonconvex problem (3). It has been observed in a recent series of works (Laude et al. 2016; Möllenhoff et al. 2016; Möllenhoff and Cremers 2017; Vogt et al. 2020; Zach and Kohli 2012) that piecewise linear dual variables can lead to smooth solutions even when \(\textbf{q}\) (and therefore also \(\textbf{u}\)) is defined on a rather coarse mesh. As remarked by Fix and Agarwal (2014), Laude et al. (2016), Möllenhoff et al. (2016), for an affine dual variable this strategy corresponds to minimizing the convex envelope of the cost, \(\min _{\gamma \in \Gamma } c^{**}(\gamma )\), where \(c^{**}\) denotes the Fenchel biconjugate of c.

The implementation of the constraints in (5) can be challenging even in the case of piecewise-linear \(\textbf{q}\). This is partly due to the fact that Problem (5) is a semi-infinite optimization problem (Blankenship and Falk 1976), i.e., an optimization problem with infinitely many constraints. The works (Möllenhoff et al. 2016; Zach and Kohli 2012) implement the constraints via projections onto the epigraph of the (restricted) conjugate function of the cost within a proximal optimization framework. Such projections are only available in closed form for some choices of c and expensive to compute if the dimension is larger than one (Laude et al. 2016). This limits the applicability in a “plug-and-play” fashion.

Product-space liftings

The product-space lifting approach (Goldluecke et al. 2013) attempts to overcome the aforementioned exponential memory requirements of labeling methods in an orthogonal way to the sublabel-based methods. The main idea is to exploit the product-space structure in (1) and optimize over k marginal distributions of the probability measure \(\textbf{u}\in \mathcal {P}(\Gamma )\), which we denote by \(\textbf{u}_i \in \mathcal {P}(\Gamma _i)\). Applying (Goldluecke et al. 2013) to the single data term (3) one arrives at the following relaxation:

$$\begin{aligned}&\min _{ \{ \textbf{u}_i \in \mathcal {P}(\Gamma _i) \} } \sup _{ \{ \textbf{q}_i : \Gamma _i \rightarrow \textbf{R}\}} \sum _{i=1}^k \int _{\Gamma _i} \textbf{q}_i(\gamma _i) \, \textrm{d}\textbf{u}_i(\gamma _i) \nonumber \\&~~ ~ \text { s.t. } \sum _{i=1}^k \textbf{q}_i(\gamma _i) \le c(\gamma ) ~~ \text { for all } \gamma \in \Gamma . \end{aligned}$$
(6)

Since one only has to discretize the individual \(\Gamma _i\) this substantially reduces the memory requirements from \(\mathcal {O}(\ell ^N)\) to \(\mathcal {O}(\sum _{i=1}^k \ell ^{N_i})\). While at first glance it seems that the curse of dimensionality is lifted, the difficulty is moved to the dual, where we still have a large (or even infinite) number of constraints. A global implementation of the constraints with Lagrange multipliers as proposed in Goldluecke et al. (2013) again leads to the same exponential dependency on the dimension.

As a side note, readers familiar with optimal transport may notice that the supremum in (6) is a multi-marginal transportation problem (Carlier 2003; Villani 2008) with transportation cost c. This view is mentioned by Bach (2019) where relaxations of form (6) are analyzed under submodularity assumptions.

In summary, the sublabel-accurate lifting methods,discrete-continuous MRFs (Zach and Kohli 2012; Möllenhoff et al. 2016) and product-space liftings (Goldluecke et al. 2013) all share a common difficulty: implementation of an exponential or even infinite number of constraints on the dual variables.

Summary of our contribution

Our main contribution is a simple way to implement the dual constraints in an online fashion with a random sampling strategy which we present in Sect. 4. This allows a black-box implementation, which only requires an evaluation of the cost c and no epigraphical projection operations as in Möllenhoff et al. (2016), Zach and Kohli (2012). Moreover, the sampling approach allows us to propose and implement a sublabel-accurate variant of the product-space relaxation (Goldluecke et al. 2013) which we describe in the following section.

3 Product-Space Relaxation

Our starting point is the convex relaxation of (1) presented in Goldluecke et al. (2013), Strekalovskiy et al. (2014). In these works, \(\Gamma _i \subset \textbf{R}\) is chosen to be an interval. We first denote the Lagrangian as:

$$\begin{aligned}&\mathcal {L}( \{ \textbf{u}_i \} , \{ \textbf{q}_i \}, \{ \textbf{p}_i \} ) = \sum _{i=1}^k \int _{\Omega } \int _{\Gamma _i} \textbf{q}_i(x, \gamma _i) \end{aligned}$$
(7)
$$\begin{aligned}&\qquad - {{\,\textrm{Div}\,}}_x \textbf{p}_i(x, \gamma _i) \, \, \textrm{d}\textbf{u}_{i}^x(\gamma _i) \, \textrm{d}x. \end{aligned}$$
(8)

Following Vogt et al. (2020) we consider a generalization to manifolds \(\Gamma _i \subset \textbf{R}^{N_i}\) which leads us to the following relaxation:

$$\begin{aligned}&\min _{ \{ \textbf{u}_i : \Omega \rightarrow \mathcal {P}(\Gamma _i) \} } \sup _{ \begin{array}{c} \{ \textbf{q}_i : \Omega \times \Gamma _i \rightarrow \textbf{R}\} \\ \{ \textbf{p}_i : \Omega \times \Gamma _i \rightarrow \textbf{R}^2 \} \end{array} } \mathcal {L}( \{ \textbf{u}_i \} , \{ \textbf{q}_i \}, \{ \textbf{p}_i \} ), \nonumber \\&\text {s.t. }~ \Vert P_{T_{\gamma _i}}\nabla _{\gamma _i} \textbf{p}_i(x, \gamma _i) \Vert _* \le \lambda _i, ~ \forall i, x, \gamma _i, \end{aligned}$$
(9)
$$\begin{aligned}&\qquad \sum _{i=1}^k \textbf{q}_i(x, \gamma _i) \le c(x, \gamma ), ~ \forall x, \gamma \end{aligned}$$
(10)

This cost function appears similar to (6) explained in the previous section, but there are two differences. First, we now have marginal distributions \(\textbf{u}_i(x)\) for every \(x \in \Omega \) since we do not consider only a single data term anymore. The notation \(\, \textrm{d}\textbf{u}_i^x\) in (8) denotes the integration against the probability measure \(\textbf{u}_i(x) \in \mathcal {P}(\Gamma _i)\). The variables \(\textbf{q}_i\) play the same role as in (6) and lower-bound the cost under constraint (10). The second difference is the introduction of additional dual variables \(\textbf{p}_i\) and the term \(-{{\,\textrm{Div}\,}}_x \textbf{p}_i\) in (8). Together with the constraint (9), this can be shown to implement the total variation regularization as in Lellmann et al. (2013), Vogt et al. (2020). Following Vogt et al. (2020), the derivative \(\nabla _{\gamma _i} \textbf{p}_i(x, \gamma _i)\) in (9) denotes the \((N_i \times 2)\)-dimensional Jacobian considered in the Euclidean sense and \(P_{T_{\gamma _i}}\) the projection onto the tangent space of \(\Gamma _i\) at the point \(\gamma _i\).

To get an intuition on the total variation regularization, we use a concrete example to illustrate how (8) and (9) implement it. Consider the case when the labeling variable \(\textbf{u}_i^x = \delta _{u(x)}\) is given as a Dirac measure at every point x. As a concrete example, we consider \(k=1\) for simplicity. The term in (8) then simplifies to

$$\begin{aligned}&\int _{\Omega } -{{\,\textrm{Div}\,}}_x \textbf{p}(x, u(x)) \, \, \textrm{d}x \end{aligned}$$
(11)
$$\begin{aligned}&\quad = \int _{\Omega } \langle \nabla _{\gamma } \textbf{p}(x, u(x)), \nabla u(x) \rangle \, \, \textrm{d}x, \end{aligned}$$
(12)

which follows by applying the chain-rule and the fact that \(\textbf{p}\) has compact support. Finally, taking a point-wise supremum over \(\textbf{p}\) inside the integral in (12) under the dual-norm constraint (9) gives us the total variation of u: \(\int _{\Omega } \Vert \nabla u(x) \Vert \, \, \textrm{d}x\), which is the same as defined in (2).

3.1 Finite-Element Discretization

We approximate the infinite-dimensional problem (8) by restricting \(\textbf{u}_i\), \(\textbf{p}_i\) and \(\textbf{q}_i\) to be piecewise functions on a discrete meshing of \(\Omega \times \Gamma _i\). The considered discretization is a standard finite-element approach and largely follows the work from Vogt et al. (2020). Unlike the forward-differences considered in Vogt et al. (2020) we use lowest-order Raviart–Thomas elements (see, e.g., Caillaud and Chambolle 2020, Section 5) in \(\Omega \), which are specifically tailored towards the considered total variation regularization.

Discrete mesh

We approximate each \(d_i\)-dimensional manifold \(\Gamma _i \subset \textbf{R}^{N_i}\) with a simplicial manifold \(\Gamma _i^h\), given by the union of a collection of \(d_i\)-dimensional simplices \(\mathcal {T}_{i}\). We denote the number of vertices (“labels”) in the triangulation of \(\Gamma _i\) as \(\ell _i\). The set of labels is denoted by \(\mathcal {L}_i = \{ \textbf{v}_{i,1}, \ldots , \textbf{v}_{i,\ell _i} \}\). As assumed, \(\Omega \subset \textbf{R}^2\) is a rectangle which we split into a set of faces \(\mathcal {F}\) of edge-length \(h_x\) with edge set \(\mathcal {E}\). The number of faces and edges are denoted by \(F = |\mathcal {F}|\), \(E = |\mathcal {E}|\).

Data term and the

\(\textbf{u}_i\), \(\textbf{q}_i\) variables We assume the cost \(c : \Omega \times \Gamma \rightarrow \textbf{R}_{\ge 0}\) is constant in \(x \in \Omega \) on each face and denote its value as \(c(x(f), \gamma )\) for \(f \in \mathcal {F}\), where \(x(f) \in \Omega \) denotes the midpoint of the face f. Similarly, we also assume the variables \(\textbf{u}_i\) and \(\textbf{q}_i\) to be constant in \(x \in \Omega \) on each face but continuous piecewise linear functions in \(\gamma _i\). They are represented by coefficient functions \(\textbf{u}_i^h, \textbf{q}_i^h \in \textbf{R}^{F \cdot \ell _i}\), i.e., we specify the values on the labels and linearly interpolate inbetween. This is done by the interpolation operator \(\textbf{W}_{i, f, \gamma _i} : \textbf{R}^{F \cdot \ell _i} \rightarrow \textbf{R}\) which given an index \(1 \le i \le k\), face f, and (continuous) label position \(\gamma _i \in \Gamma _i\) computes the function value based on barycentric coordinates: \(\textbf{W}_{i, f, \gamma _i} \textbf{u}_i^h = \textbf{u}_i(x(f), \gamma _i)\). Note that after discretization, \(\textbf{u}_i\) is only defined on \(\Gamma _i^h\) but we can uniquely associate to each \(\gamma _i \in \Gamma _i^h\) a point on \(\Gamma _i\).

Divergence and

\(\textbf{p}_i\) variables Our variable \(\textbf{p}_i\) is represented by coefficients \(\textbf{p}_i^h \in \textbf{R}^{E \cdot \ell _i}\) which live on the edges in \(\Omega \) and the labels in \(\Gamma _i\). The vector \(\textbf{p}_i(x, \gamma _i) \in \textbf{R}^2\) is obtained by linearly interpolating the coefficients on the vertical and horizontal edges of the face and using the interpolated coefficients to evaluate the piecewise-linear function on \(\Gamma _i^h\). Under this approximation, the discrete divergence \({{\,\textrm{Div}\,}}_x^h : \textbf{R}^{E \cdot \ell _i} \rightarrow \textbf{R}^{F \cdot \ell _i}\) is given by \(({{\,\textrm{Div}\,}}_x^h \textbf{p}_i^h)(f) = \left( \textbf{p}_i^h(e_r) + \textbf{p}_i^h(e_t) - \textbf{p}_i^h(e_l) - \textbf{p}_i^h(e_b) \right) / h_x\) where \(e_r, e_t, e_l, e_b\) are the right, top, left and bottom edges of f, respectively.

Total variation constraint

Computing the operator \(P_{T_{\gamma _i}} \nabla _{\gamma _i}\) is largely inspired by Vogt et al. (2020), Section 2.2. It is implemented by a linear map \(\textbf{D}_{i, f, \alpha , t} : \textbf{R}^{E \cdot \ell _i} \rightarrow \textbf{R}^{d_i \times 2}\). Here, \(f \in \mathcal {F}\) and \(\alpha \in [0,1]^2\) correspond to a point \(x \in \Omega \) while \(t \in \mathcal {T}_i\) is the simplex containing the point corresponding to \(\gamma _i \in \Gamma _i\). First, the operator computes coefficients in \(\textbf{R}^{\ell _i}\) of two piecewise-linear functions on the manifold by linearly interpolating the values on the edges based on the face index \(f \in \mathcal {F}\) and \(\alpha \in [0,1]^2\). For each function, the derivative in simplex \(t \in \mathcal {T}_i\) on the triangulated manifold is given by the gradient of an affine extension. Projecting the resulting vector onto the \(d_i\)-dimensional tangent space for both functions leads to a \(d_i \times 2\)-matrix which approximates \(P_{T_{\gamma _i}} \nabla _{\gamma _i} \textbf{p}_i(x, \gamma _i)\).

Final discretized problem

Plugging our discretized \(\textbf{u}_i\), \(\textbf{q}_i\), \(\textbf{p}_i\) into (8), we arrive at the following finite-dimensional optimization problem:

$$\begin{aligned}&\min _{\{\textbf{u}_i^h \in \textbf{R}^{F \cdot \ell _i} \}} \max _{\begin{array}{c} \{\textbf{p}_i^h \in \textbf{R}^{E \cdot \ell _i} \},\\ \{\textbf{q}_i^h \in \textbf{R}^{F \cdot \ell _i} \} \end{array}} h_x^2 \cdot \sum _{i=1}^k \langle \textbf{u}_i^h, \textbf{q}_i^h - {{\,\textrm{Div}\,}}_x^h \textbf{p}_i^h \rangle \nonumber \\&\qquad + \sum _{f \in \mathcal {F}} \textbf{i}\{\textbf{u}_i^h(f) \in \Delta ^{\ell _i}\}, \end{aligned}$$
(13)
$$\begin{aligned}&\text {s.t. } ~~ \Vert \textbf{D}_{i, f, \alpha , t} \textbf{p}_i^h \Vert _* \le \lambda _i, \nonumber \\&\qquad \qquad \forall i \in [k], f \in \mathcal {F}, \alpha \in \{ 0, 1 \}^2, t \in \mathcal {T}_i, \end{aligned}$$
(14)
$$\begin{aligned}&\sum _{i=1}^k \textbf{W}_{i, f, \gamma _i}\textbf{q}_i^h \le c \left( x(f), \gamma \right) , \forall f \in \mathcal {F}, \gamma \in \Gamma , \end{aligned}$$
(15)

where \(\textbf{i}\{\cdot \}\) is the indicator function. In our applications, we found that it is sufficient to enforce the constraint (14) at the corners of each face which corresponds to choosing \(\alpha \in \{ 0, 1 \}^2\). Apart from the infinitely many constraints in (15), this is a finite-dimensional convex-concave saddle-point problem and can be tackled by many numerical optimization algorithms.

3.2 Solution Recovery

Before presenting in the next section how we propose to implement the constraints (15), we briefly discuss how a primal solution \(\{ \textbf{u}_i^h \}\) of the above problem is turned into an approximate solution to (1). To that end, we follow Lellmann et al. (2013), Vogt et al. (2020) and compute the Riemannian center of mass via an iteration \(\tau =1,\dots ,T\):

$$\begin{aligned} V^\tau _{j}&= \log _{u_i^\tau }(\textbf{v}_{i,j}), \end{aligned}$$
(16)
$$\begin{aligned} v^\tau&= \sum _{j=1}^{\ell _i} ~ \textbf{u}_i^h(f, j) V^\tau _j, \end{aligned}$$
(17)
$$\begin{aligned} \textbf{u}_i^{\tau +1}&= \exp _{u_i^\tau }(v^\tau ). \end{aligned}$$
(18)

Here, \(u_i^0 \in \Gamma _i\) is initialized by the label with the highest probability according to \(\textbf{u}_i^h(f, \cdot )\). \(\log _{u_i^\tau }\) and \(\exp _{u_i^\tau }\) denote the logarithmic and exponential mapping between \(\Gamma _i^h\) and its tangent space at \(u_i^\tau \in \Gamma _i\), which are both available in closed-form for the manifolds we consider here. In our case \(T=20\) was enough to reach convergence. For flat manifolds, \(T=1\) is enough, as both mappings boil down to the identity and (18) computes a weighted Euclidean mean.

In general, there is no theory which shows that \(u^T(x)=(u_1^T(x), \ldots , u_k^T(x))\) from (18) is a global minimizer of (1). Tightness of the relaxation in the special case \(k = 1\) and \(\Gamma \subset \textbf{R}\) is shown in Pock et al. (2010). For higher dimensional \(\Gamma \), the tightness of related relaxations is ongoing research; see Ghoussoub and Kim, Y- H., Lavenant, H. Palmer, A.Z. (2021) for results on the Dirichlet energy. By computing a-posteriori optimality gaps, solutions of (8) were shown to be typically near the global optimum of Problem (1); see, e.g., Goldluecke et al. (2013).

4 Implementation of the Constraints

Though the optimization variables in (13) are finite-dimensional, the energy is still difficult to optimize because of the infinitely many constraints in (15).

Before we present our approach, let us first describe what we refer to as the baseline method for the remainder of this paper. As the baseline approach, we consider the direct solution of (13) where we implemented the constraints only at the label/discretization points \(\mathcal {L}_1 \times \cdots \times \mathcal {L}_k\) via Lagrange multipliers [this strategy is also employed by the global variant of the product-space approach (Goldluecke et al. 2013)]. This baseline actually corresponds to a single outer iteration \(N_{it}\) of the proposed Algorithm 1, with a large number \(M_{it}\) of inner iterations.

figure a

We aim for a framework that allows for solving a better approximation of (15) than the baseline while being of similar memory complexity. To this end, Algorithm 1 alternates the following two steps:

(1) Sampling Based on the current solution we prune previously considered but feasible constraints and sample a new subset of the infinitely many constraints in (15). From all the current sampled constraints, we consider the most violated constraints for each face, add one sample at the current solution and discard the rest.

(2) Solving the subsampled problem Considering the current finite subset of constraints, we solve Problem (13) using a primal-dual method (Chambolle and Pock 2011) with diagonal preconditioning (Pock and Chambolle 2011). Both constraints (14) and (15) are implemented using Lagrange multipliers.

These two phases are performed alternatingly, with the aim to eventually approach the solution of the continuous problem (13). In practice, a fixed number of outer iterations \(N_{it}\) is set. While we do not prove convergence of the overall algorithm, convergence results for related procedures exist; see, e.g., Blankenship and Falk (1976), Theorem 2.4.

The detailed algorithm is explained in Algorithm 1. The cost matrix \(\textbf{C}\) is constructed by evaluating \(c(x(f), \gamma )\) at proposed samples \(\mathcal {S}_f\). We denote \(\xi \) and \(\nu \) as the Lagrange multipliers. The Lagrange multiplier \(\xi \) is initialized by a warm-start strategy, i.e. \(\xi ^{it}\) keeps same if we have the same proposed sample from previous outer iteraion. The \(\text {prox}\) of a function g with step size \(\tau \) is defined as:

$$\begin{aligned} \text {prox}_{\tau g}(x) = {{\,\mathrm{arg\,min}\,}}_y \frac{1}{2\tau } \Vert x-y \Vert + g(y) \end{aligned}$$
(19)

Our constraint sampling strategy is detailed in Algorithm 2. For each face in \(\mathcal {F}\), it generates a finite set of “sublabels” \(\mathcal {S}_f \subset \Gamma \) at which we implement the constraints (15). Next, we provide the motivation behind each line in the algorithm.

figure b

Random uniform sampling (Line 1) To have a global view of the cost function, we consider a uniform sampling on the label space \(\Gamma \). The parameter \(n > 0\) determines the number of the samples for each face.

Local perturbation around the mean (Line 2) Besides the global information, we apply local perturbation around the current solution \(\textbf{u}\). In case the current solution is close to the optimal one, this strategy allows us to refine it with these samples. The parameter \(\delta > 0\) determines the size of the local neighbourhood. In our experiments, we always used a Gaussian perturbation with \(\delta =0.1\).

Pruning strategy (Lines 34) Most samples from previous iterations are discarded because the corresponding constraints are already satisfied. We prune all current feasible constraints as in Blankenship and Falk (1976). Similarly, the two random sampling strategies (Lines 1 and 2) might return some samples for which the constraints are already fulfilled. Therefore, we only consider the samples with violated constraints and pick the r most violated from them. This pruning strategy is essential for a memory efficient implementation as shown later.

Sampling at u (Line 5) Finally, we add one sample which is exactly at the current solution \(u \in \Gamma \) to have at least one guaranteed sample per face. In the next section, we illustrate the behavior of Algorithm 1 on a toy problem, and evaluate its performance on real-world imaging problems.

5 Numerical Validation

Our approach and the baseline are implemented in PyTorch. Code for reproducing the following experiments can be found here: https://github.com/zhenzhangye/sublabel_meets_product_space. Note that that one of the runtime bottlenecks of our sampling strategy is creating the samples and picking the most violated r as shown in Algorithm 2. Additionally, the sparse matrix operations and the PDHG updates can be more efficiently implemented in CUDA, as all PDHG updates can be executed in a single CUDA kernel, compared to PyTorch’s multiple kernel calls. Therefore, a specialized implementation as in Goldluecke et al. (2013) will allow the method to scale by factor \(10-100\times \) in favor of runtime.

5.1 Illustration of the Algorithm

First of all, we consider a simplistic minimization problem on a single nonconvex data term:

$$\begin{aligned} c(u) = \min {\left\{ \begin{array}{ll} -4 u + 2.4, &{} u \in [0.1, 0.35),\\ 4 u - 0.4, &{} u \in [0.35, 0.6] ,\\ 2, &{}\text {otherwise}, \end{array}\right. } \end{aligned}$$
(20)

with 5 labels to illustrate the behavior of both, the baseline algorithm and our sampling strategies.

Fig. 2
figure 2

Illustration of the baseline algorithm. a Five samples (red dots) from the labels are considered. b The dual variable \(\textbf{q}\) satisfies the constraints on the samples. However, \(\textbf{q}\) is not globally optimal as it violates the constraint on the optimal solution (“green > blue”) (Color figure online)

Fig. 3
figure 3

Benefit of sampling at u (Line 5, Algorithm 2). a Because all of the proposed samples (gray dots) fullfil the constraint, they are all pruned (Line 3, Algorithm 2). b The updated \(\textbf{q}\) deteriorates because the subproblem is unconstrainted on it. c At least one sample is taken (namely \(u^{it}\), red dot). d This sample constraints \(\textbf{q}\) and thus prevents a degenerate solution (Color figure online)

Figure 2 depicts the baseline’s behavior. While it only evaluates the energy on the labels, five samples are considered as illustrated by the red dots in Fig. 2a. Figure 2b shows the dual variable \(\textbf{q}_i^h\) after it iterations. Since the algorithm is maximizing \(\textbf{q}_i^h\), the green and red dots should overlay (e.g. the second label) when it converges. Despite the convergence, the resulting \(\textbf{q}^h\) violates the constraints significantly close to the optimal solution (“green > blue”). Therefore, to attain the global optimal solution, the baseline approach needs more labels which requires more memory.

The motivation of random uniform sampling (Line 1, Algorithm 2) and local perturbation around the mean (Line 2, Algorithm 2) in our sampling strategy is intuitively clear. However, as demonstrated later in the experiment of a truncated quadratic energy, sampling at u (Line 5, Algorithm 2) is critical for the stability of our method. A comparison in Fig. 3 helps to show the necessity of this strategy. We ran two experiments with the identical settings except for the sampling at u. After a given number of iterations, the dual variable \(\textbf{q}^h\) is approximately optimal, as indicated in Fig. 3a. Our pruning strategy (Line 3, Algorithm 2) removes all of the proposed samples since all of them satisfy the constraints. As a result, the subproblem becomes unconstrainted on that dual variable \(\textbf{q}^h\) and its update has no significance, Fig. 3b. To solve this problem, we propose to always at least have one sample at u even when \(\textbf{q}^h\) is nearly optimal, cf. Fig. 3c. As illustrated in Fig. 3d, this can avoid the degeneration of \(\textbf{q}^h\) as it is still constrained.

Fig. 4
figure 4

Illustration of our proposed sampling strategies. a Two samples (red dots) are considered leading to the shown optimal dual variable \(\textbf{q}\) after running primal-dual iterations. b The two samples are pruned because the constraints are feasible. Several random samples are proposed (gray dots) and only one of them is picked (red dot). c One more sample on \(u^{it}\) is added and the \(\textbf{q}\) is refined (Color figure online)

Finally, the complete sampling strategy is illustrated in Fig. 4. As shown in Fig. 4a, the primal-dual method can obtain the optimal \(\textbf{q}^h\) for the sampled subproblem. Our sampling strategy can provide necessary samples and prune the feasible ones, cf. Fig. 4 (b). These few but meaningful samples lead the \(\textbf{q}^h\) to achieve global optimality, cf. Fig. 4c.

5.2 Truncated Quadratic Energy

In this section, we study the numerical effect of each line in Algorithm 2. We evaluate our method on the truncated quadratic energy \(c(x, u(x)) = \min \{(u(x) - f(x))^2, \nu \}\). where \(f : \Omega \rightarrow \textbf{R}\) is the input data as show in Fig. 5. For this specific experiment, we generate a \(64 \times 64\) gray image degraded with Gaussian noise of standard deviation \(\sigma =0.05\) and 5% salt-and-pepper noise. The parameters are chosen as \(\nu =0.025\), \(\lambda =0.25\), \(N_{it}=10\), \(M_{it}=200\), \(n=10\) and \(r=1\). To reduce the effect of randomness, we run each algorithm 20 times and report mean and standard deviation of the final energy for different number of labels in Table 1. We want to emphasize that more labels have benefits for both baseline and our algorithm. Nevertheless, the proposed approach can reach lower energies with the same number of labels and similar memory requirements.

Fig. 5
figure 5

a The original \(64 \times 64\) grayscale image. b Degraded with Gaussian noise of standard deviation \(\sigma =0.05\) and 5% salt-and-pepper noise

Table 1 Ablation study indicating the effect of individual lines in Algorithm 2

As can be seen in this table, adding uniform sampling and picking the most violated constraint per face (Lines 1 and 4 of Algorithm 2) already decreases the final energy significantly. We also consider local exploration around the current solution (Line 2), which helps to find better energies at the expense of higher memory requirements. The pruning strategy (Line 3) circumvents this memory issue, however the energy deteriorates dramatically because some faces could end up having no samples after pruning. Therefore, keeping the current solution as a sample (Line 5) per face prevents the energy from degrading. Including all these sampling strategies, the proposed method can achieve the best energy and runtime, at comparable memory usage to the baseline method.

Fig. 6
figure 6

Comparison between the baseline and our approach on a \(64\times 64\) grayscale image shown in Fig. 5, degraded with Gaussian and salt-and-pepper noise. Our approach finds lower energies in fewer iterations and less time than the baseline, which implements the constraints only at the label points

We further illustrate the comparison on the number of iterations and time between the baseline and our proposed method in Fig. 6. Due to the replacement on the samples, we have a peak right after each sampling phase. The energy however converges immediately, leading to an overall decreasing trend.

Additionally, we compare our method to the baseline on a more practical dataset CBSD from Martin et al. (2001). This dataset contains 68 images and noisy ones with additive white Gaussian noise (5% in this experiment). The number of labels is 7 for both methods. 5K iterations are performed on the baseline method, while we set \(N_{it}=10\) and \(M_{it}=500\) to get a fair comparison. The other parameters are chosen as \(\lambda =0.25\), \(n=50\) and \(r=1\). The results are shown in Fig. 7. Our method outpeforms the baseline among all the images regarding both energy and peak signal-to-noise ratio (PSNR).

5.3 Manifold-Value Denoising

To show the flexibility of our algorithm, we next evaluate it on a manifold-valued denoising problem in HSV color space. The hue component of this space is a circle, i.e., \(\Gamma _1 =\mathbb {S}^1\), \(\Gamma _2, \Gamma _3 = [0,1]\).

The data term of this experiment is still a truncated quadratic energy, where for the hue component the distance is taken on the circle \(\mathbb {S}^1\). The input images (Baker et al. 2011; Martin et al. 2001) are degraded with the same setting as above.

Both the baseline and our method are implemented with 7 labels. First of all, we evaluate the impact of the most violated r samples. As shown in Fig. 8, The maximum difference of energy and memory is only 0.8% and 0.06%, respectively, which can be considered almost constant wrt. r. Therefore, we pick \(r=5\). To get an equal number of total iterations, 30K iterations are performed on the baseline, while we set \(N_{it}=100\) outer iterations with \(M_{it}=300\) inner primal-dual steps for our method. Other parameters are chosen as \(\lambda =0.015\) and \(n=30\). As shown in Fig. 9, our method can achieve a lower energy than the baseline. Qualitatively, since our method implements the constraints not only at the labels but also inbetween, there is less bias.

Fig. 7
figure 7

Quantitative results on the CBSD dataset (Martin et al. 2001). We ordered both results from the best to the worst. The difference is calculated using the formula \(\frac{\text {ours - baseline}}{\text {baseline}}\). It is clear that our method consistently outperforms the baseline across all images

Fig. 8
figure 8

Comparison between different number of most violated constraints on the energy and memory: a the change of the energy with \(r=1\) as the basis; b the change of memory with \(r=1\) as the basis. It can be observed that both the energy and memory vary very little (0.8% and 0.06%, respectively) regarding different r

Fig. 9
figure 9

Denoising of images (Baker et al. 2011; Martin et al. 2001) in HSV color space (\(\Gamma _1 = \mathbb {S}^1\), \(\Gamma _2=\Gamma _3=[0,1]\)) using our method and the baseline with 7 labels. The hue component plot demonstrates that our approach is able to handle the manifold setting where the jump from \(2\pi \) to 0 is permitted. Since our approach implements the constraints adaptively inbetween the labels (gray lines) it reaches a lower energy with less label bias

Table 2 We compute the optical flow on the Middlebury dataset (Baker et al. 2011) using our method and the baseline for a varying amount of labels

5.4 Optical Flow

Given two input images \(I_1\), \(I_2\), we compute the optical flow \(u : \Omega \rightarrow \textbf{R}^2\) for the label space \(\Gamma = [a,b]^2\). We use a simple \(\ell _1\)-norm for the data term, i.e. \(c(x, u(x)) = \Vert I_2(x) - I_1(x+u(x))\Vert _1\) and set the regularization weight as \(\lambda =0.04\). The baseline approach runs for 50K iterations, while we set \(N_{it}=50\) and \(M_{it}=1000\) for a fair comparison. Additionally, the parameters are chosen as \(n=20\) and \(r=1\) in Algorithm 2.

Fig. 10
figure 10

Visualization of the optical flow on Grove3 and RubberWhale from the Middleburry dataset (Baker et al. 2011), using the baseline, the method from Laude et al. (2016) and our method for a varying amount of labels. OOM stands for out of memory. More qualitative results can be found in the appendix

We consider three methods for this experiment: the baseline, the method from Laude et al. (2016) and ours. The method from Laude et al. (2016) approximates the dataterm in a piecewise convex manner and requires a specific epigraph projection for the dataterm. Though it can attain lower energy, our approach requires less memory and tackle any cost function in a simple black-box fashion. Table 2 summarizes the quantitative results obtained on the Middleburry dataset (Baker et al. 2011), while the detailed absolute numbers can be found in the appendix. This table shows how our approach performs relatively to the baseline, i.e. “”, e.g. for three labels our energy is 50.91% of the baseline energy for all 8 datasets, while using 99.84% of the baseline’s memory and having an average end point error (aep) and average angular error (aae) of 91.94% and 82.34% of the baseline error metrics, respectively. To enable qualitative comparison, we visualize in Fig. 10 the results on two of the datasets. The remaining qualitative results on the Middlebury data set (Baker et al. 2011) are shown in the appendix. Our method outperforms the baseline approach regarding energy under the same number of labels and requires the same amount of memory. Because Laude et al. (2016) uses a tighter relxation on the label space, they can achieve lower energy with a smaller number of labels. However, it runs out of memory easily while the proposed method scales better wrt. memory consumption.

6 Conclusion and Limitations

In this paper we made functional lifting methods more scalable by combining two advances, namely product-space relaxations (Goldluecke et al. 2013) and sublabel-accurate discretizations (Möllenhoff and Cremers 2017; Vogt et al. 2020). This combination is enabled by adapting a cutting-plane method from semi-infinite programming (Blankenship and Falk 1976). This allows an implementation of sublabel-accurate methods without difficult epigraphical projections.

Moreover, our approach makes sublabel-accurate functional-lifting methods applicable to any cost function in a simple black-box fashion. In experiments, we demonstrate the effectiveness of the approach over a baseline based on the product-space relaxation (Goldluecke et al. 2013) and provided a proof-of-concept experiment showcasing the method in the manifold-valued setting.

Future work will concentrate on applying and adapting the presented framework to solve large inverse problems in computer vision with multiple data terms, different regularizers and several manifold-valued optimization variables in a joint fashion. However, it is not obvious if our presented cutting plane approach is easily applicable for such large problems or if novel ideas have to be pursued.