Keywords

1 Introduction

There is a recent interest in unconventional pattern recognition problems to solve modern, large-scale problems across several application domains. Among these, structured prediction problems [1] in which outputs may exhibit structure along with strong interdependences are particularly challenging. The case in which the output structure is a vector is referred to as Multitarget Regression (MTR) [2, 16, 17].

It is well-known that the straightforward approach of predicting multiple outputs in an independent way is not optimal. First attempts to improve this combined the independently estimated outputs using canonical correlation  [4] or used a low-rank decomposition of the input-output covariance matrix [6]. More recently proposed approaches [8, 14] take into account these dependences using different sparseness patterns in the prediction model induced by different regularizers.

Another option to promote sparseness is using random subspaces as in [12] where new variables are formed by linearly combining k outputs with random weights drawn from a normal distribution. Modern large-scale problems can be tackled if the problem is formulated using two prediction steps to introduce latent variables and convenient regularizers [5, 16, 17]. These composite approaches lead usually to nonconvex formulations that require particular solving strategies.

The present work proposes a general MTR model that considers several recent approaches as particular cases and introduces new variants. In what follows, the MTR problem is presented and our main contribution is introduced. An empirical evaluation using publicly available data and involving several state of the art approaches is also considered. Finally, the main conclusions along with lines of future research are outlined.

2 Multitarget Regression

Let \({\varvec{x}}\), \({\varvec{y}}\) be input and output vectors, respectively, and let \(\left( {\varvec{x}}^j,{\varvec{y}}^j\right) \in \ \mathbb {R}^p\times \mathbb {R}^q\), for \(i=1,\ldots ,N\), be a given training set. The MTR problem consists of obtaining a predictor \(h: \mathbb {R}^p\rightarrow \mathbb {R}^q\) in such a way that the expected deviation between true and predicted outputs is minimized for all possible input/output pairs.

The most straightforward approach consists of obtaining a univariate predictor for each one of the output variables in an independent way using any of the available methods for single-target prediction [2] which constitutes the simplest of the so-called problem transformation (also known as local) methods that consist of transforming the given MTR problem into one or more single-target ones [12]. The alternative approach to tackle MTR is through algorithm adaptation (also known as global) methods [2] which consists of adapting a particular strategy to deal directly with multiple targets. Global methods are interesting because they focus on explicitly capturing all interdependencies and internal relationships among targets.

A straightforward adaptation method for MTR is Multivariate Linear Regression with q output variables [2]. In this approach it is possible to estimate the coefficient matrixFootnote 1, \(W\in \mathbb {R}^{p\times q}\), through unconstrained optimization by combining an appropriate loss function, \(\mathcal {L}(W)\), and a regularizer, \(\mathcal {R}(W)\), as

(1)

As in most previous works on MTR, the loss function used here is the Mean Squared Error (MSE) between real outputs and predictions, \( \frac{1}{2N} ||XW-Y||_F^2\), where \(||\cdot ||_F\) designates the Frobenius norm, and \(X\in \mathbb {R}^{N\times p}\) and \(Y\in \mathbb {R}^{N\times q}\).

On the other hand, the regularizer function depends on the kind of structure we want to enforce in W. For example, using the Frobenius norm, \(||. ||_{F}^2\), leads to multivariate ridge regression. For a sparser structure (attribute selection) we can use lasso, \(||. ||_1\), or grouped lasso, \(||. ||_{21}\) [9, 15], or particular combination of these. The nuclear (or trace) norm [7], \(||. ||_{*}\), can also be used to enforce low rank in W.

In the above formulation, the underlying structure that establishes the conditional dependence that may exist among output variables is not specified. It has been shown in previous works [4, 10] that it is possible to improve the predictive capacity of the models if these dependences are taken into account e.g. by using other output variables as inputs for predicting each output variable. In the particular case of the Curds & Wheys approach [4], this problem is modeled as a linear combination of the output variables. In other words, the final predictions of the output variables are expressed as a linear combination of the initial (independent) predictions. This approach of combining the outputs has been exploited in recent years using different ways of modeling the structures of the coefficient matrices and the dependence between the output variables [5, 16, 17].

3 MTR with Output Dependence Estimation

In line with previous works described above, our model consists on two prediction steps. First, a linear prediction in a latent space of the same dimension of the output is obtained for each input vector, \({\varvec{x}}\), as \({\varvec{x}}W\). Then, this latent vector is again linearly transformed into \({\varvec{x}}WS\) using a dependence matrix, \(S\in \mathbb {R}^{q\times q}\). The corresponding loss term on X can be then written as \(\mathcal {L}(W,S)= \frac{1}{2N} ||XWS- Y||_F^2\).

Note that the predictive power of the model is exactly the same as in the linear case. But by decomposing the coefficient matrix in two factors, we introduce the possibility of modelling different kind of dependencies by using separate regularizers for these matrix factors. In particular, our model will use a linear combination of regularizers, \(\mathcal {R}(W,S)=\lambda _1 g_1(W)+\lambda _2 g_2(S)\).

This approach to the regularization model can be seen as a generalization with regard to previously proposed models as e.g. MTL [5], MMR [16], or MLSR [17]. In all previous cases, specific particular solution to the corresponding optimization problem have been developed through direct derivation of the corresponding subgradients. Moreover, MMR and MSLR are developed in the general case of using kernels that introduce flexibility and power but compromise scalability.

The proposed general approach for the MTR problem can be formulated using a coefficient matrix, W, and a dependence matrix, S, as

$$\begin{aligned} \min _{\{W,S\}} \frac{1}{2N}||XWS-Y ||_F^2 + \lambda _1 g_1(W)+\lambda _2 g_2(S) \end{aligned}$$
(2)

where \(g_1\) and \(g_2\) are any valid regularizers, usually matrix norms.

To solve problem (2) we will follow a widely used strategy which consists in breaking the global (non-convex) problem into two (convex) subproblems in an alternating way [13]. In our particular case, we fix one of the matrices and optimize the other one which leads to two simpler convex problems that can be sequentially solved in turn until convergence.

$$\begin{aligned} \min _{W} \frac{1}{2N}||Y-XW S ||_F^2+\lambda _1 g_1(W)&\text {for a given } S\end{aligned}$$
(3)
$$\begin{aligned} \min _{S} \frac{1}{2N}||Y-X W S ||_F^2+\lambda _2 g_2(S)&\text {for a given } W \end{aligned}$$
(4)

Each subproblem described in (3) and (4) is convex but not necessarily differentiable depending on the regularizers. But in most of the interesting cases as \(||.||_1\), \(||.||_{21}\), or \(||.||_{*}\) the resulting function is not differentiable.

3.1 Applying the Accelerated Proximal Gradient (APG) Method

The two alternating optimization problems in (3) and (4) involve only one variable and consist of a first term which is convex and differentiable and a second term which is convex but not differentiable. Consequently, we can express any of the two problems for a wide range of regularizers as

$$\begin{aligned} \min _A f(A)+g(A) \end{aligned}$$
(5)

where f and g represent the differentiable and non differentiable parts, respectively. The usual gradient update for a step size \(s_t\) from a current solution, A, is the one that minimizes the quadratic approximation of the criterion around A. In our case, we can instead approximate only the differentiable part and minimize instead

$$\begin{aligned} \hat{f}_{s_t, A} (B) + g(B) = f(A) + \langle \nabla f(A) , B-A \rangle + \frac{1}{2 s_t} ||B-A ||_F^2 + g(B) \end{aligned}$$
(6)

The update that substitutes the usual gradient descent update for a given step size, \(s_t\), at a particular iteration, t, is [11]

where the so-called proximal operator is defined as

The step size, \(s_t\), can be selected using the quadratic approximation of the differentiable part, \(\hat{f}_{s_t, A} (B)\), to guide a linear search procedure. Convergence can be further improved by combining the above proximal gradient update with the Nesterov’s acceleration scheme [11].

3.2 Using Different Regularizer Functions

In a wide range of interesting cases, the proximal operator reduces to a closed-form solution. For example, when the Euclidean norm in \(\mathbb {R}^n\) is used, \(g(z) = \lambda ||z ||_2\), the proximal operator is the projection onto the Euclidean unit ball which is given by

where \((z)_+\) is used as a shorthand for \(\max (0,z)\).

The grouped lasso regularizer for matrices, also referred to as \(\ell _{2,1}\) norm, is defined as \(\lambda ||A ||_{2,1} = \lambda \sum _i ||A_i ||_2\) where \(A_i\) is the i-th row in matrix A. And the corresponding proximal operator can be computed separately for each row as

In the same way, the (element-wise) lasso regularizer, \(\lambda ||A ||_{1} \), leads to

where \(\mathcal{S}_{\tau }\) is the soft-thresholding operator applied element-wise to matrix A,

$$\begin{aligned}{}[\mathcal{S}_{\tau }(A) ]_{ij} = (A_{ij}-\tau )_+ - (-A_{ij}-\tau )_+ = \left\{ \begin{array}{clc} A_{ij} -\tau &{} \text { if } &{}A_{ij}>\tau \\ 0&{} \text { if } &{} -\tau \le A_{ij} \le \tau \\ A_{ij}+\tau &{} \text { if } &{}A_{ij}< -\tau \end{array} \right. \end{aligned}$$

The proximal update can be computed in a relatively easy way in the case of using the nuclear norm as a regularizer in terms of the soft-thresholding operation and the singular value decomposition of the corresponding matrix [16].

Using different regularizers one can obtain different optimization schemes. For example, the MMR algorithm [16] considers ridge regularizers for both W and S matrices and the spectral norm regularizer also for matrix S, and the MLSR algorithm [17] considers a rigde regularizer for S and the grouped lasso for S.

In the present work we will reproduce these two algorithms in the context of our generalized scheme along with two new particular algorithms to study and compare different sparsity patterns in the sought solutions. The first variant, \(GLMR_1\), considers an element-wise lasso regularizer for both matrices W and S. And the second variant, \(GLMR_2\), combines this same regularizer for the dependency matrix S with a grouped lasso regularizer for the coefficient matrix W to enforce the selection of relevant attributes. The corresponding decompositions, gradients and proximal updates for all alternatives considered are shown in Table 1, where \(\mathcal L\) stands for \( \frac{1}{2N} ||XWS- Y||_F^2\) and its corresponding gradients are

$$\begin{aligned} \nabla _W \mathcal{L} = X^T (XWS - Y) S^T \quad \text {and} \quad \nabla _S \mathcal{L} = W^{T} X^T (XWS - Y). \end{aligned}$$

The way in which the alternating optimization scheme is applied in our generalized scheme is of crucial importance. Following recommendations in [3] we propose alternating updates instead of completely solving each subproblem as in original MMR and MSLR algorithms. According to preliminary experiments, we obtain better and faster results in general.

Table 1. Differentiable and non differentiable parts, gradients and proximal updates for the different alternatives considered. \(S=UDV^T\) is the SVD of matrix S.

4 Experiments and Results

The generalized scheme including two newly proposed variants and algorithms MMR and MSLR have been considered in this work. All have been implemented using the Accelerated Proximal Gradient strategy in the linear case. A set of 16 publicly available databases from [2, 10] has been considered to carry out comparative experiments. As in other similar works, we use the average Relative Root Mean Squared Error (aRRMSE) as a performance measure. Given a test set, \(D_{test}\), and a predictor, h, this measure is given as

$$\begin{aligned} aRRMSE(h;D_{test})=\frac{1}{q}\overset{}{\begin{array}{c} q\\ \sum \\ i=1 \end{array}\sqrt{\frac{ \sum _{\left( x,y\right) \in D_{test}} \left( h({\varvec{x}}) -{\varvec{y}}\right) ^{2}}{ \sum _{\left( x,y\right) \in D_{test}} \left( {\overline{{\varvec{y}}} - {\varvec{y}}}\right) ^{2}}}} \end{aligned}$$

where \({\overline{{\varvec{y}}}}\) is the mean target value, \({\varvec{y}}\), and \(h({\varvec{x}})\) is the corresponding prediction.

In all cases, the datasets were divided into training and test sets following a 80/20 proportion. The two new variants and the MLSR algorithm use two regularization parameters that were selected using a grid search with 20 logarithmically spaced values for each parameter. The MMR algorithm needs three regularization parameters and the corresponding grid search used 15 values for each. The best performing values for each algorithm and dataset were selected.

Fig. 1.
figure 1

Criterion values corresponding to datasets atp1d (left) and sf1 (right).

Table 2. Dataset details and performance results obtained for all considered algorithms.
Fig. 2.
figure 2

Results of the Friedman test with Shafer correction: \(\chi ^2 = 2.8688\), \(df1 = 3\), \(df2 = 45\), and \(p_v = 0.04683\).

In all cases, the alternating strategy converged to relatively good solutions at a reasonable speed. Figure 1 contains representative plots illustrating the convergence of the different algorithms for databases atp1d and sf1. The aRRMSE values corresponding to all datasets are shown in Table 2. As it can be seen, the variant \(GMLR_2\) is the one that performs the best in a majority of cases, 10, followed by MMR, \(GMLR_1\), and MLSR. The relative goodness of each of the four variants considered can be graphically observed in Fig. 2 that corresponds to the result of the Friedman test with Shafer correction. Even though we do not have significative differences, except for the case of \(GMLR_2\) with regard to MSLR, these results can be seen as very encouraging specially taking into account that we have restricted ourselves to the linear case.

Another important fact worth mentioning is that the datasets with a larger number of attributes and targets, are the ones for which the difference between \(GMLR_2\) and the rest is bigger. This is not so surprising if we think that the grouped lasso regularizer is specially good to perform attribute selection.

5 Conclusions and Further Work

A common framework to solve large-scale MTR problems based on alternating optimization and APG methods has been presented in this work. Within this framework, we have been able to reproduce previously proposed algorithms along with two new variants. The proposed scheme is flexible and can be used in a variety of situations showing a smooth behavior leading to appropriate solutions. The experiments carried out give strong evidences that the approach can be competitive with regard to other state of the art methods. Further research is directed to consider a wide catalogue of regularizers along with extending the method to the nonlinear case using kernels.