On the PLS Algorithm for Multiple Regression (PLS1)

Takane, Yoshio; Loisel, Sébastien

doi:10.1007/978-3-319-40643-5_2

Yoshio Takane⁶ &
Sébastien Loisel⁷

Part of the book series: Springer Proceedings in Mathematics & Statistics ((PROMS,volume 173))

Included in the following conference series:

International Conference on Partial Least Squares and Related Methods

1546 Accesses

Abstract

Partial least squares (PLS) was first introduced by Wold in the mid 1960s as a heuristic algorithm to solve linear least squares (LS) problems. No optimality property of the algorithm was known then. Since then, however, a number of interesting properties have been established about the PLS algorithm for regression analysis (called PLS1). This paper shows that the PLS estimator for a specific dimensionality S is a kind of constrained LS estimator confined to a Krylov subspace of dimensionality S. Links to the Lanczos bidiagonalization and conjugate gradient methods are also discussed from a somewhat different perspective from previous authors.

Download conference paper PDF

Least Squares

A Mathematical-Statistics Approach to the Least Squares Method

Article 05 January 2018

Gauss on least-squares and maximum-likelihood estimation

Article Open access 02 April 2022

Keywords

1 Introduction

Partial least squares (PLS) was first introduced by Wold (1966) as a heuristic algorithm for estimating parameters in multiple regression. Since then, it has been elaborated in many directions, including extensions to multivariate cases (Abdi 2007; de Jong 1993) and structural equation modeling (Lohmöller 1989; Wold 1982). In this paper, we focus on the original PLS algorithm for univariate regression (called PLS1), and show its optimality given the subspace in which the vector of regression coefficients is supposed to lie. Links to state-of-the-art algorithms for solving a system of linear simultaneous equations, such as the Lanczos bidiagonalization and the conjugate gradient methods, are also discussed from a somewhat different perspective from previous authors (Eldén 2004; Phatak and de Hoog 2002). We refer the reader to Rosipal and Krämer (2006) for more comprehensive accounts and reviews of new developments of PLS.

2 PLS1 as Constrained Least Squares Estimator

Consider a linear regression model

$$\displaystyle{ \mathbf{z} = \mathbf{G}\mathbf{b} + \mathbf{e}, }$$

(2.1)

where z is the N-component vector of observations on the criterion variable, G is the N × P matrix of predictor variables, b is the P-component vector of regression coefficients, and e is the N-component vector of disturbance terms. The ordinary LS (OLS) criterion is often used to estimate b under the iid (independent and identically distributed) normal assumption on e. This is a reasonable practice if N is large compared to P, and columns of G are not highly collinear (i.e., as long as the matrix G′G is well-conditioned). However, if this condition is not satisfied, the use of OLS estimators (OLSE) is not recommended, because then these estimators tend to have large variances. Principal component regression (PCR) is often employed in such situations. In PCR, principal component analysis (PCA) is first applied to G to find a low rank (say, rank S) approximation, which is subsequently used as the set of new predictor variables in a linear regression analysis. One potential problem with PCR is that the low rank approximation of G best accounts for G but is not necessarily optimal for predicting z. By contrast, PLS extracts components of G that are good predictors of z. For the case of univariate regression, the PLS algorithm (called PLS1) proceeds as follows:

PLS1 Algorithm

Step 1.:

Column-wise center G and z, and set G ₀ = G.

Step 2.:

Repeat the following substeps for i = 1, ⋯ , S (S ≤ rank(G)):

Step 2.1.:: Set w _i = G _i−1′z∕ ∥ G _i−1′z ∥ , where ∥ G _i−1′z ∥ = (z′G _i−1 G _i−1′z)^1∕2.
Step 2.2.:: Set t _i = G _i−1 w _i∕ ∥ G _i−1 w _i ∥ .
Step 2.3.:: Set v _i = G _i−1′t _i.
Step 2.4.:: Set $\mathbf{G}_{i} = \mathbf{G}_{i-1} -\mathbf{t}_{i}\mathbf{v}_{i}' = \mathbf{Q}_{G_{i-1}w_{i}}\mathbf{G}_{i-1}$ (deflation),

where $\mathbf{Q}_{G_{i-1}w_{i}} = \mathbf{I} -\mathbf{G}_{i-1}\mathbf{w}_{i}(\mathbf{w}_{i}'\mathbf{G}_{i-1}'\mathbf{G}_{i-1}\mathbf{w}_{i})^{-1}\mathbf{w}_{i}'\mathbf{G}_{i-1}'$, and where ′ denotes the transpose operation, and | | . | | denotes the L ₂ norm of a vector (i.e., $\vert \vert \mathbf{x}\vert \vert = \sqrt{\mathbf{x} '\mathbf{x}}$, see, e.g., Takane (2014), for details); vectors w _i, t _i, and v _i are called (respectively) weights, scores, and loadings, and are collected in matrices W _S, T _S, and V _S. For a given S, the PLS estimator (PLSE) of b is given by

$$\displaystyle{ \hat{\mathbf{b}}_{PLSE}^{(S)} = \mathbf{W}_{ S}(\mathbf{V}_{S}'\mathbf{W}_{S})^{-1}\mathbf{T}_{ S}'\mathbf{z} }$$

(2.2)

(see, e.g., Abdi 2007). The algorithm above assumes that S is known and, actually, the choice of its value is crucial for good performance of PLSE (a cross validation method is often used to choose the best value of S). It has been demonstrated (Phatak and de Hoog 2002) that for a given value of S, the PLSE of b has better predictability than the corresponding PCR estimator.

The PLSE of b can be regarded as a special kind of constrained LS estimator (CLSE), in which b is constrained to lie in the Krylov subspace of dimensionality S defined by

$$\displaystyle{ \mathcal{K}_{S}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z}) = \mbox{ Sp}(\mathbf{K}_{S}), }$$

(2.3)

where Sp(K _S) is the space spanned by the column vectors of K _S, and

$$\displaystyle{ \mathbf{K}_{s} = [\mathbf{G}'\mathbf{z},(\mathbf{G}'\mathbf{G})\mathbf{G}'\mathbf{z},\cdots \,,(\mathbf{G}'\mathbf{G})^{S-1}\mathbf{G}'\mathbf{z}] }$$

(2.4)

is called the Krylov matrix of order S. Because $\mbox{ Sp}(\mathbf{W}_{S}) = \mathcal{K}_{S}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})$ (see Eldén 2004, proposition 3.1; Phatak and de Hoog 2002) b can be re-parameterized as b = W _S a for some a. Then Eq. (2.1) can be rewritten as

$$\displaystyle{ \mathbf{z} = \mathbf{G}\mathbf{W}_{S}\mathbf{a} + \mathbf{e}. }$$

(2.5)

The OLSE of a is given by

$$\displaystyle{ \hat{\mathbf{a}} = (\mathbf{W}_{S}'\mathbf{G}'\mathbf{G}\mathbf{W}_{S})^{-1}\mathbf{W}_{ S}'\mathbf{G}'\mathbf{z}, }$$

(2.6)

from which the CLSE of b is found as

$$\displaystyle{ \hat{\mathbf{b}}_{CLSE}^{(S)} = \mathbf{W}_{ S}\hat{\mathbf{a}} = \mathbf{W}_{S}(\mathbf{W}_{S}'\mathbf{G}'\mathbf{G}\mathbf{W}_{S})^{-1}\mathbf{W}_{ S}'\mathbf{G}'\mathbf{z}. }$$

(2.7)

To show that (2.7) is indeed equivalent to (2.2), we need several well-known results in the PLS literature (Bro and Eldén 2009; de Jong 1993; Eldén 2004; Phatak and de Hoog 2002). First of all, W _S is column-wise orthogonal, that is,

$$\displaystyle{ \mathbf{W}_{S}'\mathbf{W}_{S} = \mathbf{I}_{S}. }$$

(2.8)

Secondly, T _S is also column-wise orthogonal,

$$\displaystyle{ \mathbf{T}_{S}'\mathbf{T}_{S} = \mathbf{I}_{S}, }$$

(2.9)

and

$$\displaystyle{ \mathbf{T}_{S}\mathbf{L}_{S} = \mathbf{G}\mathbf{W}_{S}, }$$

(2.10)

where L _S is an upper bidiagonal matrix. Relations (2.8), (2.9) and (2.10) imply that

$$\displaystyle{ \mathbf{W}_{S}'\mathbf{G}'\mathbf{G}\mathbf{W}_{S} = \mathbf{L}_{S}'\mathbf{L}_{S} = \mathbf{H}_{S}, }$$

(2.11)

where H _S is tridiagonal. Thirdly,

$$\displaystyle{ \mathbf{V}_{S}' = \mathbf{T}_{S}'\mathbf{G}, }$$

(2.12)

so that

$$\displaystyle{ \mathbf{L}_{S} = \mathbf{T}_{S}'\mathbf{G}\mathbf{W}_{S} = \mathbf{V}_{S}'\mathbf{W}_{S}. }$$

(2.13)

Now it is straightforward to show that

$$\displaystyle\begin{array}{rcl} \hat{\mathbf{b}}_{CLSE}^{(S)}& =& \mathbf{W}_{ S}(\mathbf{W}_{S}'\mathbf{G}'\mathbf{G}\mathbf{W}_{S})^{-1}\mathbf{W}_{ S}'\mathbf{G}'\mathbf{z}\ \\ & =& \mathbf{W}_{S}\mathbf{H}_{S}^{-1}\mathbf{L}_{ S}'\mathbf{T}_{S}'\mathbf{z} \\ & =& \mathbf{W}_{S}(\mathbf{L}_{S}'\mathbf{L}_{S})^{-1}\mathbf{L}_{ S}'\mathbf{T}_{S}'\mathbf{z} \\ & =& \mathbf{W}_{S}\mathbf{L}_{S}^{-1}\mathbf{T}_{ S}'\mathbf{z} \\ & =& \mathbf{W}_{S}(\mathbf{V}_{S}'\mathbf{W}_{S})^{-1}\mathbf{T}_{ S}'\mathbf{z} \\ & =& \hat{\mathbf{b}}_{PLSE}^{(S)}, {}\end{array}$$

(2.14)

and this establishes the equivalence between Eqs. (2.7) and (2.2).

The PLSE of regression parameters reduces to the OLSE if S = rank(G) (when rank(G) < P, we use G ⁺, which is the Moore-Penrose inverse of G, in lieu of (G′G)⁻¹ G in the OLSE for regression coefficients).

3 Relations to the Lanczos Bidiagonalization Method

It has been pointed out (Eldén 2004) that PLS1 described above is equivalent to the following Lanczos bidiagonalization algorithm:

The Lanczos Bidiagonalization (LBD) Algorithm

Step 1.:

Column-wise center G, and compute u ₁ = G′z∕ | | G′z | | and q ₁ = Gu ₁∕δ ₁, where δ ₁ = | | Gu ₁ | | .

Step 2.:

For i = 2, ⋯ , S (this is the same S as in PLS1),

(a)
Compute γ _i−1 u _i = G′q _i−1 −δ _i−1 u _i−1.
(b)
Compute δ _i q _i = Gu _i −γ _i−1 q _i−1.

Scalars γ _i−1 and δ _i (i = 2, ⋯ , S) are the normalization factors to make | | u _i | | = 1 and | | q _i−1 | | = 1, respectively.

Let U _S and Q _S represent the collections of u _i and q _i for i = 1, ⋯ , S. It has been shown (Eldén 2004, Proposition 3.1) that these two matrices are essentially the same as W _S and T _S, respectively, obtained in PLS1. Here “essentially” means that these two matrices are identical to W _S and T _S except that the even columns of U _S and Q _S are reflected (i.e., have their sign reversed). We show this explicitly for u ₂ and q ₂ (i.e., u ₂ = −w ₂ and q ₂ = −t ₂). It is obvious from Step 1 of the two algorithms that

$$\displaystyle{ \mathbf{w}_{1} = \mathbf{u}_{1}\hspace{5.69054pt} \mbox{ and}\hspace{5.69054pt} \mathbf{t}_{1} = \mathbf{q}_{1}. }$$

(2.15)

Let α ₁ = | | G′z | | . Then

$$\displaystyle\begin{array}{rcl} \mathbf{w}_{2}& \propto & \mathbf{G}'\mathbf{Q}_{Gw_{1}}\mathbf{z}\quad \mbox{ (from Step 2.4 of the PLS1 algorithm)} \\ & =& \mathbf{G}'\mathbf{z} -\mathbf{G}'\mathbf{G}\mathbf{w}_{1}(\mathbf{w}_{1}'\mathbf{G}'\mathbf{G}\mathbf{w}_{1})^{-1}\mathbf{w}_{ 1}'\mathbf{G}'\mathbf{z} \\ & =& \alpha _{1}(\mathbf{w}_{1} -\mathbf{G}'\mathbf{G}\mathbf{w}_{1}/\delta _{1}^{2}) {}\end{array}$$

(2.16)

$$\displaystyle\begin{array}{rcl} & \propto & -\mathbf{G}'\mathbf{G}\mathbf{w}_{1}/\delta _{1} +\delta _{1}\mathbf{w}_{1},{}\end{array}$$

(2.17)

where ∝ means “proportional.” To obtain the last expression, we multiplied Eq. (2.16) by δ ₁∕α ₁ ( > 0). This last expression is proportional to −u ₂, where u ₂ ∝ G′Gu ₁∕δ ₁ −δ ₁ u ₁ from Step 2(a) of the Lanczos algorithm. This implies u ₂ = −w ₂, because both u ₂ and w ₂ are normalized.

Similarly, define β ₁ ² = w ₁′(G′G)² w ₁. Then

$$\displaystyle\begin{array}{rcl} \mathbf{t}_{2}& \propto & \mathbf{Q}_{Gw_{1}}\mathbf{G}\mathbf{G}'\mathbf{Q}_{Gw_{1}}\mathbf{z}\hspace{8.53581pt} \mbox{ (from Step 2.2 of the PLS1 algorithm)} \\ & =& \alpha _{1}(\mathbf{G}\mathbf{w}_{1} -\mathbf{G}\mathbf{G}'\mathbf{G}\mathbf{w}_{1}/\delta _{1}^{2} -\mathbf{G}\mathbf{w}_{ 1} + \frac{\beta _{1}^{2}} {\delta _{1}^{4}}\mathbf{G}\mathbf{w}_{1}) {}\end{array}$$

(2.18)

$$\displaystyle\begin{array}{rcl} & \propto & -\mathbf{G}\mathbf{G}'\mathbf{G}\mathbf{w}_{1} + \frac{\beta _{1}^{2}} {\delta _{1}^{2}}\mathbf{G}\mathbf{w}_{1}.{}\end{array}$$

(2.19)

To obtain Eq. (2.19), we multiplied (2.18) by δ ₁ ²∕α ₁ ( > 0). On the other hand, we have

$$\displaystyle\begin{array}{rcl} \mathbf{q}_{2}& \propto & \frac{1} {\delta _{1}\gamma _{1}}(\mathbf{G}\mathbf{G}'\mathbf{G}\mathbf{u}_{1} -\delta _{1}^{2}\mathbf{G}\mathbf{u}_{ 1} -\gamma _{1}^{2}\mathbf{G}\mathbf{u}_{ 1})\quad \mbox{ (from Step 2(b) of the Lanczos algorithm)} \\ & \propto & \mathbf{G}\mathbf{G}'\mathbf{G}\mathbf{u}_{1} - (\delta _{1}^{2} +\gamma _{ 1}^{2})\mathbf{G}\mathbf{u}_{ 1}. {}\end{array}$$

(2.20)

To show that q ₂ ∝ −t ₂, it remains to show that

$$\displaystyle{ \gamma ^{2} +\delta ^{2} =\beta _{ 1}^{2}/\delta _{ 1}^{2}. }$$

(2.21)

From Step 2(a) of the Lanczos algorithm,

$$\displaystyle\begin{array}{rcl} \gamma ^{2}& =& (\mathbf{G}'\mathbf{G}\mathbf{u}_{ 1}/\delta _{1} -\delta _{1}\mathbf{u}_{1})'(\mathbf{G}'\mathbf{G}\mathbf{u}_{1}/\delta _{1} -\delta _{1}\mathbf{u}_{1}) \\ & =& \beta ^{2}/\delta ^{2} -\delta ^{2}, {}\end{array}$$

(2.22)

and so indeed (2.21) holds. Again, we have q ₂ = −t ₂, because both q ₂ and t ₂ are normalized.

The sign reversals of u ₂ and q ₂ yield u ₃ and q ₃ identical to w ₃ and t ₃, respectively, by similar sign reversals, and u ₄ and q ₄ which are sign reversals of w ₄ and t ₄, and so on. Thus, only even columns of U _s and Q _s are affected (i.e., have their sign reversed) relative to the corresponding columns of W _S and T _S, respectively. Of course, these sign reversals have no effect on estimates of regression parameters. The estimate of regression parameters by the Lanczos bidiagonaliation method is given by

$$\displaystyle{ \hat{\mathbf{b}}_{LBD}^{(S)} = \mathbf{U}_{ s}(\mathbf{L}_{S}^{{\ast}})^{-1}\mathbf{Q}_{ S}'\mathbf{z}, }$$

(2.23)

where

$$\displaystyle{ \mathbf{L}_{S}^{{\ast}} = \mathbf{Q}_{ S}'\mathbf{G}\mathbf{U}_{S}, }$$

(2.24)

which is upper bidiagonal, as is L _S (defined in Eq. (2.13)). matrix L _S ^∗ differs from matrix L _S only in the sign of its super-diagonal elements. The matrices L _S ⁻¹ and (L _S ^∗)⁻¹ are also upper bidiagonal, for which the super-diagonal elements are opposite in sign, while their diagonal elements remain the same. Thus

$$\displaystyle\begin{array}{rcl} \mathbf{W}_{S}\mathbf{L}_{S}^{-1}\mathbf{T}_{ S}'& =& \sum _{i=1}^{s}(\ell_{ i,i}\mathbf{w}_{i}\mathbf{t}_{i}' +\ell _{i,i+1}\mathbf{w}_{i}\mathbf{t}_{i+1}') \\ & =& \sum _{i=1}^{s}(\ell_{ i,i}^{{\ast}}\mathbf{u}_{ i}\mathbf{q}_{i}' +\ell_{ i,i+1}^{{\ast}}\mathbf{u}_{ i}\mathbf{q}_{i+1}') \\ & =& \mathbf{U}_{s}(\mathbf{L}_{s}^{{\ast}})^{-1}\mathbf{Q}_{ s}', {}\end{array}$$

(2.25)

where ℓ _i, j and ℓ _i, j ^∗ are the ij-th element of (respectively) L _S and L _S ^∗. Note that

$$\displaystyle{ \ell_{i,i} =\ell_{ i,i}^{{\ast}},\quad \mathbf{w}_{ i}\mathbf{t}_{i}' = \mathbf{u}_{i}\mathbf{q}_{i}',\quad \ell_{i,i+1} = -\ell_{i,i+1}^{{\ast}},\quad \text{and}\quad \mathbf{w}_{ i}\mathbf{t}_{i+1}' = -\mathbf{u}_{i}\mathbf{q}_{i+1}' }$$

(2.26)

It is widely known (see, e.g., Saad 2003) that the matrix of orthogonal basis vectors generated by the Arnoldi orthogonalization of K _S (Arnoldi 1951) is identical to U _S obtained in the Lanczos algorithm. Starting from u ₁ = G′z∕ ∥ G′z ∥ , this orthogonalization method finds u _i+1 (i = 1, ⋯ , S − 1) by successively orthogonalizing G′Gu _i (i = 1, ⋯ , S − 1) to all previous u _i’s by a procedure similar to the Gram-Schmidt orthogonalization method. This yields U _S such that G′GU _S = U _S H _S ^∗, or

$$\displaystyle{ \mathbf{U}_{S}'\mathbf{G}'\mathbf{G}\mathbf{U}_{S} = \mathbf{L}_{S}^{{\ast}'}\mathbf{L}_{ S}^{{\ast}} = \mathbf{H}_{ S}^{{\ast}}, }$$

(2.27)

where H _S ^∗ is tridiagonal as is H _S defined in Eq. (2.11). The diagonal elements of this matrix are identical to those of H _S while its sub- and super-diagonal elements have their sign reversed. Matrix H _S ^∗ is called the Lanczos tridiagonal matrix and it is useful to obtain eigenvalues of G′G.

4 Relations to the Conjugate Gradient Method

It has been pointed out (Phatak and de Hoog 2002) that the conjugate gradient (CG) algorithm (Hestenes and Stiefel 1951) for solving a system of linear simultaneous equations G′Gb = G′y gives solutions identical to $\hat{\mathbf{b}}_{PLSE}^{(s)}$ [s = 1, ⋯ , rank(G)], if the CG iteration starts from the initial solution $\hat{\mathbf{b}}_{CG}^{(0)} \equiv \mathbf{b}_{0} = \mathbf{0}$. To verify their assertion, we look into the CG algorithm stated as follows:

The Conjugate Gradient (CG) Algorithm

Step 1.:

Initialize b ₀ = 0. Then, r ₀ = G′z −G′Gb ₀ = G′z = d ₀. (Vectors r ₀ and d ₀ are called initial residual and initial direction vectors, respectively.)

Step 2.:

For i = 0, ⋯ , s − 1, compute:

(a)
a _i = d _i′r _i∕d _i′G′Gd _i = | | r _i | | ²∕d _i′G′Gd _i.
(b)
b _i+1 = b _i + a _i d _i.
(c)
$\mathbf{r}_{i+1} = \mathbf{G}'\mathbf{z} -\mathbf{G}'\mathbf{G}\mathbf{b}_{i+1} = \mathbf{r}_{i} - a_{i}\mathbf{G}'\mathbf{G}\mathbf{d}_{i} = \mathbf{Q}_{d_{i}/G'G}'\mathbf{r}_{i}$, where $\mathbf{Q}_{d_{i}/G'G} = \mathbf{I} -\mathbf{d}_{i}(\mathbf{d}_{i}'\mathbf{G}'\mathbf{G}\mathbf{d}_{i})^{-1}\mathbf{d}_{i}'\mathbf{G}'\mathbf{G}$ is the projector onto the space orthogonal to Sp(G′Gd _i) along Sp(d _i) [its transpose, on the other hand, is the projector onto the space orthogonal Sp(d _i) along Sp(G′Gd _i)].
(d)
b _i = −r _i+1′G′Gd _i∕d _i′G′Gd _i = | | r _i+1 | | ²∕ | | r _i | | ².
(e)
$\mathbf{d}_{i+1} = \mathbf{r}_{i+1} + b_{i}\mathbf{d}_{i} = \mathbf{Q}_{d_{i}/G'G}\mathbf{r}_{i+1}$.

Let R _j = [r ₀, ⋯ , r _j−1] and D _j = [d ₀, ⋯ , d _j−1] for j ≤ S. We first show that

$$\displaystyle{ \mbox{ Sp}(\mathbf{R}_{j}) = \mbox{ Sp}(\mathbf{D}_{j}) = \mathcal{K}_{j}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z}) }$$

(2.28)

by induction, where, as before, Sp(A) indicates the space spanned by the column vectors of matrix A. It is obvious that r ₀ = d ₀ = G′z, so that $\mbox{ Sp}(\mathbf{R}_{1}) = \mbox{ Sp}(\mathbf{D}_{1}) = \mathcal{K}_{1}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})$. From Step 2(c) of the CG algorithm, we have

$$\displaystyle{ \mathbf{r}_{1} = \mathbf{Q}_{d_{i}/G'G}'\mathbf{r}_{0} = \mathbf{r}_{0} -\mathbf{G}'\mathbf{G}\mathbf{d}_{0}c_{0} }$$

(2.29)

for some scalar c ₀, so that $\mathbf{r}_{1} \in \mathcal{K}_{2}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})$ because $\mathbf{G}'\mathbf{G}\mathbf{d}_{0} \in \mathcal{K}_{2}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})$. From Step 2(e), we also have

$$\displaystyle{ \mathbf{d}_{1} = \mathbf{Q}_{d_{0}/G'G}\mathbf{r}_{1} = \mathbf{r}_{1} -\mathbf{d}_{0}c_{0}^{{\ast}} }$$

(2.30)

for some c ₀ ^∗, so that $\mathbf{d}_{1} \in \mathcal{K}_{2}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})$. This shows that $\mbox{ Sp}(\mathbf{R}_{2}) = \mbox{ Sp}(\mathbf{D}_{2}) = \mathcal{K}_{2}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})$. Similarly, we have $\mathbf{r}_{2} \in \mathcal{K}_{3}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})$ and $\mathbf{d}_{2} \in \mathcal{K}_{3}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})$, so that $\mbox{ Sp}(\mathbf{R}_{3}) = \mbox{ Sp}(\mathbf{D}_{3}) = \mathcal{K}_{3}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})$, and so on.

The property of D _j above implies that Sp(W _S) is identical to Sp(D _S), which in turn implies that

$$\displaystyle{ \hat{\mathbf{b}}_{CG}^{(S)} = \mathbf{D}_{ S}(\mathbf{D}_{S}'\mathbf{G}\mathbf{G}\mathbf{D}_{S})^{-1}\mathbf{D}_{ S}'\mathbf{G}\mathbf{z} }$$

(2.31)

is identical to $\hat{\mathbf{b}}_{CLSE}^{(S)}$ as defined in Eq. (2.7), which in turn is equal to $\hat{\mathbf{b}}_{PLSE}^{(S)}$ defined in Eq. (2.2) (Phatak and de Hoog 2002) by virtue of Eq. (2.14). It remains to show that $\hat{\mathbf{b}}_{CG}^{(S)}$ defined in (2.31) coincides with b _S generated by the CG algorithm. By the G′G-conjugacy of d _j’s (the orthogonality of d _j’s with respect to G′G, i.e., d _i′G′Gd _j = 0 for any i ≠ j, as will be shown later), Eq. (2.31) can be rewritten as

$$\displaystyle{ \hat{\mathbf{b}}_{CG}^{(S)} =\sum _{ i=0}^{S-1}\mathbf{d}_{ i}(\mathbf{d}_{i}'\mathbf{G}'\mathbf{G}\mathbf{d}_{i})^{-1}\mathbf{d}_{ i}'\mathbf{G}'\mathbf{z}. }$$

(2.32)

From Step 2(b) of the CG algorithm, on the other hand, we have

$$\displaystyle{ \mathbf{b}_{1} = \mathbf{d}_{0}(\mathbf{d}_{0}'\mathbf{G}'\mathbf{G}\mathbf{d}_{0})^{-1}\mathbf{d}_{ 0}'\mathbf{r}_{0} = \mathbf{d}_{0}(\mathbf{d}_{0}'\mathbf{G}'\mathbf{G}\mathbf{d}_{0})^{-1}\mathbf{d}_{ 0}'\mathbf{G}\mathbf{z} =\hat{ \mathbf{b}}_{CG}^{(1)}, }$$

(2.33)

and

$$\displaystyle\begin{array}{rcl} \mathbf{b}_{3}& =& \hat{\mathbf{b}}_{CG}^{(1)} + \mathbf{d}_{ 1}(\mathbf{d}_{1}'\mathbf{G}'\mathbf{G}\mathbf{d}_{1})^{-1}\mathbf{d}_{ 1}'\mathbf{r}_{1}, \\ & =& \hat{\mathbf{b}}_{CG}^{(1)} + \mathbf{d}_{ 1}(\mathbf{d}_{1}'\mathbf{G}'\mathbf{G}\mathbf{d}_{1})^{-1}\mathbf{d}_{ 1}'\mathbf{G}'\mathbf{z} =\hat{ \mathbf{b}}_{CG}^{(2)},{}\end{array}$$

(2.34)

since $\mathbf{d}_{1}'\mathbf{r}_{1} = \mathbf{d}_{1}'\mathbf{Q}_{d_{0}/G'G}'\mathbf{r}_{0} = \mathbf{d}_{1}'\mathbf{r}_{0} = \mathbf{d}_{1}'\mathbf{G}\mathbf{z}$ (the second equality in the preceding equation holds again due to the G′G-conjugacy of d ₁ and d ₀). Similarly, we obtain

$$\displaystyle\begin{array}{rcl} \mathbf{b}_{3}& =& \hat{\mathbf{b}}_{CG}^{(2)} + \mathbf{d}_{ 2}(\mathbf{d}_{2}'\mathbf{G}'\mathbf{G}\mathbf{d}_{2})^{-1}\mathbf{d}_{ 2}'\mathbf{r}_{2}, \\ & =& \hat{\mathbf{b}}_{CG}^{(2)} + \mathbf{d}_{ 2}(\mathbf{d}_{2}'\mathbf{G}'\mathbf{G}\mathbf{d}_{2})^{-1}\mathbf{d}_{ 2}'\mathbf{G}'\mathbf{z} =\hat{ \mathbf{b}}_{CG}^{(3)},{}\end{array}$$

(2.35)

since $\mathbf{d}_{2}'\mathbf{r}_{2} = \mathbf{d}_{2}'\mathbf{Q}_{d_{1}/G'G}'\mathbf{r}_{1} = \mathbf{d}_{2}'\mathbf{r}_{1} = \mathbf{d}_{2}'\mathbf{Q}_{d_{0}/G'G}'\mathbf{r}_{0} = \mathbf{d}_{2}'\mathbf{r}_{0} = \mathbf{d}_{2}'\mathbf{G}\mathbf{z}$. This extends to S larger than 3. This proves the claim made above that (2.31) is indeed identical to b _S obtained from the CG iteration.

It is rather intricate to show the G′G-conjugacy of direction vectors (i.e., d _j′G′Gd _i = 0 for j ≠ i), although it is widely known in the numerical linear algebra literature (Golub and van Loan 1989). The proofs given in Golub and van Loan (1989) are not very easy to follow, however. In what follows, we attempt to provide a step-by-step proof of this fact. Let R _j and D _j be as defined above. We temporarily assume that the columns of D _j are already G′G-conjugate (i.e., D _j′G′GD _j is diagonal). Later we show that such construction of D _j is possible.

We first show that

$$\displaystyle{ \mathbf{d}_{j-1}'\mathbf{r}_{j} = 0. }$$

(2.36)

From Step 2(c) of the CG algorithm, we have

$$\displaystyle{ \mathbf{d}_{j-1}'\mathbf{r}_{j} = \mathbf{d}_{j-1}'\mathbf{Q}_{d_{j-1}/G'G}'\mathbf{r}_{j-1} = \mathbf{d}_{j-1}'(\mathbf{I}-\mathbf{G}'\mathbf{G}\mathbf{d}_{j-1}(\mathbf{d}_{j-1}'\mathbf{G}'\mathbf{G}\mathbf{d}_{j-1})^{-1}\mathbf{d}_{ j-1}')\mathbf{r}_{j-1} = 0, }$$

(2.37)

as claimed above. We next show that

$$\displaystyle{ \mathbf{d}_{j-2}'\mathbf{r}_{j} = 0, }$$

(2.38)

based on (2.36). From Step 2(c) of the algorithm, we have

$$\displaystyle\begin{array}{rcl} \mathbf{d}_{j-2}'\mathbf{r}_{j}& =& \mathbf{d}_{j-2}'\mathbf{Q}_{d_{j-1}/G'G}'\mathbf{r}_{j-1} \\ & =& \mathbf{d}_{j-2}'(\mathbf{I} -\mathbf{G}'\mathbf{G}\mathbf{d}_{j-1}(\mathbf{d}_{j-1}'\mathbf{G}\mathbf{G}\mathbf{d}_{j-1})^{-1}\mathbf{d}_{ j-1}')\mathbf{r}_{j-1} \\ & =& \mathbf{d}_{j-2}'\mathbf{r}_{j-1} = 0, {}\end{array}$$

(2.39)

as claimed. Note that d _j−2′G′Gd _j−1 = 0 by the assumption of the G′G-conjugacy (among the column vectors) of D _j. The last equality in (2.39) holds due to (2.36). By repeating essentially the same process, we can prove that d _j−k′r _j = 0 for k = 3, ⋯ , j, which implies

$$\displaystyle{ \mathbf{D}_{j}'\mathbf{r}_{j} = \mathbf{0}, }$$

(2.40)

and

$$\displaystyle{ \mathbf{R}_{j}'\mathbf{r}_{j} = \mathbf{0}, }$$

(2.41)

since $\mbox{ Sp}(\mathbf{D}_{j}) = \mbox{ Sp}(\mathbf{R}_{j}) = \mathcal{K}_{j}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})$. These relations indicate that in the CG method, the residual vector r _j is orthogonal to all previous search directions as well as all previous residual vectors.

We are now in a position to prove that

$$\displaystyle{ \mathbf{d}_{j-1}'\mathbf{G}'\mathbf{G}\mathbf{d}_{j} = 0. }$$

(2.42)

To do so, we first need to show that

$$\displaystyle{ \mathbf{d}_{j}'\mathbf{r}_{j} = \vert \vert \mathbf{r}_{j}\vert \vert ^{2}, }$$

(2.43)

and also that

$$\displaystyle{ \mathbf{d}_{j}'\mathbf{r}_{j-1} = \vert \vert \mathbf{r}_{j}\vert \vert ^{2}. }$$

(2.44)

For Eq. (2.43), we note that

$$\displaystyle\begin{array}{rcl} \mathbf{d}_{j}'\mathbf{r}_{j}& =& \mathbf{r}_{j}'\mathbf{Q}_{d_{j-1}/G'G}'\mathbf{r}_{j}\quad \mbox{ (by Step 2(e))} \\ & =& \vert \vert \mathbf{r}_{j}\vert \vert ^{2} -\mathbf{r}_{ j}'\mathbf{G}'\mathbf{G}\mathbf{d}_{j-1}(\mathbf{d}_{j-1}'\mathbf{G}'\mathbf{G}\mathbf{d}_{j-1})^{-1}\mathbf{d}_{ j-1}')\mathbf{r}_{j} = \vert \vert \mathbf{r}_{j}\vert \vert ^{2},{}\end{array}$$

(2.45)

due to Eq. (2.36). For Eq. (2.44), we have

$$\displaystyle\begin{array}{rcl} \mathbf{d}_{j}'\mathbf{r}_{j-1}& =& \mathbf{r}_{j}'\mathbf{r}_{j-1} + b_{j-1}\mathbf{d}_{j-1}'\mathbf{r}_{j-1}\quad \mbox{ (by Step 2(e))} \\ & =& 0 + (\vert \vert \mathbf{r}_{j}\vert \vert ^{2}/\vert \vert \mathbf{r}_{ j-1}\vert \vert ^{2})\vert \vert \mathbf{r}_{ j-1}\vert \vert ^{2} = \vert \vert \mathbf{r}_{ j}\vert \vert ^{2}.{}\end{array}$$

(2.46)

To show that (2.42) holds is now straightforward. We note that

$$\displaystyle{ \mathbf{r}_{j}'\mathbf{d}_{j} = \mathbf{r}_{j-1}'\mathbf{d}_{j} - a_{j-1}\mathbf{d}_{j-1}'\mathbf{G}'\mathbf{G}\mathbf{d}_{j} }$$

(2.47)

by Step 2(c), and that r _j′d _j = r _j−1′d _j = | | r _j | | ² by Eqs. (2.43) and (2.44). Since a _j−1 ≠ 0, this implies that d _j−1′G′Gd _j = 0. That is, d _j is G′G-conjugate to the previous direction vector d _j−1.

We can also show that d _j is G′G-conjugate to all previous direction vectors despite the fact that at any specific iteration, d _j is taken to be G′G-conjugate to only d _j−1. We begin with

$$\displaystyle{ \mathbf{d}_{j-2}'\mathbf{G}'\mathbf{G}\mathbf{d}_{j} = 0. }$$

(2.48)

We first note that

$$\displaystyle\begin{array}{rcl} \mathbf{r}_{j-2}'\mathbf{d}_{j}& =& \mathbf{r}_{j-2}'\mathbf{r}_{j} + b_{j-1}\mathbf{r}_{j-2}'\mathbf{d}_{j-1}\quad \mbox{ (by Step 2(e))} \\ & =& 0 + (\vert \vert \mathbf{r}_{j}\vert \vert ^{2}/\vert \vert \mathbf{r}_{ j-1}\vert \vert ^{2})\vert \vert \mathbf{r}_{ j-1}\vert \vert ^{2}\quad \mbox{ (by (2.44))} \\ & =& \vert \vert \mathbf{r}_{j}\vert \vert ^{2}. {}\end{array}$$

(2.49)

We also have

$$\displaystyle{ \mathbf{r}_{j-1}'\mathbf{d}_{j} = \mathbf{r}_{j-2}'\mathbf{d}_{j} - a_{j-2}\mathbf{d}_{j-2}'\mathbf{G}'\mathbf{G}\mathbf{d}_{j} }$$

(2.50)

by Step 2(c). Since r _j−1′d _j = r _j−2′d _j = | | r _j | | ² and a _j−2 ≠ 0, this implies (2.48). We may follow a similar line of argument as above, and show that d _j−k′G′Gd _j = 0 for k = 3, ⋯ , j. This shows that D _j′G′Gd _j = 0, as claimed.

In the proof above, it was assumed that the column vectors of D _j were G′G-conjugate. It remains to show that such construction of D _j is possible. We have D ₁′r ₁ = d ₀′r ₁ = 0 by (2.36). This implies that R ₁′r ₁ = 0 (since Sp(D ₁) = Sp(R ₁)), which in turn implies that D ₁′G′Gd ₁ = d ₀′G′Gd ₁ = 0. The columns of D ₂ = [d ₀, d ₁] are now shown to be G′G-conjugate. We repeat this process until we reach D _j whose column vectors are all G′G-conjugate. This process also generates R _j whose columns are mutually orthogonal. This means that all residual vectors are orthogonal in the CG method. The CG algorithm is also equivalent to the GMRES (Generalized Minimum Residual) method (Saad and Schultz 1986), when the latter is applied to the symmetric positive definite (pd) matrix G′G.

It may also be pointed out that R _S is an un-normalized version of W _S obtained in PLS1. This can be seen from the fact that the column vectors of both of these matrices are orthogonal to each other, and that $\mbox{ Sp}(\mathbf{W}_{S}) = \mbox{ Sp}(\mathbf{R}_{S}) = \mathcal{K}_{S}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})$. Although some columns of R _S may be sign-reversed as are some columns of U _s in the Lanczos method, it can be directly verified that this does not happen to r ₂ (i.e., r ₂∕ | | r ₂ | | = w ₂). So it is not likely to happen to other columns of R _S.

5 Concluding Remarks

The PLS1 algorithm was initially invented as a heuristic technique to solve LS problems (Wold 1966). No optimality properties of the algorithm were known at that time, and for a long time it had been criticized for being somewhat ad-hoc. It was later shown, however, that it is equivalent to some of the most sophisticated numerical algorithms to date for solving systems of linear simultaneous equations, such as the Lanczos bidiagonalization and the conjugate gradient methods. It is amazing, and indeed admirable, that Herman Wold almost single-handedly reinvented the “wheel” in a totally different context.

References

Abdi, H.: Partial least squares regression. In: Salkind, N.J. (ed.) Encyclopedia of Measurement and Statistics, pp. 740–54. Sage, Thousand Oaks (2007)
Google Scholar
Arnoldi, W.E.: The principle of minimized iterations in the solution of the matrix eigenvalue problem. Q. Appl. Math. 9, 17–29 (1951)
MathSciNet MATH Google Scholar
Bro, R., Eldén, L.: PLS works. J. Chemom. 23, 69–71 (2009)
Article Google Scholar
de Jong, S.: SIMPLS: an alternative approach to partial least squares regression. J. Chemom. 18, 251–263 (1993)
Google Scholar
Eldén, L.: Partial least-squares vs Lanczos bidiagonalization–I: analysis of a projection method for multiple regression. Comput. Stat. Data Anal. 46, 11–31 (2004)
Article MathSciNet MATH Google Scholar
Golub, G.H., van Loan, C.F.: Matrix Computations, 2nd edn. The Johns Hopkins University Press, Baltimore (1989)
MATH Google Scholar
Hestenes, M., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur Stand. 49, 409–436 (1951)
Article MathSciNet MATH Google Scholar
Lohmöller, J.B.: Latent Variables Path-Modeling with Partial Least Squares. Physica-Verlag, Heidelberg (1989)
Book MATH Google Scholar
Phatak, A., de Hoog, F.: Exploiting the connection between PLS, Lanczos methods and conjugate gradients: alternative proofs of some properties of PLS. J. Chemom. 16, 361–367 (2002)
Google Scholar
Rosipal, R., Krämer, N.: Overview and recent advances in partial least squares. In: Saunders, C., et al. (eds.) SLSFS 2005. LNCS 3940, pp. 34–51. Springer, Berlin (2006)
Google Scholar
Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society of Industrial and Applied Mathematics, Philadelphia (2003)
Book MATH Google Scholar
Saad, Y., Schultz, M.H.: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Comput. 7, 856–869 (1986)
Article MathSciNet MATH Google Scholar
Takane, Y.: Constrained Principal Component Analysis and Related Techniques. CRC Press, Boca Raton (2014)
MATH Google Scholar
Wold, H.: Estimation of principal components and related models by iterative least squares. In: Krishnaiah, P.R. (ed.) Multivariate Analysis, pp. 391–420. Academic, New York (1966)
Google Scholar
Wold, H. (1982) Soft modeling: the basic design and some extensions. In: Jöreskog, K.G., Wold, H. (eds.) Systems Under Indirect Observations, Part 2, pp. 1–54. North-Holland, Amsterdam (1982)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Victoria, Victoria, BC, Canada
Yoshio Takane
Heriot-Watt University, Edinburgh, UK
Sébastien Loisel

Authors

Yoshio Takane
View author publications
You can also search for this author in PubMed Google Scholar
Sébastien Loisel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yoshio Takane .

Editor information

Editors and Affiliations

School of Behavioral and Brain Sciences, The University of Texas at Dallas, Richardson, Texas, USA
Hervé Abdi
ESSEC Business School, Cergy Pontoise CX, France
Vincenzo Esposito Vinzi
CNAM, Paris, USA
Giorgio Russolillo
CNAM, Paris Cedex 03, France
Gilbert Saporta
NEOMA Business School, Rouen, France
Laura Trinchera

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Takane, Y., Loisel, S. (2016). On the PLS Algorithm for Multiple Regression (PLS1). In: Abdi, H., Esposito Vinzi, V., Russolillo, G., Saporta, G., Trinchera, L. (eds) The Multiple Facets of Partial Least Squares and Related Methods. PLS 2014. Springer Proceedings in Mathematics & Statistics, vol 173. Springer, Cham. https://doi.org/10.1007/978-3-319-40643-5_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-40643-5_2
Published: 14 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40641-1
Online ISBN: 978-3-319-40643-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

On the PLS Algorithm for Multiple Regression (PLS1)

Abstract

Similar content being viewed by others

Least Squares

A Mathematical-Statistics Approach to the Least Squares Method

Gauss on least-squares and maximum-likelihood estimation

Keywords

1 Introduction

2 PLS1 as Constrained Least Squares Estimator

3 Relations to the Lanczos Bidiagonalization Method

4 Relations to the Conjugate Gradient Method

5 Concluding Remarks

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

On the PLS Algorithm for Multiple Regression (PLS1)

Abstract

Similar content being viewed by others

Least Squares

A Mathematical-Statistics Approach to the Least Squares Method

Gauss on least-squares and maximum-likelihood estimation

Keywords

1 Introduction

2 PLS1 as Constrained Least Squares Estimator

3 Relations to the Lanczos Bidiagonalization Method

4 Relations to the Conjugate Gradient Method

5 Concluding Remarks

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation