Keywords

1 Introduction

Partial least squares (PLS) was first introduced by Wold (1966) as a heuristic algorithm for estimating parameters in multiple regression. Since then, it has been elaborated in many directions, including extensions to multivariate cases (Abdi 2007; de Jong 1993) and structural equation modeling (Lohmöller 1989; Wold 1982). In this paper, we focus on the original PLS algorithm for univariate regression (called PLS1), and show its optimality given the subspace in which the vector of regression coefficients is supposed to lie. Links to state-of-the-art algorithms for solving a system of linear simultaneous equations, such as the Lanczos bidiagonalization and the conjugate gradient methods, are also discussed from a somewhat different perspective from previous authors (Eldén 2004; Phatak and de Hoog 2002). We refer the reader to Rosipal and Krämer (2006) for more comprehensive accounts and reviews of new developments of PLS.

2 PLS1 as Constrained Least Squares Estimator

Consider a linear regression model

$$\displaystyle{ \mathbf{z} = \mathbf{G}\mathbf{b} + \mathbf{e}, }$$
(2.1)

where z is the N-component vector of observations on the criterion variable, G is the N × P matrix of predictor variables, b is the P-component vector of regression coefficients, and e is the N-component vector of disturbance terms. The ordinary LS (OLS) criterion is often used to estimate b under the iid (independent and identically distributed) normal assumption on e. This is a reasonable practice if N is large compared to P, and columns of G are not highly collinear (i.e., as long as the matrix GG is well-conditioned). However, if this condition is not satisfied, the use of OLS estimators (OLSE) is not recommended, because then these estimators tend to have large variances. Principal component regression (PCR) is often employed in such situations. In PCR, principal component analysis (PCA) is first applied to G to find a low rank (say, rank S) approximation, which is subsequently used as the set of new predictor variables in a linear regression analysis. One potential problem with PCR is that the low rank approximation of G best accounts for G but is not necessarily optimal for predicting z. By contrast, PLS extracts components of G that are good predictors of z. For the case of univariate regression, the PLS algorithm (called PLS1) proceeds as follows:

PLS1 Algorithm

Step 1.:

Column-wise center G and z, and set G 0 = G.

Step 2.:

Repeat the following substeps for i = 1, ⋯ , S (S ≤ rank(G)):

Step 2.1.:

Set w i  = G i−1z∕ ∥ G i−1z ∥ , where ∥ G i−1z ∥  = (zG i−1 G i−1z)1∕2.

Step 2.2.:

Set t i  = G i−1 w i ∕ ∥ G i−1 w i  ∥ .

Step 2.3.:

Set v i  = G i−1t i .

Step 2.4.:

Set \(\mathbf{G}_{i} = \mathbf{G}_{i-1} -\mathbf{t}_{i}\mathbf{v}_{i}' = \mathbf{Q}_{G_{i-1}w_{i}}\mathbf{G}_{i-1}\) (deflation),

where \(\mathbf{Q}_{G_{i-1}w_{i}} = \mathbf{I} -\mathbf{G}_{i-1}\mathbf{w}_{i}(\mathbf{w}_{i}'\mathbf{G}_{i-1}'\mathbf{G}_{i-1}\mathbf{w}_{i})^{-1}\mathbf{w}_{i}'\mathbf{G}_{i-1}'\), and where ′ denotes the transpose operation, and | | . | | denotes the L 2 norm of a vector (i.e., \(\vert \vert \mathbf{x}\vert \vert = \sqrt{\mathbf{x} '\mathbf{x}}\), see, e.g., Takane (2014), for details); vectors w i , t i , and v i are called (respectively) weights, scores, and loadings, and are collected in matrices W S , T S , and V S . For a given S, the PLS estimator (PLSE) of b is given by

$$\displaystyle{ \hat{\mathbf{b}}_{PLSE}^{(S)} = \mathbf{W}_{ S}(\mathbf{V}_{S}'\mathbf{W}_{S})^{-1}\mathbf{T}_{ S}'\mathbf{z} }$$
(2.2)

(see, e.g., Abdi 2007). The algorithm above assumes that S is known and, actually, the choice of its value is crucial for good performance of PLSE (a cross validation method is often used to choose the best value of S). It has been demonstrated (Phatak and de Hoog 2002) that for a given value of S, the PLSE of b has better predictability than the corresponding PCR estimator.

The PLSE of b can be regarded as a special kind of constrained LS estimator (CLSE), in which b is constrained to lie in the Krylov subspace of dimensionality S defined by

$$\displaystyle{ \mathcal{K}_{S}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z}) = \mbox{ Sp}(\mathbf{K}_{S}), }$$
(2.3)

where Sp(K S ) is the space spanned by the column vectors of K S , and

$$\displaystyle{ \mathbf{K}_{s} = [\mathbf{G}'\mathbf{z},(\mathbf{G}'\mathbf{G})\mathbf{G}'\mathbf{z},\cdots \,,(\mathbf{G}'\mathbf{G})^{S-1}\mathbf{G}'\mathbf{z}] }$$
(2.4)

is called the Krylov matrix of order S. Because \(\mbox{ Sp}(\mathbf{W}_{S}) = \mathcal{K}_{S}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})\) (see Eldén 2004, proposition 3.1; Phatak and de Hoog 2002) b can be re-parameterized as b = W S a for some a. Then Eq. (2.1) can be rewritten as

$$\displaystyle{ \mathbf{z} = \mathbf{G}\mathbf{W}_{S}\mathbf{a} + \mathbf{e}. }$$
(2.5)

The OLSE of a is given by

$$\displaystyle{ \hat{\mathbf{a}} = (\mathbf{W}_{S}'\mathbf{G}'\mathbf{G}\mathbf{W}_{S})^{-1}\mathbf{W}_{ S}'\mathbf{G}'\mathbf{z}, }$$
(2.6)

from which the CLSE of b is found as

$$\displaystyle{ \hat{\mathbf{b}}_{CLSE}^{(S)} = \mathbf{W}_{ S}\hat{\mathbf{a}} = \mathbf{W}_{S}(\mathbf{W}_{S}'\mathbf{G}'\mathbf{G}\mathbf{W}_{S})^{-1}\mathbf{W}_{ S}'\mathbf{G}'\mathbf{z}. }$$
(2.7)

To show that (2.7) is indeed equivalent to (2.2), we need several well-known results in the PLS literature (Bro and Eldén 2009; de Jong 1993; Eldén 2004; Phatak and de Hoog 2002). First of all, W S is column-wise orthogonal, that is,

$$\displaystyle{ \mathbf{W}_{S}'\mathbf{W}_{S} = \mathbf{I}_{S}. }$$
(2.8)

Secondly, T S is also column-wise orthogonal,

$$\displaystyle{ \mathbf{T}_{S}'\mathbf{T}_{S} = \mathbf{I}_{S}, }$$
(2.9)

and

$$\displaystyle{ \mathbf{T}_{S}\mathbf{L}_{S} = \mathbf{G}\mathbf{W}_{S}, }$$
(2.10)

where L S is an upper bidiagonal matrix. Relations (2.8), (2.9) and (2.10) imply that

$$\displaystyle{ \mathbf{W}_{S}'\mathbf{G}'\mathbf{G}\mathbf{W}_{S} = \mathbf{L}_{S}'\mathbf{L}_{S} = \mathbf{H}_{S}, }$$
(2.11)

where H S is tridiagonal. Thirdly,

$$\displaystyle{ \mathbf{V}_{S}' = \mathbf{T}_{S}'\mathbf{G}, }$$
(2.12)

so that

$$\displaystyle{ \mathbf{L}_{S} = \mathbf{T}_{S}'\mathbf{G}\mathbf{W}_{S} = \mathbf{V}_{S}'\mathbf{W}_{S}. }$$
(2.13)

Now it is straightforward to show that

$$\displaystyle\begin{array}{rcl} \hat{\mathbf{b}}_{CLSE}^{(S)}& =& \mathbf{W}_{ S}(\mathbf{W}_{S}'\mathbf{G}'\mathbf{G}\mathbf{W}_{S})^{-1}\mathbf{W}_{ S}'\mathbf{G}'\mathbf{z}\ \\ & =& \mathbf{W}_{S}\mathbf{H}_{S}^{-1}\mathbf{L}_{ S}'\mathbf{T}_{S}'\mathbf{z} \\ & =& \mathbf{W}_{S}(\mathbf{L}_{S}'\mathbf{L}_{S})^{-1}\mathbf{L}_{ S}'\mathbf{T}_{S}'\mathbf{z} \\ & =& \mathbf{W}_{S}\mathbf{L}_{S}^{-1}\mathbf{T}_{ S}'\mathbf{z} \\ & =& \mathbf{W}_{S}(\mathbf{V}_{S}'\mathbf{W}_{S})^{-1}\mathbf{T}_{ S}'\mathbf{z} \\ & =& \hat{\mathbf{b}}_{PLSE}^{(S)}, {}\end{array}$$
(2.14)

and this establishes the equivalence between Eqs. (2.7) and (2.2).

The PLSE of regression parameters reduces to the OLSE if S = rank(G) (when rank(G) < P, we use G +, which is the Moore-Penrose inverse of G, in lieu of (GG)−1 G in the OLSE for regression coefficients).

3 Relations to the Lanczos Bidiagonalization Method

It has been pointed out (Eldén 2004) that PLS1 described above is equivalent to the following Lanczos bidiagonalization algorithm:

The Lanczos Bidiagonalization (LBD) Algorithm

Step 1.:

Column-wise center G, and compute u 1 = Gz∕ | | Gz | | and q 1 = Gu 1δ 1, where δ 1 =  | | Gu 1 | | .

Step 2.:

For i = 2, ⋯ , S (this is the same S as in PLS1),

  1. (a)

    Compute γ i−1 u i  = Gq i−1δ i−1 u i−1.

  2. (b)

    Compute δ i q i  = Gu i γ i−1 q i−1.

Scalars γ i−1 and δ i (i = 2, ⋯ , S) are the normalization factors to make | | u i  | |  = 1 and | | q i−1 | |  = 1, respectively.

Let U S and Q S represent the collections of u i and q i for i = 1, ⋯ , S. It has been shown (Eldén 2004, Proposition 3.1) that these two matrices are essentially the same as W S and T S , respectively, obtained in PLS1. Here “essentially” means that these two matrices are identical to W S and T S except that the even columns of U S and Q S are reflected (i.e., have their sign reversed). We show this explicitly for u 2 and q 2 (i.e., u 2 = −w 2 and q 2 = −t 2). It is obvious from Step 1 of the two algorithms that

$$\displaystyle{ \mathbf{w}_{1} = \mathbf{u}_{1}\hspace{5.69054pt} \mbox{ and}\hspace{5.69054pt} \mathbf{t}_{1} = \mathbf{q}_{1}. }$$
(2.15)

Let α 1 =  | | Gz | | . Then

$$\displaystyle\begin{array}{rcl} \mathbf{w}_{2}& \propto & \mathbf{G}'\mathbf{Q}_{Gw_{1}}\mathbf{z}\quad \mbox{ (from Step 2.4 of the PLS1 algorithm)} \\ & =& \mathbf{G}'\mathbf{z} -\mathbf{G}'\mathbf{G}\mathbf{w}_{1}(\mathbf{w}_{1}'\mathbf{G}'\mathbf{G}\mathbf{w}_{1})^{-1}\mathbf{w}_{ 1}'\mathbf{G}'\mathbf{z} \\ & =& \alpha _{1}(\mathbf{w}_{1} -\mathbf{G}'\mathbf{G}\mathbf{w}_{1}/\delta _{1}^{2}) {}\end{array}$$
(2.16)
$$\displaystyle\begin{array}{rcl} & \propto & -\mathbf{G}'\mathbf{G}\mathbf{w}_{1}/\delta _{1} +\delta _{1}\mathbf{w}_{1},{}\end{array}$$
(2.17)

where ∝ means “proportional.” To obtain the last expression, we multiplied Eq. (2.16) by δ 1α 1 ( > 0). This last expression is proportional to −u 2, where u 2 ∝ GGu 1δ 1δ 1 u 1 from Step 2(a) of the Lanczos algorithm. This implies u 2 = −w 2, because both u 2 and w 2 are normalized.

Similarly, define β 1 2 = w 1′(GG)2 w 1. Then

$$\displaystyle\begin{array}{rcl} \mathbf{t}_{2}& \propto & \mathbf{Q}_{Gw_{1}}\mathbf{G}\mathbf{G}'\mathbf{Q}_{Gw_{1}}\mathbf{z}\hspace{8.53581pt} \mbox{ (from Step 2.2 of the PLS1 algorithm)} \\ & =& \alpha _{1}(\mathbf{G}\mathbf{w}_{1} -\mathbf{G}\mathbf{G}'\mathbf{G}\mathbf{w}_{1}/\delta _{1}^{2} -\mathbf{G}\mathbf{w}_{ 1} + \frac{\beta _{1}^{2}} {\delta _{1}^{4}}\mathbf{G}\mathbf{w}_{1}) {}\end{array}$$
(2.18)
$$\displaystyle\begin{array}{rcl} & \propto & -\mathbf{G}\mathbf{G}'\mathbf{G}\mathbf{w}_{1} + \frac{\beta _{1}^{2}} {\delta _{1}^{2}}\mathbf{G}\mathbf{w}_{1}.{}\end{array}$$
(2.19)

To obtain Eq. (2.19), we multiplied (2.18) by δ 1 2α 1 ( > 0). On the other hand, we have

$$\displaystyle\begin{array}{rcl} \mathbf{q}_{2}& \propto & \frac{1} {\delta _{1}\gamma _{1}}(\mathbf{G}\mathbf{G}'\mathbf{G}\mathbf{u}_{1} -\delta _{1}^{2}\mathbf{G}\mathbf{u}_{ 1} -\gamma _{1}^{2}\mathbf{G}\mathbf{u}_{ 1})\quad \mbox{ (from Step 2(b) of the Lanczos algorithm)} \\ & \propto & \mathbf{G}\mathbf{G}'\mathbf{G}\mathbf{u}_{1} - (\delta _{1}^{2} +\gamma _{ 1}^{2})\mathbf{G}\mathbf{u}_{ 1}. {}\end{array}$$
(2.20)

To show that q 2 ∝ −t 2, it remains to show that

$$\displaystyle{ \gamma ^{2} +\delta ^{2} =\beta _{ 1}^{2}/\delta _{ 1}^{2}. }$$
(2.21)

From Step 2(a) of the Lanczos algorithm,

$$\displaystyle\begin{array}{rcl} \gamma ^{2}& =& (\mathbf{G}'\mathbf{G}\mathbf{u}_{ 1}/\delta _{1} -\delta _{1}\mathbf{u}_{1})'(\mathbf{G}'\mathbf{G}\mathbf{u}_{1}/\delta _{1} -\delta _{1}\mathbf{u}_{1}) \\ & =& \beta ^{2}/\delta ^{2} -\delta ^{2}, {}\end{array}$$
(2.22)

and so indeed (2.21) holds. Again, we have q 2 = −t 2, because both q 2 and t 2 are normalized.

The sign reversals of u 2 and q 2 yield u 3 and q 3 identical to w 3 and t 3, respectively, by similar sign reversals, and u 4 and q 4 which are sign reversals of w 4 and t 4, and so on. Thus, only even columns of U s and Q s are affected (i.e., have their sign reversed) relative to the corresponding columns of W S and T S , respectively. Of course, these sign reversals have no effect on estimates of regression parameters. The estimate of regression parameters by the Lanczos bidiagonaliation method is given by

$$\displaystyle{ \hat{\mathbf{b}}_{LBD}^{(S)} = \mathbf{U}_{ s}(\mathbf{L}_{S}^{{\ast}})^{-1}\mathbf{Q}_{ S}'\mathbf{z}, }$$
(2.23)

where

$$\displaystyle{ \mathbf{L}_{S}^{{\ast}} = \mathbf{Q}_{ S}'\mathbf{G}\mathbf{U}_{S}, }$$
(2.24)

which is upper bidiagonal, as is L S (defined in Eq. (2.13)). matrix L S differs from matrix L S only in the sign of its super-diagonal elements. The matrices L S −1 and (L S )−1 are also upper bidiagonal, for which the super-diagonal elements are opposite in sign, while their diagonal elements remain the same. Thus

$$\displaystyle\begin{array}{rcl} \mathbf{W}_{S}\mathbf{L}_{S}^{-1}\mathbf{T}_{ S}'& =& \sum _{i=1}^{s}(\ell_{ i,i}\mathbf{w}_{i}\mathbf{t}_{i}' +\ell _{i,i+1}\mathbf{w}_{i}\mathbf{t}_{i+1}') \\ & =& \sum _{i=1}^{s}(\ell_{ i,i}^{{\ast}}\mathbf{u}_{ i}\mathbf{q}_{i}' +\ell_{ i,i+1}^{{\ast}}\mathbf{u}_{ i}\mathbf{q}_{i+1}') \\ & =& \mathbf{U}_{s}(\mathbf{L}_{s}^{{\ast}})^{-1}\mathbf{Q}_{ s}', {}\end{array}$$
(2.25)

where i, j and i, j are the ij-th element of (respectively) L S and L S . Note that

$$\displaystyle{ \ell_{i,i} =\ell_{ i,i}^{{\ast}},\quad \mathbf{w}_{ i}\mathbf{t}_{i}' = \mathbf{u}_{i}\mathbf{q}_{i}',\quad \ell_{i,i+1} = -\ell_{i,i+1}^{{\ast}},\quad \text{and}\quad \mathbf{w}_{ i}\mathbf{t}_{i+1}' = -\mathbf{u}_{i}\mathbf{q}_{i+1}' }$$
(2.26)

It is widely known (see, e.g., Saad 2003) that the matrix of orthogonal basis vectors generated by the Arnoldi orthogonalization of K S (Arnoldi 1951) is identical to U S obtained in the Lanczos algorithm. Starting from u 1 = Gz∕ ∥ Gz ∥ , this orthogonalization method finds u i+1 (i = 1, ⋯ , S − 1) by successively orthogonalizing GGu i (i = 1, ⋯ , S − 1) to all previous u i ’s by a procedure similar to the Gram-Schmidt orthogonalization method. This yields U S such that GGU S  = U S H S , or

$$\displaystyle{ \mathbf{U}_{S}'\mathbf{G}'\mathbf{G}\mathbf{U}_{S} = \mathbf{L}_{S}^{{\ast}'}\mathbf{L}_{ S}^{{\ast}} = \mathbf{H}_{ S}^{{\ast}}, }$$
(2.27)

where H S is tridiagonal as is H S defined in Eq. (2.11). The diagonal elements of this matrix are identical to those of H S while its sub- and super-diagonal elements have their sign reversed. Matrix H S is called the Lanczos tridiagonal matrix and it is useful to obtain eigenvalues of GG.

4 Relations to the Conjugate Gradient Method

It has been pointed out (Phatak and de Hoog 2002) that the conjugate gradient (CG) algorithm (Hestenes and Stiefel 1951) for solving a system of linear simultaneous equations GGb = Gy gives solutions identical to \(\hat{\mathbf{b}}_{PLSE}^{(s)}\) [s = 1, ⋯ , rank(G)], if the CG iteration starts from the initial solution \(\hat{\mathbf{b}}_{CG}^{(0)} \equiv \mathbf{b}_{0} = \mathbf{0}\). To verify their assertion, we look into the CG algorithm stated as follows:

The Conjugate Gradient (CG) Algorithm

Step 1.:

Initialize b 0 = 0. Then, r 0 = GzGGb 0 = Gz = d 0. (Vectors r 0 and d 0 are called initial residual and initial direction vectors, respectively.)

Step 2.:

For i = 0, ⋯ , s − 1, compute:

  1. (a)

    a i  = d i r i d i GGd i  =  | | r i  | | 2d i GGd i .

  2. (b)

    b i+1 = b i + a i d i .

  3. (c)

    \(\mathbf{r}_{i+1} = \mathbf{G}'\mathbf{z} -\mathbf{G}'\mathbf{G}\mathbf{b}_{i+1} = \mathbf{r}_{i} - a_{i}\mathbf{G}'\mathbf{G}\mathbf{d}_{i} = \mathbf{Q}_{d_{i}/G'G}'\mathbf{r}_{i}\), where \(\mathbf{Q}_{d_{i}/G'G} = \mathbf{I} -\mathbf{d}_{i}(\mathbf{d}_{i}'\mathbf{G}'\mathbf{G}\mathbf{d}_{i})^{-1}\mathbf{d}_{i}'\mathbf{G}'\mathbf{G}\) is the projector onto the space orthogonal to Sp(GGd i ) along Sp(d i ) [its transpose, on the other hand, is the projector onto the space orthogonal Sp(d i ) along Sp(GGd i )].

  4. (d)

    b i  = −r i+1GGd i d i GGd i  =  | | r i+1 | | 2∕ | | r i  | | 2.

  5. (e)

    \(\mathbf{d}_{i+1} = \mathbf{r}_{i+1} + b_{i}\mathbf{d}_{i} = \mathbf{Q}_{d_{i}/G'G}\mathbf{r}_{i+1}\).

Let R j  = [r 0, ⋯ , r j−1] and D j  = [d 0, ⋯ , d j−1] for j ≤ S. We first show that

$$\displaystyle{ \mbox{ Sp}(\mathbf{R}_{j}) = \mbox{ Sp}(\mathbf{D}_{j}) = \mathcal{K}_{j}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z}) }$$
(2.28)

by induction, where, as before, Sp(A) indicates the space spanned by the column vectors of matrix A. It is obvious that r 0 = d 0 = Gz, so that \(\mbox{ Sp}(\mathbf{R}_{1}) = \mbox{ Sp}(\mathbf{D}_{1}) = \mathcal{K}_{1}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})\). From Step 2(c) of the CG algorithm, we have

$$\displaystyle{ \mathbf{r}_{1} = \mathbf{Q}_{d_{i}/G'G}'\mathbf{r}_{0} = \mathbf{r}_{0} -\mathbf{G}'\mathbf{G}\mathbf{d}_{0}c_{0} }$$
(2.29)

for some scalar c 0, so that \(\mathbf{r}_{1} \in \mathcal{K}_{2}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})\) because \(\mathbf{G}'\mathbf{G}\mathbf{d}_{0} \in \mathcal{K}_{2}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})\). From Step 2(e), we also have

$$\displaystyle{ \mathbf{d}_{1} = \mathbf{Q}_{d_{0}/G'G}\mathbf{r}_{1} = \mathbf{r}_{1} -\mathbf{d}_{0}c_{0}^{{\ast}} }$$
(2.30)

for some c 0 , so that \(\mathbf{d}_{1} \in \mathcal{K}_{2}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})\). This shows that \(\mbox{ Sp}(\mathbf{R}_{2}) = \mbox{ Sp}(\mathbf{D}_{2}) = \mathcal{K}_{2}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})\). Similarly, we have \(\mathbf{r}_{2} \in \mathcal{K}_{3}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})\) and \(\mathbf{d}_{2} \in \mathcal{K}_{3}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})\), so that \(\mbox{ Sp}(\mathbf{R}_{3}) = \mbox{ Sp}(\mathbf{D}_{3}) = \mathcal{K}_{3}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})\), and so on.

The property of D j above implies that Sp(W S ) is identical to Sp(D S ), which in turn implies that

$$\displaystyle{ \hat{\mathbf{b}}_{CG}^{(S)} = \mathbf{D}_{ S}(\mathbf{D}_{S}'\mathbf{G}\mathbf{G}\mathbf{D}_{S})^{-1}\mathbf{D}_{ S}'\mathbf{G}\mathbf{z} }$$
(2.31)

is identical to \(\hat{\mathbf{b}}_{CLSE}^{(S)}\) as defined in Eq. (2.7), which in turn is equal to \(\hat{\mathbf{b}}_{PLSE}^{(S)}\) defined in Eq. (2.2) (Phatak and de Hoog 2002) by virtue of Eq. (2.14). It remains to show that \(\hat{\mathbf{b}}_{CG}^{(S)}\) defined in (2.31) coincides with b S generated by the CG algorithm. By the GG-conjugacy of d j ’s (the orthogonality of d j ’s with respect to GG, i.e., d i GGd j  = 0 for any ij, as will be shown later), Eq. (2.31) can be rewritten as

$$\displaystyle{ \hat{\mathbf{b}}_{CG}^{(S)} =\sum _{ i=0}^{S-1}\mathbf{d}_{ i}(\mathbf{d}_{i}'\mathbf{G}'\mathbf{G}\mathbf{d}_{i})^{-1}\mathbf{d}_{ i}'\mathbf{G}'\mathbf{z}. }$$
(2.32)

From Step 2(b) of the CG algorithm, on the other hand, we have

$$\displaystyle{ \mathbf{b}_{1} = \mathbf{d}_{0}(\mathbf{d}_{0}'\mathbf{G}'\mathbf{G}\mathbf{d}_{0})^{-1}\mathbf{d}_{ 0}'\mathbf{r}_{0} = \mathbf{d}_{0}(\mathbf{d}_{0}'\mathbf{G}'\mathbf{G}\mathbf{d}_{0})^{-1}\mathbf{d}_{ 0}'\mathbf{G}\mathbf{z} =\hat{ \mathbf{b}}_{CG}^{(1)}, }$$
(2.33)

and

$$\displaystyle\begin{array}{rcl} \mathbf{b}_{3}& =& \hat{\mathbf{b}}_{CG}^{(1)} + \mathbf{d}_{ 1}(\mathbf{d}_{1}'\mathbf{G}'\mathbf{G}\mathbf{d}_{1})^{-1}\mathbf{d}_{ 1}'\mathbf{r}_{1}, \\ & =& \hat{\mathbf{b}}_{CG}^{(1)} + \mathbf{d}_{ 1}(\mathbf{d}_{1}'\mathbf{G}'\mathbf{G}\mathbf{d}_{1})^{-1}\mathbf{d}_{ 1}'\mathbf{G}'\mathbf{z} =\hat{ \mathbf{b}}_{CG}^{(2)},{}\end{array}$$
(2.34)

since \(\mathbf{d}_{1}'\mathbf{r}_{1} = \mathbf{d}_{1}'\mathbf{Q}_{d_{0}/G'G}'\mathbf{r}_{0} = \mathbf{d}_{1}'\mathbf{r}_{0} = \mathbf{d}_{1}'\mathbf{G}\mathbf{z}\) (the second equality in the preceding equation holds again due to the GG-conjugacy of d 1 and d 0). Similarly, we obtain

$$\displaystyle\begin{array}{rcl} \mathbf{b}_{3}& =& \hat{\mathbf{b}}_{CG}^{(2)} + \mathbf{d}_{ 2}(\mathbf{d}_{2}'\mathbf{G}'\mathbf{G}\mathbf{d}_{2})^{-1}\mathbf{d}_{ 2}'\mathbf{r}_{2}, \\ & =& \hat{\mathbf{b}}_{CG}^{(2)} + \mathbf{d}_{ 2}(\mathbf{d}_{2}'\mathbf{G}'\mathbf{G}\mathbf{d}_{2})^{-1}\mathbf{d}_{ 2}'\mathbf{G}'\mathbf{z} =\hat{ \mathbf{b}}_{CG}^{(3)},{}\end{array}$$
(2.35)

since \(\mathbf{d}_{2}'\mathbf{r}_{2} = \mathbf{d}_{2}'\mathbf{Q}_{d_{1}/G'G}'\mathbf{r}_{1} = \mathbf{d}_{2}'\mathbf{r}_{1} = \mathbf{d}_{2}'\mathbf{Q}_{d_{0}/G'G}'\mathbf{r}_{0} = \mathbf{d}_{2}'\mathbf{r}_{0} = \mathbf{d}_{2}'\mathbf{G}\mathbf{z}\). This extends to S larger than 3. This proves the claim made above that (2.31) is indeed identical to b S obtained from the CG iteration.

It is rather intricate to show the GG-conjugacy of direction vectors (i.e., d j GGd i  = 0 for ji), although it is widely known in the numerical linear algebra literature (Golub and van Loan 1989). The proofs given in Golub and van Loan (1989) are not very easy to follow, however. In what follows, we attempt to provide a step-by-step proof of this fact. Let R j and D j be as defined above. We temporarily assume that the columns of D j are already GG-conjugate (i.e., D j GGD j is diagonal). Later we show that such construction of D j is possible.

We first show that

$$\displaystyle{ \mathbf{d}_{j-1}'\mathbf{r}_{j} = 0. }$$
(2.36)

From Step 2(c) of the CG algorithm, we have

$$\displaystyle{ \mathbf{d}_{j-1}'\mathbf{r}_{j} = \mathbf{d}_{j-1}'\mathbf{Q}_{d_{j-1}/G'G}'\mathbf{r}_{j-1} = \mathbf{d}_{j-1}'(\mathbf{I}-\mathbf{G}'\mathbf{G}\mathbf{d}_{j-1}(\mathbf{d}_{j-1}'\mathbf{G}'\mathbf{G}\mathbf{d}_{j-1})^{-1}\mathbf{d}_{ j-1}')\mathbf{r}_{j-1} = 0, }$$
(2.37)

as claimed above. We next show that

$$\displaystyle{ \mathbf{d}_{j-2}'\mathbf{r}_{j} = 0, }$$
(2.38)

based on (2.36). From Step 2(c) of the algorithm, we have

$$\displaystyle\begin{array}{rcl} \mathbf{d}_{j-2}'\mathbf{r}_{j}& =& \mathbf{d}_{j-2}'\mathbf{Q}_{d_{j-1}/G'G}'\mathbf{r}_{j-1} \\ & =& \mathbf{d}_{j-2}'(\mathbf{I} -\mathbf{G}'\mathbf{G}\mathbf{d}_{j-1}(\mathbf{d}_{j-1}'\mathbf{G}\mathbf{G}\mathbf{d}_{j-1})^{-1}\mathbf{d}_{ j-1}')\mathbf{r}_{j-1} \\ & =& \mathbf{d}_{j-2}'\mathbf{r}_{j-1} = 0, {}\end{array}$$
(2.39)

as claimed. Note that d j−2GGd j−1 = 0 by the assumption of the GG-conjugacy (among the column vectors) of D j . The last equality in (2.39) holds due to (2.36). By repeating essentially the same process, we can prove that d jk r j  = 0 for k = 3, ⋯ , j, which implies

$$\displaystyle{ \mathbf{D}_{j}'\mathbf{r}_{j} = \mathbf{0}, }$$
(2.40)

and

$$\displaystyle{ \mathbf{R}_{j}'\mathbf{r}_{j} = \mathbf{0}, }$$
(2.41)

since \(\mbox{ Sp}(\mathbf{D}_{j}) = \mbox{ Sp}(\mathbf{R}_{j}) = \mathcal{K}_{j}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})\). These relations indicate that in the CG method, the residual vector r j is orthogonal to all previous search directions as well as all previous residual vectors.

We are now in a position to prove that

$$\displaystyle{ \mathbf{d}_{j-1}'\mathbf{G}'\mathbf{G}\mathbf{d}_{j} = 0. }$$
(2.42)

To do so, we first need to show that

$$\displaystyle{ \mathbf{d}_{j}'\mathbf{r}_{j} = \vert \vert \mathbf{r}_{j}\vert \vert ^{2}, }$$
(2.43)

and also that

$$\displaystyle{ \mathbf{d}_{j}'\mathbf{r}_{j-1} = \vert \vert \mathbf{r}_{j}\vert \vert ^{2}. }$$
(2.44)

For Eq. (2.43), we note that

$$\displaystyle\begin{array}{rcl} \mathbf{d}_{j}'\mathbf{r}_{j}& =& \mathbf{r}_{j}'\mathbf{Q}_{d_{j-1}/G'G}'\mathbf{r}_{j}\quad \mbox{ (by Step 2(e))} \\ & =& \vert \vert \mathbf{r}_{j}\vert \vert ^{2} -\mathbf{r}_{ j}'\mathbf{G}'\mathbf{G}\mathbf{d}_{j-1}(\mathbf{d}_{j-1}'\mathbf{G}'\mathbf{G}\mathbf{d}_{j-1})^{-1}\mathbf{d}_{ j-1}')\mathbf{r}_{j} = \vert \vert \mathbf{r}_{j}\vert \vert ^{2},{}\end{array}$$
(2.45)

due to Eq. (2.36). For Eq. (2.44), we have

$$\displaystyle\begin{array}{rcl} \mathbf{d}_{j}'\mathbf{r}_{j-1}& =& \mathbf{r}_{j}'\mathbf{r}_{j-1} + b_{j-1}\mathbf{d}_{j-1}'\mathbf{r}_{j-1}\quad \mbox{ (by Step 2(e))} \\ & =& 0 + (\vert \vert \mathbf{r}_{j}\vert \vert ^{2}/\vert \vert \mathbf{r}_{ j-1}\vert \vert ^{2})\vert \vert \mathbf{r}_{ j-1}\vert \vert ^{2} = \vert \vert \mathbf{r}_{ j}\vert \vert ^{2}.{}\end{array}$$
(2.46)

To show that (2.42) holds is now straightforward. We note that

$$\displaystyle{ \mathbf{r}_{j}'\mathbf{d}_{j} = \mathbf{r}_{j-1}'\mathbf{d}_{j} - a_{j-1}\mathbf{d}_{j-1}'\mathbf{G}'\mathbf{G}\mathbf{d}_{j} }$$
(2.47)

by Step 2(c), and that r j d j  = r j−1d j  =  | | r j  | | 2 by Eqs. (2.43) and (2.44). Since a j−1 ≠ 0, this implies that d j−1GGd j  = 0. That is, d j is GG-conjugate to the previous direction vector d j−1.

We can also show that d j is GG-conjugate to all previous direction vectors despite the fact that at any specific iteration, d j is taken to be GG-conjugate to only d j−1. We begin with

$$\displaystyle{ \mathbf{d}_{j-2}'\mathbf{G}'\mathbf{G}\mathbf{d}_{j} = 0. }$$
(2.48)

We first note that

$$\displaystyle\begin{array}{rcl} \mathbf{r}_{j-2}'\mathbf{d}_{j}& =& \mathbf{r}_{j-2}'\mathbf{r}_{j} + b_{j-1}\mathbf{r}_{j-2}'\mathbf{d}_{j-1}\quad \mbox{ (by Step 2(e))} \\ & =& 0 + (\vert \vert \mathbf{r}_{j}\vert \vert ^{2}/\vert \vert \mathbf{r}_{ j-1}\vert \vert ^{2})\vert \vert \mathbf{r}_{ j-1}\vert \vert ^{2}\quad \mbox{ (by (2.44))} \\ & =& \vert \vert \mathbf{r}_{j}\vert \vert ^{2}. {}\end{array}$$
(2.49)

We also have

$$\displaystyle{ \mathbf{r}_{j-1}'\mathbf{d}_{j} = \mathbf{r}_{j-2}'\mathbf{d}_{j} - a_{j-2}\mathbf{d}_{j-2}'\mathbf{G}'\mathbf{G}\mathbf{d}_{j} }$$
(2.50)

by Step 2(c). Since r j−1d j  = r j−2d j  =  | | r j  | | 2 and a j−2 ≠ 0, this implies (2.48). We may follow a similar line of argument as above, and show that d jk GGd j  = 0 for k = 3, ⋯ , j. This shows that D j GGd j  = 0, as claimed.

In the proof above, it was assumed that the column vectors of D j were GG-conjugate. It remains to show that such construction of D j is possible. We have D 1r 1 = d 0r 1 = 0 by (2.36). This implies that R 1r 1 = 0 (since Sp(D 1) = Sp(R 1)), which in turn implies that D 1GGd 1 = d 0GGd 1 = 0. The columns of D 2 = [d 0, d 1] are now shown to be GG-conjugate. We repeat this process until we reach D j whose column vectors are all GG-conjugate. This process also generates R j whose columns are mutually orthogonal. This means that all residual vectors are orthogonal in the CG method. The CG algorithm is also equivalent to the GMRES (Generalized Minimum Residual) method (Saad and Schultz 1986), when the latter is applied to the symmetric positive definite (pd) matrix GG.

It may also be pointed out that R S is an un-normalized version of W S obtained in PLS1. This can be seen from the fact that the column vectors of both of these matrices are orthogonal to each other, and that \(\mbox{ Sp}(\mathbf{W}_{S}) = \mbox{ Sp}(\mathbf{R}_{S}) = \mathcal{K}_{S}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})\). Although some columns of R S may be sign-reversed as are some columns of U s in the Lanczos method, it can be directly verified that this does not happen to r 2 (i.e., r 2∕ | | r 2 | |  = w 2). So it is not likely to happen to other columns of R S .

5 Concluding Remarks

The PLS1 algorithm was initially invented as a heuristic technique to solve LS problems (Wold 1966). No optimality properties of the algorithm were known at that time, and for a long time it had been criticized for being somewhat ad-hoc. It was later shown, however, that it is equivalent to some of the most sophisticated numerical algorithms to date for solving systems of linear simultaneous equations, such as the Lanczos bidiagonalization and the conjugate gradient methods. It is amazing, and indeed admirable, that Herman Wold almost single-handedly reinvented the “wheel” in a totally different context.