Abstract
Partial least squares (PLS) was first introduced by Wold in the mid 1960s as a heuristic algorithm to solve linear least squares (LS) problems. No optimality property of the algorithm was known then. Since then, however, a number of interesting properties have been established about the PLS algorithm for regression analysis (called PLS1). This paper shows that the PLS estimator for a specific dimensionality S is a kind of constrained LS estimator confined to a Krylov subspace of dimensionality S. Links to the Lanczos bidiagonalization and conjugate gradient methods are also discussed from a somewhat different perspective from previous authors.
Similar content being viewed by others
Keywords
- Krylov subspace
- NIPALS
- PLS1 algorithm
- Lanczos bidiagonalization
- Conjugate gradients
- Constrained principal component analysis (CPCA)
1 Introduction
Partial least squares (PLS) was first introduced by Wold (1966) as a heuristic algorithm for estimating parameters in multiple regression. Since then, it has been elaborated in many directions, including extensions to multivariate cases (Abdi 2007; de Jong 1993) and structural equation modeling (Lohmöller 1989; Wold 1982). In this paper, we focus on the original PLS algorithm for univariate regression (called PLS1), and show its optimality given the subspace in which the vector of regression coefficients is supposed to lie. Links to state-of-the-art algorithms for solving a system of linear simultaneous equations, such as the Lanczos bidiagonalization and the conjugate gradient methods, are also discussed from a somewhat different perspective from previous authors (Eldén 2004; Phatak and de Hoog 2002). We refer the reader to Rosipal and Krämer (2006) for more comprehensive accounts and reviews of new developments of PLS.
2 PLS1 as Constrained Least Squares Estimator
Consider a linear regression model
where z is the N-component vector of observations on the criterion variable, G is the N × P matrix of predictor variables, b is the P-component vector of regression coefficients, and e is the N-component vector of disturbance terms. The ordinary LS (OLS) criterion is often used to estimate b under the iid (independent and identically distributed) normal assumption on e. This is a reasonable practice if N is large compared to P, and columns of G are not highly collinear (i.e., as long as the matrix G′G is well-conditioned). However, if this condition is not satisfied, the use of OLS estimators (OLSE) is not recommended, because then these estimators tend to have large variances. Principal component regression (PCR) is often employed in such situations. In PCR, principal component analysis (PCA) is first applied to G to find a low rank (say, rank S) approximation, which is subsequently used as the set of new predictor variables in a linear regression analysis. One potential problem with PCR is that the low rank approximation of G best accounts for G but is not necessarily optimal for predicting z. By contrast, PLS extracts components of G that are good predictors of z. For the case of univariate regression, the PLS algorithm (called PLS1) proceeds as follows:
PLS1 Algorithm
- Step 1.:
-
Column-wise center G and z, and set G 0 = G.
- Step 2.:
-
Repeat the following substeps for i = 1, ⋯ , S (S ≤ rank(G)):
- Step 2.1.:
-
Set w i = G i−1′z∕ ∥ G i−1′z ∥ , where ∥ G i−1′z ∥ = (z′G i−1 G i−1′z)1∕2.
- Step 2.2.:
-
Set t i = G i−1 w i ∕ ∥ G i−1 w i ∥ .
- Step 2.3.:
-
Set v i = G i−1′t i .
- Step 2.4.:
-
Set \(\mathbf{G}_{i} = \mathbf{G}_{i-1} -\mathbf{t}_{i}\mathbf{v}_{i}' = \mathbf{Q}_{G_{i-1}w_{i}}\mathbf{G}_{i-1}\) (deflation),
where \(\mathbf{Q}_{G_{i-1}w_{i}} = \mathbf{I} -\mathbf{G}_{i-1}\mathbf{w}_{i}(\mathbf{w}_{i}'\mathbf{G}_{i-1}'\mathbf{G}_{i-1}\mathbf{w}_{i})^{-1}\mathbf{w}_{i}'\mathbf{G}_{i-1}'\), and where ′ denotes the transpose operation, and | | . | | denotes the L 2 norm of a vector (i.e., \(\vert \vert \mathbf{x}\vert \vert = \sqrt{\mathbf{x} '\mathbf{x}}\), see, e.g., Takane (2014), for details); vectors w i , t i , and v i are called (respectively) weights, scores, and loadings, and are collected in matrices W S , T S , and V S . For a given S, the PLS estimator (PLSE) of b is given by
(see, e.g., Abdi 2007). The algorithm above assumes that S is known and, actually, the choice of its value is crucial for good performance of PLSE (a cross validation method is often used to choose the best value of S). It has been demonstrated (Phatak and de Hoog 2002) that for a given value of S, the PLSE of b has better predictability than the corresponding PCR estimator.
The PLSE of b can be regarded as a special kind of constrained LS estimator (CLSE), in which b is constrained to lie in the Krylov subspace of dimensionality S defined by
where Sp(K S ) is the space spanned by the column vectors of K S , and
is called the Krylov matrix of order S. Because \(\mbox{ Sp}(\mathbf{W}_{S}) = \mathcal{K}_{S}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})\) (see Eldén 2004, proposition 3.1; Phatak and de Hoog 2002) b can be re-parameterized as b = W S a for some a. Then Eq. (2.1) can be rewritten as
The OLSE of a is given by
from which the CLSE of b is found as
To show that (2.7) is indeed equivalent to (2.2), we need several well-known results in the PLS literature (Bro and Eldén 2009; de Jong 1993; Eldén 2004; Phatak and de Hoog 2002). First of all, W S is column-wise orthogonal, that is,
Secondly, T S is also column-wise orthogonal,
and
where L S is an upper bidiagonal matrix. Relations (2.8), (2.9) and (2.10) imply that
where H S is tridiagonal. Thirdly,
so that
Now it is straightforward to show that
and this establishes the equivalence between Eqs. (2.7) and (2.2).
The PLSE of regression parameters reduces to the OLSE if S = rank(G) (when rank(G) < P, we use G +, which is the Moore-Penrose inverse of G, in lieu of (G′G)−1 G in the OLSE for regression coefficients).
3 Relations to the Lanczos Bidiagonalization Method
It has been pointed out (Eldén 2004) that PLS1 described above is equivalent to the following Lanczos bidiagonalization algorithm:
The Lanczos Bidiagonalization (LBD) Algorithm
- Step 1.:
-
Column-wise center G, and compute u 1 = G′z∕ | | G′z | | and q 1 = Gu 1∕δ 1, where δ 1 = | | Gu 1 | | .
- Step 2.:
-
For i = 2, ⋯ , S (this is the same S as in PLS1),
-
(a)
Compute γ i−1 u i = G′q i−1 −δ i−1 u i−1.
-
(b)
Compute δ i q i = Gu i −γ i−1 q i−1.
-
(a)
Scalars γ i−1 and δ i (i = 2, ⋯ , S) are the normalization factors to make | | u i | | = 1 and | | q i−1 | | = 1, respectively.
Let U S and Q S represent the collections of u i and q i for i = 1, ⋯ , S. It has been shown (Eldén 2004, Proposition 3.1) that these two matrices are essentially the same as W S and T S , respectively, obtained in PLS1. Here “essentially” means that these two matrices are identical to W S and T S except that the even columns of U S and Q S are reflected (i.e., have their sign reversed). We show this explicitly for u 2 and q 2 (i.e., u 2 = −w 2 and q 2 = −t 2). It is obvious from Step 1 of the two algorithms that
Let α 1 = | | G′z | | . Then
where ∝ means “proportional.” To obtain the last expression, we multiplied Eq. (2.16) by δ 1∕α 1 ( > 0). This last expression is proportional to −u 2, where u 2 ∝ G′Gu 1∕δ 1 −δ 1 u 1 from Step 2(a) of the Lanczos algorithm. This implies u 2 = −w 2, because both u 2 and w 2 are normalized.
Similarly, define β 1 2 = w 1′(G′G)2 w 1. Then
To obtain Eq. (2.19), we multiplied (2.18) by δ 1 2∕α 1 ( > 0). On the other hand, we have
To show that q 2 ∝ −t 2, it remains to show that
From Step 2(a) of the Lanczos algorithm,
and so indeed (2.21) holds. Again, we have q 2 = −t 2, because both q 2 and t 2 are normalized.
The sign reversals of u 2 and q 2 yield u 3 and q 3 identical to w 3 and t 3, respectively, by similar sign reversals, and u 4 and q 4 which are sign reversals of w 4 and t 4, and so on. Thus, only even columns of U s and Q s are affected (i.e., have their sign reversed) relative to the corresponding columns of W S and T S , respectively. Of course, these sign reversals have no effect on estimates of regression parameters. The estimate of regression parameters by the Lanczos bidiagonaliation method is given by
where
which is upper bidiagonal, as is L S (defined in Eq. (2.13)). matrix L S ∗ differs from matrix L S only in the sign of its super-diagonal elements. The matrices L S −1 and (L S ∗)−1 are also upper bidiagonal, for which the super-diagonal elements are opposite in sign, while their diagonal elements remain the same. Thus
where ℓ i, j and ℓ i, j ∗ are the ij-th element of (respectively) L S and L S ∗. Note that
It is widely known (see, e.g., Saad 2003) that the matrix of orthogonal basis vectors generated by the Arnoldi orthogonalization of K S (Arnoldi 1951) is identical to U S obtained in the Lanczos algorithm. Starting from u 1 = G′z∕ ∥ G′z ∥ , this orthogonalization method finds u i+1 (i = 1, ⋯ , S − 1) by successively orthogonalizing G′Gu i (i = 1, ⋯ , S − 1) to all previous u i ’s by a procedure similar to the Gram-Schmidt orthogonalization method. This yields U S such that G′GU S = U S H S ∗, or
where H S ∗ is tridiagonal as is H S defined in Eq. (2.11). The diagonal elements of this matrix are identical to those of H S while its sub- and super-diagonal elements have their sign reversed. Matrix H S ∗ is called the Lanczos tridiagonal matrix and it is useful to obtain eigenvalues of G′G.
4 Relations to the Conjugate Gradient Method
It has been pointed out (Phatak and de Hoog 2002) that the conjugate gradient (CG) algorithm (Hestenes and Stiefel 1951) for solving a system of linear simultaneous equations G′Gb = G′y gives solutions identical to \(\hat{\mathbf{b}}_{PLSE}^{(s)}\) [s = 1, ⋯ , rank(G)], if the CG iteration starts from the initial solution \(\hat{\mathbf{b}}_{CG}^{(0)} \equiv \mathbf{b}_{0} = \mathbf{0}\). To verify their assertion, we look into the CG algorithm stated as follows:
The Conjugate Gradient (CG) Algorithm
- Step 1.:
-
Initialize b 0 = 0. Then, r 0 = G′z −G′Gb 0 = G′z = d 0. (Vectors r 0 and d 0 are called initial residual and initial direction vectors, respectively.)
- Step 2.:
-
For i = 0, ⋯ , s − 1, compute:
-
(a)
a i = d i ′r i ∕d i ′G′Gd i = | | r i | | 2∕d i ′G′Gd i .
-
(b)
b i+1 = b i + a i d i .
-
(c)
\(\mathbf{r}_{i+1} = \mathbf{G}'\mathbf{z} -\mathbf{G}'\mathbf{G}\mathbf{b}_{i+1} = \mathbf{r}_{i} - a_{i}\mathbf{G}'\mathbf{G}\mathbf{d}_{i} = \mathbf{Q}_{d_{i}/G'G}'\mathbf{r}_{i}\), where \(\mathbf{Q}_{d_{i}/G'G} = \mathbf{I} -\mathbf{d}_{i}(\mathbf{d}_{i}'\mathbf{G}'\mathbf{G}\mathbf{d}_{i})^{-1}\mathbf{d}_{i}'\mathbf{G}'\mathbf{G}\) is the projector onto the space orthogonal to Sp(G′Gd i ) along Sp(d i ) [its transpose, on the other hand, is the projector onto the space orthogonal Sp(d i ) along Sp(G′Gd i )].
-
(d)
b i = −r i+1′G′Gd i ∕d i ′G′Gd i = | | r i+1 | | 2∕ | | r i | | 2.
-
(e)
\(\mathbf{d}_{i+1} = \mathbf{r}_{i+1} + b_{i}\mathbf{d}_{i} = \mathbf{Q}_{d_{i}/G'G}\mathbf{r}_{i+1}\).
-
(a)
Let R j = [r 0, ⋯ , r j−1] and D j = [d 0, ⋯ , d j−1] for j ≤ S. We first show that
by induction, where, as before, Sp(A) indicates the space spanned by the column vectors of matrix A. It is obvious that r 0 = d 0 = G′z, so that \(\mbox{ Sp}(\mathbf{R}_{1}) = \mbox{ Sp}(\mathbf{D}_{1}) = \mathcal{K}_{1}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})\). From Step 2(c) of the CG algorithm, we have
for some scalar c 0, so that \(\mathbf{r}_{1} \in \mathcal{K}_{2}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})\) because \(\mathbf{G}'\mathbf{G}\mathbf{d}_{0} \in \mathcal{K}_{2}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})\). From Step 2(e), we also have
for some c 0 ∗, so that \(\mathbf{d}_{1} \in \mathcal{K}_{2}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})\). This shows that \(\mbox{ Sp}(\mathbf{R}_{2}) = \mbox{ Sp}(\mathbf{D}_{2}) = \mathcal{K}_{2}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})\). Similarly, we have \(\mathbf{r}_{2} \in \mathcal{K}_{3}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})\) and \(\mathbf{d}_{2} \in \mathcal{K}_{3}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})\), so that \(\mbox{ Sp}(\mathbf{R}_{3}) = \mbox{ Sp}(\mathbf{D}_{3}) = \mathcal{K}_{3}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})\), and so on.
The property of D j above implies that Sp(W S ) is identical to Sp(D S ), which in turn implies that
is identical to \(\hat{\mathbf{b}}_{CLSE}^{(S)}\) as defined in Eq. (2.7), which in turn is equal to \(\hat{\mathbf{b}}_{PLSE}^{(S)}\) defined in Eq. (2.2) (Phatak and de Hoog 2002) by virtue of Eq. (2.14). It remains to show that \(\hat{\mathbf{b}}_{CG}^{(S)}\) defined in (2.31) coincides with b S generated by the CG algorithm. By the G′G-conjugacy of d j ’s (the orthogonality of d j ’s with respect to G′G, i.e., d i ′G′Gd j = 0 for any i ≠ j, as will be shown later), Eq. (2.31) can be rewritten as
From Step 2(b) of the CG algorithm, on the other hand, we have
and
since \(\mathbf{d}_{1}'\mathbf{r}_{1} = \mathbf{d}_{1}'\mathbf{Q}_{d_{0}/G'G}'\mathbf{r}_{0} = \mathbf{d}_{1}'\mathbf{r}_{0} = \mathbf{d}_{1}'\mathbf{G}\mathbf{z}\) (the second equality in the preceding equation holds again due to the G′G-conjugacy of d 1 and d 0). Similarly, we obtain
since \(\mathbf{d}_{2}'\mathbf{r}_{2} = \mathbf{d}_{2}'\mathbf{Q}_{d_{1}/G'G}'\mathbf{r}_{1} = \mathbf{d}_{2}'\mathbf{r}_{1} = \mathbf{d}_{2}'\mathbf{Q}_{d_{0}/G'G}'\mathbf{r}_{0} = \mathbf{d}_{2}'\mathbf{r}_{0} = \mathbf{d}_{2}'\mathbf{G}\mathbf{z}\). This extends to S larger than 3. This proves the claim made above that (2.31) is indeed identical to b S obtained from the CG iteration.
It is rather intricate to show the G′G-conjugacy of direction vectors (i.e., d j ′G′Gd i = 0 for j ≠ i), although it is widely known in the numerical linear algebra literature (Golub and van Loan 1989). The proofs given in Golub and van Loan (1989) are not very easy to follow, however. In what follows, we attempt to provide a step-by-step proof of this fact. Let R j and D j be as defined above. We temporarily assume that the columns of D j are already G′G-conjugate (i.e., D j ′G′GD j is diagonal). Later we show that such construction of D j is possible.
We first show that
From Step 2(c) of the CG algorithm, we have
as claimed above. We next show that
based on (2.36). From Step 2(c) of the algorithm, we have
as claimed. Note that d j−2′G′Gd j−1 = 0 by the assumption of the G′G-conjugacy (among the column vectors) of D j . The last equality in (2.39) holds due to (2.36). By repeating essentially the same process, we can prove that d j−k ′r j = 0 for k = 3, ⋯ , j, which implies
and
since \(\mbox{ Sp}(\mathbf{D}_{j}) = \mbox{ Sp}(\mathbf{R}_{j}) = \mathcal{K}_{j}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})\). These relations indicate that in the CG method, the residual vector r j is orthogonal to all previous search directions as well as all previous residual vectors.
We are now in a position to prove that
To do so, we first need to show that
and also that
For Eq. (2.43), we note that
due to Eq. (2.36). For Eq. (2.44), we have
To show that (2.42) holds is now straightforward. We note that
by Step 2(c), and that r j ′d j = r j−1′d j = | | r j | | 2 by Eqs. (2.43) and (2.44). Since a j−1 ≠ 0, this implies that d j−1′G′Gd j = 0. That is, d j is G′G-conjugate to the previous direction vector d j−1.
We can also show that d j is G′G-conjugate to all previous direction vectors despite the fact that at any specific iteration, d j is taken to be G′G-conjugate to only d j−1. We begin with
We first note that
We also have
by Step 2(c). Since r j−1′d j = r j−2′d j = | | r j | | 2 and a j−2 ≠ 0, this implies (2.48). We may follow a similar line of argument as above, and show that d j−k ′G′Gd j = 0 for k = 3, ⋯ , j. This shows that D j ′G′Gd j = 0, as claimed.
In the proof above, it was assumed that the column vectors of D j were G′G-conjugate. It remains to show that such construction of D j is possible. We have D 1′r 1 = d 0′r 1 = 0 by (2.36). This implies that R 1′r 1 = 0 (since Sp(D 1) = Sp(R 1)), which in turn implies that D 1′G′Gd 1 = d 0′G′Gd 1 = 0. The columns of D 2 = [d 0, d 1] are now shown to be G′G-conjugate. We repeat this process until we reach D j whose column vectors are all G′G-conjugate. This process also generates R j whose columns are mutually orthogonal. This means that all residual vectors are orthogonal in the CG method. The CG algorithm is also equivalent to the GMRES (Generalized Minimum Residual) method (Saad and Schultz 1986), when the latter is applied to the symmetric positive definite (pd) matrix G′G.
It may also be pointed out that R S is an un-normalized version of W S obtained in PLS1. This can be seen from the fact that the column vectors of both of these matrices are orthogonal to each other, and that \(\mbox{ Sp}(\mathbf{W}_{S}) = \mbox{ Sp}(\mathbf{R}_{S}) = \mathcal{K}_{S}(\mathbf{G}'\mathbf{G},\mathbf{G}'\mathbf{z})\). Although some columns of R S may be sign-reversed as are some columns of U s in the Lanczos method, it can be directly verified that this does not happen to r 2 (i.e., r 2∕ | | r 2 | | = w 2). So it is not likely to happen to other columns of R S .
5 Concluding Remarks
The PLS1 algorithm was initially invented as a heuristic technique to solve LS problems (Wold 1966). No optimality properties of the algorithm were known at that time, and for a long time it had been criticized for being somewhat ad-hoc. It was later shown, however, that it is equivalent to some of the most sophisticated numerical algorithms to date for solving systems of linear simultaneous equations, such as the Lanczos bidiagonalization and the conjugate gradient methods. It is amazing, and indeed admirable, that Herman Wold almost single-handedly reinvented the “wheel” in a totally different context.
References
Abdi, H.: Partial least squares regression. In: Salkind, N.J. (ed.) Encyclopedia of Measurement and Statistics, pp. 740–54. Sage, Thousand Oaks (2007)
Arnoldi, W.E.: The principle of minimized iterations in the solution of the matrix eigenvalue problem. Q. Appl. Math. 9, 17–29 (1951)
Bro, R., Eldén, L.: PLS works. J. Chemom. 23, 69–71 (2009)
de Jong, S.: SIMPLS: an alternative approach to partial least squares regression. J. Chemom. 18, 251–263 (1993)
Eldén, L.: Partial least-squares vs Lanczos bidiagonalization–I: analysis of a projection method for multiple regression. Comput. Stat. Data Anal. 46, 11–31 (2004)
Golub, G.H., van Loan, C.F.: Matrix Computations, 2nd edn. The Johns Hopkins University Press, Baltimore (1989)
Hestenes, M., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur Stand. 49, 409–436 (1951)
Lohmöller, J.B.: Latent Variables Path-Modeling with Partial Least Squares. Physica-Verlag, Heidelberg (1989)
Phatak, A., de Hoog, F.: Exploiting the connection between PLS, Lanczos methods and conjugate gradients: alternative proofs of some properties of PLS. J. Chemom. 16, 361–367 (2002)
Rosipal, R., Krämer, N.: Overview and recent advances in partial least squares. In: Saunders, C., et al. (eds.) SLSFS 2005. LNCS 3940, pp. 34–51. Springer, Berlin (2006)
Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society of Industrial and Applied Mathematics, Philadelphia (2003)
Saad, Y., Schultz, M.H.: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Comput. 7, 856–869 (1986)
Takane, Y.: Constrained Principal Component Analysis and Related Techniques. CRC Press, Boca Raton (2014)
Wold, H.: Estimation of principal components and related models by iterative least squares. In: Krishnaiah, P.R. (ed.) Multivariate Analysis, pp. 391–420. Academic, New York (1966)
Wold, H. (1982) Soft modeling: the basic design and some extensions. In: Jöreskog, K.G., Wold, H. (eds.) Systems Under Indirect Observations, Part 2, pp. 1–54. North-Holland, Amsterdam (1982)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Takane, Y., Loisel, S. (2016). On the PLS Algorithm for Multiple Regression (PLS1). In: Abdi, H., Esposito Vinzi, V., Russolillo, G., Saporta, G., Trinchera, L. (eds) The Multiple Facets of Partial Least Squares and Related Methods. PLS 2014. Springer Proceedings in Mathematics & Statistics, vol 173. Springer, Cham. https://doi.org/10.1007/978-3-319-40643-5_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-40643-5_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-40641-1
Online ISBN: 978-3-319-40643-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)