Abstract
Simultaneously processing several large blocks of streaming data is a computationally expensive problem. Based on the incremental singular value decomposition algorithm, we propose a new procedure for calculating the factorization of the multiblock redundancy matrix \({{\textbf {M}}}\), which makes the multiblock method more fast and efficient when analyzing large streaming data and high-dimensional dense matrices. The procedure transforms a big data problem into a small one by processing small high-dimensional matrices where variables are in rows. Numerical experiments illustrate the accuracy and performance of the incremental solution for analyzing streaming multiblock redundancy data. The experiments demonstrate that the incremental algorithm may decompose a large matrix with a 75% reduction in execution time. It is more efficient to first partition the matrix \({{\textbf {M}}}\) and then decompose it with the incremental algorithm than to decompose the entire matrix \({{\textbf {M}}}\) using the standard singular value decomposition algorithm.
Similar content being viewed by others
Notes
\(\Vert {{\textbf {t}}}_k \Vert = 1\), \({{\textbf {t}}}_k {{\textbf {t}}}_k' = {{\textbf {w}}}_k {{\textbf {X}}}_k {{\textbf {X}}}_k' {{\textbf {w}}}_k' = {{\textbf {w}}}_k ({{\textbf {X}}}_k {{\textbf {X}}}_k')^{1/2} ({{\textbf {X}}}_k {{\textbf {X}}}_k')^{'1/2} {{\textbf {w}}}_k'\) \(= {{\textbf {w}}}_k ({{\textbf {X}}}_k {{\textbf {X}}}_k')^{1/2} ({{\textbf {w}}}_k ({{\textbf {X}}}_k {{\textbf {X}}}_k')^{1/2})' = {{\textbf {b}}}_k {{\textbf {b}}}_k' = \Vert {{\textbf {b}}}_k \Vert = 1\)
\({{\textbf {A}}}_k\) is square of order q and symmetric. We can write \({{\textbf {A}}}_k = {\textbf {Y P}}_{X_k} {{\textbf {Y}}}'\) where \({{\textbf {P}}}_{X_k} = {{\textbf {X}}}_k' ({{\textbf {X}}}_k {{\textbf {X}}}_k')^{-1} {{\textbf {X}}}_k\) is the projection operator of the subspace spanned by the columns of \({{\textbf {X}}}_k\). \({{\textbf {P}}}_{X_k}\) is symmetric and idempotent. Then, \({{\textbf {A}}}_k = {\textbf {Y P}}_{X_k} {{\textbf {P}}}_{X_k}' {{\textbf {Y}}}' = ({\textbf {Y P}}_{X_k}) ({\textbf {Y P}}_{X_k})' = {\textbf {B B}}'\). Moreover, \({{\textbf {A}}}_k\) will be a positive semidefinite matrix if \({\textbf {v A}}_k {{\textbf {v}}}' \ge 0\) for all nonzero \({{\textbf {v}}}\).
References
Baker CG, Gallivan KA, Van Dooren P (2012) Low-rank incremental methods for computing dominant singular subspaces. Linear Algebra Appl 436(8):2866–2888. https://doi.org/10.1016/j.laa.2011.07.018
Bougeard S, Hanafi M, Qannari EM (2007) ACPVI multibloc application en épidémiologie animale. J Soc Fr Stat 148(4):77–94
Bougeard S, Qannari EM, Lupo C, Hanafi M (2011a) From multiblock partial least squares to multiblock redundancy analysis: a continuum approach. Informatica 22(1):11–26. https://doi.org/10.15388/Informatica.2011.311
Bougeard S, Qannari EM, Rose N (2011b) Multiblock redundancy analysis: interpretation tools and application in epidemiology. J Chemom 25:467–475. https://doi.org/10.1002/cem.1392
Cardot H, Degras D (2018) Online principal component analysis in high dimension: which algorithm to choose? Int Stat Rev 86:29–50. https://doi.org/10.1111/insr.12220
Carroll JD (1968) Generalization of canonical correlation analysis to three or more sets of variables. In: Proceedings of the 76th annual convention APA, pp 227–228
Chan TF (1982) An improved algorithm for computing the singular value decomposition. ACM Trans Math Softw 8(1):72–83. https://doi.org/10.1145/355984.355991
D’Ambra L, Lauro C (1984) Principal components analysis onto reference subspaces. Rapporti di Ricerca NL/84 n.1, pp 1-22, Centre International de Mathematiques Pures et Appliquees
D’Ambra L, Lauro C (1992) Non symmetrical exploratory data analysis. Stat Appl 4:511–529
de Leeuw J, Young FW, Takane Y (1976) Additive structure in qualitative data: an alternating least squares method with optimal scaling features. Psychometrika 41(4):471–503. https://doi.org/10.1007/BF02296971
Degras D, Cardot H (2016) onlinePCA: online principal component analysis. R package version 1.3.1. https://cran.r-project.org/package=onlinePCA
D’Enza AI, Markos A (2015) Low-dimensional tracking of association structures in categorical data. Stat Comput 25:1009–1022. https://doi.org/10.1007/s11222-014-9470-4
D’Enza AI, Markos A, Buttarazzi D (2018) The idm package: incremental decomposition methods in R. J Stat Softw 86 Code Snippet 4. https://doi.org/10.18637/jss.v086.c04
Dongarra JJ, Demmel JW, Ostrouchov S (1992) LAPACK: a linear algebra library for high-performance computers. In: Dodge Y, Whittaker J (eds) Computational statistics. Springer, Heidelberg. https://doi.org/10.1007/978-3-662-26811-7_3
Ge Z (2017) Review on data-driven modeling and monitoring for plant-wide industrial processes. Chemom Intell Lab 171:16–25. https://doi.org/10.1016/j.chemolab.2017.09.021
Golub GH, Reinsch C (1970) Singular value decomposition and least squares solutions. Numer Math 14:403–420. https://doi.org/10.1007/BF02163027
Golub GH, Van Loan CF (1996) Matrix computations, 3rd edn. Johns Hopkins University Press, Baltimore. https://doi.org/10.1137/1028073
Hall P, Marshall D, Martin R (2002) Adding and subtracting eigenspaces with eigenvalue decomposition and singular value decomposition. Image Vis Comput 20:1009–1016. https://doi.org/10.1016/S0262-8856(02)00114-2
Horst P (1961) Relations among m sets of variables. Psychometrika 26(2):129–149. https://doi.org/10.1007/BF02289710
Hotelling H (1936) Relations between two sets of variables. Biometrika 28(3/4):321–377. https://doi.org/10.1093/biomet/28.3-4.321
Izenman AJ (1975) Reduced-rank regression for the multivariate linear model. J Multivar Anal 5:248–264. https://doi.org/10.1016/0047-259X(75)90042-1
Johansson JK (1981) An extension of Wollenberg’s redundancy analysis. Psychometrika 46(1):93–103. https://doi.org/10.1007/BF02293921
Johnson RA, Wichern DW (2007) Applied multivariate statistical analysis. Pearson Prentince Hall, Upper Saddle River
Kettenring JR (1971) Canonical analysis of several set of variables. Biometrika 58(3):433–451. https://doi.org/10.1093/biomet/58.3.433
Legendre P, Anderson MJ (1999) Distance-based redundancy analysis: testing multispecies responses in multifactorial ecological experiments. Ecol Monogr 69(1):1–24. https://doi.org/10.1890/0012-9615(1999)069[0001:DBRATM]2.0.CO;2
Legendre P, Oksanenn J, ter Braak CJF (2011) Testing the significance of canonical axes in redundancy analysis. Methods Ecol Evol 2:269–277. https://doi.org/10.1111/j.2041-210X.2010.00078.x
Levy A, Lindenbaum M (2000) Sequential Karhunen–Loeve basis extraction and its applications to images. IEEE Trans Image Process 9(8):1371–1374. https://doi.org/10.1109/83.855432
Markos A, D’Enza AI (2016) Incremental generalized canonical correlation analysis. In: Wilhelm A, Kestler H (eds) Analysis of large and complex data, studies in classification, data analysis, and knowledge organization. Springer, Cham, pp 185–194. https://doi.org/10.1007/978-3-319-25226-1_16
Martinez-Ruiz A, Montañola-Sales C (2019) Big data in multi-block data analysis: an approach to parallelizing partial least squares mode B algorithm. Heliyon 5(4):e01451. https://doi.org/10.1016/j.heliyon.2019.e01451
McArdle BH, Anderson MJ (2001) Fitting multivariate models to community data: a comment on distance-based redundancy analysis. Ecology 82(1):290–297. https://doi.org/10.1890/0012-9658(2001)082[0290:FMMTCD]2.0.CO;2
Obadia J (1978) L’analyse en composantes explicatives. Rev Stat Appl 26(4):5–28
Oja E, Karhunen J (1985) On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. J Math Anal Appl 106:69–84. https://doi.org/10.1016/0022-247X(85)90131-3
Qin SJ (2003) Statistical process monitoring: basics and beyond. J Chemom 17(8–9):480–502. https://doi.org/10.1002/cem.800
R Core Team (2022) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Ramos JA, Verriest E (1984) A unifying tool for comparing stochastic realization algorithms and model reduction techniques. In: 1984 American control conference, San Diego, CA, USA, pp 150–155. https://doi.org/10.23919/ACC.1984.4788368
Rao CR (1964) The use and interpretation of principal component analysis in applied research. Sankhya Ser A 26(4):329–358
Robert P, Escoufier Y (1976) A unifying tool for linear multivariate statistical methods: the RV-coefficient. J R Stat Soc C Appl 25(3):257–265. https://doi.org/10.2307/2347233
Ross DA, Lim J, Lin RS, Yang MH (2008) Incremental learning for robust visual tracking. Int J Comput Vis 77:125–141. https://doi.org/10.1007/s11263-007-0075-7
Schafer J, Opgen-Rhein R, Zuber V, Ahdesmaki M, Duarte-Silva AP, Strimmer K (2017) corpcor: efficient estimation of covariance and (partial) correlation. R package version 1.6.9. https://cran.r-project.org/web/packages/corpcor/index.html
Smilde AK, Naes T, Liland KH (2022) Multiblock data fusion in statistics and machine learning. Applications in the natural and life sciences. Wiley, Hoboken. https://doi.org/10.1002/9781119600978
Smith B, Boyle J, Dongarra J, Garbow B, Ikebe Y, Klema V, Moler C (1976) Matrix eigensystem routines, EISPACK guide. Lecture notes in computer science, vol 6. Springer, Berlin. https://doi.org/10.1007/3-540-07546-1
Stewart GW (1993) On the early history of the singular value decomposition. SIAM Rev 35(4):551–566. https://doi.org/10.1137/1035134
Stewart D, Love W (1968) A general canonical correlation index. Psychol Bull 70(3):160–163. https://doi.org/10.1037/h0026143
Takane Y, Hwang H (2005) An extended redundancy analysis and its applications to two practical examples. Comput Stat Data Anal 49(3):785–808. https://doi.org/10.1016/j.csda.2004.06.004
Tenenhaus M (1998) La régression PLS: Théorie et pratique. Technip, Paris
Van den Wollenberg AL (1977) Redudancy analysis an alternative for canonical correlation analysis. Psychometrika 42(2):207–219. https://doi.org/10.1007/BF02294050
Wangen LE, Kowalski BR (1989) A multiblock partial least squares algorithm for investigating complex chemical systems. J Chemom 3(1):3–20. https://doi.org/10.1002/cem.1180030104
Weng J, Zhang Y, Hwang WS (2003) Candid covariance-free incremental principal component analysis. IEEE Trans Pattern Anal 25(8):1034–1040. https://doi.org/10.1109/TPAMI.2003.1217609
Young FW (1972) A model for polynomial conjoint analysis algorithms. In: Shepard RN, Romney AK, Nerlove S (eds) Multidimensional scaling: theory and applications in the behavior-sciences. Academic Press, New York
Acknowledgements
We would like to sincerely thank both the guest editors and anonymous reviewers for careful reading of the paper and for their helpful comments and suggestions that highly improve the article.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix 1: Comparison of elapsed times for \(({{\textbf {X}}} {{\textbf {X}}}')^{-1}\)
Appendix 1: Comparison of elapsed times for \(({{\textbf {X}}} {{\textbf {X}}}')^{-1}\)
Figure 12 reports the CPU times of computing \({{\textbf {Y}}} {{\textbf {X}}}' ({{\textbf {X}}} {{\textbf {X}}}')^{-1} {{\textbf {X}}} {{\textbf {Y}}}'\) when \(({{\textbf {X}}} {{\textbf {X}}}')^{-1}\) is calculated through QR decomposition, LU decomposition, solving the system \({{\textbf {A}}} {{\textbf {x}}} = {{\textbf {b}}}\), and the spectral decomposition and subsequent modification of the resulting eigenvalues carried out by the R-function mpower (Schafer et al. 2017). These times were obtained for random normal data generated from a normal distribution. The multiblock set up included five blocks of variables \({{\textbf {X}}}\) and one endogenous block of variables \({{\textbf {Y}}}\), each with 10,000 observations. We processed matrices with 10, 50, 100, 250, 500, and 750 variables. Then, the experiments examined multiblock configurations with 60, 300, 600, 1500, 3000, and 4500 variables, respectively.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Martinez-Ruiz, A., Lauro, N.C. Incremental singular value decomposition for some numerical aspects of multiblock redundancy analysis. Comput Stat (2023). https://doi.org/10.1007/s00180-023-01418-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00180-023-01418-5