Abstract
This paper introduces principal component analysis (PCA), partial least squares projections to latent structures (PLS), and statistical molecular design (SMD) as useful tools in deriving multi- and megavariate quantitative structure-activity relationship (QSAR) models. Two QSAR data sets from the fields of environmental toxicology and environmental chemistry are worked out in detail, showing the benefits of PCA, PLS and SMD. PCA is useful when overviewing a data set and exploring relationships among compounds and relationships among variables. PLS is the regression extension of PCA and is used for establishing QSARs. SMD is essential for selecting informative training and test sets of compounds for QSAR calibration and validation.
Similar content being viewed by others
Abbreviations
- CC:
-
canonical correlation
- DOOD:
-
D-optimal onion design
- FA:
-
factor analysis
- FD:
-
factorial design
- FFD:
-
fractional factorial design
- HOMO:
-
highest occupied molecular orbital
- LDA:
-
linear discriminant analysis
- LUMO:
-
lowest unoccupied molecular orbital
- MLR:
-
multiple linear regression
- NN:
-
neural networks
- PCA:
-
principal component analysis
- PCB:
-
polychlorinated biphenyls
- PCR:
-
principal component regression
- PLS:
-
partial least squares projections to latent structures
- PLS-DA:
-
PLS discriminant analysis
- QSAR:
-
quantitative structure-activity relationships
- RMSEE:
-
root mean square error of estimation
- RMSEP:
-
root mean square error of prediction
- RR:
-
ridge regression
- SIMCA:
-
soft independent modelling of class analogy
- SQRT:
-
square root
- SMD:
-
statistical molecular design
- SVM:
-
support vector machines
References
Dunn, III, W.J., Quantitative Structure-Activity Relationships (QSAR), Chemometrics and Intelligent Laboratory Systems, 6 (1989) 181–190.
Eriksson, L. and Johansson, E., Multivariate design and modelling in QSAR, Chemom. Intell. Lab. Syst., 34 (1996) 1–19.
Eriksson, L., Jaworska, J., Worth, A.P., Cronin, M.T.D., McDowell, R.M. and Gramatica, P., Methods for reliability and uncertainty assessment and for applicability evaluations of classification- and regression-based QSARs, Environmental Health Perspectives, 111 (2003) 1361–1375.
Einax, J., Chemometrics in Environmental Chemistry. Springer-Verlag, Berlin, 1995, ISBN 3-540-58943-0.
Eriksson, L., Andersson, P.L., Johansson, E. and Tysklind, M., Megavariate analysis of environmental QSAR data. Part II - Investigating very complex problem formulations using hierarchical, non-linear and batch-wise extensions of PCA and PLS. This issue.
Jackson, J.E., A Userś Guide to Principal Components. John Wiley & Sons, Inc., New York, 1991.
Wold, S., Albano, C., Dunn, III, W.J., Edlund, U., Esbensen, K., Geladi, P., Hellberg, S., Johansson, E., Lindberg, W. and Sjöström, M., Multivariate Data Analysis in Chemistry, In Kowalski, B. (Ed.), Chemometrics: Mathematics and Statistics in Chemistry, NATO ISI Series C 138, Reidel, Dordrecht, pp. 2–78, 1984.
Flåten, G.R., Botnen, H., Grung, B. and Kvalheim, O.M., Assigning environmental variables to observed biological changes, Analytical and Bioanalytical Chemistry, 380 (2004) 453–466.
Sjöström, M., Wold, S., Söderström, M., PLS Discriminant Plots, In Gelsema, E.S. and Kanal, L.N. (Eds.), Pattern Recognition in Practice II, Elsevier Science Publishers, North-Holland, pp. 461–470, 1986.
Nouwen, J., Lindgren, F., Hansen, B., Karcher, W., Verhaar, H.J.M and Hermens, J.L.M., Classification of environmentally occurring chemicals using structural fragments and PLS discriminant analysis, Environmental Science and Technology, 31 (1997) 2313–2318.
Frank, I.E. and Friedman, J.H., A statistical view of some chemometrics regression tools, Technometrics, 35 (1993) 109–148.
Eriksson, L., Hermens, J.L.M., Johansson, E., Verhaar, H.J.M. and Wold, S., Multivariate analysis of aquatic toxicity data with PLS, Aquatic Sciences, 57 (1995) 217–241.
Höskuldsson, A., Prediction Methods in Science and Technology - Volume 1 Basic Theory, Thor Publishing, Copenhagen, 1996.
Eriksson, L., Johansson, E., Kettaneh-Wold, N., Wold, S., Multi- and Megavariate Data Analysis – Principles and Applications, Umetrics Academy, 2001. ISBN: 91–973730–1-X.
Andersson, P.L., Physico-chemical characterization and quantitative structure-activity relationships of PCBs, Ph.D. Thesis, Umeå University, Umeå, Sweden, 2000.
Tysklind, M., Andersson, P.L., Haglund, P., van Bavel, B. and Rappe, C., Selection of polychlorinated biphenyls for use in quantitative structure-activity modelling, SAR and QSAR in Env. Res., 4 (1995) 11–19.
Andersson, P.L., Haglund, P. and Tysklind, M., The internal barriers of rotation for the 209 polychlorinated biphenyls, Environ. Sci. & Pollut. Res., 4 (1997) 75–81.
Andersson, P.L., Haglund, P. and Tysklind, M., Ultraviolet Absorption Spectra of all 209 Polychlorinated Biphenyls Evaluated by Principal Component Analysis, Fresenius J. Anal. Chem., 357 (1997) 1088–1092.
Andersson, P.L., van der Burght, A.S.A.M., van den Berg, M. and Tysklind, M., Multivariate modelling of polychlorinated biphenyl-induced CYP1A Activity in hepatocytes from three different species: Ranking scales and species difference, Environmental Toxicology and Chemistry, 19 (2000) 1454–1463.
Andersson, P.L., Berg, A.H., Bjerselius, R., Norrgren, L., Olsén, H., Olsson, P.E., Örn, S. and Tysklind, M., Bioaccumulation of selected PCBs in zebra fish, three-spined stickleback and Arctic char after three different routes of exposure, Arch. Environ. Contam. Toxicol, 40 (2001) 519–530.
Eriksson, L., Andersson, P.L., Johansson, E. and Tysklind, M., Multivariate biological profiling and principal toxicity regions of compounds: The PCB case study, Journal of Chemometrics, 16 (2002) 497–509.
Eriksson, L., Johansson, E., Lindgren, F., Sjöström, M. and Wold, S., Megavariate analysis of hierarchical QSAR data, Journal of Computer-Aided Molecular Design, 16 (2002) 711–726.
Pirselova, K., Balaz, S., Ujhelyova, R., Sturdik, E., Veverka, M., Uher, M. and Brtko, J., Quantitative structure-time-activity relationships (QSTAR): Part I - growth inhibition of escherichia coli by nonionizable kojic acid derivatives, Quantitative Structure-Activity Relationships, 15 (1996) 87–93.
Pirselova, K., Balaz, S., Sturdik, E., Ujhelyova, R., Veverka, M., Uher, M. and Brtko, J., Quantitative structure-time-activity relationships (QSTAR): Part II - growth inhibition of escherichia coli by ionizable and nonionizable kojic acid derivatives, Quantitative Structure-Activity Relationships, 16 (1997) 283–289.
Oprea, T.I. and Gottfries, J., Toward minimalistic modelling of oral drug absorption, J. Mol. Graph. Mod., 17 (1999) 261–274.
Oprea, T.I. and Gottfries, J., Chemography: The art of navigating in chemicals space, J. Comb. Chem., 3 (2001) 157–166.
Oprea, T.I., Gottfries, J., Sherbukhin, V., Svensson, P. and Kühler, T.C., Chemical information management in drug discovery: Optimizing the computational and combinatorial chemistry interfaces, Journal of Molecular Graphics and Modelling, 18 (2000) 512–524.
Raevsky, O.A. and Skvortsov, V.S., 3D Hydrogen bond thermodynamics (HYBOT) potentials in molecular modelling, Journal of Computer-Aided Molecular Design, 16 (2002) 1–10.
Eriksson, L., Gottfries, J., Johansson, E. and Wold, S., Time-resolved QSAR: an approach to PLS modelling of three-way biological data, Chemometrics and Intelligent Laboratory Systems, 73 (2004) 73–84.
Wold, S., Cross validatory estimation of the number of components in factor and principal component models, Technometrics, 20 (1978) 397–405.
Hellberg, S., A Multivariate Approach to QSAR, PhD Thesis, Umeå University, Umeå, Sweden, 1986.
Lundstedt, T., A QSAR strategy for screening of drugs and predicting their clinical activity, Drug News Persp., 4 (1991) 468–475.
Wu, J., Hammarström, L.G., Claesson, O. and Fängmark, I.E., Modelling the influence of physico-chemical properties of volatile organic compounds on activated carbon adsorption capacity, Carbon, 41 (2003) 1309–1328.
Carlson, R. and Carlson, J.E., Design and Optimization in Organic Synthesis. Second revised and enlarged edition, Elsevier, 2005.
Winiwarter, S., Bonham, N.M., Ax, F., Hallberg, A., Lennernäs, H. and Karlén, A., Correlation of human jejunal permeability (in vivo) of drugs with experimentally and theoretically derived parameters – A multivariate data analysis approach, J. Med. Chem., 41 (1998) 4939–4949.
Linusson, A., Gottfries, J., Lindgren, F. and Wold, S., Statistical molecular design of building blocks for combinatorial chemistry, Journal of Medicinal Chemistry, 43 (2000) 1320–1328.
Giraud, E., Luttmann, C., Lavelle, F., Riou, J.F., Mailliet, P. and Laoui, A., Multivariate data analysis using D-optimal designs, partial least squares, and response surface modelling, A directional approach for the analysis of farnesyltransferase inhibitors, Journal of Medicinal Chemistry, 43 (2000) 1807–1816.
Eriksson, L., Arnhold, T., Beck, B., Fox, T., Johansson, E. and Kriegl, J.M., Onion design and its application to a pharmaceutical QSAR problem, Journal of Chemometrics, 18 (2004) 188–202.
Tysklind, M., Tillitt, D., Eriksson, L., Lundgren, K. and Rappe, C., A toxic equivalency factor scale for polychlorinated dibenzofurans, Fundam.Appl. Toxicol., 22 (1994) 277–285.
Ramos, E.U., Vaes, W.H.J., Verhaar, H.J.M. and Hermens, J.L.M., Polar narcosis: Designing a suitable training set for QSAR studies, Environ. Sci. & Pollut. Res., 4 (1997) 83–90.
Eriksson, L. and Hermens J.L.M, A Multivariate Approach to Quantitative Structure-Activity and Structure-Property Relationships, In: J. Einax (Ed.), The Handbook of Environmental Chemistry, Vol 2H, Chemometrics in Environmental Chemistry, Springer-Verlag, Berlin, 1995, pp. 135–168.
Todeschini, R. and Consonni, V., Handbook of Molecular Descriptors, Wiley, 2000, ISBN: 3–527–29913–0.
Box, G.E.P, Hunter, W.G. and Hunter J.S., Statistics for Experimenters, John Wiley & Sons, New York, 1978.
De Aguiar, P.F., Bourguignon, B., Khots, M.S., Massart, D.L. and Phan-Than-Luu, R., D-optimal Designs, Chemom. Intell. Lab. Syst., 30 (1995) 199–210.
Olsson, I.M., Gottfries, J. and Wold, S., D-optimal onion design in statistical molecular design, Chemometrics and Intelligent Laboratory Systems, 73 (2004) 37–46.
Olsson, I.M., Gottfries, J. and Wold, S., Controlling coverage of D-optimal onion designs and selections, Journal of Chemometrics, 18 (2004) 548–557.
Baroni, M., Clementi, S., Cruciani, G., Kettaneh-Wold, N. and Wold, S., D-optimal designs in QSAR, Quant. Struct.-Act. Relat., 12 (1993) 225–231.
Wold, S. and Dunn, III, W.J., Multivariate quantitative structure-activity relationships: Conditions for their applicability, J. Chem. Inf. Comp. Sci., 23 (1983) 6–13.
Eriksson, L., Johansson E. and Wold, S., QSAR Model Validation, Proceedings of the 7th International Workshop on QSAR in Environmental Sciences, SETAC Press, Pensacola, FL, 1997.
Tropsha, A., Gramatica, P. and Gombar, V.J., The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSAR models, QSAR and combinatorial science, 22 (2003) 69–77.
Lindgren, F., Third Generation PLS – Some Elements and Applications, PhD Thesis, Umeå University, Umeå, Sweden, 1994.
Blanco, M., Coello, J., Iturriaga, H., Maspoch, S. and Pagès, J., NIR calibration in non-linear systems: Different pls approaches and artificial neural networks, Chemom. Intell. Lab. Systs., 50 (2000) 75–82.
Norinder, U., Support vector machine models in drug design: Applications to drug transport processes and QSAR using simplex optimisations and variable selection, Neurocomputing, 55 (2003) 337–346.
Wold, S., Sjöström, M. and Eriksson, L., PLS-regression: A basic tool of chemometrics, Chemometrics and Intelligent Laboratory Systems, 58 (2001) 109–130.
Kettaneh, N., Berglund, S. and Wold, S., PCA and PLS with very large data sets, Computational Statistics & Data Analysis, 48 (2005) 69–85.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Eriksson, L., Andersson, P.L., Johansson, E. et al. Megavariate analysis of environmental QSAR data. Part I – A basic framework founded on principal component analysis (PCA), partial least squares (PLS), and statistical molecular design (SMD). Mol Divers 10, 169–186 (2006). https://doi.org/10.1007/s11030-006-9024-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11030-006-9024-6