Skip to main content
Log in

Megavariate analysis of environmental QSAR data. Part I – A basic framework founded on principal component analysis (PCA), partial least squares (PLS), and statistical molecular design (SMD)

  • Full-length paper
  • Published:
Molecular Diversity Aims and scope Submit manuscript

Abstract

This paper introduces principal component analysis (PCA), partial least squares projections to latent structures (PLS), and statistical molecular design (SMD) as useful tools in deriving multi- and megavariate quantitative structure-activity relationship (QSAR) models. Two QSAR data sets from the fields of environmental toxicology and environmental chemistry are worked out in detail, showing the benefits of PCA, PLS and SMD. PCA is useful when overviewing a data set and exploring relationships among compounds and relationships among variables. PLS is the regression extension of PCA and is used for establishing QSARs. SMD is essential for selecting informative training and test sets of compounds for QSAR calibration and validation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Abbreviations

CC:

canonical correlation

DOOD:

D-optimal onion design

FA:

factor analysis

FD:

factorial design

FFD:

fractional factorial design

HOMO:

highest occupied molecular orbital

LDA:

linear discriminant analysis

LUMO:

lowest unoccupied molecular orbital

MLR:

multiple linear regression

NN:

neural networks

PCA:

principal component analysis

PCB:

polychlorinated biphenyls

PCR:

principal component regression

PLS:

partial least squares projections to latent structures

PLS-DA:

PLS discriminant analysis

QSAR:

quantitative structure-activity relationships

RMSEE:

root mean square error of estimation

RMSEP:

root mean square error of prediction

RR:

ridge regression

SIMCA:

soft independent modelling of class analogy

SQRT:

square root

SMD:

statistical molecular design

SVM:

support vector machines

References

  • Dunn, III, W.J., Quantitative Structure-Activity Relationships (QSAR), Chemometrics and Intelligent Laboratory Systems, 6 (1989) 181–190.

  • Eriksson, L. and Johansson, E., Multivariate design and modelling in QSAR, Chemom. Intell. Lab. Syst., 34 (1996) 1–19.

    Article  CAS  Google Scholar 

  • Eriksson, L., Jaworska, J., Worth, A.P., Cronin, M.T.D., McDowell, R.M. and Gramatica, P., Methods for reliability and uncertainty assessment and for applicability evaluations of classification- and regression-based QSARs, Environmental Health Perspectives, 111 (2003) 1361–1375.

    Article  CAS  Google Scholar 

  • Einax, J., Chemometrics in Environmental Chemistry. Springer-Verlag, Berlin, 1995, ISBN 3-540-58943-0.

  • Eriksson, L., Andersson, P.L., Johansson, E. and Tysklind, M., Megavariate analysis of environmental QSAR data. Part II - Investigating very complex problem formulations using hierarchical, non-linear and batch-wise extensions of PCA and PLS. This issue.

  • Jackson, J.E., A Userś Guide to Principal Components. John Wiley & Sons, Inc., New York, 1991.

  • Wold, S., Albano, C., Dunn, III, W.J., Edlund, U., Esbensen, K., Geladi, P., Hellberg, S., Johansson, E., Lindberg, W. and Sjöström, M., Multivariate Data Analysis in Chemistry, In Kowalski, B. (Ed.), Chemometrics: Mathematics and Statistics in Chemistry, NATO ISI Series C 138, Reidel, Dordrecht, pp. 2–78, 1984.

  • Flåten, G.R., Botnen, H., Grung, B. and Kvalheim, O.M., Assigning environmental variables to observed biological changes, Analytical and Bioanalytical Chemistry, 380 (2004) 453–466.

    Article  CAS  Google Scholar 

  • Sjöström, M., Wold, S., Söderström, M., PLS Discriminant Plots, In Gelsema, E.S. and Kanal, L.N. (Eds.), Pattern Recognition in Practice II, Elsevier Science Publishers, North-Holland, pp. 461–470, 1986.

  • Nouwen, J., Lindgren, F., Hansen, B., Karcher, W., Verhaar, H.J.M and Hermens, J.L.M., Classification of environmentally occurring chemicals using structural fragments and PLS discriminant analysis, Environmental Science and Technology, 31 (1997) 2313–2318.

  • Frank, I.E. and Friedman, J.H., A statistical view of some chemometrics regression tools, Technometrics, 35 (1993) 109–148.

    Article  Google Scholar 

  • Eriksson, L., Hermens, J.L.M., Johansson, E., Verhaar, H.J.M. and Wold, S., Multivariate analysis of aquatic toxicity data with PLS, Aquatic Sciences, 57 (1995) 217–241.

    Article  Google Scholar 

  • Höskuldsson, A., Prediction Methods in Science and Technology - Volume 1 Basic Theory, Thor Publishing, Copenhagen, 1996.

    Google Scholar 

  • Eriksson, L., Johansson, E., Kettaneh-Wold, N., Wold, S., Multi- and Megavariate Data Analysis – Principles and Applications, Umetrics Academy, 2001. ISBN: 91–973730–1-X.

  • Andersson, P.L., Physico-chemical characterization and quantitative structure-activity relationships of PCBs, Ph.D. Thesis, Umeå University, Umeå, Sweden, 2000.

  • Tysklind, M., Andersson, P.L., Haglund, P., van Bavel, B. and Rappe, C., Selection of polychlorinated biphenyls for use in quantitative structure-activity modelling, SAR and QSAR in Env. Res., 4 (1995) 11–19.

    Article  CAS  Google Scholar 

  • Andersson, P.L., Haglund, P. and Tysklind, M., The internal barriers of rotation for the 209 polychlorinated biphenyls, Environ. Sci. & Pollut. Res., 4 (1997) 75–81.

    Article  CAS  Google Scholar 

  • Andersson, P.L., Haglund, P. and Tysklind, M., Ultraviolet Absorption Spectra of all 209 Polychlorinated Biphenyls Evaluated by Principal Component Analysis, Fresenius J. Anal. Chem., 357 (1997) 1088–1092.

    Article  CAS  Google Scholar 

  • Andersson, P.L., van der Burght, A.S.A.M., van den Berg, M. and Tysklind, M., Multivariate modelling of polychlorinated biphenyl-induced CYP1A Activity in hepatocytes from three different species: Ranking scales and species difference, Environmental Toxicology and Chemistry, 19 (2000) 1454–1463.

    Article  CAS  Google Scholar 

  • Andersson, P.L., Berg, A.H., Bjerselius, R., Norrgren, L., Olsén, H., Olsson, P.E., Örn, S. and Tysklind, M., Bioaccumulation of selected PCBs in zebra fish, three-spined stickleback and Arctic char after three different routes of exposure, Arch. Environ. Contam. Toxicol, 40 (2001) 519–530.

    Article  CAS  Google Scholar 

  • Eriksson, L., Andersson, P.L., Johansson, E. and Tysklind, M., Multivariate biological profiling and principal toxicity regions of compounds: The PCB case study, Journal of Chemometrics, 16 (2002) 497–509.

    Article  CAS  Google Scholar 

  • Eriksson, L., Johansson, E., Lindgren, F., Sjöström, M. and Wold, S., Megavariate analysis of hierarchical QSAR data, Journal of Computer-Aided Molecular Design, 16 (2002) 711–726.

    Article  CAS  Google Scholar 

  • Pirselova, K., Balaz, S., Ujhelyova, R., Sturdik, E., Veverka, M., Uher, M. and Brtko, J., Quantitative structure-time-activity relationships (QSTAR): Part I - growth inhibition of escherichia coli by nonionizable kojic acid derivatives, Quantitative Structure-Activity Relationships, 15 (1996) 87–93.

    Article  CAS  Google Scholar 

  • Pirselova, K., Balaz, S., Sturdik, E., Ujhelyova, R., Veverka, M., Uher, M. and Brtko, J., Quantitative structure-time-activity relationships (QSTAR): Part II - growth inhibition of escherichia coli by ionizable and nonionizable kojic acid derivatives, Quantitative Structure-Activity Relationships, 16 (1997) 283–289.

    Article  CAS  Google Scholar 

  • Oprea, T.I. and Gottfries, J., Toward minimalistic modelling of oral drug absorption, J. Mol. Graph. Mod., 17 (1999) 261–274.

    Article  CAS  Google Scholar 

  • Oprea, T.I. and Gottfries, J., Chemography: The art of navigating in chemicals space, J. Comb. Chem., 3 (2001) 157–166.

    Article  CAS  Google Scholar 

  • Oprea, T.I., Gottfries, J., Sherbukhin, V., Svensson, P. and Kühler, T.C., Chemical information management in drug discovery: Optimizing the computational and combinatorial chemistry interfaces, Journal of Molecular Graphics and Modelling, 18 (2000) 512–524.

    Article  CAS  Google Scholar 

  • Raevsky, O.A. and Skvortsov, V.S., 3D Hydrogen bond thermodynamics (HYBOT) potentials in molecular modelling, Journal of Computer-Aided Molecular Design, 16 (2002) 1–10.

    Article  CAS  Google Scholar 

  • Eriksson, L., Gottfries, J., Johansson, E. and Wold, S., Time-resolved QSAR: an approach to PLS modelling of three-way biological data, Chemometrics and Intelligent Laboratory Systems, 73 (2004) 73–84.

    Article  CAS  Google Scholar 

  • Wold, S., Cross validatory estimation of the number of components in factor and principal component models, Technometrics, 20 (1978) 397–405.

    Article  Google Scholar 

  • Hellberg, S., A Multivariate Approach to QSAR, PhD Thesis, Umeå University, Umeå, Sweden, 1986.

  • Lundstedt, T., A QSAR strategy for screening of drugs and predicting their clinical activity, Drug News Persp., 4 (1991) 468–475.

    Google Scholar 

  • Wu, J., Hammarström, L.G., Claesson, O. and Fängmark, I.E., Modelling the influence of physico-chemical properties of volatile organic compounds on activated carbon adsorption capacity, Carbon, 41 (2003) 1309–1328.

    Article  CAS  Google Scholar 

  • Carlson, R. and Carlson, J.E., Design and Optimization in Organic Synthesis. Second revised and enlarged edition, Elsevier, 2005.

  • Winiwarter, S., Bonham, N.M., Ax, F., Hallberg, A., Lennernäs, H. and Karlén, A., Correlation of human jejunal permeability (in vivo) of drugs with experimentally and theoretically derived parameters – A multivariate data analysis approach, J. Med. Chem., 41 (1998) 4939–4949.

    Article  CAS  Google Scholar 

  • Linusson, A., Gottfries, J., Lindgren, F. and Wold, S., Statistical molecular design of building blocks for combinatorial chemistry, Journal of Medicinal Chemistry, 43 (2000) 1320–1328.

    Article  CAS  Google Scholar 

  • Giraud, E., Luttmann, C., Lavelle, F., Riou, J.F., Mailliet, P. and Laoui, A., Multivariate data analysis using D-optimal designs, partial least squares, and response surface modelling, A directional approach for the analysis of farnesyltransferase inhibitors, Journal of Medicinal Chemistry, 43 (2000) 1807–1816.

    Article  CAS  Google Scholar 

  • Eriksson, L., Arnhold, T., Beck, B., Fox, T., Johansson, E. and Kriegl, J.M., Onion design and its application to a pharmaceutical QSAR problem, Journal of Chemometrics, 18 (2004) 188–202.

    Article  CAS  Google Scholar 

  • Tysklind, M., Tillitt, D., Eriksson, L., Lundgren, K. and Rappe, C., A toxic equivalency factor scale for polychlorinated dibenzofurans, Fundam.Appl. Toxicol., 22 (1994) 277–285.

    Article  CAS  Google Scholar 

  • Ramos, E.U., Vaes, W.H.J., Verhaar, H.J.M. and Hermens, J.L.M., Polar narcosis: Designing a suitable training set for QSAR studies, Environ. Sci. & Pollut. Res., 4 (1997) 83–90.

    CAS  Google Scholar 

  • Eriksson, L. and Hermens J.L.M, A Multivariate Approach to Quantitative Structure-Activity and Structure-Property Relationships, In: J. Einax (Ed.), The Handbook of Environmental Chemistry, Vol 2H, Chemometrics in Environmental Chemistry, Springer-Verlag, Berlin, 1995, pp. 135–168.

  • Todeschini, R. and Consonni, V., Handbook of Molecular Descriptors, Wiley, 2000, ISBN: 3–527–29913–0.

  • Box, G.E.P, Hunter, W.G. and Hunter J.S., Statistics for Experimenters, John Wiley & Sons, New York, 1978.

    Google Scholar 

  • De Aguiar, P.F., Bourguignon, B., Khots, M.S., Massart, D.L. and Phan-Than-Luu, R., D-optimal Designs, Chemom. Intell. Lab. Syst., 30 (1995) 199–210.

    Article  Google Scholar 

  • Olsson, I.M., Gottfries, J. and Wold, S., D-optimal onion design in statistical molecular design, Chemometrics and Intelligent Laboratory Systems, 73 (2004) 37–46.

    Article  CAS  Google Scholar 

  • Olsson, I.M., Gottfries, J. and Wold, S., Controlling coverage of D-optimal onion designs and selections, Journal of Chemometrics, 18 (2004) 548–557.

    Article  CAS  Google Scholar 

  • Baroni, M., Clementi, S., Cruciani, G., Kettaneh-Wold, N. and Wold, S., D-optimal designs in QSAR, Quant. Struct.-Act. Relat., 12 (1993) 225–231.

    Article  CAS  Google Scholar 

  • Wold, S. and Dunn, III, W.J., Multivariate quantitative structure-activity relationships: Conditions for their applicability, J. Chem. Inf. Comp. Sci., 23 (1983) 6–13.

  • Eriksson, L., Johansson E. and Wold, S., QSAR Model Validation, Proceedings of the 7th International Workshop on QSAR in Environmental Sciences, SETAC Press, Pensacola, FL, 1997.

  • Tropsha, A., Gramatica, P. and Gombar, V.J., The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSAR models, QSAR and combinatorial science, 22 (2003) 69–77.

    Article  CAS  Google Scholar 

  • Lindgren, F., Third Generation PLS – Some Elements and Applications, PhD Thesis, Umeå University, Umeå, Sweden, 1994.

  • Blanco, M., Coello, J., Iturriaga, H., Maspoch, S. and Pagès, J., NIR calibration in non-linear systems: Different pls approaches and artificial neural networks, Chemom. Intell. Lab. Systs., 50 (2000) 75–82.

    Article  CAS  Google Scholar 

  • Norinder, U., Support vector machine models in drug design: Applications to drug transport processes and QSAR using simplex optimisations and variable selection, Neurocomputing, 55 (2003) 337–346.

    Article  Google Scholar 

  • Wold, S., Sjöström, M. and Eriksson, L., PLS-regression: A basic tool of chemometrics, Chemometrics and Intelligent Laboratory Systems, 58 (2001) 109–130.

    Article  CAS  Google Scholar 

  • Kettaneh, N., Berglund, S. and Wold, S., PCA and PLS with very large data sets, Computational Statistics & Data Analysis, 48 (2005) 69–85.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lennart Eriksson.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Eriksson, L., Andersson, P.L., Johansson, E. et al. Megavariate analysis of environmental QSAR data. Part I – A basic framework founded on principal component analysis (PCA), partial least squares (PLS), and statistical molecular design (SMD). Mol Divers 10, 169–186 (2006). https://doi.org/10.1007/s11030-006-9024-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11030-006-9024-6

Keywords

Navigation