Skip to main content

Application of Mixture Models to Large Datasets

  • Chapter
  • First Online:
Big Data Analytics

Abstract

Mixture distributions are commonly being applied for modelling and for discriminant and cluster analyses in a wide variety of situations. We first consider normal and t-mixture models. As they are highly parameterized, we review methods to enable them to be fitted to large datasets involving many observations and variables. Attention is then given to extensions of these mixture models to mixtures with skew normal and skew t-distributions for the segmentation of data into clusters of non-elliptical shape. The focus is then on the latter models in conjunction with the JCM (joint clustering and matching) procedure for an automated approach to the clustering of cells in a sample in flow cytometry where a large number of cells and their associated markers have been measured. For a class of multiple samples, we consider the use of JCM for matching the sample-specific clusters across the samples in the class and for improving the clustering of each individual sample. The supervised classification of a sample is also considered in the case where there are different classes of samples corresponding, for example, to different outcomes or treatment strategies for patients undergoing medical screening or treatment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc, Ser B 39:1–38

    MathSciNet  MATH  Google Scholar 

  2. McLachlan GJ, Peel D (2000) Finite mixture models. Wiley Series in Probability and Statistics, New York

    Book  MATH  Google Scholar 

  3. McLachlan GJ, Do KA, Ambroise C (2004) Analyzing microarray gene expression data. Hoboken, New Jersey

    Book  MATH  Google Scholar 

  4. Pyne S, Lee SX, Wang K, Irish J, Tamayo P, Nazaire MD, Duong T, Ng SK, Hafler D, Levy R, Nolan GP, Mesirov J, McLachlan GJ (2014) Joint modeling and registration of cell populations in cohorts of high-dimensional flow cytometric data. PLOS ONE 9(7):e100334

    Article  Google Scholar 

  5. Pyne S, Hu X, Wang K, Rossin E, Lin TI, Maier LM, Baecher-Allan C, McLachlan GJ, Tamayo P, Hafler DA, De Jager PL, Mesirow JP (2009) Automated high-dimensional flow cytometric data analysis. Proc Natl Acad Sci USA 106:8519–8524

    Article  Google Scholar 

  6. Li JQ, Barron AR (2000) Mixture density estimation. In: Solla SA, Leen TK, Mueller KR (eds) Advances in neural information processing systems. MIT Press, Cambridge, pp 279–285

    Google Scholar 

  7. McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions, 2nd edn. Wiley-Interscience, Hokoben

    Google Scholar 

  8. McLachlan GJ, Peel D (1998) Robust cluster analysis via mixtures of multivariate \(t\)-distributions. In: Amin A, Dori D, Pudil P, Freeman H (eds) Lecture notes in computer science, vol 1451. Springer, Berlin, pp 658–666

    Google Scholar 

  9. Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464

    Article  MathSciNet  MATH  Google Scholar 

  10. McLachlan GJ (1987) On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. J R Stat Soc Ser C (Appl Stat) 36:318–324

    Google Scholar 

  11. McLachlan GJ, Peel D (1998) Robust cluster analysis via mixtures of multivariate \(t\)-distributions. In: Amin A, Dori D, Pudil P, Freeman H (eds) Lecture notes in computer science. Springer, Berlin, pp 658–666

    Google Scholar 

  12. Baek J, McLachlan GJ (2008) Mixtures of factor analyzers with common factor loadings for the clustering and visualisation of high-dimensional data. Technical Report NI08018-SCH, Preprint Series of the Isaac Newton Institute for Mathematical Sciences, Cambridge

    Google Scholar 

  13. Baek J, McLachlan GJ (2011) Mixtures of common \(t\)-factor analyzers for clustering high-dimensional microarray data. Bioinformatics 27:1269–1276

    Article  Google Scholar 

  14. McLachlan GJ, Bean RW, Peel D (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18:413–422

    Article  Google Scholar 

  15. Yb Chan (2010) Hall P. Using evidence of mixed populations to select variables for clustering very high dimensional data. J Am Stat Assoc 105:798–809

    Article  Google Scholar 

  16. Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401:788–791

    Article  Google Scholar 

  17. Donoho D, Stodden V (2004) When does non-negative matrix factorization give correct decomposition into parts? In: Advances in neural information processing systems, vol 16. MIT Press, Cambridge, MA, pp 1141–1148

    Google Scholar 

  18. Golub GH, van Loan CF (1983) Matrix computation. The John Hopkins University Press, Baltimore

    MATH  Google Scholar 

  19. Kossenkov AV, Ochs MF (2009) Matrix factorization for recovery of biological processes from microarray data. Methods Enzymol 267:59–77

    Article  Google Scholar 

  20. Johnstone IM, Lu AY (2009) On consistency and sparsity for principal components analysis in high dimensions. J Am Stat Assoc 104:682–693

    Article  MathSciNet  MATH  Google Scholar 

  21. Nikulin V, McLachlan G (2009) On a general method for matrix factorisation applied to supervised classification. In: Chen J, Chen X, Ely J, Hakkani-Tr D, He J, Hsu HH, Liao L, Liu C, Pop M, Ranganathan S (eds) Proceedings of 2009 IEEE international conference on bioinformatics and biomedicine workshop. IEEE Computer Society, Washington, D.C. Los Alamitos, CA, pp 43–48

    Google Scholar 

  22. Witten DM, Tibshirani R, Hastie T (2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10:515–534

    Article  Google Scholar 

  23. Nikulin V, McLachlan GJ (2010) Penalized principal component analysis of microarray data. In: Masulli F, Peterson L, Tagliaferri R (eds) Lecture notes in bioinformatics, vol 6160. Springer, Berlin, pp 82–96

    Google Scholar 

  24. Aghaeepour N, Finak G (2013) The FLOWCAP Consortium, The DREAM Consortium. In: Hoos H, Mosmann T, Gottardo R, Brinkman RR, Scheuermann RH (eds) Critical assessment of automated flow cytometry analysis techniques. Nature Methods 10:228–238

    Google Scholar 

  25. Naim I, Datta S, Sharma G, Cavenaugh JS, Mosmann TR (2010) Swift: scalable weighted iterative sampling for flow cytometry clustering. In: IEEE International conference on acoustics speech and signal processing (ICASSP), 2010, pp 509–512

    Google Scholar 

  26. Cron A, Gouttefangeas C, Frelinger J, Lin L, Singh SK, Britten CM, Welters MJ, van der Burg SH, West M, Chan C (2013) Hierarchical modeling for rare event detection and cell subset alignment across flow cytometry samples. PLoS Comput Biol 9(7):e1003130

    Article  Google Scholar 

  27. Dundar M, Akova F, Yerebakan HZ, Rajwa B (2014) A non-parametric bayesian model for joint cell clustering and cluster matching: identification of anomalous sample phenotypes with random effects. BMC Bioinform 15(314):1–15

    Google Scholar 

  28. Lo K, Brinkman RR, Gottardo R (2008) Automated gating of flow cytometry data via robust model-based clustering. Cytometry Part A 73:312–332

    Google Scholar 

  29. Lo K, Hahne F, Brinkman RR, Gottardo R (2009) flowclust: a bioconductor package for automated gating of flow cytometry data. BMC Bioinform 10(145):1–8

    Google Scholar 

  30. Frühwirth-Schnatter S, Pyne S (2010) Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-\(t\) distributions. Biostatistics 11:317–336

    Article  Google Scholar 

  31. Azzalini A, Capitanio A (2003) Distribution generated by perturbation of symmetry with emphasis on a multivariate skew t distribution. J R Stat Soc Ser B 65(2):367–389

    Article  MathSciNet  MATH  Google Scholar 

  32. Lee SX, McLachlan GJ (2013) On mixtures of skew-normal and skew \(t\)-distributions. Adv Data Anal Classif 7:241–266

    Article  MathSciNet  MATH  Google Scholar 

  33. Lee S, McLachlan GJ (2014) Finite mixtures of multivariate skew \(t\)-distributions: some recent and new results. Stat Comput 24:181–202

    Article  MathSciNet  MATH  Google Scholar 

  34. Lee SX, McLachlan GJ (2016) Finite mixtures of canonical fundamental skew \(t\)-distributions: the unification of the unrestricted and restricted skew t-mixture models. Stat Comput. doi:10.1007/s11222-015-9545-x

    Google Scholar 

  35. Lee SX, McLachlan GJ, Pyne S (2014) Supervised classification of flow cytometric samples via the joint clustering and matching procedure. arXiv:1411.2820 [q-bio.QM]

  36. Lee SX, McLachlan GJ, Pyne S. Modelling of inter-sample variation in flow cytometric data with the joint clustering and matching (JCM) procedure. Cytometry: Part A 2016. doi:10.1002/cyto.a.22789

    Google Scholar 

  37. Criag FE, Brinkman RR, Eyck ST, Aghaeepour N (2014) Computational analysis optimizes the flow cytometric evaluation for lymphoma. Cytometry B 86:18–24

    Article  Google Scholar 

  38. Azad A, Rajwa B, Pothen A (2014) Immunophenotypes of acute myeloid leukemia from flow cytometry data using templates. arXiv:1403.6358 [q-bio.QM]

  39. Ge Y, Sealfon SC (2012) flowpeaks: a fast unsupervised clustering for flow cytometry data via k-means and density peak finding. Bioinformatics 28:2052–2058

    Article  Google Scholar 

  40. Rossin E, Lin TI, Ho HJ, Mentzer S, Pyne S (2011) A framework for analytical characterization of monoclonal antibodies based on reactivity profiles in different tissues. Bioinformatics 27:2746–2753

    Article  Google Scholar 

  41. Ho HJ, Lin TI, Chang HH, Haase HB, Huang S, Pyne S (2012) Parametric modeling of cellular state transitions as measured with flow cytometry different tissues. BMC Bioinform. 2012. 13:(Suppl 5):S5

    Google Scholar 

  42. Ho HJ, Pyne S, Lin TI (2012) Maximum likelihood inference for mixtures of skew student-\(t\)-normal distributions through practical EM-type algorithms. Stat Comput 22:287–299

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Geoffrey McLachlan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer India

About this chapter

Cite this chapter

Lee, S.X., McLachlan, G., Pyne, S. (2016). Application of Mixture Models to Large Datasets. In: Pyne, S., Rao, B., Rao, S. (eds) Big Data Analytics. Springer, New Delhi. https://doi.org/10.1007/978-81-322-3628-3_4

Download citation

Publish with us

Policies and ethics