A Bayesian semiparametric factor analysis model for subtype identification

Jiehuan Sun; Joshua L. Warren; Hongyu Zhao

doi:10.1515/sagmb-2016-0051

Published by De Gruyter March 25, 2017

A Bayesian semiparametric factor analysis model for subtype identification

Jiehuan Sun , Joshua L. Warren and Hongyu Zhao

From the journal Statistical Applications in Genetics and Molecular Biology

https://doi.org/10.1515/sagmb-2016-0051

Showing a limited preview of this publication:

Abstract:

Disease subtype identification (clustering) is an important problem in biomedical research. Gene expression profiles are commonly utilized to infer disease subtypes, which often lead to biologically meaningful insights into disease. Despite many successes, existing clustering methods may not perform well when genes are highly correlated and many uninformative genes are included for clustering due to the high dimensionality. In this article, we introduce a novel subtype identification method in the Bayesian setting based on gene expression profiles. This method, called BCSub, adopts an innovative semiparametric Bayesian factor analysis model to reduce the dimension of the data to a few factor scores for clustering. Specifically, the factor scores are assumed to follow the Dirichlet process mixture model in order to induce clustering. Through extensive simulation studies, we show that BCSub has improved performance over commonly used clustering methods. When applied to two gene expression datasets, our model is able to identify subtypes that are clinically more relevant than those identified from the existing methods.

Keywords: Bayesian factor analysis; Bayesian nonparametrics; clustering; Dirichlet process; gene expression study

Acknowledgement

We thank the editor and reviewers for careful reading of our paper and for their insightful and constructive comments, which have greatly helped improve our work. Jiehuan Sun and Hongyu Zhao were supported by National Science Foundation grant DMS-1106738 and National Institutes of Health grants R01 GM59507 and P01 CA154295. Joshua L. Warren was supported by CTSA Grant Number UL1 TR001863 from the National Center for Advancing Translational Science (NCATS), a component of the National Institutes of Health (NIH). Its contents are solely the responsibility of the authors and do not necessarily represent the official view of NIH.

Appendix

Here, we provide details of the MCMC sampling algorithm used for fitting our proposed method. Let e_i be the cluster membership indicator, i.e. e_i = k means subject i belongs to cluster k. Let K be the largest possible cluster indicator in the most recent step. Let 𝚯 be the set of all parameters. Based on the model specification and the prior distributions given in the main text, we can write out the joint distribution of all parameters given data as follows.

P(𝚯|Y)=∏i=1nexp(−12(Yi−𝚲𝜼i)T𝚺−1(Yi−𝚲𝜼i))(2π)G|𝚺|×∏i=1n∏k=1K{exp(−12(𝜼i−𝝁k)T𝛀−1(𝜼i−𝝁k))(2π)M|𝛀|}δ(ei=k)×∏k=1Kexp(−12𝝁kT(ρIM)−1𝝁k)(2π)G|ρIM|δ(ρ∈[0,2])×∏g=1Gv1v2Γ(v1)[𝚺]gg−v1−1exp(−v2[𝚺]gg)×∏m=1Mv1v2Γ(v1)[𝛀]mm−v1−1exp(−v2[𝛀]mm)×∏g=1G∏m<min{M,g}exp(−12σ2([𝚲]gm)2)2πσ2×Γ(c)Γ(c)+ncK∏k=1KΓ(nk),

where c = 1 is the concentration parameter in the DP prior, v₁ = v₂ =0.01 are the parameters in the Inverse Gamma prior distributions for the diagonal elements of 𝚺 and 𝛀, nk=∑i=1nδ(ei=k) is the number of subjects in cluster k in the current step, and [⋅]_gm denotes the element in the g^th row and m^th column of the matrix. Then, MCMC sampling proceeds in the following steps:

The gene specific variances are sampled as follows.
[𝚺]gg|…∼Inverse Gamma(v1+n2,v2+12∑in[(Yi−𝚲𝜼i)(Yi−𝚲𝜼i)T]gg),
where |… means conditional all the other parameters and data.
The diagonal elements of the covariance matrix 𝛀 are sampled as follows.
[𝛀]mm|…∼Inverse Gamma(v1+n2,v2*),
where v2*=v2+12∑i∑k=1Kδ(ei=k)[(𝜼i−𝝁k)(𝜼i−𝝁k)T]mm.
The subject specific factor scores 𝜼_i are sampled as follows.
𝜼i|…∼MVN(Ω*(𝚲T𝚺−1Yi+Ω−1𝝁k),Ω*),
where Ω*=(𝚲T𝚺−1𝚲+Ω−1)−1.
The loading matrix 𝚲 are sampled as follows.
[𝚲]g⋅|…∼MVN(Ωg*([𝚺]gg−1𝜼T[Y]⋅g),Ωg*),
where Ωg*=([𝚺]gg−1𝜼T𝜼+σ−2IM)−1, σ² is the prior variance for elements in 𝚲 ,𝜼T=[𝜼1,…,𝜼n], and [Y]_{⋅ g} is the g^th column of Y,= [Y₁, …, Y_n]^T. For constrained rows of 𝚲, where we only need to update m elements, the updating rule is similar except that 𝜼 is constrained to the first m columns and the contribution of the (m + 1)^th column of 𝜼 is deducted from the corresponding column of [Y].
The cluster specific means 𝝁_k are sampled as follows.
𝝁k|…∼MVN(Ωk*(𝛀−1∑i=1n𝜼iδ(ei=k)),Ωk*),
where Ωk*=(nk𝛀−1+(ρIM)−1)−1.
The cluster membership indicator for each subject is sampled as follows. For cluster k, which is occupied by some subjects excluding subject i, we have
Pk:=P(ei=k|e−i,𝛀,ρ,η)=l×nk(−i)×Φ(𝜼i;𝝁~k(−i),𝛀~k(−i))
where l is some positive constant shared across all clusters, Φ(⋅; 𝝁, 𝚺) is the density function for Multivariate Normal distribution with mean 𝝁 and covariance matrix 𝚺, nk(−i)=∑j=1,j≠inδ(ej=k) is the number of subjects in cluster k excluding i^th subject, and 𝝁~k(−i) and 𝛀~k(−i) are the cluster specific means and variances for the factor scores respectively, which can be calculated as
𝛀~k(−i)=(nk(−i)𝛀−1+(ρIM)−1)−1+𝛀𝝁~k(−i)=(nk(−i)𝛀−1+(ρIM)−1)−1(nk(−i)𝛀−1𝜼¯k(−i))𝜼¯k(−i)=∑j=1,j≠in𝜼jδ(ei=k)nk(−i).
For a new cluster K + 1, we have
PK+1:=P(ei=K+1|e−i,𝛀,ρ,η)=l×c×Φ(𝜼i;M,ρIM+𝛀).
Then, the cluster membership indicator is drawn from a multinomial distribution, that is
ei|e−i,𝛀,ρ,η∼multinomial(P1,…,PK+1).
The variance parameter ρ in the base distribution of the DP prior can be sampled as follows. For convenience, we first transform the variance parameter to avoid the positive constraint on the variance parameter and the posterior sampling is conducted on the transformed variance parameter. Specifically, let τ=log(a−ρρ−b), where a and b are the parameters in the prior distribution for ρ∼Uniform(a,b) (a = 0 and b = 2 in our case). Then, a Metropolis Hasting updating step can be performed to draw τ using N (τ′,ω2) as the proposal distribution, where τ′ is the current value and ω² can be used to tune the acceptance rate. The acceptance probability can be calculated as
A(τ|τ′)=P(τ)P({𝝁k}k=1K|τ)P(τ′)P({𝝁k}k=1K|τ′),
where P(τ)=exp(τ)(exp(τ)+1)2 and P({𝝁k}k=1K|τ)∝∏k=1KΦ(𝝁k;M,(bexp(τ)+aexp(τ)+1)IM).

References

Binder, D. A. (1978): “Bayesian cluster analysis,” Biometrika, 65, 31–38.10.1093/biomet/65.1.31Search in Google Scholar

Boutou, A. K., Z. Zoumot, A. Nair, C. Davey, D. M. Hansell, A. Jamurtas, M. I. Polkey and N. S. Hopkinson (2015): “The impact of homogeneous versus heterogeneous emphysema on dynamic hyperinflation in patients with severe COPD assessed for lung volume reduction,” COPD J. Chronic Obstr. Pulm. Dis., 12, 598–605.10.3109/15412555.2015.1020149Search in Google Scholar PubMed PubMed Central

Caliński, T. and J. Harabasz (1974): “A dendrite method for cluster analysis,” Commun. Stat., 3, 1–27.Search in Google Scholar

Carvalho, C. M., J. Chang, J. E. Lucas, J. R. Nevins, Q. Wang and M. West (2008): “High-dimensional sparse factor modeling: applications in gene expression genomics,” J. Am. Stat. Assoc., 103, 1438–1456.10.1198/016214508000000869Search in Google Scholar PubMed PubMed Central

Craddock, R. C., G. A. James, P. E. Holtzheimer, X. P. Hu and H. S. Mayberg (2012): “A whole brain fMRI atlas generated via spatially constrained spectral clustering,” Hum. Brain Mapp., 33, 1914–1928.10.1002/hbm.21333Search in Google Scholar PubMed PubMed Central

Dahl, D. B. (2006): “Model-based clustering for expression data via a Dirichlet process mixture model,” In Bayesian Inference for Gene Expression and Proteomics, Cambridge: Cambridge University Press, pp. 201–218.10.1017/CBO9780511584589.011Search in Google Scholar

Drasgow, F. and R. I. Lissak (1983): “Modified parallel analysis: a procedure for examining the latent dimensionality of dichotomously scored item responses,” J. Appl. Psychol., 68, 363–373.10.1037/0021-9010.68.3.363Search in Google Scholar

Erosheva, E. A. and S. M. Curtis (2011): “Dealing with rotational invariance in Bayesian confirmatory factor analysis,” Technical Report 589, University of Washington.Search in Google Scholar

Ferguson, T. S. (1973): “A Bayesian analysis of some nonparametric problems,” Ann. Stat., 1, 209–230.10.1214/aos/1176342360Search in Google Scholar

Fraley, C. and A. E. Raftery (2002): “Model-based clustering, discriminant analysis, and density estimation,” J. Am. Stat. Assoc., 97, 611–631.10.1198/016214502760047131Search in Google Scholar

Fritsch, A. and K. Ickstadt (2009): “Improved criteria for clustering based on the posterior similarity matrix,” Bayesian Anal., 4, 367–391.10.1214/09-BA414Search in Google Scholar

Garcia-Aymerich, J., F. P. Gómez, M. Benet, E. Farrero, X. Basagaña, À. Gayete, C. Paré, X. Freixa, J. Ferrer, A. Ferrer and J. Roca (2011): “Identification and prospective validation of clinically relevant chronic obstructive pulmonary disease (COPD) subtypes,” Thorax, 66, 430–437.10.1136/thx.2010.154484Search in Google Scholar PubMed

Geweke, J. and G. Zhou (1996): “Measuring the pricing error of the arbitrage pricing theory,” Rev. Financ. Stud., 9, 557–587.10.1093/rfs/9.2.557Search in Google Scholar

Hartigan, J. A. and M. A. Wong (1979): “Algorithm AS 136: a k-means clustering algorithm,” J. R. Stat. Soc. Ser. C Appl. Stat., 28, 100–108.10.2307/2346830Search in Google Scholar

Hoadley, K. A., C. Yau, D. M. Wolf, A. D. Cherniack, D. Tamborero, S. Ng, M. D. Leiserson, B. Niu, M. D. McLellan, V. Uzunangelov and J. Zhang (2014): “Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin,” Cell, 158, 929–944.10.1016/j.cell.2014.06.049Search in Google Scholar PubMed PubMed Central

Hoyle, R. H. and J. L. Duvall (2004): “Determining the number of factors in exploratory and confirmatory factor analysis,” In: Kaplan, D. (Ed.), Handbook of Quantitative Methodology for the Social Sciences, chapter 16, Thousand Oaks, CA: Sage, pp. 301–315.10.4135/9781412986311.n16Search in Google Scholar

Hubert, L. and P. Arabie (1985): “Comparing partitions,” J. Classif., 2, 193–218.10.1007/BF01908075Search in Google Scholar

Jain, S. and R. M. Neal (2004): “A split-merge Markov chain Monte Carlo procedure for the Dirichlet process mixture model,” J. Comput. Graph. Stat., 13, 158–182.10.1198/1061860043001Search in Google Scholar

Jasra, A., C. Holmes and D. Stephens (2005): “Markov chain Monte Carlo methods and the label switching problem in Bayesian mixture modeling,” Stat. Sci., 20, 50–67.10.1214/088342305000000016Search in Google Scholar

Jeste, S. S. and D. H. Geschwind (2014): “Disentangling the heterogeneity of autism spectrum disorder through genetic findings,” Nat. Rev. Neurol., 10, 74–81.10.1038/nrneurol.2013.278Search in Google Scholar PubMed PubMed Central

Johnson, S. C. (1967): “Hierarchical clustering schemes,” Psychometrika, 32, 241–254.10.1007/BF02289588Search in Google Scholar PubMed

Kalli, M., J. E. Griffin and S. G. Walker (2011): “Slice sampling mixture models,” Stat. Comput., 21, 93–105.10.1007/s11222-009-9150-ySearch in Google Scholar

Kim, S., M. G. Tadesse and M. Vannucci (2006): “Variable selection in clustering via Dirichlet process mixture models,” Biometrika, 93, 877–893.10.1093/biomet/93.4.877Search in Google Scholar

Lee, D. D. and H. S. Seung (2001): “Algorithms for non-negative matrix factorization,” In: Advances in Neural Information Processing Systems, vol. 13, Boston, MA, USA: MIT Press, pp. 556–562.Search in Google Scholar

Liu, W., K. Yuan and D. Ye (2008): “Reducing microarray data via nonnegative matrix factorization for visualization and clustering analysis,” J. Biomed. Inform., 41, 602–606.10.1016/j.jbi.2007.12.003Search in Google Scholar PubMed

Lock, E. F. and D. B. Dunson (2013): “Bayesian consensus clustering,” Bioinformatics, 29, 2610–2616.10.1093/bioinformatics/btt425Search in Google Scholar PubMed PubMed Central

Lopes, H. F. and M. West (2004): “Bayesian model assessment in factor analysis,” Stat. Sin., 14, 41–67.Search in Google Scholar

MacEachern, S. N. (1994): “Estimating normal means with a conjugate style Dirichlet process prior,” Commun. Stat. Simul. Comput., 23, 727–741.10.1080/03610919408813196Search in Google Scholar

Medvedovic, M., K. Y. Yeung and R. E. Bumgarner (2004): “Bayesian mixture model based clustering of replicated microarray data,” Bioinformatics, 20, 1222–1232.10.1093/bioinformatics/bth068Search in Google Scholar PubMed

Murray, J. S., D. B. Dunson, L. Carin and J. E. Lucas (2013): “Bayesian Gaussian copula factor models for mixed data,” J. Am. Stat. Assoc., 108, 656–665.10.1080/01621459.2012.762328Search in Google Scholar PubMed PubMed Central

Neal, R. M. (1992): “Bayesian mixture modeling,” In: Maximum Entropy and Bayesian Methods, pp. 197–211. Berlin: Springer.10.1007/978-94-017-2219-3_14Search in Google Scholar

Pan, W. and X. Shen (2007): “Penalized model-based clustering with application to variable selection,” J. Mach. Learn. Res., 8, 1145–1164.Search in Google Scholar

Papaspiliopoulos, O. and G. O. Roberts (2008): “Retrospective Markov chain Monte Carlo methods for Dirichlet process hierarchical models,” Biometrika, 95, 169–186.10.1093/biomet/asm086Search in Google Scholar

Parker, J. S., M. Mullins, M. C. Cheang, S. Leung, D. Voduc, T. Vickery, S. Davies, C. Fauron, X. He, Z. Hu and J. F. Quackenbush (2009): “Supervised risk predictor of breast cancer based on intrinsic subtypes,” J. Clin. Oncol., 27, 1160–1167.10.1200/JCO.2008.18.1370Search in Google Scholar PubMed PubMed Central

Perou, C. M., T. Sørlie, M. B. Eisen, M. van de Rijn, S. S. Jeffrey, C. A. Rees, J. R. Pollack, D. T. Ross, H. Johnsen, L. A. Akslen and Ø. Fluge (2000): “Molecular portraits of human breast tumours,” Nature, 406, 747–752.10.1038/35021093Search in Google Scholar PubMed

Qin, Z. S. (2006): “Clustering microarray gene expression data using weighted Chinese restaurant process,” Bioinformatics, 22, 1988–1997.10.1093/bioinformatics/btl284Search in Google Scholar

Rodriguez, A. and D. B. Dunson (2014): “Functional clustering in nested designs: modeling variability in reproductive epidemiology studies,” Ann. Appl. Stat., 8, 1416–1442.10.1214/14-AOAS751Search in Google Scholar

Rousseeuw, P. J. (1987): “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis,” J. Comput. Appl. Math., 20, 53–65.10.1016/0377-0427(87)90125-7Search in Google Scholar

Schwarz, M. I. and T. E. King (2003): Interstitial lung disease. 5th ed. Shelton, CT: People’s Medical Publishing House-USA.Search in Google Scholar

Sethuraman, J. (1994): “A constructive definition of Dirichlet priors,” Stat. Sin., 4, 639–650.10.21236/ADA238689Search in Google Scholar

Sorlie, T., C. M. Perou, R. Tibshirani, T. Aas, S. Geisler, H. Johnsen, T. Hastie, M. Eisen, M. Van de Rijn, S. Jeffrey and T. Thorsen (2001): “Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications,” Proc. Natl. Acad. Sci., 98, 10869–10874.10.1073/pnas.191367098Search in Google Scholar

Tamayo, P., D. Scanfeld, B. L. Ebert, M. A. Gillette, C. W. Roberts and J. P. Mesirov (2007): “Metagene projection for cross-platform, cross-species characterization of global transcriptional states,” Proc. Natl. Acad. Sci., 104, 5959–5964.10.1073/pnas.0701068104Search in Google Scholar

The Cancer Genome Atlas Network (2012): “Comprehensive molecular portraits of human breast tumours,” Nature, 490, 61–70.10.1038/nature11412Search in Google Scholar

Vestbo, J. (2014): “COPD: definition and phenotypes,” Clin. Chest Med., 35, 1–6.10.1016/j.ccm.2013.10.010Search in Google Scholar

Walker, S. G. (2007): “Sampling the Dirichlet mixture model with slices,” Commun. Stat. Simul. Comput., 36, 45–54.10.1080/03610910601096262Search in Google Scholar

West, M. (2003): “Bayesian factor regression models in the “large p, small n” paradigm,” In: Bernardo, J., M. Bayarri, A. Dawid, D. Heckerman, A. Smith and M. West (Eds.), Bayesian Statistics, vol. 7, Oxford: Oxford University Press, pp. 723–732.Search in Google Scholar

Wigle, D. A., I. Jurisica, N. Radulovich, M. Pintilie, J. Rossant, N. Liu, C. Lu, J. Woodgett, I. Seiden, M. Johnston and S. Keshavjee (2002): “Molecular profiling of non-small cell lung cancer and correlation with disease-free survival,” Cancer Res., 62, 3005–3008.Search in Google Scholar

Wold, S., K. Esbensen and P. Geladi (1987): “Principal component analysis,” Chemometr. Intell. Lab. Syst., 2(1–3), 37–52.10.1016/0169-7439(87)80084-9Search in Google Scholar

Yang, M. and D. B. Dunson (2010): “Bayesian semiparametric structural equation models with latent variables,” Psychometrika, 75, 675–693.10.1007/s11336-010-9174-4Search in Google Scholar

Yeung, K. Y. and W. L. Ruzzo (2001): “Principal component analysis for clustering gene expression data,” Bioinformatics, 17, 763–774.10.1093/bioinformatics/17.9.763Search in Google Scholar PubMed

Published Online: 2017-3-25

Published in Print: 2017-4-25

A Bayesian semiparametric factor analysis model for subtype identification

Abstract:

Acknowledgement

Appendix

References

Journal and Issue

Articles in the same Issue