A graphical model method for integrating multiple sources of genome-scale data

Daniel Dvorkin; Brian Biehs; Katerina Kechris

doi:10.1515/sagmb-2012-0051

Published by De Gruyter August 3, 2013

A graphical model method for integrating multiple sources of genome-scale data

Daniel Dvorkin , Brian Biehs and Katerina Kechris

From the journal Statistical Applications in Genetics and Molecular Biology

https://doi.org/10.1515/sagmb-2012-0051

Showing a limited preview of this publication:

Abstract

Making effective use of multiple data sources is a major challenge in modern bioinformatics. Genome-wide data such as measures of transcription factor binding, gene expression, and sequence conservation, which are used to identify binding regions and genes that are important to major biological processes such as development and disease, can be difficult to use together due to the different biological meanings and statistical distributions of the heterogeneous data types, but each can provide valuable information for understanding the processes under study. Here we present methods for integrating multiple data sources to gain a more complete picture of gene regulation and expression. Our goal is to identify genes and cis-regulatory regions which play specific biological roles. We describe a graphical mixture model approach for data integration, examine the effect of using different model topologies, and discuss methods for evaluating the effectiveness of the models. Model fitting is computationally efficient and produces results which have clear biological and statistical interpretations. The Hedgehog and Dorsal signaling pathways in Drosophila, which are critical in embryonic development, are used as examples.

Keywords: data integration; genomics; graphical models; mixture models

Corresponding author: Daniel Dvorkin, Computational Bioscience Program, University of Colorado School of Medicine, Mail Stop 8303, 12801 E. 17th Ave., RC1S-L18 6103, Aurora, CO 80045–0511, USA

The authors thank Tom Kornberg at UCSF and Anis Karimpour-Fard at UCD. This work was supported by NIH/NLM training grant T15 LM009451 to DD.

References

Alexandridis, R, S. Lin and M. Irwin (2004): “Class discovery and classification of tumor samples using mixture modeling of gene expression data – a unified approach,” Bioinformatics, 20(16), 2545–2552.10.1093/bioinformatics/bth281Search in Google Scholar PubMed

Azzalini, A. (2005): “The skew-normal distribution and related multivariate families,” Scand J. Stat., 32(2), 159–188.Search in Google Scholar

Bantignies, F., R. H. Goodman and S. M. Smolik (2002): “The interaction between the coactivator dCBP and Modulo, a chromatin-associated factor, affects segmentation and melanotic tumor formation in Drosophila,” Proc. Natl. Acad. Sci., 99(5), 2895–2900.10.1073/pnas.052509799Search in Google Scholar PubMed PubMed Central

Barrett, T., D. B. Troup, S. E. Wilhite, P. Ledoux, D. Rudnev, C. Evangelista, I. F. Kim, A. Soboleva, M. Tomashevsky, K. A. Marshall, K. H. Phillippy, P. M. Sherman, R. N. Muertter and R. Edgar (2009): “NCBI GEO: archive for high-throughput functional genomic data,” Nucleic Acids Res., 37 (Database issue), D885–D890.Search in Google Scholar

Baum L. E., T. Petrie, G. Soules and N. Weiss (1970): “A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains,” Ann. Math. Stat., 41(1), 164–171.10.1214/aoms/1177697196Search in Google Scholar

Berkeley Drosophila Genome Project. Patterns of gene expression in Drosophila embryogenesis, last accessed January 11, 2013. URL http://insitu.fruitfly.org/cgi-bin/ex/insitu.pl.Search in Google Scholar

Bezdek, J. C., R. Ehrlich and W. Full (1984): “FCM: the fuzzy c-means clustering algorithm,” Comput. Geosci., 10(2), 191–203.Search in Google Scholar

Biehs, B., K. Kechris, S. M. Liu and T. B. Kornberg (2010): “Hedgehog targets in the Drosophila embryo and the mechanisms that generate tissue-specific outputs of Hedgehog signaling,” Development, 137(22), 3887–3898.10.1242/dev.055871Search in Google Scholar PubMed PubMed Central

Biemar, F., D. A. Nix, J. Piel, B. Peterson, M. Ronshaugen, V. Sementchenko, I. Bell, J. R. Manak and M.S. Levine (2006): “Comprehensive identification of drosophila dorsal-ventral patterning genes using a whole-genome tiling array,” Proc. Natl. Acad. Sci., 103(34), 12763–12768.10.1073/pnas.0604484103Search in Google Scholar PubMed PubMed Central

Biernacki, C., G. Celeux and G. Govaert (2000): “Assessing a mixture model for clustering with the integrated completed likelihood,“ IEEE T Pattern Anal., 22(7), 719–725.Search in Google Scholar

De Bie, T., P. Monsieurs, K. Engelen, B. De Moor, N. Cristianini and K. Marchal (2005): “Discovering transcriptional modules from motif, chip-chip and microarray data,” Pac. Symposium Biocomput., 10, 483–494.Search in Google Scholar

Dempster, A. P., N. M. Laird and D. B. Rubin (1977): “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Stat. Soc. B, 39(1), 1–38.Search in Google Scholar

Efron, B. (2007): “Size, power, and false discovery rates,” Ann. Stat., 35(4), 1351–1377.Search in Google Scholar

Fujita, P. A., B. Rhead, A. S. Zweig, A. S. Hinrichs, D. Karolchik, M. S. Cline, M. Goldman, G. P. Barber, H. Clawson, A. Coelho, M. Diekhans, T. R. Dreszer, B. M. Giardine, R. A. Harte, J. Hillman-Jackson, F. Hsu, V. Kirkup, R. M. Kuhn, K. Learned, C. H. Li, L. R. Meyer, A. Pohl, B. J. Raney, et al. (2011): “The UCSC Genome Browser database: update 2011,” Nucleic Acids Res. 39(suppl 1), D876–D882.Search in Google Scholar

Hartigan, J. A. and M. A. Wong (1979): “Algorithm AS 136: A K-means clustering algorithm,” J. R. Stat. Soc. C (Appl. Stat.), 28(1), 100–108.10.2307/2346830Search in Google Scholar

Hastie, T., R. Tibshirani, G. Sherlock, M. Eisen, P. Brown and D. Botstein (1999): Imputing missing data for gene expression arrays. Technical report, Stanford University, Division of Biostatistics, 1999. URL http://www.stanford.edu/hastie/Papers/missing.pdf.Search in Google Scholar

Hawkins, R. D., G. C. Hon and B. Ren (2010): “Next-generation genomics: an integrative approach,” Nat. Rev. Genet., 11(7), 476–486.10.1038/nrg2795Search in Google Scholar PubMed PubMed Central

Heberlein, U., C. M. Singh, A. Y. Luk and T. J. Donohoe (1995): “Growth and differentiation in the Drosophila eye coordinated by hedgehog,” Nature, 373(6516), 709–711.10.1038/373709a0Search in Google Scholar PubMed

Hoffman, M. H., O. J. Buske, J. Wang, Z. Weng, J. A. Bilmes and W. S. Noble (2012): “Unsupervised pattern discovery in human chromatin structure through genomic segmentation,” Nat. Method., 9, 473–476.Search in Google Scholar

Huang, D. W., B. T. Sherman and R. A. Lempicki (2009a): “Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources,” Nature Protocols, 4(1), 44–57.10.1038/nprot.2008.211Search in Google Scholar PubMed

Huang, D. W., B. T. Sherman and R. A. Lempicki (2009b): “Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists,” Nucleic Acids Res., 37(1), 1–13.10.1093/nar/gkn923Search in Google Scholar PubMed PubMed Central

Ji, Y., C. Wu, P. Liu, J. Wang and K. R. Coombes (2005): “Applications of beta-mixture models in bioinformatics,” Bioinformatics, 21(9), 2118–2122.10.1093/bioinformatics/bti318Search in Google Scholar PubMed

Jörnsten, R. and S. Keleş (2008): “Mixture models with multiple levels, with application to the analysis of multifactor gene expression data,” Biostatistics, 9(3), 540–554.10.1093/biostatistics/kxm051Search in Google Scholar PubMed PubMed Central

Kanehisa, M. and S. Goto (2000): “KEGG: Kyoto encyclopedia of genes and genomes,” Nucleic Acids Res., 28, 27–30.Search in Google Scholar

Kanehisa, M, S. Goto, Y. Sato, M. Furumichi and M Tanabe (2012): “KEGG for integration and interpretation of large-scale molecular datasets,” Nucleic Acids Res., 40, D109–D114.Search in Google Scholar

Kvam V. M., P. Liu and Y. Si (2012): “A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data,” Am. J. Bot., 99(2), 248–256.Search in Google Scholar

Lemmens, K., T. Dhollander, T. De Bie, P. Monsieurs, K. Engelen, B. Smets, J. Winderickx, B. De Moor and K. Marchal (2006): “Inferring transcriptional modules from ChIP-chip, motif and microarray data,” Genome Biol., 7(5), R37.Search in Google Scholar

Li, Q., M. J. MacCoss and M. Stephens (2010): “A nested mixture model for protein identification using mass spectrometry,” Ann. Appl. Stat., 4(2), 962–987.Search in Google Scholar

Lourme, A. and C. Biernacki (2013): “Simultaneous Gaussian model-based clustering for samples of multiple origins,” Comput. Stat., 28, 371–391.Search in Google Scholar

McLachlan, G. J. and T. Krishnan (2008): The EM Algorithm and Extensions, 2nd ed., Hoboken, New Jersey, USA: Wiley.10.1002/9780470191613Search in Google Scholar

McQuilton, P., S. E. St. Pierre, J. Thurmond and The FlyBase Consortium (2012): “FlyBase 101 – the basics of navigating Flybase,” Nucleic Acids Res., 40(D1), D706–D714.Search in Google Scholar

National Center for Biotechnology Information (2013): Gene Expression Omnibus (GEO), last accessed February 3, 2013. URL http://www.ncbi.nlm.nih.gov/geo/.Search in Google Scholar

Newton, M. A., A. Noueiry, D. Sarkar and P. Ahlquist (2004): “Detecting differential gene expression with a semiparametric hierarchical mixture method,” Biostatistics, 5(2), 155–176.10.1093/biostatistics/5.2.155Search in Google Scholar PubMed

Ortiz-Barahona, A., D. Villar, N. Pescador, J. Amigo and L. del Peso (2010): “Genome-wide identification of hypoxia-inducible factor binding sites and target genes by a probabilistic model integrating transcription-profiling data and in silico binding site prediction,” Nucleic Acids Res., 38(7), 2332–2345.10.1093/nar/gkp1205Search in Google Scholar PubMed PubMed Central

Qin, J., M. J. Li, P. Wang, M. Q. Zhang and J. Wang (2011): “ChIP-Array: combinatory analysis of ChIP-seq/chip and microarray gene expression data to discover direct/indirect targets of a transcription factor,” Nucleic Acids Res., 39(Suppl 2), W430–W436.10.1093/nar/gkr332Search in Google Scholar PubMed PubMed Central

Schwarz, G. (1978): “Estimating the dimension of a model,” Ann. Stat., 6(2), 461–464.Search in Google Scholar

Seringhaus, M., A. Paccanaro, A. Borneman, M. Snyder and M. Gerstein (2006): “Predicting essential genes in fungal genomes,” Genome Res., 16(9), 1126–1135.Search in Google Scholar

Siepel, A., G. Bejerano, J. S. Pedersen, A. S. Hinrichs, M. Hou, K. Rosenbloom, H. Clawson, J. Spieth, L. W. Hillier, S. Richards, G. M. Weinstock, R. K. Wilson, R. A. Gibbs, W. J. Kent, W. Miller and D. Haussler (2005): “Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes,” Genome Res., 15(8), 1034–1050.Search in Google Scholar

Storey, J. D. (2002): “A direct approach to false discovery rates,” J. R. Stat. Soc B (Stat. Method.), 64(3), 479–498.10.1111/1467-9868.00346Search in Google Scholar

Strimmer, K. (2008): “A unified approach to false discovery rate estimation,” BMC Bioinformatics 9(1), 303.10.1186/1471-2105-9-303Search in Google Scholar PubMed PubMed Central

Sun, J., A. Kabán and J. M. Garibaldi (2010): “Robust mixture clustering using Pearson type VII distribution,” Pattern Recogn. Lett., 31(16), 2447–2454.Search in Google Scholar

The FlyBase Consortium (2013): FlyBase, last accessed February 1, 2013. URL http://flybase.org/.Search in Google Scholar

The Gene Ontology Consortium (2000): “Gene Ontology: tool for the unification of biology,” Nat. Genet., 25(1), 25–29.Search in Google Scholar

The Gene Ontology Consortium (2013): The Gene Ontology, last accessed March 29, 2013. URL http://www.geneontology.org/.Search in Google Scholar

Tomancak, P., B. Berman, A. Beaton, R. Weiszmann, E. Kwan, V. Hartenstein, S. Celniker and G. Rubin (2007): “Global analysis of patterns of gene expression during Drosophila embryogenesis,” Genome Biol., 8(7), R145.Search in Google Scholar

Tyekucheva, S., L. Marchionni, R. Karchin and G. Parmigiani (2011): “Integrating diverse genomic data using gene sets,” Genome Biol., 12(10), R105.Search in Google Scholar

University of California, Santa Cruz (2013): UCSC Genome Browser, last accessed April 10, 2013. http://genome.ucsc.edu/.Search in Google Scholar

Vermunt, J. K. and J. Magidson (2005): Hierarchical mixture models for nested data structures. In Classification–the Ubiquitous Challenge: Proceedings of the 28th Annual Conference of the Gesellschaft für Klassifikation eV, University of Dortmund, March 9–11, 2004, volume 28, page 240. Springer, 2005.Search in Google Scholar

Viroli, C. (2010): “Dimensionally reduced model-based clustering through mixtures of factor mixture analyzers,” J. Classif., 27, 363–388.Search in Google Scholar

Von Ohlen, T., D. Lessing, R. Nusse and J. E. Hooper (1997): “Hedgehog signaling regulates transcription through cubitus interruptus, a sequence-specific DNA binding protein,” Proc. Natl. Acad. Sci., 94(6), 2404–2409.10.1073/pnas.94.6.2404Search in Google Scholar PubMed PubMed Central

Xie, Y., W. Pan, K. S. Jeong, G. Xiao and A. B. Khodursky (2010): “A Bayesian approach to joint modeling of protein-DNA binding, gene expression and sequence data,” Stat. Med., 29(4), 489–503.Search in Google Scholar

Xu, J. J. (1996): Statistical modelling and inference for multivariate and longitudinal discrete response data. PhD thesis, University of British Columbia, 1996. URL http://hdl.handle.net/2429/6188.Search in Google Scholar

Zeitlinger, J., R. P. Zinzen, A. Stark, M. Kellis, H. Zhang, R. A. Young and M. Levine (2007): “Whole-genome ChIP-chip analysis of dorsal, twist, and snail suggests integration of diverse patterning processes in the Drosophila embryo,” Gen Dev., 21(4), 385–390.10.1101/gad.1509607Search in Google Scholar PubMed PubMed Central

Published Online: 2013-08-03

Published in Print: 2013-08-01

A graphical model method for integrating multiple sources of genome-scale data

Abstract

References

Journal and Issue

Articles in the same Issue