Abstract
Making effective use of multiple data sources is a major challenge in modern bioinformatics. Genome-wide data such as measures of transcription factor binding, gene expression, and sequence conservation, which are used to identify binding regions and genes that are important to major biological processes such as development and disease, can be difficult to use together due to the different biological meanings and statistical distributions of the heterogeneous data types, but each can provide valuable information for understanding the processes under study. Here we present methods for integrating multiple data sources to gain a more complete picture of gene regulation and expression. Our goal is to identify genes and cis-regulatory regions which play specific biological roles. We describe a graphical mixture model approach for data integration, examine the effect of using different model topologies, and discuss methods for evaluating the effectiveness of the models. Model fitting is computationally efficient and produces results which have clear biological and statistical interpretations. The Hedgehog and Dorsal signaling pathways in Drosophila, which are critical in embryonic development, are used as examples.
The authors thank Tom Kornberg at UCSF and Anis Karimpour-Fard at UCD. This work was supported by NIH/NLM training grant T15 LM009451 to DD.
References
Alexandridis, R, S. Lin and M. Irwin (2004): “Class discovery and classification of tumor samples using mixture modeling of gene expression data – a unified approach,” Bioinformatics, 20(16), 2545–2552.10.1093/bioinformatics/bth281Search in Google Scholar PubMed
Azzalini, A. (2005): “The skew-normal distribution and related multivariate families,” Scand J. Stat., 32(2), 159–188.Search in Google Scholar
Bantignies, F., R. H. Goodman and S. M. Smolik (2002): “The interaction between the coactivator dCBP and Modulo, a chromatin-associated factor, affects segmentation and melanotic tumor formation in Drosophila,” Proc. Natl. Acad. Sci., 99(5), 2895–2900.10.1073/pnas.052509799Search in Google Scholar PubMed PubMed Central
Barrett, T., D. B. Troup, S. E. Wilhite, P. Ledoux, D. Rudnev, C. Evangelista, I. F. Kim, A. Soboleva, M. Tomashevsky, K. A. Marshall, K. H. Phillippy, P. M. Sherman, R. N. Muertter and R. Edgar (2009): “NCBI GEO: archive for high-throughput functional genomic data,” Nucleic Acids Res., 37 (Database issue), D885–D890.Search in Google Scholar
Baum L. E., T. Petrie, G. Soules and N. Weiss (1970): “A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains,” Ann. Math. Stat., 41(1), 164–171.10.1214/aoms/1177697196Search in Google Scholar
Berkeley Drosophila Genome Project. Patterns of gene expression in Drosophila embryogenesis, last accessed January 11, 2013. URL http://insitu.fruitfly.org/cgi-bin/ex/insitu.pl.Search in Google Scholar
Bezdek, J. C., R. Ehrlich and W. Full (1984): “FCM: the fuzzy c-means clustering algorithm,” Comput. Geosci., 10(2), 191–203.Search in Google Scholar
Biehs, B., K. Kechris, S. M. Liu and T. B. Kornberg (2010): “Hedgehog targets in the Drosophila embryo and the mechanisms that generate tissue-specific outputs of Hedgehog signaling,” Development, 137(22), 3887–3898.10.1242/dev.055871Search in Google Scholar PubMed PubMed Central
Biemar, F., D. A. Nix, J. Piel, B. Peterson, M. Ronshaugen, V. Sementchenko, I. Bell, J. R. Manak and M.S. Levine (2006): “Comprehensive identification of drosophila dorsal-ventral patterning genes using a whole-genome tiling array,” Proc. Natl. Acad. Sci., 103(34), 12763–12768.10.1073/pnas.0604484103Search in Google Scholar PubMed PubMed Central
Biernacki, C., G. Celeux and G. Govaert (2000): “Assessing a mixture model for clustering with the integrated completed likelihood,“ IEEE T Pattern Anal., 22(7), 719–725.Search in Google Scholar
De Bie, T., P. Monsieurs, K. Engelen, B. De Moor, N. Cristianini and K. Marchal (2005): “Discovering transcriptional modules from motif, chip-chip and microarray data,” Pac. Symposium Biocomput., 10, 483–494.Search in Google Scholar
Dempster, A. P., N. M. Laird and D. B. Rubin (1977): “Maximum likelihood from incomplete data via the EM algorithm,” J. R. Stat. Soc. B, 39(1), 1–38.Search in Google Scholar
Efron, B. (2007): “Size, power, and false discovery rates,” Ann. Stat., 35(4), 1351–1377.Search in Google Scholar
Fujita, P. A., B. Rhead, A. S. Zweig, A. S. Hinrichs, D. Karolchik, M. S. Cline, M. Goldman, G. P. Barber, H. Clawson, A. Coelho, M. Diekhans, T. R. Dreszer, B. M. Giardine, R. A. Harte, J. Hillman-Jackson, F. Hsu, V. Kirkup, R. M. Kuhn, K. Learned, C. H. Li, L. R. Meyer, A. Pohl, B. J. Raney, et al. (2011): “The UCSC Genome Browser database: update 2011,” Nucleic Acids Res. 39(suppl 1), D876–D882.Search in Google Scholar
Hartigan, J. A. and M. A. Wong (1979): “Algorithm AS 136: A K-means clustering algorithm,” J. R. Stat. Soc. C (Appl. Stat.), 28(1), 100–108.10.2307/2346830Search in Google Scholar
Hastie, T., R. Tibshirani, G. Sherlock, M. Eisen, P. Brown and D. Botstein (1999): Imputing missing data for gene expression arrays. Technical report, Stanford University, Division of Biostatistics, 1999. URL http://www.stanford.edu/hastie/Papers/missing.pdf.Search in Google Scholar
Hawkins, R. D., G. C. Hon and B. Ren (2010): “Next-generation genomics: an integrative approach,” Nat. Rev. Genet., 11(7), 476–486.10.1038/nrg2795Search in Google Scholar PubMed PubMed Central
Heberlein, U., C. M. Singh, A. Y. Luk and T. J. Donohoe (1995): “Growth and differentiation in the Drosophila eye coordinated by hedgehog,” Nature, 373(6516), 709–711.10.1038/373709a0Search in Google Scholar PubMed
Hoffman, M. H., O. J. Buske, J. Wang, Z. Weng, J. A. Bilmes and W. S. Noble (2012): “Unsupervised pattern discovery in human chromatin structure through genomic segmentation,” Nat. Method., 9, 473–476.Search in Google Scholar
Huang, D. W., B. T. Sherman and R. A. Lempicki (2009a): “Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources,” Nature Protocols, 4(1), 44–57.10.1038/nprot.2008.211Search in Google Scholar PubMed
Huang, D. W., B. T. Sherman and R. A. Lempicki (2009b): “Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists,” Nucleic Acids Res., 37(1), 1–13.10.1093/nar/gkn923Search in Google Scholar PubMed PubMed Central
Ji, Y., C. Wu, P. Liu, J. Wang and K. R. Coombes (2005): “Applications of beta-mixture models in bioinformatics,” Bioinformatics, 21(9), 2118–2122.10.1093/bioinformatics/bti318Search in Google Scholar PubMed
Jörnsten, R. and S. Keleş (2008): “Mixture models with multiple levels, with application to the analysis of multifactor gene expression data,” Biostatistics, 9(3), 540–554.10.1093/biostatistics/kxm051Search in Google Scholar PubMed PubMed Central
Kanehisa, M. and S. Goto (2000): “KEGG: Kyoto encyclopedia of genes and genomes,” Nucleic Acids Res., 28, 27–30.Search in Google Scholar
Kanehisa, M, S. Goto, Y. Sato, M. Furumichi and M Tanabe (2012): “KEGG for integration and interpretation of large-scale molecular datasets,” Nucleic Acids Res., 40, D109–D114.Search in Google Scholar
Kvam V. M., P. Liu and Y. Si (2012): “A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data,” Am. J. Bot., 99(2), 248–256.Search in Google Scholar
Lemmens, K., T. Dhollander, T. De Bie, P. Monsieurs, K. Engelen, B. Smets, J. Winderickx, B. De Moor and K. Marchal (2006): “Inferring transcriptional modules from ChIP-chip, motif and microarray data,” Genome Biol., 7(5), R37.Search in Google Scholar
Li, Q., M. J. MacCoss and M. Stephens (2010): “A nested mixture model for protein identification using mass spectrometry,” Ann. Appl. Stat., 4(2), 962–987.Search in Google Scholar
Lourme, A. and C. Biernacki (2013): “Simultaneous Gaussian model-based clustering for samples of multiple origins,” Comput. Stat., 28, 371–391.Search in Google Scholar
McLachlan, G. J. and T. Krishnan (2008): The EM Algorithm and Extensions, 2nd ed., Hoboken, New Jersey, USA: Wiley.10.1002/9780470191613Search in Google Scholar
McQuilton, P., S. E. St. Pierre, J. Thurmond and The FlyBase Consortium (2012): “FlyBase 101 – the basics of navigating Flybase,” Nucleic Acids Res., 40(D1), D706–D714.Search in Google Scholar
National Center for Biotechnology Information (2013): Gene Expression Omnibus (GEO), last accessed February 3, 2013. URL http://www.ncbi.nlm.nih.gov/geo/.Search in Google Scholar
Newton, M. A., A. Noueiry, D. Sarkar and P. Ahlquist (2004): “Detecting differential gene expression with a semiparametric hierarchical mixture method,” Biostatistics, 5(2), 155–176.10.1093/biostatistics/5.2.155Search in Google Scholar PubMed
Ortiz-Barahona, A., D. Villar, N. Pescador, J. Amigo and L. del Peso (2010): “Genome-wide identification of hypoxia-inducible factor binding sites and target genes by a probabilistic model integrating transcription-profiling data and in silico binding site prediction,” Nucleic Acids Res., 38(7), 2332–2345.10.1093/nar/gkp1205Search in Google Scholar PubMed PubMed Central
Qin, J., M. J. Li, P. Wang, M. Q. Zhang and J. Wang (2011): “ChIP-Array: combinatory analysis of ChIP-seq/chip and microarray gene expression data to discover direct/indirect targets of a transcription factor,” Nucleic Acids Res., 39(Suppl 2), W430–W436.10.1093/nar/gkr332Search in Google Scholar PubMed PubMed Central
Schwarz, G. (1978): “Estimating the dimension of a model,” Ann. Stat., 6(2), 461–464.Search in Google Scholar
Seringhaus, M., A. Paccanaro, A. Borneman, M. Snyder and M. Gerstein (2006): “Predicting essential genes in fungal genomes,” Genome Res., 16(9), 1126–1135.Search in Google Scholar
Siepel, A., G. Bejerano, J. S. Pedersen, A. S. Hinrichs, M. Hou, K. Rosenbloom, H. Clawson, J. Spieth, L. W. Hillier, S. Richards, G. M. Weinstock, R. K. Wilson, R. A. Gibbs, W. J. Kent, W. Miller and D. Haussler (2005): “Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes,” Genome Res., 15(8), 1034–1050.Search in Google Scholar
Storey, J. D. (2002): “A direct approach to false discovery rates,” J. R. Stat. Soc B (Stat. Method.), 64(3), 479–498.10.1111/1467-9868.00346Search in Google Scholar
Strimmer, K. (2008): “A unified approach to false discovery rate estimation,” BMC Bioinformatics 9(1), 303.10.1186/1471-2105-9-303Search in Google Scholar PubMed PubMed Central
Sun, J., A. Kabán and J. M. Garibaldi (2010): “Robust mixture clustering using Pearson type VII distribution,” Pattern Recogn. Lett., 31(16), 2447–2454.Search in Google Scholar
The FlyBase Consortium (2013): FlyBase, last accessed February 1, 2013. URL http://flybase.org/.Search in Google Scholar
The Gene Ontology Consortium (2000): “Gene Ontology: tool for the unification of biology,” Nat. Genet., 25(1), 25–29.Search in Google Scholar
The Gene Ontology Consortium (2013): The Gene Ontology, last accessed March 29, 2013. URL http://www.geneontology.org/.Search in Google Scholar
Tomancak, P., B. Berman, A. Beaton, R. Weiszmann, E. Kwan, V. Hartenstein, S. Celniker and G. Rubin (2007): “Global analysis of patterns of gene expression during Drosophila embryogenesis,” Genome Biol., 8(7), R145.Search in Google Scholar
Tyekucheva, S., L. Marchionni, R. Karchin and G. Parmigiani (2011): “Integrating diverse genomic data using gene sets,” Genome Biol., 12(10), R105.Search in Google Scholar
University of California, Santa Cruz (2013): UCSC Genome Browser, last accessed April 10, 2013. http://genome.ucsc.edu/.Search in Google Scholar
Vermunt, J. K. and J. Magidson (2005): Hierarchical mixture models for nested data structures. In Classification–the Ubiquitous Challenge: Proceedings of the 28th Annual Conference of the Gesellschaft für Klassifikation eV, University of Dortmund, March 9–11, 2004, volume 28, page 240. Springer, 2005.Search in Google Scholar
Viroli, C. (2010): “Dimensionally reduced model-based clustering through mixtures of factor mixture analyzers,” J. Classif., 27, 363–388.Search in Google Scholar
Von Ohlen, T., D. Lessing, R. Nusse and J. E. Hooper (1997): “Hedgehog signaling regulates transcription through cubitus interruptus, a sequence-specific DNA binding protein,” Proc. Natl. Acad. Sci., 94(6), 2404–2409.10.1073/pnas.94.6.2404Search in Google Scholar PubMed PubMed Central
Xie, Y., W. Pan, K. S. Jeong, G. Xiao and A. B. Khodursky (2010): “A Bayesian approach to joint modeling of protein-DNA binding, gene expression and sequence data,” Stat. Med., 29(4), 489–503.Search in Google Scholar
Xu, J. J. (1996): Statistical modelling and inference for multivariate and longitudinal discrete response data. PhD thesis, University of British Columbia, 1996. URL http://hdl.handle.net/2429/6188.Search in Google Scholar
Zeitlinger, J., R. P. Zinzen, A. Stark, M. Kellis, H. Zhang, R. A. Young and M. Levine (2007): “Whole-genome ChIP-chip analysis of dorsal, twist, and snail suggests integration of diverse patterning processes in the Drosophila embryo,” Gen Dev., 21(4), 385–390.10.1101/gad.1509607Search in Google Scholar PubMed PubMed Central
©2013 by Walter de Gruyter Berlin Boston