Knowledge Discovery from Complex High Dimensional Data

Lee, Sangkyun; Holzinger, Andreas

doi:10.1007/978-3-319-41706-6_7

Sangkyun Lee¹⁶ &
Andreas Holzinger^17,18

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9580))

1593 Accesses
6 Citations
1 Altmetric

Abstract

Modern data analysis is confronted by increasing dimensionality of problems, mainly contributed by higher resolutions available for data acquisition and by our use of larger models with more degrees of freedom to investigate complex systems deeper. High dimensionality constitutes one aspect of “big data”, which brings us not only computational but also statistical and perceptional challenges. Most data analysis problems are solved using techniques of optimization, where large-scale optimization requires faster algorithms and implementations. Computed solutions must be evaluated for statistical quality, since otherwise false discoveries can be made. Recent papers suggest to control and modify algorithms themselves for better statistical properties. Finally, human perception puts an inherent limit on our understanding to three dimensional spaces, making it almost impossible to grasp complex phenomena. For aid, we use dimensionality reduction or other techniques, but these usually do not capture relations between interesting objects. Here graph-based knowledge representation has lots of potential, for instance to create perceivable and interactive representations and to perform new types of analysis based on graph theory and network topology. In this article, we show glimpses of new developments in these aspects.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Gene Expression Omnibus http://www.ncbi.nlm.nih.gov/geo/.
2.
The CEL files from the GEO were normalized and summarized for transcripts using the frozen RMA algorithm [55]. Then only the verified (grade A) genes were chosen for further analysis according to the NetAffx probeset annotation v33.1 of Affymetrix (\(n=20492\) afterwards). Also, microarrays with low quality according to the GNUSE [54] error scores \(>1\) were discarded (\(m=392\) afterwards).
3.
OSGi Standard http://www.osgi.org/Main/HomePage.
4.
DEX Graph Database http://www.sparsity-technologies.com/dex.
5.
National Cancer Institute, http://www.cancer.gov.
6.
DrugBank, http://www.drugbank.ca.
7.
HCI-KDD Network: www.hci-kdd.org.

References

Anderson, N.R., Lee, E.S., Brockenbrough, J.S., Minie, M.E., Fuller, S., Brinkley, J., Tarczy-Hornoch, P.: Issues in biomedical research data management and analysis: needs and barriers. J. Am. Med. Inform. Assoc. 14(4), 478–488 (2007)
Article Google Scholar
Bach, F.R.: Bolasso: Model consistent Lasso estimation through the bootstrap. In: 25th International Conference on Machine Learning, pp. 33–40 (2008)
Google Scholar
Banerjee, O., Ghaoui, L.E., d’Aspremont, A.: Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Am. Med. Inform. Assoc. 9, 485–516 (2008)
MathSciNet MATH Google Scholar
Barabasi, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999)
Article MathSciNet MATH Google Scholar
Barabási, A., Gulbahce, N., Loscalzo, J.: Network medicine: a network-based approach to human disease. Science 12(1), 56–68 (2011)
Google Scholar
Beck, A., Tetruashvili, L.: On the convergence of block coordinate descent type methods. Science 23(4), 2037–2060 (2013)
MathSciNet MATH Google Scholar
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. Science 2(1), 183–202 (2009)
MathSciNet MATH Google Scholar
Bogdan, M., van den Berg, E., Sabatti, C., Su, W., Candes, E.J.: SLOPE - adaptive variable selection via convex optimization. (2014). arXiv:1407.3824
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Science 3(1), 1–122 (2011)
MathSciNet MATH Google Scholar
Bubenik, P., Kim, P.T.: A statistical approach to persistent homology. Science 9(2), 337–362 (2007)
MathSciNet MATH Google Scholar
Castellana, B., Escuin, D., Peiró, G., Garcia-Valdecasas, B., Vázquez, T., Pons, C., Pérez-Olabarria, M., Barnadas, A., Lerma, E.: ASPN and GJB2 are implicated in the mechanisms of invasion of ductal breast carcinomas. Science 3, 175–183 (2012)
Google Scholar
Cerri, A., Fabio, B.D., Ferri, M., Frosini, P., Landi, C.: Betti numbers in multidimensional persistent homology are stable functions. Science 36(12), 1543–1557 (2013)
MathSciNet MATH Google Scholar
Chen, H., Sharp, B.M.: Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinformatics 5(1), 147 (2004)
Article Google Scholar
Cios, K.J., Moore, G.W.: Uniqueness of medical data mining. BMC Bioinformatics 26(1), 1–24 (2002)
Google Scholar
Cook, D.J., Holder, L.B.: Graph-based data mining. BMC Bioinformatics 15(2), 32–41 (2000)
Google Scholar
Cox, D.R., Oakes, D.: Analysis of Survival Data. Monographs on Statistics & Applied Probability. Chapman & Hall/CRC, London (1984)
Google Scholar
Dehaspe, L., Toivonen, H.: Discovery of frequent DATALOG patterns. BMC Bioinformatics 3(1), 7–36 (1999)
Google Scholar
Iordache, O.: Methods. In: Iordache, O. (ed.) Polystochastic Models for Complexity. UCS, vol. 4, pp. 17–61. Springer, Heidelberg (2010)
Chapter Google Scholar
Dehmer, M., Basak, S.C.: Statistical and Machine Learning Approaches for Network Analysis. Wiley, Hoboken (2012)
Book MATH Google Scholar
Donsa, K., Spat, S., Beck, P., Pieber, T.R., Holzinger, A.: Towards personalization of diabetes therapy using computerized decision support and machine learning: some open problems and challenges. In: Holzinger, A., Röcker, C., Ziefle, M. (eds.) Smart Health. LNCS, vol. 8700, pp. 237–260. Springer, Heidelberg (2015)
Google Scholar
Dorogovtsev, S., Mendes, J.: Evolution of Networks: From Biological Nets to the Internet and WWW. Oxford University Press, Oxford (2003)
Book MATH Google Scholar
Duerr-Specht, M., Goebel, R., Holzinger, A.: Medicine and health care as a data problem: will computers become better medical doctors? In: Holzinger, A., Röcker, C., Ziefle, M. (eds.) Smart Health. LNCS, vol. 8700, pp. 21–39. Springer, Heidelberg (2015)
Google Scholar
Epstein, C., Carlsson, G., Edelsbrunner, H.: Topological data analysis. BMC Bioinformatics 27(12), 120201 (2011)
Google Scholar
Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical Lasso. BMC Bioinformatics 9(3), 432–441 (2008)
MATH Google Scholar
Golumbic, M.C.: Algorithmic Graph Theory and Perfect Graphs. Elsevier, Amsterdam (2004)
MATH Google Scholar
Henderson, B.E., Feigelson, H.S.: Hormonal carcinogenesis. Carcinogenesis 21(3), 427–433 (2000)
Article Google Scholar
Holzinger, A.: Human-Computer Interaction and Knowledge Discovery (HCI-KDD): what is the benefit of bringing those two fields to work together? In: Cuzzocrea, A., Kittl, C., Simos, D.E., Weippl, E., Xu, L. (eds.) CD-ARES 2013. LNCS, vol. 8127, pp. 319–328. Springer, Heidelberg (2013)
Chapter Google Scholar
Holzinger, A., Dehmer, M., Jurisica, I.: Knowledge discovery and interactive data mining in bioinformatics - state-of-the-art, future challenges and research directions. BMC Bioinformatics 15(Suppl 6), I1 (2014)
Article Google Scholar
Holzinger, A., Jurisica, I. (eds.): Interactive Knowledge Discovery and Data Mining in Biomedical Informatics: State-of-the-Art and Future Challenges, vol. 8401. Springer, Heidelberg (2014)
Google Scholar
Holzinger, A., Jurisica, I.: Knowledge discovery and data mining in biomedical informatics: the future is in integrative, interactive machine learning solutions. In: Holzinger, A., Jurisica, I. (eds.) Interactive Knowledge Discovery and Data Mining in Biomedical Informatics. LNCS, vol. 8401, pp. 1–18. Springer, Heidelberg (2014)
Chapter Google Scholar
Holzinger, A., Malle, B., Giuliani, N.: On graph extraction from image data. In: Ślȩzak, D., Tan, A.-H., Peters, J.F., Schwabe, L. (eds.) BIH 2014. LNCS, vol. 8609, pp. 552–563. Springer, Heidelberg (2014)
Google Scholar
Holzinger, A., Ofner, B., Dehmer, M.: Multi-touch graph-based interaction for knowledge discovery on mobile devices: state-of-the-art and future challenges. In: Holzinger, A., Jurisica, I. (eds.) Interactive Knowledge Discovery and Data Mining in Biomedical Informatics. LNCS, vol. 8401, pp. 241–254. Springer, Heidelberg (2014)
Chapter Google Scholar
Holzinger, A., Ofner, B., Stocker, C., Calero Valdez, A., Schaar, A.K., Ziefle, M., Dehmer, M.: On graph entropy measures for knowledge discovery from publication network data. In: Cuzzocrea, A., Kittl, C., Simos, D.E., Weippl, E., Xu, L. (eds.) CD-ARES 2013. LNCS, vol. 8127, pp. 354–362. Springer, Heidelberg (2013)
Chapter Google Scholar
Holzinger, A., Stocker, C., Dehmer, M.: Big complex biomedical data: towards a taxonomy of data. In: Obaidat, M.S., Filipe, J. (eds.) Communications in Computer and Information Science CCIS 455, pp. 3–18. Springer, Heidelberg (2014)
Google Scholar
Huppertz, B., Holzinger, A.: Biobanks – a source of large biological data sets: open problems and future challenges. In: Holzinger, A., Jurisica, I. (eds.) Interactive Knowledge Discovery and Data Mining in Biomedical Informatics. LNCS, vol. 8401, pp. 317–330. Springer, Heidelberg (2014)
Chapter Google Scholar
Jacob, L., Obozinski, G., Vert, J.P.: Group Lasso with overlap and graph Lasso. In: Proceedings of the 26th International Conference on Machine Learning (ICML), pp. 433–440 (2009)
Google Scholar
Javanmard, A., Montanari, A.: Model selection for high-dimensional regression under the generalized irrepresentability condition. BMC Bioinformatics 26, 3012–3020 (2013)
Google Scholar
Joachims, T., Finley, T., Yu, C.N.: Cutting-plane training of structural SVMs. BMC Bioinformatics 77(1), 27–59 (2009)
MATH Google Scholar
Kleinberg, J.: Navigation in a small world. Nature 406(6798), 845–845 (2000)
Article Google Scholar
Klopocki, E., Kristiansen, G., Wild, P.J., Klaman, I., Castanos-Velez, E., Singer, G., Stöhr, R., Simon, R., Sauter, G., Leibiger, H., Essers, L., Weber, B., Hermann, K., Rosenthal, A., Hartmann, A., Dahl, E.: Loss of SFRP1 is associated with breast cancer progression and poor prognosis in early stage tumors. Nature 25(3), 641–649 (2004)
Google Scholar
Knight, K., Fu, W.: Asymptotics for Lasso-type estimators. Ann. Stat. 28(5), 1356–1378 (2000)
Article MathSciNet MATH Google Scholar
Koontz, W., Narendra, P., Fukunaga, K.: A graph-theoretic approach to nonparametric cluster analysis. Nature 100(9), 936–944 (1976)
MathSciNet MATH Google Scholar
Kumpulainen, S., Jarvelin, K.: Barriers to task-based information access in molecular medicine. Nature 63(1), 86–97 (2012)
Google Scholar
Kurgan, L.A., Musilek, P.: A survey of knowledge discovery and data mining process models. Nature 21(01), 1–24 (2006)
Google Scholar
Lauritzen, S.L.: Graphical Models. Oxford University Press, Oxford (1996)
MATH Google Scholar
Law, V., Knox, C., Djoumbou, Y., Jewison, T., Guo, A.C., Liu, Y.F., Maciejewski, A., Arndt, D., Wilson, M., Neveu, V., Tang, A., Gabriel, G., Ly, C., Adamjee, S., Dame, Z.T., Han, B.S., Zhou, Y., Wishart, D.S.: Drugbank 4.0: shedding new light on drug metabolism. Nature 42(D1), D1091–D1097 (2014)
Google Scholar
Lee, S.: Sparse inverse covariance estimation for graph representation of feature structure. In: Holzinger, A., Jurisica, I. (eds.) Interactive Knowledge Discovery and Data Mining in Biomedical Informatics. LNCS, vol. 8401, pp. 227–240. Springer, Heidelberg (2014)
Chapter Google Scholar
Lee, S.: Signature selection for grouped features with a case study on exon microarrays. In: Stańczyk, U., Jain, L.C. (eds.) Feature Selection for Data and Pattern Classification, pp. 329–349. Springer, Heidelberg (2015)
Google Scholar
Lee, S., Wright, S.J.: Manifold identification in dual averaging methods for regularized stochastic online learning. Nature 13, 1705–1744 (2012)
MathSciNet MATH Google Scholar
Lilla, C., Koehler, T., Kropp, S., Wang-Gohrke, S., Chang-Claude, J.: Alcohol dehydrogenase 1B (ADH1B) genotype, alcohol consumption and breast cancer risk by age 50 years in a german case-control study. Nature 92(11), 2039–2041 (2005)
Google Scholar
Lodhi, H., Saunders, C., Shawe-Taylor, J., Watkins, N.C.C.: Text classification using string kernels. Nature 2, 419–444 (2002)
MATH Google Scholar
Ma, K.L., Muelder, C.W.: Large-scale graph visualization and analytics. Nature 46(7), 39–46 (2013)
Google Scholar
Mattmann, C.A.: Computing: a vision for data science. Nature 493(7433), 473–475 (2013)
Article Google Scholar
McCall, M., Murakami, P., Lukk, M., Huber, W., Irizarry, R.: Assessing affymetrix genechip microarray quality. BMC Bioinformatics 12(1), 137 (2011)
Article Google Scholar
McCall, M.N., Bolstad, B.M., Irizarry, R.A.: Frozen robust multiarray analysis (fRMA). BMC Bioinformatics 11(2), 242–253 (2010)
Google Scholar
Meinshausen, N., Bühlmann, P.: High-dimensional graphs and variable selection with the Lasso. BMC Bioinformatics 34, 1436–1462 (2006)
MathSciNet MATH Google Scholar
Meinshausen, N., Bühlmann, P.: Stability selection. BMC Bioinformatics 72(4), 417–473 (2010)
MathSciNet Google Scholar
Müller, R.: Medikamente und Richtwerte in der Notfallmedizin, 11th edn. Ralf Müller Verlag, Graz (2012)
Google Scholar
Nesterov, Y.E.: A method of solving a convex programming problem with convergence rate \(o(1/k^2)\). Soviet Math. Dokl. 27(2), 372–376 (1983)
MATH Google Scholar
Niakšu, O., Kurasova, O.: Data mining applications in healthcare: research vs practice. In: Databases and Information Systems Baltic DB & IS 2012, p. 58 (2012)
Google Scholar
Otasek, D., Pastrello, C., Holzinger, A., Jurisica, I.: Visual data mining: effective exploration of the biological universe. In: Holzinger, A., Jurisica, I. (eds.) Interactive Knowledge Discovery and Data Mining in Biomedical Informatics. LNCS, vol. 8401, pp. 19–33. Springer, Heidelberg (2014)
Chapter Google Scholar
Preuß, M., Dehmer, M., Pickl, S., Holzinger, A.: On terrain coverage optimization by using a network approach for universal graph-based data mining and knowledge discovery. In: Ślȩzak, D., Tan, A.-H., Peters, J.F., Schwabe, L. (eds.) BIH 2014. LNCS, vol. 8609, pp. 564–573. Springer, Heidelberg (2014)
Google Scholar
Schoenauer, M., Akrour, R., Sebag, M., Souplet, J.C.: Programming by feedback. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 1503–1511 (2014)
Google Scholar
Spinrad, N.: Google car takes the test. Nature 514(7523), 528–528 (2014)
Article Google Scholar
Strogatz, S.: Exploring complex networks. Nature 410(6825), 268–276 (2001)
Article Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the Lasso. Nature 58, 267–288 (1996)
MathSciNet MATH Google Scholar
Tseng, P.: Convergence of a block coordinate descent method for nondifferentiable minimization. Nature 109(3), 475–494 (2001)
MathSciNet MATH Google Scholar
Vandenberghe, L., Boyd, S., Wu, S.P.: Determinant maximization with linear matrix inequality constraints. Nature 19(2), 499–533 (1998)
MathSciNet MATH Google Scholar
Wagner, H., Dłotko, P., Mrozek, M.: Computational topology in text mining. In: Ferri, M., Frosini, P., Landi, C., Cerri, A., Di Fabio, B. (eds.) CTIC 2012. LNCS, vol. 7309, pp. 68–78. Springer, Heidelberg (2012)
Chapter Google Scholar
Washio, T., Motoda, H.: State of the art of graph-based data mining. Nature 5(1), 59 (2003)
Google Scholar
Wishart, D.S., Knox, C., Guo, A.C., Shrivastava, S., Hassanali, M., Stothard, P., Chang, Z., Woolsey, J.: Drugbank: a comprehensive resource for in silico drug discovery and exploration. Nature 34, D668–D672 (2006)
Google Scholar
Wittkop, T., Emig, D., Truss, A., Albrecht, M., Boecker, S., Baumbach, J.: Comprehensive cluster analysis with transitivity clustering. Nature 6(3), 285–295 (2011)
Google Scholar
Yoshida, K., Motoda, H., Indurkhya, N.: Graph-based induction as a unified learning framework. Nature 4(3), 297–316 (1994)
Google Scholar
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. Nature 68, 49–67 (2006)
MathSciNet MATH Google Scholar
Yuan, M., Lin, Y.: Model selection and estimation in the Gaussian graphical model. Biometrika 94(1), 19–35 (2007)
Article MathSciNet MATH Google Scholar
Zhao, P., Yu, B.: On model selection consistency of Lasso. Biometrika 7, 2541–2563 (2006)
MathSciNet MATH Google Scholar
Zhengxiang, Z., Jifa, G., Wenxin, Y., Xingsen, L.: Toward domain-driven data mining. In: International Symposium on Intelligent Information Technology Application Workshops, pp. 44–48 (2008)
Google Scholar
Zhu, X.: Persistent homology: an introduction and a new text representation for natural language processing. In: IJCAI, IJCAI/AAAI (2013)
Google Scholar
Zou, H.: The adaptive Lasso and its Oracle properties. Biometrika 101(476), 1418–1429 (2006)
MathSciNet MATH Google Scholar
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. Biometrika 67, 301–320 (2005)
MathSciNet MATH Google Scholar
Zudilova-Seinstra, E., Adriaansen, T.: Visualisation and interaction for scientific exploration and knowledge discovery. Biometrika 13(2), 115–117 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Artificial Intelligence Unit LS8, Computer Science Department, Technische Universität Dortmund, Dortmund, Germany
Sangkyun Lee
Research Unit HCI-KDD, Institute for Medical Informatics, Statistics and Documentation, Medical University Graz, Graz, Austria
Andreas Holzinger
Institute for Information Systems and Computer Media, Graz University of Technology, Graz, Austria
Andreas Holzinger

Authors

Sangkyun Lee
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Holzinger
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sangkyun Lee .

Editor information

Editors and Affiliations

TU Dortmund , Dortmund, Germany
Stefan Michaelis
TU Dortmund , Dortmund, Germany
Nico Piatkowski
TU Dortmund , Dortmund, Germany
Marco Stolpe

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Lee, S., Holzinger, A. (2016). Knowledge Discovery from Complex High Dimensional Data. In: Michaelis, S., Piatkowski, N., Stolpe, M. (eds) Solving Large Scale Learning Tasks. Challenges and Algorithms. Lecture Notes in Computer Science(), vol 9580. Springer, Cham. https://doi.org/10.1007/978-3-319-41706-6_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-41706-6_7
Published: 03 July 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41705-9
Online ISBN: 978-3-319-41706-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics