Skip to main content

Knowledge Discovery from Complex High Dimensional Data

  • Chapter
  • First Online:
Solving Large Scale Learning Tasks. Challenges and Algorithms

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9580))

Abstract

Modern data analysis is confronted by increasing dimensionality of problems, mainly contributed by higher resolutions available for data acquisition and by our use of larger models with more degrees of freedom to investigate complex systems deeper. High dimensionality constitutes one aspect of “big data”, which brings us not only computational but also statistical and perceptional challenges. Most data analysis problems are solved using techniques of optimization, where large-scale optimization requires faster algorithms and implementations. Computed solutions must be evaluated for statistical quality, since otherwise false discoveries can be made. Recent papers suggest to control and modify algorithms themselves for better statistical properties. Finally, human perception puts an inherent limit on our understanding to three dimensional spaces, making it almost impossible to grasp complex phenomena. For aid, we use dimensionality reduction or other techniques, but these usually do not capture relations between interesting objects. Here graph-based knowledge representation has lots of potential, for instance to create perceivable and interactive representations and to perform new types of analysis based on graph theory and network topology. In this article, we show glimpses of new developments in these aspects.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Gene Expression Omnibus http://www.ncbi.nlm.nih.gov/geo/.

  2. 2.

    The CEL files from the GEO were normalized and summarized for transcripts using the frozen RMA algorithm [55]. Then only the verified (grade A) genes were chosen for further analysis according to the NetAffx probeset annotation v33.1 of Affymetrix (\(n=20492\) afterwards). Also, microarrays with low quality according to the GNUSE [54] error scores \(>1\) were discarded (\(m=392\) afterwards).

  3. 3.

    OSGi Standard http://www.osgi.org/Main/HomePage.

  4. 4.

    DEX Graph Database http://www.sparsity-technologies.com/dex.

  5. 5.

    National Cancer Institute, http://www.cancer.gov.

  6. 6.

    DrugBank, http://www.drugbank.ca.

  7. 7.

    HCI-KDD Network: www.hci-kdd.org.

References

  1. Anderson, N.R., Lee, E.S., Brockenbrough, J.S., Minie, M.E., Fuller, S., Brinkley, J., Tarczy-Hornoch, P.: Issues in biomedical research data management and analysis: needs and barriers. J. Am. Med. Inform. Assoc. 14(4), 478–488 (2007)

    Article  Google Scholar 

  2. Bach, F.R.: Bolasso: Model consistent Lasso estimation through the bootstrap. In: 25th International Conference on Machine Learning, pp. 33–40 (2008)

    Google Scholar 

  3. Banerjee, O., Ghaoui, L.E., d’Aspremont, A.: Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Am. Med. Inform. Assoc. 9, 485–516 (2008)

    MathSciNet  MATH  Google Scholar 

  4. Barabasi, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  5. Barabási, A., Gulbahce, N., Loscalzo, J.: Network medicine: a network-based approach to human disease. Science 12(1), 56–68 (2011)

    Google Scholar 

  6. Beck, A., Tetruashvili, L.: On the convergence of block coordinate descent type methods. Science 23(4), 2037–2060 (2013)

    MathSciNet  MATH  Google Scholar 

  7. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. Science 2(1), 183–202 (2009)

    MathSciNet  MATH  Google Scholar 

  8. Bogdan, M., van den Berg, E., Sabatti, C., Su, W., Candes, E.J.: SLOPE - adaptive variable selection via convex optimization. (2014). arXiv:1407.3824

  9. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Science 3(1), 1–122 (2011)

    MathSciNet  MATH  Google Scholar 

  10. Bubenik, P., Kim, P.T.: A statistical approach to persistent homology. Science 9(2), 337–362 (2007)

    MathSciNet  MATH  Google Scholar 

  11. Castellana, B., Escuin, D., Peiró, G., Garcia-Valdecasas, B., Vázquez, T., Pons, C., Pérez-Olabarria, M., Barnadas, A., Lerma, E.: ASPN and GJB2 are implicated in the mechanisms of invasion of ductal breast carcinomas. Science 3, 175–183 (2012)

    Google Scholar 

  12. Cerri, A., Fabio, B.D., Ferri, M., Frosini, P., Landi, C.: Betti numbers in multidimensional persistent homology are stable functions. Science 36(12), 1543–1557 (2013)

    MathSciNet  MATH  Google Scholar 

  13. Chen, H., Sharp, B.M.: Content-rich biological network constructed by mining PubMed abstracts. BMC Bioinformatics 5(1), 147 (2004)

    Article  Google Scholar 

  14. Cios, K.J., Moore, G.W.: Uniqueness of medical data mining. BMC Bioinformatics 26(1), 1–24 (2002)

    Google Scholar 

  15. Cook, D.J., Holder, L.B.: Graph-based data mining. BMC Bioinformatics 15(2), 32–41 (2000)

    Google Scholar 

  16. Cox, D.R., Oakes, D.: Analysis of Survival Data. Monographs on Statistics & Applied Probability. Chapman & Hall/CRC, London (1984)

    Google Scholar 

  17. Dehaspe, L., Toivonen, H.: Discovery of frequent DATALOG patterns. BMC Bioinformatics 3(1), 7–36 (1999)

    Google Scholar 

  18. Iordache, O.: Methods. In: Iordache, O. (ed.) Polystochastic Models for Complexity. UCS, vol. 4, pp. 17–61. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  19. Dehmer, M., Basak, S.C.: Statistical and Machine Learning Approaches for Network Analysis. Wiley, Hoboken (2012)

    Book  MATH  Google Scholar 

  20. Donsa, K., Spat, S., Beck, P., Pieber, T.R., Holzinger, A.: Towards personalization of diabetes therapy using computerized decision support and machine learning: some open problems and challenges. In: Holzinger, A., Röcker, C., Ziefle, M. (eds.) Smart Health. LNCS, vol. 8700, pp. 237–260. Springer, Heidelberg (2015)

    Google Scholar 

  21. Dorogovtsev, S., Mendes, J.: Evolution of Networks: From Biological Nets to the Internet and WWW. Oxford University Press, Oxford (2003)

    Book  MATH  Google Scholar 

  22. Duerr-Specht, M., Goebel, R., Holzinger, A.: Medicine and health care as a data problem: will computers become better medical doctors? In: Holzinger, A., Röcker, C., Ziefle, M. (eds.) Smart Health. LNCS, vol. 8700, pp. 21–39. Springer, Heidelberg (2015)

    Google Scholar 

  23. Epstein, C., Carlsson, G., Edelsbrunner, H.: Topological data analysis. BMC Bioinformatics 27(12), 120201 (2011)

    Google Scholar 

  24. Friedman, J., Hastie, T., Tibshirani, R.: Sparse inverse covariance estimation with the graphical Lasso. BMC Bioinformatics 9(3), 432–441 (2008)

    MATH  Google Scholar 

  25. Golumbic, M.C.: Algorithmic Graph Theory and Perfect Graphs. Elsevier, Amsterdam (2004)

    MATH  Google Scholar 

  26. Henderson, B.E., Feigelson, H.S.: Hormonal carcinogenesis. Carcinogenesis 21(3), 427–433 (2000)

    Article  Google Scholar 

  27. Holzinger, A.: Human-Computer Interaction and Knowledge Discovery (HCI-KDD): what is the benefit of bringing those two fields to work together? In: Cuzzocrea, A., Kittl, C., Simos, D.E., Weippl, E., Xu, L. (eds.) CD-ARES 2013. LNCS, vol. 8127, pp. 319–328. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  28. Holzinger, A., Dehmer, M., Jurisica, I.: Knowledge discovery and interactive data mining in bioinformatics - state-of-the-art, future challenges and research directions. BMC Bioinformatics 15(Suppl 6), I1 (2014)

    Article  Google Scholar 

  29. Holzinger, A., Jurisica, I. (eds.): Interactive Knowledge Discovery and Data Mining in Biomedical Informatics: State-of-the-Art and Future Challenges, vol. 8401. Springer, Heidelberg (2014)

    Google Scholar 

  30. Holzinger, A., Jurisica, I.: Knowledge discovery and data mining in biomedical informatics: the future is in integrative, interactive machine learning solutions. In: Holzinger, A., Jurisica, I. (eds.) Interactive Knowledge Discovery and Data Mining in Biomedical Informatics. LNCS, vol. 8401, pp. 1–18. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  31. Holzinger, A., Malle, B., Giuliani, N.: On graph extraction from image data. In: Ślȩzak, D., Tan, A.-H., Peters, J.F., Schwabe, L. (eds.) BIH 2014. LNCS, vol. 8609, pp. 552–563. Springer, Heidelberg (2014)

    Google Scholar 

  32. Holzinger, A., Ofner, B., Dehmer, M.: Multi-touch graph-based interaction for knowledge discovery on mobile devices: state-of-the-art and future challenges. In: Holzinger, A., Jurisica, I. (eds.) Interactive Knowledge Discovery and Data Mining in Biomedical Informatics. LNCS, vol. 8401, pp. 241–254. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  33. Holzinger, A., Ofner, B., Stocker, C., Calero Valdez, A., Schaar, A.K., Ziefle, M., Dehmer, M.: On graph entropy measures for knowledge discovery from publication network data. In: Cuzzocrea, A., Kittl, C., Simos, D.E., Weippl, E., Xu, L. (eds.) CD-ARES 2013. LNCS, vol. 8127, pp. 354–362. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  34. Holzinger, A., Stocker, C., Dehmer, M.: Big complex biomedical data: towards a taxonomy of data. In: Obaidat, M.S., Filipe, J. (eds.) Communications in Computer and Information Science CCIS 455, pp. 3–18. Springer, Heidelberg (2014)

    Google Scholar 

  35. Huppertz, B., Holzinger, A.: Biobanks – a source of large biological data sets: open problems and future challenges. In: Holzinger, A., Jurisica, I. (eds.) Interactive Knowledge Discovery and Data Mining in Biomedical Informatics. LNCS, vol. 8401, pp. 317–330. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  36. Jacob, L., Obozinski, G., Vert, J.P.: Group Lasso with overlap and graph Lasso. In: Proceedings of the 26th International Conference on Machine Learning (ICML), pp. 433–440 (2009)

    Google Scholar 

  37. Javanmard, A., Montanari, A.: Model selection for high-dimensional regression under the generalized irrepresentability condition. BMC Bioinformatics 26, 3012–3020 (2013)

    Google Scholar 

  38. Joachims, T., Finley, T., Yu, C.N.: Cutting-plane training of structural SVMs. BMC Bioinformatics 77(1), 27–59 (2009)

    MATH  Google Scholar 

  39. Kleinberg, J.: Navigation in a small world. Nature 406(6798), 845–845 (2000)

    Article  Google Scholar 

  40. Klopocki, E., Kristiansen, G., Wild, P.J., Klaman, I., Castanos-Velez, E., Singer, G., Stöhr, R., Simon, R., Sauter, G., Leibiger, H., Essers, L., Weber, B., Hermann, K., Rosenthal, A., Hartmann, A., Dahl, E.: Loss of SFRP1 is associated with breast cancer progression and poor prognosis in early stage tumors. Nature 25(3), 641–649 (2004)

    Google Scholar 

  41. Knight, K., Fu, W.: Asymptotics for Lasso-type estimators. Ann. Stat. 28(5), 1356–1378 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  42. Koontz, W., Narendra, P., Fukunaga, K.: A graph-theoretic approach to nonparametric cluster analysis. Nature 100(9), 936–944 (1976)

    MathSciNet  MATH  Google Scholar 

  43. Kumpulainen, S., Jarvelin, K.: Barriers to task-based information access in molecular medicine. Nature 63(1), 86–97 (2012)

    Google Scholar 

  44. Kurgan, L.A., Musilek, P.: A survey of knowledge discovery and data mining process models. Nature 21(01), 1–24 (2006)

    Google Scholar 

  45. Lauritzen, S.L.: Graphical Models. Oxford University Press, Oxford (1996)

    MATH  Google Scholar 

  46. Law, V., Knox, C., Djoumbou, Y., Jewison, T., Guo, A.C., Liu, Y.F., Maciejewski, A., Arndt, D., Wilson, M., Neveu, V., Tang, A., Gabriel, G., Ly, C., Adamjee, S., Dame, Z.T., Han, B.S., Zhou, Y., Wishart, D.S.: Drugbank 4.0: shedding new light on drug metabolism. Nature 42(D1), D1091–D1097 (2014)

    Google Scholar 

  47. Lee, S.: Sparse inverse covariance estimation for graph representation of feature structure. In: Holzinger, A., Jurisica, I. (eds.) Interactive Knowledge Discovery and Data Mining in Biomedical Informatics. LNCS, vol. 8401, pp. 227–240. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  48. Lee, S.: Signature selection for grouped features with a case study on exon microarrays. In: Stańczyk, U., Jain, L.C. (eds.) Feature Selection for Data and Pattern Classification, pp. 329–349. Springer, Heidelberg (2015)

    Google Scholar 

  49. Lee, S., Wright, S.J.: Manifold identification in dual averaging methods for regularized stochastic online learning. Nature 13, 1705–1744 (2012)

    MathSciNet  MATH  Google Scholar 

  50. Lilla, C., Koehler, T., Kropp, S., Wang-Gohrke, S., Chang-Claude, J.: Alcohol dehydrogenase 1B (ADH1B) genotype, alcohol consumption and breast cancer risk by age 50 years in a german case-control study. Nature 92(11), 2039–2041 (2005)

    Google Scholar 

  51. Lodhi, H., Saunders, C., Shawe-Taylor, J., Watkins, N.C.C.: Text classification using string kernels. Nature 2, 419–444 (2002)

    MATH  Google Scholar 

  52. Ma, K.L., Muelder, C.W.: Large-scale graph visualization and analytics. Nature 46(7), 39–46 (2013)

    Google Scholar 

  53. Mattmann, C.A.: Computing: a vision for data science. Nature 493(7433), 473–475 (2013)

    Article  Google Scholar 

  54. McCall, M., Murakami, P., Lukk, M., Huber, W., Irizarry, R.: Assessing affymetrix genechip microarray quality. BMC Bioinformatics 12(1), 137 (2011)

    Article  Google Scholar 

  55. McCall, M.N., Bolstad, B.M., Irizarry, R.A.: Frozen robust multiarray analysis (fRMA). BMC Bioinformatics 11(2), 242–253 (2010)

    Google Scholar 

  56. Meinshausen, N., Bühlmann, P.: High-dimensional graphs and variable selection with the Lasso. BMC Bioinformatics 34, 1436–1462 (2006)

    MathSciNet  MATH  Google Scholar 

  57. Meinshausen, N., Bühlmann, P.: Stability selection. BMC Bioinformatics 72(4), 417–473 (2010)

    MathSciNet  Google Scholar 

  58. Müller, R.: Medikamente und Richtwerte in der Notfallmedizin, 11th edn. Ralf Müller Verlag, Graz (2012)

    Google Scholar 

  59. Nesterov, Y.E.: A method of solving a convex programming problem with convergence rate \(o(1/k^2)\). Soviet Math. Dokl. 27(2), 372–376 (1983)

    MATH  Google Scholar 

  60. Niakšu, O., Kurasova, O.: Data mining applications in healthcare: research vs practice. In: Databases and Information Systems Baltic DB & IS 2012, p. 58 (2012)

    Google Scholar 

  61. Otasek, D., Pastrello, C., Holzinger, A., Jurisica, I.: Visual data mining: effective exploration of the biological universe. In: Holzinger, A., Jurisica, I. (eds.) Interactive Knowledge Discovery and Data Mining in Biomedical Informatics. LNCS, vol. 8401, pp. 19–33. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  62. Preuß, M., Dehmer, M., Pickl, S., Holzinger, A.: On terrain coverage optimization by using a network approach for universal graph-based data mining and knowledge discovery. In: Ślȩzak, D., Tan, A.-H., Peters, J.F., Schwabe, L. (eds.) BIH 2014. LNCS, vol. 8609, pp. 564–573. Springer, Heidelberg (2014)

    Google Scholar 

  63. Schoenauer, M., Akrour, R., Sebag, M., Souplet, J.C.: Programming by feedback. In: Proceedings of the 31st International Conference on Machine Learning (ICML 2014), pp. 1503–1511 (2014)

    Google Scholar 

  64. Spinrad, N.: Google car takes the test. Nature 514(7523), 528–528 (2014)

    Article  Google Scholar 

  65. Strogatz, S.: Exploring complex networks. Nature 410(6825), 268–276 (2001)

    Article  Google Scholar 

  66. Tibshirani, R.: Regression shrinkage and selection via the Lasso. Nature 58, 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  67. Tseng, P.: Convergence of a block coordinate descent method for nondifferentiable minimization. Nature 109(3), 475–494 (2001)

    MathSciNet  MATH  Google Scholar 

  68. Vandenberghe, L., Boyd, S., Wu, S.P.: Determinant maximization with linear matrix inequality constraints. Nature 19(2), 499–533 (1998)

    MathSciNet  MATH  Google Scholar 

  69. Wagner, H., Dłotko, P., Mrozek, M.: Computational topology in text mining. In: Ferri, M., Frosini, P., Landi, C., Cerri, A., Di Fabio, B. (eds.) CTIC 2012. LNCS, vol. 7309, pp. 68–78. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  70. Washio, T., Motoda, H.: State of the art of graph-based data mining. Nature 5(1), 59 (2003)

    Google Scholar 

  71. Wishart, D.S., Knox, C., Guo, A.C., Shrivastava, S., Hassanali, M., Stothard, P., Chang, Z., Woolsey, J.: Drugbank: a comprehensive resource for in silico drug discovery and exploration. Nature 34, D668–D672 (2006)

    Google Scholar 

  72. Wittkop, T., Emig, D., Truss, A., Albrecht, M., Boecker, S., Baumbach, J.: Comprehensive cluster analysis with transitivity clustering. Nature 6(3), 285–295 (2011)

    Google Scholar 

  73. Yoshida, K., Motoda, H., Indurkhya, N.: Graph-based induction as a unified learning framework. Nature 4(3), 297–316 (1994)

    Google Scholar 

  74. Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. Nature 68, 49–67 (2006)

    MathSciNet  MATH  Google Scholar 

  75. Yuan, M., Lin, Y.: Model selection and estimation in the Gaussian graphical model. Biometrika 94(1), 19–35 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  76. Zhao, P., Yu, B.: On model selection consistency of Lasso. Biometrika 7, 2541–2563 (2006)

    MathSciNet  MATH  Google Scholar 

  77. Zhengxiang, Z., Jifa, G., Wenxin, Y., Xingsen, L.: Toward domain-driven data mining. In: International Symposium on Intelligent Information Technology Application Workshops, pp. 44–48 (2008)

    Google Scholar 

  78. Zhu, X.: Persistent homology: an introduction and a new text representation for natural language processing. In: IJCAI, IJCAI/AAAI (2013)

    Google Scholar 

  79. Zou, H.: The adaptive Lasso and its Oracle properties. Biometrika 101(476), 1418–1429 (2006)

    MathSciNet  MATH  Google Scholar 

  80. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. Biometrika 67, 301–320 (2005)

    MathSciNet  MATH  Google Scholar 

  81. Zudilova-Seinstra, E., Adriaansen, T.: Visualisation and interaction for scientific exploration and knowledge discovery. Biometrika 13(2), 115–117 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sangkyun Lee .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Lee, S., Holzinger, A. (2016). Knowledge Discovery from Complex High Dimensional Data. In: Michaelis, S., Piatkowski, N., Stolpe, M. (eds) Solving Large Scale Learning Tasks. Challenges and Algorithms. Lecture Notes in Computer Science(), vol 9580. Springer, Cham. https://doi.org/10.1007/978-3-319-41706-6_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-41706-6_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-41705-9

  • Online ISBN: 978-3-319-41706-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics