skip to main content
10.1145/2463676.2465338acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
tutorial

Machine learning for big data

Published:22 June 2013Publication History

ABSTRACT

Statistical Machine Learning has undergone a phase transition from a pure academic endeavor to being one of the main drivers of modern commerce and science. Even more so, recent results such as those on tera-scale learning [1] and on very large neural networks [2] suggest that scale is an important ingredient in quality modeling. This tutorial introduces current applications, techniques and systems with the aim of cross-fertilizing research between the database and machine learning communities.

The tutorial covers current large scale applications of Machine Learning, their computational model and the workflow behind building those. Based on this foundation, we present the current state-of-the-art in systems support in the bulk of the tutorial. We also identify critical gaps in the state-of-the-art. This leads to the closing of the seminar, where we introduce two sets of open research questions: Better systems support for the already established use cases of Machine Learning and support for recent advances in Machine Learning research.

References

  1. A. Agarwal, O. Chapelle, M. Dudik and J. Langford, "A Reliable Effective Terascale Linear Learning System," arXiv.org, 2012.Google ScholarGoogle Scholar
  2. J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, A. Senior, P. Tucker, K. Yang and A. Ng, "Large Scale Distributed Deep Networks," in Advances in Neural Information Processing Systems, 2013.Google ScholarGoogle Scholar
  3. The Apache Project, "Apache Hadoop NextGen MapReduce (YARN)," The Apache Project, {Online}. Available: http://hadoop.apache.org/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/YARN.html.Google ScholarGoogle Scholar
  4. B. Hindman, A. Konwinski, M. Zaharia and I. Stoica, "A Common Substrate for Cluster Computing," in HotCloud, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Kearns, "Efficient noise-tolerant learning from statistical queries," Journal of the ACM, pp. 392--401, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski and A. Y. Ng, "Map-Reduce for Machine Learning on Multicore," in Advances in Neural Information Processing Systems 19, Cambridge, MA, 2007.Google ScholarGoogle Scholar
  7. The Apache Foundation, "Apache Pig," 11 12 2012. {Online}. Available: http://pig.apache.org/.Google ScholarGoogle Scholar
  8. J. Dean and S. Ghemat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol. 51, pp. 107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. The Apache Mahout Project, "Apache Mahout," 17 9 2012. {Online}. Available: http://mahout.apache.org/. {Accessed 17 9 2012}.Google ScholarGoogle Scholar
  10. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker and I. Stoica, "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing," in USENIX NSDI, San Jose, CA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser and G. Czajkowski, "Pregel: a system for large-scale graph processing," in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, Indianapolis, Indiana, USA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. The Apache Software Foundation, "Apache Giraph," {Online}. Available: http://giraph.apache.org/.Google ScholarGoogle Scholar
  13. Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola and J. M. Hellerstein, "Distributed GraphLab: a framework for machine learning and data mining in the cloud," Proceedings of the VLDB Endowment, vol. 5, no. 8, pp. 716--727, April 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. G. Hinton, "Learning Multiple Layers of Representation," Trends in Cognitive Sciences, vol. 11, pp. 428--434, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  15. G. E. Hinton, S. Osindero and Y. Teh, "A fast learning algorithm for deep belief nets," Neural Computation, vol. 18, pp. 1526--1554, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Markoff, "How Many Computers to Identify a Cat? 16,000," The New York Times, 25 June 2012. {Online}. Available: http://www.nytimes.com/2012/06/26/technology/in-a-big-network-of-computers-evidence-of-machine-learning.html. {Accessed 11 December 2012}.Google ScholarGoogle Scholar
  17. A. Smola, A. Ahmed and M. Weimer, "WWW 2012 Tutorial: New Templates for Scalable Data Analysis," June 2012. {Online}. Available: http://www2012.wwwconference.org/program/tutorials/ and http://cs.markusweimer.com/2012/04/06/www-2012-tutorial-new-templates-for-scalable-data-analysis/.Google ScholarGoogle Scholar
  18. G. Dror, N. Koenigstein, Y. Koren and M. Weimer, "The Yahoo! Music Dataset and KDD-Cup'11," in Proceedings of KDDCup 2011, San Diego, CA, 2011.Google ScholarGoogle Scholar

Index Terms

  1. Machine learning for big data

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
      June 2013
      1322 pages
      ISBN:9781450320375
      DOI:10.1145/2463676

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 June 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • tutorial

      Acceptance Rates

      SIGMOD '13 Paper Acceptance Rate76of372submissions,20%Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader