ABSTRACT
Statistical Machine Learning has undergone a phase transition from a pure academic endeavor to being one of the main drivers of modern commerce and science. Even more so, recent results such as those on tera-scale learning [1] and on very large neural networks [2] suggest that scale is an important ingredient in quality modeling. This tutorial introduces current applications, techniques and systems with the aim of cross-fertilizing research between the database and machine learning communities.
The tutorial covers current large scale applications of Machine Learning, their computational model and the workflow behind building those. Based on this foundation, we present the current state-of-the-art in systems support in the bulk of the tutorial. We also identify critical gaps in the state-of-the-art. This leads to the closing of the seminar, where we introduce two sets of open research questions: Better systems support for the already established use cases of Machine Learning and support for recent advances in Machine Learning research.
- A. Agarwal, O. Chapelle, M. Dudik and J. Langford, "A Reliable Effective Terascale Linear Learning System," arXiv.org, 2012.Google Scholar
- J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, A. Senior, P. Tucker, K. Yang and A. Ng, "Large Scale Distributed Deep Networks," in Advances in Neural Information Processing Systems, 2013.Google Scholar
- The Apache Project, "Apache Hadoop NextGen MapReduce (YARN)," The Apache Project, {Online}. Available: http://hadoop.apache.org/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/YARN.html.Google Scholar
- B. Hindman, A. Konwinski, M. Zaharia and I. Stoica, "A Common Substrate for Cluster Computing," in HotCloud, 2009. Google ScholarDigital Library
- M. Kearns, "Efficient noise-tolerant learning from statistical queries," Journal of the ACM, pp. 392--401, 1998. Google ScholarDigital Library
- C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski and A. Y. Ng, "Map-Reduce for Machine Learning on Multicore," in Advances in Neural Information Processing Systems 19, Cambridge, MA, 2007.Google Scholar
- The Apache Foundation, "Apache Pig," 11 12 2012. {Online}. Available: http://pig.apache.org/.Google Scholar
- J. Dean and S. Ghemat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol. 51, pp. 107--113, 2008. Google ScholarDigital Library
- The Apache Mahout Project, "Apache Mahout," 17 9 2012. {Online}. Available: http://mahout.apache.org/. {Accessed 17 9 2012}.Google Scholar
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker and I. Stoica, "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing," in USENIX NSDI, San Jose, CA, 2012. Google ScholarDigital Library
- G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser and G. Czajkowski, "Pregel: a system for large-scale graph processing," in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, Indianapolis, Indiana, USA, 2010. Google ScholarDigital Library
- The Apache Software Foundation, "Apache Giraph," {Online}. Available: http://giraph.apache.org/.Google Scholar
- Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola and J. M. Hellerstein, "Distributed GraphLab: a framework for machine learning and data mining in the cloud," Proceedings of the VLDB Endowment, vol. 5, no. 8, pp. 716--727, April 2012. Google ScholarDigital Library
- G. Hinton, "Learning Multiple Layers of Representation," Trends in Cognitive Sciences, vol. 11, pp. 428--434, 2007.Google ScholarCross Ref
- G. E. Hinton, S. Osindero and Y. Teh, "A fast learning algorithm for deep belief nets," Neural Computation, vol. 18, pp. 1526--1554, 2006. Google ScholarDigital Library
- J. Markoff, "How Many Computers to Identify a Cat? 16,000," The New York Times, 25 June 2012. {Online}. Available: http://www.nytimes.com/2012/06/26/technology/in-a-big-network-of-computers-evidence-of-machine-learning.html. {Accessed 11 December 2012}.Google Scholar
- A. Smola, A. Ahmed and M. Weimer, "WWW 2012 Tutorial: New Templates for Scalable Data Analysis," June 2012. {Online}. Available: http://www2012.wwwconference.org/program/tutorials/ and http://cs.markusweimer.com/2012/04/06/www-2012-tutorial-new-templates-for-scalable-data-analysis/.Google Scholar
- G. Dror, N. Koenigstein, Y. Koren and M. Weimer, "The Yahoo! Music Dataset and KDD-Cup'11," in Proceedings of KDDCup 2011, San Diego, CA, 2011.Google Scholar
Index Terms
- Machine learning for big data
Recommendations
Big data, lifelong machine learning and transfer learning
WSDM '13: Proceedings of the sixth ACM international conference on Web search and data miningA major challenge in today's world is the Big Data problem, which manifests itself in Web and Mobile domains as rapidly changing and heterogeneous data streams. A data-mining system must be able to cope with the influx of changing data in a continual ...
Big Data Processing using Machine Learning algorithms: MLlib and Mahout Use Case
SITA'18: Proceedings of the 12th International Conference on Intelligent Systems: Theories and ApplicationsMachine learning is a field within artificial intelligence that allows machines to learn on their own from existing information to make predictions or/and decisions. There are three main categories of machine learning techniques: Collaborative filtering ...
Machine learning on big data
Machine learning (ML) is continuously unleashing its power in a wide range of applications. It has been pushed to the forefront in recent years partly owing to the advent of big data. ML algorithms have never been better promised while challenged by big ...
Comments