tutorial

Machine learning for big data

Authors:
Tyson Condie

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Paul Mineiro

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Neoklis Polyzotis

UC Santa Cruz, Santa Cruz, CA, USA

UC Santa Cruz, Santa Cruz, CA, USA
View Profile

,
Markus Weimer

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of DataJune 2013Pages 939–942https://doi.org/10.1145/2463676.2465338

Published:22 June 2013Publication History

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Pages 939–942

ABSTRACT

Statistical Machine Learning has undergone a phase transition from a pure academic endeavor to being one of the main drivers of modern commerce and science. Even more so, recent results such as those on tera-scale learning [1] and on very large neural networks [2] suggest that scale is an important ingredient in quality modeling. This tutorial introduces current applications, techniques and systems with the aim of cross-fertilizing research between the database and machine learning communities.

The tutorial covers current large scale applications of Machine Learning, their computational model and the workflow behind building those. Based on this foundation, we present the current state-of-the-art in systems support in the bulk of the tutorial. We also identify critical gaps in the state-of-the-art. This leads to the closing of the seminar, where we introduce two sets of open research questions: Better systems support for the already established use cases of Machine Learning and support for recent advances in Machine Learning research.

References

A. Agarwal, O. Chapelle, M. Dudik and J. Langford, "A Reliable Effective Terascale Linear Learning System," arXiv.org, 2012.Google Scholar
J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao, A. Senior, P. Tucker, K. Yang and A. Ng, "Large Scale Distributed Deep Networks," in Advances in Neural Information Processing Systems, 2013.Google Scholar
The Apache Project, "Apache Hadoop NextGen MapReduce (YARN)," The Apache Project, {Online}. Available: http://hadoop.apache.org/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/YARN.html.Google Scholar
B. Hindman, A. Konwinski, M. Zaharia and I. Stoica, "A Common Substrate for Cluster Computing," in HotCloud, 2009. Google ScholarDigital Library
M. Kearns, "Efficient noise-tolerant learning from statistical queries," Journal of the ACM, pp. 392--401, 1998. Google ScholarDigital Library
C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski and A. Y. Ng, "Map-Reduce for Machine Learning on Multicore," in Advances in Neural Information Processing Systems 19, Cambridge, MA, 2007.Google Scholar
The Apache Foundation, "Apache Pig," 11 12 2012. {Online}. Available: http://pig.apache.org/.Google Scholar
J. Dean and S. Ghemat, "MapReduce: simplified data processing on large clusters," Communications of the ACM, vol. 51, pp. 107--113, 2008. Google ScholarDigital Library
The Apache Mahout Project, "Apache Mahout," 17 9 2012. {Online}. Available: http://mahout.apache.org/. {Accessed 17 9 2012}.Google Scholar
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker and I. Stoica, "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing," in USENIX NSDI, San Jose, CA, 2012. Google ScholarDigital Library
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser and G. Czajkowski, "Pregel: a system for large-scale graph processing," in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, Indianapolis, Indiana, USA, 2010. Google ScholarDigital Library
The Apache Software Foundation, "Apache Giraph," {Online}. Available: http://giraph.apache.org/.Google Scholar
Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola and J. M. Hellerstein, "Distributed GraphLab: a framework for machine learning and data mining in the cloud," Proceedings of the VLDB Endowment, vol. 5, no. 8, pp. 716--727, April 2012. Google ScholarDigital Library
G. Hinton, "Learning Multiple Layers of Representation," Trends in Cognitive Sciences, vol. 11, pp. 428--434, 2007.Google ScholarCross Ref
G. E. Hinton, S. Osindero and Y. Teh, "A fast learning algorithm for deep belief nets," Neural Computation, vol. 18, pp. 1526--1554, 2006. Google ScholarDigital Library
J. Markoff, "How Many Computers to Identify a Cat? 16,000," The New York Times, 25 June 2012. {Online}. Available: http://www.nytimes.com/2012/06/26/technology/in-a-big-network-of-computers-evidence-of-machine-learning.html. {Accessed 11 December 2012}.Google Scholar
A. Smola, A. Ahmed and M. Weimer, "WWW 2012 Tutorial: New Templates for Scalable Data Analysis," June 2012. {Online}. Available: http://www2012.wwwconference.org/program/tutorials/ and http://cs.markusweimer.com/2012/04/06/www-2012-tutorial-new-templates-for-scalable-data-analysis/.Google Scholar
G. Dror, N. Koenigstein, Y. Koren and M. Weimer, "The Yahoo! Music Dataset and KDD-Cup'11," in Proceedings of KDDCup 2011, San Diego, CA, 2011.Google Scholar

Index Terms

Machine learning for big data
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Big data, lifelong machine learning and transfer learning
WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining

A major challenge in today's world is the Big Data problem, which manifests itself in Web and Mobile domains as rapidly changing and heterogeneous data streams. A data-mining system must be able to cope with the influx of changing data in a continual ...
Read More
Big Data Processing using Machine Learning algorithms: MLlib and Mahout Use Case
SITA'18: Proceedings of the 12th International Conference on Intelligent Systems: Theories and Applications

Machine learning is a field within artificial intelligence that allows machines to learn on their own from existing information to make predictions or/and decisions. There are three main categories of machine learning techniques: Collaborative filtering ...
Read More
Machine learning on big data

Machine learning (ML) is continuously unleashing its power in a wide range of applications. It has been pushed to the forefront in recent years partly owing to the advent of big data. ML algorithms have never been better promised while challenged by big ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
June 2013
1322 pages
ISBN:9781450320375
DOI:10.1145/2463676
General Chairs:
Kenneth Ross
Columbia University
,
Divesh Srivastava
AT&T Research
,
Program Chair:
Dimitris Papadias
HKUST
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 June 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
big data
databases
machine learning
Qualifiers
- tutorial
Conference

Acceptance Rates
SIGMOD '13 Paper Acceptance Rate76of372submissions,20%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 20
  Total Citations
  View Citations
- 2,605
  Total Downloads
- Downloads (Last 12 months)72
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Machine learning for big data

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Big data, lifelong machine learning and transfer learning

Big Data Processing using Machine Learning algorithms: MLlib and Mahout Use Case

Machine learning on big data