HotML: A DSM-based machine learning system for social networks
Introduction
In recent years, social networks such as Twitter, Weibo, Facebook, are becoming more and more popular worldwide. Social network data has increased dramatically and becomes invaluable for both the academia and the industry for research and commerce. To help social network analysis, many machine learning (ML) algorithms have been widely adopted to solve issues in social networks, such as spam bots detection [1], user classification [2], event detection [3], [4], [5], link prediction [6], [7], sentiment analysis [8], [9], topic learning [10] and many other fields. And such ML algorithms have achieved good results.
However, in so-called big data era, user scale of social networks could be in billions and the user generated contents (UGC) are extremely large. So two challenges have raised in machine learning for social networks. One is the big data, i.e. the data of training samples is extremely huge; the other one is the big model, i.e. ML algorithms often have to train parameters up to billions brought by the large scale of training samples and the deep architecture of neural networks. Such big data and big model bring a very serious computational performance issue which should be seriously and systematically considered.
Therefore, faced with the challenges of big data and big model raised by the thriving social networks, the parameter server (PS), as a high-throughput machine learning architecture for social networks, has gained much attention. The parameter server is a key-value store in a distributed shared memory fashion that enables clients to easily share access to the global model parameters stripped in multiple servers.
In our prior work DPS [11], a novel parameter server based on a recently proposed high-performance distributed shared memory (DSM) system, Grappa [12] which aims to provide a uniform memory view of machines for programmers to make writing distributed programs as if on a single machine, was introduced with high-level data abstraction, user-friendly programming interface with data flow operations like map/reduce, lightweight task scheduling system, and SSP consistency. However, there exist several drawbacks in DPS. (1) The server and worker are tightly coupled to a single node in DPS, which limits the overall performance because the server parameter request may be delayed due to executing worker tasks. (2) SSP consistency model may not make full use of the network bandwidth and there exist many trivial parameter updates that may waste network bandwidth. (3) DPS provides no fault tolerance, thus lacks high availability because ML algorithms for social networks may experience failures during the long-running training process. (4) DPS considers no load balancing which may be very important in a heterogeneous cluster with machines of different computing resources.
To overcome these drawbacks and further improve the performance. In this paper, based on DPS, we proposed HotML, a novel distributed machine learning system based on DSM for social networks that support both data parallelism and model parallelism. i.e. the global parameters are stored across the server nodes and each worker node takes a partition of the training data. And HotML contains many important components that cover the whole pipeline of machine learning for social networks.
The main contributions of HotML are as follows:
- 1.
The design of parameter server component in DPS is improved by decoupling the PS servers and workers physically. The dedicated parameter server is introduced to maximize the computing resources, and improve server throughput as well as the overall performance of HotML.
- 2.
Flexible consistency models are designed to boost the convergence of machine learning algorithms for social networks. SSP is implemented to relax the consistency as provided in DPS. In HotML, SSPPush, an improved version of SSP is introduced in servers to leverage idle network bandwidth to push global parameters to workers in advanced to reduce SSP waiting time. SSPDrop is designed in the worker to drop trivial parameter updates to reduce network communication. SSPDrop can work transparently with SSP or SSPPush.
- 3.
A flexible worker-side and a consistent server-side checkpoint mechanism are introduced to improve the availability of HotML because the DPS and the underlying DSM system Grappa does not provide fault tolerance mechanism and may be unreliable.
- 4.
A worker workload balancer is introduced to deal with the straggler problem.
A series of experiments are conducted to demonstrate the performance of the proposed system HotML. Experimental results show that HotML can reduce networking time by about 74%, and achieve up to 1.9× performance compared to the popular ML system, Petuum.
The rest of the paper is organized as follows. In Section 2, the background and related work are introduced. Section 3 describes the design and implementation of HotML. Section 4 presents performance evaluation results and analysis of HotML. Finally, the conclusion is in Section 5.
Section snippets
Background and related work
In this section, we will introduce consistency models, distributed shared memory, existing big data machine learning systems, and existing fault tolerance methods.
Design and implementation of HotML
In this section, we will first describe the overview of our HotML system and the implementation details of the key features, including flexible consistency models, lightweight task scheduler, GlobalTable data abstraction and programming interface, and fault tolerance.
Applications and evaluation
HotML was evaluated on several machine learning algorithms for social networks: matrix factorization (SGD MF) and logistic regression. We compared the performance of the proposed system, HotML, with the popular parameter system, Petuum [32], as well as our prior work, DPS [11]. Additional experiments are conducted to demonstrate the checkpoint, scalability, and workload balance of HotML.
Conclusion
In this paper, we described a DSM-based machine learning system with high performance for social networks, HotML. HotML is based on our prior work DPS and adopts DPS's high-level data abstraction and programming interfaces, user-level task scheduling and SSP consistency. To further improve the performance and availability, in HotML, the PS design is improved by decoupling PS server and PS worker physically to improve server throughput; SSPPush and SSPDrop consistency are adopted; consistent
Acknowledgements
This work is supported by China 973 Fundamental R&D Program (No. 2014CB340300), NSFC program (Nos. 61472022, 61421003), SKLSDE-2016ZX-11, and partly by the Beijing Advanced Innovation Center for Big Data and Brain Computing. We would also love to extend out gratitude to the reviewers for their valuable comments and suggestions that help improve the quality of this manuscript.
Yangyang Zhang is currently a Ph.D. student at the School of Computer Science and Engineering, Beihang University, China. His research interests include virtualization, machine learning, and distributed systems.
References (51)
- et al.
Twitter mood predicts the stock market
J. Comput. Sci.
(2011) - et al.
Enterprise: breadth-first graph traversal on gpus
Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
(2015) - et al.
Scalable inference in latent variable models
- et al.
Towards an efficient snapshot approach for virtual machines in clouds
Inf. Sci.
(2017) Detecting spam bots in online social networking sites: a machine learning approach
IFIP Annual Conference on Data and Applications Security and Privacy
(2010)- et al.
A machine learning approach to twitter user classification
Icwsm
(2011) - et al.
Ring: real-time emerging anomaly monitoring system over text streams
IEEE Trans. Big Data
(2017) - et al.
Learning similarity metrics for event identification in social media
Proceedings of the Third ACM International Conference on Web Search and Data Mining
(2010) - et al.
Earthquake shakes twitter users: real-time event detection by social sensors
Proceedings of the 19th International Conference on World Wide Web
(2010) - et al.
Predicting positive and negative links in online social networks
Proceedings of the 19th International Conference on World Wide Web
(2010)
Exploiting place features in link prediction on location-based social networks
Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
MoodLens: an emoticon-based sentiment analysis system for Chinese tweets
Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Learning evolving and emerging topics in social media: a dynamic NMF approach with temporal regularization
Proceedings of the Fifth ACM International Conference on Web Search and Data Mining
DPS: a DSM-based parameter server for machine learning
14th International Symposium on Pervasive Systems, Algorithms and Networks
Latency-tolerant software distributed shared memory
Usenix Conference on Usenix Technical Conference
Exploiting bounded staleness to speed up big data analytics
2014 USENIX Annual Technical Conference (USENIX ATC 14)
More effective distributed ML via a stale synchronous parallel parameter server
Adv. Neural Inf. Process. Syst.
High-performance distributed ML at scale through parameter server consistency models
AAAI
Exploiting heterogeneous parallelism on a multithreaded multiprocessor
International Conference on Supercomputing
The Tera computer system
ACM SIGARCH Computer Architecture News
MLlib: machine learning in apache spark
J. Mach. Learn. Res.
Spark: cluster computing with working sets
Usenix Conference on Hot Topics in Cloud Computing
Cited by (6)
Distributed Analytics For Big Data: A Survey
2024, NeurocomputingAutomated classification of social network messages into Smart Cities dimensions
2020, Future Generation Computer SystemsCitation Excerpt :OSN have caused a shift on how people communicate and share knowledge [16,17] and OSN analysis has almost replaced any conventional social science tool (surveys, interviews, questionnaires) announcing thus, the computational social science [16]. In that direction, many machine learning (ML) techniques have been widely adopted to solve issues in OSN, such as spam bots detection, intrusion detection [18], user classification, event detection, sentiment analysis, topic learning and many other fields [15]. Each social media user can be seen as an agent or sensor that continuously shares information [19] both temporal (when) and spatial (where) and reveals activities and opinions about the urban ecosystem.
The convergence of new computing paradigms and big data analytics methodologies for online social networks
2018, Journal of Computational ScienceHybrid classification structures for automatic COVID-19 detection
2022, Journal of Ambient Intelligence and Humanized ComputingFreeLauncher: Lossless failure recovery of parameter servers with ultralight replication
2021, Proceedings - International Conference on Distributed Computing SystemsDistributed graph computation meets machine learning
2020, IEEE Transactions on Parallel and Distributed Systems
Yangyang Zhang is currently a Ph.D. student at the School of Computer Science and Engineering, Beihang University, China. His research interests include virtualization, machine learning, and distributed systems.
Jianxin Li received the PhD degree in January 2008. He is a professor in the School of Computer Science and Engineering, Beihang University. He was a visiting scholar in the Machine Learning Department, CMU, in 2015, and a visiting researchers of MSRA in 2011. His current research interests include virtualization and cloud computing, data analysis, and processing. He is a member of the IEEE and the ACM.
Chenggen Sun received the M.S. degree in Computer Science from Beihang University, China in 2017. His research interests include machine learning and distributed systems.
Md Zakirul Alam Bhuiyan received the PhD degree. He is currently an Assistant Professor of the Department of Computer and Information Sciences, Fordham University. Previously, he worked as an Assistant Professor with Temple University and a postdoctoral research fellow with Central South University, China. His research focuses on dependable cyber physical systems, WSN applications, big data, and cyber security. He has served as a lead guest editor of key journals including the IEEE Transactions on Big Data, the ACM Transactions on Cyber-Physical Systems, the Information Sciences, the IEEE IoT Journal. He has also served as the general chair, program chair, workshop chair, publicity chair, TPC member, and reviewer of various international journals/conferences. He is a member of IEEE and the ACM.
Weiren Yu received the BE degree from the School of Advanced Engineering at Beihang University, China in 2011. He is currently a PhD candidate in the Department of Computer Science, Beihang University since 2011. His research interests include distributed machine learning systems, scalable graphical models and graph mining models for emerging event detection on social media.
Richong Zhang received the BS and MASc degrees from Jilin University, China, in 2001 and 2004, respectively, the MS degree from Dalhousie University, Canada, in 2006, and the PhD degree from the School of Information Technology and Engineering, University of Ottawa, Canada, in 2011. He is currently an Associate Professor in the School of Computer Science and Engineering, Beihang University, China. His research interests include artificial intelligence and data mining. He is a member of the IEEE.