Skip to main content
Log in

ParSoDA: high-level parallel programming for social data mining

  • Original Article
  • Published:
Social Network Analysis and Mining Aims and scope Submit manuscript

Abstract

Software systems for social data mining provide algorithms and tools for extracting useful knowledge from user-generated social media data. ParSoDA (Parallel Social Data Analytics) is a high-level library for developing parallel data mining applications based on the extraction of useful knowledge from large data set gathered from social media. The library aims at reducing the programming skills needed for implementing scalable social data analysis applications. To reach this goal, ParSoDA defines a general structure for a social data analysis application that includes a number of configurable steps and provides a predefined (but extensible) set of functions that can be used for each step. User applications based on the ParSoDA library can be run on both Apache Hadoop and Spark clusters. The paper describes the ParSoDA library and presents two social data analysis applications to assess its usability and scalability. Concerning usability, we compare the programming effort required for coding a social media application using versus not using the ParSoDA library. The comparison shows that ParSoDA leads to a drastic reduction (i.e., about 65%) of lines of code, since the programmer only has to implement the application logic without worrying about configuring the environment and related classes. About scalability, using a cluster with 300 cores and 1.2 TB of RAM, ParSoDA is able to reduce the execution time of such applications up to 85%, compared to a cluster with 25 cores and 100 GB of RAM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. https://github.com/SCAlabUnical/ParSoDA.

  2. https://mahout.apache.org/.

  3. https://spark.apache.org/mllib/.

References

  • Amer-Yahia S, Ibrahim N, Kengne CK, Ulliana F, Rousset MC (2014) SOCLE: towards a framework for data preparation in social applications. Ingénierie des Systèmes d’Information 19(3):49–72

    Article  Google Scholar 

  • Belcastro L, Marozzo F, Talia D, Trunfio P (2017a) Appraising spark on large-scale social media analysis. In: Euro-Par workshops. Lecture notes in computer science. Santiago de Compostela, Spain, pp 483–495. ISBN:978-3-319-75178-8

  • Belcastro L, Marozzo F, Talia D, Trunfio P (2017b) Big data analysis on clouds. In: Sakr S, Zomaya A (eds) Handbook of big data technologies. Springer, Berlin, pp 101–142. ISBN:978-3-319-49339-8

  • Belcastro L, Marozzo F, Talia D, Trunfio P (2017c) A parallel library for social media analytics. In: The 2017 international conference on high performance computing & simulation (HPCS 2017), Genoa, Italy

  • Casalino G, Castiello C, Del Buono N, Mencar C (2018) A framework for intelligent twitter data analysis with nonnegative matrix factorization. Int J Web Inf Syst 14(3):334–356

    Article  Google Scholar 

  • Cesario E, Iannazzo A R, Marozzo F, Morello F, Riotta G, Spada A, Talia D, Trunfio P (2016) Analyzing social media data to discover mobility patterns at EXPO 2015: methodology and results. In: The 2016 international conference on high performance computing and simulation (HPCS 2016), Innsbruck, Austria

  • Chodorow K (2013) MongoDB: the definitive guide. O’Reilly Media, Inc., Newton

    Google Scholar 

  • Chu C, Kim SK, Lin YA, Yu Y, Bradski G, Ng AY, Olukotun K (2007) Map-reduce for machine learning on multicore. Adv Neural Inf Process. Syst. 19:281

    Google Scholar 

  • Cuesta Á, Barrero DF, R-Moreno MD (2014) A framework for massive Twitter data extraction and analysis. Malays J Comput Sci 27:50–67

    Google Scholar 

  • Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th conference on symposium on operating systems design & implementation, OSDI’04, Berkeley, USA, p 10

  • ECMA (2009) ECMA-262: ECMAscript language specification, 5th edn. ECMA (European Association for Standardizing Information and Communication Systems), Geneva

  • Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min Knowl Discov 8(1):53–87

    Article  MathSciNet  Google Scholar 

  • Hussain A, Vatrapu R (2014) Social data analytics tool (SODATO). Springer International Publishing, Cham, pp 368–372

    Google Scholar 

  • Li H, Wang Y, Zhang D, Zhang M, Chang EY (2008) PFP: parallel FP-growth for query recommendation. In: Proceedings of the 2008 ACM conference on recommender systems, New York, NY, USA, pp 107–114

  • Marozzo F, Bessi A (2018) Analyzing polarization of social media users and news sites during political campaigns. Soc Netw Anal Min 8(1):1

    Article  Google Scholar 

  • Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(12):1–135

    Article  Google Scholar 

  • Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U, Hsu MC (2004) Mining sequential patterns by pattern-growth: the prefixSpan approach. IEEE Trans Knowl Data Eng 16(11):1424–1440

    Article  Google Scholar 

  • Talia D, Trunfio P, Marozzo F (2015) Data analysis in the cloud. Elsevier, Amsterdam

    Google Scholar 

  • White T (2012) Hadoop: the definitive guide. O’Reilly Media, Inc., Newton

    Google Scholar 

  • You L, Motta G, Sacco D, Ma T (2014) Social data analysis framework in cloud and mobility analyzer for smarter cities. In: Proceedings of 2014 IEEE international conference on service operations and logistics, and informatics, Qingdao, China, pp 96–101

  • Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ et al (2016) Apache spark: a unified engine for big data processing. Commun ACM 59(11):56–65

    Article  Google Scholar 

  • Zhou D, Chen L, He Y (2015) An unsupervised framework of exploring events on twitter: filtering, extraction and categorization. In: Proceedings of the 29th AAAI conference on artificial intelligence, Austin, Texas, USA, pp 2468–2475

Download references

Acknowledgements

This work has been partially supported by the SMART Project, CUP J28C17000150006, funded by Regione Calabria (POR FESR-FSE 2014-2020), and by the ASPIDE Project funded by the European Unions Horizon 2020 research and innovation program under grant agreement No. 801091.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fabrizio Marozzo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Belcastro, L., Marozzo, F., Talia, D. et al. ParSoDA: high-level parallel programming for social data mining. Soc. Netw. Anal. Min. 9, 4 (2019). https://doi.org/10.1007/s13278-018-0547-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13278-018-0547-5

Keywords

Navigation