ABSTRACT
Social Networks promote information sharing between people everywhere and at all times. Mining data produced in this data-rich environment can be extremely useful. Frequent itemset mining plays an important role in mining associations, correlations, sequential patterns, causality, episodes, multidimensional patterns, max-patterns, partial periodicity, emerging patterns, and many other significant data mining tasks in social networks. With the exponential growth of social network data towards a terabyte or more, most of the traditional frequent itemset mining algorithms become ineffective due to either huge resource requirements or large communications overhead. Cloud computing has proved that processing very large datasets over commodity clusters can be done by providing the right programming model. As a parallel programming model, MapReduce, one of most important techniques for cloud computing, has emerged in the mining of datasets of terabyte scale or larger on clusters of computers. In this paper, we propose an efficient frequent itemset mining algorithm, called IMRApriori, based on MapReduce framework which deals with Hadoop cloud, a parallel store and computing platform. The paper demonstrates experimental results to corroborate the theoretical claims.
- Le Zhou; Zhiyong Zhong; Jin Chang; Junjie Li; Huang, J. Z.; Shengzhong Feng, "Balanced parallel FP-Growth with MapReduce," Information Computing and Telecommunications (YC-ICT), 2010 IEEE Youth Conference on, vol., no., pp. 243, 246, 28--30 Nov. 2010.Google Scholar
- R. Agrawal and R. Srikant: Fast Algorithms for Mining Association Rules in Large Databases. In: Proceedings of the Twentieth International Conference on Very Large Databases (VLDB), pp. 487--499, 1994. Google ScholarDigital Library
- G. Buehrer, S. Parthasarathy, S. Tatikonda, T. Kurc, and J. Saltz. Toward terabyte pattern mining: an architecture-conscious solution. In Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '07, pages 2--12, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- S. Cong, J. Han, J. Hoeflinger, and D. Padua. A sampling-based framework for parallel data mining. In Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '05, pages 255--265, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- M. El-Hajj and O. Zaiane. Parallel leap: large-scale maximal pattern mining in a distributed environment. In Parallel and Distributed Systems, 2006. ICPADS 2006. 12th International Conference on, volume 1, page 8 pp., 0--0 2006. Google ScholarDigital Library
- W. Fang, K. K. Lau, M. Lu, X. Xiao, C. K. Lam, Y. Yang, B. He, Q. Luo, P. V. Sander, and K. Yang. Parallel data mining on graphics processors. Technical Report 07, The Hong Kong University of Science & Technology, 2008.Google Scholar
- L. Liu, E. Li, Y. Zhang, and Z. Tang. Optimization of frequent itemset mining on multiple-core processor. In Proceedings of the 33rd international conference on Very large data bases, VLDB '07, pages 1275--1285. VLDB Endowment, 2007. Google ScholarDigital Library
- E. Ozkural, B. Ucar, and C. Aykanat. Parallel frequent item set mining with selective item replication. Parallel and Distributed Systems, IEEE Transactions on, 22(10): 1632--1640, oct. 2011. Google ScholarDigital Library
- J. Ruoming, Y. Ge, and G. Agrawal. Shared memory parallelization of data mining algorithms: techniques, programming interface, and performance. Knowledge and Data Engineering, IEEE Transactions on, 17(1): 71--89, jan. 2005. Google ScholarDigital Library
- M. Zaki. Parallel and distributed association mining: a survey. Concurrency, IEEE, 7(4): 14--25, oct-dec 1999. Google ScholarDigital Library
- J. Dean and S. Ghemawat: Mapreduce: Simplified Data Processing on Large Clusters. In: Proceedings of the Sixth Symposium on Operating System Design and Implementation (OSDI), pp. 137--150, 2004 Google ScholarDigital Library
- Hadoop, http://hadoop.apache.org/Google Scholar
- Ming-Yen Lin, Pei-Yu Lee, and Sue-Chen Hsueh. 2012. Apriori-based frequent itemset mining algorithms on MapReduce. In Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication (ICUIMC '12). ACM, New York, NY, USA, Article 76, 8 pages. Google ScholarDigital Library
- Shah K. D. & Mahajan S. (2009). Maximizing the Efficiency of Parallel Apriori Algorithm. Proceeding of the International Conference on Advances in Recent Technologies in Communication and Computing (ARTCom '09) Kottayam, Kerala, IEEE: 107--109. Google ScholarDigital Library
- Ye Y. & Chiang C. (2006). A Parallel Apriori Algorithm for Frequent Itemsets Mining. Proc. of the 4th International Conference on Software Engineering Research, Management and Applications (SERA '06). Seattle, WA, IEEE: 87--94. Google ScholarDigital Library
- Paul S. & Saravanan V. (2008). Hash Partitioned Apriori in Parallel and Distributed Data Mining Environment with Dynamic Data Allocation Approach. Proc. of the International Conference on Computer Science and Information Technology (ICCSIT '08). Singapore, IEEE: 481--485. Google ScholarDigital Library
- Yu K. & Zhou J. (2008). A Weighted Load-Balancing Parallel Apriori Algorithm for Association Rule Mining. Proc. of the International Conference on Granular Computing (GrC '08). Hangzhou, IEEE: 756--761.Google Scholar
- G. Buehrer, S. Parthasarathy, S. Tatikonda, T. Kurc, and J. Saltz. Toward terabyte pattern mining: an architecture-conscious solution. In Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP '07, pages 2--12, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- Li L. & Zhang M. (2011). The Strategy of Mining Association Rule Based on Cloud Computing. Proceeding of the 2011 International Conference on Business Computing and Global Informatization (BCGIN '11). Washington, DC, USA, IEEE: 475--478. Google ScholarDigital Library
- Li N., Zeng L., He Q. & Shi Z. (2012). Parallel Implementation of Apriori Algorithm Based on MapReduce. Proc. of the 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing (SNPD '12). Kyoto, IEEE: 236--241. Google ScholarDigital Library
- Yang X. Y., Liu Z. & Fu Y. (2010). MapReduce as a Programming Model for Association Rules Algorithm on Hadoop. Proc. of the 3rd International Conference on Information Sciences and Interaction Sciences (ICIS '10). Chengdu, China, IEEE: 99--102.Google ScholarCross Ref
- Othman Yahya, Osman Hegazy, Ehab Ezat (2012).An Efficient Implementation of Apriori Algorithm Based on Hadoop-Mapreduce Model, Proc. of the International Journal of Reviews in Computing 31st December 2012. Vol. 12: 59--67.Google Scholar
- Z. Zheng, R. Kohavi, and L. Mason, "Real world performance of association rule algorithms", in Proc. KDD, 2001, pp. 401--406. Google ScholarDigital Library
Index Terms
- Efficient mining of frequent itemsets in social network data based on MapReduce framework
Recommendations
Accelerating Frequent Itemsets Mining on the Cloud: A MapReduce -Based Approach
ICDMW '13: Proceedings of the 2013 IEEE 13th International Conference on Data Mining WorkshopsFrequent pattern mining has a critical role in mining associations, sequential patterns, correlations, causality, episodes, multidimensional patterns, emerging patterns, and many other significant data mining tasks. With the exponential growth of ...
An efficient pattern growth approach for mining fault tolerant frequent itemsets
Highlights- Mining fault tolerant (FT) frequent itemsets are computationally expensive.
- ...
AbstractMining fault tolerant (FT) frequent itemsets from transactional databases are computationally more expensive than mining exact matching frequent itemsets. Previous algorithms mine FT frequent itemsets using Apriori heuristic. Apriori-...
Applying bit-vector projection approach for efficient mining of N-most interesting frequent itemsets
CI '07: Proceedings of the Third IASTED International Conference on Computational IntelligenceReal world datasets are sparse, dirty and contain hundreds of items. In such situations, discovering interesting rules (results) using traditional frequent itemset mining approach by specifying a user defined input support threshold is not appropriate. ...
Comments