Evolutionary Induction of Classification Trees on Spark

Reska, Daniel; Jurczuk, Krzysztof; Kretowski, Marek

doi:10.1007/978-3-319-91253-0_48

Daniel Reska¹⁸,
Krzysztof Jurczuk¹⁸ &
Marek Kretowski¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10841))

Included in the following conference series:

International Conference on Artificial Intelligence and Soft Computing

2199 Accesses
5 Citations

Abstract

Evolutionary-based approaches have recently been increasingly proposed for data mining tasks, but their real applicability depends on efficiency and scalability for large-scale data. It is clear that parallel and distributed processing support is indispensable herein. Apache Spark is one of the most promising cluster-computing engines for Big Data. In this paper, we investigate the application of Spark to speed up an evolutionary induction of classification trees in the Global Decision Tree (GDT) system. The system simultaneously searches for the tree structure and tests in non-terminal nodes due to specialized genetic operators. As the original GDT system is implemented in C++, the Java-based module is developed for Spark-based acceleration of the most computationally demanding fitness evaluation. The training dataset is transformed to Resilient Distributed Dataset, which enables in-memory processing of dataset’s parts on workers. Preliminary experimental validation on large-scale artificial and real-life datasets shows that the proposed solution is efficient and scales well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
A dipole is a pair of observations; if observations come from different classes, a dipole is called mixed.

References

The Apache Software Foundation. Apache Spark - Lightning-Fast Cluster Computing (2018). https://spark.apache.org/
Alba, E., Tomassini, M.: Parallelism and evolutionary algorithms. IEEE Trans. Evol. Comput. 6(5), 443–462 (2002)
Article Google Scholar
Barros, R.C., Basgalupp, M.P., Carvalho, A.C., Freitas, A.A.: A survey of evolutionary algorithms for decision-tree induction. IEEE Trans. SMC, Part C 42(3), 291–312 (2012)
Google Scholar
Blake, C., Keogh, E., Merz, C.: UCI repository of machine learning databases (1998). http://www.ics.uci.edu/~mlearn/MLRepository.html
Czajkowski, M., Jurczuk, K., Kretowski, M.: A parallel approach for evolutionary induced decision trees. MPI+OpenMP implementation. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2015. LNCS (LNAI), vol. 9119, pp. 340–349. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-19324-3_31
Chapter Google Scholar
Czajkowski, M., Jurczuk, K., Kretowski, M.: Hybrid parallelization of evolutionary model tree induction. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M. (eds.) ICAISC 2016. LNCS (LNAI), vol. 9692, pp. 370–379. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-39378-0_32
Chapter Google Scholar
Czajkowski, M., Kretowski, M.: Evolutionary induction of global model trees with specialized operators and memetic extensions. Inf. Sci. 288, 153–173 (2014)
Article Google Scholar
Deng, C., Tan, X., Dong, X., Tan, Y.: A parallel version of differential evolution based on resilient distributed datasets model. In: Gong, M., Pan, L., Song, T., Tang, K., Zhang, X. (eds.) BIC-TA 2015. CCIS, vol. 562, pp. 84–93. Springer, Heidelberg (2015). https://doi.org/10.1007/978-3-662-49014-3_8
Chapter Google Scholar
Ferranti, A., Marcelloni, F., Segatori, A., Antonelli, M., Ducange, P.: A distributed approach to multi-objective evolutionary generation of fuzzy rule-based classifiers from big data. Inf. Sci. 415–416, 319–340 (2017)
Article Google Scholar
Funika, W., Koperek, P.: Towards a scalable distributed fitness evaluation service. In: Wyrzykowski, R., Deelman, E., Dongarra, J., Karczewski, K., Kitowski, J., Wiatr, K. (eds.) PPAM 2015. LNCS, vol. 9573, pp. 493–502. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-32149-3_46
Chapter Google Scholar
Gong, Y.J., Chen, W.N., Zhan, Z.H., Zhang, J., Li, Y., Zhang, Q., Li, J.J.: Distributed evolutionary algorithms and their models: a survey of the state-of-the-art. Appl. Soft Comput. 34, 286–300 (2015)
Article Google Scholar
Grama, A., Karypis, G., Kumar, V., Gupta, A.: Introduction to Parallel Computing. Addison-Wesley, Boston (2003)
MATH Google Scholar
Jurczuk, K., Czajkowski, M., Kretowski, M.: Evolutionary induction of a decision tree for large-scale data: a GPU-based approach. Soft Comput. 21(24), 7363–7379 (2017)
Article Google Scholar
Kretowski, M., Grzes, M.: Evolutionary induction of mixed decision trees. Int. J. Data Warehous. Min. (IJDWM) 3(4), 68–82 (2007)
Article Google Scholar
Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al.: MLlib: machine learning in apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)
MathSciNet MATH Google Scholar
Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. Springer Science & Business Media, Heidelberg (2013). https://doi.org/10.1007/978-3-662-03315-9
Book MATH Google Scholar
Pulgar-Rubior, F., Rivera-Rivas, A., Perez-Godoy, M., Gonzalez, P., Carmona, C., del Jesus, M.: MEFASD-BD: multi-objective evolutionary fuzzy algorithm for subgroup discovery in big data environments - a MapReduce solution. Knowl.-Based Syst. 117, 70–78 (2017)
Article Google Scholar
Qi, R., Wang, Z., Li, S.: A parallel genetic algorithm based on Spark for pairwise test suite generation. J. Comput. Sci. Technol. 31(2), 417–427 (2016)
Article Google Scholar
Teijeiro, D., Pardo, X.C., González, P., Banga, J.R., Doallo, R.: Implementing parallel differential evolution on spark. In: Squillero, G., Burelli, P. (eds.) EvoApplications 2016. LNCS, vol. 9598, pp. 75–90. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-31153-1_6
Chapter Google Scholar
Wu, X., Zhu, X., Wu, G.Q., Ding, W.: Data mining with big data. IEEE Trans. Knowl. Data Eng. 26(1), 97–107 (2014)
Article Google Scholar
Zaharia, M., et al.: Apache Spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)
Article Google Scholar

Download references

Acknowledgments

This work was supported by the grant S/WI/2/18 from Bialystok University of Technology founded by Polish Ministry of Science and Higher Education.

Author information

Authors and Affiliations

Faculty of Computer Science, Bialystok University of Technology, Wiejska 45a, 15-351, Bialystok, Poland
Daniel Reska, Krzysztof Jurczuk & Marek Kretowski

Authors

Daniel Reska
View author publications
You can also search for this author in PubMed Google Scholar
Krzysztof Jurczuk
View author publications
You can also search for this author in PubMed Google Scholar
Marek Kretowski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Krzysztof Jurczuk .

Editor information

Editors and Affiliations

Częstochowa University of Technology, Częstochowa, Poland
Leszek Rutkowski
Częstochowa University of Technology, Częstochowa, Poland
Rafał Scherer
Częstochowa University of Technology, Częstochowa, Poland
Marcin Korytkowski
University of Alberta, Edmonton, AB, Canada
Witold Pedrycz
AGH University of Science and Technology, Kraków, Poland
Ryszard Tadeusiewicz
University of Louisville, Louisville, KY, USA
Jacek M. Zurada

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Reska, D., Jurczuk, K., Kretowski, M. (2018). Evolutionary Induction of Classification Trees on Spark. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2018. Lecture Notes in Computer Science(), vol 10841. Springer, Cham. https://doi.org/10.1007/978-3-319-91253-0_48

Download citation

DOI: https://doi.org/10.1007/978-3-319-91253-0_48
Published: 11 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91252-3
Online ISBN: 978-3-319-91253-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics