Improvement in Hadoop performance using integrated feature extraction and machine learning algorithms

Sarumathiy, C. K.; Geetha, K.; Rajan, C.

doi:10.1007/s00500-019-04453-x

Improvement in Hadoop performance using integrated feature extraction and machine learning algorithms

Methodologies and Application
Published: 29 October 2019

Volume 24, pages 627–636, (2020)
Cite this article

Soft Computing Aims and scope Submit manuscript

396 Accesses
9 Citations
Explore all metrics

Abstract

Big Data has been a term used in datasets which are complex and large in such a way there are some traditional technologies of data processing which are not adequate. Big Data can revolutionize most aspects in society such as collection or management of data from Big Data which is challenging and also very complex. The Hadoop has been designed for processing a large amount of unstructured and complex data. It has provided with a large amount of storage for data along with the ability to be able to tackle unlimited and concurrent tasks or jobs. The selection of features is an extremely powerful technique in the reduction of dimensionality and is also the most important step in machine learning applications. In recent decades, data is getting larger in a progressive manner in terms of instances and numbers making it very hard to deal with the problem of feature selection. In order to cope with such an epoch of Big Data, there are some more new techniques that are required to address the problem in a more efficient manner. At the same time, the suitability of the algorithms currently used may not be applicable especially when the size of data is above hundreds of gigabytes. For the purpose of this work, the correlation-based feature selection along with mutual information-based methods of feature selection was used for improving the performance. The AdaBoost and the support vector machine based classifiers have been used for improving their accuracy. The results of the experiment prove that the method proposed was able to achieve better performance compared to that of the other methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Evaluation of Feature Selection Methods Performance for Dataset Construction

Nature-Inspired Feature Selection Algorithms: A Study

Feature Selection Using Genetic Algorithm for Big Data

References

Adhikari BK, Zuo WL, Maharjan R, Guo L (2018) Sensitive data detection using NN and KNN from big data. In: International conference on algorithms and architectures for parallel processing. Springer, Cham, pp 628–642
Chapter Google Scholar
Almasi M, Abadeh MS (2018) A new MapReduce associative classifier based on a new storage format for large-scale imbalanced data. Clust Comput 21(4):1821–1847
Article Google Scholar
Ayma VA, Ferreira RS, Happ P, Oliveira D, Feitosa R, Costa G, Gamba P (2015) Classification algorithms for big data analysis, a Map Reduce approach. Int Arch Photogramm Remote Sens Spat Inf Sci 40(3):17–21
Article Google Scholar
Bhardwaj P, Gupta A, Sharma M, Gupta M, Singhal S (2016) A survey on comparative analysis of big data tools. Int J Comput Sci Mob Comput 5(5):789–793
Google Scholar
Doquire G, Verleysen M (2011) Feature selection with mutual information for uncertain data. In: International conference on data warehousing and knowledge discovery. Springer, Berlin, pp 330–341
Chapter Google Scholar
Hodge VJ, O’Keefe S, Austin J (2016) Hadoop neural network for parallel and distributed feature selection. Neural Netw 78:24–35
Article Google Scholar
Hossen J, Sayeed S (2018) Modifying cleaning method in big data analytics process using random forest classifier. In: 2018 7th international conference on computer and communication engineering (ICCCE). IEEE, pp 208–213
Kumar R, Verma R (2012) Classification algorithms for data mining: a survey. Int J Innov Eng Technol IJIET 1(2):7–14
Google Scholar
Lakshmanaprabu SK, Shankar K, Ilayaraja M, Nasir AW, Vijayakumar V, Chilamkurti N (2019) Random forest for big data classification in the internet of things using optimal features. Int J Mach Learn Cybern 10(10):1–10
Article Google Scholar
Li D, Ryu KH, Batbaatar E, Park HW, Jeone SP, Ye Z (2018) An effective feature selection and classification model for high dimensional big data sets. Int J Des Anal Tools Integr Circuits Syst 7(1):38–43
Google Scholar
Maillo J, Ramírez S, Triguero I, Herrera F (2017) kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowl Based Syst 117:3–15
Article Google Scholar
Palma-Mendoza RJ, de-Marcos L, Rodriguez D, Alonso-Betanzos A, (2018) Distributed correlation-based feature selection in spark. Inf Sci 496:287–299
Article Google Scholar
Priyadarshini A (2015) A map reduce based support vector machine for big data classification. Int J Database Theory Appl 8(5):77–98
Article Google Scholar
Ramírez-Gallego S, Mouriño-Talín H, Martínez-Rego D, Bolón-Canedo V, Benítez JM, Alonso-Betanzos A, Herrera F (2018a) An information theory-based feature selection framework for big data under apache spark. IEEE Trans Syst Man Cybern Syst 48(9):1441–1453
Article Google Scholar
Ramírez-Gallego S, García S, Xiong N, Herrera F (2018b) BELIEF: a distance-based redundancy-proof feature selection method for big data. arXiv preprint arXiv:1804.05774
Shabestari F, Rahmani AM, Navimipour NJ, Jabbehdari S (2019) A taxonomy of software-based and hardware-based approaches for energy efficiency management in the Hadoop. J Netw Comput Appl 126:162–177
Article Google Scholar
Thakor HR (2017) A survey paper on classification algorithms in big data. Int J Res Cult Soc 1(3):21–27
Google Scholar
Triguero I, Peralta D, Bacardit J, García S, Herrera F (2015) MRPR: a MapReduce solution for prototype reduction in big data classification. Neurocomputing 150:331–345
Article Google Scholar
Vergara JR, Estévez PA (2014) A review of feature selection methods based on mutual information. Neural Comput Appl 24(1):175–186
Article Google Scholar
Von Kirby P, Gerardo BD, Medina RP (2017) Implementing enhanced AdaBoost algorithm for sales classification and prediction. Int J Trade Econ Finance 8(6):270–273
Article Google Scholar
Wang Y, Ke W, Tao X (2016) A feature selection method for large-scale network traffic classification based on spark. Information 7(1):6
Article Google Scholar
Win TZ, Kham NSM (2018) Mutual information-based feature selection approach to reduce high dimension of big data. In: Proceedings of the 2018 international conference on machine learning and machine intelligence. ACM, pp 3–7
You ZH, Yu JZ, Zhu L, Li S, Wen ZK (2014) A MapReduce based parallel SVM for large-scale predicting protein–protein interactions. Neurocomputing 145:37–43
Article Google Scholar
Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 856–863
Zakir J, Seymour T, Berg K (2015) Big data analytics. Issues Inf Syst 16(2):81–90
Google Scholar
Zdravevski E, Lameski P, Kulakov A, Jakimovski B, Filiposka S, Trajanov D (2015) Feature ranking based on information gain for large classification problems with MapReduce. In: 2015 IEEE Trustcom/BigDataSE/ISPA. IEEE, vol 2, pp 186–191

Download references

Author information

Authors and Affiliations

Department of CSE, Excel Engineering College, Komarapalayam, Tamil Nadu, India
C. K. Sarumathiy & K. Geetha
Department of IT, K S Rangasamy College of Technology, Tiruchengode, Tamil Nadu, India
C. Rajan

Authors

C. K. Sarumathiy
View author publications
You can also search for this author in PubMed Google Scholar
K. Geetha
View author publications
You can also search for this author in PubMed Google Scholar
C. Rajan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to C. K. Sarumathiy.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sarumathiy, C.K., Geetha, K. & Rajan, C. Improvement in Hadoop performance using integrated feature extraction and machine learning algorithms. Soft Comput 24, 627–636 (2020). https://doi.org/10.1007/s00500-019-04453-x

Download citation

Published: 29 October 2019
Issue Date: January 2020
DOI: https://doi.org/10.1007/s00500-019-04453-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improvement in Hadoop performance using integrated feature extraction and machine learning algorithms

Abstract

Access this article

Similar content being viewed by others

An Evaluation of Feature Selection Methods Performance for Dataset Construction

Nature-Inspired Feature Selection Algorithms: A Study

Feature Selection Using Genetic Algorithm for Big Data

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Improvement in Hadoop performance using integrated feature extraction and machine learning algorithms

Abstract

Access this article

Similar content being viewed by others

An Evaluation of Feature Selection Methods Performance for Dataset Construction

Nature-Inspired Feature Selection Algorithms: A Study

Feature Selection Using Genetic Algorithm for Big Data

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Informed consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation