Skip to main content
Log in

Improvement in Hadoop performance using integrated feature extraction and machine learning algorithms

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Big Data has been a term used in datasets which are complex and large in such a way there are some traditional technologies of data processing which are not adequate. Big Data can revolutionize most aspects in society such as collection or management of data from Big Data which is challenging and also very complex. The Hadoop has been designed for processing a large amount of unstructured and complex data. It has provided with a large amount of storage for data along with the ability to be able to tackle unlimited and concurrent tasks or jobs. The selection of features is an extremely powerful technique in the reduction of dimensionality and is also the most important step in machine learning applications. In recent decades, data is getting larger in a progressive manner in terms of instances and numbers making it very hard to deal with the problem of feature selection. In order to cope with such an epoch of Big Data, there are some more new techniques that are required to address the problem in a more efficient manner. At the same time, the suitability of the algorithms currently used may not be applicable especially when the size of data is above hundreds of gigabytes. For the purpose of this work, the correlation-based feature selection along with mutual information-based methods of feature selection was used for improving the performance. The AdaBoost and the support vector machine based classifiers have been used for improving their accuracy. The results of the experiment prove that the method proposed was able to achieve better performance compared to that of the other methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Adhikari BK, Zuo WL, Maharjan R, Guo L (2018) Sensitive data detection using NN and KNN from big data. In: International conference on algorithms and architectures for parallel processing. Springer, Cham, pp 628–642

    Chapter  Google Scholar 

  • Almasi M, Abadeh MS (2018) A new MapReduce associative classifier based on a new storage format for large-scale imbalanced data. Clust Comput 21(4):1821–1847

    Article  Google Scholar 

  • Ayma VA, Ferreira RS, Happ P, Oliveira D, Feitosa R, Costa G, Gamba P (2015) Classification algorithms for big data analysis, a Map Reduce approach. Int Arch Photogramm Remote Sens Spat Inf Sci 40(3):17–21

    Article  Google Scholar 

  • Bhardwaj P, Gupta A, Sharma M, Gupta M, Singhal S (2016) A survey on comparative analysis of big data tools. Int J Comput Sci Mob Comput 5(5):789–793

    Google Scholar 

  • Doquire G, Verleysen M (2011) Feature selection with mutual information for uncertain data. In: International conference on data warehousing and knowledge discovery. Springer, Berlin, pp 330–341

    Chapter  Google Scholar 

  • Hodge VJ, O’Keefe S, Austin J (2016) Hadoop neural network for parallel and distributed feature selection. Neural Netw 78:24–35

    Article  Google Scholar 

  • Hossen J, Sayeed S (2018) Modifying cleaning method in big data analytics process using random forest classifier. In: 2018 7th international conference on computer and communication engineering (ICCCE). IEEE, pp 208–213

  • Kumar R, Verma R (2012) Classification algorithms for data mining: a survey. Int J Innov Eng Technol IJIET 1(2):7–14

    Google Scholar 

  • Lakshmanaprabu SK, Shankar K, Ilayaraja M, Nasir AW, Vijayakumar V, Chilamkurti N (2019) Random forest for big data classification in the internet of things using optimal features. Int J Mach Learn Cybern 10(10):1–10

    Article  Google Scholar 

  • Li D, Ryu KH, Batbaatar E, Park HW, Jeone SP, Ye Z (2018) An effective feature selection and classification model for high dimensional big data sets. Int J Des Anal Tools Integr Circuits Syst 7(1):38–43

    Google Scholar 

  • Maillo J, Ramírez S, Triguero I, Herrera F (2017) kNN-IS: an iterative spark-based design of the k-nearest neighbors classifier for big data. Knowl Based Syst 117:3–15

    Article  Google Scholar 

  • Palma-Mendoza RJ, de-Marcos L, Rodriguez D, Alonso-Betanzos A, (2018) Distributed correlation-based feature selection in spark. Inf Sci 496:287–299

    Article  Google Scholar 

  • Priyadarshini A (2015) A map reduce based support vector machine for big data classification. Int J Database Theory Appl 8(5):77–98

    Article  Google Scholar 

  • Ramírez-Gallego S, Mouriño-Talín H, Martínez-Rego D, Bolón-Canedo V, Benítez JM, Alonso-Betanzos A, Herrera F (2018a) An information theory-based feature selection framework for big data under apache spark. IEEE Trans Syst Man Cybern Syst 48(9):1441–1453

    Article  Google Scholar 

  • Ramírez-Gallego S, García S, Xiong N, Herrera F (2018b) BELIEF: a distance-based redundancy-proof feature selection method for big data. arXiv preprint arXiv:1804.05774

  • Shabestari F, Rahmani AM, Navimipour NJ, Jabbehdari S (2019) A taxonomy of software-based and hardware-based approaches for energy efficiency management in the Hadoop. J Netw Comput Appl 126:162–177

    Article  Google Scholar 

  • Thakor HR (2017) A survey paper on classification algorithms in big data. Int J Res Cult Soc 1(3):21–27

    Google Scholar 

  • Triguero I, Peralta D, Bacardit J, García S, Herrera F (2015) MRPR: a MapReduce solution for prototype reduction in big data classification. Neurocomputing 150:331–345

    Article  Google Scholar 

  • Vergara JR, Estévez PA (2014) A review of feature selection methods based on mutual information. Neural Comput Appl 24(1):175–186

    Article  Google Scholar 

  • Von Kirby P, Gerardo BD, Medina RP (2017) Implementing enhanced AdaBoost algorithm for sales classification and prediction. Int J Trade Econ Finance 8(6):270–273

    Article  Google Scholar 

  • Wang Y, Ke W, Tao X (2016) A feature selection method for large-scale network traffic classification based on spark. Information 7(1):6

    Article  Google Scholar 

  • Win TZ, Kham NSM (2018) Mutual information-based feature selection approach to reduce high dimension of big data. In: Proceedings of the 2018 international conference on machine learning and machine intelligence. ACM, pp 3–7

  • You ZH, Yu JZ, Zhu L, Li S, Wen ZK (2014) A MapReduce based parallel SVM for large-scale predicting protein–protein interactions. Neurocomputing 145:37–43

    Article  Google Scholar 

  • Yu L, Liu H (2003) Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 856–863

  • Zakir J, Seymour T, Berg K (2015) Big data analytics. Issues Inf Syst 16(2):81–90

    Google Scholar 

  • Zdravevski E, Lameski P, Kulakov A, Jakimovski B, Filiposka S, Trajanov D (2015) Feature ranking based on information gain for large classification problems with MapReduce. In: 2015 IEEE Trustcom/BigDataSE/ISPA. IEEE, vol 2, pp 186–191

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to C. K. Sarumathiy.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Communicated by V. Loia.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sarumathiy, C.K., Geetha, K. & Rajan, C. Improvement in Hadoop performance using integrated feature extraction and machine learning algorithms. Soft Comput 24, 627–636 (2020). https://doi.org/10.1007/s00500-019-04453-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-019-04453-x

Keywords

Navigation