AEGA: enhanced feature selection based on ANOVA and extended genetic algorithm for online customer review analysis

Tripathy, Gyananjaya; Sharaff, Aakanksha

doi:10.1007/s11227-023-05179-2

AEGA: enhanced feature selection based on ANOVA and extended genetic algorithm for online customer review analysis

Published: 22 March 2023

Volume 79, pages 13180–13209, (2023)
Cite this article

Download PDF

The Journal of Supercomputing Aims and scope Submit manuscript

AEGA: enhanced feature selection based on ANOVA and extended genetic algorithm for online customer review analysis

Download PDF

Gyananjaya Tripathy¹ &
Aakanksha Sharaff¹

1468 Accesses
2 Citations
Explore all metrics

Abstract

Sentiment analysis involves extricating and interpreting people’s views, feelings, beliefs, etc., about diverse actualities such as services, goods, and topics. People intend to investigate the users’ opinions on the online platform to achieve better performance. Regardless, the high-dimensional feature set in an online review study affects the interpretation of classification. Several studies have implemented different feature selection techniques; however, getting a high accuracy with a very minimal number of features is yet to be accomplished. This paper develops an effective hybrid approach based on an enhanced genetic algorithm (GA) and analysis of variance (ANOVA) to achieve this purpose. To beat the local minima convergence problem, this paper uses a unique two-phase crossover and impressive selection approach, gaining high exploration and fast convergence of the model. The use of ANOVA drastically reduces the feature size to minimize the computational burden of the model. Experiments are performed to estimate the algorithm performance using different conventional classifiers and algorithms like GA, Particle Swarm Optimization (PSO), Recursive Feature Elimination (RFE), Random Forest, ExtraTree, AdaBoost, GradientBoost, and XGBoost. The proposed novel approach gives impressive results using the Amazon Review dataset with an accuracy of 78.60 %, F1 score of 79.38 %, and an average precision of 0.87, and the Restaurant Customer Review dataset with an accuracy of 77.70 %, F1 score of 78.24 %, and average precision of 0.89 as compared to other existing algorithms. The result shows that the proposed model outperforms other algorithms with nearly 45 and 42% fewer features for the Amazon Review and Restaurant Customer Review datasets.

A Novel Feature Selection Method Based on Genetic Algorithm for Opinion Mining of Social Media Reviews

Review Sentiment Classification and Feature Selection Using Hybridized Support Vector Machine

Hybrid Filter–Wrapper Feature Selection Method for Sentiment Classification

Article 31 July 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The growth of social networking gives individuals different modes for expressing ideas and knowledge, which has changed us from data seekers to enthusiastic reviewers. Each moment, an enormous amount of data is forged on the internet as reviews and automated approaches are required to process and obtain relevant information. The need for an appropriate sentiment analytical approach increases with the active growth of the internet and user data production [1]. Processing a large number of views can present clues for recognizing human reactions and can also be valuable in various domains such as product review interpretation, retailing campaigns, political situation adoption, and many more [2]. In e-commerce programs, ideas and reports of goods and assistance are essential to human conclusion-making. Hence, identifying emotions and mindsets in textual data can be constructive in apprehending the customer attitude towards situations and outcomes and is intensely acceptable and valuable in numerous areas [3].

Many approaches and techniques have been used to achieve emotion. In textbooks, machine learning, lexicon-dependent, and mixed methods are widely used [4]. However, the combined strategy of GA and filter-based SLI in [5] does not specify the form and reason for making decisions and fails to solve high-dimensional data. Interpretability and explainability are intended to introduce how the system makes judgments. They are essential in the methods and mechanisms that facilitate comprehensive human functioning and are highly desirable [6]. In Ref. [7] authors have analyzed the mutual information-based and non-mutual information-based feature selection approaches to show how an informed decision has an impact on quality. It is enough to find the prediction and know why and how it was made in many cases. Methods that can show how they make informed decisions can be beneficial in giving people a way to conceptualize their judgements and enhance their quality [8]. The field of sentiment analysis gained interest during the last decade and a half among research alliances. Since 2004, sentiment analysis has been a fast-growing and highly active research facility; as it exists, there has been a significant increase in the number of papers focused on sentiment analysis and recent excavations. Research in this field can lead to a different society. The significant data era is now included with standard features of extensive and maximum data set sizes [9]. It presents us with vital challenges in extracting helpful insight from big data. Compared to the complexity of reducing data, the curse of size may be much harder to resolve. In general, there are many non-essential and unnecessary features in a large dataset, enhancing the complexity of data processing, pattern classification, and data mining. To solve this obstacle, size minimization can strain unavoidable noise and unwanted data by diminishing the maximum size space to a common intrinsic area [10]. In extension, size compression also contributes to the functional foundations for efficient and transparent data viewing. To take care of the issue of such high dimensional feature space, the main focus of the current work is to focus on reducing the feature space to an extreme minimum without compromising the efficiency. Before the sentiment analysis and feature reduction parts, an essential task is to get a dataset. Datasets have greatly improved with the headway of science and technology. In the matter of opinion mining, user feedback is intensely required.

Separation of the text’s sentiments is possible only through human review. However, while working on the update, it included many features. High-end vectors place high calculation costs and a high risk of overfitting. Feature reduction has been an effective research field in data mining, statistical communities, and pattern identification. A complete search assures that the most relevant variables are available, but in most cases, this is not feasible computationally, even in average-sized databases [11]. Since testing all sub-sets is costly, a feasible and qualitatively viable solution should be sought. Most feature minimization approaches use metaheuristic methods to avoid raising computer complexity [12]. These algorithms develop the solution of selecting a variable set with the highest fitting accuracy at the right time.

Looking at the challenges of minimizing feature space, many feasible solutions, like filter-based approaches such as Chi-square, ANOVA, and information gain, and wrapper based methods such as RFE, heuristic and metaheuristic approaches, can be implemented. The ignorance towards finding feature correlation is the strong drawback of filter-based methods, and the biggest challenge faced by almost every metaheuristic method is the local minima convergence, which sometimes lacks efficiency. Thus, looking at the limitations of both approaches, the proposed work combines both approaches to get benefit from both by eliminating the limitations. In this context, the metaheuristic-based modified GA and filter-based ANOVA methods are considered in this proposed work.

GA-based development strategies are based on population. This paper proposes an improved genetic algorithm to choose a subset of features from a multi-word set thoroughly. This method divides the chromosome into several phases of spatial management. Various conversion and crossover operators were then used in the specified stages to eliminate invalid chromosomes. The genetic algorithm has been favorably used in the feature selection process to rearrange high-density databases among the evolutionary-based algorithms. The suggested GA-based selection approach has a few innovations compared to the conventional method. Instead of taking a single crossover, a two-phase crossover is implemented to get high exploration in the feature space. Further, the proposed method introduced a unique selection approach to overcome the limitation of local minima convergence and get a fast exploitation of the global optimum.

The major additions to the proposed work are as follows:

1.
A review of the text has been considered before using the proposed framework. These preliminary considerations include removing stop words, lemmatization, and the process of tokenization.
2.
The researchers design, develop, and evaluate extended sentiment evaluation skeletons by combining ML methods with dictionary-based methods to overcome the limitations of each technique.
3.
Researchers have proposed an extended reduction algorithm using ANOVA and GA-based methods with a customized objective function.
4.
Researchers have explored the proposed algorithm, reducing the feature space size to a surprising degree while increasing efficiency.

The remaining portion is organized into sections, where Sect. 2 shows background details on sentiment evaluation and experiment-related activities. Section 3 defines the objective behind the proposed research. Section 4 introduces the proposed approach and theory after the method is presented. Section 5 explains the experimental setup to carry out the experiment. Section 6 introduces the practical lessons, explains their meaning, and explains the results. The effectiveness of the approach is demonstrated, and their descriptions are shown. Section 7 shows the practical implications of the model, and in Sect. 8, the proposed work is concluded.

2 Related work

Analyzing sentiment and identifying ideas in text attract the attention of the research society. Sentiment evaluation methods can be differentiated into two primary criteria: lexicon-based and machine learning approaches. In Ref. [13], the authors have used standard designs using various vector presentations. Then all the multiple methods are tested on the four databases, which contains online user feedback in two terminologies. The hybrid method connects dictionaries with word2vec and is well accessible, with adequate performance, but due to its high dimensions, the model may face the over-fitting issue, which can be reduced using an efficient feature selection approach.

The algorithm in Ref. [14] integrated the PSO’s global search and filter techniques’ local exploitation capabilities. A swarm initialization technique is created based on the usage of symmetrical uncertainty to calculate the correlation between features and class labels in order to produce superior results. The major drawback with the approach is that its performance may be constrained in some situations due to ignorance of the characteristics of the feature selection problem itself. The proposed AEGA model overcomes this issue with the preprocessing task. Further, the low exploration issue of the studied method is suppressed by the multi-point crossover of the proposed model. In Ref. [15], the algorithm used the wrapper method, which involved a learning algorithm to estimate the obtained subset of features. The sigmoid and tanh functions are used to transform the continuous search space into a binary one in order to match the feature selection problem’s binary nature. Additionally, the k-nearest neighbor classifier is also employed along with Euclidean distance for searching for the nearest neighbors. The main limitation of this paper is that there are no theoretical guarantees for convergence or optimality in terms of global optimal solution as well as local optimal solutions. The unique selection approach of the proposed model makes it efficient to handle this issue.

In Ref. [16], the authors have used log term frequency based modified inverse class frequency, hybrid mutation based earth warm algorithm, and local search improvised bat algorithm based elman neural network for the analysis of the online reviews after the preprocessing tasks such as white space tokenization, GL, and SBS. The preprocessing task of the studied paper limits its ability to identify more complex patterns in the user reviews, which could help improve accuracy further. The stopword removal and garbage removal using regular expressions approach is used in the proposed method to fix this issue.

The authors of Ref. [17] have suggested a distributed approach to calculate a score that assesses the value of each feature in relation to multiple labels. Two different approaches, named Euclidean Norm Maximization (ENM) and Geometric Mean Maximization (GMM), are used to aggregate the mutual information of multiple labels. The number of obtained features and the computational time are both limitations of the model, which are reduced using the efficient feature selection technique of AEGA. In Ref. [12], the authors have suggested an embedded multi-label feature selection approach with manifold regularization to tackle the learning problem in multi-label data sets, and it makes use of label information for capturing local and global correlations between labels. As the model has not considered the feature correlations between different labels, it may lead to suboptimal results in some cases. The ANOVA analysis of the proposed model makes it better able to handle this limitation.

In Ref. [18], the authors have proposed an intelligent cognitive-inspired model for sentiment analysis to reduce the complexity of big data analytics. This paper worked with TF-IDF for the feature extraction and the Binary BrainStorm Optimization technique for the feature selection. Further, the classification uses fuzzy cognitive maps to evaluate negative and positive sentiments. The number of features selected using the studied model is still large for big data which is taken care of with the help of logical AND operation. A graph-oriented variable-based PageRank algorithm using a central weight-focused algorithm is introduced in Ref. [19] to measure features with multiple labels. The approach estimates the distance between the elements and the tags using Euclidean range. The method is limited to finding the correlation between features and labels, which may lead to inaccurate results in some cases, and this gap is being reduced using the ANOVA approach of the proposed work. In Ref. [20], the authors have proposed a novel approach based on boosting or sample re-weighting. This paper is based on the feature rankings derived from fast and scalable tree-boosting models such as XGBoost with a better accuracy result. The one and only limitation of this paper is the computational time it takes to find out the fitness value. The simple SVM used in the proposed work is a solution to the studied problem.

In Ref. [21], the authors have introduced a filter-based feature selection technique that relates the attributes to a multi-dimensional space. It uses Pareto dominance concepts from the multiobjective optimization domain for selecting salient features in the dataset. The limitation of the model is that it is not suitable when the number of features arriving sequentially increases while the number of samples remains fixed. The combination of the TF-IDF and the GA make it possible to handle this issue. In Ref. [22], the authors have submitted an approach based on mutual information by selecting a specific development feature called MICO. This approach considers the correlation among features, and the influence of labels, by coupling and improving the Jaccard correlation and the mRMR structure together. The decrease in accuracy value with the increase in data size is a drawback of the model. The enhanced GA model used in the proposed work is capable of handling the issue of large datasets without compromising accuracy. In Ref. [23], the author has made a model considering feature selection techniques like the chi-square attribute selection technique and the information gain method to select the best features for the classification of SMS messages. The feature selection techniques used in this work are limited to the chi-square attribute selection technique and the information gain method, and the research is focused on SMS spam detection only; other types of text messages like reviews have not been considered for analysis. In the proposed work, the use of ANOVA and the e-commerce review dataset are used to show the competency of the model.

In Ref. [24], the authors have used two methods for feature selection: multivariate and optimization. A multivariate method is used to remove irrelevant or redundant features from the dataset, while a graph clustering based optimization method is employed to select the optimal feature subset. The model worked with a limitation of reduction in accuracy with large number of features. The proposed model maintained accuracy with a large number of features with its unique feature selection approach. In Ref. [25], the authors have used three different feature selection techniques: Chi-Square, Info Gain, and ReliefF. The authors discussed the method to identify suitable and relevant features for developing efficient machine learning based classifiers to filter spam emails. The features selected using this model may give bad results for large data as because the work is only focused on the correlation of the features. This issue is solved using the enhanced GA with ANOVA.

In Ref. [26], the authors have proposed an efficient method for sentiment analysis of fake news related to COVID-19. It applies various machine learning approaches such as KNNs, Naive Bayes, and AdaBoost and different deep learning models like LSTM, CNN, GRU, and RNN along with data preprocessing techniques in order to obtain an efficient prediction model that can accurately classify the sentiments associated with each piece of fake news on COVID-19. The approach is limited to the analysis of fake news related to COVID-19. In Ref. [27], the authors have proposed a novel approach to tackle high-dimensional and noisy RNA-Seq data for multiple cancer types. The conversion of RNA-seq values to 2D images followed by relevant feature extraction is done as the primary work. The classification is done using different deep learning models is done as the final work. The studied work is performed using cancer data.

In Ref. [28], the authors have proposed a wrapper-filter combination of Ant Colony Optimization (ACO) for feature selection. This method uses subset evaluation using a filter approach instead of the traditional wrapper approach to reduce computational complexity. The limitations of the model are that it has not considered the correlation between features while selecting a subset, which may lead to suboptimal results in some cases, and it has been tested on only two classifiers (KNN and MLP), so its performance with different types of models needs further investigation. In the proposed work, the correlation between features is checked, and the performance is tested on five different machine learning models. The slime mould algorithm (SMA) was employed in Ref. [29] and is a logical swarm-based stochastic optimizer. An SMA with a distributed foraging strategy was proposed to enhance the SMA and sustain population diversity. This model performs better in classifying the sentiment while reducing the count of features, but still, the percentage of selected features is higher. So, further, it is required to develop a model to reduce the features by a remarkable amount.

3 Research objective

Several metaheuristic optimization techniques have recently been used as a wrapper-based strategy to overcome feature selection concerns. Each optimization strategy takes time to complete a set number of model training iterations. Different feature selection methods select an average number of features with average performance. As a result, a new metaheuristic model is needed to finish the training process, which will choose fewer features to result in high performance.

4 The proposed AEGA model

This section describes the complete process followed to diminish the number of features by applying an extended genetic algorithm. The block diagram of the proposed AEGA model is shown in Fig. 1. In further steps, the experimental setup and the result analysis are described.

4.1 Data preprocessing

Before referring to the model, the text needs to be processed. The preprocessing of data accompanies the necessary steps, as described before in this paper. At first, the text is tokenized, where the entire text is divided into tokens. The tokenization phase is followed by stopword elimination and garbage elimination stages. In this process, all the stopwords and garbage values have been discarded to clean the raw data. By doing this, the complexity of further steps can be reduced.

Meanwhile, stemming is applied to each token. Different types of stemming techniques are available to stem the data word. But here, the lemmatizer is used, as the primary characteristic of the lemmatizer is to give a proper meaning to each word after stemming. The process, as mentioned above, consumes time but produces more reliable accuracy. Once this process is complete, the final and most essential process, i.e. vectorization, is implemented. In this method, all lemmatized words are transformed into vector form with some numerical values.

In this paper, the TF-IDF vectorizer is used, as this gives justice to every term. In this vectorizer, all words have different weightage. After this process, the vector form of the text gets generated.

4.2 Data analysis

This step comprises the most crucial method, i.e. the feature reduction method. The vector form of text obtained from the previous step is used as an input to this stage. The proposed extended genetic algorithm is applied to this vector to get a proper subset of the exclusive features needed to obtain an excellent accuracy preference. The general flow chart of the proposed AEGA is shown in Fig. 2. While creating the vector form, all the keywords are attached, creating a vast feature vector.

The accuracy value will be compromised if such a massive data set is used for performance measurement. So instead of exclusive features, some important features are selected by implementing the proposed algorithm. Here the complete vector is examined with the proposed model. Initially, a randomly generated population is taken into consideration, which carries some number of chromosomes. Chromosomes are also considered a candidate solution. This group has several 0 s and 1 s. The features are selected based on the number of 1 s present in each population chromosome in the primary step. By considering those features, the fitness value of the candidate solution is evaluated. Based on the fitness value for each candidate solution, the four topmost solutions are acknowledged as parents for mating purposes. Now, this solution set is ready for crossover.

4.2.1 Crossover

Crossover is helpful for mating purposes. In the proposed model, a two-phase crossover is applied. In the first phase, one point crossover is applied, and the crossover point is in the middle of the length of the chromosome. The uniform crossover approach is applied in the first stage, and it is set at 0.5 to exchange half of the bits with another parent to get a better exploration. The crossover rate is fixed at 0.5 as a threshold value. Thus, every time there is a 50% chance of getting an offspring through crossover. Considering both the cases, to reduce the computational overhead and improve the exploration, the crossover rate is set at 50%. Crossover is applied to the parents from the previous step. But instead of using the crossover for all the parents, some changes are made to the proposed model. In the proposed work, the best solution from the parent list is kept for the next round. By doing so, the solution is converging more precisely towards the optimal solution. In the crossover, except for the first one, all other offspring contain half of the bits from one parent and a half from the other parent. Finally, four offspring are generated after this step by considering the first parent as the first offspring. After the first stage crossover, the second stage crossover is applied, where the multi-point crossover is used. For the second phase, two crossover points are selected by calculating the information gain of each feature. The features with minimum and maximum entropy are selected as the first and second crossover points, respectively. In the second phase, all four parents participate without any exclusion. Finally, a new population of eight chromosomes is generated after mutation by taking four offspring each from the first phase and second phase crossover.

4.2.2 Mutation

The mutation usually helps search in the search space, which gives exploration ability to the proposed model. According to the studies, the mutation value cannot be very high or very low. When it is very high, the solution cannot be converted correctly, and if the value is meager, it will drop into the local optimum. Thus, in the proposed model, the value of the mutation is 0.015, which is applied randomly in the first stage. Then the features with the maximum and minimum value of information gain are considered 1 and 0, respectively, for all the offspring.

4.2.3 Mathematical model

The proposed model tests the dataset to get a subset of more relatable features for the dependent variable. Let A contain all the tokens from the text after preprocessing, and let l be the labeled value in the given dataset.

$$\begin{aligned} A = \{a_1, a_2, a_3,..., a_n\} \end{aligned}$$

(1)

Now, a set S to be chosen, such that;

$$\begin{aligned} S \subset A \ \ \mathrm{{and}} \ \ \sum \limits _{i=1}^{n} S_i = l (\mathrm{{Closely}}) \end{aligned}$$

(2)

Where $S_i$ represents the sentiment score of all selected features and l represents the labeled value. So, the optimal solution is represented as B, obtained from Eqn. (2), and shown in Eq. (3).

$$\begin{aligned} \overrightarrow{B} = \{b_1, b_2, b_3,..., b_n\} \end{aligned}$$

(3)

Now, this represents the minimum number of features obtained from the entire document given to the model.

4.2.4 Fitness evaluation

The fitness of the candidate solution is evaluated using the log-loss function. At first, for each candidate solution, all the features with the value “1” are selected. Considering these selected features, log-loss is applied to the dataset with a 70% training sample and 30% test sample, where each third item from the dataset comes under the test sample. For each candidate solution, the log-loss value is calculated by considering the selected features from the above step. The candidate solution having the minimum value of log-loss is considered the fittest solution. This technique is used to evaluate the best candidate solution in each iteration. With the help of this calculation, the proposed model approaches the optimum solution. The log-loss value is calculated using Eq. (4).

$$\begin{aligned} \mathrm{{Logloss}}_i = -[y_i\ \mathrm{{ln}}(p_i) + (1-y_i)\ \mathrm{{ln}}(1-p_i)] \end{aligned}$$

(4)

where i is the current observation, y denotes the real value, p denotes the probability of prediction, and a number’s natural logarithm is denoted as ln.

4.2.5 Feature selection (FS)

After mutation is complete, a new population is generated, and this population is used in the next iteration. The exact process is repeated several times. As shown in Fig. 3, up to the $2300\mathrm{{th}}$ iteration, there is convergence followed by no more changes. Thus, in the proposed model, the maximum number of generations is set to 2500 ($\mathrm{{max\_gen}} = 2500$). After all iterations, a subset of features related to the dataset labels is collected more accurately, and other less relevant features are excluded from the list. Next, the ANOVA is applied to the same dataset to select the n best features, where n denotes the number of features selected using the AEGA. Finally, the logical AND operation is performed on the resulted outputs from both AEGA and ANOVA, and the outcome features with value 1 are taken as the reduced feature set for classification.

The suggested method’s pseudocode is presented in Algorithm 1.

5 Experiment setup

This segment explains the background of operations, datasets, performance standards, classifiers, and the suggested approach parameter adjustment. All operations are performed using Python with Jupyter Notebook and performed on a Core-i5 Intel processor, 12 GB RAM storage, and 250 GB SSD space on a Windows 10 machine. To achieve reliable outcomes, the model iterates each execution with multiple independent iterations.

5.1 Dataset

Experiments are conducted using two different datasets, including the Amazon Review dataset collected from the Kaggle repository (https://www.kaggle.com/datasets/marklvl/sentiment-labelled-sentences-data-set) having multiple unique reviews with 0 and 1 labels, and the Restaurant Customer Review dataset gathered from the Kaggle repository (https://www.kaggle.com/vigneshwarsofficial/reviews) having multiple reviews labeled with 0 and 1 for each review. These datasets are labeled with 0 and 1, representing negative and positive, respectively.

5.2 Classifier and performance evaluation

To start the evaluation, several classifiers have been taken to predict the desired labels, and by executing the model, Naïve Bayes is selected as the classifier for the proposed model. Some standard classifiers are used for this purpose. By applying existing algorithms to the same dataset, the results are evaluated and compared with the proposed model for the feature reduction technique. Finally, after each execution, the final result is assessed by taking accuracy, precision, recall, and the F1 score into account instead of just accuracy along with some statistical measurements. The result values are shown in tabular form for a clear idea of the outcome.

Standard deviation ($\sigma$), skewness ($\tilde{\mu }_3$), Kurtosis value, and Jarque-Bera statistics (JB) are used to evaluate the classification performance statistically and can be calculated using Eqs. 5, 6, 7, and 8, respectively.

$$\begin{aligned} \sigma = \sqrt{\frac{\Sigma (x_i - \mu )^2}{N}} \end{aligned}$$

(5)

where N denotes the population size, $x_i$ is the value of each element, and $\mu$ is the mean of the population.

$$\begin{aligned} \tilde{\mu }_3 = \frac{\sum \nolimits _{i=1}^{N} (x_i - \overline{x})^3}{(N - 1) * \sigma ^3} \end{aligned}$$

(6)

Where the number of variables participating is N in the distribution, $x_i$ is the picked variable, $\overline{x}$ is the distribution mean, and $\sigma$ denotes the standard deviation.

$$\begin{aligned} \mathrm{{Kurtosis}} = n * \frac{\sum \nolimits _{i=1}^{N} (Y_i - \overline{Y})^4}{\left( {\sum \nolimits _{i=1}^{N} (Y_i - \overline{Y})^2}\right) ^2} \end{aligned}$$

(7)

Where $Y_i$ is the $i\mathrm{{th}}$ variable, $\overline{Y}$ is the distribution mean, and n is the number of variables engaged in the distribution.

$$\begin{aligned} \mathrm{{JB}} = n * \left[ {(\sqrt{b1})^2 / 6 + (b2 - 3)^2/ 24}\right] \end{aligned}$$

(8)

Where n represents the sample size, $\sqrt{b1}$ denotes the coefficient of skewness, and b2 denotes the coefficient of kurtosis.

5.3 Parameter adjustment

The recommended model needs some parameter settings. Some parameters, like population size (S), mutation amount (MP), crossover rate (CP), maximum generation (max_gen), mating pool size (MS), and initial random population size (PS), need to be updated before the execution happens. These parameters are listed in Table 1.

Table 1 Parameter adjustment of AEGA

Full size table

6 Results and discussions

This section compares the classification result with some conventional techniques like NB, SVM, DT, RF, and KNN by taking the AEGA as the base model for feature selection. Next, the proposed work is compared with different existing algorithms like PSO, GA, and RFE. The evaluation is done on the same platform, and the classification results are shown in the graphical representation and tabular presentation. All the experiments are carried out using the 70–30% splitting strategy. Different strategies, such as 90–10%, 80–20%, 70–30%, or 60–40% can be considered for splitting the data into the training and testing groups. The 70%-30% strategy is chosen for the proposed approach based on some analysis. The details of the analysis are shown in Table 2.

Table 2 Comparison of different splitting approaches

Full size table

As shown in Table 2, the 70–30% strategy provides a better result in almost all cases apart from precision as compared to other strategies in both cases. Thus, the 70–30% strategy is considered the splitting strategy in the present work.

Focusing on the objectives of the proposed work, the experiments are carried out to select as few features as possible by improving the efficiency of the model. The proposed model used both enhanced GA and ANOVA together to achieve the goal. Firstly, the local minima convergence issue of GA is solved using the unique selection method adopted in this work. In each iteration, the model keeps the best chromosome as an offspring to help it converge faster towards the global optimum. Further, the impact of each feature on the label is calculated using ANOVA, and finally, the best feature set is obtained by combining both approaches together with a smaller number of features compared to other approaches. Sects. 6.1 and 6.2 show the usefulness and efficiency of the proposed work through tabular and graphical representations.

6.1 Using Amazon dataset

Tables 3 and 4 show the statistical and technical analysis of the proposed AEGA model with different classification algorithms using the Amazon dataset. This analysis is done to find the best classifier with the AEGA model after getting the optimal feature set to classify the sentiment. In statistical analysis, the NB classifier shows good performance, and in technical analysis, the NB performs outstandingly compared to other classifiers. By considering both analyses, NB is taken as the classifier, with the proposed model as the base model for the feature selection approach. The dataset is then applied to the final model, and the result is compared with different feature selection models to show the efficacy of the proposed work. Tables 5 and 6 show the statistical and technical metrics comparisons of the AEGA model with different existing models. Table 6 clearly shows the AEGA model outperforms other models with an accuracy of 78.6%, precision of 76.58%, recall value of 82.40%, F1 score of 79.38%, AUC value of 0.89, and average precision of 0.87.

In the AEGA model, to completely train the model, 2500 iterations are taken. In each iteration, the local best accuracy value, global best accuracy value, and global best loss value is calculated. The calculated values of the global best loss, global best accuracy, and local best accuracy in each iteration is graphically represented in Figs. 3, 4, and 5, respectively. Figures 6 and 7 represent the precision-recall (PR) curve and ROC curve, respectively, for all the classifiers in use with the AEGA model, and Figs. 8 and 9 show the PR curve and ROC curve, respectively, for all the existing algorithms along with the AEGA model. Figure 10 represents the graphical representation of the comparison made between all the algorithms showing the evaluation performance. Figure 11 displays the comparison between all the algorithms, showing the number of features selected after applying the algorithms.

Table 3 Statistical analysis of Amazon dataset using different classification algorithms with AEGA

Full size table

Table 4 Technical analysis of Amazon dataset using different classification algorithms with AEGA

Full size table

Table 5 Statistical analysis of Amazon dataset with different feature selection algorithms

Full size table

Table 6 Technical analysis of Amazon dataset with different feature selection algorithms

Full size table

6.2 Using restaurant review dataset

The statistical and technical analysis of the proposed AEGA model with different classification algorithms using the Restaurant Review dataset is shown in Tables 7 and 8 and is similar to the Amazon review dataset. In both statistical and technical analysis, the NB approach outperforms other machine learning classifiers with a marginal difference. From the analysis, the NB model is chosen as the classifier for the proposed AEGA model. The final model’s performance is compared with other existing models to check its proficiency. The statistical metrics evaluation of the AEGA approach with other existing approaches is shown in Table 9, and the technical metrics comparison is shown in Table 10. With an accuracy of 77.7 percent, a precision of 76.38 percent, a recall of 80.20 percent, an F1 score of 78.24 percent, an AUC value of 0.89, and an average precision of 0.89, Table 10 clearly illustrates that the proposed model outperforms other models.

Table 7 Statistical analysis of restaurant review dataset using different classification algorithms with AEGA

Full size table

Table 8 Technical analysis of restaurant review dataset using different classification algorithms with AEGA

Full size table

Table 9 Statistical analysis of restaurant review dataset with different feature selection algorithms

Full size table

Table 10 Technical analysis of restaurant review dataset with different existing algorithms

Full size table

Table 11 Comparison of selected features using different FS algorithms

Full size table

In the AEGA model, 2500 iterations are required to fully train the model. The local best accuracy value, global best accuracy value, and global best loss value are calculated in each cycle. In each iteration, the derived values of global best loss, global best accuracy, and local best accuracy are graphically displayed in Figs. 12, 13, and 14. Figures 15 and 16 show the PR curve and ROC curve for all classifiers used with the AEGA model, respectively, while Figs. 17 and 18 depict the PR curve and ROC curve for all existing methods used with the AEGA model, respectively. The graphical depiction of the performance evaluation comparison done between all of the algorithms is shown in Fig. 19. The graphical depiction of the selected feature comparison done between all of the algorithms is shown in Fig. 20. The comparison of the number of obtained features using different algorithms is reflected in Table 11.

7 Practical implication

In this era of the internet, sentiment analysis is a powerful tool to implement before every step of today’s trend. The proposed model can be implemented in any business field, like e-commerce, to know the users’ requirements, likes, and dislikes. Based on the analysis done on the users’ opinions of certain products, the quality can be improved to get a reasonable margin of profit in the business world. This efficient model can filter people’s views on any political or social debate effectively. Instead of taking the whole data, a few details can be selected to make a final conclusion by using this model. This model is helpful for any review, such as a product, website, restaurant, political, social, and hotel. This model is beneficial in the field, which deals with an enormous amount of textual data to frame a conclusion.

8 Conclusion

This paper introduced a novel feature minimization technique that works with an evolutionary approach and provides a better set of selected features, which is very helpful in classifying the sentiment with a high accuracy value. The proposed work is an extension of the metaheuristic genetic algorithm, which always leads towards an optimal solution. The novel two-phase crossover approach helps the algorithm execute with the fittest solution each time. Finally, the results of the different existing algorithms showed the proposed work’s efficiency compared to other conventional classification approaches. The proposed work outperforms others with a marginal difference in the measuring parameters, including accuracy, recall, F1 score, AUC, and average precision. In future work, the proposed model will be further extended to solve other optimization problems like feature selection in semi-supervised data, multi-label data, etc.

Data Availability

The datasets analyzed during the current study are available publicly. Kaggle repository with the name “amazon cells labelled Dataset” (https://www.kaggle.com/datasets/marklvl/sentiment-labelled-sentences-data-set). Kaggle repository with the name “Restaurant Customer Reviews” (https://www.kaggle.com/datasets/vigneshwarsofficial/reviews).

Code Availability

Not Applicable.

References

Liu N, Shen B (2020) ReMemNN: a novel memory neural network for powerful interaction in aspect-based sentiment analysis. Neurocomputing 395:66–77. https://doi.org/10.1016/j.neucom.2020.02.018
Article Google Scholar
Chakraborty K, Bhattacharyya S, Bag R (2020) A survey of sentiment analysis from social media data. IEEE Trans Comput Soc Syst 7:450–464. https://doi.org/10.1109/TCSS.2019.2956957
Article Google Scholar
Liang R, Jqiang Wang (2019) A linguistic intuitionistic cloud decision support model with sentiment analysis for product selection in e-commerce. Int J Fuzzy Syst 21:963–977. https://doi.org/10.1007/s40815-019-00606-0
Article Google Scholar
Hussein DMEDM (2018) A survey on sentiment analysis challenges. J King Saud Univ Eng Sci 30:330–338. https://doi.org/10.1016/j.jksues.2016.04.002
Article Google Scholar
Abasabadi S, Nematzadeh H, Motameni H, Akbari E (2022) Hybrid feature selection based on SLI and genetic algorithm for microarray datasets. J Supercomput 78:19725–19753. https://doi.org/10.1007/s11227-022-04650-w
Article Google Scholar
Barredo Arrieta A, Díaz-Rodríguez N, del Ser J et al (2020) Explainable Artificial Intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inf Fusion 58:82–115. https://doi.org/10.1016/j.inffus.2019.12.012
Article Google Scholar
Li K, Fard N (2022) Analysis of impact of balanced level on MI-based and non-MI-based feature selection methods. J Supercomput 78:16485–16497. https://doi.org/10.1007/s11227-022-04504-5
Article Google Scholar
Sharaff A, Nagwani NK (2020) ML-EC2: an algorithm for multi-label email classification using clustering. Int J Web-Based Learn Teach Technol 15:19–33. https://doi.org/10.4018/IJWLTT.2020040102
Article Google Scholar
Wang L, Wang Y, Chang Q (2016) Feature selection methods for big data bioinformatics: a survey from the search perspective. Methods 111:21–31
Article Google Scholar
Golay J, Kanevski M (2017) Unsupervised feature selection based on the Morisita estimator of intrinsic dimension. Knowl Based Syst 135:125–134. https://doi.org/10.1016/j.knosys.2017.08.009
Article Google Scholar
Murugan NS, Devi GU (2019) Feature extraction using LR-PCA hybridization on twitter data and classification accuracy using machine learning algorithms. Cluster Comput 22:13965–13974. https://doi.org/10.1007/s10586-018-2158-3
Article Google Scholar
Zhang J, Luo Z, Li C et al (2019) Manifold regularized discriminative feature selection for multi-label learning. Pattern Recognit 95:136–150. https://doi.org/10.1016/j.patcog.2019.06.003
Article Google Scholar
Giatsoglou M, Vozalis MG, Diamantaras K et al (2017) Sentiment analysis leveraging emotions and word embeddings. Expert Syst Appl 69:214–224. https://doi.org/10.1016/j.eswa.2016.10.043
Article Google Scholar
Song XF, Zhang Y, Gong DW, Sun XY (2021) Feature selection using bare-bones particle swarm optimization with mutual information. Pattern Recognit 112:107804. https://doi.org/10.1016/j.patcog.2020.107804
Article Google Scholar
Abdel-Basset M, El-Shahat D, El-henawy I et al (2020) A new fusion of grey wolf optimizer algorithm with a two-phase mutation for feature selection. Expert Syst Appl 139:112824. https://doi.org/10.1016/j.eswa.2019.112824
Article Google Scholar
Zhao H, Liu Z, Yao X, Yang Q (2021) A machine learning-based sentiment analysis of online product reviews with a novel term weighting and feature selection approach. Inf Process Manag 58:102656. https://doi.org/10.1016/j.ipm.2021.102656
Article Google Scholar
Gonzalez-Lopez J, Ventura S, Cano A (2020) Distributed multi-label feature selection using individual mutual information measures. Knowl Based Syst 188:105052. https://doi.org/10.1016/j.knosys
Article Google Scholar
Jain DK, Boyapati P, Venkatesh J, Prakash M (2022) An intelligent cognitive-inspired computing with big data analytics framework for sentiment analysis and classification. Inf Process Manag 59:102758. https://doi.org/10.1016/j.ipm.2021.102758
Article Google Scholar
Hashemi A, Dowlatshahi MB, Nezamabadi-pour H (2020) MGFS: a multi-label graph-based feature selection algorithm via PageRank centrality. Expert Syst Appl 142:113024. https://doi.org/10.1016/j.eswa.2019.113024
Article Google Scholar
Alsahaf A, Petkov N, Shenoy V, Azzopardi G (2022) A framework for feature selection through boosting. Expert Syst Appl 187:115895. https://doi.org/10.1016/j.eswa.2021.115895
Article Google Scholar
Kashef S, Nezamabadi-pour H (2019) A label-specific multi-label feature selection algorithm based on the Pareto dominance concept. Pattern Recognit 88:654–667. https://doi.org/10.1016/j.patcog.2018.12.020
Article Google Scholar
Sun Z, Zhang J, Dai L et al (2019) Mutual information based multi-label feature selection via constrained convex optimization. Neurocomputing 329:447–456. https://doi.org/10.1016/j.neucom.2018.10.047
Article Google Scholar
Sharaff A (2019) Spam detection in SMS based on feature selection techniques. Advances in intelligent systems and computing. Springer, pp 555–563
Google Scholar
Ghimatgar H, Kazemi K, Helfroush MS, Aarabi A (2018) An improved feature selection algorithm based on graph clustering and ant colony optimization. Knowl Based Syst 159:270–285. https://doi.org/10.1016/j.knosys.2018.06.025
Article Google Scholar
Sharaff A, Nagwani NK, Swami K (2015) Impact of feature selection technique on email classification. Int J Knowl Eng IACSIT 1:59–63. https://doi.org/10.7763/IJKE.2015.V1.10
Article Google Scholar
Bangyal WH, Qasim R, Rehman NU et al (2021) Detection of fake news text classification on COVID-19 using deep learning approaches. Comput Math Methods Med 2021:1–14. https://doi.org/10.1155/2021/5514220
Article Google Scholar
Rukhsar L, Bangyal WH, Ali Khan MS et al (2022) Analyzing RNA-seq gene expression data using deep learning approaches for cancer classification. Appl Sci 12(4):1850. https://doi.org/10.3390/app12041850
Article Google Scholar
Ghosh M, Guha R, Sarkar R, Abraham A (2020) A wrapper-filter feature selection technique based on ant colony optimization. Neural Comput Appl 32:7839–7857. https://doi.org/10.1007/s00521-019-04171-3
Article Google Scholar
Hu J, Gui W, Heidari AA et al (2022) Dispersed foraging slime mould algorithm: continuous and binary variants for global optimization and wrapper-based feature selection. Knowl Based Syst 237:107761. https://doi.org/10.1016/j.knosys.2021.107761
Article Google Scholar

Download references

Acknowledgements

The authors would like to acknowledge the National Institute of Technology, Raipur, for providing the required infrastructure and facilities for doing research work. The authors truly appreciate all of the reviewers’ insightful remarks and recommendations, which helped to improve the quality of the manuscript.

Funding

The authors did not receive any funding/support from any organization.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, National Institute of Technology, Raipur, Chhattisgarh, 492010, India
Gyananjaya Tripathy & Aakanksha Sharaff

Authors

Gyananjaya Tripathy
View author publications
You can also search for this author in PubMed Google Scholar
Aakanksha Sharaff
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

GT has done the conceptualization, methodology, investigation, and prepared the main manuscript text. AS has done the supervision validation, formal analysis, data curation, reviewing, and final editing.

Corresponding author

Correspondence to Aakanksha Sharaff.

Ethics declarations

Conflict of interest

There is no conflict of interest for this study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Tripathy, G., Sharaff, A. AEGA: enhanced feature selection based on ANOVA and extended genetic algorithm for online customer review analysis. J Supercomput 79, 13180–13209 (2023). https://doi.org/10.1007/s11227-023-05179-2

Download citation

Accepted: 07 March 2023
Published: 22 March 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s11227-023-05179-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

AEGA: enhanced feature selection based on ANOVA and extended genetic algorithm for online customer review analysis

Abstract

Similar content being viewed by others

A Novel Feature Selection Method Based on Genetic Algorithm for Opinion Mining of Social Media Reviews

Review Sentiment Classification and Feature Selection Using Hybridized Support Vector Machine

Hybrid Filter–Wrapper Feature Selection Method for Sentiment Classification

1 Introduction

2 Related work

3 Research objective

4 The proposed AEGA model

4.1 Data preprocessing

4.2 Data analysis

4.2.1 Crossover

4.2.2 Mutation

4.2.3 Mathematical model

4.2.4 Fitness evaluation

4.2.5 Feature selection (FS)

5 Experiment setup

5.1 Dataset

5.2 Classifier and performance evaluation

5.3 Parameter adjustment

6 Results and discussions

6.1 Using Amazon dataset

6.2 Using restaurant review dataset

7 Practical implication

8 Conclusion

Data Availability

Code Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation