Hybrid approach of improved binary particle swarm optimization and shuffled frog leaping for feature selection

https://doi.org/10.1016/j.compeleceng.2018.02.015Get rights and content

Abstract

Currently, the masses are interested in sharing opinions, feedbacks, suggestions on any discrete topics on websites, e-forums, and blogs. Thus, the consumers tend to rely a lot on product reviews before buying any products or availing their services. However, not all reviews available over internet are authentic. Spammers manipulate the reviews in their favor to either devalue or promote products. Thus, customers are influenced to take wrong decision due to these spurious reviews, i. e., spammy contents. In order to address this problem, a hybrid approach of improved binary particle swarm optimization and shuffled frog leaping algorithm are proposed to decrease high dimensionality of the feature set and to select optimized feature subsets. Our approach helps customers in ignoring fake reviews and enhances the classification performance by providing trustworthy reviews. Naive Bayes (NB), K Nearest Neighbor (kNN) and Support Vector Machine (SVM) classifiers were used for classification. The results indicate that the proposed hybrid method of feature selection provides an optimized feature subset and obtains higher classification accuracy.

Introduction

In current times, the amount of content available to the user on the internet is rapidly increasing [1]. While purchasing the product or availing services customers generally tend to make a decision relying solely on the information available in the review sites [2]. However, there is a limited quality control for these available data. This limitation invites people to post spurious reviews on the websites in order to either promote or demote the products [3]. Such individuals are known as opinion spammers. The positive spam reviews about a product may lead to financial gains and would help to increase the popularity of the product [4]. Similarly, negative spam reviews are posted with the intention of defaming a product or services [5]. Recently, the problem of spam or fake reviews has been on the rise, and many such cases have been released in the news. Hence, there arises a necessity of finding the authenticity of these reviews. Feature selection (FS) is a technique in which a subset of features are selected from the original dataset [6]. It is mainly used to build more robust learning models and to reduce the processing cost. The main purpose of feature selection is to reduce the number of features to increase both the performance of the model and the accuracy of classification [7]. FS can be examined as a search into a state space. Thus, a full search can be performed in all the search spaces traversed. However, this approach is not feasible in case of a very large number of features. Hence, a heuristic search deliberates those features, which have not yet been selected at each iteration, for evaluation. A random search creates random subsets within the search space that can be evaluated for importance of classification performance. Due to their randomized nature, meta-heuristics such as particle swarm optimization (PSO), evolutionary algorithms (EA), bat algorithm (BA), ant colony optimization (ACO) and genetic algorithm [8], [9] are widely used for feature selection. When the feature space is high dimensional, selecting the optimal feature subset using traditional optimization methods have not proven to be effective. Therefore, meta-heuristic algorithms are used extensively for the appropriate selection of features. Two types of feature selection methods, namely the filter method and wrapper method can be incorporated for selecting subset of features. The filter model analyzes the intrinsic properties of data without involving the use of any learning algorithms [9] and can perform both subset selection and ranking. Though ranking involves identifying the importance of all the features, this method is more specifically used as a pre-process method since it selects redundant features. The wrapper model unlike other filter approaches considers the relationship between features [10]. This method initially uses an optimizing algorithm to generate various subsets of features and then uses a classification algorithm to analyze the subsets generated.

A rule-based approach was investigated to detect fake reviews in which the unexpected rules were defined to detect unusual behaviors of reviewers [11]. The study used an dataset available from Aamazon to identify spam activities. The N-gram method was applied to detect negative deceptive opinion [12]. Gold standard negative spam dataset which contains 400 reviews of 20 hotels in Chicago was used. The unigram and bigram features were trained by Support Vector Machine (SVM) classifiers. The results revealed that, the N-gram based SVM classifier achieved 86% accuracy in surpassing human judges. Two kinds of N-gram methods namely the character n gram (BON) and the word n-gram (BOW) were proposed to detect fake reviews [5]. Naive Bayes (NB) classifier was used for classifying both positive and negative reviews. The experimental results showed that the NB classifier achieved better results for positive reviews. Further, the SVM method was found to show better results in classifying deceptive and truthful negative reviews. The authors claimed that the BON showed better robustness when compared to BOW as it provided superior results with a small training dataset.

The content duplication technique was preferred for identifying the fake review [13]. Both duplicate and near-duplicate reviews were considered in training data set. Furthermore, two different techniques for spam detection were considered in the test dataset. The authors illustrated the content-based features which include 3 categories of reviews. Firstly, similarity of a review with the author's and other reviews on the target products. They also elucidate reviewer's centric features based on the burst patterns. The Probabilistic language model was developed to generate a similarity score between the reviews [14]. This approach evaluates the possibility of one review that are derived from the other. To detect the content similarity, they compared a couple of reviews by Kullback–Leibler. In addition to that Kullback–Leibler divergence measure calculates the spam score for every review. SVM was chosen for spam classification to classify both spam and ham reviews. They have achieved 81% precision in their method for detecting spam reviews.

Stylometric features, characterized either as lexical or syntactic representation were used for identifying review spam. While the lexical features represent the character or word-based features, the syntactic feature denotes the reviewers writing style at each sentence level. Graph-based methodology, the graph comprising three nodes: namely the review, the reviewer and store was applied for detecting review spammers [15], [16]. It establishes the inter-relationships between two nodes, which is achieved by evaluating following: the credibility of the reviewer, the honesty of the reviews and the reliability of the store. In this case agreement score is calculated based on the user rating. The reliability of the store depends on the credibility of its reviewer's comments.

The existing works investigated the traditional feature selection techniques such as bag of words, bag of nouns, linguistic features, weighted PCA, keyword spotting and the machine learning algorithm for reviewing spam classification. However, till date no attempts have been made to use hybrid evolutionary algorithms for reviewing spam classification. The evolutionary algorithms have been applied for different applications such as scheduling, power system, and wireless sensor networks. This is the first study that utilizes evolutionary algorithms for classifying reviews into spam and ham. FS plays a major role in classification. Hence, lot of researchers primarily focus on statistical measures to choose the features. However, these methods do not furnish an appropriate solution space. The search space size has increased exponentially corresponding to the number of features in a given data set. Traditional feature selection techniques involve larger number of features. Although all of them are not required during classification, substantial number of irrelevant and redundant features tend to affect the overall performance of the classifier.

Section snippets

Proposed model

The proposed methodology uses evolutionary algorithms for FS in order to obtain the feature subset for achieving better accuracy of classification and identification of fake reviews. It consists of four phases namely, preprocessing, feature extraction and feature subset selection using hybrid iBPSO and SFLA and classification. The block diagram of the proposed system is illustrated in Fig. 1.

Simulation results and discussion

The proposed hybrid iBPSO and SFLA algorithms were implemented using Java with Intel P4, 2. 66 GHz CPU; 16GB RAM in Windows XP Professional operating system environment. In this experiment, hybrid iBPSO and SFLA FS algorithms were implemented for selecting the optimized subsets from the review spam dataset. The stages of the proposed methods results are presented below.

Conclusion

Feature selection is critical to the performance improvement for a classification. Hence, it is important to discard the irrelevant and, noisy features from a given dataset that would decrease the classification accuracy. A number of methodologies have been adopted to select the best feature subset., In this investigation, an hybrid approach was applied for selecting the optimized feature subset. This hybrid methodology efficiently reduces the feature subset size due to randomization, which in

Acknowledgments

The authors would like to acknowledge the efforts from Dr. Biswapriya B. Misra [ORCID ID: 0000-0003-2589-6539], Assistant Professor, Internal Medicine, Wake Forest Baptist Medical Center, Winston-Salem, NC, USA for extensive help in editing the current version of the manuscript for language issues.

S. P. Rajamohana is an Assistant Professor in the Department of Information Technology at the, PSG College of Technology, Coimbatore, India. She completed her Master's in Information Technology from the same institution and is currently pursuing her PhD in Information and Communication Engineering from PSG College of Technology, Anna University, Chennai. Her research interests include review spam classification and evolutionary algorithms.

References (31)

  • S. Abdul-Rahman et al.

    Optimizing big data in bioinformatics with swarm algorithms

  • L.Y. Chuang et al.

    Chaotic binary particle swarm optimization for feature selection using logistic map

  • R. Nakamura et al.

    BBA: a binary bat algorithm for feature selection

  • B. Xue et al.

    Particle swarm optimization for feature selection in classification: a multi – objective approach

    IEEE Trans. Cybern.

    (2013)
  • N. Jindal et al.

    Finding unusual review patterns using unexpected rules

  • Cited by (78)

    • A modified reptile search algorithm for global optimization and image segmentation: Case study brain MRI images

      2023, Computers in Biology and Medicine
      Citation Excerpt :

      The fruit fly optimization algorithm (FOA) hybridizes with the adaptive–cooperative learning strategy. Rajamohana et al. [61] introduced a feature selection technique that hybrid the PSO with shuffling frog leaping to enhance the classification accuracy for fake reviews. Moreover, Neggaz et al. [62] proposed a variant of SSA used in feature selection using SCA to enhance SSA.

    • A metaheuristic approach for mining gradual patterns

      2022, Swarm and Evolutionary Computation
      Citation Excerpt :

      Particle swarm optimization is a swarm based optimization technique (originally proposed by [30]) that is inspired by the analogy of social interaction and communication (i.e. fish schooling or bird flocking). PSO simulates the movements of swarms in order to iteratively optimize a combinatorial optimization problem [31]. In the realm of frequent pattern mining, research studies conducted by [32,33] demonstrate how PSO-based approaches improve the performance of the frequent pattern (FP)-growth technique.

    • Discrete fractional-order Caputo method to overcome trapping in local optima: Manta Ray Foraging Optimizer as a case study

      2022, Expert Systems with Applications
      Citation Excerpt :

      Moreover, the Genetic Algorithm (GA) (Banzhaf et al., 1998), the Multi-Verse optimizer (MVO) (Mirjalili et al., 2016), gravitational search algorithm (Rashedi et al., 2009), and Henry Gas Solubility Optimization (HGSO) (Hashim et al., 2019). The main drawback that faced those SI optimizers is their trapping in the local optimum especially in the multi-modal optimization problems as they have several local optima with one global (Oliva & Abd Elaziz, 2020; Rajamohana & Umamaheswari, 2018). Therefore, numerous researchers improved SI approaches by modifying their exploitation and and exploration tendencies.

    View all citing articles on Scopus

    S. P. Rajamohana is an Assistant Professor in the Department of Information Technology at the, PSG College of Technology, Coimbatore, India. She completed her Master's in Information Technology from the same institution and is currently pursuing her PhD in Information and Communication Engineering from PSG College of Technology, Anna University, Chennai. Her research interests include review spam classification and evolutionary algorithms.

    K. Umamaheswari, Professor & Head in the Department of Information Technology, at the PSG College of Technology, India and has completed her Bachelor's and Master's in Computer Science and Engineering in 1989 and 2000 respectively and PhD degree from Anna University in 2010. She has 22 years of teaching experience and more than 100 Publications in international and national journals and conferences. Her research interests include data mining, cognitive networks and information retrieval.

    Reviews processed and recommended for publication to the Editor-in-Chief by Guest Editor Dr. O. Bayat.

    View full text