Sequential targeting: A continual learning approach for data imbalance in text classification

https://doi.org/10.1016/j.eswa.2021.115067Get rights and content

Highlights

  • Handling data imbalance in text data with continual learning.

  • Detecting sexual harassment and sentiment in comments of social media platforms.

  • Effective trade-off between precision and recall for an overall increase in F1-score.

  • Results show 56.38 %p increase on the IMDB dataset.

  • Results show 16.89 %p and 34.76 %p increase on NAVER Dataset.

Abstract

Text classification has numerous use cases including sentiment analysis, spam detection, document classification, hate speech detection, etc. In realistic settings, classification on text data confronts imbalanced data conditions where classes of interest usually compose a minor fraction. Deep neural networks used for text classification, such as recurrent neural networks and transformer networks, suffer from a lack of efficient methods addressing imbalanced data. Traditional data-level methods attempting to mitigate distributional skew include oversampling and undersampling. The oversampling methods destruct the quality of original language representation of the sparse data coming from minority classes whereas the undersampling methods fail to fully utilize the rich context of majority classes. We address such issues in data-driven approaches by enforcing continual learning on imbalanced data by partitioning the training data distribution into mutually exclusive subsets and performing continual learning, treating the individual subsets as distinct tasks. We demonstrate the effectiveness of our method through experiments on the IMDB dataset and constructed datasets from real-world data. The experimental results show that the proposed method improves by 56.38 %p on the IMDB dataset and by 16.89 %p and 34.76 %p on the constructed datasets compared to the baseline method in terms of the F1-score metric.

Introduction

Classification is a task of learning a discriminant function from the given data that classifies previously unseen data to the correct classes. In realistic settings, however, it is rarely the case where the discrete distribution of the training data acquired to develop the classifier is perfectly balanced across all classes. Classifiers trained in imbalanced settings tend to become biased towards the class with more samples (Xiao, Wang, & Du, 2019). In order to develop intelligent classifiers, finding methods to temper the classifier from biasing towards a certain class is of great importance.

With the recent breakthroughs in Natural Language Processing, many applications leverage text classification methods. Non-contemporary methods go through separate feature extraction and classifiers (Kowsari et al., 2019). Texts are converted into a structured feature vector space with methods such as TF-IDF (Jones, 1972), Word2Vec (Mikolov, Chen, Corrado, & Dean, 2013), Glove (Pennington, Socher, & Manning, 2014). The learned feature vectors are given as inputs to separate classifiers such as Naive Bayes Classifier (Frank & Bouckaert, 2006), Support Vector Machines (Karamizadeh, Abdullah, Halimi, Shayan, & javad Rajabi, 2014), and Conditional Random Fields (Wallach, 2004).

Text classification has a broad field of applications encompassing tasks from toxic comment detection to sentiment analysis. Deep learning models such as Recurrent Neural Networks (Wang, Hamza, & Florian, 2017), Convolution Neural Networks (Guo, Zhang, Liu, & Ma, 2019), and its hybrids (Chen, Xu, He, & Wang, 2017) have proven to be much more effective than previous rule-based or machine learning-based text classification approaches. Moreover, transformer networks such as BERT (Devlin et al., 2018, Qin et al., 2020) show state-of-the-art performance in text classification. However, previous methods that attempt to mitigate the imbalanced data distribution in text classification are highly incompatible with the aforementioned deep learning models. Unlike previous rule-based and machine learning approaches, deep learning approaches conduct feature extraction and classification within a single model. Despite their recent successes, DNNs still suffer from generalizing to a balanced testing criterion in cases of data imbalance during the training phase (Dong, Gong, & Zhu, 2018).

Previous methods addressing data imbalance in text classification can be categorized into data-level and algorithm-level methods. Data-level methods (Kaur, Pannu, & Malhi, 2019) apply manipulation on the data by undersampling majority classes or oversampling minority classes. Previous data-level methods such as SMOTE and its variants (Chawla et al., 2002, Han et al., 2005) have been proven to be highly effective. However, such data-driven methods require an effective numerical representation algorithm a priori which causes reduced compatibility with current deep neural network models. Many deep neural networks unite the feature representation learning phase and the classifier learning phase. Therefore, data-level methods that require feature representations in advance are not favorable to the current scheme of deep learning. Algorithm-level methods modify the underlying classifier or its output to reduce bias towards the majority group (Haixiang et al., 2017). However, these methods are task-sensitive and somewhat heuristic since it requires the practitioners to modify the classifier considering the innate properties of the task. Traditional random oversampling and undersampling methods (ROS, RUS), which simply duplicate or sample data instances, are independent of these two limitations and are compatible with DNNs. However, these two methods fail to preserve the information of the original data distribution and show room for improvements. Task-independent methods for data imbalance in text classification which do not require feature space algorithms are needed to train deep neural network classifiers.

We propose a novel training architecture, Sequential Targeting (ST), that handles the data imbalance problem by forcing a continual learning setting on deep learning-based classifiers. ST divides the entire training data set into mutually exclusive partitions, target-adaptively balancing the data distribution. Target distribution is a predetermined distributional setting that enables the learner to exert maximum performance when trained with. In an imbalanced distributional setting, the target distribution is idealistic to follow a uniform distribution where all the classes hold equal importance. Optimal class distribution may differ by innate property of the data, but research shows that a balanced class distribution has an overall better performance compared to other distributions (Rokach, 2010). The remaining partitions are then sorted in the magnitude of similarity with the target distribution which is measured by KL-divergence. The first partition of the split data is imbalanced while the last partition is arbitrarily modeled to be uniform across classes and all the partitions are utilized to train the learner sequentially.

We sequentially train the learner in the partitioned data while handling the issue of catastrophic forgetting (French, 1999), which is an inevitable phenomenon in transfer learning, by utilizing Elastic Weight Consolidation(EWC) (Kirkpatrick et al., 2017) to stabilize the knowledge obtained from the previous data partitions. The proposed method is independent of both the representation method and the task at hand, which addresses the limitations of previous methods.

Currently, text classification models are frequently used for sentimental analysis in online platforms (Ruz et al., 2020, Burdisso et al., 2019, Wu et al., 2017, Guerreiro and Rita, 2020). In this paper, we specifically focus on training an efficient classifier detecting sexual harassment, a specific kind of hate speech, and sentiment in online comments of NAVER2. In the name of anonymity, online discussion platforms have become a place where people undermine, harass, humiliate, threaten, and bully others (Samghabadi, Maharjan, Sprague, Diaz-Sprague, & Solorio, 2017) based on their superficial characteristics such as gender, sexual orientation, and age (ElSherief, Kulkarni, Nguyen, Wang, & Belding, 2018). When collecting and annotating comments from NAVER, we observed severe data imbalance as shown in Fig. 1.

Previous methods utilize CNN(Georgakopoulos, Tasoulis, Vrahatis, & Plagianakos, 2018) and LSTM(Shah, Sanghvi, Mehta, Shah, & Singh, 2021) for hate speech detection. Recently, state-of-the-art methods in hate speech detection utilize DNNs such as BiLSTM(Mollas, Chrysopoulou, Karlos, & Tsoumakas, 2020) and BERT(Mathew et al., 2020). In order to mitigate the highly imbalanced toxic data, direct manipulation of the text data using data augmentation (Rastogi, Mofid, & Hsiao, 2020) has recently been proposed as well. In this paper, we utilize a CNN + BiLSTM model to test the proposed method.

We perform experiments of the proposed method on real-world data and on simulated datasets (IMDB) with varying imbalance levels. We annotated and constructed three datasets consisting of comments made by users from different social platforms of NAVER: one for multiple sentimental analysis and the others for binary detection of sexual harassment. Annotations on the data were improved iteratively by in-lab annotations and crowdsourcing. Experimental results show that ST outperforms traditional approaches, with a notable gap, especially in cases where the ratio of the number of instances of the majority class to the number of instances of the minority class in the training data is extremely large. Lastly, ST proves to be compatible with previous data-level approaches addressing data imbalance.

Our contributions in this paper are threefold:

  • We introduce a novel method addressing the data imbalance problem in text classification overcoming the limitations of previous methods by continually training on redistributed subsets of data. This method proves to be effective on different datasets, IMDB and NAVER datasets, outperforming data-level methods: ROS and RUS.

  • We are one of the first of its kind, to the best of our knowledge, to implement continual learning in addressing the data imbalance problem.

  • Our method proves to be compatible with other strategies used in imbalanced classification problems. Combining our method with the data-level method outperformed other methods in several settings with varying imbalance levels.

The rest of the paper is organized as follows: Section 2 summarizes related works. Section 3 provides the details of the proposed method. Section 4 presents dataset descriptions, experiment setups, and qualitative experimental results on the various datasets. Finally, Section 5 concludes the paper with recommended application cases and future studies.

Section snippets

Methods Handling the Data Imbalance Problem in Text Data

Previous researchers have proposed data-level methods and algorithm-level methods to address the data imbalance problem in text data. Data-level techniques such as Synthetic Minority Oversampling Technique (SMOTE) (Chawla et al., 2002, Han et al., 2005) is one of the most used approaches. Its basic idea is to oversample the training data by interpolation between the sparse observations on minority classes. Different variants of SMOTE such as SMOTE-SVM (Nguyen, Cooper, & Kamei, 2011) and MWMOTE (

Sequential Targeting

We first introduce a broad overview of the novel training architecture: Sequential Targeting. Next, we show how this method has been applied to address the data imbalance problem.

Evaluation Metrics

Accuracy is commonly used to measure the performance of a classification model. However, when it comes to skewed data, accuracy alone can be misleading and thus other appropriate metrics are needed to correctly evaluate the performance of the model. In this paper, we use precision, recall, and macro F1-score to objectively evaluate the model in imbalanced data settings. Precision measures the percentage of actual positive among the number of positively predicted samples. Recall measures the

Conclusion

It is seldom the case text data in the wild has a balanced distribution. In realistic settings, there is a limitation of acquiring relatively balanced data through choices of balanced data sources. Handling data skewness is a crucial problem because learning from imbalanced data inevitably brings bias toward frequently observed classes. A wide range of previous balancing methods are limited to indirect manipulation in raw text level which is highly time-consuming and oftentimes expensive.

CRediT authorship contribution statement

Joel Jang: Conceptualization, Methodology, Validation, Investigation, Writing - original draft, Writing - review & editing, Visualization. Yoonjeon Kim: Methodology, Formal analysis, Investigation, Writing - review & editing. Kyoungho Choi: Resources, Data curation. Sungho Suh: Software, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2019–0-00075, Artificial Intelligence Graduate School Program(KAIST))

References (59)

  • J. Andreas

    Good-enough compositional data augmentation

  • S. Barua et al.

    Mwmote–majority weighted minority oversampling technique for imbalanced data set learning

    IEEE Transactions on Knowledge and Data Engineering

    (2012)
  • N.V. Chawla et al.

    Smote: synthetic minority over-sampling technique

    Journal of Artificial Intelligence Research

    (2002)
  • Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for...
  • Q. Dong et al.

    Imbalanced deep learning by minority class incremental rectification

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2018)
  • ElSherief, M., Kulkarni, V., Nguyen, D., Wang, W.Y., & Belding, E. (2018). Hate lingo: A target-based linguistic...
  • Fernando, C., Banarse, D., Blundell, C., Zwols, Y., Ha, D., Rusu, A.A., Pritzel, A., & Wierstra, D. (2017). Pathnet:...
  • E. Frank et al.
  • S.V. Georgakopoulos et al.

    Convolutional neural networks for toxic comment classification

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014)....
  • H. Han et al.

    Borderline-smote: a new over-sampling method in imbalanced data sets learning

  • Hensman, P., & Masko, D. (2015). The impact of imbalanced training data for convolutional neural networks. Degree...
  • J.M. Johnson et al.

    Survey on deep learning with class imbalance

    Journal of Big Data

    (2019)
  • K.S. Jones

    A statistical interpretation of term specificity and its application in retrieval

    Journal of Documentation

    (1972)
  • S. Karamizadeh et al.

    Advantage and drawback of support vector machine functionality

  • P. Karisani et al.

    Domain-guided task decomposition with self-training for detecting personal events in social media

  • H. Kaur et al.

    A systematic review on imbalanced data challenges in machine learning: Applications and solutions

    ACM Computing Surveys (CSUR)

    (2019)
  • S.H. Khan et al.

    Cost-sensitive learning of deep feature representations from imbalanced data

    IEEE Transactions on Neural Networks and Learning Systems

    (2017)
  • J. Kirkpatrick et al.

    Overcoming catastrophic forgetting in neural networks

    Proceedings of the National Academy of Sciences

    (2017)
  • Cited by (15)

    • RUE: A robust personalized cost assignment strategy for class imbalance cost-sensitive learning

      2023, Journal of King Saud University - Computer and Information Sciences
    View all citing articles on Scopus
    1

    This work was done while the author was at NAVER.

    View full text