Sequential targeting: A continual learning approach for data imbalance in text classification

doi:10.1016/j.eswa.2021.115067

Expert Systems with Applications

Volume 179, 1 October 2021, 115067

https://doi.org/10.1016/j.eswa.2021.115067 Get rights and content

Highlights

•
Handling data imbalance in text data with continual learning.
•
Detecting sexual harassment and sentiment in comments of social media platforms.
•
Effective trade-off between precision and recall for an overall increase in F1-score.
•
Results show 56.38 %p increase on the IMDB dataset.
•
Results show 16.89 %p and 34.76 %p increase on NAVER Dataset.

Abstract

Text classification has numerous use cases including sentiment analysis, spam detection, document classification, hate speech detection, etc. In realistic settings, classification on text data confronts imbalanced data conditions where classes of interest usually compose a minor fraction. Deep neural networks used for text classification, such as recurrent neural networks and transformer networks, suffer from a lack of efficient methods addressing imbalanced data. Traditional data-level methods attempting to mitigate distributional skew include oversampling and undersampling. The oversampling methods destruct the quality of original language representation of the sparse data coming from minority classes whereas the undersampling methods fail to fully utilize the rich context of majority classes. We address such issues in data-driven approaches by enforcing continual learning on imbalanced data by partitioning the training data distribution into mutually exclusive subsets and performing continual learning, treating the individual subsets as distinct tasks. We demonstrate the effectiveness of our method through experiments on the IMDB dataset and constructed datasets from real-world data. The experimental results show that the proposed method improves by 56.38 %p on the IMDB dataset and by 16.89 %p and 34.76 %p on the constructed datasets compared to the baseline method in terms of the F1-score metric.

Introduction

Classification is a task of learning a discriminant function from the given data that classifies previously unseen data to the correct classes. In realistic settings, however, it is rarely the case where the discrete distribution of the training data acquired to develop the classifier is perfectly balanced across all classes. Classifiers trained in imbalanced settings tend to become biased towards the class with more samples (Xiao, Wang, & Du, 2019). In order to develop intelligent classifiers, finding methods to temper the classifier from biasing towards a certain class is of great importance.

With the recent breakthroughs in Natural Language Processing, many applications leverage text classification methods. Non-contemporary methods go through separate feature extraction and classifiers (Kowsari et al., 2019). Texts are converted into a structured feature vector space with methods such as TF-IDF (Jones, 1972), Word2Vec (Mikolov, Chen, Corrado, & Dean, 2013), Glove (Pennington, Socher, & Manning, 2014). The learned feature vectors are given as inputs to separate classifiers such as Naive Bayes Classifier (Frank & Bouckaert, 2006), Support Vector Machines (Karamizadeh, Abdullah, Halimi, Shayan, & javad Rajabi, 2014), and Conditional Random Fields (Wallach, 2004).

Text classification has a broad field of applications encompassing tasks from toxic comment detection to sentiment analysis. Deep learning models such as Recurrent Neural Networks (Wang, Hamza, & Florian, 2017), Convolution Neural Networks (Guo, Zhang, Liu, & Ma, 2019), and its hybrids (Chen, Xu, He, & Wang, 2017) have proven to be much more effective than previous rule-based or machine learning-based text classification approaches. Moreover, transformer networks such as BERT (Devlin et al., 2018, Qin et al., 2020) show state-of-the-art performance in text classification. However, previous methods that attempt to mitigate the imbalanced data distribution in text classification are highly incompatible with the aforementioned deep learning models. Unlike previous rule-based and machine learning approaches, deep learning approaches conduct feature extraction and classification within a single model. Despite their recent successes, DNNs still suffer from generalizing to a balanced testing criterion in cases of data imbalance during the training phase (Dong, Gong, & Zhu, 2018).

Previous methods addressing data imbalance in text classification can be categorized into data-level and algorithm-level methods. Data-level methods (Kaur, Pannu, & Malhi, 2019) apply manipulation on the data by undersampling majority classes or oversampling minority classes. Previous data-level methods such as SMOTE and its variants (Chawla et al., 2002, Han et al., 2005) have been proven to be highly effective. However, such data-driven methods require an effective numerical representation algorithm a priori which causes reduced compatibility with current deep neural network models. Many deep neural networks unite the feature representation learning phase and the classifier learning phase. Therefore, data-level methods that require feature representations in advance are not favorable to the current scheme of deep learning. Algorithm-level methods modify the underlying classifier or its output to reduce bias towards the majority group (Haixiang et al., 2017). However, these methods are task-sensitive and somewhat heuristic since it requires the practitioners to modify the classifier considering the innate properties of the task. Traditional random oversampling and undersampling methods (ROS, RUS), which simply duplicate or sample data instances, are independent of these two limitations and are compatible with DNNs. However, these two methods fail to preserve the information of the original data distribution and show room for improvements. Task-independent methods for data imbalance in text classification which do not require feature space algorithms are needed to train deep neural network classifiers.

We propose a novel training architecture, Sequential Targeting (ST), that handles the data imbalance problem by forcing a continual learning setting on deep learning-based classifiers. ST divides the entire training data set into mutually exclusive partitions, target-adaptively balancing the data distribution. Target distribution is a predetermined distributional setting that enables the learner to exert maximum performance when trained with. In an imbalanced distributional setting, the target distribution is idealistic to follow a uniform distribution where all the classes hold equal importance. Optimal class distribution may differ by innate property of the data, but research shows that a balanced class distribution has an overall better performance compared to other distributions (Rokach, 2010). The remaining partitions are then sorted in the magnitude of similarity with the target distribution which is measured by KL-divergence. The first partition of the split data is imbalanced while the last partition is arbitrarily modeled to be uniform across classes and all the partitions are utilized to train the learner sequentially.

We sequentially train the learner in the partitioned data while handling the issue of catastrophic forgetting (French, 1999), which is an inevitable phenomenon in transfer learning, by utilizing Elastic Weight Consolidation(EWC) (Kirkpatrick et al., 2017) to stabilize the knowledge obtained from the previous data partitions. The proposed method is independent of both the representation method and the task at hand, which addresses the limitations of previous methods.

Currently, text classification models are frequently used for sentimental analysis in online platforms (Ruz et al., 2020, Burdisso et al., 2019, Wu et al., 2017, Guerreiro and Rita, 2020). In this paper, we specifically focus on training an efficient classifier detecting sexual harassment, a specific kind of hate speech, and sentiment in online comments of NAVER². In the name of anonymity, online discussion platforms have become a place where people undermine, harass, humiliate, threaten, and bully others (Samghabadi, Maharjan, Sprague, Diaz-Sprague, & Solorio, 2017) based on their superficial characteristics such as gender, sexual orientation, and age (ElSherief, Kulkarni, Nguyen, Wang, & Belding, 2018). When collecting and annotating comments from NAVER, we observed severe data imbalance as shown in Fig. 1.

Previous methods utilize CNN(Georgakopoulos, Tasoulis, Vrahatis, & Plagianakos, 2018) and LSTM(Shah, Sanghvi, Mehta, Shah, & Singh, 2021) for hate speech detection. Recently, state-of-the-art methods in hate speech detection utilize DNNs such as BiLSTM(Mollas, Chrysopoulou, Karlos, & Tsoumakas, 2020) and BERT(Mathew et al., 2020). In order to mitigate the highly imbalanced toxic data, direct manipulation of the text data using data augmentation (Rastogi, Mofid, & Hsiao, 2020) has recently been proposed as well. In this paper, we utilize a CNN + BiLSTM model to test the proposed method.

We perform experiments of the proposed method on real-world data and on simulated datasets (IMDB) with varying imbalance levels. We annotated and constructed three datasets consisting of comments made by users from different social platforms of NAVER: one for multiple sentimental analysis and the others for binary detection of sexual harassment. Annotations on the data were improved iteratively by in-lab annotations and crowdsourcing. Experimental results show that ST outperforms traditional approaches, with a notable gap, especially in cases where the ratio of the number of instances of the majority class to the number of instances of the minority class in the training data is extremely large. Lastly, ST proves to be compatible with previous data-level approaches addressing data imbalance.

Our contributions in this paper are threefold:

•
We introduce a novel method addressing the data imbalance problem in text classification overcoming the limitations of previous methods by continually training on redistributed subsets of data. This method proves to be effective on different datasets, IMDB and NAVER datasets, outperforming data-level methods: ROS and RUS.
•
We are one of the first of its kind, to the best of our knowledge, to implement continual learning in addressing the data imbalance problem.
•
Our method proves to be compatible with other strategies used in imbalanced classification problems. Combining our method with the data-level method outperformed other methods in several settings with varying imbalance levels.

The rest of the paper is organized as follows: Section 2 summarizes related works. Section 3 provides the details of the proposed method. Section 4 presents dataset descriptions, experiment setups, and qualitative experimental results on the various datasets. Finally, Section 5 concludes the paper with recommended application cases and future studies.

Section snippets

Methods Handling the Data Imbalance Problem in Text Data

Previous researchers have proposed data-level methods and algorithm-level methods to address the data imbalance problem in text data. Data-level techniques such as Synthetic Minority Oversampling Technique (SMOTE) (Chawla et al., 2002, Han et al., 2005) is one of the most used approaches. Its basic idea is to oversample the training data by interpolation between the sparse observations on minority classes. Different variants of SMOTE such as SMOTE-SVM (Nguyen, Cooper, & Kamei, 2011) and MWMOTE (

Sequential Targeting

We first introduce a broad overview of the novel training architecture: Sequential Targeting. Next, we show how this method has been applied to address the data imbalance problem.

Evaluation Metrics

Accuracy is commonly used to measure the performance of a classification model. However, when it comes to skewed data, accuracy alone can be misleading and thus other appropriate metrics are needed to correctly evaluate the performance of the model. In this paper, we use precision, recall, and macro F1-score to objectively evaluate the model in imbalanced data settings. Precision measures the percentage of actual positive among the number of positively predicted samples. Recall measures the

Conclusion

It is seldom the case text data in the wild has a balanced distribution. In realistic settings, there is a limitation of acquiring relatively balanced data through choices of balanced data sources. Handling data skewness is a crucial problem because learning from imbalanced data inevitably brings bias toward frequently observed classes. A wide range of previous balancing methods are limited to indirect manipulation in raw text level which is highly time-consuming and oftentimes expensive.

CRediT authorship contribution statement

Joel Jang: Conceptualization, Methodology, Validation, Investigation, Writing - original draft, Writing - review & editing, Visualization. Yoonjeon Kim: Methodology, Formal analysis, Investigation, Writing - review & editing. Kyoungho Choi: Resources, Data curation. Sungho Suh: Software, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2019–0-00075, Artificial Intelligence Graduate School Program(KAIST))

References (59)

M. Buda et al.
A systematic study of the class imbalance problem in convolutional neural networks
Neural Networks
(2018)
S.G. Burdisso et al.
A text classification framework for simple and effective early depression detection over social media streams
Expert Systems with Applications
(2019)
T. Chen et al.
Improving sentiment analysis via sentence type classification using bilstm-crf and cnn
Expert Systems with Applications
(2017)
R.M. French
Catastrophic forgetting in connectionist networks
Trends in Cognitive Sciences
(1999)
J. Guerreiro et al.
How to predict explicit recommendations in online reviews using text mining and sentiment analysis
Journal of Hospitality and Tourism Management
(2020)
B. Guo et al.
Improving text classification with weighted word embeddings via a multi-channel textcnn model
Neurocomputing
(2019)
G. Haixiang et al.
Learning from class-imbalanced data: Review of methods and applications
Expert Systems with Applications
(2017)
C. Padurariu et al.
Dealing with data imbalance in text classification
Procedia Computer Science
(2019)
G.A. Ruz et al.
Sentiment analysis of twitter data during critical events through bayesian networks classifiers
Future Generation Computer Systems
(2020)
S. Suh et al.
Cegan: Classification enhancement generative adversarial networks for unraveling data imbalance problems
Neural Networks
(2021)

J. Andreas

Good-enough compositional data augmentation

S. Barua et al.

Mwmote–majority weighted minority oversampling technique for imbalanced data set learning

IEEE Transactions on Knowledge and Data Engineering

(2012)

N.V. Chawla et al.

Smote: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research

(2002)

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for...

Q. Dong et al.

Imbalanced deep learning by minority class incremental rectification

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2018)

ElSherief, M., Kulkarni, V., Nguyen, D., Wang, W.Y., & Belding, E. (2018). Hate lingo: A target-based linguistic...

Fernando, C., Banarse, D., Blundell, C., Zwols, Y., Ha, D., Rusu, A.A., Pritzel, A., & Wierstra, D. (2017). Pathnet:...

E. Frank et al.

S.V. Georgakopoulos et al.

Convolutional neural networks for toxic comment classification

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014)....

H. Han et al.

Borderline-smote: a new over-sampling method in imbalanced data sets learning

Hensman, P., & Masko, D. (2015). The impact of imbalanced training data for convolutional neural networks. Degree...

J.M. Johnson et al.

Survey on deep learning with class imbalance

Journal of Big Data

(2019)

K.S. Jones

A statistical interpretation of term specificity and its application in retrieval

Journal of Documentation

(1972)

S. Karamizadeh et al.

Advantage and drawback of support vector machine functionality

P. Karisani et al.

Domain-guided task decomposition with self-training for detecting personal events in social media

H. Kaur et al.

A systematic review on imbalanced data challenges in machine learning: Applications and solutions

ACM Computing Surveys (CSUR)

(2019)

S.H. Khan et al.

Cost-sensitive learning of deep feature representations from imbalanced data

IEEE Transactions on Neural Networks and Learning Systems

(2017)

J. Kirkpatrick et al.

Overcoming catastrophic forgetting in neural networks

Proceedings of the National Academy of Sciences

(2017)

Cited by (15)

A novel continual reinforcement learning-based expert system for self-optimization of soft real-time systems
2024, Expert Systems with Applications
Virtual globes are soft real-time systems, which stream multi-resolution data sets and render world-scale landscapes in real-time. Such systems require an adaptation mechanism to fulfill user and quality of experience (QoE) requirements. Configuration and tuning of parameters to optimize system performance is challenging due to run-time uncertainties. In this work, we propose a continual reinforcement learning-based expert system for self-optimization of a soft real-time system to meet user and quality objectives. The proposed system is capable of learning continuously by utilizing the previously learned knowledge to fulfill adaptation requirements. The proposed managing system interacts with a managed system and evolves a knowledge base of adaptation rules for run-time optimization of the managed system to meet user and quality objectives. We devise a learning classifier system for the evolution, storage, and transfer of adaptation rules using a code fragments-based rich encoding scheme. We transform user goals, system states, and execution environment conditions into observable states. We validated the proposed system for the same and cross-domain knowledge transfer and reuse scenarios for different user goals and execution environments. The results of the proposed system are compared with baseline and state-of-the-art methods where the proposed system outperformed the methods in terms of adaptation accuracy.
Text classification with improved word embedding and adaptive segmentation
2024, Expert Systems with Applications
Text classification first needs to convert the text into embedding vectors. Considering that static word embedding models such as Word2vec do not consider the position information of word and the difference of its role in different documents, while dynamic word embedding models such as Bert consume a large amount of time. An improved word embedding model based on pre-trained Word2vec is proposed, which achieves better classification accuracy and much lower classification time than Bert. At first, the concept of Term Document Frequency (TDF) is proposed on the basis of TF-IDF, and the TF-IDF-TDF of each word in different documents is calculated. Then, The positional encoding is added. Finally, in order to reduce the misleading of words with low importance, a filter is designed to set the embedding vector with low importance to zero. Considering that the sequence length that the deep learning model can handle is limited, and the text sequence exceeding the Maximum Sequence Length (MSL) set by the deep learning model will be directly truncated and discarded, an adaptive segmentation model is proposed, which can set different segmentation strategies for different texts according to the length of the text and the MSL. In order to maintain the continuity of adjacent text after segmentation, an adjacent-segment-vector-attended co-attention network is designed. In addition, the multi-channel convolution and the capsule network are designed to further extract deep hidden features. Multiple comparative experiment results show that the proposed model achieves the best Accuracy and Micro-F1 on five long text baseline datasets and six short text baseline datasets. In addition, when the MSL is not set too large compared with the document length in the dataset, the classification results of the proposed model are not affected by it.
A survey on hate speech detection and sentiment analysis using machine learning and deep learning models
2023, Alexandria Engineering Journal
In today's digital era, the rise of hate speech has emerged as a critical concern, driven by the rapid information-sharing capabilities of social media platforms and online communities. As the internet expands, the proliferation of harmful content, including hate speech, presents considerable obstacles in ensuring a secure and inclusive online environment. In response to this challenge, researchers have embraced machine learning and deep learning methods to create automated systems that can effectively detect hate speech and conduct sentiment analysis, offering potential solutions to address this pressing issue. This survey article provides a comprehensive overview of recent advancements in hate speech detection and sentiment analysis using machine learning and deep learning models. We present an in-depth analysis of various methodologies and datasets employed in this domain. Additionally, we explore the unique challenges faced by these models in accurately identifying and classifying hate speech and sentiment in online text. Finally, we outline areas where more study is needed and suggest potential new avenues for exploration in the field of hate speech identification and sentiment analysis. Using the results of this survey, we hope to encourage the development of more effective machine learning and deep learning-based solutions to curb hate speech and promote a more inclusive online environment.
Text FCG: Fusing Contextual Information via Graph Learning for text classification
2023, Expert Systems with Applications
Text classification as a fundamental task in Natural Language Processing (NLP). Graph neural networks can better handle the large amount of information in text, and effective and fast graph models for text classification have received much attention. Besides, most methods are transductive learning, which means they cannot handle the documents with new words and relations. To tackle these problems, we propose a novel method for Text Classification by Fusing Contextual Information via Graph Neural Networks (TextFCG). Concretely, we first construct a single graph for all words in each text and label the edges by fusing its various contextual relations. Our text graph contains different information of documents and enhances the connectivity of graph by introducing more typed edges, which improves the learning effect of GNN. Then, based on GNN and gated recurrent unit (GRU), our model can interact the local words with global text information and enhance the sequential representation of nodes. Moreover, we focus on contextual features from the text itself. Extensive experiments on several benchmark datasets and detailed analysis prove the effectiveness of our proposed method on the text classification task.
RUE: A robust personalized cost assignment strategy for class imbalance cost-sensitive learning
2023, Journal of King Saud University - Computer and Information Sciences
Cost-sensitive learning is a popular paradigm to address class-imbalance learning (CIL) problem. Traditional cost-sensitive learning approaches always solve CIL problem by assigning a constant higher training error penalty for all minority instances than that of majority instances, but ignore the significance of location information. Therefore, several recent studies began to focus on the personalized cost assignment, i.e., designating different costs for different instances based on their location information. The emerging personalized cost-sensitive approaches always perform better than those traditional ones; however, the estimation for location information may be inaccurate as it is apt to be impacted by data density variation. To address this problem, we propose a novel location information estimation and cost assignment strategy called RUE. Unlike previous approaches, our proposed strategy explores location information by an indirect way: the error rate feed backed from a random undersampling ensemble. The strategy is robust towards data distribution, and is helpful for accurately estimating the significance of each instance regardless the complexity of data distribution. In context of Fuzzy Support Vector Machine (FSVM) and Weighted Extreme Learning Machine (WELM), the proposed cost assignment strategy is compared with several popular and state-of-the-art approaches, and the results show its effectiveness and superiority.
An efficient convolutional neural network-based classifier for an imbalanced oral squamous carcinoma cell dataset
2024, IAES International Journal of Artificial Intelligence

View all citing articles on Scopus

¹: This work was done while the author was at NAVER.

View full text

Sequential targeting: A continual learning approach for data imbalance in text classification

Highlights

Abstract

Introduction

Section snippets

Methods Handling the Data Imbalance Problem in Text Data

Sequential Targeting

Evaluation Metrics

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgement

Neural Networks

Expert Systems with Applications

Expert Systems with Applications

Trends in Cognitive Sciences

Journal of Hospitality and Tourism Management

Neurocomputing

Expert Systems with Applications

Procedia Computer Science

Future Generation Computer Systems

Neural Networks

Good-enough compositional data augmentation

Mwmote–majority weighted minority oversampling technique for imbalanced data set learning

IEEE Transactions on Knowledge and Data Engineering

Smote: synthetic minority over-sampling technique

Journal of Artificial Intelligence Research

Imbalanced deep learning by minority class incremental rectification

IEEE Transactions on Pattern Analysis and Machine Intelligence

Convolutional neural networks for toxic comment classification

Borderline-smote: a new over-sampling method in imbalanced data sets learning

Survey on deep learning with class imbalance

Journal of Big Data

A statistical interpretation of term specificity and its application in retrieval

Journal of Documentation

Advantage and drawback of support vector machine functionality

Domain-guided task decomposition with self-training for detecting personal events in social media

A systematic review on imbalanced data challenges in machine learning: Applications and solutions

ACM Computing Surveys (CSUR)

Cost-sensitive learning of deep feature representations from imbalanced data

IEEE Transactions on Neural Networks and Learning Systems

Overcoming catastrophic forgetting in neural networks

Proceedings of the National Academy of Sciences