Sequential targeting: A continual learning approach for data imbalance in text classification
Introduction
Classification is a task of learning a discriminant function from the given data that classifies previously unseen data to the correct classes. In realistic settings, however, it is rarely the case where the discrete distribution of the training data acquired to develop the classifier is perfectly balanced across all classes. Classifiers trained in imbalanced settings tend to become biased towards the class with more samples (Xiao, Wang, & Du, 2019). In order to develop intelligent classifiers, finding methods to temper the classifier from biasing towards a certain class is of great importance.
With the recent breakthroughs in Natural Language Processing, many applications leverage text classification methods. Non-contemporary methods go through separate feature extraction and classifiers (Kowsari et al., 2019). Texts are converted into a structured feature vector space with methods such as TF-IDF (Jones, 1972), Word2Vec (Mikolov, Chen, Corrado, & Dean, 2013), Glove (Pennington, Socher, & Manning, 2014). The learned feature vectors are given as inputs to separate classifiers such as Naive Bayes Classifier (Frank & Bouckaert, 2006), Support Vector Machines (Karamizadeh, Abdullah, Halimi, Shayan, & javad Rajabi, 2014), and Conditional Random Fields (Wallach, 2004).
Text classification has a broad field of applications encompassing tasks from toxic comment detection to sentiment analysis. Deep learning models such as Recurrent Neural Networks (Wang, Hamza, & Florian, 2017), Convolution Neural Networks (Guo, Zhang, Liu, & Ma, 2019), and its hybrids (Chen, Xu, He, & Wang, 2017) have proven to be much more effective than previous rule-based or machine learning-based text classification approaches. Moreover, transformer networks such as BERT (Devlin et al., 2018, Qin et al., 2020) show state-of-the-art performance in text classification. However, previous methods that attempt to mitigate the imbalanced data distribution in text classification are highly incompatible with the aforementioned deep learning models. Unlike previous rule-based and machine learning approaches, deep learning approaches conduct feature extraction and classification within a single model. Despite their recent successes, DNNs still suffer from generalizing to a balanced testing criterion in cases of data imbalance during the training phase (Dong, Gong, & Zhu, 2018).
Previous methods addressing data imbalance in text classification can be categorized into data-level and algorithm-level methods. Data-level methods (Kaur, Pannu, & Malhi, 2019) apply manipulation on the data by undersampling majority classes or oversampling minority classes. Previous data-level methods such as SMOTE and its variants (Chawla et al., 2002, Han et al., 2005) have been proven to be highly effective. However, such data-driven methods require an effective numerical representation algorithm a priori which causes reduced compatibility with current deep neural network models. Many deep neural networks unite the feature representation learning phase and the classifier learning phase. Therefore, data-level methods that require feature representations in advance are not favorable to the current scheme of deep learning. Algorithm-level methods modify the underlying classifier or its output to reduce bias towards the majority group (Haixiang et al., 2017). However, these methods are task-sensitive and somewhat heuristic since it requires the practitioners to modify the classifier considering the innate properties of the task. Traditional random oversampling and undersampling methods (ROS, RUS), which simply duplicate or sample data instances, are independent of these two limitations and are compatible with DNNs. However, these two methods fail to preserve the information of the original data distribution and show room for improvements. Task-independent methods for data imbalance in text classification which do not require feature space algorithms are needed to train deep neural network classifiers.
We propose a novel training architecture, Sequential Targeting (ST), that handles the data imbalance problem by forcing a continual learning setting on deep learning-based classifiers. ST divides the entire training data set into mutually exclusive partitions, target-adaptively balancing the data distribution. Target distribution is a predetermined distributional setting that enables the learner to exert maximum performance when trained with. In an imbalanced distributional setting, the target distribution is idealistic to follow a uniform distribution where all the classes hold equal importance. Optimal class distribution may differ by innate property of the data, but research shows that a balanced class distribution has an overall better performance compared to other distributions (Rokach, 2010). The remaining partitions are then sorted in the magnitude of similarity with the target distribution which is measured by KL-divergence. The first partition of the split data is imbalanced while the last partition is arbitrarily modeled to be uniform across classes and all the partitions are utilized to train the learner sequentially.
We sequentially train the learner in the partitioned data while handling the issue of catastrophic forgetting (French, 1999), which is an inevitable phenomenon in transfer learning, by utilizing Elastic Weight Consolidation(EWC) (Kirkpatrick et al., 2017) to stabilize the knowledge obtained from the previous data partitions. The proposed method is independent of both the representation method and the task at hand, which addresses the limitations of previous methods.
Currently, text classification models are frequently used for sentimental analysis in online platforms (Ruz et al., 2020, Burdisso et al., 2019, Wu et al., 2017, Guerreiro and Rita, 2020). In this paper, we specifically focus on training an efficient classifier detecting sexual harassment, a specific kind of hate speech, and sentiment in online comments of NAVER2. In the name of anonymity, online discussion platforms have become a place where people undermine, harass, humiliate, threaten, and bully others (Samghabadi, Maharjan, Sprague, Diaz-Sprague, & Solorio, 2017) based on their superficial characteristics such as gender, sexual orientation, and age (ElSherief, Kulkarni, Nguyen, Wang, & Belding, 2018). When collecting and annotating comments from NAVER, we observed severe data imbalance as shown in Fig. 1.
Previous methods utilize CNN(Georgakopoulos, Tasoulis, Vrahatis, & Plagianakos, 2018) and LSTM(Shah, Sanghvi, Mehta, Shah, & Singh, 2021) for hate speech detection. Recently, state-of-the-art methods in hate speech detection utilize DNNs such as BiLSTM(Mollas, Chrysopoulou, Karlos, & Tsoumakas, 2020) and BERT(Mathew et al., 2020). In order to mitigate the highly imbalanced toxic data, direct manipulation of the text data using data augmentation (Rastogi, Mofid, & Hsiao, 2020) has recently been proposed as well. In this paper, we utilize a CNN + BiLSTM model to test the proposed method.
We perform experiments of the proposed method on real-world data and on simulated datasets (IMDB) with varying imbalance levels. We annotated and constructed three datasets consisting of comments made by users from different social platforms of NAVER: one for multiple sentimental analysis and the others for binary detection of sexual harassment. Annotations on the data were improved iteratively by in-lab annotations and crowdsourcing. Experimental results show that ST outperforms traditional approaches, with a notable gap, especially in cases where the ratio of the number of instances of the majority class to the number of instances of the minority class in the training data is extremely large. Lastly, ST proves to be compatible with previous data-level approaches addressing data imbalance.
Our contributions in this paper are threefold:
- •
We introduce a novel method addressing the data imbalance problem in text classification overcoming the limitations of previous methods by continually training on redistributed subsets of data. This method proves to be effective on different datasets, IMDB and NAVER datasets, outperforming data-level methods: ROS and RUS.
- •
We are one of the first of its kind, to the best of our knowledge, to implement continual learning in addressing the data imbalance problem.
- •
Our method proves to be compatible with other strategies used in imbalanced classification problems. Combining our method with the data-level method outperformed other methods in several settings with varying imbalance levels.
The rest of the paper is organized as follows: Section 2 summarizes related works. Section 3 provides the details of the proposed method. Section 4 presents dataset descriptions, experiment setups, and qualitative experimental results on the various datasets. Finally, Section 5 concludes the paper with recommended application cases and future studies.
Section snippets
Methods Handling the Data Imbalance Problem in Text Data
Previous researchers have proposed data-level methods and algorithm-level methods to address the data imbalance problem in text data. Data-level techniques such as Synthetic Minority Oversampling Technique (SMOTE) (Chawla et al., 2002, Han et al., 2005) is one of the most used approaches. Its basic idea is to oversample the training data by interpolation between the sparse observations on minority classes. Different variants of SMOTE such as SMOTE-SVM (Nguyen, Cooper, & Kamei, 2011) and MWMOTE (
Sequential Targeting
We first introduce a broad overview of the novel training architecture: Sequential Targeting. Next, we show how this method has been applied to address the data imbalance problem.
Evaluation Metrics
Accuracy is commonly used to measure the performance of a classification model. However, when it comes to skewed data, accuracy alone can be misleading and thus other appropriate metrics are needed to correctly evaluate the performance of the model. In this paper, we use precision, recall, and macro F1-score to objectively evaluate the model in imbalanced data settings. Precision measures the percentage of actual positive among the number of positively predicted samples. Recall measures the
Conclusion
It is seldom the case text data in the wild has a balanced distribution. In realistic settings, there is a limitation of acquiring relatively balanced data through choices of balanced data sources. Handling data skewness is a crucial problem because learning from imbalanced data inevitably brings bias toward frequently observed classes. A wide range of previous balancing methods are limited to indirect manipulation in raw text level which is highly time-consuming and oftentimes expensive.
CRediT authorship contribution statement
Joel Jang: Conceptualization, Methodology, Validation, Investigation, Writing - original draft, Writing - review & editing, Visualization. Yoonjeon Kim: Methodology, Formal analysis, Investigation, Writing - review & editing. Kyoungho Choi: Resources, Data curation. Sungho Suh: Software, Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgement
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2019–0-00075, Artificial Intelligence Graduate School Program(KAIST))
References (59)
- et al.
A systematic study of the class imbalance problem in convolutional neural networks
Neural Networks
(2018) - et al.
A text classification framework for simple and effective early depression detection over social media streams
Expert Systems with Applications
(2019) - et al.
Improving sentiment analysis via sentence type classification using bilstm-crf and cnn
Expert Systems with Applications
(2017) Catastrophic forgetting in connectionist networks
Trends in Cognitive Sciences
(1999)- et al.
How to predict explicit recommendations in online reviews using text mining and sentiment analysis
Journal of Hospitality and Tourism Management
(2020) - et al.
Improving text classification with weighted word embeddings via a multi-channel textcnn model
Neurocomputing
(2019) - et al.
Learning from class-imbalanced data: Review of methods and applications
Expert Systems with Applications
(2017) - et al.
Dealing with data imbalance in text classification
Procedia Computer Science
(2019) - et al.
Sentiment analysis of twitter data during critical events through bayesian networks classifiers
Future Generation Computer Systems
(2020) - et al.
Cegan: Classification enhancement generative adversarial networks for unraveling data imbalance problems
Neural Networks
(2021)
Good-enough compositional data augmentation
Mwmote–majority weighted minority oversampling technique for imbalanced data set learning
IEEE Transactions on Knowledge and Data Engineering
Smote: synthetic minority over-sampling technique
Journal of Artificial Intelligence Research
Imbalanced deep learning by minority class incremental rectification
IEEE Transactions on Pattern Analysis and Machine Intelligence
Convolutional neural networks for toxic comment classification
Borderline-smote: a new over-sampling method in imbalanced data sets learning
Survey on deep learning with class imbalance
Journal of Big Data
A statistical interpretation of term specificity and its application in retrieval
Journal of Documentation
Advantage and drawback of support vector machine functionality
Domain-guided task decomposition with self-training for detecting personal events in social media
A systematic review on imbalanced data challenges in machine learning: Applications and solutions
ACM Computing Surveys (CSUR)
Cost-sensitive learning of deep feature representations from imbalanced data
IEEE Transactions on Neural Networks and Learning Systems
Overcoming catastrophic forgetting in neural networks
Proceedings of the National Academy of Sciences
Cited by (15)
A novel continual reinforcement learning-based expert system for self-optimization of soft real-time systems
2024, Expert Systems with ApplicationsText classification with improved word embedding and adaptive segmentation
2024, Expert Systems with ApplicationsA survey on hate speech detection and sentiment analysis using machine learning and deep learning models
2023, Alexandria Engineering JournalText FCG: Fusing Contextual Information via Graph Learning for text classification
2023, Expert Systems with ApplicationsRUE: A robust personalized cost assignment strategy for class imbalance cost-sensitive learning
2023, Journal of King Saud University - Computer and Information SciencesAn efficient convolutional neural network-based classifier for an imbalanced oral squamous carcinoma cell dataset
2024, IAES International Journal of Artificial Intelligence
- 1
This work was done while the author was at NAVER.