1 Introduction

Spam and junk content have been a problem on the Internet for decades (Sahami et al., 1998; Kaur et al., 2016; Ferrara et al., 2016; Dou et al., 2020). With the proliferation of direct marketing online, these forms of pollution continue to increase rapidly. By definition (Wu et al., 2019, https://help.twitter.com/en/rules-and-policies/twitter-rules, Pedia 2020), spam is unsolicited and unsought information, including pornography, inappropriate or nonsensical content, and commercial advertisements. These different types of spam take on new meaning in the context of social media, particularly on platforms like Twitter. For example, not everyone would view advertising as spam. When conducting a content analysis on tweets about an election, advertising about diapers is irrelevant to the election discussion and would be viewed as spam. If instead, we are analyzing content about parenting, diaper advertisements would be relevant to the content analysis and may not be viewed as spam. Because of mismatches like this, we introduce the concept of context-specific spam and attempt to understand how to accurately identify posts containing this form of spam, as well as more traditional forms of spam on Twitter.

Figure 1 shows the different types of content as they relate to spam. Filtering out traditional spam is an insufficient way to remove all spam tweets since context-specific spam remains. Similarly, classical spam filtering that considers all advertisements as spam is inadequate because it classifies legitimate context-specific advertising as spam. To accurately detect spam pollution, contextual understanding is required. The goal of our paper is to identify spam on Twitter, both traditional and context-specific. Researchers have been working on spam and bot detection for decades and have proposed a number of supervised learning approaches (Sahami et al., 1998; Kaur et al., 2016; Chen et al., 2015; Cresci et al., 2018). The state-of-the-art approaches extract features from both content and user information. Yet, on some platforms, user information can be difficult to obtain, either because of privacy concerns or because of API limitations. Consider the case of identifying spam among a set of tweets about a particular hashtag stream or keyword search like #metoo or #trump. Getting the user information for every post is impractical for these frequently used hashtags that have thousands of posts daily. As such, we focus on building models that only use post content, not user information.

Fig. 1
figure 1

Different types of content

In this paper, our primary goal is to identify polluted information, specifically traditional spam and context-specific spam using content-based features extracted from posts. Overall, our core contributions are as follows: (i) we formally define different forms of conversation pollution on social media, building a useful taxonomy of poor quality information on social media; (ii) we present a neural network model that identifies traditional and context-specific spam in a low resource setting and show that using a language model within the neural network performs better than classic state-of-the-art machine learning models; (iii) we generate and make available three Mechanical Turk data sets in three different conversation domains (https://github.com/GU-DataLab/context-spam), show the existence of context-specific spam on Twitter, and how the proportion of spam varies across conversation domains; (iv) we demonstrate the performance impact of imbalanced training data on Twitter and show that using a neural network model is promising in this setting; and (v) we show that classic machine learning models are more robust to cross-domain learning when the training data are balanced, but when the training data are heavily imbalaced, a neural network with a cross-domain pre-trained language odel leads to better performance than classic models, but still not as strong performance as domain-specific training because of the presence of context-specific spam.

The remainder of this paper is organized as follows. We present our proposed conversation pollution taxonomy in Sect. 2. Next, we discuss the related work in Sect. 3. Section 4 describes our experiment design for the spam learning task. Our data sets and the labeling task are described in Sect. 5. Then, in Sect. 6, we present our empirical evaluation, followed by conclusions in Sect. 7.

2 Conversation pollution taxonomy

Researchers are investigating different types of conversation pollution. Kolari et al. create a taxonomy of spam across the Internet (Kolari et al., 2007). The focus of their taxonomy is on different ways to distribute spam, e.g. email, IM, blogs, etc. We present a taxonomy that groups different types of conversation pollution, where spam is one form of pollution. Figure 2 shows this taxonomy. There are three high level categories of conversation pollution: deceptive/misleading (false information), abusive/offensive (threat), and persuasive/enticing (spam). False information is a post containing inaccurate content, including different forms of misinformation and disinformation. Threat content is designed to be offensive and/or abusive (Wu et al., 2019). Finally, spam content attempts to persuade and entice people to click, share, or buy something. We divide spam into two categories, traditional spam and context-specific spam.

Fig. 2
figure 2

Different forms of conversation pollution

In this paper, we focus on spam-related pollution. Specifically, we introduce a new form of spam that we refer to as context-specific spam. Context-specific spam is any post that is undesirable given the context/theme of the discussion. This includes context-irrelevant posts like irrelevant advertising. For the purposes of this paper, we consider advertising to be posts that are intended to promote a product or service. Irrelevant advertising is advertising that is not related to the discussion domain.

More formally, let \(T = \{t_{1}, t_{2}, ..., t_{n}\}\) be a tweet database containing n tweets. Different subsets of tweets are related to different thematic domains, \(D^i\) and \(T = \{D^i \cup D^j\) \(\forall i, j\}\). Let S be a set of traditional spam tweets such that \(S \subset T\). While traditional spam is domain-independent, spam can be specific to a domain. We define context-specific spam \(C^i\) as spam that is specific to a thematic domain \(D^i\), (\(C^i \subset D^i\)). An example of \(C^i\) is irrelevant advertising.

To provide more insight, suppose that we are analyzing tweets about an election, i.e. the domain of conversation is elections (\(D^i = election\)). Table 1 shows hypothetical, example tweets. The green tweets are examples of domain-relevant tweets. Given the domain, an advertisement about diapers (row 2) is irrelevant and therefore context-specific spam (\(C^i\)). However, a tweet promoting a candidate’s campaign (row 3) is an advertisement relevant to the domain (\(\overline{C^i \cup S}\)) and therefore is not considered spam.

Table 1 Examples of hypothetical tweets for the election domain

The goal of this paper is to provide researchers with automated approaches to remove irrelevant content from a particular domain of Twitter data. To that end, we propose and evaluate methods to identify \(S \cup C^i\) (all spam) for different domains \(D^i\) in a low resource setting, i.e. limited training data, and different levels of imbalanced training data. Both these constraints are important because of the cost associated with labeling training data and the imbalance of these training data sets with respect to spam. It is not unusual for less than 10% of the training data to be labeled as traditional spam or context-specific spam.

3 Related work

The definition of conversation pollution, especially on social media, is ambiguous at best. We divide this section into two subsections, focusing on different types of pollution detection. While methods for identifying content-based spam on Twitter are emerging, to the best of our knowledge, none of the state-of-the-art works include context-specific spam detection (Kaur et al., 2016; Wu et al., 2018).

3.1 Junk email detection

Early spam detection work focused on junk email detection (Sahami et al., 1998; Sasaki and Shinnou 2005; Wu 2009; Mantel and Jensen 2011; Cormack et al., 2007). One of the early and well-known approaches for filtering junk emails was a Bayesian model (Sahami et al., 1998) that used words, hand-crafted phrases, and the domains of senders as features. Unlike our study, these works use sender information, in addition to message content, to perform classification.

3.2 Spam detection on Twitter

Identifying poor quality content on social media is more challenging because the domain is broad and there are many different social media platforms with different types of posts. To further exacerbate the problem, the number of types of spam keeps increasing. Traditionally, most spammers were direct marketers trying to sell their products. More recently, researchers have identified spammers with a range of objectives (Ferrara et al., 2016; Jiang et al., 2016), including product marketing, sharing pornography, and influencing political views. These objectives vary across social media platforms. Since our work focuses on Twitter spam, we focus on literature related to spam detection on Twitter (Wu et al., 2018). We pause to mention that Twitter itself detects and blocks spam links by using Google SafeBrowsing (Chen et al., 2015), and more recently using both user and available content to identify those spreading disinformation (Safety 2020). This overall approach focuses on context-independent spam and is designed to be more global in nature. While an important step, for many public health and social science studies using Twitter data, not removing context specific spam may lead to skewed research results.

Most research focuses on detecting content polluters (Lee et al., 2011; Wu et al., 2017; El-Mawass and Alaboodi 2016; Park and Han 2016; Hu et al., 2014), i.e. individuals who are sharing poor quality content. Lee et al. (2011) studied content polluters using social honeypots and grouped content polluters into spammers and promoters. Wu et al. use discriminant analysis to identify the key post in order to identify content polluters (Wu et al., 2017). They define content polluters as fraudsters, scammers, and spammers who spread disinformation. (El-Mawass and Alaboodi (2016) used machine learning to predict Arabic content polluters on Twitter and showed that Random Forest had the highest F1-score. Our work differs from these works because we focus on the identification of pollution at the post level as opposed to polluters at the individual level. We are also interested in low resource settings where the amount of training data available is limited.

Studies focusing on spammer/bot detection tend to use both content-based and user-based information (Wu et al., 2018; Wang 2010; Mccord and Chuah 2011; Chen et al., 2015; Lin et al., 2017; Wei 2020; Hu et al., 2013; Brophy and Lowd 2020; Jeong and Kim 2018) and the best approaches achieve a precision of around 80% for training and testing on balanced datasets. Those approaches can be used when both content and user information are available. As mentioned in Sect. 1, there are scenarios where this is impractical. For example, hundreds of thousands or millions of users may post using a specific hashtag (#covid) or keyword (coronavirus), making it impractical for a researcher using those data streams to collect user information from the Twitter API. To further complicate the situation, spam is not only generated by bots. It is also produced by humans. People and companies can post advertisements or links to low-quality content. Therefore, focusing on a strategy centered on bot detection will miss some types of spam.

There are also studies on spam (as opposed to spammer) detection on Twitter (see (Kaur et al., 2016; Wu et al., 2018) for more detailed surveys). Wang proposes using a Naive Bayes model that detects spam with graph-based features (Wang 2010). The study shows that most spam tweets contain ‘@’ or mentions and links. Since that early work, multiple studies have shown that Random Forest is effective for building models for detecting spam (Mccord and Chuah 2011; Santos et al., 2014; Chen et al., 2015; Lin et al., 2017). Chen et al., compare six algorithms that use both content-based and user-based features (Chen et al., 2015). They found that Random Forest was their best classifier, even when the training data was imbalanced. Detecting spam based solely on content is more challenging because of the lack of user information (Wang 2010). Santos et al. use traditional classifiers with Bag-of-Words (BoW) feature to detect spam using only content-based information (Santos et al., 2014). They also found that the Random Forest model outperformed other classic models. Our work differs from all this previous work since we want to detect both traditional and context-specific spam. We are also comparing classic machine learning and neural network models, conducting our analysis on three different domains on Twitter, and considering the impact of limited, imbalanced training data.

4 Experiment design for spam learning task

Our goal is to build a generalizable model for detecting spam on Twitter. In doing so, we investigate the following questions: (1) What are the best classic machine learning models for identifying spam on Twitter? (2) Can neural networks that incorporate a language model perform better than classic machine learning models? (3) How much does training set imbalance affect performance? (4) Are models built using one domain of Twitter training data transferable to another Twitter domain without customization? The last question is particularly important in cases when there are a small number of labels pertaining to the spam category.

Toward that end, this section describes the experimental design for understanding how different classic machine learning (Sect. 4.1) and neural network (Sect. 4.2) models perform and how transferable the models are (Sect. 4.3).

4.1 Classic machine learning models

We first evaluate classic machine learning models for this task given past good performance on variants of this task (Kaur et al., 2016; Wu et al., 2018; Santos et al., 2014; Hu et al., 2013, 2014). We build traditional models for each domain of interest. Figure 3 shows the details of our experimental design for generating ground truth labeled data (discussed in detail in Sect. 5), preprocessing, feature extraction, modeling and evaluation. The labeled ground truth data are inputs into the process. The data are preprocessed using simple, well-established cleaning methods.

4.1.1 Feature extraction

We consider four different types of features that are widely used in spam detection research (Wu et al., 2018) including entity counts, bag-of-words (BoW), word embeddings, and TF-IDF score. Features are mixed and matched for different models (Fig. 3).

Fig. 3
figure 3

Classic machine learning models methodology overview

4.1.2 Entity count statistics

Previous spammer detection research has incorporated different entity counts statistics (Chen et al., 2015; Brophy and Lowd 2020) such as the number of retweets, the number of friends, etc. However, since our focus is content-based analysis, we build features using only the tweet content. Entity count statistics features we consider include text length, URL count, mention count, digit count, hashtag count, whether it is a retweet, and word count after URLs are removed.

4.1.3 Bag-of-Words (BoW)

Santos et al. (2014) show that Random Forest with BoW performed the best in content-based spam detection so we also use word frequency counts as features.

4.1.4 Word embeddings

There are several embedding techniques for word representation. We use GloVe (Pennington and Socher 2014)—the most widely-used pre-trained word2vec for Twitter—to represent each word and then concatenate them as a feature vector for each sentence in a tweet.

4.1.5 TF-IDF

A classic information retrieval technique for identifying important words is computing the term frequency-inverse document frequency (TF-IDF) scores. We consider a variant of the BoW model where we use the TF-IDF weight instead of the word frequency to represent the words in each tweet.

4.1.6 Machine learning algorithms

We build various classic machine learning models, including Naive Bayes (NB), k-Nearest Neighbors (kNN), Logistic Regression (LR), Support Vector Machine (SVM), Decision Tree (DT) and Random Forest (RF). Given the high dimensionality and sparse feature space, we also evaluate using Elastic Net (EN) since it has been shown to work well for spammer detection by making the sparse learning more stable (Hu et al., 2013, 2014). All the models are trained using different combinations of features described above. We build and test each model for each domain.

4.2 Exploiting neural language models

Given the success of many neural models for text classification tasks using Twitter, we propose using a classic neural model that incorporates domain specific knowledge through the use of a language model. We hypothesize that using a language model specific to a particular domain will improve the accuracy of our spam detection models in that domain, providing necessary context. Toward that end, we incorporate a well-known neural language model, BERT (Devlin et al., 2018) and fine-tune it for this task. BERT has been used successfully for other learning tasks, including sentiment analysis (Munikar et al., 2019), natural language inference (Hossain et al., 2020) and document summarization (Cachola et al., 2020).

Our neural model begins by fine-tuning BERT to build a domain specific language model (LM) using unlabeled tweets from the domain. For example, if we are interested in gun violence, we would use a large number of tweets that discuss gun violence to fine-tune BERT. BERT uses a bidirectional transformer (Vaswani et al., 2017) and its representations are jointly conditioned on both the left and right context in all layers. The output from the language model is input into a single layer neural network for the classification task. The architecture of our neural model is shown in Fig. 4. We used the BERT tokenizer to tokenize a sentence into a list of tokens as input for BERT. After the multi-layer of transformers, we used a dropout rate of 0.1 in order to avoid over-fitting. We then fed the output vectors into a single-layer of neural network with softmax.

Fig. 4
figure 4

The structure of our neural network model

The classifier is a single layer neural network as shown in Eq. 1, where y represents the output vector from the classifier, W is a weight vector randomly initialized, x represents a contextual representation vector from BERT after the dropout layer (see Fig. 4), and b is a bias vector.

$$\begin{aligned} y = Wx^T + b \end{aligned}$$
(1)

The weights of the classifier are updated using the cross-entropy loss function shown in Eq. 2. The class label C is obtained using the softmax function (see Eq. 3) to normalize the values of the output vector y from the classifier in order to obtain a probability score for each class. All BERT-based models are trained using the Adam optimizer (Kingma and Ba 2014) with learning rates of \(1e-6\), \(5e-6\), \(1e-5\), and a fixed batch size of 32. The models are fine-tuned with a maximum epoch of 20 with early stopping. We constructed a language model for each domain and test the model in the same domain.

$$\begin{aligned} Loss(y,class) = -y[class] + \log {\left( \displaystyle \sum _{j} \exp {(y[j])}\right) } \end{aligned}$$
(2)
$$\begin{aligned} C = \mathop {\text {argmax}}\limits _{j}\left( \text {softmax}(y)\right) \end{aligned}$$
(3)

4.3 Model transferability

One of our goals is to understand the strengths and limitations of our models for different domains on Twitter. For example, soccer is a different domain from politics. Toward that end, we design an experiment to measure how transferable each model built in one domain is to other domains. Figure 5 shows our experimental framework. First, we train and test models within the same domain independently. We also train models in one domain and then test them across other domains to determine cross-domain generalizability and transferability of spam detection models on Twitter. In the case of the neural network model, we use a language model built across all the domains to determine its effectiveness in settings where limited ground truth data exists and the labeled ground truth data are imbalanced. We surmise that cross-domain learning will be beneficial for identifying traditional spam, but not be as accurate for context-specific spam.

Fig. 5
figure 5

The high-level process to test model transferability

5 Labeling spam in multiple twitter domains

The need for ground truth data is two-fold. First, we need to determine whether or not context-specific spam is present and identifiable in tweets. Second, if it is present and identifiable, we need ground truth data to help build a reliable, predictive model. Since we are interested in comparing models on different domains of Twitter data, we also collected ground truth data on more than one domain. This section begins by describing the different data sets we use for our empirical evaluation (Sect. 5.1). We then explain how we generated the ground truth data (Sect. 5.2) and the characteristics of the spam in the ground truth data for each domain (Sect. 5.3).

5.1 Data description

We use three different domains of Twitter data for our empirical evaluation: meToo (a social movement focused on tackling issues related to sexual harassment and sexual assault of women), gun-violence, and parenting. The meToo data set was constructed by collecting tweets that include #meToo through the Twitter API. This data set contains over 12 million tweets from October 2018 to October 2019, the first year of the larger online movement. The gun-violence data set was collected using keywords and hashtags related to gun-violence through the Twitter API. Keywords used include guns, suicide, gun deaths, etc. We used data from 2017, approximately 22 million tweets. The parenting data set was constructed by collecting the tweets of 75 authorities who primarily post and discuss parenting topics and collecting tweets using parenting-related keywords and hashtags. Example authorities include parenting magazines and medical sites. For this data set, we collected over 200 million tweets in 2019. As is evident from the descriptions, these data sets cover a wide range of topics.

5.2 Ground truth data collection

We collected ground truth data by setting up multiple Amazon Mechanical Turk (MTurk) tasks, one for each domain of interest. MTurk is an established approach for labeling tasks (Buhrmester et al., 2011). We ask three raters to answer three questions per tweet:

  • Question 1: Is the tweet about domain \(D^i\)?

  • Question 2: Is the statement an advertisement?

  • Question 3: Is the statement spam?

For each question, we have definitions and examples to help raters answer the questions accurately. To make sure we separate advertising from traditional spam, in the definition of spam included for the workers, we explicitly state that advertising is not spam. We say that spam asks people for their personal information, contains harmful or inappropriate content/links including malware, phishing, or pornography, or is not understandable, e.g. not a complete sentence, not human language, etc. For this study, context-specific spam is limited to irrelevant advertising. Therefore, if Question 1 is false and Question 2 is true, we consider that context-specific spam. If Question 3 is true, we consider that traditional spam. Setting up the questions in this way allows us to see the prevalence of traditional and context-specific spam in the ground truth data sets.

From each of the data sets, we randomly sample 5000 English tweets, ensuring that there are no duplicates and that each tweet contains textual content after removing URLs. We also ensure that each tweet belongs to only one domain (either parenting, metoo or gun-violence). We recognize that tweets may belong to multiple domains. Here, we focus on single domain tweets and leave multi-domain posts for future work. Each tweet is labeled by three workers. We compute Krippendorff’s alpha scores (Krippendorff 2011) and conduct agreement analysis [1] to determine the quality of the labeling for the three different-domain tasks. We compute both task-based and worker-based agreement scores and average them among tasks and workers. The task-based score looks at every tweet and every rater who labeled it. The majority vote is determined and divided by the number of answers (this is the same as the number of raters in our study). The worker-based score is the total number of tasks a worker has a majority vote divided by the total number of tasks the worker completed. We exclude workers who complete only one task.

5.3 Data labeling results

Table 2 shows the scores for Krippendorff alpha, task agreement and worker agreement for each data set. By traditional standards, the alpha agreement is low (below 0.67) for five of the label categories. This results because Krippendorff’s alpha relies on vote proportions and some of the label categories are heavily imbalanced. For example, in the parenting data set, there are only 12 tweets that are labeled as traditional spam and 4988 are labeled as not traditional spam. In this domain, advertising is much more prevalent, i.e. context-specific spam. These types of large imbalances reduce the overall average. Analyzing the task-based scores, we find that they are higher than 0.85 for all the questions across all three data sets. The worker-based scores vary more from 0.79 to 0.99, but in general, have a high agreement. In the few places where there is higher disagreement, the text content is more ambiguous, and workers chose the “uncertain” option sometimes. Overall, the strong agreement across both the tasks and workers give us confidence using these data to build our models.

Table 2 The agreement scores fot the MTurk labeling

One interesting, but unexpected finding, is the difference in the fraction of spam across domains (see Fig. 6). The parenting-domain data set is the most balanced distribution between content-rich domain tweets and spam-related pollution tweets consisting of 2,485 spam tweets (12 traditional and 2,473 context-specific spam) and 2,515 domain tweets. While at first, this may seem surprising, because some of the authorities are also traditional magazines, e.g. Parents or Baby.com, they post about products – some of which are relevant to the parenting domain, e.g. diapers, and some of which are not, e.g. iPhones. For the meToo data set, the distribution is highly imbalanced between content-rich domain tweets and spam tweets, with only 281 spam tweets (118 traditional and 163 context-specific spam) and 4,719 domain tweets. The gun-violence data set is reasonably balanced between spam and non-spam content, consisting of 2,704 spam tweets (2,559 traditional and 145 context-specific spam) and 2,296 domain tweets. The high level of traditional spam results from the high level of abusive, vulgar/inappropriate language in this data stream.

Across these data sets, we have varying distributions of spam and a varying proportion of traditional spam and context-specific spam. The parenting domain has substantially more context-specific spam when compared to the other two domains. In contrast, gun-violence has significantly more traditional spam than context-specific spam. Finally, the meToo data set has a more even distribution of context-specific and traditional spam but has a much smaller percentage of spam overall. These stark differences highlight the importance of capturing these different forms of spam. Ignoring one of them may lead to a substantial loss of information about low-quality content.

Fig. 6
figure 6

The distribution of conversation pollution in our manually annotated data

6 Empirical evaluation

Our empirical evaluation is organized as follows. We begin by explaining the implementation details of experiments (Sect. 6.1). Next, we present an in-domain analysis (Sect. 6.2) where models trained on data from a given domain are used to predict spam in the same domain. We also consider different levels of imbalance in the training data to determine the smallest portion of the context-specific needed in order to train a reliable classifier. Next, we perform a cross-domain analysis (Sect. 6.3), where models trained on one domain are used to predict spam in other domains. These experiments are predicting spam (traditional and context-specific) vs. no spam. We then detect traditional and context-specific spam separately using the neural network model and show that we can effectively detect both types of spam (Sect. 6.4).

6.1 Implementation details

We now present our implementation details. All experiments were conducted using Python 3. To accelerate the preprocessing and feature extraction process, we used Apache Spark v2.4.0 (Spark 2018), taking advantage of multiple-core processing instead of using one single-core machine. The dimensions of features vary depending upon which features were extracted (Sect. 4.1) and the training data set (Sect.  5). The sample dimension numbersFootnote 1 include 7 for Entity Count Statistics, 7724 for Bag-of-Words, 1075 to 8600Footnote 2 for Word Embeddings and 5221 for TF-IDF. Processed features are then merged onto one single machine generating the final models. We use a single machine for the final model generation because some of the machine learning algorithms are not designed for a distributed environment and it is also easier for reproducibility to use a well-known package. Specifically, we use scikit-learn v0.22.1 (Pedregosa et al., 2011) for the modeling. For the neural language models (Sect. 4.2), we implemented both the language models and neural classifiers using PyTorch v1.4.0 (Paszke et al., 2019). All the language models and classifiers were trained using a Tesla T4 GPU.

For our different learning models, we conducted a sensitivity analysis using a grid-search on influential parameters. The best parameters varied by classifier, data set, and feature sets. They include \(k=\{5,7\}\) for kNN, \(C=\{1,10\}\) for LR and SVM, \(criterion=\{entropy,gini\}\) for DT, \(n\_estimators=\{50,100,200\}\) for RF, \(alpha=0.2\) and \(l1\_ratio=0.5\) for EN and \(hidden\_layer\_sizes=\{100,200\}\) for NN. The results presented in this paper use the optimized parameters for each model.

6.2 Within domain analysis

In this experiment, we compare classic machine learning models and a neural network model for predicting spam within a single domain, i.e. the training set and the test set are sampled from the same conversation domain. The gun-violence and parenting data sets are split into a 90/10 train/test split. Because the meToo data set only contains 281 spam tweets, we use 90% of the spam tweets in the training set (253 tweets) and have a small, balanced test set containing 28 spam and 28 domain tweets. Experiments were conducted and validated using 10-fold cross-validation on training sets and then evaluated on hold-out test sets. All baseline classifiers were constructed using different combinations of features as described in Sect. 4. The language model is pre-trained on 500,000 unlabeled domain-specific tweets.

Table 3 shows experimental results for both the training and testing data sets for detection of all spam - traditional and context-specific. Only the best classifier in each feature group is presented for ease of exposition. We also show the results of the vanilla neural network, i.e. a neural network without the pre-trained language model, to demonstrate the influence of the language model. We use the Counting+TF-IDF feature combination for the vanilla neural network since it tends to perform the best among four feature sets based on F1 (10-fold). The F1 scores are presented for both the training and test experiments. Accuracy, precision and recall are only shown for the test experiments. The best scores for each domain are highlighted in the table.

Focusing on the F1 scores from the 10-fold cross-validated training sets, the table shows that the neural network models with the pre-trained language models outperform all the baselines by a small amount, 1 to 2% for every domain. Similarly, focusing on the test F1 score, they outperform all the baselines by 2%, 5%, and 1% for the parenting, meToo, and gun-violence data sets, respectively. Random forest is the best classic machine learning model across domains. In general, for classic models, using the entity counting and TF-IDF weighted word features performs better than other feature combinations. In all cases, entity counting statistics plus word features perform better than entity counting features alone. These findings are consistent with previous research (Santos et al., 2014; Wu et al., 2018). We see that the vanilla neural model without pre-training using a language model performs 9% to 12% worse on the test data sets, indicating the importance of pre-training.

Table 3 Experimental results of spam pollution detection on each data set

Exploring this imbalance more, we consider the trade-off between a larger imbalance and a larger training set versus a smaller imbalance and a smaller training set. Beginning with the meToo data set, we adjust the distribution of the meToo training data to analyze different levels of imbalance. We consider three split distributions: using all the data (95/5 split), using 2529 domain tweets and 281 spam tweets (90/10 split), and using 1124 domain tweets and 281 spam tweets (80/20 split). We show the 10-fold cross-validation results for these three split distributions in Table 4. In this resource-constrained, highly imbalanced training data scenario, logistic regression and random forest still perform better than other classic machine learning models on all the different splits, while neural network models still have the highest F1 scores. Even though this is not surprising, it is still important to note that as the imbalance decreases by 5%, the F1 score increases 3 to 5%.

Table 4 Different distributions of spam pollution in the meToo domain

To see how generalizable this finding is, or if it is unique to the meToo domain, we create two data sets with an 80/20 split for the parenting and gun-violence domains. We sample 281 spam tweets and 1124 domain tweets in order to directly compare to the meToo results in Table 4. Table 5 shows the comparison of results from the 10-fold cross-validation on the imbalanced data. The first observation is that the parenting and gun-violence F1 scores are significantly lower - the lack of examples reduces the scores by 19% and 17% respectively for the top models, and has an even greater impact on many of the classic models (see 10-fold on Table 3). Similar to the previous results, we see that the neural network model still outperforms other models by 2 to 23%. These results indicate that spam within each domain has sufficient variability and that having only 20% spam in the training data may not lead to acceptable results for certain domains.

Table 5 Comparison of 10-fold results on 80/20 imbalanced training data

6.3 Cross-domain analysis

We next determine how well models trained in one domain work in another domain for the spam detection task. Figure 7 shows the F1 score of the best classifiers trained on one domain and tested on other domains. The x-axis shows each model and the y-axis shows the F1 score. It is no surprise that training and testing using the same domain always performs better than building the model in one domain and testing it on different domains. But there are a few other interesting takeaways. First, the classic models show more robustness compared to the neural network models. While the neural network always has the highest F1 score when the test data set is within the same domain, its drop in performance on other test data sets is worse than some of the classic models. The most stable model across test sets is random forest while using Counting+GloVe, followed by Counting+TF-IDF. We hypothesize that because the neural network model is designed to incorporate context representations more deeply, that changing contexts impacts its performance more significantly. The other interesting finding is that the most robust training set is the gun violence one. This makes sense because it is the one with the highest fraction of traditional spam. In other words, cross-domain analysis is most beneficial when there is a large amount of traditional spam to learn from. Based on these two findings, we wanted to determine if we could improve the cross-domain performance of the neural model by pre-training the language model with data from all three domains.

Fig. 7
figure 7

F1 scores from different feature sets and the best classifiers trained and tested across different datasets

To answer this question, we sampled 500,000 unlabeled tweets from each domain and combined them to fine-tune the language model (1.5M unlabeled tweets in total). We trained and tested the models for each domain. The results are shown in Table 6. The first two columns show the training and test data sets. The next column shows the F1 score of the best classic classifier, and the last two columns show the F1 score when the neural network model is pre-training on a language model from the same domain (LM-same) and when it is training on a combined domain language model (LM-combined). The underlined numbers are the best F1 scores using different models on the test sets. We see that the classic models are generally better when there is a difference in the domain. This is consistent with our results in the previous analysis. In all cases, a language model built using all of the domains performs worse or marginally better than pre-training a domain-specific language model. In other words, it is better to optimize for a specific domain by using domain-specific language model pre-training than it is to pre-train across all the domains, assuming the size of data for pre-training is fixed. This is an unexpected finding. One would expect that pre-training across all the domains would improve the F1 scores when the training examples come from one domain and the testing examples from another. This was not the case - the variability of type of spam, i.e. the different proportions of traditional and context-specific spam (see Fig. 6) and their variation in language, impacted the overall performance. Therefore, building a general model for spam is complex. We will explore this more in the next set of experiments.

Table 6 Comparison of F1 scores from classical models and language models trained from same-domain or combined tweets

6.4 Traditional vs. context-specific spam

While it is important to be able to identify all spam, the last analysis highlighted the need to identify a specific type of spam. In this experiment, we analyze the ability of the neural network models to detect each type of spam separately. We only include metoo and gun-violence data in the traditional spam experiment since the parenting data contains too few traditional spam tweets. Due to the imbalanced data and to avoid overfitting on deep neural networks caused by oversampling, for data sets with less than 20% spam, we reduced the training sample size to maintain a 20–80 spam, not-spam ratio. These experiments were done using 10-fold cross-validation. Table 7 shows the performance of the neural network models on each task. Pollution spam is the combination of traditional and context-specific spam. While we also tested the classic models, we focus on the neural network models since they performed better than the classic models across all the data sets on this task. Tasks having a star next to their name indicate that we have reduced the sample size to maintain the 80-20 ratio. The results show that the neural network is able to perform well labeling context-specific and traditional spam on all of the domains with an F1 score of over 80%.

Table 7 Comparison of the neural network models performance on different types of spam

A question that arises as we look at the results is whether or not a single spam detector can be designed to identify both traditional and context-specific spam across all domains. To test this, we also compare detectors built using traditional spam from all the domains to all the spam pollution, including context-specific spam, across all domains. We combined traditional spam from all three domains then undersampled non-spam tweets from all three domains equally. As a result, the models were evaluated on training data sets containing 2689 traditional spam tweets and 2688 non-spam tweets (896 from each domain). For the pollution spam, regardless of type, we sampled 5470 spam tweets and 5469 non-spam tweets (1823 from each domain). We also sampled 2252 context-specific spam tweets combined from three different domains and 2253 non-spam tweets (751 from each domain) in order to evaluate how much the performance drops if we use one classifier for detecting context-specific spam from three different domains. We evaluated our results using 10-fold cross-validation. The performance results are shown in Table 8. We see that generalizing across traditional spam performs better (F1=0.91) than generalizing across both context-specific and traditional spam (F1=0.86). Generalizing across context-specific spam negatively impacts the overall performance (F1=0.84). There are 5 and 7% differences, which are significant for this task. Given this result and previous results, if there is sufficient spam of each type, then building models for each type individually can lead to higher F1 scores than building a model for both spam types together. However, in cases where the training data imbalances are large for the different types of spam combining them leads to reasonable results.

Table 8 10-fold performances of our LM-combined models on spam from all domains

7 Conclusion

The goal of this paper is to demonstrate the existence of context-specific spam and build models to automatically identify both context-specific and traditional spam using only post data on Twitter. This is a challenging problem because: (1) what qualifies as spam varies across domains and as such, this task likely necessitates different models and new training data sets for each new domain; (2) labeling data is costly; and (3) different domains/themes of conversation on Twitter have different levels of spam, leading to different class balances in training data sets.

In this study, we define and show that context-specific spam exists on Twitter. We develop a broad conversation pollution taxonomy and place context-specific spam within that taxonomy. We experiment with different classic machine learning models and construct different types of text features introduced in previous literature to determine the best content-based algorithms for identifying spam within a single domain of posts and across multiple domains. We find that the best classic models are logistic regression and random forest, a finding that is consistent with previous literature. We analyze using a neural network model that incorporates a layer for a pre-trained language model and show that this model performs better than the best classic machine learning models and a basic neural network model when the training and testing data sets are within a single domain. However, when the training and test data sets are from different domains, the classic models are more robust than the neural models. Still, neither are as good as the domain-specific models because of the presence of context-specific spam. We also consider large imbalanced data sets and show that using a cross-domain pre-trained language model when the training data are small and imbalanced reduces the negative impact of the large imbalance.

Future work will look at other forms of conversation pollution to see if incorporating language models into neural networks can improve the state of the art. Other possible directions include considering related domains (e.g. the child behavior domain as it is closely related to the parenting domain), quantifying the levels of spam present on different platforms, and developing strategies to intervene and remove or mark those that are detracting from the conversation.