Keywords

1 Introduction

Crowdsourcing is a paradigm in which organizations or individuals obtain data or service from a large and relatively open group of users, or crowd. It has been increasingly used in modern society for data collection in various domains such as image annotation or real-time traffic reports. Amazon Mechanical Turk (MTurk) [6, 20] is one of the most pervasive crowdsourcing marketplaces, in which requesters submit various tasks requiring human intelligence, such as labeling objects in an image or flagging inappropriate content. Another example is Waze [46], a crowd-driven navigation application. Users can report the traffic status at various locations which is then aggregated to update the traffic condition shown on the map.

A key component of crowdsourcing applications is truth inference which aims to derive the answers for the tasks, e.g. the objects in the image, or true traffic condition of the road, by aggregating the user-provided data. Truth inference [12, 24, 29, 36, 38] is a challenging task due to the open nature of the crowd. First, the number of available ratings per task varies significantly. Second, the reliability of the workers can vary. For example, in the Waze application, it is common for some users to not care to report at all or to carelessly report the traffic condition. So, estimating the level of trust one has in workers’ responses and ultimately inferring the correct label for the tasks by aggregating their responses becomes complicated. Finally, the crowdsourcing applications may be subject to data poisoning attacks [23, 40] where malicious users purposely and strategically report incorrect responses to mislead the system to infer the wrong label for all or a targeted set of tasks. In the Waze example, attackers might want to take the road with the least traffic by deceiving Waze application to wrongly indicate there is heavy traffic on that specific road. This can be achieved via Sybil attacks [10, 14, 49, 53] where an attacker creates a large number of Sybil workers to strategically report wrong answers.

Traditional Sybil detection in online social networks [1, 45] typically relies on additional features or metadata (e.g., connectivity graph and IP addresses). Recent works have proposed Sybil detection methods for crowdsourcing through defining golden questions and clustering workers [10, 49]. In this paper, we focus on the truth inference methods that only rely on workers’ answers, and their robustness against poisoning attacks, which are orthogonal and complementary to Sybil detection methods using additional metadata.

The simplest method in truth inference is majority voting, where the inferred truth will be the one chosen by the majority of the assigned workers. Since the reliability of workers is not considered, majority voting may fail in the presence of unreliable or malicious workers. Considerable research has been done on improving the accuracy of truth inference methods, including optimization based methods [22, 28], probabilistic graphical model based methods [11, 25, 33, 44, 56], and neural network based methods [15, 17, 50]. These methods construct models that either explicitly or implicitly consider the credibility of workers, which create some form of defense against unreliable or malicious workers. Zheng et al. [56] evaluated truth inference methods with various worker models, types of tasks, and task models. The evaluation is focused on “normal” settings where workers may have varying reliability, but do not intentionally or strategically manipulate the answers. They concluded that truth inference methods that model worker behavior based on confusion matrix and utilize a probabilistic graphical model (PGM) have the best performances in most settings.

One unanswered question is how robust these inference methods are under intentional and strategic data poisoning attacks that are beyond normal worker behaviors. Adversaries may disguise themselves as normal workers by providing reliable answers for certain tasks to escape the worker’s reliability model while providing the wrong answer for other targeted tasks. In the worst case, when adversaries know the truth inference method and other workers’ answers, they may optimize theirs to maximizes the error of the truth inference method for all tasks or a subset of tasks [30, 31]. Thus, it is important to understand the various types of data poisoning attacks and evaluate how different truth inference methods behave under such attacks to ultimately build robust truth inference methods.

Contributions. We propose a comprehensive data poisoning attack taxonomy for truth inference in crowdsourcing and systematically evaluate the state-of-the-art truth inference methods under various attacks. In summary:

  • We present a comprehensive data poisoning attack taxonomy in crowdsourcing. We analyze the attacks along different dimensions, including attack goal (targeted vs untargeted), adversarial knowledge (black-box vs white-box), and attack strategy (heuristic vs optimization based). We also discuss the similarity and differences between data poisoning attacks in crowdsourcing and those in machine learning and other domains.

  • We design heuristic and optimization based attacks that can be used on various truth inference methods as part of our evaluation methodology. The heuristic based attacks assume black-box or no adversarial knowledge and model the worker behavior using a confusion matrix [11] and an additional disguise parameter to hide their malicious behavior. The optimization based attacks assume white-box or full adversarial knowledge including the truth inference methods being used and other workers’ answers, and are adapted from existing optimization based attacks [30] while making them more generic so they are applicable to broader types of truth inference methods.

  • We systematically evaluate the state-of-the-art truth inference methods under both heuristic and optimization based data poisoning attacks. The truth inference methods are selected carefully to represent the different types of methods including majority voting based [22], optimization based [28], probabilistic graphical model based [11, 24, 25], and neural network based [50]. They also represent different worker behavior models including probability based [28], confusion matrix based [11, 24, 25], and implicit models [22, 28]. Our study includes not only the best performing methods from the experimental study [56], but also additional direct computation [22] and optimization based methods [22, 28] and more recent neural network based methods [50].

  • We propose several metrics to evaluate the robustness of truth inference methods against data poisoning attacks. We experiment on synthetic and real-world datasets with varying parameters such as percentage of malicious workers, different attack parameters, and sparsity of the crowdsourcing dataset. We summarize the experiment findings and draw conclusions on the robustness, strengths, and weaknesses of the methods. It is our belief that these results help understand the resilience of existing methods and ultimately build more robust truth inference methods against data poisoning attacks.

Section 2 summarizes the existing attacks in related domains. Section 3 formally defines the truth inference problem and presents a categorization of existing truth inference methods and the selected methods for evaluation. Section 4 presents the attack taxonomy. Section 5 describes our evaluation methodology. Section 6 presents the results with discussions and Sect. 7 concludes the paper with key findings and future work.

2 Related Work

Truth inference methods have been studied extensively. We provide a brief description of the methods included in our evaluation in Sect. 3. Here we briefly review the works in data poisoning attacks.

Data poisoning attacks for machine learning (ML) algorithms have been increasingly studied recently [2, 5, 18, 23, 37, 40, 47]. However, data poisoning attacks in ML and crowdsourcing differ in three ways: (1) attacks in ML deal with supervised models and the goal is to degrade the performance of the model on a validation dataset, but crowdsourcing is an unsupervised problem, (2) to carry out the attacks in ML, a certain percentage of records are poisoned, while all the features associated with the poisoned record (e.g. an image) can be altered, but in crowdsourcing, a fraction of workers may be malicious, and (3) ML problems typically have rich features for records while in crowdsourcing for each task only some ratings from workers are available. Hence crowdsourcing systems are more susceptible to data poisoning attacks due to its open and unsupervised nature and lack of rich features for the truth inference problem.

Shilling attack [7, 9, 16, 32, 34, 55] is a type of data poisoning attacks in recommender systems where intruders attempt to simulate the behavior of a subset of users, leading to strategically manipulated recommendations to other users [16, 34]. The main difference between recommender systems and truth discovery is that the true labels of tasks in recommender systems, i.e., users’ personal opinions, are subjective, while ground truth for truth discovery is universal.

Other related attacks include spammer [13, 19, 20, 35, 48] and sybil [8, 26, 39, 43, 46, 51,52,53,54] attacks. In a spammer attack, workers (bots) randomly submit answers to tasks [13, 19, 20]. In sybil attacks, infiltrators create fake identities to affect the performance of the system [53, 54]. The attack and defense methods, for example in social media and IoT, typically utilize metadata such as connectivity of graph and relationship between nodes, and IP address. Sybil and spammer attacks mainly focus on the system infiltration part of the attack. The data poisoning attacks considered here assume adversaries have already successfully created or compromised multiple workers and can inject strategic answers.

3 Truth Inference Methods

Truth Inference Problem Definition. We consider a crowdsourcing system comprised of some tasks and a pool of workers. Each task is assigned to a subset of workers. The goal in truth inference is to determine the true answer based on available answers for each task. Tasks in crowdsourcing can be categorized as 1) decision-making tasks where workers select a binary answer (e.g. yes or no), 2) single label tasks where workers select a single label among multiple candidate labels, and 3) numeric tasks where an answer is a number. Moreover, truth inference methods may consider factors such as type of tasks, task difficulty, and task assignment methods [56].

In this paper, we focus on the robustness of truth inference methods under data poisoning attacks for decision-making tasks, i.e. the binary truth inference problem, and do not consider other variations largely orthogonal to our evaluation. Examples of decision-making tasks are labeling the sentiment of sentences as positive or negative or reporting accidents on road. We also assume tasks are equally hard for the average workers and tasks are randomly assigned to a subset of workers following the power-law distribution [41].

Definition 1

(Truth Inference [36]). Consider a set of M workers \(\mathbf{W} =\{w_1,.., w_M\}\) and a set of N tasks \(\mathbf{T} = \{t_1, t_2,..., t_N\}\) where each task \(t_j\) has a truth label \(z^{*}_{t_j} \in \mathbf{L} = \{0,1\}\). Given an answer matrix \(\mathbf{C} \) where each element \(c^{w_i}_{t_j}\) indicates the answer from worker \(w_i\) for task \(t_j\), the goal of truth inference is to infer the truth label \({\hat{z}_{t_j}} \in \mathbf{L} \) for each task \(t_j \in \mathbf{T} \).

Figure 1 illustrates a crowdsourcing system with 3 workers and 5 tasks and a bipartite task-worker assignment graph. The input of a truth inference method is an answer matrix \(\mathbf{C} \) provided by the workers. Workers label each task as 0 or 1, while x reflects that the task is not assigned to that worker. The output is \(\hat{\mathbf{Z }}\) reflecting the inferred answers for the tasks. The ground truth vector \(\mathbf{Z} ^{*}\) is shown as a reference. Table 1 summarizes notations used throughout the paper.

Fig. 1.
figure 1

Example of a crowdsourcing system

Truth Inference Methods. There are four main categories of truth inference techniques: 1) direct computation, 2) optimization, 3) probabilistic graphical model (PGM), and 4) neural networks. Direct computation aggregates workers’ answers by majority voting while treating workers equally or heuristically assigning weights to them [22]. Optimization based methods [22, 27, 28, 57, 59] treat the estimated labels and worker reliability as unknowns and use an optimization approach to find them. Probabilistic graphical models (PGM) explicitly model workers’ reliability to estimate the labels [11, 12, 24, 25, 44, 48]. Optimization and PGM based methods follow an iterative Expectation Maximization (EM)-based approach consisting of: 1) inferring the label of tasks given the estimated workers’ reliability, and 2) computing workers’ reliability given the current inferred labels of the tasks. Recently, unsupervised neural network based approaches [15, 50] have been proposed that input answers of each task in a neural network and output the inferred label of the task. Other approaches based on tensor augmentation and completion with limited performance have also been suggested [58].

Worker Models. Some truth inference methods do not have an explicit worker model while others model workers by: 1) a single worker reliability or penalty parameter reflecting their trustworthiness [12, 48], 2) a confusion matrix capturing workers’ probability of providing a certain label given the true label [11, 25, 44]. The confusion matrix \(\pi ^{w_{i}}\) is an \(|L|*|L|\) matrix where element \(\pi ^{w_{i}}_{p,q}\) denotes the probability of worker \(w_{i}\) reporting label q given the true label p. Assuming a binary label set \(\mathbf{L} =\{0,1\}\), the matrix is reduced to two variables, \(\alpha _{i}\) and \(\beta _{i}\), with \(\alpha _{i} = \mathrm{pr}(c _{t_j}^{w_{i}} = 1\,\mid \,z^{*} = 1)\) and \(\beta _{i} = \mathrm{pr}(c _{t_j}^{w_{i}}= 0\,\mid \,z^{*} = 0)\), showing the probability of worker \(w_{i}\) correctly reporting a task given true label 1 or 0 respectively.

Selected Methods. We aim to be comprehensive in representing categories of techniques and worker models to investigate their role in the robustness of inference methods under data poisoning attacks. When possible (e.g. for optimization and PGM based techniques), we leverage the findings of the previous study [56] by selecting the best-performing methods.

For direct computation, MV (majority voting) and its enhanced version, MV-Soft [22], are chosen. For optimization based methods, we include MV-hard [22] which employs a semi-matching optimization for the worker-task bipartite graph and PM [28] which poses an optimization problem on inferred labels and workers’ reliability. For PGM based methods, we chose the best-performing D&S [11] and BCC [25] modeling worker reliability by confusion matrix and KOS [24] using a single worker reliability parameter. For neural network based methods, we chose LAA-S [50], the only one applicable to non-complete bipartite graphs. We refer readers to the appendix for a more detailed discussion regarding each method.

Table 1. Summary of notations

4 Data Poisoning Attacks

Due to their open nature, crowdsourcing systems are subject to data poisoning attacks [23, 40] where malicious workers intentionally and strategically report incorrect labels to mislead the system to infer the wrong answer for all or a targeted set of tasks. This is different from unreliable behavior that is typically non-malicious, unintentional, and non-strategic. We propose a taxonomy for data poisoning attacks in crowdsourcing and review some existing attacks.

4.1 Attack Taxonomy

Attack Goal. Assuming a certain percentage of malicious workers, the answer matrix \(\mathbf{C} \) contains corrupt answers \(\mathbf{C} '\) contributed by malicious workers. There are two cases:

  • Untargeted Attack. Adversaries aim to decrease the overall accuracy. Their goal is to mislead the system to infer the wrong label for as many tasks as possible which can be formulated as: \(\max _\mathbf{C '} \, \sum _{j=1}^{M} \mathbbm {1} (\hat{z_{t_j}} \ne \overset{*}{z_{t_j}})\).

  • Targeted Attack. Adversaries aim to reduce the accuracy for only a targeted subset of tasks \(\mathbf{T} _{tar} \subseteq \mathbf{T} \), which can be written as: \(\max _\mathbf{C '} \, \sum _{{t_j}\in \mathbf{T} _{tar}} \mathbbm {1} (\hat{z_{t_j}} \ne \overset{*}{z_{t_j}})\).

Adversarial Knowledge. Malicious workers can have a varying level of adversarial knowledge.

  • Black-box Attacks. Adversaries only know their assigned tasks.

  • White-box Attacks. Adversaries know the inference method and all answers of other workers.

  • Gray-box Attacks. Adversaries have some knowledge, e.g. part of answers.

Attack Strategy. The attackers can adopt different strategies depending on their level of knowledge.

  • Heuristic Based Attacks. Attackers can use heuristics, e.g., always reporting the wrong answers for the tasks, or occasionally reporting true answers for some tasks to disguise themselves as honest workers. While this attack may not be as effective as the optimization based one below, it is easy to carry out and does not require strong adversarial knowledge, and hence is applicable to all truth inference algorithms in all black-box and white-box settings.

  • Optimization Based Attacks. Given the full knowledge, attackers can formulate an optimization problem to provide answers maximizing the number of flipped labels before and after the attack [30, 31]. Though more effective, this attack depends on the truth inference algorithm being used and requires full or partial adversarial knowledge, thus it is applicable to white-box or gray-box only.

4.2 Existing Attacks

A few optimization based attacks [30, 31] have been studied for representative inference methods, namely D&S [11] and PM [28], assuming attackers’ full knowledge of all answers and the inference method used (i.e. white-box setting). Since the adversary does not know the tasks’ ground truth, the optimization aims to maximize the number of inferred labels that are flipped from before attack to after attack, while also maximizing the inferred reliability of the attackers by the truth inference method. Intuitively, this helps them to obfuscate their malicious nature and hence succeed in misleading the system.

Let \(\hat{z^{a}_{t_j}}\) and \(\hat{z^{b}_{t_j}}\) denote the inferred answer for task \(t_{j}\) by the D&S after and before attack respectively. \(\hat{\alpha }_{w'}\) and \(\hat{\beta }_{w'}\) denote the inferred confusion matrix parameters of the malicious worker \(w'\). The optimization is formulated as Eq. (1) where \(\lambda \) controls the trade-off between the objectives of maximizing the collective reliability of malicious workers and the number of flipped labels.

$$\begin{aligned} \max _\mathbf{C '} \sum _{j=1}^{M} \mathbbm {1} (\hat{z^{a}_{t_j}} \ne \hat{z^{b}_{t_j}}) +\lambda \sum _{w'\in \mathbf{W} ' } ({\hat{\alpha }_{w'}} + {\hat{\beta }_{w'}}) \end{aligned}$$
(1)

For our evaluation, we design heuristic attack and extend the optimization attack for D&S and PM to all truth inference methods (which we explain in Sect. 5.2).

5 Evaluation Methodology

5.1 Datasets

We use two real datasets of decision making tasks [4, 21] and synthetic datasets with varying parameters for the evaluation. The datasets’ properties are summarized in Table 2.

Product Dataset. The task is comparing two products, e.g. “Are iPad Two 16 GB WiFi White and iPad 2nd generation 16 GB WiFi White the same?” [21].

PosSent Dataset. The task is classifying positive/neagtive tweets about the companies’ reputation, e.g. “The recent products of Apple is amazing” [4].

Synthetic Dataset. Synthetic datasets are created to evaluate the effect of redundancy, number of tasks, and number of workers on performance. The worker-task assignment graph comes from power-law distribution. The ground truth for tasks comes from a Bernoulli distribution with prior 0.5, i.e a balanced dataset. We experimented changing the prior and observed no significant change and hence only report on 0.5. The workers’ reliability \(\alpha \) and \(\beta \) come from Beta distribution.

Table 2. Properties of datasets
Table 3. Data poisoning attack parameters

5.2 Attack Design

For attacks, we inject a set of malicious workers \(W'\), so the faction of malicious workers is \(\frac{|W'|}{|W| + |W'|}\). We design both heuristic and optimization attacks for the chosen inference methods in Sect. 3. Both targeted and untargeted attacks are analyzed when feasible. The parameters used in design of attacks are given in Table 3. Default values (highlighted) were used unless specified otherwise.

Heuristic Based Attacks. We design heuristic based attacks applicable in black-box settings.

  • Untargeted Attacks. With only knowledge of the assigned tasks, the simplest heuristic for attackers is to always report the wrong answer for their tasks. However, this may be easily detected by most truth inference systems (except majority voting) with workers’ reliability modeling. Therefore attackers desire to disguise themselves as honest workers by providing correct answers to some tasks to avoid detection and being discounted later. To model such behavior, we use the following enhanced heuristic applicable to all truth inference methods. Each malicious worker behaves as a normal worker modeled by \(\pi ^{w}\) with a disguise probability \(\gamma \) and as a malicious worker modeled by \(\pi ^{w'}\) with probability \(1-\gamma \). For example, a malicious worker with a moderate level of disguise may be modeled with \(\pi ^{w} = \left[ \begin{array}{cc} \beta =1&{}0\\ 0&{} \alpha =1 \end{array}\right] \), \(\pi ^{w'} = \left[ \begin{array}{cc} \beta '=0&{}1 \\ 1&{}\alpha '=0 \end{array}\right] \), and \(\gamma = 0.2\).

  • Targeted Attacks. For targeted attacks, the best strategy for the malicious workers is to flip the labels for the targeted tasks while acting truthfully for other tasks, building their reliability. Hence we set (\(\gamma = 0\)).

Optimization Based Attacks. We design optimization based attacks applicable to white-box settings, i.e., the attacker know the inference method and all others’ answers.

  • Untargeted Attacks. We extend the attack from [30] to all inference methods. For confusion matrix based ones, we use the same formulation as Eq. (1). For methods with a single reliability parameter or no worker model, we set \(\lambda =0\). We varied \(\lambda \) in the interval [0, 1] and observed that a \(\lambda \) value in [0.9, 1] leads to the most successful attack.

  • Targeted Attacks. Here the aim is to maximize the number of targeted tasks \(\mathbf{T} _{tar} \subseteq \mathbf{T} \) whose label is flipped. The optimization is \(\max _\mathbf{C '} \, \sum _{t_j \in \mathbf{T} _{tar}} \mathbbm {1} ({\hat{z^a_{t_j}}} \ne \hat{z^{b}_{t_j}}) + \lambda \sum _{w' \in \mathbf{W} '} ({\hat{\alpha }_{w'}} + {\hat{\beta }_{w'}})\). Similar to untargeted attacks, the second term is considered only for confusion matrix based methods.

5.3 Metrics

We use the following metrics to assess the robustness of inference methods.

Accuracy. Accuracy is the fraction of correctly inferred tasks, formulated as: \(\sum _{j = 1}^{N} \mathbbm {1}({\hat{z_{t_j}} = z^*_{t_j}}) / N\), where \(\hat{z_{t_j}}\) and \({z^*_{t_j}}\) are inferred and ground truth of task \(t_j\). A lower accuracy means a more successful attack. We note that [30] defined the attack success metric as the percentage of inferred labels flipped due to attack, we believe this metric does not truly capture the attackers’ success where some inferred labels may be wrong before attack and were flipped to correct due to the attack, i.e., adversaries help the system to correctly infer the label of an otherwise wrongly labeled task. Instead, we use the flipped labels w.r.t. the ground truth, i.e. accuracy, as the metric for attack success. We also report F1 score to account for the skewness of classes in the unbalanced dataset.

Accuracy_Targeted. \(Accuracy\_{Targeted}\) is the fraction of the targeted tasks \(\mathbf{T} _{tar}\) whose truth are inferred correctly, i.e \(1 - \sum _{{t_j} \in \mathbf{T} _{tar}} \mathbbm {1}({\hat{z_{t_j}} \ne z^*_{t_j}}) / |\mathbf{T} _{tar}|\).

Area Under Curve (AUC). Since the inference methods’ accuracy changes over parameters, e.g. percentage of malicious workers, we use AUC to compare the global performance of methods on an interval of parameter values, if feasible.

Recognizability. To assess the adversary detection ability of inference methods with explicit worker models, we define Recognizability as the similarity between the simulated (ground truth) worker reliability and the inferred reliability. A higher Recognizability means the method is better at detecting malicious workers. The worker behavior is modeled by a normal confusion matrix with \(\alpha \) and \(\beta \), a malicious one with \(\alpha '\) and \(\beta '\), and a disguise parameter \(\gamma \). We aggregate these into a single value \(r_{w'}\) showing the expected reliability of a worker and define Recognizability as \(1- \frac{1}{\left| \mathbf{W} '\right| }\sum _{w'\in \mathbf{W} '} \left| r_{w'} - \hat{r}_{w'} \right| \), where \(r_{w'}\) and \(\hat{r}_{w'}\) are the simulated and inferred reliability of malicious worker \(w'\) respectively.

$$\begin{aligned} \begin{aligned} {r_{w'}}=\frac{1}{\mid \mathbf{T} _{w'}\mid } \sum _{t_{j} \in \mathbf{T} _{w'} } ({\alpha _{w'}} \times \gamma + {\alpha '_{w'}} \times (1-\gamma )) \times \, \mathbbm {1}(z^*_{t_{j}}=1) \\ +\,({\beta _{w'}} \times \gamma + {\beta '_{w'}} \times (1-\gamma ) ) \times \, \mathbbm {1}(z^*_{t_{j}} =0) \end{aligned} \end{aligned}$$
(2)
$$\begin{aligned} {\hat{r}_{w'}} =\frac{1}{\mid \mathbf{T} _{w'}\mid }\sum _{t_{j} \in \mathbf{T} _{w'} } {\hat{\alpha _{w'}} \times \mathbbm {1}(z^*_{t_{j}} =1)+{\hat{\beta _{w'}}} \times \mathbbm {1}(z^*_{t_{j}} =0)} \end{aligned}$$
(3)

6 Evaluation Results

In this section, we report the robustness of various truth inference methods under heuristic based and optimization based attacks.

Fig. 2.
figure 2

Untargeted heuristic based attack: accuracy vs. % of malicious workers

6.1 Heuristic Based Attacks: Untargeted

Impact of Percentage of Malicious Workers. The number of normal workers is fixed and the percentage of added malicious workers varies from 0 to 60%. Adversary behavior setting is \(\gamma = 0\), \(\alpha '= 0\) and \(\beta '=0\).

Fig. 3.
figure 3

Untargeted heuristic based attack: recognizability vs. % of malicious workers

Accuracy. Figure 2 shows the accuracy of the methods w.r.t. the percentage of malicious workers. We omitted the result for synthetic dataset showing similar trends. Increasing the number of malicious workers drops the accuracy of all methods. The direct computation (MV, MV-Soft) and neural network method’s (LAA-S) drop is almost linear early on, PGM methods (D&S, BCC, and KOS) and probabilistic method (PM) are more resistant especially with few adversaries and drop to 0 once the percentage goes beyond (40% to 50%). Overall, D&S and LAA-S are the most resilient for this attack. Comparing the datasets, Product dataset is more susceptible to the attack due to its low redundancy. Table 4 shows the AUC of different methods over intervals of [0–30%] of malicious workers as a realistic attack setting, and [0–50%] for more aggressive attacks. Results for Product, confirms the advantage of D&S and LAA-S in their overall performance. While most methods outperform MV, there is no clear winner for PosSent dataset.

Table 4. AUC of methods’ accuracy over specified % of malicious workers interval: untargeted heuristic based attack

Recognizability. To show adversary detectability in inference methods, we report recognizability of methods with explicit worker modeling, i.e. D&S, BCC, PM, and KOS. We exclude MV and LAA-S as they do not explicitly model reliability. MV-Hard and MV-Soft are excluded too since these methods only remove the least credible worker without explicitly computed reliability.

Figure 3 shows the methods’ recognizability w.r.t. varying percentages of malicious workers. D&S and KOS perform better than BCC in adversary detection, while PM performs the worst. This explains the robustness of the accuracy of D&S and KOS we observed earlier. Comparing Fig. 2 and 3, the accuracy and recognizability of D&S and KOS decrease as the percentage of malicious workers increases, i.e. worker modeling with good detection is the key to a robust inference algorithm under attack.

Fig. 4.
figure 4

Untargeted heuristic based attack: accuracy vs redundancy and engagement (Synthetic)

Impact of Redundancy and Engagement. Redundancy is the mean number of workers assigned per task, while worker engagement is the mean number of tasks per worker. Figure 4 shows the accuracy w.r.t. varying redundancy and engagement. As expected, with increased redundancy, it is harder for adversaries to reduce the accuracy (the percentage of attackers is set to 20%). D&S and MV-based models were more sensitive to redundancy in the sparse dataset (Product). Worker engagement had no significant impact, since it does not directly impact their reliability.

Fig. 5.
figure 5

Untargeted heuristic based attack: F1 score vs class skewness

Impact of Class Skewness. We show the effect of skewness on F1-score in synthetic data. Figure 5 shows the F1-score w.r.t. varying ratios of the majority class. MV-based methods are vulnerable to imbalance while others are robust.

Impact of Disguise (\(\gamma \)). We show the trend of accuracy and recognizability w.r.t disguise. When in disguise, adversaries’ behavior is governed by \(\alpha = 1\) and \(\beta = 1\) compared to \(\alpha ' = 0\) and \(\beta ' = 0\) in pure malicious mode.

Fig. 6.
figure 6

Untargeted heuristic based attack: accuracy vs disguise

Accuracy. Figure 6 shows the accuracy of methods w.r.t. varying disguise levels. We use a different scale for each dataset’s y-axis to highlight their trend. Since increasing disguise after 0.7 resulted in monotonously increasing accuracy for all models, we terminate at 0.7. For methods (e.g. MV) with no inherent attacker recognition, disguising only boosts the accuracy. For more robust methods, as we increase \(\gamma \) slightly, the algorithms fail to identify adversaries leading to the success of the attack. However, as disguise further increases, the accuracy goes back up due to the correct answers by the disguised malicious workers. Hence there’s an optimal level of disguise for attackers.

Fig. 7.
figure 7

Untargeted heuristic based attack: recognizability vs. disguise (\(\gamma \))

Recognizability. Figure 7 shows the recognizability of methods w.r.t. varying disguise levels on real datasets. Confusion matrix based models, BCC and D&S, are more robust to disguise and model the workers accurately regardless of adversarys’ disguise. KOS uses a single reliability value and thus is more sensitive to disguise. Generally, the recognizability of adversaries drops as disguise increases, since their behavior more closely resembles normal workers.

Impact of Malicious Worker Confusion Matrix. We also evaluated the impact of varying worker behavior parameter (\(\alpha '\), \(\beta '\)). When varying \(\alpha '\), \(\beta '\) is 0, and vice versa. Increasing adversaries’ reliability parameters is very similar to increasing disguise. We omit the figures due to space limitations. Simple methods e.g. MV-based, have increasing accuracy due to the adversaries’ true answers. In more sophisticated methods, e.g, PGM based, attack can be most successful with an optimal value.

6.2 Heuristic Based Attacks: Targeted

We report the evaluation results for targeted attacks as outlined in Sect. 5.2. We focus on parameters relevant to targeted attacks, the percentage of malicious workers and the proportion of targeted tasks.

Impact of Percentage of Malicious Workers. Figures 8 and 9 show the accuracy of the methods w.r.t. varying percentage of malicious workers on the real datasets. We fixed the fraction of targeted tasks for Product and PosSent dataset to be 0.2 and 0.1, respectively, to ensure: 1) targeted attack is impactful, 2) there is an observable difference among methods’ performance. The general trend is that increasing the number of attackers, the overall accuracy of the system increases thanks to the truthful contributions of attackers to non-targeted tasks. However, the accuracy of targeted tasks is decreased by attackers.

Fig. 8.
figure 8

Targeted heuristic based attack: accuracy & accuracy_targeted vs. % of malicious workers (Product dataset)

Fig. 9.
figure 9

Targeted heuristic based attack: accuracy & accuracy_targeted vs. % of malicious workers (PosSent dataset)

Surprisingly, D&S and BCC which are more robust against untargeted attacks are more susceptible to targeted attacks. While they maintain high overall accuracy, their accuracy for the targeted tasks suffers the most due to their failure to differentiate targeted and untargeted tasks when modeling workers’ behavior (i.e. being misled by malicious workers based on their true answers to the untargeted tasks). On the other hand, LAA-S is significantly more robust against targeted attacks, even though the overall accuracy is not as high as other methods, explained by the absence of explicit worker modeling. While MV-Soft performs worse than others in untargeted attacks (Fig. 2), it is the most resilient alongside LAA-S for targeted attacks. MV-Soft’s resilience is due to accurate detection and penalization of malicious workers when they are the majority contributing to a task with conflicting answers, which happens more frequently while focusing on limited tasks rather than all tasks.

Impact of Proportion of Targeted Tasks. Figures 10 and 11 show the accuracy of the methods w.r.t. varying percentage of targeted tasks on real datasets. As the ratio of targeted tasks increases, the overall accuracy decrease. However, when the ratio gets sufficiently large, accuracy increase for D&S and BCC. Since malicious workers have to dilute their efforts among a larger set of targeted tasks, they are more discoverable and less effective.

Fig. 10.
figure 10

Targeted heuristic based attack: accuracy vs. ratio of targeted tasks (Product dataset)

Fig. 11.
figure 11

Targeted heuristic based attack: accuracy vs. ratio of targeted tasks (PosSent dataset)

Comparing the datasets, answer redundancy inversely affect targeted attack’s success, similar to untargeted attacks. Given the Product dataset’s lower redundancy, this attack is successful even with a high ratio of targeted tasks.

6.3 Optimization Based Attacks: Untargeted

Impact of Percentage of Malicious Workers. We evaluate the optimization based attack in white-box setting, i.e. given the full knowledge of the inference method used and all other workers’ answers. Figure 12 shows the accuracy of the inference methods w.r.t. varying percentage of malicous workers.

Fig. 12.
figure 12

Untargeted optimization based attack: accuracy vs. % of malicious workers

All methods’ accuracy drops fairly quickly as the percentage of malicious workers increases. Comparing heuristic and optimization based attack (Fig. 2a vs Fig. 12a for Product dataset, Fig. 2b vs Fig. 12b for PosSent dataset), the accuracy under optimization based attack drops to zero at a much lower percentage of malicious workers for all methods. The attackers are indeed more successful when using the optimized scheme given stronger adversarial knowledge. Comparing the different methods, all perform similarly and are susceptible to the attack, since it is optimized for that particular inference method. However, LAA-S has a slight edge over others.

Fig. 13.
figure 13

Targeted optimization based attack: accuracy_targeted vs. % of malicious workers on Product and PosSent dataset

6.4 Optimization Based Attacks: Targeted

Impact of Percentage of Malicious Workers. We set the ratio of targeted tasks to be 0.01 and 0.005 for Product and PosSent datasets respectively. Since optimization based attack is more successful, a lower ratio of targeted tasks is chosen compared to Sect. 6.2. Figure 13 shows accuracy_targeted of the inference methods w.r.t. varying percentage of malicious workers on the real datasets. Overall accuracy is not shown due to space limitations. \(Accuracy\_Targeted\) decreases as percentage of malicious workers increases. Comparing these results to the heuristic based targeted attack (Figs. 8 and 9), the heuristic based attack is more effective in reducing accuracy at a small percentage of malicious workers. However, with a greater percentage of malicious workers, targeted optimization based attack is more successful. One probable reason is that since a subset of tasks is targeted, the chance of adversary’s detection is lower compared to the untargeted setting. So, optimization based attack that trades some accuracy drop in exchange for less detection will lose its edge in attack power over heuristic attack which only focuses on accuracy.

7 Conclusion and Discussion

We summarize our key findings on the performance of leading inference methods under various data poisoning attacks.

Comparison of Methods. Figure 14 shows the overall attack susceptibility of different inference methods along two dimensions (untargeted attacks and targeted attacks) under heuristic and optimization based attacks. Susceptibility is defined as 1 - AUC. A more robust method should have a lower susceptibility across both dimensions. The AUC is the accuracy over an interval of fraction of attackers. Since in reality malicious workers are not the majority, we choose the interval [0–0.5]. The most robust methods should be those dominating others (less vulnerable) in both dimensions, i.e., the pareto optimal methods or skyline. Table 5 shows the top 2 performing methods for each category of attacks. We also discuss the main findings below.

Fig. 14.
figure 14

Susceptibility of different inference methods

Table 5. Top 2 robust methods under different attacks
  • Among direct computation methods, MV-Soft is more robust than MV, i.e. dominates MV, for all attacks, thanks to its modeling of worker reliability.

  • Among optimization based methods, MV-Hard is more robust than PM only under targeted heuristic based attack. This can be attributed to its optimal matching algorithm that penalizes or removes malicious workers when they become the majority among the contributing workers for conflicting tasks, which happens when they all give wrong answers to a target set of tasks. On the other hand, PM is more robust than MV-Hard under other attacks.

  • Among PGM based methods, for untargeted attacks, D&S dominate BCC under heuristic based attack. However, under optimization attack, BCC dominates D&S. Note that in heuristic attack the responses are designed for each individual worker, while under optimization based attack, the collective effect of malicious workers is considered. Therefore correlation modeling in BCC can counter the possible collusion of malicious workers, reflected in BCC’s more resilient behavior under optimization based attack. Overall, D&S is most resilient if only untargeted heuristic based attack is anticipated. However, D&S is vulnerable to all other attacks. Between KOS and BCC, KOS has a slight edge considering all four attacks.

  • The neural network based method LAA-S dominates all others in all attacks except having a slightly higher susceptibility than D&S in untargeted heuristic based attack in the low redundancy dataset (Product). We suspect this is due to the sparsity of the task representation vector in contrast to the large number of network parameters combined with the lower quality of malicious answers in heuristic based attack. The superior performance in all other cases can be attributed to its network of parameters, which can be considered as an implicit and sophisticated (non-linear) model of worker reliability.

  • Comparing different techniques, besides the best performing neural networks, PGM based methods are generally more robust than optimization based methods and direct computation methods. This is also consistent with the findings in [56] under normal settings with varying worker reliability.

  • Comparing different worker models, the confusion matrix based methods generally outperform those with a single reliability model and MV method with no worker model. Neural network based method LAA-S, even though with no explicit worker model, achieves the best performance thanks to its network of parameters, which can be considered as an implicit and more sophisticated (non-linear) model of the reliability associated with each worker.

Comparison of Attacks. From the attack point of view, optimization based attack is more effective compared to heuristic based attack, especially for untargeted attacks. This is not surprising given the adversarial knowledge and optimization based attack strategy. However, optimization based attack can only be carried out in white-box settings given full adversarial knowledge, and may not be a realistic threat to crowdsourcing applications.

For targeted attacks, in a lower percentage of adversaries, heuristic based attack can be more successful compared to optimization based attack. Optimization based attack trades off some accuracy drop in exchange for less adversary detection, hence losing its edge in attack power over heuristic based attack, solely optimizing for accuracy drop. However, there is an optimal level of disguise and percentage of targeted tasks for attackers under heuristic based attack. These may not be easily identifiable as they vary substantially across inference methods and settings.

Comparison of Datasets and Other Factors. The comparison between the datasets reveals that a crowdsourcing system with a higher redundancy of answers is generally more robust. It remains an interesting question for a crowdsourcing system provider to find the best trade-off between redundancy and the overall platform cost to ensure the resiliency of the system. However, worker engagement does not have a major impact on the robustness of the system.

Future Works. The goal of our evaluation is to understand the resilience of existing leading-edge truth inference methods and ultimately build more robust systems. Towards this end, several directions for future work can be explored.

  • Attack resistant truth inference: while existing methods provide certain level of resistance to data poisoning attacks, it remains an open question whether we can build more robust systems, e.g. by designing hybrid methods that combine the strength of existing techniques (e.g. LAA-S against optimization based attack and targeted heuristic based attack and D&S against untargeted heuristic based attack) or designing entirely new techniques. A more fundamental question is whether we can have quantifiable guarantees for the robustness of the system against data poisoning attacks.

  • Semi-supervised learning: All the truth inference approaches evaluated in this paper assume no access to the ground truth. This unsupervised nature makes the methods inherently susceptible to attacks. A semi-supervised approach [3, 42] may help provide resilience to attacks within which ground truth and workers’ answers for historical tasks can be used for training the system.

  • Dynamic behavior of workers: Our study is focused on a fixed set of workers and tasks. In practice, the workers could join and leave the system as they wish. They could also alter their behaviors dynamically in a strategic way to maximize the attack effect. Understanding the impact of such behaviors and building robust systems in a dynamic setting is also an important direction.