Abstract
Recent years have seen an increased interest in crowdsourcing as a way of obtaining information from a potentially large group of workers at a reduced cost. The crowdsourcing process, as we consider in this paper, is as follows: a requester hires a number of workers to work on a set of similar tasks. After completing the tasks, each worker reports back outputs. The requester then aggregates the reported outputs to obtain aggregate outputs. A crucial question that arises during this process is: how many crowd workers should a requester hire? In this paper, we investigate from an empirical perspective the optimal number of workers a requester should hire when crowdsourcing tasks, with a particular focus on the crowdsourcing platform Amazon Mechanical Turk. Specifically, we report the results of three studies involving different tasks and payment schemes. We find that both the expected error in the aggregate outputs as well as the risk of a poor combination of workers decrease as the number of workers increases. Surprisingly, we find that the optimal number of workers a requester should hire for each task is around 10 to 11, no matter the underlying task and payment scheme. To derive such a result, we employ a principled analysis based on bootstrapping and segmented linear regression. Besides the above result, we also find that overall top-performing workers are more consistent across multiple tasks than other workers. Our results thus contribute to a better understanding of, and provide new insights into, how to design more effective crowdsourcing processes.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Avoid common mistakes on your manuscript.
References
von Ahn, L., Dabbish, L.: Designing games with a purpose. Commun. ACM 51(8), 58–67 (2008)
Armstrong, J.S.: Combining Forecasts. In: Armstrong, J.S. (ed.) Principles of Forecasting: A Handbook for Researchers and Practitioners, pp 1–19. Kluwer Academic Publishers (2001)
Bacon, D.F., Chen, Y., Kash, I., Parkes, D.C., Rao, M., Sridharan, M.: Predicting your own effort. In: Proceedings of the 11th International conference on autonomous agents and multiagent systems, pp 695–702 (2012)
Bai, J., Perron, P.: Computation and analysis of multiple structural change models. J. Appl. Econ. 18(1), 1–22 (2003)
Buhrmester, M.D., Kwang, T., Gosling, S.D.: Amazon’s mechanical turk: a new source of inexpensive, yet high-quality, data. Perspect. Psychol. Sci. 6(1), 3–5 (2011)
Carvalho, A.: Tailored proper scoring rules elicit decision weights. Judgment and Decision Making 10(1), 86–96 (2015)
Carvalho, A., Dimitrov, S., Larson, K.: Inducing honest reporting without observing outcomes: an application to the peer-review process (2013). arXiv preprint arXiv: 1309.3197
Carvalho, A., Dimitrov, S., Larson, K.: The output-agreement method induces honest behavior in the presence of social projection. ACM SIGecom Exchanges 13(1), 77–81 (2014)
Carvalho, A., Dimitrov, S., Larson, K.: A study on the influence of the number of mturkers on the quality of the aggregate output. In: Bulling, N. (ed.) Multi-agent systems, lecture notes in computer science, vol. 8953, pp 285–300. Springer (2015)
Carvalho, A., Larson, K.: Sharing a reward based on peer evaluations. In: Proceedings of the 9th International conference on autonomous agents and multiagent systems, pp 1455–1456 (2010)
Carvalho, A., Larson, K.: A truth serum for sharing rewards. In: Proceedings of the 10th International conference on autonomous agents and multiagent systems, pp 635–642 (2011)
Carvalho, A., Larson, K.: Sharing rewards among strangers based on peer evaluations. Decis. Anal. 9(3), 253–273 (2012)
Carvalho, A., Larson, K.: A consensual linear opinion pool. In: Proceedings of the 23rd International Joint Conference on Artificial Intelligence, pp 2518–2524 (2013)
Chen, Y., Chu, C.H., Mullen, T., Pennock, D.M.: Information markets vs. opinion pools: an empirical comparison. In: Proceedings of the 6th ACM Conference on Electronic Commerce, pp 58–67 (2005)
Chiu, C.M., Liang, T.P., Turban, E.: What can crowdsourcing do for decision support. Decis. Support. Syst. 65, 40–49 (2014)
Clemen, R.T.: Combining forecasts: a review and annotated bibliography. Int. J. Forecast. 5(4), 559–583 (1989)
Dietterich, T.G.: Ensemble methods in machine learning. In: Multiple classifier systems, lecture notes in computer science, vol. 1857, pp 1–15. Springer (2000)
Gao, X.A., Mao, A., Chen, Y.: Trick or treat: putting peer prediction to the test. In: Proceedings of the 1st workshop on crowdsourcing and online behavioral experiments (2013)
Hansen, L.K., Salamon, P.: Neural Network Ensembles. IEEE Trans. Pattern Anal. Mach. Intell. 12(10), 993–1001 (1990)
Hanson, R.: Combinatorial information market design. Inf. Syst. Front. 5(1), 107–119 (2003)
Ho, C.J., Vaughan, J.W.: Online task assignment in crowdsourcing markets. In: Proceedings of the 26th AAAI conference on artificial intelligence, pp 45–51 (2012)
Huang, S.W., Fu, W.T.: Enhancing reliability using peer consistency evaluation in human computation. In: Proceedings of the 2013 conference on computer supported cooperative work, pp 639–648 (2013)
Ipeirotis, P.G.: Analyzing the amazon mechanical turk marketplace. XRDS Crossroads: The ACM Magazine for Students 17(2), 16–21 (2010)
Ipeirotis, P.G., Provost, F., Sheng, V.S., Wang, J.: Repeated labeling using multiple noisy labelers. Data Min. Knowl. Disc 28(2), 402–441 (2014)
Lin, C.H., Weld, D.S.: Dynamically switching between synergistic workflows for crowdsourcing. In: Proceedings of the 26th AAAI conference on artificial intelligence, pp 132–133 (2012)
Marge, M., Banerjee, S., Rudnicky, A.I.: Using the amazon mechanical turk for transcription of spoken language. In: Proceedings of the 2010 IEEE International conference on acoustics speech and signal processing, pp 5270–5273 (2010)
Mason, W., Suri, S.: Conducting behavioral research on amazon’s mechanical turk. Behav. Res. Methods 44(1), 1–23 (2012)
Neruda: P.: 100 Love Sonnets. Exile (2007)
Oshiro, T.M., Perez, P.S., Baranauskas, J.A.: How many trees in a random forest In: Perner, P. (ed.) Machine learning and data mining in pattern recognition, Lecture notes in computer science, vol. 7376, pp 154–168. Springer, Berlin (2012)
Paolacci, G., Chandler, J., Ipeirotis, P.G.: Running experiments on amazon mechanical turk. Judgment and Decision making 5(5), 411–419 (2010)
Plous, S.: The Psychology of Judgment and Decision Making. Mcgraw-Hill Book Company (1993)
Quinn, A.J., Bederson, B.B.: Human computation: a survey and taxonomy of a growing field. In: Proceedings of the 2011 SIGCHI conference on human factors in computing systems, pp 1403–1412 (2011)
Ren, J., Nickerson, J.V., Mason, W., Sakamoto, Y., Graber, B.: Increasing the crowd’s capacity to create: how alternative generation affects the diversity, relevance and effectiveness of generated ads. Decis. Support. Syst. 65, 28–39 (2014)
Savage, L.J.: Elicitation of personal probabilities and expectations. J. Am. Stat. Assoc 66(336), 783–801 (1971)
Selten, R.: Axiomatic characterization of the quadratic scoring rule. Exp. Econ 1(1), 43–62 (1998)
Shaw, A.D., Horton, J.J., Chen, D.L.: Designing incentives for inexpert human raters. In: Proceedings of the ACM 2011 conference on computer supported cooperative work, pp 275–284 (2011)
Sheng, V.S., Provost, F., Ipeirotis, P.G.: Get another label? Improving data quality and data mining using multiple, noisy labelers. In: Proceedings of the 14th International conference on knowledge discovery and data mining, pp 614–622 (2008)
Snow, R., O’Connor, B., Jurafsky, D., Ng, A.Y.: Cheap and fast—But is it good? Evaluating non-expert annotations for natural language tasks. In: Proceedings of the conference on empirical methods in natural language processing, pp 254–263 (2008)
Taylor, J., Taylor, A., Greenaway, K.: Little Ann and Other Poems. Nabu Press (2010)
Tran-Thanh, L., Stein, S., Rogers, A., Jennings, N.R.: Efficient crowdsourcing of unknown experts using multi-armed bandits. In: Proceedings of the 20th European conference on artificial intelligence, pp 768–773 (2012)
Winkler, R.L., Clemen, R.T.: Multiple experts vs. multiple methods: combining correlation assessments. Decis. Anal 1(3), 167–176 (2004)
Winkler, R.L., Murphy, A.H.: “Good” Probability Assessors. J. Appl. Meteorol 7(5), 751–758 (1968)
Yuen, M.C., King, I., Leung, K.S.: A survey of crowdsourcing systems. In: Proceedings of IEEE 3rd International Conference on Social Computing, pp 766–773 (2011)
Zeileis, A., Leisch, F., Hornik, K., Kleiber, C.: strucchange: an R package for testing for structural change in linear regression models. J. Stat. Softw 7(2), 1–38 (2002)
Zhang, H., Horvitz, E., Parkes, D.: Automated workflow synthesis. In: Proceedings of the 27th AAAI conference on artificial intelligence, pp 1020–1026 (2013)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Carvalho, A., Dimitrov, S. & Larson, K. How many crowdsourced workers should a requester hire?. Ann Math Artif Intell 78, 45–72 (2016). https://doi.org/10.1007/s10472-015-9492-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10472-015-9492-4