Supervised Classifiers to Identify Hate Speech on English and Spanish Tweets

Almatarneh, Sattam; Gamallo, Pablo; Pena, Francisco J. Ribadas; Alexeev, Alexey

doi:10.1007/978-3-030-34058-2_3

Sattam Almatarneh^11,12,
Pablo Gamallo¹¹,
Francisco J. Ribadas Pena¹² &
…
Alexey Alexeev¹³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11853))

Included in the following conference series:

International Conference on Asian Digital Libraries

875 Accesses
10 Citations

Abstract

Consistently with social and political concern about hatred and harassment through social media, in recent years, automatic hate-speech detection and offensive behavior in social media are gaining a lot of attention. In this paper, we examine the performance of several supervised classifiers in the process of identifying hate speech on Twitter. More precisely, we do an empirical study that analyzes the influence of two types of linguistic features (n-grams, word embeddings) when they are used to feed different supervised machine learning classifiers: Support Vector Machine (SVM), Gaussian Naive Bayes (GNB), Complement Naive Bayes (CNB), Decision Tree (DT), Nearest Neighbors (KN), Random Forest (RF) and Neural Network (NN). The experiments we have carried out show that CNB, SVM, and RF are better than the rest classifiers in English and Spanish languages by taking into account all features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Almatarneh, S., Gamallo, P.: Comparing supervised machine learning strategies and linguistic features to search for very negative opinions. Information 10(1), 16 (2019). http://www.mdpi.com/2078-2489/10/1/16
Article Google Scholar
Almatarneh, S., Gamallo, P., Pena, F.J.R.: CiTIUS-COLE at semeval - 2019 task 5: combining linguistic features to identify hate speech against immigrants and women on multilingual tweets. In: The 13th International Workshop on Semantic Evaluation (2019)
Google Scholar
Basile, V., et al.: Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter. In: Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 54–63 (2019)
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Burnap, P., Williams, M.L.: Cyber hate speech on twitter: an application of machine classification and statistical modeling for policy and decision making. Policy Internet 7(2), 223–242 (2015)
Article Google Scholar
Burnap, P., Williams, M.L.: Hate speech, machine classification and statistical modelling of information flows on twitter: interpretation and communication for policy decision making (2014)
Google Scholar
Chen, Y., Zhou, Y., Zhu, S., Xu, H.: Detecting offensive language in social media to protect adolescent online safety. In: 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Conference on Social Computing, pp. 71–80. IEEE (2012)
Google Scholar
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(Aug), 2493–2537 (2011)
MATH Google Scholar
Dai, A.M., Olah, C., Le, Q.V.: Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998 (2015)
Fortuna, P., Nunes, S.: A survey on automatic detection of hate speech in text. ACM Comput. Surv. (CSUR) 51(4), 85 (2018)
Article Google Scholar
Gaydhani, A., Doma, V., Kendre, S., Bhagwat, L.: Detecting hate speech and offensive language on twitter using machine learning: An n-gram and tfidf based approach. arXiv preprint arXiv:1809.08651 (2018)
Gitari, N.D., Zuping, Z., Damien, H., Long, J.: A lexicon-based approach for hate speech detection. Int. J. Multimed. Ubiquit. Eng. 10(4), 215–230 (2015)
Article Google Scholar
Greevy, E., Smeaton, A.F.: Classifying racist texts using a support vector machine. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 468–469. ACM (2004)
Google Scholar
Kwok, I., Wang, Y.: Locate the hate: detecting tweets against blacks. In: Twenty-seventh AAAI Conference on Artificial Intelligence (2013)
Google Scholar
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Nobata, C., Tetreault, J., Thomas, A., Mehdad, Y., Chang, Y.: Abusive language detection in online user content. In: Proceedings of the 25th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, pp. 145–153 (2016)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mac. Learn. Res. 12(Oct), 2825–2830 (2011)
MathSciNet MATH Google Scholar
Rennie, J.D., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of naive Bayes text classifiers. In: Proceedings of the 20th International Conference on Machine Learning (ICML-2003), pp. 616–623 (2003)
Google Scholar
Tulkens, S., Hilte, L., Lodewyckx, E., Verhoeven, B., Daelemans, W.: A dictionary-based approach to racism detection in dutch social media. arXiv preprint arXiv:1608.08738 (2016)
Unsvåg, E.F., Gambäck, B.: The effects of user features on twitter hate speech detection. In: Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), pp. 75–85 (2018)
Google Scholar

Download references

Acknowledgments

Research partially funded by the Spanish Ministry of Economy and Competitiveness through projects \(TIN2017-85160-C2-2-R\), and by the Galician Regional Government under projects ED431C 2018/50.

Author information

Authors and Affiliations

Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidad de Santiago de Compostela, Santiago, Spain
Sattam Almatarneh & Pablo Gamallo
Computer Science Department, University of Vigo Escola Superior de Enxeñaría Informática, Campus As Lagoas, 32004, Ourense, Spain
Sattam Almatarneh & Francisco J. Ribadas Pena
ITMO University, Saint-Petersburg, Russia
Alexey Alexeev

Authors

Sattam Almatarneh
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Gamallo
View author publications
You can also search for this author in PubMed Google Scholar
Francisco J. Ribadas Pena
View author publications
You can also search for this author in PubMed Google Scholar
Alexey Alexeev
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sattam Almatarneh .

Editor information

Editors and Affiliations

Kyoto University, Kyoto, Japan
Adam Jatowt
Ritsumeikan University, Kusatsu, Japan
Akira Maeda
The Catholic University of America, Washington, DC, USA
Sue Yeon Syn

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Almatarneh, S., Gamallo, P., Pena, F.J.R., Alexeev, A. (2019). Supervised Classifiers to Identify Hate Speech on English and Spanish Tweets. In: Jatowt, A., Maeda, A., Syn, S. (eds) Digital Libraries at the Crossroads of Digital Information for the Future. ICADL 2019. Lecture Notes in Computer Science(), vol 11853. Springer, Cham. https://doi.org/10.1007/978-3-030-34058-2_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-34058-2_3
Published: 29 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34057-5
Online ISBN: 978-3-030-34058-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics