Skip to main content

Supervised Classifiers to Identify Hate Speech on English and Spanish Tweets

  • Conference paper
  • First Online:
Digital Libraries at the Crossroads of Digital Information for the Future (ICADL 2019)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11853))

Included in the following conference series:

Abstract

Consistently with social and political concern about hatred and harassment through social media, in recent years, automatic hate-speech detection and offensive behavior in social media are gaining a lot of attention. In this paper, we examine the performance of several supervised classifiers in the process of identifying hate speech on Twitter. More precisely, we do an empirical study that analyzes the influence of two types of linguistic features (n-grams, word embeddings) when they are used to feed different supervised machine learning classifiers: Support Vector Machine (SVM), Gaussian Naive Bayes (GNB), Complement Naive Bayes (CNB), Decision Tree (DT), Nearest Neighbors (KN), Random Forest (RF) and Neural Network (NN). The experiments we have carried out show that CNB, SVM, and RF are better than the rest classifiers in English and Spanish languages by taking into account all features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://radimrehurek.com/gensim/.

  2. 2.

    http://scikit-learn.org/stable/.

References

  1. Almatarneh, S., Gamallo, P.: Comparing supervised machine learning strategies and linguistic features to search for very negative opinions. Information 10(1), 16 (2019). http://www.mdpi.com/2078-2489/10/1/16

    Article  Google Scholar 

  2. Almatarneh, S., Gamallo, P., Pena, F.J.R.: CiTIUS-COLE at semeval - 2019 task 5: combining linguistic features to identify hate speech against immigrants and women on multilingual tweets. In: The 13th International Workshop on Semantic Evaluation (2019)

    Google Scholar 

  3. Basile, V., et al.: Semeval-2019 task 5: Multilingual detection of hate speech against immigrants and women in twitter. In: Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 54–63 (2019)

    Google Scholar 

  4. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  5. Burnap, P., Williams, M.L.: Cyber hate speech on twitter: an application of machine classification and statistical modeling for policy and decision making. Policy Internet 7(2), 223–242 (2015)

    Article  Google Scholar 

  6. Burnap, P., Williams, M.L.: Hate speech, machine classification and statistical modelling of information flows on twitter: interpretation and communication for policy decision making (2014)

    Google Scholar 

  7. Chen, Y., Zhou, Y., Zhu, S., Xu, H.: Detecting offensive language in social media to protect adolescent online safety. In: 2012 International Conference on Privacy, Security, Risk and Trust and 2012 International Conference on Social Computing, pp. 71–80. IEEE (2012)

    Google Scholar 

  8. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(Aug), 2493–2537 (2011)

    MATH  Google Scholar 

  9. Dai, A.M., Olah, C., Le, Q.V.: Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998 (2015)

  10. Fortuna, P., Nunes, S.: A survey on automatic detection of hate speech in text. ACM Comput. Surv. (CSUR) 51(4), 85 (2018)

    Article  Google Scholar 

  11. Gaydhani, A., Doma, V., Kendre, S., Bhagwat, L.: Detecting hate speech and offensive language on twitter using machine learning: An n-gram and tfidf based approach. arXiv preprint arXiv:1809.08651 (2018)

  12. Gitari, N.D., Zuping, Z., Damien, H., Long, J.: A lexicon-based approach for hate speech detection. Int. J. Multimed. Ubiquit. Eng. 10(4), 215–230 (2015)

    Article  Google Scholar 

  13. Greevy, E., Smeaton, A.F.: Classifying racist texts using a support vector machine. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 468–469. ACM (2004)

    Google Scholar 

  14. Kwok, I., Wang, Y.: Locate the hate: detecting tweets against blacks. In: Twenty-seventh AAAI Conference on Artificial Intelligence (2013)

    Google Scholar 

  15. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine Learning, pp. 1188–1196 (2014)

    Google Scholar 

  16. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  17. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  18. Nobata, C., Tetreault, J., Thomas, A., Mehdad, Y., Chang, Y.: Abusive language detection in online user content. In: Proceedings of the 25th International Conference on World Wide Web, International World Wide Web Conferences Steering Committee, pp. 145–153 (2016)

    Google Scholar 

  19. Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mac. Learn. Res. 12(Oct), 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  20. Rennie, J.D., Shih, L., Teevan, J., Karger, D.R.: Tackling the poor assumptions of naive Bayes text classifiers. In: Proceedings of the 20th International Conference on Machine Learning (ICML-2003), pp. 616–623 (2003)

    Google Scholar 

  21. Tulkens, S., Hilte, L., Lodewyckx, E., Verhoeven, B., Daelemans, W.: A dictionary-based approach to racism detection in dutch social media. arXiv preprint arXiv:1608.08738 (2016)

  22. Unsvåg, E.F., Gambäck, B.: The effects of user features on twitter hate speech detection. In: Proceedings of the 2nd Workshop on Abusive Language Online (ALW2), pp. 75–85 (2018)

    Google Scholar 

Download references

Acknowledgments

Research partially funded by the Spanish Ministry of Economy and Competitiveness through projects \(TIN2017-85160-C2-2-R\), and by the Galician Regional Government under projects ED431C 2018/50.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sattam Almatarneh .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Almatarneh, S., Gamallo, P., Pena, F.J.R., Alexeev, A. (2019). Supervised Classifiers to Identify Hate Speech on English and Spanish Tweets. In: Jatowt, A., Maeda, A., Syn, S. (eds) Digital Libraries at the Crossroads of Digital Information for the Future. ICADL 2019. Lecture Notes in Computer Science(), vol 11853. Springer, Cham. https://doi.org/10.1007/978-3-030-34058-2_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-34058-2_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-34057-5

  • Online ISBN: 978-3-030-34058-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics