1 Introduction

SentimentFootnote 1, Footnote 2analysis is the classification task of mining sentiments from natural language, which finds use in numerous applications such as reputation management, customer support, and moderating content in social media (Wilson et al. 2005; Agarwal et al. 2011; Thavareesan and Mahesan 2019, 2020a). Sentiment analysis has helped industry to compile a summary of human perspectives and interests derived from feedback or even just the polarity of comments (Pang and Lee 2004; Thavareesan and Mahesan 2020b). Offensive language identification is another classification task in natural language processing (NLP), where the aim is to moderate and minimise offensive content in social media. In recent years, sentiment analysis and offensive language identification have gained significant interest in the field of NLP.

Social media websites and product review forums provide opportunities for users to create content in informal settings. Moreover, to improve user experience, these platforms ensure that the user communicates his/her opinion in such a way that he/she feels comfortable either using native language or switching between one or more languages in the same conversation (Vyas et al. 2014). However, most NLP systems are trained on languages in formal settings with proper grammar, which creates issues when it comes to the analysis phase of “user generated” comments (Chanda et al. 2016; Pratapa et al. 2018). Further, most of the developments in sentiment analysis and offensive language identification systems are performed on monolingual data for high-resource languages, while the user-generated content in under-resourced settings are often mixed with English or other high-resource languages (Winata et al. 2019; Jose et al. 2020).

Code-mixing or code-switching is the alternation between two or more languages at the level of the document, paragraph, comments, sentence, phrase, word or morpheme. It is a distinctive aspect of conversation or dialogue in bilingual and multilingual societies (Barman et al. 2014). It is motivated by structural, discourse, pragmatic and socio-linguistic reasons (Sridhar 1978). Most of the social media comments are code-mixed, while the resources created for sentiment analysis and offensive language identification are primarily available for monolingual texts. Code-mixing is a common phenomenon in all kinds of communication among multilingual speakers including both speech and text-based interactions. Code-mixing refers to the way a bilingual/ multilingual speaker changes his or her utterance into another language. The vast majority of language pairs are under-resourced with regards to code-mixing tasks (Bali et al. 2014; Jose et al. 2020).

In this paper, we describe the creation of a corpus for Dravidian languages in the context of sentiment analysis and offensive language detection tasks. Dravidian languages are spoken mainly in the south of India (Chakravarthi et al. 2020c). The four major literary languages belonging to the language family are Tamil (ISO 639-3: tam), Telugu (ISO 639-3: tel), Malayalam (ISO 639-3: mal), and Kannada (ISO 639-3: kan). Tamil, Malayalam and Kannada fall under the South Dravidian subgroup while Telugu belongs to the South Central Dravidian subgroup (Vikram and Urs 2007). Each of the four languages has official status as one of the 22 scheduled languages recognised by the Government of India. Tamil also has official status in Sri Lanka and Singapore (Thamburaj and Rengganathan 2015). Although the languages are widely spoken by millions of people, the tools and resources available for building robust NLP applications are under-developed for these languages.

Dravidian languages are highly agglutinating languages and each language uses its own script (Krishnamurti 2003; Sakuntharaj and Mahesan 2016, 2017). The writing system is a phonemic abugida written from left to right for Malayalam and Kannada. The Dravidian languages scripts are first attested in the 580 BCE as TamiliFootnote 3 script inscribed on the pottery of Keezhadi, Sivagangai and Madurai district of Tamil Nadu, India (Sivanantham and Seran 2019)Footnote 4 by Tamil Nadu State Department of Archaeology and Archaeological Survey of India. Historically, Tamil writing system has its origin in the Tamili script that was neither purely Abugida, nor Abjad, nor Alphabet system. The writing system of Tamili was explained in the old grammar text Tolkappiyam which dates are various proposed between 9th century BCE to 6nd century BCE (Pillai 1904; Swamy 1975; Zvelebil 1991; Takahashi 1995) and in the Jaina work Samavayanga Sutta and Pannavana Sutta, these two Jain works date to 3rd-4th century BCE (Salomon 1998). At different points of time in history, Tamil was written using Tamili, Vattezhuthu, Chola, Pallava and Chola-Pallava scripts. The modern Tamil script descended from the Chola-Pallava script that became the norm in the northern part of the Tamil country around 8th century CE (Mahadevan 2003). The Malayalam script is based on the Vatteluttu script developed from old Vatteluttu with additional letters from Grantha script to write loan words (Thottingal 2019). The scripts of Kannada and Telugu had their origins from Bhattiprolu script, a southern variety of Brahmi script. From Bhattiprolu script evolved an early form of Kannada script called Kadamba script (Gai 1996) which gave rise to Telugu and Kannada scripts. Although the languages have their own scripts, social media users often use the Latin script for typing in these languages due to its ease of use and accessibility in handheld devices and computers (Thamburaj et al. 2015).

Monolingual datasets are available for Indian languages for various research aims (Agrawal et al. 2018; Thenmozhi and Aravindan 2018; Kumar et al. 2020). However, there have been few attempts to generate datasets for Tamil, Kannada and Malayalam code-mixed text (Chakravarthi et al. 2020b, c; Chakravarthi 2020; Chakravarthi and Muralidaran 2021). We believe it is essential to come up with approaches to tackle this resource bottleneck so that these languages can be equipped with NLP support in social media in a way that is both cost-effective and rapid. To create resources for a Tamil-English, Kannada-English and Malayalam-English code-mixed scenario, we collected comments on various Tamil, Kannada and Malayalam movie trailers from YouTube.

The contributions of this paper are:

  1. 1.

    We present the dataset for three Dravidian languages, namely Tamil, Kannada, and Malayalam, for sentiment analysis and offensive language identification tasks.

  2. 2.

    The dataset contains all typesFootnote 5 of code-mixing. This is the first Dravidian language dataset to contain all types of code-mixing, including mixtures of these scripts and the Latin script. The dataset consists of around 44,000 comments in Tamil-English, around 7000 comments in Kannada-English, and around 20,000 comments in Malayalam-English.

  3. 3.

    We provide an experimental analysis of logistic regression, naive Bayes, decision tree, random forest, SVM, BERT, DistilBERT, ALBERT, RoBERTa, XLM, XLM-R and Character BERT on our code-mixed data for classification tasks in order to create a benchmark for further research.

2 Related work

Sentiment analysis helps to understand the polarity (positive, negative or neutral) of the audience towards a content (comment, tweet, image, video) or an event (Brexit, presidential elections). This data on polarity can help in understanding public opinion. Furthermore, the inclusion of sentiment analysis can improve the performance of tasks such as recommendation system (Krishna et al. 2013; Musto et al. 2017), and hate speech detection (Gitari et al. 2015). Over the last 20 years, social media networks have become a rich data source for sentiment analysis (Clarke and Grieve 2017; Tian et al. 2017). Extensive research has been done for sentiment analysis of monolingual corpora such as English (Hu and Liu 2004; Wiebe et al. 2005; Jiang et al. 2019), Russian (Rogers et al. 2018), German (Cieliebak et al. 2017), Norwegian (Mæhlum et al. 2019) and Indian languages (Agrawal et al. 2018; Rani et al. 2020). In initial research works, n-gram features were used widely for classification of sentiments (Kouloumpis et al. 2011). However recently, due to readily available data on social media, these traditional techniques have been replaced by deep neural network techniques. Patwa et al. (2020) conducted sentiment analysis on code-mixed social media text for Hindi-English and Spanish-English languages. However, sentiment analysis in Dravidian languages is under-studied.

The use of aggressive, hateful or offensive language online has proliferated in social media posts because of various technological and sociological reasons.This downturn has encouraged the development of automatic moderation systems. These systems if trained on proper data can help detect aggressive speech thus moderating spiteful content on a public platform. Collection of such data has become a crucial part of social media analysis. To facilitate the researchers working on these problems, there have been shared tasks conducted on aggression identification in social media (Kumar et al. 2018) and offensive language identification (Zampieri et al. 2019) by providing necessary datasets. As English is a commonly used language on social media, a significant amount of research goes into the identification of offensive English text. However, many internet users prefer the use of their native languages. This has given rise to the development of offensive language identification dataset in Arabic, Danish, Greek, and Turkish languages (Zampieri et al. 2020). Inspired by this we developed resources for offensive language identification for Dravidian languages.

In the past few years, cheaper internet and increased use of smartphones have significantly increased social media interaction in code-mixed native languages. Dravidian language speakers (who are often bilingual with English as it is an official language in India) with a population base of 237 millionFootnote 6 contribute to large portion of such interactions. Hence, there is an ever-increasing need for the analysis of code-mixed text in Dravidian languages. However, the number of freely available code-mixed dataset (Ranjan et al. 2016; Jose et al. 2020) are still limited in number, size, and availability. Sowmya Lakshmi and Shambhavi (2017) developed a Kannada-English dataset containing English and Kannada text with word-level code-mixing. Also, they employed a stance detection system to detect stance in Kannada-English code-mixed text (on social media) using sentence embeddings. Shalini et al. (2018) have used distributed representations for sentiment analysis of Kannada-English code-mixed texts through neural networks, which had three tags: Positive, Negative and Neutral. However, the dataset for Kannada was not readily available for research purposes. To give motivation for further research we conducted (Chakravarthi et al. 2020a, d; Mandl et al. 2020; Chakravarthi et al. 2021) a shared task that provided Tamil-English, Kannada-English, and Malayalam-English code-mixed datasets using which participants trained models that identify the sentiments (task A) and offensive classes (task B) in both the languages.

Most of the recent studies on sentiment analysis and offensive language identification have been conducted on high-resourced languages from social media platforms. Models trained on such highly resourced monolingual data have succeeded in predicting sentiment and offensiveness. However, with the increased social media usage of bilingual users, a system trained on under-resourced code-mixed data is needed. In spite of this need, no large datasets for Tamil-English, Kannada-English and Malayalam-English are available. Hence, inspired by Severyn et al. (2014), we collected and created a code-mixed dataset from YouTube. In this work, we describe the process of corpora creation for under-resourced Dravidian languages from YouTube comments. This is an extension of two workshop papers (Chakravarthi et al. 2020b, c) and shared tasks (Chakravarthi et al. 2020d). We present DravidianCodeMix corpora for Tamil-English (40,000 + comments), Kannada-English (7000 + comments) and Malayalam-English (nearly 20,000 comments) with manually annotated labels for sentiment analysis and offensive language identification. We used Krippendorff’s alpha to calculate agreement amongst annotators. We made sure that each comment is annotated by at least three annotators and made the labelled corpora freely available for research purpose. For bench marking, we provided baseline experiments and results on ’DravidianCodeMix’ corpora using machine learning models.

Fig. 1
figure 1

Data collection process

Fig. 2
figure 2

Examples of code mixing in Tamil dataset

Fig. 3
figure 3

Examples of code mixing in Kannada dataset

Fig. 4
figure 4

Examples of code mixing in Malayalam dataset

3 Raw data

Online media, for example, Twitter, Facebook or YouTube, contain quickly changing data produced by millions of users that can drastically alter the reputation of an individual or an association. This raises the significance of programmed extraction of sentiments and offensive language used in online social media. YouTube is one of the popular social media platforms in the Indian subcontinent because of the wide range of content available from the platform such as songs, tutorials, product reviews, trailers and so on. YouTube allows users to create content and other users to comment on the content. It allows for more user-generated content in under-resourced languages. Hence, we chose YouTube to extract comments to create our dataset. We chose movie trailers as the topic to collect data because movies are quite popular among the Tamil, Malayalam, and Kannada speaking populace. This increases the chance of getting varied views on one topic. Figure 1 shows the overview of the steps involved in creating our dataset.

We compiled the comments from different film trailers of Tamil, Kannada, and Malayalam languages from YouTube in the year 2019. The comments were gathered using YouTube Comment Scraper toolFootnote 7. We utilized these comments to make the datasets for sentiment analysis and offensive language identification with manual annotations. We intended to collect comments that contain code-mixing at various levels of the text, with enough representation for each sentiment and offensive language classes in all three languages. It was a challenging task to extract the necessary text that suited our intent from the comment section, which was further complicated by the presence of remarks in other non-target languages. As a part of the preprocessing steps to clean the data, we utilized langdetect libraryFootnote 8 to tell different languages apart and eliminate the unintended languages. The Langdetect library, however, is a script detection library that filters out languages based on certain scripts. This has serious limitations as it misses out a number of languages written in non-conventional script. This explains why we still get data from other languages despite using this library. Examples of code-mixing in Tamil, Kannada and Malayalam corpora are shown in Figs. 2, 3, and 4 along with their translations in English. By keeping data privacy in mind, we made sure that all the user-related information is removed from the corpora. As a part of the text-preprocessing, we removed redundant information such as URL.

Since we collected corpora from social media, our corpora contain different types of real-world code-mixed data. Inter-sentential switching is characterised by change of language between sentences where each sentence is written or spoken in one language. Intra-sentential switching occurs within a single sentence, say one of the clause is in one language and the other clause is in the second language. Our corpora contains all forms of code-mixing ranging from purely monolingual texts in native languages to mixing of scripts, words, morphology, inter-sentential and intra-sentential switches. We retained all the instances of code-mixing to faithfully preserve the real-world usage.

Fig. 5
figure 5

Example Google Form with annotation instructions for sentiment analysis

4 Methodology of annotation

We create our corpora for two tasks, namely, sentiment analysis and offensive language identification. We anonymized the data gathered from Youtube in order to protect user privacy.

4.1 Annotation process

In order to find volunteers for the annotation process, we contacted students in Indian Institute of Information Technology and Management-Kerala for Malayalam, Indian Institute of Information Technology-Tiruchirapalli and Madurai Kamaraj University for Tamil. For Kannada, we contacted students in Visvesvaraya College of Engineering, Bangalore University. The student volunteer annotators received the link to a Google Form and did the annotations on their personal computers. The authors’ family members also volunteered to annotate the data. We created Google Forms to gather annotations from annotators. Information on gender, education background and medium of schooling were collected to know the diversity of the annotators. The annotators were cautioned that the user remarks may have hostile language. They were given a provision to discontinue with the annotation process in case the content is too upsetting to deal with. They were asked not to be partial to a specific individual, circumstance or occasion during the annotation process. Each Google form had been set to contain up to 100 comments and each page was limited to contain ten comments. The annotators were instructed to agree that they understood the scheme before they were allowed to proceed further. The annotation setup involved three stages. To begin with, each sentence was annotated by two individuals. In the second step, the data was included in the collection if both the annotations agreed. In the event of contention, a third individual was asked to annotate the sentence. In the third step, in the uncommon case that all the three of them disagreed, at that point, two additional annotators were brought in to label the sentences. Each form was annotated by at least three annotators.

4.2 Sentiment analysis

For sentiment analysis, we followed the methodology taken by Chakravarthi et al. (2020c), and involved at least three annotators to label each sentence. The following annotation schema was given to the annotators in English and Dravidian languages.

  • Positive state: Comment contains an explicit or implicit clue in the content recommending that the speaker is in a positive state.

  • Negative state: Comment contains an explicit or implicit clue in the content recommending that the speaker is in a negative state.

  • Mixed feelings: Comment contains an explicit or implicit clue in both positive and negative feeling.

  • Neutral state: Comment does not contain an explicit or implicit indicator of the speaker’s emotional state.

  • Not in intended language: If the comment is not in the intended language. For example, for Tamil, if the sentence does not contain Tamil written in Tamil script or Latin script, then it is not Tamil. These comments were discarded after the data annotation process.

Figures 5 and 6 show the sample Google Forms for general instructions and sentiment analysis respectively.

Fig. 6
figure 6

Examples from the first page of the Google form for sentiment analysis

4.3 Offensive language identification

We constructed an offensive language identification dataset for Dravidian languages by adapting the work of Zampieri et al. (2019). We reduced the three-level hierarchical annotation scheme of this work into a flat scheme with five labels to account for the types of offensiveness in the comments and the sixth label Not in intended language accounts for comments written in a language other than the intended language. Examples for this are the comments written in other Dravidian languages using Roman script. To simplify the annotation decisions, the six categories into which each comment will be split into are as follows:

  • Not Offensive: Comment does not contain offence or profanity.

  • Offensive Untargeted: Comment contains offence or profanity not directed towards any target. These are the comments which contain unacceptable language without targeting anyone.

  • Offensive Targeted Individual: Comment contains offence or profanity which targets an individual.

  • Offensive Targeted Group: Comment contains offence or profanity which targets a group or a community.

  • Offensive Targeted Other: Comment contains offence or profanity which does not belong to any of the previous two categories (e.g. a situation, an issue, an organization or an event).

  • Not in indented language: If the comment is not in the intended language. For example, in Tamil task, if the sentence does not contain Tamil written in Tamil script or Latin script, then it is not Tamil. These comments were discarded after the data annotation process.

Fig. 7
figure 7

Example Google Form with annotation instructions for offensive language identification

Fig. 8
figure 8

Example Google Form with annotation instructions for offensive language identification

Fig. 9
figure 9

Examples from the first page of the Google Form for offensive language identification

Examples of the Google Forms in English and native language for offensive language identification task are given in Figs. 7, 8, and 9.

Table 1 Annotators statistics for sentiment analysis
Table 2 Annotators statistics for offensive language identification

Once the Google Form was ready, we sent it out to an equal number of males and females to enquire their willingness to annotate. We got varied responses from them and so our distribution of male and female annotators involved in the task are different. From Table 1, we can see that only two female annotators volunteered to contribute for Tamil while there were more female annotators for Malayalam and Kannada. For offensive language identification, we can see that there is a balance in gender from Table 2. The majority of the annotators have received postgraduate level of education. We were not able to find volunteers of non-binary gender to annotate our dataset. All the annotators who volunteered to annotate the Tamil-English, Kannada-English and Malayalam-English datasets had bilingual proficiency in the respective code-mixed pairs and they were prepared to take up the task seriously. From Table 1 and 2, we can observe that the majority of the annotators’ medium of schooling is English even though their mother tongue is Tamil, Kannada or Malayalam. For Kannada and Malayalam languages only one annotator from each language received their education through the medium of their native language. Although the medium of education of the participants was skewed towards the English language, we were careful it would not affect the annotation task by ensuring that all of them are fully proficient in using their native language.

We were aware that there could be other factors affecting the annotation decisions on offensive language such as the annotators’ age, their field of education and their ideological stance. Due to privacy issues involved, we did not collect this information from annotators. A sample form (first assignment) was annotated by experts and a gold standard was created. The experts were a team of NLP researchers who have experience working with creating annotation standards and guidelines. We manually compared the gold standard annotations with the volunteer submission form. To control the quality of annotation, we eliminated the annotators whose label assignments in the first form were not good. For instance, if the annotators showed an unreasonable delay in responding or if they labelled all sentences with the same label or if more than fifty annotations in a form were wrong, we eliminated those contributions. A total of 22 volunteers and 23 volunteers, for sentiment analysis and offensive language identification tasks respectively, were involved in the process. Once they filled up the Google Form, 100 sentences were sent to them. If an annotator offered to volunteer more, the next Google Form was sent to them with another set of 100 sentences and in this way each volunteer chose to annotate as many sentences from the corpus as they wanted. We sent out the same comment forms to annotators but some of the forms were incomplete so we discarded them. Hence there is some difference between the sentiment dataset and offensive dataset. However, there is more than 98% comments overlap between sentiment dataset and offensive dataset.

Table 3 Inter-annotator agreement in Krippendorff’s alpha

4.4 Inter-annotator agreement

Inter-annotator agreement is a measure of the extent to which the annotators agree in their rating. This is necessary to ensure that the annotation scheme is consistent and that different raters are able to assign the same sentiment label to a given comment. There are two questions related to inter-annotator agreement: How do the annotators agree or disagree in their annotation? How much of the observed agreement or disagreement among the annotators might be due to chance? While the percentage of agreement is fairly straightforward, answering the second question involves defining and modelling what chance is and how to measure the agreement due to chance. There are different inter-annotator agreement measures that are intended to answer this in order to measure the reliability of the annotation. We utilized Krippendorff’s alpha \((\alpha )\) (Krippendorff 1970) to gauge the agreement between annotators because of the nature of our annotation setup. Krippendorff’s alpha is a rigorous statistical measure that accounts for incomplete data and, consequently, does not require every annotator to annotate every sentence. It is also a measure that considers the level of disagreement between the anticipated classes, which is critical in our annotation scheme. For example, if the annotators differ among Positive and Negative class, this difference is more genuine than when they differ between Mixed feelings and Neutral state. \(\alpha \) is sensitive to such disagreements. \(\alpha \) is characterized by:

$$\begin{aligned} \alpha = 1 - \frac{D_o}{D_e} \end{aligned}$$
(1)

\(D_o\) is the observed disagreement between sentiment labels assigned by the annotators and \(D_e\) is the disagreement expected when the coding of sentiments can be attributed to chance rather than due to the inherent property of the sentiment itself.

$$\begin{aligned} D_o= & {} \frac{1}{n}\sum _{c}\sum _{k}o_{ck\;metric}\;\delta ^2_{ck} \end{aligned}$$
(2)
$$\begin{aligned} D_e= & {} \frac{1}{n(n-1)} \sum _{c}\sum _{k}n_c \; .\;n_{k\;metric}\,\delta ^2_{ck} \end{aligned}$$
(3)

Here \(o_{ck}\;n_c\;n_k\;\) and n refer to the frequencies of values in the coincidence matrices and metric refers to any metric or level of measurement such as nominal, ordinal, interval, ratio and others. Krippendorff’s alpha applies to all these metrics. We used nominal and ordinal metric to calculate inter-annotator agreement. The range of \(\alpha \) is between ‘0’ and ‘1’, \(1 \ge \alpha \ge 0\). When \(\alpha \) is ‘1’ there is perfect agreement between the annotators and when ‘0’ the agreement is entirely due to chance. Care should be taken in interpreting the reliability of the results shown by Krippendorf’s alpha because reliability basically measures the amount of noise in the data. However, the location of noise and the strength of the relationship measured will interfere with the reliability of the estimate. It is customary to require \(\alpha \) \(\ge \) .800. A reasonable rule of thumb that allows for tentative conclusions to be drawn requires \(0.67 \le \alpha \le 0.8 \) while \(\alpha \ge \) .653 is the lowest conceivable limit. We used nltkFootnote 9 for calculating Krippendorff’s alpha \((\alpha )\). The results of inter-annotator agreement between our annotators for different languages on both sentiment analysis and offensive language identification tasks are shown in Table 3.

5 Corpus statistics

Tables 4 and 5 show the text statistics (number of words, vocabulary size, number of comments, number of sentences, and average number of words per sentences) for sentiment analysis and offensive language identification for Tamil, Malayalam and Kannada. The Tamil dataset had the highest number of samples while Kannada had the least on both the tasks. On average, each comment contained only one sentence.

Fig. 10
figure 10

Treemap for comparing sentiment classes across Tamil, Malayalam and Kannada

Fig. 11
figure 11

Treemap for comparing offensive classes across Tamil, Malayalam and Kannada

Fig. 12
figure 12

Treemap for comparing offensive classes (excluding Not Offensive class) across Tamil, Malayalam and Kannada

Table 6 and Table 7 show the class distribution across Tamil, Malayalam and Kannada in sentiment analysis and offensive language identification tasks. Furthermore, tree-maps in Figs. 10 and 11 depict the comparative analysis of distribution of sentiment and offensive classes across languages. Figure 10 illustrates that there are more number of samples labelled “Positive” than any other class in all the languages. While the disparity between “Positive” and other classes is large in Tamil, it is not the case with Malayalam and Kannada. In Malayalam, “Neutral state” is the second-largest class in terms of distribution; 6502 number of comments labelled “Neutral state” could mean that most of the comments in Malayalam are vague remarks as the sentiment behind them is unknown. On the other hand, Kannada has the least number of “Neutral state” class. Figure 11 shows that all languages have not-offensive class in the majority. In the case of Tamil, 75.49% of the total comments are not offensive, while Malayalam has 96.16% non-offensive comments. But there is no consistent trend observable amongst offensive classes across the languages shown in Fig. 12. In the case of Tamil, 60% of the offensive comments are targeted (group or individual). Similar trends are seen in the case of Malayalam (66%) and Kannada (81.17%). Absence (Malayalam) or least (Tamil, Kannada) number of targeted other category comments points to the fact that most of the offensive comments are targeted towards either an individual or a group.

Table 4 Corpus statistics for sentiment analysis
Table 5 Corpus statistics for offensive language identification
Table 6 Sentiment analysis dataset distribution
Table 7 Offensive language identification dataset distribution

Our datasets are stored in tab separated files. The first column of the tsv file contains the comments from YouTube and the second column has the final annotation.

6 Difficult examples

The social media comments that form our dataset are code-mixed showing a mixture of Dravidian languages and English. This poses a few major difficulties while annotating the sentiments and offensive language categories on our dataset. Dravidian languages are under-resourced languages and the mixing of scripts makes the annotation task difficult since the annotators must have learned both the scripts, be familiar with how English words are modified to native phonology and how the meaning of certain English words have a different meaning in the given local language. Reading and understanding the code mixed text often with non-standardised spelling is difficult unless the annotator is well-versed in both the languages (Sridhar and Sridhar 1980). This created difficulty in finding volunteer annotators who were fluent in both the languages. Moreover, we have created the annotation labels with the help of volunteer annotators for three languages (not just one language). It is challenging and time consuming to collect this much amount of data from bilingual, volunteer annotators from three different language groups.

While annotating, it was found that some of the comments were ambiguous in conveying the right sentiment of the viewers. Hence the task of annotation for sentiment analysis and offensive language identification seemed difficult. The problems include the comparison of the movie with movies of same or other industries, expression of opinion of different aspects of the movie in the same sentence. Below are a few examples of such comments and details of how we resolved those issues are provided. In this section, we talk about some examples from Tamil language that were difficult to annotate.

  • Enakku iru mugan trailer gnabagam than varuthu - All it reminds me of is the trailer of the movie Irumugan. Not sure whether the speaker enjoyed Irumugan trailer or disliked it or simply observed the similarities between the two trailers. The annotators found it difficult to identify the sentiment behind the comment consistently.

  • Rajini ah vida akshay mass ah irukane - Akshay looks more amazing than Rajini. Difficult to decide if it is a disappointment that the villain looks better than the hero or a positive appreciation for the villain actor. Some annotators interpreted negative sentiment while some others took it as positive.

  • Ada dei nama sambatha da dei - I wonder, Is this our sampath? Hey!. Conflict between neutral and positive.

  • Lokesh kanagaraj movie naalae.... English Rap....Song vandurum - If it is a movie of Lokesh kanagaraj, it always has an English rap song. Ambiguous sentiment.

  • Ayayo bigil aprm release panratha idea iruka lokesh gaaru - Oh Dear! Are you even considering releasing the movie Bigil, Mr.Lokesh?. This comment has a sinlge word ‘garu’Footnote 10 which is a non-Tamil , non-English word borrowed from Telugu language which is a politeness marker. However, in this context the speaker uses the word sarcastically to insult the director because of the undue delay in releasing the movie. The annotators were inconsistent in interpreting this as offensive or not-Tamil.

  • No of dislikes la theriyudhu, idha yaru dislike panni irrupanga nu - It is obvious from the number of dislikes as to who would have disliked this (trailer). The comment below the trailer of a movie which talks about the caste issues in contemporary Tamil society. Based on the content of the trailer, the speaker offensively implies that the scheduled caste people are the ones who would have disliked the movie and not other people. Recognising the offensive undercurrent in a seemingly normal comment is difficult and hence these examples complicate the annotation process.

According to the instructions, questions about music director, movie release date and comments containing speaker’s remarks about the date and time of watching the video should be treated as belonging to neutral class. However the above examples show that some comments about the actors and movies can be ambiguously interpreted as neutral or positive or negative. We found annotator disagreements in such sentences. Below, we give similar examples from Malayalam.

  • Realistic bhoothanghalil ninnu oru vimochanam pratheekshikkunnu -Hoping for a deliverance from realistic demons. No category of audience can be pleased simultaneously. The widespread opinion is that the Malayalam film industry is advancing with more realistic movies. Therefore a group of audience who is more fond of action or non-realistic movies are not satisfied with this culture of realistic movies. In this comment, the viewer is not insulting this growing culture, but expecting that the upcoming film is of his favourite genre. Hence we labelled it non-offensive.

  • Ithilum valiya jhimikki kammal vannatha - There was an even bigger ‘pendant earring’. ‘Jhimikki kammal’ was a trending song from a movie of the same actor mentioned here. The movie received huge publicity even before its release because of the song but it turned out to be a disappointment after its release. Thus the annotators got confused whether the comment is meant as an insult or not. But we concluded that the viewer is not offending the present trailer but marks his opinion as a warning for the audience to not judge the book by its cover.

  • Ithu kandittu nalla tholinja comedyaayi thonniyathu enikku mathram aano?- Am I the only person here who felt this a stupid comedy? The meaning of the Malayalam word mentioned here corresponding to the word ‘stupid’ varies with regions of Kerala. Hence the disparity in opinion between annotators who speaks different dialects of Malayalam was evident. Though in few regions it is offensive, generally it is considered as a byword for ‘bad’.

  • aa cinemayude peru kollam. Ithu Dileep ne udheshichanu,ayale mathram udheshichanu -The name of that movie is good. It is named after Dileep and intended only for him. It is quite obvious that there is a chance of imagining several different movie names based on the subjective predisposition of the annotator. As long as the movie name is unknown here, apparently no insult can be proved and there is no profane language used in the sentence either.

  • Kanditt Amala Paul Aadai Tamil mattoru version aanu ennu thonnunu - It looks like another version of Amala Paul’s Tamil movie Aadai. Here the viewer doubts the Malayalam movie ‘Helen’ is similar to the Tamil movie ‘Aadai’. Though the movie ‘Aadai’ was positively received by viewers and critics, we cannot generalize and assume that this comment also as positive only because of this comparison. Hence we add it to the category of ‘mixed feeling’.

  • Evideo oru Hollywood story varunnilleee. Oru DBT. -Somewhere there is a Hollywood storyline...one doubt. This is also a comparison comment of that same movie ‘Helen’ mentioned above. Nevertheless, here the difference is that the movie is compared with the Hollywood standard, which is well-known worldwide and is generally considered positive. Hence it is marked as a positive comment.

  • Trailer pole nalla story undayal mathiyarinu.-It was good enough to have a good story like the trailer. Here viewer mentioned about two aspects of the movie viz: ‘trailer’ and ‘story’. He appreciates the trailer but doubts the quality of the story at the same time. We considered this comment positive because it is clear that he enjoyed the trailer and conveys strong optimism for the movie.

7 Benchmark systems

In this section, we report the results obtained in three languages for both the tasks in the corpora introduced above. Like many earlier studies, we approach the tasks as text classification tasks. In order to provide a simple baseline, we applied several traditional machine learning algorithms such as Logistic Regression (LR), Support Vector Machine (SVM), Multinomial Naive Bayes (MNB), K-Nearest Neigbours (KNN), Decision Trees (DT) and Random Forests (RF) separately, for both sentiment analysis and offensive language detection on the code-mixed datasets. We also conducted experiments with BERT, Character BERT, DistilBERT, RoBERTA, XLM, XLM-R on our code-mixed data for classification tasks to establish good, strong baselines (Tables 8 and 9).

Table 8 Train-development-test data distribution with 90–5–5% train-dev-test split for sentiment analysis
Table 9 Train-development-test data distribution with 90–5–5% train-dev-test for offensive language identification

7.1 Experiments setup

We used 90–5–5% randomly sampled data split for training, development and test set for all the experimental setup. All the duplicated entries were removed from the dataset before the split to make test and development data truly unseen. All the experiments are tuned to the development set and tested on the test set.

7.1.1 Logistic regression (LR):

LR is one of the base-line machine learning algorithms, which is also a probabilistic classifier used for the task of classification of data (Genkin et al. 2007). This is basically the transformed version of linear regression using the logistic function (Park 2013). Accordingly it takes the real-valued features as input which is later multiplied by a weight and the sum is fed to the sigmoid function \( \sigma (z) \) also called the logistic function to obtain the class probability (Shah et al. 2020). The decision is made based on the value set as threshold. Sigmoid function is as given below:

$$\begin{aligned} \sigma (z) = \frac{\mathrm {1} }{\mathrm {1} + e^{-z} } \end{aligned}$$
(4)

Logistic regression has a close relationship with neural networks as the latter can also be viewed as a stack of several LR classifiers (de Gispert et al. 2015). Unlike Naïve Bayes which is a generative classifier, LR is a discriminative classifier (Ng and Jordan 2002). While Naïve Bayes holds strict conditional independence assumptions, LR is evidently more robust to correlated features (Jin and Pedersen 2018). It means that when there are more than one features say F1,F2,F3 which are absolutely correlated, it will divide the weight W among the features as W1,W2,W3 respectively.

We evaluated the Logistic Regression model with L2 regularization to reduce overfitting. The input features are the term frequency inverse document frequency (TF-IDF) values of up to 3 g. This approach results in the model being trained only on this dataset without taking any pre-trained embeddings.

7.1.2 Support vector machine (SVM):

Support Vector Machine are a powerful supervised machine learning algorithm used mainly for classification tasks and for regression as well. The goal of an SVM is to find the hyperplane in an N-dimensional space which distinctly classifies the data points (Ekbal and Bandyopadhyay 2008). It means, this algorithm clearly draws the decision boundary line between the data points that belong to a particular category and the ones that do not fall into the category. This is applicable to any kind of data that is encoded as a vector. Therefore, if we could produce appropriate vector representations of the data in our hand, we can use SVM to obtain the desired results (Ekbal and Bandyopadhyay 2008). Here the input features are the same as in LR that is the Term Frequency Inverse Document Frequency (TF-IDF) values of up to 3 g. We evaluate the SVM model with L2 regularization.

7.1.3 Multinomial naive bayes (MNB)

This is a Bayesian classifier that works on the naive assumption of conditional independence of features. This means that each input is independent of the other and this is absolutely unrealistic for real data. Nevertheless, it simplifies several complex tasks and hence validates the need.

We evaluate a Naive Bayes classifier for multinomially distributed data, which is derived from Bayes Theorem that finds the probability of a future event given an observed event. MNB is a specialized version of Naive Bayes that is designed more for text documents. Whereas simple naive Bayes would model a document as the presence and absence of particular words, MNB explicitly models the word counts and adjusts the underlying calculations to deal with in. Therefore, the input text data is considered as the bag of words with the count of occurrence of words(frequency) alone considered and the position of words are ignored.

Laplace smoothing is performed using \(\alpha =1\) to solve the problem of zero probability and then evaluate the MNB model with TF-IDF vectors.

7.1.4 K-nearest neighbour (KNN)

KNN is used for the classification and regression problems but mostly used for classification task.The KNN algorithm stores all available data and classifies, on the basis of similarities, a new data point. This implies that it can be conveniently grouped into a well-suite group using the KNN algorithm as new data emerges. The KNN algorithm assumes that the new upcoming data is related to the available cases and places the new case into the column that is more similar to the categories available. KNN is a non-parametric algorithm as it does not make any assumption on underlying data ((Nongmeikapam et al. 2017)). It is often referred to as a lazy learner algorithm because it does not automatically learn from the training set, but instead stores the dataset and performs an operation on the dataset at the time of classification. At the training point, the KNN algorithm only stores the dataset and then classifies the data into a group that is somewhat close to the current data as it encounters new data.

We use KNN for classification with 3, 4, 5, and 9 neighbours by applying uniform weights.

7.1.5 Decision tree (DT)

The decision tree develops models of classification or regression in the context of a tree structure. A dataset is broken down into smaller and smaller subsets, while an associated decision tree is gradually built at the same time. A tree with decision nodes and leaf nodes is the final product. Therefore, a decision tree classification works by generating a tree structure, where each node corresponds to a feature name, and the branches correspond to the feature values. The leaves of the tree represent the classification labels. After sequentially choosing alternative decisions, each node is recursively split again, and finally, the classifier defines some rules to predict the result. Decision trees can accommodate high dimensional data and perform classification without needing much computation. In general, a decision tree classifier has reasonable accuracy. While speaking about its cons, they are vulnerable to mistakes in classification problems having many classes and a comparatively limited number of training examples. Moreover, it is computationally costly for preparation which implies the method of growing a decision tree is expensive in terms of computation. Each candidate splitting area must be organized at each node before it can find the best split. Combinations of fields are used in some algorithms and a search must be made for optimum combination weights. Pruning algorithms can also be costly, because it is important to shape and compare multiple candidate sub-trees. Here, maximum depth was 800, and minimum sample splits were 5 for DT. The criteria were Gini and entropy.

7.1.6 Random forest (RF)

Random forest is an ensemble classifier that makes its prediction based on the combination of different decision trees trained on datasets of the same size as training set, called bootstraps, created from a random resampling on the training set itself (Breiman 2001). Once a tree is constructed, a set of bootstraps, which do not include any particular record from the original dataset [out-of-bag (OOB) samples], is used as test set. The error rate of the classification of all the test sets is the OOB estimate of the generalization error. RF showed important advantages over other methodologies regarding the ability to handle highly non-linearly correlated data, robustness to noise, tuning simplicity, and opportunity for efficient parallel processing. Moreover, RF presents another important characteristic: an intrinsic feature selection step, applied prior to the classification task, to reduce the variables space by giving an importance value to each feature. RF follows specific rules for tree growing, tree combination, self-testing and post-processing, it is robust to overfitting and it is considered more stable in the presence of outliers and in very high dimensional parameter spaces than other machine learning algorithms (Caruana and Niculescu-Mizil 2006). We evaluate the RF model with the same features as DT.

7.1.7 BERT

BBERT is a language representation model that uses both left and right context conditioning with Masked Language Model training objective in a semi-supervised way (Devlin et al. 2019). These deep contextual representations could be extended to a classification head to fine-tune BERT on downstream NLP tasks. We use BERT with the classification head for classification and fine-tune all parameters in an end to end fashion. We used the huggingface libraryFootnote 11 to do experiments.

7.1.8 CharacterBERT

Many language representation models have adopted the transformers architecture as their fundamental building component due to BERT’s success. Interestingly enough, the wordpiece tokenization in BERT works on most of the NLP tasks, but they are also the reason behind making BERT a complex model in the case of a specialized case. To reduce the complexity, CharacterBERT, a new variation of BERT takes away the wordpiece tokenization entirely and instead utilizes a Character-CNN to represent whole words at the character level over a sub-word level (El Boukkouri et al. 2020). The CharacterBERT is based on the BERT “base-uncased” version (L = 12, H = 768, A = 12, and total parameters = 109.5 M) with follows contains 104.6 M parameters.

7.1.9 DistilBERT

DistilBERT is a smaller, cheaper variation of BERT with 40% parameters with 95% of performance from BERT. (Sanh et al. 2019) leveraged pre-trained knowledge distillation along with a smaller language model that achieves similar performances on downstream NLP tasks with less inference time. Knowledge distillation is a compression technique that utilizes a student-teacher model where student i.e. small model learns the behaviour of the teacher i.e. large model with the help of distillation loss.

7.1.10 ALBERT

ALBERT (Lan et al. 2019) is a transformer model with fewer parameters than that of BERT trained on self-supervised loss. The foundation of the model is based on the two basic parameter techniques. The first one factorizes embedding parameterization where a large vocabulary embedding matrix is split into small matrices. The second one shares parameters with cross-layers resulting in the reduction of parameters overall. We utilized ALBERT as one of our experiments to study if the claimed performance gain over BERT is observed in our case.

7.1.11 RoBERTa

RoBERTA (Liu et al. 2019) unlike BERT is not trained on the next sentence prediction training objective. Instead, larger mini-batches and learning rates are incorporated while training the language model with the Masked Language Modelling objective. RoBERTA with its optimum design choices exceeds the evaluation metric on downstream NLP tasks over the standard BERT baseline. We leveraged the abilities of RoBERTA in our experiments.

7.1.12 XLM

XLM (Lample and Conneau 2019) is Cross-lingual Language Model trained on three training objectives: causal language modelling, masked language modelling and translation language modelling. The novelty to this language model comes from the usage of cross-lingual representations and a new supervised learning objective that improves these representations.

7.1.13 XLMNet

XLNet use autoregressive (AR) language modeling to estimate the probability distribution of a text corpus while avoiding the usage of the [MASK] token and making concurrent independent predictions. It is accomplished via AR modeling, which gives a logical approach to describe the product rule for factoring the joint probability of the projected tokens.

7.1.14 XLM-R

On a number of cross-lingual benchmarks, XLM-RoBERTa was suggested as an unsupervised cross-lingual representation technique that considerably outperformed multi-lingual BERT (Conneau et al. 2020). XLM-R was trained on Wikipedia data from 100 languages and fine-tuned for assessment and inference on a variety of downstream tasks.

Table 10 Precision, recall, and F-score for Tamil sentiment analysis
Table 11 Precision, recall, and F-score for Malayalam sentiment analysis
Table 12 Precision, recall, and F-score for Kannada sentiment analysis
Table 13 Precision, recall, and F-score for Tamil offensive language identification
Table 14 Precision, recall, and F-score for Malayalam offensive language identification
Table 15 Precision, recall, and F-score for Kannada offensive language identification

8 Results and discussion

The results of the experiments with the classifiers described above for both sentiment analysis and offensive language detection are shown in terms of precision, recall, F1-score and support in Tables 10, 11, 12, 13, 14, and 15.

We used sklearnFootnote 12 to develop the models. A macro-average will compute the metrics (precision, recall, F1-score) independently for each class and average them. Thus this metric treats all classes equally, and it does not take the attribute of class imbalance into account. A weighted average takes the metrics from each class just like a macro average, but the contribution of each class to the average is weighted by the number of examples available for it. The number of comments belonging to different classes from both tasks is listed as the support values in respective tables.

For sentiment analysis, the performance of the various classification algorithms ranges from being inadequate to average on the code-mixed dataset. Logistic regression, random forest classifiers and decision trees were the ones that fared comparatively better across all sentiment classes. To our surprise, we see that SVM performs poorly, having a worse heterogeneity than the other methods. The precision, recall and F1-score are higher for the “Positive” class followed by the “Negative” class. All the other classes performed very poorly. One of the reasons is the nature of the dataset as the classes “Mixed feelings” and “Neutral state” are challenging to label for the annotators owing to the problematic examples described before. It could be observed from Table 12, the highest weighted average precision for sentiment analysis is 0.68 from Multinomial Naive Bayes (MNB), followed by CharBERT and XLM with the highest recall of 0.62, and finally, the highest weighted F-score of 0.59 from multiple classifiers (BERT, CharBERT, XLM).

For offensive language detection, all the classification algorithms perform equally poorly. We see that logistic regression and random forest are the ones that performed relatively better than the others. The precision, recall and F1-score are higher for the “Not Offensive” class followed by the “Offensive Targeted Individual” and “OL” classes. The reasons for the poor performance of other classes are as same as sentiment analysis. From the tables, we see that the classification algorithms have performed better on the task of sentiment analysis in comparison to that of offensive language detection. One of the main reasons could be the differences in the distributions of the classes among the two different tasks. In the case of an Offensive task, we could observe the highest weighted average precision (0.78), recall (0.76) and F-score (0.74) from MNB, RF/RoBERTA/XLM and DistilBERT respectively.

When it comes to the sentiment analysis dataset in Kannada, out of the total of 7671 sentences 46% and 19% belong to the “Positive” and the “Negative” classes respectively while the other classes share 9%, 11% and 15% respectively. This distribution is better when compared to the Kannada dataset for offensive language detection task where 56% belong to “Not Offensive”, while the other classes share a low distribution of 4%, 8%, 6%, 2%, 24%. Although the distribution of offensive and non-offensive classes is skewed in all the languages, we were able to observe that an overwhelmingly higher percentage of comments belonged to non-offensive classes in Tamil and Malayalam datasets than Kannada. 72.4% of comments in Tamil and 88.44% comments in Malayalam datasets were non-offensive while in Kannada only 55.79% of the total comments were non-offensive. This explains why the precision, recall and F-score values of identifying the non-offensive class are consistently higher for Tamil and Malayalam data than Kannada.

Since we collected the posts from movie trailers, we got more positive sentiment than others as the people who watch trailers are more likely to be interested in movies and this skews the overall distribution. However, as the code-mixing phenomenon is not incorporated in the earlier models, this resource could be taken as a starting point for further research. There is significant room for improvement in code-mixed research with our dataset. In our experiments, we only utilized the machine learning methods, but more information such as linguistic information or hierarchical meta-embedding can be utilized.

9 Conclusion

This work introduced code-mixed dataset of the under-resourced Dravidian languages. This data set comprises more than 60,000 comments annotated for sentiment analysis and offensive language identification. To improve the research in the under-resourced Dravidian languages, we created an annotation scheme and achieved a high inter-annotator agreement in terms of Krippendorff \(\alpha \) from voluntary annotators on contributions collected using Google Form. We created baselines with gold standard annotated data and presented our results for each class in precision, recall, and F-Score. We expect this resource will enable the researchers to address new and exciting problems in code-mixed research. In future work, we intend to investigate whether we can apply these corpora to build corpora for other under-resourced Dravidian languages.