An unsupervised multilingual approach for online social media topic identification
Introduction
While it is a known fact that social media sharing is of huge volume and high velocity, it is not often realised that the content shared is of varied structures, which include images, videos, and text. With the openness and free form of expression, it is also not surprising to find textual content written in a mixture of languages, including informal terms and phrases that hardly follow any proper grammatical rules. Effective mining of social media data therefore can no longer be focused on a single language, but it is essential to embark on a multilingual approach to fully comprehend the sentiment and content shared online.
In this study, our aim is to identify topics that are of value to a community, in the form of opinions, concerns, news and so on, within the mass of social media data. Following previous studies (e.g., Aiello et al., 2013, Zhao et al., 2011) that have used content from main stream media as coverage comparison with social media, we are interested to assess if topics shared on social media are the same as main stream media in our study. We have chosen Twitter as the base of our investigation because of its ability to propagate hot topics in a very short duration and to a wide audience. In addition, the degree of variations in languages found on Twitter can be immense. Most of the past studies (e.g., see Vicient and Moreno, 2015, Zhao et al., 2011) have focused only on analysing English tweets, even though English is used by just 28.6% of the Internet users.1
Besides the linguistic variations found in tweets, it is observed that mixed languages and use of a localised lingual range are commonly seen for expressing emotion online, especially in a multi-cultural environment (Zielinski et al., 2012). One such example is Singlish, the colloquial Singaporean English that has incorporated elements of some Chinese dialects and the Malay language (Leimgruber, 2011). One of the main reasons of the prevalent use of such a unique ‘language’ is because a native or localised vernacular can resonate with the local community better than a formal language.2 This leads us to the decision of carrying out multilingual analysis on Singlish tweets, to see if we could detect the concerns of or interesting news from the local Singaporean community. However, this can be very challenging because of limited resources available for an informal language like Singlish.
Topic detection and tracking have been extensively studied to identify new topics in a temporally-ordered news stream (Allan, 2012). Topic modelling (Lu, 2015, Zhao et al., 2011), temporal segmentation (Benhardus and Kalita, 2013, Lu, 2015), classification of event and non-event (Becker, Naaman, & Gravano, 2011), unknown event identification (Psallidas, Becker, Naaman, & Gravano, 2013), hashtags (Vicient & Moreno, 2015) and exemplar-based topic detection (Elbagoury, Ibrahim, Farahat, Kamel, & Karray, 2015) are some of the approaches used in this research area. Temporal segmentation is important for detecting topics, and various studies in the literature have adopted different kinds of metrics, such as minutes (Benhardus & Kalita, 2013), hours (Becker et al., 2011, Benhardus and Kalita, 2013), days (Lu, 2015), and weeks (Lu, 2015). While most of the topic detection analyses have used specific events and annotated datasets for evaluation (e.g., see Aiello et al., 2013, Becker et al., 2011, Elbagoury et al., 2015), we propose a multilingual analysis through high-value term comparison using a Singlish dataset (see Section 3.2) to discover interesting topics via a candidate day selection process with minimal annotation effort. In addition, instead of manually categorising the vast amount of tweets for evaluation purposes, an online web service is adopted to automatically classify the ground truth dataset (see Section 3.3) for more comprehensive evaluation in this study.
In terms of methods, we propose a Peak Identification algorithm and compare it to Term Frequency-Inverse Document Frequency (TFIDF) and Term Frequency (TF). We also consider three topic clustering methods – Twitter Latent Dirichlet Allocation (LDA) (Zhao et al., 2011), K-Means (MacQueen, 1967) and the Dirichlet Process Mixture Model (DPMM) (Antoniak, 1974). Our intention is to identify terms and topics that are relevant and of high-value to a local community in an unsupervised manner. To ascertain if the proposed approach can identify topics that are of high-value, tweets have been clustered using days for assessment across different datasets. Besides that, the top topics identified are subjected to multilingual sentiment analysis to uncover sentiments on the ground.
The main contributions of this work can be summarised as follows:
- •
To the best of our knowledge, our work in this paper is the first attempt to identify topics from tweets with consideration of its multilingual nature through unsupervised learning without the use of any external knowledge base to decipher the context of tweets.
- •
Tweets in a localised language (i.e., Singlish as used in this study) can be leveraged for identifying relevant and important topics that are of interest or concern to the local community. The comparison of top terms discovered with the Singlish dataset succeeded in choosing appropriate candidate days with minimal annotation effort.
- •
Our proposed approach of using the DPMM clustering method and a ‘Joint’ term ranking method has consistently performed well in the topic recall and precision@10 evaluation metrics.
- •
From the observation of our results, it is essential to find optimal parameters for the DPMM even though there is no pre-defined topic number required for DPMM clustering (as opposed to Twitter LDA and K-Means).
- •
Our multilingual sentiment analysis has uncovered mixed-language tweets that were not detected when using an English polarity detection algorithm. This finding is important, since it highlights the necessity of considering the multilingual nature of online sharing to ensure a more comprehensive analysis.
- •
It is observed that both the social and main stream media platforms would share the same main topics if there are prominent events on the day. However, this observation does not hold for ‘ordinary’ days. Our approach of ranking the high-value topics therefore plays a crucial role in understanding/gauging the interests or concerns of the local community.
The rest of this paper is structured as follows. We review related work in Section 2, and describe the details of datasets and resources constructed in Section 3. Section 4 presents the methods and experimental setups, which include candidate day selection, term ranking methods, topic clustering methods, evaluation metrics and multilingual analysis. In Section 5, we list our findings and explain the results. Section 6 discusses our approach and observations before the conclusions are drawn in Section 7.
Section snippets
Topic detection
Broadly speaking, two main types of data sources have been used in evaluating topic detection approaches: labelled/curated and unlabelled data sources. The former mainly relies on annotated datasets of specific topics for identification or classification, while the latter attempts to cluster relevant topics based on features and information found in tweets without labelled data.
Aiello et al. (2013) adopted standard natural language processing techniques, n-grams, co-occurrence and a variant of
Twitter dataset collection
In order to enable unsupervised topic detection, continuous tweet extraction through a period of time was carried out. Since there was no Twitter dataset with Singlish content available for analysis, we collected the Twitter dataset used in this study by following a list of Twitter users who were tweeting topics relevant to Singapore and its regions. Given that the location information of Twitter users is typically not verified, we used Twitter's location information only as a reference and
Methods and experimental setups
In this section, we present the main components of our unsupervised multilingual topic identification approach. The overall architecture can be found in Fig. 1. As we can see from the figure, the process starts with candidate day selection - we analyse tweets over a period of seven months (as mentioned in Section 3.1) and extract candidate days that may have content that can resonate well with the local community. After that, tweets from the candidate days are extracted for further analysis.
Results
In this section, we first describe the results of candidate days identified and their top matched terms before going into the details of ground truth data analysis. After which, we discuss the results of DPMM parameter selection and evaluate the three term ranking and three topic clustering methods using recall and precision of terms and topics. Lastly, we present the findings of multilingual sentiment analysis on the top topic of each candidate day.
Discussion
Previous studies along this line of research (e.g., Aiello et al., 2013, Becker et al., 2011, Shamma et al., 2011) used annotated and curated datasets for evaluation. It is undeniable that if strict annotation rules are enforced in the datasets employed, accurate and domain-related topics can be discovered and identified. In this study, we avoided manual annotation effort in two aspects: domain selection and topic annotation. Most studies pre-defined specific events (Aiello et al., 2013,
Conclusion and future work
In this paper, we have demonstrated that it is feasible to identify high-value terms and topics from a vast amount of Twitter data through an unsupervised multilingual approach. We have also shown that by leveraging multilingual analysis and the Peak Identification algorithm, highly relevant topics that are of concern to the local community can be extracted through candidate day selection. From the observation of our results, the DPMM with our proposed ‘Joint’ ranking method has consistently
References (39)
- et al.
Sentiment analysis system adaptation for multilingual processing: The case of tweets
Information Processing & Management
(2015) - et al.
Sociolinguistic analysis of twitter in multilingual societies
- et al.
A multilingual semi-supervised approach in deriving Singlish sentic patterns for polarity detection
Knowledge-Based Systems
(2016) - et al.
Ranking of high-value social audiences on Twitter
Decision Support Systems
(2016) Detecting short-term cyclical topic dynamics in the user-generated content and news
Decision Support Systems
(2015)- et al.
Unsupervised adaptive microblog filtering for broad dynamic topics
Information Processing & Management
(2016) - et al.
Event identification in web social media through named entity recognition and topic modeling
Data & Knowledge Engineering
(2013) - et al.
Unsupervised topic discovery in micro-blogging networks
Expert Systems with Applications
(2015) - et al.
Identifying interesting Twitter contents using topical analysis
Expert Systems with Applications
(2014) - et al.
Sensing trending topics in Twitter
IEEE Transactions on Multimedia
(2013)
Topic detection and tracking: Event-based information organization
Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems
The Annals of Statistics
SentiWordNet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining
Improving sentiment analysis in Twitter using multilingual machine translated data
Beyond trending topics: Real-world event identification on Twitter
Streaming trend detection in twitter
International Journal of Web Based Communities
Latent dirichlet allocation
The Journal of Machine Learning Research
Making sense of social media streams through semantics: A survey
Semantic Web
Emotion tokens: Bridging the gap among multilingual twitter sentiment analysis
Cited by (31)
Multi-feature, multi-modal, and multi-source social event detection: A comprehensive survey
2022, Information FusionCitation Excerpt :The CNN and LSTM with fastText are better performing models for multi-lingual multi-class classification. Lo et al. (2017) [149] designed an unsupervised learning approach for hot topic identification with a multi-lingual nature, and with minimal manual annotation effort. Alkouz et al. (2018) [201] presented a method for recognizing location in Arabic and English tweets.
A simple and fast method for Named Entity context extraction from patents
2021, Expert Systems with ApplicationsCitation Excerpt :The methods used for NER are various: Information about the entity to which a word belongs, can provide crucial, although shallow, semantic information for tasks such as question answering (Abujabal, Saha Roy, Yahya, & Weikum, 2018; Blanco-Fernández et al., 2020), topic disambiguation (Fernández, Arias Fisteus, Sánchez, & López, 2012) or detection (Lo, Chiong, & Cornforth, 2017) and elements relationships identification (Sarica, Luo, & Wood, 2019). As stated in Section 1, NER is a classification task and thus NER systems need a set of annotated documents in order to use state of the art approaches in terms of accuracy (corpus-based NER uses deep neural networks) (Devlin et al., 2019).
SkillNER: Mining and mapping soft skills from any text
2021, Expert Systems with ApplicationsCitation Excerpt :NER is a computational linguistic method capable of extracting and classifying named entities mentioned in unstructured text into predefined categories (such as person names, locations, and product names). Assigning a word to a semantic class provides crucial information for tasks such as question answering (Abujabal et al., 2018; Blanco-Fernández et al., 2020), topic disambiguation (Fernández et al., 2012) or detection (Krasnashchok & Jouili, 2018; Lo et al., 2017; Al-Nabki et al., 2019), and revealment of relationships among elements (Sarica et al., 2020; Amal et al., 2019). Furthermore, NER has proved to be effective in broader applications, such as user profiling (Nicoletti et al., 2013) and ontology development in unconventional domains (Oliva et al., 2019; Rodrigues et al., 2019).
Evaluation of clustering and topic modeling methods over health-related tweets and emails
2021, Artificial Intelligence in MedicineA survey of Twitter research: Data model, graph structure, sentiment analysis and attacks
2021, Expert Systems with ApplicationsSocial media analytics and business intelligence research: A systematic review
2020, Information Processing and ManagementCitation Excerpt :The textual data contains the information about users’ intentions, opinions, sentiments and so on, similar to other social media, and therefore, a number of studies have attempted to monitor the topics that users are primarily interested in and their reactions by using text mining, sentiment analysis, or sometimes machine learning (ML) algorithms (Xu, Qi, et al., 2018; Yoo, Song, et al., 2018). Further, there are several delicate studies that focus on the textual information itself by identifying terms or topics with a high analytical value or deriving a correlation between predefined product attributes and comment length (Lo, Chiong, et al., 2017; Wong and Lacka, 2017). By contrast, some studies concentrated on the relationship between the users, who are the generators of SNS data (Fang, Sang, et al., 2014; Xie, Li, et al., 2014).