1 Introduction

On February 11, 2020, the World Health Organization (WHO) officially announced the name of the 2019 new coronaviruses outbreak "COVID-19", an abbreviation of "COrona VIrus Disease" (Lovelace 2020). Following dreadful disease spread levels, on March 11, 2020, WHO officially declared the COVID-19 a Pandemic that affected more than 110 countries (Beaubien 2020). Today, after a year of combating this pandemic, WHO reported that more than two million people have died and more than one hundred million infection cases, as shown in Fig. 1 (WHO 2021). This situation created a severe challenge for all countries to monitor and slow down the pandemic using various measures, including social distancing, lockdowns, limiting movement of persons, and avoiding the 3Cs (closed spaces, crowded, or involve close contact). After some relief during the summer period and returning to school and universities, a new vague appeared in fall 2020. The highly contagious mutation of the novel coronavirus variants was reported in the UK, South Africa, and Brazil (Bollinger and Ray 2020). It added new fears to the international health organization. Figure 1 shows the distribution of the COVID-19 total cases worldwide and daily deaths from January 21, 2020, to December 31, 2020.

Fig. 1
figure 1

COVID-19 worldwide total cases and daily deaths from 1/22/2020 to 12/31/2020

As reported by many health organizations, the top priority of 2021 is to continue fighting COVID-19, repairing and strengthening the existing health systems, speeding up access to COVID-19 treatments, and achieving equitable and safe vaccines for all (WHO 2021). Several developed vaccinations are currently being approved in several countries (Hindustan Times 2020). The majority of these vaccines were produced in North America (46%), China (18%), Asia (excluding China) and Australia (18%), and Europe (18%) (Le et al. 2020). However, traditionally the development of a vaccination pipeline could take on average ten years. It is not possible with the speed of the COVID-19 pandemic.

Consequently, new strategies have been adopted to accelerate the development of vaccines (Lurie et al. 2020). The WHO reported more than sixty COVID-19 vaccine candidates are currently in clinical development and more than 170 in preclinical development, such as Pfizer–BioNTech, Moderna, Sputnik V, Sinovac, and Oxford–AstraZeneca (Statista 2021). However, the WHO recently warned that despite vaccines approval by the Food and Drug Administration (FDA), such as Pfizer-BioNTech and Moderna's COVID-19 vaccines, population immunity is highly unlikely achieved this year (Cheng and Keaten 2021). The WHO concern is the unequal COVID-19 vaccine distribution, the rich and poor vaccine divide, and vaccine nationalism. During the recent World Economic Forum, China President (World Economic Forum 2021) stressed the importance of cooperation to develop and produce vaccines for all people worldwide. Table 1 shows the available COVID-19 vaccines within the WHO emergency use (Olliaro et al. 2021), including the Chinese vaccines Sinopharm and Sinovac recently approved by the WHO (2020). While these vaccines use different methods, the efficacy (Relative Risk Reduction) shows Pfizer-BioNTech and Moderna among the highest).

Table 1 List of available COVID-19 vaccines (WHO 2020)

The present work aims to analyze people's sentiments during the pandemic by combining sentiment analysis and natural language processing algorithms. The remainder of the paper is organized as follows: first, a review of the relevant background and literature of sentiment analysis is presented, followed by the research methodology applied in this study. Then, results are described with a discussion, and finally, the study’s main conclusion.

2 Background

Sentiment analysis has been applied in different domains such as social media monitoring, business, product analysis, stock market, tourism, health, and education to understand events and trends (Vijaykumar et al. 2017), (Ainin et al. 2020), (Hassan et al. 2020), (Abualigah et al. 2020), (Yadav et al. 2018). With the increase of social media, sentiment analysis presents the best way in understanding people opinions, whether it is positive, negative, or neutral. In healthcare, for example, sentiment analysis classifies patient comments and quantifies the performance of the employees. This sentiment analysis extends to medical sentiment analysis to provide professionals with an automated support system, determine patients ‘concerns, or analyze patients’ emotions (Kushwah et al. 2021). Birjali et al. (2021) performed a comprehensive survey on sentiment analysis, which discussed traditional and recent approaches, challenges, and future trends. Srivastava et al. (2022) explored the challenges of sarcasm detection, negation detection, and word multipolarity and ambiguity. Most literature divides sentiment analysis into three categories: machine learning approach, lexicon-based approaches, and hybrid approaches (Duong and Nguyen-Thi 2021; Soong et al. 2021). One of the main problems in sentiment analysis is the classification of the sentiment polarity as positive, negative, or neutral. Several research articles indicated that many outbreaks and pandemics could have been quickly monitored if experts considered social media data (Alamoodi et al. 2021). According to Statista (Statista 2021), as of February 12, 2021, nearly three billion AstraZeneca/Oxford's vaccines, one billion Pfizer-BioNTech vaccines, and more than six hundred million Moderna vaccine were pre-purchased. However, in a survey conducted on December 2020 by KFF (Kaiser Family Foundation) under the research project "COVID-19 Vaccine Monitor" (Muñana 2020), 71% of respondents said they would agree to take the COVID-19 vaccine if it is free of charge and safe. While about 27% remained hesitant fearing side effects. In another survey, based Health Belief Model (HBM), conducted in Hong Kong (Wong et al. 2021) during the peak of the third wave of the pandemic between July 27 and August 27, 2020, the acceptance rate was 37.2%, while the confidence of newer vaccines was 43.4% and manufacturer received 52.2% of the respondents. In a study by Raamkumar, Tan, and Wee (Raamkumar et al. 2020), the authors used Facebook to investigate and understand the public sentiment and responses to the COVID-19 pandemic. The study aimed to improve the communication strategies conducted by various public health authorities (PHAs) in the United States, Singapore, and England. This study showed that social media analyses could provide insights into PHAs' communication strategies during a pandemic. In another study, Raamkumar, Tan, and Wee (Sesagiri Raamkumar et al. 2020) developed a deep-learning text classifier to investigate public perception and reaction to physical distancing. The authors conclude that public health authorities (PHAs) can characterize the public's health behaviors using classification models. Vijaykumar, Meurzec, Jayasundar (Vijaykumar et al. 2017) studied Zika outbreaks in Singapore using Facebook to bridge the gaps between people and public officials. Results show the importance of social media in sharing information on pandemics. Manguri, Rebaz, and Pshko (Manguri et al. 2020) analyzed Twitter sentiment on COVID-19 outbreaks by pulling Twitter data from Twitter social media from April 9, 2020, to April 15, 2020, using tweepy library. The study found that people's sentiments and reactions vary daily, while the neutral toll was substantially significant for coronavirus and COVID-19 keywords. Prabhakar and Krishna (2020) used the Latent Dirichlet Allocation (LDA) to study the flow on Twitter during the COVID-19 pandemic. The sentiment analysis results confirmed people's expected reaction with negative sentiment towards the COVID-19 pandemic. The authors conclude that the knowledge flow on Twitter about the Coronavirus outbreak was necessary and correct. Some minor misinformation was spread compared to previous Ebola and Zika virus outbreaks, where Twitter users widely disseminated misinformation. Jia Xue et al. (2020) also used LDA machine learning algorithm to conduct COVID-19 sentiment analysis on 4 million Twitter messages. Results showed that public tweets have significant fear in their discussion. Shamrat et al. (2021) used Twitter to analyze people's sentiment towards three vaccines: Pfizer, Moderna, and AstraZeneca. The authors reprocessed the raw tweets using Natural Language Processing (NLP), while the algorithm for KNN classification was used to classify the processed data. The authors found a higher positive sentiment towards all three vaccines, with 47.29% for Pfizer, 46.16% for Moderna, and 40.08% for AstraZeneca Shamrat et al. (Shamrat et al. 2021). Another sentiment analysis study involving 3242 tweets on Pfizer and Sinovac vaccines was conducted during October–November 2020 in Indonesia (Nurdeni et al. 2021). Results revealed that Pfizer had 81% positive perceptions while Sinovac showed 77% positive perceptions.

3 Research methodology

Sentiment analysis (SA) is a text classification technique used to analyze natural language text. It uses Natural Language Processing (NLP) to determine whether the sentiments expressed towards a subject are positive, negative, or neutral (Solangi et al. 2018; Shamrat et al. 2021). It also helps decide people's emotions such as happiness, depression, anxiety, fear, and sadness. As shown in Fig. 2, there are three different approaches used in sentiment analysis (Kausar et al. 2019), (Rokade and Aruna 2019), (Hauthal et al. 2020): (1) a machine learning approach that includes linear classifiers (support vector machines and neural networks), probabilistic classifiers (Naïve Bayes and Bayesian network), and decision tree classifiers; (2) a lexicon-based approach which includes a corpus-based approach (statistical and semantic) and a dictionary-based approach; finally, (3) a hybrid approach which is a combination of the previous two approaches. The machine learning approach has better accuracy than the lexicon-based approach, primarily when it uses an extensive database. In contrast to machine learning, the lexicon-based method requires the involvement of humans to process text analysis. Four main steps are needed in sentiment analysis (i) a social media platform, (ii) a data collection, (iii) a pre-processing, and (iv) a data analysis. In this study, a hybrid approach is used, including a corpus-based method with semantic analysis. The sentiment analysis process begins with data collection and identification, then an extraction and classification of the features. Finally, a decision process is conducted during the stage of sentiment polarity and subjectivity. Data are collected in the form of tweets through social media during the first period between December 16, 2020 to January 27, 2021, and the second period from January 20, 2021 to April 8, 2021. The collected tweets are then pre-processed, and the cleaned data classified, analyzed and evaluated.

Fig. 2
figure 2

Sentiment analysis approaches

3.1 Data collection and pre-processing

The pre-processing of the data collected is mainly used to clean the raw data and minimize the vocabulary of words detected in the text message using Natural Language Processing with Python's NLTK Package (Solangi et al. 2018; Rajput 2019). Since the collected data is text, it needs to be transformed into numerical representation using a fixed-length feature vector. This representation is commonly based on the bag-of-words approach (BOW) (Minaee et al. 2021). This approach is a simple and flexible way to extract features from documents and track the number of used words. In this work, the #COVIDvaccine hashtag is used through two primary sources:

  • Online tweets’ monitoring tool to monitor daily tweets on the hashtag (TAGS Google sheet with Macros (https://tags.hawksey.info/).

  • Tweets collection and analysis using Python code monitoring the #COVIDvaccine hashtag over one month (30 days) using Twitter API and python Natural Language Processing (NLP) libraries.

This hashtag's total number of tweets is 230,672, with 230,623 unique tweets (till April 8, 2021). The author selected the subset of the collected tweets (16/12/2020 to 26/1/2021) and parsed the texts using the Natural Language Toolkit in Python. Then, the stopwords were eliminated from the collected corpus (such as articles and common words). Afterward, the author used the tf-idf algorithm to generate keywords (Schvaneveldt et al. 1976). The n-gram model was used afterwards to identify the top words 1-gram (1-g) (n = 1; single words); bi-gram (n = 2; two adjacent words); and three-gram (n = 3; three adjacent words)).

3.2 Data analysis and evaluation

After cleaning the data and removing the stop words, the author vectorized the tweets using the sklearn library and built the corpus of words from the tweets. A corpus analysis was established and common words and emoji related to sentiments were identified such as {'happi'; 'nice'; 'good'; 'bad'; 'sad'; 'mad'; 'best'; 'pretti'; critical; decline; die; disaster; collapse; lockdown; angry; risk; sad; serious; homeless; scam; reject; efficient; vaccine}. Tweets are divided into two groups: positive and negative tweets. After a comparison of the tweets with the keywords, positive and negative tweets are classified.

4 Results and discussion

In this section, the author investigates, in detail, the natural language processing of the tweets related to the main hashtag #COVIDvaccine.

4.1 Use of the TAGS tool

The author used the TAGS monitoring tool, freely offered by Martin Hawksey, to collect and monitor the tweets on the #COVIDvaccine hashtag. This tool helped in the social media analytics of the tweets and identified the trends in using a specific hashtag.

As shown in Fig. 3, more than 230,000 tweets were collected. The tool monitors the tweets daily. As shown, the top tweeter is VaccineCa. On the right side of the figure, the Twitter activity for the last three days is shown.

Fig. 3
figure 3

TAGS Archive for #COVIDvaccine hashtag

4.2 Data pre-processing and n-gram analysis

Using the TAGS tool helped figure out the tweets' activity over a period and show the tweets' trends analytically. Later, using Python code, the author analyzed the selected tweets deeper (based on collected data from December 16, 2020, till January 26, 2021) and applied tf-idf analysis on the tweets. The results include the 1-g, bi-gram, tri-gram keywords, and the collected hashtags.

Table 2 lists the top twenty 1-g using the tf-idf analysis. The words are organized by type, availability, and feelings. As shown, most of the top 1-g concerning the type is related to allergy, care, effectiveness, safety, reactions, isolation. Also, other words that come in the list are associated with the availability of the vaccine, as shown in the table. Other words in the 1-g analysis are related to bragging, convinced, grateful, happy, hesitant, honest. This shows the research is already finding some sentiments related to the vaccine.

Table 2 tf-idf analysis—1-g model

Table 3 lists some top tweets that appear in the data analysis. As shown, some of the feelings are carried in the tweets.

Table 3 Selected tweets related to the 1-gram analysis

Moreover, besides the 1-g analysis, the 2-g and 3-g models are also considered. Table 4 and Table 5 list the most common 2-g and 3-g, respectively, based on the availability and the feelings. As shown, several 2-g appear to tackle the safety of the vaccine and awareness of the people and the hesitancy towards the vaccine. Tweets show that people have mixed feelings about the vaccine. The bi-gram ‘take covidvaccine’ often comes in the tweets as the bi-gram ‘covidvaccine bragging’, considering that people are scamming about the vaccine.

Table 4 tf-idf analysis—2-g model
Table 5 tf-idf analysis—3-g model

The 3-g analysis shows a close relationship with the 1-g and 2-g analysis. As shown in Table 5, similar results appear related to the feelings. Some of the keywords are introduced, such as ‘2nd nation number’ and ‘community take covidvaccine’, which are significantly related to the importance of the vaccine and the people taking the vaccine. For the feelings, new phrases appear, such as ‘excited dose wrap’ and ‘reactions covidvaccine real’ and ‘honest discussion covidvaccine’.

Table 6 lists extracts of tweets related to the 3-g analysis above. As shown, some of the sentiments are not related directly to the vaccine itself; for example, the 3-g ‘government convinced people’ is related to some comparison of the vaccine and the cigarettes. Another tweet is associated with the COVID-19 vaccine experiment and the excitement of taking it. Therefore, some mixed feelings are being retrieved when analyzing the tweets.

Table 6 Selected tweets related to the 3-g analysis

4.3 Data classification and sentiment analysis

The collected tweets are vectorized using the sklearn library, and the corpus of words is built. As stated in Sect. 3, the corpus analysis identified specific keywords and emojis related to sentiments. Therefore, the tweets are divided into two groups: positive and negative tweets. The tweets analysis went through several steps, as shown below:

  1. 1.

    Identification of most common keywords

  2. 2.

    Division of the tweets into positive and negative tweets

  3. 3.

    Stemming and cleaning the tweets

  4. 4.

    Calculating the frequency of the keywords in the tweets

Table 7 summarizes three tests done on the tweets after stemming and removing the stop words.

Table 7 Tweets analysis and keywords classification

Several keywords representing sentiments appeared in the tweets’ analysis in the different trials based on the tests as shown in Figs. 4 and  5.

Fig. 4
figure 4

Tweets keywords classification—a First test; b Second test

Fig. 5
figure 5

Tweets keywords classification—third test

Both figures above show the distribution of the keywords in the selected Tweets for each test. As shown in the figures, the more tweets are selected, the more keywords are identified. Note that the Log function is used to scale the data and avoid showing large values in the graph. The tweets are classified based on their frequencies separating the positive and negative sentiments. This classification of the keywords brings back some common words retrieved in the n-gram analysis, such as ‘lockdown’, ‘happy’, ‘scam’, ‘risk’. Based on the results, people's sentiments are mixed, and some positive and negative emotions are coming back with the tweet analysis.

5 Conclusion

This study attempts to explore and analyze people's sentiments on the vaccine during the COVID-19 pandemic by using natural language processing algorithms for sentiment analysis to classify texts and extract the corresponding polarity. As shown, the sentiments are mixed with some positive and others with negative feelings. The main keyword used for retrieving the tweets is #COVIDVaccine. From this work, the author concludes that the analysis of microblogging such as Twitter can help determine people’s opinions and feelings. This work focused mainly on the COVID-19 vaccine and what tweeters feel about it. Several tweets were collected, analyzed, classified based on keywords frequency.

For future work, more testing shall be done on tweets using machine learning techniques, comparing the results with the NLP techniques, and generalizing the algorithm to different hashtags and other applications.