1 Introduction

Coronavirus Disease 2019 (COVID-19), is a rapidly spreading illness caused by the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV2; Khan et al. 2020). On March 11, 2020, the WHO officially classified the COVID-19 outbreak as a global pandemic affecting countries on all inhabited continents (Cucinotta 2020). Since December 2019, when the first cases were reported in Wuhan, China, the number of infected people and fatalities worldwide has increased rapidly (Dong et al. 2020). Both the pandemic itself and policy measures put in place to reduce its spread have had unprecedented economic and social impacts (Nicola et al. 2020), affecting the lives of billions of people. Due to its high infection and death rate alongside its potential for asymptomatic transmission, governments have implemented a wide range of policies to mitigate COVID-19’s spread and impact. Such actions began with the Chinese government’s order to quarantine Wuhan on January 23rd, 2020, followed relatively quickly by multiple countries declaring states of emergency and implementing strict quarantine and social distancing measures (Nussbaumer-Streit et al. 2020).

Unsurprisingly, COVID-19’s spread has been accompanied by a deluge of opinion, commentary, information, and misinformation circulating on social media platforms. Indeed, the social distancing measures required to slow the virus's spread might themselves be encouraging more people to turn to social media to share their experiences.

Infodemiology, or the study of (mis)information’s diffusion via digital media, has been a growing concern since the World Wide Web’s early years (Eysenbach 2002), but there has been an explosion of literally hundreds of articles on the subject since the pandemic began. These discussions center, in particular, around the concept of an “infodemic,” an “overabundance of information—some accurate and some not—that occurs during an epidemic” (Tangcharoensathien et al. 2020). Not only is there a deluge of information from legacy and social media sources, Gazendam et al. (2020) also document exponential growth in medical journal publications—many of them opinion pieces or commentaries—related to the pandemic.

While social media users should by no means be assumed representative of the general public or public opinion (Baumann et al. 2020; Mellon & Prosser 2017), social media is a critical, if also critically flawed, medium of civic discourse (Kruse et al. 2017). Indeed, the infodemiological perspective implies that the dynamics of digital discussions interact with human behavior and the virus’s spread in complex ways. Responding to this interest, several social media datasets related to COVID-19 become available at the beginning of the pandemic (e.g., Facebook (Shahi, et al. 2020), news articles (Zhou et al. 2020), Instagram, Reddit (Zarei et al. 2020)).

Our primary interest here, however, is in the numerous Twitter datasets published since the pandemic began. Twitter is a widely used social media platform whose political importance is anchored not only in its millions of daily users but also its (ab)use by elites (Abokhodair et al. 2019; Wells et al. 2020).

While the open availability of numerous Twitter datasets is of great use to researchers, these resources differ in the number, timing, and language of tweets collected, as well as the search keywords used for collection (see Appendix 1). Moreover, while these datasets are a valuable source of text data related to the pandemic, users often still must implement their own Natural Language Processing techniques, which can be computationally intensive, if they are to make meaningful use of this unstructured data. This additional barrier might limit some datasets’ utility for interested researchers.

Hoping to facilitate further use of Twitter data for analyzing the COVID-19 infodemic, this work presents a dataset containing tweets collected from all over the world, in multiple languages, starting from January 22nd, 2020. The English and Spanish tweets have been augmented using state-of-the-art Twitter Sentiment and Named Entity Recognition algorithms, providing additional structure for the raw tweet text data and obviating the need to rehydrate tweets from TweetIDs for many research purposes (e.g., sentiment analysis, social media network analysis). In addition to providing metadata at the level of individual tweets, hourly summary statistics of hashtags and mention are provided, which are suitable for semantic network analysis applications. The data collection process and descriptive statistics on the dataset are presented here.

2 Uses of twitter data amid the COVID-19 pandemic

Numerous studies use Twitter data to develop insights related to COVID-19. These analyses range considerably in focus, covering issues such as misinformation, conspiracy theories, and public health surveillance. They further differ substantially in scope. Some studies provide close readings of hundreds of tweets, while other work monitors hundreds of millions (Abdul-Mageed et al. 2020; Larson 2020).

Several studies investigate Twitter data’s potential to serve as a tool for public health monitoring amid the pandemic. Al-Garadi et al. (2020), for example, built on Sarker, et al.’s (2020) collection and identification of COVID-19 symptoms, developing a text classifier to monitor tweets for epidemiological purposes. Qin et al. (2020) present a social media search index that might be used to predict numbers of new COVID-19 cases. Mackey et al. (2020), finally, use Twitter data to look for signs of COVID-19 symptoms. Others consider how such data might help assess public responses to pandemic measures. Coftas et al. (2021), for example, use supervised classification to identify positive, negative, and neutral attitudes to COVID-19 vaccines in the first month after the initial successful vaccine announcement. Nurdeni et al. (2021) present a similar analysis of responses to the Sinovac and Pfizer vaccines in Indonesia. The dataset provided here, can facilitate these types of analyses thanks to the Sentiment and Name Entity Recognition algorithms implemented to augment it.

A much larger set of studies, however, uses Twitter data for infodemiological purposes, tracking information and misinformation flows. These studies are too numerous to detail here, but a few prominent examples illustrate the diversity of applications. Gallagher et al. (2020), for example, use a panel of US registered voters’ retweeting habits to identify the Twitterverse’s authority elites on COVID-19 and the demographic features of their respective followers. Fang and Costas (2020) observe how research on COVID-19 is cited in tweets, while Gilgorić et al. (2020) examine engagement with scientific and governmental authorities. Yang et al. (2020a, b) find links to low-credibility sources to account for more tweeted URLs in March 2020 than links to the CDC, while Pulido et al. (2020) observe in a sample of 1,000 tweets that while false information was tweeted more frequently than true, true information was retweeted more frequently than false. Al-Rawi & Shukla (2020), identify the top 1,000 most active accounts mentioning COVID in a population of approximately 50 million tweets, finding around 12% to be bots, most of which were retweeting news from mainstream outlets, though some also appeared to be boosting survivalist discourse. Yang et al. (2021), similarly, find evidence of some bot activity accelerating shares of links to websites known to post low-credibility content on COVID-19, though verified accounts of these low-credibility outlets, combined with densely connected clusters of accounts retweeting common sources, appear to account for much of the spread. While not all applied to tweets, there are nevertheless burgeoning numbers of misinformation classifiers (Ameur et al. 2021; Elhadad et al. 2020; Malla & Alphonse, 2021; Shaar et al. 2021; Zeng & Chan 2021). All this research notwithstanding, Wicke and Bolognesi (2021) identify a relatively small number of topics in tweets posted between March 20 and July 1, 2020, which, while they respond to major events, occur in remarkably stable proportions across the period. Another subset of studies focuses on prejudice and conspiracy theories linked to the pandemic. Ferrara (2020) studies the role of bots in amplifying COVID-19 conspiracy theories. Vidgen et al. (2020) present a classifier, trained on a 20,000-tweet dataset, to identify anti-Asian prejudice fomented by the pandemic. Tahmasbi, et al. (2020) use text mining to identify the growth and evolution of anti-Chinese hate speech on Reddit and Twitter, and Rodrigues de Andrade et al. (2021) conduct a similar analysis combining sentiment and mentions of China to study popular COVID geopolitics in Brazil. Shahrezaye et al. (2020) investigate conspiracy narratives in a sample of 9.5 million German-language tweets, finding very low rates of both conspiracy narratives and bot activity. Ziem et al.’s (2020) COVID-HATE dataset provides access to egonetworks of accounts with machine classified instances of hate and counterspeech, while Li, Y., et al. (2020) study both stigma and conspiracy theories using manual coding of 7,000 tweets. The combination of sentiment, hashtag, mentions, and name entities in a single large dataset, as presented in this work, facilitates the analysis of social prejudice toward certain populations, individuals, locations or organizations.

Other studies use Twitter to analyze attitudes and emotional responses to the pandemic (Garcia & Berton 2021; Kydros et al. 2021; Tyagi et al. 2021; Venigalla et al. 2020). Abd-Alrazaq et al. (2020), for instance, combine sentiment and topic modeling to identify issues and stances toward them. Yin et al. (2020) combine topic modeling and sentiment analysis to track emotional reactions to different aspects of the pandemic. Jiang et al. (2020) leverage geographic political polarization in the United States to study how political differences affect pandemic debates. Aiello et al. (2020) augment Chen et al. (2020) dataset, using topic modeling and sentiment analysis to study the evolution of English-language debate across the early months of the pandemic according to a model of epidemic psychology. Thelwall and Thelwall (2020) observe differences in word and topic usage by gender.

Dozens of openly available COVID-19-related Twitter datasets exist as of the time of writing (see Appendix 1). Nevertheless, the majority focus on the pandemic’s earlier months, generally running from late January or early February to somewhere between March and June 2020. Moreover, most of the available datasets include only Tweet IDs and limited metadata. This means that researchers need to rehydrate the datasets, a process in which users must retrieve twitter data using Tweet IDs. While not technically complicated, rehydration can be time consuming (Chen et al. 2020). Some datasets provide additional structure to the tweets, in the form of sentiment analysis or the results of topic modeling, but only two datasets, one from Feng and Zhou (2020) and one from Gupta et al. (2020), provide both topics and sentiments, which together might allow researchers to bypass rehydration. Gupta et al. (2020) dataset is the largest of the two, at approximately 63 million tweets. Still, this dataset, runs only from late January to the first of July 2020. Hence, it only covers the initial months of the pandemic. Furthermore, while topic modeling might be useful for some researchers, it is a complicated process and will often need to be tailored to their specific interests and needs. Named Entity Recognition (NER), which attempts to identify specific referents, on the other hand, may be more generally applicable for research purposes. However, only one dataset of 8.2 million tweets, created by Dmitrov et al. (2020), features named entity information, and this dataset only covers the period through April 2020.

The dataset we present overcome several limitations of existing datasets. It encompasses a significantly larger corporus of tweets that has been augmented with Sentiment and NER algorithms, as well as with hashtags, mentions, likes, and retweet data. It can help extend these types of studies outlined above by facilitating analysis using named entities, hashtags, mentions, likes, and retweets to identify a relevant corpus without the need to rehydrate voluminous datasets. Further, these data augmentations should allow researchers to perform several social media network analyses and potentially identify clusters of interest (e.g., clusters of misinformation, negative discourse) without the need for rehydration. Because the dataset covers nearly the entire duration of the pandemic to the time of writing, it provides a valuable resource, in particular, for researchers wishing to track changes in topics of discussion, relevant actors, sentiment, and hate speech over time.

3 Dataset

The dataset has over 1.7 billion tweets related to COVID-19 (count as of June, 2021), collected on an ongoing basis and processed with both Sentiment analysis and Named Entity Recognition algorithms. These two operations were selected, in particular, because they are both computationally intensive and provide sufficient data on a given tweet to potentially be useful for future research without further hydration, as shown by previous studies (see Sect. 2). In addition to these data provided at the tweet level, hourly summaries of hashtags, mentions, and the correspondence of hashtags and mentions at the tweet level, suitable for semantic network analysis, are also provided.

3.1 Data collection process

The dataset presented has been continuously collected using the Standard Twitter API since January 22nd, 2020. The tweets are collected using Twitter’s trending topics and selected keywords. Some of the keywords used are virus and coronavirus since 1/22/2020, ncov19 and ncov2019 since 2/26/2020, covid since 3/22/2020, rona since 4/22/2020, ramadandirumah (ramadhan at home in Bahasa Indonesia), dirumahaja (just (staying) at home in Bahasa Indonesia), stayathome since 5/6/2020, mask and vaccine since11/18/2020. Moreover, the Twitter dataset from Chen et al. (2020) was used to supplement the dataset presented in this work by hydrating non-duplicated tweets.

As the impact of COVID-19 increased around the world, the research team devoted more computing resources to collecting pandemic-relevant tweets. This is one of the reasons why the number of tweets increased significantly in specific periods (see in Fig. 3). Moreover, given the approach used by Twitter to provide tweets data, it cannot be ensured that the set of tweets in the dataset are a representative sample of all the tweets in a given moment. Users of the data, therefore, should keep in mind the need to normalize the data or select appropriate subsets if conducting temporal analyses.

3.2 Data description

The dataset is organized by hour (UTC) and each hour contains 7 tables: (1) “Summary_Details”, (2) “Summary_Hastag”, (3) “Summary_Mentions”, (4) “Summary_Sentiment”, (5) “Summary_NER”, (6) “Summary_Sentiment_ES”, and (7) “Summary_NER_ES”. The description of these seven summary tables is provided in Table 1. For example, given a re-tweet of the original tweet shown in Fig. 1, the information contained on the five tables relevant to this data point is shown in Fig. 2. The “Tweets_ID” feature is used as the primary key to connect all the tables. There is no information about this tweet in the tables with Spanish Sentiment and NER information given the fact this was an English tweet. A detailed description of the features present in each of the tables can be found in the GitHub repository of the dataset [https://github.com/lopezbec/COVID19_Tweets_Dataset].

Table 1 List and description of dataset tables
Fig. 1
figure 1

Example of Tweet Related to COVID-19

Fig. 2
figure 2

Example of dataset tables

The English and Spanish tweets are augmented using state-of-the-art Twitter Sentiment and Named Entity Recognition (NER) algorithms. The dataset applies Cliche’s (2017) Twitter Sentiment algorithm, which uses an ensemble model of multiple Convolutional Neural Networks and Long Short-Term Memory Networks. According to Otter et al. (2020), it achieves state-of-the-art performance on multiple twitter dataset benchmarks. Comparing several different approaches to sentiment analysis of a sample of Covid-19 tweets, Rustam et al. (2021) find this type of approach to have an accuracy of approximately 75–80%. For each English tweet, the algorithm generates a vector of non-normalized predictions for three sentiment classes: neutral, positive, and negative. Subsequently, the algorithm assigns the tweet to the class with the highest predicted probability. For English-language NER, we used Akbik et al. (2019a, b) algorithms, which takes a pooled contextualized embedding approach. Specifically, we applied the state-of-the-art English NER pre-trained model provided by Akbik et al. (2019b). Similarly, for all the Spanish tweets collected, we applied the Spanish NER pre-trained model provided by Yu et al. (2020). For each English and Spanish tweet, the NER algorithms identify all location (LOC), person (PER), organization (ORG), and miscellaneous (MISC) named entities, as well as the predicted probability for each (i.e., NER_Label Prob). For sentiment of Spanish tweets, we used the pre-trained Spanish-language neural network model provided by the Python library sentiment-analysis-spanish 0.0.25. This model predicts the probability of a given Spanish tweet to have a positive sentiment, which is subsequently used to label the tweets as either positive, neural or negative.

The English sentiment algorithm is able to, on average, process a tweet in 0.072 secs using a 2.1 GHz CPU (i.e., 100 million tweets in approximately 83.29 days), but it can easily be parallelized. The NER algorithms, by contrast, cannot be easily parallelized. On average, it can process a tweet in 0.069 secs using a single GeForce RTX 2070 1.62 GHz GPU (i.e., 100 million tweets in approximately 79.83 days). This shows the value of making the sentiment and NER information available to the community since other researchers need not spend the time and computational resources to extract this information.

3.3 Descriptive statistics

The average daily number of tweets collected on the dataset was 151,355.31. The number of tweets collected increased every month from 9,810,850 in January 2020 to its peak of 140,694,770 in August of the same year. Table 2 shows the summary statistics for the daily number of total tweets collected each month until Dec., 2020. Table 2 also shows the daily average and total number of original and re-tweets collected per each month.

Table 2 Tweet summary statistics, by month

Table 3 shows the top five languages present on the dataset. English is the most prominent, accounting for 65.38% of the tweets, followed by Spanish with 13.13%, accounting collectively for nearly four-fifths of our sample. This is the primary reason we decided to augment tweets in these two languages using the Sentiment and NER algorithms. Figure 3 presents the number of tweets from each of the top five languages collected over time.

Table 3 Distribution of tweets, by language
Fig. 3
figure 3

Tweet frequency across top five observed languages

Information about the number of retweets, likes, and geolocation information was also collected. While there are more than 4 million tweets with geolocation information in the dataset, this represents just 0.23% of the total number of tweets. Figure 4 presents the locations of the tweets with geolocation information since January 22nd, 2020. This figure shows that the most tweets with geolocation information come from the US, Europe, and India.

Fig. 4
figure 4

Map of tweets featuring geolocation information

The sentiment of all the English tweets was estimated using a state-or-the-art Twitter Sentiment algorithm. The dataset contains a total of 513,045,254 English tweets classified as negative (45.3%), 109,730,664 as positive (9.7%), and 510,487,085 as neutral (45.0%). Figure 5 shows the sentiment of all English-language tweets as of May 11th, 2021, while Fig. 6 shows the daily proportion of English tweets given their sentiment (i.e., normalized number of tweets given their sentiment). As shown in Fig. 6, there was an increase in the proportion of negative English tweets in late June 2020, and then a steady decrease by the end of January 2021. These changes in the proportion of negative sentiment may be related to multiple events that occurred in the US during that time frame, like the Black Lives Matter protests and new death forecast from the CDC in late June 2020, the roll-out of vaccines, and a new US president taking office in late January 2021.

Fig. 5
figure 5

Sentiment of English-language tweets

Fig. 6
figure 6

Daily proportion of English-language tweets sentiment

Similarly, the sentiment of all the Spanish tweets was estimated using a Spanish-language sentiment neural network model. The dataset contains a total of 189,137,429 Spanish tweets classified as negative (83.1%), 13,423,158 as positive (5.9%), and 24,997,639 as neutral (11.0%). Figure 7 shows the sentiment of all Spanish tweets, while Fig. 8 shows the daily proportion of tweets in each sentiment category. In comparison with Figs. 7 and 8, it is clear that there are proportionately more negative tweets in the Spanish corpus than in the English.

Fig. 7
figure 7

Sentiment of Spanish tweets

Fig. 8
figure 8

Daily proportion of Spanish-language tweets by sentiment

Lastly, we used a Named Entity Recognition algorithm to extract topics of conversation identified as persons, locations, organizations, and miscellaneous for both English and Spanish tweets. Table 4 shows the top 5 mentions and hashtags over the entire dataset, as well as the named entities across the dataset of English and Spanish tweets. From Table 4 it can be seen that in some circumstances the Named Entity Recognition algorithm identifies the word “covid” as the subject (i.e., NER Person) a tweet is referring to. Moreover, multiple of the words, mentions, and hashtags could potentially be grouped together given their meaning (e.g., covid19, covi, covi-19). Because we believe it is best to preserve as much of the raw data as possible and leave aggregation decisions up to researchers likely to have diverse potential applications, the dataset does not group these words together.

Table 4 Top 5 Mentions, hashtags, and named entities

3.4 Data accessibility

The dataset described in this work is available on GitHub at: /lopezbec/COVID19_Tweets_Dataset. This dataset is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License (CC BY-NC-SA 4.0). By their use of this dataset, researchers express their willingness to abide by the stipulations in the license and to remain in compliance with Twitter’s Terms of Service. If the user of the dataset would like to obtain all the information provided by the Twitter API, they would need to rehydrate the tweets using the code provided on the GitHub repository. This dataset is still being continuously collected and routinely updated.

3.5 Possible use cases

While, as noted above, there already have been other publications providing descriptive analysis of COVID-related tweets, there are, as yet, few studies that move on to hypothesis testing and causal inference (e.g., Gencoglu & Gruber 2020). While numerous researchers might be interested in conducting such analysis, they may not have the expertise or access to computational tools to conduct large-scale Sentiment or Named Entity Recognition analyses. Even the requirement to gather tweets from tweet IDs, known as rehydration, can be an unnecessary barrier to using and reusing large-scale Twitter datasets. By making sentiment analysis and named entity data readily available, this dataset allows researchers to bypass these time- and resource-consuming tasks. Avoiding these barriers will allow researchers to use the dataset as an input into larger analyses. These might include aggregating sentiment data to create a variable for use alongside other data on the pandemic, as done, for example, by Gencoglu and Gruber (2020) or to observe changes in sentiment in response to specific events, as observed by Rodrigues de Andrade et al. (2021) and Tahmasbi et al. (2021). While in these cases, the researchers used data like these to causally model the relationship between disease spread and sentiment (Gencoglu & Gruber 2020) and to identify how major policy events or statements triggered outpourings of anti-Chinese sentiment (Rodrigues de Andrade et al. 2021; Tahmasbi et al. 2021), these are only a few examples of the potential applications of the dataset.

In addition to the possible uses of readily available COVID-19 sentiment data, the named entity data can also be helpful for researchers, allowing them to track sentiment associated with particular persons, places, or actors related to the pandemic over time, or to identify actors that are generally associated with one another. This can help researchers analyze the emergence and change of conceptual associations between entities over the course of the pandemic, providing a more nuanced picture of how the online discourse on COVID-19 evolves than would be possible by observing sentiment alone. For example, Fig. 9 shows a network with the top 10 most frequent words from hashtag, mentions, and named entities from all the English tweets collected for January 2020. The nodes represent the words. The edges’ color represents the average sentiment of the tweets in which both words are present (i.e., red = negative sentiment, black = neutral), while the thickness of the edges represent the frequency in which those words were present together in a tweet. From this figure it can be seen that at the beginning of the pandemic there was a lot of discussion about China and Wuhan, since these nodes have the largest number of edges (i.e., 6 and 4, respectively). Also, it can be shown that the tweets that are related to China, Wuhan, Chinese and have the mention of “@realDonalTrump” have the most negative sentiment overall. With the dataset present here, researchers can aggregate common entities (like #Wuhan and wuhan or all the permutations of China) to create more complex semantic networks, analyzing their changes over time to better understand the evolving public sentiment and discourse regarding the pandemic, as well as to find potential clusters of misinformation and high negative sentiment.

Fig. 9
figure 9

Network generated from English tweets augmented dataset

To our knowledge, no freely available dataset with sentiment analysis and named entity recognition covers such a long period as the one presented in this work. This makes the dataset potentially useful not only for studying medium-term evolution of online discourse on COVID-19 but also as a historical document of the period. That is, this dataset can serve as an archive for future historians interested in studying an exceptional period in contemporary history.

4 Conclusion and summary

The main objective of this work is to introduce and share with the research community one of the largest openly accessible datasets of tweets with augmented metadata related to the COVID-19 pandemic. The team is continuously collecting and routinely updating the dataset with Sentiment and NER annotations and producing summary files suitable for semantic network and other forms of analysis. The dataset should enable researchers to develop models, test hypotheses, and garner insights from a large archive of Twitter-derived data without the need to rehydrate or conduct computationally prohibitive analyses.