cell_lang_counts.zip (1.9 MB)

Spatial distributions of languages extracted from Twitter

Version 3 2021-04-29, 10:53

Version 2 2021-03-30, 13:35

Version 1 2021-03-30, 13:33

dataset

posted on 2021-03-30, 13:33 authored by Thomas LoufThomas Louf

This is a collection of GeoJSON files containing the counts of users of local language groups in every cell of a grid laid over several regions of interest. The cells are defined as squares in a projected system of coordinates adapted to each country, the sides of which have a size specified in the file names (cell_size=Xm).

These counts were obtained through the processing of geo-located tweets posted between 2015 and 2019 in these regions, collected through the streaming API of Twitter, and more specifically using the "statuses/filter" endpoint (see Ref. 1). This endpoint provides a sample of tweets in real time matching some provided filters. Bounding box filters were set to collect tweets from a set of countries of interest. Before reproducing this method of data collection, one should bear in mind that the current form and even the availability of this endpoint is subject to future changes introduced by the Twitter Developer's team. The code used to make this processing as well as to visualize these data is available on GitHub (see Ref. 2).

Funding

The authors acknowledge funding from the Spanish Ministry of Science and Innovation, the AEI and FEDER (EU) under the grant PACSS (RTI2018-093732-B-C22) and the Maria de Maeztu program for Units of Excellence in R&D (MDM-2017-0711).

History

Usage metrics

Keywords

computational sociolinguistics multilingualism Twitter Computational Linguistics Computational Physics Complex Physical Systems Language in Time and Space (incl. Historical Linguistics, Dialectology)Language in Culture and Society (Sociolinguistics)

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM