ACTIV-ES: a comparable Spanish corpus comprised of film dialogue from Argentine, Mexican and Spanish productions

doi:10.5281/zenodo.1492613

Published November 20, 2018 | Version activ-es-v.02

Dataset Open

ACTIV-ES: a comparable Spanish corpus comprised of film dialogue from Argentine, Mexican and Spanish productions

Jerid Francom¹

1. Wake Forest University

DESCRIPTION: ACTIV-ES is a comparable Spanish corpus comprised of film dialogue from Argentine, Mexican and Spanish productions. Titles for each of these three countries were seeded from the Internet Movie Database, subtitle data for the hearing impaired was provided by Opensubtitles.org and was post-processed to correct/remove subtitle, OCR and diacritic artifacts and annotated for part-of-speech.

The data is available in two main formats: 1) running text for each document and 2) 1:5 gram aggregate files. Each format includes a plain text and part-of-speech annotated version. Document names reflect the language code, country, year, title, type, genre (first genre listed in the IMDb), and IMDb ID.

For more information about the development and evaluation of these resources and to cite this work refer to:

Francom, J., Hulden, M. and Ussishkin, A.. (2014) ACTIV-ES: a comparable, cross-dialect corpus of 'everyday' Spanish from Argentina, Mexico, and Spain. In Proceedings of the Ninth Annual Language Resources and Evaluation Conference, Reykjavik, Iceland. European Language Resources Association (ELRA).

In version .02 of the tagged running format corpus in the /eagles directory has been added which includes the EAGLES tagset. This tagset is much more fleshed out than the simplified tagset in the /tagged directory. For information on the tagset refer here: http://nlp.lsi.upc.edu/freeling/doc/tagsets/tagset-es.html.

Files

francojc/activ-es-activ-es-v.02.zip

Files (216.3 MB)

Name	Size	Download all
francojc/activ-es-activ-es-v.02.zip md5:ef3cabc94f840c047a7555959fd77515	216.3 MB	Preview Download

Additional details

Is supplement to: https://github.com/francojc/activ-es/tree/activ-es-v.02 (URL)

	All versions	This version
Views	537	534
Downloads	69	69
Data volume	17.1 GB	17.1 GB

ACTIV-ES: a comparable Spanish corpus comprised of film dialogue from Argentine, Mexican and Spanish productions

Creators

Description

Files

francojc/activ-es-activ-es-v.02.zip

Files (216.3 MB)

Additional details

Related works