Published June 26, 2019 | Version v1
Conference paper Open

From the Cookbook of Corpus-Based Lexical Lectometry: A Taste of Chinese

  • 1. KULeuven

Description

Lectometric approaches measure distances between language varieties (dialects, sociolects, registers etc.) by aggregating over observed differences in the realizations of a set of linguistic variables. In lexical lectometry, a variable consists of the alternative lexical expressions for one concept. In corpus-based lectometry, the observed realizations are culled from stratified corpora. Measuring semantically defined variables in corpora, and aggregating over them, poses specific methodological challenges that have been tackled in a number of studies (Heylen & Ruette 2013; Ruette et al. 2014; Ruette, Ehret & Szmrecsanyi 2016) with different statistical techniques, including Distributional Semantic Models. Yet so far, no general framework for corpus-based lexical lectometry has been formulated that systematically describes the issues and options in each step of the procedure so that it can be straightforwardly applied to new data and new languages, other than English (Ruette, Ehret & Szmrecsanyi 2016), Dutch (Geeraerts, Grondelaers & Speelman 1999) and Portuguese (Soares da Silva 2010).

This paper can be characterized as a twofold extension of the previous studies. First, it aims to establish a general framework for lexical lectometry research that considers most if not all options for different steps. Second, we want to go beyond the Indo-European languages by extending the framework on a typologically unrelated language, i.e. Chinese varieties.

For the general framework, we propose that a proper lexical lectometry research normally should involve the following steps: (1) compilation of a lectally stratified corpus; (2) sampling concepts as measuring points for lectometry; (3) identification of lexical expressions per concept; (4) disambiguation of lexical expressions in corpus data; (5) calculation of aggregated lexico-lectometric distances; (6) evaluation of measurement reliability and validity. For each step, we further provide possible options and caveats. For instance, step 2 and 3 can rely on existing concept-based lexical databases, like a synonym dictionary, or use corpus-driven keyword extraction and semantic vector space models. Step 4 can either make use of token-level distributional semantics models or rely on simpler n-gram language models.

To assess the portability of the general framework, both in practical and linguistic-typological terms, we perform a lexical lectometric analysis for varieties of Chinese based on data from large-scale corpora of Mainland Chinese, Taiwan Chinese and Singapore Chinese.

Files

abstract_ICLAVE2019_submitted.pdf

Files (188.6 kB)

Name Size Download all
md5:dee985c3c39c749cedb2a311595b2255
188.6 kB Preview Download