Statistical mechanical approach to human language
Introduction
Human language has recently become a subject of interdisciplinary character. Linguistic studies have traditionally been qualitative rather than quantitative. Recently, some attempts based on evolutionary game theory [1] have been made in an effort to understand language evolution, which have yielded some noticeable results. Particularly, interesting considerations were made in studies of competition between languages using mathematical [2] and computational models [3], [4], [5], [6], [7], [8], [9], [10].
In this paper, we propose the assumption that human language can be described as a physical system within the framework of equilibrium statistical mechanics. Defining a Hamiltonian analogue that is associated with words, we are able to explain basic properties of spoken languages, such as the universality of the exponent of Zipf law [11], and to predict reasonably well the form of the curve for the vocabulary size versus age for young children. We, thus, demonstrate that statistical physics can provide an interesting formulation for the study of spoken languages and can unify aspects, such as the frequency distribution of words and the children's vocabulary learning rate, properties which at first glance seem completely different.
A rather remarkable feature, common to several languages is the so-called Zipf law [1], which states that if we assign the value to the most frequent word of a language, to the second one, etc., then the frequency of occurrence of a word with rank isThis law has been verified experimentally for several languages with the exponent value found to be universal and approximately equal to one. An alternative way, which is also used in the literature, to present Zipf law is to state that the proportion of words whose frequency is f (taking values in the range 0–1) in a given sample text is modelled by a power function . The exponent is related to the exponent in Eq. (1) with the equation . Although it is not immediately evident, the frequency–rank Zipf plot is equivalent to a plot of the cumulative distribution of versus frequency f [12], [13]. Ref. [12], in particular, contains a detailed proof of the above statement.
Traditionally, statistical mechanics does not deal with human language. It deals with physical systems, i.e., with collections of atoms, molecules or other elementary particles. According to statistical mechanics, when a system of particles is in equilibrium at constant temperature T, then it can be found in one of N states. The probability that it is found at a given state i with energy is proportional to , the “Boltzmann factor.” The temperature T is the “measure” of the interaction of the system with the environment.
Section snippets
The basic assumption of the model
Suppose that an individual possesses a vocabulary of N words. We treat the language department of the human brain as a physical system that can be found in one of N states. Each state represents one word. There is a one-to-one mapping between these states (which are enumerated using integers up to N) and words in the individual's vocabulary. If the system is found in state i, then the word associated with state i is pronounced. We denote as “temperature”, T, a measure of the willingness (or
Divergence from the power-law at the initial part of a Zipf plot
It has been observed that while Eq. (1) is a straight line in log–log form, there are noticeable deviations in the early part of the line [15]. In Fig. 1, we plot the word occurrence, , versus the word rank, , using experimental data that come from a corpus (large collection of texts) consisting of publications in several Greek Internet sites up to May 2001, collected by Prof. Franz Guenthner at the University of Munich. It has been checked and used by T. Kyriacopoulou [16]. This corpus
Discussion and conclusions
The model proposed here predicts that if we count the word frequency distribution of young children dividing them in age groups, then the resulting Zipf plots for each age group will also exhibit power-law behaviour, but the slope will be different from that for adults. Some preliminary work in this direction using CHILDES4
Acknowledgement
This work was partially supported by the Greek Ministry of Education via Herakleitos and Pythagoras Projects.
References (26)
- et al.
Physica A
(2005) Contemp. Phys.
(2005)- The LINGUIST List, Department of English Language and Literature, Eastern Michigan University,...
- et al.
Proc. Natl. Acad. Sci. USA
(1999) - et al.
Nature
(2003) - C. Schulze, D. Stauffer, Int. J. Mod. Phys. C 16 (2005) 718 and AIP Conf. Proc. 119 (2005) 49 (8th Granada...
- et al.
Phys. Life Rev.
(2005) - et al.
Europhys. Lett.
(2005) Int. J. Mod. Phys. C
(2005)Int. J. Mod. Phys. C
(2006)
Int. J. Mod. Phys. C
Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology
Physica A
Cited by (28)
Analysis of predictor equations for determining the blast-induced vibration in rock blasting
2019, International Journal of Mining Science and TechnologyMeasuring information-based energy and temperature of literary texts
2017, Physica A: Statistical Mechanics and its ApplicationsStatistical mechanics of ontology based annotations
2015, Physica A: Statistical Mechanics and its ApplicationsCitation Excerpt :We do so using an explicit statistical mechanical model that takes into the structure of the ontology. Whilst existing work within the literature has used a specific Hamiltonian to study patterns of word usage, that work has not per se been interested in the impact of any underlying prescribed structure in the vocabulary [13]. Similarly, novel work by Palla et al. [14]
Scale and time dependence of serial correlations in word-length time series of written texts
2014, Physica A: Statistical Mechanics and its ApplicationsCitation Excerpt :Monte Carlo simulations with evolutionary game theory have been used for studying the evolution of words of systems comprising two interacting species [4]. Interestingly, formulations of equilibrium statistical mechanics have been considered for a qualitative description of important characteristics of language, such as the universality of Zipf’s law and the vocabulary size of children [5]. It has been suggested that the word frequency distribution is analogous to the Bose–Einstein distribution, which implies that the temperature of texts can be defined [6].
The spatial distribution of clusters and the formation of mixed languages in bilingual competition
2012, Physica A: Statistical Mechanics and its ApplicationsApplication of a quantum ensemble model to linguistic analysis
2011, Physica A: Statistical Mechanics and its ApplicationsCitation Excerpt :Such a parameter is related to “good” or “bad” employment of words, especially rare words [17]. The “temperature” as a measure of communicative ability was introduced in [18]. Recently, Miyazima and Yamamoto [19] used the classical Boltzmann distribution to define the “temperature of texts” from the frequency data of the most frequent words.