Rank-frequency distribution of natural languages: A difference of probabilities approach
Introduction
The statistical study of languages has shown an increased interest over the last decades since the pioneering works of Zipf [1] and others [2], [3], [4], [5], [6], [7]. These studies have focused on the rank-frequency distribution of words. Additionally, the rank diversity distribution has recently been proposed as a novel measure to characterize the statistical properties of languages [8]. This distribution can be understood as a measure of how word ranks change in time, see also [9]. This measure has also shown that the size of the language core is similar for most languages. Within this statistical linguistic point of view, in previous work we have introduced a simple Gaussian random walk model for the rank diversity which reproduced some of the observed features of the evolution of this quantity quite well [8].
Furthermore, in recent years much effort has been given to the study of complex networks associated to physical systems, biological organisms, and social organizations; the structure and dynamics of these networks being a matter of intense research [10], [11]. In previous works [12], we have looked into the evolution of complex networks in terms of a master equation (ME) describing birth–death stochastic processes along the lines developed for ecological models [13], [14]. We have shown that under very general conditions in which dynamic conflict (frustration) exists between positive and negative mechanisms, the frequency distribution versus rank is given by the ratio of two power laws. This is also the case for birth and death processes in ecology, or for the excitation–inhibition process for neurons in the central nervous system. In a large variety of systems composed by similar elements and with similar interactions between them, the response of the system is determined by general laws. However, there are always differences in the response of the system in different realizations of the same experiment which can be associated, for instance, to the large numbers law or the central limit theorem, and follow a normal Gaussian distribution. In these cases, the average values are the ones that depend on general laws, whereas the differences among various realizations of the experiments obey a different dynamics, namely, that of the great numbers law.
In this work we use this point of view to study the frequency distribution of words in six languages [8]. In particular, we analyze the difference between the data associated with different realizations of these conflictive dynamics and the adjustments of the real data. We do this in terms of a time dependent probability density distribution, by assuming that the dynamics of the rank distribution may be described by the ME describing an underlying one step, Markovian, birth–death stochastic process [12], [15]. As we have shown in previous work [16], the data describing the frequency of words of several languages can be well adjusted by an asymptotic beta function. However, as it will be shown below, there is always a small difference between the data and this adjustment. Here lies the motivation of this work and one of its main objectives is to analyze and explain the origin of this difference within the context of the proposed stochastic model.
The outline of the present work is as follows. In Section 2 we define the stochastic model and construct a ME describing the data obtained for different Indo-European languages. Then in Section 3 the initial differential-difference ME is approximated by a (nonlinear) Fokker–Planck equation (FPE) in the continuum limit, where the discrete rank stochastic variable may be treated as a continuous variable. Closed analytic forms for both, the stationary and the time dependent probability density distributions of this equation, are obtained using Padé approximants. In terms of these well defined approximations, we show that the analytic time-dependent solution of the FPE describes well some of the observed features. Finally, in Section 4 we summarize our main results and critically discuss the novel features and limitations of our work.
Section snippets
Data adjustment for Indo-European languages
The variations of the rank in time of twenty words for three different -scales for these six languages were obtained for two centuries in [8]; an example for English is given in Fig. 1. From the curves in [8] it can be observed that the behavior of is similar for all languages. Words with low rank almost do not vary in time and as the value of increases, its variations depend on the rank itself. Notice that there is a higher variation at all scales before year 1850. As an example, in
Stochastic model
Given a set of words forming a text, the number of times that a certain word appears with the rank at time can be evaluated. If this change in is modeled by a one-step Markovian stochastic process, and if and denote arbitrary functions for the transition probabilities per unit time for the rank to increase or to decrease in one unit, the dynamics of the probability density for the rank to have the value at time is given by the nonlinear ME [17]
Fokker–Planck approximation
Since only in rare cases it is possible to solve the ME explicitly, we shall assume that the changes in are small and that we are only interested in solutions that vary slowly with the discrete variable . In this limit the discrete variable may be treated as a continuous variable and the operators may be replaced by a Taylor series expansion in , yielding the following nonlinear FPE approximation for the ME [18]
Here
Discussion
In this work we have proposed a stochastic approach to analyze the dynamics of the rank variation () of words in time for six Indo-European languages: English, French, German, Italian, Russian and Spanish. Based on numerical evidence we here showed that may be regarded as a random variable exhibiting complex dynamics in different regions of the space. This fact suggests that its dynamics could be adequately described by a stochastic model, and we described it as a Markovian, one-step,
Acknowledgments
Support by projects CONACyT, Mexico 285754 and UNAM-PAPIIT, Mexico IG100518, IN-107414, and IN-107919 are acknowledged.
References (23)
- et al.
A local-world evolving network model
Phys. A
(2003) - et al.
Birth and death master equation for the evolution of complex networks
Physica A
(2014) - et al.
General model of substraction of stochastic variables. Attractor and stability analysis
Physica A
(2011) Selective Studies and the Principle of Relative Frequency in Language
(1932)An informational theory of the statistical structure of language
The Evolution of Human Languages: Proceedings of the Workshop on the Evolution of Human Languages
(1992)- et al.
Zipf’s law and random texts
Adv. Complex Syst.
(2002) Evolution of the most common English words and phrases over the centuries
J. R. Soc. Interface
(2012)- et al.
Robust clustering of languages across wikipedia growth
R. Soc. Open Sci.
(2017) The Matthew effect in empirical data
J. R. Soc. Interface
(2014)
Rank diversity of languages: Generic behavior in computational linguistics
PLoS One
Cited by (6)
Human Cognition and Hyper Intelligence
2021, Proceedings - 2021 IEEE International Conference on Dependable, Autonomic and Secure Computing, International Conference on Pervasive Intelligence and Computing, International Conference on Cloud and Big Data Computing and International Conference on Cyber Science and Technology Congress, DASC/PiCom/CBDCom/CyberSciTech 2021Improving SDG Classification Precision of Topic Models with Combinatorial Fusion Algorithm
2021, Proceedings - 2021 IEEE International Conference on Dependable, Autonomic and Secure Computing, International Conference on Pervasive Intelligence and Computing, International Conference on Cloud and Big Data Computing and International Conference on Cyber Science and Technology Congress, DASC/PiCom/CBDCom/CyberSciTech 2021Repetition characteristic for single texts
2021, CEUR Workshop Proceedings
- 1
deceased author.