Introduction

The rich get richer or success breeds success effect, also called Matthew’s principle from the parable of the Talents in Matthiew 25:14-30), has been invoked many times in the sociology of science to justify highly skewed distributions of bibliometric indicators (often power laws, see Egghe, 2005 and Rousseau, 2010) measuring the scientific production of scholars. The basic underlying idea it is that if you have more, it’s easier to gain more. This is a consequence of “the process of allocation of rewards to scientists for their contributions” (recognition) “which in turn affects the flow of ideas and findings through the communication networks of science” generating a reputational effect, as Merton (1968: 56) put it.

Concerning these mechanisms, Bonaccorsi et al. (2017) discussed recognition as a trigger of a cumulative increase in the scientific productivity of scholars and linked the results to the framework proposed by Whitley (2000). According to Whitley (2000) different scientific disciplines, which apply different knowledge production systems, can be investigated in a comparative way, on the base of a common ground, as they are reputational work organizations.

A parable that is often considered similar to Matthew’s Talents, but which opens toward a different perspective, is the parable of the Ten Mines, in Luke (Luke, 19:11-27). In Matthew, different outcomes are obtained starting by different amounts of Talents given at the initial time to servants with different abilities. On the contrary, in Luke, different outcomes are obtained starting from exactly the same amount of stocks (one Mina) given at the initial time to each servant, independently from their (unspecified) abilities. A related view to the latter can be found in Helvetius (1772) which proposes the materialist principle of equality of human intelligence.

The success breeds success principle is known, and has been reinvented many times over the last century. In animal and plant taxonomy it is known as the Yule process (Yule, 1924; Raup, 1985; Reed and Hughes, 2007), after Udny Yule (1871–1951) who studied the distribution of the sizes of biological taxa (for instance, how many species are in a genus) in 1925. From a mathematical point of view, the Yule process is a variation of the Polya’s urn model (Mahmoud, 2008), attributed to the mathematician George Polya (1887–1985). Subsequently, the Yule process was generalized by the economist Herbert Simon (who won the Turing award in 1975 and Nobel Prize in Economics in 1978) to study the distribution of wealth (1916–2001) (Simon, 1955; Mandelbrot, 1959; Simon, 1960). Simon demonstrated that the rich get richer mechanism produces power-law distributions. In Sociology, this principle was introduced by Robert Merton (1910–2003), who named it the “Matthew effect” (Merton, 1968; Wouters and Leydesdorff, 1994), after the quoted passage in the Biblical Gospel of Matthew. In Scientometrics the model was introduced in the 1970 s by the physicist Derek de Solla Price (1922–1983) (de Solla Price, 1965; de Solla Price, 1976). Building on Simon’s work, he applied the Yule process to investigate the growth of the citation network, giving the mechanism a different name: “cumulative advantage”. In 1984 two Hungarian scholars, Wolfgang Glänzel a mathematician, and Andres Schubert with a background in physical chemistry, propose a model of bibliometric distributions based on the success breeds success principle which lead to the less common Waring distribution (Schubert and Glänzel, 1984). Both scholars were later awarded the Scientometrics Derek de Solla Price Medal.

More recently, the physicists Albert-Laszlo Barabasi and Reka Albert once more reinvented Price’s network evolution mechanism in a 1999 paper (Barabasi and Albert, 1999; Albert and Barabasi, 2002; Barabasi et al., 2002), renaming it as “preferential attachment”. In a recent paper, Glänzel and Schubert (Glänzel and Schubert, 2016) present an overview of their 1984 statistical model. They illustrate the whole family of distributions which can be derived from their original model and show that, in retrospect, it can be considered a precursor of the preferential attachment network model, proposed by Barabasi. Many other examples of applications and many other names of the success breeds success mechanism can be found in the current literature. Among others, we quote (1) in system biology, the vertex-copying models recently proposed for the shape of genetic networks, proposed by the physicist Ricard Sole and colleagues (Sole and Montoya, 2001; Sole et al., 2002; Sole and Pastor-Satorras, 2003) and by the mathematician Alexei Vazquez and colleagues (Vazquez, 2003); (2) in the WWW network study, the fitness-based generalization of preferential attachment, proposed by the physicists Ginestra Bianconi and Albert-Laszlo Barabasi in 2001 (Bianconi and Barabasi, 2001); (3) the forest fire model for densification, proposed by the computer scientist Jure Leskovec and colleagues (Leskoves et al., 2005); (4) the local-competition mechanism proposed by the physicist Raissa D’Souza and colleagues (D’Souza et al., 2007); (5) the propagation of scientific memes studied by the physicist Matjaz Perc (Perc, 2013), who also recently reviewed the methodology for measuring the impact of the Matthew effect in social, technical and scientific areas (Perc, 2014).

In Scientometrics, the Price mechanism (as it is known) has been mainly focused on the distribution of citations. Price’s assumption was that the papers to be cited are chosen at random with a probability that is proportional to the number of citations those same papers already have. Thus, highly cited papers are likely to gain additional citations, giving rise to the rich get richer cumulative effect. Several modifications of the basic mechanism have been proposed from time to time, but, aside from small details, Price’s original formulation seems to catch the main features of the distribution of citations.

The current literature often focuses on the distribution of citations collected by a given paper. The question of what kind of mathematical function best describes this distribution is crucial. In 1998 Redner (Redner, 1998) considered the articles published in Physical Review D, along with all articles indexed by Thomson Scientific in the period 1981–1997. He found that the right tail of the distribution (corresponding to highly cited papers) follows a power law with exponent -3, in agreement with the conclusions of Price (Wouters and Leydesdorff, 1994). Later, Laherrere and Sornette (Laherrere and Sornette, 1998) studied the top thousand most cited physicists during the same period (1981–1997). The resulting citation distribution is better described by a stretched exponential distribution with β=0.3. Tsallis and de Albuquerque (Tsallis and de Albuquerque, 2000) analyzed the same data used by Redner with the addition of all papers published in Physical Review E, and found that the Tsallis distributionFootnote 1 with ξ≈10 and β≈1.5 consistently fits the whole distribution of citations (not just the tail). More recently, Redner performed an analysis over all the papers published in the century-long history of all the journals published in the American Physical Society (Redner, 2005). He reaches the conclusion that the Log-Normal distribution represents the data much better than a power law. In further studies the distributions of citations have been fitted with various functional forms: power laws (Seglen, 1992; Lehmann et al., 2003; Bommarito and Katz, 2010; Perc, 2010; Rodriguez-Navarro, 2011), Log-Normal (Radicchi et al., 2008; Stringer et al., 2008; Bommarito and Katz, 2010), Tsallis distribution (Wallace et al., 2009; Anastasiadis et al., 2010), modified Bessel function (Van Raan, 2001a; Van Raan 2001b) or more complicated distributions (Kryssanov et al., 2007).

It is worth noting that all but the Log-Normal fitting functions used to describe the distribution of citations c are monotonically decreasing functions of c, as the raw data clearly show no tendency to have a dip around c=0. Even in those cases where the Log-Normal shape of the distribution function has been found, the data were fitted to high c tail of the Log-Normal function (see, for example, Fig. 1 of Eom and Fortunato (2011).

Figure 1
figure 1

Examples of distribution functions obtained from eq. (11) for selected values of the parameters. In the upper panel we show the case τ1=2 and τ2=5 for three different values of Ω: 2 (black), 3 (red) and 4 (blue). In the middle panel, for the same three Ω values we have τ1=2 and τ2=7. Finally, in the lower panel, still at the same Ω’s, we report τ1=2 and τ2=9.

In addition to citation distributions, other bibliometric indicators have been shown to be well represented by a Log-Normal function in the whole domain range. When, instead of a single paper, the investigated indicator is referred to a single scholar the distribution, far to be monotonically decreasing, on increasing the variable value, first increases, reaches a maximum, then decreases with a longer right tail. Furthermore, different disciplines and different academic roles share the same Log-Normal distribution when the indicator is scaled by the median (or any other scale parameter) (Ruocco and Daraio, 2013). The same conclusion applies not only to the Hard and Life Science disciplines, but also to Social Sciences and Humanities (Bonaccorsi et al., 2017).

The universality (but for a scaling parameter) of the distribution of bibliometric parameters of scholars is an intriguing finding, and its analysis can provide important information on the Sociology and Science of Science. Also, the ultimate origin of the shape of the distribution, which is highly skewed and well represented by a Log-Normal function, can give some hints on the publishing behavior of scholars and the related scientific production process.

Why must the distribution of, for example, the number of papers published by a full professor in mathematics working on the theory of functions, or the one of an associate professor in astrophysics, or the one of a pathologist, or the one of a Latinist, each be a distribution that closely resembles a Log-Normal function? The origin of the Log-Normal distribution lies in the multiplicative noise (Mitzenmacher, 2004; Limpert et al., 2016), that is, the product of a large number of statistically independent fluctuations (additive noise would give rise to a normal distribution function). This answer is not satisfactory, it is only a reformulation of the original question. Why should the scientific production of a scholar be the result of multiplicative random phenomena? Are there other phenomena behind the observed bibliometric distributions?

In this article we propose a very simple model, based on the rich get richer rule, which—by the amplification of small initial fluctuations and by the reputational cumulative advantage mechanism- gives rise to the observed distribution of bibliometric parameters.

The mathematics of our model is straightforward. It is based on a deterministic differential equation for the individual productivity, being the only statistical variability on the initial conditions. God (Nature) gives an almost equal (number of) talent (small “t”!) to any scholar. Each scholar performs equally well, but the small initial differences, like in an inflationary process, give rise to the huge differences observed in the distributions.

The statement about the near equality of talents (note that in the present paper ability, talents and intelligence are considered as synonyms) is counter-intuitive and requires some explanation. Indeed, scholars may be different not only in their abilities (natural talents) but also in their opportunities of doing research. Moreover, scholars are embedded in university departments, universities and countries, all these levels being different in resource allocation, recognition and prestige.

The rationale of our statement is that we would like to test if the model, including this assumption, is still able to replicate the (Log-Normal) distributions observed in many empirical studies. This is important to say something about the meaning of bibliometric indicators. The reader is referred to the last section for more discussion on this point.

Model

Our model is inspired to Merton’s “Matthew effect”, and therefore to Matthew (25: 14–30), which is at the origin of the success breeds success effect. However, we also consider Luke’s parable of the Ten Mines (Luke, 19:11-27) and the materialist principle of Helvetius (Helvetius, 1772). We assume that there is an equal distribution of talents, abilities and intelligence (all these are considered as synonyms herein) and for that we depart from Matthew which assumes an unequal distribution of abilities. See Table 1 which summarizes the main components of our model.

Table 1 The main elements of our model

Note that Luke does not say that individuals have different abilities; he simply does not report anything about the abilities. For this reason we report that an unequal distribution of ability is our interpretation of Luke, on a rational base (Table 1).

Even if a theological interpretation is outside the scope of this paper, a comparative and exegetical analysis of the Gospel of Matthew and Luke shows some differences which are of interest here. Diez Herrera (2003) finds a relevant difference between Matthew and Luke: “pero encontramos también divergencias que no podemos considerar secundarias ya que influyen decisivamente en la interpretación de las parábolas. Así, tenemos primeramente la desigual distribución del dinero entre los siervos que presenta Mateo. Para el lo importante no es que todos reciban la misma cantidad para negociar en igualdad de condiciones (cosa que si aparece en la narración lucana) sino que destaca expresamente que han recibo sumas distintas, y esto, no en virtud de una decisión arbitraria y discriminatoria, sino según su capacidad (Diez Herrera, 2003: 297–298).” That is, the uneven distribution of money among the servants presented by Matthew is purposely related to their ability and not an arbitrary decision. On the other hand, in the Lucan narrative, the important thing is that all receive the same amount to negotiate on equal terms. In particular, this analysis shows some similarity of Luke with Helvetius (1772)’s materialism.

Maggioni (2000) proposes an interpretation of the meaning of the parable of Luke based on the history of the goods left in custody. That is, to take advantage of what God has given you is not simply a matter of preserving it but of producing fruit, of being active and productive with enthusiasm and courage. Man is not a simple guardian of God’s goods: he/she has the task of trading to multiply them: “Il suo significato [della parabola di Luca] è invece da ricercarsi nella storia dei beni lasciati in custodia. Cioè: sfrutta ciò che Dio ti ha consegnato, perché dovrai renderne conto. E’ il tema del giudizio. Che però va ulteriormente precisato: non si tratta semplicemente di conservare, di non perdere, ma di far fruttare. Occorre vivere in attesa di un padrone severo, che vuole raccogliere ‘dove non ha seminato’, che vuole cioè dall’uomo intraprendenza e coraggio. L’uomo non è un semplice custode dei beni di Dio: ha il compito di commerciare per moltiplicarli (Maggioni, 2000, p. 328–329)”.

In our model we adopt Maggioni (2000)’s entrepreneurial interpretation of Luke, to be productive, to trade and multiply the goods received in custody to support our hypothesis of the correspondence between productivity and ability/talent/ intelligence. Therefore, in our model, the operationalization of scholars’ talents (abilities, intelligence) in terms of research productivity is based on Maggioni (2000).

Let’s focus on a specific bibliometric indicator, for example, on the total number of papers published by a scholar in her/his whole academic life. None of the concepts introduced in what follows depends on the chosen indicator, and all the considerations and results may apply to any extensive parameter, as for example to the total number of citations received by any author’s papers, or to the total IF collected by a scientist.

Let’s call x(t) the number of papers published after a time t by a scholar, and define t=0 the starting time of their academic career (obviously x(0)=0). In order to derive a model for the distribution of x we now need two elements: (1) the time evolution of x(t), and ii) the distribution of the academic ages at the observation time. As we will see, the latter quantity is much less important than the former, at least if no pathological age distributions are chosen.

We first derive a differential equation describing the evolution in time of the variable x(t), which is described in terms of a productivity (that is, the number of papers published in a given time), which, in turn, increases with time and is almost the same for all scholars at the beginning of their career. Specifically, the assumptions of the model are the following:

  • Nature gives the same amount of talents to any scholar. In mathematical terms, productivity at time zero, let’s call it α, is the same for all the scholars.

  • A tiny, random, variability of the talents exists. The previous statement is not strictly true. The initial productivity is α+ηi, where ηi is a small, addictive, term that depends on the specific scholar i. The fluctuation of the initial talent, ηi follows a normal distribution with zero average and standard deviation σ:

    (1) η = 0 η 2 = σ 2 P ( η ) = 1 2 σ e η 2 2 σ 2
  • According to a slightly modified version of the rich get richer principle, the productivity—not the products—increases proportionally to the amount of products accumulated up to that time. The rationale behind this assumption, which is central to the development of the model, is that the productivity of a scholar is related to her/his recognition and reputation. It is well known that grant allocation and conference participation, for instance, are based on the international visibility of papers, on their corresponding quality (for example, citations) and on the recognition by the international research community. This is a process which combines quantity and quality. In our model, the recognition increases, on average, with the number of papers produced, which in turn allows the scholar to get grants and thus to attract students and Post Docs, who, in turn, will increase her/his productivity. This will increase opportunity to be invited to conferences (with the correlated advertisement of her/his works, publishing additional conference papers, and so on), thereby producing reputational cumulative effects. Mathematically, the productivity has a third addendum other than α and ηi, which is βx(t), where β has the dimension of an inverse of time. Its inverse (1/β) represents the characteristic time in which the production x(t) increases by a factor e (~2.73). In other words, this parameter specifies how much the recognition counts in determining productivity (i.e. the cumulative advantage of reputation generated by recognition). The parameter β indeed determines the value of the productivity (dx/dt) given a collection of output (x(t)). The parameter β can also be expressed as the logarithmic increment of production per unit of time: β=dln(x)/dt. We assume hereafter that β does not depend on the individual characteristics (it does not depend on “i”), rather β is the same for all.

Each assumption brings an addendum to the productivity: α, η, and βx(t) respectively. The differential equation ruling the time evolution of x(t), thus, is simply the statement that productivity is the sum of the three terms:

(2) d x i ( t ) d t = α + η i + β x i ( t )

where we have retained the pedix i to remember that -due to the presence of the statistical variable ηi—the evolution is different for each individual. This equation is promptly solved, and its solution, with the initial condition xi(0)=0, is:

(3) x i ( t ) = α + η i β [ e β t 1 ]

This equation can be rearranged to be an expression for ηi:

(4) η i = β x i ( t ) [ e β t 1 ] α

which establishes the identity between a statistical variable η and a quantity which depends on t and x, but that must be equal to η at any time. As we know the distribution function for η (eq. (1)), we can read eq. (4) as change of variable x→η, being t a parameter, thus we can work out the distribution function of x(t) via P(η)dη=P(x(t))dx(t). Therefore:

(5) P ( x ( t ) , t ) = d η d x ( t ) P ( η ) = β [ e β t 1 ] P ( η ) = β e β t 1 1 2 π σ e x p { 1 2 σ 2 ( β x ( t ) [ e β t 1 ] α ) 2 }

where we have made explicit that the distribution function P(x,t) not only depends on x, but also explicitly on the time t.

The previous equation represents the statistical distribution of the production x(t), at a given academic time t, for the scholars. Its variability mirrors the small differences in the original productivity associated to the term η. The distribution is a normal distribution, where both mean and standard deviation increase with the time t.

The distribution in eq. (5) depends on the two variables x and t, and has three model dependent parameters: α, β and σ. We can use two of these parameters to scale t and x, and we are therefore left with a single parameter. Defining the scaled time, τ, and the scaled number of papers, ξ, as:

(6) τ = β t ξ = β t σ x

and the remaining parameter, Ω, as:

(7) Ω = α σ

we get (remembering that P(ξ)=P(x)dx/dξ=P(x)σ/β):

(8) P ( ξ , τ ) = 1 2 π 1 [ e τ 1 ] e x p { 1 2 ( ξ [ e τ 1 ] Ω ) 2 }

This distribution is normalized, ∫ P(ξ,τ)dξ=1, and its mean and standard deviation are given by

(9) µ P = Ω [ e τ 1 ] σ P = [ e τ 1 ]

The second step is to take into account the distribution, let’s say R(τ), of the (scaled) academic ages τ. The distribution of the (scaled) number of papers ξ is therefore:

(10) ( ξ ) = d τ ( τ ) P ( ξ , τ )

In a mature, stationary, world the distribution of the academic ages R(τ) is stable and, to a good level of approximation, is flat in the time interval between the average academic time to reach the specific academic role, and the time to leave this role by promotion (or retirement, if we are considering the full professor role). We are confident that the choice R(τ)=θ([τ1−τ][τ−τ2])(τ2−τ1)1, being τ1 and τ2 the initial and final (scaled) times for the academic role and θ(t) the Heavside step function, is a safe approximation at an aggregate level. However, it is well known that this is not exactly the case in centralized academic systems such as the Italian and the French ones (see Lissoni et al., 2011 and Pezzoni et al., 2012). For this reason, we have tested that the results are resilient to modifications of this function, as for example to the smoothing of the harsh discontinuities at τ1 and τ2.

In conclusion, we deal with the function:

(11) ( ξ ) = 1 ( τ 1 τ 2 ) τ 1 τ 2 d τ P ( ξ , τ ) = = 1 2 π 1 ( τ 1 τ 2 ) τ 1 τ 2 d τ 1 [ e τ 1 ] e x p { 1 2 ( ξ [ e τ 1 ] Ω ) 2 }

Results

In Fig. 1, we show a few examples of the distribution functions obtained in the present paper. These have been obtained by a numerical integration of the expression in eq. (11). Each panel reports three different Ω values (2, black; 3, red; and 4, blue). The different panels refer to different τ2 values (upper, τ2=5; middle τ2=7; lower τ2=9), while τ1 is kept fixed to 2. The degree of similarity with the observed Log-Normal distribution depends on Ω, being maximum between Ω=2 and 3. However, for all the values of the parameters, the present model produces highly skewed distributions.

The present distribution is similar, but not mathematically equivalent to a Log-Normal distribution function:

(12) ( ξ ) = 1 2 π ξ Σ exp ( l o g 2 ( ξ µ ) 2 Σ 2 ) .

To better emphasize their similarities, in Fig. 2 we show an example of comparison. We choose a set of parameters for the present model, τ1=2, τ2=4, and Ω=2.5, and search by a χ2 minimization, the parameters for the Log-Normal distribution that give the best agreement between the two distributions: μ=41.3 and Σ=0.85.

Figure 2
figure 2

Comparison of the distribution from eq. (11) and the Log-Normal distribution (eq. (12)). The parameters for the present model are τ1=2, τ2=4, and Ω=2.5, while the parameters for the Log-Normal distribution, μ=41.3 and Σ=0.85, was chosen to obtain the best agreement between the two distributions.

Having established that the distribution obtained in the present model is undistinguishable from a Log-Normal distribution, it is important to map the set of parameters describing the present model with those describing a Log-Normal. As an example, in Fig. 3 we report the best choice of the Log-Normal’ s μ and Σ for each Ω value, for selected τ1 and τ2. This mapping has been obtained by a numerical χ2 minimization.

Figure 3
figure 3

Example of mapping between the Ω parameter of the present model and the μ and Σ parameters of the Log-Normal distribution that gives the best agreement between the two curves. In the present example, we keep τ1=1 and τ2=3 fixed.

We illustrate an example of application of the present model to show its ability to represent some real data, although its validity may be broader (see the last section for a discussion on this point). In Fig. 4 we illustrate a comparison between some experimental data and the present model. The data, reported as a function of the scaled variable ξ/ξo, represent the distributions of the number of publications, in all of the different disciplines, in a 10 year period, for all the Italian scholars, scaled by their medians, as obtained in Bonaccorsi et al. (2017). They studied the scientific production of the universe of Italian academic scholars over a ten-year period across 2002–2012 by using a database built by the Italian National Agency for the evaluation of Universities and Research Institutes. In Italy, each scholar belongs to a disciplinary sector by law. This official classification of scholars separates disciplines according to Life and Hard Sciences (LHS) disciplinary sectors and Social Science and Humanities (SSH) sectors. This classification therefore offers the opportunity to investigate the behavior of scholars without having to create a subjective classification of scholars for the analysis.

Figure 4
figure 4

A comparison between some experimental data and the outcome of the present model. The data, reported as a function of the scaled variable ξ/ξo, represent the distribution of the number of publications in a ten year period for all the Italian scholars, obtained in Bonaccorsi et al. (2017) by scaling the distribution of all the different disciplines by their medians. See figure 6 in Bonaccorsi et al. (2017). The red points represent the data for scholars belonging to Life and Hard Science disciplines, the blue points those of the Social Science and Humanities disciplines. The present model has been adjusted to the data. The resulting parameters are Ω=2.0, τ1=2.0, τ2=3.6 and ξo=30.

For additional information, including descriptive statistics on the data, see Bonaccorsi et al. (2017). In Fig. 4, the red points represent the data of scholars belonging to LHS disciplines, the blue points those of the SSH disciplines. The present model has been adjusted to the real data by a numerical χ2 minimization.

Finally, for practical purposes, we now present an approximation to eq. (11) that leads to a simpler analytic expression for the distribution F(ξ) not involving the numerical integration over τ. In the case the value of τ1 is large enough, we can exploit the consequences of the approximation exp(−τ) 1, to rewrite eq. (11) as:

(13) ( ξ ) = 1 2 π 1 ( τ 2 τ 1 ) τ 1 τ 2 d τ e τ e x p { 1 2 ( ξ e τ Ω ) 2 }

Now the integral in this equation can be solved with the substitution ζ=exp(−τ), that is:

(14) ( ξ ) = 1 2 π 1 ( τ 2 τ 1 ) exp ( τ 1 ) exp ( τ 2 ) d ζ e x p τ { 1 2 ( ξ ζ Ω ) 2 } = = 1 ( τ 2 τ 1 ) 1 2 ξ [ e r f ( ξ e τ 1 Ω 2 ) e r f ( ξ e τ 2 Ω 2 ) ]

As an example, in Fig. 5 we report the comparison between the exact result in eq. (11) and its approximate counterpart for selected values of the parameters. As expected, for large τ1the approximation becomes better and better, but already at τ1=2 the two curves appear to be undistinguishable.

Figure 5
figure 5

Comparison of the exact result of the present model from eq. (11) and its approximation, eq. (14), for the selected parameter values: Ω=2, τ1=2, τ2=4.

Discussion and Conclusion

The distributions reported on the previous section are not coincident with the Log-Normal function, but with this function they share their main features, to the level that they can be confused. Also, the unavoidable statistical uncertainty of the experimental data does not allow us to distinguish the small differences between eq. (11) and a Log-Normal function. In summary, we can talk of eq. (11) as a quasi-Log-Normal distribution. In Fig. 4 we have shown a mapping between the parameters Ω, τ1 and τ2 and the genuine Log-Normal parameters and conclude that the distributions observed in the bibliometric parameters in Ruocco and Daraio (2013) and Bonaccorsi et al. (2017) can be described by eq. (11) to an high degree of accuracy.

In conclusion, we have presented an (over)simplified model that catches the main features of the observed distributions of different bibliometric indicators. This model is built over the simple assumption that the natural talent is (almost) the same for all scholars at the beginning of their career. It is well known that visibility is not just the effect of publication rates. Moreover, it is the effect of only some publications not all of them (Merton, 1968). In our model, only small fluctuations are allowed. These fluctuations inflate with time following the recognition and reputation rule à la Merton, mediated by the entrepreneurial interpretation of Luke (Maggioni, 2000): the more you publish, the more you are known, the higher your probability of being recognized, the more likely you are to get the right conditions for increasing your publication rate. With these simple ingredients, and with elementary algebra, we derive a functional form that, although not coincident with a Log-Normal function, has all the features of this function, to the extent that they can be confused one with each other. We have called this function quasi-lognormal, and we proved that, to any practical purpose, one could use the Log-Normal functional shape to fit the experimental data.

It is worth noting that the assumptions at the basis of the present model, and therefore the implications and the outcome of the model itself, can be extended to other fields outside of the investigation of scientific publishing. The same set of assumptions may apply not only to scientific production, but to numerous other activities as well. Some examples may include the analysis of production and trade, income and wealth distribution, but also more applied political economy matters including public choice or policy advice analyses. In this sense the model may certainly have implications that go beyond the Science of Science.

However, all these considerations leave a tricky issue open: what do bibliometric indicators really measure? A discussion on this point follows in the next section.

Policy implications and further research

The investigation on the ab-initio causes of the observed empirical distributions of bibliometric indicators is an interesting topic from a philosophical and modelling perspective. On the other hand, policy makers need metrics for, among other things, setting thresholds, establishing criteria of funding allocation or rules for national qualification of scholars. They are not very interested in the philosophical investigation on the origin of the success breeds success effect, that is, if all scholars receive the same amount of talents or intelligence, or if they receive different levels of it. Policymakers are mostly interested in understanding what publications and citations really measure; if these metrics are a good proxy of the scientific achievements, ability and efforts of the scholars. For this purpose, our model could provide some hints for further development. According to the hypotheses of our model, the empirical distributions of the bibliometric parameters observed might be the result of chance and noise (chaos) related to multiplicative phenomena connected to a publish or perish inflationary mechanism, led by the recognition and reputation of scholars. Summing up: being a scholar in the right tail or in left tail of the distribution could have very little connection to her/his merit and achievements. This interpretation might cast some doubts on the use of the number of papers and/or citations as a measure of scientific achievements along the lines of the general critiques against quantitative metrics (see e.g. Wilsdon, 2015, 2016), and may lead to reconsider the method of peer review, despite its well-known limitations.

In the interpretation of our model, however, we follow the deductive induction of Popper (1959). In other words, the assumption of our model about the equality of ability/talents/intelligence, operationalized through an inflationary productivity process, along with the other assumptions, has led to a model that seems to reproduce some observed empirical evidence (Log-Normal distributions). This does not mean that the assumptions of the model (including that of equality of talents) are true, but that simply, according to the modus tollens, they are not falsified by our model. A tricky issue seems to emerge from this interpretation of our model that is: what do bibliometric indicators really measure? The analysis of this issue, calls for deeper investigations on the meaning of the bibliometric indicators. These further analyses are clearly outside the purpose of the present paper. They will require the development of more detailed and accurate models than our (over)simplified model, in which the relationships among intelligence, talents, their historical characterization, ability, merits and their measure (see, for example, Carson, 2007) are more carefully taken into account and modelled. This is an interesting and intriguing topic for further research to be carried out beyond Science of Science and Sociology of Science, including elements and investigation tools from Philosophy, Psychology and Theology. It could also be worthwhile to further investigate from a policy maker’s perspective, to understand, model, explain and assess the scholars’ behavior and its relation with scientific publication parameters.

Data availability

Data sharing is not applicable to this article as no datasets were generated during the current study. Data illustrated in Fig. 4 come from Fig. 6 in Bonaccorsi et al. (2017).

Additional information

How to cite this article: Ruocco G, Daraio C, Folli V and Leonetti M (2017) Bibliometric indicators: the origin of their log-normal distribution and why they are not a reliable proxy for an individual scholar’s talent. Palgrave Communications. 3:17064 doi: 10.1057/palcomms.2017.64.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.