Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The analysis of large real graphs has attracted the interest of the industry and academia due to its multiple applications, and as a consequence, many technologies for their analysis have emerged. In order to fairly compare the performance and features of such technologies, several benchmarking initiatives have kicked off [2, 4, 6]. In general, these initiatives use synthetic graph generators in seek for the flexibility not always found in real datasets.

Being able to generate credible graphs that mimic the characteristics of the real ones is of high importance, because they directly impact the performance of the systems under test. One of these characteristics is the degree distribution of the nodes in the graph. In general, it is widely accepted that in real networks the degree sequence follows a power-law, since the majority of the nodes have a small degree while just few of them are connected to many neighbours, thus having a very large degree [7].

A particular power law distribution with support the strictly positive integer numbers is the Zipf distribution. The Zipf’s law shows a linear shape in the log-log scale, but in practice this is not always the case in real networks, where it is only observed for degree values large enough. For low degree nodes, the plot usually shows concavity, and less often, convexity [3]. One generalization of the Zipf’s law that solves this issue is the Marshall-Olkin extended Zipf distribution (MOEZipf), which uses the Marshall-Olkin transformation to add an extra parameter that gives more flexibility to the family.

In this paper, we first prove by means of an analysis of several real graphs, the suitability of the MOEZipf model as a degree distribution. Second, we propose a method to generate degree sequences following the MOEZipf distribution, which is implemented in a scalable graph generator (The LDBC Data Generator [4]), showing that it works well in practice.

The paper is structured as follows: In Sect. 2, we introduce the MOEZipf probability distribution. In Sect. 3 we show that MOEZipf adjusts well the distributions of real graphs. In Sect. 4, we propose a method to generate random samples from a MOEZipf distribution. In Sect. 5, we show the results obtained with Datagen using the proposed approach, in Sect. 6, we conclude the paper.

2 The MOEZipf Model

A random variable (r.v.) X is said to follow a Zipf distribution with scale parameter \(\alpha >1\) if, and only if, its probability mass function (pmf) is equal to:

$$\begin{aligned} P(X=x)=\frac{x^{-\alpha }}{\xi (\alpha )},\,\, for\,\,\,\, x=1,2,3,\cdots , \end{aligned}$$
(1)

where \(\xi (\alpha )=\sum _{k=1}^{+\infty } k^{-\alpha }\) is the Riemann zeta function. The Zipf distribution is often suitable to fit data that correspond to frequencies of frequencies or to ranked data. This type of data shows a widespread pattern in their measurements with a very large probability at one and a very small probability at some very large values. Taking logarithms at (1), one obtains that the Zipf distribution shows a linear pattern in the log-log scale since

$$ \log (P(X=k))=-\alpha \log (x)-\log (\xi (\alpha )). $$

This linearity is useful to check whether the data may be well fitted or not by means of the Zipf distribution, by just plotting the empirical probabilities. However, in practice usually this linearity is just observed in the tail of the distribution, while a concavity is observed at the beginning. The MOEZipf distribution is proposed in [8], as an approximate model to adapt this behavior.

A r.v. X is said to follow a MOEZipf distribution with parameters \(\alpha \) and \(\beta \) if, and only if, its survival function (SF) is equal to:

$$\begin{aligned} P(X>x)={\overline{G}}(x; \alpha , \beta )=\frac{\beta \,\, {\overline{F}}(X)}{1-{\overline{\beta }}\,{\overline{F}}(X)}=\frac{\beta \,\,\xi (\alpha , x+1)}{\xi (\alpha )-{\overline{\beta }}\, \xi (\alpha +1)}, \end{aligned}$$
(2)

for \(\beta >0\), \(\alpha > 1\) and \(\overline{\beta } = 1 - \beta \). Being \({\overline{F}} (x)\) the SF of the Zipf(\(\alpha \)) distribution.

The pmf of the MOEZipf may be computed by means of

$$\begin{aligned} P(X=x)= & {} {\overline{G}} (x-1; \alpha , \beta )-{\overline{G}} (x;\alpha , \beta ) \nonumber \\= & {} \frac{x^{-\alpha }\,\beta \, \xi (\alpha )}{[\xi (\alpha )-{\overline{\beta }} \xi (\alpha , x)] [\xi (\alpha )-{\overline{\beta }} \xi (\alpha , x+1)]}, x=1,2,3,\cdots , \end{aligned}$$
(3)

where \(\xi (\alpha ,x)=\sum _{k=x+1}^{+\infty } k^{-\alpha }\) is the Hurwitz Zeta function with parameter \(\alpha \). When \(\beta =1\), in (3) one obtains the pmf of the Zipf(\(\alpha \)) distribution. An advantage of the MOEZipf distribution is that it shows a convexity or concavity behaviour at the beginning of the distribution depending on whether \(0<\beta <1\) or \(\beta > 1\) respectively, while keeping the linearity in the tail.

3 Real Graphs Analysis

This paper is motivated after the analysis of the degree distribution of nine real networks coming from diverse domainsFootnote 1, using eight different probabilistic models: Geometric, Poisson, Zipf, Right-truncated Zipf, Altmann, MOEZipf, Negative Binomial and Discrete Weibull. Table 1 shows the number of nodes, number of edges, global clustering coefficient (GCC), average clustering coefficient (ACC), assortativity degree (AD) and directionality of the networks analysed. For each directed networks, both the in-degree (In) and out-degree (Out) sequences were analysed, making a total of 13 degree sequences.

Table 1. Main characteristics of the nine real networks analysed.

The Zipf, the Right-truncated Zipf and the MOEZipf probability distributions have been considered mainly because of two reasons. On one side, because the Zipf distribution is assumed to be the node degree distribution in most scientific papers, and its Right-truncated version is a way to improve the fit in the tail of the distribution. On the other side, because we are interested in proving the suitability of the Zipf generalization: the MOEZipf distribution.

The Poisson and the Negative Binomial distributions have been included for being the first the classical distribution for counts when the events take place randomly and with the same probability, and the second its usual bi-parametric alternative used when the data show more dispersion than it was initially expected. However, we have observed this is not our case, since fitting the Negative Binomial oftenly results in numerical problems and when not, the fits are not satisfactory.

The reason for including the Geometric and the Discrete Weibull distributions is clearly different. These distributions may be seen, respectively, as the discrete versions of the Exponential and the Weibull, which are the continuous distributions associated to time to an event r.v.. One advantage of the Geometric is its simplicity and that it does not require the truncation at one, because its support are the strictly positive integer numbers. The Discrete Weibull is useful when the lifetime is measured counting cycles, shocks or revolutions. In our case, it has sense think about that an individual being active or alive while he is able to create connections with the others. From this point of view, the distribution comes naturally if one thinks that the lifetime is measured counting the number of connections performed. Finally, the Altmann distribution, also known as Zipf-Alekseev distribution or Zipf with an exponential cuttoff, is used in quantitative linguistics and it is also a bi-parameter generalization of the Zipf. In this case, it is assumed that the support is finite and the tail decreases quickly since the probabilities are multiplied by \(e^{-x\beta }\).

In order to fit the degree sequence for a given graph, the maximum likelihood parameter estimations were calculated by means of the mle function included in the R software [9]. It is known that maximum likelihood parameter estimations are good, because they are unbiased and have minimum variance. The models were compared using the Akaike Information Criterion \((\mathrm{AIC})\) and the Bayesian Information Criterion (BIC) goodness of fit measures [1], which are defined as:

$$\begin{aligned} AIC = -2l(\hat{\theta }, k) + 2M \frac{N}{N - M - 1} \end{aligned}$$

and

$$\begin{aligned} BIC = -2l(\hat{\theta }, k) + Mlog(N) \end{aligned}$$

respectively, where \(l(\hat{\theta }, k)\) is the value of the log-likelihood function evaluated at \(\hat{\theta }\), the maximum likelihood estimation of \(\theta \), for a given degree sequence k. M is the number of parameters of each probabilistic model (in our case it is equal to one or two) and N is the number of nodes of the network.

Table 2 shows the \(\Delta \)AIC and \(\Delta \)BIC for each network and all the models. These values were computed by means of the difference between the value in the current model and the value in the best model. Therefore, for each network the best model is the one that has a zero value in \(\Delta \)AIC and \(\Delta \)BIC.

Table 2. Values of the \(\Delta \)AIC and \(\Delta \)BIC for the different networks and probabilistic models.

Our experiments reveal that the analysed degree sequences can be explained with just three out of the eight models considered, which are: the MOEZipf, the Discrete Weibull and the Altmann models. The MOEZipf model is the best in 54 % of the cases, followed by the Discrete Weibull in 38 % of the cases and the Altmann in 8 % of the cases.

Figure 1 shows four degree sequences associated to the networks Amazon (In), DBLP, Patents (Out) and Youtube respectively; jointly with the fit of the best four models in each case. In all the cases the plots are in log-log scale. The best fit for the Amazon (In) is given by the MOEZipf model with parameter estimations \(\hat{\alpha }= 3.0295\) and \(\hat{\beta }=27.1284\), the second best model in this case is the Discrete Weibull with parameters \(\hat{p} = 0.7519\) and \(\hat{\beta }= 0.9271\). The model that gives the best fit to the DBLP degree sequence is the Discrete Weibull with parameter estimations \(\hat{p} = 0.2622\) and \(\hat{\beta }= 0.3881\), followed by the MOEZipf model with parameters \(\hat{\alpha }= 2.2767\) and \(\hat{\beta }=4.8613\). For the Patents (Out) degree sequence, the best model is the MOEZipf with parameters \(\hat{\alpha }= 3.196 \) and \(\hat{\beta }= 119.264\) and, in second place, the Negative Binomial model with parameters \(\hat{\gamma }= 1.4873\) and \(\hat{p} = 0.8317\). The best model for the Youtube network is the MOEZipf with parameters \(\hat{\alpha }= 2.089 \) and \(\hat{\beta }= 2.4101\), and the second best model is the Discrete Weibull with parameters \(\hat{p} = 0.0044\) and \(\hat{\beta }= 0.1424\). The information about how well a model behaves with respect to the others can be found in Table 2.

Fig. 1.
figure 1

Observed degree sequences in the Amazon (In), DBLP, Patents (Out) and Youtube networks jointly with the four best models in each case.

4 Generating MOEZipf Degree Samples

The proposed method for generating MOEZipf degree sequences is based on the well known Inverse Principle [5]. Given a sequence of uniformly distributed random values between 0 and 1, we obtain a sequence of values of the target distribution using its inverse cumulative probability function (cpf). Given that, the cpf is equal to one minus the SF, and that from (2) the SF of the MOEZipf is easily deduced from the SF of the Zipf, one can obtain the desired value by applying the inverse principle to the Zipf distribution after properly modifying the generated uniform random value.

Algorithm 1 shows the pseudocode associated to this procedure. Given fixed values for \(\alpha \) and \(\beta \), we first initialize variable x to be equal to the first value in the support of the MOEZipf which is one. After generating a value u uniformly from 0 and 1, it is transformed to value \(u'\) as follows:

$$\begin{aligned} u' = \frac{u \beta }{1 + u(\beta - 1)} \end{aligned}$$

If \(1-\frac{1}{u} \le \beta \), the final x value is equal to the first integer value such that \(u' \le F_\alpha (x)\), where \(F_\alpha (x)\) is the cpf of the Zipf distribution. Otherwise, the final x is equal to the first value satisfying \(u' \ge F_\alpha (x)\).

figure b

5 Scalable MOEZipf Generation with Datagen

Datagen is the synthetic graph generator used in the LDBC Social Network Benchmark [4]. It is designed to generate social undirected networks with different degree distributions, with correlated attributes and edges connecting people with similar characteristics in an homophylic way. Datagen is implemented using the Map-Reduce parallel programming paradigm, and therefore is able to generate large graphs by running on small commodity clusters.

We have extended Datagen with the method proposed in Sect. 4 to generate MOEZipf based graphs scalablyFootnote 2. We have generated seven synthetic graphs with a degree distributions similar to those of the seven real degree sequences analysed in Sect. 3 where the MOEZipf distribution is the best fitting model: Amazon (In), Amazon (Out), NotreDame (In), NotreDame (Out), Patents (Out), Wikipedia (In) and Youtube. To generate the graphs we have used the same number of nodes, and configured the implemented MOEZipf degree sequence generation with the same parameters as those estimated from the original networks. Note that Datagen only generates undirected graphs, but for the purpose of this paper, we are only interested in being able to mimic the degree distributions and to prove that these can be generated in a scalable way.

Table 3. Parameters of the MOEZipf distribution estimated from the original networks vs the ones estimated from the networks generated using Datagen; the generation time of each network.

Table 3 shows, for each one of the networks the following information. On the one hand, the parameters \(\hat{\alpha }\) and \(\hat{\beta }\) estimated from the original networks, which are used to generate the synthetic ones. On the other hand, the parameters \(\tilde{\alpha }\) and \(\tilde{\beta }\) estimated from the resulting synthetic networks. We see that, for six out of seven cases, the resulting estimated parameters from the synthetic networks are very similar to those from the original graphs. Only for the Amazon (Out) degree sequence, there is a remarkable difference in the value of the \(\beta \) parameter. This is because the log-likelihood function, \(l(\beta ,\alpha ;k)\), as a function of \(\beta \) tends to an asymptote as \(\beta \) increases. More exactly:

$$\begin{aligned} l(\beta ,\alpha ;k)\simeq N\,\log (\beta )+ g(\alpha ;k), \end{aligned}$$

being \(g(\alpha ;k)\) a function that does not involve the \(\beta \) parameter. Thus, there are not significant differences between the values of the log-likelihood function for two \(\beta \) values if both are large enough. Finally, in the last column of Table 3 we see the time taken to generate these datasets in our test machine cluster, composed by four quad-core nodes with 32 GB of RAM each and 2 TB spinning disks. In general, we see that the generation model is able to accurately generate degree sequences with the desired characteristics, in a scalable way.

Figure 2 shows two examples of degree distributions of two synthetically generated graphs. Specifically, the ones generated to mimic the characteristics of the Patents (Out) and Youtube degree sequences. We also plot the theoretical MOEZipf degree distribution with the parameter estimations (\(\tilde{\alpha }, \tilde{\beta }\)). In both cases, we see that Datagen is able to generate a graph with a degree sequence with the same characteristics of the real ones accurately.

Fig. 2.
figure 2

Synthetically generated graphs with similar characteristics to the Patents (Out) (N = 3774767, \(\hat{\alpha }\,=\,3.196\) and \(\hat{\beta }\,=\,119.264\)) and Youtube (N = 1134890, \(\hat{\alpha }\,=\,2.089\) and \(\hat{\beta }\,=\,2.4101\)) graphs respectively.

6 Conclusions and Future Work

We have analysed a set of degree distributions from real networks using several probability models. The \((\mathrm{AIC})\) and BIC have been used to compare the different tested models. We have shown that the MOEZipf distribution is the one that better explains the degree distributions observed in real networks. Based on this result, we have presented a method to generate MOEZipf degree sequences, and implemented it as an extension to the LDBC graph generator, namely Datagen. Our experiments have shown that with the Datagen implementation, we can generate graphs with real degree distributions in a scalable way.

In this work, we have focused on generating realistic degree distributions. Future work will consist in developing techniques to reproduce other networks’ structural characteristics, such as the clustering coefficient or the degree of assortativity. Moreover, currently Datagen only supports the generation of undirected graphs. In the future we will work on extending Datagen to generate directed graphs with different in-degree and out-degree distributions.