Elsevier

Neurocomputing

Volume 192, 5 June 2016, Pages 81-91
Neurocomputing

Exact ICL maximization in a non-stationary temporal extension of the stochastic block model for dynamic networks

https://doi.org/10.1016/j.neucom.2016.02.031Get rights and content

Abstract

The stochastic block model (SBM) is a flexible probabilistic tool that can be used to model interactions between clusters of nodes in a network. However, it does not account for interactions of time varying intensity between clusters. The extension of the SBM developed in this paper addresses this shortcoming through a temporal partition: assuming that interactions between nodes are recorded on fixed-length time intervals, the inference procedure associated with the model we propose allows us to cluster simultaneously the nodes of the network and the time intervals. The number of clusters of nodes and of time intervals, as well as the memberships to clusters, are obtained by maximizing an exact integrated complete-data likelihood, relying on a greedy search approach. Experiments on simulated and real data are carried out in order to assess the proposed methodology.

Introduction

Network analysis has been applied since the 1930s to many scientific fields. Indeed graph based modelling has been used in social sciences since the pioneer work of Jacob Moreno [1]. Nowadays, network analyses are used for instance in physics [2], economics [3], biology [4], [5] and history [6], among other fields.

One of the main tools of network analysis is clustering which aims at detecting clusters of nodes sharing similar connectivity patterns. Most of the clustering techniques look for communities, a pattern in which nodes of a given cluster are more likely to connect to members of the same cluster than to members of other clusters (see [7] for a survey). Those methods usually rely on the maximization of the modularity, a quality measure proposed by Girvan and Newman [8]. However, maximizing the modularity has been shown to be asymptotically biased [9].

In a probabilistic perspective, the stochastic block model (SBM) [10] assumes that nodes of a graph belong to hidden clusters and probabilities of interactions between nodes depend only on these clusters. The SBM can characterize the presence of communities but also more complicated patterns [11]. Many inference procedures have been derived for the SBM such as variational expectation maximization (VEM) [12], variational Bayes EM (VBEM) [13], Gibbs sampling [14], allocation sampler [15], greedy search [16] and non-parametric schemes [17]. A detailed survey on the statistical and probabilistic take on network analysis can be found in [18].

While the original SBM was developed for static networks, extensions have been proposed recently to deal with dynamic graphs. In this context, both nodes memberships to a cluster and interactions between nodes can be seen as stochastic processes. For instance, in the model of Yang et al. [19], the connectivity pattern between clusters is fixed through time and a hidden Markov model is used to describe cluster evolution: the cluster of a node at time t+1 is obtained from its cluster at time t via a Markov chain. Conversely, Xu et al. [20] as well as Xing et al. [21] used a state space model to describe temporal changes at the level of the connectivity pattern. In the latter, the authors developed a method to retrieve overlapping clusters through time.

Other temporal variations of the SBM have been proposed. They generally share with the ones described above a major assumption: the data set consists in a sequence of graphs. This is by far the most common setting for dynamic networks. Some papers remove those assumptions by considering continuous time models in which edges occur at specific instants (for instance when someone sends an email). This is the case of e.g. [22] and of [23], [24]. The model developed in the present paper introduces a sequence of graphs as an explicit aggregated view of a continuous time model.

More precisely, our model, that we call the temporal SBM (TSBM), assumes that nodes belong to clusters that do not change over time but that interaction patterns between those clusters have a time varying structure. The time interval over which interactions are studied is first segmented into sub-intervals of fixed identical duration. The model assumes that those sub-intervals can be clustered into classes of homogeneous interaction patterns: the distribution of the number of interactions that take place between nodes of two given clusters during a sub-interval depends only on the clusters of the nodes and on the cluster of the sub-interval. This provides a non-stationary extension of the SBM, which is based on the simultaneous modelling of clusters of nodes and of sub-intervals of the time horizon. Notice that a related approach is adopted in [25], but with a substantial difference: they consider time intervals whose membership is known and hence exogenous, whereas in this paper the membership of each interval is hidden and therefore inferred from the data.

The greedy search strategy proposed for the (original) stationary SBM was compared with other SBM inference tools in many scenarios using both simulated and real data in [16]. Experimental results emerged illustrating the capacity of the method to retrieve relevant clusters. Note that the same framework was considered for the (related) latent block model [26], in the context of biclustering, and similar conclusions were drawn. Indeed, contrary to most other techniques, this approach relies on an exact likelihood criterion, the so- called integrated complete-data likelihood (ICL), for optimization. In particular, it does not involve any variational approximations. Moreover, it allows the clustering of the nodes and the estimation of the number of clusters to be performed simultaneously. Alternative strategies usually do first the clustering for various number of clusters, by maximizing a given criterion, typically a lower bound. Then, they rely on a model selection criterion to estimate the number of clusters (see [12] for instance). Some sampling strategies also allow the simultaneous estimation [17], [15]. However, the corresponding Markov chains tend to exhibit poor mixing properties, i.e. low acceptance rates, for large networks. Finally, the greedy search incurs [16] a smaller computational cost than existing techniques. Therefore, we follow the greedy search approach and derive an inference algorithm, for the new model we propose, which estimates the number of clusters, for both nodes and time intervals, as well as memberships to clusters.

Finally, we cite the recent work of Matias et al. [27] who independently developed a temporal stochastic block model, related to the one proposed in this paper. Interactions in continuous time are counted by non-homogeneous Poisson processes whose intensity functions only depend on the nodes clusters. A variational EM algorithm was derived to maximize an approximation of the likelihood and non-parametric estimates of the intensity functions are provided.

This paper is structured as follows: Section 2 presents the proposed temporal extension of the SBM and derives the exact ICL for this model. Section 3 presents the greedy search algorithm used to maximize the ICL. Section 4 gathers experimental results on simulated data and on real world data.

Section snippets

A non-stationary stochastic block model

We describe in this section the proposed extension of the stochastic block model (SBM) to non-stationary situations. First, we recall the standard modeling assumptions of the SBM, then introduce our temporal extension and finally derive an exact integrated classification likelihood (ICL) for this extension.

ICL maximization

The integrated complete likelihood (ICL) in Eq. (8) has to be maximized with respect to the four unknowns c, y, K, and D which are discrete variables. Obviously no closed formulas can be obtained and it would computationally prohibitive to test every combination of the four unknowns. Following the approach described in [16], we rely on a greedy search strategy. The main idea is to start with a fine clustering of the nodes and of the intervals (possibly size one clusters) and then to alternate

Experiments

To assess the reliability of the proposed methodology some experiments on synthetic and real data were conducted. All runtimes mentioned in the next two sections are measured on a 12 cores Intel Xeon server with 92 GB of main memory running a GNU Linux operating system. The greedy algorithm described in Section 3 was implemented in C++. An Euclidean hierarchical clustering algorithm was used to initialize the labels and Kmax and Dmax have been set equal to N/2 and U/2 respectively.

Conclusion

We proposed a non-stationary extension of the stochastic block model (SBM) allowing us to simultaneously cluster nodes and infer the time structure of a network. The approach we chose consists in partitioning the time interval over which interactions are studied into sub-interval of fixed identical duration. Those intervals provide aggregated interaction counts that are studied with a SBM inspired model: nodes and time intervals are clustered in such a way that aggregated interaction counts are

Marco Corneli is a 2014 graduated student from the University of Paris 7 Denis-Diderot (Research Master M2MO). During the master he studied advanced probability theory, stochastic calculus and Monte Carlo simulation techniques. Before that, Marco graduated at the University of Siena in Italy (MSc in Finance), where he mostly studied econometrics, time series and quantitative finance. His favorite research topics are Bayesian statistics and applied probability.

References (37)

  • N. Villa, F. Rossi, Q. Truong, Mining a medieval social network by kernel som and related methods, Arxiv preprint...
  • M. Girvan et al.

    Community structure in social and biological networks

    Proc. Natl. Acad. Sci.

    (2002)
  • P. Bickel et al.

    A nonparametric view of network models and Newman–Girvan and other modularities

    Proc. Natl. Acad. Sci.

    (2009)
  • P. Latouche, E. Birmelé, C. Ambroise, Bayesian Methods for Graph Clustering, Springer, 2009, pp....
  • J.-J. Daudin et al.

    A mixture model for random graphs

    Stat. Comput.

    (2008)
  • P. Latouche et al.

    Variational Bayesian inference and complexity control for stochastic block models

    Stat. Model.

    (2012)
  • K. Nowicki et al.

    Estimation and prediction for stochastic block structures

    J. Am. Stat. Assoc.

    (2001)
  • E. Côme et al.

    Model selection and clustering in stochastic block models based on the exact integrated complete data likelihood

    Stat. Model.

    (2015)
  • Cited by (0)

    Marco Corneli is a 2014 graduated student from the University of Paris 7 Denis-Diderot (Research Master M2MO). During the master he studied advanced probability theory, stochastic calculus and Monte Carlo simulation techniques. Before that, Marco graduated at the University of Siena in Italy (MSc in Finance), where he mostly studied econometrics, time series and quantitative finance. His favorite research topics are Bayesian statistics and applied probability.

    Pierre Latouche studied at UTC University, Compiégne,France. He obtained his MSc by research from Aston University, Birmingham, UK in machine learning and his PhD in statistics from the University of Evry, France. He is now associate professor in applied mathematics at the Paris 1 Pantheon-Sorbonne university. His research focuses on networks and high dimensional data. He is interested in model selection, Bayesian analysis, and variational approximations.

    Fabrice Rossi is a Professor of applied mathematics at Paris 1 Panthéon Sorbonne University. He is a member of the SAMM research group and the head of the statistical learning and network team of this group. He has (co)authored more than 150 peer reviewed research papers published in international journals and in proceedings of conferences. His research interests include machine learning and data analysis, from theoretical aspects (learning theory) to practical applications (especially in humanities).

    View full text