A supervised approach for intra-/inter-community interaction prediction in dynamic social networks

Rossetti, Giulio; Guidotti, Riccardo; Miliou, Ioanna; Pedreschi, Dino; Giannotti, Fosca

doi:10.1007/s13278-016-0397-y

A supervised approach for intra-/inter-community interaction prediction in dynamic social networks

Original Article
Open access
Published: 27 September 2016

Volume 6, article number 86, (2016)
Cite this article

Download PDF

You have full access to this open access article

Social Network Analysis and Mining Aims and scope Submit manuscript

A supervised approach for intra-/inter-community interaction prediction in dynamic social networks

Download PDF

Giulio Rossetti ORCID: orcid.org/0000-0003-3373-1240¹,
Riccardo Guidotti²,
Ioanna Miliou²,
Dino Pedreschi² &
…
Fosca Giannotti¹

7374 Accesses
14 Citations
2 Altmetric
Explore all metrics

Abstract

Due to the growing availability of Internet services in the last decade, the interactions between people became more and more easy to establish. For example, we can have an intercontinental job interview, or we can send real-time multimedia content to any friend of us just owning a smartphone. All this kind of human activities generates digital footprints, that describe a complex, rapidly evolving, network structures. In such dynamic scenario, one of the most challenging tasks involves the prediction of future interactions between couples of actors (i.e., users in online social networks, researchers in collaboration networks). In this paper, we approach such problem by leveraging networks dynamics: to this extent, we propose a supervised learning approach which exploits features computed by time-aware forecasts of topological measures calculated between node pairs. Moreover, since real social networks are generally composed by weakly connected modules, we instantiate the interaction prediction problem in two disjoint applicative scenarios: intra-community and inter-community link prediction. Experimental results on real time-stamped networks show how our approach is able to reach high accuracy. Furthermore, we analyze the performances of our methodology when varying the typologies of features, community discovery algorithms and forecast methods.

Community Analysis and Link Prediction in Dynamic Social Networks

Predicting links in ego-networks using temporal information

Article Open access 06 January 2016

Who is really in my social circle?

Article Open access 03 October 2018

1 Introduction

Complex networks are nowadays used to describe a wide range of real-world phenomena: social and biological interactions, economic systems as well as optimization problems are examples of how broad is becoming the range of topics which are studied using network science approaches. This breadth of applicative scenarios is one of the main reasons for the renewed interest in network analysis that, in recent years, is emerged in the scientific community. Indeed, a wide class of network problems have been analyzed and applied to several branches of research: community discovery, link prediction, node ranking and classification are only few of the several tasks extensively investigated. Among all those tasks, the most challenging and interesting ones aim to describe how networks evolve through time.

Networks are rarely used to model static entities: i.e., in social contexts we can observe that as time goes by users appear and disappear, new interactions take place, and existing ones fell apart disrupting existing paths. Understanding these dynamics is the first step to obtain insights into the real nature of the phenomenon modeled by the observed network. Moreover, almost all the network problems can be reformulated in order to take into account the temporal dimension: communities can be tracked through all their life cycle to unveil their history; incremental ranking can be computed in order to optimize execution costs; links can be predicted using information obtained by the analysis of topology changes in the local surroundings of nodes. Networks taking into account the temporal dimension are called dynamic. The topology of these networks evolves over time as new links and nodes may appear or disappear according to the interactions among their users.

In order to analyze dynamic networks in a reliable way, the social features affecting their structure and behavior must be considered. Indeed, temporal changes are sometimes independent from the network topology itself and result from external factors. The problem of predicting the existence of hidden links or the creation of new ones in social networks is commonly referred to as the link prediction problem. In this work, we propose an analytic process which, exploiting well-known state-of-the-art techniques, is able to tackle this challenging task in dynamic networks.

In order to capture how topological features evolve—knowledge needed to perform prediction in dynamic contexts—we made use of time series. Specifically, considering a dynamic social network, we built a time series for each social feature of each couple of nodes, that is a sequence of measures at successive points in time, spaced at uniform time intervals. In our approach, we used such structure to forecast future values of each feature: time series forecasts are then used to solve the link prediction problem.

Several works highlight that, when addressing link prediction through supervised learning, it does not appear to exist a set of features or a similarity index that is outperforming in all settings: depending on the network analyzed, various measures could be particularly promising or not (Liben-Nowell and Kleinberg 2007). This suggests that the predictors which work best for a given network may be related to the structure within the network rather than a universal best set of predictors. Topological similarity indexes encode information about the relative overlap between nodes’ neighborhoods. We expect that the more similar two nodes’ neighborhoods are (e.g., the more overlap in shared friends), the more likely they may be to exhibit a future link. Moreover, we exploit well-known social network characteristics such as power law degree distribution (Barabási and Albert 1999), the small-world phenomenon (Watts and Strogatz 1998), and community structure (Girvan and Newman 2002).

In this study, a valuable topological information that we leverage regards the modular structure of social networks: indeed, social networks can be partitioned into densely and internally connected vertex sets and it has been extensively observed that such topologies provide bounds to the sociality of the users within them. Furthermore, in a dynamic scenario, more than in a static one, the evolution of such boundaries describes changes in people’ social behaviors. Starting from such observation, we decided to divide the original problem into two disjoint tasks:

intra-community interaction prediction;
inter-community interaction prediction.

Following the hypothesis that friends of friends are more likely to become friends than individuals who have no friends in common (Granovetter 1973; Rapoport 1963), in the former task we restrict our attention to the prediction of new links at time $t+1$ which occur between individuals who are in the same community at least once in [0, t]. This strategy has the computationally not negligible advantage of calculating only the features among nodes belonging to the same community. The latter task, on the other hand, focuses on the forecast of future bridges across network modules: such interactions represent the weak ties that keep together the overall network structure.

In this paper, we propose a data mining process able to provide a solution to both tasks: moreover, we formalize the link prediction problem for dynamic networks, the Interaction Prediction. Our approach predicts future interactions by combining dynamic social networks analysis, time series forecast, feature selection and network community structure.

The rest of this paper is organized as follows. In Sect. 2 is reported the formal definition of the link prediction problem studied. Section 3 illustrates the detail of the proposed approach as a workflow. In Sect. 4 are reported the experimental results, for both intra-community and inter-community interaction prediction tasks, obtained using real-world datasets. Section 5 introduces the related works for the link prediction problem. Finally, in Sect. 6 conclusions and future works are summarized.

2 Interaction prediction problem

The classic formulation of link prediction involves the use of the observed network status to predict new edges that are likely to appear in the future or to unveil hidden connections among existing nodes. To satisfy this definition, a wide set of approaches were proposed and tested on several different domains both in supervised and in unsupervised fashion. Graph structures are often used to describe rapid-scale human dynamics: social interactions, call graphs, buyer–seller scenarios and scientific collaborations are only few examples. This is exactly the reason why link prediction has become the principal instrument used to address the need of dealing with networks that evolve through time.

In this work, our aim is to exploit the temporal information carried by the appearance and disappearance of edges in a fully dynamic context: doing so, we plan to overcome the limitations imposed by the analysis of a static scenario when making predictions. To model rapid-scale dynamics, we will adopt the interaction network model:

Definition 1

(Interaction Network) An interaction network $G = (V, E, T)$, is defined by a set of nodes V and a set of time-stamped edges $E \subseteq V \times V \times T$ describing the interactions among them. An edge $e\in E$ is thus described by the triple (u, v, t) where $u,v \in V$ and $t\in T$. Each edge e represents an interaction between nodes u and v that took place at time t.

To easily analyze an interaction network G, we discretize it into $\tau$ consecutive snapshots of the same duration, thus obtaining a set of graphs ${\mathcal {G}} = \{G_0\dots ,G_{\tau }\}$. We assume that the interactions belonging to $G_t$ are only the ones that appear in the interval $(t,t+1)$. Such modeling choice allows us to make predictions not only for interactions that will take place among previously unconnected nodes, but also for predicting edges that have already appeared in the past. This decision is made in order to better simulate the dynamics that real interaction networks exhibit allowing nodes and edges both to rise and to fall. In real interaction networks, this model is a good proxy for structural dynamics since it allows to implicitly assign a time to leave to links (i.e., in a call graph, it enables to weight more recent interactions w.r.t. older ones when predicting future contacts among a pair of nodes). Due to the adoption of this more complex graph model, hereafter we will refer to this peculiar formulation of the LP problem as Interaction Prediction problem:

Definition 2

(Interaction Prediction) Given a set ${\mathcal {G}}=\{G_0,\dots G_t,\dots G_{\tau }\}$ of ordered network observations, with $t \in T = \{0\dots \tau \}$, the interaction prediction problem aims to predict new interactions that will took place at time $\tau +1$ thus composing $G_{\tau +1}$.

In the following section, we introduce our analytical workflow, built upon a supervised learning strategy, designed to solve the Interaction Prediction problem.

3 Proposed approach

The Interaction Prediction problem introduces new challenges to an already complex task. Due to the evolutionary behavior of the networks subject of our investigation, a particular effort is needed in order to find a reasonable way to take care of structural dynamics during the prediction phase. To this extent, we make use of time-stamped network observations and community knowledge besides classical features in order to learn a robust machine learning model able to forecast new interactions. We design our approach to follow four steps (graphically represented in Fig. 1):

Step 1 :: Given an interaction network G as input, for each temporal snapshot $t \in T$ we compute a partition $\mathcal {C}_t = \{ C_{t,0}, \dots , C_{t,k} \}$ of $G_t$ using a community discovery algorithm. Then we define, for each t and C, $G_{C_{t}}=(V_{t, C}, E_{t,C})$ as the subgraph induced on $G_t$ by the nodes in $C_{t}$, such that $V_{t,C} \subseteq V_t$ and $E_{t,C} \subseteq E_t$.
Step 2 :: For each $t \in T$, we consider the interaction communities $\mathcal {C}_t$ of $G_t$ and compute a set of measures F for each pair of nodes pair $(u,v) \in W_{t,{\mathcal {C}}}$ such that $W_{t,{\mathcal {C}}} = \{(u,v): u,v \in V_{t,C} \wedge C_{t} \in \mathcal {C}_t \}$, that is (u, v) belong to the same community at time t. Thus, we obtain values $f_{t}^{u,v}$ describing structural features, topological features and community features of the node pairs (u, v) at time t.
Step 3 :: With these values, for each couple of nodes $(u,v) \in W_{t_,\mathcal {C}}$ and feature $f \in F$ we build a time series $S_f^{u,v}$ using the sequence of measures $f_{0}^{u,v},f_{1}^{u,v},\dots , f_{\tau }^{u,v}$. Then, we apply well-known forecasting techniques in order to obtain its future expected value $f_{\tau +1}^{u,v}$.
Step 4 :: Finally, we use the set of expected values $f_{\tau +1}^{u,v}$ for each feature $f \in F$ to build a classifier that will be able to predict future intra-community interactions.

In the following, we discuss each step by itself, proposing solutions that can be used to instantiate the described analytical process making use of well-known methodologies.

3.1 Step 1: community discovery

Partitioning a network into communities is a complex task: for this reason, several approaches were introduced during the last decade, each one of them tailored to extract communities carrying specific traits. Due to the absence of an universally shared community definition, in order to evaluate the impact of community structure on the predictive power of the proposed supervised learning strategy, we tested three different CD algorithms, namely Louvain, Infohiermap and DEMON. Here we provide a short description of their major characteristics, while in the experimental section we will discuss how they affect the predictive power of the described analytical process. We remind that we adopted community discovery algorithms to split interaction networks into communities, and then we used these communities to calculate the features that will be illustrated in the following and to perform the predictions of new interactions.

Louvain is an heuristic method based on modularity optimization (Blondel et al. 2008). It is fast and scalable on very large networks and reaches high accuracy on ad hoc modular networks. The optimization is performed in two steps. First, it looks for “small” communities by optimizing modularity locally. Second, it aggregates nodes belonging to the same community and builds a new network whose nodes are the communities. These steps are repeated iteratively until a maximum of modularity is attained and a hierarchy of communities is produced. Louvain produces a complete non-overlapping partitioning of the graph. As most of the approaches based on modularity optimization, it suffers from a “scale” problem that causes the extraction of few big communities and a high number of very small ones.

Infohiermap is one of the most accurate and best performing hierarchical non-overlapping clustering algorithms for community discovery (Rosvall and Bergstrom 2011) studied to optimize community conductance. The graph structure is explored with a number of random walks of a given length and with a given probability of jumping into a random node. Intuitively, the random walkers are trapped in a community and exit from it very rarely. Each walk is described as a sequence of steps inside a community followed by a jump. By using unique names for communities and reusing a short code for nodes inside the community, the walk description can be highly compressed, in the same way as reusing street names (nodes) inside different cities (communities). The renaming is done by assigning a Huffman coding to the nodes of the network. The best network partition will result in the shortest description for all the walks.

DEMON is an incremental and limited time complexity algorithm for community discovery (Coscia et al. 2012). It extracts ego networks, i.e., the set of nodes connected to an ego node u, and identifies the real communities by adopting a democratic bottom-up merging approach of such structures. Following this approach, each node, through its ego network (i.e., the induced graph on his one-hop neighborhood), gives the perspective of the communities surrounding it: all the different nodes perspectives are then merged together leading to an overlapping partition. To each ego network is applied a label propagation algorithm which ignores the presence of the ego itself in order to identify local micro-communities, and then, with equity, such individual micro-level is combined with the ones obtained by the rest of the nodes ego networks. The result of this combination is a set of overlapping modules, the guess of the real communities in the global system, made not by an external observer, but by the actors of the network itself.

We chose to use the aforementioned algorithms since, due to their formulations, they cover three different kinds of community definitions: modularity-, conductance- and density-based ones. Since in our test we vary the structural properties of the communities used to extract the classification features, in the experimental analysis we will be able to discuss which network partitioning approach is able to provide more useful insights into future interactions.

3.2 Step 2: features design

In order to efficiently approach the Interaction Prediction task using a supervised learning strategy, it is crucial to identify and calculate a valuable set of features to train the classifier. When dealing with large-scale graphs that may include millions of vertices and links, one of the challenges is the computationally intensive extraction of such features. Several studies related to link prediction such as Feng et al. (2012), Fire et al. (2013), Jahanbakhsh et al. (2012), Lichtnwalter and Chawla (2012), Xu and Rockmore (2012) have tried to suggest which are the optimal topological structure of a network and the best features to be used. Moving from the results of such analysis, we decided to use information belonging to three different families: pairwise structural features, global topological features and community features. We recall that all the features were computed before the community extraction phase on node pairs sharing the same social context.

3.2.1 Pairwise structural features

In this class fall all the measures used in the literature to score the likelihood of new links in unsupervised scenarios. Starting from the measures proposed in Liben-Nowell and Kleinberg (2007), we restricted our set to the one in Table 1.

Table 1 Pairwise structural features

A supervised approach for intra-/inter-community interaction prediction in dynamic social networks

Abstract

Similar content being viewed by others

Community Analysis and Link Prediction in Dynamic Social Networks

Predicting links in ego-networks using temporal information

Who is really in my social circle?

1 Introduction

2 Interaction prediction problem

Definition 1

Definition 2

3 Proposed approach

3.1 Step 1: community discovery

3.2 Step 2: features design

3.2.1 Pairwise structural features

3.2.2 Global topological features

3.2.3 Community features

3.3 Step 3: forecasting models

3.4 Step 4: classifier models

4 Experiments and results

4.1 Datasets

4.2 Intra-community interaction prediction

4.2.1 Balanced scenario

Features forecast correlation

Features forecast deviation

4.2.2 Unbalanced scenario

4.3 Inter-community interaction prediction

4.3.1 Balanced scenario

4.3.2 Unbalanced scenario

5 Related works

6 Conclusions

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation