Keywords

1 Introduction

Often the data within so called Big Data is highly connected and with a complex structure. Representing it as a graph with attributes associated to nodes and relationships to edges usually provides a good model. Social Networks, Biological Networks and Telecommunications Networks are examples of areas where graph models are widely used. The representation of data as a graph allows the exploration of highly connected data through searching paths among nodes and other graph-related tasks, which are useful in different contexts of the areas exemplified [1, 2].

When the graphs become larger and their structure more complex, querying and accessing the data becomes more difficult. Interactive clustering and cluster analysis can play a key role in this case, as data-mining and statistical methods could be coupled with interactive clustering to get better solutions when handling large volumes of data [2]. Exploratory Data Analysis requires systems that are interactive for users and which incorporate commonly used techniques such as clustering, visual filtering, zooming and aggregation [3, 4], in order to process the large amount of available data in big data and graphs [5].

Our paper focuses on the issue of graph simplification techniques, and more specifically on the clustering algorithms that they rely upon [6]. These algorithms should be highly scalable, however, current analytical methods requiring big graph data clustering have scalability problems and a consequent lack of interactivity, as we discuss further in the related work [5, 7]. A second important issue of graph clustering is that most solutions address either the structure of graph (its topology) or the nodes attribute information and different data types (numerical, categorical, etc.), i.e. its content information. In this paper we propose scalable graph clustering algorithms, which can incorporate both structural and contextual information of a graph.

The major contribution of this paper is to combine two existing techniques, graph embedding and clustering, in a novel way which has not being attempted previously in the literature, to solve an important problem for graph data analysts, which is interactive graph clustering.

Although parallelization is not the focus of the work presented in this paper, we have chosen an embedding and a clustering method which are highly parallelizable, and we discuss how to do this in the paper, as future work.

The paper is structured as follows: in Sect. 2 we discuss relevant related works; in Sect. 3 we outline our graph embedding and clustering approach; in Sect. 4 we present the empirical results of benchmarking the embedded graphs for different dataset sizes and numbers of clusters, discuss thresholds of interactivity for a user and discuss the parallelization of the processes. Finally, we present the conclusions and future work.

2 State of the Art and Related Work

In this Section we consider graph clustering approaches in general, mixed data type clustering approaches, k-Means clustering which we adapt for out method and graph embedding systems.

2.1 Graph Clustering Approaches

The interactive techniques, aggregation, zooming and filtering of considerable volumes of highly multidimensional data are strongly based on clustering algorithms which provide a perception of their general structure and trends, based on a good summarization [8]. Most current approaches to graph clustering can be classified as structural or content-based [8]. A purely structural approach takes into consideration only the topological information (structure) of a graph, and not the domain knowledge (attributes). Modularity, spectral or singularity based, fuzzy, Markov chain and local are some examples of the structural approach to clustering. As indicated in [9, 10], this approach leads to a random distribution of vertex properties (the meaning of this within cluster as information content is not considered). The purely content-based clustering approach takes into consideration just the domain knowledge, but not the topological information of the graph. This usually leads to clustering solutions that are domain-specific, over-specialized, and with a loose intra-cluster structure [9, 10].

2.2 Mixed Data Type Clustering Approaches

Some representative references of the number of works addressing mixed clustering in recent years are [1013]. The most widely discussed and cited algorithm is SA-Cluster, described in [9]. It can be argued that it represents the state-of-the-art approach to be compared with. The algorithm aims at maintaining at the same time both homogeneous attribute values and cohesive structure within clusters. Extensive experiments over a number of real-world networks using a set of metrics, such as density, entropy, DBI (the metrics definition can be found in [14, 15]) run time, support the adequacy of the algorithm.

For the understanding of our contributions, let us describe summarily the SA-Cluster algorithm [9]. The initial graph (the “structure”) is extended with additional nodes and attributes (which are associated with the nodes), thus adding new edges. In this way the attribute information is present in the structure of the new augmented heterogeneous graph. This augmented graph will be clustered on the basis of a distance combining both attribute and structural distances, which is changed iteratively in the clustering process as described next.

One starts with a (random walk) distance defined as follows: consider a transition probability matrix P over a graph which indicates the probability of transition between any two nodes, and a restart probability c(c(1 − c)length(τ) which represents the probability of returning back to the initial node following a random walk of length(τ)). We note the following: τ is a path from \( {\text{v}}_{\text{i}} \) to vj whose length is length(τ); and c∈ (0,1) is the restart probability. That is, before making a choice on the random walk, there is a probability c of returning to the starting node.

We note the distinction between a ‘path’ and a ‘walk’: a path can be said to link two vertices vi and vj, whereas a ‘walk’ may start at a vertex vi and finish at another vertex vj.

Given the definition for the random walk, the neighbourhood random walk distance between two nodes, vi and vj of the initial graph can be defined as:

$$ {\text{d}}\left( {{\text{v}}_{\text{i}} ;\,{\text{v}}_{\text{j}} } \right){ = }\sum\nolimits_{{{\text{length}}\left(\uptau \right) \le {\text{l}}}}^{{\uptau : {\text{v}}_{\text{i}} \to {\text{v}}_{\text{j}} }} {{\text{p}}\left(\uptau \right){\text{c(1}} - {\text{c)}}}^{{{\text{length(}}\uptau )}} $$
(1)

in which p(τ) is the transition probability and l is the length that a random walk can go.

The initial weights of the transition probability matrix P are 1.0 for both structural and attribute nodes, and are changed at each clustering iteration using:

$$ \upomega_{\text{i}}^{{{\text{t}} + 1}} = \frac{1}{2}(\upomega_{\text{i}}^{\text{t}} + {\Delta \omega }_{\text{i}}^{\text{t}} ) $$
(2)

in which ω represents the weight of attribute a i in the (t + 1)th iteration.

As SA-Clustering uses random walk distances it is difficult to parallelize; and, even more, not practically applicable for large graphs/networks. A range of improvements, and alternative approaches have been proposed, which we turn to next.

SI-cluster [11] is another distance-based algorithm, which is specifically focused on social networks. It is based on a heat-diffusion model of the network. It splits the initial graph into a set of influence graphs, followed by an iteration process of cluster quantification and weights update similar to that of SA-Cluster, with the difference that influence-based scores as measures are used. The SI-algorithm runtime scales better for the large datasets, and significantly outperforms all the previous algorithms. Indeed, one of the results is that SI-Cluster can handle 1 M nodes on an 8 GB machine in just over 6000 s, while the other algorithms ran out of memory. Its main limitation is that the heat diffusion model might be limited to Social Networks.

2.3 K-Means Clustering Algorithm

In the current work we have adapted the standard k-Means [16] for mixed data type clustering. We will now give a brief description of the basic algorithm. Later in Sect. 3 we will describe the adaptations made. k-Means [16] is an iterative algorithm which partitions n observations into k clusters in which each observation belongs to the cluster whose mean value is closest to the value of the observation. In each cluster, the mean value represents the ‘prototype’ or ‘centroid’ of that cluster. More specifically, consider that we given as initial input to k-Means a set of observations (x1, x2,…, xn), in which each observation is a d-dimensional real vector. That is a file whose rows are the observations and the columns are the attributes. The algorithm will then try to partition the n observations into k (≤ n) sets S = {S1, S2,…, Sk} in such as way that the within-cluster sum of squares (Eq. 3) of the distances between the centroid (mean) and the observations will be minimized. That is:

$$ argmin_{s} \mathop \sum \limits_{i = 1}^{k} \mathop \sum \limits_{x \in Si}^{ } \left\| {{\text{x}} -\upmu_{\text{i}} } \right\|^{2} $$
(3)

where μi is the mean of points in cluster Si.

2.4 Graph Embedding Systems

In simple terms, graph embedding is a process in which a graph (vertices, edges) is mapped into a space in which the distance between the vertices can be read off as coordinates within that space. For example, if two nodes which are three links away in a graph are embedded in a two dimensional space, then a physical (Euclidean) distance between them would be three (units). This representation is very useful for answering distance queries, for example, in a Telecommunications Network. A more formal treatment is given in Sect. 3 of the paper.

Two embedding systems which are commonly referenced in the literature are Orion [17], a graph coordinate embedding on Euclidean spaces and Rigel [18], which uses hyperbolic spaces. Both have a computationally expensive first step of graph embedding and coordinate computation, but then the distance between any two nodes can be quickly computed.

Orion [17] maps the graph nodes to a low-dimensional Euclidean space coordinate system. A landmark-based approach is used: a fixed reasonably small number of nodes are selected and the distances between each pair of them are computed. Then, the remaining nodes obtain their coordinates iteratively, one by one, through a BFS (Breadth-First Search) algorithm to calculate the distances to each landmark. Landmarks are chosen according to their degree of centrality and being far apart: nodes with a higher degree of centrality and not too close to each other are preferred. Rigel [18] employs a hyperbolic coordinate system for graph embedding, making use of parallelization to reduce the computational cost, by partitioning the graph and performing the embedding in parallel. The system is also landmark based. The additional parameter for the hyperbolic space, the curvature, has to be calibrated.

According to [17], the Orion embedding of a 275 K-node graph takes about 2–3 h, with an error of 15–20 % on average, i.e., between 0.5 and 5 hops in absolute terms. Rigel, on the other hand, is in principle more time-consuming than Orion, but as the embedding can be parallelized it can serve for very large graphs. The accuracy of shortest path computation is significantly higher on average, the absolute error ranging between 0 and 0.9 hops. We will discuss the parallelization of Rigel later in Sect. 4.6.

3 Graph Coordinates Approach: Embedding + Clustering

In this Section we describe our novel approach for interactive clustering of mixed type graph data, based on embedding the graph in an appropriate coordinate system, followed by standard clustering using mixed structural and attribute distance(s).

3.1 Embedding

The embedding of a graph into a coordinate system with a distance is done by a suitable algorithm which maps every vertex to a point in this n-dimensional system (see Fig. 1). The most frequently used is the Euclidean metric space, however we have used the Hyperbolic space, employed by the Rigel [18] system, given its superior precision and performance potential. A particular type of embedding, greedy embedding, has been used in telecommunications. It is defined by the following property: for any two nodes p and q of the embedded graph, there is a neighbour q’ of q whose (Euclidean) distance to p is less than the one between p and q ([19], p.10). Greedy embedding of telecommunication networks allows for greedy routing of messages within a network.

Fig. 1.
figure 1

Schematic representation of the process of graph coordinate embedding

We have implemented the hyperbolic distance embedding. The hyperbolic distance between two n-dimensional points x and y is defined as:

$$ \delta (x,y) = {\text{arccosh}}\left( {\sqrt {\left( {1 + \mathop \sum \limits_{{{\text{i}} = 1}}^{\text{n}} {\text{x}}_{\text{i}}^{2} } \right)\left( {1 + \mathop \sum \limits_{{{\text{i}} = 1}}^{\text{n}} {\text{y}}_{\text{i}}^{2} } \right)} - \mathop \sum \limits_{{{\text{i}} = 1}}^{\text{n}} {\text{x}}_{\text{i}} {\text{y}}_{\text{i}} } \right) \cdot |{\text{c}}| $$
(4)

Where c is the curvature parameter (c ≤ 0; c = 0 gives Euclidean space). Refer to [18] for specific implementation details.

More formally, to describe the general procedure of embedding, consider a graph G = (V, E), where V is a set of vertices, E is a set of edges, and s(v i v j ) is a similarity relationship between the vertex pair (v i v j ) in G. Then, the embedding of G into the metric space M using the hyperbolic distance function δ(x, y) as defined in Eq. 4, is a mapping f:(Gs) → (Md), such that the function f(s(v i v j )) in the unmapped space is equivalent to a distance function in the mapped space d(v i v j ). We note that the objective is to minimize s(v i v j ) − d(v i v j ) for each pair of vertices according to the specific choice of norm x, where v i v j are the images (in M) of the vertices v i v j , respectively.

We recall that the objective of the graph-coordinate system embedding is to be able to read off the shortest paths between vertices, from the mapped (hyperbolic) space, during the clustering process. This shortest value will be used as the numerical value which represents structural information for each pair of vertices in the graph. We note that one of the vertices can be the medoid vertex for a cluster. This calculation can be done ‘on the fly’ as the distance is calculated between a pair of vertex points.

3.2 Clustering

From Sect. 3.1 we are now able to obtain a numerical distance value \( {\mathbb{D}} \)(v i , v j ) for each vertex pair in the hyperbolic space representation of the graph: the shortest path distance between them. As we have already stated that we are going to perform mixed data type clustering, we also need a categorical value. We will supply a categorical value Cv i to each vertex i by randomly assigning categories from a set of N arbitrary category labes as alphabetic letters: A, B, C, etc.

Then, for a vertex pair {vi, vj} the structural data item will be designated as value \( {\mathbb{D}} \)(v i , v j ) and the categorical data items will be designated as Cvi and Cvj. Then we will calculate the distance between two nodes vi and vj, as the weighted sum of the respective categorical data item and the structural data tem, thus:

$$ {\text{Dist}} = (\sigma \times |{\text{Cv}}_{\text{i}} - {\text{Cv}}_{{{\text{j}}|}} |) + (\upphi \times |{\mathbb{D}}\left( {{\text{v}}_{\text{i}} ,{\text{v}}_{\text{j}} } \right)|) $$
(5)

where σ + ϕ = 1, |Cvi − Cvj| |∈ {0, 1} and \( {\mathbb{D}} \)(vi,vj)∈[0,1]

We note that the value \( {\mathbb{D}} \)(v i , v j ) is calculated ‘on the fly’ and represents the shortest path between corresponding vertices vi and vj in the hyperbolic space. The weights σ and ϕ enable us to calibrate the relative weight in the overall distance of the category data item with respect to the numerical structural data item. In the experiments in Sect. 4 have used default values of σ = 0.5 and ϕ = 0.5 to give an equal weighting. While graph embedding is time consuming, it can be conceived as a pre-processing step before clustering. Once done, the distances between nodes and clusters can be very quickly computed, without further concern about the embedding itself.

3.3 Complete Process

Firstly we perform the transformation of the graph data GD and structure via the Hyperbolic embedding process described in Sect. 3.1, which gives a new graph dataset \( {\text{G}}_{\text{D}}^{\text{M}} \), where M is the new space into which the graph is mapped. Then we apply the adapted k-Means clustering algorithm to the \( {\text{G}}_{\text{D}}^{\text{M}} \) to give \( {\text{G}}_{\text{D}}^{\text{MC}} \). The complete process is outlined in the pseudo-code embodiment shown as ‘Algorithm 1’.

We note that the user could explore the most suitable clustering interactively, for instance, by modifying the weights of the mixed structural and attribute.

4 Scalability Testing

We present first the experimental setup to test the scalability of our approach, then the runtime results, followed by their comparison with other methods, and, finally we discuss interactivity, accuracy and parallelization issues.

4.1 Experimental Setup

To test the feasibility of the proposed approach, we performed hyperbolic graph-coordinate embedding followed by (mixed) clustering using an adapted version of the K-Means algorithm [16]. We considered a simple case for the distance calculation, as described in Sect. 3.2. This simplified model serves our present goal which is to evaluate the feasibility of approach. However, it could be extended to a more general set of cases. We used two different types of graphs: random-built (using R-Mat [20]), and the DBLP bibliographical data graph [21]. We tested with 10 K, 50 K, 100 K, 200 K, 500 K, and 1 M nodes. In the case of the DBLP dataset, we sampled all the available data in order to obtain the required number of nodes for each test dataset. For each dataset, we tested a number of clusters going from 10 to 100, in steps of ten (that is, 10, 20,…, 100). The hardware used was a medium/high range PC with 4 Cores, the CPU being Intel i7 960 at 3.2 GHz. Java/Netbeans were used for programming. The implementation of Rigel (source and executable of hyperbolic embedding system) was adapted from: http://sandlab.cs.ucsb.edu/rigel/. The version of k-Means was Lloyd’s within the ELKI framework, adapted from: http://elki.dbs.ifi.lmu.de/.

The tests intend to check the feasibility of the following use case: the user starts with a coarse view of 10 clusters and requests finer resolution of the clustering up to 100 clusters. The user intends to use different weights for the mixed distance, which requires re-computing the clustering but not the embedding. Thus, we need to see whether re-computing the clustering can be interactive.

4.2 Results

The graph coordinates (sequential) hyperbolic embedding of 1 M nodes took around 6.5 h, without noticeable difference for the two datasets (Random and DLBP).. However, as it is a pre-processing step, we are not especially concerned about this time and it could be later parallelized as discussed later in Sect. 4.6. We note that the distance calculation between each node pair in the hyperbolic space and the comparison of the categorical values also for each node pair incur an overhead which makes the computational cost non-lineal for increasing graph sizes. However, we would expect a lineal performance to be obtained with parallelization, as is discussed later in Sect. 4.6.

Figure 2(a) shows the runtime in seconds for 1 M node graphs and different numbers of clusters. It confirms the linearity of the runtime with respect to the number of desired clusters, with an upper bound of approximately 83 min for 100 clusters. The run-time is approximately the same for both types of graphs. In Fig. 2(b) we see the runtime in seconds for 100 clusters and different graph sizes, which ranges from 7 s for 10 K nodes and 10 clusters up to the approximately 83 min for 1 M nodes and 100 clusters.

Fig. 2.
figure 2

(a) No of clust. vs. runtime for 1 M nodes; (b) Size of graph vs. runtime for 100 clusters.

The minimal runtime for 500 K nodes and for just 10 clusters, is around 5 min. If the graph size is 10 K or 50 K the worst-case time is 5 min. For 100 K and 200 K nodes, the runtime ranges between 1 and 15 min (see Tables 1 and 2).

Table 1. Clustering time (sec.) for random graph (number of nodes versus number of clusters)
Table 2. Clustering time (sec.) for DBLP graph (number of nodes versus number of clusters)

4.3 Comparison with Other Methods

Most of the state-of-art attribute-structural clustering algorithms mentioned previously, do not scale appropriately as they fail to work on 1 M nodes, with the exception of SI-Cluster. Furthermore, the runtime results presented show that our approach outperforms most of those algorithms, again with the exception of SI-Cluster (which had 6000 s for 1 M nodes and 4000 clusters, and so would outperform our approach on the larger number of clusters), and BAGC, which showed a higher speed. Unlike SI-Cluster, which is limited to social graphs, our approach holds for generic graphs. The approach proposed also has the advantage of potentially allowing modification of weights and only the clustering needs to be re-computed, with the runtimes indicated. Furthermore, it is not limited to categorical/discrete attributes, unlike the other approaches.

4.4 Interactivity

The results shown could be improved using a more powerful desktop computer, or a cluster, however they do provide a qualitative picture in terms of reasonable desktop interactivity for an average user, which depends as well on the context of usage. This paper is not about the parallelization of graph data processing. However, we can say that if we are able to obtain a simpler and more informative representation of a graph topology, this will make any parallelization effort easier and more effective. Furthermore, the current work can be considered a previous step to implementing a parallelized version of the hyperbolic embedding and the k-Means clustering. This is discussed in Sect. 4.6.

With reference to Tables 1 and 2 we could definitely say that 500 K and 1 M nodes currently fail to be interactive irrespective of the number of clusters, as the minimal runtime for 500 K nodes and just 10 clusters is around 5 min. If the graph size is 10 K or 50 K the worst-case time is 5 min, making full interactivity possible for the user. For 100 K and 200 K nodes the runtime ranges between 1 and 15 min, which allows some interactivity, but the number of allowed clusters would be limited. Thus, our approach scales reasonably well, supporting a reasonable interactivity up to medium size graphs and clustering. The tests show that the approach proposed makes interactivity feasible in these cases.

4.5 Accuracy

As already mentioned, we chose to use the Rigel hyperbolic embedding method because its error is much smaller than Orion’s Euclidean embedding. However, its absolute error varies from 0 to 0.9 hops, which might be unacceptable in some cases. There could be ways to deal with these limitations, for example, by using overlapping clustering. Indeed, in this case we would just set the gap to 2 × max(ϕ) = 1.8 hops, where ϕ is the absolute error. We intend to explore this in the future.

4.6 Discussion About the Parallelization of the Clustering and Embedding Processes

In [22], a parallelization of k-Means using MapReduce is presented. It is stated that for the k-Means algorithm, the main computational cost is in the distance calculations. For each iteration, nk distance computations are performed, where n is the number of objects and k is the number of clusters being formed. Within a given cluster, the distance calculation between an object and the centre is independent of the same calculation in another cluster. Hence, distance computations for different clusters can be executed in parallel. However, in each iteration, the new centres to be used in the next iteration must be updated. This obliges the iterative procedures to be executed serially. In [18] the authors implemented a parallel version of their Rigel embedding system. In Rigel, the landmark bootstrapping pre-process computes the BFS trees rooted from each landmark. This can be run independently and in parallel for each landmark on different servers. Following the bootstrapping process, each graph vertex can also be embedded independently and in parallel based on the coordinates of the global landmarks. Given the large number of nodes, their distribution across servers has to be done carefully to ensure load balancing.

5 Conclusions and Future Work

With this work we have addressed the issue of the lack of scalable algorithms for the support of interactive analytical tasks such as clustering and have opened an area for their development, focusing in particular on algorithms which incorporate both structural and contextual information of a network.

In this research we have made use of the results in the field of telecommunications and proposed applying them to solve the issues discussed for the present field of interactive clustering, related to the interactive exploration of large networks.

We have used a graph coordinate embedding approach which has shown its potential scalability as well as applicability to the mixed graph clustering techniques.

Several challenges remain ahead as future work. The first challenge is to parallelize the hyperbolic embedding process and the k-Means clustering. Secondly, we need to develop metrics which accurately measure the quality of the processing results. A third challenge is to test different distance computations and calibrate the weights.

To summarize, the idea of using constant-time methods for calculation could have a potential usage for the development of mixed graph clustering methods with application to interactive analysis techniques, as it allows for the calculation of the distance between two vertices of a graph in constant time, regardless of the graph’s size.

We have identified a range in which interactive data analytics is plausible, in terms of the number of clusters and graph dataset size. We propose that the embedding pre-process followed by clustering of the mapped space within a given interactivity threshold, once parallelized, will represent a powerful and flexible environment for graph data analysts.