Keywords

1 Introduction

Graph representation or graph embedding [17] aims at mapping the vertices into a low-dimensional space while keeping the structural information and revealing the proximity of instances [17]. The compact representations for graph vertices is then useful for further tasks such as classification [10] and clustering [8, 15].

The most intuitive and simple idea to handle graph is only using the connection information and then representing the graph as a deterministic adjacent matrix. Dimension reduction techniques [12] that directly applied in the adjacent matrix can achieve superior performance in many cases.

Directly launch dimension reduction under the complete graph are efficient in many scenarios, they also have obvious disadvantages. Generally speaking, the limitations of direct matrix models are threefold. First, the direct matrix models are easily suffered from high computation complexity in large scale graphs. Since the adjacent matrix is deterministic and fixed, such methods are not flexible enough when the dataset is large. Second, the direct matrix models have not consider enough information in the model. They only provide a global view of the graph structure. However, the local information which depicts the neighborhood information should also be considered in the learned features. Finally, the success of the direct matrix models highly depend on the representation power of dimension reduction models. Methods such as spectral learning [12] and Non-negative Matrix Factorization [9] have limited representation power.

In order to solve those challenges, we propose a new deep graph representation method that based on streaming algorithm [1, 19]. The proposed method keeps the advantages of deterministic matrix methods and also introduces several new ideas to handle the limitations.

First, we introduce a streaming motivated stochastic idea into the model. Streaming methods are methods that process data streams. Specially, the input of a streaming model is organized as a sequence of data blocks. The main target of data streams is to solve memory issues. In this paper, we sample a small portion of vertices once and formulate a graph stream. With the accumulation of vertices along with the flow of data stream, more and more information will be automatically contained in the model rather in the data. Since we choose fixed small number of vertices for each time, the dimension of the input will be reduced significantly. Consequently, the streaming strategy is helpful in handling computation complexity issues.

Second, in order to combine more information in the model, we adopt a regularization framework in the proposed method. The direct matrix models only consider visible edges between vertices. The highlight point of the regularization framework is that the graph regularization term includes the vertex similarities in the model in addition to visible connections. Vertices that are similar in the original space should have similar representations in the latent low-dimensional space.

Finally, after the graph streams are obtained, we fed the data stream into a deep autoencoder [6] to learn the representations of graph vertices. The learning power of deep autoencoder assures that the learned features keep sufficient information from the original graph.

2 Related Work

Graph representation, also known as graph embedding, is a sub topic of representation learning. What representation learning [4] attempts to do is to decode data in the original space into a vector space in an unsupervised fashion so that the leaned features can be used in further tasks.

Early methods such as Laplacian Eigenmaps (LE) [3], Local Linear Embedding (LLE) [14] and are deterministic. Among those methods, the graph is denoted as an adjacent matrix and methods under the matrix is generally related to dimension reduction techniques [18]. Specially, the intuitive idea is to solve the eigenvectors of the affinity matrix. They exploit spectral properties of affinity matrices and are known as laplace methods. More recently, deep learning models are also used as dimension reduction tools for their superior ability in representation [8, 15].

More recent works on graph embedding are stochastic in which the graph is no longer represented as a fixed matrix [5, 13]. Methods such as Deepwalk [13] and node2vec [5] regard the graph as a vertex vocabulary where a collection node sequences are sampled from. Subsequently, language models such as skip-gram [11] can be used to obtain the ultimate representations.

The deterministic methods are not flexible [2] and the disadvantages of stochastic models are also obvious. Since stochastic models only consider local information that describes the nearest neighbors of vertices, they fail in providing a global picture under the whole graph view. The loss of global information influences the performance of such models when the graph structure is irregular.

3 Network Embedding Problem

3.1 Notations

In this paper, we denote vectors as lowercase letters with bold form and matrixes as uppercase letters in boldface. The elements of a matrix and a vector are denoted as \(\mathbf {X}_{ij}\) and \(\mathbf {x}_{i}\) respectively. Given a graph G(VE), V is the vertices set denoted as \(\{v_{1},...,v_{n}\}\) and E is the edges set denoted as \(\{v_{ij}\}_{i,j=1}^{n}\).

We then define the graph embedding as:

Definition 1

(Graph Embedding). Given a N-vertex graph G(VE), the goal of graph embedding is to learn a mapping \(v_{i} \longmapsto \mathbf {y}_{i}\), \(\mathbf {y}_{i} \in \mathbb {R}^{d}\). The learned representations in the latent low-dimensional space should be capable to keep the structural information of the original graph.

3.2 Streaming Strategy

In this subsection, we illustrate how to formulate the data stream from a given graph G(VE). Let K be the number of data chunks in a data stream. Denote \(S_{k}\) as the \(k^{th}\) data chunk in the data stream where \(k\in {1,2,\cdots ,K}\). The K can be extremely large since the substantial numbers of samplings is conducive to visiting the graph completely.

Fig. 1.
figure 1

The streaming strategy. Each time we choose a fixed number of vertices. The data chunks are constructed by the selected vertices.

Let the number of vertices that selected in one time to be \(D(D \ll N)\). Obviously, the D is also the input of the embedding model since D is fixed as a constant number. In the training phase, in an arbitrary step k, we select D nodes from the vertex collection \(\{v_{1},...,v_{n}\}\) uniformly. A subgraph is then constructed by the selected nodes. The \(S_{k}\) is the adjacent matrix of the subgraph. In Fig. 1 we present the sampling process to formulate a data stream. A data stream S is denoted as \(S={S_{k};k\in {1,\cdots ,K}}\).

In the embedding phase, the goal is mapping each vertex to its representations by the trained model. However, the dimension of the original data N is much higher than the input dimension of the model D. Consequently, we run a simple Principal Component Analysis (PCA) in \(\mathbf {X}\) to get \(\mathbf {X}^{D}\) with a dimension D. Then the \(\mathbf {X}^{D}\) is served as input to obtain the compact representations for each vertex.

3.3 Graph Autoendoer

Autoencoders [16] is powerful in representation task. After getting the data stream, we use a deep graph autoencoder to get the low-dimension vectors.

Deep Autoencoder: Autoencoder paradigm attempts to copy its input to its output, which results in a code layer that may capture useful properties of the input.

Let \(\mathbf {X} = \{ {\mathbf {x}_{i}}: {\mathbf {x}_{i}}\in \mathbb {R}^{m \times 1} \}_{i=1}^{n}\) and \(\mathbf {Z} = \{ {\mathbf {z}}_{i}: {\mathbf {z}}_{i}\in \mathbb {R}^{m \times 1} \}_{i=1}^{n}\) be the input matrix and reconstruction matrix. \(\mathbf {Y} = \{ {\mathbf {y}_{i}}: {\mathbf {y}_{i}}\in \mathbb {R}^{d \times 1} \}_{i=1}^{n}\) is the code matrix where the dimension of \(\mathbf {y}_{i}\) is usually much lower than the dimension of the original data \(\mathbf {x}_{i}\). A layer wise interpretation of the encoder and decoder can be represented as:

$$\begin{aligned} \mathbf {Y} = f_{\theta }(\mathbf {X}) = \delta (W_{encoder}\mathbf {X} + b_{encoder}) \end{aligned}$$
(1)
$$\begin{aligned} \mathbf {Z} = g_{\theta }(\mathbf {Y}) = \delta (W_{decoder}\mathbf {Y} + b_{decoder}) \end{aligned}$$
(2)

For convenience, we summarize the encoder parameters as \(\theta _{encoder}\), and the decoder parameters as \(\theta _{decoder}\). Then the loss function can be defined as:

$$\begin{aligned} \mathcal {L} = \Vert \mathbf {X} - \mathbf {Z}\Vert ^{2}_{F} = \sum _{i=1}^{n} \Vert \mathbf {x}_{i} - \mathbf {z}_{i} \Vert ^{2}_{2} \end{aligned}$$
(3)

Since a deep autoencoder can be thought of as a special case of feedforward networks, the parameters are optimized by backpropagate gradients through chain-rules.

Graph Regularization: In order to preserve the local structure of the data, we employ a graph regularization term derived from Laplacian Eigenmaps [3]. Suppose A is the indicator matrix where \(\mathbf {A}_{ij}\) indicate if node i and node j are connected, the laplacian loss is then defined as:

$$\begin{aligned} Laplacian = \sum _{i}^{n}\sum _{j}^{n}\mathbf {A}_{ij} \Vert \mathbf {y}_{i}-\mathbf {y}_{j} \Vert _{2}^{2} \end{aligned}$$
(4)

The laplacian loss can be further written as:

$$\begin{aligned} Laplacian = \sum _{i}^{n}\sum _{j}^{n}{A}_{ij} \Vert \mathbf {y}_{i}-\mathbf {y}_{j} \Vert _{2}^{2} = 2tr(\mathbf {Y}^{T}\mathbf {L}\mathbf {Y}) \end{aligned}$$
(5)

where \(tr(*)\) denotes the trace and \(\mathbf {L}\) is the laplace matrix calculated by matrix \(\mathbf {A}\): \(\mathbf {L} = \mathbf {D} - \mathbf {A}\). \(\mathbf {D}\) is a diagonal matrix where \(\mathbf {D} = \sum _{i}^n \mathbf {A}_{ij} = \sum _{i}^n \mathbf {A}_{ij}\).

Combining the graph information, the optimization problem is:

$$\begin{aligned} \mathcal {L} = \Vert \mathbf {X} - \mathbf {Z}\Vert ^{2}_{F} + \alpha ^\prime \cdot 2tr(\mathbf {Y}^{T}\mathbf {L}\mathbf {Y}) + \beta ^\prime \cdot \frac{1}{2} \Vert W\Vert _{F}^{2} \end{aligned}$$
(6)

Merge the constant numbers into parameters \(\alpha \) and \(\beta \), the loss function is updated as:

$$\begin{aligned} \mathcal {L} = \Vert \mathbf {X} - \mathbf {Z}\Vert ^{2}_{F} + \alpha tr(\mathbf {Y}^{T}\mathbf {L}\mathbf {Y}) + \beta \Vert W\Vert _{F}^{2} \end{aligned}$$
(7)

where \(\alpha \) and \(\beta \) are the hyperparameters that control the model complexity.

Recall that each time we have a data chunk \(S_{k}\), let \(\mathbf {X}=\mathbf {A}=S_{k}\) and then run the graph regularized autoencoder under \(\mathbf {X}\) and \(\mathbf {A}\). Similar to most deep neural networks, we choose gradient decent to optimize the deep autoencoder. The objective function is \( \mathcal {L} = \varepsilon (f,g) + \lambda \varOmega (f)\). The first term \(\varepsilon (f,g)\) is the reconstruction error and the second term \(\varOmega (f)\) is the regularization term. The partial derivatives of \(\theta _{decoder}\) only depend on the first term and the partial derivatives of \(\theta _{encoder}\) depend on both terms. By using chain rules, parameters at each layer can be calculated sequentially.

4 Experiments

In this section, we conduct experiments in clustering tasks to testify the effectiveness of our method.

We use two datasets, COIL20 and ORL, to testify the our efficiency. COIL20 contains 1440 instances that belongs to 20 categories and ORL contains 400 samples that belongs to 40 classes. The KNN-graph is constructed by computing the k-nearest neighbors of each sample.

We compare our approach with several deep models to evaluate the performance of our method. Specially, we employ deep autoencoder (DAE) [7] and stacked autoencoder (SAE) [16] as baseline models.

We evaluate the learned features in clustering task. Following the general settings in most clustering procedure, we employ purity and NMI to evaluate the results.

In our experiment, we set the D to be 500 for COIL20 and 200 for ORL. The layers of deep graph autoencoders for COIL20 and ORL are 5 and 3 respectively. For COIL20, we set the dimensions as \(500-200-100-200-500\). For ORL, we set the dimensions as \(200-100-200\).

The clustering results of COIL20 and ORL are presented in Table 1. The results show that the streaming method has competitive representation power comparing with baseline models that utilize the complete matrices. The results also indicate that when encountering large graphs, the streaming method is relieved from computation issues and is still able to achieve superior performance.

Table 1. Results in clustering task

5 Conclusion

We proposed a streaming motivated embedding method to learn the low dimensional representations of the graph. The streaming strategy is used to reduce the effect of computation complexity. The deep autoencoder and graph regularization idea make sure the learned features include enough information. Experiments in clustering task verify the effectiveness of our methods. Our model achieve results as good as models that directly apply dimension reduction in the original matrix. The results can be generalized to large graphs where directly matrix models are inapplicable.