Optimization of Spectral Wavelets for Persistence-Based Graph Classification

Yim, Ka Man; Leygonie, Jacob

doi:10.3389/fams.2021.651467

ORIGINAL RESEARCH article

Front. Appl. Math. Stat., 22 April 2021
Sec. Mathematics of Computation and Data Science
Volume 7 - 2021 | https://doi.org/10.3389/fams.2021.651467

Optimization of Spectral Wavelets for Persistence-Based Graph Classification

Ka Man Yim^*

Jacob Leygonie

Mathematical Institute, University of Oxford, Oxford, United Kingdom

A graph's spectral wavelet signature determines a filtration, and consequently an associated set of extended persistence diagrams. We propose a framework that optimizes the choice of wavelet for a dataset of graphs, such that their associated persistence diagrams capture features of the graphs that are best suited to a given data science problem. Since the spectral wavelet signature of a graph is derived from its Laplacian, our framework encodes geometric properties of graphs in their associated persistence diagrams and can be applied to graphs without a priori node attributes. We apply our framework to graph classification problems and obtain performances competitive with other persistence-based architectures. To provide the underlying theoretical foundations, we extend the differentiability result for ordinary persistent homology to extended persistent homology.

1. Introduction

1.1. Background

Graph classification is a challenging problem in machine learning. Unlike data represented in Euclidean space, there is no easily computable notion of distance or similarity between graphs. As such, graph classification requires techniques that lie beyond mainstream machine learning techniques focused on Euclidean data. Much research has been conducted on methods such as graph neural networks (GNNs) [1] and graph kernels [2, 3] that embed graphs in Euclidean space in a consistent manner.

Recently, persistent homology [4, 5] has been applied as a feature map that explicitly represents topological and geometric features of a graph as a set of persistence diagrams (a.k.a. barcodes). In the context of our discussion, the persistent homology of a graph G = (V, E) depends on a vertex function f : V → ℝ. In the case where a vertex function is not given with the data, several schemes have been proposed in the literature to assign vertex functions to graphs in a consistent way. For example, vertex functions can be constructed using local geometric descriptions of vertex neighborhoods, such as discrete curvature [6], heat kernel signatures [7] and Weisfeiler–Lehman graph kernels [8].

However, it is often difficult to know a priori whether a heuristic vertex assignment scheme will perform well in addressing different data science problems. For a single graph, we can optimize the vertex function over |V| many degrees of freedom in ℝ^V. In recent years, there have been many other examples of persistence optimization in data science applications. The first two examples of persistence optimization are the computation of Fréchet mean of barcodes using gradients on Alexandrov spaces [9], and that of point cloud inference [10], where a point cloud is optimized so that its barcode fits a target fixed barcode. The latter is an instance of topological inverse problems (see Oudot and Solomon [11] for a recent overview of such). Another inverse problem is that of surface reconstruction [12]. Besides, in the context of shape matching [13], persistence optimization is used in order to learn an adequate function between shapes. Finally, there are also many recent applications of persistence optimization in Machine Learning, such as the incorporation of topological information in Generative Modeling [14–16] or in Image Segmentation [17, 18], the design of topological losses for Regularization in supervised learning [19] or for dimension reduction [20].

Each of these applications can be thought of as minimizing a certain loss function over a manifold $M$ of parameters:

\begin{array}{l} {min}_{θ \in M} L (θ), \end{array}

where $L : M \to {Bar}^{N} \to ℝ$ factors through the space Bar^N of N-tuples of barcodes. The aim is to find the parameter θ that best fits the application at hand. Gradient descent is a very popular approach in minimization, but it requires the ability to differentiate the loss function. In fact, Leygonie et al. [21] provide notions of differentiability for maps in and out Bar that are compatible with smooth calculus, and show that the loss functions $L$ corresponding the applications cited in the above paragraph are generically differentiable. The use of (stochastic) gradient descent is further legitimated by Carriere et al. [22], where convergence guarantees on persistence optimization problems are devised, using a recent study of stratified non-smooth optimization problems [23]. In practice, the minimization of $L$ can be unstable due to its non-convexity and partial non-differentiability. Some research has been conducted in order to smooth and regularize the optimization procedure [24, 25].

In a supervised learning setting, we want to optimize our vertex function assignment scheme over many individual graphs in a dataset. Since graphs may not share the same vertex set and come in different sizes, optimizing over the |V| degrees of freedom of any one graph is not conducive to learning a vertex function assignment scheme that can generalize to another graph. The degrees of freedom in any practical vertex assignment scheme should be independent of the number of vertices of a graph. However, a framework for parameterizing and optimizing the vertex functions of many graphs over a common parameter space $M$ is not immediately apparent.

The first instance of a graph persistence optimization framework (GFL) [26] uses a one layer graph isomorphism network (GIN) [1] to parameterize vertex functions. The GIN learns a vertex function by exploiting the local topology around each vertex. In this paper, we propose a different framework for assigning and parameterizing vertex functions, based on a graph's Laplacian operator. Using the Laplacian, we can explicitly take both local and global structures of the graph into consideration in an interpretable and transparent manner.

1.2. Outline and Contributions

We address the issue of vertex function parameterization and optimization using wavelet signatures. Wavelet signatures are vertex functions derived from the eigenvalues and eigenvectors of the graph Laplacian and encode multiscale geometric information about the graph [27]. The wavelet signature of a graph is dependent on a choice of wavelet g : ℝ → ℝ, a function on the eigenvalues of the graph's Laplacian matrix. We can thus obtain a parameterization of vertex functions for any graph $F : M \to ℝ^{V}$ by parameterizing g. Consequently, the extended persistence of a graph—which has only four non-trivial persistence diagrams—can be varied over the parameter space $M$ . If we have a function $Out$ : Bar⁴ → ℝ on persistence diagrams that we wish to minimize, we can optimize over $M$ to minimize the loss function

If $L$ is generically differentiable, we can optimize the wavelet signature parameters $θ \in M$ using gradient descent methods. We illustrate an application of this framework to a graph classification problem in Figure 1, where the loss function $L$ is the classification error of a graph classification prediction model based on the graph's extended persistence diagrams.

FIGURE 1

Figure 1. Given a wavelet g : ℝ → ℝ, we can equip any graph with a non-trivial vertex function. This allows us to compute the extended persistence diagrams of a graph and use the diagrams as features of the graph to predict a graph's classification in some real world setting. The wavelet g can be optimized to improve the classification accuracy of a graph classification pipeline based on the extended persistence diagrams of a graph's vertex function.

In section 2, we describe the assignment of vertex functions $F : M \to ℝ^{V}$ by reviewing the definition of wavelet signatures. While spectral wavelets have been used in graph neural network architectures that predict vertex features [1] and compress vertex functions [28], they have not been considered in a persistent homology framework for graph classification. We describe several ways to parameterize wavelets. We also show in Proposition 2.2 that wavelet signatures are independent of the choice of eigenbasis of the graph Laplacian from which it is derived, ensuring that it is well-defined. We prove this result in Appendix B in Supplementary Material.

In section 3, we describe the theoretical basis for optimizing the extended persistent homology of a vertex function $EPH$ : ℝ^V → Bar⁴ and elucidate what it means for $L$ to be differentiable. In Proposition 3.3, we generalize the differentiability formalism of ordinary persistence [21] to extended persistence. We prove this result in Appendix A in Supplementary Material.

Finally, in section 4, we apply our framework to graph classification problems on several benchmark datasets. We show that our model is competitive with state-of-the-art persistence-based models. In particular, optimizing the vertex function appreciably improves the prediction accuracy on some datasets.

2. Filter Function Parameterization

We describe our recipe for assigning vertex functions to any simplicial graph G = (V, E) based on a parameterized spectral wavelet, the first part F of the loss function

Our recipe is based on a graph's wavelet signature, a vertex function derived from the graph's Laplacian. The wavelet signature also depends on a so-called ‘wavelet function’ in g : ℝ → ℝ, which is independent of the graph. By modulating the wavelet, we can jointly vary the wavelet signature across many graphs. We parameterize the wavelet using a finite linear combination of basis functions, such that the wavelet signature can be manipulated in a computationally tractable way. In the following section, we define the wavelet signature and describe our linear approach to wavelet parameterization.

2.1. Wavelet Signatures

The wavelet signature is a vertex function initially derived from wavelet transforms of vertex functions on graphs [29], a generalization of wavelet transforms for square integrable functions on Euclidean space [30, 31] for signal analysis [32]. Wavelet signatures for graphs have been applied to encode geometric information about meshes of 3D shapes [27, 32]. Special cases of wavelets signatures, such as the heat kernel signature [33] and wave kernel signature [34], have also been applied to describe graphs and 3D shapes [35, 36].

The wavelet signature of a graph is constructed from the graph's Laplacian operator. A graph's normalized Laplacian L ∈ ℝ^{V × V} is a symmetric positive semi-definite matrix, whose entries are given by

\begin{array}{l} L_{u v} = {\begin{array}{l} 1 & u = v \\ - \frac{1}{\sqrt{k_{u} k_{v}}} & (u, v) \in E \\ 0 & otherwise \end{array} & (2) \end{array}

where k_u is the degree of vertex u. The Laplacian's eigenvalues λ and eigenvectors ϕ are known to encode various topological and geometric information about the graph [37, 38]; for example, the number of zero eigenvalues corresponds to the number of connected components of the graph. The spectrum of the normalized Laplacian have real eigenvalues in [0, 2] [37]. As such, any function g : ℝ → ℝ evaluated on the eigenvalues need only be defined on [0, 2]. Moreover, functions on a compact domain are easily parameterized using convenient bases.

Definition 2.1. (Wavelet Signature [27]) Let L ∈ ℝ^{V × V} be the normalized Laplacian of a simplical graph G = (V, E). Let ϕ₁, …, ϕ_|V| be an orthonormal eigenbasis for L and λ₁, …, λ_|V| be their corresponding eigenvalues. The wavelet signature W : ℝ^{[0, 2]} → ℝ^V maps a function g : [0, 2] → ℝ, which we refer to as a wavelet, to a vertex function W(g) ∈ ℝ^V linearly, where the value of W(g) on vertex v is given by

\begin{array}{l} W {(g)}_{v} = \sum_{i = 1}^{| V |} g (λ_{i}) {(ϕ_{i})}_{v}^{2}, & (3) \end{array}

and (ϕ_i)_v denotes the component of eigenvector ϕ_i corresponding to vertex v.

If the eigenvalues of L have geometric multiplicity one (i.e., their eigenspaces are one dimensional), then the orthonormal eigenvectors are uniquely defined up to a choice of sign. It is then apparent from Equation (3) that the wavelet signature is independent of the choice of sign. However, if some eigenvalues have geometric multiplicity greater than one, then the orthonormal eigenvectors of L are uniquely defined up to orthonormal transformations in the individual eigenspaces. However, the wavelet signature is well-defined even when the multiplicities of eigenvalues are greater than one. This is the content of the next Proposition, whose proof is deferred to Appendix B in Supplementary Material.

PROPOSITION 2.2. The wavelet signature of a graph is independent of the choice of orthonormal eigenbasis for the Laplacian.

Remark 2.3. In addition to the traditional view of wavelets from a spectral signal processing perspective [29], we can also relate the wavelet signature of a vertex v to the degrees of vertices in some neighborhood of v prescribed by g. Consider a wavelet g : [0, 2] → ℝ. On a finite graph G, the normalized Laplacian L has at most |V| many distinct eigenvalues. As such, there exists a polynomial $ĝ (x) = \sum_{n = 0}^{p} a_{n} x^{n}$ of finite order that interpolates g at the eigenvalues g(λ_i) = ĝ(λ_i). Therefore, W(g) = W(ĝ). Moreover, the vertex values assigned by W(ĝ) are the diagonal entries of the matrix polynomial ĝ(L):

\begin{array}{l} ĝ {(L)}_{v v} = \sum_{n = 0}^{p} a_{n} {(L^{n})}_{v v} = \sum_{i = 1}^{| V |} ĝ (λ_{i}) {(ϕ_{i})}_{v}^{2} = \sum_{i = 1}^{| V |} g (λ_{i}) {(ϕ_{i})}_{v}^{2} \\ = W {(g)}_{v v} . & (4) \end{array}

Furthermore, we can also write the matrix polynomial ĝ(L) as a matrix polynomial in A = I − L, the normalized adjacency matrix. From the definition of L, we can compute the diagonal entry of a monomial A^r corresponding to vertex v as an inverse degree weighted count of paths¹ [v₀, v₁, …, v_r] on the graph which begin and end on vertex v = v₀ = v_r [39]:

\begin{array}{l} {(A^{r})}_{v v} = \frac{1}{k_{v}} \sum_{[v, v_{1}, \dots, v_{r - 1}, v]} (\prod_{l = 1}^{r - 1} \frac{1}{k_{v_{l}}}) . & (5) \end{array}

By expressing the wavelet signature as a matrix polynomial in A, we see that g controls how information at different length scales of the graph contribute to the wavelet signature. For instance, if g were an order p polynomial, then W(g)_v only takes the degrees of vertices that are ⌊p/2⌋ away from v into account. As a corollary, since W(g) can be specified by replacing g with a polynomial ĝ of order at most |V| − 1, the wavelet signature at a vertex is only dependent on the subgraph of G that is within ⌊|V| − 1⌋/2 steps away from v.

2.2. Parameterizing the Wavelet

We see from Remark 2.3 that the choice of wavelet g determines how the topology and geometry of the graph is reflected in the vertex function. Though the space of wavelets is potentially infinite dimensional, here we only consider wavelets g_θ(x) that are parameterized by parameters θ in a finite dimensional manifold, so that we can easily optimize them using computational methods. In particular, we focus on wavelets written as a linear combination of m basis functions h₁, …, h_m : [0, 2] → ℝ

\begin{array}{l} g_{θ} (x) : = \sum_{j = 1}^{m} θ_{j} h_{j} (x) & (6) \end{array}

This parameterization of wavelets in turn defines a parameterization of vertex functions F : ℝ^m → ℝ^V for our optimization pipeline in Equation (1)

\begin{array}{l} F : θ \in ℝ^{m} \mapsto F (θ) : = W (g_{θ}) \in ℝ^{V} . & (7) \end{array}

Since W(g) is a linear function of the wavelet g, F is a linear transformation:

\begin{array}{l} F (θ) = W (\sum_{j = 1}^{m} θ_{j} h_{j} (x)) = \sum_{j = 1}^{m} θ_{j} W (h_{j}) . & (8) \end{array}

We can write F as a |V| × m matrix acting on a vector ${[θ_{1}, \dots θ_{m}]}^{⊺} \in ℝ^{m}$ , whose columns are the vertex functions W(h_j).

Example 2.4 (Chebyshev Polynomials). Any Lipschitz continuous function on an interval can be well-approximated by truncating its Chebyshev series at some finite order [40]. The Chebyshev polynomials T_n : [−1, 1] → ℝ

\begin{array}{l} T_{n} (x) = cos (n arccos (x)) n \in ℕ_{\geq 0} . & (9) \end{array}

form an orthonormal set of functions. We can thus consider h_j(λ) = T_j(λ − 1), j = 0, 2, …, m as a naïve basis for wavelets. We exclude T₁(x) = x in the linear combination as W(T₁(1 − x)) = 0 for graphs without self loops.

Example 2.5 (Radial Basis Functions). In the machine learning community, a radial function refers loosely to a continuous monotonically decreasing function ρ : ℝ_≥0 → ℝ_≥0. There are many possible choices for ρ, for example, the inverse multiquadric

\begin{array}{l} ρ (r) = {({(\frac{r}{ϵ})}^{2} + 1)}^{- \frac{1}{2}} & (10) \end{array}

where ϵ ≠ 0 is a width parameter. We can obtain a naïve wavelet basis h_j(x) = ρ(||x − x_j||) using copies of ρ offset by a collection of centroids x_j ∈ ℝ along ℝ. In general, the centroids are parameters that could be optimized, but we fix them in this study. This parameterization can be considered as a radial basis function neural network. RBNNs are well-studied in function approximation and subsequently machine learning; we refer readers to [41, 42] for further details.

2.3. The Choice of Wavelet Basis

The choice of basis functions determines the space of wavelet signatures and also the numerical stability of the basis function coefficients which serve as the wavelet signature parameters. The stability of the parameterization depends on the graphs as much as the choice of wavelet basis h₁, …, h_m. We can analyse the stability of a parameterization F by its the singular value decomposition

\begin{array}{l} F = \sum_{k = 1}^{r} σ_{k} u_{k} v_{k}^{⊺} & (11) \end{array}

where σ₁, …, σ_r are the non-zero singular values of the matrix, and $u_{k} \in ℝ^{| V |}$ and $v_{k} \in ℝ^{m}$ are orthonormal sets of vectors, respectively. If the distribution of singular values span many orders of magnitude, we say the parameterization is ill-conditioned. An ill-conditioned parameterization interferes with the convergence of gradient descent algorithms on a loss function evaluated on wavelet signatures. We discuss the relationship between the conditioning of F and the stability of gradient descent in detail in Remark 2.7.

We empirically observe that the coefficients of a naïve choice of basis functions, such as Chebyshev polynomials or radial basis functions, are numerically ill-conditioned. In Figure A2 (Appendix in Supplementary Material.), we can see that the singular values of radial basis function and Chebyshev polynomial parameterizations, respectively, are distributed across a large range on the logarithmic scale for some datasets of graphs in machine learning. We address this problem by picking out a new wavelet basis

\begin{array}{l} h_{k}^{'} (x) = \frac{1}{σ_{k}} \sum_{j = 1}^{m} {(v_{k})}_{j} h_{j} (x) k = 1, \dots, r, & (12) \end{array}

where σ_k are the singular values of F and v_k are the associated vectors in ℝ^m from the singular value decomposition of matrix F in Equation (11). Then the parameterization F′ : ℝ^r → ℝ^V

\begin{array}{l} F^{'} (θ^{'}) = \sum_{k = 1}^{r} θ_{k}^{'} W (h_{k}^{'}) . & (13) \end{array}

have singular values equal to one, since this is a linear combination of orthonormal vectors $u_{k} \in ℝ^{V}$ :

\begin{array}{l} W (h_{k}^{'}) = \sum_{j = 1}^{m} \frac{1}{σ_{k}} {(v_{k})}_{j} W (h_{j}) = \frac{1}{σ_{k}} F v_{k} = u_{k} . & (14) \end{array}

As an example, we plot the new wavelet basis $h_{k}^{'}$ derived from a 12 parameter radial basis function parameterization for the MUTAG dataset in Figure A3 in Appendix B in Supplementary Material.

Remark 2.6 (Learning a Wavelet Basis for Wavelet Signatures on Multiple Graphs). In the case where the wavelet coefficients parameterize the wavelet signatures over graphs G₁, …, G_N, we can view the maps F₁, …, F_N that map wavelet basis coefficients to vertex functions of graphs G₁, …, G_N, respectively, as a parameterization for the disjoint union ⊔_iG_i:

\begin{array}{l} f = [\begin{matrix} f_{1} \\ ⋮ \\ f_{N} \end{matrix}] = [\begin{matrix} F_{1} \\ ⋮ \\ F_{N} \end{matrix}] θ = : F θ . & (15) \end{array}

We can then perform a singular value decomposition of the parameterization F on ⊔_iG_i and derive a new, well-conditioned basis.

Remark 2.7 (Why the Conditioning of F Matters). Let us optimize a loss function $L$ on the parameter space of wavelet coefficients θ using a gradient descent algorithm. In a gradient descent step of step size s, the wavelet coefficients are updated to $θ \mapsto θ - s \nabla_{θ} L$ . Using the singular value decomposition of F (Equation 11), we can write

\begin{array}{l} \nabla_{θ} L = {\nabla_{θ} f}^{⊺} \nabla_{f} L = F^{⊺} \nabla_{f} L = \sum_{k = 1}^{r} σ_{k} 〈 \nabla_{f} L, u_{k} 〉 v_{k} . & (16) \end{array}

The change in the vertex function is simply the matrix F applied to the change in wavelet parameters. Hence, the vertex function is updated to $f \mapsto f - s F \nabla_{θ} L$ , where

\begin{array}{l} F \nabla_{θ} L = \sum_{k = 1}^{r} σ_{k} 〈 \nabla_{f} L, u_{k} 〉 F v_{k} = \sum_{k = 1}^{r} σ_{k}^{2} 〈 \nabla_{f} L, u_{k} 〉 u_{k} . & (17) \end{array}

If the loss function $L$ has large second derivatives– for example, due to non-linearities in the function on persistence diagrams $Out$ : Bar⁴ → ℝ—the projections $〈 \nabla_{f} L, u_{k} 〉$ in Equations (16) and (17) may change dramatically from one gradient descent update to another. If the smallest singular value is much smaller than the largest, then updates to the wavelet signature can be especially unstable throughout the optimization process. This source of instability can be removed if we choose a parameterization with uniform singular values σ_k = 1. In this case, the update to f is simply the projection of $\nabla_{f} L$ onto the space of wavelet signatures spanned by u₁, …, u_r, without any distortion introduced by non-uniform singular values:

\begin{array}{l} f \mapsto f - s \sum_{k = 1}^{r} 〈 u_{k}, \nabla_{f} L 〉 u_{k} . & (18) \end{array}

3. Extended Persistent Homology

The homology of a given graph is a computable vector space whose dimension counts the number of connected components or cycles in the graph. Finer information can be retained by filtering the graph and analyzing the evolution of the homology throughout the filtration. This evolution is described by a set of extended persistence diagrams (a.k.a. extended barcodes), a multiset of points 〈b, d〉 that record the birth b and death of homological features in the filtration. In this section, we begin by summarizing these constructions. We refer the reader to Zomorodian and Carlsson [4], Edelsbrunner and Harer [5], and Cohen-Steiner et al. [43] for full treatments of the theory of Persistence.

Compared to ordinary persistence, extended persistence is a more informative and convenient feature map for graphs. Extended persistence encodes strictly more information than ordinary persistence. For instance, the cycles of a graph are represented as points with d = ∞ in ordinary persistence. Thus, only the birth coordinate b of such points contain useful information about the cycles. In contrast, the corresponding points in extended persistence are each endowed with a finite death time d, thus associating extra information to the cycles. The points at infinity in ordinary persistence also introduce obstacles to vectorization procedures, as often arbitrary finite cutoffs are needed to ‘tame’ the persistence diagrams before vectorization.

3.1. Extended Persistent Homology

Let G = (V, E) be a finite graph without double edges and self-loops. For the purposes of this paper, the associated extended persistent homology is a map

EPH : ℝ^{V} \to {Bar}^{4}

from functions f ∈ ℝ^V on its vertices to the space of four persistence diagrams or barcodes, which we define below. The map arises from a filtration of the graph, a sequential attachment of vertices and edges in ascending or descending order of f. We extend f on each edge e = (v, v′) by the maximal value of f over the vertices v and v′, and we then let G_t ⊂ G be the sub graph induced by vertices taking value less than t. Then we have the following sequence of inclusions:

Similarly, the sub graphs G^t ⊂ G induced by vertices taking value greater than t assemble into a sequence of inclusions:

The changes in the topology of the graph along the filtration in ascending and descending order of f can be detected by its extended persistence module, indexed over the poset ℝ ∪ {∞} ∪ ℝ^op:

where H_p is the singular (relative) homology functor in degree p ∈ 0, 1 with coefficients in a fixed field, chosen to be ℤ/2ℤ in practice. In general terms, the modules $V$ ₀(f) and $V$ ₁(f) together capture the evolution of the connected components and loops in the sub graphs of G induced by the function f.

Each module $V$ _p(f) is completely characterized by a finite multi-set $EPH$ _p(f) of pairs of real numbers 〈b, d〉 called intervals representing the birth and death of homological features. Following Cohen-Steiner et al. [44], the intervals in $EPH$ _p(f) are further partitioned according to the type of homological feature they represent:

\begin{array}{l} {EPH}_{p} (f) = \underset{= {EPH}_{p}^{ord} (f)}{\underset{︸}{{〈 b, d 〉 | b < d < \infty}}} ⊔ \underset{= {EPH}_{p}^{ext} (f)}{\underset{︸}{{〈 b, d 〉 | b < \infty < d}}} \\ ⊔ \underset{= {EPH}_{p}^{rel} (f)}{\underset{︸}{{〈 b, d 〉 | \infty < b < d}}} . & (22) \end{array}

Each of the three finite multiset ${EPH}_{p}^{k} (f)$ , for k ∈ {ord, ext, rel}, is an element in the space Bar of so-called barcodes or persistence diagrams. However, ${EPH}_{0}^{rel} (f)$ and ${EPH}_{1}^{ord} (f)$ being trivial for graphs, we refer to the collection of four remaining persistence diagrams

\begin{array}{l} EPH (f) = [{EPH}_{0}^{ord} (f), {EPH}_{0}^{ext} (f), {EPH}_{1}^{ext} (f), {EPH}_{1}^{rel} (f)] \\ \in {Bar}^{4} & (23) \end{array}

as the extended barcode or extended persistence diagram of f. We have thus defined the extended persistence map

EPH : ℝ^{V} \to {Bar}^{4} .

Remark 3.1. If we only apply homology to the filtration of Equation (19), we get an ordinary persistence module indexed over the real line, which is essentially the first row in Equation (21). This module is characterized by a unique barcode $PH$ _p(f) ∈ Bar. We refer to the map

\begin{array}{l} PH : f \in ℝ^{V} \mapsto [{PH}_{0} (f), {PH}_{1} (f)] \in {Bar}^{2} & (24) \end{array}

as the ordinary persistence map.

3.2. Differentiability of Extended Persistence

The extended persistence map can be shown to be locally Lipschitz by the Stability theorem [44]. The Rademacher theorem states that any real-valued function that is locally Lipschitz is differentiable on a full measure set. Thus, so is our loss function

as long as $Out$ and F are smooth or locally Lipschitz². If a loss function $L$ is locally Lipschitz, we can use stochastic gradient descent as a paradigm for optimization. Nonetheless, the theorem above does not rule out dense sets of non differentiability in general.

In this section, we show that the set where $EPH$ is not differentiable is not pathological. Namely, we show that $EPH$ is generically differentiable, i.e., differentiable on an open dense subset. This property guarantees that local gradients yield reliable descent directions in a neighborhood of the current iterate. We recall from Leygonie et al. [21] the definition of differentiability for maps to barcodes.

We call a map $F : M \to ℝ^{V}$ a parameterization, as it corresponds to a selection of filter functions over G parameterized by the manifold $M$ . Then B: = $EPH$ ◦ F is the barcode valued map whose differentiability properties are of interest in applications.

Definition 3.2. A map $B : M \to Bar$ on a smooth manifold $M$ is said to be differentiable at $θ \in M$ if for some neighborhood U of θ, there exists a finite collection of differentiable maps³ b_i, d_i : U → ℝ ∪ {∞}, called a local coordinate system for B at θ, such that

\forall θ^{'} \in U, B (θ^{'}) = {〈 b_{i} (θ^{'}), d_{i} (θ^{'}) 〉 | b_{i} (θ^{'}) \neq d_{i} (θ^{'})} .

For N ∈ ℕ, we say that a map $B : M \to {Bar}^{N}$ is differentiable at θ if all its components are so.

In Leygonie et al. [21], it is proven that the composition $PH$ ◦ F is generically differentiable as long as F is so. It is possible to show that $EPH$ ◦ F is generically differentiable along the same lines, but we rather provide an alternative argument in the Appendix. Namely, we rely on the fact that the extended persistence of G can be decoded from the ordinary persistence of the cone complex $C$ (G), a connection first noted in Cohen-Steiner et al. [44] for computational purposes.

PROPOSITION 3.3. Let $F : M \to ℝ^{V}$ be a generically differentiable parameterization. Then the composition $EPH$ ◦ F is generically differentiable.

For completeness, the proof provided in the Appendix treats the general case of a finite simplicial complex K of arbitrary dimension.

4. Binary Graph Classification

We investigate whether optimizing the extended persistence of wavelet signatures can be usefully applied to graph classification problems, where persistence diagrams are used as features to predict discrete, real life attributes of networks. In this setting, we aim to learn $θ \in M$ that minimize the classification error of graphs over a training dataset.

We apply our wavelet optimization framework to classification problems on the graph datasets MUTAG [47, 48], COX2 [49], DHFR [49], NCI1 [50, 51], PROTEINS [52, 53], and IMDB-B [54]. The former five datasets are biochemical molecules and IMDB-B is a collection of social ego networks. In our models, we use persistence images [46] as a fixed vectorization method and use a feed forward neural network to map the persistence images to classification labels. We also include the eigenvalues of the graph Laplacian as additional features; model particulars are described in the sections below.

To illustrate the effect of wavelet optimization on different classification problems, we also perform a set of control experiments where for the same model architecture, we fix the wavelet and only optimize the parameters of the neural network. The control experiment functions as a baseline against which we assess the efficacy of wavelet optimization.

We benchmark our results with two existing persistence based architectures, PersLay [7] and GFL [26]. Perslay optimizes the vectorization parameters and use two heat kernel signatures as fixed rather than optimizable vertex functions for computing extended persistence. GFL optimizes and parameterizes vertex functions using a graph isomorphism network [1], and computes ordinary sublevel and superlevel set persistence instead of extended persistence.

4.1. Model Architecture

We give a high level description of our model and relegate details and hyperparameter choices of the vectorization method and neural network architecture to Appendix C in Supplementary Material. In our setting, the extended persistence diagrams of the optimizable wavelet signatures for each graph are vectorized as persistence images. We also include the static persistence images of a fixed heat kernel signature, W(e^−0.1x), as an additional set of features, alongside some non-persistence features. Both the optimized and static persistence diagrams are transformed into the persistence images using identical hyperparameters. We feed the optimizable and static persistence images into two separate convolutional neural networks (CNNs) with the same architecture. Similarly, we feed the non-persistence features as a vector into a separate multilayer perceptron. The outputs of the CNNs are concatenated with the outputs of the multi-layer perceptron. Finally, an affine transformation sends the concatenated vector to a real number whose sign determines the binary classification.

4.1.1. Wavelet Parametserization

We choose a space of wavelets spanned by 12 inverse multiquadric radial basis functions

\begin{array}{l} h_{j} (x) = {({(\frac{x - x_{j}}{ϵ})}^{2} + 1)}^{- \frac{1}{2}} & (25) \end{array}

whose centroids x_j are located at x_j = 2(j − 1)/9, j = 0, …, 11. The width parameter is chosen to be the distance between the centroids, ϵ = 2/9. On each dataset, we derive a numerically stable parameterization using the procedure described in section 2.2; the parameters we optimize are the coefficients of the new basis given by Equation (12). We initialize the parameters by fitting them via least squares to the heat kernel signature W(e^−10x) on the whole dataset of graphs.

4.1.2. Non-Persistence Features

We also incorporate the eigenvalues of the normalized Laplacian as additional, fixed features of the graph. Since the number of eigenvalues for a given graph is equal to the number of vertices, it differs between graphs in the same dataset. To encode the information represented in the eigenvalues as a fixed length vector, we first sort the eigenvalues into a time-series; we then compute the log path signature of the time series up to level four, which is a fixed length vector in ℝ⁸. The log-signature captures the geometric features of the path; we refer readers to Chevyrev and Kormilitzin [55] for details about path signatures. For IMDB-B in particular, we also include the maxima and minima of the heat kernel signatures W(e^−10x) and W(e^−0.1x), respectively, of each graph.

4.2. Experimental Set Up

We employ a 10 ten-fold test-train split scheme on each dataset to measure the accuracy of our model. Each ten-fold is a set of ten experiments, corresponding to a random partition of the dataset into ten portions. In each experiment, a different portion is selected as the test set while the model is trained on the remaining nine portions. We perform 10 ten-folds to obtain a total of 10 × 10 experiments, and report the accuracy of the classifier as the average accuracy over 100 such experiments. The epochs at which the accuracies were measured are specified in Table C1.

Across all experiments, we use binary cross entropy as the loss function. We use the Adam optimizer [56] with learning rate lr = 1e-3 to optimize the parameters of the neural network. The wavelet parameters are updated using stochastic gradient descent with learning rate lr = 1e-2, for all datasets except for IMDB-B, where the learning rate is set to lr = 1e-1. The batch sizes for each experiment are shown in Table C2. In all experiments, we stop the optimization of wavelet parameters at epoch 50 while the neural network parameters continue to be optimized.

We use the GUDHI library to compute persistence, and make use of the optimization and machine learning library PyTorch for the construction of the graph classifications models.

4.3. Results and Discussion

In Table 1, we present the classification accuracies of our model. For each dataset, we perform four experiments using our model, varying whether the wavelet parameter is optimized and whether additional features are included. In Table 2, we show the test accuracy of our model alongside two persistence-based graph classification architectures, Perslay and GFL, as well as other state-of-the-art graph classification architectures.

TABLE 1

Table 1. Binary classification accuracy of our model where we vary whether non-Persistence features are included and whether the wavelet is optimized.

TABLE 2

Table 2. Binary classification accuracy on datasets of graphs.

We first compare the performances of our model between cases where we optimize and fix the wavelets. In Table 1, we see that on MUTAG and DHFR, optimizing the wavelet improves the classification accuracy regardless of whether extra features are included. On NCI1, wavelet optimization improves the classification accuracy only persistence features are included. When we include non-persistence features in the model, the performances of the optimized and control models are statistically indistinguishable for NCI1, suggesting that the non-persistence features play a more significant role in the classification. As for COX2, PROTEINS, and IMDB-B, optimizing the wavelet coefficients do not bring about statistically significant improvements. This indicates that the initial wavelet signature—the heat kernel signature W(e^−10x)—is a locally optimal choice of wavelet for our neural network classifier.

We now compare our architecture to other persistence based architectures, Perslay and GFL, where node attributes are excluded from their vertex function models. Except on PROTEINS, our wavelet optimized model matches or exceeds Perslay. While our model architecture and choice of wavelet initialization is similar to that of Perslay, we differ in two important respects. Perslay fixes the vertex functions but optimizes the weights assigned to points on the persistence diagrams, as well as the parameters of the persistence images. Our improvements on Perslay for MUTAG, DHFR, and IMDB-B indicate that vertex function optimization yields improvements that cannot be obtained through vectorization optimization alone on some datasets of graphs.

Compared to GFL (without node attributes), both Perslay and our architecture achieves similar or higher classification accuracies on PROTEINS and NCI1. This supports wavelet signatures being viable models for vertex functions on those datasets. On the other hand, both Perslay and our model lag behind GFL on IMDB-B. We attribute this to the fact that IMDB-B, unlike the other bioinformatics datasets, consists of densely connected graphs. The graphs in IMDB-B have diameter at most two and 14% of the graphs are cliques. This fact has two consequences. First, we expect the one-layer GIN used in GFL—a local topology summary—to be highly effective in optimizing for the salient features of a graph with small diameter. Second, the extended persistence modules for cliques have zero persistence, since all vertices are assigned the same function value due to symmetry. In contrast, ordinary persistence used in GFL is able to capture the cycles in a complete graph as points with infinite persistence.

Compared to non-persistence state-of-the-art architectures in Table 1, our model achieves competitive accuracies on MUTAG, COX2, and DHFR. For NCI1 and PROTEINS, all persistence architectures listed that exclude additional node attributes perform poorly in comparison, though PWL was able to achieve leading results with node attributes.

All in all, we observe that wavelet signatures can be an effective parameterization of vertex functions when we use extended persistence as features for graph classification. In particular, on some bioinformatics datasets, we show that optimizing the wavelet signature can lead to improvements in classification accuracy. The wavelet signature approach is complementary to the GFL approach to vertex function parameterization as they show strengths on different datasets.

5. Conclusion

We have presented a framework for equipping any graph G with a set of extended persistence diagrams $EPH ◦ F : M \to {Bar}^{4}$ parameterized over a manifold $M$ , a parameter space for the graph's wavelet signature. We described how wavelet signatures can be parameterized and interpreted. Given a function on extended persistence diagrams $Out$ : Bar⁴ → ℝ that is differentiable, we have shown how a loss function $L = Out ◦ EPH ◦ F$ can be generically differentiable with respect to $θ \in M$ as $L$ . Thus, we can apply gradient descent methods to optimize the extended persistence diagrams of a graph to minimize $L$ .

We applied this framework to a graph classification architecture where the wavelet signature is optimized for classification accuracy. We are able to demonstrate an increase in accuracy on several benchmark datasets where the wavelet is optimized, and perform competitively with state-of-the-art persistence based graph classification architectures.

Data Availability Statement

The code for the computational experiments in section 4 can be found in the GitHub repository https://github.com/kmyim/Persistence_Opt_Spectral_Wavelets. The datasets we use are publicly available at the repository TUDatasets https://chrsmrrs.github.io/datasets/ [60].

Author Contributions

The overall framework was jointly conceived by both authors. KY was responsible for developing wavelet signatures as a vertex function parameterization framework, along with the experimental design and analysis. The proof of the differentiability of extended persistence is due to JL. Both authors participated in the writing of the article.

Funding

KY was funded by the EPSRC Centre For Doctoral Training in Industrially Focused Mathematical Modelling (EP/L015803/1) with industrial sponsorship from Elsevier. JL was funded by the EPSRC grant EP/R513295/1. Both authors are members of the Centre for Topological Data Analysis, which is supported by the EPSRC grant New Approaches to Data Science: Application Driven Topological Data Analysis EP/R018472/1.

Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

The authors would like to thank Ulrike Tillmann and Heather Harrington for their close guidance and thoughtful advice on this project. In addition, the authors would like to thank Vidit Nanda, Peter Grindrod CBE, Andrew Mellor, Steve Oudot, Mathieu Carrière, and Theo Lacombe for fruitful discussions on this subject. Finally, we are indebted to the reviewers for their thoughtful and constructive comments, which led to many improvements of the paper.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fams.2021.651467/full#supplementary-material

Footnotes

1. ^Here a path refers to a sequences of vertices that are connected to the next vertex in the sequence by an edge.

2. ^In practice, a locally Lipschitz $Out$ can be constructed out of Lipschitz stable vectorization methods, such as Persistence Landscapes [45] and Persistence Images [46].

3. ^By convention, a differentiable map that takes the value ∞ is constant.

References

1. Xu B, Shen H, Cao Q, Qiu Y, Cheng X. Graph wavelet neural network. arXiv [Preprint]. (2019) arXiv:1904.07785.

ORIGINAL RESEARCH article

Optimization of Spectral Wavelets for Persistence-Based Graph Classification

1. Introduction

1.1. Background

1.2. Outline and Contributions

2. Filter Function Parameterization

2.1. Wavelet Signatures

2.2. Parameterizing the Wavelet

2.3. The Choice of Wavelet Basis

3. Extended Persistent Homology

3.1. Extended Persistent Homology

3.2. Differentiability of Extended Persistence

4. Binary Graph Classification

4.1. Model Architecture

4.1.1. Wavelet Parametserization

4.1.2. Non-Persistence Features

4.2. Experimental Set Up

4.3. Results and Discussion

5. Conclusion

Data Availability Statement

Author Contributions

Funding

Conflict of Interest

Acknowledgments

Supplementary Material

Footnotes

References

This article is part of the Research Topic

People also looked at