Mining frequent subgraphs in multigraphs

https://doi.org/10.1016/j.ins.2018.04.001Get rights and content

Abstract

For more than a decade, extracting frequent patterns from single large graphs has been one of the research focuses. However, in this era of data eruption, rich and complex data is being generated at an unprecedented rate. This complex data can be represented as a multigraph structure - a generic and rich graph representation. In this paper, we propose a novel frequent subgraph mining approach MuGraM that can be applied to multigraphs. MuGraM is a generic frequent subgraph mining algorithm that discovers frequent multigraph patterns. MuGraM efficiently performs the task of subgraph matching, which is crucial for support measure, and further leverages several optimization techniques for swift discovery of frequent subgraphs. Our experiments reveal two things: MuGraM discovers multigraph patterns, where other existing approaches are unable to do so; MuGraM, when applied to simple graphs, outperforms the state of the art approaches by at least one order of magnitude.

Introduction

Real world data can be easily modelled as a graph where entities are represented as nodes, and interactions between entities are represented as edges. When only one edge type is allowed between a pair of nodes, we refer to this graph structure as single edge graph; when more than one edge type is allowed between a pair of nodes, we refer to it as multigraph. Multigraph structure enables us to represent multiple relations between a pair of nodes [2], [3].

Many real world datasets can be modelled as a network where a set of nodes are interconnected by multiple relations. Various domains are abound with multigraphs: social networks spanning over the same set of people, but with different life aspects (e.g., social relationships such as Facebook, Twitter, LinkedIn, etc.); protein-protein interaction multigraphs, where the protein pairs have direct interactions/physical associations or they are co-localised [1]; gene multigraphs, where genes are connected by different pathway interactions that belong to different pathways; Resource Description Framework knowledge graphs, where a subject/object node pair is connected by different predicates [17].

Since multigraphs allow more than one relation between a pair of nodes, we can represent real world data more succinctly, which in turn helps in mining patterns that cannot be discovered in the otherwise simple graphs. For example, a recent work in the field of bioinformatics [16] creates multigraphs by merging heterogeneous genomic and phenotype data, in order to identify the disease genes. Many such applications can be catalysed in order to mine interesting and useful patterns.

One of the most important tasks in graph data management is frequent subgraph mining [6], [11], [13], [14], [22] where the problem is to discover patterns that occur frequently in a graph database. Although plenty of approaches exist to mine frequent patterns in single edge graph, to the best of our knowledge, no approach exists to mine frequent patterns in multigraphs.

Considering FSM in single edge graph data, the existing approaches can be categorized into two main families: (i) FSM for transactional graph setting, where the graph database consists of a set of relatively small sized graphs called transactions, and (ii) FSM for single large graph setting, where the graph database consists of a single large graph.

In transactional graph databases, a subgraph is frequent if it appears in at least δ transactions, where δ is a user defined frequency threshold value. Several works have been proposed to address FSM in transactional graph databases [11], [13], [22]. However, since the task of FSM in single large graph setting is more challenging than the transactional one, several approaches [6], [14] have been proposed by considering various frequency (or support) evaluation measures [4], [20].

This work is motivated by the fact that the existing FSM approaches cannot be applied to multigraph data. That is, when the graph data contains multiple relations between a pair of entities, the existing FSM approaches cannot discover frequent patterns that contain a subset of multiedges. Thus, whenever multiple relations (multiedges) exist between a pair of nodes, in order to use the existing FSM approaches, one has to map the multiple relations (multiedges) to a unique value (distinct edge label) and then perform FSM, which however, does not yield desirable results, thereby making the existing approaches rather incomplete. To the best of our knowledge, no existing work can discover frequent subgraphs in single large multigraphs. It is to be noted that one of the recent works called GraMi  [6] claims to handle multi-labeled graphs (which we refer to as multigraphs). However, neither they provide any details about managing multigraphs in their paper, nor their latest code is capable of handling multigraph data.

Let us consider a typical scenario of performing FSM on a multigraph as depicted in Fig. 1. The data multigraph in Fig. 1(a) is an extract of the real world AUCS dataset [12] that has five different relations (edge types) namely, lunch, Facebook, coauthor, leisure, and work which are defined among a set of university employees (nodes). If we perform FSM on this dataset by setting a frequency threshold δ=2, the existing FSM approaches output no patterns, since they treat a set of relations between a pair of nodes as a unique identifier, rather than treating it as a set of multiple relations. And thus, they are unable to discover those frequent patterns that are spanned from a subset of the relations, as depicted in Fig. 1(b).

The objective of the proposed work is to fill the gap in the field of FSM by proposing an approach to extract frequent patterns from multigraph data by considering patterns that can span over a subset of the multigraph relations. Thus, we propose MuGraM (Frequent MultiGraph Miner) - an algorithm that enumerates all frequent subgraph patterns in a single large multigraph. The major contributions of this work are:

  • a set of efficient pruning rules to swiftly traverse the search space for multigraph pattern extraction;

  • an efficient method to quickly evaluate the pattern support;

  • a quantitative and qualitative evaluation of MuGraM on real world graph data.

The experimental evaluation reveals that MuGraM is not only an approach that can handle multigraph data efficiently but it also outperforms the state-of-the-art approaches in extracting frequent subgraphs in single edge graph data.

The rest of the paper is organized as follows. In Section 2, we discuss the related works. In Section 3, we introduce some basic definitions and formalize the problem. In Section 4, we discuss the proposed multigraph mining algorithm MuGraM along with several optimization strategies. Detailed experimental evaluations are conducted in Section 5, followed by the conclusion in Section 6.

Section snippets

Related work

Several existing works address the problem of FSM for both transactional graph databases and single graph databases. For the transactional graph database setting, the work of Inokuchi et al. [11] has shaped the foundation for many later works. This work proposed an approach called AGM to efficiently mine the association rules among the frequently appearing substructures in a given graph data set, by treating a transaction as an adjacency matrix. Among the later works, few are notable: FSG by

Preliminaries and problem definition

In this paper we address the problem of mining single large multigraphs with undirected edges and unlabelled vertices, which will be referred as multigraphs. A multigraph G is defined as a tuple (V, E, LE, T), where V is a set of vertices, T is a set of edge types, EV × V is a set of undirected edges, and LE: V × V → 2T is a labelling function that assigns a subset of edge types to each edge E it belongs to. The labeling function LE maps the edge E to a multiedge, and thus G is a multigraph.

MuGraM: an algorithm for mining multigraphs

In order to address the problem of FSM for single large multigraph data, we propose MuGraM  - a frequent multigraph mining algorithm. Our proposed approach follows a framework similar to that of existing mining approaches, as introduced in [6], [14]. A generic framework (as depicted in Fig. 3) of mining single large graphs involves the following steps: (i) enumerate the frequent edges (frequent patterns of size s=1), (ii) extend each frequent pattern by successively adding the frequent edges

Experimental analyses

In this section we evaluate the performance of MuGraM by carrying out both quantitative and qualitative analysis. For quantitative evaluation, we compare the time performance of MuGraM with one of the recent state-of-the-art FSM approach - GraMi; for this evaluation, we use single edge graphs since no approach exists to perform FSM on multigraphs. Further, the qualitative analysis is performed on few real-world datasets to demonstrate the nature of patterns extracted by the proposed multigraph

Conclusions

In this work we proposed a generic multigraph mining algorithm called MuGraM that can efficiently discover frequent patterns in single large multigraphs. The main contributions of this work include (i) a set of pruning techniques that reduce the search space exploration by avoiding the expensive support computation as much as possible, and further expediting the support computation, and (ii) an efficient support computation mechanism that relies on a backtracking approach to discover multigraph

Acknowledgment

This work has been funded by LabEx NUMEV integrated into the I-SITE MUSE (ANR-10-LABX-20).

References (22)

  • Z. Aidong

    Protein Interaction Networks: Computational Analysis

    (2009)
  • B. Boden et al.

    Mining coherent subgraphs in multi-layer graphs with edge labels

    Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

    (2012)
  • F. Bonchi et al.

    Distance oracles in edge-labeled graphs

    EDBT

    (2014)
  • B. Bringmann et al.

    What is frequent in a single graph?

    Pacific-Asia Conference on Knowledge Discovery and Data Mining

    (2008)
  • A. Cardillo, J. Gómez-Gardenes, M. Zanin, M. Romance, D. Papo, F. del Pozo, S. Boccaletti, Emergence of network...
  • M. Elseidy et al.

    GRAMI: frequent subgraph and pattern mining in a single large graph

    Proc. VLDB

    (2014)
  • M. Fiedler et al.

    Subgraph support in a single large graph

    Seventh IEEE International Conference on Data Mining Workshops

    (2007)
  • J. Gonzalez et al.

    Efficient mining of graph-based data

    Proceedings of the AAAI Workshop on Learning Statistical Models from Relational Data

    (2000)
  • L. Holder et al.

    Substucture discovery in the SUBDUE system

    KDD Workshop

    (1994)
  • V. Ingalalli et al.

    SuMGra: querying multigraphs via efficient indexing

    International Conference on Database and Expert Systems Applications

    (2016)
  • A. Inokuchi et al.

    An apriori-based algorithm for mining frequent substructures from graph data

    European Conference on Principles of Data Mining and Knowledge Discovery

    (2000)
  • Cited by (30)

    • A fast algorithm for mining temporal association rules in a multi-attributed graph sequence

      2022, Expert Systems with Applications
      Citation Excerpt :

      With graph data becoming more and more popular in real life, researchers are interested in mining static graph data. Many algorithms have been proposed for mining interesting graph patterns, for instance, frequent subgraphs (Bhatia & Rani, 2018; Farhi & Boughaci, 2018; Ingalalli, Ienco, & Poncelet, 2018). In addition, as an extension of association rules, graph association rules (Wang, Xu, & Zhan, 2020; Wang & Xu, 2018) are mined.

    • On enumerating algorithms of novel multiple leaf-distance granular regular α-subtrees of trees

      2022, Information and Computation
      Citation Excerpt :

      Because of the important and diverse applications of special topological structures or graph patterns, the related enumeration problems have received extensive attention for several decades. These include finding or enumerating sparse spanning subgraphs [2], connected k-subgraphs [3], frequent subgraphs [16], constrained spanning trees [20,21], and graphlet in social science [4,32]. As one of the most studied counting-based topological indices, the number of subtrees (or simply subtree number) of a graph has been extensively studied in recent years.

    View all citing articles on Scopus
    View full text