Extending association rules with graph patterns

https://doi.org/10.1016/j.eswa.2019.112897Get rights and content

Highlights

  • Graph-pattern association rules can be used for discovering associations among entities in social networks.

  • Parallel technique can significantly improve performance of frequent pattern mining.

  • Graph-pattern association rules can be efficiently generated with code graph.

  • Rules generated with maximal frequent patterns are very representative.

Abstract

We propose a general class of graph-pattern association rules (GPARs) for social network analysis. Extending association rules for itemsets, GPARs can help us discover associations among entities in social networks and identify potential customers. Despite the benefits, GPARs bring us challenges: the problem of GPARs discovery is already intractable, not to mention mining over large social networks. Nonetheless, we show that it is still feasible to discover GPARs from large social networks. We first formalize the GPARs mining problem and decompose it into two subproblems: Frequent pattern mining and rule generation. To address two subproblems, we develop a parallel algorithm along with an optimization strategy to construct DFS code graphs, whose nodes correspond to frequent patterns. We also provide efficient algorithms to generate (resp. representative) GPARs by using (resp. maximal) frequent patterns. Using real-life and synthetic graphs, we experimentally verify that our algorithms not only scale well but can also identify interesting GPARs with high quality among social entities.

Introduction

Association rules have been studied for discovering regularities between items in relational data (Agrawal, Imielinski, & Swami, 1993). They have a traditional form XY, where X and Y are disjoint itemsets. For example, {diaper}⇒ {beer} is an association rule indicating that if a customer buys diapers, then he will also buy beer.

In recent years, there has been interest in studying how to identify association rules in graph data because such rules can capture associations among social entities and be used in social marketing. For example,  Fan, Wang, Wu, and Xu (2015) extended association rules with special graph patterns and used the rules in social media marketing and social recommendation. While these rules are not capable of modelling more complicated associations among social entities, their consequents, as pattern graphs, take only a single edge. As a result, numerous meaningful rules cannot be captured. Nonetheless, graph-pattern association rules are more involved with generalized patterns as antecedents and consequents. This highlights the need for extending association rules with general graph patterns and discovering these rules on social graphs.

Example 1

A fraction of a social graph G is shown in Fig. 1 (a), where each node denotes a person with name as an identifier and job title (e.g., project manager (PM), database administrator (DBA), programmer (PRG), business analyst (BA) and software tester (ST)), and each edge indicates friendship, e.g., (Bob, Mat) indicates that Bob and Mat are friends. The graph G is distributed to sites S1, S2 and S3.

One can easily infer the following rule from graph G that among a group of people with titles PM, BA, DBA, PRG and ST, if PM and BA, PM and DBA, DBA and PRG, DBA and ST, PRG and ST are friends, then the chances are that PRG and BA, BA and DBA are likely to be friends. The rule R, referred to as the graph-pattern association rule (GPAR), is defined on graphs rather than itemsets. As shown in Fig. 1 (b), the antecedent and consequent of the rule are represented as graph patterns, i.e., Ql and Qr. The graph patterns specify conditions on various entities in a social graph in terms of topological constraints. With the rule, one can infer social relationships and recommend friends to others who they will most likely be interested in, e.g., recommend Mary to Tim, and Roy to Mary. Not limited to social recommendation, GPARs can also be used in e.g., link prediction (Ebisu, Ichise, 2019, Lin, Song, Shen, Wu, 2018), graph repairing (Cheng, Chen, Yuan, & Wang, 2018), and network evolution analysis (Chaturvedi, Tiwari, & Spyratos, 2019).

While useful, GPARs mining creates challenges. (1) Traditional techniques for transactional data cannot be applied. Superficially, a GPAR can also be defined following its traditional counterpart, i.e., with antecedent and consequent being defined on edge sets. Then, traditional techniques can be trivially applied. However, in the context of a single graph, it is not applicable for mining frequent edge sets along the same line as frequent itemset mining on transactional data. In addition, the prior method for special GPARs is also not feasible because it takes a set of single-edge consequents as input and outputs GPARs with these consequents. (2) Social graphs are often large and distributively stored; hence, centralized techniques are no longer viable. One may easily verify that the rule R, along with its support, may not be correctly computed from G without distributed algorithms. Worse still, mining computations are often cost-prohibitive. With these comes the need for an algorithm to allow a high degree of parallelism and to efficiently discover GPARs.

The practical need for GPARs raises the following fundamental questions. (1) As will be seen shortly, conventional support and confidence metrics no longer work for GPARs. Thus, what metrics can be used to measure support and confidence? (2) Is there any parallel algorithm for GPARs mining over distributed graphs? (3) As there often exist excessive GPARs from a large graph, how can we develop techniques to generate “represented” GPARs to facilitate inspection and interpretation of GPARs?

Contributions. This paper proposes generalized GPARs, which extend association rules with general graph patterns and provides effective algorithms for discovering (representative) GPARs under a distributed scenario.

  • (1)

    We first propose generalized graph-pattern association rules to capture complex social relations among social entities (Section 2.3). We next show that the GPARs mining problem can be decomposed into two subproblems, i.e., frequent pattern mining and rule generation. We then outline an algorithm for the problem. (Section 2.4).

  • (2)

    We study the frequent pattern mining (FPM) problem under a distributive scenario (Section 3). Inspired by the strategy used for the constraint satisfaction problems, i.e., “look-ahead & backtracking”, we develop an algorithm FPMiner to generate a DFS code graph Gc, whose nodes correspond to frequent patterns, i.e., patterns with support above a threshold (Section 3.1). The algorithm works in parallel, hence obtains desirable performance: it computes the support of a pattern Q in O(|Ef|((k+1)2k+1)k1) time and incurs O((k+1)|FVf|) data shipment, where k indicates the level of the node to which the pattern Q corresponds in Gc, and |F|, |Ef|, and |Vf| are the number of sites, crossing edges, and virtual nodes, respectively (see definitions of F, Ef and Vf in Section 2.1). We also provide techniques for optimizing mining computations (Section 3.3).

  • (3)

    We study how to generate GPARs by using DFS code graph Gc (Section 4). Given Gc=(Vc,Ec) and confidence bound η, we develop an algorithm, denoted as RuleGen, to produce GPARs with confidence above η in O(|Vc|(|Vc|+|Ec|)) time, which are independent of the size of the underlying big graph G. We then study how to generate “representative” GPARs. We start from a notion of maximal frequent patterns, followed by the problem of maximal frequent pattern mining (MFPM). We show that MFPM does not increase the difficulty: after Gc is constructed, all the leaf nodes of Gc correspond to the set of maximal frequent patterns. Using maximal frequent patterns, we provide an algorithm, denoted by RepRuleGen, to generate “representative” GPARs, to reduce excessive GPARs and ease understanding.

  • (4)

    Using real-life and synthetic graphs, we experimentally verify the performance of our algorithms (Section 5). We find the following. (a) Our distributive algorithms for frequent pattern mining scale well with the increase of processors (n). For example, our algorithm FPMiner is on average 4.1, 2.9 and 2.5 times faster on three real-world social networks, when n increases from 4 to 20. (b) Our algorithms work reasonably well on large graphs. For example, on a graph with 4 million nodes and 53.5 million edges, FPMiner spends less than 5 minutes (269 s) and ships only 15.9% of the entire graph to discover frequent patterns using 20 processors. (c) The optimization technique for FPMiner is effective. For example, FPMineropt, the optimized algorithm of FPMiner, only requires 56.3% of the time, and ships 68% of the data of FPMiner, on average. (d) Our rule generation algorithm RuleGen is very efficient requiring less than 0.4 s, over real-life graphs. (e) Generating GPARs with maximal frequent patterns is very effective, as the “representative” GPARs only account for 1.57% of the entire ruleset, and cover more than 90% of the top-k GPARs, on average. (f) Our GPARs can predict missing relationships with an average accuracy of 43.8% on real-life graphs and outperform existing link prediction methods.

The work provides a full treatment for mining generalized graph-pattern association rules from large social graphs. It provides a parallel algorithm along with an optimization strategy for mining frequent patterns and develops techniques to generate (resp. “representative”) GPARs with (resp. maximal) frequent patterns. Compared with earlier works, this work fills one critical void for mining generalized GPARs and yields a promising approach for social network analysis.

Related Work. This paper extends our prior work (Wang & Xu, 2018) by including the following new contributions: (1) Proofs of Proposition 1 (Section 3.1) and Lemma 1 (Section 3.2); (2) a procedure EvalT and a detailed analysis of algorithm FPMiner (Section 3.2); (3) optimization strategy for local computation in workers (Section 3.3); (4) a maximal frequent pattern mining problem (MFPM) and techniques to generate “representative” GPARs (Section 4.2); and (5) a set of new experimental studies (Section 5), an efficiency and scalability test of FPMineropt, data shipment of FPMiner vs. FPMineropt, a performance evaluation over a new dataset Amazon, a performance comparison between RuleGen and RepRuleGen, and case studies over two real-life graphs.

We next categorize the related work as follows.

Association rules. Association rules, which are defined on relations of transaction data, were first introduced in Agrawal et al. (1993). Prior work on association rules for social networks (Schmitz, Hotho, Jäschke, & Stumme, 2006) and RDF knowledge bases mined conventional rules and Horn rules (as conjunctive binary predicates) (Galárraga, Teflioudi, Hose, & Suchanek, 2013) over a set of tuples with extracted attributes from social graphs instead of exploiting graph patterns. A special type of GPARs was proposed in Fan et al. (2015), where the consequents were defined as pattern graphs with a single edge.

Existing research has also shown that traditional techniques can produce too many association rules because for a frequent itemset of size k, there may exist 2k2 frequent subsets (Zaki, 2000). Therefore, maximal frequent itemsets and their mining algorithm were introduced by Gouda and Zaki (2001), as maximal frequent itemsets can imply and include all information of frequent itemsets.

This work proposes mining techniques for generalized GPARs, which are defined with general graph patterns as antecedent and consequent. Moreover, we show that the set of maximal frequent patterns can be easily obtained from a DFS code graph and used to generate “representative” GPARs.

Big graph analysis. In the era of big graphs, the development of algorithms for graph analytics has spawned a large amount of research.  Sapountzi and Psannis (2018) categorized social network analysis into three types of tasks: Sentiment analysis & opinion mining, topic detection and collaborative recommendation, and depicted various methods and their associated frameworks for the tasks. A recent survey regarding data mining and processing frameworks for big graphs was conducted by Aridhi and Nguifo (2016), in which the problem of frequent pattern mining was formulated as two problems: graph transaction based and single graph based. This classification coincides with Jiang, Coenen, and Zito (2013).

As will be seen shortly, single graph based pattern mining is a subtask of GPARs mining, and the patterns discovered need to be organized well for rule generation. Although graph analytical platforms exist for single graph-based pattern mining, e.g.,  Teixeira et al. (2015), their output may still need to be reorganized to facilitate (“representative”) rule generalization.

Graph-pattern mining. The generalization from item sets to graph patterns also requires effective graph-pattern mining techniques to discover GPARs in big graphs beyond traditional rule mining techniques (Zhang & Zhang, 2002). The topic of graph-pattern mining has been intensively studied (see Jiang et al. (2013) for a survey). Two technical routes developed according to the essential differences in graph data.

(1) Algorithms for graph databases. The algorithms can be further categorized into two types: (i) Apriori methods (Inokuchi, Washio, & Motoda, 2000) that start with small graphs, and generate substructures by joining similar but slightly different frequent subgraphs, in a bottom-up manner; and (ii) pattern-growth methods (Huan, Wang, Prins, & Yang, 2004), which extend a frequent graph by adding new nodes and edges in every possible position. (2) Algorithms for single large graphs. For example,  Elseidy, Abdelhamid, Skiadopoulos, and Kalnis (2014) proposed using a minimum image that preserves anti-monotonic property as a support metric to measure pattern frequency. To further improve efficiency, parallel techniques were studied in Talukder and Zaki (2016) and Teixeira et al. (2015). Talukder and Zaki (2016) developed a level-wise approach and effective pruning strategy under the distributive scenario. Teixeira et al. (2015) introduced a Pregel-inspired distributed platform Arabesque for frequent pattern mining.

More recently,  de J. Costa, Bernardini, Artigas, and Viterbo (2019) proposed methods for mining direct acyclic patterns to identify retention patterns in undergraduate programmes. To identify periodic patterns in dynamic networks,  Halder, Samiullah, and Lee (2017) proposed SPPMiner, a single pass supergraph based periodic pattern mining technique. SPPMiner stores all entities of the dynamic network only once and calculates common sub-patterns only once at each timestamp, thus achieving high efficiency.

However, techniques for frequent pattern mining over graph databases (Huan, Wang, Prins, Yang, 2004, Inokuchi, Washio, Motoda, 2000) cannot be directly used to mine GPARs, as their anti-monotonic property does not hold in a single graph (Jiang et al., 2013). Our settings are quite different from de J. Costa et al. (2019) and Halder et al. (2017), as our method can mine general patterns rather than DAG patterns, and we are coping with static graphs rather than dynamic graphs. Our work differs from distributed techniques (Talukder, Zaki, 2016, Teixeira, Fonseca, Serafini, Siganos, Zaki, Aboulnaga, 2015) as follows: we leverage both partial evaluation and asynchronous message passing to identify matches of candidate patterns in a distributive environment instead of level-wise evaluation or Pregel-based computation; Moreover, frequent patterns are our intermediate results, and our distributed technique organizes all frequent patterns in a graph structure, which facilitates (“representative”) rule generation.

GPARs mining. Special GPARs and their mining techniques were introduced in Fan et al. (2015), where consequents of the GPARs are defined as pattern graphs with a single edge, and the algorithm is specifically developed to mine diversified GPARs with fixed consequent. Fan, Wu, and Xu (2016) next introduced quantified graph association rules, which extend GPARs with quantifiers to identify potential customers in social media marketing. Another work addressed mining GPARs over stream data (Namaki, Wu, Song, Lin, & Ge, 2017).

Our work differs from these studies in the semantics, i.e., we are mining generalized GPARs, with antecedent and consequent represented by general graph patterns.

Section snippets

Graph-pattern association rules

In this section, we first review graphs, patterns, subgraphs, (subgraph) isomorphism and distributed graphs. We then introduce DFS Code, DFS Lexicographic Order and DFS Code Tree, followed by the definition of graph-pattern association rules.

Frequent pattern mining

In this section, we first investigate the frequent pattern mining problem (Section 3.1); we then develop a parallel algorithm for the problem (Section 3.2) and provide an optimization strategy to improve performance (Section 3.3).

GPARs generation

In this section, we first introduce how to generate a set of GPARs from code graph Gc. We then introduce the notion of “representative” GPARs and develop a technique to generate “representative” GPARs.

Experimental study

We next present an experimental study of our algorithms. Using real-life and synthetic data, we conducted two sets of experiments to evaluate: (1) The efficiency and data shipment of our distributed algorithm FPMiner and the effectiveness of optimization techniques for FPMiner; (2) the efficiency of the rule generation algorithm RuleGen and the effectiveness of our “representative” rule generation algorithm RepRuleGen.

Experimental setting. We used three real-life graphs. (a) Amazon (Leskovec &

Conclusion

We have proposed generalized graph-pattern association rules (GPARs) and viable support and confidence measures for the discovery of GPARs. Compared with special rules introduced in Fan et al. (2015), the generalized GPARs are capable to model even more complicated associations among social entities. We have provided techniques to efficiently mine GPARs. In particular, our technique for frequent pattern mining follows the “look-ahead & backtracking” strategy, that is widely employed for the

Author statement

Xin Wang conceived of the presented idea, developed the theory and algorithms. Yang Xu verified the methods, and implemented algorithms with the help of Xin Wang. Yang Xu and Huayi Zhan designed experiments, and conducted tests. Xin Wang wrote the manuscript with support fromYang Xu and Huayi Zhan.

Xin Wang agreed to be accountable for all aspects of the work in ensuring that problems appearing in any part of the work are appropriately investigated and resolved.

Declaration of Competing Interest

The authors whose names are listed immediately below certify that they have NO affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakers’ bureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in

Acknowledgments

This work is supported in part by the NSFC 71490722, 71490725, 71671146, Sichuan Provincial Major Project 2017JY0225 and Fundamental Research Funds for the Central Universities, China.

References (43)

  • S. Aridhi et al.

    Big graph mining: Frameworks and techniques

    Big Data Research

    (2016)
  • S. Halder et al.

    Supergraph based periodic pattern mining in dynamic social networks

    Expert Systems with Applications

    (2017)
  • L. et al.

    Link prediction in complex networks: A survey

    Physica A: Statistical Mechanics and its Applications

    (2011)
  • A. Sapountzi et al.

    Social networking data analysis tools & challenges

    Future Generation Comp System

    (2018)
  • R. Agrawal et al.

    Mining association rules between sets of items in large databases

    Proceedings of the 1993 ACM SIGMOD international conference on management of data

    (1993)
  • S. Brin et al.

    Beyond market baskets: Generalizing association rules to correlations

    Proceedings ACM SIGMOD international conference on management of data

    (1997)
  • A. Chaturvedi et al.

    minstab: Stable network evolution rule mining for system changeability analysis

    IEEE Transactions on Emerging Topics in Computational Intelligence

    (2019)
  • Y. Cheng et al.

    Rule-based graph repairing: Semantic and efficient repairing methods

    2018 ieee 34th international conference on data engineering (icde)

    (2018)
  • L.P. Cordella et al.

    A (sub)graph isomorphism algorithm for matching large graphs

    TPAMI

    (2004)
  • T. Ebisu et al.

    Graph pattern entity ranking model for knowledge graph completion

    Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: Human language technologies, NAACL-HLT 2019, volume 1

    (2019)
  • M. Elseidy et al.

    GRAMI: Frequent subgraph and pattern mining in a single large graph

    PVLDB

    (2014)
  • W. Fan et al.

    Association rules with graph patterns

    PVLDB

    (2015)
  • W. Fan et al.

    Adding counting quantifiers to graph patterns

    Proceedings of the 2016 international conference on management of data, SIGMOD conference

    (2016)
  • M. Fiedler et al.

    Subgraph support in a single large graph

    Workshops proceedings of the 7th IEEE international conference on data mining (ICDM)

    (2007)
  • L.A. Galárraga et al.

    AMIE: Association rule mining under incomplete evidence in ontological knowledge bases

    22nd international world wide web conference, WWW

    (2013)
  • S. Garg et al.

    Evolution of an online social aggregation network: An empirical study

    Proceedings of the 9th ACM SIGCOMM internet measurement conference, IMC

    (2009)
  • N.Z. Gong et al.

    Evolution of social-attribute networks: Measurements, modeling, and implications using google+

    Proceedings of the 12th ACM SIGCOMM internet measurement conference, IMC

    (2012)
  • K. Gouda et al.

    Efficiently mining maximal frequent itemsets

    Proceedings of the 2001 IEEE international conference on data mining

    (2001)
  • GraMi (2015)....
  • I. Grujic et al.

    Collecting and analyzing data from e-government facebook pages

    ICT Innovations

    (2014)
  • E. Gudes et al.

    Discovering frequent graph patterns using disjoint paths

    IEEE Transactions of Knowledge Data Engineering

    (2006)
  • Cited by (0)

    View full text