Extending association rules with graph patterns

doi:10.1016/j.eswa.2019.112897

Expert Systems with Applications

Volume 141, 1 March 2020, 112897

https://doi.org/10.1016/j.eswa.2019.112897 Get rights and content

Highlights

•
Graph-pattern association rules can be used for discovering associations among entities in social networks.
•
Parallel technique can significantly improve performance of frequent pattern mining.
•
Graph-pattern association rules can be efficiently generated with code graph.
•
Rules generated with maximal frequent patterns are very representative.

Abstract

We propose a general class of graph-pattern association rules ( $GPARs$ ) for social network analysis. Extending association rules for itemsets, $GPARs$ can help us discover associations among entities in social networks and identify potential customers. Despite the benefits, $GPARs$ bring us challenges: the problem of $GPARs$ discovery is already intractable, not to mention mining over large social networks. Nonetheless, we show that it is still feasible to discover $GPARs$ from large social networks. We first formalize the $GPARs$ mining problem and decompose it into two subproblems: Frequent pattern mining and rule generation. To address two subproblems, we develop a parallel algorithm along with an optimization strategy to construct DFS code graphs, whose nodes correspond to frequent patterns. We also provide efficient algorithms to generate (resp. representative) $GPARs$ by using (resp. maximal) frequent patterns. Using real-life and synthetic graphs, we experimentally verify that our algorithms not only scale well but can also identify interesting $GPARs$ with high quality among social entities.

Introduction

Association rules have been studied for discovering regularities between items in relational data (Agrawal, Imielinski, & Swami, 1993). They have a traditional form X⇒Y, where X and Y are disjoint itemsets. For example, {diaper}⇒ {beer} is an association rule indicating that if a customer buys diapers, then he will also buy beer.

In recent years, there has been interest in studying how to identify association rules in graph data because such rules can capture associations among social entities and be used in social marketing. For example, Fan, Wang, Wu, and Xu (2015) extended association rules with special graph patterns and used the rules in social media marketing and social recommendation. While these rules are not capable of modelling more complicated associations among social entities, their consequents, as pattern graphs, take only a single edge. As a result, numerous meaningful rules cannot be captured. Nonetheless, graph-pattern association rules are more involved with generalized patterns as antecedents and consequents. This highlights the need for extending association rules with general graph patterns and discovering these rules on social graphs.

Example 1

A fraction of a social graph G is shown in Fig. 1 (a), where each node denotes a person with name as an identifier and job title (e.g., project manager (PM), database administrator (DBA), programmer (PRG), business analyst (BA) and software tester (ST)), and each edge indicates friendship, e.g., (Bob, Mat) indicates that Bob and Mat are friends. The graph G is distributed to sites S₁, S₂ and S₃.

One can easily infer the following rule from graph G that among a group of people with titles PM, BA, DBA, PRG and ST, if PM and BA, PM and DBA, DBA and PRG, DBA and ST, PRG and ST are friends, then the chances are that PRG and BA, BA and DBA are likely to be friends. The rule R, referred to as the graph-pattern association rule ( $GPAR$ ), is defined on graphs rather than itemsets. As shown in Fig. 1 (b), the antecedent and consequent of the rule are represented as graph patterns, i.e., Q_l and Q_r. The graph patterns specify conditions on various entities in a social graph in terms of topological constraints. With the rule, one can infer social relationships and recommend friends to others who they will most likely be interested in, e.g., recommend Mary to Tim, and Roy to Mary. Not limited to social recommendation, $GPARs$ can also be used in e.g., link prediction (Ebisu, Ichise, 2019, Lin, Song, Shen, Wu, 2018), graph repairing (Cheng, Chen, Yuan, & Wang, 2018), and network evolution analysis (Chaturvedi, Tiwari, & Spyratos, 2019).

While useful, $GPARs$ mining creates challenges. (1) Traditional techniques for transactional data cannot be applied. Superficially, a $GPAR$ can also be defined following its traditional counterpart, i.e., with antecedent and consequent being defined on edge sets. Then, traditional techniques can be trivially applied. However, in the context of a single graph, it is not applicable for mining frequent edge sets along the same line as frequent itemset mining on transactional data. In addition, the prior method for special $GPARs$ is also not feasible because it takes a set of single-edge consequents as input and outputs $GPARs$ with these consequents. (2) Social graphs are often large and distributively stored; hence, centralized techniques are no longer viable. One may easily verify that the rule R, along with its support, may not be correctly computed from G without distributed algorithms. Worse still, mining computations are often cost-prohibitive. With these comes the need for an algorithm to allow a high degree of parallelism and to efficiently discover $GPARs$ .

The practical need for $GPARs$ raises the following fundamental questions. (1) As will be seen shortly, conventional support and confidence metrics no longer work for $GPARs$ . Thus, what metrics can be used to measure support and confidence? (2) Is there any parallel algorithm for $GPARs$ mining over distributed graphs? (3) As there often exist excessive $GPARs$ from a large graph, how can we develop techniques to generate “represented” $GPARs$ to facilitate inspection and interpretation of $GPARs$ ?

Contributions. This paper proposes generalized $GPARs,$ which extend association rules with general graph patterns and provides effective algorithms for discovering (representative) $GPARs$ under a distributed scenario.

(1)
We first propose generalized graph-pattern association rules to capture complex social relations among social entities (Section 2.3). We next show that the $GPARs$ mining problem can be decomposed into two subproblems, i.e., frequent pattern mining and rule generation. We then outline an algorithm for the problem. (Section 2.4).
(2)
We study the frequent pattern mining ( $FPM$ ) problem under a distributive scenario (Section 3). Inspired by the strategy used for the constraint satisfaction problems, i.e., “look-ahead & backtracking”, we develop an algorithm $FPMiner$ to generate a DFS code graph $G_{c},$ whose nodes correspond to frequent patterns, i.e., patterns with support above a threshold (Section 3.1). The algorithm works in parallel, hence obtains desirable performance: it computes the support of a pattern Q in $O (| E_{f} | {((k + 1) 2^{k + 1})}^{k - 1})$ time and incurs $O ((k + 1) | F ∥ V_{f} |)$ data shipment, where k indicates the level of the node to which the pattern Q corresponds in $G_{c},$ and $| F |,$ |E_f|, and |V_f| are the number of sites, crossing edges, and virtual nodes, respectively (see definitions of $F,$ E_f and V_f in Section 2.1). We also provide techniques for optimizing mining computations (Section 3.3).
(3)
We study how to generate $GPARs$ by using DFS code graph $G_{c}$ (Section 4). Given $G_{c} = (V_{c}, E_{c})$ and confidence bound η, we develop an algorithm, denoted as $RuleGen,$ to produce $GPARs$ with confidence above η in $O (| V_{c} | (| V_{c} | + | E_{c} |))$ time, which are independent of the size of the underlying big graph G. We then study how to generate “representative” $GPARs$ . We start from a notion of maximal frequent patterns, followed by the problem of maximal frequent pattern mining ( $MFPM$ ). We show that $MFPM$ does not increase the difficulty: after $G_{c}$ is constructed, all the leaf nodes of $G_{c}$ correspond to the set of maximal frequent patterns. Using maximal frequent patterns, we provide an algorithm, denoted by $RepRuleGen,$ to generate “representative” $GPARs,$ to reduce excessive $GPARs$ and ease understanding.
(4)
Using real-life and synthetic graphs, we experimentally verify the performance of our algorithms (Section 5). We find the following. (a) Our distributive algorithms for frequent pattern mining scale well with the increase of processors (n). For example, our algorithm $FPMiner$ is on average 4.1, 2.9 and 2.5 times faster on three real-world social networks, when n increases from 4 to 20. (b) Our algorithms work reasonably well on large graphs. For example, on a graph with 4 million nodes and 53.5 million edges, $FPMiner$ spends less than 5 minutes (269 s) and ships only 15.9% of the entire graph to discover frequent patterns using 20 processors. (c) The optimization technique for $FPMiner$ is effective. For example, ${FPMiner}_{opt},$ the optimized algorithm of $FPMiner,$ only requires 56.3% of the time, and ships 68% of the data of $FPMiner,$ on average. (d) Our rule generation algorithm $RuleGen$ is very efficient requiring less than 0.4 s, over real-life graphs. (e) Generating $GPARs$ with maximal frequent patterns is very effective, as the “representative” $GPARs$ only account for 1.57% of the entire ruleset, and cover more than 90% of the top-k $GPARs,$ on average. (f) Our $GPARs$ can predict missing relationships with an average accuracy of 43.8% on real-life graphs and outperform existing link prediction methods.

The work provides a full treatment for mining generalized graph-pattern association rules from large social graphs. It provides a parallel algorithm along with an optimization strategy for mining frequent patterns and develops techniques to generate (resp. “representative”) $GPARs$ with (resp. maximal) frequent patterns. Compared with earlier works, this work fills one critical void for mining generalized $GPARs$ and yields a promising approach for social network analysis.

Related Work. This paper extends our prior work (Wang & Xu, 2018) by including the following new contributions: (1) Proofs of Proposition 1 (Section 3.1) and Lemma 1 (Section 3.2); (2) a procedure $EvalT$ and a detailed analysis of algorithm $FPMiner$ (Section 3.2); (3) optimization strategy for local computation in workers (Section 3.3); (4) a maximal frequent pattern mining problem ( $MFPM$ ) and techniques to generate “representative” $GPARs$ (Section 4.2); and (5) a set of new experimental studies (Section 5), an efficiency and scalability test of ${FPMiner}_{opt},$ data shipment of $FPMiner$ vs. ${FPMiner}_{opt},$ a performance evaluation over a new dataset Amazon, a performance comparison between $RuleGen$ and $RepRuleGen,$ and case studies over two real-life graphs.

We next categorize the related work as follows.

Association rules. Association rules, which are defined on relations of transaction data, were first introduced in Agrawal et al. (1993). Prior work on association rules for social networks (Schmitz, Hotho, Jäschke, & Stumme, 2006) and RDF knowledge bases mined conventional rules and Horn rules (as conjunctive binary predicates) (Galárraga, Teflioudi, Hose, & Suchanek, 2013) over a set of tuples with extracted attributes from social graphs instead of exploiting graph patterns. A special type of $GPARs$ was proposed in Fan et al. (2015), where the consequents were defined as pattern graphs with a single edge.

Existing research has also shown that traditional techniques can produce too many association rules because for a frequent itemset of size k, there may exist $2^{k} - 2$ frequent subsets (Zaki, 2000). Therefore, maximal frequent itemsets and their mining algorithm were introduced by Gouda and Zaki (2001), as maximal frequent itemsets can imply and include all information of frequent itemsets.

This work proposes mining techniques for generalized $GPARs,$ which are defined with general graph patterns as antecedent and consequent. Moreover, we show that the set of maximal frequent patterns can be easily obtained from a DFS code graph and used to generate “representative” $GPARs$ .

Big graph analysis. In the era of big graphs, the development of algorithms for graph analytics has spawned a large amount of research. Sapountzi and Psannis (2018) categorized social network analysis into three types of tasks: Sentiment analysis & opinion mining, topic detection and collaborative recommendation, and depicted various methods and their associated frameworks for the tasks. A recent survey regarding data mining and processing frameworks for big graphs was conducted by Aridhi and Nguifo (2016), in which the problem of frequent pattern mining was formulated as two problems: graph transaction based and single graph based. This classification coincides with Jiang, Coenen, and Zito (2013).

As will be seen shortly, single graph based pattern mining is a subtask of $GPARs$ mining, and the patterns discovered need to be organized well for rule generation. Although graph analytical platforms exist for single graph-based pattern mining, e.g., Teixeira et al. (2015), their output may still need to be reorganized to facilitate (“representative”) rule generalization.

Graph-pattern mining. The generalization from item sets to graph patterns also requires effective graph-pattern mining techniques to discover $GPARs$ in big graphs beyond traditional rule mining techniques (Zhang & Zhang, 2002). The topic of graph-pattern mining has been intensively studied (see Jiang et al. (2013) for a survey). Two technical routes developed according to the essential differences in graph data.

(1) Algorithms for graph databases. The algorithms can be further categorized into two types: (i) Apriori methods (Inokuchi, Washio, & Motoda, 2000) that start with small graphs, and generate substructures by joining similar but slightly different frequent subgraphs, in a bottom-up manner; and (ii) pattern-growth methods (Huan, Wang, Prins, & Yang, 2004), which extend a frequent graph by adding new nodes and edges in every possible position. (2) Algorithms for single large graphs. For example, Elseidy, Abdelhamid, Skiadopoulos, and Kalnis (2014) proposed using a minimum image that preserves anti-monotonic property as a support metric to measure pattern frequency. To further improve efficiency, parallel techniques were studied in Talukder and Zaki (2016) and Teixeira et al. (2015). Talukder and Zaki (2016) developed a level-wise approach and effective pruning strategy under the distributive scenario. Teixeira et al. (2015) introduced a Pregel-inspired distributed platform Arabesque for frequent pattern mining.

More recently, de J. Costa, Bernardini, Artigas, and Viterbo (2019) proposed methods for mining direct acyclic patterns to identify retention patterns in undergraduate programmes. To identify periodic patterns in dynamic networks, Halder, Samiullah, and Lee (2017) proposed SPPMiner, a single pass supergraph based periodic pattern mining technique. SPPMiner stores all entities of the dynamic network only once and calculates common sub-patterns only once at each timestamp, thus achieving high efficiency.

However, techniques for frequent pattern mining over graph databases (Huan, Wang, Prins, Yang, 2004, Inokuchi, Washio, Motoda, 2000) cannot be directly used to mine $GPARs,$ as their anti-monotonic property does not hold in a single graph (Jiang et al., 2013). Our settings are quite different from de J. Costa et al. (2019) and Halder et al. (2017), as our method can mine general patterns rather than DAG patterns, and we are coping with static graphs rather than dynamic graphs. Our work differs from distributed techniques (Talukder, Zaki, 2016, Teixeira, Fonseca, Serafini, Siganos, Zaki, Aboulnaga, 2015) as follows: we leverage both partial evaluation and asynchronous message passing to identify matches of candidate patterns in a distributive environment instead of level-wise evaluation or Pregel-based computation; Moreover, frequent patterns are our intermediate results, and our distributed technique organizes all frequent patterns in a graph structure, which facilitates (“representative”) rule generation.

GPARs mining. Special $GPARs$ and their mining techniques were introduced in Fan et al. (2015), where consequents of the $GPARs$ are defined as pattern graphs with a single edge, and the algorithm is specifically developed to mine diversified $GPARs$ with fixed consequent. Fan, Wu, and Xu (2016) next introduced quantified graph association rules, which extend $GPARs$ with quantifiers to identify potential customers in social media marketing. Another work addressed mining $GPARs$ over stream data (Namaki, Wu, Song, Lin, & Ge, 2017).

Our work differs from these studies in the semantics, i.e., we are mining generalized $GPARs,$ with antecedent and consequent represented by general graph patterns.

Section snippets

Graph-pattern association rules

In this section, we first review graphs, patterns, subgraphs, (subgraph) isomorphism and distributed graphs. We then introduce DFS Code, DFS Lexicographic Order and DFS Code Tree, followed by the definition of graph-pattern association rules.

Frequent pattern mining

In this section, we first investigate the frequent pattern mining problem (Section 3.1); we then develop a parallel algorithm for the problem (Section 3.2) and provide an optimization strategy to improve performance (Section 3.3).

GPARs generation

In this section, we first introduce how to generate a set of $GPARs$ from code graph $G_{c}$ . We then introduce the notion of “representative” $GPARs$ and develop a technique to generate “representative” $GPARs$ .

Experimental study

We next present an experimental study of our algorithms. Using real-life and synthetic data, we conducted two sets of experiments to evaluate: (1) The efficiency and data shipment of our distributed algorithm $FPMiner$ and the effectiveness of optimization techniques for $FPMiner$ ; (2) the efficiency of the rule generation algorithm $RuleGen$ and the effectiveness of our “representative” rule generation algorithm $RepRuleGen$ .

Experimental setting. We used three real-life graphs. (a) Amazon (Leskovec &

Conclusion

We have proposed generalized graph-pattern association rules ( $GPARs$ ) and viable support and confidence measures for the discovery of $GPARs$ . Compared with special rules introduced in Fan et al. (2015), the generalized $GPARs$ are capable to model even more complicated associations among social entities. We have provided techniques to efficiently mine $GPARs$ . In particular, our technique for frequent pattern mining follows the “look-ahead & backtracking” strategy, that is widely employed for the

Author statement

Xin Wang conceived of the presented idea, developed the theory and algorithms. Yang Xu verified the methods, and implemented algorithms with the help of Xin Wang. Yang Xu and Huayi Zhan designed experiments, and conducted tests. Xin Wang wrote the manuscript with support fromYang Xu and Huayi Zhan.

Xin Wang agreed to be accountable for all aspects of the work in ensuring that problems appearing in any part of the work are appropriately investigated and resolved.

Declaration of Competing Interest

The authors whose names are listed immediately below certify that they have NO affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakers’ bureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in

Acknowledgments

This work is supported in part by the NSFC 71490722, 71490725, 71671146, Sichuan Provincial Major Project 2017JY0225 and Fundamental Research Funds for the Central Universities, China.

References (43)

S. Aridhi et al.
Big graph mining: Frameworks and techniques
Big Data Research
(2016)
S. Halder et al.
Supergraph based periodic pattern mining in dynamic social networks
Expert Systems with Applications
(2017)
L. Lü et al.
Link prediction in complex networks: A survey
Physica A: Statistical Mechanics and its Applications
(2011)
A. Sapountzi et al.
Social networking data analysis tools & challenges
Future Generation Comp System
(2018)
R. Agrawal et al.
Mining association rules between sets of items in large databases
Proceedings of the 1993 ACM SIGMOD international conference on management of data
(1993)
S. Brin et al.
Beyond market baskets: Generalizing association rules to correlations
Proceedings ACM SIGMOD international conference on management of data
(1997)
A. Chaturvedi et al.
minstab: Stable network evolution rule mining for system changeability analysis
IEEE Transactions on Emerging Topics in Computational Intelligence
(2019)
Y. Cheng et al.
Rule-based graph repairing: Semantic and efficient repairing methods
2018 ieee 34th international conference on data engineering (icde)
(2018)
L.P. Cordella et al.
A (sub)graph isomorphism algorithm for matching large graphs
TPAMI
(2004)
T. Ebisu et al.
Graph pattern entity ranking model for knowledge graph completion
Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: Human language technologies, NAACL-HLT 2019, volume 1
(2019)

M. Elseidy et al.

GRAMI: Frequent subgraph and pattern mining in a single large graph

PVLDB

(2014)

W. Fan et al.

Association rules with graph patterns

PVLDB

(2015)

W. Fan et al.

Adding counting quantifiers to graph patterns

Proceedings of the 2016 international conference on management of data, SIGMOD conference

(2016)

M. Fiedler et al.

Subgraph support in a single large graph

Workshops proceedings of the 7th IEEE international conference on data mining (ICDM)

(2007)

L.A. Galárraga et al.

AMIE: Association rule mining under incomplete evidence in ontological knowledge bases

22nd international world wide web conference, WWW

(2013)

S. Garg et al.

Evolution of an online social aggregation network: An empirical study

Proceedings of the 9th ACM SIGCOMM internet measurement conference, IMC

(2009)

N.Z. Gong et al.

Evolution of social-attribute networks: Measurements, modeling, and implications using google+

Proceedings of the 12th ACM SIGCOMM internet measurement conference, IMC

(2012)

K. Gouda et al.

Efficiently mining maximal frequent itemsets

Proceedings of the 2001 IEEE international conference on data mining

(2001)

GraMi (2015)....

I. Grujic et al.

Collecting and analyzing data from e-government facebook pages

ICT Innovations

(2014)

E. Gudes et al.

Discovering frequent graph patterns using disjoint paths

IEEE Transactions of Knowledge Data Engineering

(2006)

Cited by (0)

View full text

Extending association rules with graph patterns

Highlights

Abstract

Introduction

Section snippets

Graph-pattern association rules

Frequent pattern mining

GPARs generation

Experimental study

Conclusion

Author statement

Declaration of Competing Interest

Acknowledgments

Big Data Research

Expert Systems with Applications

Physica A: Statistical Mechanics and its Applications

Future Generation Comp System

Mining association rules between sets of items in large databases

Proceedings of the 1993 ACM SIGMOD international conference on management of data

Beyond market baskets: Generalizing association rules to correlations

Proceedings ACM SIGMOD international conference on management of data

minstab: Stable network evolution rule mining for system changeability analysis

IEEE Transactions on Emerging Topics in Computational Intelligence

Rule-based graph repairing: Semantic and efficient repairing methods

2018 ieee 34th international conference on data engineering (icde)

A (sub)graph isomorphism algorithm for matching large graphs

TPAMI

Graph pattern entity ranking model for knowledge graph completion

Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: Human language technologies, NAACL-HLT 2019, volume 1

GRAMI: Frequent subgraph and pattern mining in a single large graph

PVLDB

Association rules with graph patterns

PVLDB

Adding counting quantifiers to graph patterns

Proceedings of the 2016 international conference on management of data, SIGMOD conference

Subgraph support in a single large graph

Workshops proceedings of the 7th IEEE international conference on data mining (ICDM)

AMIE: Association rule mining under incomplete evidence in ontological knowledge bases

22nd international world wide web conference, WWW

Evolution of an online social aggregation network: An empirical study

Proceedings of the 9th ACM SIGCOMM internet measurement conference, IMC

Evolution of social-attribute networks: Measurements, modeling, and implications using google+

Proceedings of the 12th ACM SIGCOMM internet measurement conference, IMC

Efficiently mining maximal frequent itemsets

Proceedings of the 2001 IEEE international conference on data mining

Collecting and analyzing data from e-government facebook pages

ICT Innovations

Discovering frequent graph patterns using disjoint paths

IEEE Transactions of Knowledge Data Engineering

Proceedings of the 1993 ACM SIGMOD international conference on management of data

Proceedings of the 2001 IEEE international conference on data mining