Extending association rules with graph patterns
Introduction
Association rules have been studied for discovering regularities between items in relational data (Agrawal, Imielinski, & Swami, 1993). They have a traditional form X⇒Y, where X and Y are disjoint itemsets. For example, {diaper}⇒ {beer} is an association rule indicating that if a customer buys diapers, then he will also buy beer.
In recent years, there has been interest in studying how to identify association rules in graph data because such rules can capture associations among social entities and be used in social marketing. For example, Fan, Wang, Wu, and Xu (2015) extended association rules with special graph patterns and used the rules in social media marketing and social recommendation. While these rules are not capable of modelling more complicated associations among social entities, their consequents, as pattern graphs, take only a single edge. As a result, numerous meaningful rules cannot be captured. Nonetheless, graph-pattern association rules are more involved with generalized patterns as antecedents and consequents. This highlights the need for extending association rules with general graph patterns and discovering these rules on social graphs. Example 1 A fraction of a social graph G is shown in Fig. 1 (a), where each node denotes a person with name as an identifier and job title (e.g., project manager (PM), database administrator (DBA), programmer (PRG), business analyst (BA) and software tester (ST)), and each edge indicates friendship, e.g., (Bob, Mat) indicates that Bob and Mat are friends. The graph G is distributed to sites S1, S2 and S3. One can easily infer the following rule from graph G that among a group of people with titles PM, BA, DBA, PRG and ST, if PM and BA, PM and DBA, DBA and PRG, DBA and ST, PRG and ST are friends, then the chances are that PRG and BA, BA and DBA are likely to be friends. The rule R, referred to as the graph-pattern association rule (), is defined on graphs rather than itemsets. As shown in Fig. 1 (b), the antecedent and consequent of the rule are represented as graph patterns, i.e., Ql and Qr. The graph patterns specify conditions on various entities in a social graph in terms of topological constraints. With the rule, one can infer social relationships and recommend friends to others who they will most likely be interested in, e.g., recommend Mary to Tim, and Roy to Mary. Not limited to social recommendation, can also be used in e.g., link prediction (Ebisu, Ichise, 2019, Lin, Song, Shen, Wu, 2018), graph repairing (Cheng, Chen, Yuan, & Wang, 2018), and network evolution analysis (Chaturvedi, Tiwari, & Spyratos, 2019). While useful, mining creates challenges. (1) Traditional techniques for transactional data cannot be applied. Superficially, a can also be defined following its traditional counterpart, i.e., with antecedent and consequent being defined on edge sets. Then, traditional techniques can be trivially applied. However, in the context of a single graph, it is not applicable for mining frequent edge sets along the same line as frequent itemset mining on transactional data. In addition, the prior method for special is also not feasible because it takes a set of single-edge consequents as input and outputs with these consequents. (2) Social graphs are often large and distributively stored; hence, centralized techniques are no longer viable. One may easily verify that the rule R, along with its support, may not be correctly computed from G without distributed algorithms. Worse still, mining computations are often cost-prohibitive. With these comes the need for an algorithm to allow a high degree of parallelism and to efficiently discover .
The practical need for raises the following fundamental questions. (1) As will be seen shortly, conventional support and confidence metrics no longer work for . Thus, what metrics can be used to measure support and confidence? (2) Is there any parallel algorithm for mining over distributed graphs? (3) As there often exist excessive from a large graph, how can we develop techniques to generate “represented” to facilitate inspection and interpretation of ?
Contributions. This paper proposes generalized which extend association rules with general graph patterns and provides effective algorithms for discovering (representative) under a distributed scenario.
- (1)
We first propose generalized graph-pattern association rules to capture complex social relations among social entities (Section 2.3). We next show that the mining problem can be decomposed into two subproblems, i.e., frequent pattern mining and rule generation. We then outline an algorithm for the problem. (Section 2.4).
- (2)
We study the frequent pattern mining () problem under a distributive scenario (Section 3). Inspired by the strategy used for the constraint satisfaction problems, i.e., “look-ahead & backtracking”, we develop an algorithm to generate a DFS code graph whose nodes correspond to frequent patterns, i.e., patterns with support above a threshold (Section 3.1). The algorithm works in parallel, hence obtains desirable performance: it computes the support of a pattern Q in time and incurs data shipment, where k indicates the level of the node to which the pattern Q corresponds in and |Ef|, and |Vf| are the number of sites, crossing edges, and virtual nodes, respectively (see definitions of Ef and Vf in Section 2.1). We also provide techniques for optimizing mining computations (Section 3.3).
- (3)
We study how to generate by using DFS code graph (Section 4). Given and confidence bound η, we develop an algorithm, denoted as to produce with confidence above η in time, which are independent of the size of the underlying big graph G. We then study how to generate “representative” . We start from a notion of maximal frequent patterns, followed by the problem of maximal frequent pattern mining (). We show that does not increase the difficulty: after is constructed, all the leaf nodes of correspond to the set of maximal frequent patterns. Using maximal frequent patterns, we provide an algorithm, denoted by to generate “representative” to reduce excessive and ease understanding.
- (4)
Using real-life and synthetic graphs, we experimentally verify the performance of our algorithms (Section 5). We find the following. (a) Our distributive algorithms for frequent pattern mining scale well with the increase of processors (n). For example, our algorithm is on average 4.1, 2.9 and 2.5 times faster on three real-world social networks, when n increases from 4 to 20. (b) Our algorithms work reasonably well on large graphs. For example, on a graph with 4 million nodes and 53.5 million edges, spends less than 5 minutes (269 s) and ships only 15.9% of the entire graph to discover frequent patterns using 20 processors. (c) The optimization technique for is effective. For example, the optimized algorithm of only requires 56.3% of the time, and ships 68% of the data of on average. (d) Our rule generation algorithm is very efficient requiring less than 0.4 s, over real-life graphs. (e) Generating with maximal frequent patterns is very effective, as the “representative” only account for 1.57% of the entire ruleset, and cover more than 90% of the top-k on average. (f) Our can predict missing relationships with an average accuracy of 43.8% on real-life graphs and outperform existing link prediction methods.
The work provides a full treatment for mining generalized graph-pattern association rules from large social graphs. It provides a parallel algorithm along with an optimization strategy for mining frequent patterns and develops techniques to generate (resp. “representative”) with (resp. maximal) frequent patterns. Compared with earlier works, this work fills one critical void for mining generalized and yields a promising approach for social network analysis.
Related Work. This paper extends our prior work (Wang & Xu, 2018) by including the following new contributions: (1) Proofs of Proposition 1 (Section 3.1) and Lemma 1 (Section 3.2); (2) a procedure and a detailed analysis of algorithm (Section 3.2); (3) optimization strategy for local computation in workers (Section 3.3); (4) a maximal frequent pattern mining problem () and techniques to generate “representative” (Section 4.2); and (5) a set of new experimental studies (Section 5), an efficiency and scalability test of data shipment of vs. a performance evaluation over a new dataset Amazon, a performance comparison between and and case studies over two real-life graphs.
We next categorize the related work as follows.
Association rules. Association rules, which are defined on relations of transaction data, were first introduced in Agrawal et al. (1993). Prior work on association rules for social networks (Schmitz, Hotho, Jäschke, & Stumme, 2006) and RDF knowledge bases mined conventional rules and Horn rules (as conjunctive binary predicates) (Galárraga, Teflioudi, Hose, & Suchanek, 2013) over a set of tuples with extracted attributes from social graphs instead of exploiting graph patterns. A special type of was proposed in Fan et al. (2015), where the consequents were defined as pattern graphs with a single edge.
Existing research has also shown that traditional techniques can produce too many association rules because for a frequent itemset of size k, there may exist frequent subsets (Zaki, 2000). Therefore, maximal frequent itemsets and their mining algorithm were introduced by Gouda and Zaki (2001), as maximal frequent itemsets can imply and include all information of frequent itemsets.
This work proposes mining techniques for generalized which are defined with general graph patterns as antecedent and consequent. Moreover, we show that the set of maximal frequent patterns can be easily obtained from a DFS code graph and used to generate “representative” .
Big graph analysis. In the era of big graphs, the development of algorithms for graph analytics has spawned a large amount of research. Sapountzi and Psannis (2018) categorized social network analysis into three types of tasks: Sentiment analysis & opinion mining, topic detection and collaborative recommendation, and depicted various methods and their associated frameworks for the tasks. A recent survey regarding data mining and processing frameworks for big graphs was conducted by Aridhi and Nguifo (2016), in which the problem of frequent pattern mining was formulated as two problems: graph transaction based and single graph based. This classification coincides with Jiang, Coenen, and Zito (2013).
As will be seen shortly, single graph based pattern mining is a subtask of mining, and the patterns discovered need to be organized well for rule generation. Although graph analytical platforms exist for single graph-based pattern mining, e.g., Teixeira et al. (2015), their output may still need to be reorganized to facilitate (“representative”) rule generalization.
Graph-pattern mining. The generalization from item sets to graph patterns also requires effective graph-pattern mining techniques to discover in big graphs beyond traditional rule mining techniques (Zhang & Zhang, 2002). The topic of graph-pattern mining has been intensively studied (see Jiang et al. (2013) for a survey). Two technical routes developed according to the essential differences in graph data.
(1) Algorithms for graph databases. The algorithms can be further categorized into two types: (i) Apriori methods (Inokuchi, Washio, & Motoda, 2000) that start with small graphs, and generate substructures by joining similar but slightly different frequent subgraphs, in a bottom-up manner; and (ii) pattern-growth methods (Huan, Wang, Prins, & Yang, 2004), which extend a frequent graph by adding new nodes and edges in every possible position. (2) Algorithms for single large graphs. For example, Elseidy, Abdelhamid, Skiadopoulos, and Kalnis (2014) proposed using a minimum image that preserves anti-monotonic property as a support metric to measure pattern frequency. To further improve efficiency, parallel techniques were studied in Talukder and Zaki (2016) and Teixeira et al. (2015). Talukder and Zaki (2016) developed a level-wise approach and effective pruning strategy under the distributive scenario. Teixeira et al. (2015) introduced a Pregel-inspired distributed platform Arabesque for frequent pattern mining.
More recently, de J. Costa, Bernardini, Artigas, and Viterbo (2019) proposed methods for mining direct acyclic patterns to identify retention patterns in undergraduate programmes. To identify periodic patterns in dynamic networks, Halder, Samiullah, and Lee (2017) proposed SPPMiner, a single pass supergraph based periodic pattern mining technique. SPPMiner stores all entities of the dynamic network only once and calculates common sub-patterns only once at each timestamp, thus achieving high efficiency.
However, techniques for frequent pattern mining over graph databases (Huan, Wang, Prins, Yang, 2004, Inokuchi, Washio, Motoda, 2000) cannot be directly used to mine as their anti-monotonic property does not hold in a single graph (Jiang et al., 2013). Our settings are quite different from de J. Costa et al. (2019) and Halder et al. (2017), as our method can mine general patterns rather than DAG patterns, and we are coping with static graphs rather than dynamic graphs. Our work differs from distributed techniques (Talukder, Zaki, 2016, Teixeira, Fonseca, Serafini, Siganos, Zaki, Aboulnaga, 2015) as follows: we leverage both partial evaluation and asynchronous message passing to identify matches of candidate patterns in a distributive environment instead of level-wise evaluation or Pregel-based computation; Moreover, frequent patterns are our intermediate results, and our distributed technique organizes all frequent patterns in a graph structure, which facilitates (“representative”) rule generation.
GPARs mining. Special and their mining techniques were introduced in Fan et al. (2015), where consequents of the are defined as pattern graphs with a single edge, and the algorithm is specifically developed to mine diversified with fixed consequent. Fan, Wu, and Xu (2016) next introduced quantified graph association rules, which extend with quantifiers to identify potential customers in social media marketing. Another work addressed mining over stream data (Namaki, Wu, Song, Lin, & Ge, 2017).
Our work differs from these studies in the semantics, i.e., we are mining generalized with antecedent and consequent represented by general graph patterns.
Section snippets
Graph-pattern association rules
In this section, we first review graphs, patterns, subgraphs, (subgraph) isomorphism and distributed graphs. We then introduce DFS Code, DFS Lexicographic Order and DFS Code Tree, followed by the definition of graph-pattern association rules.
Frequent pattern mining
In this section, we first investigate the frequent pattern mining problem (Section 3.1); we then develop a parallel algorithm for the problem (Section 3.2) and provide an optimization strategy to improve performance (Section 3.3).
GPARs generation
In this section, we first introduce how to generate a set of from code graph . We then introduce the notion of “representative” and develop a technique to generate “representative” .
Experimental study
We next present an experimental study of our algorithms. Using real-life and synthetic data, we conducted two sets of experiments to evaluate: (1) The efficiency and data shipment of our distributed algorithm and the effectiveness of optimization techniques for ; (2) the efficiency of the rule generation algorithm and the effectiveness of our “representative” rule generation algorithm .
Experimental setting. We used three real-life graphs. (a) Amazon (Leskovec &
Conclusion
We have proposed generalized graph-pattern association rules () and viable support and confidence measures for the discovery of . Compared with special rules introduced in Fan et al. (2015), the generalized are capable to model even more complicated associations among social entities. We have provided techniques to efficiently mine . In particular, our technique for frequent pattern mining follows the “look-ahead & backtracking” strategy, that is widely employed for the
Author statement
Xin Wang conceived of the presented idea, developed the theory and algorithms. Yang Xu verified the methods, and implemented algorithms with the help of Xin Wang. Yang Xu and Huayi Zhan designed experiments, and conducted tests. Xin Wang wrote the manuscript with support fromYang Xu and Huayi Zhan.
Xin Wang agreed to be accountable for all aspects of the work in ensuring that problems appearing in any part of the work are appropriately investigated and resolved.
Declaration of Competing Interest
The authors whose names are listed immediately below certify that they have NO affiliations with or involvement in any organization or entity with any financial interest (such as honoraria; educational grants; participation in speakers’ bureaus; membership, employment, consultancies, stock ownership, or other equity interest; and expert testimony or patent-licensing arrangements), or non-financial interest (such as personal or professional relationships, affiliations, knowledge or beliefs) in
Acknowledgments
This work is supported in part by the NSFC 71490722, 71490725, 71671146, Sichuan Provincial Major Project 2017JY0225 and Fundamental Research Funds for the Central Universities, China.
References (43)
- et al.
Big graph mining: Frameworks and techniques
Big Data Research
(2016) - et al.
Supergraph based periodic pattern mining in dynamic social networks
Expert Systems with Applications
(2017) - et al.
Link prediction in complex networks: A survey
Physica A: Statistical Mechanics and its Applications
(2011) - et al.
Social networking data analysis tools & challenges
Future Generation Comp System
(2018) - et al.
Mining association rules between sets of items in large databases
Proceedings of the 1993 ACM SIGMOD international conference on management of data
(1993) - et al.
Beyond market baskets: Generalizing association rules to correlations
Proceedings ACM SIGMOD international conference on management of data
(1997) - et al.
minstab: Stable network evolution rule mining for system changeability analysis
IEEE Transactions on Emerging Topics in Computational Intelligence
(2019) - et al.
Rule-based graph repairing: Semantic and efficient repairing methods
2018 ieee 34th international conference on data engineering (icde)
(2018) - et al.
A (sub)graph isomorphism algorithm for matching large graphs
TPAMI
(2004) - et al.
Graph pattern entity ranking model for knowledge graph completion
Proceedings of the 2019 conference of the north american chapter of the association for computational linguistics: Human language technologies, NAACL-HLT 2019, volume 1
(2019)