ABSTRACT
Graph search, i.e., finding all graphs in a database D that contain the query graph q, is a classical primitive prevalent in various graph database applications. In the past, there has been an abundance of studies devoting to this topic; however, with the recent emergence of large information networks, it places new challenges to the research community. Most of the traditional graph search schemes utilize the strategy of graph feature based indexing, whereas the index construction step that often involves frequent subgraph mining becomes a bottleneck for large graphs due to the high computational complexity. Although there have been several methods proposed to solve this mining bottleneck such as summarization of database graphs, the frequent subgraphs thus generated as indexing features are still unsatisfactory because the feature set is in general not only inadequate or deficient for the large graph scenario, but also with many redundant features. Furthermore, the large size of the graphs makes it too easy for a small feature to be contained in many of them, severely impacting its selectivity and pruning power. Motivated by all the above issues we identify, in this paper we propose a novel CP-Index (Contact Preservation) for efficient indexing of large graphs. To overcome the low selectivity issue, we reap further pruning opportunities by leveraging each feature's location information in the database graphs. Specifically, we look at how features are touching upon each other in the query, and check whether this contact pattern is preserved in the target graphs. Then, to tackle the deficiency and redundancy problems associated with features, new feature generation and selection methods such as dual feature generation and size-increasing bootstrapping feature selection are introduced to complete our design. Experiment results show that CP-Index is much more effective in indexing large graphs.
- C. Chen, C. X. Lin, M. Fredrikson, M. Christodorescu, X. Yan, and J. Han. Mining graph patterns efficiently via randomized summaries. PVLDB, 2(1):742--753, 2009. Google ScholarDigital Library
- J. Chen, W. Hsu, M.-L. Lee, and S.-K. Ng. Nemofinder: Dissecting genome-wide protein-protein interactions with meso-scale network motifs. In KDD, pages 106--115, 2006. Google ScholarDigital Library
- J. Cheng, Y. Ke, W. Ng, and A. Lu. Fg-index: Towards verification-free query processing on graph databases. In SIGMOD, pages 857--872, 2007. Google ScholarDigital Library
- J. Cheng, J. X. Yu, B. Ding, P. S. Yu, and H. Wang. Fast graph pattern matching. In ICDE, pages 913--922, 2008. Google ScholarDigital Library
- J. Cheng, J. X. Yu, X. Lin, H. Wang, and P. S. Yu. Fast computing reachability labelings for large graphs with high compression rate. In EDBT, pages 193--204, 2008. Google ScholarDigital Library
- M. Christodorescu, S. Jha, and C. Kruegel. Mining specifications of malicious behavior. In ESEC/SIGSOFT FSE, pages 5--14, 2007. Google ScholarDigital Library
- W. Fan, J. Li, S. Ma, N. Tang, Y. Wu, and Y. Wu. Graph pattern matching: From intractable to polynomial time. PVLDB, 3(1):264--275, 2010. Google ScholarDigital Library
- R. Giugno and D. Shasha. Graphgrep: A fast and universal method for querying graphs. In ICPR (2), pages 112--115, 2002.Google Scholar
- W.-S. Han, J. Lee, M.-D. Pham, and J. X. Yu. iGraph: A framework for comparisons of disk-based graph indexing techniques. PVLDB, 3(1):449--459, 2010. Google ScholarDigital Library
- M. A. Hasan, V. Chaoji, S. Salem, J. Besson, and M. J. Zaki. Origami: Mining representative orthogonal graph patterns. In ICDM, pages 153--162, 2007. Google ScholarDigital Library
- H. He and A. K. Singh. Graphs-at-a-time: Query language and access methods for graph databases. In SIGMOD, pages 405--418, 2008. Google ScholarDigital Library
- A. Inokuchi, T. Washio, and H. Motoda. Complete mining of frequent patterns from graphs: Mining graph data. Machine Learning, 50(3):321--354, 2003. Google ScholarDigital Library
- V. Kacholia, S. Pandit, S. Chakrabarti, S. Sudarshan, R. Desai, and H. Karambelkar. Bidirectional expansion for keyword search on graph databases. In VLDB, pages 505--516, 2005. Google ScholarDigital Library
- Y. Ke, J. Cheng, and W. Ng. Efficient correlation search from graph databases. IEEE Transactions on Knowledge and Data Engineering, 20(12):1601--1615, 2008. Google ScholarDigital Library
- M. Kuramochi and G. Karypis. Frequent subgraph discovery. In ICDM, pages 313--320, 2001. Google ScholarDigital Library
- M. Kuramochi and G. Karypis. Finding frequent patterns in a large sparse graph. Data Mining and Knowledge Discovery, 11(3):243--271, 2005. Google ScholarDigital Library
- J. Pei, D. Jiang, and A. Zhang. On mining cross-graph quasi-cliques. In KDD, pages 228--238, 2005. Google ScholarDigital Library
- N. Polyzotis and M. N. Garofalakis. Xsketch synopses for XML data graphs. ACM Transactions on Database Systems, 31(3):1014--1063, 2006. Google ScholarDigital Library
- T. Sarlós, A. A. Benczúr, K. Csalogány, D. Fogaras, and B. Rácz. To randomize or not to randomize: Space optimal summaries for hyperlink analysis. In WWW, pages 297--306, 2006. Google ScholarDigital Library
- H. Shang, Y. Zhang, X. Lin, and J. X. Yu. Taming verification hardness: An efficient algorithm for testing subgraph isomorphism. PVLDB, 1(1):364--375, 2008. Google ScholarDigital Library
- H. Toivonen. Sampling large databases for association rules. In VLDB, pages 134--145, 1996. Google ScholarDigital Library
- N. Wang, S. Parthasarathy, K.-L. Tan, and A. K. H. Tung. Csv: Visualizing and mining cohesive subgraphs. In SIGMOD, pages 445--458, 2008. Google ScholarDigital Library
- D. W. Williams, J. Huan, and W. Wang. Graph database indexing using structured graph decomposition. In ICDE, pages 976--985, 2007.Google ScholarCross Ref
- X. Yan and J. Han. gSpan: Graph-based substructure pattern mining. In ICDM, pages 721--724, 2002. Google ScholarDigital Library
- X. Yan, P. S. Yu, and J. Han. Graph indexing: A frequent structure-based approach. In SIGMOD, pages 335--346, 2004. Google ScholarDigital Library
- S. Zhang, S. Li, and J. Yang. Gaddi: Distance index based subgraph matching in biological networks. In EDBT, pages 192--203, 2009. Google ScholarDigital Library
- P. Zhao and J. Han. On graph query optimization in large networks. PVLDB, 3(1):340--351, 2010. Google ScholarDigital Library
- P. Zhao, J. X. Yu, and P. S. Yu. Graph indexing: Tree+ delta >= graph. In VLDB, pages 938--949, 2007. Google ScholarDigital Library
- L. Zou, L. Chen, J. X. Yu, and Y. Lu. A novel spectral coding in a large graph database. In EDBT, pages 181--192, 2008. Google ScholarDigital Library
Index Terms
- CP-index: on the efficient indexing of large graphs
Recommendations
Fg-index: towards verification-free query processing on graph databases
SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of dataGraphs are prevalently used to model the relationships between objects in various domains. With the increasing usage of graph databases, it has become more and more demanding to efficiently process graph queries. Querying graph databases is costly since ...
Efficient algorithms for supergraph query processing on graph databases
We study the problem of processing supergraph queries on graph databases. A graph database D is a large set of graphs. A supergraph query q on D is to retrieve all the graphs in D such that q is a supergraph of them. The large ...
ECTree: an extended tree index for attributed subgraph queries
IDEAS '12: Proceedings of the 16th International Database Engineering & Applications SysmposiumGraphs are popular data structures for modeling complex data types. There is a need for managing such graph data and providing efficient querying tools. In the graph mining realm, the problem lies in indexing a large number of graphs for fast retrieval. ...
Comments