Abstract
One of the most fundamental problems in computer science is the reachability problem: Given a directed graph and two vertices s and t, can sreach t via a path? We revisit existing techniques and combine them with new approaches to support a large portion of reachability queries in constant time using a linear-sized reachability index. Our new algorithm
In a detailed experimental study, we compare a variety of algorithms with respect to their index-building and query times as well as their memory footprint on a diverse set of instances. Our experiments indicate that the query performance often depends strongly not only on the type of graph but also on the result, i.e., reachable or unreachable. Furthermore, we show that previous algorithms are significantly sped up when combined with our new approach in almost all scenarios. Surprisingly, due to cache effects, a higher investment in space doesn’t necessarily pay off: Reachability queries can often be answered even faster than single memory accesses in a precomputed full reachability matrix.
1 INTRODUCTION
Graphs are used to model problem settings of various different disciplines. A natural question that arises frequently is whether one vertex of the graph can reach another vertex via a path of directed edges. Reachability finds application in a wide variety of fields, such as program and dataflow analysis [24, 25], user-input dependence analysis [27], XML query processing [34], and more [40]. Another prominent example is the Semantic Web which is composed of RDF/OWL data. These are often very huge graphs with rich content. Here, reachability queries are often necessary to deduce relationships among the objects.
There are two straightforward solutions to the reachability problem: The first is to answer each query individually with a graph traversal algorithm, such as breadth-first search (BFS) or depth-first search (DFS), in worst-case \( \mathcal {O}(m+n) \) time and \( \mathcal {O}(n) \) space. Secondly, we can precompute a full all-pairs reachability matrix in an initialization step and answer all ensuing queries in worst-case constant time. In return, this approach suffers from a space complexity of \( \mathcal {O}(n^2) \) and an initialization time of \( \mathcal {O}(n\cdot m) \) using the Floyd–Warshall algorithm [6, 7, 35] or starting a graph traversal at each vertex in turn. Alternatively, the initialization step can be performed in \( \mathcal {O}(n^\omega) \) via fast matrix multiplication, where \( \mathcal {O}(n^\omega) \) is the time required to multiply two \( n \times n \) matrices (\( 2 \le \omega \lt 2.38 \) [20]). With increasing graph size however, both the initialization time and space complexity of this approach become impractical. We, therefore, strive for alternative algorithms which decrease these complexities whilst still providing fast query lookups.
Contribution. In this article, we study a variety of approaches that are able to support fast reachability queries. All of these algorithms perform some kind of preprocessing on the graph and then use the collected data to answer reachability queries in a timely manner. Based on simple observations, we provide a new algorithm,
2 PRELIMINARIES
Terms and Definitions. Let \( G=(V, E) \) be a simple directed graph with vertex set V and edge set \( E\subseteq V \times V \). As usual, \( n=|V| \) and \( m=|E| \). An edge \( (u, v) \) is said to be outgoing at u and incoming at v, and u and v are called adjacent. The out-degree \( \mathrm{deg}^{+}(u) \) (in-degree \( \mathrm{deg}^{-}(u) \)) of a vertex u is its number of outgoing (incoming) edges. A vertex without incoming (outgoing) edges is called a source (sink). The out-neighborhood \( \textsf {N}^{{+}}(v) \) (in-neighborhood \( \textsf {N}^{{-}}(v) \)) of a vertex u is the set of all vertices v such that \( (u, v) \in E \) (\( (v, u) \in E \)). The reverse of an edge \( (u, v) \) is an edge \( (v, u) = {(u, v)}^{{\mathrm{R}}} \). The reverse \( {G}^{{\mathrm{R}}} \) of a graph G is obtained by keeping the vertices of G, but substituting each edge \( (u, v) \in E \) by its reverse, i.e., \( {G}^{{\mathrm{R}}} = (V, {E}^{{\mathrm{R}}}) \).
A sequence of vertices \( s = v_0 \rightarrow \dots \rightarrow v_k = t \), \( k \ge 0 \), such that for each pair of consecutive vertices \( v_i \rightarrow v_{i+1} \), \( (v_i, v_{i+1})\in E \), is called an s-t path. If such a path exists, s is said to reach t and we write \( s \rightarrow ^*t \) for short, and \( s \not\rightarrow ^*t \) otherwise. The out-reachability \( \textsf {R}^{+}(u) = \lbrace v \mid u \rightarrow ^*v\rbrace \) (in-reachability \( \textsf {R}^{-}(u) = \lbrace v \mid v \rightarrow ^*u\rbrace \)) of a vertex \( u \in V \) is the set of all vertices that u can reach (that can reach u).
A weakly connected component(WCC) of G is a maximal set of vertices \( C \subseteq V \) such that \( \forall u, v \in C: u \rightarrow ^*v \) in \( G=(V, E \cup {E}^{{\mathrm{R}}}) \), i.e., also using the reverse of edges. Note that if two vertices \( u, v \) reside in different WCCs, then \( u \not\rightarrow ^*v \) and \( v \not\rightarrow ^*u \). A strongly connected component(SCC) of G denotes a maximal set of vertices \( S \subseteq V \) such that \( \forall u, v \in S: u \rightarrow ^*v \wedge v \rightarrow ^*u \) in G. Contracting each SCC S of G to a single vertex \( v_S \), called its representative, while preserving edges between different SCCs as edges between their corresponding representatives, yields the condensation \( {G}^{{\mathrm{C}}} \) of G. We denote the SCC a vertex \( v \in V \) belongs to by \( \mathcal {S}(v) \). A directed graph G is strongly connected if it only has a single SCC and acyclic if each SCC is a singleton, i.e., if G has n SCCs. Observe that G and \( {G}^{{\mathrm{R}}} \) have exactly the same WCCs and SCCs and that \( {G}^{{\mathrm{C}}} \) is a directed acyclic graph (DAG). WCCs of a graph can be computed in \( \mathcal {O}(n+m) \) time, e.g., via a BFS that ignores edge directions. The SCCs of a graph can be computed in linear time [29] as well.
A topological ordering \( \tau : V \rightarrow \mathbb {N}_0 \) of a DAG G is a total ordering of its vertices such that \( \forall (u, v) \in E: \tau (u) \lt \tau (v) \). Note that the topological ordering of G isn’t necessarily unique, i.e., there can be multiple different topological orderings. For a vertex \( u \in V \), the forward topological level \( \mathcal {F}(u) = \min _\tau \tau (u) \), i.e., the minimum value of \( \tau (u) \) among all topological orderings \( \tau \) of G. Consequently, \( \mathcal {F}(u) = 0 \) if and only if u is a source. The backward topological level \( \mathcal {B}(u) \) of \( u \in V \) is the topological level of u with respect to \( {G}^{{\mathrm{R}}} \) and \( \mathcal {B}(u) = 0 \) if and only if u is a sink. A topological ordering, as well as the forward and backward topological levels, can be computed in linear time [6, 19, 30], see also Section 4.
A reachability query Query(\( s, t \)) for a pair of vertices \( s, t \in V \) is called positive and answered with
Basic Observations. With respect to processing a reachability Query(\( s, t \)) in a graph G for an arbitrary pair of vertices \( s \ne t \in V \), the following basic observations are immediate and have partially also been noted elsewhere [22] :
As mentioned above, the precomputations necessary for Observations (B2) and (B3) can be performed in \( \mathcal {O}(n+m) \) time. Note, however, that Observations (B3) and (B4) together are equivalent to asking whether \( s \rightarrow ^*t \): If \( s \rightarrow ^*t \) and \( \mathcal {S}(s) \ne \mathcal {S}(t) \), then for every topological ordering \( \tau \), \( \tau (\mathcal {S}(s)) \lt \tau (\mathcal {S}(t)) \). Otherwise, if \( s \not\rightarrow ^*t \), a topological ordering \( \tau \) with \( \tau (\mathcal {S}(t)) \lt \tau (\mathcal {S}(s)) \) can be computed by topologically sorting \( {G}^{{\mathrm{C}}} \cup \lbrace (\mathcal {S}(t),\mathcal {S}(s))\rbrace \). Hence, the precomputations necessary for Observation (B4) would require solving the Reachability problem for all pairs of vertices already. Furthermore, a DAG can have exponentially many different topological orderings. In consequence, weaker forms are employed, such as the following [22, 38, 39] (see also Section 4):
Assumptions. Following the convention introduced in the preceding work [3, 22, 38, 39] (cf. Section 3), we only consider Reachability on DAGs from here on and implicitly assume that the condensation, if necessary, has already been computed and Observation (B3) has been applied. For better readability, we also drop the use of \( \mathcal {S}(\cdot) \).
3 RELATED WORK
A large amount of research on reachability indices has been conducted. Existing approaches can roughly be put into three categories: compression of transitive closure [2, 13, 14, 15, 32, 34], hop-labeling-based algorithms [4, 5, 16, 26, 37], as well as pruned search [18, 22, 28, 31, 33, 36, 38, 39]. As Merz and Sanders [22] noted, the first category gives very good query times for small networks but doesn’t scale very well to large networks (which is the focus of this work). Therefore, we do not consider approaches based on this technique more closely. Hop labeling algorithms typically build paths from labels that are stored for each vertex. For example, in 2-hop labeling, each vertex stores two sets containing vertices it can reach in the given graph as well as in the reverse graph. A query can then be reduced to the set intersection problem. Pruned-search-based approaches precompute information to speed up queries by pruning the search.
Due to its volume, it is impossible to compare against all previous work. We mostly follow the methodology of Merz and Sanders [22] and focus on five recent techniques. The two most recent hop-labeling-based approaches are
Table 1 subsumes the time and space complexities of the new algorithm
Algorithm | Initialization Time | Index Size (\( {\text{Byte}} \)) | Queries: Time | Space |
---|---|---|---|---|
BFS/DFS | \( \mathcal {O}(1) \) | 0 | \( \mathcal {O}(n+m) \) | \( \mathcal {O}(n) \) |
Full matrix | \( \mathcal {O}(n \cdot (n+m)) \) | \( n^2/8 \) | \( \mathcal {O}(1) \) | \( \mathcal {O}(1) \) |
\( \mathcal {O}(n\log n+m) \) | \( \mathcal {O}(n \log n) \) | \( \mathcal {O}(\log n) \) | \( \mathcal {O}(\log n) \) | |
\( \mathcal {O}(m+n\log n) \) | 56n | \( \mathcal {O}(1) \) / \( \mathcal {O}(n+m) \) | \( \mathcal {O}(n) \) | |
\( \mathcal {O}((k_{\texttt {IP}}{} + h_{\texttt {IP}}{})(n+m)) \) | \( \mathcal {O}((k_{\texttt {IP}}{} + h_{\texttt {IP}}{})n) \) | \( \mathcal {O}(k_{\texttt {IP}}{}) \) / \( \mathcal {O}(k_{\texttt {IP}}{}\cdot n \cdot \rho {}^2) \) | \( \mathcal {O}(n) \) | |
\( \mathcal {O}(s_{\texttt {BFL}}{}\cdot (n+m)) \) | \( 2\lceil \frac{s_{\texttt {BFL}}{}}{8}\rceil n \) | \( \mathcal {O}(s_{\texttt {BFL}}{}) \) / \( \mathcal {O}(s_{\texttt {BFL}}{}\cdot n + m) \) | \( \mathcal {O}(n) \) | |
\( \mathcal {O}((d+kp)(n+m)) \) | \( (12 + 12 d+ 2 \lceil \frac{k}{8}\rceil)n \) | \( \mathcal {O}(k+ d+ 1) \) / \( \mathcal {O}(n+m) \) | \( \mathcal {O}(n) \) |
Parameters: \( k_{\texttt {IP}}{} \): #permutations, \( h_{\texttt {IP}}{} \): #vertices with precomputed \( \textsf {R}^{+}(\cdot) \), \( s_{\texttt {BFL}}{} \): size of Bloom filter (bits), \( \rho {} \): reachability in G, d: #topological orderings, k: #supportive vertices, p: #candidates per supportive vertex.
Parameters: \( k_{\texttt {IP}}{} \): #permutations, \( h_{\texttt {IP}}{} \): #vertices with precomputed \( \textsf {R}^{+}(\cdot) \), \( s_{\texttt {BFL}}{} \): size of Bloom filter (bits), \( \rho {} \): reachability in G, d: #topological orderings, k: #supportive vertices, p: #candidates per supportive vertex.
4 O’REACH: FASTER REACHABILITY VIA OBSERVATIONS
In this section, we propose our new algorithm
Overview. The hop labeling technique used in our algorithm is inspired by a recent result for experimentally faster reachability queries in a dynamic graph by Hanauer et al. [11]. The idea here is to speed up reachability queries based on a selected set of so-called supportive vertices, for which complete out- and in-reachability is maintained explicitly. This information is used in three simple observations, which allow to answer matching queries in constant time. In our algorithm, we transfer this idea to the static setting. We further increase the ratio of queries answerable in constant time by a new perspective on topological orderings and their conflation with DFS, which provides additional reachability information and further increases the ratio of queries answerable in constant time. In case we cannot answer a query via an observation, we fall back to either a pruning bidirectional BFS or one of the existing algorithms.
In the following, we switch the order and first discuss topological orderings in depth, followed by our adaptation of supportive vertices. For both parts, consider a reachability Query(\( s, t \)) for two vertices \( s, t \in V \) with \( s \ne t \).
4.1 Extended Topological Orderings
Taking up the observation that topological orderings can be used to answer a reachability query decisively negative, we first investigate how Observation (B4) can be used most effectively in practice. Before we dive deeper into this subject, let us briefly review some facts concerning topological orderings and reachability in general.
Let \( \mathcal {N}(\tau) \subseteq \mathcal {N} \) denote the set of negative queries a topological ordering \( \tau \) can answer, i.e., the set of all \( (s, t) \in \mathcal {N} \) such that \( \tau (t) \lt \tau (s) \), and let \( \rho ^{{-}}(\tau) = \mathcal {N}(\tau) / \mathcal {N} \) be the answerable negative query ratio.
(i) | The reachability in any DAG is at most 50%. In this case, the topological ordering is unique. | ||||
(ii) | Any topological ordering \( \tau \) witnesses the non-reachability between exactly 50% of all pairs of distinct vertices. Therefore, \( \rho ^{{-}}(\tau) \ge 50\% \). | ||||
(iii) | Every topological ordering of the same DAG can answer the same | ||||
(iv) | For two different topological orderings \( \tau \ne \tau ^{\prime } \) of a DAG, \( \mathcal {N}(\tau) \ne \mathcal {N}(\tau ^{\prime }) \). |
Let G be a DAG.
(i) | As G is acyclic, there is at least one topological ordering \( \tau \) of G. Then, for every edge \( (u, v) \) of G, \( \tau (u) \lt \tau (v) \), which implies that each vertex u can reach at most all those vertices \( w \ne u \) with \( \tau (u) \lt \tau (w) \). Consequently, a vertex u with \( \tau (u) = i \) can reach at most \( n-i-1 \)other vertices (note that \( i \ge 0 \)). Thus, the reachability in G is at most \( \frac{1}{n(n-1)}\sum _{i=0}^{n-1} (n-i-1) = \frac{1}{n(n-1)}\sum _{j=0}^{n-1} j = \frac{n(n-1)}{n(n-1)\cdot 2} = \frac{1}{2} \). Conversely, assume that the reachability in G is \( \frac{1}{2} \). Then, each vertex u with \( \tau (u) = i \) reaches exactly all \( n-i-1 \) other vertices ordered after it, which implies that there exists no other topological ordering \( \tau ^{\prime } \) with \( \tau ^{\prime }(u) \gt \tau (u) \). By induction on i, the topological ordering of G is unique. | ||||
(ii) | Let \( \tau \) be an arbitrary topological ordering of G. Then, each vertex u with \( \tau (u) = i \) can certainly reach those vertices v with \( \tau (v) \lt \tau (u) \). Hence, \( \tau \) witnesses the non-reachability of exactly \( \sum _{i=1}^{n-1} i = \frac{n(n-1)}{2} \) pairs of distinct vertices. | ||||
(iii) | As Observation (B4) corresponds exactly to the non-reachability between those pairs of vertices witnessed by the topological ordering, the claim follows directly from (ii). | ||||
(iv) | As \( \tau \ne \tau ^{\prime } \), there is at least one \( i \in \mathbb {N}_0 \) such that \( \tau (u) = i = \tau ^{\prime }(v) \) and \( u \ne v \). Let \( j = \tau (v) \). If \( j \gt i \), the number of non-reachabilities from v to another vertex witnessed by \( \tau \) exceeds the number of those witnessed by \( \tau ^{\prime } \), and falls behind it otherwise. In both cases, the difference in numbers immediately implies a difference in the set of vertex pairs, which proves the claim. \( \qedhere \) |
In consequence, it is pointless to look for one particularly good topological ordering. Instead, to get the most out of Observation (B4), we need topological orderings whose sets of answerable negative queries differ greatly, such that their union covers a large fraction of \( \mathcal {N} \). Note that both forward and backward topological levels each represent the set of topological orderings that can be obtained by ordering the vertices in blocks grouped by their level and arbitrarily permuting the vertices in each block. Different algorithms [6, 19, 29] for computing a topological ordering in linear time have been proposed over the years, with Kahn’s algorithm [19] in combination with a queue being one that always yields a topological ordering represented by forward topological levels. We, therefore, complement the forward and backward topological levels by stack-based approaches, as in Kahn’s algorithm [19] in combination with a stack or Tarjan’s DFS-based algorithm [29] for computing the SCCs of a graph, which as a by-product also yields a topological ordering of the condensation. To diversify the set of answerable negative queries further, we additionally randomize the order in which vertices are processed in case of ties and also compute topological orderings on the reverse graph, in analogy to backward topological levels.
We next show how, with a small extension, the stack-based topological orderings mentioned above can be used to additionally answer positive queries. To keep the description concise, we concentrate on Tarjan’s algorithm [29] in the following and reduce it to the part relevant for obtaining a topological ordering of a DAG. In short, the algorithm starts a DFS at an arbitrary vertex \( s \in S \), where \( S \subseteq V \) is a given set of vertices to start from. Whenever it visits a vertex v, it marks v as visited and recursively visits all unvisited vertices in its out-neighborhood. On return, it prepends v to the topological ordering. A loop over \( S = V \) ensures that all vertices are visited. Note that although the vertices are visited in DFS order, the topological ordering is different from a DFS numbering as it is constructed “from back to front” and corresponds to a reverse sorting according to what is also called finishing time of each vertex.
To answer positive queries, we exploit the invariant that when visiting a vertex v, all yet unvisited vertices reachable from v will be prepended to the topological ordering prior to v being prepended. Consequently, v can certainly reach all vertices in the topological ordering between v and, exclusively, the vertex w that was at the front of the topological ordering when v was visited. Let x denote the vertex preceding w in the final topological ordering, i.e., the vertex with the largest index that was reached recursively from v. For a topological ordering \( \tau \) constructed in this way, we call \( \tau (x) \) the high index of v and denote it with \( \tau _H(v) \). Furthermore, v may be able to also reach w and vertices beyond, which occurs if \( v \rightarrow ^*y \) for some vertex y, but y had already been visited earlier. We, therefore, additionally track the max index, the largest index of any vertex that v can reach, and denote it with \( \tau _{X}(v) \). Figure 1(a) shows how to compute an extended topological ordering with both high and max indices in pseudo-code and highlights our extensions. Compared to Tarjan’s original version [29], the running time remains unaffected by our modifications and is still in \( \mathcal {O}(n+m) \).
Note that neither max nor high indices yield an ordering of V: Every vertex that is visited recursively starting from v and before vertex x with \( \tau (x) = \tau _H(v) \), inclusively, has the same high index as v, and the high index of each vertex in a graph consisting of a single path, e.g., would be n – 1. In particular, neither max nor high index forms a DFS numbering and also differ in definition and use from the DFS finishing times \( \hat{\phi } \) used in
If ExtendedTopSort is run on the reverse graph, it yields a topological ordering \( \tau ^{\prime } \) and high and max indices \( \tau ^{\prime }_H \) and \( \tau ^{\prime }_{X} \), such that reversing \( \tau ^{\prime } \) yields again a topological ordering \( \tau \) of the original graph. Furthermore, \( \tau _L(v) := n - 1 - \tau ^{\prime }_H(v) \) is a low index for each vertex v, which denotes the smallest index of a vertex in \( \tau \) that can certainly reach v, i.e., the out-reachability of v is replaced by in-reachability. Analogously, \( \tau _{N}(v) := n - 1 - \tau ^{\prime }_{X}(v) \) is a min index in \( \tau \) and no vertex u with \( \tau (u) \lt \tau _{N}(v) \) can reach v.
The following observations show how such an extended topological ordering \( \tau \) can be used to answer both positive and negative reachability queries:
Recall that by definition, \( \tau (s) \le \tau _H(s) \le \tau _{X}(s) \) and \( \tau _{N}(t) \le \tau _L(t) \le \tau (t) \). Figure 1(b) depicts three examples of extended topological orderings. In contrast to negative queries, not every extended topological ordering is equally effective in answering positive queries, and it can be arbitrarily bad, as shown in the extremes on the left (worst) and at the center (best) of Figure 1(b):
Let \( \mathcal {P}(\tau) \subseteq \mathcal {P} \) be the set of positive queries an extended topological ordering \( \tau \) can answer and let \( \rho ^{{+}}(\tau) = \mathcal {P}(\tau) / \mathcal {P} \) be the answerable positive query ratio. Then, \( 0 \le \rho ^{{+}}(\tau) \le 1 \).
Instead, the effectiveness of an extended topological ordering depends positively on the size of the ranges \( \left[\tau (v), \tau _H(v)\right] \) and \( \left[\tau _L(v), \tau (v)\right] \), and negatively on \( \left[\tau _H(v), \tau _{X}(v)\right] \) and \( \left[\tau _{N}(v), \tau _L(v)\right] \) which in turn depend on the recursion depths during construction and the order of recursive calls. The former two can be maximized if the first, non-recursive call to Visit() in line 4 in ExtendedTopSort always has a source as its argument, i.e., if the algorithm’s parameter S corresponds to the set of all sources. Clearly, this still guarantees that every vertex is visited.
In addition to the forward and backward topological levels,
4.2 Supportive Vertices
We now show how to apply and improve the idea of supportive vertices in the static setting. A vertex v is supportive if the set of vertices that v can reach and that can reach v, \( {R^{+}}(v) \) and \( {R^{-}}(v) \), respectively, have been precomputed and membership queries can be performed in sublinear time. We can then answer reachability queries using the following simple observations [11]:
To apply these observations, our algorithm selects a set of k supportive vertices during the initialization phase. In contrast to the original use scenario in the dynamic setting, where the graph changes over time and it is difficult to choose “good” supportive vertices that can help to answer many queries, the static setting leaves room for further optimizations here: With respect to Observation (S1), we consider a supportive vertex v “good” if \( |{R^{+}}(v)| \cdot |{R^{-}}(v)| \) is large as it maximizes the possibility that \( s \in {R^{-}}(v) \wedge t \in {R^{+}}(v) \). With respect to Observation (S2) and (S3), we expect a “good” supportive vertex to have out- or in-reachability sets, respectively, of size close to \( \frac{n}{2} \), i.e., when \( |{R^{+}}(v)|\cdot |V \setminus {R^{+}}(v)| \) or \( |{R^{-}}(v)|\cdot |V \setminus {R^{-}}(v)| \), respectively, are maximal. Furthermore, to increase total coverage and avoid redundancy, the set of queries Query(\( s, t \)) covered by two different supportive vertices should ideally overlap as little as possible.
We remark that this is a general-purpose approach that has shown to work well across different types of instances, albeit possibly at the expense of an increased initialization time. It seems natural that more specialized routines for different graph classes can improve both running time and coverage.
4.3 The Complete Algorithm
Given a graph G and a sequence of queries Q, we summarize in the following how
Steps 1 and 2 run in linear time. As shown in Sections 4.1 and 4.2, the same applies to Steps 3 and 4, assuming that all parameters are constants. The required space is linear for all steps. The reachability index consists of the following information for each vertex v: one integer for the WCC, one integer each for \( \mathcal {F}(v) \) and \( \mathcal {B}(v) \), three integers for each of the d extended topological orderings \( \tau \) (\( \tau (v), \tau _H(v)/\tau _L(v), \tau _{X}(v)/\tau _{N}(v) \)), two bits for each of the k supportive vertices, indicating its reachability to/from v. For graphs with and \( n \le 2^{32} \), \( 4 \,{\text{Byte}} \) per integer suffice. Furthermore, we group the bits encoding the reachabilities to and from the supportive vertices, respectively, and represent them each by one suitably sized integer, e.g., using
For each query Query(\( s, t \)),
Test 1: | |||||
Test 2: | |||||
Test 3: | k supportive vertices, positive (S1). | ||||
Test 4: | |||||
Test 5: | |||||
Test 6: | remaining \( d-1 \) topological orderings (B4), (T1)/(T4), (T2)/(T5), (T3)/(T6). | ||||
Test 7: | different WCCs (B2). |
Observe that the tests for Observations (S1)–(S3) can each be implemented easily using boolean logic, which allows for a concurrent test of all supports whose reachability information is encoded in one accordingly-sized integer: For Observation (S1), it suffices to test whether \( r^-(s) \wedge r^+(t) \gt 0 \), and \( r^+(s) \wedge \lnot r^+(t) \gt 0 \) and \( \lnot r^-(s) \wedge r^-(t) \gt 0 \) for Observations (S2) and (S3), where \( r^+ \) and \( r^- \) hold the respective forward and backward reachability information in the same order for all supports. Each test hence requires at most one comparison of two integers plus at most two elementary bit operations. Also, note that Observation (B1) is implicitly tested by Observations (B5) and (B6). Using the data structure described above, our algorithm requires at most one memory transfer for s and one for t for each Query(\( s, t \)) that is answerable by one of the observations. Note that there are more observations that allow to identify a negative query than a positive query, which is why we expect a more pronounced speedup for the former. However, as stated in Theorem 4.1, the reachability in DAGs is always less than 50%, which justifies a bias towards an optimization for negative queries.
If the query can not be answered using any of these tests, we instead fall back to either another algorithm or a bidirectional BFS with pruning, which uses these tests for each newly encountered vertex v in a subquery Query(\( v, t \)) (forward step) or Query(\( s, v \)) (backward step). If a subquery can be answered decisively positive by a test, the bidirectional BFS can immediately answer Query(\( s, t \)) positively. Otherwise, if a subquery is answered decisively negative by a test, the encountered vertex v is no longer considered (pruning step). If the subquery could not be answered by a test, the vertex v is added to the queue as in a regular (bidirectional) BFS.
5 EXPERIMENTAL EVALUATION
We evaluated our new algorithm
5.1 Setup and Methodology
We implemented
To counteract artifacts of measurement and accuracy, we ran each algorithm five times on each instance and in general use the median for the evaluation. As
Instances. To facilitate comparability, we adopt the instances used in the articles introducing
Kronecker. These instances were generated by the RMAT generator for the Graph500 benchmark [23] and oriented acyclically from smaller to larger node ID. The name encodes the number of vertices \( 2^i \) as kron_logni. Random: Graphs generated according to the Erdős-Renyí model \( G(n, m) \) and oriented acyclically from smaller to larger node ID. The name encodes \( n=2^i \) and \( m=2^j \) as randni-j. Delaunay: Delaunay graphs from the 10th DIMACS Challenge [1, 8]. delaunay_ni is a Delaunay triangulation of \( 2^{i} \) random points in the unit square. Large real: Introduced in [39], these instances represent citation networks (citeseer.scc, citeseerx, cit-Patents), a taxonomy graph (go-uniprot), as well as excerpts from the RDF graph of a protein database (uniprotm22, uniprotm100, uniprotm150). Small real dense: Among these instances, introduced in [17], are again citation networks (arXiv, pubmed_sub, citeseer_sub), a taxonomy graph (go_sub), as well as one obtained from a semantic knowledge database (yago_sub). Small real sparse: These instances were introduced in [18] and represent XML documents (xmark, nasa), metabolic networks (amaze, kegg) or originate from pathway and genome databases (all others).
Queries. Following the methodology of [22], we generated three sets of 100,000 queries each: positive, negative, and random. Each set consists of random queries, which were generated by picking two vertices uniformly at random and filtering out negative or positive queries for the positive and negative query sets, respectively. The fourth query set, mixed, is a randomly shuffled union of all queries from positive and negative and hence contains 200,000 pairs of vertices. As the order of the queries within each set had an observable effect on the running time due to caching effects and memory layout, we randomly shuffled every query set five times and used a different permutation for each repetition of an experiment to ensure equal conditions for all algorithms.
5.2 Experimental Results
We ran
Average query times. Table A.5 lists the average time per query for the query sets negative and positive. All missing values are due to a memory requirement of more than \( 32 \,\mathrm{G}{\rm B} \) (
Our results by and large confirm the performance comparison of
Notably, \( \texttt {Matrix}{} \) was outperformed quite often, especially for queries in the set negative, which correlates with the fact that a large portion of these queries could be answered by constant-time observations (see also the detailed analysis of observation effectiveness below) and is due to its larger memory footprint. Across all instances and seeds, more than \( 95 \,\% \) of all queries in this set could be answered by
There are some instances where
The results on the query sets random and mixed are similar and listed in Tables A.6 and A.8. Once again,
Speedups by
Initialization Times (Table A.10). On all graphs,
Based on the average query time per instance, the minimum number of random queries necessary to amortize the additional investment in initialization time if
Effectiveness of Observations. We collected a vast amount of statistical data to perform an analysis of the effectiveness of the different observations used in
First, we look only at fast queries, i.e., those queries that could be answered without a fallback. We increased the counter for all observations that could answer a query for this analysis, not just the first in order, which is why there may be overlaps (one query can be answered multiple times). Across all query sets, the most effective observation was the negative basic observation on topological orderings, (B4), which could answer \( 54 \,\% \) of all fast queries. As the average reachability in the random query set is very low, negative queries predominate in the overall picture. It thus does not come as a surprise that the most effective observation is a negative one. On the negative query set, it could answer even \( 84 \,\% \) of all fast queries. The negative observations second to (B4) in effectiveness were those looking at the forward and backward topological levels, Observation (B5) and (B6), which could answer around \( 74.5 \,\% \) each on the negative query set and around \( 47.5 \,\% \) of all fast queries. The observations using the max and min indices of extended topological orderings, (T2) and (T5), could answer \( 26 \,\% \) and \( 19 \,\% \) of the fast queries in the negative query set, and the observations based on supportive vertices, (S2) and (S3), \( 19 \,\% \) and \( 12 \,\% \), respectively.
After lowering the number of topological orderings from \( d= 4 \) to \( d= 2 \), (B4) was equally effective as (B5) and (B6), each of which could answer around \( 48 \,\% \) of all fast queries and \( 75 \,\% \) of those in the negative query set. Observe that decreasing d negatively affects the number of fast queries, which in turn leads to slightly increased ratios for (B5) and (B6). For Observations (T2) and (T5), the effectiveness was reduced to \( 21 \,\% \) and \( 16 \,\% \) on the negative query set, and to \( 13 \,\% \) and \( 10 \,\% \) across all query sets.
The most effective positive observation and the second-best among all query sets, was the supportive-vertices-based Observation (S1), which could answer around \( 25 \,\% \) of all fast queries and \( 66 \,\% \) in the positive query set. Follow-up observations were the ones using high and low indices, (T1) and (T4), with \( 21 \,\% \) and \( 23 \,\% \) effectiveness for the positive query set, and around \( 7.5 \,\% \) across all query sets. The remaining two, (T3) and (T6), could answer \( 10 \,\% \) and \( 5 \,\% \) in the positive set.
Reducing the number of supportive vertices from \( k= 16 \) to \( k= 8 \) led to a small diminution of the effectiveness of Observation (S1) to around \( 64.5 \,\% \) on the set of positive queries, both if the number of candidates to choose from was kept equal (\( p= 150 \)) or reduced analogously (\( p= 75 \)). Reducing the number of topological orderings to \( d= 2 \) resulted in a slight deterioration in case of (T1) and (T4) to \( 19 \,\% \) and \( 21 \,\% \), and to \( 5 \,\% \) with respect to the positive query set.
Among all fast queries that could be answered by only one observation, the most effective observation was the positive supportive-vertices-based Observation (S1) with \( 38 \,\% \) for all query sets and \( 65 \,\% \) for the positive query set, followed by the negative basic observation using topological orderings, (B4), with around \( 29 \,\% \) for all query sets and \( 63 \,\% \) for the negative query set.
Looking now at the entire query sets, our statistics show that \( 95 \,\% \) of all queries could be answered via an observation on the negative set. In \( 70 \,\% \) of all cases, (B5) in the second test, which uses topological forward levels, could already answer the query. In further \( 16 \,\% \) of all cases, the observation based on topological backward levels, (B6), was successful. On the positive query set, the fallback rate was around \( 29 \,\% \) and hence higher than on the negative query set. \( 52 \,\% \) of all queries in this set could be answered by the supportive-vertices-based observation (S1), and the high and low indices of extended topological orderings (T1) and (T4) were responsible for another \( 7 \,\% \) and \( 4 \,\% \), respectively. Observe that here, the first observation in the order that can answer a query “wins the point”, i.e., the effectiveness here depends on the order and there are no overlaps in the reported effectiveness.
Memory Consumption. Table A.11 lists the memory each algorithm used for their reachability index. As
6 CONCLUSION
In this article, we revisited existing techniques for the static reachability problem and combined them with new approaches to support a large portion of reachability queries in constant time using a linear-sized reachability index. Our extensive experimental evaluation shows that in almost all scenarios, combining any of the existing algorithms with our new techniques implemented in
The on average fastest algorithm across all instances and types of queries was a combination of
APPENDIX
A TABLES AND FIGURES
Footnotes
1 Otherwise, \( \frac{1}{n} \le \rho \).
Footnote2 Source code and instances are available from https://oreach.taa.univie.ac.at.
Footnote3 Provided directly by the authors.
Footnote4 https://github.com/fiji-flo/preach2014/tree/master/original_code.
Footnote5 https://github.com/datourat/IP-label-for-graph-reachability.
Footnote6 https://github.com/BoleynSu/bfl.
Footnote7 https://code.google.com/archive/p/grail/.
Footnote8 The statistics were obtained in a slightly different way in [12].
Footnote
- [1] . 2014. Benchmarking for graph clustering and partitioning. In Proceedings of the Encyclopedia of Social Network Analysis and Mining. Springer.Google ScholarCross Ref
- [2] . 2008. An efficient algorithm for answering graph reachability queries. In Proceedings of the 24th International Conference on Data Engineering. , , and (Eds.), IEEE Computer Society, 893–902.
DOI: Google ScholarDigital Library - [3] . 2013. TF-label: A topological-folding labeling scheme for reachability querying in a large graph. In Proceedings of the ACM SIGMOD International Conference on Management of Data. , , and (Eds.), ACM, 193–204.
DOI: Google ScholarDigital Library - [4] . 2006. Fast computation of reachability labeling for large graphs. In Proceedings of the Advances in Database Technology - EDBT 2006, 10th International Conference on Extending Database Technology. , , , , , , , , and (Eds.), Lecture Notes in Computer Science, Vol. 3896, Springer, 961–979.
DOI: Google ScholarDigital Library - [5] . 2003. Reachability and distance queries via 2-hop labels. SIAM Journal on Computing 32, 5 (2003), 1338–1355.
DOI: Google ScholarDigital Library - [6] . 2009. Introduction to Algorithms (3rd ed.). MIT Press, Chapter Elementary Data Structures.Google Scholar
- [7] . 1962. Algorithm 97: Shortest path. Communications of the ACM 5, 6 (1962), 345.
DOI: Google ScholarDigital Library - [8] . 2018. Communication-free massively distributed graph generation. In Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium.Google ScholarCross Ref
- [9] . 2008. Contraction hierarchies: Faster and simpler hierarchical routing in road networks. In Proceedings of the International Workshop on Experimental and Efficient Algorithms. Springer, 319–333.Google ScholarCross Ref
- [10] . 2012. Exact routing in large road networks using contraction hierarchies. Transportation Science 46, 3 (2012), 388–404.Google ScholarDigital Library
- [11] . 2020. Faster fully dynamic transitive closure in practice. In Proceedings of the18th International Symposium on Experimental Algorithms. and (Eds.),
Leibniz International Proceedings in Informatics , Vol. 160, Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Dagstuhl, Germany, 14:1–14:14.DOI: Google ScholarCross Ref - [12] . 2021. O’reach: Even faster reachability in large graphs. In Proceedings of the19th International Symposium on Experimental Algorithms. and (Eds.), Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 13:1–13:24.
DOI: Google ScholarCross Ref - [13] . 1990. A compression technique to materialize transitive closure. ACM Transactions on Database Systems 15, 4 (1990), 558–598.
DOI: Google ScholarDigital Library - [14] . 2012. SCARAB: Scaling reachability computation on large graphs. In Proceedings of the ACM SIGMOD International Conference on Management of Data. , , , , and (Eds.), ACM, 169–180.
DOI: Google ScholarDigital Library - [15] . 2011. Path-tree: An efficient reachability indexing scheme for large directed graphs. ACM Transactions on Database Systems 36, 1 (2011), 7:1–7:44.
DOI: Google ScholarDigital Library - [16] . 2013. Simple, fast, and scalable reachability oracle. Proceedings of the VLDB Endowment 6, 14 (2013), 1978–1989.
DOI: Google ScholarDigital Library - [17] . 2009. 3-HOP: A high-compression indexing scheme for reachability query. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data.Association for Computing Machinery, New York, NY, 813–826.
DOI: Google ScholarDigital Library - [18] . 2008. Efficiently answering reachability queries on very large directed graphs. In Proceedings of the ACM SIGMOD International Conference on Management of Data. (Ed.), ACM, 595–608.
DOI: Google ScholarDigital Library - [19] . 1962. Topological sorting of large networks. Communications of the ACM 5, 11 (1962), 558–562.
DOI: Google ScholarDigital Library - [20] . 2014. Powers of tensors and fast matrix multiplication. In Proceedings of the International Symposium on Symbolic and Algebraic Computation. , , , and (Eds.), ACM, 296–303.
DOI: Google ScholarDigital Library - [21] . 2014. SNAP Datasets: Stanford Large Network Dataset Collection. Retrieved Feb 1, 2021 from http://snap.stanford.edu/data.Google Scholar
- [22] . 2014. PReaCH: A fast lightweight reachability index using pruning and contraction hierarchies. In Proceedings of the European Symposium on Algorithms. and (Eds.), Springer, Berlin, 701–712.Google ScholarCross Ref
- [23] . 2010. Introducing the graph 500. Cray Users Group 19 (2010), 45–74.Google Scholar
- [24] . 1998. Program analysis via graph reachability. Information and Software Technology 40, 11–12 (1998), 701–726.Google ScholarCross Ref
- [25] . 1995. Precise interprocedural dataflow analysis via graph reachability. In Proceedings of the 22nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. 49–61.Google ScholarDigital Library
- [26] . 2004. HOPI: An efficient connection index for complex XML document collections. In Proceedings of the Advances in Database Technology - EDBT 2004, 9th International Conference on Extending Database Technology. , , , , , , and (Eds.),
Lecture Notes in Computer Science , Vol. 2992, Springer, 237–255.DOI: Google ScholarCross Ref - [27] . 2008. User-input dependence analysis via graph reachability. In Proceedings of the 2008 8th IEEE International Working Conference on Source Code Analysis and Manipulation. 25–34.Google ScholarCross Ref
- [28] . 2017. Reachability querying: Can it be even faster?IEEE Transactions on Knowledge and Data Engineering 29, 3 (2017), 683–697.
DOI: Google ScholarDigital Library - [29] . 1972. Depth-first search and linear graph algorithms. SIAM Journal on Computing 1, 2 (1972), 146–160.
DOI: Google ScholarDigital Library - [30] . 1976. Edge-disjoint spanning trees and depth-first search. Acta Informatica 6, 2 (1976), 171–185.Google ScholarDigital Library
- [31] . 2007. Fast and practical indexing and querying of very large graphs. In Proceedings of the ACM SIGMOD International Conference on Management of Data. , , and (Eds.), ACM, 845–856.
DOI: Google ScholarDigital Library - [32] . 2011. A memory efficient reachability data structure through bit vector compression. In Proceedings of the ACM SIGMOD International Conference on Management of Data. , , , and (Eds.), ACM, 913–924.
DOI: Google ScholarDigital Library - [33] . 2014. Reachability queries in very large graphs: A fast refined online search approach. In Proceedings of the EDBT. 511–522.Google Scholar
- [34] . 2006. Dual labeling: Answering graph reachability queries in constant time. In Proceedings of the 22nd International Conference on Data Engineering. , , , and (Eds.), IEEE Computer Society, 75.
DOI: Google ScholarDigital Library - [35] . 1962. A theorem on boolean matrices. Journal of the ACM 9, 1 (1962), 11–12.
DOI: Google ScholarDigital Library - [36] . 2018. Reachability querying: An independent permutation labeling approach. The VLDB Journal 27, 1 (2018), 1–26.
DOI: Google ScholarDigital Library - [37] . 2013. Fast and scalable reachability queries on graphs by pruned labeling with landmarks and paths. In Proceedings of the 22nd ACM International Conference on Information and Knowledge Management. , , , , and (Eds.), ACM, 1601–1606.
DOI: Google ScholarDigital Library - [38] . 2010. GRAIL: Scalable reachability index for large graphs. Proceedings of the VLDB Endowment 3, 1–2 (2010), 276–284.
DOI: Google ScholarDigital Library - [39] . 2012. GRAIL: A scalable index for reachability queries in very large graphs. The VLDB Journal 21, 4 (2012), 509–534.Google ScholarDigital Library
- [40] . 2010. Graph reachability queries: A survey. In Proceedings of the Managing and Mining Graph Data. and (Eds.),
Advances in Database Systems , Vol. 40. Springer, 181–215.DOI: Google ScholarCross Ref
Index Terms
- O’Reach: Even Faster Reachability in Large Graphs
Recommendations
An Overview of Reachability Indexes on Graphs
SIGMOD '23: Companion of the 2023 International Conference on Management of DataGraphs have been the natural choice for modeling entities and the relationships among them. One of the most fundamental graph processing operators is a reachability query, which checks whether a path exists from the source to the target vertex in a plain ...
Reachability queries on large dynamic graphs: a total order approach
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataReachability queries are a fundamental type of queries on graphs that find important applications in numerous domains. Although a plethora of techniques have been proposed for reachability queries, most of them require that the input graph is static, ...
Answering label-constraint reachability in large graphs
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge managementIn this paper, we study a variant of reachability queries, called label-constraint reachability (LCR) queries, specifically,given a label set S and two vertices u1 and u2 in a large directed graph G, we verify whether there exists a path from u1 to u2 ...
Comments