Tree edit distance: Robust and memory-efficient
Introduction
Data with hierarchical dependencies are often modelled as trees. Tree data appear in many applications, ranging from hierarchical data formats like JSON or XML to merger trees in astrophysics [33]. An interesting query computes the similarity between two trees. The standard measure for tree similarity is the tree edit distance, which is defined as the minimum-cost sequence of node edit operations that transform one tree into another. The tree edit distance has been successfully applied in bioinformatics (e.g., to find similarities between RNA secondary structures [1], [29], neuronal cells [21], or glycan structures [3]), in image analysis [7], pattern recognition [25], melody recognition [19], natural language processing [28], information extraction [12], [23], and document retrieval [22], and has received considerable attention from the database community [5], [8], [9], [10], [11], [16], [17], [18], [26], [27].
The fastest algorithms for the tree edit distance (TED) decompose the input trees into smaller subtrees and use dynamic programming to build the overall solution from the subtree solutions. The key difference between various TED algorithms is the decomposition strategy, which has a major impact on the runtime. Early attempts to compute TED [13], [24], [37] use a hard-coded strategy, which disregards or only partially considers the shape of the input trees. This may lead to very poor strategies and asymptotic runtime differences of up to a polynomial degree. The most recent development is the Robust Tree Edit Distance (RTED) algorithm [30], which operates in two steps (cf. Fig. 1(a)). In the first step, a decomposition strategy is computed. The strategy adapts to the input trees and is shown to be optimal among all previously proposed strategies. The actual distance computation is done in the second step, which executes the strategy.
In terms of runtime, the overhead for the strategy computation in RTED is small compared to the gain due to the better strategy. Unfortunately, this does not hold for the main memory consumption. Fig. 1(b) shows the memory usage for two example trees (perfect binary trees) of 8191 nodes: the strategy computation requires 1.1 GB of RAM, while the execution of the strategy (i.e., the actual distance computation) requires only 0.55 GB. Thus, for large instances, the strategy computation is the bottleneck and the fallback is a hard-coded strategy. This is undesirable since the gain of a good strategy grows with the instance size. Reducing the memory requirements of the strategy computation affects the maximum tree size that can be processed. This is crucial especially for large trees like abstract syntax trees of source code repositories [15], [20] (Emacs: nodes and MythTV: nodes) or merger trees in astrophysics1 [33].
In this paper we propose the AP-TED algorithm, which solves the memory problem of the strategy computation. This is achieved by computing the strategy bottom-up using dynamic programming and releasing part of the memorization tables early. We prove that our algorithm requires at most 1/3 of the memory that is needed by RTED׳s strategy computation [30]. As a result, the memory cost of the strategy computation is never above the cost of the distance computation. Our extensive experimental evaluation on various tree shapes, which require very different strategies, confirms our analytic memory bound and shows that our algorithm is often much better than its theoretical upper bound. For some tree shapes, it even runs in linear space, while the RTED strategy algorithm always requires quadratic space.
In addition to reducing the memory usage, AP-TED computes the optimum in a larger class of strategies than RTED. Strategies are expressed by root-leaf paths that guide the decomposition of the input trees. A path decomposes a tree into subtrees by deleting nodes and edges on a root-leaf path. Each resulting subtree is recursively decomposed by a new root-leaf path. RTED computes the optimal LRH strategy. An LRH strategy considers only left, right, and heavy paths. The left (right) root-leaf path connects each parent with its first (last) child; the heavy path connects the parent with the rightmost child that roots the largest subtree. AP-TED considers all root-leaf paths and is not limited to left, right, and heavy paths. Thus, our strategy is at least as good as the strategies used by RTED. To the best of our knowledge, this is the first algorithm to compute the optimal all-path strategy. The runtime complexity of our strategy algorithm is as for the RTED strategy. This result is surprising since in each recursive step we need to consider a linear number of paths compared to only three paths (left, right, and heavy) in the RTED strategy. Our empirical evaluation suggests that in practice our strategy algorithm is even slightly faster than the RTED strategy algorithm since it allocates less memory.
On the distance computation side, we observe that a large number of subproblems that result from the tree decompositions are very small trees with one or two nodes only. We show that a significant boost can be achieved by treating these cases separately. We introduce the AP-TED+ algorithm, which leverages that fact and achieves runtime improvements of more than 50% in some cases.
Summarizing, the contributions of this paper are the following:
- •
Memory efficiency. We substantially reduce the memory requirements w.r.t. previous strategy computation algorithms by traversing the trees bottom-up and systematically releasing memory early. The resulting AP-TED algorithm always consumes less memory for the strategy computation than for the actual distance computation and thus breaks the bottleneck of previous algorithms. (We show the correctness of our approach and prove an upper bound for the memory usage.)
- •
Optimal all-path strategy. The decomposition strategy used by AP-TED is optimal in the class of all-path strategies. This class generalizes LRH strategies and contains all strategies of previous TED algorithms. Although our strategy algorithm must consider more paths, it is as efficient as the strategy algorithm in RTED (quadratic in the input size).
- •
New single-path functions. We develop AP-TED+, which leverages two new single-path functions to compute the distance of subtree pairs when one of the subtrees is small. This case occurs frequently during the decomposition process. Our new single-path functions run in linear time and at most linear space, which substantially improves over the single-path functions , , and used in RTED [30]. To take full advantage of the new functions, we integrate them into the strategy computation to obtain better strategies. Our experiments confirm the significant runtime improvement.
The paper is structured as follows. Section 2 sets the stage for our discussion of strategy algorithms. In Section 3 we define the problem, and we present our AP-TED algorithm in Section 4. The memory efficient implementation of the strategy computation in AP-TED is discussed in Section 5. The AP-TED+ algorithm is presented in Section 6. We treat related work in Section 7, experimentally evaluate our solution in Section 8, and conclude in Section 9.
Section snippets
Notation
We follow the notation of [30] when possible. A tree F is a directed, acyclic, connected graph with nodes N(F) and edges , where each node has at most one incoming edge. Each node has a label, which is not necessarily unique within the tree. The nodes of a tree F are strictly and totally ordered such that (a) for any edge , and (b) for any two nodes , if and f is not a descendant of g, then for all descendants of g. The tree traversal that visits all
Problem definition
As outlined in previous sections, the path strategies introduced by Pawlik and Augsten [30] generalize all state-of-the-art algorithms for computing the tree edit distance. They consider the class of LRH strategies and show optimality. However, LRH strategies limit the paths to be left, right, or heavy. We observe that allowing all paths leads to less expensive strategies. Another drawback of the RTED algorithm is the fact that the computation of the optimal strategy requires more space than
AP-TED algorithm
Until now, only LRH strategies have been considered in literature [13], [24], [30], [37]. They are limited to left, right and heavy paths only. LRH strategies are only a fraction of all possible path strategies. There may exist non-LRH path strategies that lead to better solutions. In principle, all possible path strategies must be checked for the best result. In this section we present AP-TED, a new algorithm that computes the tree edit distance with the optimal all-path strategy. The core of
Memory efficiency in AP-TED
The main memory requirement is a bottleneck of the tree edit distance computation. The strategy computation in RTED exceeds the memory needed for executing the strategy. Our AP-TED strategy algorithm reduces the memory usage by at least 2/3 and never uses more memory than the execution of the strategy. We achieve that by decreasing the maximum size of the data structures used for strategy computation.
AP-TED+ algorithm
The RTED algorithm computes the tree edit distance by executing the single-path functions for the subtree pairs resulting from the strategy. We observe that when one of the input trees in a single-path function is small, the distance can be computed more efficiently than with the existing single-path functions. We address two special cases, which are very frequent and have a high impact on the runtime: one- and two-node trees. We present AP-TED+, a new algorithm that improves over previous
Related work
Tree edit distance algorithms. The tree edit distance has a recursive solution, which decomposes the input trees into smaller subtrees and subforests. The best known algorithms are dynamic programming implementations of this recursive solution, where small subproblems are computed first. The first tree edit distance algorithm was proposed by Tai [34]. It runs in time and space where n is the number of tree nodes. The runtime complexity is given by the number of subproblems that must be
Experiments
In this section we experimentally evaluate AP-TED and AP-TED+ and compare them to RTED [30]. Our empirical evaluation on real-world and synthetic data confirms our analytical results: computing the strategy in AP-TED is as efficient as in RTED, but requires significantly less memory. In particular, the strategy computation requires less memory than the actual tree edit distance computation.
Set-up. All algorithms are implemented as single-thread applications in Java 1.7. We run the experiments
Conclusion
In this paper we develop two new algorithms for the tree edit distance: AP-TED and AP-TED+. The strategy computation is a main memory bottleneck of the state-of-the-art solution, RTED [30]. The memory required for the strategy computation can be twice the memory needed for the actual tree edit distance computation. Our AP-TED strategy algorithm reduces the memory by at least 2/3 compared to the strategy computation in RTED and never uses more memory than the distance computation. The
Acknowledgements
This work is partially supported by the SyRA project of the Free University of Bozen-Bolzano, Italy.
References (37)
- et al.
A methodology for clustering xml documents by structure
Inf. Syst.
(2006) - et al.
Decomposition algorithms for the tree edit distance problem
J. Discret. Algorithms
(2005) - et al.
Computing similarity between RNA structures
Theor. Comput. Sci.
(2002) Tree edit distance problems algorithms and applications to bioinformatics
IEICE Trans. Inf. Syst. E
(2010)- et al.
Approximating tree edit distance through string edit distance
Algorithmica
(2010) - et al.
Efficient tree-matching methods for accurate carbohydrate database queries
Genome Inform.
(2003) - et al.
Approximating tree edit distance through string edit distance for binary tree codes
Fundam. Inform.
(2010) - et al.
Efficient top-k approximate subtree matching in small memory
IEEE Trans. Knowl. Data Eng. (TKDE)
(2011) - et al.
The pq-gram distance between ordered labeled trees
ACM Trans. Database Syst. (TODS)
(2010) - J. Bellando, R. Kothari, Region-based modeling and tree edit distance as a basis for gesture recognition, in:...
An optimal decomposition algorithm for tree edit distance
ACM Trans. Algorithms
Xml stream processing using tree-edit distance embeddings
ACM Trans. Database Syst. (TODS)
Cited by (148)
SEMv2: Table separation line detection based on instance segmentation
2024, Pattern RecognitionMulti-view overlapping clustering for the identification of the subject matter of legal judgments
2023, Information SciencesThe edge-preservation similarity for comparing rooted, unordered, node-labeled trees
2023, Pattern Recognition LettersSFTM: Fast matching of web pages using Similarity-based Flexible Tree Matching
2023, Information SystemsCitation Excerpt :APTED is the reference implementation of TED that reports on the best performance so far. The implementation of APTED used for this evaluation is the one provided by the authors of [11,12]. Since APTED yields the optimal solution to the TED problem, TED is theoretically superior in accuracy to all more restricted solutions (see Section 2).
Histogram-based comparison of metric spaces using HMMs
2024, Evolutionary Intelligence