Tree edit distance: Robust and memory-efficient

doi:10.1016/j.is.2015.08.004

Information Systems

Volume 56, March 2016, Pages 157-173

https://doi.org/10.1016/j.is.2015.08.004 Get rights and content

Highlights

•
We address the memory problem of the strategy computation in the RTED algorithm for the tree edit distance.
•
We prove an upper bound which guarantees that the strategy computation never uses more memory than the distance computation.
•
We compute the optimal strategy in the class of all-path strategies which subsumes the LRH strategies used before.
•
We develop new single-path functions which are better in terms of runtime and memory than the previously used functions.

Abstract

Hierarchical data are often modelled as trees. An interesting query identifies pairs of similar trees. The standard approach to tree similarity is the tree edit distance, which has successfully been applied in a wide range of applications. In terms of runtime, the state-of-the-art algorithm for the tree edit distance is RTED, which is guaranteed to be fast independent of the tree shape. Unfortunately, this algorithm requires up to twice the memory of its competitors. The memory is quadratic in the tree size and is a bottleneck for the tree edit distance computation.

In this paper we present a new, memory efficient algorithm for the tree edit distance, AP-TED (All Path Tree Edit Distance). Our algorithm runs at least as fast as RTED without trading in memory efficiency. This is achieved by releasing memory early during the first step of the algorithm, which computes a decomposition strategy for the actual distance computation. We show the correctness of our approach and prove an upper bound for the memory usage. The strategy computed by AP-TED is optimal in the class of all-path strategies, which subsumes the class of LRH strategies used in RTED. We further present the AP-TED⁺ algorithm, which requires less computational effort for very small subtrees and improves the runtime of the distance computation. Our experimental evaluation confirms the low memory requirements and the runtime efficiency of our approach.

Introduction

Data with hierarchical dependencies are often modelled as trees. Tree data appear in many applications, ranging from hierarchical data formats like JSON or XML to merger trees in astrophysics [33]. An interesting query computes the similarity between two trees. The standard measure for tree similarity is the tree edit distance, which is defined as the minimum-cost sequence of node edit operations that transform one tree into another. The tree edit distance has been successfully applied in bioinformatics (e.g., to find similarities between RNA secondary structures [1], [29], neuronal cells [21], or glycan structures [3]), in image analysis [7], pattern recognition [25], melody recognition [19], natural language processing [28], information extraction [12], [23], and document retrieval [22], and has received considerable attention from the database community [5], [8], [9], [10], [11], [16], [17], [18], [26], [27].

The fastest algorithms for the tree edit distance (TED) decompose the input trees into smaller subtrees and use dynamic programming to build the overall solution from the subtree solutions. The key difference between various TED algorithms is the decomposition strategy, which has a major impact on the runtime. Early attempts to compute TED [13], [24], [37] use a hard-coded strategy, which disregards or only partially considers the shape of the input trees. This may lead to very poor strategies and asymptotic runtime differences of up to a polynomial degree. The most recent development is the Robust Tree Edit Distance (RTED) algorithm [30], which operates in two steps (cf. Fig. 1(a)). In the first step, a decomposition strategy is computed. The strategy adapts to the input trees and is shown to be optimal among all previously proposed strategies. The actual distance computation is done in the second step, which executes the strategy.

In terms of runtime, the overhead for the strategy computation in RTED is small compared to the gain due to the better strategy. Unfortunately, this does not hold for the main memory consumption. Fig. 1(b) shows the memory usage for two example trees (perfect binary trees) of 8191 nodes: the strategy computation requires 1.1 GB of RAM, while the execution of the strategy (i.e., the actual distance computation) requires only 0.55 GB. Thus, for large instances, the strategy computation is the bottleneck and the fallback is a hard-coded strategy. This is undesirable since the gain of a good strategy grows with the instance size. Reducing the memory requirements of the strategy computation affects the maximum tree size that can be processed. This is crucial especially for large trees like abstract syntax trees of source code repositories [15], [20] (Emacs: $> 10 k$ nodes and MythTV: $> 50 k$ nodes) or merger trees in astrophysics¹ [33].

In this paper we propose the AP-TED algorithm, which solves the memory problem of the strategy computation. This is achieved by computing the strategy bottom-up using dynamic programming and releasing part of the memorization tables early. We prove that our algorithm requires at most 1/3 of the memory that is needed by RTED׳s strategy computation [30]. As a result, the memory cost of the strategy computation is never above the cost of the distance computation. Our extensive experimental evaluation on various tree shapes, which require very different strategies, confirms our analytic memory bound and shows that our algorithm is often much better than its theoretical upper bound. For some tree shapes, it even runs in linear space, while the RTED strategy algorithm always requires quadratic space.

In addition to reducing the memory usage, AP-TED computes the optimum in a larger class of strategies than RTED. Strategies are expressed by root-leaf paths that guide the decomposition of the input trees. A path decomposes a tree into subtrees by deleting nodes and edges on a root-leaf path. Each resulting subtree is recursively decomposed by a new root-leaf path. RTED computes the optimal LRH strategy. An LRH strategy considers only left, right, and heavy paths. The left (right) root-leaf path connects each parent with its first (last) child; the heavy path connects the parent with the rightmost child that roots the largest subtree. AP-TED considers all root-leaf paths and is not limited to left, right, and heavy paths. Thus, our strategy is at least as good as the strategies used by RTED. To the best of our knowledge, this is the first algorithm to compute the optimal all-path strategy. The runtime complexity of our strategy algorithm is $O (n^{2})$ as for the RTED strategy. This result is surprising since in each recursive step we need to consider a linear number of paths compared to only three paths (left, right, and heavy) in the RTED strategy. Our empirical evaluation suggests that in practice our strategy algorithm is even slightly faster than the RTED strategy algorithm since it allocates less memory.

On the distance computation side, we observe that a large number of subproblems that result from the tree decompositions are very small trees with one or two nodes only. We show that a significant boost can be achieved by treating these cases separately. We introduce the AP-TED⁺ algorithm, which leverages that fact and achieves runtime improvements of more than 50% in some cases.

Summarizing, the contributions of this paper are the following:

•
Memory efficiency. We substantially reduce the memory requirements w.r.t. previous strategy computation algorithms by traversing the trees bottom-up and systematically releasing memory early. The resulting AP-TED algorithm always consumes less memory for the strategy computation than for the actual distance computation and thus breaks the bottleneck of previous algorithms. (We show the correctness of our approach and prove an upper bound for the memory usage.)
•
Optimal all-path strategy. The decomposition strategy used by AP-TED is optimal in the class of all-path strategies. This class generalizes LRH strategies and contains all strategies of previous TED algorithms. Although our strategy algorithm must consider more paths, it is as efficient as the strategy algorithm in RTED (quadratic in the input size).
•
New single-path functions. We develop AP-TED⁺, which leverages two new single-path functions to compute the distance of subtree pairs when one of the subtrees is small. This case occurs frequently during the decomposition process. Our new single-path functions run in linear time and at most linear space, which substantially improves over the single-path functions $Δ^{L}$ , $Δ^{R}$ , and $Δ^{I}$ used in RTED [30]. To take full advantage of the new functions, we integrate them into the strategy computation to obtain better strategies. Our experiments confirm the significant runtime improvement.

The paper is structured as follows. Section 2 sets the stage for our discussion of strategy algorithms. In Section 3 we define the problem, and we present our AP-TED algorithm in Section 4. The memory efficient implementation of the strategy computation in AP-TED is discussed in Section 5. The AP-TED⁺ algorithm is presented in Section 6. We treat related work in Section 7, experimentally evaluate our solution in Section 8, and conclude in Section 9.

Section snippets

Notation

We follow the notation of [30] when possible. A tree F is a directed, acyclic, connected graph with nodes N(F) and edges $E (F) \subseteq N (F) \times N (F)$ , where each node has at most one incoming edge. Each node has a label, which is not necessarily unique within the tree. The nodes of a tree F are strictly and totally ordered such that (a) $v > w$ for any edge $(v, w) \in E (F)$ , and (b) for any two nodes $f, g$ , if $f < g$ and f is not a descendant of g, then $f < g^{'}$ for all descendants $g^{'}$ of g. The tree traversal that visits all

Problem definition

As outlined in previous sections, the path strategies introduced by Pawlik and Augsten [30] generalize all state-of-the-art algorithms for computing the tree edit distance. They consider the class of LRH strategies and show optimality. However, LRH strategies limit the paths to be left, right, or heavy. We observe that allowing all paths leads to less expensive strategies. Another drawback of the RTED algorithm is the fact that the computation of the optimal strategy requires more space than

AP-TED algorithm

Until now, only LRH strategies have been considered in literature [13], [24], [30], [37]. They are limited to left, right and heavy paths only. LRH strategies are only a fraction of all possible path strategies. There may exist non-LRH path strategies that lead to better solutions. In principle, all possible path strategies must be checked for the best result. In this section we present AP-TED, a new algorithm that computes the tree edit distance with the optimal all-path strategy. The core of

Memory efficiency in AP-TED

The main memory requirement is a bottleneck of the tree edit distance computation. The strategy computation in RTED exceeds the memory needed for executing the strategy. Our AP-TED strategy algorithm reduces the memory usage by at least 2/3 and never uses more memory than the execution of the strategy. We achieve that by decreasing the maximum size of the data structures used for strategy computation.

AP-TED⁺ algorithm

The RTED algorithm computes the tree edit distance by executing the single-path functions for the subtree pairs resulting from the strategy. We observe that when one of the input trees in a single-path function is small, the distance can be computed more efficiently than with the existing single-path functions. We address two special cases, which are very frequent and have a high impact on the runtime: one- and two-node trees. We present AP-TED⁺, a new algorithm that improves over previous

Related work

Tree edit distance algorithms. The tree edit distance has a recursive solution, which decomposes the input trees into smaller subtrees and subforests. The best known algorithms are dynamic programming implementations of this recursive solution, where small subproblems are computed first. The first tree edit distance algorithm was proposed by Tai [34]. It runs in $O (n^{6})$ time and space where n is the number of tree nodes. The runtime complexity is given by the number of subproblems that must be

Experiments

In this section we experimentally evaluate AP-TED and AP-TED⁺ and compare them to RTED [30]. Our empirical evaluation on real-world and synthetic data confirms our analytical results: computing the strategy in AP-TED is as efficient as in RTED, but requires significantly less memory. In particular, the strategy computation requires less memory than the actual tree edit distance computation.

Set-up. All algorithms are implemented as single-thread applications in Java 1.7. We run the experiments

Conclusion

In this paper we develop two new algorithms for the tree edit distance: AP-TED and AP-TED⁺. The strategy computation is a main memory bottleneck of the state-of-the-art solution, RTED [30]. The memory required for the strategy computation can be twice the memory needed for the actual tree edit distance computation. Our AP-TED strategy algorithm reduces the memory by at least 2/3 compared to the strategy computation in RTED and never uses more memory than the distance computation. The

Acknowledgements

This work is partially supported by the SyRA project of the Free University of Bozen-Bolzano, Italy.

References (37)

T. Dalamagas et al.
A methodology for clustering xml documents by structure
Inf. Syst.
(2006)
S. Dulucq et al.
Decomposition algorithms for the tree edit distance problem
J. Discret. Algorithms
(2005)
B. Ma et al.
Computing similarity between RNA structures
Theor. Comput. Sci.
(2002)
T. Akutsu
Tree edit distance problems algorithms and applications to bioinformatics
IEICE Trans. Inf. Syst. E
(2010)
T. Akutsu et al.
Approximating tree edit distance through string edit distance
Algorithmica
(2010)
K.F. Aoki et al.
Efficient tree-matching methods for accurate carbohydrate database queries
Genome Inform.
(2003)
T. Aratsu et al.
Approximating tree edit distance through string edit distance for binary tree codes
Fundam. Inform.
(2010)
N. Augsten et al.
Efficient top-k approximate subtree matching in small memory
IEEE Trans. Knowl. Data Eng. (TKDE)
(2011)
N. Augsten et al.
The pq-gram distance between ordered labeled trees
ACM Trans. Database Syst. (TODS)
(2010)
J. Bellando, R. Kothari, Region-based modeling and tree edit distance as a basis for gesture recognition, in:...

S.S. Chawathe, Comparing hierarchical data in external memory, in: International Conference on Very Large Data Bases...

G. Cobéna, S. Abiteboul, A. Marian, Detecting changes in xml documents, in: International Conference on Data...

S. Cohen, Indexing for subtree similarity-search using edit distance, in: ACM SIGMOD International Conference on...

D. de Castro Reis, P.B. Golgher, A.S. da Silva, A.H.F. Laender, Automatic web news extraction using tree edit distance,...

E.D. Demaine et al.

An optimal decomposition algorithm for tree edit distance

ACM Trans. Algorithms

(2009)

J.-R. Falleri, F. Morandat, X. Blanc, M. Martinez, M. Montperrus, Fine-grained and accurate source code differencing,...

J.P. Finis, M. Raiber, N. Augsten, R. Brunel, A. Kemper, F. Färber, Rws-diff: flexible and efficient change detection...

M. Garofalakis et al.

Xml stream processing using tree-edit distance embeddings

ACM Trans. Database Syst. (TODS)

(2005)

Cited by (148)

SEMv2: Table separation line detection based on instance segmentation
2024, Pattern Recognition
Table structure recognition is an indispensable element for enabling machines to comprehend tables. Its primary purpose is to identify the internal structure of a table. Nevertheless, due to the complexity and diversity of their structure and style, it is highly challenging to parse the tabular data into a structured format that machines can comprehend. In this work, we adhere to the principle of the split-and-merge based methods and propose an accurate table structure recognizer, termed SEMv2 (SEM: Split, Embed and Merge). Unlike the previous works in the “split” stage, we aim to address the table separation line instance-level discrimination problem and introduce a table separation line detection strategy based on conditional convolution. Specifically, we design the “split” in a top-down manner that detects the table separation line instance first and then dynamically predicts the table separation line mask for each instance. The final table separation line shape can be accurately obtained by processing the table separation line mask in a row-wise/column-wise manner. To comprehensively evaluate the SEMv2, we also present a more challenging dataset for table structure recognition, dubbed iFLYTAB, which encompasses multiple style tables in various scenarios such as photos, scanned documents, etc. Extensive experiments on publicly available datasets (e.g. SciTSR, PubTabNet and iFLYTAB) demonstrate the efficacy of our proposed approach. The code and iFLYTAB dataset are available at https://github.com/ZZR8066/SEMv2
Multi-view overlapping clustering for the identification of the subject matter of legal judgments
2023, Information Sciences
The legal field is generally burdened by paper-heavy activities, and the management of massive amounts of legal judgments without the adoption of computational tools may compromise the effectiveness and efficiency of administration processes. In this paper, we propose MOSTA, a novel unsupervised method to support the automated identification of groups of legal judgments with similar characteristics, with the goal of reducing the manual effort necessary for the management of legal judgments.
Methodologically, MOSTA learns two different embedding models for legal judgments. The first aims to represent the semantics of the textual content, while the second aims to represent co-citations of legal acts, also considering the granularity of the citations. Such representations are then fused through a multi-view approach based on an autoencoder, and the obtained representation is finally exploited by a novel overlapping clustering algorithm. The latter is an additional strong point of MOSTA, since, contrary to existing approaches, does not rely on additional input parameters that inherently influence the degree of overlap of the resulting clusters.
Our experiments, performed on three textual datasets, including a real-world legal dataset provided by EUR-Lex, proved that the proposed representation of cited legal acts, the adopted multi-view fusion strategy, and the novel overlapping clustering algorithm implemented in MOSTA provide a positive contribution to the quality of the identified clusters. Finally, MOSTA demonstrated to be able to outperform by a great margin existing complete solutions based on fine-tuned BERT embedding models and existing overlapping clustering algorithms.
The edge-preservation similarity for comparing rooted, unordered, node-labeled trees
2023, Pattern Recognition Letters
Rooted trees are ubiquitous data structures which are used to model hierarchical objects from a plethora of different application domains. For various downstream analysis tasks, measures are needed that quantify (dis-)similarity between rooted trees. Many such measures exist, e. g., the widely used tree edit distance (TED). However, there are few algorithms to compute (dis-)similarity measures which are specifically designed for rooted, unordered, node-labeled trees and support input trees of different orders. To close this gap in the literature, we introduce the edge-preservation similarity (EPS). We show how to exactly compute EPS via integer quadratic programming on small instances and present a scalable 4-approximation algorithm. An evaluation on tree representations of pseudoknotted RNA secondary structures and acyclic molecular graphs shows that both exact and approximate (normalized) EPS better preserves functional similarities between the compared RNAs and molecules than the often-used TED. Python implementations of our algorithms and scripts to reproduce the results are available on GitHub: https://github.com/bionetslab/edge-preservation-similarity.
SFTM: Fast matching of web pages using Similarity-based Flexible Tree Matching
2023, Information Systems
Citation Excerpt :
APTED is the reference implementation of TED that reports on the best performance so far. The implementation of APTED used for this evaluation is the one provided by the authors of [11,12]. Since APTED yields the optimal solution to the TED problem, TED is theoretically superior in accuracy to all more restricted solutions (see Section 2).
Tree matching techniques have been investigated in many fields, including web data mining and extraction, as a key component to analyze the content of web pages. However, when applied to existing web pages, traditional tree matching approaches, covered by algorithms like Tree-Edit Distance (TED) or XyDiff, either fail to scale beyond a few hundred nodes or exhibit a relatively low accuracy.
In this article, we therefore propose a novel algorithm, named Similarity-based Flexible Tree Matching (SFTM), which enables high accuracy tree matching on real-life web pages, with practical computation times. We approach tree matching as an optimization problem and leverage node labels and local topology similarity in order to avoid any combinatorial explosion. Our practical evaluation demonstrates that SFTM significantly improves the state of the art in terms of accuracy, while allowing computation times significantly lower than the most accurate solutions. By gaining on these two dimensions, SFTM therefore offers an affordable solution to match complex trees in practice.
Histogram-based comparison of metric spaces using HMMs
2024, Evolutionary Intelligence
Lamarckian Inheritance Improves Robot Evolution in Dynamic Environments
2024, arXiv

View all citing articles on Scopus

View full text

Tree edit distance: Robust and memory-efficient

Highlights

Abstract

Introduction

Section snippets

Notation

Problem definition

AP-TED algorithm

Memory efficiency in AP-TED

AP-TED+ algorithm

Related work

Experiments

Conclusion

Acknowledgements

Inf. Syst.

J. Discret. Algorithms

Theor. Comput. Sci.

Tree edit distance problems algorithms and applications to bioinformatics

IEICE Trans. Inf. Syst. E

Approximating tree edit distance through string edit distance

Algorithmica

Efficient tree-matching methods for accurate carbohydrate database queries

Genome Inform.

Approximating tree edit distance through string edit distance for binary tree codes

Fundam. Inform.

Efficient top-k approximate subtree matching in small memory

IEEE Trans. Knowl. Data Eng. (TKDE)

The pq-gram distance between ordered labeled trees

ACM Trans. Database Syst. (TODS)

An optimal decomposition algorithm for tree edit distance

ACM Trans. Algorithms

Xml stream processing using tree-edit distance embeddings

ACM Trans. Database Syst. (TODS)

AP-TED⁺ algorithm