Keywords

1 Introduction

1.1 Word Equations

The word equation problem, i.e. solving equations in the algebra of words, was first investigated by Markov in the fifties. In this problem we get as an input an equation of the form

$$\begin{aligned} u = v \end{aligned}$$

where u and v are strings of letters (from a fixed alphabet) as well as variables and a solution is a substitution of words for variables that turns this formal equation into a true equality of strings of letters (over the same fixed alphabet). It is relatively easy to show a reduction of this problem to the Hilbert’s 10-th problem, i.e. the question of solving systems of Diophantine equations. Already then it was generally accepted that Hilbert’s 10-th problem is undecidable and Markov wanted to show this by proving the undecidability of word equations.

Alas, while Hilbert’s 10-th problem is undecidable, the word equation problem is decidable, which was shown by Makanin [54]. The termination proof of his algorithm is very complex and yields a relatively weak bound on the computational complexity, thus over the years several improvements and simplifications over the original algorithm were proposed [27, 29, 43, 79]. Simplifications have many potential advantages: it seems natural that simpler algorithm can be generalised or extended more easily (for instance, to the case of equations in groups) than a complex one. Moreover, simpler algorithm should be more effective in practical applications and should have a lower complexity bounds.

Subcases. It is easy to show NP-hardness for word equations, so far no better computational complexity lower bound is known. Such hardness stimulated a search for a restricted subclasses of the problem for which efficient (i.e. polynomial) algorithms can be given [4]. One of such subclasses is defined by restricting the amount of different variables that can be used in an equation: it is known that equations with one [11, 45] and two [4, 10, 28] variables can be solved in polynomial time. Already for three variables it is not known, whether they are in NP or not [71] and partial results require nontrivial analysis [71].

Generalisations. Since Makanin’s original solution much effort was put into extending his algorithm to other structures. Three directions seemed most natural:

  • adding constraints to word equations;

  • equations in free groups;

  • partial commutation;

  • equations in terms.

  • Constraints. From the application point of view, it is advantageous to consider word equations that can also use some additional constraints, i.e. we require that the solution for X has some additional properties. This was first done for regular constraints [79], on the other hand, for several types of constraints, for instance length-constraints, it is still open, whether the resulting problem is decidable or not (it becomes undecidable, if we allow counting occurrences of particular letter in the substitutions and arithmetic operations on such counts [3]).

  • Free groups. From the algebraic point of view, the word equation problem is solving equations in a free semigroup. It is natural to try to extend an algorithm from the free semigroup also to the case of free groups and then perhaps even to a larger class of groups (observe, that there are groups and semigroups for which the word problem is undecidable). The first algorithm for the group case was given by Makanin [55, 56], his algorithm was not primitively-recursive [44]. Furthermore, Razborov showed that this algorithm can be used to give a description of all solutions of an equation [68] (more readable description of the Razborov’s construction is available in [41]). As a final comment, note that such a description was the first step in proving the Tarski’s Conjecture for free groups (that the theory of free groups is decidable) [42].

  • Partial commutation. Another natural generalization is to allow partial commutation between the letters, i.e. for each pair of letters we specify, whether \(ab = ba\) or not. Such partially commutative words are usually called traces and the corresponding groups are usually known as Right-Angled Artin Groups, RAAGs for short. Decidability for trace equations was shown by Matiyasevich [57] and for RAAGs by Diekert and Muscholl [15]. In both cases, the main step in the proof was a reduction from a partially commutative case to a non-commutative one.

  • Terms. We can view words as very simple terms: each letter is a function symbol of arity 1. In this way word equations are equations over (very simple) terms. It is known, that term unification can be decided in polynomial time, assuming that variables represent closed (full) terms [69]; thus such a problem is unlikely to generalise word equations. A natural generalisation of term unification and word equations is a second-order unification, in which we allow variables to represent functions that take arguments (which need to be closed terms). However, it is known that this problem is undecidable, even in many restricted subcases [16, 26, 47, 49]. Context unification [7, 8, 74] is a natural problem ‘in between’: we allow variables representing functions, but we insist that they use their argument exactly once. It is easy to show that such defined problem generalises word equations, on the other hand, the undecidability proofs for second-order unification do not transfer directly to this model. Being a natural generalisation is not enough to explain the interest in this problem, more importantly, context unification has natural connections with other, well-studied problems (equality up to constraints [61], linear second-order unification [47, 50], one-step term rewriting [62], bounded second order unification [76], ...). Unfortunately, for over two decades the question of decidability of context unification remained open. Despite intensive research, not much is known about the decidability of this problem: only results for some restricted subcases are known: [8, 19, 47, 48, 51, 75, 77, 78].

1.2 Compression and Word Equations

For more than 20 years since Makanin’s original solution there was very small progress in algorithms for word equations: the algorithm was improved in many places, in particular this lead to a better estimation of the running time; however, the main idea (and the general complexity of the proof) was essentially the same.

The breakthrough was done by Plandowski and Rytter [67], who, for the first time, used the compression to solve word equations. They showed, that the shortest solution (of size N) of the word equation (of size n) has an SLP representation of size \(\mathsf{{poly}}(n, \log N)\); here a Straight Line Programme (SLP for short) is simply a context free grammar generating exactly one word. Using the algorithm for testing the equality of two SLPs [63] this easily yields a (non-deterministic) algorithm running in time \(\mathsf{{poly}}(n, \log N)\). Unfortunately, this work did not provide any bound on N and the only known bound (4 times exponential in n) came directly from Makanin’s algorithm, together those two results yielded a 3NEXPTIME algorithm. Soon after the bound on the size of the shortest solution was improved to triply exponential [27], which immediately yielded an algorithm from class 2NEXPTIME, however, the same paper [27] improved Makanin’s algorithm, so that it workd in EXPSPACE.

Next, Plandowski gave a better (doubly exponential) bound on the size of the shortest solution [64] and thus obtained a NEXPTIME algorithm, in particular, at that time this was the best known algorithm for this problem. The proof was based on novel factorisations of words. By better exploiting the interplay between factorisations and compression, he improved the algorithm so that it worked in PSPACE [65].

It is worth mentioning, that the solution proposed by Plandowski is essentially different than the one given by Makanin. In particular, it allowed generalisations more easily: Diekert, Gutiérrez and Hagenah [13] showed, that Plandowski’s algorithm can be extended to the case in which we allow regular constraints in the equation (i.e. we want that the word substituted for X is from a regular language, whose description by a finite automaton is part of the input) and inversion; such an extended algorithm still works in polynomial space. It is easy to show that solving equations in free groups reduces to the above-mentioned problem of word equations with regular constraints and inversion [13] (it is worth mentioning, that in general we do not know whether solving equations in free groups is easier or harder than solving the ones in a free semigroup).

On the other hand, Plandowski showed, that his algorithm can be used to generate a finite representation of all solutions of a word equation [66], which allows solving several decision problems concerning the set of all solutions (finiteness, boundedness, boundedness of the exponent of periodicity etc.). It is not known, whether this algorithm can be generalised so that it generates all solutions also in the case of regular constraints and inversion (or in a free group).

The new, simpler algorithm for word equations and demonstration of connections between compression and word equations gave a new hope for solving the context unification problem. The first results were very promising: by using ‘tree’ equivalents of SLPs [2] computational complexity of some problems related to context unification was established [9, 19, 48]. Unfortunately, this approach failed to fully generalise Plandowski’s algorithm for words: the equivalent of factorisations that were used in the algorithm were not found for trees.

It is worth mentioning, that Rytter and Plandowski’s approach, in which we compress a solution using SLPs (or in the non-deterministic case—we guess the compressed representation of the solution) and then perform the computation directly on the SLP-compressed representations using known algorithm that work in polynomial time, turned out to be extremely fruitful in many branches of computer science. The recent survey by Lohrey gives several such successful applications [53].

1.3 Recompression

Recompression was developed for a specific problem concerning compressed data (fully compressed membership problem for finite automata [30]) and was later successfully applied to word equations [36] and other problems related to compressed representations. The usual approach for word equations (and compressed data in general) is that one tries to extract information about the combinatorics of the underlying words from the equation (compressed representation) and use this structure to solve the problem at hand. This is somehow natural: if the word can be represented compactly (be it as a solution of a word equation or using some compression mechanism) then it should have a lot of internal structure.

Recompression takes a different approach: our aim is to perform simple compression operations on the solution word of the word equation directly on the compressed representation. We need to modify the equation a bit in order to do that, however, the choice of the compression operation and the analysis focuses on the compressed representation and its properties and (almost) completely ignores the properties of the solution. The idea of performing the compression operation is somehow natural in view of the already mentioned Plandowski and Rytter result [67], that the (length-minimal) solution has a small SLP: since such an SLP exists, we can try to build it bottom-up, i.e. the SLP has a rule \(a \rightarrow bc\) and so we will replace each bc in the solution by a. (There are some complications in case of \(b=c\), as then the compression is ambiguous: we solve this by replacing the maximal repetitions of b letter instead of replacing bb).

Of course, performing such a compression on the equation might be difficult or even impossible at all and we sometimes need to modify the equation. However, it turns out that a greedy choice suffices to guarantee that the kept equation is of quadratic size. The correctness and size analysis turns out to be surprisingly easy. The method is also very robust, so that it can be applied to various scenarios related to word equations: one variable word equations [35], equations in free groups [14], twisted word equations [12], context unification [31], ...See the following Sections for details of some of those results.

1.4 Algorithms for Grammar-Based Compression

Due to ever-increasing amount of data, compression is widely applied in order to decrease the data’s size. Still, the stored data is accessed and processed. Decompressing it on each such an occasion basically wastes the gain of reduced storage size. Thus there is a demand for algorithms dealing directly with the compressed data, without explicit decompression. Indeed, efficient algorithms for fundamental text operations (pattern matching, equality testing, etc.) are known for various practically used compression methods (LZ77, LZW, their variants, etc.) [20,21,22,23,24,25, 63].

Note that above the compression can be seen as a source of problem that we want to overcome. However, as demonstrated by Plandowski and Rytter [67], the compression can also be seen as a solution to some problems, i.e. if we can show that the instance or its solutions is (highly) compressible, then we can compress it and, using the algorithms mentioned above, perform the computation on the compressed representation. See a recent survey of Lohrey [53], which gives examples of application of this approach in various fields, ranging from group theory, computational topology to program verification.

Compression standards differ in the main idea as well as in details. Thus when devising algorithms for compressed data, quite early one needs to focus on the exact compression method, to which the algorithm is applied. The most practical (and challenging) choice is one of the widely used standards, like LZW or LZ77. However, a different approach is also pursued: for some applications (and most of theory-oriented considerations) it would be useful to model one of the practical compression standard by a more mathematically well-founded and ‘clean’ method. The already mentioned Straight-Line Programs (SLPs), are such a clean formulation for many block compression methods: each LZ77 compressed text can be converted into an equivalent SLP of size \(\mathcal {O}( n \log ( N / n))\) and in \(\mathcal {O}(n \log (N/n))\) time [5, 70] (where N is the size of the decompressed text), while each SLP can be converted to an equivalent LZ77-like of \(\mathcal {O}(n)\) size in polynomial time. Other reasons of popularity of SLPs is that usually they compress well the input text [46, 60] Lastly, a greedy grammar compression can be efficiently implemented and thus can be used as a preprocessing to other compression methods, like those based on Burrows-Wheeler transform [39].

One can treat an SLP as a system of (very simple) word equations, i.e. a production \(X \rightarrow \alpha \) is rewritten as \(X = \alpha \), and so the recompression algorithm generalizes also to such setting. It can be then seen as a variant of locally consistent parsing [1, 58, 72], and indeed those techniques were one of the sources of the recompression approach.

It is no surprise that the highly non-deterministic recompression algorithm determinises when applied to SLPs, what is surprising is that it can be made efficient. In particular, it can be used to checking the equality of two SLPs in roughly quadratic time, which is the fastest known algorithm for this problem [33] (and also for the generalisation of this problem, the fully compressed pattern matching).

The main drawback of grammar compression is that the size of the smallest grammar cannot be even approximated within (small enough) constant factor [5, 80]. There are many algorithms that achieve a logarithmic approximation ratio [5, 70, 73], recompression can also be used to obtain one (in fact: two different). One of those algorithms [32] seems to have a slightly better practical behaviour than the other ones, the second has much simpler analysis than other approximation algorithms [34] (as it is essentially a greedy left-to-right scan).

Just as recompression generalizes from word equations to context unification (i.e. term equations), the approximation algorithm based on recompression for strings can be generalized to trees [38], in which case it produces a so-called tree SLP [2]. This was the first approximation algorithm for this problem.

Survey’s Limitations

As this is an informal survey presentations, most of the proofs are only sketched or omitted. Due to space constraints, only some applications and results are explained in detail.

2 Recompression for Word Equations

We begin with a formal definition of the word equations problem: Consider a finite alphabet \(\varSigma \) and set of variables \(\mathcal X\); during the algorithm \(\varSigma \) will be extended by new letters, but it will always remain finite. Word equation is of a form ‘\(u = v\)’, where \(u, v \in (\varSigma \cup \mathcal X)^*\) and its solution is a homomorphism \(S: \varSigma \cup \mathcal X \mapsto \varSigma ^*\), which is constant on \(\varSigma \), that is \(S(a) = a\), and satisfies the equation, i.e. words \(S(u)\) and \(S(v)\) are equal. By n we denote the size of the equation, i.e. \(|u|+|v|\). The algorithm requires only small improvements so that it applies also to systems of equations, to streamline the presentation we will not consider this case.

Fix any solution \(S\) of the equation \(u = v\), without loss of generality we can assume that this is the shortest solution, i.e. the one minimising \(|S(u) |\); let N denote the length of the solution, that is \(|S(u) |\). By the earlier work of Plandowski and Rytter [67], we know that \(S(u)\) (and also \(S(X)\) for each variable X) has an SLP (of size \(\mathsf{{poly}}(n, \log N)\)), in fact the same conclusion can be to drawn from the later works of Plandowski [64,65,66]. Regardless of the form of \(S\) and SLP, we know, that at least one of the productions in this SLP is of the form \(c \rightarrow ab\), where c is a nonterminal of the SLP while \(a,b \in \varSigma \) are letters. Let us ‘reverse’ this production, i.e. replace in \(S(u)\) all pairs of letters ab by c. It is relatively easy to formalise this operation for words, it is not so clear, what should be done in case of equations, so let us inspect the easier fragment first.

figure a

Consider an explicitly given word w. Performing the ‘ab-pair compression’ on it is easy (we replace each pair ab by c), as long as \(a \ne b\): replacing pairs aa is ambiguous, as such pairs can ‘overlap’. Instead, we replace maximal blocks of a letter a: block \(a^\ell \) is maximal, when there is no letter a to left nor to the right of it (in particular, there could be no letter at all).

Formally, the operations are defined as follows:

  • ab pair compression For a given word w replace all occurrences of ab in w by a fresh letter c.

  • a block compression For a given word w replace all occurrences of maximal blocks \(a^\ell \) for \(\ell > 1\) in w by fresh letters \(a_\ell \).

We always assume, that in the ab-pair compression the letters a and b are different.

Observe, that those operations are indeed ‘inverses’ of SLP productions: replacing ab with c corresponds to a production \(c \rightarrow ab\), similarly replacing \(a^\ell \) with \(a_\ell \) corresponds to a production \(a_\ell \rightarrow a^\ell \).

figure b

Iterating the pair and blocks compression results in a compression of word w, assuming that we treat the introduced symbols as normal letters. There are several possible ways to implement such iteration, different results are obtained by altering the order of the compressions, exact treatment of new letters and so on. Still, essentially each ‘reasonable’ variant works.

Observe, that if we compress two words, say \(w_1\) and \(w_2\), in parallel then the resulting words \(w_1'\) and \(w_2'\) are equal if and only if \(w_1\) and \(w_2\) are. This justifies the usage of compression operations to both sides of the word equation in parallel, it remains to show, how to do that.

Let us fix a solution \(S\), a pair ab (where \(a \ne b\)); consider how does a particular occurrence of ab got into \(S(u)\).

Definition 1

For an equation \(u = v\), solution \(S\) and pair ab an occurrence of ab in \(S(u) \) (or \(S(v)\)) is

  • explicit, if it consists solely of letters coming from u (or v);

  • implicit, if it consists solely of letters coming from a substitution \(S(X)\) for a fixed occurrence of some variable X;

  • crossing, otherwise.

A pair ab is crossing (for a solution \(S\)) if it has at least one crossing occurrence and non-crossing (for a solution \(S\)) otherwise.

We similarly define explicit, implicit and crossing occurrences for blocks of letter a; a is crossing, if at least one of its blocks has a crossing occurrence. (In other words: aa is crossing).

Example 1

Equation

$$\begin{aligned} aaX bb abab ab a = Xaa bb Y ab X \end{aligned}$$

has a unique solution \(S(X) = a\), \(S(Y) = abab\), under which sides evaluate to

$$\begin{aligned} aaa bb abab ab a = aaa bb abab ab a. \end{aligned}$$

Pair ba is crossing (as the first letter of \(S(Y) \) is a and first Y is preceded by a letter b, moreover, the last letter of \(S(Y) \) is b and the second Y is succeeded by a letter a), pair ab is non-crossing. Letter b is non-crossing, letter a is crossing (as X is preceded by a letter a on the left-hand side of the equation and on the right-hand side of the equation X is succeeded by a letter a).

figure c
figure d

Fix a pair ab and a solution \(S\) of the equation \(u = v\). If ab is non-crossing, performing \( \textsc {}\mathsf PairComp (ab,S(u))\) is easy: we need to replace every explicit occurrence (which we do directly on the equation) as well as each implicit occurrence, which is done ‘implicitly’, as the solution is not stored, nor written anywhere. Due to the similarities to \( \textsc {}\mathsf PairComp \) we will simply use the name \( \textsc {}\mathsf PairComp \)(ab,‘\(u{} = v{}\)’), when we make the pair compressions on the equation. The argument above shows, that if the equation had a solution for which ab is non-crossing then also the obtained equation has a solution. The same applies to the block compression, called \( \textsc {}\mathsf BlockComp \)(a,‘\(u{} = v{}\)’) for simplicity. On the other hand, if the obtained equation has a solution, then also the original equation had one (this solution is obtained by replacing each letter c by ab, the argument for the block compressions the same).

Lemma 1

Let the equation \(u = v\) have a solution \(S\), such that ab is non-crossing for \(S\). Then \(u' = v'\) obtained by \( \textsc {}\mathsf PairComp \)(ab,‘\(u{} = v{}\)’) is satisfiable. If the obtained equation \(u' = v'\) is satisfiable, then also the original equation \(u = v\) is. The same applies to \( \textsc {}\mathsf BlockComp \)(a,‘\(u{} = v{}\)’).

Unfortunately Lemma 1 is not enough to simulate Compression(w) directly on the equation: In general there is no guarantee that the pair ab (letter a) is non-crossing, moreover, we do not know which pairs have only implicit occurrences. It turns out, that the second problem is trivial: if we restrict ourselves to the shortest solutions then every pair that has an implicit occurrence has also a crossing or explicit one, a similar statement holds also for blocks of letters.

Lemma 2

([67]). Let \(S\) be a shortest solution of an equation ‘\(u{} = v{}\)’. Then:

  • If ab is a substring of \(S(u)\), where \(a \ne b\), then a, b have explicit occurrences in the equation and ab has an explicit or crossing occurrence.

  • If \(a^k\) is a maximal block in \(S(u)\) then a has an explicit occurrence in the equation and \(a^k\) has an explicit or crossing occurrence.

The proof is simple: suppose that a pair has only implicit occurrences. Then we could remove them and the obtained solution is shorter, contradicting the assumption. The argument for blocks is a bit more involved, as they can overlap.

Getting back to the crossing pairs (and blocks), if we fix a pair ab (letter a), then it is easy to ‘uncross’ it: by Definition 1 we can conclude that the pair ab is crossing if and only if for some variables X and Y (not necessarily different) one of the following conditions holds (we assume that the solution does not assign an empty word to any variable—otherwise we could simply remove such a variable from the equation):

  • (CP1) aX occurs in the equation and \(S(X)\) begins with b;

  • (CP2) Yb occurs in the equation and \(S(Y)\) ends with a;

  • (CP3) YX occurs in the equation, \(S(X)\) begins with b and b \(S(Y)\) ends with a.

In each of these cases the ‘uncrossing’ is natural: in (1) we ‘pop’ from X a letter b to the left, in (2) we pop a to the right from Y, in (3) we perform both operations. It turns out that in fact we can be even more systematic: we do not have to look at the occurrences of variables, it is enough to consider the first and last letter of \(S(X)\) for each variable X:

  • If \(S(X)\) begins with b then we replace X with bX (changing implicitly the solution \(S(X) = b w\) to \(S '(X) = w\)), if in the new solution \(S(X) = \epsilon \), i.e. it is empty, then we remove X from the equation;

  • if \(S(X)\) ends with a then we apply a symmetric procedure.

Such an algorithm is called \( \textsc {}\mathsf Pop \).

figure e

It is easy to see, that for appropriate non-deterministic choices the obtained equation has a solution for which ab is non-crossing: for instance, if aX occurs in the equation and \(S(X)\) begins with b then we make the corresponding non-deterministic choices, popping b to the left and obtaining abX; a simple proof requires a precise statement of the claim as well as some case analysis.

Lemma 3

If the equation ‘\(u{} = v{}\)’ has a solution \(S\) then for an appropriate run of \( \textsc {}\mathsf Pop (a,b,{ `{u{} = v{}}\hbox {'} })\) (for appropriate non-deterministic choices) the obtained equation \(u' = v'\) has a corresponding solution \(S '\), i.e. \(S(u) = S '(u')\), for which ab is a non-crossing pair. If the obtained equation has a solution then also the original equation had one.

Thus, we know how to proceed with a crossing ab-pair compression: we first turn ab into a non-crossing pair (Pop) and then compress it as a non-crossing pair (PairComp).

We would like to perform similar operations for block compression. For non-crossing blocks we can naturally define a similar algorithm \( \textsc {}\mathsf BlockComp \)(a,‘\(u{} = v{}\)’). It remains to show how to ‘uncross’ a letter a. Unfortunately, if aX occurs in the equation and \(S(X)\) begins with a then replacing X with aX is not enough, as \(S(X)\) may still begin with a. In such a case we iterate the procedure until the first letter of X is not a (this includes the case in which we remove the whole variable X). Observe, that instead of doing this letter by letter, we can uncross a in one step: it is enough to remove from variable X its whole a-prefix and a-suffix of \(S(X)\) (if \(w = a^\ell w' a^r\), where \(w'\) does not begin nor end with a, a-prefix w is \(a^\ell \) and a-suffix is \(a^r\); if \(w = a^\ell \) then a-suffix and \(w'\) are empty). Such an algorithm is called \( \textsc {}\mathsf CutPrefSuff \).

figure f

Similarly as in Pop, we can show that after an appropriate run of CutPrefSuff the obtained equation has a (corresponding) solution for which a is non-crossing. Unfortunately, there is another problem: we need to write down the lengths \(\ell \) and r of a-prefixes and suffixes. We can write them as binary numbers, in which case they use \(\mathcal {O}(\log \ell + \log r)\) bits of memory. However in general those still can be arbitrarily large numbers. Fortunately, we can show that in some solution those values are at most exponential (and so their description is polynomial-size). This easily follows from the exponential bound on exponent of periodicity [43]. For the moment it is enough that we know that:

Lemma 4

([43]). In the shortest solution of the equation ‘\(u{} = v{}\)’ each a-prefix and a-suffix has at most exponential length (in terms of \(|u|+|v|\)).

Thus in Pop we can restrict ourselves to a-prefixes and suffixes of at most exponential length.

Lemma 5

Let \(S\) be a shortest solution of ‘\(u{} = v{}\)’. For some non-deterministic choices, i.e. after some run of \( \textsc {}\mathsf CutPrefSuff (a,{ `{u{} = v{}}\hbox {'} })\), the obtained equation ‘\(u' = v'\)’ has a corresponding solution \(S '\), such that \(S '(u') = S(u) \), and a is a non-crossing letter for \(S '\), moreover, the explicit a blocks in ‘\(u' = v'\)’ have at most exponential length. If the obtained equation has a solution then also the original equation had one.

After Pop we can compress a-blocks using \( \textsc {}\mathsf BlockComp \)(a,‘\(u{} = v{}\)’), observe that afterwards long a-blocks are replaced with single letters.

We are now ready to simulate Compression directly on the equation. The question is, in which order we should compress pairs and blocks? We make the choice nondeterministically: if there are any non-crossing pairs or letters, we compress them. This is natural, as such compression decreases both the size of the equation and the size of the length-minimal solution of the equation. If all pairs and letters are crossing, we choose greedily, i.e. the one that leads to the smallest equation (in one step). It is easy to show that such a strategy keeps the equation quadratic, more involved strategy, in which we compress many pairs/blocks in parallel, leads to a linear-length equation.

figure g

Call one iteration of the main loop a phase.

The correctness of the algorithm follows from the earlier discussion on the correctness of BlockComp, CutPrefSuff, PairComp and Pop. In particular, the length of the length-minimal solution drops by at least 1 in each iteration, thus the algorithm terminates.

Lemma 6

Algorithm WordEqSAT has \(\mathcal {O}(N)\) phases, where N is the length of the shortest solution of the input equation.

Let us bound the space needed by the algorithm: we claim that for appropriate nondeterministic choices the stored equation has at most \(8n^2\) letters (and n variables). To see this, observe first that each Pop introduces at most 2n letters, one at each side of the variable. The same applies to CutPrefSuff (formally, CutPrefSuff introduces long blocks but they are immediately replaced with single letters, and so we can think that in fact we introduce only 2n letters). By (1)–(3) we know that there are at most 2n crossing pairs and crossing letters (as each crossing pair/each crossing letter corresponds to one occurrence of a variable and one ‘side’ of such an occurrence). If the equation has m letters (and at most n occurrences of variables) and there is an occurrence of a non-crossing pair or block then we choose it for compression. Otherwise, there are m letters in the equation and each is covered by at least one pair/block, so for one of 2n choice at least \(\frac{m}{2n}\) letters are covered, so at least \(\frac{m}{4n}\) letters are removed by some compression. Thus the new equation has at most

$$\begin{aligned} \underbrace{m}_{\text {previous}} - \underbrace{\frac{m}{4n}}_{\text {removed}} + \underbrace{2n}_{\text {popped}}&= m \left( 1 - \frac{1}{4n}\right) + 2n\\&\le 8n^2\left( 1 - \frac{1}{4n}\right) + 2n\\&= 8n^2 -2n + 2n = 8n^2 \end{aligned}$$

letters, where the inequality follows by the inductive assumption that \(m \le 8n^2\). Going for the bit-size, each symbol requires at most logarithmic number of bits, and so

Lemma 7

WordEqSAT runs in \(\mathcal {O}(n^2 \log n)\) (bit) space.

With some effort we can make the above if analysis much tighter, see Sect. 4.1.

Theorem 1

([36]). The recompression based algorithm (nondeterministically) decides word equations problem in \(\mathcal {O}(n \log n)\) bit-space; moreover, the stored equation has linear length.

Moreover, with some extra effort one can remove also the logarithmic dependency, and show that satisfiability of word equations is in non-deterministic linear space, i.e. the problem is context sensitive. Surprisingly, it is enough to employ Huffman coding for the equation and run a variant of the algorithm. However, the analysis requires a deeper understanding of how fragments of the equation are changed during the algorithm and how they depend one on another.

Theorem 2

([37]). A variant of recompression based algorithm which encodes the equation using Huffman coding (nondeterministically) decides word equations problem in \(\mathcal {O}(m)\) bit-space; where m is the bit-size encoding of the input using any prefix-free code.

Note that we allow some bit-optimization in the size of the input problem.

As a reminder: a PSPACE algorithm for this problem was already known [65]. Its memory consumption is not stated explicitly in that work, however, it is much larger than \(\mathcal {O}(n \log n)\): the stored equations are of length \(\mathcal {O}(n^3)\) and during the transformations the algorithm uses essentially more memory.

3 Extensions of the Algorithm for Word Equations

3.1 \(\mathcal {O}(n \log n)\) Space

In order to improve the space consumption from quadratic to \(\mathcal {O}(n \log n)\) we want to perform several compressions in parallel. To make it more precise, observe that

  • All block compressions (also for different letters) can be performed in parallel, as such blocks do not overlap. Moreover, uncrossing different letters can also be done in parallel: if a is the first letter of \(S(X) \) and b the last, then we pop from X the a-prefix and b-suffix.

  • If \(\varSigma _\ell \) and \(\varSigma _r\) are disjoint, then the pair compressions for ab with \(a \in \varSigma _\ell \) and \(b \in \varSigma _r\) can be done in parallel. Similarly as in the previous case, uncrossing can be done in parallel, by popping first letter if it is from \(\varSigma _r\) and last if it is from \(\varSigma _\ell \).

  • We do not compress all pairs, only those from \(\mathcal {O}(1)\) partitions \(\varSigma _\ell \), \(\varSigma _r\) that cover ‘many’ occurrences of pairs in the equation and in the solution.

The crucial things is the choice of partitions. It turns out that choosing a random partition reduces the length of the solution by a constant fraction: consider two consecutive letters ab in \(S(X) \). If \(a = b\) then they will be compressed as part of the maximal block. If \(a \ne b\) then there is 1/4 chance that \(ab \in \varSigma _\ell \varSigma _r\). Thus, in expectation, the length of the word shortens by one fourth of its length.

A similar argument also shows that the number of letters in the equation remains linear, when a random partition is chosen. Thus, the equation will be of linear size (though each letter may need \(\mathcal {O}(\log n)\) bits for the encoding).

3.2 Equations with Regular Constraints and Inversion; Equations in Free Groups

As already mentioned, it is natural and important to extend the word equations by regular constraints and inversion, in particular this leads to an algorithm for equations in free groups [13] (the reduction between those two problems is fully syntactical and does not depend on the particular algorithm for solving word equations). Note that it is not known, whether the algorithm generating a representation of all solutions can be also extended by regular constraints and inversion. Thus the only previously known algorithm for representation of all solutions of an equation in a free group was due to Razborov [68], and it was based on Makanin’s algorithm for word equations in free groups.

Adding the regular constraints to the recompression based algorithm WordEqSAT is fairly standard: We can encode all constraints using one non-deterministic finite automaton (the constraints for particular variables differ only in the set of accepting states). For each letter c we store its transition function, i.e. a function \(f_c: Q \rightarrow 2^Q\), which says that the automaton in state q after reading a letter c reaches a state in \(f_c(q)\). This function naturally extends to words: it still defines which states can be reached from q after reading w. Formally \(f_{wa} = (f_w \circ f_{a})(q) = \{ p \, | \, \exists q' \in f_w(q) \text { i } p \in f_a(q') \} \) for a letter a. If we introduce a new letter c (which replaces a word w) then we naturally define the transition function \(f_c \leftarrow f_w\). We can express the regular constraints in terms of this function: saying that \(S(X) \) is accepted by an automaton means that \(f_{S(X)}(q_0)\) is one of the accepting states. So it is enough to guess the value of \(f_{S(X)}\) which satisfies this condition; in this way we can talk about the value \(f_X\) for a variable X. Popping letters from a variable means that we need to adjust the transition function, i.e. when we replace X by aX then \(f_{X} = f_a \circ f_{X'}\), we similarly define \(f_X\) when we pop letters to the right.

More problems are caused by the inversion: intuitively it corresponds to taking the inverse element in the group and on the semigroup level we this is simulated by requiring that \(\overline{\overline{a}}=a\) for each letter a and \(\overline{a_1a_2\dots a_m}= \overline{a_m} \dots \overline{a_2} \overline{a_1}\). This has an impact on the compression: when we compress a pair ab to c, then we should also replace \(\overline{ab} = \overline{b}\overline{a}\) by a letter \(\overline{c}\). At the first sight this looks easy, but becomes problematic, when those two pairs are not disjoint, i.e. when \(\overline{a} = a\) (or \(\overline{b} = b\)); in general we cannot exclude such a case and if it happens, in a sequence \(ba\overline{b}\) during the pair compression for ba we want to simultaneously replace ba and \(a\overline{b}\), which is not possible. Instead, we replace maximal fragments that can be fully covered with pairs ab or \(\overline{b}\overline{a}\), in this case this: the whole triple \(ba\overline{b}\). In the worst case (when \(a = \overline{a}\) and \(b = \overline{b}\)) we need to replace whole sequences of the form \((ab)^n\), which is a common generalisation of both pairs and blocks compression.

Theorem 3

([6, 14]). A recompression based algorithm generates in polynomial space the description of all solutions of a word equation in free semigroups with inversion and regular constraints.

3.3 Context Unification

Recall that the context unification is a generalisation of word equations to the case of terms (Fig. 2). What type of equations we would like to consider? Clearly we consider terms over a fixed signature (which is usually part of the input), and allow occurrences of constants and variables. If we allow only that the variables represent full terms, then the satisfiability of such equations is decidable in polynomial time [69] and so probably does not generalise the word equations (which are NP-hard). This is also easy to observe when we look closer at a word equation: the words represented by the variables can be concatenated at both ends, i.e. they represent terms with a missing argument.

We arrive at a conclusion that our generalisation should use variables with arguments, i.e. the (second-order) variables take an argument that is a full term and can use it, perhaps several times. Such a definition leads to a second-order unification, which is known to be undecidable even in very restricted subcases [16, 26, 47, 49].

Thus we would like to have a subclass of second order unification that still generalises word equations. In order to do that we put additional restriction on the solutions: each argument can be used by the term exactly once. Observe that this still generalises the word equations: using the argument exactly once naturally corresponds to concatenation (Fig. 1).

Fig. 1.
figure 1

A context and the same context applied on an argument.

Formally, in the context unification problem [7, 8, 74], we consider an equation \(u = v\) in which we use term variables (representing closed terms), which we denote by letters xy, as well as context variables (representing terms with one ‘hole’ for the argument, they are usually called contexts), which we denote by letters XY. Syntactically, u and v are terms that use letters from signature \(\varSigma \) (which is part of the input), term variables and context variables, the former are treated as symbols of arity 0, while the latter as symbols of arity 1. A substitution \(S\) assigns to each variable a closed term over \(\varSigma \) and to each context variable it assigns a context, i.e. a term over \(\varSigma \cup \{\varOmega \}\) in which the special symbol \(\varOmega \) has arity 0 and is used exactly once. (Intuitively it corresponds to a place in which we later substitute the argument). \(S\) is extended to u, v in a natural way, note that for a context variable X the term \(S(X(t))\) is obtained by replacing in \(S(X)\) the unique symbol \(\varOmega \) by \(S(t)\). A solution is a substitution satisfying \(S(u) = S(v) \).

Example 2

Consider a signature \(\{ f, c, c'\}\), where f has arity 2 while \(c,c'\) have arity 0 and consider an equation \(X(c) = Y(c')\), where X and Y are context variables. The equation has a solution \(S(X) = f(\varOmega ,c'), S(Y) = f(c, \varOmega )\) and then \(S(X(c)) = f(c,c') = S(Y(c')) \).

Fig. 2.
figure 2

Term f(h(ccc), f(cf(cc))) viewed as a tree, f is of arity 2, h: 3 and c: 0.

We try to apply the main idea of the recompression also in the case of terms: we iterate local compression operations and we guarantee that the word (term) equation is polynomial size. Since several term problems were solved using compression-based methods [9, 17,18,19, 48], there is a reasonable hope that our approach may succeed.

Pair and block compression easily generalise to sequences of letters of arity 1 (we can think of them as words), unfortunately, there is no guarantee that a term has even one such letter. Intuitively, we rather expect that it has mostly leaves and symbols of larger arity. This leads us to another local compression operation: leaf compression. Consider a node labelled with f and its i-th child that is a leaf. We want to compress f with this child, leaving other children (and their subtrees) unchanged. Formally, given f of arity at least 1, position \(1 \le i \le {{\,\mathrm{ar}\,}}(f)\) and a letter c of arity 0 the LeafComp(fict) operation (leaf compression) replaces in term t nodes labelled with f and subterms \(t_1,\ldots ,t_{i-1},c,t_{i+1},\ldots ,t_{{{\,\mathrm{ar}\,}}(f)}\) (where c and position i are fixed, while other terms \(t_1,\ldots ,t_{i-1},t_{i+1},\ldots ,t_{{{\,\mathrm{ar}\,}}(f)}\)—varying) by a term labelled with \(f'\) and subterms \(t_1',\ldots ,t_{i-1}',t_{i+1}',\ldots ,t_{{{\,\mathrm{ar}\,}}(f)}'\) that are obtained by applying recursively LeafComp to terms \(t_1,\ldots ,t_{i-1},t_{i+1},\ldots ,t_{{{\,\mathrm{ar}\,}}(f)}\); in other words, we first change the label from f to \(f'\) and then remove the i-th child, which has a label c and we apply such a compression to all occurrences of f and c in parallel.

The notion of crossing pair generalizes to this case in a natural way and the uncrossing replaces a term variable with a constant or replaces X(t) with \(X(f(x_1,\ldots ,x_i,t,x_{i+1},\ldots , x_\ell ))\). Note that this introduces new variables.

Now the whole algorithm looks similar as in the case of word equations, we simply use additional compression operation. However, the analysis is much more involved, as the new uncrossing introduces fresh term variables. However, their number at any point can be linearly bounded and the polynomial upper-bound follows.

Theorem 4

([31]). Recompression based algorithm solves context unification in nondeterministic polynomial space.

4 Recompression and Compressed Data

The recompression technique is (partially) inspired by methods coming from the algorithm’s design [1, 58]. In this section we show that it is able to contribute back to algorithmics: some algorithmic questions for compressed data can be solved using a recompression technique. The obtained solutions are as good and sometimes better than the known ones, which is surprising taking into the account the robustness of the method.

4.1 Straight Line Programs and Recompression

Recall that the Straight Line Programme (SLP) was defined as a context-free grammar whose each nonterminal generates exactly one word. We employ the following naming conventions for SLPs: its nonterminals are ordered (without loss of generality: \(X_1\), \(X_2\), ..., \(X_m\)), each nonterminal has exactly one production and if \(X_j\) occurs in the production for \(X_i\) then \(j <i\); we will use symbols \(\mathcal A\), \(\mathcal B\), etc. to denote an SLP. The unique word generated by a nonterminal \(X_i\) is denoted by \({{\,\mathrm{val}\,}}(X_i)\), while the whole SLP \(\mathcal A\) defines a word \({{\,\mathrm{val}\,}}(\mathcal A) = {{\,\mathrm{val}\,}}(X_m)\).

We can treat SLP as a system of word equations (in variables \(X_1, \ldots , X_m\)): production \(X_i \rightarrow \alpha _i\) corresponds to an equation \(X_i = \alpha _i\); observe that such an equality is meaningful as \({{\,\mathrm{val}\,}}(X) = {{\,\mathrm{val}\,}}(\alpha )\) (where \({{\,\mathrm{val}\,}}\) is naturally extended to strings of letters and nonterminals), moreover, this is the unique solution of this equation. Thus the recompression technique can be applied to SLPs as well (so far we used recompression only to one equation but it easily generalises also to a system of equations).

However, there are two issues that need to be solved: non-determinism and efficiency: the recompression for word equations is highly non-deterministic while algorithms for SLPs should, if possible, be deterministic and we usually want them to be efficient, i.e. we want as small polynomial degree as possible.

Let us inspect the source of non-determinism of recompression-based approach, it is needed to:

  1. 1.

    establish, whether \({{\,\mathrm{val}\,}}(X_i) = \epsilon \);

  2. 2.

    establish the first (and last) letter of \({{\,\mathrm{val}\,}}(X_i)\);

  3. 3.

    establish the length of a-prefix and suffix of \({{\,\mathrm{val}\,}}(X_i)\);

  4. 4.

    the choice of the partition to compress.

The first three question ask about some basic properties of the solution and can be easily answered in case of SLPs: assuming that we already know the answers for \(X_j\) for \(j <i\): let \(X_i \rightarrow \alpha _i\), then we first remove from \(\alpha _i\) all nonterminals \(X_j\), for which \({{\,\mathrm{val}\,}}(X_j) = \epsilon \), and then

  1. 1.

    \({{\,\mathrm{val}\,}}(X_i) = \epsilon \) if and only if \(\alpha _i = \epsilon \);

  2. 2.

    the first letter of \({{\,\mathrm{val}\,}}(X_i)\) is the first letter of \(\alpha _i\) or the first letter of \({{\,\mathrm{val}\,}}(X_j)\), if the first symbol of \(\alpha _i\) is \(X_j\);

  3. 3.

    the length of the a-prefix depends only on the letters a in \(\alpha _i\) and the lengths of a-prefixes in nonterminals in \(\alpha _i\).

All those conditions can be verified in linear time. The last question is of different nature. However, the argument used to show that a good choice of a partition exists actually shown that in expectation the choice is a good one and this approach can be easily derandomised using conditional expectation approach. In particular, this subprocedure can be implemented in linear time.

Concerning the running time, the generalisations of Pop, PairComp, CutPrefSuff and BlockComp can be implemented in linear time, thus the recompression for SLPs runs in polynomial (in SLP’s size) time, so polynomial in total.

Lemma 8

The recompression for SLPs runs in \(\mathcal {O}(n \log N)\le \mathcal {O}(n^2)\) time, where n is the size of the input SLP and N is the length of the defined word.

4.2 SLP Equality and Fully Compressed Pattern Matching

One of the first (and most important) problems considered for SLPs is the equality testing, i.e. for two SLPs we want to decide if they define the same word. The first polynomial algorithm for this problem was given in 1994 by Plandowski [63], to be more precise, his algorithm run in \(\mathcal {O}(n^4)\) time. Afterwards research was mostly focused on the more general problem of fully compressed pattern matching: for given SLPs \(\mathcal A\) and \(\mathcal B\) we want to decide, whether \({{\,\mathrm{val}\,}}(\mathcal A)\) occurs in \({{\,\mathrm{val}\,}}(\mathcal B)\) (as a subword). The first solution to this problem was given by Karpiński et al. [40] in 1995. Gasieniec et al. [21] gave a faster randomised algorithm. In 1997 Miyazaki et al. [59] constructed an \(\mathcal {O}(n^4)\) algorithm. Finally, Lifshits gave an \(\mathcal {O}(n^3)\) algorithm for this problem [52]. All of the mentioned papers were based on the same original idea as Plandowski’s algorithm.

Recompression can be naturally applied to equality testing of SLPs: given two SLPs \(\mathcal A\) and \(\mathcal B\) we add an equation \(X_{m_{\mathcal A}} = Y_{m_{\mathcal A}}\) and ask about the satisfiability of the whole system. As already observed, the recompression based algorithm will work in polynomial time. It turns out that the proper implementation (using many nontrivial algorithmic techniques) runs in time \(\mathcal {O}(n \log N)\), where \(N = |{{\,\mathrm{val}\,}}(\mathcal A)| = |{{\,\mathrm{val}\,}}(\mathcal B)|\) (if \(|{{\,\mathrm{val}\,}}(\mathcal A)| \ne |{{\,\mathrm{val}\,}}(\mathcal B)|\) then clearly \(\mathcal A\) and \(\mathcal B\) are not equal) and n the sum of sizes of SLPs \(\mathcal A\) and \(\mathcal B\). In order to obtain such a running time, we need several optimisations.

Theorem 5

([33]). The recompression based algorithm for equality testing for SLPs runs in \(\mathcal {O}(n \log N)\) time, where n is the sum of SLPs’ sizes while N the size of the defined (decompressed) words.

In order to use the recompression technique for the fully compressed pattern matching problem, we need some essential modifications: consider ba-pair compression on a pattern ab and text bab. We obtain the same pattern ab and text cb, loosing the only occurrence of the pattern in the text. This happens because the compression (on the text) is done partially on the pattern occurrence and partially outside it. To remedy this, we perform the compression operations in a particular order, which takes into the account what are the first and last letters of pattern and text. (In the considered example, we make the ab-pair compression first and this preserves the occurrences of the pattern.) Similar approach works also for block compression.

Theorem 6

([33]). The recompression based algorithm for fully compressed pattern matching runs in \(\mathcal {O}(n \log M)\) time, where n is the sum of SLPs’ sizes while M the length of the (uncompressed) pattern.