1 Introduction

Counting occurrences of letters in words is a major topic in formal language theory. In particular, much ink has been spent on investigating the counting ability of some language classes. For example, Joshi et al. [1] suggested that the language \(\mathrm {MIX}= \{w \in \{a,b,c\}^* \mid |w|_a = |w|_b = |w|_c \}\) should not be in the class of so-called mildly context-sensitive languages since it allows too much freedom in word order, so that relations between MIX and several language classes have been investigated (e.g., indexed languages [2], range concatenation languages [3], tree-adjoining languages [4], multiple context-free languages [5], etc.). The Parikh map is another rich example on this topic (counting occurrences of letters) [6].

In the recent work [7] by Colbourn et al., the counting feature of MIX is generalised from counting letter occurrences to counting word occurrences. They considered several problems for languages of the form \(L_{\!A}(w_1,\ldots , w_k) = \{ w \in A^* \mid |w|_{w_1} = \cdots = |w|_{w_k}\}\) (where \(|u|_v\) is the number of occurrences of \(v\) in \(w\)) which we call Word-MIX languages (WMIX for short) in this paper. While \(L_{\!A}(w_1, w_2)\) is always deterministic context-free, it can also be regular (\(L_{\!A}(ab, ba)\) is regular if \(A = \{a,b\}\), while it is not regular if \(A = \{a,b,c\}\), for example) [7]. This kind of generalisation – from letter occurrences to word occurrences – is also considered in the context of the Parikh map through so-called Parikh matrices [8] and subword histories [9, 10] (in this setting they have considered scattered subword occurrences instead of subword occurrences).

Colbourn et al. [7] provided a necessary and sufficient condition for \(w_1\) and \(w_2\) for the WMIX language \(L_{\!A}(w_1, w_2)\) to be regular, and gave a polynomial time algorithm for testing that condition. For the fully general case, the decidability of the regularity problem for WMIX languages can be derived from some known results on unambiguous constrained automata (UnCA for short), since \(L_{\!A}(w_1, \ldots , w_k)\) is always recognised by an UnCA, and the regularity for UnCA languages is decidable due to [11].

In this paper, we show that context-freeness is decidable for WMIX languages. We also give an alternative decidability proof for the regularity of WMIX languages. As we mentioned above, the regularity for WMIX languages is already known to be decidable thanks to the decidability results on UnCA languages (which include all WMIX languages) given by Cadilhac et al. [11]. But the alternative proof of the regularity for WMIX languages given in this paper gives more structural information of WMIX languages, and the proof can be naturally extended into the context-freeness. We introduce a new notion called dimension, which represents certain structural information of WMIX languages, and prove that a WMIX language is (1) regular if and only if its dimension is at most one, and (2) context-free if and only if its dimension is at most two. To the best of our knowledge, there has been no research on the context-freeness for WMIX languages or UnCA languages. As far as we know, a language class with such a decidable context-freeness property is very rare. We are only aware of such examples in some subclasses of bounded languages [12,13,14] and languages associated with vector addition systems [15].

For the space restriction, we omit some definitions and proofs; see the full version [16] for details.

2 Preliminaries

For a set X, we denote by \(\#\!\left( X\right) \) the cardinality of X. We denote by \(\mathbb {N}\) the set of natural numbers including 0. We call a mapping \(M: X \rightarrow \mathbb {N}\) multiset over X. For a set \(X\), we write \(2^X\) for the power set of \(X\).

We assume that the reader has a basic understanding of automata and linear algebra.

2.1 Words and Word-MIX Languages

For an alphabet A, we denote the set of all words (resp. all non-empty words) over A by \(A^*\) (resp. \(A^+\)). We write \(A^n\) (resp. \(A^{<n}\)) for the set of all words of length n (resp. less than n), and write \(\mathbb {N}^{\le c}\) for the set of all natural numbers less than or equal \(c\) for \(c \in \mathbb {N}\). For a pair of words \(v, w \in A^*\), \(|w|_v\) denotes the number of subword occurrences of v in w

We write \(u \sqsubseteq v\) if \(u\) is a subword of \(v\), and write \(u \sqsubseteq _{\mathrm {sc}}v\) if \(u\) is a scattered subword of \(v\). For words \(w_1, \ldots , w_k \in A^*\), we define

and call it the Word-MIX (WMIX for short) language of k-parameter words \(w_1, \ldots , w_k\) over \(A\). For a word \(w \in A^*\), we denote the set of prefixes and suffixes of w by \(\mathrm {pref}(w)\) and \(\mathrm {suff}(w)\), and denote the length-n (\(n \le |w|\)) prefix and suffix of w by \(\mathrm {pref}_n(w)\) and \(\mathrm {suff}_n(w)\), respectively.

2.2 Graphs and Walks

Let \({\mathcal G}= (V, E)\) be a (directed) graph. We call a sequence of vertices \(\omega = (v_1, \ldots , v_n) \in V^n \, (n \ge 1)\) walk (from \(v_1\) into \(v_n\) in \({\mathcal G}\)) if \((v_i, v_{i+1}) \in E\) for each \(i \in \{1, \ldots , n-1\}\), and define the length of \(\omega \) as \(n-1\) and denote it by \(|\omega |\). We denote by \(\mathtt {from}(\omega )\) and \(\mathtt {into}(\omega )\) the source and the target of \(\omega \). \(\omega \) is called an empty walk if \(|\omega | = 0\). If two walks \(\omega _1 = (v_1, \ldots , v_m), \omega _2 = (v'_1, \ldots , v'_n)\) are connectable (i.e., \(\mathtt {into}(\omega _1) = \mathtt {from}(\omega _2)\)), we write \(\omega _1 \odot \omega _2\) for the connecting walk . A non-empty walk \(\omega \) is called loop (on \(\mathtt {from}(\omega )\)) if \(\mathtt {from}(\omega ) = \mathtt {into}(\omega )\). A walk \((v_1, \ldots , v_n)\) is called path if \(v_i \ne v_j\) for every \(i, j \in \{1, \ldots , n\}\) with \(i \ne j\). A loop \((v, v_1, \ldots , v_n, v)\) is called cycle if \((v, v_1, \ldots , v_n)\) is a path. We use the metavariable \(\pi \) for a path, and the metavariable \(\gamma \) for a cycle. For a cycle \(\gamma \) and \(n \ge 1\), we write \(\gamma ^n\) for the loop which is an n-times repetition of \(\gamma \). We denote by \({\mathcal W}({\mathcal G}), {\mathcal P}({\mathcal G}),\) and by \({\mathcal C}({\mathcal G})\) the set of all walks, paths and cycles in \({\mathcal G}\). Note that \({\mathcal W}({\mathcal G})\) is infinite in general, but \({\mathcal P}({\mathcal G})\) and \({\mathcal C}({\mathcal G})\) are both finite if \({\mathcal G}\) is finite.

The N-dimensional de Bruijn graph \({\mathcal G}^{N}_A= (A^N, E)\) over A is a graph whose vertex set \(A^N\) is the set of words of length N and the edge set E is defined by

The case \(N = 2\) is depicted in Fig. 1.

Fig. 1.
figure 1

The 2-dimensional de Bruijn graph \({\mathcal G}^{2}_A\) over \(A = \{a, b\}\), a walk (ba, aa, aa, ab, bb, ba) (dotted red arrow) on \({\mathcal G}^{2}_A\) and its corresponding word baaabba. (Color figure online)

Let v be a vertex of \({\mathcal G}^{N}_A\). A word \(w = a_1 \cdots a_m \in A^+\) induces the walk \((v, v_1, \ldots , v_m)\) (where \(v_i = \mathrm {suff}_n(v \, \mathrm {pref}_i(w))\)) in \({\mathcal G}^{N}_A\), and we denote it by \(\mathtt {walk}_{{\mathcal G}^{N}_A}(v, w)\). Conversely, a walk \(\omega = (v_1, \ldots , v_n)\) in \({\mathcal G}^{N}_A\) induces the word \(v_1 \mathrm {suff}_1(v_2) \cdots \mathrm {suff}_1(v_n) \in A^*\), and we denote it by \(\mathtt {word}_{{\mathcal G}^{N}_A}(\omega )\) (see Fig. 1). For words \(w, w_1, \ldots , w_k \in A^*\) and a walk \(\omega = (v_0, v_1, \ldots , v_n) \in {\mathcal W}({\mathcal G}^{N}_A)\), we define the following vectors in \(\mathbb {N}^k\):

We call \(|w|_{(w_1, \ldots , w_k)}\) (resp. \(|\omega |_{(w_1, \ldots , w_k)}\)) the occurrence vector of \(w\) (resp. \(\omega \)). We notice that the range of the summation in the above definition of \(|\omega |_{(w_1, \ldots , w_k)}\) does not contain 0, hence \(|\omega |_{(w_1, \ldots , w_k)} = (0, \ldots , 0)\) if \(\omega \) is an empty walk \(\omega = (v_0)\). The next proposition states a basic property of \({\mathcal G}^{N}_A\), which can be shown by a straightforward induction on the length of w.

Proposition 1

Let \(w_1, \ldots , w_k \in A^*\) and \(N = \max (|w_1|, \ldots , |w_k|)\). For any pair of words \(v,w \in A^*\) such that \(|v| = N\) and \(\omega = \mathtt {walk}_{{\mathcal G}^{N}_A}(v, w)\), we have

$$ |vw|_{(w_1, \ldots , w_k)} = |v|_{(w_1, \ldots , w_k)} + |\omega |_{(w_1, \ldots , w_k)}. $$

2.3 Well-Quasi-Orders

A quasi order \(\le \) on a set X is called well-quasi-order (wqo for short) if any infinite sequence \((x_i)_{i \in \mathbb {N}} \, (x_i \in X)\) contains an increasing pair \(x_i \le x_j\) with \(i < j\). Let \(\le _1\) be a quasi order on a set \(X_1\) and \(\le _2\) be a quasi order on a set \(X_2\). The product order \(\le _{1,2}\) is a quasi order on \(X_1 \times X_2\) defined by

$$ (x_1, y_1) \le _{1,2} (x_2, y_2) {\mathop {\Longleftrightarrow }\limits ^{\mathtt {def}}}x_1 \le _1 x_2 \text { and } y_1 \le _2 y_2. $$

Proposition 2

( cf. Proposition 6.1.1 in [17]). Let \(\le _1\) be a wqo on a set \(X_1\) and \(\le _2\) be a wqo on a set \(X_2\). The product order \(\le _{1,2}\) is again a wqo on \(X_1 \times X_2\).

We list some examples of wqos below:

  1. (1)

    The identity relation \(=\) on any finite set X is a wqo (the pigeonhole principle).

  2. (2)

    The usual order \(\le \) on \(\mathbb {N}\) is a wqo.

  3. (3)

    The product order \(\le _m\) on \(\mathbb {N}^m\) is a wqo for any \(m \ge 1\) (Dickson’s lemma), which is a direct corollary of Proposition 2.

  4. (4)

    The point-wise order \(\le _{\mathtt {pt}}\) on the multisets \(\mathbb {N}^X\) (\(M \le _{\mathtt {pt}}M' {\mathop {\Longleftrightarrow }\limits ^{\mathtt {def}}}M(x) \le M'(x)\) for all \(x \in X\)) over a finite set X is a wqo (just a paraphrase of Dickson’s lemma).

3 Path-Cycle Decomposition of Walks

In this section, we provide a simple method which decomposes, in left-to-right manner, a walk \(\omega \) into a (possibly empty) path \(\pi \) and a sequence of cycles \(\varGamma \) (Fig. 2). This decomposition, and its inverse operation (composition), are probably folklore, and the contents in this section appeared already in the author’s unpublished note [18]. A similar method is also used in [11].

Let \({\mathcal G}= (V, E)\) be a graph. For a pair of sequences of cycles \(\varGamma _1 = (\gamma _1, \ldots , \gamma _n),\) \(\varGamma _2 = (\gamma '_1, \ldots , \gamma '_m)\), we write \(\varGamma _1.\varGamma _2\) for the concatenation \((\gamma _1, \ldots , \gamma _n, \gamma '_1, \ldots , \gamma '_m)\). When \(\varGamma _1 = (\gamma )\) we simply write \(\gamma .\varGamma _2\) for \(\varGamma _1.\varGamma _2\) We write \(\emptyset \) for the empty sequence of cycles. For \(\varGamma = (\gamma _1, \ldots , \gamma _n)\), we denote by \(\varGamma (i)\) for the i-th component \(\gamma _i\) of \(\varGamma \), and denote by \(|\varGamma |_\gamma \) the number \(\#\!\left( \{i \mid \varGamma (i) = \gamma \}\right) \) of occurrences of \(\gamma \) in \(\varGamma \). For a walk \(\omega = (v_1, \ldots , v_n)\), we denote by \(V(\omega )\) the set of all vertices appearing in \(\omega \): .

Fig. 2.
figure 2

Computation of \(\varPhi _{{\mathcal K}_4}\) and \(\varPsi _{{\mathcal K}_4}\)

We then define a decomposition function \(\varPhi _{{\mathcal G}}\) inductively as follows:

and

It is clear by definition that, for any \(\omega \) and \((\omega ', \varGamma ) = \varPhi _{{\mathcal G}}(\omega )\), \(\omega '\) is a path and \(\varGamma \) is a sequence of cycles, i.e., \(\varPhi _{{\mathcal G}}: {\mathcal W}({\mathcal G}) \rightarrow {\mathcal P}({\mathcal G}) \times {\mathcal C}({\mathcal G})^*\). Conversely, we can define a composition (partial) function \(\varPsi _{{\mathcal G}}\) as an inverse of \(\varPhi _{{\mathcal G}}\), i.e., \(\omega = \varPsi _{{\mathcal G}}(\varPhi _{{\mathcal G}}(\omega )).\) The formal definition of \(\varPsi _{{\mathcal G}}\) can be found in the full version [16].

Example 1

Consider the complete graph \({\mathcal K}_4 = (V_4 = \{1,2,3,4\}, E_4 = V_4 \times V_4)\) of order 4 and a walk \(\omega = (1,2,3,2,3,4,3,4,2,4)\). The result of decomposition is \(\varPhi _{{\mathcal K}_4}(\omega ) = (\pi = (1,2,4), \varGamma = ((2,3,2), (3,4,3), (2,3,4,2)))\). All intermediate computation steps of \(\varPhi _{\mathcal{K}_4}(\omega )\) and \(\varPsi _{{\mathcal K}_4}(\varPhi _{{\mathcal K}_4}(\omega ))\) are drawn in Fig. 2 (in the figure we denote by \( \pi \& \varGamma \) a pair \((\pi , \varGamma )\) for visibility).

3.1 Multi-traces and Traces

For a walk \(\omega \) in a graph \({\mathcal G}\), we define the multi-trace \({{\,\mathrm{\mathbb {N}Tr}\,}}(\omega ): {\mathcal P}({\mathcal G}) \cup {\mathcal C}({\mathcal G}) \rightarrow \mathbb {N}\) of a walk \(\omega \) as the following multiset over paths and cycles:

We define the trace \({{\,\mathrm{Tr}\,}}(\omega )\) of a walk \(\omega \) in \({\mathcal G}\) as the following set of paths and cycles:

Intuitively, the multi-trace of \(\omega \) in \({\mathcal G}\) is obtained by forgetting the ordering of the decomposition result \((\omega , \varGamma ) = \varPhi _{{\mathcal G}}(\omega )\) of \(\omega \), and the trace of \(\omega \) is obtained by forgetting the multiplicity from the original multi-trace (see Fig. 3 for the relation).

Fig. 3.
figure 3

Relations between words, walks and (multi-)traces (\(N = 2\) for the examples).

The following proposition states that the occurrences of parameter words of a walk are completely determined from its multi-trace (see the full version [16] for details).

Proposition 3

Let \(w_1, \ldots , w_k \in A^*\) and \(N = \max (|w_1|, \ldots , |w_k|)\). For any \(\omega \) in \({\mathcal W}({\mathcal G}^{N}_A)\), we have

$$\begin{aligned} |\omega |_{(w_1, \ldots , w_k)} = \sum _{\pi \in {\mathcal P}({\mathcal G}^{N}_A)} \!\!\! ({{\,\mathrm{\mathbb {N}Tr}\,}}(\omega ))(\pi ) \cdot |\pi |_{(w_1, \ldots , w_k)} + \!\!\! \sum _{\gamma \in {\mathcal C}({\mathcal G}^{N}_A)} \!\!\! ({{\,\mathrm{\mathbb {N}Tr}\,}}(\omega ))(\gamma ) \cdot |\gamma |_{(w_1, \ldots , w_k)}. \end{aligned}$$

4 Main Results

In this section we first introduce a new notion for WMIX languages called dimension. Afterwards, we state our main results that characterise both regularity and context-freeness of WMIX languages.

Definition 1

Let \(w_1, \ldots , w_k \in A^*\) and \(N = \max (|w_1|, \ldots , |w_k|)\). Let \(T = \{\pi \} \cup \{\gamma _1, \ldots , \gamma _m\}\) be a trace of a walk \(\omega \) in \({\mathcal G}^{N}_A\). A subset \(S\) of \(\{\gamma _1, \ldots , \gamma _m\}\) is called pumpable in \(T\) of \(L_{\!A}(w_1, \ldots , w_k)\) if, for any number \(n \ge 1\), there exists a word \(uv \in L_{\!A}(w_1, \ldots , w_k)\) with \(\omega = \mathtt {walk}_{{\mathcal G}^{N}_A}(u, v)\) such that (1) \({{\,\mathrm{Tr}\,}}(\omega ) = T\) and (2) \(({{\,\mathrm{\mathbb {N}Tr}\,}}(\omega ))(\gamma ) \ge n\) for each \(\gamma \in S\). We further say \(S\) is maximal if no proper superset of \(S\) included in \(\{\gamma _1, \ldots , \gamma _m\}\) is pumpable.

Remark 1

The emptyset \(\emptyset \) is always pumpable in a trace \(T\) of \(L_{\!A}(w_1,\ldots ,w_k)\) such that \({{\,\mathrm{Tr}\,}}(\mathtt {walk}_{{\mathcal G}^{N}_A}(u,v)) = T\) for some \(uv \in L_{\!A}(w_1,\ldots ,w_k)\). Moreover, it is decidable whether \(S\) is pumpable or not in \(T\) of \(L_{\!A}(w_1, \ldots , w_k)\) (see the full version [16]).

Recall that a vector space is a set \(\varvec{V} \subseteq \mathbb {R}^k\) such that \(\varvec{0} \in \varvec{V}, \varvec{V} + \varvec{V}\subseteq \varvec{V}\) and \(\mathbb {R}\varvec{V} = \{ \alpha \cdot \varvec{v} \mid \varvec{v} \in \varvec{V}, \alpha \in \mathbb {R}\}\subseteq \varvec{V}\) where \(\varvec{0}\) is the vector with all zeros.

Definition 2

Let \(w_1, \ldots , w_k \in A^*\), \(N = \max (|w_1|, \ldots , |w_k|)\). The dimension of \(L = L_{\!A}(w_1, \ldots , w_k)\) is the natural number defined as

$$ \max \{ \dim (\varvec{V}) \mid \varvec{V} \!=\! \mathrm {span}(\{ |\gamma |_{(w_1, \ldots , w_k)} \mid \gamma \in S \}), S \text { is pumpable in some } T \text { of } L \} $$

where \(\dim (\varvec{V})\) is the dimension of the vector space \(\varvec{V}\) and \(\mathrm {span}(B)\) is the vector space spanned by \(B\) (where .

Fig. 4.
figure 4

The 1-dimensional de Bruijn graph \({\mathcal G}^{1}_A\) over \(A = \{a, b, c\}\).

The dimension of a WMIX language \(L\) is, roughly speaking, the minimum number of cycles (in the de Bruijn graph) that should be counted independently. We describe this intuition more rigorously by using \(\mathrm {MIX}= L_{\!A}(a,b,c)\) for \(A = \{a,b,c\}\) as a simple example.

Example 2

Since \(\max (|a|, |b|, |c|) = 1\), it is enough to consider the 1-dimensional de Bruijn graph \({\mathcal G}^{1}_A\) over \(A = \{a,b,c\}\) (see Fig. 4). One can easily observe that the set of cycles \(S = \{\gamma _1 = (a,a), \gamma _2 = (b,b), \gamma _3 = (c,c)\}\), each \(\gamma _i\) is depicted in Fig. 4, is pumpable in the trace \(T = \{ (a,b,c) \} \cup S\): for any \(n > 0\), the word \(a w_n = a^{n+1} b^{n+1} c^{n+1}\) is in \(\mathrm {MIX}\) and it satisfies the two conditions in the Definition 1 as (1) \({{\,\mathrm{Tr}\,}}(\mathtt {walk}_{{\mathcal G}^{N}_A}(a, w_n)) = T\) and (2) \(({{\,\mathrm{\mathbb {N}Tr}\,}}(\mathtt {walk}_{{\mathcal G}^{N}_A}(a, w_n)))(\gamma _i) = n\) for each \(\gamma _i \in S\). The occurrence vectors corresponding to \(\gamma _1, \gamma _2, \gamma _3\) are \(\varvec{v_1} = (1,0,0), \varvec{v_2} = (0,1,0), \varvec{v_3} = (0,0,1)\), respectively. Since those occurrence vectors are linearly independent, the vector space spanned by them is \(\mathbb {R}^3\) and thus the dimension of \(\mathrm {MIX}\) is three.

By considering dimensions of WMIX languages, we can nicely characterise both regularity and context-freeness as follows.

Theorem 1

(regularity). \(L_{\!A}(w_1, \ldots , w_k)\) is regular if and only if its dimension is at most one.

Theorem 2

(context-freeness). \(L_{\!A}(w_1, \ldots , w_k)\) is context-free if and only if its dimension is at most two.

Some pushdown automaton \(\mathcal {A}\) can recognise \(L_{\!A}(a,b)\) since, by using its stack, \(\mathcal {A}\) can track the number \(|w|_a - |w|_b\). However, no pushdown automaton \(\mathcal {A}\) can recognise \(\mathrm {MIX}= L_{\!A}(a,b,c)\) since, for that purpose, one should track the numbers \(|w|_a - |w|_b\) and \(|w|_b - |w|_c\) simultaneously. This is a rough intuition why a language with dimension greater than or equal three is never to be context-free (the formal proof is in the next section).

The set \({\mathcal P}({\mathcal G}^{N}_A) \cup {\mathcal C}({\mathcal G}^{N}_A)\) of paths and cycles in the \(N\)-dimensional de Bruijn graph is finite, hence we can effectively enumerate all traces of all walks in \({\mathcal G}^{N}_A\) (see the full version [16] for details). Moreover, as we mentioned in Remark 1, we can also effectively enumerate all pumpable sets in a trace. For a pumpable set \(S\), computing the dimension of the vector space spanned by the occurrence vectors \(S\) is just counting the maximum number of linearly independent ones from the occurrence vectors of \(S\). Combining these facts and Theorem 1–2, we can effectively compute the dimension of \(L_{\!A}(w_1, \ldots , w_k)\) and hence we have the following decidability result.

Corollary 1

Regularity and context-freeness are decidable for WMIX languages.

5 Proof of the Main Results

The proof structure of Theorem 1 is similar with one of Theorem 2, albeit that the latter is more complicated. In this section, we firstly investigate some structural properties of pumpable sets, which play crucial role in the main proof. We secondly give a proof of Theorem 1 which would give a good intuition for the latter proof. Finally, we give a proof of Theorem 2.

5.1 Properties of Pumpable Sets

For a vector \(\varvec{v} = (c_1, \ldots , c_k) \in \mathbb {R}^k\), we define Observe that \(w \in L_{\!A}(w_1, \ldots , w_k)\) if and only if \(\mathtt {diff}(|w|_{(w_1, \ldots , w_k)}) = 0\).

Lemma 1

Let \(w_1, \ldots , w_k \in A^*\) and \(N = \max (|w_1|, \ldots , |w_k|)\). For any maximum pumpable set \(S\) in \(T = \{\pi \} \cup S'\) of \(L_{\!A}(w_1, \ldots , w_k)\), if \(\varvec{V} = \mathrm {span}(\{ |\gamma |_{(w_1, \ldots , w_k)} \mid \gamma \in S \})\) has a non-zero dimension, then \(\varvec{V}\) contains the vector \(\varvec{1}\).

Proof

Let \(u = \mathtt {from}(\pi )\). Since \(S\) is pumpable, there exists an infinite sequence \((uv_i)_{i \in \mathbb {N}}\), where \(u \in A^N\), of words that satisfies:

  1. (1)

    \({{\,\mathrm{Tr}\,}}(\mathtt {walk}_{{\mathcal G}^{N}_A}(u, v_i)) = T\) for all \(i \in \mathbb {N}\).

  2. (2)

    \(uv_i \in L_{\!A}(w_1, \ldots , w_k)\) for all \(i \in \mathbb {N}\).

  3. (3)

    \({{\,\mathrm{\mathbb {N}Tr}\,}}(\mathtt {walk}_{{\mathcal G}^{N}_A}(u, v_i))(\gamma ) < {{\,\mathrm{\mathbb {N}Tr}\,}}(\mathtt {walk}_{{\mathcal G}^{N}_A}(u, v_j))(\gamma )\) for all \(i,j \in \mathbb {N}\) with \(i < j\) and for all \(\gamma \in S\).

Now consider an infinite sequence of multi-traces of the above sequence

Since the point-wise order on the multisets over any finite set is a wqo (thanks to Dickson’s lemma) and \({\mathcal P}({\mathcal G}^{N}_A) \cup {\mathcal C}({\mathcal G}^{N}_A)\) is finite, \((M_i)_{i \in \mathbb {N}}\) contains an infinite increasing subsequence \((M_j)_{j \in J}\) (\(J \subseteq \mathbb {N}\)). Let \(\overline{S} = (S' \setminus S)\). Because \(S\) is maximum, the number of maximum occurrence of any non-pumpable cycle \(\gamma \in \overline{S}\) is bounded, i.e., there is some constant \(c \in \mathbb {N}\) such that \(({{\,\mathrm{\mathbb {N}Tr}\,}}(\mathtt {walk}_{{\mathcal G}^{N}_A}(u, v_i)))(\gamma ) < c\) for any \(\gamma \in \overline{S}\) and \(i \in \mathbb {N}\). By using pigeonhole principle, we can deduce that, in the infinite sequence \((M_j)_{j \in J}\), there exists a pair \((i_1, i_2) \in J^2\) with \(i_1 < i_2\) such that \((M_{i_1})(\gamma ) = (M_{i_2})(\gamma )\) for all \(\gamma \in \overline{S}\). Let \(C = \sum _{\gamma \in \overline{S}} M_{i_1}(\gamma ) \cdot |\gamma |_{(w_1,\ldots , w_k)}\). Combining the above observation and the condition (3) of \((u v_i)_{i \in \mathbb {N}}\), we have

figure a

Because \(u v_{i_1}, uv_{i_2} \in L_{\!A}(w_1, \ldots , w_k)\), by Proposition 1 and Proposition 3, we have

$$\begin{aligned}&\mathtt {diff}(|u v_{i_1}|_{(w_1,\ldots , w_k)}) = \mathtt {diff}(|u v_{i_2}|_{(w_1, \ldots , w_k)}) = 0\\ = \,&\mathtt {diff}\left( |u|_{(w_1, \ldots , w_k)} + |\pi |_{(w_1, \ldots , w_k)} + C + \sum _{\gamma \in S} M_{i_1}(\gamma ) \cdot |\gamma |_{(w_1, \ldots , w_k)} \right) \\ =\,&\mathtt {diff}\left( |u|_{(w_1, \ldots , w_k)} + |\pi |_{(w_1, \ldots , w_k)} + C + \sum _{\gamma \in S} M_{i_2}(\gamma ) \cdot |\gamma |_{(w_1, \ldots , w_k)}\right) . \end{aligned}$$

Moreover, from the above equation we obtain

$$\begin{aligned} \mathtt {diff}\left( \sum _{\gamma \in S} (M_{i_2}(\gamma ) - M_{i_1}(\gamma ))\cdot |\gamma |_{(w_1, \ldots , w_k)} \right) = 0 \end{aligned}$$
(1)

because for any \(\varvec{v}\) such that \(\mathtt {diff}(\varvec{v}) = 0\), \(\mathtt {diff}(\varvec{v} + \varvec{v'}) = 0\) if and only if \(\mathtt {diff}(\varvec{v}') = 0\). By Condition (\(\bigstar \)), the vector

$$ \varvec{v} = \sum _{\gamma \in S} (M_{i_2}(\gamma ) - M_{i_1}(\gamma )) \cdot |\gamma |_{(w_1, \ldots , w_k)} $$

is not the zero vector \(\varvec{0}\). Thus \(\varvec{v}\) is of the form \(n \cdot \varvec{1} \, (n \ne 0)\), i.e., \(\varvec{1} \in \mathrm {span}(\varvec{V})\).    \(\square \)

Lemma 2

Let \(w_1, \ldots , w_k \in A^*\) and \(N = \max (|w_1|, \ldots , |w_k|)\). For any trace T of some walk in \({\mathcal G}^{N}_A\), if \({{\,\mathrm{Tr}\,}}(\mathtt {walk}_{{\mathcal G}^{N}_A}(u,v)) = T\) for some \(uv \in L_{\!A}(w_1,\ldots ,w_k)\), then there exists a unique maximal (i.e., the maximum) pumpable set \(S\) in \(T\) of \(L_{\!A}(w_1, \ldots , w_k)\).

Proof

Let \(S_1, S_2\) be two maximal pumpable sets in \(T\) of \(L_{\!A}(w_1,\ldots ,w_k)\) and \(S_1 = \{\gamma _1, \ldots , \gamma _m\}\). We now prove that \(S_1 \cup S_2\) is also pumpable in \(T\), which implies \(S_1 = S_2\) by the maximality of \(S_1\) and \(S_2\). By Condition (\(\bigstar \)) and Equation (1) in the proof of Lemma 1, we can deduce that there exist \(n_1, \ldots , n_m \in \mathbb {N}\) such that \(n_i > 0\) for all \(i \in \{1, \ldots , m\}\) and \(\mathtt {diff}(\sum _{i=1}^{m} n_i \cdot |\gamma _i|_{(w_1,\ldots ,w_k)}) = 0\). Let \((uv_i)_{i \in \mathbb {N}}\) be an infinite sequence that ensures the pumpability of \(S_2\), namely,

  1. (1)

    \({{\,\mathrm{Tr}\,}}(\mathtt {walk}_{{\mathcal G}^{N}_A}(u,v_i)) = T\) for all \(i \in \mathbb {N}\).

  2. (2)

    \(uv_i \in L_{\!A}(w_1, \ldots , w_k)\) for all \(i \in \mathbb {N}\).

  3. (3)

    \({{\,\mathrm{\mathbb {N}Tr}\,}}(\mathtt {walk}_{{\mathcal G}^{N}_A}(u, v_i))(\gamma ) \ge i\) for all \(i \in \mathbb {N}\) and for all \(\gamma \in S_2\).

Let \(uv'_i\) be a word that satisfying \({{\,\mathrm{Tr}\,}}(\mathtt {walk}_{{\mathcal G}^{N}_A}(u,v'_i)) = T\), \({{\,\mathrm{\mathbb {N}Tr}\,}}(\mathtt {walk}_{{\mathcal G}^{N}_A}(u, v'_i))(\gamma _j) = {{\,\mathrm{\mathbb {N}Tr}\,}}(\mathtt {walk}_{{\mathcal G}^{N}_A}(u, v_i))(\gamma _j) + i \times n_j\) for all \(i \in \mathbb {N}\) and \(\gamma _j \in S_1\). Such word \(uv'_i\) always exists because we can just pump an occurrence of \(\gamma _j \in S_1\) in \(\mathtt {walk}_{{\mathcal G}^{N}_A}(u, v_i)\) \((i \times n_j)\)-times repeatedly. Then the infinite sequence \((uv'_i)_{i \in \mathbb {N}}\) satisfies \(uv'_i \in L_{\!A}(w_1,\ldots ,w_k)\) and \({{\,\mathrm{\mathbb {N}Tr}\,}}(\mathtt {walk}_{{\mathcal G}^{N}_A}(u, v'_i))(\gamma ) \ge i\) for all \(i \in \mathbb {N}\) and for all \(\gamma \in S_1 \cup S_2\), because \(\mathtt {diff}(\sum _{j = 1}^{m} n_j \cdot |\gamma _j|_{(w_1,\ldots ,w_k)}) = 0\). Which means that \((uv'_i)_{i \in \mathbb {N}}\) ensures the pumpability of \(S_1 \cup S_2\), this ends the proof.    \(\square \)

Lemma 3

Let \(w_1, \ldots , w_k \in A^*\) and \(N = \max (|w_1|, \ldots , |w_k|)\). For any maximum pumpable set \(S\) of \(L_{\!A}(w_1, \ldots , w_k)\),

  1. (1)

    if the vector space \(\varvec{V}\) spanned by the occurrence vectors of \(S\) is of dimension one, then \(\varvec{V} = \mathrm {span}(\{\varvec{1}\})\) where \(\varvec{1}\) is the \(k\)-dimensional vector with entries all 1, i.e., any occurrence vector \(\varvec{v}\) of \(S\) satisfies \(\mathtt {diff}(\varvec{v}) = 0\).

  2. (2)

    if the vector space \(\varvec{V}\) spanned by the occurrence vectors of \(S\) is of dimension greater than or equal two, then we can choose a basis \(B \subseteq \{ |\gamma |_{(w_1, \ldots , w_k)} \mid \gamma \in S \}\) of \(\varvec{V}\) such that any element \(\varvec{v}\) of \(B\) satisfies \(\mathtt {diff}(\varvec{v}) \ne 0\).

Proof

Condition (1) is a direct consequence of Lemma 1. Condition (2) is also from Lemma 1. Let \(\gamma \in S\) be a pumpable cycle such that \(\mathtt {diff}(\gamma ) \ne 0\). Such \(\gamma \) always exists since \(S\) contains at least two cycles whose occurrence vectors are linearly independent. Moreover, by Condition (\(\bigstar \)) in the proof of Lemma 1, we can deduce that there exists \(B' \subseteq S\) such that the occurrence vectors of \(B' \cup \{\gamma \}\) are linearly independent and \(\varvec{1} \in \mathrm {span}(B' \cup \{\gamma \})\). Thus any vector of the form \(n \cdot \varvec{1} \, (n \ne 0)\) is not in the occurrence vectors of \(B' \cup \{\gamma \}\), we can take a desired basis \(B\) as an extension of \(B' \cup \{\gamma \}\) (\(B' \cup \{\gamma \} \subseteq B\)).    \(\square \)

5.2 Proof of Theorem 1

To prove “only if” part, we modify standard Pumping Lemma as follows and call it Shrinking Lemma. Shrinking Lemma (see the full version [16] for the proof).

Lemma 4

(Shrinking Lemma for regular languages). Let \(L \subseteq A^*\) be a regular language. Then there exists a constant \(c \in \mathbb {N}\) such that, for any number \(n \ge c\) and for any word \(w \in L\) with \(|w| \ge n\), for any factorisation \(w = xyz\) such that \(|y| = n \ge c\), there exists a word \(y'\) such that (1) \(y' \sqsubseteq _{\mathrm {sc}}y\), (2) \(|y'| \le c\) and (3) \(x y' z \in L\).

Now we prove Theorem 1. Let \(N = \max (|w_1|, \ldots , |w_k|)\). The “only if” part is shown by contraposition. Assume that the dimension of \(L = L_{\!A}(w_1, \ldots , w_k)\) is two (higher-dimensional case can be shown similarly). Because \(L\) is of dimension two, there exists a maximum pumpable set \(S = \{\gamma _{i_1}, \ldots , \gamma _{i_j}\}\) in some trace \(T = \{\pi \} \cup \{\gamma _1, \ldots , \gamma _m\} \) in \({\mathcal G}^{N}_A\) such that two occurrence vectors \(|\gamma _{\alpha }|_{(w_1,\ldots ,w_k)}\) and \(|\gamma _{\beta }|_{(w_1,\ldots ,w_k)}\) of two cycles \(\gamma _{\alpha }\) and \(\gamma _{\beta }\) in \(S\) are linearly independent and any occurrence vector of an element of \(S\) can be represented as a linear combination of \(|\gamma _{\alpha }|_{(w_1,\ldots ,w_k)}\) and \(|\gamma _{\beta }|_{(w_1,\ldots ,w_k)}\). By Condition (2) of Lemma 3, we can assume that \(\mathtt {diff}(|\gamma _{\alpha }|_{(w_1, \ldots , w_k)}) \ne 0\) and \(\mathtt {diff}(|\gamma _{\beta }|_{(w_1, \ldots , w_k)}) \ne 0\). Since \(S\) is a maximum pumpable set and the dimension of \(L\) is two, there exists a constant \(c_T \in \mathbb {N}\) such that for any \(n \in \mathbb {N}\) there exists a word \(uv_n \in L\) with \({{\,\mathrm{Tr}\,}}(\mathtt {walk}_{{\mathcal G}^{N}_A}(u,v_n)) = T\), \(({{\,\mathrm{\mathbb {N}Tr}\,}}(u,v_n))(\gamma _\alpha ) = n_\alpha ,({{\,\mathrm{\mathbb {N}Tr}\,}}(u,v))(\gamma _\beta ) = n_\beta \ge n\) and \(({{\,\mathrm{\mathbb {N}Tr}\,}}(u,v_n))(\gamma _i) \le c_T\) for each \(i \in (\{1,\ldots ,m\}\setminus \{\alpha ,\beta \})\). By Proposition 3, we can assume that the walk \(\mathtt {walk}_{{\mathcal G}^{N}_A}(u,v_n)\) is of the form

$$ \mathtt {walk}_{{\mathcal G}^{N}_A}(u,v_n) = \omega _1 \odot \gamma _\alpha ^{n_\alpha } \odot \omega _2 \odot \gamma _\beta ^{n_\beta } \odot \omega _3. $$

Intuitively, \(\mathtt {walk}_{{\mathcal G}^{N}_A}(u,v_n)\) firstly moves to \(\mathtt {from}(\gamma _\alpha )\) (part of \(\omega _1\)), and secondly passes \(\gamma _\alpha \) repeatedly \(n_\alpha \)-times and moves to \(\mathtt {from}(\gamma _\beta )\) (part of \(\gamma _\alpha ^{n_\alpha } \odot \omega _2\)), and lastly passes \(\gamma _\beta \) repeatedly \(n_\beta \)-times and moves to the end (part of \(\gamma _\beta ^{n_\beta } \odot \omega _3\)). If \(L\) is regular, then by Lemma 4, there exists a constant \(c\) such that for any \(n \ge c\) and the factorisation \(uv_n = x y_n z_n\), where \(x,y_n\) and \(z_n\) are words corresponding to the first, second and last part of walks described above, there exists a word \(y'_n\) satisfying conditions (1)–(3) in Lemma 4. Because \(\mathtt {diff}(|\gamma _\beta |_{(w_1, \ldots , w_k)}) \ne 0\), we have \(|\gamma _{\beta }|_{w_j} < |\gamma _{\beta }|_{w_{j'}}\) for some \(1 \le j,j' \le k\). However, since the length of \(x\) and \(y'_n\) are fixed by constant but \(z_n\) can be arbitrarily large, the gap of the occurrences \(|z_n|_{w_{j'}} - |z_n|_{w_j}\) can be arbitrarily large (thus \(|x y'_n z_n|_{w_j'} - |x y'_n z_n|_{w_j}\) can be arbitrarily large, too). It means that \(x y'_n z_n \not \in L\) for sufficiently large \(n\), a contradiction.

The “if” part is achieved by showing that the language \(L_T = \{ uv \in L \mid |u| = N, {{\,\mathrm{Tr}\,}}(\mathtt {walk}_{{\mathcal G}^{N}_A}(u, v)) = T \} \) is regular for each trace \(T = \{\pi \} \cup \{\gamma _1, \ldots , \gamma _m\}\) in \({\mathcal G}^{N}_A\). It implies that \(L\) is regular because \(L = L^{<N} \cup \bigcup _{T: \text {trace}} L_T\) (notice that \(L^{<N} = \{ w \in L \mid |w| < N \}\) is finite and thus regular). One can observe that \(L = \{ w \in L \mid |w| < N\} \cup \bigcup _{T: \text { trace in } {\mathcal G}^{N}_A} L_T\), hence if every \(L_T\) is regular then \(L\) is also regular. To achieve it, we construct a deterministic automaton \(\mathcal {A}_{T,S}\), where \(S\) is the maximum pumpable set in \(T\), so that \(L_T = L(\mathcal {A}_{T,S})\). Let \(\overline{S} = (T \setminus S \setminus \{\pi \})\) and define

(notice that as usual). \(c_T\) is well-defined natural number, because, by the definition of pumpable set and \(S\) being maximum, for any cycle \(\gamma \) in \(T\) but not in \(S\), the maximum number of occurrences of \(\gamma \) in a walk of some word in \(L\) is bounded. We denote by \(\mathcal {F}\) the set of all functions from \(\overline{S}\) to \(\mathbb {N}^{\le c_T}\). Notice that both \(\overline{S}\) and \(\mathbb {N}^{\le c_T}\) are finite, \(\mathcal {F}\) is also finite. Let \(f_0 \in \mathcal {F}\) be the constant map to 0. Then the construction is as follows: \(\mathcal {A}_{T, S} = (Q, \delta , \varepsilon , F)\) where each component is defined in Fig. 5.

Fig. 5.
figure 5

The construction of \(\mathcal {A}_{T,S} = (Q, \delta ,\varepsilon , F)\).

Although the formal definition in Fig. 5 could look complex, the behavior of \(\mathcal {A}_{T,S}\) is simple: it computes path-cycle decomposition and counts the number of occurrences of each non-pumpable cycle \(\gamma \in \overline{S}\). The main part of states is \(Q'\) which consists of the path part \({\mathcal P}({\mathcal G}^{N}_A)\), pumpable-cycles part \(2^S\) and non-pumpable-cycles part \(\mathcal {F}\). While reading an input word \(w\), \(\mathcal {A}_{T,S}\) extends the path part (Case  if the next vertex \(wb\) is not in the current path. If the next vertex \(wb\) is already in the current path, there are four possibilities (Case  – ). If the induced cycle \(\gamma \) on \(wb\) is in \(S\) (Case  ), \(\mathcal {A}_{T,S}\) updates the pumpable cycle part. The number of occurrences of such cycle \(\gamma \in S\) is not necessary to be memorised, since by Condition (1) of Lemma 3 \(\mathtt {diff}(|\gamma |_{(w_1,\ldots ,w_k)}) = 0\). If \(\gamma \) is not in \(T\) (Case  ), \(\mathcal {A}_{T,S}\) goes to the rejecting state \(q_{\mathtt {rej}}\), since the trace of \(w\) is never to be \(T\). If \(\gamma \) is in \(\overline{S}\), there are two possibilities further: if the current number of occurrences of \(\gamma \) is less than \(c_T\) (Case  ), \(\mathcal {A}_{T,S}\) increments it, otherwise (Case  ), \(\mathcal {A}_{T,S}\) goes to \(q_{\mathtt {rej}}\) because \(w\) is never to be in \(L\) by the definition of \(c_T\).    \(\square \)

5.3 Proof of Theorem 2

The proof structure is similar with the regular case (Theorem 1). The following lemma is a context-free variant of Lemma 4. Lemma 4 (see the full version [16] for the proof).

Lemma 5

(Shrinking Lemma for context-free languages). Let \(L \subseteq A^*\) be a context-free language. Then there exists a constant \(c \in \mathbb {N}\) such that, for any number \(n \ge c\) and for any word \(w \in L\) with \(|w| \ge n\), there exists a factorisation \(w = xyz\) and a word \(y'\) such that (0) \(2n > |y| \ge n \ge c\), (1) \(y' \sqsubseteq _{\mathrm {sc}}y\), (2) \(|y'| \le c\) and (3) \(xy'z \in L\).

Now we prove Theorem 2. Let \(N = \max (|w_1|, \ldots , |w_k|)\). The “only if” part is shown by contraposition. Assume that the dimension of \(L = L_{\!A}(w_1, \ldots , w_k)\) is three (higher-dimensional case can be shown similarly). Because \(L\) is of dimension three, there exists a maximum pumpable set \(S = \{\gamma _{i_1}, \ldots , \gamma _{i_j}\}\) in some trace \(T = \{\pi \} \cup \{\gamma _1, \ldots , \gamma _m\} \) in \({\mathcal G}^{N}_A\) such that three occurrence vectors \(B = \{|\gamma _{\alpha }|_{(w_1,\ldots ,w_k)}, |\gamma _{\beta }|_{(w_1,\ldots ,w_k)}, |\gamma _{\delta }|_{(w_1,\ldots ,w_k)}\}\) of three cycles \(\gamma _{\alpha }\), \(\gamma _{\beta }\) and \(\gamma _{\delta }\) in \(S\) are linearly independent and any occurrence vector of an element of \(S\) can be represented as a linear combination of \(B\). By Condition (2) of Lemma 3, we can assume that any vector \(\varvec{v}\) in \(B\) satisfies \(\mathtt {diff}(\varvec{v}) \ne 0\). Since \(S\) is a maximum pumpable set and the dimension of \(L\) is three, there exists a constant \(c_T \in \mathbb {N}\) such that for any \(n \in \mathbb {N}\) there exists a word \(uv_n \in L\) with \({{\,\mathrm{Tr}\,}}(\mathtt {walk}_{{\mathcal G}^{N}_A}(u,v_n)) = T\), \(({{\,\mathrm{\mathbb {N}Tr}\,}}(\mathtt {walk}_{{\mathcal G}^{N}_A}(u,v_n)))(\gamma _i) = n_i \ge n \) for each \(i \in \{\alpha ,\beta ,\delta \}\) and \(({{\,\mathrm{\mathbb {N}Tr}\,}}(\mathtt {walk}_{{\mathcal G}^{N}_A}(u,v_n)))(\gamma _i) \le c\) for each \(i \in (\{1,\ldots ,m\}\setminus \{\alpha ,\beta ,\delta \})\). By Proposition 3, we can assume that the walk \(\mathtt {walk}_{{\mathcal G}^{N}_A}(u,v_n)\) is of the form

$$ \mathtt {walk}_{{\mathcal G}^{N}_A}(u,v_n) = \omega _1 \odot \gamma _\alpha ^{n_\alpha } \odot \omega _2 \odot \gamma _\beta ^{n_\beta } \odot \omega _3 \odot \gamma _\delta ^{n_\delta } \odot \omega _4. $$

Let \(u_{n,1}, u_{n,2}\) and \(u_{n,3}\) be words corresponding to \(\omega _1 \odot \gamma _\alpha ^{n_\alpha }, \omega _2 \odot \gamma _\beta ^{n_\beta }\) and \(\omega _3 \odot \gamma _\delta ^{n_\delta } \odot \omega _4\), respectively (thus \(uv_n = u_{n,1} u_{n,2} u_{n,3}\)). Let \(M_n = \min \{ n_\alpha \cdot |\gamma _\alpha |, n_\beta \cdot |\gamma _\beta |, n_\delta \cdot |\gamma _\delta | \}\). If \(L\) is context-free, then by Lemma 5, there exists a constant \(c\) such that for any \(n \ge c\), there is a factorisation \(uv_n = x_n y_n z_n\) and a word \(y_n'\) satisfying conditions (0)–(3) in Lemma 5. Take \(n \in \mathbb {N}\) that satisfies \(M_n \ge c\). Then, the word \(y\) in the factorisation \(uv_n = x_n y_n z_n\) above can cross at most two words from \(u_{n,1}, u_{n,2}, u_{n,3}\). It means that \(x_n y'_n z_n \not \in L\) for sufficiently large \(n\), a contradiction.

The “if” part is achieved in a similar way as the regular case: we can construct a pushdown automaton \(\mathcal {A}_{T,S}\), where \(S\) is the maximum pumpable set in \(T\), so that \(L_T = L(\mathcal {A}_{T,S})\). The only difference is that \(\mathcal {A}_{T,S}\) uses its stack for checking the consistency the occurrences of two linearly independent occurrence vectors. \(\mathcal {A}_{T,S}\) achieves it as some pushdown automaton recognises \(L_{\!A}(a,b)\).    \(\square \)

6 Conclusion and Future Work

In this paper, we provided decidable, necessary and sufficient conditions of the regularity and context-freeness for WMIX languages by using the notion of dimensions. Complexity issues on these problems (tight lower/upper bounds, more efficient algorithm, etc.) are untouched and could be future work.

The author’s main interest is how to generalise the main result into more richer language classes, e.g., UnCA languages [11]. From WMIX languages (represented by de Bruijn graphs and diagonals \(\{n\cdot \varvec{1} \mid n \in \mathbb {N}\}\)) into UnCA languages (represented by unambiguous automata and semilinear sets), although we should modify the notion of dimensions and some part of the proof strategy, the author conjectures that the context-freeness is still decidable for UnCA languages.