Elsevier

Theoretical Computer Science

Volume 866, 18 April 2021, Pages 70-81
Theoretical Computer Science

Shortest covers of all cyclic shifts of a string

https://doi.org/10.1016/j.tcs.2021.03.011Get rights and content

Highlights

  • Computing shortest covers of cyclic shifts of a string is considered.

  • This problem extends in a natural way previous results on quasiperiodicity.

  • We give an O(n log n)-time algorithm for this problem.

  • Shortest covers of cyclic shifts of a Fibonacci string are also characterized.

Abstract

A factor C of a string S is called a cover of S, if each position of S is contained in an occurrence of C. Breslauer (1992) [3] proposed a well-known O(n)-time algorithm that computes the shortest cover of every prefix of a string of length n. We show an O(nlogn)-time and O(n)-space algorithm that computes the shortest cover of every cyclic shift of a string of length n and an O(n)-time algorithm that computes the shortest among these covers. We also provide a combinatorial characterization of shortest covers of cyclic shifts of Fibonacci strings that leads to efficient algorithms for computing these covers.

We further consider the bound on the number of different lengths of shortest covers of cyclic shifts of the same string of length n. We show that this number is Θ(logn) for Fibonacci strings.

Introduction

We consider strings as finite sequences of letters drawn from an alphabet Σ=[0,nO(1)], often referred to as an integer alphabet [1]. The notion of periodicity in strings and its many variants have been well-studied in many fields like combinatorics on words, pattern matching, data compression, automata theory, formal language theory, and molecular biology. A typical regularity, the period U of a given string S, grasps the repetitiveness of S since S is a prefix of a string constructed by concatenations of U. If S=AWB, for some, possibly empty, strings A,W,B, then W is called a factor of S and, respectively, S is a superstring of W. A factor C of S is called a cover of S, if each position of S is contained in an occurrence of C. A factor C of S is called a seed of S, if there exists a superstring of S which is constructed by concatenations and superpositions of C. In other words, C is a seed of S if S is covered by occurrences and left and right overhangs of C. For example, abc is a period of abcabcabca, abca is a cover of abcabcaabca, and abca is a seed of bcabcaabc. The notions “cover” and “seed” are generalizations of periods in the sense that superpositions as well as concatenations are considered to define them, whereas only concatenations are considered for periods.

In computation of covers, two problems have been considered in the literature. The shortest-cover problem (also known as the superprimitivity test) is that of computing the shortest cover of a given string of length n, and the all-covers problem is that of computing all the covers of a given string. Apostolico et al. [2] introduced the notion of covers and gave a linear-time algorithm for the shortest-cover problem. Breslauer [3] proposed an on-line algorithm for computing the shortest cover that works in linear time. In particular, his algorithm computes the shortest cover of every prefix of a string. The other direction was taken by Moore and Smyth [4], [5] and by Li and Smyth [6] who computed all the covers of a string and a representation of all the covers of all prefixes of a string, respectively. A circular string S corresponding to a given string S is formed by concatenating the first letter of S to the right of its last letter. Covers of circular strings were also considered. It is implicit in [7] that covers of a circular string S are exactly seeds of S2. Covers and seeds of Fibonacci strings were studied in [8], whereas covers of circular Fibonacci strings were considered in [9].

All the seeds of a string of length n can be represented in O(n) space as a collection of a linear number of disjoint paths in the suffix trees of the string and of its reversal. This representation can be computed in O(nlogn) time [7] and even in O(n)-time [10]. Recently it was also shown in [11] that all the seeds can also be represented as a linear number of disjoint paths in just the suffix tree of the string. This implies the following fact:

Lemma 1

The problem of computing the shortest cover of a circular string can be solved in linear time.

We say that a string Y is a cyclic shift of a string X if X=AB and Y=BA for some strings A and B; in this case we also write Y=rot|A|(X). It seems that the problem of computing shortest covers of all cyclic shifts of a string is harder than that of computing the shortest cover of a circular string. A straightforward application of any of the aforementioned algorithms for computing covers of a string yields an O(n2)-time solution to the problem. One should note that covers of circular strings are a different notion than that of covers of cyclic shifts of a string; see Fig. 1.

The shortest covers of cyclic shifts of a string can behave rather irregularly. For example, the length of the shortest cover of S=abaabababababababa equals 3, whereas the shortest cover of rot1(S) has length 18.

We consider the following problem.

Let S be a string of length n and ShCov(S) denote the shortest cover of S. We introduce an array CCS of length n such that CCS[i]=|ShCov(roti(S))|. Our main result is computing this array. We also denoteCCSet(S)={CCS[i]:i=0,,n1}.

Example 1

For the Fibonacci strings S1=abaab,S2=abaababaabaab we have:CCS1=[5,5,5,3,5],CCS2=[5,5,13,3,]CCSet(S1)={3,5},CCSet(S2)={3,5,8,13}.

Our results. We show that the whole array CCS and miniCCS[i] for a string S of length n can be computed in O(nlogn) time and O(n) time, respectively, and O(n) space. For this we use a characterization of covers of cyclic shifts of a string by seeds and squares, i.e., strings of the form W2, and the suffix tree data structure.

We give a simple recursive formula for computing CCS for a Fibonacci string S. It implies a linear-time algorithm for computing this array as a whole and can be used to devise time and space efficient algorithms for computing subsequent elements of this array. We also show that for the family of Fibonacci strings we have |CCSet(S)|=Θ(log|S|).

Structure of the paper. In Section 2 we recall basic definitions and illustrate them by showing a linear-time algorithm that solves a similar problem to the one in scope, that is, computing the shortest periods of all cyclic shifts of a string. Then in Section 3 we present characterizations of shortest covers of cyclic shifts of a string, which lead us to the main algorithmic results in Section 4. Shortest covers of cyclic shifts of Fibonacci strings are studied in Section 5. We conclude and mention some open problems in Section 6.

This is a full version of the paper [12]. In particular, compared to the conference version, it contains a much more precise characterization of shortest covers of cyclic shifts of Fibonacci strings.

Section snippets

Preliminaries

We assume that positions of a string S are numbered 0 through |S|1, S=S[0]S[|S|1]. By S[i..j] we denote a factor of S equal to S[i]S[j]. A factor is called a prefix of i=0 and a suffix if j=|S|1. A factor that occurs both as a prefix and as a suffix of S is called a border of S. A positive integer p is a period of S if S[i]=S[i+p] for all i=0,,|S|p1.

Covers of cyclic shifts

We denote by 12-Squares(S) (square halves) the set of factors Z of S such that the square Z2 is also a factor of S and by 12-PSquares(S) the subset of 12-Squares(S) that consists only of primitive strings. We further denote by Seeds(S) the set of factors which are seeds of S. We use these sets for the string S3 in order to characterize covers of all cyclic shifts of S.

Lemma 3

Let S be a string of length n and C be a string of length up to n. Then C is a cover of roti(S) if and only if CSeeds(S3)12-

Main algorithm

First we have to show how to compute efficiently the tree T(S3). We denote by OccPSquares(S) the set of all occurrences of primitively rooted squares in S. Each occurrence is represented in O(1) space as a factor of S. A direct consequence of the Three-square-prefix Lemma, see [16], is that a string of length n has no more than logn prefixes that are primitively rooted squares.

Lemma 6

[16]

For a string S of length n, |OccPSquares(S)|=O(nlogn).

Lemma 7

For a string S of length n, |12-PSquares(S)|=O(n) and this set

Shortest covers of cyclic shifts of Fibonacci strings

Recall that the Fibonacci strings are defined as Fib0=b, Fib1=a, Fibk=Fibk1Fibk2 for k2. In other words, Fibk=ϕk(Fib0), where ϕ is a morphismϕ(a)=ab,ϕ(b)=a. HenceFib2=ab,Fib3=aba,Fib4=abaab,Fib5=abaababa,. We denote Fk=|Fibk|, the k-th Fibonacci number. In this section we give a precise characterization of CC and CCSet of Fibonacci strings. We further denote CFn=CCFibn. See Fig. 8 for an example.

Let lcp(U,V) denote the length of the longest common prefix of strings U and V. We use the

Conclusions and open problems

Breslauer [3] proposed a linear-time algorithm for computing the shortest cover of every prefix of a string. We have proposed an O(nlogn)-time algorithm for computing the shortest cover of every cyclic shift of a string. It remains an open problem if these values can be computed in O(n) time.

O(n), O(nlogn) and O(n2)-time algorithms for computing the shortest left seed, right seed, and seed, respectively, of all the prefixes of a string are known; see [25], [26]. Here left and right seed are

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (26)

  • M. Christou et al.

    Efficient seed computation revisited

    Theor. Comput. Sci.

    (2013)
  • M. Farach

    Optimal suffix tree construction with large alphabets

  • Y. Li et al.

    Computing the cover array in linear time

    Algorithmica

    (2002)
  • Cited by (4)

    1

    Supported by the Polish National Science Center, grant no. 2018/31/D/ST6/03991.

    View full text