Skip to main content
Log in

Fast sequence segmentation using log-linear models

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Sequence segmentation is a well-studied problem, where given a sequence of elements, an integer K, and some measure of homogeneity, the task is to split the sequence into K contiguous segments that are maximally homogeneous. A classic approach to find the optimal solution is by using a dynamic program. Unfortunately, the execution time of this program is quadratic with respect to the length of the input sequence. This makes the algorithm slow for a sequence of non-trivial length. In this paper we study segmentations whose measure of goodness is based on log-linear models, a rich family that contains many of the standard distributions. We present a theoretical result allowing us to prune many suboptimal segmentations. Using this result, we modify the standard dynamic program for 1D log-linear models, and by doing so reduce the computational time. We demonstrate empirically, that this approach can significantly reduce the computational burden of finding the optimal segmentation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. Calders et al. (2007) deal only with binary sequences but we can easily extend these results to the general case.

  2. The implementation of the algorithm is given at http://adrem.ua.ac.be/segmentation.

  3. For clarity sake, figures show average lifetimes of bins containing 40 points

  4. The datasets were obtained from http://www.cs.ucr.edu/~eamonn/discords/.

References

  • Basseville M, Nikiforov IV (1993) Detection of abrupt changes—theory and application. Prentice-Hall, Englewood Cliffs

    Google Scholar 

  • Bellman R (1961) On the approximation of curves by line segments using dynamic programming. Commun ACM 4(6):284

    Article  MATH  Google Scholar 

  • Bernaola-Galván P, Román-Roldán R, Oliver JL (1996) Compositional segmentation and long-range fractal correlations in dna sequences. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics 53(5): 5181–5189

    Google Scholar 

  • Calders T, Dexters N, Goethals B (2007) Mining frequent itemsets in a stream. In: ICDM, pp 83–92

  • Douglas D, Peucker T (1973) Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Can Cartogr 10(2):112–122

    Article  Google Scholar 

  • Džeroski S, Goethals B, Panov P (eds) (2011) Inductive databases and constraint-based data mining. Springer, New York

    Google Scholar 

  • Gedikli A, Aksoy H, Unal NE, Kehagias A (2010) Modified dynamic programming approach for offline segmentation of long hydrometeorological time series. Stoch Environ Res Risk Assess 24(5):547–557

    Article  Google Scholar 

  • Gionis A, Mannila H (2003) Finding recurrent sources in sequences. In: Proceedings of the seventh annual international conference on research in computational molecular biology, RECOMB ’03, pp 123–130

  • Grünwald P (2007) The minimum description length principle. MIT Press, Cambridge

    Google Scholar 

  • Haiminen N, Gionis A (2004) Unimodal segmentation of sequences. In: ICDM, pp 106–113

  • Himberg J, Korpiaho K, Mannila H, Tikanmäki J, Toivonen H (2001) Time series segmentation for context recognition in mobile devices. In: ICDM, pp 203–210

  • Keogh EJ, Lin J, Fu AWC (2005) HOT SAX: efficiently finding the most unusual time series subsequence. In: ICDM, pp 226–233

  • Kifer D, Ben-David S, Gehrke J (2004) Detecting change in data streams. In: VLDB, pp 180–191

  • Lavrenko V, Schmill M, Lawrie D, Ogilvie P, Jensen D, Allan J (2000) Mining of concurrent text and time series. In: KDD workshop on text mining, pp 37–44

  • Palpanas T, Vlachos M, Keogh EJ, Gunopulos D, Truppel W (2004) Online amnesic approximation of streaming time series. In: ICDE, pp 339–349

  • Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464

    Article  MATH  Google Scholar 

  • Shatkay H, Zdonik SB (1996) Approximate queries and representations for large data sequences. In: ICDE, pp 536–545

  • Terzi E, Tsaparas P (2006) Efficient algorithms for sequence segmentation. In: SIAM data mining

Download references

Acknowledgments

Nikolaj Tatti was partly supported by a Post-Doctoral Fellowship of the Research Foundation-Flanders (fwo).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nikolaj Tatti.

Additional information

Communicated by Hendrik Blockeel, Kristian Kersting, Siegfried Nijssen, Filip Zelezny.

Appendix: Proofs

Appendix: Proofs

1.1 A. 1 Proof of Theorem 1

Theorem 1 will follow from the following theorem.

Theorem 9

Let \(D = \left(D_1 ,\ldots , D_e\right)\). Let \(1 \le m < e\). Assume that diff (D[1, m], D[m +1, e]) is a cover. Then there exists \(n > m\) such that \({ sc }\mathopen {}\left([1, n], [n + 1, e]\right) > { sc }\mathopen {}\left([1, m], [m + 1, e]\right)\) or there exists \(l < m\) such that \({ sc }\mathopen {}\left([1, l], [l + 1, e]\right) \ge { sc }\mathopen {}\left([1, m], [m + 1, e]\right)\).

In order to prove the theorem we will introduce some helpful notation. First, given a parameter vector \(s\) and \(r\), we define

$$\begin{aligned} h(k \mid s, r) = { sc }\mathopen {}\left([1, k] \mid s\right) + { sc }\mathopen {}\left([k + 1, e] \mid r\right). \end{aligned}$$

Note that \(h(k \mid s, r) \le { sc }\mathopen {}\left([1, k], [k + 1, e]\right)\). We also define

$$\begin{aligned} g(l, \delta \mid s, r) = l\left(Z(s) - Z(r) + (s - r)^T\delta \right). \end{aligned}$$

This function is essentially the difference between two scores.

Lemma 1

Let \(k > l\). We have \(h(k \mid s, r) - h(l \mid s, r) =g(k - l, { av }\mathopen {}\left(l + 1, k\right) \mid s, r)\).

Proof

Note that

$$\begin{aligned} h(k \mid s, r)&= kZ(s) + s^T{ c }\mathopen {}\left(k\right) + (e - k)Z(r) + r^T({ c }\mathopen {}\left(e\right) - { c }\mathopen {}\left(k\right)) \\&= k(Z(s) - Z(r)) + (s - r)^T{ c }\mathopen {}\left(k\right) + eZ(r) + r^T{ c }\mathopen {}\left(e\right). \end{aligned}$$

The last two terms do not depend on \(k\). This allows us to write

$$\begin{aligned} h(k \mid s, r) - h(l \mid s, r)&= k(Z(s) - Z(r)) + (s - r)^T{ c }\mathopen {}\left(k\right)\\&- l(Z(s) - Z(r)) - (s - r)^T{ c }\mathopen {}\left(l\right) \\&= (k - l)\left(Z(s) - Z(r) + (s - r)^T\frac{{ c }\mathopen {}\left(k\right) - { c }\mathopen {}\left(l\right)}{k - l}\right)\\&= g(k - l, { av }\mathopen {}\left(l + 1, k\right) \mid s, r). \end{aligned}$$

This completes the proof. \(\square \)

Proof of Theorem 9 Write \(y = { sc }\mathopen {}\left([1, m], [m + 1, e]\right)\) and define

$$\begin{aligned} x = \max _{k < m} { sc }\mathopen {}\left([1, k], [k + 1, e]\right) \quad {\text{ and}}\quad z = \max _{k > m} { sc }\mathopen {}\left([1, k], [k + 1, e]\right). \end{aligned}$$

We need to show that either \(x \ge y\) or \(z > y\). Assume that \(z \le y\). Fix \(\epsilon > 0\). By definition, there exist \(s\) and \(r\) such that

$$\begin{aligned} { sc }\mathopen {}\left([1, m] \mid s\right) + { sc }\mathopen {}\left([m + 1, e] \mid r\right) \ge y - \epsilon . \end{aligned}$$

From now on we will write \(h(k)\) to mean \(h(k \mid s, r)\) and \(g(k, \delta )\) to mean \(g(k, \delta \mid s, r)\). We must have \(h(m) + \epsilon \ge y \ge z\) or, equivalently, \(\epsilon \ge z - h(m)\).

Since \({ diff }\mathopen {}\left(D[1, m], D[m + 1, e]\right)\) is a cover, there exist integers \(l\) and \(n, 0 \le l < m < n \le e\), such that \((\alpha - \beta )^T(s - r) \ge 0\), where \(\alpha = { av }\mathopen {}\left(m + 1, n\right)\) and \(\beta = { av }\mathopen {}\left(l + 1, m\right)\).

Define \(c = (n - m) / (m - l)\). We now have

$$\begin{aligned} \epsilon&\ge z - h(m) \ge h(n) - h(m) = g(n - m, \alpha ) = cg(m - l, \alpha ) \\&= cg(m - l, \beta ) + c(m - l)(s - r)^T(\alpha - \beta ) \ge cg(m - l, \beta ) \\&= c(h(m) - h(l)) \ge c(h(m) - x) \ge c(y - \epsilon - x), \end{aligned}$$

which implies \(y - x \le \epsilon (1 + c^{-1}) \le \epsilon (1 + e)\). Since this holds for any \(\epsilon > 0\), we conclude that \(y \le x\). This proves the theorem. \(\square \)

Proof of Theorem 1 Let \(P\) a segmentation and let \(I\) and \(J\) be two consecutive segments such that \({ diff }\mathopen {}\left(D[I], D[J]\right)\) is a cover. We can now apply Theorem 9 to find alternative segments \(I^{\prime }\) and \(J^{\prime }\) such that if we define \(P^{\prime }\) by replacing \(I\) and \(J\) from \(P\) with \(I^{\prime }\) and \(J^{\prime }\) then either \({ sc }\mathopen {}\left(P^{\prime } \mid D\right) > { sc }\mathopen {}\left(P^{\prime } \mid D\right)\) or \({ sc }\mathopen {}\left(P^{\prime } \mid D\right) \ge { sc }\mathopen {}\left(P^{\prime } \mid D\right)\) and \(I^{\prime }\) ends before \(I\). We repeat this until no consecutive segments constitute a cover. This repetition ends because no segmentation will occur twice during these steps and there is a finite number of segmentations. The reason why no segmentation occur twice is because either the score properly increases or the score stays the same and we move a breakpoint to the left. \(\square \)

1.2 A.2 Proof of Theorem 8

Let \(U\) be the resulting tree from \({\textsc {UpdateTree}}(T, C, D, i)\). To prove the theorem we need to show that the paths of \(U\) from leafs to the root consists of borders, there are no nodes in \(U\) outside the borders, and that children are ordered. We will prove these results in a series of lemmata.

Lemma 2

Let \(T^{\prime }\) be a tree after we have added a node \(i\) in UpdateTree. Let \(n \ne i\) be a node in \(T^{\prime }\) and let \(m\) be its parent. Let \(c \in C\) be such that \(n \in { borders }\mathopen {}\left(c, i - 1\right)\). If \(m \notin { borders }\mathopen {}\left(c, i\right)\), then \(n\) will cease to be a child of \(m\) during some stage of UpdateTree.

Proof

Let \(r\) be a root node of \(T^{\prime }\). Consider a pre-order of nodes of \(T^{\prime }\), that is, parents and earlier siblings come first. We will prove the lemma using induction on the pre-order.

To prove the first step, let \(n\) be the first child of \(i\). If \(i \notin { borders }\mathopen {}\left(c, i\right)\), then Theorem 5 implies that \({ av }\mathopen {}\left(n, i\right) \ge { av }\mathopen {}\left(i, i\right)\) which is exactly the test on Line 9. Hence, \(n\) will be disconnected from \(i\).

Let us now prove the induction step. Let \(p\) be the parent of \(m\) in \(T^{\prime }\). Assume that \(p \ne r\). Note that \(p\) is the border next to \(m\) in \({ borders }\mathopen {}\left(c, i - 1\right)\). Theorem 5 implies that \(p \notin { borders }\mathopen {}\left(c, i\right)\), hence the induction assumption implies that \(m\) and \(p\) are disconnected and \(m\) becomes a child of \(r\) at some point.

Assume now that \(n\) is not the first child of \(m\) and let \(q\) be the sibling left to \(n\), and let \(p\) be such that \(q \in { borders }\mathopen {}\left(p, i - 1\right)\). Theorem 3 implies that \({ av }\mathopen {}\left(q, m - 1\right) \ge { av }\mathopen {}\left(j, m - 1\right)\) for any \(q \le j < m\). Since \(n > q\), we must have \({ av }\mathopen {}\left(q, m - 1\right) \ge { av }\mathopen {}\left(n, m - 1\right) \ge { av }\mathopen {}\left(m, i\right)\), which implies that \(m \notin { borders }\mathopen {}\left(p, i\right)\). Again, the induction assumption implies that \(q\) and \(m\) will be disconnected. Consequently, \(n\) will be the first child of \(m\) at some point.

Note that while moving \(m\) or left siblings of \(n\) to be children of \(r\) we move the current node \(a\) in UpdateTree to the left. Hence, there will be a point where \(a = m\) and \(n\) is the first child of \(m\). Theorem 5 implies that \({ av }\mathopen {}\left(n, i\right) \ge { av }\mathopen {}\left(m, i\right)\) which is exactly the test on Line 9. Hence, \(n\) will be disconnected from \(m\). This proves the lemma. \(\square \)

Lemma 3

For every \(c \in C\), a path in \(U\) from \(c\) to a child of the root node \(r\) equals \({ borders }\mathopen {}\left(c, i\right)\).

Proof

Fix \(c \in C\) and let \(\left(b_1 ,\ldots , b_M\right) = { borders }\mathopen {}\left(c, i - 1\right)\) and define \(b_{M + 1} = i\). Theorem 5 implies that there is \(1 \le N \le M + 1\) such that \(\left(b_1 ,\ldots , b_N\right) = { borders }\mathopen {}\left(c, i\right)\).

After adding \(i\) to \(T\), UpdateTree will not add new nodes into the path from \(c\) to \(r\). Lemma 2 now implies that the path from \(c\) to \(r\) will be \(\left(b_1 ,\ldots , b_K\right)\), where \(K \le N\). If \(N = 1\), then immediately \(K = 1\). To conclude that \(K = N\) in general, assume that \(N > 1\) and assume that at some point in UpdateTree we have \(a = b_N\) and \(b = b_{N - 1}\). Then, according to Theorem 5, the test on Line 9 will fail and \(b_{N - 1}\) remains as a child of \(b_N\). \(\square \)

Lemma 4

Let \(n\) be a node in \(U\), then there is \(c \in C\) such that \(n \in { borders }\mathopen {}\left(c, i\right)\).

Proof

Let \(m\) be a node that occurs in \(T\) but not in \({ btree }\mathopen {}\left(D[1, i], C\right)\). The lemma will follow if we can show that \(m\) is not in \(U\). Let \(n\) be the last child of \(m\). Lemma 2 implies that at some point \(n\) will be disconnected from \(m\) and we will visit \(m\) when it is a leaf, since \(m \notin C\), we will delete \(m\). \(\square \)

Lemma 5

Consider a post-order of nodes of \(T = { btree }\mathopen {}\left(D[1, i - 1], C\right)\), that is, parents and later siblings come first. Node values decrease with respect to this order.

Proof

We will prove that the following holds: Let \(n\) be a node and let \(m\) be its left sibling. Let \(q\) be the smallest child of \(n\). Then \(m < q\). Note that this automatically proves the lemma.

Note that \(q \in C\). To prove that \(m < q\), let \(c \in C\) such that \(m \in { borders }\mathopen {}\left(c, i - 1\right)\). If \(c \ge q\), then since \(n > m \ge c\), Theorem 7 implies that \(n \in { borders }\mathopen {}\left(c, i - 1\right)\) which is a contradiction. Consequently, \(c < q\). If \(q \le m\), then again Theorem 7 implies that \(m \in { borders }\mathopen {}\left(q, i - 1\right)\) which is a contradiction. This proves that \(m < q\). \(\square \)

Lemma 6

Child nodes of each node in \(U\) are ordered from smallest to largest.

Proof

UpdateTree modifies the tree by moving the first child of a node \(a\) to be the left sibling of \(a\). This does not change the post-order of the nodes. This implies that, since node values decrease with respect to the post-order in \(T\), they will also decrease in \(U\). This proves the lemma. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tatti, N. Fast sequence segmentation using log-linear models. Data Min Knowl Disc 27, 421–441 (2013). https://doi.org/10.1007/s10618-012-0301-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-012-0301-y

Keywords

Navigation