Fast sequence segmentation using log-linear models

Tatti, Nikolaj

doi:10.1007/s10618-012-0301-y

Fast sequence segmentation using log-linear models

Published: 18 January 2013

Volume 27, pages 421–441, (2013)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Nikolaj Tatti^1,2

792 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Sequence segmentation is a well-studied problem, where given a sequence of elements, an integer K, and some measure of homogeneity, the task is to split the sequence into K contiguous segments that are maximally homogeneous. A classic approach to find the optimal solution is by using a dynamic program. Unfortunately, the execution time of this program is quadratic with respect to the length of the input sequence. This makes the algorithm slow for a sequence of non-trivial length. In this paper we study segmentations whose measure of goodness is based on log-linear models, a rich family that contains many of the standard distributions. We present a theoretical result allowing us to prune many suboptimal segmentations. Using this result, we modify the standard dynamic program for 1D log-linear models, and by doing so reduce the computational time. We demonstrate empirically, that this approach can significantly reduce the computational burden of finding the optimal segmentation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Article 27 November 2022

Gbeminiyi John Oyewole & George Alex Thopil

A Systematic Review of Hidden Markov Models and Their Applications

Article 12 May 2020

Bhavya Mor, Sunita Garhwal & Ajay Kumar

Notes

Calders et al. (2007) deal only with binary sequences but we can easily extend these results to the general case.
The implementation of the algorithm is given at http://adrem.ua.ac.be/segmentation.
For clarity sake, figures show average lifetimes of bins containing 40 points
The datasets were obtained from http://www.cs.ucr.edu/~eamonn/discords/.

References

Basseville M, Nikiforov IV (1993) Detection of abrupt changes—theory and application. Prentice-Hall, Englewood Cliffs
Google Scholar
Bellman R (1961) On the approximation of curves by line segments using dynamic programming. Commun ACM 4(6):284
Article MATH Google Scholar
Bernaola-Galván P, Román-Roldán R, Oliver JL (1996) Compositional segmentation and long-range fractal correlations in dna sequences. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics 53(5): 5181–5189
Google Scholar
Calders T, Dexters N, Goethals B (2007) Mining frequent itemsets in a stream. In: ICDM, pp 83–92
Douglas D, Peucker T (1973) Algorithms for the reduction of the number of points required to represent a digitized line or its caricature. Can Cartogr 10(2):112–122
Article Google Scholar
Džeroski S, Goethals B, Panov P (eds) (2011) Inductive databases and constraint-based data mining. Springer, New York
Google Scholar
Gedikli A, Aksoy H, Unal NE, Kehagias A (2010) Modified dynamic programming approach for offline segmentation of long hydrometeorological time series. Stoch Environ Res Risk Assess 24(5):547–557
Article Google Scholar
Gionis A, Mannila H (2003) Finding recurrent sources in sequences. In: Proceedings of the seventh annual international conference on research in computational molecular biology, RECOMB ’03, pp 123–130
Grünwald P (2007) The minimum description length principle. MIT Press, Cambridge
Google Scholar
Haiminen N, Gionis A (2004) Unimodal segmentation of sequences. In: ICDM, pp 106–113
Himberg J, Korpiaho K, Mannila H, Tikanmäki J, Toivonen H (2001) Time series segmentation for context recognition in mobile devices. In: ICDM, pp 203–210
Keogh EJ, Lin J, Fu AWC (2005) HOT SAX: efficiently finding the most unusual time series subsequence. In: ICDM, pp 226–233
Kifer D, Ben-David S, Gehrke J (2004) Detecting change in data streams. In: VLDB, pp 180–191
Lavrenko V, Schmill M, Lawrie D, Ogilvie P, Jensen D, Allan J (2000) Mining of concurrent text and time series. In: KDD workshop on text mining, pp 37–44
Palpanas T, Vlachos M, Keogh EJ, Gunopulos D, Truppel W (2004) Online amnesic approximation of streaming time series. In: ICDE, pp 339–349
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Article MATH Google Scholar
Shatkay H, Zdonik SB (1996) Approximate queries and representations for large data sequences. In: ICDE, pp 536–545
Terzi E, Tsaparas P (2006) Efficient algorithms for sequence segmentation. In: SIAM data mining

Download references

Acknowledgments

Nikolaj Tatti was partly supported by a Post-Doctoral Fellowship of the Research Foundation-Flanders (fwo).

Author information

Authors and Affiliations

Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
Nikolaj Tatti
Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium
Nikolaj Tatti

Authors

Nikolaj Tatti
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nikolaj Tatti.

Additional information

Communicated by Hendrik Blockeel, Kristian Kersting, Siegfried Nijssen, Filip Zelezny.

Appendix: Proofs

1.1 A. 1 Proof of Theorem 1

Theorem 1 will follow from the following theorem.

Theorem 9

Let $D = \left(D_1 ,\ldots , D_e\right)$. Let $1 \le m < e$. Assume that diff (D[1, m], D[m +1, e]) is a cover. Then there exists $n > m$ such that ${ sc }\mathopen {}\left([1, n], [n + 1, e]\right) > { sc }\mathopen {}\left([1, m], [m + 1, e]\right)$ or there exists $l < m$ such that ${ sc }\mathopen {}\left([1, l], [l + 1, e]\right) \ge { sc }\mathopen {}\left([1, m], [m + 1, e]\right)$.

In order to prove the theorem we will introduce some helpful notation. First, given a parameter vector $s$ and $r$, we define

$$\begin{aligned} h(k \mid s, r) = { sc }\mathopen {}\left([1, k] \mid s\right) + { sc }\mathopen {}\left([k + 1, e] \mid r\right). \end{aligned}$$

Note that $h(k \mid s, r) \le { sc }\mathopen {}\left([1, k], [k + 1, e]\right)$. We also define

$$\begin{aligned} g(l, \delta \mid s, r) = l\left(Z(s) - Z(r) + (s - r)^T\delta \right). \end{aligned}$$

This function is essentially the difference between two scores.

Lemma 1

Let $k > l$. We have $h(k \mid s, r) - h(l \mid s, r) =g(k - l, { av }\mathopen {}\left(l + 1, k\right) \mid s, r)$.

Proof

Note that

$$\begin{aligned} h(k \mid s, r)&= kZ(s) + s^T{ c }\mathopen {}\left(k\right) + (e - k)Z(r) + r^T({ c }\mathopen {}\left(e\right) - { c }\mathopen {}\left(k\right)) \\&= k(Z(s) - Z(r)) + (s - r)^T{ c }\mathopen {}\left(k\right) + eZ(r) + r^T{ c }\mathopen {}\left(e\right). \end{aligned}$$

The last two terms do not depend on $k$. This allows us to write

$$\begin{aligned} h(k \mid s, r) - h(l \mid s, r)&= k(Z(s) - Z(r)) + (s - r)^T{ c }\mathopen {}\left(k\right)\\&- l(Z(s) - Z(r)) - (s - r)^T{ c }\mathopen {}\left(l\right) \\&= (k - l)\left(Z(s) - Z(r) + (s - r)^T\frac{{ c }\mathopen {}\left(k\right) - { c }\mathopen {}\left(l\right)}{k - l}\right)\\&= g(k - l, { av }\mathopen {}\left(l + 1, k\right) \mid s, r). \end{aligned}$$

This completes the proof. $\square $

Proof of Theorem 9 Write $y = { sc }\mathopen {}\left([1, m], [m + 1, e]\right)$ and define

$$\begin{aligned} x = \max _{k < m} { sc }\mathopen {}\left([1, k], [k + 1, e]\right) \quad {\text{ and}}\quad z = \max _{k > m} { sc }\mathopen {}\left([1, k], [k + 1, e]\right). \end{aligned}$$

We need to show that either $x \ge y$ or $z > y$. Assume that $z \le y$. Fix $\epsilon > 0$. By definition, there exist $s$ and $r$ such that

$$\begin{aligned} { sc }\mathopen {}\left([1, m] \mid s\right) + { sc }\mathopen {}\left([m + 1, e] \mid r\right) \ge y - \epsilon . \end{aligned}$$

From now on we will write $h(k)$ to mean $h(k \mid s, r)$ and $g(k, \delta )$ to mean $g(k, \delta \mid s, r)$. We must have $h(m) + \epsilon \ge y \ge z$ or, equivalently, $\epsilon \ge z - h(m)$.

Since ${ diff }\mathopen {}\left(D[1, m], D[m + 1, e]\right)$ is a cover, there exist integers $l$ and $n, 0 \le l < m < n \le e$, such that $(\alpha - \beta )^T(s - r) \ge 0$, where $\alpha = { av }\mathopen {}\left(m + 1, n\right)$ and $\beta = { av }\mathopen {}\left(l + 1, m\right)$.

Define $c = (n - m) / (m - l)$. We now have

$$\begin{aligned} \epsilon&\ge z - h(m) \ge h(n) - h(m) = g(n - m, \alpha ) = cg(m - l, \alpha ) \\&= cg(m - l, \beta ) + c(m - l)(s - r)^T(\alpha - \beta ) \ge cg(m - l, \beta ) \\&= c(h(m) - h(l)) \ge c(h(m) - x) \ge c(y - \epsilon - x), \end{aligned}$$

which implies $y - x \le \epsilon (1 + c^{-1}) \le \epsilon (1 + e)$. Since this holds for any $\epsilon > 0$, we conclude that $y \le x$. This proves the theorem. $\square $

Proof of Theorem 1 Let $P$ a segmentation and let $I$ and $J$ be two consecutive segments such that ${ diff }\mathopen {}\left(D[I], D[J]\right)$ is a cover. We can now apply Theorem 9 to find alternative segments $I^{\prime }$ and $J^{\prime }$ such that if we define $P^{\prime }$ by replacing $I$ and $J$ from $P$ with $I^{\prime }$ and $J^{\prime }$ then either ${ sc }\mathopen {}\left(P^{\prime } \mid D\right) > { sc }\mathopen {}\left(P^{\prime } \mid D\right)$ or ${ sc }\mathopen {}\left(P^{\prime } \mid D\right) \ge { sc }\mathopen {}\left(P^{\prime } \mid D\right)$ and $I^{\prime }$ ends before $I$. We repeat this until no consecutive segments constitute a cover. This repetition ends because no segmentation will occur twice during these steps and there is a finite number of segmentations. The reason why no segmentation occur twice is because either the score properly increases or the score stays the same and we move a breakpoint to the left. $\square $

1.2 A.2 Proof of Theorem 8

Let $U$ be the resulting tree from ${\textsc {UpdateTree}}(T, C, D, i)$. To prove the theorem we need to show that the paths of $U$ from leafs to the root consists of borders, there are no nodes in $U$ outside the borders, and that children are ordered. We will prove these results in a series of lemmata.

Lemma 2

Let $T^{\prime }$ be a tree after we have added a node $i$ in UpdateTree. Let $n \ne i$ be a node in $T^{\prime }$ and let $m$ be its parent. Let $c \in C$ be such that $n \in { borders }\mathopen {}\left(c, i - 1\right)$. If $m \notin { borders }\mathopen {}\left(c, i\right)$, then $n$ will cease to be a child of $m$ during some stage of UpdateTree.

Proof

Let $r$ be a root node of $T^{\prime }$. Consider a pre-order of nodes of $T^{\prime }$, that is, parents and earlier siblings come first. We will prove the lemma using induction on the pre-order.

To prove the first step, let $n$ be the first child of $i$. If $i \notin { borders }\mathopen {}\left(c, i\right)$, then Theorem 5 implies that ${ av }\mathopen {}\left(n, i\right) \ge { av }\mathopen {}\left(i, i\right)$ which is exactly the test on Line 9. Hence, $n$ will be disconnected from $i$.

Let us now prove the induction step. Let $p$ be the parent of $m$ in $T^{\prime }$. Assume that $p \ne r$. Note that $p$ is the border next to $m$ in ${ borders }\mathopen {}\left(c, i - 1\right)$. Theorem 5 implies that $p \notin { borders }\mathopen {}\left(c, i\right)$, hence the induction assumption implies that $m$ and $p$ are disconnected and $m$ becomes a child of $r$ at some point.

Assume now that $n$ is not the first child of $m$ and let $q$ be the sibling left to $n$, and let $p$ be such that $q \in { borders }\mathopen {}\left(p, i - 1\right)$. Theorem 3 implies that ${ av }\mathopen {}\left(q, m - 1\right) \ge { av }\mathopen {}\left(j, m - 1\right)$ for any $q \le j < m$. Since $n > q$, we must have ${ av }\mathopen {}\left(q, m - 1\right) \ge { av }\mathopen {}\left(n, m - 1\right) \ge { av }\mathopen {}\left(m, i\right)$, which implies that $m \notin { borders }\mathopen {}\left(p, i\right)$. Again, the induction assumption implies that $q$ and $m$ will be disconnected. Consequently, $n$ will be the first child of $m$ at some point.

Note that while moving $m$ or left siblings of $n$ to be children of $r$ we move the current node $a$ in UpdateTree to the left. Hence, there will be a point where $a = m$ and $n$ is the first child of $m$. Theorem 5 implies that ${ av }\mathopen {}\left(n, i\right) \ge { av }\mathopen {}\left(m, i\right)$ which is exactly the test on Line 9. Hence, $n$ will be disconnected from $m$. This proves the lemma. $\square $

Lemma 3

For every $c \in C$, a path in $U$ from $c$ to a child of the root node $r$ equals ${ borders }\mathopen {}\left(c, i\right)$.

Proof

Fix $c \in C$ and let $\left(b_1 ,\ldots , b_M\right) = { borders }\mathopen {}\left(c, i - 1\right)$ and define $b_{M + 1} = i$. Theorem 5 implies that there is $1 \le N \le M + 1$ such that $\left(b_1 ,\ldots , b_N\right) = { borders }\mathopen {}\left(c, i\right)$.

After adding $i$ to $T$, UpdateTree will not add new nodes into the path from $c$ to $r$. Lemma 2 now implies that the path from $c$ to $r$ will be $\left(b_1 ,\ldots , b_K\right)$, where $K \le N$. If $N = 1$, then immediately $K = 1$. To conclude that $K = N$ in general, assume that $N > 1$ and assume that at some point in UpdateTree we have $a = b_N$ and $b = b_{N - 1}$. Then, according to Theorem 5, the test on Line 9 will fail and $b_{N - 1}$ remains as a child of $b_N$. $\square $

Lemma 4

Let $n$ be a node in $U$, then there is $c \in C$ such that $n \in { borders }\mathopen {}\left(c, i\right)$.

Proof

Let $m$ be a node that occurs in $T$ but not in ${ btree }\mathopen {}\left(D[1, i], C\right)$. The lemma will follow if we can show that $m$ is not in $U$. Let $n$ be the last child of $m$. Lemma 2 implies that at some point $n$ will be disconnected from $m$ and we will visit $m$ when it is a leaf, since $m \notin C$, we will delete $m$. $\square $

Lemma 5

Consider a post-order of nodes of $T = { btree }\mathopen {}\left(D[1, i - 1], C\right)$, that is, parents and later siblings come first. Node values decrease with respect to this order.

Proof

We will prove that the following holds: Let $n$ be a node and let $m$ be its left sibling. Let $q$ be the smallest child of $n$. Then $m < q$. Note that this automatically proves the lemma.

Note that $q \in C$. To prove that $m < q$, let $c \in C$ such that $m \in { borders }\mathopen {}\left(c, i - 1\right)$. If $c \ge q$, then since $n > m \ge c$, Theorem 7 implies that $n \in { borders }\mathopen {}\left(c, i - 1\right)$ which is a contradiction. Consequently, $c < q$. If $q \le m$, then again Theorem 7 implies that $m \in { borders }\mathopen {}\left(q, i - 1\right)$ which is a contradiction. This proves that $m < q$. $\square $

Lemma 6

Child nodes of each node in $U$ are ordered from smallest to largest.

Proof

UpdateTree modifies the tree by moving the first child of a node $a$ to be the left sibling of $a$. This does not change the post-order of the nodes. This implies that, since node values decrease with respect to the post-order in $T$, they will also decrease in $U$. This proves the lemma. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tatti, N. Fast sequence segmentation using log-linear models. Data Min Knowl Disc 27, 421–441 (2013). https://doi.org/10.1007/s10618-012-0301-y

Download citation

Received: 20 September 2012
Accepted: 10 December 2012
Published: 18 January 2013
Issue Date: November 2013
DOI: https://doi.org/10.1007/s10618-012-0301-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast sequence segmentation using log-linear models

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

A Systematic Review of Hidden Markov Models and Their Applications

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Proofs

1.1 A. 1 Proof of Theorem 1

Theorem 9

Lemma 1

Proof

1.2 A.2 Proof of Theorem 8

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Proof

Lemma 5

Proof

Lemma 6

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fast sequence segmentation using log-linear models

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

A Systematic Review of Hidden Markov Models and Their Applications

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix: Proofs

Appendix: Proofs

1.1 A. 1 Proof of Theorem 1

Theorem 9

Lemma 1

Proof

1.2 A.2 Proof of Theorem 8

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Proof

Lemma 5

Proof

Lemma 6

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation