research-article

SAND: streaming subsequence anomaly detection

Authors:
Paul Boniol

Univ. de Paris

Univ. de Paris
View Profile

,
John Paparrizos

University of Chicago

University of Chicago
View Profile

,
Themis Palpanas

Univ. de Paris; IUF

Univ. de Paris; IUF
View Profile

,
Michael J. Franklin

University of Chicago

University of Chicago
View Profile

Proceedings of the VLDB Endowment Volume 14 Issue 10pp 1717–1729https://doi.org/10.14778/3467861.3467863

Published:01 June 2021Publication History

Proceedings of the VLDB Endowment

Abstract

With the increasing demand for real-time analytics and decision making, anomaly detection methods need to operate over streams of values and handle drifts in data distribution. Unfortunately, existing approaches have severe limitations: they either require prior domain knowledge or become cumbersome and expensive to use in situations with recurrent anomalies of the same type. In addition, subsequence anomaly detection methods usually require access to the entire dataset and are not able to learn and detect anomalies in streaming settings. To address these problems, we propose SAND, a novel online method suitable for domain-agnostic anomaly detection. SAND aims to detect anomalies based on their distance to a model that represents normal behavior. SAND relies on a novel steaming methodology to incrementally update such model, which adapts to distribution drifts and omits obsolete data. The experimental results on several real-world datasets demonstrate that SAND correctly identifies single and recurrent anomalies without prior knowledge of the characteristics of these anomalies. SAND outperforms by a large margin the current state-of-the-art algorithms in terms of accuracy while achieving orders of magnitude speedups.

References

2020. SAND Webpage: http://helios.mi.parisdescartes.fr/~themisp/SAND/.Google Scholar
D. Abboud, M. Elbadaoui, W.A. Smith, and R.B. Randall. 2019. Advanced bearing diagnostics: A comparative study of two powerful approaches. MSSP 114 (2019).Google Scholar
Ali Abdul-Aziz, Mark R Woike, Nikunj C Oza, Bryan L Matthews, and John D lekki. 2012. Rotor health monitoring combining spin tests and data-driven anomaly detection methods. Structural Health Monitoring (2012). Google ScholarCross Ref
Jerome Antoni and Pietro Borghesani. 2019. A statistical methodology for the design of condition indicators. Mechanical Systems and Signal Processing (2019).Google Scholar
Anthony J. Bagnall, Richard L. Cole, Themis Palpanas, and Konstantinos Zoumpatianos. 9(7), 2019. Data Series Management (Dagstuhl Seminar 19282). Dagstuhl Reports (9(7), 2019).Google Scholar
S. Bahaadini, V. Noroozi, N. Rohani, S. Coughlin, M. Zevin, J. R. Smith, Vicky Kalogera, and Aggelos K Katsaggelos. 2018. Machine learning for Gravity Spy: Glitch classification and dataset. Information Sciences 444 (1 5 2018).Google Scholar
V. Barnet and T. Lewis. 1994. Outliers in Statistical Data. John Wiley and Sons, Inc.Google Scholar
Paul Boniol, Michele Linardi, Federico Roncallo, and Themis Palpanas. 2020. Automated Anomaly Detection in Large Sequences. In 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20--24, 2020.Google Scholar
Paul Boniol, Michele Linardi, Federico Roncallo, and Themis Palpanas. 2020. SAD: An Unsupervised System for Subsequence Anomaly Detection. In ICDE.Google Scholar
Paul Boniol, Michele Linardi, Federico Roncallo, Themis Palpanas, Mohammed Meftah, and Emmanuel Remy. 2021. Unsupervised and scalable subsequence anomaly detection in large data series. The VLDB Journal (2021).Google Scholar
Paul Boniol and Themis Palpanas. 2020. Series2Graph: Graph-based Subsequence Anomaly Detection for Time Series. PVLDB 13, 11 (2020). Google ScholarDigital Library
Paul Boniol, Themis Palpanas, Mohammed Meftah, and Emmanuel Remy. 2020. GraphAn: Graph-based Subsequence Anomaly Detection. Proc. VLDB Endow. 13, 12 (2020), 2941--2944. Google ScholarDigital Library
Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000. LOF: Identifying Density-based Local Outliers. In SIGMOD. Google ScholarDigital Library
Yingyi Bu, Oscar Tat-Wing Leung, Ada Wai-Chee Fu, Eamonn J. Keogh, Jian Pei, and Sam Meshkin. 2007. WAT: Finding Top-K Discords in Time Series Database. In SIAM.Google Scholar
Bill Yuan Chiu, Eamonn J. Keogh, and Stefano Lonardi. 2003. Probabilistic discovery of time series motifs. In SIGKDD 2003. 493--498. Google ScholarDigital Library
Nassia Daouayry, Ammar Mechouche, Pierre-Loic Maisonneuve, Vasile-Marian Scuturici, and Jean-Marc Petit. 2019. Data-Centric Helicopter Failure Anticipation: The MGB Oil Pressure Virtual Sensor Case. IEEE BigData.Google Scholar
H. A. Dau, A. Bagnall, K. Kamgar, C. M. Yeh, Y. Zhu, S. Gharghabi, C. A. Ratanamahatana, and E. Keogh. 2019. The UCR time series archive. IEEE/CAA Journal of Automatica Sinica 6, 6 (2019), 1293--1305. Google ScholarCross Ref
Goldberger et al. [n.d.]. PhysioBank, PhysioToolkit, and PhysioNet. Circulation ([n. d.]). http://circ.ahajournals.org/content/101/23/e215Google Scholar
Yan Zhu et al. [n.d.]. Matrix Profile II: Exploiting a Novel Algorithm and GPUs to Break the One Hundred Million Barrier for Time Series Motifs and Joins. In ICDM 2016.Google Scholar
Ada Wai-Chee Fu, Oscar Tat-Wing Leung, Eamonn J. Keogh, and Jessica Lin. 2006. Finding Time Series Discords Based on Haar Transform. In ADMA. Google ScholarDigital Library
Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan O'Callaghan. 2003. Clustering Data Streams: Theory and Practice. TKDE 15, 3 (2003). Google ScholarDigital Library
Medina Hadjem, Farid Naït-Abdesselam, and Ashfaq A. Khokhar. 2016. ST-segment and T-wave anomalies prediction in an ECG data using RUSBoost. In Healthcom.Google Scholar
E. Keogh, J. Lin, and a. Fu. 2005. HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence. (ICDM) (2005). Google ScholarDigital Library
Eamonn Keogh, Stefano Lonardi, Chotirat Ann Ratanamahatana, Li Wei, Sang-Hee Lee, and John Handley. 2007. Compression-based data mining of sequential data. Data Mining and Knowledge Discovery (2007). Google ScholarDigital Library
M. Kontaki, A. Gounaris, A. N. Papadopoulos, K. Tsichlas, and Y. Manolopoulos. 2011. Continuous monitoring of distance-based outliers over data streams. In 2011 IEEE 27th International Conference on Data Engineering. 135--146. Google ScholarDigital Library
Balaji Lakshminarayanan, Daniel M. Roy, and Yee Whye Teh. 2015. Mondrian Forests: Efficient Online Random Forests. arXiv:1406.2673 [stat.ML] Google ScholarDigital Library
Michele Linardi, Yan Zhu, Themis Palpanas, and Eamonn J. Keogh. 2020. Matrix Profile Goes MAD: Variable-Length Motif And Discord Discovery in Data Series. In DAMI.Google Scholar
Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation Forest. In ICDM (ICDM).Google ScholarDigital Library
Yubao Liu, Xiuwei Chen, and Fei Wang. 2009. Efficient Detection of Discords for Time Series Stream. Advances in Data and Web Management (2009). Google ScholarDigital Library
Wei Luo and Marcus Gallagher. 2011. Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series. In Advances in Knowledge Discovery and Data Mining. Google ScholarDigital Library
Haoran Ma, Benyamin Ghojogh, Maria N. Samad, Dongyu Zheng, and Mark Crowley. 2020. Isolation Mondrian Forest for Batch and Online Anomaly Detection. arXiv:2003.03692 [cs.LG]Google Scholar
Pankaj Malhotra, Lovekesh Vig, Gautam Shroff, and Puneet Agarwal. 2015. Long Short Term Memory Networks for Anomaly Detection in Time Series. (2015).Google Scholar
Katsiaryna Mirylenka, Alice Marascu, Themis Palpanas, Matthias Fehr, Stefan Jank, Gunter Welde, and Daniel Groeber. 2013. Envelope-Based Anomaly Detection for High-Speed Manufacturing Processes. European Advanced Process Control and Manufacturing Conference (2013).Google Scholar
G. B. Moody and R. G. Mark. 2001. The impact of the MIT-BIH Arrhythmia Database. IEEE Engineering in Medicine and Biology Magazine (2001).Google Scholar
Abdullah Mueen, Eamonn J. Keogh, Qiang Zhu, Sydney Cash, and M. Brandon Westover. [n.d.]. Exact Discovery of Time Series Motifs. In SDM 2009.Google Scholar
M. Munir, S. A. Siddiqui, A. Dengel, and S. Ahmed. 2019. DeepAnT: A Deep Learning Approach for Unsupervised Anomaly Detection in Time Series. IEEE Access 7 (2019), 1991--2005. Google ScholarCross Ref
Themis Palpanas. 2013. Real-Time Data Analytics in Sensor Networks. In Managing and Mining Sensor Data. 173--210.Google Scholar
Themis Palpanas. 2015. Data Series Management: The Road to Big Sequence Analytics. SIGMOD Rec. 44, 2 (2015), 47--52. Google ScholarDigital Library
Themis Palpanas and Volker Beckmann. 2019. Report on the First and Second Interdisciplinary Time Series Analysis Workshop (ITISA). SIGMOD Rec. 48, 3 (2019). Google ScholarDigital Library
John Paparrizos and Michael J Franklin. 2019. GRAIL: Efficient Time-Series Representation Learning. PVLDB 12, 11 (2019), 1762--1777. Google ScholarDigital Library
John Paparrizos and Luis Gravano. 2015. k-Shape: Efficient and Accurate Clustering of Time Series. In SIGMOD. 1855--1870. Google ScholarDigital Library
John Paparrizos and Luis Gravano. 2017. Fast and accurate time-series clustering. TODS 42, 2 (2017), 1--49. Google ScholarDigital Library
John Paparrizos, Chunwei Liu, Bruno Barbarioli, Johnny Hwang, Ikraduya Edian, Aaron J. Elmore, Michael J. Franklin, and Sanjay Krishnan. 2021. VergeDB: A Database for IoT Analytics on Edge Devices. In CIDR.Google Scholar
John Paparrizos, Chunwei Liu, Aaron J. Elmore, and Michael J. Franklin. 2020. Debunking Four Long-Standing Misconceptions of Time-Series Distance Measures. In SIGMOD. 1887--1905. Google ScholarDigital Library
Tuomas Pelkonen, Scott Franklin, Paul Cavallaro, Qi Huang, Justin Meza, Justin Teller, and Kaushik Veeraraghavan. 2015. Gorilla: A Fast, Scalable, In-Memory Time Series Database. PVLDB 8, 12 (2015). Google ScholarDigital Library
Botao Peng, Panagiota Fatourou, and Themis Palpanas. 2021. Fast Data Series Indexing for In-Memory Data. VLDBJ (2021).Google Scholar
Botao Peng, Panagiota Fatourou, and Themis Palpanas. 2021. SING: Sequence Indexing Using GPUs. In ICDE.Google Scholar
Botao Peng, Themis Palpanas, and Panagiota Fatourou. 2020. ParIS+: Data Series Indexing on Multi-core Architectures. TKDE (2020).Google Scholar
D. W. Scott. 1992. Multivariate Density Estimation. Theory, Practice, and Visualization. Wiley.Google Scholar
Pavel Senin, Jessica Lin, Xing Wang, Tim Oates, Sunil Gandhi, Arnold P. Boedihardjo, Crystal Chen, and Susan Frankenstein. 2015. Time series anomaly discovery with grammar-based compression. In EDBT.Google Scholar
Pavel Senin, Jessica Lin, Xing Wang, Tim Oates, Sunil Gandhi, Arnold P. Boedihardjo, Crystal Chen, and Susan Frankenstein. 2018. GrammarViz 3.0: Interactive Discovery of Variable-Length Time Series Patterns. TKDD (2018). Google ScholarDigital Library
Sharmila Subramaniam, Themis Palpanas, Dimitris Papadopoulos, Vana Kalogeraki, and Dimitrios Gunopulos. 2006. Online Outlier Detection in Sensor Data Using Non-Parametric Models. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12--15, 2006. Google ScholarDigital Library
Luan Tran, Liyue Fan, and Cyrus Shahabi. 2016. Distance-Based Outlier Detection in Data Streams. Proc. VLDB Endow. 9, 12 (Aug. 2016), 1089--1100. Google ScholarDigital Library
Dafne van Kuppevelt, Vincent van Hees, and Christiaan Meijer. 2017. PAMAP2 dataset preprocessed v0.3.0. Google ScholarCross Ref
J. Wang, A. Balasubramanian, L. Mojica de la Vega, J. Green, A. Samal, and B. Prabhakaran. [n.d.]. Word recognition from continuous articulatory movement time-series data using symbolic representations. In SLPAT (2013).Google Scholar
Qitong Wang and Themis Palpanas. 2021. Deep Learning Embeddings for Data Series Similarity Search. In SIGKDD. Google ScholarDigital Library
CW Whitney, DJ Gottlieb, S Redline, RG Norman, RR Dodge, E Shahar, S Surovec, and FJ Nieto. 1998. Reliability of scoring respiratory disturbance indices and sleep staging. Sleep (November 1998).Google Scholar
Dragomir Yankov, Eamonn J. Keogh, Jose Medina, Bill Yuan-chi Chiu, and Victor B. Zordan. [n.d.]. Detecting time series motifs under uniform scaling. In ACM.Google Scholar
Dragomir Yankov, Eamonn J. Keogh, and Umaa Rebbapragada. 2007. Disk Aware Discord Discovery: Finding Unusual Time Series in Terabyte Sized Datasets. In ICDM. Google ScholarDigital Library
Chin-Chia Michael Yeh, Yan Zhu, Liudmila Ulanova, Nurjahan Begum, Yifei Ding, Hoang Anh Dau, Diego Furtado Silva, Abdullah Mueen, and Eamonn J. Keogh. 2016. Matrix Profile I: All Pairs Similarity Joins for Time Series. In ICDM.Google Scholar

Index Terms

SAND: streaming subsequence anomaly detection

Index terms have been assigned to the content through auto-classification.

Recommendations

SAND in action: subsequence anomaly detection for streams

Subsequence anomaly detection in long data series is a significant problem. While the demand for real-time analytics and decision making increases, anomaly detection methods have to operate over streams and handle drifts in data distribution. ...
Read More
SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics
DSN '14: Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks

Many long-running network analytics applications impose a high-throughput and high reliability requirements on stream processing systems. However, previous stream processing systems cannot sustain high-speed traffic at the core router level. Furthermore, ...
Read More
Bootstrapping methodology for the Session-based Anomaly Notification Detector (SAND)
ACM-SE 44: Proceedings of the 44th annual Southeast regional conference

In [1] we discussed the possibilities of an anomaly-based intrusion detection system that modeled a network at a particular location using advanced data mining techniques on the network packets. In later research [2], we discovered that session-based ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 14, Issue 10
June 2021
219 pages
ISSN:2150-8097
Editors:
Xin Luna Dong
Amazon
,
Felix Naumann
HPI, University of Potsdam
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 June 2021
Published in pvldb Volume 14, Issue 10
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 24
  Total Citations
  View Citations
- 404
  Total Downloads
- Downloads (Last 12 months)162
- Downloads (Last 6 weeks)19
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SAND: streaming subsequence anomaly detection

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

SAND in action: subsequence anomaly detection for streams

SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics

Bootstrapping methodology for the Session-based Anomaly Notification Detector (SAND)

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

SAND: streaming subsequence anomaly detection

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

SAND in action: subsequence anomaly detection for streams

SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics

Bootstrapping methodology for the Session-based Anomaly Notification Detector (SAND)

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media