Abstract
With the increasing demand for real-time analytics and decision making, anomaly detection methods need to operate over streams of values and handle drifts in data distribution. Unfortunately, existing approaches have severe limitations: they either require prior domain knowledge or become cumbersome and expensive to use in situations with recurrent anomalies of the same type. In addition, subsequence anomaly detection methods usually require access to the entire dataset and are not able to learn and detect anomalies in streaming settings. To address these problems, we propose SAND, a novel online method suitable for domain-agnostic anomaly detection. SAND aims to detect anomalies based on their distance to a model that represents normal behavior. SAND relies on a novel steaming methodology to incrementally update such model, which adapts to distribution drifts and omits obsolete data. The experimental results on several real-world datasets demonstrate that SAND correctly identifies single and recurrent anomalies without prior knowledge of the characteristics of these anomalies. SAND outperforms by a large margin the current state-of-the-art algorithms in terms of accuracy while achieving orders of magnitude speedups.
- 2020. SAND Webpage: http://helios.mi.parisdescartes.fr/~themisp/SAND/.Google Scholar
- D. Abboud, M. Elbadaoui, W.A. Smith, and R.B. Randall. 2019. Advanced bearing diagnostics: A comparative study of two powerful approaches. MSSP 114 (2019).Google Scholar
- Ali Abdul-Aziz, Mark R Woike, Nikunj C Oza, Bryan L Matthews, and John D lekki. 2012. Rotor health monitoring combining spin tests and data-driven anomaly detection methods. Structural Health Monitoring (2012). Google ScholarCross Ref
- Jerome Antoni and Pietro Borghesani. 2019. A statistical methodology for the design of condition indicators. Mechanical Systems and Signal Processing (2019).Google Scholar
- Anthony J. Bagnall, Richard L. Cole, Themis Palpanas, and Konstantinos Zoumpatianos. 9(7), 2019. Data Series Management (Dagstuhl Seminar 19282). Dagstuhl Reports (9(7), 2019).Google Scholar
- S. Bahaadini, V. Noroozi, N. Rohani, S. Coughlin, M. Zevin, J. R. Smith, Vicky Kalogera, and Aggelos K Katsaggelos. 2018. Machine learning for Gravity Spy: Glitch classification and dataset. Information Sciences 444 (1 5 2018).Google Scholar
- V. Barnet and T. Lewis. 1994. Outliers in Statistical Data. John Wiley and Sons, Inc.Google Scholar
- Paul Boniol, Michele Linardi, Federico Roncallo, and Themis Palpanas. 2020. Automated Anomaly Detection in Large Sequences. In 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20--24, 2020.Google Scholar
- Paul Boniol, Michele Linardi, Federico Roncallo, and Themis Palpanas. 2020. SAD: An Unsupervised System for Subsequence Anomaly Detection. In ICDE.Google Scholar
- Paul Boniol, Michele Linardi, Federico Roncallo, Themis Palpanas, Mohammed Meftah, and Emmanuel Remy. 2021. Unsupervised and scalable subsequence anomaly detection in large data series. The VLDB Journal (2021).Google Scholar
- Paul Boniol and Themis Palpanas. 2020. Series2Graph: Graph-based Subsequence Anomaly Detection for Time Series. PVLDB 13, 11 (2020). Google ScholarDigital Library
- Paul Boniol, Themis Palpanas, Mohammed Meftah, and Emmanuel Remy. 2020. GraphAn: Graph-based Subsequence Anomaly Detection. Proc. VLDB Endow. 13, 12 (2020), 2941--2944. Google ScholarDigital Library
- Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000. LOF: Identifying Density-based Local Outliers. In SIGMOD. Google ScholarDigital Library
- Yingyi Bu, Oscar Tat-Wing Leung, Ada Wai-Chee Fu, Eamonn J. Keogh, Jian Pei, and Sam Meshkin. 2007. WAT: Finding Top-K Discords in Time Series Database. In SIAM.Google Scholar
- Bill Yuan Chiu, Eamonn J. Keogh, and Stefano Lonardi. 2003. Probabilistic discovery of time series motifs. In SIGKDD 2003. 493--498. Google ScholarDigital Library
- Nassia Daouayry, Ammar Mechouche, Pierre-Loic Maisonneuve, Vasile-Marian Scuturici, and Jean-Marc Petit. 2019. Data-Centric Helicopter Failure Anticipation: The MGB Oil Pressure Virtual Sensor Case. IEEE BigData.Google Scholar
- H. A. Dau, A. Bagnall, K. Kamgar, C. M. Yeh, Y. Zhu, S. Gharghabi, C. A. Ratanamahatana, and E. Keogh. 2019. The UCR time series archive. IEEE/CAA Journal of Automatica Sinica 6, 6 (2019), 1293--1305. Google ScholarCross Ref
- Goldberger et al. [n.d.]. PhysioBank, PhysioToolkit, and PhysioNet. Circulation ([n. d.]). http://circ.ahajournals.org/content/101/23/e215Google Scholar
- Yan Zhu et al. [n.d.]. Matrix Profile II: Exploiting a Novel Algorithm and GPUs to Break the One Hundred Million Barrier for Time Series Motifs and Joins. In ICDM 2016.Google Scholar
- Ada Wai-Chee Fu, Oscar Tat-Wing Leung, Eamonn J. Keogh, and Jessica Lin. 2006. Finding Time Series Discords Based on Haar Transform. In ADMA. Google ScholarDigital Library
- Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan O'Callaghan. 2003. Clustering Data Streams: Theory and Practice. TKDE 15, 3 (2003). Google ScholarDigital Library
- Medina Hadjem, Farid Naït-Abdesselam, and Ashfaq A. Khokhar. 2016. ST-segment and T-wave anomalies prediction in an ECG data using RUSBoost. In Healthcom.Google Scholar
- E. Keogh, J. Lin, and a. Fu. 2005. HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence. (ICDM) (2005). Google ScholarDigital Library
- Eamonn Keogh, Stefano Lonardi, Chotirat Ann Ratanamahatana, Li Wei, Sang-Hee Lee, and John Handley. 2007. Compression-based data mining of sequential data. Data Mining and Knowledge Discovery (2007). Google ScholarDigital Library
- M. Kontaki, A. Gounaris, A. N. Papadopoulos, K. Tsichlas, and Y. Manolopoulos. 2011. Continuous monitoring of distance-based outliers over data streams. In 2011 IEEE 27th International Conference on Data Engineering. 135--146. Google ScholarDigital Library
- Balaji Lakshminarayanan, Daniel M. Roy, and Yee Whye Teh. 2015. Mondrian Forests: Efficient Online Random Forests. arXiv:1406.2673 [stat.ML] Google ScholarDigital Library
- Michele Linardi, Yan Zhu, Themis Palpanas, and Eamonn J. Keogh. 2020. Matrix Profile Goes MAD: Variable-Length Motif And Discord Discovery in Data Series. In DAMI.Google Scholar
- Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation Forest. In ICDM (ICDM).Google ScholarDigital Library
- Yubao Liu, Xiuwei Chen, and Fei Wang. 2009. Efficient Detection of Discords for Time Series Stream. Advances in Data and Web Management (2009). Google ScholarDigital Library
- Wei Luo and Marcus Gallagher. 2011. Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series. In Advances in Knowledge Discovery and Data Mining. Google ScholarDigital Library
- Haoran Ma, Benyamin Ghojogh, Maria N. Samad, Dongyu Zheng, and Mark Crowley. 2020. Isolation Mondrian Forest for Batch and Online Anomaly Detection. arXiv:2003.03692 [cs.LG]Google Scholar
- Pankaj Malhotra, Lovekesh Vig, Gautam Shroff, and Puneet Agarwal. 2015. Long Short Term Memory Networks for Anomaly Detection in Time Series. (2015).Google Scholar
- Katsiaryna Mirylenka, Alice Marascu, Themis Palpanas, Matthias Fehr, Stefan Jank, Gunter Welde, and Daniel Groeber. 2013. Envelope-Based Anomaly Detection for High-Speed Manufacturing Processes. European Advanced Process Control and Manufacturing Conference (2013).Google Scholar
- G. B. Moody and R. G. Mark. 2001. The impact of the MIT-BIH Arrhythmia Database. IEEE Engineering in Medicine and Biology Magazine (2001).Google Scholar
- Abdullah Mueen, Eamonn J. Keogh, Qiang Zhu, Sydney Cash, and M. Brandon Westover. [n.d.]. Exact Discovery of Time Series Motifs. In SDM 2009.Google Scholar
- M. Munir, S. A. Siddiqui, A. Dengel, and S. Ahmed. 2019. DeepAnT: A Deep Learning Approach for Unsupervised Anomaly Detection in Time Series. IEEE Access 7 (2019), 1991--2005. Google ScholarCross Ref
- Themis Palpanas. 2013. Real-Time Data Analytics in Sensor Networks. In Managing and Mining Sensor Data. 173--210.Google Scholar
- Themis Palpanas. 2015. Data Series Management: The Road to Big Sequence Analytics. SIGMOD Rec. 44, 2 (2015), 47--52. Google ScholarDigital Library
- Themis Palpanas and Volker Beckmann. 2019. Report on the First and Second Interdisciplinary Time Series Analysis Workshop (ITISA). SIGMOD Rec. 48, 3 (2019). Google ScholarDigital Library
- John Paparrizos and Michael J Franklin. 2019. GRAIL: Efficient Time-Series Representation Learning. PVLDB 12, 11 (2019), 1762--1777. Google ScholarDigital Library
- John Paparrizos and Luis Gravano. 2015. k-Shape: Efficient and Accurate Clustering of Time Series. In SIGMOD. 1855--1870. Google ScholarDigital Library
- John Paparrizos and Luis Gravano. 2017. Fast and accurate time-series clustering. TODS 42, 2 (2017), 1--49. Google ScholarDigital Library
- John Paparrizos, Chunwei Liu, Bruno Barbarioli, Johnny Hwang, Ikraduya Edian, Aaron J. Elmore, Michael J. Franklin, and Sanjay Krishnan. 2021. VergeDB: A Database for IoT Analytics on Edge Devices. In CIDR.Google Scholar
- John Paparrizos, Chunwei Liu, Aaron J. Elmore, and Michael J. Franklin. 2020. Debunking Four Long-Standing Misconceptions of Time-Series Distance Measures. In SIGMOD. 1887--1905. Google ScholarDigital Library
- Tuomas Pelkonen, Scott Franklin, Paul Cavallaro, Qi Huang, Justin Meza, Justin Teller, and Kaushik Veeraraghavan. 2015. Gorilla: A Fast, Scalable, In-Memory Time Series Database. PVLDB 8, 12 (2015). Google ScholarDigital Library
- Botao Peng, Panagiota Fatourou, and Themis Palpanas. 2021. Fast Data Series Indexing for In-Memory Data. VLDBJ (2021).Google Scholar
- Botao Peng, Panagiota Fatourou, and Themis Palpanas. 2021. SING: Sequence Indexing Using GPUs. In ICDE.Google Scholar
- Botao Peng, Themis Palpanas, and Panagiota Fatourou. 2020. ParIS+: Data Series Indexing on Multi-core Architectures. TKDE (2020).Google Scholar
- D. W. Scott. 1992. Multivariate Density Estimation. Theory, Practice, and Visualization. Wiley.Google Scholar
- Pavel Senin, Jessica Lin, Xing Wang, Tim Oates, Sunil Gandhi, Arnold P. Boedihardjo, Crystal Chen, and Susan Frankenstein. 2015. Time series anomaly discovery with grammar-based compression. In EDBT.Google Scholar
- Pavel Senin, Jessica Lin, Xing Wang, Tim Oates, Sunil Gandhi, Arnold P. Boedihardjo, Crystal Chen, and Susan Frankenstein. 2018. GrammarViz 3.0: Interactive Discovery of Variable-Length Time Series Patterns. TKDD (2018). Google ScholarDigital Library
- Sharmila Subramaniam, Themis Palpanas, Dimitris Papadopoulos, Vana Kalogeraki, and Dimitrios Gunopulos. 2006. Online Outlier Detection in Sensor Data Using Non-Parametric Models. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12--15, 2006. Google ScholarDigital Library
- Luan Tran, Liyue Fan, and Cyrus Shahabi. 2016. Distance-Based Outlier Detection in Data Streams. Proc. VLDB Endow. 9, 12 (Aug. 2016), 1089--1100. Google ScholarDigital Library
- Dafne van Kuppevelt, Vincent van Hees, and Christiaan Meijer. 2017. PAMAP2 dataset preprocessed v0.3.0. Google ScholarCross Ref
- J. Wang, A. Balasubramanian, L. Mojica de la Vega, J. Green, A. Samal, and B. Prabhakaran. [n.d.]. Word recognition from continuous articulatory movement time-series data using symbolic representations. In SLPAT (2013).Google Scholar
- Qitong Wang and Themis Palpanas. 2021. Deep Learning Embeddings for Data Series Similarity Search. In SIGKDD. Google ScholarDigital Library
- CW Whitney, DJ Gottlieb, S Redline, RG Norman, RR Dodge, E Shahar, S Surovec, and FJ Nieto. 1998. Reliability of scoring respiratory disturbance indices and sleep staging. Sleep (November 1998).Google Scholar
- Dragomir Yankov, Eamonn J. Keogh, Jose Medina, Bill Yuan-chi Chiu, and Victor B. Zordan. [n.d.]. Detecting time series motifs under uniform scaling. In ACM.Google Scholar
- Dragomir Yankov, Eamonn J. Keogh, and Umaa Rebbapragada. 2007. Disk Aware Discord Discovery: Finding Unusual Time Series in Terabyte Sized Datasets. In ICDM. Google ScholarDigital Library
- Chin-Chia Michael Yeh, Yan Zhu, Liudmila Ulanova, Nurjahan Begum, Yifei Ding, Hoang Anh Dau, Diego Furtado Silva, Abdullah Mueen, and Eamonn J. Keogh. 2016. Matrix Profile I: All Pairs Similarity Joins for Time Series. In ICDM.Google Scholar
Index Terms
- SAND: streaming subsequence anomaly detection
Recommendations
SAND in action: subsequence anomaly detection for streams
Subsequence anomaly detection in long data series is a significant problem. While the demand for real-time analytics and decision making increases, anomaly detection methods have to operate over streams and handle drifts in data distribution. ...
SAND: A Fault-Tolerant Streaming Architecture for Network Traffic Analytics
DSN '14: Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and NetworksMany long-running network analytics applications impose a high-throughput and high reliability requirements on stream processing systems. However, previous stream processing systems cannot sustain high-speed traffic at the core router level. Furthermore, ...
Bootstrapping methodology for the Session-based Anomaly Notification Detector (SAND)
ACM-SE 44: Proceedings of the 44th annual Southeast regional conferenceIn [1] we discussed the possibilities of an anomaly-based intrusion detection system that modeled a network at a particular location using advanced data mining techniques on the network packets. In later research [2], we discovered that session-based ...
Comments