skip to main content
research-article

SAND: streaming subsequence anomaly detection

Published:01 June 2021Publication History
Skip Abstract Section

Abstract

With the increasing demand for real-time analytics and decision making, anomaly detection methods need to operate over streams of values and handle drifts in data distribution. Unfortunately, existing approaches have severe limitations: they either require prior domain knowledge or become cumbersome and expensive to use in situations with recurrent anomalies of the same type. In addition, subsequence anomaly detection methods usually require access to the entire dataset and are not able to learn and detect anomalies in streaming settings. To address these problems, we propose SAND, a novel online method suitable for domain-agnostic anomaly detection. SAND aims to detect anomalies based on their distance to a model that represents normal behavior. SAND relies on a novel steaming methodology to incrementally update such model, which adapts to distribution drifts and omits obsolete data. The experimental results on several real-world datasets demonstrate that SAND correctly identifies single and recurrent anomalies without prior knowledge of the characteristics of these anomalies. SAND outperforms by a large margin the current state-of-the-art algorithms in terms of accuracy while achieving orders of magnitude speedups.

References

  1. 2020. SAND Webpage: http://helios.mi.parisdescartes.fr/~themisp/SAND/.Google ScholarGoogle Scholar
  2. D. Abboud, M. Elbadaoui, W.A. Smith, and R.B. Randall. 2019. Advanced bearing diagnostics: A comparative study of two powerful approaches. MSSP 114 (2019).Google ScholarGoogle Scholar
  3. Ali Abdul-Aziz, Mark R Woike, Nikunj C Oza, Bryan L Matthews, and John D lekki. 2012. Rotor health monitoring combining spin tests and data-driven anomaly detection methods. Structural Health Monitoring (2012). Google ScholarGoogle ScholarCross RefCross Ref
  4. Jerome Antoni and Pietro Borghesani. 2019. A statistical methodology for the design of condition indicators. Mechanical Systems and Signal Processing (2019).Google ScholarGoogle Scholar
  5. Anthony J. Bagnall, Richard L. Cole, Themis Palpanas, and Konstantinos Zoumpatianos. 9(7), 2019. Data Series Management (Dagstuhl Seminar 19282). Dagstuhl Reports (9(7), 2019).Google ScholarGoogle Scholar
  6. S. Bahaadini, V. Noroozi, N. Rohani, S. Coughlin, M. Zevin, J. R. Smith, Vicky Kalogera, and Aggelos K Katsaggelos. 2018. Machine learning for Gravity Spy: Glitch classification and dataset. Information Sciences 444 (1 5 2018).Google ScholarGoogle Scholar
  7. V. Barnet and T. Lewis. 1994. Outliers in Statistical Data. John Wiley and Sons, Inc.Google ScholarGoogle Scholar
  8. Paul Boniol, Michele Linardi, Federico Roncallo, and Themis Palpanas. 2020. Automated Anomaly Detection in Large Sequences. In 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20--24, 2020.Google ScholarGoogle Scholar
  9. Paul Boniol, Michele Linardi, Federico Roncallo, and Themis Palpanas. 2020. SAD: An Unsupervised System for Subsequence Anomaly Detection. In ICDE.Google ScholarGoogle Scholar
  10. Paul Boniol, Michele Linardi, Federico Roncallo, Themis Palpanas, Mohammed Meftah, and Emmanuel Remy. 2021. Unsupervised and scalable subsequence anomaly detection in large data series. The VLDB Journal (2021).Google ScholarGoogle Scholar
  11. Paul Boniol and Themis Palpanas. 2020. Series2Graph: Graph-based Subsequence Anomaly Detection for Time Series. PVLDB 13, 11 (2020). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Paul Boniol, Themis Palpanas, Mohammed Meftah, and Emmanuel Remy. 2020. GraphAn: Graph-based Subsequence Anomaly Detection. Proc. VLDB Endow. 13, 12 (2020), 2941--2944. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000. LOF: Identifying Density-based Local Outliers. In SIGMOD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yingyi Bu, Oscar Tat-Wing Leung, Ada Wai-Chee Fu, Eamonn J. Keogh, Jian Pei, and Sam Meshkin. 2007. WAT: Finding Top-K Discords in Time Series Database. In SIAM.Google ScholarGoogle Scholar
  15. Bill Yuan Chiu, Eamonn J. Keogh, and Stefano Lonardi. 2003. Probabilistic discovery of time series motifs. In SIGKDD 2003. 493--498. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Nassia Daouayry, Ammar Mechouche, Pierre-Loic Maisonneuve, Vasile-Marian Scuturici, and Jean-Marc Petit. 2019. Data-Centric Helicopter Failure Anticipation: The MGB Oil Pressure Virtual Sensor Case. IEEE BigData.Google ScholarGoogle Scholar
  17. H. A. Dau, A. Bagnall, K. Kamgar, C. M. Yeh, Y. Zhu, S. Gharghabi, C. A. Ratanamahatana, and E. Keogh. 2019. The UCR time series archive. IEEE/CAA Journal of Automatica Sinica 6, 6 (2019), 1293--1305. Google ScholarGoogle ScholarCross RefCross Ref
  18. Goldberger et al. [n.d.]. PhysioBank, PhysioToolkit, and PhysioNet. Circulation ([n. d.]). http://circ.ahajournals.org/content/101/23/e215Google ScholarGoogle Scholar
  19. Yan Zhu et al. [n.d.]. Matrix Profile II: Exploiting a Novel Algorithm and GPUs to Break the One Hundred Million Barrier for Time Series Motifs and Joins. In ICDM 2016.Google ScholarGoogle Scholar
  20. Ada Wai-Chee Fu, Oscar Tat-Wing Leung, Eamonn J. Keogh, and Jessica Lin. 2006. Finding Time Series Discords Based on Haar Transform. In ADMA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Sudipto Guha, Adam Meyerson, Nina Mishra, Rajeev Motwani, and Liadan O'Callaghan. 2003. Clustering Data Streams: Theory and Practice. TKDE 15, 3 (2003). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Medina Hadjem, Farid Naït-Abdesselam, and Ashfaq A. Khokhar. 2016. ST-segment and T-wave anomalies prediction in an ECG data using RUSBoost. In Healthcom.Google ScholarGoogle Scholar
  23. E. Keogh, J. Lin, and a. Fu. 2005. HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence. (ICDM) (2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Eamonn Keogh, Stefano Lonardi, Chotirat Ann Ratanamahatana, Li Wei, Sang-Hee Lee, and John Handley. 2007. Compression-based data mining of sequential data. Data Mining and Knowledge Discovery (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Kontaki, A. Gounaris, A. N. Papadopoulos, K. Tsichlas, and Y. Manolopoulos. 2011. Continuous monitoring of distance-based outliers over data streams. In 2011 IEEE 27th International Conference on Data Engineering. 135--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Balaji Lakshminarayanan, Daniel M. Roy, and Yee Whye Teh. 2015. Mondrian Forests: Efficient Online Random Forests. arXiv:1406.2673 [stat.ML] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Michele Linardi, Yan Zhu, Themis Palpanas, and Eamonn J. Keogh. 2020. Matrix Profile Goes MAD: Variable-Length Motif And Discord Discovery in Data Series. In DAMI.Google ScholarGoogle Scholar
  28. Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation Forest. In ICDM (ICDM).Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Yubao Liu, Xiuwei Chen, and Fei Wang. 2009. Efficient Detection of Discords for Time Series Stream. Advances in Data and Web Management (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Wei Luo and Marcus Gallagher. 2011. Faster and Parameter-Free Discord Search in Quasi-Periodic Time Series. In Advances in Knowledge Discovery and Data Mining. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Haoran Ma, Benyamin Ghojogh, Maria N. Samad, Dongyu Zheng, and Mark Crowley. 2020. Isolation Mondrian Forest for Batch and Online Anomaly Detection. arXiv:2003.03692 [cs.LG]Google ScholarGoogle Scholar
  32. Pankaj Malhotra, Lovekesh Vig, Gautam Shroff, and Puneet Agarwal. 2015. Long Short Term Memory Networks for Anomaly Detection in Time Series. (2015).Google ScholarGoogle Scholar
  33. Katsiaryna Mirylenka, Alice Marascu, Themis Palpanas, Matthias Fehr, Stefan Jank, Gunter Welde, and Daniel Groeber. 2013. Envelope-Based Anomaly Detection for High-Speed Manufacturing Processes. European Advanced Process Control and Manufacturing Conference (2013).Google ScholarGoogle Scholar
  34. G. B. Moody and R. G. Mark. 2001. The impact of the MIT-BIH Arrhythmia Database. IEEE Engineering in Medicine and Biology Magazine (2001).Google ScholarGoogle Scholar
  35. Abdullah Mueen, Eamonn J. Keogh, Qiang Zhu, Sydney Cash, and M. Brandon Westover. [n.d.]. Exact Discovery of Time Series Motifs. In SDM 2009.Google ScholarGoogle Scholar
  36. M. Munir, S. A. Siddiqui, A. Dengel, and S. Ahmed. 2019. DeepAnT: A Deep Learning Approach for Unsupervised Anomaly Detection in Time Series. IEEE Access 7 (2019), 1991--2005. Google ScholarGoogle ScholarCross RefCross Ref
  37. Themis Palpanas. 2013. Real-Time Data Analytics in Sensor Networks. In Managing and Mining Sensor Data. 173--210.Google ScholarGoogle Scholar
  38. Themis Palpanas. 2015. Data Series Management: The Road to Big Sequence Analytics. SIGMOD Rec. 44, 2 (2015), 47--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Themis Palpanas and Volker Beckmann. 2019. Report on the First and Second Interdisciplinary Time Series Analysis Workshop (ITISA). SIGMOD Rec. 48, 3 (2019). Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. John Paparrizos and Michael J Franklin. 2019. GRAIL: Efficient Time-Series Representation Learning. PVLDB 12, 11 (2019), 1762--1777. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. John Paparrizos and Luis Gravano. 2015. k-Shape: Efficient and Accurate Clustering of Time Series. In SIGMOD. 1855--1870. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. John Paparrizos and Luis Gravano. 2017. Fast and accurate time-series clustering. TODS 42, 2 (2017), 1--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. John Paparrizos, Chunwei Liu, Bruno Barbarioli, Johnny Hwang, Ikraduya Edian, Aaron J. Elmore, Michael J. Franklin, and Sanjay Krishnan. 2021. VergeDB: A Database for IoT Analytics on Edge Devices. In CIDR.Google ScholarGoogle Scholar
  44. John Paparrizos, Chunwei Liu, Aaron J. Elmore, and Michael J. Franklin. 2020. Debunking Four Long-Standing Misconceptions of Time-Series Distance Measures. In SIGMOD. 1887--1905. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Tuomas Pelkonen, Scott Franklin, Paul Cavallaro, Qi Huang, Justin Meza, Justin Teller, and Kaushik Veeraraghavan. 2015. Gorilla: A Fast, Scalable, In-Memory Time Series Database. PVLDB 8, 12 (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Botao Peng, Panagiota Fatourou, and Themis Palpanas. 2021. Fast Data Series Indexing for In-Memory Data. VLDBJ (2021).Google ScholarGoogle Scholar
  47. Botao Peng, Panagiota Fatourou, and Themis Palpanas. 2021. SING: Sequence Indexing Using GPUs. In ICDE.Google ScholarGoogle Scholar
  48. Botao Peng, Themis Palpanas, and Panagiota Fatourou. 2020. ParIS+: Data Series Indexing on Multi-core Architectures. TKDE (2020).Google ScholarGoogle Scholar
  49. D. W. Scott. 1992. Multivariate Density Estimation. Theory, Practice, and Visualization. Wiley.Google ScholarGoogle Scholar
  50. Pavel Senin, Jessica Lin, Xing Wang, Tim Oates, Sunil Gandhi, Arnold P. Boedihardjo, Crystal Chen, and Susan Frankenstein. 2015. Time series anomaly discovery with grammar-based compression. In EDBT.Google ScholarGoogle Scholar
  51. Pavel Senin, Jessica Lin, Xing Wang, Tim Oates, Sunil Gandhi, Arnold P. Boedihardjo, Crystal Chen, and Susan Frankenstein. 2018. GrammarViz 3.0: Interactive Discovery of Variable-Length Time Series Patterns. TKDD (2018). Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Sharmila Subramaniam, Themis Palpanas, Dimitris Papadopoulos, Vana Kalogeraki, and Dimitrios Gunopulos. 2006. Online Outlier Detection in Sensor Data Using Non-Parametric Models. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12--15, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Luan Tran, Liyue Fan, and Cyrus Shahabi. 2016. Distance-Based Outlier Detection in Data Streams. Proc. VLDB Endow. 9, 12 (Aug. 2016), 1089--1100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Dafne van Kuppevelt, Vincent van Hees, and Christiaan Meijer. 2017. PAMAP2 dataset preprocessed v0.3.0. Google ScholarGoogle ScholarCross RefCross Ref
  55. J. Wang, A. Balasubramanian, L. Mojica de la Vega, J. Green, A. Samal, and B. Prabhakaran. [n.d.]. Word recognition from continuous articulatory movement time-series data using symbolic representations. In SLPAT (2013).Google ScholarGoogle Scholar
  56. Qitong Wang and Themis Palpanas. 2021. Deep Learning Embeddings for Data Series Similarity Search. In SIGKDD. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. CW Whitney, DJ Gottlieb, S Redline, RG Norman, RR Dodge, E Shahar, S Surovec, and FJ Nieto. 1998. Reliability of scoring respiratory disturbance indices and sleep staging. Sleep (November 1998).Google ScholarGoogle Scholar
  58. Dragomir Yankov, Eamonn J. Keogh, Jose Medina, Bill Yuan-chi Chiu, and Victor B. Zordan. [n.d.]. Detecting time series motifs under uniform scaling. In ACM.Google ScholarGoogle Scholar
  59. Dragomir Yankov, Eamonn J. Keogh, and Umaa Rebbapragada. 2007. Disk Aware Discord Discovery: Finding Unusual Time Series in Terabyte Sized Datasets. In ICDM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Chin-Chia Michael Yeh, Yan Zhu, Liudmila Ulanova, Nurjahan Begum, Yifei Ding, Hoang Anh Dau, Diego Furtado Silva, Abdullah Mueen, and Eamonn J. Keogh. 2016. Matrix Profile I: All Pairs Similarity Joins for Time Series. In ICDM.Google ScholarGoogle Scholar

Index Terms

  1. SAND: streaming subsequence anomaly detection
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the VLDB Endowment
          Proceedings of the VLDB Endowment  Volume 14, Issue 10
          June 2021
          219 pages
          ISSN:2150-8097
          Issue’s Table of Contents

          Publisher

          VLDB Endowment

          Publication History

          • Published: 1 June 2021
          Published in pvldb Volume 14, Issue 10

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader