skip to main content
research-article

Distance-based outlier detection: consolidation and renewed bearing

Published:01 September 2010Publication History
Skip Abstract Section

Abstract

Detecting outliers in data is an important problem with interesting applications in a myriad of domains ranging from data cleaning to financial fraud detection and from network intrusion detection to clinical diagnosis of diseases. Over the last decade of research, distance-based outlier detection algorithms have emerged as a viable, scalable, parameter-free alternative to the more traditional statistical approaches.

In this paper we assess several distance-based outlier detection approaches and evaluate them. We begin by surveying and examining the design landscape of extant approaches, while identifying key design decisions of such approaches. We then implement an outlier detection framework and conduct a factorial design experiment to understand the pros and cons of various optimizations proposed by us as well as those proposed in the literature, both independently and in conjunction with one another, on a diverse set of real-life datasets. To the best of our knowledge this is the first such study in the literature. The outcome of this study is a family of state of the art distance-based outlier detection algorithms.

Our detailed empirical study supports the following observations. The combination of optimization strategies enables significant efficiency gains. Our factorial design study highlights the important fact that no single optimization or combination of optimizations (factors) always dominates on all types of data. Our study also allows us to characterize when a certain combination of optimizations is likely to prevail and helps provide interesting and useful insights for moving forward in this domain.

References

  1. A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM, 51(1):117--122, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. F. Angiulli and F. Fassetti. Very efficient mining of distance-based outliers. In M. J. Silva, A. H. F. Laender, R. A. Baeza-Yates, D. L. McGuinness, B. Olstad, Ø. H. Olsen, and A. O. Falcão, editors, CIKM, pages 791--800. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. F. Angiulli and C. Pizzuti. Fast outlier detection in high dimensional spaces. In PKDD '02: Proc. of the 6th European Conf. on Principles of Data Mining and Knowledge Discovery, pages 15--26, London, UK, 2002. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. D. Bay, D. Kibler, M. J. Pazzani, and P. Smyth. The uci kdd archive of large data sets for data mining research and experimentation. SIGKDD Explor. Newsl., 2(2):81--85, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. D. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In 9th ACM SIGKDD Int. Conf. on Knowledge Discovery on Data Mining, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: Identifying density-based local outliers. In W. Chen, J. F. Naughton, and P. A. Bernstein, editors, Proc. of the 2000 ACM SIGMOD Int. Conf. on Management of Data, May 16--18, 2000, Dallas, Texas, USA, pages 93--104. ACM, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Ester, J. Kriegel, H. P. and Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial fatabases with noise. In In Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining. AAAI Press, 1996.Google ScholarGoogle Scholar
  8. C. Faloutsos and K. Lin. FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proceedings of the 1995 ACM SIGMOD international conference on Management of data, pages 163--174. ACM New York, NY, USA, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Ghoting, S. Parthasarathy, and M. E. Otey. Fast mining of distance-based outliers in high-dimensional datasets. 6th SIAM Int. Conf. on Data Mining, April 2005.Google ScholarGoogle Scholar
  10. A. Ghoting, S. Parthasarathy, and M. E. Otey. Fast mining of distance-based outliers in high-dimensional datasets. Data Min. Knowl. Discov., 16(3):349--364, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Guha, R. Rastogi, and K. Shim. Cure: an efficient clustering algorithm for large databases. In SIGMOD '98: ACM SIGMOD Int. Conf. on Management of data, pages 73--84, New York, NY, USA, 1998. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Z. Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov., 2(3):283--304, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. E. M. Knorr and R. T. Ng. Finding intensional knowledge of distance-based outliers. In VLDB '99: 25th Int. Conf. on Very Large Data Bases, pages 211--222, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H. Kriegel, P. Kroger, and A. Zimek. Outlier Detection Techniques. In Tutorial at the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2009.Google ScholarGoogle Scholar
  15. J. Laurikkala, M. Juhola, and E. Kentala. Informal identification of outliers in medical data. In The Fifth International Workshop on Intelligent Data Analysis in Medicine and Pharmacology. Citeseer, 2000.Google ScholarGoogle Scholar
  16. M. Mahoney and P. Chan. Learning nonstationary models of normal network traffic for detecting novel attacks. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 376--385. ACM New York, NY, USA, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Mahoney and P. Chan. Learning rules for anomaly detection of hostile network traffic. In Proceedings of the Third IEEE International Conference on Data Mining, page 601. Citeseer, 2003. Google ScholarGoogle Scholar
  18. R. T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In 20th Int. Conf. on Very Large Data Bases, 1994, Santiago, Chile, pages 144--155. Morgan Kaufmann Publishers, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. K. Ord. Outliers in statistical data: V. barnett and t. lewis, 1994, 3rd edition, (john wiley & sons, chichester), isbn 0-471-93094. Int. Journal of Forecasting, 12(1):175--176, March 1996.Google ScholarGoogle ScholarCross RefCross Ref
  20. S. Papadimitriou, H. Kitagawa, P. Gibbons, and C. Faloutsos. LOCI: Fast outlier detection using the local correlation integral. In 19th International Conference on Data Engineering, 2003. Proceedings, pages 315--326, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  21. Projeto Tamandua, 2006. http://tamandua.speed.dcc.ufmg.br/.Google ScholarGoogle Scholar
  22. S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In SIGMOD '00: Proc. ACM SIGMOD Int. Conf. on Management of data, pages 427--438, New York, NY, USA, 2000. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In SIGMOD '95: ACM SIGMOD Int. Conf. on Management of data, pages 71--79, New York, NY, USA, 1995. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. Torr and D. Murray. Outlier detection and motion segmentation. Sensor Fusion VI, 2059:432--443, 1993.Google ScholarGoogle Scholar
  25. J. Tukey. Exploratory data analysis. Addison-Wesley, 1977.Google ScholarGoogle Scholar
  26. N. Vu and V. Gopalkrishnan. Efficient Pruning Schemes for Distance-Based Outlier Detection. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II, page 175. Springer, 2009. Google ScholarGoogle ScholarCross RefCross Ref
  27. M. Wu and C. Jermaine. A bayesian method for guessing the extreme values in a data set? In VLDB '07: Proceedings of the 33rd international conference on Very large data bases, pages 471--482. VLDB Endowment, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an efficient data clustering method for very large databases. SIGMOD Rec., 25(2):103--114, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 3, Issue 1-2
    September 2010
    1658 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 September 2010
    Published in pvldb Volume 3, Issue 1-2

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader