research-article

Distance-based outlier detection: consolidation and renewed bearing

Authors:
Gustavo H. Orair

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
View Profile

,
Carlos H. C. Teixeira

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
View Profile

,
Wagner Meira

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
View Profile

,
Ye Wang

The Ohio State University, Columbus

The Ohio State University, Columbus
View Profile

,
Srinivasan Parthasarathy

The Ohio State University, Columbus

The Ohio State University, Columbus
View Profile

Proceedings of the VLDB Endowment Volume 3 Issue 1-2pp 1469–1480https://doi.org/10.14778/1920841.1921021

Published:01 September 2010Publication History

Proceedings of the VLDB Endowment

Abstract

Detecting outliers in data is an important problem with interesting applications in a myriad of domains ranging from data cleaning to financial fraud detection and from network intrusion detection to clinical diagnosis of diseases. Over the last decade of research, distance-based outlier detection algorithms have emerged as a viable, scalable, parameter-free alternative to the more traditional statistical approaches.

In this paper we assess several distance-based outlier detection approaches and evaluate them. We begin by surveying and examining the design landscape of extant approaches, while identifying key design decisions of such approaches. We then implement an outlier detection framework and conduct a factorial design experiment to understand the pros and cons of various optimizations proposed by us as well as those proposed in the literature, both independently and in conjunction with one another, on a diverse set of real-life datasets. To the best of our knowledge this is the first such study in the literature. The outcome of this study is a family of state of the art distance-based outlier detection algorithms.

Our detailed empirical study supports the following observations. The combination of optimization strategies enables significant efficiency gains. Our factorial design study highlights the important fact that no single optimization or combination of optimizations (factors) always dominates on all types of data. Our study also allows us to characterize when a certain combination of optimizations is likely to prevail and helps provide interesting and useful insights for moving forward in this domain.

References

A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM, 51(1):117--122, 2008. Google ScholarDigital Library
F. Angiulli and F. Fassetti. Very efficient mining of distance-based outliers. In M. J. Silva, A. H. F. Laender, R. A. Baeza-Yates, D. L. McGuinness, B. Olstad, Ø. H. Olsen, and A. O. Falcão, editors, CIKM, pages 791--800. ACM, 2007. Google ScholarDigital Library
F. Angiulli and C. Pizzuti. Fast outlier detection in high dimensional spaces. In PKDD '02: Proc. of the 6th European Conf. on Principles of Data Mining and Knowledge Discovery, pages 15--26, London, UK, 2002. Springer-Verlag. Google ScholarDigital Library
S. D. Bay, D. Kibler, M. J. Pazzani, and P. Smyth. The uci kdd archive of large data sets for data mining research and experimentation. SIGKDD Explor. Newsl., 2(2):81--85, 2000. Google ScholarDigital Library
S. D. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In 9th ACM SIGKDD Int. Conf. on Knowledge Discovery on Data Mining, 2003. Google ScholarDigital Library
M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: Identifying density-based local outliers. In W. Chen, J. F. Naughton, and P. A. Bernstein, editors, Proc. of the 2000 ACM SIGMOD Int. Conf. on Management of Data, May 16--18, 2000, Dallas, Texas, USA, pages 93--104. ACM, 2000. Google ScholarDigital Library
M. Ester, J. Kriegel, H. P. and Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial fatabases with noise. In In Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining. AAAI Press, 1996.Google Scholar
C. Faloutsos and K. Lin. FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proceedings of the 1995 ACM SIGMOD international conference on Management of data, pages 163--174. ACM New York, NY, USA, 1995. Google ScholarDigital Library
A. Ghoting, S. Parthasarathy, and M. E. Otey. Fast mining of distance-based outliers in high-dimensional datasets. 6th SIAM Int. Conf. on Data Mining, April 2005.Google Scholar
A. Ghoting, S. Parthasarathy, and M. E. Otey. Fast mining of distance-based outliers in high-dimensional datasets. Data Min. Knowl. Discov., 16(3):349--364, 2008. Google ScholarDigital Library
S. Guha, R. Rastogi, and K. Shim. Cure: an efficient clustering algorithm for large databases. In SIGMOD '98: ACM SIGMOD Int. Conf. on Management of data, pages 73--84, New York, NY, USA, 1998. ACM. Google ScholarDigital Library
Z. Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov., 2(3):283--304, 1998. Google ScholarDigital Library
E. M. Knorr and R. T. Ng. Finding intensional knowledge of distance-based outliers. In VLDB '99: 25th Int. Conf. on Very Large Data Bases, pages 211--222, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
H. Kriegel, P. Kroger, and A. Zimek. Outlier Detection Techniques. In Tutorial at the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2009.Google Scholar
J. Laurikkala, M. Juhola, and E. Kentala. Informal identification of outliers in medical data. In The Fifth International Workshop on Intelligent Data Analysis in Medicine and Pharmacology. Citeseer, 2000.Google Scholar
M. Mahoney and P. Chan. Learning nonstationary models of normal network traffic for detecting novel attacks. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 376--385. ACM New York, NY, USA, 2002. Google ScholarDigital Library
M. Mahoney and P. Chan. Learning rules for anomaly detection of hostile network traffic. In Proceedings of the Third IEEE International Conference on Data Mining, page 601. Citeseer, 2003. Google Scholar
R. T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In 20th Int. Conf. on Very Large Data Bases, 1994, Santiago, Chile, pages 144--155. Morgan Kaufmann Publishers, 1994. Google ScholarDigital Library
K. Ord. Outliers in statistical data: V. barnett and t. lewis, 1994, 3rd edition, (john wiley & sons, chichester), isbn 0-471-93094. Int. Journal of Forecasting, 12(1):175--176, March 1996.Google ScholarCross Ref
S. Papadimitriou, H. Kitagawa, P. Gibbons, and C. Faloutsos. LOCI: Fast outlier detection using the local correlation integral. In 19th International Conference on Data Engineering, 2003. Proceedings, pages 315--326, 2003.Google ScholarCross Ref
Projeto Tamandua, 2006. http://tamandua.speed.dcc.ufmg.br/.Google Scholar
S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In SIGMOD '00: Proc. ACM SIGMOD Int. Conf. on Management of data, pages 427--438, New York, NY, USA, 2000. ACM Press. Google ScholarDigital Library
N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In SIGMOD '95: ACM SIGMOD Int. Conf. on Management of data, pages 71--79, New York, NY, USA, 1995. ACM. Google ScholarDigital Library
P. Torr and D. Murray. Outlier detection and motion segmentation. Sensor Fusion VI, 2059:432--443, 1993.Google Scholar
J. Tukey. Exploratory data analysis. Addison-Wesley, 1977.Google Scholar
N. Vu and V. Gopalkrishnan. Efficient Pruning Schemes for Distance-Based Outlier Detection. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II, page 175. Springer, 2009. Google ScholarCross Ref
M. Wu and C. Jermaine. A bayesian method for guessing the extreme values in a data set? In VLDB '07: Proceedings of the 33rd international conference on Very large data bases, pages 471--482. VLDB Endowment, 2007. Google ScholarDigital Library
T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an efficient data clustering method for very large databases. SIGMOD Rec., 25(2):103--114, 1996. Google ScholarDigital Library

Recommendations

Enhancing Outlier Detection by an Outlier Indicator
Machine Learning and Data Mining in Pattern Recognition
Abstract
Outlier detection is an important task in data mining and has high practical value in numerous applications such as astronomical observation, text detection, fraud detection and so on. At present, a large number of popular outlier detection ...
Read More
Outlier detection based on rough sets theory

An outlier in a dataset is a point or a class of points that is considerably dissimilar to or inconsistent with the remainder of the data. Detection of outliers is important for many applications and has always attracted attention among data mining ...
Read More
A vertical distance-based outlier detection method with local pruning
CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management

"One person's noise is another person's signal". Outlier detection is used to clean up datasets and also to discover useful anomalies, such as criminal activities in electronic commerce, computer intrusion attacks, terrorist threats, agricultural pest ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 3, Issue 1-2
September 2010
1658 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 September 2010
Published in pvldb Volume 3, Issue 1-2
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 26
  Total Citations
  View Citations
- 640
  Total Downloads
- Downloads (Last 12 months)28
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Distance-based outlier detection: consolidation and renewed bearing

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Enhancing Outlier Detection by an Outlier Indicator

Outlier detection based on rough sets theory

A vertical distance-based outlier detection method with local pruning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Distance-based outlier detection: consolidation and renewed bearing

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Enhancing Outlier Detection by an Outlier Indicator

Outlier detection based on rough sets theory

A vertical distance-based outlier detection method with local pruning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media