Abstract
Detecting outliers in data is an important problem with interesting applications in a myriad of domains ranging from data cleaning to financial fraud detection and from network intrusion detection to clinical diagnosis of diseases. Over the last decade of research, distance-based outlier detection algorithms have emerged as a viable, scalable, parameter-free alternative to the more traditional statistical approaches.
In this paper we assess several distance-based outlier detection approaches and evaluate them. We begin by surveying and examining the design landscape of extant approaches, while identifying key design decisions of such approaches. We then implement an outlier detection framework and conduct a factorial design experiment to understand the pros and cons of various optimizations proposed by us as well as those proposed in the literature, both independently and in conjunction with one another, on a diverse set of real-life datasets. To the best of our knowledge this is the first such study in the literature. The outcome of this study is a family of state of the art distance-based outlier detection algorithms.
Our detailed empirical study supports the following observations. The combination of optimization strategies enables significant efficiency gains. Our factorial design study highlights the important fact that no single optimization or combination of optimizations (factors) always dominates on all types of data. Our study also allows us to characterize when a certain combination of optimizations is likely to prevail and helps provide interesting and useful insights for moving forward in this domain.
- A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM, 51(1):117--122, 2008. Google ScholarDigital Library
- F. Angiulli and F. Fassetti. Very efficient mining of distance-based outliers. In M. J. Silva, A. H. F. Laender, R. A. Baeza-Yates, D. L. McGuinness, B. Olstad, Ø. H. Olsen, and A. O. Falcão, editors, CIKM, pages 791--800. ACM, 2007. Google ScholarDigital Library
- F. Angiulli and C. Pizzuti. Fast outlier detection in high dimensional spaces. In PKDD '02: Proc. of the 6th European Conf. on Principles of Data Mining and Knowledge Discovery, pages 15--26, London, UK, 2002. Springer-Verlag. Google ScholarDigital Library
- S. D. Bay, D. Kibler, M. J. Pazzani, and P. Smyth. The uci kdd archive of large data sets for data mining research and experimentation. SIGKDD Explor. Newsl., 2(2):81--85, 2000. Google ScholarDigital Library
- S. D. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In 9th ACM SIGKDD Int. Conf. on Knowledge Discovery on Data Mining, 2003. Google ScholarDigital Library
- M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: Identifying density-based local outliers. In W. Chen, J. F. Naughton, and P. A. Bernstein, editors, Proc. of the 2000 ACM SIGMOD Int. Conf. on Management of Data, May 16--18, 2000, Dallas, Texas, USA, pages 93--104. ACM, 2000. Google ScholarDigital Library
- M. Ester, J. Kriegel, H. P. and Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial fatabases with noise. In In Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining. AAAI Press, 1996.Google Scholar
- C. Faloutsos and K. Lin. FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proceedings of the 1995 ACM SIGMOD international conference on Management of data, pages 163--174. ACM New York, NY, USA, 1995. Google ScholarDigital Library
- A. Ghoting, S. Parthasarathy, and M. E. Otey. Fast mining of distance-based outliers in high-dimensional datasets. 6th SIAM Int. Conf. on Data Mining, April 2005.Google Scholar
- A. Ghoting, S. Parthasarathy, and M. E. Otey. Fast mining of distance-based outliers in high-dimensional datasets. Data Min. Knowl. Discov., 16(3):349--364, 2008. Google ScholarDigital Library
- S. Guha, R. Rastogi, and K. Shim. Cure: an efficient clustering algorithm for large databases. In SIGMOD '98: ACM SIGMOD Int. Conf. on Management of data, pages 73--84, New York, NY, USA, 1998. ACM. Google ScholarDigital Library
- Z. Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov., 2(3):283--304, 1998. Google ScholarDigital Library
- E. M. Knorr and R. T. Ng. Finding intensional knowledge of distance-based outliers. In VLDB '99: 25th Int. Conf. on Very Large Data Bases, pages 211--222, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- H. Kriegel, P. Kroger, and A. Zimek. Outlier Detection Techniques. In Tutorial at the 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2009.Google Scholar
- J. Laurikkala, M. Juhola, and E. Kentala. Informal identification of outliers in medical data. In The Fifth International Workshop on Intelligent Data Analysis in Medicine and Pharmacology. Citeseer, 2000.Google Scholar
- M. Mahoney and P. Chan. Learning nonstationary models of normal network traffic for detecting novel attacks. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 376--385. ACM New York, NY, USA, 2002. Google ScholarDigital Library
- M. Mahoney and P. Chan. Learning rules for anomaly detection of hostile network traffic. In Proceedings of the Third IEEE International Conference on Data Mining, page 601. Citeseer, 2003. Google Scholar
- R. T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In 20th Int. Conf. on Very Large Data Bases, 1994, Santiago, Chile, pages 144--155. Morgan Kaufmann Publishers, 1994. Google ScholarDigital Library
- K. Ord. Outliers in statistical data: V. barnett and t. lewis, 1994, 3rd edition, (john wiley & sons, chichester), isbn 0-471-93094. Int. Journal of Forecasting, 12(1):175--176, March 1996.Google ScholarCross Ref
- S. Papadimitriou, H. Kitagawa, P. Gibbons, and C. Faloutsos. LOCI: Fast outlier detection using the local correlation integral. In 19th International Conference on Data Engineering, 2003. Proceedings, pages 315--326, 2003.Google ScholarCross Ref
- Projeto Tamandua, 2006. http://tamandua.speed.dcc.ufmg.br/.Google Scholar
- S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In SIGMOD '00: Proc. ACM SIGMOD Int. Conf. on Management of data, pages 427--438, New York, NY, USA, 2000. ACM Press. Google ScholarDigital Library
- N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In SIGMOD '95: ACM SIGMOD Int. Conf. on Management of data, pages 71--79, New York, NY, USA, 1995. ACM. Google ScholarDigital Library
- P. Torr and D. Murray. Outlier detection and motion segmentation. Sensor Fusion VI, 2059:432--443, 1993.Google Scholar
- J. Tukey. Exploratory data analysis. Addison-Wesley, 1977.Google Scholar
- N. Vu and V. Gopalkrishnan. Efficient Pruning Schemes for Distance-Based Outlier Detection. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II, page 175. Springer, 2009. Google ScholarCross Ref
- M. Wu and C. Jermaine. A bayesian method for guessing the extreme values in a data set? In VLDB '07: Proceedings of the 33rd international conference on Very large data bases, pages 471--482. VLDB Endowment, 2007. Google ScholarDigital Library
- T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an efficient data clustering method for very large databases. SIGMOD Rec., 25(2):103--114, 1996. Google ScholarDigital Library
Recommendations
Enhancing Outlier Detection by an Outlier Indicator
Machine Learning and Data Mining in Pattern RecognitionAbstractOutlier detection is an important task in data mining and has high practical value in numerous applications such as astronomical observation, text detection, fraud detection and so on. At present, a large number of popular outlier detection ...
Outlier detection based on rough sets theory
An outlier in a dataset is a point or a class of points that is considerably dissimilar to or inconsistent with the remainder of the data. Detection of outliers is important for many applications and has always attracted attention among data mining ...
A vertical distance-based outlier detection method with local pruning
CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management"One person's noise is another person's signal". Outlier detection is used to clean up datasets and also to discover useful anomalies, such as criminal activities in electronic commerce, computer intrusion attacks, terrorist threats, agricultural pest ...
Comments