Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance

Zhang, Ji; Wang, Hai

doi:10.1007/s10115-006-0020-z

Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance

Regular Paper
Published: 15 March 2006

Volume 10, pages 333–355, (2006)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Ji Zhang¹ &
Hai Wang²

412 Accesses
88 Citations
Explore all metrics

Abstract

In this paper, we identify a new task for studying the outlying degree (OD) of high-dimensional data, i.e. finding the subspaces (subsets of features) in which the given points are outliers, which are called their outlying subspaces. Since the state-of-the-art outlier detection techniques fail to handle this new problem, we propose a novel detection algorithm, called High-Dimension Outlying subspace Detection (HighDOD), to detect the outlying subspaces of high-dimensional data efficiently. The intuitive idea of HighDOD is that we measure the OD of the point using the sum of distances between this point and itsknearest neighbors. Two heuristic pruning strategies are proposed to realize fast pruning in the subspace search and an efficient dynamic subspace search method with a sample-based learning process has been implemented. Experimental results show that HighDOD is efficient and outperforms other searching alternatives such as the naive top–down, bottom–up and random search methods, and the existing outlier detection methods cannot fulfill this new task effectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Density-Based Clustering Based on Hierarchical Density Estimates

A Comprehensive Survey of Anomaly Detection Algorithms

Article 26 November 2021

Feature selection techniques for machine learning: a survey of more than two decades of research

Article 01 December 2023

References

Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data mining application. In: Proceedings of ACM SIGMOD'98, Seattle, Washington, USA, pp 94–105
Aggarwal CC, Procopiuc C, Wolf JL, Yu PS, Park JS (1999) Fast algorithms for projected clustering. In: Proceedings of the ACM SIGMOD'99, Philadelphia, Pennsylvania, USA, pp 61–72
Aggarwal CC, Yu PS (2001) Outlier detection in high dimensional data. In: Proceedings of the ACM SIGMOD'01, Santa Barbara, California, USA
Angiulli F, Pizzuti C (2002) Fast outlier detection in high dimensional spaces. In: Proceedings of PKDD'02, Helsinki, Finland, pp 15–26
Barnett V, Lewis T (1994) Outliers in statistical data, 3rd edn. Wiley, New York
MATH Google Scholar
Berchtold S, Keim DA, Kriegel H (1996) The X-tree: an index structure for high-dimensional data. In: Proceedings of the VLDB'96, Mumbai, India, pp 28–39
Breuning M, Kriegel H-P, Ng R, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of the ACM SIGMOD'00, Dallas, Texas, pp 93–104
Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the SIGKDD'96, Portland, Oregon, USA, pp 226–231
Han J, Kamber M (2000) Data mining: concepts and techniques. Morgan Kaufman
Hawkins D (1980) Identification of outliers. Chapman and Hall, London
MATH Google Scholar
Hinneburg A, Keim DA (1998) An efficient approach to cluster in large multimedia databases with noise. In: Proceedings of the SIGKDD'98, New York, NY, USA, pp 58–65
Jin W, Tung AKH, Han J (2001) Finding top n local outliers in large database. In: Proceedings of the SIGKDD'01, San Francisco, CA, pp 293–298
Knorr EM, Ng RT (1998) Algorithms for mining distance-based outliers in large dataset. In: Proceedings of the VLDB'98, New York, NY, pp 392–403
Knorr EM, Ng RT (1999) Finding intentional knowledge of distance-based outliers. In: Proceedings of the VLDB'99, Edinburgh, Scotland, pp 211–222
Mace AE (1964) Sample-size determination. Reinhold Publishing Corporation, New York
MATH Google Scholar
Ng R, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Proceedings of the VLDB'94, Santiago, Chile, pp 144–155
Papadimitriou S, Kitagawa H, Gibbons PB, Faloutsos C (2003) LOCI: fast outlier detection using the local correlation integral. In: Proceedings of the ICDE'03, Bangalore, India, pp 315–325
Preparata F, Shamos M (1998) Computational geometry: an introduction. Springer-Verlag, Berlin Heidelberg New York
Google Scholar
Ramaswamy S, Rastogi R, Kyuseok S (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of the ACM SIGMOD'00, Dallas, Texas, pp 427–438
Ruts I, Rousseeuw P (1996) Computing depth contours of bivariate point clouds. Comput Stat Data Anal 23: 153–168
Article MATH Google Scholar
Sarafis IA, Trinder PW, Zalzala AMS (2003) Towards effective subspace clustering with an evolutionary algorithm. In: IEEE congress on evolutionary computation, Canberra, Australia
Sheikholeslami G, Chatterjee S, Zhang A (1999) WaveCluster: a wavelet based clustering approach for spatial data in very large database. VLDB J 8(3/4): 289–304
Google Scholar
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD'96, Montreal, Canada, pp 103–114

Download references

Author information

Authors and Affiliations

Faculty of Computer Science, Dalhousie University, Halifax, Nova Scotia, Canada
Ji Zhang
Sobey School of Business, Saint Mary's University, Halifax, Nova Scotia, Canada
Hai Wang

Authors

Ji Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hai Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ji Zhang.

Additional information

Ji Zhang received his BS from Department of Information Systems and Information Management at Southeast University, Nanjing, China, in 2000 and MSc from Department of Computer Science at National University of Singapore in 2002. He worked as a researcher in Center for Information Mining and Extraction (CHIME) at National University of Singapore from 2002 to 2003 and Department of Computer Science at University of Toronto from 2003 to 2005. He is currently with Faculty of Computer Science at Dalhousie University, Canada. His research interests include Knowledge Discovery and Data Mining, XML and Data Cleaning. He has published papers in Journal of Intelligent Information Systems (JIIS), Journal of Database Management (JDM), and major international conferences such as VLDB, WWW, DEXA, DaWaK, SDM, and so on.

Hai Wang is an assistant professor in the Department of Finance Management Science at Sobey School of Business of Saint Mary's University, Canada. He received his BSc in computer science from the University of New Brunswick, and his MSc and PhD in Computer Science from the University of Toronto. His research interests are in the areas of database management, data mining, e-commerce, and performance evaluation. His papers have been published in International Journal of Mobile Communications, Data Knowledge Engineering, ACM SIGMETRICS Performance Evaluation Review, Knowledge and Information Systems, Performance Evaluation, and others.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, J., Wang, H. Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance. Knowl Inf Syst 10, 333–355 (2006). https://doi.org/10.1007/s10115-006-0020-z

Download citation

Received: 03 December 2003
Revised: 18 July 2005
Accepted: 30 July 2005
Published: 15 March 2006
Issue Date: October 2006
DOI: https://doi.org/10.1007/s10115-006-0020-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

A Comprehensive Survey of Anomaly Detection Algorithms

Feature selection techniques for machine learning: a survey of more than two decades of research

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

A Comprehensive Survey of Anomaly Detection Algorithms

Feature selection techniques for machine learning: a survey of more than two decades of research

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation