ScienceDirect® Home Skip Main Navigation Links
You have guest access to ScienceDirect. Find out more.
 
Home
Browse
My Settings
Alerts
Help
 Quick Search
 Search tips (Opens new window)
    Clear all fields    
advertisementadvertisement
Information Sciences
Volume 177, Issue 2, 15 January 2007, Pages 467-475
 
Font Size: Decrease Font Size  Increase Font Size
 Abstract - selected
Article
Purchase PDF (190 K)

 
 
 
Related Articles in ScienceDirect
View More Related Articles
 
View Record in Scopus
 
doi:10.1016/j.ins.2006.03.006    How to Cite or Link Using DOI (Opens New Window)
Copyright © 2006 Elsevier Inc. All rights reserved.

Anomaly detection in web documents using crisp and fuzzy-based cosine clustering methodology

Menahem Friedmana, b, Mark Lastb, Corresponding Author Contact Information, E-mail The Corresponding Author, Yaniv Makoverb and Abraham Kandelc

aDepartment of Physics, Nuclear Research Center – Negev, Beer-Sheva, POB 9001, Israel bDepartment of Information Systems Engineering, Ben-Gurion University of the Negev, Hanessiim, Beer-Sheva 84105, Israel cDepartment of Computer Science and Engineering, University of South Florida, Tampa, FL 33620, USA

Available online 18 April 2006.

Purchase the full-text article



References and further reading may be available for this article. To view references and further reading you must purchase this article.

Abstract

Cluster analysis is a primary tool for detecting anomalous behavior in real-world data such as web documents, medical records of patients or other personal data. Most existing methods for document clustering are based on the classical vector-space model, which represents each document by a fixed-size vector of weighted key terms often referred to as key phrases. Since vector representations of documents are frequently very sparse, inverted files are used to prevent a tremendous computational overload which may be caused in large and diverse document collections such as pages downloaded from the World Wide Web. In order to reduce computation costs and space complexity, many popular methods for clustering web documents, including those using inverted files, usually assume a relatively small prefixed number of clusters.

We propose several new crisp and fuzzy approaches based on the cosine similarity principle for clustering documents that are represented by variable-size vectors of key phrases, without limiting the final number of clusters. Each entry in a vector consists of two fields. The first field refers to a key phrase in the document and the second denotes an importance weight associated with this key phrase within the particular document. Removing the restriction on the total number of clusters, may moderately increase computing costs but on the other hand improves the method’s performance in classifying incoming vectors as normal or abnormal, based on their similarity to the existing clusters. All the procedures represented in this work are characterized by two features: (a) the number of clusters is not restricted by some relatively prefixed small number, i.e., an arbitrary new incoming vector which is not similar to any of the existing cluster centers necessarily starts a new cluster and (b) a vector with multiple appearance n in the training set is counted as n distinct vectors rather than a single vector. These features are the main reasons for the high quality performance of the proposed algorithms. We later describe them in detail and show their implementation in a real-world application from the area of web activity monitoring, in particular, by detecting anomalous documents downloaded from the internet by users with abnormal information interests.

Keywords: Fuzzy-based clustering; Document clustering; Cosine similarity; Anomaly detection

Article Outline

1. Introduction
2. The problem
3. Cosine similarity-based algorithms
3.1. Crisp cosine clustering (CCC)
3.2. Fuzzy-based cosine clustering (FCC)
3.3. Local fuzzy-based cosine clustering (LFCC)
4. Application: detecting anomalies in web documents
4.1. The experiment
4.2. The results
5. Summary
Acknowledgements
References




Information Sciences
Volume 177, Issue 2, 15 January 2007, Pages 467-475
 
Home
Browse
My Settings
Alerts
Help
Elsevier.com (Opens new window)
About ScienceDirect  |  Contact Us  |  Information for Advertisers  |  Terms & Conditions  |  Privacy Policy
Copyright © 2008 Elsevier B.V. All rights reserved. ScienceDirect® is a registered trademark of Elsevier B.V.