ScienceDirect® Home Skip Main Navigation Links
You have guest access to ScienceDirect. Find out more.
 
Home
Browse
My Settings
Alerts
Help
 Quick Search
 Search tips (Opens new window)
    Clear all fields    
advertisementadvertisement
Data & Knowledge Engineering
Volume 54, Issue 2, August 2005, Pages 189-210
 
Font Size: Decrease Font Size  Increase Font Size
 Abstract - selected
Article
Purchase PDF (280 K)

 
 
 
Related Articles in ScienceDirect
View More Related Articles
 
View Record in Scopus
 
doi:10.1016/j.datak.2004.12.001    How to Cite or Link Using DOI (Opens New Window)
Copyright © 2004 Elsevier B.V. All rights reserved.

Knowledge discovery by probabilistic clustering of distributed databases

Sally McCleanCorresponding Author Contact Information, E-mail The Corresponding Author, Bryan ScotneyE-mail The Corresponding Author, Philip MorrowE-mail The Corresponding Author and Kieran GreerE-mail The Corresponding Author

School of Computing and Information Engineering, University of Ulster, Cromore Road, Coleraine BT52 1SA, Northern Ireland

Received 17 July 2004; 
accepted 1 December 2004. 
Available online 23 December 2004.

Purchase the full-text article



References and further reading may be available for this article. To view references and further reading you must purchase this article.

Abstract

Clustering of distributed databases facilitates knowledge discovery through learning of new concepts that characterise common features and differences between datasets. Hence, general patterns can be learned rather than restricting learning to specific databases from which rules may not be generalisable. We cluster databases that hold aggregate count data on categorical attributes that have been classified according to homogeneous or heterogeneous classification schemes. Clustering of datasets is carried out via the probability distributions that describe their respective aggregates. The homogeneous case is straightforward. For heterogeneous data we investigate a number of clustering strategies, of which the most efficient avoid the need to compute a dynamic shared ontology to homogenise the classification schemes prior to clustering.

Keywords: Distributed databases; Probabilistic clustering; Aggregates; Dynamic shared ontology

Article Outline

1. Background
2. Terminology and data models
3. The distance metrics
3.1. Introduction
3.2. The homogeneous case
3.3. The heterogeneous case
3.3.1. The Euclidean distance metric
3.3.2. The Kullback–Leibler information divergence as a distance metric
4. Clustering the databases
5. Performance evaluation
5.1. An overlap metric
5.2. Performance of the Aggint operator
5.3. Accuracy of the clustering algorithms
5.4. Performance of the clustering algorithms
6. Summary and further work
Acknowledgements
References
Vitae





 
Home
Browse
My Settings
Alerts
Help
Elsevier.com (Opens new window)
About ScienceDirect  |  Contact Us  |  Information for Advertisers  |  Terms & Conditions  |  Privacy Policy
Copyright © 2008 Elsevier B.V. All rights reserved. ScienceDirect® is a registered trademark of Elsevier B.V.