Copyright © 2004 Elsevier B.V. All rights reserved.
Knowledge discovery by probabilistic clustering of distributed databases
Received 17 July 2004;
References and further reading may be available for this article. To view references and further reading you must purchase this article.
Abstract
Clustering of distributed databases facilitates knowledge discovery through learning of new concepts that characterise common features and differences between datasets. Hence, general patterns can be learned rather than restricting learning to specific databases from which rules may not be generalisable. We cluster databases that hold aggregate count data on categorical attributes that have been classified according to homogeneous or heterogeneous classification schemes. Clustering of datasets is carried out via the probability distributions that describe their respective aggregates. The homogeneous case is straightforward. For heterogeneous data we investigate a number of clustering strategies, of which the most efficient avoid the need to compute a dynamic shared ontology to homogenise the classification schemes prior to clustering.
Keywords: Distributed databases; Probabilistic clustering; Aggregates; Dynamic shared ontology
Article Outline
- 1. Background
- 2. Terminology and data models
- 3. The distance metrics
- 3.1. Introduction
- 3.2. The homogeneous case
- 3.3. The heterogeneous case
- 4. Clustering the databases
- 5. Performance evaluation
- 5.1. An overlap metric
- 5.2. Performance of the Aggint operator
- 5.3. Accuracy of the clustering algorithms
- 5.4. Performance of the clustering algorithms
- 6. Summary and further work
- Acknowledgements
- References
- Vitae







E-mail Article
Add to my Quick Links

Cited By in Scopus (5)






