Abstract
This work introduces a new approach to record clustering where a hybrid algorithm is presented to cluster records based upon threshold values and the query patterns made to a particular database. The Hamming Distance of a file is used as a measure of space density. The objective of the algorithm is to minimize the Hamming Distance of the file while attaching significance to the most frequent queries being asked. Simulation experiments conducted proved that a great reduction in response time is yielded after the restructuring of a file. We study the space density properties of a file and how it affects retrieval time before and after clustering, as a means of predicting file performance and making appropriate choices of parameters. Criteria, such as, block size, threshold value, percentage of records satisfying a given set of queries, etc., which affect clustering and response time are also studied.
Article PDF
Similar content being viewed by others
References
Cardenas AF (1975) Analysis and performance of inverted database structure. Communications of the ACM, 18(5):253-263.
Comer D (1979) The ubiquitous B-tree, ACM Computing Surveys, 11(2):121-138.
Croft B (1977) Clustering large files of documents using the single link method. Journal of the American Society for Information Science.
Deppisch, U (1986) S-tree: A dynamic balanced signature index for office retrieval. In: Proceedings of then 1986-ACM Conference on Research and Development in Information Retrieval, pp. 77-87.
Guttman A (1984) R-trees: A dynamic index structure for spatial searching. In: Proceedings of the 1984 ACM SIGMOD International Conference on the mgt. of Data, pp. 47-57.
Jain AK and Dubes RC (1988) Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs, New Jersey.
Kang ANCet al. (1977) Storage reduction through minimal spanning trees and spanning forests. IEEE Transactions on Computers.
King B (1967) Step-wise clustering procedures. Journal of the American Statistical Association, 69.
Larson PA (1980) Linear hashing with partial expansions. In: Proceedings of the 6th International Conference on VLDB, pp. 224-232.
Lowden BGT (1985a) An approach to multikey sequencing in an equiprobable keyterm retrieval simulation. In: Proceedings of 8th ACM SIGIR Conf. on Research and Development in Information Retrieval.
Lowden BGT (1985b) An approach to multikey sequencing in an equiprobable keyterm retrieval simulation. In Proceeding of the 8th ACM SIGIR Conf. on Research and Development in Information Retrieval, Montreal, pp. 92-96.
Lowden BGT and Kitsopanidis A (1993) Enhancing Database Retrieval Performance Using Record Clustering, University of Essex, Essex, UK.
Mauldin ML (1991) Conceptual Information Retrieval: A Case Study in Adaptive Partial Parsing. Kluwer Academic Publishers.
Moghrabi IAR (1995) Expert systems and their use in libraries, Sci-Quest, 5(2).
Moghrabi IAR (1996) Analysis of algorithms for clustering records in document databases. Lebanese Scientific Research Reports, 1(2):33-44.
Nievergelt J and Hinterberger H (1984) The grid file: An adaptable symmetric multikey file structure. ACM Transactions On Database Systems, 9(1):38-71.
Safar MA (1995) Clustering records in information retrieval systems. M.Sc. Dissertation, Lebanese American University, Beirut.
Salton G (1975) Dynamic Information and Library Processing. Prentice-Hall, Englewood Cliffs, New Jersey.
Salton G and McGill MJ (1983) Introduction to Modern Information Retrieval. McGraw Hill, New York.
Williamson RE (1974) Realtime document retrieval. Ph.D. Thesis, Cornell University, Ithaca, NY.
Wong SKM, ZierkoWand Ranghvan VV (1987) On modeling of information retrieval concepts in vector spaces, ACM Transactions On Database Systems, 12(2).
Ymmanuel MB (1993) Clustering techniques for document databases. M.Sc. Dissertation, University of Essex.
Yu. CT, Lam K. Suen and Siu MK (1985) Adaptive record clustering. ACM Transactions on Database Systems, 10(2):180-204.
Yu CT, Luk WS and Siu MK (1978) On the estimation of the number of desired records with respect to a given query. ACM Transactions on Database Systems, 3(1):41-56.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Moghrabi, I., Makholian, R. A New Approach to Clustering Records in Information Retrieval Systems. Information Retrieval 3, 105–126 (2000). https://doi.org/10.1023/A:1009901830009
Issue Date:
DOI: https://doi.org/10.1023/A:1009901830009