ScienceDirect® Home Skip Main Navigation Links
You have guest access to ScienceDirect. Find out more.
 
Home
Browse
My Settings
Alerts
Help
 Quick Search
 Search tips (Opens new window)
    Clear all fields    
advertisementadvertisement
Information Processing & Management
Volume 41, Issue 5, September 2005, Pages 1141-1161
 
Font Size: Decrease Font Size  Increase Font Size
 Abstract - selected
Article
Purchase PDF (689 K)

 
 
 
Related Articles in ScienceDirect
View More Related Articles
 
Special issue
View Record in Scopus
 
doi:10.1016/j.ipm.2004.05.002    How to Cite or Link Using DOI (Opens New Window)
Copyright © 2004 Elsevier Ltd All rights reserved.

A case study of distributed information retrieval architectures to index one terabyte of text

Fidel Cachedaa, Corresponding Author Contact Information, E-mail The Corresponding Author, Vassilis Plachourasb, E-mail The Corresponding Author and Iadh Ounisb, E-mail The Corresponding Author

aDepartment of Information and Communication Technologies, Facultad de Informática, University of A Coruña, Campus de Elviña s/n, 15071 A Coruña, Spain bDepartment of Computing Science, University of Glasgow, Glasgow G12 8QQ, UK

Received 14 November 2003; 
accepted 3 May 2004. 
Available online 28 July 2004.

Purchase the full-text article



References and further reading may be available for this article. To view references and further reading you must purchase this article.

Abstract

The increasing number of documents to be indexed in many environments (Web, intranets, digital libraries) and the limitations of a single centralised index (lack of scalability, server overloading and failures), lead to the use of distributed information retrieval systems to efficiently search and locate the desired information. This work is a case study of different architectures for a distributed information retrieval system, in order to provide a guide to approximate the optimal architecture with a specific set of resources. We analyse the effectiveness of a distributed, replicated and clustered architecture simulating a variable number of workstations (from 1 up to 4096). A collection of approximately 94 million documents and 1 terabyte (TB) of text is used to test the performance of the different architectures. In a purely distributed information retrieval system, the brokers become the bottleneck due to the high number of local answer sets to be sorted. In a replicated system, the network is the bottleneck due to the high number of query servers and the continuous data interchange with the brokers. Finally, we demonstrate that a clustered system will outperform a replicated system if a high number of query servers is used, essentially due to the reduction of the network load. However a change in the distribution of the users’ queries could reduce the performance of a clustered system.

Keywords: Distributed information retrieval; Performance; Simulation

Article Outline

1. Introduction
2. Related work
3. Simulation model
3.1. Analytical model
3.2. The spirit collection model
3.2.1. Document model
3.2.2. Query model
3.3. Distributed model
4. Simulation results
4.1. Distributed system
4.2. Replicated system
4.2.1. Compression to reduce network congestion
4.3. Clustered system
5. Conclusions
Acknowledgements
References












 
Home
Browse
My Settings
Alerts
Help
Elsevier.com (Opens new window)
About ScienceDirect  |  Contact Us  |  Information for Advertisers  |  Terms & Conditions  |  Privacy Policy
Copyright © 2008 Elsevier B.V. All rights reserved. ScienceDirect® is a registered trademark of Elsevier B.V.