Abstract
We simulate different architectures of a distributed Information Retrieval system on a very large Web collection, in order to work out the optimal setting for a particular set of resources. We analyse the effectiveness of a distributed, replicated and clustered architecture using a variable number of workstations. A collection of approximately 94 million documents and 1 terabyte of text is used to test the performance of the different architectures. We show that in a purely distributed architecture, the brokers become the bottleneck due to the high number of local answer sets to be sorted. In a replicated system, the network is the bottleneck due to the high number of query servers and the continuous data interchange with the brokers. Finally, we demonstrate that a clustered system will outperform a replicated system if a large number of query servers is used, mainly due to the reduction of the network load.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Amati, G., Carpineto, C., Romano, G.: FUB at TREC-10 Web track: A probabilistic framework for topic relevance term weighting. In: NIST Special Publication 500-250: The Tenth Text REtrieval Conference (TREC 2001) (2001)
Burkowski, F.J.: Retrieval performance of a distributed database utilizing a parallel process document server. In: Proceedings of the International Symposium on Databases in Parallel and Distributed Systems, pp. 71–70 (1990)
Cahoon, B., McKinley, K.S.: Performance evaluation of a distributed architecture for information retrieval. In: Proceedings of ACM-SIGIR International Conference on Research and Development in Information Retrieval, pp. 110–118 (1996)
Callan, J.: Distributed information retrieval. In: Bruce Croft, W. (ed.) Advances in Information Retrieval: Recent Research from the CIIR. ch. 5, pp. 127–150. Kluwer Academic Publishers, Dordrecht (2000)
Hawking, D.: Scalable text retrieval for large digital libraries. In: Peters, C., Thanos, C. (eds.) ECDL 1997. LNCS, vol. 1324, pp. 127–146. Springer, Heidelberg (1997)
Hawking, D., Thistlewaite, P.: Methods for Information Server Selection. ACM Transactions on Information Systems 17(1), 40–76 (1999)
Hawking, D., Craswell, N.: Overview of the TREC-2001 Web Track. In: Information Technology: The Tenth Text Retrieval Conference, TREC 2001. NIST SP 500-250, pp. 61–67 (2001)
Heaps, H.S.: Information Retrieval: Computational and Theoretical Aspects. Academic Press, New York (1978)
Jeong, B., Omiecinski, E.: Inverted File Partitioning Schemes in Multiple Disk Systems. IEEE Transactions on Parallel and Distributed Systems 6(2), 142–153 (1995)
Jones, C.B., Purves, R., Ruas, A., Sanderson, M., Sester, M., van Kreveld, M., Weibel, R.: Spatial information retrieval and geographical ontologies an overview of the SPIRIT project. In: Proceedings of the 25th ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 387–388. ACM Press, New York (2002)
Little, M.C.: JavaSim User’s Guide. Public Release 0.3, Version 1.0, http://javasim.ncl.ac.uk/manual/javasim.pdf University of Newcastle upon Tyne (2001)
Lu, Z., McKinley, K.: Partial collection replication versus caching for information retrieval systems. In: Proceedings of the ACM International Conference on Research and Development in Information Retrieval, pp. 248–255 (2000)
Ribeiro-Neto, B., Barbosa, R.: Query performance for tightly coupled distributed digital libraries. In: Proceedings of the 3rd ACM Conference on Digital Libraries, pp. 182–190 (1998)
Spink, A., Jansen, B.J., Wolfram, D., Saracevic, T.: From e-sex to e-commerce: Web search changes. IEEE Computer 35(3), 107–109 (2002)
Tomasic, A.I., Garcia-Molina, H.: Performance of inverted indices in shared-nothing distributed text document information retrieval systems. In: Proceedings of the 2nd International Conference on Parallel and Distributed Information Systems, pp. 8–17 (1993)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cacheda, F., Plachouras, V., Ounis, I. (2004). Performance Analysis of Distributed Architectures to Index One Terabyte of Text. In: McDonald, S., Tait, J. (eds) Advances in Information Retrieval. ECIR 2004. Lecture Notes in Computer Science, vol 2997. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24752-4_29
Download citation
DOI: https://doi.org/10.1007/978-3-540-24752-4_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21382-6
Online ISBN: 978-3-540-24752-4
eBook Packages: Springer Book Archive