Performance Analysis of Distributed Architectures to Index One Terabyte of Text

Cacheda, Fidel; Plachouras, Vassilis; Ounis, Iadh

doi:10.1007/978-3-540-24752-4_29

Performance Analysis of Distributed Architectures to Index One Terabyte of Text

Fidel Cacheda⁶,
Vassilis Plachouras⁷ &
Iadh Ounis⁷

Conference paper

772 Accesses
11 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2997))

Abstract

We simulate different architectures of a distributed Information Retrieval system on a very large Web collection, in order to work out the optimal setting for a particular set of resources. We analyse the effectiveness of a distributed, replicated and clustered architecture using a variable number of workstations. A collection of approximately 94 million documents and 1 terabyte of text is used to test the performance of the different architectures. We show that in a purely distributed architecture, the brokers become the bottleneck due to the high number of local answer sets to be sorted. In a replicated system, the network is the bottleneck due to the high number of query servers and the continuous data interchange with the brokers. Finally, we demonstrate that a clustered system will outperform a replicated system if a large number of query servers is used, mainly due to the reduction of the network load.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Amati, G., Carpineto, C., Romano, G.: FUB at TREC-10 Web track: A probabilistic framework for topic relevance term weighting. In: NIST Special Publication 500-250: The Tenth Text REtrieval Conference (TREC 2001) (2001)
Google Scholar
Burkowski, F.J.: Retrieval performance of a distributed database utilizing a parallel process document server. In: Proceedings of the International Symposium on Databases in Parallel and Distributed Systems, pp. 71–70 (1990)
Google Scholar
Cahoon, B., McKinley, K.S.: Performance evaluation of a distributed architecture for information retrieval. In: Proceedings of ACM-SIGIR International Conference on Research and Development in Information Retrieval, pp. 110–118 (1996)
Google Scholar
Callan, J.: Distributed information retrieval. In: Bruce Croft, W. (ed.) Advances in Information Retrieval: Recent Research from the CIIR. ch. 5, pp. 127–150. Kluwer Academic Publishers, Dordrecht (2000)
Google Scholar
Hawking, D.: Scalable text retrieval for large digital libraries. In: Peters, C., Thanos, C. (eds.) ECDL 1997. LNCS, vol. 1324, pp. 127–146. Springer, Heidelberg (1997)
Chapter Google Scholar
Hawking, D., Thistlewaite, P.: Methods for Information Server Selection. ACM Transactions on Information Systems 17(1), 40–76 (1999)
Article Google Scholar
Hawking, D., Craswell, N.: Overview of the TREC-2001 Web Track. In: Information Technology: The Tenth Text Retrieval Conference, TREC 2001. NIST SP 500-250, pp. 61–67 (2001)
Google Scholar
Heaps, H.S.: Information Retrieval: Computational and Theoretical Aspects. Academic Press, New York (1978)
MATH Google Scholar
Jeong, B., Omiecinski, E.: Inverted File Partitioning Schemes in Multiple Disk Systems. IEEE Transactions on Parallel and Distributed Systems 6(2), 142–153 (1995)
Article Google Scholar
Jones, C.B., Purves, R., Ruas, A., Sanderson, M., Sester, M., van Kreveld, M., Weibel, R.: Spatial information retrieval and geographical ontologies an overview of the SPIRIT project. In: Proceedings of the 25th ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 387–388. ACM Press, New York (2002)
Chapter Google Scholar
Little, M.C.: JavaSim User’s Guide. Public Release 0.3, Version 1.0, http://javasim.ncl.ac.uk/manual/javasim.pdf University of Newcastle upon Tyne (2001)
Lu, Z., McKinley, K.: Partial collection replication versus caching for information retrieval systems. In: Proceedings of the ACM International Conference on Research and Development in Information Retrieval, pp. 248–255 (2000)
Google Scholar
Ribeiro-Neto, B., Barbosa, R.: Query performance for tightly coupled distributed digital libraries. In: Proceedings of the 3rd ACM Conference on Digital Libraries, pp. 182–190 (1998)
Google Scholar
Spink, A., Jansen, B.J., Wolfram, D., Saracevic, T.: From e-sex to e-commerce: Web search changes. IEEE Computer 35(3), 107–109 (2002)
Google Scholar
Tomasic, A.I., Garcia-Molina, H.: Performance of inverted indices in shared-nothing distributed text document information retrieval systems. In: Proceedings of the 2nd International Conference on Parallel and Distributed Information Systems, pp. 8–17 (1993)
Google Scholar

Download references

Author information

Authors and Affiliations

Departament of Information and Communication Technologies, University of A Coruña Facultad de Informática, Campus de Elviña s/n, 15071, A Coruña, Spain
Fidel Cacheda
Department of Computing Science, University of Glasgow, Glasgow, G12 8QQ, UK
Vassilis Plachouras & Iadh Ounis

Authors

Fidel Cacheda
View author publications
You can also search for this author in PubMed Google Scholar
Vassilis Plachouras
View author publications
You can also search for this author in PubMed Google Scholar
Iadh Ounis
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computing and Technology, David Goldman Informatics Centre, University of Sunderland, St. Peter’s Campus, SR6 0DD, Sunderland, UK
Sharon McDonald
School of Computing and Technology, University of Sunderland, St. Peter’s Campus, St. Peter’s Way, SR6 0DD, Sunderland, United Kingdom
John Tait

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cacheda, F., Plachouras, V., Ounis, I. (2004). Performance Analysis of Distributed Architectures to Index One Terabyte of Text. In: McDonald, S., Tait, J. (eds) Advances in Information Retrieval. ECIR 2004. Lecture Notes in Computer Science, vol 2997. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24752-4_29

Download citation

DOI: https://doi.org/10.1007/978-3-540-24752-4_29
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21382-6
Online ISBN: 978-3-540-24752-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics