Skip to main content

Performance Analysis of Distributed Architectures to Index One Terabyte of Text

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2997))

Abstract

We simulate different architectures of a distributed Information Retrieval system on a very large Web collection, in order to work out the optimal setting for a particular set of resources. We analyse the effectiveness of a distributed, replicated and clustered architecture using a variable number of workstations. A collection of approximately 94 million documents and 1 terabyte of text is used to test the performance of the different architectures. We show that in a purely distributed architecture, the brokers become the bottleneck due to the high number of local answer sets to be sorted. In a replicated system, the network is the bottleneck due to the high number of query servers and the continuous data interchange with the brokers. Finally, we demonstrate that a clustered system will outperform a replicated system if a large number of query servers is used, mainly due to the reduction of the network load.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Amati, G., Carpineto, C., Romano, G.: FUB at TREC-10 Web track: A probabilistic framework for topic relevance term weighting. In: NIST Special Publication 500-250: The Tenth Text REtrieval Conference (TREC 2001) (2001)

    Google Scholar 

  2. Burkowski, F.J.: Retrieval performance of a distributed database utilizing a parallel process document server. In: Proceedings of the International Symposium on Databases in Parallel and Distributed Systems, pp. 71–70 (1990)

    Google Scholar 

  3. Cahoon, B., McKinley, K.S.: Performance evaluation of a distributed architecture for information retrieval. In: Proceedings of ACM-SIGIR International Conference on Research and Development in Information Retrieval, pp. 110–118 (1996)

    Google Scholar 

  4. Callan, J.: Distributed information retrieval. In: Bruce Croft, W. (ed.) Advances in Information Retrieval: Recent Research from the CIIR. ch. 5, pp. 127–150. Kluwer Academic Publishers, Dordrecht (2000)

    Google Scholar 

  5. Hawking, D.: Scalable text retrieval for large digital libraries. In: Peters, C., Thanos, C. (eds.) ECDL 1997. LNCS, vol. 1324, pp. 127–146. Springer, Heidelberg (1997)

    Chapter  Google Scholar 

  6. Hawking, D., Thistlewaite, P.: Methods for Information Server Selection. ACM Transactions on Information Systems 17(1), 40–76 (1999)

    Article  Google Scholar 

  7. Hawking, D., Craswell, N.: Overview of the TREC-2001 Web Track. In: Information Technology: The Tenth Text Retrieval Conference, TREC 2001. NIST SP 500-250, pp. 61–67 (2001)

    Google Scholar 

  8. Heaps, H.S.: Information Retrieval: Computational and Theoretical Aspects. Academic Press, New York (1978)

    MATH  Google Scholar 

  9. Jeong, B., Omiecinski, E.: Inverted File Partitioning Schemes in Multiple Disk Systems. IEEE Transactions on Parallel and Distributed Systems 6(2), 142–153 (1995)

    Article  Google Scholar 

  10. Jones, C.B., Purves, R., Ruas, A., Sanderson, M., Sester, M., van Kreveld, M., Weibel, R.: Spatial information retrieval and geographical ontologies an overview of the SPIRIT project. In: Proceedings of the 25th ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 387–388. ACM Press, New York (2002)

    Chapter  Google Scholar 

  11. Little, M.C.: JavaSim User’s Guide. Public Release 0.3, Version 1.0, http://javasim.ncl.ac.uk/manual/javasim.pdf University of Newcastle upon Tyne (2001)

  12. Lu, Z., McKinley, K.: Partial collection replication versus caching for information retrieval systems. In: Proceedings of the ACM International Conference on Research and Development in Information Retrieval, pp. 248–255 (2000)

    Google Scholar 

  13. Ribeiro-Neto, B., Barbosa, R.: Query performance for tightly coupled distributed digital libraries. In: Proceedings of the 3rd ACM Conference on Digital Libraries, pp. 182–190 (1998)

    Google Scholar 

  14. Spink, A., Jansen, B.J., Wolfram, D., Saracevic, T.: From e-sex to e-commerce: Web search changes. IEEE Computer 35(3), 107–109 (2002)

    Google Scholar 

  15. Tomasic, A.I., Garcia-Molina, H.: Performance of inverted indices in shared-nothing distributed text document information retrieval systems. In: Proceedings of the 2nd International Conference on Parallel and Distributed Information Systems, pp. 8–17 (1993)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cacheda, F., Plachouras, V., Ounis, I. (2004). Performance Analysis of Distributed Architectures to Index One Terabyte of Text. In: McDonald, S., Tait, J. (eds) Advances in Information Retrieval. ECIR 2004. Lecture Notes in Computer Science, vol 2997. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24752-4_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-24752-4_29

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-21382-6

  • Online ISBN: 978-3-540-24752-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics