Files

Abstract

Standard general-purpose Web retrieval relies on centralized search engines that do not realistically scale when applied to the exponentially growing number of documents available on the Web. By taking advantage of the resource sharing principle, Peer-to-Peer (P2P) techniques represent a promising architectural alternative for building decentralized Web search engines offering true Web-size scalability, provided that enough peers are available. However, in all such P2P approaches proposed so far, excessive network bandwidth consumption during retrieval, caused by the necessary transmission of possibly very long posting lists, was identified as the major bottleneck for implementing truly scalable P2P full-text Web retrieval. The main objective of the present research is thus to find a decentralized indexing/retrieval strategy that fully exploits the distributed computation possibilities provided by P2P networks, but keeps the required network bandwidth consumption scalable, while guaranteeing an acceptable retrieval quality. To address this problem we introduce a novel indexing/retrieval model based on Highly Discriminative Keys (HDKs), which correspond to carefully selected indexing terms and term sets associated with posting lists truncated to the top-k most relevant documents with respect to the associated key. Using HDKs for indexing thus increases the number of indexing features but, at the same time, strictly limits the size of the associated posting lists. When combined with an adequate retrieval model, this leads to strongly reduced network traffic. More precisely, our experimental results show that HDK-based indexing and retrieval lead to storage and bandwidth requirements that remain manageable with respect to the growth of document collection while preserving a fully acceptable retrieval quality.

Details

Actions