Skip to main content
Log in

Peer-to-peer information retrieval using shared-content clustering

Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Peer-to-peer (p2p) networks are used by millions for searching and downloading content. Recently, clustering algorithms were shown to be useful for helping users find content in large networks. Yet, many of these algorithms overlook the fact that p2p networks follow graph models with a power-law node degree distribution. This paper studies the obtained clusters when applying clustering algorithms on power-law graphs and their applicability for finding content. Driven by the observed deficiencies, a simple yet efficient clustering algorithm is proposed, which targets a relaxed optimization of a minimal distance distribution of each cluster with a size balancing scheme. A comparative analysis using a song-similarity graph collected from 1.2 million Gnutella users reveals that commonly used efficiency measures often overlook search and recommendation applicability issues and provide the wrong impression that the resulting clusters are well suited for these tasks. We show that the proposed algorithm performs well on various measures that are well suited for the domain.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Ars technica report on P2P file sharing client market share. http://arstechnica.com/old/content/2008/04/study-bittorren-sees-big-growth-l

  2. Anglade A, Tiemann M, Vignoli F (2007) Virtual communities for creating shared music channels. In: Proceedings of international symposium on music information retrieval

  3. Barabási A-L, Albert R (1999) Emergence of scaling in random networks. Science 286:509–512

    Article  MathSciNet  Google Scholar 

  4. Barbehenn M (1998) A note on the complexity of Dijkstra’s algorithm for graphs with weighted vertices. IEEE Trans Comput 47(2):263

    Article  MathSciNet  Google Scholar 

  5. Bollobas B, Riordan O (2004) The diameter of a scale-free random graph. Combinatorica 24(1):5–34

    Article  MATH  MathSciNet  Google Scholar 

  6. Bradley PS, Fayyad U, Reina C (1998) Scaling clustering algorithms to large databases. Knowl Discov Data Min (AAAI Press)

  7. Bradley PS, Fayyad UM (1998) Refining initial points for k-means clustering. In: ICML ’98. Morgan Kaufmann, San Francisco (pp. 91–99)

  8. Celma O, Cano P (2008) From hits to niches? Or how popular artists can bias music recommendation and discovery. In: 2nd workshop on large-scale recommender systems and the netflix prize competition, Las Vegas

  9. Dhillon IS, Guan Y, Kulis B (2007) Weighted graph cuts without eigenvectors a multilevel approach. IEEE Trans Pattern Anal Mach Intell 29(11):1944–1957

    Article  Google Scholar 

  10. Dijkstra EW (1959) A note on two problems in connexion with graphs. Numerische Mathematik 1:269–271

    Article  MATH  MathSciNet  Google Scholar 

  11. Dongen SV (2000) Performance criteria for graph clustering and markov cluster experiments. Technical report. National Research Institute for Mathematics and Computer Science

  12. Faloutsos C, Lin K-I (1995) Fastmap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: ACM SIGMOD ’95

  13. Fessant FL, Kermarrec AM, Massoulie L (2004) Clustering in peer-to-peer file sharing workloads. In: IPTPS

  14. Fodor I (2002) A survey of dimension reduction techniques. Technical report. Center for Applied Scientific Computing, Lawrence Livermore National Laboratory

  15. Geleijnse G, Schedl M, Knees P (2007) The quest for ground truth in musical artist tagging in the social web era. In: ISMIR, Vienna

  16. Gish AS, Shavitt Y, Tankel T (2007) Geographical statistics and characteristics of p2p query strings. In: IPTPS

  17. Handcock MS, Raftery AE, Tantrum JM (2007) Model-based clustering for social networks. J R Stat Soc Ser A 170(2):301–354

    Article  MathSciNet  Google Scholar 

  18. Herlocker JL, Konstan JA, Terveen LG (2004) Evaluating collaborative filtering recommender systems. ACM Trans Inf Syst 22:5–53

    Article  Google Scholar 

  19. Hu T, Sung S (2006) Finding centroid clusterings with entropy-based criteria. Knowl Inf Syst 10:505–514

    Article  Google Scholar 

  20. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323

    Article  Google Scholar 

  21. Jin R, Goswami A, Agrawal G (2006) Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10(1):17–40

    Article  Google Scholar 

  22. Kang U, Tsourakakis C, Faloutsos C (2011) PEGASUS: mining peta-scale graphs. Knowl Inf Syst 27(2):303–325

    Article  Google Scholar 

  23. Karypis G, Kumar V (1995) A fast and high quality multilevel scheme for partitioning irregular graphs. In: International conference on parallel processing

  24. Koenigstein N, Shavitt Y, Weinsberg E, Weinsberg U (2010) On the applicability of peer-to-peer data in music information retrieval research. In: ISMIR

  25. Luo P, Xiong H, Lü K, Shi Z (2007) Distributed classification in peer-to-peer networks. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’07. ACM

  26. Mowat A, Schmidt R, Schumacher M, Constantinescu I (2008) Extending peer-to-peer networks for approximate search. In: 23rd annual ACM symposium on applied computing

  27. Mowat A, Schmidt R, Schumacherand M, Constantinescu I (2008) Extending peer-to-peer networks for approximate search. In: ACM SAC ’08. ACM, New York. pp 455–459

  28. Narasimhamurthy A, Greene D, Hurley NJ, Cunningham P (2010) Partitioning large networks without breaking communities. Knowl Inf Syst 25(2):345–369

    Article  Google Scholar 

  29. Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33:2001

    Article  Google Scholar 

  30. Ars technica report on P2P file sharing client market share. http://arstechnica.com/old/content/2008/04/study-bittorren-sees-big-growth-l

  31. Pelleg D (2000) Moore A X-means: extending k-means with efficient estimation of the number of clusters. In: The 17th international conference on machine learning. Morgan Kaufmann, Los Altos. pp 727–734

  32. Platt JC (2004) Fast embedding of sparse music similarity graphs. In: Advances in neural information processing systems

  33. Priness I, Maimon O, Ben-Gal I (2007) Evaluation of gene-expression clustering via mutual information distance measure. BMC Bioinform 8(1):111–123

    Google Scholar 

  34. Resnick P, Varian HR (1997) Recommender systems. Commun ACM 40(3):56–58

    Google Scholar 

  35. Ripeanu M (2001) Peer-to-peer architecture case study: Gnutella network. In: First international conference on peer-to-peer computing

  36. Sakuma J, Kobayashi S (2010) Large-scale k-means clustering with user-centric privacy-preservation. Knowl Inf Syst 25(2):253–279

    Google Scholar 

  37. Saroiu S, Gummadi KP, Gribble SD (2003) Measuring and analyzing the characteristics of napster and gnutella hosts

  38. Satuluri V, Parthasarathy S (2009) Scalable graph clustering using stochastic flows: applications to community discovery. In: KDD

  39. Scholkopf B, Smola A, Muller K-R (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10(5):1299–1319

    Article  Google Scholar 

  40. Shavitt Y, Weinsberg E, Weinsberg U (2010) Estimating peer similarity using distance of shared files. In: International workshop on peer-to-peer systems (IPTPS)

  41. Shavitt Y, Weinsberg E, Weinsberg U (2011) Mining music from large-scale peer-to-peer networks. IEEE Multimedia 18(1):14–23

    Article  Google Scholar 

  42. Shavitt Y, Weinsberg U (2009) Song clustering using peer-to-peer co-occurrences. In: adMIRe

  43. Sripanidkulchai K, Maggs B, Zhang H (2003) Efficient content location using interest-based locality in peer-to-peer systems. In: INFOCOM

  44. Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD

  45. Stutzbach D, Rejaie R (2006) On unbiased sampling for unstructured peer-to-peer networks. In: ACM IMC, pp 27–40

  46. Stutzbach D, Rejaie R, Sen S (2007) Characterizing unstructured overlay topologies in modern P2P file-sharing systems. In: Internet measurement conference (IMC), pp 49–62

  47. Voulgaris S, Kermarrec A-M, Massoulié L, van Steen M (2004) Exploiting semantic proximity in peer-to-peer content searching. In: 10th international workshop on future trends in distributed computing systems (FTDCS 2004), China

  48. Wang F, Li P, König AC, Wan M (2012) Improving clustering by learning a bi-stochastic data similarity matrix. Knowl Inf Syst 32(2):351–382

    Google Scholar 

  49. Wong B, Vigfússon Y, Sirer EG (2007) Hyperspaces for object clustering and approximate matching in peer-to-peer overlays. In: USENIX HOTOS ’07. USENIX, Berkeley, pp 1–6

  50. Wu J, Xiong H, Chen J (2009) Adapting the right measures for k-means clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’09. ACM, New york

  51. Yang B, Garcia-Molina H (2002) Improving search in peer-to-peer networks. In: ICDCS ’02: proceedings of the 22nd international conference on distributed computing systems

  52. Zaharia MA, Chandel A, Saroiu S, Keshav S (2007) Finding content in file-sharing networks when you can’t even spell. In: IPTPS

  53. Zheng R, Provost F, Ghose A (2007) Social network collaborative filtering. In: 6th workshop on ebusiness (WEB)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Udi Weinsberg.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ben-Gal, I., Shavitt, Y., Weinsberg, E. et al. Peer-to-peer information retrieval using shared-content clustering. Knowl Inf Syst 39, 383–408 (2014). https://doi.org/10.1007/s10115-013-0619-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-013-0619-9

Keywords

Navigation