Skip to main content

Distributed Large-Scale Information Filtering

  • Chapter
  • First Online:
  • 418 Accesses

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 8420))

Abstract

We study the problem of distributed resource sharing in peer-to-peer networks and focus on the problem of information filtering. In our setting, subscriptions and publications are specified using an expressive attribute-value representation that supports both the Boolean and Vector Space models. We use an extension of the distributed hash table Chord to organise the nodes and store user subscriptions, and utilise efficient publication protocols that keep the network traffic and latency low at filtering time. To verify our approach, we evaluate the proposed protocols experimentally using thousands of nodes, millions of user subscriptions, and two different real-life corpora. We also study three important facets of the load-balancing problem in such a scenario and present a novel algorithm that manages to distribute the load evenly among the nodes. Our results show that the designed protocols are scalable and efficient: they achieve expressive information filtering functionality with low message traffic and latency.

Part of this work was performed while the authors were with the Technical University of Crete, Chania, Greece. C. Tryfonopoulos was partially supported by programme Heraclitus of the Greek Ministry of Education.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    A proximity formula is an expression of the form \(w_1 \prec _{\xi _1} \cdots \prec _{\xi _k} w_{k}\), where \(w_i\) is a word and \(\xi _i\) is a distance interval of the form \(\{[l,u]\): \(l,u \in {\mathbb {N}}, l \ge 0 \, \text {and}\, l\le u \} \cup \{[l,\infty )\): \(l \in {\mathbb {N}}\, \text {and} \, l \ge 0 \}\). The proximity operator \(\prec _{\xi }\) is used to capture the concepts of order and distance between words in a text document using intervals that impose lower and upper bounds on distances between words.

References

  1. Hameurlain, A., Hussain, F.K., Morvan, F., Tjoa, A.M. (eds.): Globe 2012, vol. 7450. Springer, Heidelberg (2012)

    Google Scholar 

  2. Sinha, V., Gupta, A., Kohli, G.S.: Comparative study of P2P and cloud computing paradigm usage in research purposes. In: Das, V.V., Stephen, J., Chaba, Y. (eds.) CNC 2011. CCIS, vol. 142, pp. 341–347. Springer, Heidelberg (2011)

    Google Scholar 

  3. Kavalionak, H., Montresor, A.: P2P and cloud: a marriage of convenience for replica management. In: Kuipers, F.A., Heegaard, P.E. (eds.) IWSOS 2012. LNCS, vol. 7166, pp. 60–71. Springer, Heidelberg (2012)

    Google Scholar 

  4. Trajkovska, I., Salvachua Rodriguez, J., Mozo Velasco, A.: A novel P2P and cloud computing hybrid architecture for multimedia streaming with QoS cost functions. In: ACM Multimedia (2010)

    Google Scholar 

  5. Kontominas, D., Raftopoulou, P., Tryfonopoulos, C., Petrakis, E.G.: DS4: a distributed social and semantic search system. In: ECIR (2013)

    Google Scholar 

  6. Loupasakis, A., Ntarmos, N., Triantafillou, P.: eXO: decentralized autonomous scalable social networking. In: CIDR (2011)

    Google Scholar 

  7. Graffi, K., Gross, C., Mukherjee, P., Kovacevic, A., Steinmetz, R.: LifeSocial.KOM: a P2P-based platform for secure online social networks. In: P2P (2010)

    Google Scholar 

  8. Stoica, I., Morris, R., Karger, D., Kaashoek, M., Balakrishnan, H.: Chord: a scalable peer-to-peer lookup service for internet applications. In: ACM SIGCOMM (2001)

    Google Scholar 

  9. Koubarakis, M., Skiadopoulos, S., Tryfonopoulos, C.: Logic and computational complexity for boolean information retrieval. IEEE TKDE 18(12), 1659–1666 (2006)

    Google Scholar 

  10. Tryfonopoulos, C., Idreos, S., Koubarakis, M.: Publish/Subscribe functionality in IR environments using structured overlay networks. In: ACM SIGIR (2005)

    Google Scholar 

  11. Carzaniga, A., Rosenblum, D.S., Wolf, A.: Design and evaluation of a wide-area event notification service. ACM TOCS 19(3), 332–383 (2001)

    Article  Google Scholar 

  12. Koubarakis, M., Tryfonopoulos, C., Idreos, S., Drougas, Y.: Selective information dissemination in P2P networks: problems and solutions. SIGMOD Rec. 32(3), 71–76 (2003)

    Article  Google Scholar 

  13. Rowstron, A., Kermarrec, A.-M., Druschel, P.: SCRIBE: the design of a large-scale event notification infrastructure. In: Crowcroft, J., Hofmann, M. (eds.) NGC 2001. LNCS, vol. 2233, pp. 30–43. Springer, Heidelberg (2001)

    Google Scholar 

  14. Pietzuch, P., Bacon, J.: Hermes: a distributed event-based middleware architecture. In: DEBS (2002)

    Google Scholar 

  15. Tam, D., Azimi, R., Jacobsen, H.-A.: Building content-based publish/subscribe systems with distributed hash tables. In: Aberer, K., Koubarakis, M., Kalogeraki, V. (eds.) DBISP2P 2003. LNCS, vol. 2944, pp. 138–152. Springer, Heidelberg (2004)

    Google Scholar 

  16. Terpstra, W., Behnel, S., Fiege, L., Zeidler, A., Buchmann, A.: A peer-to-peer approach to content-based publish/subscribe. In: DEBS (2003)

    Google Scholar 

  17. Gedik, B., Liu, L.: PeerCQ: a decentralized and self-configuring peer-to-peer information monitoring system. In: ICDCS (2003)

    Google Scholar 

  18. Karger, D., Lehman, E., Leighton, T., Levine, M., Lewin, D., Panigrahy, R.: Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web. In: ACM STOC (1997)

    Google Scholar 

  19. Bender, M., Bender, M., Michel, S., Michel, S., Parkitny, S., Parkitny, S., Weikum, G., Weikum, G.: A comparative study of pub/sub methods in structured P2P networks. In: Moro, G. (ed.) DBISP2P 2005 and DBISP2P 2006. LNCS, vol. 4125, pp. 385–396. Springer, Heidelberg (2007)

    Google Scholar 

  20. Triantafillou, P., Aekaterinidis, I.: Content-based publish-subscribe over structured P2P networks. In: DEBS (2004)

    Google Scholar 

  21. Gupta, A., Sahin, O.D., Agrawal, D.P., El Abbadi, A.: Meghdoot: content-based publish/subscribe over P2P networks. In: Jacobsen, H.-A. (ed.) Middleware 2004. LNCS, vol. 3231, pp. 254–273. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  22. Aekaterinidis, I., Triantafillou, P.: PastryStrings: a comprehensive content-based publish/subscribe DHT network. In: ICDCS (2006)

    Google Scholar 

  23. Aekaterinidis, I., Triantafillou, P.: Internet scale string attribute publish/subscribe data networks. In: CIKM (2005)

    Google Scholar 

  24. Tran, D., Pham, C.: Enabling content-based publish/subscribe services in cooperative P2P networks. Comput. Netw. 54(11), 1739–1749 (2010)

    Article  MATH  Google Scholar 

  25. Lo, S.C., Chiu, Y.T.: Design of content-based publish/subscribe systems over structured overlay networks. IEICE Trans. E91–D(5), 1504–1511 (2008)

    Google Scholar 

  26. Liau, C.Y., Ng, W.S., Shu, Y., Tan, K.-L., Bressan, S.: Efficient range queries and fast lookup services for scalable P2P networks. In: Ng, W.S., Ooi, B.-C., Ouksel, A.M., Sartori, C. (eds.) DBISP2P 2004. LNCS, vol. 3367, pp. 93–106. Springer, Heidelberg (2005)

    Google Scholar 

  27. Tryfonopoulos, C., Zimmer, C., Koubarakis, M., Weikum, G.: Architectural alternatives for information filtering in structured overlay networks. IEEE Internet Comput. 11(4), 24–34 (2007)

    Article  Google Scholar 

  28. Zheng, X., Luo, J., Cao, J.: Pat: a P2P based publish/subscribe system for QoS information dissemination of web services. In: ICWS (2009)

    Google Scholar 

  29. Cheung, A.Y., Jacobsen, H.A.: Load balancing content-based publish/subscribe systems. ACM TOCS 28(4), 46–100 (2010)

    Article  Google Scholar 

  30. Bernard, S., Potop-Butucaru, M.G., Tixeuil, S.: A framework for secure and private P2P publish/subscribe. In: Dolev, S., Cobb, J., Fischer, M., Yung, M. (eds.) SSS 2010. LNCS, vol. 6366, pp. 531–545. Springer, Heidelberg (2010)

    Google Scholar 

  31. Drosou, M., Stefanidis, K., Pitoura, E.: Preference-aware publish/subscribe delivery with diversity. In: DEBS (2009)

    Google Scholar 

  32. Tang, C., Xu, Z.: pFilter: global information filtering and dissemination using structured overlays. In: FTDCS (2003)

    Google Scholar 

  33. Zhu, Y., Hu, Y.: Ferry: a P2P-based architecture for content-based publish/subscribe services. IEEE TPDS 18(5), 672–685 (2007)

    Google Scholar 

  34. Ratnasamy, S., Francis, P., Handley, M., Karp, R., Shenker, S.: A scalable content-addressable network. In: ACM SIGCOMM (2001)

    Google Scholar 

  35. Kossmann, D.: The state of the art in distributed query processing. ACM Comput. Surv. 32(4), 422–469 (2000)

    Article  Google Scholar 

  36. Stonebraker, M., Aoki, P., Litwin, W., Pfeffer, A., Sah, A., Sidell, J., Staelin, C., Yu, A.: Mariposa: a wide-area distributed database system. VLDB J. 5(1), 48–63 (1996)

    Article  Google Scholar 

  37. Litwin, W., Neimat, M.A., Schneider, D.A.: LH* - a scalable, distributed data structure. ACM TODS 21(4), 480–525 (1996)

    Article  Google Scholar 

  38. Balakrishnan, H., Kaashoek, M., Karger, D., Morris, R., Stoica, I.: Looking up data in P2P systems. CACM 46(2), 43–48 (2003)

    Article  Google Scholar 

  39. Huebsch, R., Hellerstein, J., Lanham, N., Loo, B., Shenker, S., Stoica, I.: Querying the internet with PIER. In: VLDB (2003)

    Google Scholar 

  40. Harren, M., Hellerstein, J.M., Huebsch, R., Loo, B.T., Shenker, S., Stoica, I.: Complex queries in DHT-based peer-to-peer networks. In: Druschel, P., Kaashoek, F., Rowstron, A. (eds.) IPTPS 2002. LNCS, vol. 2429, pp. 242–250. Springer, Heidelberg (2002)

    Google Scholar 

  41. Idreos, S., Tryfonopoulos, C., Koubarakis, M.: Distributed evaluation of continuous Equi-join queries over large structured overlay networks. In: ICDE (2006)

    Google Scholar 

  42. Palma, W., Akbarinia, R., Pacitti, E., Valduriez, P.: DHTJoin: processing continuous join queries using DHT networks. DPD 26(2–3), 291–317 (2009)

    Google Scholar 

  43. Dédzoé, W.K., Lamarre, P., Akbarinia, R., Valduriez, P.: Efficient early top-k query processing in overloaded P2P systems. In: Hameurlain, A., Liddle, S.W., Schewe, K.-D., Zhou, X. (eds.) DEXA 2011, Part I. LNCS, vol. 6860, pp. 140–155. Springer, Heidelberg (2011)

    Google Scholar 

  44. Cai, M., Frank, M., Yan, B., MacGregor, R.: A subscribable peer-to-peer RDF repository for distributed metadata management. J. Web Semant. 2(2), 109–130 (2004)

    Article  Google Scholar 

  45. Liarou, E., Idreos, S., Koubarakis, M.: Continuous RDF query processing over DHTs. In: ISWC (2007)

    Google Scholar 

  46. Lohrmann, B., Battré, D., Kao, O.: Towards parallel processing of RDF queries in DHTs. In: Hameurlain, A., Tjoa, A.M. (eds.) Globe 2009. LNCS, vol. 5697, pp. 36–47. Springer, Heidelberg (2009)

    Google Scholar 

  47. Battré, D., Heine, F., Höing, A., Hovestadt, M., Kao, O., Liebetruth, C.: Dynamic knowledge in DHT based RDF stores. In: SWWS (2008)

    Google Scholar 

  48. Belkin, N., Croft, W.: Information filtering and information retrieval: two sides of the same coin? CACM 35(12), 29–38 (1992)

    Article  Google Scholar 

  49. Li, J., Loo, B., Hellerstein, J., Kaashoek, M., Karger, D., Morris, R.: On the feasibility of peer-to-peer web indexing and search. In: Frans Kaashoek, M., Stoica, I. (eds.) IPTPS 2003, vol. 2735, pp. 207–215. Springer, Heidelberg (2003)

    Google Scholar 

  50. Reynolds, P., Vahdat, A.: Efficient peer-to-peer keyword searching. In: Endler, M., Schmidt, D. (eds.) Middleware 2003, vol. 2672, pp. 21–40. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  51. Hsiao, H.C., King, C.T.: Similarity discovery in structured P2P overlays. In: ICPP (2003)

    Google Scholar 

  52. Tryfonopoulos, C., Idreos, S., Koubarakis, M.: LibraRing: an architecture for distributed digital libraries based on DHTs. In: Rauber, A., Christodoulakis, S., Tjoa, A.M. (eds.) ECDL 2005. LNCS, vol. 3652, pp. 25–36. Springer, Heidelberg (2005)

    Google Scholar 

  53. Bender, M., Michel, S., Triantafillou, P., Weikum, G., Zimmer, C.: MINERVA: collaborative P2P search (Demo). In: VLDB (2005)

    Google Scholar 

  54. Gounaris, A., Fernandes, A., Papadopoulos, A., C. Yfoulis: Parallel query processing on the grid. In: Advances in Parallel Computing (2009)

    Google Scholar 

  55. Narendula, R., Papaioannou, T., Aberer, K.: My3: a highly-available P2P-based online social network. In: P2P (2011)

    Google Scholar 

  56. Stoica, I., Morris, R., Liben-Nowell, D., Karger, D., Kaashoek, M.F., Dabek, F., Balakrishnan, H.: Chord: a scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM TON 11(1), 17–32 (2003)

    Article  Google Scholar 

  57. Tryfonopoulos, C., Koubarakis, M., Drougas, Y.: Information filtering and query indexing for an information retrieval model. ACM TOIS 27(2), 1–47 (2009)

    Article  Google Scholar 

  58. Yan, T., Garcia-Molina, H.: The SIFT information dissemination system. ACM TODS 24(4), 529–565 (1999)

    Article  Google Scholar 

  59. Huebsch, R.: Content-based multicast: comparison of implementation options. Technical Report UCB//CSD-03-1229, UC Berkeley (2003)

    Google Scholar 

  60. Pitoura, T., Ntarmos, N., Triantafillou, P.: Replication, load balancing and efficient range query processing in DHTs. In: Ioannidis, Y. (ed.) EDBT 2006. LNCS, vol. 3896, pp. 131–148. Springer, Heidelberg (2006)

    Google Scholar 

  61. Gopalakrishnan, V., Silaghi, B., Bhattacharjee, B., Keleher, P.: Adaptive replication in peer-to-peer systems. In: ICDCS (2004)

    Google Scholar 

  62. Shen, H.: Efficient and effective file replication in structured P2P file sharing systems. In: P2P (2009)

    Google Scholar 

  63. Deb, S., Linga, P., Rastogi, R., Srinivasan, A.: Accelerating lookups in P2P systems using peer caching. In: ICDE (2008)

    Google Scholar 

  64. Bhattacharjee, B., Chawathe, S., Gopalakrishnan, V., Keleher, P., Silaghi, B.: Efficient peer-to-peer searches using result-caching. In: Frans Kaashoek, M., Stoica, I. (eds.) IPTPS 2003, vol. 2735, pp. 225–236. Springer, Heidelberg (2003)

    Google Scholar 

  65. Dong, L.: Automatic term extraction and similarity assessment in a domain specific document corpus. Master’s thesis, Department of Computer Science, Dalhousie University (2002)

    Google Scholar 

  66. Frantzi, K., Ananiadou, S., Mima, H.: Automatic recognition of multi-word terms: the C-value/NC-value method. IJDL 3(2), 115–130 (2000)

    Google Scholar 

  67. Karger, D.R., Ruhl, M.: Simple efficient load balancing algorithms for peer-to-peer systems. In: SPAA (2004)

    Google Scholar 

  68. Datta, A., Schmidt, R., Aberer, K.: Query-load balancing in structured overlays. In: CCGRID (2007)

    Google Scholar 

  69. Miliaraki, I., Kaoudi, Z., Koubarakis, M.: XML data dissemination using automata on top of structured overlay networks. In: WWW (2008)

    Google Scholar 

  70. Kaoudi, Z., Koubarakis, M., Kyzirakos, K., Miliaraki, I., Magiridou, M., Papadakis-Pesaresi, A.: Atlas: storing, updating and querying RDF(S) data on top of DHTs. J. Web Sem. 8(4), 271–277 (2010)

    Article  Google Scholar 

  71. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christos Tryfonopoulos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Tryfonopoulos, C., Idreos, S., Koubarakis, M., Raftopoulou, P. (2014). Distributed Large-Scale Information Filtering. In: Hameurlain, A., Küng, J., Wagner, R. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XIII. Lecture Notes in Computer Science(), vol 8420. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54426-2_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-54426-2_4

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-54425-5

  • Online ISBN: 978-3-642-54426-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics