skip to main content
10.1145/1135777.1135793acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

Site level noise removal for search engines

Published:23 May 2006Publication History

ABSTRACT

The currently booming search engine industry has determined many online organizations to attempt to artificially increase their ranking in order to attract more visitors to their web sites. At the same time, the growth of the web has also inherently generated several navigational hyperlink structures that have a negative impact on the importance measures employed by current search engines. In this paper we propose and evaluate algorithms for identifying all these noisy links on the web graph, may them be spam or simple relationships between real world entities represented by sites, replication of content, etc. Unlike prior work, we target a different type of noisy link structures, residing at the site level, instead of the page level. We thus investigate and annihilate site level mutual reinforcement relationships, abnormal support coming from one site towards another, as well as complex link alliances between web sites. Our experiments with the link database of the TodoBR search engine show a very strong increase in the quality of the output rankings after having applied our techniques.

References

  1. E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer. The connectivity sonar: detecting site functionality by structural patterns. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia, pages 38--47, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Badrank. http://en.efactory.de/e-pr0.shtml.Google ScholarGoogle Scholar
  3. R. Baeza-Yates, C. Castillo, and V. López. Pagerank increase under different collusion topologies. In First International Workshop on Adversarial Information Retrieval on the Web, 2005.Google ScholarGoogle Scholar
  4. R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press / Addison-Wesley, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. A. Benczur, K. Csalogany, T. Sarlos, and M. Uher. Spamrank - fully automatic link spam detection. In First International Workshop on Adversarial Information Retrieval on the Web, 2005.Google ScholarGoogle Scholar
  6. K. Bharat, A. Z. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. Journal of the American Society of Information Science, 51(12):1114--1122, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proc. of 21st ACM International SIGIR Conference on Research and Development in Information Retrieval, pages 104--111, Melbourne, AU, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Borodin, G. O. Roberts, J. S. Rosenthal, and P. Tsaparas. Finding authorities and hubs from link structures on the world wide web. In Proceedings of the 10th International Conference on World Wide Web, pages 415--429, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Brin, R. Motwani, L. Page, and T. Winograd. What can you do with a web in your pocket? Data Engineering Bulletin, 21(2):37--47, 1998.Google ScholarGoogle Scholar
  10. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Comput. Netw. ISDN Syst., 29(8-13):1157--1166, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Chakrabarti. Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In Proc. of the 10th International Conference on World Wide Web, pages 211--220, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B. Davison. Recognizing nepotistic links on the web. In Proceedings of the AAAI-2000 Workshop on Artificial Intelligence for Web Search, 2000.Google ScholarGoogle Scholar
  14. N. Eiron and K. S. McCurley. Untangling compound documents on the web. In Proc. of the 14th ACM Conference on Hypertext and Hypermedia, pages 85--94, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In WebDB '04: Proceedings of the 7th International Workshop on the Web and Databases, pages 1--6, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Z. Gyöngyi and H. Garcia-Molina. Link spam alliances. In Proc. of the 31st International VLDB Conference on Very Large Data Bases, pages 517--528, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the Adversarial Information Retrieval held the 14th Intl. World Wide Web Conference, 2005.Google ScholarGoogle Scholar
  18. Z. Gyöngyi, H. Garcia-Molina, and J. Pendersen. Combating web spam with trustrank. In Proceedings of the 30th International VLDB Conference, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Hawking, E. Voorhees, N. Craswell, and P. Bailey. Overview of the trec8 web track. In Eighth Text Retrieval Conference, 1999.Google ScholarGoogle Scholar
  20. T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately interpreting clickthrough data as implicit feedback. In Proc. of the 28th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the web for emerging cyber-communities. In Proceeding of the 8th International Conference on World Wide Web, pages 1481--1493, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks (Amsterdam, Netherlands: 1999), 33(1-6):387--401, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. L. Li, Y. Shang, and W. Zhang. Improvement of hits-based algorithms on web documents. In Proceedings of the 11th International Conference on World Wide Web, pages 527--535, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford University, 1998.Google ScholarGoogle Scholar
  26. G. Roberts and J. Rosenthal. Downweighting tightly knit communities in world wide web rankings. Advances and Applications in Statistics (ADAS), 3:199--216, 2003.Google ScholarGoogle Scholar
  27. B. Wu and B. Davison. Identifying link farm spam pages. In Proceedings of the 14th World Wide Web Conference, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. B. Wu and B. Davison. Undue influence: Eliminating the impact of link plagiarism on web search rankings. Technical report, LeHigh University, 2005.Google ScholarGoogle Scholar
  29. H. Zhang, A. Goel, R. Govindan, K. Mason, and B. van Roy. Improving eigenvector-based reputation systems against collusions. In Proceedings of the 3rd Workshop on Web Graph Algorithms, 2004.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Site level noise removal for search engines

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WWW '06: Proceedings of the 15th international conference on World Wide Web
      May 2006
      1102 pages
      ISBN:1595933239
      DOI:10.1145/1135777

      Copyright © 2006 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 23 May 2006

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate1,899of8,196submissions,23%

      Upcoming Conference

      WWW '24
      The ACM Web Conference 2024
      May 13 - 17, 2024
      Singapore , Singapore

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader