ABSTRACT
The currently booming search engine industry has determined many online organizations to attempt to artificially increase their ranking in order to attract more visitors to their web sites. At the same time, the growth of the web has also inherently generated several navigational hyperlink structures that have a negative impact on the importance measures employed by current search engines. In this paper we propose and evaluate algorithms for identifying all these noisy links on the web graph, may them be spam or simple relationships between real world entities represented by sites, replication of content, etc. Unlike prior work, we target a different type of noisy link structures, residing at the site level, instead of the page level. We thus investigate and annihilate site level mutual reinforcement relationships, abnormal support coming from one site towards another, as well as complex link alliances between web sites. Our experiments with the link database of the TodoBR search engine show a very strong increase in the quality of the output rankings after having applied our techniques.
- E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer. The connectivity sonar: detecting site functionality by structural patterns. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia, pages 38--47, 2003. Google ScholarDigital Library
- Badrank. http://en.efactory.de/e-pr0.shtml.Google Scholar
- R. Baeza-Yates, C. Castillo, and V. López. Pagerank increase under different collusion topologies. In First International Workshop on Adversarial Information Retrieval on the Web, 2005.Google Scholar
- R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press / Addison-Wesley, 1999. Google ScholarDigital Library
- A. A. Benczur, K. Csalogany, T. Sarlos, and M. Uher. Spamrank - fully automatic link spam detection. In First International Workshop on Adversarial Information Retrieval on the Web, 2005.Google Scholar
- K. Bharat, A. Z. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. Journal of the American Society of Information Science, 51(12):1114--1122, 2000. Google ScholarDigital Library
- K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proc. of 21st ACM International SIGIR Conference on Research and Development in Information Retrieval, pages 104--111, Melbourne, AU, 1998. Google ScholarDigital Library
- A. Borodin, G. O. Roberts, J. S. Rosenthal, and P. Tsaparas. Finding authorities and hubs from link structures on the world wide web. In Proceedings of the 10th International Conference on World Wide Web, pages 415--429, 2001. Google ScholarDigital Library
- S. Brin, R. Motwani, L. Page, and T. Winograd. What can you do with a web in your pocket? Data Engineering Bulletin, 21(2):37--47, 1998.Google Scholar
- A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Comput. Netw. ISDN Syst., 29(8-13):1157--1166, 1997. Google ScholarDigital Library
- S. Chakrabarti. Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In Proc. of the 10th International Conference on World Wide Web, pages 211--220, 2001. Google ScholarDigital Library
- S. Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, 2003. Google ScholarDigital Library
- B. Davison. Recognizing nepotistic links on the web. In Proceedings of the AAAI-2000 Workshop on Artificial Intelligence for Web Search, 2000.Google Scholar
- N. Eiron and K. S. McCurley. Untangling compound documents on the web. In Proc. of the 14th ACM Conference on Hypertext and Hypermedia, pages 85--94, 2003. Google ScholarDigital Library
- D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In WebDB '04: Proceedings of the 7th International Workshop on the Web and Databases, pages 1--6, 2004. Google ScholarDigital Library
- Z. Gyöngyi and H. Garcia-Molina. Link spam alliances. In Proc. of the 31st International VLDB Conference on Very Large Data Bases, pages 517--528, 2005. Google ScholarDigital Library
- Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the Adversarial Information Retrieval held the 14th Intl. World Wide Web Conference, 2005.Google Scholar
- Z. Gyöngyi, H. Garcia-Molina, and J. Pendersen. Combating web spam with trustrank. In Proceedings of the 30th International VLDB Conference, 2004. Google ScholarDigital Library
- D. Hawking, E. Voorhees, N. Craswell, and P. Bailey. Overview of the trec8 web track. In Eighth Text Retrieval Conference, 1999.Google Scholar
- T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately interpreting clickthrough data as implicit feedback. In Proc. of the 28th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2005. Google ScholarDigital Library
- J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999. Google ScholarDigital Library
- R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the web for emerging cyber-communities. In Proceeding of the 8th International Conference on World Wide Web, pages 1481--1493, 1999. Google ScholarDigital Library
- R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks (Amsterdam, Netherlands: 1999), 33(1-6):387--401, 2000. Google ScholarDigital Library
- L. Li, Y. Shang, and W. Zhang. Improvement of hits-based algorithms on web documents. In Proceedings of the 11th International Conference on World Wide Web, pages 527--535, 2002. Google ScholarDigital Library
- L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford University, 1998.Google Scholar
- G. Roberts and J. Rosenthal. Downweighting tightly knit communities in world wide web rankings. Advances and Applications in Statistics (ADAS), 3:199--216, 2003.Google Scholar
- B. Wu and B. Davison. Identifying link farm spam pages. In Proceedings of the 14th World Wide Web Conference, 2005. Google ScholarDigital Library
- B. Wu and B. Davison. Undue influence: Eliminating the impact of link plagiarism on web search rankings. Technical report, LeHigh University, 2005.Google Scholar
- H. Zhang, A. Goel, R. Govindan, K. Mason, and B. van Roy. Improving eigenvector-based reputation systems against collusions. In Proceedings of the 3rd Workshop on Web Graph Algorithms, 2004.Google ScholarCross Ref
Index Terms
- Site level noise removal for search engines
Recommendations
A site-ranking algorithm for a small group of sites
ICCSA'07: Proceedings of the 2007 international conference on Computational science and Its applications - Volume Part IIHyperlink, or shortly link, analysis seeks to model the web structures and discover the relations among web sites or Web pages. The extracted models or relations can be used for the web mining applications, including market researches and various online ...
Impact of search engines on page popularity
WWW '04: Proceedings of the 13th international conference on World Wide WebRecent studies show that a majority of Web page accesses are referred by search engines. In this paper we study the widespread use of Web search engines and its impact on the ecology of the Web. In particular, we study how much impact search engines ...
Identifying link farm spam pages
WWW '05: Special interest tracks and posters of the 14th international conference on World Wide WebWith the increasing importance of search in guiding today's web traffic, more and more effort has been spent to create search engine spam. Since link analysis is one of the most important factors in current commercial search engines' ranking systems, ...
Comments