Article

Site level noise removal for search engines

Authors:
André Luiz da Costa Carvalho

Federal University of Amazonas, Ramos, Manaus, Brazil

Federal University of Amazonas, Ramos, Manaus, Brazil
View Profile

,
Paul - Alexandru Chirita

L3S and University of Hannover, Hannover, Germany

L3S and University of Hannover, Hannover, Germany
View Profile

,
Edleno Silva de Moura

Federal University of Amazonas, Ramos, Manaus, Brazil

Federal University of Amazonas, Ramos, Manaus, Brazil
View Profile

,
Pável Calado

IST/INESC-ID, Porto Salvo, Portugal

IST/INESC-ID, Porto Salvo, Portugal
View Profile

,
Wolfgang Nejdl

L3S and University of Hannover, Hannover, Germany

L3S and University of Hannover, Hannover, Germany
View Profile

WWW '06: Proceedings of the 15th international conference on World Wide WebMay 2006Pages 73–82https://doi.org/10.1145/1135777.1135793

Published:23 May 2006Publication History

WWW '06: Proceedings of the 15th international conference on World Wide Web

Pages 73–82

ABSTRACT

The currently booming search engine industry has determined many online organizations to attempt to artificially increase their ranking in order to attract more visitors to their web sites. At the same time, the growth of the web has also inherently generated several navigational hyperlink structures that have a negative impact on the importance measures employed by current search engines. In this paper we propose and evaluate algorithms for identifying all these noisy links on the web graph, may them be spam or simple relationships between real world entities represented by sites, replication of content, etc. Unlike prior work, we target a different type of noisy link structures, residing at the site level, instead of the page level. We thus investigate and annihilate site level mutual reinforcement relationships, abnormal support coming from one site towards another, as well as complex link alliances between web sites. Our experiments with the link database of the TodoBR search engine show a very strong increase in the quality of the output rankings after having applied our techniques.

References

E. Amitay, D. Carmel, A. Darlow, R. Lempel, and A. Soffer. The connectivity sonar: detecting site functionality by structural patterns. In Proceedings of the 14th ACM Conference on Hypertext and Hypermedia, pages 38--47, 2003. Google ScholarDigital Library
Badrank. http://en.efactory.de/e-pr0.shtml.Google Scholar
R. Baeza-Yates, C. Castillo, and V. López. Pagerank increase under different collusion topologies. In First International Workshop on Adversarial Information Retrieval on the Web, 2005.Google Scholar
R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval. ACM Press / Addison-Wesley, 1999. Google ScholarDigital Library
A. A. Benczur, K. Csalogany, T. Sarlos, and M. Uher. Spamrank - fully automatic link spam detection. In First International Workshop on Adversarial Information Retrieval on the Web, 2005.Google Scholar
K. Bharat, A. Z. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to find mirrored hosts on the WWW. Journal of the American Society of Information Science, 51(12):1114--1122, 2000. Google ScholarDigital Library
K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proc. of 21st ACM International SIGIR Conference on Research and Development in Information Retrieval, pages 104--111, Melbourne, AU, 1998. Google ScholarDigital Library
A. Borodin, G. O. Roberts, J. S. Rosenthal, and P. Tsaparas. Finding authorities and hubs from link structures on the world wide web. In Proceedings of the 10th International Conference on World Wide Web, pages 415--429, 2001. Google ScholarDigital Library
S. Brin, R. Motwani, L. Page, and T. Winograd. What can you do with a web in your pocket? Data Engineering Bulletin, 21(2):37--47, 1998.Google Scholar
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Comput. Netw. ISDN Syst., 29(8-13):1157--1166, 1997. Google ScholarDigital Library
S. Chakrabarti. Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In Proc. of the 10th International Conference on World Wide Web, pages 211--220, 2001. Google ScholarDigital Library
S. Chakrabarti. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, 2003. Google ScholarDigital Library
B. Davison. Recognizing nepotistic links on the web. In Proceedings of the AAAI-2000 Workshop on Artificial Intelligence for Web Search, 2000.Google Scholar
N. Eiron and K. S. McCurley. Untangling compound documents on the web. In Proc. of the 14th ACM Conference on Hypertext and Hypermedia, pages 85--94, 2003. Google ScholarDigital Library
D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages. In WebDB '04: Proceedings of the 7th International Workshop on the Web and Databases, pages 1--6, 2004. Google ScholarDigital Library
Z. Gyöngyi and H. Garcia-Molina. Link spam alliances. In Proc. of the 31st International VLDB Conference on Very Large Data Bases, pages 517--528, 2005. Google ScholarDigital Library
Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the Adversarial Information Retrieval held the 14th Intl. World Wide Web Conference, 2005.Google Scholar
Z. Gyöngyi, H. Garcia-Molina, and J. Pendersen. Combating web spam with trustrank. In Proceedings of the 30th International VLDB Conference, 2004. Google ScholarDigital Library
D. Hawking, E. Voorhees, N. Craswell, and P. Bailey. Overview of the trec8 web track. In Eighth Text Retrieval Conference, 1999.Google Scholar
T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay. Accurately interpreting clickthrough data as implicit feedback. In Proc. of the 28th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2005. Google ScholarDigital Library
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999. Google ScholarDigital Library
R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the web for emerging cyber-communities. In Proceeding of the 8th International Conference on World Wide Web, pages 1481--1493, 1999. Google ScholarDigital Library
R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks (Amsterdam, Netherlands: 1999), 33(1-6):387--401, 2000. Google ScholarDigital Library
L. Li, Y. Shang, and W. Zhang. Improvement of hits-based algorithms on web documents. In Proceedings of the 11th International Conference on World Wide Web, pages 527--535, 2002. Google ScholarDigital Library
L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford University, 1998.Google Scholar
G. Roberts and J. Rosenthal. Downweighting tightly knit communities in world wide web rankings. Advances and Applications in Statistics (ADAS), 3:199--216, 2003.Google Scholar
B. Wu and B. Davison. Identifying link farm spam pages. In Proceedings of the 14th World Wide Web Conference, 2005. Google ScholarDigital Library
B. Wu and B. Davison. Undue influence: Eliminating the impact of link plagiarism on web search rankings. Technical report, LeHigh University, 2005.Google Scholar
H. Zhang, A. Goel, R. Govindan, K. Mason, and B. van Roy. Improving eigenvector-based reputation systems against collusions. In Proceedings of the 3rd Workshop on Web Graph Algorithms, 2004.Google ScholarCross Ref

Index Terms

Site level noise removal for search engines
1. Information systems
  1. Information retrieval

Recommendations

A site-ranking algorithm for a small group of sites
ICCSA'07: Proceedings of the 2007 international conference on Computational science and Its applications - Volume Part II

Hyperlink, or shortly link, analysis seeks to model the web structures and discover the relations among web sites or Web pages. The extracted models or relations can be used for the web mining applications, including market researches and various online ...
Read More
Impact of search engines on page popularity
WWW '04: Proceedings of the 13th international conference on World Wide Web

Recent studies show that a majority of Web page accesses are referred by search engines. In this paper we study the widespread use of Web search engines and its impact on the ecology of the Web. In particular, we study how much impact search engines ...
Read More
Identifying link farm spam pages
WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web

With the increasing importance of search in guiding today's web traffic, more and more effort has been spent to create search engine spam. Since link analysis is one of the most important factors in current commercial search engines' ranking systems, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '06: Proceedings of the 15th international conference on World Wide Web
May 2006
1102 pages
ISBN:1595933239
DOI:10.1145/1135777
General Chairs:
Leslie Carr
University of Southampton
,
David De Roure
University of Southampton
,
Arun Iyengar
IBM Research
,
Program Chairs:
Carole Goble
University of Manchester, UK
,
Mike Dahlin
University of Texas at Austin
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 May 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
PageRank
link analysis
noise reduction
spam
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 27
  Total Citations
  View Citations
- 687
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Site level noise removal for search engines

WWW '06: Proceedings of the 15th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

A site-ranking algorithm for a small group of sites

Impact of search engines on page popularity

Identifying link farm spam pages