Abstract
Spamdexing is any of various methods to manipulate the relevancy or prominence of resources indexed by a search engine, usually in a manner inconsistent with the purpose of the indexing system. Combating Spamdexing has become one of the top challenges for web search. Machine learning based methods have shown their superiority for being easy to adapt to newly developed spam techniques. In this paper, we propose a two-stage classification strategy to detect web spam, which is based on the predicted spamicity of learning algorithms and hyperlink propagation. Preliminary experiments on standard WEBSPAM-UK2006 benchmark show that the two-stage strategy is reasonable and effective.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Becchetti, L., Castillo1, C., Donato1, D., Leonardi, S., Baeza-Yates, R.: Using Rank Propagation and Probabilistic Counting for Link Based Spam Detection. In: Proc. of WebKDD 2006 (August 2006)
Castillo, C., Donato, D., Gionis, A., Murdock, V., Silvestri, F.: Know your Neighbors: Web Spam Detection using the Web Topology. Technologies Project (November 2006)
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation Ranking: Bringing Order to the Web. Stanford Digital Library Technologies Project (1998)
Benczúr, A.A., Csalogány, K., Sarlós, T., Uher, M.: Spamrank: Fully Automatic Link Spam Detection. In: Proc. of AIRWeb 2005, May 2005, Chiba, Japan (2005)
Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting Spam Web Pages through Content Analysis. In: Proc. of the World Wide Web conference (May 2006)
Yahoo! Research: Web Collection UK-2006, http://research.yahoo.com/ Crawled by the Laboratory of Web Algorithmics, University of Milan (retrieved Febrary 2007), http://law.dsi.unimi.it/
Gyöngyi, Z., Molina, H.G., Pedersen, J.: Combating Web Spam with TrustRank. In: Proc. of the Thirtieth International Conference on Very Large Data Bases, August 2004, Toronto, Canada (2004)
Benczúr, A., Csalogány, K., Sarlós, T.: Link-based Similarity Search to Fight Web Spam. In: Proc. of AIRWeb 2006 (2006)
Geng, G.G., Wang, C.H., Jin, X.B., Li, Q.D., Xu, L.: IACAS at Web Spam Challenge 2007 Track I, Web Spam Challenge (2007)
Wu, B.N., Davison, B.: Cloaking and Redirection: a Preliminary Study. In: Proc. of the 1st International Workshop on Adversarial Information Retrieval on the Web (May 2005)
Gyöngyi, Z., Garcia-Molina, H.: Web Spam Taxonomy. In: Proc. of First Workshop on Adversarial Information Retrieval on the Web (2005)
Weiss, G.M.: Mining with Rarity - Problems and Solutions: A Unifying Framework. In: SIGKDD Explorition (2004)
Preund, Y., Schapire, R.E.: A Decision-theoretic Generalization of on-line Learning and an Application to Boosting. Journal of Computer and System Sciences 55(1), 119–139 (1997)
Gyöngyi, Z., Molina, H.G.: Link Spam Alliances, Technical Report (September 2005)
Witten, I.H., Frank, E.: Data Mining: Pratical Machine Learning Tools and Techniques. 2nd edition. Morgan Kaufmann (2005)
Henzinger, M., Motwani, R., Silverstein, C.: Challenges in web search engines. SIGIR Forum (2002)
Gan, Q.Q., Suel, T.: Improving Web Spam Classifiers Using Link Structure. In: AIRWeb 2007, May 2007, Banff, Canada (2007)
Geng, G.G., Wang, C.H., Li, Q.D., Xu, L., Jin, X.B.: Boosting the Performace of Web Spam Detection with Ensemble Under-Sampling Classification. In: Proc. of the 4th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2007 (August 2007)
Benczúr, A., Biró, I., Csalogány, K., Sarlós, T.: Web Spam Detection via Commercial Intent Analysis. In: Proc. of the 3rd International Workshop on Adversarial Information Retrieval on the Web, May 2007, Banff, Canada (2007)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Geng, GG., Wang, CH., Li, QD. (2008). Improving Spamdexing Detection Via a Two-Stage Classification Strategy. In: Li, H., Liu, T., Ma, WY., Sakai, T., Wong, KF., Zhou, G. (eds) Information Retrieval Technology. AIRS 2008. Lecture Notes in Computer Science, vol 4993. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-68636-1_34
Download citation
DOI: https://doi.org/10.1007/978-3-540-68636-1_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68633-0
Online ISBN: 978-3-540-68636-1
eBook Packages: Computer ScienceComputer Science (R0)