Topic-Independent Web High-Quality Page Selection Based on K-Means Clustering

Wang, Canhui; Liu, Yiqun; Zhang, Min; Ma, Shaoping

doi:10.1007/11562382_43

Canhui Wang²⁰,
Yiqun Liu²⁰,
Min Zhang²⁰ &
…
Shaoping Ma²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3689))

Included in the following conference series:

Asia Information Retrieval Symposium

995 Accesses
1 Citations

Abstract

One of the web search engines’ challenges is to identify the quality of web pages independent of a given user request. Web high-quality pages provide readers proper entries to get more concentrated required information on the web. This paper focuses on topic-independent web high-quality page selection to reduce web information redundancies and clean noise. Different non-content features and their effects on high-quality page selection are studied. Then K-means clustering with these features is performed to separate high-quality pages from common ones. Experiments on 19GB (document size) TREC web data set (.GOV data) have been made. By this proposed approach, less than 50% of web pages are obtained as high-quality ones, covering about 90% key information in the whole set. Information retrieval on this high-quality page set achieves more than 40% improvement, compared with that on the whole data collection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Davison, B.D.: Topical locality in the web. In: Croft, W.B., Harper, D.J., Kraft, D.H., Zobel, J. (eds.) Proceedings of the 23rd Annual International Conference on Research and Development in Information Retrieval, pp. 272–279 (2000)
Google Scholar
Zhang, M., Lin, C., Liu, Y., Zhao, L., Ma, L., Ma, S.: THUIR at TREC 2003: Novelty, Robust, Web and HARD (2003)
Google Scholar
Hawking, D., Craswell, N.: Overview of the TREC-2002 web track. In: Voorhees and Buckland (2002)
Google Scholar
Hawking, D., Craswell, N.: Overview of the TREC 2003 web track, 2003. In: NIST Special Publication: SP 500-255, The Twelfth Text Retrieval Conference (2003)
Google Scholar
Lozano, J.A., Pena, J.M., Larranaga, P.: An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognition Lett. 20, 1027–1040 (1999)
Article Google Scholar
Bharat, K., Henzinger, M.: Improved algorithms for topic distillation in a hyperlinked environment. In: 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 104–111 (August 1998)
Google Scholar
Henzinger, M.R., Motwani, R., Silverstein, C.: Challenges in Web Search Engines. In: proceedings of the International Joint Conference on Artificial Intelligence (2003)
Google Scholar
Craswell, N., Hawking, D.: Query-independent evidence in home page finding. ACM Transactions on Information Systems (TOIS) archive 21(3), 286–313 (2003); table of contents
Google Scholar
Westerveld, T., Hiemstra, D., Kraaij, W.: Retrieving Web Pages Using Content, Links, URLs and Anchors. In: Voorhees and Harman, pp. 663–672 (2002)
Google Scholar
Kraaij, W., Westerveld, T., Hiemstra, D.: The importance of prior probabilities for entry page search. In: 25th annual international ACM SIGIR conference on research and development in information retrieval, pp. 27–34 (2002)
Google Scholar
Likas, A., Vlassis, N., Verbeek, J.J.: The global k-means clustering algorithm. Pattern Recognition (2003)
Google Scholar
Liu, Y., Zhang, M., Ma, S.: Effective topic distillation with key resource pre-selection. In: Proceedings of the Asia Information Retrieval Symposium (2004)
Google Scholar
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.: The analysis of a simple k-means clustering algorithm. In: Symposium on Computational Geometry (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

State Key Lab of Intelligent technology & systems, Tsinghua University, Beijing, 100084, P.R.China
Canhui Wang, Yiqun Liu, Min Zhang & Shaoping Ma

Authors

Canhui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yiqun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Min Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shaoping Ma
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Pohang University of Science and Technology, San 31, Hyoja-dong, Nam-gu, 790-784, Pohang, Korea
Gary Geunbae Lee
Computer and Communication Media Research, NEC Corp., Miyazaki 4-1-1, Miyamae-ku, 216-8555, Kawasaki, Japan
Akio Yamada
Human-Computer Communications Laboratory, Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong
Helen Meng
School of Engineering, Information and Communications University, 119, Munjiro, Yuseong-gu, 305-732, Daejeon, Korea
Sung Hyon Myaeng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, C., Liu, Y., Zhang, M., Ma, S. (2005). Topic-Independent Web High-Quality Page Selection Based on K-Means Clustering. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.H. (eds) Information Retrieval Technology. AIRS 2005. Lecture Notes in Computer Science, vol 3689. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562382_43

Download citation

DOI: https://doi.org/10.1007/11562382_43
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29186-2
Online ISBN: 978-3-540-32001-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics