ABSTRACT
It is well known that Web-page classification can be enhanced by using hyperlinks that provide linkages between Web pages. However, in the Web space, hyperlinks are usually sparse, noisy and thus in many situations can only provide limited help in classification. In this paper, we extend the concept of linkages from explicit hyperlinks to implicit links built between Web pages. By observing that people who search the Web with the same queries often click on different, but related documents together, we draw implicit links between Web pages that are clicked after the same queries. Those pages are implicitly linked. We provide an approach for automatically building the implicit links between Web pages using Web query logs, together with a thorough comparison between the uses of implicit and explicit links in Web page classification. Our experimental results on a large dataset confirm that the use of the implicit links is better than using explicit links in classification performance, with an increase of more than 10.5% in terms of the Macro-F1 measurement.
- D. Beeferman and A. Berger. Agglomerative clustering of a search engine query log. In KDD '00: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 407--416, New York, NY, USA, 2000.]] Google ScholarDigital Library
- S. M. Beitzel, E. C. Jensen, A. Chowdhury, D. Grossman, and O. Frieder. Hourly analysis of a very large topically categorized web query log. In SIGIR '04: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 321--328, New York, NY, USA, 2004.]] Google ScholarDigital Library
- S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In SIGMOD '98: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pages 307--318, New York, NY, USA, 1998.]] Google ScholarDigital Library
- S.-L. Chuang and L.-F. Chien. Enriching web taxonomies through subject categorization of query terms from search engine logs. Decision Support Systems, 35(1):113--127, 2003.]] Google ScholarDigital Library
- C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273--297, 1995.]] Google ScholarDigital Library
- N. Eiron and K. S. McCurley. Analysis of anchor text for web search. In SIGIR '03: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 459--460, Toronto, Canada, 2003.]] Google ScholarDigital Library
- J. Fürnkranz. Exploiting structural information for text classification on the www. In IDA '99: Proceedings of the 3rd Symposium on Intelligent Data Analysis, pages 487--498, 1999.]] Google ScholarDigital Library
- R. Ghani, S. Slattery, and Y. Yang. Hypertext categorization using hyperlink patterns and meta data. In ICML '01: Proceedings of the Eighteenth International Conference on Machine Learning, pages 178--185, 2001.]] Google ScholarDigital Library
- E. J. Glover, K. Tsioutsiouliklis, S. Lawrence, D. M. Pennock, and G. W. Flake. Using web structure for classifying and describing web pages. In WWW '02: Proceedings of the 11th International Conference on World Wide Web, pages 562--569, Honolulu, Hawaii, USA, 2002.]] Google ScholarDigital Library
- T. Joachims. Text categorization with suport vector machines: Learning with many relevant features. In ECML '98: Proceedings of the 10th European Conference on Machine Learning, pages 137--142, 1998.]] Google ScholarDigital Library
- T. Joachims. Learning to classify text using support vector machines. Dissertation, Kluwer, 2002.]] Google ScholarDigital Library
- A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998.]]Google Scholar
- T. Mitchell. Machine Learning. McGraw-Hill, 1997.]] Google ScholarDigital Library
- H.-J. Oh, S.-H. Myaeng, and M.-H. Lee. A practical hypertext categorization method using links and incrementally available class information. In SIGIR '00: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 264--271, Athens, Greece, 2000.]] Google ScholarDigital Library
- C. Quek. Classification of world wide web documents. Thesis, School of Computer Science, CMU, 1997.]]Google Scholar
- V. V. Raghavan and H. Sever. On the reuse of past optimal queries. In SIGIR '95: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 344--350, Seattle, Washington, USA, 1995.]] Google ScholarDigital Library
- C. Silverstein, H. Marais, M. Henzinger, and M. Moricz. Analysis of a very large web search engine query log. SIGIR Forum, 33(1):6--12, 1999.]] Google ScholarDigital Library
- C. J. van Rijsbergen. Information Retrieval. Butterworth, London, 1979.]] Google ScholarDigital Library
- V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, NY, USA, 1995.]] Google ScholarDigital Library
- J.-R. Wen, J.-Y. Nie, and H. Zhang. Clustering user queries of a search engine. In WWW' 01: Proceedings of the Tenth International World Wide Web Conference, pages 162--168, Hong Kong, China, 2001.]] Google ScholarDigital Library
- G.-R. Xue, D. Shen, Q. Yang, H.-J. Zeng, Z. Chen, Y. Yu, W. Xi, and W.-Y. Ma. Irc: An iterative reinforcement categorization algorithm for interrelated web objects. In ICDM '04: Proceedings of the 4th IEEE International Conference on Data Mining, pages 273--280,Brighton, UK, 2004.]] Google ScholarDigital Library
- Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In ICML '97: Proceedings of the Fourteenth International Conference on Machine Learning, pages 412--420, Nashville, TN, USA, 1997.]] Google ScholarDigital Library
Index Terms
- A comparison of implicit and explicit links for web page classification
Recommendations
Implicit Links based Web Page Representation for Web Page Classification
WIMS '15: Proceedings of the 5th International Conference on Web Intelligence, Mining and SemanticsWith the rapid growth of the web's size, web page classification becomes more prominent. The representation way of a web page and contextual features used for this representation have both an impact on the classification's performance. Thus, finding an ...
A Clique Based Web Page Classification Corrective Approach
WI-IAT '14: Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 02Nowadays, the web is the most relevant data source. Its size does not stop growing day by day. Web page classification becomes crucial due to this overwhelming amount of data. Web pages contain many noisy contents that bias textual classifiers and lead ...
Text categorization based on k-nearest neighbor approach for web site classification
Automatic categorization is a viable method to deal with the scaling problem on the World Wide Web. For Web site classification, this paper proposes the use of Web pages linked with the home page in a different manner from the sole use of home pages in ...
Comments