Abstract
A typical on-line content directory contains factual information about entities (e.g., address of a company) together with entity categories (e.g., company’s industries). The categories are a salient element of the system as they allow users to browse for entities of a chosen type. Assigning categories manually can be a challenging task, considering that an entity can belong to few out of hundreds of categories (e.g., all possible industry types). Instead we suggest to augment this process with an automatic categorization system that suggests categories based on the entity’s home page. To improve the accuracy of results, the system follows links extracted from the home page and uses retrieved content to expand an entity’s term profile. The profile is later used by a multi-label classification system to assign categories to the entity. The key element of the system is a link ranking module, which uses home page features (e.g., position and anchor text of links) to select links that are most likely to improve the categorization results. Evaluation on a data set of nearly ten thousand company home pages confirmed that the link ranking approach allows the system to limit the retrieval and processing costs to allow real-time responses and still outperform the categorization results of baseline systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Angelova, R., Weikum, G.: Graph-based text classification: learn from your neighbors. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2006, pp. 485–492. ACM (2006)
Baldi, P., Frasconi, P., Smyth, P.: Modeling the Internet and the Web: probabilistic methods and algorithms. Wiley Series in Probability and Statistics. Wiley (2003)
Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.: Learning to rank using gradient descent. In: Proceedings of the 22nd International Conference on Machine Learning, ICML 2005, pp. 89–96. ACM (2005)
Carlson, A., Schafer, C.: Bootstrapping information extraction from semi-structured web pages. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS (LNAI), vol. 5211, pp. 195–210. Springer, Heidelberg (2008)
Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hyperlinks. SIGMOD Rec. 27(2), 307–318 (1998)
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11-16), 1623–1640 (1999)
Freund, Y., Iyer, R., Schapire, R.E., Singer, Y.: An efficient boosting algorithm for combining preferences. J. Mach. Learn. Res. 4, 933–969 (2003)
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1-3), 389–422 (2002)
Hao, Q., Cai, R., Pang, Y., Zhang, L.: From one tree to a forest: a unified solution for structured web data extraction. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, pp. 775–784. ACM (2011)
Herbrich, R., Graepel, T., Obermayer, K.: Support vector learning for ordinal regression. In: International Conference on Artificial Neural Networks, pp. 97–102 (1999)
Joachims, T.: Training linear SVMs in linear time. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2006, pp. 217–226. ACM (2006)
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI 1998 Workshop on Learning For Text Categorization, pp. 41–48. AAAI Press (1998)
Nguyen, H., Fuxman, A., Paparizos, S., Freire, J., Agrawal, R.: Synthesizing products for online catalogs. Proc. VLDB Endow. 4(7), 409–418 (2011)
Oh, H.-J., Myaeng, S.H., Lee, M.-H.: A practical hypertext categorization method using links and incrementally available class information. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2000, pp. 264–271. ACM (2000)
Pal, A., Tomar, D.S., Shrivastava, S.C.: Effective focused crawling based on content and link structure analysis. International Journal of Computer Science and Information Security 2(1), 140–152 (2009)
Pant, G., Srinivasan, P.: Learning to crawl: Comparing classification schemes. ACM Trans. Inf. Syst. 23(4), 430–462 (2005)
Qi, X., Davison, B.D.: Web page classification: Features and algorithms. ACM Comput. Surv. 41(2), 12:1–12:31 (2009)
Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. Mach. Learn. 85(3), 333–359 (2011)
Roth, S.P., Schmutz, P., Pauwels, S.L., Bargas-Avila, J.A., Opwis, K.: Mental models for web objects: Where do users expect to find the most frequent objects in online shops, news portals, and company web pages? Interact. Comput. 22(2), 140–152 (2010)
Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 697–706. ACM (2007)
Tang, T.T., Hawking, D., Craswell, N., Griffiths, K.: Focused crawling for both topical relevance and quality of medical information. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, CIKM 2005, pp. 147–154. ACM (2005)
Tsoumakas, G., Vlahavas, I.: Random k-labelsets: An ensemble method for multilabel classification. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 406–417. Springer, Heidelberg (2007)
Xu, J., Li, H.: Adarank: a boosting algorithm for information retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2007, pp. 391–398. ACM (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lipczak, M., Niewiarowski, T., Keselj, V., Milios, E. (2013). Selective Retrieval for Categorization of Semi-structured Web Resources. In: Zaïane, O.R., Zilles, S. (eds) Advances in Artificial Intelligence. Canadian AI 2013. Lecture Notes in Computer Science(), vol 7884. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38457-8_11
Download citation
DOI: https://doi.org/10.1007/978-3-642-38457-8_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38456-1
Online ISBN: 978-3-642-38457-8
eBook Packages: Computer ScienceComputer Science (R0)