Selective Retrieval for Categorization of Semi-structured Web Resources

Lipczak, Marek; Niewiarowski, Tomasz; Keselj, Vlado; Milios, Evangelos

doi:10.1007/978-3-642-38457-8_11

Marek Lipczak^21,22,
Tomasz Niewiarowski²¹,
Vlado Keselj²¹ &
…
Evangelos Milios²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7884))

Included in the following conference series:

Canadian Conference on Artificial Intelligence

1678 Accesses
1 Citations

Abstract

A typical on-line content directory contains factual information about entities (e.g., address of a company) together with entity categories (e.g., company’s industries). The categories are a salient element of the system as they allow users to browse for entities of a chosen type. Assigning categories manually can be a challenging task, considering that an entity can belong to few out of hundreds of categories (e.g., all possible industry types). Instead we suggest to augment this process with an automatic categorization system that suggests categories based on the entity’s home page. To improve the accuracy of results, the system follows links extracted from the home page and uses retrieved content to expand an entity’s term profile. The profile is later used by a multi-label classification system to assign categories to the entity. The key element of the system is a link ranking module, which uses home page features (e.g., position and anchor text of links) to select links that are most likely to improve the categorization results. Evaluation on a data set of nearly ten thousand company home pages confirmed that the link ranking approach allows the system to limit the retrieval and processing costs to allow real-time responses and still outperform the categorization results of baseline systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Angelova, R., Weikum, G.: Graph-based text classification: learn from your neighbors. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2006, pp. 485–492. ACM (2006)
Google Scholar
Baldi, P., Frasconi, P., Smyth, P.: Modeling the Internet and the Web: probabilistic methods and algorithms. Wiley Series in Probability and Statistics. Wiley (2003)
Google Scholar
Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.: Learning to rank using gradient descent. In: Proceedings of the 22nd International Conference on Machine Learning, ICML 2005, pp. 89–96. ACM (2005)
Google Scholar
Carlson, A., Schafer, C.: Bootstrapping information extraction from semi-structured web pages. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS (LNAI), vol. 5211, pp. 195–210. Springer, Heidelberg (2008)
Chapter Google Scholar
Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hyperlinks. SIGMOD Rec. 27(2), 307–318 (1998)
Article Google Scholar
Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11-16), 1623–1640 (1999)
Article Google Scholar
Freund, Y., Iyer, R., Schapire, R.E., Singer, Y.: An efficient boosting algorithm for combining preferences. J. Mach. Learn. Res. 4, 933–969 (2003)
MathSciNet Google Scholar
Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1-3), 389–422 (2002)
Article MATH Google Scholar
Hao, Q., Cai, R., Pang, Y., Zhang, L.: From one tree to a forest: a unified solution for structured web data extraction. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, pp. 775–784. ACM (2011)
Google Scholar
Herbrich, R., Graepel, T., Obermayer, K.: Support vector learning for ordinal regression. In: International Conference on Artificial Neural Networks, pp. 97–102 (1999)
Google Scholar
Joachims, T.: Training linear SVMs in linear time. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2006, pp. 217–226. ACM (2006)
Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI 1998 Workshop on Learning For Text Categorization, pp. 41–48. AAAI Press (1998)
Google Scholar
Nguyen, H., Fuxman, A., Paparizos, S., Freire, J., Agrawal, R.: Synthesizing products for online catalogs. Proc. VLDB Endow. 4(7), 409–418 (2011)
Google Scholar
Oh, H.-J., Myaeng, S.H., Lee, M.-H.: A practical hypertext categorization method using links and incrementally available class information. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2000, pp. 264–271. ACM (2000)
Google Scholar
Pal, A., Tomar, D.S., Shrivastava, S.C.: Effective focused crawling based on content and link structure analysis. International Journal of Computer Science and Information Security 2(1), 140–152 (2009)
Google Scholar
Pant, G., Srinivasan, P.: Learning to crawl: Comparing classification schemes. ACM Trans. Inf. Syst. 23(4), 430–462 (2005)
Article Google Scholar
Qi, X., Davison, B.D.: Web page classification: Features and algorithms. ACM Comput. Surv. 41(2), 12:1–12:31 (2009)
Google Scholar
Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. Mach. Learn. 85(3), 333–359 (2011)
Article Google Scholar
Roth, S.P., Schmutz, P., Pauwels, S.L., Bargas-Avila, J.A., Opwis, K.: Mental models for web objects: Where do users expect to find the most frequent objects in online shops, news portals, and company web pages? Interact. Comput. 22(2), 140–152 (2010)
Article Google Scholar
Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 697–706. ACM (2007)
Google Scholar
Tang, T.T., Hawking, D., Craswell, N., Griffiths, K.: Focused crawling for both topical relevance and quality of medical information. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, CIKM 2005, pp. 147–154. ACM (2005)
Google Scholar
Tsoumakas, G., Vlahavas, I.: Random k-labelsets: An ensemble method for multilabel classification. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 406–417. Springer, Heidelberg (2007)
Chapter Google Scholar
Xu, J., Li, H.: Adarank: a boosting algorithm for information retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2007, pp. 391–398. ACM (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Dalhousie University, Halifax, Canada, B3H 1W5
Marek Lipczak, Tomasz Niewiarowski, Vlado Keselj & Evangelos Milios
2nd Act Innovations Inc., Halifax, Canada
Marek Lipczak

Authors

Marek Lipczak
View author publications
You can also search for this author in PubMed Google Scholar
Tomasz Niewiarowski
View author publications
You can also search for this author in PubMed Google Scholar
Vlado Keselj
View author publications
You can also search for this author in PubMed Google Scholar
Evangelos Milios
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Alberta, Edmonton, AB, Canada
Osmar R. Zaïane
Department of Computer Science, University of Regina, Canada
Sandra Zilles

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lipczak, M., Niewiarowski, T., Keselj, V., Milios, E. (2013). Selective Retrieval for Categorization of Semi-structured Web Resources. In: Zaïane, O.R., Zilles, S. (eds) Advances in Artificial Intelligence. Canadian AI 2013. Lecture Notes in Computer Science(), vol 7884. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38457-8_11

Download citation

DOI: https://doi.org/10.1007/978-3-642-38457-8_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38456-1
Online ISBN: 978-3-642-38457-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics