Skip to main content

Selective Retrieval for Categorization of Semi-structured Web Resources

  • Conference paper
Advances in Artificial Intelligence (Canadian AI 2013)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7884))

Included in the following conference series:

Abstract

A typical on-line content directory contains factual information about entities (e.g., address of a company) together with entity categories (e.g., company’s industries). The categories are a salient element of the system as they allow users to browse for entities of a chosen type. Assigning categories manually can be a challenging task, considering that an entity can belong to few out of hundreds of categories (e.g., all possible industry types). Instead we suggest to augment this process with an automatic categorization system that suggests categories based on the entity’s home page. To improve the accuracy of results, the system follows links extracted from the home page and uses retrieved content to expand an entity’s term profile. The profile is later used by a multi-label classification system to assign categories to the entity. The key element of the system is a link ranking module, which uses home page features (e.g., position and anchor text of links) to select links that are most likely to improve the categorization results. Evaluation on a data set of nearly ten thousand company home pages confirmed that the link ranking approach allows the system to limit the retrieval and processing costs to allow real-time responses and still outperform the categorization results of baseline systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Angelova, R., Weikum, G.: Graph-based text classification: learn from your neighbors. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2006, pp. 485–492. ACM (2006)

    Google Scholar 

  2. Baldi, P., Frasconi, P., Smyth, P.: Modeling the Internet and the Web: probabilistic methods and algorithms. Wiley Series in Probability and Statistics. Wiley (2003)

    Google Scholar 

  3. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.: Learning to rank using gradient descent. In: Proceedings of the 22nd International Conference on Machine Learning, ICML 2005, pp. 89–96. ACM (2005)

    Google Scholar 

  4. Carlson, A., Schafer, C.: Bootstrapping information extraction from semi-structured web pages. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part I. LNCS (LNAI), vol. 5211, pp. 195–210. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  5. Chakrabarti, S., Dom, B., Indyk, P.: Enhanced hypertext categorization using hyperlinks. SIGMOD Rec. 27(2), 307–318 (1998)

    Article  Google Scholar 

  6. Chakrabarti, S., van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11-16), 1623–1640 (1999)

    Article  Google Scholar 

  7. Freund, Y., Iyer, R., Schapire, R.E., Singer, Y.: An efficient boosting algorithm for combining preferences. J. Mach. Learn. Res. 4, 933–969 (2003)

    MathSciNet  Google Scholar 

  8. Guyon, I., Weston, J., Barnhill, S., Vapnik, V.: Gene selection for cancer classification using support vector machines. Mach. Learn. 46(1-3), 389–422 (2002)

    Article  MATH  Google Scholar 

  9. Hao, Q., Cai, R., Pang, Y., Zhang, L.: From one tree to a forest: a unified solution for structured web data extraction. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011, pp. 775–784. ACM (2011)

    Google Scholar 

  10. Herbrich, R., Graepel, T., Obermayer, K.: Support vector learning for ordinal regression. In: International Conference on Artificial Neural Networks, pp. 97–102 (1999)

    Google Scholar 

  11. Joachims, T.: Training linear SVMs in linear time. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2006, pp. 217–226. ACM (2006)

    Google Scholar 

  12. McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI 1998 Workshop on Learning For Text Categorization, pp. 41–48. AAAI Press (1998)

    Google Scholar 

  13. Nguyen, H., Fuxman, A., Paparizos, S., Freire, J., Agrawal, R.: Synthesizing products for online catalogs. Proc. VLDB Endow. 4(7), 409–418 (2011)

    Google Scholar 

  14. Oh, H.-J., Myaeng, S.H., Lee, M.-H.: A practical hypertext categorization method using links and incrementally available class information. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2000, pp. 264–271. ACM (2000)

    Google Scholar 

  15. Pal, A., Tomar, D.S., Shrivastava, S.C.: Effective focused crawling based on content and link structure analysis. International Journal of Computer Science and Information Security 2(1), 140–152 (2009)

    Google Scholar 

  16. Pant, G., Srinivasan, P.: Learning to crawl: Comparing classification schemes. ACM Trans. Inf. Syst. 23(4), 430–462 (2005)

    Article  Google Scholar 

  17. Qi, X., Davison, B.D.: Web page classification: Features and algorithms. ACM Comput. Surv. 41(2), 12:1–12:31 (2009)

    Google Scholar 

  18. Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. Mach. Learn. 85(3), 333–359 (2011)

    Article  Google Scholar 

  19. Roth, S.P., Schmutz, P., Pauwels, S.L., Bargas-Avila, J.A., Opwis, K.: Mental models for web objects: Where do users expect to find the most frequent objects in online shops, news portals, and company web pages? Interact. Comput. 22(2), 140–152 (2010)

    Article  Google Scholar 

  20. Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 697–706. ACM (2007)

    Google Scholar 

  21. Tang, T.T., Hawking, D., Craswell, N., Griffiths, K.: Focused crawling for both topical relevance and quality of medical information. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, CIKM 2005, pp. 147–154. ACM (2005)

    Google Scholar 

  22. Tsoumakas, G., Vlahavas, I.: Random k-labelsets: An ensemble method for multilabel classification. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) ECML 2007. LNCS (LNAI), vol. 4701, pp. 406–417. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  23. Xu, J., Li, H.: Adarank: a boosting algorithm for information retrieval. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2007, pp. 391–398. ACM (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lipczak, M., Niewiarowski, T., Keselj, V., Milios, E. (2013). Selective Retrieval for Categorization of Semi-structured Web Resources. In: Zaïane, O.R., Zilles, S. (eds) Advances in Artificial Intelligence. Canadian AI 2013. Lecture Notes in Computer Science(), vol 7884. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38457-8_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-38457-8_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-38456-1

  • Online ISBN: 978-3-642-38457-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics