Skip to main content
Log in

xCrawl: a high-recall crawling method for Web mining

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Web mining systems exploit the redundancy of data published on the Web to automatically extract information from existing Web documents. The first step in the Information Extraction process is thus to locate as many Web pages as possible that contain relevant information within a limited period of time, a task which is commonly accomplished by applying focused crawling techniques. The performance of such a crawler can be measured by its “recall”, i.e., the percentage of documents found and identified as relevant compared to the total number of existing documents. A higher recall value implies that more redundant data are available, which in turn leads to better results in the subsequent fact extraction phase of the Web mining process. In this paper, we propose xCrawl, a new focused crawling method which outperforms state-of-the-art approaches with respect to the recall values achievable within a given period of time. This method is based on a new combination of ideas and techniques used to identify and exploit the navigational structures of Web sites, such as hierarchies, lists, or maps. In addition, automatic query generation is applied to rapidly collect Web sources containing target documents. The proposed crawling technique was inspired by the requirements of a Web mining system developed to extract product and service descriptions given in tabular form and was evaluated in different application scenarios. Comparisons with existing focused crawling techniques reveal that the new crawling method leads to a significant increase in recall while maintaining precision.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Aggarwal CC, Al-Garawi F, Yu PS (2001) Intelligent crawling on the World Wide Web with arbitrary predicates. In: Shen VY, Saito N, Lyu RM, Zurko ME (eds) Proceedings of the 10th international world wide web conference. ACM, New York, pp 96–105

    Google Scholar 

  2. Agichtein E, Gravano L (2003) Querying text databases for efficient information extraction. In: Dayal U, Ramamritham K, Vijayaraman TM (eds) Proceedings of the 19th IEEE international conference on data engineering. IEEE Computer Society, Los Alamitos, pp 113–124

    Google Scholar 

  3. Bergholz A, Chidlovskii B (2003) Crawling for domain-specific Hidden Web resources. In: Catarci T, Mercella M, Mylopoulos J, Orlowska ME (eds) Proceedings of the fourth international conference on web information systems engineering. IEEE Computer Society, Los Alamitos, pp 125–133

    Google Scholar 

  4. Chakrabarti S (2003) Mining the Web: discovering knowledge from hypertext data. Morgan Kaufmann, San Francisco

    Google Scholar 

  5. Chakrabarti S, van den Berg M, Dom B (1999) Focused crawling: a new approach to topic-specific Web resource discovery. Comput Netw 31: 1623–1640

    Article  Google Scholar 

  6. Chakrabarti S, Punera K, Subramanyam M (2002) Accelerated focused crawling through online relevance feedback. In: Lassner D, De Roure D, Iyengar A (eds) Proceedings of the 11th International World Wide Web Conference. ACM, New York, pp 148–159

    Google Scholar 

  7. Cho J, Garcia-Molina H, Page L (1998) Efficient crawling through URL ordering. Comput Netw ISDN Syst 30: 161–172

    Article  Google Scholar 

  8. Craven M, DiPasquo D, Freitag D et al (2000) Learning to construct knowledge bases from the World Wide Web. Artif Intell 118: 69–113

    Article  MATH  Google Scholar 

  9. Dasgupta A, Ghosh A, Kumar R et al (2007) The discoverability of the Web. In: Williamson CL, Zurko ME, Patel-Schneider PF et al (eds) Proceedings of the 16th international conference on world wide web. ACM, New York, pp 421–430

    Chapter  Google Scholar 

  10. Diligenti M, Coetzee F, Lawrence S et al (2000) Focused crawling using context graphs. In: Abbadi AE, Brodie ML, Chakravarthy S et al (eds) Proceedings of 26th international conference on very large data bases. Morgan Kaufmann, San Francisco, pp 527–534

    Google Scholar 

  11. Dill S, Eiron N, Gibson D et al (2003) SemTag and seeker: bootstrapping the semantic Web via automated semantic annotation. In: Hencsey G, White B, Chen Y et al (eds) Proceedings of the 12th international conference on world wide web. ACM, New York, pp 178–186

    Google Scholar 

  12. Ester M, Kriegel HP, Schubert M (2004) Accurate and efficient crawling for relevant websites. In: Nascimento MA, Özsu MT, Kossmann D et al (eds) Proceedings of the thirtieth international conference on very large data bases. Morgan Kaufmann, San Francisco, pp 396–407

    Google Scholar 

  13. Felfernig A, Friedrich G, Jannach D et al (2007) An integrated environment for the development of knowledge-based recommender applications. Int J Electron Commer 11: 11–34

    Article  Google Scholar 

  14. Gatterbauer W, Bohunsky P, Herzog M et al (2007) Towards domain-independent information extraction from web tables. In: Williamson CL, Zurko ME, Patel-Schneider PF et al (eds) Proceedings of the 16th international conference on world wide web. ACM, New York

    Google Scholar 

  15. Haveliwala TH (2003) Topic-Sensitive PageRank: a context-sensitive ranking algorithm for Web search. IEEE Trans Knowl Data Eng 15: 784–796

    Article  Google Scholar 

  16. Ipeirotis PG, Agichtein E, Jain P et al (2006) To search or to crawl?: towards a query optimizer for text-centric tasks. In: Chaudhuri S, Hristidis V, Polyzotis N (eds) Proceedings of the 2006 ACM SIGMOD international conference on management of data. ACM, New York, pp 265–276

    Chapter  Google Scholar 

  17. Jannach D, Shchekotykhin K, Friedrich G (2009) Automated ontology instantiation from tabular web sources—the AllRight system, Web semantics: science, services and agents on the world wide web (in press)

  18. Kleinberg J (1999) Authoritative sources in a hyperlinked environment. J ACM 46: 604–632

    Article  MATH  MathSciNet  Google Scholar 

  19. Kleinberg J, Kumar R, Raghavan P et al (1999) The Web as a graph: measurements, models, and methods. In: Asano T, Imai H, Lee DT et al (eds) Proceedings of the 5th annual international conference on computing and combinatorics. Lecture notes in computer science, vol 1627. Springer, Berlin, pp 1–17

  20. Kruger A, Giles CL, Coetzee F et al (2000) DEADLINER: building a new Niche search engine. In: Agah A, Callan J, Rundensteiner E et al (eds) Proceedings of 9th international conference on information and knowledge management. ACM, New York, pp 272–281

    Google Scholar 

  21. Menczer F, Pant G, Srinivasan P et al (2001) Evaluating topic-driven web crawlers. In: Kraft DH, Croft WB, Harper DJ et al (eds) Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 241–249

    Chapter  Google Scholar 

  22. Mesbah A, Bozdag E, van Deursen A (2008) Crawling AJAX by inferring user interface state changes. In: Schwabe D, Curbera F, Dantzig P (eds) Proceedings of the 8th international conference on web engineering. IEEE Computer Society, Los Alamitos, pp 122–134

    Chapter  Google Scholar 

  23. Peng T, Zuo W, He F (2008) SVM based adaptive learning method for text classification from positive and unlabeled documents. Knowl Inf Syst 16: 281–301

    Article  Google Scholar 

  24. Rennie J, McCallum A (1999) Using reinforcement learning to spider the Web efficiently. In: Bratko I, Dzeroski S (eds) Proceedings of the 16th international conference on machine learning. Morgan Kaufmann, San Francisco, pp 335–343

    Google Scholar 

  25. Robertson SE (1990) On term selection for query expansion. J Documentation 46: 359–364

    Article  Google Scholar 

  26. Schonfeld U, Bar-Yossef Z, Keidar I (2009) Do not crawl in the DUST: different URLs with similar text. ACM Trans Web 3: 3–31

    Google Scholar 

  27. Shchekotykhin K, Jannach D, Friedrich G (2007) Clustering Web documents with tables for information extraction. In: Sleeman D, Barker K (eds) Proccedings of the 4th international conference on knowledge capture. ACM, New York, pp 169–170

    Chapter  Google Scholar 

  28. Shchekotykhin K, Jannach D, Friedrich G et al (2007) AllRight: automatic ontology instantiation from tabular web documents. In: Aberer K, Choi K, Noy N et al (eds) Proceedings of the 6th international semantic web conference and 2nd Asian semantic web conference. Springer, Berlin, pp 463–476

    Google Scholar 

  29. Tong H, Faloutsos C, Pan JY (2008) Random walk with restart: fast solutions and applications. Knowl Inf Syst 14: 327–346

    Article  MATH  Google Scholar 

  30. Wang P, Hu J, Zeng HJ et al (2009) Using Wikipedia knowledge to improve text classification. Knowl Inf Syst 19: 265–281

    Article  Google Scholar 

  31. Witten I, Frank E (2000) Data mining: practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco

    Google Scholar 

  32. Yu H, Han J, Chang KCC (2004) PEBL: web page classification without negative examples. IEEE Trans Knowl Data Eng 16: 70–81

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kostyantyn Shchekotykhin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shchekotykhin, K., Jannach, D. & Friedrich, G. xCrawl: a high-recall crawling method for Web mining. Knowl Inf Syst 25, 303–326 (2010). https://doi.org/10.1007/s10115-009-0266-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-009-0266-3

Keywords

Navigation