xCrawl: a high-recall crawling method for Web mining

Shchekotykhin, Kostyantyn; Jannach, Dietmar; Friedrich, Gerhard

doi:10.1007/s10115-009-0266-3

xCrawl: a high-recall crawling method for Web mining

Regular Paper
Published: 18 November 2009

Volume 25, pages 303–326, (2010)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Kostyantyn Shchekotykhin¹,
Dietmar Jannach² &
Gerhard Friedrich¹

279 Accesses
9 Citations
6 Altmetric
Explore all metrics

Abstract

Web mining systems exploit the redundancy of data published on the Web to automatically extract information from existing Web documents. The first step in the Information Extraction process is thus to locate as many Web pages as possible that contain relevant information within a limited period of time, a task which is commonly accomplished by applying focused crawling techniques. The performance of such a crawler can be measured by its “recall”, i.e., the percentage of documents found and identified as relevant compared to the total number of existing documents. A higher recall value implies that more redundant data are available, which in turn leads to better results in the subsequent fact extraction phase of the Web mining process. In this paper, we propose xCrawl, a new focused crawling method which outperforms state-of-the-art approaches with respect to the recall values achievable within a given period of time. This method is based on a new combination of ideas and techniques used to identify and exploit the navigational structures of Web sites, such as hierarchies, lists, or maps. In addition, automatic query generation is applied to rapidly collect Web sources containing target documents. The proposed crawling technique was inspired by the requirements of a Web mining system developed to extract product and service descriptions given in tabular form and was evaluated in different application scenarios. Comparisons with existing focused crawling techniques reveal that the new crawling method leads to a significant increase in recall while maintaining precision.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aggarwal CC, Al-Garawi F, Yu PS (2001) Intelligent crawling on the World Wide Web with arbitrary predicates. In: Shen VY, Saito N, Lyu RM, Zurko ME (eds) Proceedings of the 10th international world wide web conference. ACM, New York, pp 96–105
Google Scholar
Agichtein E, Gravano L (2003) Querying text databases for efficient information extraction. In: Dayal U, Ramamritham K, Vijayaraman TM (eds) Proceedings of the 19th IEEE international conference on data engineering. IEEE Computer Society, Los Alamitos, pp 113–124
Google Scholar
Bergholz A, Chidlovskii B (2003) Crawling for domain-specific Hidden Web resources. In: Catarci T, Mercella M, Mylopoulos J, Orlowska ME (eds) Proceedings of the fourth international conference on web information systems engineering. IEEE Computer Society, Los Alamitos, pp 125–133
Google Scholar
Chakrabarti S (2003) Mining the Web: discovering knowledge from hypertext data. Morgan Kaufmann, San Francisco
Google Scholar
Chakrabarti S, van den Berg M, Dom B (1999) Focused crawling: a new approach to topic-specific Web resource discovery. Comput Netw 31: 1623–1640
Article Google Scholar
Chakrabarti S, Punera K, Subramanyam M (2002) Accelerated focused crawling through online relevance feedback. In: Lassner D, De Roure D, Iyengar A (eds) Proceedings of the 11th International World Wide Web Conference. ACM, New York, pp 148–159
Google Scholar
Cho J, Garcia-Molina H, Page L (1998) Efficient crawling through URL ordering. Comput Netw ISDN Syst 30: 161–172
Article Google Scholar
Craven M, DiPasquo D, Freitag D et al (2000) Learning to construct knowledge bases from the World Wide Web. Artif Intell 118: 69–113
Article MATH Google Scholar
Dasgupta A, Ghosh A, Kumar R et al (2007) The discoverability of the Web. In: Williamson CL, Zurko ME, Patel-Schneider PF et al (eds) Proceedings of the 16th international conference on world wide web. ACM, New York, pp 421–430
Chapter Google Scholar
Diligenti M, Coetzee F, Lawrence S et al (2000) Focused crawling using context graphs. In: Abbadi AE, Brodie ML, Chakravarthy S et al (eds) Proceedings of 26th international conference on very large data bases. Morgan Kaufmann, San Francisco, pp 527–534
Google Scholar
Dill S, Eiron N, Gibson D et al (2003) SemTag and seeker: bootstrapping the semantic Web via automated semantic annotation. In: Hencsey G, White B, Chen Y et al (eds) Proceedings of the 12th international conference on world wide web. ACM, New York, pp 178–186
Google Scholar
Ester M, Kriegel HP, Schubert M (2004) Accurate and efficient crawling for relevant websites. In: Nascimento MA, Özsu MT, Kossmann D et al (eds) Proceedings of the thirtieth international conference on very large data bases. Morgan Kaufmann, San Francisco, pp 396–407
Google Scholar
Felfernig A, Friedrich G, Jannach D et al (2007) An integrated environment for the development of knowledge-based recommender applications. Int J Electron Commer 11: 11–34
Article Google Scholar
Gatterbauer W, Bohunsky P, Herzog M et al (2007) Towards domain-independent information extraction from web tables. In: Williamson CL, Zurko ME, Patel-Schneider PF et al (eds) Proceedings of the 16th international conference on world wide web. ACM, New York
Google Scholar
Haveliwala TH (2003) Topic-Sensitive PageRank: a context-sensitive ranking algorithm for Web search. IEEE Trans Knowl Data Eng 15: 784–796
Article Google Scholar
Ipeirotis PG, Agichtein E, Jain P et al (2006) To search or to crawl?: towards a query optimizer for text-centric tasks. In: Chaudhuri S, Hristidis V, Polyzotis N (eds) Proceedings of the 2006 ACM SIGMOD international conference on management of data. ACM, New York, pp 265–276
Chapter Google Scholar
Jannach D, Shchekotykhin K, Friedrich G (2009) Automated ontology instantiation from tabular web sources—the AllRight system, Web semantics: science, services and agents on the world wide web (in press)
Kleinberg J (1999) Authoritative sources in a hyperlinked environment. J ACM 46: 604–632
Article MATH MathSciNet Google Scholar
Kleinberg J, Kumar R, Raghavan P et al (1999) The Web as a graph: measurements, models, and methods. In: Asano T, Imai H, Lee DT et al (eds) Proceedings of the 5th annual international conference on computing and combinatorics. Lecture notes in computer science, vol 1627. Springer, Berlin, pp 1–17
Kruger A, Giles CL, Coetzee F et al (2000) DEADLINER: building a new Niche search engine. In: Agah A, Callan J, Rundensteiner E et al (eds) Proceedings of 9th international conference on information and knowledge management. ACM, New York, pp 272–281
Google Scholar
Menczer F, Pant G, Srinivasan P et al (2001) Evaluating topic-driven web crawlers. In: Kraft DH, Croft WB, Harper DJ et al (eds) Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 241–249
Chapter Google Scholar
Mesbah A, Bozdag E, van Deursen A (2008) Crawling AJAX by inferring user interface state changes. In: Schwabe D, Curbera F, Dantzig P (eds) Proceedings of the 8th international conference on web engineering. IEEE Computer Society, Los Alamitos, pp 122–134
Chapter Google Scholar
Peng T, Zuo W, He F (2008) SVM based adaptive learning method for text classification from positive and unlabeled documents. Knowl Inf Syst 16: 281–301
Article Google Scholar
Rennie J, McCallum A (1999) Using reinforcement learning to spider the Web efficiently. In: Bratko I, Dzeroski S (eds) Proceedings of the 16th international conference on machine learning. Morgan Kaufmann, San Francisco, pp 335–343
Google Scholar
Robertson SE (1990) On term selection for query expansion. J Documentation 46: 359–364
Article Google Scholar
Schonfeld U, Bar-Yossef Z, Keidar I (2009) Do not crawl in the DUST: different URLs with similar text. ACM Trans Web 3: 3–31
Google Scholar
Shchekotykhin K, Jannach D, Friedrich G (2007) Clustering Web documents with tables for information extraction. In: Sleeman D, Barker K (eds) Proccedings of the 4th international conference on knowledge capture. ACM, New York, pp 169–170
Chapter Google Scholar
Shchekotykhin K, Jannach D, Friedrich G et al (2007) AllRight: automatic ontology instantiation from tabular web documents. In: Aberer K, Choi K, Noy N et al (eds) Proceedings of the 6th international semantic web conference and 2nd Asian semantic web conference. Springer, Berlin, pp 463–476
Google Scholar
Tong H, Faloutsos C, Pan JY (2008) Random walk with restart: fast solutions and applications. Knowl Inf Syst 14: 327–346
Article MATH Google Scholar
Wang P, Hu J, Zeng HJ et al (2009) Using Wikipedia knowledge to improve text classification. Knowl Inf Syst 19: 265–281
Article Google Scholar
Witten I, Frank E (2000) Data mining: practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco
Google Scholar
Yu H, Han J, Chang KCC (2004) PEBL: web page classification without negative examples. IEEE Trans Knowl Data Eng 16: 70–81
Article Google Scholar

Download references

Author information

Authors and Affiliations

University Klagenfurt, 9020, Klagenfurt, Austria
Kostyantyn Shchekotykhin & Gerhard Friedrich
Technische Universität Dortmund, 44221, Dortmund, Germany
Dietmar Jannach

Authors

Kostyantyn Shchekotykhin
View author publications
You can also search for this author in PubMed Google Scholar
Dietmar Jannach
View author publications
You can also search for this author in PubMed Google Scholar
Gerhard Friedrich
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kostyantyn Shchekotykhin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shchekotykhin, K., Jannach, D. & Friedrich, G. xCrawl: a high-recall crawling method for Web mining. Knowl Inf Syst 25, 303–326 (2010). https://doi.org/10.1007/s10115-009-0266-3

Download citation

Received: 13 March 2009
Revised: 31 August 2009
Accepted: 28 September 2009
Published: 18 November 2009
Issue Date: November 2010
DOI: https://doi.org/10.1007/s10115-009-0266-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

xCrawl: a high-recall crawling method for Web mining

Abstract

Access this article

Similar content being viewed by others

Crawl Smart: A Domain-Specific Crawler

Web Structure Mining Algorithms: A Survey

Adaptive Focused Crawling of Linked Data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

xCrawl: a high-recall crawling method for Web mining

Abstract

Access this article

Similar content being viewed by others

Crawl Smart: A Domain-Specific Crawler

Web Structure Mining Algorithms: A Survey

Adaptive Focused Crawling of Linked Data

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation