Abstract
Web mining systems exploit the redundancy of data published on the Web to automatically extract information from existing Web documents. The first step in the Information Extraction process is thus to locate as many Web pages as possible that contain relevant information within a limited period of time, a task which is commonly accomplished by applying focused crawling techniques. The performance of such a crawler can be measured by its “recall”, i.e., the percentage of documents found and identified as relevant compared to the total number of existing documents. A higher recall value implies that more redundant data are available, which in turn leads to better results in the subsequent fact extraction phase of the Web mining process. In this paper, we propose xCrawl, a new focused crawling method which outperforms state-of-the-art approaches with respect to the recall values achievable within a given period of time. This method is based on a new combination of ideas and techniques used to identify and exploit the navigational structures of Web sites, such as hierarchies, lists, or maps. In addition, automatic query generation is applied to rapidly collect Web sources containing target documents. The proposed crawling technique was inspired by the requirements of a Web mining system developed to extract product and service descriptions given in tabular form and was evaluated in different application scenarios. Comparisons with existing focused crawling techniques reveal that the new crawling method leads to a significant increase in recall while maintaining precision.
Similar content being viewed by others
References
Aggarwal CC, Al-Garawi F, Yu PS (2001) Intelligent crawling on the World Wide Web with arbitrary predicates. In: Shen VY, Saito N, Lyu RM, Zurko ME (eds) Proceedings of the 10th international world wide web conference. ACM, New York, pp 96–105
Agichtein E, Gravano L (2003) Querying text databases for efficient information extraction. In: Dayal U, Ramamritham K, Vijayaraman TM (eds) Proceedings of the 19th IEEE international conference on data engineering. IEEE Computer Society, Los Alamitos, pp 113–124
Bergholz A, Chidlovskii B (2003) Crawling for domain-specific Hidden Web resources. In: Catarci T, Mercella M, Mylopoulos J, Orlowska ME (eds) Proceedings of the fourth international conference on web information systems engineering. IEEE Computer Society, Los Alamitos, pp 125–133
Chakrabarti S (2003) Mining the Web: discovering knowledge from hypertext data. Morgan Kaufmann, San Francisco
Chakrabarti S, van den Berg M, Dom B (1999) Focused crawling: a new approach to topic-specific Web resource discovery. Comput Netw 31: 1623–1640
Chakrabarti S, Punera K, Subramanyam M (2002) Accelerated focused crawling through online relevance feedback. In: Lassner D, De Roure D, Iyengar A (eds) Proceedings of the 11th International World Wide Web Conference. ACM, New York, pp 148–159
Cho J, Garcia-Molina H, Page L (1998) Efficient crawling through URL ordering. Comput Netw ISDN Syst 30: 161–172
Craven M, DiPasquo D, Freitag D et al (2000) Learning to construct knowledge bases from the World Wide Web. Artif Intell 118: 69–113
Dasgupta A, Ghosh A, Kumar R et al (2007) The discoverability of the Web. In: Williamson CL, Zurko ME, Patel-Schneider PF et al (eds) Proceedings of the 16th international conference on world wide web. ACM, New York, pp 421–430
Diligenti M, Coetzee F, Lawrence S et al (2000) Focused crawling using context graphs. In: Abbadi AE, Brodie ML, Chakravarthy S et al (eds) Proceedings of 26th international conference on very large data bases. Morgan Kaufmann, San Francisco, pp 527–534
Dill S, Eiron N, Gibson D et al (2003) SemTag and seeker: bootstrapping the semantic Web via automated semantic annotation. In: Hencsey G, White B, Chen Y et al (eds) Proceedings of the 12th international conference on world wide web. ACM, New York, pp 178–186
Ester M, Kriegel HP, Schubert M (2004) Accurate and efficient crawling for relevant websites. In: Nascimento MA, Özsu MT, Kossmann D et al (eds) Proceedings of the thirtieth international conference on very large data bases. Morgan Kaufmann, San Francisco, pp 396–407
Felfernig A, Friedrich G, Jannach D et al (2007) An integrated environment for the development of knowledge-based recommender applications. Int J Electron Commer 11: 11–34
Gatterbauer W, Bohunsky P, Herzog M et al (2007) Towards domain-independent information extraction from web tables. In: Williamson CL, Zurko ME, Patel-Schneider PF et al (eds) Proceedings of the 16th international conference on world wide web. ACM, New York
Haveliwala TH (2003) Topic-Sensitive PageRank: a context-sensitive ranking algorithm for Web search. IEEE Trans Knowl Data Eng 15: 784–796
Ipeirotis PG, Agichtein E, Jain P et al (2006) To search or to crawl?: towards a query optimizer for text-centric tasks. In: Chaudhuri S, Hristidis V, Polyzotis N (eds) Proceedings of the 2006 ACM SIGMOD international conference on management of data. ACM, New York, pp 265–276
Jannach D, Shchekotykhin K, Friedrich G (2009) Automated ontology instantiation from tabular web sources—the AllRight system, Web semantics: science, services and agents on the world wide web (in press)
Kleinberg J (1999) Authoritative sources in a hyperlinked environment. J ACM 46: 604–632
Kleinberg J, Kumar R, Raghavan P et al (1999) The Web as a graph: measurements, models, and methods. In: Asano T, Imai H, Lee DT et al (eds) Proceedings of the 5th annual international conference on computing and combinatorics. Lecture notes in computer science, vol 1627. Springer, Berlin, pp 1–17
Kruger A, Giles CL, Coetzee F et al (2000) DEADLINER: building a new Niche search engine. In: Agah A, Callan J, Rundensteiner E et al (eds) Proceedings of 9th international conference on information and knowledge management. ACM, New York, pp 272–281
Menczer F, Pant G, Srinivasan P et al (2001) Evaluating topic-driven web crawlers. In: Kraft DH, Croft WB, Harper DJ et al (eds) Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 241–249
Mesbah A, Bozdag E, van Deursen A (2008) Crawling AJAX by inferring user interface state changes. In: Schwabe D, Curbera F, Dantzig P (eds) Proceedings of the 8th international conference on web engineering. IEEE Computer Society, Los Alamitos, pp 122–134
Peng T, Zuo W, He F (2008) SVM based adaptive learning method for text classification from positive and unlabeled documents. Knowl Inf Syst 16: 281–301
Rennie J, McCallum A (1999) Using reinforcement learning to spider the Web efficiently. In: Bratko I, Dzeroski S (eds) Proceedings of the 16th international conference on machine learning. Morgan Kaufmann, San Francisco, pp 335–343
Robertson SE (1990) On term selection for query expansion. J Documentation 46: 359–364
Schonfeld U, Bar-Yossef Z, Keidar I (2009) Do not crawl in the DUST: different URLs with similar text. ACM Trans Web 3: 3–31
Shchekotykhin K, Jannach D, Friedrich G (2007) Clustering Web documents with tables for information extraction. In: Sleeman D, Barker K (eds) Proccedings of the 4th international conference on knowledge capture. ACM, New York, pp 169–170
Shchekotykhin K, Jannach D, Friedrich G et al (2007) AllRight: automatic ontology instantiation from tabular web documents. In: Aberer K, Choi K, Noy N et al (eds) Proceedings of the 6th international semantic web conference and 2nd Asian semantic web conference. Springer, Berlin, pp 463–476
Tong H, Faloutsos C, Pan JY (2008) Random walk with restart: fast solutions and applications. Knowl Inf Syst 14: 327–346
Wang P, Hu J, Zeng HJ et al (2009) Using Wikipedia knowledge to improve text classification. Knowl Inf Syst 19: 265–281
Witten I, Frank E (2000) Data mining: practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco
Yu H, Han J, Chang KCC (2004) PEBL: web page classification without negative examples. IEEE Trans Knowl Data Eng 16: 70–81
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shchekotykhin, K., Jannach, D. & Friedrich, G. xCrawl: a high-recall crawling method for Web mining. Knowl Inf Syst 25, 303–326 (2010). https://doi.org/10.1007/s10115-009-0266-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0266-3