Skip to main content

Exploiting Genre in Focused Crawling

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4726))

Abstract

In this paper, we propose a novel approach to focused crawling that exploits genre and content-related information present in Web pages to guide the crawling process. The effectiveness, efficiency and scalability of this approach are demonstrated by a set of experiments involving the crawling of pages related to syllabi (genre) of computer science courses (content). The results of these experiments show that focused crawlers constructed according to our approach achieve levels of F1 superior to 92% (an average gain of 178% over traditional focused crawlers), requiring the analysis of no more than 60% of the visited pages in order to find 90% of the relevant pages (an average gain of 82% over traditional focused crawlers).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baeza-Yates, R.A., Ribeiro-Neto, B.A.: Modern Information Retrieval. ACM Press/Addison-Wesley, New York (1999)

    Google Scholar 

  2. Chakrabarti, S., Berg, M., Dom, B.: Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. Journal of Computer Networks 31(11-16), 1623–1640 (1999)

    Article  Google Scholar 

  3. De Bra, P.M.E., Post, R.D.J.: Information Retrieval in the World Wide Web: Making Client-Based Searching Feasible. Journal of Computer Networks and ISDN Systems 27(2), 183–192 (1994)

    Article  Google Scholar 

  4. Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: Focused Crawling Using Context Graphs. In: Proc. 26th Int’l Conference on Very Large Data Bases, pp. 527–534 (2000)

    Google Scholar 

  5. Herscovici, M., Jacovi, M., Maarek, Y.S., Pelleg, D., Shtalhaim, M., Ur, S.: The Shark-Search Algorithm - An Application: Tailored Web Site Mapping. Journal of Computer Networks 30(1-7), 317–326 (1998)

    Article  Google Scholar 

  6. Lage, J.P., Silva, A.S., Golgher, P.B., Laender, A.H.F.: Automatic Generation of Agents for Collecting Hidden Web Pages for Data Extraction. Data & Knowledge Engineering 49(2), 177–196 (2004)

    Article  Google Scholar 

  7. Liu, H., Janssen, J.C.M., Milios, E.E.: Using HMM to Learn User Browsing Patterns for Focused Web Crawling. Data & Knowledge Engineering 59(2), 270–291 (2006)

    Article  Google Scholar 

  8. McCallum, A., Nigam, K., Rennie, J., Seymore, K.: Automating the Construction of Internet Portals with Machine Learning. Journal of Information Retrieval 3(2), 127–163 (2000)

    Article  Google Scholar 

  9. Menczer, F., Pant, G., Srinivasan, P.: Topical Web Crawlers: Evaluating Adaptive Algorithms. ACM Transactions on Internet Technology 4(4), 378–419 (2004)

    Article  Google Scholar 

  10. Menczer, F., Pant, G., Srinivasan, P., Ruiz, M.E.: Evaluating Topic-driven Web Crawlers. In: Proc. 24th Annual Int’l ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 241–249 (2001)

    Google Scholar 

  11. Pant, G., Menczer, F.: Topical Crawling for Business Intelligence. In: Koch, T., Sølvberg, I.T. (eds.) ECDL 2003. LNCS, vol. 2769, pp. 233–244. Springer, Heidelberg (2003)

    Google Scholar 

  12. Pant, G., Srinivasan, P.: Link Contexts in Classifier-Guided Topical Crawlers. IEEE Transactions on Knowledge and Data Engineering 18(1), 107–122 (2006)

    Article  Google Scholar 

  13. Pant, G., Srinivasan, P.: Learning to Crawl: Comparing Classification Schemes. ACM Transactions on Information Systems 23(4), 430–462 (2005)

    Article  Google Scholar 

  14. Pant, G., Tsioutsiouliklis, K., Johnson, J., Giles, C.L.: Panorama: Extending digital libraries with topical crawlers. In: Proc. 4th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 142–150 (2004)

    Google Scholar 

  15. Srinivasan, P., Menczer, F., Pant, G.: A General Evaluation Framework for Topical Crawlers. Journal of Information Retrieval 8(3), 417–447 (2005)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Nivio Ziviani Ricardo Baeza-Yates

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

de Assis, G.T., Laender, A.H.F., Gonçalves, M.A., da Silva, A.S. (2007). Exploiting Genre in Focused Crawling. In: Ziviani, N., Baeza-Yates, R. (eds) String Processing and Information Retrieval. SPIRE 2007. Lecture Notes in Computer Science, vol 4726. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-75530-2_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-75530-2_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-75529-6

  • Online ISBN: 978-3-540-75530-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics