Skip to main content

Filtering Web Documents for a Thematic Warehouse Case Study: eDot a Food Risk Data Warehouse (extended)

  • Conference paper
Intelligent Information Processing and Web Mining

Part of the book series: Advances in Soft Computing ((AINSC,volume 31))

Abstract

Ordinary sources, like databases and general-pupose document collections, seems to be insufficient and inadequate to scale the needs and the requirements of the new generation of warehouses: thematic data warehouses. Knowing that more and more online thematic data is available, the web can be considered as a useful data source for populating thematic data warehouses. To do so, the warehouse data supplier must be able to filter the heterogeneous web content to keep only the documents corresponding to the warehouse topic. Therefore, building efficient automatic tools to characterize web documents dealing with a given thematic is essential to challenge the warehouse data acquisition issue. In this paper, we present our filtering approach implemented in an automatic tool called “eDot-Filter”. This tool is used to filter crawled documents to keep only the documents dealing with food risk. These documents are then stored in a thematic warehouse called “eDot”. Our filtering approach is based on “WeQueL”, a declarative web query langage that improves the expressive power of keyword-based queries.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 259.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 329.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Halkidi, M., Nguyen, B., Varlamis, I., Vazirgiannis, M.: (2002) Organising web documents into thematic subsets using an ontology (THESUS). In: Actes électroniques des Journees Web Semantique, Paris

    Google Scholar 

  2. Brin, S., Page, L.: (1998) The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30, 107–117

    Article  Google Scholar 

  3. Rennie, J., McCallum, A.K.: (1999) Using reinforcement learning to spider the web efficiently. In: Proc. 16th International Conf. on Machine Learning, Morgan Kaufmann, San Francisco, CA 335–343

    Google Scholar 

  4. Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: (2000) Focused crawling using context graphs. In: 26th International Conference on Very Large Databases, VLDB 2000, Cairo, Egypt 527–534

    Google Scholar 

  5. Chakrabarti, S., van den Berg, M., Dom, B.: (1999) Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks (Amsterdam, Netherlands: 1999) 31, 1623–1640

    Google Scholar 

  6. whizbang: Cora version 2.0: Computer science research paper search engine (website) http://cora.whizbang.com.

    Google Scholar 

  7. CiteSeer: http://citeseer.nj.nec.com/cs (website)

    Google Scholar 

  8. Mezaour, A.D.: (2003) Focused Search on the Web usingWeQueL. In: Proceedings of the 10th International Workshop on Knowledge Representation meets Databases (KRDB 2003), Hamburg, Germany. 63–74

    Google Scholar 

  9. Mezaour, A.D.: (2004) Recherche ciblée de documents sur le web. Revue RNTI-E2, numéro spécial EGC’2004 2, 491–502

    Google Scholar 

  10. Mezaour, A.D.: (2004) Filtering Web Documents for eDot, a food risk warehouse. In: Proceedings of the 2nd International Conference on Computational Intelligence (ICCI 2004), Istanbul, Turkey, Prof. Dr. Ali OKATAN 249–252.

    Google Scholar 

  11. Sym’Previus: Système de prévision du comportement des micro-organismes dans les aliments: http://www.symprevius.net/index.htm (website)

    Google Scholar 

  12. ComBase: Combined database for predictive microbiology: http://wyndmoor.arserrc.gov/combase/ (website)

    Google Scholar 

  13. Mezaour, A.D.: (2004) Numerical filtering query of edot: http://www.lri.fr/~mezaour/wequel/edot/qnum.xml

    Google Scholar 

  14. Ferri, C., Flach, P., Hernandez-Orallo, J.: (2002) Learning decision trees using the area under the ROC curve. In: Proceedings of 9th International Conference on Machine Learning, ICML’02. 139–146

    Google Scholar 

  15. Nebraska university, the area under an ROC curve: http://gim.unmc.edu/dxtests/roc3.htm (website)

    Google Scholar 

  16. Aggarwal, C.C., Al-Garawi, F., Yu, P.S.: (2001) Intelligent crawling on the world wide web with arbitrary predicates. In: World Wide Web. 96–105

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Mezaour, AD. (2005). Filtering Web Documents for a Thematic Warehouse Case Study: eDot a Food Risk Data Warehouse (extended). In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds) Intelligent Information Processing and Web Mining. Advances in Soft Computing, vol 31. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-32392-9_28

Download citation

  • DOI: https://doi.org/10.1007/3-540-32392-9_28

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25056-2

  • Online ISBN: 978-3-540-32392-1

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics