Filtering Web Documents for a Thematic Warehouse Case Study: eDot a Food Risk Data Warehouse (extended)

Mezaour, Amar-Djalil

doi:10.1007/3-540-32392-9_28

Amar-Djalil Mezaour³

Part of the book series: Advances in Soft Computing ((AINSC,volume 31))

862 Accesses
2 Citations
3 Altmetric

Abstract

Ordinary sources, like databases and general-pupose document collections, seems to be insufficient and inadequate to scale the needs and the requirements of the new generation of warehouses: thematic data warehouses. Knowing that more and more online thematic data is available, the web can be considered as a useful data source for populating thematic data warehouses. To do so, the warehouse data supplier must be able to filter the heterogeneous web content to keep only the documents corresponding to the warehouse topic. Therefore, building efficient automatic tools to characterize web documents dealing with a given thematic is essential to challenge the warehouse data acquisition issue. In this paper, we present our filtering approach implemented in an automatic tool called “eDot-Filter”. This tool is used to filter crawled documents to keep only the documents dealing with food risk. These documents are then stored in a thematic warehouse called “eDot”. Our filtering approach is based on “WeQueL”, a declarative web query langage that improves the expressive power of keyword-based queries.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Halkidi, M., Nguyen, B., Varlamis, I., Vazirgiannis, M.: (2002) Organising web documents into thematic subsets using an ontology (THESUS). In: Actes électroniques des Journees Web Semantique, Paris
Google Scholar
Brin, S., Page, L.: (1998) The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30, 107–117
Article Google Scholar
Rennie, J., McCallum, A.K.: (1999) Using reinforcement learning to spider the web efficiently. In: Proc. 16th International Conf. on Machine Learning, Morgan Kaufmann, San Francisco, CA 335–343
Google Scholar
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: (2000) Focused crawling using context graphs. In: 26th International Conference on Very Large Databases, VLDB 2000, Cairo, Egypt 527–534
Google Scholar
Chakrabarti, S., van den Berg, M., Dom, B.: (1999) Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks (Amsterdam, Netherlands: 1999) 31, 1623–1640
Google Scholar
whizbang: Cora version 2.0: Computer science research paper search engine (website) http://cora.whizbang.com.
Google Scholar
CiteSeer: http://citeseer.nj.nec.com/cs (website)
Google Scholar
Mezaour, A.D.: (2003) Focused Search on the Web usingWeQueL. In: Proceedings of the 10th International Workshop on Knowledge Representation meets Databases (KRDB 2003), Hamburg, Germany. 63–74
Google Scholar
Mezaour, A.D.: (2004) Recherche ciblée de documents sur le web. Revue RNTI-E2, numéro spécial EGC’2004 2, 491–502
Google Scholar
Mezaour, A.D.: (2004) Filtering Web Documents for eDot, a food risk warehouse. In: Proceedings of the 2nd International Conference on Computational Intelligence (ICCI 2004), Istanbul, Turkey, Prof. Dr. Ali OKATAN 249–252.
Google Scholar
Sym’Previus: Système de prévision du comportement des micro-organismes dans les aliments: http://www.symprevius.net/index.htm (website)
Google Scholar
ComBase: Combined database for predictive microbiology: http://wyndmoor.arserrc.gov/combase/ (website)
Google Scholar
Mezaour, A.D.: (2004) Numerical filtering query of edot: http://www.lri.fr/~mezaour/wequel/edot/qnum.xml
Google Scholar
Ferri, C., Flach, P., Hernandez-Orallo, J.: (2002) Learning decision trees using the area under the ROC curve. In: Proceedings of 9th International Conference on Machine Learning, ICML’02. 139–146
Google Scholar
Nebraska university, the area under an ROC curve: http://gim.unmc.edu/dxtests/roc3.htm (website)
Google Scholar
Aggarwal, C.C., Al-Garawi, F., Yu, P.S.: (2001) Intelligent crawling on the world wide web with arbitrary predicates. In: World Wide Web. 96–105
Google Scholar

Download references

Author information

Authors and Affiliations

Laboratoire de Recherche en Informatique (LRI), Université Paris Sud, 91405, Orsay cedex, France
Amar-Djalil Mezaour

Authors

Amar-Djalil Mezaour
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computer Sciences, Polish Academy of Sciences, ul. Ordona 21, 01-237, Warszawa, Poland
Mieczysław A. Kłopotek , Sławomir T. Wierzchoń & Krzysztof Trojanowski , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mezaour, AD. (2005). Filtering Web Documents for a Thematic Warehouse Case Study: eDot a Food Risk Data Warehouse (extended). In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds) Intelligent Information Processing and Web Mining. Advances in Soft Computing, vol 31. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-32392-9_28

Download citation

DOI: https://doi.org/10.1007/3-540-32392-9_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25056-2
Online ISBN: 978-3-540-32392-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics