Abstract
Ordinary sources, like databases and general-pupose document collections, seems to be insufficient and inadequate to scale the needs and the requirements of the new generation of warehouses: thematic data warehouses. Knowing that more and more online thematic data is available, the web can be considered as a useful data source for populating thematic data warehouses. To do so, the warehouse data supplier must be able to filter the heterogeneous web content to keep only the documents corresponding to the warehouse topic. Therefore, building efficient automatic tools to characterize web documents dealing with a given thematic is essential to challenge the warehouse data acquisition issue. In this paper, we present our filtering approach implemented in an automatic tool called “eDot-Filter”. This tool is used to filter crawled documents to keep only the documents dealing with food risk. These documents are then stored in a thematic warehouse called “eDot”. Our filtering approach is based on “WeQueL”, a declarative web query langage that improves the expressive power of keyword-based queries.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Halkidi, M., Nguyen, B., Varlamis, I., Vazirgiannis, M.: (2002) Organising web documents into thematic subsets using an ontology (THESUS). In: Actes électroniques des Journees Web Semantique, Paris
Brin, S., Page, L.: (1998) The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems 30, 107–117
Rennie, J., McCallum, A.K.: (1999) Using reinforcement learning to spider the web efficiently. In: Proc. 16th International Conf. on Machine Learning, Morgan Kaufmann, San Francisco, CA 335–343
Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M.: (2000) Focused crawling using context graphs. In: 26th International Conference on Very Large Databases, VLDB 2000, Cairo, Egypt 527–534
Chakrabarti, S., van den Berg, M., Dom, B.: (1999) Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks (Amsterdam, Netherlands: 1999) 31, 1623–1640
whizbang: Cora version 2.0: Computer science research paper search engine (website) http://cora.whizbang.com.
CiteSeer: http://citeseer.nj.nec.com/cs (website)
Mezaour, A.D.: (2003) Focused Search on the Web usingWeQueL. In: Proceedings of the 10th International Workshop on Knowledge Representation meets Databases (KRDB 2003), Hamburg, Germany. 63–74
Mezaour, A.D.: (2004) Recherche ciblée de documents sur le web. Revue RNTI-E2, numéro spécial EGC’2004 2, 491–502
Mezaour, A.D.: (2004) Filtering Web Documents for eDot, a food risk warehouse. In: Proceedings of the 2nd International Conference on Computational Intelligence (ICCI 2004), Istanbul, Turkey, Prof. Dr. Ali OKATAN 249–252.
Sym’Previus: Système de prévision du comportement des micro-organismes dans les aliments: http://www.symprevius.net/index.htm (website)
ComBase: Combined database for predictive microbiology: http://wyndmoor.arserrc.gov/combase/ (website)
Mezaour, A.D.: (2004) Numerical filtering query of edot: http://www.lri.fr/~mezaour/wequel/edot/qnum.xml
Ferri, C., Flach, P., Hernandez-Orallo, J.: (2002) Learning decision trees using the area under the ROC curve. In: Proceedings of 9th International Conference on Machine Learning, ICML’02. 139–146
Nebraska university, the area under an ROC curve: http://gim.unmc.edu/dxtests/roc3.htm (website)
Aggarwal, C.C., Al-Garawi, F., Yu, P.S.: (2001) Intelligent crawling on the world wide web with arbitrary predicates. In: World Wide Web. 96–105
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mezaour, AD. (2005). Filtering Web Documents for a Thematic Warehouse Case Study: eDot a Food Risk Data Warehouse (extended). In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds) Intelligent Information Processing and Web Mining. Advances in Soft Computing, vol 31. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-32392-9_28
Download citation
DOI: https://doi.org/10.1007/3-540-32392-9_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25056-2
Online ISBN: 978-3-540-32392-1
eBook Packages: EngineeringEngineering (R0)