Abstract
In human health and life sciences, researchers extensively collaborate with each other, sharing genomic, biomedical and experimental results. This necessitates dynamically integrating different databases into a single repository or a warehouse. The data integrated in these warehouses are extracted from various heterogeneous sources, having different degrees of quality and trust. Most of the time, they are neither rigorously chosen nor carefully controlled for data quality. Data preparation and data quality metadata are recommended but still insufficiently exploited for ensuring quality and validating the results of information retrieval or data mining techniques.
In a previous work, we built a data warehouse called GEDAW (Gene Expression Data Warehouse) that stores various information: data on genes expressed in the liver during iron overload and liver diseases, relevant information from public databanks (mostly in XML), DNA-chips home experiments and also medical records. Based on our past experience, this paper reports briefly on the lessons learned from biomedical data integration and data quality issues, and the solutions we propose to the numerous problems of schema evolution of both data sources and warehousing system. In this context, we present QDex, a Quality driven bio-Data Exploration tool, which provides a functional and modular architecture for database profiling and exploration, enabling users to set up query workflows and take advantage of data quality profiling metadata before the complex processes of data integration in the warehouse. An illustration with QDex Tool is shown afterwards.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Anathakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating Fuzzy Duplicates in Data warehouses. In: Proc. of Intl. Conf. VLDB (2002)
Batini, C., Catarci, T., Scannapiceco, M.: A Survey of Data Quality Issues in Cooperative Information Systems. In: Atzeni, P., Chu, W., Lu, H., Zhou, S., Ling, T.-W. (eds.) ER 2004. LNCS, vol. 3288, Springer, Heidelberg (2004)
Do, H.-H., Rahm, E.: Flexible Integration of Molecular-biological Annotation Data: The GenMapper Approach. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, Springer, Heidelberg (2004)
Guérin, E., Marquet, G., Burgun, A., Loréal, O., Berti-Equille, L., Leser, U., Moussouni, F.: Integrating and Warehousing Liver Gene Expression Data and Related Biomedical Resources in GEDAW. In: Ludäscher, B., Raschid, L. (eds.) DILS 2005. LNCS (LNBI), vol. 3615, Springer, Heidelberg (2005)
Guérin, E., Marquet, G., Chabalier, J., Troadec, M.B., Guguen-Guillouzo, C., Loréal, O., Burgun, A., Moussouni, F.: Combining biomedical knowledge and transcriptomic data to extract new knowledge on genes. Journal of Integrative Bioinformatics 3(2) (2006)
Harris, M.A., et al.: Gene Ontology Consortium. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. ( Database issue) 32, D258–D261 (2004)
Bodenreider, O.: The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. ( Database issue) 32, D267–D270 (2004)
Lacroix, Z., Critchlow, T. (eds.): Bioinformatics: Managing Scientific Data. Morgan Kaufmann, San Francisco (2003)
Martinez, A., Hammer, J.: Making Quality Count in Biological Data Sources. In: IQIS 2005. Proc. of the 2nd Intl. ACM Workshop on Information Quality in Information Systems, USA (June 2004)
Müller, H., Leser, U., Freytag, J.-C.: Mining for Patterns in Contradictory Data. In: IQIS 2004. Proc. of the 1st Intl. ACM Workshop on Information Quality in Information Systems, France, pp. 51–58 (June 2004)
Müller, H., Naumann, F., Freytag, J.-C.: Data Quality in Genome Databases. In: ICIQ 2003. Proc. of Conference on Information Quality, pp. 269–284. MIT, Cambridge (2003)
Overton, C.G., Haas, J.: Case-Based Reasoning Driven Gene Annotation. In: Computational Methods in MolecularBiology, Elsevier Science, Amsterdam (1998)
Rahm, E., Do, H.: Data Cleaning: Problems and Current Approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Thanaraj, T.A.: A clean data set of EST-confirmed splice sites from Homo sapiens and standards for clean-up procedures. Nucleic Acids Res. 27(13), 2627–2637 (1999)
Wang, R.Y.: Journey to Data Quality. In: Advances in Database Systems, vol. 23, Kluwer Academic Press, Boston (2002)
Wang, R., Kon, H., Madnick, S.: Data Quality Requirements Analysis and Modelling. In: Ninth International Conference of Data Engineering, Vienna, Austria (1993)
Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann, San Francisco (1999)
Durinck, S., Moreau, Y., Kasprzyk, A., Davis, S., De Moor, B., Brazma, A., Huber, W.: BioMart and BioConductor: A powerful link between biological databases and microarray data analysis. Bioinformatics 21(16), 3439–3440 (2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Moussouni, F., Berti-Équille, L., Rozé, G., Loréal, O., Guérin, E. (2007). QDex: A Database Profiler for Generic Bio-data Exploration and Quality Aware Integration. In: Weske, M., Hacid, MS., Godart, C. (eds) Web Information Systems Engineering – WISE 2007 Workshops. WISE 2007. Lecture Notes in Computer Science, vol 4832. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77010-7_2
Download citation
DOI: https://doi.org/10.1007/978-3-540-77010-7_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-77009-1
Online ISBN: 978-3-540-77010-7
eBook Packages: Computer ScienceComputer Science (R0)