Skip to main content

QDex: A Database Profiler for Generic Bio-data Exploration and Quality Aware Integration

  • Conference paper
Book cover Web Information Systems Engineering – WISE 2007 Workshops (WISE 2007)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4832))

Included in the following conference series:

  • 802 Accesses

Abstract

In human health and life sciences, researchers extensively collaborate with each other, sharing genomic, biomedical and experimental results. This necessitates dynamically integrating different databases into a single repository or a warehouse. The data integrated in these warehouses are extracted from various heterogeneous sources, having different degrees of quality and trust. Most of the time, they are neither rigorously chosen nor carefully controlled for data quality. Data preparation and data quality metadata are recommended but still insufficiently exploited for ensuring quality and validating the results of information retrieval or data mining techniques.

In a previous work, we built a data warehouse called GEDAW (Gene Expression Data Warehouse) that stores various information: data on genes expressed in the liver during iron overload and liver diseases, relevant information from public databanks (mostly in XML), DNA-chips home experiments and also medical records. Based on our past experience, this paper reports briefly on the lessons learned from biomedical data integration and data quality issues, and the solutions we propose to the numerous problems of schema evolution of both data sources and warehousing system. In this context, we present QDex, a Quality driven bio-Data Exploration tool, which provides a functional and modular architecture for database profiling and exploration, enabling users to set up query workflows and take advantage of data quality profiling metadata before the complex processes of data integration in the warehouse. An illustration with QDex Tool is shown afterwards.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Anathakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating Fuzzy Duplicates in Data warehouses. In: Proc. of Intl. Conf. VLDB (2002)

    Google Scholar 

  2. Batini, C., Catarci, T., Scannapiceco, M.: A Survey of Data Quality Issues in Cooperative Information Systems. In: Atzeni, P., Chu, W., Lu, H., Zhou, S., Ling, T.-W. (eds.) ER 2004. LNCS, vol. 3288, Springer, Heidelberg (2004)

    Google Scholar 

  3. Do, H.-H., Rahm, E.: Flexible Integration of Molecular-biological Annotation Data: The GenMapper Approach. In: Lindner, W., Mesiti, M., Türker, C., Tzitzikas, Y., Vakali, A.I. (eds.) EDBT 2004. LNCS, vol. 3268, Springer, Heidelberg (2004)

    Google Scholar 

  4. Guérin, E., Marquet, G., Burgun, A., Loréal, O., Berti-Equille, L., Leser, U., Moussouni, F.: Integrating and Warehousing Liver Gene Expression Data and Related Biomedical Resources in GEDAW. In: Ludäscher, B., Raschid, L. (eds.) DILS 2005. LNCS (LNBI), vol. 3615, Springer, Heidelberg (2005)

    Google Scholar 

  5. Guérin, E., Marquet, G., Chabalier, J., Troadec, M.B., Guguen-Guillouzo, C., Loréal, O., Burgun, A., Moussouni, F.: Combining biomedical knowledge and transcriptomic data to extract new knowledge on genes. Journal of Integrative Bioinformatics 3(2) (2006)

    Google Scholar 

  6. Harris, M.A., et al.: Gene Ontology Consortium. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. ( Database issue) 32, D258–D261 (2004)

    Google Scholar 

  7. Bodenreider, O.: The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. ( Database issue) 32, D267–D270 (2004)

    Google Scholar 

  8. Lacroix, Z., Critchlow, T. (eds.): Bioinformatics: Managing Scientific Data. Morgan Kaufmann, San Francisco (2003)

    Google Scholar 

  9. Martinez, A., Hammer, J.: Making Quality Count in Biological Data Sources. In: IQIS 2005. Proc. of the 2nd Intl. ACM Workshop on Information Quality in Information Systems, USA (June 2004)

    Google Scholar 

  10. Müller, H., Leser, U., Freytag, J.-C.: Mining for Patterns in Contradictory Data. In: IQIS 2004. Proc. of the 1st Intl. ACM Workshop on Information Quality in Information Systems, France, pp. 51–58 (June 2004)

    Google Scholar 

  11. Müller, H., Naumann, F., Freytag, J.-C.: Data Quality in Genome Databases. In: ICIQ 2003. Proc. of Conference on Information Quality, pp. 269–284. MIT, Cambridge (2003)

    Google Scholar 

  12. Overton, C.G., Haas, J.: Case-Based Reasoning Driven Gene Annotation. In: Computational Methods in MolecularBiology, Elsevier Science, Amsterdam (1998)

    Google Scholar 

  13. Rahm, E., Do, H.: Data Cleaning: Problems and Current Approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)

    Google Scholar 

  14. Thanaraj, T.A.: A clean data set of EST-confirmed splice sites from Homo sapiens and standards for clean-up procedures. Nucleic Acids Res. 27(13), 2627–2637 (1999)

    Article  Google Scholar 

  15. Wang, R.Y.: Journey to Data Quality. In: Advances in Database Systems, vol. 23, Kluwer Academic Press, Boston (2002)

    Google Scholar 

  16. Wang, R., Kon, H., Madnick, S.: Data Quality Requirements Analysis and Modelling. In: Ninth International Conference of Data Engineering, Vienna, Austria (1993)

    Google Scholar 

  17. Pyle, D.: Data Preparation for Data Mining. Morgan Kaufmann, San Francisco (1999)

    Google Scholar 

  18. Durinck, S., Moreau, Y., Kasprzyk, A., Davis, S., De Moor, B., Brazma, A., Huber, W.: BioMart and BioConductor: A powerful link between biological databases and microarray data analysis. Bioinformatics 21(16), 3439–3440 (2005)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Mathias Weske Mohand-Saïd Hacid Claude Godart

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Moussouni, F., Berti-Équille, L., Rozé, G., Loréal, O., Guérin, E. (2007). QDex: A Database Profiler for Generic Bio-data Exploration and Quality Aware Integration. In: Weske, M., Hacid, MS., Godart, C. (eds) Web Information Systems Engineering – WISE 2007 Workshops. WISE 2007. Lecture Notes in Computer Science, vol 4832. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-77010-7_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-77010-7_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-77009-1

  • Online ISBN: 978-3-540-77010-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics