TAQIH, a tool for tabular data quality assessment and improvement in the context of health data

https://doi.org/10.1016/j.cmpb.2018.12.029Get rights and content

Highlights

  • A web based data quality assessment and improvement tool is implemented considering the data quality dimensions. This tool also includes exploratory data analysis functionalities.

  • Data quality dimensions are reviewed, focusing on those more relevant in the healthcare field.

  • Data quality assessment and improvement techniques are reviewed.

  • The proposed tool has been applied to two datasets related to drug prescriptions and a glucose monitoring system, and results are discussed.

Abstract

Background and Objectives

Data curation is a tedious task but of paramount relevance for data analytics and more specially in the health context where data-driven decisions must be extremely accurate. The ambition of TAQIH is to support non-technical users on 1) the exploratory data analysis (EDA) process of tabular health data, and 2) the assessment and improvement of its quality.

Methods

A web-based tool has been implemented with a simple yet powerful visual interface. First, it provides interfaces to understand the dataset, to gain the understanding of the content, structure and distribution. Then, it provides data visualization and improvement utilities for the data quality dimensions of completeness, accuracy, redundancy and readability.

Results

It has been applied in two different scenarios. (1) The Northern Ireland General Practitioners (GPs) Prescription Data, an open data set containing drug prescriptions. (2) A glucose monitoring tele health system dataset. Findings on (1) include: Features that had significant amount of missing values (e.g. AMP_NM variable 53.39%); instances that have high percentage of variable values missing (e.g. 0.21% of the instances with > 75% of missing values); highly correlated variables (e.g. Gross and Actual cost almost completely correlated (∼ + 1.0)). Findings on (2) include: Features that had significant amount of missing values (e.g. patient height, weight and body mass index (BMI) (> 70%), date of diagnosis 13%)); highly correlated variables (e.g. height, weight and BMI). Full detail of the testing and insights related to findings are reported.

Conclusions

TAQIH enables and supports users to carry out EDA on tabular health data and to assess and improve its quality. Having the layout of the application menu arranged sequentially as the conventional EDA pipeline helps following a consistent analysis process. The general description of the dataset and features section is very useful for the first overview of the dataset. The missing value heatmap is also very helpful in visually identifying correlations among missing values. The correlations section has proved to be supportive as a preliminary step before further data analysis pipelines, as well as the outliers section. Finally, the data quality section provides a quantitative value to the dataset improvements.

Introduction

It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data [1].

When data scientists or statisticians approach a data problem they generally follow a step process. They first choose a knowledge discovery methodology such as CRISP-DM, SEMMA or KDD [2]. These methodologies consider aspects like business and data understanding, data preparation / transformation, modelling, evaluation / interpretation and decision / deployment.

Then, they perform exploratory data analysis (EDA), refining the contextual understanding of the data as well as reviewing the quality of the data, before working with it, so issues such as incomplete data, noisy, inconsistent values, etc. can be identified early.

Finally, data pre-processing is performed, transforming the data so that it can be used by knowledge extraction algorithms. Data pre-processing includes: data preparation, composed by integration, cleaning, normalization and transformation of data; and data reduction, such as feature selection, instance selection, discretization, etc.

Although some of the data preprocessing methods are applicable to most types of data, it is important to note that at least three distinct groups of data types can be differentiated, requiring their own specific methods and techniques.

Firstly, tabular data type, commonly known as strongly structured, which contains Electronic Health Records (EHR1), Personal Health Records (PHR2), and even most of open data.

Secondly, semi-structured data such as social networks or sensor signals.

And finally, non-structured data, such as medical imaging, free text or even omics data.

The TAQIH tool focuses in structured numeric data types, which is the basic data type in health, although we are in the process of including text and categorical data types, which are also frequently used in this field. Consequently, the rest of the paper focuses on numeric tabular data.

Following, the relevant state of the art regarding data quality evaluation and improvement is introduced. For an extended explanation of the basic concepts, relevance and general context around data quality, the reader is referred to Introduction to Information Quality, in: Data and Information Quality, Data-Centric Systems and Applications [3].

In 2009 Sattler et al. [4] articulated the main dimensions of data quality as completeness, accuracy, consistency and timeliness. They also considered additional application dependent dimensions such as relevance, response time and trustworthiness. Additionally, the Canadian Institute for Health Information [5], proposed the dimensions of accuracy, timeliness, comparability, usability and relevance, in the context of public policy supporting health care management, and with the purpose of building public awareness about the factors that affect health. Later, in 2013, the DAMA UK Working Group refined the Data Quality Dimensions as: completeness, uniqueness, timeliness, validity, accuracy and consistency, in their white paper [6]. In addition to these six data quality dimensions, the DAMA UK Working Group considered that it would also be useful to ask the following questions about the data:

  • Usability of the data - Is it understandable, simple, relevant, accessible, maintainable and at the right level of precision?

  • Timing issues with the data (beyond timeliness itself) - Is it stable yet responsive to legitimate change requests?

  • Flexibility of the data - Is it comparable and compatible with other data, does it have useful groupings and classifications? Can it be repurposed, and is it easy to manipulate?

  • Confidence in the data - Are Data Governance, Data Protection and Data Security in place? What is the reputation of the data, and is it verified or verifiable?

  • Value of the data - Is there a good cost/benefit case for the data? Is it being optimally used? Does it endanger people's safety or privacy or the legal responsibilities of the enterprise? Does it support or contradict the corporate image or the corporate message?

Nevertheless, they also highlight the fact that in some situations one or more dimensions may not be relevant and can therefore be discarded. And depending on each organization's requirements it is likely to give more relevance to some dimensions since they can play a more important role than others.

In the field of public health information systems, [7] carried out a study about the dimensions which were most frequently assessed. According to their study, completeness, accuracy and timeliness were the three most-used attributes among a total of 49 attributes of data quality.

In the context of EHR data reuse for clinical research, [8] describe five dimensions of data quality: completeness, correctness (accuracy), concordance (consistency), plausibility (trust and accuracy), and currency (timeliness), which also fit into the dimensions by the DAMA UK group [6].

In the previous examples of the literature there are discrepancies regarding the data quality dimensions among authors, but many of the terms presented as dimensions are related and could even be considered as synonyms. Therefore, it would make sense to review all these studies and group similar dimension concepts with the objective to obtain an inclusive definition of data quality dimensions. As an example of this work, [9], [10] presented a richer categorization of the data quality dimensions, which uses a previously defined framework, grouping data quality dimension concepts into clusters based on their similarity. Using related papers in the literature, they propose the dimensions described below, which are used as the reference standard through this paper:

  • 1.

    Accuracy, correctness, validity, and precision focus on the adherence to a given reality of interest.

  • 2.

    Completeness, pertinence, and relevance refer to the capability of representing all and only the relevant aspects of the reality of interest.

  • 3.

    Redundancy, minimality, compactness, and conciseness refer to the capability of representing the aspects of the reality of interest with the minimal use of informative resources.

  • 4.

    Readability, comprehensibility, clarity, and simplicity refer to ease of under-standing and fruition of information by users.

  • 5.

    Accessibility and availability are related to the ability of the user to access information from his or her culture, physical status/functions, and technologies available.

  • 6.

    Consistency, cohesion, and coherence refer to the capability of the information to comply without contradictions to all properties of the reality of interest, as specified in terms of integrity constraints, data edits, business rules, and other formalisms.

  • 7.

    Usefulness, related to the advantage the user gains from the use of information.

  • 8.

    Trust, including believability, reliability, and reputation, catching how much information derives from an authoritative source. The trust cluster encompasses also issues related to security.

Next, data quality assessment approaches are described in Section 1.2 Indicators and methods for evaluating Data Quality, while corrective and palliative approaches are described in Section 1.3 Data quality improvement strategies.

In order to perform the Assessment of Data Quality, according to [6], the common approach would follow these steps:

  • 1.

    Select the data to be assessed

  • 2.

    Assess which data quality dimensions to use as well as their weighting

  • 3.

    Define the thresholds for good and bad quality data regarding each data quality dimension

  • 4.

    Apply the assessment

  • 5.

    Review the results to determine whether the data quality is acceptable or not

  • 6.

    When appropriate, perform corrective actions

  • 7.

    Perform a follow up monitoring by periodically repeating the procedure

In the context of Big Data in healthcare [11] state that the assessment of data quality in health care has to consider: (1) the entire lifecycle of health data; (2) problems arising from errors and inaccuracies in the data itself; (3) the source(s) and the pedigree of the data; and (4) how the underlying purpose of data collection impact the analytic processing and knowledge expected to be derived. Automation in the form of data handling, storage, entry and processing technologies is to be viewed as a double-edged sword. At one level, automation can be a good solution, while at another level it can create a different set of data quality issues.

Regarding the reuse of EHR data for clinical research [8] reviewed the clinical research literature discussing data quality assessment methodology for EHR data. According to them, seven broad categories of data quality assessment methods can be defined: comparison with gold standards, data element agreement, data source agreement, distribution comparison, validity checks, log review, and element presence. This work was later refined by [12] introducing a data quality assessment guideline for EHR data reuse called 3 × 3 Data Quality Assessment (3 × 3 DQA). This 3 × 3 DQA is built around three constructs of data quality: complete, correct and current data. Each of these constructs are operationalized according to the three primary dimensions of EHR data: patients, variables and time. Each of the nine operationalized constructs maps to a methodological recommendation for EHR data quality assessment. The initial expert response to the framework was positive, but improvements are required. Specifically, the constructs and operationalized constructs of EHR data quality need to be improved, and primary goals and principles of the guideline to be explicitly stated and explained to intended users. Automated execution of the guideline should also be explored to reduce cognitive overhead for potential users to interpret the complex guideline logic.

As a conclusion of this section the reader may have noticed that many of the assessment methods are dependent on knowing many priors about the data itself and are not directly applicable in an automatic manner. Therefore, the authors consider that instead of looking for fully automatic tools for data quality assessment, in many cases either interactive tools or tools to facilitate data exploration are the most appropriate approaches.

The completeness dimension is inversely related to the handling of missing values, which are not available in the dataset for certain field or variable. There are two main and non-exclusive procedures to deal with missing values according to Luengo, Herrera et al. [13], [14], [15]: deletion (drop) and imputation. Deletion implies the removal of the instance (list wise) or the whole variable (pairwise) with missing values. Imputation on the other hand is the process of replacing missing data with substituted values, which can be carried out considering either only the attribute with missing values (standalone imputation) or including the relationships among attributes. In both cases, different datasets could be used for selecting the appropriate substitution values.

Regarding variable (i.e. feature) or instance deletion, it is important to consider their relevance for further analysis steps, since in some cases they might be essential.

In regard to standalone imputation, these methods only consider the missing value feature/variable in order to set the new value. The usual replacement values include the mean, median and mode. But refinements can be used such as using the values computed for the instance category instead of using the value computed for the whole dataset. These approaches are usually simple to compute but do not produce accurate results and may negatively affect further analyses on the imputed data.

Standalone imputation methods produced poor results in general. To overcome this issue, imputation methods which consider attribute relationships have been developed, by either the statistics or the machine learning community. In the statistics field many approaches have been explored such as Hot-deck [16], Cold-deck, Nearest neighbour, Interpolation, Model based filling or Propensity score [17] (nonparametric). Even some specific approaches for discrete variables such as parametric regression have been developed. In the borderline between statistics and machine learning we can find the Bayesian Principal Component Analysis (PCA). Regarding machine learning, they usually work on classification or regression models.

According to the study carried out by [13], where the purpose is to use the datasets for classification tasks in the machine learning field, the imputation methods that fill in the missing values outperform the case deletion and the lack of imputation. They also state that there is no universal imputation method that performs best for all classifiers. Finally, they conclude that the use of the Fuzzy K-Means Imputation (FKMI) and Event Covering (EC) imputation methods are the best choices under general assumptions but show little advantage with respect to the rest of imputation methods analysed.

If the purpose is to get the most accurate data for further processing, and enough useful data is available after the deletion, then instance deletion is a sensible approach. But in practice, the amount of useful data is usually limited, especially for certain label values.

For a more thorough description of methods dealing with missing data the reader is referred to Missing-Data Adjustments in Large Surveys [18], On the choice of the best imputation methods for missing values considering three groups of classification methods [13], and Statistical Analysis with Missing Data [19].

An outlying observation, or "outlier," is one that appears to deviate markedly from other members of the instance in which it occurs [20]. Outliers can be either the result of poor data collection, in which case they are dropped in general; or genuinely good. In the latter case, if the outlier does not change the results but does affect assumptions, it may be dropped, but it should be reported for the further data analysis process. When the outlier affects both results and assumptions, it is not legitimate to simply drop the outlier. And if the outlier creates a significant association, it should be dropped and any significance should not be reported from the analysis.

Outliers are related to the data quality dimensions of accuracy, since they can point out errors in the data collection process; as well as to redundancy/compactness and readability, since they can be related to edge cases which are not useful for the general understanding of the data but obscure its analysis and visualization.

Outliers can be identified using univariate and multivariate methods and managed accordingly depending on the case, to alleviate their effect on latter analytics. Alternatively, data analysis methods can be used, which are not affected by outliers, such as robust statistics [21], [22].

Univariate methods for outlier detection look for data points with extreme values on one variable. If the outlier values are not extreme, then these methods don't detect/filter them. A box plot chart can be used as a preliminary tool for visually identifying outliers following a univariate method. Tukey's outlier detection method follows a similar approach to Box plots to identify outliers, since it defines an outlier as a value that falls far from the central point (i.e. the median). This distance is usually given as a factor of the interquartile range (IQR).

Multivariate methods look for unusual combinations on all the variables. Two broad categories can be identified: machine learning model based and distance-based methods.

In the former, a probabilistic mathematical model (i.e. regression) is trained using all the data and then it is applied on all data, so that instances which have significant difference between the predicted value and the ground truth are considered as outliers. In the second, distance-based methods, they are grounded on the existence of a distance metric among instances. In this case, the usual method to identify outliers relies on computing the normalized mean absolute deviation, and then considering as an outlier any instance that is more than x standard deviations from the mean, where x is a pre-established threshold (e.g. 2 or 1.5). Its biggest drawback is that the presence of outliers has a strong effect on the arithmetic mean and standard deviation. More sophisticated methods use the covariance matrix (CM) as basis, but require a robust CM Estimator, to avoid outliers’ negative effect. In this sense, the Minimum Covariance Determinant estimator [23], uses the Minimum Covariance determinant as a robust method to obtain the covariance matrix, after which, the Mahalanobis distance is used to compute outliers.

Recently, the work described in Automated Exploration for Interactive Outlier Detection [24] presented a framework to select the most suitable outlier detection algorithm set given a dataset and execution time constraints.

Section snippets

Methods

The TAQIH tool, which has been implemented in the context of the Meaningful Integration of Data, Analytics and Service (MIDAS) EU H2020 project, is presented here.

TAQIH is a web-based software tool to support the exploratory data analysis (EDA) process, with a special emphasis on the assessment of the data quality, as well as delivering semi-automatic data quality improvement.

During EDA, interactions with the dataset permit increasing its understanding and carrying out actions which directly

Results

We have applied our TAQIH tool in two different scenarios. First, the Northern Ireland General Practitioners (GPs) Prescription Data, an open data set containing drug prescriptions. And second, a glucose monitoring tele health system related data.

In both cases, we are following the assessment of data quality process, according to the Dama UK group [6].

Discussion

We have carried out a bibliography review on data quality dimensions, as well as data quality assessment and improvement, focused on the healthcare field. In this context, we have presented TAQIH, a data quality assessment and improvement software tool and its application into two practical scenarios.

In the first scenario, the open data drug prescription dataset, we realized that missing values were coded using a nonstandard value. After this adjustment, we could notice that there were some

Acknowledgements

This work was supported by the European Union's Horizon 2020 research and innovation programme under grant agreement No 727721 (MIDAS).

References (28)

  • D.W. Joenssen et al.

    Hot deck methods for imputing missing data

  • Y. Fu et al.

    REMIX: automated exploration for interactive outlier detection

  • T. Dasu et al.

    Exploratory Data Mining and Data Cleaning

    (2003)
  • A.I.R.L. Azevedo et al.

    KDD, SEMMA and CRISP-DM: a parallel overview

  • C. Batini et al.

    Introduction to information quality

  • K.-U. Sattler

    Data quality dimensions

  • Canadian Institute for Health Information, The CIHI Data Quality Framework, Ottawa, ON, Canada,...
  • The six primary dimensions for data quality assessment

    (2013)
  • H. Chen et al.

    A review of data quality assessment methods for public health information systems

    Int. J. Environ. Res. Public Health

    (2014)
  • N.G. Weiskopf et al.

    Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research

    J. Am. Med. Inf. Assoc. JAMIA.

    (2013)
  • C. Batini et al.

    Data quality dimensions

  • C. Batini et al.

    Information quality dimensions for maps and texts

  • S.R. Sukumar et al.

    Quality of big data in health care

    Int. J. Health Care Qual. Assur.

    (2015)
  • N.G. Weiskopf et al.

    A Data Quality Assessment Guideline for Electronic Health Record Data Reuse

    EGEMs Gener. Evid. Methods Improve Patient Outcomes

    (2017)
  • View full text