TAQIH, a tool for tabular data quality assessment and improvement in the context of health data
Introduction
It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data [1].
When data scientists or statisticians approach a data problem they generally follow a step process. They first choose a knowledge discovery methodology such as CRISP-DM, SEMMA or KDD [2]. These methodologies consider aspects like business and data understanding, data preparation / transformation, modelling, evaluation / interpretation and decision / deployment.
Then, they perform exploratory data analysis (EDA), refining the contextual understanding of the data as well as reviewing the quality of the data, before working with it, so issues such as incomplete data, noisy, inconsistent values, etc. can be identified early.
Finally, data pre-processing is performed, transforming the data so that it can be used by knowledge extraction algorithms. Data pre-processing includes: data preparation, composed by integration, cleaning, normalization and transformation of data; and data reduction, such as feature selection, instance selection, discretization, etc.
Although some of the data preprocessing methods are applicable to most types of data, it is important to note that at least three distinct groups of data types can be differentiated, requiring their own specific methods and techniques.
Firstly, tabular data type, commonly known as strongly structured, which contains Electronic Health Records (EHR1), Personal Health Records (PHR2), and even most of open data.
Secondly, semi-structured data such as social networks or sensor signals.
And finally, non-structured data, such as medical imaging, free text or even omics data.
The TAQIH tool focuses in structured numeric data types, which is the basic data type in health, although we are in the process of including text and categorical data types, which are also frequently used in this field. Consequently, the rest of the paper focuses on numeric tabular data.
Following, the relevant state of the art regarding data quality evaluation and improvement is introduced. For an extended explanation of the basic concepts, relevance and general context around data quality, the reader is referred to Introduction to Information Quality, in: Data and Information Quality, Data-Centric Systems and Applications [3].
In 2009 Sattler et al. [4] articulated the main dimensions of data quality as completeness, accuracy, consistency and timeliness. They also considered additional application dependent dimensions such as relevance, response time and trustworthiness. Additionally, the Canadian Institute for Health Information [5], proposed the dimensions of accuracy, timeliness, comparability, usability and relevance, in the context of public policy supporting health care management, and with the purpose of building public awareness about the factors that affect health. Later, in 2013, the DAMA UK Working Group refined the Data Quality Dimensions as: completeness, uniqueness, timeliness, validity, accuracy and consistency, in their white paper [6]. In addition to these six data quality dimensions, the DAMA UK Working Group considered that it would also be useful to ask the following questions about the data:
- •
Usability of the data - Is it understandable, simple, relevant, accessible, maintainable and at the right level of precision?
- •
Timing issues with the data (beyond timeliness itself) - Is it stable yet responsive to legitimate change requests?
- •
Flexibility of the data - Is it comparable and compatible with other data, does it have useful groupings and classifications? Can it be repurposed, and is it easy to manipulate?
- •
Confidence in the data - Are Data Governance, Data Protection and Data Security in place? What is the reputation of the data, and is it verified or verifiable?
- •
Value of the data - Is there a good cost/benefit case for the data? Is it being optimally used? Does it endanger people's safety or privacy or the legal responsibilities of the enterprise? Does it support or contradict the corporate image or the corporate message?
Nevertheless, they also highlight the fact that in some situations one or more dimensions may not be relevant and can therefore be discarded. And depending on each organization's requirements it is likely to give more relevance to some dimensions since they can play a more important role than others.
In the field of public health information systems, [7] carried out a study about the dimensions which were most frequently assessed. According to their study, completeness, accuracy and timeliness were the three most-used attributes among a total of 49 attributes of data quality.
In the context of EHR data reuse for clinical research, [8] describe five dimensions of data quality: completeness, correctness (accuracy), concordance (consistency), plausibility (trust and accuracy), and currency (timeliness), which also fit into the dimensions by the DAMA UK group [6].
In the previous examples of the literature there are discrepancies regarding the data quality dimensions among authors, but many of the terms presented as dimensions are related and could even be considered as synonyms. Therefore, it would make sense to review all these studies and group similar dimension concepts with the objective to obtain an inclusive definition of data quality dimensions. As an example of this work, [9], [10] presented a richer categorization of the data quality dimensions, which uses a previously defined framework, grouping data quality dimension concepts into clusters based on their similarity. Using related papers in the literature, they propose the dimensions described below, which are used as the reference standard through this paper:
- 1.
Accuracy, correctness, validity, and precision focus on the adherence to a given reality of interest.
- 2.
Completeness, pertinence, and relevance refer to the capability of representing all and only the relevant aspects of the reality of interest.
- 3.
Redundancy, minimality, compactness, and conciseness refer to the capability of representing the aspects of the reality of interest with the minimal use of informative resources.
- 4.
Readability, comprehensibility, clarity, and simplicity refer to ease of under-standing and fruition of information by users.
- 5.
Accessibility and availability are related to the ability of the user to access information from his or her culture, physical status/functions, and technologies available.
- 6.
Consistency, cohesion, and coherence refer to the capability of the information to comply without contradictions to all properties of the reality of interest, as specified in terms of integrity constraints, data edits, business rules, and other formalisms.
- 7.
Usefulness, related to the advantage the user gains from the use of information.
- 8.
Trust, including believability, reliability, and reputation, catching how much information derives from an authoritative source. The trust cluster encompasses also issues related to security.
Next, data quality assessment approaches are described in Section 1.2 Indicators and methods for evaluating Data Quality, while corrective and palliative approaches are described in Section 1.3 Data quality improvement strategies.
In order to perform the Assessment of Data Quality, according to [6], the common approach would follow these steps:
- 1.
Select the data to be assessed
- 2.
Assess which data quality dimensions to use as well as their weighting
- 3.
Define the thresholds for good and bad quality data regarding each data quality dimension
- 4.
Apply the assessment
- 5.
Review the results to determine whether the data quality is acceptable or not
- 6.
When appropriate, perform corrective actions
- 7.
Perform a follow up monitoring by periodically repeating the procedure
In the context of Big Data in healthcare [11] state that the assessment of data quality in health care has to consider: (1) the entire lifecycle of health data; (2) problems arising from errors and inaccuracies in the data itself; (3) the source(s) and the pedigree of the data; and (4) how the underlying purpose of data collection impact the analytic processing and knowledge expected to be derived. Automation in the form of data handling, storage, entry and processing technologies is to be viewed as a double-edged sword. At one level, automation can be a good solution, while at another level it can create a different set of data quality issues.
Regarding the reuse of EHR data for clinical research [8] reviewed the clinical research literature discussing data quality assessment methodology for EHR data. According to them, seven broad categories of data quality assessment methods can be defined: comparison with gold standards, data element agreement, data source agreement, distribution comparison, validity checks, log review, and element presence. This work was later refined by [12] introducing a data quality assessment guideline for EHR data reuse called 3 × 3 Data Quality Assessment (3 × 3 DQA). This 3 × 3 DQA is built around three constructs of data quality: complete, correct and current data. Each of these constructs are operationalized according to the three primary dimensions of EHR data: patients, variables and time. Each of the nine operationalized constructs maps to a methodological recommendation for EHR data quality assessment. The initial expert response to the framework was positive, but improvements are required. Specifically, the constructs and operationalized constructs of EHR data quality need to be improved, and primary goals and principles of the guideline to be explicitly stated and explained to intended users. Automated execution of the guideline should also be explored to reduce cognitive overhead for potential users to interpret the complex guideline logic.
As a conclusion of this section the reader may have noticed that many of the assessment methods are dependent on knowing many priors about the data itself and are not directly applicable in an automatic manner. Therefore, the authors consider that instead of looking for fully automatic tools for data quality assessment, in many cases either interactive tools or tools to facilitate data exploration are the most appropriate approaches.
The completeness dimension is inversely related to the handling of missing values, which are not available in the dataset for certain field or variable. There are two main and non-exclusive procedures to deal with missing values according to Luengo, Herrera et al. [13], [14], [15]: deletion (drop) and imputation. Deletion implies the removal of the instance (list wise) or the whole variable (pairwise) with missing values. Imputation on the other hand is the process of replacing missing data with substituted values, which can be carried out considering either only the attribute with missing values (standalone imputation) or including the relationships among attributes. In both cases, different datasets could be used for selecting the appropriate substitution values.
Regarding variable (i.e. feature) or instance deletion, it is important to consider their relevance for further analysis steps, since in some cases they might be essential.
In regard to standalone imputation, these methods only consider the missing value feature/variable in order to set the new value. The usual replacement values include the mean, median and mode. But refinements can be used such as using the values computed for the instance category instead of using the value computed for the whole dataset. These approaches are usually simple to compute but do not produce accurate results and may negatively affect further analyses on the imputed data.
Standalone imputation methods produced poor results in general. To overcome this issue, imputation methods which consider attribute relationships have been developed, by either the statistics or the machine learning community. In the statistics field many approaches have been explored such as Hot-deck [16], Cold-deck, Nearest neighbour, Interpolation, Model based filling or Propensity score [17] (nonparametric). Even some specific approaches for discrete variables such as parametric regression have been developed. In the borderline between statistics and machine learning we can find the Bayesian Principal Component Analysis (PCA). Regarding machine learning, they usually work on classification or regression models.
According to the study carried out by [13], where the purpose is to use the datasets for classification tasks in the machine learning field, the imputation methods that fill in the missing values outperform the case deletion and the lack of imputation. They also state that there is no universal imputation method that performs best for all classifiers. Finally, they conclude that the use of the Fuzzy K-Means Imputation (FKMI) and Event Covering (EC) imputation methods are the best choices under general assumptions but show little advantage with respect to the rest of imputation methods analysed.
If the purpose is to get the most accurate data for further processing, and enough useful data is available after the deletion, then instance deletion is a sensible approach. But in practice, the amount of useful data is usually limited, especially for certain label values.
For a more thorough description of methods dealing with missing data the reader is referred to Missing-Data Adjustments in Large Surveys [18], On the choice of the best imputation methods for missing values considering three groups of classification methods [13], and Statistical Analysis with Missing Data [19].
An outlying observation, or "outlier," is one that appears to deviate markedly from other members of the instance in which it occurs [20]. Outliers can be either the result of poor data collection, in which case they are dropped in general; or genuinely good. In the latter case, if the outlier does not change the results but does affect assumptions, it may be dropped, but it should be reported for the further data analysis process. When the outlier affects both results and assumptions, it is not legitimate to simply drop the outlier. And if the outlier creates a significant association, it should be dropped and any significance should not be reported from the analysis.
Outliers are related to the data quality dimensions of accuracy, since they can point out errors in the data collection process; as well as to redundancy/compactness and readability, since they can be related to edge cases which are not useful for the general understanding of the data but obscure its analysis and visualization.
Outliers can be identified using univariate and multivariate methods and managed accordingly depending on the case, to alleviate their effect on latter analytics. Alternatively, data analysis methods can be used, which are not affected by outliers, such as robust statistics [21], [22].
Univariate methods for outlier detection look for data points with extreme values on one variable. If the outlier values are not extreme, then these methods don't detect/filter them. A box plot chart can be used as a preliminary tool for visually identifying outliers following a univariate method. Tukey's outlier detection method follows a similar approach to Box plots to identify outliers, since it defines an outlier as a value that falls far from the central point (i.e. the median). This distance is usually given as a factor of the interquartile range (IQR).
Multivariate methods look for unusual combinations on all the variables. Two broad categories can be identified: machine learning model based and distance-based methods.
In the former, a probabilistic mathematical model (i.e. regression) is trained using all the data and then it is applied on all data, so that instances which have significant difference between the predicted value and the ground truth are considered as outliers. In the second, distance-based methods, they are grounded on the existence of a distance metric among instances. In this case, the usual method to identify outliers relies on computing the normalized mean absolute deviation, and then considering as an outlier any instance that is more than x standard deviations from the mean, where x is a pre-established threshold (e.g. 2 or 1.5). Its biggest drawback is that the presence of outliers has a strong effect on the arithmetic mean and standard deviation. More sophisticated methods use the covariance matrix (CM) as basis, but require a robust CM Estimator, to avoid outliers’ negative effect. In this sense, the Minimum Covariance Determinant estimator [23], uses the Minimum Covariance determinant as a robust method to obtain the covariance matrix, after which, the Mahalanobis distance is used to compute outliers.
Recently, the work described in Automated Exploration for Interactive Outlier Detection [24] presented a framework to select the most suitable outlier detection algorithm set given a dataset and execution time constraints.
Section snippets
Methods
The TAQIH tool, which has been implemented in the context of the Meaningful Integration of Data, Analytics and Service (MIDAS) EU H2020 project, is presented here.
TAQIH is a web-based software tool to support the exploratory data analysis (EDA) process, with a special emphasis on the assessment of the data quality, as well as delivering semi-automatic data quality improvement.
During EDA, interactions with the dataset permit increasing its understanding and carrying out actions which directly
Results
We have applied our TAQIH tool in two different scenarios. First, the Northern Ireland General Practitioners (GPs) Prescription Data, an open data set containing drug prescriptions. And second, a glucose monitoring tele health system related data.
In both cases, we are following the assessment of data quality process, according to the Dama UK group [6].
Discussion
We have carried out a bibliography review on data quality dimensions, as well as data quality assessment and improvement, focused on the healthcare field. In this context, we have presented TAQIH, a data quality assessment and improvement software tool and its application into two practical scenarios.
In the first scenario, the open data drug prescription dataset, we realized that missing values were coded using a nonstandard value. After this adjustment, we could notice that there were some
Acknowledgements
This work was supported by the European Union's Horizon 2020 research and innovation programme under grant agreement No 727721 (MIDAS).
References (28)
- et al.
Hot deck methods for imputing missing data
- et al.
REMIX: automated exploration for interactive outlier detection
- et al.
Exploratory Data Mining and Data Cleaning
(2003) - et al.
KDD, SEMMA and CRISP-DM: a parallel overview
- et al.
Introduction to information quality
Data quality dimensions
- Canadian Institute for Health Information, The CIHI Data Quality Framework, Ottawa, ON, Canada,...
The six primary dimensions for data quality assessment
(2013)- et al.
A review of data quality assessment methods for public health information systems
Int. J. Environ. Res. Public Health
(2014) - et al.
Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research
J. Am. Med. Inf. Assoc. JAMIA.
(2013)
Data quality dimensions
Information quality dimensions for maps and texts
Quality of big data in health care
Int. J. Health Care Qual. Assur.
A Data Quality Assessment Guideline for Electronic Health Record Data Reuse
EGEMs Gener. Evid. Methods Improve Patient Outcomes
Cited by (11)
Test-retest reliability of the Cost for Patients Questionnaire
2022, International Journal of Technology Assessment in Health CareAutomating Electronic Health Record Data Quality Assessment
2023, Journal of Medical SystemsAn Assessment of the Quality of Open Government Data in Saudi Arabia
2023, IEEE AccessSystem Architecture of a European Platform for Health Policy Decision Making: MIDAS
2022, Frontiers in Public HealthDeveloping a systematic approach to assessing data quality in secondary use of clinical data based on intended use
2022, Learning Health SystemsFormulation of an evaluation index system of geological hazard data quality in China
2021, Chinese Journal of Geological Hazard and Control