Introduction

Familial hypercholesterolaemia (FH) is a common inherited genetic disorder causing high cholesterol levels from birth1 and increased risk of premature heart disease and death2. FH affects ~1 in 200–500 of the general population3,4. However, most individuals with FH and affected family members remain undiagnosed worldwide5. In patients with heterozygous FH, lipid-lowering therapy such as the use of moderate- to high-intensity statins, or newer PCSK9 inhibitors markedly improves prognosis6 – reducing risk of coronary heart disease and all-cause mortality by at least 44%7,8. Patients who remain unidentified will be untreated or be sub-optimally treated with low-intensity statins and assumed to have commoner multifactorial causes for raised cholesterol.

Internationally, current approaches to identity FH based on clinical characteristics recommend use of the Simon-Broome diagnostic criteria (SB)2, Dutch Lipid Clinic Network criteria (DLCN)9, Make Early Diagnosis to Prevent Early Deaths (MEDPED)10, or Japanese Atherosclerosis Society (JAS) criteria11 (see Supplementary Box 1). These criteria have all been developed from specialist FH or lipid clinic registries, with emphasise on conducting a thorough family history and assessment of clinical features such as tendon xanthoma and arcus senilis. This means the application of these criteria in searching electronic health records of the wider general population in primary care will be limited. For instance, family histories are poorly recorded as evidenced in primary care databases from the UK and Australia, hence an acknowledged limitation of using primary care databases12.

Hence, there has been a drive to develop bespoke algorithms derived from large electronic health records (EHRs) to detect FH. The SEARCH study in the US13 used an electronic version of the DLCN criteria, while the FAMCAT tool in the UK14,15,16 and FindFH model in the US17 have been recently developed from prediction modelling. Using standard logistic regression, area under receiver operating curves (AUC) for FAMCAT were from 0.86 in its development database (UK Clinical Practice Research Datalink)18, to 0.83 and 0.84 in two separate external validation databases14,15 (QRESEARCH and RCGP Surveillance Network). Developed as a data-driven machine-learning (ML) algorithm from US administrative health data, the recent FindFH model resulted in an AUC of 0.89 using a random forest models approach17.

ML algorithms have diverse applications including disease modelling19, with the potential of improving prediction, identifying latent variables which are unlikely to be observed but might be inferred from other variables. ML, therefore, offers an alternative approach to standard prediction modelling20. The aims of this study were firstly, to evaluate the performance of a range of different ML algorithms to identify patients with FH within a large UK general primary care population; secondly, we sought to determine potential differences in the clinical utility of using different ML algorithms.

Results

Study population characteristics

There was a total of 4,157,705 individuals in CPRD study population with either a total cholesterol or LDL-cholesterol record during the study period. 129,930 (3.1%) individuals were excluded from the analysis for either having outlying cholesterol measurements, data entry errors, having a death or transfer out date before study start date or a diagnosis of FH before the study start date. The complete study cohort for analysis was made up of 4,027,775 individuals, with 7928 (0.2%) having a documented diagnosis of FH (Fig. 1). Reported FH prevalence was higher towards the south of England, with London having greatest frequency of FH identified. Other regions of England towards the North and Northeast have lower population frequency of FH identified.

Fig. 1: Map of familial hypercholesterolaemia prevalence.
figure 1

Reported prevalence of familial hypercholesterolaemia in primary care electronic health records by English region.

To develop the FH models, 75% of the complete cohort (n = 3,020,832) was randomly sampled to become the training cohort and the remaining 25% of the cohort (n = 1,006,943) assigned as the validation cohort. Table 1 shows the descriptive characteristics of both training and validation cohorts stratified by sex.

Table 1 Clinical characteristics for men and women age 16 years or above in the derivation and validation cohorts.

Variable rankings

Clinical features ranking for predicting FH are presented in Fig. 2, Supplementary Table 1. All 45 predictor variables were included in developing all the models. All models, apart from deep learning, indicated that cholesterol values and family history features were strong indicators of FH, consistent with existing diagnostic criteria. For the logistic regression model, only three features remained as relevant. For random forests and gradient boosting machines both featured current statin potency, triglycerides, body mass index and systolic blood pressure to determine the likelihood of FH. The deep learning model prioritised exclusion features which indicate a lower likelihood of FH, including secondary causes of raised cholesterol due to chronic conditions such as kidney disease, diabetes, hypertension and hypothyroidism. The deep learning model also identified rare signals such as tendon xanthomata, which is known to be under-recognised in primary care.

Fig. 2: Top 10 risk factors for familial hypercholesterolaemia.
figure 2

Algorithms were derived from training cohort of 3,020,832 patients. BP blood pressure, CHD coronary heart disease, DLM deep learning model, FH familial hypercholesterolaemia, GBM gradient boosting model, LDL-C low-density lipoprotein cholesterol, LRM logistic regression model, RFM random forest model, TC total cholesterol.

Discrimination

To predict the risk/probability of having FH for each individual, the algorithms were applied to the validation cohort (n = 1,006,943). The discrimination accuracy based on AUC, c-statistics, is presented in Fig. 3 for all the models (Supplementary Table 2 for details). AUC was lowest for the logistic regression model. The discrimination accuracy was similar by sex. For instance, for the ensemble model, which is a combination of all the other models, the c-statistics for FH in men was 0.898 (95% CI: 0.886–0.911) compared to 0.884 (95% CI: 0.873–0.895) for women.

Fig. 3: Discrimination accuracy of the models for identifying familial hypercholesterolaemia in primary care.
figure 3

Based on validation cohort of 1,006,943 patients.

Calibration

Calibration accuracy for all the models was assessed by plotting deciles of predicted risk against expected proportion of FH diagnosis for each decile, Fig. 4. At lower predicted risks, the algorithms were generally well calibrated, however, at higher predicted risks were not as well-calibrated.

Fig. 4: Calibration plots for all models.
figure 4

Based on validation cohort of 1,006,943 patients: a logistic regression model, b random forest model, c gradient boosting model, d deep-learning model and e ensemble model. E:O log of the expected/observed number of events, CITL calibration-in-the-large, AUC area under the curve, slope calibration slope. The circles represent deciles of patients grouped by similar predicted risk. The distribution of patients (stratified by outcome) is indicated with spikes at the bottom of the graph. Patients with a diagnosis of familial hypercholesterolaemia (FH) are represented by spikes above the x-axis (red line), and patients without a diagnosis of FH, below the x-axis).

Sensitivity and specificity

Using a cut-off above 1 in 250 (0·004), we determined the sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) of each machine-learning algorithm shown in Table 2. The number of high probability individuals varied due to the shapes of the probability distributions between algorithms, ranging from 0.73% of the population using ensemble learning to 10.16% of the population using deep learning. Specificity was not as variable as sensitivity ranging from 90.0% for deep leaning to 99.3% for ensemble learning. The corresponding NPVs for all algorithms were all above 99%. Sensitivity was highest for the deep learning algorithm (72.6%) and lowest for ensemble learning (30.5%). However, a lower proportion of high probability individuals using ensemble learning models meant that the PPV for ensemble learning would be the highest (15.5%). In contrast, deep learning models, by identifying a far greater number of high probability individuals, meant that this model would result in the lowest PPV (2.8%).

Table 2 Sensitivity, specificity, positive predictive and negative values for machine-learning models for detecting familial hypercholesterolaemia in the validation cohort (n = 1,006,943).

Likelihood ratios

We determined positive and negative likelihood ratios (LR) for each machine-learning algorithm (Fig. 5). The positive likelihood ratio (LR+) estimates the likelihood of having FH, give a positive test result (>1 in 250). The negative likelihood ratio (LR−) estimates the likelihood of not having FH, given a negative test result (≤1/250). All machine-learning models resulted in significant LR+ and LR−, with ensemble learning having the highest LR+ (45.5, 95% CI 42.4–49.9) and deep learning models having lowest LR− (0.31, 95% CI 0.28–0.33).

Fig. 5: Positive and negative likelihood ratios for machine-learning models for detecting familial hypercholesterolaemia.
figure 5

Based on validation cohort of 1,006,943 patients: a positive likelihood ratios (95% confidence interval) and b negative likelihood ratios (95% confidence interval).

Discussion

We have assessed the ability of five different machine-learning (ML) algorithms to detect cases of familial hypercholesterolaemia (FH) in over 4 million patients’ routine primary health care records.

We found four ML models (random forest, gradient boosting, deep learning and ensemble learning) all had similarly high predictive accuracy, with AUC > 0.89. This is highly consistent with that found for the FindFH model, using a random forest algorithm in US administrative health data17; and a 3–6% improvement on the UK FAMCAT tool using standard logistic regression14,15,18.

We found substantial differences for clinical utility between ML algorithms. Despite their similar overall accuracy (apart from logistic regression), our analysis highlights a trade-off that will be necessary between specificity and sensitivity of these models for binary risk stratification. Specificity and negative predictive values were consistently high across all methods, due to the low prevalence of FH in the general population. However, numbers identified as high probability of FH, sensitivity and positive predictive values varied between approaches.

For instance, we found a deep learning model would identify ~10% of the population as probable FH, generating a very high case load for clinicians to screen, review and test. This would have the highest sensitivity (i.e. proportion of patients with actual FH identified) but the detection rate would be poor (low positive predictive value). Conversely, ensemble learning would identify only 0.73% of the population as probable FH requiring clinical review. Although this has lower sensitivity, it would be more efficient in having a higher detection rate (higher positive predictive value).

For example, in an average sized UK primary care practice of 8800 patients, an estimated 30% (n = 2640) of individuals would have had a registered cholesterol measured21. Using a deep learning model would identify 264 probable FH needing clinical assessment and testing. This strategy would yield the maximum absolute number of FH cases identified but would also be the most resource intensive. Conversely, ensemble learning would minimise the number of probable FH to <20 patients for clinical assessment. As the ensemble learning algorithm has the greatest positive likelihood ratio whilst maintaining a significant negative likelihood ratio, this may arguably be the most viable ML model to implement for FH case-finding in primary care practice, given workload and resource implications. Following a model-based approach to case-finding for potential FH, the patient would require a detailed clinical assessment and confirmatory diagnosis by genetic testing to identify a pathogenic mutation. Hence, the extent of false-positive results would have significant resource implications.

This study further highlights interesting and significant variations in the clinical variables identified by the different ML models used. The ML-based logistic regression only consisted of three variables (total cholesterol, LDL-cholesterol and potency of statin prescribed) which is similar to the initial triage of primary care electronic health records recommended by English NICE guideline recommendations through identification of elevated cholesterols alone to systematically identify those with possible FH22. Other models also identified potential negative indicators of FH – those with elevated triglycerides and secondary causes of raised cholesterol. In FH patients, serum triglycerides are usually not elevated23. Raised triglycerides appeared as an important negative feature in both random forest and gradient boosting models. Deep learning identified several secondary causes which were strong negative indicators of FH, including liver disease, chronic kidney disease, hypothyroidism and diabetes. These factors are supported by guidelines recommending excluding these secondary causes prior to establishing a possible FH diagnosis22. The deep learning model also identified tendon xanthomata as an important clinical feature suggesting a definite FH diagnosis, in line with established SB and DLCN criteria2,9. Given its poor recognition in primary care, in any standard modelling, this would have been very unlikely to have the statistical power to identify FH given this clinical feature is only present in <0.01% of the total cohort population.

This research offers a number of strengths. Our study evaluates a range of different ML models for detection of FH in primary care and has done so using not only conventional AUC but also the sensitivity, specificity, positive predictive value and negative predictive values. We employed a large sample size of over four million patients, embracing 6% of the entire UK population, enhancing generalisability of the findings. In particular, this work has also assessed the clinical value of these ML algorithms by exploring diagnostic test accuracy metrics, seldom reported for prediction models using machine-learning.

The current UK study and recent study in the US17 confirm that ML approaches are viable to use in EHR systems and can significantly enhance detection FH. This offers major opportunities to increase diagnosis of FH and to prevent premature heart disease and early deaths. Moreover, while replication of ML methods can be questioned24, the use of different datasets in the UK and US, with consistency between their analysis by different study teams is now available, supporting the generalisability of these ML approaches. In this regard, we have also made fully available, in GitHub, the codes for our models to assist with replication, validation and implementation.

However, we acknowledge several study limitations, in common with other research using large health care databases. These include lack of formal adjudication of diagnoses, information bias and potential bias due to missing data. Missing data could potentially introduce bias in the effect estimates of the prediction models as well as a reducing power. However, we used imputation methods for variables which were sufficiently missing-at-random and a very large sample size to mitigate these effects. The specific coding of FH recorded in UK general practice records will include patients identified with phenotypic FH, who may or may not have been confirmed by genetic testing. A recognised issue in EHRs is that some patients FH could potentially be misclassified, have not yet been identified, or might not have had cholesterol assessed.

Future research should validate and replicate our ML models in other large clinical datasets in other populations. Secondly, further evaluation of the feasibility and acceptability of machine-learning applications in clinical practice is needed. The computational capacity of health care systems continues to evolve; and electronic health records are increasingly moving to cloud-based servers with data centralisation. This presents exciting opportunities to exploit machine-learning as a realistic option to detect uncommon conditions of major health importance, such as familial hypercholesterolaemia.

Methods

Study design and data source

The study cohort was obtained from the Clinical Practice Research Datalink (CPRD). CPRD contains anonymised electronic medical records from 836 general practices with over 11 million research-usable patients25 and is representative of the UK general population26. Over 5.2 million of these patient records are currently of active research-quality, making CPRD one of the most widely used real-world data sources for healthcare research. Information routinely collected from primary care practices, as part of the database, include demographics, lifestyle, diagnoses, prescriptions, diagnostic tests, referrals to specialists and secondary care and death status. Secondary care activity is incorporated in the primary care records through hospital discharge letters from hospital or referral notes from specialists.

Study population

A record of cholesterol level is essential to establish a diagnosis of FH hence, all patients included in the study had at least a single record of total cholesterol measurement between the baseline date of 1 January 1999 or the earliest date the CPRD primary care practice started contributing data to the database after 1 January 1999 and the end date of 25 June 2019 or the latest date the CPRD primary care practice finished contributing data prior to 25 June 2019. Where follow-up was not completed by a patient, the end date was specified as the date of death, transfer out of practice, or final practice visit. Patients with a diagnosis of FH, had the date of the diagnosis specified as their end date. Patients aged 16 years and younger were excluded as the cholesterol level thresholds for the diagnosis and treatment of FH vary when compared to adults22. Patients with a FH diagnosis prior to study entry date (1 January 1999) or with a prior diagnosis of other inherited lipid disorders were excluded.

Clinical features

Clinical features incorporated into all machine-learning models are documented in Box 1. These were derived from known associations between these features and having FH from previous literature, in recommended diagnostic criteria, previously developed algorithms, or expert clinical opinion. We included a range of clinical features which could either increase or decrease the likelihood of FH.

Identifying patients with possible FH using Simon-Broome criteria is based on the following variables: total cholesterol, LDL-cholesterol, family history of MI, and family history of raised cholesterol3. Where a patients had both LDL and total cholesterol levels recorded, the LDL-cholesterol was prioritised, given its importance in recommended diagnostic criteria. For patients with multiple cholesterol levels recorded, the highest cholesterol value at any point between 1 January 1999 and 25 June 2019 was used. Each patient’s recorded triglyceride record at the time of each cholesterol measurement was extracted – raised levels are a negative predictor of FH23. We assessed for outliers (cholesterol and triglyceride levels ≤0 mmol/L or >5 positive standard deviations [SD] from the mean) and data entry errors. A prior history of CHD <60 years may also lead to a higher likelihood of being diagnosed with FH27.

Family histories of MI and of raised cholesterol were included as likely diagnostic variables22. while not specifically included in previous criteria/guidelines, family history of FH was also examined. Family history variables were dichotomised to either having a family history or not. In the event that family history was not evaluated, we assumed that there was no family history. Additional information was not available to further categorise family history to identify the relative affected and age at diagnosis for the condition28.

Although current diagnostic criteria use untreated cholesterol levels to evaluate probability of FH27, patients with elevated cholesterol levels may be receiving lipid-lowering therapy. Consequently, the prescribing and potency of lipid-lowering therapy were included as variables of interest. Cholesterol level was considered treated when the most recent prescription of lipid-lowering therapy ended within 30-days or overlapped with the date of the cholesterol measurement. A 30-day washout period was used to account for any residual effects of the lipid-lowering drugs when the drug treatment had been stopped29. Statin potency was classified using the most recent recommendations for statin intensity in the clinical guidance of the UK National Institute of Health and Care Excellence (NICE), which is based on a previous meta-analysis30.

Secondary causes of hypercholesterolaemia are currently recommended as negative predictors of FH in clinical guidelines22. The following important secondary conditions were, therefore, included in our assessment: liver disease (defined as, fatty liver disease, cirrhosis, chronic liver failure and alcoholic liver disease), diabetes mellitus (type I and type II), hypothyroidism (acquired and congenital), kidney disease (defined as, chronic kidney disease, renal impairment and acute renal failure) and nephrotic syndrome.

Outcome

The primary outcome was a documented incident diagnosis of FH in the patient records during the specified study period. FH is explicitly coded using the internationally recognised Read coding system in UK primary electronic health records (EHRs). This diagnostic code is entered into primary care electronic records after lipid specialist assessment, based on clinical phenotype, and/or by genetic test. To ensure temporality between predictors and the outcome, the diagnosis of FH must have occurred after the predictor variables.

Machine-learning algorithms/models

The total study cohort was randomly split into a ‘training’ cohort (75% of the study cohort) in which the FH algorithms were derived and a ‘validation’ cohort (remaining 25% of the cohort) in which the algorithms were applied and tested. The data split was computer-generated using a uniform distribution to generate random numbers in STATA. The five commonly used algorithms were used – logistic regression31, random forest32, gradient boosting machines33, deep-learning neural networks34 and ensemble learning35. Ensemble learning model was a combination of the four (4) other ML algorithms. Using the library package h2o (http://www.h2o.ai) in R Studio, the risk algorithms were developed in the training cohort and applied to the validation cohort. A grid search was used to determine the hyper parameters for each model and 10-fold cross-validations was done to determine the values for the best performance using the training cohort (Supplementary Methods 1).

Statistical analysis

Descriptive characteristics for the study population are reported as numbers with percentages or mean with standard deviation (SD) for categorical and continuous variables, respectively. The level of missing values ranged between 2.4% for systolic blood pressure to 23.3% for body mass index (BMI) (Supplementary Methods 2). To estimate missing values for BMI, LDL-C levels, triglyceride levels, systolic and diastolic blood pressures, multiple imputation by chained equations was used to generate 10 imputed datasets using all the other available patient variables36. The imputed datasets were pooled into a single dataset using Rubin’s rule37.

Harrell’s c-statistic, a measure of the total area under the receiver operating characteristic curve (AUC), was calculated using the validation cohorts to determine the predictive accuracy of the models developed in the training cohort. A jack-knife procedure was used to estimate the standard errors and 95% confidence intervals for the c-statistic estimates38. AUC is a global indicator of a test’s ability to determine whether or not a specific condition is present39. AUC value lies between 0.5 and 1.0–0.5 indicates a poor classifier and 1.0 indicates an excellent classifier. Calibration, the degree of similarity between observed and predicted probability of sub-optimal response, was assessed by a calibration plot in groups across the risk spectrum as recommended in TRIPOD guidelines40. Sensitivity, specificity, positive predictive value, and negative predictive value were calculated using a probability threshold of >1 in 250 (0.004)4 to reflect the expected prevalence of FH in the general population. Stata 16 MP4 version was used for statistical analyses to assess model performance.

Informed consent/IRB statement

Ethical approval for this study was obtained from the Independent Scientific Advisory Committee (ISAC) – study protocol number 19_083R. De-identified (anonymised) patient data was obtained from CPRD hence this study was exempt from obtaining informed consent from patients.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.