Introduction

As an urologist or oncologist it is not rare to encounter a 77 year old prostate cancer patient treated with androgen deprivation therapy, whose PSA rises consecutively at castrate serum levels of testosterone and who develops new bone lesions on imaging studies. According to the European Association of Urology guidelines, this patient meets the criteria for metastatic Castration-Resistant Prostate Cancer (CRPC) (Cornford et al. 2017). The patient has a medical history of chronic obstructive pulmonary disease (COPD) and diabetes mellitus. He has no prostate cancer related symptoms but due to his comorbidities he has a performance status of 1. We have previously shown that based on these factors Dutch clinicians are more likely to opt for watchful waiting or hormone targeted drugs, instead of docetaxel/prednisolone or radium-223 (Angst et al. 2019). In absence of clear recommendations for a preferred treatment option and sequence, clinicians may benefit from support of a clinical prediction model that is able to predict survival per treatment option based on patients’ clinical baseline characteristics.

Recently, a significant amount of work has been published concerning risk prediction in prostate cancer (Kearns and Lin 2017). Risk prediction models evolved to indispensable tools to aid clinicians in making evidence-based decisions. In the urology field clinical risk prediction models for different disease states of prostate cancer exist, to predict for example the probability of biopsy-detectable aggressive prostate cancer, lymph node involvement, or overall survival (OS) in first-line chemotherapy. Nevertheless, despite existing general guidelines for reporting of a multivariable prediction model for individual prognosis or diagnosis (Collins et al. 2015), the process of developing and validating such models is still shrouded in mystery for most clinicians. The aim of this paper is to provide a comprehensive detailed guide to help clinicians understand the (sometimes complex) steps in developing a useful prediction model for CRPC patients, based on a real-life case, using a retrospective dataset of real-world patients treated for CRPC. We aim to both assist the clinician in understanding the development of a prediction model and to support the clinician in recognizing common shortcomings in existing prediction models. Of course, it is of highly importance to involve a statistician in the preparatory phase as well as constructing and validating the model.

Methodology

Research question and statistical model choice

First and foremost, one needs to formulate a clear research question. Additionally, before delving into the process of developing a prediction model it should first be checked if a similar model exists. In this case it may sometimes be more appropriate to update or adapt these previous models. In this study we aimed to develop a model to predict mortality in patients with CRPC treated in first-line with either abiraterone, enzalutamide, docetaxel, watchful waiting (defined as best supportive care using systemic treatment without proven life prolonging benefits, such as anti-androgens and ketoconazole) or radium-223, with the goal to use the model for treatment decision-making and to incorporate the model into a decision aid. Based on the type of outcome an appropriate model should be chosen, because different models should be used for different types of data (Supplementary Table 1). In our case we are dealing with survival data. Hence, a non-parametric Cox proportional hazard model was chosen. It should be noted that for very long-term predictions a parametric model (e.g. Weibull) may be preferred, since these provide more stable predictions at the end of follow up (Carroll 2003). A summary of all considerations in model development is presented in Table 1.

Table 1 Summary of considering in prediction modelling.

Data inspection

In our case we used a retrospective registry called the CAstration-resistant Prostate cancer RegIstry (CAPRI), which is an investigator-initiated, observational multi-center registry in 20 hospitals in the Netherlands. In the subset of the data we used, with first line treatment only, 3588 patients and 2335 deaths were recorded (Westgeest et al. 2018). The patients were treated according to clinical practice with a variety of first-line treatments including abiraterone, enzalutamide, docetaxel, or watchful waiting. Radium-223 was excluded from analyses due to the fact that only ten patients received Radium-223 as first line treatment in this dataset. Baseline variables are presented in Table 2. Furthermore, this dataset contained sixteen potential predictors. In general, it is recommended to have at least ten events (deaths in our case) to investigate one predictor. If a predictor has multiple categories you need 10*(number of categories − 1) events for that predictor.

Table 2 Baseline characteristics of patients with CRPC treated with abiraterone, enzalutamide, docetaxel or watchful waiting

Missing values and coding of predictors

In an ideal world the predictors in a dataset are all clinically relevant (Cornford et al. 20172), comprehensible (Angst et al. 2019), measured reliably (Kearns and Lin 2017), without missing data (Collins et al. 2015), and not correlated with each other (Carroll 2003). Unfortunately, datasets fulfilling all these criteria are the exception rather than the rule. Regarding the first three criteria it is recommended that clinician’s perspectives are taken into account. Several authors mentioned to perform systematic reviews in order to find suitable candidate predictors (Steyerberg 2008). In the sections below we will address the latter two criteria (missing values and correlation between predictors). Additionally, we will give special attention on how to handle continuous predictors (e.g. age and hemoglobin).

Missing values

Various approaches are described to handle missing data, each with its own limitations and benefits (Papageorgiou et al. 2018). In our case we used multiple imputation using the MICE statistical package of R (Buuren and Groothuis-Oudshoorn 2011). “Imputation” in the context of missing baseline variables basically means that missing values are predicted upon other baseline values and/or outcome. Alike almost every statistical manipulation, certain assumptions must be made about the missing data, especially the mechanism of missing data (missing completely at random, missing at random, missing not at random) should be addressed (Papageorgiou et al. 2018). Following the latest consensus we incorporated the outcome in the imputation model using the Nelson-Aalen estimator, a non-parametric estimator of the cumulative hazard rate function (Moons et al. 2006). Using multiple imputation one creates multiple datasets in which the missing values are imputed, resulting in multiple completed datasets. The formal rules state that the analyses need to be conducted on all datasets separately and the obtained estimated must be pooled thereafter (Rubin 2004). Nevertheless, in case of a few missing values some authors proposed to develop the model on one dataset and test the model on the other datasets (Steyerberg 2008). Controversy remains on the cut-off of how much missing values is “too much” missing (Papageorgiou et al. 2018).

Correlation between predictors

In medicine many variables roughly describe the same phenomena and are therefore correlated with each other. One should avoid putting highly correlated variables in the same model. Firstly, the aim of a prediction model is to be as simple as possible, and incorporating similar variables is considered redundant. Secondly, in case of correlated variables a phenomena called “multicollinearity” can occur, characterized by extremely high/low estimates or standard errors (Multicollinearity 2020). Therefore, it is advisable to investigate all the correlations between the predictors by means of Pearson’s R or Spearman’s rho, and high correlation should be addressed. This can either be done by excluding one of the two correlated variable or recoding the variables into one new variable. In our case the variables “pain” and “opioid use” were correlated (Spearman’s rho: 0.36). Clinically this makes perfect sense, as opioids are prescribed when a patient is in pain. We recoded opioid and pain in several variables and a combined variable consisting out of 3 categories proved to be the best predictor (Supplementary Table 3).

Continuous predictors

Continuous predictors are variables that can take an infinite number of values (e.g. age and lactate dehydrogenase), and contain a lot of information. Hence, simply dichotomizing continuous predictors is paired with significant information loss (Royston et al. 2006). Nevertheless, incorporating continuous predictors into a statistical model comes along with the assumption the continuous predictors is associated with the outcome in a linear way. While a linear association can also be applied for some non-linear associations, this may not always be the case (Fig. 1). Thus, we recommend firstly to explore the association of the continuous predictor with the outcome in a univariable model. In order to explore the best fitting association with the outcome and a continuous predictor one can use: transformation (like logarithmic transformation), categorization, splines and fractional polynomials, as is explained in Table 3 and Fig. 2 (Steyerberg 2008).

Fig. 1
figure 1

Example of a continuous outcome (y-axis) and continuous predictor (x-axis). As is shown: with the assumption the relation is linear the model (red line) does not fit the observed data well (black dots)

Table 3 Performance of a linear model by adding flexibility to assumed linear association with the outcome
Fig. 2
figure 2

Example of relaxation of the linear assumed association (red line) of a continuous outcome and predictor. This can be done either with natural splines (green line) or fractional polynomials (FP) (blue line). Using splines the data is divided in separate sections, and each section has its own estimate of the line. Using fractional polynomials the relationship is described as multiple polynomials, which can produce a very flexible line

Interaction

Let us consider two predictors. Separately, they have no association with the outcome, however, when they are both present, a significant association with the outcome is observed (or vice versa). Such a phenomena is called “interaction” (Steyerberg 2008). For example these interactions are quite common in gene studies: Only when gene X and gene Y are turned on a certain chemical reaction will start. When either one of the genes is turned off, the reaction will not begin. Naturally, these interactions can also be present in epidemiology studies. However, especially when one considers many predictors, constructing interaction terms can be an overwhelming task. There are so many possibilities one cannot see the wood for the trees. In this case it is advisable to avert to the clinicians and a priori select a number of possible interactions, which make clinical sense. In our study, we tested the interaction term “watchful waiting” and “opioid use or pain”, which turned out to be highly significant. This corresponds to the clinic; a patient with watchful waiting and opioid use or pain indicates a palliative setting, in which the patient is expected to die soon. Hence, watchful waiting and opioid use together have a stronger association with the outcome than watchful waiting and opioid use separately.

Model specification

As mentioned earlier, the first step of predictor selection should be together with subject-specific experts. Predictor selection is arguably the hardest part of model building (Ratner 2010). Multiple methods exist to address the selection process of the a priori selected set of predictors. The most widely used methods include stepwise selection and best subset regression, and these are previously described (Miller 2002; Harrell 2015). In our case we had a lot of variables due to the interaction terms and non-linear continuous predictors. One always wants the most parsimonious model and does not want to exceed the one predictor per ten events rule of thumb. Therefore, it is reasonable to drop predictors that do not add much to the performance of the model. We employed a lesser known selection method using Least Absolute Shrinkage and Selection Operator (LASSO) regression (Tibshirani 1996). This is a penalized machine learning technique that shrinks the estimate of unimportant predictors to zero (Supplementary Fig. 1). An estimate of zero equals no association with the outcome and, therefore a predictor is excluded. This method also can handle correlation within predictors to some extent, as the algorithm will “see” that in case of high correlation of predictor A and B, shrinking predictor B to zero will not influence performance of the model (Tibshirani 1996). Nevertheless, an algorithm cannot judge which predictor is more comprehensible or measured reliably. Therefore, one should never skip the step of looking for correlations between predictors. A package to run LASSO regression in R is the “glmnet” package (Friedman et al. 2010), with an elaborate vignette to code this in R (Hastie and Qian 2016). However, in our case we had multiple polynomials describing the relation of a continuous predictor with the outcome (see “Continuous predictors”). One wants either include all the polynomials in the model or none at all. Hence, we need to “tell” the LASSO algorithm they belong together as a group. The statistical R package “grpreg” has implemented such a function (Breheny and Huang 2015).

We opted for a two-step approach. Firstly, we ran the LASSO regression and thereafter we incorporated all the non-zero predictors in a Cox-model. The final model is shown in Table 4.

Table 4 Final Cox model for predicting mortality in patients with CRPC

Assessment of assumptions

Every statistical model comes along with certain assumptions (Freedman 2009). If these assumptions are not met, the model is not or less valid (Freedman 2009). Each model family has its own specific assumptions. A key assumption in the cox model we used is the proportional hazard (PH) assumption. This basically means that ratio of hazards (the output of a Cox model) is constant over time. Two approaches are commonly used to test whether this assumption is violated: plotting Kaplan–Meier curves or plotting the residuals. Both methods are implemented in most statistical programs or packages. The Schoenfeld residuals should be used to test the PH assumption. Schoenfeld residuals represent the difference between the observed covariate and the expected given the risk set at that time. If one draws an average line through the residuals, this line should be straight (Schoenfeld 1982). A formal test has also been developed (Schoenfeld F test) (Grambsch and Therneau 1994). In our model certain variables did not meet the PH assumption. Fortunately, this is not the end of the world. One can avert to parametric models, since some of these models do not rely on the PH assumption, however you need to start all over again. Another approach is to use an extension of the Cox model called time-varying coefficients, not to be confused with time-varying covariates (Hastie and Tibshirani 1993; Fisher and Lin 1999). Time-varying coefficients can be applied if the effect of a predictor is not constant over time, or in other words if the PH assumption is violated. In our case the effect predictor WHO performance status was not constant over time. As is shown in the Schoenfeld residual plot the effect of the performance status was higher in the first months compared to later in follow-up (Fig. 3a). Therefore, we decided to use a stepwise time-varying coefficient function; we made a separate hazard ratio for the first ten months and for the following months thereafter. As presented in Fig. 3b, the PH assumption was not violated anymore. A vignette to implement time-varying coefficients in R has been published previously (Therneau et al. 2013).

Fig. 3
figure 3

a Example of a Schoenfeld residuals plot in order to check the proportional hazard assumption. When the hazard of WHO is assumed constant over time (blue line in part a), the assumption is violated, especially in the first ten months the blue line deviates from the red line. In part b we have two coefficients for WHO, one for the first ten months and one for more than ten months. Proportional hazards assumption is not violated anymore

Model performance

Two related terms are important in model performance: discrimination and calibration (Alba et al. 2017). Discrimination describes how well a model discriminates a high risk patient from a low risk patient or, in other words: Does the model estimate higher probabilities for patients that have an event compared to patients that do not have an event? Discrimination of binary outcomes is measured with the c-statistic or with ROC-curves (Pencina and D'Agostino 2015). In our study, the overall c-statistic of the model was 0.74, which indicates a good discrimination of the model. Calibration or goodness-of-fit conveys to which extent the predicted probability agrees with the observed probability. For example a high risk patient had a sevenfold higher probability of an event compared to a low risk patient and predicted risks are 7% vs 1%. The observed probabilities of a high risk patient and a low risk patient were 70% vs 10%. In this case discrimination is satisfactory, as the model discriminates well between a high and low risk patient. Nevertheless, calibration is extremely off; the observed risks are not even close to the predicted risks. Several methods exist to assess calibration and are described previously (Calster et al. 2016).

Model validation

Testing model performance on the dataset on which is developed is most of the time overly optimistic (Babyak 2004). After all, the model “learned” the estimates out of the correlations/associations derived from that specific dataset. To assess the possibly overly optimistic performance a statistical model should be validated. Preferably, this should be done internally and externally. During internal validation the model is validated with the original dataset. Historically, this is done by randomly splitting the original dataset into two datasets. One training dataset and one validation dataset. Nevertheless, this approach is not recommended, because this inherently implies one cannot train the model on all the patients. In small datasets the amount of data is reduced, possibly leading to overfitting, and in very large datasets randomly splitting results in very comparable datasets. Therefore, we recommend to employ either bootstrapping techniques or k-fold cross validation. Using k-cross validation one uses the whole dataset as training dataset for the model, and thereafter splits the dataset in k groups (usually ten groups). One group is the validation set and the others are the training sets. This process is repeated k times with each a different group for the validation set (Supplementary Fig. 3) (Harrell 2015). Using bootstrapping the model is also trained on the whole dataset and thereafter random samples are drawn from the original data. Herein a patient can be drawn multiple times and the drawn sample is usually of the same size of the original dataset (Supplementary Fig. 4) (Efron and Tibshirani 1994).

Notwithstanding, the ultimate test for a model is external validation. This means that the performance of the model is still satisfactory if it is tested on a different dataset. For example this dataset could be derived from another center, or geographical area. A model that calibrates poorly on external data can be recalibrated, whereas a model that discriminates poorly cannot. In this case a new model is required (Su et al. 2018).

There is another highly important form of validity called “face validity”. Yet, again the expert clinician comes into play here, as there are no formal ways to test face validity. Face validity says something about whether the test or model measures what it is supposed to measure. For instance face validity may be impaired when key predictors are not included in the model because they were not collected. Or when the dataset is old and does not represent clinical practice anymore. In our case, the patients in the CAPRI dataset were included from January 1, 2010 until December 31, 2017. Our aim was to develop a model to predict mortality in patients with CRPC treated with either abiraterone, enzalutamide, docetaxel, or watchful waiting in first line, to support adequate decision making. However, due to the retrospective nature of this dataset, strong selection bias is present for treatment, especially since abiraterone and enzalutamide were not available as first-line treatment in the Netherlands from 2010–2013. So patients that were eligible for those treatments, received watchful waiting or docetaxel in this period. Of course, a multivariable model will adjust to some extend for this, and one can include intervention year as covariate to assess/and adjust for this phenomena. However, for future predictions, intervention year as covariate implies that a certain trend will continue in the future. This does not make (clinical) sense at all. Hence, this model failed the face validity.

Conclusion

Risk prediction is becoming increasingly more important in medical practice. In this article, we discuss several steps in developing a prediction model including missing data, predictor encoding and selection using LASSO, testing model assumptions, performance and validation, using an example from uro-oncology. Prediction model development is not a futile task and both the input of the clinician and statistician are essential. This article may be used to bridge the gap between the two disciplines.