ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Opinion Article

Response heterogeneity: Challenges for personalised medicine and big data approaches in psychiatry and chronic pain

[version 1; peer review: 2 approved, 1 approved with reservations]
PUBLISHED 15 Jan 2018
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the INCF gateway.

This article is included in the Artificial Intelligence and Machine Learning gateway.

Abstract

Response rates to available treatments for psychological and chronic pain disorders are poor, and there is a considerable burden of suffering and disability for patients, who often cycle through several rounds of ineffective treatment. As individuals presenting to the clinic with symptoms of these disorders are likely to be heterogeneous, there is considerable interest in the possibility that different constellations of signs could be used to identify subgroups of patients that might preferentially benefit from particular kinds of treatment. To this end, there has been a recent focus on the application of machine learning methods to attempt to identify sets of predictor variables (demographic, genetic, etc.) that could be used to target individuals towards treatments that are more likely to work for them in the first instance.
Importantly, the training of such models generally relies on datasets where groups of individual predictor variables are labelled with a binary outcome category − usually ‘responder’ or ‘non-responder’ (to a particular treatment). However, as previously highlighted in other areas of medicine, there is a basic statistical problem in classifying individuals as ‘responding’ to a particular treatment on the basis of data from conventional randomized controlled trials. Specifically, insufficient information on the partition of variance components in individual symptom changes mean that it is inappropriate to consider data from the active treatment arm alone in this way. This may be particularly problematic in the case of psychiatric and chronic pain symptom data, where both within-subject variability and measurement error are likely to be high.
Here, we outline some possible solutions to this problem in terms of dataset design and machine learning methodology, and conclude that it is important to carefully consider the kind of inferences that particular training data are able to afford, especially in arenas where the potential clinical benefit is so large.

Keywords

personalised medicine, big data, machine learning, psychiatry, chronic pain, individual differences, response heterogeneity, clinical trial design

Introduction

The proportion of patients who respond to available treatments for psychological and chronic pain disorders is often low. For example, in major depression, roughly 40% of individuals experience a ‘clinically significant’ response (decrease in symptom severity score above some minimum value) over the course of treatment (e.g. 1,2). Similarly, a recent meta-analysis of available pharmacotherapies for neuropathic pain found estimates of ‘number needed to treat’ (number of patients needed to be treated to prevent one additional adverse clinical outcome) for effective treatments ranged from 4–10, indicating poor response rates3. For patients, this often means a lengthy process of cycling through different treatment options, in a sequence that may be significantly influenced by non-clinical concerns (e.g. relative drug cost, therapist availability, local health authority guidelines), and where there may be inadequate data on the safety and effectiveness of switching regimes (e.g. 4). For psychological conditions, this process can be particularly lengthy, given the significant period of time before common pharmacological treatments are expected to take effect (e.g. 4–6 weeks to conclude a particular drug treatment is ineffective,4). Together, this results in a substantial burden of suffering and disability to individuals with a diagnosis of these disorders, before (if) an effective treatment option can be found.

It is generally assumed that differential response to a particular treatment across individuals can be at least partially explained by patient heterogeneity within a certain diagnostic category – i.e. that individuals who present to the clinic with similar sets of symptoms may have different underlying pathologies. This seems a particularly reasonable assumption in the case of both mental health disorders and chronic pain, as diagnosis is often made purely on the basis of self-reported symptom checklists, and our lack of knowledge into the aetiology of these conditions means we have little opportunity for differential diagnosis. Indeed, in the case of psychiatric disorders, such as depression, diagnosis can often be made on the basis of directly contradictory symptom reports (e.g. sleeping too much vs sleeping too little), and there may be many different ways to meet diagnostic criteria (e.g. 227 possible symptom combinations for major depressive disorder, according to DSM-IV5). Similarly, even patients with a diagnosis of a particular pain condition are likely to have distinct patterns of nervous system damage, involving multiple pathways (e.g. 6), and definitions of chronic pain itself can vary dramatically across research groups and clinical centres7.

Even if we lack insight into pathological mechanisms, it seems likely that if we are able to use some kind of predictive method to direct individuals towards treatments that are likely to be more effective for them – then even a small increase in the resulting response rate could potentially have a large effect on disease burden for individual patients. There has therefore recently been great interest in doing just this for psychiatric data, via application of supervised learning methods to large datasets of individual clinical predictors and treatment response data (see 8 for an excellent recent review of potential clinical advantages and best methodological practice in this area).

The current gold standard approach is firstly to define a set of features and targets for various machine learning algorithms to train on. In this context, features are individual difference variables that may potentially relate to future treatment outcome (clinical, demographic, physiological, genetic, behavioural, etc. information). The target variable (that the algorithm must learn to predict) is usually a binary category label, such as ‘responder’ or ‘non-responder’ (whether or not an individual has exhibited symptom improvement above some threshold level, following a particular course of treatment). Various supervised learning algorithms can then be trained on this labelled dataset (ideally using a rigorous cross-validated approach), and assessed in terms of their predictive accuracy on independent ‘unseen’ (during model training) data. Finally, the best model can be brought forward to a randomised controlled trial framework, where treatment allocation by current clinical guidelines could be compared to algorithm-assisted treatment assignment8.

This approach is highly attractive, as the potential clinical gains from even a small increase in likelihood of treatment response for a particular individual are large. However, across the field of medicine in general, attempts to make such clinical gains via a personalised medicine approach have not often fulfilled their initial promise – with relatively few reaching the clinic (e.g. 9). Here, we explore a basic statistical issue that may limit the effectiveness of this process – i.e. the reliability of distinguishing between treatment ‘responders’ and ‘non-responders’ in the first place. We further discuss the reasons why this problem may be particularly acute in the case of available data regarding psychiatric disorders and chronic pain conditions, and some potential solutions.

The problem of response heterogeneity

The problem of properly identifying response heterogeneity, or, more simply, reliably distinguishing between responders and non-responders to a particular treatment, on the basis of randomised controlled trial (RCT) data, has previously been highlighted across various fields of medicine1012. If not properly addressed, this constitutes an absolute limit on the effectiveness of predictive models at the level of input or training data, thereby limiting their future clinical usefulness.

The issue is best illustrated by considering the nature of data collected during RCTs, and the kind of inferences this process affords. The foundation of an RCT is that the mean effect of an intervention (e.g. active drug treatment) is derived by comparing what happened, on average, to the (randomly allocated) participants in the intervention group to what happened, on average, to participants in the control (e.g. placebo) arm. The random allocation of participants to the intervention vs control arms allows the control group to function as an illustration of what we might have expected to occur in the intervention group, had they not received the active treatment – in turn allowing us to draw conclusions about the overall (average) effects of the treatment itself12. Crucially, we can only draw this inference by direct comparison to the control arm data.

This basis of an RCT means that we cannot identify responders and non-responders by considering individuals in the intervention group alone. In other words, we cannot legitimately label an individual who received a particular active treatment as a ‘responder’ (or not) because we do not know what would have happened to that particular individual if they had been in the comparator (or placebo) arm10. This kind of information is very hard to obtain at the individual (cf the group) level, as there is no good way to obtain a control observation. Formally, to properly infer whether a particular participant responded or didn’t respond to a particular treatment, we would require knowledge of what would have happened if a key event (treatment administration) both did and did not occur (a form of counterfactual reasoning), which is not possible in the real world11.

A particularly acute issue for psychiatric and chronic pain datasets?

Variability of change (e.g. t2t1 symptom score) in the intervention arm is not a true estimate of variability in treatment response, because it includes components of within-subject variation and measurement error10. Even if measurement error is small (i.e. we can precisely measure the outcome variable of interest), for many medical interventions, the outcome variable will depend on a complex interplay of biological factors (e.g. time of day, stress level, etc.), and so within-subject variability will be relatively high. This means that the reliability of within-subject measurements across time points can be somewhat poor, and large variation in changes between study time points may be evident − even where there is no true individual difference in treatment response.

Unfortunately, for psychiatric and chronic pain symptom data, both measurement error and within-subject variation are likely to be high. Measurement error may be higher than other areas of medicine, as the main tools used to assess clinical outcomes are patient or clinician-completed questionnaire measures, which are relatively low precision tools. Further, although self-reported symptom levels are considered the gold standard outcome measure for both psychiatric disorders and chronic pain conditions13, reliability is limited by factors such as cognitive capacity and level of insight for patient-rated measures (e.g. 14), and by interviewer skill and inter-rater agreement for clinician-rated measures (e.g. 1517). Finally, these classes of disorders represent episodic, chronically relapsing conditions, which will likely contribute to large within-subject variation, particularly at typical RCT follow-up timescales (often around 6 months–1 year; cf e.g. median duration of a depressive episode of ~20 weeks,18). If the variation in outcome due to these sources is greater than that due to any true individual differences in treatment response, it will be very hard to detect the latter under a conventional RCT framework.

A further problem in predicting true response heterogeneity is susceptibility of symptom change data to regression to the mean and mathematical coupling artefacts19,20. Regression to the mean refers to the phenomenon whereby if an individual is selected on the basis of having an extreme measurement value at time point one, their second measurement value will, on average, be closer to the mean of the population distribution (due to the influences of measurement error and normal within-subject variation). A corollary of this effect is that t1 severity is often a significant covariate of change in symptom score between t1 and t2, – meaning that individuals with higher initial scores may appear to show the greatest improvement in symptom levels at follow-up, even when the true magnitude of change does not vary across individuals (see 10 for a worked example). The fact the t1 score is used to calculate both quantities (i.e. they are mathematically coupled) results in further inflation of this relationship (see 20). Care should therefore be taken when key predictors in response algorithms closely index t1 severity, as this may result in a poorly generalising model. However, in previous studies in psychiatric datasets, baseline severity score is usually included among the features used to train response prediction algorithms (e.g. 2123).

These factors may help explain why previous attempts to apply machine learning approaches to outcome prediction in psychiatric datasets have thus far had limited success in terms of out-of-sample (unseen data) classification. For example, a recent methodologically rigorous trial aiming to predict significant response (remission) following treatment with a particular antidepressant achieved only ~60% classification accuracy when the model was applied in external validation datasets23. However, as previously noted, tools with only modest true predictive value may still have reasonably high clinical utility compared to current best practice8; therefore this is still an approach very much worth pursuing.

Potential solutions

Clinical trial design

The problem of identifying true response heterogeneity is a problem of appropriately partitioning variance components in observed outcomes11. The ability to properly identify differential response to a particular treatment in different individuals requires replication at the level at which the differential response is claimed (i.e., that particular treatment in that particular individual). Differential treatment response (i.e. identification of patient by treatment interactions) can therefore be identified by use of repeated period cross-over designs – a form of trial where each participant receives both placebo and active treatments more than once11. However, in practice, these designs are rare, as they are likely to be impractical (prohibitively lengthy and expensive) and/or unethical. This kind of design also assumes that treatments wash out fully between administrations, which might not be reasonable for some interventions (e.g. psychological therapies)24.

Training data definition and selection

An alternative approach is to improve the way data from existing RCTs is used to train predictive models. For example, it has been suggested that the uncertainty in each individual’s ‘response’ (change in symptom score in the active treatment group) could be expressed as a confidence interval by reference to the standard deviation of the change scores in the control (placebo) group multiplied by the appropriate value from the t distribution (e.g. individual change score ± 1.96*SD of control arm changes for a 95% CI, see 24). The probability that any given individual in the intervention group is a true responder (true change score is greater than the minimum clinically significant change) can then be derived from individual CIs using a Bayesian approach10. Appropriate supervised learning algorithms could then be trained to predict (continuous) treatment response probability, as opposed to dividing individuals into binary response categories (e.g. using Gaussian process regression,25).

It also may be important to think carefully about the nature of the predictors (features) included in supervised learning model training data – as those that reference initial clinical severity may be vulnerable to regression to the mean-related artefacts. There are statistical methods that have proposed to correct for regression to the mean when correlating t2-t1 symptom changes with initial severity level (see 20). However, these may require additional measurements (e.g. multiple estimates of t1 value, in order to control for effects of measurement error).

Counterfactual probabilistic modelling

When a particular experiment is not feasible, one alternative is to train models from observational (non-experimental) data that are able to make counterfactual predictions – i.e. of the outcomes that would have been observed, had we run that particular experiment. For example, Saria and colleagues have recently developed a counterfactual Gaussian process (CGP) approach to modelling clinical outcome data26. The CGP is trained on observational (non-experimental) time series data, in order to form a model of clinical outcomes under a series of treatments in continuous time. Crucially, the CGP is trained using a joint maximum likelihood objective, which parses dependencies between observed actions (e.g. treatments) and outcomes in the data. This feature allows prediction of how future trajectories (symptom levels) may change in response to different treatment interventions, and has previously been shown to successfully predict real clinical data (renal health markers following different kinds of dialysis,26,27).

This modelling approach requires datasets with semi-continuous measurement of the relevant clinical outcome (both pre- and post- intervention), in order to generate hypothetical treatment response traces – a kind of data that is not usually available from existing RCTs. Given sufficient attention to patient confidentiality and other ethical concerns, it may be possible to obtain appropriate training data from health service clinical records; however, frequency and consistency of symptom reporting may pose analytical problems (e.g. 27). The use of personal devices such as smartphones or other wearable technology to regularly self-record symptom levels may be a potential source of this kind of data in the future, given sufficient insight and patient compliance (e.g. 28). The CGP approach also rests on two key mathematical assumptions: that there will be a consistency of outcomes between training observations and future outcomes, given a particular treatment; and that there are no important confounding variables missing from the dataset26. It may require careful consideration as to whether these are reasonable assumptions for modelling psychological and chronic pain symptomatology.

Conclusions

The issues discussed above underline the importance of focusing on where data comes from when considering strategies for personalised medicine. In particular, it is problematic to designate individual data points from a conventional RCT design as ‘responders’ or ‘non-responders’ to a particular treatment, as this is in effect a single-arm (no control) study not adjusted for other important sources of response variation. This might be particularly important when considering patients with episodic, chronically-relapsing disorders as control variability is likely to be high (and symptom measurement itself is often imprecise). One solution to this problem is to use data derived from repeated cross-over design clinical trials, although in practice these can be prohibitively difficult and/or ethically problematic. It may be possible to alleviate these issues with careful model design, but this may still require changes to the way data is collected and monitored in the future in order to maximise potential clinical utility.

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 15 Jan 2018
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Norbury A and Seymour B. Response heterogeneity: Challenges for personalised medicine and big data approaches in psychiatry and chronic pain [version 1; peer review: 2 approved, 1 approved with reservations] F1000Research 2018, 7:55 (https://doi.org/10.12688/f1000research.13723.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 15 Jan 2018
Views
10
Cite
Reviewer Report 20 Feb 2018
Ricardo Silva, The Alan Turing Institute, London, UK 
Approved
VIEWS 10
I would like to thank the authors for the informative review of the many difficulties pertinent to modeling the treatment of psychiatric disorders. My main take will be on the statistical and machine learning aspects, causal inference in particular, as I do ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Silva R. Reviewer Report For: Response heterogeneity: Challenges for personalised medicine and big data approaches in psychiatry and chronic pain [version 1; peer review: 2 approved, 1 approved with reservations]. F1000Research 2018, 7:55 (https://doi.org/10.5256/f1000research.14907.r30529)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 01 Mar 2018
    Agnes Norbury, Computational and Biological Learning Laboratory, Department of Engineering, University of Cambridge, Cambridge, CB2 1PZ, UK
    01 Mar 2018
    Author Response
    Thank you for taking the time to review this manuscript, and for providing a thoughtful meditation on individual response prediction from a casual inference perspective. We have updated the manuscript ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 01 Mar 2018
    Agnes Norbury, Computational and Biological Learning Laboratory, Department of Engineering, University of Cambridge, Cambridge, CB2 1PZ, UK
    01 Mar 2018
    Author Response
    Thank you for taking the time to review this manuscript, and for providing a thoughtful meditation on individual response prediction from a casual inference perspective. We have updated the manuscript ... Continue reading
Views
18
Cite
Reviewer Report 02 Feb 2018
William G. Hopkins, Institute for Health and Sport, Victoria University, Melbourne, VIC, Australia 
Approved with Reservations
VIEWS 18
This article represents a valuable contribution to the developing literature on quantification of individual responses to treatments and identification of patients who respond positively to treatments. Perhaps there should be some attention to the issue of identifying negative responders, since ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Hopkins WG. Reviewer Report For: Response heterogeneity: Challenges for personalised medicine and big data approaches in psychiatry and chronic pain [version 1; peer review: 2 approved, 1 approved with reservations]. F1000Research 2018, 7:55 (https://doi.org/10.5256/f1000research.14907.r29962)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 01 Mar 2018
    Agnes Norbury, Computational and Biological Learning Laboratory, Department of Engineering, University of Cambridge, Cambridge, CB2 1PZ, UK
    01 Mar 2018
    Author Response
    Thank you for taking the time to review this manuscript. Please see below for responses to your main comments, which are addressed in the revised manuscript.
     
    1. We have added reference ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 01 Mar 2018
    Agnes Norbury, Computational and Biological Learning Laboratory, Department of Engineering, University of Cambridge, Cambridge, CB2 1PZ, UK
    01 Mar 2018
    Author Response
    Thank you for taking the time to review this manuscript. Please see below for responses to your main comments, which are addressed in the revised manuscript.
     
    1. We have added reference ... Continue reading
Views
23
Cite
Reviewer Report 18 Jan 2018
Greg Atkinson,  Health and Social Care Institute, Teesside University, Middlesbrough, UK 
Philip J. Williamson, Health and Social Care Institute, Teesside University, Middlesbrough, UK 
Approved
VIEWS 23
In this manuscript, the authors discussed the concept of inter-individual differences in response to treatment interventions, particularly those focussed on psychological-related outcomes. The consideration of inter-individual responses is an important issue and the authors provide further insights previously not considered ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Atkinson G and Williamson PJ. Reviewer Report For: Response heterogeneity: Challenges for personalised medicine and big data approaches in psychiatry and chronic pain [version 1; peer review: 2 approved, 1 approved with reservations]. F1000Research 2018, 7:55 (https://doi.org/10.5256/f1000research.14907.r29856)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 01 Mar 2018
    Agnes Norbury, Computational and Biological Learning Laboratory, Department of Engineering, University of Cambridge, Cambridge, CB2 1PZ, UK
    01 Mar 2018
    Author Response
    Thank you for taking the time to review this manuscript. Please see below for responses to your main comments, which are addressed in the revised manuscript.

    1. We have added reference ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 01 Mar 2018
    Agnes Norbury, Computational and Biological Learning Laboratory, Department of Engineering, University of Cambridge, Cambridge, CB2 1PZ, UK
    01 Mar 2018
    Author Response
    Thank you for taking the time to review this manuscript. Please see below for responses to your main comments, which are addressed in the revised manuscript.

    1. We have added reference ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 15 Jan 2018
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.