1 Introduction

Multiple Sclerosis (MS) is an inflammatory, demyelinating disease of the central nervous system which commonly affects young adults, with no currently known cure [1]. Magnetic Resonance Imaging (MRI) has been used to diagnose and monitor disease activity and progression, as one of the hallmarks of the disease includes the presence of lesions which are visible in MRI. The number of new or enlarged T2 lesions is a marker of MS activity and the volume of lesions is often used to quantify accumulated “disease burden” [2]. For relapsing remitting MS (RRMS), these measurements have become essential in the evaluation of new treatments through clinical trials; therapy success is often measured through reduction in the number of new lesions. Hence, predicting future lesion activity in MRI could lead to a better understanding of disease worsening, and also to evaluating treatment efficacy in clinical trials. However, automatic prediction is challenging as the MRI of patients with MS presents wide variability across the population, with varying number, sizes and shapes of lesions throughout the brain. The effects of these variable lesion characteristics have a totally unknown effect on patients’ outcomes, making this context a perfect candidate for automatic data mining and machine learning techniques. The longitudinal course of RRMS is also highly variable across the population, resulting in new lesions that can appear and disappear, grow or remain stable over time, for reasons that are not well understood. As a result, a unified static and dynamic model across the population is difficult to develop. While the detection of Gadolinium-enhancing lesions has been shown to be a good indicator that a patient’s disease is currently active, administering Gadolinium has important side effects for patients. Several automatic prediction methods predict the conversion of patients with preliminary symptoms to MS rather than predict dynamics of patients known to have the disease, using logistic regression on a number of clinical indicators [3] and more recently, deep learning methods [4, 5]. Other efforts have predicted long-term clinical effects [6].

In this work, we develop a fully automatic, probabilistic machine learning framework to model the variability of lesions in the multi-modal MRI of patients with RRMS with the objectives of: (1) automatic identification of lesion types across the population, (2) probabilistic prediction of new lesion activity in patients two years in the future based only on baseline multi-modal MRI and (3) automatic identification of responders to treatment using lesion activity prediction learned for untreated and treated groups. Leveraging the success of the Bag-of-Words model in performing unsupervised categorization in the field of computer vision [7], we develop a novel unsupervised Bag-of-Lesions (BoL) model for brain image representation in the context of MS. The method first clusters previously labelled lesions based on a variety of image-based features (e.g. textures, prior tissue atlas). This leads to a codebook for lesion types. Lesions are represented probabilistically over codewords, and patients are represented as a “Bag-of-Lesions”, based on probabilistic lesion codeword histograms. This permits the automatic unsupervised grouping of images through histogram clustering. Experiments on a proprietary dataset of 1048 patients, acquired during a large, multi-center, multi-scanner clinical trial, show that the BoL representation at baseline, combined with a random forest classifier, can be used to accurately predict future patient lesion activity two years in the future, where activity is defined as the presence of new or enlarged T2 lesions. In 50-fold cross-validation, our results compare favourably to Support Vector Machines (SVM) and Nearest Neighbour classifiers, as well as a simpler Naive Bayesian classifier based on counts of lesions of different sizes. We also use this framework to automatically identify responders in two different treated groups of patients, with sensitivity of 82% and 84% and specificity of 92% and 94% respectively.

2 Proposed Method

Each RRMS patient presents at baseline with a set of multi-channel MRI, \(\varvec{I}\), and a set of L coarsely labelled lesions obtained automatically through an algorithm (e.g. [8]) or manually. Our first objective is to model the variability of lesions, and develop a robust, data-driven categorization of lesions into a finite set of types. In order to obtain such a representation, lesions are first divided into coarse size bins. Each lesion is then described by a set of vector-valued, intensity-based features \(f_x\). In this work, we use four different kinds of features: RIFT [10] and Local Binary Pattern (LBP) [11] at varying window sizes to encode the texture of the lesion and surrounding tissues, a probabilistic healthy tissue class prior to encode tissue context (represented by a mean and variance of healthy tissue prior probabilities from an atlas over the voxels labelled as lesion), and intensity features (mean and variance of the intensity of the lesion voxels). Other features can be added as desired. Lesion features are binned according to size groups, and modelled using a Gaussian mixture model (GMM), whose components have full covariance matrixFootnote 1. The mixture is learned in standard fashion using Expectation Maximization (EM). Bayesian Information Criterion (BIC) is used to determine the number of mixture components, \(n_x\). We refer to the components of these GMMs, denoted \(f_{x,j}, j=1\dots n_x\), as feature-types.

For lesion \(L_i\), let \(f_{x}(\varvec{I}, L_i), x=1, \dots 4\) denote the features extracted for this lesion, and let \(c(x,j,i)=P(f_{x,j}|f_{x}(\varvec{I}, L_i)), j=1\dots n_x\). We construct the Cartesian product of the feature types (which has \(\prod _{x=1}^4 n_x\) elements). We consider each of these elements a lesion type. For each element \((j_1, \dots j_N)\), the product \(c(x,j_1 ,i)\cdot c(x,j_N ,i)\) represents the codeword of lesion i corresponding to feature vector x. The use of this product encodes a conditional independence assumption: feature types are considered conditionally independent given the lesion. We then collect all codewords for all lesions. Finally, a patient’s representation is a probabilistic histogram of the lesion-types present in their brain scans, referred to as a Bag of Lesions representation (by analogy to the Bag of Words representation used in text and image processing). An overview of our framework can be found in Fig. 1.

Fig. 1.
figure 1

(a) Learning the Bag of Lesions from Training Data. Lesions are first separated by size. Features (e.g. RIFT) are extracted from each lesion in the database. Each feature is modelled as a separate GMM, with each component referred to as a feature-type. Each lesion codeword is the combination of feature-types. (b) Representing a new patient. The lesion codeword is determined for all patient lesions. The patient is represented by a probabilistic histogram of lesion-types.

As patients are represented as a distribution of lesion-types, groups of similar patients can also be found by automatic clustering using EM. The optimal number of groups can be selected automatically using the Bayesian Information Criterion (BIC). We compute the likelihood that a new test patient is part of a group based on their BoL by computing the Mahalanobis distance to each group. In this way we automatically learn patterns of lesion presentation across the population.

2.1 Activity Prediction

The appearance of new lesions or enlargement of existing lesions can be used as a biomarker for focal inflammatory activity, which is associated with relapses in RRMS. We seek a probabilistic prediction of future activity, based on the baseline BoL representation, \(P(A=1|BoL)\). We train a random forest classifier to predict MS activity based on the BoL representation P(A|BoL) with different sets of lesion-types. The lesion-types are progressively eliminated using a backward elimination method, removing 20% of the least informative remaining types (as determined by the Gini impurity across all nodes of all trees) at each iteration and evaluating prediction accuracy on a retrained random forest [9]. The lesion-types that result in the highest prediction accuracy are preserved in the final model. The final prediction is computed by averaging the activity probability predicted by each tree. Because the dataset is imbalanced, with many fewer patients being inactive, the training error weights the two types of misclassification differently, accounting for the proportion of examples in each class.

2.2 Identifying Responders to Treatment

Ground truth information regarding which patients in a treatment group have definitively responded to treatment is rarely available. In this work, responders to treatment are defined as patients predicted, with high confidence, to have new lesions or lesion growth two years from baseline if not treated, but instead had no lesion activity. This can act as a proxy for ground truth, based on the assumption that treatment must have halted the activity of the disease. To achieve this goal, we fit activity prediction models for the untreated and treated populations separately. To identify whether a new patient is a responder to a drug, we compute the patient’s probability of future activity, \(P_{untr}(A=1|BoL)\) using the "untreated" model from the Bag-of-Lesion representation computed from the baseline MRI, and the probability \(P_{treat}(A=0|BoL)\) using the model computed from treated patients. A patient is considered a responder if these probabilities exceed thresholds \(\alpha \) and \(\beta \) respectively, essentially stating that the two models disagree with high confidence.

3 Experiments and Results

In order to validate the framework for characterizing lesion types and patient groups, for predicting future lesion activity and for classifying responders to treatment, we conducted experiments using a large, proprietary dataset of real MS patient brain images from a multi-centre, multi-scanner clinical trial. The data contained 1048 RRMS patients, each with 4 MR image sequences available: T1, T2, PD, and FLAIR. Each volume was at a resolution of 1 mm\(\,\times \,\) 1 mm\(\,\times \,\)3 mm. Pre-processing included brain extraction [12], bias field inhomogeneity correction using N3 [13], Nyul image intensity normalization, and registration of all images to MNI-space. Included with the clinical trial dataset were: (1) T2 lesion label masks for each patient at baseline, (2) New disease activity labels for each patient, defined as the presence of any new or enlarging T2 lesions 24 months from baseline. The T2 lesion masks provided were obtained through a semi-manual process whereby a trained expert reader corrected an in-house automated segmentation result. The new and enlarged T2 lesion masks provided were obtained through expert validation of an automatic longitudinal MS lesion segmentation framework [14]. Patients were treated in a double-blind study with either a placebo or one of two drugs, divided as follows: 259 Untreated (placebo), 280 Drug A, 259 Drug B. The trial did not achieve its primary endpoint (due to insufficient evidence of effectiveness across the entire cohort). However, there was a clear trend towards a treatment response for some patients in the trial, rendering the task of automatically finding responders at once challenging and compelling for this dataset.

A total of 98,106 lesions were used to build a comprehensive lesion codebook. According to clinical protocol, lesions of less than three voxels were omitted. Lesions were subdivided into four coarse size groups: tiny (3–10 voxels), small (11–25 voxels), medium (26–100 voxels), and large (101+ voxels). For each lesion, RIFT features were extracted at three scales (3 mm, 6 mm, 9 mm) with eight bins for gradients in two dimensions. LBP features were obtained by binarizing intensity differences around central voxels at fixed radii (1 mm, 2 mm, 3 mm). As such, RIFT and LBP captured the textures of the lesions and their surrounding tissues (e.g. see Fig. 1), overcoming any minor under/over-segmentation in the lesion labelling. Probabilistic healthy tissue context was obtained through registration to MNI-space, leading to prior probabilities of white matter (WM), gray matter (GM), cerebral spinal fluid (CSF), partial volume (PV) (at the interface of GM and CSF). The mean and the variance of the probabilities at the lesion voxels are taken as the features. Intensity was encoded as the mean and variance of the intensity of each of the image modalities across each lesion. Examples of lesions drawn from several types are shown in Fig. 2. Patients clustered automatically based on their BoL representation were found to exhibit similar lesion distributions (See Fig. 3).

Fig. 2.
figure 2

Examples of lesions from three lesion types. Top: Lesions (red) over FLAIR images. Bottom: Zoomed in. (a) Lesions between ventricles. (b) Cortical lesions. (c) Large peri-ventricular lesions.

Fig. 3.
figure 3

Patients automatically grouped based on similar lesion histogram distributions. (Top) Patients with a few very small lesions mostly in the white matter. (Bottom) Patients with large lesions near lateral ventricles.

3.1 Disease Activity Prediction

Each of the MS patient multi-channel volumes in the clinical trial dataset were considered as baseline acquisitions from which BoL representations were inferred. The emergence of new and enlarging lesions 2 years after baseline was additional information provided for all but 250 patients (as they did not complete the study). These markers were used as indicators of future disease activity. A random forest classifier was used for optimal lesion-type selection and for the prediction of P(A|BoL). 50-fold cross validation experiments were performed on the untreated (placebo) dataset. Figure 4 shows the maximum likelihood random forest results in comparison with several classifiers: (1) Nearest Neighbour (NN), where activity was assigned based on the closest training case, as defined by three different distance metrics (Euclidean, Mahalanobis, and \(\chi ^2\)).Footnote 2 the (2) Support Vector Machines (SVM)Footnote 3, using linear, RBF and \(\chi ^2\) kernels. (3) Naive Bayes classifier, based solely on the number of lesions in each size bin, in order to explore whether this was the dominating factor in our framework. Both NN and SVM were based on the BoL representations. The random forest classifier (\(\alpha =0.5\)) performed favourably against the other methods overall, with mean values of 70% sensitivity and 58% specificity (for \(A=1\)). All methods based on the BoL representation outperformed the Naive Bayesian method. When considering only activity predictions with high probabilities (above \(\alpha = 0.8\)), sensitivity increased to 94%. However, the specificity dropped substantially, partially because there were only 14 inactive cases at that threshold.

Fig. 4.
figure 4

Comparison of disease activity prediction results based on a 50-fold cross-validation on the placebo dataset: 3 Nearest Neighbour (NN) methods, 3 Support Vector Machines (SVM), proposed Random Forest classifier (\(\alpha =0.5\)), and Naive Bayesian classifier trained only on the number of lesions of each size.

Two treated groups of patients were available for training and testing a separate activity prediction model under the effects of treatment after baseline. The results for 50-fold cross validation using the random forest classifier on the treated cases, the sensitivities increased to close to 1 for both treatments at high probability thresholds (\(\beta =0.8\)), with specificities at around 0.5. Interestingly, when patients in the treated groups were tested using the untreated model, this led to a decrease in specificity by 7% for both treatments (\(\alpha =0.5\)), due to an increase in false positive predictions. This indicates the effectiveness of the treated patient prediction model and, for some patients, the treatment seems to be effectively halting the formation of new or enlarged lesions.

3.2 Responder Identification

We define “responders” (\(R=1\)) as those patients in the treated group whose baseline scans lead to high probability (\(\alpha = 0.8\)) in predicted activity two years later under the untreated model, where they have a known outcome of inactive (i.e. no new or enlarged T2 lesions). At this probability threshold, the sensitivity for detecting activity in the untreated patients is at 98%. Using this definition, there were 25 responders in the Drug A treatment arm and 24 responders in the Drug B treatment arm. Table 1 shows the results of responder classification for two different treatments in the clinical trial dataset, when the probability thresholds are set to high (\(\alpha =\beta =0.8\)). The results indicate that the treatment can be reliably predicted to work on a small subset of patients, even though the overall objectives of the clinical trial were not met.

Table 1. Responder prediction results for treatments A & B, probability thresholds \(\alpha =\beta =0.8\).

4 Conclusion

In this paper, we introduce a fully automatic, probabilistic framework for the prediction of future MS disease activity in patients based on a new Bag-of-Lesions representation of their scans at baseline. We develop a probabilistic codebook of distinct lesion types across the population, and show how those lesion types can be used to separate patients into groups that present similar lesion patterns. Additional clinical validation is required to determine how this translates into discoveries of natural patterns of MS disease variability. The activity prediction is then used to automatically identify potential responders to two treatments in the context of a real, large, multi-centre, multi-scanner clinical trial for RRMS patients, showing sensitivities of 82% and 84% and specificities of 92% and 94% respectively. This suggests the possibility of a tool for personalized treatment for new MS patients, and for assessing treatment efficacy.