Participants
Participants were assessed as part of the ESTRA, STRATIFY, and IMAGEN studies. These were sibling studies that were designed with matched assessments and protocols to enable comparability.
Case-control studies
Our clinical sample included participants with AN and BN, recruited as part of the ESTRA study. All the participants were female, aged 18-25 years, and recruited at the London study site. Healthy controls (HC) for the ED patients were selected from the IMAGEN study (see below) at the third follow-up (~23 years old), as being female, recruited in London, and screened negative for all psychiatric diagnoses based on the Mini International Neuropsychiatric Interview 53. Participants with MDD and AUD, and the corresponding HCs were aged 18-25 years and recruited as part of the STRATIFY study from three study sites: London, Southampton, UK and Berlin, Germany. Written consent was obtained from all the participants (see Supplementary Methods for more details).
Longitudinal cohort study
This population sample was derived from IMAGEN, a longitudinal neuroimaging and genetics study of adolescents recruited from eight study sites in Europe 54. Written assent was obtained from all the participants and written consent from their parents/guardians. The data used in the longitudinal prediction analysis were acquired at ages 14, 16, and 19 years.
Eating disorder symptoms were assessed by self-report of concerns over one’s shape, weight, and eating, and disordered eating behaviors (binge-eating, purging, and dieting) based on the Development and Wellbeing Assessment (DAWBA) 55. ‘Developers’ were defined as individuals who did not report any ED symptom at age 14, but reported one or more symptoms at ages 16 or 19. They were compared to controls, who remained asymptomatic across the three ages. Developers of depression and harmful drinkingwere defined as scoring low on depressive symptoms and harmful drinking 56, respectively, at age 14, but high at ages 16/19. Controls for these groups scored low on depressive symptoms and harmful drinking, respectively, across the three ages (for more details, see Supplementary Methods). Data collected at age 14 were used to predict whether participants developed each mental health symptom at ages 16 or 19.
Measures
Demographic information, including sex assigned at birth, age, and ethnicity was acquired from self-report. Our analyses combined a wide range of data domains comprising cognition, environment, personality, psychopathology, substance use, and BMI (for full details, see Supplementary Methods). Full lists of variables and percentages of missing data are provided in Supplementary Tables 2-4.
Data Analysis
A logistic regression model with L1 and L2 regularization, namely Elastic Net was used, implemented in the glmnet (version 4.1-7) package 57 in R (version 4.2.1). Model performance was assessed by area under the receiver operating characteristic curve (AUC-ROC) and area under the precision and recall curve (AUC-PR). These performance metrics were derived from a nested cross-validation (CV) procedure. The whole dataset was randomly split into 10 subsets. The ratio between cases and controls was maintained the same across these subsets. One subset (10% of the whole dataset) was reserved for model testing, and the remaining data (90% of the whole dataset) was used for model training.
The data preparation procedure included imputation of missing values, partialling out the effect of confounding variables, standardization, and dealing with extreme values. First, missing data were imputed in the training and testing data separately, by using a Random Forest-based method implemented in the missForest package 58 in R (version 4.2.1). Second, the effects of confounding variables were partialled out from the training and testing data separately, following the procedure recommended by Snoek et al. (2019) 59. For each feature in the training data, a linear regression model was fitted with the confounding variables as the only predictors. Residuals from this model were used for model training. This linear regression model was directly applied to the testing data (without model refitting) to obtain residuals of each feature. This approach ensured that no information from the testing data was utilized in the model training process. Third, each feature in the training data was standardized into z-scores. The mean and standard deviation of each feature in the training data were used to standardize the testing data. Last, to mitigate the impact of extreme values on model fitting, the z-scores smaller than -3 or larger than 3 were recoded as -3 and 3, respectively.
A five-fold inner CV was nested in the training data to select the optimal hyper-parameters (alpha and lambda) for the Elastic Net model, with the goal of maximizing AUC-ROC on the training data. By using the optimal hyper-parameters, an Elastic Net model was fitted on all the training data (90% of the whole dataset). The classification performance of the constructed model was assessed using the remaining subset (10% of the whole dataset). This process was repeated until each subset had been used as the testing data. If the model involved a single predictor of BMI, an ordinary logistic regression model was used instead. The same 10-fold CV procedure was employed as above, but the nested CV and hyper-parameter tuning procedures were omitted.
The above CV procedure was repeated 10 times to mitigate the effect of data splitting. The model’s performance metrics were averaged across the 10 repetitions. The ROC curves were plotted with the ROCR package (https://CRAN.R-project.org/package=ROCR).
Sample weighting in the prediction models: In building the prediction models using the longitudinal IMAGEN data, the model training and testing procedures were the same as those used for the clinical sample, except that sample weights were provided for the model training to deal with group size imbalances between the developers and controls (Supplementary Table 1). The weight of a sample was inversely proportional to the group size, thus assigning higher weights to the developers than the controls.
Bootstrapping confidence intervals: Confidence intervals of the performance metrics (AUC-ROC and AUC-PR) were obtained by using bootstrapping. For each repeat of the CV, the model’s output was resampled with repetition. Based on the resampled values the performance metrics were obtained. This procedure was repeated 2000 times for each repeat of the CV, forming a bootstrap distribution. Lower and upper bounds of the CI were derived from 2.5% and 97.5% percentile of the bootstrapping distribution and averaged across the 10 repetitions.
Permutation test: P values for the model’s performance were obtained from permutation tests. We randomly shuffled the group membership of samples before submitting the data to the same CV procedure described above, and derived performance metrics. This procedure was repeated 5000 times to derive null distributions of AUC-ROC and AUC-PR. To calculate the P-value, we counted how many values in the null distribution exceeded the actual performance and divided this count by the number of permutations.
Classification of ED patients: Firstly, we included all the variables (n=47) in building the classification model and considered age as a confounding variable. Given BMI is a diagnostic criterion for AN, a second model was run after excluding BMI. We further built models that involved each data domain alone to test if they could distinguish ED groups. A total of 18 models were built (Figure 1). A variable was identified as a reliable contributor to the Elastic Net model if it had a non-zero coefficient in at least 90% of all the CV folds 60. The coefficient of the model for each feature was averaged across all the CV folds to obtain the median value, which represents the feature’s importance.
Classification of MDD and AUD patients: We excluded 14 variables with excessive missing data, such as measures of cognitive performance and traumatic experiences (as indicated in Supplementary Table 3). Furthermore, we excluded depressive and emotional symptoms from MDD vs. HC analysis, and excluded the harmful drinking scale from AUD vs. HC analysis. Considering a sex bias in the HC group (59% females, Supplementary Table 1), sex was considered as a confounding variable, in addition to age and study site.
Transdiagnostic models: We tested whether the model derived from the AN vs. HC and BN vs. HC analyses could also distinguish MDD and AUD from HC, and vice versa. As we are aware, BMI is a diagnostic criterion of AN but is unrelated to MDD and AUD. Therefore, BMI was excluded from the transdiagnostic analysis. In addition, variables with excessive missing data in the AUD and MDD samples, such as measures of cognitive performance, traumatic experiences (as indicated in Supplementary Table 3), were also excluded. To derive a single model for AN vs. HC classification, we used the median values of the hyper-parameters (alpha and lambda) across all the CV folds to train a model using the entire AN and HC data. We tested whether this model could distinguish MDD and AUD from HC, respectively. Similarly, we trained a model for BN vs. HC classification and tested it on the MDD and AUD samples. Conversely, we tested whether the models developed for MDD vs. HC and AUD vs. HC classifications could classify ED patients from healthy controls. The same data preparation procedure was adopted from the classification analyses, including data imputation, adjustment for confounding variables, standardization, and handling extreme values.
Predicting the development of future mental health symptoms: The top 10 reliable variables identified from the classification analyses in the clinical EDs, MDD, and AUD samples were pooled together and used for the prediction analysis in the longitudinal population sample. Data collected at age 14 were used to predict the development of ED symptoms, depressive symptoms, and harmful drinking at ages 16/19 years. In addition, we built a second model by adding known risk factors of EDs, including sex, BMI, and pubertal development scale to investigate whether they could improve prediction accuracy.