Introduction

Known predictors of iron status in women include genetic factors, diet, supplement use, menopausal status and recency of menopause, recency of blood donation, and medical conditions. These predictors may together be usable as proxies for serum iron measures, which are not feasible to collect in large prospective studies, and which, when available in retrospective studies, are subject to bias from reverse causal effects.

Both dietary and non-dietary factors like blood loss can influence iron stores1. Dietary predictors include alcohol2 and heme iron intake from sources such as red meat3,4. There are numerous nondietary factors associated with iron levels, including: menstruation5, hormone use6, physical activity7, body mass index (BMI)1,5, iron supplements1,5, calcium supplements8, aspirin1,9,10, statins11,12,13,14, polycystic ovarian syndrome (PCOS)15,16,17, parity18, occult bleeding from a stomach ulcer19, previous diagnosis of iron deficiency anemia (IDA), and repetitive caesarean section20.

Recent studies also report blood donor status as a strong predictor of iron status in women21,22. Frequent and recent blood donation demonstrated the strongest association with low serum ferritin levels in a study of a Danish population from 20145. In this study other factors associated with low serum ferritin levels included menopause status in women, weight, age, menstruation characteristics, supplements, and diet. This study underlines the importance of blood donor status and blood loss when considering iron status, with evidence of an association between iron deficiency anemia and blood donation.

Simultaneous assessment of the many common predictors of serum iron measures could contribute to existing knowledge and provide a chance to identify additional iron-related health outcomes through development of a predictive model. Such a model could enable use of common self-reported items from surveys as predictors to assess body iron status. Once the most jointly important predictors have been identified, one could use that set of predictors via a multiple imputation model to leverage a substudy based on a small subcohort that provides blood-based measures and scale up to an analysis of the entire cohort, including outcomes from non-subcohort members. Our primary aim is to develop and validate a prediction model of serum iron measures at baseline with many iron-related predictors related to absorption, metabolism and loss of iron, in a sample from a large U.S.-based cohort of women. We hypothesized that a prediction model based on questionnaire data in a cross-sectional study could serve as a proxy for blood-based measures to assess iron status. We used R2 as our main measure to assess our hypothesis that a questionnaire-based model performs well.

Methods

Study population

The original case-cohort study for this analysis contains 6008 participants with iron measures23 from the Sister Study24, a prospective cohort of 50,884 women who, at enrollment, had not been diagnosed with breast cancer themselves but had at least one sister diagnosed with breast cancer. Measures from the randomly selected subcohort, n = 3200 served as a random sample from the Sister Study. We omitted 29 participants missing all baseline serum iron measures, and among the 3171 remaining participants (Supplementary Fig. S1), we stratified the analyses by menopause status at baseline. We then modeled the sample as a whole, to estimate model performance for the entire sample with menopausal status as a predictor.

Outcome variables for body iron levels

Our outcomes included: (1) serum iron (mcg/dL), (2) ferritin (mcg/dL), and (3) transferrin saturation (%). Serum sample handling and sample analysis have been previously described23. Briefly, a Roche Cobas 6000 Chemistry analyzer was used to perform analyses to measure serum iron, ferritin and unsaturated iron binding capacity (UIBC) with two levels of quality control and inter-assay coefficient of variation not exceeding 3.3%. Transferrin saturation was calculated as the ratio of serum iron to the sum of serum iron and UIBC, which ratio was then multiplied by 100 to create a percentage.

Candidate predictors

We selected predictors of serum iron biomarkers based on individual associations previously reported in the literature. These predictors can be grouped into the following five categories: (1) supplements and dietary intake, (2) lifestyle, (3) reproductive history, (4) health conditions, and (5) a general category of binary predictors.

The first group included red meat consumption and supplements and dietary intake for calcium and iron, based on a nutrition analysis derived from a self-reported food frequency questionnaire that was conducted through a separate survey at baseline24. In addition to iron supplements (including multivitamins), we included calcium intake because calcium inhibits iron absorption8. The second group of continuous predictors, including lifestyle-related variables reported at study entry, included: exercise (hours per week), examiner-measured BMI (kg/m2), and alcohol consumption (drinks per month). The third group of variables included reproductive blood-loss related variables reported at baseline: years since last menstrual period, reproductive life span (calculated by subtracting age at menarche from age at most recent menstrual period), estrogen or progesterone use (number of total years), and total years pregnant or breastfeeding. The fourth group of variables included self-report of ever having the specified health condition at baseline: preeclampsia as a substitute for cesarean section (not available)25, polycystic ovary syndrome (PCOS), irritable bowel syndrome (IBS), colon/rectum polyps, and iron-deficiency anemia (IDA). A fifth, and final, group was composed of the following self-reported binary response variables: statin use, regular aspirin use (1 + /week for 3 or more months), and recent blood donation (within past 12 months).

All Sister Study participants provided written informed consent. The institutional review board of the National Institutes of Health, Bethesda, Maryland (United States of America), provides study approval and oversight (protocol number 02EN271). All methods for this study were carried out in accordance with relevant guidelines and regulations.

Statistical analyses

The descriptive statistics included either the median and interquartile range for continuous variables or frequencies accompanied by percent distributions for categorical variables.

We estimated the association between individual predictors and serum iron outcomes by fitting simple linear regression models separately for each of the iron-related predictors, with age included in all models. The multivariable prediction model for iron status included all variables in one model for each of the three iron status markers. We also considered simpler predictive models based on a step-down procedure, with successive removal of predictors that were contributing little to the goodness of fit of the model (p > 0.05). We also considered a penalization method, the lasso model26, as a means to remove variables and simplify the predictor model since this model had the lowest mean squared error for the minimum alpha tuning parameter27 from the glmnet28 R package. To assess the performance and internal validation of these multivariable predictor models, we followed best practice guidelines for prediction models29,30,31, which are outlined below.

Model performance and validation

To estimate model performance for the prediction models, we estimated the proportion explained variation, R2. Following best practices in prediction model reporting29, we conducted internal validation procedures for the prediction model to assess how the model would perform in new samples. Multivariable prediction models based on the observed data will typically be ‘optimistic,’ in that they provide better performance than would be found in new samples32, either from the same or different underlying populations. In our case, we estimated the performance of the prediction model compared to new samples drawn from the same underlying population using bootstrapping, using a method known as ‘internal validation’32. Finding similar data collected from a different population was not possible, and we did not conduct external validation measures.

For internal validation of performance measures, we used a repeated sampling approach33 with bootstrap samples, as defined by Harrell’s bias correction34. This method performs well and includes the following steps34,35: (1) estimate the prediction model and calculate performance statistics with the observed data, (2) generate a bootstrap sample with replacement from the original sample, (3) create a prediction model based on the bootstrap sample and generate the performance statistic (R2 and calibration slopes) – the ‘training’ estimate, (4) using the model parameter estimates from the bootstrap sample in step 3 and the observed data, calculate the performance statistic – the ‘testing’ estimate, (5) repeat steps 2–4 for each of the bootstrap samples. In our case we repeated steps 2–4 1000 times and calculated the ‘optimism’ of the original prediction model by averaging the difference between the performance statistics based on the fits to the bootstrap data and fits based on the original observed data (in steps 3 and 4). The ‘optimism’ estimates the bias indicating to what extent the performance statistic from the original model was too high relative to what would be seen with new samples from the same population.

In addition to R2, we also used calibration measures by pairing with the internal model validation described above to assess how well the prediction model based on the observed outcomes can function with new samples30,32. Calibration applies to many outcomes, including continuous variables33. Generally defined, the calibration model is derived by fitting a regression model with observed values as the outcome and predicted values as the covariate. In our internal validation analyses, the slope from a calibration model based on observed data and predictions based on the bootstrap-based parameter estimates, defined in step four above, served as an indicator of what amount the estimated parameters need to be changed to make the model fit well for predictions in new samples from the underlying population. A calibration slope of 1.0 indicates a well-calibrated fit, and a slope less than 1.0 suggests that the predictions are more extreme than the observed values and ‘overfitting’ the estimates32. This slope can also provide a uniform shrinkage factor to adjust coefficients for future use32.

Missing data

None of the predictor variables had over 8 percent missing values for the premenopausal group or over 14 percent missing for the postmenopausal group. However, among the variables considered for the multivariable model, 17 percent (n = 181) of the premenopausal women and 24 percent (n = 498) of the postmenopausal women had at least one missing value. We used multiple imputation to account for the missing data in the multivariable models, assuming missing at random patterns. To estimate parameters using multiple imputation, we used the mice36 package in R to create imputed data sets. We also estimated the prediction models with only complete case data.

For the full multivariable model using the maximum likelihood method, we used 100 imputed samples to compute the estimates and pooled the estimates and standard errors according to Rubin’s method37. For the backwards selection models with the same set of covariates as for the full multivariable model, we used the psfmi38 package with Rubin’s Rules for variable selection39,40,41, in which the full model is fit in all imputed data sets, the estimates and standard errors are combined using Rubin’s Rules, the variable with the largest p-value is excluded, then this process is repeated until all parameters in the model have a p value below 0.05. To obtain imputed estimates for variable selection from the penalized regression models, we used the miselect42 package in R to implement a stacked pooled objective function across the imputed data sets43. For the model performance statistics, we estimated the statistic using the relevant model for each imputation, including bootstrapping, then averaged them.

We used R software44, version 3.6.1, to handle data and conduct all statistical analyses.

Results

We excluded the supplemental breast cancer cases from the original case-cohort and only considered the 3171 randomly-sampled women in the subcohort, using only their data and blood sample taken at baseline. Of those women 2106 reported being postmenopausal and 1063 were premenopausal (Table 1). The median ferritin level was higher in the postmenopausal women. Spearman correlation values for the three iron outcome measures in the total sample (n = 3171) was 0.21 for serum iron and ferritin, 0.34 for transferrin saturation and ferritin, and 0.92 for serum iron and transferrin saturation.

Table 1 Baseline characteristics of subcohort by menopause status at baseline.

Univariable candidate predictor models

We characterized the associations between each of the predictors in the five categories and iron status in separate linear regression models. In the premenopausal group of women, strong positive associations with ferritin included red meat consumption, alcohol consumption, BMI, and current statin use (Supplementary Fig. S2, Supplementary Table S1). Strong inverse associations with ferritin were seen for years pregnant and breastfeeding, IDA, and recent blood donation. The associations with IDA and recent blood donation were also inverse for ferritin, transferrin saturation and serum iron. The association with BMI was positive for ferritin, but strongly inverse for transferrin saturation and serum iron. Red meat consumption, years pregnant or breastfeeding, and current statin use were only related to ferritin.

In the postmenopausal women, large positive associations with ferritin included: red meat consumption, alcohol consumption, and BMI (Supplementary Fig. S3, Supplementary Table S2). Strong inverse associations with ferritin existed for calcium intake, iron deficiency anemia (IDA), and recent blood donation. Some findings were different for the transferrin saturation and serum iron outcomes, including strong positive associations for exercise and inverse associations for BMI, years since last menstrual period (LMP), and statin use. In terms of consistency of direction of association across the three iron outcomes, alcohol consumption, recent blood donation and diagnosis with iron deficiency anemia all conformed to that pattern. Blood donation and IDA were both consistently strong predictors in both the pre- and postmenopausal groups.

Model performance and validation

For the premenopausal group, after accounting for the missing values in the prediction models, we found that the estimated R2 for the multivariable prediction model was very low for both serum iron (0.07) and transferrin saturation (0.07) (Table 2). When conducting the analyses for the set of individuals with complete data, the R2 remains low but slightly higher for serum iron and transferrin saturation (Supplementary Table S3) compared to the prediction models accounting for the missing data. The multivariable prediction model for serum ferritin had a higher R2 (0.13), explaining more of the observed variation than the models for serum iron (R2 = 0.07) and transferrin saturation (R2 = 0.07). Internal validation of these estimates based on the bootstrap method suggests overfitting in the model, implying a reduced R2 after an optimism correction. The internal validation calibration slopes were between 0.9 and 1.0, suggesting that the predictions from models based on the observed data would tend to provide more extreme estimates when applied to new samples drawn from the same underlying population.

Table 2 Multivariable regression model parameter estimates by iron outcome for premenopausal group, multiple imputation.

In the postmenopausal group of women, after accounting for the missing values, we found that the estimated R2 for the predictor model was low for both serum iron (0.06) and transferrin saturation (0.09) (Table 3). The model predicting serum ferritin measures had a higher R2 (0.19), explaining more of the observed variation than the models for serum iron and transferrin saturation. When conducting the analyses restricting to the set of individuals with complete data, the R2 remained low but was slightly higher for serum iron and transferrin saturation (Supplementary Table S4) relative to the prediction models accounting for missing data. Internal validation of these estimates based on bootstraps suggests overfitting in the model implying a reduced optimism-corrected R2. Similar to the premenopausal group of women, the internal validation calibration slopes remain between 0.9 and 1.0, suggesting that the model would provide somewhat reduced strength of associations between the predictors and the outcomes in new samples drawn from the same underlying population.

Table 3 Multivariable regression model parameter estimates by iron outcome for postmenopausal group, multiple imputation.

We find similar patterns in performance when combining the pre- and postmenopausal groups of women in one model (Supplementary Table S5). The model with a ferritin outcome has the highest R2 (0.31) compared to serum iron (0.08) and transferrin saturation (0.09). Because menopause status is predictive in the combined sample, the amount of explained variation is greater than in the strata based on the two menopausal subgroups. Models explaining more of the variance, in this case indicated by R2, are better performing prediction models. Models with an R2 approaching 100% imply the observed outcomes are very close to the predicted values, and a value near 0% implies no association between the observed and predicted values32. The value of R2 near 0.31 implies 69% of the variance remains unexplained.

Discussion

In this work, we evaluated prediction models for iron status using questionnaire-based data, including many variables relating to diet, supplements, lifestyle-related variables, reproductive characteristics, iron-related health conditions, medications, and recent blood donation status. We developed this model with a large sample of women in the United States, ages 35–74 years, with three different common biomarkers for iron status based on serum: iron, transferrin saturation, and ferritin. Although each person in that sample had a sister previously diagnosed with breast cancer, we know of no reason to expect that history to affect the relationship between questionnaire variables and their iron status.

Model performance was weak, with an R2 less than 0.10 for the maximum likelihood models for iron and transferrin saturation but stronger, with an R2 of 0.13 in the premenopausal group and 0.19 in the postmenopausal group for ferritin. The overall R2 for ferritin with both menopausal groups in the same model was 0.31. The internal validation calibration slopes were close to 1, showing good calibration between observed and predicted outcomes, for all models in the internal validation process. Internal validation calibration slopes greater than 0.9 suggest a tendency of the observed prediction model to provide similar predictions in new samples from the underlying population. Using the variable selection approaches, including backwards selection and penalized regression, did not appear to substantively change the estimated model performance measures in the complete case sample. We did not assess predictions of dichotomous outcomes, such as iron deficiency or overload based on a cutoff value for ferritin, but such a model may be worth considering in future work.

The weak model performance does not support the use of a large array of predictive variables to replace common serum measures. The predictive model for serum ferritin demonstrated the best performance, especially for the postmenopausal group of women, with an R2 close to 0.20. No menopause-stratified prediction model R2 exceeded 0.20, indicating that within each of the menopause categories over 80% of the variance remained unexplained. When we combined pre- and post-menopausal groups into one analysis (Supplementary Table S5), the explained variance (R2) for the serum ferritin outcome was 0.31. In this model menopause was a strong predictor and consequently stratifying by menopause reduces the explained variance. However, even that model leaves around 69% of the variance unexplained. Nevertheless, the calibration slope from the internal validation was close to one, 0.98, for the ferritin outcome in the postmenopausal group, indicating that this model should provide predicted values in new samples from the same underlying population that are not systematically different from those observed.

Although this study used easily accessible self-reported variables, collecting information for all candidate predictions in these analyses may not be possible when resources are limited. Based on the variable reduction models, including stepwise backwards selection and lasso models, we identified several variables, such as recent blood donation, that remained consistently in the model and explained a sizable amount of the observed variance, although the total fraction explained was low. Recent blood donation was the predictor with the strongest association with iron outcomes for both pre- and postmenopausal groups of women, and it remained in the model with both variable selection approaches. This finding is consistent with results from a European sample of blood donors21,22. We considered only cross-sectional data and investigators instead carrying out longitudinal analyses could presumably capture important temporal variation in iron status by acquiring careful records of the timing of repeated donations of whole blood.

There could also be reverse-causal effects when cross-sectional data are used for assessing iron status. For example, a history of IDA, heavy menstruation or regular blood donation could prompt some to take regular iron supplements. Although not as strong an association as the blood donation indicator variable, lifestyle measures frequently remained after variable selection. Alcohol and red meat consumption remained in both pre- and postmenopausal groups after variable selection. The indicator for ever having preeclampsia was unexpectedly positively associated with ferritin and remained in the model for the premenopausal group only, providing a more informative predictor for this group than for the postmenopausal group. This may be in part a birth cohort or recall bias effect since the pregnancies tend to be farther in the past for the post-menopause group. The basis for that association is not clear, but it is possible, as mentioned above, that the positive association with ferritin may reflect an effect of inflammation45 on ferritin and not actual iron status. There is, however, evidence that iron overload plays a causative role in preeclampsia46.

Strengths and limitations

One strength in assessing prediction models of iron status was the evaluation of three commonly measured serum-based iron biomarkers and a comprehensive set of predictors that can be easily obtained through questionnaires – and, excepting food frequency questionnaires, readily incorporated into electronic health systems – related to body iron status. Moreover, we assessed these variables in a large random sample from a contemporary U.S. cohort of women. From this information, we assessed the suitability of using questionnaire items to assess iron status through a multivariable prediction model. These strengths of this study had the potential to show, provided we found evidence of strong prediction model performance, a means to extend analyses based on serum iron measures to larger samples through a proxy prediction model based on more accessible measures.

Unfortunately, none of the predictive models showed strong performance, indicating the possibility that the variation among women in these markers of iron status depends on additional variables outside of our candidates. Measurement error could be impacting both the predictor and outcome variables, which would affect model performance. Also, questionnaire items, no matter how extensive, are inherently error-prone, given the potential for misclassification or recall bias. Another limitation is the known temporal variability of the outcome variables, especially for serum iron and transferrin saturation47.

An inherent limitation relates to the role of inflammation, which we did not have a good measure for, and which is known to elevate ferritin levels and disrupt its ability to serve as a marker for iron status48. Some of our questionnaire data could have acted as markers for chronic inflamation, e.g. IBD, and also the use of statins, aspirin, high BMI, history of preeclampsia, all of which are related to heart disease and therefore inflammation and all of which predicted higher ferritin but not higher serum iron or percent transferrin saturation (see supplemental figures).

Another limitation of this study is the inability to carry out external validation–and inability to generalize this prediction model to other populations, such as men. This study population was women ages 35–74 years who had at least one sister diagnosed with breast cancer at study entry and was disproportionately non-Hispanic White. The extensive nature of this study, given the large number of predictor variables paired with serum iron values, prevented us from finding a similar sample in a different population and conducting external validation analyses. External validation presents a future area of research with the potential of a similar prediction model providing better performance in different populations such as men.

In developing prediction models based on easily accessible questionnaire data, we did not find evidence to support that questionnaire-based prediction models could serve as a good proxy for serum iron measures. When developing and validating the prediction models, we determined that serum ferritin served as a more predictable outcome than either serum iron or transferrin saturation. The prediction models for the postmenopausal group of women had better performance measures than those for the premenopausal group. Although more extensive and precise predictor variables may improve the performance of the model, it seems likely that unmeasured genetic and environmental factors may cause temporal variability and heterogeneity among women in their iron status. It was also of note that recency of blood donation was the most important factor for any group of women and any outcome measured, underlining the consistency and importance of collecting a detailed history of blood donation recency and frequency in determining iron status. Creating predictive models for body iron status based on questionnaire data that could usefully replace the iron markers measured in serum samples remains a challenge.