Introduction

Health system assessments of depression care quality typically consist of process measures, such as adequacy of antidepressant medication prescribed or psychotherapy visits completed.1,2,3,4 Performance measures based on patient-reported outcomes (e.g., improvement in symptoms) are potentially important additions to care quality assessment because they reflect the realized effectiveness of implemented evidence-based treatments.5 Performance measures based on depression symptom outcomes have been developed by the National Committee for Quality Assurance (NCQA) and the International Consortium for Health Outcomes Measurement (ICHOM), and health systems may consider adopting these or similar measures to inform quality improvement initiatives.6,7

Within large health systems, performance measures may be used to compare quality across facilities in addition to understanding system-wide performance. For patient-reported outcome performance measures (PRO-PMs) to be useful for identifying quality improvement opportunities across practice sites, it is important to establish whether there are meaningful differences in outcomes between sites that are due to care quality and not due to chance or differences in patient populations.8 Though findings are not consistent, studies show patient characteristics such as age, gender, race, and socioeconomic status predict treatment response via mechanisms that may be independent of the quality or adequacy of the treatment received.9,10 If these characteristics differ at a population level (e.g., an older vs. younger patient population), PRO-PM scores may need to be adjusted for these characteristics (referred to as case-mix adjustment).11,12 Kramer et al.12 evaluated the optimal case-mix adjustment factors for depression outcomes among outpatients with depression and identified that baseline depression severity and physical functioning were associated with depression outcomes. However, this prior study did not use the Patient Health Questionnaire (PHQ9)13 to measure depression symptoms (as recommended for depression PRO-PMs) or assess performance across multiple sites within a single health system.

Treatment with antidepressant medications and certain psychotherapies improves depression symptoms in clinical trials,14 and measures of adequate receipt of these treatments were associated with improved outcomes in studies of quality improvement interventions (e.g., collaborative care management.15,16,17 However, these findings may not generalize to routine care where the relationships between process measures reflecting treatment receipt and outcome measures are less well established. Identifying process measures that impact PRO-PM performance could provide more actionable targets for quality managers and providers.

This study conducted a longitudinal survey of depression outcomes among US Department of Veterans Affairs (VA) patients receiving care in Midwestern facilities to address existing knowledge gaps and inform the potential adoption of depression PRO-PMs. Although VA providers collect PHQ9 depression outcomes as part of routine care, this study measured PHQ9 outcomes independently using largely automated systems to avoid potential bias and imprecision from clinically administered measures and missing outcomes for patients who drop out of care. The study’s key aims were to (1) identify baseline patient characteristics that best predict subsequent depression outcomes (e.g., case-mix variables); (2) after adjusting for case-mix, assess variation in depression outcomes across facilities; and (3) assess whether various measures of depression care processes are associated with case-mix adjusted symptom outcomes in routine care.

Methods

Setting and sample

This study recruited patients who accessed care across 29 VA Healthcare facilities in a Midwestern Veterans Integrated Service Network between June 2017 and October 2019. The sampling strategy was designed to recruit similar numbers of patients across these facilities, regardless of size, to improve cross-site analyses. On a weekly basis, the study team identified patients from electronic medical record data with a clinical diagnosis of depression (ICD-10 codes: F32.0x, F32.1x, F32.2x, F32.4x, F32.8x, F32.9x, F33.0x, F33.1x, F33.2x, F33.9x, F34.1x, F43.21, F43.23), a PHQ2 score > 2, or a new prescription for an antidepressant medication at a primary care or mental health provider visit during the past week. Patients without recent provider documentation of current depression symptoms and those with a diagnosis of bipolar disorder, schizophrenia/schizoaffective disorder, or a neurocognitive disorder such as dementia were excluded. While some depression quality measures (e.g., those focused on antidepressant treatment adequacy) only include patients with new episodes of care, others include all patients with a depression diagnosis regardless of when treatment initiated (i.e., new and prevalent cases).1,7 Sampling was evenly stratified between patients with new vs. prevalent episodes of depression to assess whether this distinction is associated with differences in outcomes and the impact of including new vs. prevalent patients on PRO-PMs. New episodes were conservatively defined as no depression diagnosis or positive 2-item PHQ screen (PHQ2) in the past year and no antidepressant treatment in the prior 6 months.18 Study staff contacted and screened potentially eligible patients with the PHQ9. Those with a score of 10 or more (which has 88% sensitivity and specificity for detecting a major depressive episode)13 could complete the remaining baseline assessment and follow-up assessments at 6 weeks, 3 months, 6 months, and 12 months. The VA Ann Arbor Healthcare System’s Institutional Review Board approved human subjects’ involvement.

Survey administration

Participants completed all survey assessments, including the initial PHQ9 screening, by self-report via the participant’s choice of telephone interactive voice response (IVR), web-based survey accessed via a short message service (SMS) text message link, or mailed paper surveys. The IVR and SMS systems conducted automated follow-up assessments supplemented by study staff reminder calls.

Measures

The primary outcome for the study was depression symptom severity at 6 months according to the continuous PHQ9 score. The PHQ9 has good sensitivity and specificity for identifying a major depressive episode, has a range from 0 to 27, and is the measure the NCQA and ICHOM depression PRO-PMs use. 1,6,13,19 The PHQ9 was used to also construct 3 PRO-PMs from dichotomized patient outcomes: 5-point reduction in PHQ9 from baseline to follow-up (i.e., improvement), 50% reduction in PHQ9 from baseline to follow-up (i.e., response), and PHQ9 of 5 or less at follow-up (i.e., remission). These measure definitions were chosen from outcome assessments of collaborative care interventions for depression in primary care settings.15

The baseline survey assessed patient characteristics associated with depression outcomes, and, when possible, survey items were aligned with those recommended by the ICHOM.6,9,10 To minimize response burden, single-item assessments were used unless otherwise noted to measure the following: race, ethnicity, marital status, people living in the home, education, employment, financial distress, Medical Outcomes Study Social Support scale (4 items), 20 social distress (4 items),21 age of onset of first depressive episode (before or after age 18), number of lifetime depressive episodes, duration of current depressive episode (greater or less than 2 years), current antidepressant use and duration, current receipt of psychotherapy, depression treatment expectancy, anxiety score (0 to 100 with 100 most severe), pain score (0 to 100 with 100 most severe), PROMIS physical functioning scale (4 items),22 and general health.23 Additional variables extracted from the medical record included age, gender, service-connected disability, Elixhauser comorbidity score24 (modified to remove mental health and substance use disorders), substance use disorders, anxiety disorders, pain disorders, new treatment episode, and number of prior mental health visits. Follow-up assessments included the PHQ9. Mode of survey completion (IVR, SMS, or paper) and an indicator for survey completion during the COVID-19 pandemic (applicable to 7.7% of 6-month and 27.2% of 12-month surveys) were also included. Quality of care provided by a facility might have influenced some baseline variables, such as prior treatment and treatment expectancy; however, these variables were included given the important baseline patient characteristics they may represent separate from care quality, and unadjusted analyses were conducted without these included.

Medical record data was used to construct three care process measures for patient-level analyses. Adequate antidepressant treatment was defined as receipt of at least an 84-day supply of antidepressant medication over the 114 days prior to the 3-month assessment among those receiving any antidepressant, consistent with a NCQA measure.1 Psychotherapy treatment adequacy was defined as receipt of at least 3 psychotherapy visits in the 84 days prior to the 3-month assessment among those receiving any psychotherapy. At least three psychotherapy visits have been used in other studies to define minimally adequate treatment in the VA.4,25 The third measure represented an exploratory measure of treatment intensification, defined as whether a patient who had not experienced a 5-point improvement in PHQ9 score from baseline to 6 weeks received a new antidepressant medication, an increase in antidepressant medication dose, a depression augmentation agent, or initiation of psychotherapy.

Analyses

To identify a parsimonious set of case-mix adjustment variables, a backward stepwise variable selection process was used. The purpose of removing variables was to reduce the risk of over-fitting the models and to inform health systems regarding which variables to prioritize for the purpose of case-mix adjustment. The initial model included all baseline variables (including baseline PHQ9 score and an indicator for new episodes) and exempted predictors of follow-up assessment completion from removal to account for missing data assuming missingness at random. Using linear regression models, variables that improved the Akaike information criteria (AIC) the most were removed, one at a time, until the AIC no longer improved with removal of any subsequent variables. A hierarchical model was used to determine coefficients for the retained variables with 6-month continuous PHQ9 scores as the outcome and facilities as random intercepts. Intra-class correlation (ICC) coefficients were used to describe the variation in outcomes at the facility-level relative to patient-level variation within facilities by dividing the facility level variance by the total variance. ICCs were calculated for PHQ9 scores at 6 months using both unadjusted and adjusted hierarchical models and for the subgroup of patients with new episodes of depression. Two sets of sensitivity analyses were used to assess whether the results are robust to changes in the primary model. In the first set, the analyses were repeated using 3- and 12-month PHQ9 scores as outcomes to assess sensitivity to duration of follow-up. In the second set, analyses were repeated using 6-month dichotomous outcomes of improvement, response, and remission based on PHQ9 scores to assess sensitivity to alternative outcome definitions. Visual comparisons were used to assess for outlying performing facilities using (1) the unadjusted mean change in PHQ9 scores at 6 months (calculated as 6 months minus baseline) along with their 95% confidence intervals and (2) the random intercepts for each facility in the final adjusted model of 6-month PHQ scores. Finally, each of the 3 measures of depression care processes were included in each of the prior hierarchical models. The sample size was designed to provide adequate precision to estimate the intraclass correlation coefficient.26

Results

Enrollment, follow-up completion, and sample characteristics

Study staff screened 17,433 patients with medical record indicators of depression, yielding 10,666 (61%) eligible following chart review. Of these, 5,138 could not be reached by phone, 2,652 refused, 131 proved ineligible, and 2,745 consented to participate. Of those consented, 2,390 (87%) completed the baseline PHQ-9 screen and 1,638 (69%) of baseline completers had a PHQ-9 score ≥ 10 and became eligible for follow-up assessments. Participants completed 1,224 (75%) assessments at 6-weeks, 1,295 (79%) at 3 months, 1,214 (74%) at 6 months, and 1,106 (68%) at 12 months. Older age and SMS survey method were positively associated with completion of the 3-, 6-, and 12-month assessments; additionally, greater educational attainment was positively associated with 3-month assessment completion, and a substance use disorder diagnosis and degree of social distress were negatively associated with 12-month assessment completion.

The sample completing the 6-month assessments (N = 1,214) had a mean age of 52 (SD = 15) years and was 20% female, 80% White, 11% Black, 4% Hispanic, 0.7% Asian American or Pacific Islander, and 5% multiracial or other race. Only 7% of participants described their current depressive episode as their first, and 24% indicated their current depressive episode as less than 2 years duration. Frequency of comorbid diagnoses included 43% for anxiety disorders, 43% for PTSD, 18% for a substance use disorders, and 72% for a pain diagnosis. Although 52% of patients were considered new episodes of depression via medical record screening (e.g., no depression diagnosis in past year, no antidepressant in past 6 months), most participants (88%) had previously seen a VA mental health provider (i.e., for mental health diagnoses other than depression for those with new depression episodes) in the past year. The mean number of mental health visits was 9.6 (SD = 14.2) and for those with any visits it was 11.0 (SD = 14.6). In the year prior to their baseline, 70% of participants received antidepressant treatment, and 63% received psychotherapy according to medical record data. According to survey responses, 73% of participants were taking an antidepressant at baseline. Only 7% of participants expected their treatment to be very successful, 30% expected treatment to be moderately successful, 47% expected treatment to be somewhat successful, and 17% expected their treatment to not at all be successful (see Table 1).

Table 1 Sample Characteristics (N = 1,214)

Depression outcomes and case-mix adjustment variables

Participants had a mean PHQ9 score of 16.2 (SD = 4.4) at baseline, 14.4 (SD = 5.7) at 3 months, 13.8 (SD = 5.9) at 6 months, and 13.8 (SD = 6.2) at 12 months. At 3 months, 27.4% of participants had a 5-point improvement in PHQ9 score, 12.4% had a 50% improvement, and 6.0% had a score of 5 or less. At 6 months, these figures were 30.6%, 14.7%, and 8.2% and at 12 months were 32.5%, 16.7%, and 9.4%, respectively.

Age and mode of survey completion predicted follow-up survey completion at 6 month and were excluded from variable reduction. Following removal of 17 variables, statistically significant predictors of lower PHQ9 scores at 6 months were female gender, less than 2-year duration of current depressive episode, depression onset before age 18, substance use disorder diagnosis, and expectancy that treatment would be very successful (Table 2). Predictors of greater PHQ9 scores at 6 months were greater baseline PHQ9, separated marital status, worse physical functioning, anxiety rating, and number of past-year mental health visits. Across models of 3-month and 12-month continuous outcomes and dichotomous 6-month outcomes (5-point improvement, 50% improvement, and PHQ9 less than 5) (Table 3), no variable was a significant predictor across all models. Baseline PHQ9 score, physical functioning, total prior mental health visits, and expectancy that treatment would be very successful were significant in 4 of the 5 sensitivity models, while depression episode duration of less than 2 years and expectancy that treatment would be moderately successful were significant in 3 of the 5. Female gender was associated with lower PHQ9 scores at 3 and 12 months but not with greater likelihood of 5-point improvement, response, or remission at 6 months.

Table 2 Predictors of PHQ9 scores following variable reduction
Table 3 Models of dichotomous 6-month outcomes

Depression outcomes and treatment facility

In the unadjusted model of continuous PHQ9 outcomes at 6 months, the ICC was 0.01 showing little variation across facilities relative to within facilities. In the model including baseline case-mix adjustment variables, the ICC was < 0.01. Sensitivity analyses of 3 month and 12 months PHQ scores and those using dichotomous outcome definitions at 6 months did not find any ICC above 0.01. In subgroup analysis of patients with new episodes of depression, the ICC for 6-month continuous outcomes was 0.03. Mean change in unadjusted PHQ9 outcomes at 6 months by facility are depicted in Fig. 1 with overlapping 95% confidence intervals for the true means. In adjusted models, the random intercepts for facility also had overlapping confidence intervals with no outliers (please see electronic supplementary material).

Fig. 1
figure 1

Unadjusted mean change in PHQ9 score from baseline to 6 months by VA facility including 95% confidence intervals

Depression outcomes and care processes

Seventy-one percent of participants with any antidepressant received an adequate 84-day supply, 54% with any psychotherapy received at least 3 sessions, and 32% of those whose PHQ9 did not improve by 5 points at 6 weeks received intensification of treatment. None of these indicators significantly predicted continuous 6-month PHQ9 outcomes when added separately to unadjusted or fully adjusted patient-level models. In the sensitivity analyses, at least 3 sessions of psychotherapy yielded higher odds of remission at 6 months (OR 2.70; 95% CI: 1.10, 6.64; p = 0.03).

Discussion

This study found depression outcomes were primarily influenced by baseline patient characteristics. Indicators of more severe or treatment-resistant depression, specifically greater baseline PHQ9, duration of current depressive episode more than 2 years, and greater number of prior mental health visits, consistently predicted worse subsequent depression outcomes. Physical functioning also consistently predicted outcomes, consistent with Kramer et al.’s prior study of depression case-mix adjustment and other work investigating the relationship between medical illness and depression.12,27 Treatment expectancy has been shown to influence outcomes in clinical trials for depression.28,29 The results of this study extend these findings by demonstrating that treatment expectancy also predicts PHQ9 outcomes in routine care. These findings support the ICHOM approach of including physical functioning and treatment expectancy among other depression case-mix variables. However, unlike baseline PHQ9 scores and prior mental health visits, physical functioning and treatment expectancy are not often collected or contained within existing medical records, and the costs of collecting these additional measures may only prove worthwhile if substantial differences in these characteristics exist across planned comparison settings.

This study found that only a minimal amount of variation in depression outcomes is explained by the facility in which patients received their care and no individual facility outperformed the others. These findings suggest that at least in the VA healthcare system, performance measures based on depression symptoms are unlikely to be useful for comparing care quality across facilities. Consistency in outcomes across facilities could be due to similar patterns of care delivery, although several aspects of depression care vary across VA facilities, such as the propensity to provide psychotherapy vs. antidepressant medications and the propensity for patients to be treated by an integrated primary care mental health provider.30,31 Since this study used facility as the unit of analysis, clinically significant differences in quality and resultant outcomes may exist within individual clinics or care teams yet manifest on average as small differences at the facility level. Depression PRO-PMs in the VA health system may need to focus on identifying quality improvement opportunities within particular clinics (e.g., primary care) or teams rather than VA facilities as a whole. The minimal impact of treatment setting on depression outcomes could have resulted from the substantial degree to which baseline patient characteristics and unobserved factors present before or during treatment (e.g., patients’ life events) determine depression outcomes.32,33

Study findings suggest PRO-PMs may more reliably detect differences in outcomes across facilities when restricted to patients with new depressive episodes. Our criteria for identifying new patients (e.g., no depression diagnosis or positive screen in 1 year, no antidepressant in 6 months) did not screen out patients with chronic untreated depression, depression treatment outside the VA, or prior VA mental health use for other diagnoses. Refining the criteria for new patients could further improve PRO-PMs for new patients; however, performance measures will still be needed for chronic and treatment-resistant depression in the VA given the prevalence of these conditions.

While there was little variation in PHQ9 outcomes across facilities, these findings do not inform use of the PHQ9 with individual patients as part of measurement-based care. Although the study did not measure use of PHQ9 scores by individual clinicians, if use by clinicians is consistent across facilities (i.e., consistent high or low-level use), then use would not drive differences in outcomes across facilities despite potential effectiveness with individual patients. It is possible more efficient depression outcome measures (e.g., a single-item assessment of mood)34 may be sufficient for comparing quality across facilities; however, in settings, where the PHQ9 is routinely collected for patient care, using alternative outcome measures may only add to patient and health system burden.

This study found none of the process measures of adequate antidepressant medication treatment, adequate psychotherapy, or treatment intensification were associated with depression outcomes in the primary analyses. Psychotherapy receipt predicted remission at 6 months in a sensitivity analysis, though the wide confidence interval suggests remaining cautious about this finding. These primarily null findings could be due to treatment selection bias mitigating treatment effects, such that patients with more severe or persistent symptoms may be more likely to seek and receive adequate treatment or treatment intensification; conversely, patients tend to stop depression treatment once symptom improvement has been achieved.35,36 Although this study’s adjusted models included baseline symptom severity, this may not account for symptom and functional improvements that drive treatment decisions over time.

Although the studied process measures appropriately reflect receipt of effective evidence-based treatments, limitations to the degree to which they capture fidelity to the interventions used in clinical trials could also explain the largely null associations. Measures of antidepressant treatment adequacy use pharmacy prescription fills and not actual ingestion of medication or appropriate dosing, and psychotherapy procedure codes do not ensure the therapist utilized a specific evidence-based psychotherapy protocol. Future research should explore depression care process measures that more accurately assess fidelity, are not reliant on patient treatment adherence, or reflect shared decision-making and patient preferences when treatment does not meet criteria for adequacy.37,38

The degree of treatment resistance (i.e., continued symptoms despite prior treatment) in the study population may also explain the lack of associations between process measures and outcomes. In the STAR*D trial of sequential antidepressant treatments, remission rates decreased to 13.7% and 13.0%, respectively, by the third and fourth antidepressant trial rates, and rates of relapse in these groups were high (> 50%) within 12 months.39 This study’s remission rates of < 10% across time points despite subsequent treatment are largely consistent with the patients in STAR*D who did not respond to initial treatments and likely reflect a patient population with more treatment-resistant depression.

This study of VA patients with a high proportion of chronic depression (episode duration > 2 years) receiving care in the Midwest may not generalize to other treatment populations or to VA patients with new-onset depression or residing in other regions of the USA. Generalizability was strengthened by recruiting patients across 29 different VA facilities, and this study avoided referral bias by using medical records to identify participants. The inability to include data from patients who refused any study participation or who were unable to be contacted may bias the results. However, among enrolled participants eligible for follow-up assessments, this study adjusted for characteristics that predicted follow-up completion and had an adequate 74% of participants’ follow-up at 6 months. Since the study sample was designed to include a mix of patients with new and existing depression diagnoses evenly distributed across facilities, descriptive characteristics of the sample (e.g., age) may not represent all patients with depression in the VA.

Implications for Behavioral Health

Depression outcome-based quality measures generated using automated methods do not appear reliable for assessing differences in care quality between VA facilities that treat patients with predominantly chronic depression. The lack of association between patient outcomes and measures of antidepressant or psychotherapy use suggests that current process measures may not adequately capture provider fidelity to evidence-based practices or are confounded by treatment nonadherence when patients’ symptoms improve. Based on strong associations observed at the individual level, if depression outcomes are used to compare clinics or care teams, outcomes should be adjusted for baseline patient characteristics including depression severity, duration of depression, prior specialty mental health service use, treatment expectancy, and physical functioning. Depression care quality improvement efforts in the VA and related research should focus on identifying and improving care for treatment-resistant depression, given the high prevalence of chronic depression and limited symptom improvement that was observed.