INTRODUCTION

Gender inequity is pervasive in academic medicine. While the number of female physicians has steadily increased, a gender gap remains in both pay and promotion.1,2,3 Women have comprised more than 40% of medical students since 1995, yet continue to be underrepresented at higher levels of academic rank and leadership.4,5,6 Within internal medicine, women make up 52% of clinical instructors but only 38% of associate professors and 24% of full professors.5,6 Past work has demonstrated that the field of general internal medicine (GIM) is not immune to gender disparities.7,8,9 Even in the relatively “newer” field of hospital medicine within GIM, these disparities have been well documented.10,11,12,13 Despite abundant evidence regarding the existence of this “leaky pipeline” for women in academic medicine, the factors that contribute to its persistence are less well understood. To develop effective strategies to rectify gender inequities in academic medicine, we must first identify the factors that contribute to their existence.

Teaching evaluations are used to make decisions about promotion, rank, and leadership positions for clinician educators.14 However, the inherently subjective nature of evaluations introduces the risk of reflecting and amplifying evaluators’ underlying explicit (overt) or implicit (unconscious) biases.15,16,17 In previous work, learners rated instructors labeled as male significantly higher than instructors labeled as female regardless of the instructor’s actual gender, suggesting gender bias was playing a significant role in their evaluations.16 Others have demonstrated that descriptive language used in evaluation forms may influence how men and women are assessed.18 While the current literature is varied in terms of the impact of gender on teaching evaluations of clinical faculty in medical education, there is evidence that residents and medical students may also be vulnerable to such biases.19,20,21,22,23

Gender-based social role theory and stereotype-based cognitive biases likely impact how women are evaluated and thus contribute to gender disparities. Specifically, agentic characteristics including decisiveness, instrumental competence, and assertiveness are considered more traditionally masculine, while communal characteristics such as compassion, empathy, and caring are considered more traditionally feminine.24,25 The wide-sweeping nature of these social norms results in preconceived “gendered expectations.” These expectations are further amplified in fields or roles that have been traditionally occupied by men, including procedural specialties and leadership roles in academic medicine.20,24 For women who demonstrate agentic behaviors as leaders of inpatient or cardiac resuscitation teams, these expectations may result in being penalized, a phenomenon previously described as “the double bind” or “role incongruity.”24,26,27,28,29,30

Internal medicine residents who pursue a career in GIM may practice in the inpatient setting as a hospitalist or outpatient setting as a primary care physician. These distinct clinical settings also align with differing gender norms. Responsibilities traditionally seen as “agentic” including performing procedures and leading rapid response and resuscitation teams are more frequently performed by hospitalists,31,32 while primary care physicians practice in settings where “communal” traits such as strong physician-patient communication and collaborative care have greater emphasis.33 The potential impact of gender-based expectations on learner assessment of faculty performance in these different clinical settings is unknown.

The objective of this study was to determine whether gender-based differences exist in the assessment of teaching performance of GIM attending physicians by residents and to explore the extent the language used to describe the quality, skill, or behavior being evaluated (agentic vs. communal) and the environment of the interaction (inpatient vs. outpatient) impact gender differences in assessment. We hypothesized that male faculty would receive more favorable ratings overall, as well as for skills and behaviors related to agentic traits or described with more classically agentic language. Similarly, we hypothesized women would be more favorably assessed for skills or traits described with more communal language. Finally, we hypothesized that male faculty would be rated more favorably in the inpatient settings, whereas female faculty would be assessed more favorably in the outpatient settings where communal characteristics may be more highly valued.

METHODS

Setting and Participants

Participants included GIM faculty who served as teaching attendings on inpatient general medicine services and GIM faculty who supervised outpatient continuity clinics at a single Midwestern academic tertiary care center and an affiliated Veteran’s Administration hospital (VA) between July 1, 2015, and December 31, 2018. Inpatient services included four general medicine teams and two resident-based hospital medicine teams at the University hospital and four general medicine teams at the VA. A general medicine team was comprised of one senior resident, two interns, and a faculty member. Trainees rotated monthly and faculty rotated every half month. A resident-based hospital medicine team included two senior residents and a faculty member who worked together in half-month increments. The outpatient continuity clinics occurred either at a traditional academic GIM practice based at the University Hospital, the VA, or in one of four University-affiliated community-based GIM practices. Residents averaged one half day per week in continuity clinic, working with the same one or two attending physicians over the 3 years of training.

Faculty Evaluations

At our institution, inpatient faculty are evaluated by trainees at the end of a half month rotation, while faculty supervising primary care continuity clinics are evaluated by trainees in their clinic twice annually. Evaluations are performed using MedHub™ (Minneapolis, MN).

Inpatient faculty are evaluated by all trainees on the inpatient team including residents from the categorical Internal Medicine program (N=136 total, 45–46 trainees per year), Medicine/Pediatrics (N=32 total, 8 trainees per year), interns completing a preliminary year (N=8), and Anesthesia interns (N=28). Outpatient faculty are evaluated by trainees in continuity clinic from the categorical Internal Medicine Program (N=136).

The faculty evaluation tool was developed internally by our residency program leadership team in 2014–2015. The form utilizes a 5-point Likert scale to answer prompts organized using the ACGME competencies: medical knowledge (MK), patient care (PC), interpersonal and communication stills (ICS), professionalism (PROF), practice-based learning and improvement (PBLI), and systems-based practice (SBP). The form also includes a global assessment of overall teaching ability (Fig. 1).

Figure 1
figure 1

Assessment form of attending general internal medicine physicians in the inpatient and outpatient setting. Words or skills previously associated positively with men/classic agentic characteristics are in bold; words or skills previously associated positively with women/classic communal characteristics are underlined and italicized.

Study Design

The faculty evaluation form was reviewed by two members of the study team (JRL and SH) for language or skills that, based on prior literature, represented gender norms for men (agentic terms such as “leader,” “confident” or “autonomy” and skills such as procedures) or women (communal terms such as “collaborative,” “empathy” and “compassion” or skills such as history gathering and physical exam).34,35,36,37,38,39 The PC and ISC competency sections contained phrases weighted toward traditionally “feminine” characteristics, while the competencies of MK and PBLI were more weighted toward traditionally “masculine” characteristics (Fig. 1). The competencies of SBP and PROF contain a mix of both agentic and communal key words or skills.

The results of trainee evaluations of faculty were compiled from rotations on inpatient general medicine, resident-based hospital medicine, and continuity clinic. Evaluations of subspecialist faculty were excluded. Faculty characteristics including gender and year of medical school graduation were added from a departmental database. The variables of interest included the gender of the trainee who completed the evaluation (evaluator) and of the faculty (evaluatee) and de-identified prior to analysis.

Primary Outcome

Our primary outcome of interest was the mean rating of the faculty’s overall teaching ability and the mean rating of each ACGME competency based on our assessment form. Mean ratings were then compared by faculty gender.

Data Analysis

Descriptive statistics including means with standard deviations were reported for men and women for overall teaching ability and each ACGME competency. Differences in evaluation scores between male and female faculty were assessed using a multilevel model with resident evaluator identity as a random factor and attending gender (male or female) as a fixed factor. The inclusion of the random factor nullified any effect of an individual evaluators’ tendency to give low or high ratings (“hawk” vs. “dove” bias), which is a possible confounder given the unbalanced design of these observational data. By nullifying this effect, the estimated means of male and female scores were free of possible confounding and provided a stronger test of gender effects. The effects of gender on overall rating and each competency rating were evaluated in separate models. Evaluations with missing data were excluded casewise from the multilevel analysis.

We evaluated the interaction of setting (inpatient vs. outpatient) on the effect of faculty gender on assessment scores using a multilevel model with trainee evaluator identity as a random factor and a full factorial of attending gender (male or female) and clinical site type (inpatient vs. outpatient). Again, the inclusion of trainee identity nullified rater bias. A significant interaction of gender and site type would indicate a difference in effect of gender based on clinical site. Least-squares means (classical Yates contrasts) were used as post hoc tests to compare the gender effect in each site type. The interaction effects on overall rating and for each competency rating were assessed in separate models.

All analyses were conducted using R version 3.6.3. Multilevel models were conducted using the “lmerTest” (version 3.1-2) addition to the “lme4” (version 1.1-23) package. Post hoc Yates contrasts were performed using the “ls means” command in “lmerTest.”

The study was determined to be exempt by the University of Michigan Institutional Review Board (HUM00160043).

RESULTS

In total, 4081 faculty teaching evaluations from inpatient and outpatient general medicine services were completed by trainees. Five hundred assessments were of subspecialty faculty and excluded. One hundred thirty (3.6%) evaluations were missing data for at least one evaluation measure and excluded casewise from the multilevel analysis.

Of the final 3581 evaluations included, 2046 evaluations were of male faculty (57.1%) and 1535 (42.9%) evaluations were of female faculty (Fig. 2). Among these, 445 total trainees (245 male, 55.1%, and 200 female, 44.9%) assessed 161 distinct attending GIM physicians (81 male, 50.3% and 80 female, 49.7%) with 2365 unique rater-attending pairs. The majority of pairs involved a single assessment of an attending physician by a single resident (N=1861, 78.7%). In a minority of cases (N=302, 12.8%), a resident assessed the same attending three or more times, mostly in the outpatient setting (N=298, 98.9%).

Figure 2
figure 2

Evaluations of general internal medicine faculty.

Among all faculty included in our analysis, 83% were on the clinician educator track (85% of total female faculty, 81% of total male faculty) and 17% were on the clinician investigator (i.e., tenure) track. Among the inpatient faculty included in our analysis, 90% were clinician educators, while 75% of the outpatient faculty were clinician educators. A total of seven faculty in our cohort attended in both the inpatient and outpatient setting (five men and two women). Male faculty on average were 20.2 years post-medical school graduation, while female faculty were on average of 15.7 years post-graduation. The number of years since training for individual male attendings ranged from a mean of 14.7 at their earliest evaluation during the analyzed period to 17.3 at their last evaluation compared to individual female attendings who ranged from a mean of 12.8 to 14.8 years since training.

Teaching Assessments by Gender and ACGME Competency

Faculty of both genders were rated by trainees as having excellent clinical performance and teaching ability (Fig. 3). After controlling for rater gender, male faculty were rated as having higher overall teaching ability compared to their female colleagues (male=4.69 vs. female=4.63, p=0.003). Male and female faculty were rated similarly in PC (male=4.67 vs. female=4.67, p=0.94) and ICS (male=4.72 vs. female=4.72, p=0.79). In contrast, male faculty received higher scores than female faculty in the competencies of MK (male = 4.73 vs. female = 4.67, p<0.001), PROF (4.79 vs. 4.76, p=0.02), PBLI (4.76 vs. 4.73, p = 0.04), and SBP (4.75 vs. 4.71, p = 0.01).

Figure 3
figure 3

Mean male and female general internal medicine faculty evaluations by trainees using combined inpatient and outpatient settings. MK, medical knowledge; PC, patient care; ICS, interpersonal and communication skills; PROF, professionalism; PBLI, practice-based learning and improvement; SBP, systems-based practice. *p<0.05.

Impact of Clinical Setting on Gender Differences in Assessment

A total of 1843 evaluations were from inpatient experiences (70.9% male and 29.1% female) and 1738 evaluations were from outpatient experiences (42.6% male and 57.4% female). For all competencies, there was a significant interaction of attending gender and clinical setting (inpatient vs. outpatient) (Fig. 4). This was predominantly due to a larger gender difference in the inpatient setting where male faculty received higher teaching ratings than female faculty overall and in each of the six competencies (Fig. 4). By contrast, there was no difference in the overall rating of male and female faculty in the outpatient setting or in the competencies of MK, PROF, PBLI, SBP, or ICS. Female faculty in the outpatient setting were rated higher than male faculty in the competency of PC (Fig. 4).

Figure 4
figure 4

Mean male and female internal medicine faculty evaluations by trainees by clinical setting (a overall rating: outpatient male vs. female faculty (4.68 vs. 4.69) and inpatient male vs. female faculty (4.70 vs. 4.53); b medical knowledge: outpatient male vs. female faculty (4.73 vs. 4.71) and inpatient male vs. female faculty (4.73 vs. 4.62); c patient care: outpatient male vs. female faculty (4.65 vs. 4.71) and inpatient male vs. female faculty (4.67 vs. 4.59); d communication: outpatient male vs. female faculty (4.71 vs. 4.75) and inpatient male vs. female faculty (4.73 vs. 4.66); e professionalism: outpatient male vs. female faculty (4.78 vs. 4.80) and inpatient male vs. female faculty (4.80 vs. 4.71); f practice-based learning and improvement: outpatient male vs. female faculty (4.76 vs. 4.76) and inpatient male vs. female faculty (4.76 vs. 4.68); g systems-based practice: outpatient male vs. female faculty (4.73 vs. 4.74) and inpatient male vs. female faculty (4.76 vs. 4.67)).

DISCUSSION

This study adds to the growing literature addressing gender disparities in academic medicine. In our cohort, female GIM faculty received lower overall teaching scores than their male counterparts. This difference was largely attributable to evaluations from the inpatient setting. Teaching evaluations for clinician educators play an important role in promotion and compensation; thus, differences in evaluation may be contributing to the “leaky pipeline” of academic medicine. While the absolute differences are small, they should not be discredited as “unimportant.” Prior work on the phenomenon of amplification cascade (small differences in evaluations leading to large differences in overall assessment) and bias accumulation (multiple subtle biases adding up to overt discrimination) support the theory that gender disparities in GIM are a culmination of countless “small” differences like the ones found here.40,41

When evaluating assessments of GIM faculty independent of setting, male faculty scored higher in overall teaching and in four ACMGE competencies (MK, PROF, PBLI, SBP). While we had hypothesized men would score higher in the competencies with evaluation prompts that included more traditionally agentic language, traits, or skills (MK and PBLI), men were rated higher both in these competencies and in those using both agentic and communal evaluation prompts (PROF and SBP). Conversely, the PC and ICS prompts contained language, traits, or skills considered more communal; while we hypothesized this would result in higher ratings of female faculty, male and female faculty scored no differently in these competencies (Fig. 3). These findings are important — while prior work has shown that gender-biased language is common in narrative comments on evaluations in academic medicine,34,35,37,38,42,43 our findings also suggest a potential impact of gendered language within the assessment tools themselves.

While we strive for all faculty to demonstrate each of the attributes described on the assessment form, it is important to understand the impact context-based and gender-based behavior expectations may have on how faculty are evaluated by trainees. In our cohort, men were rated higher than their female peers in overall teaching and across all competency groupings in the inpatient setting, a clinical environment which has been historically male dominated and where traditionally agentic characteristics are more highly valued. Conversely, female GIM faculty in the outpatient setting received higher ratings in PC compared to male faculty, with no difference in ratings for overall teaching, MK, ICS, PROF, PBLI, and SBP. We hypothesize that the gender disparity in assessment of faculty performance in the inpatient setting may represent the discordance between the expected gender norms for female physicians and the clinical requirements of the inpatient setting, where decisiveness, assertiveness, and urgency of clinical situations may contradict the communal gender-based expectations others have for female faculty. Prior literature has described how female hospitalists leading inpatient teaching services intentionally work to navigate the “too nice” versus “too aggressive” discord between societal gender-based behavior expectations and the more “masculine” expectations of the inpatient clinical setting.44

Female faculty may pay a “gender-tax” when being evaluated by trainees in the inpatient setting due to this discordance between expected and observed behaviors. The congruence between expected gender norms and the emphasis on communal traits, such as collaboration, interpersonal sensitivity, and communication, in the outpatient clinical setting is a potential explanation for the equivalent or higher evaluation scores for female GIM faculty in this clinical setting. Like female GIM inpatient faculty, male GIM faculty in the outpatient setting may also pay a “gender-tax” in performance assessment, where traditionally male gender norms could be incongruent with the expectations for the clinical setting.

Overall, these findings highlight the complex interplay between gender norms and the potential impact of clinical setting on teaching evaluations for female faculty. Previously, others have hypothesized that gender disparities may not be as prevalent in hospital medicine due to the near equal number of men and women faculty practicing as academic hospitalists.10,11,45 However, the amount of time an individual academic hospitalist spends clinically on teaching vs. non-teaching services varies.46 In our study, it is notable only 29% of the total evaluations completed by trainees in the inpatient setting were of women faculty, despite the fact that women made up 44% of the total GIM faculty practicing as hospitalists at our institution, highlighting potential differences in the gender makeup among faculty on teaching vs. non-teaching services. An important next step will be to examine representation of female faculty in inpatient internal medicine teaching roles as this may contribute to trainees’ expectations for behaviors and result in gender differences in teaching evaluations, similar to what has been seen in other male-dominated fields in medicine.20

The fact that the male faculty in our cohort were further out from medical school graduation raises the question of whether there is a confounding disparity in seniority and experience that explains the differences in performance ratings between male and female faculty. While many hypothesize that more senior faculty would be assessed more favorably by learners, this theory is not supported in the literature.21,47 We were not able to examine the impact of seniority in our data set due to a lack female faculty at the most senior level to compare to their male colleagues.

Our study has several limitations. First, this represents the experience of a single institution, using an internally developed evaluation form, and only includes faculty from GIM; our findings may not be representative of all GIM programs or other specialties. Additionally, we recognize that the relationship between faculty and learners is not identical in inpatient and outpatient environment and that this may impact resident assessment of faculty performance. While uncommon for a resident to evaluate the same faculty more than once in the inpatient setting, it does occur in the outpatient setting due to the longitudinal nature of resident continuity clinic. This longitudinal relationship may impact how a resident perceives their attending, independent of the attending’s gender.

In conclusion, our findings suggest that female GIM faculty in academic medicine may be evaluated less favorably by trainees compared to their male colleagues. Our work suggests that gender disparity in evaluations may be heightened based on the language used in the evaluation tools and influenced by the clinical setting of the evaluation. Implicit bias, stereotype-threat, and role incongruity all likely play a role in these observed disparities. Because of the potential impact of teaching evaluations on faculty promotion, advancement, and salary, recognition of gender-based biases in teaching and performance evaluations is essential, especially for female faculty in divisions of hospital medicine and female faculty in other inpatient-focused specialties.