Anesthesiologists’ clinical competency will be assessed from completion of residency through to retirement. There are several limitations with using simulation for these assessments. Mannequin-based and/or multidisciplinary simulations can require expensive travel to a test centre and time away from clinical practice for both ratees and raters. Simulations often focus on management of a few crises of brief duration. In contrast, in situ assessments quantify anesthesiologists’ clinical performance in the dynamic and unpredictable environment where they personally deliver care. This environment includes a range of large and unexpected problems, where anesthesiologists’ roles include foreseeing and preventing problems and where social, team, and environmental factors influence anesthesiologists’ effectiveness.1,2 Thus, as part of an overall assessment of clinical competency, our department uses in situ assessments of individual anesthesiologists working in operating rooms and other procedural locations (henceforth referred to as “ORs”) to determine how well they provide clinical supervision of anesthesia residents (Table 1).3,4,5,6,7,8,9,10,11,12,13,14 Higher scores for clinical supervision are associated with fewer resident reports of errors with adverse effects on patients (Table 2.15)11,12,13 and greater preference for the anesthesiologist to care for the rating resident’s family (Table 2.7).7

Table 1 de Oliveira Filho et al.’s instrument6 for measuring faculty anesthesiologists’ supervision of residents during clinical operating room care
Table 2 Previous findings regarding supervision of anesthesia residents and nurse anesthetists by faculty anesthesiologists

Supervision, in this context, refers to all clinical oversight functions directed toward assuring the quality of clinical care whenever the anesthesiologist is not the sole anesthesia care provider.3,4,5 The de Oliveira Filho unidimensional nine-item supervision instrument is a reliable scale used to assess the quality of supervision provided by each anesthesiologist (Table 1).6,7,8,9 The scale measures all attributes of anesthesiologists’ supervision of anesthesia residents (Table 2.1)6,7,10,11,12,13 and has been shown, in multiple studies, to do this as a unidimensional construct (Table 2.2).6,9,10,12 Low supervision scores are associated with written comments about the anesthesiologist being disrespectful, unprofessional, and/or teaching poorly that day (Table 2.18).10,14,15 Scores increase when anesthesiologists receive individual feedback regarding the quality of their supervision (Table 2.17).4 Scores are monitored daily and each anesthesiologist is provided with periodic feedback.3,15

The supervision scale’s maximum value is 4.00, which corresponds to a response of 4 (i.e., “always”) to each of the nine questions (Table 1).6 Because of the ceiling effect, multiple scores of 4.00 reduce the scale’s reliability7,9,10,15 to differentiate performance among the anesthesiologists, even though such differentiation is mandatory (see Discussion).

We previously asked residents to provide a single evaluation for the overall quality of supervision they received from the department’s faculty (i.e., as if intended as an evaluation of the residency program) (Table 2.14).13 We compared those overall scores pairwise with the mean of each resident’s evaluations of all individual anesthesiologists with whom they worked during the preceding eight months.13 Both sets of scores showed considerable heterogeneity among the residents (e.g., some residents provided overall lower scores than those of other residents).13 Consequently, our hypothesis was that greater differentiation among anesthesiologists’ supervision scores could be obtained by incorporating scoring leniency by the resident (rater) into the statistical analysis (i.e., treating a high score as less meaningful when given by a resident who consistently provides high scores, in other words, lenient relative to other raters).Footnote 1

Methods

The University of Iowa Institutional Review Board affirmed (June 8, 2016) that this investigation did not meet the regulatory definition of research in human subjects. Analyses were performed with de-identified data.

From July 1, 2013 to December 31, 2015, our department utilized the de Oliveira Filho supervision scale to assess the quality of clinical supervision by staff anesthesiologists (Table 1).6,7 The cohort reported herein includes all rater evaluations of all staff anesthesiologists (ratees) over that 2.5-year period chosen for convenience. We used five six-month periods because we previously showed that six months was a sufficient duration in our department for nearly all ratees to receive evaluations and for an adequate number of unique raters to differentiate reliably among ratees using the supervision scale.9,10,15

The evaluation process consisted of daily, automated e-mail requests16 to raters to evaluate the supervision provided by each ratee with whom they worked the previous day in an OR setting for at least one hour, including obstetrics and/or non-operating room anesthesia (e.g., radiation therapy).4,8,9,10 Raters evaluated ratees’ supervision by logging in to a secure webpage.8 The raters could not submit their rating until each of the nine questions was answered with their choice of 1-4: 1 = never; 2 = rarely; 3 = frequently; or 4 = always (Table 1). The “score” for each evaluation was equal to the mean of the responses to the nine questions (Table 1). The scores remained confidential and were provided to the ratees periodically (every six months) only after averaging among multiple raters.1,15,17

Statistical analysis

If one or two of the nine questions resulted in leniency among raters, a potential intervention would have been to either modify the question(s) or provide an example of behaviour that should affect the answer to the question(s) (see Discussion). In contrast, if leniency were present throughout all questions, then an analysis of leniency would need to incorporate the average scores of the raters. The question whether leniency was present in a few vs all questions was addressed by analyzing the mean ratings for each combination of the 65 raters and nine questions. These means had sample sizes of at least 37 answers and mean sample sizes of 210 answers (i.e., sufficient to make the ranks of 1, 2, 3, or 4 into interval levels of measurement). Cronbach’s alpha, a test for internal consistency among answers to questions,Footnote 2 was calculated using the resulting 65 × 9 matrix, equally weighting each rater. The confidence interval (CI) for Cronbach’s alpha was calculated using the asymptotic method.18

The statistical distribution of the rater leniency scores was analyzed. For each observed rater and ratee pair, we calculated the mean score specific to that pair.8,15,19,Footnote 3 For each rater, we obtained the equally weighted average of the mean scores provided by that rater for all ratees (Fig. 1).8,C We previously showed that the number of scores per pair differs markedly among raters for each ratee (i.e., there is non-random assignment of residents and anesthesiologists such that leniency will not average out; P < 0.001).8,19

Fig. 1
figure 1

Distribution of the average scores among raters. There were 3,421 observed combinations of the 65 raters (i.e., residents) and the 97 ratees (i.e., faculty anesthesiologists). For each combination, the mean was calculated. Then, for each rater, the average of the means across ratees was calculated. Those 65 values are shown in the figure. For details, see Footnote C and the companion papers of References (8) and (15). The dot plot shows that the distribution was unlike that of a normal distribution. The dot plot uses rounding for presentation; only one of the 65 raters had all scores equal to 4.00 (i.e., the average of the averages was also equal to 4.00)

The same approach was used when calculating the average of the means by ratee. In order to assess whether the scores of individual ratees were unusually low or high, we compared the averages of the means of each ratee with the value of 3.80 using Student’s t test and Wilcoxon signed-rank test. The value of 3.80 is the overall mean supervision score among all ratees’ scores (see Results). The P values using the Wilcoxon signed-rank test were exact, calculated using StatXact® 11 (Cytel Inc., Cambridge, MA, USA). Student’s t test does not adjust for rater leniency.

Stata® 14.1 was used (StataCorp LP, College Station, TX, USA) to perform mixed-effects analyses treating the rater as a categorical fixed effect and the ratee as a random effect. Results of the mixed-effects analyses allowed us to assess the ratees’ quality of supervision. Mixed-effects analyses were carried out for two separate dependent variables, modelled individually: 1) the average score, and 2) the binary variable whether the score equalled the maximum of 4.00 (Fig. 2). The logistic regression was performed using the “melogit” command option of mean-variance adaptive Gauss-Hermite quadrature with 30 quadrature points. As described later (see Results), several analyses were repeated using other estimation methods, including cluster-robust variance estimation, the unstructured covariance matrix, or estimation by Laplace approximation. All tests for differences between ratees were performed treating two-sided P < 0.01 as statistically significant. The imbalance in the number of resident ratings per ratee is considered in Appendix 1.

Fig. 2
figure 2

Distribution among raters of their percentage of scores equal to the maximum value of 4.00 (i.e., all nine questions were answered “always”). The figure shows percentages for the 65 raters (i.e., residents). Unlike in Fig. 1, the 13,664 individual scores were used, and 9,305 (68.1%) were equal to 4.00. The dot plot shows substantial distribution of percentages among raters. The dot plot uses rounding for presentation; matching Fig. 1, there is just one rater with all scores equal to 4.00 (i.e., 100%). Among the 49 raters each with > 12 scores ≤ 4.00, the logits followed a normal distribution: Lilliefors test, P = 0.24. Among the 38 raters each with ≥ 19 such scores, Lilliefors test, P = 0.21

Results

Internal consistency of raters’ answers to the nine questions contributing to the score

Individual questions did not contribute significantly to leniency (i.e., consideration of individual questions could not improve the statistical modelling). Cronbach’s alpha as to the raters’ answers to questions was high in value (0.977; 95% CI, 0.968 to 0.985); therefore, the score for each rating could be used (i.e., the mean of the answers to the nine questions in the supervision scale) (Table 1).

Statistical distributions of rater and ratee scores

We used the 13,664 scores,Footnote 4 with 3,421 observed combinations of the 65 raters and 97 ratees. In Appendix 2, we show lack of validity of the statistical assumptions for a random effects model in the original score scale.20,21

We treated the rater as a fixed effect to incorporate rater leniency in a mixed-effects logistic regression model. Fig. 2 shows the distribution among raters of the percentage of scores equal to the maximum value of 4.00 (i.e., all nine questions answered “always”). The 65 raters differed significantly amongst each other in terms of the percentages of their scores equal to 4.00 (P < 0.001 using fixed-effect logistic regression).

The mixed-effects model with rater as a fixed effect and ratee as a random effect relies on the assumption that the distribution of the logits among ratees follows a normal distribution, which it does. Specifically, no ratee had all scores equal to 4.00 (i.e., for which the logit would have been undefined because of division by zero). In addition, no ratee had all scores less than 4.00 (i.e., for which the logit would also be undefined). There were 60 ratees each with ≥ 14 scores lower than 4.00 among their ≥ 32 scores (i.e., sample sizes large enough to obtain reliable estimates of the logits).8,22 The logits followed a normal distribution [Lilliefors test, P = 0.50; mean (standard deviation [SD]), −0.781(0.491)].

Effectiveness of logistic regression with leniency relative to Student’s t tests (i.e., without adjustment for leniency)

Figures 3 and 4 show each ratee’s average score, meaning the average of means, equally weighting each rater (see above “Statistical distributions of rater and ratee scores”).4,8,10,15 We use symbols to indicate whether the ratee’s average score is significantly (P < 0.01) different from the average score among all ratees when using a Student’s t test (i.e., without adjustment for rater leniency). We also indicate whether the ratee’s percentage of scores < 4.00 differed from other ratees when using mixed-effects logistic regression, with rater leniency treated as a fixed effect and ratee as a random effect. We subsequently refer to that mixed-effects model as “logistic regression with leniency”.

Fig. 3
figure 3

Comparison between Student’s t test and logistic regression with leniency for identification of ratees with below average quality of supervision. Among the 97 ratees (i.e., faculty anesthesiologists), there were 14 ratees with less than the minimum of nine different raters (i.e., residents) for reliability. Those 14 ratees were excluded. Among the other 83 ratees, there were 30 with average scores < 3.80, with the averages calculated as in Fig. 1, equally weighting each rater. The data for the 30 ratees are shown. Among the 30 ratees, there are eight plotted using red triangles. Student’s t test found that the averages for the eight ratees were < 3.80 (P < 0.01). The mixed-effects logistic regression with rater leniency treated as a fixed effect also found that the percentage of scores equal to 4.00 for each of these eight ratees differed from that of other anesthesiologists (P < 0.01). There are 13 ratees plotted using orange squares. These ratees were not significant by the Student’s t test (P ≥ 0.025), but were significant by logistic regression with leniency (P ≤ 0.0068). There are nine ratees plotted using blue circles. Neither of the two statistical methods found these ratees to be significant at P < 0.01. There were 0 (i.e., no) ratees for which the Student’s t test was significant, but not so for logistic regression. Therefore, the black X is shown in the legend to match Fig. 5, but no such data points are plotted. The scale of the vertical axis in Fig. 3 differs from that in Fig. 4

Fig. 4
figure 4

Comparison between Student’s t test and logistic regression with leniency for identification of ratees with greater than average quality of supervision. This figure matches Fig. 3, except that it is for the 53 ratees (i.e., faculty anesthesiologists) with average scores > 3.80. Among the 53 ratees, there are 13 plotted using red triangles. Student’s t test found these 13 ratees’ averages to be > 3.80 (P < 0.01). The mixed-effects logistic regression with rater leniency treated as a fixed effect also found that the percentage of scores equal to 4.00 for each of these 13 ratees differed from that of other anesthesiologists (P < 0.01). There are seven ratees plotted using orange squares. These ratees were not significant by the Student’s t test (P ≥ 0.016), but were significant by logistic regression with leniency (P ≤ 0.0034). There are 30 ratees plotted using blue circles. Neither of the two statistical methods found these ratees to be significant at P < 0.01. There are three ratees plotted using black Xs. These were significant by the Student’s t test (P ≤ 0.0018) but not by logistic regression with leniency (P ≥ 0.32). The scale of the vertical axis in Fig. 4 differs from that in Fig. 3

The principal result is that 20/97 ratees were identified as outliers using the logistic regression with leniency, but not by Student’s t tests. There were 3/97 ratees identified as outliers using the Student’s t tests, but not by logistic regression with leniency. The 20 vs 3 is significant; exact P < 0.001 using McNemar’s test. Thus, adjusting for rater leniency increased the ability to distinguish the quality of anesthesiologists’ clinical supervision.

In Appendix 3, we confirm the corollary that there is less information from scores < 4.00 vs the percentage of scores equal to the maximum score of 4.00.

In Appendix 4, we show that our previous observation of an increase in supervision score over time with evaluation and feedback (Table 2.17)4 holds when analyzed using logistic regression with leniency.

In Appendix 5, we show that our previous analyses and publications without consideration of rater leniency were reasonable because initially there was greater heterogeneity of scores among ratees.

Graphical presentation of the principal result

In this final section, we examine why incorporating rater leniency increased the sensitivity to detect both below average and above average performance differences among ratees. Readers who are less interested in “why” may want to go directly to the Discussion.

The figures are divided between descriptions of the supervision scores of ratees (i.e., anesthesiologists) with scores less than (Figs 3, 5) and greater than (Figs 4, 6) the overall average score of 3.80. No ratee happened to have an average score equal to 3.80. Thus, the division reflected different concerns, i.e., identification of ratees potentially performing below average vs those potentially performing above average.

Fig. 5
figure 5

Comparison between logistic regression with and without leniency for identification of ratees with below average quality of supervision. As described in Fig. 3 legend, data for 30 ratees (i.e., faculty anesthesiologists) are displayed, each with average scores < 3.80. Among the 30 ratees, 15 are plotted using red triangles. Both logistic regression models detected that these ratees had a significantly (P < 0.01) greater percentage of scores < 4.00 than other ratees. There are six ratees plotted using orange squares. Logistic regression with leniency, but not without leniency, detected these ratees as providing significantly lower quality supervision than other ratees. There are nine ratees plotted using blue circles. Neither of the two statistical methods found these ratees to be significant at P < 0.01. Including leniency in the logistic regression did not prevent significance for any (0) ratees

Fig. 6
figure 6

Comparison between logistic regression with and without leniency for identification of ratees with greater than average quality of supervision. This figure matches Fig. 4, except that it is for the 53 ratees (i.e., faculty anesthesiologists) with average scores > 3.80. Among the 53 ratees, 13 are plotted using red triangles. Both logistic regression models detected that these ratees had a significantly (P < 0.01) greater percentage of scores < 4.00 than other ratees. There are seven ratees plotted using orange squares. Logistic regression with leniency, but not without leniency, detected these ratees as providing significantly lower quality supervision than other ratees. There are 30 ratees plotted using blue circles. Neither of the two statistical methods found these ratees to be significant at P < 0.01. Including leniency in the logistic regression did not prevent significance for any (0) ratees

Statistical significance of the logistic regression with leniency depended on the number of scores < 4.00, shown on the vertical axes of Figs 3-6 (see Appendix 1 and Appendix 6). For a given ratee average score, blue circles showing lack of statistical significance are more often present for smaller sample sizes than red triangles and orange squares.

Among the 30 ratees with average scores < 3.80, 13 were not significantly different from the average of 3.80 using the Student’s t test, but were significantly different from the other ratees by logistic regression with leniency (Fig. 3). For illustration, we consider the ratee with an average score of 3.56, shown by the left-most orange square. This score was the smallest value not found to be significantly less than the overall average of 3.80 using the Student’s t test, but found to differ significantly from the other ratees by logistic regression with leniency. In Appendix 7, we show that this finding was caused by substantial variability among raters (i.e., residents) regarding how much the ratee’s quality of supervision was less than the maximum score (4.00).

Among the 53 ratees who had average supervision scores > 3.80 and who had at least nine different raters, seven were not significantly different from average as determined by the Student’s t test, but were significantly different using logistic regression with leniency (Fig. 4). There were 3/53 ratees who were significantly different from average by the Student’s t test, but not significantly different using logistic regression with leniency. For illustration, we consider the ratee with the highest average score. In Appendix 8, we show that logistic regression with, or without, leniency (Fig. 6) lacked statistical power to differentiate this ratee from other anesthesiologists because the ratee had above average quality of supervision and relatively few clinical days (i.e., ratings).

Discussion

The supervision scores are the cumulative result of how the anesthesiologists perform in clinical environments. The scores reflect in situ performance and can improve with feedback.4,15 Supervision scores are used in our department for mandatory annual collegiate evaluations and for maintenance of hospital clinical privileges (i.e., the United States’ mandatory semi-annual “Ongoing Professional Practice Evaluation”). Consequently, the statistical comparisons could reasonably be considered to represent high-stakes testing.Footnote 5 We therefore considered statistical approaches that satisfy statistical assumptions as much as possible. In addition, we conservatively treated as statistically significant only those differences in ratee scores with small P values < 0.01 and used random effects modelling (i.e., shrinkage of estimates for anesthesiologists with small sample sizes toward the average).23,24,25,26,27 Nevertheless, we show mixed-effects logistic regression modelling, with rater leniency entered as a fixed effect, which resulted in greater detection of performance outliers than with the Student’s t test (i.e., without adjustment for rater leniency). Comparing the mixed-effects logistic regression model with rater leniency with multiple Student’s t tests, rather than with a random effects model of the average scores without rater leniency, resulted in a lesser chance23,24,25 of detecting benefit in logistic regression (i.e., our conclusion is deliberately conservative).

Previous psychometric studies of anesthesiologists’ assessments of resident performance have also found significant rater leniency.28,29 Even with an adjustment of the average scores for rater leniency, the number of different ratings that faculty needed for a reliable assessment of resident performance exceeded the total number of faculty in many departments.28 Our paper provides a methodological framework for future statistical analyses of leniency for such applications.

Suppose the anesthesiologists were distributed into nine categories. There are those with a less than average, average, and greater than average annual number of clinical days, thereby receiving a less than average, average, and greater than average number of evaluations of their clinical performance. There are anesthesiologists who provide less than average, average, and greater than average quality of supervision. We think that, among these nine (3 × 3) groups, the least institutional cost for misclassifying the quality of clinical supervision (below average, average, above average) would be to consider the group of anesthesiologists providing less than average clinical workload and greater than average quality of supervision as providing average quality of supervision. Because this was the only group that was “misclassified” through use of logistic regression with leniency, we think it is reasonable managerially to use this method to analyze the supervision data.

We showed that leniency in the supervision scale (Table 1) was caused by the cumulative effect of all questions (i.e., leniency was not the disproportionate effect of a few questions). If an individual question had accounted for variability in leniency among raters, providing examples of behaviour corresponding to an answer could have been an alternative intervention to reduce leniency. Because our department provides OR care for a large diversity of procedures, it is not obvious to us how to provide examples because there are so many different interactions between residents and anesthesiologists that could contribute to less than or greater than average quality of supervision.1,10 Nevertheless, the finding that leniency arises because of the cumulative effect of all questions shows that the issue is moot. Variability in rater leniency is the result of the raters’ overall (omnibus) assessments of anesthesiologists’ performance, without distinction among the nine items describing specific attributes of supervision.

The supervision score is a surrogate for whether a resident would choose the anesthesiologist to care for their family (Table 2.7).7 Supervision scores for specific rotations are associated with perceived teamwork during the rotation (Table 2.8).12 Observation of intraoperative briefings has found that sometimes anesthesiologists barely participate (e.g., being occupied with other activities).30 Team members can “rate the value” of the intraoperative briefing performed “in the OR when the patient is awake”.31 Thus, we have hypothesized that leniency may be related to interaction among organizational safety culture, residents’ perceptions of the importance of the intraoperative briefing to patient outcome, and the anesthesiologists’ participation (or lack) in the briefings. Our finding of large internal rater consistency among the nine questions shows that such a hypothesis cannot be supported. Supervision begins when residents and anesthesiologists are assigned cases together, ends after the day’s patient care is completed, and includes inseparable attributes (Table 1). Future studies could evaluate whether rater leniency is personality based and/or applies to rating other domains such as quality of life.

Our findings are limited by raters being nested within departments (i.e., residents in one department rarely work with anesthesiologists in other departments). Consequently, for external reporting, we recommend that evaluation of each ratee (anesthesiologist, subspecialty,12 or department11 be performed using the equally weighed average of the scores from each rater. Results are reported as average scores of equally weighted raters, along with confidence intervals.8,C In contrast, for assessment and progressive quality improvement within a department, we recommend the use of mixed-effects logistic regression with rater leniency. Results are reported as odds ratios, along with confidence intervals. Regardless, in situ assessment of the quality of supervision depends (Figs 4 and 6) on there being at least nine (and preferably more) unique raters for each ratee (Table 2.11).7 Although this generally holds for operating room anesthesia, it can be a limitation for specialties (e.g., chronic pain) in which residents rotate for weeks at a time and work with one or two attending physicians.