Assessment of Emergency Medicine Resident Performance in a Pediatric In Situ Simulation Using Multi-Source Feedback

Introduction Multi-source feedback (MSF) is an evaluation method mandated by the Accreditation Council for Graduate Medical Education (ACGME). The Queen’s Simulation Assessment Tool (QSAT) has been validated as being able to distinguish between resident performances in a simulation setting. The QSAT has also been demonstrated to have excellent MSF agreement when used in an adult simulation performed in a simulation lab. Using the QSAT, this study sought to determine the degree of agreement of MSF in a single pediatric (Peds) simulation case conducted in situ in a Peds emergency department (ED). Methods This Institutional Review Board-approved study was conducted in a four-year emergency medicine residency. A Peds resuscitation case was developed with specific behavioral anchors on the QSAT, which uses a 1-5 scale in each of five categories: Primary Assessment, Diagnostic Actions, Therapeutic Actions, Communication, and Overall Assessment. Data was gathered from six participants for each simulation. The lead resident self-evaluated and received MSF from a junior peer resident, a fixed Peds ED nurse, a random ED nurse, and two faculty (one fixed, the other from a dyad). The agreement was calculated with intraclass correlation coefficients (ICC). Results The simulation was performed on 35 separate days over two academic years. A total of 106 MSF participants were enrolled. Enrollees included three faculty members, 35 team leaders, 34 peers, 33 ED registered nurses (RN), and one Peds RN; 50% of the enrollees were female (n=53). Mean QSAT scores ranged from 20.7 to 23.4. A fair agreement was demonstrated via ICC; there was no statistically significant difference between sources of MSF. Removing self-evaluation led to the highest ICC. ICC for any single or grouped non-faculty source of MSF was poor. Conclusion Using the QSAT, the findings from this single-site cohort suggest that faculty must be included in MSF. Self-evaluation appears to be of limited value in MSF with the QSAT. The degree of MSF agreement as gathered by the QSAT was lower in this cohort than previously reported for adult simulation cases performed in the simulation lab. This may be due to either the pediatric nature of the case, the location of the simulation, or both.


Introduction
Pediatric (Peds) simulation using a simulation mannequin can help to assess a learner's progress through residency training on a variety of emergency department (ED) case presentations while in a controlled, structured environment. Since its initiation, simulation has evolved in its application to include not only teaching but also to be a means for assessment. Assessment of some residency core competencies has been found to be better conducted through simulation than other traditional means [1]. The Accreditation Council for Graduate Medical Education (ACGME) requires the assessment of emergency medicine (EM) residents through a set of milestones, and simulation has been deemed an acceptable form of assessment specifically for Milestones 1-11 and 16-23 [2].
There are several different proposed tools to evaluate EM learners in the simulation environment [3][4][5], but 1 1 2 1 1 1 1 1 1 1 few of them have been validated in multicenter studies like the Queen's Simulation Assessment Tool (QSAT) [6]. First described in 2012, the QSAT was developed via a Delphi process with the intent of being "simple and modifiable" to new simulation cases [6,7]. The QSAT measures four performance domains (initial assessment, diagnostic approach, therapeutic approach, and communication skills) in addition to providing a single global assessment [7]. These five categories are all measured on a 5-point scale [7]. The QSAT was validated for resident assessment in resuscitation cases and was able to differentiate between post-graduate year (PGY) classes by demonstrating persistently higher scores among senior residents [8]. In these studies, the QSAT was completed exclusively by EM faculty [6][7][8].
According to the ACGME, multi-source feedback (MSF) is a suggested method of evaluation for 10 of the 23 Milestones [2]. Data for MSF is drawn from learners being evaluated by peers, colleagues, and/or supervisors. Several studies have demonstrated adequate reliability using simulation performance checklists to generate MSF [9][10][11].
New research continues to be conducted regarding the use of MSF combined with simulation. Previous research using the QSAT for MSF has shown favorable outcomes with excellent inter-rater reliability (IRR) in an adult simulation case in the simulation lab [12]. However, the QSAT has not yet been studied in a Peds simulation conducted in the clinical environment (in situ). In situ simulation has been shown to have benefits over simulation in the simulation lab, which include teamwork, patient safety, cost, availability, repetition, and realistic setting [13,14]. Our study aimed to determine the concordance of rater evaluations of the QSAT when used to provide MSF to assess EM resident performance in a single, standardized, in situ Peds simulation resuscitation case performed within a children's ED.

Materials And Methods
This prospective study was an a priori planned extension of a two-part study. Part one of the Institutional Review Board (IRB) protocol entailed the simulation of an adult toxic ingestion case conducted in the simulation lab. These MSF results were previously published [12]. This study represents part two of the IRB protocol and evaluates the concordance of MSF for a Peds simulation case conducted in situ. Similar to the QSAT development studies, a single Peds resuscitation case was developed by a group of simulation-trained EM physicians using standard simulation templates [6][7][8]. The case, a toxic ingestion, was run using a highfidelity Peds simulation mannequin in a patient room within the children's ED. All simulations were conducted during a single standard in situ time: 0700 to 0730 on Tuesdays. A maximum of two simulations was conducted in any single calendar month.
The study was conducted at a four-year dually approved EM residency at an independent academic center within a suburban healthcare network. The program trains 14 residents per PGY. All study participants consented to participation in the simulation. As part of the consent process, an independent party's contact information was provided. This party, an educator within the Network's Department of Education, could be contacted by study subjects to allow for anonymous removal from the study.
All EM residents in their PGY 2-4 levels of training were eligible for enrollment as team leaders. The team leader was provided MSF using the QSAT, a previously validated rubric [6][7][8]. The QSAT assesses resident performance in four categories using case-specific behavioral anchors: primary assessment of the patient; diagnostic testing; treatment of the underlying condition; and interpersonal communication with staff and consultants. The QSAT with the case-specific behavioral anchors is shown in Table 1. A fifth category, overall assessment, is both holistic and was left to the interpretation of the assessor.

s Simulation Assessment Tool (QSAT) With Behavioral Anchors
MSF was provided to the team leader from five sources: two attendings, two nurses, and a resident peer. The team leader, through self-assessment, provided the sixth source. As with part one, in order to control for issues of multiple comparisons, each source of MSF contributed in proportions determined a priori [11]. For the attendings, there was a single fixed EM core teaching faculty member present at all simulations and an EM faculty dyad comprising two core faculty who participated in 50% of the cases each. From nursing [registered nurse (RN)], MSF was provided by a random, general EM nurse (ED RN) enrolled once and only once, and a fixed Peds EM-trained nurse (Peds RN) was present at all simulations. The peer resident, enrolled once and only once, was intentionally junior (as measured by PGY) to the team leader. Each team leader, enrolled once for the study, performed a self-evaluation.
The QSAT was completed by all participants in the patient room utilized for the simulation as soon as the case was completed. While no prior training on the QSAT was provided to the study participants, the evaluating faculty had experience in developing the simulation case and QSAT anchors. Participants provided personal demographics prior to the simulation. PGY was based on the date of enrollment.

Data analysis
The analytic plan has been previously described [12]. Descriptive statistics described the sample, and both counts and percentages were used for categorical variables. For normally distributed continuous data, results were presented via mean and standard deviation. Median was used for non-normal distributions. Normalcy was determined by skew of less than +1 and greater than -1 via visual inspection of the histogram plots. The plan to avoid issues with repeated measures is presented above.
The cumulative QSAT score was the sum of the rating in each of the five sections, resulting in totals ranging from 5-25. As in part one, hypothesis testing was assessed via IRR by obtaining intraclass correlation coefficients (ICC) for the groups of raters [12]. Two-way random ICCs were used to determine the average level of absolute rater agreement between all raters within each simulation. ICC interpretation is as per Cicchetti, who defined results less than 0.40 as poor, 0.40 to 0.59 as fair, 0.60 to 0.74 as good, and ≥0.75 as excellent [15]. ICC was calculated systematically across the different sources of MSF.
All analyses were two-tailed with the alpha set at 0.05. All statistical analyses were performed using SAS version 9.3 (SAS Institute, Cary, NC) and SPSS Statistics version 24 for Windows (IBM, Armonk, NY).

Results
This study enrolled 106 participants prospectively during 35 separate in situ Peds simulation over the course of two academic years. Enrollees included three faculty members, 35 team leaders, 34 peers, 33 ED RN, and one Peds RN; 53 of the enrollees were female (50%). Team leaders were Doctor of Osteopathic Medicine (DO) (86%, n=30) and Doctor of Medicine (MD) (14%, n=5). The PGY of team leaders was as follows: three PGY 2, 18 PGY 3, and 14 PGY 4. Peers consisted of DO (85%, n=29) and MD (15%, n=5). The median years of experience for ED RN was 2.5 (IQR: 1.0-9.0). Table 2 reports, for each source of MSF, the average QSAT for each QSAT Category, along with the total score. For the entire cohort, self-evaluations accounted for the lowest summative scores, and the lowest categorical scores for Primary assessment, Diagnostic Action (tied), and Overall Assessment. Self-evaluation was followed closely by the fixed faculty, who had the lowest categorical scores in Diagnostic Action (tied), Therapeutic Actions, and Communication. Peer evaluations accounted for the highest summative scores and were the highest in each of the five categorical scores. These differences, based on 95% CI, were not statistically significant.   Table 3. All ratings are negatively skewed, indicating that most scores were high with a few, outlying low scores.  The ICC in Table 4 demonstrates the degree of agreement between sources of MSF. This agreement analysis speaks to whether each source of MSF provided the same score as the other sources. Based on definitions by Cichetti, this agreement for all raters was fair at 0.531 (ICC 1) [15].

Discussion
The agreement between sources of MSF using the QSAT in this cohort was both fair at best and lower than we had previously reported [12]. The lesser agreement may be due to the simulation being a Peds case, an in situ simulation, or both.
Exposure to Peds resuscitation cases has been noted to be low for trainees [16]. The limited number of reliable and validated Peds simulation assessment tools has been previously noted [17]. A study of emergency medical services (EMS) providers found a longer time to intervention in Peds vs. adult simulation case, though the difference was not statistically significant [18]. A study of adult EM physicians observed that pediatric skill begins to decrease after six months [19]. So while the increased frequency of Peds critical care education offered may be a counter-measure to lower skills, it may not improve inter-rater agreement for MSF for Peds cases. In fact, in this cohort, the negative skew in Table 3 demonstrates that MSF raters more frequently scored participants highly. This suggests that the decreased inter-rater agreement is likely more closely linked to confounders other than Peds critical care skills.
Another explanation is the in situ nature of this cohort. While in situ simulation may have its benefits, there are unique challenges as well [13,14]. In situ simulation has been described as having different goals from the simulations offered in the simulation lab [20]. Here, "time pressures" are noted as a barrier to in situ simulations. Difficulties in locating eligible participants from available clinical staff led to incomplete sources of MSF as noted in Table 2. Since the enrollees, except for the faculty, needed to return to clinical duties, this may have impacted their QSAT scores. The impact of "time pressure" is complex and has been the source of study in the sociology literature [21]. Issues noted include a "speed-accuracy trade-off" and that people will cope by "attend(ing) selectively to information." That said, if MSF is to be utilized in the clinical environment, staff and faculty alike will experience some degree of "time pressure." If the impact of "time pressure" is one of decreased agreement, the ability of Clinical Competency Committees (CCC) to utilize MSF to evaluate residents based on clinical encounters may be problematic. The ACGME recommends where MSF may be obtained in real time outside of the temporal pressures of the ED. Otherwise, if the decreased agreement was in fact due to "time pressure," MSF would need to be obtained summatively rather than in real time.
This study is limited by the fact that the Peds case was internally developed. The negative skew noted in Table 3 may have occurred because the case was "too easy." This conclusion is further supported by the fact that Table 2 does not demonstrate a clear progression of scores as PGY increases. Since the study's hypothesis was to test the degree of agreement between different sources of MSF, not to create a simulation case that differentiates the abilities of the enrolled resident team leaders, the difficulty of the case should not impact the degree of agreement. The negative skew may have artificially increased the agreement seen since the lower part of the 1-5 scale was used with decreased frequency. Even though in Table 2 there is not a significant increase in total QSAT scores as PGY increases, it is interesting to note that as PGY increases, self-evaluation total scores decrease. This finding appears consistent with the "Dunning-Kruger effect" in which the less skilled overestimate their abilities in large part because of a lack of understanding or "irrational optimism." [22] Concordant with our initial findings, self-assessment decreased the correlation of MSF in this cohort as well [12]. Table 4 demonstrates that the greatest agreement occurs when the self rater is removed. Milestone 19, Level 3 anchor notes that programs must ask the resident to self-assess [2]. The self-assessment behavioral anchor lies within the Problem-Based Learning competency, which also houses Evidence-Based Medicine (EBM) skills. It has been demonstrated that self-assessment of information literacy (EBM) skills has "little calibration" with actual skill [23]. In the end, the process of self-assessment may be better done holistically or summatively rather than on an individual case. At the very least, this data reinforces the fact that the QSAT is not an appropriate instrument for self-assessment.
In this cohort, the total and categorical scores by peer evaluator were the highest overall and for each stratified PGY except for PGY 4 primary assessment, and the PGY 2 and 3 diagnostics. Peer raters had the highest, or tied for the highest, score in 21 of the 24 possible categorical and total scores. A previous study has found that providing "uphill" feedback is perceived as difficult for junior residents [24]. Concerns about the personal relationships of the residents were also noted. These high scores on the QSAT are in contrast to our prior work where peers were almost universally lower than not just both faculty scores but were occasionally lower than the EMS score [12]. Because this part of the larger study protocol followed the first, it may have impacted the peer residents. The location (in situ while on shift vs. simulation lab during grand rounds) and duration of the study (the initial study was completed in weeks; this study spanned over two academic years) may have also played a role. While the two cohorts demonstrate a different relationship between peer and faculty (ICC of 0.799 previously [12], here in Table 5, 0.349), there is evidence that peer evaluations have a positive impact on performance [25]. For residents, peer evaluation of teaching in the clinical environment has been demonstrated to be feasible, suggesting that in certain circumstances, peer contributions to MSF on shift may be possible [26]. Peer contribution to MSF for the Professionalism and Communication competencies was determined to be both possible and reliable [27]. As noted above, for EM, these ACGME competencies may prove to be the most practical to focus on in future studies of MSF.
In the end, the decreased level of agreement in this study may most closely correlate with the decreased agreement between the faculty. Here the agreement between the faculty was 0.570 ( Table 5) as compared to 0.840 previously [12]. While the fixed faculty had lower scores than the dyad ( Table 2), the 95% CI always overlapped. Interestingly, the 95% CI drops into negative values when the faculty raters are removed ( Table  5). Negative ICCs are theoretically difficult to interpret. Some argue that they are invalid (however still mathematically computational, which is why they are calculated), while others argue that they are valid and it just signifies disagreement/poor agreement [28,29]. By all measures, this cohort again confirms that faculty is a necessary component of MSF, and their inclusion yields the highest level of agreement. The 95% CI also drop into negative numbers when faculty and peer are combined. This stems from the elevated peer scores discussed above combined with the lower scores provided by the fixed attending. While conducted at the same program, the enrolled faculty providing scores did not overlap in the two studies. Taken together, the two cohorts appear to support the important role of the CCC. The ACGME notes that "CCC can be an opportunity to balance out the 'hawks' and 'doves.'" [30] For programs attempting to utilize the QSAT for MSF either in the simulation lab, for in situ simulation, or for actual patient encounters, the CCC should interpret the findings based on their experience with the faculty who contribute the scores.

Limitations
Limitations of this study include its small sample size and single site of enrollment. While the QSAT has been previously validated, the Peds simulation case used here was not [6][7][8]. Not every Peds simulation conducted had all six sources of MSF. Consistent with prior work on using the QSAT for MSF, the team members did not receive explicit QSAT training [12]. Training, described as simple, was provided in prior faculty study of the QSAT [6][7][8]. This lack of training, while felt to better reflect how the QSAT would be employed for in situ MSF, or for MSF on actual patient encounters, may have impacted the results. Finally, the impact of the pediatric nature of the case and the in situ nature of the simulation cannot be separated. The case, the location, or both, may have impacted the results.

Conclusions
In this single-site simulation study using a locally developed Peds toxic ingestion case, the QSAT demonstrated fair inter-rater agreement. This level of agreement may have been influenced by the in situ nature of the study, the pediatric nature of the case, or both. Unless faculty are included in QSAT MSF, the agreement is poor, with a mathematical suggestion of disagreement. Peer evaluation consistently provided the highest scores in this cohort, though the differences between the sources of MSF were not significant. When using the QSAT, self-evaluation decreases agreement, suggesting that this form of MSF may not be appropriate with this instrument. Reliable ways to gather MSF to inform CCCs remains a challenge for EM, and future efforts may be best focused on ACGME Competencies other than Patient Care.

Additional Information
Disclosures