Application of the Multi-Facet Rasch Model to Evaluate the Chief Resident Selection Survey: A Two-Year Study

Introduction The Chief Resident (CR) selection process is described by many residency programs as a collective effort from the residency program leadership, key faculty members, and resident peers. Unfortunately, the literature does not show any established guidelines, methods, or psychometric sound instruments to aid this process. The purpose of this study was to evaluate the properties of the newly developed CRs selection survey across two years using the Multi-Facet Rasch Model (MFRM). Methods This study used the MFRM to analyze two-year data from the newly developed CRs selection survey. After the first implementation of the tool in 2015, this instrument had its second-round evaluation process for the CRs selection in 2016. We applied a three-facet Rasch model (candidates, questions, and raters). We used Facets v. 3.66 and SAS 9.4 (SAS Institute Inc., Cary, NC) for data analysis. Results In 2015, 40 out of100 residents completed the survey to select three of the four candidates for the 2017-2018 CRs positions. The mean rating for each candidate showed that Candidate 1 received the highest rating of 5.56 while Candidates 2 and 4 received the exact same ratings. The majority of survey items performed very well based on the results from the MFRM while leaving room for improvement for a few items. In 2016, 55 out of 100 residents completed the revised survey to select three of the six candidates for the 2018-2019 CR positions. The mean rating showed that Candidate 3 received the highest mean rating of 5.81 while Candidate 2 received the lowest mean rating of 5.12. The item reliability was improved from 0.70 to 0.88 based on the results from the revised survey. The results were used to help inform decisions regarding the selection of chief residents. Conclusions The CR selection process requires a fair and collective effort from program leadership, relevant faculty members, and input from the resident group. Our study demonstrated that the survey tool we developed is appropriate to select CR candidates and MFRM is a promising technique in survey development and the evaluation of survey items.


Introduction
Chief residents (CRs) are selected for a leadership position that offers them experience in the clinical, administrative, and educational activities of a hospital department or a residency program. Their leadership plays a critical role in post-graduate medical training in the context of exercising their administrative duties, teaching activities, and patient care [1]. In 1889, William Stewart Halsted, the father of the modern era of surgery, was the pioneer in establishing the concept of the chief residency in academic residency training programs when he was the Chairman of Surgery at the Johns Hopkins Hospital [2][3]. Since then, almost all academic centers designate or appoint a few of their high achieving resident physicians to spend either their final year or an additional year as CRs. The duties for CRs may vary depending on the size of the department, the length of the residency, and the number of housestaff members.
Although being selected as a CR is an honor and privilege sought by many residents at their respective programs, the literature does not identify any established guidelines, methods, or psychometrically sound instruments to augment the selection process. We only noted a few studies in the limited literature. One study conducted by Jain et al. surveyed 83 program directors regarding selection methods, duties, training, and evaluation of their CRs [4]. In this study, the authors showed that CRs were mostly appointed by their program directors or chairman, or through the vote of the faculty. Another study by Young et al. reported that the most common methods of CR selection were faculty vote (45%), combined resident-faculty vote (22%), and other (33%) [3]. On the other hand, many residency program websites describe the CR selection process as a collective effort between the Residency Program Director and key faculty members while also taking into account the input from resident peers; however, a formal outline of the process is not shared and specific tools, such as a validated survey to appropriate capture peer feedback, are not described. In essence, CRs selection typically lacks a reliable process to capture the performance and leadership qualities of potential candidates.
In our Pediatric Residency Program, we annually select three CRs during their postgraduate year (PGY)-2 year for PGY-4. Candidates self-nominate themselves submitting their biosketches that include a) background information, b) interests, c) professional plans beyond residency, d) goal statement, e) qualities the candidates possess that would make them a good CR, and f) goals for the CR position. Each candidate goes through two sets of interviews with the residency program leadership: 1) 30-minute themed interviews with Residency Program Director, Chair of Graduate Medical Education (GME), and Vice-Chair of GME; and 2) 30-minute group interviews with Associate Program Directors, Current Chief Residents, and Program Coordinators, and finally a resident survey to capture peer opinions on the CR candidates' skills. While this model facilitates annual deliberation for the CR selection, the previous ranking survey reflected the popularity of the CR candidates among peers and was not able to capture personal and leadership characteristics desired in CR candidates. Therefore, we developed a unique survey instrument to capture those characteristics from the resident peers' perspectives. We intended to use the results of this survey during the CR selection consensus meeting that is scheduled after all interviews are completed. Thus, the purpose of this study was to investigate the properties of the newly developed CRs selection survey tool under the item response theory (IRT) framework [5][6], specifically, we utilized a multi-facet Rash model (MFRM), which is an extended version of one-parameter IRT technique [7][8], focusing on three aspects of our survey tool: 1) Reliability of the items included in the first version and the revised version of the survey to differentiate each candidate for the CR position across two years of study, 2) performance of each item in respect of rating scales function in both version of the survey, and 3) raters' severity or leniency in rating each candidate.

Materials And Methods
This survey-based study was conducted at the Pediatric Residency Program at Children's Mercy Hospital in Kansas City, Missouri. Using the Multi-Facet Rasch Model (MFRM), we evaluated our CR selection survey tool from the resident peers' perspectives, as an addition to the traditional CR selection interview process. After the utilization of the results in the first implementation for the CR selection in 2015, this instrument had its second-round development process for the CRs selection in 2016.
We obtained the Children's Mercy Hospital Institutional Review Board approval 15100469 as non-human subjects for this study.

Tool development
To develop the CR selection survey, we conducted a comprehensive search within the Google, MEDLINE, PubMed Central, Scopus, PsycINFO, and ERIC databases to locate a validated tool that can measure the personal and leadership characteristics of a CR candidate. While we were not able to find well-established guidelines, methods, or psychometric instruments for selecting CRs in residency programs, this literature search enabled us to prepare a comprehensive list of desirable personal and leadership attributes of a person for a leadership position. Feedback from residency leadership then helped us determine key constructs with specific domains for measurement items for the CR tool. After the development of the first version of the survey instrument in 2015, we edited and finalized the items based on the feedback obtained from stakeholders.
The CR selection survey tool had two parts. The first consisted of two questions, rating 16 descriptive characteristics with a six-point quality scale from one representing "Very Poor" to six representing "Excellent" followed by an open-ended question. We used a six-point scale to force more positive or more negative response choices from respondents. All the items are listed alphabetically so as not to give any priority to any of the traits and qualities. The second part of the survey has two rating-type questions with a comment box provided. After the implementation of this survey, we revised it for the next cycle in 2016 based on the feedback from program leadership involved in the CRs selection process. In this version, while we revised a few items in the first part of the survey, we also added new items to further measure candidates' characteristics and leadership skills that increased the number of the items from 16 to 22 (Appendix 1). We obtained the face and content validity of the tool through peer reviews in both versions.
in September 2015 to identify candidates for the 2017-2018 CR positions. We used the revised version of the survey in September 2016 for the selection of 2018-2019 CRs. We emailed the survey link for anonymous responses to all pediatric residents in our program with two follow-up reminders for both implementations. Along with the survey link, we also provided the candidates' biosketches/CVs in our internal shared folder to allow the respondents further information about each candidate. There were four chief resident candidates in 2015 and six candidates in 2016. In our institution, all candidates are self-nominated for the CR positions and their peers served as the raters in this study. Survey results are only reviewed by program leadership involved in CR selection and not available to CR candidates or peers.

Data analysis
Traditionally, when analyzing survey questionnaires, a sum score or average score over all the survey questions are often used. However, respondents, also called raters if using a rating scale in the survey, may have a different threshold or may use different criteria when they respond to survey questions. For example, it may have a different meaning when respondent 1 and respondent 2 both selected "strongly agree" to the same survey question. Similarly, some respondents may indorse "Poor" after comparing all candidates (i.e., relative ranking) while others may just indorse "Poor" but not thinking of any other candidates (i.e., absolute ranking). This is called the rater effect or more accurately rater leniency or stringency in the literature. This rater effect cannot be captured if using traditional average/sum scores.
In this study, we utilized the multi-facet Rasch model (MFRM) to address how the rater effect can be captured when analyzing the survey data. Thus, we applied a three-facet Rasch model to our study, including 1) Candidates; 2) Questions; and 3) Raters. MFRM is intended to calibrate the candidate ability or performance (corresponding to the sum or average score over all the survey items for each candidate), item difficulty (corresponding to the sum or average score over all the candidates and raters for each item), and rater stringency or leniency (corresponding to the sum or average score over all the candidates and items for each rater) into the same logit scale, facilitating direct comparison between elements. In the analyses, the item facet was centered on zero in the logit scale so that the candidate performance and rater stringency or leniency could be evaluated relative to the items. The advantage of MFRM is that all sources of variability, such as candidates, survey questions, and raters, which all have the potential to influence measurement results, are considered simultaneously [9].
We first analyzed the 2015 survey results to evaluate the ability of CRs and to assess the quality of survey items. Then we analyzed the 2016 survey results using a similar method as in 2015 to evaluate the candidate's ability and to assess the quality of the revised survey items. For each year data, we provided the results of: 1) The chi-square test: It evaluates the divergence between observed and model expected scores [10]. We reported these test statistics as a regular check of the overall data model fit. However, it would not be a primary concern whether it is statistically significant or not under the MFRM framework [11].
2) Separation reliability: The separation reliability is equivalent to Cronbach's alpha, ranging from 0 to 1. According to Linacre [11], separation reliability for the source of rater variance is used in calculating the inter-rater agreement index. The separation reliability is actually a measure of how different the measures are within each facet examined in this study.
3) Measurement results of each element and each facet: In addition to the observed average ratings as in other studies, the measurement in the logit scale and its associated standard error are also provided from the MFRM outputs. These are direct measures of performance of each candidate, survey question, and rater. 4) Infit and outfit statistics: Analysis of infit and outfit statistics are two fit indices that represent the relationship between observed ratings and model-derived ratings for both items and candidates. Statistics equal to or near 1 show perfect correspondence between observed and expected values while standardized values higher than 2 signal the presence of serious distortions in the data [12][13]. The difference between the infit and outfit values derives from the way in which the statistics are calculated. The outfit is sensitive to the outlying or extreme ratings.
All these statistics mentioned above were reported for candidates, items, and raters, respectively. In addition, the interaction analyses between candidates and items were also assessed to provide additional information for candidates' performance and survey items. Data analysis of MFRM was completed using Facets v. 3.66.1 [14]. The remaining data analysis was completed using SAS 9.4 (SAS Institute Inc., Cary, NC).

Descriptive results
Out of 100 residents, 40 of them responded to the 2015 survey to select three of the four candidates for the position of CRs for the year 2017-2018. Table 1 shows the frequencies of 16 survey questions at each rating, with mean ratings and their associated standard deviation for each candidate. The mean rating for each candidate showed that Candidate 1 received the highest rating of 5.56 while Candidates 2 and 4 received the exact same meaning ratings over 16 survey questions.

survey results from the MFRM
For the person and model fit, the chi-square test is 2 (3) = 76.2, p < 0.001, with a candidate reliability separation index of 0.96. A reliability separation index of 0.70 would suggest that, on average, there are discernible statistically significant differences between candidates. Therefore, both the chi-square test and the reliability separation index indicate that overall, the model fits the data well and candidates' abilities could also be differentiated well.  For the item and model fit, the chi-square test is 2 (15) = 49.6, p < 0.001, and the item reliability is 0.70. We expect some degree of differences among items so that each item could contribute to the overall survey instrument. Table 4 provides the observed average, logit item difficulty with standard error, and item infit and outfit statistics. Item difficulty measure ranges from -0.44 (easiest item) for item 3 ("respectful") to 0.51 (most difficult) for item 8 ("acceptance of criticism/ability to accept his/her own mistakes"). If by using the criteria of the standardized value of 2 (Zstd) indicating a poor fit, then the infit and/or outfit column in Table  4 shows that the item 4 ("confident"), item 10 ("diversity awareness/open-minded"), and item 11 ("enthusiastic") exhibit a relatively poor item fit, indicating these three items do not fit the model well.  For the overall rater and model fit, the chi-square test is 2 (39) = 729.6, p < 0.001, signifying that the raters did not all exercise the same level of severity when evaluating these four candidates in this study. The reliability of the rater separation index is 0.89 if extreme ratings were included. This result suggests that there are discernible statistically significant differences between at least two of the raters. At the very least, the most severe and most lenient raters were significantly different. Therefore, both the chi-square best and reliability of the rater separation index suggest the need for the adjustment of severity when assessing candidates' performance. Rater severity levels range from -4 (most lenient) to 3.41 (most severe) on the logit scale.

Question
In addition, the FACETS variable map (also called "Wright map") is particularly useful (Figure 1) when interpreting the MFRM results. This map visually summarizes all facets investigated, with each facet presented in a separate column. The logit scale appears in the first column on the map. The other columns are candidates (the second column), questions (the third column), raters (the fourth column). Only the first facet (candidates) is positively oriented (+ Candidates), which means more able candidates have positive measurement values and are therefore placed up in the column while less able ones are placed lower in the column. For the questions column, more difficult questions are placed on the top while easy questions are on the bottom. For the raters' column, more severe raters appear higher in the column while more lenient raters appear lower. The variable map allows for different comparisons within and between facets. As can be seen, the variability across raters in their level of severity was substantial. In fact, the raters' effect is very spread, with a value ranging from around -4 (most lenient for raters of 11, 14, and 37) to around 3.5 (most severity for raters of 15 and 25).

Rasch analysis for the 2015 survey
The interaction between candidates and survey questions is particularly useful to provide additional information that the observed rating candidates received on a particular item were unexpected from the model. The relative measures (t scores) above 2.0 or below -2.0 would indicate an interaction effect. Figure 2 shows the relative measures for the candidates and questions interaction. Candidate 1 and Candidate 3 received relatively consistent ratings across all 16 questions. However, Candidate 2 received particularly low ratings on question 4 (confident), question 9 (common sense), and question 11 (efficient). On closer examination of the interaction between raters and candidates, five raters or more endorsed particularly low ratings to Candidate 2 (the results table was not provided since there are 40 raters). This piece of information might be particularly useful when the program had to select a CR candidate between Candidate 2 and Candidate 4.

FIGURE 2: Relative measures based on the candidates and questions interaction analysis for the 2015 survey
In summary, the results showed that this newly developed survey can differentiate CR candidates' performance well. Based on the ability estimates from MFRM, Candidate 1 and Candidate 3 were rated higher than Candidate 2 and Candidate 4 by their peer residents. As for Candidate 2 and Candidate 4, although they received about the same ratings either based on the mean ratings or based on the ability estimates from MFRM, Candidate 4 would be a better fit than Candidate 2 if an additional candidate had to be selected. In terms of item performance for this newly developed CR selection tool, the majority of items performed very well based on the psychometric results from the MFRM while leaving room for improvement for a few items.

survey results from the MFRM
For person and model fit, the chi-square test is 2 (5) = 753.5, p < 0.001, with a candidate reliability of 0.99. Similar to the survey results in 2015, both the chi-square test and the reliability separation index suggest that overall, the model fits the data, and candidates' abilities could be differentiated very well. The candidate measure report from MFRM (  For the item and model fit, the chi-square test is p < 0.001, with a reliability of .88. Table 6 provides the observed average logit item difficulty with standard error and item infit and outfit statistics. The logit measure of item difficulty ranges from -0.76 (easiest item) for item 8 (dedicated) to 0.62 (hardest item) for item 2 (ability to accept his/her own mistakes). In addition, the examination of the infit and outfit statistics showed that item 3 (assertive), item 6 (confident), and item 17 (non-discriminatory) showed relatively poor item fit with a mean square of 1.88, 1.7, and 0.68, indicating these three items do not fit the model well and do not contribute to assessing candidates' performance.  For the overall rater and model fit, the chi-square test is 2 (53) = 1281.0, p < 0.001, and the reliability is 0.94. This result also suggests the need for the adjustment of severity when assessing candidates' performance. Rater severity levels range from -2.17 (most lenient) to 2.26 (most severe) on the logit scale. Among all 55 raters, five raters endorsed the rating of 6 ("excellent") for all six candidates and the questions they rated. These 5 raters' ratings were weighted to the minimum by FACETS so that the candidates' ability estimates were not affected that much by these extreme ratings. In addition, the three strictest raters endorsed the average observed average ratings of 4.4, 4.7, and 4.8 across all 22 questions and six candidates.

Question
We also provided results of the FACETS variable map so that the relative distribution of each facet can be seen (Figure 3) for the 2016 survey. The rater column still shows a greater departure of severity and leniency of raters on a logit. Five raters (5,6,16,21, and 23) with similar but notably different positions at the bottom, indicating they are the most lenient raters with a logit scale of around -3. From this variable map, we also can see that raters 25, 49, and 51 were the most stringent raters and were placed on the top of the rater column. It is also similar to the results in 2015; -22 items were -placed between -1 to 1, indicating the items were not spread enough when compared to the variability of raters.

FIGURE 4: Candidates and questions interaction analysis for the 2016 survey
In summary, the results from the revised survey showed that the item reliability was improved from 0.70 to 0.88. This revised survey can differentiate CR candidates' performance well. Based on the ability estimates from MFRM, Candidate 3, Candidate 5, and Candidate 1 would be recommended for chief residents.

Discussion
Our study demonstrated the development process of a psychometrically sound survey for CR selection through the utility of MFRM. When combined with our full CR selection process, including utilizing biosketches to review candidate self-identified skills and goals for the position; interview strength with all program leadership; and the MFRM showing peer identified leadership qualities, the bundle of all elements allowed for comprehensive CR selection. The overall candidate ranking results from MFRM are almost identical to the ranking results if using the average score, suggesting that MFRM results are valid and consistent with the traditional method. However, it would be hard to differentiate Candidates 3 and 4 for the 2015 survey if using traditional average score/sum score to evaluate each candidate. A major benefit of applying the MFRM to analyze the survey data is that the rater's severity/leniency can be treated as one facet as survey items and CR candidates, and then unbiased ratings can be applied to the item and candidate performance evaluation. Another benefit of using MFRM is that it can provide multiple measures to assess the quality of each item, which is very useful in survey development [11]. Therefore, MFRM makes important contributions to the analysis of survey results by simultaneously assessing CR candidates, the survey items, and rater severity and leniency effect.
The findings are promising since the results showed that the survey in both years is reliable and can differentiate candidates' personal and leadership attributes well. Meanwhile, the innovative application of MFRM provides us insights into the survey development by providing a measure and related infit and outfit statistics for each item. If one with extreme infit/outfit statistics, this item could be edited for clarity or be removed from the survey. The MRFM provides a common metric for the facet scores, including candidates' performance, survey questions, and raters so that item quality and raters' severity/leniency can be monitored. This innovative way could facilitate our understanding of the scale development process as well as provide objective measurement of each facet evaluated in this model.

Reliability of the survey items
Our CR selection survey tool and MRFM model allowed the CR selection process to capture the personal and leadership characteristics of CR candidates from the resident peers' perspectives. Three facets of the modeling enabled us to revise the items after the first implementation of the CR selection in 2015. The second round of the CR selection in 2016 provided additional evidence to go over another iteration of the items for further improvement of the tool. The chi-square tests and separation reliability index were used to measure the overall data-model fit results for candidates, survey questions, and raters for the surveys in 2015 and 2016. The results obtained through MFRM demonstrated that the survey overall can differentiate candidates well in both years and the majority of items performed very well in differentiating candidates within each year. Although, in general, a low separation of rater reliability is more desirable, indicating less variability across raters. A high value of separation reliability would indicate the existence of rater severity and leniency effects by the MFRM [15].
However, for this specific chief resident selection purpose, most candidates were nominated because of their excellent performance and standing in the program. We also can consider including the "Very Poor" and "Poor" rating categories because both the number of candidates and raters are relatively small for both years of administrating the surveys. With a relatively larger sample size of candidates or raters, we may see more endorsement of "poor" or "very poor" if this survey is administrated to multiple institutions in the future. In addition, we do not know the exact process of how peer residents would evaluate these CR candidates. If they evaluate all CR candidates' relative performance, then it is possible that they may endorse "Very Poor" or "Poor" to specific CR candidates on certain survey items. Besides removing these two rating categories, the other way to remediate this if each rating category can be clearly defined and differentiated in the survey.

Conclusions
Looking for an effective and fair way to select CRs and to identify personal and leadership attributes are our primary objectives. In this regard, our surveys for collecting candid feedback from peer residents were instrumental to identify the desired characteristics in residents. This process was important since the information about the candidate's personal, professional, and leadership qualities gave us insights into each candidate's abilities to take a CR leadership role. Our study demonstrated an application of multi-facet Rasch modeling to analyze survey data and to inform decisions regarding the selection of CRs.
Finally, the newly designed CR selection survey development process we adopted in this study can be applicable to other educational settings and/or professionals, especially when the selection process requires a collective effort from multiple parties (faculty, peers, trainees, program directors, or other stakeholders).
In addition, this model can increase the transparency in the selection processes by adjusting potential rater severity and leniency. In the meanwhile, this paper also motivates researchers in the field to apply MFRM to validate the survey and to analyze the Likert scale survey data.
Scale: Very Well, Average, Acquainted, Not very Well (Names of the Candidate) Thank you for taking the time to complete this survey!

Additional Information Disclosures
Human subjects: All authors have confirmed that this study did not involve human participants or tissue. Animal subjects: All authors have confirmed that this study did not involve animal subjects or tissue.

Conflicts of interest:
In compliance with the ICMJE uniform disclosure form, all authors declare the following: Payment/services info: All authors have declared that no financial support was received from any organization for the submitted work. Financial relationships: All authors have declared that they have no financial relationships at present or within the previous three years with any organizations that might have an interest in the submitted work. Other relationships: All authors have declared that there are no other relationships or activities that could appear to have influenced the submitted work.