The Impact of Electronic Data to Capture Qualitative Comments in a Competency-Based Assessment System

Introduction Digitalizing workplace-based assessments (WBA) holds the potential for facilitating feedback and performance review, wherein we can easily record, store, and analyze data in real time. When digitizing assessment systems, however, it is unclear what is gained and lost in the message as a result of the change in medium. This study evaluates the quality of comments generated in paper vs. electronic media and the influence of an assessor’s seniority. Methods Using a realist evaluation framework, a retrospective database review was conducted with paper-based and electronic medium comments. A sample of assessments was examined to determine any influence of the medium on the word count and the Quality of Assessment for Learning (QuAL) score. A correlation analysis evaluated the relationship between word count and QuAL score. Separate univariate analyses of variance (ANOVAs) were used to examine the influence of the assessor's seniority and medium on word count, QuAL score, and WBA scores. Results The analysis included a total of 1,825 records. The average word count for the electronic comments (M=16) was significantly higher than the paper version (M=12; p=0.01). Longer comments positively correlated with QuAL score (r=0.2). Paper-based comments received lower QuAL scores (0.41) compared to electronic (0.51; p<0.01). Years in practice was negatively correlated with QuAL score (r=-0.08; p<0.001) as was word count (r=-0.2; p<0.001). Conclusion Digitization of WBAs increased the length of comments and did not appear to jeopardize the quality of WBAs; these results indicate higher-quality assessment data. True digital transformation may be possible by harnessing trainee data repositories and repurposing them to analyze for faculty-relevant metrics.


Introduction
Electronic data capture systems, which are the new normal at most of the educational institutions that collect assessment data about trainees [1,2], require careful transition planning and change management planning [3,4]. While the previous studies provide some good examples of transitions from a technical standpoint, there may be important individual differences in how faculty adapt to these changes, which may, in turn, lead to different interpretations of educational assessment outcomes [1][2][3][4].
Electronic assessment records have the potential to drastically improve assessment data aggregation. If properly implemented, they hold the potential for facilitating feedback and performance review, wherein we can easily record, store, and analyze data in real time [1,5]. Sentiment analysis or natural language machine learning algorithms hold great promise in helping to enhance real-time qualitative analysis as well [6,7], but these technologies are contingent on raters submitting high-quality observations and assessments. Specifically, faculty may differ in how they interact with electronic assessment platforms when recording feedback to trainees [8]. Critically, the technology used to gather and house assessment data may influence the rater-trainee experience. It is well described that electronic medical records (EMRs), for instance, have

Materials And Methods
We adopted a realist evaluation perspective to examine our local workplace-based assessment program [15]. The goal of a realist evaluation framework is to take into consideration how a program's implementation is affected by contextual factors and how that context works in conjunction with a given mechanism to result in outcomes. Locally, at McMaster University's specialist emergency medicine program, we have developed a daily WBA program known as the McMaster Modular Assessment Program (McMAP) [12]. Details regarding our successful blueprinting and implementation are discussed elsewhere [12,[16][17][18]. The current study contributes to continued quality assurance and explores the contextual influence of converting McMAP to computer-based data collection methods. We are pleased to report that the previous version of this work was presented as a virtual poster at the Association for Medical Education in Europe (AMEE) 2020 virtual conference.

Procedure
We retrospectively examined the assessments from 85 raters on 30 residents from October 2012 to June 2015. The data from these assessments were preprocessed to anonymize the comments, names, and any user identifier and securely stored in our lead author's encrypted university computer. Comments were then evaluated to objectively determine a quality score and word count.

Data selection
The comments from October 2012 to June 2013 were collected on a paper-based series of WBA workbooks, which were filled out contemporaneously in the clinical setting. The comments from July 2013 to June 2015 were collected using an online e-portfolio system created and housed locally at McMaster University. All data were entered into a Microsoft Excel workbook (Microsoft Corp., Seattle, WA). Each comment examined was associated with a word count, which was determined using standard data processing functions within Excel. To calculate the number of words using Excel, we first removed double spaces and counted the spaces in the comment and added one. Records with zero to three words were excluded from this study. Appendix Table 3 outlines the comments that only had zero to three words so that our readers can appraise their worth as needed.

Coding
There were two independent variables: method of data capture and years in practice. Method of data capture was medium, coded as paper or electronic entry. Years in practice was coded as a continuous variable. Dependent variables were word count, coded as a count for each comment, and comment quality, coded using the QuAL score [14] and McMAP global scale [12], which is a global rating scale of trainee performance on a particular shift rated out of seven on behavioral anchors. Average word counts were rounded and reported as whole numbers [12]. As a result of excluding records with lower word counts, there were no missing data for word count or for QuAL score. However, there were seven missing McMAP global scores, all of which occurred within the paper-based iteration of our system (since the mandatory fields allowed us to prevent such missing data in the electronic version).

Quality Assessment
We used a novel evaluation tool [14] to determine the quality of the assessment comments. The derivation of that tool is described elsewhere [14]. Essentially, three authors (TC, SS, and SM) scored different subsets of the comments. In a previous study, this scoring process was calibrated using 10% of the dataset, achieving an intraclass correlation coefficient (ICC) of 0.95 [14].

Quantitative analysis
All data were processed and coded using Microsoft Excel 2011 for Mac Descriptive Statistics; Pearson correlation, Chi-squared, and univariate analysis of variance (ANOVA) were calculated using IBM SPSS 26 (IBM Corp. Released 2019, IBM SPSS Statistics for Windows, Version 26.0. Armonk, NY: IBM Corp). To evaluate the impact of medium and years in practice on the three dependent variables, word count, QuAL score, and McMAP rating scores were analyzed separately. Pearson correlation analyses were used to describe the relationship between our dependent variables (word count, McMAP score, and QuAL score) and years in practice. Separate univariate ANOVAs were used to describe the impact of medium (paper or electronic) on our dependent variables (word count, McMAP score, and QuAL score). Chi-squared analysis was used to evaluate the independence of frequencies for records with three words or less between paper and electronic entries.

Ethics
Our project was reviewed by our local institutional review board's chairperson (Hamilton Integrated Research Ethics Board) and granted an exemption.

Results
From October 2012 to July 2014, 2,556 assessments were generated using written comments, rating scale scores, or both. A total of 1,018 of these assessments were generated within paper workbooks and subsequently transcribed by an administrative staff member or junior faculty member. Of those 1,018 paperbased records, 33% (n=338) were excluded for having three words or less and 22% (n=225) records had no words. From July 2013 to July 2015, daily evaluations of both tasks and overall performance were collected via a novel electronic platform, yielding 1,538 assessments, which included both written comments and numerical ratings. Of those 1,538 records, 24% (n=373) were excluded for having three words or less and 16% (n=241) had zero words. Table 1 shows the distribution of records by year and medium. A Chi-squared analysis of the test of independence of the distribution of comments with less than three words was not significant (χ 2 =2.2, p=0.5). Assessments were from a total of 86 faculty members of whom 64% were male. Around 1% (20) of entries that met the inclusion criteria did not provide information about years in independent practice or rater identity; these records were excluded from the analysis. Table 2 shows the overall demographics of the assessors.

Word count
After the exclusion of 731 records, the average word count was 15 (SD=14) across 1,825 comments. There was a significant main effect of medium on word count (F (1,1823)=52.87, p=0.01, partial η 2 =0.83). The electronic records had a higher word count (16) on average compared to paper-based records (12).

Word count and years in practice
Years in practice was negatively correlated to word count (r=-0.2; p<0.001). The correlation was the same when examining paper records and electronic records (for paper: r=-0.25, p<0.001; for electronic r=-0.19, p<0.001).

Quality assessment of comments
The average QuAL score, 0.47/5 (SD=0.86), was lower compared to the sample studied previously (mean=0.9/5, SD=0.9) [14]. The lower range of scores is consistent with prior work, indicating the continued need for faculty development when providing written feedback [14]. Evaluating the transition from paper to electronic data capture, the univariate ANOVA showed a main effect of medium (F (1, 1823)=6.7, p<0.01, partial η 2 =0.004). Critically, for our study goal, there was a small effect size, and the quality of comments was not reduced because of the transition, as the average QuAL score for electronically captured assessments was 0.51 compared to 0.41 for paper-based assessments.
Unsurprisingly, longer comments (regardless of the medium) were positively correlated with scores on the first QuAL subscale (evidence of observed behavior; r=0.46, p<0.01), positively correlated with scores on the second subscale (suggestion for improvement; r=0.40, p<0.01), and positively correlated with scores on the third subscale (evidence linked to suggestion; r=0.41, p<0.001).

Quality assessment of comments and years in practice
Years in practice was negatively correlated with QuAL score (r=-0.08, p<0.001).

McMAP global rating score
The mean McMAP score was 5.9/7 (SD=0.82), with a median score of 6. There was no main effect of medium on McMAP scores.

McMAP global rating score and years in practice
There was no significant correlation between years in practice and McMAP global rating score. Although Figure 1 depicts some variability in scores across different years in practice, we did not detect a significant or meaningful relationship between years in practice and McMAP scores.

FIGURE 1: Mean McMAP scores of trainees generated in both contexts (paper vs. electronic)
Year 0 on the x-axis denotes the years since the end of their postgraduate training. Numbers less than zero (e.g., negative) on this axis denotes the number of years prelicensure for an individual since senior residents often acted as assessors for junior residents (e.g., senior emergency medicine resident would observe, provide feedback, and rate a first-or second-year trainee).

Discussion
Our study examined the relationship between the method of collecting assessment data and the quality and quantity of words within the comments for one WBA system. You will recall that our intention was to determine if the quality of comments generated in paper vs. electronic media was influenced by an assessor's seniority. Contrary to the postulations of our residents in our previous qualitative program evaluation study [8], our faculty members were not deterred by the transition to electronic media and, on average, wrote more words for qualitative comments. Faculty members did skip comment boxes more often in the electronic version (23% vs. 9.2%), despite that there were mandatory fields within the digital version.
Going beyond the initial mandate of our study, we also elucidated an interesting finding with regards to the volume of feedback generated by different cohorts of our attending physicians. Mid-career faculty members tended to write the least. In our locale, we hypothesize that the phenomenon we observed may be due to the effects described by Govaerts and colleagues [13], but the phenomenon may also intersect with faculty engagement. In our local quality assurance focus groups, residents revealed that many faculty members were rather disengaged with the new WBA system (McMAP) [8]. As such, we postulate that a number of different forces may be at play.
With the advent of competency-based medical education (CBME), there has been a marked use of digital systems to capture WBA [19][20][21][22]. With the increasing use of these databases, many groups have resorted to trainee behaviors around data capture [23,24]. However, more attention must be paid to how faculty respond and engage with these systems and then how faculty respond to their needs via faculty development [25]. Whereas in traditional testing, validity lies in the hands of the students and their engagement in the response process, in the age of CBME and WBAs, the response process of faculty members who enter the data is of paramount importance. Our present study sheds light on an important aspect of the response process for generating high-quality data about trainee performance. Others have examined time burden on faculty [26] with WBA, engagement of faculty [27] in the assessment process, biases they exhibit [28], and even perceptions of their role within these systems [16,17].
In our study, by examining the WBA participation of various faculty cohorts, we show how we might bring more nuanced analyses around different needs of various subgroups of faculty. By doing this type of analysis, we feel that we could begin to refine approaches for faculty development. Rather than seeing faculty as one singular group, more nuanced and targeted approaches to faculty development can be generated by transforming trainee databases to reveal new insights about faculty performance [29].

Next Steps
Trainee databases and repositories may represent a wealth of untapped data that can provide faculty with tangible, actionable insights about their own performance as faculty raters within a system of assessment. A true digital transformation of faculty development may be possible if we harness the newly developed trainee assessment databases to generate useful metrics on faculty performance in terms of their contributions to assessment, feedback, and rating of trainees in the age of CBME [30]. Repurposing trainee data for faculty development insights holds great potential for providing true insights into actual faculty performance related to assessment and their tangible contributions to academic medicine. Future studies in this area may include studies that examine sentiment analysis or applications of natural language processing (such as sequencing of feedback statements, syntactic complexity, local or text coherence, lexical sophistication) to the real-time data capture of trainee feedback comments.

Limitations
This study has a number of limitations. This is a retrospective program evaluation study. Novelty effects of technology may have distorted the use of electronic vs. paper. However, the increased use of the electronic medium can reduce the technological barrier for the raters and create better buy-in to generate feedback to residents. The paper version of data was only from year one, so the score changes from year to year may reflect increasing score drift due to increased usage by faculty members. Therefore, we cannot tease apart the initial pilot year's novelty effect on our present study. Finally, the data are coming from a single institution with a focus on emergency medicine. The generalizability of our results is limited to our research population.

Conclusions
We detail our journey through the effective digitalization of a WBA system, which resulted in more words written per comment about trainee performance, and the presence of higher-quality comments. True digital transformation may be possible by harnessing trainee data repositories and repurposing them to analyze for faculty-relevant metrics. Digitalization of workplace-based assessments resulted in an increase in the length of comments available to educators. Longer comments achieved on digital systems did not appear to negatively impact the quality of the assessments. The medium, electronic vs. paper, has a high influence on the evaluative message being sent or received.