Development and Test-Item Analysis of a Freely Available 1900-Item Question Bank for Rheumatology Trainees

Background Tests composed of multiple-choice questions are an established tool to help evaluate knowledge of medical content. Within the field of rheumatology, there is an absence of free and easily-accessible sets of multiple-choice questions that have been rigorously evaluated and analyzed. Objective To develop a question bank composed of multiple-choice questions that evaluate trainee knowledge of rheumatology, as well as to investigate the psychometric properties (reliability, discrimination indices, difficulty indices) of items within the question bank. Methods Multiple-choice questions were drafted according to a strict methodology devised by the investigators. Between January and December 2020, questions were administered in sets of 20-25 questions to test-takers who were either current trainees or had recently graduated from training programs. Performance was evaluated through descriptive statistics (mean, median, range, standard deviation) and test-item statistics (difficulty index, discrimination index, reliability). Results Investigators drafted 1900 multiple choice questions within 45 sections each composed of 20 to 25 questions each. These questions were administered to 32 participants. The mean discrimination index was 0.57 (standard deviation: 0.22) and mean difficulty index was 0.38 (standard deviation: 0.23). Reliability indices for the 45 sections ranged from 0.45 to 0.85 (mean: 0.613, standard deviation: 0.09). The overall reliability index for the entire item bank was greater than 0.95. Conclusion The investigators developed a 1900-item question bank composed of items that have sufficient difficulty and discrimination indices to be used for low- and moderate-stakes settings. A rigorous methodology was employed to create the first freely-accessible reliable tool for the assessment of rheumatology knowledge. This tool can be purposed for both summative and formative evaluation in multiple settings and platforms.


Introduction
Multiple-choice questions are a mainstay of educational assessment for the past century and are the basis for certification examinations in both Internal Medicine and Rheumatology, among other fields [1]. Tests composed of carefully-constructed multiple-choice questions have psychometric properties that make them conducive to learner assessment [2]. Additionally, when crafted appropriately, multiple-choice questions may enable self-regulated learning at the individual level [3].
To address this need, the investigators have systematically developed a test item bank of 1900 items for Rheumatology fellows and other learners. Through the application of principles for test item writing, the investigators have ensured that these multiple-choice questions have psychometric properties conducive to their use for formative evaluation and self-evaluation.

Materials And Methods
The University of Iowa Institutional Review Board reviewed this project and determined it was not human subjects research since it was an educational intervention that only involved interactions involving an educational test and any disclosure of responses would not reasonably place the subjects at risk of criminal or civil liability or be damaging to the subjects' financial standing, employability, educational advancement, or reputation. The project was completed from June 2017 to December 2020.
A charter was drafted by the investigators who outlined the process of constructing and evaluating items ( Figure 1). As part of the charter, the investigators delineated (1) the content of the item bank, (2) specific educational objectives, (3) appraisal of test item quality, (4) protocol for drafting test items, and (5) calculation of psychometric properties of test items.  The authors outlined the process of constructing test items using 10 principles.

Item drafting
After the charter was ratified, the investigators began writing the components of each item: the stem, options (including correct answer choices and distractors), clinical significance, and references.
The abovementioned set of 10 criteria guided the production of these test items. For each item, only one correct answer was listed, along with two other, incorrect answer choices (distractors). A fourth answer choice, "I will have to look that up," was also added. This form of test item format was selected due to previous data demonstrating equivalent psychometric properties of three-choices compared to four-choices [8]. Additionally, the fourth answer choice was incorporated to minimize random guessing and promote reflection on confidence of knowledge [9].
Each test item also had an explanation, entitled "Clinical Significance" that justified the importance of knowing this particular piece of information in the clinical context. When feasible, explanations also included discussions about the relationships between the distractors, the correct answer choice, and the stem.
These questions were based on 1238 distinct objectives. Objectives were split among the six levels of Bloom Taxonomy. A plurality of test items involved comprehension, followed by application and analysis of knowledge (40.03%, 23.51% and 15.89%, respectively). Slightly over 5% of questions involved synthesis (5.16%).

Appraisal of test item quality
All 1900 questions underwent two rounds of review by three investigators. The average score for test items during the initial review was 9.1 (SD=0.8). The kappa-statistic was 0.94. 150 questions were below the threshold of 8 and so were re-written in the second round of reviews. The second round of reviews yielded an average score of 9.4 (SD=0.3). The kappa-statistic was 0.98. The greatest source of disagreement was parallelism among answer choices, which was the source of disagreement in 132 of the 1900 questions (6.9%). Table 2 displays the inter-rater reliabilities for each criterion upon final review.

Criterion
Inter-rater reliability (K) at final review

Test item statistics
Test item statistics were calculated for each of the 95 sections as well as the entire test item bank ( Table 3).
For the entire item bank, the mean score was 0.58 (Range: 0.31-0.69; median: 0.608; SD=0.083). Reliability of the entire question bank, calculated through the KR-20, was 0.986.  Difficulty and discrimination indices for each test item was also calculated relative to its contribution to the section score and to the total score ( Table 4). Difficulty ranged from 0 to 1.00 (mean: 0.576; median: 0.563; SD: 0.217). Discrimination indices relative to section scores ranged from -0.375 to 1 (mean: 0.376; SD: 0.235), with a median of 0.375. In contrast, discrimination indices relative to the total score ranged from -0.625 to 0.875 (mean: 0.184; SD: 0.214) with a median of 0.375 ( Figure 2).

FIGURE 2: Distribution of difficulty and discrimination indices for each test item.
Finally, test items were grouped by Bloom's taxonomy in order to calculate their difficulty and discrimination indices. The mean difficulty indices for each group varied from 0.405 (Evaluation) to 0.715 (Comprehension), with significant variability within the six groups. The mean discrimination indices also ranged widely, from 0.220 to 0.548 ( Table 5).

Discussion
The investigators were able to draft a set of 1900 multiple choice items to evaluate knowledge of rheumatology among trainees. Based on the sampling of these 32 individuals, these items may be suitable for uses like exam preparation review and formative evaluation during training. The systematic manner in which these items were constructed and evaluated has been essential for ensuring transparency of results.

Systematic and methodical test item drafting enables rapid assessment of knowledge
First, the items were constructed in a rigorous and systematic manner based on established principles of test item creation. The tethering of items to certain principles like simplicity, objectivity, positivity, and clarity helped to prevent the introduction of unwarranted variability associated with test-taking. This allowed for the moderately high-reliability indices that were observed among different sections, as well as the highreliability index for the entire test-bank.
The approach of defining objectives first, followed by drafting items, was also important in upholding the quality of items. Indeed, it helped to ensure diversity of items, ranging from simple recollection to higherorder evaluation. Additionally, this approach also helped to reduce redundancy that would affect the psychometric properties of test items.
Because the items have relatively short stems and answer choices, these items are particularly useful for rapid assessment of knowledge. These can be used in point-of-care settings where deficits in knowledge can be identified and corrected promptly. Additionally, because each item is explicitly tethered to one or two educational objectives, it enables the test-taker to better recognize the specificity of the defect in knowledge in the context. The explanations and references that accompany each item are also instrumental in ensuring that learners can expand their knowledge base.
The relatively high inter-rater reliability of each criterion speaks to the objectivity to which these criteria were drafted. The criterion with the lowest inter-rater reliability was parallelism, which is understandable given that this is the most difficult criterion to operationalize.

Difficulty and discrimination indices suggest its utility for a variety of learners
Secondly, the items in the item bank tended to be difficult. The mean difficulty index was 0.576. Since one of the principles was setting the difficulty for a second year fellow approaching graduation, a relatively low difficulty index was somewhat expected with a target of 0.65. This has important implications for application. This allows for more opportunities to correct deficits in knowledge. The overall spread and distribution of difficulty indices are also notable, since it allowed the item bank, as a whole, to evaluate test takers of differing abilities. Although the majority of test-takers were rheumatology fellows, this wide variety in difficulty indices may render the item bank suitable for other learners, including medical students, Internal Medicine residents, and physicians in independent practice.
Similarly, the discrimination indices with respect to section scores were modest. Forty-two items (2%) of items had negative discrimination indices with respect to section scores. The distribution of discrimination indices suggests that individual test items can be broadly used to help distinguish low-scorers and highscorers of sections. As expected, when discrimination indices were calculated with respect to total scores, the values were lower and there were 224 (11.2%) that were negative. This likely reflects the heterogeneity of test-takers who may have strengths in certain sections but weaknesses in others. These items were retained in the bank, but those with negative discrimination indices are marked accordingly in the explanations.

Test items can address multiple levels within Bloom's Taxonomy
Thirdly, there did not appear to be major differences in the difficulty or discrimination indices among the different taxons of objectives. Items focusing on comprehension of rheumatology knowledge were, on average, easier than those focusing on other skills, but the variability was sufficiently high that the mean difficulty indices of all six taxons approximated one another. Similar results were present for the discrimination indices, although it appears that analysis items had, overall, a greater ability to discriminate between high-and low-achievers. This also bolsters the utility of these items to evaluate a host of skills in addition to knowledge and comprehension.

Implications and future directions
This item bank can be utilized in a variety of ways by both learners and teachers. Although most multiplechoice items are more commonly used for summative evaluation purposes (determination of an outcome at the end of an educational program), the sheer number and diversity of items within this item bank may also render this test item bank useful for formative evaluation purposes (assessment of abilities at a given time during the educational program).
Previous studies have suggested that multiple-choice items may be used for self-directed and self-regulated learning [12]. Because each item is linked to a specific learning objective, an explanation, and a reference, these items are particularly suitable for those trainees that would like more structure to their self-directed learning. Additionally, because data is available regarding test-taker performance, self-directed learners can identify their own strengths and weaknesses with regards to their peers. Alternatively, items from this bank may also be used for CME-directed activities.
Beyond the level of the individual, these items may be used in board preparation. Indeed, the item content was loosely based on the American Board of Internal Medicine's Blueprint for the Certification Examination, with expansions to accommodate further topics relevant to the practice of rheumatology. Since the items are grouped thematically into sections, training programs can use the sections to evaluate trainee performance and adjust didactic and other teaching activities accordingly. These sections have sufficient reliability to help provide a rough estimate on trainee performance with respect to peers.
In addition, future versions of the question bank may include images and videos. They were deliberately excluded in this first version to streamline the process of drafting the test item stems and homogenize test items for psychometric analysis. However, these can be included in future versions with their own guidelines to help ensure sufficiently high difficulty and discrimination indices.
Although these items were distributed electronically as sections and manually graded, there exists the potential for computer-adaptive testing [13]. Computerized testing enables a platform that is friendlier to the test taker and the capacity to instantly grade and report scores. Additionally, sophisticated computer algorithms based on item response theory may be able to utilize the difficulty and discrimination indices to identify the most appropriate items to evaluate trainee performance. Since most of these algorithms select items that the given test-taker has a 50% probability of answering correctly, this is exceptionally suitable for this item bank, since the median and mean difficulty indices for items are 0.563 and 0.576, respectively.

Limitations and unanswered questions
The prospective design of this investigation, the methodological rigor of assessing test item quality, and the relatively large number of participants bolster the validity of this item bank. At the same time, there are notable limitations.
First, the inclusion of option D ("I will have to look that up") is a deviation from standard test item writing principles. The preferred approach is to recommend that test-takers skip questions altogether. However, since these questions were developed for an audience of learners rather than examiners, the authors felt that it was important to provide an option for learners to reflect on their own degree of self-confidence. While the inclusion of option D likely reduced the reliability of the test, this disadvantage is off-set by benefits for learners who seek to gauge their self-confidence in applying their knowledge. Because we did not force learners to answer all items (instead of 'skipping'), there is ambiguity in interpreting the reasons for a testtaker answering a question as D or skipping. Likewise, for simplicity, we did not split this into other options to assess different levels of metacognition since that was not our intended purpose.
Secondly, our convenience sampling makes it unclear if this data is generalizable to a larger number of rheumatology fellows and other learners. Similarly, our strict guideline that delivery of further test item sections would be contingent on completion of previous sections may have enabled a very high completion rate, but is not likely to be replicable in the general population, where such strictness is unfeasible. Of note, we also did not record performance outcomes in passing board certification or recertification examinations. Therefore, its utility in examination preparation remains unclear.
Thirdly, there are other systemic confounders that may have altered test scores. Because learners were not supervised while taking the items, they may have had access to resources. They were also asked not to guess in favor of answering 'D' (I will have to look that up') or skip the question altogether, but it is likely that many of these answers were products of educated guessing, altering some of our test statistics.
Fourth, though objectives for test items included all six taxons, they were not equally distributed among the taxons. There were fewer test items for analysis, evaluation, and synthesis compared to knowledge, comprehension and application. The reason is likely because the 10 criteria constrained long, nuanced stems that would lend themselves more naturally towards more higher-order thinking. To balance this, the investigators sought to transfer this nuance to the explanations, which go into greater depth for higherorder thinking.
Lastly, rheumatology is an evolving field, and, as it evolves, the item bank will need to be periodically updated to incorporate the most recent state of the art. Likewise, we acknowledge that 42 test items (2%) had negative discrimination indices with respect to section scores. These can be eliminated or significantly modified in future versions. We have continued to include them in the current test item bank, with an advisory about their negative discrimination indices, for full transparency to reviewers and to test-takers.

Conclusions
A rigorous methodology, employing best practices in test item writing, was used to create the first freelyaccessible and reliable test item bank for the assessment of rheumatology knowledge. This test item bank has psychometric properties that may make it suitable for formative evaluation of learners and for selfregulated learning. The test item bank covers the spectrum of rheumatology topics, as set by the American Board of Internal Medicine, and assesses learners across the six taxons of Bloom's taxonomy (knowledge, comprehension, application, analysis, synthesis, and evaluation). Because there has been transparency in the development and the psychometric analysis of these test items, learners using this test item bank are better able to appraise their own performance relative to their peers, and, in so doing, are better empowered to guide their own learning.

Additional Information Disclosures
Human subjects: All authors have confirmed that this study did not involve human participants or tissue. Animal subjects: All authors have confirmed that this study did not involve animal subjects or tissue.

Conflicts of interest:
In compliance with the ICMJE uniform disclosure form, all authors declare the following: Payment/services info: All authors have declared that no financial support was received from any organization for the submitted work. Financial relationships: All authors have declared that they have no financial relationships at present or within the previous three years with any organizations that might have an interest in the submitted work. Other relationships: All authors have declared that there are no other relationships or activities that could appear to have influenced the submitted work.