Phase II Quantitative Validation of AI Platforms for Clinical Evidence Retrieval: Gold-Standard Comparison and Scoring Rubric Analysis of LDRT Trials in Osteoarthritis


Abstract

Background

Artificial intelligence (AI) platforms such as ChatGPT, Claude, Gemini, and Perplexity are increasingly used by patients and clinicians to interpret treatment-related information. However, their reliability in synthesizing complex clinical evidence remains uncertain. Low-dose radiotherapy (LDRT) has emerged as a therapeutic option for osteoarthritis, offering anti-inflammatory effects and pain relief, making it an ideal model for evaluating AI-driven evidence synthesis.

Methods

A multi-phase validation study was conducted. In Phase 1, a gold-standard dataset was created from randomized controlled trials and prospective studies on LDRT for osteoarthritis, with manual extraction of study design, interventions, comparators, outcomes, and limitations. In Phase 2 (in progress), leading AI platforms (ChatGPT-4 Turbo, Claude Sonnet 4.0, Gemini Flash 2.5, Perplexity AI, and Meta AI) are benchmarked against this dataset using a structured scoring rubric converted into JSON fields to assess accuracy, completeness, and consistency. Phase 3 (planned) will evaluate reproducibility and hallucination rates through repeated outputs and natural language processing-based reliability analysis.

Results

Preliminary findings indicate variability in AI platform performance when extracting and synthesizing LDRT clinical evidence. While some platforms demonstrated partial alignment with reference data, common limitations included incomplete study identification, inconsistencies in reported outcomes, and occasional hallucinations. These discrepancies were more pronounced in interpreting heterogeneous trial designs and comparative effectiveness data.

Conclusion

AI platforms demonstrate variable reliability in summarizing clinical evidence for LDRT in osteoarthritis. Although promising for improving accessibility of complex data, current limitations in accuracy and completeness necessitate careful human oversight before clinical or research application.

Poster
non-peer-reviewed

Phase II Quantitative Validation of AI Platforms for Clinical Evidence Retrieval: Gold-Standard Comparison and Scoring Rubric Analysis of LDRT Trials in Osteoarthritis


Author Information

Aishwarya Kalluri

Research, Orlando College of Osteopathic Medicine, Winter Garden, USA

Kristal De La Cruz Quezada

Research, Orlando College of Osteopathic Medicine, Winter Garden, USA

Nadiya A. Persaud Corresponding Author

College of Public Health, University of South Florida, Tampa, USA

Justin Rineer

Oncology, Orlando Health, Orlando, USA

Tomas Dvorak

Oncology, Orlando Health, Orlando, USA


PDF Share