Abstract
Background
Artificial intelligence (AI) platforms such as ChatGPT, Claude, Gemini, and Perplexity are increasingly used by patients and clinicians to interpret treatment-related information. However, their reliability in synthesizing complex clinical evidence remains uncertain. Low-dose radiotherapy (LDRT) has emerged as a therapeutic option for osteoarthritis, offering anti-inflammatory effects and pain relief, making it an ideal model for evaluating AI-driven evidence synthesis.
Methods
A multi-phase validation study was conducted. In Phase 1, a gold-standard dataset was created from randomized controlled trials and prospective studies on LDRT for osteoarthritis, with manual extraction of study design, interventions, comparators, outcomes, and limitations. In Phase 2 (in progress), leading AI platforms (ChatGPT-4 Turbo, Claude Sonnet 4.0, Gemini Flash 2.5, Perplexity AI, and Meta AI) are benchmarked against this dataset using a structured scoring rubric converted into JSON fields to assess accuracy, completeness, and consistency. Phase 3 (planned) will evaluate reproducibility and hallucination rates through repeated outputs and natural language processing-based reliability analysis.
Results
Preliminary findings indicate variability in AI platform performance when extracting and synthesizing LDRT clinical evidence. While some platforms demonstrated partial alignment with reference data, common limitations included incomplete study identification, inconsistencies in reported outcomes, and occasional hallucinations. These discrepancies were more pronounced in interpreting heterogeneous trial designs and comparative effectiveness data.
Conclusion
AI platforms demonstrate variable reliability in summarizing clinical evidence for LDRT in osteoarthritis. Although promising for improving accessibility of complex data, current limitations in accuracy and completeness necessitate careful human oversight before clinical or research application.
