A Unified Platform for Radiology Report Generation and Clinician-Centered AI Evaluation

Zhuoqi Ma, PhD, Radiology
Xinye Yang, BS, Brown CS
Zach Atalay, Website Dev
Andrew Yang, BS, Brown CS
Scott Collins, RT, Radiology
Harrison Bai, MD, University of Colorado

Paper status - Under Review


Radiology and AI image
AI in Medical Imaging.

Abstract

Generative AI models have demonstrated strong potential in radiology report generation, but their clinical adoption depends on physician trust. In this study, we conducted a radiology-focused Turing test to evaluate how well attendings and residents distinguish AI-generated reports from those written by radiologists, and how their confidence and decision time reflect trust. we developed an integrated web-based platform comprising two core modules: Report Generation and Report Evaluation. Using the web-based platform, eight participants evaluated 48 anonymized X-ray cases, each paired with two reports from three comparison groups: radiologist vs. AI model 1, radiologist vs. AI model 2, and AI model 1 vs. AI model 2. Participants selected the AI-generated report, rated their confidence, and indicated report preference. Attendings outperformed residents in identifying AI-generated reports (49.9% vs. 41.1%) and exhibited longer decision times, suggesting more deliberate judgment. Both groups took more time when both reports were AI-generated. Our findings highlight the role of clinical experience in AI acceptance and the need for design strategies that foster trust in clinical applications.


Demo Visuals



Methodology


We use the web-based radiology report evaluation platform to assess physicians’ trust and acceptance of AI-generated radiology reports through a Turing-test-inspired interface. In report evaluation platform, the interface presents a medical image (e.g., an X-ray or CT slice) at the top, accompanied by two diagnostic reports labeled Report 1 and Report 2. Each report contains detailed findings describing various anatomical and pathological observations. The physician is asked to review both reports and the corresponding image, then respond to three key questions: (1) Preference – Which report would you prefer to use in clinical practice? (2) AI Identification – Which report do you believe was generated by AI? (3) Confidence – How confident are you in your previous judgment? To ensure fair evaluation, report identities (AI vs. human) are randomly shuffled across cases. Also, report formatting is standardized to reduce stylistic cues. No model or author identifiers are exposed during the task. 48 X-ray cases and radiologists’ reports were retrieved from Rhode Island Hospital. All data were collected anonymously to ensure participant privacy. For each X-ray case, we generate AI reports using two state-of-the-art methods. The cases were evenly divided into three comparison groups based on report generation methods: radiologist vs. AI model 1, radiologist vs. AI model 2, and AI model 1 vs. AI model 2 (16 each condition). Eight participants (4 attendings and 4 residents) participated in this experiment. After reading the provided case, participants were asked to identify which report was AI-generated. Additionally, they provided Likert scale evaluations on a scale of 1(low confidence) to 5(high confidence) on their confidence in these judgements. Finally, participants indicate which report they would be more inclined to adopt in clinical practice.

method figure

Citation

citation in the works