Decreasing Administrative Effort Related to Non-Approval of Image-guidED Procedures Using Large Language Models

Decreasing Administrative Effort Related to Non-Approval of Image-guidED Procedures Using Large Language Models - The DENIED-AI Pilot Study.

McCarthy, Colin J, Vijay Ramalingam, Yiftach Barash, Seetharam Chadalavada, Xiao Wu, Oleksandra Kutsenko, Daniel Raskin, Vera Sorin, and Ammar Sarwar. 2026. “Decreasing Administrative Effort Related to Non-Approval of Image-GuidED Procedures Using Large Language Models - The DENIED-AI Pilot Study.”. Academic Radiology.

Publisher's Version

Abstract

RATIONALE AND OBJECTIVES: To evaluate whether large language models (LLMs) can generate accurate, clinically valid, and usable letters to appeal insurance denials for radiology procedures.

MATERIALS AND METHODS: This pilot study generated insurance appeal letters for a simulated clinical scenario. Four LLMs (Claude 3.5, Nova Pro, Llama-3.1-70B, ChatGPT-4o) were used with zero-shot, few-shot, and retrieval-augmented generation (RAG) techniques. Four board-certified interventional radiologists, blinded to model and technique, scored letters for content (accuracy, personalization, references), grammar and structure (readability, tone, persuasiveness), and usability (estimated editing time, usefulness as a template). References were verified for accuracy, and outputs were carefully examined for hallucinations. Statistical analyses included ANOVA, Chi-square, and Fleiss' Kappa for interrater reliability.

RESULTS: Mean content and grammar scores were 3.9 ± 0.95 and 4.3 ± 0.9 (out of 5), with no significant differences by model or technique (p >.05). Reviewer agreement was poor (Fleiss' Kappa -0.18 for content, -0.085 for grammar). Hallucinations were flagged by reviewers in 16/48 assessments, significantly more often with the online model (ChatGPT-4o: 58% vs offline 25%; p =.03). Of 44 references, 80% from the offline models were fabricated compared with 29% from ChatGPT-4o (p <.001). Estimated editing time was less than 10 min in 71% of responses, and the reviewers felt the letters would be useful as templates in 73% of cases.

CONCLUSION: LLM-generated appeal letters for insurance denials were generally well received, with high usability and adequate quality. However, fabricated references and hallucinations remain prevalent, necessitating careful human review before clinical use.

Last updated on 05/08/2026
PubMed

Return to the BIDMC Radiology Research Homepage