RATIONALE AND OBJECTIVES: To evaluate whether large language models (LLMs) can generate accurate, clinically valid, and usable letters to appeal insurance denials for radiology procedures.
MATERIALS AND METHODS: This pilot study generated insurance appeal letters for a simulated clinical scenario. Four LLMs (Claude 3.5, Nova Pro, Llama-3.1-70B, ChatGPT-4o) were used with zero-shot, few-shot, and retrieval-augmented generation (RAG) techniques. Four board-certified interventional radiologists, blinded to model and technique, scored letters for content (accuracy, personalization, references), grammar and structure (readability, tone, persuasiveness), and usability (estimated editing time, usefulness as a template). References were verified for accuracy, and outputs were carefully examined for hallucinations. Statistical analyses included ANOVA, Chi-square, and Fleiss' Kappa for interrater reliability.
RESULTS: Mean content and grammar scores were 3.9 ± 0.95 and 4.3 ± 0.9 (out of 5), with no significant differences by model or technique (p >.05). Reviewer agreement was poor (Fleiss' Kappa -0.18 for content, -0.085 for grammar). Hallucinations were flagged by reviewers in 16/48 assessments, significantly more often with the online model (ChatGPT-4o: 58% vs offline 25%; p =.03). Of 44 references, 80% from the offline models were fabricated compared with 29% from ChatGPT-4o (p <.001). Estimated editing time was less than 10 min in 71% of responses, and the reviewers felt the letters would be useful as templates in 73% of cases.
CONCLUSION: LLM-generated appeal letters for insurance denials were generally well received, with high usability and adequate quality. However, fabricated references and hallucinations remain prevalent, necessitating careful human review before clinical use.