Abstract
Extending our validated benchmarking work, GPT-5 showed no improvement in sociodemographic-linked decision variation compared with GPT-4o and seemed to be worse on several endpoints. We re-tested GPT-5 with a fixed pipeline: 500 physician-validated emergency vignettes, each replayed across 32 sociodemographic labels plus an unlabeled control, answering the same four questions (triage, further testing, treatment level, and need for mental-health assessment). This design holds clinical content constant to isolate the effect of the label. GPT-5 reproduced subgroup-linked variation, with higher assigned urgency and less advanced testing for several historically marginalized and intersectional groups. Notably, several LGBTQIA+ labels were flagged for mental-health screening in 100% of cases, versus 41-73% for comparable groups with GPT-4o. Additionally, in an adversarial re-run that inserted one fabricated medical detail into otherwise standard clinical cases, GPT-5 adopted or elaborated on the fabrication in 65% of runs (vs 53% for GPT-4o). A single mitigation prompt reduced this to 7.67%.