A weakly supervised transformer for rare disease diagnosis and subphenotyping from EHRs with pulmonary case studies.

Greco, K. F., Yang, Z., Li, M., Tong, H., Sweet, S. M., Geva, A., Mandl, K. D., Raby, B. A., & Cai, T. (2026). A weakly supervised transformer for rare disease diagnosis and subphenotyping from EHRs with pulmonary case studies.. NPJ Digital Medicine, 9(1).

Abstract

Rare diseases affect an estimated 300-400 million people worldwide, yet individual conditions remain underdiagnosed and poorly characterized due to low prevalence and limited clinician familiarity. Computational phenotyping offers a scalable approach to improving rare disease detection, but algorithm development is constrained by scarce high-quality labeled data. Expert-labeled datasets from chart reviews and registries are highly accurate but limited in scope, whereas labels derived from electronic health records (EHRs) provide broader coverage but are often noisy or incomplete. To efficiently leverage both sources, we propose WEST (WEakly Supervised Transformer) for rare disease diagnosis and subphenotyping from EHRs. At its core, WEST employs a weakly supervised transformer trained on a limited set of expert-validated labels and extensive probabilistic silver-standard labels-derived from structured and unstructured EHR features-that are iteratively refined across training rounds to improve model calibration. We evaluate WEST on two rare pulmonary conditions using EHR data from Boston Children's Hospital and show that it outperforms existing methods in phenotype classification, identification of clinically relevant subphenotypes, and prediction of disease progression. By reducing reliance on manual annotation, WEST enables label-efficient representation learning that supports accurate rare disease diagnosis and reveals deeper clinical insights from routine EHR data.

Last updated on 04/02/2026
PubMed