Although randomized controlled trials (RCTs) are the gold standard for establishing the efficacy and safety of a medical treatment, real-world evidence (RWE) generated from real-world data has been vital in postapproval monitoring and is being promoted for the regulatory process of experimental therapies. An emerging source of real-world data is electronic health records (EHRs), which contain detailed information on patient care in both structured (eg, diagnosis codes) and unstructured (eg, clinical notes and images) forms. Despite the granularity of the data available in EHRs, the critical variables required to reliably assess the relationship between a treatment and clinical outcome are challenging to extract. To address this fundamental challenge and accelerate the reliable use of EHRs for RWE, we introduce an integrated data curation and modeling pipeline consisting of 4 modules that leverage recent advances in natural language processing, computational phenotyping, and causal modeling techniques with noisy data. Module 1 consists of techniques for data harmonization. We use natural language processing to recognize clinical variables from RCT design documents and map the extracted variables to EHR features with description matching and knowledge networks. Module 2 then develops techniques for cohort construction using advanced phenotyping algorithms to both identify patients with diseases of interest and define the treatment arms. Module 3 introduces methods for variable curation, including a list of existing tools to extract baseline variables from different sources (eg, codified, free text, and medical imaging) and end points of various types (eg, death, binary, temporal, and numerical). Finally, module 4 presents validation and robust modeling methods, and we propose a strategy to create gold-standard labels for EHR variables of interest to validate data curation quality and perform subsequent causal modeling for RWE. In addition to the workflow proposed in our pipeline, we also develop a reporting guideline for RWE that covers the necessary information to facilitate transparent reporting and reproducibility of results. Moreover, our pipeline is highly data driven, enhancing study data with a rich variety of publicly available information and knowledge sources. We also showcase our pipeline and provide guidance on the deployment of relevant tools by revisiting the emulation of the Clinical Outcomes of Surgical Therapy Study Group Trial on laparoscopy-assisted colectomy versus open colectomy in patients with early-stage colon cancer. We also draw on existing literature on EHR emulation of RCTs together with our own studies with the Mass General Brigham EHR.
Publications
2023
Functional tissue units (FTUs) form the basic building blocks of organs and are important for understanding and modeling the healthy physiological function of the organ and changes during disease states. In this first comprehensive catalog of FTUs, we document the definition, physical dimensions, vasculature, and cellular composition of 22 anatomically correct, nested functional tissue units (FTUs) in 10 healthy human organs. The catalog includes datasets, illustrations, an interactive online FTU explorer, and a large printable poster. All data and code are freely available. This is part of a larger ongoing international effort to construct a Human Reference Atlas (HRA) of all cells in the human body.
2022
In this extended abstract, we describe and analyze a lossy compression of MinHash from buckets of size O(logn) to buckets of size O(loglogn) by encoding using floating-point notation. This new compressed sketch, which we call HyperMinHash, as we build off a HyperLogLog scaffold, can be used as a drop-in replacement of MinHash. Unlike comparable Jaccard index fingerprinting algorithms in sub-logarithmic space (such as b-bit MinHash), HyperMinHash retains MinHash's features of streaming updates, unions, and cardinality estimation. For an additive approximation error ϵ on a Jaccard index t, given a random oracle, HyperMinHash needs O(ϵ-2(loglogn+log1ϵ)) space. HyperMinHash allows estimating Jaccard indices of 0.01 for set cardinalities on the order of 1019 with relative error of around 10% using 2MiB of memory; MinHash can only estimate Jaccard indices for cardinalities of 1010 with the same memory consumption.
Admissions are generally classified as COVID-19 hospitalizations if the patient has a positive SARS-CoV-2 polymerase chain reaction (PCR) test. However, because 35% of SARS-CoV-2 infections are asymptomatic, patients admitted for unrelated indications with an incidentally positive test could be misclassified as a COVID-19 hospitalization. EHR-based studies have been unable to distinguish between a hospitalization specifically for COVID-19 versus an incidental SARS-CoV-2 hospitalization. From a retrospective EHR-based cohort in four US healthcare systems, a random sample of 1,123 SARS-CoV-2 PCR-positive patients hospitalized between 3/2020â€"8/2021 was manually chart-reviewed and classified as admitted-with-COVID-19 (incidental) vs. specifically admitted for COVID-19 (for-COVID-19). EHR-based phenotyped feature sets filtered out incidental admissions, which occurred in 26%. The top site-specific feature sets had 79-99% specificity with 62-75% sensitivity, while the best performing across-site feature set had 71-94% specificity with 69-81% sensitivity. A large proportion of SARS-CoV-2 PCR-positive admissions were incidental. Straightforward EHR-based phenotypes differentiated admissions, which is important to assure accurate public health reporting and research.
Although intracranial hemorrhage (ICH) is frequent in the setting of brain metastases, there are limited data on the influence of antiplatelet agents on the development of brain tumor-associated ICH. To evaluate whether the administration of antiplatelet agents increases the risk of ICH, we performed a matched cohort analysis of patients with metastatic brain tumors with blinded radiology review. The study population included 392 patients with metastatic brain tumors (134 received antiplatelet agents and 258 acted as controls). Non-small cell lung cancer was the most common malignancy in the cohort (74.0%), followed by small cell lung cancer (9.9%), melanoma (4.6%), and renal cell cancer (4.3%). Among those who received an antiplatelet agent, 86.6% received aspirin alone and 23.1% received therapeutic anticoagulation during the study period. The cumulative incidence of any ICH at 1 year was 19.3% (95% CI, 14.1-24.4) in patients not receiving antiplatelet agents compared with 22.5% (95% CI, 15.2-29.8; P = .22, Gray test) in those receiving antiplatelet agents. The cumulative incidence of major ICH was 5.4% (95% CI, 2.6-8.3) among controls compared with 5.5% (95% CI, 1.5-9.5; P = .80) in those exposed to antiplatelet agents. The combination of anticoagulation plus antiplatelet agents did not increase the risk of major ICH. The use of antiplatelet agents was not associated with an increase in the incidence, size, or severity of ICH in the setting of brain metastases.
BACKGROUND: Admissions are generally classified as COVID-19 hospitalizations if the patient has a positive SARS-CoV-2 polymerase chain reaction (PCR) test. However, because 35% of SARS-CoV-2 infections are asymptomatic, patients admitted for unrelated indications with an incidentally positive test could be misclassified as a COVID-19 hospitalization. Electronic health record (EHR)-based studies have been unable to distinguish between a hospitalization specifically for COVID-19 versus an incidental SARS-CoV-2 hospitalization. Although the need to improve classification of COVID-19 versus incidental SARS-CoV-2 is well understood, the magnitude of the problems has only been characterized in small, single-center studies. Furthermore, there have been no peer-reviewed studies evaluating methods for improving classification.
OBJECTIVE: The aims of this study are to, first, quantify the frequency of incidental hospitalizations over the first 15 months of the pandemic in multiple hospital systems in the United States and, second, to apply electronic phenotyping techniques to automatically improve COVID-19 hospitalization classification.
METHODS: From a retrospective EHR-based cohort in 4 US health care systems in Massachusetts, Pennsylvania, and Illinois, a random sample of 1123 SARS-CoV-2 PCR-positive patients hospitalized from March 2020 to August 2021 was manually chart-reviewed and classified as "admitted with COVID-19" (incidental) versus specifically admitted for COVID-19 ("for COVID-19"). EHR-based phenotyping was used to find feature sets to filter out incidental admissions.
RESULTS: EHR-based phenotyped feature sets filtered out incidental admissions, which occurred in an average of 26% of hospitalizations (although this varied widely over time, from 0% to 75%). The top site-specific feature sets had 79%-99% specificity with 62%-75% sensitivity, while the best-performing across-site feature sets had 71%-94% specificity with 69%-81% sensitivity.
CONCLUSIONS: A large proportion of SARS-CoV-2 PCR-positive admissions were incidental. Straightforward EHR-based phenotypes differentiated admissions, which is important to assure accurate public health reporting and research.
Given the growing number of prediction algorithms developed to predict COVID-19 mortality, we evaluated the transportability of a mortality prediction algorithm using a multi-national network of healthcare systems. We predicted COVID-19 mortality using baseline commonly measured laboratory values and standard demographic and clinical covariates across healthcare systems, countries, and continents. Specifically, we trained a Cox regression model with nine measured laboratory test values, standard demographics at admission, and comorbidity burden pre-admission. These models were compared at site, country, and continent level. Of the 39,969 hospitalized patients with COVID-19 (68.6% male), 5717 (14.3%) died. In the Cox model, age, albumin, AST, creatine, CRP, and white blood cell count are most predictive of mortality. The baseline covariates are more predictive of mortality during the early days of COVID-19 hospitalization. Models trained at healthcare systems with larger cohort size largely retain good transportability performance when porting to different sites. The combination of routine laboratory test values at admission along with basic demographic features can predict mortality in patients hospitalized with COVID-19. Importantly, this potentially deployable model differs from prior work by demonstrating not only consistent performance but also reliable transportability across healthcare systems in the US and Europe, highlighting the generalizability of this model and the overall approach.
OBJECTIVE: To assess changes in international mortality rates and laboratory recovery rates during hospitalisation for patients hospitalised with SARS-CoV-2 between the first wave (1 March to 30 June 2020) and the second wave (1 July 2020 to 31 January 2021) of the COVID-19 pandemic.
DESIGN, SETTING AND PARTICIPANTS: This is a retrospective cohort study of 83 178 hospitalised patients admitted between 7 days before or 14 days after PCR-confirmed SARS-CoV-2 infection within the Consortium for Clinical Characterization of COVID-19 by Electronic Health Record, an international multihealthcare system collaborative of 288 hospitals in the USA and Europe. The laboratory recovery rates and mortality rates over time were compared between the two waves of the pandemic.
PRIMARY AND SECONDARY OUTCOME MEASURES: The primary outcome was all-cause mortality rate within 28 days after hospitalisation stratified by predicted low, medium and high mortality risk at baseline. The secondary outcome was the average rate of change in laboratory values during the first week of hospitalisation.
RESULTS: Baseline Charlson Comorbidity Index and laboratory values at admission were not significantly different between the first and second waves. The improvement in laboratory values over time was faster in the second wave compared with the first. The average C reactive protein rate of change was -4.72 mg/dL vs -4.14 mg/dL per day (p=0.05). The mortality rates within each risk category significantly decreased over time, with the most substantial decrease in the high-risk group (42.3% in March-April 2020 vs 30.8% in November 2020 to January 2021, p<0.001) and a moderate decrease in the intermediate-risk group (21.5% in March-April 2020 vs 14.3% in November 2020 to January 2021, p<0.001).
CONCLUSIONS: Admission profiles of patients hospitalised with SARS-CoV-2 infection did not differ greatly between the first and second waves of the pandemic, but there were notable differences in laboratory improvement rates during hospitalisation. Mortality risks among patients with similar risk profiles decreased over the course of the pandemic. The improvement in laboratory values and mortality risk was consistent across multiple countries.
The risk profiles of post-acute sequelae of COVID-19 (PASC) have not been well characterized in multi-national settings with appropriate controls. We leveraged electronic health record (EHR) data from 277 international hospitals representing 414,602 patients with COVID-19, 2.3 million control patients without COVID-19 in the inpatient and outpatient settings, and over 221 million diagnosis codes to systematically identify new-onset conditions enriched among patients with COVID-19 during the post-acute period. Compared to inpatient controls, inpatient COVID-19 cases were at significant risk for angina pectoris (RR 1.30, 95% CI 1.09-1.55), heart failure (RR 1.22, 95% CI 1.10-1.35), cognitive dysfunctions (RR 1.18, 95% CI 1.07-1.31), and fatigue (RR 1.18, 95% CI 1.07-1.30). Relative to outpatient controls, outpatient COVID-19 cases were at risk for pulmonary embolism (RR 2.10, 95% CI 1.58-2.76), venous embolism (RR 1.34, 95% CI 1.17-1.54), atrial fibrillation (RR 1.30, 95% CI 1.13-1.50), type 2 diabetes (RR 1.26, 95% CI 1.16-1.36) and vitamin D deficiency (RR 1.19, 95% CI 1.09-1.30). Outpatient COVID-19 cases were also at risk for loss of smell and taste (RR 2.42, 95% CI 1.90-3.06), inflammatory neuropathy (RR 1.66, 95% CI 1.21-2.27), and cognitive dysfunction (RR 1.18, 95% CI 1.04-1.33). The incidence of post-acute cardiovascular and pulmonary conditions decreased across time among inpatient cases while the incidence of cardiovascular, digestive, and metabolic conditions increased among outpatient cases. Our study, based on a federated international network, systematically identified robust conditions associated with PASC compared to control groups, underscoring the multifaceted cardiovascular and neurological phenotype profiles of PASC.
OBJECTIVE: For multi-center heterogeneous Real-World Data (RWD) with time-to-event outcomes and high-dimensional features, we propose the SurvMaximin algorithm to estimate Cox model feature coefficients for a target population by borrowing summary information from a set of health care centers without sharing patient-level information.
MATERIALS AND METHODS: For each of the centers from which we want to borrow information to improve the prediction performance for the target population, a penalized Cox model is fitted to estimate feature coefficients for the center. Using estimated feature coefficients and the covariance matrix of the target population, we then obtain a SurvMaximin estimated set of feature coefficients for the target population. The target population can be an entire cohort comprised of all centers, corresponding to federated learning, or a single center, corresponding to transfer learning.
RESULTS: Simulation studies and a real-world international electronic health records application study, with 15 participating health care centers across three countries (France, Germany, and the U.S.), show that the proposed SurvMaximin algorithm achieves comparable or higher accuracy compared with the estimator using only the information of the target site and other existing methods. The SurvMaximin estimator is robust to variations in sample sizes and estimated feature coefficients between centers, which amounts to significantly improved estimates for target sites with fewer observations.
CONCLUSIONS: The SurvMaximin method is well suited for both federated and transfer learning in the high-dimensional survival analysis setting. SurvMaximin only requires a one-time summary information exchange from participating centers. Estimated regression vectors can be very heterogeneous. SurvMaximin provides robust Cox feature coefficient estimates without outcome information in the target population and is privacy-preserving.