Publications

2022

Gutiérrez-Sacristán, Alba, Arnaud Serret-Larmande, Meghan R Hutch, Carlos Sáez, Bruce J Aronow, Surbhi Bhatnagar, Clara-Lea Bonzel, et al. (2022) 2022. “Hospitalizations Associated With Mental Health Conditions Among Adolescents in the US and France During the COVID-19 Pandemic.”. JAMA Network Open 5 (12): e2246548. https://doi.org/10.1001/jamanetworkopen.2022.46548.

IMPORTANCE: The COVID-19 pandemic has been associated with an increase in mental health diagnoses among adolescents, though the extent of the increase, particularly for severe cases requiring hospitalization, has not been well characterized. Large-scale federated informatics approaches provide the ability to efficiently and securely query health care data sets to assess and monitor hospitalization patterns for mental health conditions among adolescents.

OBJECTIVE: To estimate changes in the proportion of hospitalizations associated with mental health conditions among adolescents following onset of the COVID-19 pandemic.

DESIGN, SETTING, AND PARTICIPANTS: This retrospective, multisite cohort study of adolescents 11 to 17 years of age who were hospitalized with at least 1 mental health condition diagnosis between February 1, 2019, and April 30, 2021, used patient-level data from electronic health records of 8 children's hospitals in the US and France.

MAIN OUTCOMES AND MEASURES: Change in the monthly proportion of mental health condition-associated hospitalizations between the prepandemic (February 1, 2019, to March 31, 2020) and pandemic (April 1, 2020, to April 30, 2021) periods using interrupted time series analysis.

RESULTS: There were 9696 adolescents hospitalized with a mental health condition during the prepandemic period (5966 [61.5%] female) and 11 101 during the pandemic period (7603 [68.5%] female). The mean (SD) age in the prepandemic cohort was 14.6 (1.9) years and in the pandemic cohort, 14.7 (1.8) years. The most prevalent diagnoses during the pandemic were anxiety (6066 [57.4%]), depression (5065 [48.0%]), and suicidality or self-injury (4673 [44.2%]). There was an increase in the proportions of monthly hospitalizations during the pandemic for anxiety (0.55%; 95% CI, 0.26%-0.84%), depression (0.50%; 95% CI, 0.19%-0.79%), and suicidality or self-injury (0.38%; 95% CI, 0.08%-0.68%). There was an estimated 0.60% increase (95% CI, 0.31%-0.89%) overall in the monthly proportion of mental health-associated hospitalizations following onset of the pandemic compared with the prepandemic period.

CONCLUSIONS AND RELEVANCE: In this cohort study, onset of the COVID-19 pandemic was associated with increased hospitalizations with mental health diagnoses among adolescents. These findings support the need for greater resources within children's hospitals to care for adolescents with mental health conditions during the pandemic and beyond.

Börner, Katy, Andreas Bueckle, Bruce W Herr, Leonard E Cross, Ellen M Quardokus, Elizabeth G Record, Yingnan Ju, et al. (2022) 2022. “Tissue Registration and Exploration User Interfaces in Support of a Human Reference Atlas.”. Communications Biology 5 (1): 1369. https://doi.org/10.1038/s42003-022-03644-x.

Seventeen international consortia are collaborating on a human reference atlas (HRA), a comprehensive, high-resolution, three-dimensional atlas of all the cells in the healthy human body. Laboratories around the world are collecting tissue specimens from donors varying in sex, age, ethnicity, and body mass index. However, harmonizing tissue data across 25 organs and more than 15 bulk and spatial single-cell assay types poses challenges. Here, we present software tools and user interfaces developed to spatially and semantically annotate ("register") and explore the tissue data and the evolving HRA. A key part of these tools is a common coordinate framework, providing standard terminologies and data structures for describing specimen, biological structure, and spatial data linked to existing ontologies. As of April 22, 2022, the "registration" user interface has been used to harmonize and publish data on 5,909 tissue blocks collected by the Human Biomolecular Atlas Program (HuBMAP), the Stimulating Peripheral Activity to Relieve Conditions program (SPARC), the Human Cell Atlas (HCA), the Kidney Precision Medicine Project (KPMP), and the Genotype Tissue Expression project (GTEx). Further, 5,856 tissue sections were derived from 506 HuBMAP tissue blocks. The second "exploration" user interface enables consortia to evaluate data quality, explore tissue data spatially within the context of the HRA, and guide data acquisition. A companion website is at https://cns-iu.github.io/HRA-supporting-information/ .

2021

Weber, Griffin M, Chuan Hong, Nathan P Palmer, Paul Avillach, Shawn N Murphy, Alba Gutiérrez-Sacristán, Zongqi Xia, et al. (2021) 2021. “International Comparisons of Harmonized Laboratory Value Trajectories to Predict Severe COVID-19: Leveraging the 4CE Collaborative Across 342 Hospitals and 6 Countries: A Retrospective Cohort Study.”. MedRxiv : The Preprint Server for Health Sciences. https://doi.org/10.1101/2020.12.16.20247684.

OBJECTIVES: To perform an international comparison of the trajectory of laboratory values among hospitalized patients with COVID-19 who develop severe disease and identify optimal timing of laboratory value collection to predict severity across hospitals and regions.

DESIGN: Retrospective cohort study.

SETTING: The Consortium for Clinical Characterization of COVID-19 by EHR (4CE), an international multi-site data-sharing collaborative of 342 hospitals in the US and in Europe.

PARTICIPANTS: Patients hospitalized with COVID-19, admitted before or after PCR-confirmed result for SARS-CoV-2.

PRIMARY AND SECONDARY OUTCOME MEASURES: Patients were categorized as "ever-severe" or "never-severe" using the validated 4CE severity criteria. Eighteen laboratory tests associated with poor COVID-19-related outcomes were evaluated for predictive accuracy by area under the curve (AUC), compared between the severity categories. Subgroup analysis was performed to validate a subset of laboratory values as predictive of severity against a published algorithm. A subset of laboratory values (CRP, albumin, LDH, neutrophil count, D-dimer, and procalcitonin) was compared between North American and European sites for severity prediction.

RESULTS: Of 36,447 patients with COVID-19, 19,953 (43.7%) were categorized as ever-severe. Most patients (78.7%) were 50 years of age or older and male (60.5%). Longitudinal trajectories of CRP, albumin, LDH, neutrophil count, D-dimer, and procalcitonin showed association with disease severity. Significant differences of laboratory values at admission were found between the two groups. With the exception of D-dimer, predictive discrimination of laboratory values did not improve after admission. Sub-group analysis using age, D-dimer, CRP, and lymphocyte count as predictive of severity at admission showed similar discrimination to a published algorithm (AUC=0.88 and 0.91, respectively). Both models deteriorated in predictive accuracy as the disease progressed. On average, no difference in severity prediction was found between North American and European sites.

CONCLUSIONS: Laboratory test values at admission can be used to predict severity in patients with COVID-19. Prediction models show consistency across international sites highlighting the potential generalizability of these models.

Haendel, Melissa A, Christopher G Chute, Tellen D Bennett, David A Eichmann, Justin Guinney, Warren A Kibbe, Philip R O Payne, et al. (2021) 2021. “The National COVID Cohort Collaborative (N3C): Rationale, Design, Infrastructure, and Deployment.”. Journal of the American Medical Informatics Association : JAMIA 28 (3): 427-43. https://doi.org/10.1093/jamia/ocaa196.

OBJECTIVE: Coronavirus disease 2019 (COVID-19) poses societal challenges that require expeditious data and knowledge sharing. Though organizational clinical data are abundant, these are largely inaccessible to outside researchers. Statistical, machine learning, and causal analyses are most successful with large-scale data beyond what is available in any given organization. Here, we introduce the National COVID Cohort Collaborative (N3C), an open science community focused on analyzing patient-level data from many centers.

MATERIALS AND METHODS: The Clinical and Translational Science Award Program and scientific community created N3C to overcome technical, regulatory, policy, and governance barriers to sharing and harmonizing individual-level clinical data. We developed solutions to extract, aggregate, and harmonize data across organizations and data models, and created a secure data enclave to enable efficient, transparent, and reproducible collaborative analytics.

RESULTS: Organized in inclusive workstreams, we created legal agreements and governance for organizations and researchers; data extraction scripts to identify and ingest positive, negative, and possible COVID-19 cases; a data quality assurance and harmonization pipeline to create a single harmonized dataset; population of the secure data enclave with data, machine learning, and statistical analytics tools; dissemination mechanisms; and a synthetic data pilot to democratize data access.

CONCLUSIONS: The N3C has demonstrated that a multisite collaborative learning health network can overcome barriers to rapidly build a scalable infrastructure incorporating multiorganizational clinical data for COVID-19 analytics. We expect this effort to save lives by enabling rapid collaboration among clinicians, researchers, and data scientists to identify treatments and specialized care and thereby reduce the immediate and long-term impacts of COVID-19.

Kohane, Isaac S, Bruce J Aronow, Paul Avillach, Brett K Beaulieu-Jones, Riccardo Bellazzi, Robert L Bradford, Gabriel A Brat, et al. (2021) 2021. “What Every Reader Should Know About Studies Using Electronic Health Record Data But May Be Afraid to Ask.”. Journal of Medical Internet Research 23 (3): e22219. https://doi.org/10.2196/22219.

Coincident with the tsunami of COVID-19-related publications, there has been a surge of studies using real-world data, including those obtained from the electronic health record (EHR). Unfortunately, several of these high-profile publications were retracted because of concerns regarding the soundness and quality of the studies and the EHR data they purported to analyze. These retractions highlight that although a small community of EHR informatics experts can readily identify strengths and flaws in EHR-derived studies, many medical editorial teams and otherwise sophisticated medical readers lack the framework to fully critically appraise these studies. In addition, conventional statistical analyses cannot overcome the need for an understanding of the opportunities and limitations of EHR-derived studies. We distill here from the broader informatics literature six key considerations that are crucial for appraising studies utilizing EHR data: data completeness, data collection and handling (eg, transformation), data type (ie, codified, textual), robustness of methods against EHR variability (within and across institutions, countries, and time), transparency of data and analytic code, and the multidisciplinary approach. These considerations will inform researchers, clinicians, and other stakeholders as to the recommended best practices in reviewing manuscripts, grants, and other outputs from EHR-data derived studies, and thereby promote and foster rigor, quality, and reliability of this rapidly growing field.

Beaulieu-Jones, Brett K, William Yuan, Gabriel A Brat, Andrew L Beam, Griffin Weber, Marshall Ruffin, and Isaac S Kohane. (2021) 2021. “Machine Learning for Patient Risk Stratification: Standing On, or Looking Over, the Shoulders of Clinicians?”. NPJ Digital Medicine 4 (1): 62. https://doi.org/10.1038/s41746-021-00426-3.

Machine learning can help clinicians to make individualized patient predictions only if researchers demonstrate models that contribute novel insights, rather than learning the most likely next step in a set of actions a clinician will take. We trained deep learning models using only clinician-initiated, administrative data for 42.9 million admissions using three subsets of data: demographic data only, demographic data and information available at admission, and the previous data plus charges recorded during the first day of admission. Models trained on charges during the first day of admission achieve performance close to published full EMR-based benchmarks for inpatient outcomes: inhospital mortality (0.89 AUC), prolonged length of stay (0.82 AUC), and 30-day readmission rate (0.71 AUC). Similar performance between models trained with only clinician-initiated data and those trained with full EMR data purporting to include information about patient state and physiology should raise concern in the deployment of these models. Furthermore, these models exhibited significant declines in performance when evaluated over only myocardial infarction (MI) patients relative to models trained over MI patients alone, highlighting the importance of physician diagnosis in the prognostic performance of these models. These results provide a benchmark for predictive accuracy trained only on prior clinical actions and indicate that models with similar performance may derive their signal by looking over clinician's shoulders-using clinical behavior as the expression of preexisting intuition and suspicion to generate a prediction. For models to guide clinicians in individual decisions, performance exceeding these benchmarks is necessary.

Visweswaran, Shyam, Malarkodi J Samayamuthu, Michele Morris, Griffin M Weber, Douglas MacFadden, Philip Trevvett, Jeffrey G Klann, et al. (2021) 2021. “Development of a COVID-19 Application Ontology for the ACT Network.”. MedRxiv : The Preprint Server for Health Sciences. https://doi.org/10.1101/2021.03.15.21253596.

Clinical data networks that leverage large volumes of data in electronic health records (EHRs) are significant resources for research on coronavirus disease 2019 (COVID-19). Data harmonization is a key challenge in seamless use of multisite EHRs for COVID-19 research. We developed a COVID-19 application ontology in the national Accrual to Clinical Trials (ACT) network that enables harmonization of data elements that that are critical to COVID-19 research. The ontology contains over 50,000 concepts in the domains of diagnosis, procedures, medications, and laboratory tests. In particular, it has computational phenotypes to characterize the course of illness and outcomes, derived terms, and harmonized value sets for SARS-CoV-2 laboratory tests. The ontology was deployed and validated on the ACT COVID-19 network that consists of nine academic health centers with data on 14.5M patients. This ontology, which is freely available to the entire research community on GitHub at https://github.com/shyamvis/ACT-COVID-Ontology, will be useful for harmonizing EHRs for COVID-19 research beyond the ACT network.

Visweswaran, Shyam, Malarkodi J Samayamuthu, Michele Morris, Griffin M Weber, Douglas MacFadden, Philip Trevvett, Jeffrey G Klann, et al. (2021) 2021. “Development of a Coronavirus Disease 2019 (COVID-19) Application Ontology for the Accrual to Clinical Trials (ACT) Network.”. JAMIA Open 4 (2): ooab036. https://doi.org/10.1093/jamiaopen/ooab036.

Clinical data networks that leverage large volumes of data in electronic health records (EHRs) are significant resources for research on coronavirus disease 2019 (COVID-19). Data harmonization is a key challenge in seamless use of multisite EHRs for COVID-19 research. We developed a COVID-19 application ontology in the national Accrual to Clinical Trials (ACT) network that enables harmonization of data elements that are critical to COVID-19 research. The ontology contains over 50 000 concepts in the domains of diagnosis, procedures, medications, and laboratory tests. In particular, it has computational phenotypes to characterize the course of illness and outcomes, derived terms, and harmonized value sets for severe acute respiratory syndrome coronavirus 2 laboratory tests. The ontology was deployed and validated on the ACT COVID-19 network that consists of 9 academic health centers with data on 14.5M patients. This ontology, which is freely available to the entire research community on GitHub at https://github.com/shyamvis/ACT-COVID-Ontology, will be useful for harmonizing EHRs for COVID-19 research beyond the ACT network.

Bourgeois, Florence T, Alba Gutiérrez-Sacristán, Mark S Keller, Molei Liu, Chuan Hong, Clara-Lea Bonzel, Amelia L M Tan, et al. (2021) 2021. “International Analysis of Electronic Health Records of Children and Youth Hospitalized With COVID-19 Infection in 6 Countries.”. JAMA Network Open 4 (6): e2112596. https://doi.org/10.1001/jamanetworkopen.2021.12596.

IMPORTANCE: Additional sources of pediatric epidemiological and clinical data are needed to efficiently study COVID-19 in children and youth and inform infection prevention and clinical treatment of pediatric patients.

OBJECTIVE: To describe international hospitalization trends and key epidemiological and clinical features of children and youth with COVID-19.

DESIGN, SETTING, AND PARTICIPANTS: This retrospective cohort study included pediatric patients hospitalized between February 2 and October 10, 2020. Patient-level electronic health record (EHR) data were collected across 27 hospitals in France, Germany, Spain, Singapore, the UK, and the US. Patients younger than 21 years who tested positive for COVID-19 and were hospitalized at an institution participating in the Consortium for Clinical Characterization of COVID-19 by EHR were included in the study.

MAIN OUTCOMES AND MEASURES: Patient characteristics, clinical features, and medication use.

RESULTS: There were 347 males (52%; 95% CI, 48.5-55.3) and 324 females (48%; 95% CI, 44.4-51.3) in this study's cohort. There was a bimodal age distribution, with the greatest proportion of patients in the 0- to 2-year (199 patients [30%]) and 12- to 17-year (170 patients [25%]) age range. Trends in hospitalizations for 671 children and youth found discrete surges with variable timing across 6 countries. Data from this cohort mirrored national-level pediatric hospitalization trends for most countries with available data, with peaks in hospitalizations during the initial spring surge occurring within 23 days in the national-level and 4CE data. A total of 27 364 laboratory values for 16 laboratory tests were analyzed, with mean values indicating elevations in markers of inflammation (C-reactive protein, 83 mg/L; 95% CI, 53-112 mg/L; ferritin, 417 ng/mL; 95% CI, 228-607 ng/mL; and procalcitonin, 1.45 ng/mL; 95% CI, 0.13-2.77 ng/mL). Abnormalities in coagulation were also evident (D-dimer, 0.78 ug/mL; 95% CI, 0.35-1.21 ug/mL; and fibrinogen, 477 mg/dL; 95% CI, 385-569 mg/dL). Cardiac troponin, when checked (n = 59), was elevated (0.032 ng/mL; 95% CI, 0.000-0.080 ng/mL). Common complications included cardiac arrhythmias (15.0%; 95% CI, 8.1%-21.7%), viral pneumonia (13.3%; 95% CI, 6.5%-20.1%), and respiratory failure (10.5%; 95% CI, 5.8%-15.3%). Few children were treated with COVID-19-directed medications.

CONCLUSIONS AND RELEVANCE: This study of EHRs of children and youth hospitalized for COVID-19 in 6 countries demonstrated variability in hospitalization trends across countries and identified common complications and laboratory abnormalities in children and youth with COVID-19 infection. Large-scale informatics-based approaches to integrate and analyze data across health care systems complement methods of disease surveillance and advance understanding of epidemiological and clinical features associated with COVID-19 in children and youth.

Tao, Ziye, Griffin M Weber, and Yun William Yu. (2021) 2021. “Expected 10-Anonymity of HyperLogLog Sketches for Federated Queries of Clinical Data Repositories.”. Bioinformatics (Oxford, England) 37 (Suppl_1): i151-i160. https://doi.org/10.1093/bioinformatics/btab292.

MOTIVATION: The rapid growth in of electronic medical records provide immense potential to researchers, but are often silo-ed at separate hospitals. As a result, federated networks have arisen, which allow simultaneously querying medical databases at a group of connected institutions. The most basic such query is the aggregate count-e.g. How many patients have diabetes? However, depending on the protocol used to estimate that total, there is always a tradeoff in the accuracy of the estimate against the risk of leaking confidential data. Prior work has shown that it is possible to empirically control that tradeoff by using the HyperLogLog (HLL) probabilistic sketch.

RESULTS: In this article, we prove complementary theoretical bounds on the k-anonymity privacy risk of using HLL sketches, as well as exhibit code to efficiently compute those bounds.

AVAILABILITY AND IMPLEMENTATION: https://github.com/tzyRachel/K-anonymity-Expectation.