Publications

2020

Weber, Griffin M, Yingnan Ju, and Katy Börner. (2020) 2020. “Considerations for Using the Vasculature As a Coordinate System to Map All the Cells in the Human Body.”. Frontiers in Cardiovascular Medicine 7: 29. https://doi.org/10.3389/fcvm.2020.00029.

Several ongoing international efforts are developing methods of localizing single cells within organs or mapping the entire human body at the single cell level, including the Chan Zuckerberg Initiative's Human Cell Atlas (HCA), and the Knut and Allice Wallenberg Foundation's Human Protein Atlas (HPA), and the National Institutes of Health's Human BioMolecular Atlas Program (HuBMAP). Their goals are to understand cell specialization, interactions, spatial organization in their natural context, and ultimately the function of every cell within the body. In the same way that the Human Genome Project had to assemble sequence data from different people to construct a complete sequence, multiple centers around the world are collecting tissue specimens from diverse populations that vary in age, race, sex, and body size. A challenge will be combining these heterogeneous tissue samples into a 3D reference map that will enable multiscale, multidimensional Google Maps-like exploration of the human body. Key to making alignment of tissue samples work is identifying and using a coordinate system called a Common Coordinate Framework (CCF), which defines the positions, or "addresses," in a reference body, from whole organs down to functional tissue units and individual cells. In this perspective, we examine the concept of a CCF based on the vasculature and describe why it would be an attractive choice for mapping the human body.

Yu, Yun William, and Griffin M Weber. (2020) 2020. “Balancing Accuracy and Privacy in Federated Queries of Clinical Data Repositories: Algorithm Development and Validation.”. Journal of Medical Internet Research 22 (11): e18735. https://doi.org/10.2196/18735.

BACKGROUND: Over the past decade, the emergence of several large federated clinical data networks has enabled researchers to access data on millions of patients at dozens of health care organizations. Typically, queries are broadcast to each of the sites in the network, which then return aggregate counts of the number of matching patients. However, because patients can receive care from multiple sites in the network, simply adding the numbers frequently double counts patients. Various methods such as the use of trusted third parties or secure multiparty computation have been proposed to link patient records across sites. However, they either have large trade-offs in accuracy and privacy or are not scalable to large networks.

OBJECTIVE: This study aims to enable accurate estimates of the number of patients matching a federated query while providing strong guarantees on the amount of protected medical information revealed.

METHODS: We introduce a novel probabilistic approach to running federated network queries. It combines an algorithm called HyperLogLog with obfuscation in the form of hashing, masking, and homomorphic encryption. It is tunable, in that it allows networks to balance accuracy versus privacy, and it is computationally efficient even for large networks. We built a user-friendly free open-source benchmarking platform to simulate federated queries in large hospital networks. Using this platform, we compare the accuracy, k-anonymity privacy risk (with k=10), and computational runtime of our algorithm with several existing techniques.

RESULTS: In simulated queries matching 1 to 100 million patients in a 100-hospital network, our method was significantly more accurate than adding aggregate counts while maintaining k-anonymity. On average, it required a total of 12 kilobytes of data to be sent to the network hub and added only 5 milliseconds to the overall federated query runtime. This was orders of magnitude better than other approaches, which guaranteed the exact answer.

CONCLUSIONS: Using our method, it is possible to run highly accurate federated queries of clinical data repositories that both protect patient privacy and scale to large networks.

Brat, Gabriel A, Griffin M Weber, Nils Gehlenborg, Paul Avillach, Nathan P Palmer, Luca Chiovato, James Cimino, et al. (2020) 2020. “International Electronic Health Record-Derived COVID-19 Clinical Course Profiles: The 4CE Consortium.”. NPJ Digital Medicine 3: 109. https://doi.org/10.1038/s41746-020-00308-0.

We leveraged the largely untapped resource of electronic health record data to address critical clinical and epidemiological questions about Coronavirus Disease 2019 (COVID-19). To do this, we formed an international consortium (4CE) of 96 hospitals across five countries (www.covidclinical.net). Contributors utilized the Informatics for Integrating Biology and the Bedside (i2b2) or Observational Medical Outcomes Partnership (OMOP) platforms to map to a common data model. The group focused on temporal changes in key laboratory test values. Harmonized data were analyzed locally and converted to a shared aggregate form for rapid analysis and visualization of regional differences and global commonalities. Data covered 27,584 COVID-19 cases with 187,802 laboratory tests. Case counts and laboratory trajectories were concordant with existing literature. Laboratory tests at the time of diagnosis showed hospital-level differences equivalent to country-level variation across the consortium partners. Despite the limitations of decentralized data generation, we established a framework to capture the trajectory of COVID-19 disease in patients and their response to interventions.

2019

Hejblum, Boris P, Griffin M Weber, Katherine P Liao, Nathan P Palmer, Susanne Churchill, Nancy A Shadick, Peter Szolovits, Shawn N Murphy, Isaac S Kohane, and Tianxi Cai. (2019) 2019. “Probabilistic Record Linkage of De-Identified Research Datasets With Discrepancies Using Diagnosis Codes.”. Scientific Data 6: 180298. https://doi.org/10.1038/sdata.2018.298.

We develop an algorithm for probabilistic linkage of de-identified research datasets at the patient level, when only diagnosis codes with discrepancies and no personal health identifiers such as name or date of birth are available. It relies on Bayesian modelling of binarized diagnosis codes, and provides a posterior probability of matching for each patient pair, while considering all the data at once. Both in our simulation study (using an administrative claims dataset for data generation) and in two real use-cases linking patient electronic health records from a large tertiary care network, our method exhibits good performance and compares favourably to the standard baseline Fellegi-Sunter algorithm. We propose a scalable, fast and efficient open-source implementation in the ludic R package available on CRAN, which also includes the anonymized diagnosis code data from our real use-case. This work suggests it is possible to link de-identified research databases stripped of any personal health identifiers using only diagnosis codes, provided sufficient information is shared between the data sources.

Consortium, HuBMAP. (2019) 2019. “The Human Body at Cellular Resolution: The NIH Human Biomolecular Atlas Program.”. Nature 574 (7777): 187-92. https://doi.org/10.1038/s41586-019-1629-x.

Transformative technologies are enabling the construction of three-dimensional maps of tissues with unprecedented spatial and molecular resolution. Over the next seven years, the NIH Common Fund Human Biomolecular Atlas Program (HuBMAP) intends to develop a widely accessible framework for comprehensively mapping the human body at single-cell resolution by supporting technology development, data acquisition, and detailed spatial mapping. HuBMAP will integrate its efforts with other funding agencies, programs, consortia, and the biomedical research community at large towards the shared vision of a comprehensive, accessible three-dimensional molecular and cellular atlas of the human body, in health and under various disease conditions.

2018

Agniel, Denis, Isaac S Kohane, and Griffin M Weber. (2018) 2018. “Biases in Electronic Health Record Data Due to Processes Within the Healthcare System: Retrospective Observational Study.”. BMJ (Clinical Research Ed.) 361: k1479. https://doi.org/10.1136/bmj.k1479.

OBJECTIVE: To evaluate on a large scale, across 272 common types of laboratory tests, the impact of healthcare processes on the predictive value of electronic health record (EHR) data.

DESIGN: Retrospective observational study.

SETTING: Two large hospitals in Boston, Massachusetts, with inpatient, emergency, and ambulatory care.

PARTICIPANTS: All 669 452 patients treated at the two hospitals over one year between 2005 and 2006.

MAIN OUTCOME MEASURES: The relative predictive accuracy of each laboratory test for three year survival, using the time of the day, day of the week, and ordering frequency of the test, compared to the value of the test result.

RESULTS: The presence of a laboratory test order, regardless of any other information about the test result, has a significant association (P<0.001) with the odds of survival in 233 of 272 (86%) tests. Data about the timing of when laboratory tests were ordered were more accurate than the test results in predicting survival in 118 of 174 tests (68%).

CONCLUSIONS: Healthcare processes must be addressed and accounted for in analysis of observational health data. Without careful consideration to context, EHR data are unsuitable for many research questions. However, if explicitly modeled, the same processes that make EHR data complex can be leveraged to gain insight into patients' state of health.

Lungeanu, Alina, Dorothy R Carter, Leslie A DeChurch, and Noshir S Contractor. (2018) 2018. “How Team Interlock Ecosystems Shape the Assembly of Scientific Teams: A Hypergraph Approach.”. Communication Methods and Measures 12 (2-3): 174-98. https://doi.org/10.1080/19312458.2018.1430756.

Today's most pressing scientific problems necessitate scientific teamwork; the increasing complexity and specialization of knowledge render "lone geniuses" ill-equipped to make high-impact scientific breakthroughs. Social network research has begun to explore the factors that promote the assembly of scientific teams. However, this work has been limited by network approaches centered conceptually and analytically on "nodes as people," or "nodes as teams." In this paper, we develop a ' team-interlock ecosystem' conceptualization of collaborative environments within which new scientific teams, or other creative team-based enterprises, assemble. Team interlock ecosystems comprise teams linked to one another through overlapping memberships and/or overlapping knowledge domains. They depict teams, people, and knowledge sets as nodes, and thus, present both conceptual advantages as well as methodological challenges. Conceptually, team interlock ecosystems invite novel questions about how the structural characteristics of embedding ecosystems serve as the primordial soup from which new teams assemble. Methodologically, however, studying ecosystems requires the use of more advanced analytics that correspond to the inherently multilevel phenomenon of scientists nested within multiple teams. To address these methodological challenges, we advance the use of hypergraph methodologies combined with bibliometric data and simulation-based approaches to test hypotheses related to the ecosystem drivers of team assembly.

2017

Norton, Wynne E, Alina Lungeanu, David A Chambers, and Noshir Contractor. (2017) 2017. “Mapping the Growing Discipline of Dissemination and Implementation Science in Health.”. Scientometrics 112 (3): 1367-90. https://doi.org/10.1007/s11192-017-2455-2.

BACKGROUND: The field of dissemination and implementation (D&I) research in health has grown considerably in the past decade. Despite the potential for advancing the science, limited research has focused on mapping the field.

METHODS: We administered an online survey to individuals in the D&I field to assess participants' demographics and expertise, as well as engagement with journals and conferences, publications, and grants. A combined roster-nomination method was used to collect data on participants' advice networks and collaboration networks; participants' motivations for choosing collaborators was also assessed. Frequency and descriptive statistics were used to characterize the overall sample; network metrics were used to characterize both networks. Among a sub-sample of respondents who were researchers, regression analyses identified predictors of two metrics of academic performance (i.e., publications and funded grants).

RESULTS: A total of 421 individuals completed the survey, representing a 30.75% response rate of eligible individuals. Most participants were White (n = 343), female (n = 284, 67.4%), and identified as a researcher (n = 340, 81%). Both the advice and the collaboration networks displayed characteristics of a small world network. The most important motivations for selecting collaborators were aligned with advancing the science (i.e., prior collaborators, strong reputation, and good collaborators) rather than relying on human proclivities for homophily, proximity, and friendship. Among a sub-sample of 295 researchers, expertise (individual predictor), status (advice network), and connectedness (collaboration network) were significant predictors of both metrics of academic performance.

CONCLUSIONS: Network-based interventions can enhance collaboration and productivity; future research is needed to leverage these data to advance the field.

Mantia, Charlene, Erik J Uhlmann, Maneka Puligandla, Griffin M Weber, Donna Neuberg, and Jeffrey I Zwicker. (2017) 2017. “Predicting the Higher Rate of Intracranial Hemorrhage in Glioma Patients Receiving Therapeutic Enoxaparin.”. Blood 129 (25): 3379-85. https://doi.org/10.1182/blood-2017-02-767285.

Venous thromboembolism occurs in up to one-third of patients with primary brain tumors. Spontaneous intracranial hemorrhage (ICH) is also a frequent occurrence in these patients, but there is limited data on the safety of therapeutic anticoagulation. To determine the rate of ICH in patients treated with enoxaparin, we performed a matched, retrospective cohort study with blinded radiology review for 133 patients with high-grade glioma. After diagnosis of glioma, the cohort that received enoxaparin was 3 times more likely to develop a major ICH than those not treated with anticoagulation (14.7% vs 2.5%; P = .036; hazard ratio [HR], 3.37; 95% confidence interval [CI], 1.02-11.14). When enoxaparin was analyzed as a time-varying covariate, anticoagulation was associated with a >13-fold increased risk of hemorrhage (HR, 13.26; 95% CI, 3.33-52.85; P < .0001). Overall survival was significantly shorter for patients who suffered a major ICH on enoxaparin compared with patients not receiving anticoagulation (3.3 vs 10.2 months; log-rank P = .012). We applied a validated ICH prediction risk score PANWARDS (platelets, albumin, no congestive heart failure, warfarin, age, race, diastolic blood pressure, stroke), and observed that all major ICHs on enoxaparin occurred in the setting of a PANWARDS score ≥25, corresponding with a sensitivity of 100% (95% CI, 63% to 100%) and a specificity of 40% (95% CI, 25% to 56%). We conclude that caution is warranted when considering therapeutic anticoagulation in patients with high-grade gliomas given the increased risk of ICH and poor prognosis after a major hemorrhage on anticoagulation. The PANWARDS score may assist clinicians in identifying the patients at greatest risk of suffering a major intracranial hemorrhage with anticoagulation.