Publications

  • Chang, Chi-Yuan, Robert Moss, Brandon Westover, and Daniel M Goldenholz. (2025) 2025. “Rigorous Evaluation of Five Models for E-Diary-only Seizure Forecasting: Retrospective and Prospective Datasets Do Not Outperform the Napkin Method.”. Epilepsia. https://doi.org/10.1111/epi.18677.

    OBJECTIVE: Seizure forecasting using e-diaries may help patients with seizures to organize their daily life. Until now, most methods were not rigorously tested against a strict standard. This study aims to assess whether the performance of various models for seizure forecasting using e-diaries is better than the performance of a moving window average (a.k.a. the Napkin method, due to simplicity of calculation).

    METHODS: We analyzed three cohorts from Seizure Tracker: a retrospective study and two prospective studies. E-diaries and the type of seizures were extracted from the datasets. We implemented five machine learning models (Perceptron, 1D-convolution, Multilayer Perceptron, Cycle, point-process generalized linear model) and compared their performance at seizure forecasting against the Napkin forecast. The models predicted the probability of having at least one seizure in the next 24-h period based on a 90-day historical window. Model performances were evaluated by commonly used metrics (area under the precision-recall curve, area under the receiver operating characteristic curve, and Brier score). We considered a model to be clinically ineffective if it did not outperform the Napkin method across metrics and seizure frequencies.

    RESULTS: A total of 5501 retrospective patients (3300 training, 1100 validation, and 1101 testing) and 36 prospective patients (21 from one cohort, 15 from the other) were included in the analysis. No model achieved significantly better performance than the Napkin method across metrics and frequencies.

    SIGNIFICANCE: Clinically effective seizure forecasting (i.e., beyond the Napkin method) for 24-h risk using e-diaries alone may be infeasible with currently available techniques.

  • Henry, Christopher N, and Daniel M Goldenholz. (2025) 2025. “The Road Not Taken: Misclassifying an Anti-Seizure Medication As a Failure.”. Annals of Clinical and Translational Neurology. https://doi.org/10.1002/acn3.70139.

    OBJECTIVE: To quantify how often anti-seizure medications (ASMs) appear ineffective yet provide benefit when considering seizure frequency (SF) variability.

    METHODS: We used the CHOCOLATES seizure diary simulator to generate 100,000 patient seizure diaries that reflect natural SF variation in a heterogeneous population. Medication effect was modeled as a 20% average SF reduction (standard deviation 10%). We identified how many patients with an observed ≥ 25% SF increase (apparent worsening) actually had a true ≥ 10% SF reduction (vs. no medication), and how many with an observed ≥ 50% SF reduction (apparent responders) would have shown < 0% reduction if not taking the ASM. We also quantified how many individuals who had apparent worsening were actual worsening (> 0% SF increase vs. no medication).

    RESULTS: Simulations closely matched real-world ASM trials, showing a median SF reduction of 36% with ASM versus 17% with placebo; 35% of patients on ASM achieved ≥ 50% SF reduction versus 20% on placebo. Apparent worsening occurred in 12%; among these, 76% were true improvers. Of the apparent responders, 12% were true nonresponders. Only 4% of the individuals with apparent worsening truly worsened compared to no medication.

    INTERPRETATION: SF variability can lead to significant misclassification of ASM benefit. Many patients labeled as having "failed" an ASM trial were likely receiving meaningful benefit and may warrant reconsideration of the medication. Prospective clinical studies are needed to determine how best to account for SF variability and refine the interpretation of treatment response in epilepsy management.

  • Li, J, D M Goldenholz, M Alkofer, C Sun, F A Nascimento, J J Halford, B C Dean, et al. (2025) 2025. “Expert-Level Detection of Epilepsy Markers in EEG on Short and Long Timescales.”. NEJM AI 2 (7). https://doi.org/10.1056/aioa2401221.

    BACKGROUND: Epileptiform discharges, or spikes, within electroencephalogram (EEG) recordings are essential for diagnosing epilepsy and localizing seizure origins. Artificial intelligence (AI) offers a promising approach to automating detection, but current models are often hindered by artifact-related false positives and often target either event- or EEG-level classification, thus limiting clinical utility.

    METHODS: We developed SpikeNet2, a deep-learning model based on a residual network architecture, and enhanced it with hard-negative mining to reduce false positives. Our study analyzed 17,812 EEG recordings from 13,523 patients across multiple institutions, including Massachusetts General Brigham (MGB) hospitals. Data from the Human Epilepsy Project (HEP) and SCORE-AI (SAI) were also included. A total of 32,433 event-level samples, labeled by experts, were used for training and evaluation. Performance was assessed using the area under the receiver operating characteristic curve (AUROC), the area under the precision-recall curve (AUPRC), calibration error, and a modified area under the curve (mAUC) metric. The model's generalizability was evaluated using external datasets.

    RESULTS: SpikeNet2 demonstrated strong performance in event-level spike detection, achieving an AUROC of 0.973 and an AUPRC of 0.995, with 44% of experts surpassing the model on the MGB dataset. In external validation, the model achieved an AUROC of 0.942 and an AUPRC of 0.948 on the HEP dataset. For EEG-level classification, SpikeNet2 recorded an AUROC of 0.958 and an AUPRC of 0.959 on the MGB dataset, an AUROC of 0.888 and an AUPRC of 0.823 on the HEP dataset, and an AUROC of 0.995 and an AUPRC of 0.991 on the SAI dataset, with 32% of experts outperforming the model. The false-positive rate was reduced to an average of nine spikes per hour.

    CONCLUSIONS: SpikeNet2 offers expert-level accuracy in both event-level spike detection and EEG-level classification, while significantly reducing false positives. Its dual functionality and robust performance across diverse datasets make it a promising tool for clinical and telemedicine applications, particularly in resource-limited settings. (Funded by the National Institutes of Health and others.).