Reconstructing Sepsis Trajectories from Clinical Case Reports using LLMs: the Textual Time Series Corpus for Sepsis
Pith reviewed 2026-05-22 21:12 UTC · model grok-4.3
The pith
LLMs can extract and time-order sepsis findings from narrative case reports with event match rates up to 0.93.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
An LLM pipeline phenotypes, extracts, and annotates time-localized findings from sepsis case reports to generate an open corpus of 2,139 reports; on held-out validation material the pipeline recovers events at rates of 0.93 (GPT-5) and 0.76 (Llama 3.3 70B Instruct) with temporal concordances of 0.965 and 0.908 respectively when measured against expert labels.
What carries the argument
LLM pipeline that phenotypes, extracts, and annotates time-localized clinical findings inside narrative case reports.
If this is right
- The corpus supplies temporally fine-grained sepsis trajectories for training predictive models.
- LLMs can serve as a practical tool for temporal reconstruction from clinical narrative with documented performance bounds.
- Multimodal integration is identified as one concrete direction to address remaining reconstruction errors.
- The same extraction approach can be reused on case reports for other conditions to build additional textual time series.
Where Pith is reading between the lines
- Pairing the new corpus with existing structured sources such as MIMIC-IV could produce hybrid training sets that combine narrative completeness with coded timeliness.
- If the temporal accuracy holds on prospective notes, the method could feed earlier-warning systems that operate on raw text rather than delayed discharge summaries.
- Persistent narrative time ambiguities may require new annotation conventions that distinguish explicit clock times from relative phrases before further scaling.
Load-bearing premise
Discrepancies between the pipeline output and physician labels arise only from LLM limitations and not from ambiguities in how time is expressed in the original case-report text.
What would settle it
A new blinded expert annotation pass on several hundred generated timelines that yields event match rates below 0.70 across both model families.
Figures
read the original abstract
Clinical case reports and discharge summaries may be the most complete and accurate summarization of patient encounters, yet they are finalized, i.e., timestamped after the encounter. Complementary structured data streams become available sooner but suffer from incompleteness. To train models and algorithms on more complete and temporally fine-grained data, we construct a pipeline to phenotype, extract, and annotate time-localized findings within case reports using large language models. We apply our pipeline to generate an open-access textual time series corpus for Sepsis-3 comprising 2,139 case reports from the PubMed-Open Access (PMOA) Subset. To validate our system, we apply it to PMOA and timeline annotations from i2b2/MIMIC-IV and compare the results to physician-expert annotations. We show high recovery rates of clinical findings (event match rates: GPT-5--0.93, Llama 3.3 70B Instruct--0.76) and strong temporal ordering (concordance: GPT-5--0.965, Llama 3.3 70B Instruct--0.908). Our work characterizes the ability of LLMs to time-localize clinical findings in text, illustrating the limitations of LLM use for temporal reconstruction and providing several potential avenues of improvement via multimodal integration.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces an LLM-based pipeline to extract and time-localize clinical findings from sepsis case reports, producing an open 2,139-report textual time series corpus from the PubMed Open Access subset. Validation on i2b2/MIMIC-IV data against physician annotations yields event match rates of 0.93 (GPT-5) and 0.76 (Llama 3.3 70B Instruct) plus temporal concordance of 0.965 and 0.908; the work positions the corpus as higher-fidelity temporal ground truth than structured data streams and characterizes LLM limitations for this task.
Significance. If the validation metrics can be shown to primarily reflect recoverable signal rather than text ambiguity, the corpus would be a useful resource for training temporally-aware clinical NLP models on sepsis trajectories. The concrete match and concordance numbers, plus the open release, provide a starting point for multimodal extensions mentioned in the abstract.
major comments (2)
- [Validation] Validation section: the central claim that the pipeline produces usable high-fidelity temporal ground truth rests on match rates and concordance against physician annotations, yet no inter-annotator agreement is reported on the same i2b2/MIMIC-IV subset and no quantification is given for how often source narratives contain under-specified temporal expressions (e.g., “within hours of intubation”). Without these, the metrics conflate LLM fidelity with irreducible text ambiguity, weakening the justification for releasing the corpus as ground truth.
- [Abstract] Abstract and results: event match rates and concordance are reported as single point estimates (0.93/0.76 and 0.965/0.908) with no error bars, no per-event-type breakdown, and no stratification by time granularity or prompting variant; this limits assessment of robustness and directly affects the soundness of the “high recovery rates” claim.
minor comments (2)
- [Abstract] Clarify the exact model referred to as “GPT-5” and whether any post-hoc exclusions or prompting choices were applied during validation.
- [Abstract] The abstract states the corpus is for Sepsis-3 but does not specify how Sepsis-3 criteria were applied or verified in the PMOA reports.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly where possible.
read point-by-point responses
-
Referee: [Validation] Validation section: the central claim that the pipeline produces usable high-fidelity temporal ground truth rests on match rates and concordance against physician annotations, yet no inter-annotator agreement is reported on the same i2b2/MIMIC-IV subset and no quantification is given for how often source narratives contain under-specified temporal expressions (e.g., “within hours of intubation”). Without these, the metrics conflate LLM fidelity with irreducible text ambiguity, weakening the justification for releasing the corpus as ground truth.
Authors: We agree this is an important limitation. The i2b2/MIMIC-IV annotations used for validation come from the original dataset releases, which provide single-expert annotations per case; IAA cannot be computed from the available data without new multi-annotator labeling. We will add explicit discussion of this in the revised validation and limitations sections, tempering claims about 'high-fidelity ground truth' to reflect agreement with available expert annotations rather than absolute fidelity. For under-specified temporal expressions, we did not quantify their frequency in the current study but will add a qualitative breakdown or note their contribution to ambiguity if feasible with existing resources. revision: partial
-
Referee: [Abstract] Abstract and results: event match rates and concordance are reported as single point estimates (0.93/0.76 and 0.965/0.908) with no error bars, no per-event-type breakdown, and no stratification by time granularity or prompting variant; this limits assessment of robustness and directly affects the soundness of the “high recovery rates” claim.
Authors: We agree that additional detail is needed for robustness assessment. In the revision we will add bootstrap-derived 95% confidence intervals for the primary metrics, per-event-type breakdowns (e.g., vital signs, labs, symptoms, interventions), and stratification by time granularity where the data permit. We will also summarize prompting-variant results in a supplementary table. These changes will be incorporated into the results section and referenced in the abstract. revision: yes
- Inter-annotator agreement on the i2b2/MIMIC-IV validation subset (single-annotator source datasets prevent retrospective computation)
Circularity Check
No circularity: empirical validation against external annotations
full rationale
The paper describes an LLM-based extraction pipeline applied to case reports, with performance quantified via direct comparison to independent physician annotations on the i2b2/MIMIC-IV subset. Event match rates and concordance scores are measured outputs, not quantities fitted or defined in terms of themselves. No equations, self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the derivation chain; the central claims rest on external benchmarks rather than reducing to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can identify clinical findings and their relative temporal order from narrative case-report text at rates comparable to human experts.
Reference graph
Works this paper leans on
-
[1]
E. Kyriazopoulou, L. Liaskou-Antoniou, G. Adamis, A. Panagaki, N. Melachroinopoulos, E. Drakou, K. Marousis, G. Chrysos, A. Spyrou, N. Alexiou et al. , Procalcitonin to reduce long-term infection-associated adverse events in sepsis. a randomized trial, American Journal of Respiratory and Critical Care Medicine 203, 202 (2021)
work page 2021
-
[2]
C. W. Seymour, J. N. Kennedy, S. Wang, C.-C. H. Chang, C. F. Elliott, Z. Xu, S. Berry, G. Cler- mont, G. Cooper, H. Gomez et al., Derivation, validation, and potential treatment implications of novel clinical phenotypes for sepsis, JAMA 321, 2003 (2019)
work page 2003
-
[3]
K. E. Henry, R. Adams, C. Parent, H. Soleimani, A. Sridharan, L. Johnson, D. N. Hager, S. E. Cosgrove, A. Markowski, E. Y. Klein et al. , Factors driving provider adoption of the trews machine learning-based early warning system and its effects on sepsis treatment timing, Nature Medicine 28, 1447 (2022)
work page 2022
- [4]
-
[5]
S. Noroozizadeh, J. C. Weiss and G. H. Chen, Temporal supervised contrastive learning for modeling patient risk progression, in Machine Learning for Health (ML4H) , (PMLR, 2023)
work page 2023
-
[6]
A. Moldwin, D. Demner-Fushman and T. R. Goodwin, Empirical findings on the role of struc- tured data, unstructured data, and their combination for automatic clinical phenotyping, AMIA Summits on Translational Science Proceedings 2021, p. 445 (2021)
work page 2021
-
[7]
W. Sun, A. Rumshisky and O. Uzuner, Evaluating temporal relations in clinical text: 2012 i2b2 challenge, Journal of the American Medical Informatics Association 20, 806 (2013)
work page 2012
-
[8]
A. Leeuwenberg and M.-F. Moens, Towards extracting absolute event timelines from english clinical reports, IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, 2710 (2020)
work page 2020
-
[9]
G. Frattallone-Llado, J. Kim, C. Cheng, D. Salazar, S. Edakalavan and J. C. Weiss, Using multimodal data to improve precision of inpatient event timelines, in Pacific-Asia Conference on Knowledge Discovery and Data Mining , (Springer, May 2024)
work page 2024
-
[10]
P. J. Thoral, J. M. Peppink, R. H. Driessen, E. J. Sijbrands, E. J. Kompanje, L. Kaplan, H. Bai- ley, J. Kesecioglu, M. Cecconi, M. Churpek et al., Sharing ICU patient data responsibly under the society of critical care medicine/European society of intensive care medicine joint data sci- ence collaboration: the Amsterdam university medical centers databas...
work page 2021
-
[11]
T. J. Pollard, A. E. Johnson, J. D. Raffa, L. A. Celi, R. G. Mark and O. Badawi, The eicu collaborative research database, a freely available multi-center database for critical care research, Scientific data 5, 1 (2018)
work page 2018
-
[12]
A. E. Johnson, T. J. Pollard, L. Shen, L.-w. H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. Anthony Celi and R. G. Mark, MIMIC-III, a freely accessible critical care database, Scientific data 3, 1 (2016)
work page 2016
-
[13]
A. Johnson, T. Pollard, S. Horng, L. A. Celi and R. Mark, MIMIC-IV-Note: Deidentified free-text clinical notes (version 2.2) (2023)
work page 2023
-
[14]
A. E. Johnson, J. Aboab, J. D. Raffa, T. J. Pollard, R. O. Deliberato, L. A. Celi and D. J. Stone, A comparative analysis of sepsis identification methods in an electronic database, Critical care medicine 46, 494 (2018)
work page 2018
-
[15]
T. M. Seinen, J. A. Kors, E. M. van Mulligen and P. R. Rijnbeek, Using structured codes and free-text notes to measure information complementarity in electronic health records: Feasibility and validation study, Journal of Medical Internet Research 27, p. e66910 (2025)
work page 2025
- [16]
-
[17]
D. Van Veen, C. Van Uden, L. Blankemeier, J.-B. Delbrouck, A. Aali, C. Bluethgen, A. Pareek, M. Polacin, E. P. Reis, A. Seehofnerov´ aet al., Adapted large language models can outperform medical experts in clinical text summarization, Nature Medicine 30, 1134 (2024)
work page 2024
-
[18]
D. P. Jeong, S. Garg, Z. C. Lipton and M. Oberst, Medical adaptation of large language and vision-language models: Are we making progress?, in Empirical Methods in Natural Language Processing, eds. Y. Al-Onaizan, M. Bansal and Y.-N. Chen (Association for Computational Linguistics, Miami, Florida, USA, November 2024)
work page 2024
-
[19]
https://pmc.ncbi.nlm.nih.gov/tools/openftlist/, (2024), Ac- cessed: 2024-09-14
PMC Open Access Subset. https://pmc.ncbi.nlm.nih.gov/tools/openftlist/, (2024), Ac- cessed: 2024-09-14
work page 2024
-
[20]
Z. Zhong and D. Chen, A frustratingly easy approach for entity and relation extraction, in Pro- ceedings of the 2021 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies , eds. K. Toutanova, A. Rumshisky, L. Zettle- moyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakrabor...
work page 2021
-
[21]
P. R. Deka, A. N. Jurek-Loughrey and D. Padmanabhan, Improved methods to aid unsupervised evidence-based fact checking for online health news, Journal of Data Intelligence 3, 474 (Nov 2022)
work page 2022
-
[22]
M. Abu-Tineh, M. A. Alamin, E. Aljaloudi, A. Alshurafa, B. Garcia-Ca˜ nibano, R. Y. Taha and S. A. Elkourashy, A rare case of lambert-eaton myasthenia syndrome associated with non- hodgkin’s lymphoma: A case report and review of the literature, Case Reports in Oncology 16, 1300 (2023). Appendix A. Log-Time Cumulative Distribution Function Recall the log-t...
work page 2023
-
[23]
These results highlight a systematic bias of under-identification of clinical events in i2m4 compared to sepsis-10 by Llama 3.3. Additionally, the i2m4 dataset contains 2.3× more clinical events per report on average than sepsis-10, as annotated by the clinician, with a much larger variance in event counts across reports (Figure E1). This heterogeneity re...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.