pith. sign in

arxiv: 2605.26560 · v1 · pith:ELJH2RE2new · submitted 2026-05-26 · 💻 cs.CL · cs.AI

Reliable Extraction of Clinical Follow-Up Instructions: A Hybrid Neural-Symbolic Pipeline

Pith reviewed 2026-06-29 18:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords clinical follow-up extractionhybrid neural-symbolic pipelineaction-date pairsoutpatient notestime normalizationentity linkingsynthetic corpus evaluationgenerative model comparison
0
0 comments X

The pith

A hybrid pipeline extracts clinical follow-up action-date pairs at 0.99 F1 by separating neural tagging from deterministic arithmetic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a pipeline which uses neural methods only for identifying actions and times, then applies fixed rules for linking and date calculation, produces far more accurate (action, date) pairs than models that generate the full extraction in one step. On a synthetic set of outpatient notes it reaches near-perfect pair-level F1 and exact day accuracy even for actions never seen in training, while generative approaches lose most of the correct pairings. A reader would care because reliable extraction of scheduled follow-ups directly supports automated scheduling, compliance checks, and reduced manual review in outpatient care. The work shows the gain comes from making the arithmetic and linking steps explicit rather than implicit in decoding.

Core claim

The pipeline defines TestSpecification and TimeSpecification entities linked by a ScheduledFor relation; a neural tagger and linker identify them in text, after which an ontology canonicalizes the 28 possible actions and a deterministic converter turns time expressions into day offsets. On 259-note seen and OOV splits this yields Test-Time Pair F1 of 0.997 and 0.986 with 0.00-day mean absolute error. Generative baselines reach high action F1 yet only 0.51-0.57 pair F1, with non-overlapping confidence intervals.

What carries the argument

ScheduledFor relation between TestSpecification and TimeSpecification entities, extracted neurally then normalized symbolically via ontology and deterministic day-offset conversion.

If this is right

  • The method generalizes to actions held out from training, maintaining high pair F1.
  • Explicit separation of entity extraction from date arithmetic produces exact day offsets where generative decoding does not.
  • The approach surfaces concrete failure modes of pure generation on linking and arithmetic tasks.
  • The benchmark identifies transfer to real EHR notes as the required next validation step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same split between learned recognition and rule-based calculation could be applied to other clinical extraction tasks that require precise temporal offsets.
  • High performance on OOV actions suggests the ontology plus deterministic normalizer removes the main source of date errors once entities are found.
  • If real notes contain more varied phrasing than the synthetic set, the neural tagging stage would become the new bottleneck rather than the arithmetic stage.
  • The zero-day MAE result indicates that once the ScheduledFor link is correct the rest of the pipeline introduces no additional timing error.

Load-bearing premise

The 2,000-note synthetic outpatient corpus with action-disjoint splits captures the linguistic variety, time-expression ambiguity, and action distribution of real clinical notes.

What would settle it

Running the same pipeline and generative baselines on a held-out collection of real EHR outpatient notes and comparing pair F1 and day MAE would show whether the reported advantage survives outside the synthetic corpus.

Figures

Figures reproduced from arXiv: 2605.26560 by Alexander Apartsin, Michal Laufer, Yehudit Aperstein.

Figure 1
Figure 1. Figure 1: System overview. Solid arrows show the principal data flow. The note text is encoded by a shared BioBERT encoder over sliding windows of length 512 with stride 128; a BIO tagging head predicts TestSpecification and TimeSpecification entity spans from the contextual states; a biaffine relation head links each predicted TestSpecification span to a TimeSpecification span (or to a none option), producing a Sch… view at source ↗
Figure 2
Figure 2. Figure 2: Dataset composition. Each bar shows one specialty stratum, segmented by the number of follow-up actions per note (0, 1, or 2). Specialty proportions are approximately balanced (379-422 notes per specialty), and the action-count distribution is consistent across specialties [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Vocabulary distributions in the corpus. (a) Top 15 follow-up action types out of 28 in the closed set; imaging and laboratory tests dominate, consults and rehabilitation appear in the long tail. (b) Top 15 temporal-expression surface forms out of 549 distinct forms among 2,028 gold mentions, illustrating the lexical diversity (the most frequent phrase has only twelve mentions). Stress factors are present a… view at source ↗
Figure 4
Figure 4. Figure 4: Held-out performance with note-level bootstrap 95% confidence intervals on the seen-test (in-distribution) and OOV-test (action-disjoint) splits. The structured pipeline reaches near-perfect F1 on all three metrics in both splits. Generative baselines achieve high TestSpecification F1 but their TimeSpecification Offset and Test-Time Pair F1 drop sharply, with CIs that do not overlap the BioBERT intervals o… view at source ↗
Figure 5
Figure 5. Figure 5: Mean absolute time-offset error in days on matched TestSpecification entities, with note-level bootstrap 95% confidence intervals. Reference lines mark one day, one week, and one month. The structured pipeline achieves 0.00-day MAE in both splits because every linked time phrase is normalized by the deterministic rule grammar. Generative baselines produce day-scale to month-scale arithmetic errors. The str… view at source ↗
read the original abstract

Objective. Outpatient notes carry follow-up instructions pairing actions with future times ("MRI brain in two weeks"). Extracting (action, date) pairs supports scheduling and audit, but generative extractors miss the date because linking and arithmetic are implicit in decoding. We test a hybrid neural-symbolic pipeline against direct generation. Methods. We define TestSpecification and TimeSpecification entities and a ScheduledFor relation. BioBERT feeds BIO tagging and a biaffine linker; entities are canonicalized via a 28-action ontology and times normalized to day offsets deterministically. We evaluate on a 2,000-note synthetic outpatient corpus with action-disjoint splits (18 train, 6 OOV-test) against zero-shot GPT-4o-mini and LoRA-fine-tuned LLaMA-3 8B with note-level bootstrap 95% CIs. Results. On 259-note seen and OOV splits the hybrid pipeline achieves Test-Time Pair F1 of 0.997 and 0.986 with 0.00-day MAE. Baselines reach high action F1 (LLaMA-3 0.992; GPT-4o-mini 0.963 seen) but Pair F1 stays at 0.51-0.57 (LLaMA-3) and 0.53 (GPT-4o-mini), CIs non-overlapping with the hybrid. Conclusion. Separating learned entity extraction from deterministic date arithmetic outperforms generation on this benchmark, generalizes to held-out actions, and exposes failure modes. Transfer to real EHR notes is the next validation; a first-pass realism check is in Limitations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a hybrid neural-symbolic pipeline for extracting follow-up instructions as (action, time) pairs from outpatient clinical notes. It uses BioBERT for BIO tagging and biaffine linking to identify TestSpecification and TimeSpecification entities and ScheduledFor relations, followed by canonicalization with a 28-action ontology and deterministic time normalization to day offsets. Evaluated on a 2,000-note synthetic corpus with action-disjoint train/test splits (including OOV actions), the pipeline achieves Test-Time Pair F1 of 0.997 (seen) and 0.986 (OOV) with 0.00-day MAE, significantly outperforming zero-shot GPT-4o-mini and LoRA-fine-tuned LLaMA-3 8B baselines whose Pair F1 remains around 0.5-0.57 despite high action F1.

Significance. If the synthetic corpus adequately represents real clinical notes, the result demonstrates that separating learned entity extraction from deterministic symbolic date handling can achieve near-perfect performance and generalization to held-out actions, where end-to-end generative models fail on linking and arithmetic. The bootstrap 95% CIs and explicit OOV splits provide reproducible evidence of robustness on the benchmark.

major comments (2)
  1. [Methods (corpus and evaluation)] Methods (synthetic corpus generation and evaluation setup): The headline claims of Test-Time Pair F1 0.997/0.986 and 0.00-day MAE rest exclusively on the 2,000-note synthetic outpatient corpus with action-disjoint splits. The paper states in the Conclusion that transfer to real EHR notes is the next validation and references a first-pass realism check in Limitations, but no quantitative evidence is provided that the corpus replicates the linguistic variety, time-expression ambiguity, or action distribution of authentic clinical notes. This assumption is load-bearing for the Objective's claim of reliability in a clinical setting.
  2. [Results] Results (baseline comparison): The non-overlapping CIs are a strength, but the generative baselines' low Pair F1 (0.51-0.57) is attributed to failures in linking and arithmetic; without ablation showing that the hybrid's deterministic normalization is the decisive factor (vs. the ontology or linker), it is unclear whether the gap would persist on more varied data.
minor comments (1)
  1. [Abstract / Methods] The abstract and Methods should explicitly state the number of bootstrap samples and how note-level aggregation is performed for the 95% CIs to support reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments. We address each major point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Methods (corpus and evaluation)] Methods (synthetic corpus generation and evaluation setup): The headline claims of Test-Time Pair F1 0.997/0.986 and 0.00-day MAE rest exclusively on the 2,000-note synthetic outpatient corpus with action-disjoint splits. The paper states in the Conclusion that transfer to real EHR notes is the next validation and references a first-pass realism check in Limitations, but no quantitative evidence is provided that the corpus replicates the linguistic variety, time-expression ambiguity, or action distribution of authentic clinical notes. This assumption is load-bearing for the Objective's claim of reliability in a clinical setting.

    Authors: We agree the synthetic corpus is central and that no quantitative metrics comparing it to real notes are provided. The corpus was deliberately constructed with action-disjoint splits and controlled time expressions to isolate OOV generalization and deterministic normalization effects. The manuscript already states in Conclusion and Limitations that real-EHR transfer is future work. We will revise Limitations to more explicitly discuss potential gaps in linguistic variety and ambiguity relative to authentic notes and will temper claims about immediate clinical deployment. revision: partial

  2. Referee: [Results] Results (baseline comparison): The non-overlapping CIs are a strength, but the generative baselines' low Pair F1 (0.51-0.57) is attributed to failures in linking and arithmetic; without ablation showing that the hybrid's deterministic normalization is the decisive factor (vs. the ontology or linker), it is unclear whether the gap would persist on more varied data.

    Authors: The attribution follows directly from the baselines achieving high action F1 yet low Pair F1, pointing to linking and arithmetic as the failure modes. The hybrid architecture separates these steps by design. We acknowledge an explicit ablation isolating the deterministic normalizer is absent. We will add a clarifying sentence in Results/Discussion noting this limitation and that future work could include such ablations on more varied data. revision: partial

standing simulated objections not resolved
  • Quantitative evidence that the synthetic corpus replicates linguistic variety, time-expression ambiguity, or action distributions of real clinical notes

Circularity Check

0 steps flagged

No circularity; purely empirical evaluation on held-out synthetic splits

full rationale

The paper defines a hybrid pipeline (BIO tagging + biaffine linker + deterministic normalization via 28-action ontology) and reports Test-Time Pair F1 and MAE directly on action-disjoint held-out splits of a 2,000-note synthetic corpus. No equations, parameters, or claims reduce by construction to fitted inputs or self-citations; the central results are measured performance numbers on unseen data. The authors explicitly flag transfer to real EHR notes as future work, confirming the evaluation is self-contained against the stated benchmark.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on a hand-crafted 28-action ontology for canonicalization and the assumption that time expressions can be normalized deterministically without learned disambiguation. No free parameters are fitted inside the reported pipeline itself beyond the pre-trained BioBERT weights.

free parameters (1)
  • 28-action ontology
    Hand-curated set of canonical actions used for entity normalization; size and contents chosen outside the reported experiments.
axioms (1)
  • domain assumption Time expressions in outpatient notes can be mapped to exact day offsets using deterministic rules without residual ambiguity
    Invoked when converting relative times to numeric offsets after linking.

pith-pipeline@v0.9.1-grok · 5830 in / 1295 out tokens · 39930 ms · 2026-06-29T18:40:48.248597+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 20 canonical work pages

  1. [1]

    Y. Wang, L. Wang, M. Rastegar-Mojarad, S. Moon, F. Shen, N. Afzal, S. Liu, Y. Zeng, S. Mehrabi, S. Sohn, H. Liu, Clinical information extraction applications: a literature review, J. Biomed. Inform. 77 (2018) 34 -49. https://doi.org/10.1016/j.jbi.2017.11.011

  2. [2]

    Savova, J.J

    G.K. Savova, J.J. Masanz, P.V. Ogren, J. Zheng, S. Sohn, K.C. Kipper -Schuler, C.G. Chute, Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications, J. Am. Med. Inform. Assoc. 17 (2010) 50 7-513. https://doi.org/10.1136/jamia.2009.001560

  3. [3]

    W. Sun, A. Rumshisky, O. Uzuner, Evaluating temporal relations in clinical text: 2012 i2b2 Challenge, J. Am. Med. Inform. Assoc. 20 (2013) 806 -813. https://doi.org/10.1136/amiajnl -2013-001628

  4. [4]

    Styler IV, S

    W.F. Styler IV, S. Bethard, S. Finan, M. Palmer, S. Pradhan, P.C. de Groen, B. Erickson, T. Miller, C. Lin, G. Savova, J. Pustejovsky, Temporal annotation in the clinical domain, Trans. Assoc. Comput. Linguist. 2 (2014) 143-154. https://doi.org/10.1162/tacl_a_00172

  5. [5]

    Bethard, L

    S. Bethard, L. Derczynski, G. Savova, J. Pustejovsky, M. Verhagen, SemEval -2015 Task 6: Clinical TempEval, in: Proc. 9th Int. Workshop Semantic Eval. (SemEval 2015), Association for Computational Linguistics, Denver, CO, 2015, pp. 806 -814. https://doi.org/10.18653/v1/S15 -2136

  6. [6]

    Bethard, G

    S. Bethard, G. Savova, M. Palmer, J. Pustejovsky, SemEval-2016 Task 12: Clinical TempEval, in: Proc. 10th Int. Workshop Semantic Eval. (SemEval 2016), Association for Computational Linguistics, San Diego, CA, 2016, pp. 1052 -1062. https://doi.org/10.18653/v1/S16 -1165

  7. [7]

    Devlin, M.-W

    J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: pre -training of deep bidirectional transformers for language understanding, in: Proc. 2019 Conf. North Am. Chapter Assoc. Comput. Linguist.: Hum. Lang. Technol., Association for Computational Linguist ics, Minneapolis, MN, 2019, pp. 4171 -4186. https://doi.org/10.18653/v1/N19 -1423

  8. [8]

    J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C.H. So, J. Kang, BioBERT: a pre -trained biomedical language representation model for biomedical text mining, Bioinformatics 36 (2020) 1234 -1240. https://doi.org/10.1093/bioinformatics/btz682

  9. [9]

    Alsentzer, J.R

    E. Alsentzer, J.R. Murphy, W. Boag, W.-H. Weng, D. Jin, T. Naumann, M. McDermott, Publicly available clinical BERT embeddings, in: Proc. 2nd Clin. Nat. Lang. Process. Workshop, Association for Computational Linguistics, Minneapolis, MN, 2019, pp. 72 -78. https://doi.org/10.18653/v1/W19 -1909

  10. [10]

    Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, H. Poon, Domain-specific language model pretraining for biomedical natural language processing, ACM Trans. Comput. Healthc. 3 (2021) 2:1-2:23. https://doi.org/10.1145/3458754

  11. [11]

    X. Yang, A. Chen, N. PourNejatian, H.C. Shin, K.E. Smith, C. Parisien, C. Compas, C. Martin, M.G. Flores, Y. Zhang, T. Magoc, C.A. Harle, G. Lipori, D.A. Mitchell, W.R. Hogan, E.A. Shenkman, J. Bian, Y. Wu, A large language model for electronic health records, NPJ Digit. Med. 5 (2022) 194. https://doi.org/10.1038/s41746-022- 00742 -2

  12. [12]

    doi: 10.18653/v1/2022.emnlp-main.130

    M. Agrawal, S. Hegselmann, H. Lang, Y. Kim, D. Sontag, Large language models are few -shot clinical information extractors, in: Proc. 2022 Conf. Empir. Methods Nat. Lang. Process. (EMNLP), Association for Computational Linguistics, Abu Dhabi, 2022, pp. 1998-2022. https://doi.org/10.18653/v1/2022.emnlp-main.130

  13. [14]

    Strotgen, M

    J. Strotgen, M. Gertz, HeidelTime: high quality rule -based extraction and normalization of temporal expressions, in: Proc. 5th Int. Workshop Semantic Eval. (SemEval 2010), Association for Computational Linguistics, Uppsala, 2010, pp. 321 -324

  14. [15]

    Chang, C.D

    A.X. Chang, C.D. Manning, SUTime: a library for recognizing and normalizing time expressions, in: Proc. 8th Int. Conf. Lang. Resour. Eval. (LREC 2012), European Language Resources Association, Istanbul, 2012, pp. 3735-3740

  15. [16]

    C. Lin, T. Miller, D. Dligach, S. Bethard, G. Savova, A BERT-based universal model for both within- and cross-sentence clinical temporal relation extraction, in: Proc. 2nd Clin. Nat. Lang. Process. Workshop, Association for Computational Linguistics, Minneapolis, MN, 2019, pp. 65-71. https://doi.org/10.18653/v1/W19- 1908. 17

  16. [17]

    Eberts, A

    M. Eberts, A. Ulges, Span-based joint entity and relation extraction with transformer pre-training, in: Proc. 24th Eur. Conf. Artif. Intell. (ECAI 2020), IOS Press, Santiago de Compostela, 2020, pp. 2006 -2013. https://doi.org/10.3233/FAIA200321

  17. [18]

    Dozat, C.D

    T. Dozat, C.D. Manning, Deep biaffine attention for neural dependency parsing, in: Proc. 5th Int. Conf. Learn. Represent. (ICLR 2017), Toulon, 2017

  18. [19]

    Wornow, Y

    M. Wornow, Y. Xu, R. Thapa, B. Patel, E. Steinberg, S. Fleming, M.A. Pfeffer, J.A. Fries, N.H. Shah, The shaky foundations of large language models and foundation models for electronic health records, NPJ Digit. Med. 6 (2023) 135. https://doi.org/10.1038/s41746 -023-00879 -8

  19. [20]

    Y. Hu, Q. Chen, J. Du, X. Peng, V.K. Keloth, X. Zuo, Y. Zhou, Z. Li, X. Jiang, Z. Lu, K. Roberts, H. Xu, Improving large language models for clinical named entity recognition via prompt engineering, J. Am. Med. Inform. Assoc. 31 (2024) 1812 -1820. https://doi.org/10.1093/jamia/ocad259

  20. [21]

    Kweon, J

    S. Kweon, J. Kim, J. Kim, S. Im, E. Cho, S. Bae, J. Oh, G. Lee, J.H. Moon, S.C. You, S. Baek, C.H. Han, Y.B. Jung, Y. Jo, E. Choi, Publicly shareable clinical large language model built on synthetic clinical notes, in: Findings Assoc. Comput. Linguist.: ACL 2024, Association for Computational Linguistics, Bangkok, 2024, pp. 5148-5168

  21. [22]

    M. Loni, F. Poursalim, M. Asadi, A. Gharehbaghi, A review on generative AI models for synthetic medical text, time series, and longitudinal data, NPJ Digit. Med. 8 (2025) 281. https://doi.org/10.1038/s41746-024-01409- w

  22. [23]

    Callen, J.I

    J.L. Callen, J.I. Westbrook, A. Georgiou, J. Li, Failure to follow -up test results for ambulatory patients: a systematic review, J. Gen. Intern. Med. 27 (2012) 1334 -1348. https://doi.org/10.1007/s11606 -011-1949-5

  23. [24]

    Yetisgen-Yildiz, M.L

    M. Yetisgen-Yildiz, M.L. Gunn, F. Xia, T.H. Payne, A text processing pipeline to extract recommendations from radiology reports, J. Biomed. Inform. 46 (2013) 354 -362. https://doi.org/10.1016/j.jbi.2012.12.005

  24. [25]

    M., & von Bloh, W

    J. Dagdelen, A. Dunn, S. Lee, N. Walker, A.S. Rosen, G. Ceder, K.A. Persson, A. Jain, Structured information extraction from scientific text with large language models, Nat. Commun. 15 (2024) 1418. https://doi.org/10.1038/s41467 -024-45563 -x