arxiv: 2604.20560 · v1 · submitted 2026-04-22 · 💻 cs.CL

Recognition: unknown

LLM StructCore: Schema-Guided Reasoning Condensation and Deterministic Compilation

Serhii Zabolotnii

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords schema-guided reasoningcase report formsclinical notesdeterministic compilationCRF fillingtwo-stage pipelineDyspnea

0 comments

The pith

A two-stage pipeline condenses clinical notes into a fixed nine-key schema-guided summary that a deterministic compiler then expands into 134-field case report forms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that automatic filling of sparse case report forms from clinical notes benefits from decomposing the task rather than asking an LLM to predict all 134 fields at once. In stage one, the model produces a stable JSON summary limited to exactly nine domain keys through schema-guided reasoning. In stage two, a fixed program with no language model parses that summary, normalizes terms to the official vocabulary, applies evidence-gated filters to block unsupported claims, and expands the result into the required 134-item format. This design addresses the extreme sparsity of the forms, where most fields are unknown and scoring penalizes both empty entries and false positives. The same pipeline yields macro-F1 scores of 0.6543 on English and 0.6905 on Italian development data with no language-specific changes.

Core claim

The central claim is that replacing direct 134-field prediction with schema-guided condensation into a nine-key summary followed by deterministic compilation produces accurate, filter-compliant CRF outputs by removing LLM variability from the expansion and filtering steps.

What carries the argument

Schema-guided reasoning condensation to a stable JSON object with exactly nine domain keys, which a deterministic compiler then parses, canonicalizes, filters with evidence gates, and expands to the 134-item target structure.

Load-bearing premise

The nine-key summary produced in the first stage must capture every necessary fact from the notes without loss, so that the deterministic compiler can correctly expand and filter it into the final 134 fields.

What would settle it

A clinical note containing a fact required for one of the 134 fields that is omitted from the nine-key summary, or an unsupported prediction that the evidence-gated filters fail to remove before it reaches the final output.

Figures

Figures reproduced from arXiv: 2604.20560 by Serhii Zabolotnii.

read the original abstract

Automatically filling Case Report Forms (CRFs) from clinical notes is challenging due to noisy language, strict output contracts, and the high cost of false positives. We describe our CL4Health 2026 submission for Dyspnea CRF filling (134 items) using a contract-driven two-stage design grounded in Schema-Guided Reasoning (SGR). The key task property is extreme sparsity: the majority of fields are unknown, and official scoring penalizes both empty values and unsupported predictions. We shift from a single-step "LLM predicts 134 fields" approach to a decomposition where (i) Stage 1 produces a stable SGR-style JSON summary with exactly 9 domain keys, and (ii) Stage 2 is a fully deterministic, 0-LLM compiler that parses the Stage 1 summary, canonicalizes item names, normalizes predictions to the official controlled vocabulary, applies evidence-gated false-positive filters, and expands the output into the required 134-item format. On the dev80 split, the best teacher configuration achieves macro-F1 0.6543 (EN) and 0.6905 (IT); on the hidden test200, the submitted English variant scores 0.63 on Codabench. The pipeline is language-agnostic: Italian results match or exceed English with no language-specific engineering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable two-stage pipeline for sparse CRF filling that gets solid competition scores by pairing LLM condensation with a deterministic compiler, but the whole thing rests on an untested claim that nine fixed keys capture everything needed.

read the letter

This paper's main contribution is a two-stage pipeline for filling a 134-item dyspnea CRF from clinical notes. An LLM first condenses the note into a fixed 9-key JSON summary using schema-guided reasoning, then a fully deterministic compiler expands that summary into the official format, normalizes values, and applies evidence-gated filters to avoid false positives. The approach is explicitly designed for the extreme sparsity and scoring penalties in the CL4Health 2026 task. On the dev80 split the best setup reaches macro-F1 of 0.6543 English and 0.6905 Italian; the submitted English run scores 0.63 on the hidden test200. The pipeline needs no language-specific rules, which is a practical plus for a bilingual setting. The deterministic second stage is the part that actually delivers reliability, since it removes any risk of the LLM inventing unsupported fields at the final output step. That engineering choice is what the work does cleanly and is worth noting. The central assumption, however, is that the nine keys always contain every fact required to populate any of the 134 items or to decide the filters correctly. Clinical notes are noisy and incomplete, so if even one needed detail falls outside those keys the compiler cannot recover it and the reported scores cannot be fully credited to the method. The abstract and description give no ablation on the key set, no coverage check against the full CRF, and no error analysis showing how often information is lost. The language-agnostic claim inherits the same untested assumption for Italian notes. This is the kind of applied system paper that teams working on medical information extraction or LLM-to-structured-output pipelines would find useful to read and try to replicate. It is not a theoretical advance and the numbers are respectable rather than striking, but the decomposition is concrete and the second stage is reproducible. I would send it to peer review because it ships real hidden-test results and a clear, inspectable engineering split that others can build on or stress-test further.

Referee Report

2 major / 2 minor

Summary. The manuscript presents LLM StructCore, a two-stage pipeline for filling 134-item Dyspnea Case Report Forms from clinical notes. Stage 1 uses an LLM to condense input notes into a fixed 9-key schema-guided JSON summary via Schema-Guided Reasoning; Stage 2 applies a fully deterministic compiler that canonicalizes names, normalizes to controlled vocabulary, applies evidence-gated false-positive filters, and expands to the required 134-field output. The approach is presented as language-agnostic. On the dev80 split the best teacher configuration reaches macro-F1 0.6543 (EN) and 0.6905 (IT); the submitted English system scores 0.63 on the hidden test200.

Significance. If the 9-key condensation step preserves all information required by the deterministic compiler, the decomposition offers a practical route to high-precision, low-hallucination structured extraction under strict output contracts and extreme sparsity. The language-agnostic results and the separation of LLM reasoning from deterministic compilation are potentially useful for other clinical or regulatory NLP tasks that penalize unsupported predictions.

major comments (2)

[Stage 1 / abstract] Abstract and Stage 1 description: the central claim that the fixed 9-key JSON summary encodes every fact needed for lossless deterministic expansion to all 134 CRF items (including correct triggering of evidence-gated filters) is load-bearing for attributing the reported macro-F1 scores to the method. No coverage analysis, ablation removing individual keys, or counter-example set is provided to verify completeness on noisy, sparse clinical notes; if any required datum lies outside the 9 keys, Stage 2 cannot recover it and the performance numbers cannot be credited to the pipeline.
[Abstract / results] Abstract and results section: the language-agnostic claim (Italian results match or exceed English with no language-specific engineering) rests on the same 9 keys being sufficient for both languages. No cross-lingual coverage check or per-language error analysis is described that would confirm the schema does not silently drop language-specific clinical details.

minor comments (2)

[Abstract] The abstract states concrete macro-F1 numbers on dev80 and hidden test200 but does not specify the exact teacher model, prompt template, or number of runs; adding these details would improve reproducibility.
[Results] No baseline comparison (e.g., direct 134-field LLM prompting or rule-based extraction) is mentioned in the provided text; including at least one such reference point would clarify the contribution of the two-stage design.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the detailed feedback. We address the major comments below and will revise the manuscript accordingly to strengthen the claims.

read point-by-point responses

Referee: [Stage 1 / abstract] Abstract and Stage 1 description: the central claim that the fixed 9-key JSON summary encodes every fact needed for lossless deterministic expansion to all 134 CRF items (including correct triggering of evidence-gated filters) is load-bearing for attributing the reported macro-F1 scores to the method. No coverage analysis, ablation removing individual keys, or counter-example set is provided to verify completeness on noisy, sparse clinical notes; if any required datum lies outside the 9 keys, Stage 2 cannot recover it and the performance numbers cannot be credited to the pipeline.

Authors: We agree that an explicit verification of the 9-key schema's completeness is important for substantiating the pipeline's performance. The schema was iteratively designed with clinical experts to ensure all information necessary for the 134 fields is captured, and the deterministic nature of Stage 2 ensures no information is added or lost beyond what is in the summary. However, we acknowledge the absence of a formal coverage analysis in the current manuscript. In the revised version, we will add a section detailing the mapping from the 9 keys to the 134 CRF items, an ablation study removing keys to show impact on F1, and a set of counter-examples from the dev set where the schema might be insufficient, if any exist. This will allow readers to assess the completeness claim directly. revision: yes
Referee: [Abstract / results] Abstract and results section: the language-agnostic claim (Italian results match or exceed English with no language-specific engineering) rests on the same 9 keys being sufficient for both languages. No cross-lingual coverage check or per-language error analysis is described that would confirm the schema does not silently drop language-specific clinical details.

Authors: The language-agnostic property is supported by the fact that the 9-key schema is defined in English but applied to Italian notes without modification, and the deterministic compiler operates on the structured output independently of language. The results show comparable or better performance on Italian, suggesting the schema captures the necessary clinical concepts across languages. That said, we did not include a specific cross-lingual coverage analysis or error breakdown by language in the manuscript. We will add this in the revision, including a per-language error analysis on the dev set to demonstrate that no language-specific details are lost due to the schema. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline description with external benchmarks

full rationale

The manuscript presents an applied two-stage system for CRF filling: Stage 1 uses an LLM to produce a fixed 9-key schema-guided JSON summary, and Stage 2 applies a deterministic compiler to expand it to 134 fields while enforcing evidence-gated filters. No equations, parameter fitting, or derivations appear in the load-bearing claims. Results are reported on held-out dev80 and hidden test200 splits, providing external validation independent of the method definition. The completeness of the 9-key representation is treated as an empirical design choice whose adequacy is assessed by downstream F1 scores rather than by any self-referential reduction or self-citation chain. No self-citations, ansatzes, or uniqueness theorems are invoked to justify the pipeline.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method relies on standard LLM capabilities for structured reasoning and the correctness of manually designed schema and compiler rules; no new entities or fitted parameters are introduced in the abstract.

axioms (1)

domain assumption LLM can produce stable and accurate 9-key JSON summaries from clinical notes using schema guidance
Central to Stage 1 performance and overall pipeline success.

pith-pipeline@v0.9.0 · 5535 in / 1164 out tokens · 41193 ms · 2026-05-09T23:40:02.685473+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 6 canonical work pages · 4 internal anchors

[1]

Schema-guided reasoning (SGR): A complete guide

Rinat Abdullin. Schema-guided reasoning (SGR): A complete guide. https://abdullin.com/schema-guided-reasoning/, 2026

2026
[2]

Alan R. Aronson. Effective mapping of biomedical text to the UMLS Metathesaurus: The MetaMap program. InProceedings of the AMIA Symposium, pages 17–21, 2001

2001
[3]

Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen

Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen. XGrammar: Flexible and efficient structured generation engine for large language models.Proceedings of Machine Learning and Systems, 7, 2024

2024
[4]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Pietro Ferrazzi, Alberto Lavelli, and Bernardo Magnini. Converting annotated clinical cases into structured case report forms. In Dina Demner-Fushman, Sophia Ananiadou, Makoto Miwa, and Junichi Tsujii, editors,Proceedings of the 24th Workshop on Biomedical Language Processing, pages 307–318, Viena, Austria, August 2025. Association for Computational Lingu...

work page doi:10.18653/v1/ 2025
[5]

Small LLMs for medical NLP: a systematic analysis of few-shot, con- straint decoding, fine-tuning and continual pre-training in Italian.arXiv preprint arXiv:2602.17475, 2026

Pietro Ferrazzi, Mattia Franzin, Alberto Lavelli, and Bernardo Magnini. Small LLMs for medical NLP: a systematic analysis of few-shot, con- straint decoding, fine-tuning and continual pre-training in Italian.arXiv preprint arXiv:2602.17475, 2026

work page arXiv 2026
[6]

Overview of the CRF 2026 shared task on clinical case report forms filling

Pietro Ferrazzi, Soumitra Ghosh, Alberto Lavelli, and Bernardo Magnini. Overview of the CRF 2026 shared task on clinical case report forms filling. InProceedings of the Third Workshop on Patient-Oriented Language Processing (CL4Health), Palma, Mallorca (Spain), 2026. ELRA. 15

2026
[7]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Mober, et al. DSPy: Compiling declara- tive language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023

work page internal anchor Pith review arXiv 2023
[8]

ScispaCy: Fast and robust models for biomedical natural language processing

Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. ScispaCy: Fast and robust models for biomedical natural language processing. In Proceedings of the 18th BioNLP Workshop and Shared Task, 2019

2019
[9]

Capabilities of Gemini Models in Medicine

Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, et al. Capabilities of Gemini models in medicine.arXiv preprint arXiv:2404.18416, 2024

work page internal anchor Pith review arXiv 2024
[10]

Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al

Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023

2023
[11]

QuickUMLS: A fast, unsupervised approach for medical concept extraction

Luca Soldaini and Nazli Goharian. QuickUMLS: A fast, unsupervised approach for medical concept extraction. InMedIR Workshop at SIGIR, 2016

2016
[12]

Chapman, GuerganaSavova, NoemieElhadad, SameerPradhan, BrettR

Hanna Suominen, Sanna Salanterä, Sumithra Velupillai, Wendy W. Chapman, GuerganaSavova, NoemieElhadad, SameerPradhan, BrettR. South, Danielle L. Mowery, Gareth J.F. Jones, Johannes Leveling, Liadh Kelly, Lorraine Goeuriot, David Martinez, and Guido Zuccon. Overview of the ShARe/CLEF eHealth evaluation lab 2013. InCLEF, 2013

2013
[13]

South, Shuying Shen, and Scott L

Özlem Uzuner, Brett R. South, Shuying Shen, and Scott L. DuVall. 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. volume 18, pages 552–556, 2011

2010
[14]

Efficient Guided Generation for Large Language Models

Brandon T. Willard and Rémi Louf. Efficient guided generation for large language models.arXiv preprint arXiv:2307.09702, 2023

work page internal anchor Pith review arXiv 2023
[15]

SGLang: Efficient Execution of Structured Language Model Programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kober, Joseph E. Gonzalez, Clark Barrett, Ying Sheng, and Ion Stoica. SGLang: Efficient execution of structured language model programs.arXiv preprint arXiv:2312.07104, 2024. 16

work page internal anchor Pith review arXiv 2024