Recognition: unknown
LLM StructCore: Schema-Guided Reasoning Condensation and Deterministic Compilation
Pith reviewed 2026-05-09 23:40 UTC · model grok-4.3
The pith
A two-stage pipeline condenses clinical notes into a fixed nine-key schema-guided summary that a deterministic compiler then expands into 134-field case report forms.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that replacing direct 134-field prediction with schema-guided condensation into a nine-key summary followed by deterministic compilation produces accurate, filter-compliant CRF outputs by removing LLM variability from the expansion and filtering steps.
What carries the argument
Schema-guided reasoning condensation to a stable JSON object with exactly nine domain keys, which a deterministic compiler then parses, canonicalizes, filters with evidence gates, and expands to the 134-item target structure.
Load-bearing premise
The nine-key summary produced in the first stage must capture every necessary fact from the notes without loss, so that the deterministic compiler can correctly expand and filter it into the final 134 fields.
What would settle it
A clinical note containing a fact required for one of the 134 fields that is omitted from the nine-key summary, or an unsupported prediction that the evidence-gated filters fail to remove before it reaches the final output.
Figures
read the original abstract
Automatically filling Case Report Forms (CRFs) from clinical notes is challenging due to noisy language, strict output contracts, and the high cost of false positives. We describe our CL4Health 2026 submission for Dyspnea CRF filling (134 items) using a contract-driven two-stage design grounded in Schema-Guided Reasoning (SGR). The key task property is extreme sparsity: the majority of fields are unknown, and official scoring penalizes both empty values and unsupported predictions. We shift from a single-step "LLM predicts 134 fields" approach to a decomposition where (i) Stage 1 produces a stable SGR-style JSON summary with exactly 9 domain keys, and (ii) Stage 2 is a fully deterministic, 0-LLM compiler that parses the Stage 1 summary, canonicalizes item names, normalizes predictions to the official controlled vocabulary, applies evidence-gated false-positive filters, and expands the output into the required 134-item format. On the dev80 split, the best teacher configuration achieves macro-F1 0.6543 (EN) and 0.6905 (IT); on the hidden test200, the submitted English variant scores 0.63 on Codabench. The pipeline is language-agnostic: Italian results match or exceed English with no language-specific engineering.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents LLM StructCore, a two-stage pipeline for filling 134-item Dyspnea Case Report Forms from clinical notes. Stage 1 uses an LLM to condense input notes into a fixed 9-key schema-guided JSON summary via Schema-Guided Reasoning; Stage 2 applies a fully deterministic compiler that canonicalizes names, normalizes to controlled vocabulary, applies evidence-gated false-positive filters, and expands to the required 134-field output. The approach is presented as language-agnostic. On the dev80 split the best teacher configuration reaches macro-F1 0.6543 (EN) and 0.6905 (IT); the submitted English system scores 0.63 on the hidden test200.
Significance. If the 9-key condensation step preserves all information required by the deterministic compiler, the decomposition offers a practical route to high-precision, low-hallucination structured extraction under strict output contracts and extreme sparsity. The language-agnostic results and the separation of LLM reasoning from deterministic compilation are potentially useful for other clinical or regulatory NLP tasks that penalize unsupported predictions.
major comments (2)
- [Stage 1 / abstract] Abstract and Stage 1 description: the central claim that the fixed 9-key JSON summary encodes every fact needed for lossless deterministic expansion to all 134 CRF items (including correct triggering of evidence-gated filters) is load-bearing for attributing the reported macro-F1 scores to the method. No coverage analysis, ablation removing individual keys, or counter-example set is provided to verify completeness on noisy, sparse clinical notes; if any required datum lies outside the 9 keys, Stage 2 cannot recover it and the performance numbers cannot be credited to the pipeline.
- [Abstract / results] Abstract and results section: the language-agnostic claim (Italian results match or exceed English with no language-specific engineering) rests on the same 9 keys being sufficient for both languages. No cross-lingual coverage check or per-language error analysis is described that would confirm the schema does not silently drop language-specific clinical details.
minor comments (2)
- [Abstract] The abstract states concrete macro-F1 numbers on dev80 and hidden test200 but does not specify the exact teacher model, prompt template, or number of runs; adding these details would improve reproducibility.
- [Results] No baseline comparison (e.g., direct 134-field LLM prompting or rule-based extraction) is mentioned in the provided text; including at least one such reference point would clarify the contribution of the two-stage design.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work and the detailed feedback. We address the major comments below and will revise the manuscript accordingly to strengthen the claims.
read point-by-point responses
-
Referee: [Stage 1 / abstract] Abstract and Stage 1 description: the central claim that the fixed 9-key JSON summary encodes every fact needed for lossless deterministic expansion to all 134 CRF items (including correct triggering of evidence-gated filters) is load-bearing for attributing the reported macro-F1 scores to the method. No coverage analysis, ablation removing individual keys, or counter-example set is provided to verify completeness on noisy, sparse clinical notes; if any required datum lies outside the 9 keys, Stage 2 cannot recover it and the performance numbers cannot be credited to the pipeline.
Authors: We agree that an explicit verification of the 9-key schema's completeness is important for substantiating the pipeline's performance. The schema was iteratively designed with clinical experts to ensure all information necessary for the 134 fields is captured, and the deterministic nature of Stage 2 ensures no information is added or lost beyond what is in the summary. However, we acknowledge the absence of a formal coverage analysis in the current manuscript. In the revised version, we will add a section detailing the mapping from the 9 keys to the 134 CRF items, an ablation study removing keys to show impact on F1, and a set of counter-examples from the dev set where the schema might be insufficient, if any exist. This will allow readers to assess the completeness claim directly. revision: yes
-
Referee: [Abstract / results] Abstract and results section: the language-agnostic claim (Italian results match or exceed English with no language-specific engineering) rests on the same 9 keys being sufficient for both languages. No cross-lingual coverage check or per-language error analysis is described that would confirm the schema does not silently drop language-specific clinical details.
Authors: The language-agnostic property is supported by the fact that the 9-key schema is defined in English but applied to Italian notes without modification, and the deterministic compiler operates on the structured output independently of language. The results show comparable or better performance on Italian, suggesting the schema captures the necessary clinical concepts across languages. That said, we did not include a specific cross-lingual coverage analysis or error breakdown by language in the manuscript. We will add this in the revision, including a per-language error analysis on the dev set to demonstrate that no language-specific details are lost due to the schema. revision: yes
Circularity Check
No circularity: empirical pipeline description with external benchmarks
full rationale
The manuscript presents an applied two-stage system for CRF filling: Stage 1 uses an LLM to produce a fixed 9-key schema-guided JSON summary, and Stage 2 applies a deterministic compiler to expand it to 134 fields while enforcing evidence-gated filters. No equations, parameter fitting, or derivations appear in the load-bearing claims. Results are reported on held-out dev80 and hidden test200 splits, providing external validation independent of the method definition. The completeness of the 9-key representation is treated as an empirical design choice whose adequacy is assessed by downstream F1 scores rather than by any self-referential reduction or self-citation chain. No self-citations, ansatzes, or uniqueness theorems are invoked to justify the pipeline.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM can produce stable and accurate 9-key JSON summaries from clinical notes using schema guidance
Reference graph
Works this paper leans on
-
[1]
Schema-guided reasoning (SGR): A complete guide
Rinat Abdullin. Schema-guided reasoning (SGR): A complete guide. https://abdullin.com/schema-guided-reasoning/, 2026
2026
-
[2]
Alan R. Aronson. Effective mapping of biomedical text to the UMLS Metathesaurus: The MetaMap program. InProceedings of the AMIA Symposium, pages 17–21, 2001
2001
-
[3]
Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen
Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen. XGrammar: Flexible and efficient structured generation engine for large language models.Proceedings of Machine Learning and Systems, 7, 2024
2024
-
[4]
Pietro Ferrazzi, Alberto Lavelli, and Bernardo Magnini. Converting annotated clinical cases into structured case report forms. In Dina Demner-Fushman, Sophia Ananiadou, Makoto Miwa, and Junichi Tsujii, editors,Proceedings of the 24th Workshop on Biomedical Language Processing, pages 307–318, Viena, Austria, August 2025. Association for Computational Lingu...
-
[5]
Pietro Ferrazzi, Mattia Franzin, Alberto Lavelli, and Bernardo Magnini. Small LLMs for medical NLP: a systematic analysis of few-shot, con- straint decoding, fine-tuning and continual pre-training in Italian.arXiv preprint arXiv:2602.17475, 2026
-
[6]
Overview of the CRF 2026 shared task on clinical case report forms filling
Pietro Ferrazzi, Soumitra Ghosh, Alberto Lavelli, and Bernardo Magnini. Overview of the CRF 2026 shared task on clinical case report forms filling. InProceedings of the Third Workshop on Patient-Oriented Language Processing (CL4Health), Palma, Mallorca (Spain), 2026. ELRA. 15
2026
-
[7]
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Mober, et al. DSPy: Compiling declara- tive language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714, 2023
work page internal anchor Pith review arXiv 2023
-
[8]
ScispaCy: Fast and robust models for biomedical natural language processing
Mark Neumann, Daniel King, Iz Beltagy, and Waleed Ammar. ScispaCy: Fast and robust models for biomedical natural language processing. In Proceedings of the 18th BioNLP Workshop and Shared Task, 2019
2019
-
[9]
Capabilities of Gemini Models in Medicine
Khaled Saab, Tao Tu, Wei-Hung Weng, Ryutaro Tanno, David Stutz, Ellery Wulczyn, et al. Capabilities of Gemini models in medicine.arXiv preprint arXiv:2404.18416, 2024
work page internal anchor Pith review arXiv 2024
-
[10]
Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al
Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023
2023
-
[11]
QuickUMLS: A fast, unsupervised approach for medical concept extraction
Luca Soldaini and Nazli Goharian. QuickUMLS: A fast, unsupervised approach for medical concept extraction. InMedIR Workshop at SIGIR, 2016
2016
-
[12]
Chapman, GuerganaSavova, NoemieElhadad, SameerPradhan, BrettR
Hanna Suominen, Sanna Salanterä, Sumithra Velupillai, Wendy W. Chapman, GuerganaSavova, NoemieElhadad, SameerPradhan, BrettR. South, Danielle L. Mowery, Gareth J.F. Jones, Johannes Leveling, Liadh Kelly, Lorraine Goeuriot, David Martinez, and Guido Zuccon. Overview of the ShARe/CLEF eHealth evaluation lab 2013. InCLEF, 2013
2013
-
[13]
South, Shuying Shen, and Scott L
Özlem Uzuner, Brett R. South, Shuying Shen, and Scott L. DuVall. 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. volume 18, pages 552–556, 2011
2010
-
[14]
Efficient Guided Generation for Large Language Models
Brandon T. Willard and Rémi Louf. Efficient guided generation for large language models.arXiv preprint arXiv:2307.09702, 2023
work page internal anchor Pith review arXiv 2023
-
[15]
SGLang: Efficient Execution of Structured Language Model Programs
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kober, Joseph E. Gonzalez, Clark Barrett, Ying Sheng, and Ion Stoica. SGLang: Efficient execution of structured language model programs.arXiv preprint arXiv:2312.07104, 2024. 16
work page internal anchor Pith review arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.