pith. sign in

arxiv: 2507.12261 · v2 · submitted 2025-07-16 · 💻 cs.CL · cs.AI

Infherno: End-to-end Agent-based FHIR Resource Synthesis from Free-form Clinical Notes

Pith reviewed 2026-05-19 04:31 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords FHIR resource synthesisLLM agentsclinical noteshealthcare interoperabilitystructured data generationmedical terminology toolsend-to-end automation
0
0 comments X

The pith

An agent-based LLM system converts free-form clinical notes into structured FHIR resources while matching human performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Infherno as an end-to-end framework that deploys LLM agents equipped with code execution and terminology database tools to generate FHIR resources directly from unstructured clinical notes. Earlier modular pipelines and constrained-decoding methods often produced invalid structures or failed to generalize, so the new approach targets those gaps by letting agents iteratively build, validate, and correct outputs. A sympathetic reader would care because successful automation could reduce manual structuring work and support consistent data exchange between hospitals and systems. Evaluation on synthetic and real clinical datasets shows the system, especially with Gemini 2.5-Pro, reaches levels close to human annotators.

Core claim

Infherno is an agent-driven pipeline that combines large language models, executable code, and healthcare terminology lookups to translate free-form clinical notes into FHIR-compliant resources end to end, achieving structural validity and clinical accuracy that competes with human baseline predictions on both synthetic and clinical test sets.

What carries the argument

Infherno, the end-to-end LLM-agent framework that uses code execution and terminology tools to enforce FHIR schema adherence and clinical correctness during resource synthesis from free-form text.

If this is right

  • The framework supports clinical data integration and interoperability across different healthcare institutions.
  • A dedicated frontend allows processing of both custom synthetic data and real clinical notes.
  • Users can run the system with either local models or proprietary models according to institutional constraints.
  • Gemini 2.5-Pro emerges as the strongest performer among tested models on the evaluated synthetic and clinical datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread adoption could lower the cost of populating standardized electronic health records by shifting effort from manual entry to review of agent outputs.
  • The same agent-plus-tool pattern might extend to generating resources in other medical data standards beyond FHIR.
  • Iterative agent conversations could be added to clarify ambiguous phrases in notes before final resource creation.
  • Deployment in live hospital workflows would need separate measurement of latency and error rates under time pressure.

Load-bearing premise

LLM agents with code execution and terminology tools can consistently produce structurally valid and clinically accurate FHIR resources from ambiguous free-form notes even when reliable ground-truth data is difficult to collect.

What would settle it

A large blinded comparison in which expert clinicians score Infherno-generated FHIR resources against manually created reference resources drawn from a diverse collection of real-world clinical notes for both syntactic schema compliance and clinical accuracy.

Figures

Figures reproduced from arXiv: 2507.12261 by Alexander Meyer, Frank Kramer, Johann Frei, Lisa Raithel, Nils Feldhus, Roland Roller.

Figure 1
Figure 1. Figure 1: Illustrative example of how Infherno, an agentic approach for FHIR resource synthesis, pro￾cesses a discharge letter (top left, cyan) using SNOMED CT tools (light blue) and code search (green) and fhir.resources code loops (purple, right). After a few iterations including tool calls and observations from a Python executor, the LLM agent proceeds to produce a final answer (red) in a FHIR/JSON format, repres… view at source ↗
Figure 3
Figure 3. Figure 3: Front end of Infherno showing an intermediate step (Code Search) during the text-to-FHIR translation with the Log Replay function. 2023) are shown at inference time. A Log Replay tab ( [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Extended example of clinical note synthesis with [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The full text from the first document from the synthetic corpus. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
read the original abstract

For clinical data integration and healthcare services, the HL7 FHIR standard has established itself as a desirable format for interoperability between complex health data. Previous attempts at automating the translation from free-form clinical notes into structured FHIR resources address narrowly defined tasks and rely on modular approaches or LLMs with instruction tuning and constrained decoding. As those solutions frequently suffer from limited generalizability and structural inconformity, we propose an end-to-end framework powered by LLM agents, code execution, and healthcare terminology database tools to address these issues. Our solution, called Infherno, is designed to adhere to the FHIR document schema and competes well with a human baseline in predicting FHIR resources from unstructured text. The implementation features a front end for custom and synthetic data and both local and proprietary models, supporting clinical data integration processes and interoperability across institutions. Gemini 2.5-Pro excels in our evaluation on synthetic and clinical datasets, yet ambiguity and feasibility of collecting ground-truth data remain open problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Infherno, an end-to-end framework that employs LLM agents augmented with code execution and healthcare terminology database tools to translate free-form clinical notes into HL7 FHIR resources. It positions this agent-based approach as superior to prior modular pipelines or instruction-tuned LLMs with constrained decoding, emphasizing improved structural conformity to the FHIR schema. The work reports evaluations on synthetic and clinical datasets, claiming competitive performance against a human baseline (with Gemini 2.5-Pro performing strongly), provides a frontend for custom/synthetic data and model support, and flags ambiguity plus ground-truth collection as remaining open problems.

Significance. If the performance claims are substantiated with rigorous metrics, this engineering contribution could meaningfully advance automated clinical data structuring for interoperability, addressing documented weaknesses in generalizability and schema validity of earlier methods. The integration of agentic workflows with external tools is a practical strength, and the provision of a frontend plus support for both local and proprietary models enhances usability. The empirical focus on real-world FHIR synthesis is timely given the standard's adoption in healthcare.

major comments (2)
  1. Abstract: The central claim that Infherno 'competes well with a human baseline in predicting FHIR resources from unstructured text' is unsupported by any quantitative metrics (F1, exact-match, structural validity rates), error analysis, or protocol details for the human baseline (annotator numbers, adjudication process, or ambiguity handling). This is load-bearing for the primary contribution, especially since the abstract itself identifies ground-truth collection as an open problem.
  2. Evaluation/Results: No tables, figures, or specific numerical results are referenced to substantiate performance on synthetic versus clinical datasets or to enable comparison against the human baseline, preventing assessment of robustness or clinical accuracy.
minor comments (2)
  1. The abstract would benefit from briefly specifying example FHIR resources targeted (e.g., Patient, Observation, Condition) to ground the synthesis task.
  2. Consider adding a diagram of the agent workflow (including tool calls for code execution and terminology lookup) to improve clarity of the end-to-end pipeline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below, indicating where revisions will be made to better substantiate the claims and improve clarity.

read point-by-point responses
  1. Referee: Abstract: The central claim that Infherno 'competes well with a human baseline in predicting FHIR resources from unstructured text' is unsupported by any quantitative metrics (F1, exact-match, structural validity rates), error analysis, or protocol details for the human baseline (annotator numbers, adjudication process, or ambiguity handling). This is load-bearing for the primary contribution, especially since the abstract itself identifies ground-truth collection as an open problem.

    Authors: We agree that the abstract claim would benefit from more explicit support to strengthen the manuscript. In the revision, we will update the abstract to include or directly reference key quantitative metrics (such as F1 scores, exact-match rates, and structural validity) from the evaluations on synthetic and clinical datasets. We will also add concise details on the human baseline protocol, including annotator numbers, adjudication process, and ambiguity handling. This addresses the load-bearing nature of the claim while preserving the abstract's brevity. The manuscript already flags ground-truth collection as an open problem precisely because of these ambiguities in clinical notes. revision: yes

  2. Referee: Evaluation/Results: No tables, figures, or specific numerical results are referenced to substantiate performance on synthetic versus clinical datasets or to enable comparison against the human baseline, preventing assessment of robustness or clinical accuracy.

    Authors: We acknowledge that clearer referencing and presentation of results would aid assessment. The manuscript reports evaluations on both synthetic and clinical datasets with comparisons to the human baseline, but we will revise the results section to explicitly cite and describe the relevant tables and figures. This will include specific numerical results (F1, exact-match, structural validity rates) for Infherno variants versus the human baseline, along with an expanded error analysis to demonstrate robustness and clinical accuracy. We will ensure these elements are prominently referenced in the text. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering evaluation against external baselines

full rationale

The paper describes an LLM-agent pipeline for converting clinical notes to FHIR resources and reports empirical performance against human baselines and datasets. No derivations, equations, fitted parameters, or predictions are present that could reduce to self-definition or internal fits. Claims rest on external evaluation rather than any self-referential construction, self-citation load-bearing argument, or renamed known result. The work is self-contained as an applied system description with open problems explicitly noted.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on standard assumptions about LLM tool-use reliability and FHIR schema validity rather than introducing new fitted parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5705 in / 1086 out tokens · 23164 ms · 2026-05-19T04:31:49.036405+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

  1. [1]

    URL: " 'urlintro :=

    ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Hanjie Chen, Zhouxiang Fang, Yash Singla, and Mark Dredze. 2025 a . https://doi.org/10.18653/v1/2025.naacl-long.182 Benchmarking large language models on answering and explaining challenging medical questions . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Tech...

  4. [4]

    Bitterman

    Shan Chen, Pedro Moreira, Yuxin Xiao, Sam Schmidgall, Jeremy Warner, Hugo Aerts, Thomas Hartvigsen, Jack Gallifant, and Danielle S. Bitterman. 2025 b . https://arxiv.org/abs/2505.14963 Medbrowsecomp: Benchmarking medical deep research and computer use . arXiv, abs/2505.14963

  5. [5]

    Ellie J. Coromilas, Stephanie Kochav, Isaac Goldenthal, Angelo Biviano, Hasan Garan, Seth Goldbarg, Joon-Hyuk Kim, Ilhwan Yeo, Cynthia Tracy, Shant Ayanian, Joseph Akar, Avinainder Singh, Shashank Jain, Leandro Zimerman, Maurício Pimentel, Stefan Osswald, Raphael Twerenbold, Nicolas Schaerli, Lia Crotti, Daniele Fabbri, Gianfranco Parati, Yi Li, Felipe At...

  6. [6]

    John Dagdelen, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S Rosen, Gerbrand Ceder, Kristin A Persson, and Anubhav Jain. 2024. https://doi.org/10.1038/s41467-024-45563-x Structured information extraction from scientific text with large language models . Nature Communications, 15(1):1418

  7. [7]

    Abul Ehtesham, Aditi Singh, and Saket Kumar. 2025. https://arxiv.org/abs/2506.13800 Enhancing clinical decision support and ehr insights through llms and the model context protocol: An open-source mcp-fhir framework . arXiv, abs/2506.13800

  8. [8]

    Saibo Geng, Martin Josifoski, Maxime Peyrard, and Robert West. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.674 Grammar-constrained decoding for structured NLP tasks without finetuning . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10932--10952, Singapore. Association for Computational Linguistics

  9. [9]

    Na Hong, Andrew Wen, Feichen Shen, Sunghwan Sohn, Chen Wang, Hongfang Liu, and Guoqian Jiang. 2019. https://doi.org/10.1093/jamiaopen/ooz056 Developing a scalable fhir-based clinical data normalization pipeline for standardizing and integrating unstructured and structured electronic health record data . JAMIA Open, 2(4):570--579

  10. [10]

    Mingjian Jiang, Yangjun Ruan, Luis Lastras, Pavan Kapanipathi, and Tatsunori Hashimoto. 2025. https://arxiv.org/abs/2505.08120 Putting it all into context: Simplifying agents with lclms . arXiv, abs/2505.08120

  11. [11]

    Christian Leibig, Moritz Brehmer, Stefan Bunk, Danalyn Byng, Katja Pinker, and Lale Umutlu. 2022. https://doi.org/10.1016/S2589-7500(22)00070-X Combining the strengths of radiologists and ai for breast cancer screening: a retrospective analysis . The Lancet Digital Health, 4(7):e507--e519

  12. [12]

    Hugo Leroux, Alejandro Metke-Jimenez, and Michael J Lawley. 2017. https://doi.org/10.1186/s13326-017-0148-7 Towards achieving semantic interoperability of clinical study data with fhir . Journal of biomedical semantics, 8:1--14

  13. [13]

    Yerebakan, Yoshihisa Shinagawa, and Yuan Luo

    Yikuan Li, Hanyin Wang, Halid Z. Yerebakan, Yoshihisa Shinagawa, and Yuan Luo. 2024. https://doi.org/10.1056/AIcs2300301 FHIR-GPT enhances health interoperability with large language models . NEJM AI, 1(8):AIcs2300301

  14. [14]

    Yusheng Liao, Shuyang Jiang, Yanfeng Wang, and Yu Wang. 2025. https://arxiv.org/abs/2410.17657 Reflectool: Towards reflection-aware tool-augmented clinical agents . arXiv, abs/2410.17657

  15. [15]

    Yihan Lin, Zhirong Bella Yu, and Simon Lee. 2025. https://arxiv.org/abs/2504.14657 A case study exploring the current landscape of synthetic medical record generation with commercial llms . arXiv, abs/2504.14657

  16. [16]

    Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. 2023. https://doi.org/10.1038/s41586-023-05881-4 Foundation models for generalist medical artificial intelligence . Nature, 616(7956):259--265

  17. [17]

    Nuno Pimenta, António Chaves, Regina Sousa, António Abelha, and Hugo Peixoto. 2023. https://doi.org/https://doi.org/10.1016/j.procs.2023.03.115 Interoperability of clinical data through fhir: A review . Procedia Computer Science, 220:856--861. The 14th International Conference on Ambient Systems, Networks and Technologies Networks (ANT) and The 6th Intern...

  18. [18]

    Tia Pope and Ahmad Patooghy. 2025. https://doi.org/10.1145/3718095 Comparative evaluation of gpt models in fhir proficiency . ACM Trans. Intell. Syst. Technol. Just Accepted

  19. [19]

    Álvaro Riquelme Tornel, Pedro Costa del Amo, and Catalina Costa Martínez. 2025. https://arxiv.org/abs/2507.03067 Large language models for automating clinical data standardization: Hl7 fhir use case . arXiv, abs/2507.03067

  20. [20]

    Daniel Rose, Chia-Chien Hung, Marco Lepri, Israa Alqassem, Kiril Gashteovski, and Carolin Lawrence. 2025. https://arxiv.org/abs/2502.19175 Meddxagent: A unified modular agent framework for explainable automatic differential diagnosis . arXiv, abs/2502.19175

  21. [21]

    Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. 2025. `smolagents`: a smol library to build great agentic systems. https://github.com/huggingface/smolagents

  22. [22]

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. https://openreview.net/forum?id=Yacmpz84TH Toolformer: Language models can teach themselves to use tools . In Thirty-seventh Conference on Neural Information Processing Systems

  23. [23]

    Paul Schmiedmayer, Adrit Rao, Philipp Zagar, Lauren Aalami, Vishnu Ravi, Aydin Zahedivash, Dong han Yao, Arash Fereydooni, and Oliver Aalami. 2025. https://doi.org/10.1016/j.jacadv.2025.101780 Llmonfhir . JACC: Advances, 4(6\_Part\_1):101780

  24. [24]

    Chong Shao, Douglas Snyder, Chiran Li, Bowen Gu, Kerry Ngan, Chun-Ting Yang, Jiageng Wu, Richard Wyss, Kueiyu Joshua Lin, and Jie Yang. 2025. https://arxiv.org/abs/2506.11137 Scalable medication extraction and discontinuation identification from electronic health records using large language models . arXiv, abs/2506.11137

  25. [25]

    Megha Sharma, Tushar Vatsal, Srujana Merugu, and Aruna Rajan. 2023. https://doi.org/10.18653/v1/2023.acl-industry.76 Automated digitization of unstructured medical prescriptions . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 794--805, Toronto, Canada. Association for Computati...

  26. [26]

    Richard Shin, Christopher Lin, Sam Thomson, Charles Chen, Subhro Roy, Emmanouil Antonios Platanios, Adam Pauls, Dan Klein, Jason Eisner, and Benjamin Van Durme. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.608 Constrained language models yield few-shot semantic parsers . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language P...

  27. [27]

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. 2023. https://doi.org/10.1038/s41586-023-06291-2 Large language models encode clinical knowledge . Nature, 620(7972):172--180

  28. [28]

    Parinaz Tabari, Alfonso Piscitelli, Gennaro Costagliola, and Mattia de Rosa. 2025. https://doi.org/10.3233/SHTI250470 Assessing the potential of an llm-powered system for enhancing fhir resource validation . In Intelligent Health Systems--From Technology to Data and Knowledge, pages 803--807. IOS Press

  29. [29]

    Amir Tavanaei, Kee Kiat Koo, Hayreddin Ceker, Shaobai Jiang, Qi Li, Julien Han, and Karim Bouyarmane. 2024. https://doi.org/10.18653/v1/2024.emnlp-industry.62 Structured object language modeling ( SO - LM ): Native structured objects generation conforming to complex schemas with self-supervised denoising . In Proceedings of the 2024 Conference on Empirica...

  30. [30]

    Bowen Wang, Jiuyang Chang, Yiming Qian, Guoxin Chen, Junhao Chen, Zhouqiang Jiang, Jiahao Zhang, Yuta Nakashima, and Hajime Nagahara. 2024. https://openreview.net/forum?id=F7rAX6yiS2 Dire CT : Diagnostic reasoning for clinical notes via large language models . In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track

  31. [31]

    Wenxuan Wang, Zizhan Ma, Zheng Wang, Chenghan Wu, Jiaming Ji, Wenting Chen, Xiang Li, and Yixuan Yuan. 2025. https://arxiv.org/abs/2502.11211 A survey of llm-based agents in medicine: How far are we from baymax? arXiv, abs/2502.11211

  32. [32]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2023. https://openreview.net/forum?id=WE_vluYUL-X React: Synergizing reasoning and acting in language models . In The Eleventh International Conference on Learning Representations

  33. [33]

    Dalal, Jennifer L

    Cyril Zakka, Rohan Shad, Akash Chaurasia, Alex R. Dalal, Jennifer L. Kim, Michael Moor, Robyn Fong, Curran Phillips, Kevin Alexander, Euan Ashley, Jack Boyd, Kathleen Boyd, Karen Hirsch, Curt Langlotz, Rita Lee, Joanna Melia, Joanna Nelson, Karim Sallam, Stacey Tullis, Melissa Ann Vogelsong, John Patrick Cunningham, and William Hiesinger. 2024. https://do...

  34. [34]

    Ferreira, Francis Rossignol, Raymond T

    Xiao Yu Cindy Zhang, Carlos R. Ferreira, Francis Rossignol, Raymond T. Ng, Wyeth Wasserman, and Jian Zhu. 2025. https://arxiv.org/abs/2505.17265 Casereportbench: An llm benchmark dataset for dense information extraction in clinical case reports . arXiv, abs/2505.17265

  35. [35]

    Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan \" O . Ar k. 2024. https://proceedings.neurips.cc/paper_files/paper/2024/file/ee71a4b14ec26710b39ee6be113d7750-Paper-Conference.pdf Chain of agents: Large language models collaborating on long-context tasks . In Advances in Neural Information Processing Systems, volume 37, pages 132...