Infherno: End-to-end Agent-based FHIR Resource Synthesis from Free-form Clinical Notes
Pith reviewed 2026-05-19 04:31 UTC · model grok-4.3
The pith
An agent-based LLM system converts free-form clinical notes into structured FHIR resources while matching human performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Infherno is an agent-driven pipeline that combines large language models, executable code, and healthcare terminology lookups to translate free-form clinical notes into FHIR-compliant resources end to end, achieving structural validity and clinical accuracy that competes with human baseline predictions on both synthetic and clinical test sets.
What carries the argument
Infherno, the end-to-end LLM-agent framework that uses code execution and terminology tools to enforce FHIR schema adherence and clinical correctness during resource synthesis from free-form text.
If this is right
- The framework supports clinical data integration and interoperability across different healthcare institutions.
- A dedicated frontend allows processing of both custom synthetic data and real clinical notes.
- Users can run the system with either local models or proprietary models according to institutional constraints.
- Gemini 2.5-Pro emerges as the strongest performer among tested models on the evaluated synthetic and clinical datasets.
Where Pith is reading between the lines
- Widespread adoption could lower the cost of populating standardized electronic health records by shifting effort from manual entry to review of agent outputs.
- The same agent-plus-tool pattern might extend to generating resources in other medical data standards beyond FHIR.
- Iterative agent conversations could be added to clarify ambiguous phrases in notes before final resource creation.
- Deployment in live hospital workflows would need separate measurement of latency and error rates under time pressure.
Load-bearing premise
LLM agents with code execution and terminology tools can consistently produce structurally valid and clinically accurate FHIR resources from ambiguous free-form notes even when reliable ground-truth data is difficult to collect.
What would settle it
A large blinded comparison in which expert clinicians score Infherno-generated FHIR resources against manually created reference resources drawn from a diverse collection of real-world clinical notes for both syntactic schema compliance and clinical accuracy.
Figures
read the original abstract
For clinical data integration and healthcare services, the HL7 FHIR standard has established itself as a desirable format for interoperability between complex health data. Previous attempts at automating the translation from free-form clinical notes into structured FHIR resources address narrowly defined tasks and rely on modular approaches or LLMs with instruction tuning and constrained decoding. As those solutions frequently suffer from limited generalizability and structural inconformity, we propose an end-to-end framework powered by LLM agents, code execution, and healthcare terminology database tools to address these issues. Our solution, called Infherno, is designed to adhere to the FHIR document schema and competes well with a human baseline in predicting FHIR resources from unstructured text. The implementation features a front end for custom and synthetic data and both local and proprietary models, supporting clinical data integration processes and interoperability across institutions. Gemini 2.5-Pro excels in our evaluation on synthetic and clinical datasets, yet ambiguity and feasibility of collecting ground-truth data remain open problems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Infherno, an end-to-end framework that employs LLM agents augmented with code execution and healthcare terminology database tools to translate free-form clinical notes into HL7 FHIR resources. It positions this agent-based approach as superior to prior modular pipelines or instruction-tuned LLMs with constrained decoding, emphasizing improved structural conformity to the FHIR schema. The work reports evaluations on synthetic and clinical datasets, claiming competitive performance against a human baseline (with Gemini 2.5-Pro performing strongly), provides a frontend for custom/synthetic data and model support, and flags ambiguity plus ground-truth collection as remaining open problems.
Significance. If the performance claims are substantiated with rigorous metrics, this engineering contribution could meaningfully advance automated clinical data structuring for interoperability, addressing documented weaknesses in generalizability and schema validity of earlier methods. The integration of agentic workflows with external tools is a practical strength, and the provision of a frontend plus support for both local and proprietary models enhances usability. The empirical focus on real-world FHIR synthesis is timely given the standard's adoption in healthcare.
major comments (2)
- Abstract: The central claim that Infherno 'competes well with a human baseline in predicting FHIR resources from unstructured text' is unsupported by any quantitative metrics (F1, exact-match, structural validity rates), error analysis, or protocol details for the human baseline (annotator numbers, adjudication process, or ambiguity handling). This is load-bearing for the primary contribution, especially since the abstract itself identifies ground-truth collection as an open problem.
- Evaluation/Results: No tables, figures, or specific numerical results are referenced to substantiate performance on synthetic versus clinical datasets or to enable comparison against the human baseline, preventing assessment of robustness or clinical accuracy.
minor comments (2)
- The abstract would benefit from briefly specifying example FHIR resources targeted (e.g., Patient, Observation, Condition) to ground the synthesis task.
- Consider adding a diagram of the agent workflow (including tool calls for code execution and terminology lookup) to improve clarity of the end-to-end pipeline.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. We address each major comment below, indicating where revisions will be made to better substantiate the claims and improve clarity.
read point-by-point responses
-
Referee: Abstract: The central claim that Infherno 'competes well with a human baseline in predicting FHIR resources from unstructured text' is unsupported by any quantitative metrics (F1, exact-match, structural validity rates), error analysis, or protocol details for the human baseline (annotator numbers, adjudication process, or ambiguity handling). This is load-bearing for the primary contribution, especially since the abstract itself identifies ground-truth collection as an open problem.
Authors: We agree that the abstract claim would benefit from more explicit support to strengthen the manuscript. In the revision, we will update the abstract to include or directly reference key quantitative metrics (such as F1 scores, exact-match rates, and structural validity) from the evaluations on synthetic and clinical datasets. We will also add concise details on the human baseline protocol, including annotator numbers, adjudication process, and ambiguity handling. This addresses the load-bearing nature of the claim while preserving the abstract's brevity. The manuscript already flags ground-truth collection as an open problem precisely because of these ambiguities in clinical notes. revision: yes
-
Referee: Evaluation/Results: No tables, figures, or specific numerical results are referenced to substantiate performance on synthetic versus clinical datasets or to enable comparison against the human baseline, preventing assessment of robustness or clinical accuracy.
Authors: We acknowledge that clearer referencing and presentation of results would aid assessment. The manuscript reports evaluations on both synthetic and clinical datasets with comparisons to the human baseline, but we will revise the results section to explicitly cite and describe the relevant tables and figures. This will include specific numerical results (F1, exact-match, structural validity rates) for Infherno variants versus the human baseline, along with an expanded error analysis to demonstrate robustness and clinical accuracy. We will ensure these elements are prominently referenced in the text. revision: yes
Circularity Check
No circularity: empirical engineering evaluation against external baselines
full rationale
The paper describes an LLM-agent pipeline for converting clinical notes to FHIR resources and reports empirical performance against human baselines and datasets. No derivations, equations, fitted parameters, or predictions are present that could reduce to self-definition or internal fits. Claims rest on external evaluation rather than any self-referential construction, self-citation load-bearing argument, or renamed known result. The work is self-contained as an applied system description with open problems explicitly noted.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprinturl eprintpr...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Hanjie Chen, Zhouxiang Fang, Yash Singla, and Mark Dredze. 2025 a . https://doi.org/10.18653/v1/2025.naacl-long.182 Benchmarking large language models on answering and explaining challenging medical questions . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Tech...
-
[4]
Shan Chen, Pedro Moreira, Yuxin Xiao, Sam Schmidgall, Jeremy Warner, Hugo Aerts, Thomas Hartvigsen, Jack Gallifant, and Danielle S. Bitterman. 2025 b . https://arxiv.org/abs/2505.14963 Medbrowsecomp: Benchmarking medical deep research and computer use . arXiv, abs/2505.14963
-
[5]
Ellie J. Coromilas, Stephanie Kochav, Isaac Goldenthal, Angelo Biviano, Hasan Garan, Seth Goldbarg, Joon-Hyuk Kim, Ilhwan Yeo, Cynthia Tracy, Shant Ayanian, Joseph Akar, Avinainder Singh, Shashank Jain, Leandro Zimerman, Maurício Pimentel, Stefan Osswald, Raphael Twerenbold, Nicolas Schaerli, Lia Crotti, Daniele Fabbri, Gianfranco Parati, Yi Li, Felipe At...
-
[6]
John Dagdelen, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S Rosen, Gerbrand Ceder, Kristin A Persson, and Anubhav Jain. 2024. https://doi.org/10.1038/s41467-024-45563-x Structured information extraction from scientific text with large language models . Nature Communications, 15(1):1418
- [7]
-
[8]
Saibo Geng, Martin Josifoski, Maxime Peyrard, and Robert West. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.674 Grammar-constrained decoding for structured NLP tasks without finetuning . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 10932--10952, Singapore. Association for Computational Linguistics
-
[9]
Na Hong, Andrew Wen, Feichen Shen, Sunghwan Sohn, Chen Wang, Hongfang Liu, and Guoqian Jiang. 2019. https://doi.org/10.1093/jamiaopen/ooz056 Developing a scalable fhir-based clinical data normalization pipeline for standardizing and integrating unstructured and structured electronic health record data . JAMIA Open, 2(4):570--579
- [10]
-
[11]
Christian Leibig, Moritz Brehmer, Stefan Bunk, Danalyn Byng, Katja Pinker, and Lale Umutlu. 2022. https://doi.org/10.1016/S2589-7500(22)00070-X Combining the strengths of radiologists and ai for breast cancer screening: a retrospective analysis . The Lancet Digital Health, 4(7):e507--e519
-
[12]
Hugo Leroux, Alejandro Metke-Jimenez, and Michael J Lawley. 2017. https://doi.org/10.1186/s13326-017-0148-7 Towards achieving semantic interoperability of clinical study data with fhir . Journal of biomedical semantics, 8:1--14
-
[13]
Yerebakan, Yoshihisa Shinagawa, and Yuan Luo
Yikuan Li, Hanyin Wang, Halid Z. Yerebakan, Yoshihisa Shinagawa, and Yuan Luo. 2024. https://doi.org/10.1056/AIcs2300301 FHIR-GPT enhances health interoperability with large language models . NEJM AI, 1(8):AIcs2300301
- [14]
- [15]
-
[16]
Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. 2023. https://doi.org/10.1038/s41586-023-05881-4 Foundation models for generalist medical artificial intelligence . Nature, 616(7956):259--265
-
[17]
Nuno Pimenta, António Chaves, Regina Sousa, António Abelha, and Hugo Peixoto. 2023. https://doi.org/https://doi.org/10.1016/j.procs.2023.03.115 Interoperability of clinical data through fhir: A review . Procedia Computer Science, 220:856--861. The 14th International Conference on Ambient Systems, Networks and Technologies Networks (ANT) and The 6th Intern...
-
[18]
Tia Pope and Ahmad Patooghy. 2025. https://doi.org/10.1145/3718095 Comparative evaluation of gpt models in fhir proficiency . ACM Trans. Intell. Syst. Technol. Just Accepted
- [19]
- [20]
-
[21]
Aymeric Roucher, Albert Villanova del Moral, Thomas Wolf, Leandro von Werra, and Erik Kaunismäki. 2025. `smolagents`: a smol library to build great agentic systems. https://github.com/huggingface/smolagents
work page 2025
-
[22]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. https://openreview.net/forum?id=Yacmpz84TH Toolformer: Language models can teach themselves to use tools . In Thirty-seventh Conference on Neural Information Processing Systems
work page 2023
-
[23]
Paul Schmiedmayer, Adrit Rao, Philipp Zagar, Lauren Aalami, Vishnu Ravi, Aydin Zahedivash, Dong han Yao, Arash Fereydooni, and Oliver Aalami. 2025. https://doi.org/10.1016/j.jacadv.2025.101780 Llmonfhir . JACC: Advances, 4(6\_Part\_1):101780
-
[24]
Chong Shao, Douglas Snyder, Chiran Li, Bowen Gu, Kerry Ngan, Chun-Ting Yang, Jiageng Wu, Richard Wyss, Kueiyu Joshua Lin, and Jie Yang. 2025. https://arxiv.org/abs/2506.11137 Scalable medication extraction and discontinuation identification from electronic health records using large language models . arXiv, abs/2506.11137
-
[25]
Megha Sharma, Tushar Vatsal, Srujana Merugu, and Aruna Rajan. 2023. https://doi.org/10.18653/v1/2023.acl-industry.76 Automated digitization of unstructured medical prescriptions . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 794--805, Toronto, Canada. Association for Computati...
-
[26]
Richard Shin, Christopher Lin, Sam Thomson, Charles Chen, Subhro Roy, Emmanouil Antonios Platanios, Adam Pauls, Dan Klein, Jason Eisner, and Benjamin Van Durme. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.608 Constrained language models yield few-shot semantic parsers . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language P...
-
[27]
Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. 2023. https://doi.org/10.1038/s41586-023-06291-2 Large language models encode clinical knowledge . Nature, 620(7972):172--180
-
[28]
Parinaz Tabari, Alfonso Piscitelli, Gennaro Costagliola, and Mattia de Rosa. 2025. https://doi.org/10.3233/SHTI250470 Assessing the potential of an llm-powered system for enhancing fhir resource validation . In Intelligent Health Systems--From Technology to Data and Knowledge, pages 803--807. IOS Press
-
[29]
Amir Tavanaei, Kee Kiat Koo, Hayreddin Ceker, Shaobai Jiang, Qi Li, Julien Han, and Karim Bouyarmane. 2024. https://doi.org/10.18653/v1/2024.emnlp-industry.62 Structured object language modeling ( SO - LM ): Native structured objects generation conforming to complex schemas with self-supervised denoising . In Proceedings of the 2024 Conference on Empirica...
-
[30]
Bowen Wang, Jiuyang Chang, Yiming Qian, Guoxin Chen, Junhao Chen, Zhouqiang Jiang, Jiahao Zhang, Yuta Nakashima, and Hajime Nagahara. 2024. https://openreview.net/forum?id=F7rAX6yiS2 Dire CT : Diagnostic reasoning for clinical notes via large language models . In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track
work page 2024
- [31]
-
[32]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2023. https://openreview.net/forum?id=WE_vluYUL-X React: Synergizing reasoning and acting in language models . In The Eleventh International Conference on Learning Representations
work page 2023
-
[33]
Cyril Zakka, Rohan Shad, Akash Chaurasia, Alex R. Dalal, Jennifer L. Kim, Michael Moor, Robyn Fong, Curran Phillips, Kevin Alexander, Euan Ashley, Jack Boyd, Kathleen Boyd, Karen Hirsch, Curt Langlotz, Rita Lee, Joanna Melia, Joanna Nelson, Karim Sallam, Stacey Tullis, Melissa Ann Vogelsong, John Patrick Cunningham, and William Hiesinger. 2024. https://do...
-
[34]
Ferreira, Francis Rossignol, Raymond T
Xiao Yu Cindy Zhang, Carlos R. Ferreira, Francis Rossignol, Raymond T. Ng, Wyeth Wasserman, and Jian Zhu. 2025. https://arxiv.org/abs/2505.17265 Casereportbench: An llm benchmark dataset for dense information extraction in clinical case reports . arXiv, abs/2505.17265
-
[35]
Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan \" O . Ar k. 2024. https://proceedings.neurips.cc/paper_files/paper/2024/file/ee71a4b14ec26710b39ee6be113d7750-Paper-Conference.pdf Chain of agents: Large language models collaborating on long-context tasks . In Advances in Neural Information Processing Systems, volume 37, pages 132...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.