pith. machine review for the scientific record. sign in

arxiv: 2605.14126 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Reinforcement Learning for Tool-Calling Agents in Fast Healthcare Interoperability Resources (FHIR)

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:53 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords reinforcement learningtool-calling agentsFHIRhealthcare datamulti-turn reasoningCodeAct agentLLM judge
0
0 comments X

The pith

Reinforcement learning post-training raises a Qwen3-8B model to 77 percent correctness on FHIR clinical queries, surpassing larger closed models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames answering questions over FHIR health records as a sequential decision process on a directed graph of resources. It trains a multi-turn CodeAct agent with reinforcement learning where an LLM judge supplies rewards grounded in actual execution outcomes. This yields 77 percent answer correctness on FHIR-AgentBench using the smaller Qwen3-8B model, up from 50 percent for the o4-mini baseline, while enforcing data-integrity constraints during traversals. The work supplies an end-to-end pipeline covering environment setup, harness construction, training, and evaluation. A sympathetic reader would care because the result shows RL can make cheaper open models competitive for realistic multi-step queries over structured clinical data.

Core claim

By casting FHIR reasoning as sequential decision-making over a queryable structured graph and post-training a multi-turn CodeAct agent with reinforcement learning using execution-grounded rewards from an LLM judge, the authors achieve 77 percent answer correctness on FHIR-AgentBench with a Qwen3-8B model, compared to 50 percent for o4-mini, while maintaining data-integrity constraints.

What carries the argument

The RL post-trained multi-turn CodeAct agent equipped with a custom harness and an LLM judge that supplies execution-grounded rewards for traversals across FHIR resource graphs.

If this is right

  • Smaller open-weight models can reach higher accuracy than larger closed models on structured healthcare question-answering tasks.
  • Reinforcement learning enforces data-integrity constraints during agent reasoning over clinical graphs.
  • An end-to-end post-training pipeline reliably improves multi-turn reasoning on queryable structured data.
  • Performance gains arise from reward-driven training rather than prompt engineering alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reinforcement-learning harness could be applied to other graph-structured interoperability standards outside healthcare.
  • Lower inference costs from using smaller models could make clinical decision-support tools more accessible in resource-limited settings.
  • Varying the judge model or adding human verification steps would test how sensitive the gains are to reward quality.
  • Combining this approach with retrieval or planning modules from other agent frameworks might further reduce error rates on complex aggregation queries.

Load-bearing premise

An LLM judge can supply reliable execution-grounded rewards without bias or systematic errors when scoring multi-step FHIR traversals.

What would settle it

Evaluating the trained agent on a fresh held-out portion of FHIR-AgentBench questions with independent ground-truth execution results that show correctness at or below 50 percent.

Figures

Figures reproduced from arXiv: 2605.14126 by Jan P. Bremer, Marius S. Knorr, Nils Schweingruber, Robert M\"uller.

Figure 1
Figure 1. Figure 1: Reinforcement learning pipeline for training a clinical FHIR reasoning agent. The agent follows a structured act–observe–think loop, with access to two tools: a retrieval tool for loading clinical resources from a FHIR server, and a Python interpreter for code execution. After one or more reasoning cycles, the agent emits a final solution, which is evaluated by a LLM-judge against ground-truth answers from… view at source ↗
Figure 2
Figure 2. Figure 2: Fast Healthcare Interoperability Resources (FHIR). (A) Resource graph of a MIMIC-FHIR patient, connected by references. (B) Clinical concepts can be mapped to FHIR: A patient (1) presents to the emergency department (2; first Encounter) with one or more conditions (3). During the visit, clinical observations (4) were recorded (e.g. heart rate, blood pressure), each linked to the procedure that generated th… view at source ↗
Figure 3
Figure 3. Figure 3: Answer correctness on FHIR-AgentBench (Lee et al., 2025) of vanilla Qwen3 models in different sizes (4B to 32B), API￾based models (closed/open-weights), and after RL training (step 3000). Small crosses indicate performance restricted to questions where the agent successfully submitted an answer. 2. Like 1., but with asymmetric clipping (“clip-higher”; ϵlow = 0.2, ϵhigh = 0.28) instead of symmetric clip. 3.… view at source ↗
Figure 4
Figure 4. Figure 4: Training curves for Qwen3-8B. Curves show the answer correctness on the FHIR-AgentBench (Lee et al., 2025) validation subset (n=424) with a 6 turn budget. Recipes: vanilla GRPO (blue), +clip-higher (orange), +DAPO-filter (green). Dashed horizontal lines mark each run’s score at step 3000 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: Our best RL run (GRPO + clip-higher + DAPO filter), broken down by FHIR resource type over training steps. The model first learns to return "Empty" on questions with no matching resources, which is the cheapest reward to get, before improving on the other resource types. The dotted line shows the overall mean. More broadly, our findings indicate that execution-grounded RL post-training is a viable path to … view at source ↗
Figure 8
Figure 8. Figure 8: Example trajectory. The agent first retrieves all MedicationRequest resources. Then it filters for "iv", and sorts the resources. It identifies the correct MedicationRequest resource, and then resolves the reference to Medication. Medication holds the actual medication that the question is looking for. In this plot, Python tool outputs were collapsed for brevity. 0% 20% 40% 60% 80% 100% Qwen3-8B Qwen3-8B +… view at source ↗
Figure 9
Figure 9. Figure 9: Accuracy by FHIR resource type for Qwen3 (zero-shot and trained), and API-baselines. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
read the original abstract

Fast Healthcare Interoperability Resources (FHIR) is the dominant standard for interoperable exchange of healthcare data. In FHIR, electronic health records form a directed graph of resources. Answering clinically meaningful questions over FHIR requires agents to perform multi-step reasoning, filtering, and aggregation across multiple resource types. Prior work shows that even tool-augmented LLM agents (retrieval, code execution, multi-turn planning) often select the wrong resources or violate traversal constraints. We study this problem in the context of FHIR-AgentBench, a benchmark for realistic question answering over real-world hospital data, and frame reasoning on FHIR as a sequential decision-making problem over a queryable structured graph. We implement a multi-turn CodeAct agent and post-train it with reinforcement learning using a custom harness and tools. A LLM Judge provides execution-grounded rewards. Compared to prompt-based, closed-model baselines, RL post-training improves performance while enforcing data-integrity constraints. Empirically, our approach improves answer correctness from 50% (o4-mini) to 77% on FHIR-AgentBench using a smaller and cheaper Qwen3-8B model. We present an end-to-end post-training pipeline (environment building, harness construction, model training and custom evaluation) that reliably improves multi-turn reasoning over structured clinical graphs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents an end-to-end pipeline for post-training a Qwen3-8B model using reinforcement learning for multi-turn tool-calling agents in FHIR data querying. Framing the problem as sequential decision-making over a structured graph, they employ a CodeAct agent and an LLM Judge to provide execution-grounded rewards, reporting an improvement in answer correctness from 50% (o4-mini baseline) to 77% on the FHIR-AgentBench benchmark.

Significance. If the results are robust, this work would be significant for demonstrating that RL can enhance smaller open models to outperform larger proprietary ones on complex, constraint-heavy reasoning tasks in healthcare interoperability. It provides a practical pipeline for building reliable agents over clinical data graphs, potentially reducing costs and improving accessibility while enforcing data integrity.

major comments (2)
  1. [Reward Design] The central empirical claim (50% to 77% correctness) depends on the LLM Judge supplying reliable rewards for multi-step FHIR traversals. However, the manuscript provides no calibration data, human agreement metrics, or ablation studies on the judge's prompting or consistency, making it impossible to rule out systematic biases as the source of the performance lift rather than the RL training itself.
  2. [Experimental Setup] The abstract reports a clear numerical gain but the full methods lack details on training curves, reward shaping specifics, baseline implementations, statistical significance testing, or error analysis, which are necessary to assess the robustness of the 27-point improvement.
minor comments (1)
  1. [Methods] Clarify the exact definition of the CodeAct agent and how it interfaces with the FHIR environment in the methods section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment point by point below and indicate the revisions planned for the next version.

read point-by-point responses
  1. Referee: [Reward Design] The central empirical claim (50% to 77% correctness) depends on the LLM Judge supplying reliable rewards for multi-step FHIR traversals. However, the manuscript provides no calibration data, human agreement metrics, or ablation studies on the judge's prompting or consistency, making it impossible to rule out systematic biases as the source of the performance lift rather than the RL training itself.

    Authors: We agree that explicit validation of the LLM Judge is required to support the central claim. In the revised manuscript we will add a dedicated subsection reporting calibration results: human-expert agreement on a random sample of 150 multi-turn trajectories (Cohen's kappa and raw agreement), plus ablations on judge prompt phrasing and temperature. These data will be placed in the main text with full details in the appendix. We note that rewards remain execution-grounded (FHIR query success and constraint violations are verified by the environment), which limits the scope for judge-induced bias, but the requested metrics will allow readers to quantify any residual judge variance. revision: yes

  2. Referee: [Experimental Setup] The abstract reports a clear numerical gain but the full methods lack details on training curves, reward shaping specifics, baseline implementations, statistical significance testing, or error analysis, which are necessary to assess the robustness of the 27-point improvement.

    Authors: We accept that the current Methods section is insufficiently detailed for reproducibility and robustness assessment. The revision will expand this section to include: (i) training curves for both reward and correctness metrics across epochs, (ii) the exact reward-shaping formula with component weights, (iii) verbatim prompt templates and tool configurations used for the o4-mini baseline, (iv) statistical significance results (bootstrap 95% CI and McNemar test on paired outcomes), and (v) a categorized error analysis of the remaining 23% failures. These additions will be integrated into the main paper and will directly address concerns about the reported 27-point gain. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical result rests on external benchmark

full rationale

The paper frames FHIR reasoning as a sequential decision problem and reports an empirical performance lift (50% to 77% on FHIR-AgentBench) obtained by RL post-training of Qwen3-8B with rewards supplied by an LLM judge. No equations, derivations, or parameter-fitting steps are described that would reduce the reported correctness metric to the inputs by construction. The central claim depends on an external benchmark and the unvalidated assumption that the LLM judge supplies reliable execution-grounded rewards; this is an empirical assumption rather than a self-referential definition or a prediction forced by fitted parameters. No self-citation chains, uniqueness theorems, or ansatzes are invoked in a load-bearing way. The derivation chain is therefore self-contained as a standard RL experiment on a public benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard RL assumptions plus the domain-specific premise that an LLM can serve as a trustworthy reward signal for FHIR query correctness.

axioms (1)
  • domain assumption LLM Judge supplies execution-grounded rewards that correlate with true answer correctness
    Invoked to generate training signal for the RL post-training step

pith-pipeline@v0.9.0 · 5541 in / 1147 out tokens · 25587 ms · 2026-05-15T04:53:35.494733+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 8 internal anchors

  1. [1]

    Journal of the American Medical Informatics Association , author =

    Electronic health record adoption in. Journal of the American Medical Informatics Association , author =. 2017 , pages =. doi:10.1093/jamia/ocx080 , abstract =

  2. [2]

    Gut and Liver , author =

    Challenges in and. Gut and Liver , author =. 2024 , pmid =. doi:10.5009/gnl230272 , abstract =

  3. [3]

    AMA journal of ethics , author =

    Language,. AMA journal of ethics , author =. 2017 , pmid =. doi:10.1001/journalofethics.2017.19.3.stas1-1703 , abstract =

  4. [4]

    WIREs Computational Statistics , author =

    Challenges and opportunities beyond structured data in analysis of electronic health records , volume =. WIREs Computational Statistics , author =. 2021 , note =. doi:10.1002/wics.1549 , abstract =

  5. [5]

    iScience , author =

    The application of large language models in medicine:. iScience , author =. 2024 , pmid =. doi:10.1016/j.isci.2024.109713 , abstract =

  6. [6]

    2024 , pmid =

    NEJM AI , author =. 2024 , pmid =. doi:10.1056/aics2300301 , abstract =

  7. [7]

    Conference on Empirical Methods in Natural Language Processing , author =

    Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing , author =. 2024 , pmid =. doi:10.18653/v1/2024.emnlp-main.1245 , abstract =

  8. [8]

    DeepSeek-AI and Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Zhang, Ruoyu and Xu, Runxin and Zhu, Qihao and Ma, Shirong and Wang, Peiyi and Bi, Xiao and Zhang, Xiaokang and Yu, Xingkai and Wu, Yu and Wu, Z. F. and Gou, Zhibin and Shao, Zhihong and Li, Zhuoshu and Gao, Ziyi and Liu, Aixin and Xue, Bing and Wang, Bingxuan and Wu, Bocha...

  9. [9]

    BMC Medical Research Methodology , author =

    External validation of a. BMC Medical Research Methodology , author =. 2013 , keywords =. doi:10.1186/1471-2288-13-33 , abstract =

  10. [10]

    and Lemeshow, Stanley and May, Susanne , month = feb, year =

    Hosmer, David W. and Lemeshow, Stanley and May, Susanne , month = feb, year =. Applied

  11. [11]

    BMC Medical Research Methodology , author =

    Survival prediction models: an introduction to discrete-time modeling , volume =. BMC Medical Research Methodology , author =. 2022 , keywords =. doi:10.1186/s12874-022-01679-6 , abstract =

  12. [12]

    Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and W...

  13. [13]

    Semantic

    Müller, Robert , month = jul, year =. Semantic. doi:10.48550/arXiv.2507.10820 , abstract =

  14. [14]

    ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

    Feng, Jiazhan and Huang, Shijue and Qu, Xingwei and Zhang, Ge and Qin, Yujia and Zhong, Baoquan and Jiang, Chengquan and Chi, Jinxin and Zhong, Wanjun , month = apr, year =. doi:10.48550/arXiv.2504.11536 , abstract =

  15. [15]

    Singh, Joykirat and Magazine, Raghav and Pandya, Yash and Nambi, Akshay , month = apr, year =. Agentic. doi:10.48550/arXiv.2505.01441 , abstract =

  16. [16]

    Nemotron-research-tool-n1: Exploring tool-using language models with reinforced reasoning

    Zhang, Shaokun and Dong, Yi and Zhang, Jieyu and Kautz, Jan and Catanzaro, Bryan and Tao, Andrew and Wu, Qingyun and Yu, Zhiding and Liu, Guilin , month = may, year =. Nemotron-. doi:10.48550/arXiv.2505.00024 , abstract =

  17. [17]

    ToolRL: Reward is All Tool Learning Needs

    Qian, Cheng and Acikgoz, Emre Can and He, Qi and Wang, Hongru and Chen, Xiusi and Hakkani-Tür, Dilek and Tur, Gokhan and Ji, Heng , month = apr, year =. doi:10.48550/arXiv.2504.13958 , abstract =

  18. [18]

    2017 , pmid =

    Pharmacy and Therapeutics , author =. 2017 , pmid =

  19. [19]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Liu, Zichen and Chen, Changyu and Li, Wenjun and Qi, Penghui and Pang, Tianyu and Du, Chao and Lee, Wee Sun and Lin, Min , month = mar, year =. Understanding. doi:10.48550/arXiv.2503.20783 , abstract =

  20. [20]

    2018 , keywords =

    BMC Medical Research Methodology , author =. 2018 , keywords =. doi:10.1186/s12874-018-0482-1 , abstract =

  21. [21]

    JAMA , author =

    Evaluating the yield of medical tests , volume =. JAMA , author =. 1982 , pmid =

  22. [22]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Yu, Qiying and Zhang, Zheng and Zhu, Ruofei and Yuan, Yufeng and Zuo, Xiaochen and Yue, Yu and Dai, Weinan and Fan, Tiantian and Liu, Gaohong and Liu, Lingjun and Liu, Xin and Lin, Haibin and Lin, Zhiqi and Ma, Bole and Sheng, Guangming and Tong, Yuxuan and Zhang, Chi and Zhang, Mofan and Zhang, Wang and Zhu, Hang and Zhu, Jinhua and Chen, Jiaze and Chen,...

  23. [23]

    GitHub repository , howpublished =

    Brad Hilton and Kyle Corbitt and David Corbitt and Saumya Gandhi and Angky William and Bohdan Kovalenskyi and Andie Jones , title =. GitHub repository , howpublished =. 2025 , publisher =

  24. [24]

    2025 , url =

    RULER: Relative Universal LLM-Elicited Rewards , author =. 2025 , url =

  25. [25]

    2025 , eprint=

    Qwen2.5 Technical Report , author=. 2025 , eprint=

  26. [26]

    2025 , eprint=

    ReTool: Reinforcement Learning for Strategic Tool Use in LLMs , author=. 2025 , eprint=

  27. [27]

    2025 , eprint=

    Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning , author=. 2025 , eprint=

  28. [28]

    2025 , eprint=

    Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning , author=. 2025 , eprint=

  29. [29]

    2025 , eprint=

    APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay , author=. 2025 , eprint=

  30. [30]

    2025 , eprint=

    ToolRL: Reward is All Tool Learning Needs , author=. 2025 , eprint=

  31. [31]

    2025 , eprint=

    Semantic Context for Tool Orchestration , author=. 2025 , eprint=

  32. [32]

    2024 , eprint=

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

  33. [33]

    2025 , eprint=

    MedAlpaca -- An Open-Source Collection of Medical Conversational AI Models and Training Data , author=. 2025 , eprint=

  34. [34]

    2025 , note=

    SkyRL-SQL: Matching GPT-4o and o4-mini on Text2SQL with Multi-Turn RL , author=. 2025 , note=

  35. [35]

    2025 , month = jun, howpublished =

    Evolving SkyRL into a Highly-Modular RL Framework , author =. 2025 , month = jun, howpublished =

  36. [36]

    2024 , eprint=

    Executable Code Actions Elicit Better LLM Agents , author=. 2024 , eprint=

  37. [37]

    International Conference on Learning Representations (ICLR) , year=

    ReAct: Synergizing Reasoning and Acting in Language Models , author=. International Conference on Learning Representations (ICLR) , year=

  38. [38]

    PAL: Program-aided Language Models

    PAL: Program-aided Language Models , author=. arXiv preprint arXiv:2211.10435 , year=

  39. [39]

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , author=. arXiv preprint arXiv:2211.12588 , year=

  40. [40]

    2025 , eprint=

    The Art of Scaling Reinforcement Learning Compute for LLMs , author=. 2025 , eprint=

  41. [41]

    TRL GRPO Trainer Documentation , author=

  42. [42]

    2025 , eprint=

    VL Norm: Rethink Loss Aggregation in RLVR , author=. 2025 , eprint=

  43. [43]

    arXiv preprint arXiv:2506.10446 , year=

    Efficient Reasoning via Powered Length Penalty , author=. arXiv preprint arXiv:2506.10446 , year=

  44. [44]

    2025 , eprint=

    DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning , author=. 2025 , eprint=

  45. [45]

    2019 , month = oct, day =

  46. [46]

    The Heat is On: US Caught FHIR in 2019 , howpublished =

  47. [47]

    2025 , eprint=

    FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering , author=. 2025 , eprint=

  48. [48]

    2025 , eprint=

    Large Language Models for Automating Clinical Data Standardization: HL7 FHIR Use Case , author=. 2025 , eprint=

  49. [49]

    Journal of Medical Internet Research , author =

    Using a. Journal of Medical Internet Research , author =. 2025 , pmid =. doi:10.2196/73540 , abstract =

  50. [50]

    2024 , eprint=

    LLM on FHIR -- Demystifying Health Records , author=. 2024 , eprint=

  51. [51]

    2025 , eprint=

    Question Answering on Patient Medical Records with Private Fine-Tuned LLMs , author=. 2025 , eprint=

  52. [52]

    2025 , eprint=

    MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents , author=. 2025 , eprint=

  53. [53]

    2025 , note =

    Johann Schopplich , title =. 2025 , note =

  54. [54]

    2025 , version =

    Blaze: A FHIR. 2025 , version =

  55. [55]

    and Mao, Huanzhi and Yan, Fanjia and Ji, Charlie and Suresh, Vishnu and Stoica, Ion and Gonzalez, Joseph E

    Patil, Shishir G. and Mao, Huanzhi and Yan, Fanjia and Ji, Charlie and Suresh, Vishnu and Stoica, Ion and Gonzalez, Joseph E. , booktitle =. The. 2025 , url =

  56. [56]

    2024 , eprint =

    StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models , author =. 2024 , eprint =

  57. [57]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

    Butterfly Effects in Toolchains: Categorizing and Analyzing Failures in Parameter Filling for LLMs , author =. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

  58. [58]

    2026 , eprint =

    Lost in Execution: On the Multilingual Robustness of Tool Calling in Large Language Models , author =. 2026 , eprint =

  59. [59]

    2025 , eprint =

    Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning , author =. 2025 , eprint =

  60. [60]

    2026 , eprint =

    Robust Tool Use via Fission-GRPO: Learning to Recover from Execution Errors , author =. 2026 , eprint =

  61. [61]

    2025 , eprint =

    ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration , author =. 2025 , eprint =

  62. [62]

    2026 , eprint =

    ToolGate: Contract-Grounded and Verified Tool Execution for LLMs , author =. 2026 , eprint =

  63. [63]

    2026 , eprint =

    From Failure to Mastery: Generating Hard Samples for Tool-use Agents , author =. 2026 , eprint =

  64. [64]

    2026 , eprint =

    Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents , author =. 2026 , eprint =

  65. [65]

    2026 , eprint =

    Emerging from Ground: Addressing Intent Deviation in Tool-Using Agents via Deriving Real Calls into Virtual Trajectories , author =. 2026 , eprint =

  66. [66]

    AMIA Annual Symposium Proceedings , author =

    Reducing. AMIA Annual Symposium Proceedings , author =. 2023 , pmid =