pith. sign in

arxiv: 2605.16679 · v2 · pith:RLPSQLQAnew · submitted 2026-05-15 · 💻 cs.CL · cs.AI

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

Pith reviewed 2026-05-20 17:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords AI agentshealthcare workflowsbenchmarklong-horizon taskspolicy densitymulti-role compositionprior authorizationutilization management
0
0 comments X

The pith

AI agents resolve only 28 percent of long-horizon healthcare workflows even in the best case.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CHI-Bench to test AI agents on realistic end-to-end healthcare operations that demand following dense policies, switching between multiple roles with handoffs, and managing multi-turn interactions such as peer reviews and patient outreach. Tasks place the agent inside a simulator of 20 healthcare apps reached through 87 tools, where it must drive each clinical case to completion while consulting a handbook of over 1,290 managed-care documents. Evaluation of 30 agent harness and model combinations shows the strongest result reaches only 28.0 percent task resolution, with no setup clearing 20 percent on a strict pass^3 criterion and performance falling to 3.8 percent when all tasks run inside one continuous session. These outcomes indicate that current agents face substantial barriers in policy-dense, role-composed enterprise workflows.

Core claim

CHI-Bench evaluates agents on long-horizon tasks across provider prior authorization, payer utilization management, and care management. Each task supplies a clinical case that the agent must advance to terminal status inside a high-fidelity simulator of 20 healthcare applications exposed via 87 MCP tools, while producing required role-specific artifacts and obeying rules drawn from a 1,290-plus document managed-care operations handbook. Across thirty agent configurations the highest observed resolution rate is 28.0 percent; no configuration exceeds 20 percent under a strict pass^3 metric, and collapsing all tasks into a single session reduces success to 3.8 percent.

What carries the argument

CHI-Bench task suite, a simulator of 20 healthcare apps reached through 87 MCP tools together with a 1,290-plus document managed-care operations handbook that requires agents to execute policy-grounded, multi-role, multilateral workflows to completion.

If this is right

  • Agents must sustain coherent action sequences across dozens of tool calls and document productions while switching roles.
  • Single-session execution exposes sharp degradation compared with reset or multi-session operation.
  • Comparable success-rate ceilings are expected in any enterprise domain that combines dense policy libraries with irreversible role-composed decisions.
  • Existing agent benchmarks likely understate the difficulty of full workflow automation in regulated operational settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Explicit mechanisms for retrieving and applying specific policy clauses during planning could raise completion rates on these tasks.
  • Testing the same workflow patterns in legal compliance or financial operations would likely surface parallel gaps in current agent capabilities.
  • Production systems would need human oversight loops or verification stages to compensate for the low autonomous success rates shown here.

Load-bearing premise

The simulator of 20 apps and the managed-care handbook together capture the policy density, multi-role handoffs, and multilateral interactions of actual healthcare operations.

What would settle it

An agent configuration that achieves greater than 50 percent task resolution on the identical CHI-Bench tasks under the reported evaluation protocol would directly challenge the measured performance ceiling.

Figures

Figures reproduced from arXiv: 2605.16679 by Biwei Huang, Caiming Xiong, Carl Yang, Chenyu You, Deon Metelski, Eric P. Xing, Fan Feng, Fangli Geng, Frank Wang, Hang Jiang, Hank Capps MD, Haolin Chen, Hua Wei, Joonyul Lee, Kevin Riley, Kun Zhang, Leon Qi, Lingjing Kong, Philip S. Yu, Qingsong Wen, Sanmi Koyejo, Steve Brown, Tao Xia, Tianyi Zeng, T. Y. Alvin Liu, Weiran Yao, Xiangchen Song, Yanjie Fu, Yuan Yuan, Yue Zhao, Zeyu Tang, Zhiwei Liu, Zixian Ma.

Figure 1
Figure 1. Figure 1: χ-Bench: Clinical Healthcare In-Situ Environment and Evaluation Benchmark. Abstract End-to-end automation of realistic healthcare operations stresses three capabili￾ties underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; multi-role composition, a single task requires the agent to play multiple roles with handof… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the three challenges: policy retrieval, multi-role composition (intake clerk [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: pass@1 across the three χ-Bench environments of frontier proprietary LLMs with their first-party agent harness. Error bars are task-level percentile bootstrap 95% confidence intervals. We evaluated 30 agent harness/model configurations spanning major frontier models and strong agent stacks. As shown in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparing strengths and weaknesses of Codex GPT-5.5 and Claude Code Opus 4.6 across [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: χ-World Engine: Simulated Worlds for Clinical Healthcare In-Situ Workflows. 3.1.1 Realistic Healthcare Software Environments We implement the apps1 across three domains: provider PA, payer UM, and care management. Built in ∼115K lines of Python, the simulator captures features absent from general-purpose benchmarks: case state machines with 29 statuses and explicit legal transitions; reviewer-independence … view at source ↗
Figure 6
Figure 6. Figure 6: Healthcare apps across three task domains. [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Managed-Care Operations Handbook Skill is organized as a progressive-disclosure manual. The top-level SKILL.md acts as a table of contents that routes the agent to one of three role sub-skills (provider-pa, payer-um, care-manager); the two shared medical-library (clinical lookup) and platform (role-specific tutorials) are reachable from any sub-skill via the dashed access bus. chapters and templates. Two a… view at source ↗
Figure 8
Figure 8. Figure 8: Example of a Payer UM task for hereditary breast-cancer genomic sequencing. [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Task breakdown. Inner: Domain; Mid￾dle: PA/UM terminal state, CM patient persona; Outer: clinical/service category. Step 3 – Multi-reviewer review. Each trajectory is reviewed by at least 1 practicing healthcare worker and 5 authors for clinical precision, and must clear a residual-PHI scan and a clinical-realism check before admission. The detailed human validation protocols are described in Section D.1. … view at source ↗
Figure 10
Figure 10. Figure 10: Verification pipeline. Each trial emits a persisted record to deterministic contract verifier and rubric-based LLM judge under strict-majority vote. A trial passes only when both layers pass. 4 Experiments 4.1 Experiment Setup We evaluate 30 agent harness/model configurations across two stacks: a proprietary stack pair￾ing each frontier lab’s first-party CLI (Claude Code [3], OpenAI Codex [37], Gemini CLI… view at source ↗
Figure 11
Figure 11. Figure 11: (a) Each marker is one row of Table 2 [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Pass@1 under trimmed skills. We trimmed the 1,279-document Managed-Care Oper￾ations Handbook Skill three ways (−Domain drops the domain handbook, −Medical drops the medical library, −Both drops both), ran all tasks with Codex + GPT-5.5, and found that the handbook’s effect is domain-dependent ( [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: Second-level failure modes. % is over failed trials; colors show first-level categories. [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Provider Prior Authorization workflow. Eight phases from referral to terminal determina [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Payer Utilization Management workflow. Six stages from intake normalization to outbound [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Care Management workflow. Five phases from intake to finalized care plan, with a [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Provider PA example task (Hard / Returned for docs). Top: patient chart snapshot. Bottom: [PITH_FULL_IMAGE:figures/full_fig_p034_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Payer UM example task (Moderate / Approved, picked up at Nurse review). Top: patient [PITH_FULL_IMAGE:figures/full_fig_p034_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Care Management example task (PTSD / Hard, Refusing). Top: chart snapshot. Bottom: [PITH_FULL_IMAGE:figures/full_fig_p040_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Per-rubric Cohen’s κ distribution across the 199 binary rubrics (V=3 votes each). Dashed line marks the substantial-agreement threshold κ = 0.6. Policy-Compliance, Tool-Use-Error, Abstain-or-Stuck, and Hallucination). The three judgment / completion / policy axes combined (71.9%) dominate the agent-attributable bucket, indicating that observed failures reflect agent execution and policy interpretation rat… view at source ↗
Figure 22
Figure 22. Figure 22: Per-trial wall-clock (left column) and total token usage (right column) violin distributions, [PITH_FULL_IMAGE:figures/full_fig_p044_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Per-task pass@1 heatmap on PA. Cells = pass@1 in [0, 1]; right-edge annotation = per-row mean pass@1. pa_t012_t012_o001_p01_triage_payer pa_t036_t036_o002_p01_p2p_payer pa_t019_t019_o001_p01_p2p_payer pa_t031_t031_o001_p01_p2p_payer pa_t016_t016_o001_p01_p2p_payer pa_t013_t013_o002_p01_nurse_review_payer pa_t014_t014_o001_p01_intake_payer pa_t032_t032_o002_p01_nurse_review_payer pa_t034_t034_o002_p01_inta… view at source ↗
Figure 24
Figure 24. Figure 24: Per-task pass@1 heatmap on UM. Same format as [PITH_FULL_IMAGE:figures/full_fig_p045_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Per-task pass@1 heatmap on CM. Same format as [PITH_FULL_IMAGE:figures/full_fig_p046_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Per-row outcome breakdown into pass / agent-failure / infrastructure-failure / wall-clock [PITH_FULL_IMAGE:figures/full_fig_p050_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Per-row 100%-stacked failure-mode distribution split into PA, UM, and CM panels. Within [PITH_FULL_IMAGE:figures/full_fig_p050_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Detailed second-level mode breakdown: one stacked bar per (harness, model) row. Within [PITH_FULL_IMAGE:figures/full_fig_p051_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Two-panel policy-read summary. Left: per-row mean recall partitioned by trial-level outcome (Overall / Pass / Fail). Right: per-row recall on the y-axis against pass@1 on the x-axis (one marker per (harness, model) cell), with a strong positive rank correlation (r = +0.77, n = 30). Recall is the fraction of GT-cited handbook policies the agent’s trajectory accesses via Read / Grep / Bash tool calls; the w… view at source ↗
read the original abstract

End-to-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi-role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi-turn dialogs, such as peer-to-peer review and patient outreach. We introduce $\chi$-Bench, a benchmark of long-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a high-fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role's artifacts, guided by a 1,290+ document managed-care operations handbook skill. Across 30 agent harness/models configurations, the best agent resolves only 28.0% of tasks, no agent clears 20% on strict pass^3, and executing all tasks in a single session slumps the performance to 3.8%. These results raise the hypothesis that similar gaps are likely to surface in other policy-dense, role-composed, irreversible enterprise domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CHI-Bench, a benchmark for long-horizon healthcare workflows across prior authorization, utilization management, and care management. Tasks require agents to navigate a high-fidelity simulator of 20 apps via 87 MCP tools, produce role-specific artifacts, and follow a 1,290+ document managed-care handbook. Across 30 agent harness/model configurations, the best agent achieves 28.0% task resolution, no configuration exceeds 20% on strict pass^3, and single-session execution drops to 3.8%. The authors hypothesize that comparable limitations will appear in other policy-dense, role-composed enterprise domains.

Significance. If the simulator and handbook faithfully reproduce policy density, multi-role handoffs, and irreversible multilateral steps, the empirical results provide a concrete, falsifiable demonstration of current agent limitations on realistic enterprise workflows. The benchmark construction itself (multi-app tool interface plus large policy corpus) is a positive contribution that could serve as a template for other domains.

major comments (2)
  1. [Benchmark construction and task definition sections] The central empirical claims (28% best-case resolution, <20% pass^3, 3.8% single-session) rest on the unverified assumption that the 20-app simulator and 1,290+ document handbook reproduce the policy branching, state dependencies, and multilateral interactions of real managed-care operations. No quantitative coverage metrics, expert validation study, or comparison against de-identified production logs are reported to support this fidelity claim.
  2. [Evaluation protocol] The pass criteria and error taxonomy are not sufficiently specified to allow independent replication or assessment of whether the measured gap is driven by policy complexity versus simulator simplifications (e.g., deterministic tool outcomes or limited state).
minor comments (2)
  1. [Evaluation metrics] Clarify the exact definition of 'pass^3' and how terminal status is determined across the three domains.
  2. [Simulator description] Add a table or appendix listing the 87 MCP tools with brief descriptions of their state effects and policy dependencies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our paper. We have carefully considered the feedback regarding benchmark fidelity and evaluation protocol. Our responses to the major comments are provided below, along with planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Benchmark construction and task definition sections] The central empirical claims (28% best-case resolution, <20% pass^3, 3.8% single-session) rest on the unverified assumption that the 20-app simulator and 1,290+ document handbook reproduce the policy branching, state dependencies, and multilateral interactions of real managed-care operations. No quantitative coverage metrics, expert validation study, or comparison against de-identified production logs are reported to support this fidelity claim.

    Authors: We recognize the importance of validating the simulator's fidelity to real managed-care operations. Unfortunately, privacy regulations prevent us from accessing or comparing against de-identified production logs, and a full expert validation study was beyond the scope of this initial benchmark release. The benchmark was constructed by compiling a comprehensive set of publicly available policy documents from managed care handbooks and designing tasks that reflect typical workflow complexities in prior authorization, utilization management, and care management. We will revise the manuscript to provide a more detailed description of the benchmark construction process, including examples of how specific policy rules and multi-role handoffs are instantiated in the tasks. We will also add a dedicated limitations subsection that explicitly discusses the challenges in achieving and verifying full fidelity to production environments. revision: partial

  2. Referee: [Evaluation protocol] The pass criteria and error taxonomy are not sufficiently specified to allow independent replication or assessment of whether the measured gap is driven by policy complexity versus simulator simplifications (e.g., deterministic tool outcomes or limited state).

    Authors: We agree that more detailed specification of the evaluation protocol is necessary for replication. In the revised manuscript, we will provide a complete description of the pass criteria, including the exact conditions for task resolution and the strict pass^3 metric. We will also expand the error taxonomy to categorize failures into policy adherence errors, tool invocation mistakes, state tracking issues, and interaction failures, with illustrative examples from our experiments. This will help clarify whether the performance gaps stem from policy complexity or other factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark results

full rationale

The paper constructs CHI-Bench as a new high-fidelity simulator with 20 apps, 87 MCP tools, and a 1,290+ document handbook, then directly measures agent success rates (28.0% best resolution, <20% pass^3, 3.8% single-session) across 30 configurations. These metrics are obtained by running agents on the benchmark tasks and are not reduced to fitted parameters, self-definitions, or self-citation chains. No equations, uniqueness theorems, or ansatzes are invoked that would make the reported performance equivalent to the inputs by construction. The evaluation is self-contained as an empirical measurement on an independently specified testbed.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the constructed simulator and handbook faithfully represent real-world healthcare operations; no free parameters are fitted to produce the headline percentages and no new physical entities are postulated.

axioms (1)
  • domain assumption The 1,290+ document managed-care operations handbook and the 20-app simulator with 87 MCP tools accurately reflect the policy density and interaction patterns of real healthcare workflows.
    This assumption underpins the claim that low agent performance indicates broader limitations in policy-rich enterprise domains.

pith-pipeline@v0.9.0 · 5877 in / 1327 out tokens · 109410 ms · 2026-05-20T17:36:10.097055+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · 16 internal anchors

  1. [1]

    2024 AMA prior authorization physician survey

    American Medical Association. 2024 AMA prior authorization physician survey. Presented at the Annual Meeting of the American Medical Association, Chicago, IL, 2024. URL https: //www.ama-assn.org/system/files/prior-authorization-survey.pdf

  2. [2]

    Introducing the Model Context Protocol

    Anthropic. Introducing the Model Context Protocol. https://www.anthropic.com/news/ model-context-protocol, 2024. Accessed: 2026-04-30. 11

  3. [3]

    Claude Code

    Anthropic. Claude Code. https://github.com/anthropics/claude-code, 2025. Ac- cessed: 2026-04-30

  4. [4]

    Claude Opus 4.7 system card

    Anthropic. Claude Opus 4.7 system card. https://www.anthropic.com/system-cards,

  5. [5]

    Covers Claude Opus 4.7, Sonnet 4.6, and Haiku 4.5

    Accessed: 2026-04-30. Covers Claude Opus 4.7, Sonnet 4.6, and Haiku 4.5

  6. [6]

    R. K. Arora, J. Wei, R. S. Hicks, P. Bowman, J. Quiñonero-Candela, F. Tsimpourlas, M. Sharman, M. Shah, A. Vallone, A. Beutel, J. Heidecke, and K. Singhal. Healthbench: Evaluating large language models towards improved human health, 2025. URL https://arxiv.org/abs/ 2505.08775

  7. [7]

    $\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

    V . Barres, H. Dong, S. Ray, X. Si, and K. Narasimhan.τ 2-bench: Evaluating conversational agents in a dual-control environment, 2025. URLhttps://arxiv.org/abs/2506.07982

  8. [8]

    S. Bedi, H. Cui, M. Fuentes, A. Unell, M. Wornow, J. M. Banda, N. Kotecha, T. Keyes, Y . Mai, M. Oez, et al. Medhelm: Holistic evaluation of large language models for medical tasks.arXiv preprint arXiv:2505.23802, 2025

  9. [9]

    S. Bedi, R. Welch, E. Steinberg, M. Wornow, T. M. Kim, H. Ahmed, P. Sterling, B. Purohit, Q. Akram, A. Acosta, et al. Healthadminbench: Evaluating computer-use agents on healthcare administration tasks.arXiv preprint arXiv:2604.09937, 2026

  10. [10]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  11. [11]

    Cuellar, A

    A. Cuellar, A. H. Krist, L. M. Nichols, and A. J. Kuzel. Facilitators and barriers to care coordination in patient-centered medical homes (PCMHs) from coordinators’ perspectives. Journal of the American Board of Family Medicine, 31(1):90–101, 2018. doi: 10.3122/jabfm. 2018.01.170133. PMC4809054

  12. [12]

    Cutler, E

    D. Cutler, E. Wikler, and P. Basch. Reducing administrative costs and improving the health care system.New England Journal of Medicine, 367(20):1875–1878, 2012. doi: 10.1056/ NEJMp1209711

  13. [13]

    DeepSeek-V4 Pro model card

    DeepSeek-AI. DeepSeek-V4 Pro model card. https://huggingface.co/deepseek-ai/ DeepSeek-V4-Pro, 2026. Accessed: 2026-04-30

  14. [14]

    WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

    A. Drouin, M. Gasse, M. Caccia, I. H. Laradji, M. D. Verme, T. Marty, L. Boisvert, M. Thakkar, Q. Cappart, D. Vazquez, N. Chapados, and A. Lacoste. WorkArena: How capable are web agents at solving common knowledge work tasks? InProceedings of the 41st International Conference on Machine Learning (ICML), 2024. URLhttps://arxiv.org/abs/2403.07718

  15. [15]

    GLM-5: from Vibe Coding to Agentic Engineering

    GLM-5 Team. GLM-5: From vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

  16. [16]

    Gemini CLI

    Google. Gemini CLI. https://github.com/google-gemini/gemini-cli, 2025. Ac- cessed: 2026-04-30

  17. [17]

    Gemini 3.1 Pro model card

    Google DeepMind. Gemini 3.1 Pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/ , 2026. Accessed: 2026-04-30. Covers Gemini 3.1 Pro and Gemini 3 Flash

  18. [18]

    Harbor: A framework for agent evaluations and RL environments

    Harbor Framework. Harbor: A framework for agent evaluations and RL environments. https: //github.com/harbor-framework/harbor, 2026. Accessed: 2026-04-30

  19. [19]

    Jiang, K

    Y . Jiang, K. C. Black, G. Geng, D. Park, J. Zou, A. Y . Ng, and J. H. Chen. Medagentbench: a virtual ehr environment to benchmark medical llm agents.Nejm Ai, 2(9):AIdbp2500144, 2025

  20. [20]

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. URL https://arxiv.org/ abs/2310.06770. 12

  21. [21]

    D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021. URLhttps://arxiv.org/abs/2009.13081

  22. [22]

    Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, and X. Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577. Association for Computational Linguistics,

  23. [23]

    URLhttps://arxiv.org/abs/1909.06146

  24. [24]

    Jones and C

    A. Jones and C. Kelly. Code execution with mcp: Building more efficient agents, 2025

  25. [25]

    H.-H. Ju. Improving care coordination of patients with chronic diseases.The Journal for Nurse Practitioners, 18(9):926–929, 2022. doi: 10.1016/j.nurpra.2022.07.005

  26. [26]

    L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains.Artificial intelligence, 101(1-2):99–134, 1998

  27. [27]

    Karam, M.-C

    M. Karam, M.-C. Chouinard, M. Kevork, R. Fleming, and A. Duhoux. Nurses’ and patients’ perspectives on care coordination across health care and social services sectors: A qualitative study.SAGE Open Nursing, 2026. doi: 10.1177/08445621251395347

  28. [28]

    Khandekar, Q

    N. Khandekar, Q. Jin, G. Xiong, S. Dunn, S. S. Applebaum, Z. Anwar, M. Sarfo-Gyamfi, C. W. Safranek, A. A. Anwar, A. Zhang, A. Gilson, M. B. Singer, A. Dave, A. Taylor, A. Zhang, Q. Chen, and Z. Lu. MedCalc-Bench: Evaluating large language models for medical calculations. InAdvances in Neural Information Processing Systems 37: Datasets and Benchmarks Track,

  29. [29]

    URLhttps://arxiv.org/abs/2406.12036

  30. [30]

    Kimi K2: Open Agentic Intelligence

    Kimi Team. Kimi K2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

  31. [31]

    DeepAgents

    LangChain. DeepAgents. https://github.com/langchain-ai/deepagents, 2025. Ac- cessed: 2026-04-30

  32. [32]

    G. Lee, H. Hwang, S. Bae, Y . Kwon, W. Shin, S. Yang, M. Seo, J.-Y . Kim, and E. Choi. EHRSQL: A practical text-to-SQL benchmark for electronic health records. InAdvances in Neural Information Processing Systems 35: Datasets and Benchmarks Track, 2022. URL https://arxiv.org/abs/2301.07695

  33. [33]

    J. Li, W. Zhao, J. Zhao, W. Zeng, H. Wu, X. Wang, R. Ge, Y . Cao, Y . Huang, W. Liu, et al. The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution.arXiv preprint arXiv:2510.25726, 2025

  34. [34]

    X. Li, W. Chen, Y . Liu, S. Zheng, X. Chen, Y . He, Y . Li, B. You, H. Shen, J. Sun, et al. Skillsbench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026

  35. [35]

    J. Liu, W. Wang, Z. Ma, G. Huang, Y . Su, K.-J. Chang, W. Chen, H. Li, L. Shen, and M. R. Lyu. MedChain: Bridging the gap between LLM agents and clinical practice through interactive sequential benchmarking. InAdvances in Neural Information Processing Systems 38: Datasets and Benchmarks Track, 2025. URLhttps://arxiv.org/abs/2412.01605

  36. [36]

    M. A. Merrill, A. G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, J. Y . Shin, T. Walshe, E. K. Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

  37. [37]

    Modal: High-performance serverless infrastructure for AI and data

    Modal Labs. Modal: High-performance serverless infrastructure for AI and data. https: //modal.com, 2025. Accessed: 2026-04-30

  38. [38]

    Hermes Agent: The agent that grows with you

    Nous Research. Hermes Agent: The agent that grows with you. https://github.com/ NousResearch/hermes-agent, 2026. Accessed: 2026-04-30

  39. [39]

    OpenAI Agents SDK (python)

    OpenAI. OpenAI Agents SDK (python). https://github.com/openai/ openai-agents-python, 2025. Accessed: 2026-04-30. 13

  40. [40]

    OpenAI Codex CLI

    OpenAI. OpenAI Codex CLI. https://github.com/openai/codex, 2025. Accessed: 2026-04-30

  41. [41]

    GPT-5.5 system card

    OpenAI. GPT-5.5 system card. https://openai.com/index/gpt-5-5-system-card/ ,

  42. [42]

    Covers the GPT-5.5, GPT-5.4, and GPT-5.4 Mini family

    Accessed: 2026-04-30. Covers the GPT-5.5, GPT-5.4, and GPT-5.4 Mini family

  43. [43]

    OpenClaw: Your own personal ai assistant

    OpenClaw. OpenClaw: Your own personal ai assistant. https://github.com/openclaw/ openclaw, 2025. Accessed: 2026-04-30

  44. [44]

    A. Pal, L. K. Umapathi, and M. Sankarasubbu. MedMCQA: A large-scale multi-subject multi- choice dataset for medical domain question answering. InProceedings of the Conference on Health, Inference, and Learning (CHIL), volume 174 ofProceedings of Machine Learning Research, pages 248–260. PMLR, 2022. URLhttps://arxiv.org/abs/2203.14371

  45. [45]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  46. [46]

    N. R. Sahni, P. Gupta, M. Peterson, and D. M. Cutler. Active steps to reduce administrative spending associated with financial transactions in US health care.Health Affairs Scholar, 1(5): qxad053, 2023. doi: 10.1093/haschl/qxad053

  47. [47]

    N. R. Sahni, B. Istvan, and D. M. Cutler. Perceptions of prior authorization burden and solutions. Health Affairs Scholar, 2(9):qxae096, 2024. doi: 10.1093/haschl/qxae096

  48. [48]

    Schmidgall, R

    S. Schmidgall, R. Ziaei, C. Harris, E. Reis, J. Jopling, and M. Moor. AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments, 2024. URL https: //arxiv.org/abs/2405.07960

  49. [49]

    C. A. Sinsky, L. Colligan, L. Li, M. Prgomet, S. Reynolds, L. Goeders, J. Westbrook, M. Tutty, and G. Blike. Allocation of physician time in ambulatory practice: A time and motion study in 4 specialties.Annals of Internal Medicine, 165(11):753–760, 2016. doi: 10.7326/M16-0961

  50. [50]

    Steinberger

    P. Steinberger. MCPorter: TypeScript runtime and CLI for connecting to MCP servers. https: //github.com/steipete/mcporter, 2025. npm packagemcporter; accessed 2026-05-03

  51. [51]

    R. S. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2):181–211, 1999

  52. [52]

    X. Tang, B. Qian, R. Gao, J. Chen, X. Chen, and M. Gerstein. BioCoder: a benchmark for bioinformatics code generation with large language models.Bioinformatics, 40(Supplement_1): i266–i276, 2024. doi: 10.1093/bioinformatics/btae230. URL https://arxiv.org/abs/ 2308.16458

  53. [53]

    X. Tang, D. Shao, J. Sohn, J. Chen, J. Zhang, J. Xiang, F. Wu, Y . Zhao, C. Wu, W. Shi, A. Cohan, and M. Gerstein. MedAgentsBench: Benchmarking thinking models and agent frameworks for complex medical reasoning, 2025. URLhttps://arxiv.org/abs/2503.07459

  54. [54]

    Trivedi, T

    H. Trivedi, T. Khot, M. Hartmann, R. Manku, V . Dong, E. Li, S. Gupta, A. Sabharwal, and N. Balasubramanian. AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, 2024. URL http...

  55. [55]

    Tsatsaronis, G

    G. Tsatsaronis, G. Balikas, P. Malakasiotis, I. Partalas, M. Zschunke, M. R. Alvers, D. Weis- senborn, A. Krithara, S. Petridis, D. Polychronopoulos, Y . Almirantis, J. Pavlopoulos, N. Bask- iotis, P. Gallinari, T. Artières, A.-C. N. Ngomo, N. Heino, E. Gaussier, L. Barrio-Alvers, M. Schroeder, I. Androutsopoulos, and G. Paliouras. An overview of the BIOA...

  56. [56]

    Z. Wang, B. Danek, Z. Yang, Z. Chen, and J. Sun. Can large language models replace data scientists in biomedical research?, 2024. URLhttps://arxiv.org/abs/2410.21591. 14

  57. [57]

    Wornow, R

    M. Wornow, R. Thapa, E. Steinberg, J. A. Fries, and N. H. Shah. EHRSHOT: An EHR benchmark for few-shot evaluation of foundation models. InAdvances in Neural Information Processing Systems 36: Datasets and Benchmarks Track, 2023. URLhttps://arxiv.org/ abs/2307.02028

  58. [58]

    Grok 4 model card

    xAI. Grok 4 model card. https://data.x.ai/2025-08-20-grok-4-model-card.pdf ,

  59. [59]

    Accessed: 2026-04-30

  60. [60]

    T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y . Liu, Y . Xu, S. Zhou, S. Savarese, C. Xiong, V . Zhong, and T. Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information Processing Systems 37: Datasets and Benchmarks Track, 2024. URL https://...

  61. [61]

    Xiong, Q

    G. Xiong, Q. Jin, Z. Lu, and A. Zhang. Benchmarking retrieval-augmented generation for medicine. InFindings of the Association for Computational Linguistics: ACL 2024. Association for Computational Linguistics, 2024. URLhttps://arxiv.org/abs/2402.13178

  62. [62]

    F. F. Xu, Y . Song, B. Li, Y . Tang, K. Jain, M. Bao, Z. Z. Wang, X. Zhou, Z. Guo, M. Cao, M. Yang, H. Y . Lu, A. Martin, Z. Su, L. Maben, R. Mehta, W. Chi, L. Jang, Y . Xie, S. Zhou, and G. Neubig. TheAgentCompany: Benchmarking LLM agents on consequential real world tasks, 2024. URLhttps://arxiv.org/abs/2412.14161

  63. [63]

    R. Xu, Y . Zhuang, Y . Zhong, Y . Yu, Z. Wang, X. Tang, H. Wu, M. D. Wang, J. C. Ho, Y . Xiao, W. Shi, and C. Yang. MedAgentGym: A scalable agentic training environment for code-centric reasoning in biomedical data science. InInternational Conference on Learning Representations (ICLR), 2026. URLhttps://arxiv.org/abs/2506.04405

  64. [64]

    S. Yao, N. Shinn, P. Razavi, and K. Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

  65. [65]

    Zhang, K

    B. Zhang, K. Lazuka, and M. Murag. Equipping agents for the real world with agent skills. Anthropic Engineering Blog, 2025

  66. [66]

    S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, U. Alon, and G. Neubig. WebArena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations (ICLR), 2024. URL https: //arxiv.org/abs/2307.13854

  67. [67]

    MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

    Y . Zuo, S. Qu, Y . Li, Z. Chen, X. Zhu, E. Hua, K. Zhang, N. Ding, and B. Zhou. Medx- pertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025. 15 χ-Bench Appendix A Ethical Statement 17 B Extended Related Work 17 B.1 Axis Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

  68. [68]

    has been480.00000000000006

    This category captures only failures whose proximate cause is independent of the agent: WebSocket gateway abnormal closures (OpenClaw1006), MCP container setup errors (e.g., a missing shared world file), trial-runner exceptions on transport timeouts, and the rare zero-step exits where the agent emitted no actions before the runtime ended the trial. We del...

  69. [69]

    Read rubrics.json first

  70. [70]

    Read canonical_case_record.json and source evidence files

  71. [71]

    For each rubric, read agent_outputs/stage_*.json named by the rubric

  72. [72]

    Cite the file path and line range in every verdict explanation

  73. [73]

    59 Grading boundaries: - Do not create new deterministic checks beyond what rubrics.json defines

    Write verdicts.json incrementally so partial progress is persisted. 59 Grading boundaries: - Do not create new deterministic checks beyond what rubrics.json defines. - Semantic equivalence is acceptable on phrasing. - Fail items where the agent fabricated evidence, cited facts the source documents do not support, or missed a required element. {{role_speci...

  74. [74]

    Enumerate-in, enumerate-out: when CONTEXT names items, account for each one explicitly (satisfied / unsatisfied / absent)

  75. [75]

    Affirmative claims require primary-source citation: evidence_refs must point to the agent artifact, not the canonical record or rubric

  76. [76]

    Literal quotation, no coercion: the field name and value you quote must appear verbatim in the agent’s artifact

  77. [77]

    policies/06__bariatric-surgery.md 4.3

    Structured evidence (optional): items_credited / items_missing / evidence_quotes alongside the prose explanation. Citation discipline: cite specific file path + line range (e.g., "policies/06__bariatric-surgery.md 4.3", "agent_outputs/stage_md_review.json: decision=approve"). Output schema (verdicts.json): { "rubric_verdicts": { "<rubric_id>": { "pass": <...

  78. [78]

    Read canonical_case_record.json, cm_reference.json, chart_slice.json, and task_instruction.md

  79. [79]

    Read handbook files when a rubric cites handbook standards

  80. [80]

    Send me information

    Read the relevant agent_outputs/stage_*.json named by each rubric. ... [grading boundaries, hard-fail rules, verdict discipline, output schema; truncated for length, identical to template above] ... 60 - ANTI-TRIGGER HARD FAIL: when grading rb_outreach_*, cm.outreach.*, or cm_v4 cm.outreach.quality, read per-task persona consent_anti_triggers and cm_refer...

Showing first 80 references.