CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

Biwei Huang; Caiming Xiong; Carl Yang; Chenyu You; Deon Metelski; Eric P. Xing; Fan Feng; Fangli Geng; Frank Wang; Hang Jiang

arxiv: 2605.16679 · v2 · pith:RLPSQLQAnew · submitted 2026-05-15 · 💻 cs.CL · cs.AI

CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

Haolin Chen , Deon Metelski , Leon Qi , Tao Xia , Joonyul Lee , Steve Brown , Kevin Riley , Frank Wang

show 25 more authors

T. Y. Alvin Liu Hank Capps MD Zeyu Tang Xiangchen Song Lingjing Kong Fan Feng Tianyi Zeng Zhiwei Liu Zixian Ma Hang Jiang Fangli Geng Yuan Yuan Chenyu You Qingsong Wen Hua Wei Yanjie Fu Yue Zhao Carl Yang Biwei Huang Kun Zhang Caiming Xiong Sanmi Koyejo Eric P. Xing Philip S. Yu Weiran Yao

This is my paper

Pith reviewed 2026-05-20 17:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords AI agentshealthcare workflowsbenchmarklong-horizon taskspolicy densitymulti-role compositionprior authorizationutilization management

0 comments

The pith

AI agents resolve only 28 percent of long-horizon healthcare workflows even in the best case.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CHI-Bench to test AI agents on realistic end-to-end healthcare operations that demand following dense policies, switching between multiple roles with handoffs, and managing multi-turn interactions such as peer reviews and patient outreach. Tasks place the agent inside a simulator of 20 healthcare apps reached through 87 tools, where it must drive each clinical case to completion while consulting a handbook of over 1,290 managed-care documents. Evaluation of 30 agent harness and model combinations shows the strongest result reaches only 28.0 percent task resolution, with no setup clearing 20 percent on a strict pass^3 criterion and performance falling to 3.8 percent when all tasks run inside one continuous session. These outcomes indicate that current agents face substantial barriers in policy-dense, role-composed enterprise workflows.

Core claim

CHI-Bench evaluates agents on long-horizon tasks across provider prior authorization, payer utilization management, and care management. Each task supplies a clinical case that the agent must advance to terminal status inside a high-fidelity simulator of 20 healthcare applications exposed via 87 MCP tools, while producing required role-specific artifacts and obeying rules drawn from a 1,290-plus document managed-care operations handbook. Across thirty agent configurations the highest observed resolution rate is 28.0 percent; no configuration exceeds 20 percent under a strict pass^3 metric, and collapsing all tasks into a single session reduces success to 3.8 percent.

What carries the argument

CHI-Bench task suite, a simulator of 20 healthcare apps reached through 87 MCP tools together with a 1,290-plus document managed-care operations handbook that requires agents to execute policy-grounded, multi-role, multilateral workflows to completion.

If this is right

Agents must sustain coherent action sequences across dozens of tool calls and document productions while switching roles.
Single-session execution exposes sharp degradation compared with reset or multi-session operation.
Comparable success-rate ceilings are expected in any enterprise domain that combines dense policy libraries with irreversible role-composed decisions.
Existing agent benchmarks likely understate the difficulty of full workflow automation in regulated operational settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Explicit mechanisms for retrieving and applying specific policy clauses during planning could raise completion rates on these tasks.
Testing the same workflow patterns in legal compliance or financial operations would likely surface parallel gaps in current agent capabilities.
Production systems would need human oversight loops or verification stages to compensate for the low autonomous success rates shown here.

Load-bearing premise

The simulator of 20 apps and the managed-care handbook together capture the policy density, multi-role handoffs, and multilateral interactions of actual healthcare operations.

What would settle it

An agent configuration that achieves greater than 50 percent task resolution on the identical CHI-Bench tasks under the reported evaluation protocol would directly challenge the measured performance ceiling.

Figures

Figures reproduced from arXiv: 2605.16679 by Biwei Huang, Caiming Xiong, Carl Yang, Chenyu You, Deon Metelski, Eric P. Xing, Fan Feng, Fangli Geng, Frank Wang, Hang Jiang, Hank Capps MD, Haolin Chen, Hua Wei, Joonyul Lee, Kevin Riley, Kun Zhang, Leon Qi, Lingjing Kong, Philip S. Yu, Qingsong Wen, Sanmi Koyejo, Steve Brown, Tao Xia, Tianyi Zeng, T. Y. Alvin Liu, Weiran Yao, Xiangchen Song, Yanjie Fu, Yuan Yuan, Yue Zhao, Zeyu Tang, Zhiwei Liu, Zixian Ma.

**Figure 1.** Figure 1: χ-Bench: Clinical Healthcare In-Situ Environment and Evaluation Benchmark. Abstract End-to-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; multi-role composition, a single task requires the agent to play multiple roles with handof… view at source ↗

**Figure 2.** Figure 2: Illustration of the three challenges: policy retrieval, multi-role composition (intake clerk [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: pass@1 across the three χ-Bench environments of frontier proprietary LLMs with their first-party agent harness. Error bars are task-level percentile bootstrap 95% confidence intervals. We evaluated 30 agent harness/model configurations spanning major frontier models and strong agent stacks. As shown in [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Comparing strengths and weaknesses of Codex GPT-5.5 and Claude Code Opus 4.6 across [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: χ-World Engine: Simulated Worlds for Clinical Healthcare In-Situ Workflows. 3.1.1 Realistic Healthcare Software Environments We implement the apps1 across three domains: provider PA, payer UM, and care management. Built in ∼115K lines of Python, the simulator captures features absent from general-purpose benchmarks: case state machines with 29 statuses and explicit legal transitions; reviewer-independence … view at source ↗

**Figure 6.** Figure 6: Healthcare apps across three task domains. [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Managed-Care Operations Handbook Skill is organized as a progressive-disclosure manual. The top-level SKILL.md acts as a table of contents that routes the agent to one of three role sub-skills (provider-pa, payer-um, care-manager); the two shared medical-library (clinical lookup) and platform (role-specific tutorials) are reachable from any sub-skill via the dashed access bus. chapters and templates. Two a… view at source ↗

**Figure 8.** Figure 8: Example of a Payer UM task for hereditary breast-cancer genomic sequencing. [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 9.** Figure 9: Task breakdown. Inner: Domain; Middle: PA/UM terminal state, CM patient persona; Outer: clinical/service category. Step 3 – Multi-reviewer review. Each trajectory is reviewed by at least 1 practicing healthcare worker and 5 authors for clinical precision, and must clear a residual-PHI scan and a clinical-realism check before admission. The detailed human validation protocols are described in Section D.1. … view at source ↗

**Figure 10.** Figure 10: Verification pipeline. Each trial emits a persisted record to deterministic contract verifier and rubric-based LLM judge under strict-majority vote. A trial passes only when both layers pass. 4 Experiments 4.1 Experiment Setup We evaluate 30 agent harness/model configurations across two stacks: a proprietary stack pairing each frontier lab’s first-party CLI (Claude Code [3], OpenAI Codex [37], Gemini CLI… view at source ↗

**Figure 11.** Figure 11: (a) Each marker is one row of Table 2 [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗

**Figure 12.** Figure 12: Pass@1 under trimmed skills. We trimmed the 1,279-document Managed-Care Operations Handbook Skill three ways (−Domain drops the domain handbook, −Medical drops the medical library, −Both drops both), ran all tasks with Codex + GPT-5.5, and found that the handbook’s effect is domain-dependent ( [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗

**Figure 14.** Figure 14: Second-level failure modes. % is over failed trials; colors show first-level categories. [PITH_FULL_IMAGE:figures/full_fig_p011_14.png] view at source ↗

**Figure 15.** Figure 15: Provider Prior Authorization workflow. Eight phases from referral to terminal determina [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: Payer Utilization Management workflow. Six stages from intake normalization to outbound [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Care Management workflow. Five phases from intake to finalized care plan, with a [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: Provider PA example task (Hard / Returned for docs). Top: patient chart snapshot. Bottom: [PITH_FULL_IMAGE:figures/full_fig_p034_18.png] view at source ↗

**Figure 19.** Figure 19: Payer UM example task (Moderate / Approved, picked up at Nurse review). Top: patient [PITH_FULL_IMAGE:figures/full_fig_p034_19.png] view at source ↗

**Figure 20.** Figure 20: Care Management example task (PTSD / Hard, Refusing). Top: chart snapshot. Bottom: [PITH_FULL_IMAGE:figures/full_fig_p040_20.png] view at source ↗

**Figure 21.** Figure 21: Per-rubric Cohen’s κ distribution across the 199 binary rubrics (V=3 votes each). Dashed line marks the substantial-agreement threshold κ = 0.6. Policy-Compliance, Tool-Use-Error, Abstain-or-Stuck, and Hallucination). The three judgment / completion / policy axes combined (71.9%) dominate the agent-attributable bucket, indicating that observed failures reflect agent execution and policy interpretation rat… view at source ↗

**Figure 22.** Figure 22: Per-trial wall-clock (left column) and total token usage (right column) violin distributions, [PITH_FULL_IMAGE:figures/full_fig_p044_22.png] view at source ↗

**Figure 23.** Figure 23: Per-task pass@1 heatmap on PA. Cells = pass@1 in [0, 1]; right-edge annotation = per-row mean pass@1. pa_t012_t012_o001_p01_triage_payer pa_t036_t036_o002_p01_p2p_payer pa_t019_t019_o001_p01_p2p_payer pa_t031_t031_o001_p01_p2p_payer pa_t016_t016_o001_p01_p2p_payer pa_t013_t013_o002_p01_nurse_review_payer pa_t014_t014_o001_p01_intake_payer pa_t032_t032_o002_p01_nurse_review_payer pa_t034_t034_o002_p01_inta… view at source ↗

**Figure 24.** Figure 24: Per-task pass@1 heatmap on UM. Same format as [PITH_FULL_IMAGE:figures/full_fig_p045_24.png] view at source ↗

**Figure 25.** Figure 25: Per-task pass@1 heatmap on CM. Same format as [PITH_FULL_IMAGE:figures/full_fig_p046_25.png] view at source ↗

**Figure 26.** Figure 26: Per-row outcome breakdown into pass / agent-failure / infrastructure-failure / wall-clock [PITH_FULL_IMAGE:figures/full_fig_p050_26.png] view at source ↗

**Figure 27.** Figure 27: Per-row 100%-stacked failure-mode distribution split into PA, UM, and CM panels. Within [PITH_FULL_IMAGE:figures/full_fig_p050_27.png] view at source ↗

**Figure 28.** Figure 28: Detailed second-level mode breakdown: one stacked bar per (harness, model) row. Within [PITH_FULL_IMAGE:figures/full_fig_p051_28.png] view at source ↗

**Figure 29.** Figure 29: Two-panel policy-read summary. Left: per-row mean recall partitioned by trial-level outcome (Overall / Pass / Fail). Right: per-row recall on the y-axis against pass@1 on the x-axis (one marker per (harness, model) cell), with a strong positive rank correlation (r = +0.77, n = 30). Recall is the fraction of GT-cited handbook policies the agent’s trajectory accesses via Read / Grep / Bash tool calls; the w… view at source ↗

read the original abstract

End-to-end automation of realistic healthcare operations stresses three capabilities underrepresented in current benchmarks: policy density, decisions must be grounded in a large library of medical, insurance, and operational rules; Multi-role composition: a single task requires the agent to play multiple roles with handoffs; and multilateral interaction: intermediate workflow steps are multi-turn dialogs, such as peer-to-peer review and patient outreach. We introduce $\chi$-Bench, a benchmark of long-horizon healthcare workflows across three domains: provider prior authorization, payer utilization management, and care management. Each task hands the agent a clinical case in a high-fidelity simulator of 20 healthcare apps exposed via 87 MCP tools, which it must drive to a terminal status through tool calls and writing the role's artifacts, guided by a 1,290+ document managed-care operations handbook skill. Across 30 agent harness/models configurations, the best agent resolves only 28.0% of tasks, no agent clears 20% on strict pass^3, and executing all tasks in a single session slumps the performance to 3.8%. These results raise the hypothesis that similar gaps are likely to surface in other policy-dense, role-composed, irreversible enterprise domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CHI-Bench, a benchmark for long-horizon healthcare workflows across prior authorization, utilization management, and care management. Tasks require agents to navigate a high-fidelity simulator of 20 apps via 87 MCP tools, produce role-specific artifacts, and follow a 1,290+ document managed-care handbook. Across 30 agent harness/model configurations, the best agent achieves 28.0% task resolution, no configuration exceeds 20% on strict pass^3, and single-session execution drops to 3.8%. The authors hypothesize that comparable limitations will appear in other policy-dense, role-composed enterprise domains.

Significance. If the simulator and handbook faithfully reproduce policy density, multi-role handoffs, and irreversible multilateral steps, the empirical results provide a concrete, falsifiable demonstration of current agent limitations on realistic enterprise workflows. The benchmark construction itself (multi-app tool interface plus large policy corpus) is a positive contribution that could serve as a template for other domains.

major comments (2)

[Benchmark construction and task definition sections] The central empirical claims (28% best-case resolution, <20% pass^3, 3.8% single-session) rest on the unverified assumption that the 20-app simulator and 1,290+ document handbook reproduce the policy branching, state dependencies, and multilateral interactions of real managed-care operations. No quantitative coverage metrics, expert validation study, or comparison against de-identified production logs are reported to support this fidelity claim.
[Evaluation protocol] The pass criteria and error taxonomy are not sufficiently specified to allow independent replication or assessment of whether the measured gap is driven by policy complexity versus simulator simplifications (e.g., deterministic tool outcomes or limited state).

minor comments (2)

[Evaluation metrics] Clarify the exact definition of 'pass^3' and how terminal status is determined across the three domains.
[Simulator description] Add a table or appendix listing the 87 MCP tools with brief descriptions of their state effects and policy dependencies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our paper. We have carefully considered the feedback regarding benchmark fidelity and evaluation protocol. Our responses to the major comments are provided below, along with planned revisions to the manuscript.

read point-by-point responses

Referee: [Benchmark construction and task definition sections] The central empirical claims (28% best-case resolution, <20% pass^3, 3.8% single-session) rest on the unverified assumption that the 20-app simulator and 1,290+ document handbook reproduce the policy branching, state dependencies, and multilateral interactions of real managed-care operations. No quantitative coverage metrics, expert validation study, or comparison against de-identified production logs are reported to support this fidelity claim.

Authors: We recognize the importance of validating the simulator's fidelity to real managed-care operations. Unfortunately, privacy regulations prevent us from accessing or comparing against de-identified production logs, and a full expert validation study was beyond the scope of this initial benchmark release. The benchmark was constructed by compiling a comprehensive set of publicly available policy documents from managed care handbooks and designing tasks that reflect typical workflow complexities in prior authorization, utilization management, and care management. We will revise the manuscript to provide a more detailed description of the benchmark construction process, including examples of how specific policy rules and multi-role handoffs are instantiated in the tasks. We will also add a dedicated limitations subsection that explicitly discusses the challenges in achieving and verifying full fidelity to production environments. revision: partial
Referee: [Evaluation protocol] The pass criteria and error taxonomy are not sufficiently specified to allow independent replication or assessment of whether the measured gap is driven by policy complexity versus simulator simplifications (e.g., deterministic tool outcomes or limited state).

Authors: We agree that more detailed specification of the evaluation protocol is necessary for replication. In the revised manuscript, we will provide a complete description of the pass criteria, including the exact conditions for task resolution and the strict pass^3 metric. We will also expand the error taxonomy to categorize failures into policy adherence errors, tool invocation mistakes, state tracking issues, and interaction failures, with illustrative examples from our experiments. This will help clarify whether the performance gaps stem from policy complexity or other factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark results

full rationale

The paper constructs CHI-Bench as a new high-fidelity simulator with 20 apps, 87 MCP tools, and a 1,290+ document handbook, then directly measures agent success rates (28.0% best resolution, <20% pass^3, 3.8% single-session) across 30 configurations. These metrics are obtained by running agents on the benchmark tasks and are not reduced to fitted parameters, self-definitions, or self-citation chains. No equations, uniqueness theorems, or ansatzes are invoked that would make the reported performance equivalent to the inputs by construction. The evaluation is self-contained as an empirical measurement on an independently specified testbed.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the constructed simulator and handbook faithfully represent real-world healthcare operations; no free parameters are fitted to produce the headline percentages and no new physical entities are postulated.

axioms (1)

domain assumption The 1,290+ document managed-care operations handbook and the 20-app simulator with 87 MCP tools accurately reflect the policy density and interaction patterns of real healthcare workflows.
This assumption underpins the claim that low agent performance indicates broader limitations in policy-rich enterprise domains.

pith-pipeline@v0.9.0 · 5877 in / 1327 out tokens · 109410 ms · 2026-05-20T17:36:10.097055+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Each task hands the agent a clinical case in a high-fidelity simulator of 20 healthcare apps exposed via 87 MCP tools... guided by a 1,279-document managed-care operations handbook skill.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction and recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The resulting world state, artifacts and event trail are scored in-situ by a composite verifier that combines deterministic checks with rubric-based LLM judge.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · 16 internal anchors

[1]

2024 AMA prior authorization physician survey

American Medical Association. 2024 AMA prior authorization physician survey. Presented at the Annual Meeting of the American Medical Association, Chicago, IL, 2024. URL https: //www.ama-assn.org/system/files/prior-authorization-survey.pdf

work page 2024
[2]

Introducing the Model Context Protocol

Anthropic. Introducing the Model Context Protocol. https://www.anthropic.com/news/ model-context-protocol, 2024. Accessed: 2026-04-30. 11

work page 2024
[3]

Claude Code

Anthropic. Claude Code. https://github.com/anthropics/claude-code, 2025. Ac- cessed: 2026-04-30

work page 2025
[4]

Claude Opus 4.7 system card

Anthropic. Claude Opus 4.7 system card. https://www.anthropic.com/system-cards,

work page
[5]

Covers Claude Opus 4.7, Sonnet 4.6, and Haiku 4.5

Accessed: 2026-04-30. Covers Claude Opus 4.7, Sonnet 4.6, and Haiku 4.5

work page 2026
[6]

R. K. Arora, J. Wei, R. S. Hicks, P. Bowman, J. Quiñonero-Candela, F. Tsimpourlas, M. Sharman, M. Shah, A. Vallone, A. Beutel, J. Heidecke, and K. Singhal. Healthbench: Evaluating large language models towards improved human health, 2025. URL https://arxiv.org/abs/ 2505.08775

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

V . Barres, H. Dong, S. Ray, X. Si, and K. Narasimhan.τ 2-bench: Evaluating conversational agents in a dual-control environment, 2025. URLhttps://arxiv.org/abs/2506.07982

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

S. Bedi, H. Cui, M. Fuentes, A. Unell, M. Wornow, J. M. Banda, N. Kotecha, T. Keyes, Y . Mai, M. Oez, et al. Medhelm: Holistic evaluation of large language models for medical tasks.arXiv preprint arXiv:2505.23802, 2025

work page arXiv 2025
[9]

S. Bedi, R. Welch, E. Steinberg, M. Wornow, T. M. Kim, H. Ahmed, P. Sterling, B. Purohit, Q. Akram, A. Acosta, et al. Healthadminbench: Evaluating computer-use agents on healthcare administration tasks.arXiv preprint arXiv:2604.09937, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Cuellar, A

A. Cuellar, A. H. Krist, L. M. Nichols, and A. J. Kuzel. Facilitators and barriers to care coordination in patient-centered medical homes (PCMHs) from coordinators’ perspectives. Journal of the American Board of Family Medicine, 31(1):90–101, 2018. doi: 10.3122/jabfm. 2018.01.170133. PMC4809054

work page doi:10.3122/jabfm 2018
[12]

Cutler, E

D. Cutler, E. Wikler, and P. Basch. Reducing administrative costs and improving the health care system.New England Journal of Medicine, 367(20):1875–1878, 2012. doi: 10.1056/ NEJMp1209711

work page 2012
[13]

DeepSeek-V4 Pro model card

DeepSeek-AI. DeepSeek-V4 Pro model card. https://huggingface.co/deepseek-ai/ DeepSeek-V4-Pro, 2026. Accessed: 2026-04-30

work page 2026
[14]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

A. Drouin, M. Gasse, M. Caccia, I. H. Laradji, M. D. Verme, T. Marty, L. Boisvert, M. Thakkar, Q. Cappart, D. Vazquez, N. Chapados, and A. Lacoste. WorkArena: How capable are web agents at solving common knowledge work tasks? InProceedings of the 41st International Conference on Machine Learning (ICML), 2024. URLhttps://arxiv.org/abs/2403.07718

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5 Team. GLM-5: From vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Gemini CLI

Google. Gemini CLI. https://github.com/google-gemini/gemini-cli, 2025. Ac- cessed: 2026-04-30

work page 2025
[17]

Gemini 3.1 Pro model card

Google DeepMind. Gemini 3.1 Pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/ , 2026. Accessed: 2026-04-30. Covers Gemini 3.1 Pro and Gemini 3 Flash

work page 2026
[18]

Harbor: A framework for agent evaluations and RL environments

Harbor Framework. Harbor: A framework for agent evaluations and RL environments. https: //github.com/harbor-framework/harbor, 2026. Accessed: 2026-04-30

work page 2026
[19]

Jiang, K

Y . Jiang, K. C. Black, G. Geng, D. Park, J. Zou, A. Y . Ng, and J. H. Chen. Medagentbench: a virtual ehr environment to benchmark medical llm agents.Nejm Ai, 2(9):AIdbp2500144, 2025

work page 2025
[20]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. URL https://arxiv.org/ abs/2310.06770. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021. URLhttps://arxiv.org/abs/2009.13081

work page arXiv 2021
[22]

Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, and X. Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577. Association for Computational Linguistics,

work page 2019
[23]

URLhttps://arxiv.org/abs/1909.06146

work page arXiv 1909
[24]

Jones and C

A. Jones and C. Kelly. Code execution with mcp: Building more efficient agents, 2025

work page 2025
[25]

H.-H. Ju. Improving care coordination of patients with chronic diseases.The Journal for Nurse Practitioners, 18(9):926–929, 2022. doi: 10.1016/j.nurpra.2022.07.005

work page doi:10.1016/j.nurpra.2022.07.005 2022
[26]

L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains.Artificial intelligence, 101(1-2):99–134, 1998

work page 1998
[27]

Karam, M.-C

M. Karam, M.-C. Chouinard, M. Kevork, R. Fleming, and A. Duhoux. Nurses’ and patients’ perspectives on care coordination across health care and social services sectors: A qualitative study.SAGE Open Nursing, 2026. doi: 10.1177/08445621251395347

work page doi:10.1177/08445621251395347 2026
[28]

Khandekar, Q

N. Khandekar, Q. Jin, G. Xiong, S. Dunn, S. S. Applebaum, Z. Anwar, M. Sarfo-Gyamfi, C. W. Safranek, A. A. Anwar, A. Zhang, A. Gilson, M. B. Singer, A. Dave, A. Taylor, A. Zhang, Q. Chen, and Z. Lu. MedCalc-Bench: Evaluating large language models for medical calculations. InAdvances in Neural Information Processing Systems 37: Datasets and Benchmarks Track,

work page
[29]

URLhttps://arxiv.org/abs/2406.12036

work page arXiv
[30]

Kimi K2: Open Agentic Intelligence

Kimi Team. Kimi K2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

DeepAgents

LangChain. DeepAgents. https://github.com/langchain-ai/deepagents, 2025. Ac- cessed: 2026-04-30

work page 2025
[32]

G. Lee, H. Hwang, S. Bae, Y . Kwon, W. Shin, S. Yang, M. Seo, J.-Y . Kim, and E. Choi. EHRSQL: A practical text-to-SQL benchmark for electronic health records. InAdvances in Neural Information Processing Systems 35: Datasets and Benchmarks Track, 2022. URL https://arxiv.org/abs/2301.07695

work page arXiv 2022
[33]

J. Li, W. Zhao, J. Zhao, W. Zeng, H. Wu, X. Wang, R. Ge, Y . Cao, Y . Huang, W. Liu, et al. The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution.arXiv preprint arXiv:2510.25726, 2025

work page arXiv 2025
[34]

X. Li, W. Chen, Y . Liu, S. Zheng, X. Chen, Y . He, Y . Li, B. You, H. Shen, J. Sun, et al. Skillsbench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

J. Liu, W. Wang, Z. Ma, G. Huang, Y . Su, K.-J. Chang, W. Chen, H. Li, L. Shen, and M. R. Lyu. MedChain: Bridging the gap between LLM agents and clinical practice through interactive sequential benchmarking. InAdvances in Neural Information Processing Systems 38: Datasets and Benchmarks Track, 2025. URLhttps://arxiv.org/abs/2412.01605

work page arXiv 2025
[36]

M. A. Merrill, A. G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, J. Y . Shin, T. Walshe, E. K. Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[37]

Modal: High-performance serverless infrastructure for AI and data

Modal Labs. Modal: High-performance serverless infrastructure for AI and data. https: //modal.com, 2025. Accessed: 2026-04-30

work page 2025
[38]

Hermes Agent: The agent that grows with you

Nous Research. Hermes Agent: The agent that grows with you. https://github.com/ NousResearch/hermes-agent, 2026. Accessed: 2026-04-30

work page 2026
[39]

OpenAI Agents SDK (python)

OpenAI. OpenAI Agents SDK (python). https://github.com/openai/ openai-agents-python, 2025. Accessed: 2026-04-30. 13

work page 2025
[40]

OpenAI Codex CLI

OpenAI. OpenAI Codex CLI. https://github.com/openai/codex, 2025. Accessed: 2026-04-30

work page 2025
[41]

GPT-5.5 system card

OpenAI. GPT-5.5 system card. https://openai.com/index/gpt-5-5-system-card/ ,

work page
[42]

Covers the GPT-5.5, GPT-5.4, and GPT-5.4 Mini family

Accessed: 2026-04-30. Covers the GPT-5.5, GPT-5.4, and GPT-5.4 Mini family

work page 2026
[43]

OpenClaw: Your own personal ai assistant

OpenClaw. OpenClaw: Your own personal ai assistant. https://github.com/openclaw/ openclaw, 2025. Accessed: 2026-04-30

work page 2025
[44]

A. Pal, L. K. Umapathi, and M. Sankarasubbu. MedMCQA: A large-scale multi-subject multi- choice dataset for medical domain question answering. InProceedings of the Conference on Health, Inference, and Learning (CHIL), volume 174 ofProceedings of Machine Learning Research, pages 248–260. PMLR, 2022. URLhttps://arxiv.org/abs/2203.14371

work page arXiv 2022
[45]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

N. R. Sahni, P. Gupta, M. Peterson, and D. M. Cutler. Active steps to reduce administrative spending associated with financial transactions in US health care.Health Affairs Scholar, 1(5): qxad053, 2023. doi: 10.1093/haschl/qxad053

work page doi:10.1093/haschl/qxad053 2023
[47]

N. R. Sahni, B. Istvan, and D. M. Cutler. Perceptions of prior authorization burden and solutions. Health Affairs Scholar, 2(9):qxae096, 2024. doi: 10.1093/haschl/qxae096

work page doi:10.1093/haschl/qxae096 2024
[48]

Schmidgall, R

S. Schmidgall, R. Ziaei, C. Harris, E. Reis, J. Jopling, and M. Moor. AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments, 2024. URL https: //arxiv.org/abs/2405.07960

work page arXiv 2024
[49]

C. A. Sinsky, L. Colligan, L. Li, M. Prgomet, S. Reynolds, L. Goeders, J. Westbrook, M. Tutty, and G. Blike. Allocation of physician time in ambulatory practice: A time and motion study in 4 specialties.Annals of Internal Medicine, 165(11):753–760, 2016. doi: 10.7326/M16-0961

work page doi:10.7326/m16-0961 2016
[50]

Steinberger

P. Steinberger. MCPorter: TypeScript runtime and CLI for connecting to MCP servers. https: //github.com/steipete/mcporter, 2025. npm packagemcporter; accessed 2026-05-03

work page 2025
[51]

R. S. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2):181–211, 1999

work page 1999
[52]

X. Tang, B. Qian, R. Gao, J. Chen, X. Chen, and M. Gerstein. BioCoder: a benchmark for bioinformatics code generation with large language models.Bioinformatics, 40(Supplement_1): i266–i276, 2024. doi: 10.1093/bioinformatics/btae230. URL https://arxiv.org/abs/ 2308.16458

work page doi:10.1093/bioinformatics/btae230 2024
[53]

X. Tang, D. Shao, J. Sohn, J. Chen, J. Zhang, J. Xiang, F. Wu, Y . Zhao, C. Wu, W. Shi, A. Cohan, and M. Gerstein. MedAgentsBench: Benchmarking thinking models and agent frameworks for complex medical reasoning, 2025. URLhttps://arxiv.org/abs/2503.07459

work page arXiv 2025
[54]

Trivedi, T

H. Trivedi, T. Khot, M. Hartmann, R. Manku, V . Dong, E. Li, S. Gupta, A. Sabharwal, and N. Balasubramanian. AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, 2024. URL http...

work page arXiv 2024
[55]

Tsatsaronis, G

G. Tsatsaronis, G. Balikas, P. Malakasiotis, I. Partalas, M. Zschunke, M. R. Alvers, D. Weis- senborn, A. Krithara, S. Petridis, D. Polychronopoulos, Y . Almirantis, J. Pavlopoulos, N. Bask- iotis, P. Gallinari, T. Artières, A.-C. N. Ngomo, N. Heino, E. Gaussier, L. Barrio-Alvers, M. Schroeder, I. Androutsopoulos, and G. Paliouras. An overview of the BIOA...

work page doi:10.1186/s12859-015-0564-6 2015
[56]

Z. Wang, B. Danek, Z. Yang, Z. Chen, and J. Sun. Can large language models replace data scientists in biomedical research?, 2024. URLhttps://arxiv.org/abs/2410.21591. 14

work page arXiv 2024
[57]

Wornow, R

M. Wornow, R. Thapa, E. Steinberg, J. A. Fries, and N. H. Shah. EHRSHOT: An EHR benchmark for few-shot evaluation of foundation models. InAdvances in Neural Information Processing Systems 36: Datasets and Benchmarks Track, 2023. URLhttps://arxiv.org/ abs/2307.02028

work page arXiv 2023
[58]

Grok 4 model card

xAI. Grok 4 model card. https://data.x.ai/2025-08-20-grok-4-model-card.pdf ,

work page 2025
[59]

Accessed: 2026-04-30

work page 2026
[60]

T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y . Liu, Y . Xu, S. Zhou, S. Savarese, C. Xiong, V . Zhong, and T. Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information Processing Systems 37: Datasets and Benchmarks Track, 2024. URL https://...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[61]

Xiong, Q

G. Xiong, Q. Jin, Z. Lu, and A. Zhang. Benchmarking retrieval-augmented generation for medicine. InFindings of the Association for Computational Linguistics: ACL 2024. Association for Computational Linguistics, 2024. URLhttps://arxiv.org/abs/2402.13178

work page arXiv 2024
[62]

F. F. Xu, Y . Song, B. Li, Y . Tang, K. Jain, M. Bao, Z. Z. Wang, X. Zhou, Z. Guo, M. Cao, M. Yang, H. Y . Lu, A. Martin, Z. Su, L. Maben, R. Mehta, W. Chi, L. Jang, Y . Xie, S. Zhou, and G. Neubig. TheAgentCompany: Benchmarking LLM agents on consequential real world tasks, 2024. URLhttps://arxiv.org/abs/2412.14161

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

R. Xu, Y . Zhuang, Y . Zhong, Y . Yu, Z. Wang, X. Tang, H. Wu, M. D. Wang, J. C. Ho, Y . Xiao, W. Shi, and C. Yang. MedAgentGym: A scalable agentic training environment for code-centric reasoning in biomedical data science. InInternational Conference on Learning Representations (ICLR), 2026. URLhttps://arxiv.org/abs/2506.04405

work page arXiv 2026
[64]

S. Yao, N. Shinn, P. Razavi, and K. Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

Zhang, K

B. Zhang, K. Lazuka, and M. Murag. Equipping agents for the real world with agent skills. Anthropic Engineering Blog, 2025

work page 2025
[66]

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, U. Alon, and G. Neubig. WebArena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations (ICLR), 2024. URL https: //arxiv.org/abs/2307.13854

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Y . Zuo, S. Qu, Y . Li, Z. Chen, X. Zhu, E. Hua, K. Zhang, N. Ding, and B. Zhou. Medx- pertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025. 15 χ-Bench Appendix A Ethical Statement 17 B Extended Related Work 17 B.1 Axis Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

has been480.00000000000006

This category captures only failures whose proximate cause is independent of the agent: WebSocket gateway abnormal closures (OpenClaw1006), MCP container setup errors (e.g., a missing shared world file), trial-runner exceptions on transport timeouts, and the rare zero-step exits where the agent emitted no actions before the runtime ended the trial. We del...

work page 2024
[69]

Read rubrics.json first

work page
[70]

Read canonical_case_record.json and source evidence files

work page
[71]

For each rubric, read agent_outputs/stage_*.json named by the rubric

work page
[72]

Cite the file path and line range in every verdict explanation

work page
[73]

59 Grading boundaries: - Do not create new deterministic checks beyond what rubrics.json defines

Write verdicts.json incrementally so partial progress is persisted. 59 Grading boundaries: - Do not create new deterministic checks beyond what rubrics.json defines. - Semantic equivalence is acceptable on phrasing. - Fail items where the agent fabricated evidence, cited facts the source documents do not support, or missed a required element. {{role_speci...

work page
[74]

Enumerate-in, enumerate-out: when CONTEXT names items, account for each one explicitly (satisfied / unsatisfied / absent)

work page
[75]

Affirmative claims require primary-source citation: evidence_refs must point to the agent artifact, not the canonical record or rubric

work page
[76]

Literal quotation, no coercion: the field name and value you quote must appear verbatim in the agent’s artifact

work page
[77]

policies/06__bariatric-surgery.md 4.3

Structured evidence (optional): items_credited / items_missing / evidence_quotes alongside the prose explanation. Citation discipline: cite specific file path + line range (e.g., "policies/06__bariatric-surgery.md 4.3", "agent_outputs/stage_md_review.json: decision=approve"). Output schema (verdicts.json): { "rubric_verdicts": { "<rubric_id>": { "pass": <...

work page
[78]

Read canonical_case_record.json, cm_reference.json, chart_slice.json, and task_instruction.md

work page
[79]

Read handbook files when a rubric cites handbook standards

work page
[80]

Send me information

Read the relevant agent_outputs/stage_*.json named by each rubric. ... [grading boundaries, hard-fail rules, verdict discipline, output schema; truncated for length, identical to template above] ... 60 - ANTI-TRIGGER HARD FAIL: when grading rb_outreach_*, cm.outreach.*, or cm_v4 cm.outreach.quality, read per-task persona consent_anti_triggers and cm_refer...

work page

Showing first 80 references.

[1] [1]

2024 AMA prior authorization physician survey

American Medical Association. 2024 AMA prior authorization physician survey. Presented at the Annual Meeting of the American Medical Association, Chicago, IL, 2024. URL https: //www.ama-assn.org/system/files/prior-authorization-survey.pdf

work page 2024

[2] [2]

Introducing the Model Context Protocol

Anthropic. Introducing the Model Context Protocol. https://www.anthropic.com/news/ model-context-protocol, 2024. Accessed: 2026-04-30. 11

work page 2024

[3] [3]

Claude Code

Anthropic. Claude Code. https://github.com/anthropics/claude-code, 2025. Ac- cessed: 2026-04-30

work page 2025

[4] [4]

Claude Opus 4.7 system card

Anthropic. Claude Opus 4.7 system card. https://www.anthropic.com/system-cards,

work page

[5] [5]

Covers Claude Opus 4.7, Sonnet 4.6, and Haiku 4.5

Accessed: 2026-04-30. Covers Claude Opus 4.7, Sonnet 4.6, and Haiku 4.5

work page 2026

[6] [6]

R. K. Arora, J. Wei, R. S. Hicks, P. Bowman, J. Quiñonero-Candela, F. Tsimpourlas, M. Sharman, M. Shah, A. Vallone, A. Beutel, J. Heidecke, and K. Singhal. Healthbench: Evaluating large language models towards improved human health, 2025. URL https://arxiv.org/abs/ 2505.08775

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

$\tau^2$-Bench: Evaluating Conversational Agents in a Dual-Control Environment

V . Barres, H. Dong, S. Ray, X. Si, and K. Narasimhan.τ 2-bench: Evaluating conversational agents in a dual-control environment, 2025. URLhttps://arxiv.org/abs/2506.07982

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

S. Bedi, H. Cui, M. Fuentes, A. Unell, M. Wornow, J. M. Banda, N. Kotecha, T. Keyes, Y . Mai, M. Oez, et al. Medhelm: Holistic evaluation of large language models for medical tasks.arXiv preprint arXiv:2505.23802, 2025

work page arXiv 2025

[9] [9]

S. Bedi, R. Welch, E. Steinberg, M. Wornow, T. M. Kim, H. Ahmed, P. Sterling, B. Purohit, Q. Akram, A. Acosta, et al. Healthadminbench: Evaluating computer-use agents on healthcare administration tasks.arXiv preprint arXiv:2604.09937, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[11] [11]

Cuellar, A

A. Cuellar, A. H. Krist, L. M. Nichols, and A. J. Kuzel. Facilitators and barriers to care coordination in patient-centered medical homes (PCMHs) from coordinators’ perspectives. Journal of the American Board of Family Medicine, 31(1):90–101, 2018. doi: 10.3122/jabfm. 2018.01.170133. PMC4809054

work page doi:10.3122/jabfm 2018

[12] [12]

Cutler, E

D. Cutler, E. Wikler, and P. Basch. Reducing administrative costs and improving the health care system.New England Journal of Medicine, 367(20):1875–1878, 2012. doi: 10.1056/ NEJMp1209711

work page 2012

[13] [13]

DeepSeek-V4 Pro model card

DeepSeek-AI. DeepSeek-V4 Pro model card. https://huggingface.co/deepseek-ai/ DeepSeek-V4-Pro, 2026. Accessed: 2026-04-30

work page 2026

[14] [14]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

A. Drouin, M. Gasse, M. Caccia, I. H. Laradji, M. D. Verme, T. Marty, L. Boisvert, M. Thakkar, Q. Cappart, D. Vazquez, N. Chapados, and A. Lacoste. WorkArena: How capable are web agents at solving common knowledge work tasks? InProceedings of the 41st International Conference on Machine Learning (ICML), 2024. URLhttps://arxiv.org/abs/2403.07718

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

GLM-5: from Vibe Coding to Agentic Engineering

GLM-5 Team. GLM-5: From vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

Gemini CLI

Google. Gemini CLI. https://github.com/google-gemini/gemini-cli, 2025. Ac- cessed: 2026-04-30

work page 2025

[17] [17]

Gemini 3.1 Pro model card

Google DeepMind. Gemini 3.1 Pro model card. https://deepmind.google/models/ model-cards/gemini-3-1-pro/ , 2026. Accessed: 2026-04-30. Covers Gemini 3.1 Pro and Gemini 3 Flash

work page 2026

[18] [18]

Harbor: A framework for agent evaluations and RL environments

Harbor Framework. Harbor: A framework for agent evaluations and RL environments. https: //github.com/harbor-framework/harbor, 2026. Accessed: 2026-04-30

work page 2026

[19] [19]

Jiang, K

Y . Jiang, K. C. Black, G. Geng, D. Park, J. Zou, A. Y . Ng, and J. H. Chen. Medagentbench: a virtual ehr environment to benchmark medical llm agents.Nejm Ai, 2(9):AIdbp2500144, 2025

work page 2025

[20] [20]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. URL https://arxiv.org/ abs/2310.06770. 12

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, and P. Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021. URLhttps://arxiv.org/abs/2009.13081

work page arXiv 2021

[22] [22]

Q. Jin, B. Dhingra, Z. Liu, W. W. Cohen, and X. Lu. PubMedQA: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577. Association for Computational Linguistics,

work page 2019

[23] [23]

URLhttps://arxiv.org/abs/1909.06146

work page arXiv 1909

[24] [24]

Jones and C

A. Jones and C. Kelly. Code execution with mcp: Building more efficient agents, 2025

work page 2025

[25] [25]

H.-H. Ju. Improving care coordination of patients with chronic diseases.The Journal for Nurse Practitioners, 18(9):926–929, 2022. doi: 10.1016/j.nurpra.2022.07.005

work page doi:10.1016/j.nurpra.2022.07.005 2022

[26] [26]

L. P. Kaelbling, M. L. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic domains.Artificial intelligence, 101(1-2):99–134, 1998

work page 1998

[27] [27]

Karam, M.-C

M. Karam, M.-C. Chouinard, M. Kevork, R. Fleming, and A. Duhoux. Nurses’ and patients’ perspectives on care coordination across health care and social services sectors: A qualitative study.SAGE Open Nursing, 2026. doi: 10.1177/08445621251395347

work page doi:10.1177/08445621251395347 2026

[28] [28]

Khandekar, Q

N. Khandekar, Q. Jin, G. Xiong, S. Dunn, S. S. Applebaum, Z. Anwar, M. Sarfo-Gyamfi, C. W. Safranek, A. A. Anwar, A. Zhang, A. Gilson, M. B. Singer, A. Dave, A. Taylor, A. Zhang, Q. Chen, and Z. Lu. MedCalc-Bench: Evaluating large language models for medical calculations. InAdvances in Neural Information Processing Systems 37: Datasets and Benchmarks Track,

work page

[29] [29]

URLhttps://arxiv.org/abs/2406.12036

work page arXiv

[30] [30]

Kimi K2: Open Agentic Intelligence

Kimi Team. Kimi K2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

DeepAgents

LangChain. DeepAgents. https://github.com/langchain-ai/deepagents, 2025. Ac- cessed: 2026-04-30

work page 2025

[32] [32]

G. Lee, H. Hwang, S. Bae, Y . Kwon, W. Shin, S. Yang, M. Seo, J.-Y . Kim, and E. Choi. EHRSQL: A practical text-to-SQL benchmark for electronic health records. InAdvances in Neural Information Processing Systems 35: Datasets and Benchmarks Track, 2022. URL https://arxiv.org/abs/2301.07695

work page arXiv 2022

[33] [33]

J. Li, W. Zhao, J. Zhao, W. Zeng, H. Wu, X. Wang, R. Ge, Y . Cao, Y . Huang, W. Liu, et al. The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution.arXiv preprint arXiv:2510.25726, 2025

work page arXiv 2025

[34] [34]

X. Li, W. Chen, Y . Liu, S. Zheng, X. Chen, Y . He, Y . Li, B. You, H. Shen, J. Sun, et al. Skillsbench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

J. Liu, W. Wang, Z. Ma, G. Huang, Y . Su, K.-J. Chang, W. Chen, H. Li, L. Shen, and M. R. Lyu. MedChain: Bridging the gap between LLM agents and clinical practice through interactive sequential benchmarking. InAdvances in Neural Information Processing Systems 38: Datasets and Benchmarks Track, 2025. URLhttps://arxiv.org/abs/2412.01605

work page arXiv 2025

[36] [36]

M. A. Merrill, A. G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, J. Y . Shin, T. Walshe, E. K. Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[37] [37]

Modal: High-performance serverless infrastructure for AI and data

Modal Labs. Modal: High-performance serverless infrastructure for AI and data. https: //modal.com, 2025. Accessed: 2026-04-30

work page 2025

[38] [38]

Hermes Agent: The agent that grows with you

Nous Research. Hermes Agent: The agent that grows with you. https://github.com/ NousResearch/hermes-agent, 2026. Accessed: 2026-04-30

work page 2026

[39] [39]

OpenAI Agents SDK (python)

OpenAI. OpenAI Agents SDK (python). https://github.com/openai/ openai-agents-python, 2025. Accessed: 2026-04-30. 13

work page 2025

[40] [40]

OpenAI Codex CLI

OpenAI. OpenAI Codex CLI. https://github.com/openai/codex, 2025. Accessed: 2026-04-30

work page 2025

[41] [41]

GPT-5.5 system card

OpenAI. GPT-5.5 system card. https://openai.com/index/gpt-5-5-system-card/ ,

work page

[42] [42]

Covers the GPT-5.5, GPT-5.4, and GPT-5.4 Mini family

Accessed: 2026-04-30. Covers the GPT-5.5, GPT-5.4, and GPT-5.4 Mini family

work page 2026

[43] [43]

OpenClaw: Your own personal ai assistant

OpenClaw. OpenClaw: Your own personal ai assistant. https://github.com/openclaw/ openclaw, 2025. Accessed: 2026-04-30

work page 2025

[44] [44]

A. Pal, L. K. Umapathi, and M. Sankarasubbu. MedMCQA: A large-scale multi-subject multi- choice dataset for medical domain question answering. InProceedings of the Conference on Health, Inference, and Learning (CHIL), volume 174 ofProceedings of Machine Learning Research, pages 248–260. PMLR, 2022. URLhttps://arxiv.org/abs/2203.14371

work page arXiv 2022

[45] [45]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

N. R. Sahni, P. Gupta, M. Peterson, and D. M. Cutler. Active steps to reduce administrative spending associated with financial transactions in US health care.Health Affairs Scholar, 1(5): qxad053, 2023. doi: 10.1093/haschl/qxad053

work page doi:10.1093/haschl/qxad053 2023

[47] [47]

N. R. Sahni, B. Istvan, and D. M. Cutler. Perceptions of prior authorization burden and solutions. Health Affairs Scholar, 2(9):qxae096, 2024. doi: 10.1093/haschl/qxae096

work page doi:10.1093/haschl/qxae096 2024

[48] [48]

Schmidgall, R

S. Schmidgall, R. Ziaei, C. Harris, E. Reis, J. Jopling, and M. Moor. AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments, 2024. URL https: //arxiv.org/abs/2405.07960

work page arXiv 2024

[49] [49]

C. A. Sinsky, L. Colligan, L. Li, M. Prgomet, S. Reynolds, L. Goeders, J. Westbrook, M. Tutty, and G. Blike. Allocation of physician time in ambulatory practice: A time and motion study in 4 specialties.Annals of Internal Medicine, 165(11):753–760, 2016. doi: 10.7326/M16-0961

work page doi:10.7326/m16-0961 2016

[50] [50]

Steinberger

P. Steinberger. MCPorter: TypeScript runtime and CLI for connecting to MCP servers. https: //github.com/steipete/mcporter, 2025. npm packagemcporter; accessed 2026-05-03

work page 2025

[51] [51]

R. S. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2):181–211, 1999

work page 1999

[52] [52]

X. Tang, B. Qian, R. Gao, J. Chen, X. Chen, and M. Gerstein. BioCoder: a benchmark for bioinformatics code generation with large language models.Bioinformatics, 40(Supplement_1): i266–i276, 2024. doi: 10.1093/bioinformatics/btae230. URL https://arxiv.org/abs/ 2308.16458

work page doi:10.1093/bioinformatics/btae230 2024

[53] [53]

X. Tang, D. Shao, J. Sohn, J. Chen, J. Zhang, J. Xiang, F. Wu, Y . Zhao, C. Wu, W. Shi, A. Cohan, and M. Gerstein. MedAgentsBench: Benchmarking thinking models and agent frameworks for complex medical reasoning, 2025. URLhttps://arxiv.org/abs/2503.07459

work page arXiv 2025

[54] [54]

Trivedi, T

H. Trivedi, T. Khot, M. Hartmann, R. Manku, V . Dong, E. Li, S. Gupta, A. Sabharwal, and N. Balasubramanian. AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, 2024. URL http...

work page arXiv 2024

[55] [55]

Tsatsaronis, G

G. Tsatsaronis, G. Balikas, P. Malakasiotis, I. Partalas, M. Zschunke, M. R. Alvers, D. Weis- senborn, A. Krithara, S. Petridis, D. Polychronopoulos, Y . Almirantis, J. Pavlopoulos, N. Bask- iotis, P. Gallinari, T. Artières, A.-C. N. Ngomo, N. Heino, E. Gaussier, L. Barrio-Alvers, M. Schroeder, I. Androutsopoulos, and G. Paliouras. An overview of the BIOA...

work page doi:10.1186/s12859-015-0564-6 2015

[56] [56]

Z. Wang, B. Danek, Z. Yang, Z. Chen, and J. Sun. Can large language models replace data scientists in biomedical research?, 2024. URLhttps://arxiv.org/abs/2410.21591. 14

work page arXiv 2024

[57] [57]

Wornow, R

M. Wornow, R. Thapa, E. Steinberg, J. A. Fries, and N. H. Shah. EHRSHOT: An EHR benchmark for few-shot evaluation of foundation models. InAdvances in Neural Information Processing Systems 36: Datasets and Benchmarks Track, 2023. URLhttps://arxiv.org/ abs/2307.02028

work page arXiv 2023

[58] [58]

Grok 4 model card

xAI. Grok 4 model card. https://data.x.ai/2025-08-20-grok-4-model-card.pdf ,

work page 2025

[59] [59]

Accessed: 2026-04-30

work page 2026

[60] [60]

T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y . Liu, Y . Xu, S. Zhou, S. Savarese, C. Xiong, V . Zhong, and T. Yu. OSWorld: Benchmarking multimodal agents for open-ended tasks in real computer environments. InAdvances in Neural Information Processing Systems 37: Datasets and Benchmarks Track, 2024. URL https://...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[61] [61]

Xiong, Q

G. Xiong, Q. Jin, Z. Lu, and A. Zhang. Benchmarking retrieval-augmented generation for medicine. InFindings of the Association for Computational Linguistics: ACL 2024. Association for Computational Linguistics, 2024. URLhttps://arxiv.org/abs/2402.13178

work page arXiv 2024

[62] [62]

F. F. Xu, Y . Song, B. Li, Y . Tang, K. Jain, M. Bao, Z. Z. Wang, X. Zhou, Z. Guo, M. Cao, M. Yang, H. Y . Lu, A. Martin, Z. Su, L. Maben, R. Mehta, W. Chi, L. Jang, Y . Xie, S. Zhou, and G. Neubig. TheAgentCompany: Benchmarking LLM agents on consequential real world tasks, 2024. URLhttps://arxiv.org/abs/2412.14161

work page internal anchor Pith review Pith/arXiv arXiv 2024

[63] [63]

R. Xu, Y . Zhuang, Y . Zhong, Y . Yu, Z. Wang, X. Tang, H. Wu, M. D. Wang, J. C. Ho, Y . Xiao, W. Shi, and C. Yang. MedAgentGym: A scalable agentic training environment for code-centric reasoning in biomedical data science. InInternational Conference on Learning Representations (ICLR), 2026. URLhttps://arxiv.org/abs/2506.04405

work page arXiv 2026

[64] [64]

S. Yao, N. Shinn, P. Razavi, and K. Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[65] [65]

Zhang, K

B. Zhang, K. Lazuka, and M. Murag. Equipping agents for the real world with agent skills. Anthropic Engineering Blog, 2025

work page 2025

[66] [66]

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, U. Alon, and G. Neubig. WebArena: A realistic web environment for building autonomous agents. InInternational Conference on Learning Representations (ICLR), 2024. URL https: //arxiv.org/abs/2307.13854

work page internal anchor Pith review Pith/arXiv arXiv 2024

[67] [67]

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Y . Zuo, S. Qu, Y . Li, Z. Chen, X. Zhu, E. Hua, K. Zhang, N. Ding, and B. Zhou. Medx- pertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025. 15 χ-Bench Appendix A Ethical Statement 17 B Extended Related Work 17 B.1 Axis Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[68] [68]

has been480.00000000000006

This category captures only failures whose proximate cause is independent of the agent: WebSocket gateway abnormal closures (OpenClaw1006), MCP container setup errors (e.g., a missing shared world file), trial-runner exceptions on transport timeouts, and the rare zero-step exits where the agent emitted no actions before the runtime ended the trial. We del...

work page 2024

[69] [69]

Read rubrics.json first

work page

[70] [70]

Read canonical_case_record.json and source evidence files

work page

[71] [71]

For each rubric, read agent_outputs/stage_*.json named by the rubric

work page

[72] [72]

Cite the file path and line range in every verdict explanation

work page

[73] [73]

59 Grading boundaries: - Do not create new deterministic checks beyond what rubrics.json defines

Write verdicts.json incrementally so partial progress is persisted. 59 Grading boundaries: - Do not create new deterministic checks beyond what rubrics.json defines. - Semantic equivalence is acceptable on phrasing. - Fail items where the agent fabricated evidence, cited facts the source documents do not support, or missed a required element. {{role_speci...

work page

[74] [74]

Enumerate-in, enumerate-out: when CONTEXT names items, account for each one explicitly (satisfied / unsatisfied / absent)

work page

[75] [75]

Affirmative claims require primary-source citation: evidence_refs must point to the agent artifact, not the canonical record or rubric

work page

[76] [76]

Literal quotation, no coercion: the field name and value you quote must appear verbatim in the agent’s artifact

work page

[77] [77]

policies/06__bariatric-surgery.md 4.3

Structured evidence (optional): items_credited / items_missing / evidence_quotes alongside the prose explanation. Citation discipline: cite specific file path + line range (e.g., "policies/06__bariatric-surgery.md 4.3", "agent_outputs/stage_md_review.json: decision=approve"). Output schema (verdicts.json): { "rubric_verdicts": { "<rubric_id>": { "pass": <...

work page

[78] [78]

Read canonical_case_record.json, cm_reference.json, chart_slice.json, and task_instruction.md

work page

[79] [79]

Read handbook files when a rubric cites handbook standards

work page

[80] [80]

Send me information

Read the relevant agent_outputs/stage_*.json named by each rubric. ... [grading boundaries, hard-fail rules, verdict discipline, output schema; truncated for length, identical to template above] ... 60 - ANTI-TRIGGER HARD FAIL: when grading rb_outreach_*, cm.outreach.*, or cm_v4 cm.outreach.quality, read per-task persona consent_anti_triggers and cm_refer...

work page