pith. sign in

arxiv: 2604.09937 · v1 · submitted 2026-04-10 · 💻 cs.AI

HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks

Pith reviewed 2026-05-10 16:37 UTC · model grok-4.3

classification 💻 cs.AI
keywords HealthAdminBenchcomputer-use agentshealthcare administrationprior authorizationLLM evaluationGUI environmentsend-to-end workflowsbenchmark
0
0 comments X

The pith

Computer-use agents complete only 36 percent of full healthcare administrative workflows despite handling 83 percent of individual steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates HealthAdminBench to measure how well LLM-based computer-use agents perform on realistic end-to-end healthcare administration tasks such as prior authorizations, appeals, and equipment orders. It builds four simulated GUI environments and 135 expert tasks that break down into 1,698 verifiable subtasks. Testing multiple agents reveals strong performance on isolated steps but frequent failure to finish complete workflows. A reader would care because these administrative processes drive over a trillion dollars in annual spending and any reliable automation would directly affect costs and access.

Core claim

HealthAdminBench shows that the best agent reaches only 36.3 percent full-task success while the strongest subtask success rate is 82.8 percent, exposing a clear gap between current agent abilities and the requirements of real-world healthcare administrative workflows.

What carries the argument

HealthAdminBench benchmark consisting of four GUI environments (EHR system, two payer portals, fax system) and 135 expert-defined tasks across prior authorization, appeals and denials, and durable medical equipment processing, each decomposed into fine-grained verifiable subtasks.

If this is right

  • Progress on these workflows will require agents to maintain coherence across dozens of sequential GUI actions rather than succeeding on isolated steps.
  • The benchmark supplies a fixed, reproducible testbed that can track whether future agents close the observed reliability gap.
  • Differences in performance across task types and environments can guide targeted improvements in planning, error recovery, and verification.
  • Automation of these administrative flows could eventually reduce the trillion-dollar annual spending if end-to-end success rates rise substantially.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same evaluation approach could be applied to other regulated domains with complex multi-step GUI interactions, such as insurance claims or regulatory filings.
  • If the gap persists when agents are given richer state representations or longer context, it would point to deeper limitations in sequential decision-making rather than simple interface issues.
  • Real deployment would still need additional layers of human oversight and audit trails beyond what the benchmark measures.

Load-bearing premise

The 135 expert-defined tasks and their subtask decompositions accurately capture the complexity, variability, and verification needs of real-world healthcare administrative workflows.

What would settle it

A controlled evaluation in which any agent configuration achieves above 70 percent end-to-end task success across the full set of 135 tasks under the same prompting and observation conditions.

Figures

Figures reproduced from arXiv: 2604.09937 by Angelic Acosta, Bravim Purohit, Esther Nubla, Ethan Steinberg, Haroun Ahmed, Michael A. Pfeffer, Michael Wornow, Nigam H. Shah, Peter Sterling, Pritika Sharma, Qurat Akram, Ryan Welch, Sanmi Koyejo, Suhana Bedi, Taeil Matthew Kim.

Figure 1
Figure 1. Figure 1: HEALTHADMINBENCH evaluation loop. Each task is executed by an agent through itera￾tive observation, action selection, and interaction with simulated environments (EHR, payer portals, and fax). Task success is determined using a combination of deterministic checks and LLM-based judges. agents in realistic settings such as CRM and ticketing systems, while frameworks like BrowserGym unify these environments u… view at source ↗
Figure 2
Figure 2. Figure 2: HEALTHADMINBENCH contains four environments which mimic commonly utilized ap￾plications for administrative healthcare tasks – (a) an EHR inspired by Epic, (b) a payer portal inspired by Anthem, (c) a portal inspired by Availity, and an eFax inspired by RightFax. These environments are implemented as websites following the REAL framework Garg et al. (2025) patient communication, documentation, medical resea… view at source ↗
Figure 3
Figure 3. Figure 3: Performance of evaluated agents on HEALTHADMINBENCH, reported as (a) task success rate and (b) subtask success rate. 95% test-set bootstrap confidence intervals are noted with error bars. 4.1 PERFORMANCE BY TASK TYPE As described in Section 3.4, tasks are grouped into three administrative task types: Prior Autho￾rization, Appeals and Denials Management, and DME Order Processing [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 4
Figure 4. Figure 4: Performance of evaluated agents on HEALTHADMINBENCH across prompting and obser￾vation settings, reported as task success rate. 5 DISCUSSION HEALTHADMINBENCH addresses a key gap in the evaluation of healthcare AI agents by moving beyond static, text-only assessments to realistic administrative workflows that require long-horizon, cross-system interaction. Across all evaluated agents, end-to-end task success… view at source ↗
Figure 5
Figure 5. Figure 5: Violin plots indicating the distribution of steps taken by each agent for easy, medium, and [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Violin plots indicating the cost taken by each agent for easy, medium, and hard tasks on [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Subtask success rate of evaluated agents on H [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
read the original abstract

Healthcare administration accounts for over $1 trillion in annual spending, making it a promising target for LLM-based computer-use agents (CUAs). While clinical applications of LLMs have received significant attention, no benchmark exists for evaluating CUAs on end-to-end administrative workflows. To address this gap, we introduce HealthAdminBench, a benchmark comprising four realistic GUI environments: an EHR, two payer portals, and a fax system, and 135 expert-defined tasks spanning three administrative task types: Prior Authorization, Appeals and Denials Management, and Durable Medical Equipment (DME) Order Processing. Each task is decomposed into fine-grained, verifiable subtasks, yielding 1,698 evaluation points. We evaluate seven agent configurations under multiple prompting and observation settings and find that, despite strong subtask performance, end-to-end reliability remains low: the best-performing agent (Claude Opus 4.6 CUA) achieves only 36.3 percent task success, while GPT-5.4 CUA attains the highest subtask success rate (82.8 percent). These results reveal a substantial gap between current agent capabilities and the demands of real-world administrative workflows. HealthAdminBench provides a rigorous foundation for evaluating progress toward safe and reliable automation of healthcare administrative workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces HealthAdminBench, a benchmark for LLM-based computer-use agents (CUAs) on healthcare administration. It comprises four simulated GUI environments (EHR system, two payer portals, fax system) and 135 expert-defined tasks across prior authorization, appeals/denials management, and DME order processing. Tasks are decomposed into 1,698 fine-grained, verifiable subtasks. Evaluations of seven agent configurations under varied prompting and observation settings show strong subtask performance (up to 82.8% for GPT-5.4 CUA) but low end-to-end task success (36.3% best for Claude Opus 4.6 CUA), supporting the claim of a substantial gap between current agent capabilities and real-world administrative workflow demands.

Significance. If the benchmark tasks and environments are faithful proxies, this provides a valuable, fine-grained evaluation framework for a high-impact domain accounting for over $1T in annual spending. The expert-defined tasks, multi-environment setup, and subtask decomposition enable targeted diagnosis of agent failures and a reproducible testbed for tracking progress toward reliable automation. The direct empirical measurement of the subtask-to-end-to-end gap is a clear strength.

major comments (2)
  1. [Benchmark construction section] Benchmark construction section: The manuscript provides no validation evidence (e.g., coverage analysis against production logs, inter-rater reliability scores for the 135 task decompositions, or side-by-side comparison of simulated GUI behavior vs. live systems) that the tasks and environments capture real-world variability, policy nuances, or multi-system handoffs. This is load-bearing for the central claim that the 36.3% end-to-end success rate demonstrates a general capability gap rather than a benchmark-specific artifact.
  2. [Results section (agent evaluations)] Results section (agent evaluations): The reported subtask success rates (e.g., 82.8%) and end-to-end rates are given as point estimates without error bars, run-to-run variance, or sensitivity analysis to the specific subtask decompositions and verification criteria. This weakens confidence in the robustness of the observed gap.
minor comments (2)
  1. [Abstract and methods] Abstract and methods: The phrase 'multiple prompting and observation settings' is used without enumerating the exact variants tested; these should be listed explicitly with references to the corresponding result rows for reproducibility.
  2. [Table of results] Table of results: Agent names (e.g., 'Claude Opus 4.6 CUA') should be defined in a legend or footnote to avoid ambiguity across tables and figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point-by-point below, providing honest clarifications based on the work presented and indicating where revisions have been made to strengthen the paper.

read point-by-point responses
  1. Referee: [Benchmark construction section] Benchmark construction section: The manuscript provides no validation evidence (e.g., coverage analysis against production logs, inter-rater reliability scores for the 135 task decompositions, or side-by-side comparison of simulated GUI behavior vs. live systems) that the tasks and environments capture real-world variability, policy nuances, or multi-system handoffs. This is load-bearing for the central claim that the 36.3% end-to-end success rate demonstrates a general capability gap rather than a benchmark-specific artifact.

    Authors: We acknowledge that direct empirical validation against production systems would further strengthen the benchmark. Due to HIPAA regulations and the proprietary nature of live healthcare IT systems, access to production logs for coverage analysis or side-by-side live comparisons was not obtainable. The 135 tasks were developed iteratively by a team of domain experts with direct professional experience in healthcare administration, drawing from CMS guidelines, standard payer policies, and common workflow patterns. We have revised Section 3 to include an expanded 'Task and Environment Construction' subsection detailing the expert review process and how multi-system handoffs (e.g., EHR to payer portal to fax) are modeled. We have also added a dedicated paragraph in the Limitations section explicitly discussing the lack of inter-rater reliability metrics and quantitative fidelity comparisons. These changes provide greater transparency while preserving the claim that the benchmark reveals a meaningful capability gap, as the tasks target core, verifiable administrative processes. revision: partial

  2. Referee: [Results section (agent evaluations)] Results section (agent evaluations): The reported subtask success rates (e.g., 82.8%) and end-to-end rates are given as point estimates without error bars, run-to-run variance, or sensitivity analysis to the specific subtask decompositions and verification criteria. This weakens confidence in the robustness of the observed gap.

    Authors: We agree that measures of uncertainty and sensitivity would improve confidence in the results. The reported figures reflect single-run evaluations per configuration, driven by the high cost and time of executing long-horizon agent trajectories. In the revised manuscript, we have added bootstrap-derived 95% confidence intervals and standard errors for the primary metrics in the Results section (updated Table 2 and Figure 3). We have also included a new sensitivity analysis in Appendix C that varies subtask verification thresholds and decomposition granularity, confirming that the subtask-to-end-to-end gap remains large and consistent. These additions directly address the concern and demonstrate the robustness of the core finding. revision: yes

Circularity Check

0 steps flagged

No circularity in benchmark evaluation or claims

full rationale

The paper introduces HealthAdminBench with 135 expert-defined tasks across four GUI environments and reports direct empirical measurements of agent success rates on those tasks and subtasks. There are no equations, derivations, fitted parameters, predictions, or self-referential quantities in the claimed results. The central finding of low end-to-end reliability (e.g., 36.3% task success) is a straightforward observation on the defined benchmark rather than a reduction to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the provided text, and the evaluation chain is self-contained as direct measurement.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions that expert-defined tasks and GUI environments faithfully represent real healthcare admin work; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Expert-defined tasks and subtasks accurately represent real-world healthcare administrative workflows
    135 tasks spanning Prior Authorization, Appeals and Denials Management, and DME Order Processing are presented as realistic.
  • domain assumption The four GUI environments are realistic proxies for actual production systems
    EHR, two payer portals, and fax system are described as realistic.

pith-pipeline@v0.9.0 · 5577 in / 1518 out tokens · 58714 ms · 2026-05-10T16:37:25.022900+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?

    cs.CL 2026-05 unverdicted novelty 6.0

    CHI-Bench shows current AI agents achieve at most 28% success on long-horizon healthcare workflows that require dense policy adherence, multi-role handoffs, and multi-turn interactions.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 1 Pith paper

  1. [1]

    REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites.arXiv preprint arXiv:2504.11543, April 2025

    URLhttps://arxiv.org/abs/2504.11543. Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models,

  2. [2]

    text") - Type text at the current cursor focus - type text coord(

    URLhttps://arxiv.org/abs/2401.13919. 11 Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y . Ng, and Jonathan H. Chen. Medagentbench: A realistic virtual ehr environment to benchmark medi- cal llm agents.arXiv preprint arXiv:2501.14654, 2025. URLhttps://arxiv.org/abs/ 2501.14654. Gyubok Lee, Elea Bach, Eric Yang, Tom Pollard, Ali...

  3. [3]

    Coordinates are in pixels relative to the screenshot

  4. [4]

    Use the screenshot to locate UI elements visually

  5. [5]

    Prefer clicking UI elements instead of typing URLs

  6. [6]

    text") - Type text into an input field - select([id],

    Complete the objective step by step Base System Prompt (Accessibility Tree Setting) You are an autonomous web agent that can interact with websites by performing actions. 16 Your task is to complete the given objective by analyzing the current page and selecting the appropriate action. AVAILABLE ACTIONS: - click([id]) - Click an element with the specified...

  7. [7]

    Always extract element identifiers from the PAGE ELEMENTS section

  8. [8]

    Only use identifiers that are explicitly shown in PAGE ELEMENTS (e.g., [id])

  9. [9]

    Do not invent or guess identifiers

  10. [10]

    In axtree only mode, PAGE ELEMENTS already includes the full page; scrolling rarely reveals new elements

  11. [11]

    If an element is not in PAGE ELEMENTS, try checking other tabs or sections

  12. [12]

    Complete the objective step by step

  13. [13]

    This setting is designed to measure the extent to which agents can infer appropriate strategies for administrative workflows from the environment alone

    Call done() only when the entire objective is accomplished 17 A.2 TASKDESCRIPTIONPROMPTAMENDMENTS To evaluate agent performance under minimal guidance, we adopt a prompting setting in which the agent receives only the base system prompt and the task goal. This setting is designed to measure the extent to which agents can infer appropriate strategies for a...

  14. [14]

    click([dropdown-testid]) to open the options list

  15. [15]

    Do NOT use select() --- it will not work

    click([dropdown-testid-option-{value}]) to select the desired option. Do NOT use select() --- it will not work. - Dates: fill the text field with MM/DD/YYYY. DOCUMENT TRANSFER: - Download all required documents in EMR BEFORE navigating to any payer or fax portal. - To open a document: click the ‘‘View→’’ button on the RIGHT side of the document row. Do NO...

  16. [16]

    [EVALUATED] Diagnoses tab→record all ICD-10 codes

  17. [17]

    [EVALUATED] Services tab→record all CPT/HCPCS codes

  18. [18]

    Referral tab→capture referral details

  19. [19]

    [EVALUATED] General tab→scroll to Documents section→for each required doc, click ‘‘View→’’ then Download

  20. [20]

    SCROLL DOWN to find the ‘‘Open Portal’’ button

    [EVALUATED] Coverages tab→capture payer credentials and portal link. SCROLL DOWN to find the ‘‘Open Portal’’ button. - After returning from a payer portal: scroll to Communications→Add Note→fill subject and content (include the confirmation number)→ Save→Clear from Worklist. PAYER A: - Eligibility check: Click ‘‘Member Eligibility’’ tab→fill Member ID, Fi...

  21. [21]

    On the DME Orders page (/emr/dme), click on patient Garcia, Sofia

  22. [22]

    Review the order details and note the DME supplier (Option Care Health)

    The referral opens with the Orders tab (Active sub-tab by default). Review the order details and note the DME supplier (Option Care Health)

  23. [23]

    Click the Chart Review tab in the top navigation bar

  24. [24]

    In Chart Review, open the Face-to-Face Evaluation document --- observe that it is dated April 2025, more than 6 months before the order date

  25. [25]

    This indicates the F2F evaluation is expired per Medi-Cal policy (must be within 6 months) 19

  26. [26]

    Do NOT open the fax portal or send any fax

  27. [27]

    Do NOT clear the referral from the worklist

  28. [28]

    Click the Notes tab in the top navigation bar

  29. [29]

    Emily Foster; order on hold; no fax sent to Option Care Health

    In the Notes tab (right panel, Edit Note), enter a subject and a detailed progress note documenting: patient Sofia Garcia; enteral feeding pump order; expired F2F evaluation (April 2025); Medi-Cal requirement that F2F be within 6 months; need for updated F2F from Dr. Emily Foster; order on hold; no fax sent to Option Care Health

  30. [30]

    To ensure reproducibility regardless of when the benchmark is run, all tasks use a fixed benchmark date (February 25, 2026)

    Click Sign to save the note A.5 USERPROMPTS The user prompt presents the agent with the current environment state, including the task objective, interaction metadata, and the most recent observation. To ensure reproducibility regardless of when the benchmark is run, all tasks use a fixed benchmark date (February 25, 2026). This date is injected into every...

  31. [31]

    We find that increase in subtask performance of Qwen-3.5-Kinetic-SFT compared to Claude Opus 4.6 is statistically significant, but that the task performance is not. Baseline Compare Qwen -3.5-Kinetic-SFT Claude Opus 4.6 Qwen-3.5 Claude Opus 4.6 14.3% (-2.9% - 31.4%) Qwen-3.5 22.9% (8.6% - 37.1%) 8.6% (-2.9% - 20.0%) Table 9: Head-to-head differences in ta...