HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks
Pith reviewed 2026-05-10 16:37 UTC · model grok-4.3
The pith
Computer-use agents complete only 36 percent of full healthcare administrative workflows despite handling 83 percent of individual steps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HealthAdminBench shows that the best agent reaches only 36.3 percent full-task success while the strongest subtask success rate is 82.8 percent, exposing a clear gap between current agent abilities and the requirements of real-world healthcare administrative workflows.
What carries the argument
HealthAdminBench benchmark consisting of four GUI environments (EHR system, two payer portals, fax system) and 135 expert-defined tasks across prior authorization, appeals and denials, and durable medical equipment processing, each decomposed into fine-grained verifiable subtasks.
If this is right
- Progress on these workflows will require agents to maintain coherence across dozens of sequential GUI actions rather than succeeding on isolated steps.
- The benchmark supplies a fixed, reproducible testbed that can track whether future agents close the observed reliability gap.
- Differences in performance across task types and environments can guide targeted improvements in planning, error recovery, and verification.
- Automation of these administrative flows could eventually reduce the trillion-dollar annual spending if end-to-end success rates rise substantially.
Where Pith is reading between the lines
- The same evaluation approach could be applied to other regulated domains with complex multi-step GUI interactions, such as insurance claims or regulatory filings.
- If the gap persists when agents are given richer state representations or longer context, it would point to deeper limitations in sequential decision-making rather than simple interface issues.
- Real deployment would still need additional layers of human oversight and audit trails beyond what the benchmark measures.
Load-bearing premise
The 135 expert-defined tasks and their subtask decompositions accurately capture the complexity, variability, and verification needs of real-world healthcare administrative workflows.
What would settle it
A controlled evaluation in which any agent configuration achieves above 70 percent end-to-end task success across the full set of 135 tasks under the same prompting and observation conditions.
Figures
read the original abstract
Healthcare administration accounts for over $1 trillion in annual spending, making it a promising target for LLM-based computer-use agents (CUAs). While clinical applications of LLMs have received significant attention, no benchmark exists for evaluating CUAs on end-to-end administrative workflows. To address this gap, we introduce HealthAdminBench, a benchmark comprising four realistic GUI environments: an EHR, two payer portals, and a fax system, and 135 expert-defined tasks spanning three administrative task types: Prior Authorization, Appeals and Denials Management, and Durable Medical Equipment (DME) Order Processing. Each task is decomposed into fine-grained, verifiable subtasks, yielding 1,698 evaluation points. We evaluate seven agent configurations under multiple prompting and observation settings and find that, despite strong subtask performance, end-to-end reliability remains low: the best-performing agent (Claude Opus 4.6 CUA) achieves only 36.3 percent task success, while GPT-5.4 CUA attains the highest subtask success rate (82.8 percent). These results reveal a substantial gap between current agent capabilities and the demands of real-world administrative workflows. HealthAdminBench provides a rigorous foundation for evaluating progress toward safe and reliable automation of healthcare administrative workflows.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HealthAdminBench, a benchmark for LLM-based computer-use agents (CUAs) on healthcare administration. It comprises four simulated GUI environments (EHR system, two payer portals, fax system) and 135 expert-defined tasks across prior authorization, appeals/denials management, and DME order processing. Tasks are decomposed into 1,698 fine-grained, verifiable subtasks. Evaluations of seven agent configurations under varied prompting and observation settings show strong subtask performance (up to 82.8% for GPT-5.4 CUA) but low end-to-end task success (36.3% best for Claude Opus 4.6 CUA), supporting the claim of a substantial gap between current agent capabilities and real-world administrative workflow demands.
Significance. If the benchmark tasks and environments are faithful proxies, this provides a valuable, fine-grained evaluation framework for a high-impact domain accounting for over $1T in annual spending. The expert-defined tasks, multi-environment setup, and subtask decomposition enable targeted diagnosis of agent failures and a reproducible testbed for tracking progress toward reliable automation. The direct empirical measurement of the subtask-to-end-to-end gap is a clear strength.
major comments (2)
- [Benchmark construction section] Benchmark construction section: The manuscript provides no validation evidence (e.g., coverage analysis against production logs, inter-rater reliability scores for the 135 task decompositions, or side-by-side comparison of simulated GUI behavior vs. live systems) that the tasks and environments capture real-world variability, policy nuances, or multi-system handoffs. This is load-bearing for the central claim that the 36.3% end-to-end success rate demonstrates a general capability gap rather than a benchmark-specific artifact.
- [Results section (agent evaluations)] Results section (agent evaluations): The reported subtask success rates (e.g., 82.8%) and end-to-end rates are given as point estimates without error bars, run-to-run variance, or sensitivity analysis to the specific subtask decompositions and verification criteria. This weakens confidence in the robustness of the observed gap.
minor comments (2)
- [Abstract and methods] Abstract and methods: The phrase 'multiple prompting and observation settings' is used without enumerating the exact variants tested; these should be listed explicitly with references to the corresponding result rows for reproducibility.
- [Table of results] Table of results: Agent names (e.g., 'Claude Opus 4.6 CUA') should be defined in a legend or footnote to avoid ambiguity across tables and figures.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point-by-point below, providing honest clarifications based on the work presented and indicating where revisions have been made to strengthen the paper.
read point-by-point responses
-
Referee: [Benchmark construction section] Benchmark construction section: The manuscript provides no validation evidence (e.g., coverage analysis against production logs, inter-rater reliability scores for the 135 task decompositions, or side-by-side comparison of simulated GUI behavior vs. live systems) that the tasks and environments capture real-world variability, policy nuances, or multi-system handoffs. This is load-bearing for the central claim that the 36.3% end-to-end success rate demonstrates a general capability gap rather than a benchmark-specific artifact.
Authors: We acknowledge that direct empirical validation against production systems would further strengthen the benchmark. Due to HIPAA regulations and the proprietary nature of live healthcare IT systems, access to production logs for coverage analysis or side-by-side live comparisons was not obtainable. The 135 tasks were developed iteratively by a team of domain experts with direct professional experience in healthcare administration, drawing from CMS guidelines, standard payer policies, and common workflow patterns. We have revised Section 3 to include an expanded 'Task and Environment Construction' subsection detailing the expert review process and how multi-system handoffs (e.g., EHR to payer portal to fax) are modeled. We have also added a dedicated paragraph in the Limitations section explicitly discussing the lack of inter-rater reliability metrics and quantitative fidelity comparisons. These changes provide greater transparency while preserving the claim that the benchmark reveals a meaningful capability gap, as the tasks target core, verifiable administrative processes. revision: partial
-
Referee: [Results section (agent evaluations)] Results section (agent evaluations): The reported subtask success rates (e.g., 82.8%) and end-to-end rates are given as point estimates without error bars, run-to-run variance, or sensitivity analysis to the specific subtask decompositions and verification criteria. This weakens confidence in the robustness of the observed gap.
Authors: We agree that measures of uncertainty and sensitivity would improve confidence in the results. The reported figures reflect single-run evaluations per configuration, driven by the high cost and time of executing long-horizon agent trajectories. In the revised manuscript, we have added bootstrap-derived 95% confidence intervals and standard errors for the primary metrics in the Results section (updated Table 2 and Figure 3). We have also included a new sensitivity analysis in Appendix C that varies subtask verification thresholds and decomposition granularity, confirming that the subtask-to-end-to-end gap remains large and consistent. These additions directly address the concern and demonstrate the robustness of the core finding. revision: yes
Circularity Check
No circularity in benchmark evaluation or claims
full rationale
The paper introduces HealthAdminBench with 135 expert-defined tasks across four GUI environments and reports direct empirical measurements of agent success rates on those tasks and subtasks. There are no equations, derivations, fitted parameters, predictions, or self-referential quantities in the claimed results. The central finding of low end-to-end reliability (e.g., 36.3% task success) is a straightforward observation on the defined benchmark rather than a reduction to inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the provided text, and the evaluation chain is self-contained as direct measurement.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Expert-defined tasks and subtasks accurately represent real-world healthcare administrative workflows
- domain assumption The four GUI environments are realistic proxies for actual production systems
Forward citations
Cited by 1 Pith paper
-
CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?
CHI-Bench shows current AI agents achieve at most 28% success on long-horizon healthcare workflows that require dense policy adherence, multi-role handoffs, and multi-turn interactions.
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2504.11543. Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models,
-
[2]
text") - Type text at the current cursor focus - type text coord(
URLhttps://arxiv.org/abs/2401.13919. 11 Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y . Ng, and Jonathan H. Chen. Medagentbench: A realistic virtual ehr environment to benchmark medi- cal llm agents.arXiv preprint arXiv:2501.14654, 2025. URLhttps://arxiv.org/abs/ 2501.14654. Gyubok Lee, Elea Bach, Eric Yang, Tom Pollard, Ali...
-
[3]
Coordinates are in pixels relative to the screenshot
-
[4]
Use the screenshot to locate UI elements visually
-
[5]
Prefer clicking UI elements instead of typing URLs
-
[6]
text") - Type text into an input field - select([id],
Complete the objective step by step Base System Prompt (Accessibility Tree Setting) You are an autonomous web agent that can interact with websites by performing actions. 16 Your task is to complete the given objective by analyzing the current page and selecting the appropriate action. AVAILABLE ACTIONS: - click([id]) - Click an element with the specified...
-
[7]
Always extract element identifiers from the PAGE ELEMENTS section
-
[8]
Only use identifiers that are explicitly shown in PAGE ELEMENTS (e.g., [id])
-
[9]
Do not invent or guess identifiers
-
[10]
In axtree only mode, PAGE ELEMENTS already includes the full page; scrolling rarely reveals new elements
-
[11]
If an element is not in PAGE ELEMENTS, try checking other tabs or sections
-
[12]
Complete the objective step by step
-
[13]
Call done() only when the entire objective is accomplished 17 A.2 TASKDESCRIPTIONPROMPTAMENDMENTS To evaluate agent performance under minimal guidance, we adopt a prompting setting in which the agent receives only the base system prompt and the task goal. This setting is designed to measure the extent to which agents can infer appropriate strategies for a...
-
[14]
click([dropdown-testid]) to open the options list
-
[15]
Do NOT use select() --- it will not work
click([dropdown-testid-option-{value}]) to select the desired option. Do NOT use select() --- it will not work. - Dates: fill the text field with MM/DD/YYYY. DOCUMENT TRANSFER: - Download all required documents in EMR BEFORE navigating to any payer or fax portal. - To open a document: click the ‘‘View→’’ button on the RIGHT side of the document row. Do NO...
-
[16]
[EVALUATED] Diagnoses tab→record all ICD-10 codes
-
[17]
[EVALUATED] Services tab→record all CPT/HCPCS codes
-
[18]
Referral tab→capture referral details
-
[19]
[EVALUATED] General tab→scroll to Documents section→for each required doc, click ‘‘View→’’ then Download
-
[20]
SCROLL DOWN to find the ‘‘Open Portal’’ button
[EVALUATED] Coverages tab→capture payer credentials and portal link. SCROLL DOWN to find the ‘‘Open Portal’’ button. - After returning from a payer portal: scroll to Communications→Add Note→fill subject and content (include the confirmation number)→ Save→Clear from Worklist. PAYER A: - Eligibility check: Click ‘‘Member Eligibility’’ tab→fill Member ID, Fi...
-
[21]
On the DME Orders page (/emr/dme), click on patient Garcia, Sofia
-
[22]
Review the order details and note the DME supplier (Option Care Health)
The referral opens with the Orders tab (Active sub-tab by default). Review the order details and note the DME supplier (Option Care Health)
-
[23]
Click the Chart Review tab in the top navigation bar
-
[24]
In Chart Review, open the Face-to-Face Evaluation document --- observe that it is dated April 2025, more than 6 months before the order date
work page 2025
-
[25]
This indicates the F2F evaluation is expired per Medi-Cal policy (must be within 6 months) 19
-
[26]
Do NOT open the fax portal or send any fax
-
[27]
Do NOT clear the referral from the worklist
-
[28]
Click the Notes tab in the top navigation bar
-
[29]
Emily Foster; order on hold; no fax sent to Option Care Health
In the Notes tab (right panel, Edit Note), enter a subject and a detailed progress note documenting: patient Sofia Garcia; enteral feeding pump order; expired F2F evaluation (April 2025); Medi-Cal requirement that F2F be within 6 months; need for updated F2F from Dr. Emily Foster; order on hold; no fax sent to Option Care Health
work page 2025
-
[30]
Click Sign to save the note A.5 USERPROMPTS The user prompt presents the agent with the current environment state, including the task objective, interaction metadata, and the most recent observation. To ensure reproducibility regardless of when the benchmark is run, all tasks use a fixed benchmark date (February 25, 2026). This date is injected into every...
work page 2026
-
[31]
We find that increase in subtask performance of Qwen-3.5-Kinetic-SFT compared to Claude Opus 4.6 is statistically significant, but that the task performance is not. Baseline Compare Qwen -3.5-Kinetic-SFT Claude Opus 4.6 Qwen-3.5 Claude Opus 4.6 14.3% (-2.9% - 31.4%) Qwen-3.5 22.9% (8.6% - 37.1%) 8.6% (-2.9% - 20.0%) Table 9: Head-to-head differences in ta...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.