Healthcare Mechanisms from Policy-as-Code Search under Strategic Provider Response

Hongyuan Zha; Wenhao Li; Xiang Xu; Zihan Wang

arxiv: 2605.30680 · v1 · pith:GZXJXR2Znew · submitted 2026-05-29 · 💻 cs.AI · cs.MA

Healthcare Mechanisms from Policy-as-Code Search under Strategic Provider Response

Zihan Wang , Xiang Xu , Hongyuan Zha , Wenhao Li This is my paper

Pith reviewed 2026-06-28 22:42 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords healthcare mechanism designpolicy as codeprogram synthesisstrategic responsemulti-agent simulationincentive designup-codingclaim rejection

0 comments

The pith

LLM-guided code search over rule programs yields a mixed-objective healthcare incentive that eliminates up-coding, halves rejections, and keeps most baseline funds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats healthcare payment rules as programs that must be evaluated under strategic responses from providers rather than fixed behavior. It uses a multi-agent simulator with channels for coding, selection, delay, effort, and triage to score candidate rule programs. An evolutionary search guided by language models then produces an inspectable program that counters profit-driven distortions. This recovers known patterns such as up-coding under profit pressure and pressure migration when one channel is closed. The resulting program improves on a profit baseline without large revenue loss.

Core claim

Framing mechanism design as synthesis of typed rule programs scored by the Medi-Sim simulator recovers classical incentive patterns and allows search to produce a mixed-objective program that removes up-coding, halves rejection rates, and retains most of the funds generated by a profit-oriented baseline.

What carries the argument

LLM-guided evolutionary code search over typed, inspectable rule programs executed inside the five-channel Medi-Sim multi-agent simulator

If this is right

An incentive sweep across rule programs reproduces up-coding and low-complexity patient selection under profit pressure.
Closing the coding audit channel more than doubles low-complexity patient selection.
Goodhart-style drift appears where measured performance becomes anti-correlated with true clinical outcomes.
The synthesized mixed-objective program eliminates up-coding while halving rejections relative to the profit baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same search approach could be used to design rules in other domains with strategic agents, such as tax compliance or environmental permitting.
Because the output programs are typed and inspectable, regulators can audit exactly which behaviors are rewarded or penalized.
Extending the simulator with additional channels or real claims data would test whether the synthesized program remains effective outside the original five channels.

Load-bearing premise

The five-channel Medi-Sim simulator produces provider equilibria that match real-world responses to the tested incentive rules.

What would settle it

Deploy the synthesized rule program in an actual hospital claims system and measure whether up-coding rates and rejection rates match the simulator outputs under the same payment schedule.

Figures

Figures reproduced from arXiv: 2605.30680 by Hongyuan Zha, Wenhao Li, Xiang Xu, Zihan Wang.

**Figure 1.** Figure 1: Medi-Sim IPS and policy-as-code overview. Top: the hospital administrator writes episode-level front-desk, ward, and billing/review rules; stars mark levers refined by AlphaEvolve. Middle: clinician programs respond within locked rules through the Identify–Produce–Settle loop. Bottom: the dashboard reports channel-level diagnostics that guide policy search. and waiting tolerance. The default simulator uses… view at source ↗

**Figure 2.** Figure 2: L1 incentive phase diagram. Each panel reports 30-seed means over the 11 [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: L2 one-at-a-time policy ablations. Curves report 30-seed means over horizon [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Cell-annotated L1 phase diagram. Each cell reports the 30-seed mean of the indicated outcome [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Access decomposition on the L1 grid. Panels (a) and (b) split rejection by true complexity [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Strategic-delay decomposition on the L1 grid. Panels (a) and (b) report deferral rates separately [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: Clinical and budgetary outcomes on the L1 grid. Top row: mean clinical effort, effort per unit [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Team-level and KPI-level diagnostics on the L1 grid. Panels (a) and (b) report Pearson [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 10.** Figure 10: L2 steering-off flexible-pool diagnostic. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 9.** Figure 9: L2 bonus-pool ablation. Points are 30-seed means over horizon [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 11.** Figure 11: L3 AlphaEvolve validation search curves. Each panel shows generation-best and running-best fitness under a different evaluator with K=200. These curves are diagnostic: the main-text L3 claim rests on held-out rollout profiles in [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: K = 200 mixed-objective four-method comparison. Bars are means over the five held-out test seeds; whiskers are 95% bootstrap intervals. The AlphaEvolve column is the central qualitative finding: it achieves profit-comparable discounted return while reducing mean bonus and bonus pressure by about 60% relative to Fixed, with doctor margin recovered to within 25% of the Profit baseline [PITH_FULL_IMAGE:figu… view at source ↗

**Figure 13.** Figure 13: K = 200 pure-profit five-method comparison. Bars are means over the five held-out test seeds. Adding the Neutral baseline reveals that Neutral achieves a lower but still nontrivial return despite substantially higher mean effort and bonus pressure; AlphaEvolve makes a small profit-family refinement visible in return, doctor utility, and doctor margin. simulated rollouts produce indistinguishable evaluator… view at source ↗

**Figure 14.** Figure 14: K = 200 pure-welfare five-method comparison. Bars are means over the five held-out test seeds. The AlphaEvolve refinement of the welfare family lifts discounted return above all four baselines while keeping mean effort below Quality and lifting doctor margin from near zero to ∼1.4, indicating that the search has trimmed the gold-plating slack identified in §E.4 [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: K = 300 mixed-objective four-method comparison, the longer-budget counterpart of [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: K = 300 pure-profit four-method comparison. Bars are means over the five held-out test seeds. The longer budget gives AlphaEvolve a small gain over Profit in discounted return, doctor utility, and doctor margin, while preserving the same profit-oriented risk profile [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

**Figure 17.** Figure 17: K = 300 pure-welfare four-method comparison. Bars are means over the five held-out test seeds. The welfare AlphaEvolve refinement extends the discounted-return lift seen at K = 200 and continues to suppress Quality-style gold-plating on the effort and bonus panels. delay gap falls from 0.328 to ∼ 0.24 as the rejection channel takes over the cost-shedding role the delay channel had carried, and mean wait… view at source ↗

**Figure 18.** Figure 18: K = 300 mixed-objective running-best trace. Each step is a search iteration; the first marker is the neutral seed and later markers indicate iterations at which AlphaEvolve discovered a policy that improved the incumbent search fitness. The trajectory is piecewise constant with three update events at iterations 198, 213, and 273; [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

read the original abstract

Healthcare mechanisms are inseparable from the strategic provider response they induce: existing healthcare AI benchmarks hold this response fixed and so cannot evaluate mechanisms by the equilibrium they produce. We recast hospital mechanism design as program synthesis for language models: typed, inspectable rule programs are executed and scored by Medi-Sim, a multi-agent simulator with five strategic provider channels (coding, selection, delay, effort, triage). An incentive sweep recovers classical health-economics findings as adjacent regimes -- up-coding and low-complexity-patient selection under profit pressure, and Goodhart-style drift where measured performance becomes anti-correlated with true outcomes -- and a single audit lever exposes pressure migration: closing the coding channel more than doubles low-complexity selection. LLM-guided evolutionary code search over the same rule-program space then synthesizes an inspectable mixed-objective program that eliminates up-coding, halves rejection, and retains most of the profit-oriented baseline's funds.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They built a simulator to search for inspectable payment rules that beat baselines inside the model, but the gains rest on untested assumptions about provider behavior.

read the letter

The main takeaway is that the authors recast mechanism design as synthesizing typed rule programs via LLM-guided search, then score them in a custom multi-agent simulator with five provider response channels. The incentive sweep recovers known patterns like up-coding under profit pressure and Goodhart drift, which at least shows the simulator behaves consistently with classical health-economics expectations.

What works is the concrete output: they produce an inspectable mixed-objective program that, inside the simulator, removes up-coding, cuts rejections in half, and keeps most of the baseline funds. Framing the problem as program synthesis rather than black-box optimization is a reasonable move for domains that need auditability.

The soft spot is the complete absence of external grounding. Nothing in the abstract or described results compares simulated coding rates, selection thresholds, or rejection elasticities to real claims data, audits, or empirical studies. Because the search happens inside the same simulator used for evaluation, any mismatch in how the five channels model actual strategic responses means the synthesized program optimizes an artifact. The circularity burden is real here.

This is for researchers working on AI-assisted policy or mechanism design who want to see program synthesis applied to strategic settings. A reader already thinking about simulator validation or sensitivity checks would find the approach worth examining.

Send it to peer review. The internal consistency check is useful, but referees will need to press on whether the simulator channels can be defended or calibrated against outside evidence.

Referee Report

2 major / 0 minor

Summary. The paper recasts healthcare mechanism design as LLM-guided evolutionary synthesis of typed, inspectable rule programs that are executed and scored inside Medi-Sim, a multi-agent simulator containing five strategic provider channels (coding, selection, delay, effort, triage). An incentive sweep over rule programs recovers classical health-economics patterns (up-coding and low-complexity selection under profit pressure, Goodhart drift, and migration to selection when the coding channel is closed). The search then produces a mixed-objective program claimed to eliminate up-coding, halve rejection rates, and retain most of the funds generated by a profit-oriented baseline.

Significance. If the simulator equilibria match real provider behavior, the method would supply a route to equilibrium-aware, human-inspectable mechanisms that improve on fixed-response benchmarks. The recovery of classical findings supplies internal consistency evidence, and the production of executable rule code is a concrete strength. The absence of any external anchor, however, confines the demonstrated gains to the simulator itself.

major comments (2)

[Abstract] Abstract: the central performance claims (eliminates up-coding, halves rejection, retains most funds) are obtained by evaluating the synthesized program inside the same Medi-Sim instance used to generate it; the manuscript supplies no comparison of simulated coding rates, selection thresholds, rejection elasticities, or equilibrium outcomes against empirical estimates from claims data, audits, or field studies.
[Abstract] Abstract: the five strategic channels are introduced without reported calibration to external data or sensitivity analysis showing that the synthesized program's improvements are robust to plausible changes in the functional forms of provider response.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claims (eliminates up-coding, halves rejection, retains most funds) are obtained by evaluating the synthesized program inside the same Medi-Sim instance used to generate it; the manuscript supplies no comparison of simulated coding rates, selection thresholds, rejection elasticities, or equilibrium outcomes against empirical estimates from claims data, audits, or field studies.

Authors: The evaluation is indeed performed entirely within Medi-Sim. This is by design, as the contribution lies in showing that evolutionary program synthesis can discover equilibrium-aware rules that improve on baselines while recovering known incentive problems as adjacent regimes. We have no empirical claims data for this study and therefore cannot supply the requested external comparisons. In revision we will temper the abstract language and insert a dedicated limitations section that states the simulator-internal scope explicitly. revision: yes
Referee: [Abstract] Abstract: the five strategic channels are introduced without reported calibration to external data or sensitivity analysis showing that the synthesized program's improvements are robust to plausible changes in the functional forms of provider response.

Authors: The channel specifications are stylized but are chosen to reproduce documented phenomena from the health-economics literature. The fact that the incentive sweep reproduces up-coding, selection, Goodhart drift, and channel migration supplies qualitative validation of the functional forms. We will add a sensitivity-analysis subsection that varies the main behavioral parameters and confirms that the synthesized rule retains its advantage over the profit baseline across the tested range. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit simulation-based search and evaluation

full rationale

The paper performs an incentive sweep inside Medi-Sim to recover classical health-economics patterns as an internal consistency check, then runs LLM-guided search over rule programs scored by the same simulator. This is standard simulation-based mechanism design rather than any derivation, prediction, or first-principles result that reduces to its inputs by construction. No self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations appear. The performance claims are simulator outputs, not externally validated, but that is a fidelity issue, not circularity per the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the unvalidated accuracy of the five-channel strategic-response model and on the assumption that evolutionary search over rule programs finds rules whose performance generalizes beyond the simulator.

axioms (1)

domain assumption Medi-Sim's five strategic provider channels capture the dominant real-world responses to payment incentives
The entire evaluation pipeline and the synthesized program rest on this modeling choice.

invented entities (1)

Medi-Sim multi-agent simulator no independent evidence
purpose: Execute typed rule programs and score them under strategic provider responses
New simulator introduced to replace fixed-response benchmarks

pith-pipeline@v0.9.1-grok · 5685 in / 1285 out tokens · 26989 ms · 2026-06-28T22:42:47.604486+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references

[1]

Health Care Management Science, 20(4):453–466

Flexible bed allocations for hospital wards. Health Care Management Science, 20(4):453–466. Gwyn Bevan and Christopher Hood. 2006. What’s measured is what matters: Targets and gaming in the english public health care system.Public Administration, 84(3):517–538. Sally C. Brailsford, Paul R. Harper, Bhakti Patel, and Martin Pitt. 2009. An analysis of the ac...

2006
[2]

Karen Eggleston

Optimal auctions through deep learning: Advances in differentiable economics.Journal of the ACM, 71(1). Karen Eggleston. 2005. Multitasking and mixed systems for provider payment.Journal of Health Economics, 24(1):211–223. Frank Eijkenaar, Martin Emmert, Manfred Schep- pach, and Oliver Sch¨ offski. 2013. Effects of pay for performance in health care: A sy...

arXiv 2005
[3]

DRG creep

Generative agent-based modeling with ac- tions grounded in physical, social, or digital space using Concordia.Preprint, arXiv:2312.03664. Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanx- iao Liu, Quoc V. Le, Denny Zhou, and Xinyun Chen. 2023. Large language models as optimizers. arXiv preprint arXiv:2309.03409. Chao Yu, Jiming Liu, Shamim Nemati, and Gu- oshe...

arXiv 2023
[4]

strategic delay

and to practical, end-to-end differen- tiable training pipelines (Levanon and Rosen- feld, 2021).Performative predictiongeneralizes this to settings in which the deployed model itself shifts the data distribution and charac- terizes performative equilibria (Perdomo et al., 2020). The reward-shaping side of this litera- ture gives formal notions ofreward h...

2021

[1] [1]

Health Care Management Science, 20(4):453–466

Flexible bed allocations for hospital wards. Health Care Management Science, 20(4):453–466. Gwyn Bevan and Christopher Hood. 2006. What’s measured is what matters: Targets and gaming in the english public health care system.Public Administration, 84(3):517–538. Sally C. Brailsford, Paul R. Harper, Bhakti Patel, and Martin Pitt. 2009. An analysis of the ac...

2006

[2] [2]

Karen Eggleston

Optimal auctions through deep learning: Advances in differentiable economics.Journal of the ACM, 71(1). Karen Eggleston. 2005. Multitasking and mixed systems for provider payment.Journal of Health Economics, 24(1):211–223. Frank Eijkenaar, Martin Emmert, Manfred Schep- pach, and Oliver Sch¨ offski. 2013. Effects of pay for performance in health care: A sy...

arXiv 2005

[3] [3]

DRG creep

Generative agent-based modeling with ac- tions grounded in physical, social, or digital space using Concordia.Preprint, arXiv:2312.03664. Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanx- iao Liu, Quoc V. Le, Denny Zhou, and Xinyun Chen. 2023. Large language models as optimizers. arXiv preprint arXiv:2309.03409. Chao Yu, Jiming Liu, Shamim Nemati, and Gu- oshe...

arXiv 2023

[4] [4]

strategic delay

and to practical, end-to-end differen- tiable training pipelines (Levanon and Rosen- feld, 2021).Performative predictiongeneralizes this to settings in which the deployed model itself shifts the data distribution and charac- terizes performative equilibria (Perdomo et al., 2020). The reward-shaping side of this litera- ture gives formal notions ofreward h...

2021