Adaptive Simulation Experiment for LLM Policy Optimization

Enlu Zhou; Jian-qiang Hu; Mingjie Hu; Siyang Gao

arxiv: 2604.08779 · v1 · submitted 2026-04-09 · 💻 cs.LG

Adaptive Simulation Experiment for LLM Policy Optimization

Mingjie Hu , Siyang Gao , Jian-qiang Hu , Enlu Zhou This is my paper

Pith reviewed 2026-05-10 17:32 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM policy optimizationadaptive experimentspairwise comparisonsoptimal policy identificationstochastic simulatorspreference modelssimulation optimization

0 comments

The pith

An adaptive procedure identifies the optimal LLM policy using pairwise comparisons while matching the minimal data requirements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors treat large language models as stochastic simulators that generate variable responses to the same input. They propose an adaptive experiment called LLM-PO that uses pairwise comparisons between responses to determine which policy among a finite set produces the best outcomes. For policies without additional structure, they calculate exact optimal sampling rates for comparisons; for structured policies based on an underlying preference model, they solve a convex optimization problem to set sampling proportions. The adaptive method updates its sampling as data arrives, guarantees selection of the true best policy with high probability, and approaches the lowest possible number of comparisons needed. This approach matters for operations management because it allows efficient tuning of LLM behaviors without excessive testing.

Core claim

We develop an adaptive experimental procedure, termed LLM-PO, for both policy spaces, and prove that it identifies the optimal policy with the desired statistical guarantee while asymptotically attaining the fundamental data requirements.

What carries the argument

LLM-PO, the adaptive simulation experiment that dynamically allocates pairwise comparisons between candidate policies to identify the optimal one.

Load-bearing premise

LLM responses can be ranked reliably through pairwise comparisons and, in the structured case, follow an underlying preference model.

What would settle it

Running LLM-PO on a set of policies with a known best one and finding that it either selects a suboptimal policy with probability higher than allowed or uses significantly more comparisons than the theoretical lower bound.

read the original abstract

Large language models (LLMs) have significant potential to improve operational efficiency in operations management. Deploying these models requires specifying a policy that governs response quality, shapes user experience, and influences operational value. In this research, we treat LLMs as stochastic simulators and propose a pairwise comparison-based adaptive simulation experiment framework for identifying the optimal policy from a finite set of candidates. We consider two policy spaces: an unstructured space with no parametric assumption, and a structured space in which the data are generated from a preference model. For both settings, we characterize the fundamental data requirements for identifying the optimal policy with high probability. In the unstructured case, we derive a closed-form expression for the optimal sampling proportions, together with a clear operational interpretation. In the structured case, we formulate a regularized convex program to compute the optimal proportions. We then develop an adaptive experimental procedure, termed LLM-PO, for both policy spaces, and prove that it identifies the optimal policy with the desired statistical guarantee while asymptotically attaining the fundamental data requirements. Numerical experiments demonstrate that LLM-PO consistently outperforms benchmark methods and improves LLM performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper applies standard adaptive best-arm identification to LLM policy selection via pairwise comparisons, with closed-form allocations for the unstructured case and a convex program for the structured preference-model case.

read the letter

The main point is that the authors treat LLMs as stochastic simulators and build an adaptive procedure called LLM-PO to identify the best policy from a finite set using pairwise comparisons. They split the problem into an unstructured policy space with no extra assumptions and a structured space that assumes responses follow a preference model. In the unstructured case they give a closed-form expression for the optimal sampling proportions along with an operational reading. In the structured case they set up a regularized convex program to find the proportions. They then prove that the adaptive LLM-PO rule identifies the optimal policy with the target probability and asymptotically meets the information-theoretic lower bound on the number of comparisons needed.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces an adaptive simulation experiment framework, LLM-PO, for identifying the optimal policy among a finite set of candidates for large language models, which are treated as stochastic simulators. It handles both unstructured policy spaces, deriving closed-form optimal sampling proportions with operational interpretation, and structured spaces where data follow a preference model, using a regularized convex program for optimal proportions. The adaptive procedure is proven to identify the optimal policy with the desired statistical guarantee and to asymptotically attain the fundamental data requirements. Numerical experiments illustrate that LLM-PO outperforms benchmark methods.

Significance. If the results hold, the paper provides a valuable contribution by bridging optimal experimental design with LLM policy optimization in operations management. The closed-form solutions and convex formulations offer practical tools, while the proofs of statistical guarantees and asymptotic efficiency add rigor. This could lead to more efficient deployment of LLMs by minimizing the number of simulations needed for policy selection.

minor comments (2)

The abstract asserts closed-form expressions, convex programs, and proofs of statistical guarantees, but the full manuscript should explicitly restate the core assumptions on LLM response stochasticity and pairwise comparison reliability in the main text to support the central claims.
Numerical experiments are described as demonstrating outperformance, but additional details on the specific LLMs tested, the exact benchmark methods, and any sensitivity analysis to comparison noise would strengthen the supporting evidence.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and constructive review of our manuscript. We are pleased that the referee recognizes the value of LLM-PO in bridging optimal experimental design with LLM policy optimization, including the closed-form solutions, convex formulations, statistical guarantees, and asymptotic efficiency. The recommendation for minor revision is noted; we will incorporate any necessary polishing in the revised version.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core chain proceeds from explicit model assumptions (LLMs as stochastic simulators with reliable pairwise rankings) to characterization of information-theoretic data requirements, closed-form optimal proportions in the unstructured case, and a convex program in the structured preference-model case. The adaptive LLM-PO procedure is then shown to achieve finite-sample identification guarantees while asymptotically matching those requirements. No equation or claim reduces by construction to a fitted parameter renamed as a prediction, no self-citation supplies a load-bearing uniqueness result, and no ansatz is smuggled in; the derivation is self-contained against standard optimal experimental design principles with the assumptions explicitly scoped.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard modeling assumptions from stochastic simulation and preference learning rather than new postulates; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption LLMs can be treated as stochastic simulators
Central modeling choice that allows policy selection to be cast as simulation optimization.
domain assumption Pairwise comparisons provide sufficient information to identify the optimal policy
Basis for both the unstructured and structured identification results.

pith-pipeline@v0.9.0 · 5490 in / 1314 out tokens · 71540 ms · 2026-05-10T17:32:10.263723+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 … T*(μ) defined via sup_ω inf_λ ∑ ω_ij d(μ(i,j),λ(i,j)) … Corollary 1 … ω*_j̃(i),i = 1/d*_i / ∑ 1/d*_k
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2 … U*(μ) via Fisher matrix H(θ*,ω) … Bradley-Terry logistic model

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

Likewise, ifp > 1 2, then inf q< 1 2 d(p, q) =d p, 1 2 , and the infimum is approached byq↑ 1 2. Applying these facts coordinate-wise, an infimizing sequence overC i is obtained by setting λ(j, i)↑ 1 2 for allj∈ I 1(i), λ(i, h)↓ 1 2 for allh∈ I 2(i), while keeping all other pairs unchanged. Consequently, T ⋆(µ)−1 = max ω∈Ω min i̸=i⋆(µ) X j∈I1(i) ωjid µ(j,...

work page
[2]

Ifk < p, then applying (EC.15) to the pair (k, p) gives γ=ρ p d µ(k, p), 1 2 +ξ kp

In either case, the corresponding divergence term is strictly positive. Ifk < p, then applying (EC.15) to the pair (k, p) gives γ=ρ p d µ(k, p), 1 2 +ξ kp. Sinceρ p >0,d(µ(k, p), 1 2)>0, andξ kp ≥0, it follows thatγ >0. If insteadk > p, then applying (EC.15) to the pair (p, k) yields γ=ρ p d µ(p, k), 1 2 +ξ pk. Again, sinceρ p >0,d(µ(p, k), 1 2)>0, andξ p...

work page
[3]

ec6e-companion toHu, Gao, Hu, Zhou:Adaptive Simulation Experiment for LLM Policy Optimization Sinceρ i = 0 andγ >0, we obtainξ ih =γ >0

Applying (EC.15) to the pair (i, h) gives ρid µ(i, h), 1 2 −γ+ξ ih = 0. ec6e-companion toHu, Gao, Hu, Zhou:Adaptive Simulation Experiment for LLM Policy Optimization Sinceρ i = 0 andγ >0, we obtainξ ih =γ >0. By complementary slackness (EC.17), it follows that ωih = 0,∀h∈ I 2(i). Next, consider anyj∈ I 1(i). Thenj < iandµ(j, i)> 1

work page
[4]

I have a pig, two ducks, and a dog. How many animals do I have?

Applying (EC.15) to the pair (j, i) yields ρid µ(j, i), 1 2 −γ+ξ ji = 0. Again, sinceρ i = 0 andγ >0, we haveξ ji =γ >0, and thusω ji = 0 by (EC.17). Therefore, ωji = 0,∀j∈ I 1(i). Consequently, every term in thei-th constraint of (EC.12) vanishes, so X j∈I1(i) ωjid µ(j, i), 1 2 + X h∈I2(i) ωihd µ(i, h), 1 2 = 0. Since (ν, ω) is primal feasible and (ν, ω)...

work page 2021

[1] [1]

Likewise, ifp > 1 2, then inf q< 1 2 d(p, q) =d p, 1 2 , and the infimum is approached byq↑ 1 2. Applying these facts coordinate-wise, an infimizing sequence overC i is obtained by setting λ(j, i)↑ 1 2 for allj∈ I 1(i), λ(i, h)↓ 1 2 for allh∈ I 2(i), while keeping all other pairs unchanged. Consequently, T ⋆(µ)−1 = max ω∈Ω min i̸=i⋆(µ) X j∈I1(i) ωjid µ(j,...

work page

[2] [2]

Ifk < p, then applying (EC.15) to the pair (k, p) gives γ=ρ p d µ(k, p), 1 2 +ξ kp

In either case, the corresponding divergence term is strictly positive. Ifk < p, then applying (EC.15) to the pair (k, p) gives γ=ρ p d µ(k, p), 1 2 +ξ kp. Sinceρ p >0,d(µ(k, p), 1 2)>0, andξ kp ≥0, it follows thatγ >0. If insteadk > p, then applying (EC.15) to the pair (p, k) yields γ=ρ p d µ(p, k), 1 2 +ξ pk. Again, sinceρ p >0,d(µ(p, k), 1 2)>0, andξ p...

work page

[3] [3]

ec6e-companion toHu, Gao, Hu, Zhou:Adaptive Simulation Experiment for LLM Policy Optimization Sinceρ i = 0 andγ >0, we obtainξ ih =γ >0

Applying (EC.15) to the pair (i, h) gives ρid µ(i, h), 1 2 −γ+ξ ih = 0. ec6e-companion toHu, Gao, Hu, Zhou:Adaptive Simulation Experiment for LLM Policy Optimization Sinceρ i = 0 andγ >0, we obtainξ ih =γ >0. By complementary slackness (EC.17), it follows that ωih = 0,∀h∈ I 2(i). Next, consider anyj∈ I 1(i). Thenj < iandµ(j, i)> 1

work page

[4] [4]

I have a pig, two ducks, and a dog. How many animals do I have?

Applying (EC.15) to the pair (j, i) yields ρid µ(j, i), 1 2 −γ+ξ ji = 0. Again, sinceρ i = 0 andγ >0, we haveξ ji =γ >0, and thusω ji = 0 by (EC.17). Therefore, ωji = 0,∀j∈ I 1(i). Consequently, every term in thei-th constraint of (EC.12) vanishes, so X j∈I1(i) ωjid µ(j, i), 1 2 + X h∈I2(i) ωihd µ(i, h), 1 2 = 0. Since (ν, ω) is primal feasible and (ν, ω)...

work page 2021