pith. sign in

arxiv: 2605.13909 · v1 · pith:DDI4OEIXnew · submitted 2026-05-13 · 💻 cs.GT · cs.AI

TERMS-Bench: Diagnosing LLM Negotiation Agents Beyond Deal Rate

Pith reviewed 2026-05-15 02:51 UTC · model grok-4.3

classification 💻 cs.GT cs.AI
keywords LLM negotiationBayesian gamesmulti-turn strategyeconomic reasoningagent evaluationsurplus extractionbelief calibrationbilateral bargaining
1
0 comments X

The pith

A Bayesian-game testbed diagnoses LLM agents in price negotiation by measuring surplus extraction, cue use, and belief calibration rather than deal rate alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Negotiation requires models to reason under hidden preferences and multi-turn strategic talk, yet existing evaluations only count how often deals form. Terms-Bench fixes this by making the counterpart a known Bayesian type whose policy and payoffs are visible to the evaluator but hidden from the tested agent. When thirteen frontier and open models negotiate bilateral prices, nearly all reach agreements at high rates, but they differ markedly in how much value they capture, how they interpret private signals, and how well they calibrate beliefs about the other side. The framework therefore converts aggregate success into agent-specific failure maps that point to concrete reasoning gaps.

Core claim

Terms-Bench instantiates a Bayesian-game verifier in bilateral price negotiation so that the counterpart's latent type, simulator policy, and payoff structure become observable diagnostics; this setup reveals that frontier LLMs saturate deal rate while diverging on surplus extraction, cue use, belief calibration, and compliance, exposing agent-specific bargaining bottlenecks that aggregate metrics conceal.

What carries the argument

The Bayesian-game framework in bilateral price negotiation, with the counterpart's private state and policy hidden from the agent but known to the evaluator, turning the opponent into an oracle reference for measuring optimality gaps.

If this is right

  • Models can be compared by their distance to the oracle-optimal surplus given the known policy, rather than by deal rate.
  • Failures in belief updating can be separated from failures in communication strategy or constraint compliance.
  • Training interventions can target the specific measured gaps, such as cue interpretation or surplus-maximizing offers.
  • The same verifier structure can be reused across different payoff matrices to test generalization of strategic reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the benchmark to multi-issue or multi-party settings would expose whether current models handle increased dimensionality in hidden information.
  • Agents that extract more surplus against fixed policies may perform better in real markets where counterpart types are drawn from similar distributions.
  • The diagnostic lens suggests that future model releases should report calibration error and cue sensitivity alongside task success rates.

Load-bearing premise

The chosen simulator policy and payoff structure for bilateral price negotiation accurately reflect the strategic and informational features that matter in real negotiations.

What would settle it

Running the same negotiation protocol with human participants and observing whether their patterns of surplus extraction, cue use, and belief updates match or diverge from the LLM distributions would test whether the benchmark's agent-attributable gaps are real or artifacts of the simulator.

Figures

Figures reproduced from arXiv: 2605.13909 by Aneesh Pappu, Batu El, Erica Zhang, Fangzhao Zhang, James Zou, Jiashuo Liu, Jose Blanchet, Susan Athey.

Figure 1
Figure 1. Figure 1: Teaser terminal performance on the synthetic bilateral negotiation benchmark. Surplus efficiency on feasible episodes (SE+ π , normalized by ZOPA width) for 13 LLM agents and three fixed-concession baselines (FC-1%,10%,30%), each evaluated on the same 1,800 seeded episodes against the TERMS-BENCH simulator. Bars are colored by tier. SE+ π is one of six diagnostic metrics spanning terminal value, agreement … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of TERMS-BENCH. Commercial negotiation settings motivate an environment-verifier evaluation pipeline: an LLM agent negotiates with a fixed simulator whose latent type, policy, and payoff structure are hidden from the agent but observable to the evaluator. Controlled regimes test feasible bargaining, urgency shifts, and no-deal cases, while simulated counterpart families vary whether behavior is dr… view at source ↗
Figure 3
Figure 3. Figure 3: Oracle gap decomposition. Rows are normalized to each family’s base-to-oracle utility gap, so the signed inference, uncertainty, and control components sum to 100%. Negative ∆inf: posterior injection hurt utility. Negative ∆ctrl: full-reveal LLM beat the discretized oracle (see §2.3 for details). 4.2 Synthetic Main Experiment [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-family surplus efficiency, cue and inference penalty. CANDID/EXPRESSIVE expose informative cues; paired TACITURN/STRATEGIC mute them. We define the cue penalty αcue = SE+ π (cue-revealing) − SE+ π (cue-muted). For every LLM, αcue < 0 ( [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Behavioral profiles across three axes. Strategic profile (left) shows mean offer price trajectories in the (seller￾opens, overlap) regime, where lines are clipped at mean closing round per-agent; diamonds mark mean closing price, and panel annotations report trajectory coefficient απ, closer rate ρπ, and conditional utility (cond.). Background tints define five bargaining typologies: anchor-and-hold, mid/b… view at source ↗
Figure 6
Figure 6. Figure 6: Data-grounded variant summary (eleven paired models). A. Per-model SE+ π slopegraph from the synthetic suite (left) to the data-grounded suite (right) with 95% bootstrap CIs. Lines are colored green where the data-grounded CI lies entirely above the synthetic CI (gains under data-grounded), pink where it lies entirely below (losses under data-grounded), and grey otherwise; rank order is largely preserved. … view at source ↗
Figure 7
Figure 7. Figure 7: Cash-balance trajectories under the bankroll chain. Per-period mean cash balance for seven LLM merchants across stateful sessions; shaded ribbons show ±1 SEM and the right-edge ladder ranks agents by terminal balance. The dashed line marks the starting bankroll and the dotted line marks the bankruptcy threshold. Five LLMs compound to $380–$443 with full survival; Grok 4.20 reaches $110 (75% survival) and G… view at source ↗
Figure 8
Figure 8. Figure 8: Surplus efficiency across structural environment-difficulty bins. Bins progress from easy to hard. Darker cells indicate higher SE+. The right-hand column reports the percentage drop from the easiest to the hardest bin. Empirical bin-by-bin performance [PITH_FULL_IMAGE:figures/full_fig_p049_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Rank stability across difficulty bins. Panel A quantifies the cliff between bins 0–3 and the hardest bin; Panel B shows where the rank churn actually occurs, with GLM 5.1 (−7) and Doubao 2.0 Pro (+5) the largest moves and Gemini 3.1 Pro (+4) and Gemma 4 31B (+2) taking the top two hardest-bin positions. H.1 Implementation Details We provide the implementation details needed to reproduce the main TERMS-BENC… view at source ↗
Figure 10
Figure 10. Figure 10: Per-episode price geometry, synthetic vs. product-grounded suite (Claude Opus 4.6, n=1800 vs. n=1643; scenarios are deterministic by seed and identical across models within a suite). A. Public price range. B. Absolute ZOPA width on feasible episodes. C. Relative ZOPA width. D. Infeasibility gap on no-deal episodes. The relative ZOPA collapses by about an order of magnitude under data grounding, so agents … view at source ↗
Figure 11
Figure 11. Figure 11: Per-model overall SE+ π in synthetic vs. product-grounded suites with 95% bootstrap CIs. Rank order is largely preserved (Spearman ρ = 0.90); the shift is structured: models in the upper half of the synthetic leaderboard tend to gain or hold (exception: GLM 5.1), while most lower-half models lose (exception: GPT-5.5). Model Synthetic SE+ π Product-grounded SE+ π ∆ (PG−Synth) Claude Opus 4.6 0.694 [0.681, … view at source ↗
Figure 12
Figure 12. Figure 12: Persistence of the two structural penalties under product grounding (95% bootstrap CIs, B = 2000). A. αcue remains negative for 8/11 models but attenuates: the salient product anchor partly insulates agents from verbal-cue over-reaction. B. αinf becomes negative for all 11 paired models, and amplifies for 10 of 11 (GPT-4o-mini is the exception), with three PG intervals excluding zero—wide, skewed action r… view at source ↗
Figure 13
Figure 13. Figure 13: Runtime and estimated inference cost across evaluated LLM agents on the bilateral price-negotiation instantiation of TERMS￾BENCH. Thus, the reverse-direction analysis preserves the broad ranking while showing that agent-side time pressure compresses the top-end surplus advantage. The effect is not primarily an agreement-rate effect: all three models retain nearly the same AGR+ π , while the change appears… view at source ↗
Figure 14
Figure 14. Figure 14: Effect of reversing the urgency-shift direction. The main condition makes the counterpart more urgent; the auxiliary condition makes the evaluated agent more urgent. Error bars show 95% percentile bootstrap confidence intervals over episodes (B = 2000). H.5.1 Experiment Results Opener-role decomposition. The strategic profile is most visible when the agent makes the first move (the agent-opens cells), bec… view at source ↗
Figure 15
Figure 15. Figure 15: Buyer/seller asymmetry on closing surplus. A. Per-model σπ split by agent role with 95% bootstrap CIs (B = 2000, n = 600 per cell). B. Per-model ∆σπ = σπ(seller) − σπ(buyer) with 95% two-sample bootstrap CIs, sorted by magnitude. Bars are red where the CI excludes zero, grey otherwise. 12 of 13 point estimates are positive (GPT-4o-mini is the lone negative exception); 10 of 13 individual CIs strictly excl… view at source ↗
Figure 16
Figure 16. Figure 16: Per-family SE+ π (overlap regime, 95% bootstrap CI) under voice-on (red) and voice-off (blue) for the three models with matched ablations. Voice-off matches or exceeds voice-on on every family/model cell except STRATEGIC, the family whose structured cues are collapsed by design and where voice is therefore the only informative signal carrie. I.1.1 Motivation The counterpart in TERMS-BENCH’s bilateral pric… view at source ↗
Figure 17
Figure 17. Figure 17: Overall overlap-regime SE+ π (with 95% bootstrap CIs) under each condition. Rank order is preserved and between-model gaps remain larger than within-model voice deltas. We focus on the three diagnostic axes most directly affected by the linguistic surface: feasible surplus efficiency SE+ π , feasible agreement rate AGR+ π , and agreement-conditional surplus CSE+ π (which separates pricing quality from dea… view at source ↗
Figure 18
Figure 18. Figure 18: Language-and-reasoning ablation across reveal levels (L0–L3) and voice settings. A. Overall SE+ π with 95% bootstrap CIs (B = 2000, n≈480 feasible episodes per cell). B. Belief error BEtype on the same cells. The vertical dotted line separates L3 (full type reveal, leakage upper bound) from the L0–L2 reveal grid. Two patterns are visible: across L0–L2, GPT-5 gains modestly with information (∼0.08 SE+ π , … view at source ↗
Figure 19
Figure 19. Figure 19: Per-metric val and test bars for seed, GEPA-optimised, and full-information oracle. Dashed reference line marks the oracle. Surplus capture only closes a fraction (25/38 %) of the seed-to-oracle gap [PITH_FULL_IMAGE:figures/full_fig_p074_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: System prompt used for the buyer agent. The seller agent system prompt is structurally identical, with seller-side utility u(p) = p − reservation_price, IR constraint counterpart_offer ≥ reservation_price, and monotonically non-increasing seller offers. 78 [PITH_FULL_IMAGE:figures/full_fig_p078_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Product-context block prepended to the standard buyer/seller system prompt in product-grounded runs. The base prompt ( [PITH_FULL_IMAGE:figures/full_fig_p079_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: System and user prompt for the counterpart voice layer. The voice LLM is a strictly cosmetic surface realisation: the simulator’s stochastic policy πB controls all economic outcomes (price, accept/reject, sentiment, stance), and the voice LLM only writes the message consistent with those pre-committed values. 80 [PITH_FULL_IMAGE:figures/full_fig_p080_22.png] view at source ↗
read the original abstract

Negotiation is a central mechanism of economic exchange, shaping markets, procurement, labor agreements, and resource allocation. It is also a canonical testbed for agentic language models, requiring multi-turn interaction under hidden preferences, strategic communication, and binding constraints. These properties make negotiation hard to evaluate: unlike math or code, it has no intrinsic verifier. Existing LLM negotiation evaluations rely on LLM-vs.-LLM interaction or aggregate outcomes such as deal rate, leaving failures opaque. We introduce Terms-Bench, short for Testbed for Economic Reasoning in Multi-turn Strategy, a Bayesian-game framework that makes the environment itself the verifier by specifying the counterpart's latent type, policy, and payoff structure. We instantiate it in bilateral price negotiation, where the counterpart's private state and simulator policy are hidden from the agent but observable to the evaluator. This turns the counterpart from a black-box opponent into a diagnostic instrument, enabling agent-attributable failure analysis and oracle-reference optimality gaps. Evaluating 13 LLM agents spanning frontier systems from major providers, Terms-Bench turns negotiation evaluation from aggregate ranking into actionable diagnosis: where agents fail, why they fail, and what to strengthen. Empirically, frontier models saturate deal rate yet diverge in surplus extraction, cue use, belief calibration, and compliance, revealing agent-specific bargaining bottlenecks masked by prior benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Terms-Bench, a Bayesian-game testbed for LLM negotiation agents instantiated in bilateral price negotiation. By making the counterpart's latent type, policy, and payoff structure observable to the evaluator while hidden from the agent, it converts aggregate deal-rate metrics into diagnostics of surplus extraction, cue use, belief calibration, and compliance. Evaluation of 13 frontier and open-source LLMs shows saturation on deal rate but substantial divergence on the four diagnostic dimensions, which the authors attribute to agent-specific bargaining limitations.

Significance. If the simulator policy accurately captures relevant strategic and informational features of real negotiations, the framework offers a verifiable, agent-attributable alternative to opaque LLM-vs-LLM benchmarks and could guide targeted improvements in multi-turn strategic reasoning. The explicit use of an oracle-reference optimality gap is a methodological strength.

major comments (2)
  1. [§3] §3 (Framework definition): The simulator policy and payoff structure are load-bearing for the central claim that divergences are agent-attributable rather than environment-driven, yet no derivation from equilibrium concepts, human data, or validation of cue-generation and belief-update rules is provided; without this grounding, measured gaps on surplus extraction and belief calibration cannot be confidently attributed to the LLMs.
  2. [§4.3] §4.3 (Empirical results): The reported divergences across the 13 agents on surplus, cue use, and compliance lack statistical significance tests, confidence intervals, or controls for simulator stochasticity, so it is unclear whether the observed agent-specific bottlenecks are robust or could arise from environment variance.
minor comments (2)
  1. [§3.2] Notation for the Bayesian type space and belief updates could be clarified with an explicit table of symbols to aid reproducibility.
  2. [Abstract] The abstract and introduction would benefit from a one-sentence statement of the precise payoff structure used in the bilateral negotiation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for major revision. We address each major comment below and will make the corresponding changes to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Framework definition): The simulator policy and payoff structure are load-bearing for the central claim that divergences are agent-attributable rather than environment-driven, yet no derivation from equilibrium concepts, human data, or validation of cue-generation and belief-update rules is provided; without this grounding, measured gaps on surplus extraction and belief calibration cannot be confidently attributed to the LLMs.

    Authors: We agree that the simulator policy is central to attributing divergences to the agents rather than the environment. The manuscript defines the policy via type-dependent reservation prices and myopic belief updates drawn from standard bilateral bargaining models, but does not include a formal derivation or validation. In revision we will add a dedicated subsection deriving the policy from Bayesian-game equilibrium concepts, include sensitivity checks across alternative cue-generation rules, and report results under perturbed simulator parameters to confirm attribution. revision: yes

  2. Referee: [§4.3] §4.3 (Empirical results): The reported divergences across the 13 agents on surplus, cue use, and compliance lack statistical significance tests, confidence intervals, or controls for simulator stochasticity, so it is unclear whether the observed agent-specific bottlenecks are robust or could arise from environment variance.

    Authors: We acknowledge the lack of statistical tests and controls for stochasticity in the reported results. Although metrics were averaged over repeated runs, no confidence intervals or significance tests were provided. In the revised version we will add bootstrap confidence intervals for all four diagnostic metrics, conduct paired statistical tests across agents while averaging over 200 simulator seeds, and include variance decomposition to isolate agent effects from environment noise. revision: yes

Circularity Check

0 steps flagged

No significant circularity; benchmark definitions and metrics are independent of fitted inputs or self-citation chains.

full rationale

The paper introduces Terms-Bench as a Bayesian-game framework for bilateral price negotiation, specifying counterpart latent type, policy, and payoff structure to enable diagnostic evaluation. No equations, fitted parameters, or self-citations are presented in the abstract or described derivation that reduce reported metrics (deal rate, surplus extraction, cue use, belief calibration, compliance) to quantities defined by the authors' own prior work. The central claim—that frontier models saturate deal rate but diverge on agent-specific diagnostics—rests on empirical evaluation within the explicitly constructed simulator, which is presented as a novel verifier rather than a tautological restatement of inputs. This satisfies the default expectation of a self-contained benchmark paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard Bayesian-game assumptions plus the new testbed itself; no free parameters are mentioned and no new physical or mathematical entities are postulated beyond the benchmark definition.

axioms (1)
  • domain assumption The negotiation environment can be faithfully modeled as a Bayesian game in which the evaluator observes the counterpart's latent type and policy while the agent does not.
    Invoked in the description of the bilateral price negotiation instantiation; this is the load-bearing premise that allows the environment to serve as verifier.
invented entities (1)
  • Terms-Bench testbed no independent evidence
    purpose: To convert the negotiation counterpart from black-box opponent into diagnostic instrument with observable ground truth.
    Newly introduced framework; independent evidence is the empirical evaluation of 13 agents, but the entity itself is defined by the paper.

pith-pipeline@v0.9.0 · 5555 in / 1373 out tokens · 40233 ms · 2026-05-15T02:51:40.460142+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages

  1. [1]

    The agent choosesAccept; the outcome is the counterpart’s last offered price

  2. [2]

    The agent choosesReject; the outcome is disagreement⊥

  3. [3]

    The counterpart accepts the agent’s offer; the outcome is the agent’s proposed price

  4. [4]

    The counterpart terminally rejects (walk-away); the outcome is disagreement⊥

  5. [5]

    The round limitKis reached without agreement; the outcome is disagreement⊥. Constraints.All Offer actions must satisfy: (i) price bounds pmin ≤p k ≤p max; (ii) monotonic concession: for buyer agents, pk ≥p k−1 where k, k−1 index the agent’s own offers; for seller agents,pk ≤p k−1; (iii) turn budget k≤K . Violations of (i) or of individual rationality (acc...

  6. [6]

    This family replaces the earlier Truthful family

    Candid counterparts[type-instrumental economics, accurate cues]. This family replaces the earlier Truthful family. The counterpart’s economic behavior is low-noise and strongly type-conditioned: reservation value sets the feasible boundary, urgency changes acceptance and concession timing, and stance changes the payoff consequences of rigidity and concess...

  7. [7]

    Economic behavior follows the same type-instrumental preset as Candid, but the cue channel is collapsed to neutral, noncommittal states

    Taciturn counterparts[type-instrumental economics, uninformative cues]. Economic behavior follows the same type-instrumental preset as Candid, but the cue channel is collapsed to neutral, noncommittal states. This isolates inference from economic behavior alone: an agent that degrades relative to Candid is relying heavily on linguistic or stylistic cues

  8. [8]

    The cue channel remains accurate, but economic behavior is more strongly history-reactive

    Expressive counterparts[high-reactivity economics, accurate cues]. The cue channel remains accurate, but economic behavior is more strongly history-reactive. Counter-offers and acceptance probabilities respond more to the agent’s recent concession pattern and rigidity. This family tests whether agents can use reliable cues while avoiding confusion between...

  9. [9]

    Economic behavior is strongly history-reactive, and the cue channel is uninformative

    Strategic counterparts[high-reactivity economics, uninformative cues]. Economic behavior is strongly history-reactive, and the cue channel is uninformative. The counterpart is linguistically guarded while adapting tactically through price and acceptance behavior. This is the hardest core family for opponent modeling because both the economic and language ...

  10. [10]

    This family is an explicit stress test rather than part of the core factorial

    Adversarial counterparts[hardball economics, pressuring cues]. This family is an explicit stress test rather than part of the core factorial. The stance prior is skewed toward aggressive counterparts, economic reactivity is high, concessionary behavior is strongly exploited, and rigidity is punished for aggressive types. The cue channel is biased toward n...

  11. [11]

    This family degrades both the price and cue channels through noise rather than through deliberate strategic concealment

    Stochastic counterparts[moderate-reactivity economics, noisy/weak cues]. This family degrades both the price and cue channels through noise rather than through deliberate strategic concealment. Economic behavior uses a moderate reactivity preset, but price noise is high, so offer trajectories are less diagnostic of the underlying concession rule. The cue ...

  12. [12]

    Opener role χ∈ {AgentOpens,CounterpartOpens} , assigned at episode start and constant across rounds

  13. [13]

    History-reactive features ϕk := (ConcedeSpeed k,Rigidity k,ConcedeMagnitude k), which are deterministic functions of the agent’s past offer sequence (Appendix C.3) and parameterize the counter- part’s acceptance probability, walk-away hazard, and concession rate

  14. [14]

    Offer-history summary hB k := (p B k , p B k−1), which records the counterpart’s current and previous offers. These are needed to evaluate acceptance utility uA(pB k ), to compute the price-likelihood mean (28), to derive the counterpart’s concession magnitudeC B k used in the strategic-cue model, and to specify the role-dependent monotone feasible interv...

  15. [15]

    Projection onto BB = [a0, b0] produces a mixed distribution with point masses at the two endpoints: fopen(pB 1 |t B, d0,e) =    Φ a0 −µ 0(tB, d0,e) σ0 , p B 1 =a 0, 1 σ0 ϕ pB 1 −µ 0(tB, d0,e) σ0 , a 0 < p B 1 < b0, 1−Φ b0 −µ 0(tB, d0,e) σ0 , p B 1 =b 0, (39) with (a0, b0) = (rB, pmax) for a seller counterpart and (pmin, rB) for a buyer. I...

  16. [16]

    Base agent.The evaluated LLM observes the standard benchmark interface and must infer tB from prices, actions, and messages

  17. [17]

    This removes posterior-formation error while preserving uncertainty overt B

    Oracle-posterior agent.The evaluated LLM observes the standard interface plus zpost k at each round. This removes posterior-formation error while preserving uncertainty overt B

  18. [18]

    This removes both posterior-formation error and residual latent-state uncertainty

    Revealed-type agent.The evaluated LLM is given the true latent type tB directly. This removes both posterior-formation error and residual latent-state uncertainty

  19. [19]

    decision

    Model-based oracle.The dynamic-programming policy π⋆ acts from the oracle belief state and the known simulator model. This removes LLM planning, execution, and prompt-following errors. These conditions form an intervention ladder. Moving from the base agent to the oracle-posterior agent tests whether correcting the agent’s posterior improves utility. Movi...

  20. [20]

    Near-universal seller advantage on closing surplus.12 of 13 LLMs extract more closing surplus as seller than as buyer (median ∆σπ = +0.037 ; sign-test p= 0.0017 , exact paired Wilcoxon p= 0.0061 ); GPT-4o-mini is the lone exception, with ∆σπ =−0.063 . Among the 12 positive agents the magnitudes are highly model-dependent, from +0.014 (Grok 4.20 ) to +0.13...

  21. [21]

    Compensating agreement-rate dip.In the opposite direction, sellers close fewer deals: median ∆AGR+ π =−0.010 , with 0/13 models showing seller > buyer (paired Wilcoxon p= 0.0005 ). The dip is most pronounced for the strongest anchor-and-hold agents— GLM 5.1 reaches feasible AGR+ π = 1.000 as buyer but 0.902 as seller, and Claude Opus 4.7 drops from 0.998 ...

  22. [22]

    seller asks high

    Net effect on SE + π remains positive for 12 of 13.The seller-side σπ gain dominates the agreement- rate drop in SE + π terms for the same 12 of 13 agents (median ∆SE + π = +0.032, paired Wilcoxon p= 0.0100 ); the lone exception is GPT-4o-mini (−0.065). Among the 12 agents that do show a seller advantage, the heterogeneity in magnitude tracks opening-pric...

  23. [23]

    I believe 88 is a fair starting point

    The typology is preserved across role.Despite the universal σπ asymmetry, no agent crosses a typology boundary by role: every agent’s qualitative profile (anchor-and-hold, mid/balanced, anchor- 64 Claude Opus 4.6Claude Opus 4.7Gemini 3.1 ProGemma 4 31B GLM 5.1 DeepSeek-V4-Pro Grok 4.20Kimi K2.6 Qwen 3.6 Plus GPT-5.4GPT-5.5 Doubao 2.0 Pro 0.0 0.1 0.2 0.3 0...

  24. [24]

    Open inside the public price range and at least 20% away from your r e s e r v a t i o n price toward your f a v o r a b l e extreme

  25. [25]

    Concede in small , d e c r e a s i n g i n c r e m e n t s -- never more than 30% of the di st an ce to the counterpart ’ s most recent offer

  26. [26]

    buyer " or

    If the counterpart ’ s price has not crossed your r e s e r v a t i o n by round K -1 , prefer Reject over a negative - utility Accept . 72 Listing 2: Final GEPA-optimisedstrategy_patch. SYSTEM I N S T R U C T I O N S FOR TERMS - Bench B I L A T E R A L N E G O T I A T I O N AGENT = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =...

  27. [27]

    I n d i v i d u a l R a t i o n a l i t y ( IR ) : Never Accept a price worse than r e s e r v a t i o n

  28. [28]

    Bounds : Every Offer must be within [ p_min , p_max ]

  29. [29]

    M o n o t o n i c i t y : S u c c e s s i v e offers must move toward the c o u n t e r p a r t ( buyer non - decreasing , seller non - i n c r e a s i n g )

  30. [30]

    Budget : Never exceed r e s e r v a t i o n in an offer you ’ d accept

  31. [31]

    ST RA TE GY G UI DA NC E = = = = = = = = = = = = = = = = = ### Opening ( when you open first )

    Valid Actions : Always output exactly one of Offer X / Accept / Reject . ST RA TE GY G UI DA NC E = = = = = = = = = = = = = = = = = ### Opening ( when you open first )

  32. [32]

    Start at a p p r o x i m a t e l y 20 -30% of the range from your f a v o r a b l e extreme

    Open a g g r e s s i v e l y but within [ p_min , p_max ]. Start at a p p r o x i m a t e l y 20 -30% of the range from your f a v o r a b l e extreme . - BUYER : ~ p_min + 0.2 * ( p_max - p_min ) . - SELLER : ~ p_max - 0.2 * ( p_max - p_min )

  33. [33]

    ### R e s p o n d i n g to the counterpart ’ s first offer

    If r e s e r v a t i o n is extreme ( very high as buyer / very low as seller ) , open further from r e s e r v a t i o n to m ax imi se p o t e n t i a l utility . ### R e s p o n d i n g to the counterpart ’ s first offer

  34. [34]

    This avoids wa lk aw ay risk

    Accept i m m e d i a t e l y if the offer already gives you very good utility ( >50% of max ) . This avoids wa lk aw ay risk

  35. [35]

    ### C o n c e s s i o n pattern

    Counter - offer if utility is < ~35 -40% of max OR you believe you can extract more given r e m a i n i n g rounds . ### C o n c e s s i o n pattern

  36. [36]

    Never concede more than 30% of the di st anc e between your last offer and the counterpart ’ s

    Concede in small , d e c r e a s i n g i n c r e m e n t s . Never concede more than 30% of the di st anc e between your last offer and the counterpart ’ s

  37. [37]

    In later rounds , make token c o n c e s s i o n s (0.5 -2% of range ) to signal f ir mn ess

    Slow c o n c e s s i o n s as you ap pr oa ch r e s e r v a t i o n . In later rounds , make token c o n c e s s i o n s (0.5 -2% of range ) to signal f ir mn ess

  38. [38]

    ### Endgame

    Never concede past your r e s e r v a t i o n price . ### Endgame

  39. [39]

    At round K -1 / K , if the c o u n t e r p a r t has not crossed your reservation , Reject rather than accept a negative - utility deal

  40. [40]

    Re pe at ed offers at r e s e r v a t i o n signal your limit

    At your reservation , hold firm . Re pe at ed offers at r e s e r v a t i o n signal your limit

  41. [41]

    ### D e t e c t i n g no - deal s c e n a r i o s

    If the c o u n t e r p a r t is barely moving and offers do not con ve rg e to a f eas ib le zone , prepare for no a g r e e m e n t ; c o r r e c t l y i d e n t i f y i n g a no - deal s ce na ri o scores 1.0. ### D e t e c t i n g no - deal s c e n a r i o s

  42. [42]

    Co nt in ue n e g o t i a t i n g n orm al ly ; Reject c o n f i d e n t l y if their final offer is u n a c c e p t a b l e

    If c o u n t e r p a r t offers stay far from your r e s e r v a t i o n with minimal movement , the s cen ar io likely has no ZOPA . Co nt in ue n e g o t i a t i n g n orm al ly ; Reject c o n f i d e n t l y if their final offer is u n a c c e p t a b l e . ### A c c e p t i n g

  43. [43]

    Accept when the offer gives p os it iv e utility AND further n e g o t i a t i o n is u nl ik el y to yield m e a n i n g f u l l y better results

  44. [44]

    Be more willing to accept early at high utility ( >60% of max ) -- this locks in gains and avoids wa lk awa y risk

  45. [45]

    ### Message st ra te gy

    Be ca ut iou s about a c c e p t i n g too quickly at m ed io cr e utility ( <40%) -- 73 you may leave s i g n i f i c a n t value on the table . ### Message st ra te gy

  46. [46]

    near my limit

    Keep m es sa ges concise , professional , s t r a t e g i c : - R e f e r e n c e market comps , budget constraints , demand , a l t e r n a t i v e s . - Ask about their c o n s t r a i n t s to gather i n f o r m a t i o n . - Signal urgency / w i l l i n g n e s s to close when c o n c e d i n g . - Late rounds : signal fir mn es s (" near my limit " ,...

  47. [47]

    " - Accept : Accept msg =

    Your opening offer matters e n o r m o u s l y ; many n e g o t i a t i o n s co nc lu de in 1 -2 rounds . Open too close to mid po in t -> i m m e d i a t e accept at m edi oc re score ; open too a g g r e s s i v e l y -> wa lk aw ay risk . - When the c o u n t e r p a r t opens f a v o r a b l y ( below r e s e r v a t i o n as buyer , above as seller ...

  48. [48]

    They calibrate the plausible valuation scale for the item and serve as a public prior for both buyer and seller

    avg, low, and high are historical market statistics drawn from the AmazonHistoryPrice corpus (Appendix H.2.1). They calibrate the plausible valuation scale for the item and serve as a public prior for both buyer and seller

  49. [49]

    The counterpart’s true reservation valuerB, urgencyκ B, and stanceη B remain private and unobserved

  50. [50]

    The agent’s ownreservation_price is sampled from a role-conditioned wedge around the product reference price (see §H.2.1) and is delivered through the same private_context channel as in synthetic runs. Constraints introduced by the block The category-level public price bounds [pmin, pmax] in constraints.price_bounds are derived from the product category r...

  51. [51]

    Never change the actiond k ∈ {OFFER,ACCEPT,REJECT}or the pricep k

  52. [52]

    Never introduce new numbers, constraints, deadlines, or factual claims

  53. [53]

    Never reveal hidden information (reservation values, urgency, stance, internal policy)

  54. [54]

    Never reference internal variables (types, simulator, cues,κ,η)

  55. [55]

    Shape tone usingsentiment(positive, neutral, negative) andstrategy_cue(Concede, Hold, Pressure)

  56. [56]

    Keep the message realistic and concise (1–3 sentences)

  57. [57]

    buyer" or

    Ifis_opening_turn = No, briefly respond to the agent’s last message in a way consistent with the cues; ifYes, initiate naturally. Action-specific requirements: Offer→ state the provided price string verbatim with no rounding or paraphrase; Accept→ confirm agreement and make clear that the negotiation has concluded with a deal;Reject→firmly close the negot...