arxiv: 2604.16472 · v1 · submitted 2026-04-10 · 💻 cs.GT · cs.AI· cs.MA· econ.GN· econ.TH· q-fin.EC

Recognition: unknown

Training Language Models for Bilateral Trade with Private Information

Dirk Bergemann , Soheil Ghili , Xinyang Hu , Chuanhao Li , Zhuoran Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:10 UTC · model grok-4.3

classification 💻 cs.GT cs.AIcs.MAecon.GNecon.THq-fin.EC

keywords bilateral bargaininglanguage model agentsprivate informationprice discriminationsupervised fine-tuningreinforcement learninggame theorysurplus maximization

0 comments

The pith

Language models learn to discriminate prices in bilateral bargaining by anchoring high and conceding gradually over sequential offers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds an event-driven simulator that lets LLM agents negotiate bilateral trades where each side privately knows its own value for the item. Benchmark tournaments across frontier models show that the highest surplus shares and deal rates come from strategies that start with aggressive offers, adjust concessions based on responses, and show patience rather than quick accommodation. Training smaller open models first with supervised fine-tuning on strong trajectories roughly doubles the surplus captured but lowers the rate at which deals close. Adding group relative policy optimization then raises deal rates back up while trimming some of the surplus gains, and the fine-tuning step teaches models to scale their offers to item value in a way that carries over to opponents never seen during training.

Core claim

Effective LLM bargaining agents implement price discrimination through sequential offers; stronger models scale their behavior proportionally to item value and maintain performance across price tiers. Supervised fine-tuning approximately doubles surplus share but reduces deal rates, while subsequent reinforcement learning recovers deal rates at the expense of some surplus gains. The fine-tuning also compresses surplus variation across price tiers and generalizes this proportional behavior to unseen opponents.

What carries the argument

The event-driven simulator that separates binding price offers from natural-language messages, allowing automated scoring of surplus, deal completion, and strategy patterns in private-information bilateral trade.

If this is right

Stronger models maintain high surplus capture across all value levels by scaling offers proportionally to the item.
Quickly accommodating strategies in the buyer role prevent price discrimination and produce the lowest surplus and deal rates.
Supervised fine-tuning produces proportional strategies that transfer to bargaining opponents not encountered in training.
The reward structure in reinforcement learning directly trades off higher surplus extraction against higher deal completion rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar training pipelines could be applied to train LLM agents for other private-information settings such as auctions or contract negotiations.
Reward functions that explicitly penalize both low surplus and failed deals might allow models to retain more of the surplus gains from fine-tuning.
The observed link between temporal patience and higher surplus suggests testing whether adding explicit time costs in the simulator would sharpen the learned strategies further.
If the proportional scaling behavior generalizes, the same models might serve as starting points for multi-party or repeated-trade environments.

Load-bearing premise

The simulator's separation of binding offers from chat messages accurately reflects the core strategic dynamics of real bilateral bargaining under private information.

What would settle it

Running the trained Qwen models against human bargainers or in a protocol without the simulator's clean separation of offers and messages and finding that surplus shares do not double or that generalization to new opponents disappears would falsify the training claims.

Figures

Figures reproduced from arXiv: 2604.16472 by Chuanhao Li, Dirk Bergemann, Soheil Ghili, Xinyang Hu, Zhuoran Yang.

**Figure 2.** Figure 2: Hierarchical structure of the agent system prompt. The prompt integrates private [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: This is the system architecture of the bilateral bargaining simulation environment. [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: IR compliance for frontier models. Self: the model’s own violation rate (deals [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: NGFT pairwise breakdown for o3 as buyer. The irrationality comes almost entirely [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Surplus share and deal rate for frontier models (GFT scenarios), averaged over all [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Behavioral drivers for frontier models (GFT scenarios), averaged over all five [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Surplus share by reservation price quintile, averaged across all 5 opponents (GFT). [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Deal rates by reservation price quintile, averaged across all 5 opponents (GFT). [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Combined violation rates (either party) by reservation price quintile, averaged [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Buyer performance profile. Six axes: IR compliance, surplus share, deal rate, [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Seller performance profile. Same six axes as the buyer plot, with anchoring power [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Turn-level decomposition of a negotiation trajectory into autoregressive training [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: Reasoning-token masking: <think> blocks are masked from input context turns, so the model conditions only on observable agent responses. Training details. We fine-tune Qwen3-8B and Qwen3-14B [Qwen Team, 2025] using VERL [Volcano Engine, 2024]. The model learns the full agent output format (Thought: + Code:) conditioned on negotiation history. Hyperparameters are in Section B.1. 5.3 Stage 2: Reinforcement … view at source ↗

**Figure 15.** Figure 15: NGFT violation rates across training stages. Self: the Qwen model strikes a [PITH_FULL_IMAGE:figures/full_fig_p034_15.png] view at source ↗

**Figure 16.** Figure 16: Strategic effectiveness and allocative efficiency across training stages (GFT [PITH_FULL_IMAGE:figures/full_fig_p035_16.png] view at source ↗

**Figure 17.** Figure 17: Combined violation rate by reservation-price quintile and training stage, split [PITH_FULL_IMAGE:figures/full_fig_p037_17.png] view at source ↗

**Figure 18.** Figure 18: Surplus share by reservation-price quintile and training stage, split by opponent. [PITH_FULL_IMAGE:figures/full_fig_p038_18.png] view at source ↗

**Figure 19.** Figure 19: Deal rate by reservation-price quintile and training stage, split by opponent. Top [PITH_FULL_IMAGE:figures/full_fig_p039_19.png] view at source ↗

**Figure 20.** Figure 20: Example listing used in the negotiation simulation environment. [PITH_FULL_IMAGE:figures/full_fig_p047_20.png] view at source ↗

**Figure 21.** Figure 21: Violation rate heatmaps for all 25 buyer–seller pairings. Left: GFT scenarios [PITH_FULL_IMAGE:figures/full_fig_p056_21.png] view at source ↗

**Figure 22.** Figure 22: Average buyer utility in GFT scenarios. Left: averaged over all negotiations [PITH_FULL_IMAGE:figures/full_fig_p057_22.png] view at source ↗

**Figure 23.** Figure 23: Average seller utility in GFT scenarios. Left: averaged over all negotiations. [PITH_FULL_IMAGE:figures/full_fig_p057_23.png] view at source ↗

**Figure 24.** Figure 24: Surplus share heatmaps for all 25 buyer–seller pairings (GFT scenarios). Left: [PITH_FULL_IMAGE:figures/full_fig_p058_24.png] view at source ↗

**Figure 25.** Figure 25: Deal rate for all 25 buyer–seller pairings (GFT scenarios). NGFT deal rates equal [PITH_FULL_IMAGE:figures/full_fig_p058_25.png] view at source ↗

**Figure 26.** Figure 26: Seller initial aggressiveness (first offer / reservation price). Higher values indicate [PITH_FULL_IMAGE:figures/full_fig_p059_26.png] view at source ↗

**Figure 27.** Figure 27: Buyer initial aggressiveness. Left: V1 (gap closure). Right: V2 (reservation ratio). [PITH_FULL_IMAGE:figures/full_fig_p059_27.png] view at source ↗

**Figure 28.** Figure 28: Per-round concession rate. Left: buyer concession rate. Right: seller concession [PITH_FULL_IMAGE:figures/full_fig_p060_28.png] view at source ↗

**Figure 29.** Figure 29: Average negotiation rounds for all 25 buyer–seller pairings (GFT scenarios). The [PITH_FULL_IMAGE:figures/full_fig_p060_29.png] view at source ↗

read the original abstract

Bilateral bargaining under incomplete information provides a controlled testbed for evaluating large language model (LLM) agent capabilities. Bilateral trade demands individual rationality, strategic surplus maximization, and cooperation to realize gains from trade. We develop a structured bargaining environment where LLMs negotiate via tool calls within an event-driven simulator, separating binding offers from natural-language messages to enable automated evaluation. The environment serves two purposes: as a benchmark for frontier models and as a training environment for open-weight models via reinforcement learning. In benchmark experiments, a round-robin tournament among five frontier models (15,000 negotiations) reveals that effective strategies implement price discrimination through sequential offers. Aggressive anchoring, calibrated concession, and temporal patience correlate with the highest surplus share and deal rate. Accommodating strategies that concede quickly disable price discrimination in the buyer role, yielding the lowest surplus capture and deal completion. Stronger models scale their behavior proportionally to item value, maintaining performance across price tiers; weaker models perform well only when wide zones of possible agreement offset suboptimal strategies. In training experiments, we fine-tune Qwen3 (8B, 14B) via supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) against a fixed frontier opponent. These stages optimize competing objectives: SFT approximately doubles surplus share but reduces deal rates, while RL recovers deal rates but erodes surplus gains, reflecting the reward structure. SFT also compresses surplus variation across price tiers, which generalizes to unseen opponents, suggesting that behavioral cloning instills proportional strategies rather than memorized price points.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a new simulator for LLM bargaining and reports training trade-offs on Qwen models, but the strategy findings rest on an unvalidated environment with thin statistical support.

read the letter

The main things to know are that this work creates an event-driven simulator separating binding offers from natural-language messages for bilateral trade, runs a 15,000-negotiation tournament across frontier models, and trains Qwen models with SFT followed by GRPO, showing SFT boosts surplus share at the expense of deal rates while RL reverses that pattern. The simulator and the two-stage pipeline look like the clearest additions here. The tournament results on anchoring, concession, and patience as correlates of better outcomes, plus the observation that SFT compresses variation across price tiers, give some concrete data points on how LLMs behave in this setting. The paper does a reasonable job laying out the benchmark and documenting those training dynamics, which could be useful for anyone trying to optimize agent rewards in strategic interactions. The soft spots are real but not fatal. The abstract gives no error bars, run counts, or statistical tests on the tournament outcomes, so the claimed correlations are difficult to weigh. More importantly, the simulator's fidelity to actual private-information bargaining is unverified—no human data or equilibrium benchmarks are referenced—so the price-discrimination patterns and generalization claims could be tied to the specific rules around tool calls and fixed opponents rather than transferable capabilities. That said, the work is internally coherent and engages honestly with the setup it chose. This is for researchers working on multi-agent LLMs or mechanism design with AI agents. A reader in that space would get value from the environment and the training trade-off examples as a starting point. It deserves peer review because the benchmark is new and the empirical observations on SFT versus RL are worth testing further, though referees will likely push for validation experiments and better controls.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces an event-driven simulator for bilateral bargaining under private information in which LLMs negotiate via tool calls, separating binding offers from natural-language messages. Through a round-robin tournament of five frontier models across 15,000 negotiations, it reports that effective strategies implement price discrimination via sequential offers, with aggressive anchoring, calibrated concession, and temporal patience correlating to highest surplus share and deal rates. Training experiments fine-tune Qwen3 (8B, 14B) via SFT followed by GRPO against a fixed opponent, finding that SFT approximately doubles surplus share but reduces deal rates while RL recovers deal rates at the cost of surplus gains; SFT also compresses surplus variation across price tiers and generalizes to unseen opponents.

Significance. If the simulator accurately captures the strategic incentives of private-information bilateral trade, the work supplies a reproducible benchmark for LLM agent capabilities in incomplete-information settings and a concrete training pipeline that exposes trade-offs between surplus maximization and deal completion. The empirical distinction between SFT and RL effects, together with the generalization results, is a useful contribution to understanding how behavioral cloning versus reinforcement learning shapes strategic behavior in LLMs. The automated evaluation framework enabled by tool calls and separated message/offer channels is a technical strength that supports scalable experimentation.

major comments (3)

[Benchmark Experiments] Benchmark Experiments: The claim that aggressive anchoring, calibrated concession, and temporal patience 'correlate with the highest surplus share and deal rate' is presented without quantitative measures (e.g., correlation coefficients, regression coefficients, or controls for item value and model strength). This is load-bearing for the strategy-identification result that underpins the benchmark findings.
[Training Experiments] Training Experiments: The statement that 'SFT approximately doubles surplus share' is given without the pre- and post-SFT values, standard errors, number of independent training runs, or statistical tests. The same paragraph reports RL effects on deal rates and surplus; absent variance estimates or run counts, the magnitude and reliability of the reported trade-off cannot be assessed.
[Environment] Simulator / Environment description: The central modeling choice of separating binding offers from natural-language messages is introduced without validation against human bargaining data, equilibrium predictions (e.g., Myerson-Satterthwaite), or an ablation that integrates messages and offers. Because all reported correlations and training effects rest on this separation, a concrete test comparing outcomes in the separated versus integrated setting is needed to rule out simulator artifacts.

minor comments (2)

[Abstract] The abstract states '15,000 negotiations' but does not indicate the distribution across model pairs, price tiers, or buyer/seller roles; adding this breakdown would improve interpretability of the tournament results.
[Methods] Define surplus share and deal rate formally (with equations) in the methods section before reporting numerical outcomes.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below, providing the strongest honest responses possible. We commit to revisions that add quantitative support and statistical detail where feasible, while noting limitations on full external validation.

read point-by-point responses

Referee: [Benchmark Experiments] The claim that aggressive anchoring, calibrated concession, and temporal patience 'correlate with the highest surplus share and deal rate' is presented without quantitative measures (e.g., correlation coefficients, regression coefficients, or controls for item value and model strength). This is load-bearing for the strategy-identification result that underpins the benchmark findings.

Authors: We agree that explicit quantitative measures would strengthen the presentation. In the revision we will compute and report Pearson and Spearman correlations between the behavioral metrics (anchoring magnitude, concession speed, response latency) and both surplus share and deal rate. We will also add OLS regressions with controls for item value and model identity to isolate the partial correlations. revision: yes
Referee: [Training Experiments] The statement that 'SFT approximately doubles surplus share' is given without the pre- and post-SFT values, standard errors, number of independent training runs, or statistical tests. The same paragraph reports RL effects on deal rates and surplus; absent variance estimates or run counts, the magnitude and reliability of the reported trade-off cannot be assessed.

Authors: We accept that the training results require fuller statistical reporting. The revised manuscript will state the exact pre-SFT and post-SFT mean surplus shares, standard errors across at least three independent training seeds, and paired t-test or Wilcoxon results for the SFT and subsequent GRPO stages. revision: yes
Referee: [Environment] Simulator / Environment description: The central modeling choice of separating binding offers from natural-language messages is introduced without validation against human bargaining data, equilibrium predictions (e.g., Myerson-Satterthwaite), or an ablation that integrates messages and offers. Because all reported correlations and training effects rest on this separation, a concrete test comparing outcomes in the separated versus integrated setting is needed to rule out simulator artifacts.

Authors: We recognize the importance of validating the core design choice. A full human-subject replication is beyond the scope of the current study, but we will add an ablation that runs the identical tournament and training pipeline in an integrated channel where natural-language text and offers share a single message stream. We will also include a brief discussion of how the separation is consistent with the distinction between binding commitments and cheap talk in the Myerson-Satterthwaite framework. revision: partial

Circularity Check

0 steps flagged

No significant circularity: all claims are empirical simulation outcomes

full rationale

The paper reports benchmark tournaments and training runs (SFT + GRPO) inside an event-driven simulator. No mathematical derivation chain, first-principles prediction, or fitted parameter is presented as a result; every reported correlation (anchoring, concession, surplus share, deal rate, tier generalization) is an observed statistic from 15,000+ negotiations against fixed opponents. The environment definition and reward structure are explicit inputs, not outputs derived from the same equations. No self-citation load-bearing step, ansatz smuggling, or renaming of known results occurs. The central claims therefore remain independent of any internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The core invented construct is the event-driven simulator itself.

invented entities (1)

event-driven simulator separating binding offers from natural-language messages no independent evidence
purpose: to enable automated evaluation of LLM bargaining under private information
Described in the abstract as the central testbed and training environment.

pith-pipeline@v0.9.0 · 5605 in / 1335 out tokens · 84093 ms · 2026-05-10T17:10:12.667878+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

doi: 10.1037/0022-3514.81.4.657. Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. Etan A Green and E Barry Plunkett. The science of the deal: Optimal bargaining on ebay using deep reinforcement learning. InProceedings of the 2...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1037/0022-3514.81.4.657 2025
[2]

doi: 10.1287/moor.6.1.58. Roger B. Myerson and Mark A. Satterthwaite. Efficient mechanisms for bilateral trading. Journal of Economic Theory, 29(2):265–281, 1983. doi: 10.1016/0022-0531(83)90048-0. John F. Nash. The bargaining problem.Econometrica, 18(2):155–162, 1950. doi: 10.2307/ 1907266. Jihwan Oh, Murad Aghazada, Se-Young Yun, and Taehyeon Kim. Llm a...

work page doi:10.1287/moor.6.1.58 1983
[3]

Proximal Policy Optimization Algorithms

doi: 10.2307/1912531. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathemat...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.2307/1912531 2017
[4]

Pop the next event batch from the queue (events sharing the same timestamp and agent)
[5]

Advance the simulation clock to the event timestamp
[6]

Seller proposed$1,500

Convert events to a natural-language observation (e.g., “Seller proposed$1,500”; “Buyer rejected your offer and proposed$1,200”)
[7]

Route the observation to the counterpart agent
[8]

The counterpart generates a Thought: block (internal reasoning, hidden from the opponent) followed by aCode:block (tool calls). 48
[9]

The simulator parses and executes the tool calls, creating new events
[10]

Used Laptop

Check termination conditions; if not met, return to step 1. A.4 Example Negotiation Transcripts GFT example.Item: “Used Laptop” (historical high$1,500, historical low$800). Buyer reservation price b = $1,200; seller reservation price s = $900. ZOPA = [900 , 1200]; total surplus = $300. Round Agent Action 1 Sellermake offer(1400) 2 Buyersend message(‘‘Your...
[11]

Demonstration generation.DeepSeek-R1 self-play negotiations are generated within the simulator using the same system prompt template, tool definitions, and item catalog used in the benchmark (Section 4.1)
[12]

<think>...</think> reasoning blocks are removed from input prompts (but retained in target completions), as described in Section 5.2

Reasoning-trace cleaning. <think>...</think> reasoning blocks are removed from input prompts (but retained in target completions), as described in Section 5.2
[13]

Turn-level decomposition.Each multi-turn trajectory is decomposed into autoregressive training samples of increasing context length (Section 5.2)
[14]

Loss is computed only on output tokens

Format conversion.Samples are converted to input–output pairs, where the input concatenates the system prompt and conversation history and the output is the target agent response. Loss is computed only on output tokens. SFT hyperparameters.Table 12 lists the SFT training hyperparameters for both model sizes. Training uses the TRL SFTTrainer with DeepSpeed...

2020
[15]

Generate a synthetic quit negotiation response (with Thought/Code blocks matching the expected format)
[16]

Attach synthetic logprobs using the PAD token (token ID 151643, <|endoftext|> in Qwen models) with a nominal logprob of−0.1
[17]

Execute the synthetic response through the environment, terminating the negotiation with no deal
[18]

The choice of the PAD token is critical

Assign rewardR= 0. The choice of the PAD token is critical. An early implementation used token ID 1 (the double-quote character " in Qwen’s vocabulary), which created real gradient updates: with a negative advantage from the failed negotiation, GRPO would systematically decrease the probability of generating double quotes after long contexts. The PAD toke...
[19]

The training process computes gradients over accumulated batches
[20]

After 8 gradient steps, it initiates weight synchronization
[21]

The vLLM server atomically swaps model weights
[22]

B.6 Training Dynamics The training process naturally progresses through three phases:

Subsequent rollouts use the updated policy immediately. B.6 Training Dynamics The training process naturally progresses through three phases:
[23]

Rewards are dominated by Rparsing and Rexecution (Equation (8) and Table 8), as successful negotiations are rare

Tool mastery (steps 1–20).The agent learns to generate well-formed tool calls and execute valid code. Rewards are dominated by Rparsing and Rexecution (Equation (8) and Table 8), as successful negotiations are rare
[24]

IR violation rates drop from >30% to<5%

Constraint awareness (steps 20–40).The agent begins respecting IR constraints, learning to reject offers that yield negative utility. IR violation rates drop from >30% to<5%
[25]

C Frontier Benchmark: Supplementary Results This appendix provides supplementary figures and tables for the frontier model benchmark (Section 4)

Strategic optimization (steps 40–58).With basic competence established, the agent develops negotiation strategies: anchoring with aggressive opening offers, progressive concessions, and strategic outside-option exercise. C Frontier Benchmark: Supplementary Results This appendix provides supplementary figures and tables for the frontier model benchmark (Se...