{"paper":{"title":"TERMS-Bench: Diagnosing LLM Negotiation Agents Beyond Deal Rate","license":"http://creativecommons.org/licenses/by/4.0/","headline":"A Bayesian-game testbed diagnoses LLM agents in price negotiation by measuring surplus extraction, cue use, and belief calibration rather than deal rate alone.","cross_cats":["cs.AI"],"primary_cat":"cs.GT","authors_text":"Aneesh Pappu, Batu El, Erica Zhang, Fangzhao Zhang, James Zou, Jiashuo Liu, Jose Blanchet, Susan Athey","submitted_at":"2026-05-13T06:22:50Z","abstract_excerpt":"Negotiation is a central mechanism of economic exchange, shaping markets, procurement, labor agreements, and resource allocation. It is also a canonical testbed for agentic language models, requiring multi-turn interaction under hidden preferences, strategic communication, and binding constraints. These properties make negotiation hard to evaluate: unlike math or code, it has no intrinsic verifier. Existing LLM negotiation evaluations rely on LLM-vs.-LLM interaction or aggregate outcomes such as deal rate, leaving failures opaque. We introduce Terms-Bench, short for Testbed for Economic Reason"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Evaluating 13 LLM agents spanning frontier systems, Terms-Bench shows frontier models saturate deal rate yet diverge in surplus extraction, cue use, belief calibration, and compliance, revealing agent-specific bargaining bottlenecks masked by prior benchmarks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The simulator policy and payoff structure chosen for the bilateral price negotiation accurately capture the strategic and informational features that matter in real human negotiations, so that observed gaps can be attributed to the agent rather than to an unrealistic environment.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Terms-Bench is a diagnostic benchmark for LLM negotiation agents that reveals agent-specific strategic failures beyond simple deal rates by using hidden-type simulators as oracles.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A Bayesian-game testbed diagnoses LLM agents in price negotiation by measuring surplus extraction, cue use, and belief calibration rather than deal rate alone.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"0189bfc33d72fc22e8e4fd3f8282ae616dd73d4d5b2c382b3594709f1102170b"},"source":{"id":"2605.13909","kind":"arxiv","version":1},"verdict":{"id":"9ec843c2-7206-4ded-bdec-3a00cd9fef4b","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T02:51:40.460142Z","strongest_claim":"Evaluating 13 LLM agents spanning frontier systems, Terms-Bench shows frontier models saturate deal rate yet diverge in surplus extraction, cue use, belief calibration, and compliance, revealing agent-specific bargaining bottlenecks masked by prior benchmarks.","one_line_summary":"Terms-Bench is a diagnostic benchmark for LLM negotiation agents that reveals agent-specific strategic failures beyond simple deal rates by using hidden-type simulators as oracles.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The simulator policy and payoff structure chosen for the bilateral price negotiation accurately capture the strategic and informational features that matter in real human negotiations, so that observed gaps can be attributed to the agent rather than to an unrealistic environment.","pith_extraction_headline":"A Bayesian-game testbed diagnoses LLM agents in price negotiation by measuring surplus extraction, cue use, and belief calibration rather than deal rate alone."},"references":{"count":57,"sample":[{"doi":"","year":null,"title":"The agent choosesAccept; the outcome is the counterpart’s last offered price","work_id":"c99133e5-8bca-4100-919f-39e217edc131","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"The agent choosesReject; the outcome is disagreement⊥","work_id":"0a805241-2cc5-448b-9a20-ea0654126bf6","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"The counterpart accepts the agent’s offer; the outcome is the agent’s proposed price","work_id":"e4258cbf-1a38-4ab6-b16b-d237d1c56c02","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"The counterpart terminally rejects (walk-away); the outcome is disagreement⊥","work_id":"b72816ac-bebd-4460-82c6-c64108f0886f","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"The round limitKis reached without agreement; the outcome is disagreement⊥. Constraints.All Offer actions must satisfy: (i) price bounds pmin ≤p k ≤p max; (ii) monotonic concession: for buyer agents, ","work_id":"57e9dacb-ce25-4d52-b05e-397c66745a32","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":57,"snapshot_sha256":"66929ba71dd869f657dcc5d2f0d43ece34ce65991e8b74a9a4574b21e226f835","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"b5ef96c5359623a7f996dcab57d3b6c76b3caac153762af4ba35b1282ab08f9a"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}