{"paper":{"title":"Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"In the Cattle Trade benchmark, strategic coherence like spending efficiency and adaptive bidding predicts rank better than spending volume.","cross_cats":[],"primary_cat":"cs.AI","authors_text":"Clemens M\\\"uller, Robert M\\\"uller","submitted_at":"2026-05-14T08:20:03Z","abstract_excerpt":"We introduce \\textsc{Cattle Trade, a multi-agent benchmark for evaluating large language models (LLMs) as agents in strategic reasoning under imperfect information, adversarial interaction, and resource constraints. The benchmark combines auctions, hidden-offer trade challenges (TCs), bargaining, bluffing, opponent modeling, and resource allocation within a single long-horizon game lasting 50--60 turns. Unlike prior agent benchmarks that test these abilities in isolation, \\textsc{Cattle Trade} evaluates whether agents integrate them across a competitive, multi-agent economic game with conflict"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Strategic coherence, in particular spending efficiency, resource discipline, and phase-adaptive bidding, is associated with rank more strongly than spending volume or any single subskill. Two heuristic code agents outperform most tested LLMs.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That performance differences observed in this specific Cattle Trade game design accurately reflect general agentic competence in strategic reasoning under imperfect information rather than being artifacts of the particular rules, card mechanics, or turn structure.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Cattle Trade benchmark shows heuristic code agents outperforming most LLMs in integrated strategic tasks like bidding, bluffing, and resource allocation across 242 games, with strategic coherence predicting rank better than spending volume.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"In the Cattle Trade benchmark, strategic coherence like spending efficiency and adaptive bidding predicts rank better than spending volume.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"76f02deaea85bfb692ca0ef6a19a0879fb4efb722eaa0daa5ec023da567c60cd"},"source":{"id":"2605.14537","kind":"arxiv","version":1},"verdict":{"id":"7a9c7dc9-a85b-4be7-8675-66eb45586ca1","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T01:50:27.270880Z","strongest_claim":"Strategic coherence, in particular spending efficiency, resource discipline, and phase-adaptive bidding, is associated with rank more strongly than spending volume or any single subskill. Two heuristic code agents outperform most tested LLMs.","one_line_summary":"Cattle Trade benchmark shows heuristic code agents outperforming most LLMs in integrated strategic tasks like bidding, bluffing, and resource allocation across 242 games, with strategic coherence predicting rank better than spending volume.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That performance differences observed in this specific Cattle Trade game design accurately reflect general agentic competence in strategic reasoning under imperfect information rather than being artifacts of the particular rules, card mechanics, or turn structure.","pith_extraction_headline":"In the Cattle Trade benchmark, strategic coherence like spending efficiency and adaptive bidding predicts rank better than spending volume."},"references":{"count":32,"sample":[{"doi":"","year":2006,"title":"TrueSkill: A","work_id":"3429a6d8-a706-4a62-80d8-4764f0b6485e","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2017,"title":"Mastering the Game of","work_id":"849782af-8ae7-477f-aa78-549968caf849","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"Brown, Noam and Sandholm, Tuomas , journal=. Superhuman. 2019 , publisher=","work_id":"36e004f9-1df3-4288-964c-aff63cd87a7d","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Human-Level Play in the Game of","work_id":"b6b387a9-3f99-442a-9b90-6bbb1a41dec5","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"International Conference on Learning Representations , year=","work_id":"7d66a0d4-a1e1-4771-b0cd-ea8c424cb4bd","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":32,"snapshot_sha256":"ba2e65e0c7b7aa382756c013a59a0e2bc73da399a7a757534fe778cc6b32c464","internal_anchors":2},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}