GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives
Pith reviewed 2026-05-14 21:05 UTC · model grok-4.3
The pith
A three-mode benchmark shows zero-shot detection of adaptive imposters in multi-agent LLM systems is misleading because adaptation gaps only appear after few-shot recalibration.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GAMBIT supplies a dataset and three evaluation modes for imposter detectors in multi-agent LLM collectives: two zero-shot modes under increasing distribution shift and one recalibration mode that tracks adaptation speed from twenty labeled examples. An evolutionary imposter built on this substrate reduces collective task performance while remaining essentially undetectable by standard detectors. The evaluation shows that zero-shot scores alone cannot predict real-world robustness because detectors with nearly identical zero-shot performance differ by up to eight times in few-shot adaptation, and meta-learned detectors reach high accuracy twenty times faster than baselines, a distinction that
What carries the argument
The GAMBIT benchmark with its three evaluation modes (zero-shot under distribution shift and few-shot recalibration) together with the evolutionary imposter agent that co-evolves attack strategies against the collective.
If this is right
- Detectors must be tested in a recalibration mode to reveal their true robustness against evolving adversaries.
- Meta-learned detectors can reach usable accuracy with far fewer examples than standard fine-tuning.
- An evolutionary attack framework can produce imposters that simultaneously damage collective output and evade detection.
- Zero-shot leaderboards will systematically overstate the security of multi-agent systems against adaptive threats.
Where Pith is reading between the lines
- Defenses for multi-agent systems may need built-in mechanisms for rapid online recalibration rather than static detectors.
- The same evolutionary attack method could be applied to other cooperative tasks such as code generation or planning.
- Benchmarking should treat attack and defense as co-evolving processes rather than fixed pairs.
Load-bearing premise
That chess games played by Gemini agents form a representative substrate for general multi-agent LLM collectives and that the evolutionary imposter framework will generalize to other tasks without major changes.
What would settle it
A direct test in which every detector shows identical relative performance across the zero-shot and recalibration modes, or in which the evolutionary imposter fails to reduce collective performance on a non-chess reasoning task.
Figures
read the original abstract
In multi-agent systems (MAS), a single deceptive agent can nullify all gains of an agentic AI collective and evade deployed defenses. However, existing adversarial studies on MAS target only shallow tasks and do not consider adaptive adversaries, which evolve their strategies to evade the very detectors trained to catch them. To address that gap, we introduce GAMBIT, a benchmark with three evaluation modes and two independent scores for evaluating imposter detectors: the first two modes measure zero-shot detection under increasing distribution shift, and a third recalibration mode measures how quickly a detector adapts to novel attacks from just 20 labeled examples. The benchmark comes with a dataset of 27,804 labeled instances spanning 240 co-evolved imposter strategies. Our contributions are threefold: (1) Using chess as a substrate deep reasoning problem and Gemini 3.1 Pro for agents, we release GAMBIT and its dataset to evaluate imposter detectors under realistic constraints against a stealthy adaptive imposter; (2) We introduce an adaptive imposter agent based on an efficient evolutionary framework, generalizable beyond chess, that collapses collective task performance while remaining essentially undetectable (50.5% F1-score with a Gemini-based detector); (3) We show that zero-shot evaluation can be highly misleading for adaptive adversaries: two detectors with near-identical zero-shot scores differ by 8x on few-shot adaptation, while the meta-learned variant converges 20x faster, a gap only visible in the recalibration mode. Altogether, GAMBIT provides the first multi-agent benchmark where adversarial attacks and defenses co-evolve, with an imposter framework generalizable beyond our use case, and promising techniques for fast recalibration in a rapidly evolving adversarial system. Code and data: https://anonymous.4open.science/r/gambit.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GAMBIT, a three-mode benchmark for adversarial robustness in multi-agent LLM collectives. Using chess games played by Gemini 3.1 Pro agents as substrate, it releases a dataset of 27,804 labeled instances spanning 240 co-evolved imposter strategies. The benchmark evaluates imposter detectors in zero-shot modes under distribution shift and a recalibration mode measuring few-shot adaptation from 20 examples. Key claims are that an evolutionary imposter framework collapses collective performance while remaining nearly undetectable (50.5% F1 with a Gemini detector), and that zero-shot scores are highly misleading: detectors with similar zero-shot performance differ by 8x on adaptation, while a meta-learned variant converges 20x faster, gaps visible only in recalibration.
Significance. If the empirical contrasts hold after addressing validation gaps, GAMBIT would be a useful contribution by providing the first co-evolving attack-defense benchmark for MAS and demonstrating the value of a recalibration mode for adaptive adversaries. The public dataset release and evolutionary framework (claimed generalizable beyond chess) are concrete strengths that could support follow-on work on fast adaptation techniques.
major comments (3)
- [Abstract and experimental results] Abstract and § on experimental setup: the reported 50.5% F1, 8x, and 20x gaps are given without error bars, confidence intervals, or statistical tests; this is load-bearing for the central claim that zero-shot evaluation is 'highly misleading' because the magnitude of the adaptation gaps cannot be assessed for reliability.
- [Dataset and evolutionary imposter] § on evolutionary framework and dataset construction: no details are provided on how the 240 strategies were verified as independent (e.g., pairwise similarity metrics or diversity analysis), nor on regularization or held-out validation to avoid overfitting within the evolutionary loop; this directly affects whether the recalibration-mode gaps reflect genuine adaptation or artifacts of the generation process.
- [Introduction and discussion] § on generalizability and substrate choice: the headline claim that zero-shot scores mislead for adaptive adversaries rests on chess with a single model family; no cross-domain transfer experiments (e.g., to negotiation or planning tasks) are described, leaving the 8x/20x differences vulnerable to the substrate-specific concern that chess move distributions and Gemini reasoning style may not generalize.
minor comments (1)
- [Appendix or reproducibility] The anonymous code link is noted but the manuscript should include a brief reproducibility checklist (e.g., exact evolutionary hyperparameters, random seeds, and prompt templates) to support the 'generalizable beyond chess' claim.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below, outlining the revisions we plan to make to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract and experimental results] Abstract and § on experimental setup: the reported 50.5% F1, 8x, and 20x gaps are given without error bars, confidence intervals, or statistical tests; this is load-bearing for the central claim that zero-shot evaluation is 'highly misleading' because the magnitude of the adaptation gaps cannot be assessed for reliability.
Authors: We fully agree that statistical rigor is crucial for validating the central claims. In the revised version, we will recompute the metrics with multiple independent runs and include error bars (standard deviations), 95% confidence intervals, and statistical significance tests (such as Wilcoxon signed-rank tests for the gaps). This will allow readers to assess the reliability of the 8x and 20x differences. revision: yes
-
Referee: [Dataset and evolutionary imposter] § on evolutionary framework and dataset construction: no details are provided on how the 240 strategies were verified as independent (e.g., pairwise similarity metrics or diversity analysis), nor on regularization or held-out validation to avoid overfitting within the evolutionary loop; this directly affects whether the recalibration-mode gaps reflect genuine adaptation or artifacts of the generation process.
Authors: We appreciate this point and will enhance the manuscript with additional details on the evolutionary framework. Specifically, we will add descriptions of how strategy independence was ensured through pairwise similarity metrics (using cosine similarity on strategy embeddings), diversity analysis via clustering and entropy calculations, and the incorporation of held-out validation sets and regularization in the evolutionary fitness function to mitigate overfitting. These additions will demonstrate that the observed adaptation gaps arise from genuine generalization rather than generation artifacts. revision: yes
-
Referee: [Introduction and discussion] § on generalizability and substrate choice: the headline claim that zero-shot scores mislead for adaptive adversaries rests on chess with a single model family; no cross-domain transfer experiments (e.g., to negotiation or planning tasks) are described, leaving the 8x/20x differences vulnerable to the substrate-specific concern that chess move distributions and Gemini reasoning style may not generalize.
Authors: We acknowledge the limitation regarding generalizability. Chess was selected as the substrate because it demands deep strategic reasoning and provides a clear, quantifiable task performance metric, making it suitable for evaluating deception in multi-agent settings. The evolutionary imposter framework is designed to be model- and domain-agnostic. In the revision, we will expand the discussion to include a more thorough analysis of potential substrate biases and argue for broader applicability based on the framework's design. We will also explicitly state plans for future cross-domain validation as future work, as conducting such experiments would require substantial additional resources beyond this revision. revision: partial
Circularity Check
No significant circularity in GAMBIT benchmark derivation
full rationale
The paper introduces an external benchmark and dataset of 27,804 instances generated from 240 co-evolved strategies via a new evolutionary imposter framework. Reported gaps (8x adaptation difference, 20x faster convergence) are measured empirical outcomes on held-out strategies across zero-shot and recalibration modes, not quantities defined by construction from the same fitted parameters or inputs. No self-citations are load-bearing for the central claims, no ansatz is smuggled, and the framework is presented as generalizable without reducing the results to prior self-work by definition. The derivation chain is self-contained against the released artifacts.
Axiom & Free-Parameter Ledger
free parameters (1)
- evolutionary hyperparameters
axioms (1)
- domain assumption Chess games require deep reasoning that generalizes to other multi-agent LLM tasks
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce GAMBIT, a benchmark with three evaluation modes... evolutionary framework producing 240 distinct imposter strategies... detection score and adaptation score
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
move_score = max(primary_compliance,0.01)/3 * cpl_bin(pushed_move_cpl)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Kim, Yubin and Gu, Ken and Park, Chanwoo and others , title =. 2025 , journal =
work page 2025
-
[2]
Liu, Yi and Deng, Gelei and Li, Yuekang and Wang, Kailong and Zhang, Tianwei and Liu, Yepang and Wang, Haoyu and Zheng, Yanhong and Liu, Yang , title =. 2023 , journal =
work page 2023
- [3]
-
[4]
Amortized Planning with Large-Scale Transformers:
Ruoss, Anian and Del\'etang, Gr\'egoire and Medapati, Sourabh and Grau-Moya, Jordi and Wenliang,. Amortized Planning with Large-Scale Transformers:. 2024 , booktitle =
work page 2024
-
[5]
Amayuelas, Alfonso and Yang, Xianjun and Antoniades, Antonis and others , title =. 2024 , booktitle =
work page 2024
-
[6]
Huang, Jen-tse and Zhou, Jiaxu and Jin, Tailin and others , title =. 2025 , booktitle =
work page 2025
-
[7]
Xie, Yizhe and Zhu, Congcong and Zhang, Xinyue and others , title =. 2025 , journal =
work page 2025
-
[8]
Curvo, Pedro M. P. , title =. 2025 , journal =
work page 2025
-
[9]
Ju, Tianjie and Wang, Yiting and Ma, Xinbei and others , title =. 2024 , journal =
work page 2024
-
[10]
Han, Chen and Zheng, Wenzhen and Tang, Xijin , title =. 2025 , booktitle =
work page 2025
- [11]
-
[12]
Wang, J. and Wang, J. and Athiwaratkun, B. and Zhang, C. and Zou, J. , title =. 2025 , booktitle =
work page 2025
- [13]
- [14]
- [15]
-
[16]
Cemri, M. and Pan, M. Z. and Yang, S. and Agrawal, L. A. and Chopra, B. and Tiwari, R. and Keutzer, K. and Parameswaran, A. and Klein, D. and Ramchandran, K. and Zaharia, M. and Gonzalez, J. E. and Stoica, I. , title =. 2025 , journal =
work page 2025
-
[17]
Liu, F. and Zhao, R. and Chen, S. and Li, G. and Torr, P. and Han, L. and Gu, J. , title =. 2025 , journal =
work page 2025
-
[18]
Wang, X. and Wei, J. and Schuurmans, D. and Le, Q. and Chi, E. and Narang, S. and Chowdhery, A. and Zhou, D. , title =. 2023 , booktitle =
work page 2023
-
[19]
Guo, T. and Chen, X. and Wang, Y. and Chang, R. and Pei, S. and Chawla, N. V. and Wiest, O. and Zhang, X. , title =. 2024 , booktitle =
work page 2024
-
[20]
Song, M. and Pala, T. D. and Zhou, R. and Jin, W. and Zadeh, A. and Li, C. and Herremans, D. and Poria, S. , title =. 2025 , journal =
work page 2025
-
[21]
Sharma, Mrinank and Tong, Meg and Korbak, Tomasz and Duvenaud, David and Askell, Amanda and Bowman, Samuel R. and others , title =. 2024 , booktitle =
work page 2024
-
[22]
Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and others , title =. 2024 , booktitle =
work page 2024
-
[23]
Yi, Sibo and Liu, Yule and Sun, Zhen and Cong, Tianshuo and He, Xinlei and Song, Jiaxing and Xu, Ke and Li, Qi , title =. 2024 , journal =
work page 2024
-
[24]
Huang, Yao and Sun, Yitong and Zhang, Yichi and Zhang, Ruochen and Dong, Yinpeng and Wei, Xingxing , title =. 2025 , booktitle =
work page 2025
-
[25]
Kolasani, S. and Saplin, M. and Crispino, N. and Montgomery, K. and Davis, J.Q. and Zaharia, M. and Wang, C. and Wang, C. , title =. 2025 , journal =
work page 2025
-
[26]
Duan, J. and Zhang, R. and Diffenderfer, J. and Kailkhura, B. and Sun, L. and Stengel-Eskin, E. and Bansal, M. and Chen, T. and Xu, K. , title =. 2024 , journal =
work page 2024
- [27]
-
[28]
Feng, X. and Luo, Y. and Wang, Z. and Tang, H. and Yang, M. and Shao, K. and Mguni, D. and Du, Y. and Wang, J. , title =. 2023 , booktitle =
work page 2023
-
[29]
Balunovic, M. and Dekoninck, J. and Petrov, I. and Jovanovic, N. and Vechev, M. , title =. 2025 , booktitle =
work page 2025
-
[30]
Tang, Z. and Wen, Q. and Grief-Albert, S. and Elgabra, Y. and Yang, B. and Dong, H. and Anderson, A. , title =. 2026 , journal =
work page 2026
- [31]
- [32]
-
[33]
Raghu, A. and Raghu, M. and Bengio, S. and Vinyals, O. , title =. 2020 , booktitle =
work page 2020
-
[34]
Wu, J. and Yang, S. and Zhan, R. and Yuan, Y. and Chao, L.S. and Wong, D.F. , title =. 2025 , journal =
work page 2025
-
[35]
Wu, J. and Zhan, R. and Wong, D.F. and Yang, S. and Yang, X. and Yuan, Y. and Chao, L.S. , title =. 2024 , booktitle =
work page 2024
- [36]
-
[37]
Guo, Q. and Wang, R. and Guo, J. and Li, B. and Song, K. and Tan, X. and Liu, G. and Bian, J. and Yang, Y. , title =. 2024 , booktitle =
work page 2024
-
[38]
Perez, E. and Huang, S. and Song, F. and Cai, T. and Ring, R. and Aslanides, J. and Glaese, A. and McAleese, N. and Irving, G. , title =. 2022 , booktitle =
work page 2022
-
[39]
Chao, P. and Robey, A. and Dobriban, E. and Hassani, H. and Pappas, G. J. and Wong, E. , title =. 2025 , booktitle =
work page 2025
-
[40]
Samvelyan, M. and Raparthy, S. C. and Lupu, A. and Hambro, E. and Markosyan, A. H. and Bhatt, M. and Mao, Y. and Jiang, M. and Parker-Holder, J. and Foerster, J. and Rocktaschel, T. and Raileanu, R. , title =. 2024 , booktitle =
work page 2024
- [41]
-
[42]
Bergstra, J. and Bardenet, R. and Bengio, Y. and Kegl, B. , title =. 2011 , booktitle =
work page 2011
-
[43]
Akiba, T. and Sano, S. and Yanase, T. and Ohta, T. and Koyama, M. , title =. 2019 , booktitle =
work page 2019
- [44]
-
[45]
Hu, E.J. and Shen, Y. and Wallis, P. and Allen-Zhu, Z. and Li, Y. and Wang, S. and Wang, L. and Chen, W. , title =. 2022 , booktitle =
work page 2022
- [46]
-
[47]
Chao, Patrick and Robey, Alexander and Dobriban, Edgar and Hassani, Hamed and Pappas, George J. and Wong, Eric , title =. 2024 , booktitle =
work page 2024
- [48]
-
[49]
Yomtov, Oren and McCarty, Paul , title =. 2026 , howpublished =
work page 2026
- [50]
-
[51]
Proceedings of the AAAI Conference on Artificial Intelligence , year =
Chen, Jianming and Wang, Yawen and Wang, Junjie and Xie, Xiaofei and Hu, Yuanzhe and Wang, Qing and Xu, Fanjiang , title =. Proceedings of the AAAI Conference on Artificial Intelligence , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.