pith. sign in

arxiv: 2605.09027 · v2 · submitted 2026-05-09 · 💻 cs.CL · cs.AI· cs.LG· cs.MA

GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives

Pith reviewed 2026-05-14 21:05 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGcs.MA
keywords multi-agent systemsadversarial robustnessLLM collectivesimposter detectionbenchmarkadaptive adversarieszero-shot evaluationrecalibration
0
0 comments X

The pith

A three-mode benchmark shows zero-shot detection of adaptive imposters in multi-agent LLM systems is misleading because adaptation gaps only appear after few-shot recalibration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GAMBIT to test detectors against adaptive imposters that evolve to evade them in multi-agent LLM collectives. It uses two zero-shot modes that increase distribution shift and a third recalibration mode that measures how fast a detector improves from just 20 labeled examples. The work demonstrates that detectors with similar zero-shot scores can differ by a factor of eight in adaptation performance and that a meta-learned variant converges twenty times faster, differences visible only in the recalibration setting. Using chess games as the substrate, an evolutionary imposter framework collapses collective performance while staying nearly undetectable at 50.5 percent F1-score. The benchmark supplies 27,804 labeled instances across 240 co-evolved strategies to make these comparisons possible.

Core claim

GAMBIT supplies a dataset and three evaluation modes for imposter detectors in multi-agent LLM collectives: two zero-shot modes under increasing distribution shift and one recalibration mode that tracks adaptation speed from twenty labeled examples. An evolutionary imposter built on this substrate reduces collective task performance while remaining essentially undetectable by standard detectors. The evaluation shows that zero-shot scores alone cannot predict real-world robustness because detectors with nearly identical zero-shot performance differ by up to eight times in few-shot adaptation, and meta-learned detectors reach high accuracy twenty times faster than baselines, a distinction that

What carries the argument

The GAMBIT benchmark with its three evaluation modes (zero-shot under distribution shift and few-shot recalibration) together with the evolutionary imposter agent that co-evolves attack strategies against the collective.

If this is right

  • Detectors must be tested in a recalibration mode to reveal their true robustness against evolving adversaries.
  • Meta-learned detectors can reach usable accuracy with far fewer examples than standard fine-tuning.
  • An evolutionary attack framework can produce imposters that simultaneously damage collective output and evade detection.
  • Zero-shot leaderboards will systematically overstate the security of multi-agent systems against adaptive threats.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Defenses for multi-agent systems may need built-in mechanisms for rapid online recalibration rather than static detectors.
  • The same evolutionary attack method could be applied to other cooperative tasks such as code generation or planning.
  • Benchmarking should treat attack and defense as co-evolving processes rather than fixed pairs.

Load-bearing premise

That chess games played by Gemini agents form a representative substrate for general multi-agent LLM collectives and that the evolutionary imposter framework will generalize to other tasks without major changes.

What would settle it

A direct test in which every detector shows identical relative performance across the zero-shot and recalibration modes, or in which the evolutionary imposter fails to reduce collective performance on a non-chess reasoning task.

Figures

Figures reproduced from arXiv: 2605.09027 by Alexandre Le Mercier, Chris Develder, Thomas Demeester.

Figure 1
Figure 1. Figure 1: Overview of the GAMBIT framework. Left: the game environment presents four candidate decisions stratified by quality (see [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the collective framework used in [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Final CP score (higher is better) after forced opening plus 20 non-forced moves across [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Median Stockfish evaluation trajectory over 20 moves for all seven conditions ( [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Move quality distribution by condition (log scale). Tiers are position-relative (assigned by [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Undermine approach effectiveness across four generations (dashed line marks the train/test boundary). “Comparative” (red) is the top or near-top strategy in every generation (Spearman ρ = +1.000 from Gen 1 to Gen 4), making it the single OOD-stable invariant in the 10-gene space. Error bars show standard error of the mean. standard_inject dominates from Gen 3 onward (Gen 4 mean move_score 0.199, n = 708, +… view at source ↗
Figure 7
Figure 7. Figure 7: Imposter evolution across four generations (dual axis). Move score (red, left axis) is [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: N-gram classifier ID vs OOD collapse. ID imposter F1 is near-perfect ( [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Detector benchmark results. Left: imposter-class F1 for ID (in-distribution, extending left) and OOD (out-of-distribution, extending right); the boxed value is the normalized detection score. Right: per-chain macro ∆F1 (adaptation score) for SmolLM 3B under SFT and ANIL training across all 62 test-set imposter strategies (each defined by a unique gene combination; cf. §3). SFT and ANIL achieve near-identic… view at source ↗
Figure 10
Figure 10. Figure 10: Imposter-class validation F1 over wall-clock time for all detector configurations. Stars [PITH_FULL_IMAGE:figures/full_fig_p033_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Wall-clock training time for all ID and OOD detector configurations. Light bars show [PITH_FULL_IMAGE:figures/full_fig_p033_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Bimodal compliance distribution across all imposter turns (n=4,644). 68.8% of turns are fully rejected (0/3 honest agents comply) and 25.4% are unanimously captured (3/3), with only 5.9% in the intermediate range, revealing a threshold persuasion effect rather than gradual compliance. F.5 Bimodal Compliance The compliance distribution is bimodal ( [PITH_FULL_IMAGE:figures/full_fig_p037_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Gen 1 imposter deception (move 31B, Spanish Chigorin). Dmitri [imposter] pushes Qa6 [PITH_FULL_IMAGE:figures/full_fig_p040_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Gen 4 distributional evasion (move 24B, QGA Steinitz). The imposter’s text reads as [PITH_FULL_IMAGE:figures/full_fig_p040_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Suspicious mutual accusation (move 12, Najdorf Byrne). Aria accuses Dmitri and Bastien [PITH_FULL_IMAGE:figures/full_fig_p042_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Honest-scapegoat (move 9B, Catalan Closed, Sus [PITH_FULL_IMAGE:figures/full_fig_p042_16.png] view at source ↗
read the original abstract

In multi-agent systems (MAS), a single deceptive agent can nullify all gains of an agentic AI collective and evade deployed defenses. However, existing adversarial studies on MAS target only shallow tasks and do not consider adaptive adversaries, which evolve their strategies to evade the very detectors trained to catch them. To address that gap, we introduce GAMBIT, a benchmark with three evaluation modes and two independent scores for evaluating imposter detectors: the first two modes measure zero-shot detection under increasing distribution shift, and a third recalibration mode measures how quickly a detector adapts to novel attacks from just 20 labeled examples. The benchmark comes with a dataset of 27,804 labeled instances spanning 240 co-evolved imposter strategies. Our contributions are threefold: (1) Using chess as a substrate deep reasoning problem and Gemini 3.1 Pro for agents, we release GAMBIT and its dataset to evaluate imposter detectors under realistic constraints against a stealthy adaptive imposter; (2) We introduce an adaptive imposter agent based on an efficient evolutionary framework, generalizable beyond chess, that collapses collective task performance while remaining essentially undetectable (50.5% F1-score with a Gemini-based detector); (3) We show that zero-shot evaluation can be highly misleading for adaptive adversaries: two detectors with near-identical zero-shot scores differ by 8x on few-shot adaptation, while the meta-learned variant converges 20x faster, a gap only visible in the recalibration mode. Altogether, GAMBIT provides the first multi-agent benchmark where adversarial attacks and defenses co-evolve, with an imposter framework generalizable beyond our use case, and promising techniques for fast recalibration in a rapidly evolving adversarial system. Code and data: https://anonymous.4open.science/r/gambit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces GAMBIT, a three-mode benchmark for adversarial robustness in multi-agent LLM collectives. Using chess games played by Gemini 3.1 Pro agents as substrate, it releases a dataset of 27,804 labeled instances spanning 240 co-evolved imposter strategies. The benchmark evaluates imposter detectors in zero-shot modes under distribution shift and a recalibration mode measuring few-shot adaptation from 20 examples. Key claims are that an evolutionary imposter framework collapses collective performance while remaining nearly undetectable (50.5% F1 with a Gemini detector), and that zero-shot scores are highly misleading: detectors with similar zero-shot performance differ by 8x on adaptation, while a meta-learned variant converges 20x faster, gaps visible only in recalibration.

Significance. If the empirical contrasts hold after addressing validation gaps, GAMBIT would be a useful contribution by providing the first co-evolving attack-defense benchmark for MAS and demonstrating the value of a recalibration mode for adaptive adversaries. The public dataset release and evolutionary framework (claimed generalizable beyond chess) are concrete strengths that could support follow-on work on fast adaptation techniques.

major comments (3)
  1. [Abstract and experimental results] Abstract and § on experimental setup: the reported 50.5% F1, 8x, and 20x gaps are given without error bars, confidence intervals, or statistical tests; this is load-bearing for the central claim that zero-shot evaluation is 'highly misleading' because the magnitude of the adaptation gaps cannot be assessed for reliability.
  2. [Dataset and evolutionary imposter] § on evolutionary framework and dataset construction: no details are provided on how the 240 strategies were verified as independent (e.g., pairwise similarity metrics or diversity analysis), nor on regularization or held-out validation to avoid overfitting within the evolutionary loop; this directly affects whether the recalibration-mode gaps reflect genuine adaptation or artifacts of the generation process.
  3. [Introduction and discussion] § on generalizability and substrate choice: the headline claim that zero-shot scores mislead for adaptive adversaries rests on chess with a single model family; no cross-domain transfer experiments (e.g., to negotiation or planning tasks) are described, leaving the 8x/20x differences vulnerable to the substrate-specific concern that chess move distributions and Gemini reasoning style may not generalize.
minor comments (1)
  1. [Appendix or reproducibility] The anonymous code link is noted but the manuscript should include a brief reproducibility checklist (e.g., exact evolutionary hyperparameters, random seeds, and prompt templates) to support the 'generalizable beyond chess' claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each of the major comments below, outlining the revisions we plan to make to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract and experimental results] Abstract and § on experimental setup: the reported 50.5% F1, 8x, and 20x gaps are given without error bars, confidence intervals, or statistical tests; this is load-bearing for the central claim that zero-shot evaluation is 'highly misleading' because the magnitude of the adaptation gaps cannot be assessed for reliability.

    Authors: We fully agree that statistical rigor is crucial for validating the central claims. In the revised version, we will recompute the metrics with multiple independent runs and include error bars (standard deviations), 95% confidence intervals, and statistical significance tests (such as Wilcoxon signed-rank tests for the gaps). This will allow readers to assess the reliability of the 8x and 20x differences. revision: yes

  2. Referee: [Dataset and evolutionary imposter] § on evolutionary framework and dataset construction: no details are provided on how the 240 strategies were verified as independent (e.g., pairwise similarity metrics or diversity analysis), nor on regularization or held-out validation to avoid overfitting within the evolutionary loop; this directly affects whether the recalibration-mode gaps reflect genuine adaptation or artifacts of the generation process.

    Authors: We appreciate this point and will enhance the manuscript with additional details on the evolutionary framework. Specifically, we will add descriptions of how strategy independence was ensured through pairwise similarity metrics (using cosine similarity on strategy embeddings), diversity analysis via clustering and entropy calculations, and the incorporation of held-out validation sets and regularization in the evolutionary fitness function to mitigate overfitting. These additions will demonstrate that the observed adaptation gaps arise from genuine generalization rather than generation artifacts. revision: yes

  3. Referee: [Introduction and discussion] § on generalizability and substrate choice: the headline claim that zero-shot scores mislead for adaptive adversaries rests on chess with a single model family; no cross-domain transfer experiments (e.g., to negotiation or planning tasks) are described, leaving the 8x/20x differences vulnerable to the substrate-specific concern that chess move distributions and Gemini reasoning style may not generalize.

    Authors: We acknowledge the limitation regarding generalizability. Chess was selected as the substrate because it demands deep strategic reasoning and provides a clear, quantifiable task performance metric, making it suitable for evaluating deception in multi-agent settings. The evolutionary imposter framework is designed to be model- and domain-agnostic. In the revision, we will expand the discussion to include a more thorough analysis of potential substrate biases and argue for broader applicability based on the framework's design. We will also explicitly state plans for future cross-domain validation as future work, as conducting such experiments would require substantial additional resources beyond this revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity in GAMBIT benchmark derivation

full rationale

The paper introduces an external benchmark and dataset of 27,804 instances generated from 240 co-evolved strategies via a new evolutionary imposter framework. Reported gaps (8x adaptation difference, 20x faster convergence) are measured empirical outcomes on held-out strategies across zero-shot and recalibration modes, not quantities defined by construction from the same fitted parameters or inputs. No self-citations are load-bearing for the central claims, no ansatz is smuggled, and the framework is presented as generalizable without reducing the results to prior self-work by definition. The derivation chain is self-contained against the released artifacts.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that chess with Gemini 3.1 Pro constitutes a sufficiently deep and representative substrate for multi-agent LLM collectives, plus the unstated premise that the evolutionary framework produces genuinely novel attack strategies rather than rediscovering known patterns.

free parameters (1)
  • evolutionary hyperparameters
    The abstract does not specify population size, mutation rates, or selection criteria used to co-evolve the 240 imposter strategies; these are free parameters that directly shape the reported F1 scores.
axioms (1)
  • domain assumption Chess games require deep reasoning that generalizes to other multi-agent LLM tasks
    Invoked when choosing chess as the substrate without further justification in the abstract.

pith-pipeline@v0.9.0 · 5639 in / 1492 out tokens · 36902 ms · 2026-05-14T21:05:26.707590+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages

  1. [1]

    2025 , journal =

    Kim, Yubin and Gu, Ken and Park, Chanwoo and others , title =. 2025 , journal =

  2. [2]

    2023 , journal =

    Liu, Yi and Deng, Gelei and Li, Yuekang and Wang, Kailong and Zhang, Tianwei and Liu, Yepang and Wang, Haoyu and Zheng, Yanhong and Liu, Yang , title =. 2023 , journal =

  3. [3]

    2026 , journal =

    Hidden State Poisoning Attacks against. 2026 , journal =

  4. [4]

    Amortized Planning with Large-Scale Transformers:

    Ruoss, Anian and Del\'etang, Gr\'egoire and Medapati, Sourabh and Grau-Moya, Jordi and Wenliang,. Amortized Planning with Large-Scale Transformers:. 2024 , booktitle =

  5. [5]

    2024 , booktitle =

    Amayuelas, Alfonso and Yang, Xianjun and Antoniades, Antonis and others , title =. 2024 , booktitle =

  6. [6]

    2025 , booktitle =

    Huang, Jen-tse and Zhou, Jiaxu and Jin, Tailin and others , title =. 2025 , booktitle =

  7. [7]

    2025 , journal =

    Xie, Yizhe and Zhu, Congcong and Zhang, Xinyue and others , title =. 2025 , journal =

  8. [8]

    Curvo, Pedro M. P. , title =. 2025 , journal =

  9. [9]

    2024 , journal =

    Ju, Tianjie and Wang, Yiting and Ma, Xinbei and others , title =. 2024 , journal =

  10. [10]

    2025 , booktitle =

    Han, Chen and Zheng, Wenzhen and Tang, Xijin , title =. 2025 , booktitle =

  11. [11]

    and Li, S

    Du, Y. and Li, S. and Torralba, A. and Tenenbaum, J. B. and Mordatch, I. , title =. 2024 , booktitle =

  12. [12]

    and Wang, J

    Wang, J. and Wang, J. and Athiwaratkun, B. and Zhang, C. and Zou, J. , title =. 2025 , booktitle =

  13. [13]

    and Lin, Y

    Li, W. and Lin, Y. and Xia, M. and Jin, C. , title =. 2025 , journal =

  14. [14]

    and Yoon, S

    Wolf, L. and Yoon, S. and Bogunovic, I. , title =. 2025 , journal =

  15. [15]

    and Satija, H

    Wynn, A. and Satija, H. and Hadfield, G. , title =. 2025 , journal =

  16. [16]

    and Pan, M

    Cemri, M. and Pan, M. Z. and Yang, S. and Agrawal, L. A. and Chopra, B. and Tiwari, R. and Keutzer, K. and Parameswaran, A. and Klein, D. and Ramchandran, K. and Zaharia, M. and Gonzalez, J. E. and Stoica, I. , title =. 2025 , journal =

  17. [17]

    and Zhao, R

    Liu, F. and Zhao, R. and Chen, S. and Li, G. and Torr, P. and Han, L. and Gu, J. , title =. 2025 , journal =

  18. [18]

    and Wei, J

    Wang, X. and Wei, J. and Schuurmans, D. and Le, Q. and Chi, E. and Narang, S. and Chowdhery, A. and Zhou, D. , title =. 2023 , booktitle =

  19. [19]

    and Chen, X

    Guo, T. and Chen, X. and Wang, Y. and Chang, R. and Pei, S. and Chawla, N. V. and Wiest, O. and Zhang, X. , title =. 2024 , booktitle =

  20. [20]

    and Pala, T

    Song, M. and Pala, T. D. and Zhou, R. and Jin, W. and Zadeh, A. and Li, C. and Herremans, D. and Poria, S. , title =. 2025 , journal =

  21. [21]

    and others , title =

    Sharma, Mrinank and Tong, Meg and Korbak, Tomasz and Duvenaud, David and Askell, Amanda and Bowman, Samuel R. and others , title =. 2024 , booktitle =

  22. [22]

    2024 , booktitle =

    Mazeika, Mantas and Phan, Long and Yin, Xuwang and Zou, Andy and others , title =. 2024 , booktitle =

  23. [23]

    2024 , journal =

    Yi, Sibo and Liu, Yule and Sun, Zhen and Cong, Tianshuo and He, Xinlei and Song, Jiaxing and Xu, Ke and Li, Qi , title =. 2024 , journal =

  24. [24]

    2025 , booktitle =

    Huang, Yao and Sun, Yitong and Zhang, Yichi and Zhang, Ruochen and Dong, Yinpeng and Wei, Xingxing , title =. 2025 , booktitle =

  25. [25]

    and Saplin, M

    Kolasani, S. and Saplin, M. and Crispino, N. and Montgomery, K. and Davis, J.Q. and Zaharia, M. and Wang, C. and Wang, C. , title =. 2025 , journal =

  26. [26]

    and Zhang, R

    Duan, J. and Zhang, R. and Diffenderfer, J. and Kailkhura, B. and Sun, L. and Stengel-Eskin, E. and Bansal, M. and Chen, T. and Xu, K. , title =. 2024 , journal =

  27. [27]

    and Tang, Z

    Wen, Q. and Tang, Z. and Anderson, A. , title =. 2025 , journal =

  28. [28]

    and Luo, Y

    Feng, X. and Luo, Y. and Wang, Z. and Tang, H. and Yang, M. and Shao, K. and Mguni, D. and Du, Y. and Wang, J. , title =. 2023 , booktitle =

  29. [29]

    and Dekoninck, J

    Balunovic, M. and Dekoninck, J. and Petrov, I. and Jovanovic, N. and Vechev, M. , title =. 2025 , booktitle =

  30. [30]

    and Wen, Q

    Tang, Z. and Wen, Q. and Grief-Albert, S. and Elgabra, Y. and Yang, B. and Dong, H. and Anderson, A. , title =. 2026 , journal =

  31. [31]

    and Ji, L

    Wang, S. and Ji, L. and Wang, R. and Zhao, W. and Liu, H. and Hou, Y. and Wu, Y.N. , title =. 2025 , booktitle =

  32. [32]

    and Abbeel, P

    Finn, C. and Abbeel, P. and Levine, S. , title =. 2017 , booktitle =

  33. [33]

    and Raghu, M

    Raghu, A. and Raghu, M. and Bengio, S. and Vinyals, O. , title =. 2020 , booktitle =

  34. [34]

    and Yang, S

    Wu, J. and Yang, S. and Zhan, R. and Yuan, Y. and Chao, L.S. and Wong, D.F. , title =. 2025 , journal =

  35. [35]

    and Zhan, R

    Wu, J. and Zhan, R. and Wong, D.F. and Yang, S. and Yang, X. and Yuan, Y. and Chao, L.S. , title =. 2024 , booktitle =

  36. [36]

    and He, B

    Zhou, Y. and He, B. and Sun, L. , title =. 2024 , booktitle =

  37. [37]

    and Wang, R

    Guo, Q. and Wang, R. and Guo, J. and Li, B. and Song, K. and Tan, X. and Liu, G. and Bian, J. and Yang, Y. , title =. 2024 , booktitle =

  38. [38]

    and Huang, S

    Perez, E. and Huang, S. and Song, F. and Cai, T. and Ring, R. and Aslanides, J. and Glaese, A. and McAleese, N. and Irving, G. , title =. 2022 , booktitle =

  39. [39]

    and Robey, A

    Chao, P. and Robey, A. and Dobriban, E. and Hassani, H. and Pappas, G. J. and Wong, E. , title =. 2025 , booktitle =

  40. [40]

    and Raparthy, S

    Samvelyan, M. and Raparthy, S. C. and Lupu, A. and Hambro, E. and Markosyan, A. H. and Bhatt, M. and Mao, Y. and Jiang, M. and Parker-Holder, J. and Foerster, J. and Rocktaschel, T. and Raileanu, R. , title =. 2024 , booktitle =

  41. [41]

    2025 , journal =

    Agarwal, Sandhini and others , title =. 2025 , journal =

  42. [42]

    and Bardenet, R

    Bergstra, J. and Bardenet, R. and Bengio, Y. and Kegl, B. , title =. 2011 , booktitle =

  43. [43]

    and Sano, S

    Akiba, T. and Sano, S. and Yanase, T. and Ohta, T. and Koyama, M. , title =. 2019 , booktitle =

  44. [44]

    , title =

    Watanabe, S. , title =. 2023 , journal =

  45. [45]

    and Shen, Y

    Hu, E.J. and Shen, Y. and Wallis, P. and Allen-Zhu, Z. and Li, Y. and Wang, S. and Wang, L. and Chen, W. , title =. 2022 , booktitle =

  46. [46]

    2025 , url =

    Chess Benchmark:. 2025 , url =

  47. [47]

    and Wong, Eric , title =

    Chao, Patrick and Robey, Alexander and Dobriban, Edgar and Hassani, Hamed and Pappas, George J. and Wong, Eric , title =. 2024 , booktitle =

  48. [48]

    2026 , howpublished =

    Hacking. 2026 , howpublished =

  49. [49]

    2026 , howpublished =

    Yomtov, Oren and McCarty, Paul , title =. 2026 , howpublished =

  50. [50]

    2026 , howpublished =

    Kovacs, Eduard , title =. 2026 , howpublished =

  51. [51]

    Proceedings of the AAAI Conference on Artificial Intelligence , year =

    Chen, Jianming and Wang, Yawen and Wang, Junjie and Xie, Xiaofei and Hu, Yuanzhe and Wang, Qing and Xu, Fanjiang , title =. Proceedings of the AAAI Conference on Artificial Intelligence , year =