pith. sign in

arxiv: 2606.28514 · v1 · pith:DGKM5TFBnew · submitted 2026-06-26 · 💻 cs.AI · cs.CL

GPTNT: Benchmarking Real-Time Collaboration Between Multimodal Agents on Keep Talking And Nobody Explodes

Pith reviewed 2026-06-30 01:11 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords multimodal agentsreal-time collaborationbenchmarkinformation asymmetrycooperative gamesstate trackingprocedural generationerror recovery
0
0 comments X

The pith

No tested multimodal models defuse even one bomb in real time under the benchmark conditions, while human players succeed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GPTNT, a benchmark built on the cooperative game Keep Talking and Nobody Explodes, where two agents must collaborate in real time against a countdown. One agent sees and manipulates the bomb without instructions; the other has the instructions but cannot see the bomb. The setup uses procedural generation and conditions that withhold the manual, the partner, or both to force genuine derivation and communication rather than recall of known solutions. Current models from both closed and open sources achieve zero successes, exposing shortfalls in state tracking, action under time pressure, ambiguity handling, and error recovery.

Core claim

GPTNT requires two agents to coordinate asynchronously in real time to defuse procedurally generated bomb modules, with success depending on effective communication across information asymmetry. None of the tested models completes even a single defusal, a threshold human players meet, and controlled experiments trace the failures to specific weaknesses in maintaining shared state, acting efficiently under countdown pressure, resolving ambiguous messages, and recovering from mistakes.

What carries the argument

The GPTNT benchmark, which places agents in asymmetric roles within a live countdown game and withholds the manual or partner to isolate real-time collaboration from memorized knowledge.

If this is right

  • Models exhibit specific shortfalls in state tracking and handling time pressure that prevent collaborative success.
  • Error recovery and ambiguity resolution remain insufficient for tasks requiring ongoing coordination.
  • The benchmark can continue to evolve through the game's procedural generation and modding community instead of becoming fixed and retired.
  • Withholding conditions allow direct measurement of derivation versus recall in collaborative settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar real-time asymmetry tests could be applied to other domains where agents must act without full shared information.
  • Models that improve on GPTNT might transfer better to dynamic human-AI teaming scenarios than those tuned only on static benchmarks.
  • The zero-success result suggests current training may under-emphasize efficient communication under strict time limits.

Load-bearing premise

The game setup with procedural generation and withholding conditions accurately isolates in-the-moment derivation and collaboration skills from memorized knowledge or other confounds.

What would settle it

A model that completes at least one procedurally generated bomb defusal in real time when either the instruction manual or the partner is withheld.

Figures

Figures reproduced from arXiv: 2606.28514 by Alessandro Suglia, Amit Parekh, Ioannis Konstas, Kareem Al-Hasan, Malvina Nikandrou, Sabrina McCallum.

Figure 1
Figure 1. Figure 1: High-level overview of the interaction in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Rollout of two agents interacting in real time. The Defuser gathers and relays information about [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the default puzzle module types in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Manual page for Simon Says with widget references highlighted. Widgets. The four sides of a bomb—left, right, top, and bottom—can carry a number of widgets: non-interactive components that are sampled randomly, and can include a serial number, various types of batteries, ports, and indicators. The correct solution to several module types depends on these values, for instance, whether an indicator is lit, t… view at source ↗
Figure 5
Figure 5. Figure 5: Example showing how the generation time for each request relates to the in-game time for [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: SoM on Wires Set-of-Marks. While gptnt supports coordinate-based interaction, models using direct coordinate output perform substantially worse (see Appendix G.10 for prelimi￾nary results). We therefore adopt a set-of-marks (SoM) approach (Yang et al., 2023), following its use in recent agentic settings (Koh et al., 2024; Liu et al., 2024b; Xie et al., 2024b; Zhang et al., 2025b), to avoid obscuring agent … view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of action types (% of game steps) by role and mission outcome across the early, middle, [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Frequency of game outcomes depending on strikes across games played by all possible model [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Examples for each task in the offline Manual VQA. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Examples from the offline Simulator VQA evaluation tasks. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Examples of SoM localisation tasks on Simon Says and Wires modules. In ktane, the Defuser must not only perceive, but also act on what it perceives. To interact with UI elements, the Defuser must localise the correct element according to the Expert’s instructions. As discussed in §4, we facilitate localisation via set-of-marks (SoM; Yang et al., 2023). Labelled segmentation masks are overlaid on the image… view at source ↗
Figure 12
Figure 12. Figure 12: Example comparing Passwords across English and Arabic in ktane. The gptnt environment is uniquely suited to stay ahead of growing model capabilities and mitigate the risk of the benchmark becoming saturated. The set of core missions we introduce in this iteration of the benchmark only covers relatively low difficulty levels (2–3 of 7, see Appendix B.3), and more challenging missions can easily be configur… view at source ↗
read the original abstract

Multimodal models are increasingly deployed to solve tasks collaboratively with humans or other artificial agents. Existing benchmarks show that these models possess many of the required component capabilities, but the conditions that coincide in collaboration, including time pressure, information asymmetry, and imperfect communication, are usually studied in isolation. We introduce GPTNT, a benchmark built on the cooperative video game Keep Talking and Nobody Explodes, in which two agents must coordinate to defuse procedurally generated bomb puzzles against a live countdown. One agent can see and manipulate the bomb but does not have the defusal instructions; the other has the instructions but cannot see or manipulate the bomb. Neither agent can succeed alone: success requires effective and efficient communication. Unlike turn-based proxies, GPTNT requires agents to act asynchronously and communicate in real time. GPTNT is designed to separate collaboration from reliance on memorized solutions: the instruction manual, the partner, or both can be withheld to isolate what a model derives in the moment from what it already knows. We show that GPTNT poses a substantial challenge for state-of-the-art systems: none of the closed- or open-source models we test defuses a single bomb in real time, a bar that human players clear. Through controlled experiments, we identify critical weaknesses in state tracking, efficient action under time pressure, ambiguity handling, and error recovery. We release GPTNT as a benchmark for collaborative performance that current evaluations leave unmeasured. Because it runs on the real game, GPTNT benefits from procedural generation and inherits a living modding community, allowing the benchmark to evolve as models improve rather than being solved once and retired.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces the GPTNT benchmark, built on the cooperative game Keep Talking and Nobody Explodes, to evaluate real-time multimodal agent collaboration under time pressure, information asymmetry, and imperfect communication. Two agents must coordinate asynchronously: one manipulates the bomb without instructions, the other holds the manual without visual access. The design includes withholding conditions (manual, partner, or both) plus procedural generation to isolate in-the-moment derivation and collaboration skills from memorized knowledge. The central empirical claim is that none of the tested closed- or open-source models defuse any bomb in real time (a bar humans clear), with controlled experiments identifying weaknesses in state tracking, efficient action under time pressure, ambiguity handling, and error recovery. The benchmark is released to evolve with the modding community.

Significance. If the isolation claim holds and the zero-success result is robust, GPTNT would provide a valuable, hard-to-game benchmark for collaborative capabilities that existing evaluations do not measure. Strengths include the use of a real game with procedural generation (preventing static solution memorization) and the potential for the benchmark to scale rather than be retired. The identification of concrete failure modes offers actionable directions, though the significance is tempered by the need to confirm that observed failures stem from the targeted skills rather than confounds.

major comments (2)
  1. [Abstract / Benchmark Design] Abstract and benchmark design: the claim that withholding the manual (or partner) isolates 'what a model derives in the moment from what it already knows' is load-bearing for attributing the zero-success result to deficits in state tracking, real-time action, and collaboration. Because the KTANE manual is a fixed public document, models whose pretraining includes wikis, rule PDFs, or transcripts could still apply correct procedures even when the text is withheld at runtime. Procedural generation randomizes module layouts and configurations but leaves rule semantics unchanged, so it does not block this pathway. Without additional controls (e.g., explicit tests of rule recall when the manual is absent), the headline result cannot be cleanly attributed to the intended skills.
  2. [Experimental Results] Experimental results section: the abstract reports that 'none of the closed- or open-source models we test defuses a single bomb in real time' yet supplies no information on the number of trials per condition, specific models and versions tested, success criteria, error bars, or statistical tests. This absence prevents verification that the zero-success rate is reliable rather than an artifact of small sample size or particular prompting setups.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and outline revisions to improve clarity and robustness of the claims.

read point-by-point responses
  1. Referee: [Abstract / Benchmark Design] The claim that withholding the manual (or partner) isolates 'what a model derives in the moment from what it already knows' is load-bearing... Without additional controls (e.g., explicit tests of rule recall when the manual is absent), the headline result cannot be cleanly attributed to the intended skills.

    Authors: We agree this is a valid concern: because the KTANE manual is a fixed public document, pretraining exposure could allow rule recall even under withholding. Procedural generation varies module instances but not core semantics. To strengthen the isolation claim, we will add a dedicated control experiment measuring rule recall accuracy when the manual is withheld at runtime (comparing against conditions where it is provided). This will be reported in a new subsection of the benchmark design and results, allowing clearer attribution of failures to state tracking and collaboration rather than memorized knowledge. revision: yes

  2. Referee: [Experimental Results] the abstract reports that 'none of the closed- or open-source models we test defuses a single bomb in real time' yet supplies no information on the number of trials per condition, specific models and versions tested, success criteria, error bars, or statistical tests.

    Authors: The full manuscript provides these details in the Experimental Results section (including trial counts, model versions, exact success criteria based on bomb defusal within the timer, and any applicable statistics). However, we acknowledge the abstract is too terse for standalone verification. We will revise the abstract to include a concise statement on the number of trials, models evaluated, and success definition. revision: yes

Circularity Check

0 steps flagged

No circularity; new benchmark evaluated empirically against external models

full rationale

The paper introduces GPTNT, a new benchmark built on an existing game, and reports an empirical observation that none of the tested closed- or open-source models succeed while humans do. No equations, parameter fits, self-definitional derivations, or load-bearing self-citations appear in the provided text. The withholding conditions and procedural generation are presented as design choices to isolate skills, but the result (model failure rates) does not reduce to those choices by construction or via any cited prior work by the authors. The evaluation relies on external models and is falsifiable outside the paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the benchmark design assumptions and the reported experimental outcomes; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Procedural generation of bomb puzzles combined with withholding conditions isolates what a model derives in the moment from memorized solutions.
    Stated as a core design goal to separate collaboration from reliance on prior knowledge.

pith-pipeline@v0.9.1-grok · 5846 in / 1183 out tokens · 27362 ms · 2026-06-30T01:11:18.321286+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 7 canonical work pages · 3 internal anchors

  1. [1]

    ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems

    doi:10.1145/3474666. Tamay Besiroglu, Sage Andrus Bergerson, Amelia Michael, Lennart Heim, Xueyun Luo, and Neil Thompson. The Compute Divide in Machine Learning: A Threat to Academic Contribution and Scrutiny?, January 2024. Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanc...

  2. [2]

    Other-Play

    Curran Associates Inc. Xiaofeng Gao, Qiaozi Gao, Ran Gong, Kaixiang Lin, Govind Thattai, and Gaurav S. Sukhatme. DialFRED: Dialogue-Enabled Agents for Embodied Instruction Following.IEEE Robotics and Automation Letters, 7 (4):10049–10056, October 2022. ISSN 2377-3766. doi:10.1109/LRA.2022.3193254. Darren Gergle, Robert E. Kraut, and Susan R. Fussell. Lang...

  3. [3]

    doi:10.18653/v1/2022.acl-short.18

    Association for Computational Linguistics. doi:10.18653/v1/2022.acl-short.18. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Sh...

  4. [4]

    Dota 2 with Large Scale Deep Reinforcement Learning

    ISSN 0028-0836, 1476-4687. doi:10.1038/nature14236. OpenAI. Computer-Using Agent, January 2025a. URLhttps://openai.com/index/computer-using-age nt/. OpenAI. Introducing GPT-5.2, December 2025b. URLhttps://openai.com/index/introducing-gpt-5 -2/. OpenAI. Function calling, April 2026. URLhttps://developers.openai.com/api/docs/guides/functi on-calling. OpenAI...

  5. [5]

    doi:10.1111/psyp.12171

    ISSN 1469-8986. doi:10.1111/psyp.12171. Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna M. Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards Understanding Sycophancy in...

  6. [6]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    URLhttps://openreview.net/forum?id=dfeFy1PSSw. Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters for Reasoning. InThe Thirteenth International Conference on Learning Representations, October 2024. URLhttps://openreview.net/forum?id=4FWAwZtd2n. Leonidas Spil...

  7. [7]

    Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao

    URL https://openaccess.thecvf.com/content/CVPR2026/html/Xu_VS-Bench_Evaluating_VL Ms_for_Strategic_Abilities_in_Multi-Agent_Environments_CVPR_2026_paper.html. Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V, November 2023. 33 Shunyu Yao, Jeffrey Zhao, Dia...

  8. [8]

    Li et al

    moves in a complementary direction by introducing natural environment dynamics—where the world evolves independently of what agents do—but does so in a single-agent setting, forfeiting the cooperative structure entirely. Li et al. (2024c) takes yet another path, trading visual grounding for richer interaction modelling and operating on text-based state re...

  9. [9]

    automatically solved

    and Werewolf (Wu et al., 2024) operationalise this best, with hidden roles creating a communicative pressure under which agents must reason strategically over what others know and believe. WhodunitBench (Xie et al., 2024a) extends this with visual grounding over evidential images, adding an absent multimodal layer to purely text-based deduction. But all t...

  10. [10]

    Because the game is paused, observations are captured from a frozen, fully resolved game state

    Observe.The Defuser pulls its observations and drains its message queue. Because the game is paused, observations are captured from a frozen, fully resolved game state

  11. [11]

    Unity’s internal time scale is set to zero during generation, so the model may take unbounded wall-clock time at no cost in game time

    Generate.The Defuser produces a forward pass. Unity’s internal time scale is set to zero during generation, so the model may take unbounded wall-clock time at no cost in game time

  12. [12]

    The action executes and all resulting animations and state transitions resolve before the game is paused again

    Act.The action is dispatched to the game, which is resumed for a window of at least3s of game time; in practice the realised window is3s to 3.5s (Appendix B.4.1). The action executes and all resulting animations and state transitions resolve before the game is paused again. The bomb clock therefore advances for each Defuser turn, regardless of the type of...

  13. [13]

    result": {

    Expert turn.While the game is paused, the Expert pulls its message queue, produces a forward pass, and dispatches its action. Expert turns consume no game time. The cycle repeats until the game ends. Because each Defuser turn consumes3s, a mission’s time limit translates into a maximum number of Defuser turns—the time limit divided by three (Appendix B.4)...

  14. [14]

    Therefore, the base conversational ability must not have been degraded by task-specific post-training

    Conversational capability.Models must sustain coherent multi-turn dialogue across a full game, which can span dozens of turns per player. Therefore, the base conversational ability must not have been degraded by task-specific post-training

  15. [15]

    Interleaved multi-image multi-turn support.Models must handle sequences of images interleaved with text within a single conversation. The Defuser receives new visual observations on every turn, and the Expert receives the manual as a sequence of page images interleaved with extracted text across the first user turn. Architectures that support only single-...

  16. [16]

    This places a practical lower bound on usable context length, with InternVL representing the most constrained case at 40,060 tokens (Table E.1)

    Context length.Models must be capable of holding the Expert’s manual and a sufficient dialogue history simultaneously within their context window. This places a practical lower bound on usable context length, with InternVL representing the most constrained case at 40,060 tokens (Table E.1)

  17. [17]

    Keep Talking and Nobody Explodes

    No task-specific training.Models may not be fine-tuned on ktane, the game manual, or any task-derived data prior to their evaluation. Fine-tuning on the manual would trivialise the benchmark: a model capable of reciting module solutions from parametric memory is not performing the collaborative grounded reasoning the benchmark is designed to measure. Gene...