pith. sign in

arxiv: 2606.24470 · v1 · pith:WHBSEZOOnew · submitted 2026-06-23 · 💻 cs.AI

The Latent Bridge: A Continuous Slow-Fast Channel for Real-Time Game Agents

Pith reviewed 2026-06-26 00:07 UTC · model grok-4.3

classification 💻 cs.AI
keywords latent bridgeslow-fast couplingreal-time agentsvision-language modelsAtari gamestext bridgecontinuous channelgame agents
0
0 comments X

The pith

A learned continuous latent bridge between slow reasoning and fast reactive VLMs matches or exceeds text bridging for real-time game agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Real-time agents must act in tens of milliseconds yet plan over seconds, creating a conflict between reactive speed and deliberative quality. The paper couples a frozen 8B reasoning VLM with a frozen 9B reactive VLM by training only one component: a continuous projection that injects the slow model's residuals directly into the fast model's embedding space. This Latent Bridge is tested against a Text Bridge, where the slow model writes natural-language guidance, and against Fast-Only baselines across seven Atari games plus a MetaDrive driving task. It matches or beats the Text Bridge in every domain, with large gains in MsPacman and RoadRunner precisely when the slow model already outperforms the fast one. The two bridge types produce correlated gains (r=0.93), and using both together causes destructive interference.

Core claim

We introduce the Latent Bridge, a learned continuous channel that projects residuals from a frozen slow reasoning VLM into the input embedding space of a frozen fast reactive VLM in a LLaVA-style manner. On 7 Atari games and MetaDrive, with the action decoder tuned per channel on held-out seeds, the Latent Bridge matches or beats the Text Bridge in every domain, delivering substantial gains (+57% in MsPacman, +28% in RoadRunner) exactly when the slow model outperforms the fast one; the gains of the two bridges over Fast-Only correlate at r=0.93. Combining both channels interferes destructively, and the bridge is inert in MetaDrive where the Text Bridge adds no value.

What carries the argument

The Latent Bridge: a single learned continuous projection of the slow model's residuals into the fast model's embedding space, kept as the only trainable component between two frozen matched-scale VLMs.

If this is right

  • The bridge helps if and only if the slow reasoning model already outperforms the fast reactive model on the task.
  • Using both latent and text channels together produces destructive interference rather than additive benefit.
  • The method remains inert in domains such as MetaDrive where text bridging itself adds no value over fast-only.
  • Performance differences between the two bridge types are highly predictable from the relative strength of slow versus fast models alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same projection technique could be tested in non-game real-time control settings such as robotic manipulation whenever a slow planner and fast executor show complementary strengths.
  • Varying the relative scales of the two frozen models while keeping the bridge architecture fixed would test whether scale matching is required for the observed correlation.
  • The r=0.93 predictability offers a low-cost way to decide in advance whether adding a bridge is likely to help on a new task without training the full system.

Load-bearing premise

The action decoder can be tuned independently per channel on held-out seeds without introducing bias, and the models' scales are matched so that the bridge is the only variable.

What would settle it

A domain where the slow reasoning model already beats the fast model yet the latent bridge shows no gain over fast-only or breaks the r=0.93 correlation with text-bridge gains.

Figures

Figures reproduced from arXiv: 2606.24470 by Bojie Li, Noah Shi.

Figure 1
Figure 1. Figure 1: Cross-game scores with each channel at its own best decoder, selected on held-out seeds (n = 12; reported action-head variant per game, selected by the rule in §2). The Latent Bridge significantly beats the Text Bridge on MsPacman (+57%) and RoadRunner (+28%) and ties on the other five, never losing. Stars: L-vs-T significance by Mann–Whitney U (** p < .01, *** p < .001); ∗ on a game label marks the robust… view at source ↗
Figure 2
Figure 2. Figure 2: The eight evaluated domains. Seven Atari games (raw pixels, ∼15 Hz control), spanning fast hazard-avoidance—ghosts, obstacles, enemy fire—and slower route/strategy planning, plus MetaDrive (top-down driving), our non-Atari controlled negative. Frostbite (excluded) and Pong (reported but uninformative) are not shown. Text Bridge vs. Latent Bridge. The standard coupling is the Text Bridge (T): the slow model… view at source ↗
Figure 3
Figure 3. Figure 3: System architecture. The fast model (MiniCPM-o 4.5, frozen) runs the reactive loop: vision tokens and a game-state prompt feed a 36-layer LLM whose trained action head emits one action per tick (∼33–38 ms warm path). The slow model (Qwen3-VL-8B-Thinking, frozen) reasons asynchronously over structured state at ∼1 Hz; the fast loop never blocks on it and reuses the latest emission until replaced. The slow ou… view at source ↗
Figure 4
Figure 4. Figure 4: v1 vs v2 bridge architectures. v1 (top, failed): 256-d cross-attention at 2 of 36 layers. v2 (bottom, working): 4096-d prepend, all 36 layers attend. (the action with the highest probability) except where we tune the decoder per channel for the headline (§4.4); under greedy, many cells exhibit zero per-episode variance (Appendix B). 4.1 Headline: the Latent Bridge is a safe-or-better drop-in for the Text B… view at source ↗
Figure 5
Figure 5. Figure 5: Ms. Pac-Man, one representative episode (same tick across channels). Fast-Only (F, left) receives no slow guidance; the Text Bridge (T, middle) and Latent Bridge (L, right) act on the slow model’s emission and here lead F on score (180/360/440). This single frame is illustrative only: the headline +57% of L over T is an episode-mean over n=12 ( [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: RoadRunner: F cannot score; L unlocks the policy. Under bare action head, the action head’s most-confident output is no-op even when the Coyote is closing. The slow’s guidance (“head right”) breaks the local maximum. L = 608 vs T = 475, +28%. 4.3 Greedy decoding flips deterministic cells either way Some cells are deterministic under greedy decoding: with a small action space and a confident head, all 12 ep… view at source ↗
Figure 7
Figure 7. Figure 7: One action-head retraining recovers two distinct collapse modes. SpaceInvaders (left): bare collapses both bridges to zero; robust lifts them off the floor (though still below Fast-Only). River Raid (right): bare is a near-tie far below Fast-Only; robust leaves the Text Bridge flat but lifts the Latent Bridge well above it under greedy decoding—a margin that itself becomes a tie once the decoder is tuned p… view at source ↗
Figure 8
Figure 8. Figure 8: A longer text suffix widens the gap. The Text Bridge’s score as a function of how many recent slow emissions are concatenated into the suffix; the Latent Bridge (latest emission only) is the fixed dashed line. (RoadRunner is run-to-run-variable, §4; shown is one internally consistent run, where the within-run w=1 baseline starts at 608 and collapses—the trend, not the headline level.) F no tokens Lzero 8 z… view at source ↗
Figure 9
Figure 9. Figure 9: The latent’s lift over Fast-Only splits into architectural “slots” and learned content (MsPac￾man). Eight zero or random prepended tokens already supply the “slots” part; the trained bridge adds the rest. Zero and random are statistically indistinguishable (Appendix M). The bridge is non-lexical. Bridge tokens have L2 norm ∼64 (constrained by the output LayerNorm); vocab embeddings have norm ∼1.45 (p99 1.9… view at source ↗
Figure 10
Figure 10. Figure 10: The Latent Bridge helps if and only if the slow model helps with the task. Left: per-game Latent benefit L−F vs. Text benefit T−F across 7 Atari games and MetaDrive, on a signed-√ · axis. Bold points are the reported-variant cell per game (Pearson r = 0.93, n = 8); faint grey points are all 16 evaluated cells (r = 0.96), so the relationship is not an artifact of variant selection (robustness stats in Appe… view at source ↗
Figure 11
Figure 11. Figure 11: Emission statistics do not predict the sign of L − T (n = 7). x = mean unique whitespace tokens per slow emission (seed-0 T-trajectory). y = (L − T)/T. Q*bert is the only y < 0 point with a large T, but Enduro and River Raid (lower diversity) have y > 0, and SpaceInvaders (high diversity) is mildly y < 0. Pearson r = −0.08. The dashed line is the linear fit and is shown only to make the lack of structure … view at source ↗
read the original abstract

A real-time agent for general computer use - with games as the most demanding case - must act within tens of milliseconds while still planning over seconds. These two regimes sit at opposite ends of the latency-quality tradeoff. A reasoning VLM (Qwen3-VL-8B-Thinking) deliberates effectively but requires ~1.5 s per response - far too slow for a 15 Hz control loop. In contrast, a reactive VLM (MiniCPM-o 4.5) acts in milliseconds but underperforms on planning-heavy tasks. We couple two frozen models of matched scale (9B reactive, 8B reasoning), leaving the communication channel as the sole trainable component. The standard coupling is a Text Bridge (T): the slow model writes a suffix the fast model reads. We introduce a learned continuous Latent Bridge (L) that projects the slow model's residuals into the fast model's input-embedding space in a LLaVA-style manner, avoiding any text round-trip; both are compared against Fast-Only (F). On 7 Atari games and a driving domain (MetaDrive), tuning the action decoder per channel on held-out seeds, the Latent Bridge matches or beats the Text Bridge in every domain: it significantly improves two games (MsPacman +57%, RoadRunner +28%) and is a safe drop-in elsewhere. Combining both channels interferes destructively (RoadRunner -96%), so only one should be used. The benefit is highly predictable: the bridge helps if and only if slow reasoning already beats fast reaction (T > F) - the Latent and Text gains over Fast-Only move together at r=0.93. MetaDrive is the controlled negative, where the Latent Bridge is demonstrably inert because the Text Bridge adds no value. We release replay recordings and reproducible pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes coupling a frozen slow reasoning VLM (Qwen3-VL-8B-Thinking, ~1.5s latency) with a fast reactive VLM (MiniCPM-o 4.5, millisecond latency) for real-time game agents via a learned continuous Latent Bridge that projects residuals into the fast model's embedding space. It claims this Latent Bridge matches or beats the Text Bridge (slow model writes text suffix for fast model) on 7 Atari games plus MetaDrive, with significant gains in MsPacman (+57%) and RoadRunner (+28%), while the benefit is predictable (r=0.93) from whether the slow model already outperforms the fast one (T > F); MetaDrive serves as a negative control where the bridge adds no value. The communication channel is presented as the sole trainable component, with the action decoder tuned per channel on held-out seeds; both are compared to Fast-Only, and combining channels is destructive.

Significance. If the attribution of gains to the bridge holds after controlling for decoder adaptation, the work would offer a practical method for low-latency integration of slow and fast VLMs in real-time control without text round-trips, with the r=0.93 predictability providing a clear, falsifiable condition for when the bridge helps. The release of replay recordings and reproducible pipelines strengthens verifiability.

major comments (1)
  1. [Abstract] Abstract: the claim that 'the communication channel [is] the sole trainable component' is undermined by the protocol of 'tuning the action decoder per channel on held-out seeds.' This allows each decoder to adapt to the distinct output statistics of its upstream bridge (Latent vs. Text), so the reported deltas (+57% MsPacman, +28% RoadRunner) and the r=0.93 correlation cannot be attributed solely to the bridge rather than decoder compensation. The MetaDrive negative control does not resolve the confound because it only tests the case where T ≯ F.
minor comments (1)
  1. The abstract reports specific performance percentages and a correlation without error bars, number of runs, or mention of statistical tests, making it difficult to assess the reliability of the gains and the r=0.93 value.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying an imprecision in the abstract that affects how the experimental controls are interpreted. We address the concern directly below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'the communication channel [is] the sole trainable component' is undermined by the protocol of 'tuning the action decoder per channel on held-out seeds.' This allows each decoder to adapt to the distinct output statistics of its upstream bridge (Latent vs. Text), so the reported deltas (+57% MsPacman, +28% RoadRunner) and the r=0.93 correlation cannot be attributed solely to the bridge rather than decoder compensation. The MetaDrive negative control does not resolve the confound because it only tests the case where T ≯ F.

    Authors: We agree the abstract wording is imprecise and will be revised. The action decoder is a lightweight head tuned per channel on held-out seeds so that each communication method (latent projection, text suffix, or fast-only) receives a decoder matched to its output distribution; this is required for a fair system-level comparison. Because the identical tuning protocol is applied to Latent, Text, and Fast-Only, performance deltas between channels remain attributable to the communication mechanism itself. The r=0.93 correlation between Latent and Text gains over Fast-Only further indicates that gains track the presence of useful slow reasoning rather than decoder-specific compensation. MetaDrive serves as a negative control precisely because T ≯ F there; the fact that L is also inert in that regime is consistent with the bridge transmitting information only when it is beneficial. We will (1) remove the 'sole trainable component' phrasing from the abstract, (2) clarify in the methods that decoder adaptation is a controlled, per-condition step, and (3) add a short discussion of why this protocol does not confound the bridge comparison. revision: yes

Circularity Check

0 steps flagged

Empirical comparison with observed correlation; no derivation reduces to inputs by construction

full rationale

The paper is an empirical study reporting performance deltas and an r=0.93 correlation computed directly from measured game outcomes across domains. The abstract states the channel is the sole trainable component while describing per-channel decoder tuning on held-out seeds as the evaluation protocol; this is an experimental design choice, not a mathematical reduction where a fitted parameter is renamed as a prediction or where any equation equals its input by construction. No self-citations, uniqueness theorems, or ansatzes are invoked. The result is self-contained against external benchmarks (Atari games, MetaDrive) with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the effectiveness of the learned latent projection and the assumption that text round-trip is avoidable this way; abstract provides no further parameter details.

free parameters (1)
  • projection parameters for latent bridge
    The trainable component of the bridge is fitted during training on game data.
axioms (1)
  • domain assumption The two VLMs are frozen and only the bridge is trained.
    Stated in the abstract as the coupling method.
invented entities (1)
  • Latent Bridge no independent evidence
    purpose: To enable continuous communication from slow to fast model without text.
    Newly introduced in the paper as the main contribution.

pith-pipeline@v0.9.1-grok · 5863 in / 1397 out tokens · 38486 ms · 2026-06-26T00:07:13.348260+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 6 linked inside Pith

  1. [1]

    Do as i can, not as i say: Grounding language in robotic affordances

    Michael Ahn, Anthony Brohan, Noah Brown, et al. Do as i can, not as i say: Grounding language in robotic affordances. InCoRL, 2022

  2. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac et al. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022

  3. [3]

    AtariARI: Atari annotated RAM interface, 2019

    Ankesh Anand, Evan Racah, Sherjil Ozair, Yoshua Bengio, Marc-Alexandre Côté, and R Devon Hjelm. AtariARI: Atari annotated RAM interface, 2019. https://github.com/ mila-iqia/atari-representation-learning

  4. [4]

    Introducing Claude Opus 4.5

    Anthropic. Introducing Claude Opus 4.5. https://www.anthropic.com/news/ claude-opus-4-5, 2025. Released November 2025

  5. [5]

    The arcade learning environment: An evaluation platform for general agents.Journal of Artificial Intelligence Research, 47:253–279, 2013

    Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environment: An evaluation platform for general agents.Journal of Artificial Intelligence Research, 47:253–279, 2013

  6. [6]

    RT-2: Vision-language-action models transfer web knowledge to robotic control

    Anthony Brohan, Noah Brown, Justice Carbajal, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. InCoRL, 2023. The Latent Bridge: A Continuous Slow–Fast Channel for Real-Time Game Agents 20

  7. [7]

    DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.https://arxiv.org/abs/2501.12948, 2025

    DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.https://arxiv.org/abs/2501.12948, 2025

  8. [8]

    From explicit CoT to implicit CoT: Learning to internalize CoT step by step, 2024

    Yuntian Deng, Yejin Choi, and Stuart Shieber. From explicit CoT to implicit CoT: Learning to internalize CoT step by step, 2024. arXiv:2405.14838

  9. [9]

    Gemini Live multimodal real-time api, 2024

    Google. Gemini Live multimodal real-time api, 2024. https://ai.google.dev/ gemini-api/docs/live

  10. [10]

    Think before you speak: Training language models with pause tokens

    Sachin Goyal, Ziwei Ji, Ankit Singh Rawat, Aditya Krishna Menon, Sanjiv Kumar, and Vaishnavh Nagarajan. Think before you speak: Training language models with pause tokens. InICLR, 2024

  11. [11]

    Training large language models to reason in a continuous latent space,

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space,

  12. [12]

    Inner monologue: Embodied reasoning through planning with language models

    Wenlong Huang, Fei Xia, Ted Xiao, et al. Inner monologue: Embodied reasoning through planning with language models. InCoRL, 2022

  13. [13]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues? InICLR, 2024

  14. [14]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InICML, 2023

  15. [15]

    BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language- image pre-training with frozen image encoders and large language models. InICML, 2023

  16. [16]

    MetaDrive: Composing diverse driving scenarios for generalizable reinforcement learning

    Quanyi Li, Zhenghao Peng, Lan Feng, Qihang Zhang, Zhenghai Xue, and Bolei Zhou. MetaDrive: Composing diverse driving scenarios for generalizable reinforcement learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3461–3475, 2023

  17. [17]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023

  18. [18]

    OpenAI o1 system card.https://arxiv.org/abs/2412.16720, 2024

    OpenAI. OpenAI o1 system card.https://arxiv.org/abs/2412.16720, 2024

  19. [19]

    OpenAI Realtime API, 2024

    OpenAI. OpenAI Realtime API, 2024. https://platform.openai.com/docs/guides/ realtime

  20. [20]

    Introducing GPT-5.2

    OpenAI. Introducing GPT-5.2. https://openai.com/index/introducing-gpt-5-2/ ,

  21. [21]

    Released December 2025

  22. [22]

    MiniCPM-o 4.5: An omni-modal large language model, 2025

    OpenBMB Team. MiniCPM-o 4.5: An omni-modal large language model, 2025. https: //huggingface.co/openbmb/MiniCPM-o-4_5

  23. [23]

    Pine AI: The most natural human-computer interface is your voice

    Pine AI. Pine AI: The most natural human-computer interface is your voice. https://www.19pine.ai/blog/ pine-ai-the-most-natural-human-computer-interface-is-your-voice, 2026. The Latent Bridge: A Continuous Slow–Fast Channel for Real-Time Game Agents 21

  24. [24]

    Qwen3-VL-8B-Thinking, 2025

    Qwen Team. Qwen3-VL-8B-Thinking, 2025. https://huggingface.co/Qwen/ Qwen3-VL-8B-Thinking

  25. [25]

    Stable-Baselines3 atari zoo, 2021.https://huggingface.co/sb3

    Antonin Raffin. Stable-Baselines3 atari zoo, 2021.https://huggingface.co/sb3

  26. [26]

    Mixture-of-depths: Dynamically allocat- ing compute in transformer-based language models, 2024

    David Raposo, Sam Ritter, Blake Richards, et al. Mixture-of-depths: Dynamically allocat- ing compute in transformer-based language models, 2024. arXiv:2404.02258

  27. [27]

    Proximal policy optimization algorithms, 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. arXiv:1707.06347

  28. [28]

    Anthropic releases Opus 4.8 with new ‘dy- namic workflow’ tool

    TechCrunch. Anthropic releases Opus 4.8 with new ‘dy- namic workflow’ tool. https://techcrunch.com/2026/05/28/ anthropic-releases-opus-4-8-with-new-dynamic-workflow-tool/ , 2026. Released 28 May 2026, 41 days after Opus 4.7

  29. [29]

    Interaction models: A scalable approach to human–AI collabora- tion, 2026.https://thinkingmachines.ai/blog/interaction-models/

    Thinking Machines Lab. Interaction models: A scalable approach to human–AI collabora- tion, 2026.https://thinkingmachines.ai/blog/interaction-models/

  30. [30]

    Step-audio-r1 technical report

    Fei Tian et al. Step-audio-r1 technical report. https://arxiv.org/abs/2511.15848, 2025

  31. [31]

    Grok voice think fast 1.0

    xAI. Grok voice think fast 1.0. https://x.ai/news/grok-voice-think-fast-1 , 2026. Real-time voice model with background (asynchronous) reasoning

  32. [32]

    τ-bench: A bench- mark for tool-agent-user interaction in real-world domains

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A bench- mark for tool-agent-user interaction in real-world domains. https://arxiv.org/abs/ 2406.12045, 2024

  33. [33]

    strategic-guidance

    Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah D Goodman. Quiet-STaR: Language models can teach themselves to think before speaking. InCOLM, 2024. A Full results table Both bare-action-head and robust-action-head reported for every game, n= 12 per cell, under greedy decoding; the headline (§4.1, Table 1) instead uses the b...