pith. sign in

arxiv: 2606.26463 · v2 · pith:7J3SU7LLnew · submitted 2026-06-24 · 💻 cs.LG

Finding the Time to Think: Learning Planning Budgets in Real-Time RL

Pith reviewed 2026-06-30 09:24 UTC · model grok-4.3

classification 💻 cs.LG
keywords real-time reinforcement learningplanning budgetsgating policyvariable delayPac-ManTetrisreal-time RLdeliberation time
0
0 comments X

The pith

A lightweight gating policy learns to select state-dependent planning budgets for agents in real-time RL environments where deliberation consumes time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines real-time reinforcement learning in which the environment keeps advancing while the agent decides what to do. Standard planners assume unlimited thinking time, but here the agent must choose its own deliberation length at each step. A gating policy is trained on top of an existing planner to pick these lengths based on the current state. This method is evaluated in real-time versions of Pac-Man, Tetris, Snake, Speed Hex, and Speed Go, where it surpasses fixed-budget and heuristic approaches and also works when the environment and agent run on separate GPUs.

Core claim

In variable-delay real-time RL the agent selects the duration of deliberation at each decision point because the environment continues to progress. For the planning agents considered, the suitable delay varies with state, yet directly planning the length of the plan itself tends to paralyze action selection. Training a lightweight gating policy to choose state-dependent budgets instead yields higher performance than fixed or heuristic baselines across the tested games and transfers to asynchronous real-time hardware configurations.

What carries the argument

A lightweight gating policy that selects state-dependent planning budgets on top of a base planner.

If this is right

  • State-dependent budget selection improves scores over any constant planning time in the real-time game domains.
  • The learned gate transfers directly to asynchronous execution where environment and planner run on separate processors.
  • Avoiding explicit planning over the budget itself prevents the paralysis observed when agents try to optimize their own thinking time.
  • The approach applies uniformly to multiple planning-based agents across distinct real-time environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar gating could be added to planners in continuous-control tasks where action latency directly affects safety or cost.
  • Joint training of the gate and the base planner might further reduce the total compute needed for a given performance level.
  • In competitive settings the gate could implicitly learn to allocate more time when the opponent is in a threatening state.

Load-bearing premise

The suitable amount of planning time at each moment depends on the current state in a way that a separate policy can learn without excessive overhead.

What would settle it

A controlled comparison in which the best single fixed planning budget, chosen after exhaustive search, matches or exceeds the gating policy's score in every real-time game environment.

Figures

Figures reproduced from arXiv: 2606.26463 by Aneesh Muppidi, Dylan Cope, Firas Darwish, Jakob Nicolaus Foerster, Jo\~ao F. Henriques.

Figure 1
Figure 1. Figure 1: Given the current state, the gating policy chooses whether to react immediately or spend time planning, selecting the number of timesteps k over which to plan. The agent then takes k−1 committed actions using πreflex (π0) while MCTS plans, and finally executes the planned action. system. We instead let the agent choose its delay at each decision point, turning delay selection itself into the learning probl… view at source ↗
Figure 2
Figure 2. Figure 2: Timeline of a budgeted option ok. The agent emits k−1 committed actions from πreflex while computation ck runs (instantiated as MCTS for k frames), then applies ck’s output πk and returns to the meta level at st+k. In clock environments, the committed steps are no-ops that only consume clock. Meta-level SMDP. The gating policy πgate(k | st) is a meta-policy that chooses among the |K| budgeted options. At e… view at source ↗
Figure 3
Figure 3. Figure 3: Across Pac-Man, Tetris, and 2-player Speed Hex, planning quality rises with simulation count while inference latency rises alongside it. Blue denotes performance; dashed curves denote per-step latency on H100, A100, and A40; shading shows ±SE. chosen action is applied. Intuitively, this is the same pressure as in speed chess: the board waits for your move, but your limited clock keeps running while you thi… view at source ↗
Figure 4
Figure 4. Figure 4: Across all five environments, the gating policy outperforms fixed-budget and heuristic baselines, showing that adaptive allocation matters more than committing to a single search budget. Bars show mean ± SE over 100 episodes; for Speed Hex and Speed Go, expected score is averaged over the shared sampled clock budgets. Making this meta-MCTS AlphaZero stack computationally feasible requires careful attention… view at source ↗
Figure 5
Figure 5. Figure 5: The policy plans deeply precisely when the state is dangerous or constrained. Across Pac-Man, real-time Tetris, and Snake, larger chosen budgets are associated with higher threat, denser boards, or fewer safe continuations, indicating that the gate is responding to meaningful decision difficulty. Plots show state features conditioned on chosen budget k (mean ± 1 SE, 100 episodes). 0.5 1.0 Move fraction 0.0… view at source ↗
Figure 6
Figure 6. Figure 6: In both Speed Hex and Speed Go, the learned policy is much more reactive under the small budget (T=300) and distributes mass toward deeper options once more clock is available (T=4100). In real-time Tetris on H100, the policy preserves the same strongly bimodal “react or plan deeply” strategy seen in simulation. Appendix J gives the full per-environment, per-FPS, per-GPU analysis. This result tests a key h… view at source ↗
Figure 7
Figure 7. Figure 7: Two-GPU asynchronous deployment pipeline. [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Simulation-trained policies transfer cleanly to hardware deployment. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Full co-scaling results across Pac-Man, real-time Tetris, Speed Hex, Speed Go, and Snake. Blue denotes planning quality; dashed curves denote per-step latency on H100, A100, and A40; shading shows ±SE. F Cross-Evaluation and Base Model Selection For the committed-action environments, we trained four AlphaZero checkpoints at budgets k ∈ K and evaluated all 16 (train-k, eval-k) combinations. Here, train-k de… view at source ↗
Figure 10
Figure 10. Figure 10: Simulation-option calibration for the two clock environments. Each point shows a budget’s average expected score against all other candidate budgets. Red circles mark the options used in our clocked experiments: 16/32/64/96 simulations for Speed Go and 2/8/32/128 for Speed Hex. Environment Train-k Eval-k=1 Eval-k=2 Eval-k=3 Eval-k=4 real-time Tetris k=1 61.8 @ 128 85.0 @ 128 73.2 @ 128 69.0 @ 128 k=2 30.0… view at source ↗
Figure 11
Figure 11. Figure 11: Different environments induce different allocation profiles. Pac-Man becomes more reactive later in the episode, real-time Tetris shifts toward deeper planning as the board densifies, Snake remains mostly reactive with occasional deeper planning in constrained states, and both clock games become less reactive when more time is available. This figure complements [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Strict-timeout Speed Hex is substantially easier than the main Speed Hex benchmark. Using the unique-game expected-score metric, the learned gate remains above parity against every opponent on average, with means of 0.952 against the policy-only opponent, 0.903 against fixed 2-simulation play, 0.904 against random allocation, and 0.893 even against the fixed 128-simulation opponent. At large budgets, the … view at source ↗
Figure 13
Figure 13. Figure 13: Detailed two-GPU deployment breakdown across real-time Tetris, Pac-Man, and Snake. Top: return vs. FPS. Middle: deadline miss rate. Bottom: p95 slack to the k=4 deadline. Snake remains robust, real-time Tetris fails only near the tightest A40 regime, and Pac-Man is the most latency-sensitive. L Real-Time RL, SMDPs, and Committed-Action Training This appendix expands Section 3.2 by focusing on the distinct… view at source ↗
read the original abstract

Deliberating takes time. In real-time settings, that time is not free. Standard reinforcement learning (RL) sidesteps this as the environment waits indefinitely for the agent's decision. Instead, we study real-time RL environments where the environment progresses while waiting for the agent's action. Building on prior real-time formalizations, we introduce variable-delay real-time RL, where the agent chooses how long to deliberate at each decision point since the environment progresses. For the planning agents we use, the right delay is state-dependent, and naively planning how long to plan can paralyze the agent. We instead approach this setting by training a lightweight gating policy on top of a planner to select state-dependent planning budgets. Across real-time Pac-Man, Tetris, Snake, Speed Hex, and Speed Go, our gating policy outperforms fixed-budget and heuristic baselines, and transfers to a real-time setup where the environment and agent run on two different GPUs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces variable-delay real-time RL in which the agent selects a state-dependent deliberation time at each step. It proposes training a lightweight gating policy atop a base planner to choose these planning budgets, avoiding paralysis from naive meta-planning. The gating policy is reported to outperform fixed-budget and heuristic baselines across real-time Pac-Man, Tetris, Snake, Speed Hex, and Speed Go, and to transfer successfully to a two-GPU distributed real-time environment.

Significance. If the empirical results are robust, the work provides a practical mechanism for adaptive computation in real-time RL settings where environment progress during deliberation is explicit. The multi-game evaluation and the two-GPU transfer experiment are positive elements that would strengthen applicability claims, provided the gating overhead is shown to be negligible.

major comments (2)
  1. [real-time transfer experiment] The real-time transfer claim (two-GPU setup) is load-bearing for the strongest result. The manuscript must report measured inference latency of the gating policy itself and demonstrate that this latency remains small relative to the budgets it selects; otherwise the effective planning time deviates from the intended value and comparisons to fixed-budget baselines become invalid.
  2. [experimental evaluation] The outperformance claims across five games rest on quantitative results that are not summarized in the abstract. The paper should include, for each game, mean performance with error bars, number of runs, and explicit controls for total compute or wall-clock time to allow verification that gains are not artifacts of unequal resource allocation.
minor comments (1)
  1. [method] Clarify the training procedure for the gating policy (reward signal, data collection, and whether it is trained jointly or separately) to make the method reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [real-time transfer experiment] The real-time transfer claim (two-GPU setup) is load-bearing for the strongest result. The manuscript must report measured inference latency of the gating policy itself and demonstrate that this latency remains small relative to the budgets it selects; otherwise the effective planning time deviates from the intended value and comparisons to fixed-budget baselines become invalid.

    Authors: We agree this validation is necessary. In the revision we will add direct measurements of gating-policy inference latency on the same hardware used for the two-GPU experiment and show that the added latency is negligible relative to the budgets chosen by the policy (typically <5% of the smallest budget). This will confirm that the reported planning times remain accurate and that baseline comparisons are unaffected. revision: yes

  2. Referee: [experimental evaluation] The outperformance claims across five games rest on quantitative results that are not summarized in the abstract. The paper should include, for each game, mean performance with error bars, number of runs, and explicit controls for total compute or wall-clock time to allow verification that gains are not artifacts of unequal resource allocation.

    Authors: The experimental section already reports per-game means, standard deviations, and the number of independent runs (n=10 for all environments). We will (1) insert a concise quantitative summary into the abstract and (2) add an explicit paragraph in the experimental setup detailing the wall-clock-time and total-FLOP budgets enforced across all methods, confirming that every agent receives identical compute resources per decision cycle. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper trains a lightweight gating policy via standard RL to select state-dependent planning budgets in variable-delay real-time environments. The central claim rests on empirical outperformance against fixed-budget and heuristic baselines across multiple games, plus a two-GPU transfer experiment. No equations or steps reduce a claimed prediction or result to a fitted parameter or self-citation by construction; the gating policy is learned independently rather than defined in terms of its own outputs. The derivation chain uses external benchmarks and does not invoke uniqueness theorems or ansatzes from prior self-work as load-bearing justification.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no details on free parameters, axioms, or invented entities; assessment limited to surface description.

pith-pipeline@v0.9.1-grok · 5708 in / 970 out tokens · 39971 ms · 2026-06-30T09:24:20.266657+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 24 canonical work pages · 14 internal anchors

  1. [1]

    L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning

    Pranjal Aggarwal and Sean Welleck. L1: Controlling how long a reasoning model thinks with reinforcement learning.arXiv preprint arXiv:2503.04697,

  2. [2]

    Handling delay in real-time reinforcement learning.arXiv preprint arXiv:2503.23478,

    Ivan Anokhin, Rishav Rishav, Matthew Riemer, Stephen Chung, Irina Rish, and Samira Ebrahimi Kahou. Handling delay in real-time reinforcement learning.arXiv preprint arXiv:2503.23478,

  3. [3]

    Pondernet: Learning to ponder.arXiv preprint arXiv:2107.05407,

    Andrea Banino, Jan Balaguer, and Charles Blundell. Pondernet: Learning to ponder.arXiv preprint arXiv:2107.05407,

  4. [4]

    Akhilan Boopathy, Aneesh Muppidi, Peggy Yang, Abhiram Iyer, William Yue, and Ila Fiete

    URLhttps://arxiv.org/abs/2306.09884. Akhilan Boopathy, Aneesh Muppidi, Peggy Yang, Abhiram Iyer, William Yue, and Ila Fiete. Permu- tation invariant learning with high-dimensional particle filters.arXiv preprint arXiv:2410.22695,

  5. [5]

    Learning to select computations

    Frederick Callaway, Sayan Gul, Paul M Krueger, Thomas L Griffiths, and Falk Lieder. Learning to select computations.arXiv preprint arXiv:1711.06892,

  6. [6]

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance

    Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance.arXiv preprint arXiv:2305.05176,

  7. [7]

    Esther Derman, Gal Dalal, and Shie Mannor

    URLhttp://github.com/deepmind. Esther Derman, Gal Dalal, and Shie Mannor. Acting in delayed environments with non-stationary markov policies.arXiv preprint arXiv:2101.11992,

  8. [8]

    Thinkless: Llm learns when to think.arXiv preprint arXiv:2505.13379,

    Gongfan Fang, Xinyin Ma, and Xinchao Wang. Thinkless: Llm learns when to think.arXiv preprint arXiv:2505.13379,

  9. [9]

    TreeQN and ATreeC: Differentiable Tree-Structured Models for Deep Reinforcement Learning

    Gregory Farquhar, Tim Rocktäschel, Maximilian Igl, and Shimon Whiteson. Treeqn and atreec: Dif- ferentiable tree-structured models for deep reinforcement learning.arXiv preprint arXiv:1710.11417,

  10. [10]

    High entropy leads to symmetry-equivariant policies in Dec-POMDPs

    Johannes Forkel and Jakob Foerster. Entropy is all you need for inter-seed cross-play in hanabi. arXiv preprint arXiv:2511.22581,

  11. [11]

    Adaptive Computation Time for Recurrent Neural Networks

    Alex Graves. Adaptive computation time for recurrent neural networks.arXiv preprint arXiv:1603.08983,

  12. [12]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  13. [13]

    Metacontrol for adaptive imagination-based optimization.arXiv preprint arXiv:1705.02670,

    Jessica B Hamrick, Andrew J Ballard, Razvan Pascanu, Oriol Vinyals, Nicolas Heess, and Pe- ter W Battaglia. Metacontrol for adaptive imagination-based optimization.arXiv preprint arXiv:1705.02670,

  14. [14]

    Combining q-learning and search with amortized value estimates

    Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Tobias Pfaff, Theophane Weber, Lars Buesing, and Peter W Battaglia. Combining q-learning and search with amortized value estimates. arXiv preprint arXiv:1912.02807,

  15. [15]

    On the role of planning in model-based deep reinforcement learning.arXiv preprint arXiv:2011.04021,

    Jessica B Hamrick, Abram L Friesen, Feryal Behbahani, Arthur Guez, Fabio Viola, Sims Witherspoon, Thomas Anthony, Lars Buesing, Petar Veličković, and Théophane Weber. On the role of planning in model-based deep reinforcement learning.arXiv preprint arXiv:2011.04021,

  16. [16]

    Selecting Computations: Theory and Applications

    Nicholas Hay, Stuart Russell, David Tolpin, and Solomon Eyal Shimony. Selecting computations: Theory and applications.arXiv preprint arXiv:1408.2048,

  17. [17]

    Reasoning, Metareasoning, and Mathematical Truth: Studies of Theorem Proving under Limited Resources

    Eric J Horvitz and Adrian Klein. Reasoning, metareasoning, and mathematical truth: Studies of theorem proving under limited resources.arXiv preprint arXiv:1302.4960,

  18. [18]

    Metareasoning for Planning Under Uncertainty

    Christopher H Lin, Andrey Kolobov, Ece Kamar, and Eric Horvitz. Metareasoning for planning under uncertainty.arXiv preprint arXiv:1505.00399,

  19. [19]

    s1: Simple test- time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori B Hashimoto. s1: Simple test- time scaling. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 20286–20332,

  20. [20]

    RouteLLM: Learning to Route LLMs with Preference Data

    Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data.arXiv preprint arXiv:2406.18665,

  21. [21]

    Under review

    14 Preprint. Under review. Matthew Riemer, Gopeshh Subbaraj, Glen Berseth, and Irina Rish. Enabling realtime reinforcement learning at scale with staggered asynchronous inference.arXiv preprint arXiv:2412.14355,

  22. [22]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438,

  23. [23]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

  24. [24]

    Dast: Difficulty-adaptive slow-thinking for large reasoning models

    Yi Shen, Jian Zhang, Jieyun Huang, Shuming Shi, Wenjing Zhang, Jiangze Yan, Ning Wang, Kai Wang, Zhaoxiang Liu, and Shiguo Lian. Dast: Difficulty-adaptive slow-thinking for large reasoning models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pp. 2322–2331,

  25. [25]

    Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

    Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,

  26. [26]

    Ted Xiao, Eric Jang, Dmitry Kalashnikov, Sergey Levine, Julian Ibarz, Karol Hausman, and Alexander Herzog

    URLhttps://openreview.net/forum?id=yGglBJ1pjZ. Ted Xiao, Eric Jang, Dmitry Kalashnikov, Sergey Levine, Julian Ibarz, Karol Hausman, and Alexander Herzog. Thinking while moving: Deep reinforcement learning with concurrent control.arXiv preprint arXiv:2004.06089,

  27. [27]

    running a simulation

    A Reproducibility The full JAX implementation of every environment, base planner, and gating policy in this paper, together with pretrained checkpoints for all five environments and the two-GPU deployment harness, is released athttps://aneeshers.github.io/realtime-rl/. Each result reported in Sections 5 and 7 has a corresponding environment-variable-drive...

  28. [28]

    estimates the advantage at timestept as aλ-weighted sum of one-step TD residuals, ˆAt = ∞∑ l=0 (γλ)lδt+l, δ t =r t +γV(st+1)−V(st),(6) which assumes a unit time gap between consecutive states. In our SMDP, consecutive meta-statesst and st+kt are separated by a variable number of environment frameskt, so the per-step discount in the TD residual becomesγkt:...

  29. [29]

    Thus, in both clocked domains, the selected options provide the gating policy with meaningfully separated quality-latency tradeoffs. 2 ... 256 Simulations 2000 3000 4000 5000Episode Return Pac-Man 2 ... 256 Simulations 40 60 80 100Episode Return Tetris RT 2 ... 256 Simulations 0.4 0.6 0.8 1.0Win Rate Speed Hex 2 ... 256 Simulations 0.25 0.50 0.75 1.00Expe...

  30. [30]

    Once timeout is an immediate terminal event, stronger search is no longer reliably beneficial; it often just increases the probability of burning too much clock

    This pattern is the opposite of what we observe in the main benchmark. Once timeout is an immediate terminal event, stronger search is no longer reliably beneficial; it often just increases the probability of burning too much clock. In other words, the game becomes easier 22 Preprint. Under review. 0 1000 2000 3000 4000 5000 Clock budget 0.0 0.2 0.4 0.6 0...

  31. [31]

    In our internal runs, the end-to-end training workflow dropped from roughly two weeks in an earlier less-batched pipeline to roughly six hours in the optimized JAX implementation

    for search,vmap for environment parallelism,jit for whole-program compilation,pmap for device parallelism, and lax.scan for both MCTS inner loops and PPO rollout loops — reduce wall-clock training time dramatically. In our internal runs, the end-to-end training workflow dropped from roughly two weeks in an earlier less-batched pipeline to roughly six hour...

  32. [32]

    This is the discrete-time SMDP solved during training

    withdeterministicholding timeτ=k, so successive meta-decisions are spacedk environment steps apart and the appropriate discount isγk. This is the discrete-time SMDP solved during training. Deployment.In real-time deployment the holding time between meta-decisions equalsk×Tframe: MCTS runs concurrently on GPU 1 while the environment executes thek reflex fr...