ELASTIC: Efficiently Learning to Adaptively Scale Test-Time Compute for Generative Control Policies

Andrea Bajcsy; Andrew Zou Li; Gokul Swamy; Yonatan Bisk

arxiv: 2606.31132 · v1 · pith:FT4E4HB4new · submitted 2026-06-30 · 💻 cs.RO

ELASTIC: Efficiently Learning to Adaptively Scale Test-Time Compute for Generative Control Policies

Andrew Zou Li , Gokul Swamy , Yonatan Bisk , Andrea Bajcsy This is my paper

Pith reviewed 2026-07-01 05:42 UTC · model grok-4.3

classification 💻 cs.RO

keywords test-time compute scalinggenerative control policiesdiffusion policiesrobot manipulationmeta reinforcement learningadaptive compute allocation

0 comments

The pith

A meta-policy learns to allocate sequential and parallel test-time compute for generative robot policies to raise success at fixed budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ELASTIC, which treats test-time compute allocation for generative control policies as a meta-Markov Decision Process. A meta-policy is trained with reinforcement learning to choose the number of denoising steps and the number of parallel action samples at each iteration while interacting with a frozen base policy. The meta-policy receives only task success as its reward and never sees the base policy's original training data. If the claim holds, robot systems can replace hand-designed or fixed scaling rules with learned, state-dependent schedules that deliver higher task success for the same total compute.

Core claim

ELASTIC formulates compute allocation as a meta-Markov Decision Process in which a meta-policy interacts with a frozen pretrained robot policy and selects sequential steps and parallel samples at each denoising iteration to maximize task success while minimizing compute. Using reinforcement learning, this meta-policy learns adaptive compute schedules without access to the GCP's training data. Across simulated manipulation benchmarks with diffusion policies, ELASTIC Pareto-dominates fixed and single-axis scaling baselines at matched compute budgets. On real-world robot manipulation, ELASTIC matches best-of-10 success while reducing wall-clock latency by 34%.

What carries the argument

The meta-Markov Decision Process in which the meta-policy chooses sequential denoising steps and parallel action samples to interact with a frozen generative control policy.

If this is right

Adaptive schedules Pareto-dominate fixed and single-axis scaling at matched compute budgets on simulated manipulation tasks.
The approach achieves equivalent success to best-of-10 sampling while cutting wall-clock latency by 34 percent on real robot manipulation.
The meta-policy learns without access to the generative policy's training data or extra reward shaping.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same meta-MDP framing could be applied to test-time scaling in other generative models such as language or image generators.
Jointly optimizing the base policy and the meta-policy might yield further gains beyond the current frozen-base setting.
State-dependent scaling may prove especially useful in tasks whose precision demands change sharply across phases.

Load-bearing premise

The reinforcement learning procedure lets the meta-policy discover useful compute schedules using only task success as the reward signal.

What would settle it

If a new manipulation task shows that the learned meta-policy needs strictly more total compute than the best fixed schedule to reach the same success rate, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2606.31132 by Andrea Bajcsy, Andrew Zou Li, Gokul Swamy, Yonatan Bisk.

**Figure 1.** Figure 1: Parallel vs. Sequential Scaling in Robotics. Parallel scaling explores diverse action modes, while sequential scaling refines solution quality. Our ELASTIC algorithm jointly allocates both compute axes to optimize the performance–latency tradeoff. a mode selection problem that benefits more from parallel sampling. Conversely, parallel scaling without sufficient sequential refinement produces imprecise acti… view at source ↗

**Figure 2.** Figure 2: Diffusion Policy Success Rate vs. Sequential Steps L and Parallel Samples P. ELASTIC outperforms Fixed and Sequential-only scaling at matched compute across 6 tasks and 3 cost settings (200 trials each), demonstrating the value of state-dependent allocation. Each marker is one method’s operating point (L, P); Sequential and Fixed sweep three matched-budget points (one per cost setting). The gap between Fix… view at source ↗

**Figure 3.** Figure 3: shows how the meta-policy allocates compute along both axes across time in CAN PAIRED. Sequential steps spike at the pickup phase, where clean denoising is required for reliable grasping; parallel candidates spike after, where the base policy’s training data splits into multiple modes and Qϕ can discriminate between them. After the mode split resolves, both axes collapse: free motion toward the goal is uni… view at source ↗

**Figure 4.** Figure 4: LIBERO-10 Allocation Heatmaps. ELASTIC allocates compute differently across different tasks on a multi-task VLA. At pick and place states, the top task shows elevated sequential steps, whereas the bottom task instead uses more parallel samples. Setup. For simulation, we use the LIBERO-10 benchmark [32] and the corresponding task prompts and run the π0.5-LIBERO checkpoint zero-shot over 20 trials. For har… view at source ↗

**Figure 5.** Figure 5: Real World Success Rate vs Inference Latency. ELASTIC adaptively scales compute to match BoN performance with 34% less inference latency. Average (Sequential Steps, Parallel Samples) per method are labeled. Hyperparameters. For simulation, we set α = 0.13 and β = 0.1 based on wall-clock latency scaling tests on a single RTX 4090. The optimization is conducted with 10 rollouts per task from the base polic… view at source ↗

**Figure 6.** Figure 6: Allocation Timeline: Real World Mug Task. Using ELASTIC, the meta-policy allocates more parallel samples when deciding which mug to pick near the start and during the placing action. Sequential steps vary less, with some peaks near grasping and prior to placing. Failure cases when using insufficient compute include selecting the wrong mug, failing to grasp, and failing to reach over the basket prior to pla… view at source ↗

**Figure 7.** Figure 7: ELASTIC Pipeline. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8 [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Allocation Timeline: Square MH and PH. Using ELASTIC, the meta-policy allocates more parallel compute overall to Square MH (left) compared to Square PH (right). In particular, sequential compute varies a lot throughout the Square MH trajectory, peaking when first moving toward the square and while approaching the insertion [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Denoising Timeline. Visualization of how action samples evolve through the denoising process under the meta-policy. Each plot shows the Euclidean distance between samples vs. denoising time from different states in a CAN PAIRED trajectory. Stopping behaviors emerge from the single scalar reward in Eq. 1 without hand-designed stopping rules or specially trained noise-aware verifiers. 14 [PITH_FULL_IMAGE:… view at source ↗

**Figure 11.** Figure 11: LIBERO-10 Success Rates vs Inference Latency. On LIBERO-10, base policy π0.5 already achieves strong performance. Both scaling axes provide minor success rate gains at the cost of significantly longer inference latency: +36.8% for BoN with V-GPS (10 Sequential Steps, 10 Parallel Samples) and +40.3% for Sequential (18 Sequential Steps, 1 Sample). ELASTIC recovers performance gains equal to Sequential with … view at source ↗

**Figure 12.** Figure 12: LIBERO-10 Allocation Heatmap. The heatmap shows how the ELASTIC allocates sequential steps (top) and parallel samples (bottom) across states for π0.5. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: Diffusion Policy Success Rate vs. Sequential Steps L and Parallel Samples P. We plot an additional Learned Fixed baseline, which learns a constant allocation of L sequential steps and P parallel samples per action generation. 7.2.1 Baselines Since ELASTIC uses varying (L, P) per state, Fixed matches the mean, L, ¯ P¯, by randomly sampling between {⌊L¯⌋, ⌈L¯⌉} steps and {⌊P¯⌋, ⌈P¯⌉} samples at each enviro… view at source ↗

**Figure 14.** Figure 14: Inference Latency vs Sequential Steps and Parallel Samples [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗

read the original abstract

Generative control policies (GCPs), such as diffusion policies and flow-based vision-language-action models, enable test-time scaling in robot control. Test-time compute can be allocated along two axes: sequential scaling, which increases denoising steps to refine actions, and parallel scaling, which samples multiple candidate actions to search across modes of the policy distribution. However, the optimal allocation of sequential and parallel compute is hard to know a priori as it is state-, task-, and policy-dependent. For example, early stages of a grasp may benefit from broader parallel exploration, while near-contact phases may require more sequential refinement for precision. We present ELASTIC, an algorithm that learns state-dependent test-time compute schedules for GCPs. We formulate compute allocation as a meta-Markov Decision Process in which a meta-policy interacts with a frozen pretrained robot policy and selects sequential steps and parallel samples at each denoising iteration to maximize task success while minimizing compute. Using reinforcement learning, this meta-policy also learns adaptive compute schedules without access to the GCP's training data. Across simulated manipulation benchmarks with diffusion policies, ELASTIC Pareto-dominates fixed and single-axis scaling baselines at matched compute budgets. On real-world robot manipulation with the $\pi_{0.5}$ vision-language-action model, ELASTIC matches best-of-$10$ success while reducing wall-clock latency by 34%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ELASTIC's meta-RL setup for state-dependent sequential and parallel scaling on frozen generative policies is a practical angle, but the abstract gives no experimental details and the binary-success learning concern looks unresolved.

read the letter

The paper's core move is to treat test-time compute allocation for a generative control policy as a meta-MDP. A separate meta-policy picks, at each denoising step, how many sequential steps and how many parallel samples to run, trained by RL to maximize task success while keeping total compute down. It does this on a frozen base policy and without access to the base policy's original training data. That joint schedule and the no-data-access constraint are the concrete novelties relative to fixed or single-axis scaling.

The abstract reports that this beats fixed and single-axis baselines on simulated manipulation tasks at matched budgets, and on a real robot with the pi0.5 model it matches best-of-10 success while cutting wall-clock time by 34 percent. Those numbers, if they hold, would matter for anyone trying to run diffusion-style policies on physical hardware where latency and success both count.

The main weakness is that none of the supporting evidence is visible. There are no training curves for the meta-policy, no description of the reward beyond terminal success, no statistical tests, and no check on whether the learned schedules actually vary with state or simply collapse near a constant allocation. The stress-test concern lands: with only binary success at the end and no shaped intermediate signals or access to the base policy's data distribution, credit assignment across the denoising trajectory is likely to be weak. If the meta-policy cannot reliably discover better-than-fixed behavior, the Pareto claim does not follow.

This is aimed at people already working on test-time scaling for generative robot policies. A reader in that niche would see the formulation and the deployment motivation, but would still need the full methods and data to judge whether the result is real. I would send it to peer review so the experiments can be examined, though the current write-up does not yet make a strong case on its own.

Referee Report

1 major / 1 minor

Summary. The paper introduces ELASTIC, which formulates test-time compute allocation for generative control policies (diffusion policies and vision-language-action models) as a meta-Markov Decision Process. A meta-policy is trained via reinforcement learning on a frozen pretrained GCP to select state-dependent sequential denoising steps and parallel samples at each iteration, using only terminal task success as reward and without access to the GCP's original training data. The central empirical claim is that the resulting adaptive schedules Pareto-dominate fixed and single-axis scaling baselines on simulated manipulation tasks at matched compute budgets, and on real-robot manipulation with the π0.5 model they match best-of-10 success while cutting wall-clock latency by 34%.

Significance. If the reported results are reproducible, the work shows that RL on a meta-MDP can discover non-trivial adaptive compute schedules for GCPs that improve the efficiency-performance trade-off in robot manipulation without requiring access to the base policy's training distribution or shaped rewards.

major comments (1)

[Methods (meta-MDP and RL procedure)] The meta-MDP formulation and RL procedure (described in the methods) receive only terminal binary success and interact solely with the frozen GCP. No mechanism is shown for how credit assignment across the denoising trajectory produces state-dependent schedules that outperform fixed baselines rather than collapsing to near-constant allocation; this is load-bearing for the Pareto-dominance claim.

minor comments (1)

[Abstract] The abstract states concrete performance numbers (Pareto dominance, 34% latency reduction) but supplies no experimental protocol, reward definition, training curves, or statistical tests; these details should be summarized in the abstract or a dedicated experimental setup subsection for immediate assessability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment on the meta-MDP formulation, RL procedure, and credit assignment below.

read point-by-point responses

Referee: [Methods (meta-MDP and RL procedure)] The meta-MDP formulation and RL procedure (described in the methods) receive only terminal binary success and interact solely with the frozen GCP. No mechanism is shown for how credit assignment across the denoising trajectory produces state-dependent schedules that outperform fixed baselines rather than collapsing to near-constant allocation; this is load-bearing for the Pareto-dominance claim.

Authors: We agree that the manuscript would benefit from a clearer explanation of how credit assignment enables non-constant, state-dependent allocations. The meta-policy state includes the current robot observation and the denoising iteration index. The meta-policy is trained with PPO using generalized advantage estimation on trajectories that receive only terminal binary success; this propagates credit across the sequence of allocation decisions. Because the state representation conditions on both task progress and denoising stage, the learned policy can (and in our experiments does) vary sequential and parallel compute per state rather than collapsing to a fixed schedule. To make this explicit, we will revise the methods section to detail the state features, advantage estimator, and training hyperparameters, and we will add an analysis subsection with visualizations of per-state allocation decisions plus statistics on allocation variance across episodes and tasks. These additions will directly support the Pareto-dominance results. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical RL procedure on frozen policy

full rationale

The paper formulates compute allocation as a meta-MDP and trains a meta-policy via RL to maximize task success on a frozen GCP. Reported Pareto dominance and latency reductions are measured experimental outcomes across simulation and real-robot benchmarks, not quantities that reduce to fitted inputs or self-citations by construction. No equations, ansatzes, or uniqueness theorems are invoked that would make the success rates tautological; the approach remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the meta-MDP and RL training are standard tools whose concrete hyperparameters and reward definitions are not stated.

pith-pipeline@v0.9.1-grok · 5785 in / 1147 out tokens · 24903 ms · 2026-07-01T05:42:25.791348+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 9 canonical work pages · 2 internal anchors

[1]

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023

2023
[2]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Walke...
[3]

URLhttps://proceedings.mlr.press/v305/black25a

PMLR, 27–30 Sep 2025. URLhttps://proceedings.mlr.press/v305/black25a. html

2025
[4]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model. In P. Agrawal, O. Kroemer, and W. Burgard, editors,Proceedings of The 8th Conference on Robot Lea...

2025
[5]

Ghosh, H

Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

2024
[6]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. InAd- vances in Neural Information Processing Systems, volume 33, page 6840–6851. Curran Associates, Inc., 2020. URLhttps://proceedings.neurips.cc/paper/2020/hash/ 4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html

2020
[7]

J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. InInternational Con- ference on Learning Representations, Oct. 2021. URLhttps://openreview.net/forum? id=St1giarCHLP

2021
[8]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InInternational Conference on Learning Representations, 2023. URLhttps: //openreview.net/forum?id=PqvMRDCJT9t

2023
[9]

Brown, J

B. Brown, J. Juravsky, R. S. Ehrlich, R. Clark, Q. V . Le, C. Re, and A. Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling, 2025. URLhttps: //openreview.net/forum?id=0xUEBQV54B

2025
[10]

Prasad, K

A. Prasad, K. Lin, J. Wu, L. Zhou, and J. Bohg. Consistency policy: Accelerated visuomotor policies via consistency distillation. InRobotics: Science and Systems, 2024

2024
[11]

C. Lu, Y . Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. InAdvances in Neural Information Pro- cessing Systems, 2022. URLhttps://proceedings.neurips.cc/paper_files/paper/ 2022/hash/260a14acce2a89dad36adc8eefe7c59e-Abstract-Conference.html

2022
[12]

X. Hu, B. Liu, X. Liu, and Q. Liu. Adaflow: Imitation learning with variance- adaptive flow-based policies. InAdvances in Neural Information Processing Systems,
[13]

URLhttps://proceedings.neurips.cc/paper_files/paper/2024/hash/ fa9cfdb49f7de3dee54007f84c0745b1-Abstract-Conference.html. 10

2024
[14]

Trivedi, A

A. Trivedi, A. Li, M. Elnoor, Y . U. Ciftci, A. Singh, J. D’sa, S. Bae, D. Isele, T. Padir, and F. M. Tariq. Adaptive time step flow matching for autonomous driving motion planning, 2026. URLhttps://arxiv.org/abs/2602.10285

work page arXiv 2026
[15]

S.-A. Yu, F. Gao, Y . Wu, C. Yu, and Y . Wang. D3p: Dynamic denoising diffusion policy via reinforcement learning. (arXiv:2508.06804), Aug. 2025. doi:10.48550/arXiv.2508.06804. URLhttp://arxiv.org/abs/2508.06804. arXiv:2508.06804 [cs]

work page doi:10.48550/arxiv.2508.06804 2025
[16]

Setlur, N

A. Setlur, N. Rajaraman, S. Levine, and A. Kumar. Scaling test-time compute without verifi- cation or RL is suboptimal. InF orty-second International Conference on Machine Learning,
[17]

URLhttps://openreview.net/forum?id=beeNgQEfe2
[18]

N. Ma, S. Tong, H. Jia, H. Hu, Y .-C. Su, M. Zhang, X. Yang, Y . Li, T. Jaakkola, X. Jia, and S. Xie. Scaling inference time compute for diffusion models. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 2523–2534, 2025. URLhttps://openaccess.thecvf.com/content/CVPR2025/html/Ma_Scaling_ Inference_Time_Compute_for_Diffusion_...

2025
[19]

Damani, I

M. Damani, I. Shenfeld, A. Peng, A. Bobu, and J. Andreas. Learning how hard to think: Input-adaptive allocation of lm computation. InInternational Conference on Learning Repre- sentations, 2025. URLhttps://openreview.net/forum?id=6qUUgw9bAZ

2025
[20]

J. Qi, X. Ye, H. Tang, Z. Zhu, and E. Choi. Learning to reason across parallel samples for llm reasoning. (arXiv:2506.09014), Oct. 2025. doi:10.48550/arXiv.2506.09014. URLhttp: //arxiv.org/abs/2506.09014. arXiv:2506.09014 [cs]

work page doi:10.48550/arxiv.2506.09014 2025
[21]

J. Kwok, C. Agia, R. Sinha, M. Foutter, S. Li, I. Stoica, A. Mirhoseini, and M. Pavone. Robomonkey: Scaling test-time sampling and verification for vision-language-action models. (arXiv:2506.17811), 2025. doi:10.48550/arXiv.2506.17811. URLhttp://arxiv.org/abs/ 2506.17811. arXiv:2506.17811 [cs]

work page doi:10.48550/arxiv.2506.17811 2025
[22]

A. K. Jain, V . Mohta, S. Kim, A. Bhardwaj, J. Ren, Y . Feng, S. Choudhury, and G. Swamy. A smooth sea never made a skilled SAILOR: Robust imitation via learning to search. InAdvances in Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum? id=qN5hmLkBtC

2025
[23]

Nakamoto, O

M. Nakamoto, O. Mees, A. Kumar, and S. Levine. Steering your generalists: Improving robotic foundation models via value guidance.Conference on Robot Learning (CoRL), 2024

2024
[24]

Y . Wu, R. Tian, G. Swamy, and A. Bajcsy. From foresight to forethought: Vlm-in-the-loop policy steering via latent alignment. InProceedings of Robotics: Science and Systems, 2025. URLhttps://roboticsproceedings.org/rss21/p076.html

2025
[25]

Y . Wu, A. Li, T. Hermans, F. Ramos, A. Bajcsy, and C. P ´erez-D’Arpino. Do what you say: Steering vision-language-action models via runtime reasoning-action alignment verification. 2026 IEEE International Conference on Robotics & Automation (ICRA), 2026

2026
[26]

P. Dong, A. Swerdlow, D. Sadigh, and C. Finn. Faster: Value-guided sampling for fast rl, 2026. URLhttps://arxiv.org/abs/2604.19730

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Zaletel, and Joel E

R. Manvi, J. Hong, T. Seyde, M. Labonne, M. Lechner, and S. Levine. Zero-overhead intro- spection for adaptive test-time compute. (arXiv:2512.01457), Dec. 2025. doi:10.48550/arXiv. 2512.01457. URLhttp://arxiv.org/abs/2512.01457. arXiv:2512.01457 [cs]

work page internal anchor Pith review doi:10.48550/arxiv 2025
[28]

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score- based generative modeling through stochastic differential equations. InInternational Con- ference on Learning Representations, Oct. 2021. URLhttps://openreview.net/forum? id=PxTIG12RRHS. 11

2021
[29]

A. Z. Ren, J. Lidard, L. L. Ankile, A. Simeonov, P. Agrawal, A. Majumdar, B. Burchfiel, H. Dai, and M. Simchowitz. Diffusion policy policy optimization. InInternational Con- ference on Learning Representations, 2025. URLhttps://openreview.net/forum?id= YPR0X7dCXn

2025
[30]

A. Y . Ng, D. Harada, and S. J. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. InProceedings of the Sixteenth International Conference on Machine Learning (ICML), pages 278–287, 1999. URLhttps://dl.acm.org/doi/10. 5555/645528.657613

work page arXiv 1999
[31]

Haarnoja, A

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum en- tropy deep reinforcement learning with a stochastic actor. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 ofPro- ceedings of Machine Learning Research, pages 1861–1870. PMLR, 10–15 Jul 2018. URL https:...

2018
[32]

Perez, F

E. Perez, F. Strub, H. de Vries, V . Dumoulin, and A. C. Courville. Film: Visual reasoning with a general conditioning layer. InAAAI, 2018

2018
[33]

Y . Song, Y . Zhou, A. Sekhari, D. Bagnell, A. Krishnamurthy, and W. Sun. Hybrid RL: Us- ing both offline and online data can make RL efficient. InThe Eleventh International Con- ference on Learning Representations, 2023. URLhttps://openreview.net/forum?id= yyBis80iUuU

2023
[34]

Mandlekar, D

A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Mart´ın-Mart´ın. What matters in learning from offline human demonstrations for robot manipulation. InProceedings of the 5th Conference on Robot Learning, volume 164 ofProceedings of Machine Learning Research, pages 1678–1690. PMLR, 2022. URL https:/...

2022
[35]

B. Liu, Y . Zhu, C. Gao, Y . Feng, qiang liu, Y . Zhu, and P. Stone. LIBERO: Benchmark- ing knowledge transfer for lifelong robot learning. InThirty-seventh Conference on Neu- ral Information Processing Systems Datasets and Benchmarks Track, 2023. URLhttps: //openreview.net/forum?id=xzEtNSuDJk

2023
[36]

Liang, Y

A. Liang, Y . Korkmaz, J. Zhang, M. Hwang, A. Anwar, S. Kaushik, A. Shah, A. S. Huang, L. Zettlemoyer, D. Fox, Y . Xiang, A. Li, A. Bobu, A. Gupta, S. Tu, E. Biyik, and J. Zhang. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons. In Robotics: Science and Systems 2026, 2026

2026
[37]

Y . J. Ma, J. Hejna, C. Fu, D. Shah, J. Liang, Z. Xu, S. Kirmani, P. Xu, D. Driess, T. Xiao, O. Bastani, D. Jayaraman, W. Yu, T. Zhang, D. Sadigh, and F. Xia. Vision language models are in-context value learners. InThe Thirteenth International Conference on Learning Repre- sentations, 2025. URLhttps://openreview.net/forum?id=friHAl5ofG

2025
[38]

T. Lee, A. Wagenmaker, K. Pertsch, P. Liang, S. Levine, and C. Finn. Roboreward: General- purpose vision-language reward models for robotics, 2026. URLhttps://arxiv.org/abs/ 2601.00675

work page arXiv 2026
[39]

H. Tan, S. Chen, Y . Xu, Z. Wang, Y . Ji, C. Chi, Y . Lyu, Z. Zhao, X. Chen, P. Co, et al. Robo- dopamine: General process reward modeling for high-precision robotic manipulation.arXiv preprint arXiv:2512.23703, 2025. 12 7 Appendix 7.1ELASTICAlgorithm Algorithm 1Inference at every environment stept 1:Observationo t, base policyπ GCP, meta-policyπ meta, re...

work page arXiv 2025

[1] [1]

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023

2023

[2] [2]

Black, N

K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Walke...

[3] [3]

URLhttps://proceedings.mlr.press/v305/black25a

PMLR, 27–30 Sep 2025. URLhttps://proceedings.mlr.press/v305/black25a. html

2025

[4] [4]

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, P. R. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn. Openvla: An open-source vision-language-action model. In P. Agrawal, O. Kroemer, and W. Burgard, editors,Proceedings of The 8th Conference on Robot Lea...

2025

[5] [5]

Ghosh, H

Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

2024

[6] [6]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. InAd- vances in Neural Information Processing Systems, volume 33, page 6840–6851. Curran Associates, Inc., 2020. URLhttps://proceedings.neurips.cc/paper/2020/hash/ 4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html

2020

[7] [7]

J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. InInternational Con- ference on Learning Representations, Oct. 2021. URLhttps://openreview.net/forum? id=St1giarCHLP

2021

[8] [8]

Lipman, R

Y . Lipman, R. T. Q. Chen, H. Ben-Hamu, M. Nickel, and M. Le. Flow matching for generative modeling. InInternational Conference on Learning Representations, 2023. URLhttps: //openreview.net/forum?id=PqvMRDCJT9t

2023

[9] [9]

Brown, J

B. Brown, J. Juravsky, R. S. Ehrlich, R. Clark, Q. V . Le, C. Re, and A. Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling, 2025. URLhttps: //openreview.net/forum?id=0xUEBQV54B

2025

[10] [10]

Prasad, K

A. Prasad, K. Lin, J. Wu, L. Zhou, and J. Bohg. Consistency policy: Accelerated visuomotor policies via consistency distillation. InRobotics: Science and Systems, 2024

2024

[11] [11]

C. Lu, Y . Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. InAdvances in Neural Information Pro- cessing Systems, 2022. URLhttps://proceedings.neurips.cc/paper_files/paper/ 2022/hash/260a14acce2a89dad36adc8eefe7c59e-Abstract-Conference.html

2022

[12] [12]

X. Hu, B. Liu, X. Liu, and Q. Liu. Adaflow: Imitation learning with variance- adaptive flow-based policies. InAdvances in Neural Information Processing Systems,

[13] [13]

URLhttps://proceedings.neurips.cc/paper_files/paper/2024/hash/ fa9cfdb49f7de3dee54007f84c0745b1-Abstract-Conference.html. 10

2024

[14] [14]

Trivedi, A

A. Trivedi, A. Li, M. Elnoor, Y . U. Ciftci, A. Singh, J. D’sa, S. Bae, D. Isele, T. Padir, and F. M. Tariq. Adaptive time step flow matching for autonomous driving motion planning, 2026. URLhttps://arxiv.org/abs/2602.10285

work page arXiv 2026

[15] [15]

S.-A. Yu, F. Gao, Y . Wu, C. Yu, and Y . Wang. D3p: Dynamic denoising diffusion policy via reinforcement learning. (arXiv:2508.06804), Aug. 2025. doi:10.48550/arXiv.2508.06804. URLhttp://arxiv.org/abs/2508.06804. arXiv:2508.06804 [cs]

work page doi:10.48550/arxiv.2508.06804 2025

[16] [16]

Setlur, N

A. Setlur, N. Rajaraman, S. Levine, and A. Kumar. Scaling test-time compute without verifi- cation or RL is suboptimal. InF orty-second International Conference on Machine Learning,

[17] [17]

URLhttps://openreview.net/forum?id=beeNgQEfe2

[18] [18]

N. Ma, S. Tong, H. Jia, H. Hu, Y .-C. Su, M. Zhang, X. Yang, Y . Li, T. Jaakkola, X. Jia, and S. Xie. Scaling inference time compute for diffusion models. InIEEE/CVF Con- ference on Computer Vision and Pattern Recognition (CVPR), pages 2523–2534, 2025. URLhttps://openaccess.thecvf.com/content/CVPR2025/html/Ma_Scaling_ Inference_Time_Compute_for_Diffusion_...

2025

[19] [19]

Damani, I

M. Damani, I. Shenfeld, A. Peng, A. Bobu, and J. Andreas. Learning how hard to think: Input-adaptive allocation of lm computation. InInternational Conference on Learning Repre- sentations, 2025. URLhttps://openreview.net/forum?id=6qUUgw9bAZ

2025

[20] [20]

J. Qi, X. Ye, H. Tang, Z. Zhu, and E. Choi. Learning to reason across parallel samples for llm reasoning. (arXiv:2506.09014), Oct. 2025. doi:10.48550/arXiv.2506.09014. URLhttp: //arxiv.org/abs/2506.09014. arXiv:2506.09014 [cs]

work page doi:10.48550/arxiv.2506.09014 2025

[21] [21]

J. Kwok, C. Agia, R. Sinha, M. Foutter, S. Li, I. Stoica, A. Mirhoseini, and M. Pavone. Robomonkey: Scaling test-time sampling and verification for vision-language-action models. (arXiv:2506.17811), 2025. doi:10.48550/arXiv.2506.17811. URLhttp://arxiv.org/abs/ 2506.17811. arXiv:2506.17811 [cs]

work page doi:10.48550/arxiv.2506.17811 2025

[22] [22]

A. K. Jain, V . Mohta, S. Kim, A. Bhardwaj, J. Ren, Y . Feng, S. Choudhury, and G. Swamy. A smooth sea never made a skilled SAILOR: Robust imitation via learning to search. InAdvances in Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum? id=qN5hmLkBtC

2025

[23] [23]

Nakamoto, O

M. Nakamoto, O. Mees, A. Kumar, and S. Levine. Steering your generalists: Improving robotic foundation models via value guidance.Conference on Robot Learning (CoRL), 2024

2024

[24] [24]

Y . Wu, R. Tian, G. Swamy, and A. Bajcsy. From foresight to forethought: Vlm-in-the-loop policy steering via latent alignment. InProceedings of Robotics: Science and Systems, 2025. URLhttps://roboticsproceedings.org/rss21/p076.html

2025

[25] [25]

Y . Wu, A. Li, T. Hermans, F. Ramos, A. Bajcsy, and C. P ´erez-D’Arpino. Do what you say: Steering vision-language-action models via runtime reasoning-action alignment verification. 2026 IEEE International Conference on Robotics & Automation (ICRA), 2026

2026

[26] [26]

P. Dong, A. Swerdlow, D. Sadigh, and C. Finn. Faster: Value-guided sampling for fast rl, 2026. URLhttps://arxiv.org/abs/2604.19730

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

Zaletel, and Joel E

R. Manvi, J. Hong, T. Seyde, M. Labonne, M. Lechner, and S. Levine. Zero-overhead intro- spection for adaptive test-time compute. (arXiv:2512.01457), Dec. 2025. doi:10.48550/arXiv. 2512.01457. URLhttp://arxiv.org/abs/2512.01457. arXiv:2512.01457 [cs]

work page internal anchor Pith review doi:10.48550/arxiv 2025

[28] [28]

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score- based generative modeling through stochastic differential equations. InInternational Con- ference on Learning Representations, Oct. 2021. URLhttps://openreview.net/forum? id=PxTIG12RRHS. 11

2021

[29] [29]

A. Z. Ren, J. Lidard, L. L. Ankile, A. Simeonov, P. Agrawal, A. Majumdar, B. Burchfiel, H. Dai, and M. Simchowitz. Diffusion policy policy optimization. InInternational Con- ference on Learning Representations, 2025. URLhttps://openreview.net/forum?id= YPR0X7dCXn

2025

[30] [30]

A. Y . Ng, D. Harada, and S. J. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. InProceedings of the Sixteenth International Conference on Machine Learning (ICML), pages 278–287, 1999. URLhttps://dl.acm.org/doi/10. 5555/645528.657613

work page arXiv 1999

[31] [31]

Haarnoja, A

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum en- tropy deep reinforcement learning with a stochastic actor. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 ofPro- ceedings of Machine Learning Research, pages 1861–1870. PMLR, 10–15 Jul 2018. URL https:...

2018

[32] [32]

Perez, F

E. Perez, F. Strub, H. de Vries, V . Dumoulin, and A. C. Courville. Film: Visual reasoning with a general conditioning layer. InAAAI, 2018

2018

[33] [33]

Y . Song, Y . Zhou, A. Sekhari, D. Bagnell, A. Krishnamurthy, and W. Sun. Hybrid RL: Us- ing both offline and online data can make RL efficient. InThe Eleventh International Con- ference on Learning Representations, 2023. URLhttps://openreview.net/forum?id= yyBis80iUuU

2023

[34] [34]

Mandlekar, D

A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Mart´ın-Mart´ın. What matters in learning from offline human demonstrations for robot manipulation. InProceedings of the 5th Conference on Robot Learning, volume 164 ofProceedings of Machine Learning Research, pages 1678–1690. PMLR, 2022. URL https:/...

2022

[35] [35]

B. Liu, Y . Zhu, C. Gao, Y . Feng, qiang liu, Y . Zhu, and P. Stone. LIBERO: Benchmark- ing knowledge transfer for lifelong robot learning. InThirty-seventh Conference on Neu- ral Information Processing Systems Datasets and Benchmarks Track, 2023. URLhttps: //openreview.net/forum?id=xzEtNSuDJk

2023

[36] [36]

Liang, Y

A. Liang, Y . Korkmaz, J. Zhang, M. Hwang, A. Anwar, S. Kaushik, A. Shah, A. S. Huang, L. Zettlemoyer, D. Fox, Y . Xiang, A. Li, A. Bobu, A. Gupta, S. Tu, E. Biyik, and J. Zhang. Robometer: Scaling general-purpose robotic reward models via trajectory comparisons. In Robotics: Science and Systems 2026, 2026

2026

[37] [37]

Y . J. Ma, J. Hejna, C. Fu, D. Shah, J. Liang, Z. Xu, S. Kirmani, P. Xu, D. Driess, T. Xiao, O. Bastani, D. Jayaraman, W. Yu, T. Zhang, D. Sadigh, and F. Xia. Vision language models are in-context value learners. InThe Thirteenth International Conference on Learning Repre- sentations, 2025. URLhttps://openreview.net/forum?id=friHAl5ofG

2025

[38] [38]

T. Lee, A. Wagenmaker, K. Pertsch, P. Liang, S. Levine, and C. Finn. Roboreward: General- purpose vision-language reward models for robotics, 2026. URLhttps://arxiv.org/abs/ 2601.00675

work page arXiv 2026

[39] [39]

H. Tan, S. Chen, Y . Xu, Z. Wang, Y . Ji, C. Chi, Y . Lyu, Z. Zhao, X. Chen, P. Co, et al. Robo- dopamine: General process reward modeling for high-precision robotic manipulation.arXiv preprint arXiv:2512.23703, 2025. 12 7 Appendix 7.1ELASTICAlgorithm Algorithm 1Inference at every environment stept 1:Observationo t, base policyπ GCP, meta-policyπ meta, re...

work page arXiv 2025