arxiv: 2605.12236 · v1 · submitted 2026-05-12 · 💻 cs.RO · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

TMRL: Diffusion Timestep-Modulated Pretraining Enables Exploration for Efficient Policy Finetuning

Matthew M. Hong , Jesse Zhang , Anusha Nagabandi , Abhishek Gupta

Authors on Pith no claims yet

Pith reviewed 2026-05-13 04:14 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LG

keywords robot policy fine-tuningreinforcement learningdiffusion modelsexplorationpretrainingsample efficiencymanipulation tasks

0 comments

The pith

Pretraining robot policies with injected diffusion noise and then modulating the timestep during RL fine-tuning creates controllable exploration that improves sample efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to address how behavior cloning pretraining produces narrow action distributions that limit exploration when fine-tuning robot policies with reinforcement learning. It proposes injecting forward-diffusion noise into policy inputs during pretraining to form a smooth continuum from accurate imitation to wider action coverage. During fine-tuning the agent learns to select the diffusion timestep dynamically, which directly controls the level of exploration while keeping the benefits of the pretraining intact. This combination works across input types such as raw states, point clouds, and image-based policies and produces measurable gains in how quickly policies reach competent performance. A reader would care because it offers a practical route to shortening the real-world training time needed for complex manipulation skills.

Core claim

The central claim is that Context-Smoothed Pre-training injects forward-diffusion noise into policy inputs to create a tunable continuum between precise imitation and broad coverage, and that Timestep-Modulated Reinforcement Learning then lets the policy choose the conditioning timestep at each step so that exploration can be adjusted on the fly, yielding higher sample efficiency in downstream RL fine-tuning.

What carries the argument

Timestep-Modulated Reinforcement Learning (TMRL) that conditions the policy on a selected diffusion timestep during fine-tuning, built on top of Context-Smoothed Pre-training (CSP) that adds controlled forward-diffusion noise to the inputs.

If this is right

The method integrates directly with policies that take states, 3D point clouds, or image inputs without requiring architectural changes.
RL fine-tuning reaches successful policies with substantially fewer environment interactions than standard approaches.
Complex real-world manipulation tasks become solvable within one hour of robot fine-tuning time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same noise-injection-plus-timestep-control pattern could be tested in non-robot RL domains where pretraining is used to initialize policies.
If the timestep selection proves stable, it might reduce the amount of real-world data collection needed for new robot skills.
Future work could examine whether the same mechanism helps when the pretraining data itself comes from noisy or incomplete demonstrations.

Load-bearing premise

That the noise levels introduced in pretraining actually produce a useful, continuous range of behaviors that can be selected via timestep without harming the policy's ability to imitate or to learn.

What would settle it

A controlled comparison on the same pre-trained policy showing that standard RL fine-tuning without timestep modulation reaches the same success rate with equal or fewer samples, or that real-world manipulation tasks still require more than one hour of fine-tuning, would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.12236 by Abhishek Gupta, Anusha Nagabandi, Jesse Zhang, Matthew M. Hong.

**Figure 1.** Figure 1: TMRL bridges behavior cloning (BC) pre-training and RL fine-tuning by smoothing the conditioning of policy inputs (contexts). During pre-training, [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: For an unseen goal position, the conditional action distribution p(a | ) remains narrow, lacking sufficient support for the agent to reach the goal. In contrast, the marginal p(a) provides the required coverage, though with too much action variance near the goal. In this setting, an optimal policy should more closely match p(a) at the beginning of the trajectory and p(a | ) at the end. Context-smoothed pol… view at source ↗

**Figure 3.** Figure 3: Timestep-modulated exploration via context smoothing. (Left) During pre-training, a steerable policy pθ is trained across all noise levels σ by corrupting the context c via the kernel qσ(˜c | c), producing a policy that can be queried at any conditioning strength during inference. (Right) During RL fine-tuning, TMRL exposes pθ with a context-noise dial σ as an explicit control variable for the high-level p… view at source ↗

**Figure 4.** Figure 4: Empirical validation of context smoothing. We train a context-smoothed diffusion policy p(x, y | ˜θ, σ) with Equation (5) to produce 2D points on a unit circle conditioned on c = θ, where σ ∈ 0, 1, . . . , T denotes diffusion scheduler timesteps. Each panel shows p(x, y | ˜θ = 0, σ) at a fixed level of context noise σ; the shaded region and contours visualize the spread of the conditioning distribution ove… view at source ↗

**Figure 5.** Figure 5: Evaluation. We evaluate TMRL across 8 tasks spanning navigation and manipulation in simulation and the real world. 0.0 0.3 0.6 0.9 0.00 0.25 0.50 0.75 1.00 pointmaze 0.0 0.3 0.6 0.9 cube-single 0.0 0.2 0.4 0.6 libero-goal 0.0 0.2 0.4 0.6 libero-90 Online Steps (×10⁶) Success Rate TMRL DSRL PostBC RLPD SPiRL [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: RL Success Rates for simulation tasks. TMRL attains near 100% success rate in both OGBench tasks, outperforming the best baselines by 14% in pointmaze-giant and 200% in cube-single at final performance. In libero-goal, TMRL and RLPD [25] both reach 100% success. However, for the longer-horizon libero-90 task, only TMRL explores sufficiently to achieve non-trivial success rates. … view at source ↗

**Figure 7.** Figure 7: CSP unlocks better action coverage before RL finetuning. We measure the Success@K for context-smoothed pre-training against standard BC and PostBC. CSP achieves greater success@K across all K on both tasks. Vision-Language Model “push the button” Corruption Kernel Action Expert robot state noise at, at+1, ... at+H σ ~ U(0,1) clean noisy [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Context Smoothed Pre-training for VLAs: we propose [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 12.** Figure 12: Comparison of exploration behaviors on the [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗

**Figure 13.** Figure 13: ). At the beginning of the trajectory, TMRL uses more diffusion noise so that the π0 CSP policy can reach the sausage—by default π0 is overfit to reaching the carrot almost every time. Near the end of the trajectory, TMRL uses less diffusion noise because once π0 has picked up an object, it will generally attempt to correctly place it in the pot and therefore does not need as much diffusion noise to steer… view at source ↗

**Figure 11.** Figure 11: Comparing adaptation with TMRL against TMRLCFG for OGBench tasks. We find that our method substantially outperforms CFG-based interpolation. Action Coverage Visualization: We then visualize the exploration behavior induced by CSP+TMRLvs. BC+DSRLin [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 14.** Figure 14: Conditioned smoothed policies enable out-of-support generalization. (Top) In a 2D unit-circle setup ( [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 15.** Figure 15: Cube environments. We filter transitions from the cube-single-play dataset where the cube or the cube goal x-position is > 0.4. We evaluate on three tasks where the goal location is located beyond the filtered region. We use the pointmaze-large-navigate-v0 dataset for training, where each observation consists of the agent’s [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

**Figure 16.** Figure 16: PointMaze environments. Example train and evaluation mazes. The evaluation environment features a significantly larger maze layout. We train RL policies on four out-of-distribution goal locations simultaneously. TABLE III: OGBench hyperparameters Hyperparameter Value Action len 50 (pointmaze), 50 (cube) Discount 0.995 Noise bound 1.0 Hidden dim 512 Action target entropy −action_dim Timestep target entropy… view at source ↗

**Figure 17.** Figure 17: Dexterous manipulation grasping objects. A policy is trained on three objects master_chef_can , tuna_fish_can and potted_meat_can and then evaluated on a novel large_marker object. Dexterous grasping. Our dexterous grasping experiments are run in simulation built on IsaacLab [56], we distill an RL expert into point cloud conditioned visuomotor policies via a standard student-teacher approach [62] and tra… view at source ↗

**Figure 18.** Figure 18: Timestep modulation over the course of a rollout. We visualize the timesteps used by TMRL across action chunk indices for the task “put the sausage in the pot,” comparing the first successful rollout to the converged final policy. The final policy uses lower timesteps toward the end of the trajectory, reflecting reliance on increasingly precise, imitation-like action distributions as the task nears comple… view at source ↗

**Figure 19.** Figure 19: Timestep modulation over the course of a rollout. We visualize the timesteps used by TMRL across action chunk indices for the task “pick up the shrimp and put it into the white drawer,” comparing the first successful rollout to the converged final policy. The final policy uses notably lower timesteps during the manipulation-critical middle portion of the trajectory, reflecting more precise, imitation-like… view at source ↗

**Figure 20.** Figure 20: Libero-Object action distributions under context smoothing. Each row corresponds to a distinct task in the Libero-Object suite. Columns show rollouts as the observation embedding is progressively smoothed from the original context (left) toward the marginal distribution (right), interpolating between the conditional and marginal policy [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗

**Figure 21.** Figure 21: Libero-Spatial action distributions under context smoothing. Each row corresponds to a distinct task in the Libero-Spatial suite. Columns show rollouts as the observation embedding is progressively smoothed from the original context (left) toward the marginal distribution (right), interpolating between the conditional and marginal policy [PITH_FULL_IMAGE:figures/full_fig_p021_21.png] view at source ↗

**Figure 22.** Figure 22: Libero-Goal action distributions under context smoothing. Each row corresponds to a distinct task in the Libero-Goal suite. Columns show rollouts as the observation embedding is progressively smoothed from the original context (left) toward the marginal distribution (right), interpolating between the conditional and marginal policy [PITH_FULL_IMAGE:figures/full_fig_p022_22.png] view at source ↗

read the original abstract

Fine-tuning pre-trained robot policies with reinforcement learning (RL) often inherits the bottlenecks introduced by pre-training with behavioral cloning (BC), which produces narrow action distributions that lack the coverage necessary for downstream exploration. We present a unified framework that enables the exploration necessary to enable efficient robot policy finetuning by bridging BC pre-training and RL fine-tuning. Our pre-training method, Context-Smoothed Pre-training (CSP), injects forward-diffusion noise into policy inputs, creating a continuum between precise imitation and broad action coverage. We then fine-tune pre-trained policies via Timestep-Modulated Reinforcement Learning (TMRL), which trains the agent to dynamically adjust this conditioning during fine-tuning by modulating the diffusion timestep, granting explicit control over exploration. Integrating seamlessly with arbitrary policy inputs, e.g., states, 3D point clouds, or image-based VLA policies, we show that TMRL improves RL fine-tuning sample efficiency. Notably, TMRL enables successful real-world fine-tuning on complex manipulation tasks in under one hour. Videos and code available at https://weirdlabuw.github.io/tmrl/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TMRL gives explicit control over exploration in RL fine-tuning by conditioning on diffusion timestep after noise-injected BC pretraining.

read the letter

The main takeaway is that this paper shows how to turn the diffusion process into a controllable exploration dial for robot policy fine-tuning. They pretrain with CSP by adding forward diffusion noise to inputs, which creates a range from tight imitation to broader coverage, then use TMRL to let the policy condition on the timestep during RL so it can adjust how exploratory its actions are on the fly. This setup integrates with states, point clouds, or image-based policies without major changes to the architecture. That framing is new for bridging BC and RL in robotics, and the real-world claim of successful fine-tuning on complex manipulation tasks in under an hour stands out as a practical result if the experiments hold up. The paper does well at keeping the method simple and compatible with existing policy inputs, which makes the idea easy to try on top of current diffusion or VLA setups. The mechanism avoids obvious internal contradictions and directly targets the narrow action distribution problem that comes from standard BC. One soft spot is the lack of detailed quantitative comparisons or ablations in the high-level description, so it is hard to separate how much gain comes from the timestep modulation versus the pretraining noise itself or other implementation choices. The core assumption that the noise continuum produces useful action coverage for exploration needs clear evidence from the results to be fully convincing. This work is for robotics researchers focused on policy pretraining and sample-efficient RL fine-tuning, especially those already using diffusion models. A reader working on real-robot adaptation would find the method and the one-hour real-world demo useful to examine. It deserves a serious referee because the idea is coherent, the integration claims are consistent with conditioning architectures, and the problem it addresses is central to making RL practical on hardware. I would send it to peer review.

Referee Report

1 major / 0 minor

Summary. The paper proposes Context-Smoothed Pre-training (CSP), which injects forward-diffusion noise into policy inputs during behavioral cloning to create a continuum between precise imitation and broad action coverage, and Timestep-Modulated Reinforcement Learning (TMRL), which conditions the policy on the diffusion timestep during RL fine-tuning to explicitly modulate exploration. It claims seamless integration with arbitrary inputs (states, 3D point clouds, image-based VLA policies), improved RL fine-tuning sample efficiency, and successful real-world fine-tuning on complex manipulation tasks in under one hour, with code and videos released.

Significance. If the empirical claims hold, this could be a significant contribution to robot learning by offering a unified, conditioning-based bridge between BC pre-training and RL fine-tuning that directly addresses the exploration bottleneck through diffusion timestep modulation. The approach's compatibility with modern multimodal policies and the real-world demonstration would be impactful; the release of code and videos is a clear strength for reproducibility.

major comments (1)

Abstract: The abstract states performance improvements and real-world success but provides no quantitative results, baselines, experimental details, or error analysis, making it impossible to assess whether the data supports the claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and for acknowledging the potential significance of our approach for bridging BC pre-training and RL fine-tuning in robot learning. We address the major comment point by point below.

read point-by-point responses

Referee: Abstract: The abstract states performance improvements and real-world success but provides no quantitative results, baselines, experimental details, or error analysis, making it impossible to assess whether the data supports the claims.

Authors: We agree that the abstract would be strengthened by including specific quantitative results to better substantiate the claims. In the revised version, we will update the abstract to incorporate key metrics such as sample efficiency improvements (e.g., achieving target performance with X% fewer environment interactions than standard BC+RL baselines), details on the baselines compared (including vanilla diffusion policies and other exploration methods), and reference to error analysis from multiple random seeds. These additions will be kept concise to maintain abstract length while providing sufficient evidence for the reported gains in simulation and the real-world one-hour fine-tuning results. We believe this directly addresses the concern without altering the core narrative. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes a high-level methodological framework (CSP for pretraining via forward-diffusion noise injection on inputs, followed by TMRL conditioning on diffusion timestep during RL finetuning) without presenting any equations, derivations, or parameter-fitting procedures in the abstract or summary. No load-bearing steps reduce to self-definitions, fitted inputs renamed as predictions, or self-citation chains. Claims of improved sample efficiency and real-world applicability are positioned as empirical outcomes rather than tautological constructions, rendering the argument self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that diffusion processes can be meaningfully applied to policy inputs for creating controllable exploration in robotics, with no free parameters or invented entities explicitly introduced in the abstract.

axioms (1)

domain assumption Forward diffusion noise can be injected into policy inputs to create a controllable continuum between imitation and exploration.
This is the core mechanism of Context-Smoothed Pre-training described in the abstract.

pith-pipeline@v0.9.0 · 5508 in / 1273 out tokens · 94776 ms · 2026-05-13T04:14:40.554699+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 (Smoothing increases overlap... TV(p_σ(·|c),p_σ(·|c')) ≤ (E∥w∥/σ)∥c-c'∥

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 11 internal anchors

[1]

Posterior behavioral cloning: Pretraining bc policies for efficient rl finetuning,

A. Wagenmaker, P. Dong, R. Tsao, C. Finn, and S. Levine, “Posterior behavioral cloning: Pretraining bc policies for efficient rl finetuning,”arXiv preprint arXiv:2512.16911, 2025

work page arXiv 2025
[2]

Steering your diffusion policy with latent space rein- forcement learning,

A. Wagenmaker, M. Nakamoto, Y . Zhang, S. Park, W. Yagoub, A. Nagabandi, A. Gupta, and S. Levine, “Steering your diffusion policy with latent space rein- forcement learning,” inConference on Robot Learning, 2025

work page 2025
[3]

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

P. Intelligence, A. Amin, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dha- balia, J. DiCarlo, D. Driess, M. Equi, A. Esmail, Y . Fang, C. Finn, C. Glossop, T. Godden, I. Gory- achev, L. Groom, H. Hancock, K. Hausman, G. Hussein, B. Ichter, S. Jakubczak, R. Jen, T. Jones, B. Katz, L. Ke, C. Kuchi, M. Lamb, D. LeBlanc, S. L...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Bootstrap your own skills: Learning to solve new tasks with large language model guidance,

J. Zhang, J. Zhang, K. Pertsch, Z. Liu, X. Ren, M. Chang, S.-H. Sun, and J. J. Lim, “Bootstrap your own skills: Learning to solve new tasks with large language model guidance,” inConference on Robot Learning (CoRL), 2023

work page 2023
[5]

Sprint: Scalable policy pre-training via language instruction re- labeling,

J. Zhang, K. Pertsch, J. Zhang, and J. J. Lim, “Sprint: Scalable policy pre-training via language instruction re- labeling,” inInternational Conference on Robotics and Automation (ICRA), 2024

work page 2024
[6]

Precise and dexterous robotic manipulation via human-in-the-loop reinforcement learning,

J. Luo, C. Xu, J. Wu, and S. Levine, “Precise and dexterous robotic manipulation via human-in-the-loop reinforcement learning,”Science Robotics, vol. 10, no. 105, p. eads5033, 2025

work page 2025
[7]

EXTRACT: Efficient policy learning by extracting transferrable robot skills from offline data,

J. Zhang, M. Heo, Z. Liu, E. Biyik, J. J. Lim, Y . Liu, and R. Fakoor, “EXTRACT: Efficient policy learning by extracting transferrable robot skills from offline data,” in Conference on Robot Learning, 2024

work page 2024
[8]

Flare: Achieving masterful and adaptive robot policies with large-scale reinforcement learning fine-tuning,

J. Hu, R. Hendrix, A. Farhadi, A. Kembhavi, R. Mart ´ın- Mart´ın, P. Stone, K.-H. Zeng, and K. Ehsani, “Flare: Achieving masterful and adaptive robot policies with large-scale reinforcement learning fine-tuning,” in2025 IEEE International Conference on Robotics and Automa- tion (ICRA). IEEE, 2025, pp. 3617–3624

work page 2025
[9]

Rapidly adapting policies to the real-world via simulation-guided fine- tuning,

P. Yin, T. Westenbroek, S. Bagaria, K. Huang, C.-A. Cheng, A. Kolobov, and A. Gupta, “Rapidly adapting policies to the real-world via simulation-guided fine- tuning,” inInternational Conference on Learning Rep- resentations (ICLR), 2025

work page 2025
[10]

Robot fine-tuning made easy: Pre-training re- wards and policies for autonomous real-world reinforce- ment learning,

J. Yang, M. S. Mark, B. Vu, A. Sharma, J. Bohg, and C. Finn, “Robot fine-tuning made easy: Pre-training re- wards and policies for autonomous real-world reinforce- ment learning,” inInternational Conference on Robotics and Automation (ICRA), 2024

work page 2024
[11]

Rl-100: Performant robotic ma- nipulation with real-world reinforcement learning,

K. Lei, H. Li, D. Yu, Z. Wei, L. Guo, Z. Jiang, Z. Wang, S. Liang, and H. Xu, “Rl-100: Performant robotic ma- nipulation with real-world reinforcement learning,”arXiv preprint arXiv: 2510.14830, 2026

work page arXiv 2026
[12]

Gr-rl: Going dexterous and precise for long-horizon robotic manipulation

Y . Li, X. Ma, J. Xu, Y . Cui, Z. Cui, Z. Han, L. Huang, T. Kong, Y . Liu, H. Niu, W. Peng, J. Qiao, Z. Ren, H. Shi, Z. Su, J. Tian, Y . Xiao, S. Zhang, L. Zheng, H. Li, and Y . Wu, “Gr-rl: Going dexterous and precise for long-horizon robotic manipulation,”arXiv preprint arXiv:2512.01801, 2025

work page arXiv 2025
[13]

Alvinn: An autonomous land vehicle in a neural network,

D. Pomerleau, “Alvinn: An autonomous land vehicle in a neural network,” inProceedings of (NeurIPS) Neural Information Processing Systems, D. Touretzky, Ed. Mor- gan Kaufmann, December 1989, pp. 305 – 313

work page 1989
[14]

A framework for behavioural cloning,

M. Bain and C. Sammut, “A framework for behavioural cloning,” inMachine Intelligence 15, 1995

work page 1995
[15]

Implicit behavioral cloning,

P. Florence, C. Lynch, A. Zeng, O. Ramirez, A. Wahid, L. Downs, A. Wong, J. Lee, I. Mordatch, and J. Tomp- son, “Implicit behavioral cloning,”Conference on Robot Learning (CoRL), 2021

work page 2021
[16]

Diffusion policy: Visuomotor pol- icy learning via action diffusion,

C. Chi, S. Feng, Y . Du, Z. Xu, E. Cousineau, B. Burch- fiel, and S. Song, “Diffusion policy: Visuomotor pol- icy learning via action diffusion,” inProceedings of Robotics: Science and Systems (RSS), 2023

work page 2023
[17]

Over- trained language models are harder to fine-tune,

J. M. Springer, S. Goyal, K. Wen, T. Kumar, X. Yue, S. Malladi, G. Neubig, and A. Raghunathan, “Over- trained language models are harder to fine-tune,” inForty-second International Conference on Machine Learning, 2025

work page 2025
[18]

Rethinking fine-tuning when scaling test-time compute: Limiting confidence improves math- ematical reasoning,

F. Chen, A. Raventos, N. Cheng, S. Ganguli, and S. Druckmann, “Rethinking fine-tuning when scaling test-time compute: Limiting confidence improves math- ematical reasoning,” inThe Thirty-ninth Annual Confer- ence on Neural Information Processing Systems, 2025

work page 2025
[19]

arXiv preprint arXiv:2504.12491 , year=

H. Zeng, K. Hui, H. Zhuang, Z. Qin, Z. Yue, H. Za- mani, and D. Alon, “Can pre-training indicators reliably predict fine-tuning outcomes of llms?”arXiv preprint arXiv:2504.12491, 2025

work page arXiv 2025
[20]

The cov- erage principle: How pre-training enables post-training,

F. Chen, A. Huang, N. Golowich, S. Malladi, A. Block, J. T. Ash, A. Krishnamurthy, and D. J. Foster, “The cov- erage principle: How pre-training enables post-training,” inThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[21]

Stem- OB: Generalizable visual imitation learning with stem- like convergent observation through diffusion inversion,

K. Hu, Z. Rui, Y . He, Y . Liu, P. Hua, and H. Xu, “Stem- OB: Generalizable visual imitation learning with stem- like convergent observation through diffusion inversion,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[22]

Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets,

C. Zhu, R. Yu, S. Feng, B. Burchfiel, P. Shah, and A. Gupta, “Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets,” inProceedings of Robotics: Science and Systems (RSS), 2025

work page 2025
[23]

Behavior Regularized Offline Reinforcement Learning

Y . Wu, G. Tucker, and O. Nachum, “Behavior regu- larized offline reinforcement learning,”arXiv preprint arXiv:1911.11361, 2019

work page internal anchor Pith review arXiv 1911
[24]

Con- servative q-learning for offline reinforcement learning,

A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Con- servative q-learning for offline reinforcement learning,” inAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 1179–1191

work page 2020
[25]

Ef- ficient online reinforcement learning with offline data,

P. J. Ball, L. Smith, I. Kostrikov, and S. Levine, “Ef- ficient online reinforcement learning with offline data,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 202. PMLR, 23–29 Jul 2023, pp. 1577– 1594

work page 2023
[26]

Diffusion policy policy optimization,

A. Ren, J. Lidard, L. Ankile, A. Simeonov, P. Agrawal, A. Majumdar, B. Burchfiel, H. Dai, and M. Simchowitz, “Diffusion policy policy optimization,” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 77 288–77 329

work page 2025
[27]

Accelerating reinforce- ment learning with learned skill priors,

K. Pertsch, Y . Lee, and J. J. Lim, “Accelerating reinforce- ment learning with learned skill priors,” inConference on Robot Learning (CoRL), 2020

work page 2020
[28]

Parrot: Data-driven behavioral priors for reinforcement learning,

A. Singh, H. Liu, G. Zhou, A. Yu, N. Rhinehart, and S. Levine, “Parrot: Data-driven behavioral priors for reinforcement learning,” inInternational Conference on Learning Representations, 2021

work page 2021
[29]

{OPAL}: Offline primitive discovery for accelerating offline reinforcement learning,

A. Ajay, A. Kumar, P. Agrawal, S. Levine, and O. Nachum, “{OPAL}: Offline primitive discovery for accelerating offline reinforcement learning,” inInterna- tional Conference on Learning Representations, 2021

work page 2021
[30]

Demonstration- guided reinforcement learning with learned skills,

K. Pertsch, Y . Lee, Y . Wu, and J. J. Lim, “Demonstration- guided reinforcement learning with learned skills,” in5th Conference on Robot Learning, 2021

work page 2021
[31]

EXPO: Stable Reinforcement Learning with Expressive Policies

P. Dong, Q. Li, D. Sadigh, and C. Finn, “EXPO: Stable reinforcement learning with expressive policies,”arXiv preprint arXiv:2507.07986, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Ignorance is bliss: Robust control via information gating,

M. Tomar, R. Islam, M. E. Taylor, S. Levine, and P. Bachman, “Ignorance is bliss: Robust control via information gating,” inThirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[33]

Infobot: Trans- fer and exploration via the information bottleneck,

A. Goyal, R. Islam, D. Strouse, Z. Ahmed, M. Botvinick, H. Larochelle, Y . Bengio, and S. Levine, “Infobot: Trans- fer and exploration via the information bottleneck,”arXiv preprint arXiv:1901.10902, 2023

work page arXiv 1901
[34]

History-guided video diffusion,

K. Song, B. Chen, M. Simchowitz, Y . Du, R. Tedrake, and V . Sitzmann, “History-guided video diffusion,” inForty-second International Conference on Machine Learning, 2025

work page 2025
[35]

Diffusion forcing: Next-token prediction meets full-sequence diffusion,

B. Chen, D. Mart ´ı Mons ´o, Y . Du, M. Simchowitz, R. Tedrake, and V . Sitzmann, “Diffusion forcing: Next-token prediction meets full-sequence diffusion,” Advances in Neural Information Processing Systems, vol. 37, pp. 24 081–24 125, 2025

work page 2025
[36]

Peek: Guiding and minimal image representations for zero-shot gener- alization of robot manipulation policies,

J. Zhang, M. Memmel, K. Kim, D. Fox, J. Thomason, F. Ramos, E. Bıyık, A. Gupta, and A. Li, “Peek: Guiding and minimal image representations for zero-shot gener- alization of robot manipulation policies,” in2026 IEEE International Conference on Robotics and Automation (ICRA), 2026

work page 2026
[37]

Blindfolded experts generalize better: Insights from robotic manipulation and videogames,

E. Zisselman, M. Mutti, S. Francis-Meretzki, E. Shafer, and A. Tamar, “Blindfolded experts generalize better: Insights from robotic manipulation and videogames,” in The Thirty-ninth Annual Conference on Neural Informa- tion Processing Systems, 2025

work page 2025
[38]

Aug- mented reality for robots (arro): Pointing visuomotor policies towards visual robustness,

R. Mirjalili, T. J ¨ulg, F. Walter, and W. Burgard, “Aug- mented reality for robots (arro): Pointing visuomotor policies towards visual robustness,”IEEE Robotics and Automation Letters, 2026

work page 2026
[39]

Controlvla: Few-shot object-centric adaptation for pre-trained vision- language-action models,

P. Li, Y . Wu, Z. Xi, W. Li, Y . Huang, Z. Zhang, Y . Chen, J. Wang, S.-C. Zhu, T. Liu, and S. Huang, “Controlvla: Few-shot object-centric adaptation for pre-trained vision- language-action models,”CoRR, vol. abs/2506.16211, June 2025

work page arXiv 2025
[40]

Classifier-Free Diffusion Guidance

J. Ho and T. Salimans, “Classifier-free diffusion guid- ance,”arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[41]

Ran- domized smoothing for stochastic optimization,

J. C. Duchi, P. L. Bartlett, and M. J. Wainwright, “Ran- domized smoothing for stochastic optimization,”SIAM Journal on Optimization, vol. 22, no. 2, pp. 674–701, 2012

work page 2012
[42]

Bundled gradi- ents through contact via randomized smoothing,

H. J. T. Suh, T. Pang, and R. Tedrake, “Bundled gradi- ents through contact via randomized smoothing,”IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 4000– 4007, 2022

work page 2022
[43]

Certified adver- sarial robustness via randomized smoothing,

J. Cohen, E. Rosenfeld, and Z. Kolter, “Certified adver- sarial robustness via randomized smoothing,” inInterna- tional Conference on Machine Learning. PMLR, 2019, pp. 1310–1320

work page 2019
[44]

Og- bench: Benchmarking offline goal-conditioned rl,

S. Park, K. Frans, B. Eysenbach, and S. Levine, “Og- bench: Benchmarking offline goal-conditioned rl,” in International Conference on Learning Representations (ICLR), 2025

work page 2025
[45]

Affordance-based robot manipulation with flow matching,

F. Zhang and M. Gienger, “Affordance-based robot manipulation with flow matching,”arXiv preprint arXiv:2409.01083, 2025

work page arXiv 2025
[46]

Much ado about noising: Dispelling the myths of generative robotic control,

C. Pan, G. Anantharaman, N.-C. Huang, C. Jin, D. Pfrommer, C. Yuan, F. Permenter, G. Qu, N. Boffi, G. Shi, and M. Simchowitz, “Much ado about noising: Dispelling the myths of generative robotic control,”arXiv preprint arXiv:2512.01809, 2025

work page arXiv 2025
[47]

Rfs: Reinforcement learning with residual flow steering for dexterous manipulation,

E. Su, T. Westenbroek, A. Nagabandi, and A. Gupta, “Rfs: Reinforcement learning with residual flow steering for dexterous manipulation,” inThe Fourteenth Interna- tional Conference on Learning Representations, 2026

work page 2026
[48]

Generative modeling by esti- mating gradients of the data distribution,

Y . Song and S. Ermon, “Generative modeling by esti- mating gradients of the data distribution,” inAdvances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch ´e-Buc, E. Fox, and R. Garnett, Eds., vol. 32. Curran Associates, Inc., 2019

work page 2019
[49]

Denoising Diffusion Probabilistic Models

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,”arXiv preprint arxiv:2006.11239, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[50]

Denoising diffusion implicit models,

J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” inInternational Conference on Learn- ing Representations, 2021

work page 2021
[51]

Soft actor-critic: Off-policy maximum entropy deep reinforce- ment learning with a stochastic actor,

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforce- ment learning with a stochastic actor,” inProceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. PMLR, 10–15 Jul 2018, pp. 1861–1870

work page 2018
[52]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky, “π 0: A vision-language-action flow model for general robot control,”arXiv preprint arXiv:2410...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

B. Liu, Y . Zhu, C. Gao, Y . Feng, Q. Liu, Y . Zhu, and P. Stone, “Libero: Benchmarking knowledge transfer for lifelong robot learning,”arXiv preprint arXiv:2306.03310, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Libero-pro: Towards robust and fair evalua- tion of vision-language-action models beyond memoriza- tion.[arXiv preprint arXiv:2510.03827], 2025

X. Zhou, Y . Xu, G. Tie, Y . Chen, G. Zhang, D. Chu, P. Zhou, and L. Sun, “Libero-pro: Towards robust and fair evaluation of vision-language-action models be- yond memorization,”[arXiv preprint arXiv:2510.03827], 2025

work page arXiv 2025
[55]

Leap hand: Low-cost, efficient, and anthropomorphic hand for robot learning,

K. Shaw, A. Agarwal, and D. Pathak, “Leap hand: Low-cost, efficient, and anthropomorphic hand for robot learning,”Robotics: Science and Systems (RSS), 2023

work page 2023
[56]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

M. Mittal, P. Roth, J. Tigue, A. Richard, O. Zhang, P. Du, A. Serrano-Mu ˜noz, X. Yao, R. Zurbr ¨ugg, N. Rudin, L. Wawrzyniak, M. Rakhsha, A. Denzler, E. Heiden, A. Borovicka, O. Ahmed, I. Akinola, A. Anwar, M. T. Carlson, J. Y . Feng, A. Garg, R. Gasoto, L. Gulich, Y . Guo, M. Gussert, A. Hansen, M. Kulkarni, C. Li, W. Liu, V . Makoviychuk, G. Malczyk, H...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Bridgedata v2: A dataset for robot learning at scale,

H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen-Estruch, A. W. He, V . Myers, M. J. Kim, M. Du, A. Lee, K. Fang, C. Finn, and S. Levine, “Bridgedata v2: A dataset for robot learning at scale,” in Proceedings of The 7th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, J. Tan, M. Toussaint, and K. Darvish, Eds., ...

work page 2023
[58]

Droid: A large-scale in-the-wild robot manipulation dataset,

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

work page 2024
[59]

Scalable diffusion models with transformers,

W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProceedings of the IEEE/CVF interna- tional conference on computer vision, 2023, pp. 4195– 4205

work page 2023
[60]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,”Neural Comput., vol. 9, no. 8, p. 1735–1780, Nov. 1997

work page 1997
[61]

Flow Matching for Generative Modeling

Y . Lipman, R. T. Chen, H. Ben-Hamu, M. Nickel, and M. Le, “Flow matching for generative modeling,”arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[62]

A system for general in-hand object re-orientation,

T. Chen, J. Xu, and P. Agrawal, “A system for general in-hand object re-orientation,” inProceedings of the 5th Conference on Robot Learning, ser. Proceedings of Machine Learning Research, A. Faust, D. Hsu, and G. Neumann, Eds., vol. 164. PMLR, 08–11 Nov 2022, pp. 297–307

work page 2022
[63]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” inarXiv preprint: 1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[64]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y . Huang, S.-W. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features without supe...

work page internal anchor Pith review Pith/arXiv arXiv 2024