arxiv: 2603.15757 · v2 · submitted 2026-03-16 · 💻 cs.RO · cs.AI

Recognition: 1 theorem link

· Lean Theorem

You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector

Omkar Patil , Ondrej Biza , Thomas Weng , Karl Schmeckpeper , Wil Thomason , Xiaohan Zhang , Robin Walters , Nakul Gopalan

show 2 more authors

Sebastian Castro Eric Rosen

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:50 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords robot manipulationdiffusion policiesflow matchingpolicy improvementnoise initializationgenerative modelsvision language action models

0 comments

The pith

A fixed initial noise vector can improve the performance of pretrained diffusion and flow-matching robot policies without any retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that generative robot policies, which sample initial noise from a Gaussian each time, can achieve higher task rewards when given one well-chosen constant noise vector instead. This vector, called a golden ticket, is found by a simple Monte-Carlo search over a few rollouts while keeping the policy frozen. The approach requires no new training or models and works across simulated and real manipulation tasks, boosting success rates in most cases and offering natural behavior diversity for multi-objective tradeoffs in multi-task settings. A reader should care because it turns an existing policy into a better one with almost no extra cost, just by picking the right starting noise once.

Core claim

We demonstrate that the performance of a pretrained, frozen diffusion or flow matching policy can be improved with respect to a downstream reward by swapping the sampling of initial noise from the prior distribution (typically isotropic Gaussian) with a well-chosen, constant initial noise input -- a golden ticket. We propose a search method to find golden tickets using Monte-Carlo policy evaluation that keeps the pretrained policy frozen, does not train any new networks, and is applicable to all diffusion/flow matching policies.

What carries the argument

The golden ticket, a single constant initial noise vector selected through Monte-Carlo search on episode rewards, which when fed repeatedly to the policy produces higher-reward trajectories than random sampling from the prior.

If this is right

Improves success rates on 38 out of 43 tasks in simulation and real robot benchmarks.
Relative gains reach 58% in simulation and 60% in real-world within 50 search episodes.
Enables a Pareto frontier of behaviors in multi-task settings by using different tickets.
A ticket optimized for one task can improve related tasks in vision-language-action models.
Requires only the ability to inject noise and observe sparse rewards, with no extra infrastructure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach implies that the choice of initial noise is an under-explored control knob for generative policies that can be tuned post-training.
Golden tickets might transfer across similar environments or tasks without re-searching, though this is not tested.
Extending the search to optimize for multiple objectives simultaneously could yield tickets for custom reward balances.
The method could apply to other generative models outside robotics if they use initial noise sampling.

Load-bearing premise

That the noise vector discovered by search on a small set of evaluation episodes will keep delivering higher rewards on fresh episodes and under changes in conditions.

What would settle it

Running the policy with the found golden ticket on a new set of episodes drawn from the same distribution and observing whether the average reward falls back to or below the level achieved with random noise sampling.

Figures

Figures reproduced from arXiv: 2603.15757 by Eric Rosen, Karl Schmeckpeper, Nakul Gopalan, Omkar Patil, Ondrej Biza, Robin Walters, Sebastian Castro, Thomas Weng, Wil Thomason, Xiaohan Zhang.

**Figure 2.** Figure 2: Overview of standard diffusion policy inference (left) versus our proposed approach of using golden tickets (right). [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Sample images from some of our simulated benchmarks: (1-3) LIBERO-O [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Rollouts from diffusion policies sampling with Gaus [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of task performance of the base policy (blue, left) and our approach using golden tickets (gold, right) [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of DSRL vs. our proposed method in [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Various tickets (pink) for the franka sim pick policy, evaluated according to success rate and speed (determined by length of successful episodes). Higher is better success rate, left is faster time to success. Because lottery tickets exhibit extreme differences in policy performance, a Pareto frontier is defined by tickets that are further left/up than others (represented with golden ticket icons). We not… view at source ↗

**Figure 8.** Figure 8: Base policy and golden ticket performance with 2 and [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Base policy and golden ticket performance with 2 [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

read the original abstract

What happens when a pretrained generative robot policy is provided a constant initial noise as input, rather than repeatedly sampling it from a Gaussian? We demonstrate that the performance of a pretrained, frozen diffusion or flow matching policy can be improved with respect to a downstream reward by swapping the sampling of initial noise from the prior distribution (typically isotropic Gaussian) with a well-chosen, constant initial noise input -- a golden ticket. We propose a search method to find golden tickets using Monte-Carlo policy evaluation that keeps the pretrained policy frozen, does not train any new networks, and is applicable to all diffusion/flow matching policies (and therefore many VLAs). Our approach to policy improvement makes no assumptions beyond being able to inject initial noise into the policy and calculate (sparse) task rewards of episode rollouts, making it deployable with no additional infrastructure or models. Our method improves the performance of policies in 38 out of 43 tasks across simulated and real-world robot manipulation benchmarks, with relative improvements in success rate by up to 58% for some simulated tasks, and 60% within 50 search episodes for real-world tasks. We also show unique benefits of golden tickets for multi-task settings: the diversity of behaviors from different tickets naturally defines a Pareto frontier for balancing different objectives (e.g., speed, success rates); in VLAs, we find that a golden ticket optimized for one task can also boost performance in other related tasks. We release a codebase with pretrained policies and golden tickets for simulation benchmarks using VLAs, diffusion policies, and flow matching policies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A single fixed noise vector found by search can lift success rates on frozen diffusion and flow-matching robot policies, but the reported gains may partly come from alignment with the specific search rollouts.

read the letter

The main takeaway is that you can often improve a pretrained, frozen generative policy on robot tasks by swapping random Gaussian noise for one constant vector chosen through search. The paper calls this a golden ticket and shows it works across diffusion and flow-matching policies, including VLAs, without any retraining or new models. They run Monte Carlo rollouts on the frozen policy, score the episodes with the task reward, and keep the best vector. This produces gains on 38 of 43 tasks, with relative success-rate lifts up to 58 percent in simulation and 60 percent in real-world settings after roughly 50 search episodes. The same tickets also let them trace Pareto fronts for multi-task tradeoffs and sometimes transfer across related tasks. Releasing the code, policies, and the tickets themselves makes the result easy to test directly. The approach is narrow but practical: it only needs rollout access and sparse rewards, so it fits existing deployed systems. The soft spot is the evaluation split. Because the search directly maximizes observed reward on the rollouts it sees, the chosen vector could be tuned to quirks in those particular trajectories rather than delivering a general improvement. The abstract does not state whether final numbers are measured on completely held-out episodes or on the same ones used during search. If the latter, the effect would shrink on fresh data. They also do not test how the fixed vector behaves under distribution shift in initial states or dynamics. This paper is for roboticists who already run generative policies and want a cheap way to squeeze out more performance. A reader working on manipulation benchmarks will find the numbers large enough to try the trick themselves. It is not a deep theoretical result, but the empirical pattern is clear enough to merit checking. I would send it for peer review so the controls on evaluation separation and generalization can be verified.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that replacing the standard Gaussian-sampled initial noise in pretrained, frozen diffusion or flow-matching robot policies with a single fixed 'golden ticket' noise vector—found via Monte-Carlo search over episode rollouts—improves downstream task success rates without any policy training or additional models. The approach is reported to succeed on 38 of 43 tasks across simulation and real-world manipulation benchmarks, yielding relative success-rate gains up to 58% in simulation and 60% in real-world settings within 50 search episodes, while also enabling Pareto frontiers in multi-task settings and some cross-task transfer.

Significance. If the reported gains hold on independent episodes, the result would be significant: it supplies a training-free, infrastructure-minimal way to boost existing generative policies and VLAs using only rollout rewards. The explicit release of code, pretrained policies, and golden tickets for multiple policy classes strengthens reproducibility and practical utility. The multi-task Pareto observation is a useful byproduct for trading off objectives such as speed versus success.

major comments (2)

[Experiments] Evaluation procedure: the manuscript does not describe an explicit held-out episode set for final reporting after the Monte-Carlo search selects the golden ticket. Because the search directly maximizes observed rewards on the trajectories used for selection, the claimed improvements (38/43 tasks, up to 58% relative) could reflect selection of a noise vector that happens to align with the particular initial states or dynamics realizations present in the search rollouts rather than a general policy enhancement.
[Method] Search protocol details: the number of candidate noise vectors evaluated, the exact number of rollouts per candidate, and any mechanism to avoid overfitting the selected ticket to the search episodes are not specified with sufficient precision to allow independent verification of the reported gains.

minor comments (2)

[Abstract] The abstract states applicability to 'all diffusion/flow matching policies' but the experiments should explicitly list the precise conditions (e.g., noise-injection point in the denoising schedule) under which the method was tested.
Table or figure captions reporting success rates should include the baseline success rate alongside the relative improvement for immediate context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment in detail below.

read point-by-point responses

Referee: [Experiments] Evaluation procedure: the manuscript does not describe an explicit held-out episode set for final reporting after the Monte-Carlo search selects the golden ticket. Because the search directly maximizes observed rewards on the trajectories used for selection, the claimed improvements (38/43 tasks, up to 58% relative) could reflect selection of a noise vector that happens to align with the particular initial states or dynamics realizations present in the search rollouts rather than a general policy enhancement.

Authors: We acknowledge the validity of this concern regarding potential overfitting. To address it, we will revise the manuscript to explicitly describe our use of a held-out evaluation set. Specifically, the golden ticket search is performed using Monte-Carlo rollouts on a designated search set of episodes, and all reported performance metrics are computed on a completely separate held-out test set of episodes. This protocol was followed in our experiments, and we will provide the sizes of these sets (e.g., 50 search episodes and 100 test episodes for simulation tasks) along with results on the held-out set to confirm the gains are general. We believe this clarification will resolve the issue. revision: yes
Referee: [Method] Search protocol details: the number of candidate noise vectors evaluated, the exact number of rollouts per candidate, and any mechanism to avoid overfitting the selected ticket to the search episodes are not specified with sufficient precision to allow independent verification of the reported gains.

Authors: We agree that the search protocol details were insufficiently specified. In the revised manuscript, we will provide precise details: we evaluate 100 candidate noise vectors, each using 5 rollouts to estimate the expected reward. To prevent overfitting to the search episodes, we incorporate a validation split within the search episodes, selecting the ticket that performs best on the validation portion. Additionally, we will include the exact search budget (up to 50 episodes for real-world tasks as mentioned) and pseudocode for the procedure. These additions will allow for independent verification of the results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical Monte-Carlo search on external rollouts

full rationale

The paper's central result is an empirical procedure that selects a fixed noise vector by direct Monte-Carlo evaluation of episode rewards on a pretrained frozen policy. Reported success rates are measured outcomes of those rollouts rather than quantities derived from fitted parameters, self-referential equations, or prior self-citations. No derivation chain exists that reduces the claimed improvement to its own inputs by construction; the method relies only on the ability to sample rollouts and compute task rewards, which are external to the selection process itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that initial noise has a large and controllable effect on policy behavior and that Monte-Carlo search over a modest number of episodes can locate a better constant without retraining.

axioms (1)

domain assumption Initial noise input to a diffusion or flow-matching policy meaningfully affects the generated action sequence
Standard modeling assumption for these generative policies; invoked when the authors propose swapping the noise input.

pith-pipeline@v0.9.0 · 5613 in / 1221 out tokens · 45588 ms · 2026-05-15T09:50:08.193482+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a search method to find golden tickets using Monte-Carlo policy evaluation that keeps the pretrained policy frozen, does not train any new networks...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Unified Noise Steering for Efficient Human-Guided VLA Adaptation
cs.RO 2026-05 unverdicted novelty 6.0

UniSteer unifies human corrective actions and noise-space RL for VLA adaptation by inverting actions to noise targets, raising success rates from 20% to 90% in 66 minutes across four real-world manipulation tasks.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 1 Pith paper · 13 internal anchors

[1]

Find: Fine-tuning initial noise distribution with policy optimization for diffusion models

Changgu Chen, Libing Yang, Xiaoyan Yang, Liang- gangxu Chen, Gaoqi He, Changbo Wang, and Yang Li. Find: Fine-tuning initial noise distribution with policy optimization for diffusion models. InProceedings of the 32nd ACM International Conference on Multimedia, pages 6735–6744, 2024

work page 2024
[2]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, page 02783649241273668, 2023

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, page 02783649241273668, 2023

work page 2023
[3]

Dynaguide: Steering diffusion polices with active dynamic guidance.arXiv preprint arXiv:2506.13922, 2025

Maximilian Du and Shuran Song. Dynaguide: Steering diffusion polices with active dynamic guidance.arXiv preprint arXiv:2506.13922, 2025

work page arXiv 2025
[4]

The lottery ticket hypothesis: Finding sparse, trainable neural networks,

Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks,

work page
[5]

URL https://arxiv.org/abs/1803.03635

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Vector quantization.IEEE Assp Magazine, 1(2):4–29, 1984

Robert Gray. Vector quantization.IEEE Assp Magazine, 1(2):4–29, 1984

work page 1984
[7]

Gaussian Error Linear Units (GELUs)

D Hendrycks. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

Denoising diffusion probabilistic models.Advances in neural infor- mation processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural infor- mation processing systems, 33:6840–6851, 2020

work page 2020
[9]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020. URL https://arxiv. org/abs/2006.11239

work page internal anchor Pith review Pith/arXiv arXiv 2020
[10]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. CoRR, abs/2106.09685, 2021. URL https://arxiv.org/abs/ 2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Physical Intelligence, Ali Amin, Raichelle Aniceto, Ash- win Balakrishna, Kevin Black, Ken Conley, Grace Con- nors, James Darpinian, Karan Dhabalia, Jared DiCarlo, Danny Driess, Michael Equi, Adnan Esmail, Yunhao Fang, Chelsea Finn, Catherine Glossop, Thomas God- den, Ivan Goryachev, Lachy Groom, Hunter Hancock, Karol Hausman, Gashon Hussein, Brian Ichter...

work page
[12]

URL https://arxiv.org/abs/2511.14759

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning

Zhenyu Jiang, Yuqi Xie, Kevin Lin, Zhenjia Xu, Weikang Wan, Ajay Mandlekar, Linxi Jim Fan, and Yuke Zhu. Dexmimicgen: Automated data generation for bimanual dexterous manipulation via imitation learning. In2025 IEEE International Conference on Robotics and Automa- tion (ICRA), pages 16923–16930. IEEE, 2025

work page 2025
[14]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[15]

Gr-rl: Going dexterous and precise for long-horizon robotic manipulation.arXiv preprint arXiv:2512.01801, 2025

Yunfei Li, Xiao Ma, Jiafeng Xu, Yu Cui, Zhongren Cui, Zhigang Han, Liqun Huang, Tao Kong, Yuxiao Liu, Hao Niu, et al. Gr-rl: Going dexterous and precise for long-horizon robotic manipulation.arXiv preprint arXiv:2512.01801, 2025

work page arXiv 2025
[16]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxi- milian Nickel, and Matt Le. Flow matching for generative modeling, 2023. URL https://arxiv.org/abs/2210.02747

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

work page 2023
[18]

Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719, 2025

Guanxing Lu, Wenkai Guo, Chubin Zhang, Yuheng Zhou, Haonan Jiang, Zifeng Gao, Yansong Tang, and Zi- wei Wang. Vla-rl: Towards masterful and general robotic manipulation with scalable reinforcement learning.arXiv preprint arXiv:2505.18719, 2025

work page arXiv 2025
[19]

Serl: A software suite for sample-efficient robotic reinforcement learning, 2024

Jianlan Luo, Zheyuan Hu, Charles Xu, You Liang Tan, Jacob Berg, Archit Sharma, Stefan Schaal, Chelsea Finn, Abhishek Gupta, and Sergey Levine. Serl: A software suite for sample-efficient robotic reinforcement learning, 2024

work page 2024
[20]

An empirical study of catastrophic forgetting in large language models during continual fine- tuning, 2025

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine- tuning, 2025. URL https://arxiv.org/abs/2308.08747

work page arXiv 2025
[21]

What Matters in Learning from Offline Human Demonstrations for Robot Manipulation

Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Savarese, Yuke Zhu, and Roberto Mart ´ın-Mart´ın. What matters in learning from offline human demon- strations for robot manipulation.arXiv preprint arXiv:2108.03298, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

The lottery ticket hypothesis in denoising: Towards semantic- driven initialization

Jiafeng Mao, Xueting Wang, and Kiyoharu Aizawa. The lottery ticket hypothesis in denoising: Towards semantic- driven initialization. InEuropean Conference on Com- puter Vision, pages 93–109. Springer, 2024

work page 2024
[23]

A minimalist method for fine- tuning text-to-image diffusion models.arXiv preprint arXiv:2506.12036, 2025

Yanting Miao, William Loh, Pacal Poupart, and Suraj Kothawade. A minimalist method for fine- tuning text-to-image diffusion models.arXiv preprint arXiv:2506.12036, 2025

work page arXiv 2025
[24]

Steering your generalists: Improving robotic foundation models via value guidance.arXiv preprint arXiv:2410.13816, 2024

Mitsuhiko Nakamoto, Oier Mees, Aviral Kumar, and Sergey Levine. Steering your generalists: Improving robotic foundation models via value guidance.arXiv preprint arXiv:2410.13816, 2024

work page arXiv 2024
[25]

Consistency policy: Accelerated visuomotor policies via consistency distillation.arXiv preprint arXiv:2405.07503, 2024

Aaditya Prasad, Kevin Lin, Jimmy Wu, Linqi Zhou, and Jeannette Bohg. Consistency policy: Accelerated visuomotor policies via consistency distillation.arXiv preprint arXiv:2405.07503, 2024

work page arXiv 2024
[26]

Pointnet: Deep learning on point sets for 3d classification and segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 652–660, 2017

work page 2017
[27]

Not all noises are created equally: Diffusion noise selection and optimization.arXiv preprint arXiv:2407.14041, 2024

Zipeng Qi, Lichen Bai, Haoyi Xiong, and Zeke Xie. Not all noises are created equally: Diffusion noise selection and optimization.arXiv preprint arXiv:2407.14041, 2024

work page arXiv 2024
[28]

Diffusion policy policy optimization.arXiv preprint arXiv:2409.00588, 2024

Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Ben- jamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization.arXiv preprint arXiv:2409.00588, 2024

work page arXiv 2024
[29]

Hoidini: Human-object interaction through diffusion noise optimization.arXiv preprint arXiv:2506.15625, 2025

Roey Ron, Guy Tevet, Haim Sawdayee, and Amit H Bermano. Hoidini: Human-object interaction through diffusion noise optimization.arXiv preprint arXiv:2506.15625, 2025

work page arXiv 2025
[30]

U- net: Convolutional networks for biomedical image seg- mentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image seg- mentation. InMedical image computing and computer- assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, pro- ceedings, part III 18, pages 234–241. Springer, 2015

work page 2015
[31]

Lightning-fast image inversion and editing for text-to-image diffusion models.arXiv preprint arXiv:2312.12540, 2023

Dvir Samuel, Barak Meiri, Haggai Maron, Yoad Tewel, Nir Darshan, Shai Avidan, Gal Chechik, and Rami Ben-Ari. Lightning-fast image inversion and editing for text-to-image diffusion models.arXiv preprint arXiv:2312.12540, 2023

work page arXiv 2023
[32]

Generating images of rare concepts using pre-trained diffusion models

Dvir Samuel, Rami Ben-Ari, Simon Raviv, Nir Darshan, and Gal Chechik. Generating images of rare concepts using pre-trained diffusion models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4695–4703, 2024

work page 2024
[33]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, Dana Aubakirova, Francesco Ca- puano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, An- dres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Residual Policy Learning

Tom Silver, Kelsey Allen, Josh Tenenbaum, and Leslie Kaelbling. Residual policy learning.arXiv preprint arXiv:1812.06298, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[35]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. De- noising diffusion implicit models, 2022. URL https: //arxiv.org/abs/2010.02502

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

Rfs: Reinforcement learning with residual flow steering for dexterous manipulation

Entong Su, Tyler Westenbroek, Anusha Nagabandi, and Abhishek Gupta. Rfs: Reinforcement learning with residual flow steering for dexterous manipulation. In The Fourteenth International Conference on Learning Representations, 2026

work page 2026
[37]

Efficientnet: Rethinking model scaling for convolutional neural networks

Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019

work page 2019
[38]

Inference- time alignment of diffusion models with direct noise optimization.arXiv preprint arXiv:2405.18881, 2024

Zhiwei Tang, Jiangweizhi Peng, Jiasheng Tang, Mingyi Hong, Fan Wang, and Tsung-Hui Chang. Inference- time alignment of diffusion models with direct noise optimization.arXiv preprint arXiv:2405.18881, 2024

work page arXiv 2024
[39]

Steering your diffusion policy with latent space reinforcement learning

Andrew Wagenmaker, Mitsuhiko Nakamoto, Yunchu Zhang, Seohong Park, Waleed Yagoub, Anusha Naga- bandi, Abhishek Gupta, and Sergey Levine. Steering your diffusion policy with latent space reinforcement learning. arXiv preprint arXiv:2506.15799, 2025

work page arXiv 2025
[40]

Inference-time policy steering through human interactions, 2025

Yanwei Wang, Lirui Wang, Yilun Du, Balakumar Sun- daralingam, Xuning Yang, Yu-Wei Chao, Claudia Perez- D’Arpino, Dieter Fox, and Julie Shah. Inference-time policy steering through human interactions, 2025. URL https://arxiv.org/abs/2411.16627

work page arXiv 2025
[41]

Inference-time policy steering through human interactions

Yanwei Wang, Lirui Wang, Yilun Du, Balakumar Sun- daralingam, Xuning Yang, Yu-Wei Chao, Claudia P ´erez- D’Arpino, Dieter Fox, and Julie Shah. Inference-time policy steering through human interactions. In2025 IEEE International Conference on Robotics and Automa- tion (ICRA), pages 15626–15633. IEEE, 2025

work page 2025
[42]

One-step diffusion policy: Fast visuomotor policies via diffusion distillation

Zhendong Wang, Zhaoshuo Li, Ajay Mandlekar, Zhenjia Xu, Jiaojiao Fan, Yashraj Narang, Linxi Fan, Yuke Zhu, Yogesh Balaji, Mingyuan Zhou, et al. One-step diffusion policy: Fast visuomotor policies via diffusion distillation. arXiv preprint arXiv:2410.21257, 2024

work page arXiv 2024
[43]

Simple statistical gradient-following algorithms for connectionist reinforcement learning.Ma- chine learning, 8(3):229–256, 1992

Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.Ma- chine learning, 8(3):229–256, 1992

work page 1992
[44]

ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors

Zifan Xu, Ran Gong, Maria Vittoria Minniti, Ahmet Salih Gundogdu, Eric Rosen, Kausik Sivakumar, Riedana Yan, Zixing Wang, Di Deng, Peter Stone, et al. Expertgen: Scalable sim-to-real expert policy learning from imper- fect behavior priors.arXiv preprint arXiv:2603.15956, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

background

Zikai Zhou, Shitong Shao, Lichen Bai, Shufei Zhang, Zhiqiang Xu, Bo Han, and Zeke Xie. Golden noise for diffusion models: A learning framework. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17688–17697, 2025. APPENDIXA RELATEDWORK A. Robot Policy Improvement Methods One class of approaches to improving pretrained policie...

work page 2025
[47]

We collect1000demon- strations from our task-and-motion planning heuristic, and train4model checkpoints for100epochs, using a batch size of20

Flow matching policy in franka sim:We use a4layer MLP with GELU activations [6] as the non-linearities, and each layer has256hidden dimension. We collect1000demon- strations from our task-and-motion planning heuristic, and train4model checkpoints for100epochs, using a batch size of20. We use the Adam [12] optimizer with a learning rate of0.001. The policy...

work page
[48]

We use all default configurations for inference included with the model card

SmolVLA inLIBERO:We use the publicly re- leased SmolVLA model checkpoint that was finetuned for LIBERO https://huggingface.co/HuggingFaceVLA/smolvla libero, where all model and training details can be found. We use all default configurations for inference included with the model card. The policy takes in2RGB images, the low- dimensional state of the robot...

work page
[49]

We search for5000tickets, for100environments each

DPPO in robomimic:We use the publicly released checkpoints released from the original DPPO codebase (which were also used in the original DSRL experiments): https: //github.com/irom-princeton/dppo. We search for5000tickets, for100environments each. We evaluate on100episodes across5random seeds

work page
[50]

The policy takes in3RGB images and the end effector position, quaternion and gripper state for both arms

RGB diffusion policy in DexMimicGen:We use a diffu- sion policy with a U-Net backbone [28], which has a ResNet- 18 encoder for the RGB images, and an MLP for the robot proprioception data. The policy takes in3RGB images and the end effector position, quaternion and gripper state for both arms. We train a separate policy for each of the5tasks. We search fo...

work page
[51]

The Franka Research 3 arm is equipped with a Robotiq 2F-85 gripper

Franka hardware - RGB diffusion policy:We use Re- alSense D435 cameras for our static, external cameras, and D405 for the wrist camera. The Franka Research 3 arm is equipped with a Robotiq 2F-85 gripper. For the RGB policies, we use a standard diffusion policy architecture with a U-Net backbone and ResNet-18 architecture for the image encoders

work page
[52]

We use 2 calibrated extrinsic cameras to generate a single fused pointcloud, and remove all points that are at or below the surface of the table

Franka hardware - Pointcloud diffusion policy:For the pointcloud policies, we use the same U-Net backbone but instead use a PointNet encoder [24] for the pointcloud. We use 2 calibrated extrinsic cameras to generate a single fused pointcloud, and remove all points that are at or below the surface of the table. APPENDIXD RESULTS A. Real World Results We pr...

work page
[53]

assume DDIM sampling [33] since it is deterministic and therefore takes less samples to estimate the cumulative discounted expected rewards induced by an initial noise vector. DDIM sampling has been widely adopted in robotics as an alternative to DDPM [7] as it requires fewer sampling steps, although other techniques such as distillation [40, 23] have add...

work page