Recognition: 3 theorem links
· Lean TheoremDiffusion Policy Policy Optimization
Pith reviewed 2026-05-16 08:44 UTC · model grok-4.3
The pith
DPPO fine-tunes diffusion-based policies with policy gradients to reach stronger performance than prior RL methods on robot tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DPPO is an algorithmic framework that fine-tunes diffusion-based policies using policy-gradient reinforcement learning. The method achieves the strongest overall performance and sample efficiency on common benchmarks compared with other RL algorithms for diffusion policies and compared with policy-gradient tuning of alternative policy parameterizations. It exploits synergies between the diffusion parameterization and policy-gradient updates that yield structured on-manifold exploration, stable training dynamics, and high policy robustness, as verified on simulated robotic tasks with pixel observations and via zero-shot transfer to physical robot hardware in long-horizon manipulation.
What carries the argument
The diffusion policy parameterization combined with policy-gradient updates, which produces structured on-manifold exploration and stable fine-tuning trajectories.
If this is right
- Fine-tuning becomes practical for diffusion policies that were previously considered difficult to optimize with standard RL.
- Policies exhibit lower variance across random seeds, reducing the number of trials needed to reach reliable performance.
- On-manifold exploration improves sample efficiency in high-dimensional continuous action spaces typical of robotics.
- Zero-shot sim-to-real transfer succeeds for multi-stage manipulation tasks without additional real-world fine-tuning.
- Pixel-observation tasks become trainable end-to-end with the same diffusion-plus-gradient recipe.
Where Pith is reading between the lines
- The same diffusion-plus-gradient recipe may transfer to other generative policy classes such as flow-matching or score-based models.
- Combining DPPO with offline datasets could further boost sample efficiency by blending imitation and on-policy signals.
- The on-manifold property may reduce the sim-to-real gap in contact-rich tasks where action distributions must stay physically plausible.
Load-bearing premise
The observed performance and efficiency gains arise specifically from synergies between the diffusion parameterization and policy-gradient updates rather than from benchmark-specific hyperparameter choices or implementation details.
What would settle it
A controlled re-implementation in which a non-diffusion policy receives identical hyperparameter tuning, network capacity, and training budget yet matches or exceeds DPPO's benchmark scores would indicate the gains are not unique to the diffusion structure.
read the original abstract
We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL). PG methods are ubiquitous in training RL policies with other policy parameterizations; nevertheless, they had been conjectured to be less efficient for diffusion-based policies. Surprisingly, we show that DPPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks compared to other RL methods for diffusion-based policies and also compared to PG fine-tuning of other policy parameterizations. Through experimental investigation, we find that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization, leading to structured and on-manifold exploration, stable training, and strong policy robustness. We further demonstrate the strengths of DPPO in a range of realistic settings, including simulated robotic tasks with pixel observations, and via zero-shot deployment of simulation-trained policies on robot hardware in a long-horizon, multi-stage manipulation task. Website with code: diffusion-ppo.github.io
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Diffusion Policy Policy Optimization (DPPO), a framework for fine-tuning diffusion-based policies (such as Diffusion Policy) in continuous control and robot learning using policy gradient (PG) methods. It reports that DPPO attains the strongest overall performance and efficiency on common benchmarks relative to other RL algorithms for diffusion policies and to PG fine-tuning of alternative policy parameterizations. The authors attribute these gains to unique synergies between the diffusion parameterization and PG updates that enable structured on-manifold exploration, stable training, and policy robustness. Results are shown on simulated robotic tasks with pixel observations and via zero-shot sim-to-real transfer on a long-horizon multi-stage manipulation task.
Significance. If the empirical superiority holds after equalizing hyperparameter search effort across all compared methods, the work would be significant for robot learning: it would demonstrate that PG fine-tuning is both viable and advantageous for diffusion policies, contrary to prior conjectures, and supply a practical recipe with hardware validation that could influence how diffusion policies are deployed in real-world manipulation.
major comments (2)
- [Experiments] Experiments section: The headline claim that performance edges arise from 'unique synergies' between diffusion parameterization and PG updates is load-bearing, yet the manuscript provides no explicit statement of the hyperparameter search budget, number of trials, or optimizer settings allocated to each baseline (other RL methods on diffusion policies and PG on alternative parameterizations). Without this information, the reported gains cannot be confidently attributed to the claimed mechanism rather than unequal tuning or implementation quality.
- [Method] Method section (DPPO algorithm description): The precise modifications to the standard PG update—particularly how the denoising network is treated during the policy gradient step, whether noise schedules are frozen or annealed, and how the diffusion loss is combined with the RL objective—are not stated with sufficient equation-level detail to allow independent reproduction of the reported stability benefits.
minor comments (2)
- [Abstract] The abstract refers to 'common benchmarks' without naming the specific environments or tasks; an explicit list (e.g., the MuJoCo or Meta-World suites used) would improve immediate readability.
- [Figures] Learning-curve figures would be clearer if they included shaded regions for standard deviation across random seeds rather than only mean curves.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We agree that the manuscript requires additional details on hyperparameter tuning procedures and algorithmic specifications to support the claims and enable reproduction. We have revised the manuscript accordingly, as described in the point-by-point responses below.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The headline claim that performance edges arise from 'unique synergies' between diffusion parameterization and PG updates is load-bearing, yet the manuscript provides no explicit statement of the hyperparameter search budget, number of trials, or optimizer settings allocated to each baseline (other RL methods on diffusion policies and PG on alternative parameterizations). Without this information, the reported gains cannot be confidently attributed to the claimed mechanism rather than unequal tuning or implementation quality.
Authors: We agree that explicit documentation of the hyperparameter search effort is required to substantiate the performance claims. In the revised manuscript, we have added a new subsection to the Experiments section along with an appendix that specifies the search budget, number of trials, optimizer settings, learning rates, batch sizes, and random seeds used for every baseline, including other RL methods applied to diffusion policies and PG fine-tuning of alternative parameterizations. These additions confirm that comparable tuning resources were allocated across methods. revision: yes
-
Referee: [Method] Method section (DPPO algorithm description): The precise modifications to the standard PG update—particularly how the denoising network is treated during the policy gradient step, whether noise schedules are frozen or annealed, and how the diffusion loss is combined with the RL objective—are not stated with sufficient equation-level detail to allow independent reproduction of the reported stability benefits.
Authors: We thank the referee for highlighting the need for greater precision. The revised Method section now includes explicit equations detailing the policy gradient update applied to the denoising network parameters, the fixed noise schedule used during fine-tuning, and the combined objective that integrates the diffusion denoising loss with the RL term. Updated pseudocode has also been added to facilitate independent reproduction of the reported stability properties. revision: yes
Circularity Check
No circularity: empirical performance claims rest on external benchmarks
full rationale
The paper presents DPPO as an algorithmic framework for RL fine-tuning of diffusion policies and supports its claims of superior performance and efficiency exclusively through experimental comparisons on standard benchmarks against other RL methods and policy parameterizations. No mathematical derivation, equation, or ansatz is offered whose validity reduces to a fitted parameter, self-citation chain, or input by construction. The central assertions are externally falsifiable via replication of the reported tasks, random seeds, and hyperparameter protocols; any self-citations to prior diffusion-policy work serve only as background and are not invoked to prove uniqueness or forbid alternatives. The result is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL).
-
Foundation.PhiForcingphi_equation unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Through experimental investigation, we find that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization, leading to structured and on-manifold exploration, stable training, and strong policy robustness.
-
Foundation.DiscretenessForcingcontinuous_no_isolated_zero_defect unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PG methods are ubiquitous in training RL policies with other policy parameterizations; nevertheless, they had been conjectured to be less efficient for diffusion-based policies.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 21 Pith papers
-
BrickCraft: Visuomotor Skill Composition with Situated Manual Guidance for Long-Horizon Interlocking Brick Assembly
BrickCraft composes reusable visuomotor skills via relative anchoring to partial structures and situated visual manuals to achieve long-horizon interlocking brick assembly from limited demonstrations with generalizati...
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 achieves 91.0 PDMS on NAVSIM with camera input by training a discrete diffusion model to self-edit trajectories via RL-aligned AutoEdit.
-
ScoRe-Flow: Complete Distributional Control via Score-Based Reinforcement Learning for Flow Matching
ScoRe-Flow achieves decoupled mean-variance control in stochastic flow matching by deriving a closed-form score for drift modulation plus learned variance, yielding faster RL convergence and higher success rates on lo...
-
You've Got a Golden Ticket: Improving Generative Robot Policies With A Single Noise Vector
Optimizing a single constant initial noise vector for frozen generative robot policies improves success rates on 38 of 43 tasks by up to 58% relative improvement.
-
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving
ReCogDrive unifies VLM scene understanding with a diffusion planner reinforced by DiffGRPO to reach state-of-the-art results on NAVSIM and Bench2Drive benchmarks.
-
Driving Intents Amplify Planning-Oriented Reinforcement Learning
DIAL uses intent-conditioned CFG and multi-intent GRPO to expand and preserve diverse modes in continuous-action preference RL, lifting RFS to 9.14 and surpassing both prior best (8.5) and human demonstration (8.13).
-
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
-
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities
LQL stabilizes Q-learning by penalizing violations of n-step action-sequence lower bounds with a hinge loss computed from standard network outputs.
-
ReflectDrive-2: Reinforcement-Learning-Aligned Self-Editing for Discrete Diffusion Driving
ReflectDrive-2 combines masked discrete diffusion with RL-aligned self-editing to generate and refine driving trajectories, reaching 91.0 PDMS on NAVSIM camera-only and 94.8 in best-of-6.
-
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies
OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
-
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
-
ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors
ExpertGen generates high-success expert policies in simulation from imperfect priors by freezing a diffusion behavior model and optimizing its initial noise via RL, then distills them for real-robot deployment.
-
What Does Flow Matching Bring To TD Learning?
Flow matching critics outperform monolithic ones in RL by 2x performance and 5x sample efficiency via test-time error recovery through integration and multi-point velocity supervision that preserves feature plasticity.
-
Space Syntax-guided Post-training for Residential Floor Plan Generation
SSPT turns space-syntax integration metrics into post-training feedback signals that improve public-space dominance and functional hierarchy in AI-generated residential floor plans.
-
RL-RIG: A Generative Spatial Reasoner via Intrinsic Reflection
RL-RIG uses a generate-reflect-edit loop with reinforcement learning to improve spatial accuracy in image generation, reporting up to 11% gains over prior open-source models on scene-graph metrics.
-
How Does the Lagrangian Guide Safe Reinforcement Learning through Diffusion Models?
ALGD augments the Lagrangian to locally convexify the energy landscape in diffusion models, stabilizing safe RL training and generation without changing optimal policies.
-
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control
DexVLA combines a scaled diffusion action expert with embodiment curriculum learning to achieve better generalization and performance than prior VLA models on diverse robot hardware and long-horizon tasks.
-
EponaV2: Driving World Model with Comprehensive Future Reasoning
EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
-
Driving Intents Amplify Planning-Oriented Reinforcement Learning
DIAL expands continuous-action driving policies via intent-conditioned flow matching and multi-intent GRPO, lifting best-of-N preference scores above human demonstrations for the first time on WOD-E2E.
-
StableIDM: Stabilizing Inverse Dynamics Model against Manipulator Truncation via Spatio-Temporal Refinement
StableIDM stabilizes inverse dynamics models under manipulator truncation by combining robot-centric masking, directional spatial feature aggregation, and temporal dynamics refinement, yielding 12.1% higher strict act...
-
Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems
A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.
Reference graph
Works this paper leans on
-
[1]
J. Achiam. Spinning Up in Deep Reinforcement Learning. 2018
work page 2018
-
[2]
A. Ajay, Y . Du, A. Gupta, J. B. Tenenbaum, T. S. Jaakkola, and P. Agrawal. Is conditional generative modeling all you need for decision making? In The Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[3]
M. Alakuijala, G. Dulac-Arnold, J. Mairal, J. Ponce, and C. Schmid. Residual reinforcement learning from demonstrations. arXiv preprint arXiv:2106.08050, 2021
-
[4]
O. M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 2020
work page 2020
- [5]
- [6]
-
[7]
P. J. Ball, L. Smith, I. Kostrikov, and S. Levine. Efficient online reinforcement learning with offline data. In International Conference on Machine Learning, pages 1577–1594. PMLR, 2023
work page 2023
-
[8]
C. M. Bishop and N. M. Nasrabadi. Pattern recognition and machine learning. Springer, 2006. 16
work page 2006
-
[9]
Training Diffusion Models with Reinforcement Learning
K. Black, M. Janner, Y . Du, I. Kostrikov, and S. Levine. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [10]
-
[11]
G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [12]
- [13]
- [14]
- [15]
- [16]
-
[17]
T. Chen, J. Xu, and P. Agrawal. A system for general in-hand object re-orientation. In Conference on Robot Learning, 2022
work page 2022
- [18]
-
[19]
C. Chi, B. Burchfiel, E. Cousineau, S. Feng, and S. Song. Iterative residual policy: for goal- conditioned dynamic manipulation of deformable objects. The International Journal of Robotics Research, 2024
work page 2024
-
[20]
C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research , 2024
work page 2024
-
[21]
Directly Fine-Tuning Diffusion Models on Differentiable Rewards
K. Clark, P. Vicol, K. Swersky, and D. J. Fleet. Directly fine-tuning diffusion models on differentiable rewards. arXiv preprint arXiv:2309.17400, 2023
work page internal anchor Pith review arXiv 2023
-
[22]
Z. Ding and C. Jin. Consistency models as a rich and efficient policy class for reinforcement learning. arXiv preprint arXiv:2309.16984, 2023
-
[23]
L. Engstrom, A. Ilyas, S. Santurkar, D. Tsipras, F. Janoos, L. Rudolph, and A. Madry. Implemen- tation matters in deep rl: A case study on ppo and trpo. In International conference on learning representations, 2019
work page 2019
- [24]
-
[25]
Y . Fan, O. Watkins, Y . Du, H. Liu, M. Ryu, C. Boutilier, P. Abbeel, M. Ghavamzadeh, K. Lee, and K. Lee. Reinforcement learning for fine-tuning text-to-image diffusion models. Advances in Neural Information Processing Systems, 2024
work page 2024
-
[26]
C. Fefferman, S. Mitter, and H. Narayanan. Testing the manifold hypothesis.Journal of the American Mathematical Society, 2016
work page 2016
-
[27]
P. Florence, L. Manuelli, and R. Tedrake. Self-supervised correspondence in visuomotor policy learn- ing. IEEE Robotics and Automation Letters, 2019. 17
work page 2019
-
[28]
P. Florence, C. Lynch, A. Zeng, O. A. Ramirez, A. Wahid, L. Downs, A. Wong, J. Lee, I. Mordatch, and J. Tompson. Implicit behavioral cloning. In Conference on Robot Learning. PMLR, 2022
work page 2022
-
[29]
J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine. D4rl: Datasets for deep data-driven rein- forcement learning. arXiv preprint arXiv:2004.07219, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[30]
Z. Fu, T. Z. Zhao, and C. Finn. Mobile aloha: Learning bimanual mobile manipulation with low-cost whole-body teleoperation. arXiv preprint arXiv:2401.02117, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [31]
- [32]
-
[33]
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning , pages 1861–1870. PMLR, 2018
work page 2018
- [34]
-
[35]
IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies
P. Hansen-Estruch, I. Kostrikov, M. Janner, J. G. Kuba, and S. Levine. Idql: Implicit q-learning as an actor-critic method with diffusion policies. arXiv preprint arXiv:2304.10573, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [36]
- [37]
-
[38]
J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in neural informa- tion processing systems, 2020
work page 2020
-
[39]
J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [40]
- [41]
- [42]
-
[43]
J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V . Tsounis, V . Koltun, and M. Hutter. Learning agile and dynamic motor skills for legged robots. Science Robotics, 2019
work page 2019
- [44]
-
[45]
Planning with Diffusion for Flexible Behavior Synthesis
M. Janner, Y . Du, J. B. Tenenbaum, and S. Levine. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [46]
-
[47]
B. Kang, X. Ma, C. Du, T. Pang, and S. Yan. Efficient diffusion policies for offline reinforcement learning. Advances in Neural Information Processing Systems, 2024
work page 2024
-
[48]
E. Kaufmann, L. Bauersfeld, A. Loquercio, M. Müller, V . Koltun, and D. Scaramuzza. Champion- level drone racing using deep reinforcement learning. Nature, 2023
work page 2023
-
[49]
Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
- [50]
- [51]
- [52]
- [53]
-
[54]
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra. Continu- ous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[55]
Y . Lin, A. S. Wang, G. Sutanto, A. Rai, and F. Meier. Polymetis. https:// facebookresearch.github.io/fairo/polymetis/, 2021
work page 2021
-
[56]
A. Lou, C. Meng, and S. Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. stat, 2024
work page 2024
- [57]
-
[58]
S. Luo, Y . Su, X. Peng, S. Wang, J. Peng, and J. Ma. Antigen-specific antibody design and optimiza- tion with diffusion-based generative models for protein structures. Advances in Neural Information Processing Systems, 2022
work page 2022
-
[59]
Isaac Gym: High Performance GPU-Based Physics Simulation For Robot Learning
V . Makoviychuk, L. Wawrzyniak, Y . Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[60]
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation
A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Martín-Martín. What matters in learning from offline human demonstrations for robot manip- ulation. In arXiv preprint arXiv:2108.03298, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[61]
A. Nair, A. Gupta, M. Dalal, and S. Levine. Awac: Accelerating online reinforcement learning with offline datasets. arXiv preprint arXiv:2006.09359, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[62]
M. Nakamoto, S. Zhai, A. Singh, M. Sobol Mark, Y . Ma, C. Finn, A. Kumar, and S. Levine. Cal- ql: Calibrated offline rl pre-training for efficient online fine-tuning. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[63]
A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, 2021
work page 2021
-
[64]
T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, J. Peters, et al. An algorithmic perspective on imitation learning. Foundations and Trends® in Robotics, 2018
work page 2018
- [65]
- [66]
-
[67]
X. B. Peng, A. Kumar, G. Zhang, and S. Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[68]
X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (ToG), 2021
work page 2021
-
[69]
F. Permenter and C. Yuan. Interpreting and improving diffusion models from an optimization per- spective. arXiv preprint arXiv:2306.04848, 2023
-
[70]
J. Peters and S. Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning, 2007
work page 2007
-
[71]
D. A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. Advances in neural information processing systems, 1988
work page 1988
-
[72]
DreamFusion: Text-to-3D using 2D Diffusion
B. Poole, A. Jain, J. T. Barron, and B. Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [73]
- [74]
-
[75]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. In Interna- tional conference on machine learning, 2021
work page 2021
-
[76]
Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations
A. Rajeswaran, V . Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine. Learn- ing complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[77]
A. Z. Ren, H. Dai, B. Burchfiel, and A. Majumdar. AdaptSim: Task-driven simulation adaptation for sim-to-real transfer. In Proceedings of the Conference on Robot Learning (CoRL), 2023
work page 2023
- [78]
- [79]
-
[80]
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.