Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
Variance reduction techniques for gradient estimates in reinforcement learning.Journal of Machine Learning Research, 5(Nov):1471–1530
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
MDPO improves differentiable planning by injecting gradient-sensitivity-adapted noise into the action space, outperforming both deterministic variants and PPO on nonlinear and hybrid benchmarks.
citing papers explorer
-
FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling
Sol-RL decouples FP4-based candidate exploration from BF16 policy optimization in diffusion RL, delivering up to 4.64x faster convergence with maintained or superior alignment performance on models like FLUX.1 and SD3.5.
-
Model-Driven Policy Optimization in Differentiable Simulators via Stochastic Exploration
MDPO improves differentiable planning by injecting gradient-sensitivity-adapted noise into the action space, outperforming both deterministic variants and PPO on nonlinear and hybrid benchmarks.