QGF performs test-time policy optimization for flow models in RL by guiding a behavior-cloned reference policy with value-function gradients, achieving strong results on high-dimensional offline RL benchmarks without additional policy training.
Flow-based single-step completion for efficient and expressive policy learning.arXiv preprint arXiv:2506.21427, 2025
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 3years
2026 3verdicts
UNVERDICTED 3representative citing papers
RLDT fine-tunes pretrained flow-matching policies for continuous control by aligning them to a max-entropy RL transport field constructed via SVGD, using expected-target estimation for stable multi-step updates.
DriftQL is a single-pass offline RL algorithm using drift regularization that outperforms diffusion and flow policies on standard benchmarks.
citing papers explorer
-
Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning
QGF performs test-time policy optimization for flow models in RL by guiding a behavior-cloned reference policy with value-function gradients, achieving strong results on high-dimensional offline RL benchmarks without additional policy training.
-
Reinforcement Learning for Flow-Matching Policies with Density Transport
RLDT fine-tunes pretrained flow-matching policies for continuous control by aligning them to a max-entropy RL transport field constructed via SVGD, using expected-target estimation for stable multi-step updates.
-
Drift Q-Learning
DriftQL is a single-pass offline RL algorithm using drift regularization that outperforms diffusion and flow policies on standard benchmarks.