pith. sign in

Q-learning with adjoint matching

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it
abstract

We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm that tackles a long-standing challenge in continuous-action RL: efficient optimization of an expressive diffusion or flow-matching policy with respect to a parameterized Q-function. Effective optimization requires exploiting the first-order information of the critic, but it is challenging to do so for flow or diffusion policies because direct gradient-based optimization via backpropagation through their multi-step denoising process is numerically unstable. Existing methods work around this either by only using the value and discarding the gradient information, or by relying on approximations that sacrifice policy expressivity or bias the learned policy. QAM sidesteps both of these challenges by leveraging adjoint matching, a recently proposed technique in generative modeling, which transforms the critic's action gradient to form a step-wise objective function that is free from unstable backpropagation, while providing an unbiased, expressive policy at the optimum. Combined with temporal-difference backup for critic learning, QAM consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.

citation-role summary

background 1 method 1

citation-polarity summary

fields

cs.LG 8 cs.RO 2

years

2026 10

verdicts

UNVERDICTED 10

clear filters

representative citing papers

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

cs.LG · 2026-06-09 · unverdicted · novelty 7.0

QGF performs test-time policy optimization for flow models in RL by guiding a behavior-cloned reference policy with value-function gradients, achieving strong results on high-dimensional offline RL benchmarks without additional policy training.

Reinforcement Learning via Value Gradient Flow

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.

Reversal Q-Learning

cs.LG · 2026-06-16 · unverdicted · novelty 6.0

Reversal Q-Learning (RQL) proposes reversing flows for virtual trajectories and bias-variance reduction in an expanded MDP to train flow policies, reporting best average performance on 50 simulated robotic tasks versus prior flow-based offline RL methods.

DiPOD: Diffusion Policy Optimization without Drifting Apart

cs.LG · 2026-06-11 · unverdicted · novelty 6.0

DiPOD stabilizes diffusion policy optimization by interleaving self-distillation with gradient updates via an on-policy ELBO regularizer, yielding more stable training and higher rewards than prior methods.

FASTER: Value-Guided Sampling for Fast RL

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.