Q-learning with adjoint matching

· 2026 · cs.LG · arXiv 2601.14234

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

open full Pith review browse 10 citing papers arXiv PDF

abstract

We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm that tackles a long-standing challenge in continuous-action RL: efficient optimization of an expressive diffusion or flow-matching policy with respect to a parameterized Q-function. Effective optimization requires exploiting the first-order information of the critic, but it is challenging to do so for flow or diffusion policies because direct gradient-based optimization via backpropagation through their multi-step denoising process is numerically unstable. Existing methods work around this either by only using the value and discarding the gradient information, or by relying on approximations that sacrifice policy expressivity or bias the learned policy. QAM sidesteps both of these challenges by leveraging adjoint matching, a recently proposed technique in generative modeling, which transforms the critic's action gradient to form a step-wise objective function that is free from unstable backpropagation, while providing an unbiased, expressive policy at the optimum. Combined with temporal-difference backup for critic learning, QAM consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

cs.LG · 2026-06-09 · unverdicted · novelty 7.0

QGF performs test-time policy optimization for flow models in RL by guiding a behavior-cloned reference policy with value-function gradients, achieving strong results on high-dimensional offline RL benchmarks without additional policy training.

Reinforcement Learning for Flow-Matching Policies with Density Transport

cs.LG · 2026-06-07 · unverdicted · novelty 7.0

RLDT fine-tunes pretrained flow-matching policies for continuous control by aligning them to a max-entropy RL transport field constructed via SVGD, using expected-target estimation for stable multi-step updates.

Q-VGM: Q-Guided Value-Gradient Matching for Flow-Matching VLA Policies

cs.RO · 2026-06-06 · unverdicted · novelty 7.0

Q-VGM introduces value-gradient matching via VGG-Flow to improve flow-matching VLA policies with a Cal-QL critic, achieving success rate lifts on LIBERO, RoboTwin, and real-robot tasks.

Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

FAV aligns few-step generative models by amortizing SVGD updates from reward-tilted sampling into generator parameters via fixed-point regression, requiring only sample access, and shows outperformance on robotics tasks plus scaling on image generators.

Reinforcement Learning via Value Gradient Flow

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

VGF solves behavior-regularized RL by transporting particles from a reference distribution to the value-induced optimal policy via discrete value-guided gradient flow.

Reversal Q-Learning

cs.LG · 2026-06-16 · unverdicted · novelty 6.0

Reversal Q-Learning (RQL) proposes reversing flows for virtual trajectories and bias-variance reduction in an expanded MDP to train flow policies, reporting best average performance on 50 simulated robotic tasks versus prior flow-based offline RL methods.

DiPOD: Diffusion Policy Optimization without Drifting Apart

cs.LG · 2026-06-11 · unverdicted · novelty 6.0

DiPOD stabilizes diffusion policy optimization by interleaving self-distillation with gradient updates via an on-policy ELBO regularizer, yielding more stable training and higher rewards than prior methods.

Learning While Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

cs.RO · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

LWD is a fleet-scale offline-to-online RL framework that continually improves pretrained VLA policies using autonomous rollouts and human interventions, reaching 95% average success on real-world manipulation tasks.

FASTER: Value-Guided Sampling for Fast RL

cs.LG · 2026-04-21 · unverdicted · novelty 6.0

FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.

Scaling World-Model Reinforcement Learning Through Diffusion Policy Optimization

cs.LG · 2026-05-25 · unverdicted · novelty 5.0

MBDPO reformulates policy optimization as a diffusion process over searched trajectories in latent world models to reduce misalignment between search and value learning.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.

Q-learning with adjoint matching

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer