hub Mixed citations

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Xue Bin Peng, Aviral Kumar, Grace Zhang, Sergey Levine · 2019 · cs.LG · arXiv 1910.00177

Mixed citation behavior. Most common role is background (65%).

82 Pith papers citing it

Background 65% of classified citations

open full Pith review browse 82 citing papers arXiv PDF

abstract

In this paper, we aim to develop a simple and scalable reinforcement learning algorithm that uses standard supervised learning methods as subroutines. Our goal is an algorithm that utilizes only simple and convergent maximum likelihood loss functions, while also being able to leverage off-policy data. Our proposed approach, which we refer to as advantage-weighted regression (AWR), consists of two standard supervised learning steps: one to regress onto target values for a value function, and another to regress onto weighted target actions for the policy. The method is simple and general, can accommodate continuous and discrete actions, and can be implemented in just a few lines of code on top of standard supervised learning methods. We provide a theoretical motivation for AWR and analyze its properties when incorporating off-policy data from experience replay. We evaluate AWR on a suite of standard OpenAI Gym benchmark tasks, and show that it achieves competitive performance compared to a number of well-established state-of-the-art RL algorithms. AWR is also able to acquire more effective policies than most off-policy algorithms when learning from purely static datasets with no additional environmental interactions. Furthermore, we demonstrate our algorithm on challenging continuous control tasks with highly complex simulated characters.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 11 method 6

citation-polarity summary

background 11 use method 5 unclear 1

claims ledger

abstract In this paper, we aim to develop a simple and scalable reinforcement learning algorithm that uses standard supervised learning methods as subroutines. Our goal is an algorithm that utilizes only simple and convergent maximum likelihood loss functions, while also being able to leverage off-policy data. Our proposed approach, which we refer to as advantage-weighted regression (AWR), consists of two standard supervised learning steps: one to regress onto target values for a value function, and another to regress onto weighted target actions for the policy. The method is simple and general, can ac

co-cited works

representative citing papers

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

Offline Reinforcement Learning with Implicit Q-Learning

cs.LG · 2021-10-12 · unverdicted · novelty 8.0

IQL achieves policy improvement in offline RL by implicitly estimating optimal action values through state-conditional upper expectiles of value functions, without querying Q-functions on out-of-distribution actions.

Decision Transformer: Reinforcement Learning via Sequence Modeling

cs.LG · 2021-06-02 · accept · novelty 8.0

Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

cs.LG · 2020-04-15 · accept · novelty 8.0

D4RL supplies new offline RL benchmarks and datasets from expert and mixed sources to expose weaknesses in existing algorithms and standardize evaluation.

Dual Advantage Fields

cs.LG · 2026-06-02 · conditional · novelty 7.0

Dual Advantage Fields converts bilinear dual value models into local advantage scores via learned action-effect models, equaling the goal-conditioned Bellman advantage under realizability and improving aggregate metrics on OGBench locomotion, manipulation, and puzzle tasks.

Explicit Critic Guidance for Aligning Diffusion Models

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.

Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

FAV aligns few-step generative models by amortizing SVGD updates from reward-tilted sampling into generator parameters via fixed-point regression, requiring only sample access, and shows outperformance on robotics tasks plus scaling on image generators.

RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

cs.CV · 2026-05-20 · conditional · novelty 7.0

RankE co-evolves AR policy and decoder via alternating ranking optimization, improving both FID and CLIP scores on LlamaGen-XL and Janus-Pro where policy-only RL degrades FID.

Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

Proposes latent analogies and analogy transduction to enable compositional generalization to unseen goal-context pairs in offline GCRL, outperforming trajectory-stitching baselines on manipulation tasks.

Learning from Language Feedback via Variational Policy Distillation

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.

Peng's Q($\lambda$) for Conservative Value Estimation in Offline Reinforcement Learning

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

CPQL adapts the multi-step Peng's Q(λ) operator for conservative offline value estimation, achieving performance guarantees and empirical gains over single-step baselines on D4RL while supporting offline-to-online fine-tuning.

Switching Successor Measures for Hierarchical Zero-shot Reinforcement Learning

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Switching successor measures extend classical successor measures to enable hierarchical zero-shot RL via the FB π-Switch algorithm that extracts subgoal-selection and control policies from forward-backward representations.

Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.

Active Learning for Gaussian Process Regression Under Self-Induced Boltzmann Weights

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

AB-SID-iVAR enables Gaussian process active learning for self-induced Boltzmann distributions by closed-form approximation of the target, with high-probability error vanishing guarantees and empirical gains on PES and drug discovery tasks.

Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

cs.LG · 2026-05-04 · unverdicted · novelty 7.0

Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating coverage, variance, and other terms.

Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

cs.CV · 2026-03-01 · unverdicted · novelty 7.0

SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.

Test-time Offline Reinforcement Learning on Goal-related Experience

cs.LG · 2025-07-24 · unverdicted · novelty 7.0

GC-TTT adapts goal-conditioned policies at test time by fine-tuning on self-supervised selected goal-related offline data, yielding performance gains in loco-navigation and manipulation tasks.

Group-in-Group Policy Optimization for LLM Agent Training

cs.LG · 2025-05-16 · unverdicted · novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.

KTO: Model Alignment as Prospect Theoretic Optimization

cs.LG · 2024-02-02 · conditional · novelty 7.0

KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

cs.LG · 2023-05-29 · accept · novelty 7.0

DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.

VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

cs.RO · 2022-09-30 · unverdicted · novelty 7.0

VIP learns a visual embedding from human videos whose distance defines dense, smooth rewards for arbitrary goal-image robot tasks without task-specific fine-tuning.

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

cs.LG · 2022-08-12 · unverdicted · novelty 7.0

Diffusion-QL uses conditional diffusion models as expressive policies in offline RL by coupling behavior cloning with Q-value maximization, achieving SOTA on most D4RL tasks.

Freeform Preference Learning for Robotic Manipulation

cs.RO · 2026-06-30 · unverdicted · novelty 6.0

Freeform Preference Learning trains language-conditioned multi-axis reward models from human pairwise preferences to produce steerable and compositional robot policies that outperform sparse and binary-preference baselines by 38 percentage points.

STEAM: Self-Supervised Temporal Ensemble Advantage Modeling for Real-World Robot Learning

cs.RO · 2026-06-29 · unverdicted · novelty 6.0

STEAM learns advantages from expert trajectories via self-supervised temporal ensemble modeling to improve policy learning on real robot tasks like bimanual folding and pick-and-place.

citing papers explorer

Showing 50 of 56 citing papers after filters.

Offline Reinforcement Learning with Implicit Q-Learning cs.LG · 2021-10-12 · unverdicted · none · ref 10 · internal anchor
IQL achieves policy improvement in offline RL by implicitly estimating optimal action values through state-conditional upper expectiles of value functions, without querying Q-functions on out-of-distribution actions.
Decision Transformer: Reinforcement Learning via Sequence Modeling cs.LG · 2021-06-02 · accept · none · ref 24 · internal anchor
Decision Transformer casts RL as autoregressive sequence modeling conditioned on desired returns, past states and actions, matching or exceeding offline RL baselines on Atari, Gym and Key-to-Door tasks.
D4RL: Datasets for Deep Data-Driven Reinforcement Learning cs.LG · 2020-04-15 · accept · none · ref 18 · internal anchor
D4RL supplies new offline RL benchmarks and datasets from expert and mixed sources to expose weaknesses in existing algorithms and standardize evaluation.
Dual Advantage Fields cs.LG · 2026-06-02 · conditional · none · ref 23 · internal anchor
Dual Advantage Fields converts bilinear dual value models into local advantage scores via learned action-effect models, equaling the goal-conditioned Bellman advantage under realizability and improving aggregate metrics on OGBench locomotion, manipulation, and puzzle tasks.
Explicit Critic Guidance for Aligning Diffusion Models cs.LG · 2026-05-26 · unverdicted · none · ref 56 · internal anchor
Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.
Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference cs.LG · 2026-05-26 · unverdicted · none · ref 75 · internal anchor
FAV aligns few-step generative models by amortizing SVGD updates from reward-tilted sampling into generator parameters via fixed-point regression, requiring only sample access, and shows outperformance on robotics tasks plus scaling on image generators.
Compositional Transduction with Latent Analogies for Offline Goal-Conditioned Reinforcement Learning cs.LG · 2026-05-20 · unverdicted · none · ref 6 · internal anchor
Proposes latent analogies and analogy transduction to enable compositional generalization to unseen goal-context pairs in offline GCRL, outperforming trajectory-stitching baselines on manipulation tasks.
Learning from Language Feedback via Variational Policy Distillation cs.LG · 2026-05-14 · unverdicted · none · ref 25 · internal anchor
VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.
Peng's Q($\lambda$) for Conservative Value Estimation in Offline Reinforcement Learning cs.LG · 2026-05-14 · unverdicted · none · ref 12 · internal anchor
CPQL adapts the multi-step Peng's Q(λ) operator for conservative offline value estimation, achieving performance guarantees and empirical gains over single-step baselines on D4RL while supporting offline-to-online fine-tuning.
Switching Successor Measures for Hierarchical Zero-shot Reinforcement Learning cs.LG · 2026-05-13 · unverdicted · none · ref 45 · internal anchor
Switching successor measures extend classical successor measures to enable hierarchical zero-shot RL via the FB π-Switch algorithm that extracts subgoal-selection and control policies from forward-backward representations.
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models cs.LG · 2026-05-12 · unverdicted · none · ref 42 · 2 links · internal anchor
Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
Active Learning for Gaussian Process Regression Under Self-Induced Boltzmann Weights cs.LG · 2026-05-11 · unverdicted · none · ref 50 · internal anchor
AB-SID-iVAR enables Gaussian process active learning for self-induced Boltzmann distributions by closed-form approximation of the target, with high-probability error vanishing guarantees and empirical gains on PES and drug discovery tasks.
Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent cs.LG · 2026-05-04 · unverdicted · none · ref 35 · internal anchor
Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating coverage, variance, and other terms.
Test-time Offline Reinforcement Learning on Goal-related Experience cs.LG · 2025-07-24 · unverdicted · none · ref 7 · internal anchor
GC-TTT adapts goal-conditioned policies at test time by fine-tuning on self-supervised selected goal-related offline data, yielding performance gains in loco-navigation and manipulation tasks.
Group-in-Group Policy Optimization for LLM Agent Training cs.LG · 2025-05-16 · unverdicted · none · ref 43 · internal anchor
GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while keeping the same rollout and memory footprint.
KTO: Model Alignment as Prospect Theoretic Optimization cs.LG · 2024-02-02 · conditional · none · ref 14 · internal anchor
KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model cs.LG · 2023-05-29 · accept · none · ref 32 · internal anchor
DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning cs.LG · 2022-08-12 · unverdicted · none · ref 13 · internal anchor
Diffusion-QL uses conditional diffusion models as expressive policies in offline RL by coupling behavior cloning with Q-value maximization, achieving SOTA on most D4RL tasks.
DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization cs.LG · 2026-05-29 · unverdicted · none · ref 16 · internal anchor
DRIFT achieves multi-turn RL performance via offline importance-weighted SFT by leveraging the equivalence of KL-regularized RL to weighted supervised learning.
FLAG: Flow Policy MaxEnt-RL by Latent Augmented Guidance cs.LG · 2026-05-29 · unverdicted · none · ref 36 · internal anchor
FLAG augments state space with flow latent variable to optimize a proxy MaxEnt-RL objective, enabling expressive policies with limited importance samples in high-dimensional control.
Moment Matching Q-Learning cs.LG · 2026-05-27 · unverdicted · none · ref 6 · internal anchor
MoMa QL uses MMD moment matching to enforce distribution-level convergence of conditional score functions in flow-based RL policies for improved sampling efficiency.
SPAR: Support-Preserving Action Rectification cs.LG · 2026-05-27 · unverdicted · none · ref 12 · internal anchor
SPAR anchors policy learning to a frozen BC policy for residual rectification and introduces latent self-imitation to eliminate manifold drift, achieving SOTA on D4RL.
Latent Representation Alignment for Offline Goal-Conditioned Reinforcement Learning cs.LG · 2026-05-25 · unverdicted · none · ref 3 · internal anchor
LAVL combines latent-representation value generalization with hierarchical planning to reduce erroneous generalization in offline GCRL and outperforms prior methods on 20 of 22 OGBench datasets.
Goal-Conditioned Agents that Learn Everything All at Once cs.LG · 2026-05-22 · unverdicted · none · ref 45 · internal anchor
LEO enables efficient all-goals learning in goal-conditioned RL by jointly predicting for all goals in one network pass, yielding >250x speedup over relabelling and better performance on Craftax.
Reinforcing Human Behavior Simulation via Verbal Feedback cs.LG · 2026-05-19 · unverdicted · none · ref 25 · internal anchor
DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.
Offline Reinforcement Learning with Universal Horizon Models cs.LG · 2026-05-15 · unverdicted · none · ref 52 · internal anchor
Universal horizon models extend geometric horizon models to arbitrary horizons and apply winsorized distributions for stable offline RL value learning, outperforming baselines on 100 OGBench tasks.
Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow cs.LG · 2026-05-08 · unverdicted · none · ref 46 · internal anchor
DFP is a one-step generative policy using Wasserstein gradient flow on a drifting model backbone, with a top-K behavior cloning surrogate, that reaches SOTA on Robomimic and OGBench manipulation tasks.
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex cs.LG · 2026-05-07 · unverdicted · none · ref 44 · 2 links · internal anchor
Listwise Policy Optimization explicitly performs target-projection on the LLM response simplex, unifying and improving group-based RLVR methods with monotonic improvement and flexible divergences.
Threshold-Guided Optimization for Visual Generative Models cs.LG · 2026-05-06 · unverdicted · none · ref 22 · internal anchor
A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies cs.LG · 2026-05-04 · unverdicted · none · ref 34 · 2 links · internal anchor
OGPO enables sample-efficient full-finetuning of generative control policies via off-policy critics and modified PPO, achieving SOTA on robot manipulation tasks while rescuing poorly initialized behavior cloning policies without expert data.
AdamO: A Collapse-Suppressed Optimizer for Offline RL cs.LG · 2026-05-03 · unverdicted · none · ref 53 · internal anchor
AdamO modifies Adam with an orthogonality correction to ensure the spectral radius of the TD update operator stays below one, providing a theoretical stability guarantee for offline RL.
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL cs.LG · 2026-05-03 · unverdicted · none · ref 145 · internal anchor
QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markovian datasets.
Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning cs.LG · 2026-05-03 · unverdicted · none · ref 10 · 2 links · internal anchor
FAN simplifies expressive flow policies and distributional critics in offline RL via single-iteration behavior regularization and single-sample noise conditioning to claim SOTA performance with lower training and inference time.
When Policies Cannot Be Retrained: A Unified Closed-Form View of Post-Training Steering in Offline Reinforcement Learning cs.LG · 2026-04-23 · unverdicted · none · ref 9 · internal anchor
For diagonal-Gaussian frozen actors, PoE with alpha equals KL adaptation with beta = alpha/(1-alpha); empirically, composition shows an actor-competence ceiling with 4/5/3 HELP/FROZEN/HURT split on D4RL and zero success on AntMaze.
Beyond Importance Sampling: Rejection-Gated Policy Optimization cs.LG · 2026-04-16 · unverdicted · none · ref 7 · internal anchor
RGPO replaces importance sampling with a smooth [0,1] acceptance gate in policy gradients, unifying TRPO/PPO/REINFORCE, bounding variance for heavy-tailed ratios, and showing gains in online RLHF experiments.
Value-Guidance MeanFlow for Offline Multi-Agent Reinforcement Learning cs.LG · 2026-04-09 · unverdicted · none · ref 25 · internal anchor
VGM²P achieves SOTA-comparable performance in offline MARL via value-guided conditional behavior cloning with MeanFlow, enabling efficient single-step action generation insensitive to regularization coefficients.
Target Policy Optimization cs.LG · 2026-04-07 · unverdicted · none · ref 11 · internal anchor
TPO constructs a target distribution q proportional to the old policy times exp(utility) and trains the policy to match it via cross-entropy, matching or beating PPO and GRPO especially under sparse rewards.
Delightful Distributed Policy Gradient cs.LG · 2026-03-20 · unverdicted · none · ref 19 · internal anchor
Delightful Policy Gradient gates updates with advantage times surprisal to suppress rare failures while preserving rare successes in distributed RL with stale or buggy data.
On the Non-decoupling of Supervised Fine-tuning and Reinforcement Learning in Post-training cs.LG · 2026-01-12 · unverdicted · none · ref 11 · internal anchor
SFT and RL cannot be decoupled in LLM post-training because each step increases the loss or lowers the reward of the prior step under KL and PL analyses.
$\pi^{*}_{0.6}$: a VLA That Learns From Experience cs.LG · 2025-11-18 · unverdicted · none · ref 69 · internal anchor
RECAP enables a generalist VLA to self-improve via advantage-conditioned RL on mixed real-world data, more than doubling throughput and halving failure rates on hard manipulation tasks.
Training Diffusion Models with Reinforcement Learning cs.LG · 2023-05-22 · unverdicted · none · ref 20 · internal anchor
DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.
IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies cs.LG · 2023-04-20 · conditional · none · ref 38 · internal anchor
IDQL generalizes IQL into an actor-critic framework and uses diffusion policies for robust policy extraction, outperforming prior offline RL methods.
FlowAWR: Online Adaptive Flow Reinforcement via Advantage-Weighted Rectification cs.LG · 2026-06-29 · unverdicted · none · ref 17 · internal anchor
FlowAWR derives an advantage-weighted rectification for optimal velocity fields in flow models, claiming 2-5x faster convergence than DiffusionNFT on SD3.5-Medium.
Two-Stage Fine-Tuning for Protein Sequence Generation with Targeted Amino-Acid Composition cs.LG · 2026-06-26 · unverdicted · none · ref 7 · internal anchor
A domain-adaptive fine-tuning stage followed by reward-weighted RL fine-tuning produces protein sequences whose amino-acid composition matches a specified target while preserving sequence statistics and diversity.
Abstraction for Offline Goal-Conditioned Reinforcement Learning cs.LG · 2026-05-21 · unverdicted · none · ref 8 · internal anchor
Introduces relativised options and hierarchical abstraction to reuse experience across similar contexts in offline GCRL, with two algorithms demonstrating performance gains.
COOPO: Cyclic Offline-Online Policy Optimization Algorithm cs.LG · 2026-05-18 · unverdicted · none · ref 36 · internal anchor
COOPO is a cyclic offline-online RL algorithm that repeatedly anchors the policy to a dataset via KL-regularized updates then fine-tunes online, claiming better sample efficiency and monotonic improvement under coverage assumptions.
ISEP: Implicit Support Expansion for Offline Reinforcement Learning via Stochastic Policy Optimization cs.LG · 2026-05-18 · unverdicted · none · ref 39 · internal anchor
ISEP expands action support in offline RL via value interpolation between data and policy samples, then uses stochastic policy optimization to avoid mode collapse in the resulting multimodal objective.
Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy cs.LG · 2026-05-13 · unverdicted · none · ref 10 · 2 links · internal anchor
Q-Flow bridges stability and expressivity in flow-based RL policies by propagating terminal trajectory values to intermediate states for gradient-based optimization.
Behavioral Mode Discovery for Fine-tuning Multimodal Generative Policies cs.LG · 2026-05-12 · unverdicted · none · ref 65 · internal anchor
Unsupervised behavioral mode discovery combined with mutual information rewards enables RL fine-tuning of multimodal generative policies that achieves higher success rates without losing action diversity.
Entropy-Regularized Adjoint Matching for Offline Reinforcement Learning cs.LG · 2026-05-07 · unverdicted · none · ref 8 · 2 links · internal anchor
ME-AM adds mirror-descent entropy maximization and a mixture behavior prior to adjoint matching in flow-based policies to mitigate popularity bias and support binding in offline RL.

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer