hub Mixed citations

Offline Reinforcement Learning with Implicit Q-Learning

Ilya Kostrikov, Ashvin Nair, Sergey Levine · 2021 · cs.LG · arXiv 2110.06169

Mixed citation behavior. Most common role is background (33%).

53 Pith papers citing it

Background 33% of classified citations

open full Pith review browse 53 citing papers arXiv PDF

abstract

Offline reinforcement learning requires reconciling two conflicting aims: learning a policy that improves over the behavior policy that collected the dataset, while at the same time minimizing the deviation from the behavior policy so as to avoid errors due to distributional shift. This trade-off is critical, because most current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy, and therefore need to either constrain these actions to be in-distribution, or else regularize their values. We propose an offline RL method that never needs to evaluate actions outside of the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through generalization. The main insight in our work is that, instead of evaluating unseen actions from the latest policy, we can approximate the policy improvement step implicitly by treating the state value function as a random variable, with randomness determined by the action (while still integrating over the dynamics to avoid excessive optimism), and then taking a state conditional upper expectile of this random variable to estimate the value of the best actions in that state. This leverages the generalization capacity of the function approximator to estimate the value of the best available action at a given state without ever directly querying a Q-function with this unseen action. Our algorithm alternates between fitting this upper expectile value function and backing it up into a Q-function. Then, we extract the policy via advantage-weighted behavioral cloning. We dub our method implicit Q-learning (IQL). IQL demonstrates the state-of-the-art performance on D4RL, a standard benchmark for offline reinforcement learning. We also demonstrate that IQL achieves strong performance fine-tuning using online interaction after offline initialization.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 baseline 4 method 3

citation-polarity summary

background 4 baseline 4 use method 3 unclear 1

representative citing papers

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

cs.LG · 2026-05-16 · unverdicted · novelty 7.0

Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasoning accuracy and shortening responses.

Path-Coupled Bellman Flows for Distributional Reinforcement Learning

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

Path-Coupled Bellman Flows use source-consistent Bellman-coupled paths and a lambda-parameterized control-variate to learn return distributions via flow matching, improving fidelity and stability over prior DRL approaches.

Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

DOSER detects OOD actions via diffusion-model denoising error and applies selective regularization based on predicted transitions, proving gamma-contraction with performance bounds and outperforming priors on offline RL benchmarks.

SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation

cs.CV · 2026-04-21 · unverdicted · novelty 7.0

SceneOrchestra trains an orchestrator to generate full tool-call trajectories for 3D scene synthesis and uses a discriminator during training to select high-quality plans, yielding state-of-the-art results with lower runtime.

Provably Efficient Offline-to-Online Value Adaptation with General Function Approximation

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

Offline-to-online value adaptation in RL has a minimax lower bound matching pure online learning in hard cases, yet O2O-LSVI improves sample complexity under a novel structural condition on pretrained Q-functions.

WOMBET: World Model-based Experience Transfer for Robust and Sample-efficient Reinforcement Learning

cs.LG · 2026-04-10 · unverdicted · novelty 7.0

WOMBET generates reliable prior data with world-model uncertainty penalization and transfers it to target tasks via adaptive offline-online sampling, yielding better sample efficiency than baselines.

From Static Constraints to Dynamic Adaptation: Sample-Level Constraint Relaxation for Offline-to-Online Reinforcement Learning

cs.LG · 2025-11-05 · unverdicted · novelty 7.0 · 2 refs

DARE performs sample-level constraint relaxation in offline-to-online RL by conditioning on behavioral consistency with a behavior model via posterior-induced exchange, yielding improved fine-tuning stability and performance on D4RL benchmarks.

EXPO: Stable Reinforcement Learning with Expressive Policies

cs.LG · 2025-07-10 · conditional · novelty 7.0

EXPO stabilizes online RL for expressive policies by training a base policy with imitation and using a lightweight Gaussian edit policy to select higher-value actions on the fly for sampling and TD backups.

Steering Your Diffusion Policy with Latent Space Reinforcement Learning

cs.RO · 2025-06-18 · unverdicted · novelty 7.0

DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.

BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning

cs.LG · 2025-06-06 · conditional · novelty 7.0

BiTrajDiff augments offline RL datasets by running independent forward and backward diffusion processes from intermediate states, yielding higher performance than prior one-directional data-augmentation baselines on D4RL.

Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

cs.LG · 2022-08-12 · unverdicted · novelty 7.0

Diffusion-QL uses conditional diffusion models as expressive policies in offline RL by coupling behavior cloning with Q-value maximization, achieving SOTA on most D4RL tasks.

Target-Aligned Bellman Backup for Cross-domain Offline Reinforcement Learning

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

Target-Aligned Bellman Backup (TABB) improves cross-domain offline RL by selecting source transitions according to their contribution to accurate target-domain Bellman target estimation.

Generative Auto-Bidding with Unified Modeling and Exploration

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

GUIDE integrates a Decision Transformer for joint modeling of bidding actions and states with Q-value regularization for exploration and an IDM for safe policy fallback, outperforming baselines in simulations and real Taobao deployment with gains in GMV, clicks, cost, and ROI.

CA2: Code-Aware Agent for Automated Game Testing

cs.SE · 2026-05-13 · unverdicted · novelty 6.0

CA2 integrates call stack information into RL agents for game testing and shows consistent gains over baselines that ignore code signals.

Multi-scale Predictive Representations for Goal-conditioned Reinforcement Learning

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

Ms.PR applies multi-scale predictive supervision to enforce goal-directed alignment in latent spaces for offline GCRL, yielding improved representation quality and performance on vision and state-based tasks.

Predictive but Not Plannable: RC-aux for Latent World Models

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.

Beyond Autoregressive RTG: Conditioning via Injection Outside Sequential Modeling in Decision Transformer

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Injecting RTG into states outside the autoregressive sequence yields shorter, more efficient Decision Transformers that outperform the original on offline RL tasks.

Offline Reinforcement Learning for Rotation Profile Control in Tokamaks

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Offline RL policies trained solely on DIII-D historical data were deployed on the tokamak and produced promising real-world control of the plasma rotation profile.

On the Role of Language Representations in Auto-Bidding: Findings and Implications

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

SemBid injects LLM-encoded Task, History, and Strategy semantics as tokens into offline bidding trajectories and uses self-attention to outperform numerical-only baselines in performance, constraint satisfaction, and robustness.

Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities

cs.AI · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.

Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Adaptive Q-Chunking selects optimal action chunk sizes at each state via normalized advantage comparisons to outperform fixed chunk sizes in offline-to-online RL on robot benchmarks.

OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.

Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies

cs.RO · 2026-05-01 · unverdicted · novelty 6.0

Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.

When Policies Cannot Be Retrained: A Unified Closed-Form View of Post-Training Steering in Offline Reinforcement Learning

cs.LG · 2026-04-23 · unverdicted · novelty 6.0

For diagonal-Gaussian frozen actors, PoE with alpha equals KL adaptation with beta = alpha/(1-alpha); empirically, composition shows an actor-competence ceiling with 4/5/3 HELP/FROZEN/HURT split on D4RL and zero success on AntMaze.

citing papers explorer

Showing 50 of 53 citing papers.

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation cs.LG · 2026-05-16 · unverdicted · none · ref 19 · internal anchor
Decoupling prefix source from token-level KL direction in autoregressive sequence KL yields four objectives unifying SFT, DAgger, offline RL and OPD, with KL mixing and entropy-gated curriculum improving math reasoning accuracy and shortening responses.
Path-Coupled Bellman Flows for Distributional Reinforcement Learning cs.LG · 2026-05-07 · unverdicted · none · ref 17 · internal anchor
Path-Coupled Bellman Flows use source-consistent Bellman-coupled paths and a lambda-parameterized control-variate to learn return distributions via flow matching, improving fidelity and stability over prior DRL approaches.
Beyond Penalization: Diffusion-based Out-of-Distribution Detection and Selective Regularization in Offline Reinforcement Learning cs.LG · 2026-05-06 · unverdicted · none · ref 6 · internal anchor
DOSER detects OOD actions via diffusion-model denoising error and applies selective regularization based on predicted transitions, proving gamma-contraction with performance bounds and outperforming priors on offline RL benchmarks.
SceneOrchestra: Efficient Agentic 3D Scene Synthesis via Full Tool-Call Trajectory Generation cs.CV · 2026-04-21 · unverdicted · none · ref 17 · internal anchor
SceneOrchestra trains an orchestrator to generate full tool-call trajectories for 3D scene synthesis and uses a discriminator during training to select high-quality plans, yielding state-of-the-art results with lower runtime.
Provably Efficient Offline-to-Online Value Adaptation with General Function Approximation cs.LG · 2026-04-15 · unverdicted · none · ref 6 · internal anchor
Offline-to-online value adaptation in RL has a minimax lower bound matching pure online learning in hard cases, yet O2O-LSVI improves sample complexity under a novel structural condition on pretrained Q-functions.
WOMBET: World Model-based Experience Transfer for Robust and Sample-efficient Reinforcement Learning cs.LG · 2026-04-10 · unverdicted · none · ref 6 · internal anchor
WOMBET generates reliable prior data with world-model uncertainty penalization and transfers it to target tasks via adaptive offline-online sampling, yielding better sample efficiency than baselines.
From Static Constraints to Dynamic Adaptation: Sample-Level Constraint Relaxation for Offline-to-Online Reinforcement Learning cs.LG · 2025-11-05 · unverdicted · none · ref 7 · 2 links · internal anchor
DARE performs sample-level constraint relaxation in offline-to-online RL by conditioning on behavioral consistency with a behavior model via posterior-induced exchange, yielding improved fine-tuning stability and performance on D4RL benchmarks.
EXPO: Stable Reinforcement Learning with Expressive Policies cs.LG · 2025-07-10 · conditional · none · ref 14 · internal anchor
EXPO stabilizes online RL for expressive policies by training a base policy with imitation and using a lightweight Gaussian edit policy to select higher-value actions on the fly for sampling and TD backups.
Steering Your Diffusion Policy with Latent Space Reinforcement Learning cs.RO · 2025-06-18 · unverdicted · none · ref 81 · internal anchor
DSRL steers pretrained diffusion policies for robotics by applying RL to their latent noise inputs, achieving sample-efficient real-world adaptation with only black-box access.
BiTrajDiff: Bidirectional Trajectory Generation with Diffusion Models for Offline Reinforcement Learning cs.LG · 2025-06-06 · conditional · none · ref 23 · internal anchor
BiTrajDiff augments offline RL datasets by running independent forward and backward diffusion processes from intermediate states, yielding higher performance than prior one-directional data-augmentation baselines on D4RL.
Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning cs.LG · 2022-08-12 · unverdicted · none · ref 7 · internal anchor
Diffusion-QL uses conditional diffusion models as expressive policies in offline RL by coupling behavior cloning with Q-value maximization, achieving SOTA on most D4RL tasks.
Target-Aligned Bellman Backup for Cross-domain Offline Reinforcement Learning cs.LG · 2026-05-21 · unverdicted · none · ref 36 · internal anchor
Target-Aligned Bellman Backup (TABB) improves cross-domain offline RL by selecting source transitions according to their contribution to accurate target-domain Bellman target estimation.
Generative Auto-Bidding with Unified Modeling and Exploration cs.AI · 2026-05-19 · unverdicted · none · ref 23 · internal anchor
GUIDE integrates a Decision Transformer for joint modeling of bidding actions and states with Q-value regularization for exploration and an IDM for safe policy fallback, outperforming baselines in simulations and real Taobao deployment with gains in GMV, clicks, cost, and ROI.
CA2: Code-Aware Agent for Automated Game Testing cs.SE · 2026-05-13 · unverdicted · none · ref 29 · internal anchor
CA2 integrates call stack information into RL agents for game testing and shows consistent gains over baselines that ignore code signals.
Multi-scale Predictive Representations for Goal-conditioned Reinforcement Learning cs.LG · 2026-05-10 · unverdicted · none · ref 21 · internal anchor
Ms.PR applies multi-scale predictive supervision to enforce goal-directed alignment in latent spaces for offline GCRL, yielding improved representation quality and performance on vision and state-based tasks.
Predictive but Not Plannable: RC-aux for Latent World Models cs.LG · 2026-05-08 · unverdicted · none · ref 23 · internal anchor
RC-aux corrects spatiotemporal mismatch in reconstruction-free latent world models by adding multi-horizon prediction and reachability supervision, improving planning performance on goal-conditioned pixel-control tasks.
Beyond Autoregressive RTG: Conditioning via Injection Outside Sequential Modeling in Decision Transformer cs.LG · 2026-05-07 · unverdicted · none · ref 9 · internal anchor
Injecting RTG into states outside the autoregressive sequence yields shorter, more efficient Decision Transformers that outperform the original on offline RL tasks.
Offline Reinforcement Learning for Rotation Profile Control in Tokamaks cs.LG · 2026-05-07 · unverdicted · none · ref 55 · internal anchor
Offline RL policies trained solely on DIII-D historical data were deployed on the tokamak and produced promising real-world control of the plasma rotation profile.
On the Role of Language Representations in Auto-Bidding: Findings and Implications cs.AI · 2026-05-07 · unverdicted · none · ref 25 · internal anchor
SemBid injects LLM-encoded Task, History, and Strategy semantics as tokens into offline bidding trajectories and uses self-attention to outperform numerical-only baselines in performance, constraint satisfaction, and robustness.
Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities cs.AI · 2026-05-07 · unverdicted · none · ref 23 · 2 links · internal anchor
LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.
Adaptive Q-Chunking for Offline-to-Online Reinforcement Learning cs.LG · 2026-05-07 · unverdicted · none · ref 21 · internal anchor
Adaptive Q-Chunking selects optimal action chunk sizes at each state via normalized advantage comparisons to outperform fixed chunk sizes in offline-to-online RL on robot benchmarks.
OGPO: Sample Efficient Full-Finetuning of Generative Control Policies cs.LG · 2026-05-04 · unverdicted · none · ref 53 · internal anchor
OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.
Learning while Deploying: Fleet-Scale Reinforcement Learning for Generalist Robot Policies cs.RO · 2026-05-01 · unverdicted · none · ref 23 · internal anchor
Fleet-scale RL framework improves a single generalist VLA policy from deployment data to 95% average success on eight real-world manipulation tasks with 16 dual-arm robots.
When Policies Cannot Be Retrained: A Unified Closed-Form View of Post-Training Steering in Offline Reinforcement Learning cs.LG · 2026-04-23 · unverdicted · none · ref 4 · internal anchor
For diagonal-Gaussian frozen actors, PoE with alpha equals KL adaptation with beta = alpha/(1-alpha); empirically, composition shows an actor-competence ceiling with 4/5/3 HELP/FROZEN/HURT split on D4RL and zero success on AntMaze.
Occupancy Reward Shaping: Improving Credit Assignment for Offline Goal-Conditioned Reinforcement Learning cs.LG · 2026-04-22 · conditional · none · ref 10 · internal anchor
Occupancy Reward Shaping extracts goal-reaching rewards from world-model occupancy measures using optimal transport, improving offline goal-conditioned RL performance 2.2x on 13 tasks without changing the optimal policy.
Fisher Decorator: Refining Flow Policy via a Local Transport Map cs.LG · 2026-04-20 · unverdicted · none · ref 40 · internal anchor
Fisher Decorator refines flow policies in offline RL via a local transport map and Fisher-matrix quadratic approximation of the KL constraint, yielding controllable error near the optimum and SOTA benchmark results.
Abstract Sim2Real through Approximate Information States cs.RO · 2026-04-16 · unverdicted · none · ref 38 · internal anchor
Abstract simulators can be grounded to real tasks by making their dynamics history-dependent and correcting them with real data, enabling RL policy transfer.
When Missing Becomes Structure: Intent-Preserving Policy Completion from Financial KOL Discourse cs.LG · 2026-04-15 · unverdicted · none · ref 18 · internal anchor
KICL completes execution decisions in KOL financial discourse using offline RL, achieving top returns and Sharpe ratios with no unsupported trades or direction changes on YouTube and X data from 2022-2025.
MoRI: Mixture of RL and IL Experts for Long-Horizon Manipulation Tasks cs.RO · 2026-04-11 · unverdicted · none · ref 18 · internal anchor
MoRI dynamically mixes RL and IL experts with variance-based switching and IL regularization to reach 97.5% success in four real-world robotic tasks while cutting human intervention by 85.8%.
JD-BP: A Joint-Decision Generative Framework for Auto-Bidding and Pricing cs.GT · 2026-04-07 · unverdicted · none · ref 21 · internal anchor
JD-BP jointly generates bids and pricing corrections via generative models, memory-less return-to-go, trajectory augmentation, and energy-based DPO to improve auto-bidding performance despite prediction errors and latency.
Hierarchical Reinforcement Learning with Augmented Step-Level Transitions for LLM Agents cs.AI · 2026-04-07 · unverdicted · none · ref 2 · internal anchor
STEP-HRL enables step-level learning in LLM agents via hierarchical task structure and local progress modules, outperforming baselines on ScienceWorld and ALFWorld while cutting token usage.
Hierarchical Planning with Latent World Models cs.LG · 2026-04-03 · unverdicted · none · ref 28 · internal anchor
Hierarchical planning over multi-scale latent world models enables 70% success on real robotic pick-and-place with goal-only input where flat models achieve 0%, while cutting planning compute up to 4x in simulations.
LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels cs.LG · 2026-03-13 · unverdicted · none · ref 48 · internal anchor
LeWM is the first end-to-end trainable JEPA from pixels that uses only two loss terms for stable training and fast planning on 2D/3D control tasks.
WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents cs.AI · 2026-03-05 · unverdicted · none · ref 12 · internal anchor
WebFactory is a fully automated RL pipeline that compresses LLM-encoded internet knowledge into grounded web agents, achieving performance comparable to human-annotated training but using synthetic data from only 10 websites.
Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons cs.RO · 2026-03-02 · unverdicted · none · ref 78 · internal anchor
Robometer combines intra-trajectory progress supervision with inter-trajectory preference supervision on a 1M-trajectory dataset to learn more generalizable robotic reward functions than prior methods.
OPRIDE: Offline Preference-based Reinforcement Learning via In-Dataset Exploration cs.LG · 2026-02-19 · unverdicted · none · ref 25 · internal anchor
OPRIDE improves query efficiency in offline PbRL via a principled in-dataset exploration strategy and discount scheduling, outperforming prior methods with fewer queries and providing theoretical guarantees.
The hidden risks of temporal resampling in clinical reinforcement learning cs.LG · 2026-02-06 · conditional · none · ref 61 · internal anchor
Resampling clinical time series into uniform bins for offline RL reduces performance by up to 60% and causes retrospective evaluations to overestimate returns by 1.5-3x versus unprocessed data.
The Signal is in the Steps: Local Scoring for Reasoning Data Selection cs.LG · 2025-10-05 · unverdicted · none · ref 9 · internal anchor
LALP scores local reasoning steps rather than full trajectories to improve selection of training data from diverse teacher models for distilling long-form reasoning.
DAWM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions cs.LG · 2025-09-23 · unverdicted · none · ref 17 · internal anchor
DAWM introduces a modular diffusion world model with an inverse dynamics model to produce complete synthetic transitions that improve conservative offline RL algorithms like TD3BC and IQL on D4RL tasks.
Reinforcement Learning with Action Chunking cs.LG · 2025-07-10 · unverdicted · none · ref 32 · internal anchor
Q-chunking improves offline-to-online RL sample efficiency on long-horizon sparse-reward manipulation tasks by applying action chunking to TD learning.
RoboMD: Uncovering Robot Vulnerabilities through Semantic Potential Fields cs.RO · 2024-12-03 · unverdicted · none · ref 54 · internal anchor
A deep RL vulnerability-prediction policy trained in semantic embedding space finds up to 23% more unique robot manipulation failures than vision-language baselines and enables more efficient fine-tuning.
IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies cs.LG · 2023-04-20 · conditional · none · ref 27 · internal anchor
IDQL generalizes IQL into an actor-critic framework and uses diffusion policies for robust policy extraction, outperforming prior offline RL methods.
Abstraction for Offline Goal-Conditioned Reinforcement Learning cs.LG · 2026-05-21 · unverdicted · none · ref 7 · internal anchor
Introduces relativised options and hierarchical abstraction to reuse experience across similar contexts in offline GCRL, with two algorithms demonstrating performance gains.
stable-worldmodel: A Platform for Reproducible World Modeling Research and Evaluation cs.LG · 2026-05-20 · unverdicted · none · ref 35 · internal anchor
The paper presents stable-worldmodel (swm), a platform with high-performance data layer, modern world model baselines, planning solvers, and extended environments for reproducible research and generalization evaluation.
COOPO: Cyclic Offline-Online Policy Optimization Algorithm cs.LG · 2026-05-18 · unverdicted · none · ref 17 · internal anchor
COOPO is a cyclic offline-online RL algorithm that repeatedly anchors the policy to a dataset via KL-regularized updates then fine-tunes online, claiming better sample efficiency and monotonic improvement under coverage assumptions.
RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking cs.AI · 2026-05-11 · unverdicted · none · ref 16 · 2 links · internal anchor
RankQ augments temporal-difference Q-learning with a multi-term self-supervised ranking loss to enforce structured action ordering, yielding competitive or better results than prior methods on D4RL and large gains in vision-based robot fine-tuning.
Safe-Support Q-Learning: Learning without Unsafe Exploration cs.LG · 2026-04-28 · unverdicted · none · ref 3 · internal anchor
Safe-Support Q-Learning trains Q-functions and policies in reinforcement learning without ever visiting unsafe states by constraining the behavior policy to a safe set and using KL-regularized Bellman targets in a two-stage framework.
Efficient Hierarchical Implicit Flow Q-learning for Offline Goal-conditioned Reinforcement Learning cs.LG · 2026-04-10 · unverdicted · none · ref 21 · internal anchor
Proposes mean flow policies and LeJEPA loss to overcome Gaussian policy limits and weak subgoal generation in hierarchical offline GCRL, reporting strong results on OGBench state and pixel tasks.
VGAS: Value-Guided Action-Chunk Selection for Few-Shot Vision-Language-Action Adaptation cs.AI · 2026-02-07 · unverdicted · none · ref 20 · internal anchor
VGAS uses best-of-N selection with a geometrically grounded critic and explicit regularization to improve success rates of few-shot VLA policies under limited data and distribution shifts.
Reflection-Based Task Adaptation for Self-Improving VLA cs.RO · 2025-10-14 · unverdicted · none · ref 18 · internal anchor
Reflective Self-Adaptation combines failure-reflective reinforcement learning with success-guided imitation learning to enable faster and more reliable task adaptation for pre-trained Vision-Language-Action models.

Offline Reinforcement Learning with Implicit Q-Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer