D4RL supplies new offline RL benchmarks and datasets from expert and mixed sources to expose weaknesses in existing algorithms and standardize evaluation.
Off-policy deep reinforcement learning without explo- ration
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
DAWM introduces a modular diffusion world model with an inverse dynamics model to produce complete synthetic transitions that improve conservative offline RL algorithms like TD3BC and IQL on D4RL tasks.
A deep RL vulnerability-prediction policy trained in semantic embedding space finds up to 23% more unique robot manipulation failures than vision-language baselines and enables more efficient fine-tuning.
Behavior-regularized actor-critic methods achieve strong offline RL results with simple regularization, rendering many recent technical additions unnecessary.
Develops Way Off-Policy batch RL algorithms with pre-trained model priors, KL-control, and dropout uncertainty estimates to learn implicit rewards from offline human dialog data, reporting live deployment gains over prior offline methods.
RankQ augments temporal-difference Q-learning with a multi-term self-supervised ranking loss to enforce structured action ordering, yielding competitive or better results than prior methods on D4RL and large gains in vision-based robot fine-tuning.
A preference fine-tuning method for LLMs that combines context augmentation, theory-driven preference pair construction, curriculum learning, and a density estimation support constraint to produce domain-aligned review responses with reduced hallucinations and over-conservatism.
Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.
citing papers explorer
-
RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking
RankQ augments temporal-difference Q-learning with a multi-term self-supervised ranking loss to enforce structured action ordering, yielding competitive or better results than prior methods on D4RL and large gains in vision-based robot fine-tuning.
-
Align Generative Artificial Intelligence with Human Preferences: A Novel Large Language Model Fine-Tuning Method for Online Review Management
A preference fine-tuning method for LLMs that combines context augmentation, theory-driven preference pair construction, curriculum learning, and a density estimation support constraint to produce domain-aligned review responses with reduced hallucinations and over-conservatism.