Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning

Alvin Wan; Ion Stoica; Joseph E. Gonzalez; Michael I. Jordan; Sergey Levine; Vladimir Feinberg

arxiv: 1803.00101 · v1 · pith:VY7FCRDPnew · submitted 2018-02-28 · 💻 cs.LG · cs.AI· stat.ML

Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning

Vladimir Feinberg , Alvin Wan , Ion Stoica , Michael I. Jordan , Joseph E. Gonzalez , Sergey Levine This is my paper

classification 💻 cs.LG cs.AIstat.ML

keywords learningdynamicsmodelmodel-freereinforcementvaluecomplexitydata

0 comments

read the original abstract

Recent model-free reinforcement learning algorithms have proposed incorporating learned dynamics models as a source of additional data with the intention of reducing sample complexity. Such methods hold the promise of incorporating imagined data coupled with a notion of model uncertainty to accelerate the learning of continuous control tasks. Unfortunately, they rely on heuristics that limit usage of the dynamics model. We present model-based value expansion, which controls for uncertainty in the model by only allowing imagination to fixed depth. By enabling wider use of learned dynamics models within a model-free reinforcement learning algorithm, we improve value estimation, which, in turn, reduces the sample complexity of learning.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Dream to Control: Learning Behaviors by Latent Imagination
cs.LG 2019-12 accept novelty 7.0

Dreamer learns to control from images by imagining and optimizing behaviors in a learned latent world model, outperforming prior methods on 20 visual tasks in data efficiency and final performance.
Exploring Model-based Planning with Policy Networks
cs.LG 2019-06 unverdicted novelty 7.0

POPLIN combines policy networks with model-predictive planning by optimizing either action sequences or policy parameters, yielding 3x better sample efficiency than PETS, TD3 and SAC on MuJoCo locomotion tasks.
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer achieves state-of-the-art results in offline goal-conditioned RL by replacing return-to-go with a state-conditioned Q-estimator and introducing a gated hybrid attention-mamba backbone for content-adaptive histor...
QHyer: Q-conditioned Hybrid Attention-mamba Transformer for Offline Goal-conditioned RL
cs.LG 2026-05 unverdicted novelty 6.0

QHyer replaces return-to-go with a state-conditioned Q-estimator and adds a gated hybrid attention-mamba backbone to achieve state-of-the-art performance in offline goal-conditioned RL on both Markovian and non-Markov...
A KL-regularization Framework for Learning to Plan with Adaptive Priors
cs.LG 2025-10 unverdicted novelty 6.0

PO-MPC unifies prior MPPI-based RL approaches under a single KL-regularized framework that uses the planner distribution as a prior, with new variations yielding performance gains in experiments.
DAWM: Diffusion Action World Models for Offline Reinforcement Learning via Action-Inferred Transitions
cs.LG 2025-09 unverdicted novelty 6.0

DAWM introduces a modular diffusion world model with an inverse dynamics model to produce complete synthetic transitions that improve conservative offline RL algorithms like TD3BC and IQL on D4RL tasks.
Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants
cs.CL 2026-05 unverdicted novelty 5.0

Fine-tuned simulators grounded in real human data produce LLM assistants that win more often against real users than those trained against role-playing simulators.
ReinVBC: A Model-based Reinforcement Learning Approach to Vehicle Braking Controller
cs.RO 2026-04 unverdicted novelty 5.0

ReinVBC applies offline model-based RL to learn vehicle dynamics and braking policies, with results indicating real-world capability and potential to replace production anti-lock braking systems.
EvolvingAgent: Curriculum Self-evolving Agent with Continual World Model for Long-Horizon Tasks
cs.RO 2025-02 unverdicted novelty 5.0

EvolvingAgent autonomously completes long-horizon tasks via a closed-loop planner-controller-reflector system with continual world model updates, reporting 111.74% higher success rates than baselines in Minecraft and ...
EfficientTDMPC: Improved MPC Objectives for Sample-Efficient Continuous Control
cs.LG 2026-05 unverdicted novelty 4.0

EfficientTDMPC extends the TD-MPC family with model ensembles, return averaging, and uncertainty penalties to reach SOTA sample efficiency on hard continuous control benchmarks in low-data regimes.