pith. sign in

arxiv: 1912.02807 · v2 · pith:L72EUOZ2new · submitted 2019-12-05 · 💻 cs.LG · stat.ML

Combining Q-Learning and Search with Amortized Value Estimates

classification 💻 cs.LG stat.ML
keywords searchsavecombiningestimatesmctsmodel-basedmodel-freeq-learning
0
0 comments X
read the original abstract

We introduce "Search with Amortized Value Estimates" (SAVE), an approach for combining model-free Q-learning with model-based Monte-Carlo Tree Search (MCTS). In SAVE, a learned prior over state-action values is used to guide MCTS, which estimates an improved set of state-action values. The new Q-estimates are then used in combination with real experience to update the prior. This effectively amortizes the value computation performed by MCTS, resulting in a cooperative relationship between model-free learning and model-based search. SAVE can be implemented on top of any Q-learning agent with access to a model, which we demonstrate by incorporating it into agents that perform challenging physical reasoning tasks and Atari. SAVE consistently achieves higher rewards with fewer training steps, and---in contrast to typical model-based search approaches---yields strong performance with very small search budgets. By combining real experience with information computed during search, SAVE demonstrates that it is possible to improve on both the performance of model-free learning and the computational cost of planning.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TabQL: In-Context Q-Learning with Tabular Foundation Models

    cs.LG 2026-05 unverdicted novelty 7.0

    TabQL is a reinforcement learning framework that substitutes a tabular foundation model with in-context capabilities for the parametric Q-network in DQN, with a warm-up phase and theoretical analysis claiming improved...

  2. Finding the Time to Think: Learning Planning Budgets in Real-Time RL

    cs.LG 2026-06 unverdicted novelty 6.0

    A learned gating policy selects state-dependent planning budgets in variable-delay real-time RL and outperforms fixed-budget and heuristic baselines across Pac-Man, Tetris, Snake, Speed Hex, and Speed Go.

  3. Finding the Time to Think: Learning Planning Budgets in Real-Time RL

    cs.LG 2026-06 unverdicted novelty 6.0

    Trains a gating policy to select state-dependent planning budgets in variable-delay real-time RL, outperforming fixed-budget and heuristic baselines across Pac-Man, Tetris, Snake, Speed Hex, and Speed Go.