On Inductive Biases in Deep Reinforcement Learning

David Silver; Hado van Hasselt; Joseph Modayil; Matteo Hessel

arxiv: 1907.02908 · v1 · pith:6HCSNN7Onew · submitted 2019-07-05 · 💻 cs.LG · cs.AI· stat.ML

On Inductive Biases in Deep Reinforcement Learning

Matteo Hessel , Hado van Hasselt , Joseph Modayil , David Silver This is my paper

Pith reviewed 2026-05-25 02:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords deep reinforcement learninginductive biasesadaptive componentsgeneralizationcontinuous controldomain-specific components

0 comments

The pith

Replacing domain-specific components in deep RL agents with adaptive alternatives from the literature can improve performance on new tasks without retuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the trade-off between generality and performance created by inductive biases in common deep reinforcement learning algorithms. These biases often appear as domain-specific components that shape the agent's objective and its interface to the environment. The authors replace several such components with adaptive solutions drawn from prior work and measure the effects on both the original domains and a fresh set of continuous control problems. On the original tasks results are mixed, but on the new tasks the adaptive version outperforms the original on many problems when neither system receives extra tuning.

Core claim

The system with adaptive components performed better on many of the new tasks.

What carries the argument

Domain-specific components that bias the objective and environmental interface of deep RL agents, replaced by adaptive solutions from the literature.

If this is right

Performance on the original tasks sometimes decreases and sometimes increases when domain-specific components are replaced by adaptive ones.
Reducing the number of domain-specific components improves learning performance on new tasks without any additional tuning of either system.
The effort required to obtain domain knowledge or tune hyper-parameters can be reduced by using adaptive alternatives.
Weaker inductive biases can lead to more general algorithms that transfer better across problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Many reported successes in deep RL may depend on hidden domain knowledge embedded in components that are treated as standard.
Algorithms built with fewer fixed biases could scale to wider ranges of environments if the adaptive replacements remain stable.
The same substitution approach could be applied to other families of RL agents to test whether the generalization benefit holds beyond the tested architectures.

Load-bearing premise

Adaptive solutions drawn from the literature can be substituted for the original domain-specific components while preserving all other aspects of the agent's learning dynamics and interface.

What would settle it

Finding that the adaptive system fails to outperform the original system on the new set of continuous control problems would falsify the central result.

read the original abstract

Many deep reinforcement learning algorithms contain inductive biases that sculpt the agent's objective and its interface to the environment. These inductive biases can take many forms, including domain knowledge and pretuned hyper-parameters. In general, there is a trade-off between generality and performance when algorithms use such biases. Stronger biases can lead to faster learning, but weaker biases can potentially lead to more general algorithms. This trade-off is important because inductive biases are not free; substantial effort may be required to obtain relevant domain knowledge or to tune hyper-parameters effectively. In this paper, we re-examine several domain-specific components that bias the objective and the environmental interface of common deep reinforcement learning agents. We investigated whether the performance deteriorates when these components are replaced with adaptive solutions from the literature. In our experiments, performance sometimes decreased with the adaptive components, as one might expect when comparing to components crafted for the domain, but sometimes the adaptive components performed better. We investigated the main benefit of having fewer domain-specific components, by comparing the learning performance of the two systems on a different set of continuous control problems, without additional tuning of either system. As hypothesized, the system with adaptive components performed better on many of the new tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The experiments show adaptive replacements for some domain-specific RL components can beat the originals on untuned new continuous-control tasks, though the substitutions may not be neutral.

read the letter

The paper's main result is that replacing certain domain-specific pieces in deep RL agents with more adaptive versions from the literature led to better performance on many new tasks without any retuning, even though the same changes sometimes hurt results on the original tasks. They frame this as evidence that weaker inductive biases can buy generality at some cost to specialization. The setup directly tests the trade-off by holding the new tasks fixed and comparing the two agent versions head-to-head. That is a clean way to check the generality hypothesis and gives readers concrete numbers rather than just discussion. The work is useful for anyone who designs or tunes RL agents and wants data on how much domain knowledge is really necessary. The soft spot is the substitution step itself. The adaptive components are drawn from prior work, but any incidental difference in normalization, variance handling, credit assignment, or interface details would confound the claim that the gain comes purely from reduced bias. The observed drop on the original tasks already suggests the replacements are not perfect drop-ins. If the paper does not show explicit checks that state-action spaces, rewards, and optimization objectives stayed identical, the attribution to inductive-bias reduction stays shaky. The abstract supplies no stats, controls, or implementation specifics, so the full manuscript needs to carry that weight. This is the sort of empirical question that belongs in the literature. I would bring it to a reading group to walk through the exact component swaps and any interface-matching tests. It deserves peer review; the question is relevant and the experimental logic is straightforward enough to evaluate once the details are on the table.

Referee Report

2 major / 1 minor

Summary. The paper claims that domain-specific inductive biases in deep RL algorithms (e.g., pretuned hyperparameters and domain knowledge in objective and interface) trade off performance for generality. By replacing several such components with adaptive alternatives drawn from the literature, the authors observe that performance on the original tasks sometimes decreases, but the resulting system outperforms the original on many new continuous-control tasks when neither is retuned.

Significance. If the substitutions are shown to differ from the originals only in reduced domain bias while preserving interfaces, reward structure, optimization, and exploration mechanics, the result would be significant: it supplies empirical support for the hypothesis that weaker inductive biases improve out-of-distribution generalization in RL without further tuning. The work also quantifies the practical cost of obtaining domain-specific components.

major comments (2)

[Abstract / Experiments] Abstract and experimental sections: the central generalization claim requires that the adaptive substitutions (drawn from the literature) preserve state/action interfaces, reward structure, optimization objective, and exploration mechanics exactly. The abstract notes performance decreases on original tasks, which is consistent with imperfect substitution; without explicit controls or ablations demonstrating that the only systematic difference is adaptivity, attribution of the new-task advantage to reduced bias is not yet load-bearing.
[Abstract] The paper reports that the adaptive system performed better on many new tasks, but supplies no statistical tests, number of runs, or variance measures in the provided abstract. This makes it impossible to judge whether the reported advantage is robust or could be explained by incidental differences introduced by the adaptive components (e.g., implicit normalization or altered credit assignment).

minor comments (1)

[Methods] Clarify in the methods section exactly which domain-specific components were replaced and cite the precise literature sources for each adaptive substitute.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly where the concerns are valid.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and experimental sections: the central generalization claim requires that the adaptive substitutions (drawn from the literature) preserve state/action interfaces, reward structure, optimization objective, and exploration mechanics exactly. The abstract notes performance decreases on original tasks, which is consistent with imperfect substitution; without explicit controls or ablations demonstrating that the only systematic difference is adaptivity, attribution of the new-task advantage to reduced bias is not yet load-bearing.

Authors: The adaptive substitutions were drawn from the literature specifically because they are designed as drop-in replacements that preserve the original state/action interfaces, reward structure, optimization objective, and exploration mechanics while removing only the domain-specific tuning or knowledge. The observed performance decrease on the original tasks is the expected cost of reduced bias rather than evidence of imperfect substitution. We will add an explicit section or appendix detailing the preserved elements for each substitution and any supporting ablations or controls that isolate the effect of adaptivity versus other incidental changes. revision: partial
Referee: [Abstract] The paper reports that the adaptive system performed better on many new tasks, but supplies no statistical tests, number of runs, or variance measures in the provided abstract. This makes it impossible to judge whether the reported advantage is robust or could be explained by incidental differences introduced by the adaptive components (e.g., implicit normalization or altered credit assignment).

Authors: We agree that the abstract should report statistical details for transparency. The body of the paper already includes results over multiple runs with means and variances; we will revise the abstract to include the number of runs, variance information, and any statistical tests supporting the new-task advantages. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison with no derivation chain

full rationale

The paper reports an experimental study replacing domain-specific RL components with adaptive alternatives from the literature and measuring performance on original and new continuous-control tasks. No equations, fitted parameters, or first-principles derivations are present; the central claim rests on direct empirical outcomes rather than any reduction of a prediction to its own inputs by construction. Self-citations, if present, are not load-bearing for a mathematical result. The substitution-neutrality premise is an empirical assumption open to falsification by the experiments themselves, not a definitional or self-referential step.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated premise that the adaptive replacements are fair substitutes.

pith-pipeline@v0.9.0 · 5747 in / 993 out tokens · 23646 ms · 2026-05-25T02:12:58.367690+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/BranchSelection.lean branch_selection echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We investigated whether the performance deteriorates when these components are replaced with adaptive solutions from the literature... the system with adaptive components performed better on many of the new tasks.
Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the same algorithm could master other games, such as Shogi and Chess... removing these domain heuristics

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 6 internal anchors

[1]

DeepMind Lab

C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler, A. Lefrancq, S. Green, V . Valdés, A. Sadik, J. Schrittwieser, K. Anderson, S. York, M. Cant, A. Cain, A. Bolton, S. Gaffney, H. King, D. Hassabis, S. Legg, and S. Petersen. Deepmind lab. CoRR, abs/1612.03801,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

M. G. Bellemare, G. Ostrovski, A. Guez, P . S. Thomas, and R. Munos. Increasing the action gap: New operators for reinforcement learning. CoRR, abs/1512.04860,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Y. Bengio. Gradient-based optimization of hyperparameters. Neural computation, 12(8):1889–1900,

work page 1900
[5]

URL http://arxiv.org/abs/1802.10217. L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V . Mnih, T. Ward, Y. Doron, V . Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Innateness, AlphaZero, and Artificial Intelligence

Marcus. Innateness, alphazero, and artiﬁcial intelligence. CoRR, abs/1801.05667,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

On the difficulty of training Recurrent Neural Networks

URL http://arxiv.org/abs/1211.5063. Silver, Hubert, Schrittwieser, Antonoglou, Lai, Guez, Lanctot, Sifre, Kumaran, Graepel, Lillicrap, Simonyan, and Hass- abis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. CoRR, abs/1712.01815,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

van Hasselt, A

H. van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double Q-learning. In AAAI Conference on Artiﬁcial Intelligence, pages 2094–2100,

work page 2094
[10]

Meta-Gradient Reinforcement Learning

Xu, van Hasselt, and Silver. Meta-gradient reinforcement learning. CoRR, abs/1805.09801,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

9 On Inductive Biases in Deep Reinforcement Learning Appendix A. Training Details We performed very limited tuning on Atari, both due to the cost of running so many comparison with 8 seeds at scale across 57 games, and because we were interested in generalization to a different domain. We used a learning rate of 1e− 3, an entropy cost of 0.01 and a baseli...

work page 2016
[12]

No additional tuning was performed for any of the experiments on the Control Suite. B. Experiment Details In Figure 4 we report the detailed learning curves for all Atari games for three distinct agents: the fully adaptive agent (in red), the agent with ﬁxed action repeats (in green), and the agent acting at the fastest timescale (in blue). It’s interesti...

work page 2018

[1] [1]

DeepMind Lab

C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler, A. Lefrancq, S. Green, V . Valdés, A. Sadik, J. Schrittwieser, K. Anderson, S. York, M. Cant, A. Cain, A. Bolton, S. Gaffney, H. King, D. Hassabis, S. Legg, and S. Petersen. Deepmind lab. CoRR, abs/1612.03801,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

M. G. Bellemare, G. Ostrovski, A. Guez, P . S. Thomas, and R. Munos. Increasing the action gap: New operators for reinforcement learning. CoRR, abs/1512.04860,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Y. Bengio. Gradient-based optimization of hyperparameters. Neural computation, 12(8):1889–1900,

work page 1900

[4] [5]

URL http://arxiv.org/abs/1802.10217. L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V . Mnih, T. Ward, Y. Doron, V . Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [6]

Innateness, AlphaZero, and Artificial Intelligence

Marcus. Innateness, alphazero, and artiﬁcial intelligence. CoRR, abs/1801.05667,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [8]

On the difficulty of training Recurrent Neural Networks

URL http://arxiv.org/abs/1211.5063. Silver, Hubert, Schrittwieser, Antonoglou, Lai, Guez, Lanctot, Sifre, Kumaran, Graepel, Lillicrap, Simonyan, and Hass- abis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. CoRR, abs/1712.01815,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [9]

van Hasselt, A

H. van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double Q-learning. In AAAI Conference on Artiﬁcial Intelligence, pages 2094–2100,

work page 2094

[8] [10]

Meta-Gradient Reinforcement Learning

Xu, van Hasselt, and Silver. Meta-gradient reinforcement learning. CoRR, abs/1805.09801,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [11]

9 On Inductive Biases in Deep Reinforcement Learning Appendix A. Training Details We performed very limited tuning on Atari, both due to the cost of running so many comparison with 8 seeds at scale across 57 games, and because we were interested in generalization to a different domain. We used a learning rate of 1e− 3, an entropy cost of 0.01 and a baseli...

work page 2016

[10] [12]

No additional tuning was performed for any of the experiments on the Control Suite. B. Experiment Details In Figure 4 we report the detailed learning curves for all Atari games for three distinct agents: the fully adaptive agent (in red), the agent with ﬁxed action repeats (in green), and the agent acting at the fastest timescale (in blue). It’s interesti...

work page 2018