Accelerating Q-learning through Efficient Value-Sharing across Actions
Pith reviewed 2026-06-30 07:16 UTC · model grok-4.3
The pith
The mean-expansion layer accelerates action-value learning by sharing values across actions within each state.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The mean-expansion layer accelerates action-value learning by sharing values across actions within a state and by changing the problem from directly learning potentially large action-values to learning a lower-norm representation of them. In deep RL, this layer can be applied as a parameter-free addition to Q-network architectures without altering the underlying algorithm. Applied to deep Q-networks and implicit quantile networks, it improves aggregate performance across 57 Atari games while increasing action gaps and dramatically reducing value overestimation.
What carries the argument
The mean-expansion layer, which shares a mean value component across actions in a state while learning a lower-norm residual representation of the action values.
If this is right
- Aggregate performance improves across 57 Atari games when the layer is added to DQN and IQN.
- Action gaps become larger after the layer is introduced.
- Value overestimation is dramatically reduced.
- The layer works without any change to the underlying Q-learning algorithm.
Where Pith is reading between the lines
- The same layer could be tested on other value-based methods such as SARSA or actor-critic variants.
- Lower-norm targets might reduce the need for target networks or other stabilization tricks in deep RL.
- The sharing mechanism could be generalized to continuous action spaces where actions are not discrete.
Load-bearing premise
The mean-expansion layer can be inserted into Q-networks as a parameter-free addition without changing the rest of the learning algorithm.
What would settle it
Running the same DQN and IQN agents on the 57 Atari games with and without the mean-expansion layer and finding no improvement in aggregate score or no reduction in value overestimation.
Figures
read the original abstract
Action-values are foundational to many control algorithms such as Q-learning. Therefore learning action-values efficiently is central to reinforcement learning (RL). However, learning them can be slow, requiring many updates to move values from their initialization, typically near zero, to their true values, which may be far from zero. Moreover, action-value learning algorithms typically update each state-action pair independently, without learning shared value structure across actions within a state. In this paper, we address these inefficiencies by introducing the mean-expansion layer, which accelerates action-value learning by sharing values across actions within a state and by changing the problem from directly learning potentially large action-values to learning a lower-norm representation of them. In deep RL, this layer can be applied as a parameter-free addition to Q-network architectures without altering the underlying algorithm. Applied to deep Q-networks and implicit quantile networks, it improves aggregate performance across 57 Atari games while increasing action gaps and dramatically reducing value overestimation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the mean-expansion layer to accelerate Q-learning by sharing action values across actions within each state and by reformulating the problem as learning a lower-norm representation of the action-values. The layer is described as a parameter-free architectural addition to Q-networks (including DQN and IQN) that leaves the underlying Q-learning update rule, target computation, and loss unchanged. The manuscript claims that this yields improved aggregate performance across 57 Atari games, larger action gaps, and substantially reduced value overestimation.
Significance. If the empirical claims hold and the layer truly functions as a pure architectural change, the work would offer a lightweight, broadly applicable improvement to value-based deep RL that directly targets known inefficiencies in independent per-action updates and slow convergence from zero initialization. Demonstrating gains on the full Atari suite while also reporting secondary metrics (action gaps, overestimation) would strengthen its potential impact on algorithm design.
major comments (2)
- [Abstract] Abstract: the central claim that performance gains arise from value-sharing via a parameter-free addition 'without altering the underlying algorithm' is load-bearing. The manuscript must show (e.g., in the method section) the exact forward and backward pass through the layer and confirm that the Q-target, TD error, and gradient flow remain identical to the baseline; otherwise the observed improvements could stem from an implicit normalization or sharing effect rather than the intended mechanism.
- The absence of any derivation or pseudocode for the mean-expansion layer in the provided text leaves the 'lower-norm representation' claim unverified; if the layer simply computes a mean and expands it, the reduction in norm must be shown to follow directly from the architecture rather than from training dynamics.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the recommendation for major revision. We address each major comment below and will incorporate the requested clarifications and derivations into the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that performance gains arise from value-sharing via a parameter-free addition 'without altering the underlying algorithm' is load-bearing. The manuscript must show (e.g., in the method section) the exact forward and backward pass through the layer and confirm that the Q-target, TD error, and gradient flow remain identical to the baseline; otherwise the observed improvements could stem from an implicit normalization or sharing effect rather than the intended mechanism.
Authors: We agree that the method section requires explicit verification of the forward and backward passes. In the revision we will add the precise formulation: given network outputs Q(s,·), the mean-expansion layer computes the state-wise mean m and produces outputs whose per-action deviations from m are learned by the network while m itself is shared. Because the layer is parameter-free and its Jacobian is the identity (up to the mean subtraction which cancels in the gradient), the Q-target, TD error, and loss are mathematically identical to the baseline. Gradient flow through the layer is unchanged, ensuring that any performance difference arises from the altered representation rather than an implicit algorithmic modification. Pseudocode will be included. revision: yes
-
Referee: [—] The absence of any derivation or pseudocode for the mean-expansion layer in the provided text leaves the 'lower-norm representation' claim unverified; if the layer simply computes a mean and expands it, the reduction in norm must be shown to follow directly from the architecture rather than from training dynamics.
Authors: We acknowledge the current text lacks an explicit derivation. The layer reformulates the learning target so that the network directly optimizes a zero-mean deviation vector whose Euclidean norm is provably smaller than that of the original action-value vector (by the property that ||v - mean(v)||_2 ≤ ||v||_2). We will insert both the algebraic derivation and layer pseudocode in the methods section to demonstrate that the norm reduction is an immediate architectural consequence, independent of training dynamics. revision: yes
Circularity Check
No circularity: mean-expansion layer is a novel parameter-free architectural addition validated empirically
full rationale
The paper introduces the mean-expansion layer as a new component that shares values across actions and reduces the learning problem to a lower-norm representation. This is presented as a direct architectural modification to Q-networks without changing the underlying Q-learning update rule or loss. Performance gains on 57 Atari games are shown via experiments on DQN and IQN, with no derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps. The abstract and described method contain no self-definitional reductions or uniqueness theorems imported from prior author work. The result is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
invented entities (1)
-
mean-expansion layer
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Neural Information Processing Systems (NeurIPS) , year=
Pytorch: An imperative style, high-performance deep learning library , author=. Neural Information Processing Systems (NeurIPS) , year=
-
[2]
International Conference on Machine Learning (ICML) , year=
Addressing function approximation error in actor-critic methods , author=. International Conference on Machine Learning (ICML) , year=
-
[3]
Coursera: Neural Networks for Machine Learning , pages=
Lecture 6e-rmsprop: Divide the gradient by a running average of its recent magnitude , author=. Coursera: Neural Networks for Machine Learning , pages=
-
[4]
Journal of Machine Learning Research (JMLR) , year =
Yasuhiro Fujita and Prabhat Nagarajan and Toshiki Kataoka and Takahiro Ishikawa , title =. Journal of Machine Learning Research (JMLR) , year =
-
[5]
Kingma and Jimmy Ba , title =
Diederik P. Kingma and Jimmy Ba , title =. International Conference on Learning Representations (ICLR) , year =
-
[6]
Gymnasium: A Standard Interface for Reinforcement Learning Environments
Gymnasium: A standard interface for reinforcement learning environments , author=. arXiv preprint arXiv:2407.17032 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Nature , year=
Human-level control through deep reinforcement learning , author=. Nature , year=
-
[8]
International Conference on Machine Learning (ICML) , year=
Dueling network architectures for deep reinforcement learning , author=. International Conference on Machine Learning (ICML) , year=
-
[9]
Wright Laboratory , year=
Advantage updating (Technical Report WL-TR-93-1146) , author=. Wright Laboratory , year=
-
[10]
International Conference on Machine Learning (ICML) , year=
Tang, Yunhao and Munos, R. International Conference on Machine Learning (ICML) , year=
-
[11]
Reinforcement Learning Journal (RLJ) , year =
An Analysis of Action-Value Temporal-Difference Methods That Learn State Values , author =. Reinforcement Learning Journal (RLJ) , year =
-
[12]
Aitchison, Matthew and Sweetser, Penny and Hutter, Marcus , booktitle =
-
[13]
Minatar: An
Young, Kenny and Tian, Tian , journal=. Minatar: An
-
[14]
International Conference on Machine Learning (ICML) , year=
Implicit quantile networks for distributional reinforcement learning , author=. International Conference on Machine Learning (ICML) , year=
-
[15]
van Hasselt, Hado and Guez, Arthur and Silver, David , booktitle=
-
[16]
AAAI Conference on Artificial intelligence (AAAI) , year=
Rainbow: Combining improvements in deep reinforcement learning , author=. AAAI Conference on Artificial intelligence (AAAI) , year=
-
[17]
The Arcade Learning Environment:
Bellemare, Marc G.\ and Naddaf, Yavar and Veness, Joel and Bowling, Michael , journal=. The Arcade Learning Environment:
-
[18]
Revisiting the Arcade Learning Environment:
Machado, Marlos C.\ and Bellemare, Marc G.\ and Talvitie, Erik and Veness, Joel and Hausknecht, Matthew and Bowling, Michael , journal=. Revisiting the Arcade Learning Environment:
-
[19]
Nagarajan, Prabhat and White, Martha and Machado, Marlos C.\ , journal=
-
[20]
1992 , publisher=
Watkins, Christopher John Cornish Hellaby and Dayan, Peter , journal=. 1992 , publisher=
1992
-
[21]
Watkins, Christopher John Cornish Hellaby , title=
-
[22]
1992 , school=
Reinforcement learning for robots using neural networks , author=. 1992 , school=
1992
-
[23]
1994 , publisher=
On-line Q-learning using connectionist systems , author=. 1994 , publisher=
1994
-
[24]
AAAI Conference on Artificial Intelligence (AAAI) , year=
When the best move isn't optimal: Q-learning with exploration , author=. AAAI Conference on Artificial Intelligence (AAAI) , year=
-
[25]
Neural Information Processing Systems (NeurIPS) , year=
Action-gap phenomenon in reinforcement learning , author=. Neural Information Processing Systems (NeurIPS) , year=
-
[26]
AAAI Conference on Artificial Intelligence (AAAI) , year=
Increasing the action gap: New operators for reinforcement learning , author=. AAAI Conference on Artificial Intelligence (AAAI) , year=
-
[27]
John Quan and Georg Ostrovski , url =
-
[28]
Reinforcement Learning Journal (RLJ) , year =
Reward Centering , author =. Reinforcement Learning Journal (RLJ) , year =
-
[29]
International Conference on Machine Learning (ICML) , year=
Beyond variance reduction: Understanding the true impact of baselines on policy optimization , author=. International Conference on Machine Learning (ICML) , year=
-
[30]
Neural Information Processing Systems (NeurIPS) , year=
Learning values across many orders of magnitude , author=. Neural Information Processing Systems (NeurIPS) , year=
-
[31]
Exploit reward shifting in value-based deep-
Sun, Hao and Han, Lei and Yang, Rui and Ma, Xiaoteng and Guo, Jian and Zhou, Bolei , journal=. Exploit reward shifting in value-based deep-
-
[32]
International Conference on Machine Learning (ICML) , year=
Revisiting rainbow: Promoting more insightful and inclusive deep reinforcement learning research , author=. International Conference on Machine Learning (ICML) , year=
-
[33]
Definitions and preliminary lemmas , author=
A theory of steady-state activity in nerve-fiber networks: I. Definitions and preliminary lemmas , author=. The Bulletin of Mathematical Biophysics , year=
-
[34]
International Conference on Machine Learning (ICML) , year=
A distributional perspective on reinforcement learning , author=. International Conference on Machine Learning (ICML) , year=
-
[35]
International Conference on Machine Learning (ICML) , year=
An optimistic perspective on offline reinforcement learning , author=. International Conference on Machine Learning (ICML) , year=
-
[36]
Neural Information Processing Systems (NeurIPS) , year=
Deep reinforcement Learning at the edge of the statistical precipice , author=. Neural Information Processing Systems (NeurIPS) , year=
-
[37]
1999 , school=
Reinforcement learning through gradient descent , author=. 1999 , school=
1999
-
[38]
Machine Learning , year=
Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine Learning , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.