Accelerating Q-learning through Efficient Value-Sharing across Actions

Brett Daley; Marlos C. Machado; Martha White; Prabhat Nagarajan

arxiv: 2606.29806 · v1 · pith:BXZNO67Hnew · submitted 2026-06-29 · 💻 cs.LG · cs.AI

Accelerating Q-learning through Efficient Value-Sharing across Actions

Prabhat Nagarajan , Brett Daley , Martha White , Marlos C. Machado This is my paper

Pith reviewed 2026-06-30 07:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords mean-expansion layerQ-learningaction-value learningdeep reinforcement learningvalue overestimationAtari games

0 comments

The pith

The mean-expansion layer accelerates action-value learning by sharing values across actions within each state.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes the mean-expansion layer to address slow learning of action-values in Q-learning. Standard algorithms update each state-action pair independently and must move values from near-zero initialization to their true magnitudes. The new layer instead shares a common value component across actions in the same state and recasts the learning target as a lower-norm representation. When inserted as a parameter-free module into existing deep Q-networks, the change produces higher aggregate scores on Atari games along with larger action gaps and less value overestimation.

Core claim

The mean-expansion layer accelerates action-value learning by sharing values across actions within a state and by changing the problem from directly learning potentially large action-values to learning a lower-norm representation of them. In deep RL, this layer can be applied as a parameter-free addition to Q-network architectures without altering the underlying algorithm. Applied to deep Q-networks and implicit quantile networks, it improves aggregate performance across 57 Atari games while increasing action gaps and dramatically reducing value overestimation.

What carries the argument

The mean-expansion layer, which shares a mean value component across actions in a state while learning a lower-norm residual representation of the action values.

If this is right

Aggregate performance improves across 57 Atari games when the layer is added to DQN and IQN.
Action gaps become larger after the layer is introduced.
Value overestimation is dramatically reduced.
The layer works without any change to the underlying Q-learning algorithm.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same layer could be tested on other value-based methods such as SARSA or actor-critic variants.
Lower-norm targets might reduce the need for target networks or other stabilization tricks in deep RL.
The sharing mechanism could be generalized to continuous action spaces where actions are not discrete.

Load-bearing premise

The mean-expansion layer can be inserted into Q-networks as a parameter-free addition without changing the rest of the learning algorithm.

What would settle it

Running the same DQN and IQN agents on the 57 Atari games with and without the mean-expansion layer and finding no improvement in aggregate score or no reduction in value overestimation.

Figures

Figures reproduced from arXiv: 2606.29806 by Brett Daley, Marlos C. Machado, Martha White, Prabhat Nagarajan.

**Figure 1.** Figure 1: The mean-expansion layer. The input vector z = (3, 1), is projected onto the all-ones vector to produce the mean component (2, 2). This mean component is scaled by k, where k = 2, to produce the implicit baseline vector k n Jz = (4, 4). The input vector, which also serves as the residual vector, z, is added to this implicit baseline to produce the output q = (I + k n J)z = (7, 5). The mean vector of q is (… view at source ↗

**Figure 2.** Figure 2: Gridworld results. We compare IBQ with different values of k, including k = 0, which is Q-learning. We report the percentage increase in episode completions across four sample complexity regimes. Shaded regions corresponds to a 95% confidence interval. For most k, IBQ(k) can complete over 20% more episodes than Q-learning within 1k timesteps. Both algorithms quickly master the task and their gap decrease… view at source ↗

**Figure 3.** Figure 3: (left) Different algorithms and the interquartile mean of their human-normalized score across 57 games. All algorithms were run for five seeds per game. The shaded region depicts the 95% stratified bootstrap confidence interval (Agarwal et al., 2021). Dashed lines indicate the use of the mean-expansion layer. (right) The increase in human-normalized score, measured as the average area-under-the-curve, when… view at source ↗

**Figure 4.** Figure 4: (left) The percentage reduction in overestimation when using IB-DQN over DQN, as a percentage of DQN’s average overestimation area under the curve. In all games, IB-DQN reduces overestimation over DQN. (right) The increase in relative action gap from using IB-DQN instead of DQN. In the vast majority of games, IB-DQN increases the relative action gap compared to DQN. Values are clipped at 0.01 for visibilit… view at source ↗

**Figure 5.** Figure 5: Sensitivity Analysis of k. (left) The average area under the curve (AUC) of the human-normalized score for several values of k (log scale) on Atari 2600 games, where the shaded region depicts the standard deviation across five seeds. (center) Similar to the left figure, but for LunarLander-v3, where the shaded region depicts a 95% confidence interval for the mean across 120 seeds. (right) The corresponding… view at source ↗

**Figure 6.** Figure 6: The ME layer with RMSprop and the Huber loss. The plot shows the interquartile mean of DQN with and without the ME layer across 55 games. All algorithms were run for three seeds per game. The shaded region depicts the 95% stratified bootstrap confidence interval (Agarwal et al., 2021). To examine how the ME layer impacts performance in deep RL outside of the Adam optimizer, we also tested with the RMSprop… view at source ↗

**Figure 7.** Figure 7: Mean score across 50M timesteps over five seeds per game. For readability, plots are smoothed with a moving average of seven. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Mean overestimation across 50M timesteps across five seeds. Overestimation capped at 25 for visibility. Translucent curves are the individual seeds. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

read the original abstract

Action-values are foundational to many control algorithms such as Q-learning. Therefore learning action-values efficiently is central to reinforcement learning (RL). However, learning them can be slow, requiring many updates to move values from their initialization, typically near zero, to their true values, which may be far from zero. Moreover, action-value learning algorithms typically update each state-action pair independently, without learning shared value structure across actions within a state. In this paper, we address these inefficiencies by introducing the mean-expansion layer, which accelerates action-value learning by sharing values across actions within a state and by changing the problem from directly learning potentially large action-values to learning a lower-norm representation of them. In deep RL, this layer can be applied as a parameter-free addition to Q-network architectures without altering the underlying algorithm. Applied to deep Q-networks and implicit quantile networks, it improves aggregate performance across 57 Atari games while increasing action gaps and dramatically reducing value overestimation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The mean-expansion layer is a clean architectural addition for sharing mean values across actions in Q-networks, delivering measurable Atari gains and lower overestimation without apparent algorithm changes.

read the letter

The main point is that this paper adds a mean-expansion layer to Q-networks so the network learns a lower-norm representation while the mean value gets shared across actions in each state. That directly tackles slow movement from zero initialization and independent per-action updates in standard Q-learning.

The work does what it sets out to do on the empirical side. They plug the layer into DQN and IQN, run the usual 57-game Atari suite, and report better aggregate scores, bigger action gaps, and sharply reduced value overestimation. Those are concrete, reproducible outcomes on a standard benchmark, and the parameter-free claim means the gains come from the representation change rather than extra hyperparameters or loss terms.

The soft spot is the integration mechanics. The abstract says the layer leaves the underlying algorithm untouched, but the exact way the mean is computed, how it flows into targets, and whether any normalization or gradient sharing occurs needs the full equations and pseudocode to confirm. If the paper shows the forward and backward passes stay identical to vanilla Q-learning except for the layer output, the central claim holds; otherwise the results could partly reflect an implicit tweak. Literature placement also matters—prior overestimation fixes like double Q-learning or clipped targets should be compared directly.

This is for people who modify value-network architectures in deep RL and want simple, drop-in changes that show up in control-task metrics. A reader already running Atari experiments would get immediate value from the numbers.

Send it to peer review. The experiments are on the right benchmark and the idea is narrow enough to evaluate cleanly.

Referee Report

2 major / 0 minor

Summary. The paper introduces the mean-expansion layer to accelerate Q-learning by sharing action values across actions within each state and by reformulating the problem as learning a lower-norm representation of the action-values. The layer is described as a parameter-free architectural addition to Q-networks (including DQN and IQN) that leaves the underlying Q-learning update rule, target computation, and loss unchanged. The manuscript claims that this yields improved aggregate performance across 57 Atari games, larger action gaps, and substantially reduced value overestimation.

Significance. If the empirical claims hold and the layer truly functions as a pure architectural change, the work would offer a lightweight, broadly applicable improvement to value-based deep RL that directly targets known inefficiencies in independent per-action updates and slow convergence from zero initialization. Demonstrating gains on the full Atari suite while also reporting secondary metrics (action gaps, overestimation) would strengthen its potential impact on algorithm design.

major comments (2)

[Abstract] Abstract: the central claim that performance gains arise from value-sharing via a parameter-free addition 'without altering the underlying algorithm' is load-bearing. The manuscript must show (e.g., in the method section) the exact forward and backward pass through the layer and confirm that the Q-target, TD error, and gradient flow remain identical to the baseline; otherwise the observed improvements could stem from an implicit normalization or sharing effect rather than the intended mechanism.
The absence of any derivation or pseudocode for the mean-expansion layer in the provided text leaves the 'lower-norm representation' claim unverified; if the layer simply computes a mean and expands it, the reduction in norm must be shown to follow directly from the architecture rather than from training dynamics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for major revision. We address each major comment below and will incorporate the requested clarifications and derivations into the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that performance gains arise from value-sharing via a parameter-free addition 'without altering the underlying algorithm' is load-bearing. The manuscript must show (e.g., in the method section) the exact forward and backward pass through the layer and confirm that the Q-target, TD error, and gradient flow remain identical to the baseline; otherwise the observed improvements could stem from an implicit normalization or sharing effect rather than the intended mechanism.

Authors: We agree that the method section requires explicit verification of the forward and backward passes. In the revision we will add the precise formulation: given network outputs Q(s,·), the mean-expansion layer computes the state-wise mean m and produces outputs whose per-action deviations from m are learned by the network while m itself is shared. Because the layer is parameter-free and its Jacobian is the identity (up to the mean subtraction which cancels in the gradient), the Q-target, TD error, and loss are mathematically identical to the baseline. Gradient flow through the layer is unchanged, ensuring that any performance difference arises from the altered representation rather than an implicit algorithmic modification. Pseudocode will be included. revision: yes
Referee: [—] The absence of any derivation or pseudocode for the mean-expansion layer in the provided text leaves the 'lower-norm representation' claim unverified; if the layer simply computes a mean and expands it, the reduction in norm must be shown to follow directly from the architecture rather than from training dynamics.

Authors: We acknowledge the current text lacks an explicit derivation. The layer reformulates the learning target so that the network directly optimizes a zero-mean deviation vector whose Euclidean norm is provably smaller than that of the original action-value vector (by the property that ||v - mean(v)||_2 ≤ ||v||_2). We will insert both the algebraic derivation and layer pseudocode in the methods section to demonstrate that the norm reduction is an immediate architectural consequence, independent of training dynamics. revision: yes

Circularity Check

0 steps flagged

No circularity: mean-expansion layer is a novel parameter-free architectural addition validated empirically

full rationale

The paper introduces the mean-expansion layer as a new component that shares values across actions and reduces the learning problem to a lower-norm representation. This is presented as a direct architectural modification to Q-networks without changing the underlying Q-learning update rule or loss. Performance gains on 57 Atari games are shown via experiments on DQN and IQN, with no derivation chain, fitted parameters renamed as predictions, or self-citation load-bearing steps. The abstract and described method contain no self-definitional reductions or uniqueness theorems imported from prior author work. The result is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review; no equations, parameters, or background assumptions are detailed enough to populate the ledger.

invented entities (1)

mean-expansion layer no independent evidence
purpose: share values across actions within a state and learn lower-norm representations
Introduced as the central new component in the abstract.

pith-pipeline@v0.9.1-grok · 5696 in / 1010 out tokens · 28569 ms · 2026-06-30T07:16:57.772870+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Neural Information Processing Systems (NeurIPS) , year=

Pytorch: An imperative style, high-performance deep learning library , author=. Neural Information Processing Systems (NeurIPS) , year=
[2]

International Conference on Machine Learning (ICML) , year=

Addressing function approximation error in actor-critic methods , author=. International Conference on Machine Learning (ICML) , year=
[3]

Coursera: Neural Networks for Machine Learning , pages=

Lecture 6e-rmsprop: Divide the gradient by a running average of its recent magnitude , author=. Coursera: Neural Networks for Machine Learning , pages=
[4]

Journal of Machine Learning Research (JMLR) , year =

Yasuhiro Fujita and Prabhat Nagarajan and Toshiki Kataoka and Takahiro Ishikawa , title =. Journal of Machine Learning Research (JMLR) , year =
[5]

Kingma and Jimmy Ba , title =

Diederik P. Kingma and Jimmy Ba , title =. International Conference on Learning Representations (ICLR) , year =
[6]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

Gymnasium: A standard interface for reinforcement learning environments , author=. arXiv preprint arXiv:2407.17032 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Nature , year=

Human-level control through deep reinforcement learning , author=. Nature , year=
[8]

International Conference on Machine Learning (ICML) , year=

Dueling network architectures for deep reinforcement learning , author=. International Conference on Machine Learning (ICML) , year=
[9]

Wright Laboratory , year=

Advantage updating (Technical Report WL-TR-93-1146) , author=. Wright Laboratory , year=
[10]

International Conference on Machine Learning (ICML) , year=

Tang, Yunhao and Munos, R. International Conference on Machine Learning (ICML) , year=
[11]

Reinforcement Learning Journal (RLJ) , year =

An Analysis of Action-Value Temporal-Difference Methods That Learn State Values , author =. Reinforcement Learning Journal (RLJ) , year =
[12]

Aitchison, Matthew and Sweetser, Penny and Hutter, Marcus , booktitle =
[13]

Minatar: An

Young, Kenny and Tian, Tian , journal=. Minatar: An
[14]

International Conference on Machine Learning (ICML) , year=

Implicit quantile networks for distributional reinforcement learning , author=. International Conference on Machine Learning (ICML) , year=
[15]

van Hasselt, Hado and Guez, Arthur and Silver, David , booktitle=
[16]

AAAI Conference on Artificial intelligence (AAAI) , year=

Rainbow: Combining improvements in deep reinforcement learning , author=. AAAI Conference on Artificial intelligence (AAAI) , year=
[17]

The Arcade Learning Environment:

Bellemare, Marc G.\ and Naddaf, Yavar and Veness, Joel and Bowling, Michael , journal=. The Arcade Learning Environment:
[18]

Revisiting the Arcade Learning Environment:

Machado, Marlos C.\ and Bellemare, Marc G.\ and Talvitie, Erik and Veness, Joel and Hausknecht, Matthew and Bowling, Michael , journal=. Revisiting the Arcade Learning Environment:
[19]

Nagarajan, Prabhat and White, Martha and Machado, Marlos C.\ , journal=
[20]

1992 , publisher=

Watkins, Christopher John Cornish Hellaby and Dayan, Peter , journal=. 1992 , publisher=

1992
[21]

Watkins, Christopher John Cornish Hellaby , title=
[22]

1992 , school=

Reinforcement learning for robots using neural networks , author=. 1992 , school=

1992
[23]

1994 , publisher=

On-line Q-learning using connectionist systems , author=. 1994 , publisher=

1994
[24]

AAAI Conference on Artificial Intelligence (AAAI) , year=

When the best move isn't optimal: Q-learning with exploration , author=. AAAI Conference on Artificial Intelligence (AAAI) , year=
[25]

Neural Information Processing Systems (NeurIPS) , year=

Action-gap phenomenon in reinforcement learning , author=. Neural Information Processing Systems (NeurIPS) , year=
[26]

AAAI Conference on Artificial Intelligence (AAAI) , year=

Increasing the action gap: New operators for reinforcement learning , author=. AAAI Conference on Artificial Intelligence (AAAI) , year=
[27]

John Quan and Georg Ostrovski , url =
[28]

Reinforcement Learning Journal (RLJ) , year =

Reward Centering , author =. Reinforcement Learning Journal (RLJ) , year =
[29]

International Conference on Machine Learning (ICML) , year=

Beyond variance reduction: Understanding the true impact of baselines on policy optimization , author=. International Conference on Machine Learning (ICML) , year=
[30]

Neural Information Processing Systems (NeurIPS) , year=

Learning values across many orders of magnitude , author=. Neural Information Processing Systems (NeurIPS) , year=
[31]

Exploit reward shifting in value-based deep-

Sun, Hao and Han, Lei and Yang, Rui and Ma, Xiaoteng and Guo, Jian and Zhou, Bolei , journal=. Exploit reward shifting in value-based deep-
[32]

International Conference on Machine Learning (ICML) , year=

Revisiting rainbow: Promoting more insightful and inclusive deep reinforcement learning research , author=. International Conference on Machine Learning (ICML) , year=
[33]

Definitions and preliminary lemmas , author=

A theory of steady-state activity in nerve-fiber networks: I. Definitions and preliminary lemmas , author=. The Bulletin of Mathematical Biophysics , year=
[34]

International Conference on Machine Learning (ICML) , year=

A distributional perspective on reinforcement learning , author=. International Conference on Machine Learning (ICML) , year=
[35]

International Conference on Machine Learning (ICML) , year=

An optimistic perspective on offline reinforcement learning , author=. International Conference on Machine Learning (ICML) , year=
[36]

Neural Information Processing Systems (NeurIPS) , year=

Deep reinforcement Learning at the edge of the statistical precipice , author=. Neural Information Processing Systems (NeurIPS) , year=
[37]

1999 , school=

Reinforcement learning through gradient descent , author=. 1999 , school=

1999
[38]

Machine Learning , year=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine Learning , year=

[1] [1]

Neural Information Processing Systems (NeurIPS) , year=

Pytorch: An imperative style, high-performance deep learning library , author=. Neural Information Processing Systems (NeurIPS) , year=

[2] [2]

International Conference on Machine Learning (ICML) , year=

Addressing function approximation error in actor-critic methods , author=. International Conference on Machine Learning (ICML) , year=

[3] [3]

Coursera: Neural Networks for Machine Learning , pages=

Lecture 6e-rmsprop: Divide the gradient by a running average of its recent magnitude , author=. Coursera: Neural Networks for Machine Learning , pages=

[4] [4]

Journal of Machine Learning Research (JMLR) , year =

Yasuhiro Fujita and Prabhat Nagarajan and Toshiki Kataoka and Takahiro Ishikawa , title =. Journal of Machine Learning Research (JMLR) , year =

[5] [5]

Kingma and Jimmy Ba , title =

Diederik P. Kingma and Jimmy Ba , title =. International Conference on Learning Representations (ICLR) , year =

[6] [6]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

Gymnasium: A standard interface for reinforcement learning environments , author=. arXiv preprint arXiv:2407.17032 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Nature , year=

Human-level control through deep reinforcement learning , author=. Nature , year=

[8] [8]

International Conference on Machine Learning (ICML) , year=

Dueling network architectures for deep reinforcement learning , author=. International Conference on Machine Learning (ICML) , year=

[9] [9]

Wright Laboratory , year=

Advantage updating (Technical Report WL-TR-93-1146) , author=. Wright Laboratory , year=

[10] [10]

International Conference on Machine Learning (ICML) , year=

Tang, Yunhao and Munos, R. International Conference on Machine Learning (ICML) , year=

[11] [11]

Reinforcement Learning Journal (RLJ) , year =

An Analysis of Action-Value Temporal-Difference Methods That Learn State Values , author =. Reinforcement Learning Journal (RLJ) , year =

[12] [12]

Aitchison, Matthew and Sweetser, Penny and Hutter, Marcus , booktitle =

[13] [13]

Minatar: An

Young, Kenny and Tian, Tian , journal=. Minatar: An

[14] [14]

International Conference on Machine Learning (ICML) , year=

Implicit quantile networks for distributional reinforcement learning , author=. International Conference on Machine Learning (ICML) , year=

[15] [15]

van Hasselt, Hado and Guez, Arthur and Silver, David , booktitle=

[16] [16]

AAAI Conference on Artificial intelligence (AAAI) , year=

Rainbow: Combining improvements in deep reinforcement learning , author=. AAAI Conference on Artificial intelligence (AAAI) , year=

[17] [17]

The Arcade Learning Environment:

Bellemare, Marc G.\ and Naddaf, Yavar and Veness, Joel and Bowling, Michael , journal=. The Arcade Learning Environment:

[18] [18]

Revisiting the Arcade Learning Environment:

Machado, Marlos C.\ and Bellemare, Marc G.\ and Talvitie, Erik and Veness, Joel and Hausknecht, Matthew and Bowling, Michael , journal=. Revisiting the Arcade Learning Environment:

[19] [19]

Nagarajan, Prabhat and White, Martha and Machado, Marlos C.\ , journal=

[20] [20]

1992 , publisher=

Watkins, Christopher John Cornish Hellaby and Dayan, Peter , journal=. 1992 , publisher=

1992

[21] [21]

Watkins, Christopher John Cornish Hellaby , title=

[22] [22]

1992 , school=

Reinforcement learning for robots using neural networks , author=. 1992 , school=

1992

[23] [23]

1994 , publisher=

On-line Q-learning using connectionist systems , author=. 1994 , publisher=

1994

[24] [24]

AAAI Conference on Artificial Intelligence (AAAI) , year=

When the best move isn't optimal: Q-learning with exploration , author=. AAAI Conference on Artificial Intelligence (AAAI) , year=

[25] [25]

Neural Information Processing Systems (NeurIPS) , year=

Action-gap phenomenon in reinforcement learning , author=. Neural Information Processing Systems (NeurIPS) , year=

[26] [26]

AAAI Conference on Artificial Intelligence (AAAI) , year=

Increasing the action gap: New operators for reinforcement learning , author=. AAAI Conference on Artificial Intelligence (AAAI) , year=

[27] [27]

John Quan and Georg Ostrovski , url =

[28] [28]

Reinforcement Learning Journal (RLJ) , year =

Reward Centering , author =. Reinforcement Learning Journal (RLJ) , year =

[29] [29]

International Conference on Machine Learning (ICML) , year=

Beyond variance reduction: Understanding the true impact of baselines on policy optimization , author=. International Conference on Machine Learning (ICML) , year=

[30] [30]

Neural Information Processing Systems (NeurIPS) , year=

Learning values across many orders of magnitude , author=. Neural Information Processing Systems (NeurIPS) , year=

[31] [31]

Exploit reward shifting in value-based deep-

Sun, Hao and Han, Lei and Yang, Rui and Ma, Xiaoteng and Guo, Jian and Zhou, Bolei , journal=. Exploit reward shifting in value-based deep-

[32] [32]

International Conference on Machine Learning (ICML) , year=

Revisiting rainbow: Promoting more insightful and inclusive deep reinforcement learning research , author=. International Conference on Machine Learning (ICML) , year=

[33] [33]

Definitions and preliminary lemmas , author=

A theory of steady-state activity in nerve-fiber networks: I. Definitions and preliminary lemmas , author=. The Bulletin of Mathematical Biophysics , year=

[34] [34]

International Conference on Machine Learning (ICML) , year=

A distributional perspective on reinforcement learning , author=. International Conference on Machine Learning (ICML) , year=

[35] [35]

International Conference on Machine Learning (ICML) , year=

An optimistic perspective on offline reinforcement learning , author=. International Conference on Machine Learning (ICML) , year=

[36] [36]

Neural Information Processing Systems (NeurIPS) , year=

Deep reinforcement Learning at the edge of the statistical precipice , author=. Neural Information Processing Systems (NeurIPS) , year=

[37] [37]

1999 , school=

Reinforcement learning through gradient descent , author=. 1999 , school=

1999

[38] [38]

Machine Learning , year=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine Learning , year=