pith. sign in

arxiv: 2605.16318 · v1 · pith:FSC52N3Znew · submitted 2026-05-04 · 💻 cs.LG

Investigating Action Encodings in Recurrent Neural Networks in Reinforcement Learning

Pith reviewed 2026-05-20 23:22 UTC · model grok-4.3

classification 💻 cs.LG
keywords recurrent neural networksreinforcement learningaction encodingstate representationpartial observability
0
0 comments X

The pith

Different methods for feeding actions into recurrent cell updates produce measurable differences in reinforcement learning performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines several design options for including action information inside the state update step of recurrent neural networks used by reinforcement learning agents. It tests these options on a collection of simple domains to observe how each choice influences the agent's ability to build useful internal state and solve tasks. A sympathetic reader cares because real-world reinforcement learning often requires memory of past observations and actions, and the way actions are folded into the recurrence can determine whether that memory remains informative. The work treats these incorporation choices as an explicit axis of variation rather than an implementation detail.

Core claim

Several distinct ways exist to incorporate previous actions into the hidden-state update of a recurrent cell, and empirical comparisons of the resulting architectures on illustrative reinforcement learning domains reveal performance differences that arise from these choices.

What carries the argument

action incorporation into the recurrent state update function

If this is right

  • Recurrent agents can achieve different levels of success depending on whether actions are concatenated, added, or otherwise combined with observations inside the cell update.
  • Certain incorporation methods may prove more robust when observations are noisy or when actions have delayed effects.
  • Future recurrent architectures intended for reinforcement learning should treat the action pathway as a first-class design decision rather than a default concatenation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same investigation could be repeated for other signals such as rewards or termination flags to see whether they produce comparable sensitivities.
  • If one encoding consistently wins on simple domains, it becomes a natural candidate for transfer to large-scale agents that already use recurrence.
  • The performance gaps may shrink or grow with network depth or with the length of the temporal dependencies the agent must capture.

Load-bearing premise

The simple illustrative domains used in the experiments capture the main difficulties recurrent networks encounter when deployed in practical reinforcement learning settings.

What would settle it

A follow-up study on a larger-scale partially observable control task in which every tested action incorporation method yields statistically indistinguishable final performance would undermine the claim that these design choices matter.

Figures

Figures reproduced from arXiv: 2605.16318 by Adam White, Martha White, Matthew Schlegel, Volodymyr Tkachuk.

Figure 1
Figure 1. Figure 1: Learning Curves for various RNN cells in [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualizations of the multiplicative and additive RNNs. The dimensions of the weight matrices [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The illustrative environments used in Sec￾tion 5.1 and Section 5.2 respectively. (left) The Ring World environment with 6 states is depicted, where the observation the agent receives is denoted in each of the circles, available actions denoted by the red ar￾rows, and the agent’s current location denoted by a double line. (right) The base TMaze environments are depicted with the available actions denoted be… view at source ↗
Figure 4
Figure 4. Figure 4: Ring World sensitivity curves of RMSVE over the final 50k steps for CELL (hidden size) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Individual learning curves for the additive (hidden size of 15) and multiplicative (hidden size 12) [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ring World predictions of seed = 62 for the multiplicative and additive RNNs. Discounts listed with the target pol￾icy persistently going counter-clockwise. We start with a survey over truncation values for all the archi￾tectures in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: TSNE plots for the additive and multiplicative RNNs for truncation [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: TSNE plots for the additive and multiplicative RNNs for truncation [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: (left) Bakker’s TMaze box plots and violin plots over the performance averaged over the final 10% with 50 independent runs. Trained over 300k steps with τ = 10. All GRUs use a state size 6, while RNNs use a state size 20. The deep additive used an action encoding of |a| = 4. (right) Directional TMaze comparison over the performance averaged over the final 10% of episodes with 100 independent runs trained o… view at source ↗
Figure 10
Figure 10. Figure 10: Sensitivity curves over number of factors M with standard error for the (top) FacRNN (30) and (bottom) Fac￾GRU (17). All agents were trained over 300k steps. See Appendix F.3 for sweeps over different state sizes. We use the data generated by a sweep over the learning rate with 40 runs and compare to the data in fig￾ure 9. The red labels on the x-axis indicate when the network has the same number of param… view at source ↗
Figure 11
Figure 11. Figure 11: Two variants of combining cells. State size chosen based on procedures of pre￾vious environments. (top) Performance of success rates (left) TMaze with same basic parameters as above for CELL (hidden size): Softmax GRU (6), Cat GRU (6), Softmax RNN (20), Cat RNN (20). (right) Direc￾tional TMaze with same parameters as above for CELL (hidden size): Softmax GRU (8), Cat GRU (12), Softmax RNN (15), Cat RNN (2… view at source ↗
Figure 12
Figure 12. Figure 12: (left) Image Directional TMaze percent success over the final 10% of episodes for 20 runs for CELL (hidden size): AAGRU (70), MAGRU (32), DAGRU (45, |a| = 128). Using ADAM trained over 400k steps, (τ ) = 20. GRU omitted due to prior performance. (center) Lunar Lander average reward over all episodes for CELL (hidden size): GRU (154), AAGRU (152), MAGRU (64), DAGRU (152, |a| = 64) and (τ ) = 16. (right) Lu… view at source ↗
Figure 13
Figure 13. Figure 13: Average success over the in￾tervention taking the go forward activa￾tion and starting in the eastward posi￾tion. (Naive Strategy) Using the eval￾uated intervention over 60k steps for train￾ing, (Hand Designed) a sequence of hand designed interventions to build up to the fi￾nal evaluation intervention over 60k steps. To show the potential of active data collection, we will briefly revisit the Directional T… view at source ↗
Figure 14
Figure 14. Figure 14: Directional TMaze sweep over size of the gating network (i.e how many units in a single hidden [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: TSNE plots for multiplicative and additive RNNs for various number of training samples. [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Online: (left + middle) Directional TMaze percent success in reaching the goal over the final 10% of episodes with 100 independent runs for CELL (hidden size): RNN (46), AARNN (46), MARNN (27), FacRNN (46) M = 24, GRU (26), AAGRU (26), MAGRU (15), FacGRU (26) M = 21. (right) Ring World learning curves over RMSVE with 100 independent runs for: RNN (20), AARNN (20), MARNN (15), GRU (12), AAGRU (12), MAGRU (… view at source ↗
Figure 17
Figure 17. Figure 17: Average number of steps to goal over trun [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: (left) Resulting average success over the last 10% of episodes for various number of hidden units in the action encoding network for the deep additive networks with standard error intervals. Each layer (denoted by the title of the plot) contains the number of hidden denoted by the x-axis. (right) Comparing the deep multiplicative operation with the base cells used in the main text. that can’t be overcome … view at source ↗
Figure 20
Figure 20. Figure 20: Truncation sensitivity curves for the Experience Replay setting in ring world. Results are RMSVE [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗
Figure 19
Figure 19. Figure 19: Ring World Hyperparameters [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗
Figure 23
Figure 23. Figure 23: F.5 Lunar Lander We provide all hyperparameter settings ( [PITH_FULL_IMAGE:figures/full_fig_p030_23.png] view at source ↗
Figure 21
Figure 21. Figure 21: TMaze Experience Replay experiments: (top left) The hyperparameters used across all cells (bottom) The cell specific hyperparameters (top right) Percent success over the final 10% of episodes. Same as [PITH_FULL_IMAGE:figures/full_fig_p031_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Directional TMaze Experience Replay results: [PITH_FULL_IMAGE:figures/full_fig_p032_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Image Directional TMaze: (top) Percent success over final 10% of episodes for the image tmaze for τ = 12 and τ = 20 (labeled). See labels for size of hidden state with left being small networks, and right being larger. (bottom left) The hyperparameters used across all cells in Image Directional TMaze (bottom right) The cell specific hyperparameters. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Lunar Lander experimental details: (top left) The hyperparameters used across all cells in Lunar Lander (bottom) The cell specific hyperparameters (top right) Average final reward over all episodes (same as figure 12) 34 [PITH_FULL_IMAGE:figures/full_fig_p034_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Lunar Lander further results: (top left) Average final reward over the final 10% of episodes for 20 runs (top middle) Total steps per episode for non-factored cells for 20 runs (top right) Total steps per episode for factored cells for 20 runs (bottom left) learning rate sensitivity curves for 10 runs (bottom middle) Learning curves per episode for non-factored cells over total reward for 20 runs (bottom … view at source ↗
Figure 26
Figure 26. Figure 26: Individual learning curves. Line is the median over 1000 episodes, with the shaded region as the [PITH_FULL_IMAGE:figures/full_fig_p036_26.png] view at source ↗
read the original abstract

Building and maintaining state to learn policies and value functions is critical for deploying reinforcement learning (RL) agents in the real world. Recurrent neural networks (RNNs) have become a key point of interest for the state-building problem, and several large-scale reinforcement learning agents incorporate recurrent networks. While RNNs have become a mainstay in many RL applications, many key design choices and implementation details responsible for performance improvements are often not reported. In this work, we discuss one axis on which RNN architectures can be (and have been) modified for use in RL. Specifically, we look at how action information can be incorporated into the state update function of a recurrent cell. We discuss several choices in using action information and empirically evaluate the resulting architectures on a set of illustrative domains. Finally, we discuss future work in developing recurrent cells and discuss challenges specific to the RL setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper investigates multiple distinct approaches for incorporating action information into the state-update function of recurrent cells in RNNs for reinforcement learning. It describes several architectural choices for action encoding, performs an empirical comparison of the resulting networks on a set of illustrative domains, and concludes with a discussion of future work and RL-specific challenges for recurrent architectures.

Significance. If the reported performance differences hold under replication, the work provides a useful, focused examination of an under-documented design axis in recurrent RL agents. By isolating action incorporation from other RNN modifications, the study can help practitioners understand why certain recurrent architectures succeed or fail in POMDP settings. The empirical framing on illustrative domains is a reasonable starting point for such an investigation, though the paper does not claim broad generalization.

minor comments (3)
  1. The abstract and introduction would benefit from explicitly naming the illustrative domains, the performance metrics, the baselines, and whether statistical significance testing was performed; without these details the reader cannot immediately assess the strength of the empirical claims.
  2. Section 4 (or the experimental section): clarify whether the reported results include error bars or multiple random seeds, and whether the same hyper-parameter search budget was allocated to each action-encoding variant.
  3. Notation: ensure that the mathematical description of each action-encoding variant (e.g., concatenation, gating, or embedding) is presented with consistent symbols so that differences between the variants are immediately visible.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work and for recommending minor revision. The referee's summary accurately reflects the scope and contributions of the manuscript. As no specific major comments are provided in the report, we have no individual points to address below.

Circularity Check

0 steps flagged

No significant circularity in empirical investigation

full rationale

The paper presents an empirical investigation of different methods for incorporating action information into the state update of recurrent cells for RL. It discusses architectural choices and evaluates them experimentally on illustrative domains without any claimed mathematical derivations, predictions derived from fitted parameters, or self-referential definitions. The central claims rest on experimental comparisons rather than reducing to inputs by construction, making the work self-contained against external benchmarks with no load-bearing self-citations or ansatzes identified in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard assumptions from RL and RNN literature without introducing new free parameters, invented entities, or ad-hoc axioms beyond the usual modeling choices for recurrent state updates.

axioms (1)
  • domain assumption Recurrent networks are a suitable mechanism for building state representations in partially observable RL problems.
    Invoked in the opening sentences of the abstract when motivating the use of RNNs.

pith-pipeline@v0.9.0 · 5678 in / 1168 out tokens · 34546 ms · 2026-05-20T23:22:36.704056+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 7 internal anchors

  1. [1]

    F. M. Bianchi, E. Maiorino, M. C. Kampffmeyer, A. Rizzi, and R. Jenssen. An overview and comparative analysis of recurrent neural networks for short term load forecasting.arXiv:1705.04378,

  2. [2]

    OpenAI Gym

    G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Openai gym. arXiv:1606.01540,

  3. [3]

    Chandar, C

    16 Published in Transactions on Machine Learning Research (12/2022) S. Chandar, C. Sankar, E. Vorontsov, S. E. Kahou, and Y. Bengio. Towards Non-saturating Recurrent Units for Modelling Long-term Dependencies.Proceedings of the Association for the Advancement of AI Conference on Artificial Intelligence,

  4. [4]

    Chung, C

    J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. InNeurIPS 2014 Workshop on Deep Learning,

  5. [5]

    Dohare, A

    S. Dohare, A. R. Mahmood, and R. S. Sutton. Continual backprop: Stochastic gradient descent with persistent randomness.arXiv preprint arXiv:2108.06325,

  6. [6]

    Memory-based control with recurrent neural networks

    N. Heess, J. J. Hunt, T. P. Lillicrap, and D. Silver. Memory-based control with recurrent neural networks. arXiv:1512.04455,

  7. [7]

    M. Innes. Don’t unroll adjoint: Differentiating ssa-form programs.arXiv:1810.07951, 2018a. M. Innes. Flux: Elegant machine learning with julia.Journal of Open Source Software, 2018b. M. Jaderberg, V. Mnih, W. M. Czarnecki, T. Schaul, J. Z. Leibo, D. Silver, and K. Kavukcuoglu. Reinforce- ment Learning with Unsupervised Auxiliary Tasks.International Confer...

  8. [8]

    Visualizing and Understanding Recurrent Networks

    17 Published in Transactions on Machine Learning Research (12/2022) A. Karpathy, J. Johnson, and L. Fei-Fei. Visualizing and understanding recurrent networks.arXiv preprint arXiv:1506.02078,

  9. [9]

    Menick, E

    J. Menick, E. Elsen, U. Evci, S. Osindero, K. Simonyan, and A. Graves. A practical sparse approximation for real time recurrent learning.arXiv preprint arXiv:2006.07232,

  10. [10]

    Parisotto, F

    18 Published in Transactions on Machine Learning Research (12/2022) E. Parisotto, F. Song, J. Rae, R. Pascanu, C. Gulcehre, S. Jayakumar, M. Jaderberg, R. L. Kaufman, A. Clark, S. Noury, et al. Stabilizing transformers for reinforcement learning. InInternational Conference on Machine Learning,

  11. [11]

    Samani and R

    A. Samani and R. S. Sutton. Learning agent state online with recurrent generate-and-test.arXiv preprint arXiv:2112.15236,

  12. [12]

    Learning to Predict Independent of Span

    H. van Hasselt and R. S. Sutton. Learning to predict independent of span.arXiv preprint arXiv:1508.04582,

  13. [13]

    Vaswani, N

    19 Published in Transactions on Machine Learning Research (12/2022) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems 30,

  14. [14]

    P. Zhu, X. Li, P. Poupart, and G. Miao. On improving deep reinforcement learning for pomdps.arXiv preprint arXiv:1704.07978,

  15. [15]

    20 Published in Transactions on Machine Learning Research (12/2022) A Learning Long-Temporal Dependencies from Online Data Learning long-temporal dependencies is the primary concern of both RL and SL applications of recurrent networks. While great work has been done to coalesce around a few potential architectures and algorithms for SL settings, these are...

  16. [16]

    in an experience replay buffer), they must also continually incorporate the newest information into their decisions (i.e

    Not only do agents need to learn from the currently stored data (i.e. in an experience replay buffer), they must also continually incorporate the newest information into their decisions (i.e. update online and incrementally). The importance of learning state from an online stream of data has been heavily emphasized in the past through predictive represent...

  17. [17]

    and GVF networks (Schlegel et al., 2021), and in modeling trace patterning systems (Rafiee et al., 2022). From a supervised learning perspective, several problems like saturating capacity and catastrophic forgetting are cited as the most pressing for any parametric continual learning system (Sodhani et al., 2019). Below we suggest a few alternative direct...

  18. [18]

    They show improvement in several settings, but don’t explore the model when starved for temporal information in the update

    in a single architecture. They show improvement in several settings, but don’t explore the model when starved for temporal information in the update. Another approach is through stimulating traces, as shown by Rafiee et al. (2022), where traces of observations are used to bridge the gap between different stimuli. Instead of traces, an objective which lear...

  19. [19]

    One can even change the requirements on the architecture in terms of final objectives

    of the trajectory could provide similar benefits as a predictive objective. One can even change the requirements on the architecture in terms of final objectives. Mozer (1991) propose to predict only the contour or general trends of a temporal sequence, reducing the resolution considerably. Value functions are another object which takes an infinite sequen...

  20. [20]

    reservoir

    are one possible direction. Related to the generate and test idea, echo-state networks rely on a random fixed “reservoir” net- work, where predictions are made by only adjusting the outgoing weights. Because the recurrent architecture is fixed, no gradients flow through the recurrent connections meaning no BPTT is needed to estimate the gradients. Unfortu...

  21. [21]

    are a widely used alternative to recurrent architectures in natural language processing. Transformers have also shown some success in reinforcement learning but either require the full sequence of observations at inference and learning time (Mishra et al., 2018; Parisotto et al.,

  22. [22]

    Because of these compromises, it is still unclear if transformers are a viable solution to the state construction problem in continual reinforcement learning

    or turn the RL problem into a supervised problem using the full return as the training signal (Chen et al., 2021). Because of these compromises, it is still unclear if transformers are a viable solution to the state construction problem in continual reinforcement learning. B Insight Beyond Learning Curves Learning curves showing the agent’s performance, u...

  23. [23]

    When does the agent make a decision, and does the agent 22 Published in Transactions on Machine Learning Research (12/2022) stick to this decision? We believe answering these questions and more can lead to better understanding of recurrent agents as well as pathways to better algorithms for training such agents. C Architectural Choices Below are several a...

  24. [24]

    We use a third strategy here (using gradient information to refresh the hidden state to minimize the objective), but found little difference between this and the stale approach

    warming up the agent from the beginning (or some number of time steps prior) of an episode (Hausknecht and Stone, 2015). We use a third strategy here (using gradient information to refresh the hidden state to minimize the objective), but found little difference between this and the stale approach. For much more insight and discussion on this choice see Ka...

  25. [25]

    As compared to the additive and multiplicative the mixture of experts RNN network performs in-between the two networks

    A sweep over various number of experts and a simple gating network with a single layer and softmax activation. As compared to the additive and multiplicative the mixture of experts RNN network performs in-between the two networks. The GRU, on the other hand, fails to perform well in this domain. This might be related to the results seen in section 5.2.1, ...

  26. [26]

    For DirectionalTMaze the AAGRU and MAGRU have a reasonable median performance

    Compared to the replay setting, we can see all the variants performed worse across the board. For DirectionalTMaze the AAGRU and MAGRU have a reasonable median performance. The MARNN and FacGRU are the only other cells which have runs reaching good performance, but overall perform poorly. We expect initialization plays a large role in the networks perform...

  27. [27]

    Overall, we found the size of the encoding network to not make a large difference in the final performance

    First we provide results over various action encoding sizes for the Directional TMaze environment using the Deep Additive network from the main paper. Overall, we found the size of the encoding network to not make a large difference in the final performance. In effect, this result suggests there is still a core limitation with the deep additive operation ...

  28. [28]

    30 Published in Transactions on Machine Learning Research (12/2022) Parameter Value Steps 300,000 steps Optimizer RMSprop RMSProp RNN:η 0.01×(2.0(−11:2:−2)) RMSProp GRU:η 0.01×(2.0(−11:2:−6)) RMSpropρ 0.99 Discountγ 0.99 Truncationτ 12 Buffer Size 10000 Batch Size 8 Update freq 4 steps Target Network Freq 1000 steps Independent Runs 50 AAMA Fac AAMA FacNA...

  29. [29]

    Line is the median over 1000 episodes, with the shaded region as the 1st and 3rd quantile over the same window

    34 Published in Transactions on Machine Learning Research (12/2022) Figure 25: Lunar Lander further results:(top left)Average final reward over the final10%of episodes for 20 runs(top middle)Total steps per episode for non-factored cells for 20 runs(top right)Total steps per episode for factored cells for 20 runs(bottom left)learning rate sensitivity curv...