pith. sign in

arxiv: 1906.09310 · v1 · pith:2ENPI6WZnew · submitted 2019-06-21 · 💻 cs.LG · cs.AI· cs.CL

A Study of State Aliasing in Structured Prediction with RNNs

Pith reviewed 2026-05-25 18:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords state aliasingrecurrent neural networkspolicy gradientreinforcement learningstate representationstructured predictionvalue-based methodstext-based games
0
0 comments X

The pith

Recurrent neural networks trained with policy gradient often fail to learn distinct state representations when multiple states share the same optimal action.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines the state representations learned by RNN-based reinforcement learning agents trained with both policy gradient and value-based methods. It shows that policy gradient training produces state aliasing, where distinct states are conflated in the representation space because they require the same action, which blocks optimal policies. The effect appears across simple maze navigation and more complex text-based games. Value-based methods avoid the aliasing. A sympathetic reader would care because end-to-end RNN agents are common in structured prediction tasks such as dialogue, and the finding isolates a concrete training interaction that can derail learning.

Core claim

The authors demonstrate through experiments that state aliasing, the conflation of two or more distinct states in the representation space, occurs when several states share the same optimal action and the agent is trained via policy gradient. This produces RNN agents that fail to learn state representations leading to an optimal policy. The paper characterizes the phenomenon in a maze setting and a text-based game and contrasts it with value-based training.

What carries the argument

State aliasing, the conflation of distinct states in the representation space that occurs when those states share an optimal action under policy gradient training.

If this is right

  • Policy gradient methods produce suboptimal policies for RNN agents in any environment where the same action is optimal at multiple distinct states.
  • Value-based methods avoid the state aliasing that policy gradient induces in these settings.
  • Training recommendations can be made to select or modify methods when RNNs are used for reinforcement learning in structured prediction.
  • The aliasing effect can be reproduced and measured in both minimal maze tasks and richer text-game environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same aliasing pattern could appear in other recurrent sequence tasks that involve repeated actions across different contexts.
  • Auxiliary objectives that encourage state discrimination might reduce aliasing even under policy gradient training.
  • The findings point toward preferring value-based updates when the task structure features repeated optimal actions.

Load-bearing premise

The observed failure in the maze and text-game experiments stems specifically from the interaction of policy gradient with shared optimal actions rather than other aspects of the RNN or training procedure.

What would settle it

An experiment in which an RNN trained with policy gradient on a task with repeated optimal actions across states nevertheless learns distinct representations for those states and reaches the optimal policy would falsify the central claim.

Figures

Figures reproduced from arXiv: 1906.09310 by Adam Trischler, Layla El Asri.

Figure 1
Figure 1. Figure 1: Simple maze. Source: (McCallum, 1996) right left x1 x3 x3 x2 x3 x3 x3 x1 x2 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: On the left: evolution of the distance between hidden states [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Evolution of the distance between hid￾den states hx3,x1 and hx3,x2 throughout learning with R3 occur as often in this setting, but the model fails 30% of the time. We again plot the evolution of the Euclidean distance ||hx3,x1 − hx3,x2 ||2 during an unsuccessful run with R3, in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: TextWorld game used to illustrate state aliasing in a language setting. The maze experiments show that state aliasing arises in GRUs and LSTMs when different states share the same optimal action and the networks are trained with policy-gradient methods. To study this phenomenon in a setting like dialogue modelling, we design a game in which the same action must be taken in different contexts. We use the Te… view at source ↗
Figure 5
Figure 5. Figure 5: Retrieval model tested on TextWorld. The model, inspired by the one proposed by He et al. (2016), is described in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of the state aliasing prob [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

End-to-end reinforcement learning agents learn a state representation and a policy at the same time. Recurrent neural networks (RNNs) have been trained successfully as reinforcement learning agents in settings like dialogue that require structured prediction. In this paper, we investigate the representations learned by RNN-based agents when trained with both policy gradient and value-based methods. We show through extensive experiments and analysis that, when trained with policy gradient, recurrent neural networks often fail to learn a state representation that leads to an optimal policy in settings where the same action should be taken at different states. To explain this failure, we highlight the problem of state aliasing, which entails conflating two or more distinct states in the representation space. We demonstrate that state aliasing occurs when several states share the same optimal action and the agent is trained via policy gradient. We characterize this phenomenon through experiments on a simple maze setting and a more complex text-based game, and make recommendations for training RNNs with reinforcement learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims that RNN-based RL agents trained with policy gradient methods often fail to learn distinct state representations when different states require the same optimal action, resulting in state aliasing and suboptimal policies. This is shown through experiments on a maze setting and a text-based game, where policy gradient leads to suboptimal policies unlike value-based methods. The work provides analysis and recommendations for training such agents.

Significance. If the findings hold, this empirical study is significant for understanding limitations in training RNNs with policy gradients in structured prediction tasks common in dialogue and similar domains. By contrasting with value-based methods and using both simple and complex environments, it offers practical insights. The identification of state aliasing as a specific issue when optimal actions are shared is a useful contribution.

minor comments (2)
  1. The abstract mentions 'extensive experiments and analysis' but does not summarize any specific quantitative results, metrics, or performance numbers; including a brief mention of key findings would improve clarity for readers.
  2. Experimental figures and tables should report results over multiple random seeds with error bars or standard deviations to allow assessment of variability and robustness of the observed state aliasing effect.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the work and recommendation for minor revision. The summary correctly captures the core claim that policy-gradient training of RNN agents leads to state aliasing when multiple states share the same optimal action, in contrast to value-based methods, with supporting experiments in both simple and complex environments.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an empirical study whose central claims rest on experimental comparisons between policy-gradient and value-based training of RNN agents in maze and text-game environments. No derivations, first-principles predictions, or fitted parameters are presented that could reduce to their own inputs by construction. The reported failure mode (state aliasing under shared optimal actions) is characterized directly from observed outcomes rather than from any self-referential definition or self-citation chain. The work therefore contains no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical investigation; the abstract introduces no free parameters, mathematical axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5696 in / 1007 out tokens · 20684 ms · 2026-05-25T18:38:54.234201+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    An Actor - Critic Algorithm for Sequence Prediction

    Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An Actor - Critic Algorithm for Sequence Prediction . In Proceedings of the International Conference on Learning Representations, 2017

  3. [3]

    e nboer, C ağlar G \

    Kyunghyun Cho, Bart van Merri \" e nboer, C ağlar G \" u l c ehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder--decoder for statistical machine translation. In In Proceedings of Empirical Methods in Natural Language Processing, 2014

  4. [4]

    Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew J

    Marc-Alexandre C \^o t \'e , \'A kos K \'a d \'a r, Xingdi Yuan, Ben A. Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew J. Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler. Textworld: A learning environment for text-based games. In Proceedings of the Computer Games Workshop at ICML/IJCAI, 2018

  5. [5]

    Abhishek Das, Satwik Kottur, Jos \'e M. F. Moura, Stefan Lee, and Dhruv Batra. Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning . In Proceedings of the International Conference on Computer Vision, 2017

  6. [6]

    A sequence-to-sequence model for user simulation in spoken dialogue systems

    Layla El Asri, Jing He, and Kaheer Suleman. A sequence-to-sequence model for user simulation in spoken dialogue systems. In Proceedings of Interspeech, 2016

  7. [7]

    Deep reinforcement learning with a natural language action space

    Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng, and Mari Ostendorf. Deep reinforcement learning with a natural language action space. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016

  8. [8]

    Long short-term memory

    Sepp Hochreiter and J\" u rgen Schmidhuber. Long short-term memory. Neural Computation, 9 0 (8): 0 1735--1780, 1997

  9. [9]

    Why are sequence-to-sequence models so dull? understanding the low-diversity problem of chatbots

    Shaojie Jiang and Maarten de Rijke. Why are sequence-to-sequence models so dull? understanding the low-diversity problem of chatbots. In Proceedings of the EMNLP Search-Oriented Conversational AI Workshop, 2018

  10. [10]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980

  11. [11]

    Deep Reinforcement Learning for Dialogue Generation

    Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. Deep Reinforcement Learning for Dialogue Generation . In Proceeding of the Conference on Empirical Methods on Natural Language Processing, 2016

  12. [12]

    Reinforcement Learning with Selective Perception and Hidden State

    Andrew Kachites McCallum. Reinforcement Learning with Selective Perception and Hidden State. PhD thesis, 1996

  13. [13]

    Mnih , K

    V. Mnih , K. Kavukcuoglu , D. Silver , A. Graves , I. Antonoglou , D. Wierstra , and M. Riedmiller . Playing atari with deep reinforcement learning. In Proceedings of the NeurIPS workshop on Deep Reinforcement Learning, 2013

  14. [14]

    Cohen, and Mirella Lapata

    Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Ranking sentences for extractive summarization with reinforcement learning. In Proceedings of the North American Chapter of the Annual Conference of the Association for Computational Linguistics, 2018

  15. [15]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the Annual Conference of the Association for Computational Linguistics, 2002

  16. [16]

    Sequence Level Training with Recurrent Neural Networks

    Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence Level Training with Recurrent Neural Networks . In Proceedings of the International Conference on Learning Representations, 2016

  17. [17]

    End-to-end optimization of goal-driven and visually grounded dialogue systems

    Florian Strub, Harm de Vries, Jeremie Mary, Bilal Piot, Aaron Courville, and Olivier Pietquin. End-to-end optimization of goal-driven and visually grounded dialogue systems. In Proceedings of the International Joint Conference on Artificial Intelligence, 2017

  18. [18]

    Tieleman and G

    T. Tieleman and G. Hinton. Lecture 6.5---RmsProp: Divide the gradient by a running average of its recent magnitude . COURSERA: Neural Networks for Machine Learning, 2012

  19. [19]

    Williams

    Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. pp.\ 229--256, 1992

  20. [20]

    L. Wu , F. Tian , T. Qin , J. Lai , and T.-Y. Liu . A study of reinforcement learning for neural machine translation. In Proceeding of the Conference on Empirical Methods on Natural Language Processing, 2018