A Study of State Aliasing in Structured Prediction with RNNs

Adam Trischler; Layla El Asri

arxiv: 1906.09310 · v1 · pith:2ENPI6WZnew · submitted 2019-06-21 · 💻 cs.LG · cs.AI· cs.CL

A Study of State Aliasing in Structured Prediction with RNNs

Layla El Asri , Adam Trischler This is my paper

Pith reviewed 2026-05-25 18:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords state aliasingrecurrent neural networkspolicy gradientreinforcement learningstate representationstructured predictionvalue-based methodstext-based games

0 comments

The pith

Recurrent neural networks trained with policy gradient often fail to learn distinct state representations when multiple states share the same optimal action.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines the state representations learned by RNN-based reinforcement learning agents trained with both policy gradient and value-based methods. It shows that policy gradient training produces state aliasing, where distinct states are conflated in the representation space because they require the same action, which blocks optimal policies. The effect appears across simple maze navigation and more complex text-based games. Value-based methods avoid the aliasing. A sympathetic reader would care because end-to-end RNN agents are common in structured prediction tasks such as dialogue, and the finding isolates a concrete training interaction that can derail learning.

Core claim

The authors demonstrate through experiments that state aliasing, the conflation of two or more distinct states in the representation space, occurs when several states share the same optimal action and the agent is trained via policy gradient. This produces RNN agents that fail to learn state representations leading to an optimal policy. The paper characterizes the phenomenon in a maze setting and a text-based game and contrasts it with value-based training.

What carries the argument

State aliasing, the conflation of distinct states in the representation space that occurs when those states share an optimal action under policy gradient training.

If this is right

Policy gradient methods produce suboptimal policies for RNN agents in any environment where the same action is optimal at multiple distinct states.
Value-based methods avoid the state aliasing that policy gradient induces in these settings.
Training recommendations can be made to select or modify methods when RNNs are used for reinforcement learning in structured prediction.
The aliasing effect can be reproduced and measured in both minimal maze tasks and richer text-game environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same aliasing pattern could appear in other recurrent sequence tasks that involve repeated actions across different contexts.
Auxiliary objectives that encourage state discrimination might reduce aliasing even under policy gradient training.
The findings point toward preferring value-based updates when the task structure features repeated optimal actions.

Load-bearing premise

The observed failure in the maze and text-game experiments stems specifically from the interaction of policy gradient with shared optimal actions rather than other aspects of the RNN or training procedure.

What would settle it

An experiment in which an RNN trained with policy gradient on a task with repeated optimal actions across states nevertheless learns distinct representations for those states and reaches the optimal policy would falsify the central claim.

Figures

Figures reproduced from arXiv: 1906.09310 by Adam Trischler, Layla El Asri.

**Figure 2.** Figure 2: On the left: evolution of the distance between hidden states [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Evolution of the distance between hidden states hx3,x1 and hx3,x2 throughout learning with R3 occur as often in this setting, but the model fails 30% of the time. We again plot the evolution of the Euclidean distance ||hx3,x1 − hx3,x2 ||2 during an unsuccessful run with R3, in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: TextWorld game used to illustrate state aliasing in a language setting. The maze experiments show that state aliasing arises in GRUs and LSTMs when different states share the same optimal action and the networks are trained with policy-gradient methods. To study this phenomenon in a setting like dialogue modelling, we design a game in which the same action must be taken in different contexts. We use the Te… view at source ↗

**Figure 5.** Figure 5: Retrieval model tested on TextWorld. The model, inspired by the one proposed by He et al. (2016), is described in [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Illustration of the state aliasing prob [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

End-to-end reinforcement learning agents learn a state representation and a policy at the same time. Recurrent neural networks (RNNs) have been trained successfully as reinforcement learning agents in settings like dialogue that require structured prediction. In this paper, we investigate the representations learned by RNN-based agents when trained with both policy gradient and value-based methods. We show through extensive experiments and analysis that, when trained with policy gradient, recurrent neural networks often fail to learn a state representation that leads to an optimal policy in settings where the same action should be taken at different states. To explain this failure, we highlight the problem of state aliasing, which entails conflating two or more distinct states in the representation space. We demonstrate that state aliasing occurs when several states share the same optimal action and the agent is trained via policy gradient. We characterize this phenomenon through experiments on a simple maze setting and a more complex text-based game, and make recommendations for training RNNs with reinforcement learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Policy gradient training makes RNN agents alias states that share an optimal action, shown in maze and text-game experiments.

read the letter

The main observation is that RNNs trained as RL agents with policy gradient often end up merging distinct states in their representation when those states call for the same action. This produces policies that cannot be optimal. Value-based methods avoid the same collapse in the same setups. The paper demonstrates the pattern in a simple maze and a text-based game, then ties the aliasing directly to the shared-action condition under the policy-gradient update. That contrast is the useful part. It gives practitioners a concrete reason to prefer value-based training or to add explicit state distinctions when using policy gradient on recurrent models for dialogue or other structured tasks. The experiments isolate the effect reasonably by holding the architecture fixed and swapping only the learning rule. The write-up stays within its scope and ends with training recommendations instead of claiming a general theory fix. The environments are small, so the magnitude of the problem could shrink in larger domains, but the paper does not overstate the reach. The central claim holds up on the reported comparisons without obvious confounds in the design. This is the sort of targeted empirical note that helps people avoid a recurring gotcha when they combine RNNs and policy gradient. A reader working on RL for language or sequential prediction will find it worth their time. It is solid enough to go to referees rather than desk reject.

Referee Report

0 major / 2 minor

Summary. The paper claims that RNN-based RL agents trained with policy gradient methods often fail to learn distinct state representations when different states require the same optimal action, resulting in state aliasing and suboptimal policies. This is shown through experiments on a maze setting and a text-based game, where policy gradient leads to suboptimal policies unlike value-based methods. The work provides analysis and recommendations for training such agents.

Significance. If the findings hold, this empirical study is significant for understanding limitations in training RNNs with policy gradients in structured prediction tasks common in dialogue and similar domains. By contrasting with value-based methods and using both simple and complex environments, it offers practical insights. The identification of state aliasing as a specific issue when optimal actions are shared is a useful contribution.

minor comments (2)

The abstract mentions 'extensive experiments and analysis' but does not summarize any specific quantitative results, metrics, or performance numbers; including a brief mention of key findings would improve clarity for readers.
Experimental figures and tables should report results over multiple random seeds with error bars or standard deviations to allow assessment of variability and robustness of the observed state aliasing effect.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the work and recommendation for minor revision. The summary correctly captures the core claim that policy-gradient training of RNN agents leads to state aliasing when multiple states share the same optimal action, in contrast to value-based methods, with supporting experiments in both simple and complex environments.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an empirical study whose central claims rest on experimental comparisons between policy-gradient and value-based training of RNN agents in maze and text-game environments. No derivations, first-principles predictions, or fitted parameters are presented that could reduce to their own inputs by construction. The reported failure mode (state aliasing under shared optimal actions) is characterized directly from observed outcomes rather than from any self-referential definition or self-citation chain. The work therefore contains no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical investigation; the abstract introduces no free parameters, mathematical axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5696 in / 1007 out tokens · 20684 ms · 2026-05-25T18:38:54.234201+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

An Actor - Critic Algorithm for Sequence Prediction

Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An Actor - Critic Algorithm for Sequence Prediction . In Proceedings of the International Conference on Learning Representations, 2017

work page 2017
[3]

e nboer, C ağlar G \

Kyunghyun Cho, Bart van Merri \" e nboer, C ağlar G \" u l c ehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder--decoder for statistical machine translation. In In Proceedings of Empirical Methods in Natural Language Processing, 2014

work page 2014
[4]

Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew J

Marc-Alexandre C \^o t \'e , \'A kos K \'a d \'a r, Xingdi Yuan, Ben A. Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew J. Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler. Textworld: A learning environment for text-based games. In Proceedings of the Computer Games Workshop at ICML/IJCAI, 2018

work page 2018
[5]

Abhishek Das, Satwik Kottur, Jos \'e M. F. Moura, Stefan Lee, and Dhruv Batra. Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning . In Proceedings of the International Conference on Computer Vision, 2017

work page 2017
[6]

A sequence-to-sequence model for user simulation in spoken dialogue systems

Layla El Asri, Jing He, and Kaheer Suleman. A sequence-to-sequence model for user simulation in spoken dialogue systems. In Proceedings of Interspeech, 2016

work page 2016
[7]

Deep reinforcement learning with a natural language action space

Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng, and Mari Ostendorf. Deep reinforcement learning with a natural language action space. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016

work page 2016
[8]

Long short-term memory

Sepp Hochreiter and J\" u rgen Schmidhuber. Long short-term memory. Neural Computation, 9 0 (8): 0 1735--1780, 1997

work page 1997
[9]

Why are sequence-to-sequence models so dull? understanding the low-diversity problem of chatbots

Shaojie Jiang and Maarten de Rijke. Why are sequence-to-sequence models so dull? understanding the low-diversity problem of chatbots. In Proceedings of the EMNLP Search-Oriented Conversational AI Workshop, 2018

work page 2018
[10]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014
[11]

Deep Reinforcement Learning for Dialogue Generation

Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. Deep Reinforcement Learning for Dialogue Generation . In Proceeding of the Conference on Empirical Methods on Natural Language Processing, 2016

work page 2016
[12]

Reinforcement Learning with Selective Perception and Hidden State

Andrew Kachites McCallum. Reinforcement Learning with Selective Perception and Hidden State. PhD thesis, 1996

work page 1996
[13]

Mnih , K

V. Mnih , K. Kavukcuoglu , D. Silver , A. Graves , I. Antonoglou , D. Wierstra , and M. Riedmiller . Playing atari with deep reinforcement learning. In Proceedings of the NeurIPS workshop on Deep Reinforcement Learning, 2013

work page 2013
[14]

Cohen, and Mirella Lapata

Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Ranking sentences for extractive summarization with reinforcement learning. In Proceedings of the North American Chapter of the Annual Conference of the Association for Computational Linguistics, 2018

work page 2018
[15]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the Annual Conference of the Association for Computational Linguistics, 2002

work page 2002
[16]

Sequence Level Training with Recurrent Neural Networks

Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence Level Training with Recurrent Neural Networks . In Proceedings of the International Conference on Learning Representations, 2016

work page 2016
[17]

End-to-end optimization of goal-driven and visually grounded dialogue systems

Florian Strub, Harm de Vries, Jeremie Mary, Bilal Piot, Aaron Courville, and Olivier Pietquin. End-to-end optimization of goal-driven and visually grounded dialogue systems. In Proceedings of the International Joint Conference on Artificial Intelligence, 2017

work page 2017
[18]

Tieleman and G

T. Tieleman and G. Hinton. Lecture 6.5---RmsProp: Divide the gradient by a running average of its recent magnitude . COURSERA: Neural Networks for Machine Learning, 2012

work page 2012
[19]

Williams

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. pp.\ 229--256, 1992

work page 1992
[20]

L. Wu , F. Tian , T. Qin , J. Lai , and T.-Y. Liu . A study of reinforcement learning for neural machine translation. In Proceeding of the Conference on Empirical Methods on Natural Language Processing, 2018

work page 2018

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

An Actor - Critic Algorithm for Sequence Prediction

Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh Goyal, Ryan Lowe, Joelle Pineau, Aaron Courville, and Yoshua Bengio. An Actor - Critic Algorithm for Sequence Prediction . In Proceedings of the International Conference on Learning Representations, 2017

work page 2017

[3] [3]

e nboer, C ağlar G \

Kyunghyun Cho, Bart van Merri \" e nboer, C ağlar G \" u l c ehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder--decoder for statistical machine translation. In In Proceedings of Empirical Methods in Natural Language Processing, 2014

work page 2014

[4] [4]

Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew J

Marc-Alexandre C \^o t \'e , \'A kos K \'a d \'a r, Xingdi Yuan, Ben A. Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew J. Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler. Textworld: A learning environment for text-based games. In Proceedings of the Computer Games Workshop at ICML/IJCAI, 2018

work page 2018

[5] [5]

Abhishek Das, Satwik Kottur, Jos \'e M. F. Moura, Stefan Lee, and Dhruv Batra. Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning . In Proceedings of the International Conference on Computer Vision, 2017

work page 2017

[6] [6]

A sequence-to-sequence model for user simulation in spoken dialogue systems

Layla El Asri, Jing He, and Kaheer Suleman. A sequence-to-sequence model for user simulation in spoken dialogue systems. In Proceedings of Interspeech, 2016

work page 2016

[7] [7]

Deep reinforcement learning with a natural language action space

Ji He, Jianshu Chen, Xiaodong He, Jianfeng Gao, Lihong Li, Li Deng, and Mari Ostendorf. Deep reinforcement learning with a natural language action space. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016

work page 2016

[8] [8]

Long short-term memory

Sepp Hochreiter and J\" u rgen Schmidhuber. Long short-term memory. Neural Computation, 9 0 (8): 0 1735--1780, 1997

work page 1997

[9] [9]

Why are sequence-to-sequence models so dull? understanding the low-diversity problem of chatbots

Shaojie Jiang and Maarten de Rijke. Why are sequence-to-sequence models so dull? understanding the low-diversity problem of chatbots. In Proceedings of the EMNLP Search-Oriented Conversational AI Workshop, 2018

work page 2018

[10] [10]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2014

[11] [11]

Deep Reinforcement Learning for Dialogue Generation

Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, and Dan Jurafsky. Deep Reinforcement Learning for Dialogue Generation . In Proceeding of the Conference on Empirical Methods on Natural Language Processing, 2016

work page 2016

[12] [12]

Reinforcement Learning with Selective Perception and Hidden State

Andrew Kachites McCallum. Reinforcement Learning with Selective Perception and Hidden State. PhD thesis, 1996

work page 1996

[13] [13]

Mnih , K

V. Mnih , K. Kavukcuoglu , D. Silver , A. Graves , I. Antonoglou , D. Wierstra , and M. Riedmiller . Playing atari with deep reinforcement learning. In Proceedings of the NeurIPS workshop on Deep Reinforcement Learning, 2013

work page 2013

[14] [14]

Cohen, and Mirella Lapata

Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Ranking sentences for extractive summarization with reinforcement learning. In Proceedings of the North American Chapter of the Annual Conference of the Association for Computational Linguistics, 2018

work page 2018

[15] [15]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the Annual Conference of the Association for Computational Linguistics, 2002

work page 2002

[16] [16]

Sequence Level Training with Recurrent Neural Networks

Marc'Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence Level Training with Recurrent Neural Networks . In Proceedings of the International Conference on Learning Representations, 2016

work page 2016

[17] [17]

End-to-end optimization of goal-driven and visually grounded dialogue systems

Florian Strub, Harm de Vries, Jeremie Mary, Bilal Piot, Aaron Courville, and Olivier Pietquin. End-to-end optimization of goal-driven and visually grounded dialogue systems. In Proceedings of the International Joint Conference on Artificial Intelligence, 2017

work page 2017

[18] [18]

Tieleman and G

T. Tieleman and G. Hinton. Lecture 6.5---RmsProp: Divide the gradient by a running average of its recent magnitude . COURSERA: Neural Networks for Machine Learning, 2012

work page 2012

[19] [19]

Williams

Ronald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. pp.\ 229--256, 1992

work page 1992

[20] [20]

L. Wu , F. Tian , T. Qin , J. Lai , and T.-Y. Liu . A study of reinforcement learning for neural machine translation. In Proceeding of the Conference on Empirical Methods on Natural Language Processing, 2018

work page 2018