arxiv: 2605.10816 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Policy Gradient Methods for Non-Markovian Reinforcement Learning

Avik Kar , Siddharth Chandak , Rahul Singh , Soumitra Sinhahajari , Eric Moulines , Shalabh Bhatnagar , Nicholas Bambos

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords policy gradientnon-Markovian reinforcement learningagent statereinforcement learningASMPG algorithmNMDPconvergence guarantees

0 comments

The pith

A policy gradient theorem for Agent State-Markov policies allows joint reward-driven optimization of internal states and actions in non-Markovian reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends policy gradient methods from Markovian to non-Markovian decision processes by introducing Agent State-Markov policies. These policies maintain a recursively updated internal agent state that summarizes history and map that state to actions. Instead of fixing the state dynamics or training them with separate predictive losses, the approach derives a gradient that lets the state update rules and the control policy be optimized together to maximize expected reward. This produces the ASMPG algorithm, which comes with finite-time and almost-sure convergence results and outperforms predictive baselines on several non-Markovian tasks.

Core claim

We establish a novel policy gradient theorem for ASM policies, extending the classical policy gradient results from the Markovian setting to episodic and infinite-horizon discounted NMDPs. Building on this gradient expression, we propose the Agent State-Markov Policy Gradient (ASMPG) algorithm, which leverages the recursive structure of the agent state dynamics for efficient optimization. We establish finite-time and almost sure convergence guarantees, and empirically demonstrate that, on a range of non-Markovian tasks, ASMPG outperforms baselines that learn state representations via predictive objectives.

What carries the argument

The Agent State-Markov (ASM) policy, which pairs recursively updated agent state dynamics with a control policy that selects actions from the agent state, together with the derived policy gradient expression that differentiates through both components.

If this is right

The derived gradient expression supports direct differentiation through the recursive state update, enabling end-to-end optimization without auxiliary losses.
Finite-time convergence bounds hold for both episodic and infinite-horizon discounted non-Markovian settings.
Almost-sure convergence of the ASMPG iterates is guaranteed under standard step-size conditions.
Empirical results show higher returns than predictive baselines across multiple history-dependent tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reward-centric formulation may produce more compact agent states than prediction-based methods when only reward-relevant history matters.
The same gradient construction could be applied to other recursive memory architectures beyond the specific agent state used here.
If the joint optimization succeeds, separate representation-learning stages may become unnecessary in many partially observable or history-dependent problems.

Load-bearing premise

A recursively updated agent state can be jointly optimized with the policy using only reward signals to form a sufficient compact summary of non-Markovian history.

What would settle it

A non-Markovian task where ASMPG fails to reach the performance of a method that first learns predictive state representations and then optimizes the policy separately.

Figures

Figures reproduced from arXiv: 2605.10816 by Avik Kar, Eric Moulines, Nicholas Bambos, Rahul Singh, Shalabh Bhatnagar, Siddharth Chandak, Soumitra Sinhahajari.

**Figure 1.** Figure 1: Chatbot as a non-Markovian environment with agent state. We illustrate the role of agent state and ASM policies using a chatbot (Young et al., 2013). At each time step t, the user (which is the environment in this example) utters Ot ∈ O. The agent selects an action At ∈ A, a response to the user, and receives a reward rt based on the user’s satisfaction. The environment is non-Markovian, as both the utter… view at source ↗

**Figure 2.** Figure 2: Learning curves for ASMPG, AIS-KL, and AIS-MMD on the five environments. Curves [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Best-checkpoint performance over 10 random seeds. The following colors indicate the associated algorithms: – ASMPG, – AIS-KL, – AIS-MMD. The box denotes the interquartile range, and the whiskers indicate the range of non-outlier values. Points beyond the whiskers are considered outliers. 4.3 Performance comparison We train each method for 106 environment steps and evaluate performance every 10,000 steps. T… view at source ↗

**Figure 4.** Figure 4: CheeseMaze. CheeseMaze. CheeseMaze is a partially observed navigation problem based on the maze example in McCallum (1993). The environment has 11 latent states, labeled 0, . . . , 10, where state 10 is the terminal goal state. The agent observes only 7 distinct observation symbols. Thus, multiple maze locations are observationally aliased as shown in [PITH_FULL_IMAGE:figures/full_fig_p033_4.png] view at source ↗

**Figure 5.** Figure 5: HallwayNavigation. HallwayNavigation. HallwayNavigation is a deterministic, partially observed maze-navigation task introduced by McCallum (1995). The latent state is the agent’s position in a 7 × 4 grid containing two interior blocked regions (See [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗

read the original abstract

We study policy gradient methods for reinforcement learning in non-Markovian decision processes (NMDPs), where observations and rewards depend on the entire interaction history. To handle this dependence, the agent maintains an internal state that is recursively updated to provide a compact summary of past observations and actions. In contrast to approaches that treat the agent state dynamics as fixed or learn it via predictive objectives, we propose a reward-centric formulation that jointly optimizes the agent state dynamics and the control policy to maximize the expected cumulative reward. To this end, we consider a class of Agent State-Markov (ASM) policies, comprising an agent state dynamics and a control policy that maps the agent state to actions. We establish a novel policy gradient theorem for ASM policies, extending the classical policy gradient results from the Markovian setting to episodic and infinite-horizon discounted NMDPs. Building on this gradient expression, we propose the Agent State-Markov Policy Gradient (ASMPG) algorithm, which leverages the recursive structure of the agent state dynamics for efficient optimization. We establish finite-time and almost sure convergence guarantees, and empirically demonstrate that, on a range of non-Markovian tasks, ASMPG outperforms baselines that learn state representations via predictive objectives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a policy gradient theorem and algorithm for jointly optimizing a recursive agent state and policy using only rewards in non-Markovian RL.

read the letter

The paper extends policy gradient methods to non-Markovian decision processes by introducing Agent State-Markov policies. These policies maintain a recursive internal state that summarizes history and optimize both the state update parameters and the policy parameters together using only the reward signal. They provide a policy gradient theorem for this class, the ASMPG algorithm, convergence guarantees, and some empirical results showing gains over predictive baselines.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes a reward-centric approach to policy gradients in non-Markovian RL by introducing Agent State-Markov (ASM) policies, where agent state dynamics are jointly optimized with the policy parameters to maximize expected reward. It derives a policy gradient theorem for ASM policies in episodic and discounted infinite-horizon NMDPs, introduces the ASMPG algorithm leveraging the recursive structure, proves finite-time and almost-sure convergence, and reports empirical superiority over predictive state representation baselines on non-Markovian tasks.

Significance. If the results hold, the work offers a simpler alternative to non-Markovian RL methods that rely on predictive objectives for learning state representations, by instead using only the reward signal for joint optimization. The extension of the policy gradient theorem and the convergence guarantees would be valuable contributions, particularly if they demonstrate that reward-centric optimization suffices for discovering adequate history summaries. The empirical results suggest practical advantages, but the significance depends on the validity of the joint optimization claim.

major comments (1)

The policy gradient theorem for ASM policies is claimed to extend classical results, but since it holds for any fixed agent-state recursion (reducing to a Markov process on the augmented state), the central novelty and load-bearing claim is the joint optimization of the recursion parameters with the policy using only reward. The finite-time and a.s. convergence results guarantee convergence to a stationary point of this joint objective, but do not establish that the stationary point yields a sufficient statistic for the non-Markovian history, unlike methods with explicit predictive losses.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment below with clarifications on the scope of our contributions.

read point-by-point responses

Referee: The policy gradient theorem for ASM policies is claimed to extend classical results, but since it holds for any fixed agent-state recursion (reducing to a Markov process on the augmented state), the central novelty and load-bearing claim is the joint optimization of the recursion parameters with the policy using only reward. The finite-time and a.s. convergence results guarantee convergence to a stationary point of this joint objective, but do not establish that the stationary point yields a sufficient statistic for the non-Markovian history, unlike methods with explicit predictive losses.

Authors: We agree that for any fixed agent-state recursion the theorem reduces to the classical policy gradient result on the induced Markov process over the augmented state. The central contribution of the manuscript is the joint optimization of the recursion parameters with the policy parameters using only the reward signal; this requires deriving a gradient expression that propagates through the parameterized state dynamics. We do not claim or prove that stationary points of the joint objective are sufficient statistics for the history. Our finite-time and almost-sure convergence results apply strictly to the reward objective. Empirical results on non-Markovian tasks indicate that the learned representations are effective in practice. In the revision we will add an explicit discussion section delineating these theoretical limitations and contrasting the approach with predictive-state methods. revision: yes

Circularity Check

0 steps flagged

No circularity: novel policy gradient theorem is an independent extension of classical results to ASM policies

full rationale

The paper defines the ASM policy class independently (agent state recursion plus policy on that state) and derives a new policy gradient theorem by extending the classical Markovian policy gradient to episodic and infinite-horizon discounted NMDPs. The ASMPG algorithm is then constructed directly from this gradient expression, with finite-time and almost-sure convergence results stated for the joint optimization of recursion and policy parameters. No equation or claim reduces by construction to a fitted quantity, prior predictive objective, or self-citation chain; the central mathematical content is self-contained against the classical baseline, and empirical comparisons are external to the derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of a learnable recursive agent state that compactly summarizes history for the purpose of reward maximization.

free parameters (1)

parameters of agent state dynamics
The recursive update rule for the agent state is parameterized and optimized jointly via gradients.

axioms (1)

domain assumption NMDPs admit a compact recursive agent state representation sufficient for policy optimization
Invoked to justify the ASM policy class and the extension of the policy gradient theorem.

invented entities (1)

Agent State-Markov (ASM) policies no independent evidence
purpose: Class of policies that combine learnable agent state dynamics with a control policy mapping state to actions
New construct introduced to enable the reward-centric formulation and gradient theorem.

pith-pipeline@v0.9.0 · 5534 in / 1290 out tokens · 57329 ms · 2026-05-12T05:33:26.909648+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 1 internal anchor

[1]

Journal of Artificial Intelligence Research , volume=

Infinite-horizon policy-gradient estimation , author=. Journal of Artificial Intelligence Research , volume=. 2001 , doi=

work page 2001
[2]

Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence , pages=

Learning finite-state controllers for partially observable environments , author=. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence , pages=. 1999 , publisher=

work page 1999
[3]

Policy-Gradients for

Aberdeen, Douglas and Buffet, Olivier and Thomas, Owen , booktitle=. Policy-Gradients for. 2007 , volume=

work page 2007
[4]

A function approximation approach to estimation of policy gradient for

Yu, Huizhen , booktitle=. A function approximation approach to estimation of policy gradient for

work page
[5]

Logic Journal of the IGPL , volume=

Recurrent policy gradients , author=. Logic Journal of the IGPL , volume=. 2010 , publisher=

work page 2010
[6]

Recurrent Model-Free

Ni, Tianwei and Eysenbach, Benjamin and Salakhutdinov, Ruslan , booktitle=. Recurrent Model-Free. 2022 , volume=

work page 2022
[7]

Proceedings of the 35th International Conference on Machine Learning , pages=

Using Reward Machines for High-Level Task Specification and Decomposition in Reinforcement Learning , author=. Proceedings of the 35th International Conference on Machine Learning , pages=. 2018 , volume=

work page 2018
[8]

, journal=

Cayci, Semih and He, Niao and Srikant, R. , journal=. Finite-Time Analysis of Natural Actor-Critic for. 2024 , doi=

work page 2024
[9]

Recurrent Natural Policy Gradient for

Cayci, Semih and Eryilmaz, Atilla , journal=. Recurrent Natural Policy Gradient for. 2025 , url=

work page 2025
[10]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

work page
[11]

Journal of Machine Learning Research , volume=

End-to-end training of deep visuomotor policies , author=. Journal of Machine Learning Research , volume=

work page
[12]

ACM Computing Surveys (CSUR) , volume=

Reinforcement learning in healthcare: A survey , author=. ACM Computing Surveys (CSUR) , volume=. 2021 , publisher=

work page 2021
[13]

Foundations and Trends in Machine Learning , volume=

Reinforcement learning, bit by bit , author=. Foundations and Trends in Machine Learning , volume=. 2023 , publisher=

work page 2023
[14]

Puterman, Martin L , year=

work page
[15]

Journal of Machine Learning Research , volume=

Simple agent, complex environment: Efficient reinforcement learning with agent states , author=. Journal of Machine Learning Research , volume=

work page
[16]

Journal of Machine Learning Research , volume=

Approximate information state for approximate planning and reinforcement learning in partially observed systems , author=. Journal of Machine Learning Research , volume=

work page
[17]

Q-learning for stochastic control under general information structures and non-

Kara, Ali Devran and Yuksel, Serdar , journal=. Q-learning for stochastic control under general information structures and non-

work page
[18]

Sixteenth European Workshop on Reinforcement Learning , year =

Approximate information state based convergence analysis of recurrent Q-learning , author=. Sixteenth European Workshop on Reinforcement Learning , year =

work page
[19]

Advances in neural information processing systems , volume=

Reinforcement learning with long short-term memory , author=. Advances in neural information processing systems , volume=

work page
[20]

, author=

Regular Decision Processes: A Model for Non-Markovian Domains. , author=. IJCAI , volume=

work page
[21]

29th International Joint Conference on Artificial Intelligence, IJCAI 2020 , pages=

Learning and solving regular decision processes , author=. 29th International Joint Conference on Artificial Intelligence, IJCAI 2020 , pages=. 2020 , organization=

work page 2020
[22]

Advances in Neural Information Processing Systems , volume=

Provably efficient offline reinforcement learning in regular decision processes , author=. Advances in Neural Information Processing Systems , volume=

work page
[23]

IJCAI , year=

Efficient PAC Reinforcement Learning in Regular Decision Processes , author=. IJCAI , year=

work page
[24]

The Thirteenth International Conference on Learning Representations , year=

Offline RL in regular decision processes: Sample efficiency via language metrics , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[25]

CoRR , year=

Tractable Offline Learning of Regular Decision Processes , author=. CoRR , year=

work page
[26]

Advances in neural information processing systems , volume=

Predictive representations of state , author=. Advances in neural information processing systems , volume=

work page
[27]

Proceedings of the 20th International Conference on Machine Learning (ICML-03) , pages=

Learning predictive state representations , author=. Proceedings of the 20th International Conference on Machine Learning (ICML-03) , pages=

work page
[28]

Advances in neural information processing systems , volume=

Bounded finite state controllers , author=. Advances in neural information processing systems , volume=

work page
[29]

2016 , publisher=

Partially observed Markov decision processes , author=. 2016 , publisher=

work page 2016
[30]

International conference on machine learning , pages=

High probability convergence of stochastic gradient methods , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023
[31]

SIAM journal on optimization , volume=

Stochastic first-and zeroth-order methods for nonconvex stochastic programming , author=. SIAM journal on optimization , volume=. 2013 , publisher=

work page 2013
[32]

Better Theory for

Ahmed Khaled and Peter Richt. Better Theory for. Transactions on Machine Learning Research , issn=. 2023 , url=

work page 2023
[33]

Annals of Operations Research , volume=

Decentralized stochastic control , author=. Annals of Operations Research , volume=. 2016 , publisher=

work page 2016
[34]

2008 , publisher=

Sequential decomposition of sequential dynamic teams: applications to real-time communication and networked control systems , author=. 2008 , publisher=

work page 2008
[35]

2015 , publisher=

Stochastic systems: Estimation, identification, and adaptive control , author=. 2015 , publisher=

work page 2015
[36]

Systems & Control Letters , volume=

Reinforcement learning in non-Markovian environments , author=. Systems & Control Letters , volume=. 2024 , publisher=

work page 2024
[37]

Reinforcement learning in

Schmidhuber, J. Reinforcement learning in. Advances in neural information processing systems , volume=

work page
[38]

1996 , publisher=

Wiering, MA and Schmidhuber, Juergen , journal=. 1996 , publisher=

work page 1996
[39]

Machine learning , volume=

Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

work page 1992
[40]

Artificial intelligence , volume=

Planning and acting in partially observable stochastic domains , author=. Artificial intelligence , volume=. 1998 , publisher=

work page 1998
[41]

Proceedings of the 2016 conference on empirical methods in natural language processing , pages=

Deep reinforcement learning for dialogue generation , author=. Proceedings of the 2016 conference on empirical methods in natural language processing , pages=

work page 2016
[42]

1996 , publisher=

Reinforcement learning with selective perception and hidden state , author=. 1996 , publisher=

work page 1996
[43]

Journal of Mathematical Analysis and Applications , volume=

Sufficient statistics in the optimum control of stochastic systems , author=. Journal of Mathematical Analysis and Applications , volume=. 1965 , publisher=

work page 1965
[44]

Advances in neural information processing systems , volume=

Policy gradient methods for reinforcement learning with function approximation , author=. Advances in neural information processing systems , volume=

work page
[45]

arXiv preprint arXiv:2306.05991 , year=

Approximate information state based convergence analysis of recurrent Q-learning , author=. arXiv preprint arXiv:2306.05991 , year=

work page arXiv
[46]

Periodic agent-state based Q-learning for

Sinha, Amit and Geist, Matthieu and Mahajan, Aditya , journal=. Periodic agent-state based Q-learning for

work page
[47]

Mathematics of Operations Research , volume=

Convergence of finite memory Q learning for POMDPs and near optimality of learned policies under filter stability , author=. Mathematics of Operations Research , volume=. 2023 , publisher=

work page 2023
[48]

Journal of Machine Learning Research , volume=

On the theory of policy gradient methods: Optimality, approximation, and distribution shift , author=. Journal of Machine Learning Research , volume=

work page
[49]

Proceedings of the AAAI conference on artificial intelligence , volume=

Sample efficient reinforcement learning with REINFORCE , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[50]

International conference on machine learning , pages=

On the global convergence rates of softmax policy gradient methods , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020
[51]

International Conference on Machine Learning , pages=

Stochastic gradient succeeds for bandits , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[52]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

REINFORCE Converges to Optimal Policies with Any Learning Rate , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[53]

Proceedings of the IEEE , volume=

Young, Steve and Ga. Proceedings of the IEEE , volume=. 2013 , publisher=

work page 2013
[54]

Cell Systems , volume=

To modulate or to skip: De-escalating PARP inhibitor maintenance therapy in ovarian cancer using adaptive therapy , author=. Cell Systems , volume=. 2024 , publisher=

work page 2024
[55]

2017 IEEE international conference on healthcare informatics (ICHI) , pages=

Deep reinforcement learning for dynamic treatment regimes on medical registry data , author=. 2017 IEEE international conference on healthcare informatics (ICHI) , pages=. 2017 , organization=

work page 2017
[56]

European journal of operational research , volume=

Imperfect maintenance , author=. European journal of operational research , volume=. 1996 , publisher=

work page 1996
[57]

An age-and state-dependent

Giorgio, Massimiliano and Guida, Maurizio and Pulcini, Gianpaolo , journal=. An age-and state-dependent. 2011 , publisher=

work page 2011
[58]

Reliability Engineering & System Safety , volume=

A condition-based prognostic approach for age-and state-dependent partially observable nonlinear degrading system , author=. Reliability Engineering & System Safety , volume=. 2023 , publisher=

work page 2023
[59]

IEEE Transactions on Systems, Man, and Cybernetics , volume=

Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems , author=. IEEE Transactions on Systems, Man, and Cybernetics , volume=

work page
[60]

OpenAI Gym

OpenAI Gym , author=. arXiv preprint arXiv:1606.01540 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[61]

2017 , publisher=

Probability and measure , author=. 2017 , publisher=

work page 2017
[62]

Proceedings of the Tenth International Conference on International Conference on Machine Learning , pages=

Overcoming incomplete perception with util distinction memory , author=. Proceedings of the Tenth International Conference on International Conference on Machine Learning , pages=

work page
[63]

Utile distinction hidden

Wierstra, Daan and Wiering, Marco , booktitle=. Utile distinction hidden

work page
[64]

Machine Learning Proceedings 1995 , pages=

Instance-based utile distinctions for reinforcement learning with hidden state , author=. Machine Learning Proceedings 1995 , pages=. 1995 , publisher=

work page 1995
[65]

Advances in neural information processing systems , volume=

Actor-critic algorithms , author=. Advances in neural information processing systems , volume=

work page
[66]

Proceedings of the Nineteenth International Conference on Machine Learning , pages=

Approximately Optimal Approximate Reinforcement Learning , author=. Proceedings of the Nineteenth International Conference on Machine Learning , pages=

work page
[67]

Choudhary, Kartik and Gupta, Dhawal and Thomas, Philip S , booktitle=

work page
[68]

Communications Medicine , volume=

Choi, Yunho and Oh, Songmi and Huh, Jin Won and Joo, Ho-Taek and Lee, Hosu and You, Wonsang and Bae, Cheng-mok and Choi, Jae-Hun and Kim, Kyung-Joong , title=. Communications Medicine , volume=

work page
[69]

Operations Research , volume=

Optimum maintenance with incomplete information , author=. Operations Research , volume=. 1968 , publisher=

work page 1968
[70]

Steven Morad and Ryan Kortvelesy and Matteo Bettini and Stephan Liwicki and Amanda Prorok , booktitle=

work page
[71]

European journal of operational research , volume=

Inventory control with product returns: The impact of imperfect information , author=. European journal of operational research , volume=. 2009 , publisher=

work page 2009
[72]

Proceedings of the AAAI conference on artificial intelligence , volume=

Adaptive quantitative trading: An imitative deep reinforcement learning approach , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[73]

Advances in Neural Information Processing Systems , volume=

FinRL-Meta: Market environments and benchmarks for data-driven financial reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[74]

Proceedings of the 15th ACM conference on recommender systems , pages=

Partially observable reinforcement learning for dialog-based interactive recommendation , author=. Proceedings of the 15th ACM conference on recommender systems , pages=

work page
[75]

and Tsitsiklis, John N

Bertsekas, Dimitri P. and Tsitsiklis, John N. , title =. SIAM Journal on Optimization , volume =

work page
[76]

Journal of mathematical analysis and applications , volume=

Optimal control of. Journal of mathematical analysis and applications , volume=. 1965 , publisher=

work page 1965
[77]

nature , volume=

Learning representations by back-propagating errors , author=. nature , volume=. 1986 , publisher=

work page 1986
[78]

The optimal control of partially observable

Smallwood, Richard D and Sondik, Edward J , journal=. The optimal control of partially observable. 1973 , publisher=

work page 1973
[79]

Neural computation , volume=

Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=

work page 1997
[80]

Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable

Loch, John and Singh, Satinder , booktitle=. Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable

work page

Showing first 80 references.