Recognition: no theorem link
Policy Gradient Methods for Non-Markovian Reinforcement Learning
Pith reviewed 2026-05-12 05:33 UTC · model grok-4.3
The pith
A policy gradient theorem for Agent State-Markov policies allows joint reward-driven optimization of internal states and actions in non-Markovian reinforcement learning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We establish a novel policy gradient theorem for ASM policies, extending the classical policy gradient results from the Markovian setting to episodic and infinite-horizon discounted NMDPs. Building on this gradient expression, we propose the Agent State-Markov Policy Gradient (ASMPG) algorithm, which leverages the recursive structure of the agent state dynamics for efficient optimization. We establish finite-time and almost sure convergence guarantees, and empirically demonstrate that, on a range of non-Markovian tasks, ASMPG outperforms baselines that learn state representations via predictive objectives.
What carries the argument
The Agent State-Markov (ASM) policy, which pairs recursively updated agent state dynamics with a control policy that selects actions from the agent state, together with the derived policy gradient expression that differentiates through both components.
If this is right
- The derived gradient expression supports direct differentiation through the recursive state update, enabling end-to-end optimization without auxiliary losses.
- Finite-time convergence bounds hold for both episodic and infinite-horizon discounted non-Markovian settings.
- Almost-sure convergence of the ASMPG iterates is guaranteed under standard step-size conditions.
- Empirical results show higher returns than predictive baselines across multiple history-dependent tasks.
Where Pith is reading between the lines
- The reward-centric formulation may produce more compact agent states than prediction-based methods when only reward-relevant history matters.
- The same gradient construction could be applied to other recursive memory architectures beyond the specific agent state used here.
- If the joint optimization succeeds, separate representation-learning stages may become unnecessary in many partially observable or history-dependent problems.
Load-bearing premise
A recursively updated agent state can be jointly optimized with the policy using only reward signals to form a sufficient compact summary of non-Markovian history.
What would settle it
A non-Markovian task where ASMPG fails to reach the performance of a method that first learns predictive state representations and then optimizes the policy separately.
Figures
read the original abstract
We study policy gradient methods for reinforcement learning in non-Markovian decision processes (NMDPs), where observations and rewards depend on the entire interaction history. To handle this dependence, the agent maintains an internal state that is recursively updated to provide a compact summary of past observations and actions. In contrast to approaches that treat the agent state dynamics as fixed or learn it via predictive objectives, we propose a reward-centric formulation that jointly optimizes the agent state dynamics and the control policy to maximize the expected cumulative reward. To this end, we consider a class of Agent State-Markov (ASM) policies, comprising an agent state dynamics and a control policy that maps the agent state to actions. We establish a novel policy gradient theorem for ASM policies, extending the classical policy gradient results from the Markovian setting to episodic and infinite-horizon discounted NMDPs. Building on this gradient expression, we propose the Agent State-Markov Policy Gradient (ASMPG) algorithm, which leverages the recursive structure of the agent state dynamics for efficient optimization. We establish finite-time and almost sure convergence guarantees, and empirically demonstrate that, on a range of non-Markovian tasks, ASMPG outperforms baselines that learn state representations via predictive objectives.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a reward-centric approach to policy gradients in non-Markovian RL by introducing Agent State-Markov (ASM) policies, where agent state dynamics are jointly optimized with the policy parameters to maximize expected reward. It derives a policy gradient theorem for ASM policies in episodic and discounted infinite-horizon NMDPs, introduces the ASMPG algorithm leveraging the recursive structure, proves finite-time and almost-sure convergence, and reports empirical superiority over predictive state representation baselines on non-Markovian tasks.
Significance. If the results hold, the work offers a simpler alternative to non-Markovian RL methods that rely on predictive objectives for learning state representations, by instead using only the reward signal for joint optimization. The extension of the policy gradient theorem and the convergence guarantees would be valuable contributions, particularly if they demonstrate that reward-centric optimization suffices for discovering adequate history summaries. The empirical results suggest practical advantages, but the significance depends on the validity of the joint optimization claim.
major comments (1)
- The policy gradient theorem for ASM policies is claimed to extend classical results, but since it holds for any fixed agent-state recursion (reducing to a Markov process on the augmented state), the central novelty and load-bearing claim is the joint optimization of the recursion parameters with the policy using only reward. The finite-time and a.s. convergence results guarantee convergence to a stationary point of this joint objective, but do not establish that the stationary point yields a sufficient statistic for the non-Markovian history, unlike methods with explicit predictive losses.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address the major comment below with clarifications on the scope of our contributions.
read point-by-point responses
-
Referee: The policy gradient theorem for ASM policies is claimed to extend classical results, but since it holds for any fixed agent-state recursion (reducing to a Markov process on the augmented state), the central novelty and load-bearing claim is the joint optimization of the recursion parameters with the policy using only reward. The finite-time and a.s. convergence results guarantee convergence to a stationary point of this joint objective, but do not establish that the stationary point yields a sufficient statistic for the non-Markovian history, unlike methods with explicit predictive losses.
Authors: We agree that for any fixed agent-state recursion the theorem reduces to the classical policy gradient result on the induced Markov process over the augmented state. The central contribution of the manuscript is the joint optimization of the recursion parameters with the policy parameters using only the reward signal; this requires deriving a gradient expression that propagates through the parameterized state dynamics. We do not claim or prove that stationary points of the joint objective are sufficient statistics for the history. Our finite-time and almost-sure convergence results apply strictly to the reward objective. Empirical results on non-Markovian tasks indicate that the learned representations are effective in practice. In the revision we will add an explicit discussion section delineating these theoretical limitations and contrasting the approach with predictive-state methods. revision: yes
Circularity Check
No circularity: novel policy gradient theorem is an independent extension of classical results to ASM policies
full rationale
The paper defines the ASM policy class independently (agent state recursion plus policy on that state) and derives a new policy gradient theorem by extending the classical Markovian policy gradient to episodic and infinite-horizon discounted NMDPs. The ASMPG algorithm is then constructed directly from this gradient expression, with finite-time and almost-sure convergence results stated for the joint optimization of recursion and policy parameters. No equation or claim reduces by construction to a fitted quantity, prior predictive objective, or self-citation chain; the central mathematical content is self-contained against the classical baseline, and empirical comparisons are external to the derivation.
Axiom & Free-Parameter Ledger
free parameters (1)
- parameters of agent state dynamics
axioms (1)
- domain assumption NMDPs admit a compact recursive agent state representation sufficient for policy optimization
invented entities (1)
-
Agent State-Markov (ASM) policies
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Journal of Artificial Intelligence Research , volume=
Infinite-horizon policy-gradient estimation , author=. Journal of Artificial Intelligence Research , volume=. 2001 , doi=
work page 2001
-
[2]
Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence , pages=
Learning finite-state controllers for partially observable environments , author=. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence , pages=. 1999 , publisher=
work page 1999
-
[3]
Aberdeen, Douglas and Buffet, Olivier and Thomas, Owen , booktitle=. Policy-Gradients for. 2007 , volume=
work page 2007
-
[4]
A function approximation approach to estimation of policy gradient for
Yu, Huizhen , booktitle=. A function approximation approach to estimation of policy gradient for
-
[5]
Logic Journal of the IGPL , volume=
Recurrent policy gradients , author=. Logic Journal of the IGPL , volume=. 2010 , publisher=
work page 2010
-
[6]
Ni, Tianwei and Eysenbach, Benjamin and Salakhutdinov, Ruslan , booktitle=. Recurrent Model-Free. 2022 , volume=
work page 2022
-
[7]
Proceedings of the 35th International Conference on Machine Learning , pages=
Using Reward Machines for High-Level Task Specification and Decomposition in Reinforcement Learning , author=. Proceedings of the 35th International Conference on Machine Learning , pages=. 2018 , volume=
work page 2018
-
[8]
Cayci, Semih and He, Niao and Srikant, R. , journal=. Finite-Time Analysis of Natural Actor-Critic for. 2024 , doi=
work page 2024
-
[9]
Recurrent Natural Policy Gradient for
Cayci, Semih and Eryilmaz, Atilla , journal=. Recurrent Natural Policy Gradient for. 2025 , url=
work page 2025
-
[10]
Advances in neural information processing systems , volume=
Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
-
[11]
Journal of Machine Learning Research , volume=
End-to-end training of deep visuomotor policies , author=. Journal of Machine Learning Research , volume=
-
[12]
ACM Computing Surveys (CSUR) , volume=
Reinforcement learning in healthcare: A survey , author=. ACM Computing Surveys (CSUR) , volume=. 2021 , publisher=
work page 2021
-
[13]
Foundations and Trends in Machine Learning , volume=
Reinforcement learning, bit by bit , author=. Foundations and Trends in Machine Learning , volume=. 2023 , publisher=
work page 2023
-
[14]
Puterman, Martin L , year=
-
[15]
Journal of Machine Learning Research , volume=
Simple agent, complex environment: Efficient reinforcement learning with agent states , author=. Journal of Machine Learning Research , volume=
-
[16]
Journal of Machine Learning Research , volume=
Approximate information state for approximate planning and reinforcement learning in partially observed systems , author=. Journal of Machine Learning Research , volume=
-
[17]
Q-learning for stochastic control under general information structures and non-
Kara, Ali Devran and Yuksel, Serdar , journal=. Q-learning for stochastic control under general information structures and non-
-
[18]
Sixteenth European Workshop on Reinforcement Learning , year =
Approximate information state based convergence analysis of recurrent Q-learning , author=. Sixteenth European Workshop on Reinforcement Learning , year =
-
[19]
Advances in neural information processing systems , volume=
Reinforcement learning with long short-term memory , author=. Advances in neural information processing systems , volume=
- [20]
-
[21]
29th International Joint Conference on Artificial Intelligence, IJCAI 2020 , pages=
Learning and solving regular decision processes , author=. 29th International Joint Conference on Artificial Intelligence, IJCAI 2020 , pages=. 2020 , organization=
work page 2020
-
[22]
Advances in Neural Information Processing Systems , volume=
Provably efficient offline reinforcement learning in regular decision processes , author=. Advances in Neural Information Processing Systems , volume=
-
[23]
Efficient PAC Reinforcement Learning in Regular Decision Processes , author=. IJCAI , year=
-
[24]
The Thirteenth International Conference on Learning Representations , year=
Offline RL in regular decision processes: Sample efficiency via language metrics , author=. The Thirteenth International Conference on Learning Representations , year=
-
[25]
Tractable Offline Learning of Regular Decision Processes , author=. CoRR , year=
-
[26]
Advances in neural information processing systems , volume=
Predictive representations of state , author=. Advances in neural information processing systems , volume=
-
[27]
Proceedings of the 20th International Conference on Machine Learning (ICML-03) , pages=
Learning predictive state representations , author=. Proceedings of the 20th International Conference on Machine Learning (ICML-03) , pages=
-
[28]
Advances in neural information processing systems , volume=
Bounded finite state controllers , author=. Advances in neural information processing systems , volume=
-
[29]
Partially observed Markov decision processes , author=. 2016 , publisher=
work page 2016
-
[30]
International conference on machine learning , pages=
High probability convergence of stochastic gradient methods , author=. International conference on machine learning , pages=. 2023 , organization=
work page 2023
-
[31]
SIAM journal on optimization , volume=
Stochastic first-and zeroth-order methods for nonconvex stochastic programming , author=. SIAM journal on optimization , volume=. 2013 , publisher=
work page 2013
-
[32]
Ahmed Khaled and Peter Richt. Better Theory for. Transactions on Machine Learning Research , issn=. 2023 , url=
work page 2023
-
[33]
Annals of Operations Research , volume=
Decentralized stochastic control , author=. Annals of Operations Research , volume=. 2016 , publisher=
work page 2016
-
[34]
Sequential decomposition of sequential dynamic teams: applications to real-time communication and networked control systems , author=. 2008 , publisher=
work page 2008
-
[35]
Stochastic systems: Estimation, identification, and adaptive control , author=. 2015 , publisher=
work page 2015
-
[36]
Systems & Control Letters , volume=
Reinforcement learning in non-Markovian environments , author=. Systems & Control Letters , volume=. 2024 , publisher=
work page 2024
-
[37]
Schmidhuber, J. Reinforcement learning in. Advances in neural information processing systems , volume=
- [38]
-
[39]
Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=
work page 1992
-
[40]
Artificial intelligence , volume=
Planning and acting in partially observable stochastic domains , author=. Artificial intelligence , volume=. 1998 , publisher=
work page 1998
-
[41]
Proceedings of the 2016 conference on empirical methods in natural language processing , pages=
Deep reinforcement learning for dialogue generation , author=. Proceedings of the 2016 conference on empirical methods in natural language processing , pages=
work page 2016
-
[42]
Reinforcement learning with selective perception and hidden state , author=. 1996 , publisher=
work page 1996
-
[43]
Journal of Mathematical Analysis and Applications , volume=
Sufficient statistics in the optimum control of stochastic systems , author=. Journal of Mathematical Analysis and Applications , volume=. 1965 , publisher=
work page 1965
-
[44]
Advances in neural information processing systems , volume=
Policy gradient methods for reinforcement learning with function approximation , author=. Advances in neural information processing systems , volume=
-
[45]
arXiv preprint arXiv:2306.05991 , year=
Approximate information state based convergence analysis of recurrent Q-learning , author=. arXiv preprint arXiv:2306.05991 , year=
-
[46]
Periodic agent-state based Q-learning for
Sinha, Amit and Geist, Matthieu and Mahajan, Aditya , journal=. Periodic agent-state based Q-learning for
-
[47]
Mathematics of Operations Research , volume=
Convergence of finite memory Q learning for POMDPs and near optimality of learned policies under filter stability , author=. Mathematics of Operations Research , volume=. 2023 , publisher=
work page 2023
-
[48]
Journal of Machine Learning Research , volume=
On the theory of policy gradient methods: Optimality, approximation, and distribution shift , author=. Journal of Machine Learning Research , volume=
-
[49]
Proceedings of the AAAI conference on artificial intelligence , volume=
Sample efficient reinforcement learning with REINFORCE , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
-
[50]
International conference on machine learning , pages=
On the global convergence rates of softmax policy gradient methods , author=. International conference on machine learning , pages=. 2020 , organization=
work page 2020
-
[51]
International Conference on Machine Learning , pages=
Stochastic gradient succeeds for bandits , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[52]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
REINFORCE Converges to Optimal Policies with Any Learning Rate , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[53]
Proceedings of the IEEE , volume=
Young, Steve and Ga. Proceedings of the IEEE , volume=. 2013 , publisher=
work page 2013
-
[54]
To modulate or to skip: De-escalating PARP inhibitor maintenance therapy in ovarian cancer using adaptive therapy , author=. Cell Systems , volume=. 2024 , publisher=
work page 2024
-
[55]
2017 IEEE international conference on healthcare informatics (ICHI) , pages=
Deep reinforcement learning for dynamic treatment regimes on medical registry data , author=. 2017 IEEE international conference on healthcare informatics (ICHI) , pages=. 2017 , organization=
work page 2017
-
[56]
European journal of operational research , volume=
Imperfect maintenance , author=. European journal of operational research , volume=. 1996 , publisher=
work page 1996
-
[57]
Giorgio, Massimiliano and Guida, Maurizio and Pulcini, Gianpaolo , journal=. An age-and state-dependent. 2011 , publisher=
work page 2011
-
[58]
Reliability Engineering & System Safety , volume=
A condition-based prognostic approach for age-and state-dependent partially observable nonlinear degrading system , author=. Reliability Engineering & System Safety , volume=. 2023 , publisher=
work page 2023
-
[59]
IEEE Transactions on Systems, Man, and Cybernetics , volume=
Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems , author=. IEEE Transactions on Systems, Man, and Cybernetics , volume=
-
[60]
OpenAI Gym , author=. arXiv preprint arXiv:1606.01540 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [61]
-
[62]
Overcoming incomplete perception with util distinction memory , author=. Proceedings of the Tenth International Conference on International Conference on Machine Learning , pages=
-
[63]
Wierstra, Daan and Wiering, Marco , booktitle=. Utile distinction hidden
-
[64]
Machine Learning Proceedings 1995 , pages=
Instance-based utile distinctions for reinforcement learning with hidden state , author=. Machine Learning Proceedings 1995 , pages=. 1995 , publisher=
work page 1995
-
[65]
Advances in neural information processing systems , volume=
Actor-critic algorithms , author=. Advances in neural information processing systems , volume=
-
[66]
Proceedings of the Nineteenth International Conference on Machine Learning , pages=
Approximately Optimal Approximate Reinforcement Learning , author=. Proceedings of the Nineteenth International Conference on Machine Learning , pages=
-
[67]
Choudhary, Kartik and Gupta, Dhawal and Thomas, Philip S , booktitle=
-
[68]
Communications Medicine , volume=
Choi, Yunho and Oh, Songmi and Huh, Jin Won and Joo, Ho-Taek and Lee, Hosu and You, Wonsang and Bae, Cheng-mok and Choi, Jae-Hun and Kim, Kyung-Joong , title=. Communications Medicine , volume=
-
[69]
Optimum maintenance with incomplete information , author=. Operations Research , volume=. 1968 , publisher=
work page 1968
-
[70]
Steven Morad and Ryan Kortvelesy and Matteo Bettini and Stephan Liwicki and Amanda Prorok , booktitle=
-
[71]
European journal of operational research , volume=
Inventory control with product returns: The impact of imperfect information , author=. European journal of operational research , volume=. 2009 , publisher=
work page 2009
-
[72]
Proceedings of the AAAI conference on artificial intelligence , volume=
Adaptive quantitative trading: An imitative deep reinforcement learning approach , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
-
[73]
Advances in Neural Information Processing Systems , volume=
FinRL-Meta: Market environments and benchmarks for data-driven financial reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=
-
[74]
Proceedings of the 15th ACM conference on recommender systems , pages=
Partially observable reinforcement learning for dialog-based interactive recommendation , author=. Proceedings of the 15th ACM conference on recommender systems , pages=
-
[75]
Bertsekas, Dimitri P. and Tsitsiklis, John N. , title =. SIAM Journal on Optimization , volume =
-
[76]
Journal of mathematical analysis and applications , volume=
Optimal control of. Journal of mathematical analysis and applications , volume=. 1965 , publisher=
work page 1965
-
[77]
Learning representations by back-propagating errors , author=. nature , volume=. 1986 , publisher=
work page 1986
-
[78]
The optimal control of partially observable
Smallwood, Richard D and Sondik, Edward J , journal=. The optimal control of partially observable. 1973 , publisher=
work page 1973
-
[79]
Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=
work page 1997
-
[80]
Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable
Loch, John and Singh, Satinder , booktitle=. Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.