pith. machine review for the scientific record. sign in

arxiv: 2605.10816 · v1 · submitted 2026-05-11 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Policy Gradient Methods for Non-Markovian Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords policy gradientnon-Markovian reinforcement learningagent statereinforcement learningASMPG algorithmNMDPconvergence guarantees
0
0 comments X

The pith

A policy gradient theorem for Agent State-Markov policies allows joint reward-driven optimization of internal states and actions in non-Markovian reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends policy gradient methods from Markovian to non-Markovian decision processes by introducing Agent State-Markov policies. These policies maintain a recursively updated internal agent state that summarizes history and map that state to actions. Instead of fixing the state dynamics or training them with separate predictive losses, the approach derives a gradient that lets the state update rules and the control policy be optimized together to maximize expected reward. This produces the ASMPG algorithm, which comes with finite-time and almost-sure convergence results and outperforms predictive baselines on several non-Markovian tasks.

Core claim

We establish a novel policy gradient theorem for ASM policies, extending the classical policy gradient results from the Markovian setting to episodic and infinite-horizon discounted NMDPs. Building on this gradient expression, we propose the Agent State-Markov Policy Gradient (ASMPG) algorithm, which leverages the recursive structure of the agent state dynamics for efficient optimization. We establish finite-time and almost sure convergence guarantees, and empirically demonstrate that, on a range of non-Markovian tasks, ASMPG outperforms baselines that learn state representations via predictive objectives.

What carries the argument

The Agent State-Markov (ASM) policy, which pairs recursively updated agent state dynamics with a control policy that selects actions from the agent state, together with the derived policy gradient expression that differentiates through both components.

If this is right

  • The derived gradient expression supports direct differentiation through the recursive state update, enabling end-to-end optimization without auxiliary losses.
  • Finite-time convergence bounds hold for both episodic and infinite-horizon discounted non-Markovian settings.
  • Almost-sure convergence of the ASMPG iterates is guaranteed under standard step-size conditions.
  • Empirical results show higher returns than predictive baselines across multiple history-dependent tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reward-centric formulation may produce more compact agent states than prediction-based methods when only reward-relevant history matters.
  • The same gradient construction could be applied to other recursive memory architectures beyond the specific agent state used here.
  • If the joint optimization succeeds, separate representation-learning stages may become unnecessary in many partially observable or history-dependent problems.

Load-bearing premise

A recursively updated agent state can be jointly optimized with the policy using only reward signals to form a sufficient compact summary of non-Markovian history.

What would settle it

A non-Markovian task where ASMPG fails to reach the performance of a method that first learns predictive state representations and then optimizes the policy separately.

Figures

Figures reproduced from arXiv: 2605.10816 by Avik Kar, Eric Moulines, Nicholas Bambos, Rahul Singh, Shalabh Bhatnagar, Siddharth Chandak, Soumitra Sinhahajari.

Figure 1
Figure 1. Figure 1: Chatbot as a non-Markovian environment with agent state. We illustrate the role of agent state and ASM policies using a chatbot (Young et al., 2013). At each time step t, the user (which is the environment in this example) utters Ot ∈ O. The agent selects an action At ∈ A, a response to the user, and receives a reward rt based on the user’s satisfac￾tion. The environment is non-Markovian, as both the utter… view at source ↗
Figure 2
Figure 2. Figure 2: Learning curves for ASMPG, AIS-KL, and AIS-MMD on the five environments. Curves [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Best-checkpoint performance over 10 random seeds. The following colors indicate the associated algorithms: – ASMPG, – AIS-KL, – AIS-MMD. The box denotes the interquartile range, and the whiskers indicate the range of non-outlier values. Points beyond the whiskers are considered outliers. 4.3 Performance comparison We train each method for 106 environment steps and evaluate performance every 10,000 steps. T… view at source ↗
Figure 4
Figure 4. Figure 4: CheeseMaze. CheeseMaze. CheeseMaze is a partially observed navigation problem based on the maze example in Mc￾Callum (1993). The environment has 11 latent states, la￾beled 0, . . . , 10, where state 10 is the terminal goal state. The agent observes only 7 distinct observation symbols. Thus, multiple maze locations are observationally aliased as shown in [PITH_FULL_IMAGE:figures/full_fig_p033_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: HallwayNavigation. HallwayNavigation. HallwayNavigation is a deter￾ministic, partially observed maze-navigation task intro￾duced by McCallum (1995). The latent state is the agent’s position in a 7 × 4 grid containing two interior blocked regions (See [PITH_FULL_IMAGE:figures/full_fig_p033_5.png] view at source ↗
read the original abstract

We study policy gradient methods for reinforcement learning in non-Markovian decision processes (NMDPs), where observations and rewards depend on the entire interaction history. To handle this dependence, the agent maintains an internal state that is recursively updated to provide a compact summary of past observations and actions. In contrast to approaches that treat the agent state dynamics as fixed or learn it via predictive objectives, we propose a reward-centric formulation that jointly optimizes the agent state dynamics and the control policy to maximize the expected cumulative reward. To this end, we consider a class of Agent State-Markov (ASM) policies, comprising an agent state dynamics and a control policy that maps the agent state to actions. We establish a novel policy gradient theorem for ASM policies, extending the classical policy gradient results from the Markovian setting to episodic and infinite-horizon discounted NMDPs. Building on this gradient expression, we propose the Agent State-Markov Policy Gradient (ASMPG) algorithm, which leverages the recursive structure of the agent state dynamics for efficient optimization. We establish finite-time and almost sure convergence guarantees, and empirically demonstrate that, on a range of non-Markovian tasks, ASMPG outperforms baselines that learn state representations via predictive objectives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes a reward-centric approach to policy gradients in non-Markovian RL by introducing Agent State-Markov (ASM) policies, where agent state dynamics are jointly optimized with the policy parameters to maximize expected reward. It derives a policy gradient theorem for ASM policies in episodic and discounted infinite-horizon NMDPs, introduces the ASMPG algorithm leveraging the recursive structure, proves finite-time and almost-sure convergence, and reports empirical superiority over predictive state representation baselines on non-Markovian tasks.

Significance. If the results hold, the work offers a simpler alternative to non-Markovian RL methods that rely on predictive objectives for learning state representations, by instead using only the reward signal for joint optimization. The extension of the policy gradient theorem and the convergence guarantees would be valuable contributions, particularly if they demonstrate that reward-centric optimization suffices for discovering adequate history summaries. The empirical results suggest practical advantages, but the significance depends on the validity of the joint optimization claim.

major comments (1)
  1. The policy gradient theorem for ASM policies is claimed to extend classical results, but since it holds for any fixed agent-state recursion (reducing to a Markov process on the augmented state), the central novelty and load-bearing claim is the joint optimization of the recursion parameters with the policy using only reward. The finite-time and a.s. convergence results guarantee convergence to a stationary point of this joint objective, but do not establish that the stationary point yields a sufficient statistic for the non-Markovian history, unlike methods with explicit predictive losses.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the major comment below with clarifications on the scope of our contributions.

read point-by-point responses
  1. Referee: The policy gradient theorem for ASM policies is claimed to extend classical results, but since it holds for any fixed agent-state recursion (reducing to a Markov process on the augmented state), the central novelty and load-bearing claim is the joint optimization of the recursion parameters with the policy using only reward. The finite-time and a.s. convergence results guarantee convergence to a stationary point of this joint objective, but do not establish that the stationary point yields a sufficient statistic for the non-Markovian history, unlike methods with explicit predictive losses.

    Authors: We agree that for any fixed agent-state recursion the theorem reduces to the classical policy gradient result on the induced Markov process over the augmented state. The central contribution of the manuscript is the joint optimization of the recursion parameters with the policy parameters using only the reward signal; this requires deriving a gradient expression that propagates through the parameterized state dynamics. We do not claim or prove that stationary points of the joint objective are sufficient statistics for the history. Our finite-time and almost-sure convergence results apply strictly to the reward objective. Empirical results on non-Markovian tasks indicate that the learned representations are effective in practice. In the revision we will add an explicit discussion section delineating these theoretical limitations and contrasting the approach with predictive-state methods. revision: yes

Circularity Check

0 steps flagged

No circularity: novel policy gradient theorem is an independent extension of classical results to ASM policies

full rationale

The paper defines the ASM policy class independently (agent state recursion plus policy on that state) and derives a new policy gradient theorem by extending the classical Markovian policy gradient to episodic and infinite-horizon discounted NMDPs. The ASMPG algorithm is then constructed directly from this gradient expression, with finite-time and almost-sure convergence results stated for the joint optimization of recursion and policy parameters. No equation or claim reduces by construction to a fitted quantity, prior predictive objective, or self-citation chain; the central mathematical content is self-contained against the classical baseline, and empirical comparisons are external to the derivation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the existence of a learnable recursive agent state that compactly summarizes history for the purpose of reward maximization.

free parameters (1)
  • parameters of agent state dynamics
    The recursive update rule for the agent state is parameterized and optimized jointly via gradients.
axioms (1)
  • domain assumption NMDPs admit a compact recursive agent state representation sufficient for policy optimization
    Invoked to justify the ASM policy class and the extension of the policy gradient theorem.
invented entities (1)
  • Agent State-Markov (ASM) policies no independent evidence
    purpose: Class of policies that combine learnable agent state dynamics with a control policy mapping state to actions
    New construct introduced to enable the reward-centric formulation and gradient theorem.

pith-pipeline@v0.9.0 · 5534 in / 1290 out tokens · 57329 ms · 2026-05-12T05:33:26.909648+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 1 internal anchor

  1. [1]

    Journal of Artificial Intelligence Research , volume=

    Infinite-horizon policy-gradient estimation , author=. Journal of Artificial Intelligence Research , volume=. 2001 , doi=

  2. [2]

    Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence , pages=

    Learning finite-state controllers for partially observable environments , author=. Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence , pages=. 1999 , publisher=

  3. [3]

    Policy-Gradients for

    Aberdeen, Douglas and Buffet, Olivier and Thomas, Owen , booktitle=. Policy-Gradients for. 2007 , volume=

  4. [4]

    A function approximation approach to estimation of policy gradient for

    Yu, Huizhen , booktitle=. A function approximation approach to estimation of policy gradient for

  5. [5]

    Logic Journal of the IGPL , volume=

    Recurrent policy gradients , author=. Logic Journal of the IGPL , volume=. 2010 , publisher=

  6. [6]

    Recurrent Model-Free

    Ni, Tianwei and Eysenbach, Benjamin and Salakhutdinov, Ruslan , booktitle=. Recurrent Model-Free. 2022 , volume=

  7. [7]

    Proceedings of the 35th International Conference on Machine Learning , pages=

    Using Reward Machines for High-Level Task Specification and Decomposition in Reinforcement Learning , author=. Proceedings of the 35th International Conference on Machine Learning , pages=. 2018 , volume=

  8. [8]

    , journal=

    Cayci, Semih and He, Niao and Srikant, R. , journal=. Finite-Time Analysis of Natural Actor-Critic for. 2024 , doi=

  9. [9]

    Recurrent Natural Policy Gradient for

    Cayci, Semih and Eryilmaz, Atilla , journal=. Recurrent Natural Policy Gradient for. 2025 , url=

  10. [10]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  11. [11]

    Journal of Machine Learning Research , volume=

    End-to-end training of deep visuomotor policies , author=. Journal of Machine Learning Research , volume=

  12. [12]

    ACM Computing Surveys (CSUR) , volume=

    Reinforcement learning in healthcare: A survey , author=. ACM Computing Surveys (CSUR) , volume=. 2021 , publisher=

  13. [13]

    Foundations and Trends in Machine Learning , volume=

    Reinforcement learning, bit by bit , author=. Foundations and Trends in Machine Learning , volume=. 2023 , publisher=

  14. [14]

    Puterman, Martin L , year=

  15. [15]

    Journal of Machine Learning Research , volume=

    Simple agent, complex environment: Efficient reinforcement learning with agent states , author=. Journal of Machine Learning Research , volume=

  16. [16]

    Journal of Machine Learning Research , volume=

    Approximate information state for approximate planning and reinforcement learning in partially observed systems , author=. Journal of Machine Learning Research , volume=

  17. [17]

    Q-learning for stochastic control under general information structures and non-

    Kara, Ali Devran and Yuksel, Serdar , journal=. Q-learning for stochastic control under general information structures and non-

  18. [18]

    Sixteenth European Workshop on Reinforcement Learning , year =

    Approximate information state based convergence analysis of recurrent Q-learning , author=. Sixteenth European Workshop on Reinforcement Learning , year =

  19. [19]

    Advances in neural information processing systems , volume=

    Reinforcement learning with long short-term memory , author=. Advances in neural information processing systems , volume=

  20. [20]

    , author=

    Regular Decision Processes: A Model for Non-Markovian Domains. , author=. IJCAI , volume=

  21. [21]

    29th International Joint Conference on Artificial Intelligence, IJCAI 2020 , pages=

    Learning and solving regular decision processes , author=. 29th International Joint Conference on Artificial Intelligence, IJCAI 2020 , pages=. 2020 , organization=

  22. [22]

    Advances in Neural Information Processing Systems , volume=

    Provably efficient offline reinforcement learning in regular decision processes , author=. Advances in Neural Information Processing Systems , volume=

  23. [23]

    IJCAI , year=

    Efficient PAC Reinforcement Learning in Regular Decision Processes , author=. IJCAI , year=

  24. [24]

    The Thirteenth International Conference on Learning Representations , year=

    Offline RL in regular decision processes: Sample efficiency via language metrics , author=. The Thirteenth International Conference on Learning Representations , year=

  25. [25]

    CoRR , year=

    Tractable Offline Learning of Regular Decision Processes , author=. CoRR , year=

  26. [26]

    Advances in neural information processing systems , volume=

    Predictive representations of state , author=. Advances in neural information processing systems , volume=

  27. [27]

    Proceedings of the 20th International Conference on Machine Learning (ICML-03) , pages=

    Learning predictive state representations , author=. Proceedings of the 20th International Conference on Machine Learning (ICML-03) , pages=

  28. [28]

    Advances in neural information processing systems , volume=

    Bounded finite state controllers , author=. Advances in neural information processing systems , volume=

  29. [29]

    2016 , publisher=

    Partially observed Markov decision processes , author=. 2016 , publisher=

  30. [30]

    International conference on machine learning , pages=

    High probability convergence of stochastic gradient methods , author=. International conference on machine learning , pages=. 2023 , organization=

  31. [31]

    SIAM journal on optimization , volume=

    Stochastic first-and zeroth-order methods for nonconvex stochastic programming , author=. SIAM journal on optimization , volume=. 2013 , publisher=

  32. [32]

    Better Theory for

    Ahmed Khaled and Peter Richt. Better Theory for. Transactions on Machine Learning Research , issn=. 2023 , url=

  33. [33]

    Annals of Operations Research , volume=

    Decentralized stochastic control , author=. Annals of Operations Research , volume=. 2016 , publisher=

  34. [34]

    2008 , publisher=

    Sequential decomposition of sequential dynamic teams: applications to real-time communication and networked control systems , author=. 2008 , publisher=

  35. [35]

    2015 , publisher=

    Stochastic systems: Estimation, identification, and adaptive control , author=. 2015 , publisher=

  36. [36]

    Systems & Control Letters , volume=

    Reinforcement learning in non-Markovian environments , author=. Systems & Control Letters , volume=. 2024 , publisher=

  37. [37]

    Reinforcement learning in

    Schmidhuber, J. Reinforcement learning in. Advances in neural information processing systems , volume=

  38. [38]

    1996 , publisher=

    Wiering, MA and Schmidhuber, Juergen , journal=. 1996 , publisher=

  39. [39]

    Machine learning , volume=

    Simple statistical gradient-following algorithms for connectionist reinforcement learning , author=. Machine learning , volume=. 1992 , publisher=

  40. [40]

    Artificial intelligence , volume=

    Planning and acting in partially observable stochastic domains , author=. Artificial intelligence , volume=. 1998 , publisher=

  41. [41]

    Proceedings of the 2016 conference on empirical methods in natural language processing , pages=

    Deep reinforcement learning for dialogue generation , author=. Proceedings of the 2016 conference on empirical methods in natural language processing , pages=

  42. [42]

    1996 , publisher=

    Reinforcement learning with selective perception and hidden state , author=. 1996 , publisher=

  43. [43]

    Journal of Mathematical Analysis and Applications , volume=

    Sufficient statistics in the optimum control of stochastic systems , author=. Journal of Mathematical Analysis and Applications , volume=. 1965 , publisher=

  44. [44]

    Advances in neural information processing systems , volume=

    Policy gradient methods for reinforcement learning with function approximation , author=. Advances in neural information processing systems , volume=

  45. [45]

    arXiv preprint arXiv:2306.05991 , year=

    Approximate information state based convergence analysis of recurrent Q-learning , author=. arXiv preprint arXiv:2306.05991 , year=

  46. [46]

    Periodic agent-state based Q-learning for

    Sinha, Amit and Geist, Matthieu and Mahajan, Aditya , journal=. Periodic agent-state based Q-learning for

  47. [47]

    Mathematics of Operations Research , volume=

    Convergence of finite memory Q learning for POMDPs and near optimality of learned policies under filter stability , author=. Mathematics of Operations Research , volume=. 2023 , publisher=

  48. [48]

    Journal of Machine Learning Research , volume=

    On the theory of policy gradient methods: Optimality, approximation, and distribution shift , author=. Journal of Machine Learning Research , volume=

  49. [49]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Sample efficient reinforcement learning with REINFORCE , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  50. [50]

    International conference on machine learning , pages=

    On the global convergence rates of softmax policy gradient methods , author=. International conference on machine learning , pages=. 2020 , organization=

  51. [51]

    International Conference on Machine Learning , pages=

    Stochastic gradient succeeds for bandits , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  52. [52]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    REINFORCE Converges to Optimal Policies with Any Learning Rate , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  53. [53]

    Proceedings of the IEEE , volume=

    Young, Steve and Ga. Proceedings of the IEEE , volume=. 2013 , publisher=

  54. [54]

    Cell Systems , volume=

    To modulate or to skip: De-escalating PARP inhibitor maintenance therapy in ovarian cancer using adaptive therapy , author=. Cell Systems , volume=. 2024 , publisher=

  55. [55]

    2017 IEEE international conference on healthcare informatics (ICHI) , pages=

    Deep reinforcement learning for dynamic treatment regimes on medical registry data , author=. 2017 IEEE international conference on healthcare informatics (ICHI) , pages=. 2017 , organization=

  56. [56]

    European journal of operational research , volume=

    Imperfect maintenance , author=. European journal of operational research , volume=. 1996 , publisher=

  57. [57]

    An age-and state-dependent

    Giorgio, Massimiliano and Guida, Maurizio and Pulcini, Gianpaolo , journal=. An age-and state-dependent. 2011 , publisher=

  58. [58]

    Reliability Engineering & System Safety , volume=

    A condition-based prognostic approach for age-and state-dependent partially observable nonlinear degrading system , author=. Reliability Engineering & System Safety , volume=. 2023 , publisher=

  59. [59]

    IEEE Transactions on Systems, Man, and Cybernetics , volume=

    Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems , author=. IEEE Transactions on Systems, Man, and Cybernetics , volume=

  60. [60]

    OpenAI Gym

    OpenAI Gym , author=. arXiv preprint arXiv:1606.01540 , year=

  61. [61]

    2017 , publisher=

    Probability and measure , author=. 2017 , publisher=

  62. [62]

    Proceedings of the Tenth International Conference on International Conference on Machine Learning , pages=

    Overcoming incomplete perception with util distinction memory , author=. Proceedings of the Tenth International Conference on International Conference on Machine Learning , pages=

  63. [63]

    Utile distinction hidden

    Wierstra, Daan and Wiering, Marco , booktitle=. Utile distinction hidden

  64. [64]

    Machine Learning Proceedings 1995 , pages=

    Instance-based utile distinctions for reinforcement learning with hidden state , author=. Machine Learning Proceedings 1995 , pages=. 1995 , publisher=

  65. [65]

    Advances in neural information processing systems , volume=

    Actor-critic algorithms , author=. Advances in neural information processing systems , volume=

  66. [66]

    Proceedings of the Nineteenth International Conference on Machine Learning , pages=

    Approximately Optimal Approximate Reinforcement Learning , author=. Proceedings of the Nineteenth International Conference on Machine Learning , pages=

  67. [67]

    Choudhary, Kartik and Gupta, Dhawal and Thomas, Philip S , booktitle=

  68. [68]

    Communications Medicine , volume=

    Choi, Yunho and Oh, Songmi and Huh, Jin Won and Joo, Ho-Taek and Lee, Hosu and You, Wonsang and Bae, Cheng-mok and Choi, Jae-Hun and Kim, Kyung-Joong , title=. Communications Medicine , volume=

  69. [69]

    Operations Research , volume=

    Optimum maintenance with incomplete information , author=. Operations Research , volume=. 1968 , publisher=

  70. [70]

    Steven Morad and Ryan Kortvelesy and Matteo Bettini and Stephan Liwicki and Amanda Prorok , booktitle=

  71. [71]

    European journal of operational research , volume=

    Inventory control with product returns: The impact of imperfect information , author=. European journal of operational research , volume=. 2009 , publisher=

  72. [72]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Adaptive quantitative trading: An imitative deep reinforcement learning approach , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  73. [73]

    Advances in Neural Information Processing Systems , volume=

    FinRL-Meta: Market environments and benchmarks for data-driven financial reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  74. [74]

    Proceedings of the 15th ACM conference on recommender systems , pages=

    Partially observable reinforcement learning for dialog-based interactive recommendation , author=. Proceedings of the 15th ACM conference on recommender systems , pages=

  75. [75]

    and Tsitsiklis, John N

    Bertsekas, Dimitri P. and Tsitsiklis, John N. , title =. SIAM Journal on Optimization , volume =

  76. [76]

    Journal of mathematical analysis and applications , volume=

    Optimal control of. Journal of mathematical analysis and applications , volume=. 1965 , publisher=

  77. [77]

    nature , volume=

    Learning representations by back-propagating errors , author=. nature , volume=. 1986 , publisher=

  78. [78]

    The optimal control of partially observable

    Smallwood, Richard D and Sondik, Edward J , journal=. The optimal control of partially observable. 1973 , publisher=

  79. [79]

    Neural computation , volume=

    Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=

  80. [80]

    Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable

    Loch, John and Singh, Satinder , booktitle=. Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable

Showing first 80 references.