pith. sign in

arxiv: 2605.17431 · v1 · pith:SKCF3QJInew · submitted 2026-05-17 · 💻 cs.LG · cs.AI

MATE: Solving Contextual Markov Decision Processes with Memory of Accumulated Transition Embeddings

Pith reviewed 2026-05-20 13:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Contextual Markov Decision ProcessesMemory ArchitectureReinforcement LearningPosterior ApproximationPermutation InvarianceOnline AdaptationTransition EmbeddingsSequence Models
0
0 comments X

The pith

A sum of transition embeddings can stand in for the full posterior over contexts in CMDPs while keeping enough information for near-optimal decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In contextual Markov decision processes an agent must adapt its actions to an unknown context that shapes the transition dynamics. Computing the exact posterior over possible contexts grows intractable with more observations. MATE keeps a running sum of embeddings computed from each observed transition instead. The sum works because the true posterior is unchanged by the order of those observations, so the aggregated memory stays expressive enough for good action choices. The resulting architecture runs with fixed per-step cost and sidesteps both the quadratic expense of attention models and the gradient problems of recurrent networks.

Core claim

The paper establishes that a memory formed by summing embeddings of successive transitions is provably sufficient to represent the posterior belief over contexts in a CMDP. This follows directly from the fact that the posterior distribution is invariant to the ordering of the observations. Consequently the sum serves as a fixed-size, constant-cost substitute for the growing belief state, enabling online adaptation that matches the returns of standard sequence models on benchmark CMDP tasks.

What carries the argument

Sum-aggregated memory of transition embeddings, which exploits the permutation invariance of the context posterior to retain sufficient statistics for action selection.

Load-bearing premise

That simply adding up transition embeddings is enough to keep the information needed for near-optimal choices, with no need for order or other structure.

What would settle it

A CMDP in which optimal behavior requires remembering the exact sequence of transitions rather than only their aggregate statistics; the sum memory would then produce visibly lower returns than an order-sensitive model.

Figures

Figures reproduced from arXiv: 2605.17431 by Frank Chongwoo Park, Gene Chung, Himchan Hwang, Hyeokju Jeong, Sangwoong Yoon, Seungyeon Kim.

Figure 1
Figure 1. Figure 1: Overview of MATE. MATE represents the memory mt as a summation of transition embeddings, serving as a tractable substitute for the intractable posterior p(c|x1:t). By preserving the permutation invariance of the posterior, πθ is provably capable of representing the optimal policy π ∗ , despite its structural simplicity. former and RNN baselines across diverse benchmarks. Our results highlight its effective… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of memory architectures. MATE replaces the attention mechanisms of Transformers and the recurrent con￾nections of RNNs with simple sum aggregation. Single-layer ver￾sions are illustrated for clarity. Transition Embeddings (MATE), a permutation-invariant memory defined as mt = Xt i=1 Eψ(xi). (6) The transition encoder Eψ can be any network that maps each transition xi to an embedding. As highligh… view at source ↗
Figure 3
Figure 3. Figure 3: Learning curves on MuJoCo (Left, Center) and Meta-World (Right) benchmarks. Tasks are ordered by increasing difficulty from top to bottom. Solid lines and shaded regions represent the mean and standard deviation across 3 random seeds. All agents are trained using Soft Actor-Critic (SAC) (Haarnoja et al., 2018) also scale mˆ t by the square root of its dimension, adopting the strategy in RMSNorm (Zhang & Se… view at source ↗
Figure 4
Figure 4. Figure 4: Performance comparison on the T-Maze tasks. The figures show the maximum average return on the (Left) Passive T-Maze and (Right) Active T-Maze environments. The reported values are the best results from 2 independent runs. All agents are trained using Double Q-Learning (DDQN) (Van Hasselt et al., 2016). ML10 includes out-of-distribution test tasks, we focus on evaluating adaptability to the variation withi… view at source ↗
Figure 5
Figure 5. Figure 5: Memory-based RL framework used in our experiments. An episode trajectory consisting of observation, action, and reward history is processed by a sequence encoder to produce a latent memory representation mt that summarizes past interactions. This memory vector conditions standard off-policy RL components. For discrete-control tasks (e.g., T-Maze), mt is provided to a DDQN head to estimate action values. Fo… view at source ↗
read the original abstract

We propose MATE, a simple yet effective memory architecture for solving Contextual Markov Decision Processes (CMDPs), a family of MDPs parameterized by an unobserved context. In CMDPs, an optimal agent can adapt online by maintaining the posterior belief over contexts. MATE replaces this intractable posterior with a sum-aggregated memory, leveraging the posterior's permutation invariance to retain provably sufficient expressiveness. Compared to prior memory architectures, MATE avoids the growing per-step rollout cost of Transformers and the gradient issues commonly associated with Recurrent Neural Networks (RNNs). Extensive evaluations across diverse benchmarks demonstrate that MATE provides clear computational advantages while achieving performance comparable to standard sequence-model baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes MATE, a memory architecture for Contextual Markov Decision Processes (CMDPs) that replaces the intractable posterior belief over contexts with a sum-aggregated memory of transition embeddings. It leverages the permutation invariance of the posterior to claim provably sufficient expressiveness, avoiding the per-step costs of Transformers and gradient problems of RNNs, and reports comparable performance to sequence-model baselines on diverse benchmarks.

Significance. If the theoretical justification for sufficiency holds, MATE could provide a simple and scalable memory mechanism for online adaptation in CMDPs, offering computational advantages in reinforcement learning settings with unobserved contexts. The empirical results suggest practical utility, though the lack of detailed ablations limits assessment of robustness.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (MATE architecture): The claim that sum-aggregation of transition embeddings 'retains provably sufficient expressiveness' solely via the posterior's permutation invariance is load-bearing for the central contribution. Permutation invariance ensures order-independence but does not establish that the sum is injective or information-preserving w.r.t. context-distinguishing likelihoods. No theorem or explicit construction (e.g., moment-matching or injective feature map) is provided to rule out collisions when context effects are non-additive, so the sufficiency for near-optimal action selection in general CMDPs remains unsubstantiated.
  2. [§4 or §5] §4 or §5 (Experiments): The reported performance comparability to sequence-model baselines lacks error bars, ablation results on the embedding function, or controls testing whether sum-aggregation actually preserves posterior distinguishability. Without these, it is unclear whether observed results support the sufficiency claim or arise from other implementation details.
minor comments (2)
  1. [Method] Clarify the precise per-step computation of the accumulated sum and the embedding network architecture to support reproducibility.
  2. [Related Work] Add missing references to prior work on sufficient statistics for CMDPs or POMDPs to better situate the permutation-invariance argument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below, providing clarifications on the theoretical foundations and committing to empirical enhancements in the revision.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (MATE architecture): The claim that sum-aggregation of transition embeddings 'retains provably sufficient expressiveness' solely via the posterior's permutation invariance is load-bearing for the central contribution. Permutation invariance ensures order-independence but does not establish that the sum is injective or information-preserving w.r.t. context-distinguishing likelihoods. No theorem or explicit construction (e.g., moment-matching or injective feature map) is provided to rule out collisions when context effects are non-additive, so the sufficiency for near-optimal action selection in general CMDPs remains unsubstantiated.

    Authors: We appreciate the referee's focus on this foundational claim. The manuscript's argument rests on the fact that the posterior over contexts is exchangeable (hence permutation-invariant) with respect to the sequence of observed transitions. This symmetry permits the use of a sum aggregator without regard to order. To address potential collisions under non-additive context effects, we will add an explicit proposition in §3 that specifies sufficient conditions on the embedding function—namely, that it realizes an injective map from transition distributions to a feature space whose sums uniquely recover the posterior's sufficient statistics for policy optimality. This construction draws on known results for symmetric function approximation and will be accompanied by a proof sketch ruling out information loss for the relevant class of CMDPs. revision: yes

  2. Referee: [§4 or §5] §4 or §5 (Experiments): The reported performance comparability to sequence-model baselines lacks error bars, ablation results on the embedding function, or controls testing whether sum-aggregation actually preserves posterior distinguishability. Without these, it is unclear whether observed results support the sufficiency claim or arise from other implementation details.

    Authors: We agree that stronger empirical support is needed to link the observed performance to the sum-aggregation mechanism. In the revised manuscript we will augment the experimental section with (i) mean performance and standard-error bars computed over multiple independent seeds for every benchmark, (ii) an ablation varying the transition-embedding architecture, and (iii) a control that substitutes alternative permutation-invariant aggregators (e.g., mean or learned attention) while keeping all other components fixed. These additions will directly test whether the sum preserves the distinguishability required by the theoretical claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's derivation claims that sum-aggregation of transition embeddings retains provably sufficient expressiveness for the context posterior by leveraging its permutation invariance. This rests on a standard mathematical property of posterior distributions over contexts rather than any self-definition, fitted input renamed as prediction, or load-bearing self-citation. No equations or steps in the abstract or described chain reduce the sufficiency claim to a tautology or prior author work by construction; the argument treats invariance as an external fact that permits the aggregation without additional mechanisms. The central claim therefore remains self-contained against external benchmarks and does not trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven assertion that permutation invariance of the posterior makes sum aggregation sufficient; no free parameters or invented entities are named in the abstract.

axioms (1)
  • domain assumption The posterior belief over contexts in a CMDP is permutation-invariant with respect to the order of observed transitions.
    Invoked to justify that sum aggregation retains sufficient expressiveness.

pith-pipeline@v0.9.0 · 5653 in / 1183 out tokens · 43980 ms · 2026-05-20T13:35:58.791187+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 1 internal anchor

  1. [1]

    International conference on machine learning , pages=

    Efficient off-policy meta-reinforcement learning via probabilistic context variables , author=. International conference on machine learning , pages=. 2019 , organization=

  2. [2]

    Proceedings of ICLR 2020 , year=

    VariBAD: a very good method for Bayes-adaptive deep RL via meta-learning , author=. Proceedings of ICLR 2020 , year=

  3. [3]

    International Conference on Machine Learning , pages=

    Recurrent Model-Free RL Can Be a Strong Baseline for Many POMDPs , author=. International Conference on Machine Learning , pages=. 2022 , organization=

  4. [4]

    Contextual Markov Decision Processes

    Contextual markov decision processes , author=. arXiv preprint arXiv:1502.02259 , year=

  5. [5]

    Journal of Artificial Intelligence Research , volume=

    A survey of zero-shot generalisation in deep reinforcement learning , author=. Journal of Artificial Intelligence Research , volume=

  6. [6]

    CoRR , year=

    SplAgger: Split Aggregation for Meta-Reinforcement Learning , author=. CoRR , year=

  7. [7]

    CoRR , year=

    Bridging State and History Representations: Understanding Self-Predictive RL , author=. CoRR , year=

  8. [8]

    Proceedings of the 38th International Conference on Neural Information Processing Systems , pages=

    Towards an information theoretic framework of context-based offline meta-reinforcement learning , author=. Proceedings of the 38th International Conference on Neural Information Processing Systems , pages=

  9. [9]

    Foundations and Trends in Machine Learning , volume=

    A tutorial on meta-reinforcement learning , author=. Foundations and Trends in Machine Learning , volume=. 2025 , publisher=

  10. [10]

    Advances in Neural Information Processing Systems , volume=

    Recurrent hypernetworks are surprisingly strong in meta-rl , author=. Advances in Neural Information Processing Systems , volume=

  11. [11]

    IEEE transactions on neural networks , volume=

    Learning long-term dependencies with gradient descent is difficult , author=. IEEE transactions on neural networks , volume=. 1994 , publisher=

  12. [12]

    International conference on machine learning , pages=

    On the difficulty of training recurrent neural networks , author=. International conference on machine learning , pages=. 2013 , organization=

  13. [13]

    Advances in neural information processing systems , volume=

    Deep sets , author=. Advances in neural information processing systems , volume=

  14. [14]

    International Conference on Machine Learning , pages=

    On the limitations of representing functions on sets , author=. International Conference on Machine Learning , pages=. 2019 , organization=

  15. [15]

    International Conference on Algorithmic Learning Theory , pages=

    Universal representation of permutation-invariant functions on vectors and tensors , author=. International Conference on Algorithmic Learning Theory , pages=. 2024 , organization=

  16. [16]

    IEEE Access , volume=

    Off-policy meta-reinforcement learning with belief-based task inference , author=. IEEE Access , volume=. 2022 , publisher=

  17. [17]

    arXiv preprint arXiv:2007.02879 , year=

    Fast adaptation via policy-dynamics value functions , author=. arXiv preprint arXiv:2007.02879 , year=

  18. [18]

    4th Lifelong Machine Learning Workshop at ICML 2020 , year=

    Exchangeable Models in Meta Reinforcement Learning , author=. 4th Lifelong Machine Learning Workshop at ICML 2020 , year=

  19. [19]

    arXiv preprint arXiv:2410.02751 , year=

    Relic: A recipe for 64k steps of in-context reinforcement learning for embodied ai , author=. arXiv preprint arXiv:2410.02751 , year=

  20. [20]

    Advances in Neural Information Processing Systems , volume=

    Structured state space models for in-context reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  21. [21]

    (No Title) , year=

    Bayesian decision problems and Markov chains , author=. (No Title) , year=

  22. [22]

    2002 , publisher=

    Optimal Learning: Computational procedures for Bayes-adaptive Markov decision processes , author=. 2002 , publisher=

  23. [23]

    Advances in neural information processing systems , volume=

    Why generalization in rl is difficult: Epistemic pomdps and implicit partial observability , author=. Advances in neural information processing systems , volume=

  24. [24]

    arXiv preprint arXiv:2502.07978 , year=

    A survey of in-context reinforcement learning , author=. arXiv preprint arXiv:2502.07978 , year=

  25. [25]

    AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents , volume =

    Grigsby, Jake and Fan, Jim and Zhu, Yuke , booktitle =. AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents , volume =

  26. [26]

    Advances in Neural Information Processing Systems , volume=

    Amago-2: Breaking the multi-task barrier in meta-reinforcement learning with transformers , author=. Advances in Neural Information Processing Systems , volume=

  27. [27]

    Artificial intelligence , volume=

    Planning and acting in partially observable stochastic domains , author=. Artificial intelligence , volume=. 1998 , publisher=

  28. [28]

    2023 , booktitle=

    In-context Reinforcement Learning with Algorithm Distillation , author=. 2023 , booktitle=

  29. [29]

    CoRR , year=

    1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities , author=. CoRR , year=

  30. [30]

    Transactions on Machine Learning Research , year=

    Contextualize Me--The Case for Context in Reinforcement Learning , author=. Transactions on Machine Learning Research , year=

  31. [31]

    Aaai , volume=

    Acting optimally in partially observable stochastic domains , author=. Aaai , volume=

  32. [32]

    Advances in Neural Information Processing Systems , volume=

    Neural injective functions for multisets, measures and graphs via a finite witness theorem , author=. Advances in Neural Information Processing Systems , volume=

  33. [33]

    Mathematics of control, signals and systems , volume=

    Approximation by superpositions of a sigmoidal function , author=. Mathematics of control, signals and systems , volume=. 1989 , publisher=

  34. [34]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Predictive Coding Enhances Meta-RL To Achieve Interpretable Bayes-Optimal Belief Representation Under Partial Observability , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  35. [35]

    Advances in neural information processing systems , volume=

    Root mean square layer normalization , author=. Advances in neural information processing systems , volume=

  36. [36]

    2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , year =

    MuJoCo: A physics engine for model-based control , author =. 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , year =

  37. [37]

    Proceedings of the 34th International Conference on Machine Learning , pages =

    Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

  38. [38]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  39. [39]

    Neural computation , volume=

    Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=

  40. [40]

    Advances in Neural Information Processing Systems , volume=

    When do transformers shine in rl? decoupling memory from credit assignment , author=. Advances in Neural Information Processing Systems , volume=

  41. [41]

    arXiv preprint arXiv:2506.13892 , year=

    Scaling Algorithm Distillation for Continuous Control with Mamba , author=. arXiv preprint arXiv:2506.13892 , year=

  42. [42]

    Forty-second International Conference on Machine Learning , year=

    A Large Recurrent Action Model: xLSTM enables Fast Inference for Robotics Tasks , author=. Forty-second International Conference on Machine Learning , year=

  43. [43]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Deep reinforcement learning with double q-learning , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  44. [44]

    International conference on machine learning , pages=

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=

  45. [45]

    International Conference on Learning Representations , year=

    Amrl: Aggregated memory for reinforcement learning , author=. International Conference on Learning Representations , year=

  46. [46]

    Advances in Neural Information Processing Systems , volume=

    Decision mamba: Reinforcement learning via hybrid selective sequence modeling , author=. Advances in Neural Information Processing Systems , volume=

  47. [47]

    The Thirteenth International Conference on Learning Representations , year=

    Efficient Cross-Episode Meta-RL , author=. The Thirteenth International Conference on Learning Representations , year=

  48. [48]

    Duan, Yan and Schulman, John and Chen, Xi and Bartlett, Peter L and Sutskever, Ilya and Abbeel, Pieter , journal=

  49. [49]

    International conference on machine learning , pages=

    Batch normalization: Accelerating deep network training by reducing internal covariate shift , author=. International conference on machine learning , pages=. 2015 , organization=

  50. [50]

    Conference on robot learning , pages=

    Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning , author=. Conference on robot learning , pages=. 2020 , organization=

  51. [51]

    First conference on language modeling , year=

    Mamba: Linear-time sequence modeling with selective state spaces , author=. First conference on language modeling , year=

  52. [52]

    2022 International Conference on Robotics and Automation (ICRA) , year=

    Context is Everything: Implicit Identification for Dynamics Adaptation , author=. 2022 International Conference on Robotics and Automation (ICRA) , year=