The Terminal Representation in Reinforcement Learning

Amir Esterhuysen; Anders Jonsson

arxiv: 2605.31289 · v1 · pith:2GWE2LN2new · submitted 2026-05-29 · 💻 cs.LG · cs.AI

The Terminal Representation in Reinforcement Learning

Amir Esterhuysen , Anders Jonsson This is my paper

Pith reviewed 2026-06-28 23:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords terminal representationdefault representationsuccessor representationreinforcement learningrepresentation learningoption discoveryreward shaping

0 comments

The pith

The terminal representation encodes reward-weighted trajectories in RL as a lower-dimensional object that can be used directly without eigenvector computations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the terminal representation (TR) as a formulation that captures reward-weighted trajectories in a manner similar to the default representation (DR). It establishes that the TR can be learned in lower dimensionality and applied directly to tasks such as option discovery, reward shaping, transfer learning, and exploration. This avoids the need for eigendecomposition and the associated assumption of symmetric transition dynamics. A sympathetic reader would care because the approach promises reduced computational overhead in both learning and using the representation while preserving the underlying knowledge.

Core claim

The terminal representation (TR) encodes reward-weighted trajectories similarly to the DR, but can be learned as a lower-dimensionality object, and can be used directly for the mentioned applications without eigenvector computations. Eigendecomposition also imposes the assumption of symmetric transition dynamics, which the TR can bypass. The TR is embedded in the top DR eigenvector, allowing it to capture the same underlying knowledge without eigendecomposition. The work develops the theoretical foundations including derivation, convergence of two learning algorithms, use for zero-shot compositionality, and equivalences between alternative reward formulations, along with empirical evidence o

What carries the argument

The terminal representation (TR), a lower-dimensional encoding of reward-weighted trajectories that is embedded in the top eigenvector of the default representation.

If this is right

The TR supports zero-shot compositionality of representations.
Two learning algorithms converge to the TR.
Equivalences hold between alternative reward formulations within the TR.
The TR enables the listed applications without requiring eigendecomposition.
Learning, storing, and using the TR requires less computational overhead than the DR.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The lower-dimensional nature could make the TR more scalable to high-dimensional state spaces where eigendecomposition becomes prohibitive.
Bypassing the symmetry assumption opens the possibility of applying these representations in non-reversible environments typical of many real control problems.
The embedding property suggests potential for hybrid methods that combine TR with existing eigenvector-based techniques when partial symmetry holds.

Load-bearing premise

That the TR is embedded in the top DR eigenvector allowing capture of the same knowledge without eigendecomposition and that two unspecified learning algorithms converge to it while supporting the listed downstream uses.

What would settle it

An experiment demonstrating that the TR does not support direct use for option discovery or exploration without eigenvector computations, or fails to match DR information content under asymmetric transition dynamics.

Figures

Figures reproduced from arXiv: 2605.31289 by Amir Esterhuysen, Anders Jonsson.

**Figure 1.** Figure 1: Left: TR option discovery variants vs. baselines using the DR (RACE), the SR (CEO), and a random walk. Experiments conducted in Four Rooms environment with 4 differently weighted goals. 10 seeds tested. Right: Q-learning boosted by different representations: all TR variants, the DR (RACE+Q), the SR (CEO+Q). All TR variants produce competitive performance. Using all column vectors achieves higher average re… view at source ↗

**Figure 2.** Figure 2: Reward shaping in the indicated environments, averaged over 50 seeds (TR in yellow). [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Transfer learning in the multi-goal Four Rooms environment, averaged over 10 seeds. The [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Representation learning is a powerful tool for spatio-temporal abstraction within reinforcement learning (RL). Two well established approaches are through the successor representation (SR) and the default representation (DR). The SR encodes states by the future trajectories they induce, capturing information flow decoupled from reward. The DR builds on this by weighting trajectories with reward, integrating credit-assignment structure into the representation. Eigenvectors of both representations have been used to support a range of downstream tasks -- including option discovery, reward shaping, transfer learning, and exploration. We introduce a structurally distinct formulation: the terminal representation (TR). The TR encodes reward-weighted trajectories similarly to the DR, but can be learned as a lower-dimensionality object, and can be used directly for the mentioned applications without eigenvector computations. Eigendecomposition also imposes the assumption of symmetric transition dynamics, which the TR can bypass. In this work we develop the theoretical foundations of the TR: its derivation, convergence of two learning algorithms, its use for zero-shot compositionality, and equivalences between alternative reward formulations. We further show the TR is embedded in the top DR eigenvector, allowing it to capture the same underlying knowledge without eigendecomposition. Additionally, we provide empirical evidence of the TR as a viable alternative to existing representations in subsidiary applications, while requiring less computational overhead to learn, store, and use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TR is pitched as a lower-dimensional direct alternative to DR that skips eigendecomposition, but the embedding relation and algorithm convergence are the parts that still need verification.

read the letter

The paper introduces the terminal representation as a formulation that encodes reward-weighted trajectories like the default representation but claims lower dimensionality and direct use for option discovery, reward shaping, and exploration without eigenvectors or the symmetric-dynamics assumption.

It lays out a derivation, shows two learning algorithms, discusses zero-shot compositionality, and notes equivalences across reward formulations. The central technical move is the statement that TR sits inside the top DR eigenvector, which would let it capture the same information with less overhead. Empirical checks for lower compute and storage are mentioned as well.

The soft spots sit exactly where the stress-test flags them: the embedding relation and the convergence of the two algorithms. If the embedding is shown only by construction rather than an independent derivation, or if convergence does not preserve the downstream properties, the claimed advantages do not follow. The abstract and visible material do not make those steps fully explicit, so the practical payoff remains conditional on those details.

This work is aimed at researchers already using successor or default representations for abstraction in RL. Someone looking for a lighter alternative to eigenvector methods could extract ideas from the theoretical sections if the math holds.

It deserves peer review to check the embedding and convergence arguments. I would send it rather than desk-reject.

Referee Report

3 major / 2 minor

Summary. The paper introduces the terminal representation (TR) as a new formulation in reinforcement learning that encodes reward-weighted trajectories similarly to the default representation (DR) but as a lower-dimensional object. It develops theoretical foundations including its derivation, convergence of two learning algorithms, use for zero-shot compositionality, and equivalences between reward formulations. The central claims are that the TR is embedded in the top DR eigenvector (allowing the same knowledge without eigendecomposition), that TR can be used directly for option discovery, reward shaping, transfer learning, and exploration, and that it bypasses the symmetric transition dynamics assumption of eigendecomposition. Empirical evidence is provided that TR is a viable alternative with less computational overhead.

Significance. If the embedding relation and convergence results hold, the TR would offer a lower-dimensional, directly usable alternative to DR/SR eigenvectors for multiple downstream RL tasks, reducing computational costs in learning, storage, and application while avoiding symmetry assumptions. The empirical evidence for subsidiary applications is a positive contribution if the theoretical claims are secured.

major comments (3)

[theoretical foundations on embedding] The claim that the TR is embedded in the top DR eigenvector (allowing capture of equivalent knowledge without eigendecomposition) is asserted in the abstract and theoretical foundations but no explicit embedding relation, derivation, or equation is provided, which is load-bearing for the dimensionality reduction and bypass of eigendecomposition.
[section on learning algorithms and convergence] Convergence of the two learning algorithms to the TR is claimed as part of the theoretical foundations but neither the algorithms nor their convergence proofs/derivations are supplied, which is central to the assertion that TR can be learned as a lower-dimensionality object supporting the listed applications.
[section on equivalences and compositionality] The equivalences between alternative reward formulations and the use for zero-shot compositionality are developed theoretically, but without the embedding and convergence established, it is unclear whether these preserve the representational power claimed relative to the DR.

minor comments (2)

The two learning algorithms are referenced but not named or described in the abstract or early sections; this should be clarified with pseudocode or definitions.
Notation for SR, DR, and TR should be introduced with explicit equations in the preliminaries to improve readability before the new derivations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on the manuscript. We address each major comment below and will revise the paper to strengthen the theoretical foundations as requested.

read point-by-point responses

Referee: The claim that the TR is embedded in the top DR eigenvector (allowing capture of equivalent knowledge without eigendecomposition) is asserted in the abstract and theoretical foundations but no explicit embedding relation, derivation, or equation is provided, which is load-bearing for the dimensionality reduction and bypass of eigendecomposition.

Authors: We agree that an explicit derivation of the embedding relation is necessary to support the central claims. In the revised manuscript we will add a dedicated derivation subsection that presents the embedding equation relating the TR to the top DR eigenvector, along with the steps showing how this relation permits equivalent knowledge capture without requiring eigendecomposition. revision: yes
Referee: Convergence of the two learning algorithms to the TR is claimed as part of the theoretical foundations but neither the algorithms nor their convergence proofs/derivations are supplied, which is central to the assertion that TR can be learned as a lower-dimensionality object supporting the listed applications.

Authors: We acknowledge that the algorithms and their convergence proofs must be supplied explicitly. The revision will include full descriptions of both learning algorithms together with the corresponding convergence derivations, thereby completing the theoretical support for learning the TR directly as a lower-dimensional object. revision: yes
Referee: The equivalences between alternative reward formulations and the use for zero-shot compositionality are developed theoretically, but without the embedding and convergence established, it is unclear whether these preserve the representational power claimed relative to the DR.

Authors: With the embedding relation and convergence results added as described above, the equivalences and zero-shot compositionality claims will be shown to preserve the asserted representational power relative to the DR. The revision will include explicit cross-references and additional explanatory text linking these results to the newly provided foundations. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation presented as independent

full rationale

The abstract and provided text introduce the TR via its own derivation and two learning algorithms whose convergence is claimed to be shown. The embedding of TR in the top DR eigenvector is asserted as a result shown in the work, not presupposed by definition or by fitting a parameter to the target quantity. No equations are supplied that would allow reduction of any prediction to its inputs by construction, and no self-citation chain is invoked to justify the central premises. The paper therefore remains self-contained against external benchmarks with no load-bearing step that collapses to a renaming, ansatz smuggling, or fitted-input prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the central claims rest on unspecified convergence of learning algorithms and an embedding relation whose details are not visible.

pith-pipeline@v0.9.1-grok · 5759 in / 1049 out tokens · 16542 ms · 2026-06-28T23:16:35.050433+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 1 canonical work pages

[1]

Successor features for transfer in reinforcement learning.Advances in neural information processing systems, 30, 2017

André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado P Van Hasselt, and David Silver. Successor features for transfer in reinforcement learning.Advances in neural information processing systems, 30, 2017

2017
[2]

Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798–1828, 2013

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798–1828, 2013

2013
[3]

Improving generalization for temporal difference learning: The successor repre- sentation.Neural Computation, 5(4):613–624, 1993

Peter Dayan. Improving generalization for temporal difference learning: The successor repre- sentation.Neural Computation, 5(4):613–624, 1993

1993
[4]

Golub and Charles F

Gene H. Golub and Charles F. Van Loan.Matrix Computations. Johns Hopkins University Press, 4th edition, 2013

2013
[5]

Cambridge university press, 2012

Roger A Horn and Charles R Johnson.Matrix analysis. Cambridge university press, 2012

2012
[6]

Globally optimal hierarchical rein- forcement learning for linearly-solvable markov decision processes

Guillermo Infante, Anders Jonsson, and Vicenç Gómez. Globally optimal hierarchical rein- forcement learning for linearly-solvable markov decision processes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 6970–6977, 2022

2022
[7]

Machado, Clemens Rosenbaum, Xiaoxiao Guo, Miao Liu, Gerald Tesauro, and Murray Campbell

Marlos C. Machado, Clemens Rosenbaum, Xiaoxiao Guo, Miao Liu, Gerald Tesauro, and Murray Campbell. Eigenoption discovery through the deep successor representation. In International Conference on Learning Representations (ICLR), 2018

2018
[8]

Machado, Marc G

Marlos C. Machado, Marc G. Bellemare, and Michael Bowling. Count-based exploration with the successor representation. InAAAI Conference on Artificial Intelligence (AAAI), 2020

2020
[9]

Machado, Andre Barreto, Doina Precup, and Michael Bowling

Marlos C. Machado, Andre Barreto, Doina Precup, and Michael Bowling. Temporal abstraction in reinforcement learning with the successor representation.Journal of Machine Learning Research, 24(80):1–69, 2023

2023
[10]

Proto-value functions: A laplacian framework for learning representation and control in markov decision processes.Journal of Machine Learning Research, 8(10), 2007

Sridhar Mahadevan and Mauro Maggioni. Proto-value functions: A laplacian framework for learning representation and control in markov decision processes.Journal of Machine Learning Research, 8(10), 2007

2007
[11]

Ng, Daishi Harada, and Stuart Russell

Andrew Y . Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transfor- mations: Theory and application to reward shaping. InInternational Conference on Machine Learning (ICML), 1999

1999
[12]

Payam Piray and Nathaniel D. Daw. Linear reinforcement learning in planning, grid fields, and cognitive control.Nature Communications, 12(1):4942, 2021

2021
[13]

SIAM, revised edition, 2011

Yousef Saad.Numerical Methods for Large Eigenvalue Problems. SIAM, revised edition, 2011

2011
[14]

Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2): 181–211, 1999

Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2): 181–211, 1999

1999
[15]

Linearly-solvable Markov decision problems.Advances in Neural Informa- tion Processing Systems (NIPS), pages 1369–1376, 2006

Emanuel Todorov. Linearly-solvable Markov decision problems.Advances in Neural Informa- tion Processing Systems (NIPS), pages 1369–1376, 2006

2006
[16]

Compositionality of optimal control laws.Advances in neural information processing systems, 22, 2009

Emanuel Todorov. Compositionality of optimal control laws.Advances in neural information processing systems, 22, 2009

2009
[17]

Reward-aware proto- representations in reinforcement learning.arXiv preprint arXiv:2505.16217, 2025

Hon Tik Tse, Siddarth Chandrasekar, and Marlos C Machado. Reward-aware proto- representations in reinforcement learning.arXiv preprint arXiv:2505.16217, 2025

work page arXiv 2025
[18]

v(i) S v(i) T # =β i

John N Tsitsiklis. Asynchronous stochastic approximation and q-learning.Machine learning, 16(3):185–202, 1994. 10 A Theory In this appendix we prove several of the theorems stated in the main text. For clarity we restate the theorems here. A.1 Proof of Theorem 4.1 Theorem 4.1LetM 0 =D T . The update rule Mk+1 =D T +D SMk (5) converges to the TR:lim k→∞ Mk...

1994

[1] [1]

Successor features for transfer in reinforcement learning.Advances in neural information processing systems, 30, 2017

André Barreto, Will Dabney, Rémi Munos, Jonathan J Hunt, Tom Schaul, Hado P Van Hasselt, and David Silver. Successor features for transfer in reinforcement learning.Advances in neural information processing systems, 30, 2017

2017

[2] [2]

Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798–1828, 2013

Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives.IEEE transactions on pattern analysis and machine intelligence, 35(8): 1798–1828, 2013

2013

[3] [3]

Improving generalization for temporal difference learning: The successor repre- sentation.Neural Computation, 5(4):613–624, 1993

Peter Dayan. Improving generalization for temporal difference learning: The successor repre- sentation.Neural Computation, 5(4):613–624, 1993

1993

[4] [4]

Golub and Charles F

Gene H. Golub and Charles F. Van Loan.Matrix Computations. Johns Hopkins University Press, 4th edition, 2013

2013

[5] [5]

Cambridge university press, 2012

Roger A Horn and Charles R Johnson.Matrix analysis. Cambridge university press, 2012

2012

[6] [6]

Globally optimal hierarchical rein- forcement learning for linearly-solvable markov decision processes

Guillermo Infante, Anders Jonsson, and Vicenç Gómez. Globally optimal hierarchical rein- forcement learning for linearly-solvable markov decision processes. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 6970–6977, 2022

2022

[7] [7]

Machado, Clemens Rosenbaum, Xiaoxiao Guo, Miao Liu, Gerald Tesauro, and Murray Campbell

Marlos C. Machado, Clemens Rosenbaum, Xiaoxiao Guo, Miao Liu, Gerald Tesauro, and Murray Campbell. Eigenoption discovery through the deep successor representation. In International Conference on Learning Representations (ICLR), 2018

2018

[8] [8]

Machado, Marc G

Marlos C. Machado, Marc G. Bellemare, and Michael Bowling. Count-based exploration with the successor representation. InAAAI Conference on Artificial Intelligence (AAAI), 2020

2020

[9] [9]

Machado, Andre Barreto, Doina Precup, and Michael Bowling

Marlos C. Machado, Andre Barreto, Doina Precup, and Michael Bowling. Temporal abstraction in reinforcement learning with the successor representation.Journal of Machine Learning Research, 24(80):1–69, 2023

2023

[10] [10]

Proto-value functions: A laplacian framework for learning representation and control in markov decision processes.Journal of Machine Learning Research, 8(10), 2007

Sridhar Mahadevan and Mauro Maggioni. Proto-value functions: A laplacian framework for learning representation and control in markov decision processes.Journal of Machine Learning Research, 8(10), 2007

2007

[11] [11]

Ng, Daishi Harada, and Stuart Russell

Andrew Y . Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transfor- mations: Theory and application to reward shaping. InInternational Conference on Machine Learning (ICML), 1999

1999

[12] [12]

Payam Piray and Nathaniel D. Daw. Linear reinforcement learning in planning, grid fields, and cognitive control.Nature Communications, 12(1):4942, 2021

2021

[13] [13]

SIAM, revised edition, 2011

Yousef Saad.Numerical Methods for Large Eigenvalue Problems. SIAM, revised edition, 2011

2011

[14] [14]

Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2): 181–211, 1999

Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2): 181–211, 1999

1999

[15] [15]

Linearly-solvable Markov decision problems.Advances in Neural Informa- tion Processing Systems (NIPS), pages 1369–1376, 2006

Emanuel Todorov. Linearly-solvable Markov decision problems.Advances in Neural Informa- tion Processing Systems (NIPS), pages 1369–1376, 2006

2006

[16] [16]

Compositionality of optimal control laws.Advances in neural information processing systems, 22, 2009

Emanuel Todorov. Compositionality of optimal control laws.Advances in neural information processing systems, 22, 2009

2009

[17] [17]

Reward-aware proto- representations in reinforcement learning.arXiv preprint arXiv:2505.16217, 2025

Hon Tik Tse, Siddarth Chandrasekar, and Marlos C Machado. Reward-aware proto- representations in reinforcement learning.arXiv preprint arXiv:2505.16217, 2025

work page arXiv 2025

[18] [18]

v(i) S v(i) T # =β i

John N Tsitsiklis. Asynchronous stochastic approximation and q-learning.Machine learning, 16(3):185–202, 1994. 10 A Theory In this appendix we prove several of the theorems stated in the main text. For clarity we restate the theorems here. A.1 Proof of Theorem 4.1 Theorem 4.1LetM 0 =D T . The update rule Mk+1 =D T +D SMk (5) converges to the TR:lim k→∞ Mk...

1994