Learning in Low-Dimensional Subspaces: Orthogonal Bottlenecks for Reinforcement Learning

Aleksandar Todorov; Matthia Sabatelli

arxiv: 2605.26012 · v1 · pith:Z2PUSD43new · submitted 2026-05-25 · 💻 cs.LG · cs.AI

Learning in Low-Dimensional Subspaces: Orthogonal Bottlenecks for Reinforcement Learning

Aleksandar Todorov , Matthia Sabatelli This is my paper

Pith reviewed 2026-06-29 22:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords reinforcement learningrepresentation learningorthogonal bottleneckslow-dimensional subspaceslinear realizabilityvalue function approximation

0 comments

The pith

A fixed orthonormal projection in RL encoders constrains features to low-dimensional subspaces while preserving expressivity and gradient dynamics above the value function's intrinsic rank.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces orthogonal bottlenecks as a fixed orthonormal projection placed after the encoder to limit features to a chosen low dimension. Under the assumption that the optimal value function is linear in those features, the projection keeps full expressivity and leaves gradient updates equivalent to a low-dimensional version when the chosen dimension exceeds the rank of the value function. Experiments across single-task and multi-task settings show that performance stays the same or improves once the dimension passes a small task-specific threshold, and many tasks allow extreme compression. The bottlenecks also stabilize feature norms and raise effective rank. The approach adds no auxiliary losses or changes to the RL algorithm itself.

Core claim

Under a linear realizability assumption, when the bottleneck dimension exceeds the intrinsic rank of the optimal value function in feature space, the bottleneck preserves expressivity and leaves the induced gradient dynamics unchanged up to an equivalent low-dimensional parameterization.

What carries the argument

The fixed orthonormal projection that constrains encoder features to a low-dimensional subspace.

Load-bearing premise

The optimal value function is linear in the encoder features.

What would settle it

Measure the intrinsic rank of the optimal value function in feature space and check whether performance or policy quality drops when the bottleneck dimension is set below that rank but holds when set above it.

Figures

Figures reproduced from arXiv: 2605.26012 by Aleksandar Todorov, Matthia Sabatelli.

**Figure 2.** Figure 2: For PPO in Humanoid, feature norms explode when using a fixed Gaussian projection B and lower performance is achieved, while a fixed orthonormal B learns reliably. Both bottlenecks use k = 8. Why Orthogonality Matters. A key requirement in Theorem 3.2 is that the projection matrix satisfies the orthogonality condition B⊤B = Ik. In particular, it ensures that the induced gradient dynamics on the effectiv… view at source ↗

**Figure 3.** Figure 3: Bottleneck manifolds for Acrobot-v1 (DQN, top) and Freeway-MinAtar (PPO, bottom) [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Three-dimensional bottleneck embeddings for Acrobot-v1 (DQN, top) and Freeway [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Final performance (IQM over seeds) as a function of bottleneck dimension [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Performance (top row) and normalized mean effective rank (bottom row) for no projec [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Meta-World MT10 performance of a baseline PPO agent and an agent equipped with a fixed bottleneck of dimension k = 24. We finally turn to the multi-task setting, where a single agent must solve multiple tasks simultaneously using shared parameters. Multi-task RL is often limited by negative transfer and representational interference, since tasks compete for shared capacity and updates from one task can d… view at source ↗

**Figure 8.** Figure 8: Humanoid encoder-width D sweep with fixed bottleneck dimension k = 8. Curves largely overlap across encoder widths, indicating that performance is only weakly sensitive to encoder width once k is fixed; very large widths (e.g., D = 1024) can exhibit slightly lower returns. D Orthogonal Initialization Variants To verify that our results are not sensitive to the orthogonalization method used to initialize th… view at source ↗

**Figure 9.** Figure 9: Humanoid learning curves with a fixed bottleneck dimension [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Classic Control learning curves for CartPole-v1 and Acrobot-v1 comparing the uncon [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗

**Figure 11.** Figure 11: MinAtar learning curves comparing the unconstrained PPO baseline to fixed orthogonal [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Representative Atari learning curves for each game, showing the unconstrained PQN [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Representative Brax MuJoCo learning curves for each task, showing the unconstrained [PITH_FULL_IMAGE:figures/full_fig_p027_13.png] view at source ↗

**Figure 14.** Figure 14: Classic Control normalized mean effective rank curves for CartPole-v1 and Acrobot-v1 [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗

**Figure 15.** Figure 15: MinAtar normalized mean effective rank curves for the same runs shown in Figure [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗

**Figure 16.** Figure 16: Atari normalized mean effective rank curves for the same runs shown in Figure [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗

**Figure 17.** Figure 17: Brax MuJoCo normalized mean effective rank curves for the same runs shown in Figure [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗

**Figure 18.** Figure 18: Meta-World MT10 performance and normalized mean effective rank for the same runs. [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗

read the original abstract

Deep reinforcement learning (RL) agents commonly rely on high-dimensional neural representations, despite growing evidence that task-relevant value and policy structure may be intrinsically low-dimensional. In this work, we present a simple yet effective representation-level prior that inserts a fixed orthonormal projection to constrain encoder features to a low-dimensional subspace, requiring no auxiliary objectives, pretraining, or changes to the underlying RL algorithm. Under a linear realizability assumption, we prove that when the bottleneck dimension exceeds the intrinsic rank of the optimal value function in feature space, the bottleneck preserves expressivity and leaves the induced gradient dynamics unchanged up to an equivalent low-dimensional parameterization. Empirically, we find that across both single and multi-task benchmarks, baseline performance is either matched or improved once the bottleneck dimension exceeds a small task-dependent threshold; in many cases, value representations can be compressed to extremely low dimensions without loss, and the minimal sufficient dimension depends far more on environment complexity than encoder width. In addition, we analyze representation geometry and find that orthogonal bottlenecks stabilize feature norms and are associated with higher effective rank. Together, these results support a representation-space interpretation of the manifold hypothesis in reinforcement learning and position orthogonal bottlenecks as a lightweight, architecture-agnostic mechanism for shaping RL representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a fixed orthonormal bottleneck to RL encoders with a linear-realizability proof, but the assumption looks untested against the deep-net experiments.

read the letter

The core move here is inserting a fixed orthonormal projection right after the encoder to push features into a low-dimensional subspace. No new losses, no pretraining, and the RL algorithm stays untouched. Under the assumption that the optimal value function is exactly linear in the encoder features, they prove that once the bottleneck dimension clears the intrinsic rank, you keep full expressivity and the gradient flow is equivalent to a reparameterized low-dimensional version.

That combination of the fixed projection plus the realizability argument is the new piece. The empirical side shows that performance holds or improves once the dimension passes a small task-specific threshold, and that you can often compress quite aggressively without loss. The geometry observations on norm stability and effective rank are also straightforward to check.

The main weakness is that the linear realizability assumption is doing the heavy lifting for the theory, yet the experiments use standard deep encoders where exact linearity is unlikely. The abstract gives no sign that they measured how close the learned features actually come to satisfying it, so it is unclear whether the proof explains the reported robustness or whether other factors like implicit regularization are at work. The performance claims are also summarized without error bars or statistical detail, which makes the "matched or improved" statement hard to weigh.

This is aimed at researchers who want lightweight representation priors in RL and are willing to test the assumption themselves. If the full paper supplies the proof details and some diagnostic checks on linearity, it is worth sending out for review with those points flagged.

Referee Report

2 major / 2 minor

Summary. The paper proposes inserting a fixed orthonormal projection (orthogonal bottleneck) as a representation-level prior into RL encoders to constrain features to a low-dimensional subspace without auxiliary losses or algorithm changes. Under a linear realizability assumption on the optimal value function, it proves that when the bottleneck dimension exceeds the intrinsic rank of V* in feature space, expressivity is preserved and gradient dynamics remain unchanged up to an equivalent low-dimensional reparameterization. Empirically, across single- and multi-task RL benchmarks, baseline performance is matched or improved once the bottleneck dimension exceeds a small task-dependent threshold, with value representations compressible to very low dimensions; additional analysis shows the bottlenecks stabilize feature norms and increase effective rank.

Significance. If the linear realizability assumption holds, the result supplies a simple, architecture-agnostic mechanism for enforcing the manifold hypothesis in RL representations and yields a clean theoretical guarantee on expressivity and dynamics. The empirical observation that minimal sufficient dimension depends more on environment complexity than encoder width is potentially useful for practical design. The work is strengthened by its parameter-free nature and lack of extra objectives, but the explanatory link between the conditional theory and the reported robustness is weakened by the absence of any verification that the assumption approximately holds for the trained deep encoders.

major comments (2)

[§3] §3 (theoretical guarantee): the claim that the bottleneck 'leaves the induced gradient dynamics unchanged up to an equivalent low-dimensional parameterization' is conditioned on the linear realizability assumption that V* is exactly linear in the encoder features ϕ; this assumption is invoked for the guarantee but the manuscript supplies no diagnostic (e.g., residual of the linear fit or rank of the value matrix) to check whether it holds even approximately for the deep-network encoders used in the experiments.
[§5] §5 (empirical evaluation): performance claims that 'baseline performance is either matched or improved' and that 'value representations can be compressed to extremely low dimensions without loss' are presented as summary statements without reported error bars, number of random seeds, dataset sizes, or statistical tests, making it impossible to assess whether the observed robustness to small bottleneck dimensions is reliable or could be explained by mechanisms other than the stated theory.

minor comments (2)

Notation for the orthonormal projection matrix and the intrinsic rank of V* should be introduced with a single consistent symbol set in the theory section before being used in the experiments.
Figure captions for the representation-geometry plots should explicitly state the number of environments and seeds averaged.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment below and indicate the revisions we will incorporate.

read point-by-point responses

Referee: [§3] §3 (theoretical guarantee): the claim that the bottleneck 'leaves the induced gradient dynamics unchanged up to an equivalent low-dimensional parameterization' is conditioned on the linear realizability assumption that V* is exactly linear in the encoder features ϕ; this assumption is invoked for the guarantee but the manuscript supplies no diagnostic (e.g., residual of the linear fit or rank of the value matrix) to check whether it holds even approximately for the deep-network encoders used in the experiments.

Authors: We agree that the theoretical guarantee is conditioned on linear realizability and that the manuscript would be strengthened by empirical diagnostics assessing how well the assumption holds approximately for the trained encoders. In the revision we will add an appendix that reports the effective rank of the value function in feature space together with the residual of the linear fit on representative tasks. revision: yes
Referee: [§5] §5 (empirical evaluation): performance claims that 'baseline performance is either matched or improved' and that 'value representations can be compressed to extremely low dimensions without loss' are presented as summary statements without reported error bars, number of random seeds, dataset sizes, or statistical tests, making it impossible to assess whether the observed robustness to small bottleneck dimensions is reliable or could be explained by mechanisms other than the stated theory.

Authors: We acknowledge that the empirical section would benefit from fuller statistical reporting. In the revised manuscript we will report error bars over five random seeds, specify evaluation episode counts, and include statistical tests supporting the performance claims. revision: yes

Circularity Check

0 steps flagged

No circularity; central claim conditioned on external linear realizability assumption

full rationale

The paper's theoretical guarantee is explicitly stated as holding 'under a linear realizability assumption' that the optimal value function is linear in the encoder features. This is an external modeling assumption invoked to prove preservation of expressivity, not a quantity defined or fitted inside the paper's own equations. No self-citations, fitted inputs renamed as predictions, or self-definitional reductions appear in the abstract or described derivation chain. The empirical sections are presented as separate validation and do not feed back into the proof. The derivation chain is therefore self-contained against the stated assumption.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The proof depends on one domain assumption; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Linear realizability assumption: the optimal value function is linear in the encoder features.
Invoked to guarantee that the bottleneck preserves expressivity and gradient dynamics.

pith-pipeline@v0.9.1-grok · 5748 in / 1132 out tokens · 23990 ms · 2026-06-29T22:45:41.749859+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 3 canonical work pages · 1 internal anchor

[1]

ISSN: 2640- 3498

URL https://proceedings.mlr.press/v80/chen18i.html. ISSN: 2640- 3498. Simon S. Du, Sham M. Kakade, Ruosong Wang, and Lin F. Yang. Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning?, February 2020. URL http://arxiv.org/ abs/1910.03016. arXiv:1910.03016 [cs]. Ayoub Echchahed and Pablo Samuel Castro. A Survey of State Representa...

work page doi:10.1090/jams/852 2020
[2]

Plasticity Loss in Deep Reinforcement Learning: A Survey

ISBN 978-0-8218-5030-5 978-0-8218-7611-4. DOI: 10.1090/conm/026/737400. URL http://www.ams.org/conm/026/. 12 Timo Klein, Lukas Miklautz, Kevin Sidak, Claudia Plant, and Sebastian Tschiatschek. Plasticity Loss in Deep Reinforcement Learning: A Survey, November 2024. URL http://arxiv.org/ abs/2411.04832. arXiv:2411.04832 [cs]. Aviral Kumar, Rishabh Agarwal,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1090/conm/026/737400 2024
[3]

Tor Lattimore, Csaba Szepesvari, and Gellert Weisz

URL https://proceedings.neurips.cc/paper_files/paper/2022/ hash/debf482a7dbdc401f9052dbe15702837-Abstract-Conference.html. Tor Lattimore, Csaba Szepesvari, and Gellert Weisz. Learning with Good Feature Representations in Bandits and in RL with a Generative Model. InProceedings of the 37th International Conference on Machine Learning, pp. 5662–5670. PMLR, ...

work page doi:10.1038/nature14236 2022
[4]

Representational sufficiency.There exist encoder parameters and head parameters θ⋆ such that the network s7→H B⊤z(s);θ ⋆ exactly realizesV ⋆(s)for alls∈ S
[5]

Training (θ, W) by gradient descent on loss L evolves At identically to training the direct parameterizationh=Cϕ(s)on(θ, C), givenC 0 =A 0

Trainability:Let W∈R D×D be the encoder’s final layer and At =B ⊤Wt the composite feature-to-bottleneck map. Training (θ, W) by gradient descent on loss L evolves At identically to training the direct parameterizationh=Cϕ(s)on(θ, C), givenC 0 =A 0. Proof. For general notation, fix a feature map ϕ:S →R D as in Assumption 3.1. We focus on the last linear la...
[6]

The key point is thatΘ ⋆ has rankrand we assumek≥r

Representational sufficiency We will explicitly construct parameters(W ⋆, θ⋆)such that for alls, H B⊤W ⋆ϕ(s);θ ⋆ = Θ⋆ϕ(s) =V ⋆(s). The key point is thatΘ ⋆ has rankrand we assumek≥r. SinceΘ ⋆ ∈R m×D has rankr, it admits a singular value decomposition Θ⋆ =U rΣrV ⊤ r , where •U r ∈R m×r has orthonormal columns (U ⊤ r Ur =I r), •Σ r ∈R r×r is diagonal with s...
[7]

Consider an arbitrary (differentiable) training objective L computed from the head output

Trainability We now prove that training (θ, W) with the orthogonal bottleneck induces the same gradient de- scent dynamics on the composite map At =B ⊤Wt as training a direct bottleneck map Ct in the parameterizationh=Cϕ(s), providedC 0 =A 0. Consider an arbitrary (differentiable) training objective L computed from the head output. For example, L can be a...

[1] [1]

ISSN: 2640- 3498

URL https://proceedings.mlr.press/v80/chen18i.html. ISSN: 2640- 3498. Simon S. Du, Sham M. Kakade, Ruosong Wang, and Lin F. Yang. Is a Good Representation Sufficient for Sample Efficient Reinforcement Learning?, February 2020. URL http://arxiv.org/ abs/1910.03016. arXiv:1910.03016 [cs]. Ayoub Echchahed and Pablo Samuel Castro. A Survey of State Representa...

work page doi:10.1090/jams/852 2020

[2] [2]

Plasticity Loss in Deep Reinforcement Learning: A Survey

ISBN 978-0-8218-5030-5 978-0-8218-7611-4. DOI: 10.1090/conm/026/737400. URL http://www.ams.org/conm/026/. 12 Timo Klein, Lukas Miklautz, Kevin Sidak, Claudia Plant, and Sebastian Tschiatschek. Plasticity Loss in Deep Reinforcement Learning: A Survey, November 2024. URL http://arxiv.org/ abs/2411.04832. arXiv:2411.04832 [cs]. Aviral Kumar, Rishabh Agarwal,...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1090/conm/026/737400 2024

[3] [3]

Tor Lattimore, Csaba Szepesvari, and Gellert Weisz

URL https://proceedings.neurips.cc/paper_files/paper/2022/ hash/debf482a7dbdc401f9052dbe15702837-Abstract-Conference.html. Tor Lattimore, Csaba Szepesvari, and Gellert Weisz. Learning with Good Feature Representations in Bandits and in RL with a Generative Model. InProceedings of the 37th International Conference on Machine Learning, pp. 5662–5670. PMLR, ...

work page doi:10.1038/nature14236 2022

[4] [4]

Representational sufficiency.There exist encoder parameters and head parameters θ⋆ such that the network s7→H B⊤z(s);θ ⋆ exactly realizesV ⋆(s)for alls∈ S

[5] [5]

Training (θ, W) by gradient descent on loss L evolves At identically to training the direct parameterizationh=Cϕ(s)on(θ, C), givenC 0 =A 0

Trainability:Let W∈R D×D be the encoder’s final layer and At =B ⊤Wt the composite feature-to-bottleneck map. Training (θ, W) by gradient descent on loss L evolves At identically to training the direct parameterizationh=Cϕ(s)on(θ, C), givenC 0 =A 0. Proof. For general notation, fix a feature map ϕ:S →R D as in Assumption 3.1. We focus on the last linear la...

[6] [6]

The key point is thatΘ ⋆ has rankrand we assumek≥r

Representational sufficiency We will explicitly construct parameters(W ⋆, θ⋆)such that for alls, H B⊤W ⋆ϕ(s);θ ⋆ = Θ⋆ϕ(s) =V ⋆(s). The key point is thatΘ ⋆ has rankrand we assumek≥r. SinceΘ ⋆ ∈R m×D has rankr, it admits a singular value decomposition Θ⋆ =U rΣrV ⊤ r , where •U r ∈R m×r has orthonormal columns (U ⊤ r Ur =I r), •Σ r ∈R r×r is diagonal with s...

[7] [7]

Consider an arbitrary (differentiable) training objective L computed from the head output

Trainability We now prove that training (θ, W) with the orthogonal bottleneck induces the same gradient de- scent dynamics on the composite map At =B ⊤Wt as training a direct bottleneck map Ct in the parameterizationh=Cϕ(s), providedC 0 =A 0. Consider an arbitrary (differentiable) training objective L computed from the head output. For example, L can be a...