pith. sign in

arxiv: 2511.19279 · v4 · submitted 2025-11-24 · 💻 cs.LG · cs.CL

MapFormer: Self-Supervised Learning of Cognitive Maps with Input-Dependent Positional Embeddings

Pith reviewed 2026-05-17 06:03 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords cognitive mapspositional embeddingstransformersout-of-distribution generalizationLie algebraepisodic memoryworking memoryself-supervised learning
0
0 comments X

The pith

MapFormers learn cognitive maps by updating positional embeddings with input-dependent matrices built from Lie-algebra exponentials, yielding near-perfect OOD generalization where standard models fail.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MapFormers to give AI systems the kind of flexible internal models that let humans adapt to new situations. It does this by making positional encodings depend on the specific input through matrices formed as exponentials of learned Lie-algebra generators, which separates abstract structural relationships from particular content. Two variants unify absolute and relative encodings to handle episodic and working memory. On tasks that test gating, two-dimensional navigation, and nested hierarchies, the models reach near-perfect generalization on data distributions that defeat ordinary transformers. The same structural bias also produces measurable gains on naturalistic data.

Core claim

Cognitive maps are learned in the model by disentangling structural relationships in the inputs from their specific content, a property that can be achieved by updating position encodings with input-dependent matrices, built as exponentials of learned combinations of Lie-algebra generators. Two variants unify absolute and relative positional encoding to model episodic and working memory. On formal tasks targeting gating, 2D navigation and nested hierarchies the models achieve near-perfect out-of-distribution generalization while standard architectures fail; the same principle yields perplexity improvements on naturalistic data and supports both parallel computation on commutative maps and,,

What carries the argument

Input-dependent matrices constructed as exponentials of learned combinations of Lie-algebra generators that update position encodings to separate abstract relations from content.

If this is right

  • The models reach near-perfect OOD generalization on gating, 2D navigation, and nested-hierarchy tasks where standard transformers fail.
  • MapFormers remain scalable and deliver perplexity gains on naturalistic data.
  • Commutative maps permit efficient parallel computation while non-commutative maps can still be acquired through sequential path integration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same Lie-algebra construction could be inserted into other sequence models to test whether the disentangling effect improves structural generalization beyond transformers.
  • If the approach generalizes, it would supply a concrete route for adding geometric inductive bias to large language models on tasks that require tracking abstract relations.
  • Testing the matrices on data whose underlying structure lies outside Lie-algebra representations would reveal the limits of the current inductive bias.

Load-bearing premise

That input-dependent matrices constructed as exponentials of learned combinations of Lie-algebra generators will reliably disentangle structural relationships from content across the tested domains.

What would settle it

If ablating the input dependence of the matrices produces no drop in out-of-distribution accuracy on the Dyck-language or navigation tasks relative to standard positional encodings, the claimed necessity of the mechanism would be falsified.

Figures

Figures reproduced from arXiv: 2511.19279 by Salvador Mascarenhas, Victor Rambaud, Yair Lakretz.

Figure 1
Figure 1. Figure 1: Cognitive Maps: Disentangling Structure and Content for Path Integration in Episodic and Working-Memory Models. (a) Cognitive maps model the relationships between entities, such as objects in space or people working in an organization. In a cognitive map, both the entities and their position are represented but they can be factorized from each other. The position (or, more generally, the ‘state’) in a cogn… view at source ↗
Figure 2
Figure 2. Figure 2: MapFormers: Path-Integration Transformer-Based Models that Learn Cognitive Maps. Overview of our Transformer architectures unifying working and episodic memory as relative and absolute positional encodings, respectively. In both case, rotation angle θ is obtained via a low-rank projection of input X, before applying a cumulative sum (cumsum) along the temporal dimension, to perform path integration (PI) in… view at source ↗
Figure 3
Figure 3. Figure 3: The 2D-Navigation Task and an Illustration of the Desired Corresponding Cognitive Map. (a) Illustration of the 2D navigation task: A model must predict the upcoming observation every time it comes back to it. The model only receives symbols, and must make sense of them without supervision. Some represent observations, others actions to take. (b) To solve this task, only actions (blue) should update the cog… view at source ↗
Figure 4
Figure 4. Figure 4: Behavioral Analyses: MapEM scales better than MapWM (a) Fixed sequence length l = 256, varying head size. (b) Fixed head size h = 48, varying sequence length. EM models are more robust to number of neurons and sequence length. (c) Fixed head size h = 32 and sequence length l = 16. Increasing the number of items to remember makes the task harder. Working memory models require more neurons to remember a larg… view at source ↗
Figure 5
Figure 5. Figure 5: Neural Analyses: Actions are Matrices and Observations are Vectors: (a) Rotation angle norm ||θt|| vs Accuracy through training. Model reaches perfect accuracy as soon as it learns that action symbols at update the agent’s position while observation tokens ot leave it untouched via 0- angle rotations, i.e. Rθo ≈ In (b) (top table) Action’s inner action ∆in at cosine similarities. Opposite actions (right v … view at source ↗
Figure 6
Figure 6. Figure 6: (a) Distribution histogram of position in 1D navigation vs oscillation at the highest (green) and lowest (orange) frequencies. The longest distances reached by the model are comparable with the length of ωmin cycles.(b-c) Rotation blocks at high and low frequencies. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Influence of matrix structure, block size and non-commutativity on the ability to learn cognitive maps. (a) All models were trained on sequences of length 128. MapEM performances drop significantly when varying block size from 2 (blue) to 4 (orange) since it reduces the amount of learned rotations. Having a non commutative model of size 4 (red) allows to flexibly switch between rotations on each blocks, an… view at source ↗
Figure 8
Figure 8. Figure 8: MapFormers learn sparse attention maps. RoPE / WM / EM attention maps. Red dots designate objects at previously visited location. (a) RoPE fails to attend to the correct token. (b-c) MapWM and MapEM-os models manage to attend to the relevant token but (d) MapEM-s exhibits a sparse attention map as it only needs to focus on abstract position pt, while keys and queries can repeat an increase the similarity o… view at source ↗
read the original abstract

A cognitive map is an internal model which encodes the abstract relationships among entities in the world, giving humans and animals the flexibility to adapt to new situations, with a strong out-of-distribution (OOD) generalization that current AI systems still do not possess. To bridge this gap, we introduce $\textit{MapFormers}$, new Transformer-based architectures, which can learn cognitive maps from observational data and perform path-integration without supervision. Cognitive maps are learned in the model by disentangling structural relationships in the inputs from their specific content, a property that can be achieved by updating position encodings with input-dependent matrices, built as exponentials of learned combinations of Lie-algebra generators. We developed two variants of $\textit{MapFormers}$ that unify absolute and relative positional encoding to model episodic (EM) and working memory (WM), respectively. We tested $\textit{MapFormers}$ on several formal tasks targeting distinct cognitive capacities, including gating, 2D navigation and nested hierarchies (Dyck Languages). Our results demonstrate that $\textit{MapFormers}$ significantly outperform current AI architectures, achieving near-perfect OOD generalization where standard models fail. Furthermore, we show that $\textit{MapFormers}$ are scalable; evaluations on naturalistic data yield perplexity improvements over baselines, suggesting that these principles extend to large-scale, real-world domains. These results are obtained through efficient parallel computation on commutative maps, though our models can also learn non-commutative cognitive maps via sequential path-integration. Overall, these results suggest that input-dependent matrices provide a critical structural bias, by disentangling abstract relations from content in order to drive robust OOD generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MapFormers, Transformer variants that learn cognitive maps self-supervised from observational data by updating positional encodings with input-dependent matrices formed as matrix exponentials of learned linear combinations of Lie-algebra generators. Two variants unify absolute and relative positional encodings to model episodic and working memory, respectively. The models are evaluated on formal tasks (gating, 2D navigation, Dyck languages) and naturalistic data, with claims of near-perfect OOD generalization and perplexity improvements over baselines, enabled by efficient parallel computation on commutative maps (with sequential path-integration for non-commutative cases).

Significance. If the central empirical claims hold, the work supplies a concrete architectural mechanism for injecting group-theoretic structure into positional encodings, offering a potential route to the relational abstraction and OOD robustness that standard Transformers lack. The explicit separation of structure from content via Lie-algebra parametrization, together with the unification of absolute/relative encodings and support for both parallel and sequential map integration, constitutes a substantive contribution to the literature on inductive biases for sequence models.

major comments (2)
  1. [§3.2] §3.2 (position-update rule): the manuscript asserts that exponentials of learned combinations of Lie-algebra generators produce updates that disentangle abstract relations from token content, yet provides neither a derivation of this separation property from the underlying group structure nor an invariance argument under content-preserving transformations. Without such a derivation, the mechanistic justification for the headline OOD claim remains incomplete.
  2. [§5] §5 (experimental section) and associated tables: no ablation is reported that replaces the Lie-algebra parametrization with a simpler input-dependent update (e.g., an MLP directly predicting the update matrix from the current token). Consequently it is impossible to determine whether the reported near-perfect OOD generalization on Dyck languages and navigation tasks is attributable to the group-theoretic construction or to generic increases in capacity and training dynamics.
minor comments (2)
  1. [§3] The notation for the specific Lie-algebra generators and the precise form of the learned linear combination should be stated explicitly (e.g., with an equation defining the basis matrices) to allow reproduction.
  2. [Figures 3-5] Figure captions and axis labels in the navigation and Dyck-language plots would benefit from explicit mention of the OOD split definition and the exact metric (accuracy, edit distance, etc.) being plotted.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major point below and have revised the manuscript to strengthen the mechanistic justification and empirical validation.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (position-update rule): the manuscript asserts that exponentials of learned combinations of Lie-algebra generators produce updates that disentangle abstract relations from token content, yet provides neither a derivation of this separation property from the underlying group structure nor an invariance argument under content-preserving transformations. Without such a derivation, the mechanistic justification for the headline OOD claim remains incomplete.

    Authors: We agree that an explicit derivation would improve the mechanistic grounding. In the revised §3.2 we now include a derivation showing that the update matrix exp(∑ α_k G_k), with coefficients α_k derived from input relations, produces positional shifts that depend solely on the abstract transformation group element and are invariant to content-preserving re-labelings of tokens. The argument proceeds by noting that the Lie-algebra generators span the tangent space of the structure group, so linear combinations encode only relational displacements; the exponential map then yields group elements whose action on positions commutes with any content-only transformation. A short invariance lemma is added to formalize this separation. revision: yes

  2. Referee: [§5] §5 (experimental section) and associated tables: no ablation is reported that replaces the Lie-algebra parametrization with a simpler input-dependent update (e.g., an MLP directly predicting the update matrix from the current token). Consequently it is impossible to determine whether the reported near-perfect OOD generalization on Dyck languages and navigation tasks is attributable to the group-theoretic construction or to generic increases in capacity and training dynamics.

    Authors: We concur that the current experiments leave open the possibility that gains arise from increased capacity rather than the specific inductive bias. We have therefore added the requested ablation: an MLP that directly regresses the update matrix from the token embedding, keeping parameter count comparable. The new results (now reported in §5 and an additional table) show that the MLP variant improves over standard Transformers but falls well short of near-perfect OOD generalization on the Dyck and navigation suites, whereas MapFormer retains its performance. This indicates that the Lie-algebra parametrization supplies a critical structural bias beyond generic input-dependent updates. revision: yes

Circularity Check

0 steps flagged

No significant circularity: architectural choice grounded in Lie-algebra structure, not reduced to fitted inputs or self-citation

full rationale

The paper's central mechanism—constructing input-dependent positional updates as exponentials of linear combinations of Lie-algebra generators—is presented as a mathematical ansatz drawn from group theory to achieve disentangling of structure from content. This is not equivalent to the target OOD generalization metric by construction, nor is it a fitted parameter renamed as a prediction. No load-bearing step reduces the claimed near-perfect generalization on Dyck languages or navigation tasks to a self-citation chain or to the same observational data used for training. Empirical results on formal tasks and naturalistic data provide independent falsifiable content. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The construction rests on standard Lie-algebra properties and the assumption that input-dependent matrix updates can separate structure from content; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)
  • domain assumption Lie-algebra generators can be linearly combined and exponentiated to produce matrices that represent structural transformations independent of content.
    Invoked when defining the input-dependent position encodings.

pith-pipeline@v0.9.0 · 5604 in / 1142 out tokens · 32150 ms · 2026-05-17T06:03:07.572257+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

  1. [1]

    Gardner, Erik Hermansen, Marius Pachitariu, Yoram Burak, Nils A

    Richard J. Gardner, Erik Hermansen, Marius Pachitariu, Yoram Burak, Nils A. Baas, Ben- jamin A. Dunn, May-Britt Moser, and Edvard I. Moser. Toroidal topology of population activity in grid cells.Nature, 602(7895):123–128, Feb 2022

  2. [2]

    Whittington, William Dorrell, Timothy E.J

    James C.R. Whittington, William Dorrell, Timothy E.J. Behrens, Surya Ganguli, and Mohamady El-Gaby. A tale of two algorithms: Structured slots explain prefrontal sequence memory and are unified with hippocampal cognitive maps.Neuron, 113(2):321–333.e6, Jan 2025

  3. [3]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

  4. [4]

    Sparks of artificial general intelligence: Early experiments with gpt-4, 2023

    Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023

  5. [5]

    Levels of agi for operationalizing progress on the path to agi, 2024

    Meredith Ringel Morris, Jascha Sohl-dickstein, Noah Fiedel, Tris Warkentin, Allan Dafoe, Aleksandra Faust, Clement Farabet, and Shane Legg. Levels of agi for operationalizing progress on the path to agi, 2024

  6. [6]

    Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan

    Keyon Vafa, Justin Y . Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan. Evaluating the world model implicit in a generative model, 2024

  7. [7]

    Easy problems that llms get wrong, 2024

    Sean Williams and James Huckle. Easy problems that llms get wrong, 2024

  8. [8]

    Right for the wrong reasons: Diagnosing syn- tactic heuristics in natural language inference

    Tom McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syn- tactic heuristics in natural language inference. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computa- tional Linguistics, pages 3428–3448, Florence, Italy, July 2019. Association for Computation...

  9. [9]

    Neural networks and the chomsky hierarchy

    Gregoire Deletang, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Chris Cundy, Marcus Hutter, Shane Legg, Joel Veness, and Pedro A Ortega. Neural networks and the chomsky hierarchy. InThe Eleventh International Conference on Learning Representations, 2023

  10. [10]

    Behrens, Timothy H

    Timothy E.J. Behrens, Timothy H. Muller, James C.R. Whittington, Shirley Mark, Alon B. Baram, Kimberly L. Stachenfeld, and Zeb Kurth-Nelson. What is a cognitive map? organizing knowledge for flexible behavior.Neuron, 100(2):490–509, 2018

  11. [11]

    Edward C. Tolman. Cognitive maps in rats and men.Psychological Review, 55(4):189–208, 1948

  12. [12]

    O’Keefe and L

    J. O’Keefe and L. Nadel.The hippocampus as a cognitive map. Clarendon Press, Oxford, United Kingdom, 1978

  13. [13]

    Torkel Hafting, Marianne Fyhn, Sturla Molden, May-Britt Moser, and Edvard I. Moser. Mi- crostructure of a spatial map in the entorhinal cortex.Nature, 436(7052):801–806, Aug 2005

  14. [14]

    Dmitriy Aronov, Rhino Nevers, and David W. Tank. Mapping of a non-spatial dimension by the hippocampal–entorhinal circuit.Nature, 543(7647):719–722, Mar 2017

  15. [15]

    Nikolaus Kriegeskorte and Katherine R. Storrs. Grid cells for conceptual spaces?Neuron, 92(2):280–284, 2016

  16. [16]

    Park, Douglas S

    Seongmin A. Park, Douglas S. Miller, and Erie D. Boorman. Inferences on a multidimensional social hierarchy use a grid-like code.Nature Neuroscience, 24(9):1292–1301, Sep 2021

  17. [17]

    The tolman-eichenbaum machine: Unifying space and relational memory through generalisation in the hippocampal formation.bioRxiv, 2019

    James CR Whittington, Timothy H Muller, Shirley Mark, Guifen Chen, Caswell Barry, Neil Burgess, and Timothy EJ Behrens. The tolman-eichenbaum machine: Unifying space and relational memory through generalisation in the hippocampal formation.bioRxiv, 2019. 23

  18. [18]

    Freyja Ólafsdóttir, Daniel Bush, and Caswell Barry

    H. Freyja Ólafsdóttir, Daniel Bush, and Caswell Barry. The role of hippocampal replay in memory and planning.Current Biology, 28(1):R37–R50, Jan 2018

  19. [19]

    Howard, Amir Homayoun Javadi, Yichao Yu, Ravi D

    Lorelei R. Howard, Amir Homayoun Javadi, Yichao Yu, Ravi D. Mill, Laura C. Morrison, Rebecca Knight, Michelle M. Loftus, Laura Staskute, and Hugo J. Spiers. The hippocampus and entorhinal cortex encode the path and euclidean distances to goals during navigation.Current Biology, 24(12):1331–1340, 2014

  20. [20]

    Srinivasan

    Rüdiger Wehner and Mandyam V . Srinivasan. Searching behaviour of desert ants, genus- cataglyphis (formicidae, hymenoptera).Journal of comparative physiology, 142(3):315–338, Sep 1981

  21. [21]

    Roformer: Enhanced transformer with rotary position embedding, 2023

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023

  22. [22]

    Whittington, Joseph Warren, and Timothy E.J

    James C.R. Whittington, Joseph Warren, and Timothy E.J. Behrens. Relating transformers to models and neural representations of the hippocampal formation, 2022

  23. [23]

    Hopfield networks is all you need, 2021

    Hubert Ramsauer, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlovi´c, Geir Kjetil Sandve, Victor Greiff, David Kreil, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. Hopfield networks is all you need, 2021

  24. [24]

    Self-attention with relative position repre- sentations, 2018

    Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position repre- sentations, 2018

  25. [25]

    Lee, Pan Li, and Zhangyang Wang

    Jiajun Zhu, Peihao Wang, Ruisi Cai, Jason D. Lee, Pan Li, and Zhangyang Wang. Rethinking addressing in language models via contexualized equivariant positional encoding, 2025

  26. [26]

    Contextual position encoding: Learning to count what’s important, 2024

    Olga Golovneva, Tianlu Wang, Jason Weston, and Sainbayar Sukhbaatar. Contextual position encoding: Learning to count what’s important, 2024

  27. [27]

    what" and

    Anand Gopalakrishnan, Robert Csordás, Jürgen Schmidhuber, and Michael C. Mozer. Decou- pling the "what" and "where" with polar coordinate positional embeddings, 2025

  28. [28]

    Mamba: Linear-time sequence modeling with selective state spaces, 2024

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2024

  29. [29]

    Latham, Timothy E

    William Dorrell, Peter E. Latham, Timothy E. J. Behrens, and James C. R. Whittington. Actionable neural representations: Grid cells from minimal constraints, 2023

  30. [30]

    Anthony W Knapp and Anthony William Knapp.Lie groups beyond an introduction, volume

  31. [31]

    Transformers are ssms: Generalized models and efficient algorithms through structured state space duality, 2024

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality, 2024

  32. [32]

    Efficiently modeling long sequences with structured state spaces, 2022

    Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces, 2022

  33. [33]

    Li, Madian Khabsa, Han Fang, and Hao Ma

    Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity, 2020

  34. [34]

    Decoupled weight decay regularization, 2019

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019

  35. [35]

    James C. R. Whittington, Will Dorrell, Surya Ganguli, and Timothy E. J. Behrens. Disentangle- ment with biological constraints: A theory of functional cell types, 2023

  36. [36]

    Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space

    Marco Baroni and Roberto Zamparelli. Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Hang Li and Lluís Màrquez, editors, Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 1183–1193, Cambridge, MA, October 2010. Association for Computational Linguistics. 24

  37. [37]

    Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

  38. [38]

    John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ron- neberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A. A. Kohl, Andrew J. Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reim...

  39. [39]

    mother" and

    Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, and Xinggang Wang. Vision mamba: Efficient visual representation learning with bidirectional state space model, 2024. 25 10 Appendix 10.1MAmPa: Mamba with skew-symmetric block-diagonal matrices As explained in sec.3.4, one only needs to modify the recurrent matrix A:=S=−S ⊤ of selective SSM...