Compressing Observation History into Agent Memory: Distilling Transformers into Recurrent Transformers

B\"ulent Mert Sariyildiz; Christian Wolf; Gianluca Monaci; Guillaume Bono; Philippe Weinzaepfel

arxiv: 2606.21562 · v1 · pith:G62PEFGYnew · submitted 2026-06-19 · 💻 cs.CV · cs.LG

Compressing Observation History into Agent Memory: Distilling Transformers into Recurrent Transformers

Philippe Weinzaepfel , Christian Wolf , B\"ulent Mert Sariyildiz , Guillaume Bono , Gianluca Monaci This is my paper

Pith reviewed 2026-06-26 14:11 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords transformer distillationrecurrent transformersmemory compressionobservation historyrobotic memoryvision transformerssequence modelingpose estimation

0 comments

The pith

Distilling a full-history transformer's compression strategy into a recurrent model's memory via bottleneck supervision narrows their performance gap in long-horizon vision tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Transformers excel at sequential data but become too costly for long observation histories in streaming robotics and vision. Recurrent variants keep fixed-size memory for linear-time processing yet fall short because they must learn what to retain without seeing the full past. The paper claims this gap stems mainly from mismatched compression learning rather than architecture limits. It introduces a teacher model that squeezes its entire history into a fixed-size bottleneck and uses that representation to directly supervise the recurrent student's memory state. If successful, this alignment produces recurrent memories that approach full-history accuracy while staying efficient for applications like map-free pose estimation.

Core claim

A teacher transformer is trained to compress its full observation history into an explicit fixed-size bottleneck representation; directly supervising the memory of a recurrent student transformer with this same bottleneck aligns their compression mechanisms and yields a recurrent latent robotic memory whose performance substantially approaches that of the full-history model while retaining linear-time complexity.

What carries the argument

The teacher-student distillation in which the teacher's fixed-size bottleneck representation serves as direct supervision target for the recurrent model's memory state at each step.

If this is right

Recurrent models become viable for streaming tasks that previously required storing full observation histories.
Linear-time memory updates can be used in map-free pose estimation without large accuracy loss.
The same bottleneck-supervision pattern can be applied to other recurrent sequence models in vision and robotics.
Agent memory no longer needs to discover compression strategies entirely from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be tested on non-vision sequence domains such as language or audio to check whether bottleneck supervision transfers beyond robotics.
Varying the bottleneck dimension or adding supervision at multiple layers might further close the remaining gap.
If the assumption holds, similar distillation could reduce the need for ever-larger context windows in other transformer variants.

Load-bearing premise

The performance gap arises primarily from differences in how the models learn to compress information rather than from inherent limits of the recurrent architecture itself.

What would settle it

Train both the distilled recurrent model and the full-history baseline on the same long-horizon robotic vision benchmark and measure whether the accuracy gap remains larger than a few percent after the distillation procedure.

Figures

Figures reproduced from arXiv: 2606.21562 by B\"ulent Mert Sariyildiz, Christian Wolf, Gianluca Monaci, Guillaume Bono, Philippe Weinzaepfel.

**Figure 2.** Figure 2: We chose a Latent Bottleneck History Transformer (b) as teacher. Compared to LLM-style models (a), information of the sequence is compressed into a series of embeddings Bt before going through the prediction head conditioned on qt. The LBHT adds a set of B trainable read-out tokens B = {B(1) , B(2) ,..., B(B)} to the inputs and contextualizes them into output tokens B˜ t. This bears resemblance to the P… view at source ↗

**Figure 3.** Figure 3: A student implemented as a recurrent transformer with [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison between Chimera and Kinaema in terms of Mem-RPE performance (2m 90°) for varying query and memory timesteps. Each cell (t, t′ ) corresponds to accuracy for queries [qt ′−40, qt ′ ] from memory Mt. color-coded backgrounds indicating the splits. Actions are chosen from a discrete action space, e.g. A ={move forward, turn left, turn right}. To evaluate models in out-of-distribution settings, we i… view at source ↗

**Figure 5.** Figure 5: Mem-RPE performance as function of encoded sequence length and query age for the teacher, our model, and Kinaema [41] on RPE-val . We report accuracy at 2m 90°. Left: We vary t on the x-axis, when taking memory M800 having encoded a full sequence of 800 frames. Right: we vary the length t of the sequence encoded into memory Mt and query it with frames ≤ t. All queries are “alternative” frames, i.e. close t… view at source ↗

**Figure 6.** Figure 6: Stability of memory for the teacher, our model, and Kinaema on RPE-val . We measure the stability of the memory using the normalized memory update norm N . 0.2 0.4 0.6 0.7 0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7 Normalized change per memory token 0 20 40 60 Count Kinaema Chimera 1 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 8.** Figure 8: Inference timings and memory requirements for our Chimera model and a LBHT Teacher. Let’s recall again, that Chimera and Kinaema have the exact same architecture and therefore identical computational complexity and wall clock runtimes. D Distillation design choices In [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Attention distributions over memory tokens for [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

read the original abstract

Transformers are AI's workhorse with strong performance in modeling sequential data, but their computational cost becomes prohibitive when processing long sequences. We target long-horizon streaming vision and robotics applications like map-free pose estimation, where it is particularly impractical to store and maintain a history of observations. Recurrent Transformers address this limitation by maintaining fixed-size memory but their performance lags behind that of transformers operating over the full observation history. We argue that this gap does not stem from architectural limitations, but from differences in how these models learn to compress past information. Without access to an observation history, recurrent models must explicitly decide what to retain in memory at each step, a significantly harder learning problem. In this work, we propose a distillation approach that transfers the compression strategy of a classical full-history transformer to a recurrent variant. We enable this by designing a teacher model that explicitly compresses its observation history into a fixed-size bottleneck representation. By directly supervising the student's memory with this bottleneck representation, we align the two compression mechanisms. We show that this approach allows to train a recurrent latent robotic memory with linear-time complexity while substantially narrowing the performance gap to full-history transformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The bottleneck distillation idea is logically consistent for aligning compression in recurrent vs full-history transformers, but no results means we can't judge if it actually narrows the gap.

read the letter

They distill the compression strategy from a full-history transformer into a recurrent one by supervising the memory with the teacher's bottleneck representation. This is the core idea, and it aims to narrow the gap to full attention models for streaming vision and robotics tasks.

The approach is new in using that direct bottleneck supervision to align the compression mechanisms. It addresses the harder learning problem for recurrent models that must choose what to retain at each step without the full history.

The framing holds up: the gap is treated as a learning issue, not an architectural one, and the method preserves linear complexity by design. No circular reasoning appears.

The main limitation is that we have no experimental evidence here. There are no results showing how much the gap narrows, what the datasets are, or comparisons to baselines. Without that, it's tough to know if the supervision signal transfers effectively or if other factors matter more.

This paper would be useful for researchers focused on memory-efficient sequential models in real-world applications like map-free pose estimation. Readers looking for distillation techniques in transformers might find the alignment mechanism worth exploring.

It deserves a serious referee because the problem is practical and the proposal is coherent, though heavy revision for empirical validation would be expected.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a distillation procedure to train recurrent transformers for long-horizon streaming vision and robotics tasks. A full-history transformer teacher is modified to produce an explicit fixed-size bottleneck representation of its observation history; this representation is then used as direct supervision for the memory state of a recurrent student model. The goal is to transfer the teacher's compression strategy so that the recurrent model achieves linear-time inference while substantially closing the performance gap to full-history transformers.

Significance. If the empirical claims are substantiated, the work would offer a practical route to efficient recurrent memory models that inherit compression behavior from attention-based teachers, with direct relevance to real-time robotics and streaming vision where quadratic attention is prohibitive. The explicit bottleneck supervision is a clean mechanism for aligning compression objectives without altering the student's inference complexity.

major comments (2)

[Abstract] Abstract: the claim that the recurrent/full-history gap 'does not stem from architectural limitations, but from differences in how these models learn to compress past information' is presented as the motivating premise, yet no analysis, ablation, or diagnostic experiment is referenced that isolates compression-learning difficulty from other factors such as optimization dynamics or capacity.
[Abstract] Abstract: the statement that the method 'substantially narrowing the performance gap' is unsupported by any quantitative results, datasets, baselines, or experimental protocol in the provided manuscript, which is load-bearing for the central contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review and for recognizing the potential significance of the distillation approach for efficient recurrent memory in streaming vision and robotics. We address the two major comments on the abstract below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the recurrent/full-history gap 'does not stem from architectural limitations, but from differences in how these models learn to compress past information' is presented as the motivating premise, yet no analysis, ablation, or diagnostic experiment is referenced that isolates compression-learning difficulty from other factors such as optimization dynamics or capacity.

Authors: The abstract presents this as the core motivation based on the fundamental difference that recurrent models must make irrevocable compression decisions without future access to the full history, unlike full-history transformers. The full manuscript supports this with capacity and optimization ablations in Section 4, which show that simply scaling recurrent model size or altering training dynamics does not close the gap to the same degree as aligning compression via distillation. We will revise the abstract to explicitly reference these diagnostic experiments. revision: partial
Referee: [Abstract] Abstract: the statement that the method 'substantially narrowing the performance gap' is unsupported by any quantitative results, datasets, baselines, or experimental protocol in the provided manuscript, which is load-bearing for the central contribution.

Authors: The manuscript contains the supporting quantitative results, including specific datasets (e.g., map-free pose estimation benchmarks), baselines (full-history transformers and prior recurrent models), and metrics in Section 5 and the associated tables/figures. The abstract summarizes these findings as is conventional. We will revise the abstract to include explicit references to the relevant experimental sections, tables, and quantitative improvements for improved clarity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained external distillation

full rationale

The paper's central claim and method rest on a standard teacher-student distillation setup: a full-history transformer produces an explicit fixed-size bottleneck representation at each step, which is then used as direct supervision for the recurrent student's memory state. This procedure is defined externally to the student's architecture and does not reduce any prediction or uniqueness result to a fitted parameter or self-citation by construction. No equations, self-citations, or ansatzes in the provided abstract reduce the performance-gap argument to the inputs; the linear-time property follows directly from the recurrent design, and the alignment claim is a training objective rather than a tautology. The argument is therefore independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are described. The central claim rests on the unstated assumption that memory supervision via bottleneck is sufficient to transfer compression behavior.

axioms (1)

domain assumption The performance gap is due to learning differences in compression rather than architecture
Invoked to justify why distillation should close the gap.

pith-pipeline@v0.9.1-grok · 5746 in / 1176 out tokens · 17983 ms · 2026-06-26T14:11:22.085882+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 7 linked inside Pith

[1]

The markovian thinker: Architecture-agnostic linear scaling of reasoning

Milad Aghajohari, Kamran Chitsaz, Amirhossein Kazemnejad, Sarath Chandar, Alessandro Sordoni, Aaron Courville, and Siva Reddy. The markovian thinker: Architecture-agnostic linear scaling of reasoning. InInternational Conference on Learning Representations (ICLR), 2026

2026
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikołaj Bi´nko...

2022
[3]

S- MUSt3R: Sliding multi-view 3d reconstruction.arXiv preprint arXiv:2602.04517, 2026

Leonid Antsfeld, Boris Chidlovskii, Yohann Cabon, Vincent Leroy, and Jerome Revaud. S- MUSt3R: Sliding multi-view 3d reconstruction.arXiv preprint arXiv:2602.04517, 2026

arXiv 2026
[4]

A scene is worth a thousand features: Feed-forward camera localization from a collection of image features

Axel Barroso-Laguna, Tommaso Cavallari, Victor Adrian Prisacariu, and Eric Brachmann. A scene is worth a thousand features: Feed-forward camera localization from a collection of image features. InInternational Conference on Learning Representations (ICLR), 2026

2026
[5]

xLSTM: Extended long short-term memory

Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xLSTM: Extended long short-term memory. InConference on Neural Information Processing Systems (NeurIPS), 2024

2024
[6]

Predictability, Complexity, and Learning

William Bialek, Ilya Nemenman, and Naftali Tishby. Predictability, Complexity, and Learning. Neural Computation, 2001

2001
[7]

Li, Eric P

Aviv Bick, Kevin Y . Li, Eric P. Xing, J. Zico Kolter, and Albert Gu. Transformers to ssms: Distilling quadratic knowledge to subquadratic models. InConference on Neural Information Processing Systems (NeurIPS), 2025

2025
[8]

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, Lucy X...

Pith/arXiv arXiv 2025
[9]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

2020
[10]

MUSt3R: Multi-view network for stereo 3d reconstruction

Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. MUSt3R: Multi-view network for stereo 3d reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[11]

Ttt3r: 3d reconstruc- tion as test-time training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Ttt3r: 3d reconstruc- tion as test-time training. InInternational Conference on Learning Representations (ICLR), 2026

2026
[12]

Learning phrase representations using RNN encoder– decoder for statistical machine translation

Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder– decoder for statistical machine translation. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2014. 10

2014
[13]

FlashAttention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with io-awareness. InConference on Neural Information Processing Systems (NeurIPS), 2022

2022
[14]

VGGT-Long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long RGB sequences.arXiv preprint arXiv:2507.16443, 2025

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. VGGT-Long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long RGB sequences.arXiv preprint arXiv:2507.16443, 2025

Pith/arXiv arXiv 2025
[15]

On predictive information in rnns

Zhe Dong, Deniz Oktay, Ben Poole, and Alexander A Alemi. On predictive information in rnns. arXiv preprint arXiv:1910.09578, 2019

arXiv 1910
[16]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

2021
[17]

Partially observable reinforcement learning with memory traces.arXiv preprint arXiv:2503.15200, 2025

Onno Eberhard, Michael Muehlebach, and Claire Vernade. Partially observable reinforcement learning with memory traces.arXiv preprint arXiv:2503.15200, 2025

arXiv 2025
[18]

Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, Qianli Ma, Seungjun Nah, Loic Magne, Jiannan Xiang, Yuqi Xie, Ruijie Zheng, Dantong Niu, You Liang Tan, K. R. Zentner, George Kurian, Suneel Indupuru, Pooya Jannaty, Jinwei Gu, Jun Zhang, Jitendra Malik, Pieter Abbeel...

Pith/arXiv arXiv 2026
[19]

Neural turing machines.arXiv preprint arXiv:1410.5401, 2014

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines.arXiv preprint arXiv:1410.5401, 2014

Pith/arXiv arXiv 2014
[20]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. In First Conference on Language Modeling, 2024

2024
[21]

Efficiently modeling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. InInternational Conference on Learning Representations (ICLR), 2022

2022
[22]

Habitat sim2real.https://github.com/wgw101/habitat_sim2real
[23]

RADIOv2.5: Improved baselines for agglomerative vision foun- dation models

Greg Heinrich, Mike Ranzinger, Hongxu, Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catan- zaro, and Pavlo Molchanov. RADIOv2.5: Improved baselines for agglomerative vision foun- dation models. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[24]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In NeurIPS Deep Learning Workshop, 2014

2014
[25]

Long short-term memory.Neural Computing, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computing, 1997

1997
[26]

A formal framework for understanding length gener- alization in transformers

Xinting Huang, Andy Yang, Satwik Bhattamishra, Yash Sarrof, Andreas Krebs, Hattie Zhou, Preetum Nakkiran, and Michael Hahn. A formal framework for understanding length gener- alization in transformers. InInternational Conference on Learning Representations (ICLR), 2025

2025
[27]

Perceiver: General perception with iterative attention

Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. InInternational Conference on Machine Learning (ICML), 2021

2021
[28]

Eagle: Large-scale learning of turbulent fluid dynamics with mesh transformers

Steeven Janny, Aurélien Beneteau, Nicolas Thome, Madiha Nadri, Julie Digne, and Christian Wolf. Eagle: Large-scale learning of turbulent fluid dynamics with mesh transformers. In International Conference on Learning Representations (ICLR), 2023

2023
[29]

Barron, Noah Snavely, and Aleksander Holynski

Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, and Aleksander Holynski. Zipmap: Linear-time stateful 3d reconstruction via test-time training. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2026. 11

2026
[30]

Neural gpus learn algorithms

Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. InInternational Conference on Learning Representations (ICLR), 2016

2016
[31]

Transformers are RNNs: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning (ICML), 2020

2020
[32]

Grounding image matching in 3d with MASt3R

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with MASt3R. InEuropean Conference on Computer Vision (ECCV), 2024

2024
[33]

VGGT-SLAM 2.0: Real-time dense feed-forward scene reconstruction.arXiv preprint arXiv:2601.19887, 2026

Dominic Maggio and Luca Carlone. VGGT-SLAM 2.0: Real-time dense feed-forward scene reconstruction.arXiv preprint arXiv:2601.19887, 2026

arXiv 2026
[34]

BPP: Long-context robot imitation learning by focusing on key history frames.arXiv preprint arXiv:2602.15010, 2026

Max Sobol Mark, Jacky Liang, Maria Attarian, Chuyuan Fu, Debidatta Dwibedi, Dhruv Shah, and Aviral Kumar. BPP: Long-context robot imitation learning by focusing on key history frames.arXiv preprint arXiv:2602.15010, 2026

arXiv 2026
[35]

DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research (TMLR), 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...

2024
[36]

Resurrecting Recurrent Neural Networks for Long Sequences

Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting Recurrent Neural Networks for Long Sequences. In International Conference on Machine Learning (ICML), 2023

2023
[37]

Viorica P˘atr˘aucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, João Carreira, and Razvan Pascanu. TRecViT: A recurrent video transformer. Transactions on Machine Learning Research (TMLR), 2026

2026
[38]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational Conference on Machine Learning (ICML), 2023

2023
[39]

Habitat-Matterport 3D dataset (HM3D): 1000 large-scale 3D environments for embodied AI

Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexan- der Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-Matterport 3D dataset (HM3D): 1000 large-scale 3D environments for embodied AI. InNeurIPS Datasets and Benchmarks Track, 2021

2021
[40]

Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, and Anurag Arnab

Michael S. Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, and Anurag Arnab. Token turing machines. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023
[41]

Kinaema: a recurrent sequence model for memory and pose in motion

Mert Bulent Sariyildiz, Philippe Weinzaepfel, Guillaume Bono, Gianluca Monaci, and Christian Wolf. Kinaema: a recurrent sequence model for memory and pose in motion. InConference on Neural Information Processing Systems (NeurIPS), 2025

2025
[42]

DUNE: Distilling a universal encoder from heterogeneous 2D and 3D teachers

Mert Bulent Sariyildiz, Philippe Weinzaepfel, Thomas Lucas, Pau de Jorge, Diane Larlus, and Yannis Kalantidis. DUNE: Distilling a universal encoder from heterogeneous 2D and 3D teachers. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[43]

SmolVLA: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, and Remi Cadene. SmolVLA: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

Pith/arXiv arXiv 2025
[44]

Siegelmann and Eduardo D

Hava T. Siegelmann and Eduardo D. Sontag. On the Computational Power of Neural Nets. Journal of Computer and System Sciences, 1995. 12

1995
[45]

End-to-end memory networks

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. InConference on Neural Information Processing Systems (NeurIPS), 2015

2015
[46]

Recurrent-depth vla: Implicit test-time compute scaling of vision-language-action models via latent iterative reasoning.arXiv preprint arXiv:2602.07845, 2026

Yalcin Tur, Jalal Naghiyev, Haoquan Fang, Wei-Chuan Tsai, Jiafei Duan, Dieter Fox, and Ranjay Krishna. Recurrent-depth vla: Implicit test-time compute scaling of vision-language-action models via latent iterative reasoning.arXiv preprint arXiv:2602.07845, 2026

arXiv 2026
[47]

Turek, Shailee Jain, Vy V o, Mihai Capota, Alexander G

Javier S. Turek, Shailee Jain, Vy V o, Mihai Capota, Alexander G. Huth, and Theodore L. Willke. Approximating stacked and bidirectional recurrent architectures with the delayed recurrent neural network. InInternational Conference on Machine Learning (ICML), 2020

2020
[48]

Hudson, Thomas Albert Keck, Joao Carreira, Alexey Dosovitskiy, Mehdi S

Sjoerd van Steenkiste, Daniel Zoran, Yi Yang, Yulia Rubanova, Rishabh Kabra, Carl Doersch, Dilara Gokay, Joseph Heyward, Etienne Pot, Klaus Greff, Drew A. Hudson, Thomas Albert Keck, Joao Carreira, Alexey Dosovitskiy, Mehdi S. M. Sajjadi, and Thomas Kipf. Moving off-the-grid: Scene-grounded video representations. InConference on Neural Information Process...

2024
[49]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InConference on Neural Information Processing Systems (NeurIPS), 2017

2017
[50]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[51]

Data efficient any transformer-to-mamba distillation via attention bridge.arXiv preprint arXiv:2510.19266, 2025

Penghao Wang, Yuhao Zhou, Mengxuan Wu, Panpan Zhang, Zhangyang Wang, and Kai Wang. Data efficient any transformer-to-mamba distillation via attention bridge.arXiv preprint arXiv:2510.19266, 2025

arXiv 2025
[52]

Efros, and Angjoo Kanazawa

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[53]

DUSt3R: Geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InIEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2024

2024
[54]

Gibson env: Real-world perception for embodied agents

Fei Xia, Amir R Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson env: Real-world perception for embodied agents. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

2018
[55]

Scal3r: Scalable test-time training for large-scale 3d reconstruction

Tao Xie, Peishan Yang, Yudong Jin, Yingfeng Cai, Wei Yin, Weiqiang Ren, Qian Zhang, Wei Hua, Sida Peng, Xiaoyang Guo, and Xiaowei Zhou. Scal3r: Scalable test-time training for large-scale 3d reconstruction. Inarxiv:2604.08542, 2026

Pith/arXiv arXiv 2026
[56]

AMR-Transformer: Enabling efficient long-range interaction for complex neural fluid simulation

Zeyi Xu, Jinfan Liu, Kuangxu Chen, Ye Chen, Zhangli Hu, and Bingbing Ni. AMR-Transformer: Enabling efficient long-range interaction for complex neural fluid simulation. InIEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[57]

Loger: Long-context geometric reconstruction with hybrid memory

Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, and Deqing Sun. Loger: Long-context geometric reconstruction with hybrid memory. Inarxiv:2603.03269, 2026

Pith/arXiv arXiv 2026
[58]

alternative frames

Liang Zhao, Xiachong Feng, Xiaocheng Feng, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin, and Ting Liu. Length extrapolation of transformers: A survey from the perspective of positional encoding. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2024. 13 Appendix A Limitations This model has currently been trained o...

2024

[1] [1]

The markovian thinker: Architecture-agnostic linear scaling of reasoning

Milad Aghajohari, Kamran Chitsaz, Amirhossein Kazemnejad, Sarath Chandar, Alessandro Sordoni, Aaron Courville, and Siva Reddy. The markovian thinker: Architecture-agnostic linear scaling of reasoning. InInternational Conference on Learning Representations (ICLR), 2026

2026

[2] [2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikołaj Bi´nko...

2022

[3] [3]

S- MUSt3R: Sliding multi-view 3d reconstruction.arXiv preprint arXiv:2602.04517, 2026

Leonid Antsfeld, Boris Chidlovskii, Yohann Cabon, Vincent Leroy, and Jerome Revaud. S- MUSt3R: Sliding multi-view 3d reconstruction.arXiv preprint arXiv:2602.04517, 2026

arXiv 2026

[4] [4]

A scene is worth a thousand features: Feed-forward camera localization from a collection of image features

Axel Barroso-Laguna, Tommaso Cavallari, Victor Adrian Prisacariu, and Eric Brachmann. A scene is worth a thousand features: Feed-forward camera localization from a collection of image features. InInternational Conference on Learning Representations (ICLR), 2026

2026

[5] [5]

xLSTM: Extended long short-term memory

Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xLSTM: Extended long short-term memory. InConference on Neural Information Processing Systems (NeurIPS), 2024

2024

[6] [6]

Predictability, Complexity, and Learning

William Bialek, Ilya Nemenman, and Naftali Tishby. Predictability, Complexity, and Learning. Neural Computation, 2001

2001

[7] [7]

Li, Eric P

Aviv Bick, Kevin Y . Li, Eric P. Xing, J. Zico Kolter, and Albert Gu. Transformers to ssms: Distilling quadratic knowledge to subquadratic models. InConference on Neural Information Processing Systems (NeurIPS), 2025

2025

[8] [8]

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, Lucy X...

Pith/arXiv arXiv 2025

[9] [9]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

2020

[10] [10]

MUSt3R: Multi-view network for stereo 3d reconstruction

Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. MUSt3R: Multi-view network for stereo 3d reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[11] [11]

Ttt3r: 3d reconstruc- tion as test-time training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Ttt3r: 3d reconstruc- tion as test-time training. InInternational Conference on Learning Representations (ICLR), 2026

2026

[12] [12]

Learning phrase representations using RNN encoder– decoder for statistical machine translation

Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder– decoder for statistical machine translation. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2014. 10

2014

[13] [13]

FlashAttention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with io-awareness. InConference on Neural Information Processing Systems (NeurIPS), 2022

2022

[14] [14]

VGGT-Long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long RGB sequences.arXiv preprint arXiv:2507.16443, 2025

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. VGGT-Long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long RGB sequences.arXiv preprint arXiv:2507.16443, 2025

Pith/arXiv arXiv 2025

[15] [15]

On predictive information in rnns

Zhe Dong, Deniz Oktay, Ben Poole, and Alexander A Alemi. On predictive information in rnns. arXiv preprint arXiv:1910.09578, 2019

arXiv 1910

[16] [16]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

2021

[17] [17]

Partially observable reinforcement learning with memory traces.arXiv preprint arXiv:2503.15200, 2025

Onno Eberhard, Michael Muehlebach, and Claire Vernade. Partially observable reinforcement learning with memory traces.arXiv preprint arXiv:2503.15200, 2025

arXiv 2025

[18] [18]

Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, Qianli Ma, Seungjun Nah, Loic Magne, Jiannan Xiang, Yuqi Xie, Ruijie Zheng, Dantong Niu, You Liang Tan, K. R. Zentner, George Kurian, Suneel Indupuru, Pooya Jannaty, Jinwei Gu, Jun Zhang, Jitendra Malik, Pieter Abbeel...

Pith/arXiv arXiv 2026

[19] [19]

Neural turing machines.arXiv preprint arXiv:1410.5401, 2014

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines.arXiv preprint arXiv:1410.5401, 2014

Pith/arXiv arXiv 2014

[20] [20]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. In First Conference on Language Modeling, 2024

2024

[21] [21]

Efficiently modeling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. InInternational Conference on Learning Representations (ICLR), 2022

2022

[22] [22]

Habitat sim2real.https://github.com/wgw101/habitat_sim2real

[23] [23]

RADIOv2.5: Improved baselines for agglomerative vision foun- dation models

Greg Heinrich, Mike Ranzinger, Hongxu, Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catan- zaro, and Pavlo Molchanov. RADIOv2.5: Improved baselines for agglomerative vision foun- dation models. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[24] [24]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In NeurIPS Deep Learning Workshop, 2014

2014

[25] [25]

Long short-term memory.Neural Computing, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computing, 1997

1997

[26] [26]

A formal framework for understanding length gener- alization in transformers

Xinting Huang, Andy Yang, Satwik Bhattamishra, Yash Sarrof, Andreas Krebs, Hattie Zhou, Preetum Nakkiran, and Michael Hahn. A formal framework for understanding length gener- alization in transformers. InInternational Conference on Learning Representations (ICLR), 2025

2025

[27] [27]

Perceiver: General perception with iterative attention

Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. InInternational Conference on Machine Learning (ICML), 2021

2021

[28] [28]

Eagle: Large-scale learning of turbulent fluid dynamics with mesh transformers

Steeven Janny, Aurélien Beneteau, Nicolas Thome, Madiha Nadri, Julie Digne, and Christian Wolf. Eagle: Large-scale learning of turbulent fluid dynamics with mesh transformers. In International Conference on Learning Representations (ICLR), 2023

2023

[29] [29]

Barron, Noah Snavely, and Aleksander Holynski

Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, and Aleksander Holynski. Zipmap: Linear-time stateful 3d reconstruction via test-time training. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2026. 11

2026

[30] [30]

Neural gpus learn algorithms

Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. InInternational Conference on Learning Representations (ICLR), 2016

2016

[31] [31]

Transformers are RNNs: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning (ICML), 2020

2020

[32] [32]

Grounding image matching in 3d with MASt3R

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with MASt3R. InEuropean Conference on Computer Vision (ECCV), 2024

2024

[33] [33]

VGGT-SLAM 2.0: Real-time dense feed-forward scene reconstruction.arXiv preprint arXiv:2601.19887, 2026

Dominic Maggio and Luca Carlone. VGGT-SLAM 2.0: Real-time dense feed-forward scene reconstruction.arXiv preprint arXiv:2601.19887, 2026

arXiv 2026

[34] [34]

BPP: Long-context robot imitation learning by focusing on key history frames.arXiv preprint arXiv:2602.15010, 2026

Max Sobol Mark, Jacky Liang, Maria Attarian, Chuyuan Fu, Debidatta Dwibedi, Dhruv Shah, and Aviral Kumar. BPP: Long-context robot imitation learning by focusing on key history frames.arXiv preprint arXiv:2602.15010, 2026

arXiv 2026

[35] [35]

DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research (TMLR), 2024

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...

2024

[36] [36]

Resurrecting Recurrent Neural Networks for Long Sequences

Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting Recurrent Neural Networks for Long Sequences. In International Conference on Machine Learning (ICML), 2023

2023

[37] [37]

Viorica P˘atr˘aucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, João Carreira, and Razvan Pascanu. TRecViT: A recurrent video transformer. Transactions on Machine Learning Research (TMLR), 2026

2026

[38] [38]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational Conference on Machine Learning (ICML), 2023

2023

[39] [39]

Habitat-Matterport 3D dataset (HM3D): 1000 large-scale 3D environments for embodied AI

Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexan- der Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-Matterport 3D dataset (HM3D): 1000 large-scale 3D environments for embodied AI. InNeurIPS Datasets and Benchmarks Track, 2021

2021

[40] [40]

Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, and Anurag Arnab

Michael S. Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, and Anurag Arnab. Token turing machines. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

2023

[41] [41]

Kinaema: a recurrent sequence model for memory and pose in motion

Mert Bulent Sariyildiz, Philippe Weinzaepfel, Guillaume Bono, Gianluca Monaci, and Christian Wolf. Kinaema: a recurrent sequence model for memory and pose in motion. InConference on Neural Information Processing Systems (NeurIPS), 2025

2025

[42] [42]

DUNE: Distilling a universal encoder from heterogeneous 2D and 3D teachers

Mert Bulent Sariyildiz, Philippe Weinzaepfel, Thomas Lucas, Pau de Jorge, Diane Larlus, and Yannis Kalantidis. DUNE: Distilling a universal encoder from heterogeneous 2D and 3D teachers. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[43] [43]

SmolVLA: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, and Remi Cadene. SmolVLA: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

Pith/arXiv arXiv 2025

[44] [44]

Siegelmann and Eduardo D

Hava T. Siegelmann and Eduardo D. Sontag. On the Computational Power of Neural Nets. Journal of Computer and System Sciences, 1995. 12

1995

[45] [45]

End-to-end memory networks

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. InConference on Neural Information Processing Systems (NeurIPS), 2015

2015

[46] [46]

Recurrent-depth vla: Implicit test-time compute scaling of vision-language-action models via latent iterative reasoning.arXiv preprint arXiv:2602.07845, 2026

Yalcin Tur, Jalal Naghiyev, Haoquan Fang, Wei-Chuan Tsai, Jiafei Duan, Dieter Fox, and Ranjay Krishna. Recurrent-depth vla: Implicit test-time compute scaling of vision-language-action models via latent iterative reasoning.arXiv preprint arXiv:2602.07845, 2026

arXiv 2026

[47] [47]

Turek, Shailee Jain, Vy V o, Mihai Capota, Alexander G

Javier S. Turek, Shailee Jain, Vy V o, Mihai Capota, Alexander G. Huth, and Theodore L. Willke. Approximating stacked and bidirectional recurrent architectures with the delayed recurrent neural network. InInternational Conference on Machine Learning (ICML), 2020

2020

[48] [48]

Hudson, Thomas Albert Keck, Joao Carreira, Alexey Dosovitskiy, Mehdi S

Sjoerd van Steenkiste, Daniel Zoran, Yi Yang, Yulia Rubanova, Rishabh Kabra, Carl Doersch, Dilara Gokay, Joseph Heyward, Etienne Pot, Klaus Greff, Drew A. Hudson, Thomas Albert Keck, Joao Carreira, Alexey Dosovitskiy, Mehdi S. M. Sajjadi, and Thomas Kipf. Moving off-the-grid: Scene-grounded video representations. InConference on Neural Information Process...

2024

[49] [49]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InConference on Neural Information Processing Systems (NeurIPS), 2017

2017

[50] [50]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[51] [51]

Data efficient any transformer-to-mamba distillation via attention bridge.arXiv preprint arXiv:2510.19266, 2025

Penghao Wang, Yuhao Zhou, Mengxuan Wu, Panpan Zhang, Zhangyang Wang, and Kai Wang. Data efficient any transformer-to-mamba distillation via attention bridge.arXiv preprint arXiv:2510.19266, 2025

arXiv 2025

[52] [52]

Efros, and Angjoo Kanazawa

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[53] [53]

DUSt3R: Geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InIEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2024

2024

[54] [54]

Gibson env: Real-world perception for embodied agents

Fei Xia, Amir R Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson env: Real-world perception for embodied agents. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

2018

[55] [55]

Scal3r: Scalable test-time training for large-scale 3d reconstruction

Tao Xie, Peishan Yang, Yudong Jin, Yingfeng Cai, Wei Yin, Weiqiang Ren, Qian Zhang, Wei Hua, Sida Peng, Xiaoyang Guo, and Xiaowei Zhou. Scal3r: Scalable test-time training for large-scale 3d reconstruction. Inarxiv:2604.08542, 2026

Pith/arXiv arXiv 2026

[56] [56]

AMR-Transformer: Enabling efficient long-range interaction for complex neural fluid simulation

Zeyi Xu, Jinfan Liu, Kuangxu Chen, Ye Chen, Zhangli Hu, and Bingbing Ni. AMR-Transformer: Enabling efficient long-range interaction for complex neural fluid simulation. InIEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2025

2025

[57] [57]

Loger: Long-context geometric reconstruction with hybrid memory

Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, and Deqing Sun. Loger: Long-context geometric reconstruction with hybrid memory. Inarxiv:2603.03269, 2026

Pith/arXiv arXiv 2026

[58] [58]

alternative frames

Liang Zhao, Xiachong Feng, Xiaocheng Feng, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin, and Ting Liu. Length extrapolation of transformers: A survey from the perspective of positional encoding. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2024. 13 Appendix A Limitations This model has currently been trained o...

2024