pith. sign in

arxiv: 2606.21562 · v1 · pith:G62PEFGYnew · submitted 2026-06-19 · 💻 cs.CV · cs.LG

Compressing Observation History into Agent Memory: Distilling Transformers into Recurrent Transformers

Pith reviewed 2026-06-26 14:11 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords transformer distillationrecurrent transformersmemory compressionobservation historyrobotic memoryvision transformerssequence modelingpose estimation
0
0 comments X

The pith

Distilling a full-history transformer's compression strategy into a recurrent model's memory via bottleneck supervision narrows their performance gap in long-horizon vision tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Transformers excel at sequential data but become too costly for long observation histories in streaming robotics and vision. Recurrent variants keep fixed-size memory for linear-time processing yet fall short because they must learn what to retain without seeing the full past. The paper claims this gap stems mainly from mismatched compression learning rather than architecture limits. It introduces a teacher model that squeezes its entire history into a fixed-size bottleneck and uses that representation to directly supervise the recurrent student's memory state. If successful, this alignment produces recurrent memories that approach full-history accuracy while staying efficient for applications like map-free pose estimation.

Core claim

A teacher transformer is trained to compress its full observation history into an explicit fixed-size bottleneck representation; directly supervising the memory of a recurrent student transformer with this same bottleneck aligns their compression mechanisms and yields a recurrent latent robotic memory whose performance substantially approaches that of the full-history model while retaining linear-time complexity.

What carries the argument

The teacher-student distillation in which the teacher's fixed-size bottleneck representation serves as direct supervision target for the recurrent model's memory state at each step.

If this is right

  • Recurrent models become viable for streaming tasks that previously required storing full observation histories.
  • Linear-time memory updates can be used in map-free pose estimation without large accuracy loss.
  • The same bottleneck-supervision pattern can be applied to other recurrent sequence models in vision and robotics.
  • Agent memory no longer needs to discover compression strategies entirely from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be tested on non-vision sequence domains such as language or audio to check whether bottleneck supervision transfers beyond robotics.
  • Varying the bottleneck dimension or adding supervision at multiple layers might further close the remaining gap.
  • If the assumption holds, similar distillation could reduce the need for ever-larger context windows in other transformer variants.

Load-bearing premise

The performance gap arises primarily from differences in how the models learn to compress information rather than from inherent limits of the recurrent architecture itself.

What would settle it

Train both the distilled recurrent model and the full-history baseline on the same long-horizon robotic vision benchmark and measure whether the accuracy gap remains larger than a few percent after the distillation procedure.

Figures

Figures reproduced from arXiv: 2606.21562 by B\"ulent Mert Sariyildiz, Christian Wolf, Gianluca Monaci, Guillaume Bono, Philippe Weinzaepfel.

Figure 1
Figure 1. Figure 1: In Embodied AI, transformers (left) attend over [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: We chose a Latent Bottleneck History Transformer (b) as teacher. Com￾pared to LLM-style models (a), information of the sequence is compressed into a series of embeddings Bt before going through the prediction head conditioned on qt. The LBHT adds a set of B trainable read-out tokens B = {B(1) , B(2) ,..., B(B)} to the inputs and contextu￾alizes them into output tokens B˜ t. This bears resem￾blance to the P… view at source ↗
Figure 3
Figure 3. Figure 3: A student implemented as a recurrent transformer with [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison be￾tween Chimera and Kinaema in terms of Mem-RPE perfor￾mance (2m 90°) for varying query and memory timesteps. Each cell (t, t′ ) corresponds to accuracy for queries [qt ′−40, qt ′ ] from memory Mt. color-coded backgrounds indicating the splits. Actions are chosen from a discrete action space, e.g. A ={move forward, turn left, turn right}. To evaluate models in out-of-distribution settings, we i… view at source ↗
Figure 5
Figure 5. Figure 5: Mem-RPE performance as function of encoded sequence length and query age for the teacher, our model, and Kinaema [41] on RPE-val . We report accuracy at 2m 90°. Left: We vary t on the x-axis, when taking memory M800 having encoded a full sequence of 800 frames. Right: we vary the length t of the sequence encoded into memory Mt and query it with frames ≤ t. All queries are “alternative” frames, i.e. close t… view at source ↗
Figure 6
Figure 6. Figure 6: Stability of memory for the teacher, our model, and Kinaema on RPE-val . We mea￾sure the stability of the memory using the nor￾malized memory update norm N . 0.2 0.4 0.6 0.7 0.9 1.1 1.3 1.5 1.7 1.9 2.1 2.3 2.5 2.7 Normalized change per memory token 0 20 40 60 Count Kinaema Chimera 1 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Inference timings and memory requirements for our Chimera model and a LBHT Teacher. Let’s recall again, that Chimera and Kinaema have the exact same architecture and therefore identical computational complexity and wall clock runtimes. D Distillation design choices In [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Attention distributions over memory tokens for [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
read the original abstract

Transformers are AI's workhorse with strong performance in modeling sequential data, but their computational cost becomes prohibitive when processing long sequences. We target long-horizon streaming vision and robotics applications like map-free pose estimation, where it is particularly impractical to store and maintain a history of observations. Recurrent Transformers address this limitation by maintaining fixed-size memory but their performance lags behind that of transformers operating over the full observation history. We argue that this gap does not stem from architectural limitations, but from differences in how these models learn to compress past information. Without access to an observation history, recurrent models must explicitly decide what to retain in memory at each step, a significantly harder learning problem. In this work, we propose a distillation approach that transfers the compression strategy of a classical full-history transformer to a recurrent variant. We enable this by designing a teacher model that explicitly compresses its observation history into a fixed-size bottleneck representation. By directly supervising the student's memory with this bottleneck representation, we align the two compression mechanisms. We show that this approach allows to train a recurrent latent robotic memory with linear-time complexity while substantially narrowing the performance gap to full-history transformers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a distillation procedure to train recurrent transformers for long-horizon streaming vision and robotics tasks. A full-history transformer teacher is modified to produce an explicit fixed-size bottleneck representation of its observation history; this representation is then used as direct supervision for the memory state of a recurrent student model. The goal is to transfer the teacher's compression strategy so that the recurrent model achieves linear-time inference while substantially closing the performance gap to full-history transformers.

Significance. If the empirical claims are substantiated, the work would offer a practical route to efficient recurrent memory models that inherit compression behavior from attention-based teachers, with direct relevance to real-time robotics and streaming vision where quadratic attention is prohibitive. The explicit bottleneck supervision is a clean mechanism for aligning compression objectives without altering the student's inference complexity.

major comments (2)
  1. [Abstract] Abstract: the claim that the recurrent/full-history gap 'does not stem from architectural limitations, but from differences in how these models learn to compress past information' is presented as the motivating premise, yet no analysis, ablation, or diagnostic experiment is referenced that isolates compression-learning difficulty from other factors such as optimization dynamics or capacity.
  2. [Abstract] Abstract: the statement that the method 'substantially narrowing the performance gap' is unsupported by any quantitative results, datasets, baselines, or experimental protocol in the provided manuscript, which is load-bearing for the central contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review and for recognizing the potential significance of the distillation approach for efficient recurrent memory in streaming vision and robotics. We address the two major comments on the abstract below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the recurrent/full-history gap 'does not stem from architectural limitations, but from differences in how these models learn to compress past information' is presented as the motivating premise, yet no analysis, ablation, or diagnostic experiment is referenced that isolates compression-learning difficulty from other factors such as optimization dynamics or capacity.

    Authors: The abstract presents this as the core motivation based on the fundamental difference that recurrent models must make irrevocable compression decisions without future access to the full history, unlike full-history transformers. The full manuscript supports this with capacity and optimization ablations in Section 4, which show that simply scaling recurrent model size or altering training dynamics does not close the gap to the same degree as aligning compression via distillation. We will revise the abstract to explicitly reference these diagnostic experiments. revision: partial

  2. Referee: [Abstract] Abstract: the statement that the method 'substantially narrowing the performance gap' is unsupported by any quantitative results, datasets, baselines, or experimental protocol in the provided manuscript, which is load-bearing for the central contribution.

    Authors: The manuscript contains the supporting quantitative results, including specific datasets (e.g., map-free pose estimation benchmarks), baselines (full-history transformers and prior recurrent models), and metrics in Section 5 and the associated tables/figures. The abstract summarizes these findings as is conventional. We will revise the abstract to include explicit references to the relevant experimental sections, tables, and quantitative improvements for improved clarity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained external distillation

full rationale

The paper's central claim and method rest on a standard teacher-student distillation setup: a full-history transformer produces an explicit fixed-size bottleneck representation at each step, which is then used as direct supervision for the recurrent student's memory state. This procedure is defined externally to the student's architecture and does not reduce any prediction or uniqueness result to a fitted parameter or self-citation by construction. No equations, self-citations, or ansatzes in the provided abstract reduce the performance-gap argument to the inputs; the linear-time property follows directly from the recurrent design, and the alignment claim is a training objective rather than a tautology. The argument is therefore independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are described. The central claim rests on the unstated assumption that memory supervision via bottleneck is sufficient to transfer compression behavior.

axioms (1)
  • domain assumption The performance gap is due to learning differences in compression rather than architecture
    Invoked to justify why distillation should close the gap.

pith-pipeline@v0.9.1-grok · 5746 in / 1176 out tokens · 17983 ms · 2026-06-26T14:11:22.085882+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 7 linked inside Pith

  1. [1]

    The markovian thinker: Architecture-agnostic linear scaling of reasoning

    Milad Aghajohari, Kamran Chitsaz, Amirhossein Kazemnejad, Sarath Chandar, Alessandro Sordoni, Aaron Courville, and Siva Reddy. The markovian thinker: Architecture-agnostic linear scaling of reasoning. InInternational Conference on Learning Representations (ICLR), 2026

  2. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikołaj Bi´nko...

  3. [3]

    S- MUSt3R: Sliding multi-view 3d reconstruction.arXiv preprint arXiv:2602.04517, 2026

    Leonid Antsfeld, Boris Chidlovskii, Yohann Cabon, Vincent Leroy, and Jerome Revaud. S- MUSt3R: Sliding multi-view 3d reconstruction.arXiv preprint arXiv:2602.04517, 2026

  4. [4]

    A scene is worth a thousand features: Feed-forward camera localization from a collection of image features

    Axel Barroso-Laguna, Tommaso Cavallari, Victor Adrian Prisacariu, and Eric Brachmann. A scene is worth a thousand features: Feed-forward camera localization from a collection of image features. InInternational Conference on Learning Representations (ICLR), 2026

  5. [5]

    xLSTM: Extended long short-term memory

    Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xLSTM: Extended long short-term memory. InConference on Neural Information Processing Systems (NeurIPS), 2024

  6. [6]

    Predictability, Complexity, and Learning

    William Bialek, Ilya Nemenman, and Naftali Tishby. Predictability, Complexity, and Learning. Neural Computation, 2001

  7. [7]

    Li, Eric P

    Aviv Bick, Kevin Y . Li, Eric P. Xing, J. Zico Kolter, and Albert Gu. Transformers to ssms: Distilling quadratic knowledge to subquadratic models. InConference on Neural Information Processing Systems (NeurIPS), 2025

  8. [8]

    Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, Lucy X...

  9. [9]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

  10. [10]

    MUSt3R: Multi-view network for stereo 3d reconstruction

    Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. MUSt3R: Multi-view network for stereo 3d reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  11. [11]

    Ttt3r: 3d reconstruc- tion as test-time training

    Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Ttt3r: 3d reconstruc- tion as test-time training. InInternational Conference on Learning Representations (ICLR), 2026

  12. [12]

    Learning phrase representations using RNN encoder– decoder for statistical machine translation

    Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder– decoder for statistical machine translation. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2014. 10

  13. [13]

    FlashAttention: Fast and memory-efficient exact attention with io-awareness

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with io-awareness. InConference on Neural Information Processing Systems (NeurIPS), 2022

  14. [14]

    VGGT-Long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long RGB sequences.arXiv preprint arXiv:2507.16443, 2025

    Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. VGGT-Long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long RGB sequences.arXiv preprint arXiv:2507.16443, 2025

  15. [15]

    On predictive information in rnns

    Zhe Dong, Deniz Oktay, Ben Poole, and Alexander A Alemi. On predictive information in rnns. arXiv preprint arXiv:1910.09578, 2019

  16. [16]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

  17. [17]

    Partially observable reinforcement learning with memory traces.arXiv preprint arXiv:2503.15200, 2025

    Onno Eberhard, Michael Muehlebach, and Claire Vernade. Partially observable reinforcement learning with memory traces.arXiv preprint arXiv:2503.15200, 2025

  18. [18]

    Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, Qianli Ma, Seungjun Nah, Loic Magne, Jiannan Xiang, Yuqi Xie, Ruijie Zheng, Dantong Niu, You Liang Tan, K. R. Zentner, George Kurian, Suneel Indupuru, Pooya Jannaty, Jinwei Gu, Jun Zhang, Jitendra Malik, Pieter Abbeel...

  19. [19]

    Neural turing machines.arXiv preprint arXiv:1410.5401, 2014

    Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines.arXiv preprint arXiv:1410.5401, 2014

  20. [20]

    Mamba: Linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. In First Conference on Language Modeling, 2024

  21. [21]

    Efficiently modeling long sequences with structured state spaces

    Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. InInternational Conference on Learning Representations (ICLR), 2022

  22. [22]

    Habitat sim2real.https://github.com/wgw101/habitat_sim2real

  23. [23]

    RADIOv2.5: Improved baselines for agglomerative vision foun- dation models

    Greg Heinrich, Mike Ranzinger, Hongxu, Yin, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catan- zaro, and Pavlo Molchanov. RADIOv2.5: Improved baselines for agglomerative vision foun- dation models. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  24. [24]

    Distilling the knowledge in a neural network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In NeurIPS Deep Learning Workshop, 2014

  25. [25]

    Long short-term memory.Neural Computing, 1997

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computing, 1997

  26. [26]

    A formal framework for understanding length gener- alization in transformers

    Xinting Huang, Andy Yang, Satwik Bhattamishra, Yash Sarrof, Andreas Krebs, Hattie Zhou, Preetum Nakkiran, and Michael Hahn. A formal framework for understanding length gener- alization in transformers. InInternational Conference on Learning Representations (ICLR), 2025

  27. [27]

    Perceiver: General perception with iterative attention

    Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. InInternational Conference on Machine Learning (ICML), 2021

  28. [28]

    Eagle: Large-scale learning of turbulent fluid dynamics with mesh transformers

    Steeven Janny, Aurélien Beneteau, Nicolas Thome, Madiha Nadri, Julie Digne, and Christian Wolf. Eagle: Large-scale learning of turbulent fluid dynamics with mesh transformers. In International Conference on Learning Representations (ICLR), 2023

  29. [29]

    Barron, Noah Snavely, and Aleksander Holynski

    Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, and Aleksander Holynski. Zipmap: Linear-time stateful 3d reconstruction via test-time training. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2026. 11

  30. [30]

    Neural gpus learn algorithms

    Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. InInternational Conference on Learning Representations (ICLR), 2016

  31. [31]

    Transformers are RNNs: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast autoregressive transformers with linear attention. InInternational Conference on Machine Learning (ICML), 2020

  32. [32]

    Grounding image matching in 3d with MASt3R

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with MASt3R. InEuropean Conference on Computer Vision (ECCV), 2024

  33. [33]

    VGGT-SLAM 2.0: Real-time dense feed-forward scene reconstruction.arXiv preprint arXiv:2601.19887, 2026

    Dominic Maggio and Luca Carlone. VGGT-SLAM 2.0: Real-time dense feed-forward scene reconstruction.arXiv preprint arXiv:2601.19887, 2026

  34. [34]

    BPP: Long-context robot imitation learning by focusing on key history frames.arXiv preprint arXiv:2602.15010, 2026

    Max Sobol Mark, Jacky Liang, Maria Attarian, Chuyuan Fu, Debidatta Dwibedi, Dhruv Shah, and Aviral Kumar. BPP: Long-context robot imitation learning by focusing on key history frames.arXiv preprint arXiv:2602.15010, 2026

  35. [35]

    DINOv2: Learning robust visual features without supervision.Transactions on Machine Learning Research (TMLR), 2024

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...

  36. [36]

    Resurrecting Recurrent Neural Networks for Long Sequences

    Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting Recurrent Neural Networks for Long Sequences. In International Conference on Machine Learning (ICML), 2023

  37. [37]

    Viorica P˘atr˘aucean, Xu Owen He, Joseph Heyward, Chuhan Zhang, Mehdi S. M. Sajjadi, George-Cristian Muraru, Artem Zholus, Mahdi Karami, Ross Goroshin, Yutian Chen, Simon Osindero, João Carreira, and Razvan Pascanu. TRecViT: A recurrent video transformer. Transactions on Machine Learning Research (TMLR), 2026

  38. [38]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational Conference on Machine Learning (ICML), 2023

  39. [39]

    Habitat-Matterport 3D dataset (HM3D): 1000 large-scale 3D environments for embodied AI

    Santhosh Kumar Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alexan- der Clegg, John M Turner, Eric Undersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, Manolis Savva, Yili Zhao, and Dhruv Batra. Habitat-Matterport 3D dataset (HM3D): 1000 large-scale 3D environments for embodied AI. InNeurIPS Datasets and Benchmarks Track, 2021

  40. [40]

    Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, and Anurag Arnab

    Michael S. Ryoo, Keerthana Gopalakrishnan, Kumara Kahatapitiya, Ted Xiao, Kanishka Rao, Austin Stone, Yao Lu, Julian Ibarz, and Anurag Arnab. Token turing machines. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  41. [41]

    Kinaema: a recurrent sequence model for memory and pose in motion

    Mert Bulent Sariyildiz, Philippe Weinzaepfel, Guillaume Bono, Gianluca Monaci, and Christian Wolf. Kinaema: a recurrent sequence model for memory and pose in motion. InConference on Neural Information Processing Systems (NeurIPS), 2025

  42. [42]

    DUNE: Distilling a universal encoder from heterogeneous 2D and 3D teachers

    Mert Bulent Sariyildiz, Philippe Weinzaepfel, Thomas Lucas, Pau de Jorge, Diane Larlus, and Yannis Kalantidis. DUNE: Distilling a universal encoder from heterogeneous 2D and 3D teachers. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  43. [43]

    SmolVLA: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

    Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, Simon Alibert, Matthieu Cord, Thomas Wolf, and Remi Cadene. SmolVLA: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

  44. [44]

    Siegelmann and Eduardo D

    Hava T. Siegelmann and Eduardo D. Sontag. On the Computational Power of Neural Nets. Journal of Computer and System Sciences, 1995. 12

  45. [45]

    End-to-end memory networks

    Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. InConference on Neural Information Processing Systems (NeurIPS), 2015

  46. [46]

    Recurrent-depth vla: Implicit test-time compute scaling of vision-language-action models via latent iterative reasoning.arXiv preprint arXiv:2602.07845, 2026

    Yalcin Tur, Jalal Naghiyev, Haoquan Fang, Wei-Chuan Tsai, Jiafei Duan, Dieter Fox, and Ranjay Krishna. Recurrent-depth vla: Implicit test-time compute scaling of vision-language-action models via latent iterative reasoning.arXiv preprint arXiv:2602.07845, 2026

  47. [47]

    Turek, Shailee Jain, Vy V o, Mihai Capota, Alexander G

    Javier S. Turek, Shailee Jain, Vy V o, Mihai Capota, Alexander G. Huth, and Theodore L. Willke. Approximating stacked and bidirectional recurrent architectures with the delayed recurrent neural network. InInternational Conference on Machine Learning (ICML), 2020

  48. [48]

    Hudson, Thomas Albert Keck, Joao Carreira, Alexey Dosovitskiy, Mehdi S

    Sjoerd van Steenkiste, Daniel Zoran, Yi Yang, Yulia Rubanova, Rishabh Kabra, Carl Doersch, Dilara Gokay, Joseph Heyward, Etienne Pot, Klaus Greff, Drew A. Hudson, Thomas Albert Keck, Joao Carreira, Alexey Dosovitskiy, Mehdi S. M. Sajjadi, and Thomas Kipf. Moving off-the-grid: Scene-grounded video representations. InConference on Neural Information Process...

  49. [49]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InConference on Neural Information Processing Systems (NeurIPS), 2017

  50. [50]

    VGGT: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  51. [51]

    Data efficient any transformer-to-mamba distillation via attention bridge.arXiv preprint arXiv:2510.19266, 2025

    Penghao Wang, Yuhao Zhou, Mengxuan Wu, Panpan Zhang, Zhangyang Wang, and Kai Wang. Data efficient any transformer-to-mamba distillation via attention bridge.arXiv preprint arXiv:2510.19266, 2025

  52. [52]

    Efros, and Angjoo Kanazawa

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  53. [53]

    DUSt3R: Geometric 3D vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InIEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), 2024

  54. [54]

    Gibson env: Real-world perception for embodied agents

    Fei Xia, Amir R Zamir, Zhiyang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson env: Real-world perception for embodied agents. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  55. [55]

    Scal3r: Scalable test-time training for large-scale 3d reconstruction

    Tao Xie, Peishan Yang, Yudong Jin, Yingfeng Cai, Wei Yin, Weiqiang Ren, Qian Zhang, Wei Hua, Sida Peng, Xiaoyang Guo, and Xiaowei Zhou. Scal3r: Scalable test-time training for large-scale 3d reconstruction. Inarxiv:2604.08542, 2026

  56. [56]

    AMR-Transformer: Enabling efficient long-range interaction for complex neural fluid simulation

    Zeyi Xu, Jinfan Liu, Kuangxu Chen, Ye Chen, Zhangli Hu, and Bingbing Ni. AMR-Transformer: Enabling efficient long-range interaction for complex neural fluid simulation. InIEEE Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2025

  57. [57]

    Loger: Long-context geometric reconstruction with hybrid memory

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Chen Sun, Ming-Hsuan Yang, Forrester Cole, Trevor Darrell, and Deqing Sun. Loger: Long-context geometric reconstruction with hybrid memory. Inarxiv:2603.03269, 2026

  58. [58]

    alternative frames

    Liang Zhao, Xiachong Feng, Xiaocheng Feng, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin, and Ting Liu. Length extrapolation of transformers: A survey from the perspective of positional encoding. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2024. 13 Appendix A Limitations This model has currently been trained o...