arxiv: 2509.26645 · v4 · submitted 2025-09-30 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

TTT3R: 3D Reconstruction as Test-Time Training

Xingyu Chen , Yue Chen , Yuliang Xiu , Andreas Geiger , Anpei Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-17 06:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D reconstructiontest-time trainingrecurrent neural networkslength generalizationonline learningpose estimationmemory update

0 comments

The pith

Framing 3D reconstruction as test-time training yields a closed-form learning rate from alignment confidence that doubles global pose accuracy on long sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern recurrent neural networks for 3D reconstruction lose accuracy when sequences exceed the training length. The authors treat the reconstruction process as an online learning problem at test time. Alignment confidence between the stored memory state and each new observation supplies the information needed to set a learning rate. The rate decides how strongly the memory should incorporate the new data while keeping prior information. The resulting update rule improves accuracy on extended sequences without any retraining or extra parameters.

Core claim

By viewing recurrent 3D reconstruction as an online learning problem, a closed-form learning rate derived directly from the alignment confidence between the memory state and incoming observations balances retention of historical information against adaptation to new observations, delivering substantially better length generalization.

What carries the argument

Alignment confidence between the memory state and incoming observations, used to compute a closed-form learning rate that controls memory updates in recurrent models.

Load-bearing premise

That alignment confidence between the memory state and new observations can be computed reliably enough to produce a stable learning rate that correctly balances history retention with adaptation and avoids instability.

What would settle it

Applying the method to long image sequences and measuring no gain, or a loss, in global pose estimation accuracy relative to the baseline would falsify the central claim.

read the original abstract

Modern Recurrent Neural Networks have become a competitive architecture for 3D reconstruction due to their linear-time complexity. However, their performance degrades significantly when applied beyond the training context length, revealing limited length generalization. In this work, we revisit the 3D reconstruction foundation models from a Test-Time Training perspective, framing their designs as an online learning problem. Building on this perspective, we leverage the alignment confidence between the memory state and incoming observations to derive a closed-form learning rate for memory updates, to balance between retaining historical information and adapting to new observations. This training-free intervention, termed TTT3R, substantially improves length generalization, achieving a $2\times$ improvement in global pose estimation over baselines, while operating at 20 FPS with just 6 GB of GPU memory to process thousands of images. Code is available in https://rover-xingyu.github.io/TTT3R

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes TTT3R, a training-free test-time training intervention for recurrent 3D reconstruction models. It reframes the recurrent memory update as an online learning problem and derives a closed-form learning rate from the alignment confidence between the current memory state and incoming observations, with the goal of improving length generalization beyond the training context length while preserving efficiency.

Significance. If the closed-form derivation is correct and the resulting updates remain stable, the approach would be a notable contribution to long-sequence 3D reconstruction, offering a lightweight way to extend RNN-based models without retraining. The reported efficiency (20 FPS, 6 GB for thousands of images) and the 2× gain in global pose estimation are practically relevant strengths.

major comments (2)

[Method section (derivation of learning rate)] The central claim rests on deriving a closed-form learning rate directly from alignment confidence, yet the manuscript supplies no derivation steps, explicit formula, or stability analysis (e.g., bounds ensuring the rate stays in (0,1) or that the update remains contractive). This is load-bearing for the length-generalization result and directly engages the skeptic concern about drift accumulation over long sequences.
[Experiments and Results] Quantitative claims of a 2× improvement in global pose estimation lack supporting details on baselines, exact evaluation protocol, sequence lengths, or error analysis (e.g., drift metrics over thousands of frames). Without these, it is impossible to verify whether the gains are robust or sensitive to the alignment-confidence definition.

minor comments (1)

[Abstract] The abstract states that code is available at the given URL; a direct repository link or DOI would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the thorough review and valuable feedback on our manuscript. We have carefully considered each comment and revised the paper to address the concerns raised, particularly by expanding the method derivation and providing more experimental details. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [Method section (derivation of learning rate)] The central claim rests on deriving a closed-form learning rate directly from alignment confidence, yet the manuscript supplies no derivation steps, explicit formula, or stability analysis (e.g., bounds ensuring the rate stays in (0,1) or that the update remains contractive). This is load-bearing for the length-generalization result and directly engages the skeptic concern about drift accumulation over long sequences.

Authors: We appreciate the referee highlighting this critical aspect. We acknowledge that the manuscript would benefit from more explicit derivation steps, the closed-form formula, and a stability analysis. In the revised version, we have added a dedicated subsection in the Method section that provides the full derivation from the alignment confidence metric to the closed-form learning rate, along with a stability analysis establishing bounds that keep the rate in (0,1) and ensure the memory update remains contractive. This directly addresses potential drift accumulation over long sequences and strengthens the length-generalization claims. revision: yes
Referee: [Experiments and Results] Quantitative claims of a 2× improvement in global pose estimation lack supporting details on baselines, exact evaluation protocol, sequence lengths, or error analysis (e.g., drift metrics over thousands of frames). Without these, it is impossible to verify whether the gains are robust or sensitive to the alignment-confidence definition.

Authors: We thank the referee for this observation. While some protocol details appeared in the supplementary material, we agree they should be more prominent in the main text. In the revision, we have expanded the Experiments and Results section to explicitly describe the baselines, the evaluation protocol for global pose estimation, the tested sequence lengths (including thousands of frames), and additional error analysis with drift metrics. We have also added a sensitivity study with respect to the alignment-confidence definition to demonstrate robustness of the reported 2× gains. revision: yes

Circularity Check

0 steps flagged

No circularity: closed-form learning rate derived from observable alignment confidence

full rationale

The paper frames 3D reconstruction as an online learning problem and derives the learning rate directly from alignment confidence between memory state and new observations. This is presented as a training-free computation from an observable quantity rather than a fit, self-definition, or self-citation chain. No equations or steps in the abstract or context reduce the claimed result to its inputs by construction, and the method operates independently at test time without relying on fitted parameters from target metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is based solely on the abstract; no explicit free parameters, invented entities, or detailed axioms are stated. The central claim rests on the unelaborated assumption that alignment confidence yields a useful closed-form update rule.

axioms (1)

domain assumption Alignment confidence between memory state and incoming observations can be computed and directly converted into an optimal learning rate for memory updates.
This premise underpins the closed-form derivation and is invoked when the paper describes balancing historical information with new observations.

pith-pipeline@v0.9.0 · 5457 in / 1293 out tokens · 57550 ms · 2026-05-17T06:35:59.893810+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction
cs.CV 2026-05 unverdicted novelty 7.0

Ray-aware pointers that track both location and viewing direction enable adaptive retain-or-replace memory updates for more stable streaming 3D reconstruction.
GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens
cs.CV 2026-04 unverdicted novelty 7.0

GlobalSplat achieves competitive novel-view synthesis on RealEstate10K and ACID using only 16K Gaussians via global scene tokens and coarse-to-fine training, with a 4MB footprint and under 78ms inference.
TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens
cs.CV 2026-04 unverdicted novelty 7.0

TokenGS uses learnable Gaussian tokens in an encoder-decoder architecture to regress 3D means directly, achieving SOTA feed-forward reconstruction on static and dynamic scenes with better robustness.
RobotPan: A 360$^\circ$ Surround-View Robotic Vision System for Embodied Perception
cs.RO 2026-04 unverdicted novelty 7.0

RobotPan predicts metric-scaled compact 3D Gaussians from calibrated multi-view inputs via spherical coordinates and hierarchical voxel priors for real-time 360° robotic perception and reconstruction.
AnyImageNav: Any-View Geometry for Precise Last-Meter Image-Goal Navigation
cs.RO 2026-04 unverdicted novelty 7.0

AnyImageNav uses a semantic-to-geometric cascade with 3D multi-view foundation models to recover precise 6-DoF poses from goal images, achieving 0.27m position error and state-of-the-art success rates on Gibson and HM...
Learning 3D Reconstruction with Priors in Test Time
cs.CV 2026-04 unverdicted novelty 7.0

Test-time constrained optimization incorporates priors into pre-trained multiview transformers via self-supervised losses and penalty terms to improve 3D reconstruction accuracy.
FrameVGGT: Geometry-Aligned Frame-Level Memory for Bounded Streaming VGGT
cs.CV 2026-03 unverdicted novelty 7.0

FrameVGGT replaces token-level KV retention with frame-level segments and prototypes to bound memory while preserving geometric coherence in streaming VGGT.
ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training
cs.CV 2026-03 unverdicted novelty 7.0

ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.
MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors
cs.CV 2025-12 unverdicted novelty 7.0

MoonSeg3R is the first method for online monocular 3D instance segmentation, achieving performance competitive with RGB-D systems by using CUT3R priors for geometric consistency and temporal query memory.
Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval
cs.CV 2026-05 unverdicted novelty 6.0

RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.
Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction
cs.CV 2026-05 unverdicted novelty 6.0

Asymmetric token reduction, with distinct merging for queries and pruning for key-values plus layer-wise adaptation, delivers up to 28x speedup on 1000-frame 3D reconstruction inputs while preserving competitive quality.
Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction
cs.CV 2026-05 unverdicted novelty 6.0

Ray-aware pointer memory with adaptive retain-or-replace updates enhances stability and accuracy in streaming 3D reconstruction.
Linearizing Vision Transformer with Test-Time Training
cs.CV 2026-05 unverdicted novelty 6.0

Using Test-Time Training's structural match to Softmax attention plus key normalization and locality modules allows inheriting pretrained weights and fine-tuning Stable Diffusion 3.5 in one hour to match quality while...
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
cs.CV 2026-04 unverdicted novelty 6.0

The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer
cs.CV 2026-03 conditional novelty 6.0

OVGGT achieves constant O(1) memory and compute for streaming 3D geometry reconstruction by using FFN-residual-based KV cache compression and dynamic anchor protection, matching state-of-the-art accuracy on long sequences.
ViT$^3$: Unlocking Test-Time Training in Vision
cs.CV 2025-12 unverdicted novelty 6.0

ViT³ is a Test-Time Training vision model that achieves linear complexity, matches or exceeds other linear models like Mamba on classification, generation, detection and segmentation, and narrows the gap to standard v...
StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression
cs.CV 2026-04 unverdicted novelty 5.0

StreamCacheVGGT improves streaming 3D geometry reconstruction accuracy and stability under fixed memory by using cross-layer token importance scoring and hybrid cache compression instead of pure eviction.

Reference graph

Works this paper leans on

112 extracted references · 112 canonical work pages · cited by 17 Pith papers · 18 internal anchors

[1]

Bundle adjustment in the large

Sameer Agarwal, Noah Snavely, Steven M Seitz, and Richard Szeliski. Bundle adjustment in the large. InECCV, 2010. 3

work page 2010
[2]

Building rome in a day.ACM Communications, 2011

Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day.ACM Communications, 2011. 3

work page 2011
[3]

Cross-view completion models are zero-shot correspondence estimators

Honggyu An, Jinhyeon Kim, Seonghoon Park, Jaewoo Jung, Jisang Han, Sunghwan Hong, and Seungryong Kim. Cross-view completion models are zero-shot correspondence estimators. arXiv, 2412.09072, 2024. 3

work page arXiv 2024
[4]

Speeded-up robust features (surf).Computer vision and image understanding, 2008

Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-up robust features (surf).Computer vision and image understanding, 2008. 3

work page 2008
[5]

xlstm: Extended long short-term memory

Maximilian Beck, Korbinian P ¨oppel, Markus Spanring, Andreas Auer, Oleksandra Prud- nikova, Michael Kopp, G¨unter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory. InNeurIPS, 2024. 2, 4

work page 2024
[6]

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. arXiv preprint arXiv:2501.00663, 2024. 2, 4, 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization

Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization. arXiv preprint arXiv:2504.13173, 2025. 2, 4, 10

work page arXiv 2025
[8]

Decimamba: Exploring the length extrapolation potential of mamba

Assaf Ben-Kish, Itamar Zimerman, Shady Abu-Hussein, Nadav Cohen, Amir Globerson, Lior Wolf, and Raja Giryes. Decimamba: Exploring the length extrapolation potential of mamba. arXiv preprint arXiv:2406.14528, 2024. 2

work page arXiv 2024
[9]

Birth of a transformer: A memory viewpoint

Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, and Leon Bottou. Birth of a transformer: A memory viewpoint. InNeurIPS, 2023. 2, 4

work page 2023
[10]

Local learning algorithms.Neural computation, 1992

L ´eon Bottou and Vladimir Vapnik. Local learning algorithms.Neural computation, 1992. 4

work page 1992
[11]

Dsac-differentiable ransac for camera localization

Eric Brachmann, Alexander Krull, Sebastian Nowozin, Jamie Shotton, Frank Michel, Stefan Gumhold, and Carsten Rother. Dsac-differentiable ransac for camera localization. InCVPR,

work page
[12]

Scene coordinate reconstruction: Posing of image collections via incremental learning of a relocalizer

Eric Brachmann, Jamie Wynn, Shuai Chen, Tommaso Cavallari, ´Aron Monszpart, Daniyar Turmukhambetov, and Victor Adrian Prisacariu. Scene coordinate reconstruction: Posing of image collections via incremental learning of a relocalizer. InECCV, 2024. 1, 3

work page 2024
[13]

A naturalistic open source movie for optical flow evaluation

Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for optical flow evaluation. InECCV, 2012. 22, 23, 24

work page 2012
[14]

Must3r: Multi-view network for stereo 3d reconstruction

Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. Must3r: Multi-view network for stereo 3d reconstruction. In CVPR, 2025. 3, 5

work page 2025
[15]

Emerging properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InICCV,

work page
[16]

Easi3r: Estimating disentangled motion from dust3r without training.arXiv preprint arXiv:2503.24391, 2025

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Easi3r: Estimating disentangled motion from dust3r without training.arXiv preprint arXiv:2503.24391, 2025. 3, 6, 18, 22, 24 11 Published as a conference paper at ICLR 2026

work page arXiv 2025
[17]

Stuffed mamba: Oversized states lead to the inability to forget.arXiv preprint arXiv:2410.07145,

Yingfa Chen, Xinrong Zhang, Shengding Hu, Xu Han, Zhiyuan Liu, and Maosong Sun. Stuffed mamba: Oversized states lead to the inability to forget.arXiv preprint arXiv:2410.07145,

work page arXiv
[18]

Feat2gs: Probing visual foundation models with gaussian splatting.arXiv, 2412.09606, 2024

Yue Chen, Xingyu Chen, Anpei Chen, Gerard Pons-Moll, and Yuliang Xiu. Feat2gs: Probing visual foundation models with gaussian splatting.arXiv, 2412.09606, 2024. 9

work page arXiv 2024
[19]

Long3r: Long sequence streaming 3d reconstruction.arXiv preprint arXiv:2507.18255, 2025

Zhuoguang Chen, Minghui Qin, Tianyuan Yuan, Zhe Liu, and Hang Zhao. Long3r: Long sequence streaming 3d reconstruction.arXiv preprint arXiv:2507.18255, 2025. 3, 5

work page arXiv 2025
[20]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InCVPR, 2017. 7, 8, 20, 22

work page 2017
[21]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024. 4, 6, 17

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. Flashattention: Fast and memory-efficient exact attention with io-awareness. InNeurIPS, 2022. 1

work page 2022
[23]

Monoslam: Real-time single camera slam.PAMI, 2007

Andrew J Davison, Ian D Reid, Nicholas D Molton, and Olivier Stasse. Monoslam: Real-time single camera slam.PAMI, 2007. 3

work page 2007
[24]

VGGT-Long: Chunk it, Loop it, Align it -- Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt-long: Chunk it, loop it, align it– pushing vggt’s limits on kilometer-scale long rgb sequences.arXiv preprint arXiv:2507.16443,

work page internal anchor Pith review arXiv
[25]

Superpoint: Self-supervised interest point detection and description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description. InCVPRW, 2018. 3

work page 2018
[26]

Learning without training: The implicit dynamics of in-context learning.arXiv preprint arXiv:2507.16003, 2025

Benoit Dherin, Michael Munn, Hanna Mazzawi, Michael Wunder, and Javier Gonzalvo. Learning without training: The implicit dynamics of in-context learning.arXiv preprint arXiv:2507.16003, 2025. 4

work page arXiv 2025
[27]

Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization.arXiv, 2412.08376, 2024

Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization.arXiv, 2412.08376, 2024. 3

work page arXiv 2024
[28]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 3, 4

work page internal anchor Pith review Pith/arXiv arXiv 2010
[29]

Lsd-slam: Large-scale direct monocular slam

Jakob Engel, Thomas Sch¨ops, and Daniel Cremers. Lsd-slam: Large-scale direct monocular slam. InECCV, 2014. 3

work page 2014
[30]

Direct sparse odometry.PAMI, 2017

Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry.PAMI, 2017. 3

work page 2017
[31]

Model-agnostic meta-learning for fast adaptation of deep networks

Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. InICML, 2017. 4

work page 2017
[32]

Deep Think with Confidence

Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence. arXiv preprint arXiv:2508.15260, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Vision meets robotics: The kitti dataset.The international journal of robotics research, 2013

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.The international journal of robotics research, 2013. 8, 9, 18, 19, 20, 23, 24

work page 2013
[34]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

ViPE: Video Pose Engine for 3D Geometric Perception

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025. 3 12 Published as a conference paper at ICLR 2026

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Repeat after me: Transformers are better than state space models at copying.arXiv preprint arXiv:2402.01032,

Samy Jelassi, David Brandfonbrener, Sham M Kakade, and Eran Malach. Repeat after me: Transformers are better than state space models at copying.arXiv preprint arXiv:2402.01032,

work page arXiv
[37]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ¸ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InICML, 2020. 2, 3

work page 2020
[38]

Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 2017. 2

work page 2017
[39]

Robust consistent video depth estimation

Johannes Kopf, Xuejian Rong, and Jia-Bin Huang. Robust consistent video depth estimation. InCVPR, 2021. 3, 22

work page 2021
[40]

Stream3r: Scalable sequential 3d reconstruction with causal transformer

Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. Stream3r: Scalable sequential 3d reconstruction with causal transformer. 2025. 22, 24

work page 2025
[41]

Epnp: An accurate o(n) solution to the pnp problem

Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Epnp: An accurate o(n) solution to the pnp problem. InICCV, 2009. 5

work page 2009
[42]

Grounding image matching in 3d with MASt3R

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3d with MASt3R. InECCV, 2024. 22, 24

work page 2024
[43]

MegaSaM: accurate, fast, and robust structure and motion from casual dynamic videos

Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. MegaSaM: accurate, fast, and robust structure and motion from casual dynamic videos. InCVPR, 2025. 3

work page 2025
[44]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Longhorn: State space models are amortized online learners.arXiv preprint arXiv:2407.14207, 2024

Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, and Qiang Liu. Longhorn: State space models are amortized online learners.arXiv preprint arXiv:2407.14207, 2024. 1, 2, 4

work page arXiv 2024
[46]

Alligat0r: Pre-training through co- visibility segmentation for relative camera pose regression.arXiv preprint arXiv:2503.07561,

Thibaut Loiseau, Guillaume Bourmaud, and Vincent Lepetit. Alligat0r: Pre-training through co- visibility segmentation for relative camera pose regression.arXiv preprint arXiv:2503.07561,

work page arXiv
[47]

Distinctive image features from scale-invariant keypoints.International journal of computer vision, 2004

David G Lowe. Distinctive image features from scale-invariant keypoints.International journal of computer vision, 2004. 3

work page 2004
[48]

Consistent video depth estimation.ACM Trans

Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, and Johannes Kopf. Consistent video depth estimation.ACM Trans. on Graphics, 2020. 3, 4

work page 2020
[49]

Vggt- slam: Dense rgb slam optimized on the sl (4) manifold.arXiv preprint arXiv:2505.12549, 2025

Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt-slam: Dense rgb slam optimized on the sl (4) manifold.arXiv preprint arXiv:2505.12549, 2025. 3

work page arXiv 2025
[50]

Orb-slam: a versatile and accurate monocular slam system.IEEE transactions on robotics, 2015

Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular slam system.IEEE transactions on robotics, 2015. 3

work page 2015
[51]

Mast3r-slam: Real-time dense slam with 3d reconstruction priors

Riku Murai, Eric Dexheimer, and Andrew J Davison. Mast3r-slam: Real-time dense slam with 3d reconstruction priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 16695–16705, 2025. 3

work page 2025
[52]

Dtam: Dense tracking and mapping in real-time

Richard A Newcombe, Steven J Lovegrove, and Andrew J Davison. Dtam: Dense tracking and mapping in real-time. InICCV, 2011. 3

work page 2011
[53]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023. 4 13 Published as a conference paper at ICLR 2026

work page internal anchor Pith review Pith/arXiv arXiv 2023
[54]

Resurrecting recurrent neural networks for long sequences

Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. InICML,

work page
[55]

Refusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals

Emanuele Palazzolo, Jens Behley, Philipp Lottes, Philippe Giguere, and Cyrill Stachniss. Refusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals. InIROS, 2019. 8, 9, 20, 23, 24

work page 2019
[56]

The weiszfeld algorithm: proof, amendments, and extensions.F oundations of location analysis, 2011

Frank Plastria. The weiszfeld algorithm: proof, amendments, and extensions.F oundations of location analysis, 2011. 5

work page 2011
[57]

Self-calibration and metric reconstruction inspite of varying and unknown intrinsic camera parameters.International Journal of Computer Vision, 1999

Marc Pollefeys, Reinhard Koch, and Luc Van Gool. Self-calibration and metric reconstruction inspite of varying and unknown intrinsic camera parameters.International Journal of Computer Vision, 1999. 3

work page 1999
[58]

Visual modeling with a hand-held camera.International Journal of Computer Vision, 2004

Marc Pollefeys, Luc Van Gool, Maarten Vergauwen, Frank Verbiest, Kurt Cornelis, Jan Tops, and Reinhard Koch. Visual modeling with a hand-held camera.International Journal of Computer Vision, 2004. 3

work page 2004
[59]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

Hopfield Networks is All You Need

Hubert Ramsauer, Bernhard Sch¨afl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlovi´c, Geir Kjetil Sandve, et al. Hopfield networks is all you need.arXiv preprint arXiv:2008.02217, 2020. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2008
[61]

Vision transformers for dense predic- tion

Ren´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense predic- tion. InICCV, 2021. 4

work page 2021
[62]

Understanding and improving length generalization in recurrent models.arXiv preprint arXiv:2507.02782, 2025

Ricardo Buitrago Ruiz and Albert Gu. Understanding and improving length generalization in recurrent models.arXiv preprint arXiv:2507.02782, 2025. 2, 10, 20

work page arXiv 2025
[63]

Super- glue: Learning feature matching with graph neural networks

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Super- glue: Learning feature matching with graph neural networks. InCVPR, 2020. 3

work page 2020
[64]

Linear transformers are secretly fast weight programmers

Imanol Schlag, Kazuki Irie, and J ¨urgen Schmidhuber. Linear transformers are secretly fast weight programmers. InICML, 2021. 2, 3, 4, 5, 6, 17

work page 2021
[65]

Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 1992

J¨urgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 1992. 2, 4, 5

work page 1992
[66]

Structure-from-motion revisited

Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. In CVPR, 2016. 3

work page 2016
[67]

Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

Wenzhe Shi, Jose Caballero, Ferenc Husz´ar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. InCVPR, 2016. 4

work page 2016
[68]

Scene coordinate regression forests for camera relocalization in RGB-D images

Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in RGB-D images. InCVPR, 2013. 3, 9

work page 2013
[69]

Photo tourism: exploring photo collec- tions in 3d

Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring photo collec- tions in 3d. InSIGGRAPH, 2006. 3

work page 2006
[70]

Modeling the world from internet photo collections.International journal of computer vision, 2008

Noah Snavely, Steven M Seitz, and Richard Szeliski. Modeling the world from internet photo collections.International journal of computer vision, 2008. 3

work page 2008
[71]

A benchmark for the evaluation of rgb-d slam systems

J¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of rgb-d slam systems. In2012 IEEE/RSJ international conference on intelligent robots and systems, 2012. 7, 8, 18, 19, 20, 22

work page 2012
[72]

Test-time training with self-supervision for generalization under distribution shifts

Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InICML, 2020. 4 14 Published as a conference paper at ICLR 2026

work page 2020
[73]

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024. 2, 4, 5, 6, 17

work page internal anchor Pith review Pith/arXiv arXiv 2024
[74]

Retentive Network: A Successor to Transformer for Large Language Models

Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023. 2, 4, 6, 17

work page internal anchor Pith review Pith/arXiv arXiv 2023
[75]

PhD thesis, University of Toronto, 2013

Ilya Sutskever.Training recurrent neural networks. PhD thesis, University of Toronto, 2013. 2

work page 2013
[76]

Sequence to sequence learning with neural networks

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. InNeurIPS, 2014. 1

work page 2014
[77]

Ba-net: Dense bundle ad- justment network.arXiv preprint arXiv:1806.04807, 2018

Chengzhou Tang and Ping Tan. Ba-net: Dense bundle adjustment network.arXiv preprint arXiv:1806.04807, 2018. 3

work page arXiv 2018
[78]

Aether: Geometric-aware unified world modeling.arXiv preprint arXiv:2503.18945, 2025

Aether Team, Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, et al. Aether: Geometric-aware unified world modeling.arXiv preprint arXiv:2503.18945, 2025. 22, 24

work page arXiv 2025
[79]

Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras

Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. InNeurIPS, 2021. 3

work page 2021
[80]

Bundle adjustment—a modern synthesis

Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. Bundle adjustment—a modern synthesis. InVision Algorithms: Theory and Practice: International Workshop on Vision Algorithms Corfu, Greece, 2000. 3

work page 2000

Showing first 80 references.