Recognition: 1 theorem link
· Lean TheoremTTT3R: 3D Reconstruction as Test-Time Training
Pith reviewed 2026-05-17 06:35 UTC · model grok-4.3
The pith
Framing 3D reconstruction as test-time training yields a closed-form learning rate from alignment confidence that doubles global pose accuracy on long sequences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By viewing recurrent 3D reconstruction as an online learning problem, a closed-form learning rate derived directly from the alignment confidence between the memory state and incoming observations balances retention of historical information against adaptation to new observations, delivering substantially better length generalization.
What carries the argument
Alignment confidence between the memory state and incoming observations, used to compute a closed-form learning rate that controls memory updates in recurrent models.
Load-bearing premise
That alignment confidence between the memory state and new observations can be computed reliably enough to produce a stable learning rate that correctly balances history retention with adaptation and avoids instability.
What would settle it
Applying the method to long image sequences and measuring no gain, or a loss, in global pose estimation accuracy relative to the baseline would falsify the central claim.
read the original abstract
Modern Recurrent Neural Networks have become a competitive architecture for 3D reconstruction due to their linear-time complexity. However, their performance degrades significantly when applied beyond the training context length, revealing limited length generalization. In this work, we revisit the 3D reconstruction foundation models from a Test-Time Training perspective, framing their designs as an online learning problem. Building on this perspective, we leverage the alignment confidence between the memory state and incoming observations to derive a closed-form learning rate for memory updates, to balance between retaining historical information and adapting to new observations. This training-free intervention, termed TTT3R, substantially improves length generalization, achieving a $2\times$ improvement in global pose estimation over baselines, while operating at 20 FPS with just 6 GB of GPU memory to process thousands of images. Code is available in https://rover-xingyu.github.io/TTT3R
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes TTT3R, a training-free test-time training intervention for recurrent 3D reconstruction models. It reframes the recurrent memory update as an online learning problem and derives a closed-form learning rate from the alignment confidence between the current memory state and incoming observations, with the goal of improving length generalization beyond the training context length while preserving efficiency.
Significance. If the closed-form derivation is correct and the resulting updates remain stable, the approach would be a notable contribution to long-sequence 3D reconstruction, offering a lightweight way to extend RNN-based models without retraining. The reported efficiency (20 FPS, 6 GB for thousands of images) and the 2× gain in global pose estimation are practically relevant strengths.
major comments (2)
- [Method section (derivation of learning rate)] The central claim rests on deriving a closed-form learning rate directly from alignment confidence, yet the manuscript supplies no derivation steps, explicit formula, or stability analysis (e.g., bounds ensuring the rate stays in (0,1) or that the update remains contractive). This is load-bearing for the length-generalization result and directly engages the skeptic concern about drift accumulation over long sequences.
- [Experiments and Results] Quantitative claims of a 2× improvement in global pose estimation lack supporting details on baselines, exact evaluation protocol, sequence lengths, or error analysis (e.g., drift metrics over thousands of frames). Without these, it is impossible to verify whether the gains are robust or sensitive to the alignment-confidence definition.
minor comments (1)
- [Abstract] The abstract states that code is available at the given URL; a direct repository link or DOI would improve reproducibility.
Simulated Author's Rebuttal
We sincerely thank the referee for the thorough review and valuable feedback on our manuscript. We have carefully considered each comment and revised the paper to address the concerns raised, particularly by expanding the method derivation and providing more experimental details. Below we respond point by point to the major comments.
read point-by-point responses
-
Referee: [Method section (derivation of learning rate)] The central claim rests on deriving a closed-form learning rate directly from alignment confidence, yet the manuscript supplies no derivation steps, explicit formula, or stability analysis (e.g., bounds ensuring the rate stays in (0,1) or that the update remains contractive). This is load-bearing for the length-generalization result and directly engages the skeptic concern about drift accumulation over long sequences.
Authors: We appreciate the referee highlighting this critical aspect. We acknowledge that the manuscript would benefit from more explicit derivation steps, the closed-form formula, and a stability analysis. In the revised version, we have added a dedicated subsection in the Method section that provides the full derivation from the alignment confidence metric to the closed-form learning rate, along with a stability analysis establishing bounds that keep the rate in (0,1) and ensure the memory update remains contractive. This directly addresses potential drift accumulation over long sequences and strengthens the length-generalization claims. revision: yes
-
Referee: [Experiments and Results] Quantitative claims of a 2× improvement in global pose estimation lack supporting details on baselines, exact evaluation protocol, sequence lengths, or error analysis (e.g., drift metrics over thousands of frames). Without these, it is impossible to verify whether the gains are robust or sensitive to the alignment-confidence definition.
Authors: We thank the referee for this observation. While some protocol details appeared in the supplementary material, we agree they should be more prominent in the main text. In the revision, we have expanded the Experiments and Results section to explicitly describe the baselines, the evaluation protocol for global pose estimation, the tested sequence lengths (including thousands of frames), and additional error analysis with drift metrics. We have also added a sensitivity study with respect to the alignment-confidence definition to demonstrate robustness of the reported 2× gains. revision: yes
Circularity Check
No circularity: closed-form learning rate derived from observable alignment confidence
full rationale
The paper frames 3D reconstruction as an online learning problem and derives the learning rate directly from alignment confidence between memory state and new observations. This is presented as a training-free computation from an observable quantity rather than a fit, self-definition, or self-citation chain. No equations or steps in the abstract or context reduce the claimed result to its inputs by construction, and the method operates independently at test time without relying on fitted parameters from target metrics.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Alignment confidence between memory state and incoming observations can be computed and directly converted into an optimal learning rate for memory updates.
Forward citations
Cited by 18 Pith papers
-
Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction
Ray-aware pointers that track both location and viewing direction enable adaptive retain-or-replace memory updates for more stable streaming 3D reconstruction.
-
GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens
GlobalSplat achieves competitive novel-view synthesis on RealEstate10K and ACID using only 16K Gaussians via global scene tokens and coarse-to-fine training, with a 4MB footprint and under 78ms inference.
-
TokenGS: Decoupling 3D Gaussian Prediction from Pixels with Learnable Tokens
TokenGS uses learnable Gaussian tokens in an encoder-decoder architecture to regress 3D means directly, achieving SOTA feed-forward reconstruction on static and dynamic scenes with better robustness.
-
RobotPan: A 360$^\circ$ Surround-View Robotic Vision System for Embodied Perception
RobotPan predicts metric-scaled compact 3D Gaussians from calibrated multi-view inputs via spherical coordinates and hierarchical voxel priors for real-time 360° robotic perception and reconstruction.
-
AnyImageNav: Any-View Geometry for Precise Last-Meter Image-Goal Navigation
AnyImageNav uses a semantic-to-geometric cascade with 3D multi-view foundation models to recover precise 6-DoF poses from goal images, achieving 0.27m position error and state-of-the-art success rates on Gibson and HM...
-
Learning 3D Reconstruction with Priors in Test Time
Test-time constrained optimization incorporates priors into pre-trained multiview transformers via self-supervised losses and penalty terms to improve 3D reconstruction accuracy.
-
FrameVGGT: Geometry-Aligned Frame-Level Memory for Bounded Streaming VGGT
FrameVGGT replaces token-level KV retention with frame-level segments and prototypes to bound memory while preserving geometric coherence in streaming VGGT.
-
ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training
ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.
-
MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors
MoonSeg3R is the first method for online monocular 3D instance segmentation, achieving performance competitive with RGB-D systems by using CUT3R priors for geometric consistency and temporal query memory.
-
Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval
RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.
-
Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction
Asymmetric token reduction, with distinct merging for queries and pruning for key-values plus layer-wise adaptation, delivers up to 28x speedup on 1000-frame 3D reconstruction inputs while preserving competitive quality.
-
Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction
Ray-aware pointer memory with adaptive retain-or-replace updates enhances stability and accuracy in streaming 3D reconstruction.
-
Linearizing Vision Transformer with Test-Time Training
Using Test-Time Training's structural match to Softmax attention plus key normalization and locality modules allows inheriting pretrained weights and fine-tuning Stable Diffusion 3.5 in one hour to match quality while...
-
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...
-
Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction
Scal3R achieves better accuracy and consistency in large-scale 3D scene reconstruction by maintaining a compressed global context through test-time adaptation of lightweight neural networks on long video sequences.
-
OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer
OVGGT achieves constant O(1) memory and compute for streaming 3D geometry reconstruction by using FFN-residual-based KV cache compression and dynamic anchor protection, matching state-of-the-art accuracy on long sequences.
-
ViT$^3$: Unlocking Test-Time Training in Vision
ViT³ is a Test-Time Training vision model that achieves linear complexity, matches or exceeds other linear models like Mamba on classification, generation, detection and segmentation, and narrows the gap to standard v...
-
StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression
StreamCacheVGGT improves streaming 3D geometry reconstruction accuracy and stability under fixed memory by using cross-layer token importance scoring and hybrid cache compression instead of pure eviction.
Reference graph
Works this paper leans on
-
[1]
Bundle adjustment in the large
Sameer Agarwal, Noah Snavely, Steven M Seitz, and Richard Szeliski. Bundle adjustment in the large. InECCV, 2010. 3
work page 2010
-
[2]
Building rome in a day.ACM Communications, 2011
Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day.ACM Communications, 2011. 3
work page 2011
-
[3]
Cross-view completion models are zero-shot correspondence estimators
Honggyu An, Jinhyeon Kim, Seonghoon Park, Jaewoo Jung, Jisang Han, Sunghwan Hong, and Seungryong Kim. Cross-view completion models are zero-shot correspondence estimators. arXiv, 2412.09072, 2024. 3
-
[4]
Speeded-up robust features (surf).Computer vision and image understanding, 2008
Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-up robust features (surf).Computer vision and image understanding, 2008. 3
work page 2008
-
[5]
xlstm: Extended long short-term memory
Maximilian Beck, Korbinian P ¨oppel, Markus Spanring, Andreas Auer, Oleksandra Prud- nikova, Michael Kopp, G¨unter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Extended long short-term memory. InNeurIPS, 2024. 2, 4
work page 2024
-
[6]
Titans: Learning to Memorize at Test Time
Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. arXiv preprint arXiv:2501.00663, 2024. 2, 4, 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization. arXiv preprint arXiv:2504.13173, 2025. 2, 4, 10
-
[8]
Decimamba: Exploring the length extrapolation potential of mamba
Assaf Ben-Kish, Itamar Zimerman, Shady Abu-Hussein, Nadav Cohen, Amir Globerson, Lior Wolf, and Raja Giryes. Decimamba: Exploring the length extrapolation potential of mamba. arXiv preprint arXiv:2406.14528, 2024. 2
-
[9]
Birth of a transformer: A memory viewpoint
Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, and Leon Bottou. Birth of a transformer: A memory viewpoint. InNeurIPS, 2023. 2, 4
work page 2023
-
[10]
Local learning algorithms.Neural computation, 1992
L ´eon Bottou and Vladimir Vapnik. Local learning algorithms.Neural computation, 1992. 4
work page 1992
-
[11]
Dsac-differentiable ransac for camera localization
Eric Brachmann, Alexander Krull, Sebastian Nowozin, Jamie Shotton, Frank Michel, Stefan Gumhold, and Carsten Rother. Dsac-differentiable ransac for camera localization. InCVPR,
-
[12]
Eric Brachmann, Jamie Wynn, Shuai Chen, Tommaso Cavallari, ´Aron Monszpart, Daniyar Turmukhambetov, and Victor Adrian Prisacariu. Scene coordinate reconstruction: Posing of image collections via incremental learning of a relocalizer. InECCV, 2024. 1, 3
work page 2024
-
[13]
A naturalistic open source movie for optical flow evaluation
Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for optical flow evaluation. InECCV, 2012. 22, 23, 24
work page 2012
-
[14]
Must3r: Multi-view network for stereo 3d reconstruction
Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. Must3r: Multi-view network for stereo 3d reconstruction. In CVPR, 2025. 3, 5
work page 2025
-
[15]
Emerging properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InICCV,
-
[16]
Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Easi3r: Estimating disentangled motion from dust3r without training.arXiv preprint arXiv:2503.24391, 2025. 3, 6, 18, 22, 24 11 Published as a conference paper at ICLR 2026
-
[17]
Stuffed mamba: Oversized states lead to the inability to forget.arXiv preprint arXiv:2410.07145,
Yingfa Chen, Xinrong Zhang, Shengding Hu, Xu Han, Zhiyuan Liu, and Maosong Sun. Stuffed mamba: Oversized states lead to the inability to forget.arXiv preprint arXiv:2410.07145,
-
[18]
Feat2gs: Probing visual foundation models with gaussian splatting.arXiv, 2412.09606, 2024
Yue Chen, Xingyu Chen, Anpei Chen, Gerard Pons-Moll, and Yuliang Xiu. Feat2gs: Probing visual foundation models with gaussian splatting.arXiv, 2412.09606, 2024. 9
-
[19]
Long3r: Long sequence streaming 3d reconstruction.arXiv preprint arXiv:2507.18255, 2025
Zhuoguang Chen, Minghui Qin, Tianyuan Yuan, Zhe Liu, and Hang Zhao. Long3r: Long sequence streaming 3d reconstruction.arXiv preprint arXiv:2507.18255, 2025. 3, 5
-
[20]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InCVPR, 2017. 7, 8, 20, 22
work page 2017
-
[21]
Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060, 2024. 4, 6, 17
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Flashattention: Fast and memory-efficient exact attention with io-awareness
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R´e. Flashattention: Fast and memory-efficient exact attention with io-awareness. InNeurIPS, 2022. 1
work page 2022
-
[23]
Monoslam: Real-time single camera slam.PAMI, 2007
Andrew J Davison, Ian D Reid, Nicholas D Molton, and Olivier Stasse. Monoslam: Real-time single camera slam.PAMI, 2007. 3
work page 2007
-
[24]
Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt-long: Chunk it, loop it, align it– pushing vggt’s limits on kilometer-scale long rgb sequences.arXiv preprint arXiv:2507.16443,
work page internal anchor Pith review arXiv
-
[25]
Superpoint: Self-supervised interest point detection and description
Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description. InCVPRW, 2018. 3
work page 2018
-
[26]
Benoit Dherin, Michael Munn, Hanna Mazzawi, Michael Wunder, and Javier Gonzalvo. Learning without training: The implicit dynamics of in-context learning.arXiv preprint arXiv:2507.16003, 2025. 4
-
[27]
Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization.arXiv, 2412.08376, 2024. 3
-
[28]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 3, 4
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[29]
Lsd-slam: Large-scale direct monocular slam
Jakob Engel, Thomas Sch¨ops, and Daniel Cremers. Lsd-slam: Large-scale direct monocular slam. InECCV, 2014. 3
work page 2014
-
[30]
Direct sparse odometry.PAMI, 2017
Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry.PAMI, 2017. 3
work page 2017
-
[31]
Model-agnostic meta-learning for fast adaptation of deep networks
Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. InICML, 2017. 4
work page 2017
-
[32]
Yichao Fu, Xuewei Wang, Yuandong Tian, and Jiawei Zhao. Deep think with confidence. arXiv preprint arXiv:2508.15260, 2025. 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Vision meets robotics: The kitti dataset.The international journal of robotics research, 2013
Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.The international journal of robotics research, 2013. 8, 9, 18, 19, 20, 23, 24
work page 2013
-
[34]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
ViPE: Video Pose Engine for 3D Geometric Perception
Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Korovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025. 3 12 Published as a conference paper at ICLR 2026
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Samy Jelassi, David Brandfonbrener, Sham M Kakade, and Eran Malach. Repeat after me: Transformers are better than state space models at copying.arXiv preprint arXiv:2402.01032,
-
[37]
Transformers are rnns: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Franc ¸ois Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InICML, 2020. 2, 3
work page 2020
-
[38]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 2017. 2
work page 2017
-
[39]
Robust consistent video depth estimation
Johannes Kopf, Xuejian Rong, and Jia-Bin Huang. Robust consistent video depth estimation. InCVPR, 2021. 3, 22
work page 2021
-
[40]
Stream3r: Scalable sequential 3d reconstruction with causal transformer
Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. Stream3r: Scalable sequential 3d reconstruction with causal transformer. 2025. 22, 24
work page 2025
-
[41]
Epnp: An accurate o(n) solution to the pnp problem
Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. Epnp: An accurate o(n) solution to the pnp problem. InICCV, 2009. 5
work page 2009
-
[42]
Grounding image matching in 3d with MASt3R
Vincent Leroy, Yohann Cabon, and Jerome Revaud. Grounding image matching in 3d with MASt3R. InECCV, 2024. 22, 24
work page 2024
-
[43]
MegaSaM: accurate, fast, and robust structure and motion from casual dynamic videos
Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. MegaSaM: accurate, fast, and robust structure and motion from casual dynamic videos. InCVPR, 2025. 3
work page 2025
-
[44]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Longhorn: State space models are amortized online learners.arXiv preprint arXiv:2407.14207, 2024
Bo Liu, Rui Wang, Lemeng Wu, Yihao Feng, Peter Stone, and Qiang Liu. Longhorn: State space models are amortized online learners.arXiv preprint arXiv:2407.14207, 2024. 1, 2, 4
-
[46]
Thibaut Loiseau, Guillaume Bourmaud, and Vincent Lepetit. Alligat0r: Pre-training through co- visibility segmentation for relative camera pose regression.arXiv preprint arXiv:2503.07561,
-
[47]
David G Lowe. Distinctive image features from scale-invariant keypoints.International journal of computer vision, 2004. 3
work page 2004
-
[48]
Consistent video depth estimation.ACM Trans
Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, and Johannes Kopf. Consistent video depth estimation.ACM Trans. on Graphics, 2020. 3, 4
work page 2020
-
[49]
Vggt- slam: Dense rgb slam optimized on the sl (4) manifold.arXiv preprint arXiv:2505.12549, 2025
Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt-slam: Dense rgb slam optimized on the sl (4) manifold.arXiv preprint arXiv:2505.12549, 2025. 3
-
[50]
Orb-slam: a versatile and accurate monocular slam system.IEEE transactions on robotics, 2015
Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular slam system.IEEE transactions on robotics, 2015. 3
work page 2015
-
[51]
Mast3r-slam: Real-time dense slam with 3d reconstruction priors
Riku Murai, Eric Dexheimer, and Andrew J Davison. Mast3r-slam: Real-time dense slam with 3d reconstruction priors. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 16695–16705, 2025. 3
work page 2025
-
[52]
Dtam: Dense tracking and mapping in real-time
Richard A Newcombe, Steven J Lovegrove, and Andrew J Davison. Dtam: Dense tracking and mapping in real-time. InICCV, 2011. 3
work page 2011
-
[53]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth´ee Darcet, Th´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023. 4 13 Published as a conference paper at ICLR 2026
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[54]
Resurrecting recurrent neural networks for long sequences
Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. InICML,
-
[55]
Refusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals
Emanuele Palazzolo, Jens Behley, Philipp Lottes, Philippe Giguere, and Cyrill Stachniss. Refusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals. InIROS, 2019. 8, 9, 20, 23, 24
work page 2019
-
[56]
The weiszfeld algorithm: proof, amendments, and extensions.F oundations of location analysis, 2011
Frank Plastria. The weiszfeld algorithm: proof, amendments, and extensions.F oundations of location analysis, 2011. 5
work page 2011
-
[57]
Marc Pollefeys, Reinhard Koch, and Luc Van Gool. Self-calibration and metric reconstruction inspite of varying and unknown intrinsic camera parameters.International Journal of Computer Vision, 1999. 3
work page 1999
-
[58]
Visual modeling with a hand-held camera.International Journal of Computer Vision, 2004
Marc Pollefeys, Luc Van Gool, Maarten Vergauwen, Frank Verbiest, Kurt Cornelis, Jan Tops, and Reinhard Koch. Visual modeling with a hand-held camera.International Journal of Computer Vision, 2004. 3
work page 2004
-
[59]
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708, 2025. 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
Hopfield Networks is All You Need
Hubert Ramsauer, Bernhard Sch¨afl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber, Markus Holzleitner, Milena Pavlovi´c, Geir Kjetil Sandve, et al. Hopfield networks is all you need.arXiv preprint arXiv:2008.02217, 2020. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2008
-
[61]
Vision transformers for dense predic- tion
Ren´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense predic- tion. InICCV, 2021. 4
work page 2021
-
[62]
Ricardo Buitrago Ruiz and Albert Gu. Understanding and improving length generalization in recurrent models.arXiv preprint arXiv:2507.02782, 2025. 2, 10, 20
-
[63]
Super- glue: Learning feature matching with graph neural networks
Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Super- glue: Learning feature matching with graph neural networks. InCVPR, 2020. 3
work page 2020
-
[64]
Linear transformers are secretly fast weight programmers
Imanol Schlag, Kazuki Irie, and J ¨urgen Schmidhuber. Linear transformers are secretly fast weight programmers. InICML, 2021. 2, 3, 4, 5, 6, 17
work page 2021
-
[65]
J¨urgen Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks.Neural Computation, 1992. 2, 4, 5
work page 1992
-
[66]
Structure-from-motion revisited
Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. In CVPR, 2016. 3
work page 2016
-
[67]
Wenzhe Shi, Jose Caballero, Ferenc Husz´ar, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. InCVPR, 2016. 4
work page 2016
-
[68]
Scene coordinate regression forests for camera relocalization in RGB-D images
Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in RGB-D images. InCVPR, 2013. 3, 9
work page 2013
-
[69]
Photo tourism: exploring photo collec- tions in 3d
Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring photo collec- tions in 3d. InSIGGRAPH, 2006. 3
work page 2006
-
[70]
Modeling the world from internet photo collections.International journal of computer vision, 2008
Noah Snavely, Steven M Seitz, and Richard Szeliski. Modeling the world from internet photo collections.International journal of computer vision, 2008. 3
work page 2008
-
[71]
A benchmark for the evaluation of rgb-d slam systems
J¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of rgb-d slam systems. In2012 IEEE/RSJ international conference on intelligent robots and systems, 2012. 7, 8, 18, 19, 20, 22
work page 2012
-
[72]
Test-time training with self-supervision for generalization under distribution shifts
Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. InICML, 2020. 4 14 Published as a conference paper at ICLR 2026
work page 2020
-
[73]
Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620, 2024. 2, 4, 5, 6, 17
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[74]
Retentive Network: A Successor to Transformer for Large Language Models
Yutao Sun, Li Dong, Shaohan Huang, Shuming Ma, Yuqing Xia, Jilong Xue, Jianyong Wang, and Furu Wei. Retentive network: A successor to transformer for large language models.arXiv preprint arXiv:2307.08621, 2023. 2, 4, 6, 17
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[75]
PhD thesis, University of Toronto, 2013
Ilya Sutskever.Training recurrent neural networks. PhD thesis, University of Toronto, 2013. 2
work page 2013
-
[76]
Sequence to sequence learning with neural networks
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. InNeurIPS, 2014. 1
work page 2014
-
[77]
Ba-net: Dense bundle ad- justment network.arXiv preprint arXiv:1806.04807, 2018
Chengzhou Tang and Ping Tan. Ba-net: Dense bundle adjustment network.arXiv preprint arXiv:1806.04807, 2018. 3
-
[78]
Aether: Geometric-aware unified world modeling.arXiv preprint arXiv:2503.18945, 2025
Aether Team, Haoyi Zhu, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Chunhua Shen, Jiangmiao Pang, et al. Aether: Geometric-aware unified world modeling.arXiv preprint arXiv:2503.18945, 2025. 22, 24
-
[79]
Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras
Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. InNeurIPS, 2021. 3
work page 2021
-
[80]
Bundle adjustment—a modern synthesis
Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. Bundle adjustment—a modern synthesis. InVision Algorithms: Theory and Practice: International Workshop on Vision Algorithms Corfu, Greece, 2000. 3
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.