arxiv: 2604.07279 · v1 · submitted 2026-04-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Mem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training

Changkun Liu , Jiezhi Yang , Zeman Li , Yuan Deng , Jiancong Guo , Luca Ballan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:28 UTC · model grok-4.3

classification 💻 cs.CV

keywords streaming 3D reconstructionhybrid memorytest-time trainingcamera trackinggeometric mappingtemporal consistencylong sequencesrecurrent models

0 comments

The pith

A hybrid memory design separates camera tracking via test-time training from geometric mapping with explicit tokens to improve consistency in streaming 3D reconstruction over long sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a streaming 3D reconstruction model that combines two memory types to handle long visual streams without the drift and forgetting common in prior recurrent approaches. Camera tracking uses an implicit fast-weight memory in the form of a lightweight neural network updated on the fly through test-time training. Geometric mapping instead relies on an explicit set of tokens kept to a fixed size. This split is intended to overcome capacity limits in compressed latent states while supporting constant memory use. A sympathetic reader would care because reliable processing of extended video inputs matters for real-time applications such as robotics and augmented reality.

Core claim

Mem3R maintains an implicit fast-weight memory implemented as a lightweight multi-layer perceptron updated via test-time training for camera tracking and an explicit token-based fixed-size state for geometric mapping. This hybrid design decouples the two processes to achieve better temporal consistency over long sequences, reduces overall model size, and allows integration with existing state-update strategies while preserving constant GPU memory usage and comparable inference throughput.

What carries the argument

Hybrid memory that decouples implicit test-time-trained fast-weight memory for camera tracking from explicit token-based fixed-size state for geometric mapping.

If this is right

Improves performance on long sequences of 500 to 1000 frames compared with prior recurrent designs.
Reduces model size while maintaining or improving accuracy.
Lowers absolute trajectory error by up to 39 percent on long sequences when paired with improved state-update methods.
Extends accuracy gains to related tasks including video depth estimation and 3D reconstruction.
Preserves constant GPU memory usage and comparable inference speed across varying sequence lengths.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of tracking and mapping memory needs could extend to other sequential perception tasks that suffer from forgetting over time.
Test-time updates to the implicit memory might allow adaptation to new lighting or motion patterns without full offline retraining.
Fixed-size explicit states may offer better scaling properties for extremely long streams than fully compressed recurrent states.
The design could be combined with hardware-specific optimizations to run on resource-limited devices.

Load-bearing premise

Decoupling camera tracking into an implicit memory updated at test time and geometric mapping into explicit tokens will improve temporal consistency without introducing new accuracy or efficiency trade-offs over long sequences.

What would settle it

A comparison on sequences exceeding 1000 frames where the hybrid model shows faster error accumulation or higher memory consumption than a single unified memory baseline would challenge the central claim.

Figures

Figures reproduced from arXiv: 2604.07279 by Changkun Liu, Jiancong Guo, Jiezhi Yang, Luca Ballan, Yuan Deng, Zeman Li.

**Figure 1.** Figure 1: We present Mem3R, an RNN-based model built on the CUT3R paradigm that achieves stronger long-sequence streaming 3D perception. Built on a dual-memory design, Mem3R combines (i) an implicit memory, W, for camera pose estimation and (ii) an explicit memory of persistent tokens for global geometric context. It is further compatible with plug-and-play state-update strategies developed for CUT3R, such as TTT3R,… view at source ↗

**Figure 2.** Figure 2: Overview of Mem3R. Top: For each frame It in the streaming image sequence, a ViT encoder extracts image features Ft. The MLP-based fast-weight module W performs camera tracking with decoder under Test-Time Training (TTT), while the fixed-size state S and decoder preserve and update geometric information, producing an intermediate state S ′ t . S ′ t is then fused with the previous state St−1 through a chan… view at source ↗

**Figure 3.** Figure 3: Quantitative evaluation of camera pose estimation from the ScanNet dataset (left) and the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative visualization of predicted camera trajectories on the TUM Dynamics dataset [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Quantitative evaluation of 3D Reconstruction on 7-Scenes. OOM denotes out-of-memory. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative 3D reconstruction results on long sequences. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative 3D reconstruction results on long sequences. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative 3D reconstruction results on long sequences. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative visualization of predicted camera trajectories on the TUM Dynamics dataset [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

read the original abstract

Streaming 3D perception is well suited to robotics and augmented reality, where long visual streams must be processed efficiently and consistently. Recent recurrent models offer a promising solution by maintaining fixed-size states and enabling linear-time inference, but they often suffer from drift accumulation and temporal forgetting over long sequences due to the limited capacity of compressed latent memories. We propose Mem3R, a streaming 3D reconstruction model with a hybrid memory design that decouples camera tracking from geometric mapping to improve temporal consistency over long sequences. For camera tracking, Mem3R employs an implicit fast-weight memory implemented as a lightweight Multi-Layer Perceptron updated via Test-Time Training. For geometric mapping, Mem3R maintains an explicit token-based fixed-size state. Compared with CUT3R, this design not only significantly improves long-sequence performance but also reduces the model size from 793M to 644M parameters. Mem3R supports existing improved plug-and-play state update strategies developed for CUT3R. Specifically, integrating it with TTT3R decreases Absolute Trajectory Error by up to 39% over the base implementation on 500 to 1000 frame sequences. The resulting improvements also extend to other downstream tasks, including video depth estimation and 3D reconstruction, while preserving constant GPU memory usage and comparable inference throughput. Project page: https://lck666666.github.io/Mem3R/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mem3R's hybrid split of TTT-updated MLP for tracking and explicit tokens for mapping is a clean architectural move that delivers smaller size and better long-sequence ATE, but the two memories still need an explicit alignment mechanism to make the gains reliable.

read the letter

The paper's core contribution is the hybrid memory: an implicit fast-weight MLP updated via test-time training handles camera tracking, while a fixed-size explicit token state handles geometric mapping. This is a direct step beyond CUT3R and lets them drop the model from 793M to 644M parameters. When plugged into TTT3R, it reports up to 39% lower absolute trajectory error on 500-1000 frame sequences, with constant GPU memory and similar speed. That combination of size reduction and long-sequence stability is the practical payoff for robotics and AR use cases. The plug-and-play compatibility with prior state-update tricks is also useful and keeps the work incremental rather than a full rewrite. The main soft spot is the missing link between the two memory streams. The abstract gives no joint loss, cross-attention, or consistency regularizer, so it is not obvious why tracked poses stay aligned with the mapped geometry over hundreds of frames. If the full paper does not add such a coupling, the reported gains could be fragile or sequence-specific. The evaluation also focuses on selected long sequences without clear discussion of failure cases or broader dataset coverage. This work is aimed at researchers building online 3D perception pipelines who already know the CUT3R and TTT3R baselines. It is solid enough on the architectural idea and the concrete numbers to warrant a serious referee, even if the alignment question needs tightening in revision. I would send it out for review but flag the coupling issue for the reviewers to check against the actual implementation.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Mem3R, a streaming 3D reconstruction model featuring a hybrid memory architecture. Camera tracking is handled by an implicit fast-weight memory implemented as an MLP updated through test-time training (TTT), while geometric mapping uses an explicit fixed-size token-based state. The approach is shown to be compatible with prior strategies like TTT3R, claiming a reduction in model size from 793M to 644M parameters compared to CUT3R, up to 39% decrease in Absolute Trajectory Error (ATE) on long sequences (500-1000 frames), and benefits to downstream tasks such as video depth estimation and 3D reconstruction, all while keeping GPU memory constant.

Significance. Should the hybrid memory design prove effective in maintaining alignment between tracking and mapping without introducing new inconsistencies, this work could offer a practical advancement in efficient streaming 3D perception for applications in robotics and augmented reality. The parameter reduction and linear-time inference are particularly valuable for long visual streams, and the plug-and-play integration with existing TTT methods enhances its applicability.

major comments (2)

[Methods (hybrid memory design)] The decoupling of implicit TTT memory for camera tracking from explicit tokens for geometry is central to the claims of improved temporal consistency. However, no explicit mechanism (e.g., joint loss, cross-attention, or consistency regularizer) is described to ensure alignment between the pose estimates from the MLP and the mapped structure in the token state. This omission raises the possibility of accumulating drift, which could undermine the reported long-sequence ATE improvements.
[Experiments (quantitative results)] The 39% ATE reduction and model size claims are load-bearing for the contribution. The manuscript should provide more details on the evaluation protocol, including the number of sequences tested, variance across runs, and whether the 500-1000 frame sequences were selected post-hoc, to rule out selection bias and confirm the gains are robust.

minor comments (2)

[Abstract] The abstract mentions 'up to 39%' ATE reduction; specifying the exact sequences or average would improve clarity.
[Related Work] Ensure all baselines like CUT3R are properly cited with full references.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The hybrid memory design in Mem3R aims to address temporal consistency challenges in streaming 3D reconstruction, and we appreciate the opportunity to clarify the alignment mechanisms and evaluation details. We address each major comment below and will update the manuscript accordingly.

read point-by-point responses

Referee: [Methods (hybrid memory design)] The decoupling of implicit TTT memory for camera tracking from explicit tokens for geometry is central to the claims of improved temporal consistency. However, no explicit mechanism (e.g., joint loss, cross-attention, or consistency regularizer) is described to ensure alignment between the pose estimates from the MLP and the mapped structure in the token state. This omission raises the possibility of accumulating drift, which could undermine the reported long-sequence ATE improvements.

Authors: We thank the referee for this observation. In the Mem3R architecture, alignment is achieved implicitly through the processing pipeline: the pose output by the TTT-updated MLP is directly applied to warp and fuse new observations into the explicit token-based mapping state, with shared visual features extracted from the input frames feeding both components. The end-to-end reconstruction objective (including depth and pose consistency terms) provides gradient flow that couples the two memories during test-time updates. That said, we agree the original manuscript did not sufficiently articulate this information flow. In revision we will add a concise subsection under Methods describing the interaction between the implicit MLP and explicit tokens, including a diagram of the data flow, to make the absence of drift accumulation clearer. No new explicit regularizer is introduced because the current design already yields the reported ATE gains; the revision will simply document the existing coupling more explicitly. revision: partial
Referee: [Experiments (quantitative results)] The 39% ATE reduction and model size claims are load-bearing for the contribution. The manuscript should provide more details on the evaluation protocol, including the number of sequences tested, variance across runs, and whether the 500-1000 frame sequences were selected post-hoc, to rule out selection bias and confirm the gains are robust.

Authors: We agree that greater transparency on the evaluation protocol is warranted. The 39% ATE reduction is the largest improvement observed when integrating Mem3R with TTT3R on long sequences drawn from the same benchmark suites used by CUT3R and TTT3R; all qualifying sequences in the 500–1000 frame range were included rather than cherry-picked. In the revised manuscript we will (i) state the exact number of sequences evaluated, (ii) report per-sequence ATE values together with mean and standard deviation where multiple random seeds were feasible, and (iii) explicitly confirm that sequence selection followed the identical long-sequence protocol of the baseline papers with no post-hoc filtering. These additions will allow readers to assess robustness directly. revision: yes

Circularity Check

0 steps flagged

No circularity: hybrid design and empirical gains are independent of inputs

full rationale

The paper proposes a novel hybrid memory architecture (implicit MLP via TTT for tracking + explicit fixed-size tokens for mapping) motivated by limitations in prior recurrent models like CUT3R. Performance numbers (e.g., 39% ATE reduction when plugged with TTT3R) are reported as experimental outcomes on 500-1000 frame sequences, not as predictions derived from fitted parameters or self-referential equations. No load-bearing step reduces by construction to the inputs; the decoupling is presented as an explicit design choice with downstream task extensions, keeping the derivation self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claim depends on the hybrid memory design working as described, with several design choices (memory types, update strategies) that are not derived from first principles but proposed and validated empirically.

free parameters (2)

MLP architecture for fast-weight memory
The lightweight MLP parameters are learned via test-time training, representing fitted components central to the tracking module.
Token state size for explicit memory
Fixed-size state capacity is a design choice that affects mapping performance and memory usage.

axioms (2)

domain assumption Decoupling camera tracking from geometric mapping improves temporal consistency in long sequences
This is the core premise of the hybrid design.
domain assumption Test-time training can maintain an effective implicit memory without drift accumulation
Assumed to work better than standard recurrent states.

invented entities (2)

Implicit fast-weight memory no independent evidence
purpose: To handle camera tracking with adaptive updates via TTT
Architectural innovation introduced to address limitations of fixed latent memories.
Explicit token-based fixed-size state no independent evidence
purpose: To maintain geometric mapping information
New state representation for the mapping component.

pith-pipeline@v0.9.0 · 5561 in / 1659 out tokens · 91959 ms · 2026-05-10T18:28:40.101758+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hybrid memory design that decouples camera tracking from geometric mapping... implicit fast-weight memory implemented as a lightweight Multi-Layer Perceptron updated via Test-Time Training... explicit token-based fixed-size state
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

replaces CUT3R’s pose-related state tokens and decoder layers with a lightweight implicit MLP-based memory

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

80 extracted references · 23 canonical work pages · 10 internal anchors

[1]

Arnold, J

E. Arnold, J. Wynn, S. Vicente, G. Garcia-Hernando, A. Monszpart, V . Prisacariu, D. Turmukhambetov, and E. Brachmann. Map-free visual relocalization: Metric pose relative to a single image. InEuropean Conference on Computer Vision, pages 690–708. Springer, 2022. 6, 15

2022
[2]

Azinovi ´c, R

D. Azinovi ´c, R. Martin-Brualla, D. B. Goldman, M. Nießner, and J. Thies. Neural rgb-d surface reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6290–6301, 2022. 8

2022
[3]

Bauer, F

Z. Bauer, F. Gomez-Donoso, E. Cruz, S. Orts-Escolano, and M. Cazorla. Uasol, a large-scale high- resolution outdoor stereo dataset.Scientific data, 6(1):162, 2019. 15

2019
[4]

Titans: Learning to Memorize at Test Time

A. Behrouz, P. Zhong, and V . Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663, 2024. 2, 3

work page internal anchor Pith review arXiv 2024
[5]

Atlas: Learning to optimally memorize the context at test time, 2025

A. Behrouz, Z. Li, P. Kacham, M. Daliri, Y . Deng, P. Zhong, M. Razaviyayn, and V . Mirrokni. Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025. 3

work page arXiv 2025
[6]

Behrouz, M

A. Behrouz, M. Razaviyayn, P. Zhong, and V . Mirrokni. Nested learning: The illusion of deep learning architectures. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=nbMeRvNb7A. 3

2025
[7]

Behrouz, M

A. Behrouz, M. Razaviyayn, P. Zhong, and V . Mirrokni. It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=gZyEJ2kMow. 3

2026
[8]

M. J. Black, P. Patel, J. Tesch, and J. Yang. Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8726–8737, 2023. 15 10

2023
[9]

D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. InEuropean conference on computer vision, pages 611–625. Springer, 2012. 7, 8

2012
[10]

Virtual KITTI 2

Y . Cabon, N. Murray, and M. Humenberger. Virtual kitti 2.arXiv preprint arXiv:2001.10773, 2020. 15

work page internal anchor Pith review arXiv 2001
[11]

Chang, A

A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niebner, M. Savva, S. Song, A. Zeng, and Y . Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In2017 International Conference on 3D Vision (3DV), pages 667–676. IEEE Computer Society, 2017. 15

2017
[12]

X. Chen, Y . Chen, Y . Xiu, A. Geiger, and A. Chen. Easi3r: Estimating disentangled motion from dust3r without training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9158–9168, 2025. 2, 7, 18, 19

2025
[13]

X. Chen, Y . Chen, Y . Xiu, A. Geiger, and A. Chen. TTT3r: 3d reconstruction as test-time training. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=aMs6FtNaY5. 2, 3, 6, 8, 17, 18, 19

2026
[14]

Z. Chen, J. Yang, and H. Yang. Pref3r: Pose-free feed-forward 3d gaussian splatting from variable-length image sequence, 2024. URLhttps://arxiv.org/abs/2411.16877. 2

work page arXiv 2024
[15]

Z. Chen, M. Qin, T. Yuan, Z. Liu, and H. Zhao. Long3r: Long sequence streaming 3d reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5273–5284, 2025. 3

2025
[16]

A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, and M. Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InIEEE Conf. Comput. Vis. Pattern Recog., pages 2432–2443, 2017. 6, 15

2017
[17]

Dehghan, G

A. Dehghan, G. Baruch, Z. Chen, Y . Feigin, P. Fu, T. Gebauer, D. Kurz, T. Dimry, B. Joffe, A. Schwartz, and E. Shulman. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data. InAdv. Neural Inform. Process. Syst., 2021. 6, 15

2021
[18]

J. Dong, H. Li, S. Zhou, W. Hu, W. Xu, and Y . Wang. Memix: Writing less, remembering more for streaming 3d reconstruction.arXiv preprint arXiv:2603.15330, 2026. URL https://arxiv.org/abs/ 2603.15330. 2

work page arXiv 2026
[19]

Z. Fan, J. Zhang, W. Cong, P. Wang, R. Li, K. Wen, S. Zhou, A. Kadambi, Z. Wang, D. Xu, et al. Large spatial model: End-to-end unposed images to semantic 3d.Advances in neural information processing systems, 37:40212–40229, 2024. 2

2024
[20]

Geiger, P

A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset.The international journal of robotics research, 32(11):1231–1237, 2013. 7

2013
[21]

Gu and T

A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst conference on language modeling, 2024. 3

2024
[22]

J. J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982. 3

1982
[23]

J. Hu, Y . Pan, J. Du, D. Lan, X. Tang, Q. Wen, Y . Liang, and W. Sun. Comba: Improving bilinear rnns with closed-loop control.arXiv preprint arXiv:2506.02475, 2025. 3

work page arXiv 2025
[24]

Huang, K

P.-H. Huang, K. Matzen, J. Kopf, N. Ahuja, and J.-B. Huang. Deepmvs: Learning multi-view stereopsis. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2821–2830,
[25]

Jayanti, S

R. Jayanti, S. Agrawal, V . Garg, S. Tourani, M. H. Khan, S. Garg, and M. Krishna. Segmast3r: Geometry grounded segment matching. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 2

2025
[26]

H. Jin, R. Wu, T. Zhang, R. Gao, J. T. Barron, N. Snavely, and A. Holynski. Zipmap: Linear-time stateful 3d reconstruction with test-time training.arXiv preprint arXiv:2603.04385, 2026. 3

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Karaev, I

N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rupprecht. Dynamicstereo: Consistent dynamic depth from stereo videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13229–13239, 2023. 15

2023
[28]

Y . Lan, Y . Luo, F. Hong, S. Zhou, H. Chen, Z. Lyu, S. Yang, B. Dai, C. C. Loy, and X. Pan. Stream3r: Scalable sequential 3d reconstruction with causal transformer.arXiv preprint arXiv:2508.10893, 2025. 2, 3, 7, 18, 19 11

work page arXiv 2025
[29]

Leroy, Y

V . Leroy, Y . Cabon, and J. Revaud. Grounding image matching in 3d with mast3r. InEuropean Conference on Computer Vision, pages 71–91. Springer, 2024. 2, 7, 18, 19

2024
[30]

Li and N

Z. Li and N. Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018. 6, 15

2041
[31]

Z. Li, A. Behrouz, Y . Deng, P. Zhong, P. Kacham, M. Karami, M. Razaviyayn, and V . Mirrokni. Tnt: Improving chunkwise training for test-time memorization.arXiv preprint arXiv:2511.07343, 2025. 3

work page arXiv 2025
[32]

Z. Li, J. Zhou, Y . Wang, H. Guo, W. Chang, Y . Zhou, H. Zhu, J. Chen, C. Shen, and T. He. Wint3r: Window- based streaming reconstruction with camera token pool. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=PjviszIZf1. 3

2026
[33]

L. Ling, Y . Sheng, Z. Tu, W. Zhao, C. Xin, K. Wan, L. Yu, Q. Guo, Z. Yu, Y . Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024. 6, 15

2024
[34]

C. Liu, B. Tan, Z. Ke, S. Zhang, J. Liu, M. Qian, N. Xue, Y . Shen, and T. Braud. Plana3r: Zero-shot metric planar 3d reconstruction via feed-forward planar splatting. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 2

2025
[35]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

L. Mehl, J. Schmalfuss, A. Jahedi, Y . Nalivayko, and A. Bruhn. Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4981–4991, 2023. 15

2023
[37]

Palazzolo, J

E. Palazzolo, J. Behley, P. Lottes, P. Giguere, and C. Stachniss. Refusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7855–7862. IEEE, 2019. 7

2019
[38]

B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, S. Biderman, H. Cao, X. Cheng, M. Chung, L. Derczynski, et al. Rwkv: Reinventing rnns for the transformer era. InFindings of the association for computational linguistics: EMNLP 2023, pages 14048–14077, 2023. 3

2023
[39]

Reizenstein, R

J. Reizenstein, R. Shapovalov, P. Henzler, L. Sbordone, P. Labatut, and D. Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. InProceedings of the IEEE/CVF international conference on computer vision, pages 10901–10911, 2021. 6, 15

2021
[40]

Roberts, J

M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 10912–10922, 2021. 6, 15

2021
[41]

Schlag, K

I. Schlag, K. Irie, and J. Schmidhuber. Linear transformers are secretly fast weight programmers. In International conference on machine learning, pages 9355–9366. PMLR, 2021. 3

2021
[42]

Schmidhuber

J. Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131–139, 1992

1992
[43]

Schmidhuber

J. Schmidhuber. Reducing the ratio between learning complexity and number of time varying variables in fully recurrent nets. InICANN’93: Proceedings of the International Conference on Artificial Neural Networks Amsterdam, The Netherlands 13–16 September 1993 3, pages 460–463. Springer, 1993. 3

1993
[44]

N. Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020. 4

work page internal anchor Pith review Pith/arXiv arXiv 2002
[45]

Y . Shen, Z. Zhang, Y . Qu, X. Zheng, J. Ji, S. Zhang, and L. Cao. Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025. 2, 3

work page arXiv 2025
[46]

Shotton, B

J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. InInt. Conf. Comput. Vis., pages 2930–2937, 2013. 8

2013
[47]

Sinha, R

S. Sinha, R. Shapovalov, J. Reizenstein, I. Rocco, N. Neverova, A. Vedaldi, and D. Novotny. Common pets in 3d: Dynamic new-view synthesis of real-life deformable categories. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4881–4891, 2023. 15

2023
[48]

Sturm, N

J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of rgb-d slam systems. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 573–580. IEEE, 2012. 6 12

2012
[49]

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020. 15

2020
[50]

Y . Sun, X. Li, K. Dalal, J. Xu, A. Vikram, G. Zhang, Y . Dubois, X. Chen, X. Wang, S. Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620,

work page internal anchor Pith review arXiv
[51]

Z. Tang, Y . Fan, D. Wang, H. Xu, R. Ranjan, A. Schwing, and Z. Yan. Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5283–5293, 2025. 2

2025
[52]

K. Team, Y . Zhang, Z. Lin, X. Yao, J. Hu, F. Meng, C. Liu, X. Men, S. Yang, Z. Li, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025. 2, 3

work page internal anchor Pith review arXiv 2025
[53]

F. Tosi, Y . Liao, C. Schmitt, and A. Geiger. Smd-nets: Stereo mixture density networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8942–8952, 2021. 15

2021
[54]

S. Umeyama. Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on pattern analysis and machine intelligence, 13(4):376–380, 2002. 6

2002
[55]

C. Wang, H. Tan, W. Yifan, Z. Chen, Y . Liu, K. Sunkavalli, S. Bi, L. Liu, and Y . Hu. tttlrm: Test-time training for long context and autoregressive 3d reconstruction.arXiv preprint arXiv:2602.20160, 2026. 3

work page arXiv 2026
[56]

Wang and L

H. Wang and L. Agapito. 3d reconstruction with spatial memory. In2025 International Conference on 3D Vision (3DV), pages 78–89. IEEE, 2025. 2, 3, 7, 18, 19

2025
[57]

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306,
[58]

Q. Wang, Y . Zhang, A. Holynski, A. A. Efros, and A. Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025. 3, 6, 7, 15, 18, 19

2025
[59]

S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud. Dust3r: Geometric 3d vision made easy. In IEEE Conf. Comput. Vis. Pattern Recog., pages 20697–20709, 2024. 2, 7, 18, 19

2024
[60]

W. Wang, D. Zhu, X. Wang, Y . Hu, Y . Qiu, C. Wang, Y . Hu, A. Kapoor, and S. Scherer. Tartanair: A dataset to push the limits of visual slam. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916. IEEE, 2020. 15

2020
[61]

Y . Wang, J. Zhou, H. Zhu, W. Chang, Y . Zhou, Z. Li, J. Chen, J. Pang, C. Shen, and T. He. Pi3: Scalable permutation-equivariant visual geometry learning.arXiv e-prints, pages arXiv–2507, 2025. 2, 7, 18

2025
[62]

T. Wu, J. Zhang, X. Fu, Y . Wang, J. Ren, L. Pan, W. Wu, L. Yang, J. Wang, C. Qian, et al. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 803–814, 2023. 15

2023
[63]

Y . Wu, W. Zheng, J. Zhou, and J. Lu. Point3r: Streaming 3d reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025. 3, 6, 18, 19

work page arXiv 2025
[64]

H. Xia, Y . Fu, S. Liu, and X. Wang. Rgbd objects in the wild: Scaling real-world 3d object learning from rgb-d videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22378–22389, 2024. 15

2024
[65]

Q. Xu, D. Wei, L. Zhao, W. Li, Z. Huang, S. Ji, and P. Liu. Siu3r: Simultaneous scene understanding and 3d reconstruction beyond feature alignment. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 2

2025
[66]

J. Yang, A. Sax, K. J. Liang, M. Henaff, H. Tang, A. Cao, J. Chai, F. Meier, and M. Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21924–21935, 2025. 2

2025
[67]

S. Yang, J. Kautz, and A. Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024. 3

work page internal anchor Pith review arXiv 2024
[68]

S. Yang, B. Wang, Y . Zhang, Y . Shen, and Y . Kim. Parallelizing linear transformers with the delta rule over sequence length.Advances in neural information processing systems, 37:115491–115522, 2024. 3 13

2024
[69]

Y . Yao, Z. Luo, S. Li, J. Zhang, Y . Ren, L. Zhou, T. Fang, and L. Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1790–1799, 2020. 15

2020
[70]

Yeshwanth, Y

C. Yeshwanth, Y . Liu, M. Nießner, and A. Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. In Int. Conf. Comput. Vis., pages 12–22, 2023. 15

2023
[71]

X. Yu, M. Xu, Y . Zhang, H. Liu, C. Ye, Y . Wu, Z. Yan, C. Zhu, Z. Xiong, T. Liang, et al. Mvimgnet: A large-scale dataset of multi-view images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9150–9161, 2023. 15

2023
[72]

S. Yuan, Y . Yang, X. Yang, X. Zhang, Z. Zhao, L. Zhang, and Z. Zhang. Infinitevggt: Visual geometry grounded transformer for endless streams.arXiv preprint arXiv:2601.02281, 2026. 2, 3

work page arXiv 2026
[73]

Zhang, C

J. Zhang, C. Herrmann, J. Hur, V . Jampani, T. Darrell, F. Cole, D. Sun, and M.-H. Yang. MonST3r: A simple approach for estimating geometry in the presence of motion. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=lJpqxFgWCM. 2, 7, 18, 19

2025
[74]

LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

J. Zhang, C. Herrmann, J. Hur, C. Sun, M.-H. Yang, F. Cole, T. Darrell, and D. Sun. Loger: Long-context geometric reconstruction with hybrid memory.arXiv preprint arXiv:2603.03269, 2026. 3

work page internal anchor Pith review Pith/arXiv arXiv 2026
[75]

Zhang, J

S. Zhang, J. Wang, Y . Xu, N. Xue, C. Rupprecht, X. Zhou, Y . Shen, and G. Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21936–21947, 2025. 2

2025
[76]

Test-time training done right.arXiv preprint arXiv:2505.23884, 2025

T. Zhang, S. Bi, Y . Hong, K. Zhang, F. Luan, S. Yang, K. Sunkavalli, W. T. Freeman, and H. Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025. 2, 3

work page arXiv 2025
[77]

Zheng, A

Y . Zheng, A. W. Harley, B. Shen, G. Wetzstein, and L. J. Guibas. Pointodyssey: A large-scale synthetic dataset for long-term point tracking. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19855–19865, 2023. 15

2023
[78]

Ttsa3r: Training-free temporal-spatial adaptive persistent state for streaming 3d reconstruction.arXiv preprintarXiv:2601.22615, 2026

Z. Zheng, X. Xiang, and J. Zhang. Ttsa3r: Training-free temporal-spatial adaptive persistent state for streaming 3d reconstruction.arXiv preprint arXiv:2601.22615, 2026. 2, 3, 6, 18, 19

work page arXiv 2026
[79]

T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018. 15

work page internal anchor Pith review arXiv 2018
[80]

D. Zhuo, W. Zheng, J. Guo, Y . Wu, J. Zhou, and J. Lu. Streaming 4d visual geometry transformer.arXiv preprint arXiv:2507.11539, 2025. 2, 3, 6, 18, 19 14 A Supplementary Material A.1 Training Datasets Following the final stage of training for CUT3R, we only use multi-view datasets during training. Our fine-tuning follows the same training data configurati...

work page arXiv 2025