Recognition: 2 theorem links
· Lean TheoremMem3R: Streaming 3D Reconstruction with Hybrid Memory via Test-Time Training
Pith reviewed 2026-05-10 18:28 UTC · model grok-4.3
The pith
A hybrid memory design separates camera tracking via test-time training from geometric mapping with explicit tokens to improve consistency in streaming 3D reconstruction over long sequences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Mem3R maintains an implicit fast-weight memory implemented as a lightweight multi-layer perceptron updated via test-time training for camera tracking and an explicit token-based fixed-size state for geometric mapping. This hybrid design decouples the two processes to achieve better temporal consistency over long sequences, reduces overall model size, and allows integration with existing state-update strategies while preserving constant GPU memory usage and comparable inference throughput.
What carries the argument
Hybrid memory that decouples implicit test-time-trained fast-weight memory for camera tracking from explicit token-based fixed-size state for geometric mapping.
If this is right
- Improves performance on long sequences of 500 to 1000 frames compared with prior recurrent designs.
- Reduces model size while maintaining or improving accuracy.
- Lowers absolute trajectory error by up to 39 percent on long sequences when paired with improved state-update methods.
- Extends accuracy gains to related tasks including video depth estimation and 3D reconstruction.
- Preserves constant GPU memory usage and comparable inference speed across varying sequence lengths.
Where Pith is reading between the lines
- The separation of tracking and mapping memory needs could extend to other sequential perception tasks that suffer from forgetting over time.
- Test-time updates to the implicit memory might allow adaptation to new lighting or motion patterns without full offline retraining.
- Fixed-size explicit states may offer better scaling properties for extremely long streams than fully compressed recurrent states.
- The design could be combined with hardware-specific optimizations to run on resource-limited devices.
Load-bearing premise
Decoupling camera tracking into an implicit memory updated at test time and geometric mapping into explicit tokens will improve temporal consistency without introducing new accuracy or efficiency trade-offs over long sequences.
What would settle it
A comparison on sequences exceeding 1000 frames where the hybrid model shows faster error accumulation or higher memory consumption than a single unified memory baseline would challenge the central claim.
Figures
read the original abstract
Streaming 3D perception is well suited to robotics and augmented reality, where long visual streams must be processed efficiently and consistently. Recent recurrent models offer a promising solution by maintaining fixed-size states and enabling linear-time inference, but they often suffer from drift accumulation and temporal forgetting over long sequences due to the limited capacity of compressed latent memories. We propose Mem3R, a streaming 3D reconstruction model with a hybrid memory design that decouples camera tracking from geometric mapping to improve temporal consistency over long sequences. For camera tracking, Mem3R employs an implicit fast-weight memory implemented as a lightweight Multi-Layer Perceptron updated via Test-Time Training. For geometric mapping, Mem3R maintains an explicit token-based fixed-size state. Compared with CUT3R, this design not only significantly improves long-sequence performance but also reduces the model size from 793M to 644M parameters. Mem3R supports existing improved plug-and-play state update strategies developed for CUT3R. Specifically, integrating it with TTT3R decreases Absolute Trajectory Error by up to 39% over the base implementation on 500 to 1000 frame sequences. The resulting improvements also extend to other downstream tasks, including video depth estimation and 3D reconstruction, while preserving constant GPU memory usage and comparable inference throughput. Project page: https://lck666666.github.io/Mem3R/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Mem3R, a streaming 3D reconstruction model featuring a hybrid memory architecture. Camera tracking is handled by an implicit fast-weight memory implemented as an MLP updated through test-time training (TTT), while geometric mapping uses an explicit fixed-size token-based state. The approach is shown to be compatible with prior strategies like TTT3R, claiming a reduction in model size from 793M to 644M parameters compared to CUT3R, up to 39% decrease in Absolute Trajectory Error (ATE) on long sequences (500-1000 frames), and benefits to downstream tasks such as video depth estimation and 3D reconstruction, all while keeping GPU memory constant.
Significance. Should the hybrid memory design prove effective in maintaining alignment between tracking and mapping without introducing new inconsistencies, this work could offer a practical advancement in efficient streaming 3D perception for applications in robotics and augmented reality. The parameter reduction and linear-time inference are particularly valuable for long visual streams, and the plug-and-play integration with existing TTT methods enhances its applicability.
major comments (2)
- [Methods (hybrid memory design)] The decoupling of implicit TTT memory for camera tracking from explicit tokens for geometry is central to the claims of improved temporal consistency. However, no explicit mechanism (e.g., joint loss, cross-attention, or consistency regularizer) is described to ensure alignment between the pose estimates from the MLP and the mapped structure in the token state. This omission raises the possibility of accumulating drift, which could undermine the reported long-sequence ATE improvements.
- [Experiments (quantitative results)] The 39% ATE reduction and model size claims are load-bearing for the contribution. The manuscript should provide more details on the evaluation protocol, including the number of sequences tested, variance across runs, and whether the 500-1000 frame sequences were selected post-hoc, to rule out selection bias and confirm the gains are robust.
minor comments (2)
- [Abstract] The abstract mentions 'up to 39%' ATE reduction; specifying the exact sequences or average would improve clarity.
- [Related Work] Ensure all baselines like CUT3R are properly cited with full references.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The hybrid memory design in Mem3R aims to address temporal consistency challenges in streaming 3D reconstruction, and we appreciate the opportunity to clarify the alignment mechanisms and evaluation details. We address each major comment below and will update the manuscript accordingly.
read point-by-point responses
-
Referee: [Methods (hybrid memory design)] The decoupling of implicit TTT memory for camera tracking from explicit tokens for geometry is central to the claims of improved temporal consistency. However, no explicit mechanism (e.g., joint loss, cross-attention, or consistency regularizer) is described to ensure alignment between the pose estimates from the MLP and the mapped structure in the token state. This omission raises the possibility of accumulating drift, which could undermine the reported long-sequence ATE improvements.
Authors: We thank the referee for this observation. In the Mem3R architecture, alignment is achieved implicitly through the processing pipeline: the pose output by the TTT-updated MLP is directly applied to warp and fuse new observations into the explicit token-based mapping state, with shared visual features extracted from the input frames feeding both components. The end-to-end reconstruction objective (including depth and pose consistency terms) provides gradient flow that couples the two memories during test-time updates. That said, we agree the original manuscript did not sufficiently articulate this information flow. In revision we will add a concise subsection under Methods describing the interaction between the implicit MLP and explicit tokens, including a diagram of the data flow, to make the absence of drift accumulation clearer. No new explicit regularizer is introduced because the current design already yields the reported ATE gains; the revision will simply document the existing coupling more explicitly. revision: partial
-
Referee: [Experiments (quantitative results)] The 39% ATE reduction and model size claims are load-bearing for the contribution. The manuscript should provide more details on the evaluation protocol, including the number of sequences tested, variance across runs, and whether the 500-1000 frame sequences were selected post-hoc, to rule out selection bias and confirm the gains are robust.
Authors: We agree that greater transparency on the evaluation protocol is warranted. The 39% ATE reduction is the largest improvement observed when integrating Mem3R with TTT3R on long sequences drawn from the same benchmark suites used by CUT3R and TTT3R; all qualifying sequences in the 500–1000 frame range were included rather than cherry-picked. In the revised manuscript we will (i) state the exact number of sequences evaluated, (ii) report per-sequence ATE values together with mean and standard deviation where multiple random seeds were feasible, and (iii) explicitly confirm that sequence selection followed the identical long-sequence protocol of the baseline papers with no post-hoc filtering. These additions will allow readers to assess robustness directly. revision: yes
Circularity Check
No circularity: hybrid design and empirical gains are independent of inputs
full rationale
The paper proposes a novel hybrid memory architecture (implicit MLP via TTT for tracking + explicit fixed-size tokens for mapping) motivated by limitations in prior recurrent models like CUT3R. Performance numbers (e.g., 39% ATE reduction when plugged with TTT3R) are reported as experimental outcomes on 500-1000 frame sequences, not as predictions derived from fitted parameters or self-referential equations. No load-bearing step reduces by construction to the inputs; the decoupling is presented as an explicit design choice with downstream task extensions, keeping the derivation self-contained.
Axiom & Free-Parameter Ledger
free parameters (2)
- MLP architecture for fast-weight memory
- Token state size for explicit memory
axioms (2)
- domain assumption Decoupling camera tracking from geometric mapping improves temporal consistency in long sequences
- domain assumption Test-time training can maintain an effective implicit memory without drift accumulation
invented entities (2)
-
Implicit fast-weight memory
no independent evidence
-
Explicit token-based fixed-size state
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
hybrid memory design that decouples camera tracking from geometric mapping... implicit fast-weight memory implemented as a lightweight Multi-Layer Perceptron updated via Test-Time Training... explicit token-based fixed-size state
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
replaces CUT3R’s pose-related state tokens and decoder layers with a lightweight implicit MLP-based memory
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Arnold, J
E. Arnold, J. Wynn, S. Vicente, G. Garcia-Hernando, A. Monszpart, V . Prisacariu, D. Turmukhambetov, and E. Brachmann. Map-free visual relocalization: Metric pose relative to a single image. InEuropean Conference on Computer Vision, pages 690–708. Springer, 2022. 6, 15
2022
-
[2]
Azinovi ´c, R
D. Azinovi ´c, R. Martin-Brualla, D. B. Goldman, M. Nießner, and J. Thies. Neural rgb-d surface reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6290–6301, 2022. 8
2022
-
[3]
Bauer, F
Z. Bauer, F. Gomez-Donoso, E. Cruz, S. Orts-Escolano, and M. Cazorla. Uasol, a large-scale high- resolution outdoor stereo dataset.Scientific data, 6(1):162, 2019. 15
2019
-
[4]
Titans: Learning to Memorize at Test Time
A. Behrouz, P. Zhong, and V . Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663, 2024. 2, 3
work page internal anchor Pith review arXiv 2024
-
[5]
Atlas: Learning to optimally memorize the context at test time, 2025
A. Behrouz, Z. Li, P. Kacham, M. Daliri, Y . Deng, P. Zhong, M. Razaviyayn, and V . Mirrokni. Atlas: Learning to optimally memorize the context at test time.arXiv preprint arXiv:2505.23735, 2025. 3
-
[6]
Behrouz, M
A. Behrouz, M. Razaviyayn, P. Zhong, and V . Mirrokni. Nested learning: The illusion of deep learning architectures. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=nbMeRvNb7A. 3
2025
-
[7]
Behrouz, M
A. Behrouz, M. Razaviyayn, P. Zhong, and V . Mirrokni. It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=gZyEJ2kMow. 3
2026
-
[8]
M. J. Black, P. Patel, J. Tesch, and J. Yang. Bedlam: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8726–8737, 2023. 15 10
2023
-
[9]
D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. InEuropean conference on computer vision, pages 611–625. Springer, 2012. 7, 8
2012
-
[10]
Y . Cabon, N. Murray, and M. Humenberger. Virtual kitti 2.arXiv preprint arXiv:2001.10773, 2020. 15
work page internal anchor Pith review arXiv 2001
-
[11]
Chang, A
A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niebner, M. Savva, S. Song, A. Zeng, and Y . Zhang. Matterport3d: Learning from rgb-d data in indoor environments. In2017 International Conference on 3D Vision (3DV), pages 667–676. IEEE Computer Society, 2017. 15
2017
-
[12]
X. Chen, Y . Chen, Y . Xiu, A. Geiger, and A. Chen. Easi3r: Estimating disentangled motion from dust3r without training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9158–9168, 2025. 2, 7, 18, 19
2025
-
[13]
X. Chen, Y . Chen, Y . Xiu, A. Geiger, and A. Chen. TTT3r: 3d reconstruction as test-time training. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=aMs6FtNaY5. 2, 3, 6, 8, 17, 18, 19
2026
- [14]
-
[15]
Z. Chen, M. Qin, T. Yuan, Z. Liu, and H. Zhao. Long3r: Long sequence streaming 3d reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5273–5284, 2025. 3
2025
-
[16]
A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, and M. Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InIEEE Conf. Comput. Vis. Pattern Recog., pages 2432–2443, 2017. 6, 15
2017
-
[17]
Dehghan, G
A. Dehghan, G. Baruch, Z. Chen, Y . Feigin, P. Fu, T. Gebauer, D. Kurz, T. Dimry, B. Joffe, A. Schwartz, and E. Shulman. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile RGB-D data. InAdv. Neural Inform. Process. Syst., 2021. 6, 15
2021
- [18]
-
[19]
Z. Fan, J. Zhang, W. Cong, P. Wang, R. Li, K. Wen, S. Zhou, A. Kadambi, Z. Wang, D. Xu, et al. Large spatial model: End-to-end unposed images to semantic 3d.Advances in neural information processing systems, 37:40212–40229, 2024. 2
2024
-
[20]
Geiger, P
A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset.The international journal of robotics research, 32(11):1231–1237, 2013. 7
2013
-
[21]
Gu and T
A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst conference on language modeling, 2024. 3
2024
-
[22]
J. J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982. 3
1982
- [23]
-
[24]
Huang, K
P.-H. Huang, K. Matzen, J. Kopf, N. Ahuja, and J.-B. Huang. Deepmvs: Learning multi-view stereopsis. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2821–2830,
-
[25]
Jayanti, S
R. Jayanti, S. Agrawal, V . Garg, S. Tourani, M. H. Khan, S. Garg, and M. Krishna. Segmast3r: Geometry grounded segment matching. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 2
2025
-
[26]
H. Jin, R. Wu, T. Zhang, R. Gao, J. T. Barron, N. Snavely, and A. Holynski. Zipmap: Linear-time stateful 3d reconstruction with test-time training.arXiv preprint arXiv:2603.04385, 2026. 3
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[27]
Karaev, I
N. Karaev, I. Rocco, B. Graham, N. Neverova, A. Vedaldi, and C. Rupprecht. Dynamicstereo: Consistent dynamic depth from stereo videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13229–13239, 2023. 15
2023
- [28]
-
[29]
Leroy, Y
V . Leroy, Y . Cabon, and J. Revaud. Grounding image matching in 3d with mast3r. InEuropean Conference on Computer Vision, pages 71–91. Springer, 2024. 2, 7, 18, 19
2024
-
[30]
Li and N
Z. Li and N. Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018. 6, 15
2041
- [31]
-
[32]
Z. Li, J. Zhou, Y . Wang, H. Guo, W. Chang, Y . Zhou, H. Zhu, J. Chen, C. Shen, and T. He. Wint3r: Window- based streaming reconstruction with camera token pool. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=PjviszIZf1. 3
2026
-
[33]
L. Ling, Y . Sheng, Z. Tu, W. Zhao, C. Xin, K. Wan, L. Yu, Q. Guo, Z. Yu, Y . Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024. 6, 15
2024
-
[34]
C. Liu, B. Tan, Z. Ke, S. Zhang, J. Liu, M. Qian, N. Xue, Y . Shen, and T. Braud. Plana3r: Zero-shot metric planar 3d reconstruction via feed-forward planar splatting. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 2
2025
-
[35]
Decoupled Weight Decay Regularization
I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
L. Mehl, J. Schmalfuss, A. Jahedi, Y . Nalivayko, and A. Bruhn. Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4981–4991, 2023. 15
2023
-
[37]
Palazzolo, J
E. Palazzolo, J. Behley, P. Lottes, P. Giguere, and C. Stachniss. Refusion: 3d reconstruction in dynamic environments for rgb-d cameras exploiting residuals. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7855–7862. IEEE, 2019. 7
2019
-
[38]
B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, S. Biderman, H. Cao, X. Cheng, M. Chung, L. Derczynski, et al. Rwkv: Reinventing rnns for the transformer era. InFindings of the association for computational linguistics: EMNLP 2023, pages 14048–14077, 2023. 3
2023
-
[39]
Reizenstein, R
J. Reizenstein, R. Shapovalov, P. Henzler, L. Sbordone, P. Labatut, and D. Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. InProceedings of the IEEE/CVF international conference on computer vision, pages 10901–10911, 2021. 6, 15
2021
-
[40]
Roberts, J
M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InProceedings of the IEEE/CVF international conference on computer vision, pages 10912–10922, 2021. 6, 15
2021
-
[41]
Schlag, K
I. Schlag, K. Irie, and J. Schmidhuber. Linear transformers are secretly fast weight programmers. In International conference on machine learning, pages 9355–9366. PMLR, 2021. 3
2021
-
[42]
Schmidhuber
J. Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation, 4(1):131–139, 1992
1992
-
[43]
Schmidhuber
J. Schmidhuber. Reducing the ratio between learning complexity and number of time varying variables in fully recurrent nets. InICANN’93: Proceedings of the International Conference on Artificial Neural Networks Amsterdam, The Netherlands 13–16 September 1993 3, pages 460–463. Springer, 1993. 3
1993
-
[44]
N. Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020. 4
work page internal anchor Pith review Pith/arXiv arXiv 2002
- [45]
-
[46]
Shotton, B
J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. InInt. Conf. Comput. Vis., pages 2930–2937, 2013. 8
2013
-
[47]
Sinha, R
S. Sinha, R. Shapovalov, J. Reizenstein, I. Rocco, N. Neverova, A. Vedaldi, and D. Novotny. Common pets in 3d: Dynamic new-view synthesis of real-life deformable categories. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4881–4891, 2023. 15
2023
-
[48]
Sturm, N
J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of rgb-d slam systems. In2012 IEEE/RSJ international conference on intelligent robots and systems, pages 573–580. IEEE, 2012. 6 12
2012
-
[49]
P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020. 15
2020
-
[50]
Y . Sun, X. Li, K. Dalal, J. Xu, A. Vikram, G. Zhang, Y . Dubois, X. Chen, X. Wang, S. Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620,
work page internal anchor Pith review arXiv
-
[51]
Z. Tang, Y . Fan, D. Wang, H. Xu, R. Ranjan, A. Schwing, and Z. Yan. Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5283–5293, 2025. 2
2025
-
[52]
K. Team, Y . Zhang, Z. Lin, X. Yao, J. Hu, F. Meng, C. Liu, X. Men, S. Yang, Z. Li, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025. 2, 3
work page internal anchor Pith review arXiv 2025
-
[53]
F. Tosi, Y . Liao, C. Schmitt, and A. Geiger. Smd-nets: Stereo mixture density networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8942–8952, 2021. 15
2021
-
[54]
S. Umeyama. Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on pattern analysis and machine intelligence, 13(4):376–380, 2002. 6
2002
- [55]
-
[56]
Wang and L
H. Wang and L. Agapito. 3d reconstruction with spatial memory. In2025 International Conference on 3D Vision (3DV), pages 78–89. IEEE, 2025. 2, 3, 7, 18, 19
2025
-
[57]
J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306,
-
[58]
Q. Wang, Y . Zhang, A. Holynski, A. A. Efros, and A. Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025. 3, 6, 7, 15, 18, 19
2025
-
[59]
S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud. Dust3r: Geometric 3d vision made easy. In IEEE Conf. Comput. Vis. Pattern Recog., pages 20697–20709, 2024. 2, 7, 18, 19
2024
-
[60]
W. Wang, D. Zhu, X. Wang, Y . Hu, Y . Qiu, C. Wang, Y . Hu, A. Kapoor, and S. Scherer. Tartanair: A dataset to push the limits of visual slam. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916. IEEE, 2020. 15
2020
-
[61]
Y . Wang, J. Zhou, H. Zhu, W. Chang, Y . Zhou, Z. Li, J. Chen, J. Pang, C. Shen, and T. He. Pi3: Scalable permutation-equivariant visual geometry learning.arXiv e-prints, pages arXiv–2507, 2025. 2, 7, 18
2025
-
[62]
T. Wu, J. Zhang, X. Fu, Y . Wang, J. Ren, L. Pan, W. Wu, L. Yang, J. Wang, C. Qian, et al. Omniobject3d: Large-vocabulary 3d object dataset for realistic perception, reconstruction and generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 803–814, 2023. 15
2023
- [63]
-
[64]
H. Xia, Y . Fu, S. Liu, and X. Wang. Rgbd objects in the wild: Scaling real-world 3d object learning from rgb-d videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22378–22389, 2024. 15
2024
-
[65]
Q. Xu, D. Wei, L. Zhao, W. Li, Z. Huang, S. Ji, and P. Liu. Siu3r: Simultaneous scene understanding and 3d reconstruction beyond feature alignment. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 2
2025
-
[66]
J. Yang, A. Sax, K. J. Liang, M. Henaff, H. Tang, A. Cao, J. Chai, F. Meier, and M. Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21924–21935, 2025. 2
2025
-
[67]
S. Yang, J. Kautz, and A. Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024. 3
work page internal anchor Pith review arXiv 2024
-
[68]
S. Yang, B. Wang, Y . Zhang, Y . Shen, and Y . Kim. Parallelizing linear transformers with the delta rule over sequence length.Advances in neural information processing systems, 37:115491–115522, 2024. 3 13
2024
-
[69]
Y . Yao, Z. Luo, S. Li, J. Zhang, Y . Ren, L. Zhou, T. Fang, and L. Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1790–1799, 2020. 15
2020
-
[70]
Yeshwanth, Y
C. Yeshwanth, Y . Liu, M. Nießner, and A. Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. In Int. Conf. Comput. Vis., pages 12–22, 2023. 15
2023
-
[71]
X. Yu, M. Xu, Y . Zhang, H. Liu, C. Ye, Y . Wu, Z. Yan, C. Zhu, Z. Xiong, T. Liang, et al. Mvimgnet: A large-scale dataset of multi-view images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9150–9161, 2023. 15
2023
- [72]
-
[73]
Zhang, C
J. Zhang, C. Herrmann, J. Hur, V . Jampani, T. Darrell, F. Cole, D. Sun, and M.-H. Yang. MonST3r: A simple approach for estimating geometry in the presence of motion. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=lJpqxFgWCM. 2, 7, 18, 19
2025
-
[74]
LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory
J. Zhang, C. Herrmann, J. Hur, C. Sun, M.-H. Yang, F. Cole, T. Darrell, and D. Sun. Loger: Long-context geometric reconstruction with hybrid memory.arXiv preprint arXiv:2603.03269, 2026. 3
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[75]
Zhang, J
S. Zhang, J. Wang, Y . Xu, N. Xue, C. Rupprecht, X. Zhou, Y . Shen, and G. Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21936–21947, 2025. 2
2025
-
[76]
Test-time training done right.arXiv preprint arXiv:2505.23884, 2025
T. Zhang, S. Bi, Y . Hong, K. Zhang, F. Luan, S. Yang, K. Sunkavalli, W. T. Freeman, and H. Tan. Test-time training done right.arXiv preprint arXiv:2505.23884, 2025. 2, 3
-
[77]
Zheng, A
Y . Zheng, A. W. Harley, B. Shen, G. Wetzstein, and L. J. Guibas. Pointodyssey: A large-scale synthetic dataset for long-term point tracking. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19855–19865, 2023. 15
2023
-
[78]
Z. Zheng, X. Xiang, and J. Zhang. Ttsa3r: Training-free temporal-spatial adaptive persistent state for streaming 3d reconstruction.arXiv preprint arXiv:2601.22615, 2026. 2, 3, 6, 18, 19
-
[79]
T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018. 15
work page internal anchor Pith review arXiv 2018
-
[80]
D. Zhuo, W. Zheng, J. Guo, Y . Wu, J. Zhou, and J. Lu. Streaming 4d visual geometry transformer.arXiv preprint arXiv:2507.11539, 2025. 2, 3, 6, 18, 19 14 A Supplementary Material A.1 Training Datasets Following the final stage of training for CUT3R, we only use multi-view datasets during training. Our fine-tuning follows the same training data configurati...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.