Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction

Chi Zhang; Feifei Li; Qi Song; Rui Huang

arxiv: 2605.05749 · v3 · pith:Y7L6NQWSnew · submitted 2026-05-07 · 💻 cs.CV

Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction

Feifei Li , Qi Song , Chi Zhang , Rui Huang This is my paper

Pith reviewed 2026-05-22 09:50 UTC · model grok-4.3

classification 💻 cs.CV

keywords ray-aware pointer memorystreaming 3D reconstructionadaptive memory updatesdense reconstructionloop closurecamera pose estimationonline reconstructionviewpoint consistency

0 comments

The pith

Ray-aware pointers that store position and viewing direction enable retain-or-replace memory updates for stable streaming 3D reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to fix instability and redundancy in dense 3D reconstruction from continuous image streams by replacing appearance-driven fusion with a memory system that reasons about both location and viewing direction. Traditional approaches accumulate similar observations when the camera moves, leading to drift and bloated memory. Each pointer in the new design holds a 3D position, ray direction, and feature embedding so the system can jointly check geometric closeness and viewpoint consistency. This supports an adaptive retain-or-replace rule that keeps informative data, drops duplicates, and flags potential loop revisits for pose refinement. The result is bounded memory growth together with better long-term geometry and camera accuracy during online processing.

Core claim

By storing 3D position, ray direction, and feature embedding together in each memory pointer, the system can apply a single retain-or-replace update rule that distinguishes local redundancy from novel observations and loop candidates without averaging features, thereby maintaining bounded memory while enforcing geometric consistency through triggered pose refinement.

What carries the argument

Ray-aware pointer memory that stores each entry's 3D position, associated ray direction, and feature embedding to support joint reasoning on spatial proximity and viewpoint consistency.

If this is right

Memory size stays bounded because redundant pointers are discarded instead of fused.
Detection of loop candidates automatically triggers pose refinement for global consistency.
Long-term reconstruction remains stable when the camera revisits areas under changing viewpoints.
Camera pose estimates improve because repeated observations no longer accumulate error.
Streaming inference remains efficient since only selected pointers are retained or updated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pointer logic could be tested in visual SLAM pipelines to reduce drift without full bundle adjustment.
Extending the ray-direction check to handle moving objects might improve robustness in dynamic scenes.
The approach suggests a general pattern for any streaming reconstruction task where viewpoint consistency matters more than simple feature averaging.

Load-bearing premise

The retain-or-replace mechanism based on joint spatial and ray-direction reasoning can reliably distinguish local redundancy from novel observations and loop revisits without losing critical geometric information or introducing new errors.

What would settle it

A controlled test sequence containing known loop closures where the method either fails to refine pose (producing measurable drift) or incorrectly replaces a unique structure (producing visible holes or artifacts) compared with a fusion baseline.

Figures

Figures reproduced from arXiv: 2605.05749 by Chi Zhang, Feifei Li, Qi Song, Rui Huang.

**Figure 1.** Figure 1: Comparison of visualized results of Point3R, our proposed method, and Pseudo GT. Pseudo GT of dense 3D model is view at source ↗

**Figure 2.** Figure 2: Overview of the proposed ray-aware pointer-based streaming reconstruction pipeline. Each incoming frame is view at source ↗

**Figure 3.** Figure 3: Illustration of the pointer update results for a given frame using the view at source ↗

**Figure 4.** Figure 4: Visualized results of reconstruction on datasets NRGBD and 7scenes. view at source ↗

**Figure 5.** Figure 5: Reserved Memory used by the merged method view at source ↗

read the original abstract

Dense 3D reconstruction from continuous image streams requires both accurate geometric aggregation and stable long-term memory management. Recent feed-forward reconstruction frameworks integrate observations through persistent memory representations, yet most rely primarily on appearance-based similarity when updating memory. Such appearance-driven integration often leads to redundant accumulation of observations and unstable geometry when viewpoint changes occur. In this work, we propose a ray-aware pointer memory for streaming 3D reconstruction that explicitly models both spatial location and viewing direction within a unified memory representation. Each memory pointer stores its 3D position, associated ray direction, and feature embedding, allowing the system to reason jointly about geometric proximity and viewpoint consistency. Based on this representation, we introduce an adaptive pointer update strategy that replaces traditional fusion-based memory compression with a retain-or-replace mechanism. Instead of averaging nearby observations, the system selectively retains informative pointers while discarding redundant ones, preserving distinctive geometric structures while maintaining bounded memory growth. Furthermore, the joint reasoning over spatial distance and ray-direction discrepancy enables the system to distinguish between local redundancy, novel observations, and potential loop revisits in a unified manner. When loop candidates are detected, pose refinement is triggered to enforce global geometric consistency across the reconstruction. Extensive experiments demonstrate that the proposed ray-aware memory design significantly improves long-term reconstruction stability and camera pose accuracy while maintaining efficient streaming inference. Our approach provides a principled framework for scalable and drift-resistant online 3D reconstruction from image streams.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ray-aware pointers with retain-or-replace updates target drift in streaming reconstruction but the directional check looks vulnerable to pose noise.

read the letter

The paper introduces a memory pointer that combines 3D position with ray direction for deciding updates in streaming 3D reconstruction. This replaces standard fusion with a retain-or-replace mechanism that uses both spatial and directional cues to manage what stays in memory. What is new is the explicit modeling of viewing direction to better identify when an observation is redundant, novel, or part of a loop. The adaptive strategy keeps memory size in check without averaging, which can blur details, and it links to pose refinement on detected loops. This seems like a practical step for long-term stability in continuous image streams. The work does well at identifying the limitations of appearance-based methods and proposing a geometric alternative that could reduce drift. The claims of improved reconstruction stability and pose accuracy are the kind that matter for applications like robotics. A potential issue is whether the ray-direction discrepancy holds up under accumulating pose errors. Small drifts can make matching rays look different, risking either lost geometry or unchecked memory growth. The paper would be stronger with explicit analysis or ablations showing robustness to noise levels typical in streaming scenarios. This paper is for people developing or extending streaming 3D systems who need better memory management. It has a clear technical contribution and experimental support, so it deserves a serious referee. I recommend sending it for peer review.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a ray-aware pointer memory for streaming 3D reconstruction from continuous image streams. Each memory pointer encodes 3D position, associated ray direction, and feature embedding. An adaptive retain-or-replace update replaces fusion-based compression; the joint spatial-plus-ray-direction criterion is used to classify local redundancy versus novel observations versus loop revisits, with pose refinement triggered on detected loops. The central claim is that this design yields improved long-term reconstruction stability and camera-pose accuracy while keeping memory bounded and inference efficient.

Significance. If the retain-or-replace rule proves robust, the approach would offer a concrete alternative to appearance-driven memory management in online reconstruction pipelines, with potential benefits for drift reduction in long sequences containing viewpoint changes and revisits. The explicit incorporation of ray direction into the memory representation is a targeted contribution that could influence subsequent work on streaming SLAM and dense mapping.

major comments (2)

[§3.2] §3.2 (Ray-aware pointer update): the retain-or-replace decision rests on a ray-direction discrepancy whose sensitivity to accumulated pose error is not bounded or analyzed; small drift can make rays from the same surface appear dissimilar (false novel) or dissimilar surfaces appear similar (false retention), directly undermining the stability and loop-revisit claims.
[§5] §5 (Experiments): quantitative results are reported without error bars, without an ablation that isolates the ray-direction term from the spatial term, and without explicit comparison against recent streaming baselines that also handle loops; this leaves the magnitude and reliability of the claimed gains on stability and pose accuracy difficult to assess.

minor comments (2)

A diagram or pseudocode listing the exact retain/replace thresholds and the discrepancy metric would clarify the adaptive update procedure.
[Abstract] The abstract states performance gains but omits the datasets, metrics, and sequence lengths used; adding these would help readers gauge the scope of the evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions planned for the next manuscript version.

read point-by-point responses

Referee: [§3.2] §3.2 (Ray-aware pointer update): the retain-or-replace decision rests on a ray-direction discrepancy whose sensitivity to accumulated pose error is not bounded or analyzed; small drift can make rays from the same surface appear dissimilar (false novel) or dissimilar surfaces appear similar (false retention), directly undermining the stability and loop-revisit claims.

Authors: We acknowledge that a formal sensitivity analysis of the ray-direction term under pose drift is absent from the current manuscript. The joint spatial-plus-ray criterion is intended to limit the impact of small errors by restricting comparisons to spatially proximate pointers, but we agree this does not constitute a rigorous bound. In the revised version we will add a short analysis in §3.2 deriving an upper bound on ray discrepancy given bounded pose error ε and surface normal variation, together with a brief empirical study on sequences with injected drift to quantify false-positive and false-negative rates. revision: yes
Referee: [§5] §5 (Experiments): quantitative results are reported without error bars, without an ablation that isolates the ray-direction term from the spatial term, and without explicit comparison against recent streaming baselines that also handle loops; this leaves the magnitude and reliability of the claimed gains on stability and pose accuracy difficult to assess.

Authors: The referee correctly notes the lack of error bars, an isolating ablation, and comparisons to recent loop-aware streaming methods. We will revise §5 to include (i) error bars from five independent runs with different initialization seeds, (ii) an ablation table that removes the ray-direction term while keeping the spatial term fixed, and (iii) direct numerical comparisons against two recent streaming baselines that explicitly manage loop closures. These additions will be supported by the same evaluation protocol already used in the paper. revision: yes

Circularity Check

0 steps flagged

No circularity: method is an algorithmic proposal validated externally

full rationale

The paper proposes a ray-aware pointer memory representation that stores 3D position, ray direction, and features, together with a retain-or-replace update rule that classifies observations by joint spatial and directional discrepancy. No equations, derivations, or fitted parameters are presented that reduce the claimed stability or pose-accuracy improvements to the inputs by construction. The design choices are motivated by the limitations of appearance-only fusion and are evaluated through experiments on streaming reconstruction tasks; the central claims therefore rest on empirical demonstration rather than self-definition or self-citation load-bearing steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim rests on the new ray-aware pointer representation and the adaptive retain-or-replace rule. No free parameters, standard axioms, or invented entities with independent evidence are specified in the abstract.

invented entities (1)

ray-aware pointer no independent evidence
purpose: Unified memory unit storing 3D position, ray direction, and feature embedding for joint geometric and viewpoint reasoning
Introduced in the abstract as the core new representation enabling the adaptive update strategy.

pith-pipeline@v0.9.0 · 5787 in / 1175 out tokens · 47277 ms · 2026-05-22T09:50:55.570662+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

joint geometric distance metric … D(m_new, m_k) = λ_pos d_pos + λ_ang d_ang … Local redundancy … Loop revisit … Novel geometry
IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

retain-or-replace policy … stochastic … preserves informative pointers while discarding redundant ones

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 4 internal anchors

[1]

Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M Seitz, and Richard Szeliski. 2011. Building rome in a day.Commun. ACM54, 10 (2011), 105–112

work page 2011
[2]

Sameer Agarwal, Noah Snavely, Steven M Seitz, and Richard Szeliski. 2010. Bundle adjustment in the large. InEuropean conference on computer vision. Springer, 29– 42

work page 2010
[3]

Dejan Azinović, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. 2022. Neural rgb-d surface reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6290–6301

work page 2022
[4]

Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. 2012. A naturalistic open source movie for optical flow evaluation. InEuropean conference on computer vision. Springer, 611–625

work page 2012
[5]

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. 2025. Ttt3r: 3d reconstruction as test-time training.arXiv preprint arXiv:2509.26645 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Zhuoguang Chen, Minghui Qin, Tianyuan Yuan, Zhe Liu, and Hang Zhao

work page
[7]

InProceedings of the IEEE/CVF International Conference on Computer Vision

Long3r: Long sequence streaming 3d reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vision. 5273–5284

work page
[8]

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition. 5828–5839

work page 2017
[9]

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. 2025. VGGT-Long: Chunk it, Loop it, Align it–Pushing VGGT’s Limits on Kilometer-scale Long RGB Sequences. arXiv preprint arXiv:2507.16443(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, and Torsten Sattler. 2019. D2-net: A trainable cnn for joint description and detection of local features. InProceedings of the ieee/cvf conference on computer vision and pattern recognition. 8092–8101

work page 2019
[11]

Qiancheng Fu, Qingshan Xu, Yew Soon Ong, and Wenbing Tao. 2022. Geo-neus: Geometry-consistent neural implicit surfaces learning for multi-view reconstruc- tion.Advances in Neural Information Processing Systems35 (2022), 3403–3416

work page 2022
[12]

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision meets robotics: The kitti dataset.The international journal of robotics research32, 11 (2013), 1231–1237

work page 2013
[13]

Wen Jiang, Boshu Lei, and Kostas Daniilidis. 2024. Fisherrf: Active view selec- tion and mapping with radiance fields using fisher information. InEuropean Conference on Computer Vision. Springer, 422–440

work page 2024
[14]

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al

work page
[15]

Graph.42, 4 (2023), 139–1

3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph.42, 4 (2023), 139–1

work page 2023
[16]

Johannes Kopf, Xuejian Rong, and Jia-Bin Huang. 2021. Robust consistent video depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1611–1621

work page 2021
[17]

Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. 2025. Stream3r: Scalable sequential 3d reconstruction with causal transformer.arXiv preprint arXiv:2508.10893(2025)

work page arXiv 2025
[18]

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. 2024. Grounding image matching in 3d with mast3r. InEuropean conference on computer vision. Springer, 71–91

work page 2024
[19]

Feifei Li, Panwen Hu, Qi Song, and Rui Huang. 2024. Incremental 3D Re- construction through a Hybrid Explicit-and-Implicit Representation. In2024 IEEE International Conference on Robotics and Automation (ICRA). 15121–15127. doi:10.1109/ICRA57147.2024.10610868

work page doi:10.1109/icra57147.2024.10610868 2024
[20]

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. 2023. Lightglue: Local feature matching at light speed. InProceedings of the IEEE/CVF international conference on computer vision. 17627–17638

work page 2023
[21]

David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. International journal of computer vision60, 2 (2004), 91–110

work page 2004
[22]

Dominic Maggio and Luca Carlone. 2026. VGGT-SLAM 2.0: Real time Dense Feed-forward Scene Reconstruction.arXiv preprint arXiv:2601.19887(2026)

work page arXiv 2026
[23]

Dominic Maggio, Hyungtae Lim, and Luca Carlone. 2025. Vggt-slam: Dense rgb slam optimized on the sl (4) manifold.arXiv preprint arXiv:2505.12549(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. 2021. Nerf: Representing scenes as neural radiance fields for view synthesis.Commun. ACM65, 1 (2021), 99–106

work page 2021
[25]

Emanuele Palazzolo, Jens Behley, Philipp Lottes, Philippe Giguere, and Cyrill Stachniss. 2019. ReFusion: 3D reconstruction in dynamic environments for RGB- D cameras exploiting residuals. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 7855–7862

work page 2019
[26]

Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. 2011. ORB: An efficient alternative to SIFT or SURF. In2011 International conference on computer vision. Ieee, 2564–2571

work page 2011
[27]

Johannes L Schonberger and Jan-Michael Frahm. 2016. Structure-from-motion revisited. InProceedings of the IEEE conference on computer vision and pattern recognition. 4104–4113

work page 2016
[28]

Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys

work page
[29]

InEuropean conference on computer vision

Pixelwise view selection for unstructured multi-view stereo. InEuropean conference on computer vision. Springer, 501–518

work page
[30]

Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. 2013. Scene coordinate regression forests for camera relocalization in RGB-D images. InProceedings of the IEEE conference on computer vision and pattern recognition. 2930–2937

work page 2013
[31]

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images. InEuropean conference on computer vision. Springer, 746–760

work page 2012
[32]

Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. 2012. A benchmark for the evaluation of RGB-D SLAM systems. In2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 573–580

work page 2012
[33]

Chris Sweeney, Torsten Sattler, Tobias Hollerer, Matthew Turk, and Marc Polle- feys. 2015. Optimizing the viewing graph for structure-from-motion. InProceed- ings of the IEEE international conference on computer vision. 801–809

work page 2015
[34]

Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon

work page
[35]

InInternational workshop on vision algorithms

Bundle adjustment—a modern synthesis. InInternational workshop on vision algorithms. Springer, 298–372

work page
[36]

Hengyi Wang and Lourdes Agapito. 2025. 3d reconstruction with spatial memory. In2025 International Conference on 3D Vision (3DV). IEEE, 78–89

work page 2025
[37]

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rup- precht, and David Novotny. 2025. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference. 5294– 5306

work page 2025
[38]

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. 2025. Continuous 3d perception model with persistent state. In Proceedings of the Computer Vision and Pattern Recognition Conference. 10510– 10522

work page 2025
[39]

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. 2024. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20697–20709

work page 2024
[40]

Yi Wei, Shaohui Liu, Yongming Rao, Wang Zhao, Jiwen Lu, and Jie Zhou. 2021. Nerfingmvs: Guided optimization of neural radiance fields for indoor multi-view stereo. InProceedings of the IEEE/CVF international conference on computer vision. 5610–5619

work page 2021
[41]

Changchang Wu. 2013. Towards linear-time incremental structure from motion. In2013 International Conference on 3D Vision-3DV 2013. IEEE, 127–134

work page 2013
[42]

Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. 2025. Point3r: Streaming 3d re- construction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863 (2025)

work page arXiv 2025
[43]

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. 2025. Fast3r: Towards 3d recon- struction of 1000+ images in one forward pass. InProceedings of the Computer Vision and Pattern Recognition Conference. 21924–21935

work page 2025
[44]

Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, and Zhipeng Zhang. 2026. InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams.arXiv preprint arXiv:2601.02281 (2026)

work page arXiv 2026
[45]

Chi Zhang, Qi Song, Feifei Li, Jie Li, and Rui Huang. 2025. Improving Hierarchical Representations of Vectorized HD Maps with Perspective Clues.IEEE Robotics and Automation Letters(2025)

work page 2025
[46]

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. 2024. Monst3r: A simple approach for estimating geometry in the presence of motion.arXiv preprint arXiv:2410.03825(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Zhoutong Zhang, Forrester Cole, Zhengqi Li, Michael Rubinstein, Noah Snavely, and William T Freeman. 2022. Structure and motion from casual videos. In European Conference on Computer Vision. Springer, 20–37

work page 2022

[1] [1]

Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M Seitz, and Richard Szeliski. 2011. Building rome in a day.Commun. ACM54, 10 (2011), 105–112

work page 2011

[2] [2]

Sameer Agarwal, Noah Snavely, Steven M Seitz, and Richard Szeliski. 2010. Bundle adjustment in the large. InEuropean conference on computer vision. Springer, 29– 42

work page 2010

[3] [3]

Dejan Azinović, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. 2022. Neural rgb-d surface reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6290–6301

work page 2022

[4] [4]

Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. 2012. A naturalistic open source movie for optical flow evaluation. InEuropean conference on computer vision. Springer, 611–625

work page 2012

[5] [5]

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. 2025. Ttt3r: 3d reconstruction as test-time training.arXiv preprint arXiv:2509.26645 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Zhuoguang Chen, Minghui Qin, Tianyuan Yuan, Zhe Liu, and Hang Zhao

work page

[7] [7]

InProceedings of the IEEE/CVF International Conference on Computer Vision

Long3r: Long sequence streaming 3d reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vision. 5273–5284

work page

[8] [8]

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition. 5828–5839

work page 2017

[9] [9]

Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. 2025. VGGT-Long: Chunk it, Loop it, Align it–Pushing VGGT’s Limits on Kilometer-scale Long RGB Sequences. arXiv preprint arXiv:2507.16443(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, and Torsten Sattler. 2019. D2-net: A trainable cnn for joint description and detection of local features. InProceedings of the ieee/cvf conference on computer vision and pattern recognition. 8092–8101

work page 2019

[11] [11]

Qiancheng Fu, Qingshan Xu, Yew Soon Ong, and Wenbing Tao. 2022. Geo-neus: Geometry-consistent neural implicit surfaces learning for multi-view reconstruc- tion.Advances in Neural Information Processing Systems35 (2022), 3403–3416

work page 2022

[12] [12]

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. 2013. Vision meets robotics: The kitti dataset.The international journal of robotics research32, 11 (2013), 1231–1237

work page 2013

[13] [13]

Wen Jiang, Boshu Lei, and Kostas Daniilidis. 2024. Fisherrf: Active view selec- tion and mapping with radiance fields using fisher information. InEuropean Conference on Computer Vision. Springer, 422–440

work page 2024

[14] [14]

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al

work page

[15] [15]

Graph.42, 4 (2023), 139–1

3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph.42, 4 (2023), 139–1

work page 2023

[16] [16]

Johannes Kopf, Xuejian Rong, and Jia-Bin Huang. 2021. Robust consistent video depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1611–1621

work page 2021

[17] [17]

Yushi Lan, Yihang Luo, Fangzhou Hong, Shangchen Zhou, Honghua Chen, Zhaoyang Lyu, Shuai Yang, Bo Dai, Chen Change Loy, and Xingang Pan. 2025. Stream3r: Scalable sequential 3d reconstruction with causal transformer.arXiv preprint arXiv:2508.10893(2025)

work page arXiv 2025

[18] [18]

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. 2024. Grounding image matching in 3d with mast3r. InEuropean conference on computer vision. Springer, 71–91

work page 2024

[19] [19]

Feifei Li, Panwen Hu, Qi Song, and Rui Huang. 2024. Incremental 3D Re- construction through a Hybrid Explicit-and-Implicit Representation. In2024 IEEE International Conference on Robotics and Automation (ICRA). 15121–15127. doi:10.1109/ICRA57147.2024.10610868

work page doi:10.1109/icra57147.2024.10610868 2024

[20] [20]

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. 2023. Lightglue: Local feature matching at light speed. InProceedings of the IEEE/CVF international conference on computer vision. 17627–17638

work page 2023

[21] [21]

David G Lowe. 2004. Distinctive image features from scale-invariant keypoints. International journal of computer vision60, 2 (2004), 91–110

work page 2004

[22] [22]

Dominic Maggio and Luca Carlone. 2026. VGGT-SLAM 2.0: Real time Dense Feed-forward Scene Reconstruction.arXiv preprint arXiv:2601.19887(2026)

work page arXiv 2026

[23] [23]

Dominic Maggio, Hyungtae Lim, and Luca Carlone. 2025. Vggt-slam: Dense rgb slam optimized on the sl (4) manifold.arXiv preprint arXiv:2505.12549(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. 2021. Nerf: Representing scenes as neural radiance fields for view synthesis.Commun. ACM65, 1 (2021), 99–106

work page 2021

[25] [25]

Emanuele Palazzolo, Jens Behley, Philipp Lottes, Philippe Giguere, and Cyrill Stachniss. 2019. ReFusion: 3D reconstruction in dynamic environments for RGB- D cameras exploiting residuals. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 7855–7862

work page 2019

[26] [26]

Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. 2011. ORB: An efficient alternative to SIFT or SURF. In2011 International conference on computer vision. Ieee, 2564–2571

work page 2011

[27] [27]

Johannes L Schonberger and Jan-Michael Frahm. 2016. Structure-from-motion revisited. InProceedings of the IEEE conference on computer vision and pattern recognition. 4104–4113

work page 2016

[28] [28]

Johannes L Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys

work page

[29] [29]

InEuropean conference on computer vision

Pixelwise view selection for unstructured multi-view stereo. InEuropean conference on computer vision. Springer, 501–518

work page

[30] [30]

Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. 2013. Scene coordinate regression forests for camera relocalization in RGB-D images. InProceedings of the IEEE conference on computer vision and pattern recognition. 2930–2937

work page 2013

[31] [31]

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor segmentation and support inference from rgbd images. InEuropean conference on computer vision. Springer, 746–760

work page 2012

[32] [32]

Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. 2012. A benchmark for the evaluation of RGB-D SLAM systems. In2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 573–580

work page 2012

[33] [33]

Chris Sweeney, Torsten Sattler, Tobias Hollerer, Matthew Turk, and Marc Polle- feys. 2015. Optimizing the viewing graph for structure-from-motion. InProceed- ings of the IEEE international conference on computer vision. 801–809

work page 2015

[34] [34]

Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon

work page

[35] [35]

InInternational workshop on vision algorithms

Bundle adjustment—a modern synthesis. InInternational workshop on vision algorithms. Springer, 298–372

work page

[36] [36]

Hengyi Wang and Lourdes Agapito. 2025. 3d reconstruction with spatial memory. In2025 International Conference on 3D Vision (3DV). IEEE, 78–89

work page 2025

[37] [37]

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rup- precht, and David Novotny. 2025. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference. 5294– 5306

work page 2025

[38] [38]

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. 2025. Continuous 3d perception model with persistent state. In Proceedings of the Computer Vision and Pattern Recognition Conference. 10510– 10522

work page 2025

[39] [39]

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. 2024. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20697–20709

work page 2024

[40] [40]

Yi Wei, Shaohui Liu, Yongming Rao, Wang Zhao, Jiwen Lu, and Jie Zhou. 2021. Nerfingmvs: Guided optimization of neural radiance fields for indoor multi-view stereo. InProceedings of the IEEE/CVF international conference on computer vision. 5610–5619

work page 2021

[41] [41]

Changchang Wu. 2013. Towards linear-time incremental structure from motion. In2013 International Conference on 3D Vision-3DV 2013. IEEE, 127–134

work page 2013

[42] [42]

Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. 2025. Point3r: Streaming 3d re- construction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863 (2025)

work page arXiv 2025

[43] [43]

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. 2025. Fast3r: Towards 3d recon- struction of 1000+ images in one forward pass. InProceedings of the Computer Vision and Pattern Recognition Conference. 21924–21935

work page 2025

[44] [44]

Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, and Zhipeng Zhang. 2026. InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams.arXiv preprint arXiv:2601.02281 (2026)

work page arXiv 2026

[45] [45]

Chi Zhang, Qi Song, Feifei Li, Jie Li, and Rui Huang. 2025. Improving Hierarchical Representations of Vectorized HD Maps with Perspective Clues.IEEE Robotics and Automation Letters(2025)

work page 2025

[46] [46]

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. 2024. Monst3r: A simple approach for estimating geometry in the presence of motion.arXiv preprint arXiv:2410.03825(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Zhoutong Zhang, Forrester Cole, Zhengqi Li, Michael Rubinstein, Noah Snavely, and William T Freeman. 2022. Structure and motion from casual videos. In European Conference on Computer Vision. Springer, 20–37

work page 2022