pith. machine review for the scientific record. sign in

arxiv: 2604.15237 · v2 · submitted 2026-04-16 · 💻 cs.CV

Recognition: unknown

StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords streaming 3D reconstructionvisual geometry transformercache compressioncross-layer scoringconstant memoryvideo geometrytoken merginghybrid cache
0
0 comments X

The pith

StreamCacheVGGT maintains 3D reconstruction quality from video by scoring tokens across layers and merging rather than deleting them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reconstructing dense 3D geometry from continuous video requires stable inference within a fixed memory budget. Pure eviction methods destroy information through outright token deletion guided by noisy single-layer scores. The proposed system tracks token importance trajectories across the full transformer hierarchy and applies order statistics to find tokens with lasting geometric value. It then uses a three-tier process to merge moderately important tokens into retained ones by nearest-neighbor assignment on the key-vector manifold. This keeps essential context that would otherwise be lost while staying strictly within constant-cost limits.

Core claim

By replacing binary eviction with cross-layer consistency-enhanced scoring and hybrid cache compression that merges tokens via nearest-neighbor assignment on the key-vector manifold, the framework preserves geometric salience across long video streams and achieves higher reconstruction accuracy on five benchmarks while enforcing constant memory use.

What carries the argument

Cross-Layer Consistency-Enhanced Scoring (CLCES) that tracks sustained salience across layers combined with Hybrid Cache Compression (HCC) that performs three-tier triage and nearest-neighbor merging on the key-vector manifold.

If this is right

  • Higher reconstruction accuracy and long-term stability on 7-Scenes, NRGBD, ETH3D, Bonn, and KITTI.
  • Strict adherence to constant memory and compute budgets without any training.
  • Reduced information loss compared with pure eviction that deletes tokens outright.
  • Robust scores derived from order-statistical analysis across the transformer hierarchy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same merging strategy on the key-vector manifold could be tested on other streaming transformer applications such as video object tracking or scene flow estimation.
  • Cross-layer consistency tracking may reduce sensitivity to single-layer activation noise in related vision transformers that process sequential data.
  • The three-tier triage could be adapted to trade off accuracy against memory in real-time robotics pipelines that must run indefinitely.

Load-bearing premise

Merging tokens by nearest-neighbor similarity in key space will preserve the geometric information needed for accurate 3D reconstruction without adding new distortions.

What would settle it

A controlled test on long video sequences where reconstruction error rises or geometric fidelity drops when hybrid merging is enabled compared with a larger-memory baseline that simply retains all tokens.

read the original abstract

Reconstructing dense 3D geometry from continuous video streams requires stable inference under a constant memory budget. Existing $O(1)$ frameworks primarily rely on a ``pure eviction'' paradigm, which suffers from significant information destruction due to binary token deletion and evaluation noise from localized, single-layer scoring. To address these bottlenecks, we propose StreamCacheVGGT, a training-free framework that reimagines cache management through two synergistic modules: Cross-Layer Consistency-Enhanced Scoring (CLCES) and Hybrid Cache Compression (HCC). CLCES mitigates activation noise by tracking token importance trajectories across the Transformer hierarchy, employing order-statistical analysis to identify sustained geometric salience. Leveraging these robust scores, HCC transcends simple eviction by introducing a three-tier triage strategy that merges moderately important tokens into retained anchors via nearest-neighbor assignment on the key-vector manifold. This approach preserves essential geometric context that would otherwise be lost. Extensive evaluations on five benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, and KITTI) demonstrate that StreamCacheVGGT sets a new state-of-the-art, delivering superior reconstruction accuracy and long-term stability while strictly adhering to constant-cost constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes StreamCacheVGGT, a training-free framework for constant-memory streaming 3D geometry reconstruction from video using Visual Geometry Transformers. It introduces Cross-Layer Consistency-Enhanced Scoring (CLCES) to track token importance across layers via order statistics and Hybrid Cache Compression (HCC) with a three-tier triage that merges moderately important tokens via nearest-neighbor assignment on the key-vector manifold, claiming SOTA accuracy and long-term stability on the 7-Scenes, NRGBD, ETH3D, Bonn, and KITTI benchmarks while avoiding binary eviction.

Significance. If the central claims hold, the work would offer a practical advance for online dense reconstruction under fixed memory budgets by replacing pure eviction with context-preserving compression; the training-free design and explicit focus on cross-layer geometric salience are strengths that could influence follow-on systems in robotics and AR.

major comments (2)
  1. The SOTA claim on five benchmarks is load-bearing for the contribution, yet the abstract (and by extension the evaluation section) provides no quantitative metrics, tables, error bars, or direct comparisons to prior constant-memory baselines, preventing verification of the asserted gains in reconstruction accuracy and stability.
  2. HCC section (three-tier triage and nearest-neighbor merging on key-vector manifold): the premise that proximity in key space equals preservation of 3D geometric context for moderately important tokens is untested; no ablation, failure-mode analysis, or comparison to pure-eviction baselines shows that merging does not distort camera poses or point clouds beyond what CLCES already filters, directly undermining the 'superior accuracy' result.
minor comments (1)
  1. Notation for CLCES order-statistical analysis and HCC manifold distance is introduced without explicit equations or pseudocode, making the heuristics hard to reproduce.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and outline the revisions we will make to strengthen the presentation and evidence.

read point-by-point responses
  1. Referee: The SOTA claim on five benchmarks is load-bearing for the contribution, yet the abstract (and by extension the evaluation section) provides no quantitative metrics, tables, error bars, or direct comparisons to prior constant-memory baselines, preventing verification of the asserted gains in reconstruction accuracy and stability.

    Authors: We agree that the abstract omits specific quantitative metrics, which limits immediate verifiability of the SOTA claims. The evaluation section does contain tables reporting reconstruction accuracy, stability metrics, and comparisons against prior constant-memory baselines across the five benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, KITTI), including results aggregated over multiple runs. To address the concern directly, we will revise the abstract to include key numerical results and error bars summarizing the gains. This is a presentation clarification rather than an absence of supporting data. revision: yes

  2. Referee: HCC section (three-tier triage and nearest-neighbor merging on key-vector manifold): the premise that proximity in key space equals preservation of 3D geometric context for moderately important tokens is untested; no ablation, failure-mode analysis, or comparison to pure-eviction baselines shows that merging does not distort camera poses or point clouds beyond what CLCES already filters, directly undermining the 'superior accuracy' result.

    Authors: We acknowledge that the manuscript does not provide dedicated ablations isolating the nearest-neighbor merging step on the key-vector manifold, nor explicit failure-mode analysis of potential distortions to camera poses or point clouds. The reported superior accuracy is based on end-to-end benchmark results versus pure-eviction baselines, but we agree this does not fully isolate the contribution of the merging operation. We will add targeted ablation studies and distortion analysis (including pose and point-cloud metrics) in the revised manuscript and supplementary material to directly test the premise. revision: yes

Circularity Check

0 steps flagged

No circularity: training-free heuristics with external benchmark validation

full rationale

The paper describes a training-free framework consisting of explicit heuristic modules (CLCES for cross-layer scoring via order statistics and HCC for three-tier merging on the key-vector manifold). No parameters are fitted to the target reconstruction data, no predictions reduce to fitted inputs by construction, and no load-bearing claims rely on self-citations or imported uniqueness theorems. Performance is asserted via direct evaluation on five independent external benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, KITTI), keeping the derivation chain self-contained against those benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on domain assumptions about geometric salience being consistent across transformer layers and that nearest-neighbor merging on key vectors preserves 3D context; no free parameters or new entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Token importance trajectories across the Transformer hierarchy reliably indicate sustained geometric salience.
    Invoked in the description of CLCES to justify order-statistical analysis over single-layer scoring.
  • domain assumption Nearest-neighbor assignment on the key-vector manifold merges moderately important tokens without destroying essential geometric information.
    Central to the HCC three-tier triage strategy.

pith-pipeline@v0.9.0 · 5531 in / 1264 out tokens · 26056 ms · 2026-05-10T11:12:32.543758+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 9 canonical work pages · 2 internal anchors

  1. [1]

    Grounding image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. pages 71–91, 2024

  2. [2]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

  3. [3]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. pages 5294–5306, 2025

  4. [4]

    Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

    Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. pages 21924–21935, 2025

  5. [5]

    OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer

    Si-Yu Lu, Po-Ting Chen, Hui-Che Hsu, Sin-Ye Jhong, Wen-Huang Cheng, and Yung-Yao Chen. Ovggt: O (1) constant-cost streaming visual geometry transformer.arXiv preprint arXiv:2603.05959, 2026

  6. [6]

    Streaming 4d visual geometry transformer.arXiv preprint arXiv:2507.11539, 2025

    Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer. arXiv preprint arXiv:2507.11539, 2025

  7. [7]

    Flashattention: Fast and memory-efficient exact attention with io-awareness

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. volume 35, pages 16344–16359, 2022

  8. [8]

    Feed-forward neural networks.Ieee Potentials, 13(4):27–31, 2002

    George Bebis and Michael Georgiopoulos. Feed-forward neural networks.Ieee Potentials, 13(4):27–31, 2002

  9. [9]

    Token merging: Your vit but faster

    Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. 2022

  10. [10]

    Map-free visual relocalization: Metric pose relative to a single image

    Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, Aron Monszpart, Victor Prisacariu, Daniyar Turmukhambetov, and Eric Brachmann. Map-free visual relocalization: Metric pose relative to a single image. In European Conference on Computer Vision, pages 690–708. Springer, 2022

  11. [11]

    Llafs: When large language models meet few-shot segmentation

    Lanyun Zhu, Tianrun Chen, Deyi Ji, Jieping Ye, and Jun Liu. Llafs: When large language models meet few-shot segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3065–3075, 2024

  12. [12]

    Deepmvs: Learning multi-view stereopsis

    Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi-view stereopsis. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2821–2830, 2018

  13. [13]

    Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling

    Qi Zhu, Jiangwei Lao, Deyi Ji, Junwei Luo, Kang Wu, Yingying Zhang, Lixiang Ru, Jian Wang, Jingdong Chen, Ming Yang, et al. Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  14. [14]

    Ibd: Alleviating hallucinations in large vision-language models via image-biased decoding

    Lanyun Zhu, Deyi Ji, Tianrun Chen, Peng Xu, Jieping Ye, and Jun Liu. Ibd: Alleviating hallucinations in large vision-language models via image-biased decoding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  15. [15]

    Structural and statistical texture knowledge distillation and learning for segmentation.IEEE Transactionson PatternAnalysis and MachineIntelligence, 47(5):3639–3656, 2025

    Deyi Ji, Feng Zhao, Hongtao Lu, Feng Wu, and Jieping Ye. Structural and statistical texture knowledge distillation and learning for segmentation.IEEE Transactionson PatternAnalysis and MachineIntelligence, 47(5):3639–3656, 2025

  16. [16]

    Discrete latent perspective learning for segmentation and detection

    Deyi Ji, Feng Zhao, Lanyun Zhu, Wenwei Jin, Hongtao Lu, and Jieping Ye. Discrete latent perspective learning for segmentation and detection. InInternational Conference on Machine Learning, pages 21719–21730, 2024

  17. [17]

    Not every patch is needed: Towards a more efficient and effective backbone for video-based person re-identification.IEEE Transactionson Image Processing, 2025

    Lanyun Zhu, Tianrun Chen, Deyi Ji, Jieping Ye, and Jun Liu. Not every patch is needed: Towards a more efficient and effective backbone for video-based person re-identification.IEEE Transactionson Image Processing, 2025

  18. [18]

    Fastvggt: Training-free acceleration of visual geometry trans- former.arXiv preprint arXiv:2509.02560, 2025

    You Shen, Zhipeng Zhang, Yansong Qu, Xiawu Zheng, Jiayi Ji, Shengchuan Zhang, and Liujuan Cao. Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025

  19. [19]

    Ultra-high resolution segmentation with ultra-rich context: A novel benchmark

    Deyi Ji, Feng Zhao, Hongtao Lu, Mingyuan Tao, and Jieping Ye. Ultra-high resolution segmentation with ultra-rich context: A novel benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23621–23630, 2023. 13

  20. [20]

    Structural and statistical texture knowledge distillation for semantic segmentation

    Deyi Ji, Haoran Wang, Mingyuan Tao, Jianqiang Huang, Xian-Sheng Hua, and Hongtao Lu. Structural and statistical texture knowledge distillation for semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16876–16885, 2022

  21. [21]

    Learning statistical texture for semantic segmentation

    Lanyu Zhu, Deyi Ji, Shiping Zhu, Weihao Gan, Wei Wu, and Junjie Yan. Learning statistical texture for semantic segmentation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  22. [22]

    Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction

    Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. InProceedings of the IEEE/CVF international conference on computer vision, pages 10901–10911, 2021

  23. [23]

    Pptformer: Pseudo multi-perspective transformer for uav segmentation

    Deyi Ji, Wenwei Jin, Hongtao Lu, and Feng Zhao. Pptformer: Pseudo multi-perspective transformer for uav segmentation. International Joint Conference on Artificial Intelligence, pages 893–901, 2024

  24. [24]

    Megadepth: Learning single-view depth prediction from internet photos

    Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018

  25. [25]

    Popen: Preference-based optimization and ensemble for lvlm-based reasoning segmentation

    Lanyun Zhu, Tianrun Chen, Qianxiong Xu, Xuanyi Liu, Deyi Ji, Haiyang Wu, De Wen Soh, and Jun Liu. Popen: Preference-based optimization and ensemble for lvlm-based reasoning segmentation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  26. [26]

    Retrv-r1: A reasoning-driven mllm framework for universal and efficient multimodal retrieval.Neural Information Processing Systems (NeurIPS), 2025

    Lanyun Zhu, Deyi Ji, Tianrun Chen, Haiyang Wu, and Shiqi Wang. Retrv-r1: A reasoning-driven mllm framework for universal and efficient multimodal retrieval.Neural Information Processing Systems (NeurIPS), 2025

  27. [27]

    Deyi Ji, Feng Zhao, and Hongtao Lu. Guided patch-grouping wavelet transformer with spatial congruence for ultra-high resolution segmentation.International Joint Conference on Artificial Intelligence, pages 920–928, 2023

  28. [28]

    Replay master: Automatic sample selection and effective memory utilization for continual semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Lanyun Zhu, Tianrun Chen, Jianxiong Yin, Simon See, De Wen Soh, and Jun Liu. Replay master: Automatic sample selection and effective memory utilization for continual semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  29. [29]

    $\pi^3$: Permutation-Equivariant Visual Geometry Learning

    Yifan Wang et al. pi3: Scalable permutation-equivariant visual geometry learning.arXivpreprintarXiv:2507.13347, 2025

  30. [30]

    Llafs++: Few-shot image segmentation with large language models.IEEE Transactionson Pattern Analysis and Machine Intelligence, 2025

    Lanyun Zhu, Tianrun Chen, Deyi Ji, Peng Xu, Jieping Ye, and Jun Liu. Llafs++: Few-shot image segmentation with large language models.IEEE Transactionson Pattern Analysis and Machine Intelligence, 2025

  31. [31]

    Context-aware graph convolution network for target re-identification

    Deyi Ji, Haoran Wang, Hanzhe Hu, Weihao Gan, Wei Wu, and Junjie Yan. Context-aware graph convolution network for target re-identification. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1646–1654, 2021

  32. [32]

    CPCF: A cross-prompt contrastive framework for referring multimodal large language models

    Lanyun Zhu, Deyi Ji, Tianrun Chen, Haiyang Wu, De Wen Soh, and Jun Liu. CPCF: A cross-prompt contrastive framework for referring multimodal large language models. InForty-secondInternational Conference on Machine Learning, 2025

  33. [33]

    Learning gabor texture features for fine-grained recognition

    Lanyun Zhu, Tianrun Chen, Jianxiong Yin, Simon See, and Jun Liu. Learning gabor texture features for fine-grained recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 1621–1631, 2023

  34. [34]

    View-centric multi-object tracking with homographic matching in moving uav.IEEE Transactionson Geoscience and Remote Sensing, 2026

    Deyi Ji, Lanyun Zhu, Siqi Gao, Qi Zhu, Yiru Zhao, Peng Xu, Yue Ding, Hongtao Lu, Jieping Ye, Feng Wu, et al. View-centric multi-object tracking with homographic matching in moving uav.IEEE Transactionson Geoscience and Remote Sensing, 2026

  35. [35]

    3d reconstruction with spatial memory

    Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory. pages 78–89, 2025

  36. [36]

    Continuous 3d perception model with persistent state

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. pages 10510–10522, 2025

  37. [37]

    TTT3R: 3D Reconstruction as Test-Time Train- ing.arXiv:2509.26645, 2025

    Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Ttt3r: 3d reconstruction as test-time training. arXiv preprint arXiv:2509.26645, 2025

  38. [38]

    Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory.arXiv:2507.02863, 2025

    Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. Point3r: Streaming 3d reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025

  39. [39]

    H2o: Heavy-hitter oracle for efficient generative inference of large language models

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advancesin Neural Information Processing Systems, 36:34661–34710, 2023. 14

  40. [40]

    Kevin Zhou

    Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.arXiv preprint arXiv:2407.11550, 2024

  41. [41]

    Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

  42. [42]

    Dynamicvit: Efficient vision transformers with dynamic token sparsification

    Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34:13937–13949, 2021

  43. [43]

    Representation shift: Unifying token compression with flashattention

    Joonmyung Choi, Sanghyeok Lee, Byungoh Ko, Eunseo Kim, Jihyung Kil, and Hyunwoo J Kim. Representation shift: Unifying token compression with flashattention. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20456–20466, 2025

  44. [44]

    Dinov2: Learning robust visual features without supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. 2023

  45. [45]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

  46. [46]

    Evict3R: eci.arXiv preprint arXiv:2507.14890, 2025

    Jinhui Deng, Zhili Li, Yijin Ma, Xin Yang, and Pengfei Wan. Evict3R: eci.arXiv preprint arXiv:2507.14890, 2025

  47. [47]

    In- finiteVGGT: Visual Geometry Grounded Transformer for Endless Streams.arXiv preprint :2601.02281, 2026

    Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, and Zhipeng Zhang. Infinitevggt: Visual geometry grounded transformer for endless streams.arXiv preprint arXiv:2601.02281, 2026

  48. [48]

    Neural rgb-d surface reconstruction

    Dejan Azinović, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6290–6301, 2022

  49. [49]

    A multi-view stereo benchmark with high-resolution images and multi-camera videos

    Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3260–3269, 2017

  50. [50]

    Refusion: 3d reconstruc- tion in dynamic environments for rgb-d cameras exploiting residuals

    Emanuele Palazzolo, Jens Behley, Philipp Lottes, Philippe Giguere, and Cyrill Stachniss. Refusion: 3d reconstruc- tion in dynamic environments for rgb-d cameras exploiting residuals. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7855–7862. IEEE, 2019

  51. [51]

    Vision meets robotics: The kitti dataset

    Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The international journal of robotics research, 32(11):1231–1237, 2013. 15