arxiv: 2604.15237 · v2 · submitted 2026-04-16 · 💻 cs.CV

Recognition: unknown

StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression

Xuanyi Liu , Chunan Yu , Deyi Ji , Qi Zhu , Lingyun Sun , Xuanfu Li , Jin Ma , Tianrun Chen

show 1 more author

Lanyun Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords streaming 3D reconstructionvisual geometry transformercache compressioncross-layer scoringconstant memoryvideo geometrytoken merginghybrid cache

0 comments

The pith

StreamCacheVGGT maintains 3D reconstruction quality from video by scoring tokens across layers and merging rather than deleting them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reconstructing dense 3D geometry from continuous video requires stable inference within a fixed memory budget. Pure eviction methods destroy information through outright token deletion guided by noisy single-layer scores. The proposed system tracks token importance trajectories across the full transformer hierarchy and applies order statistics to find tokens with lasting geometric value. It then uses a three-tier process to merge moderately important tokens into retained ones by nearest-neighbor assignment on the key-vector manifold. This keeps essential context that would otherwise be lost while staying strictly within constant-cost limits.

Core claim

By replacing binary eviction with cross-layer consistency-enhanced scoring and hybrid cache compression that merges tokens via nearest-neighbor assignment on the key-vector manifold, the framework preserves geometric salience across long video streams and achieves higher reconstruction accuracy on five benchmarks while enforcing constant memory use.

What carries the argument

Cross-Layer Consistency-Enhanced Scoring (CLCES) that tracks sustained salience across layers combined with Hybrid Cache Compression (HCC) that performs three-tier triage and nearest-neighbor merging on the key-vector manifold.

If this is right

Higher reconstruction accuracy and long-term stability on 7-Scenes, NRGBD, ETH3D, Bonn, and KITTI.
Strict adherence to constant memory and compute budgets without any training.
Reduced information loss compared with pure eviction that deletes tokens outright.
Robust scores derived from order-statistical analysis across the transformer hierarchy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same merging strategy on the key-vector manifold could be tested on other streaming transformer applications such as video object tracking or scene flow estimation.
Cross-layer consistency tracking may reduce sensitivity to single-layer activation noise in related vision transformers that process sequential data.
The three-tier triage could be adapted to trade off accuracy against memory in real-time robotics pipelines that must run indefinitely.

Load-bearing premise

Merging tokens by nearest-neighbor similarity in key space will preserve the geometric information needed for accurate 3D reconstruction without adding new distortions.

What would settle it

A controlled test on long video sequences where reconstruction error rises or geometric fidelity drops when hybrid merging is enabled compared with a larger-memory baseline that simply retains all tokens.

read the original abstract

Reconstructing dense 3D geometry from continuous video streams requires stable inference under a constant memory budget. Existing $O(1)$ frameworks primarily rely on a ``pure eviction'' paradigm, which suffers from significant information destruction due to binary token deletion and evaluation noise from localized, single-layer scoring. To address these bottlenecks, we propose StreamCacheVGGT, a training-free framework that reimagines cache management through two synergistic modules: Cross-Layer Consistency-Enhanced Scoring (CLCES) and Hybrid Cache Compression (HCC). CLCES mitigates activation noise by tracking token importance trajectories across the Transformer hierarchy, employing order-statistical analysis to identify sustained geometric salience. Leveraging these robust scores, HCC transcends simple eviction by introducing a three-tier triage strategy that merges moderately important tokens into retained anchors via nearest-neighbor assignment on the key-vector manifold. This approach preserves essential geometric context that would otherwise be lost. Extensive evaluations on five benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, and KITTI) demonstrate that StreamCacheVGGT sets a new state-of-the-art, delivering superior reconstruction accuracy and long-term stability while strictly adhering to constant-cost constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StreamCacheVGGT refines constant-memory streaming 3D reconstruction with cross-layer token scoring and hybrid merging instead of pure eviction, but the SOTA claims lack any numbers or ablations to evaluate.

read the letter

The paper's concrete step forward is the pair of modules that move beyond binary token deletion. CLCES tracks token importance trajectories across the transformer layers and applies order statistics to filter out single-layer noise. HCC then applies a three-tier split: keep the top tokens, drop the bottom, and merge the middle ones to retained anchors by nearest-neighbor lookup in key-vector space. Both pieces are training-free, which keeps the method deployable without retraining on target video streams. That combination directly targets the information loss the authors identify in prior O(1) frameworks and offers a practical alternative for robotics or AR pipelines that cannot afford growing memory use. The description of the heuristics is clear enough that someone could re-implement the cache logic from the text. The citations line up with the relevant prior work on streaming visual geometry transformers without obvious omissions. The main limitation is the missing evidence. The abstract states superior accuracy and long-term stability on 7-Scenes, NRGBD, ETH3D, Bonn, and KITTI, yet supplies no quantitative results, no baseline deltas, and no ablation on either CLCES or the merging rule. Without those, it is impossible to judge whether the gains are real or whether nearest-neighbor assignment on the key manifold actually preserves 3D geometry rather than distorting poses or point clouds. The stress-test point about proximity in key space not equaling geometric equivalence is fair and needs checking in the experiments. This work is for researchers who maintain or extend online 3D mapping systems under strict memory caps. A reader who needs concrete cache-management ideas for transformer-based geometry pipelines can extract usable techniques even if they later modify the scoring thresholds or manifold assumptions. It deserves peer review because the core idea is a clear incremental improvement on an established problem, the benchmarks are standard, and the authors can supply the quantitative support and validation that the current summary omits.

Referee Report

2 major / 1 minor

Summary. The paper proposes StreamCacheVGGT, a training-free framework for constant-memory streaming 3D geometry reconstruction from video using Visual Geometry Transformers. It introduces Cross-Layer Consistency-Enhanced Scoring (CLCES) to track token importance across layers via order statistics and Hybrid Cache Compression (HCC) with a three-tier triage that merges moderately important tokens via nearest-neighbor assignment on the key-vector manifold, claiming SOTA accuracy and long-term stability on the 7-Scenes, NRGBD, ETH3D, Bonn, and KITTI benchmarks while avoiding binary eviction.

Significance. If the central claims hold, the work would offer a practical advance for online dense reconstruction under fixed memory budgets by replacing pure eviction with context-preserving compression; the training-free design and explicit focus on cross-layer geometric salience are strengths that could influence follow-on systems in robotics and AR.

major comments (2)

The SOTA claim on five benchmarks is load-bearing for the contribution, yet the abstract (and by extension the evaluation section) provides no quantitative metrics, tables, error bars, or direct comparisons to prior constant-memory baselines, preventing verification of the asserted gains in reconstruction accuracy and stability.
HCC section (three-tier triage and nearest-neighbor merging on key-vector manifold): the premise that proximity in key space equals preservation of 3D geometric context for moderately important tokens is untested; no ablation, failure-mode analysis, or comparison to pure-eviction baselines shows that merging does not distort camera poses or point clouds beyond what CLCES already filters, directly undermining the 'superior accuracy' result.

minor comments (1)

Notation for CLCES order-statistical analysis and HCC manifold distance is introduced without explicit equations or pseudocode, making the heuristics hard to reproduce.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below and outline the revisions we will make to strengthen the presentation and evidence.

read point-by-point responses

Referee: The SOTA claim on five benchmarks is load-bearing for the contribution, yet the abstract (and by extension the evaluation section) provides no quantitative metrics, tables, error bars, or direct comparisons to prior constant-memory baselines, preventing verification of the asserted gains in reconstruction accuracy and stability.

Authors: We agree that the abstract omits specific quantitative metrics, which limits immediate verifiability of the SOTA claims. The evaluation section does contain tables reporting reconstruction accuracy, stability metrics, and comparisons against prior constant-memory baselines across the five benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, KITTI), including results aggregated over multiple runs. To address the concern directly, we will revise the abstract to include key numerical results and error bars summarizing the gains. This is a presentation clarification rather than an absence of supporting data. revision: yes
Referee: HCC section (three-tier triage and nearest-neighbor merging on key-vector manifold): the premise that proximity in key space equals preservation of 3D geometric context for moderately important tokens is untested; no ablation, failure-mode analysis, or comparison to pure-eviction baselines shows that merging does not distort camera poses or point clouds beyond what CLCES already filters, directly undermining the 'superior accuracy' result.

Authors: We acknowledge that the manuscript does not provide dedicated ablations isolating the nearest-neighbor merging step on the key-vector manifold, nor explicit failure-mode analysis of potential distortions to camera poses or point clouds. The reported superior accuracy is based on end-to-end benchmark results versus pure-eviction baselines, but we agree this does not fully isolate the contribution of the merging operation. We will add targeted ablation studies and distortion analysis (including pose and point-cloud metrics) in the revised manuscript and supplementary material to directly test the premise. revision: yes

Circularity Check

0 steps flagged

No circularity: training-free heuristics with external benchmark validation

full rationale

The paper describes a training-free framework consisting of explicit heuristic modules (CLCES for cross-layer scoring via order statistics and HCC for three-tier merging on the key-vector manifold). No parameters are fitted to the target reconstruction data, no predictions reduce to fitted inputs by construction, and no load-bearing claims rely on self-citations or imported uniqueness theorems. Performance is asserted via direct evaluation on five independent external benchmarks (7-Scenes, NRGBD, ETH3D, Bonn, KITTI), keeping the derivation chain self-contained against those benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on domain assumptions about geometric salience being consistent across transformer layers and that nearest-neighbor merging on key vectors preserves 3D context; no free parameters or new entities are explicitly introduced in the abstract.

axioms (2)

domain assumption Token importance trajectories across the Transformer hierarchy reliably indicate sustained geometric salience.
Invoked in the description of CLCES to justify order-statistical analysis over single-layer scoring.
domain assumption Nearest-neighbor assignment on the key-vector manifold merges moderately important tokens without destroying essential geometric information.
Central to the HCC three-tier triage strategy.

pith-pipeline@v0.9.0 · 5531 in / 1264 out tokens · 26056 ms · 2026-05-10T11:12:32.543758+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 9 canonical work pages · 2 internal anchors

[1]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. pages 71–91, 2024

2024
[2]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697–20709, 2024

2024
[3]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. pages 5294–5306, 2025

2025
[4]

Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. pages 21924–21935, 2025

2025
[5]

OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer

Si-Yu Lu, Po-Ting Chen, Hui-Che Hsu, Sin-Ye Jhong, Wen-Huang Cheng, and Yung-Yao Chen. Ovggt: O (1) constant-cost streaming visual geometry transformer.arXiv preprint arXiv:2603.05959, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Streaming 4d visual geometry transformer.arXiv preprint arXiv:2507.11539, 2025

Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer. arXiv preprint arXiv:2507.11539, 2025

work page arXiv 2025
[7]

Flashattention: Fast and memory-efficient exact attention with io-awareness

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. volume 35, pages 16344–16359, 2022

2022
[8]

Feed-forward neural networks.Ieee Potentials, 13(4):27–31, 2002

George Bebis and Michael Georgiopoulos. Feed-forward neural networks.Ieee Potentials, 13(4):27–31, 2002

2002
[9]

Token merging: Your vit but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster. 2022

2022
[10]

Map-free visual relocalization: Metric pose relative to a single image

Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, Aron Monszpart, Victor Prisacariu, Daniyar Turmukhambetov, and Eric Brachmann. Map-free visual relocalization: Metric pose relative to a single image. In European Conference on Computer Vision, pages 690–708. Springer, 2022

2022
[11]

Llafs: When large language models meet few-shot segmentation

Lanyun Zhu, Tianrun Chen, Deyi Ji, Jieping Ye, and Jun Liu. Llafs: When large language models meet few-shot segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3065–3075, 2024

2024
[12]

Deepmvs: Learning multi-view stereopsis

Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi-view stereopsis. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2821–2830, 2018

2018
[13]

Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling

Qi Zhu, Jiangwei Lao, Deyi Ji, Junwei Luo, Kang Wu, Yingying Zhang, Lixiang Ru, Jian Wang, Jingdong Chen, Ming Yang, et al. Skysense-o: Towards open-world remote sensing interpretation with vision-centric visual-language modeling. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[14]

Ibd: Alleviating hallucinations in large vision-language models via image-biased decoding

Lanyun Zhu, Deyi Ji, Tianrun Chen, Peng Xu, Jieping Ye, and Jun Liu. Ibd: Alleviating hallucinations in large vision-language models via image-biased decoding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[15]

Structural and statistical texture knowledge distillation and learning for segmentation.IEEE Transactionson PatternAnalysis and MachineIntelligence, 47(5):3639–3656, 2025

Deyi Ji, Feng Zhao, Hongtao Lu, Feng Wu, and Jieping Ye. Structural and statistical texture knowledge distillation and learning for segmentation.IEEE Transactionson PatternAnalysis and MachineIntelligence, 47(5):3639–3656, 2025

2025
[16]

Discrete latent perspective learning for segmentation and detection

Deyi Ji, Feng Zhao, Lanyun Zhu, Wenwei Jin, Hongtao Lu, and Jieping Ye. Discrete latent perspective learning for segmentation and detection. InInternational Conference on Machine Learning, pages 21719–21730, 2024

2024
[17]

Not every patch is needed: Towards a more efficient and effective backbone for video-based person re-identification.IEEE Transactionson Image Processing, 2025

Lanyun Zhu, Tianrun Chen, Deyi Ji, Jieping Ye, and Jun Liu. Not every patch is needed: Towards a more efficient and effective backbone for video-based person re-identification.IEEE Transactionson Image Processing, 2025

2025
[18]

Fastvggt: Training-free acceleration of visual geometry trans- former.arXiv preprint arXiv:2509.02560, 2025

You Shen, Zhipeng Zhang, Yansong Qu, Xiawu Zheng, Jiayi Ji, Shengchuan Zhang, and Liujuan Cao. Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560, 2025

work page arXiv 2025
[19]

Ultra-high resolution segmentation with ultra-rich context: A novel benchmark

Deyi Ji, Feng Zhao, Hongtao Lu, Mingyuan Tao, and Jieping Ye. Ultra-high resolution segmentation with ultra-rich context: A novel benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23621–23630, 2023. 13

2023
[20]

Structural and statistical texture knowledge distillation for semantic segmentation

Deyi Ji, Haoran Wang, Mingyuan Tao, Jianqiang Huang, Xian-Sheng Hua, and Hongtao Lu. Structural and statistical texture knowledge distillation for semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16876–16885, 2022

2022
[21]

Learning statistical texture for semantic segmentation

Lanyu Zhu, Deyi Ji, Shiping Zhu, Weihao Gan, Wei Wu, and Junjie Yan. Learning statistical texture for semantic segmentation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

2021
[22]

Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. InProceedings of the IEEE/CVF international conference on computer vision, pages 10901–10911, 2021

2021
[23]

Pptformer: Pseudo multi-perspective transformer for uav segmentation

Deyi Ji, Wenwei Jin, Hongtao Lu, and Feng Zhao. Pptformer: Pseudo multi-perspective transformer for uav segmentation. International Joint Conference on Artificial Intelligence, pages 893–901, 2024

2024
[24]

Megadepth: Learning single-view depth prediction from internet photos

Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018

2041
[25]

Popen: Preference-based optimization and ensemble for lvlm-based reasoning segmentation

Lanyun Zhu, Tianrun Chen, Qianxiong Xu, Xuanyi Liu, Deyi Ji, Haiyang Wu, De Wen Soh, and Jun Liu. Popen: Preference-based optimization and ensemble for lvlm-based reasoning segmentation. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

2025
[26]

Retrv-r1: A reasoning-driven mllm framework for universal and efficient multimodal retrieval.Neural Information Processing Systems (NeurIPS), 2025

Lanyun Zhu, Deyi Ji, Tianrun Chen, Haiyang Wu, and Shiqi Wang. Retrv-r1: A reasoning-driven mllm framework for universal and efficient multimodal retrieval.Neural Information Processing Systems (NeurIPS), 2025

2025
[27]

Deyi Ji, Feng Zhao, and Hongtao Lu. Guided patch-grouping wavelet transformer with spatial congruence for ultra-high resolution segmentation.International Joint Conference on Artificial Intelligence, pages 920–928, 2023

2023
[28]

Replay master: Automatic sample selection and effective memory utilization for continual semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Lanyun Zhu, Tianrun Chen, Jianxiong Yin, Simon See, De Wen Soh, and Jun Liu. Replay master: Automatic sample selection and effective memory utilization for continual semantic segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[29]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Yifan Wang et al. pi3: Scalable permutation-equivariant visual geometry learning.arXivpreprintarXiv:2507.13347, 2025

work page internal anchor Pith review arXiv 2025
[30]

Llafs++: Few-shot image segmentation with large language models.IEEE Transactionson Pattern Analysis and Machine Intelligence, 2025

Lanyun Zhu, Tianrun Chen, Deyi Ji, Peng Xu, Jieping Ye, and Jun Liu. Llafs++: Few-shot image segmentation with large language models.IEEE Transactionson Pattern Analysis and Machine Intelligence, 2025

2025
[31]

Context-aware graph convolution network for target re-identification

Deyi Ji, Haoran Wang, Hanzhe Hu, Weihao Gan, Wei Wu, and Junjie Yan. Context-aware graph convolution network for target re-identification. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1646–1654, 2021

2021
[32]

CPCF: A cross-prompt contrastive framework for referring multimodal large language models

Lanyun Zhu, Deyi Ji, Tianrun Chen, Haiyang Wu, De Wen Soh, and Jun Liu. CPCF: A cross-prompt contrastive framework for referring multimodal large language models. InForty-secondInternational Conference on Machine Learning, 2025

2025
[33]

Learning gabor texture features for fine-grained recognition

Lanyun Zhu, Tianrun Chen, Jianxiong Yin, Simon See, and Jun Liu. Learning gabor texture features for fine-grained recognition. InProceedings of the IEEE/CVF international conference on computer vision, pages 1621–1631, 2023

2023
[34]

View-centric multi-object tracking with homographic matching in moving uav.IEEE Transactionson Geoscience and Remote Sensing, 2026

Deyi Ji, Lanyun Zhu, Siqi Gao, Qi Zhu, Yiru Zhao, Peng Xu, Yue Ding, Hongtao Lu, Jieping Ye, Feng Wu, et al. View-centric multi-object tracking with homographic matching in moving uav.IEEE Transactionson Geoscience and Remote Sensing, 2026

2026
[35]

3d reconstruction with spatial memory

Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory. pages 78–89, 2025

2025
[36]

Continuous 3d perception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. pages 10510–10522, 2025

2025
[37]

TTT3R: 3D Reconstruction as Test-Time Train- ing.arXiv:2509.26645, 2025

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Ttt3r: 3d reconstruction as test-time training. arXiv preprint arXiv:2509.26645, 2025

work page arXiv 2025
[38]

Point3R: Streaming 3D Reconstruction with Explicit Spatial Pointer Memory.arXiv:2507.02863, 2025

Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. Point3r: Streaming 3d reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025

work page arXiv 2025
[39]

H2o: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advancesin Neural Information Processing Systems, 36:34661–34710, 2023. 14

2023
[40]

Kevin Zhou

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.arXiv preprint arXiv:2407.11550, 2024

work page arXiv 2024
[41]

Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.Advances in Neural Information Processing Systems, 37:22947–22970, 2024

2024
[42]

Dynamicvit: Efficient vision transformers with dynamic token sparsification

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural information processing systems, 34:13937–13949, 2021

2021
[43]

Representation shift: Unifying token compression with flashattention

Joonmyung Choi, Sanghyeok Lee, Byungoh Ko, Eunseo Kim, Jihyung Kil, and Hyunwoo J Kim. Representation shift: Unifying token compression with flashattention. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20456–20466, 2025

2025
[44]

Dinov2: Learning robust visual features without supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. 2023

2023
[45]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

2017
[46]

Evict3R: eci.arXiv preprint arXiv:2507.14890, 2025

Jinhui Deng, Zhili Li, Yijin Ma, Xin Yang, and Pengfei Wan. Evict3R: eci.arXiv preprint arXiv:2507.14890, 2025

work page arXiv 2025
[47]

In- finiteVGGT: Visual Geometry Grounded Transformer for Endless Streams.arXiv preprint :2601.02281, 2026

Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, and Zhipeng Zhang. Infinitevggt: Visual geometry grounded transformer for endless streams.arXiv preprint arXiv:2601.02281, 2026

work page arXiv 2026
[48]

Neural rgb-d surface reconstruction

Dejan Azinović, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6290–6301, 2022

2022
[49]

A multi-view stereo benchmark with high-resolution images and multi-camera videos

Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3260–3269, 2017

2017
[50]

Refusion: 3d reconstruc- tion in dynamic environments for rgb-d cameras exploiting residuals

Emanuele Palazzolo, Jens Behley, Philipp Lottes, Philippe Giguere, and Cyrill Stachniss. Refusion: 3d reconstruc- tion in dynamic environments for rgb-d cameras exploiting residuals. In2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7855–7862. IEEE, 2019

2019
[51]

Vision meets robotics: The kitti dataset

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The international journal of robotics research, 32(11):1231–1237, 2013. 15

2013