arxiv: 2604.18260 · v1 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

Geometry-Guided 3D Visual Token Pruning for Video-Language Models

Han Li , Zehao Huang , Jiahui Fu , Naiyan Wang , Si Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual token pruning3D scene understandinggeometry-aware attentionvideo-language modelstoken efficiencyvoxel selectionspatial video processing

0 comments

The pith

A geometry-guided pruning framework for 3D spatial videos allows video-language models to discard 90 percent of visual tokens while retaining over 90 percent of their original performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to reduce the computational load of processing 3D scenes in video-language models by pruning redundant visual tokens. It uses available depth and camera pose data to identify which tokens carry unique spatial information across frames. By grouping tokens into voxels and selecting representatives both inside and across voxels, the approach removes inter-frame redundancy without losing scene completeness. This matters because current models struggle with the high token counts in spatial videos, limiting their use in real-time or long-context 3D tasks. If effective, it allows these models to handle complex 3D reasoning more efficiently.

Core claim

Geo3DPruner first computes geometry-aware global attention to model cross-frame relevance using depth maps and camera poses, then applies a two-stage pruning: intra-voxel selection of representative multi-view features within each voxel, followed by inter-voxel selection to maintain a spatially diverse set of voxels. On 3D scene understanding benchmarks, this prunes 90% of tokens while retaining over 90% of original performance and beats prior text-guided or vision-only pruning approaches.

What carries the argument

The two-stage pruning process that partitions the 3D space into voxels based on depth and camera poses then selects representative tokens to balance relevance and diversity.

Load-bearing premise

That depth maps and camera poses are reliably available for computing geometry-aware attention and that voxel-based representative selection will not discard task-critical spatial information.

What would settle it

Running the pruned model on a 3D benchmark where depth or pose estimates contain significant noise or are missing and measuring whether performance retention falls substantially below 90 percent.

Figures

Figures reproduced from arXiv: 2604.18260 by Han Li, Jiahui Fu, Naiyan Wang, Si Liu, Zehao Huang.

**Figure 1.** Figure 1: Motivation of Geo3DPruner. 3D spatial videos essentially represent multi-view projections of the complete 3D scene. Features corresponding to the same objects (e.g., the wooden desk or the swivel chair) frequently recur across different frames. Existing pruning strategies fail to remove such redundancy due to the absence of global cross-frame relevance modeling. fundamental bottleneck that hinders furthe… view at source ↗

**Figure 2.** Figure 2: Framework of Geo3DPruner. We adopt Video-3D LLM [51] as our base model and replace its 3D positional encodings with geometry features following VG LLM [50]. Input video frames are processed by two parallel encoders: a 2D visual encoder (e.g., SigLIP [43]) extracts image features, and a 3D geometry encoder (e.g., VGGT [29]) captures geometric features while modeling longrange cross-frame dependencies. The … view at source ↗

**Figure 3.** Figure 3: Illustration of intra-voxel view consistency pruning. Within each voxel, multi-view features from different frames are evaluated based on attention scores to identify the most representative tokens. structural integrity of the 3D scene, as shown in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Performance-efficiency trade-off curves for 32-frame videos using different visual token pruning methods. The x-axis denotes the reduction ratio in the number of visual tokens, and the y-axis shows the corresponding performance across benchmarks. The black dotted line represents the unpruned base model. (a) (b) 0.20 0.40 0.60 0.80 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: (a) Retained tokens in each frame show diverse spatial coverage across different objects, preserving overall scene completeness. (b) Global attention maps from the red-box patch in the first frame primarily focus on geometry-aligned and instancerelated regions in other frames. selection in solving this optimization problem. Strategy for computing global cross-frame relevance. We further investigate two a… view at source ↗

**Figure 6.** Figure 6: Voxel-level visualization before and after pruning. (a) Visualization of the original voxelized scene, where voxels densely cover both foreground objects and background regions. (b) Visualization after applying Geo3DPruner, where redundant voxels are largely removed while object-related and structurally important voxels are preserved. B.3. Comparison under Fixed Token Budget We further conduct a comparison… view at source ↗

read the original abstract

Multimodal large language models have demonstrated remarkable capabilities in 2D vision, motivating their extension to 3D scene understanding. Recent studies represent 3D scenes as 3D spatial videos composed of image sequences with depth and camera pose information, enabling pre-trained video-language models to perform 3D reasoning tasks. However, the large number of visual tokens in spatial videos remains a major bottleneck for efficient inference and context management. Existing pruning methods overlook the view consistency of spatial videos and the spatial diversity of the remaining tokens, which prevents them from effectively removing inter-frame redundancy and preserving scene completeness. In this paper, we propose Geo3DPruner, a Geometry-Guided 3D Visual Token Pruning framework. Geo3DPruner first models cross-frame relevance through geometry-aware global attention, and then performs a two-stage pruning process. The intra-voxel stage selects representative multi-view features within each voxel, while the inter-voxel stage preserves spatial diversity by selecting a globally distributed subset of voxels. Extensive experiments on multiple 3D scene understanding benchmarks demonstrate that Geo3DPruner retains over 90% of the original performance while pruning 90% of visual tokens, significantly outperforming existing text-guided and vision-guided pruning methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Geo3DPruner uses depth and pose to prune 90% of tokens in spatial videos while claiming to retain most performance, but the gains may depend on clean geometry inputs that baselines lack.

read the letter

Geo3DPruner prunes visual tokens in 3D spatial videos by first running geometry-aware global attention across frames, then applying a two-stage voxel process that picks representative features inside each voxel and a diverse subset of voxels overall. This targets view consistency and spatial spread in a way that text-guided or vision-only pruning skips. The abstract reports it keeps over 90% of base performance at 90% pruning on several 3D scene benchmarks and beats the compared methods. That efficiency step addresses a real limit on sequence length for video-language models in robotics or AR settings. The two-stage voxel logic is the clearest addition over prior work. The reported numbers are straightforward and the procedure avoids closed-form circularity. The main limitation is the dependence on accurate depth maps and camera poses to build the voxels and attention. Real depth estimates often carry noise that could shift voxel boundaries and drop task-critical structure. Because the baselines receive no geometry signal, some of the measured advantage may come from that extra input rather than the pruning rules alone. Tests with noisy or estimated depth would clarify how robust the method stays. The work is aimed at groups building efficient multimodal models for spatial reasoning. A reader already working on token reduction for video-language models would pick up the specific intra- and inter-voxel steps and the geometry tie-in. It deserves peer review so referees can check the experimental controls, ablations on depth quality, and full implementation details.

Referee Report

3 major / 2 minor

Summary. The paper proposes Geo3DPruner, a geometry-guided framework for pruning visual tokens in 3D spatial videos processed by video-language models. It first computes cross-frame relevance via geometry-aware global attention using depth maps and camera poses, then applies a two-stage process: intra-voxel selection of representative multi-view features within each voxel and inter-voxel selection to maintain spatial diversity across voxels. Experiments on multiple 3D scene understanding benchmarks show that the method prunes 90% of visual tokens while retaining over 90% of original performance and outperforming text-guided and vision-guided pruning baselines.

Significance. If the empirical claims hold under scrutiny, the work offers a practical way to reduce the token bottleneck in 3D-aware VLMs by exploiting readily available geometric cues to remove inter-frame redundancy while preserving scene structure. The two-stage voxel-based selection is a concrete algorithmic contribution that could inform future efficiency techniques for long spatial video inputs.

major comments (3)

[Method and Experiments] The central performance claim (90% retention at 90% pruning) rests on the assumption that depth maps and camera poses are accurate enough for reliable voxel partitioning and representative selection; however, no sensitivity analysis or noise-injection experiments are reported to test whether misalignment or estimation errors discard task-critical 3D structure.
[Experiments] The reported outperformance over text- and vision-guided baselines may partly reflect the privileged geometry input rather than the pruning logic itself; the manuscript does not include an ablation or controlled comparison in which baselines are also given depth/pose information.
[Experiments] The experimental section lacks error bars, multiple random seeds, or statistical significance tests for the key retention numbers, and it is unclear whether the pruning hyperparameters were tuned on held-out validation data or post-hoc on the reported benchmarks.

minor comments (2)

[Method] Notation for the geometry-aware attention mechanism should be defined more explicitly (e.g., how depth and pose are encoded into the attention weights) to aid reproducibility.
[Figures] Figure captions and axis labels in the pruning-ratio vs. performance plots could be clarified to distinguish intra-voxel from inter-voxel contributions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's thoughtful feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we will revise the manuscript to address the concerns.

read point-by-point responses

Referee: [Method and Experiments] The central performance claim (90% retention at 90% pruning) rests on the assumption that depth maps and camera poses are accurate enough for reliable voxel partitioning and representative selection; however, no sensitivity analysis or noise-injection experiments are reported to test whether misalignment or estimation errors discard task-critical 3D structure.

Authors: We thank the referee for highlighting this important aspect. In the context of 3D spatial videos, depth maps and camera poses are provided as part of the input data, similar to how they are used in other 3D vision tasks. To address the concern, we will include a sensitivity analysis by injecting noise into the depth and pose estimates and evaluating the impact on pruning performance and downstream task accuracy in the revised manuscript. This will demonstrate the robustness of our geometry-guided approach. revision: yes
Referee: [Experiments] The reported outperformance over text- and vision-guided baselines may partly reflect the privileged geometry input rather than the pruning logic itself; the manuscript does not include an ablation or controlled comparison in which baselines are also given depth/pose information.

Authors: We agree that this is a valid point for clarification. The text-guided and vision-guided baselines are designed to operate without geometric information, as they rely on text queries or 2D visual features respectively. Providing them with depth and pose would require modifying their core mechanisms, which may not be straightforward. Nevertheless, to strengthen the comparison, we will add an experiment where we augment the baselines with geometric cues where possible and compare the results. This will help isolate the contribution of our two-stage voxel pruning strategy. revision: partial
Referee: [Experiments] The experimental section lacks error bars, multiple random seeds, or statistical significance tests for the key retention numbers, and it is unclear whether the pruning hyperparameters were tuned on held-out validation data or post-hoc on the reported benchmarks.

Authors: We acknowledge the importance of statistical rigor in experimental reporting. In the revised version, we will include error bars based on multiple random seeds (e.g., 3-5 runs) for the key metrics and perform statistical significance tests where applicable. Regarding hyperparameter tuning, the pruning ratios and voxel sizes were selected based on preliminary experiments on a held-out subset of the training data to avoid overfitting to the test benchmarks. We will clarify this in the manuscript and provide more details on the tuning process. revision: yes

Circularity Check

0 steps flagged

No significant circularity; algorithmic procedure is self-contained

full rationale

The paper introduces Geo3DPruner as a novel algorithmic framework consisting of geometry-aware global attention followed by intra- and inter-voxel pruning stages. No equations, fitted parameters, or predictions are defined in terms of the target performance metrics or outputs. The reported retention of >90% performance at 90% pruning is presented as an empirical result from benchmarks rather than a quantity derived by construction from the method's own definitions. No self-citations serve as load-bearing uniqueness theorems, and no ansatzes or renamings reduce the central claims to prior inputs within the paper. The derivation chain is the procedure itself, which stands independently of the evaluation numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that depth and camera pose data are provided and accurate; no free parameters or new invented entities are introduced in the abstract description.

axioms (1)

domain assumption Depth maps and camera poses are available and sufficiently accurate to compute cross-frame geometry-aware attention.
The entire pruning pipeline begins with geometry-aware global attention that requires these inputs.

pith-pipeline@v0.9.0 · 5531 in / 1297 out tokens · 32977 ms · 2026-05-10T05:45:00.146457+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 11 canonical work pages · 2 internal anchors

[1]

ScanQA: 3D Question Answering for Spatial Scene Understanding

Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. ScanQA: 3D Question Answering for Spatial Scene Understanding. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 19129–19139, 2022. 2, 6, 1

2022
[2]

MUSt3R: Multi-view Network for Stereo 3D Recon- struction

Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. MUSt3R: Multi-view Network for Stereo 3D Recon- struction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1050–1060, 2025. 3

2025
[3]

ScanRefer: 3D Object Localization in RGB-D Scans Using Natural Language

Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. ScanRefer: 3D Object Localization in RGB-D Scans Using Natural Language. InEuropean Conference on Computer Vision, pages 202–221, 2020. 2, 5, 7, 1

2020
[4]

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models. InEuropean Conference on Computer Vision, pages 19–35, 2024. 2, 3, 6, 7

2024
[5]

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning

Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26428–26438, 2024. 2

2024
[6]

Scan2Cap: Context-Aware Dense Captioning in RGB-D Scans

Zhenyu Chen, Ali Gholami, Matthias Nießner, and Angel X Chang. Scan2Cap: Context-Aware Dense Captioning in RGB-D Scans. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3193– 3203, 2021. 2, 5, 7, 1

2021
[7]

ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5828–5839, 2017. 5, 1

2017
[8]

Eliminating Cross-modal Conflicts in BEV Space for LiDAR-Camera 3D Object De- tection

Jiahui Fu, Chen Gao, Zitian Wang, Lirong Yang, Xiaofei Wang, Beipeng Mu, and Si Liu. Eliminating Cross-modal Conflicts in BEV Space for LiDAR-Camera 3D Object De- tection. InIEEE International Conference on Robotics and Automation, pages 16381–16387, 2024. 2

2024
[9]

Cascade Cost V olume for High- Resolution Multi-View Stereo and Stereo Matching

Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. Cascade Cost V olume for High- Resolution Multi-View Stereo and Stereo Matching. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2495–2504, 2020. 3

2020
[10]

3D-LLM: In- jecting the 3D World into Large Language Models.Advances in Neural Information Processing Systems, 36:20482–20494,

Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3D-LLM: In- jecting the 3D World into Large Language Models.Advances in Neural Information Processing Systems, 36:20482–20494,
[11]

An embodied general- ist agent in 3D world

Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied general- ist agent in 3D world. InProceedings of the 41st Inter- national Conference on Machine Learning, pages 20413– 20451, 2024. 2

2024
[12]

PruneVid: Visual Token Pruning for Efficient Video Large Language Models

Xiaohu Huang, Hao Zhou, and Kai Han. PruneVid: Visual Token Pruning for Efficient Video Large Language Models. InFindings of the Association for Computational Linguis- tics, pages 19959–19973, 2025. 3

2025
[13]

3D Gaussian Splatting for Real-Time Radiance Field Rendering.ACM Transactions on Graphics, 42(4):139–1, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3D Gaussian Splatting for Real-Time Radiance Field Rendering.ACM Transactions on Graphics, 42(4):139–1, 2023. 3

2023
[14]

Ground- ing Image Matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing Image Matching in 3D with MASt3R. InEuropean Con- ference on Computer Vision, pages 71–91, 2024. 3

2024
[15]

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

Yanwei Li, Chengyao Wang, and Jiaya Jia. LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models. In European Conference on Computer Vision, pages 323–340,
[16]

Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing

Yudong Liu, Jingwei Sun, Yueqian Lin, Jianyi Zhang, Jingyang Zhang, Ming Yin, Qinsi Wang, Hai Li, and Yiran Chen. Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 20802–20811,
[17]

3D-SPS: Single- Stage 3D Visual Grounding via Referred Point Progressive Selection

Junyu Luo, Jiahui Fu, Xianghao Kong, Chen Gao, Haibing Ren, Hao Shen, Huaxia Xia, and Si Liu. 3D-SPS: Single- Stage 3D Visual Grounding via Referred Point Progressive Selection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16454– 16463, 2022. 2

2022
[18]

arXiv preprint arXiv:2210.07474 (2022)

Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. SQA3D: Sit- uated Question Answering in 3D Scenes.arXiv preprint arXiv:2210.07474, 2022. 2, 6, 7, 1

work page arXiv 2022
[19]

BLEU: a Method for Automatic Evaluation of Machine Translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a Method for Automatic Evaluation of Machine Translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,
[20]

arXiv preprint arXiv:2501.01428 (2025)

Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models.arXiv preprint arXiv:2501.01428, 2025. 1, 2

work page arXiv 2025
[21]

Structure- From-Motion Revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure- From-Motion Revisited. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 4104–4113, 2016. 3

2016
[22]

LLaV A-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. LLaV A-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22857–22867, 2025. 3

2025
[23]

DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models

Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18992–19001, 2025. 3

2025
[24]

Qwen2 Technical Report

Qwen Team et al. Qwen2 Technical Report.arXiv preprint arXiv:2407.10671, 2024. 4, 6

work page internal anchor Pith review arXiv 2024
[25]

Attention is All you Need.Advances in Neural Information Processing Systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is All you Need.Advances in Neural Information Processing Systems, 30, 2017. 3

2017
[26]

CIDEr: Consensus-Based Image Description Evalu- ation

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-Based Image Description Evalu- ation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4566–4575, 2015. 6, 1

2015
[27]

Look-m: Look- once optimization in kv cache for efficient multimodal long- context inference.arXiv preprint arXiv:2406.18139, 2024

Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhi- hong Zhu, Peng Jin, Longyue Wang, and Li Yuan. LOOK-M: Look-Once Optimization in KV Cache for Effi- cient Multimodal Long-Context Inference.arXiv preprint arXiv:2406.18139, 2024. 2, 3

work page arXiv 2024
[28]

[cls] token tells everything needed for training-free efficient mllms.arXiv preprint arXiv:2412.05819, 2024

Ao Wang, Fengyuan Sun, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding. [CLS] Token Tells Everything Needed for Training-free Efficient MLLMs.arXiv preprint arXiv:2412.05819, 2024. 2, 3

work page arXiv 2024
[29]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 3, 4, 5, 6, 7

2025
[30]

Ibr- net: Learning multi-view image-based rendering

Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibr- net: Learning multi-view image-based rendering. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2021. 3

2021
[31]

Continuous 3D Per- ception Model with Persistent State

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3D Per- ception Model with Persistent State. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025. 3

2025
[32]

DUSt3R: Geometric 3D Vision Made Easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D Vision Made Easy. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024. 3

2024
[33]

EmbodiedScan: A Holistic Multi- Modal 3D Perception Suite Towards Embodied AI

Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, et al. EmbodiedScan: A Holistic Multi- Modal 3D Perception Suite Towards Embodied AI. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19757–19767, 2024. 2

2024
[34]

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction.arXiv preprint arXiv:2410.17247, 2024. 2, 3

work page internal anchor Pith review arXiv 2024
[35]

PointLLM: Empowering Large Language Models to Understand Point Clouds

Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiang- miao Pang, and Dahua Lin. PointLLM: Empowering Large Language Models to Understand Point Clouds. InEuropean Conference on Computer Vision, pages 131–147, 2024. 2

2024
[36]

PointLLM-V2: Empowering Large Language Models to Better Understand Point Clouds.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Runsen Xu, Shuai Yang, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. PointLLM-V2: Empowering Large Language Models to Better Understand Point Clouds.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 2

2025
[37]

Fast3R: Towards 3D Reconstruction of 1000+ Im- ages in One Forward Pass

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3R: Towards 3D Reconstruction of 1000+ Im- ages in One Forward Pass. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21924– 21935, 2025. 3

2025
[38]

Thinking in space: How mul- timodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 2

2025
[39]

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. VisionZip: Longer is Better but Not Necessary in Vision Language Models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 19792–19802, 2025. 2, 3

2025
[40]

MVSNet: Depth Inference for Unstructured Multi- view Stereo

Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. MVSNet: Depth Inference for Unstructured Multi- view Stereo. InEuropean Conference on Computer Vision, pages 767–783, 2018. 3

2018
[41]

Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 22128– 22136, 2025. 2, 3

2025
[42]

pixelNeRF: Neural Radiance Fields From One or Few Im- ages

Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelNeRF: Neural Radiance Fields From One or Few Im- ages. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 4578–4587,
[43]

Sigmoid Loss for Language Image Pre- Training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid Loss for Language Image Pre- Training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023. 2, 3, 4, 6

2023
[44]

Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs

Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiy- ong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 20857–20867, 2025. 2, 3, 6, 7

2025
[45]

Llava-mini: Efficient image and video large mul- timodal models with one vision token.arXiv preprint arXiv:2501.03895, 2025

Shaolei Zhang, Qingkai Fang, Zhe Yang, and Yang Feng. LLaV A-Mini: Efficient Image and Video Large Multi- modal Models with One Vision Token.arXiv preprint arXiv:2501.03895, 2025. 3

work page arXiv 2025
[46]

Multi3DRefer: Grounding Text Description to Multiple 3D Objects

Yiming Zhang, ZeMing Gong, and Angel X Chang. Multi3DRefer: Grounding Text Description to Multiple 3D Objects. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 15225–15236, 2023. 2, 5, 1

2023
[47]

Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference

Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Infer.arXiv preprint arXiv:2410.04417, 2024. 2, 3

work page arXiv 2024
[48]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. LLaV A-Video: Video In- struction Tuning With Synthetic Data.arXiv preprint arXiv:2410.02713, 2024. 6

work page Pith review arXiv 2024
[49]

A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs

Wangbo Zhao, Yizeng Han, Jiasheng Tang, Zhikai Li, Yibing Song, Kai Wang, Zhangyang Wang, and Yang You. A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19814– 19824, 2025. 2, 3

2025
[50]

arXiv preprint arXiv:2505.24625 (2025)

Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors.arXiv preprint arXiv:2505.24625, 2025. 1, 2, 3, 4, 6

work page arXiv 2025
[51]

Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

Duo Zheng, Shijia Huang, and Liwei Wang. Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8995–9006, 2025. 1, 2, 3, 4, 6

2025
[52]

LLaV A-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities

Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. LLaV A-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 4295–4305, 2025. 2

2025
[53]

GA”. The “Type

Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4D Visual Geometry Transformer. arXiv preprint arXiv:2507.11539, 2025. 3 Geometry-Guided 3D Visual Token Pruning for Video-Language Models Supplementary Material A. Dataset Details ScanRefer [3].ScanRefer [3] is a large-scale 3D visual grounding dataset built on ScanNet [7], p...

work page arXiv 2025
[54]

As shown in Tab

combined with pruning ratios of 50% and 75%, respec- tively, to maintain the same number of visual tokens. As shown in Tab. 8, Geo3DPruner achieves substantially bet- ter performance on average, improving from 100% to 108% and 109% when using 16 and 32 frames, respectively. These results highlight a key advantage of our method, as it en- ables the model t...