pith. machine review for the scientific record. sign in

arxiv: 2604.18260 · v1 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

Geometry-Guided 3D Visual Token Pruning for Video-Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:45 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual token pruning3D scene understandinggeometry-aware attentionvideo-language modelstoken efficiencyvoxel selectionspatial video processing
0
0 comments X

The pith

A geometry-guided pruning framework for 3D spatial videos allows video-language models to discard 90 percent of visual tokens while retaining over 90 percent of their original performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to reduce the computational load of processing 3D scenes in video-language models by pruning redundant visual tokens. It uses available depth and camera pose data to identify which tokens carry unique spatial information across frames. By grouping tokens into voxels and selecting representatives both inside and across voxels, the approach removes inter-frame redundancy without losing scene completeness. This matters because current models struggle with the high token counts in spatial videos, limiting their use in real-time or long-context 3D tasks. If effective, it allows these models to handle complex 3D reasoning more efficiently.

Core claim

Geo3DPruner first computes geometry-aware global attention to model cross-frame relevance using depth maps and camera poses, then applies a two-stage pruning: intra-voxel selection of representative multi-view features within each voxel, followed by inter-voxel selection to maintain a spatially diverse set of voxels. On 3D scene understanding benchmarks, this prunes 90% of tokens while retaining over 90% of original performance and beats prior text-guided or vision-only pruning approaches.

What carries the argument

The two-stage pruning process that partitions the 3D space into voxels based on depth and camera poses then selects representative tokens to balance relevance and diversity.

Load-bearing premise

That depth maps and camera poses are reliably available for computing geometry-aware attention and that voxel-based representative selection will not discard task-critical spatial information.

What would settle it

Running the pruned model on a 3D benchmark where depth or pose estimates contain significant noise or are missing and measuring whether performance retention falls substantially below 90 percent.

Figures

Figures reproduced from arXiv: 2604.18260 by Han Li, Jiahui Fu, Naiyan Wang, Si Liu, Zehao Huang.

Figure 1
Figure 1. Figure 1: Motivation of Geo3DPruner. 3D spatial videos essen￾tially represent multi-view projections of the complete 3D scene. Features corresponding to the same objects (e.g., the wooden desk or the swivel chair) frequently recur across different frames. Ex￾isting pruning strategies fail to remove such redundancy due to the absence of global cross-frame relevance modeling. fundamental bottleneck that hinders furthe… view at source ↗
Figure 2
Figure 2. Figure 2: Framework of Geo3DPruner. We adopt Video-3D LLM [51] as our base model and replace its 3D positional encodings with geometry features following VG LLM [50]. Input video frames are processed by two parallel encoders: a 2D visual encoder (e.g., SigLIP [43]) extracts image features, and a 3D geometry encoder (e.g., VGGT [29]) captures geometric features while modeling long￾range cross-frame dependencies. The … view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of intra-voxel view consistency pruning. Within each voxel, multi-view features from different frames are evaluated based on attention scores to identify the most represen￾tative tokens. structural integrity of the 3D scene, as shown in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance-efficiency trade-off curves for 32-frame videos using different visual token pruning methods. The x-axis denotes the reduction ratio in the number of visual tokens, and the y-axis shows the corresponding performance across benchmarks. The black dotted line represents the unpruned base model. (a) (b) 0.20 0.40 0.60 0.80 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) Retained tokens in each frame show diverse spatial coverage across different objects, preserving overall scene com￾pleteness. (b) Global attention maps from the red-box patch in the first frame primarily focus on geometry-aligned and instance￾related regions in other frames. selection in solving this optimization problem. Strategy for computing global cross-frame relevance. We further investigate two a… view at source ↗
Figure 6
Figure 6. Figure 6: Voxel-level visualization before and after pruning. (a) Visualization of the original voxelized scene, where voxels densely cover both foreground objects and background regions. (b) Visualization after applying Geo3DPruner, where redundant voxels are largely removed while object-related and structurally important voxels are preserved. B.3. Comparison under Fixed Token Budget We further conduct a comparison… view at source ↗
read the original abstract

Multimodal large language models have demonstrated remarkable capabilities in 2D vision, motivating their extension to 3D scene understanding. Recent studies represent 3D scenes as 3D spatial videos composed of image sequences with depth and camera pose information, enabling pre-trained video-language models to perform 3D reasoning tasks. However, the large number of visual tokens in spatial videos remains a major bottleneck for efficient inference and context management. Existing pruning methods overlook the view consistency of spatial videos and the spatial diversity of the remaining tokens, which prevents them from effectively removing inter-frame redundancy and preserving scene completeness. In this paper, we propose Geo3DPruner, a Geometry-Guided 3D Visual Token Pruning framework. Geo3DPruner first models cross-frame relevance through geometry-aware global attention, and then performs a two-stage pruning process. The intra-voxel stage selects representative multi-view features within each voxel, while the inter-voxel stage preserves spatial diversity by selecting a globally distributed subset of voxels. Extensive experiments on multiple 3D scene understanding benchmarks demonstrate that Geo3DPruner retains over 90% of the original performance while pruning 90% of visual tokens, significantly outperforming existing text-guided and vision-guided pruning methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Geo3DPruner, a geometry-guided framework for pruning visual tokens in 3D spatial videos processed by video-language models. It first computes cross-frame relevance via geometry-aware global attention using depth maps and camera poses, then applies a two-stage process: intra-voxel selection of representative multi-view features within each voxel and inter-voxel selection to maintain spatial diversity across voxels. Experiments on multiple 3D scene understanding benchmarks show that the method prunes 90% of visual tokens while retaining over 90% of original performance and outperforming text-guided and vision-guided pruning baselines.

Significance. If the empirical claims hold under scrutiny, the work offers a practical way to reduce the token bottleneck in 3D-aware VLMs by exploiting readily available geometric cues to remove inter-frame redundancy while preserving scene structure. The two-stage voxel-based selection is a concrete algorithmic contribution that could inform future efficiency techniques for long spatial video inputs.

major comments (3)
  1. [Method and Experiments] The central performance claim (90% retention at 90% pruning) rests on the assumption that depth maps and camera poses are accurate enough for reliable voxel partitioning and representative selection; however, no sensitivity analysis or noise-injection experiments are reported to test whether misalignment or estimation errors discard task-critical 3D structure.
  2. [Experiments] The reported outperformance over text- and vision-guided baselines may partly reflect the privileged geometry input rather than the pruning logic itself; the manuscript does not include an ablation or controlled comparison in which baselines are also given depth/pose information.
  3. [Experiments] The experimental section lacks error bars, multiple random seeds, or statistical significance tests for the key retention numbers, and it is unclear whether the pruning hyperparameters were tuned on held-out validation data or post-hoc on the reported benchmarks.
minor comments (2)
  1. [Method] Notation for the geometry-aware attention mechanism should be defined more explicitly (e.g., how depth and pose are encoded into the attention weights) to aid reproducibility.
  2. [Figures] Figure captions and axis labels in the pruning-ratio vs. performance plots could be clarified to distinguish intra-voxel from inter-voxel contributions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's thoughtful feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we will revise the manuscript to address the concerns.

read point-by-point responses
  1. Referee: [Method and Experiments] The central performance claim (90% retention at 90% pruning) rests on the assumption that depth maps and camera poses are accurate enough for reliable voxel partitioning and representative selection; however, no sensitivity analysis or noise-injection experiments are reported to test whether misalignment or estimation errors discard task-critical 3D structure.

    Authors: We thank the referee for highlighting this important aspect. In the context of 3D spatial videos, depth maps and camera poses are provided as part of the input data, similar to how they are used in other 3D vision tasks. To address the concern, we will include a sensitivity analysis by injecting noise into the depth and pose estimates and evaluating the impact on pruning performance and downstream task accuracy in the revised manuscript. This will demonstrate the robustness of our geometry-guided approach. revision: yes

  2. Referee: [Experiments] The reported outperformance over text- and vision-guided baselines may partly reflect the privileged geometry input rather than the pruning logic itself; the manuscript does not include an ablation or controlled comparison in which baselines are also given depth/pose information.

    Authors: We agree that this is a valid point for clarification. The text-guided and vision-guided baselines are designed to operate without geometric information, as they rely on text queries or 2D visual features respectively. Providing them with depth and pose would require modifying their core mechanisms, which may not be straightforward. Nevertheless, to strengthen the comparison, we will add an experiment where we augment the baselines with geometric cues where possible and compare the results. This will help isolate the contribution of our two-stage voxel pruning strategy. revision: partial

  3. Referee: [Experiments] The experimental section lacks error bars, multiple random seeds, or statistical significance tests for the key retention numbers, and it is unclear whether the pruning hyperparameters were tuned on held-out validation data or post-hoc on the reported benchmarks.

    Authors: We acknowledge the importance of statistical rigor in experimental reporting. In the revised version, we will include error bars based on multiple random seeds (e.g., 3-5 runs) for the key metrics and perform statistical significance tests where applicable. Regarding hyperparameter tuning, the pruning ratios and voxel sizes were selected based on preliminary experiments on a held-out subset of the training data to avoid overfitting to the test benchmarks. We will clarify this in the manuscript and provide more details on the tuning process. revision: yes

Circularity Check

0 steps flagged

No significant circularity; algorithmic procedure is self-contained

full rationale

The paper introduces Geo3DPruner as a novel algorithmic framework consisting of geometry-aware global attention followed by intra- and inter-voxel pruning stages. No equations, fitted parameters, or predictions are defined in terms of the target performance metrics or outputs. The reported retention of >90% performance at 90% pruning is presented as an empirical result from benchmarks rather than a quantity derived by construction from the method's own definitions. No self-citations serve as load-bearing uniqueness theorems, and no ansatzes or renamings reduce the central claims to prior inputs within the paper. The derivation chain is the procedure itself, which stands independently of the evaluation numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that depth and camera pose data are provided and accurate; no free parameters or new invented entities are introduced in the abstract description.

axioms (1)
  • domain assumption Depth maps and camera poses are available and sufficiently accurate to compute cross-frame geometry-aware attention.
    The entire pruning pipeline begins with geometry-aware global attention that requires these inputs.

pith-pipeline@v0.9.0 · 5531 in / 1297 out tokens · 32977 ms · 2026-05-10T05:45:00.146457+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 11 canonical work pages · 2 internal anchors

  1. [1]

    ScanQA: 3D Question Answering for Spatial Scene Understanding

    Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. ScanQA: 3D Question Answering for Spatial Scene Understanding. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 19129–19139, 2022. 2, 6, 1

  2. [2]

    MUSt3R: Multi-view Network for Stereo 3D Recon- struction

    Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. MUSt3R: Multi-view Network for Stereo 3D Recon- struction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1050–1060, 2025. 3

  3. [3]

    ScanRefer: 3D Object Localization in RGB-D Scans Using Natural Language

    Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. ScanRefer: 3D Object Localization in RGB-D Scans Using Natural Language. InEuropean Conference on Computer Vision, pages 202–221, 2020. 2, 5, 7, 1

  4. [4]

    An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models. InEuropean Conference on Computer Vision, pages 19–35, 2024. 2, 3, 6, 7

  5. [5]

    LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning

    Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26428–26438, 2024. 2

  6. [6]

    Scan2Cap: Context-Aware Dense Captioning in RGB-D Scans

    Zhenyu Chen, Ali Gholami, Matthias Nießner, and Angel X Chang. Scan2Cap: Context-Aware Dense Captioning in RGB-D Scans. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3193– 3203, 2021. 2, 5, 7, 1

  7. [7]

    ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5828–5839, 2017. 5, 1

  8. [8]

    Eliminating Cross-modal Conflicts in BEV Space for LiDAR-Camera 3D Object De- tection

    Jiahui Fu, Chen Gao, Zitian Wang, Lirong Yang, Xiaofei Wang, Beipeng Mu, and Si Liu. Eliminating Cross-modal Conflicts in BEV Space for LiDAR-Camera 3D Object De- tection. InIEEE International Conference on Robotics and Automation, pages 16381–16387, 2024. 2

  9. [9]

    Cascade Cost V olume for High- Resolution Multi-View Stereo and Stereo Matching

    Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. Cascade Cost V olume for High- Resolution Multi-View Stereo and Stereo Matching. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2495–2504, 2020. 3

  10. [10]

    3D-LLM: In- jecting the 3D World into Large Language Models.Advances in Neural Information Processing Systems, 36:20482–20494,

    Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3D-LLM: In- jecting the 3D World into Large Language Models.Advances in Neural Information Processing Systems, 36:20482–20494,

  11. [11]

    An embodied general- ist agent in 3D world

    Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied general- ist agent in 3D world. InProceedings of the 41st Inter- national Conference on Machine Learning, pages 20413– 20451, 2024. 2

  12. [12]

    PruneVid: Visual Token Pruning for Efficient Video Large Language Models

    Xiaohu Huang, Hao Zhou, and Kai Han. PruneVid: Visual Token Pruning for Efficient Video Large Language Models. InFindings of the Association for Computational Linguis- tics, pages 19959–19973, 2025. 3

  13. [13]

    3D Gaussian Splatting for Real-Time Radiance Field Rendering.ACM Transactions on Graphics, 42(4):139–1, 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3D Gaussian Splatting for Real-Time Radiance Field Rendering.ACM Transactions on Graphics, 42(4):139–1, 2023. 3

  14. [14]

    Ground- ing Image Matching in 3D with MASt3R

    Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing Image Matching in 3D with MASt3R. InEuropean Con- ference on Computer Vision, pages 71–91, 2024. 3

  15. [15]

    LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

    Yanwei Li, Chengyao Wang, and Jiaya Jia. LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models. In European Conference on Computer Vision, pages 323–340,

  16. [16]

    Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing

    Yudong Liu, Jingwei Sun, Yueqian Lin, Jianyi Zhang, Jingyang Zhang, Ming Yin, Qinsi Wang, Hai Li, and Yiran Chen. Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 20802–20811,

  17. [17]

    3D-SPS: Single- Stage 3D Visual Grounding via Referred Point Progressive Selection

    Junyu Luo, Jiahui Fu, Xianghao Kong, Chen Gao, Haibing Ren, Hao Shen, Huaxia Xia, and Si Liu. 3D-SPS: Single- Stage 3D Visual Grounding via Referred Point Progressive Selection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16454– 16463, 2022. 2

  18. [18]

    arXiv preprint arXiv:2210.07474 (2022)

    Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. SQA3D: Sit- uated Question Answering in 3D Scenes.arXiv preprint arXiv:2210.07474, 2022. 2, 6, 7, 1

  19. [19]

    BLEU: a Method for Automatic Evaluation of Machine Translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a Method for Automatic Evaluation of Machine Translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,

  20. [20]

    arXiv preprint arXiv:2501.01428 (2025)

    Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models.arXiv preprint arXiv:2501.01428, 2025. 1, 2

  21. [21]

    Structure- From-Motion Revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure- From-Motion Revisited. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 4104–4113, 2016. 3

  22. [22]

    LLaV A-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

    Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. LLaV A-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22857–22867, 2025. 3

  23. [23]

    DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models

    Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18992–19001, 2025. 3

  24. [24]

    Qwen2 Technical Report

    Qwen Team et al. Qwen2 Technical Report.arXiv preprint arXiv:2407.10671, 2024. 4, 6

  25. [25]

    Attention is All you Need.Advances in Neural Information Processing Systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is All you Need.Advances in Neural Information Processing Systems, 30, 2017. 3

  26. [26]

    CIDEr: Consensus-Based Image Description Evalu- ation

    Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-Based Image Description Evalu- ation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4566–4575, 2015. 6, 1

  27. [27]

    Look-m: Look- once optimization in kv cache for efficient multimodal long- context inference.arXiv preprint arXiv:2406.18139, 2024

    Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhi- hong Zhu, Peng Jin, Longyue Wang, and Li Yuan. LOOK-M: Look-Once Optimization in KV Cache for Effi- cient Multimodal Long-Context Inference.arXiv preprint arXiv:2406.18139, 2024. 2, 3

  28. [28]

    [cls] token tells everything needed for training-free efficient mllms.arXiv preprint arXiv:2412.05819, 2024

    Ao Wang, Fengyuan Sun, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding. [CLS] Token Tells Everything Needed for Training-free Efficient MLLMs.arXiv preprint arXiv:2412.05819, 2024. 2, 3

  29. [29]

    Vggt: Vi- sual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 3, 4, 5, 6, 7

  30. [30]

    Ibr- net: Learning multi-view image-based rendering

    Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibr- net: Learning multi-view image-based rendering. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2021. 3

  31. [31]

    Continuous 3D Per- ception Model with Persistent State

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3D Per- ception Model with Persistent State. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025. 3

  32. [32]

    DUSt3R: Geometric 3D Vision Made Easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D Vision Made Easy. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024. 3

  33. [33]

    EmbodiedScan: A Holistic Multi- Modal 3D Perception Suite Towards Embodied AI

    Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, et al. EmbodiedScan: A Holistic Multi- Modal 3D Perception Suite Towards Embodied AI. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19757–19767, 2024. 2

  34. [34]

    PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

    Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction.arXiv preprint arXiv:2410.17247, 2024. 2, 3

  35. [35]

    PointLLM: Empowering Large Language Models to Understand Point Clouds

    Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiang- miao Pang, and Dahua Lin. PointLLM: Empowering Large Language Models to Understand Point Clouds. InEuropean Conference on Computer Vision, pages 131–147, 2024. 2

  36. [36]

    PointLLM-V2: Empowering Large Language Models to Better Understand Point Clouds.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Runsen Xu, Shuai Yang, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. PointLLM-V2: Empowering Large Language Models to Better Understand Point Clouds.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 2

  37. [37]

    Fast3R: Towards 3D Reconstruction of 1000+ Im- ages in One Forward Pass

    Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3R: Towards 3D Reconstruction of 1000+ Im- ages in One Forward Pass. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21924– 21935, 2025. 3

  38. [38]

    Thinking in space: How mul- timodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 2

  39. [39]

    VisionZip: Longer is Better but Not Necessary in Vision Language Models

    Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. VisionZip: Longer is Better but Not Necessary in Vision Language Models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 19792–19802, 2025. 2, 3

  40. [40]

    MVSNet: Depth Inference for Unstructured Multi- view Stereo

    Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. MVSNet: Depth Inference for Unstructured Multi- view Stereo. InEuropean Conference on Computer Vision, pages 767–783, 2018. 3

  41. [41]

    Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

    Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 22128– 22136, 2025. 2, 3

  42. [42]

    pixelNeRF: Neural Radiance Fields From One or Few Im- ages

    Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelNeRF: Neural Radiance Fields From One or Few Im- ages. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 4578–4587,

  43. [43]

    Sigmoid Loss for Language Image Pre- Training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid Loss for Language Image Pre- Training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023. 2, 3, 4, 6

  44. [44]

    Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs

    Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiy- ong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 20857–20867, 2025. 2, 3, 6, 7

  45. [45]

    Llava-mini: Efficient image and video large mul- timodal models with one vision token.arXiv preprint arXiv:2501.03895, 2025

    Shaolei Zhang, Qingkai Fang, Zhe Yang, and Yang Feng. LLaV A-Mini: Efficient Image and Video Large Multi- modal Models with One Vision Token.arXiv preprint arXiv:2501.03895, 2025. 3

  46. [46]

    Multi3DRefer: Grounding Text Description to Multiple 3D Objects

    Yiming Zhang, ZeMing Gong, and Angel X Chang. Multi3DRefer: Grounding Text Description to Multiple 3D Objects. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 15225–15236, 2023. 2, 5, 1

  47. [47]

    Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference

    Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Infer.arXiv preprint arXiv:2410.04417, 2024. 2, 3

  48. [48]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. LLaV A-Video: Video In- struction Tuning With Synthetic Data.arXiv preprint arXiv:2410.02713, 2024. 6

  49. [49]

    A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs

    Wangbo Zhao, Yizeng Han, Jiasheng Tang, Zhikai Li, Yibing Song, Kai Wang, Zhangyang Wang, and Yang You. A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19814– 19824, 2025. 2, 3

  50. [50]

    arXiv preprint arXiv:2505.24625 (2025)

    Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors.arXiv preprint arXiv:2505.24625, 2025. 1, 2, 3, 4, 6

  51. [51]

    Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding

    Duo Zheng, Shijia Huang, and Liwei Wang. Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8995–9006, 2025. 1, 2, 3, 4, 6

  52. [52]

    LLaV A-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities

    Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. LLaV A-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 4295–4305, 2025. 2

  53. [53]

    GA”. The “Type

    Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4D Visual Geometry Transformer. arXiv preprint arXiv:2507.11539, 2025. 3 Geometry-Guided 3D Visual Token Pruning for Video-Language Models Supplementary Material A. Dataset Details ScanRefer [3].ScanRefer [3] is a large-scale 3D visual grounding dataset built on ScanNet [7], p...

  54. [54]

    As shown in Tab

    combined with pruning ratios of 50% and 75%, respec- tively, to maintain the same number of visual tokens. As shown in Tab. 8, Geo3DPruner achieves substantially bet- ter performance on average, improving from 100% to 108% and 109% when using 16 and 32 frames, respectively. These results highlight a key advantage of our method, as it en- ables the model t...