Recognition: unknown
Geometry-Guided 3D Visual Token Pruning for Video-Language Models
Pith reviewed 2026-05-10 05:45 UTC · model grok-4.3
The pith
A geometry-guided pruning framework for 3D spatial videos allows video-language models to discard 90 percent of visual tokens while retaining over 90 percent of their original performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Geo3DPruner first computes geometry-aware global attention to model cross-frame relevance using depth maps and camera poses, then applies a two-stage pruning: intra-voxel selection of representative multi-view features within each voxel, followed by inter-voxel selection to maintain a spatially diverse set of voxels. On 3D scene understanding benchmarks, this prunes 90% of tokens while retaining over 90% of original performance and beats prior text-guided or vision-only pruning approaches.
What carries the argument
The two-stage pruning process that partitions the 3D space into voxels based on depth and camera poses then selects representative tokens to balance relevance and diversity.
Load-bearing premise
That depth maps and camera poses are reliably available for computing geometry-aware attention and that voxel-based representative selection will not discard task-critical spatial information.
What would settle it
Running the pruned model on a 3D benchmark where depth or pose estimates contain significant noise or are missing and measuring whether performance retention falls substantially below 90 percent.
Figures
read the original abstract
Multimodal large language models have demonstrated remarkable capabilities in 2D vision, motivating their extension to 3D scene understanding. Recent studies represent 3D scenes as 3D spatial videos composed of image sequences with depth and camera pose information, enabling pre-trained video-language models to perform 3D reasoning tasks. However, the large number of visual tokens in spatial videos remains a major bottleneck for efficient inference and context management. Existing pruning methods overlook the view consistency of spatial videos and the spatial diversity of the remaining tokens, which prevents them from effectively removing inter-frame redundancy and preserving scene completeness. In this paper, we propose Geo3DPruner, a Geometry-Guided 3D Visual Token Pruning framework. Geo3DPruner first models cross-frame relevance through geometry-aware global attention, and then performs a two-stage pruning process. The intra-voxel stage selects representative multi-view features within each voxel, while the inter-voxel stage preserves spatial diversity by selecting a globally distributed subset of voxels. Extensive experiments on multiple 3D scene understanding benchmarks demonstrate that Geo3DPruner retains over 90% of the original performance while pruning 90% of visual tokens, significantly outperforming existing text-guided and vision-guided pruning methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Geo3DPruner, a geometry-guided framework for pruning visual tokens in 3D spatial videos processed by video-language models. It first computes cross-frame relevance via geometry-aware global attention using depth maps and camera poses, then applies a two-stage process: intra-voxel selection of representative multi-view features within each voxel and inter-voxel selection to maintain spatial diversity across voxels. Experiments on multiple 3D scene understanding benchmarks show that the method prunes 90% of visual tokens while retaining over 90% of original performance and outperforming text-guided and vision-guided pruning baselines.
Significance. If the empirical claims hold under scrutiny, the work offers a practical way to reduce the token bottleneck in 3D-aware VLMs by exploiting readily available geometric cues to remove inter-frame redundancy while preserving scene structure. The two-stage voxel-based selection is a concrete algorithmic contribution that could inform future efficiency techniques for long spatial video inputs.
major comments (3)
- [Method and Experiments] The central performance claim (90% retention at 90% pruning) rests on the assumption that depth maps and camera poses are accurate enough for reliable voxel partitioning and representative selection; however, no sensitivity analysis or noise-injection experiments are reported to test whether misalignment or estimation errors discard task-critical 3D structure.
- [Experiments] The reported outperformance over text- and vision-guided baselines may partly reflect the privileged geometry input rather than the pruning logic itself; the manuscript does not include an ablation or controlled comparison in which baselines are also given depth/pose information.
- [Experiments] The experimental section lacks error bars, multiple random seeds, or statistical significance tests for the key retention numbers, and it is unclear whether the pruning hyperparameters were tuned on held-out validation data or post-hoc on the reported benchmarks.
minor comments (2)
- [Method] Notation for the geometry-aware attention mechanism should be defined more explicitly (e.g., how depth and pose are encoded into the attention weights) to aid reproducibility.
- [Figures] Figure captions and axis labels in the pruning-ratio vs. performance plots could be clarified to distinguish intra-voxel from inter-voxel contributions.
Simulated Author's Rebuttal
We appreciate the referee's thoughtful feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where appropriate, we will revise the manuscript to address the concerns.
read point-by-point responses
-
Referee: [Method and Experiments] The central performance claim (90% retention at 90% pruning) rests on the assumption that depth maps and camera poses are accurate enough for reliable voxel partitioning and representative selection; however, no sensitivity analysis or noise-injection experiments are reported to test whether misalignment or estimation errors discard task-critical 3D structure.
Authors: We thank the referee for highlighting this important aspect. In the context of 3D spatial videos, depth maps and camera poses are provided as part of the input data, similar to how they are used in other 3D vision tasks. To address the concern, we will include a sensitivity analysis by injecting noise into the depth and pose estimates and evaluating the impact on pruning performance and downstream task accuracy in the revised manuscript. This will demonstrate the robustness of our geometry-guided approach. revision: yes
-
Referee: [Experiments] The reported outperformance over text- and vision-guided baselines may partly reflect the privileged geometry input rather than the pruning logic itself; the manuscript does not include an ablation or controlled comparison in which baselines are also given depth/pose information.
Authors: We agree that this is a valid point for clarification. The text-guided and vision-guided baselines are designed to operate without geometric information, as they rely on text queries or 2D visual features respectively. Providing them with depth and pose would require modifying their core mechanisms, which may not be straightforward. Nevertheless, to strengthen the comparison, we will add an experiment where we augment the baselines with geometric cues where possible and compare the results. This will help isolate the contribution of our two-stage voxel pruning strategy. revision: partial
-
Referee: [Experiments] The experimental section lacks error bars, multiple random seeds, or statistical significance tests for the key retention numbers, and it is unclear whether the pruning hyperparameters were tuned on held-out validation data or post-hoc on the reported benchmarks.
Authors: We acknowledge the importance of statistical rigor in experimental reporting. In the revised version, we will include error bars based on multiple random seeds (e.g., 3-5 runs) for the key metrics and perform statistical significance tests where applicable. Regarding hyperparameter tuning, the pruning ratios and voxel sizes were selected based on preliminary experiments on a held-out subset of the training data to avoid overfitting to the test benchmarks. We will clarify this in the manuscript and provide more details on the tuning process. revision: yes
Circularity Check
No significant circularity; algorithmic procedure is self-contained
full rationale
The paper introduces Geo3DPruner as a novel algorithmic framework consisting of geometry-aware global attention followed by intra- and inter-voxel pruning stages. No equations, fitted parameters, or predictions are defined in terms of the target performance metrics or outputs. The reported retention of >90% performance at 90% pruning is presented as an empirical result from benchmarks rather than a quantity derived by construction from the method's own definitions. No self-citations serve as load-bearing uniqueness theorems, and no ansatzes or renamings reduce the central claims to prior inputs within the paper. The derivation chain is the procedure itself, which stands independently of the evaluation numbers.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Depth maps and camera poses are available and sufficiently accurate to compute cross-frame geometry-aware attention.
Reference graph
Works this paper leans on
-
[1]
ScanQA: 3D Question Answering for Spatial Scene Understanding
Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. ScanQA: 3D Question Answering for Spatial Scene Understanding. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 19129–19139, 2022. 2, 6, 1
2022
-
[2]
MUSt3R: Multi-view Network for Stereo 3D Recon- struction
Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. MUSt3R: Multi-view Network for Stereo 3D Recon- struction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1050–1060, 2025. 3
2025
-
[3]
ScanRefer: 3D Object Localization in RGB-D Scans Using Natural Language
Dave Zhenyu Chen, Angel X Chang, and Matthias Nießner. ScanRefer: 3D Object Localization in RGB-D Scans Using Natural Language. InEuropean Conference on Computer Vision, pages 202–221, 2020. 2, 5, 7, 1
2020
-
[4]
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models. InEuropean Conference on Computer Vision, pages 19–35, 2024. 2, 3, 6, 7
2024
-
[5]
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning
Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, and Tao Chen. LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26428–26438, 2024. 2
2024
-
[6]
Scan2Cap: Context-Aware Dense Captioning in RGB-D Scans
Zhenyu Chen, Ali Gholami, Matthias Nießner, and Angel X Chang. Scan2Cap: Context-Aware Dense Captioning in RGB-D Scans. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3193– 3203, 2021. 2, 5, 7, 1
2021
-
[7]
ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. ScanNet: Richly-Annotated 3D Reconstructions of Indoor Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5828–5839, 2017. 5, 1
2017
-
[8]
Eliminating Cross-modal Conflicts in BEV Space for LiDAR-Camera 3D Object De- tection
Jiahui Fu, Chen Gao, Zitian Wang, Lirong Yang, Xiaofei Wang, Beipeng Mu, and Si Liu. Eliminating Cross-modal Conflicts in BEV Space for LiDAR-Camera 3D Object De- tection. InIEEE International Conference on Robotics and Automation, pages 16381–16387, 2024. 2
2024
-
[9]
Cascade Cost V olume for High- Resolution Multi-View Stereo and Stereo Matching
Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong Tan, and Ping Tan. Cascade Cost V olume for High- Resolution Multi-View Stereo and Stereo Matching. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2495–2504, 2020. 3
2020
-
[10]
3D-LLM: In- jecting the 3D World into Large Language Models.Advances in Neural Information Processing Systems, 36:20482–20494,
Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, and Chuang Gan. 3D-LLM: In- jecting the 3D World into Large Language Models.Advances in Neural Information Processing Systems, 36:20482–20494,
-
[11]
An embodied general- ist agent in 3D world
Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, and Siyuan Huang. An embodied general- ist agent in 3D world. InProceedings of the 41st Inter- national Conference on Machine Learning, pages 20413– 20451, 2024. 2
2024
-
[12]
PruneVid: Visual Token Pruning for Efficient Video Large Language Models
Xiaohu Huang, Hao Zhou, and Kai Han. PruneVid: Visual Token Pruning for Efficient Video Large Language Models. InFindings of the Association for Computational Linguis- tics, pages 19959–19973, 2025. 3
2025
-
[13]
3D Gaussian Splatting for Real-Time Radiance Field Rendering.ACM Transactions on Graphics, 42(4):139–1, 2023
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3D Gaussian Splatting for Real-Time Radiance Field Rendering.ACM Transactions on Graphics, 42(4):139–1, 2023. 3
2023
-
[14]
Ground- ing Image Matching in 3D with MASt3R
Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing Image Matching in 3D with MASt3R. InEuropean Con- ference on Computer Vision, pages 71–91, 2024. 3
2024
-
[15]
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Yanwei Li, Chengyao Wang, and Jiaya Jia. LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models. In European Conference on Computer Vision, pages 323–340,
-
[16]
Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing
Yudong Liu, Jingwei Sun, Yueqian Lin, Jianyi Zhang, Jingyang Zhang, Ming Yin, Qinsi Wang, Hai Li, and Yiran Chen. Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 20802–20811,
-
[17]
3D-SPS: Single- Stage 3D Visual Grounding via Referred Point Progressive Selection
Junyu Luo, Jiahui Fu, Xianghao Kong, Chen Gao, Haibing Ren, Hao Shen, Huaxia Xia, and Si Liu. 3D-SPS: Single- Stage 3D Visual Grounding via Referred Point Progressive Selection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16454– 16463, 2022. 2
2022
-
[18]
arXiv preprint arXiv:2210.07474 (2022)
Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yitao Liang, Song-Chun Zhu, and Siyuan Huang. SQA3D: Sit- uated Question Answering in 3D Scenes.arXiv preprint arXiv:2210.07474, 2022. 2, 6, 7, 1
-
[19]
BLEU: a Method for Automatic Evaluation of Machine Translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a Method for Automatic Evaluation of Machine Translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,
-
[20]
arXiv preprint arXiv:2501.01428 (2025)
Zhangyang Qi, Zhixiong Zhang, Ye Fang, Jiaqi Wang, and Hengshuang Zhao. GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models.arXiv preprint arXiv:2501.01428, 2025. 1, 2
-
[21]
Structure- From-Motion Revisited
Johannes L Schonberger and Jan-Michael Frahm. Structure- From-Motion Revisited. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 4104–4113, 2016. 3
2016
-
[22]
LLaV A-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models
Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. LLaV A-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22857–22867, 2025. 3
2025
-
[23]
DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models
Keda Tao, Can Qin, Haoxuan You, Yang Sui, and Huan Wang. DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18992–19001, 2025. 3
2025
-
[24]
Qwen Team et al. Qwen2 Technical Report.arXiv preprint arXiv:2407.10671, 2024. 4, 6
work page internal anchor Pith review arXiv 2024
-
[25]
Attention is All you Need.Advances in Neural Information Processing Systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is All you Need.Advances in Neural Information Processing Systems, 30, 2017. 3
2017
-
[26]
CIDEr: Consensus-Based Image Description Evalu- ation
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-Based Image Description Evalu- ation. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4566–4575, 2015. 6, 1
2015
-
[27]
Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhi- hong Zhu, Peng Jin, Longyue Wang, and Li Yuan. LOOK-M: Look-Once Optimization in KV Cache for Effi- cient Multimodal Long-Context Inference.arXiv preprint arXiv:2406.18139, 2024. 2, 3
-
[28]
Ao Wang, Fengyuan Sun, Hui Chen, Zijia Lin, Jungong Han, and Guiguang Ding. [CLS] Token Tells Everything Needed for Training-free Efficient MLLMs.arXiv preprint arXiv:2412.05819, 2024. 2, 3
-
[29]
Vggt: Vi- sual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 3, 4, 5, 6, 7
2025
-
[30]
Ibr- net: Learning multi-view image-based rendering
Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibr- net: Learning multi-view image-based rendering. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2021. 3
2021
-
[31]
Continuous 3D Per- ception Model with Persistent State
Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3D Per- ception Model with Persistent State. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025. 3
2025
-
[32]
DUSt3R: Geometric 3D Vision Made Easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D Vision Made Easy. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 20697–20709, 2024. 3
2024
-
[33]
EmbodiedScan: A Holistic Multi- Modal 3D Perception Suite Towards Embodied AI
Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, et al. EmbodiedScan: A Holistic Multi- Modal 3D Perception Suite Towards Embodied AI. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19757–19767, 2024. 2
2024
-
[34]
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
Long Xing, Qidong Huang, Xiaoyi Dong, Jiajie Lu, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang, Feng Wu, et al. PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction.arXiv preprint arXiv:2410.17247, 2024. 2, 3
work page internal anchor Pith review arXiv 2024
-
[35]
PointLLM: Empowering Large Language Models to Understand Point Clouds
Runsen Xu, Xiaolong Wang, Tai Wang, Yilun Chen, Jiang- miao Pang, and Dahua Lin. PointLLM: Empowering Large Language Models to Understand Point Clouds. InEuropean Conference on Computer Vision, pages 131–147, 2024. 2
2024
-
[36]
PointLLM-V2: Empowering Large Language Models to Better Understand Point Clouds.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
Runsen Xu, Shuai Yang, Xiaolong Wang, Tai Wang, Yilun Chen, Jiangmiao Pang, and Dahua Lin. PointLLM-V2: Empowering Large Language Models to Better Understand Point Clouds.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 2
2025
-
[37]
Fast3R: Towards 3D Reconstruction of 1000+ Im- ages in One Forward Pass
Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3R: Towards 3D Reconstruction of 1000+ Im- ages in One Forward Pass. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21924– 21935, 2025. 3
2025
-
[38]
Thinking in space: How mul- timodal large language models see, remember, and recall spaces
Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How mul- timodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 2
2025
-
[39]
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Senqiao Yang, Yukang Chen, Zhuotao Tian, Chengyao Wang, Jingyao Li, Bei Yu, and Jiaya Jia. VisionZip: Longer is Better but Not Necessary in Vision Language Models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 19792–19802, 2025. 2, 3
2025
-
[40]
MVSNet: Depth Inference for Unstructured Multi- view Stereo
Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. MVSNet: Depth Inference for Unstructured Multi- view Stereo. InEuropean Conference on Computer Vision, pages 767–783, 2018. 3
2018
-
[41]
Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models
Weihao Ye, Qiong Wu, Wenhao Lin, and Yiyi Zhou. Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models. InProceedings of the AAAI Conference on Artificial Intelligence, pages 22128– 22136, 2025. 2, 3
2025
-
[42]
pixelNeRF: Neural Radiance Fields From One or Few Im- ages
Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelNeRF: Neural Radiance Fields From One or Few Im- ages. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 4578–4587,
-
[43]
Sigmoid Loss for Language Image Pre- Training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid Loss for Language Image Pre- Training. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023. 2, 3, 4, 6
2023
-
[44]
Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiy- ong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, and Shanghang Zhang. Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 20857–20867, 2025. 2, 3, 6, 7
2025
-
[45]
Shaolei Zhang, Qingkai Fang, Zhe Yang, and Yang Feng. LLaV A-Mini: Efficient Image and Video Large Multi- modal Models with One Vision Token.arXiv preprint arXiv:2501.03895, 2025. 3
-
[46]
Multi3DRefer: Grounding Text Description to Multiple 3D Objects
Yiming Zhang, ZeMing Gong, and Angel X Chang. Multi3DRefer: Grounding Text Description to Multiple 3D Objects. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 15225–15236, 2023. 2, 5, 1
2023
-
[47]
Sparsevlm: Vi- sual token sparsification for efficient vision-language model inference
Yuan Zhang, Chun-Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, et al. SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Infer.arXiv preprint arXiv:2410.04417, 2024. 2, 3
-
[48]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. LLaV A-Video: Video In- struction Tuning With Synthetic Data.arXiv preprint arXiv:2410.02713, 2024. 6
work page Pith review arXiv 2024
-
[49]
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs
Wangbo Zhao, Yizeng Han, Jiasheng Tang, Zhikai Li, Yibing Song, Kai Wang, Zhangyang Wang, and Yang You. A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19814– 19824, 2025. 2, 3
2025
-
[50]
arXiv preprint arXiv:2505.24625 (2025)
Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors.arXiv preprint arXiv:2505.24625, 2025. 1, 2, 3, 4, 6
-
[51]
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding
Duo Zheng, Shijia Huang, and Liwei Wang. Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene Understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8995–9006, 2025. 1, 2, 3, 4, 6
2025
-
[52]
LLaV A-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities
Chenming Zhu, Tai Wang, Wenwei Zhang, Jiangmiao Pang, and Xihui Liu. LLaV A-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D Capabilities. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 4295–4305, 2025. 2
2025
-
[53]
Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4D Visual Geometry Transformer. arXiv preprint arXiv:2507.11539, 2025. 3 Geometry-Guided 3D Visual Token Pruning for Video-Language Models Supplementary Material A. Dataset Details ScanRefer [3].ScanRefer [3] is a large-scale 3D visual grounding dataset built on ScanNet [7], p...
-
[54]
As shown in Tab
combined with pruning ratios of 50% and 75%, respec- tively, to maintain the same number of visual tokens. As shown in Tab. 8, Geo3DPruner achieves substantially bet- ter performance on average, improving from 100% to 108% and 109% when using 16 and 32 frames, respectively. These results highlight a key advantage of our method, as it en- ables the model t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.