4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
Pith reviewed 2026-05-16 21:18 UTC · model grok-4.3
The pith
4D-RGPT uses perceptual distillation from a frozen expert model to improve multimodal LLMs' region-level 4D perception in videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
4D-RGPT is designed to capture 4D representations from video inputs with enhanced temporal perception by using Perceptual 4D Distillation to transfer comprehensive 4D knowledge from a frozen expert model, leading to better performance on 4D VQA benchmarks and the new R4D-Bench.
What carries the argument
Perceptual 4D Distillation (P4D), the training framework that transfers 4D representations from a frozen expert model into the MLLM without retraining the expert.
If this is right
- 4D-RGPT achieves notable improvements on existing 4D VQA benchmarks.
- It also shows gains on the proposed R4D-Bench benchmark for region-level 4D understanding.
- The model enhances temporal perception and region-level reasoning in video question answering.
- The distillation allows comprehensive 4D perception to be added to MLLMs efficiently.
Where Pith is reading between the lines
- The method could extend to other perceptual tasks by distilling from different expert models in computer vision.
- Region-level 4D capabilities might enable more accurate dynamic scene analysis in fields like robotics and augmented reality.
- Future research could explore combining this with real-time video processing for interactive applications.
Load-bearing premise
That the perceptual distillation process can transfer comprehensive 4D representations from the frozen expert into the MLLM without significant information loss to support enhanced region-level and temporal perception.
What would settle it
If evaluations on R4D-Bench show that the distilled 4D-RGPT does not outperform a standard MLLM on questions requiring precise depth and motion understanding in specific regions.
read the original abstract
Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces 4D-RGPT, a multimodal large language model (MLLM) specialized for region-level 4D video understanding, trained via a Perceptual 4D Distillation (P4D) framework that transfers representations from a frozen expert model. It also proposes R4D-Bench, a new benchmark for depth-aware dynamic scenes supporting region-level prompting, constructed through a hybrid automated and human-verified pipeline. The central claim is that 4D-RGPT achieves notable improvements over baselines on both existing 4D VQA benchmarks and the proposed R4D-Bench.
Significance. If the reported benchmark gains hold under scrutiny, the work advances 4D perception in MLLMs by demonstrating effective transfer of temporal and spatial features via distillation without retraining the expert. The hybrid construction of R4D-Bench, with explicit human verification for region-level and depth-aware queries, fills a documented gap in prior 3D/4D VQA datasets that emphasize static scenes. The detailed description of the distillation process, expert freezing, and benchmark pipeline provides a reproducible template for similar perceptual transfer efforts.
minor comments (2)
- [Abstract] Abstract: the phrase 'notable improvements' is used without any numerical deltas, baseline names, or metric values; adding one or two key quantitative results (e.g., accuracy gains on R4D-Bench) would make the summary self-contained.
- [Experiments] Section 4 (Experiments): while the distillation pipeline is described, the manuscript should explicitly state the number of ablation runs, random seeds, and statistical significance tests for the reported gains to allow readers to assess robustness.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our manuscript and for recognizing the contributions of 4D-RGPT, the Perceptual 4D Distillation framework, and the R4D-Bench benchmark. We appreciate the recommendation for minor revision and will incorporate improvements to enhance clarity and reproducibility.
Circularity Check
No significant circularity
full rationale
The paper introduces 4D-RGPT via perceptual distillation (P4D) from a frozen expert and evaluates it on existing 4D VQA benchmarks plus the newly constructed R4D-Bench. All load-bearing claims reduce to reported empirical improvements rather than any self-definitional loop, fitted parameter renamed as prediction, or self-citation chain. The distillation framework is presented as a standard transfer process with explicit freezing and hybrid benchmark construction; no equations or premises collapse to their own inputs by construction. This is the normal non-circular case for an empirical MLLM paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Frozen expert models contain rich 4D representations that can be distilled into MLLMs without loss of temporal and region-level information
invented entities (3)
-
4D-RGPT
no independent evidence
-
Perceptual 4D Distillation (P4D)
no independent evidence
-
R4D-Bench
no independent evidence
Forward citations
Cited by 2 Pith papers
-
ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation
ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.
-
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shya- mal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
OpenAI. Gpt-5. https://openai.com/chatgpt,
-
[3]
Large language model
-
[4]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
AbhimanyuDubey, AbhinavJauhri, AbhinavPandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
OpenAI. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Spatialllm: A compound 3d- informed design towards spatially-intelligent large multimodal models
Wufei Ma, Luoxin Ye, Celso M de Melo, Alan Yuille, and Jieneng Chen. Spatialllm: A compound 3d- informed design towards spatially-intelligent large multimodal models. InCVPR, 2025
work page 2025
-
[8]
From flatland to space: Teaching vision-language models to perceive and reason in 3d
Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, et al. From flatland to space: Teaching vision-language models to perceive and reason in 3d. InNeurIPS, 2025
work page 2025
-
[9]
St-vlm: Kinematic instruction tuning for spatio-temporal reasoning in vision-language models,
Dohwan Ko, Sihyeon Kim, Yumin Suh, Minseo Yoon, Manmohan Chandraker, Hyunwoo J Kim, et al. ST-VLM: Kinematic instruction tuning for spatio- temporal reasoning in vision-language models.arXiv preprint arXiv:2503.19355, 2025
-
[10]
Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feis- zli, and Kevin J Liang. Multi-SpatialMLLM: Multi- frame spatial understanding with multi-modal large language models.arXiv preprint arXiv:2505.17015, 2025
work page internal anchor Pith review arXiv 2025
-
[11]
Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforc- ing spatial reasoning in vision-language models with interwoven thinking and visual drawing. InNeurIPS, 2025
work page 2025
-
[12]
Fine-grained preference optimization improves spatial reasoning in vlms
Yifan Shen, Yuanzhe Liu, Jingyuan Zhu, Xu Cao, Xiaofeng Zhang, Yixiao He, Wenming Ye, James Matthew Rehg, and Ismini Lourentzou. Fine-grained preference optimization improves spatial reasoning in vlms. InNeurIPS, 2025
work page 2025
-
[13]
SpatialReasoner: Towards explicit and gener- alizable 3d spatial reasoning
Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso M de Melo, Jianwen Xie, and Alan Yuille. SpatialReasoner: Towards explicit and gener- alizable 3d spatial reasoning. InNeurIPS, 2025
work page 2025
-
[14]
SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Llava-st: A multimodal large language model for fine-grained spatial-temporal understanding
Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spa- tialLadder: Progressive training for spatial rea- soning in vision-language models.arXiv preprint arXiv:2510.08531, 2025
-
[16]
Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence
Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence. InNeurIPS, 2025
work page 2025
-
[17]
SD-VLM: Spatial measuring and under- standing with depth-encoded vision-language models
Pingyi Chen, Yujing Lou, Shen Cao, Jinhui Guo, Lubin Fan, Yue Wu, Lin Yang, Lizhuang Ma, and Jieping Ye. SD-VLM: Spatial measuring and under- standing with depth-encoded vision-language models. InNeurIPS, 2025
work page 2025
-
[18]
Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors
Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors. InNeurIPS, 2025
work page 2025
-
[19]
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. VLM-3R: Vision- language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Hanyu Zhou and Gim Hee Lee. LLaVA-4D: Embed- ding spatiotemporal prompt into lmms for 4d scene understanding.arXiv preprint arXiv:2505.12253, 2025
-
[21]
3d aware region prompted vision language model.arXiv preprint arXiv:2509.13317,
An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiaolong Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, et al. 3d aware region prompted vision language model.arXiv preprint arXiv:2509.13317, 2025
-
[22]
Reasoning in space via grounding in the world.arXiv preprint arXiv:2510.13800, 2025
Yiming Chen, Zekun Qi, Wenyao Zhang, Xin Jin, Li Zhang, and Peidong Liu. Reasoning in space via grounding in the world.arXiv preprint arXiv:2510.13800, 2025
-
[23]
STI-Bench: Are MLLMs ready for precise spatial-temporal world un- derstanding? InICCV, 2025
Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenx- iao Cai, Zheng Liu, and Bo Zhao. STI-Bench: Are MLLMs ready for precise spatial-temporal world un- derstanding? InICCV, 2025
work page 2025
-
[24]
VLM4D: Towards spatiotemporal awareness in vision language models
Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Eric Xin Wang, and Achuta Kadambi. VLM4D: Towards spatiotemporal awareness in vision language models. InICCV, 2025
work page 2025
-
[25]
SAT: Spatial aptitude training for multimodal language models
Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Anirud- dha Kembhavi, Bryan A Plummer, Ranjay Krishna, Kuo-Hao Zeng, et al. SAT: Spatial aptitude training for multimodal language models. InCOLM, 2025
work page 2025
-
[26]
Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. OmniSpatial: Towards comprehensive spatial reason- ing benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025. 20 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
-
[27]
Mmsi-bench: A benchmark for multi-image spatial intelligence
Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, et al. MMSI-Bench: A benchmark for multi-image spatial intelligence.arXiv preprint arXiv:2505.23764, 2025
work page internal anchor Pith review arXiv 2025
-
[28]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Ham- bro, Faisal Azhar, et al. Llama: Open and effi- cient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine- tuned chat models.arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
VILA: On pre- training for visual language models
Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. VILA: On pre- training for visual language models. InCVPR, 2024
work page 2024
-
[32]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
Gheorghe Comanici, Eric Bieber, Mike Schaek- ermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023
work page 2023
-
[35]
Improved baselines with visual instruction tun- ing
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tun- ing. InCVPR, 2024
work page 2024
-
[36]
Alibaba Group Qwen Team. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
NVILA: Efficient frontier visual language models
Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, et al. NVILA: Efficient frontier visual language models. InCVPR, 2025
work page 2025
-
[38]
Honglu Zhou, Xiangyu Peng, Shrikant Kendre, Michael S Ryoo, Silvio Savarese, Caiming Xiong, and Juan Carlos Niebles. Strefer: Empowering video llms with space-time referring and reasoning via synthetic instruction data. InICCV, 2025
work page 2025
-
[39]
Vrope: Rotary position embedding for video large language models
Zikang Liu, Longteng Guo, Yepeng Tang, Tongtian Yue, Junxian Cai, Kai Ma, Qingbin Liu, Xi Chen, and Jing Liu. Vrope: Rotary position embedding for video large language models. InEMNLP, 2025
work page 2025
-
[40]
Yumeng Shi, Quanyu Long, Yin Wu, and Wenya Wang. Causality matters: How temporal information emerges in video language models.arXiv preprint arXiv:2508.11576, 2025
-
[41]
Timesuite: Improving mllms for long video understanding via grounded tuning
Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhengrong Yue, Yi Wang, et al. Timesuite: Improving mllms for long video understanding via grounded tuning. InICLR, 2025
work page 2025
-
[42]
Timechat: A time-sensitive multimodal large language model for long video understanding
Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. InCVPR, 2024
work page 2024
-
[43]
Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, et al. A bounding box is worth one token-interleaving layout and text in a large language model for document understanding. InACL Findings, 2025
work page 2025
-
[44]
ChatSpot: Bootstrapping multimodal llms via pre- cise referring instruction tuning
Liang Zhao, En Yu, Zheng Ge, Jinrong Yang, Hao- ran Wei, Hongyu Zhou, Jianjian Sun, Yuang Peng, Runpei Dong, Chunrui Han, and Xiangyu Zhang. ChatSpot: Bootstrapping multimodal llms via pre- cise referring instruction tuning. InIJCAI, 2024
work page 2024
-
[45]
Jack of all tasks master of many: Designing general- purpose coarse-to-fine vision-language model
Shraman Pramanick, Guangxing Han, Rui Hou, Sayan Nag, Ser-Nam Lim, Nicolas Ballas, Qifan Wang, Rama Chellappa, and Amjad Almahairi. Jack of all tasks master of many: Designing general- purpose coarse-to-fine vision-language model. In CVPR, 2024
work page 2024
-
[46]
ChatterBox: Multi-round multimodal referring and grounding
Yunjie Tian, Tianren Ma, Lingxi Xie, Jihao Qiu, Xi Tang, Yuan Zhang, Jianbin Jiao, Qi Tian, and Qixiang Ye. ChatterBox: Multi-round multimodal referring and grounding. InAAAI, 2025
work page 2025
-
[47]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Grounding multimodal large language models to the world
Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Qixiang Ye, and Furu Wei. Grounding multimodal large language models to the world. InICLR, 2024
work page 2024
-
[49]
MiniGPT-4: Enhancing vision-language understanding with advanced large language models
Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. InICLR, 2024
work page 2024
-
[50]
The All- Seeing project v2: Towards general relation compre- hension of the open world
Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, et al. The All- Seeing project v2: Towards general relation compre- hension of the open world. InECCV, 2024
work page 2024
-
[51]
LION: Empowering multimodal large language model with dual-level visual knowl- edge
Gongwei Chen, Leyang Shen, Rui Shao, Xiang Deng, and Liqiang Nie. LION: Empowering multimodal large language model with dual-level visual knowl- edge. InCVPR, 2024
work page 2024
-
[52]
CoLLaVO: Crayon large language and vision mOdel
Byung-Kwan Lee, Beomchan Park, Chae Won Kim, and Yong Man Ro. CoLLaVO: Crayon large language and vision mOdel. InACL, 2024. 21 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
work page 2024
-
[53]
ARGUS: Vision-centric rea- soning with grounded chain-of-thought
Yunze Man, De-An Huang, Guilin Liu, Shiwei Sheng, Shilong Liu, Liang-Yan Gui, Jan Kautz, Yu-Xiong Wang, and Zhiding Yu. ARGUS: Vision-centric rea- soning with grounded chain-of-thought. InCVPR, 2025
work page 2025
-
[54]
Draw-and-understand: Leveraging visual prompts to enable mllms to com- prehend what you want
Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, and Hongsheng Li. Draw-and-understand: Leveraging visual prompts to enable mllms to com- prehend what you want. InICLR, 2025
work page 2025
-
[55]
GPT4RoI: Instruction tuning large lan- guage model on region-of-interest
Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Yu Liu, Kai Chen, and Ping Luo. GPT4RoI: Instruction tuning large lan- guage model on region-of-interest. InECCV Work- shop, 2024
work page 2024
-
[56]
Groma: Localized visual tokeniza- tion for grounding multimodal large language models
Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokeniza- tion for grounding multimodal large language models. InECCV, 2024
work page 2024
-
[57]
The all-seeing project: Towards panoptic visual recognition and understand- ing of the open world
Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang, Zhenhang Huang, Linjie Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao, et al. The all-seeing project: Towards panoptic visual recognition and understand- ing of the open world. InICLR, 2024
work page 2024
-
[58]
Sangmin Woo, Kang Zhou, Yun Zhou, Shuai Wang, Sheng Guan, Haibo Ding, and Lin Lee Cheong. Black- box visual prompt engineering for mitigating object hallucination in large vision language models. In NAACL, 2025
work page 2025
-
[59]
ViP-LLaVA: Making large multi- modal models understand arbitrary visual prompts
Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee. ViP-LLaVA: Making large multi- modal models understand arbitrary visual prompts. InCVPR, 2024
work page 2024
-
[60]
Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V
Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompt- ing unleashes extraordinary visual grounding in gpt- 4v.arXiv preprint arXiv:2310.11441, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
Scaffolding coordinates to promote vision-language coordination in large multi-modal models
Xuanyu Lei, Zonghan Yang, Xinrui Chen, Peng Li, and Yang Liu. Scaffolding coordinates to promote vision-language coordination in large multi-modal models. InACL, 2025
work page 2025
-
[62]
Omni-rgpt: Unifying image and video region-level understanding via token marks
Miran Heo, Min-Hung Chen, De-An Huang, Sifei Liu, Subhashree Radhakrishnan, Seon Joo Kim, Yu- Chiang Frank Wang, and Ryo Hachiuma. Omni-rgpt: Unifying image and video region-level understanding via token marks. InCVPR, 2025
work page 2025
-
[63]
Spacevista: All-scale visual spatial reasoning from mm to km.arXiv preprint arXiv:2510.09606, 2025
Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, et al. Spacevista: All-scale visual spatial reasoning from mm to km.arXiv preprint arXiv:2510.09606, 2025
work page internal anchor Pith review arXiv 2025
-
[64]
See&trek: Training-free spatial prompting for multi- modal large language model
Pengteng Li, Pinhao Song, Wuyang Li, Weiyu Guo, Huizai Yao, Yijie Xu, Dugang Liu, and Hui Xiong. See&trek: Training-free spatial prompting for multi- modal large language model. InNeurIPS, 2025
work page 2025
-
[65]
Mllms need 3d-aware representation supervision for scene understanding
Xiaohu Huang, Jingjing Wu, Qunyi Xie, and Kai Han. Mllms need 3d-aware representation supervision for scene understanding. InNeurIPS, 2025
work page 2025
-
[66]
L4P: Low-level 4D vision perception unified
Abhishek Badki, Hang Su, Bowen Wen, and Orazio Gallo. L4P: Low-level 4D vision perception unified. arXiv preprint arXiv:2502.13078, 2025
-
[67]
Weifeng Lu, Minghao Ye, Zewei Ye, Ruihan Tao, Shuo Yang, and Bo Zhao. RoboFAC: A compre- hensive framework for robotic failure analysis and correction.arXiv preprint arXiv:2505.12224, 2025
-
[68]
Wolf: Dense video captioning with a world summarization framework.TMLR, 2025
Boyi Li, Ligeng Zhu, Ran Tian, Shuhan Tan, Yuxiao Chen, Yao Lu, Yin Cui, Sushant Veer, Max Ehrlich, Jonah Philion, et al. Wolf: Dense video captioning with a world summarization framework.TMLR, 2025
work page 2025
-
[69]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jian- wei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InECCV, 2024
work page 2024
-
[70]
Sam 2: Segment anything in images and videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Rong- hang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Va- sudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos. In ICLR, 2025
work page 2025
-
[71]
The 2017 DAVIS Challenge on Video Object Segmentation
Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alexander Sorkine-Hornung, and Luc Van Gool. The 2017 DAVIS challenge on video object segmentation.arXiv:1704.00675, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[72]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding perfor- mance boundaries of open-source multimodal mod- els with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[73]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Vide- ollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[74]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Ze- jun Ma, Ziwei Liu, and Chunyuan Li. Video in- struction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[75]
Llava-onevision: Easy visual task transfer.TMLR, 2025
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.TMLR, 2025
work page 2025
-
[76]
Llavanext: Improved reasoning, ocr, and world knowledge, 2024
Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuan- han Zhang, Sheng Shen, and Yong JaeLee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024
work page 2024
-
[77]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, 2023. 22 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
work page 2023
-
[78]
Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[79]
LoRA: Low-rank adaptation of large language models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. InICLR, 2022
work page 2022
-
[80]
HuggingFace's Transformers: State-of-the-art Natural Language Processing
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Rémi Louf, Morgan Fun- towicz, et al. Huggingface’s transformers: State-of- the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.