pith. sign in

arxiv: 2512.17012 · v4 · submitted 2025-12-18 · 💻 cs.CV

4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

Pith reviewed 2026-05-16 21:18 UTC · model grok-4.3

classification 💻 cs.CV
keywords 4D understandingperceptual distillationmultimodal large language modelvideo question answeringregion-level promptingtemporal perception4D VQA benchmarkR4D-Bench
0
0 comments X

The pith

4D-RGPT uses perceptual distillation from a frozen expert model to improve multimodal LLMs' region-level 4D perception in videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops 4D-RGPT as a multimodal LLM specialized for capturing 4D representations from video with better temporal perception. It proposes Perceptual 4D Distillation to transfer these representations from a frozen expert model into the LLM. This addresses weak 4D perception in existing models and enables region-level prompting. A new benchmark called R4D-Bench is introduced for depth-aware dynamic scenes, and the approach shows improvements on multiple benchmarks.

Core claim

4D-RGPT is designed to capture 4D representations from video inputs with enhanced temporal perception by using Perceptual 4D Distillation to transfer comprehensive 4D knowledge from a frozen expert model, leading to better performance on 4D VQA benchmarks and the new R4D-Bench.

What carries the argument

Perceptual 4D Distillation (P4D), the training framework that transfers 4D representations from a frozen expert model into the MLLM without retraining the expert.

If this is right

  • 4D-RGPT achieves notable improvements on existing 4D VQA benchmarks.
  • It also shows gains on the proposed R4D-Bench benchmark for region-level 4D understanding.
  • The model enhances temporal perception and region-level reasoning in video question answering.
  • The distillation allows comprehensive 4D perception to be added to MLLMs efficiently.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could extend to other perceptual tasks by distilling from different expert models in computer vision.
  • Region-level 4D capabilities might enable more accurate dynamic scene analysis in fields like robotics and augmented reality.
  • Future research could explore combining this with real-time video processing for interactive applications.

Load-bearing premise

That the perceptual distillation process can transfer comprehensive 4D representations from the frozen expert into the MLLM without significant information loss to support enhanced region-level and temporal perception.

What would settle it

If evaluations on R4D-Bench show that the distilled 4D-RGPT does not outperform a standard MLLM on questions requiring precise depth and motion understanding in specific regions.

read the original abstract

Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces 4D-RGPT, a multimodal large language model (MLLM) specialized for region-level 4D video understanding, trained via a Perceptual 4D Distillation (P4D) framework that transfers representations from a frozen expert model. It also proposes R4D-Bench, a new benchmark for depth-aware dynamic scenes supporting region-level prompting, constructed through a hybrid automated and human-verified pipeline. The central claim is that 4D-RGPT achieves notable improvements over baselines on both existing 4D VQA benchmarks and the proposed R4D-Bench.

Significance. If the reported benchmark gains hold under scrutiny, the work advances 4D perception in MLLMs by demonstrating effective transfer of temporal and spatial features via distillation without retraining the expert. The hybrid construction of R4D-Bench, with explicit human verification for region-level and depth-aware queries, fills a documented gap in prior 3D/4D VQA datasets that emphasize static scenes. The detailed description of the distillation process, expert freezing, and benchmark pipeline provides a reproducible template for similar perceptual transfer efforts.

minor comments (2)
  1. [Abstract] Abstract: the phrase 'notable improvements' is used without any numerical deltas, baseline names, or metric values; adding one or two key quantitative results (e.g., accuracy gains on R4D-Bench) would make the summary self-contained.
  2. [Experiments] Section 4 (Experiments): while the distillation pipeline is described, the manuscript should explicitly state the number of ablation runs, random seeds, and statistical significance tests for the reported gains to allow readers to assess robustness.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript and for recognizing the contributions of 4D-RGPT, the Perceptual 4D Distillation framework, and the R4D-Bench benchmark. We appreciate the recommendation for minor revision and will incorporate improvements to enhance clarity and reproducibility.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces 4D-RGPT via perceptual distillation (P4D) from a frozen expert and evaluates it on existing 4D VQA benchmarks plus the newly constructed R4D-Bench. All load-bearing claims reduce to reported empirical improvements rather than any self-definitional loop, fitted parameter renamed as prediction, or self-citation chain. The distillation framework is presented as a standard transfer process with explicit freezing and hybrid benchmark construction; no equations or premises collapse to their own inputs by construction. This is the normal non-circular case for an empirical MLLM paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claim depends on standard machine learning assumptions about knowledge transfer via distillation and the validity of newly constructed benchmarks for measuring 4D perception.

axioms (1)
  • domain assumption Frozen expert models contain rich 4D representations that can be distilled into MLLMs without loss of temporal and region-level information
    This underpins the entire P4D training framework described in the abstract.
invented entities (3)
  • 4D-RGPT no independent evidence
    purpose: Specialized multimodal LLM for capturing 4D representations from video
    Newly proposed model architecture
  • Perceptual 4D Distillation (P4D) no independent evidence
    purpose: Framework to transfer 4D knowledge from expert to target model
    Newly introduced training method
  • R4D-Bench no independent evidence
    purpose: Benchmark for depth-aware dynamic scenes with region-level prompting
    Newly constructed evaluation dataset

pith-pipeline@v0.9.0 · 5510 in / 1358 out tokens · 39933 ms · 2026-05-16T21:18:20.916819+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ST-$\pi$: Structured SpatioTemporal VLA for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    ST-π structures VLA models by having a spatiotemporal VLM produce causally ordered chunk-level prompts that guide a dual-generator action expert to jointly handle spatial and temporal control in robotic manipulation.

  2. XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

    cs.CV 2026-04 unverdicted novelty 4.0

    XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · cited by 2 Pith papers · 24 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shya- mal Anadkat, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    OpenAI. Gpt-5. https://openai.com/chatgpt,

  3. [3]

    Large language model

  4. [4]

    Qwen2.5 Technical Report

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

  5. [5]

    The Llama 3 Herd of Models

    AbhimanyuDubey, AbhinavJauhri, AbhinavPandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  6. [6]

    GPT-4o System Card

    OpenAI. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024

  7. [7]

    Spatialllm: A compound 3d- informed design towards spatially-intelligent large multimodal models

    Wufei Ma, Luoxin Ye, Celso M de Melo, Alan Yuille, and Jieneng Chen. Spatialllm: A compound 3d- informed design towards spatially-intelligent large multimodal models. InCVPR, 2025

  8. [8]

    From flatland to space: Teaching vision-language models to perceive and reason in 3d

    Jiahui Zhang, Yurui Chen, Yanpeng Zhou, Yueming Xu, Ze Huang, Jilin Mei, Junhui Chen, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, et al. From flatland to space: Teaching vision-language models to perceive and reason in 3d. InNeurIPS, 2025

  9. [9]

    St-vlm: Kinematic instruction tuning for spatio-temporal reasoning in vision-language models,

    Dohwan Ko, Sihyeon Kim, Yumin Suh, Minseo Yoon, Manmohan Chandraker, Hyunwoo J Kim, et al. ST-VLM: Kinematic instruction tuning for spatio- temporal reasoning in vision-language models.arXiv preprint arXiv:2503.19355, 2025

  10. [10]

    Multi-spatialmllm: Multi-frame spatial understanding with multi-modal large language models.arXiv preprint arXiv:2505.17015, 2025

    Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feis- zli, and Kevin J Liang. Multi-SpatialMLLM: Multi- frame spatial understanding with multi-modal large language models.arXiv preprint arXiv:2505.17015, 2025

  11. [11]

    Reinforc- ing spatial reasoning in vision-language models with interwoven thinking and visual drawing

    Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforc- ing spatial reasoning in vision-language models with interwoven thinking and visual drawing. InNeurIPS, 2025

  12. [12]

    Fine-grained preference optimization improves spatial reasoning in vlms

    Yifan Shen, Yuanzhe Liu, Jingyuan Zhu, Xu Cao, Xiaofeng Zhang, Yixiao He, Wenming Ye, James Matthew Rehg, and Ismini Lourentzou. Fine-grained preference optimization improves spatial reasoning in vlms. InNeurIPS, 2025

  13. [13]

    SpatialReasoner: Towards explicit and gener- alizable 3d spatial reasoning

    Wufei Ma, Yu-Cheng Chou, Qihao Liu, Xingrui Wang, Celso M de Melo, Jianwen Xie, and Alan Yuille. SpatialReasoner: Towards explicit and gener- alizable 3d spatial reasoning. InNeurIPS, 2025

  14. [14]

    SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

    Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

  15. [15]

    Llava-st: A multimodal large language model for fine-grained spatial-temporal understanding

    Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Spa- tialLadder: Progressive training for spatial rea- soning in vision-language models.arXiv preprint arXiv:2510.08531, 2025

  16. [16]

    Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, and Yueqi Duan. Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence. InNeurIPS, 2025

  17. [17]

    SD-VLM: Spatial measuring and under- standing with depth-encoded vision-language models

    Pingyi Chen, Yujing Lou, Shen Cao, Jinhui Guo, Lubin Fan, Yue Wu, Lin Yang, Lizhuang Ma, and Jieping Ye. SD-VLM: Spatial measuring and under- standing with depth-encoded vision-language models. InNeurIPS, 2025

  18. [18]

    Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors

    Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors. InNeurIPS, 2025

  19. [19]

    VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

    Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, et al. VLM-3R: Vision- language models augmented with instruction-aligned 3d reconstruction.arXiv preprint arXiv:2505.20279, 2025

  20. [20]

    LLaVA-4D: Embed- ding spatiotemporal prompt into lmms for 4d scene understanding.arXiv preprint arXiv:2505.12253, 2025

    Hanyu Zhou and Gim Hee Lee. LLaVA-4D: Embed- ding spatiotemporal prompt into lmms for 4d scene understanding.arXiv preprint arXiv:2505.12253, 2025

  21. [21]

    3d aware region prompted vision language model.arXiv preprint arXiv:2509.13317,

    An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiaolong Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, et al. 3d aware region prompted vision language model.arXiv preprint arXiv:2509.13317, 2025

  22. [22]

    Reasoning in space via grounding in the world.arXiv preprint arXiv:2510.13800, 2025

    Yiming Chen, Zekun Qi, Wenyao Zhang, Xin Jin, Li Zhang, and Peidong Liu. Reasoning in space via grounding in the world.arXiv preprint arXiv:2510.13800, 2025

  23. [23]

    STI-Bench: Are MLLMs ready for precise spatial-temporal world un- derstanding? InICCV, 2025

    Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenx- iao Cai, Zheng Liu, and Bo Zhao. STI-Bench: Are MLLMs ready for precise spatial-temporal world un- derstanding? InICCV, 2025

  24. [24]

    VLM4D: Towards spatiotemporal awareness in vision language models

    Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Eric Xin Wang, and Achuta Kadambi. VLM4D: Towards spatiotemporal awareness in vision language models. InICCV, 2025

  25. [25]

    SAT: Spatial aptitude training for multimodal language models

    Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Anirud- dha Kembhavi, Bryan A Plummer, Ranjay Krishna, Kuo-Hao Zeng, et al. SAT: Spatial aptitude training for multimodal language models. InCOLM, 2025

  26. [26]

    Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025

    Mengdi Jia, Zekun Qi, Shaochen Zhang, Wenyao Zhang, Xinqiang Yu, Jiawei He, He Wang, and Li Yi. OmniSpatial: Towards comprehensive spatial reason- ing benchmark for vision language models.arXiv preprint arXiv:2506.03135, 2025. 20 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

  27. [27]

    Mmsi-bench: A benchmark for multi-image spatial intelligence

    Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, et al. MMSI-Bench: A benchmark for multi-image spatial intelligence.arXiv preprint arXiv:2505.23764, 2025

  28. [28]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Ham- bro, Faisal Azhar, et al. Llama: Open and effi- cient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  29. [29]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine- tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  30. [30]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023

  31. [31]

    VILA: On pre- training for visual language models

    Ji Lin, Hongxu Yin, Wei Ping, Pavlo Molchanov, Mohammad Shoeybi, and Song Han. VILA: On pre- training for visual language models. InCVPR, 2024

  32. [32]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  33. [33]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaek- ermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  34. [34]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023

  35. [35]

    Improved baselines with visual instruction tun- ing

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tun- ing. InCVPR, 2024

  36. [36]

    Qwen2.5-VL Technical Report

    Alibaba Group Qwen Team. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  37. [37]

    NVILA: Efficient frontier visual language models

    Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, et al. NVILA: Efficient frontier visual language models. InCVPR, 2025

  38. [38]

    Strefer: Empowering video llms with space-time referring and reasoning via synthetic instruction data

    Honglu Zhou, Xiangyu Peng, Shrikant Kendre, Michael S Ryoo, Silvio Savarese, Caiming Xiong, and Juan Carlos Niebles. Strefer: Empowering video llms with space-time referring and reasoning via synthetic instruction data. InICCV, 2025

  39. [39]

    Vrope: Rotary position embedding for video large language models

    Zikang Liu, Longteng Guo, Yepeng Tang, Tongtian Yue, Junxian Cai, Kai Ma, Qingbin Liu, Xi Chen, and Jing Liu. Vrope: Rotary position embedding for video large language models. InEMNLP, 2025

  40. [40]

    Causality matters: How temporal information emerges in video language models.arXiv preprint arXiv:2508.11576, 2025

    Yumeng Shi, Quanyu Long, Yin Wu, and Wenya Wang. Causality matters: How temporal information emerges in video language models.arXiv preprint arXiv:2508.11576, 2025

  41. [41]

    Timesuite: Improving mllms for long video understanding via grounded tuning

    Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhengrong Yue, Yi Wang, et al. Timesuite: Improving mllms for long video understanding via grounded tuning. InICLR, 2025

  42. [42]

    Timechat: A time-sensitive multimodal large language model for long video understanding

    Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. InCVPR, 2024

  43. [43]

    A bounding box is worth one token-interleaving layout and text in a large language model for document understanding

    Jinghui Lu, Haiyang Yu, Yanjie Wang, Yongjie Ye, Jingqun Tang, Ziwei Yang, Binghong Wu, Qi Liu, Hao Feng, Han Wang, et al. A bounding box is worth one token-interleaving layout and text in a large language model for document understanding. InACL Findings, 2025

  44. [44]

    ChatSpot: Bootstrapping multimodal llms via pre- cise referring instruction tuning

    Liang Zhao, En Yu, Zheng Ge, Jinrong Yang, Hao- ran Wei, Hongyu Zhou, Jianjian Sun, Yuang Peng, Runpei Dong, Chunrui Han, and Xiangyu Zhang. ChatSpot: Bootstrapping multimodal llms via pre- cise referring instruction tuning. InIJCAI, 2024

  45. [45]

    Jack of all tasks master of many: Designing general- purpose coarse-to-fine vision-language model

    Shraman Pramanick, Guangxing Han, Rui Hou, Sayan Nag, Ser-Nam Lim, Nicolas Ballas, Qifan Wang, Rama Chellappa, and Amjad Almahairi. Jack of all tasks master of many: Designing general- purpose coarse-to-fine vision-language model. In CVPR, 2024

  46. [46]

    ChatterBox: Multi-round multimodal referring and grounding

    Yunjie Tian, Tianren Ma, Lingxi Xie, Jihao Qiu, Xi Tang, Yuan Zhang, Jianbin Jiao, Qi Tian, and Qixiang Ye. ChatterBox: Multi-round multimodal referring and grounding. InAAAI, 2025

  47. [47]

    Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

    Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- modal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023

  48. [48]

    Grounding multimodal large language models to the world

    Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, Qixiang Ye, and Furu Wei. Grounding multimodal large language models to the world. InICLR, 2024

  49. [49]

    MiniGPT-4: Enhancing vision-language understanding with advanced large language models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. InICLR, 2024

  50. [50]

    The All- Seeing project v2: Towards general relation compre- hension of the open world

    Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, et al. The All- Seeing project v2: Towards general relation compre- hension of the open world. InECCV, 2024

  51. [51]

    LION: Empowering multimodal large language model with dual-level visual knowl- edge

    Gongwei Chen, Leyang Shen, Rui Shao, Xiang Deng, and Liqiang Nie. LION: Empowering multimodal large language model with dual-level visual knowl- edge. InCVPR, 2024

  52. [52]

    CoLLaVO: Crayon large language and vision mOdel

    Byung-Kwan Lee, Beomchan Park, Chae Won Kim, and Yong Man Ro. CoLLaVO: Crayon large language and vision mOdel. InACL, 2024. 21 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

  53. [53]

    ARGUS: Vision-centric rea- soning with grounded chain-of-thought

    Yunze Man, De-An Huang, Guilin Liu, Shiwei Sheng, Shilong Liu, Liang-Yan Gui, Jan Kautz, Yu-Xiong Wang, and Zhiding Yu. ARGUS: Vision-centric rea- soning with grounded chain-of-thought. InCVPR, 2025

  54. [54]

    Draw-and-understand: Leveraging visual prompts to enable mllms to com- prehend what you want

    Weifeng Lin, Xinyu Wei, Ruichuan An, Peng Gao, Bocheng Zou, Yulin Luo, Siyuan Huang, Shanghang Zhang, and Hongsheng Li. Draw-and-understand: Leveraging visual prompts to enable mllms to com- prehend what you want. InICLR, 2025

  55. [55]

    GPT4RoI: Instruction tuning large lan- guage model on region-of-interest

    Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Yu Liu, Kai Chen, and Ping Luo. GPT4RoI: Instruction tuning large lan- guage model on region-of-interest. InECCV Work- shop, 2024

  56. [56]

    Groma: Localized visual tokeniza- tion for grounding multimodal large language models

    Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. Groma: Localized visual tokeniza- tion for grounding multimodal large language models. InECCV, 2024

  57. [57]

    The all-seeing project: Towards panoptic visual recognition and understand- ing of the open world

    Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang, Zhenhang Huang, Linjie Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao, et al. The all-seeing project: Towards panoptic visual recognition and understand- ing of the open world. InICLR, 2024

  58. [58]

    Black- box visual prompt engineering for mitigating object hallucination in large vision language models

    Sangmin Woo, Kang Zhou, Yun Zhou, Shuai Wang, Sheng Guan, Haibo Ding, and Lin Lee Cheong. Black- box visual prompt engineering for mitigating object hallucination in large vision language models. In NAACL, 2025

  59. [59]

    ViP-LLaVA: Making large multi- modal models understand arbitrary visual prompts

    Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee. ViP-LLaVA: Making large multi- modal models understand arbitrary visual prompts. InCVPR, 2024

  60. [60]

    Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompt- ing unleashes extraordinary visual grounding in gpt- 4v.arXiv preprint arXiv:2310.11441, 2023

  61. [61]

    Scaffolding coordinates to promote vision-language coordination in large multi-modal models

    Xuanyu Lei, Zonghan Yang, Xinrui Chen, Peng Li, and Yang Liu. Scaffolding coordinates to promote vision-language coordination in large multi-modal models. InACL, 2025

  62. [62]

    Omni-rgpt: Unifying image and video region-level understanding via token marks

    Miran Heo, Min-Hung Chen, De-An Huang, Sifei Liu, Subhashree Radhakrishnan, Seon Joo Kim, Yu- Chiang Frank Wang, and Ryo Hachiuma. Omni-rgpt: Unifying image and video region-level understanding via token marks. InCVPR, 2025

  63. [63]

    Spacevista: All-scale visual spatial reasoning from mm to km.arXiv preprint arXiv:2510.09606, 2025

    Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, et al. Spacevista: All-scale visual spatial reasoning from mm to km.arXiv preprint arXiv:2510.09606, 2025

  64. [64]

    See&trek: Training-free spatial prompting for multi- modal large language model

    Pengteng Li, Pinhao Song, Wuyang Li, Weiyu Guo, Huizai Yao, Yijie Xu, Dugang Liu, and Hui Xiong. See&trek: Training-free spatial prompting for multi- modal large language model. InNeurIPS, 2025

  65. [65]

    Mllms need 3d-aware representation supervision for scene understanding

    Xiaohu Huang, Jingjing Wu, Qunyi Xie, and Kai Han. Mllms need 3d-aware representation supervision for scene understanding. InNeurIPS, 2025

  66. [66]

    L4P: Low-level 4D vision perception unified

    Abhishek Badki, Hang Su, Bowen Wen, and Orazio Gallo. L4P: Low-level 4D vision perception unified. arXiv preprint arXiv:2502.13078, 2025

  67. [67]

    RoboFAC: A compre- hensive framework for robotic failure analysis and correction.arXiv preprint arXiv:2505.12224, 2025

    Weifeng Lu, Minghao Ye, Zewei Ye, Ruihan Tao, Shuo Yang, and Bo Zhao. RoboFAC: A compre- hensive framework for robotic failure analysis and correction.arXiv preprint arXiv:2505.12224, 2025

  68. [68]

    Wolf: Dense video captioning with a world summarization framework.TMLR, 2025

    Boyi Li, Ligeng Zhu, Ran Tian, Shuhan Tan, Yuxiao Chen, Yao Lu, Yin Cui, Sushant Veer, Max Ehrlich, Jonah Philion, et al. Wolf: Dense video captioning with a world summarization framework.TMLR, 2025

  69. [69]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jian- wei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InECCV, 2024

  70. [70]

    Sam 2: Segment anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Rong- hang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Va- sudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. Sam 2: Segment anything in images and videos. In ICLR, 2025

  71. [71]

    The 2017 DAVIS Challenge on Video Object Segmentation

    Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alexander Sorkine-Hornung, and Luc Van Gool. The 2017 DAVIS challenge on video object segmentation.arXiv:1704.00675, 2017

  72. [72]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding perfor- mance boundaries of open-source multimodal mod- els with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

  73. [73]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Vide- ollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025

  74. [74]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Ze- jun Ma, Ziwei Liu, and Chunyuan Li. Video in- struction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024

  75. [75]

    Llava-onevision: Easy visual task transfer.TMLR, 2025

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.TMLR, 2025

  76. [76]

    Llavanext: Improved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuan- han Zhang, Sheng Shen, and Yong JaeLee. Llavanext: Improved reasoning, ocr, and world knowledge, 2024

  77. [77]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, 2023. 22 4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation

  78. [78]

    Qwen2 Technical Report

    Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

  79. [79]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. LoRA: Low-rank adaptation of large language models. InICLR, 2022

  80. [80]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Rémi Louf, Morgan Fun- towicz, et al. Huggingface’s transformers: State-of- the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019

Showing first 80 references.