4D-RGPT uses perceptual 4D distillation to boost region-level 4D perception in multimodal LLMs and reports gains on existing and new video QA benchmarks.
3d aware region prompted vision language model
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 5roles
background 1polarities
background 1representative citing papers
POMA-3D learns self-supervised 3D scene representations from point maps and improves performance on geometric 3D tasks including navigation and scene retrieval.
A structure-aware sequential compression of KV cache during prefill reduces peak memory usage in MLLMs while keeping generative performance nearly intact.
SpaceDrive integrates 3D positional encodings derived from depth and ego-states into VLMs, replacing digit tokens to improve spatial reasoning and trajectory regression in autonomous driving.
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.
citing papers explorer
-
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
4D-RGPT uses perceptual 4D distillation to boost region-level 4D perception in multimodal LLMs and reports gains on existing and new video QA benchmarks.
-
POMA-3D: The Point Map Way to 3D Scene Understanding
POMA-3D learns self-supervised 3D scene representations from point maps and improves performance on geometric 3D tasks including navigation and scene retrieval.
-
Reducing Peak Memory Usage for Modern Multimodal Large Language Model Pipelines
A structure-aware sequential compression of KV cache during prefill reduces peak memory usage in MLLMs while keeping generative performance nearly intact.
-
SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous Driving
SpaceDrive integrates 3D positional encodings derived from depth and ego-states into VLMs, replacing digit tokens to improve spatial reasoning and trajectory regression in autonomous driving.
-
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challenges like instance permanence and consistent interaction.