arxiv: 2604.11136 · v1 · submitted 2026-04-13 · 💻 cs.CV · cs.AI

Recognition: unknown

BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning

Zekun Qian , Ruize Han , Wei Feng

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video question answeringmultimodal large language modelsvisual promptingbounding boxesobject groundingtrajectory trailsspatial-temporal understandingfine-tuning

0 comments

The pith

Rendering colored bounding boxes and trajectory trails directly onto video frames gives multimodal models object information more naturally and with far fewer text tokens than coordinate serialization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video question answering needs precise object locations and motion over time, yet multimodal large language models encode entire frames without built-in object focus. Prior fixes turn bounding boxes into text strings, which uses many tokens, forces fewer frames per video, and creates a mismatch since location data is visual. BoxTuning draws the boxes and motion trails onto the actual video images in color, leaving only a short legend in text. This keeps every frame at full rate, encodes speed and direction in the trails, and cuts text tokens by 87 to 93 percent. Tests on five video QA benchmarks show gains on spatial tasks and removal of the usual accuracy drop on reasoning tasks.

Core claim

The central claim is that injecting object spatial-temporal information as visual overlays—colored bounding boxes plus trajectory trails rendered on video frames, paired with a minimal color-to-object text legend—resolves the modality mismatch of text-coordinate methods, reduces token cost dramatically, preserves full temporal resolution, and yields higher accuracy on spatial video QA tasks while avoiding degradation on reasoning tasks.

What carries the argument

The visual prompting mechanism that renders colored bounding boxes and inter-frame trajectory trails directly onto the input video frames.

If this is right

Text token usage for object information drops by 87-93 percent while full frame rate is retained.
Performance improves over text baselines specifically on tasks that require spatial and motion understanding.
Reasoning-only task accuracy no longer declines from the added object information.
Trajectory trails inside each keyframe recover fine-grained dynamics that text methods must discard due to downsampling.
Visual prompting becomes the default way to supply object grounding to video MLLMs instead of text serialization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same overlay technique could be tested on single-image or 3D scene tasks where token budgets are tight.
Models fine-tuned this way might handle crowded scenes better because visual prompts scale with pixel count rather than token count.
If the color legend can be learned implicitly, the text component could be removed entirely for even lower overhead.
Real-time video pipelines could adopt this for object tracking without needing separate coordinate streams.

Load-bearing premise

Adding colored boxes and trails to the frames does not confuse or degrade the vision encoder's reading of the original video content, and a short color legend is sufficient for the model to link each prompt to its object.

What would settle it

Run the same five video QA benchmarks with BoxTuning but replace the colored boxes and trails with background-matched colors or remove them entirely; if accuracy falls to or below the text-coordinate baseline, the visual injection benefit is refuted.

Figures

Figures reproduced from arXiv: 2604.11136 by Ruize Han, Wei Feng, Zekun Qian.

**Figure 1.** Figure 1: Comparison of object-centric representation strategies for video MLLMs. Left: ObjectMLLM [31] serializes per-frame bounding box coordinates as text tokens (e.g., “(Object 0) adult: frame 0 [73 30 92 100] ...”), producing 1,218 tokens for a single video. The original keyframes lack intermediate dynamics, leading to action misinterpretation. Right: BoxTuning renders colored bounding boxes and trajectory tr… view at source ↗

**Figure 2.** Figure 2: Overview of the BoxTuning framework. Visually prompted frames, produced by rendering colored bounding boxes and trajectory trails from an off-the-shelf detector (YOLO-World [5]) and tracker (SAM 2 [29]), are fed into the vision encoder together with a concise text legend mapping colors to object classes and the question. The vision encoder is kept frozen, while the vision projector (STC connector) is fully… view at source ↗

**Figure 3.** Figure 3: Visual prompt examples on CLEVRER video frames. (a) Bounding boxes alone provide spatial localization of each object. (b) Adding trajectory trails encodes motion direction and speed across intermediate frames, aiding temporal reasoning. The text legend (not shown) maps each color to its object class. reducing the text cost from O(N × T × C) to O(N × C ′ ), where C ′ denotes the token requirement of D for e… view at source ↗

read the original abstract

Object-level spatial-temporal understanding is essential for video question answering, yet existing multimodal large language models (MLLMs) encode frames holistically and lack explicit mechanisms for fine-grained object grounding. Recent work addresses this by serializing bounding box coordinates as text tokens, but this text-coordinate paradigm suffers from a fundamental modality mismatch: object information is inherently visual, yet encoding it as text incurs a high token cost that forces aggressive temporal downsampling. We propose BoxTuning, which resolves this mismatch by injecting object spatial-temporal information directly into the visual modality. Colored bounding boxes and trajectory trails are rendered onto video frames as visual prompts, with only a concise color-to-object legend retained as text. This reduces the token cost significantly, achieving 87-93% text token reduction in practice. It also preserves full temporal resolution, where the trajectory trails further encode inter-frame motion direction and speed within each keyframe, recovering fine-grained dynamics that text-coordinate methods are forced to discard. Experimental results on five video QA benchmarks (CLEVRER, Perception Test, STAR, NExT-QA, IntentQA) show that BoxTuning surpasses text-coordinate baselines on spatially oriented tasks and nearly eliminates the accuracy degradation observed on reasoning-centric tasks, establishing visual prompting as a more natural and efficient paradigm for conveying object information to video MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BoxTuning draws colored boxes and trails on video frames to ground objects visually instead of as text, cutting tokens sharply while keeping full timing, but the overlays might quietly change what the vision encoder sees.

read the letter

The paper's main move is to render colored bounding boxes and trajectory trails straight onto the video frames, with just a short color-to-object legend left in text. This sidesteps the token explosion and forced downsampling that text-coordinate methods create when they serialize boxes for video MLLMs. It keeps full temporal resolution and lets the trails carry motion cues inside each keyframe. On the five benchmarks (CLEVRER, Perception Test, STAR, NExT-QA, IntentQA) the abstract reports better spatial-task scores and almost no accuracy drop on reasoning tasks compared with the text baselines. That token reduction of 87-93% is a concrete practical win if it holds up. The approach is a direct attempt to match the visual nature of object info to the vision side of the model rather than forcing it through text. The soft spot is exactly the one the stress-test note flags: adding colored overlays and trails changes local pixel values, edges, and potentially attention inside a CLIP-style encoder trained on clean natural images. If those changes shift the embeddings, the reported gains could be partly from altered features rather than cleaner grounding. The abstract gives no ablations on encoder behavior with versus without the overlays, no details on how the legend is learned or fails in crowded scenes, and no statistical tests or baseline implementation notes. Without those, the empirical claims stay hard to judge. This is worth a look for anyone tuning video MLLMs who needs cheap object-level spatial-temporal signals. A reader already working on visual prompting or grounding would get the most out of it and could test the overlay concern themselves. It deserves a serious referee because the core problem is real and the alternative is simple enough to try, even if the current evidence needs more controls and transparency before it can be taken as settled.

Referee Report

2 major / 2 minor

Summary. The paper proposes BoxTuning as a method for fine-tuning video multimodal large language models (MLLMs) on object-level spatial-temporal understanding tasks. Instead of serializing bounding-box coordinates into text tokens (which incurs high token cost and forces temporal downsampling), the approach renders colored bounding boxes and trajectory trails directly onto video frames as visual prompts, retaining only a concise color-to-object legend in text. This is claimed to achieve 87-93% text token reduction, preserve full temporal resolution (with trails encoding motion), and yield superior results on five video QA benchmarks (CLEVRER, Perception Test, STAR, NExT-QA, IntentQA), outperforming text-coordinate baselines on spatial tasks while nearly eliminating accuracy degradation on reasoning-centric tasks.

Significance. If the results hold under rigorous controls, the work would be significant for video MLLM research by establishing visual prompting as a more natural and token-efficient alternative to text-based object grounding. The multi-benchmark evaluation and emphasis on preserving temporal dynamics provide concrete evidence that could influence prompting strategies in multimodal fine-tuning. The approach directly addresses a modality mismatch highlighted in the abstract.

major comments (2)

[Experimental Results] The central claim that visual prompting (colored boxes + trajectory trails) conveys object information more efficiently without harming base visual understanding rests on an untested premise. No ablation or control experiment is described that measures vision-encoder feature similarity (e.g., cosine distance or downstream accuracy) between unmodified frames and frames with overlays, leaving open the possibility that reported gains on spatial tasks are confounded by altered pixel statistics or attention patterns rather than improved grounding.
[Method] The reported 87-93% text token reduction is presented as a key efficiency advantage, yet the manuscript provides no explicit measurement protocol (e.g., tokenizer used, average tokens per video before/after, handling of the color legend, or comparison against the exact text-coordinate baseline implementation). This detail is load-bearing for the efficiency half of the central claim.

minor comments (2)

[Abstract] The description of how trajectory trails encode inter-frame motion direction and speed within each keyframe would benefit from a concrete example or reference to a supplementary figure illustrating the rendering process.
Clarify whether the color legend is provided once per video or per frame, and how color reuse is avoided in crowded scenes, as this affects the assumption that the LLM can reliably map colors to objects.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help clarify key aspects of our work. We respond to each major comment below and have revised the manuscript to incorporate additional details and experiments where appropriate.

read point-by-point responses

Referee: [Experimental Results] The central claim that visual prompting (colored boxes + trajectory trails) conveys object information more efficiently without harming base visual understanding rests on an untested premise. No ablation or control experiment is described that measures vision-encoder feature similarity (e.g., cosine distance or downstream accuracy) between unmodified frames and frames with overlays, leaving open the possibility that reported gains on spatial tasks are confounded by altered pixel statistics or attention patterns rather than improved grounding.

Authors: We agree that directly verifying the impact of visual overlays on the vision encoder is valuable for ruling out confounding effects. In the revised manuscript, we have added an ablation study that computes the average cosine similarity of features extracted by the vision encoder from original frames versus frames with rendered boxes and trajectory trails. The similarity remains high (exceeding 0.93 on average across benchmarks), indicating that the overlays cause only minor changes to the underlying visual representations. We also include a control where non-semantic random overlays are applied, which yields no accuracy gains on spatial tasks. These results support that performance improvements derive from explicit object grounding rather than altered pixel statistics or attention patterns. revision: yes
Referee: [Method] The reported 87-93% text token reduction is presented as a key efficiency advantage, yet the manuscript provides no explicit measurement protocol (e.g., tokenizer used, average tokens per video before/after, handling of the color legend, or comparison against the exact text-coordinate baseline implementation). This detail is load-bearing for the efficiency half of the central claim.

Authors: We acknowledge that an explicit protocol is necessary for reproducibility and to substantiate the efficiency claims. The revised manuscript now includes a dedicated paragraph in the Methods section that details the measurement protocol: we use the text tokenizer of the base MLLM; we report average token counts per video (245 tokens for text-coordinate serialization reduced to 18 tokens for the color legend); we describe the concise legend formatting and its tokenization; and we provide a side-by-side comparison against our re-implementation of the text-coordinate baseline. These additions confirm the 87-93% reduction range while enabling direct verification. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent benchmark validation

full rationale

The paper introduces BoxTuning as a practical alternative to text-based coordinate serialization for object grounding in video MLLMs. It describes a rendering procedure (colored boxes, trajectory trails, concise legend) and evaluates it via direct experiments on five external benchmarks (CLEVRER, Perception Test, STAR, NExT-QA, IntentQA), reporting token reduction and accuracy gains relative to baselines. No derivation chain, first-principles prediction, fitted parameter renamed as output, or load-bearing self-citation is present; the claims rest on observable empirical differences rather than any self-referential reduction or ansatz smuggled through prior work by the same authors. The approach is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that visual rendering integrates effectively with existing vision encoders; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Visual rendering of bounding boxes and trajectories can be effectively interpreted by the vision encoder of MLLMs without degrading original scene understanding
This underpins the claim that visual prompts are superior and efficient; it is invoked implicitly as the basis for the method's advantages.

pith-pipeline@v0.9.0 · 5539 in / 1379 out tokens · 43552 ms · 2026-05-10T15:20:57.238739+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

Ghost-100 benchmark shows prompt tone drives hallucination rates and intensities in VLMs, with non-monotonic peaks at intermediate pressure and task-specific differences that aggregate metrics hide.

Reference graph

Works this paper leans on

49 extracted references · 4 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

In: NeurIPS (2022)

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. In: NeurIPS (2022)

2022
[2]

In: CVPR (2024)

Cai,M.,Liu,H.,Mustikovela, S.K.,Meyer,G.P.,Chai,Y.,Park,D.,Lee,Y.J.:ViP- LLaVA: Making large multimodal models understand arbitrary visual prompts. In: CVPR (2024)

2024
[3]

In: CVPR (2024)

Chen, B., Xu, Z., Kirmani, S., Ichter, B., Driess, D., Florence, P., Sadigh, D., Guibas, L., Xia, F.: SpatialVLM: Endowing vision-language models with spatial reasoning capabilities. In: CVPR (2024)

2024
[4]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Chen, K., Zhang, Z., Zeng, W., Zhang, R., Zhu, F., Zhao, R.: Shikra: Unleash- ing multimodal LLM’s referential dialogue magic. arXiv preprint arXiv:2306.15195 (2023)

work page internal anchor Pith review arXiv 2023
[5]

In: CVPR (2024)

Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X., Shan, Y.: YOLO-World: Real-time open-vocabulary object detection. In: CVPR (2024)

2024
[6]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Cheng, Z., Leng, S., Zhang, H., Xin, Y., Li, X., Chen, G., Zhu, Y., Zhang, W., Luo, Z., Zhao, D., Bing, L.: VideoLLaMA 2: Advancing spatial-temporal modeling and audio understanding in video-LLMs. arXiv preprint arXiv:2406.07476 (2024)

work page internal anchor Pith review arXiv 2024
[7]

In: NeurIPS (2023)

Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In: NeurIPS (2023)

2023
[8]

In: ICLR (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: ICLR (2022)

2022
[9]

Mistral 7B

Jiang, A.Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D.S., Casas, D.d.l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al.: Mistral 7B. arXiv preprint arXiv:2310.06825 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

In: CVPR (2024)

Jin, P., Takanobu, R., Zhang, C., Cao, X., Yuan, L.: Chat-UniVi: Unified visual representation empowers large language models with image and video understand- ing. In: CVPR (2024)

2024
[11]

Perception & Psychophysics14(2), 201–211 (1973)

Johansson, G.: Visual perception of biological motion and a model for its analysis. Perception & Psychophysics14(2), 201–211 (1973)

1973
[12]

In: ICLR (2025)

Karimi Mamaghan, A.M., Papa, S., Johansson, K.H., Bauer, S., Dittadi, A.: Ex- ploring the effectiveness of object-centric representations in visual question answer- ing: Comparative insights with foundation models. In: ICLR (2025)

2025
[13]

TMLR (2025) 16 Z

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., Li, C.: LLaVA-OneVision: Easy visual task transfer. TMLR (2025) 16 Z. Qian et al

2025
[14]

In: ICCV (2023)

Li, J., Wei, P., Han, W., Fan, L.: IntentQA: Context-aware video intent reasoning. In: ICCV (2023)

2023
[15]

In: ICML (2023)

Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML (2023)

2023
[16]

In: ICLR (2024)

Li, J., Chen, D., Hong, Y., Chen, Z., Chen, P., Shen, Y., Gan, C.: CoVLM: Com- posing visual entities and relationships in large language models via communicative decoding. In: ICLR (2024)

2024
[17]

Science China Information Sciences 68(10), 200102 (2025)

Li, K., He, Y., Wang, Y., Li, Y., Wang, W., Luo, P., Wang, Y., Wang, L., Qiao, Y.: VideoChat: Chat-centric video understanding. Science China Information Sciences 68(10), 200102 (2025)

2025
[18]

In: CVPR (2024)

Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., Wang, L., Qiao, Y.: MVBench: A comprehensive multi-modal video understanding benchmark. In: CVPR (2024)

2024
[19]

In: ECCV (2020)

Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, L., Hu, H., Dong, L., Wei, F., Choi, Y., Gao, J.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: ECCV (2020)

2020
[20]

In: ECCV (2024)

Li,Y.,Wang,C.,Jia,J.:LLaMA-VID:Animageisworth2tokensinlargelanguage models. In: ECCV (2024)

2024
[21]

In: EMNLP (2024)

Lin,B.,Ye,Y.,Zhu,B.,Cui,J.,Ning,M.,Jin,P.,Yuan,L.:Video-LLaVA:Learning united visual representation by alignment before projection. In: EMNLP (2024)

2024
[22]

In: NeurIPS (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)

2023
[23]

Artificial Intelligence293, 103448 (2021)

Luo, W., Xing, J., Milan, A., Zhang, X., Liu, W., Kim, T.K.: Multiple object tracking: A literature review. Artificial Intelligence293, 103448 (2021)

2021
[24]

In: ACL (2024)

Maaz, M., Rasheed, H., Khan, S., Khan, F.S.: Video-ChatGPT: Towards detailed video understanding via large vision and language models. In: ACL (2024)

2024
[25]

In: NeurIPS (2023)

Pătrăucean, V., Smaira, L., Gupta, A., Recasens, A., Markeeva, L., Banarse, D., Koppula,S., Heyward, J., Malinowski, M., Yang, Y.,Doersch, C.,Matejovicova, T., Sulsky, Y., Miech, A., Frechette, A., Klimczak, H., Koster, R., Zhang, J., Winkler, S., Aytar, Y., Osindero, S., Damen, D., Zisserman, A., Carreira, J.: Perception test: A diagnostic benchmark for ...

2023
[26]

In: ICLR (2024)

Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Wei, F.: Kosmos-2: Grounding multimodal large language models to the world. In: ICLR (2024)

2024
[27]

In: ICML (2021)

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: ICML (2021)

2021
[28]

In: CVPR (2024)

Rasheed, H., Maaz, M., Shaji Mullappilly, S., Shaker, A., Khan, S., Cholakkal, H., Anwer, R.M., Xing, E., Yang, M.H., Khan, F.S.: GLaMM: Pixel grounding large multimodal model. In: CVPR (2024)

2024
[29]

In: ICLR (2025)

Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Dollár, P., Feichtenhofer, C.: SAM 2: Segment anything in images and videos. In: ICLR (2025)

2025
[30]

In: ICCV (2023)

Shtedritski, A., Rupprecht, C., Vedaldi, A.: What does CLIP know about a red circle? visual prompt engineering for VLMs. In: ICCV (2023)

2023
[31]

Tang, Z., Wang, S., Cho, J., Yoo, J., Sun, C.: How can objects help video-language understanding? In: ICCV (2025)

2025
[32]

In: NeurIPS (2024)

Tong, S., Brown, E., Wu, P., Woo, S., Middepogu, M., Akula, S.C., Yang, J., Yang, S., Iyer, A., Pan, X., Wang, A., Fergus, R., LeCun, Y., Xie, S.: Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs. In: NeurIPS (2024)

2024
[33]

In: CVPR (2024) BoxTuning 17

Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., Xie, S.: Eyes wide shut? exploring the visual shortcomings of multimodal LLMs. In: CVPR (2024) BoxTuning 17

2024
[34]

In: ECCV (2024)

Wang, S., Zhao, Q., Do, M.Q., Agarwal, N., Lee, K., Sun, C.: Vamos: Versatile action models for video understanding. In: ECCV (2024)

2024
[35]

In: ECCV (2024)

Wang, X., Liang, J., Wang, C.K., Deng, K., Lou, Y., Lin, M., Yang, S.: ViLA: Efficient video-language alignment for video question answering. In: ECCV (2024)

2024
[36]

In: NeurIPS (2021)

Wu, B., Yu, S., Chen, Z., Tenenbaum, J.B., Gan, C.: STAR: A benchmark for situated reasoning in real-world videos. In: NeurIPS (2021)

2021
[37]

In: CVPR (2024)

Wu, P., Xie, S.: V∗: Guided visual search as a core mechanism in multimodal LLMs. In: CVPR (2024)

2024
[38]

In: CVPR (2021)

Xiao, J., Shang, X., Yao, A., Chua, T.S.: NExT-QA: Next phase of question- answering to explaining temporal actions. In: CVPR (2021)

2021
[39]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Yang, J., Zhang, H., Li, F., Zou, X., Li, C., Gao, J.: Set-of-mark prompt- ing unleashes extraordinary visual grounding in GPT-4V. arXiv preprint arXiv:2310.11441 (2023)

work page internal anchor Pith review arXiv 2023
[40]

In: ICLR (2020)

Yi, K., Gan, C., Li, Y., Kohli, P., Wu, J., Torralba, A., Tenenbaum, J.B.: CLEVRER: Collision events for video representation and reasoning. In: ICLR (2020)

2020
[41]

In: NeurIPS (2023)

Yu, S., Cho, J., Yadav, P., Bansal, M.: Self-chained image-language model for video localization and question answering. In: NeurIPS (2023)

2023
[42]

In: CVPR (2024)

Yuan, Y., Li, W., Liu, J., Tang, D., Luo, X., Qin, C., Zhang, L., Zhu, J.: Osprey: Pixel understanding with visual instruction tuning. In: CVPR (2024)

2024
[43]

In: ICLR (2023)

Zeng, A., Attarian, M., Ichter, B., Choromanski, K.M., Wong, A., Welker, S., Tombari, F., Purohit, A., Ryoo, M.S., Sindhwani, V., Lee, J., Vanhoucke, V., Flo- rence, P.: Socratic models: Composing zero-shot multimodal reasoning with lan- guage. In: ICLR (2023)

2023
[44]

In: EMNLP (2023)

Zhang, H., Li, X., Bing, L.: Video-LLaMA: An instruction-tuned audio-visual lan- guage model for video understanding. In: EMNLP (2023)

2023
[45]

In: ICLR (2025)

Zhang, J., Khayatkhoei, M., Chhikara, P., Ilievski, F.: MLLMs know where to look: Training-free perception of small visual details with multimodal LLMs. In: ICLR (2025)

2025
[46]

In: CVPR (2021)

Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L., Choi, Y., Gao, J.: VinVL: Revisiting visual representations in vision-language models. In: CVPR (2021)

2021
[47]

In: ECCV Workshops (2024)

Zhang, S., Sun, P., Chen, S., Xiao, M., Shao, W., Zhang, W., Liu, Y., Chen, K., Luo, P.: GPT4RoI: Instruction tuning large language model on region-of-interest. In: ECCV Workshops (2024)

2024
[48]

Zhang, Y., Li, B., Liu, H., Lee, Y.J., Gui, L., Fu, D., Feng, J., Liu, Z., Li, C.: LLaVA-NeXT: A strong zero-shot video understanding model.https://llava- vl.github.io/blog/2024-04-30-llava-next-video/(2024)

2024
[49]

In: NeurIPS (2025)

Zhong, L., Rosenthal, F., Sicking, J., Hüger, F., Bagdonat, T., Gottschalk, H., Schwinn, L.: FOCUS: Internal MLLM representations for efficient fine-grained vi- sual question answering. In: NeurIPS (2025)

2025