hub

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916

Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee · 2023

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

browse 12 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Mechanisms of Object Localization in Vision-Language Models

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

Localization in VLMs relies on a containerization mechanism driven by object-aligned tokens and a narrow set of specialized attention heads in early-to-mid or mid-to-late layers.

Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models

cs.CV · 2026-04-28 · conditional · novelty 7.0

Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregressive decoding.

Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding

cs.CV · 2026-03-31 · unverdicted · novelty 7.0

Omni-NegCLIP improves CLIP's negation understanding by up to 52.65% on presence-based and 12.50% on absence-based tasks through front-layer fine-tuning with specialized contrastive losses.

Where Do Vision-Language Models Fail? World Scale Analysis for Image Geolocalization

cs.CV · 2026-04-17 · unverdicted · novelty 6.0

Vision-language models display large performance differences and clear limits in zero-shot country-level geolocalization from ground-view photos, with semantic cues helping coarse guesses but failing on fine details.

Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models

cs.CV · 2026-04-16 · unverdicted · novelty 6.0

A 0.5B student VLM distills from a 3B teacher using visual-switch distillation and DBiLD loss to gain 3.6 points on average across 10 multimodal benchmarks without architecture changes.

Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.

Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

cs.CV · 2026-03-15 · unverdicted · novelty 6.0

Attention dispersion during extended reasoning impairs MLLM perception on images, and a training-free VRGA framework mitigates it by selecting and reweighting visual attention heads using an entropy-focus criterion.

Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning

cs.CV · 2025-12-17 · unverdicted · novelty 6.0

Skyra is an MLLM that detects AI-generated videos by identifying and reasoning over grounded visual artifacts, supported by a new annotated dataset and benchmark.

ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models

cs.CV · 2025-10-12 · unverdicted · novelty 6.0

ViSurf unifies SFT and RLVR for LVLMs in one training stage by injecting ground-truth labels into rollouts and applying novel reward controls, outperforming standalone and two-stage baselines on diverse benchmarks.

Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation

cs.CV · 2026-04-06 · unverdicted · novelty 5.0

Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.

3D-IDE: 3D Implicit Depth Emergent

cs.CV · 2026-03-28 · unverdicted · novelty 5.0

3D awareness emerges implicitly in MLLMs via self-supervised geometric constraints that create an information bottleneck, removing depth and pose dependencies at inference and cutting latency by 55%.

TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning

cs.CV · 2025-12-03 · unverdicted · novelty 5.0

TempR1 applies temporal-aware multi-task RL using GRPO and three types of localization rewards to achieve SOTA temporal understanding in MLLMs with synergistic gains from joint optimization.

citing papers explorer

Showing 12 of 12 citing papers.

Mechanisms of Object Localization in Vision-Language Models cs.CV · 2026-05-19 · unverdicted · none · ref 19
Localization in VLMs relies on a containerization mechanism driven by object-aligned tokens and a narrow set of specialized attention heads in early-to-mid or mid-to-late layers.
Prefill-Time Intervention for Mitigating Hallucination in Large Vision-Language Models cs.CV · 2026-04-28 · conditional · none · ref 26
Prefill-Time Intervention (PTI) reduces hallucinations in large vision-language models by applying a one-time modality-aware steering correction to the initial KV cache at the prefill stage rather than during autoregressive decoding.
Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding cs.CV · 2026-03-31 · unverdicted · none · ref 20
Omni-NegCLIP improves CLIP's negation understanding by up to 52.65% on presence-based and 12.50% on absence-based tasks through front-layer fine-tuning with specialized contrastive losses.
Where Do Vision-Language Models Fail? World Scale Analysis for Image Geolocalization cs.CV · 2026-04-17 · unverdicted · none · ref 22
Vision-language models display large performance differences and clear limits in zero-shot country-level geolocalization from ground-view photos, with semantic cues helping coarse guesses but failing on fine details.
Switch-KD: Visual-Switch Knowledge Distillation for Vision-Language Models cs.CV · 2026-04-16 · unverdicted · none · ref 30
A 0.5B student VLM distills from a 3B teacher using visual-switch distillation and DBiLD loss to gain 3.6 points on average across 10 multimodal benchmarks without architecture changes.
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning cs.CV · 2026-04-06 · unverdicted · none · ref 25
RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models cs.CV · 2026-03-15 · unverdicted · none · ref 14
Attention dispersion during extended reasoning impairs MLLM perception on images, and a training-free VRGA framework mitigates it by selecting and reweighting visual attention heads using an entropy-focus criterion.
Skyra: AI-Generated Video Detection via Grounded Artifact Reasoning cs.CV · 2025-12-17 · unverdicted · none · ref 37
Skyra is an MLLM that detects AI-generated videos by identifying and reasoning over grounded visual artifacts, supported by a new annotated dataset and benchmark.
ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models cs.CV · 2025-10-12 · unverdicted · none · ref 17
ViSurf unifies SFT and RLVR for LVLMs in one training stage by injecting ground-truth labels into rollouts and applying novel reward controls, outperforming standalone and two-stage baselines on diverse benchmarks.
Firebolt-VL: Efficient Vision-Language Understanding with Cross-Modality Modulation cs.CV · 2026-04-06 · unverdicted · none · ref 27
Firebolt-VL introduces an LFM-based decoder and token-grid correlation to achieve linear-time vision-language inference with improved fine-grained grounding.
3D-IDE: 3D Implicit Depth Emergent cs.CV · 2026-03-28 · unverdicted · none · ref 29
3D awareness emerges implicitly in MLLMs via self-supervised geometric constraints that create an information bottleneck, removing depth and pose dependencies at inference and cutting latency by 55%.
TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning cs.CV · 2025-12-03 · unverdicted · none · ref 31
TempR1 applies temporal-aware multi-task RL using GRPO and three types of localization rewards to achieve SOTA temporal understanding in MLLMs with synergistic gains from joint optimization.

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer