hub

LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=

Liu, Haotian, Li, Chunyuan, Li, Yuheng, Li, Bo, Zhang, Yuanhan, Shen, Sheng

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

browse 12 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Thermo-VL: Extending Vision-Language Models to Thermal Infrared Perception

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

Thermo-VL augments a frozen Molmo-7B VLM with a trainable thermal encoder and prompt-conditioned dual-attention fusion to improve cross-spectrum visual reasoning.

Prefix-Adaptive Block Diffusion for Efficient Document Recognition

cs.CV · 2026-05-16 · unverdicted · novelty 6.0

PA-BDM adapts block diffusion by switching to causal intra-block denoising and dynamically committing reliable prefixes to KV cache, yielding higher accuracy and 71.6% higher throughput than a comparable baseline on document benchmarks.

Deep Pre-Alignment for VLMs

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.

Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

cs.MM · 2026-05-11 · unverdicted · novelty 6.0

LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.

Large Vision-Language Models Get Lost in Attention

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.

CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with negligible added inference cost.

Where and What: Reasoning Dynamic and Implicit Preferences in Situated Conversational Recommendation

cs.AI · 2026-04-22 · unverdicted · novelty 6.0

SiPeR improves recommendation accuracy and response quality in situated conversations by estimating scene transitions and performing Bayesian inverse inference with multimodal LLMs.

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

cs.CV · 2024-10-06 · accept · novelty 6.0

SparseVLM uses text-guided attention to prune and recycle visual tokens in VLMs, delivering 54% FLOPs reduction and 37% lower latency with 97% accuracy retention on LLaVA.

Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters

cs.CL · 2026-05-19 · unverdicted · novelty 5.0

Systematic evaluation finds cross-modal skill injection via model merging succeeds in instruction-following and cross-lingual scenarios but fails in mathematical reasoning, with TA and DARE methods outperforming others after hyperparameter analysis.

Text-Guided Multi-Scale Frequency Representation Adaptation

cs.CV · 2026-05-05 · unverdicted · novelty 5.0

FreqAdapter adapts multimodal models by text-guided multi-scale fine-tuning in the frequency domain, claiming better performance and efficiency than signal-space PEFT methods.

RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference

cs.CV · 2026-05-01 · unverdicted · novelty 5.0

RTPrune introduces a reading-twice inspired two-stage pruning technique for DeepSeek-OCR that retains 84.25% tokens while delivering 99.47% accuracy and 1.23x faster prefill on OmniDocBench.

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

cs.CV · 2024-08-09 · unverdicted · novelty 5.0

mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.

citing papers explorer

Showing 12 of 12 citing papers.

Thermo-VL: Extending Vision-Language Models to Thermal Infrared Perception cs.CV · 2026-05-21 · unverdicted · none · ref 47
Thermo-VL augments a frozen Molmo-7B VLM with a trainable thermal encoder and prompt-conditioned dual-attention fusion to improve cross-spectrum visual reasoning.
Prefix-Adaptive Block Diffusion for Efficient Document Recognition cs.CV · 2026-05-16 · unverdicted · none · ref 54
PA-BDM adapts block diffusion by switching to causal intra-block denoising and dynamically committing reliable prefixes to KV cache, yielding higher accuracy and 71.6% higher throughput than a comparable baseline on document benchmarks.
Deep Pre-Alignment for VLMs cs.CV · 2026-05-14 · unverdicted · none · ref 142
Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination cs.MM · 2026-05-11 · unverdicted · none · ref 80
LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
Large Vision-Language Models Get Lost in Attention cs.AI · 2026-05-07 · unverdicted · none · ref 58
In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering cs.CV · 2026-05-06 · unverdicted · none · ref 64
CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with negligible added inference cost.
Where and What: Reasoning Dynamic and Implicit Preferences in Situated Conversational Recommendation cs.AI · 2026-04-22 · unverdicted · none · ref 66
SiPeR improves recommendation accuracy and response quality in situated conversations by estimating scene transitions and performing Bayesian inverse inference with multimodal LLMs.
SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference cs.CV · 2024-10-06 · accept · none · ref 71
SparseVLM uses text-guided attention to prune and recycle visual tokens in VLMs, delivering 54% FLOPs reduction and 37% lower latency with 97% accuracy retention on LLaVA.
Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters cs.CL · 2026-05-19 · unverdicted · none · ref 52
Systematic evaluation finds cross-modal skill injection via model merging succeeds in instruction-following and cross-lingual scenarios but fails in mathematical reasoning, with TA and DARE methods outperforming others after hyperparameter analysis.
Text-Guided Multi-Scale Frequency Representation Adaptation cs.CV · 2026-05-05 · unverdicted · none · ref 8
FreqAdapter adapts multimodal models by text-guided multi-scale fine-tuning in the frequency domain, claiming better performance and efficiency than signal-space PEFT methods.
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference cs.CV · 2026-05-01 · unverdicted · none · ref 23
RTPrune introduces a reading-twice inspired two-stage pruning technique for DeepSeek-OCR that retains 84.25% tokens while delivering 99.47% accuracy and 1.23x faster prefill on OmniDocBench.
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models cs.CV · 2024-08-09 · unverdicted · none · ref 120
mPLUG-Owl3 introduces hyper attention blocks to integrate vision and language for long image-sequence understanding and reports SOTA results on single-image, multi-image, and video benchmarks.

LLaVA-NeXT: Improved reasoning, OCR, and world knowledge , url=

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer