hub

Advances in neural information processing systems , volume=

Visual instruction tuning , author=

43 Pith papers cite this work. Polarity classification is still indexing.

43 Pith papers citing it

browse 43 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

cs.CR · 2026-05-11 · unverdicted · novelty 8.0

M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.

Thermo-VL: Extending Vision-Language Models to Thermal Infrared Perception

cs.CV · 2026-05-21 · unverdicted · novelty 7.0

Thermo-VL augments a frozen Molmo-7B VLM with a trainable thermal encoder and prompt-conditioned dual-attention fusion to improve cross-spectrum visual reasoning.

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

cs.CV · 2026-05-20 · conditional · novelty 7.0

WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.

Very Efficient Listwise Multimodal Reranking for Long Documents

cs.IR · 2026-05-12 · unverdicted · novelty 7.0

ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.

Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding

cs.AI · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.

AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification

astro-ph.IM · 2026-05-07 · unverdicted · novelty 7.0

AstroAlertBench evaluates multimodal LLMs on astronomical classification accuracy, reasoning, and honesty using real ZTF alerts, revealing that high accuracy often diverges from self-assessed reasoning quality.

ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety

cs.CR · 2026-04-21 · unverdicted · novelty 7.0

ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisoned samples.

Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework

cs.CV · 2026-04-20 · conditional · novelty 7.0

Introduces the first large-scale 3D PET/CT dataset with fine-grained RoI annotations for Vietnamese and a graph-enhanced HiRRA framework that achieves SOTA report generation by modeling RoI dependencies.

GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning

cs.RO · 2026-04-19 · unverdicted · novelty 7.0

GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.

VoiceBench: Benchmarking LLM-Based Voice Assistants

cs.CL · 2024-10-22 · unverdicted · novelty 7.0

VoiceBench is the first benchmark for multi-faceted evaluation of LLM voice assistants using real and synthetic spoken instructions with speaker, environmental, and content variations.

OProver: A Unified Framework for Agentic Formal Theorem Proving

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.

Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

Frontier LLMs exhibit bias from stigmatizing language in clinical vignettes across four conditions, skewing decisions toward less aggressive management, with limited mitigation from Chain-of-Thought or self-debiasing prompts.

Deep Pre-Alignment for VLMs

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.

Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation

cs.RO · 2026-05-11 · unverdicted · novelty 6.0

SAGE trains agents in physics-grounded semantic abstractions via RL with asymmetric clipping, achieving 53.21% LLM-Match Success on A-EQA (+9.7% over baseline) and encouraging physical robot transfer.

WeatherSyn: An Instruction Tuning MLLM For Weather Forecasting Report Generation

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

WeatherSyn is the first instruction-tuned MLLM for weather forecasting report generation, outperforming closed-source models on a new dataset of 31 US cities across 8 weather aspects.

Large Vision-Language Models Get Lost in Attention

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.

CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with negligible added inference cost.

Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models

cs.AI · 2026-05-05 · unverdicted · novelty 6.0

MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.

Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

cs.CL · 2026-05-02 · unverdicted · novelty 6.0 · 2 refs

A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.

MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment

cs.CV · 2026-04-23 · unverdicted · novelty 6.0

MiMIC mitigates visual modality collapse and semantic misalignment in universal multimodal retrieval via fusion-in-decoder architecture and robust single-modality training.

DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.

S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.

MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

cs.LG · 2026-04-19 · unverdicted · novelty 6.0

MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.

TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

cs.RO · 2024-12-13 · conditional · novelty 6.0

Visual trace prompting improves spatial-temporal awareness in VLA models, delivering 10% gains on SimplerEnv and 3.5x on real-robot tasks.

citing papers explorer

Showing 43 of 43 citing papers.

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation cs.CR · 2026-05-11 · unverdicted · none · ref 4
M³Att poisons medical multimodal RAG by pairing covert textual misinformation with query-agnostic visual perturbations that increase retrieval of the bad content, causing LLMs to generate clinically plausible but incorrect responses.
Thermo-VL: Extending Vision-Language Models to Thermal Infrared Perception cs.CV · 2026-05-21 · unverdicted · none · ref 23
Thermo-VL augments a frozen Molmo-7B VLM with a trainable thermal encoder and prompt-conditioned dual-attention fusion to improve cross-spectrum visual reasoning.
WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata cs.CV · 2026-05-20 · conditional · none · ref 101
WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.
Very Efficient Listwise Multimodal Reranking for Long Documents cs.IR · 2026-05-12 · unverdicted · none · ref 38
ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.
Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding cs.AI · 2026-05-08 · unverdicted · none · ref 65 · 2 links
LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.
AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification astro-ph.IM · 2026-05-07 · unverdicted · none · ref 36
AstroAlertBench evaluates multimodal LLMs on astronomical classification accuracy, reasoning, and honesty using real ZTF alerts, revealing that high accuracy often diverges from self-assessed reasoning quality.
ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety cs.CR · 2026-04-21 · unverdicted · none · ref 5
ProjLens shows that backdoor parameters in MLLMs are encoded in low-rank subspaces of the projector and that embeddings shift toward the target direction with magnitude linear in input norm, activating only on poisoned samples.
Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework cs.CV · 2026-04-20 · conditional · none · ref 13
Introduces the first large-scale 3D PET/CT dataset with fine-grained RoI annotations for Vietnamese and a graph-enhanced HiRRA framework that achieves SOTA report generation by modeling RoI dependencies.
GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning cs.RO · 2026-04-19 · unverdicted · none · ref 47
GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.
VoiceBench: Benchmarking LLM-Based Voice Assistants cs.CL · 2024-10-22 · unverdicted · none · ref 14
VoiceBench is the first benchmark for multi-faceted evaluation of LLM voice assistants using real and synthetic spoken instructions with speaker, environmental, and content variations.
OProver: A Unified Framework for Agentic Formal Theorem Proving cs.CL · 2026-05-17 · unverdicted · none · ref 5
OProver-32B achieves top Pass@32 scores on MiniF2F, ProverBench, and PutnamBench by combining continued pretraining with iterative agentic proving, retrieval, SFT on repairs, and RL on unresolved cases using a 6.86M-proof dataset.
Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making cs.CL · 2026-05-17 · unverdicted · none · ref 58
Frontier LLMs exhibit bias from stigmatizing language in clinical vignettes across four conditions, skewing decisions toward less aggressive management, with limited mitigation from Chain-of-Thought or self-debiasing prompts.
Deep Pre-Alignment for VLMs cs.CV · 2026-05-14 · unverdicted · none · ref 6
Deep Pre-Alignment uses a small VLM perceiver instead of ViT to pre-align visual features with LLM text space, yielding 1.9-3.0 point gains on multimodal benchmarks and 32.9% less language forgetting.
Plan in Sandbox, Navigate in Open Worlds: Learning Physics-Grounded Abstracted Experience for Embodied Navigation cs.RO · 2026-05-11 · unverdicted · none · ref 11
SAGE trains agents in physics-grounded semantic abstractions via RL with asymmetric clipping, achieving 53.21% LLM-Match Success on A-EQA (+9.7% over baseline) and encouraging physical robot transfer.
WeatherSyn: An Instruction Tuning MLLM For Weather Forecasting Report Generation cs.CL · 2026-05-08 · unverdicted · none · ref 6
WeatherSyn is the first instruction-tuned MLLM for weather forecasting report generation, outperforming closed-source models on a new dataset of 31 US cities across 8 weather aspects.
Large Vision-Language Models Get Lost in Attention cs.AI · 2026-05-07 · unverdicted · none · ref 5
In LVLMs, attention can be replaced by random Gaussian weights with little or no performance loss, indicating that current models get lost in attention rather than efficiently using visual context.
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering cs.CV · 2026-05-06 · unverdicted · none · ref 30
CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with negligible added inference cost.
Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models cs.AI · 2026-05-05 · unverdicted · none · ref 31
MoR lets clients train local reward models on private preferences and uses a learned Mixture-of-Rewards with GRPO on the server to align a shared base VLM without exchanging parameters, architectures, or raw data.
Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression cs.CL · 2026-05-02 · unverdicted · none · ref 10 · 2 links
A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.
MiMIC: Mitigating Visual Modality Collapse in Universal Multimodal Retrieval While Avoiding Semantic Misalignment cs.CV · 2026-04-23 · unverdicted · none · ref 50
MiMIC mitigates visual modality collapse and semantic misalignment in universal multimodal retrieval via fusion-in-decoder architecture and robust single-modality training.
DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents cs.CV · 2026-04-21 · unverdicted · none · ref 23
DR-MMSearchAgent derives batch-wide trajectory advantages and uses differentiated Gaussian rewards to prevent premature collapse in multimodal agents, outperforming MMSearch-R1 by 8.4% on FVQA-test.
S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models cs.CV · 2026-04-20 · unverdicted · none · ref 71
S2H-DPO generates hierarchical prompt-driven preference pairs to improve multi-image reasoning in VLMs while keeping single-image performance intact.
MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference cs.LG · 2026-04-19 · unverdicted · none · ref 6
MACS improves inference speed in multimodal MoE models by entropy-weighted balancing of visual tokens and real-time modality-adaptive expert capacity allocation.
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies cs.RO · 2024-12-13 · conditional · none · ref 43
Visual trace prompting improves spatial-temporal awareness in VLA models, delivering 10% gains on SimplerEnv and 3.5x on real-robot tasks.
LLaVA-Video: Video Instruction Tuning With Synthetic Data cs.CV · 2024-10-03 · unverdicted · none · ref 167
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction cs.AI · 2026-05-18 · unverdicted · none · ref 3
Multimodal LLMs suffer Safety Geometry Collapse from modality-induced drift that reduces refusal separability; ReGap corrects drift at inference time using self-rectification signals to restore safety without retraining.
Temporal Aware Pruning for Efficient Diffusion-based Video Generation cs.CV · 2026-05-18 · unverdicted · none · ref 133 · 2 links
TAPE applies temporal-aware token pruning with smoothing, reselection, and timestep scheduling to speed up video diffusion models while preserving visual fidelity and coherence.
DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization cs.RO · 2026-05-17 · unverdicted · none · ref 164
DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.
Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity cs.LG · 2026-05-13 · unverdicted · none · ref 13
Cosine similarity poorly predicts performance degradation from layer removal in LLMs, making direct accuracy-drop ablation a more reliable relevance metric.
Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium cs.CV · 2026-05-11 · unverdicted · none · ref 31
ACE uses adversarial counter-commonsense perturbations on image tokens during decoding to suppress hallucinated linguistic priors while preserving stable visual signals in MLLMs.
SynerMedGen: Synergizing Medical Multimodal Understanding with Generation via Task Alignment cs.CV · 2026-05-09 · unverdicted · none · ref 51
SynerMedGen introduces generation-aligned understanding tasks and a two-stage training strategy that enables strong zero-shot medical image synthesis performance and outperforms specialized models when generation training is added.
Text-Guided Multi-Scale Frequency Representation Adaptation cs.CV · 2026-05-05 · unverdicted · none · ref 30
FreqAdapter adapts multimodal models by text-guided multi-scale fine-tuning in the frequency domain, claiming better performance and efficiency than signal-space PEFT methods.
ARGUS: Policy-Adaptive Ad Governance via Evolving Reinforcement with Adversarial Umpiring cs.CL · 2026-05-04 · unverdicted · none · ref 16
ARGUS uses a Prosecutor-Defender-Umpire multi-agent setup plus RAG and chain-of-thought rewards to adapt ad policy enforcement to new regulations using minimal fresh labels.
RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference cs.CV · 2026-05-01 · unverdicted · none · ref 9
RTPrune introduces a reading-twice inspired two-stage pruning technique for DeepSeek-OCR that retains 84.25% tokens while delivering 99.47% accuracy and 1.23x faster prefill on OmniDocBench.
AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce cs.CL · 2026-04-22 · unverdicted · none · ref 11
AFMRL uses MLLM-generated attributes in attribute-guided contrastive learning and retrieval-aware reinforcement to achieve SOTA fine-grained multimodal retrieval on e-commerce datasets.
Think before Go: Hierarchical Reasoning for Image-goal Navigation cs.RO · 2026-04-19 · unverdicted · none · ref 92
HRNav decomposes image-goal navigation into VLM-based short-horizon planning and RL-based execution with a wandering suppression penalty to improve performance in complex unseen settings.
Bridging Perception and Action: A Lightweight Multimodal Meta-Planner Framework for Robust Earth Observation Agents cs.MA · 2026-05-06 · unverdicted · none · ref 32
The LMMP framework improves tool-calling accuracy and task success rates for Earth observation agents by grounding plans in multimodal features and remote sensing expert knowledge via a two-stage training process.
UnAC: Adaptive Visual Prompting with Abstraction and Stepwise Checking for Complex Multimodal Reasoning cs.CV · 2026-05-05 · unverdicted · none · ref 37
UnAC improves LMM performance on visual reasoning benchmarks by combining adaptive visual prompting, image abstraction, and gradual self-checking.
SCRIPT: Scalable Diffusion Policy with Multi-stage Training for Language-driven Physics-based Humanoid Control cs.GR · 2026-05-21 · unreviewed · ref 6
Diagnosing and Correcting Concept Omission in Multimodal Diffusion Transformers cs.CV · 2026-05-14 · unreviewed · ref 51
Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning cs.AI · 2026-05-13 · unreviewed · ref 47
Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization cs.CR · 2026-05-06 · unreviewed · ref 5
Dual-Anchoring: Addressing State Drift in Vision-Language Navigation cs.CV · 2026-04-19 · unreviewed · ref 80

Advances in neural information processing systems , volume=

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer