CrossMPI steers both visual and textual interpretations in LVLMs through image-only perturbations by optimizing in hidden-state space at selected middle layers with distance-based budget allocation.
Title resolution pending
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7verdicts
UNVERDICTED 7roles
dataset 1polarities
use dataset 1representative citing papers
MLLMs display a large perception-reasoning gap on perspective-conditioned spatial reasoning tasks from omnidirectional images, with sharp accuracy drops on advanced tasks like egocentric rotation, though partial gains are possible via RL reward shaping.
DISSECT benchmark reveals that VLMs extract visual details from scientific diagrams but frequently lose them during reasoning, with open-source models showing a larger integration gap than closed-source ones.
A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
AIM applies modality-specific masks to balance stability and plasticity in asymmetric VLMs, achieving SOTA average performance and reduced forgetting on continual VQA v2 and GQA while preserving generalization to novel compositions.
DeSAP uses decoupled cross-modal similarity plus visual saliency to prune visual tokens in LVLMs, retaining 11.1% tokens for 10x FLOPs reduction and 98.1% performance on LLaVA-1.5-7B.
MSAO cuts end-to-end latency by 30% and resource overhead by 30-65% for multimodal LLM inference through sparsity-aware edge-cloud offloading while preserving accuracy.
citing papers explorer
-
A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation
CrossMPI steers both visual and textual interpretations in LVLMs through image-only perturbations by optimizing in hidden-state space at selected middle layers with distance-based budget allocation.
-
Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images
MLLMs display a large perception-reasoning gap on perspective-conditioned spatial reasoning tasks from omnidirectional images, with sharp accuracy drops on advanced tasks like egocentric rotation, though partial gains are possible via RL reward shaping.
-
DISSECT: Diagnosing Where Vision Ends and Language Priors Begin in Scientific VLMs
DISSECT benchmark reveals that VLMs extract visual details from scientific diagrams but frequently lose them during reasoning, with open-source models showing a larger integration gap than closed-source ones.
-
Latent Denoising Improves Visual Alignment in Large Multimodal Models
A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
-
AIM: Asymmetric Information Masking for Visual Question Answering Continual Learning
AIM applies modality-specific masks to balance stability and plasticity in asymmetric VLMs, achieving SOTA average performance and reduced forgetting on continual VQA v2 and GQA while preserving generalization to novel compositions.
-
Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models
DeSAP uses decoupled cross-modal similarity plus visual saliency to prune visual tokens in LVLMs, retaining 11.1% tokens for 10x FLOPs reduction and 98.1% performance on LLaVA-1.5-7B.
-
MSAO: Adaptive Modality Sparsity-Aware Offloading with Edge-Cloud Collaboration for Efficient Multimodal LLM Inference
MSAO cuts end-to-end latency by 30% and resource overhead by 30-65% for multimodal LLM inference through sparsity-aware edge-cloud offloading while preserving accuracy.