Title resolution pending

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, Steven Hoi · 2023

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

browse 7 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation

cs.CR · 2026-05-15 · unverdicted · novelty 7.0

CrossMPI steers both visual and textual interpretations in LVLMs through image-only perturbations by optimizing in hidden-state space at selected middle layers with distance-based budget allocation.

PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.

When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models

cs.CV · 2026-03-29 · unverdicted · novelty 7.0

A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.

Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems

cs.AI · 2026-05-03 · unverdicted · novelty 6.0

FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.

Latent Denoising Improves Visual Alignment in Large Multimodal Models

cs.CV · 2026-04-23 · unverdicted · novelty 6.0

A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.

Learning Shared Sentiment Prototypes for Adaptive Multimodal Sentiment Analysis

cs.MM · 2026-04-07 · unverdicted · novelty 6.0

PRISM learns shared sentiment prototypes to enable structured cross-modal comparison and dynamic modality reweighting in multimodal sentiment analysis, outperforming baselines on three benchmark datasets.

STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models

cs.CV · 2026-04-03 · unverdicted · novelty 6.0

STEAR reduces spatial and temporal hallucinations in Video-LLMs via layer-aware evidence intervention from middle decoder layers in a single-encode pass.

citing papers explorer

Showing 7 of 7 citing papers.

A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation cs.CR · 2026-05-15 · unverdicted · none · ref 9
CrossMPI steers both visual and textual interpretations in LVLMs through image-only perturbations by optimizing in hidden-state space at selected middle layers with distance-based budget allocation.
PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models cs.CV · 2026-04-09 · unverdicted · none · ref 11
PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.
When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models cs.CV · 2026-03-29 · unverdicted · none · ref 14
A wrinkle-field perturbation method creates photorealistic non-rigid image changes that degrade state-of-the-art VLMs on image captioning and VQA more effectively than prior baselines.
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems cs.AI · 2026-05-03 · unverdicted · none · ref 7
FLP uses multi-persona foresight simulation to detect infections via response diversity and applies local purification to reduce maximum cumulative infection rates in multi-agent systems from over 95% to below 5.47%.
Latent Denoising Improves Visual Alignment in Large Multimodal Models cs.CV · 2026-04-23 · unverdicted · none · ref 17
A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
Learning Shared Sentiment Prototypes for Adaptive Multimodal Sentiment Analysis cs.MM · 2026-04-07 · unverdicted · none · ref 10
PRISM learns shared sentiment prototypes to enable structured cross-modal comparison and dynamic modality reweighting in multimodal sentiment analysis, outperforming baselines on three benchmark datasets.
STEAR: Layer-Aware Spatiotemporal Evidence Intervention for Hallucination Mitigation in Video Large Language Models cs.CV · 2026-04-03 · unverdicted · none · ref 10
STEAR reduces spatial and temporal hallucinations in Video-LLMs via layer-aware evidence intervention from middle decoder layers in a single-encode pass.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer