Mitigating Cross-Image Information Leakage in Multi-Image Understanding with Large Vision-Language Models

Junsuk Choe; Minyoung Lee; Sanghyuk Chun; Yeji Park

arxiv: 2508.13744 · v2 · pith:W5Q4ZGZAnew · submitted 2025-08-19 · 💻 cs.CV · cs.AI

Mitigating Cross-Image Information Leakage in Multi-Image Understanding with Large Vision-Language Models

Yeji Park , Minyoung Lee , Sanghyuk Chun , Junsuk Choe This is my paper

classification 💻 cs.CV cs.AI

keywords focusmulti-imageleakageperformancecross-imageimageimagesinformation

0 comments

read the original abstract

Large Vision-Language Models (LVLMs) exhibit strong performance on single-image tasks. However, their performance degrades significantly when handling multi-image inputs. While this degradation has been observed in prior work, its nature remains poorly understood. We empirically observe visual elements from different images become entangled in the model's representations and responses. We refer to this phenomenon as cross-image information leakage. To address this issue, we propose FOCUS, a training-free and architecture-agnostic method. FOCUS masks all but one image with random noise, guiding the model to focus on the single clean image. This process is applied across the target images to obtain logits under partially masked contexts. These logits are aggregated and then refined using a noise-only reference input, which suppresses the leakage and yields more accurate outputs. FOCUS consistently improves performance on diverse multi-image benchmarks. We further show that FOCUS generalizes to video understanding, extending its applicability beyond static multi-image inputs. This demonstrates that FOCUS offers a general solution for enhancing multi-image reasoning without additional training or architectural modifications.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs
cs.AI 2026-06 unverdicted novelty 7.0

AVLLMs route audio-visual information sequentially in video tasks and via parallel streams for interleaved items, allowing early token discard with little performance loss across models and scales.
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
cs.CV 2026-04 unverdicted novelty 7.0

CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...