EchoingPixels: Aliasing-Resistant Joint Token Reduction for Audio-Visual LLMs

Chao Gong; Depeng Wang; Huijia Zhu; Jingjing Chen; Ya Guo; Zhipeng Wei

arxiv: 2512.10324 · v2 · pith:TZOMBIMZnew · submitted 2025-12-11 · 💻 cs.CV

EchoingPixels: Aliasing-Resistant Joint Token Reduction for Audio-Visual LLMs

Chao Gong , Depeng Wang , Zhipeng Wei , Ya Guo , Huijia Zhu , Jingjing Chen This is my paper

classification 💻 cs.CV

keywords audio-visualechoingpixelssparsetokenjointpositionalreductionaliasing

0 comments

read the original abstract

Audio-Visual Large Language Models (AV-LLMs) face prohibitive computational costs of processing massive, redundant audio-visual tokens. Existing unimodal compression techniques fail to capture the heterogeneous and mutually influential information density of joint audio-visual signals. Furthermore, we identify a fundamental and overlooked theoretical bottleneck in sparse token reduction: positional aliasing. We demonstrate that aggressive sparse sampling on standard position-encoded sequences violates the Nyquist limit relative to the effective token interval, causing phase-wrapping collisions that corrupt temporal monotonicity. To address this, we introduce EchoingPixels, a framework for aliasing-resistant joint token reduction. Our Cross-Modal Semantic Sieve performs extractive selection on the synergistic audio-visual stream, dynamically allocating budgets based on joint-modality saliency rather than fixed per-modality ratios. To resolve positional aliasing, we derive Sync-RoPE, a spectral low-pass filter for Rotary Positional Embeddings that adapts encoding bandwidth to the sparse sampling rate, preserving monotonic temporal relationships in the reduced stream. Experiments show that EchoingPixels achieves performance comparable to full models using only 5-20% of original tokens, validating theoretically grounded sparse learning as a robust solution for efficient AV-LLMs. Code is available at https://github.com/CharlesGong12/EchoingPixels.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs
cs.AI 2026-06 unverdicted novelty 7.0

AVLLMs route audio-visual information sequentially in video tasks and via parallel streams for interleaved items, allowing early token discard with little performance loss across models and scales.
Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs
cs.CV 2026-05 unverdicted novelty 6.0

ContextGuard prunes 55% of tokens in Qwen2.5-Omni 7B while matching full performance on five of six audio-visual benchmarks by preserving audio-irrecoverable visual context.
EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMs
cs.CV 2026-05 unverdicted novelty 6.0

EchoPrune prunes video tokens via query relevance and temporal reconstruction error to let VideoLLMs handle up to 20x more frames under fixed budget with reported gains in accuracy and speed.
Stage-adaptive Token Selection for Efficient Omni-modal LLMs
cs.CV 2026-05 unverdicted novelty 5.0

SEATS adaptively selects and removes non-text tokens before and inside the LLM layers of omni-modal models, yielding 9.3x FLOPs reduction and 4.8x prefill speedup at 10% token retention while keeping 96.3% performance.
OmniRefine: Alignment-Aware Cooperative Compression for Efficient Omnimodal Large Language Models
cs.AI 2026-05 unverdicted novelty 5.0

OmniRefine introduces alignment-aware chunk refinement via similarity and dynamic programming followed by modality-cooperative token compression, achieving near-baseline accuracy at 44% token retention on WorldSense.