LatentOmni proposes a latent-space cross-modal reasoning framework that uses feature-level supervision and Omni-Sync Position Embedding to align and synchronize audio-visual latents, supported by a new 35K interleaved reasoning dataset and showing gains over text CoT baselines.
When modalities conflict: How unimodal reasoning uncertainty governs preference dynamics in mllms
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6verdicts
UNVERDICTED 6roles
background 1polarities
support 1representative citing papers
Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.
Artifact-Bench supplies a three-level artifact taxonomy and three evaluation tasks that show 19 MLLMs perform near or below random on AI-video realism detection and reasoning.
Vision-language models use semantic signals more than syntactic ones to bind words like 'image' to actual visual inputs, with implications for robustness in multimodal systems.
Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.
Constraining visual token budget per observation during VLM training forces genuine active perception and delivers 5% average relative improvement without auxiliary losses or architecture changes.
citing papers explorer
-
LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning
LatentOmni proposes a latent-space cross-modal reasoning framework that uses feature-level supervision and Omni-Sync Position Embedding to align and synchronize audio-visual latents, supported by a new 35K interleaved reasoning dataset and showing gains over text CoT baselines.
-
Improving Vision-language Models with Perception-centric Process Reward Models
Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.
-
Artifact-Bench: Evaluating MLLMs on Detecting and Assessing the Artifacts of AI-Generated Videos
Artifact-Bench supplies a three-level artifact taxonomy and three evaluation tasks that show 19 MLLMs perform near or below random on AI-video realism detection and reasoning.
-
Source-Modality Monitoring in Vision-Language Models
Vision-language models use semantic signals more than syntactic ones to bind words like 'image' to actual visual inputs, with implications for robustness in multimodal systems.
-
Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models
Omni-modal LLMs exhibit visual preference that emerges in mid-to-late layers, enabling hallucination detection without task-specific training.
-
Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth
Constraining visual token budget per observation during VLM training forces genuine active perception and delivers 5% average relative improvement without auxiliary losses or architecture changes.