VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models
Pith reviewed 2026-05-18 09:36 UTC · model grok-4.3
The pith
VAPO trains omni-modal models to first anchor on slide visuals then transcribe spoken audio, cutting visual interference.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a temporally decoupled policy with separate think and answer blocks, optimized by multi-objective reinforcement learning, lets the model extract visual priors as semantic anchors in the think phase before generating the audio-based transcription in the answer phase, thereby eliminating the tendency to output unspoken slide content.
What carries the argument
The temporally decoupled policy inside Visually-Anchored Policy Optimization (VAPO), which separates visual prior extraction in a think block from audio-driven transcription in an answer block and tunes both via multi-objective reinforcement learning.
If this is right
- Reduces entity recognition errors in specialized domains by preventing hallucination of slide content.
- Reaches state-of-the-art results on SlideASR-Bench and existing public speech datasets.
- Supports practical end-to-end slide-enhanced speech recognition without separate visual-text correction steps.
- Reshapes inference to prioritize auditory signals while still using visuals as anchors.
Where Pith is reading between the lines
- Similar decoupling could mitigate modality dominance in other tasks such as video captioning with overlaid text.
- The method suggests a general pattern for reducing hallucinations by enforcing explicit anchoring steps in multimodal models.
- Real-time implementations might improve live captioning accuracy during presentations with changing slides.
- The approach could extend to noisy environments or multilingual settings to test robustness beyond the current benchmark.
Load-bearing premise
That separating visual processing into its own think phase before audio transcription will reliably stop the model from defaulting to visible text without creating new biases or lowering overall performance.
What would settle it
If models trained with VAPO still output slide text that was never spoken in the audio on the real-world portion of SlideASR-Bench at rates comparable to the baseline, the central claim would be falsified.
read the original abstract
Omni-modal large language models (OLLMs) offer a promising end-to-end solution for slide-enhanced speech recognition due to their inherent multimodal capabilities. However, we found a fundamental issue faced by OLLMs: \textit{Visual Interference}, where models show a bias towards visible text over auditory signals, causing them to hallucinate slide content that was never spoken. To address this, we propose Visually-Anchored Policy Optimization (VAPO), which aims to reshape models' inference process to follow the human-like ``Look-then-Listen'' inference chain. Specifically, we design a temporally decoupled policy: the model first extracts visual priors in a <think> block to serve as semantic anchors, then generates the transcription in an <answer> block. The policy is optimized via multi-objective reinforcement learning. Furthermore, we introduce SlideASR-Bench, a comprehensive benchmark designed to address the scarcity of entity-rich data, comprising a large-scale synthetic corpus for training and a challenging real-world test set for evaluation. We conduct extensive evaluations demonstrating that VAPO effectively eliminates visual interference and achieves state-of-the-art performance on SlideASR-Bench and public datasets, significantly reducing entity recognition errors in specialized domains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that omni-modal LLMs suffer from 'visual interference' (bias toward visible slide text over spoken audio, leading to hallucinations) in slide-enhanced speech recognition. It proposes Visually-Anchored Policy Optimization (VAPO), which enforces a temporally decoupled inference policy via separate <think> (visual prior extraction) and <answer> (transcription) blocks, trained with multi-objective reinforcement learning. The authors also introduce SlideASR-Bench (synthetic training corpus + real-world test set) and report that VAPO eliminates visual interference while achieving SOTA results on the new benchmark and public datasets, with large reductions in entity recognition errors.
Significance. If the causal link between the decoupled policy/RL training and interference elimination is substantiated with ablations and error analysis, the work would address a practical limitation in multimodal ASR for presentation-heavy domains and provide a reusable benchmark. The end-to-end framing and human-like 'Look-then-Listen' motivation are timely given the rise of OLLMs; however, the current evidence level leaves the magnitude of improvement and generality unclear.
major comments (3)
- [Abstract, §3] Abstract and §3 (method overview): the central claim that the temporally decoupled <think>/<answer> policy 'eliminates visual interference' is not isolated from other factors. No ablation is described that compares the full VAPO policy against (a) standard prompting with the same OLLM, (b) the same model trained only on additional data without the RL objectives, or (c) a coupled <think>+<answer> variant. Without these controls, gains on SlideASR-Bench could be explained by dataset fitting rather than the policy structure.
- [§4, Tables 2-3] §4 (experiments) and Table 2/3: the abstract states 'SOTA results' and 'significantly reducing entity recognition errors' but the provided description contains no quantitative metrics, WER/CER deltas, entity F1 scores, or statistical significance tests. The reader cannot assess whether the reported improvements exceed the variance of the baseline OLLM or prior slide-enhanced ASR systems.
- [§3.2] §3.2 (RL formulation): the multi-objective reinforcement learning objective is described at a high level but lacks the explicit reward functions, weighting coefficients, or policy gradient details needed to reproduce the 'reshaping' of inference. It is therefore impossible to verify whether the optimization specifically anchors on auditory signals rather than simply increasing context length.
minor comments (2)
- [§2] The term 'Visual Interference' is introduced as a new phenomenon; a short related-work paragraph contrasting it with known multimodal hallucination issues in vision-language models would help readers situate the contribution.
- [§4.1] SlideASR-Bench construction details (how synthetic slides are generated, how real-world test utterances were collected and annotated) are referenced but not fully specified; adding these to the appendix would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important areas for strengthening the evidence and reproducibility of VAPO. We address each major comment below and will revise the paper accordingly.
read point-by-point responses
-
Referee: [Abstract, §3] Abstract and §3 (method overview): the central claim that the temporally decoupled <think>/<answer> policy 'eliminates visual interference' is not isolated from other factors. No ablation is described that compares the full VAPO policy against (a) standard prompting with the same OLLM, (b) the same model trained only on additional data without the RL objectives, or (c) a coupled <think>+<answer> variant. Without these controls, gains on SlideASR-Bench could be explained by dataset fitting rather than the policy structure.
Authors: We agree that isolating the contribution of the temporally decoupled policy and multi-objective RL is essential to substantiate the causal claims. In the revised manuscript, we will add a dedicated ablation study in §4 comparing VAPO against (a) the base OLLM with standard prompting, (b) supervised fine-tuning on SlideASR-Bench data without the RL objectives, and (c) a coupled <think>+<answer> variant. These results, along with error analysis on visual hallucinations, will be included to demonstrate that improvements arise from the proposed policy rather than data alone. revision: yes
-
Referee: [§4, Tables 2-3] §4 (experiments) and Table 2/3: the abstract states 'SOTA results' and 'significantly reducing entity recognition errors' but the provided description contains no quantitative metrics, WER/CER deltas, entity F1 scores, or statistical significance tests. The reader cannot assess whether the reported improvements exceed the variance of the baseline OLLM or prior slide-enhanced ASR systems.
Authors: We thank the referee for noting this presentation issue. The full manuscript contains Tables 2 and 3 with WER, CER, and entity metrics, but we will revise §4 to explicitly report deltas versus baselines, include statistical significance tests (e.g., bootstrap p-values), and add a focused error analysis on entity recognition errors to quantify the reductions and confirm they exceed baseline variance. revision: yes
-
Referee: [§3.2] §3.2 (RL formulation): the multi-objective reinforcement learning objective is described at a high level but lacks the explicit reward functions, weighting coefficients, or policy gradient details needed to reproduce the 'reshaping' of inference. It is therefore impossible to verify whether the optimization specifically anchors on auditory signals rather than simply increasing context length.
Authors: We acknowledge the need for greater detail to ensure reproducibility. In the revised §3.2, we will explicitly define the reward functions (transcription accuracy and visual-anchoring terms), provide the weighting coefficients, and detail the policy gradient method (including how the decoupled <think>/<answer> structure is optimized to prioritize auditory signals over visual bias). revision: yes
Circularity Check
No circularity: new method and benchmark introduced with independent empirical support
full rationale
The paper defines VAPO as a novel temporally decoupled <think>/<answer> policy trained with multi-objective RL to counter visual interference, and introduces SlideASR-Bench as a new synthetic-plus-real evaluation resource. These constructs and the reported SOTA gains on entity recognition are presented as empirical outcomes from training and testing, not as quantities that reduce by definition or fitting to prior inputs within the paper. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain; the central claim rests on external benchmark results rather than internal tautology.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multi-objective reinforcement learning can successfully optimize a temporally decoupled think-then-answer policy for multimodal inference.
invented entities (1)
-
Visual Interference
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArrowOfTime.leanarrow_from_z unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
temporally decoupled policy: the model first extracts visual priors in a <think> block ... then generates the transcription in an <answer> block. The policy is optimized via multi-objective reinforcement learning.
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
four distinct reward functions: Format Reward, OCR Reward, ASR Reward, Visual Anchoring Reward
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.