VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models

Delai Qiu; Jitao Sang; Rui Hu; Shengping Liu; Yining Wang

arxiv: 2510.08618 · v2 · submitted 2025-10-08 · 📡 eess.AS · cs.CV· cs.SD

VAPO: End-to-end Slide-Enhanced Speech Recognition with Omni-modal Large Language Models

Rui Hu , Delai Qiu , Yining Wang , Shengping Liu , Jitao Sang This is my paper

Pith reviewed 2026-05-18 09:36 UTC · model grok-4.3

classification 📡 eess.AS cs.CVcs.SD

keywords slide-enhanced speech recognitionvisual interferenceomni-modal modelspolicy optimizationreinforcement learningentity recognitionmultimodal transcriptionbenchmark

0 comments

The pith

VAPO trains omni-modal models to first anchor on slide visuals then transcribe spoken audio, cutting visual interference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that omni-modal large language models exhibit visual interference in slide-enhanced speech recognition by favoring visible slide text over auditory signals and hallucinating content that was never spoken. It introduces Visually-Anchored Policy Optimization to reshape the inference process into a look-then-listen sequence using a temporally decoupled policy. If correct, this would produce more accurate end-to-end transcriptions in lectures or meetings where slides accompany speech, with particular gains on technical entities and domain terms. The approach also supplies a new benchmark with synthetic training data and real-world test cases to measure progress on this bias.

Core claim

The central claim is that a temporally decoupled policy with separate think and answer blocks, optimized by multi-objective reinforcement learning, lets the model extract visual priors as semantic anchors in the think phase before generating the audio-based transcription in the answer phase, thereby eliminating the tendency to output unspoken slide content.

What carries the argument

The temporally decoupled policy inside Visually-Anchored Policy Optimization (VAPO), which separates visual prior extraction in a think block from audio-driven transcription in an answer block and tunes both via multi-objective reinforcement learning.

If this is right

Reduces entity recognition errors in specialized domains by preventing hallucination of slide content.
Reaches state-of-the-art results on SlideASR-Bench and existing public speech datasets.
Supports practical end-to-end slide-enhanced speech recognition without separate visual-text correction steps.
Reshapes inference to prioritize auditory signals while still using visuals as anchors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar decoupling could mitigate modality dominance in other tasks such as video captioning with overlaid text.
The method suggests a general pattern for reducing hallucinations by enforcing explicit anchoring steps in multimodal models.
Real-time implementations might improve live captioning accuracy during presentations with changing slides.
The approach could extend to noisy environments or multilingual settings to test robustness beyond the current benchmark.

Load-bearing premise

That separating visual processing into its own think phase before audio transcription will reliably stop the model from defaulting to visible text without creating new biases or lowering overall performance.

What would settle it

If models trained with VAPO still output slide text that was never spoken in the audio on the real-world portion of SlideASR-Bench at rates comparable to the baseline, the central claim would be falsified.

read the original abstract

Omni-modal large language models (OLLMs) offer a promising end-to-end solution for slide-enhanced speech recognition due to their inherent multimodal capabilities. However, we found a fundamental issue faced by OLLMs: \textit{Visual Interference}, where models show a bias towards visible text over auditory signals, causing them to hallucinate slide content that was never spoken. To address this, we propose Visually-Anchored Policy Optimization (VAPO), which aims to reshape models' inference process to follow the human-like ``Look-then-Listen'' inference chain. Specifically, we design a temporally decoupled policy: the model first extracts visual priors in a <think> block to serve as semantic anchors, then generates the transcription in an <answer> block. The policy is optimized via multi-objective reinforcement learning. Furthermore, we introduce SlideASR-Bench, a comprehensive benchmark designed to address the scarcity of entity-rich data, comprising a large-scale synthetic corpus for training and a challenging real-world test set for evaluation. We conduct extensive evaluations demonstrating that VAPO effectively eliminates visual interference and achieves state-of-the-art performance on SlideASR-Bench and public datasets, significantly reducing entity recognition errors in specialized domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VAPO uses RL on a split think-answer policy to fix visual bias in slide speech recognition, with the main uncertainty around whether the policy design is the real driver of improvements.

read the letter

The one thing to know is that the authors tackle visual interference in omni-modal large language models by training a policy that splits the inference into a think step for visuals and an answer step for the transcription, optimized with multi-objective reinforcement learning. They also release SlideASR-Bench to support this kind of work. What the paper does well is to frame the problem in terms of a human-like inference chain and to build a benchmark that includes both large-scale synthetic data for training and a challenging real-world test set. This helps with the scarcity of entity-rich data in slide-enhanced speech recognition, which is a practical need for things like lecture transcription. The soft spots are around how much the specific method contributes. The stress-test concern is valid here: it is not yet clear if the temporally decoupled policy is what eliminates the interference or if the improvements come from other factors like the training data or the RL objectives in general. The paper would be stronger with ablations that test the policy structure against baselines that use similar compute but different designs. General performance on public datasets is claimed to be SOTA, but the details on how much entity recognition improves need to be examined closely to see if the fix is robust. This paper is for researchers in audio-visual multimodal learning and speech recognition systems. A reader who is developing end-to-end models for professional or educational content would get value from the benchmark and the policy optimization approach. It deserves a serious referee because the problem is well-motivated and the proposal is concrete enough to evaluate properly.

Referee Report

3 major / 2 minor

Summary. The paper claims that omni-modal LLMs suffer from 'visual interference' (bias toward visible slide text over spoken audio, leading to hallucinations) in slide-enhanced speech recognition. It proposes Visually-Anchored Policy Optimization (VAPO), which enforces a temporally decoupled inference policy via separate <think> (visual prior extraction) and <answer> (transcription) blocks, trained with multi-objective reinforcement learning. The authors also introduce SlideASR-Bench (synthetic training corpus + real-world test set) and report that VAPO eliminates visual interference while achieving SOTA results on the new benchmark and public datasets, with large reductions in entity recognition errors.

Significance. If the causal link between the decoupled policy/RL training and interference elimination is substantiated with ablations and error analysis, the work would address a practical limitation in multimodal ASR for presentation-heavy domains and provide a reusable benchmark. The end-to-end framing and human-like 'Look-then-Listen' motivation are timely given the rise of OLLMs; however, the current evidence level leaves the magnitude of improvement and generality unclear.

major comments (3)

[Abstract, §3] Abstract and §3 (method overview): the central claim that the temporally decoupled <think>/<answer> policy 'eliminates visual interference' is not isolated from other factors. No ablation is described that compares the full VAPO policy against (a) standard prompting with the same OLLM, (b) the same model trained only on additional data without the RL objectives, or (c) a coupled <think>+<answer> variant. Without these controls, gains on SlideASR-Bench could be explained by dataset fitting rather than the policy structure.
[§4, Tables 2-3] §4 (experiments) and Table 2/3: the abstract states 'SOTA results' and 'significantly reducing entity recognition errors' but the provided description contains no quantitative metrics, WER/CER deltas, entity F1 scores, or statistical significance tests. The reader cannot assess whether the reported improvements exceed the variance of the baseline OLLM or prior slide-enhanced ASR systems.
[§3.2] §3.2 (RL formulation): the multi-objective reinforcement learning objective is described at a high level but lacks the explicit reward functions, weighting coefficients, or policy gradient details needed to reproduce the 'reshaping' of inference. It is therefore impossible to verify whether the optimization specifically anchors on auditory signals rather than simply increasing context length.

minor comments (2)

[§2] The term 'Visual Interference' is introduced as a new phenomenon; a short related-work paragraph contrasting it with known multimodal hallucination issues in vision-language models would help readers situate the contribution.
[§4.1] SlideASR-Bench construction details (how synthetic slides are generated, how real-world test utterances were collected and annotated) are referenced but not fully specified; adding these to the appendix would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important areas for strengthening the evidence and reproducibility of VAPO. We address each major comment below and will revise the paper accordingly.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (method overview): the central claim that the temporally decoupled <think>/<answer> policy 'eliminates visual interference' is not isolated from other factors. No ablation is described that compares the full VAPO policy against (a) standard prompting with the same OLLM, (b) the same model trained only on additional data without the RL objectives, or (c) a coupled <think>+<answer> variant. Without these controls, gains on SlideASR-Bench could be explained by dataset fitting rather than the policy structure.

Authors: We agree that isolating the contribution of the temporally decoupled policy and multi-objective RL is essential to substantiate the causal claims. In the revised manuscript, we will add a dedicated ablation study in §4 comparing VAPO against (a) the base OLLM with standard prompting, (b) supervised fine-tuning on SlideASR-Bench data without the RL objectives, and (c) a coupled <think>+<answer> variant. These results, along with error analysis on visual hallucinations, will be included to demonstrate that improvements arise from the proposed policy rather than data alone. revision: yes
Referee: [§4, Tables 2-3] §4 (experiments) and Table 2/3: the abstract states 'SOTA results' and 'significantly reducing entity recognition errors' but the provided description contains no quantitative metrics, WER/CER deltas, entity F1 scores, or statistical significance tests. The reader cannot assess whether the reported improvements exceed the variance of the baseline OLLM or prior slide-enhanced ASR systems.

Authors: We thank the referee for noting this presentation issue. The full manuscript contains Tables 2 and 3 with WER, CER, and entity metrics, but we will revise §4 to explicitly report deltas versus baselines, include statistical significance tests (e.g., bootstrap p-values), and add a focused error analysis on entity recognition errors to quantify the reductions and confirm they exceed baseline variance. revision: yes
Referee: [§3.2] §3.2 (RL formulation): the multi-objective reinforcement learning objective is described at a high level but lacks the explicit reward functions, weighting coefficients, or policy gradient details needed to reproduce the 'reshaping' of inference. It is therefore impossible to verify whether the optimization specifically anchors on auditory signals rather than simply increasing context length.

Authors: We acknowledge the need for greater detail to ensure reproducibility. In the revised §3.2, we will explicitly define the reward functions (transcription accuracy and visual-anchoring terms), provide the weighting coefficients, and detail the policy gradient method (including how the decoupled <think>/<answer> structure is optimized to prioritize auditory signals over visual bias). revision: yes

Circularity Check

0 steps flagged

No circularity: new method and benchmark introduced with independent empirical support

full rationale

The paper defines VAPO as a novel temporally decoupled <think>/<answer> policy trained with multi-objective RL to counter visual interference, and introduces SlideASR-Bench as a new synthetic-plus-real evaluation resource. These constructs and the reported SOTA gains on entity recognition are presented as empirical outcomes from training and testing, not as quantities that reduce by definition or fitting to prior inputs within the paper. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain; the central claim rests on external benchmark results rather than internal tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of the new policy optimization and benchmark; limited details available from abstract alone.

axioms (1)

domain assumption Multi-objective reinforcement learning can successfully optimize a temporally decoupled think-then-answer policy for multimodal inference.
Invoked in the description of how VAPO is trained.

invented entities (1)

Visual Interference no independent evidence
purpose: Describes the bias of OLLMs toward visible slide text over auditory signals leading to hallucinations.
Newly identified phenomenon used to motivate the method.

pith-pipeline@v0.9.0 · 5761 in / 1225 out tokens · 29283 ms · 2026-05-18T09:36:43.248896+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArrowOfTime.lean arrow_from_z unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

temporally decoupled policy: the model first extracts visual priors in a <think> block ... then generates the transcription in an <answer> block. The policy is optimized via multi-objective reinforcement learning.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

four distinct reward functions: Format Reward, OCR Reward, ASR Reward, Visual Anchoring Reward

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.