Diffusion Large Language Models for Visual Speech Recognition
Pith reviewed 2026-06-29 12:41 UTC · model grok-4.3
The pith
A diffusion large language model for visual speech recognition reaches 19.5 percent WER on LRS3 by decoding in flexible order.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DLLM-VSR formulates transcription as iterative masked denoising with flexible-order decoding and confidence-based unmasking. A two-stage masked-denoising training strategy separates visual-to-text content alignment from length modeling. Length-guided candidate decoding constructs plausible transcript-length hypotheses from video duration, decodes under multiple hypotheses, and reranks the candidates, yielding 19.5 percent WER on LRS3.
What carries the argument
Iterative masked denoising inside a diffusion large language model, with confidence-based unmasking and length-guided candidate decoding from video duration.
If this is right
- Flexible-order decoding supplies committed tokens as bidirectional context before low-confidence positions are resolved.
- Separating content alignment from length modeling in the two-stage training reduces interference between the two tasks.
- Length-guided candidate decoding and reranking measurably narrows the performance difference to oracle-length transcripts.
- The same framework reaches state-of-the-art accuracy on LRS3 while using only the labeled portion of the training set.
Where Pith is reading between the lines
- The approach may extend to other visually grounded sequence tasks where early commitment of reliable tokens can improve later disambiguation.
- If length information from duration proves reliable across datasets, similar guidance could be tested in audio-only or text-only diffusion models.
- The separation of alignment and length stages suggests a general recipe for adapting diffusion language models to tasks with variable output lengths.
Load-bearing premise
Video duration supplies transcript-length hypotheses accurate enough for candidate decoding and reranking to close most of the gap to oracle-length performance without adding new selection errors.
What would settle it
An experiment in which length hypotheses drawn from video duration produce no reduction in the gap to oracle-length WER or increase overall word error rate on LRS3.
Figures
read the original abstract
Existing Visual Speech Recognition (VSR) systems commonly rely on left-to-right autoregressive decoding, which can force premature decisions on visually ambiguous tokens before sufficient context is available. We propose DLLM-VSR, to the best of our knowledge, the first Diffusion Large Language Model (DLLM)-based VSR framework, formulating transcription as iterative masked denoising with flexible-order decoding. With confidence-based unmasking, DLLM-VSR commits high-confidence positions early and uses the committed tokens as bidirectional context to refine ambiguous ones. To adapt DLLMs to VSR, we introduce a two-stage masked-denoising training strategy that separates visual-to-text content alignment from length modeling. We further observe a performance gap with oracle-length decoding, which assumes access to the true transcript length, indicating that reducing target-length uncertainty can improve DLLM-based VSR. To reduce this gap, we develop length-guided candidate decoding, which uses video duration to construct plausible transcript-length hypotheses, decodes under multiple hypotheses, and reranks candidates using length plausibility and decoding confidence. The proposed method achieves a state-of-the-art WER of 19.5\% on LRS3 using only its labeled training data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes DLLM-VSR, the first diffusion large language model framework for visual speech recognition. Transcription is formulated as iterative masked denoising with flexible-order decoding and confidence-based unmasking to provide bidirectional context. A two-stage masked-denoising training strategy separates visual-to-text content alignment from length modeling. Length-guided candidate decoding uses video duration to generate plausible transcript-length hypotheses, performs multiple decodes, and reranks candidates by length plausibility and confidence. The central empirical claim is a state-of-the-art WER of 19.5% on LRS3 using only the dataset's labeled training data.
Significance. If the result holds under rigorous validation, the work would establish diffusion LLMs as a viable alternative to autoregressive decoding for VSR by enabling non-left-to-right refinement of ambiguous tokens. The reported SOTA on a standard benchmark without external data would be a notable empirical contribution, particularly if the two-stage training and length-guided inference prove robust.
major comments (3)
- [Abstract] Abstract: The 19.5% WER SOTA claim rests on length-guided candidate decoding closing most of the oracle-length gap, yet no quantitative breakdown is given of (a) the accuracy of video-duration to transcript-length mapping, (b) the fraction of cases where reranking selects a worse hypothesis, or (c) the net WER change when the length hypothesis is incorrect. This is load-bearing for the central performance claim.
- [Abstract] Abstract: The reported result is presented without error bars, number of evaluation runs, or statistical significance tests against prior methods, so it is impossible to determine whether 19.5% constitutes a reliable improvement or could be within run-to-run variance.
- [Abstract] The two-stage training is described as separating content alignment from length modeling, but no ablation is referenced that isolates the contribution of the second stage or quantifies how much of the final WER depends on the inference-time length heuristic versus the learned model.
minor comments (1)
- [Abstract] Notation for the masked-denoising process and the precise form of the length plausibility score used in reranking should be formalized with equations rather than prose description.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment below and will revise the manuscript to incorporate the requested analyses.
read point-by-point responses
-
Referee: [Abstract] Abstract: The 19.5% WER SOTA claim rests on length-guided candidate decoding closing most of the oracle-length gap, yet no quantitative breakdown is given of (a) the accuracy of video-duration to transcript-length mapping, (b) the fraction of cases where reranking selects a worse hypothesis, or (c) the net WER change when the length hypothesis is incorrect. This is load-bearing for the central performance claim.
Authors: We agree that a quantitative breakdown would strengthen the central claim. The manuscript notes the oracle-length gap but does not provide the requested details on mapping accuracy, reranking failure rate, or WER impact of incorrect lengths. In the revision we will add this analysis to the experiments section. revision: yes
-
Referee: [Abstract] Abstract: The reported result is presented without error bars, number of evaluation runs, or statistical significance tests against prior methods, so it is impossible to determine whether 19.5% constitutes a reliable improvement or could be within run-to-run variance.
Authors: We acknowledge the absence of variability measures and significance tests in the current reporting. The 19.5% result reflects our primary evaluation. In the revision we will run multiple evaluations, report error bars, and include statistical significance tests against prior methods. revision: yes
-
Referee: [Abstract] The two-stage training is described as separating content alignment from length modeling, but no ablation is referenced that isolates the contribution of the second stage or quantifies how much of the final WER depends on the inference-time length heuristic versus the learned model.
Authors: We agree an ablation would clarify the contributions of each component. The manuscript motivates the two-stage strategy but does not include the requested ablation. We will add experiments isolating the second stage and the inference-time length heuristic in the revised manuscript. revision: yes
Circularity Check
No derivation chain present; result is empirical benchmark
full rationale
The paper presents a new VSR framework (DLLM-VSR) with two-stage masked-denoising training and length-guided candidate decoding at inference. No equations, first-principles derivations, or parameter-fitting steps are described that could reduce to self-definition or fitted-input-as-prediction. The central claim (19.5% WER on LRS3) is obtained by standard supervised training on the labeled LRS3 split followed by evaluation; length hypotheses are constructed from video duration as an explicit heuristic, not derived from any internal model equation. No self-citations are invoked as load-bearing uniqueness theorems. The method is therefore self-contained against external benchmarks with no circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Visual speech recognition can be effectively formulated as iterative masked denoising over text tokens conditioned on video features.
Reference graph
Works this paper leans on
-
[1]
Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, and Ab- delrahman Mohamed
Transformer-based video front-ends for audio- visual speech recognition for single and multi-person video.arXiv preprint arXiv:2201.10439. Bowen Shi, Wei-Ning Hsu, Kushal Lakhotia, and Ab- delrahman Mohamed. 2022. Learning audio-visual speech representation by masked multimodal cluster prediction.arXiv preprint arXiv:2201.02184. Themos Stafylakis and Geor...
-
[2]
Diffusion llm with native variable genera- tion lengths: Let [eos] lead the way.arXiv preprint arXiv:2510.24605. Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong
-
[3]
Dream 7B: Diffusion Large Language Models
Dream 7b: Diffusion large language models. arXiv preprint arXiv:2508.15487. Jeong Hun Yeo, Minsu Kim, Jeongsoo Choi, Dae Hoe Kim, and Yong Man Ro. 2024a. Akvsr: Audio knowledge empowered visual speech recognition by compressing audio knowledge of a pretrained model. IEEE Transactions on Multimedia, 26:6462–6474. Jeong Hun Yeo, Minsu Kim, Shinji Watanabe, ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
We decode candidate lengths within Kpred ±5 , resulting in up to 11 candidates. During denois- ing, positions whose confidence exceeds 0.9 are committed; if no position exceeds the threshold, the most confident position is committed. For length-guided decoding, all length candidates are batched and decoded in parallel. The final tran- script is selected u...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.