Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition

Yuu Jinnai

arxiv: 2510.19471 · v2 · pith:6LPV6YWLnew · submitted 2025-10-22 · 💻 cs.CL · cs.LG· eess.AS

Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition

Yuu Jinnai This is my paper

Pith reviewed 2026-05-18 04:58 UTC · model grok-4.3

classification 💻 cs.CL cs.LGeess.AS

keywords minimum bayes risk decodingautomatic speech recognitionbeam searchwhisperspeech translationdecoding algorithms

0 comments

The pith

Minimum Bayes Risk decoding outperforms beam search in accuracy for most ASR and speech translation experiments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether Minimum Bayes Risk decoding, already shown to help in text generation, can also raise accuracy in speech-to-text work such as automatic speech recognition. The authors test it with Whisper models on English and Japanese data for both ASR and speech translation. In the majority of the settings they ran, MBR produced better results than the usual beam search method. This points to MBR as a practical option when offline processing allows trading speed for higher correctness.

Core claim

Sample-based Minimum Bayes Risk decoding outperforms beam search decoding in accuracy for ASR and ST tasks using Whisper models on English and Japanese datasets in most experimental settings evaluated. The results indicate that MBR decoding is a promising method for offline ASR and ST tasks that require high accuracy.

What carries the argument

Minimum Bayes Risk (MBR) decoding, which picks the hypothesis that minimizes expected risk measured by a loss function over a finite set of sampled hypotheses.

If this is right

MBR decoding can serve as a drop-in replacement for beam search when accuracy is the main goal in offline ASR and ST.
Word error rates can be lowered by estimating risk across multiple sampled hypotheses rather than relying on the single highest-scoring path.
Whisper and similar models show the same pattern of improvement across both English and Japanese test sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-time ASR would need faster approximations of the risk calculation to keep MBR practical.
The same sampling-based risk approach might transfer to related audio tasks such as diarization or enhancement.
Increasing the sample count beyond the paper's setting could produce still larger gains if compute budget permits.

Load-bearing premise

That the finite set of samples drawn for MBR risk estimation adequately represents the true risk distribution for ASR hypotheses.

What would settle it

Repeating the experiments with a substantially larger number of samples per utterance or on an unrelated ASR model and observing no accuracy gain over beam search.

read the original abstract

Recent work has shown that sample-based Minimum Bayes Risk (MBR) decoding outperforms beam search in text-to-text generation tasks, such as machine translation, text summarization, and image captioning. On the other hand, beam search is the current practice for speech-to-text tasks such as automatic speech recognition (ASR) and Speech Translation (ST). Given that MBR decoding is effective in text-to-text generation tasks, it is reasonable to expect it to also be effective for speech-to-text tasks. In this paper, we evaluate MBR decoding for ASR and ST tasks on English and Japanese using Whisper and its derivative models. We observe that the accuracy of MBR decoding outperforms that of beam search in most of the experimental settings we have evaluated. The results show that MBR decoding is a promising method for offline ASR and ST tasks that require high accuracy. The code is available at https://github.com/CyberAgentAILab/mbr-for-asr

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates sample-based Minimum Bayes Risk (MBR) decoding against beam search for ASR and ST tasks using Whisper and derivative models on English and Japanese data. It reports that MBR yields higher accuracy than beam search in most evaluated settings and concludes that MBR is a promising approach for offline high-accuracy speech-to-text applications, with code released.

Significance. If the empirical results hold under closer scrutiny of sampling variance, the work would usefully extend MBR's documented gains from text-to-text generation to speech-to-text, providing a practical alternative to beam search when accuracy matters more than latency. The multi-setting evaluation is a strength, but the absence of reported sample counts, risk-metric details, and significance testing limits how strongly the central observation can be generalized.

major comments (2)

[Section 3] Experimental setup (Section 3): The manuscript does not state the number of samples drawn for MBR risk estimation or the precise risk metric (e.g., expected WER over the sample set). Given quadratic scaling of pairwise WER and the large hypothesis space in ASR, this omission leaves open whether the reported gains could arise from high-variance Monte-Carlo estimates rather than from the MBR principle itself.
[Results] Results (Table 2 or equivalent): The claim that MBR outperforms beam search 'in most of the experimental settings' is presented without statistical significance tests, confidence intervals on WER differences, or the exact fraction of settings showing improvement. This makes it difficult to assess whether the central observation is robust to sampling variability in the risk estimates.

minor comments (2)

[Abstract] The abstract could quantify the magnitude of the observed WER improvements and the total number of settings evaluated.
[Section 2] Notation for the MBR objective and the beam-search baseline should be aligned more clearly with standard definitions in the ASR literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to improve the clarity of our experimental details and the statistical presentation of results. We respond to each major comment below.

read point-by-point responses

Referee: [Section 3] Experimental setup (Section 3): The manuscript does not state the number of samples drawn for MBR risk estimation or the precise risk metric (e.g., expected WER over the sample set). Given quadratic scaling of pairwise WER and the large hypothesis space in ASR, this omission leaves open whether the reported gains could arise from high-variance Monte-Carlo estimates rather than from the MBR principle itself.

Authors: We agree that these implementation details are necessary for reproducibility and for readers to evaluate the stability of the Monte-Carlo risk estimates. We will revise Section 3 to explicitly state the number of samples drawn per hypothesis and the precise definition of the risk metric used (expected WER over the sample set). This addition will directly address concerns about potential variance in the estimates. revision: yes
Referee: [Results] Results (Table 2 or equivalent): The claim that MBR outperforms beam search 'in most of the experimental settings' is presented without statistical significance tests, confidence intervals on WER differences, or the exact fraction of settings showing improvement. This makes it difficult to assess whether the central observation is robust to sampling variability in the risk estimates.

Authors: We acknowledge that reporting the exact fraction of improved settings and adding measures of uncertainty would make the central claim more precise. In the revised manuscript we will state the proportion of settings in which MBR improves over beam search and will include confidence intervals on WER differences for the main tables. Formal significance testing across all configurations was not performed in the original experiments due to computational constraints; we will instead discuss robustness based on the consistency observed across languages, tasks, and model variants. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation of MBR vs beam search

full rationale

The paper conducts an empirical comparison of sample-based Minimum Bayes Risk decoding against beam search on ASR and ST tasks using Whisper models across English and Japanese. The central claim rests on observed accuracy improvements in reported experimental settings rather than any derivation, first-principles result, or fitted parameter. No equations, ansatzes, or self-citation chains are invoked to derive the performance gains; the results are presented as direct measurements on external benchmarks and are therefore falsifiable outside the paper's own fitted values. This is a standard self-contained empirical study with no load-bearing reductions to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical study relying on standard ASR benchmarks, Whisper model outputs, and established MBR sampling procedures from prior text-generation papers; no new free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5692 in / 943 out tokens · 33197 ms · 2026-05-18T04:58:33.816806+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MBR decoding works by sampling multiple hypotheses ... arg max_y∈H 1/N ∑_{y'∈H} u(y,y')
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the hypothesis that lies at the center of the sampled hypotheses

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.