pith. sign in

arxiv: 2606.29534 · v1 · pith:7R6K5ZP4new · submitted 2026-06-28 · 💻 cs.CL · eess.AS

Preference-ASR: A Preference-Aware Test Set for Benchmarking ASR in the Era of Speech LLMs

Pith reviewed 2026-06-30 07:23 UTC · model grok-4.3

classification 💻 cs.CL eess.AS
keywords ASR evaluationpreference-aware test setspeech recognitionoutput style preferencesnormalizationdisfluency handlingentity rendering
0
0 comments X

The pith

ASR model rankings shift when evaluated on following different user preferences for output formatting and style.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard ASR test sets fix conventions for numbers, entities, disfluencies, and casing, then apply normalizers that erase the distinctions users actually request. The paper builds PreferenceASR from seven open-source corpora using an LLM-assisted pipeline followed by human checks, then scores systems with a normalizer that skips steps to match an active preference instruction. Four categories are covered: normalization choices, entity rendering, disfluency handling, and case usage. When four models are tested, their relative order changes depending on which preference is active, showing that conventional metrics hide differences in how well systems adapt to instructions.

Core claim

PreferenceASR evaluates ASR systems on their ability to follow natural-language preference instructions across normalization, entities, disfluencies, and case; the preference-aware normalizer selectively applies only the matching normalization steps, and benchmarking four models demonstrates that rankings change across preference types in ways traditional fixed-normalizer evaluations miss.

What carries the argument

The preference-aware normalizer that selectively skips normalization steps to match the active instruction.

If this is right

  • Models that rank highest under one preference instruction can rank lower under another.
  • Fixed normalizers in existing benchmarks mask differences in how systems handle explicit user instructions.
  • The released dataset supplies a direct way to measure preference adherence in ASR outputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Speech LLMs may require explicit preference conditioning at inference time to maintain high scores across categories.
  • Evaluation pipelines for future ASR systems could default to testing multiple preference instructions rather than one fixed style.
  • The shift in rankings suggests that preference following is an independent capability that standard word-error metrics do not capture.

Load-bearing premise

The two-stage pipeline with human verification produces examples that match real user preferences and the normalizer applies instructions without adding its own systematic bias.

What would settle it

Re-evaluate the same four models on a larger held-out set of preference instructions and find that model rankings remain stable across all four categories.

Figures

Figures reproduced from arXiv: 2606.29534 by Boris Ginsburg, Desh Raj, Jagadeesh Balam, Nikolay Karpov, Nithin Rao Koluguri, Piotr Zelasko, Sasha Meister.

Figure 1
Figure 1. Figure 1: Existing ASR references exhibit inconsistent conven￾tions across datasets: GigaSpeech writes numbers in spoken form while Earnings-22 uses written form; AMI preserves filler words while SPGISpeech strips them. Preference-ASR resolves this by conditioning each reference on an explicit user instruc￾tion. while SPGISpeech strips them. The net effect is that a model’s ranking on any single benchmark often refl… view at source ↗
Figure 2
Figure 2. Figure 2: Two-stage construction pipeline for the Preference-ASR dataset. Audio samples and ground truth from seven corpora are first manually verified. Stage 1 classifies each sample into preference categories (normalization, entities, disfluencies, case) using LLM. Stage 2 generates task-specific instructions and reference texts. A final round of manual verification produces the released test set with preference-a… view at source ↗
Figure 3
Figure 3. Figure 3: Examples from Preference-ASR. Each box shows ground truth (GT), preference instruction, and expected out￾put (Pref). Bold marks transformations; underlined entities are false positives absent from audio [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Additional examples from Preference-ASR (complementary to [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Popular ASR test sets adopt inconsistent conventions for numbers, disfluencies, entities, and casing, while standard normalizers erase the format distinctions users care about. Current benchmarks therefore cannot measure whether a model follows user preferences for output style. We introduce PreferenceASR, a test set evaluating ASR systems on their ability to follow natural-language preference instructions across four categories: normalization, entities, disfluencies, and case. Built from seven open-source corpora via a two-stage LLM-assisted pipeline with human verification, it is evaluated with a preference-aware normalizer that selectively skips steps matching the active instruction. Benchmarking four models shows rankings shift across preference types, exposing quality differences traditional evaluation obscures. We publicly release the dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces PreferenceASR, a new ASR test set constructed from seven open-source corpora via a two-stage LLM-assisted pipeline with human verification. It evaluates systems on their ability to follow natural-language preference instructions in four categories (normalization, entities, disfluencies, case) using a preference-aware normalizer that selectively skips matching steps. Benchmarking four models shows that rankings shift across preference types, revealing quality differences that standard evaluation obscures. The dataset is publicly released.

Significance. If the dataset construction and normalizer are validated, the work could meaningfully advance ASR benchmarking for speech LLMs by incorporating user preferences that current metrics ignore. The public release of the dataset is a clear strength that enables follow-up work.

major comments (2)
  1. Abstract: the central claim that 'benchmarking four models shows rankings shift across preference types' is presented without any quantitative results, specific WER values, error analysis, or tables, preventing assessment of whether the shifts are robust or statistically meaningful.
  2. Abstract (preference-aware normalizer description): the selective skipping logic is load-bearing for all reported comparisons, yet the abstract provides no validation (e.g., human agreement rates on skipped vs. applied steps or ablation on neutral instructions); without this, observed ranking changes could arise from normalizer artifacts rather than model differences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below. The full manuscript contains the requested quantitative details and validations, but we agree the abstract can be strengthened for clarity.

read point-by-point responses
  1. Referee: Abstract: the central claim that 'benchmarking four models shows rankings shift across preference types' is presented without any quantitative results, specific WER values, error analysis, or tables, preventing assessment of whether the shifts are robust or statistically meaningful.

    Authors: The experiments section of the manuscript provides detailed WER tables, ranking comparisons, and error analysis across the four models and preference categories. The abstract is a concise summary and cannot include tables. We will revise the abstract to include specific quantitative highlights (e.g., example WER deltas and ranking reversals) to make the claim more assessable while remaining within length limits. revision: yes

  2. Referee: Abstract (preference-aware normalizer description): the selective skipping logic is load-bearing for all reported comparisons, yet the abstract provides no validation (e.g., human agreement rates on skipped vs. applied steps or ablation on neutral instructions); without this, observed ranking changes could arise from normalizer artifacts rather than model differences.

    Authors: The manuscript details the normalizer's construction, human verification process, agreement rates, and ablations in the methods and experiments sections. We agree a brief reference to this validation in the abstract would address the concern and will revise the abstract accordingly to note that the normalizer was validated via human agreement and ablations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction with no derivations or fitted predictions

full rationale

The paper is a dataset construction effort that builds PreferenceASR from existing corpora via an LLM-assisted pipeline plus human verification, then applies a custom normalizer for evaluation. No equations, parameters, or predictions appear in the provided text or abstract. The central claim (ranking shifts across preference categories) rests on direct empirical comparison of four models rather than any self-referential reduction, self-citation chain, or ansatz. The normalizer description is procedural rather than a fitted or derived quantity that could collapse to its inputs by construction. This matches the default expectation of a non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the reliability of the LLM-assisted construction and human verification process; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Human verification of LLM-generated preference examples produces a test set that reflects genuine user preferences.
    The two-stage pipeline relies on this step to ensure quality; stated in the abstract description of dataset creation.

pith-pipeline@v0.9.1-grok · 5675 in / 1132 out tokens · 31721 ms · 2026-06-30T07:23:46.427878+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 12 canonical work pages · 7 internal anchors

  1. [1]

    Preference-ASR: A Preference-Aware Test Set for Benchmarking ASR in the Era of Speech LLMs

    Introduction Automatic speech recognition (ASR) has evolved rapidly, from hidden Markov models through end-to-end CTC and RNN-Transducer architectures to the recent class of Speech- augmented Large Language Models (SpeechLLMs) such as SALMONN [1], Qwen-Audio [2], and Qwen2.5-Omni [3]. Un- like earlier systems, SpeechLLMs can follow natural-language instru...

  2. [2]

    A preference-annotated English test set covering four pref- erence categories: normalization, entities, disfluencies, and case

  3. [3]

    A two-stage LLM-assisted pipeline for generating the test set, with human verification and correction

  4. [4]

    A preference-aware normalizer that selectively skips normal- ization steps matching the active preference instruction, en- abling fair WER computation across diverse formatting re- quirements

  5. [5]

    We publicly release the test set and evaluation code to support reproducible benchmarking of SpeechLLMs on preference- following

  6. [6]

    22”) or spoken words (TN, e.g., “twenty-two

    Preference-ASR Dataset 2.1. Preference Categories We organize preferences into four categories that capture the most common points of friction in real-world transcription workflows. Each category is further divided into sub-categories that can be combined in a single instruction. Normalization.Normalization deals with non-standard words whose spoken and w...

  7. [7]

    22nd” following an ITN instruction, a standard normalizer converts it to “twenty second,

    Experiment Setup We evaluate the Preference-ASR dataset along two dimensions: (1) how providing a preference instruction affects raw transcrip- tion accuracy under standard normalization, and (2) whether models actually comply with the requested preference when as- sessed with a preference-aware metric. 3.1. Preference-Aware Normalizer Standard WER pipeli...

  8. [8]

    Standard

    Evaluation Results Each model is evaluated under two settings:default(D), a stan- dard ASR prompt without any preference instruction, andin- structed(I), with a preference-specific instruction appended. 3https://huggingface.co/spaces/hf-audio/open_ asr_leaderboard Table 2:WER (%) on Preference-ASR. For each model,Std= standard WER with full normalization;...

  9. [9]

    Conclusion We introduced Preference-ASR, a test set of 3,210 samples drawn from seven open-source corpora that evaluates whether ASR systems can follow explicit human preference instructions across four categories: normalization, entities, disfluencies, and case. Together with a preference-aware normalizer that selec- tively skips steps matching the activ...

  10. [10]

    All content was reviewed and validated by the authors

    Generative AI Use Disclosure Generative AI tools (LLMs) were used in two capacities: (1) as part of the dataset construction pipeline for preference classifi- cation and instruction generation, and (2) for editing and pol- ishing the manuscript. All content was reviewed and validated by the authors

  11. [11]

    SALMONN: Towards generic hearing abilities for large language models,

    C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “SALMONN: Towards generic hearing abilities for large language models,”Proc. International Conference on Learning Representations (ICLR), 2024

  12. [12]

    Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

    Y . Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-Audio: Advancing universal audio understand- ing via unified large-scale audio-language models,”arXiv preprint arXiv:2311.07919, 2023

  13. [13]

    Qwen2.5-Omni Technical Report

    J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-Omni technical report,”arXiv preprint arXiv:2503.20215, 2025

  14. [14]

    Lib- rispeech: An ASR corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An ASR corpus based on public domain audio books,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 5206–5210

  15. [15]

    SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition,

    P. K. O’Neill, V . Lavrukhin, S. Majumdar, V . Noroozi, Y . Zhang, O. Kuchaiev, J. Balam, and B. Ginsburg, “SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition,” inProc. INTERSPEECH, 2021, pp. 1434– 1438

  16. [16]

    Earnings-22: A practical benchmark for accents in the wild,

    M. Del Rio, P. Ha, Q. McNamara, C. Miller, and S. Churi, “Earnings-22: A practical benchmark for accents in the wild,” in Proc. INTERSPEECH, 2022, pp. 4833–4837

  17. [17]

    GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed au- dio,

    G. Chen, S. Chai, G. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhanget al., “GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed au- dio,” inProc. INTERSPEECH, 2021, pp. 3670–3674

  18. [18]

    The ami meeting corpus,

    W. Kraaij, T. Hain, M. Lincoln, and W. Post, “The ami meeting corpus,” inProc. International Conference on Methods and Tech- niques in Behavioral Research, 2005, pp. 1–4

  19. [19]

    V oxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpreta- tion,

    C. Wang, M. Riviere, A. Lee, A. Wu, C. Talber, J. Joshi, and J. Pino, “V oxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpreta- tion,” inProc. Annual Meeting of the Association for Computa- tional Linguistics (ACL), 2021, pp. 993–1003

  20. [20]

    Common V oice: A massively-multilingual speech corpus,

    R. Ardila, M. Branwen, L. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common V oice: A massively-multilingual speech corpus,” in Proc. Language Resources and Evaluation Conference (LREC), 2020, pp. 4218–4222

  21. [21]

    End-to-end rich transcription-style automatic speech recognition with semi- supervised learning,

    Z. Meng, S. Parthasarathy, E. Sun, Y . Gaur, N. Kanda, L. Chen, Y . Zhao, J. Huang, Y . Gong, X. Zenget al., “End-to-end rich transcription-style automatic speech recognition with semi- supervised learning,”arXiv preprint arXiv:2107.05382, 2021

  22. [22]

    MMAU: A holistic bench- mark of agent capabilities across diverse domains,

    G. Sakshi, Z. Sun, A. Alomraniet al., “MMAU: A holistic bench- mark of agent capabilities across diverse domains,”arXiv preprint arXiv:2407.18961, 2024

  23. [23]

    Speech-ifeval: Evaluating instruction-following and quantifying catastrophic forgetting in speech-aware language models,

    C.-Y . K. Ke-Han Lu and H. yi Lee, “Speech-ifeval: Evaluating instruction-following and quantifying catastrophic forgetting in speech-aware language models,” 2025. [Online]. Available: https://arxiv.org/abs/2505.19037

  24. [24]

    Instruction-following speech recognition,

    C.-I. J. Lai, Z. Lu, L. Cao, and R. Pang, “Instruction-following speech recognition,”arXiv preprint arXiv:2309.09843, 2023

  25. [25]

    Open ASR leaderboard,

    Hugging Face, “Open ASR leaderboard,” https://huggingface.co/ spaces/hf-audio/open asr leaderboard, 2024

  26. [26]

    RNN Approaches to Text Normalization: A Challenge

    R. Sproat and N. Jaitly, “RNN approaches to text normalization: A challenge,”arXiv preprint arXiv:1611.00068, 2016

  27. [27]

    Deep context: End-to-end contex- tual speech recognition,

    G. Pundak and T. N. Sainath, “Deep context: End-to-end contex- tual speech recognition,” inProc. IEEE Spoken Language Tech- nology Workshop (SLT). IEEE, 2018, pp. 418–425

  28. [28]

    Spontaneous speech: How people really talk and why engineers should care,

    E. Shriberg, “Spontaneous speech: How people really talk and why engineers should care,” inProc. INTERSPEECH, 2005, pp. 1781–1784

  29. [29]

    Bidirectional recurrent neural network with attention mechanism for punctuation restoration,

    O. Tilk and T. Alum ¨ae, “Bidirectional recurrent neural network with attention mechanism for punctuation restoration,” inProc. INTERSPEECH, 2016, pp. 3047–3051

  30. [30]

    Longer is (not necessarily) stronger: Punc- tuated long-sequence training for enhanced speech recognition and translation,

    N. R. Koluguri, T. Bartley, H. Xu, O. Hrinchuk, J. Balam, B. Gins- burg, and G. Kucsko, “Longer is (not necessarily) stronger: Punc- tuated long-sequence training for enhanced speech recognition and translation,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 255–262

  31. [31]

    Qwen3 Technical Report

    Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388

  32. [32]

    open asr leaderboard/normalizer,

    Hugging Face, “open asr leaderboard/normalizer,” https://github. com/huggingface/open asr leaderboard/tree/main/normalizer, 2026, commit or access date: March 3, 2026

  33. [33]

    Robust speech recognition via large-scale weak su- pervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,”Proc. International Conference on Machine Learning (ICML), pp. 28 492–28 518, 2023

  34. [34]

    What is lost in normalization? Exploring pitfalls in multilingual ASR model evaluations,

    K. Guptaet al., “What is lost in normalization? Exploring pitfalls in multilingual ASR model evaluations,” inProc. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

  35. [35]

    Canary-1b-v2 & parakeet-tdt-0.6 b-v3: Efficient and high-performance models for multilingual asr and ast,

    M. Sekoyan, N. R. Koluguri, N. Tadevosyan, P. Zelasko, T. Bart- ley, N. Karpov, J. Balam, and B. Ginsburg, “Canary-1b-v2 & parakeet-tdt-0.6 b-v3: Efficient and high-performance models for multilingual asr and ast,”arXiv preprint arXiv:2509.14128, 2025

  36. [36]

    Fast conformer with linearly scalable attention for efficient speech recognition,

    D. Rekesh, N. R. Koluguri, S. Kriman, S. Majumdar, V . Noroozi, H. Huang, O. Hrinchuk, K. Puvvada, A. Kumar, J. Balamet al., “Fast conformer with linearly scalable attention for efficient speech recognition,” in2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8

  37. [37]

    Efficient sequence transduction by jointly predicting tokens and durations,

    H. Xu, F. Jia, S. Majumdar, H. Huang, S. Watanabe, and B. Gins- burg, “Efficient sequence transduction by jointly predicting tokens and durations,” inInternational Conference on Machine Learn- ing. PMLR, 2023, pp. 38 462–38 484

  38. [38]

    Qwen2. 5 technical report,

    A. Y . Qwen, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Weiet al., “Qwen2. 5 technical report,”arXiv preprint, 2024

  39. [39]

    Lora: Low-rank adaptation of large language models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.”Iclr, vol. 1, no. 2, p. 3, 2022

  40. [40]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    A. Abouelenin, A. Ashfaq, A. Atkinson, H. Awadalla, N. Bach et al., “Phi-4-Mini technical report: Compact yet powerful mul- timodal language models via mixture-of-LoRAs,”arXiv preprint arXiv:2503.01743, 2025

  41. [41]

    Qwen3-Omni Technical Report

    J. Xu, Z. Guo, J. He, H. Hu, Y . Chu, J. Linet al., “Qwen3-Omni technical report,”arXiv preprint arXiv:2509.17765, 2025

  42. [42]

    NVIDIA NeMo Canary-Qwen-2.5B,

    NVIDIA, “NVIDIA NeMo Canary-Qwen-2.5B,” https: //huggingface.co/nvidia/canary-qwen-2.5b, 2025. Supplementary Material: Preference-ASR

  43. [43]

    Um, and profit aim is fifty million Euros, which is uh

    Additional Preference Examples Figure 3 in the main paper shows one example per preference category. Figure 4 lists a sample from each category along with the full instruction and expected preference text as used during evaluation. Normalization AMI, ITN, numbers & symbols GT:“Um, and profit aim is fifty million Euros, which is uh” Instr:Perform speech re...

  44. [44]

    Per-Dataset WER Breakdown Table 2 in the main paper reports aggregate WER across all seven source datasets. Tables 3–6 provide the full per-dataset breakdown for each model, reporting both standard WER (Std, full normalization) and preference-aware WER (Pref, selective normalization that skips steps matching the active preference). D denotes a default pro...