pith. machine review for the scientific record. sign in

arxiv: 2605.00022 · v1 · submitted 2026-04-20 · 💻 cs.CL · cs.AI· cs.SD

Recognition: unknown

Putting HUMANS first: Efficient LAM Evaluation with Human Preference Alignment

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SD
keywords large audio modelsLAM evaluationsubset selectionhuman preference alignmentregression modelingbenchmark efficiencyuser satisfactionHUMANS benchmark
0
0 comments X

The pith

Regression on 50-example subsets predicts human preferences for large audio models at 0.98 correlation, outperforming full benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to determine if tiny fractions of existing benchmark data can stand in for costly full evaluations of large audio models. Testing multiple selection techniques across many models and tasks reveals that 50 examples can match full benchmark rankings at over 0.93 correlation. When these subsets train regression models to forecast human ratings collected from real voice assistant interactions, they reach 0.98 correlation—exceeding what the complete benchmark or random samples achieve. This highlights that strategic data curation can yield better proxies for user satisfaction than relying on volume alone. The authors release these subsets as the HUMANS benchmark for more efficient and human-aligned LAM assessment.

Core claim

The central claim is that minimal subsets of just 50 benchmark examples, selected strategically and used to train regression models, achieve 0.98 Pearson correlation with human preference ratings from voice assistant conversations, outperforming both random subsets and models trained on the full benchmark data.

What carries the argument

Regression models trained on curated minimal subsets to predict human satisfaction, released as the HUMANS benchmark.

If this is right

  • New LAMs can be evaluated using only 0.3% of the data while maintaining high correlation with comprehensive scores.
  • The HUMANS benchmark offers practitioners a low-cost way to assess models that better reflects real user preferences.
  • Quality of selected examples matters more than quantity for aligning evaluations with human judgments.
  • Similar methods could reduce redundancy in other large-scale AI benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adopting this method could lower the barrier for researchers to iterate on audio models quickly.
  • The moderate 0.85 correlation of full benchmarks with humans points to a need for better human-aligned metrics in general.
  • Testing these subsets on models from different domains or languages would reveal how broadly they apply.

Load-bearing premise

The 776 human preference ratings from voice assistant conversations represent general user satisfaction, and the regression models will generalize beyond the collected data to new models and tasks.

What would settle it

Collecting human preference ratings for a fresh set of LAMs and conversations, then verifying if the predictions from the HUMANS regression models on the 50-example subsets maintain high correlation with those new ratings.

Figures

Figures reproduced from arXiv: 2605.00022 by Diyi Yang, William Held, Woody Haosheng Gan.

Figure 1
Figure 1. Figure 1: Overview. We select minimal subsets from full benchmark pools, validate alignment with human preferences through interactive evaluations, and train regression models to efficiently predict user satisfaction. 100× more tokens than text, making single-model evaluation cost hundreds of GPU-hours and dollars. This makes it impractical to quickly compare can￾didate models, evaluate checkpoints, or A/B test conf… view at source ↗
Figure 2
Figure 2. Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Benchmark alignment with human prefer￾ences. Pearson correlation between subset scores (aver￾aged over 100 random initializations) and human overall ratings. "Best" Subset: Anchor Points for n ≤ 30 and Combined Embedding for n ≥ 50. models struggle with verbosity (Qwen3-Omni: 27.1%, Voxtral: 23.3% vs. 17.2% average). GPT￾4o-audio achieves highest satisfaction (4.98) de￾spite highest latency complaints (38.… view at source ↗
Figure 4
Figure 4. Figure 4: presents the results: • “Best” subsets perform better: Principled se￾lection methods consistently outperform random sampling at different subset sizes, demonstrating that they capture more discriminative and infor￾mative benchmark items, while random sampling includes redundant or low-signal items. • Quality over quantity: Performance peaks at n = 100 (r = 0.978) for “best” subset be￾fore dropping to r = 0… view at source ↗
Figure 5
Figure 5. Figure 5: Random Sampling (Pearson). AUCC=0.891, N90 = 83, N95 = 164. 10 35 60 85 110 135 160 185 Subset Size (n) 0.5 0.6 0.7 0.8 0.9 1.0 Pearson Correlation Random Sampling Learn Pearson [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Random-Sampling-Learn (Pearson). AUCC=0.854, N90 = 119, N95 = 300. D.2 Alternative Correlation Metrics [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: Difficulty-based Selection (Pearson). AUCC=0.902, N90 = 71, N95 = 157. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Anchor Points (Pearson). AUCC=0.927, N90 = 40, N95 = 155. 10 35 60 85 110 135 160 185 Subset Size (n) 0.5 0.6 0.7 0.8 0.9 1.0 Pearson Correlation IRT Based Pearson [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Acoustic Embedding (Pearson). AUCC=0.850, N90 = 92, N95 = 250. 10 35 60 85 110 135 160 185 Subset Size (n) 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Pearson Correlation Combined Embedding Pearson [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Combined Embedding (Pearson). AUCC=0.943, N90 = 32, N95 = 67. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Task distribution for Anchor Points method. Heatmap shows the percentage of items from each task in subsets of varying sizes (n=10 to n=200), averaged across 100 random seeds. Darker red indicates higher representation. Tasks are ordered by their representation at n=10. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Task distribution for Combined Embedding method. Heatmap shows the percentage of items from each task in subsets of varying sizes (n=10 to n=200), averaged across 100 random seeds. Darker red indicates higher representation. Tasks are ordered by their representation at n=10. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Study overview and informed consent form. The page begins with "How it works" explaining the study workflow, followed by the detailed informed consent section covering purpose, procedures, risks, benefits, compensation, and participant rights. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Open chat scenario assignment. Example of scenario instructions for free-form conversations. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Goal-oriented scenario assignment. Example showing the "Weight Loss" scenario with structured context and goals. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Function calling scenario assignment. Example showing the "Social Media Engagement" scenario with detailed task requirements. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Real-time conversation interface. Shows the voice interaction screen with scenario information (left), conversation status indicators, and function verification panel (right) for tracking task completion in function calling scenarios. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Overall recommendation rating interface. Expandable 6-point scale with detailed descriptions for each rating level. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Multi-dimensional rating interface. All five evaluation dimensions with expandable scales, text feedback area, and optional audio feedback recording. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Model-specific failure mode distributions. Heatmap shows percentage of dissatisfaction cases mentioning each failure category for each model. Cell color intensity represents the ratio to baseline (average across all models), with darker red indicating higher-than-average rates and darker green indicating lower-than-average rates. Models are ordered by overall human satisfaction (left to right: highest to … view at source ↗
read the original abstract

The rapid proliferation of large audio models (LAMs) demands efficient approaches for model comparison, yet comprehensive benchmarks are costly. To fill this gap, we investigate whether minimal subsets can reliably evaluate LAMs while reducing costs and data redundancy. Analyzing 10 subset selection methods with 18 audio models across 40 tasks covering major LAM evaluation dimensions, we show that subsets of just 50 examples (0.3% of data) can achieve over 0.93 Pearson correlation with full benchmark scores. To understand how well these scores align with what practitioners ultimately care about, user satisfaction, we collect 776 human preference ratings from realistic voice assistant conversations, finding that both subsets and full benchmark achieve only 0.85 correlation with human. To better predict preferences, we trained regression models on these selected subsets, achieving 0.98 correlation -- outperforming regression models trained on both random subsets and the full benchmark. This demonstrates that in regression modeling, well-curated subsets outpredict the full benchmark, showing quality over quantity. We open-source these regression-weighted subsets as the HUMANS benchmark, an efficient proxy for LAM evaluation that captures both benchmark performance and user preferences.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that subsets of just 50 examples (0.3% of data) selected via 10 methods can achieve >0.93 Pearson correlation with full-benchmark scores when evaluating 18 large audio models (LAMs) across 40 tasks. It further reports that regression models trained on these subsets predict 776 collected human preference ratings from voice-assistant conversations with 0.98 correlation, outperforming both random subsets and the full benchmark, and releases the resulting regression-weighted subsets as the open-source HUMANS benchmark.

Significance. If the statistical claims survive proper validation, the work could meaningfully lower the cost of LAM evaluation while improving alignment with user satisfaction. The scale of the experiments (10 selection methods, 18 models, 40 tasks) and the decision to open-source the HUMANS subsets are concrete strengths that would aid reproducibility and follow-on research.

major comments (3)
  1. Abstract: the headline result that regression models on the 50-example subsets reach 0.98 correlation with human preferences rests on training with only n=18 models. With 50 features per model the feature-to-sample ratio is high; the abstract gives no indication of the regression method, regularization, or validation procedure (in-sample vs. cross-validated), so the reported superiority over the full benchmark cannot be assessed for overfitting.
  2. Abstract: subset selection itself was performed on the same 18 models to maximize correlation with the full benchmark, introducing selection bias. Any subsequent claim that the selected subsets outperform the full benchmark (or random subsets) in predicting human ratings therefore inherits the same limited sample and requires an independent hold-out or cross-validation step to be credible.
  3. Abstract: the 0.85 correlation between both subsets and the full benchmark with the 776 human preference ratings is presented as evidence that curated subsets are preferable, yet no details are supplied on the rating protocol, inter-rater agreement, data exclusion rules, or the exact statistical test used. Without these, the human-alignment advantage cannot be verified.
minor comments (2)
  1. The abstract introduces the HUMANS benchmark without first defining the LAM acronym or briefly characterizing the 40 tasks; adding these clarifications would improve readability for a general CL audience.
  2. The paper states that subsets achieve 'over 0.93 Pearson correlation' but does not report confidence intervals or the number of trials used to compute the figure; including these would strengthen the quantitative claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments identify key areas where greater methodological transparency and validation are needed to support the statistical claims. We address each point below and will revise the manuscript to incorporate the requested details and procedures.

read point-by-point responses
  1. Referee: Abstract: the headline result that regression models on the 50-example subsets reach 0.98 correlation with human preferences rests on training with only n=18 models. With 50 features per model the feature-to-sample ratio is high; the abstract gives no indication of the regression method, regularization, or validation procedure (in-sample vs. cross-validated), so the reported superiority over the full benchmark cannot be assessed for overfitting.

    Authors: We agree that the current abstract does not provide sufficient detail on the regression analysis, making it difficult to evaluate overfitting risk with n=18 and 50 features. In the revised manuscript we will expand both the abstract and methods section to specify the regression method, any regularization applied, the procedure for selecting hyperparameters, and the validation approach (including cross-validated correlation scores). This will allow readers to assess whether the reported 0.98 correlation and outperformance over the full benchmark hold under proper validation. revision: yes

  2. Referee: Abstract: subset selection itself was performed on the same 18 models to maximize correlation with the full benchmark, introducing selection bias. Any subsequent claim that the selected subsets outperform the full benchmark (or random subsets) in predicting human ratings therefore inherits the same limited sample and requires an independent hold-out or cross-validation step to be credible.

    Authors: We acknowledge the selection bias concern, as the subsets were chosen to maximize correlation with the full benchmark on the identical set of 18 models later used for human-preference regression. The comparison to random subsets is less affected because random selection does not optimize on the data. To strengthen credibility we will revise the paper to report a nested cross-validation in which subset selection is repeated independently within each training fold, and we will present the resulting out-of-fold performance on human ratings. We will also add explicit discussion of the limitations imposed by the small number of available models. revision: yes

  3. Referee: Abstract: the 0.85 correlation between both subsets and the full benchmark with the 776 human preference ratings is presented as evidence that curated subsets are preferable, yet no details are supplied on the rating protocol, inter-rater agreement, data exclusion rules, or the exact statistical test used. Without these, the human-alignment advantage cannot be verified.

    Authors: We agree that the abstract (and current manuscript) lacks the necessary details on the human preference data to allow independent verification. In the revision we will add a dedicated subsection describing the rating protocol, the instructions and scale provided to raters, quantitative inter-rater agreement measures, any exclusion criteria applied to the 776 ratings, and confirmation that Pearson correlation (with associated p-values) was the test used. These additions will support evaluation of the reported 0.85 correlations. revision: yes

Circularity Check

1 steps flagged

Regression correlation with human preferences is in-sample fit by construction

specific steps
  1. fitted input called prediction [Abstract]
    "To better predict preferences, we trained regression models on these selected subsets, achieving 0.98 correlation -- outperforming regression models trained on both random subsets and the full benchmark. This demonstrates that in regression modeling, well-curated subsets outpredict the full benchmark, showing quality over quantity."

    Regression is trained to map subset benchmark scores (features) to the human preference ratings (targets) on the identical set of 18 models. The 0.98 Pearson correlation is the in-sample correlation between fitted outputs and training targets; it is produced by the fitting procedure itself rather than by any out-of-sample or first-principles derivation.

full rationale

The paper's central efficiency claim rests on two steps: (1) subset selection yielding 0.93 benchmark correlation and (2) regression on those subsets yielding 0.98 human-preference correlation that 'outpredicts' the full benchmark. The second step fits a regression directly to the human ratings (targets) using subset scores as features on the same 18 models; the reported correlation is therefore the training-set fit, not an independent prediction. This matches the fitted-input-called-prediction pattern exactly. No self-citation chains, self-definitional equations, or ansatz smuggling appear. The 0.93 benchmark correlation is not forced because multiple selection methods were tested and the result is presented as an achievable upper bound rather than a definitional identity. Overall partial circularity because the headline '0.98 outpredicts full benchmark' number reduces to comparing two in-sample fits on identical limited data.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claims rest on empirical subset selection and regression fitting to newly collected human data; the main addition is the curated subsets and the comparative finding about their predictive superiority.

free parameters (2)
  • subset size of 50
    Selected as the minimal size achieving >0.93 correlation with full benchmarks
  • regression model coefficients
    Fitted to the 776 human preference ratings to maximize correlation
axioms (2)
  • domain assumption Pearson correlation is a sufficient metric for both benchmark alignment and human preference prediction
    Used to claim 0.93 and 0.98 correlations without additional metrics or error analysis
  • domain assumption The collected human ratings represent stable user satisfaction across models and tasks
    Central to the claim that subsets better predict preferences
invented entities (1)
  • HUMANS benchmark no independent evidence
    purpose: Efficient proxy for LAM evaluation that captures both benchmark performance and user preferences
    Newly constructed from the selected regression-weighted subsets

pith-pipeline@v0.9.0 · 5506 in / 1520 out tokens · 68873 ms · 2026-05-10T05:44:45.214416+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 4 canonical work pages · 1 internal anchor

  1. [1]

    ArXiv:2410.17196

    Understanding the long-term use of smart speaker assistants.Proceedings of the ACM on Inter- active, Mobile, Wearable and Ubiquitous Technolo- gies, 2(3):1–24. Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, and 1 oth- ers. 2022. Wavlm: Large-scale self-supervised pre- trainin...

  2. [2]

    https://github.com/SALT-NLP/CAVA

    Cava: Comprehensive assessment of voice assistants. https://github.com/SALT-NLP/CAVA. A benchmark for evaluating large audio models (LAMs) capabilities across six domains: turn taking, instruction following, function calling, tone aware- ness, safety, and latency. Chien-yu Huang, Wei-Chih Chen, Shu-wen Yang, Andy T Liu, Chen-An Li, Yu-Xiang Lin, Wei-Cheng...

  3. [3]

    Microsoft

    Item response theory in ai: Analysing machine learning classifiers at the instance level.Artificial intelligence, 271:18–42. Microsoft. 2025. Presidio - data protection and de- identification sdk. Mustafa Misir. 2021. Benchmark set reduction for cheap empirical algorithmic studies. In2021 IEEE Congress on Evolutionary Computation (CEC), pages 1–8. IEEE. O...

  4. [4]

    tinyBenchmarks : evaluating LLMs with fewer examples

    tinybenchmarks: evaluating llms with fewer examples.arXiv preprint arXiv:2402.14992. Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak su- pervision. InInternational conference on machine learning, pages 28492–28518. PMLR. Michael J Ryan, Yanzhe Zhang, Amol Sa...

  5. [5]

    MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

    Mmau: A massive multi-task audio under- standing and reasoning benchmark.arXiv preprint arXiv:2410.19168. George Saon, Avihu Dekel, Alexander Brooks, Tohru Nagano, Abraham Daniels, Aharon Satt, Ashish Mit- tal, Brian Kingsbury, David Haws, Edmilson Morais, and 1 others. 2025. Granite-speech: open-source speech-aware llms with strong english asr capabili- ...

  6. [6]

    Coreset sampling: Randomly sample n items from the full benchmark D to form coreset C⊂D , using task-balanced probabilities: each item i in task t has probability pi = 1 T·|T t| where T is the number of tasks and |Tt| is the number of items in taskt

  7. [7]

    Regression training: Train a Ridge regres- sion model g on the M=|M| source models that minimize: 1 M X m∈M (¯s(m, D)−g[s(m, C)])2 +λ∥g∥ 2 2 (4) where: •¯s(m, D) = 1 T PT t=1 ¯sm,t is the task- averaged score of source model m on the full benchmark •s(m, C)∈R n is the vector of model m’s scores on thencoreset items •λis the regularization parameter

  8. [8]

    Hyperparameter selection: The reg- ularization parameter λ is selected via 5-fold cross-validation over the set {0.001,0.01,0.1,1.0,10.0,100.0} using RidgeCV from scikit-learn

  9. [9]

    Target prediction: For each target model f, predict its full benchmark score as: hRandom-Sampling-Learn(f) =g[s(f, C)](5) C.2 Random-Search-Learn: Complete Algorithm Training procedure:

  10. [10]

    Train-validation split: Randomly split the M source models M into training set Mtrain (75%) and validation setM val (25%)

  11. [11]

    Coreset search: For each iteration i= 1, . . . , N(whereN= 1000): (a) Sample candidate coreset Ci ⊂D with |Ci|=n using task-balanced random sampling (b) Train Ridge regression model gi on Mtrain to predict ¯s(m, D) from s(m, Ci), with regularization parameter λ selected via cross-validation over {0.001,0.01,0.1,1.0,10.0,100.0} (c) Evaluate mean absolute e...

  12. [12]

    Final model training: Retrain Ridge regres- sion g∗ on all source models M using the selected coreset C∗, with λ re-selected via cross-validation

  13. [13]

    C.3 Variance-Based Selection: Implementation Details For each itemiin the benchmark:

    Target prediction: For target model f, predict full benchmark score as h(f) = g∗[s(f, C ∗)]. C.3 Variance-Based Selection: Implementation Details For each itemiin the benchmark:

  14. [14]

    Collect scores from all K source models: {si,1, si,2, . . . , si,K}

  15. [15]

    Compute mean score:¯si = 1 K PK k=1 si,k

  16. [16]

    Compute variance: σ2 i = 1 K PK k=1(si,k −¯si)2 Sort all items by variance in descending order and select the top n items globally (not per-task). This global selection strategy prioritizes the most discriminative items across the entire benchmark, which may result in unequal task representation compared to task-balanced methods. C.4 Difficulty-Based Sele...

  17. [17]

    For tasks with continuous scores in [0,1] , we binarize by finding threshold c such thatP i,l Yil ≈ P i,l ⊮[Yil ≥c] to preserve the overall mean score

    Data preparation: Extract binary responses Yil ∈ {0,1} for all source models and items. For tasks with continuous scores in [0,1] , we binarize by finding threshold c such thatP i,l Yil ≈ P i,l ⊮[Yil ≥c] to preserve the overall mean score

  18. [18]

    The resulting model provides point estimates ˆαi ∈R 5 and ˆβi ∈R for each item, and ˆθl ∈R 5 for each source model

    Model training: Train the 5-dimensional IRT model with learning rate 0.1 for 500 epochs using the Adam optimizer with fixed random seed for reproducibility. The resulting model provides point estimates ˆαi ∈R 5 and ˆβi ∈R for each item, and ˆθl ∈R 5 for each source model. C.5.2 IRT-Based Item Embeddings Following Polo et al. (2024), we construct item embe...

  19. [19]

    Differences from Original Anchor PointsOur method differs from Vivek et al

    This maintains task balance in the final APW score—clusters containing more items or items from underrepresented tasks receive proportionally higher weights. Differences from Original Anchor PointsOur method differs from Vivek et al. (2023) in three key ways:

  20. [20]

    Distance metric: We use Euclidean dis- tance on normalized embeddings instead of correlation-based distances. Since all audio metrics are pre-normalized to[0,1] , Euclidean distance effectively captures performance sim- ilarity without requiring correlation computa- tion or logit transforms

  21. [21]

    K- Means provides native sample weight support in scikit-learn, enabling efficient task-aware clustering with O(n·D·K·I) complexity where I <100 iterations

    Clustering algorithm: We use weighted K-Means instead of K-Medoids (PAM). K- Means provides native sample weight support in scikit-learn, enabling efficient task-aware clustering with O(n·D·K·I) complexity where I <100 iterations. We map centroids to nearest datapoints post-hoc rather than con- straining medoids during optimization

  22. [22]

    You receive audio input and respond with audio. Speak naturally in English

    Task awareness: We introduce task-based balance weights for multi-task benchmarks, ensuring equal task contribution regardless of dataset size. The original method assumed single-task datasets where uniform weighting suffices. D Complete Subset Selection Results D.1 Correlation Curves for All Methods Figures 5–13 show detailed correlation curves with conf...

  23. [23]

    Missing conversational quality metrics:Nat- uralness, conciseness, and appropriate formal- ity drive 78.9% of user dissatisfaction yet are not systematically evaluated in existing bench- marks

  24. [24]

    Static evaluation misses interactive failures: Latency, turn-taking, error recovery, and real- time audio quality only manifest in live conver- sation, not in offline benchmark tasks

  25. [25]

    Accuracy-usability tradeoff unaddressed: Benchmarks prioritize correctness (ASR word error rate, task completion) while users weight naturalness and efficiency equally or higher in determining overall satisfaction. These findings justify our human preference val- idation approach: benchmark subset selection must be validated against user experience to ens...

  26. [26]

    For each of the 7 2 = 21 possible held-out pairs (mi, mj): • Train Ridge regression on the remaining 5 models’ subset scores and human ratings • Select regularization strength α∈ {10−4,10 −3, . . . ,104} via nested leave- one-out CV on the 5 training models • Retrain on all 5 models with the selected α and predict for the 2 held-out models: ˆymi,ˆymj • Ch...

  27. [27]

    This provides a fair evaluation where both ap- proaches make predictions on truly unseen models

    Computepairwise ranking accuracy: propor- tion of correctly ranked pairs across all 21 splits For comparison, we compute pairwise ranking accuracy using original subset scores on the same 21 held-out pairs without any regression training. This provides a fair evaluation where both ap- proaches make predictions on truly unseen models. I.2 Results Table 11 ...