A Finetuned SpeechLLM for Joint Multi-Granular L2 Assessment and Natural-Language Rationales

Aditya Kamlesh Parikh; Catia Cucchiarini; Cristian Tejedor-Garcia; Helmer Strik

arxiv: 2606.09470 · v1 · pith:7VF7EEWEnew · submitted 2026-06-08 · 💻 cs.CL · cs.AI

A Finetuned SpeechLLM for Joint Multi-Granular L2 Assessment and Natural-Language Rationales

Aditya Kamlesh Parikh , Cristian Tejedor-Garcia , Catia Cucchiarini , Helmer Strik This is my paper

Pith reviewed 2026-06-27 16:31 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords L2 speech assessmentSpeechLLMmulti-granular assessmentnatural language rationalesinterpretabilitySpeechOcean762Bounded Direct Preference Optimization

0 comments

The pith

A single finetuned SpeechLLM jointly scores L2 speech at sentence, word and phoneme levels while generating natural-language rationales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a rubric-guided SpeechLLM that performs multi-aspect assessment of second-language speech in one forward pass. It predicts ordinal scores for sentence-level accuracy, fluency and prosody, plus word- and phoneme-level accuracy, and produces an accompanying natural-language rationale. Training combines supervised fine-tuning with Bounded Direct Preference Optimization to encourage both accurate labels and coherent explanations. On the SpeechOcean762 benchmark the joint model matches or beats single-granularity baselines while remaining competitive with earlier systems. Analysis of the generated rationales shows they remain plausible at the sentence level but lose faithfulness when checked against token-level ground truth.

Core claim

The rubric-guided SpeechLLM, trained with a hybrid objective of supervised fine-tuning plus Bounded Direct Preference Optimization, jointly predicts ordinal labels at the sentence level (accuracy, fluency, prosody), word/phoneme-level accuracy, and generates a natural-language rationale in the same response. On SpeechOcean762 the approach matches or outperforms single-granularity models while staying competitive with prior work. Rationales are evaluated for self-consistency with model predictions via sentiment consistency and for alignment with ground-truth labels via mention-based agreement; they prove plausible at sentence level but faithfulness degrades at word and phoneme levels because

What carries the argument

Rubric-guided SpeechLLM trained with supervised fine-tuning plus Bounded Direct Preference Optimization to produce joint multi-granular labels and natural-language rationales.

If this is right

A single model can replace separate systems tuned for sentence-level versus token-level scoring.
Sentence-level rationales remain consistent with the model's own predictions on standard data.
Word- and phoneme-level rationales show weaker alignment with ground-truth references.
Overall scoring performance stays competitive with prior single-task approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same joint prediction-plus-rationale format could be tested on other spoken-language tasks that require both a numeric score and an explanation.
If faithfulness at the token level improves with denser reference data, the model could supply automated corrective feedback in language-learning applications.
The observed drop in faithfulness at finer granularities points to a need for training sets that contain explicit word- or phoneme-level explanations rather than only sentence-level ones.

Load-bearing premise

The hybrid training objective and rubric guidance are sufficient to produce rationales whose self-consistency and alignment with ground-truth labels can be meaningfully evaluated at multiple granularities.

What would settle it

A direct measurement on SpeechOcean762 or a similar corpus showing that mention-based agreement between generated rationales and ground-truth word/phoneme labels falls below the level needed for practical use would falsify the claim of usable multi-granular rationales.

Figures

Figures reproduced from arXiv: 2606.09470 by Aditya Kamlesh Parikh, Catia Cucchiarini, Cristian Tejedor-Garcia, Helmer Strik.

**Figure 1.** Figure 1: Abridged prompt structure (full rubrics omitted for brevity). The model predicts multi-granular labels and a natural-language rationale from speech, transcript, and target phonemes. 2.5. Training Procedure We fine-tune the model on a single NVIDIA RTX A6000 (48GB) GPU using the AdamW optimizer. To reduce gradient instability often observed in preference-based fine-tuning [36], we use a constant learning ra… view at source ↗

read the original abstract

Automated L2 speech assessment can assign proficiency labels, but often lacks interpretability. We propose a rubric-guided SpeechLLM for multi-aspect, multi-granular assessment, trained with a hybrid objective combining supervised fine-tuning and Bounded Direct Preference Optimization. The model jointly predicts ordinal labels at the sentence-level (accuracy, fluency, prosody), word/phoneme-level accuracy, and generates a natural-language rationale in the same response. On SpeechOcean762, our approach matches or outperforms single-granularity models while remaining competitive with prior approaches. We analyze rationale reliability along two axes: self-consistency with model predictions and alignment with ground-truth labels, using sentiment consistency (plausibility) and mention-based agreement (faithfulness). Rationales are plausible at the sentence level, but faithfulness degrades at the word/phoneme level: references are sparse and weakly aligned with token-level labels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable SpeechLLM that jointly outputs multi-granular L2 labels and rationales, but the faithfulness drop at word/phoneme level undercuts the joint claim.

read the letter

The main takeaway is a single model that predicts sentence-level accuracy, fluency and prosody, plus word- and phoneme-level accuracy, while also emitting a natural-language rationale in the same pass. It is trained with supervised fine-tuning plus bounded DPO on SpeechOcean762 and stays competitive with single-granularity baselines.

The joint multi-granular setup plus rationale generation is the concrete addition relative to prior single-task work. Sentence-level rationales show decent plausibility under their sentiment check, which is a useful signal for interpretability in language tools.

The soft spot is exactly where the stress-test flagged: mention-based faithfulness falls at the word and phoneme levels because the reference texts are sparse and poorly aligned with the token labels. That directly weakens the selling point of reliable rationales across all granularities in one response. The abstract also gives almost no experimental details, no statistical tests, and no error analysis, so the performance numbers are hard to weigh.

The work is aimed at researchers building automated L2 feedback systems who care about adding some form of explanation. A reader already working on speech assessment or rubric-guided LLMs would pick up the training recipe and the two-axis rationale evaluation.

It is worth sending to peer review so the full methods, numbers, and any fixes for the lower-level faithfulness can be examined.

Referee Report

1 major / 0 minor

Summary. The paper introduces a rubric-guided SpeechLLM trained with a hybrid objective (supervised fine-tuning plus Bounded Direct Preference Optimization) that jointly outputs sentence-level ordinal scores for accuracy/fluency/prosody, word/phoneme-level accuracy, and a natural-language rationale on the SpeechOcean762 dataset. It reports matching or outperforming single-granularity baselines while remaining competitive with prior work, and evaluates rationale quality via sentiment consistency (plausibility) and mention-based agreement (faithfulness), finding sentence-level plausibility but degraded faithfulness at finer granularities due to sparse references.

Significance. If the empirical results and joint multi-granular output hold under detailed scrutiny, the approach could meaningfully advance interpretable L2 assessment by unifying label prediction and explanation generation in a single model response. The hybrid objective and rubric guidance are presented as enabling this, but the reported faithfulness degradation at word/phoneme level directly limits the strength of the multi-granularity claim.

major comments (1)

[Abstract] Abstract: The central claim that the model 'jointly predicts ... and generates a natural-language rationale in the same response' with reliable multi-granular output is load-bearing, yet the reported degradation in mention-based agreement (faithfulness) at word/phoneme level—attributed to sparse and weakly aligned references—undermines support for the lower-granularity component of that joint output. This is not merely a presentation issue; it requires either stronger evidence (e.g., improved alignment metrics or error analysis) or a narrowed claim to maintain the multi-granular contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the single major comment below, agreeing that the abstract phrasing merits refinement to better align with the reported results on rationale quality.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the model 'jointly predicts ... and generates a natural-language rationale in the same response' with reliable multi-granular output is load-bearing, yet the reported degradation in mention-based agreement (faithfulness) at word/phoneme level—attributed to sparse and weakly aligned references—undermines support for the lower-granularity component of that joint output. This is not merely a presentation issue; it requires either stronger evidence (e.g., improved alignment metrics or error analysis) or a narrowed claim to maintain the multi-granular contribution.

Authors: We acknowledge this point. The manuscript already states in the abstract and results that faithfulness degrades at word/phoneme levels owing to sparse references, while label prediction remains competitive. The joint output is achieved via the single-response architecture and hybrid training, which produces all components together. We agree the abstract's reference to 'reliable multi-granular output' could overstate rationale quality at finer levels. We will therefore revise the abstract to clarify that the model jointly predicts multi-granular labels and generates a rationale, with sentence-level rationales showing stronger plausibility and faithfulness than word/phoneme-level ones. This narrows the claim without new experiments. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical application with no derivations or load-bearing self-citations

full rationale

The paper is an empirical ML study: it fine-tunes a SpeechLLM on the public SpeechOcean762 dataset using a hybrid SFT + Bounded DPO objective, then reports joint multi-granular predictions and rationale quality metrics. No equations, first-principles derivations, or predictions appear in the abstract or described content. No self-citations are invoked to justify uniqueness theorems or ansatzes that would reduce the central claims to prior author work. The evaluation metrics (accuracy, faithfulness, plausibility) are standard and externally defined; results are compared to single-granularity baselines and prior approaches without any reduction by construction. This is a standard empirical application whose claims rest on experimental outcomes rather than any definitional or fitted-input loop.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review; the approach rests on standard LLM fine-tuning assumptions, the existence and labeling quality of SpeechOcean762, and the premise that Bounded DPO can be applied to rationale generation without further justification.

axioms (2)

domain assumption Standard supervised fine-tuning and preference optimization objectives transfer directly to joint label-plus-rationale generation for speech assessment.
Invoked by the choice of hybrid training objective without additional derivation in the abstract.
domain assumption Ground-truth labels and human rationales in SpeechOcean762 provide a reliable external benchmark for both prediction accuracy and rationale faithfulness.
Used to evaluate self-consistency and alignment without discussion of label noise or sparsity effects beyond the reported degradation.

pith-pipeline@v0.9.1-grok · 5700 in / 1464 out tokens · 27329 ms · 2026-06-27T16:31:48.430718+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 8 canonical work pages · 5 internal anchors

[1]

A Finetuned SpeechLLM for Joint Multi-Granular L2 Assessment and Natural-Language Rationales

Introduction The growing demand for effective second language (L2) acqui- sition has intensified interest in instructional approaches that target oral proficiency. Despite advances in pedagogy and digi- tal learning environments, spoken communication remains one of the most challenging competencies for L2 learners to de- velop [1]. Difficulties in speech ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

niceness bias

Methodology 2.1. Model Architecture We use the state-of-the-art (SOTA) open source Qwen2-Audio- 7B-Instruct [24] as the backbone SpeechLLM. To minimize memory footprint and preserve pretrained acoustic representa- tions, the base model weights are frozen under 4-bit quantiza- tion [37]. Fine-tuning is performed using Low-Rank Adapta- tion (LoRA) [38], inj...
[3]

Sentence Metrics: Accuracy:[Rubrics forExcellent/Good/Average/Bad/Worst] Fluency:[Rubrics forExcellent/Good/Average/Bad/Worst] Prosody:[Rubrics forExcellent/Good/Average/Bad/Worst]
[4]

Word Accuracy:Rate each word inline (Excellent/Good/Average/Bad/Worst) based on the rubrics
[5]

Input:Transcript: ”transcript”

Phoneme Accuracy:Rate each phoneme inline (Excellent/Good/Average/Bad/Worst) based on the rubrics. Input:Transcript: ”transcript”. Target Phonemes: ”phoneme sequence”. Output Format: Accuracy: [Label] Fluency: [Label] Prosody: [Label] Words: WORD1/Label WORD2/Label . . . Phonemes: p1/Label p2/Label . . . Rationale: [Free-text justification] Figure 1:Abrid...
[6]

The word TROOPS has a slight error in the second letter

Results and Discussion We evaluate our fine-tuned SpeechLLM on the SO762 test set. To validate the proposed multi-aspect, multi-granular assess- ment setting, we structure the analysis into three parts: (i) over- all performance across sentence-, word-, and phoneme-level predictions; (ii) comparison against single-granularity models and prior SOTA systems...

1941
[7]

Conclusion In this work, we presented an E2E rubric-guided SpeechLLM for comprehensive L2 speech assessment. By combining SFT with BDPO on the Qwen2-Audio-7B-Instruct model, it jointly predicted rubric-aligned proficiency labels across sentence, word, and phoneme levels and generated a natural-language rationale in a single response. Our results demonstra...
[8]

Acknowledgments This publication is part of the project Responsible AI for V oice Diagnostics (RAIVD) with file number NGF.1607.22.013 of the research programme NGF AiNed Fellowship Grants which is financed by the Dutch Research Council (NWO)
[9]

All scientific content, ex- perimental design, analyses, results, and conclusions were de- veloped, verified, and approved by the authors

Generative AI Use Disclosure Generative AI tools were used for language editing and polish- ing, including grammar and phrasing. All scientific content, ex- perimental design, analyses, results, and conclusions were de- veloped, verified, and approved by the authors. The authors take full responsibility for the content of this paper, and no genera- tive A...
[10]

The interference of first language and second language acquisition,

A. Derakhshan and E. Karimi, “The interference of first language and second language acquisition,”Theory and Practice in lan- guage studies, vol. 5, no. 10, pp. 2112–2117, 2015

2015
[11]

Child—adult differences in second-language phonological learn- ing: The role of cross-language similarity,

W. Baker, P. Trofimovich, J. E. Flege, M. Mack, and R. Halter, “Child—adult differences in second-language phonological learn- ing: The role of cross-language similarity,”Language and Speech, vol. 51, no. 4, pp. 317–342, 2008

2008
[12]

Fostering efl learners’ motivation, anxiety, and self-efficacy through computer-assisted language learning-and mobile-assisted language learning-based instructions,

L. Dong, S. Jamal Mohammed, K. Ahmed Abdel-Al Ibrahim, and A. Rezai, “Fostering efl learners’ motivation, anxiety, and self-efficacy through computer-assisted language learning-and mobile-assisted language learning-based instructions,”Frontiers in psychology, vol. 13, p. 899557, 2022

2022
[13]

Effects of corrective feedback on second language pro- nunciation development,

K. Saito, “Effects of corrective feedback on second language pro- nunciation development,” inThe Cambridge Handbook of Cor- rective Feedback in Second Language Learning and Teaching, H. Nassaji and E. Kartchava, Eds. Cambridge University Press, 2021, pp. 407–428

2021
[14]

Putting an accent on the positive: New direc- tions for l2 pronunciation research and instruction,

T. M. Derwing, “Putting an accent on the positive: New direc- tions for l2 pronunciation research and instruction,” inInterna- tional Symposium on Applied Phonetics, 2018, pp. 19–21

2018
[15]

Phone-level pronunciation scoring and assessment for interactive language learning,

S. M. Witt and S. J. Young, “Phone-level pronunciation scoring and assessment for interactive language learning,”Speech com- munication, vol. 30, no. 2-3, pp. 95–108, 2000

2000
[16]

Quantitative assess- ment of second language learners’ fluency by means of automatic speech recognition technology,

C. Cucchiarini, H. Strik, and L. Boves, “Quantitative assess- ment of second language learners’ fluency by means of automatic speech recognition technology,”The Journal of the Acoustical So- ciety of America, vol. 107, no. 2, pp. 989–999, 2000

2000
[17]

Oral proficiency training in dutch l2: The contribution of asr-based corrective feedback,

C. Cucchiarini, A. Neri, and H. Strik, “Oral proficiency training in dutch l2: The contribution of asr-based corrective feedback,” Speech Communication, vol. 51, no. 10, pp. 853–863, 2009

2009
[18]

wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020

2020
[19]

Hubert: Self-supervised speech rep- resentation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhut- dinov, and A. Mohamed, “Hubert: Self-supervised speech rep- resentation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 29, pp. 3451–3460, 2021

2021
[20]

Explore wav2vec 2.0 for mispronunciation detection

X. Xu, Y . Kang, S. Cao, B. Lin, and L. Ma, “Explore wav2vec 2.0 for mispronunciation detection.” inInterspeech, vol. 2021, 2021, pp. 4428–4432

2021
[21]

Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learn- ing,

E. Kim, J.-J. Jeon, H. Seo, and H. Kim, “Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learn- ing,” inInterspeech 2022, 2022, pp. 1411–1415

2022
[22]

Enhancing GOP in CTC-Based Mispronunciation Detection with Phonological Knowledge,

A. K. Parikh, C. Tejedor-Garcia, C. Cucchiarini, and H. Strik, “Enhancing GOP in CTC-Based Mispronunciation Detection with Phonological Knowledge,” inInterspeech 2025, 2025, pp. 5068– 5072

2025
[23]

A Framework for Phoneme-Level Pronunciation Assessment Using CTC,

X. Cao, Z. Fan, T. Svendsen, and G. Salvi, “A Framework for Phoneme-Level Pronunciation Assessment Using CTC,” inInter- speech 2024, 2024, pp. 302–306

2024
[24]

Evaluating Logit-Based GOP Scores for Mispronunciation De- tection,

A. K. Parikh, C. Tejedor-Garcia, C. Cucchiarini, and H. Strik, “Evaluating Logit-Based GOP Scores for Mispronunciation De- tection,” inInterspeech 2025, 2025, pp. 2405–2409

2025
[25]

Automatic Scoring at Multi-Granularity for L2 Pronunciation,

B. Lin, L. Wang, X. Feng, and J. Zhang, “Automatic Scoring at Multi-Granularity for L2 Pronunciation,” inInterspeech 2020, 2020, pp. 3022–3026

2020
[26]

Transformer-based multi-aspect multi-granularity non-native en- glish speaker pronunciation assessment,

Y . Gong, Z. Chen, I.-H. Chu, P. Chang, and J. Glass, “Transformer-based multi-aspect multi-granularity non-native en- glish speaker pronunciation assessment,” inICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7262–7266

2022
[27]

A Hier- archical Context-aware Modeling Approach for Multi-aspect and Multi-granular Pronunciation Assessment,

F.-A. Chao, T.-H. Lo, T.-I. Wu, Y .-T. Sung, and B. Chen, “A Hier- archical Context-aware Modeling Approach for Multi-aspect and Multi-granular Pronunciation Assessment,” inInterspeech 2023, 2023, pp. 974–978

2023
[28]

Evaluating human alignment and model faithfulness of llm rationale,

M. Fayyaz, F. Yin, J. Sun, and N. Peng, “Evaluating human alignment and model faithfulness of llm rationale,”arXiv preprint arXiv:2407.00219, 2024

work page arXiv 2024
[29]

Read to hear: A zero-shot pronunciation assessment using textual descriptions and llms,

Y .-W. Chen, M. Ma, and J. Hirschberg, “Read to hear: A zero-shot pronunciation assessment using textual descriptions and llms,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 2682–2694

2025
[30]

Exploring the potential of large multimodal models as effective alternatives for pronunciation assessment,

K. Wang, L. He, K. Liu, Y . Deng, W. Wei, and S. Zhao, “Exploring the potential of large multimodal models as effective alternatives for pronunciation assessment,”arXiv preprint arXiv:2503.11229, 2025

work page arXiv 2025
[31]

On decoder-only architecture for speech- to-text and large language model integration,

J. Wu, Y . Gaur, Z. Chen, L. Zhou, Y . Zhu, T. Wang, J. Li, S. Liu, B. Ren, L. Liuet al., “On decoder-only architecture for speech- to-text and large language model integration,” in2023 IEEE Auto- matic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8

2023
[32]

SALMONN: Towards Generic Hearing Abilities for Large Language Models

C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Salmonn: Towards generic hearing abilities for large language models,”arXiv preprint arXiv:2310.13289, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Y . Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-audio: Advancing universal audio understand- ing via unified large-scale audio-language models,”arXiv preprint arXiv:2311.07919, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Qwen2-Audio Technical Report

Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Linet al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

GAMA: A large audio-language model with advanced audio understanding and complex reasoning abilities,

S. Ghosh, S. Kumar, A. Seth, C. K. R. Evuru, U. Tyagi, S. Sakshi, O. Nieto, R. Duraiswami, and D. Manocha, “GAMA: A large audio-language model with advanced audio understanding and complex reasoning abilities,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, ...

2024
[36]

Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities,

S. Ghosh, Z. Kong, S. Kumar, S. Sakshi, J. Kim, W. Ping, R. Valle, D. Manocha, and B. Catanzaro, “Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities,” inProceedings of the 42nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Singh, M. Fazel, D. Hsu, S. Laco...

2025
[37]

SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation

H. Wang, J. Zhao, Y . Yang, S. Liu, J. Chen, Y . Zhang, S. Zhao, J. Li, J. Zhou, H. Sunet al., “Speechllm-as-judges: Towards gen- eral and interpretable speech quality evaluation,”arXiv preprint arXiv:2510.14664, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Zero-shot speech llms for multi-aspect evaluation of l2 speech: Challenges and opportunities,

A. K. Parikh, C. Tejedor-Garcia, C. Cucchiarini, and H. Strik, “Zero-shot speech llms for multi-aspect evaluation of l2 speech: Challenges and opportunities,”Proc. SLaTE 2025, pp. 11–15, 2025

2025
[39]

Assessment of L2 Oral Proficiency using Speech Large Language Models,

R. Ma, M. Qian, S. Tang, S. Bann `o, K. M. Knill, and M. J. Gales, “Assessment of L2 Oral Proficiency using Speech Large Language Models,” inInterspeech 2025, 2025, pp. 5078–5082

2025
[40]

Rubric-guided fine-tuning of speechllms for multi-aspect, multi- rater l2 reading-speech assessment,

A. K. Parikh, C. Tejedor-Garc ´ıa, C. Cucchiarini, and H. Strik, “Rubric-guided fine-tuning of speechllms for multi-aspect, multi- rater l2 reading-speech assessment,” inProceedings of the Fif- teenth Language Resources and Evaluation Conference (LREC 2026). Palma, Mallorca, Spain: European Language Resources Association (ELRA), May 2026, pp. 10 255–10 265

2026
[41]

SimPO: Simple Preference Optimization with a Reference-Free Reward,

Y . Meng, M. Xia, and D. Chen, “SimPO: Simple Preference Optimization with a Reference-Free Reward,” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tom- czak, and C. Zhang, Eds., vol. 37. Curran Asso- ciates, Inc., 2024, pp. 124 198–124 235. [Online]. Avail- able: https://proceedings.neurip...

2024
[42]

Fine-tuning large multimodal models for automatic pronunciation assessment,

K. Wang, W. Wei, Y . Deng, L. He, and S. Zhao, “Fine-tuning large multimodal models for automatic pronunciation assessment,” in ICASSP 2026 - 2026 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2026, pp. 17 562– 17 566

2026
[43]

plausibility: On the (un) reliability of explanations from large language models , author=

C. Agarwal, S. H. Tanneru, and H. Lakkaraju, “Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.04614

work page arXiv 2024
[44]

Rethinking DPO: The role of rejected responses in preference misalignment,

J. H. Cho, J. Oh, M. Kim, and B.-J. Lee, “Rethinking DPO: The role of rejected responses in preference misalignment,” inFindings of the Association for Computational Linguistics: EMNLP 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 8159–8176. [Online]. Avail...

2025
[45]

Direct preference optimization: your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, “Direct preference optimization: your language model is secretly a reward model,” inProceedings of the 37th International Conference on Neural Information Processing Sys- tems, ser. NIPS ’23. Red Hook, NY , USA: Curran Associates Inc., 2023

2023
[46]

Qlora: efficient finetuning of quantized llms,

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: efficient finetuning of quantized llms,” inProceedings of the 37th International Conference on Neural Information Pro- cessing Systems, ser. NIPS ’23. Red Hook, NY , USA: Curran Associates Inc., 2023

2023
[47]

Lora: Low-rank adap- tation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adap- tation of large language models,” inICLR 2022, April 2022. [Online]. Available: https://www.microsoft.com/en-us/research/ publication/lora-low-rank-adaptation-of-large-language-models/

2022
[48]

speechocean762: An Open-Source Non- Native English Speech Corpus for Pronunciation Assessment,

J. Zhang, Z. Zhang, Y . Wang, Z. Yan, Q. Song, Y . Huang, K. Li, D. Povey, and Y . Wang, “speechocean762: An Open-Source Non- Native English Speech Corpus for Pronunciation Assessment,” in Interspeech 2021, 2021, pp. 3710–3714

2021

[1] [1]

A Finetuned SpeechLLM for Joint Multi-Granular L2 Assessment and Natural-Language Rationales

Introduction The growing demand for effective second language (L2) acqui- sition has intensified interest in instructional approaches that target oral proficiency. Despite advances in pedagogy and digi- tal learning environments, spoken communication remains one of the most challenging competencies for L2 learners to de- velop [1]. Difficulties in speech ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

niceness bias

Methodology 2.1. Model Architecture We use the state-of-the-art (SOTA) open source Qwen2-Audio- 7B-Instruct [24] as the backbone SpeechLLM. To minimize memory footprint and preserve pretrained acoustic representa- tions, the base model weights are frozen under 4-bit quantiza- tion [37]. Fine-tuning is performed using Low-Rank Adapta- tion (LoRA) [38], inj...

[3] [3]

Sentence Metrics: Accuracy:[Rubrics forExcellent/Good/Average/Bad/Worst] Fluency:[Rubrics forExcellent/Good/Average/Bad/Worst] Prosody:[Rubrics forExcellent/Good/Average/Bad/Worst]

[4] [4]

Word Accuracy:Rate each word inline (Excellent/Good/Average/Bad/Worst) based on the rubrics

[5] [5]

Input:Transcript: ”transcript”

Phoneme Accuracy:Rate each phoneme inline (Excellent/Good/Average/Bad/Worst) based on the rubrics. Input:Transcript: ”transcript”. Target Phonemes: ”phoneme sequence”. Output Format: Accuracy: [Label] Fluency: [Label] Prosody: [Label] Words: WORD1/Label WORD2/Label . . . Phonemes: p1/Label p2/Label . . . Rationale: [Free-text justification] Figure 1:Abrid...

[6] [6]

The word TROOPS has a slight error in the second letter

Results and Discussion We evaluate our fine-tuned SpeechLLM on the SO762 test set. To validate the proposed multi-aspect, multi-granular assess- ment setting, we structure the analysis into three parts: (i) over- all performance across sentence-, word-, and phoneme-level predictions; (ii) comparison against single-granularity models and prior SOTA systems...

1941

[7] [7]

Conclusion In this work, we presented an E2E rubric-guided SpeechLLM for comprehensive L2 speech assessment. By combining SFT with BDPO on the Qwen2-Audio-7B-Instruct model, it jointly predicted rubric-aligned proficiency labels across sentence, word, and phoneme levels and generated a natural-language rationale in a single response. Our results demonstra...

[8] [8]

Acknowledgments This publication is part of the project Responsible AI for V oice Diagnostics (RAIVD) with file number NGF.1607.22.013 of the research programme NGF AiNed Fellowship Grants which is financed by the Dutch Research Council (NWO)

[9] [9]

All scientific content, ex- perimental design, analyses, results, and conclusions were de- veloped, verified, and approved by the authors

Generative AI Use Disclosure Generative AI tools were used for language editing and polish- ing, including grammar and phrasing. All scientific content, ex- perimental design, analyses, results, and conclusions were de- veloped, verified, and approved by the authors. The authors take full responsibility for the content of this paper, and no genera- tive A...

[10] [10]

The interference of first language and second language acquisition,

A. Derakhshan and E. Karimi, “The interference of first language and second language acquisition,”Theory and Practice in lan- guage studies, vol. 5, no. 10, pp. 2112–2117, 2015

2015

[11] [11]

Child—adult differences in second-language phonological learn- ing: The role of cross-language similarity,

W. Baker, P. Trofimovich, J. E. Flege, M. Mack, and R. Halter, “Child—adult differences in second-language phonological learn- ing: The role of cross-language similarity,”Language and Speech, vol. 51, no. 4, pp. 317–342, 2008

2008

[12] [12]

Fostering efl learners’ motivation, anxiety, and self-efficacy through computer-assisted language learning-and mobile-assisted language learning-based instructions,

L. Dong, S. Jamal Mohammed, K. Ahmed Abdel-Al Ibrahim, and A. Rezai, “Fostering efl learners’ motivation, anxiety, and self-efficacy through computer-assisted language learning-and mobile-assisted language learning-based instructions,”Frontiers in psychology, vol. 13, p. 899557, 2022

2022

[13] [13]

Effects of corrective feedback on second language pro- nunciation development,

K. Saito, “Effects of corrective feedback on second language pro- nunciation development,” inThe Cambridge Handbook of Cor- rective Feedback in Second Language Learning and Teaching, H. Nassaji and E. Kartchava, Eds. Cambridge University Press, 2021, pp. 407–428

2021

[14] [14]

Putting an accent on the positive: New direc- tions for l2 pronunciation research and instruction,

T. M. Derwing, “Putting an accent on the positive: New direc- tions for l2 pronunciation research and instruction,” inInterna- tional Symposium on Applied Phonetics, 2018, pp. 19–21

2018

[15] [15]

Phone-level pronunciation scoring and assessment for interactive language learning,

S. M. Witt and S. J. Young, “Phone-level pronunciation scoring and assessment for interactive language learning,”Speech com- munication, vol. 30, no. 2-3, pp. 95–108, 2000

2000

[16] [16]

Quantitative assess- ment of second language learners’ fluency by means of automatic speech recognition technology,

C. Cucchiarini, H. Strik, and L. Boves, “Quantitative assess- ment of second language learners’ fluency by means of automatic speech recognition technology,”The Journal of the Acoustical So- ciety of America, vol. 107, no. 2, pp. 989–999, 2000

2000

[17] [17]

Oral proficiency training in dutch l2: The contribution of asr-based corrective feedback,

C. Cucchiarini, A. Neri, and H. Strik, “Oral proficiency training in dutch l2: The contribution of asr-based corrective feedback,” Speech Communication, vol. 51, no. 10, pp. 853–863, 2009

2009

[18] [18]

wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020

2020

[19] [19]

Hubert: Self-supervised speech rep- resentation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhut- dinov, and A. Mohamed, “Hubert: Self-supervised speech rep- resentation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 29, pp. 3451–3460, 2021

2021

[20] [20]

Explore wav2vec 2.0 for mispronunciation detection

X. Xu, Y . Kang, S. Cao, B. Lin, and L. Ma, “Explore wav2vec 2.0 for mispronunciation detection.” inInterspeech, vol. 2021, 2021, pp. 4428–4432

2021

[21] [21]

Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learn- ing,

E. Kim, J.-J. Jeon, H. Seo, and H. Kim, “Automatic Pronunciation Assessment using Self-Supervised Speech Representation Learn- ing,” inInterspeech 2022, 2022, pp. 1411–1415

2022

[22] [22]

Enhancing GOP in CTC-Based Mispronunciation Detection with Phonological Knowledge,

A. K. Parikh, C. Tejedor-Garcia, C. Cucchiarini, and H. Strik, “Enhancing GOP in CTC-Based Mispronunciation Detection with Phonological Knowledge,” inInterspeech 2025, 2025, pp. 5068– 5072

2025

[23] [23]

A Framework for Phoneme-Level Pronunciation Assessment Using CTC,

X. Cao, Z. Fan, T. Svendsen, and G. Salvi, “A Framework for Phoneme-Level Pronunciation Assessment Using CTC,” inInter- speech 2024, 2024, pp. 302–306

2024

[24] [24]

Evaluating Logit-Based GOP Scores for Mispronunciation De- tection,

A. K. Parikh, C. Tejedor-Garcia, C. Cucchiarini, and H. Strik, “Evaluating Logit-Based GOP Scores for Mispronunciation De- tection,” inInterspeech 2025, 2025, pp. 2405–2409

2025

[25] [25]

Automatic Scoring at Multi-Granularity for L2 Pronunciation,

B. Lin, L. Wang, X. Feng, and J. Zhang, “Automatic Scoring at Multi-Granularity for L2 Pronunciation,” inInterspeech 2020, 2020, pp. 3022–3026

2020

[26] [26]

Transformer-based multi-aspect multi-granularity non-native en- glish speaker pronunciation assessment,

Y . Gong, Z. Chen, I.-H. Chu, P. Chang, and J. Glass, “Transformer-based multi-aspect multi-granularity non-native en- glish speaker pronunciation assessment,” inICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7262–7266

2022

[27] [27]

A Hier- archical Context-aware Modeling Approach for Multi-aspect and Multi-granular Pronunciation Assessment,

F.-A. Chao, T.-H. Lo, T.-I. Wu, Y .-T. Sung, and B. Chen, “A Hier- archical Context-aware Modeling Approach for Multi-aspect and Multi-granular Pronunciation Assessment,” inInterspeech 2023, 2023, pp. 974–978

2023

[28] [28]

Evaluating human alignment and model faithfulness of llm rationale,

M. Fayyaz, F. Yin, J. Sun, and N. Peng, “Evaluating human alignment and model faithfulness of llm rationale,”arXiv preprint arXiv:2407.00219, 2024

work page arXiv 2024

[29] [29]

Read to hear: A zero-shot pronunciation assessment using textual descriptions and llms,

Y .-W. Chen, M. Ma, and J. Hirschberg, “Read to hear: A zero-shot pronunciation assessment using textual descriptions and llms,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 2682–2694

2025

[30] [30]

Exploring the potential of large multimodal models as effective alternatives for pronunciation assessment,

K. Wang, L. He, K. Liu, Y . Deng, W. Wei, and S. Zhao, “Exploring the potential of large multimodal models as effective alternatives for pronunciation assessment,”arXiv preprint arXiv:2503.11229, 2025

work page arXiv 2025

[31] [31]

On decoder-only architecture for speech- to-text and large language model integration,

J. Wu, Y . Gaur, Z. Chen, L. Zhou, Y . Zhu, T. Wang, J. Li, S. Liu, B. Ren, L. Liuet al., “On decoder-only architecture for speech- to-text and large language model integration,” in2023 IEEE Auto- matic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8

2023

[32] [32]

SALMONN: Towards Generic Hearing Abilities for Large Language Models

C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Salmonn: Towards generic hearing abilities for large language models,”arXiv preprint arXiv:2310.13289, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Y . Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-audio: Advancing universal audio understand- ing via unified large-scale audio-language models,”arXiv preprint arXiv:2311.07919, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Qwen2-Audio Technical Report

Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Linet al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

GAMA: A large audio-language model with advanced audio understanding and complex reasoning abilities,

S. Ghosh, S. Kumar, A. Seth, C. K. R. Evuru, U. Tyagi, S. Sakshi, O. Nieto, R. Duraiswami, and D. Manocha, “GAMA: A large audio-language model with advanced audio understanding and complex reasoning abilities,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, Eds. Miami, ...

2024

[36] [36]

Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities,

S. Ghosh, Z. Kong, S. Kumar, S. Sakshi, J. Kim, W. Ping, R. Valle, D. Manocha, and B. Catanzaro, “Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities,” inProceedings of the 42nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Singh, M. Fazel, D. Hsu, S. Laco...

2025

[37] [37]

SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation

H. Wang, J. Zhao, Y . Yang, S. Liu, J. Chen, Y . Zhang, S. Zhao, J. Li, J. Zhou, H. Sunet al., “Speechllm-as-judges: Towards gen- eral and interpretable speech quality evaluation,”arXiv preprint arXiv:2510.14664, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

Zero-shot speech llms for multi-aspect evaluation of l2 speech: Challenges and opportunities,

A. K. Parikh, C. Tejedor-Garcia, C. Cucchiarini, and H. Strik, “Zero-shot speech llms for multi-aspect evaluation of l2 speech: Challenges and opportunities,”Proc. SLaTE 2025, pp. 11–15, 2025

2025

[39] [39]

Assessment of L2 Oral Proficiency using Speech Large Language Models,

R. Ma, M. Qian, S. Tang, S. Bann `o, K. M. Knill, and M. J. Gales, “Assessment of L2 Oral Proficiency using Speech Large Language Models,” inInterspeech 2025, 2025, pp. 5078–5082

2025

[40] [40]

Rubric-guided fine-tuning of speechllms for multi-aspect, multi- rater l2 reading-speech assessment,

A. K. Parikh, C. Tejedor-Garc ´ıa, C. Cucchiarini, and H. Strik, “Rubric-guided fine-tuning of speechllms for multi-aspect, multi- rater l2 reading-speech assessment,” inProceedings of the Fif- teenth Language Resources and Evaluation Conference (LREC 2026). Palma, Mallorca, Spain: European Language Resources Association (ELRA), May 2026, pp. 10 255–10 265

2026

[41] [41]

SimPO: Simple Preference Optimization with a Reference-Free Reward,

Y . Meng, M. Xia, and D. Chen, “SimPO: Simple Preference Optimization with a Reference-Free Reward,” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tom- czak, and C. Zhang, Eds., vol. 37. Curran Asso- ciates, Inc., 2024, pp. 124 198–124 235. [Online]. Avail- able: https://proceedings.neurip...

2024

[42] [42]

Fine-tuning large multimodal models for automatic pronunciation assessment,

K. Wang, W. Wei, Y . Deng, L. He, and S. Zhao, “Fine-tuning large multimodal models for automatic pronunciation assessment,” in ICASSP 2026 - 2026 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2026, pp. 17 562– 17 566

2026

[43] [43]

plausibility: On the (un) reliability of explanations from large language models , author=

C. Agarwal, S. H. Tanneru, and H. Lakkaraju, “Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.04614

work page arXiv 2024

[44] [44]

Rethinking DPO: The role of rejected responses in preference misalignment,

J. H. Cho, J. Oh, M. Kim, and B.-J. Lee, “Rethinking DPO: The role of rejected responses in preference misalignment,” inFindings of the Association for Computational Linguistics: EMNLP 2025, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 8159–8176. [Online]. Avail...

2025

[45] [45]

Direct preference optimization: your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, “Direct preference optimization: your language model is secretly a reward model,” inProceedings of the 37th International Conference on Neural Information Processing Sys- tems, ser. NIPS ’23. Red Hook, NY , USA: Curran Associates Inc., 2023

2023

[46] [46]

Qlora: efficient finetuning of quantized llms,

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: efficient finetuning of quantized llms,” inProceedings of the 37th International Conference on Neural Information Pro- cessing Systems, ser. NIPS ’23. Red Hook, NY , USA: Curran Associates Inc., 2023

2023

[47] [47]

Lora: Low-rank adap- tation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adap- tation of large language models,” inICLR 2022, April 2022. [Online]. Available: https://www.microsoft.com/en-us/research/ publication/lora-low-rank-adaptation-of-large-language-models/

2022

[48] [48]

speechocean762: An Open-Source Non- Native English Speech Corpus for Pronunciation Assessment,

J. Zhang, Z. Zhang, Y . Wang, Z. Yan, Q. Song, Y . Huang, K. Li, D. Povey, and Y . Wang, “speechocean762: An Open-Source Non- Native English Speech Corpus for Pronunciation Assessment,” in Interspeech 2021, 2021, pp. 3710–3714

2021