Are you speaking my languages? On spoken language adherence in multimodal LLMs

Hyungwon Kim; Kandarp Joshi; Lillian Zhou; Pavel Golik; Petar Aleksic

arxiv: 2606.17281 · v1 · pith:CGYUJ425new · submitted 2026-06-15 · 💻 cs.CL · cs.SD· eess.AS

Are you speaking my languages? On spoken language adherence in multimodal LLMs

Hyungwon Kim , Kandarp Joshi , Lillian Zhou , Pavel Golik , Petar Aleksic This is my paper

Pith reviewed 2026-06-27 03:03 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS

keywords language adherencemultimodal LLMsautomatic speech recognitionsoft promptingzero-shot promptingsupervised fine-tuningchain-of-thought

0 comments

The pith

Soft prompting hints at spoken languages to cut violations in multimodal LLM transcriptions without hurting accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal LLMs for automatic speech recognition often output text in the wrong language even when the input speech is clear. The paper defines this problem as a lack of language adherence, creates a metric to count such violations, and tests a soft prompting method that only hints at possible languages rather than forcing one. It compares this method against zero-shot prompting, supervised fine-tuning, and chain-of-thought reasoning across several languages. Results indicate the approaches lower the rate of language mismatches while ASR performance metrics stay stable. Trade-offs among the methods are discussed to match different levels of available compute.

Core claim

A soft prompting approach that hints at potential spoken languages without strictly constraining the output, together with zero-shot prompting, supervised fine-tuning, and chain-of-thought reasoning, reduces language adherence violations in multimodal LLMs for ASR while maintaining overall transcription performance across multiple languages.

What carries the argument

Soft prompting approach that hints at potential spoken languages without strictly constraining output.

If this is right

Transcriptions show higher fidelity to the language actually spoken in the audio.
Code-switching remains possible when speakers change languages mid-utterance.
Downstream applications that rely on the transcript receive cleaner input.
Method choice can be matched to available compute budgets without sacrificing adherence gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same soft-hint technique could be tested on other output constraints such as style or format adherence in multimodal models.
The new violation metric offers a template for measuring similar consistency issues in text-only or vision-language settings.
Combining the three mitigation strategies might produce additive gains when compute allows.

Load-bearing premise

Language adherence is a separable behavior that prompting and fine-tuning can control independently of the model's core transcription accuracy.

What would settle it

A test in which the proposed soft prompting and mitigation strategies produce no drop in measured language violation rate or cause a clear rise in word error rate on the same evaluation sets.

read the original abstract

While Large Language Model (LLM) based Automatic Speech Recognition (ASR) enables seamless multilingual use, models often misidentify the output language, compromising transcription fidelity and downstream application quality. To preserve flexibility and code-switching capabilities, we propose a soft prompting approach that hints at potential spoken languages without strictly constraining the output. We formally define this challenge as a lack of language adherence, introduce a novel metric to quantify violations, and evaluate three mitigation strategies: (1) zero-shot prompting for robust guidance under uncertainty, (2) supervised fine-tuning (SFT) to improve prompt adherence, and (3) Chain-of-Thought (CoT) reasoning to enforce adherence during decoding. We present a comparative analysis of these methods across multiple languages, evaluating effectiveness in reducing the language violation while maintaining overall ASR performance. Finally, we discuss trade-offs to guide strategy selection under various compute constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines language adherence violations in LLM ASR and tests soft prompting plus SFT and CoT as fixes, but the abstract supplies zero numbers so effectiveness stays unproven.

read the letter

The core takeaway is that this work flags a real deployment headache in multilingual LLM-based ASR where the model outputs the wrong language despite clear input, then offers a soft prompting method that suggests languages without locking them in, along with a new metric to count the violations.

What is actually new is the metric for quantifying adherence failures and the framing of soft prompting as a way to retain code-switching flexibility. The authors lay out three concrete strategies—zero-shot prompting for guidance, supervised fine-tuning for better prompt following, and chain-of-thought to steer decoding—and compare them across languages while trying to hold ASR quality steady. That practical focus on trade-offs under different compute limits is useful.

The paper does a clean job stating the problem and the mitigation options without overclaiming. It targets engineers who need reliable transcription for mixed-language users rather than trying to rewrite how LLMs work.

The main soft spot is the complete absence of quantitative results, datasets, or error breakdowns in the abstract. Without those, there is no way to check whether the strategies actually cut violations or just shift the problem. The assumption that adherence can be isolated and measured independently from core transcription ability also looks optimistic; if the model’s language choice errors stem from deeper limitations, the metric itself could be noisy. The stress-test note is right that nothing internally contradictory appears, but that does not substitute for evidence.

This is for ASR practitioners and multilingual system builders who already deal with prompting and fine-tuning. A reader in that area could pick up the metric and the three-way comparison as starting points. It deserves peer review because the problem is concrete and the proposed directions are straightforward, even if the current version reads as a proposal rather than a finished result.

Referee Report

2 major / 0 minor

Summary. The paper addresses language adherence failures in LLM-based multilingual ASR, where models misidentify output languages despite input speech. It formally defines the problem, introduces a novel metric to quantify violations, proposes a soft prompting method that hints at spoken languages without hard constraints to preserve flexibility and code-switching, and evaluates three mitigation strategies—zero-shot prompting, supervised fine-tuning (SFT), and Chain-of-Thought (CoT) reasoning—claiming they reduce violations while preserving overall ASR performance across languages.

Significance. If the empirical comparisons hold, the work offers practical, low-overhead strategies for improving output language consistency in multimodal ASR without sacrificing the flexibility that makes these models attractive for real-world multilingual use, potentially informing prompt design and fine-tuning practices under varying compute budgets.

major comments (2)

[Abstract] Abstract: The abstract states that the authors 'present a comparative analysis of these methods across multiple languages, evaluating effectiveness in reducing the language violation while maintaining overall ASR performance,' yet provides no quantitative results, datasets, error analysis, or tables/figures summarizing the trade-offs; without this evidence the central claim that the three strategies succeed cannot be assessed.
The weakest assumption—that language adherence is a separable, prompt-controllable behavior independent of core transcription capability—is not tested or discussed; if the metric itself is affected by the same model limitations that cause violations, the reported reductions may be artifacts of the evaluation rather than genuine improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our work regarding language adherence in multimodal LLMs for ASR. We address each major comment below with clarifications from the manuscript and indicate planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states that the authors 'present a comparative analysis of these methods across multiple languages, evaluating effectiveness in reducing the language violation while maintaining overall ASR performance,' yet provides no quantitative results, datasets, error analysis, or tables/figures summarizing the trade-offs; without this evidence the central claim that the three strategies succeed cannot be assessed.

Authors: The abstract is intentionally concise and high-level, following standard conventions that reserve specific quantitative results, dataset details, and figure references for the body of the paper. The full manuscript includes these elements: Section 4 presents comparative results across languages with tables reporting language violation rates and ASR metrics (e.g., WER) for zero-shot, SFT, and CoT strategies, along with error analysis in the appendix. To strengthen the abstract's standalone readability while preserving its brevity, we will revise it to incorporate one or two key quantitative highlights summarizing the trade-offs. revision: yes
Referee: The weakest assumption—that language adherence is a separable, prompt-controllable behavior independent of core transcription capability—is not tested or discussed; if the metric itself is affected by the same model limitations that cause violations, the reported reductions may be artifacts of the evaluation rather than genuine improvements.

Authors: Our experimental design directly tests separability by jointly evaluating language adherence reductions alongside preservation of core ASR performance (WER and other transcription metrics) across all strategies and languages. This ensures reported improvements are not artifacts, as any degradation in transcription capability would be visible in the results. We acknowledge that an explicit discussion of potential metric correlations with model limitations was not included; we will add a dedicated paragraph in the Discussion section addressing this assumption, its empirical support from our joint metrics, and remaining limitations. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical study of prompting, SFT, and CoT strategies for reducing language violations in multimodal ASR. It introduces a novel metric and evaluates mitigation approaches through experiments across languages, with no equations, parameter fitting, derivations, or self-citation chains that reduce claims to inputs by construction. All load-bearing steps rely on external experimental comparisons rather than self-referential definitions or renamings.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review reveals no mathematical derivations, fitted constants, or new postulated entities; the contribution is empirical evaluation of prompting strategies on a newly named failure mode.

pith-pipeline@v0.9.1-grok · 5693 in / 1125 out tokens · 54342 ms · 2026-06-27T03:03:49.354741+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 8 canonical work pages

[1]

Brouhaha: Multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation,

Mingqiu Wang and Wei Han and Izhak Shafran and Zelin Wu and Chung-Cheng Chiu and Yuan Cao and Yongqiang Wang and Nanxin Chen and Yu Zhang and Hagen Soltau and Paul Rubenstein and Lukas Zilka and Dian Yu and Zhong Meng and Golan Pundak and Nikhil Siddhartha and Johan Schalkwyk and Yonghui Wu , booktitle=ASRU, title=. doi:10.1109/ASRU57964.2023.10389703 , address=

work page doi:10.1109/asru57964.2023.10389703 2023
[2]

Audiocomposer: Towards fine-grained audio generation with natural language descriptions,

Kim, Jaeyoung and Mavandadi, Sepand and Audhkhasi, Kartik and Bharadwaj, Shikhar and Farris, Brian and Chen, Tongzhou and Ramabhadran, Bhuvana and Ganapathy, Sriram , booktitle=ICASSP, title=. doi:10.1109/ICASSP49660.2025.10887912 , address=

work page doi:10.1109/icassp49660.2025.10887912 2025
[3]

Conformer: Convolution-augmented Transformer for Speech Recognition , author=
[4]

Ankur Bapna and Colin Cherry and Yu Zhang and Ye Jia and Melvin Johnson and Yong Cheng and Simran Khanuja and Jason Riesa and Alexis Conneau , year=2022, eprint=

2022
[5]

doi:10.1109/ICCV.2019.00756 , url =

Sun, Chen and Myers, Austin and Vondrick, Carl and Murphy, Kevin and Schmid, Cordelia , booktitle =. doi:10.1109/ICCV.2019.00756 , url =

work page doi:10.1109/iccv.2019.00756 2019
[6]

Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation , author =
[7]

Emond, Jesse and Ramabhadran, Bhuvana and Roark, Brian and Moreno, Pedro and Ma, Min , booktitle=SLT, title=
[8]

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Caswell, Isaac and Breiner, Theresa and van Esch, Daan and Bapna, Ankur. Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus. International Conference on Computational Linguistics (COLING). doi:10.18653/v1/2020.coling-main.579

work page doi:10.18653/v1/2020.coling-main.579 2020
[9]

doi:10.21437/Interspeech.2008 , url=

On the use of a multilingual neural network front-end , author=. doi:10.21437/Interspeech.2008 , url=

work page doi:10.21437/interspeech.2008 2008
[10]

doi:10.21437/ICSLP.2002-178 , address =

Ma, Bin and Guan, Cuntai and Li, Haizhou and Lee, Chin-Hui , booktitle = ICSLP, year = 2002, month = sep, title =. doi:10.21437/ICSLP.2002-178 , address =

work page doi:10.21437/icslp.2002-178 2002
[11]

Waters, Austin and Gaur, Neeraj and Haghani, Parisa and Moreno, Pedro and Qu, Zhongdi , booktitle=ASRU, title=
[12]

Lopez-Moreno, Ignacio and Gonzalez-Dominguez, Javier and Plchot, Oldrich and Martinez, David and Gonzalez-Rodriguez, Joaquin and Moreno, Pedro , booktitle=ICASSP, title=
[13]

and Inouye, Jon W.T

Cole, Ronald A. and Inouye, Jon W.T. and Muthusamy, Yeshwant K. and Gopalakrishnan, Murali , booktitle=. Language identification with neural networks: a feasibility study , year=1989, month=jun, pages=

1989
[14]

A Language Agnostic Multilingual Streaming On-Device

Bo Li and Tara Sainath and Ruoming Pang and Shuo-Yiin Chang and Qiumin Xu and Trevor Strohman and Vince Chen and Qiao Liang and Heguang Liu and Yanzhang He and Parisa Haghani and Sameer Bidichandani , year = 2022, booktitle = INTERSPEECH, address =. A Language Agnostic Multilingual Streaming On-Device

2022
[15]

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification , author =
[16]

, booktitle=ASRU, title=

Watanabe, Shinji and Hori, Takaaki and Hershey, John R. , booktitle=ASRU, title=
[17]

Language-agnostic Multilingual Modelling , author =
[18]

and Sim, Khe Chai and Bacchiani, Michiel and Weinstein, Eugene and Nguyen, Patrick and Chen, Zhifeng and Wu, Yanghui and Rao, Kanishka , title =

Li, Bo and Sainath, Tara N. and Sim, Khe Chai and Bacchiani, Michiel and Weinstein, Eugene and Nguyen, Patrick and Chen, Zhifeng and Wu, Yanghui and Rao, Kanishka , title =. doi:10.1109/ICASSP.2018.8461886 , booktitle = ICASSP, pages =

work page doi:10.1109/icassp.2018.8461886 2018
[19]

Comparison of Different Neural Network Architectures for Spoken Language Identification , year=2023, month=sep, address=

Bazazo, Tala and Zeineldeen, Mohammad and Plahl, Christian and Schlueter, Ralf and Ney, Hermann , booktitle=. Comparison of Different Neural Network Architectures for Spoken Language Identification , year=2023, month=sep, address=

2023
[20]

Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval , volume=

N-gram-based text categorization , author=. Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval , volume=
[21]

Conference on Language Modeling (COLM) , year=2024, month=oct, address=

Automata-based constraints for language model decoding , author=. Conference on Language Modeling (COLM) , year=2024, month=oct, address=

2024
[22]

Plug and Play Language Models: A Simple Approach to Controlled Text Generation , author=
[23]

Data , volume =

Qasim, Mamtimin and Silamu, Wushour , title =. Data , volume =
[24]

doi:10.48550/arXiv.2506.00087 , address=

Peng Xie and Xingyuan Liu and Tsz Wai Chan and Yequan Bie and Yangqiu Song and Yang Wang and Hao Chen and Kani Chen , booktitle=NEURIPS, title=. doi:10.48550/arXiv.2506.00087 , address=

work page doi:10.48550/arxiv.2506.00087
[25]

2025 , url =

Gemini 2.0 Flash Lite Model Card , institution =. 2025 , url =

2025

[1] [1]

Brouhaha: Multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation,

Mingqiu Wang and Wei Han and Izhak Shafran and Zelin Wu and Chung-Cheng Chiu and Yuan Cao and Yongqiang Wang and Nanxin Chen and Yu Zhang and Hagen Soltau and Paul Rubenstein and Lukas Zilka and Dian Yu and Zhong Meng and Golan Pundak and Nikhil Siddhartha and Johan Schalkwyk and Yonghui Wu , booktitle=ASRU, title=. doi:10.1109/ASRU57964.2023.10389703 , address=

work page doi:10.1109/asru57964.2023.10389703 2023

[2] [2]

Audiocomposer: Towards fine-grained audio generation with natural language descriptions,

Kim, Jaeyoung and Mavandadi, Sepand and Audhkhasi, Kartik and Bharadwaj, Shikhar and Farris, Brian and Chen, Tongzhou and Ramabhadran, Bhuvana and Ganapathy, Sriram , booktitle=ICASSP, title=. doi:10.1109/ICASSP49660.2025.10887912 , address=

work page doi:10.1109/icassp49660.2025.10887912 2025

[3] [3]

Conformer: Convolution-augmented Transformer for Speech Recognition , author=

[4] [4]

Ankur Bapna and Colin Cherry and Yu Zhang and Ye Jia and Melvin Johnson and Yong Cheng and Simran Khanuja and Jason Riesa and Alexis Conneau , year=2022, eprint=

2022

[5] [5]

doi:10.1109/ICCV.2019.00756 , url =

Sun, Chen and Myers, Austin and Vondrick, Carl and Murphy, Kevin and Schmid, Cordelia , booktitle =. doi:10.1109/ICCV.2019.00756 , url =

work page doi:10.1109/iccv.2019.00756 2019

[6] [6]

Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation , author =

[7] [7]

Emond, Jesse and Ramabhadran, Bhuvana and Roark, Brian and Moreno, Pedro and Ma, Min , booktitle=SLT, title=

[8] [8]

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Caswell, Isaac and Breiner, Theresa and van Esch, Daan and Bapna, Ankur. Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus. International Conference on Computational Linguistics (COLING). doi:10.18653/v1/2020.coling-main.579

work page doi:10.18653/v1/2020.coling-main.579 2020

[9] [9]

doi:10.21437/Interspeech.2008 , url=

On the use of a multilingual neural network front-end , author=. doi:10.21437/Interspeech.2008 , url=

work page doi:10.21437/interspeech.2008 2008

[10] [10]

doi:10.21437/ICSLP.2002-178 , address =

Ma, Bin and Guan, Cuntai and Li, Haizhou and Lee, Chin-Hui , booktitle = ICSLP, year = 2002, month = sep, title =. doi:10.21437/ICSLP.2002-178 , address =

work page doi:10.21437/icslp.2002-178 2002

[11] [11]

Waters, Austin and Gaur, Neeraj and Haghani, Parisa and Moreno, Pedro and Qu, Zhongdi , booktitle=ASRU, title=

[12] [12]

Lopez-Moreno, Ignacio and Gonzalez-Dominguez, Javier and Plchot, Oldrich and Martinez, David and Gonzalez-Rodriguez, Joaquin and Moreno, Pedro , booktitle=ICASSP, title=

[13] [13]

and Inouye, Jon W.T

Cole, Ronald A. and Inouye, Jon W.T. and Muthusamy, Yeshwant K. and Gopalakrishnan, Murali , booktitle=. Language identification with neural networks: a feasibility study , year=1989, month=jun, pages=

1989

[14] [14]

A Language Agnostic Multilingual Streaming On-Device

Bo Li and Tara Sainath and Ruoming Pang and Shuo-Yiin Chang and Qiumin Xu and Trevor Strohman and Vince Chen and Qiao Liang and Heguang Liu and Yanzhang He and Parisa Haghani and Sameer Bidichandani , year = 2022, booktitle = INTERSPEECH, address =. A Language Agnostic Multilingual Streaming On-Device

2022

[15] [15]

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification , author =

[16] [16]

, booktitle=ASRU, title=

Watanabe, Shinji and Hori, Takaaki and Hershey, John R. , booktitle=ASRU, title=

[17] [17]

Language-agnostic Multilingual Modelling , author =

[18] [18]

and Sim, Khe Chai and Bacchiani, Michiel and Weinstein, Eugene and Nguyen, Patrick and Chen, Zhifeng and Wu, Yanghui and Rao, Kanishka , title =

Li, Bo and Sainath, Tara N. and Sim, Khe Chai and Bacchiani, Michiel and Weinstein, Eugene and Nguyen, Patrick and Chen, Zhifeng and Wu, Yanghui and Rao, Kanishka , title =. doi:10.1109/ICASSP.2018.8461886 , booktitle = ICASSP, pages =

work page doi:10.1109/icassp.2018.8461886 2018

[19] [19]

Comparison of Different Neural Network Architectures for Spoken Language Identification , year=2023, month=sep, address=

Bazazo, Tala and Zeineldeen, Mohammad and Plahl, Christian and Schlueter, Ralf and Ney, Hermann , booktitle=. Comparison of Different Neural Network Architectures for Spoken Language Identification , year=2023, month=sep, address=

2023

[20] [20]

Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval , volume=

N-gram-based text categorization , author=. Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval , volume=

[21] [21]

Conference on Language Modeling (COLM) , year=2024, month=oct, address=

Automata-based constraints for language model decoding , author=. Conference on Language Modeling (COLM) , year=2024, month=oct, address=

2024

[22] [22]

Plug and Play Language Models: A Simple Approach to Controlled Text Generation , author=

[23] [23]

Data , volume =

Qasim, Mamtimin and Silamu, Wushour , title =. Data , volume =

[24] [24]

doi:10.48550/arXiv.2506.00087 , address=

Peng Xie and Xingyuan Liu and Tsz Wai Chan and Yequan Bie and Yangqiu Song and Yang Wang and Hao Chen and Kani Chen , booktitle=NEURIPS, title=. doi:10.48550/arXiv.2506.00087 , address=

work page doi:10.48550/arxiv.2506.00087

[25] [25]

2025 , url =

Gemini 2.0 Flash Lite Model Card , institution =. 2025 , url =

2025