Languages in Whisper-Style Speech Encoders Align Both Phonetically and Semantically

Alessandro Vietti; Barbara Plank; Chengzhi Martin Hu; Domenico De Cristofaro; Ryan Soh-Eun Shim

arxiv: 2505.19606 · v2 · submitted 2025-05-26 · 💻 cs.CL

Languages in Whisper-Style Speech Encoders Align Both Phonetically and Semantically

Ryan Soh-Eun Shim , Domenico De Cristofaro , Chengzhi Martin Hu , Alessandro Vietti , Barbara Plank This is my paper

Pith reviewed 2026-05-19 13:42 UTC · model grok-4.3

classification 💻 cs.CL

keywords cross-lingual alignmentspeech encoderssemantic alignmentphonetic controlspoken translation retrievallow-resource ASRearly exitingWhisper-style models

0 comments

The pith

Speech encoders align languages semantically in final layers even after phonetic cues are removed by pronunciation controls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether cross-lingual alignment reported in Whisper-style speech encoders stems only from phonetic overlap or also from semantic similarity. Using spoken translation retrieval on data where equivalent utterances have been pronunciation-controlled to eliminate sound similarity, the authors find retrieval stays strongly above chance in the final layers, especially for models trained on translation tasks. This indicates the encoders capture meaning that transfers across languages. The work further shows that early-exiting the encoder yields representations less bound to language-specific semantics and improves automatic speech recognition accuracy on low-resource languages unseen in training.

Core claim

Cross-lingual alignment in Whisper-style speech encoders arises from both phonetic and semantic similarity. Pronunciation-controlled experiments show that spoken translation retrieval remains strongly above chance without phonetic cues in the final layers of encoders trained with a speech translation objective, most clearly for models additionally trained on translation. Early-exiting the encoder induces representations hypothesized to be less tied to language-specific semantics and produces performance gains in automatic speech recognition on low-resource languages.

What carries the argument

Pronunciation-controlled spoken translation retrieval using representational similarity, which isolates semantic alignment by eliminating phonetic overlap between equivalent utterances.

If this is right

Translation training strengthens semantic alignment visible in the deepest encoder layers.
Final-layer representations support cross-lingual retrieval based on meaning rather than sound patterns.
Early-exiting reduces ties to language-specific semantics and raises ASR accuracy on low-resource languages.
Semantic alignment in these encoders enables meaning-based transfer across languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Semantic alignment may support direct cross-lingual speech tasks without needing phonetic bridges.
Early-exiting offers a practical way to adapt the same encoder to new languages with limited data.
Earlier layers may retain more phonetic alignment that still aids initial speech processing.

Load-bearing premise

The pronunciation-controlled experimental setup successfully removes phonetic similarity between equivalent utterances so that retrieval performance can be attributed to semantic factors rather than residual sound overlap.

What would settle it

Retrieval accuracy falling to chance levels in the final layers under the pronunciation-controlled setup would indicate that alignment depends on phonetic rather than semantic factors.

Figures

Figures reproduced from arXiv: 2505.19606 by Alessandro Vietti, Barbara Plank, Chengzhi Martin Hu, Domenico De Cristofaro, Ryan Soh-Eun Shim.

**Figure 1.** Figure 1: Spoken translation retrieval results in Whisper-large-v2. Plot shows R@1, R@5, and R@10, micro-averaged across language pairs. Shaded regions indicate 95% Wilson confidence intervals. Dashed line is a random baseline. We observe that even after filtering out potential pronunciation shortcuts, semantic-based retrieval remains strongly above chance towards the later layers for all R@K values. 2.1. Cross-Ling… view at source ↗

**Figure 2.** Figure 2: Spoken translation retrieval results in models with and without speech translation objective. Plots show R@1, R@5, and R@10, micro-averaged across language pairs. Shaded regions indicate 95% Wilson confidence intervals. Dashed line is a random baseline. The model with an additional speech translation objective (Normal) shows stronger retrieval accuracy than the model without such an objective in the final … view at source ↗

read the original abstract

Cross-lingual alignment in pretrained language models enables knowledge transfer across languages. Similar alignment has been reported in Whisper-style speech encoders, based on spoken translation retrieval using representational similarity. However, prior work does not control for phonetic overlap between equivalent utterances, which may artificially support retrieval. We conduct pronunciation-controlled experiments to test whether cross-lingual alignment arises from semantic rather than phonetic similarity. Results show that spoken translation retrieval remains strongly above chance without phonetic cues in the final layers of encoders trained with a speech translation objective, most clearly for models additionally trained on translation. We further test early-exiting the encoder to induce representations we hypothesize to be less tied to language-specific semantics. These experiments indeed reveal performance gains in automatic speech recognition on low-resource languages unseen during training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper examines cross-lingual alignment in Whisper-style speech encoders, testing whether spoken translation retrieval reflects phonetic or semantic similarity. Through pronunciation-controlled experiments that aim to eliminate phonetic overlap between translation pairs, the authors report that retrieval remains substantially above chance in the final encoder layers, particularly for models fine-tuned on translation objectives. They additionally explore early-exiting strategies and demonstrate gains in automatic speech recognition for low-resource languages unseen during training.

Significance. If the pronunciation controls prove effective, the results would strengthen evidence that semantic alignment emerges in speech encoders trained with translation objectives, separate from low-level acoustic cues. This has implications for multilingual speech representation learning and for techniques like early-exiting to improve ASR on low-resource languages. The empirical focus on controlled retrieval tasks provides a direct test of alignment hypotheses.

major comments (1)

[Methods (pronunciation-controlled setup)] Methods section on pronunciation-controlled experiments: no quantitative metrics (e.g., phoneme edit distance, MFCC cosine similarity, or acoustic embedding distances) are reported to verify that phonetic/acoustic overlap between controlled translation pairs is reduced to chance levels relative to uncontrolled pairs. This validation is load-bearing for attributing above-chance retrieval in final layers to semantic rather than residual phonetic factors.

minor comments (2)

[Results] Clarify the exact statistical tests and number of runs used to establish 'strongly above chance' performance; include confidence intervals or p-values for the retrieval results.
[Experiments] Specify the precise layer indices or ranges used for 'final layers' and 'early-exiting' across different model sizes.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights an important opportunity to strengthen the validation of our pronunciation-controlled experiments. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: Methods section on pronunciation-controlled experiments: no quantitative metrics (e.g., phoneme edit distance, MFCC cosine similarity, or acoustic embedding distances) are reported to verify that phonetic/acoustic overlap between controlled translation pairs is reduced to chance levels relative to uncontrolled pairs. This validation is load-bearing for attributing above-chance retrieval in final layers to semantic rather than residual phonetic factors.

Authors: We agree that explicit quantitative validation of the pronunciation controls is necessary to support the claim that retrieval in final layers reflects semantic rather than residual phonetic similarity. In the revised manuscript we will add a dedicated validation subsection (or table) in the Methods reporting: (i) average phoneme edit distance, (ii) MFCC cosine similarity, and (iii) cosine distance in a frozen acoustic embedding space, each computed between controlled translation pairs versus uncontrolled pairs on the same test sets. These metrics will demonstrate that phonetic/acoustic overlap in the controlled condition is reduced to levels statistically indistinguishable from chance, thereby reinforcing the attribution of above-chance retrieval to semantic alignment. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivation or self-referential reduction

full rationale

The paper reports experimental results from pronunciation-controlled spoken translation retrieval and early-exiting tests on Whisper-style encoders. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations are present in the abstract or described methodology. The central claim rests on direct empirical retrieval performance above chance, not on any step that reduces by construction to its own inputs. The pronunciation control is an experimental manipulation whose effectiveness is a validity question, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions in representation learning that similarity in encoder outputs reflects semantic content and that the pronunciation controls isolate semantics; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Representational similarity analysis reliably measures semantic alignment across languages in speech encoders.
Invoked when interpreting retrieval performance as evidence of semantic rather than phonetic alignment.

pith-pipeline@v0.9.0 · 5667 in / 1218 out tokens · 68887 ms · 2026-05-19T13:42:43.248277+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 2 internal anchors

[1]

Introduction In speech, a growing body of work has shown speech founda- tion models to exhibit emergent multilingual capabilities [1, 2], implying the existence of cross-lingual alignment in such mod- els. Prior work probes such alignment through spoken transla- tion retrieval [3, 4], where [4] find Whisper’s encoder to have an accuracy of up to 80% on th...

work page
[2]

Languages in Whisper-Style Speech Encoders Align Both Phonetically and Semantically

Related Work In this section, we review prior work that analyze cross-lingual alignment in audio representations, along with work that looks into early exiting neural models. 1We release the dataset here:https://anonymous.4open. science/r/pronunciation-challenge-set-6214/ arXiv:2505.19606v2 [cs.CL] 4 Apr 2026 R@1 R@5 R@10 0 10 20 30 0 10 20 30 0 10 20 30 ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Experimental Setup In this section, we describe the methods we employ, along with our dataset construction procedure. Langs Original Test Set Challenge Set Sampled Test Set eng–zho 427 100 100 fra–zho 427 108 108 deu–zho 427 94 94 eng–jpn 427 77 77 fra–jpn 427 72 72 deu–jpn 427 67 67 Table 2:Statistics for the FLEURS dataset employed in our study (origina...

work page
[4]

For pairs with Japanese, we further remove samples contain- ing katakana script, which are largely reserved for loanwords in Japanese

pipeline to filter out utterances containing proper nouns. For pairs with Japanese, we further remove samples contain- ing katakana script, which are largely reserved for loanwords in Japanese. We pair together typologically and phylogenetically distant languages to avoid the influence of cognates between related varieties, resulting in six language pairs...

work page
[5]

Cross-Lingual Speech Retrieval To quantify cross-lingual alignment in the encoder of Whisper- style models, we follow prior work in employing translation retrieval as a proxy task

Methodology 4.1. Cross-Lingual Speech Retrieval To quantify cross-lingual alignment in the encoder of Whisper- style models, we follow prior work in employing translation retrieval as a proxy task. [4] propose SeqSim to quantify the similarity between two sequences of audio embeddingsX= {x1, . . . ,xm}andY={y 1, . . . ,yn}by measuring how well each frame ...

work page
[6]

We thus leverage this metric in our work

show SeqSim outperforms mean pooling and dynamic time warping for spoken translation retrieval. We thus leverage this metric in our work. 4.2. DecoderLens In this work, we follow [13] in viewing the layers of a trans- former as performing incremental updates to latent predictions of the next token. This assumption implies that the hidden states can be dec...

work page
[7]

Results In this section, we detail our results on spoken translation re- trieval and early exiting the encoder. 5.1. Cross-lingual speech retrieval is possible without pro- nunciation cues. First, we compare the results on the full data to our challenge set with the Whisper encoder. Figure 1 shows our spoken trans- lation retrieval results in Recall@K acr...

work page
[8]

Discussion and Conclusion In this work, we revisit the question of cross-lingual align- ment in Whisper-style speech foundation models. Through a series of controlled experiments, we demonstrate that spoken translation retrieval remains possible in Whisper-style speech encoders even without phonetic cues—such as cognates and proper nouns. Importantly, we ...

work page
[9]

Generative AI Use Disclosure The authors acknowledge the usage of ChatGPT as an assistant tool in part of the source code’s development and in enhancing the coherence of parts of the manuscript

work page
[10]

Prompting the hidden talent of web-scale speech models for zero-shot task gen- eralization,

P. Peng, B. Yan, S. Watanabe, and D. Harwath, “Prompting the hidden talent of web-scale speech models for zero-shot task gen- eralization,” inInterspeech 2023, 2023, pp. 396–400

work page 2023
[11]

Yang, K.-P

C.-K. Yang, K.-P. Huang, K.-H. Lu, C.-Y . Kuan, C.-Y . Hsiao, and H.-Y . Lee, “Investigating zero-shot generalizability on mandarin- english code-switched asr and speech-to-text translation of recent foundation models with self-supervision and weak supervision,” in2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (IC...

work page 2024
[12]

Wave to interlin- gua: Analyzing representations of multilingual speech transform- ers for spoken language translation,

B. M. Abdullah, M. M. Shaik, and D. Klakow, “Wave to interlin- gua: Analyzing representations of multilingual speech transform- ers for spoken language translation,” inInterspeech 2024, 2024, pp. 362–366

work page 2024
[13]

Cross-lingual transfer learning for speech translation,

R. Ma, M. Qian, Y . Fathullah, S. Tang, M. Gales, and K. Knill, “Cross-lingual transfer learning for speech translation,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), L. Chiruzzo, A. Ritter, and L. Wang, Eds. Albuquerque, ...

work page 2025
[14]

BERT: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds....

work page 2019
[15]

Cross-lingual language model pre- training,

A. Conneau and G. Lample, “Cross-lingual language model pre- training,”Advances in neural information processing systems, vol. 32, 2019

work page 2019
[16]

Self-supervised speech representations are more phonetic than semantic,

K. Choi, A. Pasad, T. Nakamura, S. Fukayama, K. Livescu, and S. Watanabe, “Self-supervised speech representations are more phonetic than semantic,” inInterspeech 2024, 2024, pp. 4578– 4582

work page 2024
[17]

Tran- scribe, translate, or transliterate: An investigation of intermediate representations in spoken language models,

T. `Og´unr`em´ı, C. D. Manning, D. Jurafsky, and K. Livescu, “Tran- scribe, translate, or transliterate: An investigation of intermediate representations in spoken language models,”Proceedings of IEEE ASRU 2025, 2025

work page 2025
[18]

Shallow-deep networks: Understanding and mitigating network overthinking,

Y . Kaya, S. Hong, and T. Dumitras, “Shallow-deep networks: Understanding and mitigating network overthinking,” inInterna- tional conference on machine learning. PMLR, 2019, pp. 3301– 3310

work page 2019
[19]

Confident adaptive language modeling,

T. Schuster, A. Fisch, J. Gupta, M. Dehghani, D. Bahri, V . Tran, Y . Tay, and D. Metzler, “Confident adaptive language modeling,” Advances in Neural Information Processing Systems, vol. 35, pp. 17 456–17 472, 2022

work page 2022
[20]

Overthinking the truth: Understanding how language models process false demon- strations,

D. Halawi, J.-S. Denain, and J. Steinhardt, “Overthinking the truth: Understanding how language models process false demon- strations,” inThe Thirteenth International Conference on Learn- ing Representations, 2024

work page 2024
[21]

A practical review of mechanistic interpretability for transformer- based language models,

D. Rai, Y . Zhou, S. Feng, A. Saparov, and Z. Yao, “A practical review of mechanistic interpretability for transformer- based language models,” 2025. [Online]. Available: https: //arxiv.org/abs/2407.02646

work page arXiv 2025
[22]

Interpreting gpt: The logit lens,

nostalgebraist, “Interpreting gpt: The logit lens,” https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/ interpreting-gpt-the-logit-lens, 2020, accessed: 2025-05-19

work page 2020
[23]

DecoderLens: Layerwise interpretation of encoder-decoder transformers,

A. Langedijk, H. Mohebbi, G. Sarti, W. Zuidema, and J. Jumelet, “DecoderLens: Layerwise interpretation of encoder-decoder transformers,” inFindings of the Association for Computational Linguistics: NAACL 2024, K. Duh, H. Gomez, and S. Bethard, Eds. Mexico City, Mexico: Association for Computational Linguistics, Jun. 2024, pp. 4764–4780. [Online]. Availabl...

work page 2024
[24]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23– ...

work page 2023
[25]

Fleurs: Few-shot learning evaluation of universal representations of speech,

A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” in2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 798– 805

work page 2023
[26]

spaCy: Industrial-strength Natural Language Processing in Python,

M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd, “spaCy: Industrial-strength Natural Language Processing in Python,” 2020

work page 2020
[27]

Eliciting Latent Predictions from Transformers with the Tuned Lens

N. Belrose, Z. Furman, L. Smith, D. Halawi, I. Ostrovsky, L. McKinney, S. Biderman, and J. Steinhardt, “Eliciting latent predictions from transformers with the tuned lens,” 2023. [Online]. Available: https://arxiv.org/abs/2303.08112

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Jump to con- clusions: Short-cutting transformers with linear transformations,

A. Yom Din, T. Karidi, L. Choshen, and M. Geva, “Jump to con- clusions: Short-cutting transformers with linear transformations,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzolari, M.-Y . Kan, V . Hoste, A. Lenci, S. Sakti, and N. Xue, Eds. Torino, Itali...

work page 2024
[29]

Powsm: A phonetic open whisper-style speech foundation model,

C.-J. Li, K. Chang, S. Bharadwaj, E. Yeo, K. Choi, J. Zhu, D. Mortensen, and S. Watanabe, “Powsm: A phonetic open whisper-style speech foundation model,” 2026. [Online]. Available: https://arxiv.org/abs/2510.24992

work page arXiv 2026

[1] [1]

Introduction In speech, a growing body of work has shown speech founda- tion models to exhibit emergent multilingual capabilities [1, 2], implying the existence of cross-lingual alignment in such mod- els. Prior work probes such alignment through spoken transla- tion retrieval [3, 4], where [4] find Whisper’s encoder to have an accuracy of up to 80% on th...

work page

[2] [2]

Languages in Whisper-Style Speech Encoders Align Both Phonetically and Semantically

Related Work In this section, we review prior work that analyze cross-lingual alignment in audio representations, along with work that looks into early exiting neural models. 1We release the dataset here:https://anonymous.4open. science/r/pronunciation-challenge-set-6214/ arXiv:2505.19606v2 [cs.CL] 4 Apr 2026 R@1 R@5 R@10 0 10 20 30 0 10 20 30 0 10 20 30 ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Experimental Setup In this section, we describe the methods we employ, along with our dataset construction procedure. Langs Original Test Set Challenge Set Sampled Test Set eng–zho 427 100 100 fra–zho 427 108 108 deu–zho 427 94 94 eng–jpn 427 77 77 fra–jpn 427 72 72 deu–jpn 427 67 67 Table 2:Statistics for the FLEURS dataset employed in our study (origina...

work page

[4] [4]

For pairs with Japanese, we further remove samples contain- ing katakana script, which are largely reserved for loanwords in Japanese

pipeline to filter out utterances containing proper nouns. For pairs with Japanese, we further remove samples contain- ing katakana script, which are largely reserved for loanwords in Japanese. We pair together typologically and phylogenetically distant languages to avoid the influence of cognates between related varieties, resulting in six language pairs...

work page

[5] [5]

Cross-Lingual Speech Retrieval To quantify cross-lingual alignment in the encoder of Whisper- style models, we follow prior work in employing translation retrieval as a proxy task

Methodology 4.1. Cross-Lingual Speech Retrieval To quantify cross-lingual alignment in the encoder of Whisper- style models, we follow prior work in employing translation retrieval as a proxy task. [4] propose SeqSim to quantify the similarity between two sequences of audio embeddingsX= {x1, . . . ,xm}andY={y 1, . . . ,yn}by measuring how well each frame ...

work page

[6] [6]

We thus leverage this metric in our work

show SeqSim outperforms mean pooling and dynamic time warping for spoken translation retrieval. We thus leverage this metric in our work. 4.2. DecoderLens In this work, we follow [13] in viewing the layers of a trans- former as performing incremental updates to latent predictions of the next token. This assumption implies that the hidden states can be dec...

work page

[7] [7]

Results In this section, we detail our results on spoken translation re- trieval and early exiting the encoder. 5.1. Cross-lingual speech retrieval is possible without pro- nunciation cues. First, we compare the results on the full data to our challenge set with the Whisper encoder. Figure 1 shows our spoken trans- lation retrieval results in Recall@K acr...

work page

[8] [8]

Discussion and Conclusion In this work, we revisit the question of cross-lingual align- ment in Whisper-style speech foundation models. Through a series of controlled experiments, we demonstrate that spoken translation retrieval remains possible in Whisper-style speech encoders even without phonetic cues—such as cognates and proper nouns. Importantly, we ...

work page

[9] [9]

Generative AI Use Disclosure The authors acknowledge the usage of ChatGPT as an assistant tool in part of the source code’s development and in enhancing the coherence of parts of the manuscript

work page

[10] [10]

Prompting the hidden talent of web-scale speech models for zero-shot task gen- eralization,

P. Peng, B. Yan, S. Watanabe, and D. Harwath, “Prompting the hidden talent of web-scale speech models for zero-shot task gen- eralization,” inInterspeech 2023, 2023, pp. 396–400

work page 2023

[11] [11]

Yang, K.-P

C.-K. Yang, K.-P. Huang, K.-H. Lu, C.-Y . Kuan, C.-Y . Hsiao, and H.-Y . Lee, “Investigating zero-shot generalizability on mandarin- english code-switched asr and speech-to-text translation of recent foundation models with self-supervision and weak supervision,” in2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (IC...

work page 2024

[12] [12]

Wave to interlin- gua: Analyzing representations of multilingual speech transform- ers for spoken language translation,

B. M. Abdullah, M. M. Shaik, and D. Klakow, “Wave to interlin- gua: Analyzing representations of multilingual speech transform- ers for spoken language translation,” inInterspeech 2024, 2024, pp. 362–366

work page 2024

[13] [13]

Cross-lingual transfer learning for speech translation,

R. Ma, M. Qian, Y . Fathullah, S. Tang, M. Gales, and K. Knill, “Cross-lingual transfer learning for speech translation,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), L. Chiruzzo, A. Ritter, and L. Wang, Eds. Albuquerque, ...

work page 2025

[14] [14]

BERT: Pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds....

work page 2019

[15] [15]

Cross-lingual language model pre- training,

A. Conneau and G. Lample, “Cross-lingual language model pre- training,”Advances in neural information processing systems, vol. 32, 2019

work page 2019

[16] [16]

Self-supervised speech representations are more phonetic than semantic,

K. Choi, A. Pasad, T. Nakamura, S. Fukayama, K. Livescu, and S. Watanabe, “Self-supervised speech representations are more phonetic than semantic,” inInterspeech 2024, 2024, pp. 4578– 4582

work page 2024

[17] [17]

Tran- scribe, translate, or transliterate: An investigation of intermediate representations in spoken language models,

T. `Og´unr`em´ı, C. D. Manning, D. Jurafsky, and K. Livescu, “Tran- scribe, translate, or transliterate: An investigation of intermediate representations in spoken language models,”Proceedings of IEEE ASRU 2025, 2025

work page 2025

[18] [18]

Shallow-deep networks: Understanding and mitigating network overthinking,

Y . Kaya, S. Hong, and T. Dumitras, “Shallow-deep networks: Understanding and mitigating network overthinking,” inInterna- tional conference on machine learning. PMLR, 2019, pp. 3301– 3310

work page 2019

[19] [19]

Confident adaptive language modeling,

T. Schuster, A. Fisch, J. Gupta, M. Dehghani, D. Bahri, V . Tran, Y . Tay, and D. Metzler, “Confident adaptive language modeling,” Advances in Neural Information Processing Systems, vol. 35, pp. 17 456–17 472, 2022

work page 2022

[20] [20]

Overthinking the truth: Understanding how language models process false demon- strations,

D. Halawi, J.-S. Denain, and J. Steinhardt, “Overthinking the truth: Understanding how language models process false demon- strations,” inThe Thirteenth International Conference on Learn- ing Representations, 2024

work page 2024

[21] [21]

A practical review of mechanistic interpretability for transformer- based language models,

D. Rai, Y . Zhou, S. Feng, A. Saparov, and Z. Yao, “A practical review of mechanistic interpretability for transformer- based language models,” 2025. [Online]. Available: https: //arxiv.org/abs/2407.02646

work page arXiv 2025

[22] [22]

Interpreting gpt: The logit lens,

nostalgebraist, “Interpreting gpt: The logit lens,” https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/ interpreting-gpt-the-logit-lens, 2020, accessed: 2025-05-19

work page 2020

[23] [23]

DecoderLens: Layerwise interpretation of encoder-decoder transformers,

A. Langedijk, H. Mohebbi, G. Sarti, W. Zuidema, and J. Jumelet, “DecoderLens: Layerwise interpretation of encoder-decoder transformers,” inFindings of the Association for Computational Linguistics: NAACL 2024, K. Duh, H. Gomez, and S. Bethard, Eds. Mexico City, Mexico: Association for Computational Linguistics, Jun. 2024, pp. 4764–4780. [Online]. Availabl...

work page 2024

[24] [24]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR, 23– ...

work page 2023

[25] [25]

Fleurs: Few-shot learning evaluation of universal representations of speech,

A. Conneau, M. Ma, S. Khanuja, Y . Zhang, V . Axelrod, S. Dalmia, J. Riesa, C. Rivera, and A. Bapna, “Fleurs: Few-shot learning evaluation of universal representations of speech,” in2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 798– 805

work page 2023

[26] [26]

spaCy: Industrial-strength Natural Language Processing in Python,

M. Honnibal, I. Montani, S. Van Landeghem, and A. Boyd, “spaCy: Industrial-strength Natural Language Processing in Python,” 2020

work page 2020

[27] [27]

Eliciting Latent Predictions from Transformers with the Tuned Lens

N. Belrose, Z. Furman, L. Smith, D. Halawi, I. Ostrovsky, L. McKinney, S. Biderman, and J. Steinhardt, “Eliciting latent predictions from transformers with the tuned lens,” 2023. [Online]. Available: https://arxiv.org/abs/2303.08112

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

Jump to con- clusions: Short-cutting transformers with linear transformations,

A. Yom Din, T. Karidi, L. Choshen, and M. Geva, “Jump to con- clusions: Short-cutting transformers with linear transformations,” inProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), N. Calzolari, M.-Y . Kan, V . Hoste, A. Lenci, S. Sakti, and N. Xue, Eds. Torino, Itali...

work page 2024

[29] [29]

Powsm: A phonetic open whisper-style speech foundation model,

C.-J. Li, K. Chang, S. Bharadwaj, E. Yeo, K. Choi, J. Zhu, D. Mortensen, and S. Watanabe, “Powsm: A phonetic open whisper-style speech foundation model,” 2026. [Online]. Available: https://arxiv.org/abs/2510.24992

work page arXiv 2026