Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models?

Filip Landin; Gabriel Skantze; Livia Qian; Luca Modica; Mehrdad Farahani; Richard Johansson

arxiv: 2605.22170 · v1 · pith:4ZEBHMDVnew · submitted 2026-05-21 · 💻 cs.CL

Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models?

Luca Modica , Filip Landin , Mehrdad Farahani , Livia Qian , Gabriel Skantze , Richard Johansson This is my paper

Pith reviewed 2026-05-22 06:22 UTC · model grok-4.3

classification 💻 cs.CL

keywords factual recallmultimodal language modelsspeech language modelscausal mediation analysismodality differencesdiscrete speech tokens

0 comments

The pith

Factual recall mechanisms identified in text models transfer only partially to the speech pathway in multimodal models that use discrete speech tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the internal circuits that let language models store and retrieve facts work the same way for spoken input as they do for text. It applies causal mediation analysis, a method previously used on text-only models, to the speech side of SpiritLM, a system that represents both modalities with discrete speech tokens. The analysis finds clear differences between text-to-text and speech-to-text mediation patterns. A reader would care because these differences affect how reliably speech-enabled systems can answer factual questions without special training for each modality.

Core claim

When causal mediation analysis is run on SpiritLM, the locations and strengths of factual-association effects observed for speech inputs diverge from those seen for text inputs, indicating that the emergent mechanisms supporting factual recall are only partially shared between the two modalities.

What carries the argument

Causal mediation analysis applied to the speech pathway of a multimodal model that encodes speech as discrete tokens, used to locate and quantify factual recall effects.

If this is right

Speech models may need separate interventions to correct factual errors rather than reusing text-based fixes.
Error patterns in knowledge questions asked by voice may differ systematically from those asked by typing.
Joint training on discrete tokens does not automatically align the internal storage of facts across modalities.
Developers of voice assistants could target modality-specific layers to improve factual reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future work could check whether the partial transfer holds in models that use continuous rather than discrete speech representations.
The finding raises the possibility that scaling speech data alone will not close the gap in factual mechanisms.
Similar mediation tests on other multimodal models could reveal whether the partial carry-over is specific to discrete-token approaches.

Load-bearing premise

Causal mediation analysis, already shown to find factual-recall circuits in text models, will locate comparable circuits when applied to the speech side of a model trained on discrete speech tokens.

What would settle it

Running the same mediation interventions on speech inputs and finding that the same neurons produce the same size effect on factual recall accuracy as they do on text inputs.

Figures

Figures reproduced from arXiv: 2605.22170 by Filip Landin, Gabriel Skantze, Livia Qian, Luca Modica, Mehrdad Farahani, Richard Johansson.

**Figure 1.** Figure 1: The SpiritLM architecture. of the model components and then visualize the contribution results. 2.2 The multimodal large language model under study: The SpiritLM model Our work examines SpiritLM (Nguyen et al., 2025) as a case of a multimodal (speech) language model that can generate text and audio language content. Furthermore, SpiritLM uses discrete speech tokens and is trained on interleaved speech a… view at source ↗

**Figure 2.** Figure 2: Log-scaled AIE across different modules and modalities over 754 prompts. In each subfigure, the x-axis [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Results of the forced alignment for a speech utterance (transcript: "The capital of Roman Republic is"). [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

In recent years, several Speech Language Models (SLMs) that represent speech and written text jointly have been presented. The question then emerges about how model-internal mechanisms are similar and different when operating in the two modalities. We focus on how these systems encode, store, and retrieve factual knowledge, which has previously been investigated for text-only models. To investigate mechanisms behind the storage and recall of factual association in SLMs, we leverage Causal Mediation Analysis, a technique previously applied to text-based models. Initial results using SpiritLM, a multimodal model integrating discrete speech tokens reveal discrepancies between text-to-text and speech-to-text results, suggesting that the emergent mechanisms for factual recall are only partially carried over from the text to the speech modality. These results advance our understanding of how internal mechanisms encode factual associations in SLMs while contributing insights for improving speech-enabled AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports partial carry-over of factual recall mechanisms from text to speech in SpiritLM via causal mediation analysis, but the differences may stem from how speech tokens are handled rather than true mechanism mismatch.

read the letter

The main thing to know is that the authors run causal mediation analysis on SpiritLM and see differences between text-to-text and speech-to-text factual recall, leading them to conclude that the mechanisms only partially transfer to the speech side. This is the first time that specific comparison appears in the literature they cite, so the result itself is new even if the method is not. They deserve credit for taking a technique that worked on text-only models and testing it on a jointly trained multimodal system with discrete speech tokens. That move is straightforward and useful for anyone thinking about how these models store facts across input types. The execution looks honest on its own terms, with no obvious circularity or invented quantities. The soft spots sit mainly in the interpretation. SpiritLM encodes speech through a separate tokenizer and embedding space that was co-trained with text, so factual associations could sit in modality-specific subspaces from the start. If the paper does not include controls that hold token distributions fixed while changing only the input encoder, the observed discrepancies could reflect representational mismatch instead of a failure of mechanism carry-over. The abstract gives almost no numbers on effect sizes, dataset size, or statistical checks, which leaves the strength of the claim hard to gauge from the summary alone. This work is for people already following interpretability in speech-language models. A reader who wants to know whether text circuits generalize to speech inputs will find a concrete question and an initial answer worth checking. It is not yet strong enough to change design practice, but the question is real. I would send it to peer review rather than desk reject; the core idea is worth a referee's time to see whether the controls close the gap the stress-test raises.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates whether factual recall mechanisms identified in text-only language models carry over to the speech modality in multimodal SLMs. Focusing on SpiritLM, which integrates discrete speech tokens, the authors apply Causal Mediation Analysis and report initial results showing discrepancies between text-to-text and speech-to-text factual recall, concluding that the mechanisms are only partially carried over from text to speech.

Significance. If the reported discrepancies are shown to be robust and not attributable to representational differences, the work would meaningfully extend causal analysis techniques to multimodal settings and clarify how factual associations are encoded across modalities in jointly trained models, with potential implications for designing more reliable speech-enabled systems.

major comments (2)

[Abstract] Abstract: The abstract states initial results and a conclusion but supplies no details on dataset, intervention targets, statistical controls, or effect sizes, so the support for the central claim cannot be evaluated from the given text.
[Experimental setup / results] Experimental setup / results: The interpretation of discrepancies as evidence of partial mechanism carry-over assumes CMA identifies comparable factual-recall circuits once the input modality changes. In SpiritLM, speech is represented via discrete tokens from a separate tokenizer and embedding space co-trained with text; without controls holding token distribution and training data fixed while varying only the input encoder, observed differences could arise from representational mismatch rather than failure of transfer. This assumption is load-bearing for the central claim.

minor comments (2)

Clarify the precise definition of 'speech-to-text' versus 'text-to-text' pathways, including whether the output is always text or whether speech output is also considered.
Provide a brief comparison table or figure caption that directly juxtaposes the key mediation effect sizes across the two modalities.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and indicate planned revisions to improve clarity and address interpretive concerns.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states initial results and a conclusion but supplies no details on dataset, intervention targets, statistical controls, or effect sizes, so the support for the central claim cannot be evaluated from the given text.

Authors: We agree that the abstract would benefit from additional specifics to allow readers to assess the claims. In the revised version, we will expand the abstract to reference the factual recall dataset (adapted from text benchmarks with speech transcriptions), the CMA intervention targets (e.g., attention heads and MLP layers in the shared transformer), and report key quantitative results including average indirect effect sizes with statistical controls. revision: yes
Referee: [Experimental setup / results] Experimental setup / results: The interpretation of discrepancies as evidence of partial mechanism carry-over assumes CMA identifies comparable factual-recall circuits once the input modality changes. In SpiritLM, speech is represented via discrete tokens from a separate tokenizer and embedding space co-trained with text; without controls holding token distribution and training data fixed while varying only the input encoder, observed differences could arise from representational mismatch rather than failure of transfer. This assumption is load-bearing for the central claim.

Authors: We acknowledge the validity of this point: SpiritLM uses distinct tokenizers and embedding spaces for speech and text despite joint training on the transformer backbone. Our CMA results reflect direct comparisons within this model architecture rather than a controlled isolation of the input encoder. We will add explicit discussion of this limitation in the revised manuscript, clarifying that the observed discrepancies demonstrate incomplete carry-over in a representative jointly trained SLM and suggesting future experiments with unified representations where feasible. The empirical findings remain informative for understanding modality effects in current multimodal models. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical application of prior technique to new multimodal data

full rationale

The paper applies causal mediation analysis (previously used on text models) to SpiritLM to compare factual recall mechanisms between text-to-text and speech-to-text pathways. No equations, fitted parameters, or self-referential definitions are present in the provided text. The central claim rests on observed discrepancies in mediation effects from this empirical comparison rather than any derivation that reduces to its own inputs by construction. The technique is treated as an external tool, and results are presented as initial findings without load-bearing self-citations or renamings that would create circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the transferability of causal mediation analysis from text to speech inputs and on SpiritLM being a representative multimodal model; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Causal mediation analysis identifies factual recall mechanisms in the same manner for speech-token inputs as it does for text inputs.
The paper applies the technique directly to the speech-to-text path without additional validation steps described.

pith-pipeline@v0.9.0 · 5689 in / 1172 out tokens · 31552 ms · 2026-05-22T06:22:14.886490+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We leverage Causal Mediation Analysis... Clean run... Corrupted run... Corrupted-with-restoration run... Indirect Effect (IE)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Initial results using SpiritLM... discrepancies between text-to-text and speech-to-text results

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 4 internal anchors

[1]

InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235, Singapore

Dissecting recall of factual associa- tions in auto-regressive language models. InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235, Singapore. Association for Computational Linguis- tics. Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy

work page 2023
[2]

InProceedings of the 2021 Confer- ence on Empirical Methods in Natural Language Pro- cessing, pages 5484–5495, Online and Punta Cana, Dominican Republic

Transformer feed-forward layers are key- value memories. InProceedings of the 2021 Confer- ence on Empirical Methods in Natural Language Pro- cessing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Neta Glazer, Yael Segal-Feldman, Hilit Segev, Aviv Shamsian, Asaf Buchnick, Gill Hetz, Ethan Fetaya, Jo...

work page 2021
[3]

How to use and interpret activation patching

How to use and interpret activation patching.arXiv preprint arXiv:2404.15255. Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdel- rahman Mohamed

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Mistral 7B

Mistral 7B.Preprint, arXiv:2310.06825. Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel

work page internal anchor Pith review Pith/arXiv arXiv
[5]

InProceedings of INTERSPEECH, Dublin, Ireland

Vits2: Improving quality and efficiency of single-stage text- to-speech with adversarial learning and architecture design. InProceedings of INTERSPEECH, Dublin, Ireland. Ludwig Kürzinger, Dominik Winkelbauer, Lujun Li, Tobias Watzel, and Gerhard Rigoll. 2020.CTC- Segmentation of Large Corpora for German End- to-End Speech Recognition, page 267–278. Spring...

work page 2020
[6]

arXiv, ://arxiv.org/abs/2502.17516, arXiv:2502.17516 [cs], doi:10.48550/arXiv.2502.17516

A sur- vey on mechanistic interpretability for multi-modal foundation models.Preprint, arXiv:2502.17516. Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi

work page arXiv
[7]

Association for Computational Linguistics

Language models as knowl- edge bases? InProceedings of the 2019 Confer- ence on Empirical Methods in Natural Language Pro- cessing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China. Association for Computational Linguistics. Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Ngu...

work page 2019
[8]

AudioPaLM: A Large Language Model That Can Speak and Listen

AudioPaLM: A large lan- guage model that can speak and listen.Preprint, arXiv:2306.12925. Gaofei Shen, Michaela Watkins, Afra Alishahi, Arianna Bisazza, and Grzegorz Chrupała

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Encoding of lexical tone in self-supervised models of spoken language. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), pages 4250–4261, Mexico City, Mexico. Association for Computational Linguistics. Changli Tang, Wenyi Yu, Guangzhi ...

work page 2024
[10]

Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias,

Causal mediation analysis for interpreting neu- ral NLP: The case of gender bias.Preprint, arXiv:2004.12265. Dingdong Wang, Junan Li, Mingyu Cui, Dongchao Yang, Xueyuan Chen, and Helen M. Meng

work page arXiv 2004
[11]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24913–24924, Suzhou, China

Speech discrete tokens or continuous features? A comparative analysis for spoken language under- standing in SpeechLLMs. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24913–24924, Suzhou, China. Association for Computational Linguistics. Bajian Xiang, Shuaijiang Zhao, Tingwei Guo, and Wei Zou

work page 2025
[12]

InProceedings of the 2025 Conference on Empirical Methods in Natu- ral Language Processing, pages 5187–5202, Suzhou, China

Understanding the modality gap: An em- pirical study on the speech-text alignment mechanism of large speech language models. InProceedings of the 2025 Conference on Empirical Methods in Natu- ral Language Processing, pages 5187–5202, Suzhou, China. Association for Computational Linguistics. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo ...

work page 2025
[13]

Qwen3 Technical Report

Qwen3 technical report.Preprint, arXiv:2505.09388. Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu

work page internal anchor Pith review Pith/arXiv arXiv
[14]

InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 15757–15773, Sin- gapore

SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 15757–15773, Sin- gapore. Association for Computational Linguistics. Wenliang Zhao, Xumin Yu, and Zengyi Qin

work page 2023
[15]

%" are converted to their written format (e.g,

MeloTTS: High-quality multi-lingual multi-accent text-to-speech. A Forced Alignment for Cross-modal Token Mapping: Implementation Details Text preprocessing for CTC.For the transcription to be compatible with the forced alignment, a text preprocessing is necessary to ensure all characters are included in the CTC model vocabulary. For example, digits and s...

work page 2021

[1] [1]

InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235, Singapore

Dissecting recall of factual associa- tions in auto-regressive language models. InProceed- ings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235, Singapore. Association for Computational Linguis- tics. Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy

work page 2023

[2] [2]

InProceedings of the 2021 Confer- ence on Empirical Methods in Natural Language Pro- cessing, pages 5484–5495, Online and Punta Cana, Dominican Republic

Transformer feed-forward layers are key- value memories. InProceedings of the 2021 Confer- ence on Empirical Methods in Natural Language Pro- cessing, pages 5484–5495, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Neta Glazer, Yael Segal-Feldman, Hilit Segev, Aviv Shamsian, Asaf Buchnick, Gill Hetz, Ethan Fetaya, Jo...

work page 2021

[3] [3]

How to use and interpret activation patching

How to use and interpret activation patching.arXiv preprint arXiv:2404.15255. Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdel- rahman Mohamed

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Mistral 7B

Mistral 7B.Preprint, arXiv:2310.06825. Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

InProceedings of INTERSPEECH, Dublin, Ireland

Vits2: Improving quality and efficiency of single-stage text- to-speech with adversarial learning and architecture design. InProceedings of INTERSPEECH, Dublin, Ireland. Ludwig Kürzinger, Dominik Winkelbauer, Lujun Li, Tobias Watzel, and Gerhard Rigoll. 2020.CTC- Segmentation of Large Corpora for German End- to-End Speech Recognition, page 267–278. Spring...

work page 2020

[6] [6]

arXiv, ://arxiv.org/abs/2502.17516, arXiv:2502.17516 [cs], doi:10.48550/arXiv.2502.17516

A sur- vey on mechanistic interpretability for multi-modal foundation models.Preprint, arXiv:2502.17516. Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi

work page arXiv

[7] [7]

Association for Computational Linguistics

Language models as knowl- edge bases? InProceedings of the 2019 Confer- ence on Empirical Methods in Natural Language Pro- cessing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China. Association for Computational Linguistics. Paul K. Rubenstein, Chulayuth Asawaroengchai, Duc Dung Ngu...

work page 2019

[8] [8]

AudioPaLM: A Large Language Model That Can Speak and Listen

AudioPaLM: A large lan- guage model that can speak and listen.Preprint, arXiv:2306.12925. Gaofei Shen, Michaela Watkins, Afra Alishahi, Arianna Bisazza, and Grzegorz Chrupała

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Encoding of lexical tone in self-supervised models of spoken language. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), pages 4250–4261, Mexico City, Mexico. Association for Computational Linguistics. Changli Tang, Wenyi Yu, Guangzhi ...

work page 2024

[10] [10]

Causal Mediation Analysis for Interpreting Neural NLP: The Case of Gender Bias,

Causal mediation analysis for interpreting neu- ral NLP: The case of gender bias.Preprint, arXiv:2004.12265. Dingdong Wang, Junan Li, Mingyu Cui, Dongchao Yang, Xueyuan Chen, and Helen M. Meng

work page arXiv 2004

[11] [11]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24913–24924, Suzhou, China

Speech discrete tokens or continuous features? A comparative analysis for spoken language under- standing in SpeechLLMs. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 24913–24924, Suzhou, China. Association for Computational Linguistics. Bajian Xiang, Shuaijiang Zhao, Tingwei Guo, and Wei Zou

work page 2025

[12] [12]

InProceedings of the 2025 Conference on Empirical Methods in Natu- ral Language Processing, pages 5187–5202, Suzhou, China

Understanding the modality gap: An em- pirical study on the speech-text alignment mechanism of large speech language models. InProceedings of the 2025 Conference on Empirical Methods in Natu- ral Language Processing, pages 5187–5202, Suzhou, China. Association for Computational Linguistics. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo ...

work page 2025

[13] [13]

Qwen3 Technical Report

Qwen3 technical report.Preprint, arXiv:2505.09388. Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 15757–15773, Sin- gapore

SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 15757–15773, Sin- gapore. Association for Computational Linguistics. Wenliang Zhao, Xumin Yu, and Zengyi Qin

work page 2023

[15] [15]

%" are converted to their written format (e.g,

MeloTTS: High-quality multi-lingual multi-accent text-to-speech. A Forced Alignment for Cross-modal Token Mapping: Implementation Details Text preprocessing for CTC.For the transcription to be compatible with the forced alignment, a text preprocessing is necessary to ensure all characters are included in the CTC model vocabulary. For example, digits and s...

work page 2021