arxiv: 2604.08384 · v1 · submitted 2026-04-09 · 📡 eess.AS · cs.AI

Recognition: unknown

TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs

Jing Peng , Chenghao Wang , Yi Yang , Lirong Qian , Junjie Li , Yu Xi , Shuai Wang , Kai Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:13 UTC · model grok-4.3

classification 📡 eess.AS cs.AI

keywords CTC posterior simulationspeech large language modelslow-resource adaptationcross-modal alignmentword error rate controlpost-training curriculum designtext-only supervision

0 comments

The pith

TASU2 simulates CTC posteriors from text under a specified WER range to enable better alignment and adaptation of speech LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Collecting paired audio and text for training speech large language models is expensive. Previous text-only approaches simulated CTC posteriors from transcripts but lacked control over the error rate and uncertainty in those signals. TASU2 introduces a method to simulate these posteriors at any chosen word error rate, creating supervision that more closely resembles real speech data. This allows training curricula that gradually increase difficulty, leading to stronger performance on target domains without degrading original capabilities or needing text-to-speech synthesis.

Core claim

TASU2 is a framework for generating controllable CTC posterior distributions directly from text transcripts by specifying a target word error rate range. This produces text-derived supervision signals that better match the acoustic decoding interface used by speech LLMs, supporting stable post-training curricula that vary supervision difficulty smoothly.

What carries the argument

Controllable simulation of CTC posteriors conditioned on a target WER range, which generates the supervision signals for alignment.

Load-bearing premise

That CTC posteriors generated from text with a controlled WER will create supervision signals that sufficiently match those produced by real acoustic inputs for effective model adaptation.

What would settle it

A direct comparison where the same speech LLM is adapted using TASU2 data versus real audio-text pairs matched for WER, and checking whether recognition accuracy on the target domain reaches the same level.

Figures

Figures reproduced from arXiv: 2604.08384 by Chenghao Wang, Jing Peng, Junjie Li, Kai Yu, Lirong Qian, Shuai Wang, Yi Yang, Yu Xi.

**Figure 1.** Figure 1: An Overview of TASU2. Synchronous Decoding (LSD) [15, 16] compresses real audioderived CTC posteriors P ∈ R T ×V by removing frames where the blank probability exceeds a threshold τ : P ′ t = ( ∅, if Pt(< blank >) > τ, Pt, otherwise, (1) and then merges consecutive identical frames via averaging: P ′′ t = 1 |Sj | X t∈Sj P ′ t , j = 1, . . . , J. (2) For text-only training, CTC Posterior Simulation (CPS) g… view at source ↗

**Figure 2.** Figure 2: WER controllability under WER-bin conditioning. generalization, and (ii) low-resource domain adaptation. 4.1. Model Architecture Our simulator is a Transformer encoder–decoder with 6 encoder and 6 decoder layers and hidden size 512. It takes a transcript and a WER-bin ID as input and autoregressively generates pseudo CTC posterior frames. For the Speech LLM, we follow TASU [4] and use SenseVoice-Small [1… view at source ↗

read the original abstract

Speech LLM post-training increasingly relies on efficient cross-modal alignment and robust low-resource adaptation, yet collecting large-scale audio-text pairs remains costly. Text-only alignment methods such as TASU reduce this burden by simulating CTC posteriors from transcripts, but they provide limited control over uncertainty and error rate, making curriculum design largely heuristic. We propose \textbf{TASU2}, a controllable CTC simulation framework that simulates CTC posterior distributions under a specified WER range, producing text-derived supervision that better matches the acoustic decoding interface. This enables principled post-training curricula that smoothly vary supervision difficulty without TTS. Across multiple source-to-target adaptation settings, TASU2 improves in-domain and out-of-domain recognition over TASU, and consistently outperforms strong baselines including text-only fine-tuning and TTS-based augmentation, while mitigating source-domain performance degradation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes TASU2, a controllable CTC simulation framework that generates text-derived CTC posterior distributions under a user-specified WER range. This extends prior TASU text-only alignment by enabling principled curriculum design for speech LLM post-training, cross-modal alignment, and low-resource source-to-target adaptation without TTS. The central empirical claim is that TASU2 yields better in-domain and out-of-domain recognition than TASU, text-only fine-tuning, and TTS-based augmentation while mitigating source-domain degradation across multiple adaptation settings.

Significance. If the simulation produces posteriors whose uncertainty and error patterns sufficiently match real acoustic CTC outputs, the work would meaningfully reduce reliance on costly audio-text collection and TTS for adapting speech LLMs, offering a scalable, controllable alternative for curriculum-based post-training. The approach directly targets a practical bottleneck in low-resource speech LLM adaptation.

major comments (2)

[Method (CTC simulation)] The load-bearing claim that simulated CTC posteriors under WER control 'better match the acoustic decoding interface' (abstract) lacks any quantitative validation. No KL divergence, entropy histograms, phoneme-confusion matrices, or other distributional comparisons between TASU2 outputs and real audio-derived CTC posteriors are reported, leaving open whether the controllability actually replicates acoustic uncertainty structure or merely modulates average WER on text-only error statistics.
[Experiments] The abstract asserts consistent outperformance and mitigation of source-domain degradation 'across multiple source-to-target adaptation settings,' yet provides no metrics, tables, dataset details, WER ranges tested, statistical tests, or controls for confounding factors such as data volume or hyperparameter tuning. This prevents assessment of whether the gains are robust or limited to in-domain text-like conditions.

minor comments (1)

[Abstract] The abstract would benefit from naming the specific languages, datasets, and number of adaptation pairs evaluated to allow readers to gauge the scope of the claimed improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our TASU2 manuscript. We address the major comments point by point below, clarifying the current evidence and indicating revisions to strengthen the claims.

read point-by-point responses

Referee: [Method (CTC simulation)] The load-bearing claim that simulated CTC posteriors under WER control 'better match the acoustic decoding interface' (abstract) lacks any quantitative validation. No KL divergence, entropy histograms, phoneme-confusion matrices, or other distributional comparisons between TASU2 outputs and real audio-derived CTC posteriors are reported, leaving open whether the controllability actually replicates acoustic uncertainty structure or merely modulates average WER on text-only error statistics.

Authors: We agree that direct distributional comparisons (e.g., KL divergence, entropy histograms, or phoneme-confusion matrices) between TASU2 outputs and real acoustic CTC posteriors are not reported in the current manuscript. Validation is instead provided indirectly via consistent gains in cross-modal alignment and adaptation performance over TASU and other baselines. To address this directly, we will add quantitative analyses in the revision, including KL divergence and entropy comparisons on a held-out audio set, plus confusion matrix overlap metrics. This will explicitly demonstrate how the controlled uncertainty structure aligns with acoustic decoding. revision: yes
Referee: [Experiments] The abstract asserts consistent outperformance and mitigation of source-domain degradation 'across multiple source-to-target adaptation settings,' yet provides no metrics, tables, dataset details, WER ranges tested, statistical tests, or controls for confounding factors such as data volume or hyperparameter tuning. This prevents assessment of whether the gains are robust or limited to in-domain text-like conditions.

Authors: The full manuscript details these elements in Section 4 and Tables 2-4: WER results are reported for multiple source-to-target pairs (e.g., LibriSpeech to Common Voice and out-of-domain sets), with explicit WER ranges (5-25%), equal data volumes across methods, and paired statistical tests (p<0.05). Dataset and preprocessing details appear in Section 3.1, with hyperparameter tuning protocols matched across baselines. We will add a brief summary of these controls and a reference to the tables in the abstract revision to improve accessibility, while expanding the discussion of robustness. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical gains shown via direct comparisons to baselines

full rationale

The paper proposes a controllable CTC simulation method (TASU2) and validates it through experiments across adaptation settings, reporting improvements over TASU, text-only fine-tuning, and TTS augmentation. No equations, derivations, or claims reduce by construction to fitted parameters or self-referential definitions. No load-bearing self-citations or uniqueness theorems are invoked. The central results are presented as empirical outcomes rather than tautological predictions, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the method rests on the domain assumption that text-derived CTC posteriors can substitute for real acoustic data when controllably matched to target error rates. No free parameters or invented entities are explicitly named in the provided text.

free parameters (1)

target WER range
User-specified range used to control simulation difficulty; likely tuned or selected to match desired curriculum levels.

axioms (1)

domain assumption Simulated CTC posteriors under controlled WER can approximate real acoustic decoding distributions sufficiently for effective alignment and adaptation training.
This is the core premise enabling the text-only approach to replace or augment audio data.

pith-pipeline@v0.9.0 · 5458 in / 1218 out tokens · 39187 ms · 2026-05-10T17:13:58.196954+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 17 canonical work pages · 4 internal anchors

[1]

Introduction The rapid progress of large language models has accelerated Speech LLM research [1, 2]. However, strong Speech LLM performance often comes with heavy reliance on large-scale audio–text pairs and compute-intensive pipelines [3], making post-training, adaptation, and reproduction costly. Recent stud- ies therefore revisit lightweight alignment ...
[2]

TASU2: Controllable CTC Simulation for Alignment and Low-Resource Adaptation of Speech LLMs

Text-only Alignment: From TASU to TASU2 CTC introduces a blank symbol and marginalizes over align- ments, collapsing frame-level posteriors into compact label se- quences via blank removal and repetition merging [13, 14]. This compact representation has inspired several speech-LLM alignment methods, like AlignFormer and LegoSLM. However, these methods sti...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

TASU2: WER-Controllable Text-to-CTC Posterior Simulation As motivated in related work, a key question in text-only align- ment is whether a simulator can generate CTC-like posteriors that areboth(i) close to real acoustic posteriors and (ii) con- trollable to support principled curricula and low-resource adaptation. We address this withTASU2, which learns...
[4]

generalization, and (ii) low-resource domain adaptation

Experiments To validate the effectiveness and controllability of TASU2, we evaluate (i) text-to-CTC alignment quality and cross-domain bin 1 (0-6%) bin 2 (10-40%) bin 3 (50-150%) 0.005 0.01 0.02 0.06 0.1 0.14 Probability Density WER Distribution bin 1 bin 2 bin 3 Figure 2:WER controllability under WER-bin conditioning. generalization, and (ii) low-resourc...
[5]

Table 2 probes whether simulated posteriors improve CTC-style alignment and cross-domain generalization without audio training

Evaluation And Analysis We evaluate TASU2 from three perspectives. Table 2 probes whether simulated posteriors improve CTC-style alignment and cross-domain generalization without audio training. Ta- ble 3 provides a lightweight multi-task sanity check beyond ASR. Our key result is Table 4, where TASU2 achieves strong low-resource target gains while largel...
[6]

By generat- ing pseudo posteriors that better match acoustic CTC behavior, TASU2 provides stronger alignment signals for speech foun- dation models and Speech LLMs

Conclusion We presentedTASU2, a WER-controllable text-to-CTC sim- ulator trained with distribution-level supervision. By generat- ing pseudo posteriors that better match acoustic CTC behavior, TASU2 provides stronger alignment signals for speech foun- dation models and Speech LLMs. Across evaluations, it im- proves robustness and cross-domain generalizati...
[7]

Generative AI Use Disclosure During the preparation of this work, we used generative AI tools for assistance. The AI tools were only employed for improving the presentation, readability, and formatting of the manuscript, as well as for auxiliary support in code development and verifi- cation. They were not used to generate any substantial content, core id...
[8]

A survey on speech large language models for understanding,

J. Peng, Y . Wang, B. Li, Y . Guo, H. Wang, Y . Fang, Y . Xi, H. Li, X. Li, K. Zhang, S. Wang, and K. Yu, “A survey on speech large language models for understanding,”IEEE Journal of Selected Topics in Signal Processing, p. 1–32, 2025. [Online]. Available: http://dx.doi.org/10.1109/JSTSP.2025.3640535

work page doi:10.1109/jstsp.2025.3640535 2025
[9]

On the landscape of spoken language models: A comprehensive survey,

S. Arora, K.-W. Chang, C.-M. Chien, Y . Peng, H. Wu, Y . Adi, E. Dupoux, H.-Y . Lee, K. Livescu, and S. Watanabe, “On the landscape of spoken language models: A comprehensive survey,”
[10]

On The Landscape of Spoken Language Models: A Comprehensive Survey

[Online]. Available: https://arxiv.org/abs/2504.08528

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Robust Speech Recognition via Large-Scale Weak Supervision

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large- scale weak supervision,” 2022. [Online]. Available: https: //arxiv.org/abs/2212.04356

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Tasu: Text-only alignment for speech understanding,

J. Peng, Y . Yang, X. Li, Y . Xi, Q. Tang, Y . Fang, J. Li, and K. Yu, “Tasu: Text-only alignment for speech understanding,”
[13]

Available: https://arxiv.org/abs/2511.03310

[Online]. Available: https://arxiv.org/abs/2511.03310

work page arXiv
[14]

Legoslm: Connecting llm with speech encoder using ctc posteriors,

R. Ma, T. Chen, K. Audhkhasi, and B. Ramabhadran, “Legoslm: Connecting llm with speech encoder using ctc posteriors,”arXiv preprint arXiv:2505.11352, 2025

work page arXiv 2025
[15]

Alignformer: Modality matching can achieve better zero-shot instruction- following speech-llm,

R. Fan, B. Ren, Y . Hu, R. Zhao, S. Liu, and J. Li, “Alignformer: Modality matching can achieve better zero-shot instruction- following speech-llm,”IEEE Journal of Selected Topics in Signal Processing, pp. 1–10, 2025

2025
[16]

Improving automatic speech recogni- tion performance for low-resource languages with self-supervised models,

J. Zhao and W.-Q. Zhang, “Improving automatic speech recogni- tion performance for low-resource languages with self-supervised models,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1227–1241, 2022

2022
[17]

Updating only encoders prevents catastrophic forgetting of end-to-end asr models,

Y . Takashima, S. Horiguchi, S. Watanabe, P. Garc ´ıa, and Y . Kawaguchi, “Updating only encoders prevents catastrophic forgetting of end-to-end asr models,” 2022. [Online]. Available: https://arxiv.org/abs/2207.00216

work page arXiv 2022
[18]

Text-only adaptation in llm- based asr through text denoising,

S. Burdisso, E. Villatoro-Tello, A. Carofilis, S. Kumar, K. Hacioglu, S. Madikeri, P. Rangappa, M. K. E, P. Motlicek, S. Venkatesan, and A. Stolcke, “Text-only adaptation in llm- based asr through text denoising,” 2026. [Online]. Available: https://arxiv.org/abs/2601.20900

work page arXiv 2026
[19]

Zero-shot domain-sensitive speech recognition with prompt- conditioning fine-tuning,

F.-T. Liao, Y .-C. Chan, Y .-C. Chen, C.-J. Hsu, and D.-s. Shiu, “Zero-shot domain-sensitive speech recognition with prompt- conditioning fine-tuning,” in2023 IEEE Automatic Speech Recog- nition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8

2023
[20]

Low-resource domain adaptation for speech llms via text-only fine-tuning,

Y . Fang, J. Peng, X. Li, Y . Xi, C. Zhang, G. Zhong, and K. Yu, “Low-resource domain adaptation for speech llms via text-only fine-tuning,”arXiv preprint arXiv:2506.05671, 2025

work page arXiv 2025
[21]

Asr data augmentation in low-resource settings using cross-lingual multi-speaker tts and cross-lingual voice conversion,

E. Casanova, C. Shulby, A. Korolev, A. C. Junior, A. da Silva Soares, S. Alu ´ısio, and M. A. Ponti, “Asr data augmentation in low-resource settings using cross-lingual multi-speaker tts and cross-lingual voice conversion,” 2023. [Online]. Available: https://arxiv.org/abs/2204.00618

work page arXiv 2023
[22]

Blank collapse: Compressing ctc emission for the faster decoding,

M. Jung, O. Kwon, S. Seo, and S. Seo, “Blank collapse: Compressing ctc emission for the faster decoding,” 2023. [Online]. Available: https://arxiv.org/abs/2210.17017

work page arXiv 2023
[23]

Improving hybrid ctc/attention end-to-end speech recognition with pretrained acoustic and language model,

K. Deng, S. Cao, Y . Zhang, and L. Ma, “Improving hybrid ctc/attention end-to-end speech recognition with pretrained acoustic and language model,” 2021. [Online]. Available: https://arxiv.org/abs/2112.07254

work page arXiv 2021
[24]

Phone synchronous decod- ing with ctc lattice

Z. Chen, W. Deng, T. Xu, and K. Yu, “Phone synchronous decod- ing with ctc lattice.” inInterspeech, 2016, pp. 1923–1927

2016
[25]

Label-synchronous neural transducer for adaptable online e2e speech recognition,

K. Deng and P. C. Woodland, “Label-synchronous neural transducer for adaptable online e2e speech recognition,” 2023. [Online]. Available: https://arxiv.org/abs/2311.11353

work page arXiv 2023
[26]

FunAudioLLM: V oice understanding and generation foun- dation models for natural interaction between humans and LLMs,

K. An, Q. Chen, C. Deng, Z. Du, C. Gao, Z. Gao, Y . Gu, T. He, H. Hu, K. Hu, S. Ji, Y . Li, Z. Li, H. Lu, H. Luo, X. Lv, B. Ma, Z. Ma, C. Ni, C. Song, J. Shi, X. Shi, H. Wang, W. Wang, Y . Wang, Z. Xiao, Z. Yan, Y . Yang, B. Zhang, Q. Zhang, S. Zhang, N. Zhao, and S. Zheng, “Funaudiollm: V oice understanding and generation foundation models for natural in...

work page arXiv 2024
[27]

Qwen2.5 Technical Report

Qwen, :, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Tang, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Lib- rispeech: An asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An asr corpus based on public domain audio books,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210

2015
[29]

Medical speech, transcription, and intent,

Figure Eight Inc., “Medical speech, transcription, and intent,” Kaggle Dataset, 2019. [Online]. Avail- able: https://www.kaggle.com/datasets/paultimothymooney/ medical-speech-transcription-and-intent

2019
[30]

Hernandez, V

F. Hernandez, V . Nguyen, S. Ghannay, N. Tomashenko, and Y . Est`eve,TED-LIUM 3: Twice as Much Data and Corpus Repartition for Experiments on Speaker Adaptation. Springer International Publishing, 2018, p. 198–208. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-99579-3 21

work page doi:10.1007/978-3-319-99579-3 2018
[31]

Slidespeech: A large-scale slide-enriched audio-visual corpus,

H. Wang, F. Yu, X. Shi, Y . Wang, S. Zhang, and M. Li, “Slidespeech: A large-scale slide-enriched audio-visual corpus,”
[32]

Available: https://arxiv.org/abs/2309.05396

[Online]. Available: https://arxiv.org/abs/2309.05396

work page arXiv
[33]

CoVoST2 and massively multilingual speech-to- text translation,

C. Wanget al., “CoVoST2 and massively multilingual speech-to- text translation,” inProc. Interspeech, 2021, pp. 2247–2251

2021