pith. sign in

arxiv: 2509.03526 · v2 · pith:74H4KQXSnew · submitted 2025-08-25 · 💻 cs.CL · eess.AS

Enhancing Speech Large Language Models through Reinforced Behavior Alignment

Pith reviewed 2026-05-21 22:19 UTC · model grok-4.3

classification 💻 cs.CL eess.AS
keywords Speech Large Language ModelsReinforced Behavior AlignmentSelf-SynthesisReinforcement LearningInstruction FollowingSpoken Question AnsweringSpeech-to-Text Translation
0
0 comments X

The pith

Reinforced Behavior Alignment improves SpeechLMs' instruction following by aligning them to a teacher model using self-generated data and reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Speech large language models lag behind text-based ones in following instructions because of differences between speech and text inputs. The paper introduces Reinforced Behavior Alignment to close that gap without human annotations. It has a teacher LLM generate its own high-quality training examples from speech, then uses reinforcement learning to make the SpeechLM match the teacher's responses. If this works, SpeechLMs become more reliable at handling varied speech requests and the same process transfers to spoken question answering and speech-to-text translation.

Core claim

This paper claims that Reinforced Behavior Alignment (RBA) bolsters the language generation proficiency of SpeechLMs. Instead of supervised fine-tuning from human annotations, RBA employs a self-synthesis methodology to generate extensive, high-fidelity alignment data by a powerful teacher LLM. Then the SpeechLM is aligned to the teacher's behavior using a reinforcement learning-based approach. Experimental results show this enhances instruction-following capabilities beyond conventional distillation baselines. The method extends to spoken question answering and speech-to-text translation, attaining state-of-the-art performance on open benchmarks with only self-generated data.

What carries the argument

Reinforced Behavior Alignment (RBA), a two-step process that first generates alignment data through self-synthesis by prompting a teacher LLM on speech inputs and then optimizes the SpeechLM via reinforcement learning to match the teacher's output behavior.

If this is right

  • SpeechLMs exhibit stronger instruction-following after applying RBA.
  • The approach outperforms conventional distillation baselines on relevant tasks.
  • RBA transfers directly to spoken question answering without additional human data.
  • Speech-to-text translation reaches state-of-the-art results on open benchmarks using only self-generated data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This self-synthesis plus reinforcement pattern could scale alignment for other speech or multimodal models by cutting annotation costs.
  • The results suggest teacher LLMs can bootstrap improvements across dynamic input modalities beyond text.
  • Similar techniques might reduce the need for human verification in reinforcement learning setups for audio-language systems.

Load-bearing premise

The self-synthesis methodology generates extensive, high-fidelity alignment data by a powerful teacher LLM that is suitable for reinforcement learning alignment without human annotations or verification.

What would settle it

An experiment showing that SpeechLMs trained with RBA perform no better than or worse than models trained via standard supervised fine-tuning on human-annotated speech data on instruction-following benchmarks would disprove the central claim.

Figures

Figures reproduced from arXiv: 2509.03526 by Jiateng Li, Yansong Liu, Yuan Liu.

Figure 1
Figure 1. Figure 1: For an aligned LLM, the input sequence can [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 1
Figure 1. Figure 1: Frameworks of RBA. Step 1: generate text user instruction by modifying pre-defined query template, followed by [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

The recent advancements of Large Language Models (LLMs) have spurred considerable research interest in extending their linguistic capabilities beyond text to other modalities, which leads to emergence of speech-based LLMs (SpeechLMs) with capability of processing user request in either speech or textual formats. However, owing to inter-modal discrepancies, these SpeechLMs still exhibit a significant performance gap compared to their text-based LLM counterparts in instruction-following, particularly when confronted with the dynamic and variable nature of user speech. To address this challenge, this paper introduces a framework termed Reinforced Behavior Alignment (RBA), designed to bolster the language generation proficiency of SpeechLMs. Instead of relying on supervised fine-tuning from human annotations, RBA employs a self-synthesis methodology to generate extensive, high-fidelity alignment data by a powerful teacher LLM. Then SpeechLMs is aligned its behavior with that of a teacher using a reinforcement learning-based approach. Experimental results demonstrate that this method effectively enhances the instruction-following capabilities of SpeechLMs that outperform conventional distillation baselines. Crucially, we demonstrate that RBA can be seamlessly extended to tasks such including spoken question answering and speech-to-text translation, attaining state-of-the-art performance on open benchmarks with only self-generated data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This paper presents Reinforced Behavior Alignment (RBA), a framework for improving Speech Large Language Models (SpeechLMs) by generating alignment data through self-synthesis using a powerful teacher LLM and then applying reinforcement learning to align the model's behavior. The authors claim that RBA enhances instruction-following capabilities beyond conventional distillation methods and can be extended to spoken question answering and speech-to-text translation, achieving state-of-the-art results on benchmarks using only self-generated data without human annotations.

Significance. Should the reported experimental outcomes prove robust, this work offers a promising direction for aligning speech-based LLMs with text-based counterparts using synthetic data and RL techniques. This could lower barriers to developing high-performing multimodal models by minimizing dependence on human-annotated datasets, with potential applications in various speech processing tasks.

major comments (2)
  1. Abstract: The abstract states that experimental results demonstrate outperformance and SOTA performance but provides no quantitative metrics, baselines, error bars, dataset details, or ablation studies to support these claims.
  2. Experiments section: The description of results for spoken question answering and speech-to-text translation does not include comparisons to human-annotated data or analysis of how self-generated targets handle acoustic variability, which is central to validating the no-human-annotation premise.
minor comments (2)
  1. Abstract: Grammatical issue: 'SpeechLMs is aligned its behavior with that of a teacher' is awkward and should be revised for clarity.
  2. Abstract: Typo or phrasing: 'tasks such including spoken question answering' should read 'tasks including spoken question answering'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and describe the revisions we will make.

read point-by-point responses
  1. Referee: Abstract: The abstract states that experimental results demonstrate outperformance and SOTA performance but provides no quantitative metrics, baselines, error bars, dataset details, or ablation studies to support these claims.

    Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised version, we will add specific metrics such as accuracy improvements on spoken QA benchmarks and BLEU scores for speech-to-text translation, along with references to the main baselines and datasets. Detailed error bars, full ablations, and dataset statistics will continue to appear in the experiments section, as abstract length limits preclude their inclusion there. revision: yes

  2. Referee: Experiments section: The description of results for spoken question answering and speech-to-text translation does not include comparisons to human-annotated data or analysis of how self-generated targets handle acoustic variability, which is central to validating the no-human-annotation premise.

    Authors: We acknowledge the value of direct comparisons to human-annotated data for contextualizing our results. Our current experiments focus on outperforming distillation baselines (which typically use human annotations) using only self-generated data, achieving SOTA on the reported benchmarks. We will add a discussion of available human-annotated equivalents where they exist for these tasks and include an analysis of acoustic variability, examining how the RL-based alignment mitigates performance drops under varied acoustic conditions. This will be incorporated as a new subsection or expanded paragraph in the experiments. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical method validated on external benchmarks

full rationale

The paper introduces the RBA framework via self-synthesis of alignment data from a text teacher LLM followed by RL-based behavior alignment, then reports experimental gains on instruction-following, spoken QA, and speech-to-text translation tasks. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. All performance claims reference open external benchmarks and conventional distillation baselines rather than reducing internally to the method's own inputs by construction. The work is therefore self-contained as an empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on the unverified assumption that a text teacher LLM can produce high-quality speech-aligned training data through self-synthesis alone.

axioms (1)
  • domain assumption Self-synthesis by a powerful teacher LLM produces extensive high-fidelity alignment data suitable for RL-based behavior alignment of SpeechLMs.
    Explicitly stated in the abstract as the data-generation step replacing human annotations.

pith-pipeline@v0.9.0 · 5746 in / 1235 out tokens · 37826 ms · 2026-05-21T22:19:09.748398+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 20 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    GPT-4 Technical Report

    Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

  4. [4]

    Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 23716--23736

  5. [5]

    Amini, A.; Vieira, T.; and Cotterell, R. 2024. Direct Preference Optimization with an Offset. arXiv preprint arXiv:2402.10571

  6. [6]

    Bai, Y.; Jones, A.; Ndousse, K.; Askell, A.; Chen, A.; DasSarma, N.; Drain, D.; Fort, S.; Ganguli, D.; Henighan, T.; et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862

  7. [7]

    C.; Dale, D.; Dong, N.; Duquenne, P.-A.; Elsahar, H.; Gong, H.; Heffernan, K.; Hoffman, J.; et al

    Barrault, L.; Chung, Y.-A.; Meglioli, M. C.; Dale, D.; Dong, N.; Duquenne, P.-A.; Elsahar, H.; Gong, H.; Heffernan, K.; Hoffman, J.; et al. 2023. SeamlessM4T: Massively Multilingual & Multimodal Machine Translation. arXiv preprint arXiv:2308.11596

  8. [8]

    Berant, J.; Chou, A.; Frostig, R.; and Liang, P. 2013. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing, 1533--1544

  9. [9]

    o r \'e nyi, B.; Weng, P.; Cheng, W.; and H \

    Busa-Fekete, R.; Sz \"o r \'e nyi, B.; Weng, P.; Cheng, W.; and H \"u llermeier, E. 2014. Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm. Machine learning, 97: 327--351

  10. [10]

    H.; Siniscalchi, S

    Chen, C.; Hu, Y.; Yang, C.-H. H.; Siniscalchi, S. M.; Chen, P.-Y.; and Chng, E.-S. 2023. Hyporadise: An open baseline for generative speech recognition with large language models. Advances in Neural Information Processing Systems, 36: 31665--31688

  11. [11]

    Chen, Z.; Deng, Y.; Yuan, H.; Ji, K.; and Gu, Q. 2024. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335

  12. [12]

    Chu, Y.; Xu, J.; Yang, Q.; Wei, H.; Wei, X.; Guo, Z.; Leng, Y.; Lv, Y.; He, J.; Lin, J.; et al. 2024. Qwen2-audio technical report. arXiv preprint arXiv:2407.10759

  13. [13]

    Chu, Y.; Xu, J.; Zhou, X.; Yang, Q.; Zhang, S.; Yan, Z.; Zhou, C.; and Zhou, J. 2023. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919

  14. [14]

    Conneau, A.; Ma, M.; Khanuja, S.; Zhang, Y.; Axelrod, V.; Dalmia, S.; Riesa, J.; Rivera, C.; and Bapna, A. 2023. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), 798--805. IEEE

  15. [15]

    Dai, J.; Pan, X.; Sun, R.; Ji, J.; Xu, X.; Liu, M.; Wang, Y.; and Yang, Y. 2023. Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773

  16. [16]

    Han, and Katrin Kirchhoff

    Das, N.; Dingliwal, S.; Ronanki, S.; Paturi, R.; Huang, Z.; Mathur, P.; Yuan, J.; Bekal, D.; Niu, X.; Jayanthi, S. M.; et al. 2024. Speechverse: A large-scale generalizable audio language model. arXiv preprint arXiv:2405.08295

  17. [17]

    D \'e fossez, A.; Mazar \'e , L.; Orsini, M.; Royer, A.; P \'e rez, P.; J \'e gou, H.; Grave, E.; and Zeghidour, N. 2024. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037

  18. [18]

    A.; Cattoni, R.; Bentivogli, L.; Negri, M.; and Turchi, M

    Di Gangi, M. A.; Cattoni, R.; Bentivogli, L.; Negri, M.; and Turchi, M. 2019. Must-c: a multilingual speech translation corpus. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2012--2017. Association for Computational Linguistics

  19. [19]

    Dong, H.; Xiong, W.; Pang, B.; Wang, H.; Zhao, H.; Zhou, Y.; Jiang, N.; Sahoo, D.; Xiong, C.; and Zhang, T. 2024. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863

  20. [20]

    Du, Z.; Chen, Q.; Zhang, S.; Hu, K.; Lu, H.; Yang, Y.; Hu, H.; Zheng, S.; Gu, Y.; Ma, Z.; et al. 2024. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407

  21. [21]

    Dubois, Y.; Galambosi, B.; Liang, P.; and Hashimoto, T. B. ???? Length-controlled alpacaeval: A simple way to debias automatic evaluators, 2024. URL https://arxiv. org/abs/2404.04475

  22. [22]

    Fang, Q.; Guo, S.; Zhou, Y.; Ma, Z.; Zhang, S.; and Feng, Y. 2024. Llama-omni: Seamless speech interaction with large language models. arXiv preprint arXiv:2409.06666

  23. [23]

    Fathullah, Y.; Wu, C.; Lakomkin, E.; Li, K.; Jia, J.; Shangguan, Y.; Mahadeokar, J.; Kalinli, O.; Fuegen, C.; and Seltzer, M. 2023. Audiochatllama: Towards general-purpose speech abilities for llms. arXiv preprint arXiv:2311.06753

  24. [24]

    Feng, X.; Jiang, Z.; Kaufmann, T.; Xu, P.; H \"u llermeier, E.; Weng, P.; and Zhu, Y. 2025. DUO: Diverse, Uncertain, On-Policy Query Generation and Selection for Reinforcement Learning from Human Feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 16604--16612

  25. [25]

    A.; Gat, I.; Conneau, A.; Kreuk, F.; Copet, J.; Defossez, A.; Synnaeve, G.; Dupoux, E.; et al

    Hassid, M.; Remez, T.; Nguyen, T. A.; Gat, I.; Conneau, A.; Kreuk, F.; Copet, J.; Defossez, A.; Synnaeve, G.; Dupoux, E.; et al. 2023. Textually pretrained speech language models. Advances in Neural Information Processing Systems, 36: 63483--63501

  26. [26]

    H.; \.Z elasko, P.; Hrinchuk, O.; Lavrukhin, V.; Balam, J.; and Ginsburg, B

    Hu, K.; Chen, Z.; Yang, C.-H. H.; \.Z elasko, P.; Hrinchuk, O.; Lavrukhin, V.; Balam, J.; and Ginsburg, B. 2025. Chain-of-thought prompting for speech translation. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1--5. IEEE

  27. [27]

    Hu, S.; Zhou, L.; Liu, S.; Chen, S.; Meng, L.; Hao, H.; Pan, J.; Liu, X.; Li, J.; Sivasankaran, S.; et al. 2024. Wavllm: Towards robust and adaptive speech large language model. arXiv preprint arXiv:2404.00656

  28. [28]

    Jain, A.; Wojcik, B.; Joachims, T.; and Saxena, A. 2013. Learning trajectory preferences for manipulators via iterative improvement. Advances in neural information processing systems, 26

  29. [29]

    Ji, S.; Jiang, Z.; Wang, W.; Chen, Y.; Fang, M.; Zuo, J.; Yang, Q.; Cheng, X.; Wang, Z.; Li, R.; et al. 2024. Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling. arXiv preprint arXiv:2408.16532

  30. [30]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Joshi, M.; Choi, E.; Weld, D. S.; and Zettlemoyer, L. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551

  31. [31]

    Kim, H.; Seo, S.; Jeong, K.; Kwon, O.; Kim, S.; Kim, J.; Lee, J.; Song, E.; Oh, M.; Ha, J.-W.; et al. 2024. Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation. arXiv preprint arXiv:2402.05706

  32. [32]

    Lakhotia, K.; Kharitonov, E.; Hsu, W.-N.; Adi, Y.; Polyak, A.; Bolte, B.; Nguyen, T.-A.; Copet, J.; Baevski, A.; Mohamed, A.; et al. 2021. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9: 1336--1354

  33. [33]

    Li, X.; Zhang, T.; Dubois, Y.; Taori, R.; Gulrajani, I.; Guestrin, C.; Liang, P.; and Hashimoto, T. B. 2023. Alpacaeval: An automatic evaluator of instruction-following models

  34. [34]

    Lin, B.; Ye, Y.; Zhu, B.; Cui, J.; Ning, M.; Jin, P.; and Yuan, L. 2023. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122

  35. [35]

    G.; Gandhe, A.; Yang, C.-H

    Lin, G.-T.; Shivakumar, P. G.; Gandhe, A.; Yang, C.-H. H.; Gu, Y.; Ghosh, S.; Stolcke, A.; Lee, H.-y.; and Bulyko, I. 2024. Paralinguistics-enhanced large language modeling of spoken dialogue. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 10316--10320. IEEE

  36. [36]

    Liu, S.; Fang, W.; Hu, Z.; Zhang, J.; Zhou, Y.; Zhang, K.; Tu, R.; Lin, T.-E.; Huang, F.; Song, M.; et al. 2025. A survey of direct preference optimization. arXiv preprint arXiv:2503.11701

  37. [37]

    Liu, Z.; Sun, X.; and Zheng, Z. 2024. Enhancing LLM Safety via Constrained Direct Preference Optimization. arXiv preprint arXiv:2403.02475

  38. [38]

    Nachmani, E.; Levkovitch, A.; Hirsch, R.; Salazar, J.; Asawaroengchai, C.; Mariooryad, S.; Rivlin, E.; Skerry-Ryan, R.; and Ramanovich, M. T. 2023. Spoken question answering and speech continuation using spectrogram-powered llm. arXiv preprint arXiv:2305.15255

  39. [39]

    N.; Wu, Y.; Nguyen, P.; Chen, Z.; Chiu, C.-C.; and Kannan, A

    Prabhavalkar, R.; Sainath, T. N.; Wu, Y.; Nguyen, P.; Chen, Z.; Chiu, C.-C.; and Kannan, A. 2018. Minimum word error rate training for attention-based sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4839--4843. IEEE

  40. [40]

    W.; Xu, T.; Brockman, G.; McLeavey, C.; and Sutskever, I

    Radford, A.; Kim, J. W.; Xu, T.; Brockman, G.; McLeavey, C.; and Sutskever, I. 2023. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, 28492--28518. PMLR

  41. [41]

    D.; Ermon, S.; and Finn, C

    Rafailov, R.; Sharma, A.; Mitchell, E.; Manning, C. D.; Ermon, S.; and Finn, C. 2024. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36

  42. [42]

    J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V

    Rennie, S. J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7008--7024

  43. [43]

    AudioPaLM: A Large Language Model That Can Speak and Listen

    Rubenstein, P. K.; Asawaroengchai, C.; Nguyen, D. D.; Bapna, A.; Borsos, Z.; Quitry, F. d. C.; Chen, P.; Badawy, D. E.; Han, W.; Kharitonov, E.; et al. 2023. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925

  44. [44]

    Shu, Y.; Dong, S.; Chen, G.; Huang, W.; Zhang, R.; Shi, D.; Xiang, Q.; and Shi, Y. 2023. Llasm: Large language and speech model. arXiv preprint arXiv:2308.15930

  45. [45]

    Tang, C.; Yu, W.; Sun, G.; Chen, X.; Tan, T.; Li, W.; Lu, L.; Ma, Z.; and Zhang, C. 2023. Salmonn: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289

  46. [46]

    Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozi \`e re, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

  47. [47]

    Wang, C.; Chen, S.; Wu, Y.; Zhang, Z.; Zhou, L.; Liu, S.; Chen, Z.; Liu, Y.; Wang, H.; Li, J.; et al. 2023 a . Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111

  48. [48]

    Wang, C.; Liao, M.; Huang, Z.; Lu, J.; Wu, J.; Liu, Y.; Zong, C.; and Zhang, J. 2023 b . Blsp: Bootstrapping language-speech pre-training via behavior alignment of continuation writing. arXiv preprint arXiv:2309.00916

  49. [49]

    Wang, C.; Wu, A.; Gu, J.; and Pino, J. 2021. CoVoST 2 and massively multilingual speech translation. In Interspeech, volume 2021, 2247--2251

  50. [50]

    Wu, J.; Gaur, Y.; Chen, Z.; Zhou, L.; Zhu, Y.; Wang, T.; Li, J.; Liu, S.; Ren, B.; Liu, L.; et al. 2023. On decoder-only architecture for speech-to-text and large language model integration. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 1--8. IEEE

  51. [51]

    Wu, L.; Tian, F.; Qin, T.; Lai, J.; and Liu, T.-Y. 2018. A study of reinforcement learning for neural machine translation. arXiv preprint arXiv:1808.08866

  52. [52]

    Xu, Z.; Jiang, F.; Niu, L.; Deng, Y.; Poovendran, R.; Choi, Y.; and Lin, B. Y. 2024. Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing. arXiv preprint arXiv:2406.08464

  53. [53]

    Yang, D.; Tian, J.; Tan, X.; Huang, R.; Liu, S.; Chang, X.; Shi, J.; Zhao, S.; Bian, J.; Wu, X.; et al. 2023. Uniaudio: An audio foundation model toward universal audio generation. arXiv preprint arXiv:2310.00704

  54. [54]

    Ye, Z.; Zhu, X.; Chan, C.-M.; Wang, X.; Tan, X.; Lei, J.; Peng, Y.; Liu, H.; Jin, Y.; DAI, Z.; et al. 2025. Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis. arXiv preprint arXiv:2502.04128

  55. [55]

    Self-Rewarding Language Models

    Yuan, W.; Pang, R. Y.; Cho, K.; Sukhbaatar, S.; Xu, J.; and Weston, J. 2024. Self-rewarding language models. arXiv preprint arXiv:2401.10020

  56. [56]

    LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

    Zen, H.; Dang, V.; Clark, R.; Zhang, Y.; Weiss, R. J.; Jia, Y.; Chen, Z.; and Wu, Y. 2019. Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882

  57. [57]

    Zeng, A.; Du, Z.; Liu, M.; Zhang, L.; Jiang, S.; Dong, Y.; and Tang, J. 2024 a . Scaling speech-text pre-training with synthetic interleaved data. arXiv preprint arXiv:2411.17607

  58. [58]

    Zeng, Y.; Liu, G.; Ma, W.; Yang, N.; Zhang, H.; and Wang, J. 2024 b . Token-level Direct Preference Optimization. arXiv preprint arXiv:2404.11999

  59. [59]

    Zhang, D.; Li, S.; Zhang, X.; Zhan, J.; Wang, P.; Zhou, Y.; and Qiu, X. 2023. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000

  60. [60]

    Zhang, J.; Huang, J.; Jin, S.; and Lu, S. 2024. Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence

  61. [61]

    Zhang, S.; Liu, X.; Zhang, X.; Liu, J.; Luo, Z.; Huang, S.; and Gong, Y. 2025. Process-based self-rewarding language models. arXiv preprint arXiv:2503.03746

  62. [62]

    Zhou, Z.; Liu, J.; Yang, C.; Shao, J.; Liu, Y.; Yue, X.; Ouyang, W.; and Qiao, Y. 2023. Beyond one-preference-for-all: Multi-objective direct preference optimization. arXiv preprint arXiv:2310.03708