Enhancing Speech Large Language Models through Reinforced Behavior Alignment

Jiateng Li; Yansong Liu; Yuan Liu

arxiv: 2509.03526 · v2 · pith:74H4KQXSnew · submitted 2025-08-25 · 💻 cs.CL · eess.AS

Enhancing Speech Large Language Models through Reinforced Behavior Alignment

Yansong Liu , Jiateng Li , Yuan Liu This is my paper

Pith reviewed 2026-05-21 22:19 UTC · model grok-4.3

classification 💻 cs.CL eess.AS

keywords Speech Large Language ModelsReinforced Behavior AlignmentSelf-SynthesisReinforcement LearningInstruction FollowingSpoken Question AnsweringSpeech-to-Text Translation

0 comments

The pith

Reinforced Behavior Alignment improves SpeechLMs' instruction following by aligning them to a teacher model using self-generated data and reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Speech large language models lag behind text-based ones in following instructions because of differences between speech and text inputs. The paper introduces Reinforced Behavior Alignment to close that gap without human annotations. It has a teacher LLM generate its own high-quality training examples from speech, then uses reinforcement learning to make the SpeechLM match the teacher's responses. If this works, SpeechLMs become more reliable at handling varied speech requests and the same process transfers to spoken question answering and speech-to-text translation.

Core claim

This paper claims that Reinforced Behavior Alignment (RBA) bolsters the language generation proficiency of SpeechLMs. Instead of supervised fine-tuning from human annotations, RBA employs a self-synthesis methodology to generate extensive, high-fidelity alignment data by a powerful teacher LLM. Then the SpeechLM is aligned to the teacher's behavior using a reinforcement learning-based approach. Experimental results show this enhances instruction-following capabilities beyond conventional distillation baselines. The method extends to spoken question answering and speech-to-text translation, attaining state-of-the-art performance on open benchmarks with only self-generated data.

What carries the argument

Reinforced Behavior Alignment (RBA), a two-step process that first generates alignment data through self-synthesis by prompting a teacher LLM on speech inputs and then optimizes the SpeechLM via reinforcement learning to match the teacher's output behavior.

If this is right

SpeechLMs exhibit stronger instruction-following after applying RBA.
The approach outperforms conventional distillation baselines on relevant tasks.
RBA transfers directly to spoken question answering without additional human data.
Speech-to-text translation reaches state-of-the-art results on open benchmarks using only self-generated data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This self-synthesis plus reinforcement pattern could scale alignment for other speech or multimodal models by cutting annotation costs.
The results suggest teacher LLMs can bootstrap improvements across dynamic input modalities beyond text.
Similar techniques might reduce the need for human verification in reinforcement learning setups for audio-language systems.

Load-bearing premise

The self-synthesis methodology generates extensive, high-fidelity alignment data by a powerful teacher LLM that is suitable for reinforcement learning alignment without human annotations or verification.

What would settle it

An experiment showing that SpeechLMs trained with RBA perform no better than or worse than models trained via standard supervised fine-tuning on human-annotated speech data on instruction-following benchmarks would disprove the central claim.

Figures

Figures reproduced from arXiv: 2509.03526 by Jiateng Li, Yansong Liu, Yuan Liu.

**Figure 1.** Figure 1: Frameworks of RBA. Step 1: generate text user instruction by modifying pre-defined query template, followed by [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

read the original abstract

The recent advancements of Large Language Models (LLMs) have spurred considerable research interest in extending their linguistic capabilities beyond text to other modalities, which leads to emergence of speech-based LLMs (SpeechLMs) with capability of processing user request in either speech or textual formats. However, owing to inter-modal discrepancies, these SpeechLMs still exhibit a significant performance gap compared to their text-based LLM counterparts in instruction-following, particularly when confronted with the dynamic and variable nature of user speech. To address this challenge, this paper introduces a framework termed Reinforced Behavior Alignment (RBA), designed to bolster the language generation proficiency of SpeechLMs. Instead of relying on supervised fine-tuning from human annotations, RBA employs a self-synthesis methodology to generate extensive, high-fidelity alignment data by a powerful teacher LLM. Then SpeechLMs is aligned its behavior with that of a teacher using a reinforcement learning-based approach. Experimental results demonstrate that this method effectively enhances the instruction-following capabilities of SpeechLMs that outperform conventional distillation baselines. Crucially, we demonstrate that RBA can be seamlessly extended to tasks such including spoken question answering and speech-to-text translation, attaining state-of-the-art performance on open benchmarks with only self-generated data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies RL alignment to SpeechLMs with self-synthesized data from a text teacher LLM, but the abstract supplies no metrics or validation details to support the SOTA claims.

read the letter

The main takeaway is that this work takes self-synthesis and reinforcement learning alignment, already used in text LLMs, and ports them to SpeechLMs to close the instruction-following gap caused by speech variability. They generate alignment targets from a strong text teacher, then use RL to match the speech model's outputs to those targets, and claim this beats standard distillation while hitting SOTA on spoken QA and speech-to-text translation with no human data at all.

Referee Report

2 major / 2 minor

Summary. This paper presents Reinforced Behavior Alignment (RBA), a framework for improving Speech Large Language Models (SpeechLMs) by generating alignment data through self-synthesis using a powerful teacher LLM and then applying reinforcement learning to align the model's behavior. The authors claim that RBA enhances instruction-following capabilities beyond conventional distillation methods and can be extended to spoken question answering and speech-to-text translation, achieving state-of-the-art results on benchmarks using only self-generated data without human annotations.

Significance. Should the reported experimental outcomes prove robust, this work offers a promising direction for aligning speech-based LLMs with text-based counterparts using synthetic data and RL techniques. This could lower barriers to developing high-performing multimodal models by minimizing dependence on human-annotated datasets, with potential applications in various speech processing tasks.

major comments (2)

Abstract: The abstract states that experimental results demonstrate outperformance and SOTA performance but provides no quantitative metrics, baselines, error bars, dataset details, or ablation studies to support these claims.
Experiments section: The description of results for spoken question answering and speech-to-text translation does not include comparisons to human-annotated data or analysis of how self-generated targets handle acoustic variability, which is central to validating the no-human-annotation premise.

minor comments (2)

Abstract: Grammatical issue: 'SpeechLMs is aligned its behavior with that of a teacher' is awkward and should be revised for clarity.
Abstract: Typo or phrasing: 'tasks such including spoken question answering' should read 'tasks including spoken question answering'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and describe the revisions we will make.

read point-by-point responses

Referee: Abstract: The abstract states that experimental results demonstrate outperformance and SOTA performance but provides no quantitative metrics, baselines, error bars, dataset details, or ablation studies to support these claims.

Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised version, we will add specific metrics such as accuracy improvements on spoken QA benchmarks and BLEU scores for speech-to-text translation, along with references to the main baselines and datasets. Detailed error bars, full ablations, and dataset statistics will continue to appear in the experiments section, as abstract length limits preclude their inclusion there. revision: yes
Referee: Experiments section: The description of results for spoken question answering and speech-to-text translation does not include comparisons to human-annotated data or analysis of how self-generated targets handle acoustic variability, which is central to validating the no-human-annotation premise.

Authors: We acknowledge the value of direct comparisons to human-annotated data for contextualizing our results. Our current experiments focus on outperforming distillation baselines (which typically use human annotations) using only self-generated data, achieving SOTA on the reported benchmarks. We will add a discussion of available human-annotated equivalents where they exist for these tasks and include an analysis of acoustic variability, examining how the RL-based alignment mitigates performance drops under varied acoustic conditions. This will be incorporated as a new subsection or expanded paragraph in the experiments. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical method validated on external benchmarks

full rationale

The paper introduces the RBA framework via self-synthesis of alignment data from a text teacher LLM followed by RL-based behavior alignment, then reports experimental gains on instruction-following, spoken QA, and speech-to-text translation tasks. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. All performance claims reference open external benchmarks and conventional distillation baselines rather than reducing internally to the method's own inputs by construction. The work is therefore self-contained as an empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on the unverified assumption that a text teacher LLM can produce high-quality speech-aligned training data through self-synthesis alone.

axioms (1)

domain assumption Self-synthesis by a powerful teacher LLM produces extensive high-fidelity alignment data suitable for RL-based behavior alignment of SpeechLMs.
Explicitly stated in the abstract as the data-generation step replacing human annotations.

pith-pipeline@v0.9.0 · 5746 in / 1235 out tokens · 37826 ms · 2026-05-21T22:19:09.748398+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RBA employs a self-synthesis methodology to generate extensive, high-fidelity alignment data by a powerful teacher LLM. Then SpeechLMs is aligned its behavior with that of a teacher using a reinforcement learning-based approach.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We demonstrate that RBA can be seamlessly extended to tasks such including spoken question answering and speech-to-text translation, attaining state-of-the-art performance on open benchmarks with only self-generated data.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 20 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

GPT-4 Technical Report

Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 23716--23736

work page 2022
[5]

Amini, A.; Vieira, T.; and Cotterell, R. 2024. Direct Preference Optimization with an Offset. arXiv preprint arXiv:2402.10571

work page arXiv 2024
[6]

Bai, Y.; Jones, A.; Ndousse, K.; Askell, A.; Chen, A.; DasSarma, N.; Drain, D.; Fort, S.; Ganguli, D.; Henighan, T.; et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

C.; Dale, D.; Dong, N.; Duquenne, P.-A.; Elsahar, H.; Gong, H.; Heffernan, K.; Hoffman, J.; et al

Barrault, L.; Chung, Y.-A.; Meglioli, M. C.; Dale, D.; Dong, N.; Duquenne, P.-A.; Elsahar, H.; Gong, H.; Heffernan, K.; Hoffman, J.; et al. 2023. SeamlessM4T: Massively Multilingual & Multimodal Machine Translation. arXiv preprint arXiv:2308.11596

work page arXiv 2023
[8]

Berant, J.; Chou, A.; Frostig, R.; and Liang, P. 2013. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing, 1533--1544

work page 2013
[9]

o r \'e nyi, B.; Weng, P.; Cheng, W.; and H \

Busa-Fekete, R.; Sz \"o r \'e nyi, B.; Weng, P.; Cheng, W.; and H \"u llermeier, E. 2014. Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm. Machine learning, 97: 327--351

work page 2014
[10]

H.; Siniscalchi, S

Chen, C.; Hu, Y.; Yang, C.-H. H.; Siniscalchi, S. M.; Chen, P.-Y.; and Chng, E.-S. 2023. Hyporadise: An open baseline for generative speech recognition with large language models. Advances in Neural Information Processing Systems, 36: 31665--31688

work page 2023
[11]

Chen, Z.; Deng, Y.; Yuan, H.; Ji, K.; and Gu, Q. 2024. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Chu, Y.; Xu, J.; Yang, Q.; Wei, H.; Wei, X.; Guo, Z.; Leng, Y.; Lv, Y.; He, J.; Lin, J.; et al. 2024. Qwen2-audio technical report. arXiv preprint arXiv:2407.10759

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Chu, Y.; Xu, J.; Zhou, X.; Yang, Q.; Zhang, S.; Yan, Z.; Zhou, C.; and Zhou, J. 2023. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Conneau, A.; Ma, M.; Khanuja, S.; Zhang, Y.; Axelrod, V.; Dalmia, S.; Riesa, J.; Rivera, C.; and Bapna, A. 2023. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), 798--805. IEEE

work page 2023
[15]

Dai, J.; Pan, X.; Sun, R.; Ji, J.; Xu, X.; Liu, M.; Wang, Y.; and Yang, Y. 2023. Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Han, and Katrin Kirchhoff

Das, N.; Dingliwal, S.; Ronanki, S.; Paturi, R.; Huang, Z.; Mathur, P.; Yuan, J.; Bekal, D.; Niu, X.; Jayanthi, S. M.; et al. 2024. Speechverse: A large-scale generalizable audio language model. arXiv preprint arXiv:2405.08295

work page arXiv 2024
[17]

D \'e fossez, A.; Mazar \'e , L.; Orsini, M.; Royer, A.; P \'e rez, P.; J \'e gou, H.; Grave, E.; and Zeghidour, N. 2024. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

A.; Cattoni, R.; Bentivogli, L.; Negri, M.; and Turchi, M

Di Gangi, M. A.; Cattoni, R.; Bentivogli, L.; Negri, M.; and Turchi, M. 2019. Must-c: a multilingual speech translation corpus. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2012--2017. Association for Computational Linguistics

work page 2019
[19]

Dong, H.; Xiong, W.; Pang, B.; Wang, H.; Zhao, H.; Zhou, Y.; Jiang, N.; Sahoo, D.; Xiong, C.; and Zhang, T. 2024. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Du, Z.; Chen, Q.; Zhang, S.; Hu, K.; Lu, H.; Yang, Y.; Hu, H.; Zheng, S.; Gu, Y.; Ma, Z.; et al. 2024. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Dubois, Y.; Galambosi, B.; Liang, P.; and Hashimoto, T. B. ???? Length-controlled alpacaeval: A simple way to debias automatic evaluators, 2024. URL https://arxiv. org/abs/2404.04475

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Fang, Q.; Guo, S.; Zhou, Y.; Ma, Z.; Zhang, S.; and Feng, Y. 2024. Llama-omni: Seamless speech interaction with large language models. arXiv preprint arXiv:2409.06666

work page arXiv 2024
[23]

Fathullah, Y.; Wu, C.; Lakomkin, E.; Li, K.; Jia, J.; Shangguan, Y.; Mahadeokar, J.; Kalinli, O.; Fuegen, C.; and Seltzer, M. 2023. Audiochatllama: Towards general-purpose speech abilities for llms. arXiv preprint arXiv:2311.06753

work page arXiv 2023
[24]

Feng, X.; Jiang, Z.; Kaufmann, T.; Xu, P.; H \"u llermeier, E.; Weng, P.; and Zhu, Y. 2025. DUO: Diverse, Uncertain, On-Policy Query Generation and Selection for Reinforcement Learning from Human Feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 16604--16612

work page 2025
[25]

A.; Gat, I.; Conneau, A.; Kreuk, F.; Copet, J.; Defossez, A.; Synnaeve, G.; Dupoux, E.; et al

Hassid, M.; Remez, T.; Nguyen, T. A.; Gat, I.; Conneau, A.; Kreuk, F.; Copet, J.; Defossez, A.; Synnaeve, G.; Dupoux, E.; et al. 2023. Textually pretrained speech language models. Advances in Neural Information Processing Systems, 36: 63483--63501

work page 2023
[26]

H.; \.Z elasko, P.; Hrinchuk, O.; Lavrukhin, V.; Balam, J.; and Ginsburg, B

Hu, K.; Chen, Z.; Yang, C.-H. H.; \.Z elasko, P.; Hrinchuk, O.; Lavrukhin, V.; Balam, J.; and Ginsburg, B. 2025. Chain-of-thought prompting for speech translation. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1--5. IEEE

work page 2025
[27]

Hu, S.; Zhou, L.; Liu, S.; Chen, S.; Meng, L.; Hao, H.; Pan, J.; Liu, X.; Li, J.; Sivasankaran, S.; et al. 2024. Wavllm: Towards robust and adaptive speech large language model. arXiv preprint arXiv:2404.00656

work page arXiv 2024
[28]

Jain, A.; Wojcik, B.; Joachims, T.; and Saxena, A. 2013. Learning trajectory preferences for manipulators via iterative improvement. Advances in neural information processing systems, 26

work page 2013
[29]

Ji, S.; Jiang, Z.; Wang, W.; Chen, Y.; Fang, M.; Zuo, J.; Yang, Q.; Cheng, X.; Wang, Z.; Li, R.; et al. 2024. Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling. arXiv preprint arXiv:2408.16532

work page arXiv 2024
[30]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, M.; Choi, E.; Weld, D. S.; and Zettlemoyer, L. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

Kim, H.; Seo, S.; Jeong, K.; Kwon, O.; Kim, S.; Kim, J.; Lee, J.; Song, E.; Oh, M.; Ha, J.-W.; et al. 2024. Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation. arXiv preprint arXiv:2402.05706

work page arXiv 2024
[32]

Lakhotia, K.; Kharitonov, E.; Hsu, W.-N.; Adi, Y.; Polyak, A.; Bolte, B.; Nguyen, T.-A.; Copet, J.; Baevski, A.; Mohamed, A.; et al. 2021. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9: 1336--1354

work page 2021
[33]

Li, X.; Zhang, T.; Dubois, Y.; Taori, R.; Gulrajani, I.; Guestrin, C.; Liang, P.; and Hashimoto, T. B. 2023. Alpacaeval: An automatic evaluator of instruction-following models

work page 2023
[34]

Lin, B.; Ye, Y.; Zhu, B.; Cui, J.; Ning, M.; Jin, P.; and Yuan, L. 2023. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

G.; Gandhe, A.; Yang, C.-H

Lin, G.-T.; Shivakumar, P. G.; Gandhe, A.; Yang, C.-H. H.; Gu, Y.; Ghosh, S.; Stolcke, A.; Lee, H.-y.; and Bulyko, I. 2024. Paralinguistics-enhanced large language modeling of spoken dialogue. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 10316--10320. IEEE

work page 2024
[36]

Liu, S.; Fang, W.; Hu, Z.; Zhang, J.; Zhou, Y.; Zhang, K.; Tu, R.; Lin, T.-E.; Huang, F.; Song, M.; et al. 2025. A survey of direct preference optimization. arXiv preprint arXiv:2503.11701

work page arXiv 2025
[37]

Liu, Z.; Sun, X.; and Zheng, Z. 2024. Enhancing LLM Safety via Constrained Direct Preference Optimization. arXiv preprint arXiv:2403.02475

work page arXiv 2024
[38]

Nachmani, E.; Levkovitch, A.; Hirsch, R.; Salazar, J.; Asawaroengchai, C.; Mariooryad, S.; Rivlin, E.; Skerry-Ryan, R.; and Ramanovich, M. T. 2023. Spoken question answering and speech continuation using spectrogram-powered llm. arXiv preprint arXiv:2305.15255

work page arXiv 2023
[39]

N.; Wu, Y.; Nguyen, P.; Chen, Z.; Chiu, C.-C.; and Kannan, A

Prabhavalkar, R.; Sainath, T. N.; Wu, Y.; Nguyen, P.; Chen, Z.; Chiu, C.-C.; and Kannan, A. 2018. Minimum word error rate training for attention-based sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4839--4843. IEEE

work page 2018
[40]

W.; Xu, T.; Brockman, G.; McLeavey, C.; and Sutskever, I

Radford, A.; Kim, J. W.; Xu, T.; Brockman, G.; McLeavey, C.; and Sutskever, I. 2023. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, 28492--28518. PMLR

work page 2023
[41]

D.; Ermon, S.; and Finn, C

Rafailov, R.; Sharma, A.; Mitchell, E.; Manning, C. D.; Ermon, S.; and Finn, C. 2024. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36

work page 2024
[42]

J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V

Rennie, S. J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7008--7024

work page 2017
[43]

AudioPaLM: A Large Language Model That Can Speak and Listen

Rubenstein, P. K.; Asawaroengchai, C.; Nguyen, D. D.; Bapna, A.; Borsos, Z.; Quitry, F. d. C.; Chen, P.; Badawy, D. E.; Han, W.; Kharitonov, E.; et al. 2023. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Shu, Y.; Dong, S.; Chen, G.; Huang, W.; Zhang, R.; Shi, D.; Xiang, Q.; and Shi, Y. 2023. Llasm: Large language and speech model. arXiv preprint arXiv:2308.15930

work page arXiv 2023
[45]

Tang, C.; Yu, W.; Sun, G.; Chen, X.; Tan, T.; Li, W.; Lu, L.; Ma, Z.; and Zhang, C. 2023. Salmonn: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozi \`e re, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Wang, C.; Chen, S.; Wu, Y.; Zhang, Z.; Zhou, L.; Liu, S.; Chen, Z.; Liu, Y.; Wang, H.; Li, J.; et al. 2023 a . Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Wang, C.; Liao, M.; Huang, Z.; Lu, J.; Wu, J.; Liu, Y.; Zong, C.; and Zhang, J. 2023 b . Blsp: Bootstrapping language-speech pre-training via behavior alignment of continuation writing. arXiv preprint arXiv:2309.00916

work page arXiv 2023
[49]

Wang, C.; Wu, A.; Gu, J.; and Pino, J. 2021. CoVoST 2 and massively multilingual speech translation. In Interspeech, volume 2021, 2247--2251

work page 2021
[50]

Wu, J.; Gaur, Y.; Chen, Z.; Zhou, L.; Zhu, Y.; Wang, T.; Li, J.; Liu, S.; Ren, B.; Liu, L.; et al. 2023. On decoder-only architecture for speech-to-text and large language model integration. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 1--8. IEEE

work page 2023
[51]

Wu, L.; Tian, F.; Qin, T.; Lai, J.; and Liu, T.-Y. 2018. A study of reinforcement learning for neural machine translation. arXiv preprint arXiv:1808.08866

work page internal anchor Pith review Pith/arXiv arXiv 2018
[52]

Xu, Z.; Jiang, F.; Niu, L.; Deng, Y.; Poovendran, R.; Choi, Y.; and Lin, B. Y. 2024. Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing. arXiv preprint arXiv:2406.08464

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Yang, D.; Tian, J.; Tan, X.; Huang, R.; Liu, S.; Chang, X.; Shi, J.; Zhao, S.; Bian, J.; Wu, X.; et al. 2023. Uniaudio: An audio foundation model toward universal audio generation. arXiv preprint arXiv:2310.00704

work page arXiv 2023
[54]

Ye, Z.; Zhu, X.; Chan, C.-M.; Wang, X.; Tan, X.; Lei, J.; Peng, Y.; Liu, H.; Jin, Y.; DAI, Z.; et al. 2025. Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis. arXiv preprint arXiv:2502.04128

work page arXiv 2025
[55]

Self-Rewarding Language Models

Yuan, W.; Pang, R. Y.; Cho, K.; Sukhbaatar, S.; Xu, J.; and Weston, J. 2024. Self-rewarding language models. arXiv preprint arXiv:2401.10020

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Zen, H.; Dang, V.; Clark, R.; Zhang, Y.; Weiss, R. J.; Jia, Y.; Chen, Z.; and Wu, Y. 2019. Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882

work page internal anchor Pith review Pith/arXiv arXiv 2019
[57]

Zeng, A.; Du, Z.; Liu, M.; Zhang, L.; Jiang, S.; Dong, Y.; and Tang, J. 2024 a . Scaling speech-text pre-training with synthetic interleaved data. arXiv preprint arXiv:2411.17607

work page arXiv 2024
[58]

Zeng, Y.; Liu, G.; Ma, W.; Yang, N.; Zhang, H.; and Wang, J. 2024 b . Token-level Direct Preference Optimization. arXiv preprint arXiv:2404.11999

work page arXiv 2024
[59]

Zhang, D.; Li, S.; Zhang, X.; Zhan, J.; Wang, P.; Zhou, Y.; and Qiu, X. 2023. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000

work page arXiv 2023
[60]

Zhang, J.; Huang, J.; Jin, S.; and Lu, S. 2024. Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence

work page 2024
[61]

Zhang, S.; Liu, X.; Zhang, X.; Liu, J.; Luo, Z.; Huang, S.; and Gong, Y. 2025. Process-based self-rewarding language models. arXiv preprint arXiv:2503.03746

work page arXiv 2025
[62]

Zhou, Z.; Liu, J.; Yang, C.; Shao, J.; Liu, Y.; Yue, X.; Ouyang, W.; and Qiao, Y. 2023. Beyond one-preference-for-all: Multi-objective direct preference optimization. arXiv preprint arXiv:2310.03708

work page arXiv 2023

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

GPT-4 Technical Report

Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. 2022. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35: 23716--23736

work page 2022

[5] [5]

Amini, A.; Vieira, T.; and Cotterell, R. 2024. Direct Preference Optimization with an Offset. arXiv preprint arXiv:2402.10571

work page arXiv 2024

[6] [6]

Bai, Y.; Jones, A.; Ndousse, K.; Askell, A.; Chen, A.; DasSarma, N.; Drain, D.; Fort, S.; Ganguli, D.; Henighan, T.; et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

C.; Dale, D.; Dong, N.; Duquenne, P.-A.; Elsahar, H.; Gong, H.; Heffernan, K.; Hoffman, J.; et al

Barrault, L.; Chung, Y.-A.; Meglioli, M. C.; Dale, D.; Dong, N.; Duquenne, P.-A.; Elsahar, H.; Gong, H.; Heffernan, K.; Hoffman, J.; et al. 2023. SeamlessM4T: Massively Multilingual & Multimodal Machine Translation. arXiv preprint arXiv:2308.11596

work page arXiv 2023

[8] [8]

Berant, J.; Chou, A.; Frostig, R.; and Liang, P. 2013. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing, 1533--1544

work page 2013

[9] [9]

o r \'e nyi, B.; Weng, P.; Cheng, W.; and H \

Busa-Fekete, R.; Sz \"o r \'e nyi, B.; Weng, P.; Cheng, W.; and H \"u llermeier, E. 2014. Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm. Machine learning, 97: 327--351

work page 2014

[10] [10]

H.; Siniscalchi, S

Chen, C.; Hu, Y.; Yang, C.-H. H.; Siniscalchi, S. M.; Chen, P.-Y.; and Chng, E.-S. 2023. Hyporadise: An open baseline for generative speech recognition with large language models. Advances in Neural Information Processing Systems, 36: 31665--31688

work page 2023

[11] [11]

Chen, Z.; Deng, Y.; Yuan, H.; Ji, K.; and Gu, Q. 2024. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Chu, Y.; Xu, J.; Yang, Q.; Wei, H.; Wei, X.; Guo, Z.; Leng, Y.; Lv, Y.; He, J.; Lin, J.; et al. 2024. Qwen2-audio technical report. arXiv preprint arXiv:2407.10759

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Chu, Y.; Xu, J.; Zhou, X.; Yang, Q.; Zhang, S.; Yan, Z.; Zhou, C.; and Zhou, J. 2023. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Conneau, A.; Ma, M.; Khanuja, S.; Zhang, Y.; Axelrod, V.; Dalmia, S.; Riesa, J.; Rivera, C.; and Bapna, A. 2023. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), 798--805. IEEE

work page 2023

[15] [15]

Dai, J.; Pan, X.; Sun, R.; Ji, J.; Xu, X.; Liu, M.; Wang, Y.; and Yang, Y. 2023. Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Han, and Katrin Kirchhoff

Das, N.; Dingliwal, S.; Ronanki, S.; Paturi, R.; Huang, Z.; Mathur, P.; Yuan, J.; Bekal, D.; Niu, X.; Jayanthi, S. M.; et al. 2024. Speechverse: A large-scale generalizable audio language model. arXiv preprint arXiv:2405.08295

work page arXiv 2024

[17] [17]

D \'e fossez, A.; Mazar \'e , L.; Orsini, M.; Royer, A.; P \'e rez, P.; J \'e gou, H.; Grave, E.; and Zeghidour, N. 2024. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

A.; Cattoni, R.; Bentivogli, L.; Negri, M.; and Turchi, M

Di Gangi, M. A.; Cattoni, R.; Bentivogli, L.; Negri, M.; and Turchi, M. 2019. Must-c: a multilingual speech translation corpus. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2012--2017. Association for Computational Linguistics

work page 2019

[19] [19]

Dong, H.; Xiong, W.; Pang, B.; Wang, H.; Zhao, H.; Zhou, Y.; Jiang, N.; Sahoo, D.; Xiong, C.; and Zhang, T. 2024. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Du, Z.; Chen, Q.; Zhang, S.; Hu, K.; Lu, H.; Yang, Y.; Hu, H.; Zheng, S.; Gu, Y.; Ma, Z.; et al. 2024. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Dubois, Y.; Galambosi, B.; Liang, P.; and Hashimoto, T. B. ???? Length-controlled alpacaeval: A simple way to debias automatic evaluators, 2024. URL https://arxiv. org/abs/2404.04475

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Fang, Q.; Guo, S.; Zhou, Y.; Ma, Z.; Zhang, S.; and Feng, Y. 2024. Llama-omni: Seamless speech interaction with large language models. arXiv preprint arXiv:2409.06666

work page arXiv 2024

[23] [23]

Fathullah, Y.; Wu, C.; Lakomkin, E.; Li, K.; Jia, J.; Shangguan, Y.; Mahadeokar, J.; Kalinli, O.; Fuegen, C.; and Seltzer, M. 2023. Audiochatllama: Towards general-purpose speech abilities for llms. arXiv preprint arXiv:2311.06753

work page arXiv 2023

[24] [24]

Feng, X.; Jiang, Z.; Kaufmann, T.; Xu, P.; H \"u llermeier, E.; Weng, P.; and Zhu, Y. 2025. DUO: Diverse, Uncertain, On-Policy Query Generation and Selection for Reinforcement Learning from Human Feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, 16604--16612

work page 2025

[25] [25]

A.; Gat, I.; Conneau, A.; Kreuk, F.; Copet, J.; Defossez, A.; Synnaeve, G.; Dupoux, E.; et al

Hassid, M.; Remez, T.; Nguyen, T. A.; Gat, I.; Conneau, A.; Kreuk, F.; Copet, J.; Defossez, A.; Synnaeve, G.; Dupoux, E.; et al. 2023. Textually pretrained speech language models. Advances in Neural Information Processing Systems, 36: 63483--63501

work page 2023

[26] [26]

H.; \.Z elasko, P.; Hrinchuk, O.; Lavrukhin, V.; Balam, J.; and Ginsburg, B

Hu, K.; Chen, Z.; Yang, C.-H. H.; \.Z elasko, P.; Hrinchuk, O.; Lavrukhin, V.; Balam, J.; and Ginsburg, B. 2025. Chain-of-thought prompting for speech translation. In ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1--5. IEEE

work page 2025

[27] [27]

Hu, S.; Zhou, L.; Liu, S.; Chen, S.; Meng, L.; Hao, H.; Pan, J.; Liu, X.; Li, J.; Sivasankaran, S.; et al. 2024. Wavllm: Towards robust and adaptive speech large language model. arXiv preprint arXiv:2404.00656

work page arXiv 2024

[28] [28]

Jain, A.; Wojcik, B.; Joachims, T.; and Saxena, A. 2013. Learning trajectory preferences for manipulators via iterative improvement. Advances in neural information processing systems, 26

work page 2013

[29] [29]

Ji, S.; Jiang, Z.; Wang, W.; Chen, Y.; Fang, M.; Zuo, J.; Yang, Q.; Cheng, X.; Wang, Z.; Li, R.; et al. 2024. Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling. arXiv preprint arXiv:2408.16532

work page arXiv 2024

[30] [30]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, M.; Choi, E.; Weld, D. S.; and Zettlemoyer, L. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551

work page internal anchor Pith review Pith/arXiv arXiv 2017

[31] [31]

Kim, H.; Seo, S.; Jeong, K.; Kwon, O.; Kim, S.; Kim, J.; Lee, J.; Song, E.; Oh, M.; Ha, J.-W.; et al. 2024. Paralinguistics-Aware Speech-Empowered Large Language Models for Natural Conversation. arXiv preprint arXiv:2402.05706

work page arXiv 2024

[32] [32]

Lakhotia, K.; Kharitonov, E.; Hsu, W.-N.; Adi, Y.; Polyak, A.; Bolte, B.; Nguyen, T.-A.; Copet, J.; Baevski, A.; Mohamed, A.; et al. 2021. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9: 1336--1354

work page 2021

[33] [33]

Li, X.; Zhang, T.; Dubois, Y.; Taori, R.; Gulrajani, I.; Guestrin, C.; Liang, P.; and Hashimoto, T. B. 2023. Alpacaeval: An automatic evaluator of instruction-following models

work page 2023

[34] [34]

Lin, B.; Ye, Y.; Zhu, B.; Cui, J.; Ning, M.; Jin, P.; and Yuan, L. 2023. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

G.; Gandhe, A.; Yang, C.-H

Lin, G.-T.; Shivakumar, P. G.; Gandhe, A.; Yang, C.-H. H.; Gu, Y.; Ghosh, S.; Stolcke, A.; Lee, H.-y.; and Bulyko, I. 2024. Paralinguistics-enhanced large language modeling of spoken dialogue. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 10316--10320. IEEE

work page 2024

[36] [36]

Liu, S.; Fang, W.; Hu, Z.; Zhang, J.; Zhou, Y.; Zhang, K.; Tu, R.; Lin, T.-E.; Huang, F.; Song, M.; et al. 2025. A survey of direct preference optimization. arXiv preprint arXiv:2503.11701

work page arXiv 2025

[37] [37]

Liu, Z.; Sun, X.; and Zheng, Z. 2024. Enhancing LLM Safety via Constrained Direct Preference Optimization. arXiv preprint arXiv:2403.02475

work page arXiv 2024

[38] [38]

Nachmani, E.; Levkovitch, A.; Hirsch, R.; Salazar, J.; Asawaroengchai, C.; Mariooryad, S.; Rivlin, E.; Skerry-Ryan, R.; and Ramanovich, M. T. 2023. Spoken question answering and speech continuation using spectrogram-powered llm. arXiv preprint arXiv:2305.15255

work page arXiv 2023

[39] [39]

N.; Wu, Y.; Nguyen, P.; Chen, Z.; Chiu, C.-C.; and Kannan, A

Prabhavalkar, R.; Sainath, T. N.; Wu, Y.; Nguyen, P.; Chen, Z.; Chiu, C.-C.; and Kannan, A. 2018. Minimum word error rate training for attention-based sequence-to-sequence models. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4839--4843. IEEE

work page 2018

[40] [40]

W.; Xu, T.; Brockman, G.; McLeavey, C.; and Sutskever, I

Radford, A.; Kim, J. W.; Xu, T.; Brockman, G.; McLeavey, C.; and Sutskever, I. 2023. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, 28492--28518. PMLR

work page 2023

[41] [41]

D.; Ermon, S.; and Finn, C

Rafailov, R.; Sharma, A.; Mitchell, E.; Manning, C. D.; Ermon, S.; and Finn, C. 2024. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36

work page 2024

[42] [42]

J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V

Rennie, S. J.; Marcheret, E.; Mroueh, Y.; Ross, J.; and Goel, V. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, 7008--7024

work page 2017

[43] [43]

AudioPaLM: A Large Language Model That Can Speak and Listen

Rubenstein, P. K.; Asawaroengchai, C.; Nguyen, D. D.; Bapna, A.; Borsos, Z.; Quitry, F. d. C.; Chen, P.; Badawy, D. E.; Han, W.; Kharitonov, E.; et al. 2023. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Shu, Y.; Dong, S.; Chen, G.; Huang, W.; Zhang, R.; Shi, D.; Xiang, Q.; and Shi, Y. 2023. Llasm: Large language and speech model. arXiv preprint arXiv:2308.15930

work page arXiv 2023

[45] [45]

Tang, C.; Yu, W.; Sun, G.; Chen, X.; Tan, T.; Li, W.; Lu, L.; Ma, Z.; and Zhang, C. 2023. Salmonn: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [46]

Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozi \`e re, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

Wang, C.; Chen, S.; Wu, Y.; Zhang, Z.; Zhou, L.; Liu, S.; Chen, Z.; Liu, Y.; Wang, H.; Li, J.; et al. 2023 a . Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111

work page internal anchor Pith review Pith/arXiv arXiv 2023

[48] [48]

Wang, C.; Liao, M.; Huang, Z.; Lu, J.; Wu, J.; Liu, Y.; Zong, C.; and Zhang, J. 2023 b . Blsp: Bootstrapping language-speech pre-training via behavior alignment of continuation writing. arXiv preprint arXiv:2309.00916

work page arXiv 2023

[49] [49]

Wang, C.; Wu, A.; Gu, J.; and Pino, J. 2021. CoVoST 2 and massively multilingual speech translation. In Interspeech, volume 2021, 2247--2251

work page 2021

[50] [50]

Wu, J.; Gaur, Y.; Chen, Z.; Zhou, L.; Zhu, Y.; Wang, T.; Li, J.; Liu, S.; Ren, B.; Liu, L.; et al. 2023. On decoder-only architecture for speech-to-text and large language model integration. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 1--8. IEEE

work page 2023

[51] [51]

Wu, L.; Tian, F.; Qin, T.; Lai, J.; and Liu, T.-Y. 2018. A study of reinforcement learning for neural machine translation. arXiv preprint arXiv:1808.08866

work page internal anchor Pith review Pith/arXiv arXiv 2018

[52] [52]

Xu, Z.; Jiang, F.; Niu, L.; Deng, Y.; Poovendran, R.; Choi, Y.; and Lin, B. Y. 2024. Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing. arXiv preprint arXiv:2406.08464

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [53]

Yang, D.; Tian, J.; Tan, X.; Huang, R.; Liu, S.; Chang, X.; Shi, J.; Zhao, S.; Bian, J.; Wu, X.; et al. 2023. Uniaudio: An audio foundation model toward universal audio generation. arXiv preprint arXiv:2310.00704

work page arXiv 2023

[54] [54]

Ye, Z.; Zhu, X.; Chan, C.-M.; Wang, X.; Tan, X.; Lei, J.; Peng, Y.; Liu, H.; Jin, Y.; DAI, Z.; et al. 2025. Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis. arXiv preprint arXiv:2502.04128

work page arXiv 2025

[55] [55]

Self-Rewarding Language Models

Yuan, W.; Pang, R. Y.; Cho, K.; Sukhbaatar, S.; Xu, J.; and Weston, J. 2024. Self-rewarding language models. arXiv preprint arXiv:2401.10020

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [56]

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Zen, H.; Dang, V.; Clark, R.; Zhang, Y.; Weiss, R. J.; Jia, Y.; Chen, Z.; and Wu, Y. 2019. Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882

work page internal anchor Pith review Pith/arXiv arXiv 2019

[57] [57]

Zeng, A.; Du, Z.; Liu, M.; Zhang, L.; Jiang, S.; Dong, Y.; and Tang, J. 2024 a . Scaling speech-text pre-training with synthetic interleaved data. arXiv preprint arXiv:2411.17607

work page arXiv 2024

[58] [58]

Zeng, Y.; Liu, G.; Ma, W.; Yang, N.; Zhang, H.; and Wang, J. 2024 b . Token-level Direct Preference Optimization. arXiv preprint arXiv:2404.11999

work page arXiv 2024

[59] [59]

Zhang, D.; Li, S.; Zhang, X.; Zhan, J.; Wang, P.; Zhou, Y.; and Qiu, X. 2023. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000

work page arXiv 2023

[60] [60]

Zhang, J.; Huang, J.; Jin, S.; and Lu, S. 2024. Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence

work page 2024

[61] [61]

Zhang, S.; Liu, X.; Zhang, X.; Liu, J.; Luo, Z.; Huang, S.; and Gong, Y. 2025. Process-based self-rewarding language models. arXiv preprint arXiv:2503.03746

work page arXiv 2025

[62] [62]

Zhou, Z.; Liu, J.; Yang, C.; Shao, J.; Liu, Y.; Yue, X.; Ouyang, W.; and Qiao, Y. 2023. Beyond one-preference-for-all: Multi-objective direct preference optimization. arXiv preprint arXiv:2310.03708

work page arXiv 2023