Learning When to Think While Listening in Large Audio-Language Models

Cheng Zhu; Jiatao Gu; Suhao Yu; Weici Zhao; Yang Xiao; Zhiyuan Song

arxiv: 2605.27190 · v1 · pith:JQDI5RQYnew · submitted 2026-05-26 · 💻 cs.CL · cs.AI· cs.LG· cs.SD

Learning When to Think While Listening in Large Audio-Language Models

Zhiyuan Song , Weici Zhao , Yang Xiao , Suhao Yu , Cheng Zhu , Jiatao Gu This is my paper

Pith reviewed 2026-06-29 18:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGcs.SD

keywords large audio-language modelsstreaming spoken interactionwait-think-answer controlpolicy optimizationspoken question answeringreasoning timing

0 comments

The pith

A controller for audio-language models learns to decide during speech when to wait, output reasoning, or answer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a wait-think-answer controller that operates on partial audio input to decide when to continue listening, externalize a compact reasoning step, or produce a final answer. It trains this controller first with supervised fine-tuning on aligned trajectories and then with policy optimization under a reward that scores answer correctness, action validity, update timing, latency match, reasoning quality, and chain consistency. On a six-task synthetic spoken reasoning benchmark the optimized controller raises row-weighted accuracy from 67.6 percent to 70.3 percent while shortening the post-endpoint thinking segment by 14 percent. The same family of controllers remains functional when transferred to human-recorded audio, with the six-reward variant uniquely reducing final-think length below the base model. A reader would care because the method directly targets the quality-versus-latency trade-off that limits natural spoken interaction with large audio-language models.

Core claim

The central claim is that a learnable wait-think-answer controller, optimized over complete trajectories rather than final answers alone, can simultaneously raise accuracy and shorten visible deliberation time in streaming spoken question answering.

What carries the argument

The wait-think-answer control formulation, which maps partial audio evidence to discrete actions of waiting, emitting a reasoning update, or answering.

If this is right

Optimizing the full wait-think-answer trajectory improves row-weighted accuracy on synthetic spoken reasoning tasks from 67.6 percent to 70.3 percent.
The same optimization reduces post-endpoint final-think length by 14 percent under identical deployment conditions.
On human-recorded audio the six-reward controller is the only learned variant whose final-think length falls below the base model.
Supervised fine-tuning alone yields the highest accuracy on real audio, while DAPO adds the latency reduction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The timing controller could be applied to other streaming modalities such as video or sensor streams where evidence arrives incrementally.
Reducing unnecessary post-endpoint reasoning steps may lower overall compute per conversation turn.
The approach suggests that future streaming models should expose explicit intermediate reasoning as a controllable output rather than an internal process only.

Load-bearing premise

The six-part reward can be jointly optimized to produce stable controller behavior without hidden trade-offs between its components.

What would settle it

An experiment in which increasing the weight on the latency-synchronization term causes a statistically significant drop in answer correctness on held-out spoken reasoning tasks.

Figures

Figures reproduced from arXiv: 2605.27190 by Cheng Zhu, Jiatao Gu, Suhao Yu, Weici Zhao, Yang Xiao, Zhiyuan Song.

**Figure 2.** Figure 2: Trajectory reward for wait-think-answer control. Rule-based terms enforce action format, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Per-task synthetic SRQA accuracy for the base controller and the six-reward DAPO [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: SFT training curves from the audio-only cold-start run. Training and validation metrics [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

Recent advances in Large Audio-Language Models (LALMs) have made real-time, streaming spoken interaction increasingly practical. In this setting, reasoning quality and responsiveness are tightly coupled: delaying reasoning until the speech endpoint can improve answer quality but moves deliberation into user-visible response delay, while answering too early risks committing before decisive evidence arrives. We introduce a learnable wait-think-answer control formulation for LALMs. Motivated by the incremental nature of human conversation, the controller decides under partial audio evidence when to wait, when to externalize a compact reasoning update, and when to answer. Using Qwen2.5-Omni-7B as the base model, we construct aligned wait-think-answer traces from spoken reasoning data, train the controller with supervised fine-tuning (SFT), and then apply Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO). The reward combines answer correctness, action validity, update timing, latency synchronization, reasoning quality, and chain consistency, optimizing the complete wait-think-answer trajectory and not the final answer alone. On a six-task synthetic spoken reasoning question answering (SRQA) benchmark, the six-reward DAPO controller improves the row-weighted accuracy from 67.6% to 70.3% while reducing post-endpoint final-think length by 14% under the same Qwen deployment harness. On a 186-item human-recorded Real Audio Bench, a transfer check beyond text-to-speech (TTS)-rendered speech, the controller family remains functional: SFT achieves the strongest accuracy, while the six-reward DAPO controller is the only learned variant whose final-think length falls below the base. These results suggest that a streaming model should learn when to make intermediate reasoning explicit during the audio stream.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a wait-think-answer controller trained with six-reward DAPO on top of Qwen2.5-Omni-7B, producing small accuracy and latency improvements on synthetic spoken QA plus workable transfer to real audio.

read the letter

The main thing here is a controller that decides during partial audio whether to keep waiting, emit a compact reasoning update, or answer. They build aligned traces, run SFT, then apply DAPO with rewards for correctness, action validity, update timing, latency sync, reasoning quality, and chain consistency. On the six-task synthetic SRQA set this raises row-weighted accuracy from 67.6% to 70.3% and cuts post-endpoint final-think length by 14%. The 186-item real-audio transfer shows the approach remains usable, though SFT leads on accuracy there.

The work is useful because it directly targets the quality-latency tension in streaming LALMs instead of just scaling the base model. Optimizing the full trajectory rather than the final answer alone fits the incremental nature of spoken input, and checking against human-recorded audio is a step beyond TTS-only tests.

The soft spots are the missing pieces around the reward design. The abstract gives no ablations, no per-component curves, no error bars, and no sensitivity checks on how the six terms are balanced. The stress-test point about possible dominance by one reward or overfitting to the synthetic tasks therefore lands; without those diagnostics the joint improvement could be fragile. The real-audio result provides partial grounding but is too small to settle the question.

This is aimed at people building real-time voice interfaces with large multimodal models. A reader working on streaming latency or incremental reasoning would get practical value from the formulation even if the gains stay modest. It deserves a serious referee because the problem is concrete and the transfer check adds some independent signal, though the methods will need close examination on reward stability and statistical support.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a wait-think-answer control formulation for Large Audio-Language Models (LALMs) to manage the trade-off between reasoning quality and responsiveness in streaming spoken interactions. Using Qwen2.5-Omni-7B as base, the authors construct aligned wait-think-answer traces from spoken reasoning data, apply supervised fine-tuning (SFT), and then optimize the full trajectory with Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO) under a six-component reward (answer correctness, action validity, update timing, latency synchronization, reasoning quality, chain consistency). On a six-task synthetic spoken reasoning QA (SRQA) benchmark the six-reward DAPO controller raises row-weighted accuracy from 67.6% to 70.3% while cutting post-endpoint final-think length by 14%; a 186-item human-recorded Real Audio Bench transfer check shows the controller family remains functional, with SFT strongest on accuracy and the DAPO variant the only learned model whose final-think length falls below the base.

Significance. If the results hold, the work supplies a concrete, trajectory-level method for learning when to externalize reasoning during audio streams, directly addressing the latency-quality tension in real-time spoken systems. The explicit use of a composite reward on the complete wait-think-answer trajectory and the inclusion of a non-TTS real-audio transfer evaluation are strengths that increase the practical relevance of the findings.

major comments (2)

[SRQA benchmark results] SRQA benchmark results: the reported gains (67.6% → 70.3% row-weighted accuracy and 14% shorter post-endpoint think length) are given without error bars, ablation tables on the six reward components, or statistical tests. This information is load-bearing for the central claim that the composite reward produces stable joint improvement rather than an artifact of weighting or task-specific scaling.
[DAPO objective and reward definition] DAPO objective and reward definition: no per-component reward curves, sensitivity sweeps on the six reward weights, or analysis of potential Pareto conflicts are supplied. Because the optimization directly tunes the controller to the composite reward on the training distribution, the absence of these diagnostics leaves the assumption that the components admit a stable optimum without hidden trade-offs or synthetic-benchmark overfitting untested.

minor comments (2)

[Abstract] The abstract states that the controller family 'remains functional' on the Real Audio Bench but supplies no quantitative definition of 'functional' beyond the final-think length comparison for the DAPO variant.
[Experimental setup] The description of the six-task SRQA benchmark would benefit from an explicit table listing the tasks and their individual accuracies rather than only the row-weighted aggregate.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the practical relevance of the wait-think-answer formulation. Below we respond point-by-point to the two major comments, committing to concrete additions that directly address the concerns about empirical robustness.

read point-by-point responses

Referee: [SRQA benchmark results] SRQA benchmark results: the reported gains (67.6% → 70.3% row-weighted accuracy and 14% shorter post-endpoint think length) are given without error bars, ablation tables on the six reward components, or statistical tests. This information is load-bearing for the central claim that the composite reward produces stable joint improvement rather than an artifact of weighting or task-specific scaling.

Authors: We agree that error bars, component ablations, and statistical tests are necessary to substantiate the claim of stable joint improvement. In the revised manuscript we will (i) report mean and standard error over at least three independent training runs for both accuracy and post-endpoint length, (ii) add a full ablation table that isolates each of the six reward terms, and (iii) include paired statistical tests (McNemar for accuracy, Wilcoxon signed-rank for length) across the six SRQA tasks to confirm that the observed gains are not artifacts of particular weightings or task subsets. revision: yes
Referee: [DAPO objective and reward definition] DAPO objective and reward definition: no per-component reward curves, sensitivity sweeps on the six reward weights, or analysis of potential Pareto conflicts are supplied. Because the optimization directly tunes the controller to the composite reward on the training distribution, the absence of these diagnostics leaves the assumption that the components admit a stable optimum without hidden trade-offs or synthetic-benchmark overfitting untested.

Authors: We acknowledge that the lack of per-component diagnostics leaves the stability of the composite optimum unverified. The revision will add (i) training curves showing the evolution of each individual reward term throughout DAPO, (ii) a sensitivity sweep over the six reward weights centered on the values used in the main experiments, and (iii) an explicit analysis of any observed trade-offs or Pareto conflicts (e.g., correctness versus latency) together with the same diagnostics evaluated on the 186-item Real Audio Bench to test for synthetic-distribution overfitting. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results are benchmark-validated

full rationale

The paper's central claims consist of measured accuracy and latency improvements on a held-out six-task SRQA benchmark plus a separate 186-item real-audio transfer set after SFT + DAPO training. The composite reward (correctness, validity, timing, etc.) is applied during optimization on training trajectories, but evaluation occurs on distinct test distributions with no reduction of the reported numbers to the training rewards by construction. No self-definitional equations, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation; the method is a standard RL pipeline whose outputs are externally falsifiable on the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; full paper would be required to enumerate any reward weights, benchmark construction choices, or modeling assumptions.

pith-pipeline@v0.9.1-grok · 5872 in / 1368 out tokens · 37086 ms · 2026-06-29T18:34:50.557333+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 31 canonical work pages · 15 internal anchors

[1]

Audio-language models for audio- centric tasks: A systematic survey.arXiv preprint arXiv:2501.15177, 2025

Yi Su, Jisheng Bai, Qisheng Xu, Kele Xu, and Yong Dou. Audio-language models for audio- centric tasks: A systematic survey.arXiv preprint arXiv:2501.15177, 2025

work page arXiv 2025
[2]

Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

Chih-Kai Yang, Neo S. Ho, and Hung-yi Lee. Towards holistic evaluation of large audio- language models: A comprehensive survey.arXiv preprint arXiv:2505.15957, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Schegloff, and Gail Jefferson

Harvey Sacks, Emanuel A. Schegloff, and Gail Jefferson. A simplest systematics for the organization of turn-taking for conversation.Language, 50(4):696–735, 1974

1974
[4]

Tanya Stivers, N. J. Enfield, Penelope Brown, Christina Englert, Makoto Hayashi, Trine Heine- mann, Gertie Hoymann, Federico Rossano, Jan Peter de Ruiter, Kyung-Eun Yoon, and Stephen C. Levinson. Universals and cultural variation in turn-taking in conversation.Proceedings of the National Academy of Sciences, 106(26):10587–10592, 2009

2009
[5]

Levinson and Francisco Torreira

Stephen C. Levinson and Francisco Torreira. Timing in turn-taking and its implications for processing models of language.Frontiers in Psychology, 6:731, 2015

2015
[6]

Levinson

Lilla Magyari, Jan Peter de Ruiter, and Stephen C. Levinson. Temporal preparation for speaking in question-answer sequences.Frontiers in Psychology, 8:211, 2017

2017
[7]

Levinson

Sara Bögels, Lilla Magyari, and Stephen C. Levinson. Neural signatures of response planning occur midway through an incoming question in conversation.Scientific Reports, 5:12881, 2015

2015
[8]

Castellucci, Christopher K

Gregg A. Castellucci, Christopher K. Kovach, Matthew A. Howard III, Jeremy D. W. Greenlee, and Michael A. Long. A speech planning network for interactive language use.Nature, 602:117–122, 2022

2022
[9]

Stephens, Lauren J

Greg J. Stephens, Lauren J. Silbert, and Uri Hasson. Speaker-listener neural coupling underlies successful communication.Proceedings of the National Academy of Sciences, 107(32):14425– 14430, 2010

2010
[10]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Defossez, Laurent Mazare, Manu Orsini, Amelie Royer, Patrick Perez, Herve Jegou, Edouard Grave, and Neil Zeghidour. Moshi: A speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Mini-omni: Language models can hear, talk while thinking in streaming, 2024,

Zhifei Xie and Changqiao Wu. Mini-Omni: Language models can hear, talk while thinking in streaming.arXiv preprint arXiv:2408.16725, 2024

work page arXiv 2024
[12]

Freeze-omni: A smart and low latency speech-to-speech dia- logue model with frozen llm,

Xiong Wang, Yangze Li, Chaoyou Fu, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, and Long Ma. Freeze-Omni: A smart and low latency speech-to-speech dialogue model with frozen LLM. arXiv preprint arXiv:2411.00774, 2024

work page arXiv 2024
[13]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-Audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv preprint arXiv:2311.07919, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Qwen2-Audio Technical Report

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen2-Audio technical report. arXiv preprint arXiv:2407.10759, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-Omni technical report.arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, and others. Qwen3-Omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Qwen3.6-35B-A3B: Agentic coding power, now open to all

Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all. Hugging Face model card, 2026. URLhttps://huggingface.co/Qwen/Qwen3.6-35B-A3B. 10

2026
[18]

Qwen3-TTS Technical Report

Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, Xinyu Zhang, Pei Zhang, Baosong Yang, Jin Xu, Jingren Zhou, and Junyang Lin. Qwen3-TTS technical report.arXiv preprint arXiv:2601.15621, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

GPT-4o System Card

OpenAI. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. GLM-4-V oice: Towards intelligent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao- Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, and Bryan Catanzaro. Audio Flamingo 3: Advancing audio intelligence with fully open large audio language models. arXiv preprint arXiv:2507.08128, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Audio-CoT: Exploring chain-of-thought reasoning in large audio language model.arXiv preprint arXiv:2501.07246, 2025

Ziyang Ma, Zhuo Chen, Yuping Wang, Eng Siong Chng, and Xie Chen. Audio-CoT: Exploring chain-of-thought reasoning in large audio language model.arXiv preprint arXiv:2501.07246, 2025

work page arXiv 2025
[23]

Audio- reasoner: Improving reasoning capability in large audio language models,

Zhifei Xie, Mingbao Lin, Zihang Liu, Pengcheng Wu, Shuicheng Yan, and Chunyan Miao. Audio-Reasoner: Improving reasoning capability in large audio language models.arXiv preprint arXiv:2503.02318, 2025

work page arXiv 2025
[24]

Audio- Thinker: Guiding audio language model when and how to think via reinforcement learning

Shu Wu, Chenxing Li, Wenfu Wang, Hao Zhang, Hualei Wang, Meng Yu, and Dong Yu. Audio- Thinker: Guiding audio language model when and how to think via reinforcement learning. arXiv preprint arXiv:2508.08039, 2025

work page arXiv 2025
[25]

Audio Flamingo Sound-CoT technical report: Improving chain-of-thought reasoning in sound understanding.arXiv preprint arXiv:2508.11818, 2025

Zhifeng Kong, Arushi Goel, Joao Felipe Santos, Sreyan Ghosh, Rafael Valle, Wei Ping, and Bryan Catanzaro. Audio Flamingo Sound-CoT technical report: Improving chain-of-thought reasoning in sound understanding.arXiv preprint arXiv:2508.11818, 2025

work page arXiv 2025
[26]

SARI: Structured audio reasoning via curriculum-guided reinforcement learning.arXiv preprint arXiv:2504.15900, 2025

Cheng Wen, Tingwei Guo, Shuaijiang Zhao, Wei Zou, and Xiangang Li. SARI: Structured audio reasoning via curriculum-guided reinforcement learning.arXiv preprint arXiv:2504.15900, 2025

work page arXiv 2025
[27]

AudSemThinker: Enhancing audio-language models through reasoning over semantics of sound.arXiv preprint arXiv:2505.14142, 2025

Gijs Wijngaard, Elia Formisano, Michele Esposito, and Michel Dumontier. AudSemThinker: Enhancing audio-language models through reasoning over semantics of sound.arXiv preprint arXiv:2505.14142, 2025

work page arXiv 2025
[28]

Reinforce- ment learning outperforms supervised fine-tuning: A case study on audio question answering

Gang Li, Jizhong Liu, Heinrich Dinkel, Yadong Niu, Junbo Zhang, and Jian Luan. Reinforce- ment learning outperforms supervised fine-tuning: A case study on audio question answering. arXiv preprint arXiv:2503.11197, 2025

work page arXiv 2025
[29]

Omni-R1: Do you really need audio to fine-tune your audio LLM?arXiv preprint arXiv:2505.09439, 2025

Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogerio Feris, and James Glass. Omni-R1: Do you really need audio to fine-tune your audio LLM?arXiv preprint arXiv:2505.09439, 2025

work page arXiv 2025
[30]

Step-Audio-R1 technical report

Fei Tian, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Yuxin Li, Daijiao Liu, Yayue Deng, Donghang Wu, Jun Chen, Liang Zhao, Chengyuan Yao, Hexin Liu, Eng Siong Chng, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, and Gang Yu. Step-Audio-R1 technical report. arXiv preprint arXiv:2511.15848, 2025

work page arXiv 2025
[31]

Can speech LLMs think while listening? InInternational Conference on Learning Representations, 2026

Yi-Jen Shih, Desh Raj, Chunyang Wu, Wei Zhou, SK Bong, Yashesh Gaur, Jay Mahadeokar, Ozlem Kalinli, and Mike Seltzer. Can speech LLMs think while listening? InInternational Conference on Learning Representations, 2026

2026
[32]

STITCH: Simultaneous thinking and talking with chunked reasoning for spoken language models

Cheng-Han Chiang, Xiaofei Wang, Linjie Li, Chung-Ching Lin, Kevin Lin, Shujie Liu, Zhen- dong Wang, Zhengyuan Yang, Hung-yi Lee, and Lijuan Wang. STITCH: Simultaneous thinking and talking with chunked reasoning for spoken language models. InInternational Conference on Learning Representations, 2026

2026
[33]

SHANKS: Simultaneous hearing and thinking for spoken language models.arXiv preprint arXiv:2510.06917, 2025

Cheng-Han Chiang, Xiaofei Wang, Linjie Li, Chung-Ching Lin, Kevin Lin, Shujie Liu, Zhen- dong Wang, Zhengyuan Yang, Hung-yi Lee, and Lijuan Wang. SHANKS: Simultaneous hearing and thinking for spoken language models.arXiv preprint arXiv:2510.06917, 2025. 11

work page arXiv 2025
[34]

StreamingThinker: Large language models can think while reading.arXiv preprint arXiv:2510.17238, 2025

Junlong Tong, Yingqi Fan, Anhao Zhao, Yunpu Ma, and Xiaoyu Shen. StreamingThinker: Large language models can think while reading.arXiv preprint arXiv:2510.17238, 2025

work page arXiv 2025
[35]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022

2022
[36]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

2022
[37]

SWIFT: A scalable lightWeight infrastructure for fine-tuning.arXiv preprint arXiv:2408.05517, 2024

Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Hong Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. SWIFT: A scalable lightWeight infrastructure for fine-tuning.arXiv preprint arXiv:2408.05517, 2024

work page arXiv 2024
[38]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, and others. DeepSeek-R1: Incentivizing reasonin...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, and others. DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks

Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmidhuber. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. InProceedings of the International Conference on Machine Learning, 2006

2006
[42]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 Reasoning Challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[43]

PIQA: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, 2020

2020
[44]

Social IQa: Commonsense reasoning about social interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense reasoning about social interactions. InProceedings of EMNLP-IJCNLP, 2019

2019
[45]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[46]

Spoken question answering and speech continuation using spectrogram-powered LLM.arXiv preprint arXiv:2305.15255, 2023

Eliya Nachmani, Alon Levkovitch, Roy Hirsch, Julian Salazar, Chulayuth Asawaroengchai, Soroosh Mariooryad, Ehud Rivlin, RJ Skerry-Ryan, and Michelle Tadmor Ramanovich. Spoken question answering and speech continuation using spectrogram-powered LLM.arXiv preprint arXiv:2305.15255, 2023. 12 A Benchmark task and data details The synthetic spoken SRQA benchma...

work page arXiv 2023

[1] [1]

Audio-language models for audio- centric tasks: A systematic survey.arXiv preprint arXiv:2501.15177, 2025

Yi Su, Jisheng Bai, Qisheng Xu, Kele Xu, and Yong Dou. Audio-language models for audio- centric tasks: A systematic survey.arXiv preprint arXiv:2501.15177, 2025

work page arXiv 2025

[2] [2]

Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

Chih-Kai Yang, Neo S. Ho, and Hung-yi Lee. Towards holistic evaluation of large audio- language models: A comprehensive survey.arXiv preprint arXiv:2505.15957, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Schegloff, and Gail Jefferson

Harvey Sacks, Emanuel A. Schegloff, and Gail Jefferson. A simplest systematics for the organization of turn-taking for conversation.Language, 50(4):696–735, 1974

1974

[4] [4]

Tanya Stivers, N. J. Enfield, Penelope Brown, Christina Englert, Makoto Hayashi, Trine Heine- mann, Gertie Hoymann, Federico Rossano, Jan Peter de Ruiter, Kyung-Eun Yoon, and Stephen C. Levinson. Universals and cultural variation in turn-taking in conversation.Proceedings of the National Academy of Sciences, 106(26):10587–10592, 2009

2009

[5] [5]

Levinson and Francisco Torreira

Stephen C. Levinson and Francisco Torreira. Timing in turn-taking and its implications for processing models of language.Frontiers in Psychology, 6:731, 2015

2015

[6] [6]

Levinson

Lilla Magyari, Jan Peter de Ruiter, and Stephen C. Levinson. Temporal preparation for speaking in question-answer sequences.Frontiers in Psychology, 8:211, 2017

2017

[7] [7]

Levinson

Sara Bögels, Lilla Magyari, and Stephen C. Levinson. Neural signatures of response planning occur midway through an incoming question in conversation.Scientific Reports, 5:12881, 2015

2015

[8] [8]

Castellucci, Christopher K

Gregg A. Castellucci, Christopher K. Kovach, Matthew A. Howard III, Jeremy D. W. Greenlee, and Michael A. Long. A speech planning network for interactive language use.Nature, 602:117–122, 2022

2022

[9] [9]

Stephens, Lauren J

Greg J. Stephens, Lauren J. Silbert, and Uri Hasson. Speaker-listener neural coupling underlies successful communication.Proceedings of the National Academy of Sciences, 107(32):14425– 14430, 2010

2010

[10] [10]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Defossez, Laurent Mazare, Manu Orsini, Amelie Royer, Patrick Perez, Herve Jegou, Edouard Grave, and Neil Zeghidour. Moshi: A speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Mini-omni: Language models can hear, talk while thinking in streaming, 2024,

Zhifei Xie and Changqiao Wu. Mini-Omni: Language models can hear, talk while thinking in streaming.arXiv preprint arXiv:2408.16725, 2024

work page arXiv 2024

[12] [12]

Freeze-omni: A smart and low latency speech-to-speech dia- logue model with frozen llm,

Xiong Wang, Yangze Li, Chaoyou Fu, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, and Long Ma. Freeze-Omni: A smart and low latency speech-to-speech dialogue model with frozen LLM. arXiv preprint arXiv:2411.00774, 2024

work page arXiv 2024

[13] [13]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-Audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv preprint arXiv:2311.07919, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Qwen2-Audio Technical Report

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen2-Audio technical report. arXiv preprint arXiv:2407.10759, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Qwen2.5-Omni Technical Report

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-Omni technical report.arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, and others. Qwen3-Omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Qwen3.6-35B-A3B: Agentic coding power, now open to all

Qwen Team. Qwen3.6-35B-A3B: Agentic coding power, now open to all. Hugging Face model card, 2026. URLhttps://huggingface.co/Qwen/Qwen3.6-35B-A3B. 10

2026

[18] [18]

Qwen3-TTS Technical Report

Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, Xinyu Zhang, Pei Zhang, Baosong Yang, Jin Xu, Jingren Zhou, and Junyang Lin. Qwen3-TTS technical report.arXiv preprint arXiv:2601.15621, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

GPT-4o System Card

OpenAI. GPT-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. GLM-4-V oice: Towards intelligent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao- Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, and Bryan Catanzaro. Audio Flamingo 3: Advancing audio intelligence with fully open large audio language models. arXiv preprint arXiv:2507.08128, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Audio-CoT: Exploring chain-of-thought reasoning in large audio language model.arXiv preprint arXiv:2501.07246, 2025

Ziyang Ma, Zhuo Chen, Yuping Wang, Eng Siong Chng, and Xie Chen. Audio-CoT: Exploring chain-of-thought reasoning in large audio language model.arXiv preprint arXiv:2501.07246, 2025

work page arXiv 2025

[23] [23]

Audio- reasoner: Improving reasoning capability in large audio language models,

Zhifei Xie, Mingbao Lin, Zihang Liu, Pengcheng Wu, Shuicheng Yan, and Chunyan Miao. Audio-Reasoner: Improving reasoning capability in large audio language models.arXiv preprint arXiv:2503.02318, 2025

work page arXiv 2025

[24] [24]

Audio- Thinker: Guiding audio language model when and how to think via reinforcement learning

Shu Wu, Chenxing Li, Wenfu Wang, Hao Zhang, Hualei Wang, Meng Yu, and Dong Yu. Audio- Thinker: Guiding audio language model when and how to think via reinforcement learning. arXiv preprint arXiv:2508.08039, 2025

work page arXiv 2025

[25] [25]

Audio Flamingo Sound-CoT technical report: Improving chain-of-thought reasoning in sound understanding.arXiv preprint arXiv:2508.11818, 2025

Zhifeng Kong, Arushi Goel, Joao Felipe Santos, Sreyan Ghosh, Rafael Valle, Wei Ping, and Bryan Catanzaro. Audio Flamingo Sound-CoT technical report: Improving chain-of-thought reasoning in sound understanding.arXiv preprint arXiv:2508.11818, 2025

work page arXiv 2025

[26] [26]

SARI: Structured audio reasoning via curriculum-guided reinforcement learning.arXiv preprint arXiv:2504.15900, 2025

Cheng Wen, Tingwei Guo, Shuaijiang Zhao, Wei Zou, and Xiangang Li. SARI: Structured audio reasoning via curriculum-guided reinforcement learning.arXiv preprint arXiv:2504.15900, 2025

work page arXiv 2025

[27] [27]

AudSemThinker: Enhancing audio-language models through reasoning over semantics of sound.arXiv preprint arXiv:2505.14142, 2025

Gijs Wijngaard, Elia Formisano, Michele Esposito, and Michel Dumontier. AudSemThinker: Enhancing audio-language models through reasoning over semantics of sound.arXiv preprint arXiv:2505.14142, 2025

work page arXiv 2025

[28] [28]

Reinforce- ment learning outperforms supervised fine-tuning: A case study on audio question answering

Gang Li, Jizhong Liu, Heinrich Dinkel, Yadong Niu, Junbo Zhang, and Jian Luan. Reinforce- ment learning outperforms supervised fine-tuning: A case study on audio question answering. arXiv preprint arXiv:2503.11197, 2025

work page arXiv 2025

[29] [29]

Omni-R1: Do you really need audio to fine-tune your audio LLM?arXiv preprint arXiv:2505.09439, 2025

Andrew Rouditchenko, Saurabhchand Bhati, Edson Araujo, Samuel Thomas, Hilde Kuehne, Rogerio Feris, and James Glass. Omni-R1: Do you really need audio to fine-tune your audio LLM?arXiv preprint arXiv:2505.09439, 2025

work page arXiv 2025

[30] [30]

Step-Audio-R1 technical report

Fei Tian, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Yuxin Li, Daijiao Liu, Yayue Deng, Donghang Wu, Jun Chen, Liang Zhao, Chengyuan Yao, Hexin Liu, Eng Siong Chng, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, and Gang Yu. Step-Audio-R1 technical report. arXiv preprint arXiv:2511.15848, 2025

work page arXiv 2025

[31] [31]

Can speech LLMs think while listening? InInternational Conference on Learning Representations, 2026

Yi-Jen Shih, Desh Raj, Chunyang Wu, Wei Zhou, SK Bong, Yashesh Gaur, Jay Mahadeokar, Ozlem Kalinli, and Mike Seltzer. Can speech LLMs think while listening? InInternational Conference on Learning Representations, 2026

2026

[32] [32]

STITCH: Simultaneous thinking and talking with chunked reasoning for spoken language models

Cheng-Han Chiang, Xiaofei Wang, Linjie Li, Chung-Ching Lin, Kevin Lin, Shujie Liu, Zhen- dong Wang, Zhengyuan Yang, Hung-yi Lee, and Lijuan Wang. STITCH: Simultaneous thinking and talking with chunked reasoning for spoken language models. InInternational Conference on Learning Representations, 2026

2026

[33] [33]

SHANKS: Simultaneous hearing and thinking for spoken language models.arXiv preprint arXiv:2510.06917, 2025

Cheng-Han Chiang, Xiaofei Wang, Linjie Li, Chung-Ching Lin, Kevin Lin, Shujie Liu, Zhen- dong Wang, Zhengyuan Yang, Hung-yi Lee, and Lijuan Wang. SHANKS: Simultaneous hearing and thinking for spoken language models.arXiv preprint arXiv:2510.06917, 2025. 11

work page arXiv 2025

[34] [34]

StreamingThinker: Large language models can think while reading.arXiv preprint arXiv:2510.17238, 2025

Junlong Tong, Yingqi Fan, Anhao Zhao, Yunpu Ma, and Xiaoyu Shen. StreamingThinker: Large language models can think while reading.arXiv preprint arXiv:2510.17238, 2025

work page arXiv 2025

[35] [35]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, 2022

2022

[36] [36]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

2022

[37] [37]

SWIFT: A scalable lightWeight infrastructure for fine-tuning.arXiv preprint arXiv:2408.05517, 2024

Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Hong Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. SWIFT: A scalable lightWeight infrastructure for fine-tuning.arXiv preprint arXiv:2408.05517, 2024

work page arXiv 2024

[38] [38]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, and others. DeepSeek-R1: Incentivizing reasonin...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, and others. DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks

Alex Graves, Santiago Fernandez, Faustino Gomez, and Jurgen Schmidhuber. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. InProceedings of the International Conference on Machine Learning, 2006

2006

[42] [42]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 Reasoning Challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[43] [43]

PIQA: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, 2020

2020

[44] [44]

Social IQa: Commonsense reasoning about social interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Commonsense reasoning about social interactions. InProceedings of EMNLP-IJCNLP, 2019

2019

[45] [45]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[46] [46]

Spoken question answering and speech continuation using spectrogram-powered LLM.arXiv preprint arXiv:2305.15255, 2023

Eliya Nachmani, Alon Levkovitch, Roy Hirsch, Julian Salazar, Chulayuth Asawaroengchai, Soroosh Mariooryad, Ehud Rivlin, RJ Skerry-Ryan, and Michelle Tadmor Ramanovich. Spoken question answering and speech continuation using spectrogram-powered LLM.arXiv preprint arXiv:2305.15255, 2023. 12 A Benchmark task and data details The synthetic spoken SRQA benchma...

work page arXiv 2023