arxiv: 2604.15037 · v3 · submitted 2026-04-16 · 💻 cs.AI · cs.CL· cs.SD

Recognition: unknown

From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

Ke Xu , Yuhao Wang , Yu Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:41 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.SD

keywords proactive voice agentsbenchmarkmultimodal LLMsproactivityvoice interactioninterventionmonitoring

0 comments

The pith

ProVoice-Bench is the first benchmark to test proactive intervention and monitoring in voice agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shifts evaluation from reactive text responses to proactive multimodal voice interactions. It introduces ProVoice-Bench with four new tasks and 1,182 curated samples generated via multi-stage synthesis. When state-of-the-art multimodal LLMs are tested on these tasks, clear shortfalls appear in deciding when to intervene and in handling context. A sympathetic reader would care because voice agents are moving into always-listening settings where timely, appropriate proactivity determines usefulness and user trust.

Core claim

Existing benchmarks overlook proactive intervention and monitoring, so ProVoice-Bench supplies the first dedicated framework with four novel tasks and 1,182 high-quality samples. Evaluation of current multimodal LLMs on the benchmark exposes a significant performance gap, most notably in over-triggering and reasoning.

What carries the argument

ProVoice-Bench, an evaluation framework built around four tasks that measure when and how voice agents should initiate action or maintain monitoring without explicit prompts.

Load-bearing premise

The multi-stage data synthesis pipeline creates samples that faithfully reflect the timing and context demands of real proactive voice scenarios.

What would settle it

Human raters listening to the benchmark prompts in live settings consistently disagree with the synthesized ground-truth labels on whether intervention is warranted.

Figures

Figures reproduced from arXiv: 2604.15037 by Ke Xu, Yuhao Wang, Yu Wang.

**Figure 1.** Figure 1: Overview of the four designed tasks in ProVoice-Bench. and explicit triggers within the conversation. We propose four comprehensive proactive tasks as below: • Proactive Intent Capture (PIC): The model infers implicit user intentions from nuanced linguistic cues (e.g., hesitation or prospective action items discussed in dialogue) and proactively initiates tool-call requests while seeking confirmation. So… view at source ↗

**Figure 2.** Figure 2: ProVoice-Bench data synthesis overview. (a) Distribution of data across the four tasks in the ProVoice-Bench. (b) Data synthesis pipeline: a multi-stage process for generating semantic cues and corresponding conversational audio. response), a description of the conversational context, and explicit temporal metadata engineered to ensure models can accurately decipher relative timing and event sequences. 2… view at source ↗

**Figure 3.** Figure 3: Experiments comparing model performance with (w/ DC) and without (w/o DC) Digital Context on ProVoice-Bench. 3.4. Impact of Digital Context We conduct experiments to investigate the impact of digital context by omitting it from the benchmark. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Recent advancements in LLM agents are gradually shifting from reactive, text-based paradigms toward proactive, multimodal interaction. However, existing benchmarks primarily focus on reactive responses, overlooking the complexities of proactive intervention and monitoring. To bridge this gap, we introduce ProVoice-Bench, the first evaluation framework specifically designed for proactive voice agents, featuring four novel tasks. By leveraging a multi-stage data synthesis pipeline, we curate 1,182 high-quality samples for rigorous testing. Our evaluation of state-of-the-art Multimodal LLMs reveals a significant performance gap, particularly regarding over-triggering and reasoning capabilities. These findings highlight the limitations of current models and offer a roadmap for developing more natural, context-aware proactive agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces the first benchmark for proactive voice agents but its performance gap claims rest on unvalidated synthetic data.

read the letter

The main takeaway is that this work creates ProVoice-Bench as the first framework aimed at proactive voice agents, complete with four new tasks and a synthesis pipeline that produced 1,182 samples. That addresses a clear hole in the literature, where most benchmarks stay stuck on reactive text responses and ignore when an agent should step in on its own or keep monitoring context in voice settings. The authors correctly flag that current multimodal LLMs show problems with over-triggering and weak reasoning in those scenarios, which matches what many people building voice systems already notice in practice. The task definitions themselves look like a reasonable starting point for measuring proactivity rather than just response quality. The soft spot is the data itself. The abstract describes a multi-stage synthesis process but supplies no human ratings for realism, no inter-annotator agreement, no comparison to actual voice logs, and no ablation that tests whether the generated cases drive the observed gaps. Without those checks, the reported performance differences could easily come from how the samples were constructed instead of real model limitations. There are also no error bars or statistical tests mentioned, which leaves the central empirical claim thin. This paper is aimed at researchers working on voice agents and multimodal LLMs who need concrete ways to track proactivity. A reader who wants task ideas or a sense of current model weaknesses could pull useful pieces from it, but anyone planning to rely on the numbers for their own work should treat them as preliminary. It deserves a serious referee because new benchmarks in this area can move the field forward even when the first version needs tightening on data quality. I would send it for review and ask the referees to focus on validation of the synthesis pipeline and whether the results hold under more controlled conditions.

Referee Report

2 major / 1 minor

Summary. The paper introduces ProVoice-Bench as the first dedicated evaluation framework for proactive voice agents, featuring four novel tasks. It employs a multi-stage data synthesis pipeline to curate 1,182 samples and evaluates state-of-the-art multimodal LLMs, reporting significant performance gaps especially in over-triggering and reasoning capabilities.

Significance. If the benchmark samples prove faithful to real-world proactive scenarios, this would be a meaningful contribution by shifting focus from reactive to proactive multimodal agent evaluation and identifying concrete limitations in current models. The creation of new tasks and the benchmark itself could serve as a useful starting point for future work on context-aware agents.

major comments (2)

[Data synthesis pipeline (methods section describing sample curation)] The central claim of a significant performance gap (particularly over-triggering and reasoning) rests on the 1,182 synthesized samples accurately representing real-world proactive intervention and monitoring. The multi-stage data synthesis pipeline is described but the manuscript supplies no quantitative validation evidence such as human expert realism ratings, inter-annotator agreement scores, or comparisons against actual voice-agent interaction logs.
[Evaluation results and abstract claims] The reported performance gaps in the evaluation of SOTA multimodal LLMs lack accompanying statistical tests, error bars, or ablations that control for potential synthesis artifacts, leaving the headline empirical findings weakly supported.

minor comments (1)

[Abstract] The abstract refers to 'high-quality samples' without any cross-reference to validation steps; a short clause noting quality controls would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and for recognizing the potential contribution of ProVoice-Bench. We will address the concerns about data validation and statistical support through revisions to the manuscript.

read point-by-point responses

Referee: [Data synthesis pipeline (methods section describing sample curation)] The central claim of a significant performance gap (particularly over-triggering and reasoning) rests on the 1,182 synthesized samples accurately representing real-world proactive intervention and monitoring. The multi-stage data synthesis pipeline is described but the manuscript supplies no quantitative validation evidence such as human expert realism ratings, inter-annotator agreement scores, or comparisons against actual voice-agent interaction logs.

Authors: We thank the referee for pointing this out. While the multi-stage pipeline incorporates expert-designed rules and iterative refinement to ensure quality, we agree that explicit quantitative validation is beneficial. In the revised manuscript, we will include results from human expert reviews, such as realism ratings on a Likert scale and inter-annotator agreement (e.g., Cohen's kappa). However, direct comparisons to proprietary voice-agent interaction logs are not feasible due to data access restrictions; we will instead provide a detailed qualitative analysis of how the synthesis aligns with documented real-world use cases from prior literature. This addresses the core concern without misrepresenting the available data. revision: partial
Referee: [Evaluation results and abstract claims] The reported performance gaps in the evaluation of SOTA multimodal LLMs lack accompanying statistical tests, error bars, or ablations that control for potential synthesis artifacts, leaving the headline empirical findings weakly supported.

Authors: We acknowledge the need for stronger statistical backing. The revised paper will report standard deviations or error bars for all metrics, include p-values from appropriate statistical tests comparing model performances, and add ablation studies varying key synthesis parameters (e.g., context length, trigger thresholds) to demonstrate robustness against potential artifacts. These additions will provide more rigorous support for the observed gaps in over-triggering and reasoning. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark and empirical evaluation are self-contained

full rationale

The paper introduces ProVoice-Bench as a fresh benchmark with four novel tasks and 1,182 samples generated via a described multi-stage synthesis pipeline. It then reports direct empirical results on SOTA multimodal LLMs. No equations, fitted parameters, predictions derived from inputs, or self-citations are used to justify the central performance-gap claim. The evaluation chain does not reduce to its own inputs by construction; the benchmark creation and testing are independent of prior fitted results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; the contribution centers on benchmark construction and empirical evaluation rather than theoretical derivations.

pith-pipeline@v0.9.0 · 5415 in / 1084 out tokens · 23798 ms · 2026-05-10T11:41:59.470234+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 26 canonical work pages · 10 internal anchors

[1]

These voice agents can also execute complex reasoning, plan- ning, and tool use by leveraging their robust textual capabili- ties [7, 8, 9]

Introduction With the rapid development of Multimodal Large Language Models (MLLMs) [1, 2, 3, 4], we have witnessed the emergence of advanced voice agents capable of perceiving audio and gen- erating natural speech responses via end-to-end models [5, 6]. These voice agents can also execute complex reasoning, plan- ning, and tool use by leveraging their ro...
[2]

From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench

ProV oice-Bench Construction 2.1. Task Overview In contrast to existing reactive agents, we propose ProV oice- Bench, a comprehensive evaluation suite comprising a large- scale speech corpus designed for proactive interaction tasks. Within this benchmark, the necessity for interaction is deter- mined by monitoring multimodal signals, including digital con...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

home_location

Digital State Construction "home_location": "Rua Augusta 45", "current_location": "TechStart Office, Lisbon"
[4]

Scene Synthesis User1 is at the TechStart office after an interview and needs to book a taxi to return home before his scheduled appointment
[5]

User1: Not too bad

Conversation Generation User2: So how did the interview go? You’ve been quiet since we sat down. User1: Not too bad. I really should head back home now. Not sure how I’m getting back though. Model: tool_call: book_service-taxi response: I can help you book a taxi from the TechStart office to your home on Rua Augusta. Would you like me to do that for you?
[6]

Acoustic Simulation Normalization ReverberationFar-field
[7]

Figure 2:ProV oice-Bench data synthesis overview.(a) Distribution of data across the four tasks in the ProV oice-Bench

Conversation Assembly Concatenate Background Sound (b) Overview of the data synthesis pipeline.(a) Data distribution of ProVoice-Bench. Figure 2:ProV oice-Bench data synthesis overview.(a) Distribution of data across the four tasks in the ProV oice-Bench. (b) Data synthesis pipeline: a multi-stage process for generating semantic cues and corresponding con...
[8]

ProV oice-Bench Overview As illustrated in Figure 2(a), to evaluate proactive interaction capabilities across diverse scenarios, we construct the ProV oice- Bench

Experiment and Result 3.1. ProV oice-Bench Overview As illustrated in Figure 2(a), to evaluate proactive interaction capabilities across diverse scenarios, we construct the ProV oice- Bench. This benchmark comprises 1,182 meticulously curated multimodal samples, balanced with both positive and negative instances. 3.2. Metrics To comprehensively assess pro...

work page arXiv
[9]

know- ing when to speak

Based on these results, we draw the following observations: Propensity for Over-triggering.The results reveal a widespread tendency toward over-triggering. This is particu- larly evident in LTM tasks, where most models tend to respond regardless of whether actual trigger points exist in the conver- sation. Similarly, in CFC tasks, models frequently fail t...
[10]

decision-to-execution

Conclusion We presented ProV oice-Bench, the first evaluation suite for proactive audio agents, featuring 1,182 high-quality samples across four novel tasks. By integrating digital context with au- dio input, our benchmark shifts the agent paradigm from reac- tive responses to context-aware, proactive interaction. Exper- imental results on state-of-the-ar...
[11]

Generative AI Use Disclosure The authors used Gemini solely for editing and polishing the language and grammar of this manuscript to improve its read- ability. All scientific analysis, interpretation of results, and the writing of the manuscript were performed by the human au- thors, who remain fully responsible for the work and its in- tegrity
[12]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen et al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabil- ities,”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

C. Wu, P. Zheng, R. Yan, S. Xiao, X. Luo, Y . Wang, W. Li, X. Jiang, Y . Liu, J. Zhou, Z. Liu, Z. Xia, C. Li, H. Deng, J. Wang, K. Luo, B. Zhang, D. Lian, X. Wang, Z. Wang, T. Huang, and Z. Liu, “Omnigen2: Exploration to advanced multimodal generation,” 2025. [Online]. Available: https://arxiv.org/abs/2506.18871

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Qwen3-Omni Technical Report

J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. He, Y . Wang, X. Shi, T. He, X. Zhu, Y . Lv, Y . Wang, D. Guo, H. Wang, L. Ma, P. Zhang, X. Zhanget al., “Qwen3-omni technical report,”arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review arXiv 2025
[15]

Kimi-Audio Technical Report

KimiTeam, D. Ding, Z. Ju, Y . Leng, S. Liu, T. Liu, Z. Shang, K. Shen, W. Song, X. Tan, H. Tang, Z. Wang, C. Wei, Y . Xin, X. Xu, J. Yu, Y . Zhang, X. Zhou, Y . Charles, J. Chen, Y . Chen et al., “Kimi-audio technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2504.18425

work page internal anchor Pith review arXiv 2025
[16]

Lts-voiceagent: A listen-think-speak framework for efficient streaming voice interaction via semantic triggering and incremental reasoning,

W. Zou, Y . Miao, Z. Ma, J. Xu, J. Gao, J. Hao, R. He, and J. Xu, “Lts-voiceagent: A listen-think-speak framework for efficient streaming voice interaction via semantic triggering and incremental reasoning,” 2026. [Online]. Available: https: //arxiv.org/abs/2601.19952

work page arXiv 2026
[17]

Voiceagent- bench: Are voice assistants ready for agentic tasks?arXiv preprint arXiv:2510.07978, 2025

D. Jain, H. Shukla, G. Rajeev, A. Kulkarni, C. Khatri, and S. Agarwal, “V oiceagentbench: Are voice assistants ready for agentic tasks?” 2025. [Online]. Available: https: //arxiv.org/abs/2510.07978

work page arXiv 2025
[18]

H., Pasad, A., Casanova, E., Wang, W., Fu, S.-W., Li, J., Chen, Z., Balam, J., et al

Y . Chen, X. Yue, C. Zhang, X. Gao, R. T. Tan, and H. Li, “V oicebench: Benchmarking llm-based voice assistants,” 2024. [Online]. Available: https://arxiv.org/abs/2410.17196

work page arXiv 2024
[19]

Shanks: Simultaneous hearing and thinking for spoken language models.arXiv preprint arXiv:2510.06917, 2025a

C.-H. Chiang, X. Wang, L. Li, C.-C. Lin, K. Lin, S. Liu, Z. Wang, Z. Yang, H. yi Lee, and L. Wang, “Shanks: Simultaneous hearing and thinking for spoken language models,” 2025. [Online]. Available: https://arxiv.org/abs/2510.06917

work page arXiv 2025
[20]

Aura: Agent for understanding, reasoning, and automated tool use in voice-driven tasks,

L. M. Maben, G. G. Lakshmy, S. Radhakrishnan, S. Arora, and S. Watanabe, “Aura: Agent for understanding, reasoning, and automated tool use in voice-driven tasks,” 2025. [Online]. Available: https://arxiv.org/abs/2506.23049

work page arXiv 2025
[21]

Proactive conversational agents in the post-chatgpt world,

L. Liao, G. H. Yang, and C. Shah, “Proactive conversational agents in the post-chatgpt world,” inProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’23. New York, NY , USA: Association for Computing Machinery, 2023, p. 3452–3455. [Online]. Available: https://doi.org/10.1145/ 3539618.3594250

work page arXiv 2023
[22]

Proactive agent: Shifting llm agents from re- active responses to active assistance,

Y . Lu, S. Yang, C. Qian, G. Chen, Q. Luo, Y . Wu, H. Wang, X. Cong, Z. Zhang, Y . Lin, W. Liu, Y . Wang, Z. Liu, F. Liu, and M. Sun, “Proactive agent: Shifting llm agents from reactive responses to active assistance,” 2024. [Online]. Available: https://arxiv.org/abs/2410.12361

work page arXiv 2024
[23]

Proactivevideoqa: A comprehensive benchmark evaluating proactive interactions in video large language models.arXiv preprint arXiv:2507.09313, 2025

Y . Wang, X. Meng, Y . Wang, H. Zhang, and D. Zhao, “Proactivevideoqa: A comprehensive benchmark evaluating proactive interactions in video large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2507.09313

work page arXiv 2025
[24]

Roboomni: Proactive robot manipulation in omni-modal context, 2025

S. Wang, J. Fu, F. Liu, X. He, H. Wu, J. Shi, K. Huang, Z. Fei, J. Gong, Z. Wu, Y . Jiang, S.-K. Ng, T.-S. Chua, and X. Qiu, “Roboomni: Proactive robot manipulation in omni-modal context,”arXiv preprint arXiv:2510.23763, 2025. [Online]. Available: https://arxiv.org/abs/2510.23763

work page arXiv 2025
[25]

Contextagent: Context-aware proactive llm agents with open-world sensory perceptions,

B. Yang, L. Xu, L. Zeng, K. Liu, S. Jiang, W. Lu, H. Chen, X. Jiang, G. Xing, and Z. Yan, “Contextagent: Context-aware proactive llm agents with open-world sensory perceptions,” 2025. [Online]. Available: https://arxiv.org/abs/2505.14668

work page arXiv 2025
[26]

Proagent: Harnessing on-demand sensory contexts for proactive llm agent systems,

B. Yang, L. Xu, L. Zeng, Y . Guo, S. Jiang, W. Lu, K. Liu, H. Xiang, X. Jiang, G. Xing, and Z. Yan, “Proagent: Harnessing on-demand sensory contexts for proactive llm agent systems,”
[27]

ProAgent: Harnessing On-Demand Sensory Contexts for Proactive LLM Agent Systems in the Wild

[Online]. Available: https://arxiv.org/abs/2512.06721

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Benchmarking egocentric multimodal goal inference for assistive wearable agents,

V . Veerabadran, F. Xiao, N. Kamra, P. Matias, J. Chen, C. Drooff, B. D. Roads, R. Williams, E. Henderson, X. Zhao, K. Carlberg, J. Tighe, and K. Ridgeway, “Benchmarking egocentric multimodal goal inference for assistive wearable agents,” 2025. [Online]. Available: https://arxiv.org/abs/2510. 22443

2025
[29]

dialog-topics dataset,

ThatsGroes, “dialog-topics dataset,” Hugging Face Hub, 2025, https://huggingface.co/datasets/ThatsGroes/dialog-topics. [Online]. Available: https://huggingface.co/datasets/ThatsGroes/ dialog-topics

2025
[30]

Qwen3-max: Just scale it,

Q. Team, “Qwen3-max: Just scale it,” September 2025

2025
[31]

Cosyvoice 3: Towards in-the-wild speech gen- eration via scaling-up and post-training.arXiv preprint arXiv:2505.17589, 2025

Z. Du, C. Gao, Y . Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, C. Ni, X. Shi, K. An, G. Yang, Y . Li, Y . Chen, Z. Gao, Q. Chen, Y . Gu, M. Chen, Y . Chen, S. Zhang, W. Wang, and J. Ye, “Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training,” 2025. [Online]. Available: https://arxiv.org/abs/2505.17589

work page arXiv 2025
[32]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

P. Anastassiou, J. Chen, J. Chen, Y . Chen, Z. Chen, Z. Chen, J. Cong, L. Deng, C. Ding, L. Gao, M. Gong, P. Huang, Q. Huang, Z. Huang, Y . Huo, D. Jia, C. Li, F. Li, H. Liet al., “Seed-tts: A family of high-quality versatile speech generation models,” 2024. [Online]. Available: https://arxiv.org/abs/2406.02430

work page internal anchor Pith review arXiv 2024
[33]

ESC: Dataset for Environmental Sound Classi- fication,

K. J. Piczak, “ESC: Dataset for Environmental Sound Classi- fication,” inProceedings of the 23rd Annual ACM Conference on Multimedia. ACM Press, 2015, pp. 1015–1018. [Online]. Available: http://dl.acm.org/citation.cfm?doid=2733373.2806390

work page arXiv 2015
[34]

The fifth’chime’speech separation and recognition challenge: dataset, task and baselines.arXiv preprint arXiv:1803.10609, 2018

J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The fifth ’chime’ speech separation and recognition challenge: Dataset, task and baselines,” 2018. [Online]. Available: https: //arxiv.org/abs/1803.10609

work page arXiv 2018
[35]

Cochlscene: Acquisition of acoustic scene data using crowdsourcing

I.-Y . Jeong and J. Park, “Cochlscene: Acquisition of acoustic scene data using crowdsourcing,” 2022. [Online]. Available: https://arxiv.org/abs/2211.02289

work page arXiv 2022
[36]

Mimo-audio: Audio language models are few-shot learners,

L.-C.-T. Xiaomi, “Mimo-audio: Audio language models are few-shot learners,” 2025. [Online]. Available: https: //github.com/XiaomiMiMo/MiMo-Audio

2025
[37]

Step-audio-r1 technical report, 2025

F. Tian, X. T. Zhang, Y . Zhang, H. Zhang, Y . Li, D. Liu, Y . Deng, D. Wu, J. Chen, L. Zhaoet al., “Step-audio-r1 technical report,” arXiv preprint arXiv:2511.15848, 2025

work page arXiv 2025
[38]

Qwen2.5-Omni Technical Report

J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-omni technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2503.20215

work page internal anchor Pith review arXiv 2025
[39]

Qwen3 Technical Report

Q. Team, “Qwen3 technical report,” 2025. [Online]. Available: https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Deepseekmath: Pushing the limits of mathematical reasoning in open language models,

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,”
[41]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

[Online]. Available: https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv