Merlin's Whisper: Enabling Efficient Reasoning in Large Language Models via Black-box Persuasive Prompting

Chak Tou Leong; Cunxiao Du; Heming Xia; Rui Li; Wenjie Li; Yongqi Li

arxiv: 2510.10528 · v3 · pith:OUE7E4IRnew · submitted 2025-10-12 · 💻 cs.CL · cs.LG

Merlin's Whisper: Enabling Efficient Reasoning in Large Language Models via Black-box Persuasive Prompting

Heming Xia , Cunxiao Du , Rui Li , Chak Tou Leong , Yongqi Li , Wenjie Li This is my paper

Pith reviewed 2026-05-21 21:16 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords persuasive promptingefficient reasoninglarge language modelstoken reductionoverthinkingblack-box optimizationiterative refinement

0 comments

The pith

Persuasive prompts can make large reasoning models generate much shorter responses without losing accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes Whisper, an iterative framework for creating persuasive prompts that encourage large reasoning models to be more concise. By treating the models as black boxes, the method refines prompts from multiple angles to reduce unnecessary reasoning steps. The approach is tested on math benchmarks and shows substantial reductions in token usage. A sympathetic reader would care because it offers a practical way to lower the high costs of running complex reasoning without needing model access.

Core claim

The paper claims that an iterative refinement process can generate persuasive prompts capable of persuading large reasoning models to produce concise yet accurate responses, achieving significant token reductions across various models and benchmarks while maintaining performance levels.

What carries the argument

Whisper: an iterative refinement framework that generates high-quality persuasive prompts from diverse perspectives to mitigate overthinking in large reasoning models treated as black boxes.

If this is right

Whisper reduces average response length by a factor of 3 on simple GSM8K questions for Qwen3 models.
It delivers an average 40% token reduction across all tested benchmarks.
Token usage on MATH-500 drops by 46% for Claude-3.7 and 50% for Gemini-2.5.
The method applies broadly across different data domains, model scales, and families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar persuasive techniques might help optimize other resource-intensive LLM behaviors like verbosity in creative tasks.
Combining Whisper with other efficiency methods could lead to even greater savings in deployment costs.
If the black-box persuasion works reliably, it reduces the need for white-box optimization techniques that require model internals.

Load-bearing premise

The assumption that iteratively generated persuasive prompts will not introduce systematic biases or cause accuracy drops on tasks beyond the tested math benchmarks when the model is used strictly as a black box.

What would settle it

Running Whisper prompts on a new set of non-math tasks such as coding problems and measuring if accuracy decreases compared to standard prompting.

read the original abstract

Large reasoning models (LRMs) have demonstrated remarkable proficiency in tackling complex tasks through step-by-step thinking. However, this lengthy reasoning process incurs substantial computational and latency overheads, hindering the practical deployment of LRMs. This work presents a new approach to mitigating overthinking in LRMs via black-box persuasive prompting. By treating LRMs as black-box communicators, we investigate how to persuade them to generate concise responses without compromising accuracy. We introduce Whisper, an iterative refinement framework that generates high-quality persuasive prompts from diverse perspectives. Experiments across multiple benchmarks demonstrate that Whisper consistently reduces token usage while preserving performance. Notably, Whisper achieves a 3x reduction in average response length on simple GSM8K questions for the Qwen3 model series and delivers an average ~40% token reduction across all benchmarks. For closed-source APIs, Whisper reduces token usage on MATH-500 by 46% for Claude-3.7 and 50% for Gemini-2.5. Further analysis reveals the broad applicability of Whisper across data domains, model scales, and families, underscoring the potential of black-box persuasive prompting as a practical strategy for enhancing LRM efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Whisper, an iterative black-box prompting framework that generates persuasive prompts from diverse perspectives to reduce overthinking and token usage in large reasoning models (LRMs) without compromising accuracy. It reports empirical results showing substantial efficiency gains, including a 3x reduction in average response length on GSM8K for Qwen3 models and an average ~40% token reduction across benchmarks, with up to 50% on MATH-500 for closed-source models like Gemini-2.5, while claiming broad applicability across domains, scales, and model families.

Significance. If the accuracy preservation and generalization claims hold, Whisper would provide a practical, training-free method for improving LRM efficiency via black-box interaction, with clear benefits for latency and compute in deployment. The approach's strength lies in its empirical focus on measurable token savings across open- and closed-source models, but its impact is limited by the narrow scope of tested tasks.

major comments (2)

[Experiments and Results] The central accuracy-preservation claim is load-bearing for the efficiency results, yet experiments are concentrated on GSM8K and MATH-500 math benchmarks. The assertion of 'broad applicability across data domains' rests on limited additional experiments whose scope, statistical controls, and failure cases are insufficiently detailed to rule out systematic biases or subtle accuracy erosion when forcing shorter outputs on non-mathematical reasoning tasks.
[Results] Baseline comparisons and statistical tests for the reported token reductions (e.g., 3x on GSM8K, ~40% average) are not specified with sufficient rigor in the presented results, making it difficult to assess whether the persuasive prompts maintain performance parity or merely trade off accuracy in ways not captured by the current evaluation.

minor comments (2)

[Method] The iterative refinement process for generating persuasive prompts could be described with more concrete examples or pseudocode to clarify how 'diverse perspectives' are operationalized.
[Evaluation] Clarify the exact definition of 'response length' and 'token reduction' metrics, including whether they account for prompt overhead or only final outputs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. We have carefully considered each comment and provide point-by-point responses below. Where revisions are needed to clarify or strengthen our claims, we have indicated the changes to be made in the revised version.

read point-by-point responses

Referee: [Experiments and Results] The central accuracy-preservation claim is load-bearing for the efficiency results, yet experiments are concentrated on GSM8K and MATH-500 math benchmarks. The assertion of 'broad applicability across data domains' rests on limited additional experiments whose scope, statistical controls, and failure cases are insufficiently detailed to rule out systematic biases or subtle accuracy erosion when forcing shorter outputs on non-mathematical reasoning tasks.

Authors: We agree that the primary evaluation focuses on mathematical reasoning tasks, as these are standard benchmarks for assessing step-by-step reasoning in LRMs. However, the manuscript does include experiments on additional domains to support the broad applicability claim. To address the referee's concern regarding insufficient detail, we will expand the relevant sections to provide more comprehensive descriptions of the experimental scope, include appropriate statistical controls such as variance across multiple runs, and discuss observed failure cases or edge cases where accuracy might be affected. This will help rule out potential biases. revision: yes
Referee: [Results] Baseline comparisons and statistical tests for the reported token reductions (e.g., 3x on GSM8K, ~40% average) are not specified with sufficient rigor in the presented results, making it difficult to assess whether the persuasive prompts maintain performance parity or merely trade off accuracy in ways not captured by the current evaluation.

Authors: We acknowledge that the presentation of baseline comparisons and statistical significance could be more rigorous. In the revised manuscript, we will include detailed descriptions of the baselines used (such as standard prompting and other efficiency methods), report exact token reduction figures with confidence intervals or standard deviations where applicable, and incorporate statistical tests (e.g., paired t-tests) to demonstrate that accuracy differences are not statistically significant. This will better substantiate the performance parity claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical prompting method validated by direct benchmarks

full rationale

The paper presents Whisper as an iterative black-box prompting framework whose value is established entirely through measured experimental outcomes on GSM8K, MATH-500, and other benchmarks. Token reductions (e.g., 3x on simple GSM8K, ~40% average) and accuracy preservation are reported as observed results rather than derived quantities. No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear in the provided text; the central claims rest on external benchmark measurements that are independent of the method's internal construction. The derivation chain is therefore self-contained as a practical engineering proposal tested against held-out data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on domain assumptions about LLM prompt responsiveness rather than new free parameters or invented entities.

axioms (1)

domain assumption Large reasoning models respond to carefully crafted persuasive prompts by shortening output length without accuracy loss
Invoked as the basis for treating models as black-box communicators that can be persuaded

pith-pipeline@v0.9.0 · 5750 in / 1092 out tokens · 46635 ms · 2026-05-21T21:16:43.562149+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AdvPrompt, an iterative refinement framework that generates high-quality adversarial prompts from diverse perspectives... achieves a 3× reduction in average response length on simple GSM8K questions
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

persuasive prompts... Evidence-based persuasion... Role-Playing... Threat

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.