Logit Arithmetic Elicits Long Reasoning Capabilities Without Training

Ayoung Lee; Farima Fatahi Bayat; Lechen Zhang; Lu Wang; Muhammad Khalifa; Xinliang Frederick Zhang; Xin Liu; Yunxiang Zhang

arxiv: 2510.09354 · v2 · submitted 2025-10-10 · 💻 cs.CL

Logit Arithmetic Elicits Long Reasoning Capabilities Without Training

Yunxiang Zhang , Muhammad Khalifa , Lechen Zhang , Xin Liu , Ayoung Lee , Xinliang Frederick Zhang , Farima Fatahi Bayat , Lu Wang This is my paper

Pith reviewed 2026-05-18 08:00 UTC · model grok-4.3

classification 💻 cs.CL

keywords logit arithmeticchain-of-thought reasoningdecoding-time methodsreasoning transferlarge language modelspreference optimizationinference-time steering

0 comments

The pith

Logit arithmetic at decoding time transfers long reasoning capabilities from a small model to a large one without any training on the large model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether large language models can exhibit extended chain-of-thought reasoning, including backtracking and self-verification, without the usual expensive post-training. It introduces ThinkLogit, a decoding-time technique that mixes logits from a much smaller reasoning guider with those of the large target model to steer generation toward the guider's strategies. A variant called ThinkLogit-DPO further trains the guider using preference optimization on mixed outputs to correct target errors. Experiments on six math, science, and coding benchmarks show relative gains of 21.5% and 24.2% over the plain target, with the method remaining effective across model families.

Core claim

Logit arithmetic between a 1.5B reasoning guider and a 32B non-reasoning target produces long reasoning traces in the target at inference time, delivering 21.5% relative improvement with ThinkLogit and 24.2% with ThinkLogit-DPO across six benchmarks, all without gradient updates to the large model and with only minimal added inference cost when logits are computed in parallel.

What carries the argument

Logit arithmetic, the direct combination of next-token logits from the small guider and large target during decoding to bias the large model toward the guider's reasoning patterns.

If this is right

Large models acquire long reasoning strategies without any fine-tuning or gradient steps on their parameters.
ThinkLogit-DPO shows that preference training only the small guider can produce additional gains on mixed outputs.
The approach works when guider and target belong to different model families.
Inference cost stays low provided the two models' logits are computed concurrently.
Reasoning capabilities become available to any large model that was not originally trained for extended chains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same mixing operation might transfer other non-reasoning behaviors such as tool-calling sequences or multi-turn planning.
Adaptive mixing weights that change during a single generation could reduce cases where the combination harms coherence.
If small guiders continue to work at scale, the economic case for training ever-larger dedicated reasoning models weakens.
Systematic sweeps over mixing coefficients on new domains would reveal whether per-task tuning is truly minimal.

Load-bearing premise

That the arithmetic mixing of logits will reliably import complex reasoning behaviors into the target model without introducing new errors or unstable generations.

What would settle it

Running ThinkLogit on a fresh reasoning benchmark and finding that accuracy and average reasoning length are no better, or worse, than the unguided target model.

read the original abstract

Large reasoning models exhibit long chain-of-thought reasoning with complex strategies such as backtracking and self-verification. Yet, these capabilities typically require resource-intensive post-training. We investigate whether such behaviors can be elicited in large models without any gradient updates. To this end, we propose a decoding-time approach, ThinkLogit, which utilizes logit arithmetic to transfer these capabilities from a substantially smaller reasoning guider to a large non-reasoning target. We further show that we can boost performance by training the guider to correct the target's errors using preference optimization over mixed model outputs, a setup we refer to as ThinkLogit-DPO. We evaluate these methods across six reasoning benchmarks spanning math, science, and coding domains using the Qwen2.5-32B guided by R1-Distill-Qwen-1.5B, a model 21x smaller. Our experiments demonstrate that ThinkLogit and ThinkLogit-DPO achieve a relative improvement of 21.5% and 24.2%, respectively, over the target model. Moreover, ThinkLogit remains effective even when the guider and target come from different model families. Crucially, our method requires zero training for the large model and would incur minimal inference overhead when logits are computed in parallel, presenting a practical solution for enabling long reasoning at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Logit mixing from a small guider lifts accuracy on a large target at decode time, but the results do not yet show that complex reasoning strategies actually transfer.

read the letter

The main point is that this paper gets relative gains of 21-24% on six reasoning benchmarks by mixing logits from a 1.5B guider into a 32B target during decoding, with no training on the large model. The DPO version on the guider adds a bit more. It also holds up when the two models come from different families. That combination is the practical contribution: a low-cost way to add reasoning behavior without post-training the big model and with limited extra inference cost if logits run in parallel.

Referee Report

1 major / 2 minor

Summary. The paper introduces ThinkLogit, a decoding-time logit-arithmetic method that mixes logits from a small reasoning guider (R1-Distill-Qwen-1.5B) into a large target model (Qwen2.5-32B) to elicit long chain-of-thought behaviors such as backtracking and self-verification without any gradient updates on the target. A variant, ThinkLogit-DPO, further trains the guider via preference optimization on mixed outputs. Across six reasoning benchmarks in math, science, and coding, the methods report relative accuracy gains of 21.5% and 24.2% respectively over the target alone, with additional evidence of cross-family transfer and low inference overhead.

Significance. If the mechanistic claim holds, the work would offer a practical, training-free route to long reasoning in large models, with clear advantages in cost and scalability. The cross-family results and the DPO extension are concrete strengths that could influence inference-time steering research.

major comments (1)

[Experiments / Results] The central claim that logit arithmetic transfers specific long-reasoning strategies (backtracking, self-verification) from guider to target is load-bearing yet unsupported by direct evidence. The evaluation reports only aggregate accuracy on the six benchmarks; no section provides qualitative examples, frequency counts, or correctness analysis of these behaviors in the generated traces. Without such inspection it is possible the gains arise from generic token-level steering rather than coherent strategy transfer.

minor comments (2)

[Method] The exact logit-mixing formula, the chosen mixing coefficient, and any ablations on its value are not described with sufficient precision to allow reproduction or sensitivity analysis.
[Evaluation] An error analysis or per-benchmark breakdown of where improvements occur (or fail) would clarify the scope of the method.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments and the opportunity to clarify our work. We address the major comment point by point below.

read point-by-point responses

Referee: [Experiments / Results] The central claim that logit arithmetic transfers specific long-reasoning strategies (backtracking, self-verification) from guider to target is load-bearing yet unsupported by direct evidence. The evaluation reports only aggregate accuracy on the six benchmarks; no section provides qualitative examples, frequency counts, or correctness analysis of these behaviors in the generated traces. Without such inspection it is possible the gains arise from generic token-level steering rather than coherent strategy transfer.

Authors: We agree that direct inspection of generated traces would strengthen the mechanistic claim. While our current results emphasize aggregate accuracy gains and cross-family transfer (which would be unlikely under purely generic token-level steering), we will add qualitative examples of backtracking and self-verification in the revised manuscript, along with frequency counts of these behaviors in ThinkLogit outputs versus the target baseline. These additions will help distinguish coherent strategy transfer from generic effects. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method validated on external benchmarks

full rationale

The paper introduces ThinkLogit as a decoding-time logit arithmetic technique to elicit long reasoning from a small guider into a large target model without any gradient updates on the target. It further describes ThinkLogit-DPO as an optional preference optimization step on the guider. All reported results consist of relative accuracy improvements measured on six independent reasoning benchmarks (math, science, coding) using Qwen2.5-32B guided by R1-Distill-Qwen-1.5B. No equations, predictions, or first-principles derivations appear in the provided text that reduce the claimed gains to quantities defined by fitted parameters or self-citations within the paper. The central claims remain externally falsifiable through benchmark performance and do not rely on any self-referential construction.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of logit mixing for capability transfer and on the assumption that the small model's reasoning traces are compatible with the large model's output distribution; no new physical entities or mathematical axioms are introduced.

free parameters (1)

logit mixing coefficient
A scalar or vector controlling the strength of the guider's logits relative to the target's must be chosen or tuned to achieve the reported gains.

pith-pipeline@v0.9.0 · 5787 in / 1230 out tokens · 53872 ms · 2026-05-18T08:00:39.600270+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

At decoding step t+1, the fused logits are computed as ˜ℓt+1 = ℓ(L)t+1 + α(ℓ(S)t+1 − ℓ(S0)t+1)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion
cs.CL 2026-04 unverdicted novelty 7.0

TriMix dynamically fuses logits from three model sources to outperform baselines and Proxy Tuning on eight low-resource languages across four model families.