Logit Arithmetic Elicits Long Reasoning Capabilities Without Training
Pith reviewed 2026-05-18 08:00 UTC · model grok-4.3
The pith
Logit arithmetic at decoding time transfers long reasoning capabilities from a small model to a large one without any training on the large model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Logit arithmetic between a 1.5B reasoning guider and a 32B non-reasoning target produces long reasoning traces in the target at inference time, delivering 21.5% relative improvement with ThinkLogit and 24.2% with ThinkLogit-DPO across six benchmarks, all without gradient updates to the large model and with only minimal added inference cost when logits are computed in parallel.
What carries the argument
Logit arithmetic, the direct combination of next-token logits from the small guider and large target during decoding to bias the large model toward the guider's reasoning patterns.
If this is right
- Large models acquire long reasoning strategies without any fine-tuning or gradient steps on their parameters.
- ThinkLogit-DPO shows that preference training only the small guider can produce additional gains on mixed outputs.
- The approach works when guider and target belong to different model families.
- Inference cost stays low provided the two models' logits are computed concurrently.
- Reasoning capabilities become available to any large model that was not originally trained for extended chains.
Where Pith is reading between the lines
- The same mixing operation might transfer other non-reasoning behaviors such as tool-calling sequences or multi-turn planning.
- Adaptive mixing weights that change during a single generation could reduce cases where the combination harms coherence.
- If small guiders continue to work at scale, the economic case for training ever-larger dedicated reasoning models weakens.
- Systematic sweeps over mixing coefficients on new domains would reveal whether per-task tuning is truly minimal.
Load-bearing premise
That the arithmetic mixing of logits will reliably import complex reasoning behaviors into the target model without introducing new errors or unstable generations.
What would settle it
Running ThinkLogit on a fresh reasoning benchmark and finding that accuracy and average reasoning length are no better, or worse, than the unguided target model.
read the original abstract
Large reasoning models exhibit long chain-of-thought reasoning with complex strategies such as backtracking and self-verification. Yet, these capabilities typically require resource-intensive post-training. We investigate whether such behaviors can be elicited in large models without any gradient updates. To this end, we propose a decoding-time approach, ThinkLogit, which utilizes logit arithmetic to transfer these capabilities from a substantially smaller reasoning guider to a large non-reasoning target. We further show that we can boost performance by training the guider to correct the target's errors using preference optimization over mixed model outputs, a setup we refer to as ThinkLogit-DPO. We evaluate these methods across six reasoning benchmarks spanning math, science, and coding domains using the Qwen2.5-32B guided by R1-Distill-Qwen-1.5B, a model 21x smaller. Our experiments demonstrate that ThinkLogit and ThinkLogit-DPO achieve a relative improvement of 21.5% and 24.2%, respectively, over the target model. Moreover, ThinkLogit remains effective even when the guider and target come from different model families. Crucially, our method requires zero training for the large model and would incur minimal inference overhead when logits are computed in parallel, presenting a practical solution for enabling long reasoning at scale.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ThinkLogit, a decoding-time logit-arithmetic method that mixes logits from a small reasoning guider (R1-Distill-Qwen-1.5B) into a large target model (Qwen2.5-32B) to elicit long chain-of-thought behaviors such as backtracking and self-verification without any gradient updates on the target. A variant, ThinkLogit-DPO, further trains the guider via preference optimization on mixed outputs. Across six reasoning benchmarks in math, science, and coding, the methods report relative accuracy gains of 21.5% and 24.2% respectively over the target alone, with additional evidence of cross-family transfer and low inference overhead.
Significance. If the mechanistic claim holds, the work would offer a practical, training-free route to long reasoning in large models, with clear advantages in cost and scalability. The cross-family results and the DPO extension are concrete strengths that could influence inference-time steering research.
major comments (1)
- [Experiments / Results] The central claim that logit arithmetic transfers specific long-reasoning strategies (backtracking, self-verification) from guider to target is load-bearing yet unsupported by direct evidence. The evaluation reports only aggregate accuracy on the six benchmarks; no section provides qualitative examples, frequency counts, or correctness analysis of these behaviors in the generated traces. Without such inspection it is possible the gains arise from generic token-level steering rather than coherent strategy transfer.
minor comments (2)
- [Method] The exact logit-mixing formula, the chosen mixing coefficient, and any ablations on its value are not described with sufficient precision to allow reproduction or sensitivity analysis.
- [Evaluation] An error analysis or per-benchmark breakdown of where improvements occur (or fail) would clarify the scope of the method.
Simulated Author's Rebuttal
We thank the referee for their constructive comments and the opportunity to clarify our work. We address the major comment point by point below.
read point-by-point responses
-
Referee: [Experiments / Results] The central claim that logit arithmetic transfers specific long-reasoning strategies (backtracking, self-verification) from guider to target is load-bearing yet unsupported by direct evidence. The evaluation reports only aggregate accuracy on the six benchmarks; no section provides qualitative examples, frequency counts, or correctness analysis of these behaviors in the generated traces. Without such inspection it is possible the gains arise from generic token-level steering rather than coherent strategy transfer.
Authors: We agree that direct inspection of generated traces would strengthen the mechanistic claim. While our current results emphasize aggregate accuracy gains and cross-family transfer (which would be unlikely under purely generic token-level steering), we will add qualitative examples of backtracking and self-verification in the revised manuscript, along with frequency counts of these behaviors in ThinkLogit outputs versus the target baseline. These additions will help distinguish coherent strategy transfer from generic effects. revision: yes
Circularity Check
No circularity: empirical method validated on external benchmarks
full rationale
The paper introduces ThinkLogit as a decoding-time logit arithmetic technique to elicit long reasoning from a small guider into a large target model without any gradient updates on the target. It further describes ThinkLogit-DPO as an optional preference optimization step on the guider. All reported results consist of relative accuracy improvements measured on six independent reasoning benchmarks (math, science, coding) using Qwen2.5-32B guided by R1-Distill-Qwen-1.5B. No equations, predictions, or first-principles derivations appear in the provided text that reduce the claimed gains to quantities defined by fitted parameters or self-citations within the paper. The central claims remain externally falsifiable through benchmark performance and do not rely on any self-referential construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- logit mixing coefficient
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
At decoding step t+1, the fused logits are computed as ˜ℓt+1 = ℓ(L)t+1 + α(ℓ(S)t+1 − ℓ(S0)t+1)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion
TriMix dynamically fuses logits from three model sources to outperform baselines and Proxy Tuning on eight low-resource languages across four model families.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.