Recognition: no theorem link
CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing
Pith reviewed 2026-05-15 04:42 UTC · model grok-4.3
The pith
CRANE injects selected reasoning directions from thinking checkpoints into instruct models by editing their parameter nullspace, raising code agent success rates on benchmarks while keeping tool-use efficiency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the Thinking-Instruct delta vector supplies a pool of directional reasoning edits that can be selectively transferred into the Instruct backbone through magnitude thresholding, a Conservative Taylor Gate, and Graduated Sigmoidal Projection, producing higher pass rates on repository-level tasks while preserving the original tool-use discipline and inference speed.
What carries the argument
The Thinking-Instruct delta vector, processed by magnitude thresholding to denoise, a Conservative Taylor Gate to retain jointly beneficial edits, and Graduated Sigmoidal Projection to suppress format-critical directions.
Load-bearing premise
The difference between the thinking and instruct checkpoints contains directional information that can be selectively moved into the instruct model without creating new tool-use failures or degrading existing protocols.
What would settle it
Applying the same editing procedure to additional paired checkpoints and measuring no improvement or new tool failures on the same benchmarks would show the selective transfer does not hold.
Figures
read the original abstract
Code agents must both reason over long-horizon repository state and obey strict tool-use protocols. In paired Instruct/Thinking checkpoints, these capabilities are complementary but misaligned. The Instruct model is concise and tool-disciplined, whereas the Thinking model offers stronger planning and recovery behavior but often over-deliberates and degrades agent performance. We present CRANE (Constrained Reasoning Injection for Code Agents via Nullspace Editing), a training-free parameter-editing method that treats the Thinking-Instruct delta as a directional pool of candidate reasoning edits for the Instruct backbone. CRANE combines magnitude thresholding to denoise the delta, a Conservative Taylor Gate to retain edits that are jointly beneficial for reasoning transfer and tool-use preservation, and Graduated Sigmoidal Projection to suppress format-critical update directions. By merging paired Instruct and Thinking checkpoints, CRANE delivers strong gains over either individual model while preserving Instruct-level efficiency: on Roo-Eval it achieves pass1 of 66.2% (+19.5%) for Qwen3-30B-A3B and 81.5% (+8.7%) for Qwen3-Next-80B-A3B; on SWE-bench-Verified it resolves up to 14 additional instances at both scales (122/500 and 180/500); and on Terminal-Bench v2 it improves pass1/pass5 by up to 2.3%/7.8%, reaching 7.6%/17.9% and 14.8%/30.3%, respectively, consistently outperforming alternative merging strategies across all three benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CRANE, a training-free parameter-editing technique that treats the delta between paired Instruct and Thinking checkpoints as a pool of candidate reasoning edits. It applies magnitude thresholding to denoise the delta, a Conservative Taylor Gate to retain jointly beneficial updates, and Graduated Sigmoidal Projection to suppress format-critical directions. The method is claimed to improve code-agent performance on Roo-Eval (pass@1 of 66.2% and 81.5% at two scales), SWE-bench-Verified (up to 14 additional resolved instances), and Terminal-Bench v2 (pass@1/pass@5 gains up to 2.3%/7.8%) while preserving Instruct-level efficiency and outperforming alternative merging strategies.
Significance. If the attribution of gains to the three components holds after verification, the work would provide a practical, training-free route to combine complementary capabilities from existing checkpoint pairs. The consistent benchmark lifts across model scales and the explicit comparisons to alternative merging strategies constitute concrete empirical grounding. The absence of invented parameters and the focus on reproducible editing operations are strengths.
major comments (3)
- [Abstract, §3] Abstract and §3: No equation-level definition is supplied for the Conservative Taylor Gate or Graduated Sigmoidal Projection. Without these definitions it is impossible to verify the claim that the gates selectively retain reasoning directions while suppressing tool-use-critical directions, which is load-bearing for attributing the reported +19.5 pp and +14-instance gains to CRANE rather than generic delta addition.
- [§5, Table 3] §5 and Table 3: The results report large lifts over baselines, yet no ablation isolating magnitude thresholding alone versus the full CRANE pipeline is presented. This omission prevents confirmation that the Conservative Taylor Gate and Graduated Sigmoidal Projection are responsible for the observed improvements rather than incidental merging effects.
- [§4.3] §4.3: The separability assumption—that tool-use protocol directions lie sufficiently outside the retained subspace after projection—is stated but not tested with a targeted failure-mode analysis on tool-call formatting errors. If overlap exists, the method could degrade protocol adherence on the very benchmarks where gains are claimed.
minor comments (2)
- [Figure 2] Figure 2: The diagram of the editing pipeline would benefit from explicit annotation of the three CRANE stages and the exact thresholding values used.
- [§2] §2: The related-work discussion on model merging omits several recent nullspace-based editing papers; adding these citations would clarify the precise novelty of the Conservative Taylor Gate.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for recognizing the empirical grounding and reproducibility aspects of CRANE. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [Abstract, §3] Abstract and §3: No equation-level definition is supplied for the Conservative Taylor Gate or Graduated Sigmoidal Projection. Without these definitions it is impossible to verify the claim that the gates selectively retain reasoning directions while suppressing tool-use-critical directions, which is load-bearing for attributing the reported +19.5 pp and +14-instance gains to CRANE rather than generic delta addition.
Authors: We agree that explicit equation-level definitions are required for verification. In the revised manuscript we will add the full mathematical formulations in §3: the Conservative Taylor Gate is defined as a first-order Taylor approximation that retains only updates whose joint effect on reasoning and tool-use objectives exceeds a conservative threshold, and the Graduated Sigmoidal Projection applies a per-direction sigmoid scaled by alignment with format-critical vectors identified from the Instruct checkpoint. These equations will directly support attribution of the observed gains. revision: yes
-
Referee: [§5, Table 3] §5 and Table 3: The results report large lifts over baselines, yet no ablation isolating magnitude thresholding alone versus the full CRANE pipeline is presented. This omission prevents confirmation that the Conservative Taylor Gate and Graduated Sigmoidal Projection are responsible for the observed improvements rather than incidental merging effects.
Authors: We accept this criticism and will insert a new ablation subsection in §5. The revision will report performance for (i) magnitude thresholding alone, (ii) thresholding plus Conservative Taylor Gate, and (iii) the complete CRANE pipeline on all three benchmarks, thereby isolating the incremental contribution of each component. revision: yes
-
Referee: [§4.3] §4.3: The separability assumption—that tool-use protocol directions lie sufficiently outside the retained subspace after projection—is stated but not tested with a targeted failure-mode analysis on tool-call formatting errors. If overlap exists, the method could degrade protocol adherence on the very benchmarks where gains are claimed.
Authors: The assumption is currently supported only by the aggregate benchmark improvements and absence of reported protocol degradation. We will add a targeted failure-mode analysis in the revised §4.3 that measures tool-call formatting error rates on a held-out set of agent trajectories before and after projection, confirming that format-critical directions are suppressed without harming overall performance. revision: yes
Circularity Check
No circularity: CRANE derivation is self-contained via empirical benchmarks
full rationale
The paper introduces CRANE as a training-free editing procedure on the Thinking-Instruct delta using magnitude thresholding, a Conservative Taylor Gate, and Graduated Sigmoidal Projection. All performance claims (Roo-Eval pass@1 gains, SWE-bench instance counts, Terminal-Bench pass@1/pass@5) are presented as direct outcomes of benchmark evaluation against alternative merging baselines. No equation in the provided text defines a result in terms of itself, renames a fitted quantity as a prediction, or relies on a load-bearing self-citation whose content is unverified. The separability assumptions are stated as modeling choices and are externally tested rather than derived from prior author work by construction.
Axiom & Free-Parameter Ledger
invented entities (2)
-
Conservative Taylor Gate
no independent evidence
-
Graduated Sigmoidal Projection
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Accessed: 2026-04-30. Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al. Qwen3-coder-next technical report.arXiv preprint arXiv:2603.00729,
-
[2]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
arXiv preprint arXiv:2307.13269 , year=
Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin. Lorahub: Efficient cross-task generalization via dynamic lora composition.arXiv preprint arXiv:2307.13269,
-
[4]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Ryan Liu, Jiayi Geng, Addison J
URL https://arxiv.org/abs/2505.22113. Ryan Liu, Jiayi Geng, Addison J. Wu, Ilia Sucholutsky, Tania Lombrozo, and Thomas L. Griffiths. Mind your step (by step): Chain-of-thought can reduce performance on tasks where thinking makes humans worse,
-
[6]
11 Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, and Aliaksei Severyn
URLhttps://arxiv.org/abs/2410.21333. 11 Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, and Aliaksei Severyn. Teaching small language models to reason. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1773–1781,
-
[7]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Accessed: 2026-04-29. Roo-Code Contributors. Roo-code: An open-source in-ide coding agent. https://github.com/ RooCodeInc/Roo-Code,
work page 2026
-
[9]
A simple and effective pruning approach for large language models
Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. In12th International Conference on Learning Representations, ICLR 2024,
work page 2024
-
[10]
OpenHands: An Open Platform for AI Software Developers as Generalist Agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Adamerging: Adaptive model merging for multi-task learning.arXiv preprint arXiv:2310.02575,
Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging: Adaptive model merging for multi-task learning.arXiv preprint arXiv:2310.02575,
-
[13]
Shu Zhou, Rui Ling, Junan Chen, Xin Wang, Tao Fan, and Hao Wang
Hugging Face dataset, accessed 2026-05-02. Shu Zhou, Rui Ling, Junan Chen, Xin Wang, Tao Fan, and Hao Wang. When more thinking hurts: Overthinking in llm test-time compute scaling,
work page 2026
-
[14]
URL https://arxiv.org/abs/2604. 10739. Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
13 A Experimental Details A.1 Roo-Eval Evaluation Each checkpoint is evaluated on five programming languages with three independent rollouts per exercise. The exercise counts are Python 34, JavaScript 50, Go 36, Java 45, and Rust 30, for 195 exercises and 585 total rollouts per complete sweep. Table 5: Roo-Eval serving, judging, and reference-cost protoco...
work page 2024
-
[16]
We adopt the Qwen3-recommendedtop_k= 20 for every row in Table 2, including endpoint references
Item Setting Subset SWE-bench-Verified, 500 instances Agent scaffold OpenHands SDK [Wang et al., 2024] Max iterations 100 per instance Sampling temperature 0.6, top_p 0.8, top_k 20 (Qwen3 defaults) Serving vLLM [Kwon et al., 2023], bf16, TP= 4 GPU4×H100 80GB Context length131072 Container backend rootless podman [Podman contributors, 2026] Image registry ...
work page 2024
-
[17]
Note: in all our experiments we run the models using local vLLM, therefore Total Token Count is used as a proxy to estimate the budget of running those models through providers, not actual incurred spending. Table 7: Token cost for major frontier lab providers used to estimate relative weights in total tokens count, and average cost ratios of token types ...
work page 2026
-
[18]
Should the function handle trailing spaces
The no-self-reflection share decreases slightly from 67.7% (Instruct) to 64.7% (Thinking), but with a different mechanism: Instruct retries the same failing approach, Thinking deliberates without testing. 17 30B-CRANE audit (100 failed rollouts).Canonical run dirs 20260420_020103 (Python), 20260420_022201 (JavaScript), 20260420_025032 (Go), 20260420_03154...
work page 2023
-
[19]
Rows: components (Q, K, V , O, expert gate/up/down, norm, router, LM head); columns: layers 0–47. Late-layer attention, mid-depth experts, and the routing gate dominate; norm and LM head receive near-zero injection. B.3 Robustness to Calibration Set Choice We assess the robustness of the CTG Taylor salience used by CRANE to calibration-set choice. On Qwen...
work page 2048
-
[20]
pass@5 is best-of-5: a task counts as a pass if any of 5 attempts passed. pass_majority requires ≥3/5 attempts to pass (per-task rate ≥0.60 ). pass_majority differs from pass@3: pass@3 weights by the probability of a 3-shot subsample landing a pass; pass_majority requires actual ≥3 successes. “Test time” is the end-to-end Terminal-Bench harness wall time;...
work page 2026
-
[21]
that group each ablation family by programming language and retain pass@1, pass@3, pass_all, iterative pass, reference cost, and recorded input/cached/output token totals and averages. Alpha sweep detailed per-language results.Table 41 reports per-language pass metrics, reference- cost proxy, and recorded local-vLLM token usage for each row in this ablati...
work page 2061
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.