pith. machine review for the scientific record. sign in

arxiv: 2605.14084 · v1 · submitted 2026-05-13 · 💻 cs.SE · cs.AI· cs.CL

Recognition: no theorem link

CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:42 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL
keywords code agentsmodel mergingparameter editingreasoning injectionnullspace editingSWE-benchbenchmark evaluationtool use
0
0 comments X

The pith

CRANE injects selected reasoning directions from thinking checkpoints into instruct models by editing their parameter nullspace, raising code agent success rates on benchmarks while keeping tool-use efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Code agents must plan over long repository states yet follow strict tool protocols. Paired instruct and thinking checkpoints split these strengths: the instruct version stays concise and disciplined, while the thinking version plans better but over-deliberates and hurts agent results. The paper shows that the difference vector between these checkpoints can be filtered and projected to move only beneficial reasoning updates into the instruct backbone. Three operations do the work: magnitude thresholding removes noise, a conservative gate keeps edits that help reasoning without breaking tools, and a graduated sigmoid projection shields format-critical directions. When applied, the merged models outperform either original checkpoint on Roo-Eval, SWE-bench-Verified, and Terminal-Bench v2.

Core claim

The central claim is that the Thinking-Instruct delta vector supplies a pool of directional reasoning edits that can be selectively transferred into the Instruct backbone through magnitude thresholding, a Conservative Taylor Gate, and Graduated Sigmoidal Projection, producing higher pass rates on repository-level tasks while preserving the original tool-use discipline and inference speed.

What carries the argument

The Thinking-Instruct delta vector, processed by magnitude thresholding to denoise, a Conservative Taylor Gate to retain jointly beneficial edits, and Graduated Sigmoidal Projection to suppress format-critical directions.

Load-bearing premise

The difference between the thinking and instruct checkpoints contains directional information that can be selectively moved into the instruct model without creating new tool-use failures or degrading existing protocols.

What would settle it

Applying the same editing procedure to additional paired checkpoints and measuring no improvement or new tool failures on the same benchmarks would show the selective transfer does not hold.

Figures

Figures reproduced from arXiv: 2605.14084 by Michele Merler, Mingzhi Zhu, Raju Pavuluri, Stacy Patterson.

Figure 1
Figure 1. Figure 1: Qualitative Roo-Eval trace illustrating the endpoint trade-off that motivates selective [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: CRANE implementation pipeline with three stages: (1) Magnitude thresholding to sparsify δ and discard low-confidence coordinates; (2) Conservative Taylor Gate that sets per-block injection strength so only directions first-order beneficial to both reasoning and tool-use are retained; (3) Graduated Sigmoidal Projection that attenuates updates along format-critical subspaces (tool control). This asymmetric g… view at source ↗
Figure 3
Figure 3. Figure 3: TTC vs. pass-rate, three benchmarks × two scales. (a–c) Qwen3-30B-A3B on Roo-Eval, SWE-bench-Verified, Terminal-Bench v2; (d–f) Qwen3-Next-80B-A3B on the same three [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Continuous-hyperparameter sensitivity analysis of the [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: CTG Taylor importance SCTG(c, l) on Qwen3-30B-A3B, derived automatically from DR and DA in [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qwen3-Next-80B residual stack laid out as 48 mixer slots: linear-attention layers (blue) [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: GSP sigmoid-weighting diagnostics. (a) Sigmoid weighting wq,r = σ(k(aq,r − τ )) with k = log(99)/τ for τ ∈ {0.003, 0.03, 0.3}; the dot marks w(τ ) = 0.5 and the transition band [0, 2τ ] contains all partial attenuation. (b) Energy-weighted residual mask profile along format-protected directions across residual depth for the sigmoid mask, polynomial soft masks (w = a p ), and a hard top-k mask. 26 [PITH_FU… view at source ↗
Figure 8
Figure 8. Figure 8: Roo-Eval results across both scales. Per-method pass@1 (light) and pass_all (dark) on the [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-language Roo-Eval pass@1 across methods at both scales. Rows: methods (Instruct, [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
read the original abstract

Code agents must both reason over long-horizon repository state and obey strict tool-use protocols. In paired Instruct/Thinking checkpoints, these capabilities are complementary but misaligned. The Instruct model is concise and tool-disciplined, whereas the Thinking model offers stronger planning and recovery behavior but often over-deliberates and degrades agent performance. We present CRANE (Constrained Reasoning Injection for Code Agents via Nullspace Editing), a training-free parameter-editing method that treats the Thinking-Instruct delta as a directional pool of candidate reasoning edits for the Instruct backbone. CRANE combines magnitude thresholding to denoise the delta, a Conservative Taylor Gate to retain edits that are jointly beneficial for reasoning transfer and tool-use preservation, and Graduated Sigmoidal Projection to suppress format-critical update directions. By merging paired Instruct and Thinking checkpoints, CRANE delivers strong gains over either individual model while preserving Instruct-level efficiency: on Roo-Eval it achieves pass1 of 66.2% (+19.5%) for Qwen3-30B-A3B and 81.5% (+8.7%) for Qwen3-Next-80B-A3B; on SWE-bench-Verified it resolves up to 14 additional instances at both scales (122/500 and 180/500); and on Terminal-Bench v2 it improves pass1/pass5 by up to 2.3%/7.8%, reaching 7.6%/17.9% and 14.8%/30.3%, respectively, consistently outperforming alternative merging strategies across all three benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces CRANE, a training-free parameter-editing technique that treats the delta between paired Instruct and Thinking checkpoints as a pool of candidate reasoning edits. It applies magnitude thresholding to denoise the delta, a Conservative Taylor Gate to retain jointly beneficial updates, and Graduated Sigmoidal Projection to suppress format-critical directions. The method is claimed to improve code-agent performance on Roo-Eval (pass@1 of 66.2% and 81.5% at two scales), SWE-bench-Verified (up to 14 additional resolved instances), and Terminal-Bench v2 (pass@1/pass@5 gains up to 2.3%/7.8%) while preserving Instruct-level efficiency and outperforming alternative merging strategies.

Significance. If the attribution of gains to the three components holds after verification, the work would provide a practical, training-free route to combine complementary capabilities from existing checkpoint pairs. The consistent benchmark lifts across model scales and the explicit comparisons to alternative merging strategies constitute concrete empirical grounding. The absence of invented parameters and the focus on reproducible editing operations are strengths.

major comments (3)
  1. [Abstract, §3] Abstract and §3: No equation-level definition is supplied for the Conservative Taylor Gate or Graduated Sigmoidal Projection. Without these definitions it is impossible to verify the claim that the gates selectively retain reasoning directions while suppressing tool-use-critical directions, which is load-bearing for attributing the reported +19.5 pp and +14-instance gains to CRANE rather than generic delta addition.
  2. [§5, Table 3] §5 and Table 3: The results report large lifts over baselines, yet no ablation isolating magnitude thresholding alone versus the full CRANE pipeline is presented. This omission prevents confirmation that the Conservative Taylor Gate and Graduated Sigmoidal Projection are responsible for the observed improvements rather than incidental merging effects.
  3. [§4.3] §4.3: The separability assumption—that tool-use protocol directions lie sufficiently outside the retained subspace after projection—is stated but not tested with a targeted failure-mode analysis on tool-call formatting errors. If overlap exists, the method could degrade protocol adherence on the very benchmarks where gains are claimed.
minor comments (2)
  1. [Figure 2] Figure 2: The diagram of the editing pipeline would benefit from explicit annotation of the three CRANE stages and the exact thresholding values used.
  2. [§2] §2: The related-work discussion on model merging omits several recent nullspace-based editing papers; adding these citations would clarify the precise novelty of the Conservative Taylor Gate.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the empirical grounding and reproducibility aspects of CRANE. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3: No equation-level definition is supplied for the Conservative Taylor Gate or Graduated Sigmoidal Projection. Without these definitions it is impossible to verify the claim that the gates selectively retain reasoning directions while suppressing tool-use-critical directions, which is load-bearing for attributing the reported +19.5 pp and +14-instance gains to CRANE rather than generic delta addition.

    Authors: We agree that explicit equation-level definitions are required for verification. In the revised manuscript we will add the full mathematical formulations in §3: the Conservative Taylor Gate is defined as a first-order Taylor approximation that retains only updates whose joint effect on reasoning and tool-use objectives exceeds a conservative threshold, and the Graduated Sigmoidal Projection applies a per-direction sigmoid scaled by alignment with format-critical vectors identified from the Instruct checkpoint. These equations will directly support attribution of the observed gains. revision: yes

  2. Referee: [§5, Table 3] §5 and Table 3: The results report large lifts over baselines, yet no ablation isolating magnitude thresholding alone versus the full CRANE pipeline is presented. This omission prevents confirmation that the Conservative Taylor Gate and Graduated Sigmoidal Projection are responsible for the observed improvements rather than incidental merging effects.

    Authors: We accept this criticism and will insert a new ablation subsection in §5. The revision will report performance for (i) magnitude thresholding alone, (ii) thresholding plus Conservative Taylor Gate, and (iii) the complete CRANE pipeline on all three benchmarks, thereby isolating the incremental contribution of each component. revision: yes

  3. Referee: [§4.3] §4.3: The separability assumption—that tool-use protocol directions lie sufficiently outside the retained subspace after projection—is stated but not tested with a targeted failure-mode analysis on tool-call formatting errors. If overlap exists, the method could degrade protocol adherence on the very benchmarks where gains are claimed.

    Authors: The assumption is currently supported only by the aggregate benchmark improvements and absence of reported protocol degradation. We will add a targeted failure-mode analysis in the revised §4.3 that measures tool-call formatting error rates on a held-out set of agent trajectories before and after projection, confirming that format-critical directions are suppressed without harming overall performance. revision: yes

Circularity Check

0 steps flagged

No circularity: CRANE derivation is self-contained via empirical benchmarks

full rationale

The paper introduces CRANE as a training-free editing procedure on the Thinking-Instruct delta using magnitude thresholding, a Conservative Taylor Gate, and Graduated Sigmoidal Projection. All performance claims (Roo-Eval pass@1 gains, SWE-bench instance counts, Terminal-Bench pass@1/pass@5) are presented as direct outcomes of benchmark evaluation against alternative merging baselines. No equation in the provided text defines a result in terms of itself, renames a fitted quantity as a prediction, or relies on a load-bearing self-citation whose content is unverified. The separability assumptions are stated as modeling choices and are externally tested rather than derived from prior author work by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The central claim rests on the effectiveness of three newly introduced editing components whose independent validation is not supplied in the abstract; no explicit free parameters or background axioms are stated.

invented entities (2)
  • Conservative Taylor Gate no independent evidence
    purpose: Retain edits jointly beneficial for reasoning transfer and tool-use preservation
    New component introduced to balance the two objectives during delta editing.
  • Graduated Sigmoidal Projection no independent evidence
    purpose: Suppress format-critical update directions
    New projection technique to protect protocol-critical directions in the Instruct model.

pith-pipeline@v0.9.0 · 5600 in / 1291 out tokens · 43162 ms · 2026-05-15T04:42:27.537662+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 6 internal anchors

  1. [1]

    Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al

    Accessed: 2026-04-30. Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al. Qwen3-coder-next technical report.arXiv preprint arXiv:2603.00729,

  2. [2]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

  3. [3]

    arXiv preprint arXiv:2307.13269 , year=

    Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin. Lorahub: Efficient cross-task generalization via dynamic lora composition.arXiv preprint arXiv:2307.13269,

  4. [4]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770,

  5. [5]

    Ryan Liu, Jiayi Geng, Addison J

    URL https://arxiv.org/abs/2505.22113. Ryan Liu, Jiayi Geng, Addison J. Wu, Ilia Sucholutsky, Tania Lombrozo, and Thomas L. Griffiths. Mind your step (by step): Chain-of-thought can reduce performance on tasks where thinking makes humans worse,

  6. [6]

    11 Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, and Aliaksei Severyn

    URLhttps://arxiv.org/abs/2410.21333. 11 Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, and Aliaksei Severyn. Teaching small language models to reason. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1773–1781,

  7. [7]

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

    Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868,

  8. [8]

    Roo-Code Contributors

    Accessed: 2026-04-29. Roo-Code Contributors. Roo-code: An open-source in-ide coding agent. https://github.com/ RooCodeInc/Roo-Code,

  9. [9]

    A simple and effective pruning approach for large language models

    Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. In12th International Conference on Learning Representations, ICLR 2024,

  10. [10]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741,

  11. [11]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  12. [12]

    Adamerging: Adaptive model merging for multi-task learning.arXiv preprint arXiv:2310.02575,

    Enneng Yang, Zhenyi Wang, Li Shen, Shiwei Liu, Guibing Guo, Xingwei Wang, and Dacheng Tao. Adamerging: Adaptive model merging for multi-task learning.arXiv preprint arXiv:2310.02575,

  13. [13]

    Shu Zhou, Rui Ling, Junan Chen, Xin Wang, Tao Fan, and Hao Wang

    Hugging Face dataset, accessed 2026-05-02. Shu Zhou, Rui Ling, Junan Chen, Xin Wang, Tao Fan, and Hao Wang. When more thinking hurts: Overthinking in llm test-time compute scaling,

  14. [14]

    URL https://arxiv.org/abs/2604. 10739. Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al. Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions.arXiv preprint arXiv:2406.15877,

  15. [15]

    The exercise counts are Python 34, JavaScript 50, Go 36, Java 45, and Rust 30, for 195 exercises and 585 total rollouts per complete sweep

    13 A Experimental Details A.1 Roo-Eval Evaluation Each checkpoint is evaluated on five programming languages with three independent rollouts per exercise. The exercise counts are Python 34, JavaScript 50, Go 36, Java 45, and Rust 30, for 195 exercises and 585 total rollouts per complete sweep. Table 5: Roo-Eval serving, judging, and reference-cost protoco...

  16. [16]

    We adopt the Qwen3-recommendedtop_k= 20 for every row in Table 2, including endpoint references

    Item Setting Subset SWE-bench-Verified, 500 instances Agent scaffold OpenHands SDK [Wang et al., 2024] Max iterations 100 per instance Sampling temperature 0.6, top_p 0.8, top_k 20 (Qwen3 defaults) Serving vLLM [Kwon et al., 2023], bf16, TP= 4 GPU4×H100 80GB Context length131072 Container backend rootless podman [Podman contributors, 2026] Image registry ...

  17. [17]

    Table 7: Token cost for major frontier lab providers used to estimate relative weights in total tokens count, and average cost ratios of token types relative to input tokens

    Note: in all our experiments we run the models using local vLLM, therefore Total Token Count is used as a proxy to estimate the budget of running those models through providers, not actual incurred spending. Table 7: Token cost for major frontier lab providers used to estimate relative weights in total tokens count, and average cost ratios of token types ...

  18. [18]

    Should the function handle trailing spaces

    The no-self-reflection share decreases slightly from 67.7% (Instruct) to 64.7% (Thinking), but with a different mechanism: Instruct retries the same failing approach, Thinking deliberates without testing. 17 30B-CRANE audit (100 failed rollouts).Canonical run dirs 20260420_020103 (Python), 20260420_022201 (JavaScript), 20260420_025032 (Go), 20260420_03154...

  19. [19]

    Late-layer attention, mid-depth experts, and the routing gate dominate; norm and LM head receive near-zero injection

    Rows: components (Q, K, V , O, expert gate/up/down, norm, router, LM head); columns: layers 0–47. Late-layer attention, mid-depth experts, and the routing gate dominate; norm and LM head receive near-zero injection. B.3 Robustness to Calibration Set Choice We assess the robustness of the CTG Taylor salience used by CRANE to calibration-set choice. On Qwen...

  20. [20]

    Test time

    pass@5 is best-of-5: a task counts as a pass if any of 5 attempts passed. pass_majority requires ≥3/5 attempts to pass (per-task rate ≥0.60 ). pass_majority differs from pass@3: pass@3 weights by the probability of a 3-shot subsample landing a pass; pass_majority requires actual ≥3 successes. “Test time” is the end-to-end Terminal-Bench harness wall time;...

  21. [21]

    that group each ablation family by programming language and retain pass@1, pass@3, pass_all, iterative pass, reference cost, and recorded input/cached/output token totals and averages. Alpha sweep detailed per-language results.Table 41 reports per-language pass metrics, reference- cost proxy, and recorded local-vLLM token usage for each row in this ablati...