pith. sign in

arxiv: 2604.27201 · v2 · submitted 2026-04-29 · 💻 cs.CL · cs.AI· cs.LG

Path-Lock Expert: Separating Reasoning Mode in Hybrid Thinking via Architecture-Level Separation

Pith reviewed 2026-05-08 03:01 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords hybrid thinkingreasoning leakagemode separationexpert routingfeed-forward expertsthink and no-think modeslanguage model architecturecontrolled reasoning
0
0 comments X

The pith

Replacing each MLP with two mode-locked experts separates think and no-think pathways in hybrid language models and sharply reduces reasoning leakage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that current hybrid-thinking models leak reasoning into no-think responses because both modes share the same feed-forward parameters. Path-Lock Expert duplicates only the MLP in each layer into two experts, one locked to thinking and one to direct answering, while sharing attention, embeddings, normalization, and the language-model head. A control-token router picks one expert for the entire sequence so training updates stay mode-pure. On math and science benchmarks the no-think mode becomes far more accurate and concise with almost no reflective tokens, yet think-mode performance stays intact. A reader would care because the fix is architectural rather than relying on ever-more-curated data.

Core claim

Path-Lock Expert replaces the single MLP in each decoder layer with two semantically locked experts—one for think and one for no-think—while keeping attention, embeddings, normalization, and the language-model head shared. A deterministic control-token router selects exactly one expert path for the entire sequence, so inference preserves the dense model’s per-token pattern and each expert receives mode-pure updates during supervised fine-tuning. Across math and science reasoning benchmarks this yields a stronger no-think mode that is more accurate, more concise, and far less prone to reasoning leakage while preserving think-mode performance.

What carries the argument

Path-Lock Expert (PLE): two mode-locked MLP experts per layer chosen by a deterministic control-token router that selects one expert for the whole sequence.

If this is right

  • No-think responses become shorter and more accurate on math and science benchmarks.
  • Reflective token counts in no-think mode drop dramatically, as seen from 2.54 to 0.39 on AIME24.
  • Think-mode accuracy is preserved while no-think accuracy rises, for example from 20.67 percent to 40 percent on the same benchmark.
  • Mode-specific updates during supervised fine-tuning stay clean because each expert sees only its own data.
  • Inference cost and pattern remain identical to the original dense model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Reasoning modes appear to live primarily inside feed-forward computations rather than shared attention mechanisms.
  • The same separation could be tested on tasks beyond math and science, such as creative writing or tool use, to check whether leakage is domain-specific.
  • Allowing the two experts to differ in size or sparsity might further improve efficiency without hurting either mode.
  • If the router were made learnable instead of deterministic, models might discover optimal mode boundaries on their own.

Load-bearing premise

Duplicating only the MLP into two mode-locked experts while sharing attention, embeddings, normalization, and the language-model head is enough to block reasoning leakage between modes.

What would settle it

If, after training PLE on Qwen3-4B, the no-think mode on AIME24 still averages more than one reflective token per response or accuracy stays below 35 percent, the claim that MLP separation alone prevents leakage would be false.

Figures

Figures reproduced from arXiv: 2604.27201 by Chaoda Song, Chuang Ma, Debargha Ganguly, Shouren Wang, Vikash Singh, Vipin Chaudhary, Wang Yang, Xianxuan Long, Xiaotian Han, Xinpeng Li.

Figure 1
Figure 1. Figure 1: Motivating example of reasoning leakage. On an AIME24 problem, Qwen3-8B view at source ↗
Figure 2
Figure 2. Figure 2: Path-Lock Expert replaces the single MLP in each decoder layer with two mode view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy, average output length, and per-answer reflective token count on AIME24 view at source ↗
Figure 4
Figure 4. Figure 4: Ablation on base model choice (AIME 24). “Qwen3-4B”: PLE initialized from view at source ↗
Figure 5
Figure 5. Figure 5: Ablation on training dataset (AIME 24). “Superior”: PLE trained on our high view at source ↗
Figure 6
Figure 6. Figure 6: Main experiment results on MATH500. Top row: Qwen2.5-7B-Instruct as base model; bottom row: Qwen3-4B as base model. For each base model, we compare PLE (Ours) against baselines under both \think and \no think modes in terms of accuracy, average output length, and average reflective tokens per response view at source ↗
Figure 7
Figure 7. Figure 7: Base model weight ablation on MATH500. We compare PLE initialized from three different base models (Qwen3-4B, Qwen3-4B-Base, and Qwen2.5-7B-Instruct), all trained on the superior-reasoning 27k+27k dataset. Each model is evaluated under both \think and \no think modes. A.2.2 Supplemental Results for Dataset Ablation 15 view at source ↗
Figure 8
Figure 8. Figure 8: Dataset ablation results on MATH500 17 view at source ↗
read the original abstract

Hybrid-thinking language models expose explicit think and no-think modes, but current designs do not separate them cleanly. Even in no-think mode, models often emit long and self-reflective responses, causing reasoning leakage. Existing work reduces this issue through better data curation and multi-stage training, yet leakage remains because both modes are still encoded in the same feed-forward parameters. We propose Path-Lock Expert (PLE), an architecture-level solution that replaces the single MLP in each decoder layer with two semantically locked experts, one for think and one for no-think, while keeping attention, embeddings, normalization, and the language-model head shared. A deterministic control-token router selects exactly one expert path for the entire sequence, so inference preserves the dense model's per-token computation pattern and each expert receives mode-pure updates during supervised fine-tuning. Across math and science reasoning benchmarks, PLE maintains strong think performance while producing a substantially stronger no-think mode that is more accurate, more concise, and far less prone to reasoning leakage. On Qwen3-4B, for example, PLE reduces no-think reflective tokens on AIME24 from 2.54 to 0.39 and improves no-think accuracy from 20.67% to 40.00%, all while preserving think-mode performance. These results suggest that controllable hybrid thinking is fundamentally an architectural problem, and separating mode-specific feed-forward pathways is a simple and effective solution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes Path-Lock Expert (PLE), an architecture that replaces the single MLP per decoder layer with two mode-specific experts (think and no-think) while sharing attention, embeddings, normalization, and the LM head. A deterministic control-token router selects one expert path for the full sequence. The central claim is that this architectural separation prevents reasoning leakage in no-think mode during inference and SFT, yielding more accurate and concise no-think outputs without degrading think performance. Concrete gains are reported on math/science benchmarks, e.g., on Qwen3-4B, no-think AIME24 accuracy rises from 20.67% to 40.00% and reflective tokens drop from 2.54 to 0.39.

Significance. If the results hold after proper controls, the work is significant because it reframes controllable hybrid thinking as an architectural rather than purely data-curation problem. By isolating only the feed-forward experts and providing named-benchmark numbers, it offers a falsifiable, architecture-level hypothesis that could be more robust than multi-stage training alone. The preservation of dense-model inference cost is a practical strength.

major comments (2)
  1. [Abstract and experimental results] The central claim that duplicating only the MLPs suffices to block leakage (while sharing attention) is load-bearing but unsupported by ablation. The abstract states that attention layers receive updates from both modes yet provides no experiment that holds shared components fixed and varies only MLP duplication; without this, the observed drop in reflective tokens cannot be attributed to the architecture rather than training or data factors.
  2. [Abstract] The manuscript reports concrete metric gains (e.g., AIME24 no-think accuracy 20.67% → 40.00%) but supplies no training-procedure details, baseline definitions, statistical significance tests, or error bars. This absence directly limits verification of the claim that PLE produces a 'substantially stronger no-think mode' while preserving think performance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We address each major comment below and have revised the manuscript to incorporate additional experiments, details, and statistical reporting as requested.

read point-by-point responses
  1. Referee: [Abstract and experimental results] The central claim that duplicating only the MLPs suffices to block leakage (while sharing attention) is load-bearing but unsupported by ablation. The abstract states that attention layers receive updates from both modes yet provides no experiment that holds shared components fixed and varies only MLP duplication; without this, the observed drop in reflective tokens cannot be attributed to the architecture rather than training or data factors.

    Authors: We agree that the original submission lacked a targeted ablation isolating MLP duplication while holding shared components (attention, embeddings, etc.) fixed. In the revised manuscript we have added such an experiment: a 'duplicated-MLP but mixed-update' baseline in which two MLPs are instantiated per layer but receive non-mode-specific gradients (i.e., the router is removed or randomized during SFT). This control shows that simply duplicating capacity is insufficient; the deterministic mode-locked routing and consequent pure updates are necessary to obtain the reported reduction in reflective tokens and accuracy gains. We also clarify in the text that attention layers do receive updates from both modes, yet the separation of feed-forward pathways prevents cross-mode interference at inference time. revision: yes

  2. Referee: [Abstract] The manuscript reports concrete metric gains (e.g., AIME24 no-think accuracy 20.67% → 40.00%) but supplies no training-procedure details, baseline definitions, statistical significance tests, or error bars. This absence directly limits verification of the claim that PLE produces a 'substantially stronger no-think mode' while preserving think performance.

    Authors: We acknowledge the reporting gaps in the original version. The revised manuscript now contains an expanded 'Experimental Setup' section that details the full SFT procedure, data composition, optimizer settings, and exact baseline construction (the dense Qwen3-4B model fine-tuned on the identical data mixture and schedule). We additionally report results from three independent runs with standard-error bars and include a Wilcoxon signed-rank test (p < 0.05) for the primary no-think accuracy and conciseness improvements. These additions directly address verifiability while preserving the original performance numbers. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical architecture proposal with benchmark results

full rationale

The paper introduces Path-Lock Expert as an architectural modification (duplicating MLPs into mode-locked experts with shared attention/embedding/normalization/LM head and deterministic control-token routing) and reports empirical outcomes on math/science benchmarks. No derivations, first-principles predictions, or equations are presented that reduce to inputs by construction. Results (e.g., reduced reflective tokens and improved no-think accuracy on Qwen3-4B) are framed as measured effects of the design after SFT, not as fitted parameters renamed as predictions or self-citation chains. The central claim rests on experimental comparison rather than tautological self-definition or load-bearing self-citations. This is the expected non-finding for an empirical architecture paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that mode-specific behavior can be isolated to the feed-forward layers alone and that a deterministic router plus mode-pure fine-tuning will keep the experts from interfering through shared components.

axioms (1)
  • domain assumption Shared attention, embeddings, normalization, and language-model head can support both think and no-think modes without introducing leakage when only the MLP is duplicated.
    Invoked in the description of keeping all components except the MLP shared while still achieving clean separation.
invented entities (1)
  • Think expert and no-think expert MLPs no independent evidence
    purpose: Provide mode-pure feed-forward computation paths that receive separate updates during supervised fine-tuning.
    Newly introduced architectural components whose separation is the core proposal.

pith-pipeline@v0.9.0 · 5598 in / 1420 out tokens · 84030 ms · 2026-05-08T03:01:44.237429+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

  1. [1]

    + π0 2 (β−β ⋆ 0)⊤H0(β−β ⋆

  2. [2]

    (19) 23 Preprint

    + π1 2 (β−β ⋆ 1)⊤H1(β−β ⋆ 1). (19) 23 Preprint. Under review. Expand the first quadratic term: (β−β ⋆ 0)⊤H0(β−β ⋆

  3. [3]

    Expand the second: (β−β ⋆ 1)⊤H1(β−β ⋆

    =β ⊤H0β−2β ⊤H0β⋆ 0 + (β ⋆ 0)⊤H0β⋆ 0. Expand the second: (β−β ⋆ 1)⊤H1(β−β ⋆

  4. [4]

    Substitute both expansions into (19): Ldense(β)≈C+ 1 2 β⊤(π0H0 +π 1H1)β−β ⊤(π0H0β⋆ 0 +π 1H1β⋆ 1), (20) where C=π 0L0(β⋆

    =β ⊤H1β−2β ⊤H1β⋆ 1 + (β ⋆ 1)⊤H1β⋆ 1. Substitute both expansions into (19): Ldense(β)≈C+ 1 2 β⊤(π0H0 +π 1H1)β−β ⊤(π0H0β⋆ 0 +π 1H1β⋆ 1), (20) where C=π 0L0(β⋆

  5. [5]

    Differentiate (20) with respect toβ: ∇βLdense(β) = (π 0H0 +π 1H1)β−(π 0H0β⋆ 0 +π 1H1β⋆ 1)

    + π0 2 (β⋆ 0)⊤H0β⋆ 0 + π1 2 (β⋆ 1)⊤H1β⋆ 1. Differentiate (20) with respect toβ: ∇βLdense(β) = (π 0H0 +π 1H1)β−(π 0H0β⋆ 0 +π 1H1β⋆ 1). Setting the gradient to zero gives (π0H0 +π 1H1)β⋆ dense =π 0H0β⋆ 0 +π 1H1β⋆ 1, hence β⋆ dense = (π 0H0 +π 1H1)−1(π0H0β⋆ 0 +π 1H1β⋆ 1), (21) assumingπ 0H0 +π 1H1 is invertible. Equation (21) shows that the dense MLP is forc...

  6. [6]

    The dense model incurs excess loss ∆conflict :=L dense(β⋆ dense)−L ⋆ PLE = 1 2 ∑ r∈{0,1} πr(β⋆ dense −β ⋆ r )⊤Hr(β⋆ dense −β ⋆ r )≥0

    +π 1L1(β⋆ 1). The dense model incurs excess loss ∆conflict :=L dense(β⋆ dense)−L ⋆ PLE = 1 2 ∑ r∈{0,1} πr(β⋆ dense −β ⋆ r )⊤Hr(β⋆ dense −β ⋆ r )≥0. (22) Proof.For PLE, the expert-subspace objective is LPLE(β0,β 1) =π 0L0(β0) +π 1L1(β1). Because β0 and β1 are separated, minimization reduces to two independent problems, so the optimum is attained at(β ⋆ 0,β...

  7. [7]

    For the dense model, evaluate each mode loss atβ ⋆ dense: Lr(β⋆ dense) =L r(β⋆ r ) + 1 2 (β⋆ dense −β ⋆ r )⊤Hr(β⋆ dense −β ⋆ r )

    +π 1L1(β⋆ 1). For the dense model, evaluate each mode loss atβ ⋆ dense: Lr(β⋆ dense) =L r(β⋆ r ) + 1 2 (β⋆ dense −β ⋆ r )⊤Hr(β⋆ dense −β ⋆ r ). Multiply byπ r and sum overr∈ {0, 1}: Ldense(β⋆ dense) = ∑ r∈{0,1} πrLr(β⋆ dense) = ∑ r∈{0,1} πrLr(β⋆ r ) + 1 2 ∑ r∈{0,1} πr(β⋆ dense −β ⋆ r )⊤Hr(β⋆ dense −β ⋆ r ). Subtract L⋆ PLE to obtain (22). Since each Hr ⪰ ...

  8. [8]

    Then β⋆ dense −β ⋆ 0 = (π 0β⋆ 0 +π 1β⋆ 1)−β ⋆ 0 =−π 1(β⋆ 0 −β ⋆

  9. [9]

    =−π 1∆β, β⋆ dense −β ⋆ 1 = (π 0β⋆ 0 +π 1β⋆ 1)−β ⋆ 1 =π 0(β⋆ 0 −β ⋆

  10. [10]

    Plug these into (22): ∆conflict = 1 2 h π0(−π1∆β) ⊤H(−π 1∆β) +π 1(π0∆β) ⊤H(π0∆β) i = 1 2 h π0π2 1∆β⊤H∆β+π 1π2 0∆β⊤H∆β i = 1 2 π0π1(π0 +π 1)∆β ⊤H∆β = 1 2 π0π1∆β⊤H∆β

    =π 0∆β. Plug these into (22): ∆conflict = 1 2 h π0(−π1∆β) ⊤H(−π 1∆β) +π 1(π0∆β) ⊤H(π0∆β) i = 1 2 h π0π2 1∆β⊤H∆β+π 1π2 0∆β⊤H∆β i = 1 2 π0π1(π0 +π 1)∆β ⊤H∆β = 1 2 π0π1∆β⊤H∆β. Equation (23) is especially useful for intuition: the dense compromise penalty grows with (i) the frequency of both modes, through π0π1, (ii) the geometric separation between their pre...