Path-Lock Expert: Separating Reasoning Mode in Hybrid Thinking via Architecture-Level Separation
Pith reviewed 2026-05-08 03:01 UTC · model grok-4.3
The pith
Replacing each MLP with two mode-locked experts separates think and no-think pathways in hybrid language models and sharply reduces reasoning leakage.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Path-Lock Expert replaces the single MLP in each decoder layer with two semantically locked experts—one for think and one for no-think—while keeping attention, embeddings, normalization, and the language-model head shared. A deterministic control-token router selects exactly one expert path for the entire sequence, so inference preserves the dense model’s per-token pattern and each expert receives mode-pure updates during supervised fine-tuning. Across math and science reasoning benchmarks this yields a stronger no-think mode that is more accurate, more concise, and far less prone to reasoning leakage while preserving think-mode performance.
What carries the argument
Path-Lock Expert (PLE): two mode-locked MLP experts per layer chosen by a deterministic control-token router that selects one expert for the whole sequence.
If this is right
- No-think responses become shorter and more accurate on math and science benchmarks.
- Reflective token counts in no-think mode drop dramatically, as seen from 2.54 to 0.39 on AIME24.
- Think-mode accuracy is preserved while no-think accuracy rises, for example from 20.67 percent to 40 percent on the same benchmark.
- Mode-specific updates during supervised fine-tuning stay clean because each expert sees only its own data.
- Inference cost and pattern remain identical to the original dense model.
Where Pith is reading between the lines
- Reasoning modes appear to live primarily inside feed-forward computations rather than shared attention mechanisms.
- The same separation could be tested on tasks beyond math and science, such as creative writing or tool use, to check whether leakage is domain-specific.
- Allowing the two experts to differ in size or sparsity might further improve efficiency without hurting either mode.
- If the router were made learnable instead of deterministic, models might discover optimal mode boundaries on their own.
Load-bearing premise
Duplicating only the MLP into two mode-locked experts while sharing attention, embeddings, normalization, and the language-model head is enough to block reasoning leakage between modes.
What would settle it
If, after training PLE on Qwen3-4B, the no-think mode on AIME24 still averages more than one reflective token per response or accuracy stays below 35 percent, the claim that MLP separation alone prevents leakage would be false.
Figures
read the original abstract
Hybrid-thinking language models expose explicit think and no-think modes, but current designs do not separate them cleanly. Even in no-think mode, models often emit long and self-reflective responses, causing reasoning leakage. Existing work reduces this issue through better data curation and multi-stage training, yet leakage remains because both modes are still encoded in the same feed-forward parameters. We propose Path-Lock Expert (PLE), an architecture-level solution that replaces the single MLP in each decoder layer with two semantically locked experts, one for think and one for no-think, while keeping attention, embeddings, normalization, and the language-model head shared. A deterministic control-token router selects exactly one expert path for the entire sequence, so inference preserves the dense model's per-token computation pattern and each expert receives mode-pure updates during supervised fine-tuning. Across math and science reasoning benchmarks, PLE maintains strong think performance while producing a substantially stronger no-think mode that is more accurate, more concise, and far less prone to reasoning leakage. On Qwen3-4B, for example, PLE reduces no-think reflective tokens on AIME24 from 2.54 to 0.39 and improves no-think accuracy from 20.67% to 40.00%, all while preserving think-mode performance. These results suggest that controllable hybrid thinking is fundamentally an architectural problem, and separating mode-specific feed-forward pathways is a simple and effective solution.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Path-Lock Expert (PLE), an architecture that replaces the single MLP per decoder layer with two mode-specific experts (think and no-think) while sharing attention, embeddings, normalization, and the LM head. A deterministic control-token router selects one expert path for the full sequence. The central claim is that this architectural separation prevents reasoning leakage in no-think mode during inference and SFT, yielding more accurate and concise no-think outputs without degrading think performance. Concrete gains are reported on math/science benchmarks, e.g., on Qwen3-4B, no-think AIME24 accuracy rises from 20.67% to 40.00% and reflective tokens drop from 2.54 to 0.39.
Significance. If the results hold after proper controls, the work is significant because it reframes controllable hybrid thinking as an architectural rather than purely data-curation problem. By isolating only the feed-forward experts and providing named-benchmark numbers, it offers a falsifiable, architecture-level hypothesis that could be more robust than multi-stage training alone. The preservation of dense-model inference cost is a practical strength.
major comments (2)
- [Abstract and experimental results] The central claim that duplicating only the MLPs suffices to block leakage (while sharing attention) is load-bearing but unsupported by ablation. The abstract states that attention layers receive updates from both modes yet provides no experiment that holds shared components fixed and varies only MLP duplication; without this, the observed drop in reflective tokens cannot be attributed to the architecture rather than training or data factors.
- [Abstract] The manuscript reports concrete metric gains (e.g., AIME24 no-think accuracy 20.67% → 40.00%) but supplies no training-procedure details, baseline definitions, statistical significance tests, or error bars. This absence directly limits verification of the claim that PLE produces a 'substantially stronger no-think mode' while preserving think performance.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback. We address each major comment below and have revised the manuscript to incorporate additional experiments, details, and statistical reporting as requested.
read point-by-point responses
-
Referee: [Abstract and experimental results] The central claim that duplicating only the MLPs suffices to block leakage (while sharing attention) is load-bearing but unsupported by ablation. The abstract states that attention layers receive updates from both modes yet provides no experiment that holds shared components fixed and varies only MLP duplication; without this, the observed drop in reflective tokens cannot be attributed to the architecture rather than training or data factors.
Authors: We agree that the original submission lacked a targeted ablation isolating MLP duplication while holding shared components (attention, embeddings, etc.) fixed. In the revised manuscript we have added such an experiment: a 'duplicated-MLP but mixed-update' baseline in which two MLPs are instantiated per layer but receive non-mode-specific gradients (i.e., the router is removed or randomized during SFT). This control shows that simply duplicating capacity is insufficient; the deterministic mode-locked routing and consequent pure updates are necessary to obtain the reported reduction in reflective tokens and accuracy gains. We also clarify in the text that attention layers do receive updates from both modes, yet the separation of feed-forward pathways prevents cross-mode interference at inference time. revision: yes
-
Referee: [Abstract] The manuscript reports concrete metric gains (e.g., AIME24 no-think accuracy 20.67% → 40.00%) but supplies no training-procedure details, baseline definitions, statistical significance tests, or error bars. This absence directly limits verification of the claim that PLE produces a 'substantially stronger no-think mode' while preserving think performance.
Authors: We acknowledge the reporting gaps in the original version. The revised manuscript now contains an expanded 'Experimental Setup' section that details the full SFT procedure, data composition, optimizer settings, and exact baseline construction (the dense Qwen3-4B model fine-tuned on the identical data mixture and schedule). We additionally report results from three independent runs with standard-error bars and include a Wilcoxon signed-rank test (p < 0.05) for the primary no-think accuracy and conciseness improvements. These additions directly address verifiability while preserving the original performance numbers. revision: yes
Circularity Check
No circularity; empirical architecture proposal with benchmark results
full rationale
The paper introduces Path-Lock Expert as an architectural modification (duplicating MLPs into mode-locked experts with shared attention/embedding/normalization/LM head and deterministic control-token routing) and reports empirical outcomes on math/science benchmarks. No derivations, first-principles predictions, or equations are presented that reduce to inputs by construction. Results (e.g., reduced reflective tokens and improved no-think accuracy on Qwen3-4B) are framed as measured effects of the design after SFT, not as fitted parameters renamed as predictions or self-citation chains. The central claim rests on experimental comparison rather than tautological self-definition or load-bearing self-citations. This is the expected non-finding for an empirical architecture paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Shared attention, embeddings, normalization, and language-model head can support both think and no-think modes without introducing leakage when only the MLP is duplicated.
invented entities (1)
-
Think expert and no-think expert MLPs
no independent evidence
Reference graph
Works this paper leans on
-
[1]
+ π0 2 (β−β ⋆ 0)⊤H0(β−β ⋆
-
[2]
+ π1 2 (β−β ⋆ 1)⊤H1(β−β ⋆ 1). (19) 23 Preprint. Under review. Expand the first quadratic term: (β−β ⋆ 0)⊤H0(β−β ⋆
-
[3]
Expand the second: (β−β ⋆ 1)⊤H1(β−β ⋆
=β ⊤H0β−2β ⊤H0β⋆ 0 + (β ⋆ 0)⊤H0β⋆ 0. Expand the second: (β−β ⋆ 1)⊤H1(β−β ⋆
-
[4]
=β ⊤H1β−2β ⊤H1β⋆ 1 + (β ⋆ 1)⊤H1β⋆ 1. Substitute both expansions into (19): Ldense(β)≈C+ 1 2 β⊤(π0H0 +π 1H1)β−β ⊤(π0H0β⋆ 0 +π 1H1β⋆ 1), (20) where C=π 0L0(β⋆
-
[5]
Differentiate (20) with respect toβ: ∇βLdense(β) = (π 0H0 +π 1H1)β−(π 0H0β⋆ 0 +π 1H1β⋆ 1)
+ π0 2 (β⋆ 0)⊤H0β⋆ 0 + π1 2 (β⋆ 1)⊤H1β⋆ 1. Differentiate (20) with respect toβ: ∇βLdense(β) = (π 0H0 +π 1H1)β−(π 0H0β⋆ 0 +π 1H1β⋆ 1). Setting the gradient to zero gives (π0H0 +π 1H1)β⋆ dense =π 0H0β⋆ 0 +π 1H1β⋆ 1, hence β⋆ dense = (π 0H0 +π 1H1)−1(π0H0β⋆ 0 +π 1H1β⋆ 1), (21) assumingπ 0H0 +π 1H1 is invertible. Equation (21) shows that the dense MLP is forc...
-
[6]
+π 1L1(β⋆ 1). The dense model incurs excess loss ∆conflict :=L dense(β⋆ dense)−L ⋆ PLE = 1 2 ∑ r∈{0,1} πr(β⋆ dense −β ⋆ r )⊤Hr(β⋆ dense −β ⋆ r )≥0. (22) Proof.For PLE, the expert-subspace objective is LPLE(β0,β 1) =π 0L0(β0) +π 1L1(β1). Because β0 and β1 are separated, minimization reduces to two independent problems, so the optimum is attained at(β ⋆ 0,β...
-
[7]
+π 1L1(β⋆ 1). For the dense model, evaluate each mode loss atβ ⋆ dense: Lr(β⋆ dense) =L r(β⋆ r ) + 1 2 (β⋆ dense −β ⋆ r )⊤Hr(β⋆ dense −β ⋆ r ). Multiply byπ r and sum overr∈ {0, 1}: Ldense(β⋆ dense) = ∑ r∈{0,1} πrLr(β⋆ dense) = ∑ r∈{0,1} πrLr(β⋆ r ) + 1 2 ∑ r∈{0,1} πr(β⋆ dense −β ⋆ r )⊤Hr(β⋆ dense −β ⋆ r ). Subtract L⋆ PLE to obtain (22). Since each Hr ⪰ ...
-
[8]
Then β⋆ dense −β ⋆ 0 = (π 0β⋆ 0 +π 1β⋆ 1)−β ⋆ 0 =−π 1(β⋆ 0 −β ⋆
-
[9]
=−π 1∆β, β⋆ dense −β ⋆ 1 = (π 0β⋆ 0 +π 1β⋆ 1)−β ⋆ 1 =π 0(β⋆ 0 −β ⋆
-
[10]
=π 0∆β. Plug these into (22): ∆conflict = 1 2 h π0(−π1∆β) ⊤H(−π 1∆β) +π 1(π0∆β) ⊤H(π0∆β) i = 1 2 h π0π2 1∆β⊤H∆β+π 1π2 0∆β⊤H∆β i = 1 2 π0π1(π0 +π 1)∆β ⊤H∆β = 1 2 π0π1∆β⊤H∆β. Equation (23) is especially useful for intuition: the dense compromise penalty grows with (i) the frequency of both modes, through π0π1, (ii) the geometric separation between their pre...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.