SamatNext v0.2-B: An Exploratory Study of RMS-Normalized Hybrid Decoders for Curriculum Retention in Small Code Models

Samat Zharassov

arxiv: 2606.22248 · v2 · pith:OCKOI7KPnew · submitted 2026-06-20 · 💻 cs.LG · cs.CL

SamatNext v0.2-B: An Exploratory Study of RMS-Normalized Hybrid Decoders for Curriculum Retention in Small Code Models

Samat Zharassov This is my paper

Pith reviewed 2026-06-30 10:14 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords hybrid decodercurriculum retentioncode modelscatastrophic forgettingRMS normalizationDeltaNetDifferential AttentionPython code generation

0 comments

The pith

A hybrid decoder mixing attention and linear-state layers retains 98.8 percent of prior semantic behavior while reaching full performance on new code stages, unlike matched Transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests SamatNext v0.2-B, a 356M-parameter hybrid sequence decoder that alternates Differential-Attention-style layers with DeltaNet-inspired simplified linear-state mixer layers under RMS normalization and output scale calibration. In a controlled staged Python code curriculum, the model reaches a 100.0 percent pass rate on the Stage 5 holdout, retains 98.8 percent of adjacent Stage 3 semantic behavior, and scores 12.0 percent on the Stage 2E early syntax holdout. A parameter-matched Transformer baseline reaches 97.6 percent on Stage 5 but retains only 6.0 percent of Stage 3 behavior. The work presents this as evidence of a shifted retention and plasticity tradeoff in the tested setting rather than a general fix for forgetting.

Core claim

SamatNext v0.2-B, an experimental 356M-parameter hybrid sequence decoder that alternates Differential-Attention-style layers with DeltaNet-inspired simplified linear-state mixer layers using RMS normalization and output scale calibration, achieves a 100.0% pass rate on the controlled Stage 5 holdout while retaining 98.8% of adjacent Stage 3 semantic behavior and reaching 12.0% on the Stage 2E early syntax holdout. The strongest Transformer baseline reaches 97.6% on Stage 5 but retains only 6.0% of Stage 3 behavior. Both architectures remain weak on long-horizon early-stage retention, so the result should be interpreted as evidence of an altered retention/plasticity tradeoff in this controlle

What carries the argument

RMS-normalized hybrid decoder that alternates Differential-Attention-style layers with DeltaNet-inspired simplified linear-state mixer layers

If this is right

Hybrid architectures can reach full new-stage performance while preserving nearly all adjacent-stage semantic behavior under sequential fine-tuning.
Standard Transformer decoders exhibit sharp drops in retention of earlier semantic behavior even when final-stage accuracy remains high.
Both hybrid and Transformer models continue to show limited retention over longer curriculum horizons.
The observed retention/plasticity tradeoff is specific to the tested curriculum and evaluation setup.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar layer alternation and normalization choices could be tested on non-code sequential tasks such as mathematical reasoning chains.
The early-stage syntax holdout scores suggest that further adjustments to the linear-state mixer components might improve distant retention without harming final performance.
Independent verification using the released code and scripts can check whether the tradeoff appears under varied random seeds or slight hyperparameter changes.

Load-bearing premise

The staged curriculum, holdout sets, and semantic behavior metrics isolate retention effects without confounding influences from data distributions, evaluation choices, or unmeasured model properties.

What would settle it

Re-running the full staged curriculum experiment with a different data ordering or an expanded set of holdouts that track retention across non-adjacent stages would show whether the reported retention advantage persists.

read the original abstract

Standard autoregressive Transformer decoders can often exhibit substantial forgetting under sequential fine-tuning on shifting curriculum distributions. This technical report evaluates SamatNext v0.2-B, an experimental 356M-parameter hybrid sequence decoder that alternates Differential-Attention-style layers with DeltaNet-inspired simplified linear-state mixer layers using RMS normalization and output scale calibration. We study the model under a controlled staged Python code curriculum and compare it with a parameter-matched Transformer baseline. In this setting, SamatNext v0.2-B achieves a 100.0% pass rate on the controlled Stage 5 holdout while retaining 98.8% of adjacent Stage 3 semantic behavior and reaching 12.0% on the Stage 2E early syntax holdout. The strongest Transformer baseline reaches 97.6% on Stage 5 but retains only 6.0% of Stage 3 behavior. Both architectures remain weak on long-horizon early-stage retention, so the result should be interpreted as evidence of an altered retention/plasticity tradeoff in this controlled setting, not as a general solution to catastrophic forgetting. Code, model specifications, evaluation scripts, and result tables are provided for independent verification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SamatNext v0.2-B keeps far more stage-3 behavior than a matched Transformer on this staged Python curriculum while matching final-stage performance, with code and scripts supplied for checks.

read the letter

SamatNext v0.2-B keeps far more stage-3 behavior than a matched Transformer on this staged Python curriculum while matching final-stage performance, with code and scripts supplied for checks.

The new piece is the concrete head-to-head on a 356M hybrid that alternates Differential-Attention layers with simplified linear-state mixers under RMS normalization. In the reported run it reaches 100% on the stage-5 holdout and 98.8% retention of stage-3 semantic behavior, against the baseline's 97.6% and 6%. The authors are explicit that both models still fail on long-horizon early-stage retention, so the result is framed only as an altered tradeoff in this controlled sequence.

The work is useful because it ships the model specs, evaluation scripts, and result tables. That lets anyone reproduce the numbers rather than take them on trust. The abstract also avoids overclaiming, which keeps the observation usable for people studying continual learning in small code models.

The soft spot is the narrow scope: one curriculum, one size, one set of holdouts and metrics. The semantic-behavior measure is not described in the abstract, and full training details are not visible here, so it is still an observation rather than a settled mechanism. No internal contradiction appears in the reported numbers.

This is worth a referee for anyone working on sequential fine-tuning of code models. The claim is modest, the comparison is direct, and the artifacts make it checkable.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces SamatNext v0.2-B, a 356M-parameter hybrid sequence decoder alternating Differential-Attention-style layers with DeltaNet-inspired simplified linear-state mixer layers under RMS normalization and output scale calibration. Under a controlled staged Python code curriculum, it reports achieving a 100.0% pass rate on the Stage 5 holdout while retaining 98.8% of adjacent Stage 3 semantic behavior and 12.0% on the Stage 2E early syntax holdout; a parameter-matched Transformer baseline reaches 97.6% on Stage 5 but only 6.0% Stage 3 retention. The work is scoped as an exploratory observation of an altered retention/plasticity tradeoff with explicit caveats on long-horizon retention and provision of code, model specs, and evaluation scripts.

Significance. If the reported numbers hold under independent verification of the provided artifacts, the result supplies concrete empirical evidence that hybrid decoder designs can produce a measurably different retention/plasticity tradeoff than standard Transformers in one controlled curriculum setting for small code models. The explicit reproducibility artifacts and scoped interpretation (no claim of a general solution to catastrophic forgetting) are strengths that increase the utility of the observation for follow-on architecture studies.

minor comments (2)

Abstract and §1: the term 'semantic behavior' is used without a concise operational definition or reference to the precise metric computation; a one-sentence clarification would improve readability for readers outside the immediate experimental setup.
The manuscript states that code, model specifications, and evaluation scripts are provided, but the main text should include explicit file names or repository paths in a dedicated 'Reproducibility' subsection to facilitate direct verification.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. The report accurately captures the exploratory scope, reproducibility provisions, and limited claims of our work on retention/plasticity tradeoffs in this controlled curriculum setting.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper reports direct empirical measurements of pass rates and semantic retention on staged holdout sets for a hybrid decoder versus a parameter-matched Transformer baseline. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text; the central claims rest on controlled experimental comparisons with code and scripts supplied for verification, making the result self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claim rests on empirical measurements from a specific hybrid configuration and curriculum; the architecture combines prior layer types without new theoretical grounding.

free parameters (2)

output scale calibration
Calibration factor applied after RMS normalization in the hybrid decoder.
layer alternation schedule
Pattern used to alternate Differential-Attention and linear-state mixer layers.

axioms (1)

domain assumption The Python code curriculum stages represent shifting distributions suitable for isolating retention effects.
Evaluation design assumes the staged holdouts measure retention independently of other factors.

invented entities (1)

RMS-Normalized Hybrid Decoder (SamatNext v0.2-B) no independent evidence
purpose: To alternate attention and linear mixer layers for improved curriculum retention.
New experimental combination introduced for this study.

pith-pipeline@v0.9.1-grok · 5748 in / 1316 out tokens · 59037 ms · 2026-06-30T10:14:58.403616+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Qwen2.5-Coder Technical Report

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3...

2017
[3]

Catastrophic interference in connectionist networks: The sequential learning problem

Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of Learning and Motivation, volume 24, pages 109–165. Elsevier, 1989

1989
[4]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[5]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[6]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017
[7]

Parallelizing linear transformers with the delta rule over sequence length

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. InAdvances in Neural Information Processing Systems, 2024

2024
[8]

Differential transformer.arXiv preprint arXiv:2410.05258, 2024

Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, and Furu Wei. Differential transformer.arXiv preprint arXiv:2410.05258, 2024

work page arXiv 2024
[9]

Root mean square normalization

Biao Zhang and Rico Sennrich. Root mean square normalization. InAdvances in Neural Information Processing Systems, volume 32, 2019. A Strict Parameter Match Specifications Table 2 lists the exact structural specifications and parameter counts of the models compared in this study. B GPU Memory Usage Benchmark Table 3 reports peak forward-pass VRAM on a con...

2019

[1] [1]

Qwen2.5-Coder Technical Report

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Keming Lu, Kai Dang, Yang Fan, Yichang Zhang, An Yang, Rui Men, Fei Huang, Bo Zheng, Yibo Miao, Shanghaoran Quan, Yunlong Feng, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. Qwen2.5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks.Proceedings of the National Academy of Sciences, 114(13):3521–3...

2017

[3] [3]

Catastrophic interference in connectionist networks: The sequential learning problem

Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. InPsychology of Learning and Motivation, volume 24, pages 109–165. Elsevier, 1989

1989

[4] [4]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002

[5] [5]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024

[6] [6]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017

[7] [7]

Parallelizing linear transformers with the delta rule over sequence length

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length. InAdvances in Neural Information Processing Systems, 2024

2024

[8] [8]

Differential transformer.arXiv preprint arXiv:2410.05258, 2024

Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, and Furu Wei. Differential transformer.arXiv preprint arXiv:2410.05258, 2024

work page arXiv 2024

[9] [9]

Root mean square normalization

Biao Zhang and Rico Sennrich. Root mean square normalization. InAdvances in Neural Information Processing Systems, volume 32, 2019. A Strict Parameter Match Specifications Table 2 lists the exact structural specifications and parameter counts of the models compared in this study. B GPU Memory Usage Benchmark Table 3 reports peak forward-pass VRAM on a con...

2019