pith. sign in

arxiv: 2602.08064 · v2 · pith:ZGLDJR3Nnew · submitted 2026-02-08 · 💻 cs.LG · cs.AI· cs.CL

SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

Pith reviewed 2026-05-22 11:07 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords SiameseNormPre-NormPost-NormTransformernormalizationtraining stabilityresidual blocksarchitecture design
0
0 comments X

The pith

SiameseNorm uses a two-stream design with shared residual blocks to combine Pre-Norm stability and Post-Norm capacity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Transformers face a persistent trade-off where Pre-Norm ensures stable training through identity gradient paths but restricts how much the residual can be transformed, while Post-Norm allows stronger transformations at the risk of unstable gradients. Single-stream attempts to blend them have not held up across different training conditions. SiameseNorm introduces a two-stream architecture in which a Pre-Norm-like stream and a Post-Norm-like stream share the same residual blocks, so each block gets training signals from both styles at the same time. This design adds almost no extra cost and works with existing Pre-Norm training methods. Tests on language models, mixture-of-experts systems, vision transformers, and diffusion models show better results without losing stability, suggesting the approach could help build more capable models more reliably.

Core claim

The long-standing tension between Pre- and Post-Norm reflects a fundamental trade-off between training stability and representational capacity. Single-stream architectures struggle to reconcile Pre-Norm's stable identity-gradient propagation with Post-Norm's normalization of the main residual path. SiameseNorm addresses this by proposing a two-stream architecture that couples Pre-Norm-like and Post-Norm-like streams through shared residual blocks, allowing each residual block to receive optimization signals from both pathways with negligible overhead. Extensive experiments on 400M and 1.3B dense language models, 15B MoE models, Vision Transformers, and Diffusion Transformers show that Siames

What carries the argument

SiameseNorm's two-stream architecture with shared residual blocks that supplies optimization signals from both Pre-Norm-like and Post-Norm-like pathways.

Load-bearing premise

That a two-stream design with shared residual blocks can deliver optimization signals from both Pre-Norm-like and Post-Norm-like pathways without introducing new instabilities or conflicts that offset the reported gains.

What would settle it

A training run on a 1.3B language model using SiameseNorm that shows no performance gain or reduced stability compared to standard Pre-Norm would disprove the claim.

read the original abstract

The long-standing tension between Pre- and Post-Norm remains an open problem in Transformer architecture, reflecting a fundamental trade-off between training stability and representational capacity. Prior attempts to combine their strengths have made progress, but often show limited robustness across training settings, restricting their broader applicability. We revisit this dilemma, showing that single-stream architectures struggle to reconcile Pre-Norm's stable identity-gradient propagation with Post-Norm's normalization of the main residual path. To address this structural tension, we propose SiameseNorm, a simple yet effective two-stream architecture that remains compatible with Pre-Norm training recipes. SiameseNorm couples Pre-Norm-like and Post-Norm-like streams through shared residual blocks, allowing each residual block to receive optimization signals from both pathways with negligible overhead. Extensive experiments on 400M and 1.3B dense language models, 15B MoE models, Vision Transformers, and Diffusion Transformers show that SiameseNorm consistently improves performance while maintaining strong training stability across architectures and modalities. Code is available at https://github.com/Qwen-Applications/SiameseNorm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SiameseNorm, a two-stream Transformer architecture that couples Pre-Norm-like and Post-Norm-like streams via shared residual blocks. This design is intended to reconcile the stability of Pre-Norm's identity-gradient path with the representational benefits of Post-Norm's normalized main path. The authors report that the approach maintains compatibility with standard Pre-Norm training recipes and delivers consistent performance gains with strong stability on dense language models (400M and 1.3B), 15B MoE models, Vision Transformers, and Diffusion Transformers.

Significance. If the central claim holds, SiameseNorm would provide a practical, low-overhead architectural fix for a persistent tension in Transformer design, with potential impact on large-scale training across modalities. Strengths include the scale of experiments (up to 15B parameters), coverage of multiple architectures and modalities, and public code release. These elements support practical significance beyond incremental empirical tuning.

major comments (2)
  1. [Section 3] Section 3 (Architecture description): The coupling of streams through identical residual blocks is presented as delivering compatible optimization signals, yet no analysis of gradient magnitudes, directions, or potential averaging effects is provided. This leaves open the possibility that the stable identity path and attenuated normalized path produce conflicting updates on shared weights, which is load-bearing for the reconciliation claim and the reported stability.
  2. [Section 4] Section 4 (Experiments): Results claim consistent improvements and strong stability across scales and modalities, but lack ablations isolating the contribution of each stream or testing under varied optimizers and initializations. Without these, it remains possible that observed gains depend on the specific training recipe rather than the two-stream structure itself.
minor comments (2)
  1. [Abstract] The abstract states 'negligible overhead' without a concrete comparison of parameter count or FLOPs relative to a standard single-stream baseline.
  2. [Figures] Figure captions and method diagrams would benefit from explicit labels distinguishing the Pre-Norm-like and Post-Norm-like pathways to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the analysis and experimental validation without altering the core claims.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (Architecture description): The coupling of streams through identical residual blocks is presented as delivering compatible optimization signals, yet no analysis of gradient magnitudes, directions, or potential averaging effects is provided. This leaves open the possibility that the stable identity path and attenuated normalized path produce conflicting updates on shared weights, which is load-bearing for the reconciliation claim and the reported stability.

    Authors: We agree that explicit analysis of gradient flow would better substantiate the compatibility of optimization signals. Although the consistent stability observed across 400M–15B models and multiple modalities provides indirect evidence against severe conflicts, we will add a dedicated subsection in the revised Section 3. This will include quantitative comparisons of gradient magnitudes and directional alignment for the shared residual blocks under both streams, along with a brief discussion of any averaging effects. revision: yes

  2. Referee: [Section 4] Section 4 (Experiments): Results claim consistent improvements and strong stability across scales and modalities, but lack ablations isolating the contribution of each stream or testing under varied optimizers and initializations. Without these, it remains possible that observed gains depend on the specific training recipe rather than the two-stream structure itself.

    Authors: We acknowledge the value of these ablations for isolating the architectural contribution. In the revised manuscript we will expand Section 4 (and supplementary material) with (i) controlled ablations that disable one stream at a time while keeping the other fixed, and (ii) additional runs using alternative optimizers and varied initialization schemes. These results will be reported at the same scales to demonstrate that performance gains are attributable to the two-stream coupling rather than the specific training recipe. revision: yes

Circularity Check

0 steps flagged

No circularity: new two-stream architecture validated empirically

full rationale

The paper introduces SiameseNorm as a structural proposal: a two-stream design with shared residual blocks that supplies optimization signals from both Pre-Norm-like and Post-Norm-like pathways. The central argument rests on identifying a tension in single-stream architectures and then defining the new coupling mechanism, followed by direct experimental validation on 400M–15B models, ViTs, and diffusion models. No equations, fitted parameters, or self-citations are used to derive performance claims; the reported stability and gains are presented as outcomes of the architecture itself rather than reductions to prior inputs or definitions. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven premise that dual streams with shared blocks transmit complementary optimization signals without new failure modes; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption Single-stream architectures cannot simultaneously achieve Pre-Norm gradient stability and Post-Norm main-path normalization.
    Stated directly in the abstract as the structural tension motivating the two-stream design.
invented entities (1)
  • SiameseNorm two-stream architecture no independent evidence
    purpose: To reconcile Pre-Norm and Post-Norm benefits through shared residual blocks
    New architectural construct introduced by the paper; independent evidence is limited to the reported experiments.

pith-pipeline@v0.9.0 · 5752 in / 1203 out tokens · 41082 ms · 2026-05-22T11:07:58.955862+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Rethinking Cross-Layer Information Routing in Diffusion Transformers

    cs.CV 2026-05 conditional novelty 6.0

    DAR replaces residual addition in DiTs with learnable timestep-adaptive non-incremental aggregation of sublayer outputs, improving FID by 2.11 on ImageNet 256x256 and accelerating convergence by 8.75x.

  2. Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    GAP introduces three-level alignment for visual latent reasoning in MLLMs, achieving top aggregate perception and reasoning performance on Qwen2.5-VL 7B by addressing decoder-input norm mismatch.

  3. Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding best aggregate perception/reasoning scores on Qwen2.5-VL 7B among supervised variants while showing task-relevant signal i...

  4. Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models

    cs.CV 2026-05 unverdicted novelty 5.0

    GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding the best aggregate perception and reasoning scores on Qwen2.5-VL 7B among supervised variants while providing task-relevan...

  5. Attention Residuals

    cs.CL 2026-03 unverdicted novelty 5.0

    Attention Residuals replaces fixed residual summation with input-dependent softmax attention over preceding layers, and a blocked variant is shown to improve uniformity and downstream performance in a 48B-parameter mo...

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 3 Pith papers · 17 internal anchors

  1. [1]

    Layer Normalization

    Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450,

  2. [2]

    Longformer: The Long-Document Transformer

    Beltagy, I., Peters, M. E., and Cohan, A. Long- former: The long-document transformer.arXiv preprint arXiv:2004.05150,

  3. [3]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

  4. [4]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Chen, A., Li, A., Gong, B., Jiang, B., Fei, B., Yang, B., Shan, B., Yu, C., Wang, C., Zhu, C., et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention. arXiv preprint arXiv:2506.13585,

  5. [5]

    Generating Long Sequences with Sparse Transformers

    URL https://arxiv.org/abs/1904.10509. Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794,

  6. [6]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    9 SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

  7. [7]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929,

  8. [8]

    Trans- former feed-forward layers are key-value memories

    Geva, M., Schuster, R., Berant, J., and Levy, O. Trans- former feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 5484–5495,

  9. [9]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

  10. [10]

    Step by step network.arXiv preprint arXiv:2511.14329,

    Han, D., Ye, T., Xia, Z., Chen, K., Wang, Y., Chen, H., and Huang, G. Step by step network.arXiv preprint arXiv:2511.14329,

  11. [11]

    Gaussian Error Linear Units (GELUs)

    Hendrycks, D. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415,

  12. [12]

    R., Pawar, S

    Henry, A., Dachapally, P. R., Pawar, S. S., and Chen, Y. Query-key normalization for transformers. InFindings of the Association for Computational Linguistics: EMNLP 2020, pp. 4246–4253,

  13. [13]

    Kim, J., Lee, B., Park, C., Oh, Y., Kim, B., Yoo, T., Shin, S., Han, D., Shin, J., and Yoo, K. M. Peri-ln: Revisiting nor- malization layer in the transformer architecture.arXiv preprint arXiv:2502.02732,

  14. [14]

    Reformer: The Efficient Transformer

    Kitaev, N., Kaiser, Ł., and Levskaya, A. Reformer: The efficient transformer.arXiv preprint arXiv:2001.04451,

  15. [15]

    DeepSeek-V3 Technical Report

    Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

  16. [16]

    Understand- ing the difficulty of training transformers

    Liu, L., Liu, X., Gao, J., Chen, W., and Han, J. Understand- ing the difficulty of training transformers. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 5747–5763,

  17. [17]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391,

  18. [18]

    Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    Qiu, Z., Wang, Z., Zheng, B., Huang, Z., Wen, K., Yang, S., Men, R., Yu, L., Huang, F., Huang, S., et al. Gated attention for large language models: Non- linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708,

  19. [19]

    GLU Variants Improve Transformer

    10 SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm Shazeer, N. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,

  20. [20]

    Highway Networks

    Srivastava, R. K., Greff, K., and Schmidhuber, J. Highway networks.arXiv preprint arXiv:1505.00387,

  21. [21]

    The curse of depth in large language models.arXiv preprint arXiv:2502.05795,

    Sun, W., Song, X., Li, P., Yin, L., Zheng, Y., and Liu, S. The curse of depth in large language models.arXiv preprint arXiv:2502.05795,

  22. [22]

    Kimi Linear: An Expressive, Efficient Attention Architecture

    Team, K., Zhang, Y., Lin, Z., Yao, X., Hu, J., Meng, F., Liu, C., Men, X., Yang, S., Li, Z., et al. Kimi linear: An expressive, efficientattentionarchitecture.arXivpreprint arXiv:2510.26692,

  23. [23]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., etal. Llama: Openandefficientfoundationlan- guage models.arXiv preprint arXiv:2302.13971, 2023a. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et a...

  24. [24]

    H., Menezes, A., Qin, T., and Yan, R

    Xie, S., Zhang, H., Guo, J., Tan, X., Bian, J., Awadalla, H. H., Menezes, A., Qin, T., and Yan, R. Residual: Trans- former with dual residual connections.arXiv preprint arXiv:2304.14802,

  25. [25]

    mHC: Manifold-Constrained Hyper-Connections

    Xie, Z., Wei, Y., Cao, H., Zhao, C., Deng, C., Li, J., Dai, D., Gao, H., Chang, J., Zhao, L., et al. mhc: Manifold-constrained hyper-connections.arXiv preprint arXiv:2512.24880,

  26. [26]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  27. [27]

    Hyper-connections

    Zhu, D., Huang, H., Huang, Z., Zeng, Y., Mao, Y., Wu, B., Min, Q., and Zhou, X. Hyper-connections. InThe Thirteenth International Conference on Learning Repre- sentations, 2025a. Zhu, D., Huang, H., Zhou, J., Huang, Z., Zeng, Y., Wu, B., Min, Q., and Zhou, X. Frac-connections: Frac- tional extension of hyper-connections.arXiv preprint arXiv:2503.14125, 20...

  28. [28]

    Appendix A.1

    11 SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm A. Appendix A.1. Comparison with Existing Multi-path Designs Input f × 𝑵 LN Output LN Figure 7|Architecture of Residual (Xie et al., 2023). ResiDual (Xie et al., 2023)The work most structurally similar to ours is ResiDual (Xie et al., 2023), as illustrated in Fig

  29. [29]

    However, a fundamental difference lies in the topology: in ResiDual, the Pre-Norm stream (Y-stream) is not connected to the input of the residual block. This implies that the 𝑌-stream acts as a global shortcut that aggregatestheoutputofeachresidualblockdirectlytoward the final output, rather than an active participant in the iterative transformation proce...

  30. [30]

    It should be noted that the learning rate and the total number of training tokens vary across our different experimental setups. Table 4|Detailed Experimental Settings for OLMo-1.3B Category Configuration / Value Model architecture Number of Layers 16 Hidden Size 2048 Attention Heads 16 Key-Value heads 16 FFN Intermediate Size 8192 Activation Function Swi...