Forget Attention: Importance-Aware Attention Is All You Need

Suhyeong Shin; Yeongwook Yang

arxiv: 2606.02332 · v2 · pith:RYNPUDA6new · submitted 2026-06-01 · 💻 cs.AI · cs.CL· cs.LG

Forget Attention: Importance-Aware Attention Is All You Need

Suhyeong Shin , Yeongwook Yang This is my paper

Pith reviewed 2026-06-28 14:36 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords SISASSM-informed attentionscore-level fusionhybrid language modelsstate space modelsattention mechanismsLAMBADAneedle-in-a-haystack

0 comments

The pith

SISA adds an SSM-derived importance term inside attention scores and computes the full operation as one standard SDPA call on augmented query and key vectors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to demonstrate that state space model importance signals can be merged with attention at the score computation level instead of being kept in separate blocks or heads. By deriving an importance value from an SSM and folding it directly into the attention score through vector augmentation, the approach lets global retrieval and sequential prioritization influence each other inside a single attention pass. A reader would care because prior hybrids isolate the two mechanisms, preventing them from shaping the same computation. If the integration works as described, it supplies a third route for building hybrids that still runs on unmodified attention kernels.

Core claim

SISA integrates an importance term derived from state space models directly into the softmax attention computation by augmenting the query and key vectors, allowing the entire operation to be performed as a single scaled dot-product attention call without recurrent states or custom kernels. This score-level fusion enables the model to prioritize important tokens while maintaining global retrieval capability. At 152M parameters trained on 5B tokens, SISA reaches 17.3% on LAMBADA-greedy versus 13.9% for a Transformer and 15.5% for Mamba-3, while attaining 100% needle-in-a-haystack accuracy from step 1K, seven times faster than the Transformer baseline.

What carries the argument

SISA, the direct insertion of an SSM-derived importance term into attention scores via augmentation of query and key vectors before a single SDPA operation.

If this is right

At the tested scales, SISA produces higher LAMBADA-greedy accuracy than both a standard Transformer and a Mamba-3 baseline.
SISA reaches perfect needle-in-a-haystack retrieval from the first thousand steps, converging seven times faster than a Transformer.
At 369M parameters, SISA keeps perfect needle-in-a-haystack performance while remaining compatible with stock SDPA execution.
The method requires no recurrent state and no custom kernel, preserving the execution path of ordinary scaled dot-product attention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Score-level fusion could be tested by deriving the importance term from sources other than SSMs to see whether the performance pattern generalizes.
The single-call property might allow direct substitution into existing Transformer training pipelines without kernel modifications.
If the importance term scales with model size, the approach could reduce reliance on block-level or head-level hybrid designs in larger models.

Load-bearing premise

An SSM-derived importance signal can be pre-computed or jointly computed so that it improves attention scores without new hyperparameters, extra passes, or changes that affect numerical stability or scaling behavior.

What would settle it

Zero out the importance term in the augmented vectors and measure whether LAMBADA and needle-in-a-haystack performance fall back to the levels of an unmodified Transformer at the same scale and training budget.

Figures

Figures reproduced from arXiv: 2606.02332 by Suhyeong Shin, Yeongwook Yang.

**Figure 2.** Figure 2: Compute graph of a SISA layer. The attention path produces Q, K, V with positional RoPE; [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: NIAH convergence at 152M. SISA reaches 100% from step 1K, while Transformer and the [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Scaling behavior across model sizes. SISA peaks on LAMBADA at 152M; Mamba-3 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: LAMBADA trajectory at 152M. SISA reaches Transformer’s final accuracy by step 2K and [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: ds ablation. Best is ds = 64 at 50M, ds = 16 at 152M, and (among the retrain candidates) ds = 128 at 369M—the optimum is non-monotonic in scale. These three ratio regimes manifest as distinct ablation behaviors in Section Q. Why is ds = 16 optimal at 152M but ds = 128 at 369M? FFN share alone does not explain it: 152M (55.9%) and 369M (58.8%) are similar yet have opposite optima. Two additional factors mat… view at source ↗

read the original abstract

Combining attention's global retrieval with the sequential importance signal of state space models (SSMs) is the open challenge of hybrid language modeling. Transformers see everywhere but cannot prioritize; SSMs know what matters but cannot revisit. Existing hybrids -- Jamba (block level) and Hymba (head level) -- place the two in separate compartments, so neither informs the other during the attention computation itself. We propose SISA (SSM-Informed Softmax Attention), which adds an SSM-derived importance term directly inside the attention score and realizes the full operation as a single SDPA call on augmented query/key vectors -- no recurrent state, no custom kernel. At 152M / 5B tokens, SISA reaches LAMBADA-greedy 17.3% (vs. Transformer 13.9 and Mamba-3 15.5) and attains NIAH 100% from step 1K, 7x faster than Transformer's retrieval convergence; at 369M, Mamba-3 leads LAMBADA while SISA preserves perfect NIAH and stock-SDPA execution. SISA thus defines a third design axis for SSM-attention hybrids -- score-level fusion -- beyond the block-level and head-level paradigms that have dominated the field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SISA claims score-level fusion by injecting an SSM importance term into attention scores for a single SDPA call, but the abstract gives no training details or stats so the gains stay hard to judge.

read the letter

The main point is that this paper introduces SISA as a way to add an SSM-derived importance signal straight into the attention scores and run the whole thing as one standard SDPA call on augmented query and key vectors. That positions it as a third option beyond the block-level mixing in Jamba and the head-level approach in Hymba.

The construction itself is straightforward and keeps execution inside existing attention kernels without recurrent states or custom code. At the 152M scale the reported LAMBADA greedy score beats both the transformer baseline and Mamba-3, and needle-in-a-haystack retrieval reaches 100 percent from the 1K step mark, seven times faster than the transformer. Those numbers are concrete and the single-call claim is a practical advantage if it holds.

The soft spot is the missing experimental backbone. The abstract lists benchmark results but supplies no training recipe, baseline implementation details, seed variation checks, or significance tests. At 369M parameters Mamba-3 actually leads on LAMBADA while SISA holds the retrieval edge, so the trade-offs are not yet mapped clearly. It is also unclear whether pre-computing or jointly computing the importance term adds passes, hyperparameters, or numerical side effects that would affect real scaling.

This is aimed at researchers building hybrid architectures who want the two components to interact inside the attention computation rather than in separate blocks or heads. A reader already experimenting with SSM-attention combinations could test the augmented-vector idea quickly.

If the full paper supplies the training code, ablations on the importance injection, and evidence that the single SDPA call stays stable, it deserves peer review. The abstract alone is too thin to decide whether the score-level axis delivers reliable gains.

Referee Report

2 major / 1 minor

Summary. The paper proposes SISA (SSM-Informed Softmax Attention) as a score-level fusion method for SSM-attention hybrids. It augments query and key vectors with an SSM-derived importance term so that the full attention operation executes as a single standard SDPA call, without recurrent state or custom kernels. At 152M parameters trained on 5B tokens, SISA reports LAMBADA-greedy accuracy of 17.3% (vs. 13.9% Transformer, 15.5% Mamba-3) and perfect NIAH retrieval from step 1K (7x faster convergence than Transformer); at 369M it preserves perfect NIAH while Mamba-3 leads on LAMBADA. The work positions this as a third design axis beyond existing block-level and head-level hybrids.

Significance. If the reported gains are reproducible, the contribution would be meaningful: it supplies a lightweight, kernel-free mechanism for injecting sequential importance directly into attention scores. This could simplify hybrid architecture design and improve retrieval without sacrificing the global access of attention. The absence of any training protocol, baseline code, or statistical controls, however, prevents assessment of whether the numerical improvements are robust or merely artifacts of a single run.

major comments (2)

[Abstract] Abstract: the central performance claims (LAMBADA 17.3%, NIAH 100% from step 1K, 7x faster convergence) rest on benchmark numbers supplied without any description of training hyperparameters, optimizer settings, baseline re-implementations, number of random seeds, or statistical tests. These omissions are load-bearing for the empirical superiority argument.
[Method] Method description: the assertion that an SSM-derived importance signal can be obtained and injected via simple Q/K augmentation while preserving exact SDPA semantics, scaling, and numerical stability is stated without an explicit algorithm, pseudocode, or verification that no additional passes or hyper-parameters are introduced. This detail is required to substantiate the “stock-SDPA execution” claim.

minor comments (1)

[Abstract] Abstract: the parenthetical comparison omits the percent sign after the Transformer and Mamba-3 numbers; consistent formatting would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in both the experimental protocol and the method description. We will revise the manuscript to incorporate the requested details, which will strengthen the reproducibility of the work.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claims (LAMBADA 17.3%, NIAH 100% from step 1K, 7x faster convergence) rest on benchmark numbers supplied without any description of training hyperparameters, optimizer settings, baseline re-implementations, number of random seeds, or statistical tests. These omissions are load-bearing for the empirical superiority argument.

Authors: We agree that the reported performance numbers require supporting experimental details to allow proper evaluation. In the revised manuscript we will add a dedicated Experimental Setup section that specifies all training hyperparameters (learning rate schedule, batch size, sequence length, optimizer), the exact re-implementations of the Transformer and Mamba-3 baselines, the number of random seeds, and any statistical tests or variance reporting used. This addition will directly address the concern that the current claims rest on insufficiently documented runs. revision: yes
Referee: [Method] Method description: the assertion that an SSM-derived importance signal can be obtained and injected via simple Q/K augmentation while preserving exact SDPA semantics, scaling, and numerical stability is stated without an explicit algorithm, pseudocode, or verification that no additional passes or hyper-parameters are introduced. This detail is required to substantiate the “stock-SDPA execution” claim.

Authors: We accept that an explicit algorithmic description is necessary to substantiate the claim of stock-SDPA execution. The revised manuscript will include a new Algorithm box (or pseudocode listing) that details (1) how the SSM importance term is computed from the input sequence, (2) the precise concatenation/augmentation of the query and key vectors, and (3) confirmation that the resulting operation remains a single call to standard scaled-dot-product attention with no extra forward passes or introduced hyperparameters. This will make the “kernel-free” and “exact SDPA semantics” properties verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; proposal is an empirical architectural change

full rationale

The paper presents SISA as a direct implementation: an SSM-derived importance term is added inside attention scores and realized via augmented Q/K vectors in one standard SDPA call. The abstract and description contain no equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled from prior author work. No derivation chain reduces the claimed improvement (LAMBADA/NIAH gains) to a self-referential definition or statistical forcing. The construction is presented as a vector-level fusion choice that preserves existing SDPA semantics, making the central claim self-contained against external benchmarks rather than circular by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or newly postulated entities.

pith-pipeline@v0.9.1-grok · 5756 in / 1137 out tokens · 23395 ms · 2026-06-28T14:36:35.534872+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 20 canonical work pages · 8 internal anchors

[1]

Vaswani, A., et al. (2017). Attention is all you need.NeurIPS

2017
[2]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A. & Dao, T. (2023). Mamba: Linear-time sequence modeling with selective state spaces. arXiv:2312.00752

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Dao, T. & Gu, A. (2024). Transformers are SSMs.ICML. arXiv:2405.21060. 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Y ., Chen, B., Wang, C., Bick, A., Kolter, J

Lahoti, A., Li, K. Y ., Chen, B., Wang, C., Bick, A., Kolter, J. Z., Dao, T., & Gu, A. (2026). Mamba-3: Improved Sequence Modeling using State Space Principles.ICLR. arXiv:2603.15569

work page arXiv 2026
[5]

Lieber, O., et al. (2024). Jamba. arXiv:2403.19887

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Ren, L., et al. (2024). Samba.ICLR 2025. arXiv:2406.07522

work page arXiv 2024
[7]

Dong, X., Fu, Y ., Diao, S., Byeon, W., Chen, Z., et al. (2024). Hymba: A Hybrid-head Architecture for Small Language Models.ICLR 2025. arXiv:2411.13676

work page arXiv 2024
[8]

Zuo, J., et al. (2025). Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance. arXiv:2507.22448

work page arXiv 2025
[9]

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

De, S., Smith, S. L., Fernando, A., et al. (2024). Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models. arXiv:2402.19427

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Press, O., Smith, N. A., & Lewis, M. (2022). Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.ICLR. arXiv:2108.12409

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y ., Li, W., & Liu, P. J. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR, 21(140), 1–67

2020
[12]

Zheng, C., Gao, Y ., Shi, H., Huang, M., Li, J., Xiong, J., Ren, X., Ng, M., Jiang, X., Li, Z., & Li, Y . (2024). DAPE: Data-Adaptive Positional Encoding for Length Extrapolation.NeurIPS. arXiv:2405.14722

work page arXiv 2024
[13]

O., & Courville, A

Lin, Z., Nikishin, E., He, X. O., & Courville, A. (2025). Forgetting Transformer: Softmax Attention with a Forget Gate.ICLR. arXiv:2503.02130

work page arXiv 2025
[14]

Ramapuram, J., Danieli, F., Dhekane, E., Weers, F., Busbridge, D., Ablin, P., Likhomanenko, T., Digani, J., Gu, Z., Shidani, A., & Webb, R. (2024). Theory, Analysis, and Best Practices for Sigmoid Self-Attention.ICLR 2025. arXiv:2409.04431

work page arXiv 2024
[15]

R., Hestness, J., & Dey, N

Soboleva, D., Al-Khateeb, F., Myers, R., Steeves, J. R., Hestness, J., & Dey, N. (2023). SlimPajama: A 627B token cleaned and deduplicated version of RedPajama.Cerebras Systems

2023
[16]

Hoffmann, J., et al. (2022). Training compute-optimal large language models. arXiv:2203.15556

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

N., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., & Fernández, R

Paperno, D., Kruszewski, G., Lazaridou, A., Pham, Q. N., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., & Fernández, R. (2016). The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context.ACL

2016
[18]

Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., & Choi, Y . (2019). HellaSwag: Can a Machine Really Finish Your Sentence?ACL

2019
[19]

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., & Tafjord, O. (2018). Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv:1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Sakaguchi, K., Le Bras, R., Bhagavatula, C., & Choi, Y . (2020). WinoGrande: An Adversarial Winograd Schema Challenge at Scale.AAAI

2020
[21]

Glorioso, P., Anthony, Q., Tokpanov, Y ., Whittington, J., Pilault, J., Ibrahim, A., & Millidge, B. (2024a). Zamba: A Compact 7B SSM Hybrid Model. arXiv:2405.16712

work page arXiv
[22]

Glorioso, P., Anthony, Q., Tokpanov, Y ., Golubeva, A., Shyam, V ., Whittington, J., Pilault, J., & Millidge, B. (2024b). The Zamba2 Suite: Technical Report. arXiv:2411.15242

work page arXiv
[23]

NVIDIA. (2025). Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models. arXiv:2504.03624

work page arXiv 2025
[24]

L., et al

Botev, A., De, S., Smith, S. L., et al. (2024). RecurrentGemma: Moving Past Transformers for Efficient Open Language Models. arXiv:2404.07839. 10

work page arXiv 2024
[25]

Waleffe, R., Byeon, W., Riach, D., Norick, B., Korthikanti, V ., et al. (2024). An Empirical Study of Mamba-based Language Models. arXiv:2406.07887

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

IBM, Princeton, CMU, & UIUC. (2024). Bamba: Inference-Efficient Hybrid Mamba2 Model. Hugging Face release, December 18, 2024.https://huggingface.co/blog/bamba. A Per-Layer Parameter Budget Table 9: Per-layer parameter comparison at 152M. Component Transformer SISA WQ +W K +W V +W O 2,359,296 2,359,296 SSM projections (WB,W C ,w α,W θ, λ) 0 746,508 SwiGLU ...

2024
[27]

SDPA scale.Even though the augmented Q/K have dimension dh +d s, the SDPA scale must be 1/√dh,not 1/√dh +d s. Using PyTorch F.scaled_dot_product_attention’s default 1/√dh +d s breaks the equivalence in the proposition of Section 3.3, and the SISA score is no longer computed correctly: ˆQ⊤ i ˆKj√dh = q⊤ i kj√dh +λ· ¯C⊤ i ¯Bj (correct), ˆQ⊤ i ˆKj√dh +d s ̸=...
[28]

The minimax offset c= (max t gt + mint gt)/2 usually keeps |g−c| small, but during early-warmup spikes of α it may grow temporarily

bf16 numerical stability.If egi−c exceeds the bf16 max (65,504), it produces NaN. The minimax offset c= (max t gt + mint gt)/2 usually keeps |g−c| small, but during early-warmup spikes of α it may grow temporarily. We clamp g−c to [−11,11] (e11 ≈59,874<65,504 ). The clamp does not activate in normal training and has no effect on optimization
[29]

Vector bias at score level

λ in fp32. λraw must be kept in fp32 to train. In bf16, the smallest representable change near −1.0 is 0.0078, far larger than the ∼10 −5 AdamW updates, so λ never moves from its initialization. Quantitative impact is reported in Appendix M. 11 D Mamba-3 Implementation We use the official Mamba-3 implementation now included in the mamba-ssm package. Earli...

work page arXiv 2048
[30]

Intermediate values such as ds = 96 would refine the location of the 152M valley (currently ds = 16) and the 369M upward edge (currentlyd s = 128)

Intermediate ds values.The U-shape was observed for ds ∈ {16,32,64,128} . Intermediate values such as ds = 96 would refine the location of the 152M valley (currently ds = 16) and the 369M upward edge (currentlyd s = 128)
[31]

Scale-adaptive ds.Replacing the manual choice of ds with a learned mechanism (per-layer ds or sparse SSM channels)
[32]

Validation at1B+scale and ad s scaling law
[33]

Long-context training (with RoPE scaling). 19
[34]

Per-layer / per-head distribution ofλand task-specific patterns
[35]

SISA-2 (follow-up).Extending score-level fusion in two directions: (a) sigmoid-based SISA to address the softmax dilution observed in Section 6.2, and (b) deeper FFN–SSM integration toward a fully fused architecture without a separate block. 20

[1] [1]

Vaswani, A., et al. (2017). Attention is all you need.NeurIPS

2017

[2] [2]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Gu, A. & Dao, T. (2023). Mamba: Linear-time sequence modeling with selective state spaces. arXiv:2312.00752

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Dao, T. & Gu, A. (2024). Transformers are SSMs.ICML. arXiv:2405.21060. 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Y ., Chen, B., Wang, C., Bick, A., Kolter, J

Lahoti, A., Li, K. Y ., Chen, B., Wang, C., Bick, A., Kolter, J. Z., Dao, T., & Gu, A. (2026). Mamba-3: Improved Sequence Modeling using State Space Principles.ICLR. arXiv:2603.15569

work page arXiv 2026

[5] [5]

Lieber, O., et al. (2024). Jamba. arXiv:2403.19887

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Ren, L., et al. (2024). Samba.ICLR 2025. arXiv:2406.07522

work page arXiv 2024

[7] [7]

Dong, X., Fu, Y ., Diao, S., Byeon, W., Chen, Z., et al. (2024). Hymba: A Hybrid-head Architecture for Small Language Models.ICLR 2025. arXiv:2411.13676

work page arXiv 2024

[8] [8]

Zuo, J., et al. (2025). Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance. arXiv:2507.22448

work page arXiv 2025

[9] [9]

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

De, S., Smith, S. L., Fernando, A., et al. (2024). Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models. arXiv:2402.19427

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Press, O., Smith, N. A., & Lewis, M. (2022). Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.ICLR. arXiv:2108.12409

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y ., Li, W., & Liu, P. J. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR, 21(140), 1–67

2020

[12] [12]

Zheng, C., Gao, Y ., Shi, H., Huang, M., Li, J., Xiong, J., Ren, X., Ng, M., Jiang, X., Li, Z., & Li, Y . (2024). DAPE: Data-Adaptive Positional Encoding for Length Extrapolation.NeurIPS. arXiv:2405.14722

work page arXiv 2024

[13] [13]

O., & Courville, A

Lin, Z., Nikishin, E., He, X. O., & Courville, A. (2025). Forgetting Transformer: Softmax Attention with a Forget Gate.ICLR. arXiv:2503.02130

work page arXiv 2025

[14] [14]

Ramapuram, J., Danieli, F., Dhekane, E., Weers, F., Busbridge, D., Ablin, P., Likhomanenko, T., Digani, J., Gu, Z., Shidani, A., & Webb, R. (2024). Theory, Analysis, and Best Practices for Sigmoid Self-Attention.ICLR 2025. arXiv:2409.04431

work page arXiv 2024

[15] [15]

R., Hestness, J., & Dey, N

Soboleva, D., Al-Khateeb, F., Myers, R., Steeves, J. R., Hestness, J., & Dey, N. (2023). SlimPajama: A 627B token cleaned and deduplicated version of RedPajama.Cerebras Systems

2023

[16] [16]

Hoffmann, J., et al. (2022). Training compute-optimal large language models. arXiv:2203.15556

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

N., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., & Fernández, R

Paperno, D., Kruszewski, G., Lazaridou, A., Pham, Q. N., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., & Fernández, R. (2016). The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context.ACL

2016

[18] [18]

Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., & Choi, Y . (2019). HellaSwag: Can a Machine Really Finish Your Sentence?ACL

2019

[19] [19]

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., & Tafjord, O. (2018). Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv:1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Sakaguchi, K., Le Bras, R., Bhagavatula, C., & Choi, Y . (2020). WinoGrande: An Adversarial Winograd Schema Challenge at Scale.AAAI

2020

[21] [21]

Glorioso, P., Anthony, Q., Tokpanov, Y ., Whittington, J., Pilault, J., Ibrahim, A., & Millidge, B. (2024a). Zamba: A Compact 7B SSM Hybrid Model. arXiv:2405.16712

work page arXiv

[22] [22]

Glorioso, P., Anthony, Q., Tokpanov, Y ., Golubeva, A., Shyam, V ., Whittington, J., Pilault, J., & Millidge, B. (2024b). The Zamba2 Suite: Technical Report. arXiv:2411.15242

work page arXiv

[23] [23]

NVIDIA. (2025). Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models. arXiv:2504.03624

work page arXiv 2025

[24] [24]

L., et al

Botev, A., De, S., Smith, S. L., et al. (2024). RecurrentGemma: Moving Past Transformers for Efficient Open Language Models. arXiv:2404.07839. 10

work page arXiv 2024

[25] [25]

Waleffe, R., Byeon, W., Riach, D., Norick, B., Korthikanti, V ., et al. (2024). An Empirical Study of Mamba-based Language Models. arXiv:2406.07887

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

IBM, Princeton, CMU, & UIUC. (2024). Bamba: Inference-Efficient Hybrid Mamba2 Model. Hugging Face release, December 18, 2024.https://huggingface.co/blog/bamba. A Per-Layer Parameter Budget Table 9: Per-layer parameter comparison at 152M. Component Transformer SISA WQ +W K +W V +W O 2,359,296 2,359,296 SSM projections (WB,W C ,w α,W θ, λ) 0 746,508 SwiGLU ...

2024

[27] [27]

SDPA scale.Even though the augmented Q/K have dimension dh +d s, the SDPA scale must be 1/√dh,not 1/√dh +d s. Using PyTorch F.scaled_dot_product_attention’s default 1/√dh +d s breaks the equivalence in the proposition of Section 3.3, and the SISA score is no longer computed correctly: ˆQ⊤ i ˆKj√dh = q⊤ i kj√dh +λ· ¯C⊤ i ¯Bj (correct), ˆQ⊤ i ˆKj√dh +d s ̸=...

[28] [28]

The minimax offset c= (max t gt + mint gt)/2 usually keeps |g−c| small, but during early-warmup spikes of α it may grow temporarily

bf16 numerical stability.If egi−c exceeds the bf16 max (65,504), it produces NaN. The minimax offset c= (max t gt + mint gt)/2 usually keeps |g−c| small, but during early-warmup spikes of α it may grow temporarily. We clamp g−c to [−11,11] (e11 ≈59,874<65,504 ). The clamp does not activate in normal training and has no effect on optimization

[29] [29]

Vector bias at score level

λ in fp32. λraw must be kept in fp32 to train. In bf16, the smallest representable change near −1.0 is 0.0078, far larger than the ∼10 −5 AdamW updates, so λ never moves from its initialization. Quantitative impact is reported in Appendix M. 11 D Mamba-3 Implementation We use the official Mamba-3 implementation now included in the mamba-ssm package. Earli...

work page arXiv 2048

[30] [30]

Intermediate values such as ds = 96 would refine the location of the 152M valley (currently ds = 16) and the 369M upward edge (currentlyd s = 128)

Intermediate ds values.The U-shape was observed for ds ∈ {16,32,64,128} . Intermediate values such as ds = 96 would refine the location of the 152M valley (currently ds = 16) and the 369M upward edge (currentlyd s = 128)

[31] [31]

Scale-adaptive ds.Replacing the manual choice of ds with a learned mechanism (per-layer ds or sparse SSM channels)

[32] [32]

Validation at1B+scale and ad s scaling law

[33] [33]

Long-context training (with RoPE scaling). 19

[34] [34]

Per-layer / per-head distribution ofλand task-specific patterns

[35] [35]

SISA-2 (follow-up).Extending score-level fusion in two directions: (a) sigmoid-based SISA to address the softmax dilution observed in Section 6.2, and (b) deeper FFN–SSM integration toward a fully fused architecture without a separate block. 20