pith. sign in

arxiv: 2606.02332 · v2 · pith:RYNPUDA6new · submitted 2026-06-01 · 💻 cs.AI · cs.CL· cs.LG

Forget Attention: Importance-Aware Attention Is All You Need

Pith reviewed 2026-06-28 14:36 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords SISASSM-informed attentionscore-level fusionhybrid language modelsstate space modelsattention mechanismsLAMBADAneedle-in-a-haystack
0
0 comments X

The pith

SISA adds an SSM-derived importance term inside attention scores and computes the full operation as one standard SDPA call on augmented query and key vectors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to demonstrate that state space model importance signals can be merged with attention at the score computation level instead of being kept in separate blocks or heads. By deriving an importance value from an SSM and folding it directly into the attention score through vector augmentation, the approach lets global retrieval and sequential prioritization influence each other inside a single attention pass. A reader would care because prior hybrids isolate the two mechanisms, preventing them from shaping the same computation. If the integration works as described, it supplies a third route for building hybrids that still runs on unmodified attention kernels.

Core claim

SISA integrates an importance term derived from state space models directly into the softmax attention computation by augmenting the query and key vectors, allowing the entire operation to be performed as a single scaled dot-product attention call without recurrent states or custom kernels. This score-level fusion enables the model to prioritize important tokens while maintaining global retrieval capability. At 152M parameters trained on 5B tokens, SISA reaches 17.3% on LAMBADA-greedy versus 13.9% for a Transformer and 15.5% for Mamba-3, while attaining 100% needle-in-a-haystack accuracy from step 1K, seven times faster than the Transformer baseline.

What carries the argument

SISA, the direct insertion of an SSM-derived importance term into attention scores via augmentation of query and key vectors before a single SDPA operation.

If this is right

  • At the tested scales, SISA produces higher LAMBADA-greedy accuracy than both a standard Transformer and a Mamba-3 baseline.
  • SISA reaches perfect needle-in-a-haystack retrieval from the first thousand steps, converging seven times faster than a Transformer.
  • At 369M parameters, SISA keeps perfect needle-in-a-haystack performance while remaining compatible with stock SDPA execution.
  • The method requires no recurrent state and no custom kernel, preserving the execution path of ordinary scaled dot-product attention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Score-level fusion could be tested by deriving the importance term from sources other than SSMs to see whether the performance pattern generalizes.
  • The single-call property might allow direct substitution into existing Transformer training pipelines without kernel modifications.
  • If the importance term scales with model size, the approach could reduce reliance on block-level or head-level hybrid designs in larger models.

Load-bearing premise

An SSM-derived importance signal can be pre-computed or jointly computed so that it improves attention scores without new hyperparameters, extra passes, or changes that affect numerical stability or scaling behavior.

What would settle it

Zero out the importance term in the augmented vectors and measure whether LAMBADA and needle-in-a-haystack performance fall back to the levels of an unmodified Transformer at the same scale and training budget.

Figures

Figures reproduced from arXiv: 2606.02332 by Suhyeong Shin, Yeongwook Yang.

Figure 1
Figure 1. Figure 1: Three levels of SSM-attention fusion. Block- and head-level hybrids combine outputs that [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Compute graph of a SISA layer. The attention path produces Q, K, V with positional RoPE; [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: NIAH convergence at 152M. SISA reaches 100% from step 1K, while Transformer and the [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Scaling behavior across model sizes. SISA peaks on LAMBADA at 152M; Mamba-3 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: LAMBADA trajectory at 152M. SISA reaches Transformer’s final accuracy by step 2K and [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: ds ablation. Best is ds = 64 at 50M, ds = 16 at 152M, and (among the retrain candidates) ds = 128 at 369M—the optimum is non-monotonic in scale. These three ratio regimes manifest as distinct ablation behaviors in Section Q. Why is ds = 16 optimal at 152M but ds = 128 at 369M? FFN share alone does not explain it: 152M (55.9%) and 369M (58.8%) are similar yet have opposite optima. Two additional factors mat… view at source ↗
read the original abstract

Combining attention's global retrieval with the sequential importance signal of state space models (SSMs) is the open challenge of hybrid language modeling. Transformers see everywhere but cannot prioritize; SSMs know what matters but cannot revisit. Existing hybrids -- Jamba (block level) and Hymba (head level) -- place the two in separate compartments, so neither informs the other during the attention computation itself. We propose SISA (SSM-Informed Softmax Attention), which adds an SSM-derived importance term directly inside the attention score and realizes the full operation as a single SDPA call on augmented query/key vectors -- no recurrent state, no custom kernel. At 152M / 5B tokens, SISA reaches LAMBADA-greedy 17.3% (vs. Transformer 13.9 and Mamba-3 15.5) and attains NIAH 100% from step 1K, 7x faster than Transformer's retrieval convergence; at 369M, Mamba-3 leads LAMBADA while SISA preserves perfect NIAH and stock-SDPA execution. SISA thus defines a third design axis for SSM-attention hybrids -- score-level fusion -- beyond the block-level and head-level paradigms that have dominated the field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes SISA (SSM-Informed Softmax Attention) as a score-level fusion method for SSM-attention hybrids. It augments query and key vectors with an SSM-derived importance term so that the full attention operation executes as a single standard SDPA call, without recurrent state or custom kernels. At 152M parameters trained on 5B tokens, SISA reports LAMBADA-greedy accuracy of 17.3% (vs. 13.9% Transformer, 15.5% Mamba-3) and perfect NIAH retrieval from step 1K (7x faster convergence than Transformer); at 369M it preserves perfect NIAH while Mamba-3 leads on LAMBADA. The work positions this as a third design axis beyond existing block-level and head-level hybrids.

Significance. If the reported gains are reproducible, the contribution would be meaningful: it supplies a lightweight, kernel-free mechanism for injecting sequential importance directly into attention scores. This could simplify hybrid architecture design and improve retrieval without sacrificing the global access of attention. The absence of any training protocol, baseline code, or statistical controls, however, prevents assessment of whether the numerical improvements are robust or merely artifacts of a single run.

major comments (2)
  1. [Abstract] Abstract: the central performance claims (LAMBADA 17.3%, NIAH 100% from step 1K, 7x faster convergence) rest on benchmark numbers supplied without any description of training hyperparameters, optimizer settings, baseline re-implementations, number of random seeds, or statistical tests. These omissions are load-bearing for the empirical superiority argument.
  2. [Method] Method description: the assertion that an SSM-derived importance signal can be obtained and injected via simple Q/K augmentation while preserving exact SDPA semantics, scaling, and numerical stability is stated without an explicit algorithm, pseudocode, or verification that no additional passes or hyper-parameters are introduced. This detail is required to substantiate the “stock-SDPA execution” claim.
minor comments (1)
  1. [Abstract] Abstract: the parenthetical comparison omits the percent sign after the Transformer and Mamba-3 numbers; consistent formatting would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for greater transparency in both the experimental protocol and the method description. We will revise the manuscript to incorporate the requested details, which will strengthen the reproducibility of the work.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central performance claims (LAMBADA 17.3%, NIAH 100% from step 1K, 7x faster convergence) rest on benchmark numbers supplied without any description of training hyperparameters, optimizer settings, baseline re-implementations, number of random seeds, or statistical tests. These omissions are load-bearing for the empirical superiority argument.

    Authors: We agree that the reported performance numbers require supporting experimental details to allow proper evaluation. In the revised manuscript we will add a dedicated Experimental Setup section that specifies all training hyperparameters (learning rate schedule, batch size, sequence length, optimizer), the exact re-implementations of the Transformer and Mamba-3 baselines, the number of random seeds, and any statistical tests or variance reporting used. This addition will directly address the concern that the current claims rest on insufficiently documented runs. revision: yes

  2. Referee: [Method] Method description: the assertion that an SSM-derived importance signal can be obtained and injected via simple Q/K augmentation while preserving exact SDPA semantics, scaling, and numerical stability is stated without an explicit algorithm, pseudocode, or verification that no additional passes or hyper-parameters are introduced. This detail is required to substantiate the “stock-SDPA execution” claim.

    Authors: We accept that an explicit algorithmic description is necessary to substantiate the claim of stock-SDPA execution. The revised manuscript will include a new Algorithm box (or pseudocode listing) that details (1) how the SSM importance term is computed from the input sequence, (2) the precise concatenation/augmentation of the query and key vectors, and (3) confirmation that the resulting operation remains a single call to standard scaled-dot-product attention with no extra forward passes or introduced hyperparameters. This will make the “kernel-free” and “exact SDPA semantics” properties verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; proposal is an empirical architectural change

full rationale

The paper presents SISA as a direct implementation: an SSM-derived importance term is added inside attention scores and realized via augmented Q/K vectors in one standard SDPA call. The abstract and description contain no equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled from prior author work. No derivation chain reduces the claimed improvement (LAMBADA/NIAH gains) to a self-referential definition or statistical forcing. The construction is presented as a vector-level fusion choice that preserves existing SDPA semantics, making the central claim self-contained against external benchmarks rather than circular by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, background axioms, or newly postulated entities.

pith-pipeline@v0.9.1-grok · 5756 in / 1137 out tokens · 23395 ms · 2026-06-28T14:36:35.534872+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 20 canonical work pages · 8 internal anchors

  1. [1]

    Vaswani, A., et al. (2017). Attention is all you need.NeurIPS

  2. [2]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Gu, A. & Dao, T. (2023). Mamba: Linear-time sequence modeling with selective state spaces. arXiv:2312.00752

  3. [3]

    Dao, T. & Gu, A. (2024). Transformers are SSMs.ICML. arXiv:2405.21060. 9

  4. [4]

    Y ., Chen, B., Wang, C., Bick, A., Kolter, J

    Lahoti, A., Li, K. Y ., Chen, B., Wang, C., Bick, A., Kolter, J. Z., Dao, T., & Gu, A. (2026). Mamba-3: Improved Sequence Modeling using State Space Principles.ICLR. arXiv:2603.15569

  5. [5]

    Lieber, O., et al. (2024). Jamba. arXiv:2403.19887

  6. [6]

    Ren, L., et al. (2024). Samba.ICLR 2025. arXiv:2406.07522

  7. [7]

    Dong, X., Fu, Y ., Diao, S., Byeon, W., Chen, Z., et al. (2024). Hymba: A Hybrid-head Architecture for Small Language Models.ICLR 2025. arXiv:2411.13676

  8. [8]

    Zuo, J., et al. (2025). Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance. arXiv:2507.22448

  9. [9]

    Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

    De, S., Smith, S. L., Fernando, A., et al. (2024). Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models. arXiv:2402.19427

  10. [10]

    Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

    Press, O., Smith, N. A., & Lewis, M. (2022). Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.ICLR. arXiv:2108.12409

  11. [11]

    Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y ., Li, W., & Liu, P. J. (2020). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR, 21(140), 1–67

  12. [12]

    Zheng, C., Gao, Y ., Shi, H., Huang, M., Li, J., Xiong, J., Ren, X., Ng, M., Jiang, X., Li, Z., & Li, Y . (2024). DAPE: Data-Adaptive Positional Encoding for Length Extrapolation.NeurIPS. arXiv:2405.14722

  13. [13]

    O., & Courville, A

    Lin, Z., Nikishin, E., He, X. O., & Courville, A. (2025). Forgetting Transformer: Softmax Attention with a Forget Gate.ICLR. arXiv:2503.02130

  14. [14]

    Ramapuram, J., Danieli, F., Dhekane, E., Weers, F., Busbridge, D., Ablin, P., Likhomanenko, T., Digani, J., Gu, Z., Shidani, A., & Webb, R. (2024). Theory, Analysis, and Best Practices for Sigmoid Self-Attention.ICLR 2025. arXiv:2409.04431

  15. [15]

    R., Hestness, J., & Dey, N

    Soboleva, D., Al-Khateeb, F., Myers, R., Steeves, J. R., Hestness, J., & Dey, N. (2023). SlimPajama: A 627B token cleaned and deduplicated version of RedPajama.Cerebras Systems

  16. [16]

    Hoffmann, J., et al. (2022). Training compute-optimal large language models. arXiv:2203.15556

  17. [17]

    N., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., & Fernández, R

    Paperno, D., Kruszewski, G., Lazaridou, A., Pham, Q. N., Bernardi, R., Pezzelle, S., Baroni, M., Boleda, G., & Fernández, R. (2016). The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context.ACL

  18. [18]

    Zellers, R., Holtzman, A., Bisk, Y ., Farhadi, A., & Choi, Y . (2019). HellaSwag: Can a Machine Really Finish Your Sentence?ACL

  19. [19]

    Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., & Tafjord, O. (2018). Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv:1803.05457

  20. [20]

    Sakaguchi, K., Le Bras, R., Bhagavatula, C., & Choi, Y . (2020). WinoGrande: An Adversarial Winograd Schema Challenge at Scale.AAAI

  21. [21]

    Glorioso, P., Anthony, Q., Tokpanov, Y ., Whittington, J., Pilault, J., Ibrahim, A., & Millidge, B. (2024a). Zamba: A Compact 7B SSM Hybrid Model. arXiv:2405.16712

  22. [22]

    Glorioso, P., Anthony, Q., Tokpanov, Y ., Golubeva, A., Shyam, V ., Whittington, J., Pilault, J., & Millidge, B. (2024b). The Zamba2 Suite: Technical Report. arXiv:2411.15242

  23. [23]

    NVIDIA. (2025). Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models. arXiv:2504.03624

  24. [24]

    L., et al

    Botev, A., De, S., Smith, S. L., et al. (2024). RecurrentGemma: Moving Past Transformers for Efficient Open Language Models. arXiv:2404.07839. 10

  25. [25]

    Waleffe, R., Byeon, W., Riach, D., Norick, B., Korthikanti, V ., et al. (2024). An Empirical Study of Mamba-based Language Models. arXiv:2406.07887

  26. [26]

    IBM, Princeton, CMU, & UIUC. (2024). Bamba: Inference-Efficient Hybrid Mamba2 Model. Hugging Face release, December 18, 2024.https://huggingface.co/blog/bamba. A Per-Layer Parameter Budget Table 9: Per-layer parameter comparison at 152M. Component Transformer SISA WQ +W K +W V +W O 2,359,296 2,359,296 SSM projections (WB,W C ,w α,W θ, λ) 0 746,508 SwiGLU ...

  27. [27]

    SDPA scale.Even though the augmented Q/K have dimension dh +d s, the SDPA scale must be 1/√dh,not 1/√dh +d s. Using PyTorch F.scaled_dot_product_attention’s default 1/√dh +d s breaks the equivalence in the proposition of Section 3.3, and the SISA score is no longer computed correctly: ˆQ⊤ i ˆKj√dh = q⊤ i kj√dh +λ· ¯C⊤ i ¯Bj (correct), ˆQ⊤ i ˆKj√dh +d s ̸=...

  28. [28]

    The minimax offset c= (max t gt + mint gt)/2 usually keeps |g−c| small, but during early-warmup spikes of α it may grow temporarily

    bf16 numerical stability.If egi−c exceeds the bf16 max (65,504), it produces NaN. The minimax offset c= (max t gt + mint gt)/2 usually keeps |g−c| small, but during early-warmup spikes of α it may grow temporarily. We clamp g−c to [−11,11] (e11 ≈59,874<65,504 ). The clamp does not activate in normal training and has no effect on optimization

  29. [29]

    Vector bias at score level

    λ in fp32. λraw must be kept in fp32 to train. In bf16, the smallest representable change near −1.0 is 0.0078, far larger than the ∼10 −5 AdamW updates, so λ never moves from its initialization. Quantitative impact is reported in Appendix M. 11 D Mamba-3 Implementation We use the official Mamba-3 implementation now included in the mamba-ssm package. Earli...

  30. [30]

    Intermediate values such as ds = 96 would refine the location of the 152M valley (currently ds = 16) and the 369M upward edge (currentlyd s = 128)

    Intermediate ds values.The U-shape was observed for ds ∈ {16,32,64,128} . Intermediate values such as ds = 96 would refine the location of the 152M valley (currently ds = 16) and the 369M upward edge (currentlyd s = 128)

  31. [31]

    Scale-adaptive ds.Replacing the manual choice of ds with a learned mechanism (per-layer ds or sparse SSM channels)

  32. [32]

    Validation at1B+scale and ad s scaling law

  33. [33]

    Long-context training (with RoPE scaling). 19

  34. [34]

    Per-layer / per-head distribution ofλand task-specific patterns

  35. [35]

    SISA-2 (follow-up).Extending score-level fusion in two directions: (a) sigmoid-based SISA to address the softmax dilution observed in Section 6.2, and (b) deeper FFN–SSM integration toward a fully fused architecture without a separate block. 20