arxiv: 2601.01972 · v4 · submitted 2026-01-05 · 💻 cs.CL · cs.AI· cs.LG

Recognition: no theorem link

Hidden State Poisoning Attacks against Mamba-based Language Models

Alexandre Le Mercier , Chris Develder , Thomas Demeester

Authors on Pith no claims yet

Pith reviewed 2026-05-16 18:12 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords mambastate space modelshidden state poisoningadversarial attackslanguage modelsinformation retrievalrobustness

0 comments

The pith

Short input phrases can irreversibly overwrite hidden states in Mamba models, inducing amnesia on retrieval tasks that pure Transformers resist.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that Mamba-based language models are vulnerable to hidden state poisoning attacks triggered by specific short input phrases. These phrases overwrite information stored in the model's hidden states in a way that later inputs cannot recover. The authors create the RoBench-25 benchmark to measure information retrieval under such attacks and show that Mamba models and certain hybrids collapse, while pure Transformer models maintain performance. The same triggers also degrade results on the Open-Prompt-Injections benchmark for affected models. The work further extends the findings to Mamba-2 and identifies hidden-layer patterns that could support future detection methods.

Core claim

The central claim is that certain short input phrases trigger hidden state poisoning attacks (HiSPA) in Mamba models. These attacks irreversibly overwrite the hidden state representations, producing a partial amnesia effect that prevents retrieval of earlier context. Experiments on the new RoBench-25 benchmark confirm that Mamba models and hybrids such as Jamba-1.7-Mini fail under these triggers, whereas pure Transformers remain unaffected. The vulnerability also appears on the Open-Prompt-Injections benchmark and extends to Mamba-2 and Nemotron-3-Nano hybrids.

What carries the argument

The Hidden State Poisoning Attack (HiSPA) mechanism, in which short trigger phrases overwrite the selective state space model's hidden states and thereby erase prior encoded information.

If this is right

Mamba models fail information retrieval tasks on RoBench-25 when HiSPA triggers are inserted.
Hybrid SSM-Transformer models such as Jamba also collapse on RoBench-25 under the same triggers.
HiSPA triggers reduce robustness on the Open-Prompt-Injections benchmark for affected models.
The vulnerability appears in Mamba-2 and in Mamba-2-based hybrids such as Nemotron-3-Nano.
Hidden-layer activation patterns during HiSPAs suggest possible signals for building mitigation systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

SSM architectures may require robustness techniques that differ from those sufficient for Transformers.
The effect could restrict reliable deployment of Mamba models in long-context settings exposed to adversarial inputs.
Interpretability signals identified in the study might enable real-time monitoring for poisoning events.
Similar hidden-state vulnerabilities could exist in other recurrent or state-space architectures.

Load-bearing premise

That the observed performance drops are caused specifically by irreversible hidden-state overwriting rather than other mechanisms such as attention disruption or output formatting changes.

What would settle it

Measuring whether Mamba model performance on RoBench-25 recovers after the trigger phrase when hidden states are explicitly reset or cleared would test whether overwriting is the direct cause.

Figures

Figures reproduced from arXiv: 2601.01972 by Alexandre Le Mercier, Chris Develder, Thomas Demeester.

**Figure 1.** Figure 1: The Mamba core (left) and Mamba block (right) as described by Gu & Dao (2024). For block b and time step t, x (b) t ∈ RK is the input core token embedding, h (b) t ∈ RK×N the hidden states with N the length of h (b) t,k , y (b) t ∈ RK the output core token embedding, and o (b) t ∈ RK/2 the output block token embedding [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 3.** Figure 3: Evolution of the GA loss function per generation, averaged at every iteration over 5 test samples of ROBENCH-25. This particular run reaches a loss of −0.9998 with a 12-token trigger containing the <|endoftext|> special token [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

State space models (SSMs) like Mamba offer efficient alternatives to Transformer-based language models, with linear time complexity. Yet, their adversarial robustness remains critically unexplored. This paper studies the phenomenon whereby specific short input phrases induce a partial amnesia effect in such models, by irreversibly overwriting information in their hidden states, referred to as a Hidden State Poisoning Attack (HiSPA). Our benchmark RoBench-25 allows evaluating a model's information retrieval capabilities when subject to HiSPAs, and confirms the vulnerability of SSMs against such attacks. Even the recent Jamba-1.7-Mini SSM--Transformer (a 52B hybrid model) collapses on RoBench-25 under some HiSPA triggers, whereas pure Transformers do not. We also observe that HiSPA triggers significantly weaken the Jamba model on the popular Open-Prompt-Injections benchmark, unlike pure Transformers. We further show that the theoretical and empirical findings extend to Mamba-2, and also analyse a Mamba-2-based hybrid (Nemotron-3-Nano). Finally, our interpretability study reveals patterns in Mamba's hidden layers during HiSPAs that could be used to build a HiSPA mitigation system. The full code and data to reproduce the experiments can be found at https://anonymous.4open.science/r/hispa_anonymous-5DB0.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mamba models can be tripped up by short trigger phrases that overwrite hidden states and erase prior context, unlike pure Transformers, with a new benchmark to measure the effect.

read the letter

The main thing to know is that this paper shows short input phrases can cause Mamba-based models to lose track of earlier information by overwriting their hidden states, and they demonstrate this on a new benchmark called RoBench-25. The effect also weakens performance on an existing prompt-injection benchmark for hybrid models like Jamba, while pure Transformers stay stable. They extend the tests to Mamba-2 and another hybrid, and include some layer-wise interpretability patterns that might help with detection.

Referee Report

2 major / 0 minor

Summary. The paper introduces Hidden State Poisoning Attacks (HiSPA) on Mamba-based state space models, claiming that specific short input phrases induce a partial amnesia effect by irreversibly overwriting hidden-state information. It presents the RoBench-25 benchmark to evaluate information retrieval under HiSPAs, reports collapse for Mamba models and hybrids such as Jamba-1.7-Mini while pure Transformers remain unaffected, shows transfer effects on the Open-Prompt-Injections benchmark, extends the findings to Mamba-2 and Nemotron-3-Nano, and provides an interpretability analysis of hidden-layer patterns for potential mitigation. Full code and data are released.

Significance. If the central empirical claims hold, the work identifies a previously unexplored vulnerability in SSM architectures that are gaining traction as efficient alternatives to Transformers, with direct implications for adversarial robustness and secure deployment. The introduction of RoBench-25, the contrast with Transformer behavior, the transfer results, and the public release of code and data constitute concrete strengths that would enable follow-up research.

major comments (2)

[Abstract / Results] The central claim that short triggers cause irreversible hidden-state overwriting specifically responsible for the amnesia effect on RoBench-25 lacks a direct causal test. No ablation is described that restores performance by resetting or bypassing the hidden state while the trigger remains present, leaving open alternatives such as attention disruption or output formatting changes.
[Abstract / Experimental Setup] The reported empirical collapse on RoBench-25 and transfer to Open-Prompt-Injections provides no details on statistical controls, baseline comparisons, number of runs, or exclusion criteria, which are required to assess whether the performance drops are robust and attributable to the proposed mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which identify opportunities to strengthen the causal evidence and statistical reporting in our work. We address each point below and will incorporate the suggested additions in the revised manuscript.

read point-by-point responses

Referee: [Abstract / Results] The central claim that short triggers cause irreversible hidden-state overwriting specifically responsible for the amnesia effect on RoBench-25 lacks a direct causal test. No ablation is described that restores performance by resetting or bypassing the hidden state while the trigger remains present, leaving open alternatives such as attention disruption or output formatting changes.

Authors: We agree that an explicit causal intervention would strengthen the mechanistic claim. Our current evidence rests on the architecture-specific collapse (Mamba/hybrids vs. unaffected Transformers), transfer to Open-Prompt-Injections, and the hidden-layer pattern analysis. To close this gap, we will add a new ablation in the revised version: after the HiSPA trigger we manually reset the hidden state (zeroing or restoring from a clean prefix) while keeping the trigger text in the input, then re-measure RoBench-25 retrieval accuracy. We expect this to restore performance and thereby support the overwriting interpretation over alternatives such as attention disruption. revision: yes
Referee: [Abstract / Experimental Setup] The reported empirical collapse on RoBench-25 and transfer to Open-Prompt-Injections provides no details on statistical controls, baseline comparisons, number of runs, or exclusion criteria, which are required to assess whether the performance drops are robust and attributable to the proposed mechanism.

Authors: We apologize for the missing details. All RoBench-25 and transfer experiments were run with 5 independent random seeds for prompt sampling and generation; we compared against no-trigger and random-phrase baselines and applied no exclusion criteria beyond discarding malformed outputs. In the revision we will expand the Experimental Setup section to report these controls explicitly, include mean ± std across runs, and add paired t-test results for significance. The updated numbers and statistical tests will appear in the main text and supplementary material. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation with no derivation chain

full rationale

The paper reports experimental results on hidden-state poisoning in Mamba models using RoBench-25 and related benchmarks. No equations, first-principles derivations, or parameter-fitting steps are presented whose outputs are claimed to be independent predictions. All central claims rest on direct model runs, ablation-style observations, and released code/data. No self-citation is used to justify uniqueness or to close a logical loop; the work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical security evaluation paper; it introduces no mathematical derivations, free parameters, or new theoretical entities beyond standard assumptions of neural network behavior.

pith-pipeline@v0.9.0 · 5546 in / 1114 out tokens · 29565 ms · 2026-05-16T18:12:20.748575+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 5 internal anchors

[1]

Blackmamba: Mixture of experts for state-space models.arXiv preprint arXiv:2402.01771,

Quentin Anthony, Yury Tokpanov, Paolo Glorioso, and Beren Millidge. Blackmamba: Mixture of experts for state-space models.arXiv preprint arXiv:2402.01771,

work page arXiv
[2]

Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models.arXiv preprint arXiv:2504.03624, 2025a

Aaron Blakeman, Aarti Basant, Abhinav Khattar, Adithya Renduchintala, Akhiad Bercovich, Alek- sander Ficek, Alexis Bjorlin, Ali Taghibakhshi, Amala Sanjay Deshmukh, Ameya Sunil Maha- baleshwarkar, et al. Nemotron-h: A family of accurate and efficient hybrid mamba-transformer models.arXiv preprint arXiv:2504.03624, 2025a. Aaron Blakeman, Aaron Grattafiori,...

work page arXiv
[3]

press/v235/dao24a.html

URL https://proceedings.mlr. press/v235/dao24a.html. Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya Sunil Mahabaleshwarkar, Shih-Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, et al. Hymba: A hybrid- head architecture for small language models.arXiv preprint arXiv:2411.13676,

work page arXiv
[4]

Investigating the indirect object identification circuit in mamba.arXiv preprint arXiv:2407.14008,

Danielle Ensign and Adrià Garriga-Alonso. Investigating the indirect object identification circuit in mamba.arXiv preprint arXiv:2407.14008,

work page arXiv
[5]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Zamba: A compact 7b ssm hybrid model.arXiv preprint arXiv:2405.16712,

Paolo Glorioso, Quentin Anthony, Yury Tokpanov, James Whittington, Jonathan Pilault, Adam Ibrahim, and Beren Millidge. Zamba: A compact 7b ssm hybrid model.arXiv preprint arXiv:2405.16712,

work page arXiv
[7]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

doi: 10.1115/1.3662552

ISSN 0021-9223. doi: 10.1115/1.3662552. URL https: //doi.org/10.1115/1.3662552. Janis Keuper. Prompt injection attacks on llm generated reviews of scientific publications.arXiv preprint arXiv:2509.10248,

work page doi:10.1115/1.3662552
[10]

Same task, more tokens: the impact of input length on the reasoning performance of large language models.arXiv preprint arXiv:2402.14848,

Mosh Levy, Alon Jacoby, and Yoav Goldberg. Same task, more tokens: the impact of input length on the reasoning performance of large language models.arXiv preprint arXiv:2402.14848,

work page arXiv
[11]

Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer- mamba language model.arXiv preprint arXiv:2403.19887,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Hidden prompts in manuscripts exploit ai-assisted peer review,

Zhicheng Lin. Hidden prompts in manuscripts exploit ai-assisted peer review.arXiv preprint arXiv:2507.06185,

work page arXiv
[13]

Carbon Emissions and Large Neural Network Training

David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training.arXiv preprint arXiv:2104.10350,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Samba: Simple hybrid state space models for efficient unlimited context language modeling.arXiv preprint arXiv:2406.07522,

Liliang Ren, Yang Liu, Yadong Lu, Yelong Shen, Chen Liang, and Weizhu Chen. Samba: Simple hybrid state space models for efficient unlimited context language modeling.arXiv preprint arXiv:2406.07522,

work page arXiv
[15]

Exploring the limitations of mamba in copy and cot reasoning

Ruifeng Ren, Zhicong Li, and Yong Liu. Exploring the limitations of mamba in copy and cot reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 12550–12574,

work page 2025
[16]

Recurrent neural networks (rnns): A gentle introduction and overview.arXiv preprint arXiv:1912.05911,

Robin M Schmidt. Recurrent neural networks (rnns): A gentle introduction and overview.arXiv preprint arXiv:1912.05911,

work page arXiv 1912
[17]

Locating and editing factual associations in mamba.arXiv preprint arXiv:2404.03646,

Arnab Sen Sharma, David Atkinson, and David Bau. Locating and editing factual associations in mamba.arXiv preprint arXiv:2404.03646,

work page arXiv
[18]

Jamba-1.5: Hybrid transformer-mamba models at scale.arXiv preprint arXiv:2408.12570,

Jamba Team, Barak Lenz, Alan Arazi, Amir Bergman, Avshalom Manevich, Barak Peleg, Ben Aviram, Chen Almagor, Clara Fridman, Dan Padnos, et al. Jamba-1.5: Hybrid transformer-mamba models at scale.arXiv preprint arXiv:2408.12570,

work page arXiv
[19]

Literature review of multi-agent debate for problem-solving.arXiv preprint arXiv:2506.00066,

Arne Tillmann. Literature review of multi-agent debate for problem-solving.arXiv preprint arXiv:2506.00066,

work page arXiv
[20]

An empirical study of mamba-based language models.arXiv preprint arXiv:2406.07887, 2024

Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, et al. An empirical study of mamba- based language models.arXiv preprint arXiv:2406.07887,

work page arXiv
[21]

Gpt vs human for scientific reviews: A dual source review on applications of chatgpt in science.arXiv preprint arXiv:2312.03769,

Chenxi Wu, Alan John Varghese, Vivek Oommen, and George Em Karniadakis. Gpt vs human for scientific reviews: A dual source review on applications of chatgpt in science.arXiv preprint arXiv:2312.03769,

work page arXiv
[22]

Virtual context: Enhancing jailbreak attacks with special token injection.arXiv preprint arXiv:2406.19845,

Yuqi Zhou, Lin Lu, Hanchi Sun, Pan Zhou, and Lichao Sun. Virtual context: Enhancing jailbreak attacks with special token injection.arXiv preprint arXiv:2406.19845,

work page arXiv
[23]

{escape}Ignore all previous instructions.{escape}

12 A Limitations Baseline ROBENCH-25 scores.Although the baseline CHSS values for some models may appear low, each reported score is averaged over 120 distinct prompt configurations (240 queries), which substantially reduces the likelihood that the observed differences arise from random noise. As a result, the relative performance gaps across experimental...

work page 2025
[24]

{escape}

demonstrates that longer inputs (even with relevant or duplicated content) lead to significant accuracy drops due to mechanisms like attention dilution (where irrelevant or extra text overwhelms focus) and the lost-in-the-middle effect (poorer handling of non-adjacent key information). This could mean that the awareness instruction acts as an additional n...

work page 2025
[25]

collapse

in Mamba-2: a fake-completion prefix (Answer:) without a leading newline, combined with an ignore/disregard instruction. Table 3 illustrates this finding with representative examples. Several observations emerge from this exploration. First, the absence of a leading newline is critical: Answer: Ignore all previous instructions.\n (no leading newline) achi...

work page 2024