pith. sign in

arxiv: 2606.18114 · v1 · pith:4CD6ILV7new · submitted 2026-06-16 · 💻 cs.LG · cs.AI

Ternary Mamba: Grouped Quantization-Aware Training of W1.58A16 State Space Models

Pith reviewed 2026-06-27 00:52 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords state space modelsmambaquantization-aware trainingternary weightsknowledge distillationmodel compressionrecurrent modelsedge deployment
0
0 comments X

The pith

A pretrained Mamba-2 checkpoint can be turned into a ternary model with grouped QAT and distillation using only 102 million tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that state space models do not require training from scratch to reach effective ternary quantization. Starting from an existing full-precision checkpoint and applying grouped quantization-aware training plus knowledge distillation from a frozen FP16 teacher produces a 3.61 times smaller model that retains nearly the same zero-shot performance. This matters for practical deployment because it slashes the data and compute needed by three orders of magnitude compared with prior from-scratch ternary SSM work. The same experiments uncover a training instability called zero-ratio collapse that is unique to the quantized SSM setting and show that correction methods borrowed from transformers do not transfer because of error buildup across the recurrence.

Core claim

Grouped quantization-aware training with knowledge distillation from a frozen FP16 teacher applied to a pretrained Mamba-2 1.3B checkpoint yields a W1.58A16 model that occupies 744 MB instead of 2,687 MB and reaches 48.1 percent average zero-shot accuracy on seven tasks after 102 million tokens, coming within 0.9 percentage points of Bi-Mamba while using roughly one-thousandth the marginal token budget of from-scratch ternary training.

What carries the argument

Grouped quantization-aware training that jointly optimizes quantization scales and model weights under a distillation loss from the frozen FP16 teacher, which counters zero-ratio collapse during the SSM recurrence.

If this is right

  • Model size drops from 2,687 MB to 744 MB while zero-shot accuracy on the seven-task average reaches 48.1 percent.
  • Total training cost is limited to 4 GPU-hours on one H100.
  • Post-training correction methods that work for transformers produce large accuracy drops because errors accumulate through the SSM recurrence.
  • Zero-ratio collapse appears only in the QAT-from-pretrained regime and must be managed by the grouped scale updates.
  • The marginal token requirement falls by a factor of roughly 1,000 relative to the 150 billion tokens used in earlier from-scratch ternary SSM training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same grouped QAT recipe may allow rapid adaptation of other pretrained SSM variants to ternary weights without repeating full pretraining.
  • Because the method works from an existing checkpoint, it opens the possibility of periodic on-device fine-tuning of compressed models when new data arrives.
  • The observed failure of transformer post-hoc fixes points to the need for recurrence-aware quantization analysis tools that are not yet standard.
  • If the 0.3 percentage point gap to Bi-Mamba can be closed with modest extra tokens, the approach would make ternary SSMs competitive for latency-critical applications.

Load-bearing premise

That the recurrent dynamics of the SSM allow the teacher signal to recover performance even after the quantization scales have been learned and the zero ratio has stabilized.

What would settle it

Running the identical 102 million token budget from a random initialization instead of the pretrained checkpoint and finding that accuracy stays well below 48 percent.

Figures

Figures reproduced from arXiv: 2606.18114 by Ramprasath Ganesaraja, Sahil Dilip Panse, Swathika N.

Figure 1
Figure 1. Figure 1: Learnable scale (exp002) collapses to 90% zeros with a loss spike at 10k steps, while fixed [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: C4 PPL vs. training tokens (log scale). Monotonic decrease with no plateau observed; trajec [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Group-size ablation. PPL is flat across g ∈ {64, 128, 256}; top-1 agreement peaks at g = 128 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Both Kalman amplification and James-Stein shrinkage increase PPL monotonically from base [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: TV/L1 ratio exceeds the Gaussian baseline ( [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Layer-selective quantization Pareto frontier. PPL recovery is approximately linear in the frac [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

State Space Models (SSMs) such as Mamba-2 offer linear-time inference but their memory footprint limits edge deployment. Prior ternary SSM work (Slender-Mamba) trains from scratch on 150B tokens; we show a pretrained checkpoint suffices, reducing the marginal token budget by 1,000x. Using grouped quantization-aware training (QAT) with knowledge distillation from a frozen FP16 teacher, we compress Mamba-2 1.3B to 3.61x (2,687 to 744 MB) and achieve 48.1% zero-shot accuracy (7-task average) in just 102M tokens (4 GPU-hours, single H100) -- approaching Bi-Mamba's 48.4% (within +/-0.9pp CI). This QAT-from-pretrained setting reveals zero-ratio collapse, a novel instability caused by learnable quantization scales that does not arise in from-scratch training. We further show that post-hoc correction strategies effective for Transformers fail for SSMs due to error accumulation through the recurrence. These results demonstrate that ternary SSMs do not require expensive from-scratch training: QAT from pretrained checkpoints with KD is a data-efficient alternative.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript claims that grouped quantization-aware training (QAT) with knowledge distillation from a frozen FP16 teacher, starting from a pretrained Mamba-2 1.3B checkpoint, enables effective W1.58A16 compression of state space models. This yields 3.61x size reduction (2687 MB to 744 MB) and 48.1% zero-shot accuracy on a 7-task average using only 102M tokens (4 GPU-hours on one H100), approaching Bi-Mamba's 48.4% within +/-0.9pp CI. The work identifies zero-ratio collapse as a novel instability arising from learnable quantization scales in the QAT-from-pretrained regime (absent in from-scratch training) and shows that post-hoc correction methods effective for Transformers fail for SSMs due to error accumulation through the recurrence. This reduces the marginal token budget by 1000x relative to prior from-scratch ternary SSM training on 150B tokens.

Significance. If the empirical results and ablations hold, the work provides a practical, data-efficient path to ternary SSM deployment that avoids the prohibitive cost of from-scratch training. The 1000x token reduction, explicit comparison to Bi-Mamba with CI, and the identification of recurrence-specific quantization instabilities (zero-ratio collapse and failure of post-hoc fixes) are load-bearing contributions that could guide quantization research on recurrent architectures. The use of grouped QAT plus KD from pretrained checkpoints is a reproducible empirical finding with clear baselines.

major comments (3)
  1. [§4.3] §4.3 (Post-hoc correction experiments): the central claim that post-hoc strategies fail for SSMs due to recurrence error accumulation is load-bearing for arguing QAT-from-pretrained is necessary, yet the manuscript provides no quantitative measurement of state error propagation (e.g., via controlled injection of quantization noise into the recurrence and tracking of hidden-state drift over sequence length).
  2. [Table 2] Table 2 and §5.1 (zero-shot results): the 48.1% vs 48.4% comparison is reported with a +/-0.9pp CI, but the text does not specify the number of evaluation runs, task-level variance, or whether the CI accounts for multiple-testing across the 7 tasks; this weakens the claim of statistical parity.
  3. [§3.2] §3.2 (zero-ratio collapse definition): the novel instability is attributed to learnable quantization scales, but the manuscript does not include an ablation isolating the effect of learnable vs. fixed scales on the observed collapse, leaving open whether the phenomenon is specific to the grouped QAT formulation or more general.
minor comments (3)
  1. The abstract and §2 reference 'Bi-Mamba' as a baseline without an explicit citation or description of its architecture/training details in the main text; this should be clarified for reproducibility.
  2. [Figure 3] Figure 3 (zero-ratio collapse visualization): the y-axis scaling and legend placement make it difficult to read the exact zero-ratio values at convergence; consider adding a table of final ratios.
  3. The token budget comparison (150B vs 102M) assumes identical model size and task distribution between Slender-Mamba and the current experiments; a short note confirming this would strengthen the 1000x claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the positive recommendation of minor revision and the constructive comments. We address each major point below and commit to revisions that strengthen the manuscript without misrepresenting the existing results.

read point-by-point responses
  1. Referee: [§4.3] §4.3 (Post-hoc correction experiments): the central claim that post-hoc strategies fail for SSMs due to recurrence error accumulation is load-bearing for arguing QAT-from-pretrained is necessary, yet the manuscript provides no quantitative measurement of state error propagation (e.g., via controlled injection of quantization noise into the recurrence and tracking of hidden-state drift over sequence length).

    Authors: We agree that a direct quantitative measurement of state error propagation would provide stronger support for the recurrence-specific claim. In the revised manuscript we will add a controlled experiment that injects synthetic quantization noise into the hidden states and tracks drift in hidden-state norms over increasing sequence lengths, with direct comparison to Transformer baselines. revision: yes

  2. Referee: [Table 2] Table 2 and §5.1 (zero-shot results): the 48.1% vs 48.4% comparison is reported with a +/-0.9pp CI, but the text does not specify the number of evaluation runs, task-level variance, or whether the CI accounts for multiple-testing across the 7 tasks; this weakens the claim of statistical parity.

    Authors: We will revise §5.1 and the caption of Table 2 to explicitly state the number of evaluation runs used to compute the CI, report task-level standard deviations, and confirm that the interval incorporates a correction for multiple testing across the seven tasks. revision: yes

  3. Referee: [§3.2] §3.2 (zero-ratio collapse definition): the novel instability is attributed to learnable quantization scales, but the manuscript does not include an ablation isolating the effect of learnable vs. fixed scales on the observed collapse, leaving open whether the phenomenon is specific to the grouped QAT formulation or more general.

    Authors: We will add a targeted ablation in §3.2 that compares zero-ratio collapse when quantization scales are learnable versus held fixed (initialized from the pretrained checkpoint) under otherwise identical QAT-from-pretrained conditions. This will isolate the contribution of learnable scales. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports empirical outcomes of grouped QAT with KD on a pretrained Mamba-2 checkpoint, including measured compression (3.61x), token count (102M), accuracy (48.1%), and comparisons to baselines such as Bi-Mamba. No equations, derivations, or parameter-fitting steps are described that reduce by construction to the inputs; the zero-ratio collapse and recurrence-error observations are presented as experimental findings rather than self-defined predictions. No load-bearing self-citations or uniqueness theorems appear in the abstract or described claims. The work is therefore self-contained as an empirical demonstration.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Abstract-only; full details unavailable. Free parameters and axioms inferred from high-level description only.

free parameters (1)
  • learnable quantization scales
    Mentioned as causing zero-ratio collapse during grouped QAT.
axioms (1)
  • domain assumption Knowledge distillation from a frozen FP16 teacher improves ternary quantization outcomes for SSMs.
    Central to the proposed training procedure.
invented entities (1)
  • zero-ratio collapse no independent evidence
    purpose: Describes a novel training instability observed in QAT for SSMs but not Transformers.
    Presented as a new phenomenon revealed by the QAT-from-pretrained setting.

pith-pipeline@v0.9.1-grok · 5762 in / 1386 out tokens · 37736 ms · 2026-06-27T00:52:18.843017+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 11 canonical work pages · 2 internal anchors

  1. [1]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013

  2. [2]

    Quamba2: Robust and efficient post-training quantization for selective state space models.arXiv preprint arXiv:2503.22879, 2025

    Hung-Yi Chiang, Hung-Yueh Guo, Zhewei Chang, Andreas Gerstlauer, and Diana Ding. Quamba2: Robust and efficient post-training quantization for selective state space models.arXiv preprint arXiv:2503.22879, 2025. 9

  3. [3]

    Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality

    Tri Dao and Albert Gu. Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024

  4. [4]

    Quamba: A post-training quantization recipe for selective state space models.arXiv preprint arXiv:2410.13229, 2024

    Hung-Yueh Guo, Zhewei Chang, Anthony Todd, Yushu Lu, Han Cheng, Andreas Gerstlauer, and Diana Ding. Quamba: A post-training quantization recipe for selective state space models.arXiv preprint arXiv:2410.13229, 2024

  5. [5]

    ParetoQ: Scaling laws in extremely low-bit LLM quantization.arXiv preprint arXiv:2502.02631, 2025

    Byeongwook Heo, Yeonwoo Oh, Dongsoo Park, and Jungwook Yoo. ParetoQ: Scaling laws in extremely low-bit LLM quantization.arXiv preprint arXiv:2502.02631, 2025

  6. [6]

    Estimation with quadratic loss

    William James and Charles Stein. Estimation with quadratic loss. InProceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 361–379, 1961

  7. [7]

    QEP: Quantization error projection for post-training cor- rection.arXiv preprint arXiv:2501.08789, 2025

    Zhongnan Li, Long Qian, and Ye Yuan. QEP: Quantization error projection for post-training cor- rection.arXiv preprint arXiv:2501.08789, 2025

  8. [8]

    MambaQuant: Quantizing the mamba family with variance alignment rotations.arXiv preprint arXiv:2504.16385, 2025

    Zukang Lin, Lirui Chen, Zheyu Yao, Qiang Du, and Song Han. MambaQuant: Quantizing the mamba family with variance alignment rotations.arXiv preprint arXiv:2504.16385, 2025

  9. [9]

    The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

    Shuming Ma, Hongyu Wang, Lingxiao Ma, Lei Wang, Wenhui Wang, Shaohan Huang, Li Dong, Ruiping Wang, Jilong Xue, and Furu Wei. The era of 1-bit LLMs: All large language models are in 1.58 bits.arXiv preprint arXiv:2402.17764, 2024

  10. [10]

    Mamba- PTQ: Outlier channels in recurrent large language models.arXiv preprint arXiv:2407.12397, 2024

    Alessandro Pierro, Luca Rosenbauer, Annika Kuhn, Bernt Schiele, and Horst Possegger. Mamba- PTQ: Outlier channels in recurrent large language models.arXiv preprint arXiv:2407.12397, 2024

  11. [11]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020

  12. [12]

    Bi-mamba: Towards accurate 1-bit state space models.Transac- tions on Machine Learning Research (TMLR), 2025

    Shengkun Tang, Yaqing Li, Caiying Leng, Xianjing Zhang, Yao Zhu, Ting Zhao, Di Niu, Mengdi Liu, Shiwei Tang, and Yubo Tian. Bi-mamba: Towards accurate 1-bit state space models.Transac- tions on Machine Learning Research (TMLR), 2025. arXiv:2411.11843

  13. [13]

    TernaryLLM: A lightweight ternary LLM with asymmetric dual learnable ternarization.arXiv preprint arXiv:2406.07177, 2025

    Shijie Yang, Zitao Guo, Zhuocheng Liu, and Meng Yang. TernaryLLM: A lightweight ternary LLM with asymmetric dual learnable ternarization.arXiv preprint arXiv:2406.07177, 2025

  14. [14]

    Slender-mamba: Fully quantized mamba in 1.58 bits from head to toe

    Zekun Yu, Takeshi Kojima, Yutaka Matsuo, and Yusuke Iwasawa. Slender-mamba: Fully quantized mamba in 1.58 bits from head to toe. InProceedings of the 31st International Conference on Computational Linguistics (COLING), pages 4715–4724, 2025

  15. [15]

    LREC: Low-rank error correction for quantized LLMs.arXiv preprint arXiv:2405.14673, 2024

    Yifei Zhang, Yifan Li, Qiang Li, and Wei Gao. LREC: Low-rank error correction for quantized LLMs.arXiv preprint arXiv:2405.14673, 2024. A Extended Results A.1 Training Hyperparameters A.2 Parameter Budget A.3 Negative Result: HG-GSQ Hessian-guided Gumbel-Softmax quantization (targeting top-30% importance weights with differen- tiable discrete optimization...