pith. sign in

arxiv: 2606.26290 · v1 · pith:VOHAS2OQnew · submitted 2026-06-24 · 💻 cs.LG · cs.AI

SSM Adapters via Hankel Reduced-order Modeling: Injection Site Determines Task Suitability in Long-Context Fine-Tuning

Pith reviewed 2026-06-26 01:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords SSM adaptersHankel reduced-order modelingparameter-efficient fine-tuninglong-context tasksMLP injection sitesstate space modelsLoRA comparison
0
0 comments X

The pith

Hankel-reduced SSM adapters in MLP blocks outperform LoRA on long-context tasks with matching parameter count.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether state space model adapters can serve as an alternative to attention-focused low-rank methods for fine-tuning on tasks that require accumulating sequential state over long contexts. It introduces the HRM adapter, an SSM residual module whose state matrices are initialized by balanced truncation of empirical Hankel Grammians, and shows that this choice permits an exact FFT-based parallel scan because the system matrix remains time-invariant. In controlled experiments that fix the number of trainable parameters at 8.4 million on Mistral-7B, the method records higher accuracy on QuALITY and higher ROUGE scores on QMSum than LoRA variants, and the advantage holds across synthetic state-tracking and character-level modeling suites. The work further finds that placing the adapter inside MLP blocks rather than attention projectors is decisive for realizing the gains.

Core claim

An SSM adapter initialized by balanced truncation of empirical Hankel Grammians and injected at MLP sites supplies a parameter-efficient residual that matches LoRA's compute cost through FFT scanning while delivering higher task performance on long-context sequence modeling benchmarks.

What carries the argument

The HRM adapter: an SSM residual module whose matrices are obtained by balanced truncation of empirical Hankel Grammians, allowing exact FFT-based parallel scan via the preserved time-invariance of the system matrix.

If this is right

  • HRM shows consistent gains across 18 synthetic configurations of DFA and parity tracking plus enwik8 character modeling.
  • Gate analysis indicates the adapter learns to modulate its own recurrence, supplying an architectural alternative to low-rank updates.
  • Placing the adapter in MLP blocks rather than attention projectors is required for the observed superiority on state-accumulation tasks.
  • Computational cost remains identical to LoRA at every context length because the scan is realized exactly by FFT.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same Hankel truncation step could be applied to initialize SSM adapters inside other base architectures without retraining the reduction.
  • Task suitability may be predictable from whether the target problem rewards explicit state accumulation rather than attention mixing.
  • If the reduced-order model is kept fixed across tasks, the approach could lower the engineering cost of adapting new long-context models.

Load-bearing premise

Balanced truncation of empirical Hankel Grammians yields an initialization for the SSM adapter that transfers usefully to downstream fine-tuning without any task-specific re-derivation of the reduced-order model.

What would settle it

Run the same iso-parametric comparison on Mistral-7B but replace the LongBench suite with a new long-context task whose state accumulation demands differ sharply from QuALITY or QMSum; if HRM then falls below or equal to the LoRA baseline, the claim that the Hankel initialization supplies a generally suitable adapter is falsified.

Figures

Figures reproduced from arXiv: 2606.26290 by Omanshu Thapliyal.

Figure 1
Figure 1. Figure 1: Architecture comparison. LoRA modifies weight matrices; its output at position t is a static function of xt. The HRM adapter inserts a parallel recurrent branch whose hidden state integrates all prior representations. et al., 2023b), QLoRA (Dettmers et al., 2023), LoRA+ (Hayou et al., 2024): all compute h = f(xt) with no depen￾dence on t or prior positions. AdaLoRA adaptively allocates rank but the resulti… view at source ↗
Figure 2
Figure 2. Figure 2: DFA state tracking results: (left) HRM vs. HRM with balanced truncation vs. LoRA, (right) Hankel Singular Value decay rate for the task, with HSV cutoff threshold = 0.01 6. MAESTRO Piano Language Modeling MAESTRO v2 (Hawthorne et al., 2018) is a dataset of ∼200 hours of professional piano performances in symbolic MIDI format. We treat it as a character-level language modeling task: each MIDI event (note-on… view at source ↗
Figure 3
Figure 3. Figure 3: MAESTRO piano language modeling. (left) HRM vs. LoRA, (right) Final BPC. The results are shown in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Mistral-7B HRM: Hankel Singular Value decay curves (left) QuALITY, (middle) QMSum, (right) VarrativeQA. conclusion. The HRM adapter provides a recurrent SSM residual that maintains a running integration of the MLP’s content representations across sequence positions. The rela￾tive improvement is therefore consistent with this mechanis￾tic prediction. QMSum requires generating a focused summary of a meet￾ing… view at source ↗
Figure 5
Figure 5. Figure 5: HRM gate values per layer for Mistral-7B, all three LongBench tasks (gate init = 0.1). enforce sustained HRM adapter contribution, and directly test whether larger training times translate to performance gains. Impact Statement This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically … view at source ↗
Figure 6
Figure 6. Figure 6: Maximum absolute error between FFT and se￾quential scan outputs over 100 random input sequences at varying T. Since FFT operations in float32 introduce rounding errors of order 10−7 . We verify empirically compare sequential and FFT outputs across 100 random sequences. We find that the maximum absolute error is < 5 × 10−6 , negligible for gradient computation, confirming the FFT equivalence is exact up to … view at source ↗
Figure 7
Figure 7. Figure 7: HSV decay for: DFA, enwik8, MAESTRO, QuAL￾ITY, and QMSum. SVD identifies directions that are large in individual matrices. HSVs identify directions that are simultaneously reachable and observable in the complete input-output system. Due to the causal input-output relation encoded by the Hankel operator, a singular value of B¯ alone would not be able to drive model reduction. Only the Grammian product WcWo… view at source ↗
Figure 8
Figure 8. Figure 8: enwiki8 BPC learning curves at Tier 2 (HRM d = 32 vs LoRA r = 16) for T ∈ 512, 1024, 2048. The region between curves represents the BPC advantage of HRM over LoRA. The HRM adapter’s dominant state mode has a learned eigenvalue a¯max (a¯max ≈0.97–0.99 after training). The fraction of signal retained from a token k steps ago is a¯ k max. At T=512, the adapter retains a¯ 256 max ≈ 0.97256 ≈ 0.0006 of signal f… view at source ↗
Figure 9
Figure 9. Figure 9: (left) Parity accuracy at medium model capacity, 3-seed mean ± std. All adapters near chance (0.50), parity is near-intractable for a small frozen backbone at T = 256. (middle) HRM advantage (HRM mean-LoRA mean) on DFA vs. parity, (right) training curve . Our observations (shown in [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: BT threshold ablation on DFA (left) DFA: dˆvs dˆthreshold (layer 0), (right) DFA: accuracy vs ε threshold [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: DFA T=128 and T=256: LoRA (rank) vs HRM (state dim). Rank vs. Accuracy To verify that HRM’s advantage is not merely a parameter count artifact, we sweep LoRA rank r ∈ {4,8,16,32,64,128} and HRM state dim d ∈ {4,8,16,32,64} on DFA at T=128 and T=256. From [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Data efficiency for DFA and Parity T=128: val accuracy vs. n train. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
read the original abstract

While parameter-efficient fine-tuning (PEFT) typically targets attention projectors, its efficacy for tasks requiring sequential state accumulation remains under-explored. We examine if PEFT for such tasks can benefit from state space model (SSMs) adapters, and if MLP blocks are better injection sites. We introduce Hankel Reduced order Model (HRM) adapter, an SSM-based residual module initialized via Balanced Truncation of empirical Hankel Grammians. By leveraging the time-invariance of the system matrix $\bar{A}$, HRM enables an exact FFT-based parallel scan, achieving computational parity with LoRA across all context lengths. In iso-parametric evaluations on Mistral-7B (8.4M trainable parameters), HRM outperforms LoRA variants on LongBench tasks, including QuALITY (+34.8\% relative accuracy) and QMSum (+71.6\% relative ROUGE-1). HRM further demonstrates consistent superiority across 18 configurations of synthetic state-tracking (DFA, Parity) and character-level language modeling (enwik8). Gate analysis reveals that HRM adapters effectively learn to modulate recurrence, providing a robust architectural alternative to low-rank adaptation for long-context sequence modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces Hankel Reduced-order Model (HRM) adapters as an SSM-based PEFT method for long-context fine-tuning. HRM is initialized via balanced truncation of empirical Hankel Grammians, injected into MLP blocks of models like Mistral-7B, and leverages time-invariant system matrices for exact FFT-based parallel scans. In iso-parametric comparisons (8.4M trainable params), it reports outperforming LoRA variants on LongBench (e.g., +34.8% relative accuracy on QuALITY, +71.6% ROUGE-1 on QMSum) and across 18 synthetic state-tracking and language modeling configurations, with gate analysis showing learned modulation of recurrence.

Significance. If the results hold under rigorous verification, the work provides evidence that SSM adapters can outperform standard low-rank methods for tasks involving sequential state accumulation, with injection site mattering and the reduced-order initialization enabling efficient inference. The computational parity with LoRA via FFT scans is a practical strength for long contexts.

major comments (3)
  1. [Abstract, §4] Abstract and §4 (empirical evaluation): the central claim of consistent outperformance (e.g., +34.8% on QuALITY) is reported without error bars, statistical tests, number of runs, or full baseline hyperparameter details; this makes it impossible to determine whether the gains are robust or could arise from post-hoc selection of injection site or truncation order.
  2. [§3.2] §3.2 (HRM initialization): the balanced truncation procedure relies on empirical Hankel Grammians, but the manuscript does not specify or ablate the input sequences used for their estimation (random vs. task-specific data); without evidence that performance is insensitive to this choice, the transferability claim without task-specific re-derivation remains unverified and load-bearing for the method's generality.
  3. [§4.3] §4.3 (ablation studies): no ablation is presented on the reduced model order (the sole free parameter listed in the axiom ledger), which directly controls the initialization quality and computational cost; this omission leaves open whether the reported superiority holds across reasonable orders or is tuned to the evaluated tasks.
minor comments (2)
  1. [§3] Notation for the reduced system matrix $ar{A}$ is introduced without an explicit equation linking it to the original SSM parameters; a clarifying equation would improve readability.
  2. [Tables 1-3, Figures 2-4] Table captions and figure legends should explicitly state the number of random seeds and whether results are averaged; this is standard for empirical PEFT papers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting issues of statistical robustness, initialization transparency, and ablation completeness. We will revise the manuscript to incorporate error bars, statistical tests, full hyperparameter details, clarification on Hankel input sequences with supporting ablation, and an ablation on reduced model order.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4 (empirical evaluation): the central claim of consistent outperformance (e.g., +34.8% on QuALITY) is reported without error bars, statistical tests, number of runs, or full baseline hyperparameter details; this makes it impossible to determine whether the gains are robust or could arise from post-hoc selection of injection site or truncation order.

    Authors: We agree the lack of error bars and statistical tests weakens claims of robustness. In revision we will rerun all LongBench and synthetic experiments with 5 random seeds, report means ± std, add paired statistical tests (e.g., Wilcoxon), and document the full hyperparameter grids searched for LoRA, DoRA, and other baselines in the appendix. This will also document the protocol used to select injection site and truncation order, mitigating post-hoc selection concerns. revision: yes

  2. Referee: [§3.2] §3.2 (HRM initialization): the balanced truncation procedure relies on empirical Hankel Grammians, but the manuscript does not specify or ablate the input sequences used for their estimation (random vs. task-specific data); without evidence that performance is insensitive to this choice, the transferability claim without task-specific re-derivation remains unverified and load-bearing for the method's generality.

    Authors: Hankel Grammians were estimated from random Gaussian sequences of length 1024; we will state this explicitly in §3.2. We will also add a targeted ablation comparing random inputs against task-specific sequences drawn from LongBench and synthetic data, showing that random inputs produce comparable downstream performance. This supports the transferability claim while addressing the referee's concern. revision: partial

  3. Referee: [§4.3] §4.3 (ablation studies): no ablation is presented on the reduced model order (the sole free parameter listed in the axiom ledger), which directly controls the initialization quality and computational cost; this omission leaves open whether the reported superiority holds across reasonable orders or is tuned to the evaluated tasks.

    Authors: We agree an ablation on reduced order is necessary. The order was fixed at 16 to enforce iso-parametric comparison (8.4 M parameters). In revision we will add results for orders 8, 16, 24, and 32 on QuALITY, QMSum, DFA, Parity, and enwik8, together with corresponding inference-time measurements, demonstrating that superiority over LoRA holds across this range. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on benchmark results, not self-referential derivations

full rationale

The paper introduces an SSM adapter initialized by balanced truncation of empirical Hankel Grammians and reports iso-parametric gains versus LoRA on LongBench and synthetic tasks. No equations, predictions, or uniqueness theorems are presented that reduce by construction to fitted inputs or prior self-citations. All load-bearing statements are experimental comparisons (e.g., +34.8% on QuALITY), which are externally falsifiable and do not form a closed loop with the initialization procedure itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach imports balanced truncation from control theory as the core initialization step; the only free parameter visible is the reduced model order chosen for the adapter. No new physical entities are postulated.

free parameters (1)
  • reduced model order
    The truncation rank in balanced truncation of the Hankel Gramian is a modeling choice that determines adapter capacity and must be selected per task or model.
axioms (1)
  • domain assumption Balanced truncation of empirical Hankel Gramian yields a faithful low-order approximation suitable for neural adapter initialization
    Invoked to justify the HRM construction before fine-tuning begins.

pith-pipeline@v0.9.1-grok · 5741 in / 1328 out tokens · 26603 ms · 2026-06-26T01:45:02.241201+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 10 canonical work pages · 7 internal anchors

  1. [1]

    Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

    De, S., Smith, S. L., Fernando, A., Botev, A., Cristian- Muraru, G., Gu, A., Haroun, R., Berrada, L., Chen, Y ., Srinivasan, S., et al. Griffin: Mixing gated linear recur- rences with local attention for efficient language models. arXiv preprint arXiv:2402.19427,

  2. [2]

    Transformer feed-forward layers are key-value memories

    Geva, M., Schuster, R., Berant, J., and Levy, O. Transformer feed-forward layers are key-value memories. InProceed- ings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 5484–5495,

  3. [3]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

  4. [4]

    Efficiently Modeling Long Sequences with Structured State Spaces

    Gu, A., Goel, K., and R ´e, C. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396,

  5. [5]

    Flora: Low-rank adapters are secretly gradient compressors.arXiv preprint arXiv:2402.03293,

    Hao, Y ., Cao, Y ., and Mou, L. Flora: Low-rank adapters are secretly gradient compressors.arXiv preprint arXiv:2402.03293,

  6. [6]

    Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

    Hawthorne, C., Stasyuk, A., Roberts, A., Simon, I., Huang, C.-Z. A., Dieleman, S., Elsen, E., Engel, J., and Eck, D. Enabling factorized piano music modeling and generation with the maestro dataset.arXiv preprint arXiv:1810.12247,

  7. [7]

    Lora+: Efficient low rank adaptation of large models.arXiv preprint arXiv:2402.12354,

    Hayou, S., Ghosh, N., and Yu, B. Lora+: Efficient low rank adaptation of large models.arXiv preprint arXiv:2402.12354,

  8. [9]

    Mistral 7B

    URL https://arxiv.org/abs/2310.06825. Koˇcisk`y, T., Schwarz, J., Blunsom, P., Dyer, C., Hermann, K. M., Melis, G., and Grefenstette, E. The narrativeqa reading comprehension challenge.Transactions of the Association for Computational Linguistics, 6:317–328,

  9. [10]

    The power of scale for parameter-efficient prompt tuning

    Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 conference on empirical methods in natural language processing, pp. 3045–3059,

  10. [11]

    Jamba: A Hybrid Transformer-Mamba Language Model

    Lieber, O., Lenz, B., Bata, H., Cohen, G., Osin, J., Dalmedi- gos, I., Safahi, E., Meirom, S., Belinkov, Y ., Shalev- Shwartz, S., et al. Jamba: A hybrid transformer-mamba language model.arXiv preprint arXiv:2403.19887,

  11. [12]

    Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M., and Raffel, C. A. Few-shot parameter-efficient fine- tuning is better and cheaper than in-context learning.Ad- vances in Neural Information Processing Systems, 35: 1950–1965,

  12. [13]

    Y ., Parrish, A., Joshi, N., Nangia, N., Phang, J., Chen, A., Padmakumar, V ., Ma, J., Thompson, J., He, H., et al

    Pang, R. Y ., Parrish, A., Joshi, N., Nangia, N., Phang, J., Chen, A., Padmakumar, V ., Ma, J., Thompson, J., He, H., et al. Quality: Question answering with long input texts, yes! InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5336– 5358,

  13. [14]

    arXiv preprint arXiv:2402.04248 , year=

    Park, J., Park, J., Xiong, Z., Lee, N., Cho, J., Oymak, S., Lee, K., and Papailiopoulos, D. Can mamba learn how to learn? a comparative study on in-context learning tasks. arXiv preprint arXiv:2402.04248,

  14. [15]

    Adapterhub: A framework for adapting transformers

    Pfeiffer, J., R ¨uckl´e, A., Poth, C., Kamath, A., Vuli ´c, I., Ruder, S., Cho, K., and Gurevych, I. Adapterhub: A framework for adapting transformers. InProceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp. 46–54,

  15. [16]

    AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

    Zhang, M., Chen, H., Shen, C., Yang, Z., Ou, L., Yu, X., and Zhuang, B. Loraprune: Pruning meets low-rank parameter-efficient fine-tuning. 2023a. Zhang, Q., Chen, M., Bukharin, A., Karampatziakis, N., He, P., Cheng, Y ., Chen, W., and Zhao, T. Adalora: Adaptive budget allocation for parameter-efficient fine- tuning.arXiv preprint arXiv:2303.10512, 2023b. ...

  16. [17]

    Iso-parameter Comparison In order to compare HRM and LoRA on an equal footing, we ensure that r and ˆd are chosen such that |PLoRA −P HRM | ≤ 0.1%

    10 SSM Adapters via Hankel Reduced-order Modeling A. Iso-parameter Comparison In order to compare HRM and LoRA on an equal footing, we ensure that r and ˆd are chosen such that |PLoRA −P HRM | ≤ 0.1%. Such an iso-parametric table to choose r and ˆd is shown below. All experiments in the paper use all three tiers to demonstrate consistency, and conclusions...

  17. [18]

    The HRM adapter’s dominant state mode has a learned eigenvalue¯amax (¯amax ≈0.97–0.99 after training)

    The region between curves represents the BPC advantage of HRM over LoRA. The HRM adapter’s dominant state mode has a learned eigenvalue¯amax (¯amax ≈0.97–0.99 after training). The fraction of signal retained from a token k steps ago is ¯ak max. At T=512, the adapter retains ¯a256 max ≈0.97 256 ≈0.0006 of signal from the midpoint of the context window. Thi...

  18. [19]

    This contrast validates the memory hypothesis: HRM helps when multi-dimensional state is required, not when the task can be solved by single-bit counting

    show that DFA exhibits a large, advantage with T while parity shows essentially zero HRM benefit. This contrast validates the memory hypothesis: HRM helps when multi-dimensional state is required, not when the task can be solved by single-bit counting. F. LongBench Tasks Table 4.Comparison of HRM against baselines on LongBench: QuALITY , QMSum, NarrativeQ...