Bilinear Input Modulation for Mamba: Koopman Bilinear Forms for Memory Retention and Multiplicative Computation

Hiroki Fujii; Masaki Yamakita

arxiv: 2604.17221 · v2 · submitted 2026-04-19 · 📡 eess.SY · cs.LG· cs.SY· math.DS

Bilinear Input Modulation for Mamba: Koopman Bilinear Forms for Memory Retention and Multiplicative Computation

Hiroki Fujii , Masaki Yamakita This is my paper

Pith reviewed 2026-05-10 06:21 UTC · model grok-4.3

classification 📡 eess.SY cs.LGcs.SYmath.DS

keywords bilinear input modulationstate space modelsMambaKoopman bilinear formsmemory retentionmultiplicative computationNARMA-10selective SSM

0 comments

The pith

Factorized bilinear input modulation augments selective SSMs with state-input products to improve both memory retention and multiplicative computation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes adding a bilinear state-input product term to Mamba-style state space models, framed as a finite-dimensional Koopman bilinear form. After sharing states across channels, the modulation is realized in three ways: a full sequential bilinear version, a linearized gated version for parallel scanning, and a parallel bilinear version on the state transition. On a memory-heavy pendulum task and a bilinear NARMA-10 task, the gated version boosts memory but not computation, while both bilinear versions improve both capabilities. Only the bilinear versions gain from larger state dimensions, showing that the bilinear mechanism uniquely exploits expanded state spaces.

Core claim

Selective SSMs with diagonal transitions are limited in memory and bilinear capacity. Introducing a factorized bilinear input modulation that computes a state-input product and routes it through complementary paths yields SSMs that retain longer histories and perform multiplicative operations more effectively. The sequential and parallel bilinear variants both succeed on bilinear computation where gating alone fails, and they alone improve when state dimension grows, while pathway ablation shows the two routes of the bilinear signal play distinct supporting roles.

What carries the argument

The factorized bilinear input modulation, which augments the SSM recurrence with an explicit state-input product term interpretable as a Koopman bilinear form and admits parallel or sequential realizations while preserving scan efficiency where possible.

If this is right

The bilinear mechanism is uniquely able to exploit increases in SSM state dimension for performance gains.
The two downstream routes of the bilinear signal play complementary roles, as confirmed by pathway ablation.
Gated modulation improves memory retention but leaves bilinear computation largely unchanged.
Parallel bilinear modulation achieves the same performance gains as the sequential version while remaining compatible with efficient parallel scan.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dissociation between memory and computation benefits could guide construction of hybrid SSM architectures that selectively apply bilinear terms only where needed.
Testing the same modulation on longer sequences or real-world time-series data would check whether the reported scaling with state size continues to hold.
The Koopman bilinear framing suggests the approach might transfer to other recurrent or attention-based models that currently rely on linear transitions.
If the efficiency advantage is preserved at scale, bilinear SSMs could become a drop-in replacement for gated variants in applications requiring strong multiplicative interactions.

Load-bearing premise

The chosen pendulum and NARMA-10 tasks plus the observed dissociation between memory and bilinear benefits are enough to establish general gains for arbitrary sequence modeling without hidden efficiency penalties.

What would settle it

Running the bilinear variants on a standard language-modeling benchmark and finding no consistent improvement over gated baselines or no additional benefit from larger state dimensions.

Figures

Figures reproduced from arXiv: 2604.17221 by Hiroki Fujii, Masaki Yamakita.

**Figure 1.** Figure 1: AR MSE distribution over 11 seeds. (a) Input [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 3.** Figure 3: NARMA-10: Training loss curves (11 seeds, mean [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: NARMA-10: Sequence length L vs AR MSE across architectures at ds ∈ {8, 16}. Dashed: ds=8, solid: ds=16. Both bilinear variants (seq-BIM and p-BIM) separate clearly from the non-bilinear baselines. terpart N(xt) (p-BIM) both exploit the richer h to compute more complex input-state products. To test whether the improvement is merely a consequence of having more parameters, we match the parameter count of the… view at source ↗

**Figure 5.** Figure 5: Parameter count vs AR MSE (median) on NARMA [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Selective State Space Models (SSMs), notably Mamba, employ diagonal state transitions that limit both memory retention and bilinear computational capacity. We propose a factorized bilinear input modulation that augments the SSM with a state-input product, interpretable as a finite-dimensional Koopman bilinear form. After introducing a shared state across channels (Coupled SSM), the modulation admits three implementations. Coupled Bilinear Input Modulation (seq-BIM) retains the full bilinear product on the input side at the cost of sequential computation, Coupled Gated Modulation (GM) linearizes it into a gate modulation that is compatible with the parallel scan, and Parallel Bilinear Input Modulation (p-BIM) places the same bilinear product on the state transition while remaining parallel-scannable. Experiments on a multiple input-delay pendulum (memory retention) and NARMA-10 (bilinear computation) reveal a clear dissociation. GM substantially improves memory retention but not bilinear computation, while both seq-BIM and p-BIM improve both. A pathway ablation confirms that the two downstream routes of the bilinear signal serve complementary roles. The improvement is statistically robust, with the bilinear variants consistently outperforming the other variants on bilinear computation. Furthermore, only the bilinear variants benefit from increasing the SSM state dimension, while coupling or gate modulation alone show no improvement, establishing the bilinear mechanism as uniquely capable of exploiting larger state spaces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The bilinear modulations create a useful dissociation on the two tested tasks and show state-dimension scaling only for the bilinear variants, but the evidence stays narrow.

read the letter

The paper's core move is to factor a bilinear state-input product into Mamba-style SSMs after first coupling channels into a shared state. They give three concrete versions: seq-BIM keeps the full product on the input side (sequential), GM turns it into a gate that stays parallel-scannable, and p-BIM moves the product onto the state transition while keeping the scan. The experiments on the input-delay pendulum and NARMA-10 then separate the effects cleanly—GM lifts memory retention without helping the bilinear task, while both seq-BIM and p-BIM lift both, and only the bilinear routes gain from larger state dimension. The pathway ablation also shows the two signal routes are complementary rather than redundant. That dissociation and the scaling observation are the clearest new pieces; they are not just claimed but shown against the non-bilinear controls on these tasks. The abstract notes consistent outperformance and statistical robustness, which is better than many SSM variants that stop at single-task wins. The soft spots are straightforward. Everything rests on two specific dynamical tasks; there are no language-modeling runs, no standard time-series suites, and no wall-clock or FLOPs numbers to check whether the parallel versions actually keep the original efficiency edge. Without those, it is still possible the observed advantages are tied to the input statistics or dynamics of the pendulum and NARMA-10 rather than being general properties of the bilinear mechanism. The framing as Koopman bilinear forms is reasonable but does not add new theory beyond the implementation. This is worth a serious referee for groups working on SSM extensions or Koopman-style models in control and sequence tasks. The ablations are honest and the dissociation is informative even if the scope is limited, so it should go to review rather than desk reject.

Referee Report

3 major / 1 minor

Summary. The paper proposes a factorized bilinear input modulation for Selective State Space Models (SSMs) such as Mamba, framed as a finite-dimensional Koopman bilinear form. After introducing a Coupled SSM with shared state across channels, it defines three variants: sequential Coupled Bilinear Input Modulation (seq-BIM), Coupled Gated Modulation (GM), and Parallel Bilinear Input Modulation (p-BIM). Experiments on a multiple input-delay pendulum task (memory retention) and NARMA-10 (bilinear computation) show a dissociation where GM improves memory but not bilinear computation, seq-BIM and p-BIM improve both, a pathway ablation confirms complementary roles, and only bilinear variants benefit from increasing SSM state dimension.

Significance. If the results hold, the work provides a mechanism to augment SSMs with improved memory retention and multiplicative capacity while retaining parallel-scan compatibility in the p-BIM and GM variants. The pathway ablations and state-dimension scaling experiments that isolate the bilinear mechanism's benefits represent a strength, offering concrete evidence for how bilinear forms exploit larger state spaces in control-oriented sequence tasks.

major comments (3)

[Experimental Evaluation] Experimental Evaluation: The claims of statistically robust outperformance, clear dissociation between variants, and unique scaling benefits for bilinear forms lack reported details on exact baselines, error bars, statistical tests, and implementation hyperparameters, which directly weakens the evidential support for the central dissociation and scaling claims.
[Results] Results on pendulum and NARMA-10 tasks: The dissociation (GM improves memory retention but not bilinear computation; seq-BIM/p-BIM improve both) and the observation that only bilinear variants benefit from larger state dimensions are shown exclusively on these two tasks. This leaves open whether the effects are intrinsic to the bilinear mechanism or tied to the specific input statistics and dynamics of the chosen benchmarks, limiting the generality of the conclusion that the bilinear mechanism is 'uniquely capable of exploiting larger state spaces.'
[Methods] p-BIM implementation: The manuscript asserts that p-BIM preserves the parallel-scan efficiency of the baseline Mamba, yet no wall-clock timings, FLOPs counts, or runtime comparisons versus the unmodified SSM are provided, leaving the practical efficiency claim unverified despite its centrality to the proposed factorized approach.

minor comments (1)

[Methods] The notation distinguishing seq-BIM, GM, and p-BIM could be introduced with a single summary table or diagram early in the methods to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications grounded in the manuscript and outlining specific revisions to strengthen the evidential basis and generality of the claims.

read point-by-point responses

Referee: [Experimental Evaluation] Experimental Evaluation: The claims of statistically robust outperformance, clear dissociation between variants, and unique scaling benefits for bilinear forms lack reported details on exact baselines, error bars, statistical tests, and implementation hyperparameters, which directly weakens the evidential support for the central dissociation and scaling claims.

Authors: We agree that the current presentation would be strengthened by explicit reporting of these details. The manuscript already performed multiple independent runs to support the 'statistically robust' phrasing, but we did not include the supporting statistics or full hyperparameter tables. In the revised version, we will add: (i) a complete table of baselines with their exact configurations, (ii) error bars showing mean and standard deviation across runs, (iii) results of paired statistical tests (e.g., t-tests) for the key comparisons, and (iv) an expanded hyperparameter appendix. These additions will directly bolster the dissociation and scaling claims without changing the experimental outcomes. revision: yes
Referee: [Results] Results on pendulum and NARMA-10 tasks: The dissociation (GM improves memory retention but not bilinear computation; seq-BIM/p-BIM improve both) and the observation that only bilinear variants benefit from larger state dimensions are shown exclusively on these two tasks. This leaves open whether the effects are intrinsic to the bilinear mechanism or tied to the specific input statistics and dynamics of the chosen benchmarks, limiting the generality of the conclusion that the bilinear mechanism is 'uniquely capable of exploiting larger state spaces.'

Authors: The pendulum task with multiple input delays and NARMA-10 were selected precisely because they isolate the two properties central to the paper: long-range memory retention under delayed inputs and explicit bilinear (multiplicative) computation, respectively. These are standard benchmarks in the control and sequence-modeling literature for evaluating exactly these capabilities. The observed dissociation aligns with the theoretical distinction between gated linearization (GM) and the full bilinear product (seq-BIM/p-BIM). We acknowledge that broader testing would increase generality. In the revision we will expand the discussion to justify the task choices, explicitly note the limitation to these two benchmarks, and outline how the bilinear mechanism's state-space scaling benefit is expected to transfer to other control-oriented tasks. revision: partial
Referee: [Methods] p-BIM implementation: The manuscript asserts that p-BIM preserves the parallel-scan efficiency of the baseline Mamba, yet no wall-clock timings, FLOPs counts, or runtime comparisons versus the unmodified SSM are provided, leaving the practical efficiency claim unverified despite its centrality to the proposed factorized approach.

Authors: The p-BIM formulation places the bilinear product on the state-transition side while preserving the exact structure required for the parallel scan algorithm; the factorization ensures no additional sequential dependencies are introduced. This is a structural property, not merely an assertion. To provide empirical verification, the revised manuscript will include wall-clock timings, FLOPs counts, and runtime comparisons of p-BIM against the unmodified Mamba baseline (and GM) on representative hardware, across varying sequence lengths and state dimensions. These measurements will confirm that the parallel-scan compatibility translates to practical efficiency. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical claims rest on independent task benchmarks rather than definitional or self-citational reduction.

full rationale

The paper proposes three factorized bilinear modulation variants (seq-BIM, GM, p-BIM) after introducing a coupled SSM, frames the approach as interpretable via Koopman bilinear forms, and validates the dissociation (GM aids retention but not bilinear computation; BIM variants aid both) plus the unique state-dimension scaling benefit exclusively through ablations and sweeps on the multiple-input-delay pendulum and NARMA-10 tasks. These results are obtained from direct comparisons against controls and do not reduce any prediction or uniqueness claim to a fitted parameter, self-defined quantity, or load-bearing self-citation chain. The mathematical formulations are presented as new constructions tested externally; no equation equates an output to its input by construction, and the central dissociation is statistically demonstrated rather than asserted via prior author theorems. The derivation chain is therefore self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on interpreting the state-input product as a finite-dimensional Koopman bilinear form and on the empirical results from two specific tasks; no explicit numerical free parameters are stated beyond standard model training.

axioms (1)

domain assumption Koopman operator theory admits finite-dimensional bilinear representations for certain nonlinear dynamical systems
Invoked to interpret the added state-input product as a Koopman bilinear form.

invented entities (1)

Coupled SSM with shared state across channels no independent evidence
purpose: Enable factorized bilinear input modulation
Introduced as prerequisite for the three modulation implementations.

pith-pipeline@v0.9.0 · 5559 in / 1350 out tokens · 51205 ms · 2026-05-10T06:21:25.304985+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

The “echo state

H. Jaeger, “The “echo state” approach to analysing and training recurrent neural networks-with an erratum note,” Bonn, Germany: German National Research Center for Information Technology GMD, Tech. Rep. 148, 2001

work page 2001
[2]

Hippo: Recurrent memory with optimal polynomial projections,

A. Gu, T. Dao, S. Ermon, A. Rudra, and C. R ´e, “Hippo: Recurrent memory with optimal polynomial projections,”Advances in Neural Information Processing Systems, vol. 33, pp. 1474–1487, 2020

work page 2020
[3]

Efficiently modeling long sequences with structured state spaces,

A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long sequences with structured state spaces,” inInternational Conference on Learning Representations, 2022

work page 2022
[4]

Mamba: Linear-time sequence modeling with selective state spaces,

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” inFirst Conference on Language Modeling, 2024

work page 2024
[5]

Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality,

T. Dao and A. Gu, “Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality,” in International Conference on Machine Learning, 2024, pp. 10 041– 10 071

work page 2024
[6]

State space model: a magical tool for state prediction in Nonlinear systems,

Q. Wang, Y . Jin, Z. Lu, Q. Gao, X. Ge, Z. Li, and L. Hou, “State space model: a magical tool for state prediction in Nonlinear systems,” Nonlinear Dynamics, vol. 113, no. 7, pp. 6577–6603, 2025

work page 2025
[7]

Diagonal state spaces are as effective as structured state spaces,

A. Gupta, A. Gu, and J. Berant, “Diagonal state spaces are as effective as structured state spaces,” inAdvances in Neural Information Processing Systems, vol. 35, 2022

work page 2022
[8]

On the parameterization and initialization of diagonal state space models,

A. Gu, A. Gupta, K. Goel, and C. R ´e, “On the parameterization and initialization of diagonal state space models,”Advances in Neural Information Processing Systems, vol. 35, pp. 35 971–35 983, 2022

work page 2022
[9]

Mamba-3: Improved sequence modeling using state space principles,

A. Lahoti, K. Y . Li, B. Chen, C. Wang, A. Bick, J. Z. Kolter, T. Dao, and A. Gu, “Mamba-3: Improved sequence modeling using state space principles,” inInternational Conference on Learning Representations, 2026

work page 2026
[10]

Hamiltonian systems and transformation in hilbert space,

B. O. Koopman, “Hamiltonian systems and transformation in hilbert space,”Proceedings of the National Academy of Sciences, vol. 17, no. 5, pp. 315–318, 1931

work page 1931
[11]

Modern Koopman theory for dynamical systems,

S. L. Brunton, M. Budi ˇsi´c, E. Kaiser, and J. N. Kutz, “Modern Koopman theory for dynamical systems,”SIAM Review, vol. 64, no. 2, pp. 229–340, 2022

work page 2022
[12]

Generalizing Koopman theory to allow for inputs and control,

J. L. Proctor, S. L. Brunton, and J. N. Kutz, “Generalizing Koopman theory to allow for inputs and control,”SIAM Journal on Applied Dynamical Systems, vol. 17, no. 1, pp. 909–930, 2018

work page 2018
[13]

Koopman operator based observer synthesis for control- affine nonlinear systems,

A. Surana, “Koopman operator based observer synthesis for control- affine nonlinear systems,” inIEEE 55th Conference on Decision and Control (CDC), 2016, pp. 6492–6499

work page 2016
[14]

Advantages of bilinear koopman realizations for the modeling and control of systems with unknown dynamics,

D. Bruder, X. Fu, and R. Vasudevan, “Advantages of bilinear koopman realizations for the modeling and control of systems with unknown dynamics,”IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 4369–4376, 2021

work page 2021
[15]

Lifted bilinear model-based linear model predictive control with scalability,

M. Kanai and M. Yamakita, “Lifted bilinear model-based linear model predictive control with scalability,”IF AC-PapersOnLine, vol. 56, no. 2, pp. 9405–9410, 2023

work page 2023
[16]

Modularized bilinear Koopman operator for modeling and predicting transients of microgrids,

X. Jiang, Y . Li, and D. Huang, “Modularized bilinear Koopman operator for modeling and predicting transients of microgrids,”IEEE Transactions on Smart Grid, vol. 15, no. 5, pp. 5219–5231, 2024

work page 2024
[17]

New results on recurrent network training: unifying the algorithms and accelerating convergence,

A. F. Atiya and A. G. Parlos, “New results on recurrent network training: unifying the algorithms and accelerating convergence,”IEEE Transactions on Neural Networks, vol. 11, no. 3, pp. 697–709, 2000

work page 2000

[1] [1]

The “echo state

H. Jaeger, “The “echo state” approach to analysing and training recurrent neural networks-with an erratum note,” Bonn, Germany: German National Research Center for Information Technology GMD, Tech. Rep. 148, 2001

work page 2001

[2] [2]

Hippo: Recurrent memory with optimal polynomial projections,

A. Gu, T. Dao, S. Ermon, A. Rudra, and C. R ´e, “Hippo: Recurrent memory with optimal polynomial projections,”Advances in Neural Information Processing Systems, vol. 33, pp. 1474–1487, 2020

work page 2020

[3] [3]

Efficiently modeling long sequences with structured state spaces,

A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long sequences with structured state spaces,” inInternational Conference on Learning Representations, 2022

work page 2022

[4] [4]

Mamba: Linear-time sequence modeling with selective state spaces,

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” inFirst Conference on Language Modeling, 2024

work page 2024

[5] [5]

Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality,

T. Dao and A. Gu, “Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality,” in International Conference on Machine Learning, 2024, pp. 10 041– 10 071

work page 2024

[6] [6]

State space model: a magical tool for state prediction in Nonlinear systems,

Q. Wang, Y . Jin, Z. Lu, Q. Gao, X. Ge, Z. Li, and L. Hou, “State space model: a magical tool for state prediction in Nonlinear systems,” Nonlinear Dynamics, vol. 113, no. 7, pp. 6577–6603, 2025

work page 2025

[7] [7]

Diagonal state spaces are as effective as structured state spaces,

A. Gupta, A. Gu, and J. Berant, “Diagonal state spaces are as effective as structured state spaces,” inAdvances in Neural Information Processing Systems, vol. 35, 2022

work page 2022

[8] [8]

On the parameterization and initialization of diagonal state space models,

A. Gu, A. Gupta, K. Goel, and C. R ´e, “On the parameterization and initialization of diagonal state space models,”Advances in Neural Information Processing Systems, vol. 35, pp. 35 971–35 983, 2022

work page 2022

[9] [9]

Mamba-3: Improved sequence modeling using state space principles,

A. Lahoti, K. Y . Li, B. Chen, C. Wang, A. Bick, J. Z. Kolter, T. Dao, and A. Gu, “Mamba-3: Improved sequence modeling using state space principles,” inInternational Conference on Learning Representations, 2026

work page 2026

[10] [10]

Hamiltonian systems and transformation in hilbert space,

B. O. Koopman, “Hamiltonian systems and transformation in hilbert space,”Proceedings of the National Academy of Sciences, vol. 17, no. 5, pp. 315–318, 1931

work page 1931

[11] [11]

Modern Koopman theory for dynamical systems,

S. L. Brunton, M. Budi ˇsi´c, E. Kaiser, and J. N. Kutz, “Modern Koopman theory for dynamical systems,”SIAM Review, vol. 64, no. 2, pp. 229–340, 2022

work page 2022

[12] [12]

Generalizing Koopman theory to allow for inputs and control,

J. L. Proctor, S. L. Brunton, and J. N. Kutz, “Generalizing Koopman theory to allow for inputs and control,”SIAM Journal on Applied Dynamical Systems, vol. 17, no. 1, pp. 909–930, 2018

work page 2018

[13] [13]

Koopman operator based observer synthesis for control- affine nonlinear systems,

A. Surana, “Koopman operator based observer synthesis for control- affine nonlinear systems,” inIEEE 55th Conference on Decision and Control (CDC), 2016, pp. 6492–6499

work page 2016

[14] [14]

Advantages of bilinear koopman realizations for the modeling and control of systems with unknown dynamics,

D. Bruder, X. Fu, and R. Vasudevan, “Advantages of bilinear koopman realizations for the modeling and control of systems with unknown dynamics,”IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 4369–4376, 2021

work page 2021

[15] [15]

Lifted bilinear model-based linear model predictive control with scalability,

M. Kanai and M. Yamakita, “Lifted bilinear model-based linear model predictive control with scalability,”IF AC-PapersOnLine, vol. 56, no. 2, pp. 9405–9410, 2023

work page 2023

[16] [16]

Modularized bilinear Koopman operator for modeling and predicting transients of microgrids,

X. Jiang, Y . Li, and D. Huang, “Modularized bilinear Koopman operator for modeling and predicting transients of microgrids,”IEEE Transactions on Smart Grid, vol. 15, no. 5, pp. 5219–5231, 2024

work page 2024

[17] [17]

New results on recurrent network training: unifying the algorithms and accelerating convergence,

A. F. Atiya and A. G. Parlos, “New results on recurrent network training: unifying the algorithms and accelerating convergence,”IEEE Transactions on Neural Networks, vol. 11, no. 3, pp. 697–709, 2000

work page 2000