Bilinear Input Modulation for Mamba: Koopman Bilinear Forms for Memory Retention and Multiplicative Computation
Pith reviewed 2026-05-10 06:21 UTC · model grok-4.3
The pith
Factorized bilinear input modulation augments selective SSMs with state-input products to improve both memory retention and multiplicative computation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Selective SSMs with diagonal transitions are limited in memory and bilinear capacity. Introducing a factorized bilinear input modulation that computes a state-input product and routes it through complementary paths yields SSMs that retain longer histories and perform multiplicative operations more effectively. The sequential and parallel bilinear variants both succeed on bilinear computation where gating alone fails, and they alone improve when state dimension grows, while pathway ablation shows the two routes of the bilinear signal play distinct supporting roles.
What carries the argument
The factorized bilinear input modulation, which augments the SSM recurrence with an explicit state-input product term interpretable as a Koopman bilinear form and admits parallel or sequential realizations while preserving scan efficiency where possible.
If this is right
- The bilinear mechanism is uniquely able to exploit increases in SSM state dimension for performance gains.
- The two downstream routes of the bilinear signal play complementary roles, as confirmed by pathway ablation.
- Gated modulation improves memory retention but leaves bilinear computation largely unchanged.
- Parallel bilinear modulation achieves the same performance gains as the sequential version while remaining compatible with efficient parallel scan.
Where Pith is reading between the lines
- The dissociation between memory and computation benefits could guide construction of hybrid SSM architectures that selectively apply bilinear terms only where needed.
- Testing the same modulation on longer sequences or real-world time-series data would check whether the reported scaling with state size continues to hold.
- The Koopman bilinear framing suggests the approach might transfer to other recurrent or attention-based models that currently rely on linear transitions.
- If the efficiency advantage is preserved at scale, bilinear SSMs could become a drop-in replacement for gated variants in applications requiring strong multiplicative interactions.
Load-bearing premise
The chosen pendulum and NARMA-10 tasks plus the observed dissociation between memory and bilinear benefits are enough to establish general gains for arbitrary sequence modeling without hidden efficiency penalties.
What would settle it
Running the bilinear variants on a standard language-modeling benchmark and finding no consistent improvement over gated baselines or no additional benefit from larger state dimensions.
Figures
read the original abstract
Selective State Space Models (SSMs), notably Mamba, employ diagonal state transitions that limit both memory retention and bilinear computational capacity. We propose a factorized bilinear input modulation that augments the SSM with a state-input product, interpretable as a finite-dimensional Koopman bilinear form. After introducing a shared state across channels (Coupled SSM), the modulation admits three implementations. Coupled Bilinear Input Modulation (seq-BIM) retains the full bilinear product on the input side at the cost of sequential computation, Coupled Gated Modulation (GM) linearizes it into a gate modulation that is compatible with the parallel scan, and Parallel Bilinear Input Modulation (p-BIM) places the same bilinear product on the state transition while remaining parallel-scannable. Experiments on a multiple input-delay pendulum (memory retention) and NARMA-10 (bilinear computation) reveal a clear dissociation. GM substantially improves memory retention but not bilinear computation, while both seq-BIM and p-BIM improve both. A pathway ablation confirms that the two downstream routes of the bilinear signal serve complementary roles. The improvement is statistically robust, with the bilinear variants consistently outperforming the other variants on bilinear computation. Furthermore, only the bilinear variants benefit from increasing the SSM state dimension, while coupling or gate modulation alone show no improvement, establishing the bilinear mechanism as uniquely capable of exploiting larger state spaces.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a factorized bilinear input modulation for Selective State Space Models (SSMs) such as Mamba, framed as a finite-dimensional Koopman bilinear form. After introducing a Coupled SSM with shared state across channels, it defines three variants: sequential Coupled Bilinear Input Modulation (seq-BIM), Coupled Gated Modulation (GM), and Parallel Bilinear Input Modulation (p-BIM). Experiments on a multiple input-delay pendulum task (memory retention) and NARMA-10 (bilinear computation) show a dissociation where GM improves memory but not bilinear computation, seq-BIM and p-BIM improve both, a pathway ablation confirms complementary roles, and only bilinear variants benefit from increasing SSM state dimension.
Significance. If the results hold, the work provides a mechanism to augment SSMs with improved memory retention and multiplicative capacity while retaining parallel-scan compatibility in the p-BIM and GM variants. The pathway ablations and state-dimension scaling experiments that isolate the bilinear mechanism's benefits represent a strength, offering concrete evidence for how bilinear forms exploit larger state spaces in control-oriented sequence tasks.
major comments (3)
- [Experimental Evaluation] Experimental Evaluation: The claims of statistically robust outperformance, clear dissociation between variants, and unique scaling benefits for bilinear forms lack reported details on exact baselines, error bars, statistical tests, and implementation hyperparameters, which directly weakens the evidential support for the central dissociation and scaling claims.
- [Results] Results on pendulum and NARMA-10 tasks: The dissociation (GM improves memory retention but not bilinear computation; seq-BIM/p-BIM improve both) and the observation that only bilinear variants benefit from larger state dimensions are shown exclusively on these two tasks. This leaves open whether the effects are intrinsic to the bilinear mechanism or tied to the specific input statistics and dynamics of the chosen benchmarks, limiting the generality of the conclusion that the bilinear mechanism is 'uniquely capable of exploiting larger state spaces.'
- [Methods] p-BIM implementation: The manuscript asserts that p-BIM preserves the parallel-scan efficiency of the baseline Mamba, yet no wall-clock timings, FLOPs counts, or runtime comparisons versus the unmodified SSM are provided, leaving the practical efficiency claim unverified despite its centrality to the proposed factorized approach.
minor comments (1)
- [Methods] The notation distinguishing seq-BIM, GM, and p-BIM could be introduced with a single summary table or diagram early in the methods to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications grounded in the manuscript and outlining specific revisions to strengthen the evidential basis and generality of the claims.
read point-by-point responses
-
Referee: [Experimental Evaluation] Experimental Evaluation: The claims of statistically robust outperformance, clear dissociation between variants, and unique scaling benefits for bilinear forms lack reported details on exact baselines, error bars, statistical tests, and implementation hyperparameters, which directly weakens the evidential support for the central dissociation and scaling claims.
Authors: We agree that the current presentation would be strengthened by explicit reporting of these details. The manuscript already performed multiple independent runs to support the 'statistically robust' phrasing, but we did not include the supporting statistics or full hyperparameter tables. In the revised version, we will add: (i) a complete table of baselines with their exact configurations, (ii) error bars showing mean and standard deviation across runs, (iii) results of paired statistical tests (e.g., t-tests) for the key comparisons, and (iv) an expanded hyperparameter appendix. These additions will directly bolster the dissociation and scaling claims without changing the experimental outcomes. revision: yes
-
Referee: [Results] Results on pendulum and NARMA-10 tasks: The dissociation (GM improves memory retention but not bilinear computation; seq-BIM/p-BIM improve both) and the observation that only bilinear variants benefit from larger state dimensions are shown exclusively on these two tasks. This leaves open whether the effects are intrinsic to the bilinear mechanism or tied to the specific input statistics and dynamics of the chosen benchmarks, limiting the generality of the conclusion that the bilinear mechanism is 'uniquely capable of exploiting larger state spaces.'
Authors: The pendulum task with multiple input delays and NARMA-10 were selected precisely because they isolate the two properties central to the paper: long-range memory retention under delayed inputs and explicit bilinear (multiplicative) computation, respectively. These are standard benchmarks in the control and sequence-modeling literature for evaluating exactly these capabilities. The observed dissociation aligns with the theoretical distinction between gated linearization (GM) and the full bilinear product (seq-BIM/p-BIM). We acknowledge that broader testing would increase generality. In the revision we will expand the discussion to justify the task choices, explicitly note the limitation to these two benchmarks, and outline how the bilinear mechanism's state-space scaling benefit is expected to transfer to other control-oriented tasks. revision: partial
-
Referee: [Methods] p-BIM implementation: The manuscript asserts that p-BIM preserves the parallel-scan efficiency of the baseline Mamba, yet no wall-clock timings, FLOPs counts, or runtime comparisons versus the unmodified SSM are provided, leaving the practical efficiency claim unverified despite its centrality to the proposed factorized approach.
Authors: The p-BIM formulation places the bilinear product on the state-transition side while preserving the exact structure required for the parallel scan algorithm; the factorization ensures no additional sequential dependencies are introduced. This is a structural property, not merely an assertion. To provide empirical verification, the revised manuscript will include wall-clock timings, FLOPs counts, and runtime comparisons of p-BIM against the unmodified Mamba baseline (and GM) on representative hardware, across varying sequence lengths and state dimensions. These measurements will confirm that the parallel-scan compatibility translates to practical efficiency. revision: yes
Circularity Check
No significant circularity: empirical claims rest on independent task benchmarks rather than definitional or self-citational reduction.
full rationale
The paper proposes three factorized bilinear modulation variants (seq-BIM, GM, p-BIM) after introducing a coupled SSM, frames the approach as interpretable via Koopman bilinear forms, and validates the dissociation (GM aids retention but not bilinear computation; BIM variants aid both) plus the unique state-dimension scaling benefit exclusively through ablations and sweeps on the multiple-input-delay pendulum and NARMA-10 tasks. These results are obtained from direct comparisons against controls and do not reduce any prediction or uniqueness claim to a fitted parameter, self-defined quantity, or load-bearing self-citation chain. The mathematical formulations are presented as new constructions tested externally; no equation equates an output to its input by construction, and the central dissociation is statistically demonstrated rather than asserted via prior author theorems. The derivation chain is therefore self-contained against the reported benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Koopman operator theory admits finite-dimensional bilinear representations for certain nonlinear dynamical systems
invented entities (1)
-
Coupled SSM with shared state across channels
no independent evidence
Reference graph
Works this paper leans on
-
[1]
H. Jaeger, “The “echo state” approach to analysing and training recurrent neural networks-with an erratum note,” Bonn, Germany: German National Research Center for Information Technology GMD, Tech. Rep. 148, 2001
work page 2001
-
[2]
Hippo: Recurrent memory with optimal polynomial projections,
A. Gu, T. Dao, S. Ermon, A. Rudra, and C. R ´e, “Hippo: Recurrent memory with optimal polynomial projections,”Advances in Neural Information Processing Systems, vol. 33, pp. 1474–1487, 2020
work page 2020
-
[3]
Efficiently modeling long sequences with structured state spaces,
A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long sequences with structured state spaces,” inInternational Conference on Learning Representations, 2022
work page 2022
-
[4]
Mamba: Linear-time sequence modeling with selective state spaces,
A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” inFirst Conference on Language Modeling, 2024
work page 2024
-
[5]
T. Dao and A. Gu, “Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality,” in International Conference on Machine Learning, 2024, pp. 10 041– 10 071
work page 2024
-
[6]
State space model: a magical tool for state prediction in Nonlinear systems,
Q. Wang, Y . Jin, Z. Lu, Q. Gao, X. Ge, Z. Li, and L. Hou, “State space model: a magical tool for state prediction in Nonlinear systems,” Nonlinear Dynamics, vol. 113, no. 7, pp. 6577–6603, 2025
work page 2025
-
[7]
Diagonal state spaces are as effective as structured state spaces,
A. Gupta, A. Gu, and J. Berant, “Diagonal state spaces are as effective as structured state spaces,” inAdvances in Neural Information Processing Systems, vol. 35, 2022
work page 2022
-
[8]
On the parameterization and initialization of diagonal state space models,
A. Gu, A. Gupta, K. Goel, and C. R ´e, “On the parameterization and initialization of diagonal state space models,”Advances in Neural Information Processing Systems, vol. 35, pp. 35 971–35 983, 2022
work page 2022
-
[9]
Mamba-3: Improved sequence modeling using state space principles,
A. Lahoti, K. Y . Li, B. Chen, C. Wang, A. Bick, J. Z. Kolter, T. Dao, and A. Gu, “Mamba-3: Improved sequence modeling using state space principles,” inInternational Conference on Learning Representations, 2026
work page 2026
-
[10]
Hamiltonian systems and transformation in hilbert space,
B. O. Koopman, “Hamiltonian systems and transformation in hilbert space,”Proceedings of the National Academy of Sciences, vol. 17, no. 5, pp. 315–318, 1931
work page 1931
-
[11]
Modern Koopman theory for dynamical systems,
S. L. Brunton, M. Budi ˇsi´c, E. Kaiser, and J. N. Kutz, “Modern Koopman theory for dynamical systems,”SIAM Review, vol. 64, no. 2, pp. 229–340, 2022
work page 2022
-
[12]
Generalizing Koopman theory to allow for inputs and control,
J. L. Proctor, S. L. Brunton, and J. N. Kutz, “Generalizing Koopman theory to allow for inputs and control,”SIAM Journal on Applied Dynamical Systems, vol. 17, no. 1, pp. 909–930, 2018
work page 2018
-
[13]
Koopman operator based observer synthesis for control- affine nonlinear systems,
A. Surana, “Koopman operator based observer synthesis for control- affine nonlinear systems,” inIEEE 55th Conference on Decision and Control (CDC), 2016, pp. 6492–6499
work page 2016
-
[14]
D. Bruder, X. Fu, and R. Vasudevan, “Advantages of bilinear koopman realizations for the modeling and control of systems with unknown dynamics,”IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 4369–4376, 2021
work page 2021
-
[15]
Lifted bilinear model-based linear model predictive control with scalability,
M. Kanai and M. Yamakita, “Lifted bilinear model-based linear model predictive control with scalability,”IF AC-PapersOnLine, vol. 56, no. 2, pp. 9405–9410, 2023
work page 2023
-
[16]
Modularized bilinear Koopman operator for modeling and predicting transients of microgrids,
X. Jiang, Y . Li, and D. Huang, “Modularized bilinear Koopman operator for modeling and predicting transients of microgrids,”IEEE Transactions on Smart Grid, vol. 15, no. 5, pp. 5219–5231, 2024
work page 2024
-
[17]
New results on recurrent network training: unifying the algorithms and accelerating convergence,
A. F. Atiya and A. G. Parlos, “New results on recurrent network training: unifying the algorithms and accelerating convergence,”IEEE Transactions on Neural Networks, vol. 11, no. 3, pp. 697–709, 2000
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.