pith. machine review for the scientific record. sign in

arxiv: 2604.14501 · v1 · submitted 2026-04-16 · 💻 cs.LG · cs.AI· cs.CC

Recognition: unknown

On the Expressive Power and Limitations of Multi-Layer SSMs

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CC
keywords state-space modelsexpressive powerchain-of-thoughtcompositional tasksstreaming algorithmsfinite precisionmulti-layer modelsmodel limitations
0
0 comments X

The pith

Multi-layer SSMs lag streaming models on compositional tasks, but online CoT makes them equivalent in power.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper maps the expressive limits of multi-layer state-space models by comparing them directly to streaming algorithms on tasks that require composing multiple operations in sequence. It establishes that SSMs face an inherent gap on these tasks that cannot be closed by adding more layers or offline chain-of-thought reasoning. Allowing online chain-of-thought, however, lets the models match the full power of streaming algorithms. The work also shows that model width and arithmetic precision cannot be traded against each other in the base SSM, yet become interchangeable once online CoT is introduced. These boundaries matter for understanding when SSM architectures can serve as efficient replacements for more general sequence processors.

Core claim

Multi-layer state-space models face fundamental limitations in compositional tasks, revealing an inherent gap between SSMs and streaming models. Offline CoT does not fundamentally increase the expressiveness, while online CoT can substantially increase its power. With online CoT, multi-layer SSMs become equivalent in power to streaming algorithms. Width and precision are not interchangeable in the base model, but admit a clean equivalence once online CoT is allowed.

What carries the argument

The formal distinction between online and offline chain-of-thought together with finite-precision arithmetic constraints used to prove equivalences and separations between multi-layer SSMs and streaming algorithms.

If this is right

  • Multi-layer SSMs without online CoT cannot solve the full class of compositional tasks that streaming algorithms handle.
  • Offline CoT adds no fundamental expressive power to multi-layer SSMs.
  • Online CoT renders width and precision interchangeable resources inside multi-layer SSMs.
  • The overall power of SSMs is jointly determined by depth, finite precision, and the availability of online CoT.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectures that embed online reasoning steps inside SSM forward passes could extend their reach to tasks currently reserved for more general recurrent or attention-based models.
  • The results suggest that simply increasing model depth or width will not overcome compositional limits unless paired with an online CoT mechanism.
  • Designers of efficient sequence models may need to treat online CoT as a first-class architectural choice rather than an optional post-processing step.

Load-bearing premise

The specific formal definitions of compositional tasks, online versus offline CoT, and finite-precision arithmetic used to prove the equivalences and gaps.

What would settle it

A concrete compositional task on which a multi-layer SSM with online CoT fails to match a streaming algorithm (or succeeds where one should not) under matching width, depth, and finite precision.

read the original abstract

We study the expressive power and limitations of multi-layer state-space models (SSMs). First, we show that multi-layer SSMs face fundamental limitations in compositional tasks, revealing an inherent gap between SSMs and streaming models. Then, we examine the role of chain-of-thought (CoT), showing that offline CoT does not fundamentally increase the expressiveness, while online CoT can substantially increase its power. Indeed, with online CoT, multi-layer SSMs become equivalent in power to streaming algorithms. Finally, we investigate the tradeoff between width and precision, showing that these resources are not interchangeable in the base model, but admit a clean equivalence once online CoT is allowed. Overall, our results offer a unified perspective on how depth, finite precision, and CoT shape the power and limits of SSMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that multi-layer state-space models (SSMs) have inherent limitations in handling compositional tasks, creating a gap with streaming models. Offline chain-of-thought (CoT) does not substantially increase their expressiveness, whereas online CoT allows multi-layer SSMs to become equivalent in power to streaming algorithms. Additionally, while width and precision are not interchangeable in the base SSM model, they admit an equivalence under online CoT. The results are supported by proofs for the limitations, equivalences, and trade-offs.

Significance. If the formal results hold, this work provides a valuable unified perspective on the roles of depth, finite precision, and CoT in determining the capabilities of SSMs. This is significant for advancing the theoretical understanding of efficient sequence models like S4 and Mamba, and could guide practical design choices in model architecture and inference strategies. The provision of proofs for limitations, equivalences, and trade-offs is a strength.

major comments (3)
  1. [§3] §3 (limitations of multi-layer SSMs): the gap with streaming models in compositional tasks is load-bearing for the first main claim; the proof must explicitly define the class of compositional tasks and the streaming-algorithm baseline to ensure the separation is not an artifact of the chosen formalization.
  2. [§5] §5 (online CoT equivalence): the headline result that online CoT lifts multi-layer SSMs to streaming-algorithm power depends on the precise model of online CoT (token generation count, re-injection into the SSM state update, and whether the state is reset or carried over). Any deviation from standard linear-recurrence SSM semantics (as in S4/Mamba) would invalidate both the limitation and recovery claims.
  3. [§6] §6 (width-precision tradeoff): the claim that width and precision become interchangeable only under online CoT requires an explicit finite-precision arithmetic model (bit width per state entry, rounding mode, and exactness of multiplication/accumulation). Without this, the tradeoff equivalence cannot be verified.
minor comments (2)
  1. [Abstract] Abstract: the summary of results is clear, but a brief mention of the concrete models (S4, Mamba) would help readers connect the theory to practice.
  2. [Notation] Notation section: ensure symbols for hidden dimension, precision bits, and state update are defined once and used consistently in all theorems.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which help strengthen the clarity of our formal results. We address each major comment below and will incorporate revisions to make definitions and models explicit while preserving the core claims.

read point-by-point responses
  1. Referee: [§3] §3 (limitations of multi-layer SSMs): the gap with streaming models in compositional tasks is load-bearing for the first main claim; the proof must explicitly define the class of compositional tasks and the streaming-algorithm baseline to ensure the separation is not an artifact of the chosen formalization.

    Authors: We agree that explicit definitions strengthen the result. In the revised manuscript, we will add a new subsection at the start of §3 that formally defines compositional tasks as those requiring the sequential composition of k independent functions (e.g., iterated parity or nested modular counting) on the input stream, and defines the streaming baseline as constant-space streaming algorithms that may perform arbitrary (but finite) computation per token. The separation proof shows that fixed-depth multi-layer SSMs cannot maintain the necessary cross-composition state, while the streaming model can; this separation is robust under the stated definitions and aligns with standard automata-theoretic notions. revision: yes

  2. Referee: [§5] §5 (online CoT equivalence): the headline result that online CoT lifts multi-layer SSMs to streaming-algorithm power depends on the precise model of online CoT (token generation count, re-injection into the SSM state update, and whether the state is reset or carried over). Any deviation from standard linear-recurrence SSM semantics (as in S4/Mamba) would invalidate both the limitation and recovery claims.

    Authors: We share the concern for precision. Our model of online CoT in §5 adheres to standard linear-recurrence SSM semantics: at each step the SSM produces a token that is immediately re-injected as the next input without state reset, the state is carried forward, and the number of generated tokens per original input token is bounded by a constant. We will insert a formal definition of this process (including the re-injection rule and state carry-over) at the beginning of §5 to make the equivalence to streaming algorithms fully verifiable under these semantics. revision: yes

  3. Referee: [§6] §6 (width-precision tradeoff): the claim that width and precision become interchangeable only under online CoT requires an explicit finite-precision arithmetic model (bit width per state entry, rounding mode, and exactness of multiplication/accumulation). Without this, the tradeoff equivalence cannot be verified.

    Authors: We accept that an explicit arithmetic model is required. The paper employs a fixed-point model with b-bit entries per state dimension, exact multiplication and accumulation within the bit width, and round-to-nearest rounding. We will expand the opening of §6 with a formal definition of this model (specifying bit width, rounding, and exactness) and restate the tradeoff theorem under it: without online CoT, width and precision are not interchangeable, while with online CoT they trade off linearly (constant product w·b suffices for equivalent power). revision: yes

Circularity Check

0 steps flagged

No circularity: equivalences derived from explicit model definitions and standard streaming comparisons

full rationale

The abstract and framing present proofs of limitations for base multi-layer SSMs versus streaming models, the differential impact of offline versus online CoT, and width-precision tradeoffs, all grounded in formal definitions of the models, CoT variants, and task classes. These are compared against external benchmarks (streaming algorithms) rather than reducing to self-referential fits, self-citations, or ansatzes. No load-bearing step equates a claimed result to its own inputs by construction; the derivations remain self-contained once the chosen formalizations are accepted.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard mathematical definitions of SSMs, streaming algorithms, compositional tasks, and finite-precision arithmetic; no new entities or fitted parameters are introduced in the abstract.

axioms (1)
  • standard math Standard definitions of multi-layer SSMs, streaming algorithms, and finite-precision computation
    The abstract builds all claims on these established formal objects without re-deriving them.

pith-pipeline@v0.9.0 · 5448 in / 1205 out tokens · 44623 ms · 2026-05-10T12:33:04.761347+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

4 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando de Freitas, and Caglar Gulcehre. Griffin: Mixing gated linear recurrences with local attention for efficien...

  2. [2]

    Gadgetless lifting beats round elimination: Improved lower bounds for pointer chasing

    Xinyu Mao, Guangxu Yang, and Jiapeng Zhang. Gadgetless lifting beats round elimination: Improved lower bounds for pointer chasing. In16th Innovations in Theoretical Computer Science Conference (ITCS 2025),

  3. [3]

    Rwkv: Reinventing rnns for the transformer era

    Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, et al. Rwkv: Reinventing rnns for the transformer era. In Findings of the association for computational linguistics: EMNLP 2023, pp. 14048–14077,

  4. [4]

    S7: Selective and simplified state space layers for sequence modeling.arXiv preprint arXiv:2410.03464,

    Taylan Soydan, Nikola Zubić, Nico Messikommer, Siddhartha Mishra, and Davide Scaramuzza. S7: Selective and simplified state space layers for sequence modeling.arXiv preprint arXiv:2410.03464,