Trading Complexity for Expressivity Through Structured Generalized Linear Token Mixing

Alexandre Allauzen; Blaise Delattre; Erwan Fagnou; Paul Caillon

arxiv: 2605.31367 · v1 · pith:VQVIRIK3new · submitted 2026-05-29 · 💻 cs.LG · cs.CL

Trading Complexity for Expressivity Through Structured Generalized Linear Token Mixing

Erwan Fagnou , Paul Caillon , Blaise Delattre , Alexandre Allauzen This is my paper

Pith reviewed 2026-06-28 22:54 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords token mixingrecurrence patternsstate space modelsattention mechanismsexpressivitycomputational complexitylanguage modeling

0 comments

The pith

Structured multi-state recurrences let token mixers trade decoding speed for greater expressivity

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a unified framework for token mixing layers that separates the direct influence of a new input on the output from the recurrent propagation of information through past outputs. It generalizes standard recurrence so that each state can depend on multiple prior states rather than only the immediate predecessor. By imposing structure on those multi-state dependencies, the authors construct new recurrence patterns whose computational complexity is provably bounded while their expressivity is analyzed theoretically. The result is a controlled way to exchange runtime cost for stronger long-range dependency modeling, demonstrated on synthetic tasks and language modeling.

Core claim

By introducing structure on multi-state recurrence dependencies, new recurrence patterns provably achieve the desired complexity while providing theoretical insights on their expressivity, thereby trading runtime for expressivity in a principled way.

What carries the argument

The structured generalization of recurrence equations allowing each state to depend on multiple past states rather than only the immediate predecessor

If this is right

Attention and state-space models are special cases of the generalized framework.
New token-mixer architectures can be systematically designed with explicit complexity-expressivity trade-offs.
Theoretical analysis of the structured patterns reveals how dependence structure controls expressivity.
Empirical results on synthetic tasks and language modeling confirm that the trade-offs are achievable in practice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation into direct influence and recurrent propagation could be used to hybridize other sequence architectures.
The same structuring technique might extend to higher-order or non-linear recurrence forms beyond the linear case studied here.
Models built this way could target regimes where current quadratic or linear recurrences are either too slow or too weak for very long contexts.

Load-bearing premise

The structure imposed on multi-state recurrence dependencies maintains the claimed complexity bounds and delivers the stated expressivity gains without hidden costs or post-hoc adjustments.

What would settle it

An explicit counterexample in which one of the structured recurrence patterns either exceeds its claimed complexity bound during causal generation or shows no measurable expressivity improvement over standard single-state attention or SSM baselines on long-range synthetic tasks.

Figures

Figures reproduced from arXiv: 2605.31367 by Alexandre Allauzen, Blaise Delattre, Erwan Fagnou, Paul Caillon.

**Figure 1.** Figure 1: (Top) Our general formulation of a linear token mixing layer, which combines attention coefficients (green) and recurrence coefficients (red). The output can be expressed simply using matrix notations. (Bottom) Examples of how different sparsity patterns of A and B produce standard layers (attention, linear recurrence, etc.) but also enable new behaviors. 3.1. Recurrent and matrix forms Definition 3.1 (Gen… view at source ↗

**Figure 2.** Figure 2: Visualization of different sparsity patterns for the matrices A and B. Non-zero entries are colored in red. The time and space complexities of the layer depend on the properties of the sparsity pattern. for decoding the i-th token, where: g(i) = max{k ∈ N s.t. f(k) < i}. If f can be extended to an invertible real function R → [1, ∞) then the complexity is in O(f −1 (i)). We will assume this is the case in… view at source ↗

**Figure 3.** Figure 3: OpenWebText perplexity (lower is better) versus per-token time complexity for respectively 125M, 355M and 775M parameters models. Points correspond to attention-based, recurrent, and cache-efficient variants; the x-axis annotates canonical regimes O(1), O(k), O(log2 n), O( √ n), and O(n). higher perplexities with comparatively small spread. This aligns with the congestion view: compressing the entire past … view at source ↗

read the original abstract

Token mixing layers play a key role in how language models can learn and generate long-range dependencies. Their efficiency relies on the necessary trade-off between decoding speed and the memory requirements, along with the cache size. Considering causal generation, this paper explores new trade-offs thanks to a unified framework which separates two crucial features: (i) the direct influence of inputs on outputs in one generation step; (ii) the recurrent propagation of information through past outputs. This framework encompasses major architectures such as attention and state-space models, but also generalizes the recurrence equations by allowing each state to depend on multiple past states rather than only the immediate predecessor. By introducing structure, we design new recurrence patterns that provably achieve the desired complexity, while providing theoretical insights on their expressivity -- trading runtime for expressivity in a principled way. Empirical validation is performed on synthetic tasks, along with language modeling. Together, these results provide a unified toolkit for the understanding and design of efficient and expressive token mixers across model families.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper unifies token mixing by generalizing recurrence to multi-state dependencies with added structure for complexity control, but the practical gains and proof details need closer checking.

read the letter

The main thing here is a framework that splits token mixing into direct input-to-output influence and recurrent propagation through past states. It then generalizes the recurrence so each state can depend on multiple prior states, not just the last one, and imposes structure on those dependencies to keep complexity manageable. This covers attention and SSMs while adding new recurrence patterns that come with claimed provable complexity bounds and some expressivity analysis.

It does a clean job organizing existing models under one view and showing how the multi-state extension can be structured to trade runtime for expressivity without obvious contradictions in the setup. The synthetic tasks plus language modeling experiments give a basic check on whether the patterns work in practice.

The soft spots are in the strength of the supporting evidence. The abstract and framework description make the complexity claims sound plausible, but without the full derivations visible it is not clear how tightly the structure preserves the bounds or whether hidden costs appear in implementation. The expressivity insights also read as preliminary; the empirical section would need tighter baselines and effect sizes to show the trade-off is actually useful beyond the synthetic cases.

This is for people working on efficient sequence models who want a design toolkit rather than a drop-in replacement. It shows clear thinking on the recurrence generalization and engages the literature directly, so it deserves a serious referee even if the experiments will likely need expansion.

Referee Report

0 major / 3 minor

Summary. The paper introduces a unified framework for token mixing layers in language models that separates direct input influence on outputs from recurrent propagation of information through past outputs. It generalizes attention and state-space model recurrences to allow multi-state dependencies, then imposes structure on these recurrences to design new patterns that provably achieve target complexity while yielding expressivity insights; the approach is evaluated on synthetic tasks and language modeling.

Significance. If the theoretical claims on complexity and expressivity hold, the work supplies a principled toolkit for trading runtime against expressivity in token mixers and unifies major architectures under one recurrence framework, which could guide design of efficient long-context models.

minor comments (3)

The abstract states that new recurrence patterns 'provably achieve the desired complexity' but does not name the specific complexity measures (e.g., time or space per token) or the exact structural constraints used; adding one sentence with these quantities would improve readability.
Empirical section: the description of synthetic tasks and language-modeling benchmarks would benefit from an explicit statement of the data-exclusion rules and error-analysis protocol referenced in the reader's assessment.
Notation: the separation into 'direct influence' and 'recurrent propagation' is introduced without a compact equation label; assigning an equation number to the general multi-state recurrence would aid cross-references.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and the recommendation for minor revision. The provided report contains no specific major comments to address.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines a unified framework separating direct input influence from recurrent state propagation, then generalizes standard attention/SSM recurrences by allowing multi-state dependencies and imposing structure to control complexity. Theoretical claims on expressivity and complexity bounds are presented as following from this explicit construction, with separate empirical validation on synthetic tasks and language modeling. No equations reduce predictions to fitted parameters by construction, no load-bearing self-citations underpin the core separation or uniqueness, and no ansatz is smuggled via prior work. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only the abstract is available, so the ledger is populated from stated elements in the abstract; no explicit free parameters, axioms, or invented entities are detailed beyond standard assumptions of linear mixing and recurrence.

axioms (1)

domain assumption Standard assumptions of causal generation and linear token mixing operations hold.
Invoked implicitly when separating direct influence and recurrent propagation for causal models.

pith-pipeline@v0.9.1-grok · 5708 in / 1222 out tokens · 20239 ms · 2026-06-28T22:54:39.892213+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 2 canonical work pages · 1 internal anchor

[1]

URL https://openreview.net/forum? id=Ua6zuk0WRH. Dao, T. Flashattention-2: Faster attention with better paral- lelism and work partitioning. 2024:35549–35562, 2024. Dao, T. and Gu, A. Transformers are ssms: generalized models and efficient algorithms through structured state space duality. InProceedings of the 41st International Conference on Machine Lear...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.emnlp-main 2024
[2]

emnlp-main.731/

URL https://aclanthology.org/2024. emnlp-main.731/. Feng, L., Tung, F., Ahmed, M. O., Bengio, Y ., and Ha- jimirsadeghi, H. Were rnns all we needed?arXiv preprint arXiv:2410.01201, 2024. URL https:// arxiv.org/abs/2410.01201. Fu, D. Y ., Dao, T., Saab, K. K., Thomas, A. W., Rudra, A., and Re, C. Hungry hungry hippos: Towards lan- guage modeling with state...

arXiv 2024
[3]

Gokaslan, A

URL https://openreview.net/forum? id=COZDy0WYGg. Gokaslan, A. and Cohen, V . Openwebtext cor- pus. http://Skylion007.github.io/ OpenWebTextCorpus, 2019. Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. InFirst Conference on Lan- guage Modeling, 2024. URL https://openreview. net/forum?id=tEYskw1VY2. Gu, A., Dao, T., Ermo...

Pith/arXiv arXiv 2019
[4]

Hush, D., Abdallah, C., and Horne, B

URL https://openreview.net/forum? id=mOJgZWkXKW. Hush, D., Abdallah, C., and Horne, B. High order recursive neural networks. In1EEE 1991 ACC Conf, 1991. Jelassi, S., Brandfonbrener, D., Kakade, S. M., and Malach, E. Repeat after me: transformers are better than state space models at copying. InProceedings of the 41st In- ternational Conference on Machine ...

1991
[5]

Lahoti, A., Marwah, T., Puduppully, R., and Gu, A

URL https://openreview.net/forum? id=n2NidsYDop. Lahoti, A., Marwah, T., Puduppully, R., and Gu, A. Chimera: State space models beyond sequences.Transac- tions on Machine Learning Research, 2025. ISSN 2835-

2025
[6]

Mavi, V ., Jangra, A., and Jatowt, A

URL https://openreview.net/forum? id=yv0TUssepk. Mavi, V ., Jangra, A., and Jatowt, A. Multi-hop question answering.Found. Trends Inf. Retr., 17(5):457–586, June
[7]

doi: 10.1561/1500000102

ISSN 1554-0669. doi: 10.1561/1500000102. URL https://doi.org/10.1561/1500000102. NVIDIA, :, Blakeman, A., Basant, A., Khattar, A., Ren- duchintala, A., Bercovich, A., Ficek, A., Bjorlin, A., Taghibakhshi, A., Deshmukh, A. S., Mahabaleshwarkar, A. S., Tao, A., Shors, A., Aithal, A., Poojary, A., Dattagupta, A., Buddharaju, B., Chen, B., Ginsburg, B., Wang,...

work page doi:10.1561/1500000102 2025
[8]

Smith, J

URL https://openreview.net/forum? id=iF7MnXnxRw. Smith, J. T., Warrington, A., and Linderman, S. Simplified state space layers for sequence modeling. InThe Eleventh International Conference on Learning Representations,
[9]

Soltani, R

URL https://openreview.net/forum? id=Ai8Hw3AXqks. Soltani, R. and Jiang, H. Higher order recurrent neural networks, 2017. URL https://openreview.net/ forum?id=ByZvfijeg. Stani´c, A., Ashley, D., Serikov, O., Kirsch, L., Faccio, F., Schmidhuber, J., Hofmann, T., and Schlag, I. The lan- guini kitchen: Enabling language modelling research at different scales...

arXiv 2017

[1] [1]

URL https://openreview.net/forum? id=Ua6zuk0WRH. Dao, T. Flashattention-2: Faster attention with better paral- lelism and work partitioning. 2024:35549–35562, 2024. Dao, T. and Gu, A. Transformers are ssms: generalized models and efficient algorithms through structured state space duality. InProceedings of the 41st International Conference on Machine Lear...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.emnlp-main 2024

[2] [2]

emnlp-main.731/

URL https://aclanthology.org/2024. emnlp-main.731/. Feng, L., Tung, F., Ahmed, M. O., Bengio, Y ., and Ha- jimirsadeghi, H. Were rnns all we needed?arXiv preprint arXiv:2410.01201, 2024. URL https:// arxiv.org/abs/2410.01201. Fu, D. Y ., Dao, T., Saab, K. K., Thomas, A. W., Rudra, A., and Re, C. Hungry hungry hippos: Towards lan- guage modeling with state...

arXiv 2024

[3] [3]

Gokaslan, A

URL https://openreview.net/forum? id=COZDy0WYGg. Gokaslan, A. and Cohen, V . Openwebtext cor- pus. http://Skylion007.github.io/ OpenWebTextCorpus, 2019. Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. InFirst Conference on Lan- guage Modeling, 2024. URL https://openreview. net/forum?id=tEYskw1VY2. Gu, A., Dao, T., Ermo...

Pith/arXiv arXiv 2019

[4] [4]

Hush, D., Abdallah, C., and Horne, B

URL https://openreview.net/forum? id=mOJgZWkXKW. Hush, D., Abdallah, C., and Horne, B. High order recursive neural networks. In1EEE 1991 ACC Conf, 1991. Jelassi, S., Brandfonbrener, D., Kakade, S. M., and Malach, E. Repeat after me: transformers are better than state space models at copying. InProceedings of the 41st In- ternational Conference on Machine ...

1991

[5] [5]

Lahoti, A., Marwah, T., Puduppully, R., and Gu, A

URL https://openreview.net/forum? id=n2NidsYDop. Lahoti, A., Marwah, T., Puduppully, R., and Gu, A. Chimera: State space models beyond sequences.Transac- tions on Machine Learning Research, 2025. ISSN 2835-

2025

[6] [6]

Mavi, V ., Jangra, A., and Jatowt, A

URL https://openreview.net/forum? id=yv0TUssepk. Mavi, V ., Jangra, A., and Jatowt, A. Multi-hop question answering.Found. Trends Inf. Retr., 17(5):457–586, June

[7] [7]

doi: 10.1561/1500000102

ISSN 1554-0669. doi: 10.1561/1500000102. URL https://doi.org/10.1561/1500000102. NVIDIA, :, Blakeman, A., Basant, A., Khattar, A., Ren- duchintala, A., Bercovich, A., Ficek, A., Bjorlin, A., Taghibakhshi, A., Deshmukh, A. S., Mahabaleshwarkar, A. S., Tao, A., Shors, A., Aithal, A., Poojary, A., Dattagupta, A., Buddharaju, B., Chen, B., Ginsburg, B., Wang,...

work page doi:10.1561/1500000102 2025

[8] [8]

Smith, J

URL https://openreview.net/forum? id=iF7MnXnxRw. Smith, J. T., Warrington, A., and Linderman, S. Simplified state space layers for sequence modeling. InThe Eleventh International Conference on Learning Representations,

[9] [9]

Soltani, R

URL https://openreview.net/forum? id=Ai8Hw3AXqks. Soltani, R. and Jiang, H. Higher order recurrent neural networks, 2017. URL https://openreview.net/ forum?id=ByZvfijeg. Stani´c, A., Ashley, D., Serikov, O., Kirsch, L., Faccio, F., Schmidhuber, J., Hofmann, T., and Schlag, I. The lan- guini kitchen: Enabling language modelling research at different scales...

arXiv 2017