Timesteps of Mamba Align with Human Reading Times

Shinnosuke Isono; Sho Yokoi; Yoshinobu Kawahara; Yuji Yamamoto

arxiv: 2606.29904 · v1 · pith:BJ3Y3U2Fnew · submitted 2026-06-29 · 💻 cs.CL

Timesteps of Mamba Align with Human Reading Times

Yuji Yamamoto , Shinnosuke Isono , Yoshinobu Kawahara , Sho Yokoi This is my paper

Pith reviewed 2026-06-30 06:46 UTC · model grok-4.3

classification 💻 cs.CL

keywords Mambastate-space modelsreading timeseye-trackinglanguage processingdiscretization timestepsurprisalmemory representation

0 comments

The pith

Mamba's input-dependent discretization timesteps predict human reading times even after controlling for GPT-2 surprisal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that in the Mamba state-space model, the per-word discretization timestep, which represents the duration of recurrent state transitions at each layer, aligns with human per-word reading times from eye-tracking data. This predictor remains significant even after accounting for established factors like GPT-2 surprisal. A sympathetic reader would care because it positions Mamba as a model for real-time language processing that incorporates dynamic, input-responsive memory updates across layers. The work draws on naturalistic reading datasets and formal analysis of the model's internal dynamics to support this alignment.

Core claim

The per-word timestep from Mamba is a significant predictor of human reading times, and remains significant even when known predictors such as GPT-2 surprisal are controlled for. We further suggest, through formal analysis of Mamba's architecture and internal dynamics, that Mamba can serve as a new, valuable lens to look at human real-time language processing with ever-updated memory, because it allows us to look at how each module (layer) weighs short- and long-term information retention, and how noise may interact with dynamic, continuous memory representation.

What carries the argument

The discretization timestep Δ_t in Mamba, dynamically computed from the input to set the duration of each recurrent state transition at every layer.

If this is right

Mamba timesteps supply an additional predictor of human reading behavior that operates beyond standard surprisal measures.
Layer-wise analysis of Δ_t values can reveal differential weighting of short-term versus long-term information retention during processing.
Mamba's continuous memory representation offers a route to examine how noise affects dynamic state updates in language comprehension.
The architecture provides a concrete mechanism for modeling ever-updated memory states in real-time human language processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the alignment holds, similar timestep-to-reading-time correlations could be tested in other input-dependent recurrent or state-space architectures.
This opens the possibility that input-responsive discretization itself contributes to modeling human-like processing efficiency across different tasks.
Controlled experiments could manipulate the computation of Δ_t to check whether changes in its dynamics alter alignment with eye-tracking measures.

Load-bearing premise

The model's discretization timestep Δ_t can be interpreted as a duration of processing time that is directly comparable to human per-word reading times measured in eye-tracking data.

What would settle it

A regression showing that Mamba timesteps lose all predictive power for reading times once additional controls such as word length, frequency, and syntactic complexity are added alongside surprisal.

Figures

Figures reproduced from arXiv: 2606.29904 by Shinnosuke Isono, Sho Yokoi, Yoshinobu Kawahara, Yuji Yamamoto.

**Figure 2.** Figure 2: Discretization timesteps ∆¯ t of each of the 24 layers of Mamba-130M, in response to the first five sentences from the Natural Stories Corpus. “★” indicates the beginning of a sentence. dynamically modulates state transitions by changing the size of timesteps in response to the input, offering greater flexibility than other SSMs that assume fixed transition dynamics. As illustrated in [PITH_FULL_IMAGE:f… view at source ↗

**Figure 3.** Figure 3: Correlations between linguistic features of [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Accuracy in the passkey retrieval task when [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Eigenvalue distribution of the transition matrix [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

This study demonstrates an alignment of per-word processing time in a popular state-space language model Mamba and human readers. In Mamba, the recurrent state transition at each layer conceptually takes some duration of time, the discretization timestep $\Delta_t$, determined dynamically in response to the input. Using a naturalistic reading dataset, we show that the per-word timestep from Mamba is a significant predictor of human reading times, and remains significant even when known predictors such as GPT-2 surprisal are controlled for. We further suggest, through formal analysis of Mamba's architecture and internal dynamics, that Mamba can serve as a new, valuable lens to look at human real-time language processing with ever-updated memory, because it allows us to look at how each module (layer) weighs short- and long-term information retention, and how noise may interact with dynamic, continuous memory representation. Code is available online.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mamba timestep correlates with reading times beyond surprisal, but the interpretation as human processing duration lacks grounding.

read the letter

The main new result here is an empirical correlation: Mamba's per-word discretization timestep Δ_t predicts eye-tracked reading times in a naturalistic dataset, and the predictor stays significant after controlling for GPT-2 surprisal. That specific link has not been shown before.

The work does a few things cleanly. It applies a controlled regression to an existing reading corpus, adds a standard baseline predictor, and releases code. The abstract also flags a formal analysis of how Mamba layers handle short- versus long-term retention, which could be useful if it is actually carried out in the full text.

The soft spots are more substantial. The abstract gives no coefficients, no sample size, no exact test, and no exclusion rules, so the size and robustness of the effect cannot be judged from what is supplied. More importantly, the central claim treats Δ_t as a proxy for processing duration. In the Mamba architecture Δ_t is simply softplus of a linear projection of the token; it is a learned rate with no units and no training objective that ties it to milliseconds or cognitive cycles. Nothing in the reported analysis rules out the possibility that Δ_t is capturing residual token-level properties (context length, embedding magnitude, or similar) that also affect reading times. The stress-test concern therefore lands: the paper does not supply a mechanistic reason why this parameter should map to human memory updating rather than to input sensitivity.

The paper is aimed at researchers who already work on state-space models and want to connect them to psycholinguistic measures. A reader looking for a new, falsifiable link between an SSM hyperparameter and behavioral data could extract an idea from it, but anyone expecting a solid model of human timing will find the evidence thin.

I would send it to peer review so the methods and the formal analysis can be checked, but the interpretation section will need substantial tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that the input-dependent discretization timestep Δ_t in the Mamba state-space model is a significant predictor of human per-word reading times in naturalistic eye-tracking data, remains significant after controlling for GPT-2 surprisal, and that formal analysis of Mamba's architecture supports its use as a model for human real-time language processing with dynamic, continuous memory updating across layers.

Significance. If the reported correlation holds under full statistical reporting and the architectural interpretation is substantiated, the work would provide a concrete, architecture-specific bridge between SSMs and cognitive reading measures, with potential to examine layer-wise short- versus long-term retention and noise effects. Code availability supports reproducibility.

major comments (2)

[Abstract] Abstract: the central regression result is stated without coefficients, sample size, exact statistical test, exclusion criteria, or effect-size information, so the claim that Δ_t 'remains significant' after controlling for GPT-2 surprisal cannot be evaluated.
[Abstract] Abstract (formal analysis paragraph): the claim that Δ_t functions as a 'duration of processing time' comparable to eye-tracked reading times is load-bearing for the mechanistic interpretation, yet Δ_t is defined only as an input-dependent learned rate (softplus of a linear projection) with no units or objective tying it to milliseconds; the paper does not demonstrate that residual variance after surprisal control is not simply other token-level features.

minor comments (2)

[Abstract] The naturalistic reading dataset is referenced but neither named nor described (e.g., corpus, number of participants, preprocessing).
[Abstract] Notation for Δ_t should be introduced with its exact functional form (softplus(linear(embedding))) at first use rather than only conceptually.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We agree that the abstract requires more precise statistical reporting and will revise it to include coefficients, sample size, test details, exclusions, and effect sizes. We also address the interpretation of Δ_t below, providing clarification on its conceptual role while acknowledging its learned-parameter definition. Point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: the central regression result is stated without coefficients, sample size, exact statistical test, exclusion criteria, or effect-size information, so the claim that Δ_t 'remains significant' after controlling for GPT-2 surprisal cannot be evaluated.

Authors: We agree this information is necessary for evaluating the claim. The full manuscript reports linear mixed-effects models on the naturalistic eye-tracking corpus (with per-word Δ_t from Mamba layers as predictor, controlling for GPT-2 surprisal), but the abstract summarizes only the significance. In revision we will add the key coefficients (with SE), N (words and participants), exact model specification, exclusion criteria (e.g., outliers, fixation thresholds), and standardized effect sizes to the abstract while keeping it concise. revision: yes
Referee: [Abstract] Abstract (formal analysis paragraph): the claim that Δ_t functions as a 'duration of processing time' comparable to eye-tracked reading times is load-bearing for the mechanistic interpretation, yet Δ_t is defined only as an input-dependent learned rate (softplus of a linear projection) with no units or objective tying it to milliseconds; the paper does not demonstrate that residual variance after surprisal control is not simply other token-level features.

Authors: We accept that Δ_t lacks millisecond units and is a learned input-dependent rate; the manuscript frames the alignment as conceptual rather than literal equivalence, supported by the architecture's dynamic state transition and layer-wise analysis of short- versus long-term retention. The regression already controls for GPT-2 surprisal (a strong token-level predictor), and Δ_t remains significant, but we agree additional token features (length, frequency) should be tested to rule out residual confounds. We will add these controls to the regression in revision or explicitly discuss the limitation if dataset constraints apply. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical correlation with external control

full rationale

The paper's central claim is an empirical statistical result: Mamba's per-word discretization timestep Δ_t (computed from a pretrained model) significantly predicts human reading times in eye-tracking data, and remains significant after controlling for GPT-2 surprisal. This is a post-hoc correlation between model internals and an independent external dataset; it does not reduce by the paper's equations to a quantity defined in terms of itself, nor is any 'prediction' obtained by fitting parameters to the target reading-time data. No self-citation chain, uniqueness theorem, or ansatz is invoked to force the result. The formal analysis of Mamba's architecture is presented as interpretive context rather than a derivation that tautologically produces the observed correlation. The finding is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central comparison rests on treating the learned discretization timestep as a proxy for processing duration; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption The discretization timestep Δ_t represents a duration of processing time comparable to human reading times
Invoked when the authors treat per-word Δ_t values as predictors of reading times.

pith-pipeline@v0.9.1-grok · 5688 in / 1246 out tokens · 23983 ms · 2026-06-30T06:46:27.386957+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 27 canonical work pages · 3 internal anchors

[1]

FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism , author =
[6]

1990 , publisher =

Prefix sums and their applications , author =. 1990 , publisher =

1990
[7]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = jul, year =

The Hidden Attention of Mamba Models , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = jul, year =. doi:10.18653/v1/2025.acl-long.76 , pages =

work page doi:10.18653/v1/2025.acl-long.76 2025
[9]

First Conference on Language Modeling , year =

Locating and Editing Factual Associations in Mamba , author =. First Conference on Language Modeling , year =
[10]

Random-Access Infinite Context Length for Transformers , url =

Mohtashami, Amirkeivan and Jaggi, Martin , booktitle =. Random-Access Infinite Context Length for Transformers , url =
[11]

2020 , howpublished =

nostalgebraist , title =. 2020 , howpublished =

2020
[14]

First Conference on Language Modeling , year =

Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author =. First Conference on Language Modeling , year =
[15]

GLU Variants Improve Transformer

Glu variants improve transformer , author =. arXiv preprint arXiv:2002.05202 , year =

work page internal anchor Pith review Pith/arXiv arXiv 2002
[16]

arXiv preprint arXiv:2502.01615 , year =

Large language models are human-like internally , author =. arXiv preprint arXiv:2502.01615 , year =

work page arXiv
[17]

Tily and Idan Blank and Anastasia Vishnevetsky and Steven T

Richard Futrell and Edward Gibson and Harry J. Tily and Idan Blank and Anastasia Vishnevetsky and Steven T. Piantadosi and Evelina Fedorenko , year =. The Natural Stories corpus: A reading-time corpus of English texts containing rare syntactic constructions , journal =
[18]

2018 , url =

Practical Guide to State-space Control: Graduate-level control theory for high schoolers , author =. 2018 , url =

2018
[19]

and Santorini, Beatrice and Marcinkiewicz, Mary Ann , editor =

Marcus, Mitchell P. and Santorini, Beatrice and Marcinkiewicz, Mary Ann , editor =. Building a Large Annotated Corpus of. Computational Linguistics , volume =. 1993 , address =

1993
[20]

Tjong Kim Sang, Erik F. and D. Introduction to the. Proceedings of the. 2001 , url =

2001
[21]

言語処理学会第28回年次大会発表論文集 , year =

ニューラル言語モデルの過剰な作業記憶 , author =. 言語処理学会第28回年次大会発表論文集 , year =
[22]

A Probabilistic

Hale, John , booktitle =. A Probabilistic. 2001 , url =

2001
[23]

Expectation-based syntactic comprehension , volume =

Levy, Roger , doi =. Expectation-based syntactic comprehension , volume =. Cognition , number =
[29]

Modeling working memory: an interference model of complex span

Oberauer, Klaus and Lewandowsky, Stephan and Farrell, Simon and Jarrold, Christopher and Greaves, Martin , doi =. Modeling working memory: an interference model of complex span. , volume =. Psychonomic bulletin & review , number =
[30]

Carpenter and Jacqueline D

Marcel Adam Just and Patricia A. Carpenter and Jacqueline D. Woolley , journal =. Paradigms and processes in reading comprehension , volume =. 1982 , doi =

1982
[31]

Retrieval interference in sentence comprehension , volume =

Julie A. Retrieval interference in sentence comprehension , volume =. Journal of Memory and Language , keywords =. doi:10.1016/j.jml.2006.03.007 , issn =

work page doi:10.1016/j.jml.2006.03.007 2006
[32]

When High-Capacity Readers Slow Down and Low-Capacity Readers Speed Up: Working Memory and Locality Effects , volume =

Nicenboim, Bruno and Loga. When High-Capacity Readers Slow Down and Low-Capacity Readers Speed Up: Working Memory and Locality Effects , volume =. doi:10.3389/fpsyg.2016.00280 , journal =

work page doi:10.3389/fpsyg.2016.00280 2016
[33]

Vision , year =

David Marr , publisher =. Vision , year =
[34]

Data from eye-tracking corpora as evidence for theories of syntactic processing complexity , volume = 109, year =

Vera Demberg and Frank Keller , journal =. Data from eye-tracking corpora as evidence for theories of syntactic processing complexity , volume = 109, year =
[35]

bioRxiv , doi =

Laura Gwilliams and Alec Marantz and David Poeppel and Jean-Remi King , year = 2024, title =. bioRxiv , doi =

2024
[36]

Nature communications , volume=

Neural dynamics of phoneme sequences reveal position-invariant code for content and order , author=. Nature communications , volume=. 2022 , doi =

2022
[37]

Information-theoretical Complexity Metrics , volume =

John Hale , journal =. Information-theoretical Complexity Metrics , volume =. 2016 , doi =

2016
[38]

Lossy-Context Surprisal: An Information-Theoretic Model of Memory Effects in Sentence Processing , volume =

Futrell, Richard and Gibson, Edward and Levy, Roger , doi =. Lossy-Context Surprisal: An Information-Theoretic Model of Memory Effects in Sentence Processing , volume =. Cognitive Science , keywords =
[39]

PNAS , volume = 119, number = 43, doi =

A resource-rational model of human processing of recursive linguistic structure , author =. PNAS , volume = 119, number = 43, doi =
[40]

Scientific Data , volume = 12, number = 1995, doi =

Yevgeni Berzak and Jonathan Malmaud and Omer Shubi and Yoav Meiri and Ella Lion and Roger Levy , year = 2025, title =. Scientific Data , volume = 12, number = 1995, doi =

2025
[42]

Yevgeni Berzak, Jonathan Malmaud, Omer Shubi, Yoav Meiri, Ella Lion, and Roger Levy. 2025. https://doi.org/10.1038/s41597-025-06272-2 Onestop: A 360-participant E nglish eye tracking dataset with different reading regimes . Scientific Data, 12(1995)

work page doi:10.1038/s41597-025-06272-2 2025
[43]

Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khattar, Adi Renduchintala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, and 1 others. 2025. Nvidia nemotron 3: Efficient and open intelligence. arXiv preprint arXiv:2512.20856

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Guy E Blelloch. 1990. Prefix sums and their applications

1990
[45]

Vera Demberg and Frank Keller. 2008. https://doi.org/10.1016/j.cognition.2008.07.008 Data from eye-tracking corpora as evidence for theories of syntactic processing complexity . Cognition, 109:192--210

work page doi:10.1016/j.cognition.2008.07.008 2008
[46]

Nir Endy, Idan Daniel Grosbard, Yuval Ran-Milo, Yonatan Slutzky, Itay Tshuva, and Raja Giryes. 2025. https://doi.org/10.18653/v1/2025.acl-long.1143 Mamba knockout for unraveling factual information flow . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23457--23477, Vienna, Austria....

work page doi:10.18653/v1/2025.acl-long.1143 2025
[47]

Richard Futrell, Edward Gibson, and Roger Levy. 2020. https://doi.org/10.1111/cogs.12814 Lossy-context surprisal: An information-theoretic model of memory effects in sentence processing . Cognitive Science, 44(3):e12814

work page doi:10.1111/cogs.12814 2020
[48]

Tily, Idan Blank, Anastasia Vishnevetsky, Steven T

Richard Futrell, Edward Gibson, Harry J. Tily, Idan Blank, Anastasia Vishnevetsky, Steven T. Piantadosi, and Evelina Fedorenko. 2021. The natural stories corpus: A reading-time corpus of english texts containing rare syntactic constructions. Language Resources and Evaluation, 55(1):63--77

2021
[49]

Albert Gu and Tri Dao. 2024. https://openreview.net/forum?id=tEYskw1VY2 Mamba: Linear-time sequence modeling with selective state spaces . In First Conference on Language Modeling

2024
[50]

Laura Gwilliams, Jean-Remi King, Alec Marantz, and David Poeppel. 2022. https://doi.org/10.1038/s41467-022-34326-1 Neural dynamics of phoneme sequences reveal position-invariant code for content and order . Nature communications, 13(1):6606

work page doi:10.1038/s41467-022-34326-1 2022
[51]

Laura Gwilliams, Alec Marantz, David Poeppel, and Jean-Remi King. 2024. https://doi.org/10.1101/2024.04.19.590280 Hierarchical dynamic coding coordinates speech comprehension in the human brain . bioRxiv

work page doi:10.1101/2024.04.19.590280 2024
[52]

Michael Hahn, Richard Futrell, Roger Levy, and Edward Gibson. 2022. https://doi.org/10.1073/pnas.2122602119 A resource-rational model of human processing of recursive linguistic structure . PNAS, 119(43)

work page doi:10.1073/pnas.2122602119 2022
[53]

John Hale. 2001. https://aclanthology.org/N01-1021/ A probabilistic E arley parser as a psycholinguistic model . In Second Meeting of the North A merican Chapter of the Association for Computational Linguistics

2001
[54]

John Hale. 2016. https://doi.org/10.1111/lnc3.12196 Information-theoretical complexity metrics . Language and Linguistics Compass, 10:397--412

work page doi:10.1111/lnc3.12196 2016
[55]

Carpenter, and Jacqueline D

Marcel Adam Just, Patricia A. Carpenter, and Jacqueline D. Woolley. 1982. https://doi.org/10.1037/0096-3445.111.2.228 Paradigms and processes in reading comprehension . Journal of Experimental Psychology: General, 111(2):228--238

work page doi:10.1037/0096-3445.111.2.228 1982
[56]

Kimi Team , Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, and 1 others. 2025. Kimi linear: An expressive, efficient attention architecture. arXiv preprint arXiv:2510.26692

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Tatsuki Kuribayashi, Yohei Oseki, Ana Brassard, and Kentaro Inui. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.712 Context limitations make neural language models more human-like . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10421--10436, Abu Dhabi, United Arab Emirates. Association for Computation...

work page doi:10.18653/v1/2022.emnlp-main.712 2022
[58]

Tatsuki Kuribayashi, Yohei Oseki, Takumi Ito, Ryo Yoshida, Masayuki Asahara, and Kentaro Inui. 2021. https://doi.org/10.18653/v1/2021.acl-long.405 Lower perplexity is not always human-like . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing ...

work page doi:10.18653/v1/2021.acl-long.405 2021
[59]

Tatsuki Kuribayashi, Yohei Oseki, Souhaib Ben Taieb, Kentaro Inui, and Timothy Baldwin. 2025. https://doi.org/10.1162/TACL.a.58 Large language models are human-like internally . Transactions of the Association for Computational Linguistics, 13:1743--1766

work page doi:10.1162/tacl.a.58 2025
[60]

Roger Levy. 2008. https://doi.org/10.1016/j.cognition.2007.05.006 Expectation-based syntactic comprehension . Cognition, 106(3):1126--1177

work page doi:10.1016/j.cognition.2007.05.006 2008
[61]

Wei Liu and Nai Ding. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.351 Information integration in large language models is gated by linguistic structural markers . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6903--6915, Suzhou, China. Association for Computational Linguistics

work page doi:10.18653/v1/2025.emnlp-main.351 2025
[62]

David Marr. 1982. Vision. WH Freeman, San Fransisco, CA

1982
[63]

Amirkeivan Mohtashami and Martin Jaggi. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/ab05dc8bf36a9f66edbff6992ec86f56-Paper-Conference.pdf Random-access infinite context length for transformers . In Advances in Neural Information Processing Systems, volume 36, pages 54567--54585. Curran Associates, Inc

2023
[64]

Byung-Doh Oh and William Schuler. 2023. https://doi.org/10.1162/tacl_a_00548 Why does surprisal from larger transformer-based language models provide a poorer fit to human reading times? Transactions of the Association for Computational Linguistics, 11:336--350

work page doi:10.1162/tacl_a_00548 2023
[65]

Preferred Networks , Kaizaburo Chubachi, Yasuhiro Fujita, Shinichi Hemmi, Yuta Hirokawa, Kentaro Imajo, Toshiki Kataoka, Goro Kobayashi, Kenichi Maehashi, Calvin Metzger, and 1 others. 2025. Plamo 2 technical report. arXiv preprint arXiv:2509.04897

work page arXiv 2025
[66]

Soo Hyun Ryu and Richard Lewis. 2021. https://doi.org/10.18653/v1/2021.cmcl-1.6 Accounting for agreement phenomena in sentence comprehension with transformer language models: Effects of similarity-based interference on surprisal and attention . In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pages 61--71, Online. Associ...

work page doi:10.18653/v1/2021.cmcl-1.6 2021
[67]

Arnab Sen Sharma, David Atkinson, and David Bau. 2024. https://openreview.net/forum?id=yoVRyrEgix Locating and editing factual associations in mamba . In First Conference on Language Modeling

2024
[68]

Wilcox, Tiago Pimentel, Clara Meister, Ryan Cotterell, and Roger P

Ethan G. Wilcox, Tiago Pimentel, Clara Meister, Ryan Cotterell, and Roger P. Levy. 2023. https://doi.org/10.1162/tacl_a_00612 Testing the predictions of surprisal theory in 11 languages . Transactions of the Association for Computational Linguistics, 11:1451--1470

work page doi:10.1162/tacl_a_00612 2023
[69]

Songlin Yang and Yu Zhang. 2024. https://github.com/fla-org/flash-linear-attention Fla: A triton-based library for hardware-efficient implementations of linear attention mechanism

2024
[70]

Ryo Yoshida, Shinnosuke Isono, Kohei Kajikawa, Taiga Someya, Yushi Sugimoto, and Yohei Oseki. 2025. https://doi.org/10.18653/v1/2025.acl-long.483 If attention serves as a cognitive model of human memory retrieval, what is the plausible memory representation? In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume...

work page doi:10.18653/v1/2025.acl-long.483 2025
[71]

Jingwei Zuo, Maksim Velikanov, Dhia Eddine Rhaiem, Ilyas Chahed, Younes Belkada, Guillaume Kunsch, and Hakim Hacid. 2024. Falcon mamba: The first competitive attention-free 7b language model. arXiv preprint arXiv:2410.05355

work page arXiv 2024

[1] [1]

FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism , author =

[2] [6]

1990 , publisher =

Prefix sums and their applications , author =. 1990 , publisher =

1990

[3] [7]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = jul, year =

The Hidden Attention of Mamba Models , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = jul, year =. doi:10.18653/v1/2025.acl-long.76 , pages =

work page doi:10.18653/v1/2025.acl-long.76 2025

[4] [9]

First Conference on Language Modeling , year =

Locating and Editing Factual Associations in Mamba , author =. First Conference on Language Modeling , year =

[5] [10]

Random-Access Infinite Context Length for Transformers , url =

Mohtashami, Amirkeivan and Jaggi, Martin , booktitle =. Random-Access Infinite Context Length for Transformers , url =

[6] [11]

2020 , howpublished =

nostalgebraist , title =. 2020 , howpublished =

2020

[7] [14]

First Conference on Language Modeling , year =

Mamba: Linear-Time Sequence Modeling with Selective State Spaces , author =. First Conference on Language Modeling , year =

[8] [15]

GLU Variants Improve Transformer

Glu variants improve transformer , author =. arXiv preprint arXiv:2002.05202 , year =

work page internal anchor Pith review Pith/arXiv arXiv 2002

[9] [16]

arXiv preprint arXiv:2502.01615 , year =

Large language models are human-like internally , author =. arXiv preprint arXiv:2502.01615 , year =

work page arXiv

[10] [17]

Tily and Idan Blank and Anastasia Vishnevetsky and Steven T

Richard Futrell and Edward Gibson and Harry J. Tily and Idan Blank and Anastasia Vishnevetsky and Steven T. Piantadosi and Evelina Fedorenko , year =. The Natural Stories corpus: A reading-time corpus of English texts containing rare syntactic constructions , journal =

[11] [18]

2018 , url =

Practical Guide to State-space Control: Graduate-level control theory for high schoolers , author =. 2018 , url =

2018

[12] [19]

and Santorini, Beatrice and Marcinkiewicz, Mary Ann , editor =

Marcus, Mitchell P. and Santorini, Beatrice and Marcinkiewicz, Mary Ann , editor =. Building a Large Annotated Corpus of. Computational Linguistics , volume =. 1993 , address =

1993

[13] [20]

Tjong Kim Sang, Erik F. and D. Introduction to the. Proceedings of the. 2001 , url =

2001

[14] [21]

言語処理学会第28回年次大会発表論文集 , year =

ニューラル言語モデルの過剰な作業記憶 , author =. 言語処理学会第28回年次大会発表論文集 , year =

[15] [22]

A Probabilistic

Hale, John , booktitle =. A Probabilistic. 2001 , url =

2001

[16] [23]

Expectation-based syntactic comprehension , volume =

Levy, Roger , doi =. Expectation-based syntactic comprehension , volume =. Cognition , number =

[17] [29]

Modeling working memory: an interference model of complex span

Oberauer, Klaus and Lewandowsky, Stephan and Farrell, Simon and Jarrold, Christopher and Greaves, Martin , doi =. Modeling working memory: an interference model of complex span. , volume =. Psychonomic bulletin & review , number =

[18] [30]

Carpenter and Jacqueline D

Marcel Adam Just and Patricia A. Carpenter and Jacqueline D. Woolley , journal =. Paradigms and processes in reading comprehension , volume =. 1982 , doi =

1982

[19] [31]

Retrieval interference in sentence comprehension , volume =

Julie A. Retrieval interference in sentence comprehension , volume =. Journal of Memory and Language , keywords =. doi:10.1016/j.jml.2006.03.007 , issn =

work page doi:10.1016/j.jml.2006.03.007 2006

[20] [32]

When High-Capacity Readers Slow Down and Low-Capacity Readers Speed Up: Working Memory and Locality Effects , volume =

Nicenboim, Bruno and Loga. When High-Capacity Readers Slow Down and Low-Capacity Readers Speed Up: Working Memory and Locality Effects , volume =. doi:10.3389/fpsyg.2016.00280 , journal =

work page doi:10.3389/fpsyg.2016.00280 2016

[21] [33]

Vision , year =

David Marr , publisher =. Vision , year =

[22] [34]

Data from eye-tracking corpora as evidence for theories of syntactic processing complexity , volume = 109, year =

Vera Demberg and Frank Keller , journal =. Data from eye-tracking corpora as evidence for theories of syntactic processing complexity , volume = 109, year =

[23] [35]

bioRxiv , doi =

Laura Gwilliams and Alec Marantz and David Poeppel and Jean-Remi King , year = 2024, title =. bioRxiv , doi =

2024

[24] [36]

Nature communications , volume=

Neural dynamics of phoneme sequences reveal position-invariant code for content and order , author=. Nature communications , volume=. 2022 , doi =

2022

[25] [37]

Information-theoretical Complexity Metrics , volume =

John Hale , journal =. Information-theoretical Complexity Metrics , volume =. 2016 , doi =

2016

[26] [38]

Lossy-Context Surprisal: An Information-Theoretic Model of Memory Effects in Sentence Processing , volume =

Futrell, Richard and Gibson, Edward and Levy, Roger , doi =. Lossy-Context Surprisal: An Information-Theoretic Model of Memory Effects in Sentence Processing , volume =. Cognitive Science , keywords =

[27] [39]

PNAS , volume = 119, number = 43, doi =

A resource-rational model of human processing of recursive linguistic structure , author =. PNAS , volume = 119, number = 43, doi =

[28] [40]

Scientific Data , volume = 12, number = 1995, doi =

Yevgeni Berzak and Jonathan Malmaud and Omer Shubi and Yoav Meiri and Ella Lion and Roger Levy , year = 2025, title =. Scientific Data , volume = 12, number = 1995, doi =

2025

[29] [42]

Yevgeni Berzak, Jonathan Malmaud, Omer Shubi, Yoav Meiri, Ella Lion, and Roger Levy. 2025. https://doi.org/10.1038/s41597-025-06272-2 Onestop: A 360-participant E nglish eye tracking dataset with different reading regimes . Scientific Data, 12(1995)

work page doi:10.1038/s41597-025-06272-2 2025

[30] [43]

Aaron Blakeman, Aaron Grattafiori, Aarti Basant, Abhibha Gupta, Abhinav Khattar, Adi Renduchintala, Aditya Vavre, Akanksha Shukla, Akhiad Bercovich, Aleksander Ficek, and 1 others. 2025. Nvidia nemotron 3: Efficient and open intelligence. arXiv preprint arXiv:2512.20856

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [44]

Guy E Blelloch. 1990. Prefix sums and their applications

1990

[32] [45]

Vera Demberg and Frank Keller. 2008. https://doi.org/10.1016/j.cognition.2008.07.008 Data from eye-tracking corpora as evidence for theories of syntactic processing complexity . Cognition, 109:192--210

work page doi:10.1016/j.cognition.2008.07.008 2008

[33] [46]

Nir Endy, Idan Daniel Grosbard, Yuval Ran-Milo, Yonatan Slutzky, Itay Tshuva, and Raja Giryes. 2025. https://doi.org/10.18653/v1/2025.acl-long.1143 Mamba knockout for unraveling factual information flow . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23457--23477, Vienna, Austria....

work page doi:10.18653/v1/2025.acl-long.1143 2025

[34] [47]

Richard Futrell, Edward Gibson, and Roger Levy. 2020. https://doi.org/10.1111/cogs.12814 Lossy-context surprisal: An information-theoretic model of memory effects in sentence processing . Cognitive Science, 44(3):e12814

work page doi:10.1111/cogs.12814 2020

[35] [48]

Tily, Idan Blank, Anastasia Vishnevetsky, Steven T

Richard Futrell, Edward Gibson, Harry J. Tily, Idan Blank, Anastasia Vishnevetsky, Steven T. Piantadosi, and Evelina Fedorenko. 2021. The natural stories corpus: A reading-time corpus of english texts containing rare syntactic constructions. Language Resources and Evaluation, 55(1):63--77

2021

[36] [49]

Albert Gu and Tri Dao. 2024. https://openreview.net/forum?id=tEYskw1VY2 Mamba: Linear-time sequence modeling with selective state spaces . In First Conference on Language Modeling

2024

[37] [50]

Laura Gwilliams, Jean-Remi King, Alec Marantz, and David Poeppel. 2022. https://doi.org/10.1038/s41467-022-34326-1 Neural dynamics of phoneme sequences reveal position-invariant code for content and order . Nature communications, 13(1):6606

work page doi:10.1038/s41467-022-34326-1 2022

[38] [51]

Laura Gwilliams, Alec Marantz, David Poeppel, and Jean-Remi King. 2024. https://doi.org/10.1101/2024.04.19.590280 Hierarchical dynamic coding coordinates speech comprehension in the human brain . bioRxiv

work page doi:10.1101/2024.04.19.590280 2024

[39] [52]

Michael Hahn, Richard Futrell, Roger Levy, and Edward Gibson. 2022. https://doi.org/10.1073/pnas.2122602119 A resource-rational model of human processing of recursive linguistic structure . PNAS, 119(43)

work page doi:10.1073/pnas.2122602119 2022

[40] [53]

John Hale. 2001. https://aclanthology.org/N01-1021/ A probabilistic E arley parser as a psycholinguistic model . In Second Meeting of the North A merican Chapter of the Association for Computational Linguistics

2001

[41] [54]

John Hale. 2016. https://doi.org/10.1111/lnc3.12196 Information-theoretical complexity metrics . Language and Linguistics Compass, 10:397--412

work page doi:10.1111/lnc3.12196 2016

[42] [55]

Carpenter, and Jacqueline D

Marcel Adam Just, Patricia A. Carpenter, and Jacqueline D. Woolley. 1982. https://doi.org/10.1037/0096-3445.111.2.228 Paradigms and processes in reading comprehension . Journal of Experimental Psychology: General, 111(2):228--238

work page doi:10.1037/0096-3445.111.2.228 1982

[43] [56]

Kimi Team , Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, and 1 others. 2025. Kimi linear: An expressive, efficient attention architecture. arXiv preprint arXiv:2510.26692

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [57]

Tatsuki Kuribayashi, Yohei Oseki, Ana Brassard, and Kentaro Inui. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.712 Context limitations make neural language models more human-like . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10421--10436, Abu Dhabi, United Arab Emirates. Association for Computation...

work page doi:10.18653/v1/2022.emnlp-main.712 2022

[45] [58]

Tatsuki Kuribayashi, Yohei Oseki, Takumi Ito, Ryo Yoshida, Masayuki Asahara, and Kentaro Inui. 2021. https://doi.org/10.18653/v1/2021.acl-long.405 Lower perplexity is not always human-like . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing ...

work page doi:10.18653/v1/2021.acl-long.405 2021

[46] [59]

Tatsuki Kuribayashi, Yohei Oseki, Souhaib Ben Taieb, Kentaro Inui, and Timothy Baldwin. 2025. https://doi.org/10.1162/TACL.a.58 Large language models are human-like internally . Transactions of the Association for Computational Linguistics, 13:1743--1766

work page doi:10.1162/tacl.a.58 2025

[47] [60]

Roger Levy. 2008. https://doi.org/10.1016/j.cognition.2007.05.006 Expectation-based syntactic comprehension . Cognition, 106(3):1126--1177

work page doi:10.1016/j.cognition.2007.05.006 2008

[48] [61]

Wei Liu and Nai Ding. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.351 Information integration in large language models is gated by linguistic structural markers . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6903--6915, Suzhou, China. Association for Computational Linguistics

work page doi:10.18653/v1/2025.emnlp-main.351 2025

[49] [62]

David Marr. 1982. Vision. WH Freeman, San Fransisco, CA

1982

[50] [63]

Amirkeivan Mohtashami and Martin Jaggi. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/ab05dc8bf36a9f66edbff6992ec86f56-Paper-Conference.pdf Random-access infinite context length for transformers . In Advances in Neural Information Processing Systems, volume 36, pages 54567--54585. Curran Associates, Inc

2023

[51] [64]

Byung-Doh Oh and William Schuler. 2023. https://doi.org/10.1162/tacl_a_00548 Why does surprisal from larger transformer-based language models provide a poorer fit to human reading times? Transactions of the Association for Computational Linguistics, 11:336--350

work page doi:10.1162/tacl_a_00548 2023

[52] [65]

Preferred Networks , Kaizaburo Chubachi, Yasuhiro Fujita, Shinichi Hemmi, Yuta Hirokawa, Kentaro Imajo, Toshiki Kataoka, Goro Kobayashi, Kenichi Maehashi, Calvin Metzger, and 1 others. 2025. Plamo 2 technical report. arXiv preprint arXiv:2509.04897

work page arXiv 2025

[53] [66]

Soo Hyun Ryu and Richard Lewis. 2021. https://doi.org/10.18653/v1/2021.cmcl-1.6 Accounting for agreement phenomena in sentence comprehension with transformer language models: Effects of similarity-based interference on surprisal and attention . In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pages 61--71, Online. Associ...

work page doi:10.18653/v1/2021.cmcl-1.6 2021

[54] [67]

Arnab Sen Sharma, David Atkinson, and David Bau. 2024. https://openreview.net/forum?id=yoVRyrEgix Locating and editing factual associations in mamba . In First Conference on Language Modeling

2024

[55] [68]

Wilcox, Tiago Pimentel, Clara Meister, Ryan Cotterell, and Roger P

Ethan G. Wilcox, Tiago Pimentel, Clara Meister, Ryan Cotterell, and Roger P. Levy. 2023. https://doi.org/10.1162/tacl_a_00612 Testing the predictions of surprisal theory in 11 languages . Transactions of the Association for Computational Linguistics, 11:1451--1470

work page doi:10.1162/tacl_a_00612 2023

[56] [69]

Songlin Yang and Yu Zhang. 2024. https://github.com/fla-org/flash-linear-attention Fla: A triton-based library for hardware-efficient implementations of linear attention mechanism

2024

[57] [70]

Ryo Yoshida, Shinnosuke Isono, Kohei Kajikawa, Taiga Someya, Yushi Sugimoto, and Yohei Oseki. 2025. https://doi.org/10.18653/v1/2025.acl-long.483 If attention serves as a cognitive model of human memory retrieval, what is the plausible memory representation? In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume...

work page doi:10.18653/v1/2025.acl-long.483 2025

[58] [71]

Jingwei Zuo, Maksim Velikanov, Dhia Eddine Rhaiem, Ilyas Chahed, Younes Belkada, Guillaume Kunsch, and Hakim Hacid. 2024. Falcon mamba: The first competitive attention-free 7b language model. arXiv preprint arXiv:2410.05355

work page arXiv 2024