Quantifying Memory Utilization with Effective State-Size

Alessandro Moro; Armin W. Thomas; Atsushi Yamashita; Michael Poli; Neehal Tumma; Qi An; Rom N. Parnichkun; Stefano Massaroli; Taiji Suzuki

arxiv: 2504.19561 · v1 · pith:CX7N4YGBnew · submitted 2025-04-28 · 💻 cs.LG

Quantifying Memory Utilization with Effective State-Size

Rom N. Parnichkun , Neehal Tumma , Armin W. Thomas , Alessandro Moro , Qi An , Taiji Suzuki , Atsushi Yamashita , Michael Poli

show 1 more author

Stefano Massaroli

This is my paper

classification 💻 cs.LG

keywords memorytextitutilizationeffectivemodelmodelsattentiondesign

0 comments

read the original abstract

The need to develop a general framework for architecture analysis is becoming increasingly important, given the expanding design space of sequence models. To this end, we draw insights from classical signal processing and control theory, to develop a quantitative measure of \textit{memory utilization}: the internal mechanisms through which a model stores past information to produce future outputs. This metric, which we call \textbf{\textit{effective state-size}} (ESS), is tailored to the fundamental class of systems with \textit{input-invariant} and \textit{input-varying linear operators}, encompassing a variety of computational units such as variants of attention, convolutions, and recurrences. Unlike prior work on memory utilization, which either relies on raw operator visualizations (e.g. attention maps), or simply the total \textit{memory capacity} (i.e. cache size) of a model, our metrics provide highly interpretable and actionable measurements. In particular, we show how ESS can be leveraged to improve initialization strategies, inform novel regularizers and advance the performance-efficiency frontier through model distillation. Furthermore, we demonstrate that the effect of context delimiters (such as end-of-speech tokens) on ESS highlights cross-architectural differences in how large language models utilize their available memory to recall information. Overall, we find that ESS provides valuable insights into the dynamics that dictate memory utilization, enabling the design of more efficient and effective sequence models.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences
cs.LG 2026-04 unverdicted novelty 7.0

Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
Contribution Weights: A Geometrical Analysis of Self-Attention Transformers
cs.LG 2026-05 unverdicted novelty 6.0

Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex s...