EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context

Arth Singh

arxiv: 2604.08556 · v1 · submitted 2026-03-17 · 💻 cs.CL · cs.AI

EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context

Arth Singh This is my paper

Pith reviewed 2026-05-15 10:48 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords exponential moving averagerecurrent contextinformation dilutionlanguage modelinggrammatical structurefixed-coefficient accumulationattention ablation

0 comments

The pith

EMA traces capture grammatical structure unsupervised but destroy token identity in language models, proving fixed accumulation causes irreversible information loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper probes the limits of simple recurrent context by using exponential moving average traces, which lack gating or content-based retrieval. These traces achieve 96 percent of a supervised BiGRU on grammatical role assignment with no labels, showing they can encode temporal structure effectively. In contrast, a 130-million-parameter language model relying solely on EMA context reaches much higher perplexity on C4 than GPT-2, and upgrading the predictor to full attention produces no improvement, isolating the failure to the traces themselves. Because the traces perform data-independent compression, the data processing inequality implies downstream components cannot recover the lost token information. The core result is that fixed-coefficient accumulation, whether over time or depth, requires learned input-dependent selection to avoid permanent dilution.

Core claim

EMA traces encode temporal structure for tasks such as grammatical role assignment but apply lossy, data-independent compression that erases token identity; this produces poor language-model performance that persists even after the predictor is replaced by full softmax attention, confirming that the entire performance gap originates in the traces and cannot be recovered downstream.

What carries the argument

Exponential moving average (EMA) traces as the simplest recurrent context accumulator without gating or content-based retrieval.

If this is right

Multi-timescale EMA traces can solve structure-dependent tasks at near-supervised accuracy without any labels.
Language modeling performance is bottlenecked by irreversible loss in the recurrent context itself.
Any fixed-coefficient accumulation across time or depth will suffer similar data-independent dilution.
Learned, input-dependent selection is required to retain content identity in recurrent models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structural-versus-content boundary may appear in other fixed methods such as basic RNNs or convolutions without selection.
Hybrid designs that keep fixed accumulation for structure while adding selective mechanisms for content could improve efficiency.
The finding suggests testing whether fixed accumulation limits performance in non-text sequence domains like audio or time-series forecasting.

Load-bearing premise

That EMA traces serve as a complete representative for all fixed-coefficient accumulation and that the ablation isolates information loss exclusively to the traces.

What would settle it

A fixed-coefficient accumulator without any input-dependent selection that achieves C4 perplexity comparable to attention-based models.

Figures

Figures reproduced from arXiv: 2604.08556 by Arth Singh.

read the original abstract

What exactly do efficient sequence models gain over simple temporal averaging? We use exponential moving average (EMA) traces, the simplest recurrent context (no gating, no content-based retrieval), as a controlled probe to map the boundary between what fixed-coefficient accumulation can and cannot represent. EMA traces encode temporal structure: a Hebbian architecture with multi-timescale traces achieves 96% of a supervised BiGRU on grammatical role assignment with zero labels, surpassing the supervised model on structure-dependent roles. EMA traces destroy token identity: a 130M-parameter language model using only EMA context reaches C4 perplexity 260 (8x GPT-2), and a predictor ablation (replacing the linear predictor with full softmax attention) yields identical loss, localizing the entire gap to the traces. The traces apply lossy, data-independent compression; by the data processing inequality, no downstream predictor can recover the discarded information. Fixed-coefficient accumulation, whether across time or depth, suffers irreversible information dilution that only learned, input-dependent selection can resolve.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EMA traces capture grammatical structure at 96% of supervised levels but destroy token identity for LM, with the ablation localizing the gap, though the claim overreaches on all fixed accumulators and depth.

read the letter

EMA traces, used here as the simplest recurrent probe with fixed coefficients and no gating, recover most of the signal needed for grammatical role assignment, hitting 96% of a supervised BiGRU with zero labels and even outperforming it on structure-dependent roles. The same traces produce 260 perplexity on C4 language modeling, eight times worse than GPT-2, and swapping the linear predictor for full softmax attention leaves the loss unchanged. This pins the failure on irreversible loss of token identity inside the fixed accumulation itself, with the data processing inequality offered as the reason no downstream model can recover what was discarded.

Referee Report

2 major / 1 minor

Summary. The paper uses exponential moving average (EMA) traces as a controlled probe for fixed-coefficient recurrent context. It reports that a Hebbian multi-timescale EMA architecture achieves 96% of a supervised BiGRU on grammatical role assignment with zero labels (surpassing on structure-dependent roles), while a 130M-parameter LM using only EMA context reaches C4 perplexity 260 (8x GPT-2). An ablation replacing the linear predictor with full softmax attention yields identical loss, localizing the gap to the traces. Invoking the data processing inequality, the authors conclude that fixed-coefficient accumulation (time or depth) produces irreversible information dilution resolvable only by learned, input-dependent selection.

Significance. If the central localization holds, the work offers a concrete empirical boundary between structure-preserving and content-destroying properties of simple recurrent accumulators, with the unsupervised grammatical-role result providing a falsifiable demonstration of structure extraction from fixed traces. The information-theoretic framing supplies a principled reason why learned selection is required, which could inform design of efficient sequence models beyond EMA.

major comments (2)

[Abstract] Abstract: the claim that 'fixed-coefficient accumulation, whether across time or depth, suffers irreversible information dilution' generalizes beyond the tested case; all quantitative evidence is restricted to EMA, with no experiments or ablations on other fixed-coefficient operators (uniform averaging, fixed FIR filters, polynomial bases) or any depth-specific probe.
[Abstract] Abstract: the predictor ablation is described only as 'replacing the linear predictor with full softmax attention' and yielding identical loss; without reported implementation details, data splits, statistical controls, or confirmation that the attention operates over the same EMA traces, it is unclear whether the localization to trace-level loss is complete.

minor comments (1)

[Abstract] Abstract: the statement '8x GPT-2' should include the exact GPT-2 C4 perplexity baseline for direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'fixed-coefficient accumulation, whether across time or depth, suffers irreversible information dilution' generalizes beyond the tested case; all quantitative evidence is restricted to EMA, with no experiments or ablations on other fixed-coefficient operators (uniform averaging, fixed FIR filters, polynomial bases) or any depth-specific probe.

Authors: EMA is used as the canonical simplest fixed-coefficient recurrent operator to isolate data-independent compression. The data processing inequality supplies a general argument that applies to any deterministic, input-independent linear accumulator (including uniform averaging and fixed FIR filters), because such operators discard information that cannot be recovered by any downstream function. We will revise the abstract to qualify the claim as applying to this class of operators, supported by the theoretical bound rather than exhaustive empirical coverage, and note that depth-specific probes lie beyond the current scope. revision: partial
Referee: [Abstract] Abstract: the predictor ablation is described only as 'replacing the linear predictor with full softmax attention' and yielding identical loss; without reported implementation details, data splits, statistical controls, or confirmation that the attention operates over the same EMA traces, it is unclear whether the localization to trace-level loss is complete.

Authors: Full implementation details appear in Section 4.3: a standard 4-head softmax attention operates directly on the EMA traces (as keys and values), trained on identical C4 splits with the same hyperparameters and seeds as the linear baseline. Perplexity remains 260 within statistical error (Table 3). We will update the abstract to reference this section and include a concise description of the ablation setup. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ablations plus external DPI principle

full rationale

The derivation chain rests on two experimental results (EMA achieving 96% of BiGRU on structure tasks; identical loss when swapping linear predictor for full attention) plus invocation of the data processing inequality as a pre-existing theorem. Neither result is obtained by fitting a parameter to the target quantity and relabeling it a prediction, nor does any step reduce to a self-citation or definitional loop. The generalization from EMA to all fixed-coefficient operators follows from the deterministic character of such recurrences, which is independent of the reported numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the data processing inequality applied to fixed averaging and on the assumption that EMA is a faithful minimal model of fixed-coefficient recurrence.

axioms (1)

standard math Data processing inequality: no function of the compressed trace can recover information discarded by the fixed averaging operation
Invoked to conclude that downstream predictors cannot recover discarded token identity.

pith-pipeline@v0.9.0 · 5473 in / 1183 out tokens · 31750 ms · 2026-05-15T10:48:36.108531+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39. Karl Friston. 2005. A theory of cort...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Jürgen Schmidhuber

Linear transformers are secretly fast weight programmers.International Conference on Machine Learning, pages 9355–9366. Jürgen Schmidhuber. 1992. Learning to control fast- weight memories: An alternative to dynamic recur- rent networks.Neural Computation, 4(1):131–139. Wolfram Schultz, Peter Dayan, and P Read Montague

work page 1992
[3]

Science, 275(5306):1593–1599

A neural substrate of prediction and reward. Science, 275(5306):1593–1599. A SPCN Experimental Details Hyperparameters.Learning rate η=0.01 for Wfb, weight decay λd=0.001. Settling weights: α=1.0, β=0.5, γ=0.3. Precision range: [πmin, πmax] = [0.1,10] . SPA: buffer window = 8, top-k= 4. BiGRU baseline.256 hidden units, bidirectional. Adam with LR = 0.001,...

work page 2048

[1] [1]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39. Karl Friston. 2005. A theory of cort...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Jürgen Schmidhuber

Linear transformers are secretly fast weight programmers.International Conference on Machine Learning, pages 9355–9366. Jürgen Schmidhuber. 1992. Learning to control fast- weight memories: An alternative to dynamic recur- rent networks.Neural Computation, 4(1):131–139. Wolfram Schultz, Peter Dayan, and P Read Montague

work page 1992

[3] [3]

Science, 275(5306):1593–1599

A neural substrate of prediction and reward. Science, 275(5306):1593–1599. A SPCN Experimental Details Hyperparameters.Learning rate η=0.01 for Wfb, weight decay λd=0.001. Settling weights: α=1.0, β=0.5, γ=0.3. Precision range: [πmin, πmax] = [0.1,10] . SPA: buffer window = 8, top-k= 4. BiGRU baseline.256 hidden units, bidirectional. Adam with LR = 0.001,...

work page 2048