EMA Is Not All You Need: Mapping the Boundary Between Structure and Content in Recurrent Context
Pith reviewed 2026-05-15 10:48 UTC · model grok-4.3
The pith
EMA traces capture grammatical structure unsupervised but destroy token identity in language models, proving fixed accumulation causes irreversible information loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EMA traces encode temporal structure for tasks such as grammatical role assignment but apply lossy, data-independent compression that erases token identity; this produces poor language-model performance that persists even after the predictor is replaced by full softmax attention, confirming that the entire performance gap originates in the traces and cannot be recovered downstream.
What carries the argument
Exponential moving average (EMA) traces as the simplest recurrent context accumulator without gating or content-based retrieval.
If this is right
- Multi-timescale EMA traces can solve structure-dependent tasks at near-supervised accuracy without any labels.
- Language modeling performance is bottlenecked by irreversible loss in the recurrent context itself.
- Any fixed-coefficient accumulation across time or depth will suffer similar data-independent dilution.
- Learned, input-dependent selection is required to retain content identity in recurrent models.
Where Pith is reading between the lines
- The same structural-versus-content boundary may appear in other fixed methods such as basic RNNs or convolutions without selection.
- Hybrid designs that keep fixed accumulation for structure while adding selective mechanisms for content could improve efficiency.
- The finding suggests testing whether fixed accumulation limits performance in non-text sequence domains like audio or time-series forecasting.
Load-bearing premise
That EMA traces serve as a complete representative for all fixed-coefficient accumulation and that the ablation isolates information loss exclusively to the traces.
What would settle it
A fixed-coefficient accumulator without any input-dependent selection that achieves C4 perplexity comparable to attention-based models.
Figures
read the original abstract
What exactly do efficient sequence models gain over simple temporal averaging? We use exponential moving average (EMA) traces, the simplest recurrent context (no gating, no content-based retrieval), as a controlled probe to map the boundary between what fixed-coefficient accumulation can and cannot represent. EMA traces encode temporal structure: a Hebbian architecture with multi-timescale traces achieves 96% of a supervised BiGRU on grammatical role assignment with zero labels, surpassing the supervised model on structure-dependent roles. EMA traces destroy token identity: a 130M-parameter language model using only EMA context reaches C4 perplexity 260 (8x GPT-2), and a predictor ablation (replacing the linear predictor with full softmax attention) yields identical loss, localizing the entire gap to the traces. The traces apply lossy, data-independent compression; by the data processing inequality, no downstream predictor can recover the discarded information. Fixed-coefficient accumulation, whether across time or depth, suffers irreversible information dilution that only learned, input-dependent selection can resolve.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper uses exponential moving average (EMA) traces as a controlled probe for fixed-coefficient recurrent context. It reports that a Hebbian multi-timescale EMA architecture achieves 96% of a supervised BiGRU on grammatical role assignment with zero labels (surpassing on structure-dependent roles), while a 130M-parameter LM using only EMA context reaches C4 perplexity 260 (8x GPT-2). An ablation replacing the linear predictor with full softmax attention yields identical loss, localizing the gap to the traces. Invoking the data processing inequality, the authors conclude that fixed-coefficient accumulation (time or depth) produces irreversible information dilution resolvable only by learned, input-dependent selection.
Significance. If the central localization holds, the work offers a concrete empirical boundary between structure-preserving and content-destroying properties of simple recurrent accumulators, with the unsupervised grammatical-role result providing a falsifiable demonstration of structure extraction from fixed traces. The information-theoretic framing supplies a principled reason why learned selection is required, which could inform design of efficient sequence models beyond EMA.
major comments (2)
- [Abstract] Abstract: the claim that 'fixed-coefficient accumulation, whether across time or depth, suffers irreversible information dilution' generalizes beyond the tested case; all quantitative evidence is restricted to EMA, with no experiments or ablations on other fixed-coefficient operators (uniform averaging, fixed FIR filters, polynomial bases) or any depth-specific probe.
- [Abstract] Abstract: the predictor ablation is described only as 'replacing the linear predictor with full softmax attention' and yielding identical loss; without reported implementation details, data splits, statistical controls, or confirmation that the attention operates over the same EMA traces, it is unclear whether the localization to trace-level loss is complete.
minor comments (1)
- [Abstract] Abstract: the statement '8x GPT-2' should include the exact GPT-2 C4 perplexity baseline for direct comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major comment point by point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'fixed-coefficient accumulation, whether across time or depth, suffers irreversible information dilution' generalizes beyond the tested case; all quantitative evidence is restricted to EMA, with no experiments or ablations on other fixed-coefficient operators (uniform averaging, fixed FIR filters, polynomial bases) or any depth-specific probe.
Authors: EMA is used as the canonical simplest fixed-coefficient recurrent operator to isolate data-independent compression. The data processing inequality supplies a general argument that applies to any deterministic, input-independent linear accumulator (including uniform averaging and fixed FIR filters), because such operators discard information that cannot be recovered by any downstream function. We will revise the abstract to qualify the claim as applying to this class of operators, supported by the theoretical bound rather than exhaustive empirical coverage, and note that depth-specific probes lie beyond the current scope. revision: partial
-
Referee: [Abstract] Abstract: the predictor ablation is described only as 'replacing the linear predictor with full softmax attention' and yielding identical loss; without reported implementation details, data splits, statistical controls, or confirmation that the attention operates over the same EMA traces, it is unclear whether the localization to trace-level loss is complete.
Authors: Full implementation details appear in Section 4.3: a standard 4-head softmax attention operates directly on the EMA traces (as keys and values), trained on identical C4 splits with the same hyperparameters and seeds as the linear baseline. Perplexity remains 260 within statistical error (Table 3). We will update the abstract to reference this section and include a concise description of the ablation setup. revision: yes
Circularity Check
No circularity: empirical ablations plus external DPI principle
full rationale
The derivation chain rests on two experimental results (EMA achieving 96% of BiGRU on structure tasks; identical loss when swapping linear predictor for full attention) plus invocation of the data processing inequality as a pre-existing theorem. Neither result is obtained by fitting a parameter to the target quantity and relabeling it a prediction, nor does any step reduce to a self-citation or definitional loop. The generalization from EMA to all fixed-coefficient operators follows from the deterministic character of such recurrences, which is independent of the reported numbers.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Data processing inequality: no function of the compressed trace can recover information discarded by the fixed averaging operation
Reference graph
Works this paper leans on
-
[1]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation
Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432. William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39. Karl Friston. 2005. A theory of cort...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Linear transformers are secretly fast weight programmers.International Conference on Machine Learning, pages 9355–9366. Jürgen Schmidhuber. 1992. Learning to control fast- weight memories: An alternative to dynamic recur- rent networks.Neural Computation, 4(1):131–139. Wolfram Schultz, Peter Dayan, and P Read Montague
work page 1992
-
[3]
A neural substrate of prediction and reward. Science, 275(5306):1593–1599. A SPCN Experimental Details Hyperparameters.Learning rate η=0.01 for Wfb, weight decay λd=0.001. Settling weights: α=1.0, β=0.5, γ=0.3. Precision range: [πmin, πmax] = [0.1,10] . SPA: buffer window = 8, top-k= 4. BiGRU baseline.256 hidden units, bidirectional. Adam with LR = 0.001,...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.