Preisach Attention: A Hysteretic Model of Sequential Memory
Pith reviewed 2026-05-25 04:40 UTC · model grok-4.3
The pith
A hysteretic attention layer using binary relays and an extrema stack achieves Turing completeness in constant depth.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Preisach Attention Layer replaces softmax with binary relay operators that maintain a stack of local extrema as internal state. This construction yields a single-layer PAL-Transformer that is Turing-complete by direct simulation of a two-stack pushdown automaton. The same layer computes rate-independent functionals, such as historical range statistics, in O(1) depth where transformers require O(log n) depth, while the function classes remain incomparable because transformers support random-access operations that PAL cannot perform without auxiliary state. The extremum stack is shown to be a minimal sufficient statistic for all rate-independent functionals.
What carries the argument
The Preisach Attention Layer, which applies binary relay operators with learned thresholds to maintain and update a stack of local extrema as the sole internal state.
If this is right
- A single PAL-Transformer layer suffices for Turing completeness via two-stack pushdown simulation.
- Historical range statistics become computable in O(1) layers by PAL while requiring O(log n) layers in transformers.
- Transformers retain the ability to perform random-access retrieval that PAL cannot achieve without added state.
- PAL inference runs in O(n log n) total cost for sequences where standard attention costs O(n squared).
Where Pith is reading between the lines
- Models built on rate-independent memory may handle long episodic sequences more efficiently when token timing carries little information.
- Hybrid architectures could combine PAL layers for extrema tracking with standard attention for positional retrieval.
- The minimal-statistic property suggests that PAL could serve as a drop-in memory module for any task whose target depends only on the ordered sequence of peaks and troughs.
Load-bearing premise
The rate-independence property together with the specific update rules of the binary relay operators are enough to simulate arbitrary computation.
What would settle it
An explicit two-stack pushdown automaton computation that cannot be reproduced by any assignment of thresholds and initial stack state in the binary relay operators under arbitrary precision arithmetic.
read the original abstract
We introduce the Preisach Attention Layer (PAL), a novel sequence modelling architecture grounded in the classical Preisach hysteresis operator from mathematical physics. PAL replaces the softmax attention mechanism with a binary relay operator parameterised by learned activation and deactivation thresholds, maintaining a stack of local extrema as its internal state. A single-layer PAL-Transformer with O(1) depth is Turing-complete under arbitrary precision arithmetic, achievable through simulation of a two-stack pushdown automaton -- in contrast to the O(log n) depth required by standard hard-attention transformers. Second, we prove that the function classes computable by PAL and by the transformer are incomparable: PAL computes historical range statistics in O(1) layers that require O(log n) layers for transformers, while transformers support random-access retrieval that PAL cannot perform without auxiliary state. The separating property is rate-independence -- PAL responds only to the sequence of local extrema, not to absolute token positions or temporal spacing. Third, we show that the extremum stack constitutes a minimal sufficient statistic of the input history for all rate-independent functionals, providing a formal analogue of the wiping property in classical hysteresis theory. PAL is thus an efficient architecture for tasks with long episodic memory and weak positional dependence, with O(n log n) total inference cost versus O(n^2) for standard attention.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Preisach Attention Layer (PAL) replacing softmax attention with binary relay operators from the classical Preisach hysteresis model, maintaining a stack of local extrema. It claims a single-layer PAL-Transformer achieves Turing completeness in O(1) depth by simulating a two-stack pushdown automaton (contrasting O(log n) depth for hard-attention transformers), proves the function classes are incomparable due to rate-independence (PAL computes historical range statistics efficiently while transformers support random-access retrieval that PAL cannot), and shows the extremum stack is a minimal sufficient statistic for rate-independent functionals, yielding O(n log n) inference cost for suitable tasks.
Significance. If the results hold, the work establishes a new attention mechanism with formal expressivity separations and efficiency gains for episodic memory tasks with weak positional dependence, plus a direct link to hysteresis theory via the wiping property. Explicit proofs of Turing completeness and the minimal sufficient statistic property would constitute a notable theoretical contribution in the field.
major comments (2)
- [Abstract] Abstract: the Turing-completeness claim via two-stack PDA simulation is load-bearing for the central contribution but provides no derivation showing how a single extremum stack plus binary relays (updated only at turning points) can encode and manipulate two independent stacks while strictly obeying rate-independence, which discards absolute positions, step counts, and non-extremal tokens.
- [Abstract] Abstract: the claim that the extremum stack is a minimal sufficient statistic for all rate-independent functionals (the formal analogue of the wiping property) is stated without the supporting derivation or error analysis, preventing verification of whether the parameterization preserves the required separation from transformer function classes.
minor comments (1)
- Clarify the precise parameterization of the learned activation/deactivation thresholds for the relay operators and how they interact with the stack maintenance rules.
Simulated Author's Rebuttal
We thank the referee for their thorough review and for highlighting the need for greater clarity on the central theoretical claims. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the Turing-completeness claim via two-stack PDA simulation is load-bearing for the central contribution but provides no derivation showing how a single extremum stack plus binary relays (updated only at turning points) can encode and manipulate two independent stacks while strictly obeying rate-independence, which discards absolute positions, step counts, and non-extremal tokens.
Authors: We agree that the abstract statement is too brief and does not contain the required derivation. The construction is developed in Section 3, where the single extrema stack encodes one PDA stack via the ordered sequence of turning-point values while the second stack is maintained in the pattern of active/inactive relays; all updates occur exclusively at local extrema, thereby preserving rate-independence. To make the argument self-contained at the abstract level, we will insert a concise sketch of this encoding in the revised abstract. revision: yes
-
Referee: [Abstract] Abstract: the claim that the extremum stack is a minimal sufficient statistic for all rate-independent functionals (the formal analogue of the wiping property) is stated without the supporting derivation or error analysis, preventing verification of whether the parameterization preserves the required separation from transformer function classes.
Authors: We accept that the abstract does not supply the supporting derivation. The proof that the extrema stack is a minimal sufficient statistic, together with the accompanying error analysis establishing exact representation of any rate-independent functional, appears in Theorem 2 and its proof in Section 4. We will add a one-sentence outline of this result to the abstract in the revision so that the separation from transformer function classes can be verified directly from the abstract. revision: yes
Circularity Check
No circularity: claims rest on external mathematical properties of hysteresis and automata
full rationale
The derivation chain consists of proofs that a single-layer PAL simulates a two-stack PDA (hence Turing-complete) and that the extremum stack is a minimal sufficient statistic for rate-independent functionals. These rest on classical Preisach operator properties (rate-independence, wiping property) and standard automata theory rather than any self-definition, fitted-parameter renaming, or self-citation load-bearing step. No equation or claim in the abstract reduces a result to an input defined by the model itself; the separating property and incomparability results are stated as independent consequences of the operator's definition. The paper is therefore self-contained against external benchmarks with no reduction by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math The Preisach hysteresis operator is defined by a collection of binary relays with activation and deactivation thresholds and maintains a stack of local extrema as its state.
- domain assumption Rate-independence is the separating property between PAL and standard transformers.
invented entities (1)
-
Preisach Attention Layer (PAL)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Titans: Learning to Memorize at Test Time
doi: 10.1016/j.est.2015.06.001. 18 Preprint Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. InarXiv preprint arXiv:2501.00663,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/j.est.2015.06.001 2015
-
[2]
Titans: Learning to Memorize at Test Time
URLhttps://arxiv.org/abs/2501.00663. Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. InarXiv preprint arXiv:2004.05150,
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[3]
Longformer: The Long-Document Transformer
URLhttps://arxiv.org/abs/2004.05150. Martin Brokate and Jürgen Sprekels.Hysteresis and Phase Transitions, volume 121 ofApplied Mathematical Sciences. Springer, New York,
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[4]
doi: 10.1007/978-1-4612-4048-8. George Cybenko. Approximation by superpositions of a sigmoidal function.Mathematics of Control, Signals and Systems, 2(4):303–314,
-
[5]
doi: 10.1007/BF02551274. Karin Dahmen and James P. Sethna. Hysteresis loop critical exponents in6−εdimensions.Physical Review Letters, 71(20):3222–3225,
-
[6]
doi: 10.1103/PhysRevLett.71.3222. Piotr Frydrych.Modelowanie charakterystyk magnesowania amorficznych rdzeni dwuosiowych sensorów transduktorowych. PhD thesis, Politechnika Warszawska, Wydział Mechatroniki, Warsaw, Poland,
-
[7]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
doi: 10.1007/BF01744431. Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InarXiv preprint arXiv:2312.00752,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/bf01744431
-
[8]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
URLhttps://arxiv.org/abs/2312.00752. Michael Hahn. Theoretical limitations of self-attention in neural sequence models.Transactions of the Association for Computational Linguistics, 8:156–171,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Johan Håstad.Computational Limitations of Small-Depth Circuits
doi: 10.1162/tacl_a_00306. Johan Håstad.Computational Limitations of Small-Depth Circuits. MIT Press, Cambridge, MA,
-
[10]
doi: 10.1073/pnas.79.8.2554. Christopher G. Langton. Computation at the edge of chaos: Phase transitions and emergent computation. Physica D: Nonlinear Phenomena, 42(1–3):12–37,
-
[11]
doi: 10.1016/0167-2789(90)90064-V. Andrew Lucas. Ising formulations of many NP problems.Frontiers in Physics, 2:5,
-
[12]
doi: 10.3389/fphy. 2014.00005. Isaak D. Mayergoyz.Mathematical Models of Hysteresis. Springer, New York,
-
[13]
Marc Mézard and Andrea Montanari.Information, Physics, and Computation
doi: 10.1162/tacl_a_ 00562. Marc Mézard and Andrea Montanari.Information, Physics, and Computation. Oxford University Press, Oxford,
-
[14]
Oxford Univer- sity Press (2018).https://doi.org/10.1093/oso/9780198814788.001.0001
doi: 10.1093/acprof:oso/9780198570837.001.0001. Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, et al. RWKV: Reinventing RNNs for the transformer era. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 14048–14077,
work page doi:10.1093/acprof:oso/9780198570837.001.0001 2023
-
[15]
Jorge Pérez, Javier Marinkovi ´c, and Pablo Barceló
doi: 10.18653/v1/2023.findings-emnlp.936. Jorge Pérez, Javier Marinkovi ´c, and Pablo Barceló. Attention is Turing complete.Journal of Machine Learning Research, 22(75):1–35,
-
[16]
doi: 10.1007/BF01349418. James P. Sethna, Karin A. Dahmen, and Olga Perkovic. Random-field ising models of hysteresis.The Science of Hysteresis, 2:107–179,
-
[17]
Random-Field Ising Models of Hysteresis
URLhttps://arxiv.org/abs/cond-mat/0406320. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInterna- tional Conference on Learning Representations,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
doi: 10.1006/jcss.1995.1013. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, volume
-
[19]
In the limitε→0, this detectsu i ≥cat each extremal step. Lemma A.5(Boolean combinations via MLP).Letϕ 1, ϕ2 be two EFO formulae computable by PAL heads producing outputsr 1, r2 ∈ {0,1}. Thenϕ 1 ∧ϕ 2,ϕ 1 ∨ϕ 2, and¬ϕ 1 are computable by a two-layer MLP applied to(r 1, r2). Proof.Boolean functions on{0,1} 2 are representable by two-layer networks with ReLU ...
work page 1989
-
[20]
These are exact (not approximate) representations on{0,1}. Lemma A.6(Extremum aggregation).The formulaExtAgg i[f(x i), ϕ(i)]is computable by a single MPAL head with measure: µ(α,β) =f(Dec(α, β))·1[ϕholds at(α, β)],(28) whereDec(α, β)recovers the valuex i encoded at thresholds(α, β). Proof.The MPAL output for a single head with measureµis: PALµ(u0:n) = X i...
work page 2020
-
[21]
Sinceu n+1 < M top, the wiping conditionu n+1 > M top is not satisfied
+ε < M top (sinceM top = code ∗(aj, d) +εfor somejand theεgap is smaller than the inter-level gap∆>0, providedε <∆/2). Sinceu n+1 < M top, the wiping conditionu n+1 > M top is not satisfied. A new pair is added without removing any existing pairs. B.2 MLP Width for Transition Function Lemma B.3(MLP width for finite transition functions).For a 2-PDA with|Q...
work page 1989
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.