arxiv: 2605.00662 · v1 · submitted 2026-05-01 · 💻 cs.NE · cs.LG

Recognition: unknown

Spiking Sequence Machines and Transformers

Joy Bose

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:50 UTC · model grok-4.3

classification 💻 cs.NE cs.LG

keywords sequence learningspiking neural networkstransformerspositional encodingcosine similaritydot-product attentionisomorphism

0 comments

The pith

Spiking sequence machines and transformers share the same five functional operations with cosine similarity as retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Sequence learning fundamentally reduces to similarity-based retrieval over a temporally ordered space. This paper shows that a spiking Sparse Distributed Memory machine and the modern transformer architecture both execute the same sequence of steps: encoding inputs, maintaining context, retrieving associations via similarity, storing information, and decoding outputs. A mathematical isomorphism links the timing of spikes to the phase of sinusoidal positional encodings, with a proof that attention mechanisms remain unaffected except for a scaling factor. Experiments reveal that the sinusoidal form is not essential, as rank-based position encodings work as well or better, while compressed frequency versions fail on tasks needing precise position tracking. The insight is that time, phase, and rank all provide equivalent ordered indices that hold up under similarity-based retrieval.

Core claim

Sequence learning reduces to similarity-based retrieval over a temporally indexed representation space, a constraint on any sequence model. A spiking Sparse Distributed Memory sequence machine and the transformer independently instantiate the same five functional operations (encoding, context maintenance, associative retrieval, storage, and decoding), with cosine similarity as the shared retrieval primitive. A Phase-Latency Isomorphism is formalised showing that sinusoidal positional phase and spike timing are linearly related, and dot product attention is proven invariant to this mapping up to a global scale factor on the positional component. Empirically, frequency-compressed positional编码g

What carries the argument

The Phase-Latency Isomorphism linking sinusoidal positional phase to spike timing, which proves invariance of dot-product attention up to scaling and unifies the five shared operations.

Load-bearing premise

The five operations fully capture the load-bearing computation in both architectures and the linear phase-latency mapping preserves all relevant behavior without hidden losses when applied to real attention mechanisms.

What would settle it

A direct test in which dot-product attention on a positionally demanding copy task produces different outputs or fails to converge after replacing sinusoidal phases with linearly mapped spike timings, or where frequency-compressed positional encodings succeed on that task.

Figures

Figures reproduced from arXiv: 2605.00662 by Joy Bose.

**Figure 1.** Figure 1: Empirical validation of the Phase–Latency Isomorphism (Proposition 1). L=128, d=128. Left: scatter of ⟨PE(p), PE(q)⟩ versus ⟨STPE(p), STPE(q)⟩ across all position pairs — points lie exactly on y = x/L², confirming ⟨STPE, STPE⟩ = (T/L)²⟨PE, PE⟩. Right: pairwise dot products as a function of positional distance |pos − pos'|, comparing PE (blue) and amplitude-scaled STPE rescaled to PE units (orange). Pearson… view at source ↗

**Figure 2.** Figure 2: Learning curves (BPC vs training step) for three positional encodings on the copy task (L=64, d=64, view at source ↗

read the original abstract

Sequence learning reduces to similarity-based retrieval over a temporally indexed representation space, a constraint on any sequence model, not a property of a specific architecture. We show that a spiking Sparse Distributed Memory sequence machine (2007) and the transformer (2017) independently instantiate the same five functional operations (encoding, context maintenance, associative retrieval, storage, and decoding), with cosine similarity as the shared retrieval primitive in both. We formalise a Phase-Latency Isomorphism showing that sinusoidal positional phase and spike timing are linearly related, and prove that dot product attention is invariant to this mapping up to a global scale factor on the positional component (Lemma 1). Empirically, frequency-compressed positional encoding fails to converge on a positionally demanding copy task, while a learned rank-based embedding matches or exceeds sinusoidal encoding, indicating that the critical property for positional representation is distance discriminability under dot-product similarity, not sinusoidal form. Time, phase, and rank are three instantiations of the same computational primitive, an ordered index whose structure survives similarity-based retrieval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper unifies spiking SDM and transformers as retrieval over indexed spaces with a phase-latency mapping that preserves dot-product attention up to scale, plus an empirical check that rank embeddings beat frequency compression on copy tasks.

read the letter

The main point is that both the 2007 spiking Sparse Distributed Memory machine and the transformer end up doing the same five operations—encoding, context maintenance, associative retrieval, storage, and decoding—with cosine similarity as the core retrieval step. The paper adds a Phase-Latency Isomorphism that treats sinusoidal positional phase and spike timing as linearly related, then claims dot-product attention is invariant to this swap except for one global scale on the positional part (Lemma 1). It also shows that rank-based embeddings work on a position-sensitive copy task while frequency-compressed ones fail to converge, suggesting the key property is distance discriminability rather than sinusoidal shape itself. Time, phase, and rank are presented as interchangeable ordered indices that survive similarity-based lookup. That empirical contrast is the cleanest new piece and could matter for positional encoding choices in sequence models. The unification view is straightforward and organizes some existing ideas without forcing them. The formal lemma and the rank experiment are the parts that feel like actual additions beyond the cited SDM and transformer papers. The soft spots sit in how far the invariance extends once real attention is in play. Learned linear projections mix content and position before the dot product, and multi-head summation plus per-layer scaling can absorb or distort a single global factor differently across heads. The five-operation list is broad enough that both architectures satisfy it by construction, which makes the claimed equivalence less diagnostic than it first appears. The abstract does not spell out whether the proof was checked after the full QK^T computation or only on isolated positional slices. For readers working on hybrid spiking-transformer designs or on why certain positional tricks succeed, the paper gives a useful lens and a quick experiment to build on. It is not a load-bearing result yet, but it is coherent enough on its own terms to warrant referee time. I would send it to review with a request to expand the lemma to cover learned projections and multi-head cases, plus more controls on the copy-task runs.

Referee Report

2 major / 2 minor

Summary. The paper claims that sequence learning reduces to similarity-based retrieval over a temporally indexed space, and that both a 2007 spiking Sparse Distributed Memory sequence machine and the 2017 transformer independently realize the same five operations (encoding, context maintenance, associative retrieval, storage, decoding) with cosine similarity as the shared primitive. It formalizes a Phase-Latency Isomorphism relating sinusoidal positional phase to spike latency and proves in Lemma 1 that dot-product attention is invariant to this linear mapping up to a global scale factor on the positional component. Experiments on a positionally demanding copy task show that frequency-compressed encodings fail to converge while learned rank-based embeddings match or exceed sinusoidal performance, supporting the view that time, phase, and rank are equivalent instantiations of an ordered index whose distance structure survives similarity retrieval.

Significance. If the central claims hold, the work supplies a formal bridge between neuromorphic sequence models and transformers, identifying a shared computational primitive and demonstrating that positional encoding succeeds via distance discriminability rather than sinusoidal specifics. The presence of an explicit lemma and falsifiable empirical predictions (rank-based alternatives) are strengths that elevate the contribution beyond purely interpretive unification.

major comments (2)

[Lemma 1] Lemma 1: The stated invariance of dot-product attention holds only up to a global scale factor on the positional component, but the manuscript does not show that this invariance survives the learned linear projections that mix positional and content information inside each attention head (Q = (content + pos) W_Q, similarly for K). Because heads and layers can learn different effective scalings, the claimed equivalence between the SDM machine and actual transformer computation graphs is not yet established.
[Five operations taxonomy] Section defining the five operations: The taxonomy is presented as independently instantiated by both architectures, yet the operations are specified at a level of abstraction that both models satisfy by construction. The paper must demonstrate that the taxonomy is motivated by sequence-learning requirements alone and is not post-hoc; otherwise the unification claim rests on circularity rather than independent convergence.

minor comments (2)

[Empirical evaluation] Copy-task experiments: additional controls are needed to confirm that embedding dimension, learning rate, and other hyperparameters are matched across positional-encoding variants so that performance differences can be attributed to the encoding itself rather than confounding factors.
[Phase-Latency Isomorphism] Notation: the linear phase-to-latency mapping and the precise form of the global scale factor in Lemma 1 would be clearer if written out as explicit equations rather than described in prose.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and precise comments, which help clarify the scope of our unification claims. We address each major point below with clarifications and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [Lemma 1] Lemma 1: The stated invariance of dot-product attention holds only up to a global scale factor on the positional component, but the manuscript does not show that this invariance survives the learned linear projections that mix positional and content information inside each attention head (Q = (content + pos) W_Q, similarly for K). Because heads and layers can learn different effective scalings, the claimed equivalence between the SDM machine and actual transformer computation graphs is not yet established.

Authors: We agree that Lemma 1 establishes the invariance only for the raw dot-product operation under the phase-latency mapping, up to a global scale on the positional component. The referee is correct that the learned projections W_Q, W_K (and similarly for V) mix content and positional vectors, and that per-head scalings can differ. However, because any global or per-head scale factor on the positional contribution can be absorbed into the learned weights without changing the functional form of the similarity retrieval, the core equivalence at the level of the cosine-similarity primitive remains intact. The SDM uses a fixed, unprojected retrieval while the transformer learns the mixing; this difference is architectural rather than computational. We will add a remark immediately after Lemma 1 that explicitly notes this absorption property and its consequence for the computation-graph equivalence, thereby addressing the gap. revision: partial
Referee: [Five operations taxonomy] Section defining the five operations: The taxonomy is presented as independently instantiated by both architectures, yet the operations are specified at a level of abstraction that both models satisfy by construction. The paper must demonstrate that the taxonomy is motivated by sequence-learning requirements alone and is not post-hoc; otherwise the unification claim rests on circularity rather than independent convergence.

Authors: The five operations are not chosen post-hoc to fit the two models. They follow directly from the problem statement in the introduction: any sequence model must (1) encode inputs, (2) maintain an ordered temporal context via an index, (3) perform associative retrieval by similarity, (4) store the resulting associations, and (5) decode to outputs. This list is derived from the general requirement that sequence learning reduces to similarity-based retrieval over a temporally indexed space, independent of any particular architecture. Both the 2007 SDM and the 2017 transformer were developed separately to meet these requirements and converge on cosine similarity as the retrieval primitive. To eliminate any appearance of circularity, we will revise the relevant section to first derive the five operations from sequence-modeling necessities alone, then map each architecture onto them. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained from general sequence-learning constraint.

full rationale

The paper begins with the general statement that sequence learning reduces to similarity-based retrieval over a temporally indexed representation space. It then maps both the 2007 SDM machine and the transformer onto the same five abstract operations (encoding, context maintenance, associative retrieval, storage, decoding) using cosine similarity as the shared primitive. The Phase-Latency Isomorphism is introduced as a formal linear relation between sinusoidal phase and spike latency, with Lemma 1 providing a mathematical proof of invariance of dot-product attention under that mapping (up to global scale). These steps are presented as independent derivations and comparisons rather than reductions of the claimed result to its own inputs by definition or fitted parameters. The empirical copy-task results supply an external check on positional encodings. No load-bearing step collapses to self-citation or self-definition of the target equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim depends on the assumption that cosine similarity is the sole retrieval primitive and that the five operations are exhaustive; no new free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Sequence learning reduces to similarity-based retrieval over a temporally indexed representation space.
Stated as a general constraint applying to any sequence model.

pith-pipeline@v0.9.0 · 5467 in / 1262 out tokens · 28963 ms · 2026-05-09T14:50:10.765602+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 4 canonical work pages · 1 internal anchor

[1]

D., Lalan, A., Bhattacharya, B

Ajwani, R. D., Lalan, A., Bhattacharya, B. S., & Bose, J. (2021). Sparse Distributed Memory using Spiking Neural Networks on Nengo. Bernstein Conference

2021
[2]

Bellec, G., Scherr, F., Subramoney, A., Hajek, E., Salaj, D., Legenstein, R., & Maass, W

arXiv:2109.03111. Bellec, G., Scherr, F., Subramoney, A., Hajek, E., Salaj, D., Legenstein, R., & Maass, W. (2020). A solution to the learning dilemma for recurrent networks of spiking neurons. Nature Communications, 11(1),

work page arXiv 2020
[3]

Bi, G., & Poo, M. (1998). Synaptic modifications in cultured hippocampal neurons: Dependence on spike timing, synaptic strength, and postsynaptic cell type. Journal of Neuroscience, 18(24), 10464–10472. Bose, J. (2007). Engineering a Sequence Machine Through Spiking Neurons Employing Rank- Order Codes. PhD thesis, University of Manchester. Bose, J. (2026)...

1998
[4]

https://joyboseroy.medium.com/what-i-built-in-2007-and-why-it-looks-a-bit- like-a-transformer-dbf3683a0ebe Ellwood, I. (2024). Short-term Hebbian learning can implement transformer-like attention. PLOS Computational Biology, 20(1), e1011843. arXiv:2310.19812. Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179–211. Furber, S. B.,...

work page arXiv 2007
[5]

S., Gilmer, J., Ganguli, S., & Sohl-Dickstein, J

Schoenholz, S. S., Gilmer, J., Ganguli, S., & Sohl-Dickstein, J. (2017). Deep information propagation. ICLR

2017
[6]

Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). RoFormer: Enhanced transformer with rotary position embedding. arXiv:2104.09864. Thorpe, S., & Gautrais, J. (1998). Rank order coding. Computational Neuroscience: Trends in Research, 113–118. VanRullen, R., & Thorpe, S. J. (2002). Surfing a spike wave down the ventral stream. Vision Researc...

work page internal anchor Pith review arXiv 2021
[7]

Yang, G., & Schoenholz, S. S. (2017). Mean field residual networks: On the edge of chaos. NeurIPS

2017
[8]

Spikformer: When spiking neural network meets transformer.arXiv preprint arXiv:2209.15425,

Yarga, S. F., Rouat, J., & Wood, S. U. N. (2023). Efficient spike encoding algorithms for neuromorphic speech recognition. Proceedings of the International Conference on Neuromorphic Systems (ICONS). Zhou, Z., Zhu, Y., He, C., Wang, Y., Yan, S., Tian, Y., & Yuan, L. (2022). Spikformer: When spiking neural network meets transformer. arXiv:2209.15425. Zhu, ...

work page arXiv 2023