pith. sign in

Transformer feed-forward layers are key-value memories

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

citation-role summary

method 1

citation-polarity summary

fields

cs.LG 5 cs.CL 3

years

2026 8

roles

method 1

polarities

use method 1

representative citing papers

Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory

cs.LG · 2026-03-27 · unverdicted · novelty 7.0

Muon achieves higher storage capacity than SGD and matches Newton's method in one-step recovery rates for associative memory under power-law distributions, while saturating at larger critical batch sizes and showing faster initial multi-step dynamics.

Cubit: Token Mixer with Kernel Ridge Regression

cs.LG · 2026-05-07 · unverdicted · novelty 5.0 · 2 refs

Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.

citing papers explorer

Showing 8 of 8 citing papers.