Transformer feed-forward layers are key-value memories

Mor Geva, Roei Schuster, Jonathan Berant, Omer Levy · 2021

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

browse 8 citing papers

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

Transformer circuits show free evolution during SFT, rendering static mechanistic localization inadequate for future parameter updates due to inherent temporal latency.

Task Vector Geometry Underlies Dual Modes of Task Inference in Transformers

cs.LG · 2026-05-05 · unverdicted · novelty 7.0

In a controlled synthetic setting, transformers implement in-distribution task inference via convex combinations of task vectors and out-of-distribution inference via nearly orthogonal extrapolative representations.

Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory

cs.LG · 2026-03-27 · unverdicted · novelty 7.0

Muon achieves higher storage capacity than SGD and matches Newton's method in one-step recovery rates for associative memory under power-law distributions, while saturating at larger critical batch sizes and showing faster initial multi-step dynamics.

Why Do LLMs Struggle in Strategic Play? Broken Links Between Observations, Beliefs, and Actions

cs.CL · 2026-04-30 · unverdicted · novelty 6.0

LLMs encode accurate but brittle internal beliefs about latent game states and convert them poorly into actions, creating systematic gaps that explain strategic failures.

Towards Effective Theory of LLMs: A Representation Learning Approach

cs.LG · 2026-05-10 · unverdicted · novelty 5.0

RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.

Cubit: Token Mixer with Kernel Ridge Regression

cs.LG · 2026-05-07 · unverdicted · novelty 5.0 · 2 refs

Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.

Functional Subspace, where language models can use vector algebra to solve problems

cs.CL · 2026-02-02 · unverdicted · novelty 5.0

LLMs form functional subspaces in activation space where in-context learning tasks are solved by vector algebra operations such as addition and subtraction.

Graph Memory Transformer (GMT)

cs.LG · 2026-04-26

citing papers explorer

Showing 8 of 8 citing papers.

Navigating by Old Maps: The Pitfalls of Static Mechanistic Localization in LLM Post-Training cs.CL · 2026-05-07 · unverdicted · none · ref 55
Transformer circuits show free evolution during SFT, rendering static mechanistic localization inadequate for future parameter updates due to inherent temporal latency.
Task Vector Geometry Underlies Dual Modes of Task Inference in Transformers cs.LG · 2026-05-05 · unverdicted · none · ref 15
In a controlled synthetic setting, transformers implement in-distribution task inference via convex combinations of task vectors and out-of-distribution inference via nearly orthogonal extrapolative representations.
Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory cs.LG · 2026-03-27 · unverdicted · none · ref 14
Muon achieves higher storage capacity than SGD and matches Newton's method in one-step recovery rates for associative memory under power-law distributions, while saturating at larger critical batch sizes and showing faster initial multi-step dynamics.
Why Do LLMs Struggle in Strategic Play? Broken Links Between Observations, Beliefs, and Actions cs.CL · 2026-04-30 · unverdicted · none · ref 35
LLMs encode accurate but brittle internal beliefs about latent game states and convert them poorly into actions, creating systematic gaps that explain strategic failures.
Towards Effective Theory of LLMs: A Representation Learning Approach cs.LG · 2026-05-10 · unverdicted · none · ref 29
RET learns temporally consistent macrovariables from LLM activations via self-supervised learning to support interpretability, early behavioral prediction, and causal intervention.
Cubit: Token Mixer with Kernel Ridge Regression cs.LG · 2026-05-07 · unverdicted · none · ref 24 · 2 links
Cubit replaces Transformer's attention with a closed-form Kernel Ridge Regression token mixer and reports larger gains as training sequence length increases.
Functional Subspace, where language models can use vector algebra to solve problems cs.CL · 2026-02-02 · unverdicted · none · ref 17
LLMs form functional subspaces in activation space where in-context learning tasks are solved by vector algebra operations such as addition and subtraction.
Graph Memory Transformer (GMT) cs.LG · 2026-04-26 · unreviewed · ref 2

Transformer feed-forward layers are key-value memories

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer