Locating and Editing Factual Associations in

Meng, Kevin, Bau, David, Andonian, Alex, Belinkov, Yonatan , journal=

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

browse 6 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

WriteSAE: Sparse Autoencoders for Recurrent State

cs.LG · 2026-05-12 · unverdicted · novelty 8.0 · 3 refs

WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.

Geometric Factual Recall in Transformers

cs.CL · 2026-05-12 · conditional · novelty 8.0

A single-layer transformer memorizes random subject-attribute bijections using logarithmic embedding dimension via linear superpositions in embeddings and ReLU-gated selection in the MLP, with zero-shot transfer to new facts and matching multi-hop constructions.

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

cs.AI · 2026-05-15 · unverdicted · novelty 7.0

LinAlg-Bench shows LLMs switch from execution errors to computational abandonment and structured fabrication at 4x4 matrix scale, indicating a working memory limit rather than knowledge gaps.

Every Component is a Lookup: Token Attribution and Composition from a Single Decomposition

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Unpack decomposes transformer credit via a unified backward recursion on the φ(S)U template, recovering known IOI circuits with mode labels and showing consistent duplicate-name suppression across Pythia scales from a single forward pass.

PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts

cs.CL · 2026-05-16 · unverdicted · novelty 6.0

Benchmark construction artifacts in hallucination detection corpora allow naive text-similarity baselines to achieve near-perfect scores, and controlled evaluations show most methods perform near chance except SAPLMA and the new DRIFT probe.

A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

Suppressing one refusal neuron or amplifying one concept neuron bypasses safety alignment in LLMs from 1.7B to 70B parameters without training or prompt engineering.

citing papers explorer

Showing 6 of 6 citing papers.

WriteSAE: Sparse Autoencoders for Recurrent State cs.LG · 2026-05-12 · unverdicted · none · ref 42 · 3 links
WriteSAE introduces sparse autoencoders with rank-1 matrix atoms for recurrent state updates, allowing replacement tests that outperform deletion on 92.4% of positions and a formula predicting logit changes with R²=0.98.
Geometric Factual Recall in Transformers cs.CL · 2026-05-12 · conditional · none · ref 13
A single-layer transformer memorizes random subject-attribute bijections using logarithmic embedding dimension via linear superpositions in embeddings and ReLU-gated selection in the MLP, with zero-shot transfer to new facts and matching multi-hop constructions.
LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning cs.AI · 2026-05-15 · unverdicted · none · ref 18
LinAlg-Bench shows LLMs switch from execution errors to computational abandonment and structured fabrication at 4x4 matrix scale, indicating a working memory limit rather than knowledge gaps.
Every Component is a Lookup: Token Attribution and Composition from a Single Decomposition cs.LG · 2026-05-22 · unverdicted · none · ref 8
Unpack decomposes transformer credit via a unified backward recursion on the φ(S)U template, recovering known IOI circuits with mode labels and showing consistent duplicate-name suppression across Pythia scales from a single forward pass.
PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts cs.CL · 2026-05-16 · unverdicted · none · ref 80
Benchmark construction artifacts in hallucination detection corpora allow naive text-similarity baselines to achieve near-perfect scores, and controlled evaluations show most methods perform near chance except SAPLMA and the new DRIFT probe.
A Single Neuron Is Sufficient to Bypass Safety Alignment in Large Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 22
Suppressing one refusal neuron or amplifying one concept neuron bypasses safety alignment in LLMs from 1.7B to 70B parameters without training or prompt engineering.

Locating and Editing Factual Associations in

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer