Locating and Editing Factual Associations in

Kevin Meng, David Bau, Alex J Andonian, Yonatan Belinkov , booktitle= · 2022

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

representative citing papers

When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

Finite-answer projections of continuation probabilities stabilize before the answer is parseable, showing 17-31 token mean lead in delayed-verdict tasks with Qwen3-4B-Instruct.

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

cs.AI · 2026-04-30 · unverdicted · novelty 7.0

LOCA identifies an average of six minimal interpretable changes in intermediate representations that causally induce refusal on otherwise successful jailbreaks for Gemma and Llama models.

MixSD: Mixed Contextual Self-Distillation for Knowledge Injection

cs.CL · 2026-05-16 · unverdicted · novelty 6.0 · 2 refs

MixSD mixes tokens from the base model's expert and naive conditionals to create distribution-aligned supervision for knowledge injection, yielding better memorization-retention trade-offs than SFT across scales and benchmarks.

What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis

cs.AI · 2026-05-05 · unverdicted · novelty 6.0

In LLM agents, memory routing circuits emerge at 0.6B scale while content circuits appear only at 4B, and write/read operations recruit a pre-existing late-layer context hub instead of creating a new one, enabling a 76% accurate unsupervised failure diagnostic.

citing papers explorer

Showing 4 of 4 citing papers.

When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment cs.AI · 2026-05-07 · unverdicted · none · ref 18
Finite-answer projections of continuation probabilities stabilize before the answer is parseable, showing 17-31 token mean lead in delayed-verdict tasks with Qwen3-4B-Instruct.
Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models cs.AI · 2026-04-30 · unverdicted · none · ref 47
LOCA identifies an average of six minimal interpretable changes in intermediate representations that causally induce refusal on otherwise successful jailbreaks for Gemma and Llama models.
MixSD: Mixed Contextual Self-Distillation for Knowledge Injection cs.CL · 2026-05-16 · unverdicted · none · ref 45 · 2 links
MixSD mixes tokens from the base model's expert and naive conditionals to create distribution-aligned supervision for knowledge injection, yielding better memorization-retention trade-offs than SFT across scales and benchmarks.
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis cs.AI · 2026-05-05 · unverdicted · none · ref 51
In LLM agents, memory routing circuits emerge at 0.6B scale while content circuits appear only at 4B, and write/read operations recruit a pre-existing late-layer context hub instead of creating a new one, enabling a 76% accurate unsupervised failure diagnostic.

Locating and Editing Factual Associations in

fields

years

verdicts

representative citing papers

citing papers explorer