Finite-answer projections of continuation probabilities stabilize before the answer is parseable, showing 17-31 token mean lead in delayed-verdict tasks with Qwen3-4B-Instruct.
Locating and Editing Factual Associations in
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
LOCA identifies an average of six minimal interpretable changes in intermediate representations that causally induce refusal on otherwise successful jailbreaks for Gemma and Llama models.
MixSD mixes tokens from the base model's expert and naive conditionals to create distribution-aligned supervision for knowledge injection, yielding better memorization-retention trade-offs than SFT across scales and benchmarks.
In LLM agents, memory routing circuits emerge at 0.6B scale while content circuits appear only at 4B, and write/read operations recruit a pre-existing late-layer context hub instead of creating a new one, enabling a 76% accurate unsupervised failure diagnostic.
citing papers explorer
-
When Does a Language Model Commit? A Finite-Answer Theory of Pre-Verbalization Commitment
Finite-answer projections of continuation probabilities stabilize before the answer is parseable, showing 17-31 token mean lead in delayed-verdict tasks with Qwen3-4B-Instruct.
-
Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models
LOCA identifies an average of six minimal interpretable changes in intermediate representations that causally induce refusal on otherwise successful jailbreaks for Gemma and Llama models.
-
MixSD: Mixed Contextual Self-Distillation for Knowledge Injection
MixSD mixes tokens from the base model's expert and naive conditionals to create distribution-aligned supervision for knowledge injection, yielding better memorization-retention trade-offs than SFT across scales and benchmarks.
-
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
In LLM agents, memory routing circuits emerge at 0.6B scale while content circuits appear only at 4B, and write/read operations recruit a pre-existing late-layer context hub instead of creating a new one, enabling a 76% accurate unsupervised failure diagnostic.