Language model circuits show high within-task consistency and necessity but substantial overlap across tasks, making them less specific than assumed.
Advances in Neural Information Processing Systems , year=
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
ScoringModel raises mean Recall@10 to 52.2 on the FakeWiki provenance benchmark from 35.0 for the best baseline, winning 41 of 45 model-by-condition comparisons and gaining 15.7 points on jailbreak-style queries.
Task-aligned supervised geometric stability predicts linear steerability with high accuracy while unsupervised stability detects representational drift earlier and with lower false alarms than CKA or Procrustes.
citing papers explorer
-
How Much Do Circuits Tell Us? Measuring the Consistency and Specificity of Language Model Circuits
Language model circuits show high within-task consistency and necessity but substantial overlap across tasks, making them less specific than assumed.
-
DataDignity: Training Data Attribution for Large Language Models
ScoringModel raises mean Recall@10 to 52.2 on the FakeWiki provenance benchmark from 35.0 for the best baseline, winning 41 of 45 model-by-condition comparisons and gaining 15.7 points on jailbreak-style queries.
-
The Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability
Task-aligned supervised geometric stability predicts linear steerability with high accuracy while unsupervised stability detects representational drift earlier and with lower false alarms than CKA or Procrustes.