Fixed counterfactual explanation datasets train LMs such that generated explanations track the model's evolving behavior rather than the fixed targets, due to persistent correlation during training.
Predictive concept decoders: Training scalable end-to-end interpretability assistants, 2025
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Four changes to Activation Oracle training yield marginal capability gains but better practical quality, plus an open-sourced evaluation suite AObench.
citing papers explorer
-
Building Better Activation Oracles
Four changes to Activation Oracle training yield marginal capability gains but better practical quality, plus an open-sourced evaluation suite AObench.