ArXiv , year=

Toy Models of Superposition , author=

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

cs.AI · 2026-04-30 · unverdicted · novelty 7.0

LOCA identifies an average of six minimal interpretable changes in intermediate representations that causally induce refusal on otherwise successful jailbreaks for Gemma and Llama models.

Rigorous Interpretation Is a Form of Evaluation

cs.CY · 2026-05-06 · unverdicted · novelty 5.0

Rigorous interpretability can function as a principled form of model evaluation if its claims are falsifiable, reproducible, and predictive.

citing papers explorer

Showing 2 of 2 citing papers.

Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models cs.AI · 2026-04-30 · unverdicted · none · ref 41
LOCA identifies an average of six minimal interpretable changes in intermediate representations that causally induce refusal on otherwise successful jailbreaks for Gemma and Llama models.
Rigorous Interpretation Is a Form of Evaluation cs.CY · 2026-05-06 · unverdicted · none · ref 29
Rigorous interpretability can function as a principled form of model evaluation if its claims are falsifiable, reproducible, and predictive.

ArXiv , year=

fields

years

verdicts

representative citing papers

citing papers explorer