MIMIC is a new inversion framework that recovers visual concepts from VLM internal states using joint inversion, feature alignment, and three regularizers.
Faithlm: Towards faithful explanations for large language models
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
Adversarial explanation attacks preserve nearly all human trust in wrong AI outputs by using persuasive framing, shown in a study varying reasoning, evidence, style, and format with over 200 participants.
Reasoning in large output spaces proceeds via shortlisting then fine-grained reasoning; this characterization enables a mechanistic distillation strategy that outperforms standard distillation.
ToxiTrace combines CuSA for LLM-refined toxic spans, GCLoss for gradient-focused saliency, and ARCL for contrastive toxic/non-toxic boundaries to improve Chinese toxicity classification and explainable span extraction.
citing papers explorer
-
When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making
Adversarial explanation attacks preserve nearly all human trust in wrong AI outputs by using persuasive framing, shown in a study varying reasoning, evidence, style, and format with over 200 participants.
-
Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces
Reasoning in large output spaces proceeds via shortlisting then fine-grained reasoning; this characterization enables a mechanistic distillation strategy that outperforms standard distillation.
-
ToxiTrace: Gradient-Aligned Training for Explainable Chinese Toxicity Detection
ToxiTrace combines CuSA for LLM-refined toxic spans, GCLoss for gradient-focused saliency, and ARCL for contrastive toxic/non-toxic boundaries to improve Chinese toxicity classification and explainable span extraction.