Faithlm: Towards faithful explanations for large language models

Chuang, Yu-Neng, Wang, Guanchu, Chang, Chia-Yuan, Tang, Ruixiang, Zhong, Shaochen, Yang, Fan · 2024 · arXiv 2402.04678

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization

cs.CV · 2025-08-11 · unverdicted · novelty 7.0

MIMIC is a new inversion framework that recovers visual concepts from VLM internal states using joint inversion, feature alignment, and three regularizers.

When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making

cs.AI · 2026-02-03 · unverdicted · novelty 6.0

Adversarial explanation attacks preserve nearly all human trust in wrong AI outputs by using persuasive framing, shown in a study varying reasoning, evidence, style, and format with over 200 participants.

Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces

cs.CL · 2026-06-05 · unverdicted · novelty 5.0

Reasoning in large output spaces proceeds via shortlisting then fine-grained reasoning; this characterization enables a mechanistic distillation strategy that outperforms standard distillation.

ToxiTrace: Gradient-Aligned Training for Explainable Chinese Toxicity Detection

cs.CL · 2026-04-14 · unverdicted · novelty 5.0

ToxiTrace combines CuSA for LLM-refined toxic spans, GCLoss for gradient-focused saliency, and ARCL for contrastive toxic/non-toxic boundaries to improve Chinese toxicity classification and explainable span extraction.

citing papers explorer

Showing 3 of 3 citing papers after filters.

When AI Persuades: Adversarial Explanation Attacks on Human Trust in AI-Assisted Decision Making cs.AI · 2026-02-03 · unverdicted · none · ref 14
Adversarial explanation attacks preserve nearly all human trust in wrong AI outputs by using persuasive framing, shown in a study varying reasoning, evidence, style, and format with over 200 participants.
Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces cs.CL · 2026-06-05 · unverdicted · none · ref 81
Reasoning in large output spaces proceeds via shortlisting then fine-grained reasoning; this characterization enables a mechanistic distillation strategy that outperforms standard distillation.
ToxiTrace: Gradient-Aligned Training for Explainable Chinese Toxicity Detection cs.CL · 2026-04-14 · unverdicted · none · ref 1
ToxiTrace combines CuSA for LLM-refined toxic spans, GCLoss for gradient-focused saliency, and ARCL for contrastive toxic/non-toxic boundaries to improve Chinese toxicity classification and explainable span extraction.

Faithlm: Towards faithful explanations for large language models

fields

years

verdicts

representative citing papers

citing papers explorer