Verbosity Tradeoffs and the Impact of Scale on the Faithfulness of LLM Self-Explanations

Maria Perez-Ortiz; Nicolas Heess; Noah Y. Siegel; Oana-Maria Camburu

arxiv: 2503.13445 · v3 · pith:3TAROGI2new · submitted 2025-03-17 · 💻 cs.CL · cs.AI

Verbosity Tradeoffs and the Impact of Scale on the Faithfulness of LLM Self-Explanations

Noah Y. Siegel , Nicolas Heess , Maria Perez-Ortiz , Oana-Maria Camburu This is my paper

classification 💻 cs.CL cs.AI

keywords metricsexplanationsfaithfulnesscorrelationalcounterfactualfaithfulmodelstest

0 comments

read the original abstract

When asked to explain their decisions, LLMs can often give explanations which sound plausible to humans. But are these explanations faithful, i.e. do they convey the factors actually responsible for the decision? In this work, we analyse counterfactual faithfulness across 75 models from 13 families. We analyze the tradeoff between conciseness and comprehensiveness, how correlational faithfulness metrics assess this tradeoff, and the extent to which metrics can be gamed. This analysis motivates two new metrics: the phi-CCT, a simplified variant of the Correlational Counterfactual Test (CCT) which avoids the need for token probabilities while explaining most of the variance of the original test; and F-AUROC, which eliminates sensitivity to imbalanced intervention distributions and captures a model's ability to produce explanations with different levels of detail. Our findings reveal a clear scaling trend: larger and more capable models are consistently more faithful on all metrics we consider. Our code is available at https://github.com/google-deepmind/corr_faith.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Faithful by Definition: Emotion Analysis via Natural Semantic Metalanguage Explications
cs.CL 2026-07 unverdicted novelty 5.0

An NSM-based explication parser with fixed semantic rules produces emotion labels for events, achieving 0.33 accuracy on held-out crowd-sourced data while shifting empirical risk to an inspectable parser.