pith. sign in

arxiv: 2601.03089 · v2 · pith:BCPIXLYJnew · submitted 2026-01-06 · 💻 cs.CL · cs.AI· cs.LG

Faithfulness Evaluation for Decoder-only LLM Attributions with Controlled Retained Information

classification 💻 cs.CL cs.AIcs.LG
keywords attributionmethodsevaluationfaithfulnessllmsretainedsoft-ncsoft-ns
0
0 comments X
read the original abstract

Large Language Models (LLMs) are increasingly evaluated with input attribution methods, yet comparing such explanations remains challenging. Existing soft-perturbation faithfulness metrics, such as Soft-NC and Soft-NS, can conflate attribution quality with the number of words retained during perturbation: attribution methods with larger average scores may keep more words and therefore obtain inflated scores. To address this issue, we propose $\pi$-Soft-NC and $\pi$-Soft-NS, an evaluation framework that compares attribution methods under the same expected retaining probability, thus controlling the number of retained words. We further introduce Grad-ELLM, a gradient-based attribution method tailored to autoregressive decoder-only LLMs, which combines gradient-derived channel importance with attention-derived token importance at each decoding step. Experiments on classification and open-generation tasks with Llama and Mistral show that Grad-ELLM achieves strong comprehensiveness-oriented faithfulness under $\pi$-Soft-NC, while there is no dominant method under $\pi$-Soft-NS. Our evaluation metric serves as a rigorous framework to compare XAI methods for LLMs, which will support progress in the field.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.