Towards faithful natural language explanations: A study using activation patching in large language models

Wei Jie Yeo, Ranjan Satapathy, Erik Cambria · 2025 · DOI 10.18653/v1/2025.emnlp-main.529

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open at publisher browse 1 citing papers

representative citing papers

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth

cs.CL · 2026-05-24 · unverdicted · novelty 8.0

Introduces BonaFide benchmark of 3,066 ground-truth labeled CoTs showing most faithfulness metrics perform near chance with biases and poor scaling to longer chains.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Faithfulness Metrics Don't Measure Faithfulness: A Meta-Evaluation with Ground Truth cs.CL · 2026-05-24 · unverdicted · none · ref 13
Introduces BonaFide benchmark of 3,066 ground-truth labeled CoTs showing most faithfulness metrics perform near chance with biases and poor scaling to longer chains.

Towards faithful natural language explanations: A study using activation patching in large language models

fields

years

verdicts

representative citing papers

citing papers explorer