A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification
Pith reviewed 2026-05-23 19:46 UTC · model grok-4.3
The pith
Self-explanations from LLMs yield faithful token subsets aligned with human rationales in text classification, unlike post-hoc methods that emphasize structural tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across sentiment, forced labour, and claim verification tasks, self-explanations generated by four open-weight LLMs produce token subsets whose faithfulness to correct model predictions exceeds that of post-hoc attributions; human alignment of these self-explanations depends on text length and task complexity, while post-hoc methods preferentially highlight structural and formatting tokens, indicating fundamentally different explanation strategies.
What carries the argument
Extractive self-explanations (token rationales generated directly by the LLM) evaluated for both human plausibility and faithfulness to model predictions, contrasted with post-hoc attribution methods on the same inputs.
If this is right
- Alignment of self-explanations with human rationales is not uniform but scales with input length and task complexity.
- Self-explanations remain faithful to the tokens supporting correct model outputs even when human agreement is only partial.
- Post-hoc attribution methods systematically surface formatting and structural tokens rather than content tokens.
- The pattern holds across English, Danish, and Italian versions of the sentiment task.
Where Pith is reading between the lines
- Applications that need explanations faithful to a model's actual decision process may benefit from using self-explanations rather than post-hoc attributions.
- Hybrid systems could combine self-explanations for faithfulness with post-hoc methods for coverage of structural cues.
- Standardized protocols for collecting human rationales would strengthen future comparisons of this kind.
- The length- and complexity-dependence suggests testing self-explanation quality on longer documents or multi-sentence reasoning tasks next.
Load-bearing premise
The newly collected human rationale annotations for Climate-Fever form a reliable and unbiased measure of plausibility for comparing LLM self-explanations.
What would settle it
A controlled re-annotation of Climate-Fever examples by a second independent group of annotators that reverses the observed alignment ranking between self-explanations and post-hoc methods would falsify the claim of distinct explanation strategies.
Figures
read the original abstract
Instruction-tuned LLMs are able to provide \textit{an} explanation about their output to users by generating self-explanations, without requiring the application of complex interpretability techniques. In this paper, we analyse whether this ability results in a \textit{good} explanation. We evaluate self-explanations in the form of input rationales with respect to their plausibility to humans. We study three text classification tasks: sentiment classification, forced labour detection and claim verification. We include Danish and Italian translations of the sentiment classification task and compare self-explanations to human annotations. For this, we collected human rationale annotations for Climate-Fever, a claim verification dataset. We furthermore evaluate the faithfulness of human and self-explanation rationales with respect to correct model predictions, and extend the study by incorporating post-hoc attribution-based explanations. We analyse four open-weight LLMs and find that alignment between self-explanations and human rationales highly depends on text length and task complexity. Nevertheless, self-explanations yield faithful subsets of token-level rationales, whereas post-hoc attribution methods tend to emphasize structural and formatting tokens, reflecting fundamentally different explanation strategies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript conducts a systematic empirical comparison of extractive self-explanations generated by four open-weight instruction-tuned LLMs against newly collected human rationales and post-hoc attribution methods across three text classification tasks (sentiment classification in English/Danish/Italian, forced labour detection, and claim verification on Climate-Fever). It evaluates plausibility through alignment with human annotations and faithfulness with respect to correct model predictions, concluding that self-explanation alignment with humans depends on text length and task complexity while self-explanations produce faithful token subsets, in contrast to post-hoc methods that emphasize structural and formatting tokens.
Significance. If the human annotations prove reliable, the study offers useful evidence distinguishing LLM self-explanation strategies from post-hoc attribution in terms of faithfulness and plausibility. Strengths include the controlled multi-task and multi-language design, direct faithfulness checks against model predictions, and the collection of new human rationales for Climate-Fever to enable the plausibility comparison.
major comments (1)
- [Annotation collection for Climate-Fever] The section describing the annotation collection process for Climate-Fever: the plausibility evaluation and claims about dependence on text length/task complexity rest on these new human rationales, yet the manuscript reports neither inter-annotator agreement, the number of annotators per instance, explicit annotation guidelines, nor bias-mitigation procedures such as blinding annotators to model outputs. Without these, it is unclear whether the annotations constitute a stable ground truth for human plausibility.
minor comments (2)
- [Abstract] The abstract and introduction could more explicitly state the number of models, tasks, and languages evaluated to improve scannability.
- [Results] Table or figure captions for the faithfulness and plausibility results should include the exact metrics used (e.g., token overlap, sufficiency) for immediate clarity.
Simulated Author's Rebuttal
We thank the referee for their detailed review and constructive feedback on our manuscript. We address the major comment below and will revise the manuscript accordingly to improve the description of our human annotation process.
read point-by-point responses
-
Referee: [Annotation collection for Climate-Fever] The section describing the annotation collection process for Climate-Fever: the plausibility evaluation and claims about dependence on text length/task complexity rest on these new human rationales, yet the manuscript reports neither inter-annotator agreement, the number of annotators per instance, explicit annotation guidelines, nor bias-mitigation procedures such as blinding annotators to model outputs. Without these, it is unclear whether the annotations constitute a stable ground truth for human plausibility.
Authors: We agree that additional details on the annotation collection process are required for the plausibility evaluation to be fully interpretable. In the revised manuscript, we will expand the relevant section to include inter-annotator agreement statistics, the number of annotators per instance, the full annotation guidelines, and bias-mitigation steps such as blinding procedures. These additions will clarify the stability of the human rationales as ground truth. revision: yes
Circularity Check
No significant circularity; purely empirical comparisons
full rationale
This is an empirical study that collects new human rationale annotations for Climate-Fever and directly compares LLM self-explanations, human rationales, and post-hoc attributions on faithfulness and plausibility metrics across several datasets and languages. No equations, derivations, fitted parameters, or predictions appear in the provided text. No self-citation chains, uniqueness theorems, or ansatzes are invoked to support central claims. The analysis rests on external annotations and model outputs rather than reducing to its own inputs by construction. This matches the default expectation for non-circular empirical work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human rationale annotations collected for Climate-Fever constitute a reliable ground truth for evaluating explanation plausibility.
Reference graph
Works this paper leans on
-
[1]
plausibility: On the (un)reliability of explanations from large language models
Faithfulness vs. plausibility: On the (un)reliability of explanations from large language models. Preprint, arXiv:2402.04614. Ameen Ali, Thomas Schnake, Oliver Eberle, Grégoire Montavon, Klaus-Robert Müller, and Lior Wolf
-
[2]
Rather a nurse than a physi- cian - contrastive explanations under investigation. In Proceedings of the 2023 Conference on Empiri- cal Methods in Natural Language Processing, pages 6907–6920, Singapore. Association for Computa- tional Linguistics. Shiyuan Huang, Siddarth Mamidanna, Shreedhar Jangam, Yilun Zhou, and Leilani H. Gilpin
work page 2023
-
[3]
Can large language models explain themselves? a study of llm-generated self-explanations. Preprint, arXiv:2310.11207. Alon Jacovi and Yoav Goldberg
-
[4]
Contrastive explanations for model interpretability. In Proceedings of the 2021 Conference on Empiri- cal Methods in Natural Language Processing, pages 1597–1611, Online and Punta Cana, Dominican Re- public. Association for Computational Linguistics. Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto...
work page 2021
-
[5]
Erick Mendez Guzman, Viktor Schlegel, and Riza Batista-Navarro
Are self-explanations from large language models faithful? Preprint, arXiv:2401.07927. Erick Mendez Guzman, Viktor Schlegel, and Riza Batista-Navarro
-
[6]
InFoBench: Evaluating instruction following ability in large lan- guage models. In Findings of the Association for Computational Linguistics: ACL 2024, pages 13025– 13048, Bangkok, Thailand. Association for Compu- tational Linguistics. Farnoush Rezaei Jafari, Grégoire Montavon, Klaus- Robert Müller, and Oliver Eberle
work page 2024
-
[7]
Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empiri- cal Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics. Terne Sasha Thorn Jakobsen, Laura Cabello, and Anders Søgaard
work page 2013
-
[8]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open foundation and fine- tuned chat models. Preprint, arXiv:2307.09288. Sarah Wiegreffe, Jack Hessel, Swabha Swayamdipta, Mark Riedl, and Yejin Choi
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Reframing human-AI collaboration for generating free-text ex- planations. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 632–658, Seattle, United States. Association for Computational Linguistics. Xi Ye and Greg Durrett
work page 2022
-
[10]
Interpreting lan- guage models with contrastive explanations. In Pro- ceedings of the 2022 Conference on Empirical Meth- ods in Natural Language Processing, pages 184–198, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, and Danqi Chen
work page 2022
-
[11]
Fine-Tuning Language Models from Human Preferences
Fine-tuning lan- guage models from human preferences. arXiv preprint arXiv:1909.08593. A Instructions A.1 SST Figure 6: Prompts in all 3 languages to solve sentiment classification. Figure 7: Follow-up prompts in all 3 languages to extract rationales. A.2 RaFoLa Figure 8: Prompts for classification and rationale extraction for the RaFoLa dataset. Figure 9...
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[12]
Figure 13 show the same POS analysis for RaFoLa
are discussed in Section 5.1. Figure 13 show the same POS analysis for RaFoLa. E Most frequent rationale tokens Table 3: List of top-8 most frequent tokens in the RaFoLa corpus (first row) together with the most frequent rationales as identified by human annotators, as well as self-generated and post-hoc explanations. #1 Abuse of vulnerability #2 Abusive ...
work page 2020
-
[13]
Both Llama3 and Mistral have a low error rate with 2% and 5% respectively
The results show that Llama2 has a lot of difficulties with respect to json syntax with syntax errors occurring in 86% of the case. Both Llama3 and Mistral have a low error rate with 2% and 5% respectively. At the same time, Mistral returns more than the maximum number of requested rationale tokens in 1 out of 3 instructions where Llama3 follows the instr...
work page 2024
-
[14]
and the minor observed performance differences for generating the correct label based on contrastive or non-contrastive approaches (Krishna et al., 2023). E.3 Entity analysis SST RaFoLa Figure 14: Comparing plausibility scores for non-contrastive and contrastive post-hoc approaches using Kappa agreement scores. Left: SST and multilingual SST for English, ...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.