A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification

Oliver Eberle; Stephanie Brandl

arxiv: 2410.03296 · v4 · pith:4AME6UQQnew · submitted 2024-10-04 · 💻 cs.CL · cs.AI

A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification

Stephanie Brandl , Oliver Eberle This is my paper

Pith reviewed 2026-05-23 19:46 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords self-explanationshuman rationalestext classificationLLM explanationsfaithfulnessplausibilitypost-hoc attributionsClimate-Fever

0 comments

The pith

Self-explanations from LLMs yield faithful token subsets aligned with human rationales in text classification, unlike post-hoc methods that emphasize structural tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether instruction-tuned LLMs can produce usable explanations for their own text classification outputs by generating extractive self-explanations as input rationales. It compares these rationales to new and existing human annotations across sentiment classification in three languages, forced labour detection, and claim verification, with fresh human labels collected for the Climate-Fever dataset. Alignment between self-explanations and humans turns out to vary with input length and task difficulty, yet the self-explanations remain faithful to the tokens that actually drive correct model predictions. Post-hoc attribution techniques, by contrast, consistently surface formatting and structural tokens instead. The comparison therefore isolates two distinct explanation strategies that cannot be treated as interchangeable.

Core claim

Across sentiment, forced labour, and claim verification tasks, self-explanations generated by four open-weight LLMs produce token subsets whose faithfulness to correct model predictions exceeds that of post-hoc attributions; human alignment of these self-explanations depends on text length and task complexity, while post-hoc methods preferentially highlight structural and formatting tokens, indicating fundamentally different explanation strategies.

What carries the argument

Extractive self-explanations (token rationales generated directly by the LLM) evaluated for both human plausibility and faithfulness to model predictions, contrasted with post-hoc attribution methods on the same inputs.

If this is right

Alignment of self-explanations with human rationales is not uniform but scales with input length and task complexity.
Self-explanations remain faithful to the tokens supporting correct model outputs even when human agreement is only partial.
Post-hoc attribution methods systematically surface formatting and structural tokens rather than content tokens.
The pattern holds across English, Danish, and Italian versions of the sentiment task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applications that need explanations faithful to a model's actual decision process may benefit from using self-explanations rather than post-hoc attributions.
Hybrid systems could combine self-explanations for faithfulness with post-hoc methods for coverage of structural cues.
Standardized protocols for collecting human rationales would strengthen future comparisons of this kind.
The length- and complexity-dependence suggests testing self-explanation quality on longer documents or multi-sentence reasoning tasks next.

Load-bearing premise

The newly collected human rationale annotations for Climate-Fever form a reliable and unbiased measure of plausibility for comparing LLM self-explanations.

What would settle it

A controlled re-annotation of Climate-Fever examples by a second independent group of annotators that reverses the observed alignment ranking between self-explanations and post-hoc methods would falsify the claim of distinct explanation strategies.

Figures

Figures reproduced from arXiv: 2410.03296 by Oliver Eberle, Stephanie Brandl.

**Figure 2.** Figure 2: Pair-wise comparison scores (Cohen’s Kappa) between rationales on SST and multilingual SST ( [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Faithfulness evaluation for SST, mSST (Danish) and RaFoLa (articles #1 and #8). Model probability [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of POS-tags in comparison to top [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Rationale token analysis of RaFoLa. (Left) Most frequent tokens extracted from the RaFoLa corpus, human annotations, and rationales extracted from model self-explanations and post-hoc explanations for articles #1 and #8. (Right) Ranking of named entities in articles class #1 extracted from model self-explanations and post-hoc explanations for Llama3. Resulting distributions are compared to the entity distr… view at source ↗

**Figure 6.** Figure 6: Prompts in all 3 languages to solve sentiment classification. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Follow-up prompts in all 3 languages to extract rationales. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Prompts for classification and rationale extraction for the RaFoLa dataset. [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Indicators defined by the International Labour Organization and published by [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Pair-wise F1 comparison scores between rationales on SST and multilingual SST (English, Danish and [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Pair-wise F1 comparison scores between rationales on RaFoLa. [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Distribution of POS-tags in comparison to top-6 POS in human annotations for SST. Absolute POS tags [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: Distribution of POS-tags in comparison to top-6 POS in human annotations for RaFoLa. Absolute POS [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: Comparing plausibility scores for non-contrastive and contrastive post-hoc approaches using Kappa [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Faithfulness evaluation for SST and mSST (top row) and RaFoLa (bottom row). The probability after [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 16.** Figure 16: Entity analysis of Rafola for #1 (top) and #8 (bottom) for llama3 (left) and mistral (right), showing top-8 [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗

read the original abstract

Instruction-tuned LLMs are able to provide \textit{an} explanation about their output to users by generating self-explanations, without requiring the application of complex interpretability techniques. In this paper, we analyse whether this ability results in a \textit{good} explanation. We evaluate self-explanations in the form of input rationales with respect to their plausibility to humans. We study three text classification tasks: sentiment classification, forced labour detection and claim verification. We include Danish and Italian translations of the sentiment classification task and compare self-explanations to human annotations. For this, we collected human rationale annotations for Climate-Fever, a claim verification dataset. We furthermore evaluate the faithfulness of human and self-explanation rationales with respect to correct model predictions, and extend the study by incorporating post-hoc attribution-based explanations. We analyse four open-weight LLMs and find that alignment between self-explanations and human rationales highly depends on text length and task complexity. Nevertheless, self-explanations yield faithful subsets of token-level rationales, whereas post-hoc attribution methods tend to emphasize structural and formatting tokens, reflecting fundamentally different explanation strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds new human rationale annotations for Climate-Fever and a multilingual comparison, but the plausibility evaluation rests on under-documented annotations.

read the letter

The core contribution here is a controlled empirical comparison of LLM self-explanations against human rationales on sentiment classification, forced labour detection, and claim verification. It includes Danish and Italian translations for one task and introduces fresh human annotations on Climate-Fever, then contrasts both with post-hoc attribution methods while checking faithfulness to correct model predictions. The finding that self-explanations select faithful token subsets while post-hoc methods favor structural tokens is a clear, usable distinction across the four open-weight models tested.

Referee Report

1 major / 2 minor

Summary. The manuscript conducts a systematic empirical comparison of extractive self-explanations generated by four open-weight instruction-tuned LLMs against newly collected human rationales and post-hoc attribution methods across three text classification tasks (sentiment classification in English/Danish/Italian, forced labour detection, and claim verification on Climate-Fever). It evaluates plausibility through alignment with human annotations and faithfulness with respect to correct model predictions, concluding that self-explanation alignment with humans depends on text length and task complexity while self-explanations produce faithful token subsets, in contrast to post-hoc methods that emphasize structural and formatting tokens.

Significance. If the human annotations prove reliable, the study offers useful evidence distinguishing LLM self-explanation strategies from post-hoc attribution in terms of faithfulness and plausibility. Strengths include the controlled multi-task and multi-language design, direct faithfulness checks against model predictions, and the collection of new human rationales for Climate-Fever to enable the plausibility comparison.

major comments (1)

[Annotation collection for Climate-Fever] The section describing the annotation collection process for Climate-Fever: the plausibility evaluation and claims about dependence on text length/task complexity rest on these new human rationales, yet the manuscript reports neither inter-annotator agreement, the number of annotators per instance, explicit annotation guidelines, nor bias-mitigation procedures such as blinding annotators to model outputs. Without these, it is unclear whether the annotations constitute a stable ground truth for human plausibility.

minor comments (2)

[Abstract] The abstract and introduction could more explicitly state the number of models, tasks, and languages evaluated to improve scannability.
[Results] Table or figure captions for the faithfulness and plausibility results should include the exact metrics used (e.g., token overlap, sufficiency) for immediate clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed review and constructive feedback on our manuscript. We address the major comment below and will revise the manuscript accordingly to improve the description of our human annotation process.

read point-by-point responses

Referee: [Annotation collection for Climate-Fever] The section describing the annotation collection process for Climate-Fever: the plausibility evaluation and claims about dependence on text length/task complexity rest on these new human rationales, yet the manuscript reports neither inter-annotator agreement, the number of annotators per instance, explicit annotation guidelines, nor bias-mitigation procedures such as blinding annotators to model outputs. Without these, it is unclear whether the annotations constitute a stable ground truth for human plausibility.

Authors: We agree that additional details on the annotation collection process are required for the plausibility evaluation to be fully interpretable. In the revised manuscript, we will expand the relevant section to include inter-annotator agreement statistics, the number of annotators per instance, the full annotation guidelines, and bias-mitigation steps such as blinding procedures. These additions will clarify the stability of the human rationales as ground truth. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical comparisons

full rationale

This is an empirical study that collects new human rationale annotations for Climate-Fever and directly compares LLM self-explanations, human rationales, and post-hoc attributions on faithfulness and plausibility metrics across several datasets and languages. No equations, derivations, fitted parameters, or predictions appear in the provided text. No self-citation chains, uniqueness theorems, or ansatzes are invoked to support central claims. The analysis rests on external annotations and model outputs rather than reducing to its own inputs by construction. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central findings rest on the assumption that human rationales collected under the study's protocol are a valid external benchmark for plausibility; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Human rationale annotations collected for Climate-Fever constitute a reliable ground truth for evaluating explanation plausibility.
Invoked when the paper treats alignment with these annotations as the primary measure of explanation quality.

pith-pipeline@v0.9.0 · 5725 in / 1279 out tokens · 21432 ms · 2026-05-23T19:46:22.551538+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 2 internal anchors

[1]

plausibility: On the (un)reliability of explanations from large language models

Faithfulness vs. plausibility: On the (un)reliability of explanations from large language models. Preprint, arXiv:2402.04614. Ameen Ali, Thomas Schnake, Oliver Eberle, Grégoire Montavon, Klaus-Robert Müller, and Lior Wolf

work page arXiv
[2]

In Proceedings of the 2023 Conference on Empiri- cal Methods in Natural Language Processing, pages 6907–6920, Singapore

Rather a nurse than a physi- cian - contrastive explanations under investigation. In Proceedings of the 2023 Conference on Empiri- cal Methods in Natural Language Processing, pages 6907–6920, Singapore. Association for Computa- tional Linguistics. Shiyuan Huang, Siddarth Mamidanna, Shreedhar Jangam, Yilun Zhou, and Leilani H. Gilpin

work page 2023
[3]

Preprint, arXiv:2310.11207

Can large language models explain themselves? a study of llm-generated self-explanations. Preprint, arXiv:2310.11207. Alon Jacovi and Yoav Goldberg

work page arXiv
[4]

In Proceedings of the 2021 Conference on Empiri- cal Methods in Natural Language Processing, pages 1597–1611, Online and Punta Cana, Dominican Re- public

Contrastive explanations for model interpretability. In Proceedings of the 2021 Conference on Empiri- cal Methods in Natural Language Processing, pages 1597–1611, Online and Punta Cana, Dominican Re- public. Association for Computational Linguistics. Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto...

work page 2021
[5]

Erick Mendez Guzman, Viktor Schlegel, and Riza Batista-Navarro

Are self-explanations from large language models faithful? Preprint, arXiv:2401.07927. Erick Mendez Guzman, Viktor Schlegel, and Riza Batista-Navarro

work page arXiv
[6]

In Findings of the Association for Computational Linguistics: ACL 2024, pages 13025– 13048, Bangkok, Thailand

InFoBench: Evaluating instruction following ability in large lan- guage models. In Findings of the Association for Computational Linguistics: ACL 2024, pages 13025– 13048, Bangkok, Thailand. Association for Compu- tational Linguistics. Farnoush Rezaei Jafari, Grégoire Montavon, Klaus- Robert Müller, and Oliver Eberle

work page 2024
[7]

In Proceedings of the 2013 Conference on Empiri- cal Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA

Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empiri- cal Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics. Terne Sasha Thorn Jakobsen, Laura Cabello, and Anders Søgaard

work page 2013
[8]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine- tuned chat models. Preprint, arXiv:2307.09288. Sarah Wiegreffe, Jack Hessel, Swabha Swayamdipta, Mark Riedl, and Yejin Choi

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Reframing human-AI collaboration for generating free-text ex- planations. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 632–658, Seattle, United States. Association for Computational Linguistics. Xi Ye and Greg Durrett

work page 2022
[10]

In Pro- ceedings of the 2022 Conference on Empirical Meth- ods in Natural Language Processing, pages 184–198, Abu Dhabi, United Arab Emirates

Interpreting lan- guage models with contrastive explanations. In Pro- ceedings of the 2022 Conference on Empirical Meth- ods in Natural Language Processing, pages 184–198, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, and Danqi Chen

work page 2022
[11]

Fine-Tuning Language Models from Human Preferences

Fine-tuning lan- guage models from human preferences. arXiv preprint arXiv:1909.08593. A Instructions A.1 SST Figure 6: Prompts in all 3 languages to solve sentiment classification. Figure 7: Follow-up prompts in all 3 languages to extract rationales. A.2 RaFoLa Figure 8: Prompts for classification and rationale extraction for the RaFoLa dataset. Figure 9...

work page internal anchor Pith review Pith/arXiv arXiv 1909
[12]

Figure 13 show the same POS analysis for RaFoLa

are discussed in Section 5.1. Figure 13 show the same POS analysis for RaFoLa. E Most frequent rationale tokens Table 3: List of top-8 most frequent tokens in the RaFoLa corpus (first row) together with the most frequent rationales as identified by human annotators, as well as self-generated and post-hoc explanations. #1 Abuse of vulnerability #2 Abusive ...

work page 2020
[13]

Both Llama3 and Mistral have a low error rate with 2% and 5% respectively

The results show that Llama2 has a lot of difficulties with respect to json syntax with syntax errors occurring in 86% of the case. Both Llama3 and Mistral have a low error rate with 2% and 5% respectively. At the same time, Mistral returns more than the maximum number of requested rationale tokens in 1 out of 3 instructions where Llama3 follows the instr...

work page 2024
[14]

E.3 Entity analysis SST RaFoLa Figure 14: Comparing plausibility scores for non-contrastive and contrastive post-hoc approaches using Kappa agreement scores

and the minor observed performance differences for generating the correct label based on contrastive or non-contrastive approaches (Krishna et al., 2023). E.3 Entity analysis SST RaFoLa Figure 14: Comparing plausibility scores for non-contrastive and contrastive post-hoc approaches using Kappa agreement scores. Left: SST and multilingual SST for English, ...

work page 2023

[1] [1]

plausibility: On the (un)reliability of explanations from large language models

Faithfulness vs. plausibility: On the (un)reliability of explanations from large language models. Preprint, arXiv:2402.04614. Ameen Ali, Thomas Schnake, Oliver Eberle, Grégoire Montavon, Klaus-Robert Müller, and Lior Wolf

work page arXiv

[2] [2]

In Proceedings of the 2023 Conference on Empiri- cal Methods in Natural Language Processing, pages 6907–6920, Singapore

Rather a nurse than a physi- cian - contrastive explanations under investigation. In Proceedings of the 2023 Conference on Empiri- cal Methods in Natural Language Processing, pages 6907–6920, Singapore. Association for Computa- tional Linguistics. Shiyuan Huang, Siddarth Mamidanna, Shreedhar Jangam, Yilun Zhou, and Leilani H. Gilpin

work page 2023

[3] [3]

Preprint, arXiv:2310.11207

Can large language models explain themselves? a study of llm-generated self-explanations. Preprint, arXiv:2310.11207. Alon Jacovi and Yoav Goldberg

work page arXiv

[4] [4]

In Proceedings of the 2021 Conference on Empiri- cal Methods in Natural Language Processing, pages 1597–1611, Online and Punta Cana, Dominican Re- public

Contrastive explanations for model interpretability. In Proceedings of the 2021 Conference on Empiri- cal Methods in Natural Language Processing, pages 1597–1611, Online and Punta Cana, Dominican Re- public. Association for Computational Linguistics. Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto...

work page 2021

[5] [5]

Erick Mendez Guzman, Viktor Schlegel, and Riza Batista-Navarro

Are self-explanations from large language models faithful? Preprint, arXiv:2401.07927. Erick Mendez Guzman, Viktor Schlegel, and Riza Batista-Navarro

work page arXiv

[6] [6]

In Findings of the Association for Computational Linguistics: ACL 2024, pages 13025– 13048, Bangkok, Thailand

InFoBench: Evaluating instruction following ability in large lan- guage models. In Findings of the Association for Computational Linguistics: ACL 2024, pages 13025– 13048, Bangkok, Thailand. Association for Compu- tational Linguistics. Farnoush Rezaei Jafari, Grégoire Montavon, Klaus- Robert Müller, and Oliver Eberle

work page 2024

[7] [7]

In Proceedings of the 2013 Conference on Empiri- cal Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA

Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empiri- cal Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics. Terne Sasha Thorn Jakobsen, Laura Cabello, and Anders Søgaard

work page 2013

[8] [8]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine- tuned chat models. Preprint, arXiv:2307.09288. Sarah Wiegreffe, Jack Hessel, Swabha Swayamdipta, Mark Riedl, and Yejin Choi

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Reframing human-AI collaboration for generating free-text ex- planations. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 632–658, Seattle, United States. Association for Computational Linguistics. Xi Ye and Greg Durrett

work page 2022

[10] [10]

In Pro- ceedings of the 2022 Conference on Empirical Meth- ods in Natural Language Processing, pages 184–198, Abu Dhabi, United Arab Emirates

Interpreting lan- guage models with contrastive explanations. In Pro- ceedings of the 2022 Conference on Empirical Meth- ods in Natural Language Processing, pages 184–198, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics. Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, and Danqi Chen

work page 2022

[11] [11]

Fine-Tuning Language Models from Human Preferences

Fine-tuning lan- guage models from human preferences. arXiv preprint arXiv:1909.08593. A Instructions A.1 SST Figure 6: Prompts in all 3 languages to solve sentiment classification. Figure 7: Follow-up prompts in all 3 languages to extract rationales. A.2 RaFoLa Figure 8: Prompts for classification and rationale extraction for the RaFoLa dataset. Figure 9...

work page internal anchor Pith review Pith/arXiv arXiv 1909

[12] [12]

Figure 13 show the same POS analysis for RaFoLa

are discussed in Section 5.1. Figure 13 show the same POS analysis for RaFoLa. E Most frequent rationale tokens Table 3: List of top-8 most frequent tokens in the RaFoLa corpus (first row) together with the most frequent rationales as identified by human annotators, as well as self-generated and post-hoc explanations. #1 Abuse of vulnerability #2 Abusive ...

work page 2020

[13] [13]

Both Llama3 and Mistral have a low error rate with 2% and 5% respectively

The results show that Llama2 has a lot of difficulties with respect to json syntax with syntax errors occurring in 86% of the case. Both Llama3 and Mistral have a low error rate with 2% and 5% respectively. At the same time, Mistral returns more than the maximum number of requested rationale tokens in 1 out of 3 instructions where Llama3 follows the instr...

work page 2024

[14] [14]

E.3 Entity analysis SST RaFoLa Figure 14: Comparing plausibility scores for non-contrastive and contrastive post-hoc approaches using Kappa agreement scores

and the minor observed performance differences for generating the correct label based on contrastive or non-contrastive approaches (Krishna et al., 2023). E.3 Entity analysis SST RaFoLa Figure 14: Comparing plausibility scores for non-contrastive and contrastive post-hoc approaches using Kappa agreement scores. Left: SST and multilingual SST for English, ...

work page 2023