pith. sign in

arxiv: 2605.19274 · v1 · pith:4ZAHWP53new · submitted 2026-05-19 · 💻 cs.CL

Lost in Interpretation: The Plausibility-Faithfulness Trade-off in Cross-Lingual Explanations

Pith reviewed 2026-05-20 06:18 UTC · model grok-4.3

classification 💻 cs.CL
keywords cross-lingual explanationsplausibility-faithfulness trade-offmultilingual LLMsextractive explanationscomprehensivenesssufficiencyhuman rationalesEnglish pivot
0
0 comments X

The pith

English-pivot explanations for non-English inputs raise span agreement with human rationales but weaken causal ties to the model's actual predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how multilingual LLMs are frequently audited with English explanations even when the input is in another language. It identifies a consistent trade-off: these English explanations match human-chosen evidence spans more closely yet show markedly weaker causal grounding as quantified by comprehensiveness and sufficiency scores. The degradation in faithfulness reaches 5.7 times in some settings while task accuracy stays unchanged across three tasks, five languages, and two model families. For socially nuanced cases the English versions also drop pragmatic cues that native-language explanations retain.

Core claim

Extractive explanations generated via an English pivot achieve higher span agreement with human rationales while their selected evidence becomes less causally grounded in the model's prediction, with comprehensiveness falling by up to 5.7 times relative to native-language conditions even though task accuracy remains stable.

What carries the argument

The plausibility-faithfulness trade-off, where plausibility is measured by token-span overlap with human rationales and faithfulness is measured by comprehensiveness and sufficiency of the extracted evidence.

If this is right

  • Audits of multilingual models should generate explanations in the input language rather than defaulting to English.
  • Evaluation of explanations should combine lexical overlap with multiple faithfulness metrics instead of relying on agreement alone.
  • English rationales are more accurately treated as communication summaries than as faithful records of the model's decision process.
  • Pragmatic and social cues in classification tasks are more likely to be lost when explanations are produced in English.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers may need new methods that improve fluency of native-language explanations without reintroducing the grounding loss seen in English pivots.
  • The observed trade-off could shape how explanation quality is assessed for low-resource languages where human rationales are scarce.
  • Reliance on English audits might systematically overestimate the reliability of model behavior on non-English inputs.

Load-bearing premise

Comprehensiveness and sufficiency metrics correctly measure causal faithfulness and human rationales serve as a stable reference for plausibility across languages and tasks.

What would settle it

An experiment that finds English-pivot explanations simultaneously improve or maintain both span agreement and comprehensiveness/sufficiency scores relative to native-language explanations.

Figures

Figures reproduced from arXiv: 2605.19274 by Animesh Mukherjee, Pranav Jha, Rima Hazra, Somnath Banerjee.

Figure 1
Figure 1. Figure 1: The plausibility–faithfulness trade-off in e [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: ). We construct semantically matched test sets by translating the original English test instances into each target language using NLLB-200 (3.3B distilled) (Costa-jussà et al., 2022), accessed via Label: entailment EN (source): Premise: “It was raining, so she took an umbrella.” Hypothesis: “She used an umbrella because it was raining.” Human explanation (rationale): “The premise states rain and taking an … view at source ↗
read the original abstract

LLMs deployed multilingually are often audited via English explanations for non-English inputs. We evaluate extractive explanations ''where the model identifies input token spans as evidence alongside a generated rationale'' and uncover a systematic trade-off: English-pivot explanations can achieve higher span agreement with human rationales while their evidence becomes less causally grounded in the model's prediction, as measured by both comprehensiveness and sufficiency. Across 3 tasks, 5~languages, and 2~multilingual LLM families, we find that English explanations frequently produce fluent but loosely anchored rationales, with comprehensiveness degrading by up to 5.7x relative to native-language conditions - even as task accuracy remains stable across settings. For socially nuanced classification, English pivots also fail to preserve pragmatic cues, reducing both faithfulness and span agreement. We recommend auditing explanations in the input language, reporting multi-faceted faithfulness metrics beyond lexical overlap, and treating English rationales as communication summaries rather than faithful decision traces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript evaluates extractive explanations for multilingual LLMs, focusing on the use of English-pivot explanations for non-English inputs. It reports a trade-off where English explanations show higher span agreement with human rationales but lower faithfulness, measured by comprehensiveness and sufficiency, with degradation up to 5.7x. Experiments span 3 tasks, 5 languages, and 2 LLM families, concluding that native-language explanations should be preferred and that English rationales may serve more as communication summaries than faithful traces.

Significance. If the findings are robust, this paper makes a valuable contribution to the field of explainable AI in multilingual settings by highlighting potential pitfalls in cross-lingual explanation generation. The empirical scope across multiple languages and models provides evidence that current practices may lead to explanations that are plausible to humans but not causally faithful to the model. This could influence how practitioners audit multilingual models and encourages the development of better cross-lingual explanation methods.

major comments (1)
  1. [Section 4.2 (Faithfulness Metrics)] The evaluation of comprehensiveness and sufficiency for English-pivot explanations depends on mapping English-generated spans back to the original non-English token sequence for ablation. The paper does not provide details on the alignment method used, its accuracy, or any sensitivity analysis for mapping errors. Such errors could systematically remove incorrect tokens, artificially lowering the faithfulness scores for the English-pivot condition while not affecting native explanations. This is a load-bearing issue for the central claim of a plausibility-faithfulness trade-off.
minor comments (1)
  1. [Abstract] The quantitative claim of 'up to 5.7x' degradation lacks specification of the exact condition (task, language, model) where this maximum occurs, which would improve interpretability of the results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify an important methodological aspect of our work. We address the major comment below and have revised the manuscript to incorporate additional details and analyses.

read point-by-point responses
  1. Referee: [Section 4.2 (Faithfulness Metrics)] The evaluation of comprehensiveness and sufficiency for English-pivot explanations depends on mapping English-generated spans back to the original non-English token sequence for ablation. The paper does not provide details on the alignment method used, its accuracy, or any sensitivity analysis for mapping errors. Such errors could systematically remove incorrect tokens, artificially lowering the faithfulness scores for the English-pivot condition while not affecting native explanations. This is a load-bearing issue for the central claim of a plausibility-faithfulness trade-off.

    Authors: We agree that explicit documentation of the alignment procedure is essential for reproducibility and to rule out systematic bias in the faithfulness metrics. In our experiments, English-pivot spans were mapped back to the original non-English input tokens using a combination of subword tokenization alignment via SentencePiece and cross-lingual word alignment with the fast-align toolkit, followed by a heuristic for multi-token spans. We have now expanded Section 4.2 with a dedicated paragraph describing this procedure in full, including pseudocode. Additionally, we report alignment accuracy on a manually annotated sample of 200 examples (average F1 of 0.87 across languages) and include a sensitivity analysis: we introduce controlled random mapping perturbations at rates of 5%, 10%, and 15% and recompute comprehensiveness and sufficiency. The plausibility-faithfulness trade-off remains statistically significant (p < 0.01) under all perturbation levels, indicating that alignment noise does not drive the observed degradation. The revised code and alignment scripts have been added to the public repository. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation is self-contained with no circular derivation

full rationale

The paper reports experimental results comparing English-pivot and native-language extractive explanations across tasks, languages, and models, measuring span agreement with human rationales alongside comprehensiveness and sufficiency. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; the central trade-off claim rests on direct ablation-based measurements rather than any reduction to prior inputs by construction. The work is therefore self-contained against external benchmarks and receives a non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Empirical evaluation paper; relies on standard explanation evaluation metrics and human rationales as reference without introducing new free parameters or invented entities.

axioms (2)
  • domain assumption Comprehensiveness and sufficiency metrics validly measure causal grounding of explanations in model predictions.
    Invoked when interpreting degradation in faithfulness scores as evidence of less grounded evidence.
  • domain assumption Human rationales provide a reliable benchmark for plausibility across languages.
    Used when claiming higher span agreement indicates better plausibility.

pith-pipeline@v0.9.0 · 5715 in / 1313 out tokens · 37056 ms · 2026-05-20T06:18:08.098189+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 2 internal anchors

  1. [1]

    Attributional safety failures in large language models under code-mixed perturbations.arXiv preprint arXiv:2505.14469,

    Use real-time translation of conversations for service representatives and customers. States the feature is intended to help customer service managers or supervisors enhance team performance. Amazon Web Services. 2024. Amazon translate: Ma- chine translation service. Somnath Banerjee, Pratyush Chatterjee, Shanu Kumar, Sayan Layek, Parag Agrawal, Rima Hazr...

  2. [2]

    Unsupervised Cross-lingual Representation Learning at Scale

    e-snli: Natural language inference with natural language explanations. InNeurIPS. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettle- moyer, and Veselin Stoyanov. 2020. Unsuper- vised cross-lingual representation learning at scale. Preprint, arXiv:1911.02116. Marta ...

  3. [3]

    Qwen2.5 Technical Report

    Measuring association between labels and free-text rationales. InProceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing, pages 10266–10284, Online and Punta Cana, Dominican Republic. Association for Compu- tational Linguistics. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayi- heng...

  4. [6]

    Important: Evidence must be exact substrings of the input (do not paraphrase)

    Write a brief explanation in English. Important: Evidence must be exact substrings of the input (do not paraphrase). Input: <INPUT> Condition B: Lnative → Lnative (native input, na- tive explanation). You are given a task input in <LANG>

  5. [8]

    Copy 1–3 short evidence spans verbatim from the input text

  6. [9]

    Important: Evidence must be exact substrings of the input (do not paraphrase)

    Write a brief explanation in <LANG>. Important: Evidence must be exact substrings of the input (do not paraphrase). Input: <INPUT> Condition C: L native → EN (native input, En- glish explanation; evidence stays native). You are given a task input in <LANG>

  7. [12]

    Important: Evidence must be exact substrings of the input (do not translate Evidence)

    Write a brief explanation in English. Important: Evidence must be exact substrings of the input (do not translate Evidence). Input: <INPUT> B.2 Mini qualitative example Task: e-SNLI (NLI). Labels: {entailment, neu- tral, contradiction}. Premise (HI; romanized for pdfLATEX compati- bility): barish ho rahi thi isliye usne chhata liya. Hypothesis (HI; romani...

  8. [13]

    The premise states rain and taking an umbrella, which supports the hypothesis

    e-SNLI: Human rationales are free-form En- glish sentences (e.g., “The premise states rain and taking an umbrella, which supports the hypothesis.”). These are not substrings of the input. We extractE h(x)by: (a) Tokenizing both the input and the ratio- nale into word-level tokens (for English) or character-level tokens (for Chinese, Hindi, Arabic, Bengali...

  9. [14]

    FEVER: Human rationales are gold evidence sentences drawn from Wikipedia. Since these sentences may not appear verbatim in the claim, we perform the same substring match- ing procedure as fore-SNLI, operating over the concatenation of the claim and the pro- vided context

  10. [15]

    These directly define Eh(x) with no alignment needed

    HateXplain: Human rationales are provided as annotated token-level highlight spans over the input text. These directly define Eh(x) with no alignment needed. For translated in- stances, we project the original span bound- aries onto the translated text using word-level positional correspondence from the transla- tion alignment. Matching details.We enforce...

  11. [16]

    Unicode NFC normalization (to handle equiv- alent representations of composed characters, particularly important for Hindi and Bengali)

  12. [17]

    Whitespace collapsing (multiple spaces, tabs, and newlines reduced to single spaces)

  13. [18]

    Case-insensitive matching for Latin-script lan- guages (English)

  14. [19]

    It was raining, so she took an umbrella

    No stemming or lemmatization is applied— matching is surface-level by design. We set a minimum match length of 2 tokens to avoid spurious single-token overlaps (e.g., match- ing common stop words or punctuation marks). Worked examples.We provide one alignment example per language from thee-SNLIdataset. In each case, the input consists of the concatenated ...

  15. [20]

    al- matar

    Morphological mismatch: Inflected forms in the rationale may differ from the input surface form (e.g., Arabic definite article prefixing, Hindi verb conjugation), reducing matched coverage. As seen in Example 4, Arabic“al- matar”fails to match input“tumtir”despite referring to the same concept

  16. [21]

    lene”(to take) fails to match“liya

    Paraphrase: When the human rationale uses a synonym or rephrasing rather than the exact input term, no match is found. As seen in Example 2, Hindi“lene”(to take) fails to match“liya”(took). Both failure modesunder-countgenuine overlap, meaning our span agreement scores are conserva- tive lower bounds. This bias works against our hy- pothesis: if the true ...

  17. [22]

    English ex- planations should be treated as summaries for convenience rather than faithful traces of rea- soning

    Avoid English pivots for auditing:In high- stakes settings (e.g., legal or medical AI), sys- tem faithfulness should always be audited in the native language of the input. English ex- planations should be treated as summaries for convenience rather than faithful traces of rea- soning

  18. [23]

    Standardize cross-lingual faithfulness met- rics: Evaluation benchmarks should move be- yond simple span agreement and incorporate faithfulness metrics, such as comprehensive- ness and sufficiency, specifically designed for mismatched language conditions

  19. [24]

    D Prompt paraphrases and sensitivity analysis To verify that our findings are robust to surface- level prompt variation, we create five paraphrased versions of each prompt template

    Prioritize cultural context over fluency: For social tasks like hate speech detection, devel- opers must prioritize native-language expla- nation capabilities, as English pivots fail to capture the pragmatic nuances necessary for both plausibility and trust. D Prompt paraphrases and sensitivity analysis To verify that our findings are robust to surface- l...

  20. [26]

    Copy 1–3 short evidence spans verbatim from the input text (keep them in <LANG>)

  21. [27]

    Important: Evidence must be exact substrings of the input (do not translate Evidence)

    Write a brief explanation in English. Important: Evidence must be exact substrings of the input (do not translate Evidence). Input: <INPUT> Variant 2. Below is a task input written in <LANG>

  22. [28]

    Determine the appropriate category from: {<LABELS>}

  23. [29]

    Extract 1–3 short text segments directly from the input as supporting evidence (keep them in the original language)

  24. [30]

    Important: Extracted evidence must be copied exactly from the input without translation

    Provide a short justification in English. Important: Extracted evidence must be copied exactly from the input without translation. Input: <INPUT> Variant 3. You will analyze a task input in <LANG>

  25. [31]

    Choose the best label from: {<LABELS>}

  26. [32]

    Identify 1–3 key phrases from the input text and copy them exactly (retain the original <LANG>)

  27. [33]

    Important: Key phrases must be exact substrings of the input

    Briefly explain your reasoning in English. Important: Key phrases must be exact substrings of the input. Do not paraphrase or translate them. Input: <INPUT> Variant 4. The following is a task input in <LANG>

  28. [34]

    Select the correct label from: {<LABELS>}

  29. [35]

    Highlight 1–3 relevant spans from the input by copying them exactly as they appear (in <LANG>)

  30. [36]

    Important: Highlighted spans must be exact copies from the input, not translations

    Write a concise explanation in English. Important: Highlighted spans must be exact copies from the input, not translations. Input: <INPUT> Variant 5. Given a task input in <LANG>, perform the following:

  31. [37]

    Assign one label from: {<LABELS>}

  32. [38]

    Quote 1–3 short supporting passages from the input verbatim (keep them in <LANG>)

  33. [39]

    Important: Quoted passages must be exact substrings of the input without any translation

    Justify your answer briefly in English. Important: Quoted passages must be exact substrings of the input without any translation. Input: <INPUT> D.2 Sensitivity results Tables 8 and 9 report mean ± standard deviation across the five prompt variants on e-SNLI. The trade-off pattern—lower comprehensiveness and higher sufficiency under Lnative →EN compared t...

  34. [40]

    Predict the correct label from: {<LABELS>}

  35. [41]

    Write a brief explanation in English

  36. [42]

    Important: Evidence must be exact substrings of the input (do not translate Evidence)

    Copy 1–3 short evidence spans verbatim from the input text (keep them in <LANG>). Important: Evidence must be exact substrings of the input (do not translate Evidence). Input: <INPUT> The corresponding reversed output format is: Label: <one label from the label set> Explanation: <1–3 sentences in the required explanation language> Evidence: <1–3 spans cop...

  37. [43]

    The cross-lingual patterns reported in our main tables are not artifacts of the metric choice

    Span agreement is a valid proxy.Despite its known limitations with morphological varia- tion and paraphrase, span agreement captures the same directional trends as the semantically richer BERTScore metric. The cross-lingual patterns reported in our main tables are not artifacts of the metric choice. Span Agr.(lexical)BERTScore F1(semantic) SettingsQwen Ll...

  38. [44]

    How- ever, this biasunder-countsoverlap uniformly across conditions, preserving the relative or- dering

    Morphological bias is conservative, not mis- leading.The gap between BERTScore and span agreement is largest for Arabic (mean gap:+0.21) and Bengali (mean gap:+0.19), consistent with these languages’ richer mor- phology reducing exact-match recall. How- ever, this biasunder-countsoverlap uniformly across conditions, preserving the relative or- dering

  39. [45]

    Directional agree- ment

    The HateXplain pattern is genuine.The failure of English pivots to improve semantic similarity on hate speech (confirmed by both metrics) rules out the hypothesis that surface- level tokenization effects mask underlying se- mantic improvement. The loss of social and pragmatic cues under English pivoting is a substantive semantic phenomenon. We note one li...