Lost in Interpretation: The Plausibility-Faithfulness Trade-off in Cross-Lingual Explanations
Pith reviewed 2026-05-20 06:18 UTC · model grok-4.3
The pith
English-pivot explanations for non-English inputs raise span agreement with human rationales but weaken causal ties to the model's actual predictions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Extractive explanations generated via an English pivot achieve higher span agreement with human rationales while their selected evidence becomes less causally grounded in the model's prediction, with comprehensiveness falling by up to 5.7 times relative to native-language conditions even though task accuracy remains stable.
What carries the argument
The plausibility-faithfulness trade-off, where plausibility is measured by token-span overlap with human rationales and faithfulness is measured by comprehensiveness and sufficiency of the extracted evidence.
If this is right
- Audits of multilingual models should generate explanations in the input language rather than defaulting to English.
- Evaluation of explanations should combine lexical overlap with multiple faithfulness metrics instead of relying on agreement alone.
- English rationales are more accurately treated as communication summaries than as faithful records of the model's decision process.
- Pragmatic and social cues in classification tasks are more likely to be lost when explanations are produced in English.
Where Pith is reading between the lines
- Developers may need new methods that improve fluency of native-language explanations without reintroducing the grounding loss seen in English pivots.
- The observed trade-off could shape how explanation quality is assessed for low-resource languages where human rationales are scarce.
- Reliance on English audits might systematically overestimate the reliability of model behavior on non-English inputs.
Load-bearing premise
Comprehensiveness and sufficiency metrics correctly measure causal faithfulness and human rationales serve as a stable reference for plausibility across languages and tasks.
What would settle it
An experiment that finds English-pivot explanations simultaneously improve or maintain both span agreement and comprehensiveness/sufficiency scores relative to native-language explanations.
Figures
read the original abstract
LLMs deployed multilingually are often audited via English explanations for non-English inputs. We evaluate extractive explanations ''where the model identifies input token spans as evidence alongside a generated rationale'' and uncover a systematic trade-off: English-pivot explanations can achieve higher span agreement with human rationales while their evidence becomes less causally grounded in the model's prediction, as measured by both comprehensiveness and sufficiency. Across 3 tasks, 5~languages, and 2~multilingual LLM families, we find that English explanations frequently produce fluent but loosely anchored rationales, with comprehensiveness degrading by up to 5.7x relative to native-language conditions - even as task accuracy remains stable across settings. For socially nuanced classification, English pivots also fail to preserve pragmatic cues, reducing both faithfulness and span agreement. We recommend auditing explanations in the input language, reporting multi-faceted faithfulness metrics beyond lexical overlap, and treating English rationales as communication summaries rather than faithful decision traces.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates extractive explanations for multilingual LLMs, focusing on the use of English-pivot explanations for non-English inputs. It reports a trade-off where English explanations show higher span agreement with human rationales but lower faithfulness, measured by comprehensiveness and sufficiency, with degradation up to 5.7x. Experiments span 3 tasks, 5 languages, and 2 LLM families, concluding that native-language explanations should be preferred and that English rationales may serve more as communication summaries than faithful traces.
Significance. If the findings are robust, this paper makes a valuable contribution to the field of explainable AI in multilingual settings by highlighting potential pitfalls in cross-lingual explanation generation. The empirical scope across multiple languages and models provides evidence that current practices may lead to explanations that are plausible to humans but not causally faithful to the model. This could influence how practitioners audit multilingual models and encourages the development of better cross-lingual explanation methods.
major comments (1)
- [Section 4.2 (Faithfulness Metrics)] The evaluation of comprehensiveness and sufficiency for English-pivot explanations depends on mapping English-generated spans back to the original non-English token sequence for ablation. The paper does not provide details on the alignment method used, its accuracy, or any sensitivity analysis for mapping errors. Such errors could systematically remove incorrect tokens, artificially lowering the faithfulness scores for the English-pivot condition while not affecting native explanations. This is a load-bearing issue for the central claim of a plausibility-faithfulness trade-off.
minor comments (1)
- [Abstract] The quantitative claim of 'up to 5.7x' degradation lacks specification of the exact condition (task, language, model) where this maximum occurs, which would improve interpretability of the results.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which helps clarify an important methodological aspect of our work. We address the major comment below and have revised the manuscript to incorporate additional details and analyses.
read point-by-point responses
-
Referee: [Section 4.2 (Faithfulness Metrics)] The evaluation of comprehensiveness and sufficiency for English-pivot explanations depends on mapping English-generated spans back to the original non-English token sequence for ablation. The paper does not provide details on the alignment method used, its accuracy, or any sensitivity analysis for mapping errors. Such errors could systematically remove incorrect tokens, artificially lowering the faithfulness scores for the English-pivot condition while not affecting native explanations. This is a load-bearing issue for the central claim of a plausibility-faithfulness trade-off.
Authors: We agree that explicit documentation of the alignment procedure is essential for reproducibility and to rule out systematic bias in the faithfulness metrics. In our experiments, English-pivot spans were mapped back to the original non-English input tokens using a combination of subword tokenization alignment via SentencePiece and cross-lingual word alignment with the fast-align toolkit, followed by a heuristic for multi-token spans. We have now expanded Section 4.2 with a dedicated paragraph describing this procedure in full, including pseudocode. Additionally, we report alignment accuracy on a manually annotated sample of 200 examples (average F1 of 0.87 across languages) and include a sensitivity analysis: we introduce controlled random mapping perturbations at rates of 5%, 10%, and 15% and recompute comprehensiveness and sufficiency. The plausibility-faithfulness trade-off remains statistically significant (p < 0.01) under all perturbation levels, indicating that alignment noise does not drive the observed degradation. The revised code and alignment scripts have been added to the public repository. revision: yes
Circularity Check
Empirical evaluation is self-contained with no circular derivation
full rationale
The paper reports experimental results comparing English-pivot and native-language extractive explanations across tasks, languages, and models, measuring span agreement with human rationales alongside comprehensiveness and sufficiency. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; the central trade-off claim rests on direct ablation-based measurements rather than any reduction to prior inputs by construction. The work is therefore self-contained against external benchmarks and receives a non-finding.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Comprehensiveness and sufficiency metrics validly measure causal grounding of explanations in model predictions.
- domain assumption Human rationales provide a reliable benchmark for plausibility across languages.
Reference graph
Works this paper leans on
-
[1]
Use real-time translation of conversations for service representatives and customers. States the feature is intended to help customer service managers or supervisors enhance team performance. Amazon Web Services. 2024. Amazon translate: Ma- chine translation service. Somnath Banerjee, Pratyush Chatterjee, Shanu Kumar, Sayan Layek, Parag Agrawal, Rima Hazr...
-
[2]
Unsupervised Cross-lingual Representation Learning at Scale
e-snli: Natural language inference with natural language explanations. InNeurIPS. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettle- moyer, and Veselin Stoyanov. 2020. Unsuper- vised cross-lingual representation learning at scale. Preprint, arXiv:1911.02116. Marta ...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[3]
Measuring association between labels and free-text rationales. InProceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing, pages 10266–10284, Online and Punta Cana, Dominican Republic. Association for Compu- tational Linguistics. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayi- heng...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Important: Evidence must be exact substrings of the input (do not paraphrase)
Write a brief explanation in English. Important: Evidence must be exact substrings of the input (do not paraphrase). Input: <INPUT> Condition B: Lnative → Lnative (native input, na- tive explanation). You are given a task input in <LANG>
-
[8]
Copy 1–3 short evidence spans verbatim from the input text
-
[9]
Important: Evidence must be exact substrings of the input (do not paraphrase)
Write a brief explanation in <LANG>. Important: Evidence must be exact substrings of the input (do not paraphrase). Input: <INPUT> Condition C: L native → EN (native input, En- glish explanation; evidence stays native). You are given a task input in <LANG>
-
[12]
Important: Evidence must be exact substrings of the input (do not translate Evidence)
Write a brief explanation in English. Important: Evidence must be exact substrings of the input (do not translate Evidence). Input: <INPUT> B.2 Mini qualitative example Task: e-SNLI (NLI). Labels: {entailment, neu- tral, contradiction}. Premise (HI; romanized for pdfLATEX compati- bility): barish ho rahi thi isliye usne chhata liya. Hypothesis (HI; romani...
-
[13]
The premise states rain and taking an umbrella, which supports the hypothesis
e-SNLI: Human rationales are free-form En- glish sentences (e.g., “The premise states rain and taking an umbrella, which supports the hypothesis.”). These are not substrings of the input. We extractE h(x)by: (a) Tokenizing both the input and the ratio- nale into word-level tokens (for English) or character-level tokens (for Chinese, Hindi, Arabic, Bengali...
-
[14]
FEVER: Human rationales are gold evidence sentences drawn from Wikipedia. Since these sentences may not appear verbatim in the claim, we perform the same substring match- ing procedure as fore-SNLI, operating over the concatenation of the claim and the pro- vided context
-
[15]
These directly define Eh(x) with no alignment needed
HateXplain: Human rationales are provided as annotated token-level highlight spans over the input text. These directly define Eh(x) with no alignment needed. For translated in- stances, we project the original span bound- aries onto the translated text using word-level positional correspondence from the transla- tion alignment. Matching details.We enforce...
-
[16]
Unicode NFC normalization (to handle equiv- alent representations of composed characters, particularly important for Hindi and Bengali)
-
[17]
Whitespace collapsing (multiple spaces, tabs, and newlines reduced to single spaces)
-
[18]
Case-insensitive matching for Latin-script lan- guages (English)
-
[19]
It was raining, so she took an umbrella
No stemming or lemmatization is applied— matching is surface-level by design. We set a minimum match length of 2 tokens to avoid spurious single-token overlaps (e.g., match- ing common stop words or punctuation marks). Worked examples.We provide one alignment example per language from thee-SNLIdataset. In each case, the input consists of the concatenated ...
-
[20]
Morphological mismatch: Inflected forms in the rationale may differ from the input surface form (e.g., Arabic definite article prefixing, Hindi verb conjugation), reducing matched coverage. As seen in Example 4, Arabic“al- matar”fails to match input“tumtir”despite referring to the same concept
-
[21]
lene”(to take) fails to match“liya
Paraphrase: When the human rationale uses a synonym or rephrasing rather than the exact input term, no match is found. As seen in Example 2, Hindi“lene”(to take) fails to match“liya”(took). Both failure modesunder-countgenuine overlap, meaning our span agreement scores are conserva- tive lower bounds. This bias works against our hy- pothesis: if the true ...
-
[22]
Avoid English pivots for auditing:In high- stakes settings (e.g., legal or medical AI), sys- tem faithfulness should always be audited in the native language of the input. English ex- planations should be treated as summaries for convenience rather than faithful traces of rea- soning
-
[23]
Standardize cross-lingual faithfulness met- rics: Evaluation benchmarks should move be- yond simple span agreement and incorporate faithfulness metrics, such as comprehensive- ness and sufficiency, specifically designed for mismatched language conditions
-
[24]
Prioritize cultural context over fluency: For social tasks like hate speech detection, devel- opers must prioritize native-language expla- nation capabilities, as English pivots fail to capture the pragmatic nuances necessary for both plausibility and trust. D Prompt paraphrases and sensitivity analysis To verify that our findings are robust to surface- l...
-
[26]
Copy 1–3 short evidence spans verbatim from the input text (keep them in <LANG>)
-
[27]
Important: Evidence must be exact substrings of the input (do not translate Evidence)
Write a brief explanation in English. Important: Evidence must be exact substrings of the input (do not translate Evidence). Input: <INPUT> Variant 2. Below is a task input written in <LANG>
-
[28]
Determine the appropriate category from: {<LABELS>}
-
[29]
Extract 1–3 short text segments directly from the input as supporting evidence (keep them in the original language)
-
[30]
Important: Extracted evidence must be copied exactly from the input without translation
Provide a short justification in English. Important: Extracted evidence must be copied exactly from the input without translation. Input: <INPUT> Variant 3. You will analyze a task input in <LANG>
-
[31]
Choose the best label from: {<LABELS>}
-
[32]
Identify 1–3 key phrases from the input text and copy them exactly (retain the original <LANG>)
-
[33]
Important: Key phrases must be exact substrings of the input
Briefly explain your reasoning in English. Important: Key phrases must be exact substrings of the input. Do not paraphrase or translate them. Input: <INPUT> Variant 4. The following is a task input in <LANG>
-
[34]
Select the correct label from: {<LABELS>}
-
[35]
Highlight 1–3 relevant spans from the input by copying them exactly as they appear (in <LANG>)
-
[36]
Important: Highlighted spans must be exact copies from the input, not translations
Write a concise explanation in English. Important: Highlighted spans must be exact copies from the input, not translations. Input: <INPUT> Variant 5. Given a task input in <LANG>, perform the following:
-
[37]
Assign one label from: {<LABELS>}
-
[38]
Quote 1–3 short supporting passages from the input verbatim (keep them in <LANG>)
-
[39]
Important: Quoted passages must be exact substrings of the input without any translation
Justify your answer briefly in English. Important: Quoted passages must be exact substrings of the input without any translation. Input: <INPUT> D.2 Sensitivity results Tables 8 and 9 report mean ± standard deviation across the five prompt variants on e-SNLI. The trade-off pattern—lower comprehensiveness and higher sufficiency under Lnative →EN compared t...
-
[40]
Predict the correct label from: {<LABELS>}
-
[41]
Write a brief explanation in English
-
[42]
Important: Evidence must be exact substrings of the input (do not translate Evidence)
Copy 1–3 short evidence spans verbatim from the input text (keep them in <LANG>). Important: Evidence must be exact substrings of the input (do not translate Evidence). Input: <INPUT> The corresponding reversed output format is: Label: <one label from the label set> Explanation: <1–3 sentences in the required explanation language> Evidence: <1–3 spans cop...
work page 2020
-
[43]
The cross-lingual patterns reported in our main tables are not artifacts of the metric choice
Span agreement is a valid proxy.Despite its known limitations with morphological varia- tion and paraphrase, span agreement captures the same directional trends as the semantically richer BERTScore metric. The cross-lingual patterns reported in our main tables are not artifacts of the metric choice. Span Agr.(lexical)BERTScore F1(semantic) SettingsQwen Ll...
-
[44]
Morphological bias is conservative, not mis- leading.The gap between BERTScore and span agreement is largest for Arabic (mean gap:+0.21) and Bengali (mean gap:+0.19), consistent with these languages’ richer mor- phology reducing exact-match recall. How- ever, this biasunder-countsoverlap uniformly across conditions, preserving the relative or- dering
-
[45]
The HateXplain pattern is genuine.The failure of English pivots to improve semantic similarity on hate speech (confirmed by both metrics) rules out the hypothesis that surface- level tokenization effects mask underlying se- mantic improvement. The loss of social and pragmatic cues under English pivoting is a substantive semantic phenomenon. We note one li...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.