Lost in Interpretation: The Plausibility-Faithfulness Trade-off in Cross-Lingual Explanations

Animesh Mukherjee; Pranav Jha; Rima Hazra; Somnath Banerjee

arxiv: 2605.19274 · v1 · pith:4ZAHWP53new · submitted 2026-05-19 · 💻 cs.CL

Lost in Interpretation: The Plausibility-Faithfulness Trade-off in Cross-Lingual Explanations

Somnath Banerjee , Pranav Jha , Rima Hazra , Animesh Mukherjee This is my paper

Pith reviewed 2026-05-20 06:18 UTC · model grok-4.3

classification 💻 cs.CL

keywords cross-lingual explanationsplausibility-faithfulness trade-offmultilingual LLMsextractive explanationscomprehensivenesssufficiencyhuman rationalesEnglish pivot

0 comments

The pith

English-pivot explanations for non-English inputs raise span agreement with human rationales but weaken causal ties to the model's actual predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how multilingual LLMs are frequently audited with English explanations even when the input is in another language. It identifies a consistent trade-off: these English explanations match human-chosen evidence spans more closely yet show markedly weaker causal grounding as quantified by comprehensiveness and sufficiency scores. The degradation in faithfulness reaches 5.7 times in some settings while task accuracy stays unchanged across three tasks, five languages, and two model families. For socially nuanced cases the English versions also drop pragmatic cues that native-language explanations retain.

Core claim

Extractive explanations generated via an English pivot achieve higher span agreement with human rationales while their selected evidence becomes less causally grounded in the model's prediction, with comprehensiveness falling by up to 5.7 times relative to native-language conditions even though task accuracy remains stable.

What carries the argument

The plausibility-faithfulness trade-off, where plausibility is measured by token-span overlap with human rationales and faithfulness is measured by comprehensiveness and sufficiency of the extracted evidence.

If this is right

Audits of multilingual models should generate explanations in the input language rather than defaulting to English.
Evaluation of explanations should combine lexical overlap with multiple faithfulness metrics instead of relying on agreement alone.
English rationales are more accurately treated as communication summaries than as faithful records of the model's decision process.
Pragmatic and social cues in classification tasks are more likely to be lost when explanations are produced in English.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers may need new methods that improve fluency of native-language explanations without reintroducing the grounding loss seen in English pivots.
The observed trade-off could shape how explanation quality is assessed for low-resource languages where human rationales are scarce.
Reliance on English audits might systematically overestimate the reliability of model behavior on non-English inputs.

Load-bearing premise

Comprehensiveness and sufficiency metrics correctly measure causal faithfulness and human rationales serve as a stable reference for plausibility across languages and tasks.

What would settle it

An experiment that finds English-pivot explanations simultaneously improve or maintain both span agreement and comprehensiveness/sufficiency scores relative to native-language explanations.

Figures

Figures reproduced from arXiv: 2605.19274 by Animesh Mukherjee, Pranav Jha, Rima Hazra, Somnath Banerjee.

**Figure 2.** Figure 2: ). We construct semantically matched test sets by translating the original English test instances into each target language using NLLB-200 (3.3B distilled) (Costa-jussà et al., 2022), accessed via Label: entailment EN (source): Premise: “It was raining, so she took an umbrella.” Hypothesis: “She used an umbrella because it was raining.” Human explanation (rationale): “The premise states rain and taking an … view at source ↗

read the original abstract

LLMs deployed multilingually are often audited via English explanations for non-English inputs. We evaluate extractive explanations ''where the model identifies input token spans as evidence alongside a generated rationale'' and uncover a systematic trade-off: English-pivot explanations can achieve higher span agreement with human rationales while their evidence becomes less causally grounded in the model's prediction, as measured by both comprehensiveness and sufficiency. Across 3 tasks, 5~languages, and 2~multilingual LLM families, we find that English explanations frequently produce fluent but loosely anchored rationales, with comprehensiveness degrading by up to 5.7x relative to native-language conditions - even as task accuracy remains stable across settings. For socially nuanced classification, English pivots also fail to preserve pragmatic cues, reducing both faithfulness and span agreement. We recommend auditing explanations in the input language, reporting multi-faceted faithfulness metrics beyond lexical overlap, and treating English rationales as communication summaries rather than faithful decision traces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

English-pivot explanations match human spans better but score lower on faithfulness metrics, with a plausible mapping artifact that needs checking before the trade-off claim sticks.

read the letter

The main takeaway here is that explanations generated in English for non-English inputs can look more plausible to people while being less tied to the model's actual decision process. The paper measures this with span agreement against human rationales on one side and comprehensiveness plus sufficiency on the other, across three tasks and five languages in two model families. Task accuracy stays roughly the same, which rules out the simple story that English just makes everything worse.

Referee Report

1 major / 1 minor

Summary. The manuscript evaluates extractive explanations for multilingual LLMs, focusing on the use of English-pivot explanations for non-English inputs. It reports a trade-off where English explanations show higher span agreement with human rationales but lower faithfulness, measured by comprehensiveness and sufficiency, with degradation up to 5.7x. Experiments span 3 tasks, 5 languages, and 2 LLM families, concluding that native-language explanations should be preferred and that English rationales may serve more as communication summaries than faithful traces.

Significance. If the findings are robust, this paper makes a valuable contribution to the field of explainable AI in multilingual settings by highlighting potential pitfalls in cross-lingual explanation generation. The empirical scope across multiple languages and models provides evidence that current practices may lead to explanations that are plausible to humans but not causally faithful to the model. This could influence how practitioners audit multilingual models and encourages the development of better cross-lingual explanation methods.

major comments (1)

[Section 4.2 (Faithfulness Metrics)] The evaluation of comprehensiveness and sufficiency for English-pivot explanations depends on mapping English-generated spans back to the original non-English token sequence for ablation. The paper does not provide details on the alignment method used, its accuracy, or any sensitivity analysis for mapping errors. Such errors could systematically remove incorrect tokens, artificially lowering the faithfulness scores for the English-pivot condition while not affecting native explanations. This is a load-bearing issue for the central claim of a plausibility-faithfulness trade-off.

minor comments (1)

[Abstract] The quantitative claim of 'up to 5.7x' degradation lacks specification of the exact condition (task, language, model) where this maximum occurs, which would improve interpretability of the results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify an important methodological aspect of our work. We address the major comment below and have revised the manuscript to incorporate additional details and analyses.

read point-by-point responses

Referee: [Section 4.2 (Faithfulness Metrics)] The evaluation of comprehensiveness and sufficiency for English-pivot explanations depends on mapping English-generated spans back to the original non-English token sequence for ablation. The paper does not provide details on the alignment method used, its accuracy, or any sensitivity analysis for mapping errors. Such errors could systematically remove incorrect tokens, artificially lowering the faithfulness scores for the English-pivot condition while not affecting native explanations. This is a load-bearing issue for the central claim of a plausibility-faithfulness trade-off.

Authors: We agree that explicit documentation of the alignment procedure is essential for reproducibility and to rule out systematic bias in the faithfulness metrics. In our experiments, English-pivot spans were mapped back to the original non-English input tokens using a combination of subword tokenization alignment via SentencePiece and cross-lingual word alignment with the fast-align toolkit, followed by a heuristic for multi-token spans. We have now expanded Section 4.2 with a dedicated paragraph describing this procedure in full, including pseudocode. Additionally, we report alignment accuracy on a manually annotated sample of 200 examples (average F1 of 0.87 across languages) and include a sensitivity analysis: we introduce controlled random mapping perturbations at rates of 5%, 10%, and 15% and recompute comprehensiveness and sufficiency. The plausibility-faithfulness trade-off remains statistically significant (p < 0.01) under all perturbation levels, indicating that alignment noise does not drive the observed degradation. The revised code and alignment scripts have been added to the public repository. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation is self-contained with no circular derivation

full rationale

The paper reports experimental results comparing English-pivot and native-language extractive explanations across tasks, languages, and models, measuring span agreement with human rationales alongside comprehensiveness and sufficiency. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text; the central trade-off claim rests on direct ablation-based measurements rather than any reduction to prior inputs by construction. The work is therefore self-contained against external benchmarks and receives a non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Empirical evaluation paper; relies on standard explanation evaluation metrics and human rationales as reference without introducing new free parameters or invented entities.

axioms (2)

domain assumption Comprehensiveness and sufficiency metrics validly measure causal grounding of explanations in model predictions.
Invoked when interpreting degradation in faithfulness scores as evidence of less grounded evidence.
domain assumption Human rationales provide a reliable benchmark for plausibility across languages.
Used when claiming higher span agreement indicates better plausibility.

pith-pipeline@v0.9.0 · 5715 in / 1313 out tokens · 37056 ms · 2026-05-20T06:18:08.098189+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 2 internal anchors

[1]

Attributional safety failures in large language models under code-mixed perturbations.arXiv preprint arXiv:2505.14469,

Use real-time translation of conversations for service representatives and customers. States the feature is intended to help customer service managers or supervisors enhance team performance. Amazon Web Services. 2024. Amazon translate: Ma- chine translation service. Somnath Banerjee, Pratyush Chatterjee, Shanu Kumar, Sayan Layek, Parag Agrawal, Rima Hazr...

work page arXiv 2024
[2]

Unsupervised Cross-lingual Representation Learning at Scale

e-snli: Natural language inference with natural language explanations. InNeurIPS. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettle- moyer, and Veselin Stoyanov. 2020. Unsuper- vised cross-lingual representation learning at scale. Preprint, arXiv:1911.02116. Marta ...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[3]

Qwen2.5 Technical Report

Measuring association between labels and free-text rationales. InProceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing, pages 10266–10284, Online and Punta Cana, Dominican Republic. Association for Compu- tational Linguistics. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayi- heng...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Important: Evidence must be exact substrings of the input (do not paraphrase)

Write a brief explanation in English. Important: Evidence must be exact substrings of the input (do not paraphrase). Input: <INPUT> Condition B: Lnative → Lnative (native input, na- tive explanation). You are given a task input in <LANG>

work page
[8]

Copy 1–3 short evidence spans verbatim from the input text

work page
[9]

Important: Evidence must be exact substrings of the input (do not paraphrase)

Write a brief explanation in <LANG>. Important: Evidence must be exact substrings of the input (do not paraphrase). Input: <INPUT> Condition C: L native → EN (native input, En- glish explanation; evidence stays native). You are given a task input in <LANG>

work page
[12]

Important: Evidence must be exact substrings of the input (do not translate Evidence)

Write a brief explanation in English. Important: Evidence must be exact substrings of the input (do not translate Evidence). Input: <INPUT> B.2 Mini qualitative example Task: e-SNLI (NLI). Labels: {entailment, neu- tral, contradiction}. Premise (HI; romanized for pdfLATEX compati- bility): barish ho rahi thi isliye usne chhata liya. Hypothesis (HI; romani...

work page
[13]

The premise states rain and taking an umbrella, which supports the hypothesis

e-SNLI: Human rationales are free-form En- glish sentences (e.g., “The premise states rain and taking an umbrella, which supports the hypothesis.”). These are not substrings of the input. We extractE h(x)by: (a) Tokenizing both the input and the ratio- nale into word-level tokens (for English) or character-level tokens (for Chinese, Hindi, Arabic, Bengali...

work page
[14]

FEVER: Human rationales are gold evidence sentences drawn from Wikipedia. Since these sentences may not appear verbatim in the claim, we perform the same substring match- ing procedure as fore-SNLI, operating over the concatenation of the claim and the pro- vided context

work page
[15]

These directly define Eh(x) with no alignment needed

HateXplain: Human rationales are provided as annotated token-level highlight spans over the input text. These directly define Eh(x) with no alignment needed. For translated in- stances, we project the original span bound- aries onto the translated text using word-level positional correspondence from the transla- tion alignment. Matching details.We enforce...

work page
[16]

Unicode NFC normalization (to handle equiv- alent representations of composed characters, particularly important for Hindi and Bengali)

work page
[17]

Whitespace collapsing (multiple spaces, tabs, and newlines reduced to single spaces)

work page
[18]

Case-insensitive matching for Latin-script lan- guages (English)

work page
[19]

It was raining, so she took an umbrella

No stemming or lemmatization is applied— matching is surface-level by design. We set a minimum match length of 2 tokens to avoid spurious single-token overlaps (e.g., match- ing common stop words or punctuation marks). Worked examples.We provide one alignment example per language from thee-SNLIdataset. In each case, the input consists of the concatenated ...

work page
[20]

al- matar

Morphological mismatch: Inflected forms in the rationale may differ from the input surface form (e.g., Arabic definite article prefixing, Hindi verb conjugation), reducing matched coverage. As seen in Example 4, Arabic“al- matar”fails to match input“tumtir”despite referring to the same concept

work page
[21]

lene”(to take) fails to match“liya

Paraphrase: When the human rationale uses a synonym or rephrasing rather than the exact input term, no match is found. As seen in Example 2, Hindi“lene”(to take) fails to match“liya”(took). Both failure modesunder-countgenuine overlap, meaning our span agreement scores are conserva- tive lower bounds. This bias works against our hy- pothesis: if the true ...

work page
[22]

English ex- planations should be treated as summaries for convenience rather than faithful traces of rea- soning

Avoid English pivots for auditing:In high- stakes settings (e.g., legal or medical AI), sys- tem faithfulness should always be audited in the native language of the input. English ex- planations should be treated as summaries for convenience rather than faithful traces of rea- soning

work page
[23]

Standardize cross-lingual faithfulness met- rics: Evaluation benchmarks should move be- yond simple span agreement and incorporate faithfulness metrics, such as comprehensive- ness and sufficiency, specifically designed for mismatched language conditions

work page
[24]

D Prompt paraphrases and sensitivity analysis To verify that our findings are robust to surface- level prompt variation, we create five paraphrased versions of each prompt template

Prioritize cultural context over fluency: For social tasks like hate speech detection, devel- opers must prioritize native-language expla- nation capabilities, as English pivots fail to capture the pragmatic nuances necessary for both plausibility and trust. D Prompt paraphrases and sensitivity analysis To verify that our findings are robust to surface- l...

work page
[26]

Copy 1–3 short evidence spans verbatim from the input text (keep them in <LANG>)

work page
[27]

Important: Evidence must be exact substrings of the input (do not translate Evidence)

Write a brief explanation in English. Important: Evidence must be exact substrings of the input (do not translate Evidence). Input: <INPUT> Variant 2. Below is a task input written in <LANG>

work page
[28]

Determine the appropriate category from: {<LABELS>}

work page
[29]

Extract 1–3 short text segments directly from the input as supporting evidence (keep them in the original language)

work page
[30]

Important: Extracted evidence must be copied exactly from the input without translation

Provide a short justification in English. Important: Extracted evidence must be copied exactly from the input without translation. Input: <INPUT> Variant 3. You will analyze a task input in <LANG>

work page
[31]

Choose the best label from: {<LABELS>}

work page
[32]

Identify 1–3 key phrases from the input text and copy them exactly (retain the original <LANG>)

work page
[33]

Important: Key phrases must be exact substrings of the input

Briefly explain your reasoning in English. Important: Key phrases must be exact substrings of the input. Do not paraphrase or translate them. Input: <INPUT> Variant 4. The following is a task input in <LANG>

work page
[34]

Select the correct label from: {<LABELS>}

work page
[35]

Highlight 1–3 relevant spans from the input by copying them exactly as they appear (in <LANG>)

work page
[36]

Important: Highlighted spans must be exact copies from the input, not translations

Write a concise explanation in English. Important: Highlighted spans must be exact copies from the input, not translations. Input: <INPUT> Variant 5. Given a task input in <LANG>, perform the following:

work page
[37]

Assign one label from: {<LABELS>}

work page
[38]

Quote 1–3 short supporting passages from the input verbatim (keep them in <LANG>)

work page
[39]

Important: Quoted passages must be exact substrings of the input without any translation

Justify your answer briefly in English. Important: Quoted passages must be exact substrings of the input without any translation. Input: <INPUT> D.2 Sensitivity results Tables 8 and 9 report mean ± standard deviation across the five prompt variants on e-SNLI. The trade-off pattern—lower comprehensiveness and higher sufficiency under Lnative →EN compared t...

work page
[40]

Predict the correct label from: {<LABELS>}

work page
[41]

Write a brief explanation in English

work page
[42]

Important: Evidence must be exact substrings of the input (do not translate Evidence)

Copy 1–3 short evidence spans verbatim from the input text (keep them in <LANG>). Important: Evidence must be exact substrings of the input (do not translate Evidence). Input: <INPUT> The corresponding reversed output format is: Label: <one label from the label set> Explanation: <1–3 sentences in the required explanation language> Evidence: <1–3 spans cop...

work page 2020
[43]

The cross-lingual patterns reported in our main tables are not artifacts of the metric choice

Span agreement is a valid proxy.Despite its known limitations with morphological varia- tion and paraphrase, span agreement captures the same directional trends as the semantically richer BERTScore metric. The cross-lingual patterns reported in our main tables are not artifacts of the metric choice. Span Agr.(lexical)BERTScore F1(semantic) SettingsQwen Ll...

work page arXiv
[44]

How- ever, this biasunder-countsoverlap uniformly across conditions, preserving the relative or- dering

Morphological bias is conservative, not mis- leading.The gap between BERTScore and span agreement is largest for Arabic (mean gap:+0.21) and Bengali (mean gap:+0.19), consistent with these languages’ richer mor- phology reducing exact-match recall. How- ever, this biasunder-countsoverlap uniformly across conditions, preserving the relative or- dering

work page
[45]

Directional agree- ment

The HateXplain pattern is genuine.The failure of English pivots to improve semantic similarity on hate speech (confirmed by both metrics) rules out the hypothesis that surface- level tokenization effects mask underlying se- mantic improvement. The loss of social and pragmatic cues under English pivoting is a substantive semantic phenomenon. We note one li...

work page arXiv 2020

[1] [1]

Attributional safety failures in large language models under code-mixed perturbations.arXiv preprint arXiv:2505.14469,

Use real-time translation of conversations for service representatives and customers. States the feature is intended to help customer service managers or supervisors enhance team performance. Amazon Web Services. 2024. Amazon translate: Ma- chine translation service. Somnath Banerjee, Pratyush Chatterjee, Shanu Kumar, Sayan Layek, Parag Agrawal, Rima Hazr...

work page arXiv 2024

[2] [2]

Unsupervised Cross-lingual Representation Learning at Scale

e-snli: Natural language inference with natural language explanations. InNeurIPS. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettle- moyer, and Veselin Stoyanov. 2020. Unsuper- vised cross-lingual representation learning at scale. Preprint, arXiv:1911.02116. Marta ...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[3] [3]

Qwen2.5 Technical Report

Measuring association between labels and free-text rationales. InProceedings of the 2021 Con- ference on Empirical Methods in Natural Language Processing, pages 10266–10284, Online and Punta Cana, Dominican Republic. Association for Compu- tational Linguistics. An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayi- heng...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [6]

Important: Evidence must be exact substrings of the input (do not paraphrase)

Write a brief explanation in English. Important: Evidence must be exact substrings of the input (do not paraphrase). Input: <INPUT> Condition B: Lnative → Lnative (native input, na- tive explanation). You are given a task input in <LANG>

work page

[5] [8]

Copy 1–3 short evidence spans verbatim from the input text

work page

[6] [9]

Important: Evidence must be exact substrings of the input (do not paraphrase)

Write a brief explanation in <LANG>. Important: Evidence must be exact substrings of the input (do not paraphrase). Input: <INPUT> Condition C: L native → EN (native input, En- glish explanation; evidence stays native). You are given a task input in <LANG>

work page

[7] [12]

Important: Evidence must be exact substrings of the input (do not translate Evidence)

Write a brief explanation in English. Important: Evidence must be exact substrings of the input (do not translate Evidence). Input: <INPUT> B.2 Mini qualitative example Task: e-SNLI (NLI). Labels: {entailment, neu- tral, contradiction}. Premise (HI; romanized for pdfLATEX compati- bility): barish ho rahi thi isliye usne chhata liya. Hypothesis (HI; romani...

work page

[8] [13]

The premise states rain and taking an umbrella, which supports the hypothesis

e-SNLI: Human rationales are free-form En- glish sentences (e.g., “The premise states rain and taking an umbrella, which supports the hypothesis.”). These are not substrings of the input. We extractE h(x)by: (a) Tokenizing both the input and the ratio- nale into word-level tokens (for English) or character-level tokens (for Chinese, Hindi, Arabic, Bengali...

work page

[9] [14]

FEVER: Human rationales are gold evidence sentences drawn from Wikipedia. Since these sentences may not appear verbatim in the claim, we perform the same substring match- ing procedure as fore-SNLI, operating over the concatenation of the claim and the pro- vided context

work page

[10] [15]

These directly define Eh(x) with no alignment needed

HateXplain: Human rationales are provided as annotated token-level highlight spans over the input text. These directly define Eh(x) with no alignment needed. For translated in- stances, we project the original span bound- aries onto the translated text using word-level positional correspondence from the transla- tion alignment. Matching details.We enforce...

work page

[11] [16]

Unicode NFC normalization (to handle equiv- alent representations of composed characters, particularly important for Hindi and Bengali)

work page

[12] [17]

Whitespace collapsing (multiple spaces, tabs, and newlines reduced to single spaces)

work page

[13] [18]

Case-insensitive matching for Latin-script lan- guages (English)

work page

[14] [19]

It was raining, so she took an umbrella

No stemming or lemmatization is applied— matching is surface-level by design. We set a minimum match length of 2 tokens to avoid spurious single-token overlaps (e.g., match- ing common stop words or punctuation marks). Worked examples.We provide one alignment example per language from thee-SNLIdataset. In each case, the input consists of the concatenated ...

work page

[15] [20]

al- matar

Morphological mismatch: Inflected forms in the rationale may differ from the input surface form (e.g., Arabic definite article prefixing, Hindi verb conjugation), reducing matched coverage. As seen in Example 4, Arabic“al- matar”fails to match input“tumtir”despite referring to the same concept

work page

[16] [21]

lene”(to take) fails to match“liya

Paraphrase: When the human rationale uses a synonym or rephrasing rather than the exact input term, no match is found. As seen in Example 2, Hindi“lene”(to take) fails to match“liya”(took). Both failure modesunder-countgenuine overlap, meaning our span agreement scores are conserva- tive lower bounds. This bias works against our hy- pothesis: if the true ...

work page

[17] [22]

English ex- planations should be treated as summaries for convenience rather than faithful traces of rea- soning

Avoid English pivots for auditing:In high- stakes settings (e.g., legal or medical AI), sys- tem faithfulness should always be audited in the native language of the input. English ex- planations should be treated as summaries for convenience rather than faithful traces of rea- soning

work page

[18] [23]

Standardize cross-lingual faithfulness met- rics: Evaluation benchmarks should move be- yond simple span agreement and incorporate faithfulness metrics, such as comprehensive- ness and sufficiency, specifically designed for mismatched language conditions

work page

[19] [24]

D Prompt paraphrases and sensitivity analysis To verify that our findings are robust to surface- level prompt variation, we create five paraphrased versions of each prompt template

Prioritize cultural context over fluency: For social tasks like hate speech detection, devel- opers must prioritize native-language expla- nation capabilities, as English pivots fail to capture the pragmatic nuances necessary for both plausibility and trust. D Prompt paraphrases and sensitivity analysis To verify that our findings are robust to surface- l...

work page

[20] [26]

Copy 1–3 short evidence spans verbatim from the input text (keep them in <LANG>)

work page

[21] [27]

Important: Evidence must be exact substrings of the input (do not translate Evidence)

Write a brief explanation in English. Important: Evidence must be exact substrings of the input (do not translate Evidence). Input: <INPUT> Variant 2. Below is a task input written in <LANG>

work page

[22] [28]

Determine the appropriate category from: {<LABELS>}

work page

[23] [29]

Extract 1–3 short text segments directly from the input as supporting evidence (keep them in the original language)

work page

[24] [30]

Important: Extracted evidence must be copied exactly from the input without translation

Provide a short justification in English. Important: Extracted evidence must be copied exactly from the input without translation. Input: <INPUT> Variant 3. You will analyze a task input in <LANG>

work page

[25] [31]

Choose the best label from: {<LABELS>}

work page

[26] [32]

Identify 1–3 key phrases from the input text and copy them exactly (retain the original <LANG>)

work page

[27] [33]

Important: Key phrases must be exact substrings of the input

Briefly explain your reasoning in English. Important: Key phrases must be exact substrings of the input. Do not paraphrase or translate them. Input: <INPUT> Variant 4. The following is a task input in <LANG>

work page

[28] [34]

Select the correct label from: {<LABELS>}

work page

[29] [35]

Highlight 1–3 relevant spans from the input by copying them exactly as they appear (in <LANG>)

work page

[30] [36]

Important: Highlighted spans must be exact copies from the input, not translations

Write a concise explanation in English. Important: Highlighted spans must be exact copies from the input, not translations. Input: <INPUT> Variant 5. Given a task input in <LANG>, perform the following:

work page

[31] [37]

Assign one label from: {<LABELS>}

work page

[32] [38]

Quote 1–3 short supporting passages from the input verbatim (keep them in <LANG>)

work page

[33] [39]

Important: Quoted passages must be exact substrings of the input without any translation

Justify your answer briefly in English. Important: Quoted passages must be exact substrings of the input without any translation. Input: <INPUT> D.2 Sensitivity results Tables 8 and 9 report mean ± standard deviation across the five prompt variants on e-SNLI. The trade-off pattern—lower comprehensiveness and higher sufficiency under Lnative →EN compared t...

work page

[34] [40]

Predict the correct label from: {<LABELS>}

work page

[35] [41]

Write a brief explanation in English

work page

[36] [42]

Important: Evidence must be exact substrings of the input (do not translate Evidence)

Copy 1–3 short evidence spans verbatim from the input text (keep them in <LANG>). Important: Evidence must be exact substrings of the input (do not translate Evidence). Input: <INPUT> The corresponding reversed output format is: Label: <one label from the label set> Explanation: <1–3 sentences in the required explanation language> Evidence: <1–3 spans cop...

work page 2020

[37] [43]

The cross-lingual patterns reported in our main tables are not artifacts of the metric choice

Span agreement is a valid proxy.Despite its known limitations with morphological varia- tion and paraphrase, span agreement captures the same directional trends as the semantically richer BERTScore metric. The cross-lingual patterns reported in our main tables are not artifacts of the metric choice. Span Agr.(lexical)BERTScore F1(semantic) SettingsQwen Ll...

work page arXiv

[38] [44]

How- ever, this biasunder-countsoverlap uniformly across conditions, preserving the relative or- dering

Morphological bias is conservative, not mis- leading.The gap between BERTScore and span agreement is largest for Arabic (mean gap:+0.21) and Bengali (mean gap:+0.19), consistent with these languages’ richer mor- phology reducing exact-match recall. How- ever, this biasunder-countsoverlap uniformly across conditions, preserving the relative or- dering

work page

[39] [45]

Directional agree- ment

The HateXplain pattern is genuine.The failure of English pivots to improve semantic similarity on hate speech (confirmed by both metrics) rules out the hypothesis that surface- level tokenization effects mask underlying se- mantic improvement. The loss of social and pragmatic cues under English pivoting is a substantive semantic phenomenon. We note one li...

work page arXiv 2020