Mitigating Cross-Lingual Cultural Inconsistencies in LLMs via Consensus-Driven Preference Optimisation

Anna Korhonen; Isabelle Augenstein; Lucas Resck

arxiv: 2605.12515 · v2 · pith:4R7PKKQJnew · submitted 2026-04-02 · 💻 cs.CL

Mitigating Cross-Lingual Cultural Inconsistencies in LLMs via Consensus-Driven Preference Optimisation

Lucas Resck , Isabelle Augenstein , Anna Korhonen This is my paper

Pith reviewed 2026-05-14 21:53 UTC · model grok-4.3

classification 💻 cs.CL

keywords cross-lingual inconsistencymultilingual LLMspreference optimizationcultural consistencypersona adherencealignment methodSingleton Fleiss kappa

0 comments

The pith

Consensus-driven preference optimization raises cross-language cultural consistency in multilingual LLMs by up to 0.10 points on a new metric.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multilingual large language models often produce different cultural references for the same fixed persona when the prompt language changes, such as naming Shakespeare in English but Cervantes in Spanish. The paper introduces Singleton Fleiss's κ_S, a metric designed to quantify this cross-lingual cultural inconsistency while remaining robust to hallucinations. It then proposes C-3PO, a framework that creates preference data from agreements across languages and optimizes the model toward those consistent outputs. The method delivers measurable gains over unaligned baselines and over prompting or steering techniques, with larger effects in lower-resource languages. Interpretability checks show the models begin steering representations toward the prompt language's cultural stereotypes early in the forward pass.

Core claim

The central claim is that training multilingual LLMs with Cross-lingual Cultural Consistent Preference Optimisation (C-3PO) increases Singleton Fleiss's κ_S by as much as 0.10 points by aligning outputs to cross-language consensus, thereby reducing the tendency for prompt language to overwrite a fixed persona's cultural references.

What carries the argument

C-3PO, a consensus-driven preference optimisation framework that identifies shared responses across languages and uses them to create training pairs that reward cultural consistency independent of prompt language.

If this is right

The same persona produces more stable cultural references when the user switches between languages.
Lower-resource languages such as Indonesian and Persian show the largest consistency gains.
Early-layer representations become less biased toward the prompt language's stereotypical culture.
The approach outperforms both prompt engineering and representation-steering baselines on the κ_S metric.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same consensus mechanism could be applied to reduce other prompt-language effects, such as differences in factual precision or stylistic tone.
Intervening at the intermediate layers identified in the analysis might achieve similar consistency with less full-model retraining.
One could test whether the gains persist when the consensus set includes languages that have genuinely different cultural norms on the query topic.
The metric κ_S could serve as a general probe for other forms of output instability tied to input language.

Load-bearing premise

That the responses showing agreement across languages truly represent the right consistent behavior rather than an averaged compromise that erases legitimate cultural distinctions.

What would settle it

Apply C-3PO to a new set of ambiguous literature queries with fixed personas, then measure whether the rate of matching cultural references (such as author names) between English and Spanish prompts rises above the rate seen in the original unaligned model.

Figures

Figures reproduced from arXiv: 2605.12515 by Anna Korhonen, Isabelle Augenstein, Lucas Resck.

**Figure 2.** Figure 2: Flowchart detailing the proposed dataset con [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of C-3PO. (1) The base model generates answers for a culturally sensitive question across multiple languages. (2) A cross-lingual consensus is extracted to construct preference pairs: for consensual languages, a random incorrect option serves as the rejected response; for divergent languages, the model’s actual output is rejected. (3) The model is fine-tuned using DPO with multilingual parallel ba… view at source ↗

**Figure 5.** Figure 5: κS consistency analysis across layers and language groups for Llama-3.1-8B. IE = Indo-European. 6 Interpretability Analysis To understand the mechanisms driving Crosslingual Cultural Inconsistency, we conduct a layerwise interpretability analysis on Llama-3.1-8B using our test set. We decode layer-wise predictions by adapting the BLEnD prompt to elicit direct multiple-choice outputs (e.g., “A”, “B”, “C”… view at source ↗

**Figure 4.** Figure 4: κS variation (Qwen-2.5-3B) as languages are incrementally added by resource level. Shaded bands for Steering and Persona indicate ranges across personas. sourcefulness strictly degrades consistency across all methods, whereas the reverse order consistently improves it. Because the sets of evaluated languages are mathematically identical at the extremes, these opposing trajectories directly isolate resour… view at source ↗

**Figure 7.** Figure 7: Cultural personalisation effect across layers [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 6.** Figure 6: Cultural personalisation across layers for [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 9.** Figure 9: System prompt used to remove country men [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: System prompt used to translate answer options to the target language. E Hyperparameters Comprehensive hyperparameter configurations for all mitigation strategies and baselines evaluated in this work are detailed in [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: System prompt template used for the Persona [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Format of the multiple-choice questions (MCQ) in the BLEnD benchmark without the JSON answer format. This format is used for the layer-wise early-decoding interpretability analysis in Section 6 to allow for direct extraction of the model’s choice at each layer without needing to parse a JSON structure. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: Singleton Fleiss’s κS consistency variation when adding languages in different orders for all models. Languages are ordered according to their resource level. Persona Vector Steering and Persona Prompting show ranges across the eight personas. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

read the original abstract

Despite their impressive capabilities, multilingual large language models (MLLMs) frequently exhibit inconsistent behaviour when the prompt's language changes. While such adaptation is generally desirable, it becomes a critical failure when a user's identity is explicitly defined. For instance, given a fixed British persona and an ambiguous everyday knowledge query about literature, the prompt's language frequently overwrites the system persona -- yielding Shakespeare in English but Cervantes in Spanish. To robustly quantify this Cross-lingual Cultural Inconsistency, we introduce Singleton Fleiss's $\kappa_S$, a metric mathematically resilient to hallucinations. For mitigation, we propose Cross-lingual Cultural Consistent Preference Optimisation (C-3PO), a consensus-driven alignment framework. C-3PO achieves up to a 0.13-point absolute increase in $\kappa_S$ over unaligned models, consistently outperforming strong prompting and representation steering baselines whilst preserving explicit user identities, cultural neutrality and intrinsic cultural knowledge. Empirical evaluations demonstrate this inconsistency disproportionately affects lower-resource languages like Indonesian and Persian. Finally, early decoding of intermediate layers reveals that MLLMs implicitly personalise outputs towards the prompt language's stereotypical culture as forward-pass representations stabilise.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

C-3PO gives a measurable lift in cross-lingual consistency but the consensus target risks flattening real cultural differences instead of fixing language-induced drift.

read the letter

The paper's main contribution is a new metric, Singleton Fleiss's κ_S, that tries to isolate how much a model's output shifts when the prompt language changes for a fixed persona, plus a consensus-driven preference optimization method they call C-3PO. They report up to a 0.10 absolute gain in κ_S over unaligned baselines and some prompting or steering alternatives, with the effect showing up more in lower-resource languages. They also include a layer-wise check that suggests the model starts steering toward the prompt language's cultural stereotypes early in the forward pass. That combination of a tailored metric, an alignment procedure, and an interpretability angle is what stands out as new here. The work does a decent job flagging a practical deployment issue—models overwriting a British persona with Cervantes in Spanish—and the empirical pattern they describe matches what many people see in multilingual models. The interpretability piece adds some mechanistic grounding that is not just hand-waving. The soft spots are more substantial. The central claim that higher κ_S means reduced inconsistency rather than reduced diversity rests on the assumption that the chosen queries have a culture-independent ground truth and that the metric only penalizes language drift. The abstract gives no derivation or formula for κ_S, no details on the query set, and no check that the consensus they optimize toward preserves legitimate variation on topics like literature or history. If the evaluation items include cases where cultures reasonably differ, the 0.10 gain could simply reflect homogenization. Without the full experimental setup, data splits, or statistical tests, it is hard to judge whether the result is robust or just an artifact of how the consensus was constructed. This is the kind of paper that would interest people working on multilingual alignment and fairness evaluations. A reader who already cares about cross-lingual consistency would get value from the metric and the optimization framing, even if they end up disagreeing with the consensus approach. It is solid enough on the surface to deserve a serious referee rather than a desk reject, provided the full paper supplies the missing experimental details and addresses whether the target consensus erases valid differences.

Referee Report

3 major / 2 minor

Summary. The paper claims that multilingual LLMs exhibit cross-lingual cultural inconsistency, where prompt language overwrites fixed personas (e.g., literature queries yielding Shakespeare in English but Cervantes in Spanish). It introduces Singleton Fleiss's κ_S as a metric mathematically resilient to hallucinations to quantify this inconsistency, proposes C-3PO as a consensus-driven preference optimization framework for mitigation, reports up to a 0.10 absolute κ_S gain over unaligned models and baselines (with larger effects in lower-resource languages like Indonesian and Persian), and provides layer-wise interpretability analysis showing early-decoding personalization to prompt-language culture as representations stabilize.

Significance. If the results hold, the work would offer a meaningful advance in multilingual LLM alignment by supplying both a new inconsistency metric and a practical consensus-based optimization method. The emphasis on lower-resource languages and the interpretability component add value by addressing equity and mechanistic understanding. The 0.10 κ_S improvement, if robustly verified, could inform future preference optimization techniques beyond standard RLHF or prompting.

major comments (3)

[§4] §4 (evaluation protocol): the central claim that the 0.10 κ_S increase demonstrates mitigation of inconsistency rather than erasure of legitimate cultural variation requires explicit verification that the query set contains only items with culture-independent ground truth; without this check (or an ablation on divergent items such as historical framing), higher κ_S may simply reflect reduced output diversity.
[§3.1] Definition of κ_S (likely §3.1): the assertion that Singleton Fleiss's κ_S is 'mathematically resilient to hallucinations' and isolates language-induced drift is load-bearing for the metric's validity, yet the exact formula, derivation steps, and proof of resilience against confounding perspective differences are not supplied in sufficient detail to allow independent confirmation.
[Results section] Table 2 or results section: the outperformance over prompting and representation-steering baselines is reported as a 0.10 gain, but the manuscript supplies no statistical tests, confidence intervals, or data-split details; this prevents assessment of whether the improvement is reliable or merely an artifact of the chosen consensus data.

minor comments (2)

[§5.2] The layer-wise interpretability analysis would be strengthened by reporting exact layer indices where representation stabilization occurs and by including quantitative stability metrics rather than qualitative description.
[§3] Notation for κ_S should be defined with an explicit equation early in the text to avoid ambiguity when comparing across languages and models.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which have helped us identify areas for clarification and strengthening. We address each major point below and will incorporate the suggested changes in the revised manuscript.

read point-by-point responses

Referee: [§4] §4 (evaluation protocol): the central claim that the 0.10 κ_S increase demonstrates mitigation of inconsistency rather than erasure of legitimate cultural variation requires explicit verification that the query set contains only items with culture-independent ground truth; without this check (or an ablation on divergent items such as historical framing), higher κ_S may simply reflect reduced output diversity.

Authors: We agree that explicitly distinguishing mitigation of inconsistency from reduced output diversity is essential. Our query set was constructed from factual, culture-independent items (e.g., everyday knowledge queries with a fixed persona) where ground truth does not legitimately vary by language. In the revision we will add an explicit verification subsection describing this selection process and include a new ablation on divergent items (such as historical framing questions) to demonstrate that the observed κ_S gains reflect reduced language-induced drift rather than loss of legitimate variation. revision: yes
Referee: [§3.1] Definition of κ_S (likely §3.1): the assertion that Singleton Fleiss's κ_S is 'mathematically resilient to hallucinations' and isolates language-induced drift is load-bearing for the metric's validity, yet the exact formula, derivation steps, and proof of resilience against confounding perspective differences are not supplied in sufficient detail to allow independent confirmation.

Authors: We accept that the current presentation lacks sufficient mathematical detail. The revised manuscript will include the precise formula for Singleton Fleiss's κ_S, the full derivation steps, and a formal argument (with supporting lemmas) demonstrating its resilience to hallucinations and its isolation of language-induced drift from other sources of variation such as perspective differences. revision: yes
Referee: [Results section] Table 2 or results section: the outperformance over prompting and representation-steering baselines is reported as a 0.10 gain, but the manuscript supplies no statistical tests, confidence intervals, or data-split details; this prevents assessment of whether the improvement is reliable or merely an artifact of the chosen consensus data.

Authors: We agree that statistical rigor is required. In the revision we will add paired statistical tests (e.g., Wilcoxon signed-rank), 95% confidence intervals for all reported κ_S gains, and explicit details on data splits, consensus data construction, and cross-validation procedures to allow readers to assess the reliability of the 0.10 improvement. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains measured by newly introduced metric

full rationale

The paper introduces Singleton Fleiss's κ_S as a new metric for cross-lingual inconsistency and C-3PO as a consensus-driven preference optimization method. The central result (0.10-point κ_S increase) is reported as an empirical outcome from applying C-3PO to MLLMs and comparing against baselines. No equations, definitions, or steps in the provided text reduce the claimed prediction or improvement to the inputs by construction. No self-citations are invoked to establish uniqueness or load-bearing premises, and the metric is not defined circularly in terms of the optimization result. The derivation relies on external consensus data and evaluations, remaining self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the metric and method are presented as novel constructs without further decomposition.

pith-pipeline@v0.9.0 · 5507 in / 1065 out tokens · 41569 ms · 2026-05-14T21:53:55.907146+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Singleton Fleiss’s κ_S ... C-3PO, a consensus-driven alignment framework
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

layer-wise interpretability analysis ... forward-pass representations stabilise

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.