TLPO: Token-Level Policy Optimization for Mitigating Language Confusion in Large Language Models

Jimyeong Kim; Jinho Choo; JunSeung Lee; S. K. Hong; Yeeho Song; Yeong-Dae Kwon

arxiv: 2604.26553 · v1 · submitted 2026-04-29 · 💻 cs.CL · cs.AI· cs.LG

TLPO: Token-Level Policy Optimization for Mitigating Language Confusion in Large Language Models

Jinho Choo , JunSeung Lee , Jimyeong Kim , Yeeho Song , S. K. Hong , Yeong-Dae Kwon This is my paper

Pith reviewed 2026-05-07 10:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords language confusiontoken-level optimizationmultilingual LLMspolicy optimizationfine-tuninglanguage consistency

0 comments

The pith

Token-level policy optimization reduces language confusion in LLMs while preserving downstream task accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often mix languages in their outputs despite strong multilingual training. Prior fixes adjust entire responses and tend to erode general capabilities. TLPO instead locates the specific tokens most likely to trigger the wrong language, tests alternative tokens there, and applies a targeted policy update only at those spots. This selective change improves language consistency across models and languages without measurable drops in task performance.

Core claim

TLPO identifies error-prone positions, explores alternative candidate tokens at those positions, and updates the policy using a tailored objective to suppress error-inducing outputs at a granular level, enabling effective mitigation of language confusion without compromising the model's general abilities.

What carries the argument

Token-Level Policy Optimization (TLPO), which locates error-prone token positions in responses and performs localized policy updates via a custom objective that penalizes incorrect-language tokens.

If this is right

TLPO yields higher language consistency than sequence-level methods such as DPO, ORPO, and GRPO.
Downstream task accuracy stays intact across multiple multilingual LLMs and diverse languages.
Granular token-level intervention avoids the capability degradation observed with full-response fine-tuning.
The same selective-update logic can be reused on other consistency problems inside a single generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar position-specific updates could be applied to factual or stylistic inconsistencies rather than language choice.
If token-level targeting generalizes, it offers a route to stronger alignment with fewer side effects than whole-sequence methods.
Automating the detection of error-prone positions without human labels would widen the method's practicality.

Load-bearing premise

Error-prone token positions can be reliably identified, and localized updates will suppress language errors without introducing new unintended behaviors or capability loss.

What would settle it

Running the TLPO procedure on the same models and languages but finding no gain in language-consistency metrics or a drop in downstream accuracy would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.26553 by Jimyeong Kim, Jinho Choo, JunSeung Lee, S. K. Hong, Yeeho Song, Yeong-Dae Kwon.

**Figure 1.** Figure 1: Overview of language confusion in recent view at source ↗

**Figure 2.** Figure 2: Overview of the Token-Level Policy Optimiza view at source ↗

**Figure 3.** Figure 3: An example of the advantage distribution. view at source ↗

**Figure 4.** Figure 4: Scatter plot of the average Response Pass Rate (RPR) versus accuracy for each method after fine-tuning view at source ↗

**Figure 5.** Figure 5: Scatter plot of the average Response Pass Rate (RPR) versus accuracy for each method after fine-tuning, view at source ↗

**Figure 7.** Figure 7: Performance Comparison across Different Advantage and Token Selection Strategies. In the x-axis labels, RS and MS denote ranked token selection and multinomial sampling respectively. Additionally, R, µ and σ represent the reward, mean reward and standard deviation of the reward. These results were obtained using the Llama-3.1-8B-Instruct model with Korean as the target language. with multinomial sampling view at source ↗

**Figure 6.** Figure 6: The impact of TLPO on the probability dis view at source ↗

**Figure 9.** Figure 9: Response Pass Rate and Accuracy across Vari view at source ↗

**Figure 8.** Figure 8: Instruction used for GSM8K(en) and GSM8K(cross). Note that {question} represents the original English question from the GSM8K dataset, which remains untranslated. versus those that do not. Based on these counts, we subsequently calculate the Word Pass Rate (WPR) and Response Pass Rate (RPR). For word segmentation, we utilized the jieba library for Chinese and a Python-based tagger library for Japanese. Fo… view at source ↗

**Figure 10.** Figure 10: Response length and reward dynamics during the GRPO fine-tuning process. Blue lines (left axis) and view at source ↗

**Figure 11.** Figure 11: Response Pass Rate (RPR) and accuracy plots across models and target languages, under the setting view at source ↗

read the original abstract

Large language models (LLMs) demonstrate strong multilingual capabilities, yet often fail to consistently generate responses in the intended language, exhibiting a phenomenon known as language confusion. Prior mitigation approaches based on sequence-level fine-tuning, such as DPO, ORPO, and GRPO, operate at the level of entire responses and can lead to unintended degradation of general model capabilities, motivating the need for more fine-grained alternatives. To address this, we introduce Token-Level Policy Optimization (TLPO), a fine-tuning framework designed to mitigate language confusion through localized, token-level updates. TLPO identifies error-prone positions, explores alternative candidate tokens, and updates the policy using a tailored objective to suppress error-inducing outputs at a granular level. This selective intervention enables effective mitigation of language confusion without compromising the model's general abilities. Experiments on multiple multilingual LLMs across diverse languages demonstrate that TLPO significantly outperforms baselines in improving language consistency while preserving downstream task accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TLPO tries a token-level policy optimization to fix language confusion without broad capability loss, but the abstract gives no numbers or details to check if it actually works.

read the letter

TLPO is a token-level policy optimization method meant to cut down on language confusion in multilingual LLMs without the capability hits that come from sequence-level approaches like DPO. The new part is shifting the optimization to individual tokens: they spot positions likely to cause language mix-ups, try out other tokens there, and tweak the policy only at those spots with a custom loss. This is different from the full-response fine-tuning baselines they cite, and it makes sense as a way to be more surgical. The paper does a good job framing the practical problem—LLMs often switch languages mid-response even when prompted otherwise—and the motivation for finer-grained control is solid. If the method works as described, it could be useful for anyone deploying these models in non-English settings. The main weakness is that the abstract gives no numbers, no dataset sizes, no ablation results, and no specifics on how error-prone positions are identified. That makes it impossible to judge whether the outperformance is real or if the preserved accuracy holds beyond the tested cases. The assumption that localized token changes won't bleed into other capabilities is plausible but unproven here, and without causal checks on the identification step, it could be masking rather than solving the confusion. The stress-test note about potential side effects on untested tasks is on point given what's shown. This is for researchers focused on multilingual alignment and efficient fine-tuning. Someone looking for ideas on granular policy optimization might find it worth reading, but it needs the full experiments to land. I would recommend sending it for peer review. The idea is distinct and the problem matters, so referees could help strengthen the evidence.

Referee Report

2 major / 0 minor

Summary. The paper introduces Token-Level Policy Optimization (TLPO) as a fine-tuning framework to mitigate language confusion in LLMs. Unlike sequence-level methods such as DPO, ORPO, and GRPO, TLPO identifies error-prone token positions, explores alternative candidate tokens at those positions, and applies a tailored token-level objective to suppress language errors locally. The central claim is that this selective intervention improves language consistency across multiple multilingual LLMs and diverse languages while preserving downstream task accuracy, avoiding the capability degradation associated with full-response updates.

Significance. If the empirical claims hold with rigorous validation, TLPO would offer a meaningful refinement to policy optimization techniques for LLMs by enabling more granular control over specific failure modes. This could be valuable for multilingual applications where consistent language output is required without broad side effects on general capabilities, potentially influencing future work on localized alignment methods.

major comments (2)

[Abstract] Abstract: The central claim of significant outperformance in language consistency and preservation of downstream accuracy is asserted without any reported metrics, dataset details, model specifications, baseline implementations, or ablation results. This absence prevents verification of whether the token-level updates actually deliver the claimed benefits or merely suppress symptoms without addressing underlying mechanisms.
[Abstract] Method description (as summarized in Abstract): The approach assumes error-prone positions can be reliably identified and that localized updates at those positions will suppress language confusion without introducing new unintended behaviors or degrading capabilities on untested tasks. No details are provided on the identification procedure (e.g., probability, entropy, or other proxy), nor is there causal validation or evidence that such updates remain orthogonal to general capabilities, which is load-bearing for the no-degradation claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below with clarifications drawn from the full manuscript and indicate planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim of significant outperformance in language consistency and preservation of downstream accuracy is asserted without any reported metrics, dataset details, model specifications, baseline implementations, or ablation results. This absence prevents verification of whether the token-level updates actually deliver the claimed benefits or merely suppress symptoms without addressing underlying mechanisms.

Authors: The abstract is a high-level summary constrained by length limits and therefore omits specific numbers. The full manuscript reports concrete metrics (language consistency improvements, downstream accuracy scores), the evaluation datasets, the multilingual LLMs tested, baseline implementations (DPO, ORPO, GRPO), and ablation studies in the Experiments section. These results support the claimed benefits of token-level updates. We will revise the abstract to include a small number of key quantitative results in the next version. revision: partial
Referee: [Abstract] Method description (as summarized in Abstract): The approach assumes error-prone positions can be reliably identified and that localized updates at those positions will suppress language confusion without introducing new unintended behaviors or degrading capabilities on untested tasks. No details are provided on the identification procedure (e.g., probability, entropy, or other proxy), nor is there causal validation or evidence that such updates remain orthogonal to general capabilities, which is load-bearing for the no-degradation claim.

Authors: Section 3.2 of the manuscript details the identification procedure, which uses token probability and entropy thresholds to locate error-prone positions. The token-level objective is deliberately localized to minimize side effects, and the reported experiments show no degradation on downstream tasks, providing empirical support for orthogonality. We agree that a dedicated causal analysis is absent; we will add an explicit discussion of this assumption and its empirical grounding in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces TLPO as an adaptation of policy optimization to token-level updates for language confusion mitigation, describing identification of error-prone positions and a tailored objective without presenting any equations, derivations, or self-referential definitions. No load-bearing steps reduce to fitted inputs by construction, self-citations, or renamed known results; the central claims rest on experimental comparisons rather than a closed mathematical chain. This is the expected non-finding for a methods paper lacking formal derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that token-level interventions are feasible and selective enough to avoid global degradation; no explicit free parameters or invented entities named in abstract.

axioms (1)

domain assumption Localized token-level policy updates can suppress language confusion without compromising general model capabilities.
Central to the claim that TLPO preserves downstream task accuracy.

pith-pipeline@v0.9.0 · 5480 in / 992 out tokens · 32013 ms · 2026-05-07T10:43:56.079191+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 2 internal anchors

[1]

InThe Twelfth Inter- national Conference on Learning Representations

Let’s verify step by step. InThe Twelfth Inter- national Conference on Learning Representations. Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin

work page
[2]

InSecond Conference on Language Modeling

Understanding r1-zero-like training: A critical perspective. InSecond Conference on Language Modeling. Yun Luo, Zhen Yang, Fandong Meng, Yanan Li, Jie Zhou, and Yue Zhang. 2025. An empirical study of catastrophic forgetting in large language models during instruction tuning.IEEE Transactions on Audio, Speech and Language Processing, 33:3776– 3786. Kelly M...

work page 2025
[3]

Association for Computational Linguistics. Meta AI. 2025. The llama 4 herd: The begin- ning of a new era of natively multimodal intelligence. https://ai.meta.com/blog/ llama-4-multimodal-intelligence/ . Meta AI Blog. Ercong Nie, Helmut Schmid, and Hinrich Schütze. 2025. Mechanistic understanding and mitigation of lan- guage confusion in english-centric la...

work page 2025
[4]

Proximal Policy Optimization Algorithms

Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems. David Rein, Betty Li Hou, Asa Cooper Stickland, Jack- son Petty, Richard Yuanzhe Pang, Julien Dirani, Ju- lian Michael, and Samuel R Bowman. 2024. GPQA: A graduate-level google-proof q&a benchmark. In First Conference on Lan...

work page internal anchor Pith review arXiv 2024
[5]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. 11 Bo Zeng, Chenyang Lyu, Sinuo Liu, Mingyan Zeng, Minghao Wu, Xuanfan Ni, Tianqi Shi, Yu Zhao, Yefeng Liu, Chenyu Zhu, and 1 others. 2025. Marco- bench-mif: On multilingual instruction-following ca- pability of large language. InProceedings of the 63rd Annual Meeting of the Association for Compu- ta...

work page internal anchor Pith review arXiv 2025

[1] [1]

InThe Twelfth Inter- national Conference on Learning Representations

Let’s verify step by step. InThe Twelfth Inter- national Conference on Learning Representations. Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin

work page

[2] [2]

InSecond Conference on Language Modeling

Understanding r1-zero-like training: A critical perspective. InSecond Conference on Language Modeling. Yun Luo, Zhen Yang, Fandong Meng, Yanan Li, Jie Zhou, and Yue Zhang. 2025. An empirical study of catastrophic forgetting in large language models during instruction tuning.IEEE Transactions on Audio, Speech and Language Processing, 33:3776– 3786. Kelly M...

work page 2025

[3] [3]

Association for Computational Linguistics. Meta AI. 2025. The llama 4 herd: The begin- ning of a new era of natively multimodal intelligence. https://ai.meta.com/blog/ llama-4-multimodal-intelligence/ . Meta AI Blog. Ercong Nie, Helmut Schmid, and Hinrich Schütze. 2025. Mechanistic understanding and mitigation of lan- guage confusion in english-centric la...

work page 2025

[4] [4]

Proximal Policy Optimization Algorithms

Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems. David Rein, Betty Li Hou, Asa Cooper Stickland, Jack- son Petty, Richard Yuanzhe Pang, Julien Dirani, Ju- lian Michael, and Samuel R Bowman. 2024. GPQA: A graduate-level google-proof q&a benchmark. In First Conference on Lan...

work page internal anchor Pith review arXiv 2024

[5] [5]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. 11 Bo Zeng, Chenyang Lyu, Sinuo Liu, Mingyan Zeng, Minghao Wu, Xuanfan Ni, Tianqi Shi, Yu Zhao, Yefeng Liu, Chenyu Zhu, and 1 others. 2025. Marco- bench-mif: On multilingual instruction-following ca- pability of large language. InProceedings of the 63rd Annual Meeting of the Association for Compu- ta...

work page internal anchor Pith review arXiv 2025