pith. sign in

arxiv: 2606.06586 · v1 · pith:5TI55IC5new · submitted 2026-06-04 · 💻 cs.CL

Improving Cross-Lingual Factual Recall via Consistency-Driven Reinforcement Learning

Pith reviewed 2026-06-28 01:47 UTC · model grok-4.3

classification 💻 cs.CL
keywords cross-lingual factual recallGRPOreinforcement learningmultilingual consistencyPolyFact datasetlanguage specializationgeneralizationLLM routing
0
0 comments X

The pith

GRPO training improves cross-lingual factual consistency and generalization in LLMs more than supervised fine-tuning or continual pretraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PolyFact, a parallel multilingual factual QA dataset spanning 100K Wikidata-grounded facts across 12 typologically diverse languages. It evaluates light continual pretraining, supervised fine-tuning, and Group Relative Policy Optimization on Qwen-2.5-7B and OLMo-2-1124-7B, finding that GRPO produces larger gains in cross-lingual consistency and generalization to unseen languages. Mechanistic analysis shows GRPO reduces language specialization inside MLP layers and attention heads, yielding more shared representations. A sympathetic reader would care because current LLMs encode substantial knowledge in English yet often cannot retrieve or express the same facts reliably when prompted in other languages.

Core claim

Using PolyFact, GRPO consistently outperforms SFT on cross-lingual factual recall and consistency while also generalizing to languages absent from training; CPT on parallel data adds only limited value. GRPO achieves these results by reorganizing multilingual routing, specifically by reducing language specialization in MLP layers and attention heads and thereby promoting more shared cross-lingual representations.

What carries the argument

Group Relative Policy Optimization (GRPO) applied as consistency-driven reinforcement learning on parallel factual data.

If this is right

  • GRPO improves factual recall on both training languages and unseen languages.
  • GRPO reduces language specialization in MLP layers and attention heads.
  • Continual pretraining on parallel data yields only limited additional gains over the base models.
  • The resulting models exhibit higher cross-lingual consistency on factual questions.
  • The same training approach can be applied to the two tested 7B-scale models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • PolyFact could serve as a reusable benchmark for measuring cross-lingual consistency beyond the methods tested here.
  • Similar consistency-driven RL objectives might address other forms of internal inconsistency, such as in multi-step reasoning across languages.
  • If language specialization is the main bottleneck, further reductions could improve transfer to very low-resource languages not included in PolyFact.

Load-bearing premise

The observed gains in consistency and generalization are produced by the GRPO objective rather than by differences in training compute, hyperparameter choices, or properties of the PolyFact data construction.

What would settle it

A controlled experiment that equalizes total training steps, learning rate schedule, batch size, and data composition between GRPO and SFT and finds no remaining advantage for GRPO would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.06586 by Eduardo Sanchez, Ektor Oikonomidis Doumpas, Eleftheria Kolokytha, George Burgess, Harry O'Donnell, Jonathan von Rad, Louis Arts, Pontus Stenetorp, Yao Lu.

Figure 1
Figure 1. Figure 1: Incentivizing cross-lingual factual con [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Expanding language coverage from English [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Layer-rank analysis of Base (left) and GRPO [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Language-specific neurons across languages; GRPO increases English alignment while SFT does not [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Concentration of specialised neurons in the [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Pairwise head overlap across language pairs [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Percentage of language-important atten￾tion heads in OLMo-2-1124-7B across Base, SFT and GRPO. 5.5 Mechanistic Analysis via LAHIS Cross-lingual head sharing. Head overlap be￾tween language pairs increases substantially af￾ter both finetuning methods (as visualized in Fig￾ure 7). SFT produces the strongest gains in pair￾wise overlap, particularly among Indo-European languages (e.g. DE–FR: 25% → 90%, DE–ID: … view at source ↗
Figure 9
Figure 9. Figure 9: Layer-rank analysis of Base and GRPO fine [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: KLAR improvement by relation type. Proper-noun relations account for 55% of the KLAR dataset. rics suggest. The gains GRPO actually delivers on language abstraction are partly masked in the aggregate by regressions concentrated in proper￾noun prompts where dataset bias works against us [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Number of stable (blue) and changed (or [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 11
Figure 11. Figure 11: Percentage of language-important attention [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Pairwise overlap of language-important attention heads across language pairs for Qwen-2.5-7B before [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Layer-wise frequency of language-specific neurons for all target languages. Note the distinct English [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Empirical Cumulative Distribution Function (ECDF) of specialised neuron discovery across layers. The [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Delta heatmap (GRPO − base). Blue indicates weakened heads, red indicates strengthened heads [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Delta heatmap (SFT-consistent − base). Pattern closely mirrors GRPO, suggesting both methods suppress similar heads [PITH_FULL_IMAGE:figures/full_fig_p021_17.png] view at source ↗
read the original abstract

Large language models (LLMs) trained predominantly on English data encode substantial world knowledge, yet often fail to express it reliably in other languages, a phenomenon known as cross-lingual factual inconsistency. To study and address this, we introduce PolyFact, a large-scale parallel multilingual factual QA dataset containing 100K Wikidata-grounded facts across 12 typologically diverse languages. Using PolyFact, we compare light continual pretraining (CPT), supervised fine-tuning (SFT), and reinforcement learning via Group Relative Policy Optimization (GRPO) for improving cross-lingual factual recall in Qwen-2.5-7B and OLMo-2-1124-7B. We find that GRPO consistently outperforms SFT, improving both cross-lingual consistency and generalization to unseen languages, while CPT on parallel data yields limited additional gains. Mechanistic analyses further show that GRPO reorganizes multilingual routing by reducing language specialization in MLP layers and attention heads, thereby promoting more shared cross-lingual representations. We release our code, models, and dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces PolyFact, a parallel multilingual factual QA dataset with 100K Wikidata-grounded facts across 12 languages. It evaluates light continual pretraining (CPT), supervised fine-tuning (SFT), and Group Relative Policy Optimization (GRPO) on Qwen-2.5-7B and OLMo-2-7B, claiming that GRPO consistently outperforms SFT in cross-lingual factual consistency and generalization to unseen languages while CPT yields limited gains. Mechanistic analyses indicate that GRPO reduces language specialization in MLP layers and attention heads to promote shared representations. The work releases code, models, and the dataset.

Significance. If the reported superiority of GRPO holds after controlling for training compute and data artifacts, the contribution would be significant for addressing cross-lingual factual inconsistency. The new dataset and the release of code/models/data support reproducibility and further work on multilingual routing. The mechanistic findings, if rigorously validated, could inform representation learning in multilingual models.

major comments (2)
  1. [Experimental setup] Experimental setup section: The comparisons of GRPO against SFT and CPT on Qwen-2.5-7B and OLMo-2-7B do not report matched training compute, FLOPs, gradient steps, wall-clock time, or token budgets. This is load-bearing for the central claim that performance differences are caused by the GRPO objective rather than optimization effort or PolyFact construction artifacts.
  2. [Mechanistic analyses] Mechanistic analyses section: The methods used to quantify language specialization in MLP layers and attention heads, and to demonstrate reorganization of multilingual routing, are unspecified. This undermines evaluation of the claim that GRPO promotes more shared cross-lingual representations.
minor comments (1)
  1. [Abstract] Abstract and results sections lack explicit quantitative metrics, error bars, or statistical tests for the reported outperformance, which should be added for clarity even if present in tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The two major comments identify important gaps in reporting that affect the strength of our central claims. We address each below and commit to revisions that will improve clarity and rigor without altering the core findings.

read point-by-point responses
  1. Referee: [Experimental setup] Experimental setup section: The comparisons of GRPO against SFT and CPT on Qwen-2.5-7B and OLMo-2-7B do not report matched training compute, FLOPs, gradient steps, wall-clock time, or token budgets. This is load-bearing for the central claim that performance differences are caused by the GRPO objective rather than optimization effort or PolyFact construction artifacts.

    Authors: We agree that matched training compute is essential for attributing gains to the GRPO objective. The current manuscript reports only epoch counts and does not provide token budgets, gradient steps, or FLOPs. In the revision we will add a dedicated table (or subsection) listing exact token counts processed, number of gradient updates, and estimated FLOPs for each method on both models. Where exact matching was not performed, we will state the compute differential and discuss its potential impact on the results. revision: yes

  2. Referee: [Mechanistic analyses] Mechanistic analyses section: The methods used to quantify language specialization in MLP layers and attention heads, and to demonstrate reorganization of multilingual routing, are unspecified. This undermines evaluation of the claim that GRPO promotes more shared cross-lingual representations.

    Authors: The manuscript describes the high-level approach (activation patching and routing entropy) but omits the precise metrics and implementation details. We will expand Section 4.2 with the exact formulas used to compute language specialization scores for MLP neurons and attention heads, the procedure for measuring cross-lingual routing reorganization, and any statistical tests applied. This will allow readers to reproduce the mechanistic claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical comparisons

full rationale

The paper presents empirical results from training and evaluating GRPO, SFT, and CPT on the PolyFact dataset for cross-lingual factual recall in two base models. All load-bearing claims (outperformance of GRPO, mechanistic changes in routing) are grounded in measured performance metrics and layer analyses against external baselines, not in any paper-internal equations, fitted parameters renamed as predictions, or self-citation chains that reduce the result to its own inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of GRPO applied to the newly constructed PolyFact dataset; no free parameters, axioms, or invented entities are explicitly introduced beyond standard supervised and reinforcement learning assumptions.

pith-pipeline@v0.9.1-grok · 5744 in / 1190 out tokens · 28247 ms · 2026-06-28T01:47:57.938705+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 9 canonical work pages

  1. [1]

    Proceedings of ACL 2025 , year =

    Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models , author =. Proceedings of ACL 2025 , year =

  2. [2]

    Proceedings of ACL 2025 , year =

    Just Go Parallel: Improving the Multilingual Capabilities of Large Language Models , author =. Proceedings of ACL 2025 , year =

  3. [3]

    Proceedings of EMNLP 2025 , year =

    From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora , author =. Proceedings of EMNLP 2025 , year =

  4. [4]

    2024 , url =

    Zhao, Jun and Zhang, Zhihao and Zhang, Qi and Gui, Tao and Huang, Xuanjing , journal=. 2024 , url =

  5. [5]

    Continual Pre-Training for Cross-Lingual

    Kazuki Fujii and Taishi Nakamura and Genta Indra Winata and others , journal =. Continual Pre-Training for Cross-Lingual. 2024 , url =

  6. [6]

    Teaching

    Kuulmets, Hele-Andra and Purason, Taido and Luhtaru, Agnes and Fishel, Mark , booktitle =. Teaching. 2024 , url =

  7. [7]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  8. [8]

    2023 , eprint=

    Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

  9. [9]

    Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models

    Tang, Tianyi and Luo, Wenyang and Huang, Haoyang and Zhang, Dongdong and Wang, Xiaolei and Zhao, Xin and Wei, Furu and Wen, Ji-Rong. Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1...

  10. [10]

    Do Multilingual

    Lisa Schut and Yarin Gal and Sebastian Farquhar , booktitle=. Do Multilingual. 2025 , url=

  11. [11]

    2025 , eprint=

    Languages Still Left Behind: Toward a Better Multilingual Machine Translation Benchmark , author=. 2025 , eprint=

  12. [12]

    2022 , eprint=

    Language Models are Multilingual Chain-of-Thought Reasoners , author=. 2022 , eprint=

  13. [13]

    2025 , eprint=

    Focusing on Language: Revealing and Exploiting Language Attention Heads in Multilingual Large Language Models , author=. 2025 , eprint=

  14. [14]

    Proceedings of EMNLP 2024 (System Demonstrations) , year =

    Tufanov, Igor and Wendler, Karen and Vesel. Proceedings of EMNLP 2024 (System Demonstrations) , year =

  15. [15]

    Global MMLU : Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

    Singh, Shivalika and Romanou, Angelika and Fourrier, Cl \'e mentine and Adelani, David Ifeoluwa and Ngui, Jian Gang and Vila-Suero, Daniel and Limkonchotiwat, Peerat and Marchisio, Kelly and Leong, Wei Qi and Susanto, Yosephine and Ng, Raymond and Longpre, Shayne and Ruder, Sebastian and Ko, Wei-Yin and Bosselut, Antoine and Oh, Alice and Martins, Andre a...

  16. [16]

    URLhttps://doi.org/10.1038/s41586-025-09422-z

    Guo, Daya and Yang, Dejian and Zhang, Haowei and Song, Junxiao and Wang, Peiyi and Zhu, Qihao and Xu, Runxin and Zhang, Ruoyu and Ma, Shirong , year=. DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning , volume=. Nature , publisher=. doi:10.1038/s41586-025-09422-z , number=

  17. [17]

    2024 , eprint=

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

  18. [18]

    2023 , version =

    Habib, Nathan and Fourrier, Clémentine and Kydlíček, Hynek and Wolf, Thomas and Tunstall, Lewis , title =. 2023 , version =

  19. [19]

    Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

    Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles , year=

  20. [20]

    doi:10.5281/zenodo.12608602 , url =

    Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

  21. [21]

    2022 , eprint=

    No Language Left Behind: Scaling Human-Centered Machine Translation , author=. 2022 , eprint=

  22. [22]

    arXiv preprint arXiv:2503.19786 , year =

    Kamath, Aishwarya and Ferret, Johan and Pathak, Shreya and Vieillard, Nino and Merhej, Ramona and Perrin, Sarah and Matejovicova, Tatiana and Ram. arXiv preprint arXiv:2503.19786 , year =

  23. [23]

    2026 , eprint=

    The Role of Mixed-Language Documents for Multilingual Large Language Model Pretraining , author=. 2026 , eprint=

  24. [24]

    Inside-Out: Hidden Factual Knowledge in

    Zorik Gekhman and Eyal Ben-David and Hadas Orgad and Eran Ofek and Yonatan Belinkov and Idan Szpektor and Jonathan Herzig and Roi Reichart , booktitle=. Inside-Out: Hidden Factual Knowledge in. 2025 , url=

  25. [25]

    The Twelfth International Conference on Learning Representations , year=

    The Reasonableness Behind Unreasonable Translation Capability of Large Language Model , author=. The Twelfth International Conference on Learning Representations , year=

  26. [26]

    How Far can 100 Samples Go? Unlocking Zero-Shot Translation with Tiny Multi-Parallel Data

    Wu, Di and Tan, Shaomu and Meng, Yan and Stap, David and Monz, Christof. How Far can 100 Samples Go? Unlocking Zero-Shot Translation with Tiny Multi-Parallel Data. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.896

  27. [27]

    From Translation to Multilinguality: Revisit the Role of Parallel Data in Multilingual

    Haobin Lin and Yan Zhao and Wenhan Han and Ping Guo and BINBINLIU and Yifan Zhang and Bingni Zhang and Taifeng Wang and Yin Zheng , year=. From Translation to Multilinguality: Revisit the Role of Parallel Data in Multilingual

  28. [28]

    Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training

    Wang, Zhijun and Li, Jiahuan and Zhou, Hao and Weng, Rongxiang and Wang, Jingang and Huang, Xin and Han, Xue and Feng, Junlan and Deng, Chao and Huang, Shujian. Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.575

  29. [29]

    Paths Not Taken: Understanding and Mending the Multilingual Factual Recall Pipeline

    Lu, Meng and Zhang, Ruochen and Eickhoff, Carsten and Pavlick, Ellie. Paths Not Taken: Understanding and Mending the Multilingual Factual Recall Pipeline. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.762

  30. [30]

    Do Llamas Work in E nglish? On the Latent Language of Multilingual Transformers

    Wendler, Chris and Veselovsky, Veniamin and Monea, Giovanni and West, Robert. Do Llamas Work in E nglish? On the Latent Language of Multilingual Transformers. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.820

  31. [31]

    2024 , eprint=

    2 OLMo 2 Furious , author=. 2024 , eprint=

  32. [32]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =. 2022 , url =

  33. [33]

    Hugging Face repository , howpublished =

    FineTranslations , author=. Hugging Face repository , howpublished =. 2026 , publisher =

  34. [34]

    2024 , eprint=

    Language-Specific Neurons: The Key to Multilingual Capabilities in Large Language Models , author=. 2024 , eprint=

  35. [35]

    arXiv preprint arXiv:2510.10280 , year =

    On the Entity-Level Alignment in Crosslingual Consistency , author =. arXiv preprint arXiv:2510.10280 , year =

  36. [36]

    arXiv preprint arXiv:2603.17070 , year =

    Large Reasoning Models Struggle to Transfer Parametric Knowledge Across Scripts , author =. arXiv preprint arXiv:2603.17070 , year =

  37. [37]

    2025 , url =

    Matsutani, Kohsei and Takashiro, Shota and Minegishi, Gouki and Kojima, Takeshi and Iwasawa, Yusuke and Matsuo, Yutaka , journal =. 2025 , url =

  38. [38]

    2026 , url =

    Hu, Hanxu and Wang, Yuxuan and Huan, Maggie and Vamvas, Jannis and Huang, Yinya and Guo, Zhijiang and Sennrich, Rico , journal =. 2026 , url =

  39. [39]

    2025 , url =

    Ye, Xiao and Shrivastava, Shaswat and Li, Zhaonan and Dineen, Jacob and Lu, Shijie and Ahuja, Avneet and Shen, Ming and Xu, Zhikun and Zhou, Ben , journal =. 2025 , url =

  40. [40]

    arXiv preprint arXiv:2601.14896 , year =

    Language-Coupled Reinforcement Learning for Multilingual Retrieval-Augmented Generation , author =. arXiv preprint arXiv:2601.14896 , year =

  41. [41]

    2025 , eprint=

    Qwen2.5 Technical Report , author=. 2025 , eprint=

  42. [42]

    arXiv preprint arXiv:2505.14297 , year =

    Cross-Lingual Optimization for Language Transfer in Large Language Models , author =. arXiv preprint arXiv:2505.14297 , year =

  43. [43]

    A Glimpse into Babel: An Analysis of Multilinguality in

    Kaffee, Lucie-Aim. A Glimpse into Babel: An Analysis of Multilinguality in. Proceedings of the 13th International Symposium on Open Collaboration (OpenSym '17) , year =. doi:10.1145/3125433.3125465 , publisher =

  44. [44]

    arXiv preprint arXiv:2512.22712 , year =

    Beg to Differ: Understanding Reasoning-Answer Misalignment Across Languages , author =. arXiv preprint arXiv:2512.22712 , year =