LiveCLKTBench: Towards Reliable Evaluation of Cross-Lingual Knowledge Transfer in Multilingual LLMs

Chun-Chia Hsu; Kai-Wei Chang; Kai-Xin Chen; Mi-Yen Yeh; Nanyun Peng; Pei-Fu Guo; Shou-De Lin; Ya-An Tsai; Yun-Da Tsai

arxiv: 2511.14774 · v4 · submitted 2025-11-03 · 💻 cs.CL · cs.AI

LiveCLKTBench: Towards Reliable Evaluation of Cross-Lingual Knowledge Transfer in Multilingual LLMs

Pei-Fu Guo , Yun-Da Tsai , Chun-Chia Hsu , Kai-Xin Chen , Ya-An Tsai , Kai-Wei Chang , Nanyun Peng , Mi-Yen Yeh

show 1 more author

Shou-De Lin

This is my paper

Pith reviewed 2026-05-18 00:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords cross-lingual transfermultilingual LLMsknowledge transfer evaluationlinguistic distanceasymmetric transferbenchmarkmodel scaling effects

0 comments

The pith

LiveCLKTBench isolates genuine cross-lingual knowledge transfer in multilingual LLMs from pre-training exposure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LiveCLKTBench, an automated pipeline that selects recent real-world facts, confirms models have not seen them in training, creates questions about them, and translates the questions across languages. This setup tests whether correct answers in a new language reflect actual transfer rather than earlier memorization. Tests on multiple LLMs in five languages show transfer depends on how similar the languages are and often works better in one direction than the reverse. Larger models perform better at transfer but the benefit levels off, and results change by subject area. The benchmark aims to give a cleaner measure of multilingual knowledge movement.

Core claim

We present LiveCLKTBench, an automated generation pipeline specifically designed to isolate and measure cross-lingual knowledge transfer. Our pipeline identifies self-contained, time-sensitive knowledge entities from real-world domains, filters them based on temporal occurrence, and verifies them against the model's knowledge. The documents of these valid entities are then used to generate factual questions, which are translated into multiple languages to evaluate transferability across linguistic boundaries. Using LiveCLKTBench, we evaluate several LLMs across five languages and observe that cross-lingual transfer is strongly influenced by linguistic distance and often asymmetric across the

What carries the argument

LiveCLKTBench, an automated pipeline that selects temporally recent entities, verifies they are unknown to the model, generates source questions, and translates them to isolate transfer effects.

If this is right

Cross-lingual transfer strength tracks the linguistic distance between source and target languages.
Transfer often runs better in one language direction than the opposite.
Increasing model size raises transfer rates but the added benefit shrinks at larger scales.
Transfer success rates differ across domains such as news, science, or culture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same filtering approach could be applied to test transfer for skills like reasoning or code understanding.
Training data that balances languages more evenly might reduce the observed directional asymmetry.
Running the benchmark on additional language pairs would check whether the distance and scale patterns hold more widely.

Load-bearing premise

The verification step correctly flags entities the model has never encountered and the temporal filter keeps out any facts that appeared in pre-training data.

What would settle it

Finding entities where the model answers the translated questions correctly even though the pipeline marked them as unknown, or locating training data that already contained the supposedly post-training facts.

Figures

Figures reproduced from arXiv: 2511.14774 by Chun-Chia Hsu, Kai-Wei Chang, Kai-Xin Chen, Mi-Yen Yeh, Nanyun Peng, Pei-Fu Guo, Shou-De Lin, Ya-An Tsai, Yun-Da Tsai.

**Figure 1.** Figure 1: Leakage Prevention in LiveCLKTBench. The pipeline prevents data leakage by selecting valid knowledge entities that contain facts unknown to pretrained models. Specifically, it identifies independent, timesensitive real-world entities, filters them by temporal occurrence, and cross-checks them against model outputs to eliminate any entities already known to pretrained models. such as REALTIMEQA (Kasai et a… view at source ↗

**Figure 2.** Figure 2: LiveCLKTBench Pipeline. The generation process consists of four stages: (1) collecting independent, time-sensitive knowledge entities; (2) generating document-grounded question–answer pairs; (3) verifying data quality using a verifier LLM; and (4) translating verified questions into multiple languages for evaluation. place, ensuring that such information could not have been included in pretraining data. En… view at source ↗

**Figure 3.** Figure 3: Language-level Transferability. Heatmaps show Transfer Scores for each (Ltrain, Ltest) pair across models, sorted by average Overall score. Darker colors indicate stronger transferability. lingual transfer. Training and Inference. All models are posttrained with a lightweight LoRA configuration (rank 16, α = 32, learning rate 5e −4 , dropout 0.1, batch size 1, 5 epochs). Intermediate checkpoints are valid… view at source ↗

**Figure 4.** Figure 4: Effect of Model Size. Overall and Transfer Scores across model families of different parameter size, shown separately by domain. 5.2 Variation Across Languages [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Evaluating cross-lingual knowledge transfer in large language models is challenging, as correct answers in a target language may arise either from genuine transfer or from prior exposure during pre-training. We present LiveCLKTBench, an automated generation pipeline specifically designed to isolate and measure cross-lingual knowledge transfer. Our pipeline identifies self-contained, time-sensitive knowledge entities from real-world domains, filters them based on temporal occurrence, and verifies them against the model's knowledge. The documents of these valid entities are then used to generate factual questions, which are translated into multiple languages to evaluate transferability across linguistic boundaries. Using LiveCLKTBench, we evaluate several LLMs across five languages and observe that cross-lingual transfer is strongly influenced by linguistic distance and often asymmetric across language directions. While larger models improve transfer, the gains diminish with scale and vary across domains. These findings provide new insights into multilingual transfer and demonstrate the value of LiveCLKTBench as a reliable benchmark for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete pipeline for generating recent-entity questions to test cross-lingual transfer while trying to screen out pre-training exposure, and the reported patterns on linguistic distance and scale are worth a look, but the verification step is the weak link.

read the letter

The main point is a pipeline that pulls time-sensitive entities from real documents, applies temporal filters to keep them recent, checks the model for prior knowledge, turns the documents into questions, and translates those questions across languages. They run this on several LLMs for five languages and note that transfer tracks linguistic closeness, shows direction asymmetry, and improves with size but with diminishing returns that also differ by domain. The automated generation from filtered entities is the clearest new piece; earlier work flagged contamination but did not ship this exact combination of temporal cutoffs plus document-to-question steps. That makes the benchmark usable for people who want fresh test items without manual curation. The empirical observations are straightforward and give numbers to discuss, even if they rest on the pipeline working as intended. The soft spot is the verification step itself. Prompting the model to confirm it does not know an entity can miss partial recall or produce unstable answers, and the abstract plus available details do not include a recall test on held-out known facts or error rates for the check. If some supposedly new entities were already in the training data, the transfer measurements mix genuine transfer with leakage. That does not invalidate the whole effort, but it means the claims about distance and scale need tighter controls before they can be taken as firm. This is for groups working on multilingual model evaluation who need practical ways to reduce contamination. Readers focused on benchmark design will get the most from the pipeline description and the language-pair results. It deserves a serious referee because the evaluation problem is real, the method is reproducible enough to critique, and the observations are testable even if the isolation is not yet airtight.

Referee Report

2 major / 2 minor

Summary. The paper introduces LiveCLKTBench, an automated pipeline to generate evaluation data for cross-lingual knowledge transfer in multilingual LLMs. The pipeline selects self-contained time-sensitive knowledge entities from real-world domains, applies temporal occurrence filters, verifies that the entities are unknown to the target model, generates factual questions from the associated documents, and translates those questions across languages. Experiments on several LLMs across five languages report that transfer performance is strongly modulated by linguistic distance, frequently asymmetric by direction, improves with scale but with diminishing returns, and varies across domains.

Significance. If the verification step reliably isolates entities absent from pre-training, the benchmark would provide a valuable controlled instrument for studying genuine cross-lingual transfer mechanisms. The empirical observations on linguistic distance, directional asymmetry, and scale effects could inform multilingual model design and evaluation practices. The automated, temporally grounded construction is a methodological strength that supports reproducibility and future extensions.

major comments (2)

[Pipeline description / Verification step] The verification step (described in the pipeline overview and methods) is load-bearing for the central claim that measured transfer reflects genuine cross-lingual generalization rather than pre-training contamination. No quantitative validation—such as recall on a held-out set of known facts, precision against known memorized entities, or error analysis of false negatives—is reported. Without such evidence the reported effects of linguistic distance and asymmetry cannot be confidently attributed to transfer.
[Results and Analysis] The results section presents post-hoc observations on domain and scale effects without detailing the experimental controls used to isolate these factors from confounding variables such as question difficulty or translation quality. This weakens the claim that gains diminish with scale and vary across domains.

minor comments (2)

[Abstract / Introduction] The abstract and introduction would benefit from a concise statement of the exact number of entities, questions, and languages used in the final benchmark to allow readers to assess scale immediately.
[Figure 1] Figure captions for the pipeline diagram should explicitly label the verification and temporal-filter stages to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Pipeline description / Verification step] The verification step (described in the pipeline overview and methods) is load-bearing for the central claim that measured transfer reflects genuine cross-lingual generalization rather than pre-training contamination. No quantitative validation—such as recall on a held-out set of known facts, precision against known memorized entities, or error analysis of false negatives—is reported. Without such evidence the reported effects of linguistic distance and asymmetry cannot be confidently attributed to transfer.

Authors: We agree that quantitative validation of the verification step is necessary to support the central claims. The current manuscript describes the verification process used to confirm entities are unknown to the target model but does not report the requested metrics. In the revised version we will add a dedicated analysis subsection reporting recall on a held-out set of known facts, precision against known memorized entities, and an error analysis of false negatives. This addition will allow readers to assess the reliability of the verification and strengthen attribution of the observed linguistic-distance and asymmetry effects to cross-lingual transfer. revision: yes
Referee: [Results and Analysis] The results section presents post-hoc observations on domain and scale effects without detailing the experimental controls used to isolate these factors from confounding variables such as question difficulty or translation quality. This weakens the claim that gains diminish with scale and vary across domains.

Authors: We acknowledge that explicit documentation of experimental controls would improve the results section. Although the original experiments incorporated steps to mitigate confounds such as question difficulty and translation quality, these were not described in sufficient detail. In the revision we will expand the results and analysis sections to specify the controls employed, including sampling procedures for difficulty assessment and quality checks on translations, thereby providing clearer isolation of the reported scale and domain effects. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction is self-contained

full rationale

The paper describes an automated pipeline for LiveCLKTBench that identifies time-sensitive knowledge entities from real-world domains, applies temporal occurrence filters, verifies entities against the target model's knowledge via prompting, generates factual questions from the documents, and translates them across languages. These steps constitute a data-generation procedure rather than a mathematical derivation or set of predictions. No equations are presented, no parameters are fitted to subsets of the evaluation data and then reused as 'predictions,' and no self-citation chain is invoked to justify a uniqueness theorem or ansatz. The reported observations on linguistic distance, asymmetry, and scale effects are direct empirical measurements obtained by running the pipeline on several LLMs; they do not reduce to the pipeline inputs by construction. The work is therefore self-contained as a benchmark-construction effort.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that real-world time-sensitive facts can be automatically identified and verified as absent from model pre-training without introducing selection bias.

axioms (1)

domain assumption Temporal occurrence filtering combined with model verification isolates knowledge not present in pre-training data.
Invoked in the pipeline description to justify the isolation of genuine transfer.

pith-pipeline@v0.9.0 · 5729 in / 1144 out tokens · 29045 ms · 2026-05-18T00:53:18.621344+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our pipeline identifies self-contained, time-sensitive knowledge entities from real-world domains, filters them based on temporal occurrence, and verifies them against the model's knowledge.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

cross-lingual transfer is strongly influenced by linguistic distance and often asymmetric across language directions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

[1]

Sanchit Ahuja, Varun Gumma, and Sunayana Sitaram

Mega: Multilingual evaluation of generative ai.Preprint, arXiv:2303.12528. Sanchit Ahuja, Varun Gumma, and Sunayana Sitaram

work page arXiv
[2]

arXiv preprint arXiv:2410.16186 , year =

Contamination report for multilingual bench- marks.Preprint, arXiv:2410.16186. Akari Asai, Jungo Kasai, Jonathan H. Clark, Kenton Lee, Eunsol Choi, and Hannaneh Hajishirzi. 2021. Xor qa: Cross-lingual open-retrieval question answer- ing.Preprint, arXiv:2010.11856. Akari Asai, Sneha Kudugunta, Xinyan Velocity Yu, Terra Blevins, Hila Gonen, Machel Reid, Yul...

work page arXiv 2021
[3]

Linzheng Chai, Jian Yang, Tao Sun, Hongcheng Guo, Jiaheng Liu, Bing Wang, Xiannian Liang, Jiaqi Bai, Tongliang Li, Qiyao Peng, and Zhoujun Li

Buffet: Benchmarking large language mod- els for few-shot cross-lingual transfer.Preprint, arXiv:2305.14857. Linzheng Chai, Jian Yang, Tao Sun, Hongcheng Guo, Jiaheng Liu, Bing Wang, Xiannian Liang, Jiaqi Bai, Tongliang Li, Qiyao Peng, and Zhoujun Li

work page arXiv
[4]

Jie Chen, Yupeng Zhang, Bingning Wang, Xin Zhao, Ji-Rong Wen, and Weipeng Chen

xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning.Preprint, arXiv:2401.07037. Jie Chen, Yupeng Zhang, Bingning Wang, Xin Zhao, Ji-Rong Wen, and Weipeng Chen. 2024. Unveiling the flaws: Exploring imperfections in synthetic data and mitigation strategies for large language models. InFindings of the Association for Computati...

work page arXiv 2024
[5]

InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2580–2591, Online

Transfer learning and distant supervision for multilingual transformer models: A study on African languages. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2580–2591, Online. Association for Computational Linguistics. Junjie Hu, Sebastian Ruder, Aditya Siddhant, Gra- ham Neubig, Orhan Firat, and Mel...

work page arXiv 2020
[6]

Radev, Noah A

Realtime qa: What’s the answer right now? Preprint, arXiv:2207.13332. Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Kon- ishi, Gyuhak Kim, and Bing Liu. 2023. Con- tinual pre-training of language models.Preprint, arXiv:2302.03241. Anne Lauscher, Vinit Ravishankar, Ivan Vuli ´c, and Goran Glavaš. 2020. From zero to hero: On the limitations of zero-shot langua...

work page arXiv 2023
[7]

Ercong Nie, Bo Shao, Zifeng Ding, Mingyang Wang, Helmut Schmid, and Hinrich Schütze

Beyond english: The impact of prompt trans- lation strategies across languages and tasks in multi- lingual llms.Preprint, arXiv:2502.09331. Ercong Nie, Bo Shao, Zifeng Ding, Mingyang Wang, Helmut Schmid, and Hinrich Schütze

work page arXiv
[8]

In the movie: ’<title>’,

Bmike-53: Investigating cross-lingual knowl- edge editing with in-context learning.Preprint, arXiv:2406.17764. Sara Rajaee and Christof Monz. 2024. Analyzing the evaluation of cross-lingual knowledge trans- fer in multilingual language models.Preprint, arXiv:2402.02099. Uri Shaham, Jonathan Herzig, Roee Aharoni, Idan Szpektor, Reut Tsarfaty, and Matan Eya...

work page arXiv 2024

[1] [1]

Sanchit Ahuja, Varun Gumma, and Sunayana Sitaram

Mega: Multilingual evaluation of generative ai.Preprint, arXiv:2303.12528. Sanchit Ahuja, Varun Gumma, and Sunayana Sitaram

work page arXiv

[2] [2]

arXiv preprint arXiv:2410.16186 , year =

Contamination report for multilingual bench- marks.Preprint, arXiv:2410.16186. Akari Asai, Jungo Kasai, Jonathan H. Clark, Kenton Lee, Eunsol Choi, and Hannaneh Hajishirzi. 2021. Xor qa: Cross-lingual open-retrieval question answer- ing.Preprint, arXiv:2010.11856. Akari Asai, Sneha Kudugunta, Xinyan Velocity Yu, Terra Blevins, Hila Gonen, Machel Reid, Yul...

work page arXiv 2021

[3] [3]

Linzheng Chai, Jian Yang, Tao Sun, Hongcheng Guo, Jiaheng Liu, Bing Wang, Xiannian Liang, Jiaqi Bai, Tongliang Li, Qiyao Peng, and Zhoujun Li

Buffet: Benchmarking large language mod- els for few-shot cross-lingual transfer.Preprint, arXiv:2305.14857. Linzheng Chai, Jian Yang, Tao Sun, Hongcheng Guo, Jiaheng Liu, Bing Wang, Xiannian Liang, Jiaqi Bai, Tongliang Li, Qiyao Peng, and Zhoujun Li

work page arXiv

[4] [4]

Jie Chen, Yupeng Zhang, Bingning Wang, Xin Zhao, Ji-Rong Wen, and Weipeng Chen

xcot: Cross-lingual instruction tuning for cross-lingual chain-of-thought reasoning.Preprint, arXiv:2401.07037. Jie Chen, Yupeng Zhang, Bingning Wang, Xin Zhao, Ji-Rong Wen, and Weipeng Chen. 2024. Unveiling the flaws: Exploring imperfections in synthetic data and mitigation strategies for large language models. InFindings of the Association for Computati...

work page arXiv 2024

[5] [5]

InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2580–2591, Online

Transfer learning and distant supervision for multilingual transformer models: A study on African languages. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2580–2591, Online. Association for Computational Linguistics. Junjie Hu, Sebastian Ruder, Aditya Siddhant, Gra- ham Neubig, Orhan Firat, and Mel...

work page arXiv 2020

[6] [6]

Radev, Noah A

Realtime qa: What’s the answer right now? Preprint, arXiv:2207.13332. Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Kon- ishi, Gyuhak Kim, and Bing Liu. 2023. Con- tinual pre-training of language models.Preprint, arXiv:2302.03241. Anne Lauscher, Vinit Ravishankar, Ivan Vuli ´c, and Goran Glavaš. 2020. From zero to hero: On the limitations of zero-shot langua...

work page arXiv 2023

[7] [7]

Ercong Nie, Bo Shao, Zifeng Ding, Mingyang Wang, Helmut Schmid, and Hinrich Schütze

Beyond english: The impact of prompt trans- lation strategies across languages and tasks in multi- lingual llms.Preprint, arXiv:2502.09331. Ercong Nie, Bo Shao, Zifeng Ding, Mingyang Wang, Helmut Schmid, and Hinrich Schütze

work page arXiv

[8] [8]

In the movie: ’<title>’,

Bmike-53: Investigating cross-lingual knowl- edge editing with in-context learning.Preprint, arXiv:2406.17764. Sara Rajaee and Christof Monz. 2024. Analyzing the evaluation of cross-lingual knowledge trans- fer in multilingual language models.Preprint, arXiv:2402.02099. Uri Shaham, Jonathan Herzig, Roee Aharoni, Idan Szpektor, Reut Tsarfaty, and Matan Eya...

work page arXiv 2024