pith. sign in

arxiv: 2606.18033 · v1 · pith:DIOGQDL5new · submitted 2026-06-16 · 💻 cs.CL · cs.AI

When English Isn't the Best Teacher: Source Language Effects in Cross-Lingual In-Context Learning

Pith reviewed 2026-06-27 00:28 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords cross-lingual transferin-context learningsource language selectionmultilingual NLPlanguage confusionfew-shot promptingtransfer patterns
0
0 comments X

The pith

Fine-tuning rules for source language choice do not hold in cross-lingual in-context learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether factors that make one language a good source for cross-lingual transfer under fine-tuning still hold when models instead receive a few in-context examples. Experiments across seven tasks, six models, and a typologically diverse language set reveal frequent mismatches in which languages perform best. This finding matters because practitioners can no longer default to English or to linguistically close languages when building prompts. The work also isolates language confusion as an extra obstacle that appears mainly in generative ICL settings.

Core claim

Conventional fine-tuning-based expectations do not consistently apply in the ICL regime. The authors show that linguistic similarity and high-resource status (such as English) do not reliably predict which source languages yield the strongest cross-lingual ICL results, and they identify language confusion as a distinct failure mode for generative tasks.

What carries the argument

Direct head-to-head comparison of source-language transfer rankings between fine-tuning and ICL, together with measurement of language confusion in generated outputs.

If this is right

  • English is frequently not the strongest source language for ICL demonstrations.
  • Linguistic similarity predicts ICL transfer less reliably than it predicts fine-tuning transfer.
  • Practitioners need new heuristics, not fine-tuning rules, to pick source languages for cross-lingual ICL.
  • Language confusion must be measured and mitigated separately when ICL is used for text generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • ICL may draw on different internal language representations than supervised fine-tuning does.
  • Prompt construction for multilingual applications could improve by using task-specific rather than fixed source-language rules.
  • The mismatch may also appear in zero-shot or chain-of-thought prompting variants.
  • Testing the same pattern on models larger than those studied would show whether scale changes the result.

Load-bearing premise

The typologically diverse language set and seven tasks chosen are representative enough that the observed mismatch between fine-tuning and ICL patterns will generalize to other models and tasks.

What would settle it

A follow-up experiment that finds the same source-language rankings in both fine-tuning and ICL across a new set of tasks or models would falsify the central mismatch claim.

Figures

Figures reproduced from arXiv: 2606.18033 by Fred Philippy, Jacques Klein, Siwen Guo, Tegawend\'e F. Bissyand\'e.

Figure 1
Figure 1. Figure 1: Correlation matrix of transfer patterns. Pear [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cross-lingual transfer matrix showing the mean z-score of model performance across language pairs. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Scatter plot of donor and recipient capabilities; [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The four heatmaps illustrate the correla￾tion between linguistic similarity (syntactic, genetic, phonological, and featural) of source–target language pairs and cross-lingual transfer performance. Each heatmap corresponds to one lang2vec dimension and re￾ports the Pearson correlation coefficient for every bench￾mark (column) and model (row) combination. An aster￾isk (*) indicates statistical significance a… view at source ↗
Figure 6
Figure 6. Figure 6: Average Line-level Pass Rates (LPR) across models for different source-target language pairs. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Correlation matrix of language confusion pat [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt template used for XNLI. User Input {question} Provide the final answer in this format: ’**<final answer>**’. Do not include any other calculations or explanations. Do not use any measurement units. Assistant Prefix The final result is: **{answer}** [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt template used for MGSM. User Input Story: sentence_1 sentence_2 sentence_3 sentence_4 Which sentence is the correct ending to the story? Sentence 1: “{option_1}” Sentence 2: “{option_2}” Assistant Prefix The correct ending: Sentence {“1” / “2”} [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt template used for XStoryCloze. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prompt template used for SIB-200. User Input Are the following two sentences paraphrases of each other? Sentence 1: “{sentence_1}” Sentence 2: “{sentence_2}” Assistant Prefix Answer to the question whether the sentences are paraphrases: {“Yes” / “No”} [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt template used for PAWS-X. User Input What is the cause or effect of the following premise? Premise: “{premise}” Choice 1: “{choice_1}” Choice 2: “{choice_2}” Assistant Prefix The correct choice is: Choice {“1” / “2”} [PITH_FULL_IMAGE:figures/full_fig_p012_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompt template used for XCOPA. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗
read the original abstract

Cross-lingual transfer in multilingual NLP has been widely explored in supervised fine-tuning contexts, where factors like data availability and linguistic similarity largely determine transfer quality. As the field shifts toward few-shot In-Context Learning (ICL), it is often presumed that insights from fine-tuning carry over unchanged. Yet this assumption has not been rigorously evaluated, leaving open the question of how to choose source languages for cross-lingual ICL. We conduct a broad empirical study of cross-lingual transfer in ICL spanning seven tasks, six models, and a typologically diverse set of languages. We further analyze language confusion, a key obstacle for generative tasks in cross-lingual ICL. Our results show that conventional fine-tuning-based expectations do not consistently apply in the ICL regime and point to alternative heuristics for selecting source languages effectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 0 minor

Summary. The manuscript conducts a broad empirical study of cross-lingual transfer in the in-context learning (ICL) regime. It spans seven tasks, six models, and a typologically diverse language set, comparing source-language effects against expectations derived from fine-tuning literature. The central finding is that conventional heuristics (e.g., those based on data availability or linguistic similarity) do not consistently predict ICL performance; the work additionally examines language confusion in generative tasks and proposes alternative selection heuristics.

Significance. If the empirical patterns hold, the result is significant because it demonstrates a divergence between supervised fine-tuning and ICL transfer behaviors in multilingual settings. The multi-task, multi-model design supplies a reasonably broad empirical base for the claim that fine-tuning-derived expectations cannot be assumed to carry over unchanged, which is timely given the growing reliance on ICL.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their thorough and positive evaluation of the manuscript. We are encouraged that the broad empirical design and the finding of divergence between fine-tuning heuristics and ICL behavior are considered significant and timely, and we appreciate the recommendation to accept.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a direct empirical study comparing cross-lingual ICL performance across tasks, models, and languages, with results presented as observations from experiments rather than derived predictions or fitted parameters. No equations, self-definitional constructs, or load-bearing self-citations appear in the abstract or described structure; the central claim rests on observed mismatches between fine-tuning heuristics and ICL outcomes, which are externally falsifiable via replication on the reported setups. This is self-contained empirical work with no reduction of claims to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity and representativeness of the empirical protocol rather than new theoretical constructs; no free parameters, invented entities, or non-standard axioms are described in the abstract.

axioms (1)
  • domain assumption The selected tasks, models, and languages are sufficiently representative to support generalization about ICL transfer behavior.
    The abstract invokes this representativeness to draw the conclusion that fine-tuning expectations do not apply in ICL.

pith-pipeline@v0.9.1-grok · 5682 in / 1155 out tokens · 38161 ms · 2026-06-27T00:28:28.690883+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 20 canonical work pages · 1 internal anchor

  1. [1]

    2025 , eprint=

    Code-Switching In-Context Learning for Cross-Lingual Transfer of Large Language Models , author=. 2025 , eprint=

  2. [2]

    XAMPLER : Learning to Retrieve Cross-Lingual In-Context Examples

    Lin, Peiqin and Martins, Andre and Schuetze, Hinrich. XAMPLER : Learning to Retrieve Cross-Lingual In-Context Examples. Findings of the Association for Computational Linguistics: NAACL 2025. 2025. doi:10.18653/v1/2025.findings-naacl.221

  3. [3]

    2024 , eprint=

    Cross-lingual QA: A Key to Unlocking In-context Cross-lingual Performance , author=. 2024 , eprint=

  4. [4]

    Multilingual LLM s are Better Cross-lingual In-context Learners with Alignment

    Tanwar, Eshaan and Dutta, Subhabrata and Borthakur, Manish and Chakraborty, Tanmoy. Multilingual LLM s are Better Cross-lingual In-context Learners with Alignment. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.346

  5. [5]

    Cross-lingual Few-Shot Learning on Unseen Languages

    Winata, Genta and Wu, Shijie and Kulkarni, Mayank and Solorio, Thamar and Preotiuc-Pietro, Daniel. Cross-lingual Few-Shot Learning on Unseen Languages. Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Pap...

  6. [6]

    The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis

    Zhang, Miaoran and Gautam, Vagrant and Wang, Mingyang and Alabi, Jesujoba and Shen, Xiaoyu and Klakow, Dietrich and Mosbach, Marius. The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.438

  7. [7]

    Blessing of Multilinguality: A Systematic Analysis of Multilingual In-Context Learning

    Tu, Yilei and Xue, Andrew and Shi, Freda. Blessing of Multilinguality: A Systematic Analysis of Multilingual In-Context Learning. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.323

  8. [8]

    LLM s Are Few-Shot In-Context Low-Resource Language Learners

    Cahyawijaya, Samuel and Lovenia, Holy and Fung, Pascale. LLM s Are Few-Shot In-Context Low-Resource Language Learners. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.naacl-long.24

  9. [9]

    XCOPA : A Multilingual Dataset for Causal Commonsense Reasoning

    Ponti, Edoardo Maria and Glava s , Goran and Majewska, Olga and Liu, Qianchu and Vuli \'c , Ivan and Korhonen, Anna. XCOPA : A Multilingual Dataset for Causal Commonsense Reasoning. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.185

  10. [10]

    and Mao, Yanke and Gao, Haonan and Lee, En-Shiun Annie

    Adelani, David Ifeoluwa and Liu, Hannah and Shen, Xiaoyu and Vassilyev, Nikita and Alabi, Jesujoba O. and Mao, Yanke and Gao, Haonan and Lee, En-Shiun Annie. SIB -200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects. Proceedings of the 18th Conference of the European Chapter of the Association for Co...

  11. [11]

    PAWS - X : A Cross-lingual Adversarial Dataset for Paraphrase Identification

    Yang, Yinfei and Zhang, Yuan and Tar, Chris and Baldridge, Jason. PAWS - X : A Cross-lingual Adversarial Dataset for Paraphrase Identification. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1382

  12. [12]

    Few-shot Learning with Multilingual Generative Language Models

    Lin, Xi Victoria and Mihaylov, Todor and Artetxe, Mikel and Wang, Tianlu and Chen, Shuohui and Simig, Daniel and Ott, Myle and Goyal, Naman and Bhosale, Shruti and Du, Jingfei and Pasunuru, Ramakanth and Shleifer, Sam and Koura, Punit Singh and Chaudhary, Vishrav and O ' Horo, Brian and Wang, Jeff and Zettlemoyer, Luke and Kozareva, Zornitsa and Diab, Mon...

  13. [13]

    URL https: //aclanthology.org/2025.acl-long.919/

    Singh, Shivalika and Romanou, Angelika and Fourrier, Cl \'e mentine and Adelani, David Ifeoluwa and Ngui, Jian Gang and Vila-Suero, Daniel and Limkonchotiwat, Peerat and Marchisio, Kelly and Leong, Wei Qi and Susanto, Yosephine and Ng, Raymond and Longpre, Shayne and Ruder, Sebastian and Ko, Wei-Yin and Bosselut, Antoine and Oh, Alice and Martins, Andre a...

  14. [14]

    findings-emnlp.630/

    Conneau, Alexis and Rinott, Ruty and Lample, Guillaume and Williams, Adina and Bowman, Samuel and Schwenk, Holger and Stoyanov, Veselin. XNLI : Evaluating Cross-lingual Sentence Representations. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1269

  15. [15]

    2022 , eprint=

    Language Models are Multilingual Chain-of-Thought Reasoners , author=. 2022 , eprint=

  16. [16]

    First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT

    Muller, Benjamin and Elazar, Yanai and Sagot, Beno \^i t and Seddah, Djam \'e. First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021. doi:10.18653/v1/2021.eacl-main.189

  17. [17]

    Karthikeyan K and Zihan Wang and Stephen Mayhew and Dan Roth , title =. Proc. of the International Conference on Learning Representations , year =

  18. [18]

    Towards a Common Understanding of Contributing Factors for Cross-Lingual Transfer in Multilingual Language Models: A Review

    Philippy, Fred and Guo, Siwen and Haddadan, Shohreh. Towards a Common Understanding of Contributing Factors for Cross-Lingual Transfer in Multilingual Language Models: A Review. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.323

  19. [19]

    How Multilingual is Multilingual BERT ?

    Pires, Telmo and Schlinger, Eva and Garrette, Dan. How Multilingual is Multilingual BERT ?. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1493

  20. [20]

    2025 , eprint=

    Balanced Multi-Factor In-Context Learning for Multilingual Large Language Models , author=. 2025 , eprint=

  21. [21]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  22. [22]

    2025 , eprint=

    Gemma 3 Technical Report , author=. 2025 , eprint=

  23. [23]

    Language Models are Few-Shot Learners , url =

    Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

  24. [24]

    2025 , eprint=

    Donors and Recipients: On Asymmetric Transfer Across Tasks and Languages with Parameter-Efficient Fine-Tuning , author=. 2025 , eprint=

  25. [25]

    A Balanced Data Approach for Evaluating Cross-Lingual Transfer: Mapping the Linguistic Blood Bank

    Malkin, Dan and Limisiewicz, Tomasz and Stanovsky, Gabriel. A Balanced Data Approach for Evaluating Cross-Lingual Transfer: Mapping the Linguistic Blood Bank. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022. doi:10.18653/v1/2022.naacl-main.361

  26. [26]

    and Lin, Ke and Kairis, Katherine and Turner, Carlisle and Levin, Lori

    Littell, Patrick and Mortensen, David R. and Lin, Ke and Kairis, Katherine and Turner, Carlisle and Levin, Lori. URIEL and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors. Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 2017

  27. [27]

    2022 , eprint=

    No Language Left Behind: Scaling Human-Centered Machine Translation , author=. 2022 , eprint=

  28. [28]

    International Conference on Machine Learning , pages=

    Similarity of neural network representations revisited , author=. International Conference on Machine Learning , pages=. 2019 , organization=

  29. [29]

    2023 , eprint=

    Bactrian-X: Multilingual Replicable Instruction-Following Models with Low-Rank Adaptation , author=. 2023 , eprint=

  30. [30]

    Understanding and Mitigating Language Confusion in LLM s

    Marchisio, Kelly and Ko, Wei-Yin and Berard, Alexandre and Dehaze, Th \'e o and Ruder, Sebastian. Understanding and Mitigating Language Confusion in LLM s. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.380

  31. [31]

    arXiv preprint arXiv:1607.01759 , year=

    Bag of Tricks for Efficient Text Classification , author=. arXiv preprint arXiv:1607.01759 , year=

  32. [32]

    2025 , eprint=

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs , author=. 2025 , eprint=

  33. [33]

    PAWS : Paraphrase Adversaries from Word Scrambling

    Zhang, Yuan and Baldridge, Jason and He, Luheng. PAWS : Paraphrase Adversaries from Word Scrambling. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v1/N19-1131

  34. [34]

    A corpus and cloze evaluation for deeper understanding of commonsense stories

    Mostafazadeh, Nasrin and Chambers, Nathanael and He, Xiaodong and Parikh, Devi and Batra, Dhruv and Vanderwende, Lucy and Kohli, Pushmeet and Allen, James. A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories. Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human ...

  35. [35]

    2021 , eprint=

    Measuring Massive Multitask Language Understanding , author=. 2021 , eprint=

  36. [36]

    A broad-coverage challenge corpus for sentence understanding through inference

    Williams, Adina and Nangia, Nikita and Bowman, Samuel. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1101

  37. [37]

    2021 , eprint=

    Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

  38. [38]

    2011 AAAI Spring Symposium Series , year=

    Choice of plausible alternatives: An evaluation of commonsense causal reasoning , author=. 2011 AAAI Spring Symposium Series , year=