pith. sign in

arxiv: 2505.14990 · v3 · submitted 2025-05-21 · 💻 cs.CL

Language Specific Knowledge: Do Models Know Better in X than in English?

Pith reviewed 2026-05-22 14:44 UTC · model grok-4.3

classification 💻 cs.CL
keywords Language Specific Knowledgemultilingual language modelsquestion answeringlanguage selectioncultural normssocial normsGemmaQwen
0
0 comments X

The pith

Switching the language of a query can improve a model's answers on cultural and social topics by accessing language-specific knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that multilingual language models sometimes perform better on certain questions when the query is written in a language other than English. It defines Language Specific Knowledge as the property that some queries have an expert language for a given model where accuracy rises. The authors test this idea on collections of cultural and behavioral norm questions and show that the best language depends on both the model and the topic. Gemma models, for example, handle China and Middle East facts more accurately in Spanish, while Qwen models do better on authority questions in Arabic or Chinese. If the pattern holds, simple language choice offers a way to raise performance without changing the underlying model.

Core claim

The central claim is that by changing the language of the input query, we can improve the question answering ability of language models. We introduce the term Language Specific Knowledge to denote queries that are best answered in an expert language for a given LLM, thereby enhancing its question-answering ability. We introduce the problem of language selection for some queries, language models can perform better when queried in languages other than English, sometimes even better in low-resource languages and the goal is to select the optimal language for the query.

What carries the argument

Language Specific Knowledge, the property that a given query about cultural or social norms is answered more accurately when posed in one particular language for a specific model.

If this is right

  • For some cultural questions, accuracy rises when the query is moved from English to another language.
  • Low-resource languages can outperform English for selected topics in particular models.
  • Different models exhibit different language-topic pairings, such as Gemma on Spanish for China facts.
  • Simple selection methods can raise question-answering scores without retraining.
  • Language choice becomes a practical lever for aligning models with the cultural contexts where they are used.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A lightweight router could be added in front of any model to pick the best language per query based on topic signals.
  • The pattern may indicate uneven coverage of the same facts across the training data of different languages.
  • Extending the test to purely factual knowledge outside norms would show whether the effect is limited to culturally loaded content.
  • If confirmed, training objectives could be adjusted to reduce the size of these language-specific gaps.

Load-bearing premise

The observed gains come from genuine differences in stored knowledge across languages rather than from tokenization length, prompt format, or biases in the norm datasets.

What would settle it

Measure accuracy on the same norm questions across languages while holding token count, prompt wording, and formatting fixed, then check whether the language-to-performance differences remain.

Figures

Figures reproduced from arXiv: 2505.14990 by Dilek Hakkani-T\"ur, Ishika Agarwal, Nimet Beyza Bozdag, Nisval Patel.

Figure 1
Figure 1. Figure 1: In this toy experiment, we prompt Llama-3.1-8B-Instruct with the same question [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustrations of the seven different language selection methods we use in our LSK [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The main results of measuring LSK – we show the performance of our various [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of the languages selected by the [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A heatmap of the language selected by LSKExtractor for each cluster, each model, and each dataset. Cluster BLEND Theme CultureAtlas Theme SocialIQA Theme 1 Regional specialties & industries (livestock, agriculture, tourism) Eastern/Central European countries (Ukraine, Serbia, Czech Rep., etc.) Basic daily activities & routine behaviors 2 Commercial hubs & popular destinations Western countries (France, Can… view at source ↗
Figure 6
Figure 6. Figure 6: Results on the effect of varying cluster size on the performance of [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: An example of a reformatted CultureAtlas question. The original binary [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt to the language model to perform with reasoning, in English. Figure [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt to the language model to perform with reasoning, in Turkish. Figure [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt to the language model to select the language expert for a given question, [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt to GPT-4o-mini to translate the datasets into one of the 16 languages we [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
read the original abstract

Often, multilingual language models are trained with the objective to map semantically similar content (in different languages) in the same latent space. In this paper, we show a nuance in this training objective, and find that by changing the language of the input query, we can improve the question answering ability of language models. We make two main contributions. First, we introduce the term Language Specific Knowledge (LSK) to denote queries that are best answered in an ``expert language'' for a given LLM, thereby enhancing its question-answering ability. We introduce the problem of language selection -- for some queries, language models can perform better when queried in languages other than English, sometimes even better in low-resource languages -- and the goal is to select the optimal language for the query. Second, we introduce a variety of simple to strong baselines to empirically motivate the language selection problem (including one of our own methods called LSKExtractor). During our evaluation, we employ three datasets that contain knowledge about both cultural and social behavioral norms. Overall, the results show that principled language selection can improve the performance of a language model, and that the expected question-to-language map is not always intuitive: Gemma models know most about China and Middle East in Spanish; Qwen models know most about authority and responsibility in Arabic and Chinese. Broadly, our research contributes to the open-source development of language models that are inclusive and more aligned with the cultural and linguistic contexts in which they are deployed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that multilingual LLMs exhibit 'Language Specific Knowledge' (LSK), meaning that for queries about cultural and social norms, performance on question answering can improve when the input query is posed in a non-English 'expert language' rather than English. The authors introduce the language selection problem, propose baselines including their LSKExtractor method, and evaluate on three norm datasets, reporting that models such as Gemma perform best in Spanish for China/Middle East topics and Qwen in Arabic/Chinese for authority/responsibility, with overall gains from principled language choice.

Significance. If the central empirical claim holds after controlling for surface-form confounds, the work would usefully nuance the assumption that multilingual models map semantics into a shared space and would offer a practical lever for improving cultural alignment in deployed LLMs. The introduction of multiple baselines and the observation of non-intuitive language preferences constitute modest but concrete contributions to multilingual evaluation.

major comments (3)
  1. [Evaluation] The experimental setup does not report controls for prompt length, subword token count, or back-translation baselines that would isolate genuine language-specific knowledge from tokenization efficiency or translation artifacts. This directly affects the interpretation of the reported accuracy gains on the three cultural-norm datasets.
  2. [Results] No statistical significance tests, confidence intervals, or variance estimates across runs are described for the accuracy improvements, making it impossible to assess whether the differences between English and the reported expert languages (e.g., Spanish for Gemma) are reliable.
  3. [Introduction] The definition of LSK in the introduction remains informal; it is unclear how the term is operationalized to distinguish stored knowledge from dataset-specific biases in the cultural and social norm collections.
minor comments (2)
  1. [Introduction] The abstract and introduction would benefit from explicit citation of prior work on multilingual prompting and cultural bias in LLMs.
  2. [Results] Figure and table captions should clarify which model-dataset-language combinations are shown and whether results are averaged over multiple prompts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and indicate where revisions will be made to improve the manuscript.

read point-by-point responses
  1. Referee: [Evaluation] The experimental setup does not report controls for prompt length, subword token count, or back-translation baselines that would isolate genuine language-specific knowledge from tokenization efficiency or translation artifacts. This directly affects the interpretation of the reported accuracy gains on the three cultural-norm datasets.

    Authors: We agree that these controls are important for isolating LSK effects. In the revised version we will add reporting of average prompt lengths and subword token counts per language-query pair. We will also introduce a back-translation baseline (translate to target language then back to English) and compare performance against the original English queries to help rule out translation artifacts. These additions will be placed in a new subsection of the experimental setup. revision: yes

  2. Referee: [Results] No statistical significance tests, confidence intervals, or variance estimates across runs are described for the accuracy improvements, making it impossible to assess whether the differences between English and the reported expert languages (e.g., Spanish for Gemma) are reliable.

    Authors: This is a fair criticism. We will rerun the main experiments with five random seeds and report mean accuracy with standard deviation, 95% confidence intervals, and p-values from McNemar’s test for pairwise comparisons between English and each expert language. These statistics will be added to Table 2 and the accompanying text. revision: yes

  3. Referee: [Introduction] The definition of LSK in the introduction remains informal; it is unclear how the term is operationalized to distinguish stored knowledge from dataset-specific biases in the cultural and social norm collections.

    Authors: We will revise the introduction to include a concise formal definition: LSK refers to the empirical observation that an LLM achieves higher accuracy on a given query when the query is expressed in one specific non-English language rather than English. We will explicitly note that operationalization occurs via controlled cross-language evaluation on three independent norm datasets, which reduces the chance that gains are driven by idiosyncrasies of any single collection. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation stands on independent data

full rationale

The paper is an empirical study that introduces the LSK term and evaluates language selection on three cultural/social norm datasets using multiple baselines including a new LSKExtractor method. No equations, fitted parameters, or derivations are present that reduce any claimed improvement to a quantity defined from the same inputs. The central results rest on direct accuracy measurements across languages rather than self-referential definitions or self-citation chains that would force the outcome. Any self-citations (if present) are not load-bearing for the reported performance gains, which are externally falsifiable via the provided datasets and models.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the standard multilingual training assumption and the new concept of LSK; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption Multilingual language models are trained with the objective to map semantically similar content in different languages to the same latent space.
    Explicitly stated as the starting training objective in the abstract.
invented entities (1)
  • Language Specific Knowledge (LSK) no independent evidence
    purpose: To label queries that are best answered in an expert language for a given LLM.
    New term coined in the paper to frame the observed language-dependent performance.

pith-pipeline@v0.9.0 · 5807 in / 1318 out tokens · 66180 ms · 2026-05-22T14:44:35.098992+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 6 internal anchors

  1. [1]

    URL https://arxiv.org/abs/2403.15412. T. A. Chang, Z. Tu, and B. K. Bergen. The geometry of multilingual language model representations,

  2. [2]

    URLhttps://arxiv.org/abs/2205.10964. K. Cheng and S. Bhat. No context needed: Contextual quandary in idiomatic reasoning with pre-trained language models. Association for Computational Linguistics,

  3. [3]

    DeepSeek-AI, D

    URLhttps://arxiv.org/abs/2412.04261. DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P . Wang, X. Bi, X. Zhang, X. Yu, Y. Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen,...

  4. [4]

    URLhttps://arxiv.org/abs/2501.12948. E. Durmus, K. Nguyen, T. Liao, N. Schiefer, A. Askell, A. Bakhtin, C. Chen, Z. Hatfield- Dodds, D. Hernandez, N. Joseph, L. Lovitt, S. McCandlish, O. Sikder, A. Tamkin, J. Thamkul, J. Kaplan, J. Clark, and D. Ganguli. Towards measuring the representa- tion of subjective global opinions in language models. InFirst Confe...

  5. [5]

    URLhttps://arxiv.org/abs/2308.01223. Y. Fung, R. Zhao, J. Doo, C. Sun, and H. Ji. Massively multi-cultural knowledge acquisition & lm benchmarking,

  6. [6]

    URLhttps://arxiv.org/abs/2402.09369. C. Gao, X. Huang, W. Zhu, S. Huang, L. Li, and F. Yuan. Could thinking multilingually empower llm reasoning?,

  7. [7]

    URLhttps://arxiv.org/abs/2504.11833. A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, A. Yang, A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, A. Korenev, A. Hinsvark, A. Rao, A. Zhang, A. Rodriguez, A. Gregerson, A. Spataru, B. Roziere, B. Biron, B. Tang, B. Chern, C...

  8. [8]

    URLhttps://arxiv.org/abs/2407.21783. D. Gurgurov, T. B¨aumel, and T. Anikina. Multilingual large language models and curse of multilinguality

  9. [10]

    URLhttps://arxiv.org/abs/2305.07004. Z. Huang, W. Zhu, G. Cheng, L. Li, and F. Yuan. Mindmerger: Efficiently boosting LLM reasoning in non-english languages. InThe Thirty-eighth Annual Conference on Neural Infor- mation Processing Systems,

  10. [11]

    URLhttps://aclanthology.org/2025.coling-main.619/

    Association for Computational Linguistics. URLhttps://aclanthology.org/2025.coling-main.619/. C. C. Liu, F. Koto, T. Baldwin, and I. Gurevych. Are multilingual llms culturally-diverse reasoners? an investigation into multicultural proverbs and sayings,

  11. [12]

    URL https: //arxiv.org/abs/2309.08591. S. V . Marjanovi´c, A. Patel, V . Adlakha, M. Aghajohari, P . BehnamGhader, M. Bhatia, A. Khan- delwal, A. Kraft, B. Krojer, X. H. L `u, N. Meade, D. Shin, A. Kazemnejad, G. Kamath, M. Mosbach, K. Sta ´nczak, and S. Reddy. Deepseek-r1 thoughtology: Let’s think about llm reasoning,

  12. [13]

    URLhttps://arxiv.org/abs/2504.07128. J. Myung, N. Lee, Y. Zhou, J. Jin, R. A. Putri, D. Antypas, H. Borkakoty, E. Kim, C. Perez- Almendros, A. A. Ayele, V . Guti´errez-Basulto, Y. Ib´a ˜nez-Garc´ıa, H. Lee, S. H. Muhammad, K. Park, A. S. Rzayev, N. White, S. M. Yimam, M. T. Pilehvar, N. Ousidhoum, J. Camacho- Collados, and A. Oh. Blend: A benchmark for ll...

  13. [14]

    URLhttps://arxiv.org/abs/2406.09948. J. Pfeiffer, N. Goyal, X. Lin, X. Li, J. Cross, S. Riedel, and M. Artetxe. Lifting the curse of multilinguality by pre-training modular transformers. In M. Carpuat, M.-C. de Marneffe, and I. V . Meza Ruiz, editors,Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Ling...

  14. [15]

    doi: 10.18653/v1/2022.naacl-main.255

    Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.255. URL https://aclanthology.org/2022.naacl-main. 255/. S. Ruder, I. Vuli ´c, and A. Søgaard. A survey of cross-lingual word embedding models. Journal of Artificial Intelligence Research, 65:569–631,

  15. [16]

    12 Preprint

    URL https://arxiv.org/ abs/2502.16534. 12 Preprint. Under review. M. Sap, H. Rashkin, D. Chen, R. LeBras, and Y. Choi. Socialiqa: Commonsense reasoning about social interactions,

  16. [17]

    URLhttps://arxiv.org/abs/1904.09728. A. Sathe, E. Fedorenko, and N. Zaslavsky. Language use is only sparsely compositional: The case of english adjective-noun phrases in humans and large language models. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 46,

  17. [18]

    URL https://arxiv.org/abs/2502.15603. L. K. Senel, B. Ebing, K. Baghirova, H. Schuetze, and G. Glava ˇs. Karde s ¸-nlu: Transfer to low-resource languages with the help of a high-resource cousin–a benchmark and evaluation for turkic languages. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Vo...

  18. [19]

    URL https://arxiv.org/abs/2403.08295. D. Wu, Y. Lei, A. Yates, and C. Monz. Representational isomorphism and alignment of multilingual large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 14074–14085,

  19. [20]

    URLhttps://arxiv.org/abs/2505.09388. Z.-X. Yong, M. F. Adilazuarda, J. Mansurov, R. Zhang, N. Muennighoff, C. Eickhoff, G. I. Winata, J. Kreutzer, S. H. Bach, and A. F. Aji. Crosslingual reasoning through test-time scaling,

  20. [21]

    URLhttps://arxiv.org/abs/2505.05408. D. Yoon, J. Jang, S. Kim, S. Kim, S. Shafayat, and M. Seo. Langbridge: Multilingual reasoning without multilingual supervision. (arXiv:2401.10695), June

  21. [22]

    Dickerson

    doi: 10.48550/arXiv. 2401.10695. URLhttp://arxiv.org/abs/2401.10695. arXiv:2401.10695 [cs]. 13 Preprint. Under review. C. Zhong, F. Cheng, Q. Liu, J. Jiang, Z. Wan, C. Chu, Y. Murawaki, and S. Kurohashi. Beyond english-centric llms: What language do multilingual language models think in?,

  22. [23]

    URLhttps://arxiv.org/abs/2408.10811. J. Zhou, Z. Zeng, H. Gong, and S. Bhat. Non-compositional expression generation based on curriculum learning and continual learning. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 4320–4335,

  23. [24]

    14 Preprint

    URL https://arxiv.org/abs/2502.12470. 14 Preprint. Under review. A CultureAtlas Reformatting The CultureAtlas dataset consists of cultural claims associated with specific countries, each annotated as either true or false. Because this binary classification setting is relatively simple and the dataset is imbalanced toward false claims, we reformatted it in...