Language Specific Knowledge: Do Models Know Better in X than in English?
Pith reviewed 2026-05-22 14:44 UTC · model grok-4.3
The pith
Switching the language of a query can improve a model's answers on cultural and social topics by accessing language-specific knowledge.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that by changing the language of the input query, we can improve the question answering ability of language models. We introduce the term Language Specific Knowledge to denote queries that are best answered in an expert language for a given LLM, thereby enhancing its question-answering ability. We introduce the problem of language selection for some queries, language models can perform better when queried in languages other than English, sometimes even better in low-resource languages and the goal is to select the optimal language for the query.
What carries the argument
Language Specific Knowledge, the property that a given query about cultural or social norms is answered more accurately when posed in one particular language for a specific model.
If this is right
- For some cultural questions, accuracy rises when the query is moved from English to another language.
- Low-resource languages can outperform English for selected topics in particular models.
- Different models exhibit different language-topic pairings, such as Gemma on Spanish for China facts.
- Simple selection methods can raise question-answering scores without retraining.
- Language choice becomes a practical lever for aligning models with the cultural contexts where they are used.
Where Pith is reading between the lines
- A lightweight router could be added in front of any model to pick the best language per query based on topic signals.
- The pattern may indicate uneven coverage of the same facts across the training data of different languages.
- Extending the test to purely factual knowledge outside norms would show whether the effect is limited to culturally loaded content.
- If confirmed, training objectives could be adjusted to reduce the size of these language-specific gaps.
Load-bearing premise
The observed gains come from genuine differences in stored knowledge across languages rather than from tokenization length, prompt format, or biases in the norm datasets.
What would settle it
Measure accuracy on the same norm questions across languages while holding token count, prompt wording, and formatting fixed, then check whether the language-to-performance differences remain.
Figures
read the original abstract
Often, multilingual language models are trained with the objective to map semantically similar content (in different languages) in the same latent space. In this paper, we show a nuance in this training objective, and find that by changing the language of the input query, we can improve the question answering ability of language models. We make two main contributions. First, we introduce the term Language Specific Knowledge (LSK) to denote queries that are best answered in an ``expert language'' for a given LLM, thereby enhancing its question-answering ability. We introduce the problem of language selection -- for some queries, language models can perform better when queried in languages other than English, sometimes even better in low-resource languages -- and the goal is to select the optimal language for the query. Second, we introduce a variety of simple to strong baselines to empirically motivate the language selection problem (including one of our own methods called LSKExtractor). During our evaluation, we employ three datasets that contain knowledge about both cultural and social behavioral norms. Overall, the results show that principled language selection can improve the performance of a language model, and that the expected question-to-language map is not always intuitive: Gemma models know most about China and Middle East in Spanish; Qwen models know most about authority and responsibility in Arabic and Chinese. Broadly, our research contributes to the open-source development of language models that are inclusive and more aligned with the cultural and linguistic contexts in which they are deployed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that multilingual LLMs exhibit 'Language Specific Knowledge' (LSK), meaning that for queries about cultural and social norms, performance on question answering can improve when the input query is posed in a non-English 'expert language' rather than English. The authors introduce the language selection problem, propose baselines including their LSKExtractor method, and evaluate on three norm datasets, reporting that models such as Gemma perform best in Spanish for China/Middle East topics and Qwen in Arabic/Chinese for authority/responsibility, with overall gains from principled language choice.
Significance. If the central empirical claim holds after controlling for surface-form confounds, the work would usefully nuance the assumption that multilingual models map semantics into a shared space and would offer a practical lever for improving cultural alignment in deployed LLMs. The introduction of multiple baselines and the observation of non-intuitive language preferences constitute modest but concrete contributions to multilingual evaluation.
major comments (3)
- [Evaluation] The experimental setup does not report controls for prompt length, subword token count, or back-translation baselines that would isolate genuine language-specific knowledge from tokenization efficiency or translation artifacts. This directly affects the interpretation of the reported accuracy gains on the three cultural-norm datasets.
- [Results] No statistical significance tests, confidence intervals, or variance estimates across runs are described for the accuracy improvements, making it impossible to assess whether the differences between English and the reported expert languages (e.g., Spanish for Gemma) are reliable.
- [Introduction] The definition of LSK in the introduction remains informal; it is unclear how the term is operationalized to distinguish stored knowledge from dataset-specific biases in the cultural and social norm collections.
minor comments (2)
- [Introduction] The abstract and introduction would benefit from explicit citation of prior work on multilingual prompting and cultural bias in LLMs.
- [Results] Figure and table captions should clarify which model-dataset-language combinations are shown and whether results are averaged over multiple prompts.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and indicate where revisions will be made to improve the manuscript.
read point-by-point responses
-
Referee: [Evaluation] The experimental setup does not report controls for prompt length, subword token count, or back-translation baselines that would isolate genuine language-specific knowledge from tokenization efficiency or translation artifacts. This directly affects the interpretation of the reported accuracy gains on the three cultural-norm datasets.
Authors: We agree that these controls are important for isolating LSK effects. In the revised version we will add reporting of average prompt lengths and subword token counts per language-query pair. We will also introduce a back-translation baseline (translate to target language then back to English) and compare performance against the original English queries to help rule out translation artifacts. These additions will be placed in a new subsection of the experimental setup. revision: yes
-
Referee: [Results] No statistical significance tests, confidence intervals, or variance estimates across runs are described for the accuracy improvements, making it impossible to assess whether the differences between English and the reported expert languages (e.g., Spanish for Gemma) are reliable.
Authors: This is a fair criticism. We will rerun the main experiments with five random seeds and report mean accuracy with standard deviation, 95% confidence intervals, and p-values from McNemar’s test for pairwise comparisons between English and each expert language. These statistics will be added to Table 2 and the accompanying text. revision: yes
-
Referee: [Introduction] The definition of LSK in the introduction remains informal; it is unclear how the term is operationalized to distinguish stored knowledge from dataset-specific biases in the cultural and social norm collections.
Authors: We will revise the introduction to include a concise formal definition: LSK refers to the empirical observation that an LLM achieves higher accuracy on a given query when the query is expressed in one specific non-English language rather than English. We will explicitly note that operationalization occurs via controlled cross-language evaluation on three independent norm datasets, which reduces the chance that gains are driven by idiosyncrasies of any single collection. revision: partial
Circularity Check
No significant circularity; empirical evaluation stands on independent data
full rationale
The paper is an empirical study that introduces the LSK term and evaluates language selection on three cultural/social norm datasets using multiple baselines including a new LSKExtractor method. No equations, fitted parameters, or derivations are present that reduce any claimed improvement to a quantity defined from the same inputs. The central results rest on direct accuracy measurements across languages rather than self-referential definitions or self-citation chains that would force the outcome. Any self-citations (if present) are not load-bearing for the reported performance gains, which are externally falsifiable via the provided datasets and models.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multilingual language models are trained with the objective to map semantically similar content in different languages to the same latent space.
invented entities (1)
-
Language Specific Knowledge (LSK)
no independent evidence
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
URLhttps://arxiv.org/abs/2412.04261. DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P . Wang, X. Bi, X. Zhang, X. Yu, Y. Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen,...
-
[4]
URLhttps://arxiv.org/abs/2501.12948. E. Durmus, K. Nguyen, T. Liao, N. Schiefer, A. Askell, A. Bakhtin, C. Chen, Z. Hatfield- Dodds, D. Hernandez, N. Joseph, L. Lovitt, S. McCandlish, O. Sikder, A. Tamkin, J. Thamkul, J. Kaplan, J. Clark, and D. Ganguli. Towards measuring the representa- tion of subjective global opinions in language models. InFirst Confe...
work page internal anchor Pith review Pith/arXiv arXiv
- [5]
- [6]
-
[7]
URLhttps://arxiv.org/abs/2504.11833. A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, A. Yang, A. Fan, A. Goyal, A. Hartshorn, A. Yang, A. Mitra, A. Sravankumar, A. Korenev, A. Hinsvark, A. Rao, A. Zhang, A. Rodriguez, A. Gregerson, A. Spataru, B. Roziere, B. Biron, B. Tang, B. Chern, C...
-
[8]
URLhttps://arxiv.org/abs/2407.21783. D. Gurgurov, T. B¨aumel, and T. Anikina. Multilingual large language models and curse of multilinguality
work page internal anchor Pith review Pith/arXiv arXiv
- [10]
-
[11]
URLhttps://aclanthology.org/2025.coling-main.619/
Association for Computational Linguistics. URLhttps://aclanthology.org/2025.coling-main.619/. C. C. Liu, F. Koto, T. Baldwin, and I. Gurevych. Are multilingual llms culturally-diverse reasoners? an investigation into multicultural proverbs and sayings,
work page 2025
-
[12]
URL https: //arxiv.org/abs/2309.08591. S. V . Marjanovi´c, A. Patel, V . Adlakha, M. Aghajohari, P . BehnamGhader, M. Bhatia, A. Khan- delwal, A. Kraft, B. Krojer, X. H. L `u, N. Meade, D. Shin, A. Kazemnejad, G. Kamath, M. Mosbach, K. Sta ´nczak, and S. Reddy. Deepseek-r1 thoughtology: Let’s think about llm reasoning,
-
[13]
URLhttps://arxiv.org/abs/2504.07128. J. Myung, N. Lee, Y. Zhou, J. Jin, R. A. Putri, D. Antypas, H. Borkakoty, E. Kim, C. Perez- Almendros, A. A. Ayele, V . Guti´errez-Basulto, Y. Ib´a ˜nez-Garc´ıa, H. Lee, S. H. Muhammad, K. Park, A. S. Rzayev, N. White, S. M. Yimam, M. T. Pilehvar, N. Ousidhoum, J. Camacho- Collados, and A. Oh. Blend: A benchmark for ll...
-
[14]
URLhttps://arxiv.org/abs/2406.09948. J. Pfeiffer, N. Goyal, X. Lin, X. Li, J. Cross, S. Riedel, and M. Artetxe. Lifting the curse of multilinguality by pre-training modular transformers. In M. Carpuat, M.-C. de Marneffe, and I. V . Meza Ruiz, editors,Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Ling...
-
[15]
doi: 10.18653/v1/2022.naacl-main.255
Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.255. URL https://aclanthology.org/2022.naacl-main. 255/. S. Ruder, I. Vuli ´c, and A. Søgaard. A survey of cross-lingual word embedding models. Journal of Artificial Intelligence Research, 65:569–631,
-
[16]
URL https://arxiv.org/ abs/2502.16534. 12 Preprint. Under review. M. Sap, H. Rashkin, D. Chen, R. LeBras, and Y. Choi. Socialiqa: Commonsense reasoning about social interactions,
-
[17]
URLhttps://arxiv.org/abs/1904.09728. A. Sathe, E. Fedorenko, and N. Zaslavsky. Language use is only sparsely compositional: The case of english adjective-noun phrases in humans and large language models. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 46,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[18]
URL https://arxiv.org/abs/2502.15603. L. K. Senel, B. Ebing, K. Baghirova, H. Schuetze, and G. Glava ˇs. Karde s ¸-nlu: Transfer to low-resource languages with the help of a high-resource cousin–a benchmark and evaluation for turkic languages. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Vo...
-
[19]
URL https://arxiv.org/abs/2403.08295. D. Wu, Y. Lei, A. Yates, and C. Monz. Representational isomorphism and alignment of multilingual large language models. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 14074–14085,
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
URLhttps://arxiv.org/abs/2505.09388. Z.-X. Yong, M. F. Adilazuarda, J. Mansurov, R. Zhang, N. Muennighoff, C. Eickhoff, G. I. Winata, J. Kreutzer, S. H. Bach, and A. F. Aji. Crosslingual reasoning through test-time scaling,
work page internal anchor Pith review Pith/arXiv arXiv
- [21]
-
[22]
doi: 10.48550/arXiv. 2401.10695. URLhttp://arxiv.org/abs/2401.10695. arXiv:2401.10695 [cs]. 13 Preprint. Under review. C. Zhong, F. Cheng, Q. Liu, J. Jiang, Z. Wan, C. Chu, Y. Murawaki, and S. Kurohashi. Beyond english-centric llms: What language do multilingual language models think in?,
work page internal anchor Pith review doi:10.48550/arxiv
- [23]
-
[24]
URL https://arxiv.org/abs/2502.12470. 14 Preprint. Under review. A CultureAtlas Reformatting The CultureAtlas dataset consists of cultural claims associated with specific countries, each annotated as either true or false. Because this binary classification setting is relatively simple and the dataset is imbalanced toward false claims, we reformatted it in...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.