pith. sign in

arxiv: 2606.23164 · v1 · pith:QVAWT623new · submitted 2026-06-22 · 💻 cs.CL

Same question, different history: language, national identity, and credit in large language models

Pith reviewed 2026-06-26 08:33 UTC · model grok-4.3

classification 💻 cs.CL
keywords large language modelsnational identitycultural memorydisputed inventionslanguage effectscredit attributionbanal nationalismhistorical claims
0
0 comments X

The pith

The language of a query about a disputed invention determines which national claimant large language models credit.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether the language in which a question is posed changes which historical figure large language models name as the inventor in cases of contested credit. Eleven models were prompted on twenty-one such disputes in twelve languages, generating nearly seventy-six thousand responses. The results show that query language shifts the surfaced claimant in a systematic way, with lower-status national figures appearing more often when the question matches their language, while English-associated figures stay consistent. The pattern holds after adjustments for length, model type, prominence, and commemoration levels, positioning language as the factor that selects between alternate national histories for the same event.

Core claim

Analysis of eleven large language models on twenty-one disputed inventions across twelve languages and 75,896 responses shows that while models often acknowledge the dispute, the language of the query systematically shifts which claimant is highlighted. Lower-status national figures appear more when the question is posed in their language, while prominent English-associated figures stay consistent. This pattern remains after adjustments for response length, model variations, historical importance, and national commemoration levels. The finding positions language as the mechanism that switches between different national histories for the same event, resulting in varied national memories.

What carries the argument

Query language as the switch that activates different national versions of the same disputed history in model responses.

If this is right

  • The same disputed history elicits different national claimants depending on the language used in the query.
  • Lower-status national figures receive greater visibility when questions are asked in their associated language.
  • Dominant Anglophone figures remain stable across languages even when credit is contested.
  • Large language models function as distributed systems of cultural memory where language conditions which histories become visible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same conditioning could appear in non-invention domains such as political events or cultural achievements.
  • Multilingual users might receive different historical accounts simply by switching the language of their questions.
  • Auditing procedures for model outputs on contested facts may need to test across languages rather than in one language alone.
  • Training adjustments that balance representation of disputed claims across languages could reduce the observed effect.

Load-bearing premise

The controls for response length, model differences, historical prominence, and levels of national commemoration are sufficient to isolate the causal effect of query language on claimant selection.

What would settle it

An experiment repeating the queries on the same disputes with new models or additional controls that removes the systematic language-linked difference in claimant selection would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.23164 by Jos\'e O. Gomes, Pierrick Bougault, Vitor D. de Moura, Wei Zhang, William Guey.

Figure 1
Figure 1. Figure 1: Claimant surfacing rate by query language. Each cell is the percentage of answers naming a given claimant when the dispute is asked in a given language; the red box marks the claimant’s associated language. Lower-status, non-English claimants (top rows) are named markedly more often in their associated language, while dominant English-language claimants (lower rows) are named at near-ceiling rates in every… view at source ↗
Figure 2
Figure 2. Figure 2: Inferential model of claimant naming. Odds ratios on a base-10 logarithmic scale, with 95 percent confidence intervals, from the cluster-robust logistic regression of whether a given claimant is named (n = 201,943 response-by-claimant observations across the 49 focal claimants; standard errors clustered by the 20 disputes that contain a focal claimant). Values above one favour the claimant being named. Ask… view at source ↗
Figure 3
Figure 3. Figure 3: How often models erase all rival claimants, by dispute. Bars show the percentage of coded responses, per dispute, that name a single claimant as settled fact. Erasure is rare overall but concentrated in disputes with a dominant English-language claimant and lower-status rivals. Magnetic resonance imaging is the negative control. with little. The open-form and head-to-head measures thus point the same way: … view at source ↗
Figure 4
Figure 4. Figure 4: The in-language advantage by commemoration and power. Each point is a focal claimant; the horizontal axis is the count of institutional-commemoration markers, the vertical axis is the in-language advantage in percentage points, and marker shape and color denote within-dispute power. The advantage concentrates among low-power claimants and remains positive even at zero commemoration. co-occurrence of a clai… view at source ↗
Figure 1
Figure 1. Figure 1: 30 [PITH_FULL_IMAGE:figures/full_fig_p030_1.png] view at source ↗
read the original abstract

Who invented the radio, Russia's Alexander Popov or Italy's Guglielmo Marconi? Was the telephone the achievement of Bell in the United States or Meucci in Italy? Does printing belong to China's Bi Sheng or Germany's Gutenberg? The answer depends not only on historical record but also on language and perspective. We analyse eleven widely used large language models across 21 disputed inventions and discoveries, evaluated in twelve languages and 75,896 responses. While models generally acknowledge that credit is contested, query language systematically affects which claimant is surfaced. Lower-status claimants are more likely to appear when questions are asked in their associated language, whereas dominant Anglophone figures remain stable across languages. These patterns persist after controlling for response length, model differences, historical prominence, and levels of national commemoration. Language thus acts as a switch that activates different national versions of the same history, producing systematically different national memories from the same question. We interpret this as evidence that large language models function as distributed systems of cultural memory, where language conditions which histories become visible, contributing to a computational form of banal nationalism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper analyzes responses from 11 LLMs to 21 disputed inventions/discoveries posed in 12 languages (75,896 total responses). It reports that query language systematically influences which claimant is surfaced, with lower-status national figures more likely to appear in their associated language while dominant Anglophone figures remain stable; these patterns are claimed to hold after controlling for response length, model, historical prominence, and national commemoration levels. The work interprets LLMs as distributed cultural memory systems that enact a form of banal nationalism via language-conditioned history selection.

Significance. If the controls adequately isolate language's independent effect, the scale of the study (75k responses across models and languages) would provide valuable empirical evidence on how LLMs encode and surface national perspectives, with implications for AI as cultural infrastructure. The explicit framing as a measurement study of contested credit rather than a fitted model is a strength.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'patterns persist after controlling for response length, model differences, historical prominence, and levels of national commemoration' is load-bearing for the causal interpretation that language acts as an independent 'switch,' yet no details are provided on the statistical methods, exact proxy variables for prominence and commemoration, data exclusion rules, or inter-rater reliability for response coding; without these the support for isolating language's effect cannot be assessed.
  2. [Abstract] The skeptic concern is warranted here: if the proxies for historical prominence and national commemoration are derived from English-centric or global sources rather than language-specific visibility metrics, they may be correlated with the independent variable (query language), leaving residual confounding that could produce the observed claimant patterns without language functioning as a causal switch.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract and the potential for residual confounding. We agree that additional detail is warranted to support the central claims and will revise the abstract accordingly while preserving its length constraints. We address each point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'patterns persist after controlling for response length, model differences, historical prominence, and levels of national commemoration' is load-bearing for the causal interpretation that language acts as an independent 'switch,' yet no details are provided on the statistical methods, exact proxy variables for prominence and commemoration, data exclusion rules, or inter-rater reliability for response coding; without these the support for isolating language's effect cannot be assessed.

    Authors: We agree the abstract omits these specifics, which limits assessment of the controls. The full manuscript's Methods section specifies multilevel logistic regressions with the listed covariates, proxies drawn from language-specific Wikipedia page-view counts (for prominence) and official national commemoration records (for commemoration levels), exclusion of responses naming no claimant, and manual coding with inter-rater agreement reported. To address the concern directly, we will expand the abstract with a single sentence summarizing these elements so the load-bearing claim can be evaluated from the abstract alone. revision: yes

  2. Referee: [Abstract] The skeptic concern is warranted here: if the proxies for historical prominence and national commemoration are derived from English-centric or global sources rather than language-specific visibility metrics, they may be correlated with the independent variable (query language), leaving residual confounding that could produce the observed claimant patterns without language functioning as a causal switch.

    Authors: We share the concern about possible correlation between proxies and query language. The manuscript employs language-specific visibility metrics (per-language Wikipedia views) and country-level commemoration data wherever available; global sources were used only as supplements for low-resource languages. We will add an explicit limitations paragraph discussing residual confounding risk and report sensitivity checks that substitute alternative proxies. This strengthens rather than undermines the language-switch interpretation but acknowledges the point. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical measurement study with independent data collection

full rationale

The paper reports an empirical analysis of 75,896 LLM responses to 21 questions across 12 languages, measuring how query language correlates with claimant selection after stated controls. No equations, fitted parameters renamed as predictions, or derivation steps appear in the provided text. The central claim rests on observed response patterns rather than any self-referential definition, self-citation chain, or ansatz. This is a standard measurement study whose validity depends on the quality of controls and data, not on internal reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the empirical observation and the listed controls. No mathematical free parameters, invented entities, or formal axioms are described in the abstract.

axioms (2)
  • domain assumption The 21 selected inventions and discoveries constitute a representative sample of contested historical credit cases.
    The analysis depends on the choice of these specific cases as disputed.
  • domain assumption The 12 languages sufficiently capture national identity perspectives relevant to the claimants.
    The language-nationality linkage is treated as given for the interpretation.

pith-pipeline@v0.9.1-grok · 5742 in / 1359 out tokens · 16171 ms · 2026-06-26T08:33:18.381135+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 21 canonical work pages

  1. [1]

    (2025) Language models' factuality depends on the language of inquiry

    Aggarwal T, Tanmay K, Agrawal A et al. (2025) Language models' factuality depends on the language of inquiry. In: International Conference on Learning Representations (ICLR) 2025. https://arxiv.org/abs/2502.17955

  2. [2]

    Verso, London

    Anderson B (1991) Imagined Communities: Reflections on the Origin and Spread of Nationalism, revised edn. Verso, London

  3. [3]

    New Ger Crit 65:125 to 133

    Assmann J, Czaplicka J (1995) Collective memory and cultural identity. New Ger Crit 65:125 to 133. https://doi.org/10.2307/488538

  4. [4]

    Sage, London

    Billig M (1995) Banal Nationalism. Sage, London

  5. [5]

    Language (

    Blodgett SL, Barocas S, Daum\'e III H, Wallach H (2020) Language (technology) is power: a critical survey of ``bias'' in NLP. In: Proceedings of ACL 2020, pp 5454 to 5476. https://doi.org/10.18653/v1/2020.acl-main.485

  6. [6]

    Cambridge University Press, Cambridge

    Brubaker R (1996) Nationalism Reframed: Nationhood and the National Question in the New Europe. Cambridge University Press, Cambridge

  7. [7]

    (2026) Large language models reflect the ideology of their creators

    Buyl M, Rogiers A, Noels S et al. (2026) Large language models reflect the ideology of their creators. npj Artif Intell 2:7. https://doi.org/10.1038/s44387-025-00048-0

  8. [8]

    (2025) Taxonomizing representational harms using speech act theory

    Corvi E, Washington H, Reed S et al. (2025) Taxonomizing representational harms using speech act theory. In: Findings of the Association for Computational Linguistics: ACL 2025. https://doi.org/10.18653/v1/2025.findings-acl.202

  9. [9]

    (2022) On measures of biases and harms in NLP

    Dev S, Sheng E, Zhao J et al. (2022) On measures of biases and harms in NLP. In: Findings of AACL-IJCNLP 2022, pp 246 to 267. https://doi.org/10.18653/v1/2022.findings-aacl.24

  10. [10]

    (2024) Towards measuring the representation of subjective global opinions in language models

    Durmus E, Nguyen K, Liao T et al. (2024) Towards measuring the representation of subjective global opinions in language models. In: Conference on Language Modeling (COLM) 2024. https://arxiv.org/abs/2306.16388

  11. [11]

    New Glob Stud 1(1):1 to 32

    Edgerton D (2007) The contradictions of techno-nationalism and techno-globalism: a historical perspective. New Glob Stud 1(1):1 to 32. https://doi.org/10.2202/1940-0004.1013

  12. [12]

    Oxford University Press, Oxford

    Fricker M (2007) Epistemic Injustice: Power and the Ethics of Knowing. Oxford University Press, Oxford

  13. [13]

    and Rossi, Ryan A

    Gallegos IO, Rossi RA, Barrow J et al. (2024) Bias and fairness in large language models: a survey. Comput Linguist 50(3):1097 to 1179. https://doi.org/10.1162/coli_a_00524

  14. [14]

    Princeton University Press, Princeton

    Gillis JR (ed) (1994) Commemorations: The Politics of National Identity. Princeton University Press, Princeton

  15. [15]

    University of Chicago Press, Chicago

    Halbwachs M (1992) On Collective Memory (Coser LA, ed and trans). University of Chicago Press, Chicago

  16. [16]

    Cambridge University Press, Cambridge

    Hobsbawm E, Ranger T (eds) (1983) The Invention of Tradition. Cambridge University Press, Cambridge

  17. [17]

    Routledge, London

    Hutchins RD (2016) Nationalism and History Education: Curricula and Textbooks in the United States and France. Routledge, London

  18. [18]

    BBC Books and Chatto & Windus, London

    Ignatieff M (1993) Blood and Belonging: Journeys into the New Nationalism. BBC Books and Chatto & Windus, London

  19. [19]

    (2020) The state and fate of linguistic diversity and inclusion in the NLP world

    Joshi P, Santy S, Budhiraja A et al. (2020) The state and fate of linguistic diversity and inclusion in the NLP world. In: Proceedings of ACL 2020, pp 6282 to 6293. https://doi.org/10.18653/v1/2020.acl-main.560

  20. [20]

    Nations Natl

    Kastoryano R (2025) Transnational nationalisms: reflections on nationalism and territory in globalization. Nations Natl. https://doi.org/10.1111/nana.13125

  21. [21]

    Benchmarking Cognitive Biases in Large Language Models as Evaluators

    Koo R, Lee M, Raheja V et al. (2024) Benchmarking cognitive biases in large language models as evaluators. In: Findings of the Association for Computational Linguistics: ACL 2024, pp 517 to 545. https://doi.org/10.18653/v1/2024.findings-acl.29

  22. [22]

    From Generation to Judgment: Opportunities and Challenges of LLM -as-a-judge

    Li D, Jiang B, Huang L et al. (2025) From generation to judgment: opportunities and challenges of LLM-as-a-judge. In: Proceedings of EMNLP 2025, pp 2757 to 2791. https://doi.org/10.18653/v1/2025.emnlp-main.138

  23. [23]

    G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment

    Liu Y, Iter D, Xu Y et al. (2023) G-Eval: NLG evaluation using GPT-4 with better human alignment. In: Proceedings of EMNLP 2023, pp 2511 to 2522. https://doi.org/10.18653/v1/2023.emnlp-main.153

  24. [24]

    Am Sociol Rev 22(6):635 to 659

    Merton RK (1957) Priorities in scientific discovery: a chapter in the sociology of science. Am Sociol Rev 22(6):635 to 659. https://doi.org/10.2307/2089193

  25. [25]

    Proc Am Philos Soc 105(5):470 to 486

    Merton RK (1961) Singletons and multiples in scientific discovery: a chapter in the sociology of science. Proc Am Philos Soc 105(5):470 to 486. https://www.jstor.org/stable/985546

  26. [26]

    Duke University Press, Durham

    Mignolo WD (2011) The Darker Side of Western Modernity: Global Futures, Decolonial Options. Duke University Press, Durham

  27. [27]

    (2024) BLEnD: a benchmark for LLMs on everyday knowledge in diverse cultures and languages

    Myung J, Lee N, Zhou Y et al. (2024) BLEnD: a benchmark for LLMs on everyday knowledge in diverse cultures and languages. In: Advances in Neural Information Processing Systems 37, Datasets and Benchmarks Track. https://arxiv.org/abs/2406.09948

  28. [28]

    Having Beer after Prayer? Measuring Cultural Bias in Large Language Models

    Naous T, Ryan MJ, Ritter A, Xu W (2024) Having beer after prayer? Measuring cultural bias in large language models. In: Proceedings of ACL 2024, pp 16366 to 16393. https://doi.org/10.18653/v1/2024.acl-long.862

  29. [29]

    Representations 26:7 to 24

    Nora P (1989) Between memory and history: les lieux de m\'emoire. Representations 26:7 to 24. https://doi.org/10.2307/2928520

  30. [30]

    In: Advances in Neural Information Processing Systems 37

    Panickssery A, Bowman SR, Feng S (2024) LLM evaluators recognize and favor their own generations. In: Advances in Neural Information Processing Systems 37. https://arxiv.org/abs/2404.13076

  31. [31]

    Washington, DC, 24 February 2026

    Pew Research Center (2026) How teens use and view AI. Washington, DC, 24 February 2026. https://www.pewresearch.org/internet/2026/02/24/how-teens-use-and-view-ai/

  32. [32]

    In: Proceedings of EMNLP 2023, pp 10650 to 10666

    Qi J, Fern\'andez R, Bisazza A (2023) Cross-lingual consistency of factual knowledge in multilingual language models. In: Proceedings of EMNLP 2023, pp 10650 to 10666. https://doi.org/10.18653/v1/2023.emnlp-main.658

  33. [33]

    (2023) Whose opinions do language models reflect? In: Proceedings of ICML 2023, PMLR 202

    Santurkar S, Durmus E, Ladhak F et al. (2023) Whose opinions do language models reflect? In: Proceedings of ICML 2023, PMLR 202. https://proceedings.mlr.press/v202/santurkar23a.html

  34. [34]

    Paradigm Publishers, Boulder

    de Sousa Santos B (2014) Epistemologies of the South: Justice Against Epistemicide. Paradigm Publishers, Boulder

  35. [35]

    Trans N Y Acad Sci 39(1):147 to 157

    Stigler SM (1980) Stigler's law of eponymy. Trans N Y Acad Sci 39(1):147 to 157. https://doi.org/10.1111/j.2164-0947.1980.tb02775.x

  36. [36]

    Proc IRE 50(10):2036 to 2047

    S\"usskind C (1962) Popov and the beginnings of radiotelegraphy. Proc IRE 50(10):2036 to 2047. https://doi.org/10.1109/JRPROC.1962.288232

  37. [37]

    PNAS Nexus 3(9):pgae346

    Tao Y, Viberg O, Baker RS, Kizilcec RF (2024) Cultural bias and cultural alignment of large language models. PNAS Nexus 3(9):pgae346. https://doi.org/10.1093/pnasnexus/pgae346

  38. [38]

    https://www.congress.gov/bill/107th-congress/house-resolution/269/text

    US House of Representatives (2002) H.Res.269, 107th Congress: honoring Antonio Meucci and his work in the invention of the telephone. https://www.congress.gov/bill/107th-congress/house-resolution/269/text

  39. [39]

    Large Language Models are not Fair Evaluators

    Wang P, Li L, Chen L et al. (2024) Large language models are not fair evaluators. In: Proceedings of ACL 2024, pp 9440 to 9450. https://doi.org/10.18653/v1/2024.acl-long.511

  40. [40]

    Knowledge conflicts for llms: A survey,

    Xu R, Qi Z, Guo Z et al. (2024) Knowledge conflicts for LLMs: a survey. In: Proceedings of EMNLP 2024, pp 8541 to 8565. https://doi.org/10.18653/v1/2024.emnlp-main.486

  41. [41]

    (2023) Judging LLM-as-a-judge with MT-Bench and Chatbot Arena

    Zheng L, Chiang W-L, Sheng Y et al. (2023) Judging LLM-as-a-judge with MT-Bench and Chatbot Arena. In: Advances in Neural Information Processing Systems 36, Datasets and Benchmarks Track. https://arxiv.org/abs/2306.05685