Recognition: 2 theorem links
· Lean TheoremSocial Policy of Large Language Models: How GPT, Claude, DeepSeek and Grok Allocate Social Budgets in Spain and Germany
Pith reviewed 2026-05-12 05:01 UTC · model grok-4.3
The pith
Large language models share a systematic bias in social budget allocation that underfunds pensions by nearly three times while overfunding housing and employment.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The four models share a systematic implicit social policy that diverges from real European spending structures: pensions are under-allocated by a factor close to three, while housing and employment are over-allocated by factors of four and two respectively. The main axis separating the models is concentration versus dispersion of the budget rather than geopolitical origin, and only Claude shows clear sensitivity to the national context supplied in the prompt. These patterns are confirmed through non-parametric statistical tests on the forty-eight independent allocations and through examination of the models' own textual justifications.
What carries the argument
Repeated identical prompting of each model to produce percentage allocations across twelve macro-areas of public expenditure, followed by direct numerical comparison to OECD reference budgets and statistical validation across models.
If this is right
- Language models cannot be treated as neutral simulators for public budgeting without correction for their consistent deviations from observed spending.
- The shared pattern across models points to training-data influences that favor certain expenditure categories over pensions.
- Only limited sensitivity to national context appears, suggesting the implicit policy is largely uniform across countries.
- Model-to-model variation is driven more by how tightly or broadly the budget is spread than by any geopolitical alignment.
Where Pith is reading between the lines
- Training corpora may embed a preference for active labor-market and housing programs that is stronger than support for retirement systems.
- Prompt engineering or post-hoc calibration against real data could be required before LLMs are used in resource-planning tools.
- The result raises the question of how to audit other implicit policy preferences in models when they are applied to new domains.
- Extending the same repeated-query method to additional countries or spending categories could map the breadth of these embedded views.
Load-bearing premise
That identical repeated prompts produce stable allocations that reflect an underlying model social policy rather than prompt artifacts or generation variability.
What would settle it
If new prompts that explicitly instruct the models to match published OECD spending shares or that provide real budget examples as examples produce allocations close to actual data, the claim of a stable embedded policy would be challenged.
Figures
read the original abstract
We study how four widely used large language models, namely Claude, GPT-4o, DeepSeek and Grok, distribute a fixed national social budget across twelve macro-areas of public expenditure under two European national contexts, Spain and Germany. Each combination of model and country is queried six times under identical prompts and generation parameters, producing forty-eight independent allocations that are compared against approximate Organisation for Economic Co-operation and Development (OECD) reference budgets and against each other. We formalise five hypotheses regarding geopolitical bias, housing under-allocation, structural convergence, sensitivity to national context, and under-representation of politically sensitive categories. The differences between models are then validated through Kruskal-Wallis tests on each macro-area, with post-hoc Mann-Whitney U comparisons under Bonferroni correction, and complemented by an analysis of pairwise Pearson correlations and a lexical examination of the textual justifications produced by each model. The results show that all four models share a systematic implicit social policy that diverges from real European spending structures: pensions are under-allocated by a factor close to three, while housing and employment are over-allocated by factors of four and two respectively. The principal axis of differentiation between models is not geopolitical, since Claude and DeepSeek are the most correlated pair across both countries, but rather a contrast between concentration and dispersion of the budget. Only Claude exhibits substantive sensitivity to the national context. The conclusions delimit the conditions under which language models may responsibly support, but not replace, expert deliberation in public budgeting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript examines how four LLMs (Claude, GPT-4o, DeepSeek, Grok) allocate a fixed social budget across twelve macro-areas under Spain and Germany contexts. Each model-country pair is queried six times with identical prompts, yielding 48 allocations that are compared to OECD reference budgets. Five hypotheses on geopolitical bias, housing, convergence, national sensitivity, and sensitive categories are tested via Kruskal-Wallis, post-hoc Mann-Whitney U (Bonferroni), Pearson correlations, and lexical analysis of justifications. The central result is that all models exhibit a shared implicit social policy diverging from real structures: pensions under-allocated by a factor of ~3, housing over-allocated by ~4, and employment by ~2; model differences are driven by concentration vs. dispersion rather than geopolitics, with only Claude showing national-context sensitivity.
Significance. If the allocations prove robust to prompt and sampling variation, the work supplies concrete evidence of systematic LLM biases in representing European social policy, with implications for responsible use in budgeting support. The multi-model, multi-country design, use of non-parametric tests, and correlation analysis provide a replicable template for auditing implicit preferences, though the small per-cell sample limits generalizability.
major comments (3)
- [Abstract/Methods] Abstract and Methods: The headline factors (pensions under-allocated by ~3, housing by ~4) are derived from means across n=6 identical-prompt replicates per cell, yet no standard deviations, ranges, or per-replicate tables are referenced; without these, it is impossible to determine whether the reported divergences are stable or sensitive to stochastic sampling.
- [Methods/Results] Methods and Results: The interpretation of allocations as reflecting an 'implicit social policy' rather than prompt artifacts rests on the assumption that the fixed prompt wording and generation parameters elicit representative distributions; the manuscript contains no ablation on prompt phrasing, temperature, or macro-area framing, which directly bears on whether the factor-of-three/four claims generalize beyond the specific query template.
- [Results] Results: With only n=6 per model-country pair, the Kruskal-Wallis and post-hoc tests have limited power to support claims of systematic convergence or the absence of geopolitical bias; the paper should report effect sizes or power calculations to substantiate that model differences (concentration vs. dispersion) are not under-powered artifacts.
minor comments (2)
- [Results] The lexical examination of textual justifications is mentioned but not quantified or exemplified in sufficient detail to allow readers to assess how it complements the numerical allocations.
- [Methods] Clarify the exact twelve macro-areas and the precise OECD reference values used for the factor calculations, ideally in a table for direct comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which identify key areas for improving the transparency and statistical rigor of our analysis. We will revise the manuscript accordingly to report variability, discuss prompt limitations, and incorporate effect sizes with power calculations.
read point-by-point responses
-
Referee: [Abstract/Methods] Abstract and Methods: The headline factors (pensions under-allocated by ~3, housing by ~4) are derived from means across n=6 identical-prompt replicates per cell, yet no standard deviations, ranges, or per-replicate tables are referenced; without these, it is impossible to determine whether the reported divergences are stable or sensitive to stochastic sampling.
Authors: We agree that measures of variability are necessary to evaluate stability. In the revised manuscript we will report standard deviations and ranges alongside all mean allocations. We will also add a supplementary table presenting the six individual replicate values for each model-country-macro-area combination. revision: yes
-
Referee: [Methods/Results] Methods and Results: The interpretation of allocations as reflecting an 'implicit social policy' rather than prompt artifacts rests on the assumption that the fixed prompt wording and generation parameters elicit representative distributions; the manuscript contains no ablation on prompt phrasing, temperature, or macro-area framing, which directly bears on whether the factor-of-three/four claims generalize beyond the specific query template.
Authors: The fixed prompt was chosen to ensure comparability across models and countries. We nevertheless accept that this design leaves generalizability to other phrasings untested. The revision will expand the Methods section with a justification of the prompt template and add a Limitations subsection addressing sensitivity to wording, temperature, and framing. We will also include a brief sensitivity analysis using one alternative prompt for a subset of conditions. revision: partial
-
Referee: [Results] Results: With only n=6 per model-country pair, the Kruskal-Wallis and post-hoc tests have limited power to support claims of systematic convergence or the absence of geopolitical bias; the paper should report effect sizes or power calculations to substantiate that model differences (concentration vs. dispersion) are not under-powered artifacts.
Authors: We recognize the power limitations of n=6. The revised Results section will report effect sizes (eta-squared for Kruskal-Wallis and rank-biserial correlation for post-hoc Mann-Whitney U tests) together with the p-values. We will also add post-hoc power calculations based on the observed effects to qualify interpretations of convergence and the absence of geopolitical bias. revision: yes
Circularity Check
No significant circularity
full rationale
The paper performs direct empirical data collection by issuing identical prompts to four LLMs across two countries, obtaining 48 budget allocations, and comparing the resulting means to external OECD reference values using non-parametric tests and correlations. No parameters are fitted to the target divergences, no equations or derivations reduce the observed allocations to the inputs by construction, and no self-citations or prior author results are invoked as load-bearing premises for the central claims. The interpretation of an 'implicit social policy' follows from the collected data and external benchmarks without self-referential reduction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Kruskal–Wallis test statistics and p-values by macro-area and country
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Constitutional AI: Harmlessness from AI Feedback
doi: 10.1017/pan.2023.2. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1017/pan.2023.2 2023
-
[2]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big?Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610–623,
work page 2021
-
[3]
In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency
doi: 10.1145/3442188.3445922. Emma Cox, Fiona Shirani, and Paul Rouse. Voices from the algorithm: Large language models in social research.Energy Research & Social Science, 113:103559,
-
[4]
doi: 10.1016/j.erss. 2024.103559. DeepSeek-AI. DeepSeek-V3 technical report. Technical report, DeepSeek-AI,
-
[5]
15 Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz
doi: 10.1080/01621459.1961.10482090. Jessica Maria Echterhoff, Yao Liu, Abeer Alessa, Julian McAuley, and Zexue He. Cognitive bias in decision-making with LLMs. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 12640–12653,
-
[6]
EU social benefits expenditure 2024: Early estimates
Eurostat. EU social benefits expenditure 2024: Early estimates. European Commission, Statistics Explained, 2025a. Eurostat. Social protection statistics: Pension expenditure and pension beneficiaries. European Commission, Statistics Explained, 2025b. Jillian Fisher, Shangbin Feng, Robert Aron, Thomas Richardson, Yejin Choi, Daniel W. Fisher, Jennifer Pan,...
-
[7]
Gudiño, Umberto Grandi, and César Hidalgo
doi: 10.1098/rsta.2024.0100. Instituto Nacional de Estadística. Contabilidad nacional anual de España: principales agregados. Revisión estadística
- [8]
-
[9]
doi: 10.2307/2280779. Gaël Le Mens and Aina Gallego. Positioning political texts with large language models by asking and averaging.Political Analysis, 33(3):274–282,
-
[10]
Kunwoo Lee, Jiyoung Park, Sunyoung Choi, and Chungjoon Lee
doi: 10.1017/pan.2024.29. Kunwoo Lee, Jiyoung Park, Sunyoung Choi, and Chungjoon Lee. Ideology and policy pref- erences in synthetic data: The potential of LLMs for public opinion analysis.Media and Communication, 13:Article 9677,
-
[11]
doi: 10.17645/mac.9677. 14 Henry B. Mann and Donald R. Whitney. On a test of whether one of two random variables is stochastically larger than the other.The Annals of Mathematical Statistics, 18(1):50–60,
-
[12]
The Annals of Mathematical Statistics , author =
doi: 10.1214/aoms/1177730491. Benjamin S. Manning, Kehang Zhu, and John J. Horton. Automated social science: Language models as scientist and subjects. Technical Report 32381, National Bureau of Economic Research,
-
[13]
OECD.Society at a Glance 2024: OECD Social Indicators. OECD Publishing, 2024a. doi: 10.1787/918d8db3-en. OECD. Social expenditure database (SOCX). OECD Statistics, 2024b. Johanna Okerlund, Evan Klasky, Aditi Middha, Sujin Kim, Hannah Rosenfeld, Molly Kleinman, and Shobita Parthasarathy. What’s in the chatterbox? large language models, why they matter, and...
-
[14]
Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto
doi: 10.1371/journal.pone.0306621. Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. Whose opinions do language models reflect? InProceedings of the 40th Interna- tional Conference on Machine Learning (ICML),
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.