arxiv: 2605.10234 · v1 · submitted 2026-05-11 · 💻 cs.CY

Recognition: 2 theorem links

· Lean Theorem

Social Policy of Large Language Models: How GPT, Claude, DeepSeek and Grok Allocate Social Budgets in Spain and Germany

Claudia Benavides Cantos , Eduardo C. Garrido-Merch\'an

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:01 UTC · model grok-4.3

classification 💻 cs.CY

keywords large language modelssocial budget allocationpublic expenditureSpainGermanyimplicit policypensionsOECD comparison

0 comments

The pith

Large language models share a systematic bias in social budget allocation that underfunds pensions by nearly three times while overfunding housing and employment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how four major large language models divide a fixed social budget across twelve public spending categories under Spanish and German contexts. Each model is queried repeatedly with identical prompts, and the resulting allocations are compared statistically to real OECD spending data. All models show the same pattern of under-allocating to pensions and over-allocating to housing and employment programs. This matters because language models are starting to appear in policy simulation and advisory roles, so any embedded priorities they carry could shift resource decisions away from established democratic and expert benchmarks. The work also checks whether model differences track geopolitical lines or national contexts and finds limited evidence for either.

Core claim

The four models share a systematic implicit social policy that diverges from real European spending structures: pensions are under-allocated by a factor close to three, while housing and employment are over-allocated by factors of four and two respectively. The main axis separating the models is concentration versus dispersion of the budget rather than geopolitical origin, and only Claude shows clear sensitivity to the national context supplied in the prompt. These patterns are confirmed through non-parametric statistical tests on the forty-eight independent allocations and through examination of the models' own textual justifications.

What carries the argument

Repeated identical prompting of each model to produce percentage allocations across twelve macro-areas of public expenditure, followed by direct numerical comparison to OECD reference budgets and statistical validation across models.

If this is right

Language models cannot be treated as neutral simulators for public budgeting without correction for their consistent deviations from observed spending.
The shared pattern across models points to training-data influences that favor certain expenditure categories over pensions.
Only limited sensitivity to national context appears, suggesting the implicit policy is largely uniform across countries.
Model-to-model variation is driven more by how tightly or broadly the budget is spread than by any geopolitical alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training corpora may embed a preference for active labor-market and housing programs that is stronger than support for retirement systems.
Prompt engineering or post-hoc calibration against real data could be required before LLMs are used in resource-planning tools.
The result raises the question of how to audit other implicit policy preferences in models when they are applied to new domains.
Extending the same repeated-query method to additional countries or spending categories could map the breadth of these embedded views.

Load-bearing premise

That identical repeated prompts produce stable allocations that reflect an underlying model social policy rather than prompt artifacts or generation variability.

What would settle it

If new prompts that explicitly instruct the models to match published OECD spending shares or that provide real budget examples as examples produce allocations close to actual data, the claim of a stable embedded policy would be challenged.

Figures

Figures reproduced from arXiv: 2605.10234 by Claudia Benavides Cantos, Eduardo C. Garrido-Merch\'an.

**Figure 1.** Figure 1: Heatmap of average allocations by sub-category and model. Left: Spain. Right: Germany. Each cell aggregates six independent runs per combination of model and country. The columns correspond to GPT-4o, Claude, DeepSeek and Grok. is GPT-4o (7.9% and 8.0%), which allocates roughly a fifth of the real share. The directionality of the error is asymmetric across macro-areas. Health is also under-allocated but to… view at source ↗

**Figure 2.** Figure 2: Average allocation by the four language models contrasted with the approximate OECD reference structure for both Spain and Germany. Pensions are under-allocated by a factor close to three, while Housing and Employment are over-allocated by factors of four and two respectively. The asymmetry between under-allocated and over-allocated macro-areas admits a structural interpretation. Pensions and Health are th… view at source ↗

**Figure 3.** Figure 3: Kruskal–Wallis test statistics and p-values by macro-area and country. The null hypothesis of equality of distributions across the four models is rejected in eleven out of eleven macro-areas in Spain and in nine out of eleven in Germany. Post-hoc pairwise Mann–Whitney U tests with Bonferroni correction localise the specific pairs of models responsible for the observed differences. In Pensions for Spain, th… view at source ↗

**Figure 4.** Figure 4: reports the distribution of the six runs by model in the macro-area of Pensions and Older People, where between-run variability is most pronounced. The central finding contradicts a superficial reading. Claude, despite recording the highest average allocation to Pensions in Spain (25.6%), is also the most stable model in that macro-area, with a standard deviation of ±0.76 percentage points, the lowest of t… view at source ↗

**Figure 5.** Figure 5: Adaptation of each model to the national context. The Spanish profile (continuous line) and the German profile (dashed line) overlap almost perfectly for GPT-4o and DeepSeek. Claude exhibits the largest inter-country difference, exchanging the leadership of Pensions and Health between scenarios. Grok shows a modest increase in Migration in Germany. Claude is the model that adapts most strongly to the count… view at source ↗

read the original abstract

We study how four widely used large language models, namely Claude, GPT-4o, DeepSeek and Grok, distribute a fixed national social budget across twelve macro-areas of public expenditure under two European national contexts, Spain and Germany. Each combination of model and country is queried six times under identical prompts and generation parameters, producing forty-eight independent allocations that are compared against approximate Organisation for Economic Co-operation and Development (OECD) reference budgets and against each other. We formalise five hypotheses regarding geopolitical bias, housing under-allocation, structural convergence, sensitivity to national context, and under-representation of politically sensitive categories. The differences between models are then validated through Kruskal-Wallis tests on each macro-area, with post-hoc Mann-Whitney U comparisons under Bonferroni correction, and complemented by an analysis of pairwise Pearson correlations and a lexical examination of the textual justifications produced by each model. The results show that all four models share a systematic implicit social policy that diverges from real European spending structures: pensions are under-allocated by a factor close to three, while housing and employment are over-allocated by factors of four and two respectively. The principal axis of differentiation between models is not geopolitical, since Claude and DeepSeek are the most correlated pair across both countries, but rather a contrast between concentration and dispersion of the budget. Only Claude exhibits substantive sensitivity to the national context. The conclusions delimit the conditions under which language models may responsibly support, but not replace, expert deliberation in public budgeting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows four LLMs produce budget splits that underfund pensions by roughly 3x and overfund housing by 4x relative to OECD numbers, but the 'shared implicit policy' reading depends on untested prompt stability.

read the letter

The main result is that GPT-4o, Claude, DeepSeek, and Grok all allocate a fixed social budget in ways that diverge from real Spanish and German spending: pensions get about one-third the OECD share, housing gets four times as much, and employment gets twice as much. They ran six identical queries per model-country pair, applied Kruskal-Wallis and post-hoc tests, checked correlations, and looked at the textual justifications. Only Claude shows much sensitivity to the national context, and the main model split is concentration versus dispersion rather than any geopolitical pattern. The direct OECD comparison and the five explicit hypotheses give the work a clear structure that is easy to follow. The experimental design itself is new in its combination of multiple frontier models, two countries, and statistical validation against external benchmarks. The lexical analysis of justifications adds a small but useful layer. The soft spot is that everything rests on one fixed prompt wording and temperature setting. Six replicates show the models are internally consistent under those exact conditions, but the paper does not test whether the factor-of-three and factor-of-four gaps survive changes in how the twelve macro-areas are described or how the budget constraint is phrased. Without that, the jump from observed allocations to a stable 'systematic implicit social policy' is not fully supported. The work is for researchers who study LLM value alignment or who want concrete examples of how these models might behave if used for policy simulation. It is worth sending to peer review because the setup is transparent, the benchmark is external, and the statistical methods are named; a referee can ask for the missing robustness checks without starting from scratch.

Referee Report

3 major / 2 minor

Summary. The manuscript examines how four LLMs (Claude, GPT-4o, DeepSeek, Grok) allocate a fixed social budget across twelve macro-areas under Spain and Germany contexts. Each model-country pair is queried six times with identical prompts, yielding 48 allocations that are compared to OECD reference budgets. Five hypotheses on geopolitical bias, housing, convergence, national sensitivity, and sensitive categories are tested via Kruskal-Wallis, post-hoc Mann-Whitney U (Bonferroni), Pearson correlations, and lexical analysis of justifications. The central result is that all models exhibit a shared implicit social policy diverging from real structures: pensions under-allocated by a factor of ~3, housing over-allocated by ~4, and employment by ~2; model differences are driven by concentration vs. dispersion rather than geopolitics, with only Claude showing national-context sensitivity.

Significance. If the allocations prove robust to prompt and sampling variation, the work supplies concrete evidence of systematic LLM biases in representing European social policy, with implications for responsible use in budgeting support. The multi-model, multi-country design, use of non-parametric tests, and correlation analysis provide a replicable template for auditing implicit preferences, though the small per-cell sample limits generalizability.

major comments (3)

[Abstract/Methods] Abstract and Methods: The headline factors (pensions under-allocated by ~3, housing by ~4) are derived from means across n=6 identical-prompt replicates per cell, yet no standard deviations, ranges, or per-replicate tables are referenced; without these, it is impossible to determine whether the reported divergences are stable or sensitive to stochastic sampling.
[Methods/Results] Methods and Results: The interpretation of allocations as reflecting an 'implicit social policy' rather than prompt artifacts rests on the assumption that the fixed prompt wording and generation parameters elicit representative distributions; the manuscript contains no ablation on prompt phrasing, temperature, or macro-area framing, which directly bears on whether the factor-of-three/four claims generalize beyond the specific query template.
[Results] Results: With only n=6 per model-country pair, the Kruskal-Wallis and post-hoc tests have limited power to support claims of systematic convergence or the absence of geopolitical bias; the paper should report effect sizes or power calculations to substantiate that model differences (concentration vs. dispersion) are not under-powered artifacts.

minor comments (2)

[Results] The lexical examination of textual justifications is mentioned but not quantified or exemplified in sufficient detail to allow readers to assess how it complements the numerical allocations.
[Methods] Clarify the exact twelve macro-areas and the precise OECD reference values used for the factor calculations, ideally in a table for direct comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key areas for improving the transparency and statistical rigor of our analysis. We will revise the manuscript accordingly to report variability, discuss prompt limitations, and incorporate effect sizes with power calculations.

read point-by-point responses

Referee: [Abstract/Methods] Abstract and Methods: The headline factors (pensions under-allocated by ~3, housing by ~4) are derived from means across n=6 identical-prompt replicates per cell, yet no standard deviations, ranges, or per-replicate tables are referenced; without these, it is impossible to determine whether the reported divergences are stable or sensitive to stochastic sampling.

Authors: We agree that measures of variability are necessary to evaluate stability. In the revised manuscript we will report standard deviations and ranges alongside all mean allocations. We will also add a supplementary table presenting the six individual replicate values for each model-country-macro-area combination. revision: yes
Referee: [Methods/Results] Methods and Results: The interpretation of allocations as reflecting an 'implicit social policy' rather than prompt artifacts rests on the assumption that the fixed prompt wording and generation parameters elicit representative distributions; the manuscript contains no ablation on prompt phrasing, temperature, or macro-area framing, which directly bears on whether the factor-of-three/four claims generalize beyond the specific query template.

Authors: The fixed prompt was chosen to ensure comparability across models and countries. We nevertheless accept that this design leaves generalizability to other phrasings untested. The revision will expand the Methods section with a justification of the prompt template and add a Limitations subsection addressing sensitivity to wording, temperature, and framing. We will also include a brief sensitivity analysis using one alternative prompt for a subset of conditions. revision: partial
Referee: [Results] Results: With only n=6 per model-country pair, the Kruskal-Wallis and post-hoc tests have limited power to support claims of systematic convergence or the absence of geopolitical bias; the paper should report effect sizes or power calculations to substantiate that model differences (concentration vs. dispersion) are not under-powered artifacts.

Authors: We recognize the power limitations of n=6. The revised Results section will report effect sizes (eta-squared for Kruskal-Wallis and rank-biserial correlation for post-hoc Mann-Whitney U tests) together with the p-values. We will also add post-hoc power calculations based on the observed effects to qualify interpretations of convergence and the absence of geopolitical bias. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper performs direct empirical data collection by issuing identical prompts to four LLMs across two countries, obtaining 48 budget allocations, and comparing the resulting means to external OECD reference values using non-parametric tests and correlations. No parameters are fitted to the target divergences, no equations or derivations reduce the observed allocations to the inputs by construction, and no self-citations or prior author results are invoked as load-bearing premises for the central claims. The interpretation of an 'implicit social policy' follows from the collected data and external benchmarks without self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The study is purely empirical and relies on external OECD data and direct model outputs; no free parameters, domain axioms, or invented entities are introduced or required by the central claim.

pith-pipeline@v0.9.0 · 5583 in / 1138 out tokens · 43106 ms · 2026-05-12T05:01:28.782554+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Kruskal–Wallis test statistics and p-values by macro-area and country

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

Constitutional AI: Harmlessness from AI Feedback

doi: 10.1017/pan.2023.2. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1017/pan.2023.2 2023
[2]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big?Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610–623,

work page 2021
[3]

In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency

doi: 10.1145/3442188.3445922. Emma Cox, Fiona Shirani, and Paul Rouse. Voices from the algorithm: Large language models in social research.Energy Research & Social Science, 113:103559,

work page doi:10.1145/3442188.3445922
[4]

2024.103559

doi: 10.1016/j.erss. 2024.103559. DeepSeek-AI. DeepSeek-V3 technical report. Technical report, DeepSeek-AI,

work page doi:10.1016/j.erss 2024
[5]

15 Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz

doi: 10.1080/01621459.1961.10482090. Jessica Maria Echterhoff, Yao Liu, Abeer Alessa, Julian McAuley, and Zexue He. Cognitive bias in decision-making with LLMs. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 12640–12653,

work page doi:10.1080/01621459.1961.10482090 1961
[6]

EU social benefits expenditure 2024: Early estimates

Eurostat. EU social benefits expenditure 2024: Early estimates. European Commission, Statistics Explained, 2025a. Eurostat. Social protection statistics: Pension expenditure and pension beneficiaries. European Commission, Statistics Explained, 2025b. Jillian Fisher, Shangbin Feng, Robert Aron, Thomas Richardson, Yejin Choi, Daniel W. Fisher, Jennifer Pan,...

work page arXiv 2024
[7]

Gudiño, Umberto Grandi, and César Hidalgo

doi: 10.1098/rsta.2024.0100. Instituto Nacional de Estadística. Contabilidad nacional anual de España: principales agregados. Revisión estadística

work page doi:10.1098/rsta.2024.0100 2024
[8]

Technical report, INE,

Años 2022–2024. Technical report, INE,

work page 2022
[9]

Gaël Le Mens and Aina Gallego

doi: 10.2307/2280779. Gaël Le Mens and Aina Gallego. Positioning political texts with large language models by asking and averaging.Political Analysis, 33(3):274–282,

work page doi:10.2307/2280779
[10]

Kunwoo Lee, Jiyoung Park, Sunyoung Choi, and Chungjoon Lee

doi: 10.1017/pan.2024.29. Kunwoo Lee, Jiyoung Park, Sunyoung Choi, and Chungjoon Lee. Ideology and policy pref- erences in synthetic data: The potential of LLMs for public opinion analysis.Media and Communication, 13:Article 9677,

work page doi:10.1017/pan.2024.29 2024
[11]

14 Henry B

doi: 10.17645/mac.9677. 14 Henry B. Mann and Donald R. Whitney. On a test of whether one of two random variables is stochastically larger than the other.The Annals of Mathematical Statistics, 18(1):50–60,

work page doi:10.17645/mac.9677
[12]

The Annals of Mathematical Statistics , author =

doi: 10.1214/aoms/1177730491. Benjamin S. Manning, Kehang Zhu, and John J. Horton. Automated social science: Language models as scientist and subjects. Technical Report 32381, National Bureau of Economic Research,

work page doi:10.1214/aoms/1177730491
[13]

OECD Publishing, 2024a

OECD.Society at a Glance 2024: OECD Social Indicators. OECD Publishing, 2024a. doi: 10.1787/918d8db3-en. OECD. Social expenditure database (SOCX). OECD Statistics, 2024b. Johanna Okerlund, Evan Klasky, Aditi Middha, Sujin Kim, Hannah Rosenfeld, Molly Kleinman, and Shobita Parthasarathy. What’s in the chatterbox? large language models, why they matter, and...

work page doi:10.1787/918d8db3-en 2024
[14]

Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto

doi: 10.1371/journal.pone.0306621. Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. Whose opinions do language models reflect? InProceedings of the 40th Interna- tional Conference on Machine Learning (ICML),

work page doi:10.1371/journal.pone.0306621