pith. machine review for the scientific record. sign in

arxiv: 2605.10234 · v1 · submitted 2026-05-11 · 💻 cs.CY

Recognition: 2 theorem links

· Lean Theorem

Social Policy of Large Language Models: How GPT, Claude, DeepSeek and Grok Allocate Social Budgets in Spain and Germany

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:01 UTC · model grok-4.3

classification 💻 cs.CY
keywords large language modelssocial budget allocationpublic expenditureSpainGermanyimplicit policypensionsOECD comparison
0
0 comments X

The pith

Large language models share a systematic bias in social budget allocation that underfunds pensions by nearly three times while overfunding housing and employment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how four major large language models divide a fixed social budget across twelve public spending categories under Spanish and German contexts. Each model is queried repeatedly with identical prompts, and the resulting allocations are compared statistically to real OECD spending data. All models show the same pattern of under-allocating to pensions and over-allocating to housing and employment programs. This matters because language models are starting to appear in policy simulation and advisory roles, so any embedded priorities they carry could shift resource decisions away from established democratic and expert benchmarks. The work also checks whether model differences track geopolitical lines or national contexts and finds limited evidence for either.

Core claim

The four models share a systematic implicit social policy that diverges from real European spending structures: pensions are under-allocated by a factor close to three, while housing and employment are over-allocated by factors of four and two respectively. The main axis separating the models is concentration versus dispersion of the budget rather than geopolitical origin, and only Claude shows clear sensitivity to the national context supplied in the prompt. These patterns are confirmed through non-parametric statistical tests on the forty-eight independent allocations and through examination of the models' own textual justifications.

What carries the argument

Repeated identical prompting of each model to produce percentage allocations across twelve macro-areas of public expenditure, followed by direct numerical comparison to OECD reference budgets and statistical validation across models.

If this is right

  • Language models cannot be treated as neutral simulators for public budgeting without correction for their consistent deviations from observed spending.
  • The shared pattern across models points to training-data influences that favor certain expenditure categories over pensions.
  • Only limited sensitivity to national context appears, suggesting the implicit policy is largely uniform across countries.
  • Model-to-model variation is driven more by how tightly or broadly the budget is spread than by any geopolitical alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training corpora may embed a preference for active labor-market and housing programs that is stronger than support for retirement systems.
  • Prompt engineering or post-hoc calibration against real data could be required before LLMs are used in resource-planning tools.
  • The result raises the question of how to audit other implicit policy preferences in models when they are applied to new domains.
  • Extending the same repeated-query method to additional countries or spending categories could map the breadth of these embedded views.

Load-bearing premise

That identical repeated prompts produce stable allocations that reflect an underlying model social policy rather than prompt artifacts or generation variability.

What would settle it

If new prompts that explicitly instruct the models to match published OECD spending shares or that provide real budget examples as examples produce allocations close to actual data, the claim of a stable embedded policy would be challenged.

Figures

Figures reproduced from arXiv: 2605.10234 by Claudia Benavides Cantos, Eduardo C. Garrido-Merch\'an.

Figure 1
Figure 1. Figure 1: Heatmap of average allocations by sub-category and model. Left: Spain. Right: Germany. Each cell aggregates six independent runs per combination of model and country. The columns correspond to GPT-4o, Claude, DeepSeek and Grok. is GPT-4o (7.9% and 8.0%), which allocates roughly a fifth of the real share. The directionality of the error is asymmetric across macro-areas. Health is also under-allocated but to… view at source ↗
Figure 2
Figure 2. Figure 2: Average allocation by the four language models contrasted with the approximate OECD reference structure for both Spain and Germany. Pensions are under-allocated by a factor close to three, while Housing and Employment are over-allocated by factors of four and two respectively. The asymmetry between under-allocated and over-allocated macro-areas admits a structural interpretation. Pensions and Health are th… view at source ↗
Figure 3
Figure 3. Figure 3: Kruskal–Wallis test statistics and p-values by macro-area and country. The null hypothesis of equality of distributions across the four models is rejected in eleven out of eleven macro-areas in Spain and in nine out of eleven in Germany. Post-hoc pairwise Mann–Whitney U tests with Bonferroni correction localise the specific pairs of models responsible for the observed differences. In Pensions for Spain, th… view at source ↗
Figure 4
Figure 4. Figure 4: reports the distribution of the six runs by model in the macro-area of Pensions and Older People, where between-run variability is most pronounced. The central finding contradicts a superficial reading. Claude, despite recording the highest average allocation to Pensions in Spain (25.6%), is also the most stable model in that macro-area, with a standard deviation of ±0.76 percentage points, the lowest of t… view at source ↗
Figure 5
Figure 5. Figure 5: Adaptation of each model to the national context. The Spanish profile (continuous line) and the German profile (dashed line) overlap almost perfectly for GPT-4o and DeepSeek. Claude exhibits the largest inter-country difference, exchanging the leadership of Pensions and Health between scenarios. Grok shows a modest increase in Migration in Germany. Claude is the model that adapts most strongly to the count… view at source ↗
read the original abstract

We study how four widely used large language models, namely Claude, GPT-4o, DeepSeek and Grok, distribute a fixed national social budget across twelve macro-areas of public expenditure under two European national contexts, Spain and Germany. Each combination of model and country is queried six times under identical prompts and generation parameters, producing forty-eight independent allocations that are compared against approximate Organisation for Economic Co-operation and Development (OECD) reference budgets and against each other. We formalise five hypotheses regarding geopolitical bias, housing under-allocation, structural convergence, sensitivity to national context, and under-representation of politically sensitive categories. The differences between models are then validated through Kruskal-Wallis tests on each macro-area, with post-hoc Mann-Whitney U comparisons under Bonferroni correction, and complemented by an analysis of pairwise Pearson correlations and a lexical examination of the textual justifications produced by each model. The results show that all four models share a systematic implicit social policy that diverges from real European spending structures: pensions are under-allocated by a factor close to three, while housing and employment are over-allocated by factors of four and two respectively. The principal axis of differentiation between models is not geopolitical, since Claude and DeepSeek are the most correlated pair across both countries, but rather a contrast between concentration and dispersion of the budget. Only Claude exhibits substantive sensitivity to the national context. The conclusions delimit the conditions under which language models may responsibly support, but not replace, expert deliberation in public budgeting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript examines how four LLMs (Claude, GPT-4o, DeepSeek, Grok) allocate a fixed social budget across twelve macro-areas under Spain and Germany contexts. Each model-country pair is queried six times with identical prompts, yielding 48 allocations that are compared to OECD reference budgets. Five hypotheses on geopolitical bias, housing, convergence, national sensitivity, and sensitive categories are tested via Kruskal-Wallis, post-hoc Mann-Whitney U (Bonferroni), Pearson correlations, and lexical analysis of justifications. The central result is that all models exhibit a shared implicit social policy diverging from real structures: pensions under-allocated by a factor of ~3, housing over-allocated by ~4, and employment by ~2; model differences are driven by concentration vs. dispersion rather than geopolitics, with only Claude showing national-context sensitivity.

Significance. If the allocations prove robust to prompt and sampling variation, the work supplies concrete evidence of systematic LLM biases in representing European social policy, with implications for responsible use in budgeting support. The multi-model, multi-country design, use of non-parametric tests, and correlation analysis provide a replicable template for auditing implicit preferences, though the small per-cell sample limits generalizability.

major comments (3)
  1. [Abstract/Methods] Abstract and Methods: The headline factors (pensions under-allocated by ~3, housing by ~4) are derived from means across n=6 identical-prompt replicates per cell, yet no standard deviations, ranges, or per-replicate tables are referenced; without these, it is impossible to determine whether the reported divergences are stable or sensitive to stochastic sampling.
  2. [Methods/Results] Methods and Results: The interpretation of allocations as reflecting an 'implicit social policy' rather than prompt artifacts rests on the assumption that the fixed prompt wording and generation parameters elicit representative distributions; the manuscript contains no ablation on prompt phrasing, temperature, or macro-area framing, which directly bears on whether the factor-of-three/four claims generalize beyond the specific query template.
  3. [Results] Results: With only n=6 per model-country pair, the Kruskal-Wallis and post-hoc tests have limited power to support claims of systematic convergence or the absence of geopolitical bias; the paper should report effect sizes or power calculations to substantiate that model differences (concentration vs. dispersion) are not under-powered artifacts.
minor comments (2)
  1. [Results] The lexical examination of textual justifications is mentioned but not quantified or exemplified in sufficient detail to allow readers to assess how it complements the numerical allocations.
  2. [Methods] Clarify the exact twelve macro-areas and the precise OECD reference values used for the factor calculations, ideally in a table for direct comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which identify key areas for improving the transparency and statistical rigor of our analysis. We will revise the manuscript accordingly to report variability, discuss prompt limitations, and incorporate effect sizes with power calculations.

read point-by-point responses
  1. Referee: [Abstract/Methods] Abstract and Methods: The headline factors (pensions under-allocated by ~3, housing by ~4) are derived from means across n=6 identical-prompt replicates per cell, yet no standard deviations, ranges, or per-replicate tables are referenced; without these, it is impossible to determine whether the reported divergences are stable or sensitive to stochastic sampling.

    Authors: We agree that measures of variability are necessary to evaluate stability. In the revised manuscript we will report standard deviations and ranges alongside all mean allocations. We will also add a supplementary table presenting the six individual replicate values for each model-country-macro-area combination. revision: yes

  2. Referee: [Methods/Results] Methods and Results: The interpretation of allocations as reflecting an 'implicit social policy' rather than prompt artifacts rests on the assumption that the fixed prompt wording and generation parameters elicit representative distributions; the manuscript contains no ablation on prompt phrasing, temperature, or macro-area framing, which directly bears on whether the factor-of-three/four claims generalize beyond the specific query template.

    Authors: The fixed prompt was chosen to ensure comparability across models and countries. We nevertheless accept that this design leaves generalizability to other phrasings untested. The revision will expand the Methods section with a justification of the prompt template and add a Limitations subsection addressing sensitivity to wording, temperature, and framing. We will also include a brief sensitivity analysis using one alternative prompt for a subset of conditions. revision: partial

  3. Referee: [Results] Results: With only n=6 per model-country pair, the Kruskal-Wallis and post-hoc tests have limited power to support claims of systematic convergence or the absence of geopolitical bias; the paper should report effect sizes or power calculations to substantiate that model differences (concentration vs. dispersion) are not under-powered artifacts.

    Authors: We recognize the power limitations of n=6. The revised Results section will report effect sizes (eta-squared for Kruskal-Wallis and rank-biserial correlation for post-hoc Mann-Whitney U tests) together with the p-values. We will also add post-hoc power calculations based on the observed effects to qualify interpretations of convergence and the absence of geopolitical bias. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper performs direct empirical data collection by issuing identical prompts to four LLMs across two countries, obtaining 48 budget allocations, and comparing the resulting means to external OECD reference values using non-parametric tests and correlations. No parameters are fitted to the target divergences, no equations or derivations reduce the observed allocations to the inputs by construction, and no self-citations or prior author results are invoked as load-bearing premises for the central claims. The interpretation of an 'implicit social policy' follows from the collected data and external benchmarks without self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The study is purely empirical and relies on external OECD data and direct model outputs; no free parameters, domain axioms, or invented entities are introduced or required by the central claim.

pith-pipeline@v0.9.0 · 5583 in / 1138 out tokens · 43106 ms · 2026-05-12T05:01:28.782554+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

  1. [1]

    Constitutional AI: Harmlessness from AI Feedback

    doi: 10.1017/pan.2023.2. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073,

  2. [2]

    Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

    Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big?Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610–623,

  3. [3]

    In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency

    doi: 10.1145/3442188.3445922. Emma Cox, Fiona Shirani, and Paul Rouse. Voices from the algorithm: Large language models in social research.Energy Research & Social Science, 113:103559,

  4. [4]

    2024.103559

    doi: 10.1016/j.erss. 2024.103559. DeepSeek-AI. DeepSeek-V3 technical report. Technical report, DeepSeek-AI,

  5. [5]

    15 Aryeh Dvoretzky, Jack Kiefer, and Jacob Wolfowitz

    doi: 10.1080/01621459.1961.10482090. Jessica Maria Echterhoff, Yao Liu, Abeer Alessa, Julian McAuley, and Zexue He. Cognitive bias in decision-making with LLMs. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 12640–12653,

  6. [6]

    EU social benefits expenditure 2024: Early estimates

    Eurostat. EU social benefits expenditure 2024: Early estimates. European Commission, Statistics Explained, 2025a. Eurostat. Social protection statistics: Pension expenditure and pension beneficiaries. European Commission, Statistics Explained, 2025b. Jillian Fisher, Shangbin Feng, Robert Aron, Thomas Richardson, Yejin Choi, Daniel W. Fisher, Jennifer Pan,...

  7. [7]

    Gudiño, Umberto Grandi, and César Hidalgo

    doi: 10.1098/rsta.2024.0100. Instituto Nacional de Estadística. Contabilidad nacional anual de España: principales agregados. Revisión estadística

  8. [8]

    Technical report, INE,

    Años 2022–2024. Technical report, INE,

  9. [9]

    Gaël Le Mens and Aina Gallego

    doi: 10.2307/2280779. Gaël Le Mens and Aina Gallego. Positioning political texts with large language models by asking and averaging.Political Analysis, 33(3):274–282,

  10. [10]

    Kunwoo Lee, Jiyoung Park, Sunyoung Choi, and Chungjoon Lee

    doi: 10.1017/pan.2024.29. Kunwoo Lee, Jiyoung Park, Sunyoung Choi, and Chungjoon Lee. Ideology and policy pref- erences in synthetic data: The potential of LLMs for public opinion analysis.Media and Communication, 13:Article 9677,

  11. [11]

    14 Henry B

    doi: 10.17645/mac.9677. 14 Henry B. Mann and Donald R. Whitney. On a test of whether one of two random variables is stochastically larger than the other.The Annals of Mathematical Statistics, 18(1):50–60,

  12. [12]

    The Annals of Mathematical Statistics , author =

    doi: 10.1214/aoms/1177730491. Benjamin S. Manning, Kehang Zhu, and John J. Horton. Automated social science: Language models as scientist and subjects. Technical Report 32381, National Bureau of Economic Research,

  13. [13]

    OECD Publishing, 2024a

    OECD.Society at a Glance 2024: OECD Social Indicators. OECD Publishing, 2024a. doi: 10.1787/918d8db3-en. OECD. Social expenditure database (SOCX). OECD Statistics, 2024b. Johanna Okerlund, Evan Klasky, Aditi Middha, Sujin Kim, Hannah Rosenfeld, Molly Kleinman, and Shobita Parthasarathy. What’s in the chatterbox? large language models, why they matter, and...

  14. [14]

    Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto

    doi: 10.1371/journal.pone.0306621. Shibani Santurkar, Esin Durmus, Faisal Ladhak, Cinoo Lee, Percy Liang, and Tatsunori Hashimoto. Whose opinions do language models reflect? InProceedings of the 40th Interna- tional Conference on Machine Learning (ICML),