pith. sign in

arxiv: 2509.07274 · v3 · submitted 2025-09-08 · 💻 cs.CL · cs.CY· cs.LG

LLM Analysis of 150+ years of German Parliamentary Debates on Migration Reveals Shift from Post-War Solidarity to Anti-Solidarity in the Last Decade

Pith reviewed 2026-05-18 17:24 UTC · model grok-4.3

classification 💻 cs.CL cs.CYcs.LG
keywords LLM annotationsolidarityanti-solidarityGerman parliamentary debatesmigrationpolitical discourse analysishistorical trendsbias correction
0
0 comments X

The pith

German parliamentary debates showed high postwar solidarity toward migrants but a sharp rise in anti-solidarity since 2015.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that large language models can annotate subtypes of solidarity and anti-solidarity across more than 150 years of German parliamentary debates on migration when their outputs are validated and statistically corrected for bias. A sympathetic reader would care because the resulting trends provide direct evidence of changing political language from group-based and compassionate solidarity in the postwar decades to exclusion, undeservingness, and resource-burden frames in recent years. The work overcomes the usual limit of small manually coded samples by scaling annotation while testing model size, prompting, fine-tuning, and systematic error patterns. It shows that the strongest models reach human-level agreement yet still require Design-based Supervised Learning correction to support reliable long-term inference.

Core claim

Using a theory-driven scheme, the strongest LLMs achieve human-level agreement on labeling solidarity and anti-solidarity subtypes in German parliamentary migration speech. Systematic errors persist and bias trend estimates, but combining soft-label outputs with Design-based Supervised Learning reduces that bias. The corrected annotations show relatively high levels of solidarity, especially group-based and compassionate forms, throughout the postwar period and a marked rise in anti-solidarity since 2015, expressed through exclusion, undeservingness, and resource burden.

What carries the argument

LLM annotation of solidarity and anti-solidarity subtypes, combined with Design-based Supervised Learning to correct systematic bias in long-term trend estimates.

If this is right

  • Political discourse on migrants moved from postwar solidarity toward exclusionary framing after 2015.
  • Group-based and compassionate solidarity dominated earlier debates while resource-burden arguments rose recently.
  • LLM-based annotation supports large-scale historical text analysis only when outputs are validated and bias-corrected.
  • Systematic model errors can distort downstream social-science inferences unless addressed by design-based methods.
  • The corrected labels enable tracing how solidarity subtypes changed across postwar displacement, labor migration, and recent refugee movements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same corrected-LLM pipeline could be applied to parliamentary records in other countries to test whether similar solidarity declines occurred after 2015.
  • The post-2015 rise in anti-solidarity may be linked to specific legislative events or crises that future work could align with the trend line.
  • If the framing shift holds, it suggests resource-burden arguments have become the dominant anti-solidarity register in contemporary German politics.
  • Longer time-series analysis could reveal whether solidarity levels follow cyclical patterns tied to economic conditions or demographic changes.

Load-bearing premise

LLM outputs retain enough validity for long-term historical trend inference after statistical correction despite remaining systematic errors.

What would settle it

Manual re-annotation of a large stratified sample of pre-2015 and post-2015 debate segments that finds no increase in anti-solidarity frames would falsify the reported shift.

Figures

Figures reproduced from arXiv: 2509.07274 by Aida Kostikova, Benjamin Paassen, Ole P\"utz, Olga Sabelfeld, Steffen Eger.

Figure 1
Figure 1. Figure 1: Overview of the annotation scheme adapted from Thijssen (2012). Statements are first classified at a high level as expressing solidarity, anti-solidarity, mixed, and none. Solidarity and anti-solidarity instances are then further categorized into subtypes based on the underlying rationale: group-based, exchange-based, compassionate, and empathic. The example sentence in the figure illustrates exchange-base… view at source ↗
Figure 2
Figure 2. Figure 2: Absolute number of instances (left) and their share relative to all DeuParl sentences (right) for the Woman and Migrant categories. solidarity framing scheme that distinguishes four high-level stance categories – solidarity (willingness to share resources, following Lahusen and Grasso (2018) and Ils et al. (2021)), anti-solidarity (restriction or exclusion of groups), mixed (both supportive and opposing el… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of instances in the human annotated dataset across time and target groups. Total indicates number of all annotated samples for the given target group, including the None label. See [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Model size vs. performance on high-level and fine-grained classification tasks in Migrant (T1). Each model is plotted with its best-performing configuration (zero- or few-shot). The x-axis is logarithmic, so spatial distances do not represent linear differences in parameter count. it, Llama-3-8B-it) often show no improvement over inference-only models. (ii) Task and label-related challenges. • All models p… view at source ↗
Figure 5
Figure 5. Figure 5: F1 scores for high-level labels based on the best-performing configuration for each model. For the Migrant group, the two test sets (Test 1 and Test 2) are combined. SBERT, Gemma-9B) show little to no temporal coherence. In summary, GPT-4 remains the strongest model overall. While performance generally scales with model size, mid-sized models can approximate GPT-4 when given examples. Among open￾weight alt… view at source ↗
Figure 6
Figure 6. Figure 6: High-level confusion matrices showing the best-performing configuration for each model on the two migrant test sets (Test 1 and Test 2), combined. Human Annotation scores are computed by averaging the confusion matrices from four LOO comparisons, where each annotator is compared to the consensus of the others. Both models exhibit notable differences from human annotators. In particular, both models [PITH_… view at source ↗
Figure 7
Figure 7. Figure 7: Solidarity trends for Migrant predicted by GPT-4 and Llama-3.3-70B. Values are normalized within each decade; percentages may not total 100% because instances classified as “None” are not shown (though included in the calculation). The period 1933–1949 (grey shaded area) is excluded due to limited data availability during and immediately after the Nazi dictatorship. 4.1 Trends with GPT-4 vs with Llama-3.3-… view at source ↗
Figure 8
Figure 8. Figure 8: (Anti-)solidarity frames trends for Migrant predicted by GPT-4 and Llama-3.3-70B. Percentages are normalized within each decade across all eight subtypes; thus, the curves in both panels together sum to 100% per decade. The period from 1933 to 1949 (grey shaded area) is excluded due to limited data availability during and immediately after the NS dictatorship. To validate the annotations of the models, we … view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of (Anti-)solidarity mentions in Bundestag debates, 1949–1957 (raw counts). LAG refers to the Lastenausgleichsgesetz (en.: Equalisation of Burdens Law); BVFG to the Bundesvertriebenengesetz (en.: Federal Expellees Act); HAuslG to the Gesetz über die Rechtsstellung heimatloser Ausländer im Bundesgebiet (en.: Law on the Legal Status of Stateless Foreigners in the Federal Territory). Mentions of … view at source ↗
Figure 10
Figure 10. Figure 10: Distribution of (Anti-)solidarity frame subtypes from 1949 to 1957 (raw counts). An inset plot shows the evolving composition of solidarity subtypes, normalized within each 6-month period. Values represent the proportional share of each subtype among all solidarity mentions during that period. from the Eastern Bloc countries were less easily framed as co-nationals (Böke 1996). This trend aligns with Thijs… view at source ↗
Figure 11
Figure 11. Figure 11: Relative distribution of (Anti-)Solidarity subtypes across selected migrant-related keywords, 1949–1957. Bars represent the proportion of each subtype within each keyword group, thus all subtypes for a given group sum to 100%. Flüchtlinge (en.: refugees), Sowjetzonenflüchtlinge (en.: Soviet-zone refugees), and Aussiedler (en.: resettlers) are primarily framed through compassionate solidarity (45–52%), fol… view at source ↗
Figure 12
Figure 12. Figure 12: (Anti-)solidarity frames distribution across major parties in the Bundestag from 1949 to 1957, by election period. Percentages are normalized within each party, such that all subtype bars (solidarity and anti-solidarity) together sum to 100% per party. We include only those parties with a sufficient number of stanced instances (N > 100). The full label distribution across all parties is provided in [PITH… view at source ↗
Figure 13
Figure 13. Figure 13: Fig. 13a shows the fraction of solidarity, anti-solidarity, and mixed stance towards migrants from 2009 to 2025 (instances classified as “None” are not shown, though included in the calculation). Fig. 13b shows the fraction of solidarity (left) and anti-solidarity (right) subtypes according to Llama-3.3-70B, where percentages are normalized within each year across all eight subtypes (thus, the curves in b… view at source ↗
Figure 14
Figure 14. Figure 14: Raw yearly counts of parliamentary statements by selected political parties, 2009–2025. 4.3.2 Distribution across parties [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Distribution of solidarity and anti-solidarity subtypes over time (2009–2025) for selected German parties. For each party, percentages are normalized within each year across all eight subtypes (thus, the curves in both panels together sum to 100% per year). Note that AfD was first represented in the Bundestag in 2017, and the FDP was not represented from 2013 to 2017. Therefore, data points are absent for… view at source ↗
Figure 16
Figure 16. Figure 16: Macro F1 scores over time for each model (best configuration), and average pairwise Cohen’s Kappa for human annotators. Evaluated on the full test set, including Frau and both test sets for Migrant, grouped by decade. Individual Annotator Label Soli Anti-soli Mixed None Final Label (Consensus) 280 13 14 32 10 88 7 12 12 8 25 2 50 30 5 177 Human Annotators Human Label 308 11 11 108 4 99 7 59 10 14 19 17 27… view at source ↗
Figure 17
Figure 17. Figure 17: High-level confusion matrices showing the best-performing configuration for each model on the two migrant test sets (Test 1 and Test 2), combined. Human Annotation scores are computed by averaging the confusion matrices from four leave-one-out (LOO) comparisons, where each annotator is compared to the consensus of the others [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Fine-grained confusion matrices showing the best-performing configuration for each model on the two migrant test sets (Test 1 and Test 2), combined. Human Annotation scores are computed by averaging the confusion matrices from four leave-one-out (LOO) comparisons, where each annotator is compared to the consensus of the others. 1865 1875 1885 1895 1905 1915 1925 1955 1965 1975 1985 1995 2005 2015 2025 0 2… view at source ↗
Figure 19
Figure 19. Figure 19: Distribution of all Migrant keywords over the years, normalized per keyword. The keywords are sorted by frequency, which means that the reliability decreases towards the bottom-right [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Percentage of sentences showing solidarity/anti-solidarity per decade for all Migrant keywords from 1867 to 2025. The keywords are sorted by frequency, which means that the reliability decreases towards the bottom-right [PITH_FULL_IMAGE:figures/full_fig_p034_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: One-step zero-shot prompt set used for the evaluation of GPT-4 on the Migrant dataset [PITH_FULL_IMAGE:figures/full_fig_p035_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: One-step zero-shot prompt set used for the evaluation of GPT-4 on the Frau dataset [PITH_FULL_IMAGE:figures/full_fig_p036_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: (Anti-)solidarity trends in 1867-2022, restricted to the keywords Flüchtlinge and Ausländer. Figure 23a shows the fraction of solidarity (left) and anti-solidarity (right) subtypes predicted by Llama-3.3-70B. Percentages are normalized within each decade across all eight subtypes, so the curves in both panels together sum to 100% per decade. Figure 23b shows the aggregated proportions of solidarity, anti-… view at source ↗
Figure 24
Figure 24. Figure 24: Two-step few-shot prompt set used for the evaluation of open-weight models on the Migrant dataset and for the final large-scale annotation with Llama-3.3-70B. The model is first given the high-level classification prompt in Figure 25a. If the output is Solidarity or Anti-solidarity, the corresponding subtype prompt in Subfigure 25b or Subfigure 25c is used [PITH_FULL_IMAGE:figures/full_fig_p037_24.png] view at source ↗
Figure 24
Figure 24. Figure 24: Two-step few-shot prompt set used for the evaluation of open-weight models on the Migrant dataset and for the final large-scale annotation with Llama-3.3-70B (continued from [PITH_FULL_IMAGE:figures/full_fig_p038_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Two-step few-shot prompt set used for the evaluation of open-weight models on the Frau dataset. The model is first given the high-level classification prompt in Figure 25a. If the output is Solidarity or Anti-solidarity, the corresponding subtype prompt in Subfigure 25b or Subfigure 25c is used. Analyze the following German text and classify it into one of the solidarity subtypes regarding women: EMPATHIC… view at source ↗
Figure 25
Figure 25. Figure 25: Two-step few-shot prompt set used for the evaluation of open-weight models on the Frau dataset (continued from [PITH_FULL_IMAGE:figures/full_fig_p040_25.png] view at source ↗
read the original abstract

Migration has been a core topic in German political debate, from the postwar displacement of millions of expellees to labor migration and recent refugee movements. Studying political speech across such wide-ranging phenomena in depth has traditionally required extensive manual annotation, limiting analysis to small subsets of the data. Large language models (LLMs) offer a potential way to overcome this constraint. Using a theory-driven annotation scheme, we examine how well LLMs annotate subtypes of solidarity and anti-solidarity in German parliamentary debates and whether the resulting labels support valid downstream inference. We first provide a comprehensive evaluation of multiple LLMs, analyzing the effects of model size, prompting strategies, fine-tuning, historical versus contemporary data, and systematic error patterns. We find that the strongest models, especially GPT-5 and gpt-oss-120B, achieve human-level agreement on this task, although their errors remain systematic and bias downstream results. To address this issue, we combine soft-label model outputs with Design-based Supervised Learning (DSL) to reduce bias in long-term trend estimates. Beyond the methodological evaluation, we interpret the resulting annotations from a social-scientific perspective to trace trends in solidarity and anti-solidarity toward migrants in postwar and contemporary Germany. Our approach shows relatively high levels of solidarity in the postwar period, especially in group-based and compassionate forms, and a marked rise in anti-solidarity since 2015, framed through exclusion, undeservingness, and resource burden. We argue that LLMs can support large-scale social-scientific text analysis, but only when their outputs are rigorously validated and statistically corrected.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates LLMs for annotating subtypes of solidarity and anti-solidarity in 150+ years of German parliamentary debates on migration. It reports human-level agreement for top models like GPT-5, identifies systematic errors, applies Design-based Supervised Learning (DSL) to combine soft labels and reduce bias for trend estimation, and interprets the corrected annotations as showing relatively high postwar solidarity (especially group-based and compassionate forms) with a marked rise in anti-solidarity since 2015 framed through exclusion, undeservingness, and resource burden.

Significance. If the DSL correction proves robust to historical language variation, the work would offer a scalable, theory-driven pipeline for large-scale historical discourse analysis that combines LLM annotation with statistical debiasing, enabling trend inference over periods where manual coding is impractical.

major comments (2)
  1. [Abstract / Methods] Abstract and methods: the claim that DSL mitigates bias sufficiently for valid long-term trend inference lacks quantitative details on correction performance (e.g., pre/post bias metrics or residual error impact specifically on the 2015 shift). This is load-bearing for the central historical comparison.
  2. [Results / Evaluation] Results / Evaluation: no era-specific validation is reported to test whether LLM error patterns (e.g., over-detection of exclusion frames in formal postwar German versus contemporary speech) are stationary; if the DSL correction is fit on mixed or recent data, it may not generalize to earlier periods and could distort the reported postwar-to-2015 solidarity shift.
minor comments (2)
  1. [Methods] Clarify the precise mathematical formulation of the soft-label + DSL combination and any assumptions about label noise stationarity.
  2. [Results] Add explicit before/after trend plots or tables showing the numerical effect of the DSL step on the key solidarity/anti-solidarity time series.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which highlight important aspects of our methodological claims. We address each major comment point by point below and have revised the manuscript accordingly where the concerns are valid.

read point-by-point responses
  1. Referee: [Abstract / Methods] Abstract and methods: the claim that DSL mitigates bias sufficiently for valid long-term trend inference lacks quantitative details on correction performance (e.g., pre/post bias metrics or residual error impact specifically on the 2015 shift). This is load-bearing for the central historical comparison.

    Authors: We agree that the manuscript would benefit from more explicit quantitative evidence on DSL performance to support the long-term inferences. In the revised version we have added a dedicated subsection reporting pre- and post-correction bias metrics (including mean bias reduction and confidence intervals) together with a direct assessment of the correction's impact on the estimated 2015 shift. These additions confirm a substantial bias reduction while preserving the direction and statistical significance of the observed change. revision: yes

  2. Referee: [Results / Evaluation] Results / Evaluation: no era-specific validation is reported to test whether LLM error patterns (e.g., over-detection of exclusion frames in formal postwar German versus contemporary speech) are stationary; if the DSL correction is fit on mixed or recent data, it may not generalize to earlier periods and could distort the reported postwar-to-2015 solidarity shift.

    Authors: The original manuscript already contains a comparison of model performance on historical versus contemporary subsets. To directly address stationarity, we have expanded the evaluation section with era-stratified metrics and a cross-era hold-out analysis. The results indicate that the dominant systematic error patterns remain sufficiently consistent across periods for the mixed-data DSL correction to be applied to the full series; we now discuss residual limitations of this assumption in the text. revision: partial

Circularity Check

0 steps flagged

No significant circularity; trends derived from externally validated LLM annotations plus statistical debiasing

full rationale

The paper's derivation chain begins with LLM annotation of solidarity/anti-solidarity subtypes, followed by direct comparison to human agreement benchmarks on held-out data, identification of systematic errors, and application of Design-based Supervised Learning (DSL) for bias correction before trend estimation. None of these steps reduce by construction to the target historical trends: the human benchmarks are independent external labels, DSL is a general statistical correction procedure applied to the observed error patterns rather than a fit tuned to the postwar-to-2015 contrast, and the final social-scientific interpretation follows from the corrected labels without redefining the inputs in terms of the outputs. No self-citation is load-bearing for the central claim, no ansatz is smuggled, and no prediction is statistically forced by construction. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that a theory-driven annotation scheme can be reliably approximated by LLMs and that DSL sufficiently corrects systematic model errors for historical inference; no new free parameters or invented entities are introduced.

axioms (2)
  • domain assumption LLMs can approximate human annotations on solidarity and anti-solidarity subtypes when evaluated against human agreement
    Invoked to justify using model outputs for downstream trend analysis after correction.
  • domain assumption Design-based Supervised Learning removes enough systematic bias to support valid long-term inferences
    Central to the claim that corrected labels reveal genuine historical shifts.

pith-pipeline@v0.9.0 · 5850 in / 1375 out tokens · 57878 ms · 2026-05-18T17:24:45.422963+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

  1. [1]

    “Wir glauben, daß das Problem der Beschaffung von Arbeit für [Umsiedler] unter dem kapitalis- tischen System nicht in der Form gelöst werden kann [...]. Wir sind also der Auffassung, daß die Umsiedlung eine Notwendigkeit ist [...].Wenn hier der Sprecher der CDU das schöne, stolze und richtige Wort geprägt hat ‘Unser Haus ist deutsch’ , wenn er am Schluß g...

  2. [2]

    Diese Zahl wächst seit 2015 durch die Zuwan- derung von Gering- und Unqualifizierten rasant

    “[...] jährlich [verlassen] etwa 140 000 Hochkompetente [...] dieses Land und im Gegenzug nach derzeitigem Stand um die 200 000 sogenannte Flüchtlinge hereinkommen, von denen nicht nur kaum einer hochkompe- tent ist, sondern viele Analphabeten sind.In Deutschland, meine Damen und Herren, lebt bereits eine große soziale Unterschicht [...]. Diese Zahl wächs...

  3. [3]

    “[...] Die gute Nachricht ist: Der Politikwechsel hat begonnen. [...]Die illegale Migration der letzten zehn Jahre gefährdet die politische Sta- bilität Deutschlands und Europas.[...] Unsere Städte und Gemeinden, die Landkreise: Alle sind über dem Limit. In den Kitas, in den Schulen, auf dem Wohnungsmarkt, bei der Sicherheit an den Bahnhöfen und auf den M...

  4. [4]

    Damit betreibt sie eine gefährliche und spaltende Gleichsetzung

    “[...]Migration wird pauschal als sicherheit- spolitisches Problem dargestellt.Die AfD führt die sogenannte gescheiterte Migrationspolitik in einer Liste von Krisen auf – neben dem Krieg in der Ukraine, der Coronapandemie und der hybriden Bedrohung. Damit betreibt sie eine gefährliche und spaltende Gleichsetzung. Sie erk- lärt das bloße Vorhandensein von ...