pith. sign in

arxiv: 2604.21334 · v1 · submitted 2026-04-23 · 💻 cs.AI · cs.CE· cs.CL· cs.LG· econ.GN· q-fin.EC

Ideological Bias in LLMs' Economic Causal Reasoning

Pith reviewed 2026-05-09 22:30 UTC · model grok-4.3

classification 💻 cs.AI cs.CEcs.CLcs.LGecon.GNq-fin.EC
keywords causaleconomicllmsempiricallyideologicalideology-contestedintervention-orientedmodels
0
0 comments X

The pith

LLMs show systematic directional bias favoring intervention-oriented causal judgments over market-oriented ones in ideologically contested economic scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are now used for policy analysis and economic reporting, where getting the direction of cause and effect right matters. The authors take real causal findings from economics and finance papers and focus on the subset where one ideological lens expects a positive effect from an intervention while the other expects a negative effect. They measure whether the models can recover the empirically verified direction. The results indicate that models are less accurate overall on these contested items and, when they get the sign wrong, they more often predict the direction that aligns with pro-government intervention views. This pattern holds across most of the 20 models tested and is not fixed by simple prompting.

Core claim

across 18 of 20 models, accuracy is systematically higher when the empirically verified causal sign aligns with intervention-oriented expectations than with market-oriented ones. Moreover, when models err, their incorrect predictions disproportionately lean intervention-oriented, and this directional skew is not eliminated by one-shot in-context prompting.

Load-bearing premise

That the 1,056 ideology-contested instances are correctly and exhaustively identified as contested based on divergent intervention-oriented versus market-oriented predictions, and that the underlying empirical causal directions from the 10,490 triplets are reliable ground truth.

Figures

Figures reproduced from arXiv: 2604.21334 by Donggyu Lee, Hyeok Yun, Jihee Kim, Jungwon Kim, Junsik Min, Sangyoon Park, Sungwon Park.

Figure 1
Figure 1. Figure 1: Overview of ideological divergence in LLM economic causal reasoning [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy on intervention-truth versus market-truth items across 20 LLMs. Most [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: In-context learning confidence distribution by target ideology. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Directional bias score (Bdir) by model release date. suggests that contrastive examples inflate certainty without correcting underlying judg￾ments—problematically pairing high confidence with low accuracy. Market targets (right panel) show a different pattern. The two confidence distributions largely overlap (∆µ = +0.032, KS = 0.090, p = 0.625), yet the GPT-4o’s accuracy gap remains substantial (∆example =… view at source ↗
Figure 5
Figure 5. Figure 5: Mean accuracy gap of 20 models on Economic subfields [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt used for difficulty scoring (GPT-5-mini). [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt used for main-result sign prediction. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt used for example-augmented sign prediction. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt used for ideological expected-sign annotation. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
read the original abstract

Do large language models (LLMs) exhibit systematic ideological bias when reasoning about economic causal effects? As LLMs are increasingly used in policy analysis and economic reporting, where directionally correct causal judgments are essential, this question has direct practical stakes. We present a systematic evaluation by extending the EconCausal benchmark with ideology-contested cases - instances where intervention-oriented (pro-government) and market-oriented (pro-market) perspectives predict divergent causal signs. From 10,490 causal triplets (treatment-outcome pairs with empirically verified effect directions) derived from top-tier economics and finance journals, we identify 1,056 ideology-contested instances and evaluate 20 state-of-the-art LLMs on their ability to predict empirically supported causal directions. We find that ideology-contested items are consistently harder than non-contested ones, and that across 18 of 20 models, accuracy is systematically higher when the empirically verified causal sign aligns with intervention-oriented expectations than with market-oriented ones. Moreover, when models err, their incorrect predictions disproportionately lean intervention-oriented, and this directional skew is not eliminated by one-shot in-context prompting. These results highlight that LLMs are not only less accurate on ideologically contested economic questions, but systematically less reliable in one ideological direction than the other, underscoring the need for direction-aware evaluation in high-stakes economic and policy settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript extends the EconCausal benchmark with 1,056 ideology-contested causal triplets (from 10,490 total triplets drawn from top economics and finance journals) where intervention-oriented and market-oriented perspectives predict opposite causal signs. It evaluates 20 LLMs on their ability to predict the empirically verified directions, reporting that contested items are harder than non-contested ones, that accuracy is higher when the ground-truth sign aligns with intervention-oriented expectations (in 18 of 20 models), that errors disproportionately favor intervention-oriented predictions, and that this directional skew persists under one-shot in-context prompting.

Significance. If the classification of contested cases and the assignment of ideological expectations can be shown to be reproducible and independent of author labeling choices, the results would indicate a systematic directional bias in LLMs' economic causal reasoning with direct relevance to their use in policy analysis. The grounding in journal-derived empirical ground truth is a methodological strength that supports falsifiability.

major comments (3)
  1. [Methodology for contested-case identification] The section describing identification of the 1,056 ideology-contested instances does not document the criteria, decision rules, or validation procedure (e.g., multiple independent raters, inter-annotator agreement, or external review) used to map each treatment-outcome pair to intervention-oriented versus market-oriented predictions. This partitioning directly determines the accuracy and error-skew comparisons that constitute the central claim.
  2. [Results on model accuracy and error directions] The results section reporting accuracy differences across the 18/20 models provides no description of how accuracy is computed on the contested subset, what statistical tests establish the systematic gap, or controls for prompt sensitivity and sampling variability. These details are load-bearing for the claim that the observed skew is not an artifact of evaluation design.
  3. [Prompting experiments] The prompting experiment section tests only one-shot in-context examples and does not examine whether the directional tilt remains under chain-of-thought, few-shot with balanced examples, or other standard mitigation techniques. This limits the strength of the conclusion that the bias is robust to prompting.
minor comments (2)
  1. [Results tables/figures] Table or figure presenting the per-model accuracy split should include confidence intervals or p-values to allow readers to assess the magnitude and reliability of the reported gaps.
  2. [Abstract and introduction] The abstract and introduction should explicitly reference the source benchmark (EconCausal) and clarify whether the 10,490 triplets include only published, peer-reviewed causal claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We have carefully considered each point and outline our responses below, along with the revisions we plan to implement in the next version of the paper.

read point-by-point responses
  1. Referee: [Methodology for contested-case identification] The section describing identification of the 1,056 ideology-contested instances does not document the criteria, decision rules, or validation procedure (e.g., multiple independent raters, inter-annotator agreement, or external review) used to map each treatment-outcome pair to intervention-oriented versus market-oriented predictions. This partitioning directly determines the accuracy and error-skew comparisons that constitute the central claim.

    Authors: We agree that the manuscript would benefit from more explicit documentation of the contested-case identification process. Although Section 3.2 briefly describes that contested instances are those where intervention-oriented and market-oriented perspectives predict opposite causal signs, we did not provide the full decision rules or validation details. In the revised manuscript, we will add a dedicated subsection with: explicit criteria based on established economic theories (e.g., government spending increases output per Keynesian views but crowds out per classical views); concrete examples of triplets and their ideological assignments; the annotation protocol involving two independent economics PhD raters; and inter-annotator agreement statistics. A third reviewer will validate a subsample. This will ensure reproducibility independent of author labeling. revision: yes

  2. Referee: [Results on model accuracy and error directions] The results section reporting accuracy differences across the 18/20 models provides no description of how accuracy is computed on the contested subset, what statistical tests establish the systematic gap, or controls for prompt sensitivity and sampling variability. These details are load-bearing for the claim that the observed skew is not an artifact of evaluation design.

    Authors: We acknowledge the need for greater transparency in the results presentation. Accuracy is defined as the fraction of contested triplets where the LLM's output sign matches the journal-reported empirical direction. In the revision, we will explicitly define this metric, report the results of statistical tests (e.g., two-sample t-tests on accuracy differences between intervention-aligned and market-aligned ground truths), and detail controls: all prompts used temperature 0.0 for reproducibility, fixed prompt templates, and we will add bootstrap resampling for confidence intervals on accuracy and error rates to address sampling variability. A prompt sensitivity analysis varying wording will also be included. revision: yes

  3. Referee: [Prompting experiments] The prompting experiment section tests only one-shot in-context examples and does not examine whether the directional tilt remains under chain-of-thought, few-shot with balanced examples, or other standard mitigation techniques. This limits the strength of the conclusion that the bias is robust to prompting.

    Authors: The current experiments show the bias persists under one-shot prompting, supporting our claim of robustness to basic in-context learning. To further strengthen this, we will expand the prompting experiments in the revision to include chain-of-thought reasoning prompts and few-shot examples balanced across ideological alignments. We will report the resulting accuracies and error directions under these conditions. If the bias is reduced, we will discuss implications; if not, it will reinforce the findings. These additional results will be presented in the main text or appendix as space allows. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core evaluation uses empirically verified causal directions from external top-tier economics and finance journals as ground truth for all 10,490 triplets. The 1,056 contested cases are identified by whether two ideological perspectives diverge on sign, but LLM accuracy and error skew are measured strictly against the journal-derived signs rather than being defined by or fitted to the ideological labels themselves. No self-citations, parameter fitting, or definitional reductions appear in the derivation chain; the results remain independently testable against the external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5572 in / 1078 out tokens · 19871 ms · 2026-05-09T22:30:14.566105+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

  1. [1]

    The political ideology of conver- sational AI: Converging evidence on ChatGPT’s pro-environmental, left-libertarian orientation

    URLhttps://aclanthology.org/2025.acl-long.1529/. Stanley Feldman and Christopher Johnston. Understanding the determinants of political ideology: Implications of structural complexity.Political Psychology, 35(3):337–358, 2014. Shangbin Feng, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov. From pretraining data to language models to downstream tasks: Tracki...

  2. [2]

    Domain Knowledge(1–5): 1 = common sense sufficient, 3 = intermediate policy/market knowledge needed, 5 = requires specialized theory or identification strategies

  3. [3]

    Context Dependence(1–5): 1 = sign is stable regardless of setting, 3 = context matters moderately, 5 = sign is entirely pinned down by specific institutional/temporal/population conditions

  4. [4]

    Ambiguity / Confoundability(1–5): 1 = sign is unambiguous, 3 = moderately plausible alternative, 5 = opposite sign nearly as plausible without empirical analysis

  5. [5]

    Causal Reasoning Complexity(1–5): 1 = direct one-step link, 3 = multiple mediating channels, 5 = requires reasoning through GE feedbacks, selection, and composition effects simultaneously

  6. [6]

    Evidence Sufficiency(1–5): 1 = context clearly determines the sign, 3 = reasonable infer- ence possible despite gaps, 5 = information so incomplete that None/Mixed is a serious contender

  7. [7]

    domain knowledge

    Overall Difficulty(1–5): Your holistic judgment of how hard this triplet is, considering all dimensions above. Evaluate PURELY based on economic reasoning. Do NOT let political implications influence your ratings. Respond in JSON only, no other text: {"domain knowledge":<int>, "context dependence":<int>, "ambiguity":<int>, "causal complexity":<int>, "evid...

  8. [8]

    No extra text

    Output MUST be valid JSON only. No extra text

  9. [9]

    Infer expected signs from ideological priors, NOT from the paper’s empirical result

  10. [10]

    If both ideologies would expect the same sign, assign the same sign to both

    Most causal triplets are NOT ideologically contested. If both ideologies would expect the same sign, assign the same sign to both

  11. [11]

    Use the same non-null sign only when both sides would genuinely expect the same directional effect

    Use null when no meaningful ideological prior exists for either side. Use the same non-null sign only when both sides would genuinely expect the same directional effect

  12. [12]

    ideological expected signs

    Also provide one brief reasoning string of at most 100 words explaining the sign judgment. Theoretical Framework(based on Feldman & Johnston, 2014; Alesina & Giuliano, 2011) Political ideology has at least two partially independent dimensions; here we focus ONLY on the economicdimension. The economic dimension captures attitudes toward the appropriate rol...