Ideological Bias in LLMs' Economic Causal Reasoning

Donggyu Lee; Hyeok Yun; Jihee Kim; Jungwon Kim; Junsik Min; Sangyoon Park; Sungwon Park

arxiv: 2604.21334 · v1 · submitted 2026-04-23 · 💻 cs.AI · cs.CE· cs.CL· cs.LG· econ.GN· q-fin.EC

Ideological Bias in LLMs' Economic Causal Reasoning

Donggyu Lee , Hyeok Yun , Jungwon Kim , Junsik Min , Sungwon Park , Sangyoon Park , Jihee Kim This is my paper

Pith reviewed 2026-05-09 22:30 UTC · model grok-4.3

classification 💻 cs.AI cs.CEcs.CLcs.LGecon.GNq-fin.EC

keywords causaleconomicllmsempiricallyideologicalideology-contestedintervention-orientedmodels

0 comments

The pith

LLMs show systematic directional bias favoring intervention-oriented causal judgments over market-oriented ones in ideologically contested economic scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are now used for policy analysis and economic reporting, where getting the direction of cause and effect right matters. The authors take real causal findings from economics and finance papers and focus on the subset where one ideological lens expects a positive effect from an intervention while the other expects a negative effect. They measure whether the models can recover the empirically verified direction. The results indicate that models are less accurate overall on these contested items and, when they get the sign wrong, they more often predict the direction that aligns with pro-government intervention views. This pattern holds across most of the 20 models tested and is not fixed by simple prompting.

Core claim

across 18 of 20 models, accuracy is systematically higher when the empirically verified causal sign aligns with intervention-oriented expectations than with market-oriented ones. Moreover, when models err, their incorrect predictions disproportionately lean intervention-oriented, and this directional skew is not eliminated by one-shot in-context prompting.

Load-bearing premise

That the 1,056 ideology-contested instances are correctly and exhaustively identified as contested based on divergent intervention-oriented versus market-oriented predictions, and that the underlying empirical causal directions from the 10,490 triplets are reliable ground truth.

Figures

Figures reproduced from arXiv: 2604.21334 by Donggyu Lee, Hyeok Yun, Jihee Kim, Jungwon Kim, Junsik Min, Sangyoon Park, Sungwon Park.

**Figure 2.** Figure 2: Accuracy on intervention-truth versus market-truth items across 20 LLMs. Most [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: In-context learning confidence distribution by target ideology. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Directional bias score (Bdir) by model release date. suggests that contrastive examples inflate certainty without correcting underlying judgments—problematically pairing high confidence with low accuracy. Market targets (right panel) show a different pattern. The two confidence distributions largely overlap (∆µ = +0.032, KS = 0.090, p = 0.625), yet the GPT-4o’s accuracy gap remains substantial (∆example =… view at source ↗

**Figure 5.** Figure 5: Mean accuracy gap of 20 models on Economic subfields [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt used for difficulty scoring (GPT-5-mini). [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt used for main-result sign prediction. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt used for example-augmented sign prediction. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt used for ideological expected-sign annotation. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

read the original abstract

Do large language models (LLMs) exhibit systematic ideological bias when reasoning about economic causal effects? As LLMs are increasingly used in policy analysis and economic reporting, where directionally correct causal judgments are essential, this question has direct practical stakes. We present a systematic evaluation by extending the EconCausal benchmark with ideology-contested cases - instances where intervention-oriented (pro-government) and market-oriented (pro-market) perspectives predict divergent causal signs. From 10,490 causal triplets (treatment-outcome pairs with empirically verified effect directions) derived from top-tier economics and finance journals, we identify 1,056 ideology-contested instances and evaluate 20 state-of-the-art LLMs on their ability to predict empirically supported causal directions. We find that ideology-contested items are consistently harder than non-contested ones, and that across 18 of 20 models, accuracy is systematically higher when the empirically verified causal sign aligns with intervention-oriented expectations than with market-oriented ones. Moreover, when models err, their incorrect predictions disproportionately lean intervention-oriented, and this directional skew is not eliminated by one-shot in-context prompting. These results highlight that LLMs are not only less accurate on ideologically contested economic questions, but systematically less reliable in one ideological direction than the other, underscoring the need for direction-aware evaluation in high-stakes economic and policy settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs tilt toward interventionist predictions on ideologically contested economic causal questions, but the contested case labeling needs checking.

read the letter

The punchline here is that LLMs appear more accurate when the true causal direction in economic questions matches intervention-oriented expectations rather than market-oriented ones, with errors also skewing interventionist, and this holds across most of the 20 models tested even after prompting. What the paper does is extend the EconCausal benchmark to isolate 1,056 ideology-contested causal triplets from over 10,000 drawn from top journals. It then measures how LLMs perform on predicting the empirically verified signs. This is new in its focused audit of directional bias on contested economic issues. It does well by relying on external journal findings for ground truth, which keeps the evaluation from being self-referential, and by checking a range of models plus a one-shot prompting condition. The soft spot is the assignment of intervention versus market predictions to those contested cases. Without independent validation or documented criteria for that mapping, the accuracy differences could be influenced by how the authors defined the ideological sides. That part needs to be solid for the bias claim to land cleanly. This is for readers who evaluate LLMs for use in policy or economic contexts. Someone building benchmarks or studying AI reliability in social domains would get concrete data points from it. It deserves a serious referee because the scale and external anchoring are there, though the classification details will be the main point of discussion. I recommend sending it for peer review.

Referee Report

3 major / 2 minor

Summary. The manuscript extends the EconCausal benchmark with 1,056 ideology-contested causal triplets (from 10,490 total triplets drawn from top economics and finance journals) where intervention-oriented and market-oriented perspectives predict opposite causal signs. It evaluates 20 LLMs on their ability to predict the empirically verified directions, reporting that contested items are harder than non-contested ones, that accuracy is higher when the ground-truth sign aligns with intervention-oriented expectations (in 18 of 20 models), that errors disproportionately favor intervention-oriented predictions, and that this directional skew persists under one-shot in-context prompting.

Significance. If the classification of contested cases and the assignment of ideological expectations can be shown to be reproducible and independent of author labeling choices, the results would indicate a systematic directional bias in LLMs' economic causal reasoning with direct relevance to their use in policy analysis. The grounding in journal-derived empirical ground truth is a methodological strength that supports falsifiability.

major comments (3)

[Methodology for contested-case identification] The section describing identification of the 1,056 ideology-contested instances does not document the criteria, decision rules, or validation procedure (e.g., multiple independent raters, inter-annotator agreement, or external review) used to map each treatment-outcome pair to intervention-oriented versus market-oriented predictions. This partitioning directly determines the accuracy and error-skew comparisons that constitute the central claim.
[Results on model accuracy and error directions] The results section reporting accuracy differences across the 18/20 models provides no description of how accuracy is computed on the contested subset, what statistical tests establish the systematic gap, or controls for prompt sensitivity and sampling variability. These details are load-bearing for the claim that the observed skew is not an artifact of evaluation design.
[Prompting experiments] The prompting experiment section tests only one-shot in-context examples and does not examine whether the directional tilt remains under chain-of-thought, few-shot with balanced examples, or other standard mitigation techniques. This limits the strength of the conclusion that the bias is robust to prompting.

minor comments (2)

[Results tables/figures] Table or figure presenting the per-model accuracy split should include confidence intervals or p-values to allow readers to assess the magnitude and reliability of the reported gaps.
[Abstract and introduction] The abstract and introduction should explicitly reference the source benchmark (EconCausal) and clarify whether the 10,490 triplets include only published, peer-reviewed causal claims.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We have carefully considered each point and outline our responses below, along with the revisions we plan to implement in the next version of the paper.

read point-by-point responses

Referee: [Methodology for contested-case identification] The section describing identification of the 1,056 ideology-contested instances does not document the criteria, decision rules, or validation procedure (e.g., multiple independent raters, inter-annotator agreement, or external review) used to map each treatment-outcome pair to intervention-oriented versus market-oriented predictions. This partitioning directly determines the accuracy and error-skew comparisons that constitute the central claim.

Authors: We agree that the manuscript would benefit from more explicit documentation of the contested-case identification process. Although Section 3.2 briefly describes that contested instances are those where intervention-oriented and market-oriented perspectives predict opposite causal signs, we did not provide the full decision rules or validation details. In the revised manuscript, we will add a dedicated subsection with: explicit criteria based on established economic theories (e.g., government spending increases output per Keynesian views but crowds out per classical views); concrete examples of triplets and their ideological assignments; the annotation protocol involving two independent economics PhD raters; and inter-annotator agreement statistics. A third reviewer will validate a subsample. This will ensure reproducibility independent of author labeling. revision: yes
Referee: [Results on model accuracy and error directions] The results section reporting accuracy differences across the 18/20 models provides no description of how accuracy is computed on the contested subset, what statistical tests establish the systematic gap, or controls for prompt sensitivity and sampling variability. These details are load-bearing for the claim that the observed skew is not an artifact of evaluation design.

Authors: We acknowledge the need for greater transparency in the results presentation. Accuracy is defined as the fraction of contested triplets where the LLM's output sign matches the journal-reported empirical direction. In the revision, we will explicitly define this metric, report the results of statistical tests (e.g., two-sample t-tests on accuracy differences between intervention-aligned and market-aligned ground truths), and detail controls: all prompts used temperature 0.0 for reproducibility, fixed prompt templates, and we will add bootstrap resampling for confidence intervals on accuracy and error rates to address sampling variability. A prompt sensitivity analysis varying wording will also be included. revision: yes
Referee: [Prompting experiments] The prompting experiment section tests only one-shot in-context examples and does not examine whether the directional tilt remains under chain-of-thought, few-shot with balanced examples, or other standard mitigation techniques. This limits the strength of the conclusion that the bias is robust to prompting.

Authors: The current experiments show the bias persists under one-shot prompting, supporting our claim of robustness to basic in-context learning. To further strengthen this, we will expand the prompting experiments in the revision to include chain-of-thought reasoning prompts and few-shot examples balanced across ideological alignments. We will report the resulting accuracies and error directions under these conditions. If the bias is reduced, we will discuss implications; if not, it will reinforce the findings. These additional results will be presented in the main text or appendix as space allows. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core evaluation uses empirically verified causal directions from external top-tier economics and finance journals as ground truth for all 10,490 triplets. The 1,056 contested cases are identified by whether two ideological perspectives diverge on sign, but LLM accuracy and error skew are measured strictly against the journal-derived signs rather than being defined by or fitted to the ideological labels themselves. No self-citations, parameter fitting, or definitional reductions appear in the derivation chain; the results remain independently testable against the external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5572 in / 1078 out tokens · 19871 ms · 2026-05-09T22:30:14.566105+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

The political ideology of conver- sational AI: Converging evidence on ChatGPT’s pro-environmental, left-libertarian orientation

URLhttps://aclanthology.org/2025.acl-long.1529/. Stanley Feldman and Christopher Johnston. Understanding the determinants of political ideology: Implications of structural complexity.Political Psychology, 35(3):337–358, 2014. Shangbin Feng, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov. From pretraining data to language models to downstream tasks: Tracki...

work page arXiv 2025
[2]

Domain Knowledge(1–5): 1 = common sense sufficient, 3 = intermediate policy/market knowledge needed, 5 = requires specialized theory or identification strategies

work page
[3]

Context Dependence(1–5): 1 = sign is stable regardless of setting, 3 = context matters moderately, 5 = sign is entirely pinned down by specific institutional/temporal/population conditions

work page
[4]

Ambiguity / Confoundability(1–5): 1 = sign is unambiguous, 3 = moderately plausible alternative, 5 = opposite sign nearly as plausible without empirical analysis

work page
[5]

Causal Reasoning Complexity(1–5): 1 = direct one-step link, 3 = multiple mediating channels, 5 = requires reasoning through GE feedbacks, selection, and composition effects simultaneously

work page
[6]

Evidence Sufficiency(1–5): 1 = context clearly determines the sign, 3 = reasonable infer- ence possible despite gaps, 5 = information so incomplete that None/Mixed is a serious contender

work page
[7]

domain knowledge

Overall Difficulty(1–5): Your holistic judgment of how hard this triplet is, considering all dimensions above. Evaluate PURELY based on economic reasoning. Do NOT let political implications influence your ratings. Respond in JSON only, no other text: {"domain knowledge":<int>, "context dependence":<int>, "ambiguity":<int>, "causal complexity":<int>, "evid...

work page
[8]

No extra text

Output MUST be valid JSON only. No extra text

work page
[9]

Infer expected signs from ideological priors, NOT from the paper’s empirical result

work page
[10]

If both ideologies would expect the same sign, assign the same sign to both

Most causal triplets are NOT ideologically contested. If both ideologies would expect the same sign, assign the same sign to both

work page
[11]

Use the same non-null sign only when both sides would genuinely expect the same directional effect

Use null when no meaningful ideological prior exists for either side. Use the same non-null sign only when both sides would genuinely expect the same directional effect

work page
[12]

ideological expected signs

Also provide one brief reasoning string of at most 100 words explaining the sign judgment. Theoretical Framework(based on Feldman & Johnston, 2014; Alesina & Giuliano, 2011) Political ideology has at least two partially independent dimensions; here we focus ONLY on the economicdimension. The economic dimension captures attitudes toward the appropriate rol...

work page 2014

[1] [1]

The political ideology of conver- sational AI: Converging evidence on ChatGPT’s pro-environmental, left-libertarian orientation

URLhttps://aclanthology.org/2025.acl-long.1529/. Stanley Feldman and Christopher Johnston. Understanding the determinants of political ideology: Implications of structural complexity.Political Psychology, 35(3):337–358, 2014. Shangbin Feng, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov. From pretraining data to language models to downstream tasks: Tracki...

work page arXiv 2025

[2] [2]

Domain Knowledge(1–5): 1 = common sense sufficient, 3 = intermediate policy/market knowledge needed, 5 = requires specialized theory or identification strategies

work page

[3] [3]

Context Dependence(1–5): 1 = sign is stable regardless of setting, 3 = context matters moderately, 5 = sign is entirely pinned down by specific institutional/temporal/population conditions

work page

[4] [4]

Ambiguity / Confoundability(1–5): 1 = sign is unambiguous, 3 = moderately plausible alternative, 5 = opposite sign nearly as plausible without empirical analysis

work page

[5] [5]

Causal Reasoning Complexity(1–5): 1 = direct one-step link, 3 = multiple mediating channels, 5 = requires reasoning through GE feedbacks, selection, and composition effects simultaneously

work page

[6] [6]

Evidence Sufficiency(1–5): 1 = context clearly determines the sign, 3 = reasonable infer- ence possible despite gaps, 5 = information so incomplete that None/Mixed is a serious contender

work page

[7] [7]

domain knowledge

Overall Difficulty(1–5): Your holistic judgment of how hard this triplet is, considering all dimensions above. Evaluate PURELY based on economic reasoning. Do NOT let political implications influence your ratings. Respond in JSON only, no other text: {"domain knowledge":<int>, "context dependence":<int>, "ambiguity":<int>, "causal complexity":<int>, "evid...

work page

[8] [8]

No extra text

Output MUST be valid JSON only. No extra text

work page

[9] [9]

Infer expected signs from ideological priors, NOT from the paper’s empirical result

work page

[10] [10]

If both ideologies would expect the same sign, assign the same sign to both

Most causal triplets are NOT ideologically contested. If both ideologies would expect the same sign, assign the same sign to both

work page

[11] [11]

Use the same non-null sign only when both sides would genuinely expect the same directional effect

Use null when no meaningful ideological prior exists for either side. Use the same non-null sign only when both sides would genuinely expect the same directional effect

work page

[12] [12]

ideological expected signs

Also provide one brief reasoning string of at most 100 words explaining the sign judgment. Theoretical Framework(based on Feldman & Johnston, 2014; Alesina & Giuliano, 2011) Political ideology has at least two partially independent dimensions; here we focus ONLY on the economicdimension. The economic dimension captures attitudes toward the appropriate rol...

work page 2014