Ideological Bias in LLMs' Economic Causal Reasoning
Pith reviewed 2026-05-09 22:30 UTC · model grok-4.3
The pith
LLMs show systematic directional bias favoring intervention-oriented causal judgments over market-oriented ones in ideologically contested economic scenarios.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
across 18 of 20 models, accuracy is systematically higher when the empirically verified causal sign aligns with intervention-oriented expectations than with market-oriented ones. Moreover, when models err, their incorrect predictions disproportionately lean intervention-oriented, and this directional skew is not eliminated by one-shot in-context prompting.
Load-bearing premise
That the 1,056 ideology-contested instances are correctly and exhaustively identified as contested based on divergent intervention-oriented versus market-oriented predictions, and that the underlying empirical causal directions from the 10,490 triplets are reliable ground truth.
Figures
read the original abstract
Do large language models (LLMs) exhibit systematic ideological bias when reasoning about economic causal effects? As LLMs are increasingly used in policy analysis and economic reporting, where directionally correct causal judgments are essential, this question has direct practical stakes. We present a systematic evaluation by extending the EconCausal benchmark with ideology-contested cases - instances where intervention-oriented (pro-government) and market-oriented (pro-market) perspectives predict divergent causal signs. From 10,490 causal triplets (treatment-outcome pairs with empirically verified effect directions) derived from top-tier economics and finance journals, we identify 1,056 ideology-contested instances and evaluate 20 state-of-the-art LLMs on their ability to predict empirically supported causal directions. We find that ideology-contested items are consistently harder than non-contested ones, and that across 18 of 20 models, accuracy is systematically higher when the empirically verified causal sign aligns with intervention-oriented expectations than with market-oriented ones. Moreover, when models err, their incorrect predictions disproportionately lean intervention-oriented, and this directional skew is not eliminated by one-shot in-context prompting. These results highlight that LLMs are not only less accurate on ideologically contested economic questions, but systematically less reliable in one ideological direction than the other, underscoring the need for direction-aware evaluation in high-stakes economic and policy settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript extends the EconCausal benchmark with 1,056 ideology-contested causal triplets (from 10,490 total triplets drawn from top economics and finance journals) where intervention-oriented and market-oriented perspectives predict opposite causal signs. It evaluates 20 LLMs on their ability to predict the empirically verified directions, reporting that contested items are harder than non-contested ones, that accuracy is higher when the ground-truth sign aligns with intervention-oriented expectations (in 18 of 20 models), that errors disproportionately favor intervention-oriented predictions, and that this directional skew persists under one-shot in-context prompting.
Significance. If the classification of contested cases and the assignment of ideological expectations can be shown to be reproducible and independent of author labeling choices, the results would indicate a systematic directional bias in LLMs' economic causal reasoning with direct relevance to their use in policy analysis. The grounding in journal-derived empirical ground truth is a methodological strength that supports falsifiability.
major comments (3)
- [Methodology for contested-case identification] The section describing identification of the 1,056 ideology-contested instances does not document the criteria, decision rules, or validation procedure (e.g., multiple independent raters, inter-annotator agreement, or external review) used to map each treatment-outcome pair to intervention-oriented versus market-oriented predictions. This partitioning directly determines the accuracy and error-skew comparisons that constitute the central claim.
- [Results on model accuracy and error directions] The results section reporting accuracy differences across the 18/20 models provides no description of how accuracy is computed on the contested subset, what statistical tests establish the systematic gap, or controls for prompt sensitivity and sampling variability. These details are load-bearing for the claim that the observed skew is not an artifact of evaluation design.
- [Prompting experiments] The prompting experiment section tests only one-shot in-context examples and does not examine whether the directional tilt remains under chain-of-thought, few-shot with balanced examples, or other standard mitigation techniques. This limits the strength of the conclusion that the bias is robust to prompting.
minor comments (2)
- [Results tables/figures] Table or figure presenting the per-model accuracy split should include confidence intervals or p-values to allow readers to assess the magnitude and reliability of the reported gaps.
- [Abstract and introduction] The abstract and introduction should explicitly reference the source benchmark (EconCausal) and clarify whether the 10,490 triplets include only published, peer-reviewed causal claims.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments on our manuscript. We have carefully considered each point and outline our responses below, along with the revisions we plan to implement in the next version of the paper.
read point-by-point responses
-
Referee: [Methodology for contested-case identification] The section describing identification of the 1,056 ideology-contested instances does not document the criteria, decision rules, or validation procedure (e.g., multiple independent raters, inter-annotator agreement, or external review) used to map each treatment-outcome pair to intervention-oriented versus market-oriented predictions. This partitioning directly determines the accuracy and error-skew comparisons that constitute the central claim.
Authors: We agree that the manuscript would benefit from more explicit documentation of the contested-case identification process. Although Section 3.2 briefly describes that contested instances are those where intervention-oriented and market-oriented perspectives predict opposite causal signs, we did not provide the full decision rules or validation details. In the revised manuscript, we will add a dedicated subsection with: explicit criteria based on established economic theories (e.g., government spending increases output per Keynesian views but crowds out per classical views); concrete examples of triplets and their ideological assignments; the annotation protocol involving two independent economics PhD raters; and inter-annotator agreement statistics. A third reviewer will validate a subsample. This will ensure reproducibility independent of author labeling. revision: yes
-
Referee: [Results on model accuracy and error directions] The results section reporting accuracy differences across the 18/20 models provides no description of how accuracy is computed on the contested subset, what statistical tests establish the systematic gap, or controls for prompt sensitivity and sampling variability. These details are load-bearing for the claim that the observed skew is not an artifact of evaluation design.
Authors: We acknowledge the need for greater transparency in the results presentation. Accuracy is defined as the fraction of contested triplets where the LLM's output sign matches the journal-reported empirical direction. In the revision, we will explicitly define this metric, report the results of statistical tests (e.g., two-sample t-tests on accuracy differences between intervention-aligned and market-aligned ground truths), and detail controls: all prompts used temperature 0.0 for reproducibility, fixed prompt templates, and we will add bootstrap resampling for confidence intervals on accuracy and error rates to address sampling variability. A prompt sensitivity analysis varying wording will also be included. revision: yes
-
Referee: [Prompting experiments] The prompting experiment section tests only one-shot in-context examples and does not examine whether the directional tilt remains under chain-of-thought, few-shot with balanced examples, or other standard mitigation techniques. This limits the strength of the conclusion that the bias is robust to prompting.
Authors: The current experiments show the bias persists under one-shot prompting, supporting our claim of robustness to basic in-context learning. To further strengthen this, we will expand the prompting experiments in the revision to include chain-of-thought reasoning prompts and few-shot examples balanced across ideological alignments. We will report the resulting accuracies and error directions under these conditions. If the bias is reduced, we will discuss implications; if not, it will reinforce the findings. These additional results will be presented in the main text or appendix as space allows. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's core evaluation uses empirically verified causal directions from external top-tier economics and finance journals as ground truth for all 10,490 triplets. The 1,056 contested cases are identified by whether two ideological perspectives diverge on sign, but LLM accuracy and error skew are measured strictly against the journal-derived signs rather than being defined by or fitted to the ideological labels themselves. No self-citations, parameter fitting, or definitional reductions appear in the derivation chain; the results remain independently testable against the external benchmark.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
URLhttps://aclanthology.org/2025.acl-long.1529/. Stanley Feldman and Christopher Johnston. Understanding the determinants of political ideology: Implications of structural complexity.Political Psychology, 35(3):337–358, 2014. Shangbin Feng, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov. From pretraining data to language models to downstream tasks: Tracki...
-
[2]
Domain Knowledge(1–5): 1 = common sense sufficient, 3 = intermediate policy/market knowledge needed, 5 = requires specialized theory or identification strategies
-
[3]
Context Dependence(1–5): 1 = sign is stable regardless of setting, 3 = context matters moderately, 5 = sign is entirely pinned down by specific institutional/temporal/population conditions
-
[4]
Ambiguity / Confoundability(1–5): 1 = sign is unambiguous, 3 = moderately plausible alternative, 5 = opposite sign nearly as plausible without empirical analysis
-
[5]
Causal Reasoning Complexity(1–5): 1 = direct one-step link, 3 = multiple mediating channels, 5 = requires reasoning through GE feedbacks, selection, and composition effects simultaneously
-
[6]
Evidence Sufficiency(1–5): 1 = context clearly determines the sign, 3 = reasonable infer- ence possible despite gaps, 5 = information so incomplete that None/Mixed is a serious contender
-
[7]
Overall Difficulty(1–5): Your holistic judgment of how hard this triplet is, considering all dimensions above. Evaluate PURELY based on economic reasoning. Do NOT let political implications influence your ratings. Respond in JSON only, no other text: {"domain knowledge":<int>, "context dependence":<int>, "ambiguity":<int>, "causal complexity":<int>, "evid...
- [8]
-
[9]
Infer expected signs from ideological priors, NOT from the paper’s empirical result
-
[10]
If both ideologies would expect the same sign, assign the same sign to both
Most causal triplets are NOT ideologically contested. If both ideologies would expect the same sign, assign the same sign to both
-
[11]
Use the same non-null sign only when both sides would genuinely expect the same directional effect
Use null when no meaningful ideological prior exists for either side. Use the same non-null sign only when both sides would genuinely expect the same directional effect
-
[12]
Also provide one brief reasoning string of at most 100 words explaining the sign judgment. Theoretical Framework(based on Feldman & Johnston, 2014; Alesina & Giuliano, 2011) Political ideology has at least two partially independent dimensions; here we focus ONLY on the economicdimension. The economic dimension captures attitudes toward the appropriate rol...
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.