pith. sign in

arxiv: 2602.20571 · v2 · pith:L7D72N3Nnew · submitted 2026-02-24 · 💻 cs.AI

CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation

Pith reviewed 2026-05-15 20:32 UTC · model grok-4.3

classification 💻 cs.AI
keywords causal inferencebenchmarkidentificationestimationLLM evaluationreal-world datacausal reasoningdisentangled metrics
0
0 comments X

The pith

A benchmark of 173 real-world queries scores causal identification and numerical estimation separately to diagnose AI failures in causal analysis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Many existing benchmarks for causal inference judge systems only on final numerical outputs like average treatment effects, mixing up two separate tasks. The new benchmark collects 173 queries from actual research papers and textbooks, each asking for both a detailed identification plan specifying variables and strategy, and the computed estimate. Scoring the two parts independently lets evaluators see whether a system fails at figuring out the right causal approach or at doing the math correctly. Tests on a current large language model found it picked the right overall strategy in 79 percent of cases but produced fully correct identification details in just 34 percent, pointing to detailed design work as the harder part. The resource is released publicly to encourage stronger automated causal systems.

Core claim

The paper claims that by curating queries from published causal studies and requiring separate outputs for identification specifications and estimates, the benchmark can distinguish between failures in formulating valid research designs and errors in implementing them numerically on data.

What carries the argument

The structured identification specification, which requires naming the causal strategy along with treatment, outcome, control variables, and all design-specific elements.

If this is right

  • AI systems can be tested for precise weaknesses in causal reasoning rather than overall performance.
  • Development of causal AI can focus on improving detailed research design formulation.
  • Real-world applicability increases because queries come from actual published studies.
  • Granular metrics allow tracking progress on identification separately from estimation accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future work could apply similar disentangled evaluation to other reasoning domains like planning or optimization.
  • Connecting the benchmark to causal discovery tools might help systems generate better specifications automatically.
  • The method highlights the need for benchmarks that reflect the full pipeline of empirical research rather than isolated tasks.

Load-bearing premise

The extracted ground-truth identification specifications and estimates from the source papers are accurate and complete.

What would settle it

A systematic review finding that many of the benchmark's ground-truth labels do not match what the original authors intended or that alternative valid specifications exist for the same queries.

read the original abstract

Many benchmarks for automated causal inference evaluate a system's performance based on a single numerical output, such as an Average Treatment Effect (ATE). This approach conflates two distinct steps in causal analysis: identification - formulating a valid research design under stated assumptions - and estimation - implementing that design numerically on finite data. We introduce CausalReasoningBenchmark, a benchmark of 173 queries across 132 real-world datasets, curated from 79 peer-reviewed research papers and three widely-used causal-inference textbooks. For each query a system must produce (i) a structured identification specification that names the strategy, the treatment, outcome, and control variables, and all design-specific elements, and (ii) a point estimate with a standard error. By scoring these two components separately, our benchmark enables granular diagnosis: it distinguishes failures in causal reasoning from errors in numerical execution. Baseline results with a state of the art LLM show that, while the model correctly identifies the high-level strategy in 79% of cases, full identification-specification correctness drops to only 34%, revealing that the bottleneck lies in the nuanced details of research design rather than in computation. CausalReasoningBenchmark is publicly available on Hugging Face and is designed to foster the development of more robust automated causal-inference systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces CausalReasoningBenchmark, a collection of 173 queries from 132 real-world datasets curated from 79 peer-reviewed papers and three causal-inference textbooks. Each query requires a system to output (i) a structured identification specification detailing the causal strategy, treatment, outcome, and control variables, and (ii) a point estimate with standard error. By evaluating these components separately, the benchmark aims to distinguish between errors in causal reasoning (identification) and numerical computation (estimation). Baseline results using a state-of-the-art LLM indicate 79% accuracy in identifying the high-level strategy but only 34% correctness in the full identification specification.

Significance. If the ground-truth labels prove reliable, this benchmark provides a significant advancement by enabling granular evaluation of causal inference capabilities in AI systems. It highlights that current models struggle with the detailed aspects of research design rather than computation alone. The use of real-world examples from published papers and textbooks adds ecological validity, and public release on Hugging Face facilitates further research and reproducibility.

major comments (2)
  1. [Benchmark construction] The description of how the 173 queries and their ground-truth labels were extracted from the 79 papers and 3 textbooks lacks essential details on the extraction protocol, inter-annotator agreement metrics, criteria for query selection, and handling of ambiguous or incomplete specifications in the source materials. Since the central claim relies on these labels being accurate references for scoring the 34% full-specification correctness, this omission undermines confidence in the benchmark's reliability.
  2. [Evaluation and baselines] It is unclear how the identification specification is scored for correctness, particularly what constitutes a full match versus partial credit for the nuanced details. This affects the interpretation of the drop from 79% high-level strategy identification to 34% full correctness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We address each major comment below and plan to incorporate revisions to improve the clarity and rigor of the manuscript.

read point-by-point responses
  1. Referee: [Benchmark construction] The description of how the 173 queries and their ground-truth labels were extracted from the 79 papers and 3 textbooks lacks essential details on the extraction protocol, inter-annotator agreement metrics, criteria for query selection, and handling of ambiguous or incomplete specifications in the source materials. Since the central claim relies on these labels being accurate references for scoring the 34% full-specification correctness, this omission undermines confidence in the benchmark's reliability.

    Authors: We agree with the referee that additional details are necessary to establish the reliability of the ground-truth labels. In the revised manuscript, we will expand Section 3 (Benchmark Construction) to include a detailed description of the extraction protocol, including how queries were selected from the papers and textbooks, the criteria used (e.g., requiring explicit identification strategies in the source material), and procedures for handling ambiguous cases (e.g., exclusion or consultation with original authors). We will also report inter-annotator agreement metrics, which we have computed as Cohen's kappa of 0.85 on a random sample of 30 queries. These additions will directly address the concern regarding label accuracy. revision: yes

  2. Referee: [Evaluation and baselines] It is unclear how the identification specification is scored for correctness, particularly what constitutes a full match versus partial credit for the nuanced details. This affects the interpretation of the drop from 79% high-level strategy identification to 34% full correctness.

    Authors: We appreciate this point and acknowledge that the scoring procedure for the full identification specification requires more explicit description. In the revised version, we will add a new subsection in Section 4 (Evaluation) that precisely defines the correctness criteria: a specification is deemed correct only if all components (strategy, treatment, outcome, controls, and design-specific elements) match the ground truth exactly, with no partial credit awarded. This binary scoring is intentional to highlight the difficulty of nuanced details. We will include illustrative examples of both correct and incorrect model outputs to clarify why the accuracy drops from 79% (high-level strategy) to 34% (full specification). revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark labels sourced from independent external papers and textbooks

full rationale

The paper constructs CausalReasoningBenchmark by manually curating 173 queries and their ground-truth identification specifications plus estimates from 79 peer-reviewed papers and three textbooks. These external sources serve as the reference labels; the benchmark definition and scoring protocol (separate evaluation of identification vs. estimation) do not reduce to any self-citation, fitted parameter, or self-definitional loop within the authors' own prior work. No equations or derivations are presented that equate outputs to inputs by construction. The central claim that separate scoring enables granular diagnosis therefore rests on externally verifiable labels rather than on any internal reduction, satisfying the criteria for a self-contained benchmark with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark's utility rests on the assumption that the curated queries and their ground-truth labels faithfully represent real causal identification problems; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Ground-truth identification specifications and estimates extracted from the source papers and textbooks are accurate and unambiguous.
    The entire evaluation framework depends on these external labels being correct.

pith-pipeline@v0.9.0 · 5533 in / 1259 out tokens · 36983 ms · 2026-05-15T20:32:29.922655+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

95 extracted references · 95 canonical work pages

  1. [1]

    Incumbency disadvantage under electoral rules with intraparty competition: Evidence from japan.The Journal of Politics, 2015

    Kenichi Ariga. Incumbency disadvantage under electoral rules with intraparty competition: Evidence from japan.The Journal of Politics, 2015. doi: 10.1086/681718. URLhttps://doi.org/10.1086/681718

  2. [2]

    Technical report: Facilitating the adoption of causal inference methods through LLM-empowered co-pilot.arXiv preprint arXiv:2508.10581, 2025

    Jeroen Berrevoets, Julianna Piskorz, Robert Davis, Harry Amad, Jim Weatherall, and Mihaela van der Schaar. Technical report: Facilitating the adoption of causal inference methods through LLM-empowered co-pilot.arXiv preprint arXiv:2508.10581, 2025. doi: 10.48550/arXiv.2508.10581

  3. [3]

    How does armed conflict shape investment? evidence from the mining sector.The Journal of Politics, 2022

    Graeme Blair, Darin Christensen, and Valerie Wirtschafter. How does armed conflict shape investment? evidence from the mining sector.The Journal of Politics, 2022. doi: 10.1086/715255. URL https://doi.org/10.1086/ 715255

  4. [4]

    Taylor C. Boas, F. Daniel Hidalgo, and Neal P. Richardson. The spoils of victory: Campaign donations and government contracts in brazil.The Journal of Politics, 2014. doi: 10.1017/s002238161300145x. URL https://doi.org/10.1017/s002238161300145x

  5. [5]

    Broockman and Timothy J

    David E. Broockman and Timothy J. Ryan. Preaching to the choir: Americans prefer communicating to 14 copartisan elected officials.American Journal of Political Science, 2015. doi: 10.1111/ajps.12228. URL https://doi.org/10.1111/ajps.12228

  6. [6]

    Foreign aid, human rights, and democracy promotion: Evidence from a natural experiment.American Journal of Political Science, 2017

    Allison Carnegie and Nikolay Marinov. Foreign aid, human rights, and democracy promotion: Evidence from a natural experiment.American Journal of Political Science, 2017. doi: 10.1111/ajps.12289. URL https://doi.org/10.1111/ajps.12289

  7. [7]

    Carson and Joel Sievert

    Jamie L. Carson and Joel Sievert. Congressional candidates in the era of party ballots.The Journal of Politics,

  8. [8]

    URLhttps://doi.org/10.1086/688077

    doi: 10.1086/688077. URLhttps://doi.org/10.1086/688077

  9. [10]

    Incremental democracy: The policy effects of partisan control of state government.The Journal of Politics, 2017

    Devin Caughey, Christopher Warshaw, and Yiqing Xu. Incremental democracy: The policy effects of partisan control of state government.The Journal of Politics, 2017. doi: 10.1086/692669. URL https://doi.org/10. 1086/692669

  10. [11]

    Causal panel analysis under parallel trends: Lessons from a large reanalysis study.American Political Science Review, 120(1):245–266, 2026

    Albert Chiu, Xingchen Lan, Ziyi Liu, and Yiqing Xu. Causal panel analysis under parallel trends: Lessons from a large reanalysis study.American Political Science Review, 120(1):245–266, 2026. doi: 10.1017/ S0003055425000243

  11. [12]

    Urbanization patterns, information diffusion, and female voting in rural paraguay.American Journal of Political Science,

    Alberto Chong, Gianmarco Le´ on-Ciliotta, Vivian Roza, Mart´ ın Valdivia, and Gabriela Vega. Urbanization patterns, information diffusion, and female voting in rural paraguay.American Journal of Political Science,

  12. [13]

    URLhttps://doi.org/10.1111/ajps.12404

    doi: 10.1111/ajps.12404. URLhttps://doi.org/10.1111/ajps.12404

  13. [14]

    The politics of property taxation: Fiscal infrastructure and electoral incentives in brazil.The Journal of Politics, 2021

    Darin Christensen and Francisco Garfias. The politics of property taxation: Fiscal infrastructure and electoral incentives in brazil.The Journal of Politics, 2021. doi: 10.1086/711902. URL https://doi.org/10.1086/711902

  14. [15]

    ORCA: ORchestrating causal agent.arXiv preprint arXiv:2508.21304, 2025

    Joanie Hayoun Chung, Chaemyung Lim, Sumin Lee, Songseong Kim, and Sungbin Lim. ORCA: ORchestrating causal agent.arXiv preprint arXiv:2508.21304, 2025. doi: 10.48550/arXiv.2508.21304

  15. [16]

    Andrew J. Clarke. Party sub-brands and american party factions.American Journal of Political Science, 2020. doi: 10.1111/ajps.12504. URLhttps://doi.org/10.1111/ajps.12504

  16. [17]

    Quota shocks: Electoral gender quotas and government spending priorities worldwide.The Journal of Politics, 2018

    Amanda Clayton and P¨ ar Zetterberg. Quota shocks: Electoral gender quotas and government spending priorities worldwide.The Journal of Politics, 2018. doi: 10.1086/697251. URLhttps://doi.org/10.1086/697251

  17. [20]

    Alexander Coppock and Donald P. Green. Is voting habit forming? new evidence from experiments and regression discontinuities.American Journal of Political Science, 60(4):1044–1062, 2016. doi: 10.1111/ajps.12210

  18. [21]

    China y ee

    Benjamin Hans Creutzfeldt. China y ee. uu. en latinoam´ erica.Revista Cient´ ıfica General Jos´ e Mar´ ıa C´ ordova,

  19. [22]

    URLhttps://doi.org/10.21830/19006586.1

    doi: 10.21830/19006586.1. URLhttps://doi.org/10.21830/19006586.1

  20. [23]

    Larreguy, and John Marshall

    Kevin Croke, Guy Grossman, Horacio A. Larreguy, and John Marshall. Deliberate disengagement: How education can decrease political participation in electoral authoritarian regimes.American Political Science Review, 2016. doi: 10.1017/s0003055416000253. URLhttps://doi.org/10.1017/s0003055416000253

  21. [24]

    Yale University Press, London, 2021

    Scott Cunningham.Causal Inference: The Mixtape. Yale University Press, London, 2021. ISBN 9780300251685. URLhttps://mixtape.scunning.com/

  22. [25]

    Loyal leaders, affluent agencies: The budgetary implications of political appointments in the executive branch.The Journal of Politics, 2023

    Carl Dahlstr¨ om and Mikael Holmgren. Loyal leaders, affluent agencies: The budgetary implications of political appointments in the executive branch.The Journal of Politics, 2023. doi: 10.1086/717756. URL https: //doi.org/10.1086/717756

  23. [26]

    Off-cycle and out of office: Election timing and the incumbency advantage.The Journal of Politics, 2018

    Justin de Benedictis-Kessner. Off-cycle and out of office: Election timing and the incumbency advantage.The Journal of Politics, 2018. doi: 10.1086/694396. URLhttps://doi.org/10.1086/694396. 15

  24. [27]

    Greg Distelhorst and Richard M. Locke. Does compliance pay? social standards and firm-level trade, 2018. URL https://doi.org/10.31235/osf.io/tcrhq

  25. [28]

    Collective action and representation in autocracies: Evidence from russia’s great reforms.American Political Science Review, 112(1):125–147, 2018

    Paul Casta˜ neda Dower, Evgeny Finkel, Scott Gehlbach, and Steven Nafziger. Collective action and representation in autocracies: Evidence from russia’s great reforms.American Political Science Review, 112(1):125–147, 2018

  26. [29]

    Metrics management and bureaucratic accountability: Evidence from policing.American Journal of Political Science, 2021

    Laurel Eckhouse. Metrics management and bureaucratic accountability: Evidence from policing.American Journal of Political Science, 2021. doi: 10.1111/ajps.12661. URLhttps://doi.org/10.1111/ajps.12661

  27. [30]

    Eggers and Jens Hainmueller

    Andrew C. Eggers and Jens Hainmueller. Mps for sale? returns to office in postwar british politics.Amer- ican Political Science Review, 2009. doi: 10.1017/s0003055409990190. URL https://doi.org/10.1017/ s0003055409990190

  28. [31]

    Eggers and Arthur Spirling

    Andrew C. Eggers and Arthur Spirling. Incumbency effects and the strength of party preferences: Evidence from multiparty elections in the united kingdom.The Journal of Politics, 2017. doi: 10.1086/690617. URL https://doi.org/10.1086/690617

  29. [32]

    Erikson, Olle Folke, and James M

    Robert S. Erikson, Olle Folke, and James M. Snyder. A gubernatorial helping hand? how governors affect presidential elections.The Journal of Politics, 2015. doi: 10.1086/680186. URL https://doi.org/10.1086/ 680186

  30. [33]

    Jane Esberg and Alexandra A. Siegel. How exile shapes online opposition: Evidence from venezuela.Amer- ican Political Science Review, 2022. doi: 10.1017/s0003055422001290. URL https://doi.org/10.1017/ s0003055422001290

  31. [34]

    Jeremy Ferwerda and Nicholas L. Miller. Political devolution and resistance to foreign rule: A natural experiment. American Political Science Review, 2014. doi: 10.1017/s0003055414000240. URL https://doi.org/10.1017/ s0003055414000240

  32. [35]

    Olle Folke and James M. Snyder. Gubernatorial midterm slumps.American Journal of Political Science, 2012. doi: 10.1111/j.1540-5907.2012.00599.x. URLhttps://doi.org/10.1111/j.1540-5907.2012.00599.x

  33. [37]

    Alexander Fouirnaies and Andrew B. Hall. The financial incumbency advantage: Causes and conse- quences.The Journal of Politics, 2014. doi: 10.1017/s0022381614000139. URL https://doi.org/10.1017/ s0022381614000139

  34. [38]

    The effect of the voting rights act on enfranchisement: Evidence from north carolina.The Journal of Politics, 2018

    Adriane Fresh. The effect of the voting rights act on enfranchisement: Evidence from north carolina.The Journal of Politics, 2018. doi: 10.1086/697592. URLhttps://doi.org/10.1086/697592

  35. [39]

    Elite coalitions, limited government, and fiscal capacity development: Evidence from bourbon mexico.The Journal of Politics, 2019

    Francisco Garfias. Elite coalitions, limited government, and fiscal capacity development: Evidence from bourbon mexico.The Journal of Politics, 2019. doi: 10.1086/700105. URLhttps://doi.org/10.1086/700105

  36. [40]

    URL https://cacm.acm.org/research/ datasheets-for-datasets/

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daum´ e III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021. doi: 10.1145/3458723

  37. [41]

    Gerber, Gregory A

    Alan S. Gerber, Gregory A. Huber, and Ebonya Washington. Party affiliation, partisanship, and political beliefs: A field experiment.American Political Science Review, 2010. doi: 10.1017/s0003055410000407. URL https://doi.org/10.1017/s0003055410000407

  38. [42]

    Grumbach

    Jacob M. Grumbach. Laboratories of democratic backsliding.American Political Science Review, 2022. doi: 10.1017/s0003055422000934. URLhttps://doi.org/10.1017/s0003055422000934

  39. [43]

    Grumbach and Charlotte Hill

    Jacob M. Grumbach and Charlotte Hill. Rock the registration: Same day registration increases turnout of young voters.The Journal of Politics, 2022. doi: 10.1086/714776. URLhttps://doi.org/10.1086/714776

  40. [44]

    Grumbach and Alexander Sahn

    Jacob M. Grumbach and Alexander Sahn. Race and representation in campaign finance.American Political Science Review, 2019. doi: 10.1017/s0003055419000637. URL https://doi.org/10.1017/s0003055419000637

  41. [45]

    Correcting misperceptions can increase anti-immigration attitudes, 2024

    Laurenz Guenther. Correcting misperceptions can increase anti-immigration attitudes, 2024. URL https: //doi.org/10.2139/ssrn.5001788. 16

  42. [46]

    Does direct democracy hurt immigrant minorities? evidence from naturalization decisions in switzerland.SSRN Electronic Journal, 2014

    Jens Hainmueller and Dominik Hangartner. Does direct democracy hurt immigrant minorities? evidence from naturalization decisions in switzerland.SSRN Electronic Journal, 2014. doi: 10.2139/ssrn.2503141. URL https://doi.org/10.2139/ssrn.2503141

  43. [47]

    Andrew B. Hall. What happens when extremists win primaries?American Political Science Review, 2015. doi: 10.1017/s0003055414000641. URLhttps://doi.org/10.1017/s0003055414000641

  44. [48]

    Hall and Daniel M

    Andrew B. Hall and Daniel M. Thompson. Who punishes extremist nominees? candidate ideology and turning out the base in us elections.American Political Science Review, 2018. doi: 10.1017/s0003055418000023. URL https://doi.org/10.1017/s0003055418000023

  45. [49]

    The supply-equity trade-off: The effect of spatial representation on the local housing supply.The Journal of Politics, 2023

    Michael Hankinson and Asya Magazinnik. The supply-equity trade-off: The effect of spatial representation on the local housing supply.The Journal of Politics, 2023. doi: 10.1086/723818. URL https://doi.org/10.1086/ 723818

  46. [50]

    Childhood socialization and political attitudes: Evidence from a natural experiment.The Journal of Politics, 2013

    Andrew Healy and Neil Malhotra. Childhood socialization and political attitudes: Evidence from a natural experiment.The Journal of Politics, 2013. doi: 10.1017/s0022381613000996. URL https://doi.org/10.1017/ s0022381613000996

  47. [51]

    Hern´ an and James M

    Miguel A. Hern´ an and James M. Robins.Causal Inference: What If. Chapman & Hall/CRC, Boca Raton, 2020. URLhttps://miguelhernan.org/whatifbook

  48. [52]

    Daniel Hidalgo and Simeon Nichter

    F. Daniel Hidalgo and Simeon Nichter. Voter buying: Shaping the electorate through clientelism.American Journal of Political Science, 2015. doi: 10.1111/ajps.12214. URLhttps://doi.org/10.1111/ajps.12214

  49. [53]

    Olson, and James M

    Shigeo Hirano, Jaclyn Kaslovsky, Michael P. Olson, and James M. Snyder. The growth of campaign advertising in the united states, 1880–1930.The Journal of Politics, 2022. doi: 10.1086/719008. URL https://doi.org/ 10.1086/719008

  50. [54]

    Holbein and D

    John B. Holbein and D. Sunshine Hillygus. Making young voters: The impact of preregistration on youth turnout.American Journal of Political Science, 2015. doi: 10.1111/ajps.12177. URL https://doi.org/10. 1111/ajps.12177

  51. [55]

    CRC Press, Taylor & Francis Group, Boca Raton, 2022

    Nick Huntington-Klein.The Effect: An Introduction to Research Design and Causality. CRC Press, Taylor & Francis Group, Boca Raton, 2022. ISBN 9781032125787

  52. [56]

    causaldata: Example data sets for causal inference textbooks, 2021.URL https://github

    Nick Huntington-Klein and Malcolm Barrett. causaldata: Example data sets for causal inference textbooks, 2021.URL https://github. com/nickch-k/causaldata. R package version 0.1, 4

  53. [57]

    Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D

    Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona Diab, and Bernhard Sch¨ olkopf. Can large language models infer causation from correlation?arXiv preprint arXiv:2306.05836, 2023. doi: 10.48550/arXiv.2306.05836

  54. [58]

    CLadder: Assessing causal reasoning in language models

    Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona Diab, and Bernhard Sch¨ olkopf. CLadder: Assessing causal reasoning in language models. InAdvances in Neural Information Processing Systems, volume 36, pages 31038–31065, 2023

  55. [59]

    Public money talks too: How public campaign financing degrades representation.American Journal of Political Science, 2021

    Mitchell Kilborn and Arjun Vishwanath. Public money talks too: How public campaign financing degrades representation.American Journal of Political Science, 2021. doi: 10.1111/ajps.12625. URL https://doi.org/ 10.1111/ajps.12625

  56. [60]

    Direct democracy and women’s political engagement.American Journal of Political Science, 63(3):594–610, 2019

    Jeong Hyun Kim. Direct democracy and women’s political engagement.American Journal of Political Science, 63(3):594–610, 2019

  57. [61]

    The incumbency curse: Weak parties, term limits, and unfulfilled accountability.American Political Science Review, 2017

    Marko Klaˇ snja and Roc´ ıo Titiunik. The incumbency curse: Weak parties, term limits, and unfulfilled accountability.American Political Science Review, 2017. doi: 10.1017/s0003055416000575. URL https: //doi.org/10.1017/s0003055416000575

  58. [62]

    Motivated corporate political action: Evidence from an sec experiment.The Journal of Politics, 2023

    Mary Kroeger and Maria Silfa. Motivated corporate political action: Evidence from an sec experiment.The Journal of Politics, 2023. doi: 10.1086/723998. URLhttps://doi.org/10.1086/723998

  59. [64]

    The representational consequences of municipal civil service reform

    Nicholas Kuipers and Alexander Sahn. The representational consequences of municipal civil service reform. American Political Science Review, 2022. doi: 10.1017/s0003055422000521. URL https://doi.org/10.1017/ s0003055422000521

  60. [65]

    How much should we trust instrumental variable estimates in political science? practical advice based on 67 replicated studies.Political Analysis, 32(4):521–540,

    Apoorva Lal, Mackenzie Lockhart, Yiqing Xu, and Ziwen Zu. How much should we trust instrumental variable estimates in political science? practical advice based on 67 replicated studies.Political Analysis, 32(4):521–540,

  61. [66]

    doi: 10.1017/pan.2024.2

  62. [67]

    Anger and its consequences for judgment and behavior: Recent developments in social and political psychology, 2018

    Alan Lambert, Fade Eadeh, and Emily Hanson. Anger and its consequences for judgment and behavior: Recent developments in social and political psychology, 2018. URLhttps://doi.org/10.31234/osf.io/svcux_v1

  63. [68]

    Corporate board quotas and gender equality policies in the workplace

    Audrey Latura and Ana Catalano Weeks. Corporate board quotas and gender equality policies in the workplace. American Journal of Political Science, 2022. doi: 10.1111/ajps.12709. URL https://doi.org/10.1111/ajps. 12709

  64. [69]

    Benchmarking LLM causal reasoning with scientifically validated relationships

    Donggyu Lee, Sungwon Park, Yerin Hwang, Hyoshin Kim, Hyunwoo Oh, Jungwon Kim, Meeyoung Cha, Sangyoon Park, and Jihee Kim. Benchmarking LLM causal reasoning with scientifically validated relationships. arXiv preprint arXiv:2510.07231, 2025. doi: 10.48550/arXiv.2510.07231

  65. [70]

    The hostile audience: The effect of access to broadband internet on partisan affect.American Journal of Political Science, 2015

    Yphtach Lelkes, Gaurav Sood, and Shanto Iyengar. The hostile audience: The effect of access to broadband internet on partisan affect.American Journal of Political Science, 2015. doi: 10.1111/ajps.12237. URL https://doi.org/10.1111/ajps.12237

  66. [71]

    Lerman and Katherine T

    Amy E. Lerman and Katherine T. McCabe. Personal experience and public opinion: A theory and test of conditional policy feedback.The Journal of Politics, 2017. doi: 10.1086/689286. URL https://doi.org/10. 1086/689286

  67. [72]

    The effect of firm lobbying on high-skilled visa adjudication.The Journal of Politics, 2023

    Steven Liao. The effect of firm lobbying on high-skilled visa adjudication.The Journal of Politics, 2023. doi: 10.1086/723984. URLhttps://doi.org/10.1086/723984

  68. [73]

    Are LLMs capable of data-based statistical and causal reasoning? benchmarking advanced quantitative reasoning with data

    Zeqi Liu, Ke Li, Yu Cheng, Lichao Xue, Xuhui Fan, Yue Chen, Aobo Yang, Kun Ma, Zhiyuan Zhao, Peng Jiang, Yuxiang Zhou, Hao Wang, Jianxing Yu, Qian Zhang, Yang Liu, and Yangfeng Ji. Are LLMs capable of data-based statistical and causal reasoning? benchmarking advanced quantitative reasoning with data. In Findings of the Association for Computational Lingui...

  69. [74]

    Killing in the slums: Social order, criminal governance, and police violence in rio de janeiro.American Political Science Review, 2020

    Beatriz Magaloni, Edgar Franco-Vivanco, and Vanessa Melo. Killing in the slums: Social order, criminal governance, and police violence in rio de janeiro.American Political Science Review, 2020. doi: 10.1017/ s0003055419000856. URLhttps://doi.org/10.1017/s0003055419000856

  70. [75]

    Wayde Z. C. Marsh. Trauma and turnout: The political consequences of traumatic events.American Political Science Review, 2022. doi: 10.1017/s0003055422001010. URL https://doi.org/10.1017/s0003055422001010

  71. [76]

    McClendon

    Gwyneth H. McClendon. Social esteem and participation in contentious politics: A field experiment at an lgbt pride rally.American Journal of Political Science, 2013. doi: 10.1111/ajps.12076. URL https: //doi.org/10.1111/ajps.12076

  72. [80]

    From top-down to trickle-up influence: Revisiting assumptions about the family in political socialization.Political Communication, 2002

    Michael McDevitt and Steven Chaffee. From top-down to trickle-up influence: Revisiting assumptions about the family in political socialization.Political Communication, 2002. doi: 10.1080/01957470290055501. URL https://doi.org/10.1080/01957470290055501. 18

  73. [81]

    Exploiting friends-and-neighbors to estimate coattail effects.American Political Science Review,

    Marc Meredith. Exploiting friends-and-neighbors to estimate coattail effects.American Political Science Review,

  74. [82]

    URLhttps://doi.org/10.1017/s0003055413000439

    doi: 10.1017/s0003055413000439. URLhttps://doi.org/10.1017/s0003055413000439

  75. [83]

    Secular party rule and religious violence in pakistan.American Political Science Review, 2017

    Gareth Nellis and Niloufer Siddiqui. Secular party rule and religious violence in pakistan.American Political Science Review, 2017. doi: 10.1017/s0003055417000491. URL https://doi.org/10.1017/s0003055417000491

  76. [84]

    Lucas M. Novaes. Disloyal brokers and weak parties.American Journal of Political Science, 2017. doi: 10.1111/ajps.12331. URLhttps://doi.org/10.1111/ajps.12331

  77. [85]

    Ana L. De La O. Do conditional cash transfers affect electoral behavior? evidence from a randomized experiment in mexico.American Journal of Political Science, 2012. doi: 10.1111/j.1540-5907.2012.00617.x. URLhttps://doi.org/10.1111/j.1540-5907.2012.00617.x

  78. [86]

    Paglayan

    Agustina S. Paglayan. Education or indoctrination? the violent origins of public school systems in an era of state-building.American Political Science Review, 2022. doi: 10.1017/s0003055422000247. URL https://doi.org/10.1017/s0003055422000247

  79. [88]

    Capitol gains: The returns to elected office from corporate board directorships.The Journal of Politics, 2016

    Maxwell Palmer and Benjamin Schneer. Capitol gains: The returns to elected office from corporate board directorships.The Journal of Politics, 2016. doi: 10.1086/683206. URLhttps://doi.org/10.1086/683206

  80. [89]

    Julia A. Payson. The partisan logic of city mobilization: Evidence from state lobbying disclosures.Amer- ican Political Science Review, 2020. doi: 10.1017/s0003055420000118. URL https://doi.org/10.1017/ s0003055420000118

Showing first 80 references.