Causal methods for LLM development and evaluation

Abdurahman Maarouf; Athiya Deviyani; Dennis Frauen; Haorui Ma; Jonas Schweisthal; Konstantin Hess; Maresa Schr\"oder; Marie Brockschmidt; Rahul G. Krishnan; Sonali Parbhoo

arxiv: 2605.25998 · v1 · pith:JBTZV24Nnew · submitted 2026-05-25 · 💻 cs.LG

Causal methods for LLM development and evaluation

Dennis Frauen , Marie Brockschmidt , Konstantin Hess , Haorui Ma , Yuchen Ma , Abdurahman Maarouf , Maresa Schr\"oder , Jonas Schweisthal

show 5 more authors

Yuxin Wang Athiya Deviyani Sonali Parbhoo Rahul G. Krishnan Stefan Feuerriegel

This is my paper

Pith reviewed 2026-06-29 22:49 UTC · model grok-4.3

classification 💻 cs.LG

keywords causal inferencelarge language modelsLLM developmentevaluationconfoundingdistribution shiftspretrainingalignment

0 comments

The pith

Central questions in LLM development like data domain effects and routing decisions are causal and need identification methods beyond prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that many core questions in building and testing large language models involve interventions whose effects must be isolated from confounding and shifts in logged data. It shows how purely predictive approaches become fragile when reward models are biased, environments change over time, or judges in evaluation carry their own biases. The authors map specific places across pretraining, alignment, routing, agent workflows, and evaluation where causal identification and estimation can be applied instead. A reader would care because this shift could replace brittle large-scale trial-and-error with more stable, interpretable design choices grounded in the actual causes of performance changes.

Core claim

The paper claims that causal methods are underutilized in the LLM pipeline even though questions about the effect of adding data domains, changes in annotator preferences under different styles, and cost-aware routing are inherently causal; these settings feature logged data subject to confounding and distribution shifts, learned but potentially biased judges, and non-stationary deployment, all of which create opportunities for principled causal identification and estimation rather than fragile predictive iteration.

What carries the argument

Causal identification and estimation applied to logged LLM data, reward models, and evaluation pipelines to isolate intervention effects despite confounding and shifts.

If this is right

Causal methods can identify the true effect of adding a data domain during pretraining despite selection biases in logged data.
Causal estimation can recover how annotator preferences shift when LLMs generate text in new styles, improving alignment.
Causal routing decisions can optimize model size choice under inference cost constraints using logged interaction data.
Causal debiasing of learned judges can improve evaluation reliability when judges themselves are influenced by style or other factors.
Causal approaches can handle non-stationary deployment by identifying time-varying effects rather than assuming stable distributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Causal methods could reduce the number of full-scale retraining runs needed by allowing targeted interventions on subsets of logged data.
The same framework might extend to agentic workflows by treating tool-use decisions as interventions whose downstream effects can be identified.
A practical test would be to run parallel A/B interventions on small models and check whether causal estimates from larger logged datasets match the observed outcomes.

Load-bearing premise

The central questions in LLM development are inherently causal and the standard identification assumptions from causal inference can be met in large-scale logged LLM data without prohibitive practical barriers.

What would settle it

A controlled intervention on a data domain or routing policy whose measured outcome differs substantially from the causal estimate obtained from existing logged data would show the identification assumptions fail in practice.

Figures

Figures reproduced from arXiv: 2605.25998 by Abdurahman Maarouf, Athiya Deviyani, Dennis Frauen, Haorui Ma, Jonas Schweisthal, Konstantin Hess, Maresa Schr\"oder, Marie Brockschmidt, Rahul G. Krishnan, Sonali Parbhoo, Stefan Feuerriegel, Yuchen Ma, Yuxin Wang.

read the original abstract

Large language model (LLM) development is currently driven by large-scale empirical iteration over data mixtures, reward models, routing strategies, and evaluation pipelines. Here, we argue that many central questions in LLM development and evaluation are inherently causal: What is the effect of adding a data domain during pretraining? How do annotator preferences change when LLMs generate text in a different style? Should a prompt be routed to a larger or smaller model given inference cost constraints? In general, causal methods are well-suited to such settings where interventions change outcomes but, surprisingly, are underrepresented in LLM development. Our contribution is threefold: (1) We explain how causal methods can help develop modern LLM development and evaluation: LLM development relies heavily on logged data, which are often subject to confounding and distribution shifts; evaluation uses learned but potentially biased judges; and deployment environments are non-stationary. These conditions make purely predictive approaches fragile and create opportunities for principled identification and estimation methods from causal inference. (2) We further map opportunities for causal methods in the entire LLM development pipeline, including pretraining, alignment, routing, agentic workflows, and evaluation. (3) We discuss new research opportunities around leveraging causal methods for LLM development and evaluation. Overall, we argue that causal methods are potentially underutilized for the LLM development and evaluation pipeline, despite the fact that such methods can ensure a reliable and scientifically grounded design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a position paper mapping causal inference ideas to LLM pipelines without new methods, examples, or evidence.

read the letter

The main point is that many decisions in LLM development involve causal questions—effects of data domains, changes in annotator preferences, or routing choices under cost constraints—and that logged data often carries confounding and shifts that break purely predictive methods. The paper lays this out clearly across pretraining, alignment, routing, agents, and evaluation.

It does a solid job framing why standard empirical iteration can be fragile here. The sections on biased judges and non-stationary environments are straightforward and point to real practical issues.

The weakness is that it stays at the level of opportunity mapping. There are no identification strategies, no small-scale examples, and no discussion of whether the usual assumptions (no unmeasured confounding, positivity, etc.) can actually be met with the data volumes and logging practices common in LLM work. The feasibility questions are left open.

This is for people already thinking about causal methods who want a quick overview of where they might intersect with current LLM practice. It could generate follow-up ideas but does not deliver results that would change how anyone builds or evaluates models today.

I would send it to peer review. The questions are legitimate and worth community discussion even if the paper itself is mostly a call for future work.

Referee Report

2 major / 1 minor

Summary. The paper argues that many central questions in LLM development and evaluation—such as the effect of adding data domains during pretraining, changes in annotator preferences, and routing decisions under cost constraints—are inherently causal. It claims that causal inference methods are surprisingly underrepresented despite their suitability for handling confounding in logged data, distribution shifts, biased evaluators, and non-stationary environments. The contribution consists of explaining these opportunities, mapping them across the full pipeline (pretraining, alignment, routing, agents, evaluation), and outlining new research directions, with the overall thesis that causal methods can support more reliable and scientifically grounded LLM design.

Significance. If the central argument holds, the paper could usefully redirect LLM research from purely empirical iteration toward identification-based approaches that explicitly address confounding and shifts. The mapping of opportunities across the pipeline is a constructive taxonomy that may help organize future work. However, the manuscript offers no new identification results, empirical demonstrations, or feasibility checks, so its significance rests on its potential to stimulate such work rather than on any demonstrated advance.

major comments (2)

[Abstract] Abstract and contribution (1): the claim that causal methods are 'well-suited' and can ensure 'reliable and scientifically grounded design' is not supported by any concrete identification strategy or worked example for the listed questions (e.g., effect of a data domain or routing decision). Without this, the argument that such methods are feasible at LLM scale remains a plausibility claim rather than a demonstrated one.
[Contribution (2)] Contribution (2): the mapping of opportunities does not discuss how standard causal assumptions (positivity, consistency, no unmeasured confounding) would be met in large-scale logged LLM data, where interventions such as data-mixture changes are rarely randomized and logs are subject to selection bias; this is load-bearing for the claim that causal methods can be directly applied without prohibitive practical barriers.

minor comments (1)

[Abstract] The abstract and introduction repeat the phrase 'underutilized' without citing any survey or quantitative evidence of current usage rates in the LLM literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. As a position paper, our aim is to map opportunities for causal methods rather than to deliver new identification theorems or large-scale experiments. We address the two major comments below.

read point-by-point responses

Referee: [Abstract] Abstract and contribution (1): the claim that causal methods are 'well-suited' and can ensure 'reliable and scientifically grounded design' is not supported by any concrete identification strategy or worked example for the listed questions (e.g., effect of a data domain or routing decision). Without this, the argument that such methods are feasible at LLM scale remains a plausibility claim rather than a demonstrated one.

Authors: We agree that the manuscript contains no new identification strategies or worked examples; this is outside the stated scope of a position paper whose contribution is to surface structural parallels between LLM development problems and causal inference settings. The suitability claim rests on the observation that logged LLM data routinely exhibit confounding, selection bias, and non-stationarity—conditions under which purely predictive approaches are known to be fragile. We will revise the abstract and introduction to state the scope more explicitly and to avoid any implication that feasibility at LLM scale has already been shown. revision: partial
Referee: [Contribution (2)] Contribution (2): the mapping of opportunities does not discuss how standard causal assumptions (positivity, consistency, no unmeasured confounding) would be met in large-scale logged LLM data, where interventions such as data-mixture changes are rarely randomized and logs are subject to selection bias; this is load-bearing for the claim that causal methods can be directly applied without prohibitive practical barriers.

Authors: This is a fair observation. The current draft does not explicitly examine how positivity, consistency, and no unmeasured confounding would hold (or be violated) under the selection mechanisms present in production LLM logs. We will add a dedicated subsection that (a) states the assumptions, (b) illustrates likely violations arising from non-randomized data-mixture changes and selection bias, and (c) points to existing causal techniques (e.g., sensitivity analysis, proximal causal inference) that have been developed precisely for such settings. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a position/advocacy piece with no derivations, equations, fitted parameters, identification results, or theorems. Its central claim—that causal questions are common in LLM pipelines and causal methods are underutilized—rests on descriptive examples rather than any self-referential reduction, self-citation chain, or renaming of prior results. No load-bearing step reduces to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that LLM development questions satisfy the conditions for causal identification; no free parameters, invented entities, or additional axioms are introduced in the abstract.

axioms (1)

domain assumption LLM development questions (data domain effects, preference changes, routing) are inherently causal and amenable to standard identification strategies.
Invoked in the opening paragraph when the authors state that many central questions are inherently causal.

pith-pipeline@v0.9.1-grok · 5831 in / 1239 out tokens · 22837 ms · 2026-06-29T22:49:50.561715+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

112 extracted references · 15 canonical work pages · 9 internal anchors

[1]

Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I

Anastasios N. Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I. Jordan, and Tijana Zrnic. 2023. Prediction-powered inference.Science382, 6671 (2023), 669–674

2023
[2]

Angelopoulos, Jacob Eisenstein, Jonathan Berant, Alekh Agar- wal, and Adam Fisch

Anastasios N. Angelopoulos, Jacob Eisenstein, Jonathan Berant, Alekh Agar- wal, and Adam Fisch. 2025. Cost-optimal active AI model evaluation. arXiv:2506.07949 (2025)

work page arXiv 2025
[3]

Susan Athey and Stefan Wager. 2021. Policy learning with observational data. Econometrica89, 1 (2021), 133–161

2021
[4]

Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, Juan M Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, et al. 2026. Holistic evaluation of large language models for medical tasks with MedHELM.Nature Medicine32 (2026), 943–951

2026
[5]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big?. InFAccT

2021
[6]

Bickel, Chris A

Peter J. Bickel, Chris A. J. Klaassen, Ya’acov Ritov, and Jon A. Wellner. 1998. Efficient and adaptive estimation for semiparametric models. Springer, New York

1998
[7]

Matthieu Bou, Nyal Patel, Arjun Jagota, Satyapriya Krishna, and Sonali Parbhoo
[8]

The alignment auditor: A Bayesian framework for verifying and refining LLM objectives. InICLR
[9]

Ralph Allan Bradley and Milton E. Terry. 1952. Rank analysis of incomplete block designs: I. The method of paired comparisons.Biometrika39, 3/4 (1952), 324–345

1952
[10]

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert- Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2021. Extracting training data from large language models. InUSENIX Security Symposium

2021
[11]

Ivi Chatzi, Eleni Straitouri, Suhas Thejaswi, and Manuel Gomez Rodriguez. 2024. Prediction-powered ranking of large language models. InNeurIPS

2024
[12]

Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James M. Robins. 2018. Double/debiased machine learning for treatment and structural parameters.The Econometrics Journal21, 1 (2018), 1–68

2018
[13]

Wayne Chi, Valerie Chen, Anastasios Nikolas Angelopoulos, Wei-Lin Chiang, Aditya Mittal, Naman Jain, Tianjun Zhang, Ion Stoica, Chris Donahue, and Ameet Talwalkar. 2025. Copilot Arena: A platform for code LLM evaluation in the wild. InICML

2025
[14]

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios N Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I Jordan, Joseph E Gonzalez, and Ion Stoica. 2024. Chatbot Arena: An open platform for evaluating LLMs by human preference. InICML

2024
[15]

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. In NeurIPS

2017
[16]

Thomas Cook, Alan Mishler, and Aaditya Ramdas. 2024. Semiparametric efficient inference in adaptive experiments. InCLeaR

2024
[17]

Alicia Curth and Mihaela Van der Schaar. 2021. Nonparametric estimation of heterogeneous treatment effects: From theory to learning algorithms. In AISTATS

2021
[18]

Athiya Deviyani and Fernando Diaz. 2025. Contextual metric meta-evaluation by measuring local metric accuracy. InFindings of the ACL: NAACL 2025

2025
[19]

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V. S. Lakshmanan, and Ahmed Hassan Awadallah. 2024. Hybrid LLM: Cost-efficient and quality-aware query routing. InICLR

2024
[20]

Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman
[21]

Measuring and mitigating unintended bias in text classification. InAIES
[22]

Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. 2021. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. InEMNLP

2021
[23]

Jacob Dorn, Kevin Guo, and Nathan Kallus. 2025. Doubly-valid/doubly-sharp sensitivity analysis for causal inference with unmeasured confounding.Journal of the American Statistical Association120, 549 (2025), 331–342

2025
[24]

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P Bosma, Zongwei Zhou, Tao Wang, Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc Le, Yonghui Wu, Zhifeng Chen, and C...

2022
[25]

Miroslav Dudík, John Langford, and Lihong Li. 2011. Doubly robust policy evaluation and learning. InICML

2011
[26]

Amir Feder, Katherine A Keith, Emaad Manzoor, Reid Pryzant, Dhanya Sridhar, Zach Wood-Doughty, Jacob Eisenstein, Justin Grimmer, Roi Reichart, Margaret E Roberts, et al. 2022. Causal inference in natural language processing: Estimation, prediction, interpretation and beyond. InTACL

2022
[27]

Shangbin Feng, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov. 2023. From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair NLP models. InACL

2023
[28]

Adam Fisch, Joshua Maynez, R Alex Hofer, Bhuwan Dhingra, Amir Globerson, and William W Cohen. 2024. Stratified prediction-powered inference for hybrid language model evaluation. InNeurIPS

2024
[29]

Dennis Frauen, Athiya Deviyani, Mihaela van der Schaar, and Stefan Feuerriegel
[30]

Nonparametric LLM evaluation from preference data. InICML
[31]

Dennis Frauen, Valentyn Melnychuk, and Stefan Feuerriegel. 2023. Sharp bounds for generalized causal sensitivity analysis. InNeurIPS

2023
[32]

Dennis Frauen, Valentyn Melnychuk, Lars van der Laan, and Stefan Feuer- riegel. 2026. Machine Learning for Causal Inference. InWiley StatsRef: Statistics Reference Online. John Wiley & Sons

2026
[33]

Angelopoulos, and Ion Stoica

Evan Frick, Connor Chen, Joseph Tennyson, Tianle Li, Wei-Lin Chiang, Anas- tasios N. Angelopoulos, and Ion Stoica. 2025. Prompt-to-leaderboard: Prompt- adaptive LLM evaluations. InICML

2025
[34]

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The pile: An 800GB dataset of diverse text for language modeling. arXiv:2101.00027

work page internal anchor Pith review Pith/arXiv arXiv 2020
[35]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. Retrieval-augmented generation for large language models: A survey. arXiv:2312.10997

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. 2020. Shortcut learning in deep neural networks.Nature Machine Intelligence2, 11 (2020), 665–673

2020
[37]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. 2024. The Llama 3 herd of models. arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Luke Guerdan, Justin Whitehouse, Kimberly Truong, Ken Holstein, and Steven Wu. 2026. Doubly-robust LLM-as-a-judge: Externally valid estimation with imperfect personas. InICLR

2026
[39]

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large language model based multi-agents: A survey of progress and challenges. InIJCAI

2024
[40]

Yuan Guo, Peng Tian, Jayashree Kalpathy-Cramer, Susan Ostmo, J.Peter Camp- bell, Michael F.Chiang, Deniz Erdogmus, Jennifer Dy, and Stratis Ioannidis. 2018. Experimental design under the Bradley-Terry model. InIJCAI

2018
[41]

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. 2024. RouterBench: A benchmark for multi-LLM routing system. InAgentic Markets Workshop at ICML 2024

2024
[42]

Yue Huang, Zhengzhe Jiang, Yuchen Ma, Yu Jiang, Xiangqi Wang, Yujun Zhou, Yuexing Hao, Kehan Guo, Pin-Yu Chen, Stefan Feuerriegel, et al. 2026. Probellm: Automating Principled Diagnosis of LLM Failures. InICML

2026
[43]

Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. InEACL

2021
[44]

Kevin G Jamieson and Robert Nowak. 2011. Active ranking using pairwise comparisons. InNeurIPS

2011
[45]

Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, LYU Zhiheng, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, et al. 2023. CLadder: Assessing causal reasoning in language models. InNeurIPS

2023
[46]

Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona Diab, and Bernhard Schölkopf. 2024. Can large language models infer causation from correlation?. InICLR

2024
[47]

Jared Joselowitz, Ritam Majumdar, Arjun Jagota, Matthieu Bou, Nyal Patel, Satyapriya Krishna, and Sonali Parbhoo. 2025. Insights from the inverse: recon- structing LLM training goals through inverse reinforcement learning. InSecond Conference on Language Modeling

2025
[48]

Kusner, and Ricardo Silva

Jean Kaddour, Aengus Lynch, Qi Liu, Matt J. Kusner, and Ricardo Silva. 2025. Causal Machine learning: A survey and open problems.Foundations and Trends in Optimization9 (2025), 1–247. Issue 1-2

2025
[49]

Jean Kaddour, Yuchen Zhu, Qi Liu, Matt J Kusner, and Ricardo Silva. 2021. Causal effect inference for structured treatments. InNeurIPS

2021
[50]

Nathan Kallus. 2025. Semiparametric preference optimization: Your language model is secretly a single-index model.arXiv preprintarXiv:2512.21917 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[51]

Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating training data mitigates privacy risks in language models. InICML. Causal Methods for LLM Development and Evaluation KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea

2022
[52]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open- domain question answering. InEMNLP

2020
[53]

Chinmaya Kausik, Yangyi Lu, Kevin Tan, Maggie Makar, Yixin Wang, and Ambuj Tewari. 2024. Offline policy evaluation and optimization under confounding. In AISTATS

2024
[54]

Edward H. Kennedy. 2023. Semiparametric doubly robust targeted double machine learning: a review.2203.06469(2023)

work page arXiv 2023
[55]

Edward H. Kennedy. 2023. Towards optimal doubly robust estimation of hetero- geneous causal effects.Electronic Journal of Statistics17, 2 (2023), 3008–3049

2023
[56]

Edward H Kennedy, Zongming Ma, Matthew D McHugh, and Dylan S Small
[57]

Non-parametric methods for doubly robust estimation of continuous treatment effects.Journal of the Royal Statistical Society Series B: Statistical Methodology79, 4 (2017), 1229–1245

2017
[58]

Christoph Kern, Unai Fischer-Abaigar, Jonas Schweisthal, Dennis Frauen, Rayid Ghani, Stefan Feuerriegel, Mihaela van der Schaar, and Frauke Kreuter. 2025. Algorithms for reliable decision-making need causal reasoning.Nature Compu- tational Science5, 5 (2025), 356–360

2025
[59]

Emre Kıcıman, Robert Ness, Amit Sharma, and Chenhao Tan. 2023. Causal reasoning and large language models: Opening a new frontier for causality. TMLR(2023)

2023
[60]

Peter Kirgis, Sayash Kapoor, Stephan Rabanser, Nitya Nadgir, Cozmin Ududec, Magda Dubois, JJ Allaire, Conrad Stosz, Marius Hobbhahn, Jacob Steinhardt, et al. 2026. Log analysis is necessary for credible evaluation of AI agents. arXiv:2605.08545(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[61]

Haruka Kiyohara, Daniel Yiming Cao, Yuta Saito, and Thorsten Joachims. 2025. An Off-Policy Learning Approach for Steering Sentence Generation towards Personalization. InRecSys

2025
[62]

Kasia Kobalczyk and Mihaela van der Schaar. 2025. Preference learning for AI alignment: A causal perspective. InICML

2025
[63]

Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. 2024. Benchmarking cognitive biases in large language models as evaluators. InFindings of the ACL: ACL 2024

2024
[64]

Patrick Kramer, Edward H Kennedy, and Isaac M Opper. 2026. Causal inference with high-dimensional treatments.arXiv:2602.21423(2026)

work page arXiv 2026
[65]

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2022. Deduplicating training data makes language models better. InACL

2022
[66]

Cresswell

Kin Kwan Leung, Mouloud Belbahri, Yi Sui, Alex Labach, Xueying Zhang, Stephen Anthony Rose, and Jesse C. Cresswell. 2026. Classifying and addressing the diversity of errors in retrieval-augmented generation systems. InEACL

2026
[67]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. InNeurIPS

2020
[68]

Xiudi Li and Sijia Li. 2025. Efficient inference for covariate-adjusted Bradley- Terry model with covariate shift.arXiv:2503.18256(2025)

work page arXiv 2025
[69]

Joel Mire, Zubin Trivadi Aysola, Daniel Chechelnitsky, Nicholas Deas, Chrysoula Zerva, and Maarten Sap. 2025. Rejected dialects: Biases against African American Language in reward models. InFindings of the ACL: NAACL 2025

2025
[70]

Angelopoulos, Trevor Darrell, Narges Norouzi, and Joseph E

Mihran Miroyan, Tsung-Han Wu, Logan King, Tianle Li, Jiayi Pan, Xinyan Hu, Wei-Lin Chiang, Anastasios N. Angelopoulos, Trevor Darrell, Narges Norouzi, and Joseph E. Gonzalez. 2025. Search Arena: analyzing search-augmented LLMs. arXiv:2506.05334(2025)

work page arXiv 2025
[71]

Miruna Oprescu, Jacob Dorn, Marah Ghoummaid, Andrew Jesson, Nathan Kallus, and Uri Shalit. 2023. B-learner: Quasi-oracle bounds on heterogeneous causal effects under hidden confounding. InICML

2023
[72]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feed...

2022
[73]

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. 2023. MemGPT: towards LLMs as operating systems. (2023)

2023
[74]

Junsoo Park, Seungyeon Jwa, Ren Meiying, Daeyoung Kim, and Sanghyuk Choi
[75]

InFindings of the ACL: EMNLP 2024

OffsetBias: Leveraging debiased data for tuning evaluators. InFindings of the ACL: EMNLP 2024

2024
[76]

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InAnnual ACM Symposium on User Interface Software and Technology

2023
[77]

2009.Causality

Judea Pearl. 2009.Causality. Cambridge University Press, New York City

2009
[78]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. InNeurIPS

2023
[79]

Robins, Andrea Rotnitzky, and Lue Ping Zhao

James M. Robins, Andrea Rotnitzky, and Lue Ping Zhao. 1994. Estimation of regression coefficients when some regressors are not always observed.Journal of the American Statistical Association89, 427 (1994), 846–866

1994
[80]

Donald B. Rubin. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology66, 5 (1974), 688–701

1974

Showing first 80 references.

[1] [1]

Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I

Anastasios N. Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I. Jordan, and Tijana Zrnic. 2023. Prediction-powered inference.Science382, 6671 (2023), 669–674

2023

[2] [2]

Angelopoulos, Jacob Eisenstein, Jonathan Berant, Alekh Agar- wal, and Adam Fisch

Anastasios N. Angelopoulos, Jacob Eisenstein, Jonathan Berant, Alekh Agar- wal, and Adam Fisch. 2025. Cost-optimal active AI model evaluation. arXiv:2506.07949 (2025)

work page arXiv 2025

[3] [3]

Susan Athey and Stefan Wager. 2021. Policy learning with observational data. Econometrica89, 1 (2021), 133–161

2021

[4] [4]

Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, Juan M Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, et al. 2026. Holistic evaluation of large language models for medical tasks with MedHELM.Nature Medicine32 (2026), 943–951

2026

[5] [5]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big?. InFAccT

2021

[6] [6]

Bickel, Chris A

Peter J. Bickel, Chris A. J. Klaassen, Ya’acov Ritov, and Jon A. Wellner. 1998. Efficient and adaptive estimation for semiparametric models. Springer, New York

1998

[7] [7]

Matthieu Bou, Nyal Patel, Arjun Jagota, Satyapriya Krishna, and Sonali Parbhoo

[8] [8]

The alignment auditor: A Bayesian framework for verifying and refining LLM objectives. InICLR

[9] [9]

Ralph Allan Bradley and Milton E. Terry. 1952. Rank analysis of incomplete block designs: I. The method of paired comparisons.Biometrika39, 3/4 (1952), 324–345

1952

[10] [10]

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert- Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2021. Extracting training data from large language models. InUSENIX Security Symposium

2021

[11] [11]

Ivi Chatzi, Eleni Straitouri, Suhas Thejaswi, and Manuel Gomez Rodriguez. 2024. Prediction-powered ranking of large language models. InNeurIPS

2024

[12] [12]

Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James M. Robins. 2018. Double/debiased machine learning for treatment and structural parameters.The Econometrics Journal21, 1 (2018), 1–68

2018

[13] [13]

Wayne Chi, Valerie Chen, Anastasios Nikolas Angelopoulos, Wei-Lin Chiang, Aditya Mittal, Naman Jain, Tianjun Zhang, Ion Stoica, Chris Donahue, and Ameet Talwalkar. 2025. Copilot Arena: A platform for code LLM evaluation in the wild. InICML

2025

[14] [14]

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios N Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I Jordan, Joseph E Gonzalez, and Ion Stoica. 2024. Chatbot Arena: An open platform for evaluating LLMs by human preference. InICML

2024

[15] [15]

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. In NeurIPS

2017

[16] [16]

Thomas Cook, Alan Mishler, and Aaditya Ramdas. 2024. Semiparametric efficient inference in adaptive experiments. InCLeaR

2024

[17] [17]

Alicia Curth and Mihaela Van der Schaar. 2021. Nonparametric estimation of heterogeneous treatment effects: From theory to learning algorithms. In AISTATS

2021

[18] [18]

Athiya Deviyani and Fernando Diaz. 2025. Contextual metric meta-evaluation by measuring local metric accuracy. InFindings of the ACL: NAACL 2025

2025

[19] [19]

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V. S. Lakshmanan, and Ahmed Hassan Awadallah. 2024. Hybrid LLM: Cost-efficient and quality-aware query routing. InICLR

2024

[20] [20]

Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman

[21] [21]

Measuring and mitigating unintended bias in text classification. InAIES

[22] [22]

Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. 2021. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. InEMNLP

2021

[23] [23]

Jacob Dorn, Kevin Guo, and Nathan Kallus. 2025. Doubly-valid/doubly-sharp sensitivity analysis for causal inference with unmeasured confounding.Journal of the American Statistical Association120, 549 (2025), 331–342

2025

[24] [24]

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P Bosma, Zongwei Zhou, Tao Wang, Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc Le, Yonghui Wu, Zhifeng Chen, and C...

2022

[25] [25]

Miroslav Dudík, John Langford, and Lihong Li. 2011. Doubly robust policy evaluation and learning. InICML

2011

[26] [26]

Amir Feder, Katherine A Keith, Emaad Manzoor, Reid Pryzant, Dhanya Sridhar, Zach Wood-Doughty, Jacob Eisenstein, Justin Grimmer, Roi Reichart, Margaret E Roberts, et al. 2022. Causal inference in natural language processing: Estimation, prediction, interpretation and beyond. InTACL

2022

[27] [27]

Shangbin Feng, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov. 2023. From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair NLP models. InACL

2023

[28] [28]

Adam Fisch, Joshua Maynez, R Alex Hofer, Bhuwan Dhingra, Amir Globerson, and William W Cohen. 2024. Stratified prediction-powered inference for hybrid language model evaluation. InNeurIPS

2024

[29] [29]

Dennis Frauen, Athiya Deviyani, Mihaela van der Schaar, and Stefan Feuerriegel

[30] [30]

Nonparametric LLM evaluation from preference data. InICML

[31] [31]

Dennis Frauen, Valentyn Melnychuk, and Stefan Feuerriegel. 2023. Sharp bounds for generalized causal sensitivity analysis. InNeurIPS

2023

[32] [32]

Dennis Frauen, Valentyn Melnychuk, Lars van der Laan, and Stefan Feuer- riegel. 2026. Machine Learning for Causal Inference. InWiley StatsRef: Statistics Reference Online. John Wiley & Sons

2026

[33] [33]

Angelopoulos, and Ion Stoica

Evan Frick, Connor Chen, Joseph Tennyson, Tianle Li, Wei-Lin Chiang, Anas- tasios N. Angelopoulos, and Ion Stoica. 2025. Prompt-to-leaderboard: Prompt- adaptive LLM evaluations. InICML

2025

[34] [34]

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The pile: An 800GB dataset of diverse text for language modeling. arXiv:2101.00027

work page internal anchor Pith review Pith/arXiv arXiv 2020

[35] [35]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. Retrieval-augmented generation for large language models: A survey. arXiv:2312.10997

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. 2020. Shortcut learning in deep neural networks.Nature Machine Intelligence2, 11 (2020), 665–673

2020

[37] [37]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. 2024. The Llama 3 herd of models. arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

Luke Guerdan, Justin Whitehouse, Kimberly Truong, Ken Holstein, and Steven Wu. 2026. Doubly-robust LLM-as-a-judge: Externally valid estimation with imperfect personas. InICLR

2026

[39] [39]

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large language model based multi-agents: A survey of progress and challenges. InIJCAI

2024

[40] [40]

Yuan Guo, Peng Tian, Jayashree Kalpathy-Cramer, Susan Ostmo, J.Peter Camp- bell, Michael F.Chiang, Deniz Erdogmus, Jennifer Dy, and Stratis Ioannidis. 2018. Experimental design under the Bradley-Terry model. InIJCAI

2018

[41] [41]

Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. 2024. RouterBench: A benchmark for multi-LLM routing system. InAgentic Markets Workshop at ICML 2024

2024

[42] [42]

Yue Huang, Zhengzhe Jiang, Yuchen Ma, Yu Jiang, Xiangqi Wang, Yujun Zhou, Yuexing Hao, Kehan Guo, Pin-Yu Chen, Stefan Feuerriegel, et al. 2026. Probellm: Automating Principled Diagnosis of LLM Failures. InICML

2026

[43] [43]

Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. InEACL

2021

[44] [44]

Kevin G Jamieson and Robert Nowak. 2011. Active ranking using pairwise comparisons. InNeurIPS

2011

[45] [45]

Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, LYU Zhiheng, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, et al. 2023. CLadder: Assessing causal reasoning in language models. InNeurIPS

2023

[46] [46]

Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona Diab, and Bernhard Schölkopf. 2024. Can large language models infer causation from correlation?. InICLR

2024

[47] [47]

Jared Joselowitz, Ritam Majumdar, Arjun Jagota, Matthieu Bou, Nyal Patel, Satyapriya Krishna, and Sonali Parbhoo. 2025. Insights from the inverse: recon- structing LLM training goals through inverse reinforcement learning. InSecond Conference on Language Modeling

2025

[48] [48]

Kusner, and Ricardo Silva

Jean Kaddour, Aengus Lynch, Qi Liu, Matt J. Kusner, and Ricardo Silva. 2025. Causal Machine learning: A survey and open problems.Foundations and Trends in Optimization9 (2025), 1–247. Issue 1-2

2025

[49] [49]

Jean Kaddour, Yuchen Zhu, Qi Liu, Matt J Kusner, and Ricardo Silva. 2021. Causal effect inference for structured treatments. InNeurIPS

2021

[50] [50]

Nathan Kallus. 2025. Semiparametric preference optimization: Your language model is secretly a single-index model.arXiv preprintarXiv:2512.21917 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[51] [51]

Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating training data mitigates privacy risks in language models. InICML. Causal Methods for LLM Development and Evaluation KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea

2022

[52] [52]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open- domain question answering. InEMNLP

2020

[53] [53]

Chinmaya Kausik, Yangyi Lu, Kevin Tan, Maggie Makar, Yixin Wang, and Ambuj Tewari. 2024. Offline policy evaluation and optimization under confounding. In AISTATS

2024

[54] [54]

Edward H. Kennedy. 2023. Semiparametric doubly robust targeted double machine learning: a review.2203.06469(2023)

work page arXiv 2023

[55] [55]

Edward H. Kennedy. 2023. Towards optimal doubly robust estimation of hetero- geneous causal effects.Electronic Journal of Statistics17, 2 (2023), 3008–3049

2023

[56] [56]

Edward H Kennedy, Zongming Ma, Matthew D McHugh, and Dylan S Small

[57] [57]

Non-parametric methods for doubly robust estimation of continuous treatment effects.Journal of the Royal Statistical Society Series B: Statistical Methodology79, 4 (2017), 1229–1245

2017

[58] [58]

Christoph Kern, Unai Fischer-Abaigar, Jonas Schweisthal, Dennis Frauen, Rayid Ghani, Stefan Feuerriegel, Mihaela van der Schaar, and Frauke Kreuter. 2025. Algorithms for reliable decision-making need causal reasoning.Nature Compu- tational Science5, 5 (2025), 356–360

2025

[59] [59]

Emre Kıcıman, Robert Ness, Amit Sharma, and Chenhao Tan. 2023. Causal reasoning and large language models: Opening a new frontier for causality. TMLR(2023)

2023

[60] [60]

Peter Kirgis, Sayash Kapoor, Stephan Rabanser, Nitya Nadgir, Cozmin Ududec, Magda Dubois, JJ Allaire, Conrad Stosz, Marius Hobbhahn, Jacob Steinhardt, et al. 2026. Log analysis is necessary for credible evaluation of AI agents. arXiv:2605.08545(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[61] [61]

Haruka Kiyohara, Daniel Yiming Cao, Yuta Saito, and Thorsten Joachims. 2025. An Off-Policy Learning Approach for Steering Sentence Generation towards Personalization. InRecSys

2025

[62] [62]

Kasia Kobalczyk and Mihaela van der Schaar. 2025. Preference learning for AI alignment: A causal perspective. InICML

2025

[63] [63]

Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. 2024. Benchmarking cognitive biases in large language models as evaluators. InFindings of the ACL: ACL 2024

2024

[64] [64]

Patrick Kramer, Edward H Kennedy, and Isaac M Opper. 2026. Causal inference with high-dimensional treatments.arXiv:2602.21423(2026)

work page arXiv 2026

[65] [65]

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2022. Deduplicating training data makes language models better. InACL

2022

[66] [66]

Cresswell

Kin Kwan Leung, Mouloud Belbahri, Yi Sui, Alex Labach, Xueying Zhang, Stephen Anthony Rose, and Jesse C. Cresswell. 2026. Classifying and addressing the diversity of errors in retrieval-augmented generation systems. InEACL

2026

[67] [67]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. InNeurIPS

2020

[68] [68]

Xiudi Li and Sijia Li. 2025. Efficient inference for covariate-adjusted Bradley- Terry model with covariate shift.arXiv:2503.18256(2025)

work page arXiv 2025

[69] [69]

Joel Mire, Zubin Trivadi Aysola, Daniel Chechelnitsky, Nicholas Deas, Chrysoula Zerva, and Maarten Sap. 2025. Rejected dialects: Biases against African American Language in reward models. InFindings of the ACL: NAACL 2025

2025

[70] [70]

Angelopoulos, Trevor Darrell, Narges Norouzi, and Joseph E

Mihran Miroyan, Tsung-Han Wu, Logan King, Tianle Li, Jiayi Pan, Xinyan Hu, Wei-Lin Chiang, Anastasios N. Angelopoulos, Trevor Darrell, Narges Norouzi, and Joseph E. Gonzalez. 2025. Search Arena: analyzing search-augmented LLMs. arXiv:2506.05334(2025)

work page arXiv 2025

[71] [71]

Miruna Oprescu, Jacob Dorn, Marah Ghoummaid, Andrew Jesson, Nathan Kallus, and Uri Shalit. 2023. B-learner: Quasi-oracle bounds on heterogeneous causal effects under hidden confounding. InICML

2023

[72] [72]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feed...

2022

[73] [73]

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. 2023. MemGPT: towards LLMs as operating systems. (2023)

2023

[74] [74]

Junsoo Park, Seungyeon Jwa, Ren Meiying, Daeyoung Kim, and Sanghyuk Choi

[75] [75]

InFindings of the ACL: EMNLP 2024

OffsetBias: Leveraging debiased data for tuning evaluators. InFindings of the ACL: EMNLP 2024

2024

[76] [76]

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InAnnual ACM Symposium on User Interface Software and Technology

2023

[77] [77]

2009.Causality

Judea Pearl. 2009.Causality. Cambridge University Press, New York City

2009

[78] [78]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. InNeurIPS

2023

[79] [79]

Robins, Andrea Rotnitzky, and Lue Ping Zhao

James M. Robins, Andrea Rotnitzky, and Lue Ping Zhao. 1994. Estimation of regression coefficients when some regressors are not always observed.Journal of the American Statistical Association89, 427 (1994), 846–866

1994

[80] [80]

Donald B. Rubin. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology66, 5 (1974), 688–701

1974