Causal methods for LLM development and evaluation
Pith reviewed 2026-06-29 22:49 UTC · model grok-4.3
The pith
Central questions in LLM development like data domain effects and routing decisions are causal and need identification methods beyond prediction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that causal methods are underutilized in the LLM pipeline even though questions about the effect of adding data domains, changes in annotator preferences under different styles, and cost-aware routing are inherently causal; these settings feature logged data subject to confounding and distribution shifts, learned but potentially biased judges, and non-stationary deployment, all of which create opportunities for principled causal identification and estimation rather than fragile predictive iteration.
What carries the argument
Causal identification and estimation applied to logged LLM data, reward models, and evaluation pipelines to isolate intervention effects despite confounding and shifts.
If this is right
- Causal methods can identify the true effect of adding a data domain during pretraining despite selection biases in logged data.
- Causal estimation can recover how annotator preferences shift when LLMs generate text in new styles, improving alignment.
- Causal routing decisions can optimize model size choice under inference cost constraints using logged interaction data.
- Causal debiasing of learned judges can improve evaluation reliability when judges themselves are influenced by style or other factors.
- Causal approaches can handle non-stationary deployment by identifying time-varying effects rather than assuming stable distributions.
Where Pith is reading between the lines
- Causal methods could reduce the number of full-scale retraining runs needed by allowing targeted interventions on subsets of logged data.
- The same framework might extend to agentic workflows by treating tool-use decisions as interventions whose downstream effects can be identified.
- A practical test would be to run parallel A/B interventions on small models and check whether causal estimates from larger logged datasets match the observed outcomes.
Load-bearing premise
The central questions in LLM development are inherently causal and the standard identification assumptions from causal inference can be met in large-scale logged LLM data without prohibitive practical barriers.
What would settle it
A controlled intervention on a data domain or routing policy whose measured outcome differs substantially from the causal estimate obtained from existing logged data would show the identification assumptions fail in practice.
Figures
read the original abstract
Large language model (LLM) development is currently driven by large-scale empirical iteration over data mixtures, reward models, routing strategies, and evaluation pipelines. Here, we argue that many central questions in LLM development and evaluation are inherently causal: What is the effect of adding a data domain during pretraining? How do annotator preferences change when LLMs generate text in a different style? Should a prompt be routed to a larger or smaller model given inference cost constraints? In general, causal methods are well-suited to such settings where interventions change outcomes but, surprisingly, are underrepresented in LLM development. Our contribution is threefold: (1) We explain how causal methods can help develop modern LLM development and evaluation: LLM development relies heavily on logged data, which are often subject to confounding and distribution shifts; evaluation uses learned but potentially biased judges; and deployment environments are non-stationary. These conditions make purely predictive approaches fragile and create opportunities for principled identification and estimation methods from causal inference. (2) We further map opportunities for causal methods in the entire LLM development pipeline, including pretraining, alignment, routing, agentic workflows, and evaluation. (3) We discuss new research opportunities around leveraging causal methods for LLM development and evaluation. Overall, we argue that causal methods are potentially underutilized for the LLM development and evaluation pipeline, despite the fact that such methods can ensure a reliable and scientifically grounded design.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that many central questions in LLM development and evaluation—such as the effect of adding data domains during pretraining, changes in annotator preferences, and routing decisions under cost constraints—are inherently causal. It claims that causal inference methods are surprisingly underrepresented despite their suitability for handling confounding in logged data, distribution shifts, biased evaluators, and non-stationary environments. The contribution consists of explaining these opportunities, mapping them across the full pipeline (pretraining, alignment, routing, agents, evaluation), and outlining new research directions, with the overall thesis that causal methods can support more reliable and scientifically grounded LLM design.
Significance. If the central argument holds, the paper could usefully redirect LLM research from purely empirical iteration toward identification-based approaches that explicitly address confounding and shifts. The mapping of opportunities across the pipeline is a constructive taxonomy that may help organize future work. However, the manuscript offers no new identification results, empirical demonstrations, or feasibility checks, so its significance rests on its potential to stimulate such work rather than on any demonstrated advance.
major comments (2)
- [Abstract] Abstract and contribution (1): the claim that causal methods are 'well-suited' and can ensure 'reliable and scientifically grounded design' is not supported by any concrete identification strategy or worked example for the listed questions (e.g., effect of a data domain or routing decision). Without this, the argument that such methods are feasible at LLM scale remains a plausibility claim rather than a demonstrated one.
- [Contribution (2)] Contribution (2): the mapping of opportunities does not discuss how standard causal assumptions (positivity, consistency, no unmeasured confounding) would be met in large-scale logged LLM data, where interventions such as data-mixture changes are rarely randomized and logs are subject to selection bias; this is load-bearing for the claim that causal methods can be directly applied without prohibitive practical barriers.
minor comments (1)
- [Abstract] The abstract and introduction repeat the phrase 'underutilized' without citing any survey or quantitative evidence of current usage rates in the LLM literature.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. As a position paper, our aim is to map opportunities for causal methods rather than to deliver new identification theorems or large-scale experiments. We address the two major comments below.
read point-by-point responses
-
Referee: [Abstract] Abstract and contribution (1): the claim that causal methods are 'well-suited' and can ensure 'reliable and scientifically grounded design' is not supported by any concrete identification strategy or worked example for the listed questions (e.g., effect of a data domain or routing decision). Without this, the argument that such methods are feasible at LLM scale remains a plausibility claim rather than a demonstrated one.
Authors: We agree that the manuscript contains no new identification strategies or worked examples; this is outside the stated scope of a position paper whose contribution is to surface structural parallels between LLM development problems and causal inference settings. The suitability claim rests on the observation that logged LLM data routinely exhibit confounding, selection bias, and non-stationarity—conditions under which purely predictive approaches are known to be fragile. We will revise the abstract and introduction to state the scope more explicitly and to avoid any implication that feasibility at LLM scale has already been shown. revision: partial
-
Referee: [Contribution (2)] Contribution (2): the mapping of opportunities does not discuss how standard causal assumptions (positivity, consistency, no unmeasured confounding) would be met in large-scale logged LLM data, where interventions such as data-mixture changes are rarely randomized and logs are subject to selection bias; this is load-bearing for the claim that causal methods can be directly applied without prohibitive practical barriers.
Authors: This is a fair observation. The current draft does not explicitly examine how positivity, consistency, and no unmeasured confounding would hold (or be violated) under the selection mechanisms present in production LLM logs. We will add a dedicated subsection that (a) states the assumptions, (b) illustrates likely violations arising from non-randomized data-mixture changes and selection bias, and (c) points to existing causal techniques (e.g., sensitivity analysis, proximal causal inference) that have been developed precisely for such settings. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is a position/advocacy piece with no derivations, equations, fitted parameters, identification results, or theorems. Its central claim—that causal questions are common in LLM pipelines and causal methods are underutilized—rests on descriptive examples rather than any self-referential reduction, self-citation chain, or renaming of prior results. No load-bearing step reduces to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM development questions (data domain effects, preference changes, routing) are inherently causal and amenable to standard identification strategies.
Reference graph
Works this paper leans on
-
[1]
Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I
Anastasios N. Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I. Jordan, and Tijana Zrnic. 2023. Prediction-powered inference.Science382, 6671 (2023), 669–674
2023
-
[2]
Angelopoulos, Jacob Eisenstein, Jonathan Berant, Alekh Agar- wal, and Adam Fisch
Anastasios N. Angelopoulos, Jacob Eisenstein, Jonathan Berant, Alekh Agar- wal, and Adam Fisch. 2025. Cost-optimal active AI model evaluation. arXiv:2506.07949 (2025)
-
[3]
Susan Athey and Stefan Wager. 2021. Policy learning with observational data. Econometrica89, 1 (2021), 133–161
2021
-
[4]
Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Michael Wornow, Juan M Banda, Nikesh Kotecha, Timothy Keyes, Yifan Mai, Mert Oez, et al. 2026. Holistic evaluation of large language models for medical tasks with MedHELM.Nature Medicine32 (2026), 943–951
2026
-
[5]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the dangers of stochastic parrots: Can language models be too big?. InFAccT
2021
-
[6]
Bickel, Chris A
Peter J. Bickel, Chris A. J. Klaassen, Ya’acov Ritov, and Jon A. Wellner. 1998. Efficient and adaptive estimation for semiparametric models. Springer, New York
1998
-
[7]
Matthieu Bou, Nyal Patel, Arjun Jagota, Satyapriya Krishna, and Sonali Parbhoo
-
[8]
The alignment auditor: A Bayesian framework for verifying and refining LLM objectives. InICLR
-
[9]
Ralph Allan Bradley and Milton E. Terry. 1952. Rank analysis of incomplete block designs: I. The method of paired comparisons.Biometrika39, 3/4 (1952), 324–345
1952
-
[10]
Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert- Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2021. Extracting training data from large language models. InUSENIX Security Symposium
2021
-
[11]
Ivi Chatzi, Eleni Straitouri, Suhas Thejaswi, and Manuel Gomez Rodriguez. 2024. Prediction-powered ranking of large language models. InNeurIPS
2024
-
[12]
Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James M. Robins. 2018. Double/debiased machine learning for treatment and structural parameters.The Econometrics Journal21, 1 (2018), 1–68
2018
-
[13]
Wayne Chi, Valerie Chen, Anastasios Nikolas Angelopoulos, Wei-Lin Chiang, Aditya Mittal, Naman Jain, Tianjun Zhang, Ion Stoica, Chris Donahue, and Ameet Talwalkar. 2025. Copilot Arena: A platform for code LLM evaluation in the wild. InICML
2025
-
[14]
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios N Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I Jordan, Joseph E Gonzalez, and Ion Stoica. 2024. Chatbot Arena: An open platform for evaluating LLMs by human preference. InICML
2024
-
[15]
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences. In NeurIPS
2017
-
[16]
Thomas Cook, Alan Mishler, and Aaditya Ramdas. 2024. Semiparametric efficient inference in adaptive experiments. InCLeaR
2024
-
[17]
Alicia Curth and Mihaela Van der Schaar. 2021. Nonparametric estimation of heterogeneous treatment effects: From theory to learning algorithms. In AISTATS
2021
-
[18]
Athiya Deviyani and Fernando Diaz. 2025. Contextual metric meta-evaluation by measuring local metric accuracy. InFindings of the ACL: NAACL 2025
2025
-
[19]
Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V. S. Lakshmanan, and Ahmed Hassan Awadallah. 2024. Hybrid LLM: Cost-efficient and quality-aware query routing. InICLR
2024
-
[20]
Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman
-
[21]
Measuring and mitigating unintended bias in text classification. InAIES
-
[22]
Jesse Dodge, Maarten Sap, Ana Marasović, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, and Matt Gardner. 2021. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. InEMNLP
2021
-
[23]
Jacob Dorn, Kevin Guo, and Nathan Kallus. 2025. Doubly-valid/doubly-sharp sensitivity analysis for causal inference with unmeasured confounding.Journal of the American Statistical Association120, 549 (2025), 331–342
2025
-
[24]
Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P Bosma, Zongwei Zhou, Tao Wang, Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc Le, Yonghui Wu, Zhifeng Chen, and C...
2022
-
[25]
Miroslav Dudík, John Langford, and Lihong Li. 2011. Doubly robust policy evaluation and learning. InICML
2011
-
[26]
Amir Feder, Katherine A Keith, Emaad Manzoor, Reid Pryzant, Dhanya Sridhar, Zach Wood-Doughty, Jacob Eisenstein, Justin Grimmer, Roi Reichart, Margaret E Roberts, et al. 2022. Causal inference in natural language processing: Estimation, prediction, interpretation and beyond. InTACL
2022
-
[27]
Shangbin Feng, Chan Young Park, Yuhan Liu, and Yulia Tsvetkov. 2023. From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair NLP models. InACL
2023
-
[28]
Adam Fisch, Joshua Maynez, R Alex Hofer, Bhuwan Dhingra, Amir Globerson, and William W Cohen. 2024. Stratified prediction-powered inference for hybrid language model evaluation. InNeurIPS
2024
-
[29]
Dennis Frauen, Athiya Deviyani, Mihaela van der Schaar, and Stefan Feuerriegel
-
[30]
Nonparametric LLM evaluation from preference data. InICML
-
[31]
Dennis Frauen, Valentyn Melnychuk, and Stefan Feuerriegel. 2023. Sharp bounds for generalized causal sensitivity analysis. InNeurIPS
2023
-
[32]
Dennis Frauen, Valentyn Melnychuk, Lars van der Laan, and Stefan Feuer- riegel. 2026. Machine Learning for Causal Inference. InWiley StatsRef: Statistics Reference Online. John Wiley & Sons
2026
-
[33]
Angelopoulos, and Ion Stoica
Evan Frick, Connor Chen, Joseph Tennyson, Tianle Li, Wei-Lin Chiang, Anas- tasios N. Angelopoulos, and Ion Stoica. 2025. Prompt-to-leaderboard: Prompt- adaptive LLM evaluations. InICML
2025
-
[34]
Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. 2020. The pile: An 800GB dataset of diverse text for language modeling. arXiv:2101.00027
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[35]
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. Retrieval-augmented generation for large language models: A survey. arXiv:2312.10997
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. 2020. Shortcut learning in deep neural networks.Nature Machine Intelligence2, 11 (2020), 665–673
2020
-
[37]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. 2024. The Llama 3 herd of models. arXiv:2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Luke Guerdan, Justin Whitehouse, Kimberly Truong, Ken Holstein, and Steven Wu. 2026. Doubly-robust LLM-as-a-judge: Externally valid estimation with imperfect personas. InICLR
2026
-
[39]
Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large language model based multi-agents: A survey of progress and challenges. InIJCAI
2024
-
[40]
Yuan Guo, Peng Tian, Jayashree Kalpathy-Cramer, Susan Ostmo, J.Peter Camp- bell, Michael F.Chiang, Deniz Erdogmus, Jennifer Dy, and Stratis Ioannidis. 2018. Experimental design under the Bradley-Terry model. InIJCAI
2018
-
[41]
Qitian Jason Hu, Jacob Bieker, Xiuyu Li, Nan Jiang, Benjamin Keigwin, Gaurav Ranganath, Kurt Keutzer, and Shriyash Kaustubh Upadhyay. 2024. RouterBench: A benchmark for multi-LLM routing system. InAgentic Markets Workshop at ICML 2024
2024
-
[42]
Yue Huang, Zhengzhe Jiang, Yuchen Ma, Yu Jiang, Xiangqi Wang, Yujun Zhou, Yuexing Hao, Kehan Guo, Pin-Yu Chen, Stefan Feuerriegel, et al. 2026. Probellm: Automating Principled Diagnosis of LLM Failures. InICML
2026
-
[43]
Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. InEACL
2021
-
[44]
Kevin G Jamieson and Robert Nowak. 2011. Active ranking using pairwise comparisons. InNeurIPS
2011
-
[45]
Zhijing Jin, Yuen Chen, Felix Leeb, Luigi Gresele, Ojasv Kamal, LYU Zhiheng, Kevin Blin, Fernando Gonzalez Adauto, Max Kleiman-Weiner, Mrinmaya Sachan, et al. 2023. CLadder: Assessing causal reasoning in language models. InNeurIPS
2023
-
[46]
Zhijing Jin, Jiarui Liu, Zhiheng Lyu, Spencer Poff, Mrinmaya Sachan, Rada Mihalcea, Mona Diab, and Bernhard Schölkopf. 2024. Can large language models infer causation from correlation?. InICLR
2024
-
[47]
Jared Joselowitz, Ritam Majumdar, Arjun Jagota, Matthieu Bou, Nyal Patel, Satyapriya Krishna, and Sonali Parbhoo. 2025. Insights from the inverse: recon- structing LLM training goals through inverse reinforcement learning. InSecond Conference on Language Modeling
2025
-
[48]
Kusner, and Ricardo Silva
Jean Kaddour, Aengus Lynch, Qi Liu, Matt J. Kusner, and Ricardo Silva. 2025. Causal Machine learning: A survey and open problems.Foundations and Trends in Optimization9 (2025), 1–247. Issue 1-2
2025
-
[49]
Jean Kaddour, Yuchen Zhu, Qi Liu, Matt J Kusner, and Ricardo Silva. 2021. Causal effect inference for structured treatments. InNeurIPS
2021
-
[50]
Nathan Kallus. 2025. Semiparametric preference optimization: Your language model is secretly a single-index model.arXiv preprintarXiv:2512.21917 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating training data mitigates privacy risks in language models. InICML. Causal Methods for LLM Development and Evaluation KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea
2022
-
[52]
Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open- domain question answering. InEMNLP
2020
-
[53]
Chinmaya Kausik, Yangyi Lu, Kevin Tan, Maggie Makar, Yixin Wang, and Ambuj Tewari. 2024. Offline policy evaluation and optimization under confounding. In AISTATS
2024
- [54]
-
[55]
Edward H. Kennedy. 2023. Towards optimal doubly robust estimation of hetero- geneous causal effects.Electronic Journal of Statistics17, 2 (2023), 3008–3049
2023
-
[56]
Edward H Kennedy, Zongming Ma, Matthew D McHugh, and Dylan S Small
-
[57]
Non-parametric methods for doubly robust estimation of continuous treatment effects.Journal of the Royal Statistical Society Series B: Statistical Methodology79, 4 (2017), 1229–1245
2017
-
[58]
Christoph Kern, Unai Fischer-Abaigar, Jonas Schweisthal, Dennis Frauen, Rayid Ghani, Stefan Feuerriegel, Mihaela van der Schaar, and Frauke Kreuter. 2025. Algorithms for reliable decision-making need causal reasoning.Nature Compu- tational Science5, 5 (2025), 356–360
2025
-
[59]
Emre Kıcıman, Robert Ness, Amit Sharma, and Chenhao Tan. 2023. Causal reasoning and large language models: Opening a new frontier for causality. TMLR(2023)
2023
-
[60]
Peter Kirgis, Sayash Kapoor, Stephan Rabanser, Nitya Nadgir, Cozmin Ududec, Magda Dubois, JJ Allaire, Conrad Stosz, Marius Hobbhahn, Jacob Steinhardt, et al. 2026. Log analysis is necessary for credible evaluation of AI agents. arXiv:2605.08545(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[61]
Haruka Kiyohara, Daniel Yiming Cao, Yuta Saito, and Thorsten Joachims. 2025. An Off-Policy Learning Approach for Steering Sentence Generation towards Personalization. InRecSys
2025
-
[62]
Kasia Kobalczyk and Mihaela van der Schaar. 2025. Preference learning for AI alignment: A causal perspective. InICML
2025
-
[63]
Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. 2024. Benchmarking cognitive biases in large language models as evaluators. InFindings of the ACL: ACL 2024
2024
- [64]
-
[65]
Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. 2022. Deduplicating training data makes language models better. InACL
2022
-
[66]
Cresswell
Kin Kwan Leung, Mouloud Belbahri, Yi Sui, Alex Labach, Xueying Zhang, Stephen Anthony Rose, and Jesse C. Cresswell. 2026. Classifying and addressing the diversity of errors in retrieval-augmented generation systems. InEACL
2026
-
[67]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. InNeurIPS
2020
- [68]
-
[69]
Joel Mire, Zubin Trivadi Aysola, Daniel Chechelnitsky, Nicholas Deas, Chrysoula Zerva, and Maarten Sap. 2025. Rejected dialects: Biases against African American Language in reward models. InFindings of the ACL: NAACL 2025
2025
-
[70]
Angelopoulos, Trevor Darrell, Narges Norouzi, and Joseph E
Mihran Miroyan, Tsung-Han Wu, Logan King, Tianle Li, Jiayi Pan, Xinyan Hu, Wei-Lin Chiang, Anastasios N. Angelopoulos, Trevor Darrell, Narges Norouzi, and Joseph E. Gonzalez. 2025. Search Arena: analyzing search-augmented LLMs. arXiv:2506.05334(2025)
-
[71]
Miruna Oprescu, Jacob Dorn, Marah Ghoummaid, Andrew Jesson, Nathan Kallus, and Uri Shalit. 2023. B-learner: Quasi-oracle bounds on heterogeneous causal effects under hidden confounding. InICML
2023
-
[72]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feed...
2022
-
[73]
Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. 2023. MemGPT: towards LLMs as operating systems. (2023)
2023
-
[74]
Junsoo Park, Seungyeon Jwa, Ren Meiying, Daeyoung Kim, and Sanghyuk Choi
-
[75]
InFindings of the ACL: EMNLP 2024
OffsetBias: Leveraging debiased data for tuning evaluators. InFindings of the ACL: EMNLP 2024
2024
-
[76]
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. InAnnual ACM Symposium on User Interface Software and Technology
2023
-
[77]
2009.Causality
Judea Pearl. 2009.Causality. Cambridge University Press, New York City
2009
-
[78]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. InNeurIPS
2023
-
[79]
Robins, Andrea Rotnitzky, and Lue Ping Zhao
James M. Robins, Andrea Rotnitzky, and Lue Ping Zhao. 1994. Estimation of regression coefficients when some regressors are not always observed.Journal of the American Statistical Association89, 427 (1994), 846–866
1994
-
[80]
Donald B. Rubin. 1974. Estimating causal effects of treatments in randomized and nonrandomized studies.Journal of Educational Psychology66, 5 (1974), 688–701
1974
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.