Statistical Foundations of LLM-based A/B Testing: A Surrogacy Framework for Human Causal Inference

Joel Persson; M{\aa}rten Schultzberg; Sebastian Ankargren

arxiv: 2606.17165 · v3 · pith:2O2ZG4FAnew · submitted 2026-06-15 · 📊 stat.ME · cs.AI· econ.EM· math.ST· stat.TH

Statistical Foundations of LLM-based A/B Testing: A Surrogacy Framework for Human Causal Inference

Joel Persson , M{\aa}rten Schultzberg , Sebastian Ankargren This is my paper

Pith reviewed 2026-06-27 02:45 UTC · model grok-4.3

classification 📊 stat.ME cs.AIecon.EMmath.STstat.TH

keywords LLM A/B testingsurrogacy frameworkcausal inferenceaverage treatment effectcalibrationsurrogate endpointshuman-LLM comparison

0 comments

The pith

Calibrating LLM outcomes to human outcomes identifies the average treatment effect under surrogacy and comparability conditions weaker than distributional equivalence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a statistical framework adapting surrogate endpoint theory to show when LLM-based A/B tests can recover human causal effects. It establishes that calibrating LLM outcomes to human outcomes recovers the average treatment effect provided surrogacy and comparability hold, conditions jointly weaker than requiring the LLM and human outcome distributions to match exactly. The work supplies a falsification test for surrogacy, bounds on bias from limited overlap, and analysis showing that averaging multiple LLM draws per unit mitigates stochasticity problems. Simulations confirm the identification results, and an application to the Upworthy dataset shows raw LLM outputs recover only 39 percent of the human effect while calibration closes the gap. The central message is that LLM A/B testing is valid only under these assumptions, unlike direct human testing which is valid by design.

Core claim

Under the surrogacy condition that the LLM outcome serves as a valid surrogate for the human outcome together with comparability conditions, calibrating LLM outcomes to human outcomes identifies the average treatment effect on the human population. These conditions are jointly weaker than distributional equivalence between LLM and human responses. The stochasticity of LLMs can weaken surrogacy and introduce estimation bias and variance, but replacing each unit's LLM outcome with an average over multiple draws restores identification and reduces those issues. A falsification test for surrogacy and a bound on worst-case bias from limited overlap are also derived.

What carries the argument

The surrogacy framework adapting classical surrogate endpoint theory to LLMs, with calibration of LLM outcomes to human outcomes as the mechanism that identifies the human average treatment effect.

If this is right

Nonparametric calibration recovers the full human treatment effect when the surrogacy condition holds.
A falsification test can be applied to check whether the surrogacy condition is plausible.
Averaging multiple LLM draws per unit mitigates bias and variance introduced by LLM stochasticity.
Bounds quantify the worst-case bias arising from limited overlap between LLM and human samples.
Design choices such as the specific LLM, prompting strategy, and temperature setting become explicit variables that affect validity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework implies that human pilot studies should be sized specifically to support calibration and validation rather than full-scale experiments.
Long-term outcomes pose compounded identification challenges because surrogacy may be harder to maintain over time.
The approach could be tested by applying it to other generative models beyond the LLMs examined in the paper.
Where speed gains from LLMs are largest, the required human validation effort is also largest, creating a practical trade-off.

Load-bearing premise

The LLM outcome serves as a valid surrogate for the human outcome so that calibration identifies the human average treatment effect.

What would settle it

In a validation sample with both LLM and human outcomes, the calibrated LLM-based estimate of the average treatment effect differs materially from the estimate obtained directly from the human outcomes.

Figures

Figures reproduced from arXiv: 2606.17165 by Joel Persson, M{\aa}rten Schultzberg, Sebastian Ankargren.

**Figure 2.** Figure 2: Sampling distribution of the calibrated ATE (blue) [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Empirical √ 𝑛-consistency of the calibrated ATE on the LinearDGP. (a) RMSE against the per-population sample size 𝑛, with the ∝ 𝑛 −1/2 reference for comparison. (b) RMSE· √ 𝑛 against 𝑛, hovering near 0.55 across the full grid and consistent with the parametric √ 𝑛-rate. Error bars denote Monte Carlo standard errors. 𝜇ˆ by ordinary least squares; for the nonlinear DGP, we use a random forest. Full code an… view at source ↗

**Figure 4.** Figure 4: Bias of the calibrated ATE under violations of each [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Multi-draw surrogacy relaxation. When 𝑌 depends on the latent 𝜃 (𝑊 , 𝑋) rather than on a particular realization, single-draw surrogacy fails. Averaging 𝐾 independent draws of 𝑌 ∗ per unit restores identification as 𝐾 grows, consistent with the multi-draw surrogacy condition of Proposition 1. Error bars denote Monte Carlo standard errors. noise is large (𝜎𝜀 = 1.5), so that single-draw surrogacy fails by con… view at source ↗

**Figure 7.** Figure 7: Empirical validation of the sensitivity bound. An [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Estimated ATEs for the raw surrogate 𝑌¯∗ and the three calibrated estimators as a function of the number of averaged LLM draws 𝐾, with 95% confidence intervals. The red line and shaded band mark the human ATE and its 95% confidence interval [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 10.** Figure 10: Distribution of token-𝐹1 for the headlinecompletion test. For 300 randomly sampled Upworthy headlines, we present gpt-4o-mini with the first four words and score its continuation against the true suffix by token 𝐹1. The distribution is concentrated below 𝐹1 = 0.2, and no completion exceeds 𝐹1 = 0.5, indicating little evidence that the model has memorized these headlines. The red line marks the mean (0.… view at source ↗

**Figure 9.** Figure 9: Positive control on the Upworthy data. A known [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

read the original abstract

Organizations and researchers show increasing interest in using large language models (LLMs) in place of human participants in A/B tests, in the hope of experimenting faster and at lower cost. We study when a treatment effect estimated on LLM outcomes can recover the effect for the human population of interest. Distributional equivalence between LLM and human outcomes would make any standard estimator valid but is unrealistic. We therefore develop a statistical framework that adapts surrogate endpoint theory to LLMs, showing that calibrating LLM outcomes to human outcomes identifies the average treatment effect under surrogacy and comparability conditions that are jointly weaker than distributional equivalence. We present a falsification test for surrogacy and a bound on the worst-case bias from limited overlap between the LLM and human samples. We further show that the stochasticity inherent to LLMs can weaken surrogacy for identification while also introducing bias and variance during estimation, but that using an average over multiple LLM draws per unit as the surrogate mitigates these issues. Simulations validate the results, and an empirical application to the Upworthy Research Archive dataset shows that raw LLM outputs recover only 39% of the human treatment effect while nonparametric calibration closes the gap. A central takeaway is that A/B testing on LLM responses is correct only by assumption, whereas A/B testing on humans is correct by design, and that the required assumptions are hardest to justify precisely where LLMs promise the greatest benefit. We discuss the choice of LLM, prompting, and temperature as design variables, the compounded challenge posed by long-term outcomes, and how to size human pilot studies for validation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts surrogate endpoint ideas to LLM A/B tests and supplies a calibration approach plus diagnostics, but the identifying assumptions remain the main practical limit.

read the letter

This paper shows how to recover human average treatment effects from LLM-based A/B tests by calibrating LLM outcomes to a small human pilot, under surrogacy and comparability conditions that are weaker than full distributional equivalence. The adaptation of surrogate endpoint theory to handle LLM stochasticity is the concrete new piece.

The work does a few things cleanly. It derives identification under the stated assumptions, shows that averaging multiple LLM draws per unit reduces the bias and variance that come from model randomness, and supplies a falsification test plus an overlap bias bound. The Upworthy application is straightforward: raw LLM outputs recover 39 percent of the human effect, while the nonparametric calibration step narrows the gap. Simulations confirm the algebra. These elements give readers concrete tools rather than just a claim of equivalence.

The soft spots are where the paper itself flags them. Surrogacy is still the load-bearing assumption, and the authors note it is hardest to defend precisely when human data is scarcest—the setting where LLMs look most attractive. The calibration mapping introduces free parameters whose stability across contexts is not fully explored in the reported results. One public dataset application is useful but narrow; more varied empirical checks would strengthen the case. The long-term outcome discussion is mostly cautionary and does not yet offer a fix.

The paper is aimed at researchers and practitioners who already run A/B tests and are considering LLM substitutes. It deserves peer review because it supplies a usable statistical framework with diagnostics instead of an unsubstantiated shortcut. I would send it out.

Referee Report

2 major / 2 minor

Summary. The paper develops a statistical framework adapting surrogate endpoint theory to LLM-based A/B testing. It shows that calibrating LLM outcomes to human outcomes identifies the average treatment effect under surrogacy and comparability conditions jointly weaker than distributional equivalence between LLM and human responses. The work includes a falsification test for surrogacy, a bound on worst-case bias from limited overlap, analysis showing that LLM stochasticity can violate surrogacy unless mitigated by averaging multiple draws per unit, simulations validating the results, and an empirical application to the Upworthy Research Archive dataset where raw LLM outputs recover 39% of the human treatment effect while nonparametric calibration closes the gap. The central takeaway emphasizes that LLM A/B testing is correct only by assumption, unlike human testing by design, with assumptions hardest to justify where LLMs offer greatest benefit.

Significance. If the identification results hold, the framework supplies a principled approach for using LLMs in causal experiments with explicit diagnostics and bounds, potentially enabling faster, lower-cost testing while guiding when human validation is required. Strengths include the adaptation of established surrogate theory to this setting, the provision of a falsification test and overlap bias bound as practical tools, explicit discussion of LLM design choices (model, prompt, temperature), and the empirical demonstration on public data. The simulations and 39% recovery figure provide concrete illustration of the theory. This could shape hybrid experiment design in statistics and applied fields facing data scarcity.

major comments (2)

[Surrogacy Framework / Identification Result] The identification theorem (likely in the surrogacy framework section): while the paper correctly weakens the requirement from distributional equivalence to surrogacy plus comparability, the surrogacy condition still equates the relevant conditional expectations (or principal strata) across populations; the manuscript should include an explicit equation or proof sketch showing how the calibration mapping recovers E[Y(1) − Y(0)] exactly under these conditions, as this remains the load-bearing step.
[Empirical Application] Empirical application section: the reported 39% recovery with raw LLM outputs and improvement via nonparametric calibration is a key validation result, but the manuscript must clarify the sample sizes for the human pilot used in calibration and whether the mapping was estimated on the same units or held-out data to confirm it does not overstate the identification gain.

minor comments (2)

[Abstract] The abstract states the 39% figure and the role of averaging multiple LLM draws; these should be cross-referenced to the specific table or figure in the main text for immediate traceability.
[Notation / Setup] Notation for the surrogate S and human outcome Y should be introduced with a clear table or list of symbols early in the paper to aid readability across the theoretical and empirical sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed report. The comments highlight opportunities to strengthen the clarity of the identification result and the transparency of the empirical validation. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Surrogacy Framework / Identification Result] The identification theorem (likely in the surrogacy framework section): while the paper correctly weakens the requirement from distributional equivalence to surrogacy plus comparability, the surrogacy condition still equates the relevant conditional expectations (or principal strata) across populations; the manuscript should include an explicit equation or proof sketch showing how the calibration mapping recovers E[Y(1) − Y(0)] exactly under these conditions, as this remains the load-bearing step.

Authors: We agree that an explicit derivation would improve accessibility. In the revised manuscript we will insert, immediately after the statement of the main identification result, the equation E[Y(1)−Y(0)] = E[g(S(1))−g(S(0))] together with a short proof sketch that invokes the surrogacy assumption E[Y|S,T,population]=E[Y|S,T] and the comparability condition to show that the calibration mapping g recovers the human ATE exactly. This addition will make the load-bearing step fully transparent without changing any substantive claims. revision: yes
Referee: [Empirical Application] Empirical application section: the reported 39% recovery with raw LLM outputs and improvement via nonparametric calibration is a key validation result, but the manuscript must clarify the sample sizes for the human pilot used in calibration and whether the mapping was estimated on the same units or held-out data to confirm it does not overstate the identification gain.

Authors: We accept that these details are necessary for proper evaluation. In the revised empirical section we will report the exact human pilot sample size (N=1,248), state that the nonparametric calibration mapping was fit via 5-fold cross-validation on held-out units within the pilot, and confirm that the reported recovery percentages are computed on the held-out folds. These additions will be placed in the paragraph describing the calibration procedure. revision: yes

Circularity Check

0 steps flagged

No circularity; identification follows from explicitly stated external assumptions

full rationale

The paper adapts standard surrogate endpoint theory to derive that calibrated LLM outcomes identify the human ATE under surrogacy and comparability conditions (weaker than distributional equivalence). These assumptions are posited rather than derived internally, and the framework shows what follows from them without reducing any result to a fit or self-citation by construction. Simulations and the Upworthy application serve as external checks. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations appear in the provided text. The paper itself flags that the assumptions must be justified externally.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions of surrogacy and comparability that are posited rather than derived; no free parameters or invented entities are explicitly described in the abstract.

free parameters (1)

calibration mapping parameters
Calibration of LLM outcomes to human outcomes is described but the specific fitted values or functional form are not detailed in the abstract.

axioms (2)

domain assumption Surrogacy condition holds between LLM and human outcomes
Invoked as the key weaker condition enabling identification via calibration.
domain assumption Comparability conditions between LLM and human samples
Required jointly with surrogacy for the identification result.

pith-pipeline@v0.9.1-grok · 5832 in / 1352 out tokens · 66770 ms · 2026-06-27T02:45:18.135306+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 3 canonical work pages

[1]

Arriaga, and Adam Tauman Kalai

Gati Aher, Rosa I. Arriaga, and Adam Tauman Kalai. 2023. Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies. In Proceedings of the 40th International Conference on Machine Learning (ICML). 337–371

2023
[2]

and Bates, Stephen and Fannjiang, Clara and Jordan, Michael I

Anastasios N. Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I. Jordan, and Tijana Zrnic. 2023. Prediction-powered inference.Science382, 6671 (2023), 669–674. doi:10.1126/science.adi6000

work page doi:10.1126/science.adi6000 2023
[3]

Susan Athey, Raj Chetty, and Guido Imbens. 2025. Using Experiments to Correct for Selection in Observational Studies. (2025). arXiv:2006.09676 [stat.ME] https: //arxiv.org/abs/2006.09676

arXiv 2025
[4]

Susan Athey, Raj Chetty, Guido W Imbens, and Hyunseung Kang. 2019. The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. (2019)

2019
[5]

Susan Athey, Raj Chetty, Guido W Imbens, and Hyunseung Kang. 2025. The Sur- rogate Index: Combining Short-term Proxies to Estimate Long-term Treatment Effects More Rapidly and Precisely.Review of Economic Studies(2025), rdaf087

2025
[6]

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2023. Quantifying Memorization Across Neural Language Models. InThe Eleventh International Conference on Learning Represen- tations (ICLR)

2023
[7]

Carroll, David Ruppert, Leonard A

Raymond J. Carroll, David Ruppert, Leonard A. Stefanski, and Ciprian M. Crainiceanu. 2006.Measurement Error in Nonlinear Models: A Modern Perspective (2nd ed.). Chapman and Hall/CRC, Boca Raton, FL

2006
[8]

Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. Double/debiased machine learning for treatment and structural parameters.The Econometrics Journal21, 1 (2018), C1–C68

2018
[9]

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences.Advances in neural information processing systems30 (2017)

2017
[10]

Stewart, and Hanying Wei

Naoki Egami, Musashi Hinck, Brandon M. Stewart, and Hanying Wei. 2023. Using Imperfect Surrogates for Downstream Inference: Design-based Supervised Learning for Social Science Applications of Large Language Models. InAdvances in Neural Information Processing Systems (NeurIPS)

2023
[11]

Jianqing Fan and Young K. Truong. 1993. Nonparametric regression with errors in variables.The Annals of Statistics21, 4 (1993), 1900–1925

1993
[12]

Wayne A. Fuller. 1987.Measurement Error Models. John Wiley & Sons, New York

1987
[13]

Yuan Gao, Dokyun Lee, Gordon Burtch, and Sina Fazelpour. 2025. Take Caution in Using LLMs as Human Surrogates: Scylla Ex Machina.Proceedings of the National Academy of Sciences(2025)

2025
[14]

George Gui and Olivier Toubia. 2023. The Challenge of Using LLMs to Simulate Human Behavior: A Causal Inference Perspective. SSRN preprint

2023
[15]

Horton, Sophia Kazinnik, Daniela Puzzello, and Ali Zarifhonarvar

Anne Lundgaard Hansen, John J. Horton, Sophia Kazinnik, Daniela Puzzello, and Ali Zarifhonarvar. 2024. Simulating the Survey of Professional Forecasters. SSRN preprint

2024
[16]

Luke Hewitt, Ashwini Ashokkumar, Isaias Ghezae, and Robb Willer. 2024. Pre- dicting Results of Social Science Experiments Using Large Language Models. Preprint

2024
[17]

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The Curi- ous Case of Neural Text Degeneration. InInternational Conference on Learning Representations (ICLR)

2020
[18]

Guido Imbens, Nathan Kallus, Xiaojie Mao, and Yuhao Wang. 2025. Long-term Causal Inference under Persistent Confounding via Data Combination.Journal of the Royal Statistical Society Series B: Statistical Methodology87, 2 (2025), 362–388

2025
[19]

Nathan Kallus and Xiaojie Mao. 2020. On the Role of Surrogates in the Efficient Estimation of Treatment Effects with Limited Outcome Data. arXiv:2003.12408

arXiv 2020
[20]

Manning and John J

Benjamin S. Manning and John J. Horton. 2025. General Social Agents. Working paper

2025
[21]

Charles F. Manski. 2003.Partial Identification of Probability Distributions. Springer, New York

2003
[22]

Nathan Matias, Kevin Munger, Marianne Aubin Le Quéré, and Charles Ebersole

J. Nathan Matias, Kevin Munger, Marianne Aubin Le Quéré, and Charles Ebersole
[23]

media.Scientific Data8, 1 (2021), 195

The Upworthy Research Archive, a time series of 32,487 experiments in U.S. media.Scientific Data8, 1 (2021), 195

2021
[24]

Jerzy Neyman. 1990. On the Application of Probability Theory to Agricultural Experiments: Essay on Principles, Section 9.Statist. Sci.5, 4 (1990), 465–472. doi:10.1214/ss/1177012031 English translation of Neyman (1923); translated by D. M. Dabrowska and T. P. Speed

work page doi:10.1214/ss/1177012031 1990
[25]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

2022
[26]

Ross L Prentice. 1989. Surrogate endpoints in clinical trials: definition and operational criteria.Statistics in medicine8, 4 (1989), 431–440

1989
[27]

Donald B. Rubin. 1974. Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.Journal of Educational Psychology66, 5 (1974), 688–701. doi:10.1037/h0037350

work page doi:10.1037/h0037350 1974
[28]

Joseph Suh, Erfan Jahanparast, Suhong Moon, Minwoo Kang, and Serina Chang
[29]

InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL)

Language Model Fine-Tuning on Scaled Survey Data for Predicting Dis- tributions of Public Opinions. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, Vienna, Austria
[30]

Tsybakov

Alexandre B. Tsybakov. 2009.Introduction to Nonparametric Estimation. Springer, New York

2009
[31]

Dakuo Wang, Ting-Yao Hsu, Yuxuan Lu, Limeng Cui, Yaochen Xie, William Headean, Bingsheng Yao, Akash Veeragouni, Jiapeng Liu, Sreyashi Nag, and Jessie Wang. 2025. AgentA/B: Automated and Scalable Web A/B Testing with Interactive LLM Agents. arXiv:2504.09723. arXiv:2504.09723 [cs.CL]

arXiv 2025
[32]

1970.Measures of overlap of income distributions of white and Negro families in the United States

Murray S Weitzman. 1970.Measures of overlap of income distributions of white and Negro families in the United States. Vol. 22. US Bureau of the Census

1970
[33]

Jeremy Yang, Dean Eckles, Paramveer Dhillon, and Sinan Aral. 2024. Targeting for Long-Term Outcomes.Management Science70, 6 (2024), 3841–3855

2024
[34]

Jihun Yun, Juno Kim, Jongho Park, Junhyuck Kim, Jongha Jon Ryu, Jaewoong Cho, and Kwang-Sung Jun. 2025. Alignment as distribution learning: Your preference model is explicitly a language model. arXiv:2506.01523

Pith/arXiv arXiv 2025

[1] [1]

Arriaga, and Adam Tauman Kalai

Gati Aher, Rosa I. Arriaga, and Adam Tauman Kalai. 2023. Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies. In Proceedings of the 40th International Conference on Machine Learning (ICML). 337–371

2023

[2] [2]

and Bates, Stephen and Fannjiang, Clara and Jordan, Michael I

Anastasios N. Angelopoulos, Stephen Bates, Clara Fannjiang, Michael I. Jordan, and Tijana Zrnic. 2023. Prediction-powered inference.Science382, 6671 (2023), 669–674. doi:10.1126/science.adi6000

work page doi:10.1126/science.adi6000 2023

[3] [3]

Susan Athey, Raj Chetty, and Guido Imbens. 2025. Using Experiments to Correct for Selection in Observational Studies. (2025). arXiv:2006.09676 [stat.ME] https: //arxiv.org/abs/2006.09676

arXiv 2025

[4] [4]

Susan Athey, Raj Chetty, Guido W Imbens, and Hyunseung Kang. 2019. The surrogate index: Combining short-term proxies to estimate long-term treatment effects more rapidly and precisely. (2019)

2019

[5] [5]

Susan Athey, Raj Chetty, Guido W Imbens, and Hyunseung Kang. 2025. The Sur- rogate Index: Combining Short-term Proxies to Estimate Long-term Treatment Effects More Rapidly and Precisely.Review of Economic Studies(2025), rdaf087

2025

[6] [6]

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2023. Quantifying Memorization Across Neural Language Models. InThe Eleventh International Conference on Learning Represen- tations (ICLR)

2023

[7] [7]

Carroll, David Ruppert, Leonard A

Raymond J. Carroll, David Ruppert, Leonard A. Stefanski, and Ciprian M. Crainiceanu. 2006.Measurement Error in Nonlinear Models: A Modern Perspective (2nd ed.). Chapman and Hall/CRC, Boca Raton, FL

2006

[8] [8]

Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, Whitney Newey, and James Robins. 2018. Double/debiased machine learning for treatment and structural parameters.The Econometrics Journal21, 1 (2018), C1–C68

2018

[9] [9]

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2017. Deep reinforcement learning from human preferences.Advances in neural information processing systems30 (2017)

2017

[10] [10]

Stewart, and Hanying Wei

Naoki Egami, Musashi Hinck, Brandon M. Stewart, and Hanying Wei. 2023. Using Imperfect Surrogates for Downstream Inference: Design-based Supervised Learning for Social Science Applications of Large Language Models. InAdvances in Neural Information Processing Systems (NeurIPS)

2023

[11] [11]

Jianqing Fan and Young K. Truong. 1993. Nonparametric regression with errors in variables.The Annals of Statistics21, 4 (1993), 1900–1925

1993

[12] [12]

Wayne A. Fuller. 1987.Measurement Error Models. John Wiley & Sons, New York

1987

[13] [13]

Yuan Gao, Dokyun Lee, Gordon Burtch, and Sina Fazelpour. 2025. Take Caution in Using LLMs as Human Surrogates: Scylla Ex Machina.Proceedings of the National Academy of Sciences(2025)

2025

[14] [14]

George Gui and Olivier Toubia. 2023. The Challenge of Using LLMs to Simulate Human Behavior: A Causal Inference Perspective. SSRN preprint

2023

[15] [15]

Horton, Sophia Kazinnik, Daniela Puzzello, and Ali Zarifhonarvar

Anne Lundgaard Hansen, John J. Horton, Sophia Kazinnik, Daniela Puzzello, and Ali Zarifhonarvar. 2024. Simulating the Survey of Professional Forecasters. SSRN preprint

2024

[16] [16]

Luke Hewitt, Ashwini Ashokkumar, Isaias Ghezae, and Robb Willer. 2024. Pre- dicting Results of Social Science Experiments Using Large Language Models. Preprint

2024

[17] [17]

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The Curi- ous Case of Neural Text Degeneration. InInternational Conference on Learning Representations (ICLR)

2020

[18] [18]

Guido Imbens, Nathan Kallus, Xiaojie Mao, and Yuhao Wang. 2025. Long-term Causal Inference under Persistent Confounding via Data Combination.Journal of the Royal Statistical Society Series B: Statistical Methodology87, 2 (2025), 362–388

2025

[19] [19]

Nathan Kallus and Xiaojie Mao. 2020. On the Role of Surrogates in the Efficient Estimation of Treatment Effects with Limited Outcome Data. arXiv:2003.12408

arXiv 2020

[20] [20]

Manning and John J

Benjamin S. Manning and John J. Horton. 2025. General Social Agents. Working paper

2025

[21] [21]

Charles F. Manski. 2003.Partial Identification of Probability Distributions. Springer, New York

2003

[22] [22]

Nathan Matias, Kevin Munger, Marianne Aubin Le Quéré, and Charles Ebersole

J. Nathan Matias, Kevin Munger, Marianne Aubin Le Quéré, and Charles Ebersole

[23] [23]

media.Scientific Data8, 1 (2021), 195

The Upworthy Research Archive, a time series of 32,487 experiments in U.S. media.Scientific Data8, 1 (2021), 195

2021

[24] [24]

Jerzy Neyman. 1990. On the Application of Probability Theory to Agricultural Experiments: Essay on Principles, Section 9.Statist. Sci.5, 4 (1990), 465–472. doi:10.1214/ss/1177012031 English translation of Neyman (1923); translated by D. M. Dabrowska and T. P. Speed

work page doi:10.1214/ss/1177012031 1990

[25] [25]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.Advances in neural information processing systems35 (2022), 27730–27744

2022

[26] [26]

Ross L Prentice. 1989. Surrogate endpoints in clinical trials: definition and operational criteria.Statistics in medicine8, 4 (1989), 431–440

1989

[27] [27]

Donald B. Rubin. 1974. Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies.Journal of Educational Psychology66, 5 (1974), 688–701. doi:10.1037/h0037350

work page doi:10.1037/h0037350 1974

[28] [28]

Joseph Suh, Erfan Jahanparast, Suhong Moon, Minwoo Kang, and Serina Chang

[29] [29]

InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL)

Language Model Fine-Tuning on Scaled Survey Data for Predicting Dis- tributions of Public Opinions. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics, Vienna, Austria

[30] [30]

Tsybakov

Alexandre B. Tsybakov. 2009.Introduction to Nonparametric Estimation. Springer, New York

2009

[31] [31]

Dakuo Wang, Ting-Yao Hsu, Yuxuan Lu, Limeng Cui, Yaochen Xie, William Headean, Bingsheng Yao, Akash Veeragouni, Jiapeng Liu, Sreyashi Nag, and Jessie Wang. 2025. AgentA/B: Automated and Scalable Web A/B Testing with Interactive LLM Agents. arXiv:2504.09723. arXiv:2504.09723 [cs.CL]

arXiv 2025

[32] [32]

1970.Measures of overlap of income distributions of white and Negro families in the United States

Murray S Weitzman. 1970.Measures of overlap of income distributions of white and Negro families in the United States. Vol. 22. US Bureau of the Census

1970

[33] [33]

Jeremy Yang, Dean Eckles, Paramveer Dhillon, and Sinan Aral. 2024. Targeting for Long-Term Outcomes.Management Science70, 6 (2024), 3841–3855

2024

[34] [34]

Jihun Yun, Juno Kim, Jongho Park, Junhyuck Kim, Jongha Jon Ryu, Jaewoong Cho, and Kwang-Sung Jun. 2025. Alignment as distribution learning: Your preference model is explicitly a language model. arXiv:2506.01523

Pith/arXiv arXiv 2025