How Many Human Survey Respondents is a Large Language Model Worth? An Uncertainty Quantification Perspective

Chengpiao Huang; Kaizheng Wang; Yuhang Wu

arxiv: 2502.17773 · v5 · pith:6A5KJWHTnew · submitted 2025-02-25 · 📊 stat.ME · cs.AI· cs.LG

How Many Human Survey Respondents is a Large Language Model Worth? An Uncertainty Quantification Perspective

Chengpiao Huang , Yuhang Wu , Kaizheng Wang This is my paper

Pith reviewed 2026-05-23 03:11 UTC · model grok-4.3

classification 📊 stat.ME cs.AIcs.LG

keywords large language modelssurvey simulationuncertainty quantificationconfidence setssample size selectionsimulation fidelitymisalignment

0 comments

The pith

LLM survey simulations yield reliable human confidence sets once the simulation count is chosen adaptively.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework that converts LLM-simulated survey responses into confidence sets for human population parameters while quantifying the extra uncertainty from human-LLM misalignment. The central design choice is the simulation sample size: too large and the sets become overconfident with bad coverage, too small and they are dominated by noise. A data-driven rule selects this size to guarantee nominal average-case coverage for any LLM fidelity level and any confidence-set method. The resulting size is shown to equal the effective number of human respondents the LLM can represent.

Core claim

We propose a data-driven approach that adaptively selects the simulation sample size to achieve nominal average-case coverage, regardless of the LLM's simulation fidelity or the confidence set construction procedure. The selected sample size is further shown to reflect the effective human population size that the LLM can represent, providing a quantitative measure of its simulation fidelity.

What carries the argument

Adaptive data-driven selection of simulation sample size to guarantee nominal average-case coverage

If this is right

Excessive simulation counts produce overly narrow sets whose coverage falls below nominal due to unmodeled misalignment.
The selection rule requires no assumptions on the form of the misalignment distribution.
Experiments on real surveys show the equivalent human size varies across LLMs and question domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The equivalent-size metric supplies a concrete benchmark for deciding when LLM data can substitute for additional human respondents in a given survey.
The same adaptive rule could be tested on other synthetic data sources such as agent-based models or synthetic populations.

Load-bearing premise

A data-driven rule can guarantee nominal average-case coverage for any LLM fidelity and any confidence-set procedure solely by adjusting the simulation sample size.

What would settle it

Apply the selection rule to held-out human survey responses and verify whether the resulting confidence sets attain the target coverage probability on average.

Figures

Figures reproduced from arXiv: 2502.17773 by Chengpiao Huang, Kaizheng Wang, Yuhang Wu.

**Figure 2.** Figure 2: The Coverage-Width Trade-off for the Simulation Sample Size [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The LLM as a Mechanical Turk. The top panel illustrates the traditional survey process, where [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Parametric Bootstrap. In contrast, our framework ( [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Our Framework. In Appendix C.2, we further establish a more quantitative connection, where our parameters can be directly interpreted in bootstrap terms: the hidden population size κ represents the number of 15 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Miscoverage Probability Proxy Gte(bk) for Different LLMs and Target Miscoverage Levels α. Horizontal axis: target miscoverage level α. Vertical axis: LLM. Circles, squares, triangles and diamonds represent Gte(bk) for α = 0.05, 0.1, 0.15, 0.2, respectively. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Histogram of the Relative Error [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Half-widths of Confidence Intervals I syn,Bern(bk) for Different LLMs and Target Levels α. Horizontal axis: target miscoverage level α. Vertical axis: half-width of I syn,Bern(bk). The results are averaged over 100 random train-test splits. The half-width of each error bar is 1.96 times the standard error. Claude 3.5 Haiku Deepseek V3 GPT-3.5 Turbo GPT-4o mini GPT-4o GPT-5 mini Llama 3.3 70B Mistral 7B Ran… view at source ↗

**Figure 9.** Figure 9: Estimated Hidden Population Sizes κb = bk/C of Different LLMs. The results are averaged over 100 random train-test splits. The half-width of each error bar is 1.96 times the standard error. As is discussed in Section 4, the hidden population size κ reflects the size of the human population that the LLM can represent. A larger κ indicates that the LLM captures more information about the human population. Ou… view at source ↗

**Figure 10.** Figure 10: Miscoverage Probability Proxy Lte(bk) for the General Method. Horizontal axis: target miscoverage level α. Vertical axis: LLM. Circles, squares, triangles and diamonds represent Lte(bk) for α = 0.05, 0.1, 0.15, 0.2, respectively. The results are averaged over 100 random train-test splits of the questions. The half-width of each error bar is 1.96 times the standard error. Sharpness of selected sample size… view at source ↗

**Figure 11.** Figure 11: Half-widths of Confidence Intervals I syn,Bern(bk) for the General Method. Horizontal axis: target miscoverage level α. Vertical axis: half-width of I syn,Bern(bk). The results are averaged over 100 random train-test splits. The half-width of each error bar is 1.96 times the standard error. Estimated hidden population size κb = bk/C. In [PITH_FULL_IMAGE:figures/full_fig_p056_11.png] view at source ↗

**Figure 12.** Figure 12: Estimated Hidden Population Sizes κb = bk/C of Different LLMs from the General Method. The results are averaged over 100 random train-test splits. The half-width of each error bar is 1.96 times the standard error. References Aher, G. V., Arriaga, R. I. and Kalai, A. T. (2023). Using large language models to simulate multiple humans and replicate human subject studies. In Proceedings of the 40th Internatio… view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used to simulate survey responses, but synthetic data can be misaligned with the human population, leading to unreliable inference. We develop a general framework that converts LLM-simulated responses into reliable confidence sets for population parameters of human responses, quantifying the uncertainty induced by the human-LLM misalignment. The key design choice is the number of simulated responses: too many produce overly narrow sets with poor coverage, while too few yield overly wide and uninformative sets dominated by stochastic noise. We propose a data-driven approach that adaptively selects the simulation sample size to achieve nominal average-case coverage, regardless of the LLM's simulation fidelity or the confidence set construction procedure. The selected sample size is further shown to reflect the effective human population size that the LLM can represent, providing a quantitative measure of its simulation fidelity. Experiments on real survey datasets reveal heterogeneous simulation fidelity across different LLMs and domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a data-driven rule for picking LLM simulation size to hit nominal coverage on human survey parameters and reads that size as effective human sample size, but the abstract supplies no derivation or conditions.

read the letter

The central claim is that you can adaptively choose how many LLM responses to draw so the resulting confidence sets for human population parameters achieve the target average-case coverage no matter how well or poorly the LLM matches the humans, and that the chosen size directly measures the LLM's effective human population size. That framing is new relative to standard synthetic-data adjustments I've seen. The experiments on real survey data are useful for showing that fidelity varies by model and topic, which matches what practitioners already suspect. The effective-size interpretation could be handy for deciding when LLM data is worth the trouble versus collecting more humans. The main gap is that the abstract states the existence of such a rule without any sketch of the argument, the assumptions on misalignment, or how the adaptation is implemented. Coverage by tuning sample size alone typically requires the misalignment to be mean-zero or to average out in a specific way; if the paper has no conditions on that or on the confidence-set method, the guarantee may not be as general as stated. The stress-test note about possible implicit assumptions on the misalignment distribution seems worth checking against the actual proof. Overall the idea is practical for survey work, but the technical support is thin from what's visible. This is worth sending to referees who know survey sampling and synthetic data so they can verify the derivation and the experiments. If the math closes, it could be a useful tool; if not, the claim needs narrowing.

Referee Report

2 major / 0 minor

Summary. The paper develops a general framework for constructing confidence sets for human population parameters from LLM-simulated survey responses, explicitly accounting for human-LLM misalignment. It proposes a data-driven adaptive procedure that selects the LLM simulation sample size n to achieve nominal average-case coverage for any LLM fidelity level and any confidence-set construction method. The selected n is interpreted as the effective human population size represented by the LLM, providing a quantitative fidelity measure. Experiments on real survey data illustrate heterogeneous fidelity across LLMs and domains.

Significance. If the adaptive selection rule and its coverage guarantee hold under the stated conditions, the work supplies a practical, procedure-agnostic tool for quantifying when and how much LLM simulations can substitute for human respondents in survey inference, along with an interpretable effective-sample-size metric. This would be a concrete contribution to uncertainty quantification for synthetic data in statistics and social science applications.

major comments (2)

[Abstract] Abstract: The central claim that a data-driven rule exists which selects simulation size n to achieve nominal average-case coverage 'regardless of the LLM's simulation fidelity or the confidence set construction procedure' and 'without additional assumptions on the form of the human-LLM misalignment distribution' is load-bearing for the entire contribution. Standard coverage arguments depend on the structure of misalignment (e.g., mean-zero bias, independence, or correctability by larger n); if misalignment introduces irreducible bias or the procedure is non-robust, adjusting n alone cannot restore coverage. The manuscript must supply the explicit conditions, derivation, or proof sketch under which the universal guarantee holds.
[Abstract] Abstract (and any later theoretical section): No derivation, proof sketch, or validation details are supplied for the existence or properties of the adaptive procedure. Without these, the support for the coverage guarantee and the effective-size interpretation cannot be assessed from the available text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and for identifying the need for stronger theoretical support in the manuscript. We agree that the coverage claims require explicit justification and will revise the paper to include the requested derivations, conditions, and proof sketches.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that a data-driven rule exists which selects simulation size n to achieve nominal average-case coverage 'regardless of the LLM's simulation fidelity or the confidence set construction procedure' and 'without additional assumptions on the form of the human-LLM misalignment distribution' is load-bearing for the entire contribution. Standard coverage arguments depend on the structure of misalignment (e.g., mean-zero bias, independence, or correctability by larger n); if misalignment introduces irreducible bias or the procedure is non-robust, adjusting n alone cannot restore coverage. The manuscript must supply the explicit conditions, derivation, or proof sketch under which the universal guarantee holds.

Authors: We agree that the abstract claim is central and that the current text does not supply an explicit derivation or proof sketch. The adaptive procedure is presented algorithmically, but the formal argument for average-case coverage under arbitrary misalignment is not detailed. In revision we will add a dedicated theoretical subsection deriving the guarantee: the data-driven rule uses a calibration sample to select the smallest n such that the empirical coverage (averaged over the misalignment distribution) meets the nominal level by construction. This holds without parametric assumptions on the misalignment because the selection directly targets the coverage functional rather than relying on bias correction or asymptotic arguments. We will also state the precise conditions (e.g., existence of a finite effective sample size and measurability of the coverage map) under which the result applies. revision: yes
Referee: [Abstract] Abstract (and any later theoretical section): No derivation, proof sketch, or validation details are supplied for the existence or properties of the adaptive procedure. Without these, the support for the coverage guarantee and the effective-size interpretation cannot be assessed from the available text.

Authors: We acknowledge the omission. The manuscript motivates and describes the adaptive selection but does not provide a formal proof sketch or validation details for its properties. In the revised version we will expand the theoretical section with (i) a proof outline showing that the selected n converges to the minimal value achieving nominal average coverage, (ii) the link between this n and the effective human sample size via the inverse of the coverage function, and (iii) additional simulation validation confirming the procedure's behavior across fidelity levels. These additions will directly address the assessability of the claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on proposed adaptive rule without reduction to inputs by construction

full rationale

The paper proposes a data-driven adaptive selection of simulation sample size n to achieve nominal average-case coverage for any LLM fidelity and any confidence-set procedure, then interprets the chosen n as an effective human population size. No equations, self-citations, or steps in the abstract or described framework reduce this selection rule or the effective-size interpretation to a fitted parameter, a self-referential definition, or a prior result by the same authors. The central claim is presented as a new methodological contribution whose coverage guarantee is asserted to hold without further assumptions on misalignment; this is not shown to be tautological or forced by the paper's own inputs. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the existence of an adaptive selection rule that delivers coverage for arbitrary misalignment; no free parameters, axioms, or invented entities are identifiable from the abstract alone.

axioms (1)

domain assumption Existence of a data-driven rule that selects simulation size to achieve nominal average-case coverage for any LLM fidelity and any confidence-set procedure
This is the load-bearing premise stated in the abstract but not derived or justified within the provided text.

pith-pipeline@v0.9.0 · 5696 in / 1252 out tokens · 27225 ms · 2026-05-23T03:11:50.477602+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Adaptive Querying with AI Persona Priors
stat.ML 2026-05 unverdicted novelty 7.0

A persona-induced latent variable model with LLM-generated priors enables scalable adaptive item selection with closed-form Bayesian updates for accurate user-specific predictions.
Rectification Difficulty and Optimal Sample Allocation in LLM-Augmented Surveys
cs.AI 2026-04 unverdicted novelty 7.0

A method using predicted rectification difficulty for optimal human sample allocation in LLM-augmented surveys captures 61-79% of theoretical efficiency gains and reduces MSE by 11% on two datasets without pilot data.
Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling
cs.AI 2026-03 conditional novelty 6.0

Repeated sampling of the same safety prompts reveals substantial differences in LLM failure probabilities across temperatures that conventional single-evaluation benchmarks miss.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 3 Pith papers · 4 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := ...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

Aher, G. V. , Arriaga, R. I. and Kalai, A. T. (2023). Using large language models to simulate multiple humans and replicate human subject studies. In Proceedings of the 40th International Conference on Machine Learning (A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato and J. Scarlett, eds.), vol. 202 of Proceedings of Machine Learning Research. P...

work page 2023
[4]

Angelopoulos, A. N. , Bates, S. , Fannjiang, C. , Jordan, M. I. and Zrnic, T. (2023). Prediction-powered inference. Science 382 669--674. ://www.science.org/doi/abs/10.1126/science.adi6000

work page doi:10.1126/science.adi6000 2023
[5]

Angelopoulos, A. N. , Bates, S. , Fisch, A. , Lei, L. and Schuster, T. (2024). Conformal risk control. In The Twelfth International Conference on Learning Representations. ://openreview.net/forum?id=33XGfHLtZg

work page 2024
[6]

Introducing computer use, a new Claude 3.5 Sonnet , and Claude 3.5 Haiku

Anthropic (2024). Introducing computer use, a new Claude 3.5 Sonnet , and Claude 3.5 Haiku . ://www.anthropic.com/news/3-5-models-and-computer-use

work page 2024
[7]

Argyle, L. P. , Busby, E. C. , Fulda, N. , Gubler, J. R. , Rytting, C. and Wingate, D. (2023). Out of one, many: Using language models to simulate human samples. Political Analysis 31 337--351

work page 2023
[8]

Barton, R. R. , Lam, H. and Song, E. (2022). Input Uncertainty in Stochastic Simulation. Springer International Publishing, Cham, 573--620. ://doi.org/10.1007/978-3-030-96935-6_17

work page doi:10.1007/978-3-030-96935-6_17 2022
[9]

, Angelopoulos, A

Bates, S. , Angelopoulos, A. , Lei, L. , Malik, J. and Jordan, M. (2021). Distribution-free, risk-controlling prediction sets. J. ACM 68. ://doi.org/10.1145/3478535

work page doi:10.1145/3478535 2021
[10]

Billingsley, P. (2017). Probability and measure. John Wiley & Sons

work page 2017
[11]

, Clinton, J

Bisbee, J. , Clinton, J. D. , Dorff, C. , Kenkel, B. and Larson, J. M. (2024). Synthetic replacements for human survey data? T he perils of large language models. Political Analysis 32 401--416

work page 2024
[12]

Liesen, Z

Boucheron, S. , Lugosi, G. and Massart, P. (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press. ://doi.org/10.1093/acprof:oso/9780199535255.001.0001

work page doi:10.1093/acprof:oso/9780199535255.001.0001 2013
[13]

, Israeli, A

Brand, J. , Israeli, A. and Ngwe, D. (2023). Using LLM s for market research. Harvard Business School Marketing Unit Working Paper

work page 2023
[14]

, Reichart, R

Calderon, N. , Reichart, R. and Dror, R. (2025). The alternative annotator test for LLM -as-a-judge: How to statistically justify replacing human annotators with LLM s. arXiv preprint arXiv:2501.10970 . ://arxiv.org/abs/2501.10970

work page arXiv 2025
[15]

and Berger, R

Casella, G. and Berger, R. (2002). Statistical inference. CRC press

work page 2002
[16]

, Liu, T

Chen, Y. , Liu, T. X. , Shan, Y. and Zhong, S. (2023). The emergence of economic rationality of GPT . Proceedings of the National Academy of Sciences 120 e2316205120. ://www.pnas.org/doi/abs/10.1073/pnas.2316205120

work page doi:10.1073/pnas.2316205120 2023
[17]

Cheng, R. C. H. and Holland, W. (2004). Calculation of confidence intervals for simulation output. ACM Trans. Model. Comput. Simul. 14 344--362. ://doi.org/10.1145/1029174.1029176

work page doi:10.1145/1029174.1029176 2004
[18]

, Hardt, M

Dominguez-Olmedo, R. , Hardt, M. and Mendler-D \"u nner, C. (2024). Questioning the survey responses of large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. ://openreview.net/forum?id=Oo7dlLgqQX

work page 2024
[19]

The Llama 3 Herd of Models

Dubey, A. , Jauhri, A. , Pandey, A. , Kadian, A. , Al-Dahle, A. , Letman, A. , Mathur, A. , Schelten, A. , Yang, A. , Fan, A. et al. (2024). The Llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

, Nguyen, K

Durmus, E. , Nguyen, K. , Liao, T. , Schiefer, N. , Askell, A. , Bakhtin, A. , Chen, C. , Hatfield-Dodds, Z. , Hernandez, D. , Joseph, N. , Lovitt, L. , McCandlish, S. , Sikder, O. , Tamkin, A. , Thamkul, J. , Kaplan, J. , Clark, J. and Ganguli, D. (2024). Towards measuring the representation of subjective global opinions in language models. In First Conf...

work page 2024
[21]

Eedi labs

Eedi Labs (2025). Eedi labs. Accessed November 15, 2025, https://www.eedi.com/

work page 2025
[22]

Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics 7 1 -- 26. ://doi.org/10.1214/aos/1176344552

work page doi:10.1214/aos/1176344552 1979
[23]

and Hastie, T

Efron, B. and Hastie, T. (2021). Computer Age Statistical Inference, Student Edition: Algorithms, Evidence, and Data Science. Institute of Mathematical Statistics Monographs, Cambridge University Press

work page 2021
[24]

, Lee, D

Gao, Y. , Lee, D. , Burtch, G. and Fazelpour, S. (2025). Take caution in using llms as human surrogates. Proceedings of the National Academy of Sciences 122 e2501660122. ://www.pnas.org/doi/abs/10.1073/pnas.2501660122

work page doi:10.1073/pnas.2501660122 2025
[25]

and Singh, A

Goli, A. and Singh, A. (2024). Frontiers: Can large language models capture human preferences? Marketing Science 43 709--722. ://doi.org/10.1287/mksc.2023.0306

work page doi:10.1287/mksc.2023.0306 2024
[26]

and Toubia, O

Gui, G. and Toubia, O. (2023). The challenge of using LLM s to simulate human behavior: A causal inference perspective. arXiv preprint arXiv:2312.15524

work page arXiv 2023
[27]

He-Yueya, J. , Ma, W. A. , Gandhi, K. , Domingue, B. W. , Brunskill, E. and Goodman, N. D. (2024). Psychometric alignment: Capturing human knowledge distributions via language models. arXiv preprint arXiv:2407.15645

work page arXiv 2024
[28]

Horton, J. J. (2023). Large language models as simulated economic agents: What can we learn from homo silicus? Tech. rep., National Bureau of Economic Research

work page 2023
[29]

Huang, C. , Wu, Y. and Wang, K. (2025). Uncertainty quantification for LLM -based survey simulations. In Proceedings of the 42nd International Conference on Machine Learning (A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff and J. Zhu, eds.), vol. 267 of Proceedings of Machine Learning Research. PMLR. ://proceedings.ml...

work page 2025
[30]

Social science meets llms: How reliable are large language models in social simulations? arXiv preprint arXiv:2410.23426, 2024

Huang, Y. , Yuan, Z. , Zhou, Y. , Guo, K. , Wang, X. , Zhuang, H. , Sun, W. , Sun, L. , Wang, J. , Ye, Y. et al. (2024). Social science meets LLM s: How reliable are large language models in social simulations? arXiv preprint arXiv:2410.23426

work page arXiv 2024
[31]

Jiang, A. Q. , Sablayrolles, A. , Mensch, A. , Bamford, C. , Chaplot, D. S. , Casas, D. d. l. , Bressand, F. , Lengyel, G. , Lample, G. , Saulnier, L. et al. (2023). Mistral 7B . arXiv preprint arXiv:2310.06825

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Jones, C. R. and Bergen, B. K. (2025). Large language models pass the turing test. arXiv preprint arXiv:2503.23674

work page arXiv 2025
[33]

, O'Hagan, S

Kim, J. , O'Hagan, S. and Rockova, V. (2024). Adaptive uncertainty quantification for generative AI . arXiv preprint arXiv:2408.08990

work page arXiv 2024
[34]

Lam, H. (2016). Advanced tutorial: Input uncertainty and robust analysis in stochastic simulation. In 2016 Winter Simulation Conference (WSC)

work page 2016
[35]

DeepSeek-V3 Technical Report

Liu, A. , Feng, B. , Xue, B. , Wang, B. , Wu, B. , Lu, C. , Zhao, C. , Deng, C. , Zhang, C. , Ruan, C. et al. (2024). DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

and Wang, X

Lu, X. and Wang, X. (2024). Generative students: Using LLM -simulated student profiles to support question item evaluation. In Proceedings of the Eleventh ACM Conference on Learning @ Scale. L@S '24, Association for Computing Machinery, New York, NY, USA. ://doi.org/10.1145/3657604.3662031

work page doi:10.1145/3657604.3662031 2024
[37]

Markel, J. M. , Opferman, S. G. , Landay, J. A. and Piech, C. (2023). Gpteach: Interactive ta training with gpt-based students. In Proceedings of the Tenth ACM Conference on Learning @ Scale. Association for Computing Machinery, New York, NY, USA. ://doi.org/10.1145/3573051.3593393

work page doi:10.1145/3573051.3593393 2023
[38]

and Pontil, M

Maurer, A. and Pontil, M. (2009). Empirical bernstein bounds and sample variance penalization. The 22nd Annual Conference on Learning Theory . ://www.learningtheory.org/colt2009/papers/012.pdf

work page 2009
[39]

Economic Sciences 121(9), e2313925121 (2024)

Mei, Q. , Xie, Y. , Yuan, W. and Jackson, M. O. (2024). A T uring test of whether AI chatbots are behaviorally similar to humans. Proceedings of the National Academy of Sciences 121 e2313925121. ://www.pnas.org/doi/abs/10.1073/pnas.2313925121

work page doi:10.1073/pnas.2313925121 2024
[40]

and Pei, L

Nelson, B. and Pei, L. (2021). Foundations and methods of stochastic simulation. Springer

work page 2021
[41]

GPT-3.5 Turbo

OpenAI (2022). GPT-3.5 Turbo . ://platform.openai.com/docs/models/gpt-3-5#gpt-3-5-turbo

work page 2022
[42]

GPT-4o mini: Advancing cost-efficient intelligence

OpenAI (2024 a ). GPT-4o mini: Advancing cost-efficient intelligence. ://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/

work page 2024
[43]

Hello GPT-4o

OpenAI (2024 b ). Hello GPT-4o . ://openai.com/index/hello-gpt-4o/

work page 2024
[44]

Introducing GPT-5

OpenAI (2025). Introducing GPT-5 . ://openai.com/index/introducing-gpt-5/

work page 2025
[45]

Owen, A. (1990). Empirical Likelihood Ratio Confidence Regions . The Annals of Statistics 18 90 -- 120. ://doi.org/10.1214/aos/1176347494

work page doi:10.1214/aos/1176347494 1990
[46]

The A merican trends panel

Pew Research Center (2025). The A merican trends panel. Accessed November 15, 2025, https://www.pewresearch.org/the-american-trends-panel/

work page 2025
[47]

, Durmus, E

Santurkar, S. , Durmus, E. , Ladhak, F. , Lee, C. , Liang, P. and Hashimoto, T. (2023). Whose opinions do language models reflect? In Proceedings of the 40th International Conference on Machine Learning (A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato and J. Scarlett, eds.), vol. 202 of Proceedings of Machine Learning Research. PMLR. ://proceedi...

work page 2023
[48]

and Vovk, V

Shafer, G. and Vovk, V. (2008). A tutorial on conformal prediction. Journal of Machine Learning Research 9 371--421. ://jmlr.org/papers/v9/shafer08a.html

work page 2008
[49]

, Gammerman, A

Vovk, V. , Gammerman, A. and Shafer, G. (2005). Algorithmic learning in a random world, vol. 29. Springer. ://doi.org/10.1007/b106715

work page doi:10.1007/b106715 2005
[50]

Large Language Models for Market Research: A Data-augmentation Approach

Wang, M. , Zhang, D. J. and Zhang, H. (2024). Large language models for market research: A data-augmentation approach. arXiv preprint arXiv:2412.19363

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

, Lamb, A

Wang, Z. , Lamb, A. , Saveliev, E. , Cameron, P. , Zaykov, J. , Hernandez-Lobato, J. M. , Turner, R. E. , Baraniuk, R. G. , Craig Barton, E. , Peyton Jones, S. , Woodhead, S. and Zhang, C. (2021). Results and insights from diagnostic questions: The NeurIPS 2020 education challenge. In Proceedings of the NeurIPS 2020 Competition and Demonstration Track (H....

work page 2021
[52]

Yang, K. , Li, H. , Wen, H. , Peng, T.-Q. , Tang, J. and Liu, H. (2024). Are large language models ( LLM s) good social predictors? In Findings of the Association for Computational Linguistics: EMNLP 2024 (Y. Al-Onaizan, M. Bansal and Y.-N. Chen, eds.). Association for Computational Linguistics, Miami, Florida, USA. ://aclanthology.org/2024.findings-emnlp.153/

work page 2024
[53]

Zelikman, E. , Ma, W. , Tran, J. , Yang, D. , Yeatman, J. and Haber, N. (2023). Generating and evaluating tests for K-12 students with language model simulations: A case study on sentence reading efficiency. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (H. Bouamor, J. Pino and K. Bali, eds.). Association for Co...

work page 2023
[54]

Cocarascu, F

Ziems, C. , Held, W. , Shaikh, O. , Chen, J. , Zhang, Z. and Yang, D. (2024). Can large language models transform computational social science? Computational Linguistics 50 237--291. ://doi.org/10.1162/coli\_a\_00502

work page doi:10.1162/coli 2024

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := ...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

Aher, G. V. , Arriaga, R. I. and Kalai, A. T. (2023). Using large language models to simulate multiple humans and replicate human subject studies. In Proceedings of the 40th International Conference on Machine Learning (A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato and J. Scarlett, eds.), vol. 202 of Proceedings of Machine Learning Research. P...

work page 2023

[4] [4]

Angelopoulos, A. N. , Bates, S. , Fannjiang, C. , Jordan, M. I. and Zrnic, T. (2023). Prediction-powered inference. Science 382 669--674. ://www.science.org/doi/abs/10.1126/science.adi6000

work page doi:10.1126/science.adi6000 2023

[5] [5]

Angelopoulos, A. N. , Bates, S. , Fisch, A. , Lei, L. and Schuster, T. (2024). Conformal risk control. In The Twelfth International Conference on Learning Representations. ://openreview.net/forum?id=33XGfHLtZg

work page 2024

[6] [6]

Introducing computer use, a new Claude 3.5 Sonnet , and Claude 3.5 Haiku

Anthropic (2024). Introducing computer use, a new Claude 3.5 Sonnet , and Claude 3.5 Haiku . ://www.anthropic.com/news/3-5-models-and-computer-use

work page 2024

[7] [7]

Argyle, L. P. , Busby, E. C. , Fulda, N. , Gubler, J. R. , Rytting, C. and Wingate, D. (2023). Out of one, many: Using language models to simulate human samples. Political Analysis 31 337--351

work page 2023

[8] [8]

Barton, R. R. , Lam, H. and Song, E. (2022). Input Uncertainty in Stochastic Simulation. Springer International Publishing, Cham, 573--620. ://doi.org/10.1007/978-3-030-96935-6_17

work page doi:10.1007/978-3-030-96935-6_17 2022

[9] [9]

, Angelopoulos, A

Bates, S. , Angelopoulos, A. , Lei, L. , Malik, J. and Jordan, M. (2021). Distribution-free, risk-controlling prediction sets. J. ACM 68. ://doi.org/10.1145/3478535

work page doi:10.1145/3478535 2021

[10] [10]

Billingsley, P. (2017). Probability and measure. John Wiley & Sons

work page 2017

[11] [11]

, Clinton, J

Bisbee, J. , Clinton, J. D. , Dorff, C. , Kenkel, B. and Larson, J. M. (2024). Synthetic replacements for human survey data? T he perils of large language models. Political Analysis 32 401--416

work page 2024

[12] [12]

Liesen, Z

Boucheron, S. , Lugosi, G. and Massart, P. (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press. ://doi.org/10.1093/acprof:oso/9780199535255.001.0001

work page doi:10.1093/acprof:oso/9780199535255.001.0001 2013

[13] [13]

, Israeli, A

Brand, J. , Israeli, A. and Ngwe, D. (2023). Using LLM s for market research. Harvard Business School Marketing Unit Working Paper

work page 2023

[14] [14]

, Reichart, R

Calderon, N. , Reichart, R. and Dror, R. (2025). The alternative annotator test for LLM -as-a-judge: How to statistically justify replacing human annotators with LLM s. arXiv preprint arXiv:2501.10970 . ://arxiv.org/abs/2501.10970

work page arXiv 2025

[15] [15]

and Berger, R

Casella, G. and Berger, R. (2002). Statistical inference. CRC press

work page 2002

[16] [16]

, Liu, T

Chen, Y. , Liu, T. X. , Shan, Y. and Zhong, S. (2023). The emergence of economic rationality of GPT . Proceedings of the National Academy of Sciences 120 e2316205120. ://www.pnas.org/doi/abs/10.1073/pnas.2316205120

work page doi:10.1073/pnas.2316205120 2023

[17] [17]

Cheng, R. C. H. and Holland, W. (2004). Calculation of confidence intervals for simulation output. ACM Trans. Model. Comput. Simul. 14 344--362. ://doi.org/10.1145/1029174.1029176

work page doi:10.1145/1029174.1029176 2004

[18] [18]

, Hardt, M

Dominguez-Olmedo, R. , Hardt, M. and Mendler-D \"u nner, C. (2024). Questioning the survey responses of large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. ://openreview.net/forum?id=Oo7dlLgqQX

work page 2024

[19] [19]

The Llama 3 Herd of Models

Dubey, A. , Jauhri, A. , Pandey, A. , Kadian, A. , Al-Dahle, A. , Letman, A. , Mathur, A. , Schelten, A. , Yang, A. , Fan, A. et al. (2024). The Llama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

, Nguyen, K

Durmus, E. , Nguyen, K. , Liao, T. , Schiefer, N. , Askell, A. , Bakhtin, A. , Chen, C. , Hatfield-Dodds, Z. , Hernandez, D. , Joseph, N. , Lovitt, L. , McCandlish, S. , Sikder, O. , Tamkin, A. , Thamkul, J. , Kaplan, J. , Clark, J. and Ganguli, D. (2024). Towards measuring the representation of subjective global opinions in language models. In First Conf...

work page 2024

[21] [21]

Eedi labs

Eedi Labs (2025). Eedi labs. Accessed November 15, 2025, https://www.eedi.com/

work page 2025

[22] [22]

Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics 7 1 -- 26. ://doi.org/10.1214/aos/1176344552

work page doi:10.1214/aos/1176344552 1979

[23] [23]

and Hastie, T

Efron, B. and Hastie, T. (2021). Computer Age Statistical Inference, Student Edition: Algorithms, Evidence, and Data Science. Institute of Mathematical Statistics Monographs, Cambridge University Press

work page 2021

[24] [24]

, Lee, D

Gao, Y. , Lee, D. , Burtch, G. and Fazelpour, S. (2025). Take caution in using llms as human surrogates. Proceedings of the National Academy of Sciences 122 e2501660122. ://www.pnas.org/doi/abs/10.1073/pnas.2501660122

work page doi:10.1073/pnas.2501660122 2025

[25] [25]

and Singh, A

Goli, A. and Singh, A. (2024). Frontiers: Can large language models capture human preferences? Marketing Science 43 709--722. ://doi.org/10.1287/mksc.2023.0306

work page doi:10.1287/mksc.2023.0306 2024

[26] [26]

and Toubia, O

Gui, G. and Toubia, O. (2023). The challenge of using LLM s to simulate human behavior: A causal inference perspective. arXiv preprint arXiv:2312.15524

work page arXiv 2023

[27] [27]

He-Yueya, J. , Ma, W. A. , Gandhi, K. , Domingue, B. W. , Brunskill, E. and Goodman, N. D. (2024). Psychometric alignment: Capturing human knowledge distributions via language models. arXiv preprint arXiv:2407.15645

work page arXiv 2024

[28] [28]

Horton, J. J. (2023). Large language models as simulated economic agents: What can we learn from homo silicus? Tech. rep., National Bureau of Economic Research

work page 2023

[29] [29]

Huang, C. , Wu, Y. and Wang, K. (2025). Uncertainty quantification for LLM -based survey simulations. In Proceedings of the 42nd International Conference on Machine Learning (A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff and J. Zhu, eds.), vol. 267 of Proceedings of Machine Learning Research. PMLR. ://proceedings.ml...

work page 2025

[30] [30]

Social science meets llms: How reliable are large language models in social simulations? arXiv preprint arXiv:2410.23426, 2024

Huang, Y. , Yuan, Z. , Zhou, Y. , Guo, K. , Wang, X. , Zhuang, H. , Sun, W. , Sun, L. , Wang, J. , Ye, Y. et al. (2024). Social science meets LLM s: How reliable are large language models in social simulations? arXiv preprint arXiv:2410.23426

work page arXiv 2024

[31] [31]

Jiang, A. Q. , Sablayrolles, A. , Mensch, A. , Bamford, C. , Chaplot, D. S. , Casas, D. d. l. , Bressand, F. , Lengyel, G. , Lample, G. , Saulnier, L. et al. (2023). Mistral 7B . arXiv preprint arXiv:2310.06825

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Jones, C. R. and Bergen, B. K. (2025). Large language models pass the turing test. arXiv preprint arXiv:2503.23674

work page arXiv 2025

[33] [33]

, O'Hagan, S

Kim, J. , O'Hagan, S. and Rockova, V. (2024). Adaptive uncertainty quantification for generative AI . arXiv preprint arXiv:2408.08990

work page arXiv 2024

[34] [34]

Lam, H. (2016). Advanced tutorial: Input uncertainty and robust analysis in stochastic simulation. In 2016 Winter Simulation Conference (WSC)

work page 2016

[35] [35]

DeepSeek-V3 Technical Report

Liu, A. , Feng, B. , Xue, B. , Wang, B. , Wu, B. , Lu, C. , Zhao, C. , Deng, C. , Zhang, C. , Ruan, C. et al. (2024). DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

and Wang, X

Lu, X. and Wang, X. (2024). Generative students: Using LLM -simulated student profiles to support question item evaluation. In Proceedings of the Eleventh ACM Conference on Learning @ Scale. L@S '24, Association for Computing Machinery, New York, NY, USA. ://doi.org/10.1145/3657604.3662031

work page doi:10.1145/3657604.3662031 2024

[37] [37]

Markel, J. M. , Opferman, S. G. , Landay, J. A. and Piech, C. (2023). Gpteach: Interactive ta training with gpt-based students. In Proceedings of the Tenth ACM Conference on Learning @ Scale. Association for Computing Machinery, New York, NY, USA. ://doi.org/10.1145/3573051.3593393

work page doi:10.1145/3573051.3593393 2023

[38] [38]

and Pontil, M

Maurer, A. and Pontil, M. (2009). Empirical bernstein bounds and sample variance penalization. The 22nd Annual Conference on Learning Theory . ://www.learningtheory.org/colt2009/papers/012.pdf

work page 2009

[39] [39]

Economic Sciences 121(9), e2313925121 (2024)

Mei, Q. , Xie, Y. , Yuan, W. and Jackson, M. O. (2024). A T uring test of whether AI chatbots are behaviorally similar to humans. Proceedings of the National Academy of Sciences 121 e2313925121. ://www.pnas.org/doi/abs/10.1073/pnas.2313925121

work page doi:10.1073/pnas.2313925121 2024

[40] [40]

and Pei, L

Nelson, B. and Pei, L. (2021). Foundations and methods of stochastic simulation. Springer

work page 2021

[41] [41]

GPT-3.5 Turbo

OpenAI (2022). GPT-3.5 Turbo . ://platform.openai.com/docs/models/gpt-3-5#gpt-3-5-turbo

work page 2022

[42] [42]

GPT-4o mini: Advancing cost-efficient intelligence

OpenAI (2024 a ). GPT-4o mini: Advancing cost-efficient intelligence. ://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/

work page 2024

[43] [43]

Hello GPT-4o

OpenAI (2024 b ). Hello GPT-4o . ://openai.com/index/hello-gpt-4o/

work page 2024

[44] [44]

Introducing GPT-5

OpenAI (2025). Introducing GPT-5 . ://openai.com/index/introducing-gpt-5/

work page 2025

[45] [45]

Owen, A. (1990). Empirical Likelihood Ratio Confidence Regions . The Annals of Statistics 18 90 -- 120. ://doi.org/10.1214/aos/1176347494

work page doi:10.1214/aos/1176347494 1990

[46] [46]

The A merican trends panel

Pew Research Center (2025). The A merican trends panel. Accessed November 15, 2025, https://www.pewresearch.org/the-american-trends-panel/

work page 2025

[47] [47]

, Durmus, E

Santurkar, S. , Durmus, E. , Ladhak, F. , Lee, C. , Liang, P. and Hashimoto, T. (2023). Whose opinions do language models reflect? In Proceedings of the 40th International Conference on Machine Learning (A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato and J. Scarlett, eds.), vol. 202 of Proceedings of Machine Learning Research. PMLR. ://proceedi...

work page 2023

[48] [48]

and Vovk, V

Shafer, G. and Vovk, V. (2008). A tutorial on conformal prediction. Journal of Machine Learning Research 9 371--421. ://jmlr.org/papers/v9/shafer08a.html

work page 2008

[49] [49]

, Gammerman, A

Vovk, V. , Gammerman, A. and Shafer, G. (2005). Algorithmic learning in a random world, vol. 29. Springer. ://doi.org/10.1007/b106715

work page doi:10.1007/b106715 2005

[50] [50]

Large Language Models for Market Research: A Data-augmentation Approach

Wang, M. , Zhang, D. J. and Zhang, H. (2024). Large language models for market research: A data-augmentation approach. arXiv preprint arXiv:2412.19363

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [51]

, Lamb, A

Wang, Z. , Lamb, A. , Saveliev, E. , Cameron, P. , Zaykov, J. , Hernandez-Lobato, J. M. , Turner, R. E. , Baraniuk, R. G. , Craig Barton, E. , Peyton Jones, S. , Woodhead, S. and Zhang, C. (2021). Results and insights from diagnostic questions: The NeurIPS 2020 education challenge. In Proceedings of the NeurIPS 2020 Competition and Demonstration Track (H....

work page 2021

[52] [52]

Yang, K. , Li, H. , Wen, H. , Peng, T.-Q. , Tang, J. and Liu, H. (2024). Are large language models ( LLM s) good social predictors? In Findings of the Association for Computational Linguistics: EMNLP 2024 (Y. Al-Onaizan, M. Bansal and Y.-N. Chen, eds.). Association for Computational Linguistics, Miami, Florida, USA. ://aclanthology.org/2024.findings-emnlp.153/

work page 2024

[53] [53]

Zelikman, E. , Ma, W. , Tran, J. , Yang, D. , Yeatman, J. and Haber, N. (2023). Generating and evaluating tests for K-12 students with language model simulations: A case study on sentence reading efficiency. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (H. Bouamor, J. Pino and K. Bali, eds.). Association for Co...

work page 2023

[54] [54]

Cocarascu, F

Ziems, C. , Held, W. , Shaikh, O. , Chen, J. , Zhang, Z. and Yang, D. (2024). Can large language models transform computational social science? Computational Linguistics 50 237--291. ://doi.org/10.1162/coli\_a\_00502

work page doi:10.1162/coli 2024