How Many Human Survey Respondents is a Large Language Model Worth? An Uncertainty Quantification Perspective
Pith reviewed 2026-05-23 03:11 UTC · model grok-4.3
The pith
LLM survey simulations yield reliable human confidence sets once the simulation count is chosen adaptively.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose a data-driven approach that adaptively selects the simulation sample size to achieve nominal average-case coverage, regardless of the LLM's simulation fidelity or the confidence set construction procedure. The selected sample size is further shown to reflect the effective human population size that the LLM can represent, providing a quantitative measure of its simulation fidelity.
What carries the argument
Adaptive data-driven selection of simulation sample size to guarantee nominal average-case coverage
If this is right
- Excessive simulation counts produce overly narrow sets whose coverage falls below nominal due to unmodeled misalignment.
- The selection rule requires no assumptions on the form of the misalignment distribution.
- Experiments on real surveys show the equivalent human size varies across LLMs and question domains.
Where Pith is reading between the lines
- The equivalent-size metric supplies a concrete benchmark for deciding when LLM data can substitute for additional human respondents in a given survey.
- The same adaptive rule could be tested on other synthetic data sources such as agent-based models or synthetic populations.
Load-bearing premise
A data-driven rule can guarantee nominal average-case coverage for any LLM fidelity and any confidence-set procedure solely by adjusting the simulation sample size.
What would settle it
Apply the selection rule to held-out human survey responses and verify whether the resulting confidence sets attain the target coverage probability on average.
Figures
read the original abstract
Large language models (LLMs) are increasingly used to simulate survey responses, but synthetic data can be misaligned with the human population, leading to unreliable inference. We develop a general framework that converts LLM-simulated responses into reliable confidence sets for population parameters of human responses, quantifying the uncertainty induced by the human-LLM misalignment. The key design choice is the number of simulated responses: too many produce overly narrow sets with poor coverage, while too few yield overly wide and uninformative sets dominated by stochastic noise. We propose a data-driven approach that adaptively selects the simulation sample size to achieve nominal average-case coverage, regardless of the LLM's simulation fidelity or the confidence set construction procedure. The selected sample size is further shown to reflect the effective human population size that the LLM can represent, providing a quantitative measure of its simulation fidelity. Experiments on real survey datasets reveal heterogeneous simulation fidelity across different LLMs and domains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops a general framework for constructing confidence sets for human population parameters from LLM-simulated survey responses, explicitly accounting for human-LLM misalignment. It proposes a data-driven adaptive procedure that selects the LLM simulation sample size n to achieve nominal average-case coverage for any LLM fidelity level and any confidence-set construction method. The selected n is interpreted as the effective human population size represented by the LLM, providing a quantitative fidelity measure. Experiments on real survey data illustrate heterogeneous fidelity across LLMs and domains.
Significance. If the adaptive selection rule and its coverage guarantee hold under the stated conditions, the work supplies a practical, procedure-agnostic tool for quantifying when and how much LLM simulations can substitute for human respondents in survey inference, along with an interpretable effective-sample-size metric. This would be a concrete contribution to uncertainty quantification for synthetic data in statistics and social science applications.
major comments (2)
- [Abstract] Abstract: The central claim that a data-driven rule exists which selects simulation size n to achieve nominal average-case coverage 'regardless of the LLM's simulation fidelity or the confidence set construction procedure' and 'without additional assumptions on the form of the human-LLM misalignment distribution' is load-bearing for the entire contribution. Standard coverage arguments depend on the structure of misalignment (e.g., mean-zero bias, independence, or correctability by larger n); if misalignment introduces irreducible bias or the procedure is non-robust, adjusting n alone cannot restore coverage. The manuscript must supply the explicit conditions, derivation, or proof sketch under which the universal guarantee holds.
- [Abstract] Abstract (and any later theoretical section): No derivation, proof sketch, or validation details are supplied for the existence or properties of the adaptive procedure. Without these, the support for the coverage guarantee and the effective-size interpretation cannot be assessed from the available text.
Simulated Author's Rebuttal
We thank the referee for their thorough review and for identifying the need for stronger theoretical support in the manuscript. We agree that the coverage claims require explicit justification and will revise the paper to include the requested derivations, conditions, and proof sketches.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that a data-driven rule exists which selects simulation size n to achieve nominal average-case coverage 'regardless of the LLM's simulation fidelity or the confidence set construction procedure' and 'without additional assumptions on the form of the human-LLM misalignment distribution' is load-bearing for the entire contribution. Standard coverage arguments depend on the structure of misalignment (e.g., mean-zero bias, independence, or correctability by larger n); if misalignment introduces irreducible bias or the procedure is non-robust, adjusting n alone cannot restore coverage. The manuscript must supply the explicit conditions, derivation, or proof sketch under which the universal guarantee holds.
Authors: We agree that the abstract claim is central and that the current text does not supply an explicit derivation or proof sketch. The adaptive procedure is presented algorithmically, but the formal argument for average-case coverage under arbitrary misalignment is not detailed. In revision we will add a dedicated theoretical subsection deriving the guarantee: the data-driven rule uses a calibration sample to select the smallest n such that the empirical coverage (averaged over the misalignment distribution) meets the nominal level by construction. This holds without parametric assumptions on the misalignment because the selection directly targets the coverage functional rather than relying on bias correction or asymptotic arguments. We will also state the precise conditions (e.g., existence of a finite effective sample size and measurability of the coverage map) under which the result applies. revision: yes
-
Referee: [Abstract] Abstract (and any later theoretical section): No derivation, proof sketch, or validation details are supplied for the existence or properties of the adaptive procedure. Without these, the support for the coverage guarantee and the effective-size interpretation cannot be assessed from the available text.
Authors: We acknowledge the omission. The manuscript motivates and describes the adaptive selection but does not provide a formal proof sketch or validation details for its properties. In the revised version we will expand the theoretical section with (i) a proof outline showing that the selected n converges to the minimal value achieving nominal average coverage, (ii) the link between this n and the effective human sample size via the inverse of the coverage function, and (iii) additional simulation validation confirming the procedure's behavior across fidelity levels. These additions will directly address the assessability of the claims. revision: yes
Circularity Check
No significant circularity; derivation relies on proposed adaptive rule without reduction to inputs by construction
full rationale
The paper proposes a data-driven adaptive selection of simulation sample size n to achieve nominal average-case coverage for any LLM fidelity and any confidence-set procedure, then interprets the chosen n as an effective human population size. No equations, self-citations, or steps in the abstract or described framework reduce this selection rule or the effective-size interpretation to a fitted parameter, a self-referential definition, or a prior result by the same authors. The central claim is presented as a new methodological contribution whose coverage guarantee is asserted to hold without further assumptions on misalignment; this is not shown to be tautological or forced by the paper's own inputs. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existence of a data-driven rule that selects simulation size to achieve nominal average-case coverage for any LLM fidelity and any confidence-set procedure
Forward citations
Cited by 3 Pith papers
-
Adaptive Querying with AI Persona Priors
A persona-induced latent variable model with LLM-generated priors enables scalable adaptive item selection with closed-form Bayesian updates for accurate user-specific predictions.
-
Rectification Difficulty and Optimal Sample Allocation in LLM-Augmented Surveys
A method using predicted rectification difficulty for optimal human sample allocation in LLM-augmented surveys captures 61-79% of theoretical efficiency gains and reduces MSE by 11% on two datasets without pilot data.
-
Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling
Repeated sampling of the same safety prompts reveals substantial differences in LLM failure probabilities across temperatures that conventional single-evaluation benchmarks miss.
Reference graph
Works this paper leans on
-
[1]
, " * write output.state after.block = add.period write newline
ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := ...
-
[2]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[3]
Aher, G. V. , Arriaga, R. I. and Kalai, A. T. (2023). Using large language models to simulate multiple humans and replicate human subject studies. In Proceedings of the 40th International Conference on Machine Learning (A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato and J. Scarlett, eds.), vol. 202 of Proceedings of Machine Learning Research. P...
work page 2023
-
[4]
Angelopoulos, A. N. , Bates, S. , Fannjiang, C. , Jordan, M. I. and Zrnic, T. (2023). Prediction-powered inference. Science 382 669--674. ://www.science.org/doi/abs/10.1126/science.adi6000
-
[5]
Angelopoulos, A. N. , Bates, S. , Fisch, A. , Lei, L. and Schuster, T. (2024). Conformal risk control. In The Twelfth International Conference on Learning Representations. ://openreview.net/forum?id=33XGfHLtZg
work page 2024
-
[6]
Introducing computer use, a new Claude 3.5 Sonnet , and Claude 3.5 Haiku
Anthropic (2024). Introducing computer use, a new Claude 3.5 Sonnet , and Claude 3.5 Haiku . ://www.anthropic.com/news/3-5-models-and-computer-use
work page 2024
-
[7]
Argyle, L. P. , Busby, E. C. , Fulda, N. , Gubler, J. R. , Rytting, C. and Wingate, D. (2023). Out of one, many: Using language models to simulate human samples. Political Analysis 31 337--351
work page 2023
-
[8]
Barton, R. R. , Lam, H. and Song, E. (2022). Input Uncertainty in Stochastic Simulation. Springer International Publishing, Cham, 573--620. ://doi.org/10.1007/978-3-030-96935-6_17
-
[9]
Bates, S. , Angelopoulos, A. , Lei, L. , Malik, J. and Jordan, M. (2021). Distribution-free, risk-controlling prediction sets. J. ACM 68. ://doi.org/10.1145/3478535
-
[10]
Billingsley, P. (2017). Probability and measure. John Wiley & Sons
work page 2017
-
[11]
Bisbee, J. , Clinton, J. D. , Dorff, C. , Kenkel, B. and Larson, J. M. (2024). Synthetic replacements for human survey data? T he perils of large language models. Political Analysis 32 401--416
work page 2024
-
[12]
Boucheron, S. , Lugosi, G. and Massart, P. (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press. ://doi.org/10.1093/acprof:oso/9780199535255.001.0001
work page doi:10.1093/acprof:oso/9780199535255.001.0001 2013
-
[13]
Brand, J. , Israeli, A. and Ngwe, D. (2023). Using LLM s for market research. Harvard Business School Marketing Unit Working Paper
work page 2023
-
[14]
Calderon, N. , Reichart, R. and Dror, R. (2025). The alternative annotator test for LLM -as-a-judge: How to statistically justify replacing human annotators with LLM s. arXiv preprint arXiv:2501.10970 . ://arxiv.org/abs/2501.10970
- [15]
-
[16]
Chen, Y. , Liu, T. X. , Shan, Y. and Zhong, S. (2023). The emergence of economic rationality of GPT . Proceedings of the National Academy of Sciences 120 e2316205120. ://www.pnas.org/doi/abs/10.1073/pnas.2316205120
-
[17]
Cheng, R. C. H. and Holland, W. (2004). Calculation of confidence intervals for simulation output. ACM Trans. Model. Comput. Simul. 14 344--362. ://doi.org/10.1145/1029174.1029176
-
[18]
Dominguez-Olmedo, R. , Hardt, M. and Mendler-D \"u nner, C. (2024). Questioning the survey responses of large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. ://openreview.net/forum?id=Oo7dlLgqQX
work page 2024
-
[19]
Dubey, A. , Jauhri, A. , Pandey, A. , Kadian, A. , Al-Dahle, A. , Letman, A. , Mathur, A. , Schelten, A. , Yang, A. , Fan, A. et al. (2024). The Llama 3 herd of models. arXiv preprint arXiv:2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Durmus, E. , Nguyen, K. , Liao, T. , Schiefer, N. , Askell, A. , Bakhtin, A. , Chen, C. , Hatfield-Dodds, Z. , Hernandez, D. , Joseph, N. , Lovitt, L. , McCandlish, S. , Sikder, O. , Tamkin, A. , Thamkul, J. , Kaplan, J. , Clark, J. and Ganguli, D. (2024). Towards measuring the representation of subjective global opinions in language models. In First Conf...
work page 2024
- [21]
-
[22]
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics 7 1 -- 26. ://doi.org/10.1214/aos/1176344552
-
[23]
Efron, B. and Hastie, T. (2021). Computer Age Statistical Inference, Student Edition: Algorithms, Evidence, and Data Science. Institute of Mathematical Statistics Monographs, Cambridge University Press
work page 2021
-
[24]
Gao, Y. , Lee, D. , Burtch, G. and Fazelpour, S. (2025). Take caution in using llms as human surrogates. Proceedings of the National Academy of Sciences 122 e2501660122. ://www.pnas.org/doi/abs/10.1073/pnas.2501660122
-
[25]
Goli, A. and Singh, A. (2024). Frontiers: Can large language models capture human preferences? Marketing Science 43 709--722. ://doi.org/10.1287/mksc.2023.0306
-
[26]
Gui, G. and Toubia, O. (2023). The challenge of using LLM s to simulate human behavior: A causal inference perspective. arXiv preprint arXiv:2312.15524
- [27]
-
[28]
Horton, J. J. (2023). Large language models as simulated economic agents: What can we learn from homo silicus? Tech. rep., National Bureau of Economic Research
work page 2023
-
[29]
Huang, C. , Wu, Y. and Wang, K. (2025). Uncertainty quantification for LLM -based survey simulations. In Proceedings of the 42nd International Conference on Machine Learning (A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff and J. Zhu, eds.), vol. 267 of Proceedings of Machine Learning Research. PMLR. ://proceedings.ml...
work page 2025
-
[30]
Huang, Y. , Yuan, Z. , Zhou, Y. , Guo, K. , Wang, X. , Zhuang, H. , Sun, W. , Sun, L. , Wang, J. , Ye, Y. et al. (2024). Social science meets LLM s: How reliable are large language models in social simulations? arXiv preprint arXiv:2410.23426
-
[31]
Jiang, A. Q. , Sablayrolles, A. , Mensch, A. , Bamford, C. , Chaplot, D. S. , Casas, D. d. l. , Bressand, F. , Lengyel, G. , Lample, G. , Saulnier, L. et al. (2023). Mistral 7B . arXiv preprint arXiv:2310.06825
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [32]
-
[33]
Kim, J. , O'Hagan, S. and Rockova, V. (2024). Adaptive uncertainty quantification for generative AI . arXiv preprint arXiv:2408.08990
-
[34]
Lam, H. (2016). Advanced tutorial: Input uncertainty and robust analysis in stochastic simulation. In 2016 Winter Simulation Conference (WSC)
work page 2016
-
[35]
Liu, A. , Feng, B. , Xue, B. , Wang, B. , Wu, B. , Lu, C. , Zhao, C. , Deng, C. , Zhang, C. , Ruan, C. et al. (2024). DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Lu, X. and Wang, X. (2024). Generative students: Using LLM -simulated student profiles to support question item evaluation. In Proceedings of the Eleventh ACM Conference on Learning @ Scale. L@S '24, Association for Computing Machinery, New York, NY, USA. ://doi.org/10.1145/3657604.3662031
-
[37]
Markel, J. M. , Opferman, S. G. , Landay, J. A. and Piech, C. (2023). Gpteach: Interactive ta training with gpt-based students. In Proceedings of the Tenth ACM Conference on Learning @ Scale. Association for Computing Machinery, New York, NY, USA. ://doi.org/10.1145/3573051.3593393
-
[38]
Maurer, A. and Pontil, M. (2009). Empirical bernstein bounds and sample variance penalization. The 22nd Annual Conference on Learning Theory . ://www.learningtheory.org/colt2009/papers/012.pdf
work page 2009
-
[39]
Economic Sciences 121(9), e2313925121 (2024)
Mei, Q. , Xie, Y. , Yuan, W. and Jackson, M. O. (2024). A T uring test of whether AI chatbots are behaviorally similar to humans. Proceedings of the National Academy of Sciences 121 e2313925121. ://www.pnas.org/doi/abs/10.1073/pnas.2313925121
-
[40]
Nelson, B. and Pei, L. (2021). Foundations and methods of stochastic simulation. Springer
work page 2021
-
[41]
OpenAI (2022). GPT-3.5 Turbo . ://platform.openai.com/docs/models/gpt-3-5#gpt-3-5-turbo
work page 2022
-
[42]
GPT-4o mini: Advancing cost-efficient intelligence
OpenAI (2024 a ). GPT-4o mini: Advancing cost-efficient intelligence. ://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/
work page 2024
- [43]
-
[44]
OpenAI (2025). Introducing GPT-5 . ://openai.com/index/introducing-gpt-5/
work page 2025
-
[45]
Owen, A. (1990). Empirical Likelihood Ratio Confidence Regions . The Annals of Statistics 18 90 -- 120. ://doi.org/10.1214/aos/1176347494
-
[46]
Pew Research Center (2025). The A merican trends panel. Accessed November 15, 2025, https://www.pewresearch.org/the-american-trends-panel/
work page 2025
-
[47]
Santurkar, S. , Durmus, E. , Ladhak, F. , Lee, C. , Liang, P. and Hashimoto, T. (2023). Whose opinions do language models reflect? In Proceedings of the 40th International Conference on Machine Learning (A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato and J. Scarlett, eds.), vol. 202 of Proceedings of Machine Learning Research. PMLR. ://proceedi...
work page 2023
-
[48]
Shafer, G. and Vovk, V. (2008). A tutorial on conformal prediction. Journal of Machine Learning Research 9 371--421. ://jmlr.org/papers/v9/shafer08a.html
work page 2008
-
[49]
Vovk, V. , Gammerman, A. and Shafer, G. (2005). Algorithmic learning in a random world, vol. 29. Springer. ://doi.org/10.1007/b106715
-
[50]
Large Language Models for Market Research: A Data-augmentation Approach
Wang, M. , Zhang, D. J. and Zhang, H. (2024). Large language models for market research: A data-augmentation approach. arXiv preprint arXiv:2412.19363
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Wang, Z. , Lamb, A. , Saveliev, E. , Cameron, P. , Zaykov, J. , Hernandez-Lobato, J. M. , Turner, R. E. , Baraniuk, R. G. , Craig Barton, E. , Peyton Jones, S. , Woodhead, S. and Zhang, C. (2021). Results and insights from diagnostic questions: The NeurIPS 2020 education challenge. In Proceedings of the NeurIPS 2020 Competition and Demonstration Track (H....
work page 2021
-
[52]
Yang, K. , Li, H. , Wen, H. , Peng, T.-Q. , Tang, J. and Liu, H. (2024). Are large language models ( LLM s) good social predictors? In Findings of the Association for Computational Linguistics: EMNLP 2024 (Y. Al-Onaizan, M. Bansal and Y.-N. Chen, eds.). Association for Computational Linguistics, Miami, Florida, USA. ://aclanthology.org/2024.findings-emnlp.153/
work page 2024
-
[53]
Zelikman, E. , Ma, W. , Tran, J. , Yang, D. , Yeatman, J. and Haber, N. (2023). Generating and evaluating tests for K-12 students with language model simulations: A case study on sentence reading efficiency. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (H. Bouamor, J. Pino and K. Bali, eds.). Association for Co...
work page 2023
-
[54]
Ziems, C. , Held, W. , Shaikh, O. , Chen, J. , Zhang, Z. and Yang, D. (2024). Can large language models transform computational social science? Computational Linguistics 50 237--291. ://doi.org/10.1162/coli\_a\_00502
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.