pith. sign in

arxiv: 2502.17773 · v5 · pith:6A5KJWHTnew · submitted 2025-02-25 · 📊 stat.ME · cs.AI· cs.LG

How Many Human Survey Respondents is a Large Language Model Worth? An Uncertainty Quantification Perspective

Pith reviewed 2026-05-23 03:11 UTC · model grok-4.3

classification 📊 stat.ME cs.AIcs.LG
keywords large language modelssurvey simulationuncertainty quantificationconfidence setssample size selectionsimulation fidelitymisalignment
0
0 comments X

The pith

LLM survey simulations yield reliable human confidence sets once the simulation count is chosen adaptively.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework that converts LLM-simulated survey responses into confidence sets for human population parameters while quantifying the extra uncertainty from human-LLM misalignment. The central design choice is the simulation sample size: too large and the sets become overconfident with bad coverage, too small and they are dominated by noise. A data-driven rule selects this size to guarantee nominal average-case coverage for any LLM fidelity level and any confidence-set method. The resulting size is shown to equal the effective number of human respondents the LLM can represent.

Core claim

We propose a data-driven approach that adaptively selects the simulation sample size to achieve nominal average-case coverage, regardless of the LLM's simulation fidelity or the confidence set construction procedure. The selected sample size is further shown to reflect the effective human population size that the LLM can represent, providing a quantitative measure of its simulation fidelity.

What carries the argument

Adaptive data-driven selection of simulation sample size to guarantee nominal average-case coverage

If this is right

  • Excessive simulation counts produce overly narrow sets whose coverage falls below nominal due to unmodeled misalignment.
  • The selection rule requires no assumptions on the form of the misalignment distribution.
  • Experiments on real surveys show the equivalent human size varies across LLMs and question domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The equivalent-size metric supplies a concrete benchmark for deciding when LLM data can substitute for additional human respondents in a given survey.
  • The same adaptive rule could be tested on other synthetic data sources such as agent-based models or synthetic populations.

Load-bearing premise

A data-driven rule can guarantee nominal average-case coverage for any LLM fidelity and any confidence-set procedure solely by adjusting the simulation sample size.

What would settle it

Apply the selection rule to held-out human survey responses and verify whether the resulting confidence sets attain the target coverage probability on average.

Figures

Figures reproduced from arXiv: 2502.17773 by Chengpiao Huang, Kaizheng Wang, Yuhang Wu.

Figure 1
Figure 1. Figure 1: An Interpretation of an LLM as Being Made Up of [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Coverage-Width Trade-off for the Simulation Sample Size [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The LLM as a Mechanical Turk. The top panel illustrates the traditional survey process, where [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Parametric Bootstrap. In contrast, our framework ( [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Our Framework. In Appendix C.2, we further establish a more quantitative connection, where our parameters can be directly interpreted in bootstrap terms: the hidden population size κ represents the number of 15 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Miscoverage Probability Proxy Gte(bk) for Different LLMs and Target Miscoverage Levels α. Horizontal axis: target miscoverage level α. Vertical axis: LLM. Circles, squares, triangles and diamonds represent Gte(bk) for α = 0.05, 0.1, 0.15, 0.2, respectively. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Histogram of the Relative Error [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Half-widths of Confidence Intervals I syn,Bern(bk) for Different LLMs and Target Levels α. Horizontal axis: target miscoverage level α. Vertical axis: half-width of I syn,Bern(bk). The results are averaged over 100 random train-test splits. The half-width of each error bar is 1.96 times the standard error. Claude 3.5 Haiku Deepseek V3 GPT-3.5 Turbo GPT-4o mini GPT-4o GPT-5 mini Llama 3.3 70B Mistral 7B Ran… view at source ↗
Figure 9
Figure 9. Figure 9: Estimated Hidden Population Sizes κb = bk/C of Different LLMs. The results are averaged over 100 random train-test splits. The half-width of each error bar is 1.96 times the standard error. As is discussed in Section 4, the hidden population size κ reflects the size of the human population that the LLM can represent. A larger κ indicates that the LLM captures more information about the human population. Ou… view at source ↗
Figure 10
Figure 10. Figure 10: Miscoverage Probability Proxy Lte(bk) for the General Method. Horizontal axis: target miscoverage level α. Vertical axis: LLM. Circles, squares, triangles and diamonds represent Lte(bk) for α = 0.05, 0.1, 0.15, 0.2, respectively. The results are averaged over 100 random train-test splits of the ques￾tions. The half-width of each error bar is 1.96 times the standard error. Sharpness of selected sample size… view at source ↗
Figure 11
Figure 11. Figure 11: Half-widths of Confidence Intervals I syn,Bern(bk) for the General Method. Horizontal axis: target miscoverage level α. Vertical axis: half-width of I syn,Bern(bk). The results are averaged over 100 random train-test splits. The half-width of each error bar is 1.96 times the standard error. Estimated hidden population size κb = bk/C. In [PITH_FULL_IMAGE:figures/full_fig_p056_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Estimated Hidden Population Sizes κb = bk/C of Different LLMs from the General Method. The results are averaged over 100 random train-test splits. The half-width of each error bar is 1.96 times the standard error. References Aher, G. V., Arriaga, R. I. and Kalai, A. T. (2023). Using large language models to simulate multiple humans and replicate human subject studies. In Proceedings of the 40th Internatio… view at source ↗
read the original abstract

Large language models (LLMs) are increasingly used to simulate survey responses, but synthetic data can be misaligned with the human population, leading to unreliable inference. We develop a general framework that converts LLM-simulated responses into reliable confidence sets for population parameters of human responses, quantifying the uncertainty induced by the human-LLM misalignment. The key design choice is the number of simulated responses: too many produce overly narrow sets with poor coverage, while too few yield overly wide and uninformative sets dominated by stochastic noise. We propose a data-driven approach that adaptively selects the simulation sample size to achieve nominal average-case coverage, regardless of the LLM's simulation fidelity or the confidence set construction procedure. The selected sample size is further shown to reflect the effective human population size that the LLM can represent, providing a quantitative measure of its simulation fidelity. Experiments on real survey datasets reveal heterogeneous simulation fidelity across different LLMs and domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper develops a general framework for constructing confidence sets for human population parameters from LLM-simulated survey responses, explicitly accounting for human-LLM misalignment. It proposes a data-driven adaptive procedure that selects the LLM simulation sample size n to achieve nominal average-case coverage for any LLM fidelity level and any confidence-set construction method. The selected n is interpreted as the effective human population size represented by the LLM, providing a quantitative fidelity measure. Experiments on real survey data illustrate heterogeneous fidelity across LLMs and domains.

Significance. If the adaptive selection rule and its coverage guarantee hold under the stated conditions, the work supplies a practical, procedure-agnostic tool for quantifying when and how much LLM simulations can substitute for human respondents in survey inference, along with an interpretable effective-sample-size metric. This would be a concrete contribution to uncertainty quantification for synthetic data in statistics and social science applications.

major comments (2)
  1. [Abstract] Abstract: The central claim that a data-driven rule exists which selects simulation size n to achieve nominal average-case coverage 'regardless of the LLM's simulation fidelity or the confidence set construction procedure' and 'without additional assumptions on the form of the human-LLM misalignment distribution' is load-bearing for the entire contribution. Standard coverage arguments depend on the structure of misalignment (e.g., mean-zero bias, independence, or correctability by larger n); if misalignment introduces irreducible bias or the procedure is non-robust, adjusting n alone cannot restore coverage. The manuscript must supply the explicit conditions, derivation, or proof sketch under which the universal guarantee holds.
  2. [Abstract] Abstract (and any later theoretical section): No derivation, proof sketch, or validation details are supplied for the existence or properties of the adaptive procedure. Without these, the support for the coverage guarantee and the effective-size interpretation cannot be assessed from the available text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and for identifying the need for stronger theoretical support in the manuscript. We agree that the coverage claims require explicit justification and will revise the paper to include the requested derivations, conditions, and proof sketches.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that a data-driven rule exists which selects simulation size n to achieve nominal average-case coverage 'regardless of the LLM's simulation fidelity or the confidence set construction procedure' and 'without additional assumptions on the form of the human-LLM misalignment distribution' is load-bearing for the entire contribution. Standard coverage arguments depend on the structure of misalignment (e.g., mean-zero bias, independence, or correctability by larger n); if misalignment introduces irreducible bias or the procedure is non-robust, adjusting n alone cannot restore coverage. The manuscript must supply the explicit conditions, derivation, or proof sketch under which the universal guarantee holds.

    Authors: We agree that the abstract claim is central and that the current text does not supply an explicit derivation or proof sketch. The adaptive procedure is presented algorithmically, but the formal argument for average-case coverage under arbitrary misalignment is not detailed. In revision we will add a dedicated theoretical subsection deriving the guarantee: the data-driven rule uses a calibration sample to select the smallest n such that the empirical coverage (averaged over the misalignment distribution) meets the nominal level by construction. This holds without parametric assumptions on the misalignment because the selection directly targets the coverage functional rather than relying on bias correction or asymptotic arguments. We will also state the precise conditions (e.g., existence of a finite effective sample size and measurability of the coverage map) under which the result applies. revision: yes

  2. Referee: [Abstract] Abstract (and any later theoretical section): No derivation, proof sketch, or validation details are supplied for the existence or properties of the adaptive procedure. Without these, the support for the coverage guarantee and the effective-size interpretation cannot be assessed from the available text.

    Authors: We acknowledge the omission. The manuscript motivates and describes the adaptive selection but does not provide a formal proof sketch or validation details for its properties. In the revised version we will expand the theoretical section with (i) a proof outline showing that the selected n converges to the minimal value achieving nominal average coverage, (ii) the link between this n and the effective human sample size via the inverse of the coverage function, and (iii) additional simulation validation confirming the procedure's behavior across fidelity levels. These additions will directly address the assessability of the claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on proposed adaptive rule without reduction to inputs by construction

full rationale

The paper proposes a data-driven adaptive selection of simulation sample size n to achieve nominal average-case coverage for any LLM fidelity and any confidence-set procedure, then interprets the chosen n as an effective human population size. No equations, self-citations, or steps in the abstract or described framework reduce this selection rule or the effective-size interpretation to a fitted parameter, a self-referential definition, or a prior result by the same authors. The central claim is presented as a new methodological contribution whose coverage guarantee is asserted to hold without further assumptions on misalignment; this is not shown to be tautological or forced by the paper's own inputs. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the existence of an adaptive selection rule that delivers coverage for arbitrary misalignment; no free parameters, axioms, or invented entities are identifiable from the abstract alone.

axioms (1)
  • domain assumption Existence of a data-driven rule that selects simulation size to achieve nominal average-case coverage for any LLM fidelity and any confidence-set procedure
    This is the load-bearing premise stated in the abstract but not derived or justified within the provided text.

pith-pipeline@v0.9.0 · 5696 in / 1252 out tokens · 27225 ms · 2026-05-23T03:11:50.477602+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Adaptive Querying with AI Persona Priors

    stat.ML 2026-05 unverdicted novelty 7.0

    A persona-induced latent variable model with LLM-generated priors enables scalable adaptive item selection with closed-form Bayesian updates for accurate user-specific predictions.

  2. Rectification Difficulty and Optimal Sample Allocation in LLM-Augmented Surveys

    cs.AI 2026-04 unverdicted novelty 7.0

    A method using predicted rectification difficulty for optimal human sample allocation in LLM-augmented surveys captures 61-79% of theoretical efficiency gains and reduces MSE by 11% on two datasets without pilot data.

  3. Evaluating Reliability Gaps in Large Language Model Safety via Repeated Prompt Sampling

    cs.AI 2026-03 conditional novelty 6.0

    Repeated sampling of the same safety prompts reveals substantial differences in LLM failure probabilities across temperatures that conventional single-evaluation benchmarks miss.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 3 Pith papers · 4 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1 'mid.sentence := ...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Aher, G. V. , Arriaga, R. I. and Kalai, A. T. (2023). Using large language models to simulate multiple humans and replicate human subject studies. In Proceedings of the 40th International Conference on Machine Learning (A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato and J. Scarlett, eds.), vol. 202 of Proceedings of Machine Learning Research. P...

  4. [4]

    Angelopoulos, A. N. , Bates, S. , Fannjiang, C. , Jordan, M. I. and Zrnic, T. (2023). Prediction-powered inference. Science 382 669--674. ://www.science.org/doi/abs/10.1126/science.adi6000

  5. [5]

    Angelopoulos, A. N. , Bates, S. , Fisch, A. , Lei, L. and Schuster, T. (2024). Conformal risk control. In The Twelfth International Conference on Learning Representations. ://openreview.net/forum?id=33XGfHLtZg

  6. [6]

    Introducing computer use, a new Claude 3.5 Sonnet , and Claude 3.5 Haiku

    Anthropic (2024). Introducing computer use, a new Claude 3.5 Sonnet , and Claude 3.5 Haiku . ://www.anthropic.com/news/3-5-models-and-computer-use

  7. [7]

    Argyle, L. P. , Busby, E. C. , Fulda, N. , Gubler, J. R. , Rytting, C. and Wingate, D. (2023). Out of one, many: Using language models to simulate human samples. Political Analysis 31 337--351

  8. [8]

    Barton, R. R. , Lam, H. and Song, E. (2022). Input Uncertainty in Stochastic Simulation. Springer International Publishing, Cham, 573--620. ://doi.org/10.1007/978-3-030-96935-6_17

  9. [9]

    , Angelopoulos, A

    Bates, S. , Angelopoulos, A. , Lei, L. , Malik, J. and Jordan, M. (2021). Distribution-free, risk-controlling prediction sets. J. ACM 68. ://doi.org/10.1145/3478535

  10. [10]

    Billingsley, P. (2017). Probability and measure. John Wiley & Sons

  11. [11]

    , Clinton, J

    Bisbee, J. , Clinton, J. D. , Dorff, C. , Kenkel, B. and Larson, J. M. (2024). Synthetic replacements for human survey data? T he perils of large language models. Political Analysis 32 401--416

  12. [12]

    Liesen, Z

    Boucheron, S. , Lugosi, G. and Massart, P. (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press. ://doi.org/10.1093/acprof:oso/9780199535255.001.0001

  13. [13]

    , Israeli, A

    Brand, J. , Israeli, A. and Ngwe, D. (2023). Using LLM s for market research. Harvard Business School Marketing Unit Working Paper

  14. [14]

    , Reichart, R

    Calderon, N. , Reichart, R. and Dror, R. (2025). The alternative annotator test for LLM -as-a-judge: How to statistically justify replacing human annotators with LLM s. arXiv preprint arXiv:2501.10970 . ://arxiv.org/abs/2501.10970

  15. [15]

    and Berger, R

    Casella, G. and Berger, R. (2002). Statistical inference. CRC press

  16. [16]

    , Liu, T

    Chen, Y. , Liu, T. X. , Shan, Y. and Zhong, S. (2023). The emergence of economic rationality of GPT . Proceedings of the National Academy of Sciences 120 e2316205120. ://www.pnas.org/doi/abs/10.1073/pnas.2316205120

  17. [17]

    Cheng, R. C. H. and Holland, W. (2004). Calculation of confidence intervals for simulation output. ACM Trans. Model. Comput. Simul. 14 344--362. ://doi.org/10.1145/1029174.1029176

  18. [18]

    , Hardt, M

    Dominguez-Olmedo, R. , Hardt, M. and Mendler-D \"u nner, C. (2024). Questioning the survey responses of large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems. ://openreview.net/forum?id=Oo7dlLgqQX

  19. [19]

    The Llama 3 Herd of Models

    Dubey, A. , Jauhri, A. , Pandey, A. , Kadian, A. , Al-Dahle, A. , Letman, A. , Mathur, A. , Schelten, A. , Yang, A. , Fan, A. et al. (2024). The Llama 3 herd of models. arXiv preprint arXiv:2407.21783

  20. [20]

    , Nguyen, K

    Durmus, E. , Nguyen, K. , Liao, T. , Schiefer, N. , Askell, A. , Bakhtin, A. , Chen, C. , Hatfield-Dodds, Z. , Hernandez, D. , Joseph, N. , Lovitt, L. , McCandlish, S. , Sikder, O. , Tamkin, A. , Thamkul, J. , Kaplan, J. , Clark, J. and Ganguli, D. (2024). Towards measuring the representation of subjective global opinions in language models. In First Conf...

  21. [21]

    Eedi labs

    Eedi Labs (2025). Eedi labs. Accessed November 15, 2025, https://www.eedi.com/

  22. [22]

    Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics 7 1 -- 26. ://doi.org/10.1214/aos/1176344552

  23. [23]

    and Hastie, T

    Efron, B. and Hastie, T. (2021). Computer Age Statistical Inference, Student Edition: Algorithms, Evidence, and Data Science. Institute of Mathematical Statistics Monographs, Cambridge University Press

  24. [24]

    , Lee, D

    Gao, Y. , Lee, D. , Burtch, G. and Fazelpour, S. (2025). Take caution in using llms as human surrogates. Proceedings of the National Academy of Sciences 122 e2501660122. ://www.pnas.org/doi/abs/10.1073/pnas.2501660122

  25. [25]

    and Singh, A

    Goli, A. and Singh, A. (2024). Frontiers: Can large language models capture human preferences? Marketing Science 43 709--722. ://doi.org/10.1287/mksc.2023.0306

  26. [26]

    and Toubia, O

    Gui, G. and Toubia, O. (2023). The challenge of using LLM s to simulate human behavior: A causal inference perspective. arXiv preprint arXiv:2312.15524

  27. [27]

    He-Yueya, J. , Ma, W. A. , Gandhi, K. , Domingue, B. W. , Brunskill, E. and Goodman, N. D. (2024). Psychometric alignment: Capturing human knowledge distributions via language models. arXiv preprint arXiv:2407.15645

  28. [28]

    Horton, J. J. (2023). Large language models as simulated economic agents: What can we learn from homo silicus? Tech. rep., National Bureau of Economic Research

  29. [29]

    Huang, C. , Wu, Y. and Wang, K. (2025). Uncertainty quantification for LLM -based survey simulations. In Proceedings of the 42nd International Conference on Machine Learning (A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff and J. Zhu, eds.), vol. 267 of Proceedings of Machine Learning Research. PMLR. ://proceedings.ml...

  30. [30]

    Social science meets llms: How reliable are large language models in social simulations? arXiv preprint arXiv:2410.23426, 2024

    Huang, Y. , Yuan, Z. , Zhou, Y. , Guo, K. , Wang, X. , Zhuang, H. , Sun, W. , Sun, L. , Wang, J. , Ye, Y. et al. (2024). Social science meets LLM s: How reliable are large language models in social simulations? arXiv preprint arXiv:2410.23426

  31. [31]

    Jiang, A. Q. , Sablayrolles, A. , Mensch, A. , Bamford, C. , Chaplot, D. S. , Casas, D. d. l. , Bressand, F. , Lengyel, G. , Lample, G. , Saulnier, L. et al. (2023). Mistral 7B . arXiv preprint arXiv:2310.06825

  32. [32]

    Jones, C. R. and Bergen, B. K. (2025). Large language models pass the turing test. arXiv preprint arXiv:2503.23674

  33. [33]

    , O'Hagan, S

    Kim, J. , O'Hagan, S. and Rockova, V. (2024). Adaptive uncertainty quantification for generative AI . arXiv preprint arXiv:2408.08990

  34. [34]

    Lam, H. (2016). Advanced tutorial: Input uncertainty and robust analysis in stochastic simulation. In 2016 Winter Simulation Conference (WSC)

  35. [35]

    DeepSeek-V3 Technical Report

    Liu, A. , Feng, B. , Xue, B. , Wang, B. , Wu, B. , Lu, C. , Zhao, C. , Deng, C. , Zhang, C. , Ruan, C. et al. (2024). DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437

  36. [36]

    and Wang, X

    Lu, X. and Wang, X. (2024). Generative students: Using LLM -simulated student profiles to support question item evaluation. In Proceedings of the Eleventh ACM Conference on Learning @ Scale. L@S '24, Association for Computing Machinery, New York, NY, USA. ://doi.org/10.1145/3657604.3662031

  37. [37]

    Markel, J. M. , Opferman, S. G. , Landay, J. A. and Piech, C. (2023). Gpteach: Interactive ta training with gpt-based students. In Proceedings of the Tenth ACM Conference on Learning @ Scale. Association for Computing Machinery, New York, NY, USA. ://doi.org/10.1145/3573051.3593393

  38. [38]

    and Pontil, M

    Maurer, A. and Pontil, M. (2009). Empirical bernstein bounds and sample variance penalization. The 22nd Annual Conference on Learning Theory . ://www.learningtheory.org/colt2009/papers/012.pdf

  39. [39]

    Economic Sciences 121(9), e2313925121 (2024)

    Mei, Q. , Xie, Y. , Yuan, W. and Jackson, M. O. (2024). A T uring test of whether AI chatbots are behaviorally similar to humans. Proceedings of the National Academy of Sciences 121 e2313925121. ://www.pnas.org/doi/abs/10.1073/pnas.2313925121

  40. [40]

    and Pei, L

    Nelson, B. and Pei, L. (2021). Foundations and methods of stochastic simulation. Springer

  41. [41]

    GPT-3.5 Turbo

    OpenAI (2022). GPT-3.5 Turbo . ://platform.openai.com/docs/models/gpt-3-5#gpt-3-5-turbo

  42. [42]

    GPT-4o mini: Advancing cost-efficient intelligence

    OpenAI (2024 a ). GPT-4o mini: Advancing cost-efficient intelligence. ://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/

  43. [43]

    Hello GPT-4o

    OpenAI (2024 b ). Hello GPT-4o . ://openai.com/index/hello-gpt-4o/

  44. [44]

    Introducing GPT-5

    OpenAI (2025). Introducing GPT-5 . ://openai.com/index/introducing-gpt-5/

  45. [45]

    Owen, A. (1990). Empirical Likelihood Ratio Confidence Regions . The Annals of Statistics 18 90 -- 120. ://doi.org/10.1214/aos/1176347494

  46. [46]

    The A merican trends panel

    Pew Research Center (2025). The A merican trends panel. Accessed November 15, 2025, https://www.pewresearch.org/the-american-trends-panel/

  47. [47]

    , Durmus, E

    Santurkar, S. , Durmus, E. , Ladhak, F. , Lee, C. , Liang, P. and Hashimoto, T. (2023). Whose opinions do language models reflect? In Proceedings of the 40th International Conference on Machine Learning (A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato and J. Scarlett, eds.), vol. 202 of Proceedings of Machine Learning Research. PMLR. ://proceedi...

  48. [48]

    and Vovk, V

    Shafer, G. and Vovk, V. (2008). A tutorial on conformal prediction. Journal of Machine Learning Research 9 371--421. ://jmlr.org/papers/v9/shafer08a.html

  49. [49]

    , Gammerman, A

    Vovk, V. , Gammerman, A. and Shafer, G. (2005). Algorithmic learning in a random world, vol. 29. Springer. ://doi.org/10.1007/b106715

  50. [50]

    Large Language Models for Market Research: A Data-augmentation Approach

    Wang, M. , Zhang, D. J. and Zhang, H. (2024). Large language models for market research: A data-augmentation approach. arXiv preprint arXiv:2412.19363

  51. [51]

    , Lamb, A

    Wang, Z. , Lamb, A. , Saveliev, E. , Cameron, P. , Zaykov, J. , Hernandez-Lobato, J. M. , Turner, R. E. , Baraniuk, R. G. , Craig Barton, E. , Peyton Jones, S. , Woodhead, S. and Zhang, C. (2021). Results and insights from diagnostic questions: The NeurIPS 2020 education challenge. In Proceedings of the NeurIPS 2020 Competition and Demonstration Track (H....

  52. [52]

    Yang, K. , Li, H. , Wen, H. , Peng, T.-Q. , Tang, J. and Liu, H. (2024). Are large language models ( LLM s) good social predictors? In Findings of the Association for Computational Linguistics: EMNLP 2024 (Y. Al-Onaizan, M. Bansal and Y.-N. Chen, eds.). Association for Computational Linguistics, Miami, Florida, USA. ://aclanthology.org/2024.findings-emnlp.153/

  53. [53]

    Zelikman, E. , Ma, W. , Tran, J. , Yang, D. , Yeatman, J. and Haber, N. (2023). Generating and evaluating tests for K-12 students with language model simulations: A case study on sentence reading efficiency. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (H. Bouamor, J. Pino and K. Bali, eds.). Association for Co...

  54. [54]

    Cocarascu, F

    Ziems, C. , Held, W. , Shaikh, O. , Chen, J. , Zhang, Z. and Yang, D. (2024). Can large language models transform computational social science? Computational Linguistics 50 237--291. ://doi.org/10.1162/coli\_a\_00502