pith. machine review for the scientific record. sign in

arxiv: 2604.12497 · v1 · submitted 2026-04-14 · 💻 cs.LG · stat.ML

Recognition: unknown

Adaptive Budget Allocation in LLM-Augmented Surveys

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:08 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords adaptive allocationLLM surveysbudget optimizationhuman verificationreliability estimationonline learningsurvey methodology
0
0 comments X

The pith

An adaptive algorithm allocates limited human verification budget to LLM survey questions by learning per-question reliability in real time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an adaptive method to decide how many human responses to collect for each survey question when LLMs generate initial answers. Because LLM accuracy varies by question and is unknown in advance, the algorithm uses each human label to correct the answer for that question and to update an estimate of how reliable the LLM is on it. This allows the system to shift future budget toward questions where the LLM is less accurate. The authors show that the difference from the best possible fixed allocation disappears as the total human budget increases. Experiments on a real survey with 68 questions show that uniform allocation wastes 10 to 12 percent of the budget while the adaptive method wastes only 2 to 6 percent.

Core claim

We propose an adaptive allocation algorithm that learns which questions are hardest for the LLM while simultaneously collecting human responses. Each human label serves a dual role: it improves the estimate for that question and reveals how well the LLM predicts human responses on it. The algorithm directs more budget to questions where the LLM is least reliable, without requiring any prior knowledge of question-level LLM accuracy. We prove that the allocation gap relative to the best possible allocation vanishes as the budget grows, and validate the approach on both synthetic data and a real survey dataset with 68 questions and over 2000 respondents. On real survey data, the standard 100% 6

What carries the argument

The online adaptive allocation rule that estimates LLM reliability from accumulating human labels and reallocates the remaining budget proportionally to current error estimates.

Load-bearing premise

Human labels collected in real time can be used both to correct responses and to learn question-specific LLM reliability without any prior information or separate calibration phase.

What would settle it

Observing on independent survey data whether the waste remains in the 2-6% range and continues to approach zero with larger budgets would confirm or refute the performance claims.

read the original abstract

Large language models (LLMs) can generate survey responses at low cost, but their reliability varies substantially across questions and is unknown before data collection. Deploying LLMs in surveys still requires costly human responses for verification and correction. How should a limited human-labeling budget be allocated across questions in real time? We propose an adaptive allocation algorithm that learns which questions are hardest for the LLM while simultaneously collecting human responses. Each human label serves a dual role: it improves the estimate for that question and reveals how well the LLM predicts human responses on it. The algorithm directs more budget to questions where the LLM is least reliable, without requiring any prior knowledge of question-level LLM accuracy. We prove that the allocation gap relative to the best possible allocation vanishes as the budget grows, and validate the approach on both synthetic data and a real survey dataset with 68 questions and over 2000 respondents. On real survey data, the standard practice of allocating human labels uniformly across questions wastes 10--12% of the budget relative to the optimal; our algorithm reduces this waste to 2--6%, and the advantage grows as questions become more heterogeneous in LLM prediction quality. The algorithm achieves the same estimation quality as traditional uniform sampling with fewer human samples, requires no pilot study, and is backed by formal performance guarantees validated on real survey data. More broadly, the framework applies whenever scarce human oversight must be allocated across tasks where LLM reliability is unknown.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes an adaptive algorithm for allocating a fixed human-labeling budget across survey questions when LLMs generate initial responses whose per-question reliability is unknown a priori. Each human label is used simultaneously to correct the LLM output and to update an online estimate of that question's LLM accuracy; the algorithm then directs future labels toward questions where the LLM is least reliable. The central claims are (i) a proof that the allocation gap relative to the optimal (oracle) allocation vanishes asymptotically as the total budget B grows, and (ii) empirical results on a 68-question real survey showing that the method reduces budget waste from 10-12% (uniform allocation) to 2-6% while achieving equivalent estimation quality with fewer human samples and without a pilot study.

Significance. If the asymptotic guarantee and the reported waste reductions hold under the stated assumptions, the work offers a practical, theoretically grounded method for efficient hybrid LLM-human data collection. The dual-use of each label, the absence of any pre-calibration phase, and the validation on real heterogeneous survey data are concrete strengths that distinguish the contribution from purely heuristic or offline allocation schemes.

major comments (2)
  1. [§4.2, Theorem 1] §4.2, Theorem 1 (regret bound): the proof sketch relies on a uniform convergence rate for the per-question reliability estimators, but the adaptive allocation rule itself changes the sampling distribution over time; it is not immediate that the resulting martingale difference sequence still satisfies the concentration inequality used to bound the gap (the dependence between the allocation decision and the next observation needs an explicit handling).
  2. [§5.3, Table 3] §5.3, Table 3 (real-data waste numbers): the 2-6% waste figure is presented as a point estimate without reported standard errors, bootstrap intervals, or a statistical test against the uniform baseline across the 68 questions; because the advantage is claimed to grow with heterogeneity, the absence of variability measures makes it impossible to judge whether the reported improvement is robust or sensitive to the particular respondent pool.
minor comments (3)
  1. [§2.1] §2.1: the definition of the per-question reliability parameter r_q is introduced only after the algorithm description; moving the formal definition and its relation to the loss function earlier would improve readability.
  2. [Figure 4] Figure 4: the synthetic-data plots show cumulative waste versus B, but the x-axis is not labeled with the total budget scale used in the theorem statements, making direct comparison to the O(1/sqrt(B)) rate difficult.
  3. [§6] §6: the discussion of limitations mentions only computational cost; a brief note on the sensitivity of the method to the choice of the exploration parameter epsilon would be useful for practitioners.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We appreciate the positive assessment of our work and the specific suggestions for improvement. Below we provide point-by-point responses to the major comments, indicating the revisions we plan to incorporate.

read point-by-point responses
  1. Referee: [§4.2, Theorem 1] §4.2, Theorem 1 (regret bound): the proof sketch relies on a uniform convergence rate for the per-question reliability estimators, but the adaptive allocation rule itself changes the sampling distribution over time; it is not immediate that the resulting martingale difference sequence still satisfies the concentration inequality used to bound the gap (the dependence between the allocation decision and the next observation needs an explicit handling).

    Authors: We thank the referee for highlighting this subtlety in the proof. The proof of Theorem 1 in §4.2 indeed uses a martingale concentration bound (specifically, a version of Freedman's inequality or similar) that is designed to handle adaptive sampling where the allocation at step t depends on previous observations. The key is that the difference sequence is a martingale with respect to the filtration generated by the history up to t-1, and the bounded differences property holds conditionally. However, we acknowledge that the current sketch is brief and does not spell this out explicitly. In the revised version, we will expand the proof to include a detailed explanation of how the adaptive nature is accounted for in the martingale framework, ensuring the concentration inequality applies directly to bound the allocation gap. This will make the argument fully rigorous without altering the result. revision: yes

  2. Referee: [§5.3, Table 3] §5.3, Table 3 (real-data waste numbers): the 2-6% waste figure is presented as a point estimate without reported standard errors, bootstrap intervals, or a statistical test against the uniform baseline across the 68 questions; because the advantage is claimed to grow with heterogeneity, the absence of variability measures makes it impossible to judge whether the reported improvement is robust or sensitive to the particular respondent pool.

    Authors: We agree that including measures of statistical variability would improve the presentation of the empirical results. The waste percentages in Table 3 are computed from the real survey data with 68 questions and over 2000 respondents, but we did not report variability across possible respondent subsamples or bootstrap replicates. In the revised manuscript, we will augment Table 3 with bootstrap standard errors or 95% confidence intervals for the waste figures under both our algorithm and the uniform allocation. Additionally, we will include a paired statistical test (e.g., Wilcoxon signed-rank test across questions) to assess the significance of the reduction. This will allow readers to better evaluate the robustness of the 2-6% improvement, particularly as heterogeneity increases. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an adaptive online allocation algorithm whose core guarantee is that the gap to the optimal allocation vanishes asymptotically as the human-label budget grows. This is framed as a formal property of the dual-use rule (each label corrects the response and updates the per-question LLM reliability estimate) without any prior knowledge. No equations, fitted parameters, or self-citations are exhibited in the provided text that would reduce the claimed prediction or proof to the inputs by construction. The empirical waste reduction (2-6% vs 10-12%) is reported as separate validation on real survey data. The derivation therefore remains self-contained against external benchmarks and does not trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that human labels simultaneously correct the data and reveal LLM reliability per question; no free parameters, invented entities, or additional axioms are stated in the abstract.

axioms (1)
  • domain assumption Human labels collected during the survey can be used both to obtain ground-truth answers and to estimate question-specific LLM prediction quality in real time.
    This dual-use property is required for the algorithm to learn reliability without a pilot study.

pith-pipeline@v0.9.0 · 5556 in / 1272 out tokens · 56431 ms · 2026-05-10T15:08:24.173549+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 9 canonical work pages · 2 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address author booktitle chapter doi edition editor eid howpublished institution isbn issn journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in "" FUNCTION format.date year ...

  3. [3]

    Advances in neural information processing systems 29

    Agrawal S, Devanur N (2016) Linear contextual bandits with knapsacks. Advances in neural information processing systems 29

  4. [4]

    Science 382(6671):669--674

    Angelopoulos AN, Bates S, Fannjiang C, Jordan MI, Zrnic T (2023 a ) Prediction-powered inference. Science 382(6671):669--674

  5. [5]

    arXiv preprint arXiv:2311.01453 , year=

    Angelopoulos AN, Duchi JC, Zrnic T (2023 b ) PPI++ : Efficient prediction-powered inference. arXiv preprint arXiv:2311.01453

  6. [6]

    Political Analysis 31(3):337--351

    Argyle LP, Busby EC, Fulda N, Gubler JR, Rytting C, Wingate D (2023) Out of one, many: Using language models to simulate human samples. Political Analysis 31(3):337--351

  7. [7]

    Machine Learning 47(2-3):235--256

    Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2-3):235--256

  8. [8]

    Journal of the ACM (JACM) 65(3):1--55

    Badanidiyuru A, Kleinberg R, Slivkins A (2018) Bandits with knapsacks. Journal of the ACM (JACM) 65(3):1--55

  9. [9]

    2025 IEEE 5th International Conference on Human-Machine Systems (ICHMS), 35--40 (IEEE)

    Bhat S, Lyons JB, Shi C, Yang XJ (2025) Effects of learning state dependence of reward weights on trust and team performance in a human-robot sequential decision-making task. 2025 IEEE 5th International Conference on Human-Machine Systems (ICHMS), 35--40 (IEEE)

  10. [10]

    HBS Working Paper (23-062)

    Brand J, Israeli A, Ngwe D (2023) Using LLMs for market research. HBS Working Paper (23-062)

  11. [11]

    Sociological Methods & Research 54(3):1074--1109

    Broska D, Howes M, van Loon A (2025) The mixed subjects design: Treating large language models as potentially informative observations. Sociological Methods & Research 54(3):1074--1109

  12. [12]

    PLOS ONE 20(4):e0319159

    Brucks M, Toubia O (2025) Prompt architecture induces methodological artifacts in large language models. PLOS ONE 20(4):e0319159

  13. [13]

    Journal of Machine Learning Research 16(69):2231--2271

    Carpentier A, Munos R, Antos A (2015) Adaptive strategy for stratified monte carlo sampling. Journal of Machine Learning Research 16(69):2231--2271

  14. [14]

    Operations Research 73(6):3027--3043

    Cohen MC, Miao S, Wang Y (2025) Dynamic pricing with fairness constraints. Operations Research 73(6):3027--3043

  15. [15]

    Production and Operations Management

    Dai T, Swaminathan JM (2025) Artificial intelligence and operations: A foundational framework of emerging research and practice. Production and Operations Management

  16. [16]

    Manufacturing & Service Operations Management 27(6):1814--1831

    DiSorbo MD, Ferreira KJ, Balakrishnan M, Tong J (2025) Warnings and endorsements: Improving human- AI collaboration in the presence of outliers. Manufacturing & Service Operations Management 27(6):1814--1831

  17. [17]

    Advances in Neural Information Processing Systems 37:45850--45878

    Dominguez-Olmedo R, Hardt M, Mendler-D \"u nner C (2024) Questioning the survey responses of large language models. Advances in Neural Information Processing Systems 37:45850--45878

  18. [18]

    Management Science 72(1):538--557

    F \"u gener A, Walzner DD, Gupta A (2026) Roles of artificial intelligence in collaboration with humans: Automation, augmentation, and the future of work. Management Science 72(1):538--557

  19. [19]

    arXiv preprint arXiv:2310.03647

    Ge H, Bastani H, Bastani O (2023) Rethinking algorithmic fairness for human- AI collaboration. arXiv preprint arXiv:2310.03647

  20. [20]

    International Conference on Machine Learning (ICML)

    Huang C, Wu Y, Wang K (2025) How many human survey respondents is a large language model worth? An uncertainty quantification perspective. International Conference on Machine Learning (ICML)

  21. [21]

    Predictions as surrogates: Revisiting surrogate outcomes in the age of ai.arXiv preprint arXiv:2501.09731,

    Ji W, Lei L, Zrnic T (2025) Predictions as surrogates: Revisiting surrogate outcomes in the age of AI . arXiv preprint arXiv:2501.09731

  22. [22]

    arXiv preprint arXiv:2510.11408

    Krsteski S, Russo G, Chang S, West R, Gligori \'c K (2025) Valid survey simulations with limited human data. arXiv preprint arXiv:2510.11408

  23. [23]

    Lattimore T, Szepesv \'a ri C (2020) Bandit Algorithms (Cambridge University Press)

  24. [24]

    arXiv preprint arXiv:2604.01086

    Li G, Liang A, Liu M, Lei M, Jasin S, Yang F, Baxi P (2026) Asymptotically optimal sequential testing with heterogeneous llms. arXiv preprint arXiv:2604.01086

  25. [25]

    Marketing Science 43(2):254--266

    Li P, Castelo N, Katona Z, Sarvary M (2024) Frontiers: Determining the validity of large language models for automated perceptual analysis. Marketing Science 43(2):254--266

  26. [26]

    Proceedings of the 22nd Annual Conference on Learning Theory (COLT), 115--124

    Maurer A, Pontil M (2009) Empirical bernstein bounds and sample variance penalization. Proceedings of the 22nd Annual Conference on Learning Theory (COLT), 115--124

  27. [27]

    Public Choice 198:3--23

    Motoki F, Pinho Neto V, Rodrigues V (2024) More human than human: Measuring ChatGPT political bias. Public Choice 198:3--23

  28. [28]

    Ppi is the difference estimator: Recognizing the survey sampling roots of prediction-powered inference.arXiv preprint arXiv:2603.19160,

    Mozer R (2026) PPI is the difference estimator: Recognizing the survey sampling roots of prediction-powered inference. arXiv preprint arXiv:2603.19160

  29. [29]

    Journal of the Royal Statistical Society 97(4):558--625

    Neyman J (1934) On the two different aspects of the representative method. Journal of the Royal Statistical Society 97(4):558--625

  30. [30]

    Digital Twins as Funhouse Mirrors: Five Key Distortions

    Peng T, Gui G, Merlau DJ, Fan GJ, Sliman MB, Brucks M, Johnson EJ, Morwitz V, et al. (2025) A mega-study of digital twins reveals strengths, weaknesses and opportunities for further improvement. arXiv preprint arXiv:2509.19088

  31. [31]

    Journal of the American Statistical Association 90(429):54--63

    Raghunathan TE, Grizzle JE (1995) A split questionnaire survey design. Journal of the American Statistical Association 90(429):54--63

  32. [32]

    Management Science 71(6):4828--4846

    Simchi-Levi D, Wang C (2025) Multi-armed bandit experimental design: Online decision-making and adaptive inference. Management Science 71(6):4828--4846

  33. [33]

    Marketing Science 44(6):1446--1455

    Toubia O, Gui GZ, Peng T, Merlau DJ, Li A, Chen H (2025) Database report: Twin-2K-500 : A data set for building digital twins of over 2,000 people based on their answers to over 500 questions. Marketing Science 44(6):1446--1455

  34. [34]

    Marketing Science 22(3):273--303

    Toubia O, Simester DI, Hauser JR, Dahan E (2003) Fast polyhedral adaptive conjoint estimation. Marketing Science 22(3):273--303

  35. [35]

    Proceedings of the National Academy of Sciences 122(22):e2427298122

    Vafa K, Athey S, Blei DM (2025) Estimating wage disparities using foundation models. Proceedings of the National Academy of Sciences 122(22):e2427298122

  36. [36]

    Efficient inference using large language models with limited human data: Fine-tuning then rectification.arXiv preprint arXiv:2511.19486,

    Wang L, Ye Z, Zhao J (2025) Efficient inference using large language models with limited human data: Fine-tuning then rectification. arXiv preprint arXiv:2511.19486

  37. [37]

    Large Language Models for Market Research: A Data-augmentation Approach

    Wang M, Zhang DJ, Zhang H (2024) Large language models for market research: A data-augmentation approach. arXiv preprint arXiv:2412.19363

  38. [38]

    Marketing Science 44(5):995--1016

    Ye Z, Yoganarasimhan H, Zheng Y (2025) LOLA : LLM -assisted online learning algorithm for content experiments. Marketing Science 44(5):995--1016

  39. [39]

    Available at SSRN 6078686

    Yin QE, Xin L (2025) Synthetic but not infinite: How much LLM -generated data to use in market research. Available at SSRN 6078686

  40. [40]

    Ziems C, Held W, Shaikh O, Chen J, Zhang Z, Yang D (2024) Can large language models transform computational social science? Computational Linguistics 50(1):237--291