arxiv: 2604.12497 · v1 · submitted 2026-04-14 · 💻 cs.LG · stat.ML

Recognition: unknown

Adaptive Budget Allocation in LLM-Augmented Surveys

Zikun Ye , Jiameng Lyu , Rui Tao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:08 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords adaptive allocationLLM surveysbudget optimizationhuman verificationreliability estimationonline learningsurvey methodology

0 comments

The pith

An adaptive algorithm allocates limited human verification budget to LLM survey questions by learning per-question reliability in real time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an adaptive method to decide how many human responses to collect for each survey question when LLMs generate initial answers. Because LLM accuracy varies by question and is unknown in advance, the algorithm uses each human label to correct the answer for that question and to update an estimate of how reliable the LLM is on it. This allows the system to shift future budget toward questions where the LLM is less accurate. The authors show that the difference from the best possible fixed allocation disappears as the total human budget increases. Experiments on a real survey with 68 questions show that uniform allocation wastes 10 to 12 percent of the budget while the adaptive method wastes only 2 to 6 percent.

Core claim

We propose an adaptive allocation algorithm that learns which questions are hardest for the LLM while simultaneously collecting human responses. Each human label serves a dual role: it improves the estimate for that question and reveals how well the LLM predicts human responses on it. The algorithm directs more budget to questions where the LLM is least reliable, without requiring any prior knowledge of question-level LLM accuracy. We prove that the allocation gap relative to the best possible allocation vanishes as the budget grows, and validate the approach on both synthetic data and a real survey dataset with 68 questions and over 2000 respondents. On real survey data, the standard 100% 6

What carries the argument

The online adaptive allocation rule that estimates LLM reliability from accumulating human labels and reallocates the remaining budget proportionally to current error estimates.

Load-bearing premise

Human labels collected in real time can be used both to correct responses and to learn question-specific LLM reliability without any prior information or separate calibration phase.

What would settle it

Observing on independent survey data whether the waste remains in the 2-6% range and continues to approach zero with larger budgets would confirm or refute the performance claims.

read the original abstract

Large language models (LLMs) can generate survey responses at low cost, but their reliability varies substantially across questions and is unknown before data collection. Deploying LLMs in surveys still requires costly human responses for verification and correction. How should a limited human-labeling budget be allocated across questions in real time? We propose an adaptive allocation algorithm that learns which questions are hardest for the LLM while simultaneously collecting human responses. Each human label serves a dual role: it improves the estimate for that question and reveals how well the LLM predicts human responses on it. The algorithm directs more budget to questions where the LLM is least reliable, without requiring any prior knowledge of question-level LLM accuracy. We prove that the allocation gap relative to the best possible allocation vanishes as the budget grows, and validate the approach on both synthetic data and a real survey dataset with 68 questions and over 2000 respondents. On real survey data, the standard practice of allocating human labels uniformly across questions wastes 10--12% of the budget relative to the optimal; our algorithm reduces this waste to 2--6%, and the advantage grows as questions become more heterogeneous in LLM prediction quality. The algorithm achieves the same estimation quality as traditional uniform sampling with fewer human samples, requires no pilot study, and is backed by formal performance guarantees validated on real survey data. More broadly, the framework applies whenever scarce human oversight must be allocated across tasks where LLM reliability is unknown.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical online rule for splitting human labels across LLM survey questions that learns reliability on the fly and proves the waste gap to the oracle shrinks to zero.

read the letter

The main thing here is an adaptive allocator that treats every human label as both a correction for the LLM answer and fresh data to estimate how well the LLM does on that exact question. It then routes more labels toward the questions where the LLM is weakest, all without any upfront calibration or prior knowledge of the accuracies. They prove the fraction of wasted labels relative to the best possible allocation goes to zero as the total budget grows, and on a real 68-question survey the method cuts waste from the uniform baseline of 10-12% down to 2-6% with the gap widening as question difficulty varies more.

Referee Report

2 major / 3 minor

Summary. The paper proposes an adaptive algorithm for allocating a fixed human-labeling budget across survey questions when LLMs generate initial responses whose per-question reliability is unknown a priori. Each human label is used simultaneously to correct the LLM output and to update an online estimate of that question's LLM accuracy; the algorithm then directs future labels toward questions where the LLM is least reliable. The central claims are (i) a proof that the allocation gap relative to the optimal (oracle) allocation vanishes asymptotically as the total budget B grows, and (ii) empirical results on a 68-question real survey showing that the method reduces budget waste from 10-12% (uniform allocation) to 2-6% while achieving equivalent estimation quality with fewer human samples and without a pilot study.

Significance. If the asymptotic guarantee and the reported waste reductions hold under the stated assumptions, the work offers a practical, theoretically grounded method for efficient hybrid LLM-human data collection. The dual-use of each label, the absence of any pre-calibration phase, and the validation on real heterogeneous survey data are concrete strengths that distinguish the contribution from purely heuristic or offline allocation schemes.

major comments (2)

[§4.2, Theorem 1] §4.2, Theorem 1 (regret bound): the proof sketch relies on a uniform convergence rate for the per-question reliability estimators, but the adaptive allocation rule itself changes the sampling distribution over time; it is not immediate that the resulting martingale difference sequence still satisfies the concentration inequality used to bound the gap (the dependence between the allocation decision and the next observation needs an explicit handling).
[§5.3, Table 3] §5.3, Table 3 (real-data waste numbers): the 2-6% waste figure is presented as a point estimate without reported standard errors, bootstrap intervals, or a statistical test against the uniform baseline across the 68 questions; because the advantage is claimed to grow with heterogeneity, the absence of variability measures makes it impossible to judge whether the reported improvement is robust or sensitive to the particular respondent pool.

minor comments (3)

[§2.1] §2.1: the definition of the per-question reliability parameter r_q is introduced only after the algorithm description; moving the formal definition and its relation to the loss function earlier would improve readability.
[Figure 4] Figure 4: the synthetic-data plots show cumulative waste versus B, but the x-axis is not labeled with the total budget scale used in the theorem statements, making direct comparison to the O(1/sqrt(B)) rate difficult.
[§6] §6: the discussion of limitations mentions only computational cost; a brief note on the sensitivity of the method to the choice of the exploration parameter epsilon would be useful for practitioners.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We appreciate the positive assessment of our work and the specific suggestions for improvement. Below we provide point-by-point responses to the major comments, indicating the revisions we plan to incorporate.

read point-by-point responses

Referee: [§4.2, Theorem 1] §4.2, Theorem 1 (regret bound): the proof sketch relies on a uniform convergence rate for the per-question reliability estimators, but the adaptive allocation rule itself changes the sampling distribution over time; it is not immediate that the resulting martingale difference sequence still satisfies the concentration inequality used to bound the gap (the dependence between the allocation decision and the next observation needs an explicit handling).

Authors: We thank the referee for highlighting this subtlety in the proof. The proof of Theorem 1 in §4.2 indeed uses a martingale concentration bound (specifically, a version of Freedman's inequality or similar) that is designed to handle adaptive sampling where the allocation at step t depends on previous observations. The key is that the difference sequence is a martingale with respect to the filtration generated by the history up to t-1, and the bounded differences property holds conditionally. However, we acknowledge that the current sketch is brief and does not spell this out explicitly. In the revised version, we will expand the proof to include a detailed explanation of how the adaptive nature is accounted for in the martingale framework, ensuring the concentration inequality applies directly to bound the allocation gap. This will make the argument fully rigorous without altering the result. revision: yes
Referee: [§5.3, Table 3] §5.3, Table 3 (real-data waste numbers): the 2-6% waste figure is presented as a point estimate without reported standard errors, bootstrap intervals, or a statistical test against the uniform baseline across the 68 questions; because the advantage is claimed to grow with heterogeneity, the absence of variability measures makes it impossible to judge whether the reported improvement is robust or sensitive to the particular respondent pool.

Authors: We agree that including measures of statistical variability would improve the presentation of the empirical results. The waste percentages in Table 3 are computed from the real survey data with 68 questions and over 2000 respondents, but we did not report variability across possible respondent subsamples or bootstrap replicates. In the revised manuscript, we will augment Table 3 with bootstrap standard errors or 95% confidence intervals for the waste figures under both our algorithm and the uniform allocation. Additionally, we will include a paired statistical test (e.g., Wilcoxon signed-rank test across questions) to assess the significance of the reduction. This will allow readers to better evaluate the robustness of the 2-6% improvement, particularly as heterogeneity increases. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an adaptive online allocation algorithm whose core guarantee is that the gap to the optimal allocation vanishes asymptotically as the human-label budget grows. This is framed as a formal property of the dual-use rule (each label corrects the response and updates the per-question LLM reliability estimate) without any prior knowledge. No equations, fitted parameters, or self-citations are exhibited in the provided text that would reduce the claimed prediction or proof to the inputs by construction. The empirical waste reduction (2-6% vs 10-12%) is reported as separate validation on real survey data. The derivation therefore remains self-contained against external benchmarks and does not trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that human labels simultaneously correct the data and reveal LLM reliability per question; no free parameters, invented entities, or additional axioms are stated in the abstract.

axioms (1)

domain assumption Human labels collected during the survey can be used both to obtain ground-truth answers and to estimate question-specific LLM prediction quality in real time.
This dual-use property is required for the algorithm to learn reliability without a pilot study.

pith-pipeline@v0.9.0 · 5556 in / 1272 out tokens · 56431 ms · 2026-05-10T15:08:24.173549+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 9 canonical work pages · 2 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address author booktitle chapter doi edition editor eid howpublished institution isbn issn journal key month note number organization pages publisher school series title type url volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.all := #1...
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in "" FUNCTION format.date year ...
[3]

Advances in neural information processing systems 29

Agrawal S, Devanur N (2016) Linear contextual bandits with knapsacks. Advances in neural information processing systems 29

2016
[4]

Science 382(6671):669--674

Angelopoulos AN, Bates S, Fannjiang C, Jordan MI, Zrnic T (2023 a ) Prediction-powered inference. Science 382(6671):669--674

2023
[5]

arXiv preprint arXiv:2311.01453 , year=

Angelopoulos AN, Duchi JC, Zrnic T (2023 b ) PPI++ : Efficient prediction-powered inference. arXiv preprint arXiv:2311.01453

work page arXiv 2023
[6]

Political Analysis 31(3):337--351

Argyle LP, Busby EC, Fulda N, Gubler JR, Rytting C, Wingate D (2023) Out of one, many: Using language models to simulate human samples. Political Analysis 31(3):337--351

2023
[7]

Machine Learning 47(2-3):235--256

Auer P, Cesa-Bianchi N, Fischer P (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learning 47(2-3):235--256

2002
[8]

Journal of the ACM (JACM) 65(3):1--55

Badanidiyuru A, Kleinberg R, Slivkins A (2018) Bandits with knapsacks. Journal of the ACM (JACM) 65(3):1--55

2018
[9]

2025 IEEE 5th International Conference on Human-Machine Systems (ICHMS), 35--40 (IEEE)

Bhat S, Lyons JB, Shi C, Yang XJ (2025) Effects of learning state dependence of reward weights on trust and team performance in a human-robot sequential decision-making task. 2025 IEEE 5th International Conference on Human-Machine Systems (ICHMS), 35--40 (IEEE)

2025
[10]

HBS Working Paper (23-062)

Brand J, Israeli A, Ngwe D (2023) Using LLMs for market research. HBS Working Paper (23-062)

2023
[11]

Sociological Methods & Research 54(3):1074--1109

Broska D, Howes M, van Loon A (2025) The mixed subjects design: Treating large language models as potentially informative observations. Sociological Methods & Research 54(3):1074--1109

2025
[12]

PLOS ONE 20(4):e0319159

Brucks M, Toubia O (2025) Prompt architecture induces methodological artifacts in large language models. PLOS ONE 20(4):e0319159

2025
[13]

Journal of Machine Learning Research 16(69):2231--2271

Carpentier A, Munos R, Antos A (2015) Adaptive strategy for stratified monte carlo sampling. Journal of Machine Learning Research 16(69):2231--2271

2015
[14]

Operations Research 73(6):3027--3043

Cohen MC, Miao S, Wang Y (2025) Dynamic pricing with fairness constraints. Operations Research 73(6):3027--3043

2025
[15]

Production and Operations Management

Dai T, Swaminathan JM (2025) Artificial intelligence and operations: A foundational framework of emerging research and practice. Production and Operations Management

2025
[16]

Manufacturing & Service Operations Management 27(6):1814--1831

DiSorbo MD, Ferreira KJ, Balakrishnan M, Tong J (2025) Warnings and endorsements: Improving human- AI collaboration in the presence of outliers. Manufacturing & Service Operations Management 27(6):1814--1831

2025
[17]

Advances in Neural Information Processing Systems 37:45850--45878

Dominguez-Olmedo R, Hardt M, Mendler-D \"u nner C (2024) Questioning the survey responses of large language models. Advances in Neural Information Processing Systems 37:45850--45878

2024
[18]

Management Science 72(1):538--557

F \"u gener A, Walzner DD, Gupta A (2026) Roles of artificial intelligence in collaboration with humans: Automation, augmentation, and the future of work. Management Science 72(1):538--557

2026
[19]

arXiv preprint arXiv:2310.03647

Ge H, Bastani H, Bastani O (2023) Rethinking algorithmic fairness for human- AI collaboration. arXiv preprint arXiv:2310.03647

work page arXiv 2023
[20]

International Conference on Machine Learning (ICML)

Huang C, Wu Y, Wang K (2025) How many human survey respondents is a large language model worth? An uncertainty quantification perspective. International Conference on Machine Learning (ICML)

2025
[21]

Predictions as surrogates: Revisiting surrogate outcomes in the age of ai.arXiv preprint arXiv:2501.09731,

Ji W, Lei L, Zrnic T (2025) Predictions as surrogates: Revisiting surrogate outcomes in the age of AI . arXiv preprint arXiv:2501.09731

work page arXiv 2025
[22]

arXiv preprint arXiv:2510.11408

Krsteski S, Russo G, Chang S, West R, Gligori \'c K (2025) Valid survey simulations with limited human data. arXiv preprint arXiv:2510.11408

work page arXiv 2025
[23]

Lattimore T, Szepesv \'a ri C (2020) Bandit Algorithms (Cambridge University Press)

2020
[24]

arXiv preprint arXiv:2604.01086

Li G, Liang A, Liu M, Lei M, Jasin S, Yang F, Baxi P (2026) Asymptotically optimal sequential testing with heterogeneous llms. arXiv preprint arXiv:2604.01086

work page arXiv 2026
[25]

Marketing Science 43(2):254--266

Li P, Castelo N, Katona Z, Sarvary M (2024) Frontiers: Determining the validity of large language models for automated perceptual analysis. Marketing Science 43(2):254--266

2024
[26]

Proceedings of the 22nd Annual Conference on Learning Theory (COLT), 115--124

Maurer A, Pontil M (2009) Empirical bernstein bounds and sample variance penalization. Proceedings of the 22nd Annual Conference on Learning Theory (COLT), 115--124

2009
[27]

Public Choice 198:3--23

Motoki F, Pinho Neto V, Rodrigues V (2024) More human than human: Measuring ChatGPT political bias. Public Choice 198:3--23

2024
[28]

Ppi is the difference estimator: Recognizing the survey sampling roots of prediction-powered inference.arXiv preprint arXiv:2603.19160,

Mozer R (2026) PPI is the difference estimator: Recognizing the survey sampling roots of prediction-powered inference. arXiv preprint arXiv:2603.19160

work page arXiv 2026
[29]

Journal of the Royal Statistical Society 97(4):558--625

Neyman J (1934) On the two different aspects of the representative method. Journal of the Royal Statistical Society 97(4):558--625

1934
[30]

Digital Twins as Funhouse Mirrors: Five Key Distortions

Peng T, Gui G, Merlau DJ, Fan GJ, Sliman MB, Brucks M, Johnson EJ, Morwitz V, et al. (2025) A mega-study of digital twins reveals strengths, weaknesses and opportunities for further improvement. arXiv preprint arXiv:2509.19088

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Journal of the American Statistical Association 90(429):54--63

Raghunathan TE, Grizzle JE (1995) A split questionnaire survey design. Journal of the American Statistical Association 90(429):54--63

1995
[32]

Management Science 71(6):4828--4846

Simchi-Levi D, Wang C (2025) Multi-armed bandit experimental design: Online decision-making and adaptive inference. Management Science 71(6):4828--4846

2025
[33]

Marketing Science 44(6):1446--1455

Toubia O, Gui GZ, Peng T, Merlau DJ, Li A, Chen H (2025) Database report: Twin-2K-500 : A data set for building digital twins of over 2,000 people based on their answers to over 500 questions. Marketing Science 44(6):1446--1455

2025
[34]

Marketing Science 22(3):273--303

Toubia O, Simester DI, Hauser JR, Dahan E (2003) Fast polyhedral adaptive conjoint estimation. Marketing Science 22(3):273--303

2003
[35]

Proceedings of the National Academy of Sciences 122(22):e2427298122

Vafa K, Athey S, Blei DM (2025) Estimating wage disparities using foundation models. Proceedings of the National Academy of Sciences 122(22):e2427298122

2025
[36]

Efficient inference using large language models with limited human data: Fine-tuning then rectification.arXiv preprint arXiv:2511.19486,

Wang L, Ye Z, Zhao J (2025) Efficient inference using large language models with limited human data: Fine-tuning then rectification. arXiv preprint arXiv:2511.19486

work page arXiv 2025
[37]

Large Language Models for Market Research: A Data-augmentation Approach

Wang M, Zhang DJ, Zhang H (2024) Large language models for market research: A data-augmentation approach. arXiv preprint arXiv:2412.19363

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Marketing Science 44(5):995--1016

Ye Z, Yoganarasimhan H, Zheng Y (2025) LOLA : LLM -assisted online learning algorithm for content experiments. Marketing Science 44(5):995--1016

2025
[39]

Available at SSRN 6078686

Yin QE, Xin L (2025) Synthetic but not infinite: How much LLM -generated data to use in market research. Available at SSRN 6078686

2025
[40]

Ziems C, Held W, Shaikh O, Chen J, Zhang Z, Yang D (2024) Can large language models transform computational social science? Computational Linguistics 50(1):237--291

2024