Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees
Pith reviewed 2026-05-23 18:45 UTC · model grok-4.3
The pith
A learning-to-defer framework allocates extractive QA queries to LLM experts with theoretical guarantees balancing performance and cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that a learning-to-defer decision rule, equipped with theoretical optimality guarantees, can allocate each extractive QA query to the most suitable LLM expert so that overall answer quality is preserved while computational cost is minimized under the assumed performance and cost models.
What carries the argument
The learning-to-defer allocation policy, which routes each query according to per-expert confidence and cost functions to achieve the provably optimal performance-cost tradeoff.
If this is right
- Answer reliability improves on SQuADv1, SQuADv2, and TriviaQA while computational overhead drops.
- Multiple specialized models become practical to deploy together without proportional cost increase.
- The deferral policy satisfies explicit optimality guarantees under the stated cost and performance models.
- The method supports scalable extractive QA systems that avoid running every expert on every input.
Where Pith is reading between the lines
- The same allocation logic could be tested on other structured prediction tasks that benefit from selective expert invocation.
- If the deferral policy were allowed to co-train with the experts, the fixed-model assumption could be relaxed and bounds might tighten.
- Production systems would need to replace proxy cost functions with live telemetry to keep the guarantees meaningful.
Load-bearing premise
The cost and performance functions used to derive the theoretical optimality guarantees accurately reflect real deployment conditions and the expert models remain fixed rather than co-adapted during training.
What would settle it
Measure actual end-to-end latency and accuracy when the same framework is deployed on a new dataset whose query distribution violates the assumed performance-cost relationships; if gains disappear, the optimality claim does not hold.
Figures
read the original abstract
Large Language Models excel in generative tasks but exhibit inefficiencies in structured text selection, particularly in extractive question answering. This challenge is magnified in resource-constrained environments, where deploying multiple specialized models for different tasks is impractical. We propose a Learning-to-Defer framework that allocates queries to specialized experts, ensuring high-confidence predictions while optimizing computational efficiency. Our approach integrates a principled allocation strategy with theoretical guarantees on optimal deferral that balances performance and cost. Empirical evaluations on SQuADv1, SQuADv2, and TriviaQA demonstrate that our method enhances answer reliability while significantly reducing computational overhead, making it well-suited for scalable and efficient EQA deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a Learning-to-Defer framework for allocating queries to specialized LLM experts in extractive QA. It claims a principled allocation strategy together with theoretical guarantees on optimal deferral that balances performance and cost, and reports empirical gains in answer reliability and reduced overhead on SQuADv1, SQuADv2, and TriviaQA.
Significance. If the optimality guarantees are independent of the fitted deferrer parameters and the fixed-expert assumption holds in deployment, the framework could supply a principled method for cost-aware routing among multiple LLMs on structured selection tasks.
major comments (2)
- [Theoretical guarantees derivation (likely §4)] The central optimality guarantees rest on the assumption that the expert models remain fixed (non-adapted) oracles whose outputs and costs are independent of the learned deferral policy. This assumption is load-bearing for the claim that the trained policy achieves the derived optimum; if experts are co-adapted during deferrer training, the guarantee does not apply to the resulting policy.
- [Cost/performance modeling (likely §3.2)] The derivation of the theoretical guarantees is tied to particular functional forms chosen for the performance and cost of each expert. Real LLM inference costs (variable token pricing, batching effects, context-length dependence) may deviate from these forms, in which case the optimality result does not transfer to the learned allocation rule.
minor comments (2)
- Clarify in the abstract and introduction how many experts are used and whether they are fine-tuned or frozen.
- Add a short discussion of how the deferral threshold or allocation rule is obtained from the theoretical optimum (closed form vs. optimization).
Simulated Author's Rebuttal
We thank the referee for the thoughtful comments on the theoretical foundations of our work. We respond to each major comment below.
read point-by-point responses
-
Referee: [Theoretical guarantees derivation (likely §4)] The central optimality guarantees rest on the assumption that the expert models remain fixed (non-adapted) oracles whose outputs and costs are independent of the learned deferral policy. This assumption is load-bearing for the claim that the trained policy achieves the derived optimum; if experts are co-adapted during deferrer training, the guarantee does not apply to the resulting policy.
Authors: Our proposed framework is developed under the assumption of fixed expert models, which are not adapted or co-trained with the deferrer. This is clearly stated in the method section, where the experts are described as pre-trained specialized LLMs. The optimality guarantees are derived specifically for this setting, ensuring that the deferral policy optimizes allocation without affecting expert behavior. Our experiments adhere to this assumption by keeping experts fixed, so the guarantees are applicable to the presented results. We do not extend claims to scenarios involving expert adaptation. revision: no
-
Referee: [Cost/performance modeling (likely §3.2)] The derivation of the theoretical guarantees is tied to particular functional forms chosen for the performance and cost of each expert. Real LLM inference costs (variable token pricing, batching effects, context-length dependence) may deviate from these forms, in which case the optimality result does not transfer to the learned allocation rule.
Authors: The theoretical analysis employs specific functional forms for performance (based on answer correctness probability) and cost (tied to model inference characteristics) to enable the derivation of optimality conditions, as detailed in Section 3.2. These forms are chosen to reflect the extractive QA setting and are validated empirically. The guarantees hold for the modeled costs and performance; the framework is modular and allows substitution of alternative functions if different cost structures are desired. The manuscript acknowledges that real-world costs may include additional variables, but the core contribution is the principled allocation under the defined models. revision: no
Circularity Check
No circularity: optimality guarantees derived from explicit performance/cost functions without reduction to fitted inputs or self-citations
full rationale
The abstract describes a learning-to-defer framework with theoretical guarantees on optimal deferral balancing performance and cost. No equations or derivation steps are visible in the provided text to inspect for self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations. The central claim relies on defined functional forms for expert performance and cost, which is a standard modeling choice rather than a circular construction. Without specific quotes showing Eq. X reducing to a fit or prior self-work by definition, the derivation chain cannot be flagged as circular. This is the expected honest non-finding when no load-bearing steps reduce by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the cost incurred when relying on the main model g is defined as c0(gi(x), zi)=1{gi(x)≠yi}. Similarly, the cost of consulting expert j is given by cj>0(mij(x), zi)=αjc0(mij(x), zi)+βj
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lemma 2 (Bayes-Rejector) … rB,i(x)=0 if inf ηi0(x)≤min ηij(x)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Learning When to Remember: Risk-Sensitive Contextual Bandits for Abstention-Aware Memory Retrieval in LLM-Based Coding Agents
RSCB-MC is a risk-sensitive contextual bandit memory controller for LLM coding agents that chooses safe actions including abstention, achieving 60.5% proxy success with 0% false positives and low latency in 200-case v...
-
Optimized Deferral for Imbalanced Settings
MILD reformulates two-stage learning to defer as cost-sensitive learning over the input-expert domain and derives new margin-based losses with guarantees, yielding better performance than baselines on image classifica...
Reference graph
Works this paper leans on
-
[1]
Question answering systems approaches and challenges
Reem Alqifari. Question answering systems approaches and challenges. In Venelin Kovatchev, Irina Temnikova, Branislava S andrih, and Ivelina Nikolova, editors, Proceedings of the Student Research Workshop Associated with RANLP 2019, pages 69--75, Varna, Bulgaria, September 2019. INCOMA Ltd. doi:10.26615/issn.2603-2821.2019_011. URL https://aclanthology.or...
-
[2]
Multi-class h -consistency bounds
Pranjal Awasthi, Anqi Mao, Mehryar Mohri, and Yutao Zhong. Multi-class h -consistency bounds. Advances in Neural Information Processing Systems, 35: 0 782–795, December 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/051f3997af1dd65da8e14397b6a72f8e-Abstract-Conference.html
work page 2022
-
[3]
Convexity, classification, and risk bounds
Peter Bartlett, Michael Jordan, and Jon McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101: 0 138--156, 02 2006. doi:10.1198/016214505000000907
-
[4]
Breiman, Bagging predictors, Machine Learning 24 (2) (1996) 123–140
Leo Breiman. Bagging predictors. Mach. Learn., 24 0 (2): 0 123–140, August 1996. ISSN 0885-6125. doi:10.1023/A:1018054314350. URL https://doi.org/10.1023/A:1018054314350
-
[5]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[6]
Reading Wikipedia to Answer Open-Domain Questions
Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[7]
C. Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory, 16 0 (1): 0 41–46, January 1970. doi:10.1109/TIT.1970.1054406
-
[8]
Corinna Cortes, Giulia DeSalvo, and Mehryar Mohri. Learning with rejection. In Ronald Ortner, Hans Ulrich Simon, and Sandra Zilles, editors, Algorithmic Learning Theory, pages 67--82, Cham, 2016. Springer International Publishing. ISBN 978-3-319-46379-7
work page 2016
-
[9]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[10]
On efficient approximate queries over machine learning models
Dujian Ding, Sihem Amer-Yahia, and Laks Lakshmanan. On efficient approximate queries over machine learning models. Proc. VLDB Endow., 16 0 (4): 0 918–931, December 2022. ISSN 2150-8097. doi:10.14778/3574245.3574273. URL https://doi.org/10.14778/3574245.3574273
-
[11]
Hybrid llm: Cost-efficient and quality-aware query routing
Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks VS Lakshmanan, and Ahmed Hassan Awadallah. Hybrid llm: Cost-efficient and quality-aware query routing. arXiv preprint arXiv:2404.14618, 2024
-
[12]
Robust Loss Functions under Label Noise for Deep Neural Networks
Aritra Ghosh, Himanshu Kumar, and P. Shanti Sastry. Robust loss functions under label noise for deep neural networks. ArXiv, abs/1712.09482, 2017. URL https://api.semanticscholar.org/CorpusID:6546734
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[13]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxi...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
When does confidence-based cascade deferral suffice?, 2024
Wittawat Jitkrittum, Neha Gupta, Aditya Krishna Menon, Harikrishna Narasimhan, Ankit Singh Rawat, and Sanjiv Kumar. When does confidence-based cascade deferral suffice?, 2024. URL https://arxiv.org/abs/2307.02764
-
[16]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[17]
Efficient edge inference by selective query
Anil Kag, Igor Fedorov, Aditya Gangrade, Paul Whatmough, and Venkatesh Saligrama. Efficient edge inference by selective query. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=jpR98ZdIm2q
work page 2023
-
[18]
Agreement-based cascading for efficient inference, 2024
Steven Kolawole, Don Dennis, Ameet Talwalkar, and Virginia Smith. Agreement-based cascading for efficient inference, 2024. URL https://arxiv.org/abs/2407.02348
-
[19]
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations, 2020. URL https://arxiv.org/abs/1909.11942
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[20]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[21]
Consistency versus realizable h-consistency for multiclass classification
Phil Long and Rocco Servedio. Consistency versus realizable h-consistency for multiclass classification. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, number 3 in Proceedings of Machine Learning Research, pages 801--809, Atlanta, Georgia, USA, 17--19 Jun 2013. PMLR. URL https://proc...
work page 2013
-
[22]
Predict responsibly: Improving fairness and accuracy by learning to defer, 2018
David Madras, Toniann Pitassi, and Richard Zemel. Predict responsibly: Improving fairness and accuracy by learning to defer, 2018
work page 2018
-
[23]
Two-stage learning to defer with multiple experts
Anqi Mao, Christopher Mohri, Mehryar Mohri, and Yutao Zhong. Two-stage learning to defer with multiple experts. In Thirty-seventh Conference on Neural Information Processing Systems, 2023 a . URL https://openreview.net/forum?id=GIlsH0T4b2
work page 2023
-
[24]
Cross-entropy loss functions: Theoretical analysis and applications, 2023 b
Anqi Mao, Mehryar Mohri, and Yutao Zhong. Cross-entropy loss functions: Theoretical analysis and applications, 2023 b . URL https://arxiv.org/abs/2304.07288
-
[25]
Regression with multi-expert deferral, 2024
Anqi Mao, Mehryar Mohri, and Yutao Zhong. Regression with multi-expert deferral, 2024. URL https://arxiv.org/abs/2403.19494
-
[26]
Edge machine learning for ai-enabled iot devices: A review
Massimo Merenda, Carlo Porcaro, and Demetrio Iero. Edge machine learning for ai-enabled iot devices: A review. Sensors, 20 0 (9): 0 2533, 2020
work page 2020
-
[27]
Foundations of Machine Learning
Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press, 2012. ISBN 026201825X
work page 2012
-
[28]
Two-stage learning-to-defer for multi-task learning, 2024
Yannis Montreuil, Shu Heng Yeo, Axel Carlier, Lai Xing Ng, and Wei Tsang Ooi. Two-stage learning-to-defer for multi-task learning, 2024. URL https://arxiv.org/abs/2410.15729
-
[29]
Adversarial robustness in two-stage learning-to-defer: Algorithms and guarantees, 2025
Yannis Montreuil, Axel Carlier, Lai Xing Ng, and Wei Tsang Ooi. Adversarial robustness in two-stage learning-to-defer: Algorithms and guarantees, 2025. URL https://arxiv.org/abs/2502.01027
-
[30]
Consistent estimators for learning to defer to an expert, 2021
Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert, 2021
work page 2021
-
[31]
Faster cascades via speculative decoding, 2024
Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Seungyeon Kim, Neha Gupta, Aditya Krishna Menon, and Sanjiv Kumar. Faster cascades via speculative decoding, 2024. URL https://arxiv.org/abs/2405.19261
-
[32]
RouteLLM: Learning to Route LLMs with Preference Data
Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data, 2024. URL https://arxiv.org/abs/2406.18665
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[35]
Know What You Don't Know: Unanswerable Questions for SQuAD
Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don't know: Unanswerable questions for squad, 2018. URL https://arxiv.org/abs/1806.03822
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[36]
Boosting algorithms for detector cascade learning
Mohammad Saberian and Nuno Vasconcelos. Boosting algorithms for detector cascade learning. Journal of Machine Learning Research, 15 0 (74): 0 2569--2605, 2014. URL http://jmlr.org/papers/v15/saberian14a.html
work page 2014
-
[37]
Delucionqa: Detecting hallucinations in domain-specific question answering, 2023
Mobashir Sadat, Zhengyu Zhou, Lukas Lange, Jun Araki, Arsalan Gundroo, Bingqing Wang, Rakesh R Menon, Md Rizwan Parvez, and Zhe Feng. Delucionqa: Detecting hallucinations in domain-specific question answering, 2023. URL https://arxiv.org/abs/2312.05200
-
[38]
How to compare different loss functions and their risks
Ingo Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 26: 0 225--287, 2007. URL https://api.semanticscholar.org/CorpusID:16660598
work page 2007
- [39]
-
[40]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. URL https://arxiv.org/abs/2302.13971
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
Fouad Trad and Ali Chehab. To ensemble or not: Assessing majority voting strategies for phishing detection with large language models, 2024. URL https://arxiv.org/abs/2412.00166
-
[42]
Model cascading: Towards jointly improving efficiency and accuracy of nlp systems, 2022
Neeraj Varshney and Chitta Baral. Model cascading: Towards jointly improving efficiency and accuracy of nlp systems, 2022. URL https://arxiv.org/abs/2210.05528
-
[43]
Rajeev Verma, Daniel Barrejon, and Eric Nalisnick. Learning to defer to multiple experts: Consistent surrogate losses, confidence calibration, and conformal ensembles. In International Conference on Artificial Intelligence and Statistics, 2022. URL https://api.semanticscholar.org/CorpusID:253237048
work page 2022
-
[44]
P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, volume 1, pages I--I, 2001. doi:10.1109/CVPR.2001.990517
-
[45]
Large language model cascades with mixture of thoughts representations for cost-efficient reasoning
Murong Yue, Jie Zhao, Min Zhang, Liang Du, and Ziyu Yao. Large language model cascades with mixture of thoughts representations for cost-efficient reasoning. arXiv preprint arXiv:2310.03094, 2023
-
[46]
Mingyuan Zhang and Shivani Agarwal. Bayes consistency vs. h-consistency: The interplay between surrogate loss functions and the scoring function class. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 16927--16936. Curran Associates, Inc., 2020. URL https://proc...
work page 2020
-
[47]
Statistical behavior and consistency of classification methods based on convex risk minimization
Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32, 12 2002. doi:10.1214/aos/1079120130
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.