Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees

Axel Carlier; Lai Xing Ng; Shu Heng Yeo; Wei Tsang Ooi; Yannis Montreuil

arxiv: 2410.15761 · v4 · pith:RLW7ADWRnew · submitted 2024-10-21 · 💻 cs.CL · cs.LG· stat.ML

Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees

Yannis Montreuil , Shu Heng Yeo , Axel Carlier , Lai Xing Ng , Wei Tsang Ooi This is my paper

Pith reviewed 2026-05-23 18:45 UTC · model grok-4.3

classification 💻 cs.CL cs.LGstat.ML

keywords extractive question answeringlearning to deferlarge language modelsquery allocationtheoretical guaranteescomputational efficiencySQuADTriviaQA

0 comments

The pith

A learning-to-defer framework allocates extractive QA queries to LLM experts with theoretical guarantees balancing performance and cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a Learning-to-Defer framework that decides which queries to send to specialized large language model experts in extractive question answering. The goal is to keep high-confidence answers while lowering total computation in settings where running every model on every query is too expensive. A principled allocation rule comes with theoretical guarantees that the deferral choices are optimal for given performance and cost functions. Tests on SQuADv1, SQuADv2, and TriviaQA show higher answer reliability together with lower overhead than running all experts unconditionally.

Core claim

The authors establish that a learning-to-defer decision rule, equipped with theoretical optimality guarantees, can allocate each extractive QA query to the most suitable LLM expert so that overall answer quality is preserved while computational cost is minimized under the assumed performance and cost models.

What carries the argument

The learning-to-defer allocation policy, which routes each query according to per-expert confidence and cost functions to achieve the provably optimal performance-cost tradeoff.

If this is right

Answer reliability improves on SQuADv1, SQuADv2, and TriviaQA while computational overhead drops.
Multiple specialized models become practical to deploy together without proportional cost increase.
The deferral policy satisfies explicit optimality guarantees under the stated cost and performance models.
The method supports scalable extractive QA systems that avoid running every expert on every input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same allocation logic could be tested on other structured prediction tasks that benefit from selective expert invocation.
If the deferral policy were allowed to co-train with the experts, the fixed-model assumption could be relaxed and bounds might tighten.
Production systems would need to replace proxy cost functions with live telemetry to keep the guarantees meaningful.

Load-bearing premise

The cost and performance functions used to derive the theoretical optimality guarantees accurately reflect real deployment conditions and the expert models remain fixed rather than co-adapted during training.

What would settle it

Measure actual end-to-end latency and accuracy when the same framework is deployed on a new dataset whose query distribution violates the assumed performance-cost relationships; if gains disappear, the optimality claim does not hold.

Figures

Figures reproduced from arXiv: 2410.15761 by Axel Carlier, Lai Xing Ng, Shu Heng Yeo, Wei Tsang Ooi, Yannis Montreuil.

**Figure 2.** Figure 2: Comparison between the Exact Match metric and the E [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Combined Efficiency Comparison across benchmarks: [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Combined Allocation Percentage across benchmark [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: From left to right: Model Cascades, Query Routing, [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Rejector Architecture: The input data is processe [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Inference Step of Our Approach: The input data is pr [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

read the original abstract

Large Language Models excel in generative tasks but exhibit inefficiencies in structured text selection, particularly in extractive question answering. This challenge is magnified in resource-constrained environments, where deploying multiple specialized models for different tasks is impractical. We propose a Learning-to-Defer framework that allocates queries to specialized experts, ensuring high-confidence predictions while optimizing computational efficiency. Our approach integrates a principled allocation strategy with theoretical guarantees on optimal deferral that balances performance and cost. Empirical evaluations on SQuADv1, SQuADv2, and TriviaQA demonstrate that our method enhances answer reliability while significantly reducing computational overhead, making it well-suited for scalable and efficient EQA deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies learning-to-defer to route extractive QA queries among LLM experts and claims theoretical optimality bounds on the performance-cost tradeoff.

read the letter

The main takeaway is a deferral framework that decides which expert handles each query in extractive QA to balance accuracy against compute cost, along with some optimality proofs for that allocation rule. It shows results on SQuADv1, SQuADv2, and TriviaQA where the approach keeps answer reliability up while cutting overhead compared with simpler baselines. That empirical piece is straightforward and could be useful for anyone running multiple models under tight resource limits. The setup follows the usual learning-to-defer template from other domains but specializes it to this task with the added bounds. The soft spot sits in the theory. The guarantees are derived from particular functional forms for performance and cost, and they treat the experts as fixed oracles whose behavior does not depend on the deferral policy itself. Real LLM inference costs often vary with batching, token counts, and hardware, and joint training can make the experts adapt. If those conditions deviate, the optimality result does not directly apply to the trained system. The paper would benefit from checking how sensitive the bounds are to those modeling choices. No obvious internal contradictions or circular definitions appear in the abstract and stress-test description. This work is aimed at teams building production QA systems who need a principled way to allocate queries across specialists. A reader focused on efficient inference would find the experiments and allocation logic worth testing. I would send it to peer review so referees can examine the derivation steps and the assumption robustness in detail.

Referee Report

2 major / 2 minor

Summary. The paper proposes a Learning-to-Defer framework for allocating queries to specialized LLM experts in extractive QA. It claims a principled allocation strategy together with theoretical guarantees on optimal deferral that balances performance and cost, and reports empirical gains in answer reliability and reduced overhead on SQuADv1, SQuADv2, and TriviaQA.

Significance. If the optimality guarantees are independent of the fitted deferrer parameters and the fixed-expert assumption holds in deployment, the framework could supply a principled method for cost-aware routing among multiple LLMs on structured selection tasks.

major comments (2)

[Theoretical guarantees derivation (likely §4)] The central optimality guarantees rest on the assumption that the expert models remain fixed (non-adapted) oracles whose outputs and costs are independent of the learned deferral policy. This assumption is load-bearing for the claim that the trained policy achieves the derived optimum; if experts are co-adapted during deferrer training, the guarantee does not apply to the resulting policy.
[Cost/performance modeling (likely §3.2)] The derivation of the theoretical guarantees is tied to particular functional forms chosen for the performance and cost of each expert. Real LLM inference costs (variable token pricing, batching effects, context-length dependence) may deviate from these forms, in which case the optimality result does not transfer to the learned allocation rule.

minor comments (2)

Clarify in the abstract and introduction how many experts are used and whether they are fine-tuned or frozen.
Add a short discussion of how the deferral threshold or allocation rule is obtained from the theoretical optimum (closed form vs. optimization).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments on the theoretical foundations of our work. We respond to each major comment below.

read point-by-point responses

Referee: [Theoretical guarantees derivation (likely §4)] The central optimality guarantees rest on the assumption that the expert models remain fixed (non-adapted) oracles whose outputs and costs are independent of the learned deferral policy. This assumption is load-bearing for the claim that the trained policy achieves the derived optimum; if experts are co-adapted during deferrer training, the guarantee does not apply to the resulting policy.

Authors: Our proposed framework is developed under the assumption of fixed expert models, which are not adapted or co-trained with the deferrer. This is clearly stated in the method section, where the experts are described as pre-trained specialized LLMs. The optimality guarantees are derived specifically for this setting, ensuring that the deferral policy optimizes allocation without affecting expert behavior. Our experiments adhere to this assumption by keeping experts fixed, so the guarantees are applicable to the presented results. We do not extend claims to scenarios involving expert adaptation. revision: no
Referee: [Cost/performance modeling (likely §3.2)] The derivation of the theoretical guarantees is tied to particular functional forms chosen for the performance and cost of each expert. Real LLM inference costs (variable token pricing, batching effects, context-length dependence) may deviate from these forms, in which case the optimality result does not transfer to the learned allocation rule.

Authors: The theoretical analysis employs specific functional forms for performance (based on answer correctness probability) and cost (tied to model inference characteristics) to enable the derivation of optimality conditions, as detailed in Section 3.2. These forms are chosen to reflect the extractive QA setting and are validated empirically. The guarantees hold for the modeled costs and performance; the framework is modular and allows substitution of alternative functions if different cost structures are desired. The manuscript acknowledges that real-world costs may include additional variables, but the core contribution is the principled allocation under the defined models. revision: no

Circularity Check

0 steps flagged

No circularity: optimality guarantees derived from explicit performance/cost functions without reduction to fitted inputs or self-citations

full rationale

The abstract describes a learning-to-defer framework with theoretical guarantees on optimal deferral balancing performance and cost. No equations or derivation steps are visible in the provided text to inspect for self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations. The central claim relies on defined functional forms for expert performance and cost, which is a standard modeling choice rather than a circular construction. Without specific quotes showing Eq. X reducing to a fit or prior self-work by definition, the derivation chain cannot be flagged as circular. This is the expected honest non-finding when no load-bearing steps reduce by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. Full text would be required to enumerate fitted thresholds, cost functions, or modeling assumptions.

pith-pipeline@v0.9.0 · 5659 in / 1029 out tokens · 34893 ms · 2026-05-23T18:45:40.086550+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the cost incurred when relying on the main model g is defined as c0(gi(x), zi)=1{gi(x)≠yi}. Similarly, the cost of consulting expert j is given by cj>0(mij(x), zi)=αjc0(mij(x), zi)+βj
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 2 (Bayes-Rejector) … rB,i(x)=0 if inf ηi0(x)≤min ηij(x)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning When to Remember: Risk-Sensitive Contextual Bandits for Abstention-Aware Memory Retrieval in LLM-Based Coding Agents
cs.CL 2026-04 unverdicted novelty 6.0

RSCB-MC is a risk-sensitive contextual bandit memory controller for LLM coding agents that chooses safe actions including abstention, achieving 60.5% proxy success with 0% false positives and low latency in 200-case v...
Optimized Deferral for Imbalanced Settings
cs.LG 2026-04 unverdicted novelty 5.0

MILD reformulates two-stage learning to defer as cost-sensitive learning over the input-expert domain and derives new margin-based losses with guarantees, yielding better performance than baselines on image classifica...

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · cited by 2 Pith papers · 14 internal anchors

[1]

Question answering systems approaches and challenges

Reem Alqifari. Question answering systems approaches and challenges. In Venelin Kovatchev, Irina Temnikova, Branislava S andrih, and Ivelina Nikolova, editors, Proceedings of the Student Research Workshop Associated with RANLP 2019, pages 69--75, Varna, Bulgaria, September 2019. INCOMA Ltd. doi:10.26615/issn.2603-2821.2019_011. URL https://aclanthology.or...

work page doi:10.26615/issn.2603-2821.2019_011 2019
[2]

Multi-class h -consistency bounds

Pranjal Awasthi, Anqi Mao, Mehryar Mohri, and Yutao Zhong. Multi-class h -consistency bounds. Advances in Neural Information Processing Systems, 35: 0 782–795, December 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/051f3997af1dd65da8e14397b6a72f8e-Abstract-Conference.html

work page 2022
[3]

Convexity, classification, and risk bounds

Peter Bartlett, Michael Jordan, and Jon McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101: 0 138--156, 02 2006. doi:10.1198/016214505000000907

work page doi:10.1198/016214505000000907 2006
[4]

Breiman, Bagging predictors, Machine Learning 24 (2) (1996) 123–140

Leo Breiman. Bagging predictors. Mach. Learn., 24 0 (2): 0 123–140, August 1996. ISSN 0885-6125. doi:10.1023/A:1018054314350. URL https://doi.org/10.1023/A:1018054314350

work page doi:10.1023/a:1018054314350 1996
[5]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[6]

Reading Wikipedia to Answer Open-Domain Questions

Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[7]

C. Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory, 16 0 (1): 0 41–46, January 1970. doi:10.1109/TIT.1970.1054406

work page doi:10.1109/tit.1970.1054406 1970
[8]

Learning with rejection

Corinna Cortes, Giulia DeSalvo, and Mehryar Mohri. Learning with rejection. In Ronald Ortner, Hans Ulrich Simon, and Sandra Zilles, editors, Algorithmic Learning Theory, pages 67--82, Cham, 2016. Springer International Publishing. ISBN 978-3-319-46379-7

work page 2016
[9]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

On efficient approximate queries over machine learning models

Dujian Ding, Sihem Amer-Yahia, and Laks Lakshmanan. On efficient approximate queries over machine learning models. Proc. VLDB Endow., 16 0 (4): 0 918–931, December 2022. ISSN 2150-8097. doi:10.14778/3574245.3574273. URL https://doi.org/10.14778/3574245.3574273

work page doi:10.14778/3574245.3574273 2022
[11]

Hybrid llm: Cost-efficient and quality-aware query routing

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks VS Lakshmanan, and Ahmed Hassan Awadallah. Hybrid llm: Cost-efficient and quality-aware query routing. arXiv preprint arXiv:2404.14618, 2024

work page arXiv 2024
[12]

Robust Loss Functions under Label Noise for Deep Neural Networks

Aritra Ghosh, Himanshu Kumar, and P. Shanti Sastry. Robust loss functions under label noise for deep neural networks. ArXiv, abs/1712.09482, 2017. URL https://api.semanticscholar.org/CorpusID:6546734

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxi...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

When does confidence-based cascade deferral suffice?, 2024

Wittawat Jitkrittum, Neha Gupta, Aditya Krishna Menon, Harikrishna Narasimhan, Ankit Singh Rawat, and Sanjiv Kumar. When does confidence-based cascade deferral suffice?, 2024. URL https://arxiv.org/abs/2307.02764

work page arXiv 2024
[16]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Efficient edge inference by selective query

Anil Kag, Igor Fedorov, Aditya Gangrade, Paul Whatmough, and Venkatesh Saligrama. Efficient edge inference by selective query. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=jpR98ZdIm2q

work page 2023
[18]

Agreement-based cascading for efficient inference, 2024

Steven Kolawole, Don Dennis, Ameet Talwalkar, and Virginia Smith. Agreement-based cascading for efficient inference, 2024. URL https://arxiv.org/abs/2407.02348

work page arXiv 2024
[19]

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations, 2020. URL https://arxiv.org/abs/1909.11942

work page internal anchor Pith review Pith/arXiv arXiv 2020
[20]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[21]

Consistency versus realizable h-consistency for multiclass classification

Phil Long and Rocco Servedio. Consistency versus realizable h-consistency for multiclass classification. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, number 3 in Proceedings of Machine Learning Research, pages 801--809, Atlanta, Georgia, USA, 17--19 Jun 2013. PMLR. URL https://proc...

work page 2013
[22]

Predict responsibly: Improving fairness and accuracy by learning to defer, 2018

David Madras, Toniann Pitassi, and Richard Zemel. Predict responsibly: Improving fairness and accuracy by learning to defer, 2018

work page 2018
[23]

Two-stage learning to defer with multiple experts

Anqi Mao, Christopher Mohri, Mehryar Mohri, and Yutao Zhong. Two-stage learning to defer with multiple experts. In Thirty-seventh Conference on Neural Information Processing Systems, 2023 a . URL https://openreview.net/forum?id=GIlsH0T4b2

work page 2023
[24]

Cross-entropy loss functions: Theoretical analysis and applications, 2023 b

Anqi Mao, Mehryar Mohri, and Yutao Zhong. Cross-entropy loss functions: Theoretical analysis and applications, 2023 b . URL https://arxiv.org/abs/2304.07288

work page arXiv 2023
[25]

Regression with multi-expert deferral, 2024

Anqi Mao, Mehryar Mohri, and Yutao Zhong. Regression with multi-expert deferral, 2024. URL https://arxiv.org/abs/2403.19494

work page arXiv 2024
[26]

Edge machine learning for ai-enabled iot devices: A review

Massimo Merenda, Carlo Porcaro, and Demetrio Iero. Edge machine learning for ai-enabled iot devices: A review. Sensors, 20 0 (9): 0 2533, 2020

work page 2020
[27]

Foundations of Machine Learning

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press, 2012. ISBN 026201825X

work page 2012
[28]

Two-stage learning-to-defer for multi-task learning, 2024

Yannis Montreuil, Shu Heng Yeo, Axel Carlier, Lai Xing Ng, and Wei Tsang Ooi. Two-stage learning-to-defer for multi-task learning, 2024. URL https://arxiv.org/abs/2410.15729

work page arXiv 2024
[29]

Adversarial robustness in two-stage learning-to-defer: Algorithms and guarantees, 2025

Yannis Montreuil, Axel Carlier, Lai Xing Ng, and Wei Tsang Ooi. Adversarial robustness in two-stage learning-to-defer: Algorithms and guarantees, 2025. URL https://arxiv.org/abs/2502.01027

work page arXiv 2025
[30]

Consistent estimators for learning to defer to an expert, 2021

Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert, 2021

work page 2021
[31]

Faster cascades via speculative decoding, 2024

Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Seungyeon Kim, Neha Gupta, Aditya Krishna Menon, and Sanjiv Kumar. Faster cascades via speculative decoding, 2024. URL https://arxiv.org/abs/2405.19261

work page arXiv 2024
[32]

RouteLLM: Learning to Route LLMs with Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data, 2024. URL https://arxiv.org/abs/2406.18665

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[35]

Know What You Don't Know: Unanswerable Questions for SQuAD

Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don't know: Unanswerable questions for squad, 2018. URL https://arxiv.org/abs/1806.03822

work page internal anchor Pith review Pith/arXiv arXiv 2018
[36]

Boosting algorithms for detector cascade learning

Mohammad Saberian and Nuno Vasconcelos. Boosting algorithms for detector cascade learning. Journal of Machine Learning Research, 15 0 (74): 0 2569--2605, 2014. URL http://jmlr.org/papers/v15/saberian14a.html

work page 2014
[37]

Delucionqa: Detecting hallucinations in domain-specific question answering, 2023

Mobashir Sadat, Zhengyu Zhou, Lukas Lange, Jun Araki, Arsalan Gundroo, Bingqing Wang, Rakesh R Menon, Md Rizwan Parvez, and Zhe Feng. Delucionqa: Detecting hallucinations in domain-specific question answering, 2023. URL https://arxiv.org/abs/2312.05200

work page arXiv 2023
[38]

How to compare different loss functions and their risks

Ingo Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 26: 0 225--287, 2007. URL https://api.semanticscholar.org/CorpusID:16660598

work page 2007
[39]

Bartlett

Ambuj Tewari and Peter L. Bartlett. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8 0 (36): 0 1007--1025, 2007. URL http://jmlr.org/papers/v8/tewari07a.html

work page 2007
[40]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. URL https://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

To ensemble or not: Assessing majority voting strategies for phishing detection with large language models, 2024

Fouad Trad and Ali Chehab. To ensemble or not: Assessing majority voting strategies for phishing detection with large language models, 2024. URL https://arxiv.org/abs/2412.00166

work page arXiv 2024
[42]

Model cascading: Towards jointly improving efficiency and accuracy of nlp systems, 2022

Neeraj Varshney and Chitta Baral. Model cascading: Towards jointly improving efficiency and accuracy of nlp systems, 2022. URL https://arxiv.org/abs/2210.05528

work page arXiv 2022
[43]

Learning to defer to multiple experts: Consistent surrogate losses, confidence calibration, and conformal ensembles

Rajeev Verma, Daniel Barrejon, and Eric Nalisnick. Learning to defer to multiple experts: Consistent surrogate losses, confidence calibration, and conformal ensembles. In International Conference on Artificial Intelligence and Statistics, 2022. URL https://api.semanticscholar.org/CorpusID:253237048

work page 2022
[44]

Viola and M

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, volume 1, pages I--I, 2001. doi:10.1109/CVPR.2001.990517

work page doi:10.1109/cvpr.2001.990517 2001
[45]

Large language model cascades with mixture of thoughts representations for cost-efficient reasoning

Murong Yue, Jie Zhao, Min Zhang, Liang Du, and Ziyu Yao. Large language model cascades with mixture of thoughts representations for cost-efficient reasoning. arXiv preprint arXiv:2310.03094, 2023

work page arXiv 2023
[46]

Bayes consistency vs

Mingyuan Zhang and Shivani Agarwal. Bayes consistency vs. h-consistency: The interplay between surrogate loss functions and the scoring function class. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 16927--16936. Curran Associates, Inc., 2020. URL https://proc...

work page 2020
[47]

Statistical behavior and consistency of classification methods based on convex risk minimization

Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32, 12 2002. doi:10.1214/aos/1079120130

work page doi:10.1214/aos/1079120130 2002

[1] [1]

Question answering systems approaches and challenges

Reem Alqifari. Question answering systems approaches and challenges. In Venelin Kovatchev, Irina Temnikova, Branislava S andrih, and Ivelina Nikolova, editors, Proceedings of the Student Research Workshop Associated with RANLP 2019, pages 69--75, Varna, Bulgaria, September 2019. INCOMA Ltd. doi:10.26615/issn.2603-2821.2019_011. URL https://aclanthology.or...

work page doi:10.26615/issn.2603-2821.2019_011 2019

[2] [2]

Multi-class h -consistency bounds

Pranjal Awasthi, Anqi Mao, Mehryar Mohri, and Yutao Zhong. Multi-class h -consistency bounds. Advances in Neural Information Processing Systems, 35: 0 782–795, December 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/hash/051f3997af1dd65da8e14397b6a72f8e-Abstract-Conference.html

work page 2022

[3] [3]

Convexity, classification, and risk bounds

Peter Bartlett, Michael Jordan, and Jon McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101: 0 138--156, 02 2006. doi:10.1198/016214505000000907

work page doi:10.1198/016214505000000907 2006

[4] [4]

Breiman, Bagging predictors, Machine Learning 24 (2) (1996) 123–140

Leo Breiman. Bagging predictors. Mach. Learn., 24 0 (2): 0 123–140, August 1996. ISSN 0885-6125. doi:10.1023/A:1018054314350. URL https://doi.org/10.1023/A:1018054314350

work page doi:10.1023/a:1018054314350 1996

[5] [5]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[6] [6]

Reading Wikipedia to Answer Open-Domain Questions

Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[7] [7]

C. Chow. On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory, 16 0 (1): 0 41–46, January 1970. doi:10.1109/TIT.1970.1054406

work page doi:10.1109/tit.1970.1054406 1970

[8] [8]

Learning with rejection

Corinna Cortes, Giulia DeSalvo, and Mehryar Mohri. Learning with rejection. In Ronald Ortner, Hans Ulrich Simon, and Sandra Zilles, editors, Algorithmic Learning Theory, pages 67--82, Cham, 2016. Springer International Publishing. ISBN 978-3-319-46379-7

work page 2016

[9] [9]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[10] [10]

On efficient approximate queries over machine learning models

Dujian Ding, Sihem Amer-Yahia, and Laks Lakshmanan. On efficient approximate queries over machine learning models. Proc. VLDB Endow., 16 0 (4): 0 918–931, December 2022. ISSN 2150-8097. doi:10.14778/3574245.3574273. URL https://doi.org/10.14778/3574245.3574273

work page doi:10.14778/3574245.3574273 2022

[11] [11]

Hybrid llm: Cost-efficient and quality-aware query routing

Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks VS Lakshmanan, and Ahmed Hassan Awadallah. Hybrid llm: Cost-efficient and quality-aware query routing. arXiv preprint arXiv:2404.14618, 2024

work page arXiv 2024

[12] [12]

Robust Loss Functions under Label Noise for Deep Neural Networks

Aritra Ghosh, Himanshu Kumar, and P. Shanti Sastry. Robust loss functions under label noise for deep neural networks. ArXiv, abs/1712.09482, 2017. URL https://api.semanticscholar.org/CorpusID:6546734

work page internal anchor Pith review Pith/arXiv arXiv 2017

[13] [13]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxi...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

When does confidence-based cascade deferral suffice?, 2024

Wittawat Jitkrittum, Neha Gupta, Aditya Krishna Menon, Harikrishna Narasimhan, Ankit Singh Rawat, and Sanjiv Kumar. When does confidence-based cascade deferral suffice?, 2024. URL https://arxiv.org/abs/2307.02764

work page arXiv 2024

[16] [16]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[17] [17]

Efficient edge inference by selective query

Anil Kag, Igor Fedorov, Aditya Gangrade, Paul Whatmough, and Venkatesh Saligrama. Efficient edge inference by selective query. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=jpR98ZdIm2q

work page 2023

[18] [18]

Agreement-based cascading for efficient inference, 2024

Steven Kolawole, Don Dennis, Ameet Talwalkar, and Virginia Smith. Agreement-based cascading for efficient inference, 2024. URL https://arxiv.org/abs/2407.02348

work page arXiv 2024

[19] [19]

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations, 2020. URL https://arxiv.org/abs/1909.11942

work page internal anchor Pith review Pith/arXiv arXiv 2020

[20] [20]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[21] [21]

Consistency versus realizable h-consistency for multiclass classification

Phil Long and Rocco Servedio. Consistency versus realizable h-consistency for multiclass classification. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, number 3 in Proceedings of Machine Learning Research, pages 801--809, Atlanta, Georgia, USA, 17--19 Jun 2013. PMLR. URL https://proc...

work page 2013

[22] [22]

Predict responsibly: Improving fairness and accuracy by learning to defer, 2018

David Madras, Toniann Pitassi, and Richard Zemel. Predict responsibly: Improving fairness and accuracy by learning to defer, 2018

work page 2018

[23] [23]

Two-stage learning to defer with multiple experts

Anqi Mao, Christopher Mohri, Mehryar Mohri, and Yutao Zhong. Two-stage learning to defer with multiple experts. In Thirty-seventh Conference on Neural Information Processing Systems, 2023 a . URL https://openreview.net/forum?id=GIlsH0T4b2

work page 2023

[24] [24]

Cross-entropy loss functions: Theoretical analysis and applications, 2023 b

Anqi Mao, Mehryar Mohri, and Yutao Zhong. Cross-entropy loss functions: Theoretical analysis and applications, 2023 b . URL https://arxiv.org/abs/2304.07288

work page arXiv 2023

[25] [25]

Regression with multi-expert deferral, 2024

Anqi Mao, Mehryar Mohri, and Yutao Zhong. Regression with multi-expert deferral, 2024. URL https://arxiv.org/abs/2403.19494

work page arXiv 2024

[26] [26]

Edge machine learning for ai-enabled iot devices: A review

Massimo Merenda, Carlo Porcaro, and Demetrio Iero. Edge machine learning for ai-enabled iot devices: A review. Sensors, 20 0 (9): 0 2533, 2020

work page 2020

[27] [27]

Foundations of Machine Learning

Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. The MIT Press, 2012. ISBN 026201825X

work page 2012

[28] [28]

Two-stage learning-to-defer for multi-task learning, 2024

Yannis Montreuil, Shu Heng Yeo, Axel Carlier, Lai Xing Ng, and Wei Tsang Ooi. Two-stage learning-to-defer for multi-task learning, 2024. URL https://arxiv.org/abs/2410.15729

work page arXiv 2024

[29] [29]

Adversarial robustness in two-stage learning-to-defer: Algorithms and guarantees, 2025

Yannis Montreuil, Axel Carlier, Lai Xing Ng, and Wei Tsang Ooi. Adversarial robustness in two-stage learning-to-defer: Algorithms and guarantees, 2025. URL https://arxiv.org/abs/2502.01027

work page arXiv 2025

[30] [30]

Consistent estimators for learning to defer to an expert, 2021

Hussein Mozannar and David Sontag. Consistent estimators for learning to defer to an expert, 2021

work page 2021

[31] [31]

Faster cascades via speculative decoding, 2024

Harikrishna Narasimhan, Wittawat Jitkrittum, Ankit Singh Rawat, Seungyeon Kim, Neha Gupta, Aditya Krishna Menon, and Sanjiv Kumar. Faster cascades via speculative decoding, 2024. URL https://arxiv.org/abs/2405.19261

work page arXiv 2024

[32] [32]

RouteLLM: Learning to Route LLMs with Preference Data

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tianhao Wu, Joseph E. Gonzalez, M Waleed Kadous, and Ion Stoica. Routellm: Learning to route llms with preference data, 2024. URL https://arxiv.org/abs/2406.18665

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[35] [35]

Know What You Don't Know: Unanswerable Questions for SQuAD

Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don't know: Unanswerable questions for squad, 2018. URL https://arxiv.org/abs/1806.03822

work page internal anchor Pith review Pith/arXiv arXiv 2018

[36] [36]

Boosting algorithms for detector cascade learning

Mohammad Saberian and Nuno Vasconcelos. Boosting algorithms for detector cascade learning. Journal of Machine Learning Research, 15 0 (74): 0 2569--2605, 2014. URL http://jmlr.org/papers/v15/saberian14a.html

work page 2014

[37] [37]

Delucionqa: Detecting hallucinations in domain-specific question answering, 2023

Mobashir Sadat, Zhengyu Zhou, Lukas Lange, Jun Araki, Arsalan Gundroo, Bingqing Wang, Rakesh R Menon, Md Rizwan Parvez, and Zhe Feng. Delucionqa: Detecting hallucinations in domain-specific question answering, 2023. URL https://arxiv.org/abs/2312.05200

work page arXiv 2023

[38] [38]

How to compare different loss functions and their risks

Ingo Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 26: 0 225--287, 2007. URL https://api.semanticscholar.org/CorpusID:16660598

work page 2007

[39] [39]

Bartlett

Ambuj Tewari and Peter L. Bartlett. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8 0 (36): 0 1007--1025, 2007. URL http://jmlr.org/papers/v8/tewari07a.html

work page 2007

[40] [40]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023. URL https://arxiv.org/abs/2302.13971

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

To ensemble or not: Assessing majority voting strategies for phishing detection with large language models, 2024

Fouad Trad and Ali Chehab. To ensemble or not: Assessing majority voting strategies for phishing detection with large language models, 2024. URL https://arxiv.org/abs/2412.00166

work page arXiv 2024

[42] [42]

Model cascading: Towards jointly improving efficiency and accuracy of nlp systems, 2022

Neeraj Varshney and Chitta Baral. Model cascading: Towards jointly improving efficiency and accuracy of nlp systems, 2022. URL https://arxiv.org/abs/2210.05528

work page arXiv 2022

[43] [43]

Learning to defer to multiple experts: Consistent surrogate losses, confidence calibration, and conformal ensembles

Rajeev Verma, Daniel Barrejon, and Eric Nalisnick. Learning to defer to multiple experts: Consistent surrogate losses, confidence calibration, and conformal ensembles. In International Conference on Artificial Intelligence and Statistics, 2022. URL https://api.semanticscholar.org/CorpusID:253237048

work page 2022

[44] [44]

Viola and M

P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, volume 1, pages I--I, 2001. doi:10.1109/CVPR.2001.990517

work page doi:10.1109/cvpr.2001.990517 2001

[45] [45]

Large language model cascades with mixture of thoughts representations for cost-efficient reasoning

Murong Yue, Jie Zhao, Min Zhang, Liang Du, and Ziyu Yao. Large language model cascades with mixture of thoughts representations for cost-efficient reasoning. arXiv preprint arXiv:2310.03094, 2023

work page arXiv 2023

[46] [46]

Bayes consistency vs

Mingyuan Zhang and Shivani Agarwal. Bayes consistency vs. h-consistency: The interplay between surrogate loss functions and the scoring function class. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 16927--16936. Curran Associates, Inc., 2020. URL https://proc...

work page 2020

[47] [47]

Statistical behavior and consistency of classification methods based on convex risk minimization

Tong Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32, 12 2002. doi:10.1214/aos/1079120130

work page doi:10.1214/aos/1079120130 2002