Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees
Pith reviewed 2026-05-23 18:45 UTC · model grok-4.3
The pith
A learning-to-defer framework allocates extractive QA queries to LLM experts with theoretical guarantees balancing performance and cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that a learning-to-defer decision rule, equipped with theoretical optimality guarantees, can allocate each extractive QA query to the most suitable LLM expert so that overall answer quality is preserved while computational cost is minimized under the assumed performance and cost models.
What carries the argument
The learning-to-defer allocation policy, which routes each query according to per-expert confidence and cost functions to achieve the provably optimal performance-cost tradeoff.
If this is right
- Answer reliability improves on SQuADv1, SQuADv2, and TriviaQA while computational overhead drops.
- Multiple specialized models become practical to deploy together without proportional cost increase.
- The deferral policy satisfies explicit optimality guarantees under the stated cost and performance models.
- The method supports scalable extractive QA systems that avoid running every expert on every input.
Where Pith is reading between the lines
- The same allocation logic could be tested on other structured prediction tasks that benefit from selective expert invocation.
- If the deferral policy were allowed to co-train with the experts, the fixed-model assumption could be relaxed and bounds might tighten.
- Production systems would need to replace proxy cost functions with live telemetry to keep the guarantees meaningful.
Load-bearing premise
The cost and performance functions used to derive the theoretical optimality guarantees accurately reflect real deployment conditions and the expert models remain fixed rather than co-adapted during training.
What would settle it
Measure actual end-to-end latency and accuracy when the same framework is deployed on a new dataset whose query distribution violates the assumed performance-cost relationships; if gains disappear, the optimality claim does not hold.
Figures
read the original abstract
Large Language Models excel in generative tasks but exhibit inefficiencies in structured text selection, particularly in extractive question answering. This challenge is magnified in resource-constrained environments, where deploying multiple specialized models for different tasks is impractical. We propose a Learning-to-Defer framework that allocates queries to specialized experts, ensuring high-confidence predictions while optimizing computational efficiency. Our approach integrates a principled allocation strategy with theoretical guarantees on optimal deferral that balances performance and cost. Empirical evaluations on SQuADv1, SQuADv2, and TriviaQA demonstrate that our method enhances answer reliability while significantly reducing computational overhead, making it well-suited for scalable and efficient EQA deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a Learning-to-Defer framework for allocating queries to specialized LLM experts in extractive QA. It claims a principled allocation strategy together with theoretical guarantees on optimal deferral that balances performance and cost, and reports empirical gains in answer reliability and reduced overhead on SQuADv1, SQuADv2, and TriviaQA.
Significance. If the optimality guarantees are independent of the fitted deferrer parameters and the fixed-expert assumption holds in deployment, the framework could supply a principled method for cost-aware routing among multiple LLMs on structured selection tasks.
major comments (2)
- [Theoretical guarantees derivation (likely §4)] The central optimality guarantees rest on the assumption that the expert models remain fixed (non-adapted) oracles whose outputs and costs are independent of the learned deferral policy. This assumption is load-bearing for the claim that the trained policy achieves the derived optimum; if experts are co-adapted during deferrer training, the guarantee does not apply to the resulting policy.
- [Cost/performance modeling (likely §3.2)] The derivation of the theoretical guarantees is tied to particular functional forms chosen for the performance and cost of each expert. Real LLM inference costs (variable token pricing, batching effects, context-length dependence) may deviate from these forms, in which case the optimality result does not transfer to the learned allocation rule.
minor comments (2)
- Clarify in the abstract and introduction how many experts are used and whether they are fine-tuned or frozen.
- Add a short discussion of how the deferral threshold or allocation rule is obtained from the theoretical optimum (closed form vs. optimization).
Simulated Author's Rebuttal
We thank the referee for the thoughtful comments on the theoretical foundations of our work. We respond to each major comment below.
read point-by-point responses
-
Referee: [Theoretical guarantees derivation (likely §4)] The central optimality guarantees rest on the assumption that the expert models remain fixed (non-adapted) oracles whose outputs and costs are independent of the learned deferral policy. This assumption is load-bearing for the claim that the trained policy achieves the derived optimum; if experts are co-adapted during deferrer training, the guarantee does not apply to the resulting policy.
Authors: Our proposed framework is developed under the assumption of fixed expert models, which are not adapted or co-trained with the deferrer. This is clearly stated in the method section, where the experts are described as pre-trained specialized LLMs. The optimality guarantees are derived specifically for this setting, ensuring that the deferral policy optimizes allocation without affecting expert behavior. Our experiments adhere to this assumption by keeping experts fixed, so the guarantees are applicable to the presented results. We do not extend claims to scenarios involving expert adaptation. revision: no
-
Referee: [Cost/performance modeling (likely §3.2)] The derivation of the theoretical guarantees is tied to particular functional forms chosen for the performance and cost of each expert. Real LLM inference costs (variable token pricing, batching effects, context-length dependence) may deviate from these forms, in which case the optimality result does not transfer to the learned allocation rule.
Authors: The theoretical analysis employs specific functional forms for performance (based on answer correctness probability) and cost (tied to model inference characteristics) to enable the derivation of optimality conditions, as detailed in Section 3.2. These forms are chosen to reflect the extractive QA setting and are validated empirically. The guarantees hold for the modeled costs and performance; the framework is modular and allows substitution of alternative functions if different cost structures are desired. The manuscript acknowledges that real-world costs may include additional variables, but the core contribution is the principled allocation under the defined models. revision: no
Circularity Check
No circularity: optimality guarantees derived from explicit performance/cost functions without reduction to fitted inputs or self-citations
full rationale
The abstract describes a learning-to-defer framework with theoretical guarantees on optimal deferral balancing performance and cost. No equations or derivation steps are visible in the provided text to inspect for self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations. The central claim relies on defined functional forms for expert performance and cost, which is a standard modeling choice rather than a circular construction. Without specific quotes showing Eq. X reducing to a fit or prior self-work by definition, the derivation chain cannot be flagged as circular. This is the expected honest non-finding when no load-bearing steps reduce by construction.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
the cost incurred when relying on the main model g is defined as c0(gi(x), zi)=1{gi(x)≠yi}. Similarly, the cost of consulting expert j is given by cj>0(mij(x), zi)=αjc0(mij(x), zi)+βj
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lemma 2 (Bayes-Rejector) … rB,i(x)=0 if inf ηi0(x)≤min ηij(x)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Learning When to Remember: Risk-Sensitive Contextual Bandits for Abstention-Aware Memory Retrieval in LLM-Based Coding Agents
RSCB-MC is a risk-sensitive contextual bandit memory controller for LLM coding agents that chooses safe actions including abstention, achieving 60.5% proxy success with 0% false positives and low latency in 200-case v...
-
Optimized Deferral for Imbalanced Settings
MILD reformulates two-stage learning to defer as cost-sensitive learning over the input-expert domain and derives new margin-based losses with guarantees, yielding better performance than baselines on image classifica...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.