Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees

Axel Carlier; Lai Xing Ng; Shu Heng Yeo; Wei Tsang Ooi; Yannis Montreuil

arxiv: 2410.15761 · v5 · pith:RLW7ADWRnew · submitted 2024-10-21 · 💻 cs.CL · cs.LG· stat.ML

Optimal Query Allocation in Extractive QA with LLMs: A Learning-to-Defer Framework with Theoretical Guarantees

Yannis Montreuil , Shu Heng Yeo , Axel Carlier , Lai Xing Ng , Wei Tsang Ooi This is my paper

Pith reviewed 2026-05-23 18:45 UTC · model grok-4.3

classification 💻 cs.CL cs.LGstat.ML

keywords extractive question answeringlearning to deferlarge language modelsquery allocationtheoretical guaranteescomputational efficiencySQuADTriviaQA

0 comments

The pith

A learning-to-defer framework allocates extractive QA queries to LLM experts with theoretical guarantees balancing performance and cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a Learning-to-Defer framework that decides which queries to send to specialized large language model experts in extractive question answering. The goal is to keep high-confidence answers while lowering total computation in settings where running every model on every query is too expensive. A principled allocation rule comes with theoretical guarantees that the deferral choices are optimal for given performance and cost functions. Tests on SQuADv1, SQuADv2, and TriviaQA show higher answer reliability together with lower overhead than running all experts unconditionally.

Core claim

The authors establish that a learning-to-defer decision rule, equipped with theoretical optimality guarantees, can allocate each extractive QA query to the most suitable LLM expert so that overall answer quality is preserved while computational cost is minimized under the assumed performance and cost models.

What carries the argument

The learning-to-defer allocation policy, which routes each query according to per-expert confidence and cost functions to achieve the provably optimal performance-cost tradeoff.

If this is right

Answer reliability improves on SQuADv1, SQuADv2, and TriviaQA while computational overhead drops.
Multiple specialized models become practical to deploy together without proportional cost increase.
The deferral policy satisfies explicit optimality guarantees under the stated cost and performance models.
The method supports scalable extractive QA systems that avoid running every expert on every input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same allocation logic could be tested on other structured prediction tasks that benefit from selective expert invocation.
If the deferral policy were allowed to co-train with the experts, the fixed-model assumption could be relaxed and bounds might tighten.
Production systems would need to replace proxy cost functions with live telemetry to keep the guarantees meaningful.

Load-bearing premise

The cost and performance functions used to derive the theoretical optimality guarantees accurately reflect real deployment conditions and the expert models remain fixed rather than co-adapted during training.

What would settle it

Measure actual end-to-end latency and accuracy when the same framework is deployed on a new dataset whose query distribution violates the assumed performance-cost relationships; if gains disappear, the optimality claim does not hold.

Figures

Figures reproduced from arXiv: 2410.15761 by Axel Carlier, Lai Xing Ng, Shu Heng Yeo, Wei Tsang Ooi, Yannis Montreuil.

**Figure 2.** Figure 2: Comparison between the Exact Match metric and the E [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Combined Efficiency Comparison across benchmarks: [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Combined Allocation Percentage across benchmark [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: From left to right: Model Cascades, Query Routing, [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Rejector Architecture: The input data is processe [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Inference Step of Our Approach: The input data is pr [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

read the original abstract

Large Language Models excel in generative tasks but exhibit inefficiencies in structured text selection, particularly in extractive question answering. This challenge is magnified in resource-constrained environments, where deploying multiple specialized models for different tasks is impractical. We propose a Learning-to-Defer framework that allocates queries to specialized experts, ensuring high-confidence predictions while optimizing computational efficiency. Our approach integrates a principled allocation strategy with theoretical guarantees on optimal deferral that balances performance and cost. Empirical evaluations on SQuADv1, SQuADv2, and TriviaQA demonstrate that our method enhances answer reliability while significantly reducing computational overhead, making it well-suited for scalable and efficient EQA deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies learning-to-defer to route extractive QA queries among LLM experts and claims theoretical optimality bounds on the performance-cost tradeoff.

read the letter

The main takeaway is a deferral framework that decides which expert handles each query in extractive QA to balance accuracy against compute cost, along with some optimality proofs for that allocation rule. It shows results on SQuADv1, SQuADv2, and TriviaQA where the approach keeps answer reliability up while cutting overhead compared with simpler baselines. That empirical piece is straightforward and could be useful for anyone running multiple models under tight resource limits. The setup follows the usual learning-to-defer template from other domains but specializes it to this task with the added bounds. The soft spot sits in the theory. The guarantees are derived from particular functional forms for performance and cost, and they treat the experts as fixed oracles whose behavior does not depend on the deferral policy itself. Real LLM inference costs often vary with batching, token counts, and hardware, and joint training can make the experts adapt. If those conditions deviate, the optimality result does not directly apply to the trained system. The paper would benefit from checking how sensitive the bounds are to those modeling choices. No obvious internal contradictions or circular definitions appear in the abstract and stress-test description. This work is aimed at teams building production QA systems who need a principled way to allocate queries across specialists. A reader focused on efficient inference would find the experiments and allocation logic worth testing. I would send it to peer review so referees can examine the derivation steps and the assumption robustness in detail.

Referee Report

2 major / 2 minor

Summary. The paper proposes a Learning-to-Defer framework for allocating queries to specialized LLM experts in extractive QA. It claims a principled allocation strategy together with theoretical guarantees on optimal deferral that balances performance and cost, and reports empirical gains in answer reliability and reduced overhead on SQuADv1, SQuADv2, and TriviaQA.

Significance. If the optimality guarantees are independent of the fitted deferrer parameters and the fixed-expert assumption holds in deployment, the framework could supply a principled method for cost-aware routing among multiple LLMs on structured selection tasks.

major comments (2)

[Theoretical guarantees derivation (likely §4)] The central optimality guarantees rest on the assumption that the expert models remain fixed (non-adapted) oracles whose outputs and costs are independent of the learned deferral policy. This assumption is load-bearing for the claim that the trained policy achieves the derived optimum; if experts are co-adapted during deferrer training, the guarantee does not apply to the resulting policy.
[Cost/performance modeling (likely §3.2)] The derivation of the theoretical guarantees is tied to particular functional forms chosen for the performance and cost of each expert. Real LLM inference costs (variable token pricing, batching effects, context-length dependence) may deviate from these forms, in which case the optimality result does not transfer to the learned allocation rule.

minor comments (2)

Clarify in the abstract and introduction how many experts are used and whether they are fine-tuned or frozen.
Add a short discussion of how the deferral threshold or allocation rule is obtained from the theoretical optimum (closed form vs. optimization).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments on the theoretical foundations of our work. We respond to each major comment below.

read point-by-point responses

Referee: [Theoretical guarantees derivation (likely §4)] The central optimality guarantees rest on the assumption that the expert models remain fixed (non-adapted) oracles whose outputs and costs are independent of the learned deferral policy. This assumption is load-bearing for the claim that the trained policy achieves the derived optimum; if experts are co-adapted during deferrer training, the guarantee does not apply to the resulting policy.

Authors: Our proposed framework is developed under the assumption of fixed expert models, which are not adapted or co-trained with the deferrer. This is clearly stated in the method section, where the experts are described as pre-trained specialized LLMs. The optimality guarantees are derived specifically for this setting, ensuring that the deferral policy optimizes allocation without affecting expert behavior. Our experiments adhere to this assumption by keeping experts fixed, so the guarantees are applicable to the presented results. We do not extend claims to scenarios involving expert adaptation. revision: no
Referee: [Cost/performance modeling (likely §3.2)] The derivation of the theoretical guarantees is tied to particular functional forms chosen for the performance and cost of each expert. Real LLM inference costs (variable token pricing, batching effects, context-length dependence) may deviate from these forms, in which case the optimality result does not transfer to the learned allocation rule.

Authors: The theoretical analysis employs specific functional forms for performance (based on answer correctness probability) and cost (tied to model inference characteristics) to enable the derivation of optimality conditions, as detailed in Section 3.2. These forms are chosen to reflect the extractive QA setting and are validated empirically. The guarantees hold for the modeled costs and performance; the framework is modular and allows substitution of alternative functions if different cost structures are desired. The manuscript acknowledges that real-world costs may include additional variables, but the core contribution is the principled allocation under the defined models. revision: no

Circularity Check

0 steps flagged

No circularity: optimality guarantees derived from explicit performance/cost functions without reduction to fitted inputs or self-citations

full rationale

The abstract describes a learning-to-defer framework with theoretical guarantees on optimal deferral balancing performance and cost. No equations or derivation steps are visible in the provided text to inspect for self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations. The central claim relies on defined functional forms for expert performance and cost, which is a standard modeling choice rather than a circular construction. Without specific quotes showing Eq. X reducing to a fit or prior self-work by definition, the derivation chain cannot be flagged as circular. This is the expected honest non-finding when no load-bearing steps reduce by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated. Full text would be required to enumerate fitted thresholds, cost functions, or modeling assumptions.

pith-pipeline@v0.9.0 · 5659 in / 1029 out tokens · 34893 ms · 2026-05-23T18:45:40.086550+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the cost incurred when relying on the main model g is defined as c0(gi(x), zi)=1{gi(x)≠yi}. Similarly, the cost of consulting expert j is given by cj>0(mij(x), zi)=αjc0(mij(x), zi)+βj
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 2 (Bayes-Rejector) … rB,i(x)=0 if inf ηi0(x)≤min ηij(x)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning When to Remember: Risk-Sensitive Contextual Bandits for Abstention-Aware Memory Retrieval in LLM-Based Coding Agents
cs.CL 2026-04 unverdicted novelty 6.0

RSCB-MC is a risk-sensitive contextual bandit memory controller for LLM coding agents that chooses safe actions including abstention, achieving 60.5% proxy success with 0% false positives and low latency in 200-case v...
Optimized Deferral for Imbalanced Settings
cs.LG 2026-04 unverdicted novelty 5.0

MILD reformulates two-stage learning to defer as cost-sensitive learning over the input-expert domain and derives new margin-based losses with guarantees, yielding better performance than baselines on image classifica...