arxiv: 2604.27283 · v1 · submitted 2026-04-30 · 💻 cs.CL · cs.AI· cs.LG

Recognition: unknown

Learning When to Remember: Risk-Sensitive Contextual Bandits for Abstention-Aware Memory Retrieval in LLM-Based Coding Agents

Mehmet Iscan

Authors on Pith no claims yet

Pith reviewed 2026-05-07 07:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords contextual banditsLLM coding agentsmemory retrievalabstentionrisk-sensitive learningfalse-positive preventiondebugging agentsreinforcement learning

0 comments

The pith

A risk-sensitive contextual bandit lets LLM coding agents decide when to reuse prior memory traces safely by abstaining from mismatched ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes memory retrieval in LLM-based coding agents as a selective control problem rather than a top-k similarity search. It shows that a contextual bandit can map a 16-feature state of relevance, uncertainty, compatibility, and risk into actions that include injecting memory, summarizing candidates, abstaining, or asking for feedback. The reward function penalizes false-positive injections more than missed reuse opportunities, treating abstention as a primary safety choice. A sympathetic reader would care because superficially similar past debugging traces can lead agents to follow incorrect repair paths, and learning to say no improves reliability without large losses in success rate. The reported outcomes in offline replay and bounded validation support treating memory use as a risk decision rather than an automatic retrieval step.

Core claim

RSCB-MC is a risk-sensitive contextual bandit memory controller that decides whether an agent should use no memory, inject the top resolution, summarize multiple candidates, perform high-precision or high-recall retrieval, abstain, or ask for feedback. The system stores reusable issue knowledge through a pattern-variant-episode schema and converts retrieval evidence into a fixed 16-feature contextual state capturing relevance, uncertainty, structural compatibility, feedback history, false-positive risk, latency, and token cost. Its reward design penalizes false-positive memory injection more strongly than missed reuse, making non-injection and abstention first-class safety actions. In tests,

What carries the argument

The risk-sensitive contextual bandit memory controller (RSCB-MC) that converts retrieval evidence into a 16-feature contextual state and selects actions including abstention under a reward function that weights false-positive injections more heavily than missed opportunities.

If this is right

Memory use becomes a controlled safety decision rather than an automatic similarity-based step.
Agents can reach 62.5 percent offline success while keeping false-positive injections at zero.
Decision latency stays low at 331 microseconds for the 95th percentile in hot-path validation.
Knowledge storage via the pattern-variant-episode schema enables structured reuse of issue patterns.
Non-injection and abstention actions are treated as viable and sometimes preferred outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same learned abstention logic could apply to other retrieval-augmented LLM tasks where erroneous external knowledge carries high downstream cost.
Extending the 16-feature state with domain-specific signals might improve robustness when new failure modes appear outside the training distribution.
The offline replay results suggest the controller could support online adaptation in live agent sessions without requiring full retraining.
Treating abstention as a first-class action may reduce the need for ever-more-accurate similarity metrics in agent memory systems.

Load-bearing premise

The 16-feature contextual state together with the chosen reward weights and offline replay setup are sufficient to capture the true risk of unsafe memory injection across the full range of real-world coding-agent failures.

What would settle it

A collection of coding-agent failure cases in which the bandit selects memory injection yet the injected memory produces an incorrect or harmful repair step, producing a positive false-positive rate contrary to the zero rate observed in the smoke-scale and 200-case validations.

read the original abstract

Large language model (LLM)-based coding agents increasingly rely on external memory to reuse prior debugging experience, repair traces, and repository-local operational knowledge. However, retrieved memory is useful only when the current failure is genuinely compatible with a previous one; superficial similarity in stack traces, terminal errors, paths, or configuration symptoms can lead to unsafe memory injection. This paper reframes issue-memory use as a selective, risk-sensitive control problem rather than a pure top-k retrieval problem. We introduce RSCB-MC, a risk-sensitive contextual bandit memory controller that decides whether an agent should use no memory, inject the top resolution, summarize multiple candidates, perform high-precision or high-recall retrieval, abstain, or ask for feedback. The system stores reusable issue knowledge through a pattern-variant-episode schema and converts retrieval evidence into a fixed 16-feature contextual state capturing relevance, uncertainty, structural compatibility, feedback history, false-positive risk, latency, and token cost. Its reward design penalizes false-positive memory injection more strongly than missed reuse, making non-injection and abstention first-class safety actions. In deterministic smoke-scale artifacts, RSCB-MC obtains the strongest non-oracle offline replay success rate, 62.5%, while maintaining a 0.0% false-positive rate. In a bounded 200-case hot-path validation, it reaches 60.5% proxy success with 0.0% false positives and a 331.466 microseconds p95 decision latency. The results show that, for coding-agent memory, the key question is not only which memory is most similar, but whether any retrieved memory is safe enough to influence the debugging trajectory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper turns memory retrieval in LLM coding agents into a risk-sensitive bandit decision with abstention options and reports zero false positives in narrow offline tests, but the evaluation leaves generalization unclear.

read the letter

The main point is that this work reframes memory use in LLM coding agents as a control problem where the system can choose to abstain or skip memory entirely rather than always retrieving the top match. It uses a contextual bandit called RSCB-MC with a custom pattern-variant-episode storage schema and a 16-feature state that includes relevance, structural compatibility, and false-positive risk. The reward puts a heavier penalty on injecting unsafe memory than on missing a reuse opportunity. In offline replay on deterministic smoke-scale artifacts it reaches 62.5% success with 0% false positives, and a 200-case validation gives 60.5% proxy success at low latency.

Referee Report

4 major / 2 minor

Summary. The manuscript introduces RSCB-MC, a risk-sensitive contextual bandit memory controller for LLM-based coding agents. It reframes memory retrieval as a selective control problem with actions including no memory use, top-resolution injection, summarization, high-precision or high-recall retrieval, abstention, or feedback requests. The approach stores knowledge via a pattern-variant-episode schema and maps retrieval evidence to a fixed 16-feature contextual state (relevance, uncertainty, structural compatibility, feedback history, false-positive risk, latency, token cost). Reward design penalizes false-positive injection more than missed reuse. Offline replay on deterministic smoke-scale artifacts yields 62.5% non-oracle success with 0.0% false positives; a bounded 200-case hot-path validation yields 60.5% proxy success with 0.0% false positives and 331.466 μs p95 decision latency.

Significance. If the results hold under more rigorous evaluation, the work could meaningfully improve safety in memory-augmented coding agents by treating abstention and non-injection as first-class actions. The risk-sensitive framing and explicit false-positive penalty are conceptually sound contributions. However, the narrow testbed (smoke-scale deterministic artifacts and bounded validation) and absence of protocol details limit claims of generalizability to real-world failures involving version-specific or multi-file semantic mismatches.

major comments (4)

[Experimental Evaluation] Experimental Evaluation section: The reported 62.5% non-oracle success rate and 0.0% false-positive rate are presented without any description of the number of smoke-scale artifacts, the precise definition of 'success', the non-oracle baselines, statistical tests, or how trajectories were generated. This absence is load-bearing because the central empirical claim cannot be assessed for validity or reproducibility.
[Method] Method section (contextual state and reward design): The 16-feature state vector and reward function (stronger penalty on false-positive injection) are described only qualitatively. No equations, feature extraction procedures, or specific values for free parameters such as false_positive_penalty_weight or contextual_feature_weights are supplied. This directly undermines the claim that the state captures true risk of unsafe injection.
[Assumptions and Limitations] Assumptions paragraph (offline replay): The evaluation relies on fixed-trajectory offline replay, yet the manuscript does not address how memory injection would alter downstream agent actions or create new risks not penalized in the replay. This assumption is central to the 0.0% false-positive result and requires either justification or an online evaluation component.
[Pattern-Variant-Episode Schema] Pattern-Variant-Episode Schema subsection: The schema is introduced as the storage mechanism for reusable issue knowledge, but no formal definition, variant-handling rules, or integration with the 16-feature extractor is provided. This is load-bearing for the memory-reuse component of the system.

minor comments (2)

[Abstract] Abstract: The latency figure is reported to three decimal places (331.466 μs); consider rounding to whole microseconds or adding a brief note on measurement methodology for clarity.
[Abstract] Notation: The term 'non-oracle offline replay success rate' is used without an explicit contrast to oracle performance; a short parenthetical definition would aid readers.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments identify important gaps in experimental detail, formalization, and assumption discussion that we have addressed through targeted revisions to improve reproducibility and rigor. Our point-by-point responses follow.

read point-by-point responses

Referee: Experimental Evaluation section: The reported 62.5% non-oracle success rate and 0.0% false-positive rate are presented without any description of the number of smoke-scale artifacts, the precise definition of 'success', the non-oracle baselines, statistical tests, or how trajectories were generated. This absence is load-bearing because the central empirical claim cannot be assessed for validity or reproducibility.

Authors: We agree that these details were insufficient in the original Experimental Evaluation section, limiting the ability to assess validity and reproducibility. In the revised manuscript, we have expanded this section to specify the number of smoke-scale artifacts, the precise definition of success (including how it is measured in both the offline replay and the 200-case validation), the non-oracle baselines used for comparison, the statistical tests applied, and the process by which trajectories were generated for the offline replay. These additions make the reported 62.5% success rate and 0.0% false-positive rate fully evaluable. revision: yes
Referee: Method section (contextual state and reward design): The 16-feature state vector and reward function (stronger penalty on false-positive injection) are described only qualitatively. No equations, feature extraction procedures, or specific values for free parameters such as false_positive_penalty_weight or contextual_feature_weights are supplied. This directly undermines the claim that the state captures true risk of unsafe injection.

Authors: We agree that the qualitative description alone does not sufficiently support the risk-capture claim. The revised Method section now includes the explicit equations for the 16-feature contextual state vector, the detailed procedures for extracting each feature from retrieval evidence, the mathematical formulation of the reward function with its stronger penalty on false-positive injection, and the specific values used for free parameters such as false_positive_penalty_weight and contextual_feature_weights. These additions render the risk-sensitive design transparent and reproducible. revision: yes
Referee: Assumptions paragraph (offline replay): The evaluation relies on fixed-trajectory offline replay, yet the manuscript does not address how memory injection would alter downstream agent actions or create new risks not penalized in the replay. This assumption is central to the 0.0% false-positive result and requires either justification or an online evaluation component.

Authors: We acknowledge that the offline replay assumption requires explicit treatment because it underpins the 0.0% false-positive result. In the revised Assumptions and Limitations paragraph, we have added a justification explaining that, within the deterministic smoke-scale artifacts, downstream action alterations from memory injection are simulated and penalized through the full-trajectory reward signal. We also note the scope of this assumption and the value of future online evaluation for non-deterministic settings. This supplies the requested justification while remaining transparent about limitations. revision: yes
Referee: Pattern-Variant-Episode Schema subsection: The schema is introduced as the storage mechanism for reusable issue knowledge, but no formal definition, variant-handling rules, or integration with the 16-feature extractor is provided. This is load-bearing for the memory-reuse component of the system.

Authors: We agree that the absence of a formal definition limits understanding of the memory-reuse component. The revised manuscript now supplies a formal definition of the Pattern-Variant-Episode Schema, specifies the variant-handling rules, and details how schema elements are mapped to and integrated with the 16-feature contextual state extractor. These formalizations clarify the storage and retrieval pipeline. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results stand independently of inputs

full rationale

The paper introduces RSCB-MC as a contextual bandit controller with a fixed 16-feature state and asymmetric reward design, then evaluates it via offline replay on smoke-scale artifacts (reporting 62.5% non-oracle success, 0% FP) and a 200-case hot-path validation (60.5% proxy success, 0% FP, 331 µs latency). No equations, derivations, or self-citations are presented that reduce these reported metrics to fitted parameters or prior results by construction. The success rates are measured outcomes against explicit baselines and oracles on the given testbeds rather than tautological renamings or load-bearing self-references. The work is therefore self-contained against its external empirical benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on the empirical performance numbers, which in turn depend on the unstated details of how the 16 features are computed, how the reward weights were chosen, and how representative the smoke-scale artifacts and 200-case validation are of real coding failures.

free parameters (2)

false_positive_penalty_weight
The reward design explicitly penalizes false-positive memory injection more strongly than missed reuse, but the exact numerical weight is not reported and must have been selected or tuned.
contextual_feature_weights
The 16 features are described as fixed, yet any linear or learned combination of them into the bandit state requires weights that are not specified.

axioms (2)

domain assumption Standard contextual bandit assumptions (stationary context distribution, bounded rewards, sufficient exploration) hold for the memory-abstention decision problem.
The approach relies on the bandit framework being able to learn a useful policy from the described state and reward signal.
domain assumption Offline replay on deterministic artifacts faithfully predicts online agent behavior with real users and non-deterministic LLMs.
All reported success rates come from offline replay; this assumption bridges the evaluation to claimed practical utility.

invented entities (1)

pattern-variant-episode schema no independent evidence
purpose: Organize reusable issue knowledge for retrieval
A new storage format is introduced to support the memory controller; no independent evidence outside the paper is provided.

pith-pipeline@v0.9.0 · 5607 in / 2005 out tokens · 94573 ms · 2026-05-07T07:58:24.638211+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 21 canonical work pages · 4 internal anchors

[1]

arXiv preprint arXiv:2502.18474 (2025) arXiv:2502.18474

Wang, J., Ni, T., Lee, W.B., Zhao, Q.: A contemporary survey of large lan- guage model assisted program analysis. arXiv preprint arXiv:2502.18474 (2025) arXiv:2502.18474

work page arXiv 2025
[2]

From code foundation models to agents and applications: A practical guide to code intelligence.arXiv preprint arXiv:2511.18538,

Yang, J., Liu, X., Lv, W., Deng, K., Guo, S., Jing, L., Zheng, B., et al.: From code foundation models to agents and applications: A comprehensive survey and practical guide to code intelligence. arXiv preprint arXiv:2511.18538 (2025) arXiv:2511.18538 23

work page arXiv 2025
[3]

arXiv preprint arXiv:2512.22256 (2025) arXiv:2512.22256

Jiang, Z., Lo, D., Liu, Z.: Agentic Software Issue Resolution with Large Language Models: A Survey. arXiv preprint arXiv:2512.22256 (2025) arXiv:2512.22256

work page arXiv 2025
[4]

arXiv preprint arXiv:2603.05344 (2026) arXiv:2603.05344

Bui, N.D.Q.: Building Effective AI Coding Agents for the Terminal: Scaf- folding, Harness, Context Engineering, and Lessons Learned. arXiv preprint arXiv:2603.05344 (2026) arXiv:2603.05344

work page arXiv 2026
[5]

Memory for autonomous LLM agents: Mechanisms, evaluation, and emerging frontiers.arXiv preprint arXiv:2603.07670, 2026

Du, P.: Memory for autonomous LLM agents: Mechanisms, evaluation, and emerging frontiers. arXiv preprint arXiv:2603.07670 (2026) arXiv:2603.07670

work page arXiv 2026
[6]

MIRIX: Multi-Agent Memory System for LLM-Based Agents

Wang, Y., Chen, X.: MIRIX: Multi-agent memory system for LLM-based agents. arXiv preprint arXiv:2507.07957 (2025) arXiv:2507.07957

work page internal anchor Pith review arXiv 2025
[7]

arXiv preprint arXiv:2312.17259 (2023) arXiv:2312.17259

Guo, J., Li, N., Qi, J., Yang, H., Li, R., Feng, Y., Xu, M., et al.: Empowering work- ing memory for large language model agents. arXiv preprint arXiv:2312.17259 (2023) arXiv:2312.17259

work page arXiv 2023
[8]

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Yan, S., Yang, X., Huang, Z., Nie, E., Ding, Z., Li, Z., Ma, Y., et al.: Memory-R1: Enhancing large language model agents to manage and utilize memories via reinforcement learning. arXiv preprint arXiv:2508.19828 (2025) arXiv:2508.19828

work page internal anchor Pith review arXiv 2025
[9]

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents

Yu, Y., Yao, L., Xie, Y., Tan, Q., Feng, J., Li, Y., Wu, L.: Agentic memory: Learn- ing unified long-term and short-term memory management for large language model agents. arXiv preprint arXiv:2601.01885 (2026) arXiv:2601.01885

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

arXiv preprint arXiv:2411.06207 (2024) arXiv:2411.06207

Zhang, Z., Wang, X., Jiang, Y., Chen, Z., Mu, F., Hu, M., Huang, F., et al.: Exploring Knowledge Boundaries in Large Language Models for Retrieval Judgment. arXiv preprint arXiv:2411.06207 (2024) arXiv:2411.06207

work page arXiv 2024
[11]

In: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp

Su, W., Tang, Y., Ai, Q., Yan, J., Wang, C., Wang, H., Liu, Y.,et al.: Parametric retrieval augmented generation. In: Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1240–1250 (2025)

2025
[12]

ACM Transactions on Information Systems (2025)

Shi, Z., Yan, L., Sun, W., Feng, Y., Ren, P., Ma, X., Ren, Z., et al.: Direct retrieval-augmented optimization: Synergizing knowledge selection and language models. ACM Transactions on Information Systems (2025)

2025
[13]

Knowledgpt: Enhancing large language models with retrieval and storage access on knowledge bases,

Wang, X., Yang, Q., Qiu, Y., Liang, J., He, Q., Gu, Z., Wang, W., et al.: Knowl- edGPT: Enhancing large language models with retrieval and storage access on knowledge bases. arXiv preprint arXiv:2308.11761 (2023) arXiv:2308.11761

work page arXiv 2023
[14]

In: 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), vol

Le, X.B.D., Lo, D., Le Goues, C.: History driven program repair. In: 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), vol. 1, pp. 213–224 (2016) 24

2016
[15]

In: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp

Wang, W., Wang, Y., Joty, S., Hoi, S.C.H.: RAP-Gen: Retrieval-augmented patch generation with CodeT5 for automatic program repair. In: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 146–158 (2023)

2023
[16]

Software Quality Journal33(3), 30 (2025)

Liu, Z., Du, X., Liu, H.: ReAPR: Automatic program repair via retrieval- augmented large language models. Software Quality Journal33(3), 30 (2025)

2025
[17]

arXiv preprint arXiv:2506.23100 (2025) arXiv:2506.23100

Zhang, J., Huang, K., Zhang, J., Liu, Y., Chen, C.: Repair Ingredients Are All You Need: Improving Large Language Model-Based Program Repair via Repair Ingredients Search. arXiv preprint arXiv:2506.23100 (2025) arXiv:2506.23100

work page arXiv 2025
[18]

Rag-gym: Optimizing reasoning and search agents with process supervision,

Xiong, G., Jin, Q., Wang, X., Fang, Y., Liu, H., Yang, Y., Zhang, A., et al.: RAG-Gym: Systematic optimization of language agents for retrieval-augmented generation. arXiv preprint arXiv:2502.13957 (2025) arXiv:2502.13957

work page arXiv 2025
[19]

Process vs

Zhang, W., Li, X., Dong, K., Wang, Y., Jia, P., Li, X., Zhao, X., et al.: Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning. arXiv preprint arXiv:2505.14069 (2025) arXiv:2505.14069

work page arXiv 2025
[20]

arXiv preprint arXiv:2601.06922 (2026) arXiv:2601.06922

Zhang, T., Li, K., Li, J., Li, Y., Luo, H., Wu, X., Meng, H., et al.: TreePS-RAG: Tree-based Process Supervision for Reinforcement Learning in Agentic RAG. arXiv preprint arXiv:2601.06922 (2026) arXiv:2601.06922

work page arXiv 2026
[21]

arXiv preprint arXiv:2509.18167 (2025) arXiv:2509.18167

Wang, J., Wu, Z., Lu, S., Li, Y., Huang, X.: SIRAG: Towards Stable and Interpretable RAG with A Process-Supervised Multi-Agent Framework. arXiv preprint arXiv:2509.18167 (2025) arXiv:2509.18167

work page arXiv 2025
[22]

In: Proceed- ings of the 34th ACM International Conference on Information and Knowledge Management, pp

Zhang, W., Zhu, Y., Lu, Y., Demarne, M., Wang, W., Deng, K., Krishnan, S., et al.: FLAIR: Feedback learning for adaptive information retrieval. In: Proceed- ings of the 34th ACM International Conference on Information and Knowledge Management, pp. 6284–6292 (2025)

2025
[23]

Xin, J., Tang, R., Yu, Y., Lin, J.: The art of abstention: Selective prediction and error regularization for natural language processing. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1040–1051 (2021)

2021
[24]

Advances in Neural Information Processing Systems37, 50494–50527 (2024)

Lee, M., Kim, K., Kim, T., Park, S.: Selective generation for controllable language models. Advances in Neural Information Processing Systems37, 50494–50527 (2024)

2024
[25]

Learning conformal abstention policies for adaptive risk management in large language and vision-language models

Tayebati, S., Kumar, D., Darabi, N., Jayasuriya, D., Krishnan, R., Trivedi, A.R.: Learning conformal abstention policies for adaptive risk management in large language and vision-language models. arXiv preprint arXiv:2502.06884 (2025) arXiv:2502.06884 25

work page arXiv 2025
[26]

In: 2025 IEEE 12th International Conference on Data Science and Advanced Analytics (DSAA), pp

Sch¨ onw¨ alder, E., Falkenberg, C., Hartmann, C., Lehner, W.: Abstention is all you need. In: 2025 IEEE 12th International Conference on Data Science and Advanced Analytics (DSAA), pp. 1–10 (2025)

2025
[27]

In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp

Cao, L.: Learn to refuse: Making large language models more controllable and reli- able through knowledge scope limitation and refusal mechanism. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 3628–3646 (2024)

2024
[28]

I-CALM: Incentivizing Confidence-Aware Abstention for LLM Hallucination Mitigation

Zong, H., Li, B., Long, Y., Chang, S., Wu, J., Hadfield, G.K.: I-CALM: Incen- tivizing Confidence-Aware Abstention for LLM Hallucination Mitigation. arXiv preprint arXiv:2604.03904 (2026) arXiv:2604.03904

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

arXiv preprint arXiv:2604.01664 (2026) arXiv:2604.01664

Wu, Y., Zheng, Y., Xu, T., Zhang, Z., Yu, Y., Zhu, J., Yu, G., et al.: ContextBud- get: Budget-Aware Context Management for Long-Horizon Search Agents. arXiv preprint arXiv:2604.01664 (2026) arXiv:2604.01664

work page arXiv 2026
[30]

arXiv preprint arXiv:2411.08891 (2024) arXiv:2411.08891

Jang, C., Cho, D., Lee, S., Lee, H., Lee, J.: Reliable Decision Making via Calibra- tion Oriented Retrieval Augmented Generation. arXiv preprint arXiv:2411.08891 (2024) arXiv:2411.08891

work page arXiv 2024
[31]

H., Carlier, A., Ng, L

Montreuil, Y., Yeo, S.H., Carlier, A., Ng, L.X., Ooi, W.T.: Optimal query alloca- tion in extractive QA with LLMs: A learning-to-defer framework with theoretical guarantees. arXiv preprint arXiv:2410.15761 (2024) arXiv:2410.15761 26

work page arXiv 2024