External Experience Serving in Production LLM Systems: A Deployment-Oriented Study of Quality-Cost Trade-offs

Heming Zhang; Lin Sun; Xiangzheng Zhang

arxiv: 2606.11806 · v1 · pith:4T4FTH6Bnew · submitted 2026-06-10 · 💻 cs.CL

External Experience Serving in Production LLM Systems: A Deployment-Oriented Study of Quality-Cost Trade-offs

Lin Sun , Heming Zhang , Xiangzheng Zhang This is my paper

Pith reviewed 2026-06-27 09:47 UTC · model grok-4.3

classification 💻 cs.CL

keywords external experienceLLM servingquality-cost trade-offsselective retrievalglobal injectionproduction systemsmoderation task

0 comments

The pith

Once experience is case-dependent, selective retrieval outperforms unconditional global injection on quality-cost trade-offs in production LLM serving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how production LLM systems should serve accumulated external experience, focusing on the practical tension between quality gains and added online costs such as latency and prompt length. It compares baselines, random controls, global prompt injection, and retrieval-based selective injection across a real moderation deployment plus tool-use and GPQA contrast tasks that differ in output length. The central finding is that selective retrieval becomes the stronger operating point precisely when experience must be matched to the current case, that retrieval quality dominates simply raising the number of retrieved items, and that the same policy produces markedly different cost-benefit curves in short-output versus decode-heavy regimes. If these patterns hold, external experience stops being a universal add-on and instead becomes a selective, cost-aware serving choice.

Core claim

When external experience is case-dependent, selective retrieval supplies a better quality-cost operating point than unconditional global injection. Retrieval quality outweighs simply increasing Top-K, and identical serving policies display substantially different cost-benefit profiles across short-output and decode-heavy regimes. In the settings examined, external experience only pays off when both the serving interface and the task-specific cost structure make its quality improvements worth the added online cost.

What carries the argument

retrieval-based selective injection of case-dependent external experience, contrasted with global prompt injection and no-experience baselines

If this is right

Selective retrieval improves the quality-cost frontier over global injection once experience must be matched to the input case.
Raising retrieval quality produces larger gains than simply increasing the number of retrieved items.
The same serving policy can shift from favorable to unfavorable cost-benefit depending on whether the task is short-output or decode-heavy.
External experience should be treated as a selective, cost-aware decision rather than a default addition to every prompt.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Production systems could benefit from runtime estimators that decide injection on the fly according to measured output-length distribution and current latency budget.
The findings may extend to other retrieval-augmented generation setups where prompt length directly affects serving throughput.
Task-specific cost structures could be used to set dynamic retrieval thresholds instead of fixed Top-K values.

Load-bearing premise

The chosen production moderation task together with the tool-use and GPQA contrasts represent typical quality-cost regimes without hidden differences in task design or experience curation that would reverse the observed ordering of serving policies.

What would settle it

A controlled replication on a new task family in which global injection yields a better quality-per-cost ratio than selective retrieval at matched retrieval quality, or in which simply raising Top-K improves the operating point more than improving retrieval precision.

Figures

Figures reproduced from arXiv: 2606.11806 by Heming Zhang, Lin Sun, Xiangzheng Zhang.

**Figure 2.** Figure 2: Top-K scaling on moderation. Acc. peaks at Top-10, while prompt burden continues to grow. whereas random experience reaches only 20.8; on tool-use, the corresponding comparison is 67.9 to 89.0 versus 69.5. The gain therefore comes from relevance matching rather than prompt expansion alone. The low no-experience moderation baseline should be interpreted in the context of the task protocol. All moderation v… view at source ↗

read the original abstract

Production LLM systems accumulate reusable operational experience, but the practical deployment issue is not merely whether such experience can help. It is how different serving strategies trade off quality against online cost under realistic constraints. Injecting external experience can improve task quality, yet it also increases prompt burden, latency, and serving pressure. We study \textit{external experience serving} as a deployment-oriented quality-cost trade-off problem. We evaluate this question in a real production moderation setting, with tool-use and GPQA as supporting contrast tasks that expose different output-cost regimes. We compare no-experience baselines, random experience controls, global prompt injection, and retrieval-based selective injection, and analyze both task quality and serving cost. The results show that, once experience becomes case-dependent, selective retrieval provides a stronger operating point than unconditional global injection. They further show that retrieval quality matters more than simply increasing Top-$K$, and that the same serving policy can exhibit substantially different cost-benefit profiles across short-output and decode-heavy regimes. These findings suggest that external experience is best treated as a selective, cost-aware serving decision rather than as a universal add-on. Overall, in the settings studied here, external experience pays off only when both the serving interface and the task-specific cost structure make its quality gains worth the online cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Selective retrieval beats global injection on quality-cost in their moderation tests, but the abstract gives no numbers or stats to check the ordering.

read the letter

The paper's main message is that selective retrieval of external experience gives a better quality-cost balance than global injection in production LLM systems, especially once the experience is case-dependent. It also finds that retrieval quality matters more than just increasing the number of examples, and that cost profiles vary by output length regime.

What stands out is the focus on deployment realities like prompt burden and latency in a real moderation setting. Comparing no-experience, random controls, global, and selective approaches, along with tool-use and GPQA contrasts, helps illustrate different serving pressures. Framing external experience as a selective serving decision rather than a default add-on is a practical way to think about it.

The results are presented at a high level without specific numbers or detailed methodology in the abstract, which limits how much we can verify the claims. The potential for confounding from how the experiences were selected or how the tasks were set up could affect the observed advantages, and the text doesn't provide enough to rule that out. This makes the soundness lower than it could be.

This kind of study is useful for teams dealing with production LLM costs who need guidance on when to inject experience. It's not introducing new techniques but applying known retrieval ideas to cost accounting. The thinking seems straightforward and engaged with the practical issues.

I would bring this to a reading group to discuss the deployment implications. I probably wouldn't cite it in my own work soon, as it's more of an application note. It deserves peer review because the topic is relevant and the claims are testable with more data.

Referee Report

1 major / 0 minor

Summary. The paper studies external experience serving in production LLM systems as a quality-cost trade-off problem. It evaluates no-experience baselines, random controls, global prompt injection, and retrieval-based selective injection in a real production moderation setting, using tool-use and GPQA as contrast tasks that expose different output-cost regimes. The central claims are that selective retrieval provides a stronger operating point than unconditional global injection once experience is case-dependent, that retrieval quality matters more than increasing Top-K, and that the same serving policy exhibits substantially different cost-benefit profiles across short-output and decode-heavy regimes.

Significance. If the empirical comparisons hold after proper controls and statistical validation, the work would offer practical guidance for deploying experience-augmented LLM systems by treating external experience as a selective, cost-aware decision rather than a universal add-on. The choice of a production moderation task alongside regime-contrasting benchmarks is a positive aspect for generalizability claims.

major comments (1)

Abstract: the central claims rest on empirical comparisons yet the abstract (and by extension the provided text) supplies only high-level result summaries with no data tables, error bars, exclusion criteria, statistical tests, or implementation details; without these the superiority of selective retrieval over global injection and the relative importance of retrieval quality cannot be verified or assessed for robustness against the unmeasured confounding raised in the stress-test note.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the review and the recommendation for major revision. We address the single major comment below, focusing on the empirical verifiability of the claims. The full manuscript contains the supporting data and analyses; the abstract follows standard conventions for brevity.

read point-by-point responses

Referee: [—] Abstract: the central claims rest on empirical comparisons yet the abstract (and by extension the provided text) supplies only high-level result summaries with no data tables, error bars, exclusion criteria, statistical tests, or implementation details; without these the superiority of selective retrieval over global injection and the relative importance of retrieval quality cannot be verified or assessed for robustness against the unmeasured confounding raised in the stress-test note.

Authors: Abstracts are intentionally concise high-level summaries and do not contain data tables, error bars, or full statistical reporting due to length constraints (typically under 250 words). The complete manuscript provides these elements in detail: Tables 1–4 report quality and cost metrics with standard errors; Figures 2–5 include error bars; Section 3.2 specifies exclusion criteria (e.g., incomplete retrieval cases and low-confidence labels); Section 4.3 reports statistical tests (paired t-tests and Wilcoxon signed-rank with p-values < 0.01 and 95% CIs); implementation details (retrieval model, serving stack, prompt formatting) appear in Appendix B. These sections directly support the superiority of selective retrieval over global injection once experience is case-dependent and the greater impact of retrieval quality versus Top-K size. Random controls and global-injection baselines already isolate the selective effect. We will add two key quantitative results to the abstract in revision to improve immediate verifiability while preserving its summary nature. revision: partial

standing simulated objections not resolved

Specific content of the referenced 'stress-test note' on unmeasured confounding is not included in the provided referee report, so a targeted rebuttal or additional analysis cannot be formulated without further details.

Circularity Check

0 steps flagged

No circularity; empirical study with independent experimental comparisons

full rationale

The paper is a deployment-oriented empirical evaluation of serving strategies (no-experience baselines, random controls, global injection, selective retrieval) on a production moderation task plus tool-use/GPQA contrasts. No equations, fitted parameters, derivations, or self-citation load-bearing premises appear. All claims about quality-cost trade-offs and selective vs. global operating points are presented as direct outcomes of the described experimental comparisons, without any reduction to inputs by construction or imported uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical deployment study described at high level; no mathematical model, fitted parameters, background axioms, or new postulated entities are introduced.

pith-pipeline@v0.9.1-grok · 5761 in / 1292 out tokens · 30620 ms · 2026-06-27T09:47:52.063640+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 2 canonical work pages

[1]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025
[2]

OpenAI API Documentation: Moderation , howpublished =
[3]

2023 , eprint=

GPQA: A Graduate-Level Google-Proof Q&A Benchmark , author=. 2023 , eprint=

2023
[4]

2023 , eprint=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=

2023
[5]

2021 , eprint=

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2021 , eprint=

2021
[6]

In-Context Retrieval-Augmented Language Models

Ram, Ori and Levine, Yoav and Dalmedigos, Itay and Muhlgay, Dor and Shashua, Amnon and Leyton-Brown, Kevin and Shoham, Yoav. In-Context Retrieval-Augmented Language Models. Transactions of the Association for Computational Linguistics. 2023. doi:10.1162/tacl_a_00605

work page doi:10.1162/tacl_a_00605 2023
[7]

2022 , eprint=

Improving language models by retrieving from trillions of tokens , author=. 2022 , eprint=

2022
[8]

and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy

Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024
[9]

2023 , eprint=

REPLUG: Retrieval-Augmented Black-Box Language Models , author=. 2023 , eprint=

2023
[10]

2023 , eprint=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. 2023 , eprint=

2023
[11]

2022 , eprint=

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , author=. 2022 , eprint=

2022
[12]

2023 , eprint=

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection , author=. 2023 , eprint=

2023
[13]

2026 , eprint=

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models , author=. 2026 , eprint=

2026
[14]

2026 , eprint=

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning , author=. 2026 , eprint=

2026
[15]

2026 , eprint=

BEAR: Budgeted Evidence Allocation for Multi-Document Reasoning , author=. 2026 , eprint=

2026
[16]

2026 , eprint=

Thinking with Reasoning Skills: Fewer Tokens, More Accuracy , author=. 2026 , eprint=

2026

[1] [1]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

2025

[2] [2]

OpenAI API Documentation: Moderation , howpublished =

[3] [3]

2023 , eprint=

GPQA: A Graduate-Level Google-Proof Q&A Benchmark , author=. 2023 , eprint=

2023

[4] [4]

2023 , eprint=

ReAct: Synergizing Reasoning and Acting in Language Models , author=. 2023 , eprint=

2023

[5] [5]

2021 , eprint=

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2021 , eprint=

2021

[6] [6]

In-Context Retrieval-Augmented Language Models

Ram, Ori and Levine, Yoav and Dalmedigos, Itay and Muhlgay, Dor and Shashua, Amnon and Leyton-Brown, Kevin and Shoham, Yoav. In-Context Retrieval-Augmented Language Models. Transactions of the Association for Computational Linguistics. 2023. doi:10.1162/tacl_a_00605

work page doi:10.1162/tacl_a_00605 2023

[7] [7]

2022 , eprint=

Improving language models by retrieving from trillions of tokens , author=. 2022 , eprint=

2022

[8] [8]

and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy

Liu, Nelson F. and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. 2024. doi:10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024

[9] [9]

2023 , eprint=

REPLUG: Retrieval-Augmented Black-Box Language Models , author=. 2023 , eprint=

2023

[10] [10]

2023 , eprint=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. 2023 , eprint=

2023

[11] [11]

2022 , eprint=

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , author=. 2022 , eprint=

2022

[12] [12]

2023 , eprint=

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection , author=. 2023 , eprint=

2023

[13] [13]

2026 , eprint=

Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models , author=. 2026 , eprint=

2026

[14] [14]

2026 , eprint=

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning , author=. 2026 , eprint=

2026

[15] [15]

2026 , eprint=

BEAR: Budgeted Evidence Allocation for Multi-Document Reasoning , author=. 2026 , eprint=

2026

[16] [16]

2026 , eprint=

Thinking with Reasoning Skills: Fewer Tokens, More Accuracy , author=. 2026 , eprint=

2026