Inference-Time Distillation: Cost-Efficient Agents Without Fine-Tuning or Manual Prompt Engineering

Asanshay Gupta; James Hong; Kayvon Fatahalian; Micha\"el Gharbi; Vishnu Sarukkai

arxiv: 2512.02543 · v3 · submitted 2025-12-02 · 💻 cs.LG

Inference-Time Distillation: Cost-Efficient Agents Without Fine-Tuning or Manual Prompt Engineering

Vishnu Sarukkai , Asanshay Gupta , James Hong , Micha\"el Gharbi , Kayvon Fatahalian This is my paper

Pith reviewed 2026-05-17 02:51 UTC · model grok-4.3

classification 💻 cs.LG

keywords inference-time distillationcost-efficient agentsLLM agentsdynamic in-context learningself-consistencyteacher-student cascadeALFWorldAppWorld

0 comments

The pith

LLM agents match teacher accuracy at 2.5 times lower cost by retrieving demonstrations from a small task subset and escalating only on student disagreements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that an expensive teacher model can generate demonstrations on just a small number of tasks, after which a cheaper student model handles the remaining work by retrieving relevant examples as context and running multiple samples to check for agreement. When the student outputs agree it proceeds; when they diverge it falls back to the teacher. This produces 2.5 times lower cost while matching accuracy on ALFWorld and 3.5 times lower cost while recovering 79 percent of accuracy on AppWorld. The entire process uses only standard inference techniques, so no fine-tuning or manual prompt engineering is needed and teams can deploy or adjust the system quickly.

Core claim

By running the teacher on a small task subset to collect demonstrations and then deploying a cheaper student that retrieves those demonstrations as in-context examples, the method uses self-consistency cascades so that agreement among student samples allows the output to be accepted while disagreement triggers a fallback to the teacher, yielding 2.5 times lower cost with full accuracy on ALFWorld and 3.5 times lower cost with 79 percent accuracy recovery on AppWorld without any training or prompt engineering.

What carries the argument

A teacher-student cascade that retrieves relevant teacher demonstrations for each student query and falls back to the teacher only when multiple student samples disagree.

Load-bearing premise

Demonstrations collected from the teacher on a small task subset remain representative for the full workload and simple retrieval plus consistency checks work without additional task-specific engineering.

What would settle it

Applying the method to a new workload where the student frequently disagrees and falls back to the teacher, resulting in accuracy well below 79 percent or cost savings far below the reported 2.5 times, would show the central claim does not hold.

Figures

Figures reproduced from arXiv: 2512.02543 by Asanshay Gupta, James Hong, Kayvon Fatahalian, Micha\"el Gharbi, Vishnu Sarukkai.

**Figure 1.** Figure 1: Overview of the in-context distillation pipeline. 1) Demonstration collection phase: the teacher LLM creates a dataset of exemplars to be stored in a vector database. 2) Inference phase: at each decision-making step for the agent, the most relevant in-context examples are retrieved to inject into the student LLM’s prompt. The student then produces multiple samples to be evaluated for selfconsistency. If i… view at source ↗

**Figure 2.** Figure 2: Combining in-context learning with cascades optimizes cost-accuracy tradeoffs. Cost-accuracy tradeoff for a variety of different model selections and techniques. Accuracy numbers for the IC + Cascade experiments break the Pareto frontier defined by the rest of the examples, performing better than the teacher on ALFWorld and significantly above others at a similar cost on both domains. cascades vs Random Mi… view at source ↗

**Figure 3.** Figure 3: Retrieving more in-context examples can boost task accuracy in exchange for higher costs. Cost-accuracy tradeoff for varying numbers of retrieved in-context exemplars (k, labeled on each datapoint) on ALFWorld and AppWorld (Student IC, no cascade). On ALFWorld, accuracy improves rapidly from k=1 to k=4, then exhibits diminishing returns beyond k=6. On AppWorld, accuracy peaks at k=5 with more modest overa… view at source ↗

read the original abstract

Deploying LLM agents at scale typically requires choosing between quality and cost. Existing cost-reduction approaches fail to preserve agility: the ability to iterate rapidly without human time bottlenecks. Prompt engineering is brittle and slows iteration, while fine-tuning requires multi-day training and commitment to fixed designs; both are impractical for iterative workflows and time-sensitive batch jobs. We demonstrate that established inference-time techniques--dynamic in-context learning and self-consistency cascades--can be leveraged to shift the cost-accuracy Pareto frontier while preserving agility. Practitioners run the teacher on a small task subset to collect demonstrations, then immediately deploy a cheaper student on the remainder. At each step, the system retrieves relevant teacher demonstrations as in-context examples. When multiple student samples agree, we proceed; when they diverge, we fall back to the teacher. This requires no prompt engineering or training. On ALFWorld, we match teacher accuracy at 2.5x lower cost (0.059 to 0.024 per episode). On AppWorld, we achieve 3.5x cost reduction while recovering 79% of teacher accuracy. Our empirical analyses provide guidance on key design choices: teacher database size, demonstration set size, retrieval strategy, and cascade thresholds. These analyses highlight inference-time levers for navigating cost-performance tradeoffs without sacrificing human development speed.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Inference-Time Distillation, an approach that collects demonstrations from a teacher LLM on a small task subset, then deploys a cheaper student model on the remainder using retrieval-based in-context learning augmented by self-consistency cascades that trigger teacher fallbacks on disagreement. No fine-tuning or manual prompt engineering is required. The central claims are a 2.5x cost reduction on ALFWorld while matching teacher accuracy (0.059 to 0.024 per episode) and a 3.5x cost reduction on AppWorld while recovering 79% of teacher accuracy, supported by empirical analyses of design choices including database size, demonstration set size, retrieval strategy, and cascade thresholds.

Significance. If the results hold, the work is significant for enabling cost-efficient LLM agent deployment while preserving agility for rapid iteration, a practical advantage over fine-tuning or brittle prompt engineering. It usefully combines established inference-time methods (dynamic ICL and consistency cascades) and supplies concrete empirical guidance on hyperparameters. Credit is due for the focus on reproducible design levers and the concrete benchmark numbers on ALFWorld and AppWorld.

major comments (2)

[§4 and abstract] §4 (Experimental Setup) and abstract: the central claim that a small teacher demonstration subset remains representative for retrieval on the full workload is load-bearing for the reported cost reductions, yet no direct test of subset-to-remainder distribution shift (e.g., long-tail task coverage in ALFWorld or AppWorld) is provided; without this, higher fallback rates could invalidate the 2.5x and 3.5x savings.
[Results section] Results section, ALFWorld and AppWorld tables: the concrete cost and accuracy figures lack error bars, run counts, or statistical tests, and post-hoc choices for thresholds and database size are not shown to have been pre-specified, weakening confidence in the Pareto-frontier claims.

minor comments (2)

[Method section] Method section: the precise definition of the cascade threshold and retrieval scoring function would benefit from an explicit equation or pseudocode to aid reproducibility.
[Figures and tables] Figure captions and tables: axis labels and cost units could be clarified for immediate readability without reference to the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§4 and abstract] §4 (Experimental Setup) and abstract: the central claim that a small teacher demonstration subset remains representative for retrieval on the full workload is load-bearing for the reported cost reductions, yet no direct test of subset-to-remainder distribution shift (e.g., long-tail task coverage in ALFWorld or AppWorld) is provided; without this, higher fallback rates could invalidate the 2.5x and 3.5x savings.

Authors: We agree this is a valid concern and that an explicit analysis of distribution shift would strengthen the claims. Our experiments show low fallback rates in practice, which indirectly supports representativeness, but we did not provide a dedicated comparison of task distributions or long-tail coverage between the demonstration subset and remainder. In revision we will add an analysis (new figure or table in §4 or appendix) reporting task-type coverage, diversity metrics, and fallback rates as a function of demonstration-set size to quantify any impact on the reported cost savings. revision: yes
Referee: [Results section] Results section, ALFWorld and AppWorld tables: the concrete cost and accuracy figures lack error bars, run counts, or statistical tests, and post-hoc choices for thresholds and database size are not shown to have been pre-specified, weakening confidence in the Pareto-frontier claims.

Authors: We acknowledge that the current presentation lacks error bars, run counts, and explicit description of how thresholds and database sizes were selected. These choices were made via a held-out validation split, but this procedure was not stated. We will revise the Results section and tables to report standard deviations over multiple random seeds (e.g., 5 runs), add the run count, and clarify the validation-based selection process. We will also include a brief sensitivity plot for the key hyperparameters to support the Pareto claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements on external benchmarks

full rationale

The paper describes an empirical inference-time method that collects teacher demonstrations on a small task subset, then applies retrieval-based in-context learning plus consistency-based fallbacks with a cheaper student model. Reported outcomes (2.5x cost reduction matching teacher accuracy on ALFWorld; 3.5x cost reduction recovering 79% accuracy on AppWorld) are direct experimental measurements against separate external benchmarks and an independent teacher model. No derivation chain, equation, or central claim reduces by construction to fitted inputs, self-citations, or renamed ansatzes; the method contains no mathematical prediction step whose output is definitionally identical to its inputs. The approach is therefore self-contained against external validation.

Axiom & Free-Parameter Ledger

3 free parameters · 0 axioms · 0 invented entities

The method depends on several empirical design choices whose values are not derived from first principles and must be selected or tuned for each workload.

free parameters (3)

demonstration set size
Number of teacher examples stored and retrieved; directly affects cost and accuracy but chosen empirically.
cascade threshold
Agreement level or number of samples required before accepting student output; tuned to balance cost and quality.
retrieval strategy parameters
Similarity metric and top-k count for fetching relevant demonstrations.

pith-pipeline@v0.9.0 · 5551 in / 1251 out tokens · 34908 ms · 2026-05-17T02:51:54.550017+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our approach retrieves relevant teacher demonstrations at each agent step and provides them to the student as in-context examples... When samples agree, we proceed; when they diverge, we fall back to the teacher.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

in-context distillation... self-consistency cascades

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · 2 internal anchors

[1]

IEEE Transactions on Information Theory , author=

doi: 10.1109/TIT.1970.1054406. Fu, Y ., Kim, D.-K., Kim, J., Sohn, S., Logeswaran, L., Bae, K., and Lee, H. Autoguide: Automated generation and selection of context-aware guidelines for large language model agents.arXiv preprint arXiv:2403.08978, 2024. Gao, J., Lv, Q., Wang, Z., Wu, T., Cao, Z., and Li, W. Uniicl: An efficient icl framework unifying compr...

work page doi:10.1109/tit.1970.1054406 1970
[2]

Distilling the Knowledge in a Neural Network

URL https://openreview.net/forum? id=KgaBScZ4VI. Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network, 2015. URL https: //arxiv.org/abs/1503.02531. Hsieh, C.-Y ., Li, C.-L., Yeh, C.-K., Nakhost, H., Fujii, Y ., Ratner, A., Krishna, R., Lee, C.-Y ., and Pfister, T. Distill- ing step-by-step! outperforming larger language model...

work page internal anchor Pith review Pith/arXiv arXiv 2015
[3]

Rap: Retrieval-augmented planning with contextual memory for multimodal llm agents.arXiv preprint arXiv:2402.03610,

Curran Associates Inc. Kagaya, T., Yuan, T. J., Lou, Y ., Karlekar, J., Pranata, S., Ki- nose, A., Oguri, K., Wick, F., and You, Y . Rap: Retrieval- augmented planning with contextual memory for mul- timodal llm agents.arXiv preprint arXiv:2402.03610, 2024. Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., San- thanam, K., Vardhamanan, S., Haq, S., Sha...

work page arXiv 2024
[4]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

URL https://openreview.net/forum? id=WdL3O58gde. Schuster, T., Fisch, A., Gupta, J., Dehghani, M., Bahri, D., Tran, V ., Tay, Y ., and Metzler, D. Confident adaptive language modeling.Advances in Neural Information Processing Systems, 35:17456–17472, 2022. Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. Reflexion: Language agents with ve...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/cvpr.2001.990517 2022

[1] [1]

IEEE Transactions on Information Theory , author=

doi: 10.1109/TIT.1970.1054406. Fu, Y ., Kim, D.-K., Kim, J., Sohn, S., Logeswaran, L., Bae, K., and Lee, H. Autoguide: Automated generation and selection of context-aware guidelines for large language model agents.arXiv preprint arXiv:2403.08978, 2024. Gao, J., Lv, Q., Wang, Z., Wu, T., Cao, Z., and Li, W. Uniicl: An efficient icl framework unifying compr...

work page doi:10.1109/tit.1970.1054406 1970

[2] [2]

Distilling the Knowledge in a Neural Network

URL https://openreview.net/forum? id=KgaBScZ4VI. Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network, 2015. URL https: //arxiv.org/abs/1503.02531. Hsieh, C.-Y ., Li, C.-L., Yeh, C.-K., Nakhost, H., Fujii, Y ., Ratner, A., Krishna, R., Lee, C.-Y ., and Pfister, T. Distill- ing step-by-step! outperforming larger language model...

work page internal anchor Pith review Pith/arXiv arXiv 2015

[3] [3]

Rap: Retrieval-augmented planning with contextual memory for multimodal llm agents.arXiv preprint arXiv:2402.03610,

Curran Associates Inc. Kagaya, T., Yuan, T. J., Lou, Y ., Karlekar, J., Pranata, S., Ki- nose, A., Oguri, K., Wick, F., and You, Y . Rap: Retrieval- augmented planning with contextual memory for mul- timodal llm agents.arXiv preprint arXiv:2402.03610, 2024. Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., San- thanam, K., Vardhamanan, S., Haq, S., Sha...

work page arXiv 2024

[4] [4]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

URL https://openreview.net/forum? id=WdL3O58gde. Schuster, T., Fisch, A., Gupta, J., Dehghani, M., Bahri, D., Tran, V ., Tay, Y ., and Metzler, D. Confident adaptive language modeling.Advances in Neural Information Processing Systems, 35:17456–17472, 2022. Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. Reflexion: Language agents with ve...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/cvpr.2001.990517 2022