Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents

Chenyu Zhou; Jianghao Lin; Jingxing Wang; Jun Wang; Weinan Zhang; Weiwen Liu; Zhihui Fu

Synthesizing temporary task-specific skills from a few retrieved trajectories at test time raises LLM agent success rates on benchmarks without updating the model.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-19 20:28 UTC pith:H5FEKJ65

load-bearing objection SkillTTA shows a workable test-time way to build temporary task-specific skills for LLM agents from a few retrieved trajectories, with decent benchmark lifts but thin details on the synthesis mechanics.

arxiv 2605.16986 v2 pith:H5FEKJ65 submitted 2026-05-16 cs.CL cs.AI

Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents

Jingxing Wang , Chenyu Zhou , Zhihui Fu , Jun Wang , Weiwen Liu , Weinan Zhang , Jianghao Lin This is my paper

classification cs.CL cs.AI

keywords LLM agentstest-time adaptationskill synthesistrajectory retrievalagent benchmarksSpreadsheetBenchALFWorldBigCodeBench

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SkillTTA to give LLM agents guidance tailored to each new test task rather than relying on a fixed library of skills. It works by pulling a small set of relevant past trajectories, including failures, and turning them into a short-lived textual skill that is fed as context to an unchanged solver model. This yields clear gains on SpreadsheetBench and BigCodeBench over static synthesis baselines and matches a memory-based method on ALFWorld while producing shorter successful paths. The adaptation occurs entirely through the generated context, and ablations show that the synthesized skills beat raw trajectory prompting and that failed examples are particularly helpful. The central premise is that a compact retrieved set carries enough signal to produce coherent, task-specific instructions.

Core claim

SkillTTA retrieves a small number of training trajectories relevant to the current task and synthesizes them into a temporary, task-specific textual skill that is supplied to the fixed solver model, producing higher Pass@1 rates than static trajectory-to-skill methods on SpreadsheetBench and BigCodeBench and competitive results with shorter trajectories on ALFWorld.

What carries the argument

SkillTTA, the test-time process that retrieves a small set of trajectories and synthesizes them into a temporary task-specific textual skill used as context.

Load-bearing premise

A small set of retrieved training trajectories that includes failures can be turned into a coherent task-specific textual skill that supplies better guidance than static skills or raw trajectories.

What would settle it

A controlled run on the same benchmarks in which skills synthesized from the retrieved trajectories produce no improvement or a drop relative to static skill synthesis or raw trajectory prompting.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Task-specific skills raise SpreadsheetBench Pass@1 from 0.397 to 0.505 and BigCodeBench Pass@1 from 0.517 to 0.651 compared with static synthesis.
On ALFWorld the method matches a heavier memory-learning baseline while generating the shortest successful trajectories.
Synthesized skills outperform raw trajectory prompting on SpreadsheetBench.
Small top-k retrieval sets and inclusion of failed trajectories improve the quality of the generated skill.
Adaptation occurs solely through added context without any parameter updates to the solver.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could let agents handle task variations that were not seen during skill-library construction without any retraining step.
Dynamic synthesis from small retrieved sets may scale more efficiently than maintaining or expanding large static skill libraries.
Combining this retrieval-plus-synthesis step with other lightweight adaptation methods could be tested on additional agent environments.
The value of failed trajectories suggests that error signals from past runs are worth preserving for future synthesis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

SkillTTA shows a workable test-time way to build temporary task-specific skills for LLM agents from a few retrieved trajectories, with decent benchmark lifts but thin details on the synthesis mechanics.

read the letter

Hey colleague, SkillTTA is basically a way to adapt LLM agents at test time by grabbing a small number of similar past trajectories and synthesizing them into a custom skill note that the agent can use for the current task. The model stays the same; only the context changes. This seems like a straightforward idea that could help in settings where static skills fall short. The paper does a solid job showing results across SpreadsheetBench, BigCodeBench, and ALFWorld. The improvements over static trajectory-to-skill methods are clear, with Pass@1 going up noticeably on the first two. On ALFWorld it performs comparably to a heavier baseline but gets shorter trajectories, which is a nice bonus. The ablations add value by checking that the synthesized skills work better than just prompting with raw trajectories, that keeping the number of retrieved items small is preferable, and that failed trajectories contribute positively by revealing common errors. Where it could be stronger is in the description of the synthesis step itself. The abstract doesn't give much on how they turn the trajectories into the skill text, what retrieval similarity they use, or the prompt engineering involved. That makes it harder to fully assess how general or easy to implement this is. The numbers are reported without much on statistical significance or multiple runs, so some uncertainty remains there. Overall, this is the kind of work that would interest folks building agents for code or planning tasks who want simple adaptation tricks. It engages honestly with the literature on skills and memory in agents, and the experiments are targeted. I think it has enough substance to go to peer review, though the authors should expand on the method details in a revision.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces SkillTTA, a test-time adaptive skill synthesis method for LLM agents. A small set of relevant training trajectories (including failures) is retrieved and synthesized into a temporary task-specific textual skill that guides a fixed solver model via context rather than parameter updates. On SpreadsheetBench the method raises Pass@1 from 0.397 to 0.505 and on BigCodeBench from 0.517 to 0.651 relative to static trajectory-to-skill synthesis; on ALFWorld it matches a heavier memory-learning baseline while producing the shortest successful trajectories. Ablations indicate that synthesized skills outperform raw trajectory prompting, that small top-k retrieval is preferable, and that failed trajectories add value by exposing recurring mistakes.

Significance. If the synthesis procedure proves robust and the reported gains hold under replication, the work offers a lightweight, parameter-free route to task-specific adaptation that avoids both fine-tuning and large memory stores. The explicit demonstration that failure trajectories improve skill quality is a useful and falsifiable contribution.

minor comments (2)

[Abstract and §4] The abstract and evaluation sections report numeric improvements and ablations but supply no concrete description of the retrieval similarity metric, the exact synthesis prompt template, or the statistical tests used to establish significance of the Pass@1 deltas; these details are needed for reproducibility even if they appear in an appendix.
[Evaluation tables] Figure or table captions should explicitly state the number of runs and random seeds underlying the reported Pass@1 and trajectory-length figures.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive and accurate summary of SkillTTA and for recommending minor revision. We are pleased that the lightweight, parameter-free adaptation approach and the utility of failure trajectories are recognized as useful contributions.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper proposes an empirical method (SkillTTA) that retrieves a small set of training trajectories for a test-time task and synthesizes them into a temporary task-specific textual skill to guide a fixed solver model. Performance is measured via direct benchmark comparisons (e.g., Pass@1 on SpreadsheetBench, ALFWorld, BigCodeBench) against baselines such as static skill synthesis and raw trajectory prompting. No equations, fitted parameters, or first-principles derivations are described whose outputs reduce by construction to the method's own inputs. Ablations address the value of failed trajectories and small top-k retrieval without creating self-referential loops. The central claims rest on external benchmark results rather than any self-definitional or self-citation chain, rendering the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on standard assumptions about LLM prompting and retrieval utility; no free parameters, invented entities, or non-standard axioms are introduced in the abstract.

axioms (1)

domain assumption LLM agents benefit from reusable skills that can be synthesized from trajectories
Opening sentence of abstract states this as background motivation.

pith-pipeline@v0.9.0 · 5744 in / 1278 out tokens · 43472 ms · 2026-05-19T20:28:49.581786+00:00 · methodology

0 comments

read the original abstract

Additional test-time compute can give LLM agents access to more past experience, yet expanding the context or adding rollouts does not necessarily yield greater agent capability. We call this challenge test-time compute-to-capability conversion and propose SkillTTA, which retrieves task-relevant training trajectories and synthesizes a temporary skill conditioned on the visible target context for a solver with fixed parameters. To pursue a higher performance ceiling, SkillTTA further uses meta prompt optimization (MPO) to adapt the policy that writes these skills. MPO evaluates candidate prompts on paired tasks and emphasizes informative transitions. It also confines updates to benchmark-specific atomic slots, reducing the variance caused by observing each edit only indirectly through skill synthesis and solver rollout. Across ALFWorld, SpreadsheetBench, BigCodeBench, and WebShop, SkillTTA outperforms state-of-the-art reuse and optimization baselines. It attains a higher performance ceiling at lower compute cost than baseline reuse and sampling strategies.

Figures

Figures reproduced from arXiv: 2605.16986 by Chenyu Zhou, Jianghao Lin, Jingxing Wang, Jun Wang, Weinan Zhang, Weiwen Liu, Zhihui Fu.

**Figure 2.** Figure 2: Detailed pipeline of SkillTTA. Training trajectories are indexed by task metadata, retrieved at test time, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: In-depth analysis on SpreadsheetBench with GPT-5.5 skill synthesis. Unless the panel explicitly varies [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: SpreadsheetBench cost-accuracy tradeoff [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SkillTTA retrieves a small set of training trajectories relevant to the current task and synthesizes them into a temporary, task-specific textual skill... adaptation happens entirely through generated context rather than parameter updates.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agent Skills Matter: Inferring Proprietary Skills from Execution Trajectories
cs.AI 2026-07 conditional novelty 7.0

A black-box attacker can recover a functional approximation of a hidden agent skill from paired skill-enabled and skill-disabled execution trajectories elicited by benign queries.