arxiv: 2605.09192 · v1 · submitted 2026-05-09 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

Evidence Over Plans: Online Trajectory Verification for Skill Distillation

Yang Zhou , Zihan Dong , Zhenting Wang , Can Jin , Shiyu Zhao , Bangwei Guo , Difei Gu , Linjun Zhang

show 2 more authors

Mu Zhou Dimitris N. Metaxas

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:32 UTC · model grok-4.3

classification 💻 cs.AI

keywords skill distillationtrajectory verificationposterior distillation indexagent skillsenvironment interactiontask automation

0 comments

The pith

Distilling agent skills from verified environment trajectories outperforms human-written plans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper contends that current skill generation for agents falls short because it depends on preference logs or human procedural documents instead of direct environment feedback. It identifies a core timing issue: effective skills form best when distilled after observing actual task interactions rather than from upfront plans. To fix this, the work introduces the Posterior Distillation Index (PDI), a metric that scores how closely a candidate skill matches real trajectory evidence. The SPARK pipeline generates those verified trajectories, computes PDI, and applies it as an online check to steer distillation toward evidence-grounded skills. Tests on 86 runnable tasks confirm that the resulting skills raise success rates above both no-skill baselines and human-written alternatives while running at far lower cost on smaller models.

Core claim

Robust skills arise when distillation is guided by the Posterior Distillation Index computed on environment-verified trajectories; this posterior approach consistently yields higher success rates and better transfer than skills drawn from prior plans or human documents, as shown across 86 tasks where student-model inference costs drop by up to 1000x.

What carries the argument

The Posterior Distillation Index (PDI), a trajectory-level metric that scores how well a distilled skill aligns with empirical task-environment evidence and functions as an online diagnostic to enforce posterior formation.

If this is right

PDI-guided skills raise task success rates above both no-skill baselines and human-written skills.
The resulting skills transfer reliably to student models whose inference cost is up to 1000 times lower than the teacher.
SPARK preserves full execution evidence so that PDI can intervene during distillation to keep skills evidence-based.
The method applies uniformly across 86 diverse runnable tasks without task-specific tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Verification during execution may matter more than initial planning for skill quality in agent systems.
PDI-style online checks could extend to other distillation settings where grounding in real outcomes is required.
Reducing dependence on human procedural documents could improve scalability for open-ended tasks.

Load-bearing premise

Skills grounded in actual environment interaction after execution are more robust than those built from prior plans or human documents.

What would settle it

An experiment in which skills distilled directly from human-written plans match or exceed PDI-guided skills in success rate, transferability, and cost across the same set of 86 runnable tasks.

Figures

Figures reproduced from arXiv: 2605.09192 by Bangwei Guo, Can Jin, Difei Gu, Dimitris N. Metaxas, Linjun Zhang, Mu Zhou, Shiyu Zhao, Yang Zhou, Zhenting Wang, Zihan Dong.

**Figure 1.** Figure 1: PDI-based SPARK Illustration. 1) Left (Skill Generation): Starting from a task description, a teacher agent (e.g., Claude Opus 4.6) interacts with a Dockerized environment (up to Nmax attempts) and updates an exploration memo from execution feedback. Upon success, the full trajectory trace is distilled into SKILL.md. Upon failure, a PDI-based proxy triggers targeted interventions. 2) Right (Task Constructi… view at source ↗

**Figure 2.** Figure 2: Mean reward r¯ across seven student models under three conditions: no skill (baseline), SPARK-generated skills, and human-written skills. Horizontal dotted lines mark the interaction-free performance of two strong teacher models. GPT-5.4-nano GPT-5.4-mini GPT-5.1-Codex DeepSeek-Chat Claude-Haiku-4.5 GLM-4.5-Air GLM-4.7-FlashX GPT-5.4-nano GPT-5.4-mini GPT-5.1-Codex DeepSeek-Chat Claude-Haiku-4.5 GLM-4.5-Ai… view at source ↗

**Figure 4.** Figure 4: Three complementary views of skill quality determinants. Left: Compression ratio ρc vs. per-pair ∆r; excessive compression degrades skill effectiveness. Middle: Mean ∆r as a function of the number of exploration attempts; gains are stable for the first three attempts and become volatile thereafter. Right: Mean ∆r per student model for skills distilled from convergent vs. divergent teacher trajectories. 4.2… view at source ↗

**Figure 5.** Figure 5: Trajectory-level analysis of skill quality using divergence-based PDI (α=0.002). (a) Pass-gain rate by trajectory group across seven student models: high-PDI iterative trajectories consistently outperform both interaction-free and low-PDI iterative skills. (b) PDI vs. per-pair ∆r (ρ=+0.364, p<10−6 ). (c) Memo ossification vs. gap relative to human-written skills (ρ=−0.277, p<10−3 ): trajectories that repea… view at source ↗

**Figure 6.** Figure 6: Spearman rank correlation between trajectory-level features (columns) and student skill gain ∆rm,t (rows). Each row corresponds to a student model; the bottom row pools all (task, model) pairs. Significance: ∗ p < .05, ∗∗ p < .01, ∗ ∗ ∗ p < .001. • First-retry reward gain: ∆r (1) = r2 − r1. • Strategy pivot count: PK k=2 1[J(strategyk−1 ,strategyk ) < 0.15], where strategyk is the “Next Strategy” section o… view at source ↗

**Figure 7.** Figure 7: Two case studies of online PDI-guided control, comparing PDI-enabled runs against observe-only controls on 3d-scan-calc and manufacturing-codebook-normalization. Each panel plots execution grounding (ϕexec), plan copying (ϕplan), memo ossification (ϕoss), and the warmup-weighted proxy-PDI used for intervention decisions. Vertical dashed lines mark soft and strong triggers. D External Transfer Case Study: l… view at source ↗

**Figure 8.** Figure 8: Sensitivity of PDI to the smoothing parameter α. (a) Spearman ρ between PDI and three outcome measures; circled points are significant at p<0.05. (b) Corresponding p-values on a log scale; the red dashed line marks p=0.05. The shaded band highlights the optimal region α ∈ [5×10−4 , 5×10−3 ]. Directional structure. We sweep all weight combinations (we, wp, wo) with we > 0 (preserving the sign convention tha… view at source ↗

read the original abstract

Agent skills can remarkably improve task success rates by using human-written procedural documents, but their quality is difficult to assess without environment-grounded verification. Existing skill generation methods heavily rely on preference logs rather than direct environment interaction, often yielding negligible or even degraded gains. We identify that it is a fundamental timing bottleneck: robust skills should be posterior-based, distilled from empirical environment interaction rather than prior plans. In this study, we introduce the Posterior Distillation Index (PDI), a trajectory-level metric that quantifies how well a distilled skill is grounded in the task-environment evidence. To operationalize PDI, we present SPARK (Structured Pipelines for Autonomous Runnable tasKs and sKill generation) for preserving task execution evidence towards full trajectory-level analysis. SPARK generates environment-verified trajectories used to compute PDI, and it applies PDI as an online diagnostic and intervention signal to ensure posterior skill formation. Across 86 runnable tasks, SPARK-generated skills consistently surpass no-skill baselines and outperform human-written skills on student models (inference cost up to 1,000x cheaper than teacher models). These findings show that PDI-guided distillation produces efficient and transferable skills grounded in the task-environment interaction. We release our code at https://github.com/EtaYang10th/spark-skills .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces PDI as a trajectory metric and SPARK as a pipeline to distill skills from environment evidence rather than plans, with gains reported on 86 tasks, but leaves the metric's exact definition and independence unclear.

read the letter

The one thing to take away is that this work shifts skill distillation toward posterior evidence from actual task runs instead of prior plans or preference data. PDI scores how well a skill matches environment trajectories, and SPARK generates those trajectories while using the score as an online signal to shape the skill. They report that the resulting skills beat no-skill baselines and even human-written ones on 86 runnable tasks, all at inference costs up to 1000x lower than the teacher model, and they release the code.

Referee Report

2 major / 2 minor

Summary. The paper claims that robust agent skills should be posterior-based and distilled from empirical environment interaction rather than prior plans. It introduces the Posterior Distillation Index (PDI), a trajectory-level metric quantifying grounding in task-environment evidence, and SPARK, a pipeline that generates environment-verified trajectories for PDI computation while using PDI as an online diagnostic and intervention signal during skill formation. Across 86 runnable tasks, SPARK-generated skills are reported to consistently outperform no-skill baselines and human-written skills when transferred to student models (with inference costs up to 1,000x lower than teacher models), supporting the claim that PDI-guided distillation yields efficient, transferable, evidence-grounded skills. Code is released at https://github.com/EtaYang10th/spark-skills.

Significance. If the central empirical claims hold and PDI can be shown to provide independent, non-circular evidence of posterior grounding, the work would offer a meaningful advance in skill distillation for autonomous agents by shifting emphasis from preference logs or prior plans to direct environment-verified trajectories. The open release of code is a clear strength that supports reproducibility and community verification of the 86-task results.

major comments (2)

[§3] §3 (PDI and SPARK description): The manuscript does not provide the explicit mathematical formula or derivation for the Posterior Distillation Index (PDI). Without this, it is impossible to verify whether PDI is computed independently from the candidate skill or whether it incorporates quantities derived from the same SPARK-generated trajectories used both for evaluation and as the online intervention signal, directly threatening the 'evidence over plans' distinction.
[§5] §5 (experimental results): The claim of consistent outperformance on 86 tasks lacks any reported statistical significance tests, details on baseline implementations, controls for trajectory independence, or ablation studies isolating the contribution of PDI versus other SPARK components. This undermines confidence that gains reflect genuine posterior grounding rather than improved filtering or planning.

minor comments (2)

The abstract states 'inference cost up to 1,000x cheaper' but does not name the specific teacher and student models or provide per-task cost breakdowns; adding this would improve clarity.
Consider including a summary table of key metrics (success rates, costs) across the 86 tasks to allow readers to assess the scale of improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify key aspects of our work on PDI and SPARK. We address the major comments point by point below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [§3] §3 (PDI and SPARK description): The manuscript does not provide the explicit mathematical formula or derivation for the Posterior Distillation Index (PDI). Without this, it is impossible to verify whether PDI is computed independently from the candidate skill or whether it incorporates quantities derived from the same SPARK-generated trajectories used both for evaluation and as the online intervention signal, directly threatening the 'evidence over plans' distinction.

Authors: We acknowledge that the explicit mathematical formula and derivation for PDI were not presented with sufficient detail in §3. In the revised manuscript, we will insert the full definition of PDI as a trajectory-level metric computed exclusively from post-execution environment evidence (e.g., success indicators, state transitions, and task-completion signals) without reference to skill parameters or prior plans. The derivation will explicitly separate the metric computation from SPARK's use of PDI as an online intervention signal, demonstrating that PDI itself remains an independent, evidence-only quantity that does not create circularity with the trajectories used for skill distillation. revision: yes
Referee: [§5] §5 (experimental results): The claim of consistent outperformance on 86 tasks lacks any reported statistical significance tests, details on baseline implementations, controls for trajectory independence, or ablation studies isolating the contribution of PDI versus other SPARK components. This undermines confidence that gains reflect genuine posterior grounding rather than improved filtering or planning.

Authors: We agree that the experimental section would benefit from greater statistical rigor and controls. In the revision, we will add paired statistical significance tests (e.g., Wilcoxon signed-rank or t-tests) across the 86 tasks to support the outperformance claims. We will also expand the text with precise descriptions of baseline implementations (no-skill and human-written skills), explicit controls for trajectory independence, and ablation studies that isolate PDI by comparing full SPARK against SPARK variants that omit the PDI intervention signal. These additions will directly address whether observed gains stem from posterior grounding. revision: yes

Circularity Check

0 steps flagged

No significant circularity: PDI is an internal guidance signal but central claims rest on external task-success metrics.

full rationale

The paper defines PDI as a trajectory-level metric computed from environment-verified runs produced by SPARK and deploys it as an online intervention during skill formation. However, the load-bearing empirical claims (outperformance on 86 tasks versus no-skill baselines and human-written skills, plus transfer to student models) are measured by independent success rates rather than by PDI values themselves. No equations, self-citations, or definitional steps are shown that would make the reported gains reduce to a tautology or a fit of the same quantity used for guidance. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on the abstract; full paper may contain additional parameters inside PDI or SPARK. The central claim rests on the domain assumption that posterior evidence is superior to prior plans.

axioms (1)

domain assumption Robust skills should be posterior-based, distilled from empirical environment interaction rather than prior plans.
Identified in the abstract as the fundamental timing bottleneck.

invented entities (1)

Posterior Distillation Index (PDI) no independent evidence
purpose: Trajectory-level metric that quantifies how well a distilled skill is grounded in task-environment evidence.
Newly introduced construct whose independent falsifiability is not established in the abstract.

pith-pipeline@v0.9.0 · 5552 in / 1317 out tokens · 46804 ms · 2026-05-12T02:32:51.191013+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PDI = z(ϕ_exec) − z(ϕ_plan) − z(ϕ_oss) ... using Jensen–Shannon divergence ... execution grounding ϕ_exec = ψ(P_E, P_s), plan copying ϕ_plan = ψ(P_P, P_s)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SPARK generates environment-verified trajectories ... applies PDI as an online diagnostic and intervention signal

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 6 internal anchors

[1]

Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026

Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. EvoSkill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026

work page arXiv 2026
[2]

Large language models as tool makers

Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[3]

arXiv preprint arXiv:2603.00718 , year=

Shiqi Chen, Jingze Gai, Ruochen Zhou, Jinghan Zhang, Tongyao Zhu, Junlong Li, Kangrui Wang, Zihan Wang, Zhengyu Chen, Klara Kaleb, Ning Miao, Siyang Gao, Cong Lu, Manling Li, Junxian He, and Yee Whye Teh. SkillCraft: Can LLM agents learn to use tools skillfully? arXiv preprint arXiv:2603.00718, 2026

work page arXiv 2026
[4]

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu

Tianyi Chen, Yinheng Li, Michael Solodko, Sen Wang, Nan Jiang, Tingyuan Cui, Junheng Hao, Jongwoo Ko, Sara Abdali, Leon Xu, Suzhen Zheng, Hao Fan, Pashmina Cameron, Justin Wagle, and Kazuhito Koishida. CUA-Skill: Develop skills for computer using agent.arXiv preprint arXiv:2601.21123, 2026

work page arXiv 2026
[5]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, et al. SkillsBench: Benchmarking how well agent skills work across diverse tasks.arXiv preprint arXiv:2602.12670, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Skillnet: Create, evaluate, and connect ai skills.arXiv preprint arXiv:2603.04448,

Yuan Liang, Ruobin Zhong, Haoming Xu, Chen Jiang, Yi Zhong, Runnan Fang, Jia-Chen Gu, Shumin Deng, Yunzhi Yao, Mengru Wang, et al. SkillNet: Create, evaluate, and connect AI skills.arXiv preprint arXiv:2603.04448, 2026

work page arXiv 2026
[7]

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

Yujian Liu, Jiabao Ji, Li An, Tommi Jaakkola, Yang Zhang, and Shiyu Chang. How well do agentic skills work in the wild: Benchmarking LLM skill usage in realistic settings.arXiv preprint arXiv:2604.04323, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Skill0: In-context agentic reinforcement learning for skill internalization.arXiv preprint arXiv:2604.02268, 2026

Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, and Yongliang Shen. SKILL0: In-context agentic reinforcement learning for skill internalization.arXiv preprint arXiv:2604.02268, 2026

work page arXiv 2026
[9]

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Erchao Zhao, Xiaoxi Jiang, and Guanjun Jiang. Trace2Skill: Distill trajectory-local lessons into transferable agent skills.arXiv preprint arXiv:2603.25158, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji

Cheng Qian, Chi Han, Yi R. Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji. CREATOR: Tool creation for disentangling abstract and concrete reasoning of large language models. InFindings of the Association for Computational Linguistics: EMNLP 2023, 2023

work page 2023
[11]

ToolLLM: Facilitating large language models to master 16000+ real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. ToolLLM: Facilitating large language models to master 16000+ real-world APIs. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[12]

Autorefine: From trajectories to reusable expertise for continual llm agent refinement.arXiv preprint arXiv:2601.22758, 2026

Libin Qiu, Zhirong Gao, Junfu Chen, Yuhang Ye, Weizhi Huang, Xiaobo Xue, Wenkai Qiu, and Shuo Tang. AutoRefine: From trajectories to reusable expertise for continual LLM agent refinement.arXiv preprint arXiv:2601.22758, 2026

work page arXiv 2026
[13]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[14]

Reflexion: Language agents with verbal reinforcement learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. 10

work page 2023
[15]

ALFWorld: Aligning text and embodied environments for interactive learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning. InInternational Conference on Learning Representations, 2021

work page 2021
[16]

Trial and error: Exploration-based trajectory optimization for LLM agents

Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, and Bill Yuchen Lin. Trial and error: Exploration-based trajectory optimization for LLM agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024

work page 2024
[17]

SkillX: Automatically Constructing Skill Knowledge Bases for Agents

Chenxi Wang, Zhuoyun Yu, Xin Xie, Wuguannan Yao, Runnan Fang, et al. SkillX: Auto- matically constructing skill knowledge bases for agents.arXiv preprint arXiv:2604.04804, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, Zeyu Zheng, Cihang Xie, and Huaxiu Yao. SkillRL: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

work page internal anchor Pith review arXiv 2026
[20]

AgentTrek: Agent trajectory synthesis via guiding replay with web tutorials

Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen Mao, Caiming Xiong, and Tao Yu. AgentTrek: Agent trajectory synthesis via guiding replay with web tutorials. In International Conference on Learning Representations (ICLR), 2025

work page 2025
[21]

arXiv preprint arXiv:2603.01145 , year=

Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, Bo Zhang, and Liang He. AutoSkill: Experience-driven lifelong learning via skill self-evolution.arXiv preprint arXiv:2603.01145, 2026

work page arXiv 2026
[22]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[23]

arXiv preprint arXiv:2603.16856 , year=

Tianzhu Ye, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, and Furu Wei. Online experiential learning for language models.arXiv preprint arXiv:2603.16856, 2026

work page arXiv 2026
[24]

ExpeL: LLM agents are experiential learners

Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. ExpeL: LLM agents are experiential learners. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024

work page 2024
[25]

SkillRouter: Retrieve-and-rerank skill selection for LLM agents at scale.arXiv preprint arXiv:2603.22455, 2026

Yanzhao Zheng, Zhentao Zhang, Chao Ma, Yuanqiang Yu, Jihuai Zhu, et al. SkillRouter: Skill routing for LLM agents at scale.arXiv preprint arXiv:2603.22455, 2026

work page arXiv 2026
[26]

Mˆ 3-bench: Multi-modal, multi-hop, multi-threaded tool-using mllm agent benchmark.arXiv preprint arXiv:2511.17729, 2025

Yang Zhou, Mingyu Zhao, Zhenting Wang, Difei Gu, Bangwei Guo, Ruosong Ye, Ligong Han, Can Jin, and Dimitris N Metaxas. Mˆ 3-bench: Multi-modal, multi-hop, multi-threaded tool-using mllm agent benchmark.arXiv preprint arXiv:2511.17729, 2025

work page arXiv 2025
[27]

Next Strategy

Yang Zhou, Shiyu Zhao, Yuxiao Chen, Zhenting Wang, Can Jin, and Dimitris N Metaxas. Led: Llm enhanced open-vocabulary object detection without human curated data generation.arXiv preprint arXiv:2503.13794, 2025. 11 Appendix Content APositioning Against Prior Skill-Centric Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...

work page arXiv 2025