pith. machine review for the scientific record. sign in

arxiv: 2605.10114 · v1 · submitted 2026-05-11 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:50 UTC · model grok-4.3

classification 💻 cs.CL
keywords retrieval-augmented executionskill librariescontext compilationLLM agentsskill graphsmulti-level graphsagent skillsrescue-aware compilation
0
0 comments X

The pith

SkillRAE compiles coarse skill retrievals into compact, grounded contexts using a multi-level graph and rescue-aware steps for better LLM agent execution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that organizing retrieved skills into task-specific contexts is a critical missing piece in Retrieval-Augmented Execution for agents that rely on growing skill libraries. Current approaches optimize retrieval and execution but leave the selected skills in disorganized form that burdens downstream executors. SkillRAE fills the gap with an offline multi-level graph that links skill communities, individual skills, and subunits, plus an online process of ranked retrieval and rescue-aware compilation. If the method works as claimed, agents could complete complex artifact-rich tasks more reliably without context overload or lost evidence. The reported results show an 11.7 percent gain on SkillsBench over prior state-of-the-art, with ablations indicating that the compilation stage itself drives the improvement rather than added prompt length.

Core claim

SkillRAE is a two-stage Retrieval-Augmented Execution method that first builds a multi-level skill graph over communities, skills, and reusable subunits in an offline indexing stage, then in the online stage performs skill-ranked retrieval with subunit evidence export followed by rescue-aware compact compilation to convert a coarse-ranked skill set into a compact, grounded, and immediately usable task-specific context.

What carries the argument

The multi-level skill graph over communities, skills, and subunits, paired with rescue-aware compact compilation that recovers key evidence from coarse retrievals.

If this is right

  • LLM agents can scale to larger skill libraries while keeping execution contexts efficient and grounded.
  • Retrieval can tolerate coarser initial ranking provided a subsequent rescue and compilation stage is present.
  • Document-centric and data-intensive workflows become more tractable once skills are organized into immediately usable forms.
  • Context compilation is shown to be a distinct and necessary component rather than a simple prompt-engineering addition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph-plus-rescue pattern could be tested on non-skill retrieval-augmented tasks such as code generation or multi-hop question answering.
  • Dynamic updates to the skill graph during agent operation might allow the system to incorporate newly discovered skills without full re-indexing.
  • If the graph construction depends on initial skill quality, low-quality libraries would limit gains and point to a need for upstream skill curation.

Load-bearing premise

The multi-level skill graph accurately captures skill relationships and the rescue-aware compilation step can recover critical evidence from coarse-ranked retrievals without important loss.

What would settle it

Running SkillRAE on SkillsBench with the rescue-aware compilation stage removed and finding performance equal to or below the prior SOTA baseline would falsify the claim that context compilation is essential.

Figures

Figures reproduced from arXiv: 2605.10114 by Shu Wang, Xiangcheng Meng, Yixiang Fang.

Figure 1
Figure 1. Figure 1: An example agent skill from SkillsBench. It contains a natural-language description, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of SKILLRAE. In the online stage, we first perform skill retrieval over the constructed graph above by combining evidence from skill communities and subunits, which are retrieved in top-down and bottom-up man￾ners, respectively. It then compiles the retrieved skills, selected subunit evidence, rescued subunits from non-selected source skills, and task-output constraints into a task-specific contex… view at source ↗
read the original abstract

Large Language Model (LLM)-based agents (e.g., OpenClaw) increasingly rely on reusable skill libraries to solve artifact-rich tasks such as document-centric workflows and data-intensive analysis. As these libraries grow, a few works have attempted to study the Retrieval-Augmented Execution (RAE), which often first retrieves some external skills and other knowledge, then compiles the context using retrieved skills, and finally executes the task. Existing works mainly focus on optimizing skill retrieval and task execution, and they pay little attention to how to effectively organize the selected skill evidence in a form that is compact, grounded, and immediately usable for the downstream executors to complete tasks. To fill this gap, we propose SkillRAE, a two-stage RAE approach focusing on skill-based context compilation, which consists of the offline and online stages. Specifically, in the offline indexing stage, it builds a multi-level skill graph over skill communities, skills, and reusable subunits, for capturing their relationships. In the online retrieval stage, it first performs skill-ranked retrieval with selected-subunit evidence export in the graph, and then applies rescue-aware compact compilation to recover the key evidence. Together, these components compile a coarse-ranked skill set into a task-specific context that is compact, grounded, and immediately usable. Experiments on two public benchmarks show that SkillRAE achieves a significant improvement over baselines for RAE. For example, on SkillsBench, it achieves an improvement of 11.7% over the SOTA method. Ablation studies further show that our context compilation is crucial, instead of a mere prompt addition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SkillRAE, a two-stage Retrieval-Augmented Execution (RAE) method for LLM-based agents. The offline stage constructs a multi-level skill graph over skill communities, skills, and reusable subunits to capture relationships. The online stage performs skill-ranked retrieval with selected-subunit evidence export, followed by rescue-aware compact compilation to produce compact, grounded, and immediately usable task contexts. Experiments on two public benchmarks report a 11.7% improvement over SOTA on SkillsBench, with ablations indicating that context compilation is essential rather than a simple prompt addition.

Significance. If the results hold under rigorous verification, SkillRAE addresses a clear gap in RAE literature by prioritizing effective organization of retrieved skill evidence over retrieval or execution alone. The multi-level graph and rescue-aware mechanism offer a structured way to handle expanding skill libraries for artifact-rich tasks. The reported gains and ablation emphasis on compilation provide a promising direction, though the absence of detailed experimental protocols limits immediate assessment of robustness and reproducibility.

major comments (2)
  1. [Experiments] Experiments section (as summarized in abstract): The central performance claim of an 11.7% improvement over SOTA on SkillsBench, along with the assertion that ablations demonstrate context compilation is crucial, lacks any description of experimental setup, baselines, number of runs, statistical tests, error bars, or data handling. This information is load-bearing for validating the empirical results that support the paper's main contribution.
  2. [Online stage] Online retrieval and compilation stage (as described in abstract): The rescue-aware compact compilation is presented at a high level as recovering key evidence from coarse-ranked skills without loss; however, no concrete mechanism, algorithm, or example is supplied to show how it avoids dropping critical evidence, which directly underpins the claim that the compiled context is 'immediately usable' for downstream executors.
minor comments (2)
  1. [Abstract] The abstract would benefit from a brief illustrative example of the multi-level skill graph (communities/skills/subunits) to clarify the offline indexing process for readers.
  2. [Method overview] Terminology such as 'rescue-aware' and 'selected-subunit evidence export' is introduced without prior definition or reference, which could be clarified in the method overview for better readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below and will incorporate the suggested improvements to strengthen the manuscript's clarity and reproducibility.

read point-by-point responses
  1. Referee: [Experiments] Experiments section (as summarized in abstract): The central performance claim of an 11.7% improvement over SOTA on SkillsBench, along with the assertion that ablations demonstrate context compilation is crucial, lacks any description of experimental setup, baselines, number of runs, statistical tests, error bars, or data handling. This information is load-bearing for validating the empirical results that support the paper's main contribution.

    Authors: We agree that the current manuscript does not include sufficient details on the experimental protocol to allow full verification of the reported results. In the revised version, we will expand Section 4 (Experiments) to explicitly describe the full experimental setup, list all baselines with citations, specify the number of runs (including error bars), detail the statistical tests performed, and outline data handling procedures. This will directly support the 11.7% improvement claim and the ablation analysis. revision: yes

  2. Referee: [Online stage] Online retrieval and compilation stage (as described in abstract): The rescue-aware compact compilation is presented at a high level as recovering key evidence from coarse-ranked skills without loss; however, no concrete mechanism, algorithm, or example is supplied to show how it avoids dropping critical evidence, which directly underpins the claim that the compiled context is 'immediately usable' for downstream executors.

    Authors: We acknowledge that the rescue-aware compact compilation is currently described at a conceptual level without a concrete algorithm or example. In the revision, we will add a detailed algorithmic description with pseudocode for the rescue mechanism in Section 3, along with a worked example illustrating how critical evidence is identified, exported from the skill graph, and preserved during compilation to ensure no loss and immediate usability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims on public benchmarks

full rationale

The paper introduces SkillRAE as a two-stage system (offline multi-level skill graph over communities/skills/subunits, online ranked retrieval plus rescue-aware compilation) whose value is asserted via measured gains on public benchmarks (e.g., +11.7% on SkillsBench versus SOTA) and ablation studies showing context compilation is not mere prompt addition. No equations, fitted parameters, or derivations are present that could reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The evaluation therefore rests on independent external data rather than self-referential definitions or renamings, rendering the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Review based on abstract only; specific free parameters, axioms, and entities cannot be audited in detail without full text. The approach assumes standard LLM prompting and retrieval mechanics plus the utility of graph-structured skills.

axioms (2)
  • domain assumption LLM agents benefit from reusable skill libraries for artifact-rich tasks
    Opening premise of the abstract.
  • domain assumption Organizing skills into communities, skills, and subunits captures useful relationships
    Basis for the offline indexing stage.
invented entities (2)
  • multi-level skill graph no independent evidence
    purpose: Capturing relationships across skill communities, skills, and reusable subunits for retrieval
    Core new structure introduced in offline stage; no independent evidence provided in abstract.
  • rescue-aware compact compilation no independent evidence
    purpose: Recovering key evidence to produce compact, grounded, usable context
    Key online-stage component; no independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5584 in / 1391 out tokens · 52634 ms · 2026-05-12T02:50:40.169806+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 9 internal anchors

  1. [1]

    OpenClaw: Personal ai assistant

    OpenClaw. OpenClaw: Personal ai assistant. https://github.com/openclaw/openclaw, 2026. Accessed: 2026-05-05

  2. [2]

    Welcome – manus documentation

    Manus AI. Welcome – manus documentation. https://manus.im/docs/introduction/welcome,

  3. [3]

    Accessed: 2026-05-05

  4. [4]

    API-bank: A comprehensive benchmark for tool-augmented LLMs

    Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-bank: A comprehensive benchmark for tool-augmented LLMs. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116, Singapore, 2023. Association for Computational Linguistics. doi: 10.186 53/v1...

  5. [5]

    ToolLLM: Facilitating large lan- guage models to master 16000+ real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large lan- guage models to master 16000+ real-world APIs. InInternational Conference on Learning Represent...

  6. [6]

    ToolHop: A query-driven benchmark for evaluating large language models in multi-hop tool use

    Junjie Ye, Zhengyin Du, Xuesong Yao, Weijian Lin, Yufei Xu, Zehui Chen, Zaiyuan Wang, Sining Zhu, Zhiheng Xi, Siyu Yuan, Tao Gui, Qi Zhang, Xuanjing Huang, and Jiecao Chen. ToolHop: A query-driven benchmark for evaluating large language models in multi-hop tool use. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguis- tic...

  7. [7]

    ShortcutsBench: A large-scale real-world benchmark for api-based agents

    Haiyang Shen, Yue Li, Desong Meng, Dongqi Cai, Sheng Qi, Li Zhang, Mengwei Xu, and Yun Ma. ShortcutsBench: A large-scale real-world benchmark for api-based agents. InProceedings of the International Conference on Learning Representations, 2025

  8. [8]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023. doi: 10.48550/arXiv.2305.16291. URL https://arxiv.org/abs/2305.16291

  9. [9]

    SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

    Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Qunhong Zeng, Di Wang, Xuan- dong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, Xuanqing ...

  10. [10]

    Organizing, orchestrating, and benchmarking agent skills at ecosystem scale, 2026

    Hao Li, Chunjiang Mu, Jianhao Chen, Siyue Ren, Zhiyao Cui, Yiqun Zhang, Lei Bai, and Shuyue Hu. Organizing, orchestrating, and benchmarking agent skills at ecosystem scale. arXiv preprint arXiv:2603.02176, 2026. doi: 10.48550/arXiv.2603.02176. URL https: //arxiv.org/abs/2603.02176

  11. [11]

    SkillRouter: Skill routing for LLM agents at scale.arXiv preprint arXiv:2603.22455, 2026

    YanZhao Zheng, ZhenTao Zhang, Chao Ma, YuanQiang Yu, JiHuai Zhu, Yong Wu, Tianze Xu, Baohua Dong, Hangcheng Zhu, Ruohui Huang, and Gang Yu. SkillRouter: Skill routing for LLM agents at scale.arXiv preprint arXiv:2603.22455, 2026. doi: 10.48550/arXiv.2603.22

  12. [12]

    URL https://arxiv.org/abs/2603.22455

  13. [13]

    RepoCoder: Repository-level code completion through itera- tive retrieval and generation

    Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian- Guang Lou, and Weizhu Chen. RepoCoder: Repository-level code completion through itera- tive retrieval and generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2471–2484, 2023

  14. [14]

    Dataflow-guided retrieval augmentation for repository- level code completion

    Wei Cheng, Yuhan Wu, and Wei Hu. Dataflow-guided retrieval augmentation for repository- level code completion. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7957–7977, 2024. 10

  15. [15]

    RepoGraph: Enhancing ai software engineering with repository-level code graph

    Siru Ouyang, Wenhao Yu, Kaixin Ma, Zilin Xiao, Zhihan Zhang, Mengzhao Jia, Jiawei Han, Hongming Zhang, and Dong Yu. RepoGraph: Enhancing ai software engineering with repository-level code graph. InProceedings of the International Conference on Learning Rep- resentations, 2025

  16. [16]

    Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. RAPTOR: Recursive abstractive processing for tree-organized retrieval. InInterna- tional Conference on Learning Representations, 2024. URL https://openreview.net/forum?id= GN921JHCRw

  17. [17]

    HippoRAG: Neurobiologically inspired long-term memory for large language models

    Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG: Neurobiologically inspired long-term memory for large language models. InAdvances in Neu- ral Information Processing Systems, volume 37, 2024

  18. [18]

    StructRAG: Boosting knowledge intensive reasoning of llms via inference-time hybrid information structurization

    Zhuoqun Li, Xuanang Chen, Haiyang Yu, Hongyu Lin, Yaojie Lu, Qiaoyu Tang, Fei Huang, Xianpei Han, Le Sun, and Yongbin Li. StructRAG: Boosting knowledge intensive reasoning of llms via inference-time hybrid information structurization. InProceedings of the International Conference on Learning Representations, 2025

  19. [19]

    ArchRAG: Attributed community-based hierarchical retrieval-augmented generation

    Shu Wang, Yixiang Fang, Yingli Zhou, Xilin Liu, and Yuchi Ma. ArchRAG: Attributed community-based hierarchical retrieval-augmented generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 15868–15876, 2026. doi: 10.1609/aa ai.v40i19.38619. URL https://ojs.aaai.org/index.php/AAAI/article/view/38619

  20. [20]

    BookRAG: A hierarchical structure-aware index- based approach for retrieval-augmented generation on complex documents.arXiv preprint arXiv:2512.03413, 2025

    Shu Wang, Yingli Zhou, and Yixiang Fang. BookRAG: A hierarchical structure-aware index- based approach for retrieval-augmented generation on complex documents.arXiv preprint arXiv:2512.03413, 2025. doi: 10.48550/arXiv.2512.03413. URL https://arxiv.org/abs/2512.0 3413

  21. [21]

    HuggingGPT: Solving AI tasks with ChatGPT and its friends in hugging face

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI tasks with ChatGPT and its friends in hugging face. InAdvances in Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=yHdT scY6Ci

  22. [22]

    ToolPlanner: A tool augmented LLM for multi granularity instructions with path planning and feedback

    Qinzhuo Wu, Wei Liu, Jian Luan, and Bin Wang. ToolPlanner: A tool augmented LLM for multi granularity instructions with path planning and feedback. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18315–18339, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.1 8653/v1/202...

  23. [23]

    Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

    Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Xiaoxi Jiang, and Guanjun Jiang. Trace2skill: Distill trajectory-local lessons into transferable agent skills.arXiv preprint arXiv:2603.25158, 2026

  24. [24]

    ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

    Siru Ouyang, Jun Yan, I Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T Le, Samira Daruki, Xiangru Tang, et al. Reasoningbank: Scaling agent self-evolving with reason- ing memory.arXiv preprint arXiv:2509.25140, 2025

  25. [25]

    Graph of Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills

    Dawei Liu, Zongxia Li, Hongyang Du, Xiyang Wu, Shihang Gui, Yongbei Kuang, and Lichao Sun. Graph of skills: Dependency-aware structural retrieval for massive agent skills.arXiv preprint arXiv:2604.05333, 2026. doi: 10.48550/arXiv.2604.05333. URL https://arxiv.org/ab s/2604.05333

  26. [26]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Confer- ence on Learning Representations, 2023. URL https://openreview.net/forum?id=WE_vluYU L-X

  27. [27]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=Yacmpz84TH. 11

  28. [28]

    Patil, Tianjun Zhang, Xin Wang, and Joseph E

    Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive APIs. InAdvances in Neural Information Processing Systems,

  29. [29]

    URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/e4c61f578ff07830f5 c37378dd3ecb0d-Abstract-Conference.html

  30. [30]

    ToolkenGPT: Augmenting frozen language models with massive tools via tool embeddings

    Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. ToolkenGPT: Augmenting frozen language models with massive tools via tool embeddings. InAdvances in Neural Information Processing Systems, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/8 fd1a81c882cd45f64958da6284f4a3f-Abstract-Conference.html

  31. [31]

    AnyTool: Self-reflective, hierarchical agents for large-scale API calls

    Yu Du, Fangyun Wei, and Hongyang Zhang. AnyTool: Self-reflective, hierarchical agents for large-scale API calls. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 11812–11829. PMLR, 2024. URL https://proceedings.mlr.press/v235/du24h.html

  32. [32]

    Re-invoke: Tool invoca- tion rewriting for zero-shot tool retrieval

    Yanfei Chen, Jinsung Yoon, Devendra Singh Sachan, Qingze Wang, Vincent Cohen-Addad, Mohammadhossein Bateni, Chen-Yu Lee, and Tomas Pfister. Re-invoke: Tool invoca- tion rewriting for zero-shot tool retrieval. InFindings of the Association for Computa- tional Linguistics: EMNLP 2024, pages 4705–4726, Miami, Florida, USA, 2024. Associa- tion for Computation...

  33. [33]

    Retrieval-augmented generation for knowledge-intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAd- vances in Neural Information Processing Systems, volume 33, 2020. URL https://proceeding s.neur...

  34. [34]

    In-depth Analysis of Graph-based RAG in a Unified Framework

    Yingli Zhou, Yaodong Su, Youran Sun, Shu Wang, Taotao Wang, Runyuan He, Yongwei Zhang, Sicong Liang, Xilin Liu, Yuchi Ma, et al. In-depth analysis of graph-based rag in a unified framework.arXiv preprint arXiv:2503.04338, 2025

  35. [35]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From lo- cal to global: A graph RAG approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024. doi: 10.48550/arXiv.2404.16130. URL https://arxiv.org/abs/ 2404.16130

  36. [36]

    Pathrag: Pruning graph-based retrieval augmented generation with relational paths

    Boyu Chen, Zirui Guo, Zidan Yang, Yuluo Chen, Junze Chen, Zhenghao Liu, Chuan Shi, and Cheng Yang. Pathrag: Pruning graph-based retrieval augmented generation with relational paths. InProceedings of the AAAI conference on artificial intelligence, volume 40, pages 30183–30191, 2026

  37. [37]

    LightRAG: Simple and Fast Retrieval-Augmented Generation

    Zirui Guo, Lianghao Xia, Yanhua Yu, Tian Ao, and Chao Huang. Lightrag: Simple and fast retrieval-augmented generation.arXiv preprint arXiv:2410.05779, 2(3), 2024

  38. [38]

    Retrieval-augmented generation with hierarchical knowledge.arXiv preprint arXiv:2503.10150,

    Haoyu Huang, Yongfeng Huang, Junjie Yang, Zhenyu Pan, Yongqiang Chen, Kaili Ma, Hongzhi Chen, and James Cheng. Retrieval-augmented generation with hierarchical knowl- edge.arXiv preprint arXiv:2503.10150, 2025

  39. [39]

    Active retrieval augmented generation

    Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. InPro- ceedings of the 2023 conference on empirical methods in natural language processing, pages 7969–7992, 2023

  40. [40]

    Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection. InThe Twelfth International Conference on Learning Representations, 2023

  41. [41]

    Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. InPro- ceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 10014–10037, 2023. 12

  42. [42]

    Demystifying and enhancing the efficiency of large language model based search agents.arXiv preprint arXiv:2505.12065, 2025

    Tiannuo Yang, Zebin Yao, Bowen Jin, Lixiao Cui, Yusen Li, Gang Wang, and Xiaoguang Liu. Demystifying and enhancing the efficiency of large language model based search agents.arXiv preprint arXiv:2505.12065, 2025

  43. [43]

    Recomp: Improving retrieval-augmented lms with context compression and selective augmentation

    Fangyuan Xu, Weijia Shi, and Eunsol Choi. Recomp: Improving retrieval-augmented lms with context compression and selective augmentation. InThe Twelfth International Conference on Learning Representations, 2023

  44. [44]

    Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression

    Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1658–1677, 2024

  45. [45]

    Trace the evidence: Constructing knowledge-grounded reasoning chains for retrieval-augmented generation

    Jinyuan Fang, Zaiqiao Meng, and Craig Macdonald. Trace the evidence: Constructing knowledge-grounded reasoning chains for retrieval-augmented generation. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 8472–8494, 2024

  46. [46]

    Chameleon: Plug-and-play compositional reasoning with large lan- guage models

    Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large lan- guage models. InAdvances in Neural Information Processing Systems, 2023

  47. [48]

    URL https://arxiv.org/abs/2309.07597

  48. [49]

    Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

    Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT-networks. InProceedings of the 2019 Conference on Empirical Methods in Nat- ural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing, pages 3982–3992. Association for Computational Linguistics, 2019. doi: 10.18653/v1/D19-1410....

  49. [50]

    Codex CLI

    OpenAI. Codex CLI. https://developers.openai.com/codex/cli, 2026. Accessed 2026-05-06

  50. [51]

    GPT-5.2 Model

    OpenAI. GPT-5.2 Model. https://developers.openai.com/api/docs/models/gpt-5.2, 2025. Accessed 2026-05-06

  51. [52]

    Gemini CLI Documentation

    Google. Gemini CLI Documentation. https://google-gemini.github.io/gemini-cli/docs/, 2026. Accessed 2026-05-06

  52. [53]

    Gemini 3 Flash is now available in Gemini CLI

    Google Developers Blog. Gemini 3 Flash is now available in Gemini CLI. https://developers.g oogleblog.com/gemini-3-flash-is-now-available-in-gemini-cli/, 2025. Accessed 2026-05-06. 13 A Appendix A.1 Implementation Details The subunit extractor is deterministic. It collects three types of support evidence from each source SKILL.md: procedural lines, elemen...