arxiv: 2605.10114 · v1 · submitted 2026-05-11 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

SkillRAE: Agent Skill-Based Context Compilation for Retrieval-Augmented Execution

Xiangcheng Meng , Shu Wang , Yixiang Fang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:50 UTC · model grok-4.3

classification 💻 cs.CL

keywords retrieval-augmented executionskill librariescontext compilationLLM agentsskill graphsmulti-level graphsagent skillsrescue-aware compilation

0 comments

The pith

SkillRAE compiles coarse skill retrievals into compact, grounded contexts using a multi-level graph and rescue-aware steps for better LLM agent execution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that organizing retrieved skills into task-specific contexts is a critical missing piece in Retrieval-Augmented Execution for agents that rely on growing skill libraries. Current approaches optimize retrieval and execution but leave the selected skills in disorganized form that burdens downstream executors. SkillRAE fills the gap with an offline multi-level graph that links skill communities, individual skills, and subunits, plus an online process of ranked retrieval and rescue-aware compilation. If the method works as claimed, agents could complete complex artifact-rich tasks more reliably without context overload or lost evidence. The reported results show an 11.7 percent gain on SkillsBench over prior state-of-the-art, with ablations indicating that the compilation stage itself drives the improvement rather than added prompt length.

Core claim

SkillRAE is a two-stage Retrieval-Augmented Execution method that first builds a multi-level skill graph over communities, skills, and reusable subunits in an offline indexing stage, then in the online stage performs skill-ranked retrieval with subunit evidence export followed by rescue-aware compact compilation to convert a coarse-ranked skill set into a compact, grounded, and immediately usable task-specific context.

What carries the argument

The multi-level skill graph over communities, skills, and subunits, paired with rescue-aware compact compilation that recovers key evidence from coarse retrievals.

If this is right

LLM agents can scale to larger skill libraries while keeping execution contexts efficient and grounded.
Retrieval can tolerate coarser initial ranking provided a subsequent rescue and compilation stage is present.
Document-centric and data-intensive workflows become more tractable once skills are organized into immediately usable forms.
Context compilation is shown to be a distinct and necessary component rather than a simple prompt-engineering addition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph-plus-rescue pattern could be tested on non-skill retrieval-augmented tasks such as code generation or multi-hop question answering.
Dynamic updates to the skill graph during agent operation might allow the system to incorporate newly discovered skills without full re-indexing.
If the graph construction depends on initial skill quality, low-quality libraries would limit gains and point to a need for upstream skill curation.

Load-bearing premise

The multi-level skill graph accurately captures skill relationships and the rescue-aware compilation step can recover critical evidence from coarse-ranked retrievals without important loss.

What would settle it

Running SkillRAE on SkillsBench with the rescue-aware compilation stage removed and finding performance equal to or below the prior SOTA baseline would falsify the claim that context compilation is essential.

Figures

Figures reproduced from arXiv: 2605.10114 by Shu Wang, Xiangcheng Meng, Yixiang Fang.

**Figure 2.** Figure 2: Overview of SKILLRAE. In the online stage, we first perform skill retrieval over the constructed graph above by combining evidence from skill communities and subunits, which are retrieved in top-down and bottom-up manners, respectively. It then compiles the retrieved skills, selected subunit evidence, rescued subunits from non-selected source skills, and task-output constraints into a task-specific contex… view at source ↗

read the original abstract

Large Language Model (LLM)-based agents (e.g., OpenClaw) increasingly rely on reusable skill libraries to solve artifact-rich tasks such as document-centric workflows and data-intensive analysis. As these libraries grow, a few works have attempted to study the Retrieval-Augmented Execution (RAE), which often first retrieves some external skills and other knowledge, then compiles the context using retrieved skills, and finally executes the task. Existing works mainly focus on optimizing skill retrieval and task execution, and they pay little attention to how to effectively organize the selected skill evidence in a form that is compact, grounded, and immediately usable for the downstream executors to complete tasks. To fill this gap, we propose SkillRAE, a two-stage RAE approach focusing on skill-based context compilation, which consists of the offline and online stages. Specifically, in the offline indexing stage, it builds a multi-level skill graph over skill communities, skills, and reusable subunits, for capturing their relationships. In the online retrieval stage, it first performs skill-ranked retrieval with selected-subunit evidence export in the graph, and then applies rescue-aware compact compilation to recover the key evidence. Together, these components compile a coarse-ranked skill set into a task-specific context that is compact, grounded, and immediately usable. Experiments on two public benchmarks show that SkillRAE achieves a significant improvement over baselines for RAE. For example, on SkillsBench, it achieves an improvement of 11.7% over the SOTA method. Ablation studies further show that our context compilation is crucial, instead of a mere prompt addition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkillRAE puts real weight on the compilation step in RAE and reports clear gains, but the abstract leaves the experimental controls too thin to judge robustness yet.

read the letter

SkillRAE's core move is to treat context compilation as a first-class problem in retrieval-augmented execution rather than an afterthought. It builds an offline multi-level skill graph that links communities, skills, and reusable subunits, then uses an online rescue-aware step to turn coarse retrieval results into compact, grounded prompts for the executor. That two-stage split directly targets the gap the abstract identifies: most prior RAE work stops at retrieval or execution and leaves the agent with bloated or incomplete skill evidence.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SkillRAE, a two-stage Retrieval-Augmented Execution (RAE) method for LLM-based agents. The offline stage constructs a multi-level skill graph over skill communities, skills, and reusable subunits to capture relationships. The online stage performs skill-ranked retrieval with selected-subunit evidence export, followed by rescue-aware compact compilation to produce compact, grounded, and immediately usable task contexts. Experiments on two public benchmarks report a 11.7% improvement over SOTA on SkillsBench, with ablations indicating that context compilation is essential rather than a simple prompt addition.

Significance. If the results hold under rigorous verification, SkillRAE addresses a clear gap in RAE literature by prioritizing effective organization of retrieved skill evidence over retrieval or execution alone. The multi-level graph and rescue-aware mechanism offer a structured way to handle expanding skill libraries for artifact-rich tasks. The reported gains and ablation emphasis on compilation provide a promising direction, though the absence of detailed experimental protocols limits immediate assessment of robustness and reproducibility.

major comments (2)

[Experiments] Experiments section (as summarized in abstract): The central performance claim of an 11.7% improvement over SOTA on SkillsBench, along with the assertion that ablations demonstrate context compilation is crucial, lacks any description of experimental setup, baselines, number of runs, statistical tests, error bars, or data handling. This information is load-bearing for validating the empirical results that support the paper's main contribution.
[Online stage] Online retrieval and compilation stage (as described in abstract): The rescue-aware compact compilation is presented at a high level as recovering key evidence from coarse-ranked skills without loss; however, no concrete mechanism, algorithm, or example is supplied to show how it avoids dropping critical evidence, which directly underpins the claim that the compiled context is 'immediately usable' for downstream executors.

minor comments (2)

[Abstract] The abstract would benefit from a brief illustrative example of the multi-level skill graph (communities/skills/subunits) to clarify the offline indexing process for readers.
[Method overview] Terminology such as 'rescue-aware' and 'selected-subunit evidence export' is introduced without prior definition or reference, which could be clarified in the method overview for better readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We address each major comment point by point below and will incorporate the suggested improvements to strengthen the manuscript's clarity and reproducibility.

read point-by-point responses

Referee: [Experiments] Experiments section (as summarized in abstract): The central performance claim of an 11.7% improvement over SOTA on SkillsBench, along with the assertion that ablations demonstrate context compilation is crucial, lacks any description of experimental setup, baselines, number of runs, statistical tests, error bars, or data handling. This information is load-bearing for validating the empirical results that support the paper's main contribution.

Authors: We agree that the current manuscript does not include sufficient details on the experimental protocol to allow full verification of the reported results. In the revised version, we will expand Section 4 (Experiments) to explicitly describe the full experimental setup, list all baselines with citations, specify the number of runs (including error bars), detail the statistical tests performed, and outline data handling procedures. This will directly support the 11.7% improvement claim and the ablation analysis. revision: yes
Referee: [Online stage] Online retrieval and compilation stage (as described in abstract): The rescue-aware compact compilation is presented at a high level as recovering key evidence from coarse-ranked skills without loss; however, no concrete mechanism, algorithm, or example is supplied to show how it avoids dropping critical evidence, which directly underpins the claim that the compiled context is 'immediately usable' for downstream executors.

Authors: We acknowledge that the rescue-aware compact compilation is currently described at a conceptual level without a concrete algorithm or example. In the revision, we will add a detailed algorithmic description with pseudocode for the rescue mechanism in Section 3, along with a worked example illustrating how critical evidence is identified, exported from the skill graph, and preserved during compilation to ensure no loss and immediate usability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims on public benchmarks

full rationale

The paper introduces SkillRAE as a two-stage system (offline multi-level skill graph over communities/skills/subunits, online ranked retrieval plus rescue-aware compilation) whose value is asserted via measured gains on public benchmarks (e.g., +11.7% on SkillsBench versus SOTA) and ablation studies showing context compilation is not mere prompt addition. No equations, fitted parameters, or derivations are present that could reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The evaluation therefore rests on independent external data rather than self-referential definitions or renamings, rendering the derivation chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Review based on abstract only; specific free parameters, axioms, and entities cannot be audited in detail without full text. The approach assumes standard LLM prompting and retrieval mechanics plus the utility of graph-structured skills.

axioms (2)

domain assumption LLM agents benefit from reusable skill libraries for artifact-rich tasks
Opening premise of the abstract.
domain assumption Organizing skills into communities, skills, and subunits captures useful relationships
Basis for the offline indexing stage.

invented entities (2)

multi-level skill graph no independent evidence
purpose: Capturing relationships across skill communities, skills, and reusable subunits for retrieval
Core new structure introduced in offline stage; no independent evidence provided in abstract.
rescue-aware compact compilation no independent evidence
purpose: Recovering key evidence to produce compact, grounded, usable context
Key online-stage component; no independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5584 in / 1391 out tokens · 52634 ms · 2026-05-12T02:50:40.169806+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean, Cost/FunctionalEquation.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

builds a multi-level skill graph over skill communities, skills, and reusable subunits... rescue-aware compact compilation to recover the key evidence
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments on SkillsBench... 11.7% improvement over SOTA

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 9 internal anchors

[1]

OpenClaw: Personal ai assistant

OpenClaw. OpenClaw: Personal ai assistant. https://github.com/openclaw/openclaw, 2026. Accessed: 2026-05-05

work page 2026
[2]

Welcome – manus documentation

Manus AI. Welcome – manus documentation. https://manus.im/docs/introduction/welcome,

work page
[3]

Accessed: 2026-05-05

work page 2026
[4]

API-bank: A comprehensive benchmark for tool-augmented LLMs

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-bank: A comprehensive benchmark for tool-augmented LLMs. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116, Singapore, 2023. Association for Computational Linguistics. doi: 10.186 53/v1...

work page 2023
[5]

ToolLLM: Facilitating large lan- guage models to master 16000+ real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large lan- guage models to master 16000+ real-world APIs. InInternational Conference on Learning Represent...

work page 2024
[6]

ToolHop: A query-driven benchmark for evaluating large language models in multi-hop tool use

Junjie Ye, Zhengyin Du, Xuesong Yao, Weijian Lin, Yufei Xu, Zehui Chen, Zaiyuan Wang, Sining Zhu, Zhiheng Xi, Siyu Yuan, Tao Gui, Qi Zhang, Xuanjing Huang, and Jiecao Chen. ToolHop: A query-driven benchmark for evaluating large language models in multi-hop tool use. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguis- tic...

work page 2025
[7]

ShortcutsBench: A large-scale real-world benchmark for api-based agents

Haiyang Shen, Yue Li, Desong Meng, Dongqi Cai, Sheng Qi, Li Zhang, Mengwei Xu, and Yun Ma. ShortcutsBench: A large-scale real-world benchmark for api-based agents. InProceedings of the International Conference on Learning Representations, 2025

work page 2025
[8]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023. doi: 10.48550/arXiv.2305.16291. URL https://arxiv.org/abs/2305.16291

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.16291 2023
[9]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li, Wenbo Chen, Yimin Liu, Shenghan Zheng, Xiaokun Chen, Yifeng He, Yubo Li, Bingran You, Haotian Shen, Jiankai Sun, Shuyi Wang, Qunhong Zeng, Di Wang, Xuan- dong Zhao, Yuanli Wang, Roey Ben Chaim, Zonglin Di, Yipeng Gao, Junwei He, Yizhuo He, Liqiang Jing, Luyang Kong, Xin Lan, Jiachen Li, Songlin Li, Yijiang Li, Yueqian Lin, Xinyi Liu, Xuanqing ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.12670 2026
[10]

Organizing, orchestrating, and benchmarking agent skills at ecosystem scale, 2026

Hao Li, Chunjiang Mu, Jianhao Chen, Siyue Ren, Zhiyao Cui, Yiqun Zhang, Lei Bai, and Shuyue Hu. Organizing, orchestrating, and benchmarking agent skills at ecosystem scale. arXiv preprint arXiv:2603.02176, 2026. doi: 10.48550/arXiv.2603.02176. URL https: //arxiv.org/abs/2603.02176

work page doi:10.48550/arxiv.2603.02176 2026
[11]

SkillRouter: Skill routing for LLM agents at scale.arXiv preprint arXiv:2603.22455, 2026

YanZhao Zheng, ZhenTao Zhang, Chao Ma, YuanQiang Yu, JiHuai Zhu, Yong Wu, Tianze Xu, Baohua Dong, Hangcheng Zhu, Ruohui Huang, and Gang Yu. SkillRouter: Skill routing for LLM agents at scale.arXiv preprint arXiv:2603.22455, 2026. doi: 10.48550/arXiv.2603.22

work page doi:10.48550/arxiv.2603.22 2026
[12]

URL https://arxiv.org/abs/2603.22455

work page arXiv
[13]

RepoCoder: Repository-level code completion through itera- tive retrieval and generation

Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian- Guang Lou, and Weizhu Chen. RepoCoder: Repository-level code completion through itera- tive retrieval and generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2471–2484, 2023

work page 2023
[14]

Dataflow-guided retrieval augmentation for repository- level code completion

Wei Cheng, Yuhan Wu, and Wei Hu. Dataflow-guided retrieval augmentation for repository- level code completion. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7957–7977, 2024. 10

work page 2024
[15]

RepoGraph: Enhancing ai software engineering with repository-level code graph

Siru Ouyang, Wenhao Yu, Kaixin Ma, Zilin Xiao, Zhihan Zhang, Mengzhao Jia, Jiawei Han, Hongming Zhang, and Dong Yu. RepoGraph: Enhancing ai software engineering with repository-level code graph. InProceedings of the International Conference on Learning Rep- resentations, 2025

work page 2025
[16]

Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. RAPTOR: Recursive abstractive processing for tree-organized retrieval. InInterna- tional Conference on Learning Representations, 2024. URL https://openreview.net/forum?id= GN921JHCRw

work page 2024
[17]

HippoRAG: Neurobiologically inspired long-term memory for large language models

Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG: Neurobiologically inspired long-term memory for large language models. InAdvances in Neu- ral Information Processing Systems, volume 37, 2024

work page 2024
[18]

StructRAG: Boosting knowledge intensive reasoning of llms via inference-time hybrid information structurization

Zhuoqun Li, Xuanang Chen, Haiyang Yu, Hongyu Lin, Yaojie Lu, Qiaoyu Tang, Fei Huang, Xianpei Han, Le Sun, and Yongbin Li. StructRAG: Boosting knowledge intensive reasoning of llms via inference-time hybrid information structurization. InProceedings of the International Conference on Learning Representations, 2025

work page 2025
[19]

ArchRAG: Attributed community-based hierarchical retrieval-augmented generation

Shu Wang, Yixiang Fang, Yingli Zhou, Xilin Liu, and Yuchi Ma. ArchRAG: Attributed community-based hierarchical retrieval-augmented generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 15868–15876, 2026. doi: 10.1609/aa ai.v40i19.38619. URL https://ojs.aaai.org/index.php/AAAI/article/view/38619

work page doi:10.1609/aa 2026
[20]

BookRAG: A hierarchical structure-aware index- based approach for retrieval-augmented generation on complex documents.arXiv preprint arXiv:2512.03413, 2025

Shu Wang, Yingli Zhou, and Yixiang Fang. BookRAG: A hierarchical structure-aware index- based approach for retrieval-augmented generation on complex documents.arXiv preprint arXiv:2512.03413, 2025. doi: 10.48550/arXiv.2512.03413. URL https://arxiv.org/abs/2512.0 3413

work page doi:10.48550/arxiv.2512.03413 2025
[21]

HuggingGPT: Solving AI tasks with ChatGPT and its friends in hugging face

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI tasks with ChatGPT and its friends in hugging face. InAdvances in Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=yHdT scY6Ci

work page 2023
[22]

ToolPlanner: A tool augmented LLM for multi granularity instructions with path planning and feedback

Qinzhuo Wu, Wei Liu, Jian Luan, and Bin Wang. ToolPlanner: A tool augmented LLM for multi granularity instructions with path planning and feedback. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18315–18339, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi: 10.1 8653/v1/202...

work page 2024
[23]

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Xiaoxi Jiang, and Guanjun Jiang. Trace2skill: Distill trajectory-local lessons into transferable agent skills.arXiv preprint arXiv:2603.25158, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Siru Ouyang, Jun Yan, I Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T Le, Samira Daruki, Xiangru Tang, et al. Reasoningbank: Scaling agent self-evolving with reason- ing memory.arXiv preprint arXiv:2509.25140, 2025

work page internal anchor Pith review arXiv 2025
[25]

Graph of Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills

Dawei Liu, Zongxia Li, Hongyang Du, Xiyang Wu, Shihang Gui, Yongbei Kuang, and Lichao Sun. Graph of skills: Dependency-aware structural retrieval for massive agent skills.arXiv preprint arXiv:2604.05333, 2026. doi: 10.48550/arXiv.2604.05333. URL https://arxiv.org/ab s/2604.05333

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.05333 2026
[26]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Confer- ence on Learning Representations, 2023. URL https://openreview.net/forum?id=WE_vluYU L-X

work page 2023
[27]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessi, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=Yacmpz84TH. 11

work page 2023
[28]

Patil, Tianjun Zhang, Xin Wang, and Joseph E

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive APIs. InAdvances in Neural Information Processing Systems,

work page
[29]

URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/e4c61f578ff07830f5 c37378dd3ecb0d-Abstract-Conference.html

work page 2024
[30]

ToolkenGPT: Augmenting frozen language models with massive tools via tool embeddings

Shibo Hao, Tianyang Liu, Zhen Wang, and Zhiting Hu. ToolkenGPT: Augmenting frozen language models with massive tools via tool embeddings. InAdvances in Neural Information Processing Systems, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/8 fd1a81c882cd45f64958da6284f4a3f-Abstract-Conference.html

work page 2023
[31]

AnyTool: Self-reflective, hierarchical agents for large-scale API calls

Yu Du, Fangyun Wei, and Hongyang Zhang. AnyTool: Self-reflective, hierarchical agents for large-scale API calls. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 11812–11829. PMLR, 2024. URL https://proceedings.mlr.press/v235/du24h.html

work page 2024
[32]

Re-invoke: Tool invoca- tion rewriting for zero-shot tool retrieval

Yanfei Chen, Jinsung Yoon, Devendra Singh Sachan, Qingze Wang, Vincent Cohen-Addad, Mohammadhossein Bateni, Chen-Yu Lee, and Tomas Pfister. Re-invoke: Tool invoca- tion rewriting for zero-shot tool retrieval. InFindings of the Association for Computa- tional Linguistics: EMNLP 2024, pages 4705–4726, Miami, Florida, USA, 2024. Associa- tion for Computation...

work page doi:10.18653/v1/2024.findings-emnlp.270 2024
[33]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAd- vances in Neural Information Processing Systems, volume 33, 2020. URL https://proceeding s.neur...

work page 2020
[34]

In-depth Analysis of Graph-based RAG in a Unified Framework

Yingli Zhou, Yaodong Su, Youran Sun, Shu Wang, Taotao Wang, Runyuan He, Yongwei Zhang, Sicong Liang, Xilin Liu, Yuchi Ma, et al. In-depth analysis of graph-based rag in a unified framework.arXiv preprint arXiv:2503.04338, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From lo- cal to global: A graph RAG approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024. doi: 10.48550/arXiv.2404.16130. URL https://arxiv.org/abs/ 2404.16130

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.16130 2024
[36]

Pathrag: Pruning graph-based retrieval augmented generation with relational paths

Boyu Chen, Zirui Guo, Zidan Yang, Yuluo Chen, Junze Chen, Zhenghao Liu, Chuan Shi, and Cheng Yang. Pathrag: Pruning graph-based retrieval augmented generation with relational paths. InProceedings of the AAAI conference on artificial intelligence, volume 40, pages 30183–30191, 2026

work page 2026
[37]

LightRAG: Simple and Fast Retrieval-Augmented Generation

Zirui Guo, Lianghao Xia, Yanhua Yu, Tian Ao, and Chao Huang. Lightrag: Simple and fast retrieval-augmented generation.arXiv preprint arXiv:2410.05779, 2(3), 2024

work page internal anchor Pith review arXiv 2024
[38]

Retrieval-augmented generation with hierarchical knowledge.arXiv preprint arXiv:2503.10150,

Haoyu Huang, Yongfeng Huang, Junjie Yang, Zhenyu Pan, Yongqiang Chen, Kaili Ma, Hongzhi Chen, and James Cheng. Retrieval-augmented generation with hierarchical knowl- edge.arXiv preprint arXiv:2503.10150, 2025

work page arXiv 2025
[39]

Active retrieval augmented generation

Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. InPro- ceedings of the 2023 conference on empirical methods in natural language processing, pages 7969–7992, 2023

work page 2023
[40]

Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learn- ing to retrieve, generate, and critique through self-reflection. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[41]

Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. InPro- ceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), pages 10014–10037, 2023. 12

work page 2023
[42]

Demystifying and enhancing the efficiency of large language model based search agents.arXiv preprint arXiv:2505.12065, 2025

Tiannuo Yang, Zebin Yao, Bowen Jin, Lixiao Cui, Yusen Li, Gang Wang, and Xiaoguang Liu. Demystifying and enhancing the efficiency of large language model based search agents.arXiv preprint arXiv:2505.12065, 2025

work page arXiv 2025
[43]

Recomp: Improving retrieval-augmented lms with context compression and selective augmentation

Fangyuan Xu, Weijia Shi, and Eunsol Choi. Recomp: Improving retrieval-augmented lms with context compression and selective augmentation. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[44]

Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. Longllmlingua: Accelerating and enhancing llms in long context scenarios via prompt compression. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1658–1677, 2024

work page 2024
[45]

Trace the evidence: Constructing knowledge-grounded reasoning chains for retrieval-augmented generation

Jinyuan Fang, Zaiqiao Meng, and Craig Macdonald. Trace the evidence: Constructing knowledge-grounded reasoning chains for retrieval-augmented generation. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 8472–8494, 2024

work page 2024
[46]

Chameleon: Plug-and-play compositional reasoning with large lan- guage models

Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large lan- guage models. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[48]

URL https://arxiv.org/abs/2309.07597

work page internal anchor Pith review arXiv
[49]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT-networks. InProceedings of the 2019 Conference on Empirical Methods in Nat- ural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing, pages 3982–3992. Association for Computational Linguistics, 2019. doi: 10.18653/v1/D19-1410....

work page doi:10.18653/v1/d19-1410 2019
[50]

Codex CLI

OpenAI. Codex CLI. https://developers.openai.com/codex/cli, 2026. Accessed 2026-05-06

work page 2026
[51]

GPT-5.2 Model

OpenAI. GPT-5.2 Model. https://developers.openai.com/api/docs/models/gpt-5.2, 2025. Accessed 2026-05-06

work page 2025
[52]

Gemini CLI Documentation

Google. Gemini CLI Documentation. https://google-gemini.github.io/gemini-cli/docs/, 2026. Accessed 2026-05-06

work page 2026
[53]

Gemini 3 Flash is now available in Gemini CLI

Google Developers Blog. Gemini 3 Flash is now available in Gemini CLI. https://developers.g oogleblog.com/gemini-3-flash-is-now-available-in-gemini-cli/, 2025. Accessed 2026-05-06. 13 A Appendix A.1 Implementation Details The subunit extractor is deterministic. It collects three types of support evidence from each source SKILL.md: procedural lines, elemen...

work page 2025