Don't Retrieve, Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills for QA and RAG

Lawrence B. Hsieh; Pengfei Wei; Yiqun Sun

arxiv: 2604.14572 · v3 · pith:GEVESBSUnew · submitted 2026-04-16 · 💻 cs.IR · cs.AI· cs.CL· cs.MA

Don't Retrieve, Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills for QA and RAG

Yiqun Sun , Pengfei Wei , Lawrence B. Hsieh This is my paper

Pith reviewed 2026-05-19 17:35 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CLcs.MA

keywords Corpus2SkillRAGLLM agent navigationhierarchical skill directoryenterprise QAcorpus distillationknowledge grounding

0 comments

The pith

Distilling a corpus into a hierarchical skill directory lets an LLM agent navigate it to improve QA and RAG on structured enterprise data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Corpus2Skill as a way to move beyond treating retrieval as a one-shot lookup in RAG systems. It first builds an offline hierarchy of skills by clustering the full corpus into layered summaries. At runtime the agent starts at the top level and drills down or backtracks through the structure until it reaches the right documents. This produces higher-quality and better-grounded answers than standard dense, hybrid, or agentic retrieval on customer-support tasks. The gains hold when the corpus has a clear topical organization but disappear on flat or open-domain collections.

Core claim

Corpus2Skill distills the document corpus offline into a hierarchical skill directory. An LLM agent then navigates this directory at inference time, beginning with broad summaries and refining its path to specific documents while backtracking from unproductive branches. The result is improved answer quality and stronger grounding compared with single-shot dense, hybrid, hierarchical-retrieval, and agentic RAG baselines on an enterprise customer-support benchmark.

What carries the argument

The hierarchical skill directory created by offline clustering, which the agent traverses by expanding summaries level by level and backtracking when a branch yields no progress.

If this is right

Answer quality and evidence grounding both rise on enterprise customer-support benchmarks.
The method outperforms single-shot dense, hybrid, hierarchical-retrieval, and agentic RAG baselines.
Cost increases remain moderate relative to the quality gains.
Consistent benefits appear on single-domain corpora that possess a recoverable topical taxonomy.
Flat retrieval remains preferable for open-domain factoid pools or homogeneous-tabular corpora.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Navigation gives the agent an explicit map of the corpus, which may reduce cases where relevant but non-retrieved documents are overlooked.
System designers should first test whether their data supports clean top-level clustering before adopting navigation over retrieval.
The same offline distillation step could be applied to other structured collections such as technical manuals or policy archives that share a natural hierarchy.

Load-bearing premise

The target corpus contains a recoverable topical taxonomy that supports effective offline hierarchical clustering into a navigable skill directory.

What would settle it

A performance comparison on a corpus lacking clear topical clusters, such as a random mix of unrelated facts, where navigation fails to outperform standard retrieval methods.

Figures

Figures reproduced from arXiv: 2604.14572 by Lawrence B. Hsieh, Pengfei Wei, Yiqun Sun.

**Figure 1.** Figure 1: Retrieve vs. Navigate. Traditional RAG passively feeds fixed passages to the LLM. CORPUS2SKILL distills the corpus into a navigable skill hierarchy that the agent actively explores, backtracks, and drills into to locate evidence. icy documents, and FAQs (Gao et al., 2024). Large language models alone cannot answer questions about these corpora reliably: they lack access to private, frequently updated cont… view at source ↗

**Figure 2.** Figure 2: Corpus2Skill system architecture. The compile phase (top) embeds documents, builds a multi-level cluster hierarchy with LLM summarization at each level, and materializes the result as a forest of navigable skills. Root clusters map to skill directories (SKILL.md), sub-clusters map to group directories (INDEX.md), and leaf documents are stored in a separate document store. The serve phase (bottom) uses an L… view at source ↗

**Figure 3.** Figure 3: Abridged SKILL.md (top) and leaf INDEX.md (bottom) from the WixQA compilation. The SKILL.md provides a routing overview of 1,513 documents; the INDEX.md lists individual document IDs with titles for targeted retrieval. means the agent can survey many document summaries cheaply before committing to the token cost of reading full documents. Navigation workflow. Because skill descriptions are preloaded, th… view at source ↗

**Figure 4.** Figure 4: Trace 1: Direct navigation. The agent descends through a single skill branch to find the relevant document in 4 steps. Query: “I want to change the currency for my course.” Step 1: view wix-commerce-monetization/SKILL.md -> 15 subgroups; "wix-online-programs" and "wix-billing-documents" relevant Step 2: view wix-online-programs/INDEX.md -> 5 leaf groups; picks group covering program settings (18 docs) Step… view at source ↗

**Figure 5.** Figure 5: Trace 2: Cross-branch navigation. The agent visits two subgroups within the same skill to combine evidence from different topic branches. Patterns. These traces illustrate two common navigation patterns: direct descent ( [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) grounds LLM responses in external evidence but treats the model as a passive consumer of search results, with no view of how the corpus is organized or what it has not yet seen. We present Corpus2Skill, which distills a document corpus offline into a hierarchical skill directory and lets an LLM agent navigate it at serve time, drilling from a bird's-eye view through progressively finer summaries down to documents, and backtracking when a branch is unproductive. On an enterprise customer-support benchmark, Corpus2Skill improves both answer quality and grounding over single-shot dense, hybrid, hierarchical-retrieval, and agentic RAG baselines at a moderate cost tradeoff. A ten-subset generalization study further shows that corpus navigation is not a universal replacement for retrieval: it consistently helps on single-domain corpora with a recoverable topical taxonomy, but flat retrieval remains preferable on open-domain factoid pools or homogeneous-tabular corpora that defeat top-level clustering. We characterize this scope distinction and discuss it as a design guideline for knowledge-grounded systems. Code is available at https://github.com/dukesun99/Corpus2Skill.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Corpus2Skill distills a corpus into a navigable hierarchical skill directory for agent traversal with backtracking, and the ten-subset study usefully shows this helps on single-domain enterprise data but not everywhere.

read the letter

The main thing here is that the paper replaces standard retrieval with an agent that navigates a pre-distilled hierarchical skill directory, drilling from broad summaries down to documents and backtracking on dead ends. On their enterprise customer-support benchmark it reports better answer quality and grounding than dense, hybrid, hierarchical-retrieval, and agentic RAG baselines, at moderate extra cost. They also run a ten-subset generalization study that draws a practical line: navigation works on single-domain corpora that have a recoverable topical taxonomy, while flat retrieval stays preferable for open-domain factoid sets or homogeneous tabular data. That scope discussion is the part worth keeping in mind for system design.

Referee Report

1 major / 2 minor

Summary. The paper introduces Corpus2Skill, which distills a document corpus offline into a hierarchical skill directory that an LLM agent navigates at serve time by drilling from high-level summaries to documents and backtracking as needed. On an enterprise customer-support benchmark it reports improvements in answer quality and grounding relative to single-shot dense, hybrid, hierarchical-retrieval, and agentic RAG baselines at moderate extra cost. A ten-subset generalization study further claims that navigation outperforms retrieval on single-domain corpora possessing a recoverable topical taxonomy but that flat retrieval remains preferable on open-domain factoid or homogeneous-tabular corpora; the authors present this distinction as a design guideline and release code.

Significance. If the empirical results and scope characterization hold, the work offers a concrete alternative paradigm to retrieval-centric RAG by shifting to offline distillation and online navigation. The explicit code release and the attempt to delineate corpus conditions under which navigation is advantageous are positive contributions that could guide practitioners building enterprise QA systems. The generalization study, while preliminary, supplies a falsifiable framing that future work can test.

major comments (1)

[Generalization study] Generalization study (ten-subset evaluation): the scope guideline that navigation helps precisely on corpora with a 'recoverable topical taxonomy' is asserted after observing performance differences, yet the manuscript reports no independent clustering-quality diagnostics (silhouette score, cophenetic correlation, cluster purity against domain labels, or summary-fidelity metrics). Without these, it is impossible to confirm that the skill directory actually recovered a useful taxonomy or whether gains arise from prompting differences or other unablated factors; this directly affects the load-bearing claim that the method's benefit is tied to the stated corpus property.

minor comments (2)

[Abstract] Abstract: comparative improvements are stated without any numerical deltas, error bars, or statistical tests, leaving the central empirical claim difficult to assess from the summary alone.
[Method] The description of the offline hierarchical clustering procedure lacks sufficient detail on algorithm choice, linkage method, stopping criteria, and how summaries are generated at each level.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address the major comment on the generalization study below and will incorporate revisions to strengthen the supporting evidence for our scope guideline.

read point-by-point responses

Referee: Generalization study (ten-subset evaluation): the scope guideline that navigation helps precisely on corpora with a 'recoverable topical taxonomy' is asserted after observing performance differences, yet the manuscript reports no independent clustering-quality diagnostics (silhouette score, cophenetic correlation, cluster purity against domain labels, or summary-fidelity metrics). Without these, it is impossible to confirm that the skill directory actually recovered a useful taxonomy or whether gains arise from prompting differences or other unablated factors; this directly affects the load-bearing claim that the method's benefit is tied to the stated corpus property.

Authors: We agree that the generalization study would benefit from explicit, independent clustering-quality diagnostics to more directly link performance gains to the recovery of a useful topical taxonomy rather than unablated factors such as prompting variations. The current manuscript presents the scope distinction primarily through observed performance differentials across corpus types. In the revised version we will add silhouette scores and cophenetic correlation coefficients computed on the hierarchical clustering steps used to build the skill directories for each of the ten subsets. We will also report summary-fidelity metrics (e.g., ROUGE or embedding similarity between generated summaries and source documents) and, for the subsets that possess domain labels, cluster purity. In addition, we will include a targeted ablation that applies the same hierarchical prompting structure to a flat-retrieval baseline, helping isolate the contribution of the navigation mechanism itself. These additions should provide clearer confirmation that the reported benefits track the stated corpus property. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on direct empirical comparisons

full rationale

The paper's central claims rest on offline distillation of a corpus into a hierarchical skill directory followed by online agent navigation, with performance improvements demonstrated via direct comparisons against dense, hybrid, hierarchical-retrieval, and agentic RAG baselines on an enterprise benchmark plus a ten-subset generalization study. No equations, fitted parameters, or predictions reduce by construction to the same evaluation data or self-defined quantities. The scope distinction (navigation helps on single-domain corpora with recoverable topical taxonomy) is presented as an observed empirical pattern rather than a self-referential definition or load-bearing self-citation. The derivation chain is self-contained through the described distillation and navigation process without invoking unverified uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the existence of a recoverable topical taxonomy in the corpus and on the effectiveness of offline clustering to produce a useful hierarchy; no free parameters or invented entities with independent evidence are specified in the abstract.

axioms (1)

domain assumption Document corpora in the target domain contain a recoverable topical taxonomy amenable to hierarchical clustering.
The generalization study explicitly conditions success on single-domain corpora with such a taxonomy.

invented entities (1)

Hierarchical skill directory no independent evidence
purpose: To serve as a navigable map that an LLM agent traverses from coarse summaries to source documents.
Introduced as the primary output of the offline distillation step.

pith-pipeline@v0.9.0 · 5743 in / 1338 out tokens · 60144 ms · 2026-05-19T17:35:23.506429+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We build a multi-level hierarchy through an iterative bottom-up process controlled by two parameters: the branching ratio p ... via K-Means ... Each resulting cluster is then summarized by an LLM
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the hierarchy depth grows as O(log_p N) ... agent traverses at most L=⌈log p N⌉ levels

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 5.0

A survey that taxonomizes agent skills for LLM-based agents across representation, acquisition, retrieval, and evolution stages while reviewing methods, resources, and open challenges.
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications
cs.IR 2026-05 unverdicted novelty 4.0

The paper surveys agent skills for LLM agents, organizing the literature into a four-stage lifecycle of representation, acquisition, retrieval, and evolution while highlighting their role in system scalability.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

InProceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, 2020. Nandy, A., Sharma, S., Maddhashiya, S., Sachdeva, K., Goyal, P., and Ganguly, N. Question answering over elec- tronic devices: A new benchmark dataset and a multi-task learning based QA framework. InFindings of the Associ- ation for Computational Linguistics: EMNLP 2021, pp. 4600–460...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[4]

Name actual features, products, or processes

Key TERMS or features mentioned across documents Be specific and concrete. Name actual features, products, or processes. Documents ({N} total, showing up to 15): --- Document 1 --- {doc_text[:600]} 15 Don’t Retrieve, Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills --- Document 2 --- {doc_text[:600]} [...] Summary: Cluster-level summa...

work page
[7]

What types of user QUESTIONS this group can answer Be specific -- name the main product areas, features, or workflows. Sub-group summaries: - Sub-group 1: {summary[:300]} - Sub-group 2: {summary[:300]} [...] Overview: Labeling prompt.Used to generate a filesystem-safe di- rectory name from a cluster summary. The response is post-processed: lowercased, non...

work page
[8]

The common TOPIC area these documents cover

work page
[9]

The types of QUESTIONS these documents answer

work page
[10]

Name actual features, products, or processes

Key TERMS or features mentioned across documents Be specific and concrete. Name actual features, products, or processes. Documents ({N} total, showing up to 15): --- Document 1 --- {doc_text[:600]} --- Document 2 --- {doc_text[:600]} [...] Summary: Cluster-level summarization.Used at higher levels to summarize a group of sub-cluster summaries into a broad...

work page
[11]

The broad DOMAIN these sub-groups cover

work page
[12]

The range of TOPICS within this domain

work page
[13]

Sub-group summaries: - Sub-group 1: {summary[:300]} - Sub-group 2: {summary[:300]} [...] Overview: Labeling.Used to generate a filesystem-safe directory name from a cluster summary

What types of user QUESTIONS this group can answer Be specific -- name the main product areas, features, or workflows. Sub-group summaries: - Sub-group 1: {summary[:300]} - Sub-group 2: {summary[:300]} [...] Overview: Labeling.Used to generate a filesystem-safe directory name from a cluster summary. The response is post- processed: lowercased, non-alphanu...

work page
[14]

Read the SKILL.md of the 1-2 most relevant skills for your query

work page
[15]

Drill into the most relevant sub-group: Read its INDEX.md

work page
[16]

Pick the most relevant document IDs

At the leaf level, INDEX.md lists document IDs with brief titles. Pick the most relevant document IDs

work page
[17]

Call get_document with each relevant doc_id to retrieve the full text

work page
[18]

contact support

Read at least one full document before answering. ## Tools - Code execution: Use ‘ls‘ and ‘cat‘ to navigate the skills hierarchy. - get_document(doc_id): Retrieve the full text of a document by its ID. The doc_id values are listed in leaf-level INDEX.md files. ## Answer Format - First sentence = direct answer. No preamble. - Factual questions: 1-3 sentenc...

work page
[19]

Go to Settings in your dashboard

work page
[20]

Click Language & Region

work page
[21]

Scroll to Currency and select your desired currency

work page
[22]

LLM only

Click Save. This applies to all Wix products including Online Programs. Your currency must match your payment provider’s currency." Navigation paths: 6221 -> 1513 -> 107 -> 18 -> 1 (programs) 6221 -> 1513 -> 68 -> 18 -> 1 (billing) This trace demonstrates cross-branch navigation: the agent first explored the online-programs subgroup to understand the rela...

work page 2009

[1] [1]

Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG

InProceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020, 2020. Nandy, A., Sharma, S., Maddhashiya, S., Sachdeva, K., Goyal, P., and Ganguly, N. Question answering over elec- tronic devices: A new benchmark dataset and a multi-task learning based QA framework. InFindings of the Associ- ation for Computational Linguistics: EMNLP 2021, pp. 4600–460...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[2] [4]

Name actual features, products, or processes

Key TERMS or features mentioned across documents Be specific and concrete. Name actual features, products, or processes. Documents ({N} total, showing up to 15): --- Document 1 --- {doc_text[:600]} 15 Don’t Retrieve, Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills --- Document 2 --- {doc_text[:600]} [...] Summary: Cluster-level summa...

work page

[3] [7]

What types of user QUESTIONS this group can answer Be specific -- name the main product areas, features, or workflows. Sub-group summaries: - Sub-group 1: {summary[:300]} - Sub-group 2: {summary[:300]} [...] Overview: Labeling prompt.Used to generate a filesystem-safe di- rectory name from a cluster summary. The response is post-processed: lowercased, non...

work page

[4] [8]

The common TOPIC area these documents cover

work page

[5] [9]

The types of QUESTIONS these documents answer

work page

[6] [10]

Name actual features, products, or processes

Key TERMS or features mentioned across documents Be specific and concrete. Name actual features, products, or processes. Documents ({N} total, showing up to 15): --- Document 1 --- {doc_text[:600]} --- Document 2 --- {doc_text[:600]} [...] Summary: Cluster-level summarization.Used at higher levels to summarize a group of sub-cluster summaries into a broad...

work page

[7] [11]

The broad DOMAIN these sub-groups cover

work page

[8] [12]

The range of TOPICS within this domain

work page

[9] [13]

Sub-group summaries: - Sub-group 1: {summary[:300]} - Sub-group 2: {summary[:300]} [...] Overview: Labeling.Used to generate a filesystem-safe directory name from a cluster summary

What types of user QUESTIONS this group can answer Be specific -- name the main product areas, features, or workflows. Sub-group summaries: - Sub-group 1: {summary[:300]} - Sub-group 2: {summary[:300]} [...] Overview: Labeling.Used to generate a filesystem-safe directory name from a cluster summary. The response is post- processed: lowercased, non-alphanu...

work page

[10] [14]

Read the SKILL.md of the 1-2 most relevant skills for your query

work page

[11] [15]

Drill into the most relevant sub-group: Read its INDEX.md

work page

[12] [16]

Pick the most relevant document IDs

At the leaf level, INDEX.md lists document IDs with brief titles. Pick the most relevant document IDs

work page

[13] [17]

Call get_document with each relevant doc_id to retrieve the full text

work page

[14] [18]

contact support

Read at least one full document before answering. ## Tools - Code execution: Use ‘ls‘ and ‘cat‘ to navigate the skills hierarchy. - get_document(doc_id): Retrieve the full text of a document by its ID. The doc_id values are listed in leaf-level INDEX.md files. ## Answer Format - First sentence = direct answer. No preamble. - Factual questions: 1-3 sentenc...

work page

[15] [19]

Go to Settings in your dashboard

work page

[16] [20]

Click Language & Region

work page

[17] [21]

Scroll to Currency and select your desired currency

work page

[18] [22]

LLM only

Click Save. This applies to all Wix products including Online Programs. Your currency must match your payment provider’s currency." Navigation paths: 6221 -> 1513 -> 107 -> 18 -> 1 (programs) 6221 -> 1513 -> 68 -> 18 -> 1 (billing) This trace demonstrates cross-branch navigation: the agent first explored the online-programs subgroup to understand the rela...

work page 2009