{"paper":{"title":"Don't Retrieve, Navigate: Distilling Enterprise Knowledge into Navigable Agent Skills for QA and RAG","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Distilling a corpus into a hierarchical skill directory lets an LLM agent navigate it to improve QA and RAG on structured enterprise data.","cross_cats":["cs.AI","cs.CL","cs.MA"],"primary_cat":"cs.IR","authors_text":"Lawrence B. Hsieh, Pengfei Wei, Yiqun Sun","submitted_at":"2026-04-16T03:05:37Z","abstract_excerpt":"Retrieval-Augmented Generation (RAG) grounds LLM responses in external evidence but treats the model as a passive consumer of search results, with no view of how the corpus is organized or what it has not yet seen. We present Corpus2Skill, which distills a document corpus offline into a hierarchical skill directory and lets an LLM agent navigate it at serve time, drilling from a bird's-eye view through progressively finer summaries down to documents, and backtracking when a branch is unproductive. On an enterprise customer-support benchmark, Corpus2Skill improves both answer quality and ground"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"On an enterprise customer-support benchmark, Corpus2Skill improves both answer quality and grounding over single-shot dense, hybrid, hierarchical-retrieval, and agentic RAG baselines at a moderate cost tradeoff.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The target corpus possesses a recoverable topical taxonomy that permits effective offline hierarchical clustering into a navigable skill directory.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Corpus2Skill converts document corpora into navigable hierarchical skill directories for LLM agents, improving QA and RAG quality on single-domain enterprise data but not on open-domain or tabular corpora.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Distilling a corpus into a hierarchical skill directory lets an LLM agent navigate it to improve QA and RAG on structured enterprise data.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"c91a0c349890c4c1f6cd94535d31a29dca84361ff0d3fc35d2d46aef87b3fb0e"},"source":{"id":"2604.14572","kind":"arxiv","version":3},"verdict":{"id":"8741f3e3-1e47-4c99-b1bb-85502409c2c8","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-19T17:35:23.506429Z","strongest_claim":"On an enterprise customer-support benchmark, Corpus2Skill improves both answer quality and grounding over single-shot dense, hybrid, hierarchical-retrieval, and agentic RAG baselines at a moderate cost tradeoff.","one_line_summary":"Corpus2Skill converts document corpora into navigable hierarchical skill directories for LLM agents, improving QA and RAG quality on single-domain enterprise data but not on open-domain or tabular corpora.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The target corpus possesses a recoverable topical taxonomy that permits effective offline hierarchical clustering into a navigable skill directory.","pith_extraction_headline":"Distilling a corpus into a hierarchical skill directory lets an LLM agent navigate it to improve QA and RAG on structured enterprise data."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2604.14572/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":18,"sample":[{"doi":"","year":2020,"title":"Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG","work_id":"6edcfe5a-e498-4a2e-aa57-dcaa5ef20f86","ref_index":1,"cited_arxiv_id":"2501.09136","is_internal_anchor":true},{"doi":"","year":null,"title":"Name actual features, products, or processes","work_id":"80bdcb83-84ec-4bf1-833d-8d726380f6bf","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"What types of user QUESTIONS this group can answer Be specific -- name the main product areas, features, or workflows. Sub-group summaries: - Sub-group 1: {summary[:300]} - Sub-group 2: {summary[:300]","work_id":"e6443fb1-2e80-4584-8068-fbdd93cd79f3","ref_index":7,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"The common TOPIC area these documents cover","work_id":"19e84311-c340-4e42-b569-af606c8d1c4c","ref_index":8,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"The types of QUESTIONS these documents answer","work_id":"50c99b61-97a3-4734-890e-2ac01a08c4ad","ref_index":9,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":18,"snapshot_sha256":"d17026f5ba36132b7f415410a32cfd7c0fd5f731d052d42a7d2767215aacfc61","internal_anchors":1},"formal_canon":{"evidence_count":2,"snapshot_sha256":"cbcf0ea62224b8beb60be7dc3c7f417e2f50b8433eaf6e5219cd35a0e0dc1011"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}