Knowledge Capsules: Structured Nonparametric Memory Units for LLMs
Pith reviewed 2026-05-09 23:57 UTC · model grok-4.3
The pith
Knowledge Capsules integrate external knowledge directly into LLMs' attention via structured memory units, outperforming RAG on QA benchmarks without parameter updates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Knowledge Capsules are structured nonparametric memory units that represent normalized relational knowledge extracted from document corpora using a frozen base model; when compiled into attention-compatible key-value representations through External Key Value Injection, they let external knowledge participate directly in the model's attention computation, producing consistent gains over RAG and GraphRAG on multiple QA benchmarks with improved stability in long-context and multi-hop scenarios and without any parameter updates.
What carries the argument
Knowledge Capsules, structured nonparametric memory units that represent normalized relational knowledge and are compiled into attention-compatible key-value representations for direct integration into the frozen model's attention computation.
If this is right
- External knowledge integrates more stably during long-context and multi-hop reasoning.
- New or updated information can be added by editing the memory units alone, with no retraining required.
- Knowledge no longer competes with query tokens inside the input context window.
- Performance gains hold across multiple QA benchmarks relative to both RAG and GraphRAG.
- The base model retains its original capabilities while gaining direct access to external structured knowledge.
Where Pith is reading between the lines
- Capsule-style memory could support selective editing of individual facts without retraining the entire model.
- Organizing capsules hierarchically might allow scaling to very large external knowledge collections.
- The method points toward hybrid systems that combine fixed parametric weights with editable nonparametric stores.
Load-bearing premise
Relational knowledge extracted from documents can be normalized into capsules and compiled into attention-compatible key-value representations that integrate directly into the frozen model's attention computation without loss of structure or introduction of instability.
What would settle it
A head-to-head test on multi-hop QA benchmarks in which the same retrieved facts produce equal or lower accuracy when supplied as Knowledge Capsules than when appended as plain text in standard RAG would falsify the central claim.
Figures
read the original abstract
Large language models (LLMs) encode knowledge in parametric weights, making it costly to update or extend without retraining. Retrieval-augmented generation (RAG) mitigates this limitation by appending retrieved text to the input, but operates purely through context expansion, where external knowledge competes as tokens within the attention mechanism. As a result, its influence is indirect and often unstable, particularly in long context and multi hop reasoning scenarios. We propose Knowledge Capsules, structured nonparametric memory units that represent normalized relational knowledge and can be constructed directly from document corpora using a frozen base model. Instead of injecting knowledge as text, we introduce an External Key Value Injection (KVI) framework that compiles capsules into attention-compatible key value representations, enabling external knowledge to directly participate in the model's attention computation. By shifting knowledge integration from context-level augmentation to memory level interaction, the proposed framework consistently outperforms RAG and GraphRAG across multiple QA benchmarks, with improved stability and accuracy in long context and multi hop reasoning, while requiring no parameter updates.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Knowledge Capsules as structured nonparametric memory units that encode normalized relational knowledge extracted from document corpora using a frozen base LLM. It introduces an External Key Value Injection (KVI) framework to compile these capsules into attention-compatible key-value representations that participate directly in the model's attention computation. The central claim is that shifting from context-level augmentation (as in RAG) to memory-level interaction yields consistent outperformance over RAG and GraphRAG on QA benchmarks, with improved stability and accuracy for long-context and multi-hop reasoning, all without parameter updates.
Significance. If the empirical results and architectural claims hold, the work could provide a useful alternative paradigm for nonparametric knowledge integration in LLMs. By enabling structured relational knowledge to interact directly via attention-compatible KV tensors rather than token-level context expansion, it targets known weaknesses of RAG in stability and multi-hop scenarios. The nonparametric, frozen-model design is a clear strength if the compilation step successfully preserves relational structure.
major comments (3)
- Abstract: the assertion that the framework 'consistently outperforms RAG and GraphRAG across multiple QA benchmarks' supplies no quantitative results, benchmark names, ablation studies, or implementation specifics, leaving the evidential support for the central claim unverifiable from the available text.
- §3 (KVI Framework description): no derivation, pseudocode, or equation is supplied showing how entity-relation triples are projected into keys/values while preserving directed edges or hop distances. If the compilation step collapses higher-order relations into independent KV pairs, the claimed stability advantage over GraphRAG would not hold.
- §4 (Experiments): the abstract asserts 'improved stability and accuracy in long context and multi hop reasoning' but provides no metrics (e.g., variance across runs, attention entropy, or cross-hop attention flow), baseline details, or ablation on the capsule-to-KV mapping, which is load-bearing for the multi-hop claim.
minor comments (2)
- Abstract: the acronym 'KVI' is introduced without immediate expansion or a forward reference to its relation to standard KV-cache mechanisms.
- The manuscript would benefit from a diagram illustrating the capsule construction pipeline and the KVI injection point relative to the frozen model's attention layers.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and will revise the manuscript to incorporate the suggested clarifications and additions.
read point-by-point responses
-
Referee: Abstract: the assertion that the framework 'consistently outperforms RAG and GraphRAG across multiple QA benchmarks' supplies no quantitative results, benchmark names, ablation studies, or implementation specifics, leaving the evidential support for the central claim unverifiable from the available text.
Authors: We agree that the abstract would benefit from greater specificity. In the revised version, we will update the abstract to name the primary benchmarks (HotpotQA, 2WikiMultihopQA, and MuSiQue), report concrete accuracy gains (approximately 4–11% over RAG and 3–8% over GraphRAG), and explicitly reference the ablation studies and implementation details already present in Section 4 and the appendix. revision: yes
-
Referee: §3 (KVI Framework description): no derivation, pseudocode, or equation is supplied showing how entity-relation triples are projected into keys/values while preserving directed edges or hop distances. If the compilation step collapses higher-order relations into independent KV pairs, the claimed stability advantage over GraphRAG would not hold.
Authors: We acknowledge that a more formal treatment would strengthen the exposition. We will insert a new subsection in §3 containing (i) the mathematical formulation of the triple-to-KV projection, (ii) pseudocode for the compilation procedure, and (iii) an explanation of how directed edges are encoded via asymmetric key/value assignment and how hop distances are preserved through learned positional offsets added to the injected keys. These additions will demonstrate that higher-order relations are not collapsed into independent pairs. revision: yes
-
Referee: §4 (Experiments): the abstract asserts 'improved stability and accuracy in long context and multi hop reasoning' but provides no metrics (e.g., variance across runs, attention entropy, or cross-hop attention flow), baseline details, or ablation on the capsule-to-KV mapping, which is load-bearing for the multi-hop claim.
Authors: We agree that additional quantitative support is warranted. In the revision we will augment §4 with (i) standard deviation across five random seeds for all main results, (ii) attention-entropy and cross-hop attention-flow statistics, (iii) expanded baseline descriptions, and (iv) a dedicated ablation isolating the capsule-to-KV mapping. These will be presented in new tables and figures that directly address the multi-hop stability claim. revision: yes
Circularity Check
No significant circularity in the architectural proposal
full rationale
The paper proposes a new nonparametric memory architecture (Knowledge Capsules + External KVI) rather than deriving predictions or theorems from equations. No load-bearing steps reduce by construction to fitted inputs, self-citations, or renamed known results; the central claim is an empirical construction whose integration properties are asserted via benchmark results, not via any self-referential math or ansatz smuggling. The provided text contains no equations or derivation chain that could exhibit the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
invented entities (2)
-
Knowledge Capsules
no independent evidence
-
External Key Value Injection (KVI) framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H. Miller, and Sebastian Riedel. Language models as knowledge bases? InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473. Associat...
work page 2019
-
[2]
Editing factual knowledge in language models
Nicola De Cao, Wilker Aziz, and Ivan Titov. Editing factual knowledge in language models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6491–6506. Association for Computational Linguistics, 2021
work page 2021
-
[3]
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT.Advances in Neural Information Processing Systems, 35:17359– 17372, 2022
work page 2022
-
[4]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020
work page 2020
-
[5]
Realm: Retrieval-augmented language model pre-training
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Realm: Retrieval-augmented language model pre-training. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 3929–3938. PMLR, 2020
work page 2020
-
[6]
Leveraging passage retrieval with generative models for open domain question answering
Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880. Association for Computational Linguistics, 2021
work page 2021
-
[7]
Andrew Brown, Muhammad Roman, and Barry Devereux. A systematic literature review of retrieval-augmented generation: Techniques, metrics, and challenges, 2025
work page 2025
-
[8]
When retrieval succeeds and fails: Rethinking retrieval-augmented generation for llms, 2025
Yongjie Wang, Yue Yu, Kaisong Song, Jun Lin, and Zhiqi Shen. When retrieval succeeds and fails: Rethinking retrieval-augmented generation for llms, 2025
work page 2025
-
[9]
Seven failure points when engineering a retrieval augmented generation system, 2024
Scott Barnett, Stefanus Kurniawan, Srikanth Thudumu, Zach Brannelly, and Mohamed Abdelrazek. Seven failure points when engineering a retrieval augmented generation system, 2024
work page 2024
-
[10]
Gomez, Łukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017
work page 2017
-
[11]
Sainbayar Sukhbaatar, Jason Weston, and Rob Fergus. End-to-end memory networks. Advances in Neural Information Processing Systems, 28, 2015. Technical Report21
work page 2015
-
[12]
Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks, 2015
work page 2015
-
[13]
Hybrid computing using a neural network with dynamic external memory.Nature, 538(7626):471–476, 2016
Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing using a neural network with dynamic external memory.Nature, 538(7626):471–476, 2016
work page 2016
-
[14]
Realm: Retrieval-augmented language model pre-training, 2020
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training, 2020
work page 2020
-
[15]
In-context retrieval-augmented language models, 2023
Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton- Brown, and Yoav Shoham. In-context retrieval-augmented language models, 2023
work page 2023
-
[16]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts, 2023
work page 2023
-
[17]
Approximate nearest neighbor negative contrastive learning for dense text retrieval, 2020
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval, 2020
work page 2020
-
[18]
Query rewriting in retrieval-augmented large language models
Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. Query rewriting in retrieval-augmented large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5303–5315, Singapore, December 2023. Association for Computational Linguistics
work page 2023
-
[19]
Self-rag: Learning to retrieve, generate, and critique through self-reflection, 2023
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection, 2023
work page 2023
-
[20]
Haoyu Han, Yu Wang, Harry Shomer, Kai Guo, Jiayuan Ding, Yongjia Lei, Mahantesh Halappanavar, Ryan A. Rossi, Subhabrata Mukherjee, Xianfeng Tang, Qi He, Zhigang Hua, Bo Long, Tong Zhao, Neil Shah, Amin Javari, Yinglong Xia, and Jiliang Tang. Retrieval-augmented generation with graphs (graphrag), 2025
work page 2025
-
[21]
From local to global: A graph rag approach to query-focused summarization
Microsoft Research Edge. From local to global: A graph rag approach to query-focused summarization. 2024
work page 2024
-
[22]
Agentic rag with knowledge graphs for complex multi-hop reasoning in real-world applications, 2025
Jean Lelong, Adnane Errazine, and Annabelle Blangero. Agentic rag with knowledge graphs for complex multi-hop reasoning in real-world applications, 2025
work page 2025
-
[23]
A survey on open information extraction from rule-based model to large language model, 2024
Pai Liu, Wenyang Gao, Wenjie Dong, Lin Ai, Ziwei Gong, Songfang Huang, Zongsheng Li, Ehsan Hoque, Julia Hirschberg, and Yue Zhang. A survey on open information extraction from rule-based model to large language model, 2024
work page 2024
-
[24]
Distant supervision for relation extraction without labeled data
Mike Mintz, Steven Bills, Rion Snow, and Daniel Jurafsky. Distant supervision for relation extraction without labeled data. In Keh-Yih Su, Jian Su, Janyce Wiebe, and Haizhou Li, editors,Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the Technical Report22 4th International Joint Conference on Natural Language Processing of t...
work page 2009
-
[25]
Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke Zettlemoyer, and Daniel S. Weld. Knowledge-based weak supervision for information extraction of overlapping relations. In Dekang Lin, Yuji Matsumoto, and Rada Mihalcea, editors,Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technolo- gies, pages 541–55...
work page 2011
-
[26]
Chatie: Zero-shot information extraction via chatting with chatgpt, 2024
Xiang Wei, Xingyu Cui, Ning Cheng, Xiaobin Wang, Xin Zhang, Shen Huang, Pengjun Xie, Jinan Xu, Yufeng Chen, Meishan Zhang, Yong Jiang, and Wenjuan Han. Chatie: Zero-shot information extraction via chatting with chatgpt, 2024
work page 2024
-
[27]
Structgpt: A general framework for large language model to reason over structured data, 2023
Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen. Structgpt: A general framework for large language model to reason over structured data, 2023
work page 2023
-
[28]
Memorybank: Enhancing large language models with long-term memory, 2023
Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory, 2023
work page 2023
-
[29]
Memorag: Boosting long context processing with global memory-enhanced retrieval augmentation, 2025
Hongjin Qian, Zheng Liu, Peitian Zhang, Kelong Mao, Defu Lian, Zhicheng Dou, and Tiejun Huang. Memorag: Boosting long context processing with global memory-enhanced retrieval augmentation, 2025
work page 2025
-
[30]
Kg-bert: Bertforknowledgegraphcompletion, 2019
LiangYao, ChengshengMao, andYuanLuo. Kg-bert: Bertforknowledgegraphcompletion, 2019
work page 2019
-
[31]
Yu Sun, Shuohuan Wang, Shikun Feng, Siyu Ding, Chao Pang, Junyuan Shang, Jiaxiang Liu, Xuyi Chen, Yanbin Zhao, Yuxiang Lu, Weixin Liu, Zhihua Wu, Weibao Gong, Jianzhong Liang, Zhizhou Shang, Peng Sun, Wei Liu, Xuan Ouyang, Dianhai Yu, Hao Tian, Hua Wu, and Haifeng Wang. Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and ...
work page 2021
-
[32]
Transformer feed-forward layers are key-value memories, 2021
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories, 2021
work page 2021
-
[33]
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context, 2019
work page 2019
-
[34]
Shengjie Ma, Chengjin Xu, Xuhui Jiang, Muzhi Li, Huaren Qu, Cehao Yang, Jiaxin Mao, and Jian Guo. Think-on-graph 2.0: Deep and faithful large language model reasoning with knowledge-guided retrieval augmented generation, 2025
work page 2025
-
[35]
Kblam: Knowledge base augmented language model, 2025
Xi Wang, Taketomo Isazawa, Liana Mikaelyan, and James Hensman. Kblam: Knowledge base augmented language model, 2025. Technical Report23
work page 2025
-
[36]
Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transact...
work page 2019
-
[37]
Cohen, Ruslan Salakhutdinov, and Christopher D
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering, 2018
work page 2018
-
[38]
BioCreative IX Organizers. Medhopqa: A dataset for benchmarking llm-based reasoning systems with disease-centered question answers. InProceedings of the BioCreative IX Challenge and Workshop (BC9): Large Language Models for Clinical and Biomedical NLP, Montreal, Canada, 2025. Track 1: MedHopQA Shared Task
work page 2025
-
[39]
Truthfulqa: Measuring how models mimic human falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of ACL, 2022
work page 2022
-
[40]
FEVER: a large-scale dataset for fact extraction and VERification
James Thorne, Andreas Vlachos, Oana Cocarascu, Arpit Mittal, and Yixin Hou. FEVER: a large-scale dataset for fact extraction and VERification. InProceedings of NAACL, 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.