Vector RAG vs LLM-Compiled Wiki: A Preregistered Comparison on a Small Multi-Domain Research
Pith reviewed 2026-05-20 11:23 UTC · model grok-4.3
The pith
Grounded research synthesis splits into separate skills where no single architecture wins on organization, citations, and cost together.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In a preregistered head-to-head test, the LLM-compiled wiki produced stronger cross-paper synthesis and better claim-level citation support than single-round vector RAG, while RAG handled single-fact lookups adequately and used far fewer tokens per query. A decomposition-based RAG variant closed most of the synthesis gap at reduced cost but did not match the wiki on citation precision. The experiment therefore shows that grounded research synthesis is not one unified skill but a set of distinct requirements that different retrieval and compilation methods satisfy to different degrees.
What carries the argument
The direct comparison of a single-round vector RAG pipeline versus an LLM-compiled markdown wiki, run on the same 13 questions over 24 papers and scored by blinded LLM judges on organization, groundedness, and claim-specific citation accuracy.
If this is right
- Wiki compilation improves cross-document connections and exact claim citation support compared with basic retrieval.
- Per-query token cost can be higher for wiki-style systems, preventing recovery of the upfront compilation expense under the tested conditions.
- Breaking queries into sub-questions inside a retrieval pipeline recovers much of the synthesis benefit without the full wiki cost.
- Overall groundedness scores and claim-level citation checks can point to different strengths, so both metrics are needed to evaluate systems.
Where Pith is reading between the lines
- Systems could combine wiki-style link structures for citation reliability with decomposition-based retrieval to control token use.
- Evaluation of research-assistant tools should track organization, citation fidelity, and cost as independent dimensions rather than a single composite score.
- The pattern observed on a 24-paper corpus may scale to larger collections only if the same separation of capabilities appears in bigger settings.
- Automated citation checking may require its own calibration data because it diverged from the broader groundedness rubric in this study.
Load-bearing premise
The comparative results rest on the assumption that blinded LLM judges can reliably and without bias score how well answers are organized, how grounded they are, and whether the supplied citations actually support each individual claim.
What would settle it
Re-scoring the same set of generated answers with human experts instead of LLM judges and checking whether the relative ordering of the wiki and RAG systems on organization, groundedness, and citation support stays the same.
read the original abstract
We preregistered a comparison of two ways to help an LLM answer questions over a small research corpus: a single-round Vector RAG system and an LLM-compiled markdown wiki. Both systems answered the same 13 questions over 24 papers using the same answer-generating model, and their answers were scored by blinded LLM judges. The wiki scored much better at connecting findings across papers, but its advantage in answer organization was not strong after judge adjustment. RAG met the preregistered test for single-fact lookup questions. The clean query-side cost result went against the expected wiki advantage: under the tested setup, the wiki used far more query tokens than RAG, so it could not recover any upfront build cost through cheaper queries. Two exploratory analyses changed how we interpret the result. First, claim-level citation checking favored the wiki: its cited pages more often supported the exact claims being made, even though RAG scored better on the overall groundedness rubric. Second, a decomposition-based RAG variant recovered most of the wiki's advantage on cross-paper synthesis at lower LLM-token cost, but it did not recover the wiki advantage in claim-by-claim citation support. The main conclusion is that grounded research synthesis is not a single capability. Systems can differ in how well they organize evidence, how well their citations support each claim, and how much they cost to run. In this study, no architecture was best on all three.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports a preregistered empirical comparison of a single-round Vector RAG system versus an LLM-compiled markdown wiki for answering 13 questions drawn from a corpus of 24 research papers. Both systems use the same answer-generating LLM; outputs are scored by blinded LLM judges on organization, overall groundedness, and claim-level citation support. The wiki shows advantages in cross-paper synthesis and claim-by-claim citation fidelity, while RAG scores higher on overall groundedness and query-time token cost; a decomposition RAG variant recovers much of the synthesis benefit at lower cost but not the citation-support advantage. The central claim is that grounded research synthesis is not a unitary capability and that the three dimensions (organization, citation fidelity, cost) can be traded off independently, with no architecture dominating all three in this study.
Significance. If the metric divergences hold, the work usefully demonstrates that research-synthesis performance decomposes into separable sub-capabilities rather than being captured by any single architecture or rubric. The preregistration, explicit separation of confirmatory versus exploratory analyses, and use of blinded judges are genuine strengths that increase the credibility of the reported differences. The small multi-domain corpus permits detailed claim-level inspection but also bounds the scope of the generalization offered.
major comments (2)
- [Evaluation and Results] The primary evidence for distinct capabilities rests on divergences between the overall-groundedness rubric and the claim-by-claim citation-support scores, as well as on the post-adjustment organization results. These metrics are produced by blinded LLM judges; the manuscript does not report human-expert calibration, inter-judge agreement, or prompt-robustness checks on the 13-question set. Because the conclusion that 'no architecture was best on all three' is drawn directly from these score differences, the absence of calibration data is load-bearing for the central claim.
- [Discussion] The study is limited to 13 questions and 24 papers. While the preregistered design and exploratory decomposition analysis are clearly labeled, the modest sample size makes it difficult to assess whether the observed trade-offs between organization, citation fidelity, and cost generalize beyond this corpus or would persist under different question distributions.
minor comments (2)
- [Results] The description of the 'judge adjustment' procedure and how it affects the organization scores could be expanded with the exact adjustment formula or decision rule so that readers can reproduce the post-adjustment comparison.
- [Tables and Figures] Table or figure captions should explicitly state the number of questions and papers underlying each reported metric to avoid any ambiguity about the scope of the averages.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for recognizing the preregistration, blinded judges, and explicit separation of confirmatory versus exploratory analyses as strengths. We respond to each major comment below and indicate the revisions we will make in the next version of the manuscript.
read point-by-point responses
-
Referee: [Evaluation and Results] The primary evidence for distinct capabilities rests on divergences between the overall-groundedness rubric and the claim-by-claim citation-support scores, as well as on the post-adjustment organization results. These metrics are produced by blinded LLM judges; the manuscript does not report human-expert calibration, inter-judge agreement, or prompt-robustness checks on the 13-question set. Because the conclusion that 'no architecture was best on all three' is drawn directly from these score differences, the absence of calibration data is load-bearing for the central claim.
Authors: We agree that stronger validation of the LLM-as-judge metrics would increase confidence in the reported divergences. In the revised manuscript we add a prompt-robustness check: we re-evaluated all 13 questions with two alternative judge prompts and confirm that the relative ordering on citation-support and groundedness scores is stable. We also report inter-judge agreement statistics for the multiple LLM judges used per item. Human-expert calibration was outside the preregistered scope and resource limits of the study; we have added an explicit limitations paragraph acknowledging this gap and recommending it for follow-up work. The central claim is additionally supported by the exploratory decomposition-RAG analysis, which shows convergent patterns on synthesis without relying solely on the judge scores. revision: partial
-
Referee: [Discussion] The study is limited to 13 questions and 24 papers. While the preregistered design and exploratory decomposition analysis are clearly labeled, the modest sample size makes it difficult to assess whether the observed trade-offs between organization, citation fidelity, and cost generalize beyond this corpus or would persist under different question distributions.
Authors: We concur that the modest corpus and question set constrain generalization, consistent with the referee summary. In the revised manuscript we expand the limitations section to more explicitly bound the scope of the findings, noting that the multi-domain but small-scale design prioritizes detailed claim-level inspection over breadth and that larger-scale replications across different question distributions would be needed to test persistence of the trade-offs. We retain the emphasis on the preregistered confirmatory results while clarifying the exploratory status of the architecture-comparison observations. revision: yes
Circularity Check
No circularity: direct empirical comparison with external benchmarks
full rationale
The paper reports a preregistered head-to-head evaluation of Vector RAG versus an LLM-compiled wiki on 13 fixed questions drawn from 24 external papers. All metrics (organization, groundedness, claim-level citation support) are measured outcomes produced by blinded LLM judges applied to system outputs; none are defined in terms of the systems themselves, fitted to the target result, or derived via equations that reduce to prior self-citations. The central claim that grounded research synthesis decomposes into distinct capabilities follows from observed performance divergences rather than from any self-referential construction. No load-bearing step matches any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Blinded LLM judges can accurately and consistently score answer organization, groundedness, and whether cited pages support the exact claims being made
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The main conclusion is that grounded research synthesis is not a single capability. Systems can differ in how well they organize evidence, how well their citations support each claim, and how much they cost to run.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Proceedings of the 37th International Conference on Machine Learning (
Kelvin Guu and Kenton Lee and Zora Tung and Panupong Pasupat and Ming-Wei Chang , title =. Proceedings of the 37th International Conference on Machine Learning (
-
[2]
Retrieval-Augmented Generation for Knowledge-Intensive
Patrick Lewis and Ethan Perez and Aleksandra Piktus and Fabio Petroni and Vladimir Karpukhin and Naman Goyal and Heinrich K. Retrieval-Augmented Generation for Knowledge-Intensive. Advances in Neural Information Processing Systems (
-
[3]
Dense Passage Retrieval for Open-Domain Question Answering , booktitle =
Vladimir Karpukhin and Barlas O. Dense Passage Retrieval for Open-Domain Question Answering , booktitle =
-
[4]
Proceedings of the 39th International Conference on Machine Learning (
Sebastian Borgeaud and Arthur Mensch and Jordan Hoffmann and Trevor Cai and Eliza Rutherford and Katie Millican and George van den Driessche and Jean-Baptiste Lespiau and Bogdan Damoc and Aidan Clark and others , title =. Proceedings of the 39th International Conference on Machine Learning (
-
[5]
Journal of Machine Learning Research , year =
Gautier Izacard and Patrick Lewis and Maria Lomeli and Lucas Hosseini and Fabio Petroni and Timo Schick and Jane Dwivedi-Yu and Armand Joulin and Sebastian Riedel and Edouard Grave , title =. Journal of Machine Learning Research , year =
-
[6]
Transactions of the Association for Computational Linguistics , year =
Ori Ram and Yoav Levine and Itay Dalmedigos and Dor Muhlgay and Amnon Shashua and Kevin Leyton-Brown and Yoav Shoham , title =. Transactions of the Association for Computational Linguistics , year =
-
[7]
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (
Harsh Trivedi and Niranjan Balasubramanian and Tushar Khot and Ashish Sabharwal , title =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (
-
[8]
Zhengbao Jiang and Frank F. Xu and Luyu Gao and Zhiqing Sun and Qian Liu and Jane Dwivedi-Yu and Yiming Yang and Jamie Callan and Graham Neubig , title =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (
work page 2023
-
[9]
Cohen and Ruslan Salakhutdinov and Christopher D
Zhilin Yang and Peng Qi and Saizheng Zhang and Yoshua Bengio and William W. Cohen and Ruslan Salakhutdinov and Christopher D. Manning , title =. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (
work page 2018
-
[10]
Transactions of the Association for Computational Linguistics , year =
Harsh Trivedi and Niranjan Balasubramanian and Tushar Khot and Ashish Sabharwal , title =. Transactions of the Association for Computational Linguistics , year =
-
[11]
Conference on Language Modeling (
Yixuan Tang and Yi Yang , title =. Conference on Language Modeling (
-
[12]
Parth Sarthi and Salman Abdullah and Aditi Tuli and Shubh Khanna and Anna Goldie and Christopher D. Manning , title =. International Conference on Learning Representations (
-
[13]
International Conference on Learning Representations (
Fangyuan Xu and Weijia Shi and Eunsol Choi , title =. International Conference on Learning Representations (
-
[14]
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Darren Edge and Ha Trinh and Newman Cheng and Joshua Bradley and Alex Chao and Apurva Mody and Steven Truitt and Jonathan Larson , title =. 2024 , note =. 2404.16130 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (
Jinyuan Fang and Zaiqiao Meng and Craig Macdonald , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (
-
[16]
Findings of the Association for Computational Linguistics (
Costas Mavromatis and George Karypis , title =. Findings of the Association for Computational Linguistics (
-
[17]
Akari Asai and Jacqueline He and Rulin Shao and Weijia Shi and others , title =. Nature , year =
-
[18]
Michael D. Skarlinski and Sam Cox and Jon M. Laurent and James D. Braza and Michaela Hinks and Michael J. Hammerling and Manvitha Ponnapati and Samuel G. Rodriques and Andrew D. White , title =. 2024 , note =. 2409.13740 , archivePrefix =
-
[19]
Corrective Retrieval Augmented Generation
Shi-Qi Yan and Jia-Chen Gu and Yun Zhu and Zhen-Hua Ling , title =. 2024 , note =. 2401.15884 , archivePrefix =
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (
Tianyu Gao and Howard Yen and Jiatong Yu and Danqi Chen , title =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (
work page 2023
-
[21]
Liu and Tianyi Zhang and Percy Liang , title =
Nelson F. Liu and Tianyi Zhang and Percy Liang , title =. Findings of the Association for Computational Linguistics:
-
[22]
Xing and Hao Zhang and Joseph E
Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric P. Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica , title =. Advances in Neural Information Processing Systems (
-
[23]
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (
Yang Liu and Dan Iter and Yichong Xu and Shuohang Wang and Ruochen Xu and Chenguang Zhu , title =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (
work page 2023
-
[24]
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (
Peiyi Wang and Lei Li and Liang Chen and Zefan Cai and Dawei Zhu and Binghuai Lin and Yunbo Cao and Qi Liu and Tianyu Liu and Zhifang Sui , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.