Quantifying Prior Dominance in RAG Systems

Barak Or

arxiv: 2606.23695 · v1 · pith:722VMKOVnew · submitted 2026-04-29 · 💻 cs.CL · cs.AI

Quantifying Prior Dominance in RAG Systems

Barak Or This is my paper

Pith reviewed 2026-07-01 08:31 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords RAGNormalized Context UtilizationPrior Dominancescaling lawssmall language modelscontextual information gainfactual extractionparametric recall

0 comments

The pith

Small language models match or outperform much larger ones when strictly extracting facts from retrieved context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Normalized Context Utilization metric to measure genuine use of external context versus internal parametric knowledge in RAG systems. It applies the metric across zero-shot, oracle, and adversarial conditions using token log-probabilities to isolate contextual information gain. For tasks limited to factual extraction without reasoning, the results show extreme diminishing returns from scaling, with efficient small models performing at or above the level of models up to 72B parameters. Larger models and a commercial API exhibit stronger prior dominance, overriding provided evidence more often.

Core claim

The Normalized Context Utilization metric applied to models from 1.5B to 72B parameters and a commercial API shows that for strict factual extraction without Chain-of-Thought, traditional scaling laws exhibit extreme diminishing returns, with highly efficient small language models matching or outperforming high-capacity architectures, while prior dominance increases with model scale and proprietary alignments.

What carries the argument

Normalized Context Utilization (NCU) metric, which computes contextual information gain from differences in token log-probabilities across zero-shot, oracle, and adversarial conditions.

If this is right

Scaling produces extreme diminishing returns for strict factual extraction without reasoning steps.
Prior dominance increases with model scale and in proprietary alignments.
Commercial APIs override explicit external evidence in nearly half of adversarial conflicts.
Small language models exhibit superior contextual adherence in strict extraction workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

RAG pipelines for factual tasks could achieve better cost-performance by defaulting to smaller models.
The NCU approach could be extended to measure adherence after fine-tuning or with different retrieval ranks.
Techniques that suppress parametric recall might close the gap for larger models in context-heavy settings.

Load-bearing premise

The three conditions combined with token log-probabilities isolate genuine contextual information gain from parametric recall without confounding effects from model architecture or training differences.

What would settle it

Finding that models above 7B parameters consistently achieve higher NCU scores than 1.5B models across multiple factual extraction datasets would contradict the diminishing-returns claim.

Figures

Figures reproduced from arXiv: 2606.23695 by Barak Or.

**Figure 1.** Figure 1: Conceptual Architecture: Standard Evaluation vs. NCU Framework. Traditional metrics (left) rely on binary text matching, failing to distinguish parametric recall from reading comprehension. The NCU framework (right) utilizes continuous log-probabilities across zero-shot and oracle conditions to isolate the informational gain of the RAG context. To empirically explore this framework, we conducted bounded i… view at source ↗

**Figure 2.** Figure 2: Prior Dominance vs. Context Adherence. A conceptual visualization of knowledge conflicts. Left: High-capacity models may override contradictory context due to heavily weighted parametric priors (Pparam ≫ Pctx). Right: Low-capacity SLMs generally exhibit higher context adherence, processing external evidence with fewer parametric constraints. 1. Continuous Evaluation Metric: We define and validate the NCU … view at source ↗

**Figure 3.** Figure 3: Context Perturbation Engine Architecture. The transformation of a single evaluation tuple into four controlled experimental conditions, aiming to isolate parametric memory from active context utilization. • Zero-Shot (czero): The model is prompted solely with the query, establishing baseline predictive uncertainty. • Oracle (coracle): The model is provided the ground-truth context containing the exact ans… view at source ↗

**Figure 4.** Figure 4: Contextual Efficacy and NCU Distribution. (A) The discrete success rates across Zero-Shot, Oracle, and Noise conditions. SLMs exhibit highly competitive context-grounded extraction. (B) The distribution of the bounded NCU scores, demonstrating the stable and high information gain achieved by the 1.5B and 7B architectures. grounding. While the commercial baseline (gpt-4o-mini) achieved the highest zero-shot… view at source ↗

**Figure 5.** Figure 5: Prior Dominance and Information Loss. (A) Prior Dominance: The percentage of inferences where the model overrode explicit adversarial context in favor of its prior parametric knowledge. (B) Negative Transfer: The frequency of instances where confusing context actively degraded the model’s predictive confidence below its zero-shot baseline. dominance is heavily influenced by parameter scale (as seen in the … view at source ↗

**Figure 6.** Figure 6: Conceptual RAG Operational Routing. Based on NCU evaluations, extractive tasks may be efficiently directed to SLMs to maximize context adherence and reduce latency. High-capacity models remain optimal for complex synthesis, provided the risk of parametric overwriting is managed. commercial baseline exhibited a measurable tendency to default to its pre-training distributions when external evidence contradic… view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) grounds Large Language Models in external knowledge, yet current evaluations rely on discrete heuristics that suffer from ''epistemic blindness'' - failing to distinguish genuine contextual information extraction from parametric memory recall. To address this, we introduce the Normalized Context Utilization (NCU) metric, leveraging continuous token log-probabilities across zero-shot, oracle, and adversarial conditions to strictly quantify contextual information gain. Evaluating architectures ranging from 1.5B to 72B parameters alongside a proprietary commercial API reveals that for strict factual extraction (without Chain-of-Thought reasoning), traditional scaling laws exhibit extreme diminishing returns: highly efficient Small Language Models (SLMs) match or outperform high-capacity architectures. Furthermore, we demonstrate that ``Prior Dominance'' correlates with model scale and proprietary alignments. The evaluated commercial API not only overrode explicit external evidence in nearly half of adversarial conflicts, but also frequently suffered from systemic confidence collapse (Negative Transfer) when its parametric priors were contradicted. Our findings highlight the structural epistemic advantage and superior contextual adherence of SLMs in strict extraction workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NCU gives a workable way to score how much RAG actually uses the retrieved context, but the SLM advantage and API prior-dominance claims rest on methods that need more checks for calibration and tokenizer effects.

read the letter

The main thing to know is that this paper defines NCU from token log-probabilities across zero-shot, oracle, and adversarial conditions, then uses it to argue that small models match or beat larger ones on strict factual extraction while commercial APIs often override the context.

The metric itself is the clearest new piece. It tries to separate genuine context gain from parametric recall in a continuous way, which is better than the usual accuracy or hallucination counts. The scale comparison from 1.5B to 72B plus the API results on negative transfer and prior dominance in nearly half the adversarial cases are concrete enough to be worth testing in other setups.

The work is honest about the evaluation gap it targets. Anyone running factual RAG pipelines would recognize the problem it describes.

The soft spot is the missing detail on implementation. The abstract gives no equations, no dataset sizes, no error bars, and no ablations for model-specific calibration or tokenizer differences. The stress-test concern lands here: if log-prob differences are not adjusted for those factors, the reported diminishing returns and SLM edge could partly reflect metric artifacts rather than real epistemic differences. Without those checks the central pattern is harder to trust at face value.

This is for applied researchers who pick models for extraction-heavy RAG or who want to improve evaluation beyond discrete heuristics. It is not a foundational methods paper.

I would send it to peer review so the methods and any corrections can be examined properly.

Referee Report

2 major / 3 minor

Summary. The paper introduces the Normalized Context Utilization (NCU) metric, which computes contextual information gain from token log-probabilities under zero-shot, oracle, and adversarial conditions to distinguish genuine retrieval use from parametric recall in RAG. It evaluates models from 1.5B to 72B parameters plus a commercial API on strict factual extraction (no CoT), claiming extreme diminishing returns in scaling laws such that SLMs match or outperform larger models, with prior dominance increasing with scale and the API overriding external evidence in ~50% of adversarial cases while exhibiting negative transfer.

Significance. If the NCU metric is shown to isolate contextual gain without architectural confounds, the work would provide evidence against naive scaling in RAG factual tasks and practical guidance favoring SLMs for context adherence, while documenting reliability failures in proprietary APIs.

major comments (2)

[NCU definition] NCU definition (likely §3): the claim that the three conditions isolate genuine contextual gain assumes log-probability differences are comparable across models without residual effects from calibration, tokenization, or training regime. No ablations or corrections are described, so the observed diminishing returns and SLM advantage could partly reflect metric artifacts (directly matching the stress-test concern on confounding).
[Model comparison results] Model comparison results (likely §4 or §5): the central claim that SLMs match or outperform high-capacity models for strict extraction rests on NCU being architecture-independent; without evidence that the metric controls for the listed confounds, the scaling-law conclusion is load-bearing and at risk.

minor comments (3)

Provide the exact NCU formula, normalization procedure, and any free parameters (the abstract implies none, but this must be explicit).
Add dataset descriptions, statistical details, error bars, and reproducibility information for the API experiments, as these are absent from the abstract-level description.
Clarify how 'epistemic blindness' is operationalized and distinguished from standard RAG failure modes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. The comments highlight important considerations regarding the robustness of the NCU metric. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [NCU definition] NCU definition (likely §3): the claim that the three conditions isolate genuine contextual gain assumes log-probability differences are comparable across models without residual effects from calibration, tokenization, or training regime. No ablations or corrections are described, so the observed diminishing returns and SLM advantage could partly reflect metric artifacts (directly matching the stress-test concern on confounding).

Authors: We thank the referee for pointing this out. The NCU metric is designed such that the differences in log-probabilities under the three conditions (zero-shot, oracle, adversarial) are intended to normalize for model-specific effects by focusing on relative gains. However, we acknowledge that without explicit ablations for calibration and tokenization differences, there remains a possibility of artifacts. In the revised manuscript, we will include a new subsection discussing these potential confounds and provide additional analysis on how the adversarial condition helps mitigate them. We will also report results with normalized probabilities where applicable. revision: partial
Referee: [Model comparison results] Model comparison results (likely §4 or §5): the central claim that SLMs match or outperform high-capacity models for strict extraction rests on NCU being architecture-independent; without evidence that the metric controls for the listed confounds, the scaling-law conclusion is load-bearing and at risk.

Authors: The model comparisons are indeed central to our conclusions. To address this, we will expand the evaluation section to include cross-model normalization techniques and sensitivity analyses to tokenization variations. This will provide stronger evidence that the observed patterns are not solely due to metric artifacts. We maintain that the consistent trends across multiple model families support our claims, but agree that additional controls will bolster the argument. revision: yes

Circularity Check

0 steps flagged

No circularity: NCU defined from independent conditions; empirical claims not forced by construction

full rationale

The paper defines NCU directly from token log-probabilities measured under three explicit conditions (zero-shot, oracle, adversarial) without any fitting step, parameter estimation from the target result, or self-citation chains. The scaling-law observations (SLMs matching larger models) are presented as direct empirical outputs of applying this metric across model sizes. No equations, ansatzes, or uniqueness theorems are shown that reduce the central claim to its own inputs. The derivation is therefore self-contained and externally falsifiable via the stated conditions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is the NCU definition resting on the assumption that the three prompt conditions cleanly separate context use from priors.

axioms (1)

domain assumption Token log-probabilities across zero-shot, oracle, and adversarial conditions isolate contextual information gain.
Core to NCU construction and all reported comparisons.

pith-pipeline@v0.9.1-grok · 5704 in / 1001 out tokens · 38707 ms · 2026-07-01T08:31:54.345344+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 16 canonical work pages · 10 internal anchors

[1]

Retrieval- augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

PatrickLewis, EthanPerez, AleksandraPiktus, FabioPetroni, VladimirKarpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

2020
[2]

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. InInterna- tional conference on machine learning, pages 2206–2240. PMLR, 2022

2022
[3]

Retrieval augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InInternational conference on machine learning, pages 3929–3938. PMLR, 2020

2020
[4]

Densepassageretrievalforopen-domainquestionanswering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, DanqiChen, andWen-tauYih. Densepassageretrievalforopen-domainquestionanswering. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 6769–6781, 2020

2020
[5]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[6]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

A Survey of Scaling in Large Language Model Reasoning

Zihan Chen, Song Wang, Zhen Tan, Xingbo Fu, Zhenyu Lei, Peng Wang, Huan Liu, Cong Shen, and Jundong Li. A survey of scaling in large language model reasoning.arXiv preprint arXiv:2504.02181, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Revisiting scaling laws for language models: The role of data quality and training strategies

Zhengyu Chen, Siqi Wang, Teng Xiao, Yudong Wang, Shiqi Chen, Xunliang Cai, Junxian He, and Jingang Wang. Revisiting scaling laws for language models: The role of data quality and training strategies. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23881–23899, 2025

2025
[9]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

2002
[10]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summa- rization branches out, pages 74–81, 2004

2004
[11]

Factscore: Fine-grained atomic evalua- tion of factual precision in long form text generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evalua- tion of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023

2023
[12]

Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence

Hung-Ting Chen, Michael Zhang, and Eunsol Choi. Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2292– 2307, 2022. 12 Preprint. Under review.B. Or|ArtificialGate Ltd

2022
[13]

Entity-based knowledge conflicts in question answering

Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. Entity-based knowledge conflicts in question answering. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7052–7063, 2021

2021
[14]

When not to trust language models: Investigating effectiveness of parametric and non-parametric memories

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 9802–9822, 2023

2023
[15]

Ragas: Automated evaluation of retrieval augmented generation

Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. Ragas: Automated evaluation of retrieval augmented generation. InProceedings of the 18th conference of the european chapter of the association for computational linguistics: system demonstrations, pages 150–158, 2024

2024
[16]

Ares: An auto- mated evaluation framework for retrieval-augmented generation systems

Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. Ares: An auto- mated evaluation framework for retrieval-augmented generation systems. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 338–354, 2024

2024
[17]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595– 46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595– 46623, 2023

2023
[18]

Redefining retrieval evaluation in the era of llms

Giovanni Trappolini, Florin Cuconasu, Simone Filice, Yoelle Maarek, and Fabrizio Silvestri. Redefining retrieval evaluation in the era of llms. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8359–8375, 2026

2026
[19]

Revisiting rag retrievers: An information theoretic benchmark.arXiv preprint arXiv:2602.21553, 2026

Wenqing Zheng, Dmitri Kalaev, Noah Fatsi, Daniel Barcklow, Owen Reinert, Igor Melnyk, Senthil Kumar, and C Bayan Bruss. Revisiting rag retrievers: An information theoretic benchmark.arXiv preprint arXiv:2602.21553, 2026

work page arXiv 2026
[20]

Rageval: Scenario specific rag evaluation dataset generation framework

Kunlun Zhu, Yifan Luo, Dingling Xu, Yukun Yan, Zhenghao Liu, Shi Yu, Ruobing Wang, Shuo Wang, Yishan Li, Nan Zhang, et al. Rageval: Scenario specific rag evaluation dataset generation framework. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8520–8544, 2025

2025
[21]

Inference scaling law for retrieval augmented generation

Shu Zhou, Yuxuan Ao, Yunyang Xuan, Xin Wang, Tao Fan, and Hao Wang. Inference scaling law for retrieval augmented generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 16522–16530, 2026

2026
[22]

To memorize or to retrieve: Scaling laws for rag-considerate pretraining

Karan Singh, Michael Yu, Varun Gangal, Zhuofu Tao, Sachin Kumar, Emmy Liu, and Steven Y Feng. To memorize or to retrieve: Scaling laws for rag-considerate pretraining. arXiv preprint arXiv:2604.00715, 2026

work page arXiv 2026
[23]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Textbooks Are All You Need

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need.arXiv preprint arXiv:2306.11644, 2023. 13 Preprint. Under review.B. Or|ArtificialGate Ltd

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Small language models improve giants by rewriting their outputs

Giorgos Vernikos, Arthur Bražinskas, Jakub Adamek, Jonathan Mallinson, Aliaksei Sev- eryn, and Eric Malmi. Small language models improve giants by rewriting their outputs. InProceedings of the 18th Conference of the European Chapter of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 2703–2718, 2024

2024
[26]

A survey on collaborative mechanisms between large and small language models.arXiv preprint arXiv:2505.07460, 2025

Yi Chen, JiaHao Zhao, and HaoHao Han. A survey on collaborative mechanisms between large and small language models.arXiv preprint arXiv:2505.07460, 2025

work page arXiv 2025
[27]

Fine-tune an slm or prompt an llm? the case of generating low-code workflows.arXiv preprint arXiv:2505.24189, 2025

Orlando Marquez Ayala, Patrice Bechard, Emily Chen, Maggie Baird, and Jingfei Chen. Fine-tune an slm or prompt an llm? the case of generating low-code workflows.arXiv preprint arXiv:2505.24189, 2025

work page arXiv 2025
[28]

RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning

Jerry Huang, Siddarth Madala, Risham Sidhu, Cheng Niu, Hao Peng, Julia Hockenmaier, and Tong Zhang. Rag-rl: Advancing retrieval-augmented generation via rl and curriculum learning.arXiv preprint arXiv:2503.12759, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Rag-star: Enhancing deliberative reasoning with retrieval augmented verifi- cation and refinement

JinhaoJiang, JiayiChen, JunyiLi, RuiyangRen, ShijieWang, WayneXinZhao, YangSong, and Tao Zhang. Rag-star: Enhancing deliberative reasoning with retrieval augmented verifi- cation and refinement. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume ...

2025
[30]

Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

2023
[31]

Lost in the middle: How language models use long contexts

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the association for computational linguistics, 12:157–173, 2024

2024
[32]

Adaptivechameleonorstubborn sloth: Revealing the behavior of large language models in knowledge conflicts

JianXie, KaiZhang, JiangjieChen, RenzeLou, andYuSu. Adaptivechameleonorstubborn sloth: Revealing the behavior of large language models in knowledge conflicts. InThe Twelfth International Conference on Learning Representations, 2023

2023
[33]

Exploring Knowledge Conflicts for Faithful LLM Reasoning: Benchmark and Method

Tianzhe Zhao et al. Exploring knowledge conflicts for faithful llm reasoning: Benchmark and method.arXiv preprint arXiv:2604.11209, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

Seeingthroughtheconflict: Transparentknowledgeconflicthandlinginretrieval-augmented generation.arXiv preprint arXiv:2601.06842, 2026

Hua Ye, Siyuan Chen, Ziqi Zhong, Canran Xiao, Haoliang Zhang, Yuhan Wu, and Fei Shen. Seeingthroughtheconflict: Transparentknowledgeconflicthandlinginretrieval-augmented generation.arXiv preprint arXiv:2601.06842, 2026

work page arXiv 2026
[35]

Large language models hallucination: A comprehen- sive survey.Computer Science Review, 61:100970, 2026

Aisha Alansari and Hamzah Luqman. Large language models hallucination: A comprehen- sive survey.Computer Science Review, 61:100970, 2026

2026
[36]

Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation

Xingyu Zhu, Junfeng Fang, Shuo Wang, Beier Zhu, Zhicai Wang, Yonghui Yang, and Xi- angnan He. Mitigating hallucinations in large vision-language models without performance degradation.arXiv preprint arXiv:2604.20366, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[37]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, Haofen Wang, et al. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2(1):32, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Freshllms: Refreshing large language models with search engine augmentation

Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun- Hsuan Sung, Denny Zhou, Quoc Le, et al. Freshllms: Refreshing large language models with search engine augmentation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13697–13720, 2024. 14 Preprint. Under review.B. Or|ArtificialGate Ltd

2024
[39]

A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

Claude Elwood Shannon. A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

1948
[40]

John Wiley & Sons, 1999

Thomas M Cover.Elements of information theory. John Wiley & Sons, 1999

1999
[41]

Kalman-inspired runtime stability and recovery in hybrid reasoning systems

Barak Or. Kalman-inspired runtime stability and recovery in hybrid reasoning systems. arXiv preprint arXiv:2602.15855, 2026

work page arXiv 2026
[42]

Active retrieval augmented generation

Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 7969–7992, 2023

2023
[43]

In-context retrieval-augmented language models.Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023

Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton- Brown, and Yoav Shoham. In-context retrieval-augmented language models.Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023

2023
[44]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 15

work page internal anchor Pith review Pith/arXiv arXiv 2025

[1] [1]

Retrieval- augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

PatrickLewis, EthanPerez, AleksandraPiktus, FabioPetroni, VladimirKarpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

2020

[2] [2]

Improving language models by retrieving from trillions of tokens

Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. InInterna- tional conference on machine learning, pages 2206–2240. PMLR, 2022

2022

[3] [3]

Retrieval augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InInternational conference on machine learning, pages 3929–3938. PMLR, 2020

2020

[4] [4]

Densepassageretrievalforopen-domainquestionanswering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, DanqiChen, andWen-tauYih. Densepassageretrievalforopen-domainquestionanswering. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 6769–6781, 2020

2020

[5] [5]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[6] [6]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

A Survey of Scaling in Large Language Model Reasoning

Zihan Chen, Song Wang, Zhen Tan, Xingbo Fu, Zhenyu Lei, Peng Wang, Huan Liu, Cong Shen, and Jundong Li. A survey of scaling in large language model reasoning.arXiv preprint arXiv:2504.02181, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Revisiting scaling laws for language models: The role of data quality and training strategies

Zhengyu Chen, Siqi Wang, Teng Xiao, Yudong Wang, Shiqi Chen, Xunliang Cai, Junxian He, and Jingang Wang. Revisiting scaling laws for language models: The role of data quality and training strategies. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23881–23899, 2025

2025

[9] [9]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

2002

[10] [10]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summa- rization branches out, pages 74–81, 2004

2004

[11] [11]

Factscore: Fine-grained atomic evalua- tion of factual precision in long form text generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evalua- tion of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023

2023

[12] [12]

Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence

Hung-Ting Chen, Michael Zhang, and Eunsol Choi. Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2292– 2307, 2022. 12 Preprint. Under review.B. Or|ArtificialGate Ltd

2022

[13] [13]

Entity-based knowledge conflicts in question answering

Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. Entity-based knowledge conflicts in question answering. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7052–7063, 2021

2021

[14] [14]

When not to trust language models: Investigating effectiveness of parametric and non-parametric memories

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 9802–9822, 2023

2023

[15] [15]

Ragas: Automated evaluation of retrieval augmented generation

Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. Ragas: Automated evaluation of retrieval augmented generation. InProceedings of the 18th conference of the european chapter of the association for computational linguistics: system demonstrations, pages 150–158, 2024

2024

[16] [16]

Ares: An auto- mated evaluation framework for retrieval-augmented generation systems

Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. Ares: An auto- mated evaluation framework for retrieval-augmented generation systems. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 338–354, 2024

2024

[17] [17]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595– 46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595– 46623, 2023

2023

[18] [18]

Redefining retrieval evaluation in the era of llms

Giovanni Trappolini, Florin Cuconasu, Simone Filice, Yoelle Maarek, and Fabrizio Silvestri. Redefining retrieval evaluation in the era of llms. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8359–8375, 2026

2026

[19] [19]

Revisiting rag retrievers: An information theoretic benchmark.arXiv preprint arXiv:2602.21553, 2026

Wenqing Zheng, Dmitri Kalaev, Noah Fatsi, Daniel Barcklow, Owen Reinert, Igor Melnyk, Senthil Kumar, and C Bayan Bruss. Revisiting rag retrievers: An information theoretic benchmark.arXiv preprint arXiv:2602.21553, 2026

work page arXiv 2026

[20] [20]

Rageval: Scenario specific rag evaluation dataset generation framework

Kunlun Zhu, Yifan Luo, Dingling Xu, Yukun Yan, Zhenghao Liu, Shi Yu, Ruobing Wang, Shuo Wang, Yishan Li, Nan Zhang, et al. Rageval: Scenario specific rag evaluation dataset generation framework. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8520–8544, 2025

2025

[21] [21]

Inference scaling law for retrieval augmented generation

Shu Zhou, Yuxuan Ao, Yunyang Xuan, Xin Wang, Tao Fan, and Hao Wang. Inference scaling law for retrieval augmented generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 16522–16530, 2026

2026

[22] [22]

To memorize or to retrieve: Scaling laws for rag-considerate pretraining

Karan Singh, Michael Yu, Varun Gangal, Zhuofu Tao, Sachin Kumar, Emmy Liu, and Steven Y Feng. To memorize or to retrieve: Scaling laws for rag-considerate pretraining. arXiv preprint arXiv:2604.00715, 2026

work page arXiv 2026

[23] [23]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[24] [24]

Textbooks Are All You Need

Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need.arXiv preprint arXiv:2306.11644, 2023. 13 Preprint. Under review.B. Or|ArtificialGate Ltd

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Small language models improve giants by rewriting their outputs

Giorgos Vernikos, Arthur Bražinskas, Jakub Adamek, Jonathan Mallinson, Aliaksei Sev- eryn, and Eric Malmi. Small language models improve giants by rewriting their outputs. InProceedings of the 18th Conference of the European Chapter of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 2703–2718, 2024

2024

[26] [26]

A survey on collaborative mechanisms between large and small language models.arXiv preprint arXiv:2505.07460, 2025

Yi Chen, JiaHao Zhao, and HaoHao Han. A survey on collaborative mechanisms between large and small language models.arXiv preprint arXiv:2505.07460, 2025

work page arXiv 2025

[27] [27]

Fine-tune an slm or prompt an llm? the case of generating low-code workflows.arXiv preprint arXiv:2505.24189, 2025

Orlando Marquez Ayala, Patrice Bechard, Emily Chen, Maggie Baird, and Jingfei Chen. Fine-tune an slm or prompt an llm? the case of generating low-code workflows.arXiv preprint arXiv:2505.24189, 2025

work page arXiv 2025

[28] [28]

RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning

Jerry Huang, Siddarth Madala, Risham Sidhu, Cheng Niu, Hao Peng, Julia Hockenmaier, and Tong Zhang. Rag-rl: Advancing retrieval-augmented generation via rl and curriculum learning.arXiv preprint arXiv:2503.12759, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

Rag-star: Enhancing deliberative reasoning with retrieval augmented verifi- cation and refinement

JinhaoJiang, JiayiChen, JunyiLi, RuiyangRen, ShijieWang, WayneXinZhao, YangSong, and Tao Zhang. Rag-star: Enhancing deliberative reasoning with retrieval augmented verifi- cation and refinement. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume ...

2025

[30] [30]

Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

2023

[31] [31]

Lost in the middle: How language models use long contexts

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the association for computational linguistics, 12:157–173, 2024

2024

[32] [32]

Adaptivechameleonorstubborn sloth: Revealing the behavior of large language models in knowledge conflicts

JianXie, KaiZhang, JiangjieChen, RenzeLou, andYuSu. Adaptivechameleonorstubborn sloth: Revealing the behavior of large language models in knowledge conflicts. InThe Twelfth International Conference on Learning Representations, 2023

2023

[33] [33]

Exploring Knowledge Conflicts for Faithful LLM Reasoning: Benchmark and Method

Tianzhe Zhao et al. Exploring knowledge conflicts for faithful llm reasoning: Benchmark and method.arXiv preprint arXiv:2604.11209, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[34] [34]

Seeingthroughtheconflict: Transparentknowledgeconflicthandlinginretrieval-augmented generation.arXiv preprint arXiv:2601.06842, 2026

Hua Ye, Siyuan Chen, Ziqi Zhong, Canran Xiao, Haoliang Zhang, Yuhan Wu, and Fei Shen. Seeingthroughtheconflict: Transparentknowledgeconflicthandlinginretrieval-augmented generation.arXiv preprint arXiv:2601.06842, 2026

work page arXiv 2026

[35] [35]

Large language models hallucination: A comprehen- sive survey.Computer Science Review, 61:100970, 2026

Aisha Alansari and Hamzah Luqman. Large language models hallucination: A comprehen- sive survey.Computer Science Review, 61:100970, 2026

2026

[36] [36]

Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation

Xingyu Zhu, Junfeng Fang, Shuo Wang, Beier Zhu, Zhicai Wang, Yonghui Yang, and Xi- angnan He. Mitigating hallucinations in large vision-language models without performance degradation.arXiv preprint arXiv:2604.20366, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[37] [37]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, Haofen Wang, et al. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2(1):32, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

Freshllms: Refreshing large language models with search engine augmentation

Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun- Hsuan Sung, Denny Zhou, Quoc Le, et al. Freshllms: Refreshing large language models with search engine augmentation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13697–13720, 2024. 14 Preprint. Under review.B. Or|ArtificialGate Ltd

2024

[39] [39]

A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

Claude Elwood Shannon. A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

1948

[40] [40]

John Wiley & Sons, 1999

Thomas M Cover.Elements of information theory. John Wiley & Sons, 1999

1999

[41] [41]

Kalman-inspired runtime stability and recovery in hybrid reasoning systems

Barak Or. Kalman-inspired runtime stability and recovery in hybrid reasoning systems. arXiv preprint arXiv:2602.15855, 2026

work page arXiv 2026

[42] [42]

Active retrieval augmented generation

Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 7969–7992, 2023

2023

[43] [43]

In-context retrieval-augmented language models.Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023

Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton- Brown, and Yoav Shoham. In-context retrieval-augmented language models.Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023

2023

[44] [44]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 15

work page internal anchor Pith review Pith/arXiv arXiv 2025