pith. sign in

arxiv: 2606.23695 · v1 · pith:722VMKOVnew · submitted 2026-04-29 · 💻 cs.CL · cs.AI

Quantifying Prior Dominance in RAG Systems

Pith reviewed 2026-07-01 08:31 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords RAGNormalized Context UtilizationPrior Dominancescaling lawssmall language modelscontextual information gainfactual extractionparametric recall
0
0 comments X

The pith

Small language models match or outperform much larger ones when strictly extracting facts from retrieved context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Normalized Context Utilization metric to measure genuine use of external context versus internal parametric knowledge in RAG systems. It applies the metric across zero-shot, oracle, and adversarial conditions using token log-probabilities to isolate contextual information gain. For tasks limited to factual extraction without reasoning, the results show extreme diminishing returns from scaling, with efficient small models performing at or above the level of models up to 72B parameters. Larger models and a commercial API exhibit stronger prior dominance, overriding provided evidence more often.

Core claim

The Normalized Context Utilization metric applied to models from 1.5B to 72B parameters and a commercial API shows that for strict factual extraction without Chain-of-Thought, traditional scaling laws exhibit extreme diminishing returns, with highly efficient small language models matching or outperforming high-capacity architectures, while prior dominance increases with model scale and proprietary alignments.

What carries the argument

Normalized Context Utilization (NCU) metric, which computes contextual information gain from differences in token log-probabilities across zero-shot, oracle, and adversarial conditions.

If this is right

  • Scaling produces extreme diminishing returns for strict factual extraction without reasoning steps.
  • Prior dominance increases with model scale and in proprietary alignments.
  • Commercial APIs override explicit external evidence in nearly half of adversarial conflicts.
  • Small language models exhibit superior contextual adherence in strict extraction workflows.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • RAG pipelines for factual tasks could achieve better cost-performance by defaulting to smaller models.
  • The NCU approach could be extended to measure adherence after fine-tuning or with different retrieval ranks.
  • Techniques that suppress parametric recall might close the gap for larger models in context-heavy settings.

Load-bearing premise

The three conditions combined with token log-probabilities isolate genuine contextual information gain from parametric recall without confounding effects from model architecture or training differences.

What would settle it

Finding that models above 7B parameters consistently achieve higher NCU scores than 1.5B models across multiple factual extraction datasets would contradict the diminishing-returns claim.

Figures

Figures reproduced from arXiv: 2606.23695 by Barak Or.

Figure 1
Figure 1. Figure 1: Conceptual Architecture: Standard Evaluation vs. NCU Framework. Tra￾ditional metrics (left) rely on binary text matching, failing to distinguish parametric recall from reading comprehension. The NCU framework (right) utilizes continuous log-probabilities across zero-shot and oracle conditions to isolate the informational gain of the RAG context. To empirically explore this framework, we conducted bounded i… view at source ↗
Figure 2
Figure 2. Figure 2: Prior Dominance vs. Context Adherence. A conceptual visualization of knowl￾edge conflicts. Left: High-capacity models may override contradictory context due to heavily weighted parametric priors (Pparam ≫ Pctx). Right: Low-capacity SLMs generally exhibit higher context adherence, processing external evidence with fewer parametric constraints. 1. Continuous Evaluation Metric: We define and validate the NCU … view at source ↗
Figure 3
Figure 3. Figure 3: Context Perturbation Engine Architecture. The transformation of a single eval￾uation tuple into four controlled experimental conditions, aiming to isolate parametric memory from active context utilization. • Zero-Shot (czero): The model is prompted solely with the query, establishing baseline predictive uncertainty. • Oracle (coracle): The model is provided the ground-truth context containing the exact ans… view at source ↗
Figure 4
Figure 4. Figure 4: Contextual Efficacy and NCU Distribution. (A) The discrete success rates across Zero-Shot, Oracle, and Noise conditions. SLMs exhibit highly competitive context-grounded extraction. (B) The distribution of the bounded NCU scores, demonstrating the stable and high information gain achieved by the 1.5B and 7B architectures. grounding. While the commercial baseline (gpt-4o-mini) achieved the highest zero-shot… view at source ↗
Figure 5
Figure 5. Figure 5: Prior Dominance and Information Loss. (A) Prior Dominance: The percentage of inferences where the model overrode explicit adversarial context in favor of its prior parametric knowledge. (B) Negative Transfer: The frequency of instances where confusing context actively degraded the model’s predictive confidence below its zero-shot baseline. dominance is heavily influenced by parameter scale (as seen in the … view at source ↗
Figure 6
Figure 6. Figure 6: Conceptual RAG Operational Routing. Based on NCU evaluations, extractive tasks may be efficiently directed to SLMs to maximize context adherence and reduce latency. High-capacity models remain optimal for complex synthesis, provided the risk of parametric overwriting is managed. commercial baseline exhibited a measurable tendency to default to its pre-training distributions when external evidence contradic… view at source ↗
read the original abstract

Retrieval-Augmented Generation (RAG) grounds Large Language Models in external knowledge, yet current evaluations rely on discrete heuristics that suffer from ''epistemic blindness'' - failing to distinguish genuine contextual information extraction from parametric memory recall. To address this, we introduce the Normalized Context Utilization (NCU) metric, leveraging continuous token log-probabilities across zero-shot, oracle, and adversarial conditions to strictly quantify contextual information gain. Evaluating architectures ranging from 1.5B to 72B parameters alongside a proprietary commercial API reveals that for strict factual extraction (without Chain-of-Thought reasoning), traditional scaling laws exhibit extreme diminishing returns: highly efficient Small Language Models (SLMs) match or outperform high-capacity architectures. Furthermore, we demonstrate that ``Prior Dominance'' correlates with model scale and proprietary alignments. The evaluated commercial API not only overrode explicit external evidence in nearly half of adversarial conflicts, but also frequently suffered from systemic confidence collapse (Negative Transfer) when its parametric priors were contradicted. Our findings highlight the structural epistemic advantage and superior contextual adherence of SLMs in strict extraction workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces the Normalized Context Utilization (NCU) metric, which computes contextual information gain from token log-probabilities under zero-shot, oracle, and adversarial conditions to distinguish genuine retrieval use from parametric recall in RAG. It evaluates models from 1.5B to 72B parameters plus a commercial API on strict factual extraction (no CoT), claiming extreme diminishing returns in scaling laws such that SLMs match or outperform larger models, with prior dominance increasing with scale and the API overriding external evidence in ~50% of adversarial cases while exhibiting negative transfer.

Significance. If the NCU metric is shown to isolate contextual gain without architectural confounds, the work would provide evidence against naive scaling in RAG factual tasks and practical guidance favoring SLMs for context adherence, while documenting reliability failures in proprietary APIs.

major comments (2)
  1. [NCU definition] NCU definition (likely §3): the claim that the three conditions isolate genuine contextual gain assumes log-probability differences are comparable across models without residual effects from calibration, tokenization, or training regime. No ablations or corrections are described, so the observed diminishing returns and SLM advantage could partly reflect metric artifacts (directly matching the stress-test concern on confounding).
  2. [Model comparison results] Model comparison results (likely §4 or §5): the central claim that SLMs match or outperform high-capacity models for strict extraction rests on NCU being architecture-independent; without evidence that the metric controls for the listed confounds, the scaling-law conclusion is load-bearing and at risk.
minor comments (3)
  1. Provide the exact NCU formula, normalization procedure, and any free parameters (the abstract implies none, but this must be explicit).
  2. Add dataset descriptions, statistical details, error bars, and reproducibility information for the API experiments, as these are absent from the abstract-level description.
  3. Clarify how 'epistemic blindness' is operationalized and distinguished from standard RAG failure modes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. The comments highlight important considerations regarding the robustness of the NCU metric. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [NCU definition] NCU definition (likely §3): the claim that the three conditions isolate genuine contextual gain assumes log-probability differences are comparable across models without residual effects from calibration, tokenization, or training regime. No ablations or corrections are described, so the observed diminishing returns and SLM advantage could partly reflect metric artifacts (directly matching the stress-test concern on confounding).

    Authors: We thank the referee for pointing this out. The NCU metric is designed such that the differences in log-probabilities under the three conditions (zero-shot, oracle, adversarial) are intended to normalize for model-specific effects by focusing on relative gains. However, we acknowledge that without explicit ablations for calibration and tokenization differences, there remains a possibility of artifacts. In the revised manuscript, we will include a new subsection discussing these potential confounds and provide additional analysis on how the adversarial condition helps mitigate them. We will also report results with normalized probabilities where applicable. revision: partial

  2. Referee: [Model comparison results] Model comparison results (likely §4 or §5): the central claim that SLMs match or outperform high-capacity models for strict extraction rests on NCU being architecture-independent; without evidence that the metric controls for the listed confounds, the scaling-law conclusion is load-bearing and at risk.

    Authors: The model comparisons are indeed central to our conclusions. To address this, we will expand the evaluation section to include cross-model normalization techniques and sensitivity analyses to tokenization variations. This will provide stronger evidence that the observed patterns are not solely due to metric artifacts. We maintain that the consistent trends across multiple model families support our claims, but agree that additional controls will bolster the argument. revision: yes

Circularity Check

0 steps flagged

No circularity: NCU defined from independent conditions; empirical claims not forced by construction

full rationale

The paper defines NCU directly from token log-probabilities measured under three explicit conditions (zero-shot, oracle, adversarial) without any fitting step, parameter estimation from the target result, or self-citation chains. The scaling-law observations (SLMs matching larger models) are presented as direct empirical outputs of applying this metric across model sizes. No equations, ansatzes, or uniqueness theorems are shown that reduce the central claim to its own inputs. The derivation is therefore self-contained and externally falsifiable via the stated conditions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution is the NCU definition resting on the assumption that the three prompt conditions cleanly separate context use from priors.

axioms (1)
  • domain assumption Token log-probabilities across zero-shot, oracle, and adversarial conditions isolate contextual information gain.
    Core to NCU construction and all reported comparisons.

pith-pipeline@v0.9.1-grok · 5704 in / 1001 out tokens · 38707 ms · 2026-07-01T08:31:54.345344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 16 canonical work pages · 10 internal anchors

  1. [1]

    Retrieval- augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    PatrickLewis, EthanPerez, AleksandraPiktus, FabioPetroni, VladimirKarpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  2. [2]

    Improving language models by retrieving from trillions of tokens

    Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens. InInterna- tional conference on machine learning, pages 2206–2240. PMLR, 2022

  3. [3]

    Retrieval augmented language model pre-training

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. InInternational conference on machine learning, pages 3929–3938. PMLR, 2020

  4. [4]

    Densepassageretrievalforopen-domainquestionanswering

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, DanqiChen, andWen-tauYih. Densepassageretrievalforopen-domainquestionanswering. InProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pages 6769–6781, 2020

  5. [5]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  6. [6]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556, 10, 2022

  7. [7]

    A Survey of Scaling in Large Language Model Reasoning

    Zihan Chen, Song Wang, Zhen Tan, Xingbo Fu, Zhenyu Lei, Peng Wang, Huan Liu, Cong Shen, and Jundong Li. A survey of scaling in large language model reasoning.arXiv preprint arXiv:2504.02181, 2025

  8. [8]

    Revisiting scaling laws for language models: The role of data quality and training strategies

    Zhengyu Chen, Siqi Wang, Teng Xiao, Yudong Wang, Shiqi Chen, Xunliang Cai, Junxian He, and Jingang Wang. Revisiting scaling laws for language models: The role of data quality and training strategies. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23881–23899, 2025

  9. [9]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

  10. [10]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summa- rization branches out, pages 74–81, 2004

  11. [11]

    Factscore: Fine-grained atomic evalua- tion of factual precision in long form text generation

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evalua- tion of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023

  12. [12]

    Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence

    Hung-Ting Chen, Michael Zhang, and Eunsol Choi. Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2292– 2307, 2022. 12 Preprint. Under review.B. Or|ArtificialGate Ltd

  13. [13]

    Entity-based knowledge conflicts in question answering

    Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. Entity-based knowledge conflicts in question answering. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 7052–7063, 2021

  14. [14]

    When not to trust language models: Investigating effectiveness of parametric and non-parametric memories

    Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 9802–9822, 2023

  15. [15]

    Ragas: Automated evaluation of retrieval augmented generation

    Shahul Es, Jithin James, Luis Espinosa Anke, and Steven Schockaert. Ragas: Automated evaluation of retrieval augmented generation. InProceedings of the 18th conference of the european chapter of the association for computational linguistics: system demonstrations, pages 150–158, 2024

  16. [16]

    Ares: An auto- mated evaluation framework for retrieval-augmented generation systems

    Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. Ares: An auto- mated evaluation framework for retrieval-augmented generation systems. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 338–354, 2024

  17. [17]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595– 46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595– 46623, 2023

  18. [18]

    Redefining retrieval evaluation in the era of llms

    Giovanni Trappolini, Florin Cuconasu, Simone Filice, Yoelle Maarek, and Fabrizio Silvestri. Redefining retrieval evaluation in the era of llms. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8359–8375, 2026

  19. [19]

    Revisiting rag retrievers: An information theoretic benchmark.arXiv preprint arXiv:2602.21553, 2026

    Wenqing Zheng, Dmitri Kalaev, Noah Fatsi, Daniel Barcklow, Owen Reinert, Igor Melnyk, Senthil Kumar, and C Bayan Bruss. Revisiting rag retrievers: An information theoretic benchmark.arXiv preprint arXiv:2602.21553, 2026

  20. [20]

    Rageval: Scenario specific rag evaluation dataset generation framework

    Kunlun Zhu, Yifan Luo, Dingling Xu, Yukun Yan, Zhenghao Liu, Shi Yu, Ruobing Wang, Shuo Wang, Yishan Li, Nan Zhang, et al. Rageval: Scenario specific rag evaluation dataset generation framework. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8520–8544, 2025

  21. [21]

    Inference scaling law for retrieval augmented generation

    Shu Zhou, Yuxuan Ao, Yunyang Xuan, Xin Wang, Tao Fan, and Hao Wang. Inference scaling law for retrieval augmented generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 16522–16530, 2026

  22. [22]

    To memorize or to retrieve: Scaling laws for rag-considerate pretraining

    Karan Singh, Michael Yu, Varun Gangal, Zhuofu Tao, Sachin Kumar, Emmy Liu, and Steven Y Feng. To memorize or to retrieve: Scaling laws for rag-considerate pretraining. arXiv preprint arXiv:2604.00715, 2026

  23. [23]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  24. [24]

    Textbooks Are All You Need

    Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need.arXiv preprint arXiv:2306.11644, 2023. 13 Preprint. Under review.B. Or|ArtificialGate Ltd

  25. [25]

    Small language models improve giants by rewriting their outputs

    Giorgos Vernikos, Arthur Bražinskas, Jakub Adamek, Jonathan Mallinson, Aliaksei Sev- eryn, and Eric Malmi. Small language models improve giants by rewriting their outputs. InProceedings of the 18th Conference of the European Chapter of the Association for Com- putational Linguistics (Volume 1: Long Papers), pages 2703–2718, 2024

  26. [26]

    A survey on collaborative mechanisms between large and small language models.arXiv preprint arXiv:2505.07460, 2025

    Yi Chen, JiaHao Zhao, and HaoHao Han. A survey on collaborative mechanisms between large and small language models.arXiv preprint arXiv:2505.07460, 2025

  27. [27]

    Fine-tune an slm or prompt an llm? the case of generating low-code workflows.arXiv preprint arXiv:2505.24189, 2025

    Orlando Marquez Ayala, Patrice Bechard, Emily Chen, Maggie Baird, and Jingfei Chen. Fine-tune an slm or prompt an llm? the case of generating low-code workflows.arXiv preprint arXiv:2505.24189, 2025

  28. [28]

    RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning

    Jerry Huang, Siddarth Madala, Risham Sidhu, Cheng Niu, Hao Peng, Julia Hockenmaier, and Tong Zhang. Rag-rl: Advancing retrieval-augmented generation via rl and curriculum learning.arXiv preprint arXiv:2503.12759, 2025

  29. [29]

    Rag-star: Enhancing deliberative reasoning with retrieval augmented verifi- cation and refinement

    JinhaoJiang, JiayiChen, JunyiLi, RuiyangRen, ShijieWang, WayneXinZhao, YangSong, and Tao Zhang. Rag-star: Enhancing deliberative reasoning with retrieval augmented verifi- cation and refinement. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume ...

  30. [30]

    Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM computing surveys, 55(12):1–38, 2023

  31. [31]

    Lost in the middle: How language models use long contexts

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the association for computational linguistics, 12:157–173, 2024

  32. [32]

    Adaptivechameleonorstubborn sloth: Revealing the behavior of large language models in knowledge conflicts

    JianXie, KaiZhang, JiangjieChen, RenzeLou, andYuSu. Adaptivechameleonorstubborn sloth: Revealing the behavior of large language models in knowledge conflicts. InThe Twelfth International Conference on Learning Representations, 2023

  33. [33]

    Exploring Knowledge Conflicts for Faithful LLM Reasoning: Benchmark and Method

    Tianzhe Zhao et al. Exploring knowledge conflicts for faithful llm reasoning: Benchmark and method.arXiv preprint arXiv:2604.11209, 2026

  34. [34]

    Seeingthroughtheconflict: Transparentknowledgeconflicthandlinginretrieval-augmented generation.arXiv preprint arXiv:2601.06842, 2026

    Hua Ye, Siyuan Chen, Ziqi Zhong, Canran Xiao, Haoliang Zhang, Yuhan Wu, and Fei Shen. Seeingthroughtheconflict: Transparentknowledgeconflicthandlinginretrieval-augmented generation.arXiv preprint arXiv:2601.06842, 2026

  35. [35]

    Large language models hallucination: A comprehen- sive survey.Computer Science Review, 61:100970, 2026

    Aisha Alansari and Hamzah Luqman. Large language models hallucination: A comprehen- sive survey.Computer Science Review, 61:100970, 2026

  36. [36]

    Mitigating Hallucinations in Large Vision-Language Models without Performance Degradation

    Xingyu Zhu, Junfeng Fang, Shuo Wang, Beier Zhu, Zhicai Wang, Yonghui Yang, and Xi- angnan He. Mitigating hallucinations in large vision-language models without performance degradation.arXiv preprint arXiv:2604.20366, 2026

  37. [37]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yixin Dai, Jiawei Sun, Haofen Wang, Haofen Wang, et al. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2(1):32, 2023

  38. [38]

    Freshllms: Refreshing large language models with search engine augmentation

    Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun- Hsuan Sung, Denny Zhou, Quoc Le, et al. Freshllms: Refreshing large language models with search engine augmentation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 13697–13720, 2024. 14 Preprint. Under review.B. Or|ArtificialGate Ltd

  39. [39]

    A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

    Claude Elwood Shannon. A mathematical theory of communication.The Bell system technical journal, 27(3):379–423, 1948

  40. [40]

    John Wiley & Sons, 1999

    Thomas M Cover.Elements of information theory. John Wiley & Sons, 1999

  41. [41]

    Kalman-inspired runtime stability and recovery in hybrid reasoning systems

    Barak Or. Kalman-inspired runtime stability and recovery in hybrid reasoning systems. arXiv preprint arXiv:2602.15855, 2026

  42. [42]

    Active retrieval augmented generation

    Zhengbao Jiang, Frank F Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 7969–7992, 2023

  43. [43]

    In-context retrieval-augmented language models.Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023

    Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton- Brown, and Yoav Shoham. In-context retrieval-augmented language models.Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023

  44. [44]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 15