pith. sign in

arxiv: 2606.28337 · v1 · pith:RTRHHNN4new · submitted 2026-05-29 · 💻 cs.IR · cs.AI

A Systems-Level Analysis of Sensitivity, Robustness, and Stability in Retrieval-Augmented Generation

Pith reviewed 2026-06-30 10:46 UTC · model grok-4.3

classification 💻 cs.IR cs.AI
keywords retrieval-augmented generationRAG evaluationsensitivity analysisrobustnessstabilitymulti-stage failureempirical studychunk size
0
0 comments X

The pith

RAG final answer accuracy often changes non-monotonically when chunk size or retrieval depth varies, so evaluation must track failures at each stage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs 56 controlled experiments on a fixed set of 500 questions linked to 20,958 corpus contexts to test how RAG systems respond to changes in chunk size, retrieval depth, reranking, and injected noise. Retrieval metrics rise with broader settings, yet exact-match and F1 scores at the final answer stage frequently rise then fall or show high variance. Smaller chunks lose answer-bearing text before retrieval even begins, and added retrieval noise causes steady degradation. The authors conclude that measuring only the end answer hides where and why the system fails.

Core claim

Across the 56 runs, retrieval-oriented metrics improved under broader retrieval settings, while downstream exact-match and F1 scores often behaved non-monotonically. Preprocessing-induced answer loss appeared under smaller chunk sizes, progressive degradation occurred under retrieval corruption, and higher variance was observed in broader retrieval regimes. These patterns indicate that RAG evaluation must incorporate sensitivity, robustness, stability, and multi-stage failure analysis rather than final answer accuracy alone.

What carries the argument

Multi-stage failure tracking that separately measures retrieval success, context packing, and generation under controlled changes to chunk size, top-k depth, reranking, and probabilistic noise.

If this is right

  • Retrieval success rates rise when more chunks or higher top-k values are used.
  • Final exact-match and F1 scores frequently fail to follow the same upward trend.
  • Smaller chunk sizes discard answer text during preprocessing before retrieval occurs.
  • Added retrieval noise produces steady drops in end-to-end performance.
  • Variance across repeated runs grows under broader retrieval settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Evaluation suites for RAG should log per-stage success rates rather than only the final string match.
  • The same staged checks could be applied to other composite systems that combine retrieval with generation.
  • Optimal chunk and depth settings may need to be tuned per query type instead of chosen globally.
  • Repeating the sweeps on larger or more diverse corpora would test whether the non-monotonic pattern persists.

Load-bearing premise

The non-monotonic score changes and variance patterns seen on this 500-question subset and 20,958-context corpus will appear with other corpora, models, and query distributions.

What would settle it

If final answer accuracy increased monotonically with every increase in retrieval depth or chunk size across several new corpora and models, the argument for mandatory multi-stage analysis would lose force.

Figures

Figures reproduced from arXiv: 2606.28337 by Bharath Simha Reddy Muthyam.

Figure 1
Figure 1. Figure 1: Pipeline architecture for the controlled RAG evaluation framework. The study separates preprocessing, retrieval, packing, generation, and failure analysis so final answer errors can be traced to pipeline stages. 3 Methodology and Evaluation Frame￾work 3.1 Pipeline Overview [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Preprocessing-induced QA answer loss across chunk sizes [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Baseline Hit@k increases with retrieval depth. representative configurations that make this tradeoff visible without listing all 32 sensitivity runs. This pattern is one of the clearest examples of non￾monotonic RAG behavior in the study. The system retrieves more gold-context evidence at higher top k, but the genera￾tor does not consistently convert that additional evidence into better answers. Higher top… view at source ↗
Figure 5
Figure 5. Figure 5: Relationship between input token count and F1, supporting the context-overload interpretation [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Baseline F1 shows non-monotonic behavior as retrieval depth increases. 5.3 Configuration Tradeoffs Rather Than a Uni￾versal Optimum The strongest EM configuration in the baseline grid was chunk size 120 with top k=3, for both rerank off and rerank on. The strongest F1 region included chunk size 120 with top k=3 and chunk size 140 with top k=3. In contrast, the highest Hit@k occurred at chunk size 140 with … view at source ↗
Figure 6
Figure 6. Figure 6: F1 degradation under increasing retrieval corruption. 5.4 Robustness Under Retrieval Noise Retrieval corruption caused progressive degradation. Averaged across the four robustness anchors, Hit@k decreased from 0.711 at 10% noise to 0.561 at 30% noise. Mean F1 decreased from 0.548 to 0.442, while retrieval failure and packing fail￾ure increased ( [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: F1 standard deviation across repeated seeded runs. 5.6 Multi-Stage Failure Analysis Failure analysis is one of the core contributions of this study. A final answer failure can arise from at least three distinct stages: • Retrieval failure: no retrieved chunk comes from the gold context id. • Packing failure: answer-bearing evidence is not present in the final packed context. • Generation failure: answer-be… view at source ↗
read the original abstract

Retrieval-Augmented Generation (RAG) systems are often evaluated using final answer accuracy, even though their failures can originate from preprocessing, retrieval, context packing, or generation. This paper presents a controlled empirical study of RAG sensitivity, robustness, and stability across 56 experimental runs. We evaluate how chunk size, retrieval depth (top k), embedding-based reranking, probabilistic retrieval noise, and repeated seeded runs affect retrieval, context packing, and generation behavior. Using a fixed 500-question QA subset mapped to 20,958 unique corpus contexts, we analyze both final answer metrics and intermediate failure modes. Across these experiments, retrieval-oriented metrics improved under broader retrieval settings, while downstream exact-match and F1 scores often behaved non-monotonically. We also observe preprocessing-induced answer loss under smaller chunk sizes, progressive degradation under retrieval corruption, and higher observed variance in broader retrieval regimes. These findings suggest that RAG evaluation should include sensitivity, robustness, stability, and multi-stage failure analysis rather than relying only on final answer accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports a controlled empirical study of Retrieval-Augmented Generation (RAG) systems across 56 experimental runs on a fixed 500-question QA subset mapped to a 20,958-context corpus. It systematically varies chunk size, retrieval depth (top-k), embedding reranking, probabilistic retrieval noise, and repeated seeded runs, measuring effects on retrieval metrics, context packing, and downstream generation (exact-match and F1). Key observations include non-monotonic behavior in final-answer metrics despite improving retrieval scores, preprocessing-induced answer loss at small chunk sizes, progressive degradation under noise, and higher variance in broader retrieval regimes. The authors conclude that RAG evaluation should incorporate sensitivity, robustness, stability, and multi-stage failure analysis rather than relying solely on final-answer accuracy.

Significance. If the reported patterns prove robust, the work would usefully demonstrate concrete limitations of accuracy-only RAG evaluation and supply a template for multi-stage analysis that isolates preprocessing, retrieval, and generation failures. The controlled design with intermediate metrics and repeated runs is a clear strength, offering reproducible examples of where and why performance diverges. The single-corpus, single-question-set scope, however, constrains how far the prescriptive recommendation can be taken without further validation.

major comments (2)
  1. [Abstract] Abstract and conclusion: The recommendation that RAG evaluation 'should include sensitivity, robustness, stability, and multi-stage failure analysis' is grounded exclusively in results from one 500-question subset and one 20,958-context corpus. The non-monotonic downstream metrics and rising variance under broader retrieval could be specific to this question distribution, context overlap, or answer phrasing; no cross-corpus or cross-query-set experiments are reported to test whether the divergence between retrieval and generation metrics generalizes.
  2. [Results] Experimental design and results sections: The 56 runs are presented without statistical significance tests, error bars, or explicit exclusion criteria for questions or runs. This weakens the ability to assess whether the claimed non-monotonic behaviors and variance patterns are reliable or sensitive to the particular 500-question sample.
minor comments (2)
  1. A summary table listing the exact parameter settings for each of the 56 runs would improve reproducibility and allow readers to map specific configurations to the reported trends.
  2. [Results] Figures illustrating non-monotonic trends and variance would benefit from explicit variance bands or per-run scatter to make the stability claims visually clearer.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below, clarifying the scope of our claims and the experimental controls already present while noting where revisions can strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and conclusion: The recommendation that RAG evaluation 'should include sensitivity, robustness, stability, and multi-stage failure analysis' is grounded exclusively in results from one 500-question subset and one 20,958-context corpus. The non-monotonic downstream metrics and rising variance under broader retrieval could be specific to this question distribution, context overlap, or answer phrasing; no cross-corpus or cross-query-set experiments are reported to test whether the divergence between retrieval and generation metrics generalizes.

    Authors: We agree that the empirical patterns are demonstrated on a single corpus and question set. The manuscript frames the contribution as a controlled case study that isolates specific failure modes (preprocessing loss, non-monotonicity, variance under noise) rather than claiming universality. The prescriptive recommendation follows from the observation that final-answer accuracy alone missed these behaviors in this reproducible setting; it is offered as a template for multi-stage analysis, not as a proven requirement for every RAG system. We will revise the abstract and conclusion to explicitly qualify the scope and note that broader validation across corpora would be valuable future work. revision: partial

  2. Referee: [Results] Experimental design and results sections: The 56 runs are presented without statistical significance tests, error bars, or explicit exclusion criteria for questions or runs. This weakens the ability to assess whether the claimed non-monotonic behaviors and variance patterns are reliable or sensitive to the particular 500-question sample.

    Authors: The design already incorporates repeated seeded runs (five seeds per configuration) to quantify variance, and the full 500-question set was used with no exclusions. We will add error bars derived from the repeated runs to all relevant figures and tables, and we will include a brief statement on the absence of question-level filtering. While formal hypothesis tests were not performed, the repeated-run variance already provides a direct measure of stability; we can add paired significance tests on the key non-monotonic comparisons if the editor deems it necessary. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements on fixed dataset

full rationale

The paper conducts a controlled empirical study consisting of 56 experimental runs on one fixed 500-question QA subset and 20,958-context corpus. All reported behaviors (non-monotonic downstream scores, preprocessing loss, variance patterns, degradation under noise) are direct observations from these runs rather than quantities derived from equations, fitted parameters renamed as predictions, or self-citation chains. No derivation chain exists; the central recommendation follows from the measured divergence between retrieval and generation metrics. This is the most common honest non-finding for measurement-focused work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The recommendation for multi-stage evaluation rests on the assumption that the controlled sweeps on this fixed QA subset capture representative failure modes; no free parameters are fitted to produce the headline claim.

axioms (1)
  • domain assumption The 500-question subset mapped to 20,958 contexts is sufficiently representative for drawing general conclusions about RAG behavior.
    The study fixes this subset as the evaluation basis for all 56 runs.

pith-pipeline@v0.9.1-grok · 5707 in / 1164 out tokens · 38957 ms · 2026-06-30T10:46:00.967851+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    Retrieval-Augmented Generation for Knowledge-Intensive

    Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and Kuttler, Heinrich and Lewis, Mike and Yih, Wen-tau and Rockt. Retrieval-Augmented Generation for Knowledge-Intensive. Advances in Neural Information Processing Systems , volume =

  2. [2]

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =

    Dense Passage Retrieval for Open-Domain Question Answering , author =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =

  3. [3]

    Sentence-

    Reimers, Nils and Gurevych, Iryna , booktitle =. Sentence-. doi:10.18653/v1/D19-1410 , url =

  4. [4]

    Wang, Wenhui and Wei, Furu and Dong, Li and Bao, Hangbo and Yang, Nan and Zhou, Ming , booktitle =

  5. [5]

    Billion-Scale Similarity Search with

    Johnson, Jeff and Douze, Matthijs and J. Billion-Scale Similarity Search with. doi:10.1109/TBDATA.2019.2921572 , year =

  6. [6]

    Journal of Machine Learning Research , volume =

    Scaling Instruction-Finetuned Language Models , author =. Journal of Machine Learning Research , volume =

  7. [7]

    Transactions of the Association for Computational Linguistics , volume =

    Lost in the Middle: How Language Models Use Long Contexts , author =. Transactions of the Association for Computational Linguistics , volume =

  8. [8]

    SQuAD: 100, 000+ questions for machine comprehension of text

    Rajpurkar, Pranav and Zhang, Jing and Lopyrev, Konstantin and Liang, Percy , booktitle =. doi:10.18653/v1/D16-1264 , url =

  9. [9]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Retrieval-Augmented Generation for Large Language Models: A Survey , author =. arXiv preprint arXiv:2312.10997 , year =

  10. [10]

    ARES : An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

    Saad-Falcon, Jon and Khattab, Omar and Potts, Christopher and Zaharia, Matei , booktitle =. doi:10.18653/v1/2024.naacl-long.20 , url =

  11. [11]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume =

    Benchmarking Large Language Models in Retrieval-Augmented Generation , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =. doi:10.1609/aaai.v38i16.29728 , url =

  12. [12]

    Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (

    Investigating the Robustness of Retrieval-Augmented Generation at the Query Level , author =. Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (

  13. [13]

    Sun, Jiashuo and Zhong, Xianrui and Zhou, Sizhe and Han, Jiawei , eprint =

  14. [14]

    2602.03689 , archivePrefix =

    Rethinking the Reranker: Boundary-Aware Evidence Selection for Robust Retrieval-Augmented Generation , author =. 2602.03689 , archivePrefix =