pith. sign in

arxiv: 2604.10745 · v1 · submitted 2026-04-12 · 💻 cs.CL

How You Ask Matters! Adaptive RAG Robustness to Query Variations

Pith reviewed 2026-05-10 15:33 UTC · model grok-4.3

classification 💻 cs.CL
keywords adaptive RAGquery variationsrobustnessretrieval-augmented generationbenchmarkquery robustnessretrieval decisions
0
0 comments X

The pith

Small surface changes in queries cause large shifts in Adaptive RAG retrieval decisions and accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests Adaptive RAG systems, which decide on the fly whether to retrieve external information before answering. It builds a benchmark of many rephrasings that keep the same meaning but vary in wording, using both human writers and model-generated versions. Evaluation across answer quality, computation cost, and retrieval choices shows that even tiny wording differences can flip whether retrieval happens and whether the final answer is correct. Bigger models improve raw performance yet leave the sensitivity to wording largely unchanged. The work therefore identifies that current Adaptive RAG methods remain fragile to natural query variation despite preserving intent.

Core claim

Adaptive RAG methods exhibit a critical robustness gap: semantically identical queries that differ only in surface form produce markedly different retrieval triggers and final accuracies. Larger language models raise overall performance but do not close this gap. The benchmark of human-written and model-generated rewrites makes the gap visible across three measured dimensions—answer quality, computational cost, and retrieval decisions—revealing that Adaptive RAG remains highly vulnerable to query variations that preserve identical semantics.

What carries the argument

A large-scale benchmark of diverse yet semantically identical query variations, built from human-written rewrites and model-generated rewrites, used to measure how Adaptive RAG components respond across answer quality, cost, and retrieval decisions.

If this is right

  • Adaptive RAG systems must be evaluated for stability under query rephrasing in addition to raw accuracy.
  • Larger models alone will not solve robustness problems in dynamic retrieval decisions.
  • Retrieval triggers and cost controls in Adaptive RAG can be triggered by surface features rather than true information need.
  • Benchmarking that includes multiple surface forms is required to expose hidden failure modes in production RAG pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers of future Adaptive RAG pipelines may need to add explicit semantic normalization steps before deciding whether to retrieve.
  • The observed gap could be narrowed by training retrieval routers on paired examples of equivalent queries rather than single surface forms.
  • Real-world user interfaces that accept free-form questions may see higher error rates than lab tests suggest unless robustness to wording is addressed.

Load-bearing premise

The rewrites used in the benchmark truly preserve identical semantics and intent without introducing subtle factual shifts or stylistic biases that themselves affect retrieval.

What would settle it

Run the same Adaptive RAG systems on a fresh set of real-user queries that have been independently verified to carry identical meaning, then check whether retrieval decisions and accuracy still vary as widely as observed on the benchmark.

Figures

Figures reproduced from arXiv: 2604.10745 by Kyomin Jung, Meeyoung Cha, Megha Sundriyal, Yunah Jang.

Figure 1
Figure 1. Figure 1: Retrieval-decision flip rates of the Qwen￾32B model, measuring how often the model changes its retrieval judgment under meaning-preserving human query rewrites. Higher rates signal instability. Flips are categorized as One-flip (only one rewrite flips) and Both-flip (both rewrites flip). 2024; Yang et al., 2025). If the model can answer confidently from parametric knowledge, it may skip retrieval and avoid… view at source ↗
Figure 2
Figure 2. Figure 2: Example of Adaptive RAG responses under human, original, and LLM-generated query rewrites, highlighting differences in answer correctness, computation overhead, and retrieval score. high-stakes settings where factual correctness and user trust are critical (Sharma et al., 2024; Oche et al., 2025). This paper presents the first empirical study of Adaptive RAG robustness under realistic query variations. To … view at source ↗
Figure 3
Figure 3. Figure 3: InAccuracy results across six perturbation types for a given dataset / model pair. Dataset Mtd. Similarity ↑ Div. ↓ Form. Read. Decl. Impr. Spell. Gram. Llama-3.1 SQuAD LB 0.551 0.539 0.593 0.634 0.552 0.599 0.469 GB 0.426 0.408 0.433 0.430 0.411 0.414 0.585 2WIKI LB 0.560 0.628 0.649 0.655 0.537 0.627 0.414 GB 0.423 0.458 0.300 0.403 0.313 0.385 0.667 QwQ SQuAD LB 0.573 0.579 0.615 0.631 0.544 0.584 0.440… view at source ↗
Figure 4
Figure 4. Figure 4: Under- and over-confident rates on model query variations. datasets with larger models. However, [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Query analysis. From left to right: (a) change in LM loss relative to the original query (∆Loss; nats/token), (b) change in query length (∆Length; words), and (c) semantic similarity to the original query (cosine similarity). Decl. Impr. Read. Form. Spell. Gram. 1 2 3 4 Hop 0.5 0.6 0.7 0.8 0.9 Sim (a) Subquery Drift 1 2 3 4 Hop 0.4 0.5 0.6 0.7 0.8 Rate (b) Retrieval Failure Rate [PITH_FULL_IMAGE:figures/f… view at source ↗
Figure 7
Figure 7. Figure 7: InAccuracy results with human rewrites. this pattern: subquery similarity decreases while RFR increases across hops, suggesting that sub￾query drift directly harms retrieval quality. 6 Human Query Analysis Human rewrites are semantically faithful but harder for the model. Figures 5 (b) and (c) show that human rewrites are substantially shorter than other query variants, often compressing or omitting surfac… view at source ↗
Figure 8
Figure 8. Figure 8: illustrates this pattern: the gold￾2 4 6 8 10 Hop 0 10 20 30 40 50 Rate (%) Original Model Human [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Helpfulness and harmfulness distribution on NQ using the QwQ-32B model. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: InAccuracy (Llama-3.1-8B). Performance across datasets for four Adaptive RAG methods. Each panel reports InAccuracy on seven query variations: the original query and six model-generated perturbations. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: InAccuracy (QwQ-32B). Performance across datasets for four Adaptive RAG methods. Each panel reports InAccuracy on seven query variations: the original query and six model-generated perturbations). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: InAccuracy (Llama-3.1-8B) including human rewrites. Performance across datasets for four Adaptive RAG methods, evaluated on a matched subset of instances that have human rewrites. For each panel, InAccuracy is reported for eight query variations (original, human rewrite, and six model-generated perturbations), all computed on the same human-rewrite subset. Orig. Human Decl. Impr. Read. Form. Spell. Gram. … view at source ↗
Figure 13
Figure 13. Figure 13: InAccuracy (QwQ-32B) including human rewrites. Performance across datasets for four Adaptive RAG methods, evaluated on a matched subset of instances that have human rewrites. For each panel, InAccuracy is reported for eight query variations (original, human rewrite, and six model-generated perturbations), all computed on the same human-rewrite subset. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: LLM Call on Llama-3.1-8B on both model-generated and human queries. Orig. Human Decl. Impr. Read. Form. Spell. Gram. 0 1 2 3 4 5 Call Count 4.18 4.54 4.10 4.21 4.48 4.35 4.68 4.24 3.32 3.37 3.35 3.32 3.36 3.27 3.30 3.29 SQuAD Orig. Human Decl. Impr. Read. Form. Spell. Gram. 0 1 2 3 4 5 Call Count 3.84 3.77 3.71 3.62 4.00 3.81 4.44 3.70 3.15 3.18 3.14 3.25 3.29 3.16 3.18 3.15 NQ Orig. Human Decl. Impr. Rea… view at source ↗
Figure 15
Figure 15. Figure 15: LLM Call on QwQ-32B on both model-generated and human queries. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Retriever Call on Llama-3.1-8B on both model-generated and human queries. Orig. Human Decl. Impr. Read. Form. Spell. Gram. 0 1 2 3 Call Count 2.04 2.23 2.01 2.07 2.20 2.13 2.29 2.07 1.38 1.42 1.40 1.39 1.39 1.36 1.37 1.36 SQuAD Orig. Human Decl. Impr. Read. Form. Spell. Gram. 0 1 2 3 Call Count 1.85 1.83 1.80 1.74 1.95 1.84 2.17 1.77 1.28 1.30 1.25 1.35 1.36 1.28 1.30 1.27 NQ Orig. Human Decl. Impr. Read.… view at source ↗
Figure 17
Figure 17. Figure 17: Retriever Call on QwQ-32B on both model-generated and human queries. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Under- and Over-confidence resultson Llama-3.1-8B on model-generated queries. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Under- and Over-confidence resultson QwQ-32B on model-generated queries. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Under- and Over-confidence resultson Llama-3.1-8B on including human-written queries. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Under- and Over-confidence results on Llama-3.1-8B on including human-written queries. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Loss score across all datasets 36 [PITH_FULL_IMAGE:figures/full_fig_p036_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Query length across all datasets 37 [PITH_FULL_IMAGE:figures/full_fig_p037_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Semantic similarity to original query across all datasets [PITH_FULL_IMAGE:figures/full_fig_p038_24.png] view at source ↗
read the original abstract

Adaptive Retrieval-Augmented Generation (RAG) promises accuracy and efficiency by dynamically triggering retrieval only when needed and is widely used in practice. However, real-world queries vary in surface form even with the same intent, and their impact on Adaptive RAG remains under-explored. We introduce the first large-scale benchmark of diverse yet semantically identical query variations, combining human-written and model-generated rewrites. Our benchmark facilitates a systematic evaluation of Adaptive RAG robustness by examining its key components across three dimensions: answer quality, computational cost, and retrieval decisions. We discover a critical robustness gap, where small surface-level changes in queries dramatically alter retrieval behavior and accuracy. Although larger models show better performance, robustness does not improve accordingly. These findings reveal that Adaptive RAG methods are highly vulnerable to query variations that preserve identical semantics, exposing a critical robustness challenge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces the first large-scale benchmark of query variations for Adaptive RAG, combining human-written and model-generated rewrites claimed to be semantically identical. It systematically evaluates Adaptive RAG methods across three dimensions—answer quality, computational cost, and retrieval decisions—reporting a critical robustness gap in which small surface-level query changes dramatically alter retrieval behavior and accuracy. Larger models are found to improve overall performance but not robustness to these variations.

Significance. If the benchmark's core assumption holds, the work identifies a practically important limitation in Adaptive RAG systems that are already deployed for efficiency. The new benchmark and multi-dimensional evaluation constitute a useful empirical contribution that could motivate more robust trigger mechanisms and query normalization techniques. The finding that scale does not confer robustness is a falsifiable observation worth testing in follow-up studies.

major comments (1)
  1. [Benchmark construction] Benchmark construction (as described in the abstract and introduction): the central claim attributes the observed robustness gap to surface-form variation alone, yet the manuscript provides no validation metrics—such as inter-annotator agreement scores, human semantic equivalence ratings, embedding cosine thresholds, or disagreement rates—for the human-written and model-generated rewrites. Without these, it is impossible to rule out that subtle factual shifts, entity substitutions, or stylistic cues are driving the reported differences in retrieval decisions and accuracy rather than surface variation.
minor comments (2)
  1. [Abstract] The abstract would benefit from reporting the total number of queries, number of rewrite pairs, and the specific Adaptive RAG baselines evaluated so readers can immediately gauge scale and coverage.
  2. [Evaluation] Clarify whether any statistical tests (e.g., paired t-tests or bootstrap confidence intervals) were applied to the differences in retrieval decisions and accuracy across rewrite types.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment regarding the validation of semantic equivalence in the benchmark construction below, and commit to incorporating additional metrics in the revised version.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction (as described in the abstract and introduction): the central claim attributes the observed robustness gap to surface-form variation alone, yet the manuscript provides no validation metrics—such as inter-annotator agreement scores, human semantic equivalence ratings, embedding cosine thresholds, or disagreement rates—for the human-written and model-generated rewrites. Without these, it is impossible to rule out that subtle factual shifts, entity substitutions, or stylistic cues are driving the reported differences in retrieval decisions and accuracy rather than surface variation.

    Authors: We appreciate the referee's point on the need for explicit validation of semantic equivalence. The human-written rewrites were produced following detailed guidelines that instructed annotators to maintain identical meaning, entities, and facts while varying only the surface form. Similarly, the model-generated rewrites used prompts that explicitly required preserving semantics without introducing new information. However, we acknowledge that quantitative validation metrics were not reported in the original submission. In the revised manuscript, we will add a dedicated section on benchmark validation, including inter-annotator agreement for human rewrites, average embedding cosine similarities between original and variant queries, and results from a human evaluation study assessing semantic equivalence ratings. These additions will provide stronger evidence that the observed robustness gap is indeed due to surface variations rather than unintended semantic changes. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark study with no derivation chain or self-referential fitting.

full rationale

The paper is an empirical evaluation introducing a benchmark of query rewrites and comparing Adaptive RAG methods on answer quality, cost, and retrieval decisions. No equations, fitted parameters, or mathematical derivations are present in the provided text. Claims rest on experimental observations rather than any step that reduces by construction to author-defined inputs or self-citations. The semantic-equivalence assumption for rewrites is an empirical premise open to validation but does not create circularity in any derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical benchmark study. It relies on the domain assumption that surface-form variants can be generated while preserving exact semantics, and on standard NLP evaluation practices. No new free parameters or invented entities are introduced by the authors.

axioms (1)
  • domain assumption Semantically identical queries can be reliably produced by human writers and LLMs without introducing new factual content or retrieval-relevant biases.
    Invoked when constructing the benchmark of 'diverse yet semantically identical query variations'.

pith-pipeline@v0.9.0 · 5446 in / 1176 out tokens · 47858 ms · 2026-05-10T15:33:08.287648+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

  1. [1]

    InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’25, page 2395–2406, New York, NY , USA

    Classifying term variants in query formula- tion. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’25, page 2395–2406, New York, NY , USA. Association for Computing Machinery. Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2024. Self-RAG: Learning to retriev...

  2. [2]

    InProceedings of the 47th Interna- tional ACM SIGIR Conference on Research and De- velopment in Information Retrieval, SIGIR ’24, page 719–729, New York, NY , USA

    The power of noise: Redefining retrieval for rag systems. InProceedings of the 47th Interna- tional ACM SIGIR Conference on Research and De- velopment in Information Retrieval, SIGIR ’24, page 719–729, New York, NY , USA. Association for Com- puting Machinery. Hanxing Ding, Liang Pang, Zihao Wei, Huawei Shen, and Xueqi Cheng. 2025. Rowen: Adaptive retriev...

  3. [3]

    Gustavo Penha, Arthur Câmara, and Claudia Hauff

    Evaluating the robustness of retrieval pipelines with query variation generators.CoRR, abs/2111.13057. Gustavo Penha, Arthur Câmara, and Claudia Hauff

  4. [4]

    Evaluating the robustness of retrieval pipelines with query variation generators. InAdvances in In- formation Retrieval: 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part I, page 397–412, Berlin, Heidelberg. Springer-Verlag. Sezen Perçin, Xin Su, Qutub Sha Syed, Phillip Howard, Aleksei Kuvshinov, L...

  5. [5]

    S., Dernoncourt, F., Sultania, D., Bagga, K., Zhang, M., Bui, T., and Kotte, V

    Qa dataset explosion: A taxonomy of nlp resources for question answering and reading com- prehension.ACM Computing Surveys, 55(10):1–45. Sanat Sharma, David Seunghyun Yoon, Franck Dernon- court, Dewang Sultania, Karishma Bagga, Mengjiao Zhang, Trung Bui, and Varun Kotte. 2024. Retrieval augmented generation for domain-specific question answering.Preprint,...

  6. [6]

    InProceedings of the 48th International ACM SIGIR Conference on Research and Devel- opment in Information Retrieval, SIGIR ’25, page 1305–1315, New York, NY , USA

    Knowing you don’t know: Learning when to continue search in multi-round rag through self- practicing. InProceedings of the 48th International ACM SIGIR Conference on Research and Devel- opment in Information Retrieval, SIGIR ’25, page 1305–1315, New York, NY , USA. Association for Computing Machinery. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, W...

  7. [7]

    Shengyao Zhuang and Guido Zuccon

    Qe-rag: A robust retrieval-augmented gener- ation benchmark for query entry errors.Preprint, arXiv:2504.04062. Shengyao Zhuang and Guido Zuccon. 2022. Character- bert and self-teaching for improving the robustness of dense retrievers on queries with typos. InProceed- ings of the 45th International ACM SIGIR Confer- ence on Research and Development in Info...

  8. [8]

    Task Definition: You are rewriting a query to make it significantly less readable while preserving the original semantic meaning as closely as possible

  9. [9]

    IRS," "distance,

    Constraints & Goals: - Flesch Reading Ease Score: The rewritten text must have a Flesch score below 60 (preferably below 50). - Semantic Similarity: The rewritten text must have SBERT similarity > 0.7 compared with the original query. - Length: The rewritten text must remain approximately the same length as the original query (±10%). - Preserve Domain Ter...

  10. [10]

    distance,

    How to Increase Complexity: - Lexical Changes: Use advanced or academic synonyms only for common words. For domain or key terms (e.g., "distance," "IRS," "tax"), keep the original term or use a very close synonym if necessary to maintain meaning. - Syntactic Complexity: Introduce passive voice, nominalizations, embedded clauses, and parenthetical or subor...

  11. [11]

    You will be given a query (question) and its corresponding answer

  12. [12]

    What You Should Do

    Your task is to rewrite the given query — imagine how you would ask the same question if you were speaking to a large language model (like ChatGPT). What You Should Do

  13. [13]

    ◦Identify what the user is trying to find out

    Understand the original query. ◦Identify what the user is trying to find out. ◦Grasp the intent and main topic clearly

  14. [14]

    ◦You may change the tone, phrasing, or structure

    Rewrite the query in a new way while keeping the same meaning. ◦You may change the tone, phrasing, or structure. ◦Information loss, addition, or simplification is acceptable. ◦The goal is to make the rewritten query similar in meaning but different in expression. ◦ Make sure that your rewritten query can still be answered appropriately with the same answe...

  15. [15]

    ◦You can make the question sound more casual, detailed, concise, or clear

    Incorporate your personal style. ◦You can make the question sound more casual, detailed, concise, or clear. ◦Different writing styles and tones are encouraged

  16. [16]

    ◦Do not simply rephrase it with only minor surface changes

    Avoid copying the original query. ◦Do not simply rephrase it with only minor surface changes. ◦The rewritten query should feel genuinely reworded and natural

  17. [17]

    ◦The answer helps you understand what the original query intended to ask

    Use the answer only for context. ◦The answer helps you understand what the original query intended to ask. ◦You should not change your rewrite based on the answer’s content. ◦Your task is to rewrite the question, not the answer. Table 15: Human Annotation Instruction: Query Rewriting Task. 20 Orig. Human Decl. Impr. Read. Form. Spell. Gram. 5 10 15 19 24 ...