How You Ask Matters! Adaptive RAG Robustness to Query Variations

Kyomin Jung; Meeyoung Cha; Megha Sundriyal; Yunah Jang

arxiv: 2604.10745 · v1 · submitted 2026-04-12 · 💻 cs.CL

How You Ask Matters! Adaptive RAG Robustness to Query Variations

Yunah Jang , Megha Sundriyal , Kyomin Jung , Meeyoung Cha This is my paper

Pith reviewed 2026-05-10 15:33 UTC · model grok-4.3

classification 💻 cs.CL

keywords adaptive RAGquery variationsrobustnessretrieval-augmented generationbenchmarkquery robustnessretrieval decisions

0 comments

The pith

Small surface changes in queries cause large shifts in Adaptive RAG retrieval decisions and accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests Adaptive RAG systems, which decide on the fly whether to retrieve external information before answering. It builds a benchmark of many rephrasings that keep the same meaning but vary in wording, using both human writers and model-generated versions. Evaluation across answer quality, computation cost, and retrieval choices shows that even tiny wording differences can flip whether retrieval happens and whether the final answer is correct. Bigger models improve raw performance yet leave the sensitivity to wording largely unchanged. The work therefore identifies that current Adaptive RAG methods remain fragile to natural query variation despite preserving intent.

Core claim

Adaptive RAG methods exhibit a critical robustness gap: semantically identical queries that differ only in surface form produce markedly different retrieval triggers and final accuracies. Larger language models raise overall performance but do not close this gap. The benchmark of human-written and model-generated rewrites makes the gap visible across three measured dimensions—answer quality, computational cost, and retrieval decisions—revealing that Adaptive RAG remains highly vulnerable to query variations that preserve identical semantics.

What carries the argument

A large-scale benchmark of diverse yet semantically identical query variations, built from human-written rewrites and model-generated rewrites, used to measure how Adaptive RAG components respond across answer quality, cost, and retrieval decisions.

If this is right

Adaptive RAG systems must be evaluated for stability under query rephrasing in addition to raw accuracy.
Larger models alone will not solve robustness problems in dynamic retrieval decisions.
Retrieval triggers and cost controls in Adaptive RAG can be triggered by surface features rather than true information need.
Benchmarking that includes multiple surface forms is required to expose hidden failure modes in production RAG pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of future Adaptive RAG pipelines may need to add explicit semantic normalization steps before deciding whether to retrieve.
The observed gap could be narrowed by training retrieval routers on paired examples of equivalent queries rather than single surface forms.
Real-world user interfaces that accept free-form questions may see higher error rates than lab tests suggest unless robustness to wording is addressed.

Load-bearing premise

The rewrites used in the benchmark truly preserve identical semantics and intent without introducing subtle factual shifts or stylistic biases that themselves affect retrieval.

What would settle it

Run the same Adaptive RAG systems on a fresh set of real-user queries that have been independently verified to carry identical meaning, then check whether retrieval decisions and accuracy still vary as widely as observed on the benchmark.

Figures

Figures reproduced from arXiv: 2604.10745 by Kyomin Jung, Meeyoung Cha, Megha Sundriyal, Yunah Jang.

**Figure 1.** Figure 1: Retrieval-decision flip rates of the Qwen32B model, measuring how often the model changes its retrieval judgment under meaning-preserving human query rewrites. Higher rates signal instability. Flips are categorized as One-flip (only one rewrite flips) and Both-flip (both rewrites flip). 2024; Yang et al., 2025). If the model can answer confidently from parametric knowledge, it may skip retrieval and avoid… view at source ↗

**Figure 2.** Figure 2: Example of Adaptive RAG responses under human, original, and LLM-generated query rewrites, highlighting differences in answer correctness, computation overhead, and retrieval score. high-stakes settings where factual correctness and user trust are critical (Sharma et al., 2024; Oche et al., 2025). This paper presents the first empirical study of Adaptive RAG robustness under realistic query variations. To … view at source ↗

**Figure 3.** Figure 3: InAccuracy results across six perturbation types for a given dataset / model pair. Dataset Mtd. Similarity ↑ Div. ↓ Form. Read. Decl. Impr. Spell. Gram. Llama-3.1 SQuAD LB 0.551 0.539 0.593 0.634 0.552 0.599 0.469 GB 0.426 0.408 0.433 0.430 0.411 0.414 0.585 2WIKI LB 0.560 0.628 0.649 0.655 0.537 0.627 0.414 GB 0.423 0.458 0.300 0.403 0.313 0.385 0.667 QwQ SQuAD LB 0.573 0.579 0.615 0.631 0.544 0.584 0.440… view at source ↗

**Figure 4.** Figure 4: Under- and over-confident rates on model query variations. datasets with larger models. However, [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Query analysis. From left to right: (a) change in LM loss relative to the original query (∆Loss; nats/token), (b) change in query length (∆Length; words), and (c) semantic similarity to the original query (cosine similarity). Decl. Impr. Read. Form. Spell. Gram. 1 2 3 4 Hop 0.5 0.6 0.7 0.8 0.9 Sim (a) Subquery Drift 1 2 3 4 Hop 0.4 0.5 0.6 0.7 0.8 Rate (b) Retrieval Failure Rate [PITH_FULL_IMAGE:figures/f… view at source ↗

**Figure 7.** Figure 7: InAccuracy results with human rewrites. this pattern: subquery similarity decreases while RFR increases across hops, suggesting that subquery drift directly harms retrieval quality. 6 Human Query Analysis Human rewrites are semantically faithful but harder for the model. Figures 5 (b) and (c) show that human rewrites are substantially shorter than other query variants, often compressing or omitting surfac… view at source ↗

**Figure 8.** Figure 8: illustrates this pattern: the gold2 4 6 8 10 Hop 0 10 20 30 40 50 Rate (%) Original Model Human [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Helpfulness and harmfulness distribution on NQ using the QwQ-32B model. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: InAccuracy (Llama-3.1-8B). Performance across datasets for four Adaptive RAG methods. Each panel reports InAccuracy on seven query variations: the original query and six model-generated perturbations. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: InAccuracy (QwQ-32B). Performance across datasets for four Adaptive RAG methods. Each panel reports InAccuracy on seven query variations: the original query and six model-generated perturbations). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: InAccuracy (Llama-3.1-8B) including human rewrites. Performance across datasets for four Adaptive RAG methods, evaluated on a matched subset of instances that have human rewrites. For each panel, InAccuracy is reported for eight query variations (original, human rewrite, and six model-generated perturbations), all computed on the same human-rewrite subset. Orig. Human Decl. Impr. Read. Form. Spell. Gram. … view at source ↗

**Figure 13.** Figure 13: InAccuracy (QwQ-32B) including human rewrites. Performance across datasets for four Adaptive RAG methods, evaluated on a matched subset of instances that have human rewrites. For each panel, InAccuracy is reported for eight query variations (original, human rewrite, and six model-generated perturbations), all computed on the same human-rewrite subset. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: LLM Call on Llama-3.1-8B on both model-generated and human queries. Orig. Human Decl. Impr. Read. Form. Spell. Gram. 0 1 2 3 4 5 Call Count 4.18 4.54 4.10 4.21 4.48 4.35 4.68 4.24 3.32 3.37 3.35 3.32 3.36 3.27 3.30 3.29 SQuAD Orig. Human Decl. Impr. Read. Form. Spell. Gram. 0 1 2 3 4 5 Call Count 3.84 3.77 3.71 3.62 4.00 3.81 4.44 3.70 3.15 3.18 3.14 3.25 3.29 3.16 3.18 3.15 NQ Orig. Human Decl. Impr. Rea… view at source ↗

**Figure 15.** Figure 15: LLM Call on QwQ-32B on both model-generated and human queries. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

**Figure 16.** Figure 16: Retriever Call on Llama-3.1-8B on both model-generated and human queries. Orig. Human Decl. Impr. Read. Form. Spell. Gram. 0 1 2 3 Call Count 2.04 2.23 2.01 2.07 2.20 2.13 2.29 2.07 1.38 1.42 1.40 1.39 1.39 1.36 1.37 1.36 SQuAD Orig. Human Decl. Impr. Read. Form. Spell. Gram. 0 1 2 3 Call Count 1.85 1.83 1.80 1.74 1.95 1.84 2.17 1.77 1.28 1.30 1.25 1.35 1.36 1.28 1.30 1.27 NQ Orig. Human Decl. Impr. Read.… view at source ↗

**Figure 17.** Figure 17: Retriever Call on QwQ-32B on both model-generated and human queries. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗

**Figure 18.** Figure 18: Under- and Over-confidence resultson Llama-3.1-8B on model-generated queries. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_18.png] view at source ↗

**Figure 19.** Figure 19: Under- and Over-confidence resultson QwQ-32B on model-generated queries. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_19.png] view at source ↗

**Figure 20.** Figure 20: Under- and Over-confidence resultson Llama-3.1-8B on including human-written queries. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_20.png] view at source ↗

**Figure 21.** Figure 21: Under- and Over-confidence results on Llama-3.1-8B on including human-written queries. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_21.png] view at source ↗

**Figure 22.** Figure 22: Loss score across all datasets 36 [PITH_FULL_IMAGE:figures/full_fig_p036_22.png] view at source ↗

**Figure 23.** Figure 23: Query length across all datasets 37 [PITH_FULL_IMAGE:figures/full_fig_p037_23.png] view at source ↗

**Figure 24.** Figure 24: Semantic similarity to original query across all datasets [PITH_FULL_IMAGE:figures/full_fig_p038_24.png] view at source ↗

read the original abstract

Adaptive Retrieval-Augmented Generation (RAG) promises accuracy and efficiency by dynamically triggering retrieval only when needed and is widely used in practice. However, real-world queries vary in surface form even with the same intent, and their impact on Adaptive RAG remains under-explored. We introduce the first large-scale benchmark of diverse yet semantically identical query variations, combining human-written and model-generated rewrites. Our benchmark facilitates a systematic evaluation of Adaptive RAG robustness by examining its key components across three dimensions: answer quality, computational cost, and retrieval decisions. We discover a critical robustness gap, where small surface-level changes in queries dramatically alter retrieval behavior and accuracy. Although larger models show better performance, robustness does not improve accordingly. These findings reveal that Adaptive RAG methods are highly vulnerable to query variations that preserve identical semantics, exposing a critical robustness challenge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Adaptive RAG flips retrieval and accuracy on minor rephrasings of the same question, shown via a new benchmark of human and model rewrites, though the rewrites' semantic fidelity is not clearly validated.

read the letter

The main takeaway is that Adaptive RAG systems change what they retrieve and how accurate they are when the query is reworded slightly, even when the intent is supposed to stay identical. The authors built a benchmark of query variations at larger scale than prior work, mixing human rewrites with model-generated ones, then tested several Adaptive RAG methods on answer quality, compute cost, and the actual retrieval trigger decisions. Larger models improve raw performance but do not close the robustness gap. This is a practical observation worth noting for anyone running these systems where users phrase things differently each time. The evaluation across three dimensions is a reasonable way to surface the issue, and the focus on adaptive retrieval rather than fixed RAG is a clear step forward from earlier robustness studies. The soft spot sits in the benchmark construction itself. The central claim rests on the rewrites preserving identical semantics, yet the abstract gives no numbers on human equivalence ratings, embedding thresholds, inter-annotator agreement, or disagreement rates. Without those checks, some of the reported shifts in retrieval behavior could trace to small unintended meaning changes or stylistic cues that retrieval models pick up. The paper would benefit from an explicit validation section and error analysis on the cases where performance dropped. This work is aimed at engineers and researchers building or auditing production RAG pipelines who care about reliability under real user input. Readers looking for empirical stress tests of adaptive components will find usable material. It deserves a serious referee because the problem is concrete and the benchmark could serve as a shared resource, provided the semantic-equivalence step is tightened.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces the first large-scale benchmark of query variations for Adaptive RAG, combining human-written and model-generated rewrites claimed to be semantically identical. It systematically evaluates Adaptive RAG methods across three dimensions—answer quality, computational cost, and retrieval decisions—reporting a critical robustness gap in which small surface-level query changes dramatically alter retrieval behavior and accuracy. Larger models are found to improve overall performance but not robustness to these variations.

Significance. If the benchmark's core assumption holds, the work identifies a practically important limitation in Adaptive RAG systems that are already deployed for efficiency. The new benchmark and multi-dimensional evaluation constitute a useful empirical contribution that could motivate more robust trigger mechanisms and query normalization techniques. The finding that scale does not confer robustness is a falsifiable observation worth testing in follow-up studies.

major comments (1)

[Benchmark construction] Benchmark construction (as described in the abstract and introduction): the central claim attributes the observed robustness gap to surface-form variation alone, yet the manuscript provides no validation metrics—such as inter-annotator agreement scores, human semantic equivalence ratings, embedding cosine thresholds, or disagreement rates—for the human-written and model-generated rewrites. Without these, it is impossible to rule out that subtle factual shifts, entity substitutions, or stylistic cues are driving the reported differences in retrieval decisions and accuracy rather than surface variation.

minor comments (2)

[Abstract] The abstract would benefit from reporting the total number of queries, number of rewrite pairs, and the specific Adaptive RAG baselines evaluated so readers can immediately gauge scale and coverage.
[Evaluation] Clarify whether any statistical tests (e.g., paired t-tests or bootstrap confidence intervals) were applied to the differences in retrieval decisions and accuracy across rewrite types.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the major comment regarding the validation of semantic equivalence in the benchmark construction below, and commit to incorporating additional metrics in the revised version.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction (as described in the abstract and introduction): the central claim attributes the observed robustness gap to surface-form variation alone, yet the manuscript provides no validation metrics—such as inter-annotator agreement scores, human semantic equivalence ratings, embedding cosine thresholds, or disagreement rates—for the human-written and model-generated rewrites. Without these, it is impossible to rule out that subtle factual shifts, entity substitutions, or stylistic cues are driving the reported differences in retrieval decisions and accuracy rather than surface variation.

Authors: We appreciate the referee's point on the need for explicit validation of semantic equivalence. The human-written rewrites were produced following detailed guidelines that instructed annotators to maintain identical meaning, entities, and facts while varying only the surface form. Similarly, the model-generated rewrites used prompts that explicitly required preserving semantics without introducing new information. However, we acknowledge that quantitative validation metrics were not reported in the original submission. In the revised manuscript, we will add a dedicated section on benchmark validation, including inter-annotator agreement for human rewrites, average embedding cosine similarities between original and variant queries, and results from a human evaluation study assessing semantic equivalence ratings. These additions will provide stronger evidence that the observed robustness gap is indeed due to surface variations rather than unintended semantic changes. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark study with no derivation chain or self-referential fitting.

full rationale

The paper is an empirical evaluation introducing a benchmark of query rewrites and comparing Adaptive RAG methods on answer quality, cost, and retrieval decisions. No equations, fitted parameters, or mathematical derivations are present in the provided text. Claims rest on experimental observations rather than any step that reduces by construction to author-defined inputs or self-citations. The semantic-equivalence assumption for rewrites is an empirical premise open to validation but does not create circularity in any derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical benchmark study. It relies on the domain assumption that surface-form variants can be generated while preserving exact semantics, and on standard NLP evaluation practices. No new free parameters or invented entities are introduced by the authors.

axioms (1)

domain assumption Semantically identical queries can be reliably produced by human writers and LLMs without introducing new factual content or retrieval-relevant biases.
Invoked when constructing the benchmark of 'diverse yet semantically identical query variations'.

pith-pipeline@v0.9.0 · 5446 in / 1176 out tokens · 47858 ms · 2026-05-10T15:33:08.287648+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’25, page 2395–2406, New York, NY , USA

Classifying term variants in query formula- tion. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’25, page 2395–2406, New York, NY , USA. Association for Computing Machinery. Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2024. Self-RAG: Learning to retriev...

work page arXiv 2024
[2]

InProceedings of the 47th Interna- tional ACM SIGIR Conference on Research and De- velopment in Information Retrieval, SIGIR ’24, page 719–729, New York, NY , USA

The power of noise: Redefining retrieval for rag systems. InProceedings of the 47th Interna- tional ACM SIGIR Conference on Research and De- velopment in Information Retrieval, SIGIR ’24, page 719–729, New York, NY , USA. Association for Com- puting Machinery. Hanxing Ding, Liang Pang, Zihao Wei, Huawei Shen, and Xueqi Cheng. 2025. Rowen: Adaptive retriev...

work page arXiv 2025
[3]

Gustavo Penha, Arthur Câmara, and Claudia Hauff

Evaluating the robustness of retrieval pipelines with query variation generators.CoRR, abs/2111.13057. Gustavo Penha, Arthur Câmara, and Claudia Hauff

work page arXiv
[4]

Evaluating the robustness of retrieval pipelines with query variation generators. InAdvances in In- formation Retrieval: 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part I, page 397–412, Berlin, Heidelberg. Springer-Verlag. Sezen Perçin, Xin Su, Qutub Sha Syed, Phillip Howard, Aleksei Kuvshinov, L...

work page 2022
[5]

S., Dernoncourt, F., Sultania, D., Bagga, K., Zhang, M., Bui, T., and Kotte, V

Qa dataset explosion: A taxonomy of nlp resources for question answering and reading com- prehension.ACM Computing Surveys, 55(10):1–45. Sanat Sharma, David Seunghyun Yoon, Franck Dernon- court, Dewang Sultania, Karishma Bagga, Mengjiao Zhang, Trung Bui, and Varun Kotte. 2024. Retrieval augmented generation for domain-specific question answering.Preprint,...

work page arXiv 2024
[6]

InProceedings of the 48th International ACM SIGIR Conference on Research and Devel- opment in Information Retrieval, SIGIR ’25, page 1305–1315, New York, NY , USA

Knowing you don’t know: Learning when to continue search in multi-round rag through self- practicing. InProceedings of the 48th International ACM SIGIR Conference on Research and Devel- opment in Information Retrieval, SIGIR ’25, page 1305–1315, New York, NY , USA. Association for Computing Machinery. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, W...

work page arXiv 2018
[7]

Shengyao Zhuang and Guido Zuccon

Qe-rag: A robust retrieval-augmented gener- ation benchmark for query entry errors.Preprint, arXiv:2504.04062. Shengyao Zhuang and Guido Zuccon. 2022. Character- bert and self-teaching for improving the robustness of dense retrievers on queries with typos. InProceed- ings of the 45th International ACM SIGIR Confer- ence on Research and Development in Info...

work page arXiv 2022
[8]

Task Definition: You are rewriting a query to make it significantly less readable while preserving the original semantic meaning as closely as possible

work page
[9]

IRS," "distance,

Constraints & Goals: - Flesch Reading Ease Score: The rewritten text must have a Flesch score below 60 (preferably below 50). - Semantic Similarity: The rewritten text must have SBERT similarity > 0.7 compared with the original query. - Length: The rewritten text must remain approximately the same length as the original query (±10%). - Preserve Domain Ter...

work page
[10]

distance,

How to Increase Complexity: - Lexical Changes: Use advanced or academic synonyms only for common words. For domain or key terms (e.g., "distance," "IRS," "tax"), keep the original term or use a very close synonym if necessary to maintain meaning. - Syntactic Complexity: Introduce passive voice, nominalizations, embedded clauses, and parenthetical or subor...

work page
[11]

You will be given a query (question) and its corresponding answer

work page
[12]

What You Should Do

Your task is to rewrite the given query — imagine how you would ask the same question if you were speaking to a large language model (like ChatGPT). What You Should Do

work page
[13]

◦Identify what the user is trying to find out

Understand the original query. ◦Identify what the user is trying to find out. ◦Grasp the intent and main topic clearly

work page
[14]

◦You may change the tone, phrasing, or structure

Rewrite the query in a new way while keeping the same meaning. ◦You may change the tone, phrasing, or structure. ◦Information loss, addition, or simplification is acceptable. ◦The goal is to make the rewritten query similar in meaning but different in expression. ◦ Make sure that your rewritten query can still be answered appropriately with the same answe...

work page
[15]

◦You can make the question sound more casual, detailed, concise, or clear

Incorporate your personal style. ◦You can make the question sound more casual, detailed, concise, or clear. ◦Different writing styles and tones are encouraged

work page
[16]

◦Do not simply rephrase it with only minor surface changes

Avoid copying the original query. ◦Do not simply rephrase it with only minor surface changes. ◦The rewritten query should feel genuinely reworded and natural

work page
[17]

◦The answer helps you understand what the original query intended to ask

Use the answer only for context. ◦The answer helps you understand what the original query intended to ask. ◦You should not change your rewrite based on the answer’s content. ◦Your task is to rewrite the question, not the answer. Table 15: Human Annotation Instruction: Query Rewriting Task. 20 Orig. Human Decl. Impr. Read. Form. Spell. Gram. 5 10 15 19 24 ...

work page

[1] [1]

InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’25, page 2395–2406, New York, NY , USA

Classifying term variants in query formula- tion. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’25, page 2395–2406, New York, NY , USA. Association for Computing Machinery. Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2024. Self-RAG: Learning to retriev...

work page arXiv 2024

[2] [2]

InProceedings of the 47th Interna- tional ACM SIGIR Conference on Research and De- velopment in Information Retrieval, SIGIR ’24, page 719–729, New York, NY , USA

The power of noise: Redefining retrieval for rag systems. InProceedings of the 47th Interna- tional ACM SIGIR Conference on Research and De- velopment in Information Retrieval, SIGIR ’24, page 719–729, New York, NY , USA. Association for Com- puting Machinery. Hanxing Ding, Liang Pang, Zihao Wei, Huawei Shen, and Xueqi Cheng. 2025. Rowen: Adaptive retriev...

work page arXiv 2025

[3] [3]

Gustavo Penha, Arthur Câmara, and Claudia Hauff

Evaluating the robustness of retrieval pipelines with query variation generators.CoRR, abs/2111.13057. Gustavo Penha, Arthur Câmara, and Claudia Hauff

work page arXiv

[4] [4]

Evaluating the robustness of retrieval pipelines with query variation generators. InAdvances in In- formation Retrieval: 44th European Conference on IR Research, ECIR 2022, Stavanger, Norway, April 10–14, 2022, Proceedings, Part I, page 397–412, Berlin, Heidelberg. Springer-Verlag. Sezen Perçin, Xin Su, Qutub Sha Syed, Phillip Howard, Aleksei Kuvshinov, L...

work page 2022

[5] [5]

S., Dernoncourt, F., Sultania, D., Bagga, K., Zhang, M., Bui, T., and Kotte, V

Qa dataset explosion: A taxonomy of nlp resources for question answering and reading com- prehension.ACM Computing Surveys, 55(10):1–45. Sanat Sharma, David Seunghyun Yoon, Franck Dernon- court, Dewang Sultania, Karishma Bagga, Mengjiao Zhang, Trung Bui, and Varun Kotte. 2024. Retrieval augmented generation for domain-specific question answering.Preprint,...

work page arXiv 2024

[6] [6]

InProceedings of the 48th International ACM SIGIR Conference on Research and Devel- opment in Information Retrieval, SIGIR ’25, page 1305–1315, New York, NY , USA

Knowing you don’t know: Learning when to continue search in multi-round rag through self- practicing. InProceedings of the 48th International ACM SIGIR Conference on Research and Devel- opment in Information Retrieval, SIGIR ’25, page 1305–1315, New York, NY , USA. Association for Computing Machinery. Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, W...

work page arXiv 2018

[7] [7]

Shengyao Zhuang and Guido Zuccon

Qe-rag: A robust retrieval-augmented gener- ation benchmark for query entry errors.Preprint, arXiv:2504.04062. Shengyao Zhuang and Guido Zuccon. 2022. Character- bert and self-teaching for improving the robustness of dense retrievers on queries with typos. InProceed- ings of the 45th International ACM SIGIR Confer- ence on Research and Development in Info...

work page arXiv 2022

[8] [8]

Task Definition: You are rewriting a query to make it significantly less readable while preserving the original semantic meaning as closely as possible

work page

[9] [9]

IRS," "distance,

Constraints & Goals: - Flesch Reading Ease Score: The rewritten text must have a Flesch score below 60 (preferably below 50). - Semantic Similarity: The rewritten text must have SBERT similarity > 0.7 compared with the original query. - Length: The rewritten text must remain approximately the same length as the original query (±10%). - Preserve Domain Ter...

work page

[10] [10]

distance,

How to Increase Complexity: - Lexical Changes: Use advanced or academic synonyms only for common words. For domain or key terms (e.g., "distance," "IRS," "tax"), keep the original term or use a very close synonym if necessary to maintain meaning. - Syntactic Complexity: Introduce passive voice, nominalizations, embedded clauses, and parenthetical or subor...

work page

[11] [11]

You will be given a query (question) and its corresponding answer

work page

[12] [12]

What You Should Do

Your task is to rewrite the given query — imagine how you would ask the same question if you were speaking to a large language model (like ChatGPT). What You Should Do

work page

[13] [13]

◦Identify what the user is trying to find out

Understand the original query. ◦Identify what the user is trying to find out. ◦Grasp the intent and main topic clearly

work page

[14] [14]

◦You may change the tone, phrasing, or structure

Rewrite the query in a new way while keeping the same meaning. ◦You may change the tone, phrasing, or structure. ◦Information loss, addition, or simplification is acceptable. ◦The goal is to make the rewritten query similar in meaning but different in expression. ◦ Make sure that your rewritten query can still be answered appropriately with the same answe...

work page

[15] [15]

◦You can make the question sound more casual, detailed, concise, or clear

Incorporate your personal style. ◦You can make the question sound more casual, detailed, concise, or clear. ◦Different writing styles and tones are encouraged

work page

[16] [16]

◦Do not simply rephrase it with only minor surface changes

Avoid copying the original query. ◦Do not simply rephrase it with only minor surface changes. ◦The rewritten query should feel genuinely reworded and natural

work page

[17] [17]

◦The answer helps you understand what the original query intended to ask

Use the answer only for context. ◦The answer helps you understand what the original query intended to ask. ◦You should not change your rewrite based on the answer’s content. ◦Your task is to rewrite the question, not the answer. Table 15: Human Annotation Instruction: Query Rewriting Task. 20 Orig. Human Decl. Impr. Read. Form. Spell. Gram. 5 10 15 19 24 ...

work page