Sifei at SemEval-2026 Task 8: Hybrid Retrieval and Query Rewriting for Multi-Turn RAG

Dmitry Ilvovsky; Sifei Meng

arxiv: 2606.28352 · v1 · pith:CN3ZXSGQnew · submitted 2026-06-05 · 💻 cs.IR · cs.CL

Sifei at SemEval-2026 Task 8: Hybrid Retrieval and Query Rewriting for Multi-Turn RAG

Sifei Meng , Dmitry Ilvovsky This is my paper

Pith reviewed 2026-06-30 10:45 UTC · model grok-4.3

classification 💻 cs.IR cs.CL

keywords multi-turn retrievalhybrid retrievalquery rewritingdense retrievalsparse retrievalcross-encoder rerankingRAG pipelineinformation retrieval

0 comments

The pith

A training-free hybrid retrieval pipeline with dense-sparse fusion and query rewriting outperforms baselines on multi-turn RAG queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a system for multi-turn retrieval-augmented generation that fuses dense and sparse retrieval, applies controlled query rewriting to track intent shifts, and uses cross-encoder reranking to respect context limits. The approach is designed to manage conversational noise without any task-specific training. On the official test set for Task A it reaches 0.5453 nDCG@5, placing third among 38 teams and beating the best baseline by a clear margin. The same retrieved passages are then fed into a lightweight generation step for Task C, yielding a harmonic mean of relevance and faithfulness equal to 0.5312.

Core claim

The authors show that combining dense retrieval, sparse retrieval, controlled query rewriting, and cross-encoder reranking produces a retrieval pipeline that ranks third on the SemEval-2026 Task 8 test set with an nDCG@5 of 0.5453, exceeding the strongest baseline score of 0.4795, while the same passages support a generation stage that scores 0.5312 on the harmonic mean of relevance and faithfulness.

What carries the argument

Hybrid retrieval pipeline that fuses dense retrieval, sparse retrieval, controlled query rewriting, and cross-encoder reranking to track evolving user intent across conversation turns.

If this is right

Dense-plus-sparse fusion plus rewriting reduces the impact of conversational noise on retrieval quality.
Reusing the same retrieved passages for generation preserves both relevance and faithfulness without extra retrieval cost.
All retrieval components being open-source allows direct reuse on other multi-turn datasets.
The pipeline respects strict context limits by limiting the number of passages passed downstream.
Ranking third among 38 teams indicates the method is competitive with other submitted systems on this task distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The training-free design suggests the same components could be swapped into other conversational search settings with minimal adaptation.
Query rewriting appears to be the main lever for handling intent drift, so isolating its contribution on additional datasets would be a direct next measurement.
Because retrieval is decoupled from generation, the approach could be paired with different language models without retraining the retriever.
The gap over the baseline implies that hybrid methods may close performance differences that single-retriever baselines leave on noisy dialogues.

Load-bearing premise

The chosen mix of dense and sparse search plus rewriting will handle shifts in user intent and conversational noise more effectively than the baseline on the SemEval-2026 Task 8 data.

What would settle it

A new test collection of multi-turn queries on which the same pipeline scores below 0.4795 nDCG@5 would show the performance gain does not hold.

Figures

Figures reproduced from arXiv: 2606.28352 by Dmitry Ilvovsky, Sifei Meng.

read the original abstract

Multi-turn retrieval-augmented generation (RAG) is challenging due to evolving user intent, conversational noise, and strict context limits. We propose a training-free hybrid retrieval pipeline for SemEval-2026 Task 8 that combines dense and sparse retrieval with controlled query rewriting and cross-encoder reranking. On the official test set of Task A, our system achieves 0.5453 nDCG@5, ranking third among 38 teams and outperforming the strongest baseline score of 0.4795. For Task C, we reuse the documents retrieved for Task A and apply a lightweight generation pipeline guided by the official prompt, achieving 0.5312 as the harmonic mean of relevance and faithfulness and ranking 15th among 29 teams. All retrieval components are open-source, while query rewriting and answer generation rely on LLM APIs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A plain shared-task report that assembles standard retrieval pieces to beat the baseline on one SemEval subtask but adds no new method or analysis.

read the letter

The paper reports a training-free pipeline that mixes dense and sparse retrieval, adds query rewriting, and applies cross-encoder reranking for SemEval-2026 Task 8. On Task A it scores 0.5453 nDCG@5 on the official test set, third out of 38 teams and ahead of the strongest baseline at 0.4795. For Task C the same retrieved documents feed a lightweight generation step that reaches 0.5312 harmonic mean, placing 15th out of 29.

The concrete test-set numbers and the fact that retrieval components are open-source are the useful parts. The description is direct about relying on LLM APIs for rewriting and generation, and it avoids any claim of theoretical advance.

Nothing in the approach is new. Hybrid retrieval plus rewriting and reranking are established practices; the paper simply applies them to this task. There are no ablations, no error analysis on conversational noise or intent drift, and no discussion of how the pipeline might behave outside this data distribution. The performance delta is presented as an empirical outcome rather than evidence that the combination solves the stated challenges of evolving intent or context limits.

The work is mainly relevant to other teams entering the same shared task or to engineers who need a quick recipe for similar multi-turn retrieval settings. It does not supply generalizable findings or reproducible insights that would interest a wider audience.

I would not bring this to a reading group and would not cite it. It does not merit sending out for serious peer review; the shared-task proceedings are the right home if any.

Referee Report

0 major / 2 minor

Summary. The manuscript reports Sifei's system for SemEval-2026 Task 8 on multi-turn RAG. It describes a training-free hybrid pipeline that combines dense and sparse retrieval, controlled query rewriting, and cross-encoder reranking. On the official Task A test set the system obtains 0.5453 nDCG@5 (3rd of 38 teams), exceeding the strongest baseline of 0.4795. For Task C the same retrieved documents are fed to a lightweight LLM generation stage, yielding a harmonic mean of relevance and faithfulness of 0.5312 (15th of 29 teams). Retrieval components are stated to be open-source.

Significance. If the reported scores are reproducible, the work supplies a concrete, competitive empirical result on a shared-task benchmark that isolates the contribution of a hybrid retrieval-plus-rewriting pipeline to conversational retrieval. The explicit release of the retrieval modules is a modest but useful community contribution. Because the paper is a system description rather than a methodological advance, its longer-term significance rests on whether the combination generalizes beyond the 2026 Task 8 data distribution.

minor comments (2)

The abstract and introduction should explicitly state the exact dense and sparse retrievers (model names, indexes) and the query-rewriting prompt template so that the 0.5453 nDCG@5 figure can be reproduced from the open-source retrieval code alone.
Section describing Task C should clarify whether any additional retrieval or rewriting occurs or whether the documents are used exactly as returned for Task A; the current wording leaves this ambiguous.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation to accept. The report contains no major comments requiring point-by-point response.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript is a purely empirical shared-task report describing a training-free hybrid retrieval pipeline and its measured nDCG@5 and harmonic-mean scores on the official SemEval-2026 test sets. No equations, derivations, fitted parameters, or predictions are presented; performance figures are direct experimental outcomes rather than reductions of any internal model. No self-citations are invoked to justify uniqueness or load-bearing premises. The work is therefore self-contained against external benchmarks with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a system description for a shared task and introduces no free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5683 in / 1050 out tokens · 38261 ms · 2026-06-30T10:45:52.919674+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 2 canonical work pages

[1]

Journal of Engineering and Applied Sciences Technology , year=

Optimizing RAG with Hybrid Search and Contextual Chunking , author=. Journal of Engineering and Applied Sciences Technology , year=
[2]

2024 , eprint=

Multi-Reranker: Maximizing performance of retrieval-augmented generation in the FinanceRAG challenge , author=. 2024 , eprint=

2024
[3]

2025 , url =

Zhiyu Chen and Biancen Xie and Sidarth Srinivasan and Qun Liu and Manikandarajan Ramanathan and Raj Maragoud , title =. 2025 , url =

2025
[4]

Transactions of the Association for Computational Linguistics , volume =

Katsis, Yannis and Rosenthal, Sara and Fadnis, Kshitij and Gunasekara, Chulaka and Lee, Young-Suk and Popa, Lucian and Shah, Vraj and Zhu, Huaiyu and Contractor, Danish and Danilevsky, Marina , title =. Transactions of the Association for Computational Linguistics , volume =. 2025 , month =. doi:10.1162/TACL.a.19 , url =

work page doi:10.1162/tacl.a.19 2025
[5]

2025 , eprint=

Jasper-Token-Compression-600M Technical Report , author=. 2025 , eprint=

2025
[6]

2025 , eprint=

HyPA-RAG: A Hybrid Parameter Adaptive Retrieval-Augmented Generation System for AI Legal and Policy Applications , author=. 2025 , eprint=

2025
[7]

2024 , eprint=

Searching for Best Practices in Retrieval-Augmented Generation , author=. 2024 , eprint=

2024
[8]

2025 , eprint=

Hybrid AI for Responsive Multi-Turn Online Conversations with Novel Dynamic Routing and Feedback Adaptation , author=. 2025 , eprint=

2025
[9]

2025 , eprint=

Meta-Chunking: Learning Text Segmentation and Semantic Completion via Logical Perception , author=. 2025 , eprint=

2025
[10]

2025 , eprint=

Reconstructing Context: Evaluating Advanced Chunking Strategies for Retrieval-Augmented Generation , author=. 2025 , eprint=

2025
[11]

arXiv preprint arXiv:2502.13595 (2025) https://doi.org/10.48550/arXiv.2502.13595

Kenneth Enevoldsen and Isaac Chung and Imene Kerboua and Márton Kardos and Ashwin Mathur and David Stap and Jay Gala and Wissam Siblini and Dominik Krzemiński and Genta Indra Winata and Saba Sturua and Saiteja Utpala and Mathieu Ciancone and Marion Schaeffer and Gabriel Sequeira and Diganta Misra and Shreeya Dhakal and Jonathan Rystrøm and Roman Solomatin...

work page arXiv
[12]

2024 , eprint=

SPLADE-v3: New baselines for SPLADE , author=. 2024 , eprint=

2024
[13]

2025 , eprint=

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , author=. 2025 , eprint=

2025
[14]

2026 , eprint=

MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations , author=. 2026 , eprint=

2026
[15]

Proceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026) , address=

SemEval-2026 Task 8: MTRAGEval: Evaluating Multi-Turn RAG Conversations , author=. Proceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026) , address=. 2026 , organization=

2026

[1] [1]

Journal of Engineering and Applied Sciences Technology , year=

Optimizing RAG with Hybrid Search and Contextual Chunking , author=. Journal of Engineering and Applied Sciences Technology , year=

[2] [2]

2024 , eprint=

Multi-Reranker: Maximizing performance of retrieval-augmented generation in the FinanceRAG challenge , author=. 2024 , eprint=

2024

[3] [3]

2025 , url =

Zhiyu Chen and Biancen Xie and Sidarth Srinivasan and Qun Liu and Manikandarajan Ramanathan and Raj Maragoud , title =. 2025 , url =

2025

[4] [4]

Transactions of the Association for Computational Linguistics , volume =

Katsis, Yannis and Rosenthal, Sara and Fadnis, Kshitij and Gunasekara, Chulaka and Lee, Young-Suk and Popa, Lucian and Shah, Vraj and Zhu, Huaiyu and Contractor, Danish and Danilevsky, Marina , title =. Transactions of the Association for Computational Linguistics , volume =. 2025 , month =. doi:10.1162/TACL.a.19 , url =

work page doi:10.1162/tacl.a.19 2025

[5] [5]

2025 , eprint=

Jasper-Token-Compression-600M Technical Report , author=. 2025 , eprint=

2025

[6] [6]

2025 , eprint=

HyPA-RAG: A Hybrid Parameter Adaptive Retrieval-Augmented Generation System for AI Legal and Policy Applications , author=. 2025 , eprint=

2025

[7] [7]

2024 , eprint=

Searching for Best Practices in Retrieval-Augmented Generation , author=. 2024 , eprint=

2024

[8] [8]

2025 , eprint=

Hybrid AI for Responsive Multi-Turn Online Conversations with Novel Dynamic Routing and Feedback Adaptation , author=. 2025 , eprint=

2025

[9] [9]

2025 , eprint=

Meta-Chunking: Learning Text Segmentation and Semantic Completion via Logical Perception , author=. 2025 , eprint=

2025

[10] [10]

2025 , eprint=

Reconstructing Context: Evaluating Advanced Chunking Strategies for Retrieval-Augmented Generation , author=. 2025 , eprint=

2025

[11] [11]

arXiv preprint arXiv:2502.13595 (2025) https://doi.org/10.48550/arXiv.2502.13595

Kenneth Enevoldsen and Isaac Chung and Imene Kerboua and Márton Kardos and Ashwin Mathur and David Stap and Jay Gala and Wissam Siblini and Dominik Krzemiński and Genta Indra Winata and Saba Sturua and Saiteja Utpala and Mathieu Ciancone and Marion Schaeffer and Gabriel Sequeira and Diganta Misra and Shreeya Dhakal and Jonathan Rystrøm and Roman Solomatin...

work page arXiv

[12] [12]

2024 , eprint=

SPLADE-v3: New baselines for SPLADE , author=. 2024 , eprint=

2024

[13] [13]

2025 , eprint=

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , author=. 2025 , eprint=

2025

[14] [14]

2026 , eprint=

MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations , author=. 2026 , eprint=

2026

[15] [15]

Proceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026) , address=

SemEval-2026 Task 8: MTRAGEval: Evaluating Multi-Turn RAG Conversations , author=. Proceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026) , address=. 2026 , organization=

2026