arxiv: 2605.00631 · v1 · submitted 2026-05-01 · 💻 cs.CL · cs.IR

H-RAG at SemEval-2026 Task 8: Hierarchical Parent-Child Retrieval for Multi-Turn RAG Conversations

Passant Elchafei , Hossam Emam , Mohamed Alansary , Monorama Swain , Markus Schedl This is my paper

Pith reviewed 2026-05-09 19:19 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords hierarchical RAGparent-child retrievalmulti-turn conversationsretrieval-augmented generationcontext aggregationSemEval task

0 comments p. Extension

The pith

A hierarchical parent-child retrieval pipeline separates fine-grained chunk search from full-document context aggregation to handle multi-turn RAG conversations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents H-RAG as a submission to SemEval-2026 Task 8 on multi-turn retrieval-augmented generation. It splits documents into overlapping sentence-level child chunks for precise retrieval via hybrid dense-sparse search and rescoring, while keeping full documents as parent units for coherent context during answer generation. The same pipeline supports both standalone retrieval evaluation and end-to-end generation that must stay faithful to the retrieved evidence. Reported scores are nDCG@5 of 0.4271 on Task A and a harmonic mean of 0.3241 on Task C. A sympathetic reader would care because conversational RAG must maintain evidence precision across turns without losing document-level coherence.

Core claim

H-RAG implements a hierarchical parent-child RAG pipeline that decouples child-level retrieval from parent-level context reconstruction. Documents are segmented into overlapping sentence-based child chunks while full documents remain as parent units. Retrieval uses hybrid dense-sparse search with tunable weighting and embedding-based rescoring over the child chunks. Retrieved evidence is aggregated at the parent level and passed to an instruction-tuned language model for response generation. This yields an nDCG@5 score of 0.4271 on Task A and a harmonic mean score of 0.3241 on Task C, underscoring the role of retrieval configuration and parent-level aggregation.

What carries the argument

The parent-child hierarchical RAG pipeline, which performs retrieval over fine-grained child chunks and reconstructs coherent context at the full-document parent level before generation.

If this is right

Child-level retrieval with hybrid search and rescoring supplies more precise evidence for the subsequent parent aggregation step.
Parent-level aggregation supplies the language model with coherent full-document context that supports faithful multi-turn generation.
Tunable weighting between dense and sparse components plus embedding rescoring directly affects both retrieval quality and downstream generation scores.
The reported nDCG@5 of 0.4271 and harmonic mean of 0.3241 demonstrate that the configuration choices matter for performance on both subtasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The design implies that document coherence may be lost when all retrieval and generation operate on the same flat chunks across conversation turns.
The parent-child split could be tested on other multi-turn dialogue datasets to check whether the performance pattern holds outside this benchmark.
The approach may extend naturally to domains such as technical support or medical consultation where evidence must remain traceable to original sources over multiple exchanges.

Load-bearing premise

That separating child-level retrieval from parent-level context reconstruction and aggregation will produce better results in multi-turn conversational RAG than non-hierarchical approaches.

What would settle it

A side-by-side run of a flat non-hierarchical retrieval baseline on the identical SemEval-2026 Task 8 test sets that reaches equal or higher nDCG@5 and harmonic-mean scores would falsify the claimed benefit of the hierarchy.

Figures

Figures reproduced from arXiv: 2605.00631 by Hossam Emam, Markus Schedl, Mohamed Alansary, Monorama Swain, Passant Elchafei.

**Figure 1.** Figure 1: Overview of the proposed H-RAG pipeline for Task C, illustrating hierarchical parent–child document view at source ↗

read the original abstract

We present H-RAG, our submission to SemEval-2026 Task 8 (MTRAGEval), addressing both Task A (Retrieval) and Task C (Generation with Retrieved Passages). Task A evaluates standalone retrieval quality, while Task C assesses end-to-end retrieval-augmented generation (RAG) in multi-turn conversational settings, requiring both accurate answer generation and faithful grounding in retrieved evidence. Our approach implements a hierarchical parent-child RAG pipeline that separates fine-grained child-level retrieval from parent-level context reconstruction during generation. Documents are segmented into overlapping sentence-based child chunks, while full documents are preserved as parent units to provide coherent context. Retrieval combines hybrid dense-sparse search, tunable weighting, and embedding-based similarity rescoring over child chunks. Retrieved evidence is aggregated at the parent level and supplied to an instruction-tuned language model for response generation. H-RAG achieves an nDCG@5 score of 0.4271 on Task A and a harmonic mean score of 0.3241 on Task C (RB_agg: 0.2488, RL_F: 0.2703, RB_llm: 0.6508), underscoring the importance of retrieval configuration and parent-level aggregation in multi-turn RAG performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

H-RAG is a clear but incremental shared-task system paper that applies parent-child hierarchy to multi-turn RAG without the comparisons needed to show it matters.

read the letter

This paper describes H-RAG, a submission to SemEval-2026 Task 8 on multi-turn RAG. It splits documents into overlapping child chunks for hybrid dense-sparse retrieval, then aggregates to full parent documents for the generator. The system reports 0.4271 nDCG@5 on Task A and 0.3241 harmonic mean on Task C. That is the core of it: a practical pipeline for conversational retrieval that keeps fine matching separate from context coherence. The description of chunking, tunable weighting, rescoring, and parent aggregation is straightforward and replicable enough for someone who needs to build something similar quickly. It avoids overclaiming new theory and sticks to the engineering choices that fit the task constraints. The numbers are concrete and tied to the public evaluation, which is useful for tracking progress in this specific setting. The main limitation is the missing evidence for the interpretive claim. The abstract says the results underscore the importance of retrieval configuration and parent-level aggregation, but there are no flat-retrieval baselines, no ablation that removes the hierarchy, and no error analysis or significance tests. Without those, the scores show what one system achieved, not why the parent-child split helped over simpler alternatives. This is typical for competition reports, but it caps how much others can learn from the work. The paper is mainly for practitioners tuning RAG pipelines or teams entering similar shared tasks who want a working example to adapt. It will not shift how we think about multi-turn retrieval in general. It shows honest engagement with the task setup and literature on hybrid retrieval, so it deserves a serious referee for a workshop or shared-task proceedings rather than a top-tier venue. I would send it to review with the expectation that reviewers ask for the missing comparisons.

Referee Report

1 major / 2 minor

Summary. The paper describes H-RAG, a hierarchical parent-child RAG pipeline submitted to SemEval-2026 Task 8 (MTRAGEval). Documents are split into overlapping sentence-based child chunks for hybrid dense-sparse retrieval with rescoring, while full documents serve as parent units for context aggregation during generation by an instruction-tuned LLM. The system reports an nDCG@5 of 0.4271 on Task A (standalone retrieval) and a harmonic mean of 0.3241 on Task C (end-to-end generation), with component scores RB_agg=0.2488, RL_F=0.2703, and RB_llm=0.6508, claiming these results highlight the value of retrieval configuration and parent-level aggregation in multi-turn conversational RAG.

Significance. If the hierarchical separation of child retrieval from parent aggregation can be shown to drive measurable gains, the work would offer a practical design pattern for improving context coherence in multi-turn RAG systems. The concrete task metrics provide a useful reference point for the shared task, but the absence of comparative baselines means the significance remains that of a well-specified system report rather than a validated methodological advance.

major comments (1)

[Abstract] Abstract: The claim that the reported scores 'underscore the importance of retrieval configuration and parent-level aggregation' is not supported by the presented evidence. Only aggregate H-RAG results are given; no flat (non-hierarchical) retrieval baseline, ablation that disables parent-level aggregation, or statistical comparison to otherwise identical pipelines is reported. Without these, the interpretive conclusion that the hierarchical design is responsible for the observed performance cannot be evaluated.

minor comments (2)

[Methods] Methods section: Additional implementation details on the exact embedding models, the tunable weighting scheme for hybrid search, the precise parent-aggregation procedure (e.g., how child scores are combined into parent context), and hyperparameter selection would strengthen reproducibility.
[Results] Results: The sub-metrics for Task C (RB_agg, RL_F, RB_llm) are listed without definitions or references to the official task evaluation protocol; a brief explanation or citation would aid readers unfamiliar with MTRAGEval.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review of our system description paper for SemEval-2026 Task 8. We address the major comment on the abstract below and agree that a revision is warranted.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that the reported scores 'underscore the importance of retrieval configuration and parent-level aggregation' is not supported by the presented evidence. Only aggregate H-RAG results are given; no flat (non-hierarchical) retrieval baseline, ablation that disables parent-level aggregation, or statistical comparison to otherwise identical pipelines is reported. Without these, the interpretive conclusion that the hierarchical design is responsible for the observed performance cannot be evaluated.

Authors: We agree with the referee that the abstract's interpretive claim lacks supporting evidence from the manuscript. The paper is a system report for a shared task and presents only the aggregate performance of the full H-RAG pipeline on the official metrics, without ablations or non-hierarchical baselines. We will therefore revise the abstract to remove the unsubstantiated conclusion and instead describe the achieved scores and pipeline design in factual terms only. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical system description on external shared task.

full rationale

The paper presents a hierarchical parent-child RAG pipeline for SemEval-2026 Task 8, describing document segmentation, hybrid retrieval, parent-level aggregation, and LLM generation, then reports externally evaluated scores (nDCG@5 = 0.4271 on Task A; harmonic mean 0.3241 on Task C). No equations, derivations, fitted parameters, predictions, or self-citations appear in the provided text. All results derive from the shared-task benchmark rather than internal construction, satisfying the self-contained criterion with no load-bearing reductions to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced in the abstract. The work relies on standard RAG components (hybrid retrieval, chunking, LM generation) without new postulates.

pith-pipeline@v0.9.0 · 5548 in / 1192 out tokens · 34085 ms · 2026-05-09T19:19:25.328605+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 4 canonical work pages

[1]

2026 , eprint=

MTRAG-UN: A Benchmark for Open Challenges in Multi-Turn RAG Conversations , author=. 2026 , eprint=

2026
[2]

Proceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026) , address=

SemEval-2026 Task 8: MTRAGEval: Evaluating Multi-Turn RAG Conversations , author=. Proceedings of the 20th International Workshop on Semantic Evaluation (SemEval-2026) , address=. 2026 , organization=

2026
[3]

Transactions of the Association for Computational Linguistics , volume =

Katsis, Yannis and Rosenthal, Sara and Fadnis, Kshitij and Gunasekara, Chulaka and Lee, Young-Suk and Popa, Lucian and Shah, Vraj and Zhu, Huaiyu and Contractor, Danish and Danilevsky, Marina , title =. Transactions of the Association for Computational Linguistics , volume =. 2025 , month =. doi:10.1162/TACL.a.19 , url =

work page doi:10.1162/tacl.a.19 2025
[4]

2025 , address =

Cheng, Yiruo and Mao, Kelong and Zhao, Ziliang and Dong, Guanting and Qian, Hongjin and Wu, Yongkang and Sakai, Tetsuya and Wen, Ji-Rong and Dou, Zhicheng , booktitle =. 2025 , address =

2025
[5]

Rad-bench: Evaluating large language models capabilities in retrieval augmented dialogues

Kuo, Tzu-Lin and Liao, Feng-Ting and Hsieh, Mu-Wei and Chang, Fu-Chieh and Hsu, Po-Chun and Shiu, Da-Shan , year =. 2409.12558 , archivePrefix=

work page arXiv
[6]

DAT: Dynamic alpha tuning for hybrid retrieval in retrieval-augmented generation

Hsu, Hsin-Ling and Tzeng, Jengnan , year =. 2503.23013 , archivePrefix=

work page arXiv
[7]

The Power of Noise: Redefining Retrieval for

Cuconasu, Florin and Trappolini, Giovanni and Siciliano, Federico and Filice, Simone and Campagnano, Cesare and Maarek, Yoelle and Tonellotto, Nicola and Silvestri, Fabrizio , booktitle =. The Power of Noise: Redefining Retrieval for. 2024 , publisher =

2024
[8]

Hierarchical Re-ranker Retriever (

Singh, Ashish and others , year =. Hierarchical Re-ranker Retriever (. 2503.02401 , archivePrefix=

work page arXiv
[9]

Findings of the Association for Computational Linguistics:

Retrieval-Augmented Generation with Hierarchical Knowledge , author =. Findings of the Association for Computational Linguistics:. 2025 , address =

2025
[10]

Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , series =

A Surprisingly Simple yet Effective Multi-Query Rewriting Method for Conversational Passage Retrieval , author =. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , series =. 2024 , publisher =

2024
[11]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

Adaptive Query Rewriting: Aligning Rewriters through Marginal Probability of Conversational Answers , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =. 2024 , address =

2024
[12]

Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , series =

Evaluating Retrieval Quality in Retrieval-Augmented Generation , author =. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval , series =. 2024 , publisher =

2024
[13]

2024 , note =

Ru, Dongyu and Qiu, Lin and Hu, Xiangkun and Zhang, Tianhang and Shi, Peng and Chang, Shuaichen and Cheng, Jiayang and Wang, Cunxiang and Sun, Shichao and Li, Huanyu and Zhang, Zizhao and Wang, Binjie and Jiang, Jiarong and He, Tong and Wang, Zhiguo and Liu, Pengfei and Zhang, Yue and Zhang, Zheng , booktitle =. 2024 , note =

2024
[14]

Correctness is not Faithfulness in

Wallat, Jonas and others , booktitle =. Correctness is not Faithfulness in. 2025 , eprint =

2025
[15]

Advances in Neural Information Processing Systems , volume=

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. Advances in Neural Information Processing Systems , volume=