MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation

Heyan Huang; Jincheng Xie; Runheng Liu; Xingchen Xiao

arxiv: 2604.18509 · v2 · submitted 2026-04-20 · 💻 cs.CL

MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation

Xingchen Xiao , Heyan Huang , Runheng Liu , Jincheng Xie This is my paper

Pith reviewed 2026-05-10 04:05 UTC · model grok-4.3

classification 💻 cs.CL

keywords multi-agent synthesisretrieval-augmented generationRAGevidence processinglarge language modelsagent specializationsynthesis stage

0 comments

The pith

Multi-agent synthesis improves RAG by reconciling evidence across specialized agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes MASS-RAG to handle cases where retrieved contexts are noisy, incomplete, or have evidence spread out. It breaks evidence processing into separate agents for summarization, extraction, and reasoning, then uses a synthesis stage to combine their outputs before generating the final answer. This setup creates multiple intermediate views of the evidence so the model can compare and integrate complementary details. Experiments across four benchmarks show gains over standard RAG baselines, especially when relevant information is distributed. If the approach holds, it suggests a practical way to make retrieval-augmented generation more reliable without changing the underlying language model.

Core claim

MASS-RAG structures evidence processing into multiple role-specialized agents for evidence summarization, evidence extraction, and reasoning over retrieved documents, then combines their outputs through a dedicated synthesis stage to produce the final answer.

What carries the argument

The multi-agent synthesis framework with distinct agents for summarization, extraction, and reasoning plus a final synthesis stage that exposes and reconciles intermediate evidence views.

If this is right

Performance gains are largest when relevant evidence is distributed across retrieved contexts rather than concentrated in one passage.
The method produces multiple intermediate evidence representations that the synthesis stage can cross-check before answer generation.
It maintains the core retrieval-augmented generation pipeline while adding agent specialization and synthesis without retraining the base model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The design could be extended to other multi-step LLM tasks that require integrating conflicting or partial information, such as multi-document summarization.
If synthesis errors prove hard to control, future versions might add verification loops between agents to catch inconsistencies earlier.
The approach suggests a general pattern for decomposing complex reasoning into role-specific agents whose outputs are then merged.

Load-bearing premise

The outputs of the role-specialized agents are sufficiently complementary and the synthesis stage can reliably combine them without introducing new inconsistencies or hallucinations.

What would settle it

Run the same four benchmarks with MASS-RAG and single-agent RAG baselines on a new set of queries where relevant evidence is deliberately split across many contexts; if MASS-RAG shows no improvement or higher error rates, the claim weakens.

Figures

Figures reproduced from arXiv: 2604.18509 by Heyan Huang, Jincheng Xie, Runheng Liu, Xingchen Xiao.

**Figure 2.** Figure 2: The overall architecture of MASS-RAG, illustrating multi-agent filtering and answer synthesis when the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Evidence Coverage Rate (ECR) of each filter [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Performance comparison of MASS-RAG with state-of-the-art baselines across four benchmark datasets. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: MASS-RAG Case Study 1 - Dataset: PopQA, Model: Mistral [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: MASS-RAG Case Study 2 - Dataset: PopQA, Model: Mistral [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: MASS-RAG Case Study 3 - Dataset: PopQA, Model: Mistral [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: MASS-RAG Case Study 4 - Dataset: TriviaQA, Model: Llama3 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: MASS-RAG Case Study 5 - Dataset: TriviaQA, Model: Llama3 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: MASS-RAG Case Study 6 - Dataset: TriviaQA, Model: Llama3 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

read the original abstract

Large language models (LLMs) are widely used in retrieval-augmented generation (RAG) to incorporate external knowledge at inference time. However, when retrieved contexts are noisy, incomplete, or heterogeneous, a single generation process often struggles to reconcile evidence effectively. We propose \textbf{MASS-RAG}, a multi-agent synthesis approach to retrieval-augmented generation that structures evidence processing into multiple role-specialized agents. MASS-RAG applies distinct agents for evidence summarization, evidence extraction, and reasoning over retrieved documents, and combines their outputs through a dedicated synthesis stage to produce the final answer. This design exposes multiple intermediate evidence views, allowing the model to compare and integrate complementary information before answer generation. Experiments on four benchmarks show that MASS-RAG consistently improves performance over strong RAG baselines, particularly in settings where relevant evidence is distributed across retrieved contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MASS-RAG adds explicit summarization, extraction, and reasoning agents plus a synthesis step to RAG, but the abstract leaves the performance claims and reconciliation reliability unverified.

read the letter

The one or two things to know about this paper are that it structures RAG around three role-specific agents (summarization, extraction, reasoning) whose outputs feed a dedicated synthesis stage, and that the abstract reports consistent gains over strong baselines on four benchmarks, especially when evidence is distributed across contexts. This is a direct, modular extension of multi-agent ideas already in the cited literature, aimed at noisy or incomplete retrieval. What the paper does well is lay out the problem of single-pass generation struggling with heterogeneous evidence and propose a concrete way to generate multiple intermediate views for comparison before final answer production. The architecture is straightforward to understand and could be implemented without much overhead in existing LLM pipelines. The soft spots sit in the evaluation and the synthesis mechanism. The abstract gives no error bars, no ablation results, no statistical tests, and no breakdown of how often the synthesis stage reconciles conflicting agent outputs versus introducing new inconsistencies or dropped evidence. If the synthesis is essentially another LLM call on concatenated agent results, the claimed benefit could shrink or reverse in practice, and nothing in the provided text checks for that failure mode. There is no math or formal derivation here, just an empirical architecture. Citation patterns appear standard for the subfield. This work is for practitioners and researchers already working on production RAG systems who want a practical pattern for better evidence integration. A reader focused on agentic methods or noisy retrieval would find the stage breakdown useful to test or adapt. It deserves a serious referee because the core idea is concrete and the reported improvements, if supported by proper details in the full manuscript, could matter for applied systems. I would send it to review but ask specifically for ablations on the synthesis stage and analysis of reconciliation errors.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes MASS-RAG, a multi-agent retrieval-augmented generation framework that decomposes evidence processing into three role-specialized agents (summarization, extraction, and reasoning) whose outputs are combined via a dedicated synthesis stage to produce the final answer. The central empirical claim is that this structure yields consistent gains over strong single-pass RAG baselines on four benchmarks, with the largest benefits arising when relevant evidence is distributed across retrieved contexts.

Significance. If the performance gains are robust, the work would provide a practical, modular way to improve evidence integration in RAG systems without requiring new model training. The multi-view intermediate representations are a natural extension of existing agentic RAG ideas, but the absence of detailed ablations and error analysis leaves open whether the synthesis step reliably adds value or merely averages noise.

major comments (3)

[Experiments] Experimental results section: The abstract and reported results assert consistent improvements on four benchmarks without error bars, ablation tables, or statistical significance tests. This makes it impossible to determine whether the observed gains exceed run-to-run variance or are driven by particular hyper-parameter choices.
[Method] Synthesis stage description: The paper states that the synthesis stage combines outputs from the three specialized agents, yet provides no quantitative error analysis (e.g., rate of introduced hallucinations, dropped conflicting evidence, or reconciliation failures) comparing synthesis outputs to single-agent baselines. This directly bears on the central claim that multi-agent synthesis improves integration for distributed evidence.
[Ablation studies] Weakest-assumption validation: No experiment isolates whether the summarization, extraction, and reasoning agents produce sufficiently complementary views; if their outputs are largely redundant, the synthesis stage cannot be credited for the reported gains.

minor comments (2)

[Method] Notation for agent roles and synthesis prompt templates should be made explicit in a table or pseudocode block for reproducibility.
[Experiments] The four benchmarks are named only in the abstract; a table listing dataset statistics, retrieval settings, and exact metrics would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The points raised highlight opportunities to strengthen the empirical rigor and analysis of MASS-RAG. We address each major comment below and will incorporate the necessary additions and clarifications in the revised manuscript.

read point-by-point responses

Referee: [Experiments] Experimental results section: The abstract and reported results assert consistent improvements on four benchmarks without error bars, ablation tables, or statistical significance tests. This makes it impossible to determine whether the observed gains exceed run-to-run variance or are driven by particular hyper-parameter choices.

Authors: We agree that error bars and statistical tests are necessary to establish robustness. In the revision we will rerun all experiments with multiple random seeds, report means with standard deviations as error bars, and include paired statistical significance tests against the baselines. We will also add a hyperparameter sensitivity analysis to show that gains are not artifacts of specific choices. revision: yes
Referee: [Method] Synthesis stage description: The paper states that the synthesis stage combines outputs from the three specialized agents, yet provides no quantitative error analysis (e.g., rate of introduced hallucinations, dropped conflicting evidence, or reconciliation failures) comparing synthesis outputs to single-agent baselines. This directly bears on the central claim that multi-agent synthesis improves integration for distributed evidence.

Authors: We acknowledge this gap in the current analysis. The revised manuscript will include a new quantitative error analysis subsection. On a sampled subset of instances we will manually annotate and compare hallucination introduction, dropped conflicting evidence, and reconciliation success rates between the synthesis stage and single-agent baselines, directly supporting the claim for distributed-evidence settings. revision: yes
Referee: [Ablation studies] Weakest-assumption validation: No experiment isolates whether the summarization, extraction, and reasoning agents produce sufficiently complementary views; if their outputs are largely redundant, the synthesis stage cannot be credited for the reported gains.

Authors: We agree that complementarity must be demonstrated. The revision will add an ablation study that quantifies output diversity (via embedding cosine similarity and lexical overlap) across the three agents and compares end-to-end performance of the full system against reduced single- or dual-agent variants. This will show that the agents provide distinct views and that synthesis is responsible for the observed gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method proposal with benchmark experiments

full rationale

The paper describes an empirical architecture for multi-agent RAG (summarization, extraction, reasoning agents plus synthesis) and reports performance gains on four benchmarks. No equations, fitted parameters, or first-principles derivations appear in the abstract or method outline. The central claim rests on experimental comparison rather than any self-referential reduction of a prediction to its own inputs or to a self-citation chain. The synthesis stage is presented as an additional LLM prompt without any claim that its outputs are mathematically forced by the agent inputs. This is a standard empirical contribution whose validity can be checked externally via replication on the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that LLMs can be reliably prompted to perform narrow, non-overlapping roles and that their intermediate outputs can be combined without loss of information.

axioms (1)

domain assumption LLMs prompted with role-specific instructions produce complementary rather than redundant or conflicting intermediate outputs.
Invoked implicitly by the design of distinct agents for summarization, extraction, and reasoning.

pith-pipeline@v0.9.0 · 5443 in / 1218 out tokens · 21901 ms · 2026-05-10T04:05:35.992065+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

[1]

2020 , eprint=

Language Models are Few-Shot Learners , author=. 2020 , eprint=

work page 2020
[2]

2022 , eprint=

Emergent Abilities of Large Language Models , author=. 2022 , eprint=

work page 2022
[3]

2023 , eprint=

In-Context Retrieval-Augmented Language Models , author=. 2023 , eprint=

work page 2023
[4]

2023 , eprint=

RA-DIT: Retrieval-Augmented Dual Instruction Tuning , author=. 2023 , eprint=

work page 2023
[5]

2023 , eprint=

REPLUG: Retrieval-Augmented Black-Box Language Models , author=. 2023 , eprint=

work page 2023
[6]

2023 , eprint=

Locating and Editing Factual Associations in GPT , author=. 2023 , eprint=

work page 2023
[7]

2024 , eprint=

Reliable, Adaptable, and Attributable Language Models with Retrieval , author=. 2024 , eprint=

work page 2024
[8]

2023 , eprint=

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions , author=. 2023 , eprint=

work page 2023
[9]

2022 , eprint=

Improving language models by retrieving from trillions of tokens , author=. 2022 , eprint=

work page 2022
[10]

2022 , eprint=

Atlas: Few-shot Learning with Retrieval Augmented Language Models , author=. 2022 , eprint=

work page 2022
[11]

2023 , eprint=

InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining , author=. 2023 , eprint=

work page 2023
[12]

2024 , eprint=

Retrieval-Augmented Generation for Large Language Models: A Survey , author=. 2024 , eprint=

work page 2024
[13]

2024 , eprint=

RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture , author=. 2024 , eprint=

work page 2024
[14]

2023 , eprint=

Lost in the Middle: How Language Models Use Long Contexts , author=. 2023 , eprint=

work page 2023
[15]

2023 , eprint=

Improving Zero-shot Reader by Reducing Distractions from Irrelevant Documents in Open-Domain Question Answering , author=. 2023 , eprint=

work page 2023
[16]

2024 , eprint=

ActiveRAG: Revealing the Treasures of Knowledge via Active Learning , author=. 2024 , eprint=

work page 2024
[17]

2023 , eprint=

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection , author=. 2023 , eprint=

work page 2023
[18]

2022 , eprint=

Re2G: Retrieve, Rerank, Generate , author=. 2022 , eprint=

work page 2022
[19]

2024 , eprint=

DSLR: Document Refinement with Sentence-Level Re-ranking and Reconstruction to Enhance Retrieval-Augmented Generation , author=. 2024 , eprint=

work page 2024
[20]

2023 , eprint=

Query Rewriting for Retrieval-Augmented Large Language Models , author=. 2023 , eprint=

work page 2023
[21]

2023 , eprint=

RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation , author=. 2023 , eprint=

work page 2023
[22]

2024 , eprint=

RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation , author=. 2024 , eprint=

work page 2024
[23]

2023 , eprint=

PRCA: Fitting Black-Box Large Language Models for Retrieval Question Answering via Pluggable Reward-Driven Contextual Adapter , author=. 2023 , eprint=

work page 2023
[24]

2024 , eprint=

Hallucination is Inevitable: An Innate Limitation of Large Language Models , author=. 2024 , eprint=

work page 2024
[25]

2019 , eprint=

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. 2019 , eprint=

work page 2019
[26]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[27]

2023 , eprint=

Learning to Filter Context for Retrieval-Augmented Generation , author=. 2023 , eprint=

work page 2023
[28]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

work page 2024
[29]

2025 , eprint=

Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning , author=. 2025 , eprint=

work page 2025
[30]

2025 , eprint=

MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning , author=. 2025 , eprint=

work page 2025
[31]

2023 , eprint=

Benchmarking Large Language Models in Retrieval-Augmented Generation , author=. 2023 , eprint=

work page 2023
[32]

2024 , eprint=

MAIN-RAG: Multi-Agent Filtering Retrieval-Augmented Generation , author=. 2024 , eprint=

work page 2024
[33]

2017 , eprint=

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , author=. 2017 , eprint=

work page 2017
[34]

2023 , eprint=

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories , author=. 2023 , eprint=

work page 2023
[35]

2023 , eprint=

Enabling Large Language Models to Generate Text with Citations , author=. 2023 , eprint=

work page 2023
[36]

2023 , eprint=

ASQA: Factoid Questions Meet Long-Form Answers , author=. 2023 , eprint=

work page 2023
[37]

2021 , eprint=

MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers , author=. 2021 , eprint=

work page 2021
[38]

2018 , eprint=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=

work page 2018
[39]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024
[40]

2023 , eprint=

Mistral 7B , author=. 2023 , eprint=

work page 2023
[41]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[42]

2024 , eprint=

Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions , author=. 2024 , eprint=

work page 2024
[43]

2022 , eprint=

Unsupervised Dense Information Retrieval with Contrastive Learning , author=. 2022 , eprint=

work page 2022
[44]

2017 , eprint=

Reading Wikipedia to Answer Open-Domain Questions , author=. 2017 , eprint=

work page 2017
[45]

2021 , eprint=

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2021 , eprint=

work page 2021

[1] [1]

2020 , eprint=

Language Models are Few-Shot Learners , author=. 2020 , eprint=

work page 2020

[2] [2]

2022 , eprint=

Emergent Abilities of Large Language Models , author=. 2022 , eprint=

work page 2022

[3] [3]

2023 , eprint=

In-Context Retrieval-Augmented Language Models , author=. 2023 , eprint=

work page 2023

[4] [4]

2023 , eprint=

RA-DIT: Retrieval-Augmented Dual Instruction Tuning , author=. 2023 , eprint=

work page 2023

[5] [5]

2023 , eprint=

REPLUG: Retrieval-Augmented Black-Box Language Models , author=. 2023 , eprint=

work page 2023

[6] [6]

2023 , eprint=

Locating and Editing Factual Associations in GPT , author=. 2023 , eprint=

work page 2023

[7] [7]

2024 , eprint=

Reliable, Adaptable, and Attributable Language Models with Retrieval , author=. 2024 , eprint=

work page 2024

[8] [8]

2023 , eprint=

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions , author=. 2023 , eprint=

work page 2023

[9] [9]

2022 , eprint=

Improving language models by retrieving from trillions of tokens , author=. 2022 , eprint=

work page 2022

[10] [10]

2022 , eprint=

Atlas: Few-shot Learning with Retrieval Augmented Language Models , author=. 2022 , eprint=

work page 2022

[11] [11]

2023 , eprint=

InstructRetro: Instruction Tuning post Retrieval-Augmented Pretraining , author=. 2023 , eprint=

work page 2023

[12] [12]

2024 , eprint=

Retrieval-Augmented Generation for Large Language Models: A Survey , author=. 2024 , eprint=

work page 2024

[13] [13]

2024 , eprint=

RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture , author=. 2024 , eprint=

work page 2024

[14] [14]

2023 , eprint=

Lost in the Middle: How Language Models Use Long Contexts , author=. 2023 , eprint=

work page 2023

[15] [15]

2023 , eprint=

Improving Zero-shot Reader by Reducing Distractions from Irrelevant Documents in Open-Domain Question Answering , author=. 2023 , eprint=

work page 2023

[16] [16]

2024 , eprint=

ActiveRAG: Revealing the Treasures of Knowledge via Active Learning , author=. 2024 , eprint=

work page 2024

[17] [17]

2023 , eprint=

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection , author=. 2023 , eprint=

work page 2023

[18] [18]

2022 , eprint=

Re2G: Retrieve, Rerank, Generate , author=. 2022 , eprint=

work page 2022

[19] [19]

2024 , eprint=

DSLR: Document Refinement with Sentence-Level Re-ranking and Reconstruction to Enhance Retrieval-Augmented Generation , author=. 2024 , eprint=

work page 2024

[20] [20]

2023 , eprint=

Query Rewriting for Retrieval-Augmented Large Language Models , author=. 2023 , eprint=

work page 2023

[21] [21]

2023 , eprint=

RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation , author=. 2023 , eprint=

work page 2023

[22] [22]

2024 , eprint=

RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation , author=. 2024 , eprint=

work page 2024

[23] [23]

2023 , eprint=

PRCA: Fitting Black-Box Large Language Models for Retrieval Question Answering via Pluggable Reward-Driven Contextual Adapter , author=. 2023 , eprint=

work page 2023

[24] [24]

2024 , eprint=

Hallucination is Inevitable: An Innate Limitation of Large Language Models , author=. 2024 , eprint=

work page 2024

[25] [25]

2019 , eprint=

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. 2019 , eprint=

work page 2019

[26] [26]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023

[27] [27]

2023 , eprint=

Learning to Filter Context for Retrieval-Augmented Generation , author=. 2023 , eprint=

work page 2023

[28] [28]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

work page 2024

[29] [29]

2025 , eprint=

Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning , author=. 2025 , eprint=

work page 2025

[30] [30]

2025 , eprint=

MA-RAG: Multi-Agent Retrieval-Augmented Generation via Collaborative Chain-of-Thought Reasoning , author=. 2025 , eprint=

work page 2025

[31] [31]

2023 , eprint=

Benchmarking Large Language Models in Retrieval-Augmented Generation , author=. 2023 , eprint=

work page 2023

[32] [32]

2024 , eprint=

MAIN-RAG: Multi-Agent Filtering Retrieval-Augmented Generation , author=. 2024 , eprint=

work page 2024

[33] [33]

2017 , eprint=

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , author=. 2017 , eprint=

work page 2017

[34] [34]

2023 , eprint=

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories , author=. 2023 , eprint=

work page 2023

[35] [35]

2023 , eprint=

Enabling Large Language Models to Generate Text with Citations , author=. 2023 , eprint=

work page 2023

[36] [36]

2023 , eprint=

ASQA: Factoid Questions Meet Long-Form Answers , author=. 2023 , eprint=

work page 2023

[37] [37]

2021 , eprint=

MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers , author=. 2021 , eprint=

work page 2021

[38] [38]

2018 , eprint=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge , author=. 2018 , eprint=

work page 2018

[39] [39]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024

[40] [40]

2023 , eprint=

Mistral 7B , author=. 2023 , eprint=

work page 2023

[41] [41]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025

[42] [42]

2024 , eprint=

Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions , author=. 2024 , eprint=

work page 2024

[43] [43]

2022 , eprint=

Unsupervised Dense Information Retrieval with Contrastive Learning , author=. 2022 , eprint=

work page 2022

[44] [44]

2017 , eprint=

Reading Wikipedia to Answer Open-Domain Questions , author=. 2017 , eprint=

work page 2017

[45] [45]

2021 , eprint=

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks , author=. 2021 , eprint=

work page 2021