ELI5: Long Form Question Answering
Pith reviewed 2026-05-24 18:21 UTC · model grok-4.3
The pith
An abstractive model trained with a multi-task objective outperforms Seq2Seq, language modeling, and extractive baselines on the new ELI5 long-form QA task.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The ELI5 corpus of 270K Reddit threads, each paired with web documents, defines a long-form question answering task that requires generating elaborate multi-sentence answers; an abstractive model trained with a multi-task objective combining QA and language modeling outperforms conventional Seq2Seq, language modeling, and strong extractive baselines on both automatic and human evaluations of this task.
What carries the argument
The multi-task abstractive model that generates answers by attending to the question and retrieved documents while also optimizing a language modeling objective.
If this is right
- Long-form QA benefits from abstractive generation that synthesizes information across documents rather than extraction alone.
- A joint QA and language modeling objective improves answer quality on this task compared with single-objective training.
- Current models remain substantially below human performance, leaving clear headroom for architectural or training advances.
- The ELI5 corpus and accompanying documents can serve as training data for models that produce multi-sentence explanatory answers.
Where Pith is reading between the lines
- If ELI5 threads prove non-representative, the performance gap between abstractive and extractive methods may shrink or reverse on other question distributions.
- The gap to human answers suggests that future systems will need mechanisms for deeper reasoning or knowledge integration beyond what the current multi-task setup supplies.
- The dataset format could be adapted to test whether retrieval quality or document length limits further gains in answer coherence.
Load-bearing premise
Threads from the single Reddit forum ELI5 constitute a representative and sufficiently diverse large-scale corpus for the general long-form question answering task.
What would settle it
Evaluation of the same multi-task abstractive model on a long-form QA dataset drawn from a different domain or user population where it no longer outperforms the Seq2Seq, language modeling, and extractive baselines.
read the original abstract
We introduce the first large-scale corpus for long-form question answering, a task requiring elaborate and in-depth answers to open-ended questions. The dataset comprises 270K threads from the Reddit forum ``Explain Like I'm Five'' (ELI5) where an online community provides answers to questions which are comprehensible by five year olds. Compared to existing datasets, ELI5 comprises diverse questions requiring multi-sentence answers. We provide a large set of web documents to help answer the question. Automatic and human evaluations show that an abstractive model trained with a multi-task objective outperforms conventional Seq2Seq, language modeling, as well as a strong extractive baseline. However, our best model is still far from human performance since raters prefer gold responses in over 86% of cases, leaving ample opportunity for future improvement.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the ELI5 dataset comprising 270K Reddit threads for long-form question answering, supplies associated web documents, and reports that an abstractive model trained with a multi-task objective outperforms Seq2Seq, language-modeling, and extractive baselines on automatic metrics and human evaluations on this corpus, while noting that raters still prefer gold answers in over 86% of cases.
Significance. If the results hold, the work provides the first large-scale benchmark specifically targeting elaborate multi-sentence answers, addressing a clear gap relative to existing short-answer QA resources. The multi-task abstractive approach and the released web-document collection constitute concrete, reusable contributions that can support retrieval-augmented generation research. The explicit gap to human performance supplies a clear, falsifiable target for subsequent work.
major comments (2)
- [§5] §5 (Experiments) and the human-evaluation protocol: the claim of outperformance and the 86% gold preference figure are load-bearing for the central result, yet the section provides no information on the number of raters, their selection criteria, inter-rater agreement, or statistical tests comparing model outputs; without these the robustness of the preference ordering cannot be verified.
- [§4] §4 (Models) and §3 (Dataset): the multi-task objective is presented as the source of improvement, but no ablation isolating the contribution of the auxiliary task versus simply training on more data is reported; this directly affects whether the stated superiority of the multi-task abstractive model is attributable to the claimed modeling choice.
minor comments (2)
- [Table 1] Table 1 and §3.2: the train/dev/test split sizes and any filtering criteria applied to the 270K threads should be stated explicitly so that future work can reproduce the exact evaluation setting.
- [§2] §2 (Related Work): the comparison to prior long-form QA resources would benefit from a brief quantitative table (e.g., average answer length, question type diversity) rather than qualitative description alone.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and the recommendation for minor revision. The comments highlight important points for improving the clarity and verifiability of our results. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§5] §5 (Experiments) and the human-evaluation protocol: the claim of outperformance and the 86% gold preference figure are load-bearing for the central result, yet the section provides no information on the number of raters, their selection criteria, inter-rater agreement, or statistical tests comparing model outputs; without these the robustness of the preference ordering cannot be verified.
Authors: We agree that the human evaluation details are insufficiently described. In the revised manuscript we will expand §5 to report the number of raters, their selection criteria and qualifications, inter-rater agreement (e.g., Fleiss' kappa), and the statistical tests used to compare model outputs against each other and against gold answers. These additions will allow readers to assess the robustness of the reported preference ordering. revision: yes
-
Referee: [§4] §4 (Models) and §3 (Dataset): the multi-task objective is presented as the source of improvement, but no ablation isolating the contribution of the auxiliary task versus simply training on more data is reported; this directly affects whether the stated superiority of the multi-task abstractive model is attributable to the claimed modeling choice.
Authors: The referee correctly notes the absence of a direct ablation. While the multi-task model is compared against single-task Seq2Seq and LM baselines trained on the same ELI5 data, an explicit control that adds equivalent extra data without the auxiliary objective is not present. In the revision we will add such an ablation experiment to isolate the contribution of the multi-task loss from the effect of additional training signal. revision: yes
Circularity Check
No circularity detected; empirical results on newly introduced dataset
full rationale
The paper's central claim is an empirical comparison: an abstractive multi-task model outperforms Seq2Seq, LM, and extractive baselines on the ELI5 corpus (270K Reddit threads). No equations, derivations, or fitted parameters are presented that reduce reported performance metrics to quantities defined inside the paper itself. The dataset definition and evaluation protocol are external to any modeling assumptions that would create self-definition or fitted-input-called-prediction circularity. Self-citations, if present, are not load-bearing for any uniqueness theorem or ansatz that forces the result. The derivation chain is self-contained against external benchmarks (human preference, automatic metrics on held-out threads).
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reddit ELI5 threads form a suitable large-scale, diverse corpus for long-form question answering that generalizes beyond the forum
Forward citations
Cited by 7 Pith papers
-
Every Bit, Everywhere, All at Once: A Binomial Multibit LLM Watermark
A binomial multibit watermarking scheme encodes every payload bit at each LLM token with dynamic redirection, outperforming baselines in accuracy and robustness for large payloads.
-
How Generative AI Disrupts Search: An Empirical Study of Google Search, Gemini, and AI Overviews
AI Overviews and Gemini retrieve substantially different sources than traditional Google search (Jaccard similarity <0.2), favor Google-owned content, appear for 51.5% of queries especially controversial ones, and are...
-
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
BART introduces a denoising pretraining method for seq2seq models that matches RoBERTa on GLUE and SQuAD while setting new state-of-the-art results on abstractive summarization, dialogue, and QA with up to 6 ROUGE gains.
-
Trustworthy AI: Ensuring Reliability and Accountability from Models to Agents
The thesis presents a kernel method for multiaccuracy across overlooked subpopulations, information-theoretic optimal watermarking for LLMs, and a simulator showing LLM agents outperforming humans in supply chains whi...
-
CTRL: A Conditional Transformer Language Model for Controllable Generation
CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.
-
Rethinking the Necessity of Adaptive Retrieval-Augmented Generation through the Lens of Adaptive Listwise Ranking
AdaRankLLM shows adaptive listwise reranking outperforms fixed-depth retrieval for most LLMs by acting as a noise filter for weak models and an efficiency optimizer for strong ones, with lower context use.
-
Retrieval-Augmented Generation for Large Language Models: A Survey
A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.