Reinforced Informativeness Optimization for Long-Form Retrieval-Augmented Generation

HaiFeng Wang; Hua Wu; Jing Liu; Ruiyang Ren; Wayne Xin Zhao; Yucheng Wang; Yuhao Wang

arxiv: 2505.20825 · v2 · submitted 2025-05-27 · 💻 cs.CL

Reinforced Informativeness Optimization for Long-Form Retrieval-Augmented Generation

Yuhao Wang , Ruiyang Ren , Yucheng Wang , Wayne Xin Zhao , Jing Liu , Hua Wu , HaiFeng Wang This is my paper

Pith reviewed 2026-05-19 13:38 UTC · model grok-4.3

classification 💻 cs.CL

keywords reinforcement learningretrieval-augmented generationlong-form question answeringverifiable rewardsfactual recallinformativeness optimizationnugget verificationself-evolution

0 comments

The pith

RioRAG uses nugget-centric verification to create externally verifiable rewards for stable reinforcement learning in long-form RAG.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RioRAG to address unstable reward signals in reinforcement learning for long-form question answering that draws on retrieval-augmented generation. It frames informativeness as a measurable objective that can be checked externally through nugget-centric verification combined with cross-source checks. This design supplies denser and more action-discriminative rewards, allowing smaller language models to improve themselves without handcrafted rules or distillation from stronger teachers. Experiments on LongFact and RAGChecker demonstrate gains in factual recall and faithfulness, positioning verifiable reward modeling as a basis for more reliable long-form systems.

Core claim

RioRAG defines informativeness as a measurable and externally verifiable objective for RL. It implements this through nugget-centric verification with cross-source checks to generate dense, action-discriminative rewards that mitigate sparsity and stabilize optimization. The method enables self-evolution of smaller LLMs and avoids handcrafted supervision or strong teacher distillation, yielding higher factual recall and faithfulness on LongFact and RAGChecker benchmarks.

What carries the argument

Nugget-centric verification with cross-source checks, which extracts key factual units from responses and validates them across multiple retrieved sources to produce reliable, dense reward signals for the RL policy.

If this is right

Smaller LLMs can self-evolve using the generated rewards without external teacher models.
Reward sparsity is reduced, leading to more stable RL optimization trajectories.
Long-form RAG outputs exhibit improved factual grounding and faithfulness.
Verifiable reward modeling becomes a practical foundation for trustworthy long-form generation systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The nugget approach could transfer to other open-ended factual generation tasks such as multi-document summarization.
External verification of this kind may reduce dependence on large teacher models across a wider range of RL setups for language models.
Combining nugget rewards with additional signals like human preference data could further densify the feedback for complex responses.

Load-bearing premise

Nugget-centric verification with cross-source checks can be performed in an externally verifiable manner that produces dense, action-discriminative rewards without any handcrafted supervision or teacher distillation.

What would settle it

If training with RioRAG rewards on LongFact fails to produce higher factual recall scores than a baseline using standard non-verifiable LLM-as-judge rewards, the benefit of the nugget-centric cross-source mechanism would be falsified.

Figures

Figures reproduced from arXiv: 2505.20825 by HaiFeng Wang, Hua Wu, Jing Liu, Ruiyang Ren, Wayne Xin Zhao, Yucheng Wang, Yuhao Wang.

**Figure 2.** Figure 2: In-depth analysis on scaling law and RL cold-start. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Analysis on co-evolution of generation length and reward during RL training. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Long-form question answering (LFQA) requires open-ended long-form responses that synthesize coherent, factually grounded content from multi-source evidence. This makes reinforcement learning (RL) reward design critical. The reward must be verifiable for faithful grounding and stable optimization. However, many standard rewards assume a unique target with an exact-match notion of correctness, which fits short-form QA and math but breaks in LFQA. As a result, current RAG systems still lack verifiable reward mechanisms, yielding unstable feedback signals and suboptimal optimization outcomes. We propose RioRAG, a framework for reinforced verifiable informativeness optimization. First, it defines informativeness as a measurable and externally verifiable objective for RL. Second, RioRAG uses nugget-centric verification with cross-source checks to enable self-evolution of smaller LLMs and to provide denser, action-discriminative rewards that mitigate reward sparsity and stabilize optimization. This formulation avoids handcrafted supervision for the policy model and strong teacher-model distillation, relying instead on externally verifiable feedback. Experiments on LongFact and RAGChecker show that RioRAG achieves higher factual recall and faithfulness, establishing verifiable reward modeling as a foundation for trustworthy long-form RAG. Our codes are available at https://github.com/RUCAIBox/RioRAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RioRAG proposes nugget-centric cross-source verification as a verifiable RL reward for long-form RAG and shows gains on LongFact and RAGChecker, but the external-verifiability claim rests on high-level description.

read the letter

The main point is that this paper gives a concrete reward formulation for RL in long-form RAG by breaking answers into nuggets and checking them across sources. It reports higher factual recall and faithfulness than baselines, and it tries to avoid teacher distillation by leaning on external checks instead of internal fitting loops. That combination is the clearest new piece relative to prior RL-for-QA work referenced in the abstract. The code release helps, and the focus on denser, action-discriminative signals directly targets the sparsity problem that standard exact-match rewards create in open-ended settings. Those are the parts that actually move the needle on the problem they set up. The soft spots are in the missing details. The abstract does not supply the exact equations for nugget extraction, consistency scoring, or how cross-source matches are computed, so it is hard to tell whether the verification stays fully objective or slips into prompt-based judgments that would make the reward less independent than claimed. If the implementation in the methods section uses any LLM prompts or learned scorers treated as ground truth, the circularity risk the stress-test note flags becomes real and undercuts the stability argument. Minor implementation choices like threshold tuning could also introduce hidden supervision. This paper is for people already working on reward design for RAG or long-form QA who want a practical alternative to teacher models. A reader focused on verifiable feedback loops would get value from the idea and the benchmark results, even if they have to adapt the specifics. I would send it to peer review. The core problem is well-motivated, the approach is coherent enough to evaluate, and the experiments give referees something concrete to test.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes RioRAG, a reinforcement learning framework for long-form retrieval-augmented generation that optimizes a verifiable informativeness objective. It introduces nugget-centric verification with cross-source checks to produce dense, action-discriminative rewards, enabling self-evolution of smaller LLMs without handcrafted supervision or teacher distillation. Experiments on LongFact and RAGChecker report gains in factual recall and faithfulness over baselines.

Significance. If the central claims hold, the work supplies a concrete mechanism for stable RL optimization in LFQA by replacing sparse or unverifiable rewards with externally grounded signals. This directly targets reward sparsity and instability in long-form RAG and could support more trustworthy generation pipelines. Code release aids reproducibility.

major comments (2)

[§3.2] §3.2: The nugget-centric verification procedure is presented as relying on objective cross-source checks, yet the manuscript does not supply explicit, reproducible criteria (e.g., exact string overlap, citation matching, or deterministic rules) for nugget extraction, consistency scoring, or matching. If any step uses prompt templates, auxiliary LLM calls, or learned rubrics whose outputs are treated as ground truth, the 'externally verifiable' and 'no handcrafted supervision' properties become circular, undermining the independence of the reward signal and the stability arguments.
[§4.2] §4.2 and Table 2: While higher factual recall and faithfulness are reported, the paper provides no ablation on verification threshold sensitivity or nugget definition variants. Without these, it is impossible to determine whether the observed gains are robust or artifacts of particular hyper-parameter choices, weakening the claim that the method yields stable optimization.

minor comments (2)

[§3.1] The reward formulation in §3.1 would benefit from an explicit equation showing how nugget-level scores aggregate into the final scalar reward used by the RL optimizer.
[Figure 3] Figure 3 caption should clarify the exact baseline configurations (model sizes, retrieval settings) to allow direct comparison with the RioRAG runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We have addressed each major point below and revised the manuscript accordingly to improve clarity, reproducibility, and robustness.

read point-by-point responses

Referee: [§3.2] The nugget-centric verification procedure is presented as relying on objective cross-source checks, yet the manuscript does not supply explicit, reproducible criteria (e.g., exact string overlap, citation matching, or deterministic rules) for nugget extraction, consistency scoring, or matching. If any step uses prompt templates, auxiliary LLM calls, or learned rubrics whose outputs are treated as ground truth, the 'externally verifiable' and 'no handcrafted supervision' properties become circular, undermining the independence of the reward signal and the stability arguments.

Authors: We appreciate the referee's emphasis on reproducibility and independence of the reward signal. The original manuscript described the nugget-centric verification at a conceptual level to highlight the framework. In the revision, we have added a detailed Appendix A that specifies the full procedure: nugget extraction combines deterministic sentence segmentation with entity linking using fixed rules and a pre-trained NER model (no learned rubrics); cross-source consistency scoring employs exact string overlap for surface forms combined with cosine similarity on fixed sentence embeddings (threshold 0.85) against the retrieved passages. Any auxiliary LLM calls are limited to structured JSON output for parsing and use a fixed, publicly documented prompt template provided verbatim in the appendix. These steps ground the reward directly in the external evidence sources rather than the policy model, preserving the non-circular, externally verifiable property and avoiding handcrafted supervision for the policy itself. revision: yes
Referee: [§4.2] §4.2 and Table 2: While higher factual recall and faithfulness are reported, the paper provides no ablation on verification threshold sensitivity or nugget definition variants. Without these, it is impossible to determine whether the observed gains are robust or artifacts of particular hyper-parameter choices, weakening the claim that the method yields stable optimization.

Authors: We agree that explicit sensitivity analysis strengthens the stability claims. The revised manuscript includes new ablation experiments in Section 4.2 and an additional Table 3. We report results for verification thresholds ranging from 0.7 to 0.95 and for three nugget granularity variants (phrase-level, sentence-level, and entity-centric). Performance remains consistently superior to baselines across these settings, with only minor variance in absolute scores, supporting that the gains arise from the overall verifiable reward formulation rather than specific hyper-parameter choices. We have also added a brief discussion of threshold selection based on a small validation split. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines informativeness as an externally verifiable objective and implements it via nugget-centric verification with cross-source checks to generate rewards for RL optimization in long-form RAG. This is presented as independent of handcrafted supervision or teacher distillation, with the central results coming from experiments on external benchmarks LongFact and RAGChecker that measure factual recall and faithfulness. No equations or steps reduce by construction to fitted parameters renamed as predictions, self-citations that bear the load of uniqueness, or ansatzes smuggled in; the framework is self-contained against those external benchmarks rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central contribution rests on redefining informativeness as an externally checkable quantity and introducing nugget-centric verification as the mechanism to operationalize it; no numerical free parameters are mentioned and no new physical or mathematical entities are postulated.

axioms (1)

domain assumption Informativeness can be defined as a measurable and externally verifiable objective suitable for RL in long-form QA
This definition is the load-bearing premise that allows the reward to be used without handcrafted supervision.

invented entities (1)

nugget-centric verification no independent evidence
purpose: To produce dense, action-discriminative rewards and enable self-evolution of smaller LLMs
New operational mechanism introduced to address reward sparsity in LFQA.

pith-pipeline@v0.9.0 · 5768 in / 1312 out tokens · 74259 ms · 2026-05-19T13:38:03.506257+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

nugget-centric hierarchical reward modeling... three-stage process: extracting the nugget from every source webpage, constructing a nugget claim checklist, and computing rewards based on factual alignment
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Reinforced Informativeness Optimization (Rio) framework that introduces informativeness as an optimization objective during RL

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ArbGraph: Conflict-Aware Evidence Arbitration for Reliable Long-Form Retrieval-Augmented Generation
cs.CL 2026-04 unverdicted novelty 7.0

ArbGraph resolves conflicts in RAG evidence by constructing a conflict-aware graph of atomic claims and applying intensity-driven iterative arbitration to suppress unreliable claims prior to generation.
RASP-Tuner: Retrieval-Augmented Soft Prompts for Context-Aware Black-Box Optimization in Non-Stationary Environments
cs.LG 2026-04 unverdicted novelty 5.0

RASP-Tuner matches or beats GP-UCB and CMA-ES regret on seven of nine synthetic non-stationary tasks while running 8-12 times faster per step.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 2 Pith papers · 5 internal anchors

[1]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020
[2]

Enabling large language models to generate text with citations

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465–6488, 2023

work page 2023
[3]

Ginger: Grounded information nugget-based genera- tion of responses

Weronika Łajewska and Krisztian Balog. Ginger: Grounded information nugget-based genera- tion of responses. arXiv preprint arXiv:2503.18174, 2025

work page arXiv 2025
[4]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. CoRR, 2025

work page 2025
[5]

Understanding retrieval augmentation for long-form question answering

Hung-Ting Chen, Fangyuan Xu, Shane Arora, and Eunsol Choi. Understanding retrieval augmentation for long-form question answering. In First Conference on Language Modeling, 2024

work page 2024
[6]

Eli5: Long form question answering

Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. Eli5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567, 2019

work page 2019
[7]

Factually consistent summarization via reinforcement learning with textual entailment feedback

Paul Roit, Johan Ferret, Lior Shani, Roee Aharoni, Geoffrey Cideron, Robert Dadashi, Matthieu Geist, Sertan Girgin, Leonard Hussenot, Orgad Keller, et al. Factually consistent summarization via reinforcement learning with textual entailment feedback. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Lon...

work page 2023
[8]

Axiomatic preference modeling for longform question answering

Corby Rosset, Guoqing Zheng, Victor Dibia, Ahmed Awadallah, and Paul Bennett. Axiomatic preference modeling for longform question answering. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 11445–11475, 2023

work page 2023
[9]

The great nugget recall: Automating fact extraction and rag evaluation with large language models

Ronak Pradeep, Nandan Thakur, Shivani Upadhyay, Daniel Campos, Nick Craswell, and Jimmy Lin. The great nugget recall: Automating fact extraction and rag evaluation with large language models. arXiv preprint arXiv:2504.15068, 2025

work page arXiv 2025
[10]

Teaching language models to support answers with verified quotes

Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chad- wick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[11]

Investigating answerability of llms for long-form question answering

Meghana Moorthy Bhat, Rui Meng, Ye Liu, Yingbo Zhou, and Semih Yavuz. Investigating answerability of llms for long-form question answering. arXiv preprint arXiv:2309.08210 , 2023. 10

work page arXiv 2023
[12]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Webcpm: Interactive web search for chinese long-form question answering

Yujia Qin, Zihan Cai, Dian Jin, Lan Yan, Shihao Liang, Kunlun Zhu, Yankai Lin, Xu Han, Ning Ding, Huadong Wang, et al. Webcpm: Interactive web search for chinese long-form question answering. In The 61st Annual Meeting Of The Association F or Computational Linguistics , 2023

work page 2023
[14]

Modeling exemplification in long-form question answering via retrieval

Shufan Wang, Fangyuan Xu, Laure Thompson, Eunsol Choi, and Mohit Iyyer. Modeling exemplification in long-form question answering via retrieval. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2079–2092, 2022

work page 2022
[15]

Rarr: Researching and revising what language models say, using language models

Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, et al. Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 16477–1...

work page 2023
[16]

Tapera: enhancing faithfulness and interpretability in long-form table qa by content planning and execution-based reasoning

Yilun Zhao, Lyuhao Chen, Arman Cohan, and Chen Zhao. Tapera: enhancing faithfulness and interpretability in long-form table qa by content planning and execution-based reasoning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 12824–12840, 2024

work page 2024
[17]

Calibrating long-form generations from large language models

Yukun Huang, Yixin Liu, Raghuveer Thirukovalluru, Arman Cohan, and Bhuwan Dhingra. Calibrating long-form generations from large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024 , pages 13441–13460, 2024

work page 2024
[18]

Rag-qa arena: Evaluating domain robustness for long-form retrieval augmented question answering

Rujun Han, Yuhao Zhang, Peng Qi, Yumo Xu, Jenyuan Wang, Lan Liu, William Yang Wang, Bonan Min, and Vittorio Castelli. Rag-qa arena: Evaluating domain robustness for long-form retrieval augmented question answering. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4354–4374, 2024

work page 2024
[19]

Forag: Factuality-optimized retrieval augmented generation for web-enhanced long-form question answering

Tianchi Cai, Zhiwen Tan, Xierui Song, Tao Sun, Jiyan Jiang, Yunqi Xu, Yinger Zhang, and Jinjie Gu. Forag: Factuality-optimized retrieval augmented generation for web-enhanced long-form question answering. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages 199–210, 2024

work page 2024
[20]

Reward-rag: Enhancing rag with reward driven supervision

Thang Nguyen, Peter Chin, and Yu-Wing Tai. Reward-rag: Enhancing rag with reward driven supervision. arXiv preprint arXiv:2410.03780, 2024

work page arXiv 2024
[21]

Rag- reward: Optimizing rag with reward modeling and rlhf

Hanning Zhang, Juntong Song, Juno Zhu, Yuanhao Wu, Tong Zhang, and Cheng Niu. Rag- reward: Optimizing rag with reward modeling and rlhf. arXiv preprint arXiv:2501.13264 , 2025

work page arXiv 2025
[22]

RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning

Jerry Huang, Siddarth Madala, Risham Sidhu, Cheng Niu, Julia Hockenmaier, and Tong Zhang. Rag-rl: Advancing retrieval-augmented generation via rl and curriculum learning.arXiv preprint arXiv:2503.12759, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Improving retrieval-augmented generation through multi-agent reinforcement learning

Yiqun Chen, Lingyong Yan, Weiwei Sun, Xinyu Ma, Yi Zhang, Shuaiqiang Wang, Dawei Yin, Yiming Yang, and Jiaxin Mao. Improving retrieval-augmented generation through multi-agent reinforcement learning. arXiv preprint arXiv:2501.15228, 2025

work page arXiv 2025
[24]

Reinforcement learning for optimizing rag for domain chatbots.arXiv preprint arXiv:2401.06800,

Mandar Kulkarni, Praveen Tangarajan, Kyung Kim, and Anusua Trivedi. Reinforcement learning for optimizing rag for domain chatbots. arXiv preprint arXiv:2401.06800, 2024

work page arXiv 2024
[25]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Long-form factuality in large language models

Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Zixia Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, et al. Long-form factuality in large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024

work page 2024
[28]

Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation

Dongyu Ru, Lin Qiu, Xiangkun Hu, Tianhang Zhang, Peng Shi, Shuaichen Chang, Cheng Jiayang, Cunxiang Wang, Shichao Sun, Huanyu Li, et al. Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation. Advances in Neural Information Processing Systems, 37:21999–22027, 2024

work page 2024
[29]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems , 35:24824–24837, 2022

work page 2022
[30]

Chain-of-note: Enhancing robustness in retrieval-augmented language models

Wenhao Yu, Hongming Zhang, Xiaoman Pan, Peixin Cao, Kaixin Ma, Jian Li, Hongwei Wang, and Dong Yu. Chain-of-note: Enhancing robustness in retrieval-augmented language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 14672–14685, 2024

work page 2024
[31]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems , 36:53728–53741, 2023

work page 2023
[32]

Emergent abilities of large language models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022

work page 2022
[33]

Clapnq: C ohesive l ong-form a nswers from p assages in natural questions for rag systems

Sara Rosenthal, Avirup Sil, Radu Florian, and Salim Roukos. Clapnq: C ohesive l ong-form a nswers from p assages in natural questions for rag systems. Transactions of the Association for Computational Linguistics, 13:53–72, 2025

work page 2025
[34]

Novelqa: A benchmark for long-range novel question answering

Cunxiang Wang, Ruoxi Ning, Boqi Pan, Tonghui Wu, Qipeng Guo, Cheng Deng, Guangsheng Bao, Qian Wang, and Yue Zhang. Novelqa: A benchmark for long-range novel question answering. arXiv e-prints, pages arXiv–2403, 2024

work page 2024
[35]

Www’18 open challenge: financial opinion mining and question answering

Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. Www’18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018 , pages 1941– 1942, 2018

work page 2018
[36]

Kiwi: A dataset of knowledge-intensive writing instructions for answering research questions

Fangyuan Xu, Kyle Lo, Luca Soldaini, Bailey Kuehl, Eunsol Choi, and David Wadden. Kiwi: A dataset of knowledge-intensive writing instructions for answering research questions. arXiv preprint arXiv:2403.03866, 2024. 12 A Details on Datasets In this section, we provide detailed descriptions of the two comprehensive benchmarks used in our experiments: LongFa...

work page arXiv 2024

[1] [1]

Retrieval-augmented generation for knowledge-intensive nlp tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020

[2] [2]

Enabling large language models to generate text with citations

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465–6488, 2023

work page 2023

[3] [3]

Ginger: Grounded information nugget-based genera- tion of responses

Weronika Łajewska and Krisztian Balog. Ginger: Grounded information nugget-based genera- tion of responses. arXiv preprint arXiv:2503.18174, 2025

work page arXiv 2025

[4] [4]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. CoRR, 2025

work page 2025

[5] [5]

Understanding retrieval augmentation for long-form question answering

Hung-Ting Chen, Fangyuan Xu, Shane Arora, and Eunsol Choi. Understanding retrieval augmentation for long-form question answering. In First Conference on Language Modeling, 2024

work page 2024

[6] [6]

Eli5: Long form question answering

Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. Eli5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567, 2019

work page 2019

[7] [7]

Factually consistent summarization via reinforcement learning with textual entailment feedback

Paul Roit, Johan Ferret, Lior Shani, Roee Aharoni, Geoffrey Cideron, Robert Dadashi, Matthieu Geist, Sertan Girgin, Leonard Hussenot, Orgad Keller, et al. Factually consistent summarization via reinforcement learning with textual entailment feedback. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Lon...

work page 2023

[8] [8]

Axiomatic preference modeling for longform question answering

Corby Rosset, Guoqing Zheng, Victor Dibia, Ahmed Awadallah, and Paul Bennett. Axiomatic preference modeling for longform question answering. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 11445–11475, 2023

work page 2023

[9] [9]

The great nugget recall: Automating fact extraction and rag evaluation with large language models

Ronak Pradeep, Nandan Thakur, Shivani Upadhyay, Daniel Campos, Nick Craswell, and Jimmy Lin. The great nugget recall: Automating fact extraction and rag evaluation with large language models. arXiv preprint arXiv:2504.15068, 2025

work page arXiv 2025

[10] [10]

Teaching language models to support answers with verified quotes

Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chad- wick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[11] [11]

Investigating answerability of llms for long-form question answering

Meghana Moorthy Bhat, Rui Meng, Ye Liu, Yingbo Zhou, and Semih Yavuz. Investigating answerability of llms for long-form question answering. arXiv preprint arXiv:2309.08210 , 2023. 10

work page arXiv 2023

[12] [12]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

Webcpm: Interactive web search for chinese long-form question answering

Yujia Qin, Zihan Cai, Dian Jin, Lan Yan, Shihao Liang, Kunlun Zhu, Yankai Lin, Xu Han, Ning Ding, Huadong Wang, et al. Webcpm: Interactive web search for chinese long-form question answering. In The 61st Annual Meeting Of The Association F or Computational Linguistics , 2023

work page 2023

[14] [14]

Modeling exemplification in long-form question answering via retrieval

Shufan Wang, Fangyuan Xu, Laure Thompson, Eunsol Choi, and Mohit Iyyer. Modeling exemplification in long-form question answering via retrieval. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2079–2092, 2022

work page 2022

[15] [15]

Rarr: Researching and revising what language models say, using language models

Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, et al. Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 16477–1...

work page 2023

[16] [16]

Tapera: enhancing faithfulness and interpretability in long-form table qa by content planning and execution-based reasoning

Yilun Zhao, Lyuhao Chen, Arman Cohan, and Chen Zhao. Tapera: enhancing faithfulness and interpretability in long-form table qa by content planning and execution-based reasoning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 12824–12840, 2024

work page 2024

[17] [17]

Calibrating long-form generations from large language models

Yukun Huang, Yixin Liu, Raghuveer Thirukovalluru, Arman Cohan, and Bhuwan Dhingra. Calibrating long-form generations from large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024 , pages 13441–13460, 2024

work page 2024

[18] [18]

Rag-qa arena: Evaluating domain robustness for long-form retrieval augmented question answering

Rujun Han, Yuhao Zhang, Peng Qi, Yumo Xu, Jenyuan Wang, Lan Liu, William Yang Wang, Bonan Min, and Vittorio Castelli. Rag-qa arena: Evaluating domain robustness for long-form retrieval augmented question answering. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4354–4374, 2024

work page 2024

[19] [19]

Forag: Factuality-optimized retrieval augmented generation for web-enhanced long-form question answering

Tianchi Cai, Zhiwen Tan, Xierui Song, Tao Sun, Jiyan Jiang, Yunqi Xu, Yinger Zhang, and Jinjie Gu. Forag: Factuality-optimized retrieval augmented generation for web-enhanced long-form question answering. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages 199–210, 2024

work page 2024

[20] [20]

Reward-rag: Enhancing rag with reward driven supervision

Thang Nguyen, Peter Chin, and Yu-Wing Tai. Reward-rag: Enhancing rag with reward driven supervision. arXiv preprint arXiv:2410.03780, 2024

work page arXiv 2024

[21] [21]

Rag- reward: Optimizing rag with reward modeling and rlhf

Hanning Zhang, Juntong Song, Juno Zhu, Yuanhao Wu, Tong Zhang, and Cheng Niu. Rag- reward: Optimizing rag with reward modeling and rlhf. arXiv preprint arXiv:2501.13264 , 2025

work page arXiv 2025

[22] [22]

RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning

Jerry Huang, Siddarth Madala, Risham Sidhu, Cheng Niu, Julia Hockenmaier, and Tong Zhang. Rag-rl: Advancing retrieval-augmented generation via rl and curriculum learning.arXiv preprint arXiv:2503.12759, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Improving retrieval-augmented generation through multi-agent reinforcement learning

Yiqun Chen, Lingyong Yan, Weiwei Sun, Xinyu Ma, Yi Zhang, Shuaiqiang Wang, Dawei Yin, Yiming Yang, and Jiaxin Mao. Improving retrieval-augmented generation through multi-agent reinforcement learning. arXiv preprint arXiv:2501.15228, 2025

work page arXiv 2025

[24] [24]

Reinforcement learning for optimizing rag for domain chatbots.arXiv preprint arXiv:2401.06800,

Mandar Kulkarni, Praveen Tangarajan, Kyung Kim, and Anusua Trivedi. Reinforcement learning for optimizing rag for domain chatbots. arXiv preprint arXiv:2401.06800, 2024

work page arXiv 2024

[25] [25]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 11

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Long-form factuality in large language models

Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Zixia Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, et al. Long-form factuality in large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024

work page 2024

[28] [28]

Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation

Dongyu Ru, Lin Qiu, Xiangkun Hu, Tianhang Zhang, Peng Shi, Shuaichen Chang, Cheng Jiayang, Cunxiang Wang, Shichao Sun, Huanyu Li, et al. Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation. Advances in Neural Information Processing Systems, 37:21999–22027, 2024

work page 2024

[29] [29]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems , 35:24824–24837, 2022

work page 2022

[30] [30]

Chain-of-note: Enhancing robustness in retrieval-augmented language models

Wenhao Yu, Hongming Zhang, Xiaoman Pan, Peixin Cao, Kaixin Ma, Jian Li, Hongwei Wang, and Dong Yu. Chain-of-note: Enhancing robustness in retrieval-augmented language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 14672–14685, 2024

work page 2024

[31] [31]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems , 36:53728–53741, 2023

work page 2023

[32] [32]

Emergent abilities of large language models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022

work page 2022

[33] [33]

Clapnq: C ohesive l ong-form a nswers from p assages in natural questions for rag systems

Sara Rosenthal, Avirup Sil, Radu Florian, and Salim Roukos. Clapnq: C ohesive l ong-form a nswers from p assages in natural questions for rag systems. Transactions of the Association for Computational Linguistics, 13:53–72, 2025

work page 2025

[34] [34]

Novelqa: A benchmark for long-range novel question answering

Cunxiang Wang, Ruoxi Ning, Boqi Pan, Tonghui Wu, Qipeng Guo, Cheng Deng, Guangsheng Bao, Qian Wang, and Yue Zhang. Novelqa: A benchmark for long-range novel question answering. arXiv e-prints, pages arXiv–2403, 2024

work page 2024

[35] [35]

Www’18 open challenge: financial opinion mining and question answering

Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. Www’18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018 , pages 1941– 1942, 2018

work page 2018

[36] [36]

Kiwi: A dataset of knowledge-intensive writing instructions for answering research questions

Fangyuan Xu, Kyle Lo, Luca Soldaini, Bailey Kuehl, Eunsol Choi, and David Wadden. Kiwi: A dataset of knowledge-intensive writing instructions for answering research questions. arXiv preprint arXiv:2403.03866, 2024. 12 A Details on Datasets In this section, we provide detailed descriptions of the two comprehensive benchmarks used in our experiments: LongFa...

work page arXiv 2024