pith. sign in

arxiv: 2505.20825 · v2 · submitted 2025-05-27 · 💻 cs.CL

Reinforced Informativeness Optimization for Long-Form Retrieval-Augmented Generation

Pith reviewed 2026-05-19 13:38 UTC · model grok-4.3

classification 💻 cs.CL
keywords reinforcement learningretrieval-augmented generationlong-form question answeringverifiable rewardsfactual recallinformativeness optimizationnugget verificationself-evolution
0
0 comments X

The pith

RioRAG uses nugget-centric verification to create externally verifiable rewards for stable reinforcement learning in long-form RAG.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RioRAG to address unstable reward signals in reinforcement learning for long-form question answering that draws on retrieval-augmented generation. It frames informativeness as a measurable objective that can be checked externally through nugget-centric verification combined with cross-source checks. This design supplies denser and more action-discriminative rewards, allowing smaller language models to improve themselves without handcrafted rules or distillation from stronger teachers. Experiments on LongFact and RAGChecker demonstrate gains in factual recall and faithfulness, positioning verifiable reward modeling as a basis for more reliable long-form systems.

Core claim

RioRAG defines informativeness as a measurable and externally verifiable objective for RL. It implements this through nugget-centric verification with cross-source checks to generate dense, action-discriminative rewards that mitigate sparsity and stabilize optimization. The method enables self-evolution of smaller LLMs and avoids handcrafted supervision or strong teacher distillation, yielding higher factual recall and faithfulness on LongFact and RAGChecker benchmarks.

What carries the argument

Nugget-centric verification with cross-source checks, which extracts key factual units from responses and validates them across multiple retrieved sources to produce reliable, dense reward signals for the RL policy.

If this is right

  • Smaller LLMs can self-evolve using the generated rewards without external teacher models.
  • Reward sparsity is reduced, leading to more stable RL optimization trajectories.
  • Long-form RAG outputs exhibit improved factual grounding and faithfulness.
  • Verifiable reward modeling becomes a practical foundation for trustworthy long-form generation systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The nugget approach could transfer to other open-ended factual generation tasks such as multi-document summarization.
  • External verification of this kind may reduce dependence on large teacher models across a wider range of RL setups for language models.
  • Combining nugget rewards with additional signals like human preference data could further densify the feedback for complex responses.

Load-bearing premise

Nugget-centric verification with cross-source checks can be performed in an externally verifiable manner that produces dense, action-discriminative rewards without any handcrafted supervision or teacher distillation.

What would settle it

If training with RioRAG rewards on LongFact fails to produce higher factual recall scores than a baseline using standard non-verifiable LLM-as-judge rewards, the benefit of the nugget-centric cross-source mechanism would be falsified.

Figures

Figures reproduced from arXiv: 2505.20825 by HaiFeng Wang, Hua Wu, Jing Liu, Ruiyang Ren, Wayne Xin Zhao, Yucheng Wang, Yuhao Wang.

Figure 1
Figure 1. Figure 1: Overall illustration of the proposed RL-based RioRAG framework. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: In-depth analysis on scaling law and RL cold-start. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Analysis on co-evolution of generation length and reward during RL training. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

Long-form question answering (LFQA) requires open-ended long-form responses that synthesize coherent, factually grounded content from multi-source evidence. This makes reinforcement learning (RL) reward design critical. The reward must be verifiable for faithful grounding and stable optimization. However, many standard rewards assume a unique target with an exact-match notion of correctness, which fits short-form QA and math but breaks in LFQA. As a result, current RAG systems still lack verifiable reward mechanisms, yielding unstable feedback signals and suboptimal optimization outcomes. We propose RioRAG, a framework for reinforced verifiable informativeness optimization. First, it defines informativeness as a measurable and externally verifiable objective for RL. Second, RioRAG uses nugget-centric verification with cross-source checks to enable self-evolution of smaller LLMs and to provide denser, action-discriminative rewards that mitigate reward sparsity and stabilize optimization. This formulation avoids handcrafted supervision for the policy model and strong teacher-model distillation, relying instead on externally verifiable feedback. Experiments on LongFact and RAGChecker show that RioRAG achieves higher factual recall and faithfulness, establishing verifiable reward modeling as a foundation for trustworthy long-form RAG. Our codes are available at https://github.com/RUCAIBox/RioRAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes RioRAG, a reinforcement learning framework for long-form retrieval-augmented generation that optimizes a verifiable informativeness objective. It introduces nugget-centric verification with cross-source checks to produce dense, action-discriminative rewards, enabling self-evolution of smaller LLMs without handcrafted supervision or teacher distillation. Experiments on LongFact and RAGChecker report gains in factual recall and faithfulness over baselines.

Significance. If the central claims hold, the work supplies a concrete mechanism for stable RL optimization in LFQA by replacing sparse or unverifiable rewards with externally grounded signals. This directly targets reward sparsity and instability in long-form RAG and could support more trustworthy generation pipelines. Code release aids reproducibility.

major comments (2)
  1. [§3.2] §3.2: The nugget-centric verification procedure is presented as relying on objective cross-source checks, yet the manuscript does not supply explicit, reproducible criteria (e.g., exact string overlap, citation matching, or deterministic rules) for nugget extraction, consistency scoring, or matching. If any step uses prompt templates, auxiliary LLM calls, or learned rubrics whose outputs are treated as ground truth, the 'externally verifiable' and 'no handcrafted supervision' properties become circular, undermining the independence of the reward signal and the stability arguments.
  2. [§4.2] §4.2 and Table 2: While higher factual recall and faithfulness are reported, the paper provides no ablation on verification threshold sensitivity or nugget definition variants. Without these, it is impossible to determine whether the observed gains are robust or artifacts of particular hyper-parameter choices, weakening the claim that the method yields stable optimization.
minor comments (2)
  1. [§3.1] The reward formulation in §3.1 would benefit from an explicit equation showing how nugget-level scores aggregate into the final scalar reward used by the RL optimizer.
  2. [Figure 3] Figure 3 caption should clarify the exact baseline configurations (model sizes, retrieval settings) to allow direct comparison with the RioRAG runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We have addressed each major point below and revised the manuscript accordingly to improve clarity, reproducibility, and robustness.

read point-by-point responses
  1. Referee: [§3.2] The nugget-centric verification procedure is presented as relying on objective cross-source checks, yet the manuscript does not supply explicit, reproducible criteria (e.g., exact string overlap, citation matching, or deterministic rules) for nugget extraction, consistency scoring, or matching. If any step uses prompt templates, auxiliary LLM calls, or learned rubrics whose outputs are treated as ground truth, the 'externally verifiable' and 'no handcrafted supervision' properties become circular, undermining the independence of the reward signal and the stability arguments.

    Authors: We appreciate the referee's emphasis on reproducibility and independence of the reward signal. The original manuscript described the nugget-centric verification at a conceptual level to highlight the framework. In the revision, we have added a detailed Appendix A that specifies the full procedure: nugget extraction combines deterministic sentence segmentation with entity linking using fixed rules and a pre-trained NER model (no learned rubrics); cross-source consistency scoring employs exact string overlap for surface forms combined with cosine similarity on fixed sentence embeddings (threshold 0.85) against the retrieved passages. Any auxiliary LLM calls are limited to structured JSON output for parsing and use a fixed, publicly documented prompt template provided verbatim in the appendix. These steps ground the reward directly in the external evidence sources rather than the policy model, preserving the non-circular, externally verifiable property and avoiding handcrafted supervision for the policy itself. revision: yes

  2. Referee: [§4.2] §4.2 and Table 2: While higher factual recall and faithfulness are reported, the paper provides no ablation on verification threshold sensitivity or nugget definition variants. Without these, it is impossible to determine whether the observed gains are robust or artifacts of particular hyper-parameter choices, weakening the claim that the method yields stable optimization.

    Authors: We agree that explicit sensitivity analysis strengthens the stability claims. The revised manuscript includes new ablation experiments in Section 4.2 and an additional Table 3. We report results for verification thresholds ranging from 0.7 to 0.95 and for three nugget granularity variants (phrase-level, sentence-level, and entity-centric). Performance remains consistently superior to baselines across these settings, with only minor variance in absolute scores, supporting that the gains arise from the overall verifiable reward formulation rather than specific hyper-parameter choices. We have also added a brief discussion of threshold selection based on a small validation split. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines informativeness as an externally verifiable objective and implements it via nugget-centric verification with cross-source checks to generate rewards for RL optimization in long-form RAG. This is presented as independent of handcrafted supervision or teacher distillation, with the central results coming from experiments on external benchmarks LongFact and RAGChecker that measure factual recall and faithfulness. No equations or steps reduce by construction to fitted parameters renamed as predictions, self-citations that bear the load of uniqueness, or ansatzes smuggled in; the framework is self-contained against those external benchmarks rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central contribution rests on redefining informativeness as an externally checkable quantity and introducing nugget-centric verification as the mechanism to operationalize it; no numerical free parameters are mentioned and no new physical or mathematical entities are postulated.

axioms (1)
  • domain assumption Informativeness can be defined as a measurable and externally verifiable objective suitable for RL in long-form QA
    This definition is the load-bearing premise that allows the reward to be used without handcrafted supervision.
invented entities (1)
  • nugget-centric verification no independent evidence
    purpose: To produce dense, action-discriminative rewards and enable self-evolution of smaller LLMs
    New operational mechanism introduced to address reward sparsity in LFQA.

pith-pipeline@v0.9.0 · 5768 in / 1312 out tokens · 74259 ms · 2026-05-19T13:38:03.506257+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. ArbGraph: Conflict-Aware Evidence Arbitration for Reliable Long-Form Retrieval-Augmented Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    ArbGraph resolves conflicts in RAG evidence by constructing a conflict-aware graph of atomic claims and applying intensity-driven iterative arbitration to suppress unreliable claims prior to generation.

  2. RASP-Tuner: Retrieval-Augmented Soft Prompts for Context-Aware Black-Box Optimization in Non-Stationary Environments

    cs.LG 2026-04 unverdicted novelty 5.0

    RASP-Tuner matches or beats GP-UCB and CMA-ES regret on seven of nine synthetic non-stationary tasks while running 8-12 times faster per step.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 2 Pith papers · 5 internal anchors

  1. [1]

    Retrieval-augmented generation for knowledge-intensive nlp tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459–9474, 2020

  2. [2]

    Enabling large language models to generate text with citations

    Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. Enabling large language models to generate text with citations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465–6488, 2023

  3. [3]

    Ginger: Grounded information nugget-based genera- tion of responses

    Weronika Łajewska and Krisztian Balog. Ginger: Grounded information nugget-based genera- tion of responses. arXiv preprint arXiv:2503.18174, 2025

  4. [4]

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. CoRR, 2025

  5. [5]

    Understanding retrieval augmentation for long-form question answering

    Hung-Ting Chen, Fangyuan Xu, Shane Arora, and Eunsol Choi. Understanding retrieval augmentation for long-form question answering. In First Conference on Language Modeling, 2024

  6. [6]

    Eli5: Long form question answering

    Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. Eli5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567, 2019

  7. [7]

    Factually consistent summarization via reinforcement learning with textual entailment feedback

    Paul Roit, Johan Ferret, Lior Shani, Roee Aharoni, Geoffrey Cideron, Robert Dadashi, Matthieu Geist, Sertan Girgin, Leonard Hussenot, Orgad Keller, et al. Factually consistent summarization via reinforcement learning with textual entailment feedback. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Lon...

  8. [8]

    Axiomatic preference modeling for longform question answering

    Corby Rosset, Guoqing Zheng, Victor Dibia, Ahmed Awadallah, and Paul Bennett. Axiomatic preference modeling for longform question answering. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 11445–11475, 2023

  9. [9]

    The great nugget recall: Automating fact extraction and rag evaluation with large language models

    Ronak Pradeep, Nandan Thakur, Shivani Upadhyay, Daniel Campos, Nick Craswell, and Jimmy Lin. The great nugget recall: Automating fact extraction and rag evaluation with large language models. arXiv preprint arXiv:2504.15068, 2025

  10. [10]

    Teaching language models to support answers with verified quotes

    Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chad- wick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147, 2022

  11. [11]

    Investigating answerability of llms for long-form question answering

    Meghana Moorthy Bhat, Rui Meng, Ye Liu, Yingbo Zhou, and Semih Yavuz. Investigating answerability of llms for long-form question answering. arXiv preprint arXiv:2309.08210 , 2023. 10

  12. [12]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

  13. [13]

    Webcpm: Interactive web search for chinese long-form question answering

    Yujia Qin, Zihan Cai, Dian Jin, Lan Yan, Shihao Liang, Kunlun Zhu, Yankai Lin, Xu Han, Ning Ding, Huadong Wang, et al. Webcpm: Interactive web search for chinese long-form question answering. In The 61st Annual Meeting Of The Association F or Computational Linguistics , 2023

  14. [14]

    Modeling exemplification in long-form question answering via retrieval

    Shufan Wang, Fangyuan Xu, Laure Thompson, Eunsol Choi, and Mohit Iyyer. Modeling exemplification in long-form question answering via retrieval. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2079–2092, 2022

  15. [15]

    Rarr: Researching and revising what language models say, using language models

    Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, et al. Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 16477–1...

  16. [16]

    Tapera: enhancing faithfulness and interpretability in long-form table qa by content planning and execution-based reasoning

    Yilun Zhao, Lyuhao Chen, Arman Cohan, and Chen Zhao. Tapera: enhancing faithfulness and interpretability in long-form table qa by content planning and execution-based reasoning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 12824–12840, 2024

  17. [17]

    Calibrating long-form generations from large language models

    Yukun Huang, Yixin Liu, Raghuveer Thirukovalluru, Arman Cohan, and Bhuwan Dhingra. Calibrating long-form generations from large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024 , pages 13441–13460, 2024

  18. [18]

    Rag-qa arena: Evaluating domain robustness for long-form retrieval augmented question answering

    Rujun Han, Yuhao Zhang, Peng Qi, Yumo Xu, Jenyuan Wang, Lan Liu, William Yang Wang, Bonan Min, and Vittorio Castelli. Rag-qa arena: Evaluating domain robustness for long-form retrieval augmented question answering. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4354–4374, 2024

  19. [19]

    Forag: Factuality-optimized retrieval augmented generation for web-enhanced long-form question answering

    Tianchi Cai, Zhiwen Tan, Xierui Song, Tao Sun, Jiyan Jiang, Yunqi Xu, Yinger Zhang, and Jinjie Gu. Forag: Factuality-optimized retrieval augmented generation for web-enhanced long-form question answering. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages 199–210, 2024

  20. [20]

    Reward-rag: Enhancing rag with reward driven supervision

    Thang Nguyen, Peter Chin, and Yu-Wing Tai. Reward-rag: Enhancing rag with reward driven supervision. arXiv preprint arXiv:2410.03780, 2024

  21. [21]

    Rag- reward: Optimizing rag with reward modeling and rlhf

    Hanning Zhang, Juntong Song, Juno Zhu, Yuanhao Wu, Tong Zhang, and Cheng Niu. Rag- reward: Optimizing rag with reward modeling and rlhf. arXiv preprint arXiv:2501.13264 , 2025

  22. [22]

    RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning

    Jerry Huang, Siddarth Madala, Risham Sidhu, Cheng Niu, Julia Hockenmaier, and Tong Zhang. Rag-rl: Advancing retrieval-augmented generation via rl and curriculum learning.arXiv preprint arXiv:2503.12759, 2025

  23. [23]

    Improving retrieval-augmented generation through multi-agent reinforcement learning

    Yiqun Chen, Lingyong Yan, Weiwei Sun, Xinyu Ma, Yi Zhang, Shuaiqiang Wang, Dawei Yin, Yiming Yang, and Jiaxin Mao. Improving retrieval-augmented generation through multi-agent reinforcement learning. arXiv preprint arXiv:2501.15228, 2025

  24. [24]

    Reinforcement learning for optimizing rag for domain chatbots.arXiv preprint arXiv:2401.06800,

    Mandar Kulkarni, Praveen Tangarajan, Kyung Kim, and Anusua Trivedi. Reinforcement learning for optimizing rag for domain chatbots. arXiv preprint arXiv:2401.06800, 2024

  25. [25]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516, 2025

  26. [26]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 11

  27. [27]

    Long-form factuality in large language models

    Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Zixia Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, et al. Long-form factuality in large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024

  28. [28]

    Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation

    Dongyu Ru, Lin Qiu, Xiangkun Hu, Tianhang Zhang, Peng Shi, Shuaichen Chang, Cheng Jiayang, Cunxiang Wang, Shichao Sun, Huanyu Li, et al. Ragchecker: A fine-grained framework for diagnosing retrieval-augmented generation. Advances in Neural Information Processing Systems, 37:21999–22027, 2024

  29. [29]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems , 35:24824–24837, 2022

  30. [30]

    Chain-of-note: Enhancing robustness in retrieval-augmented language models

    Wenhao Yu, Hongming Zhang, Xiaoman Pan, Peixin Cao, Kaixin Ma, Jian Li, Hongwei Wang, and Dong Yu. Chain-of-note: Enhancing robustness in retrieval-augmented language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 14672–14685, 2024

  31. [31]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems , 36:53728–53741, 2023

  32. [32]

    Emergent abilities of large language models

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022

  33. [33]

    Clapnq: C ohesive l ong-form a nswers from p assages in natural questions for rag systems

    Sara Rosenthal, Avirup Sil, Radu Florian, and Salim Roukos. Clapnq: C ohesive l ong-form a nswers from p assages in natural questions for rag systems. Transactions of the Association for Computational Linguistics, 13:53–72, 2025

  34. [34]

    Novelqa: A benchmark for long-range novel question answering

    Cunxiang Wang, Ruoxi Ning, Boqi Pan, Tonghui Wu, Qipeng Guo, Cheng Deng, Guangsheng Bao, Qian Wang, and Yue Zhang. Novelqa: A benchmark for long-range novel question answering. arXiv e-prints, pages arXiv–2403, 2024

  35. [35]

    Www’18 open challenge: financial opinion mining and question answering

    Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. Www’18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018 , pages 1941– 1942, 2018

  36. [36]

    Kiwi: A dataset of knowledge-intensive writing instructions for answering research questions

    Fangyuan Xu, Kyle Lo, Luca Soldaini, Bailey Kuehl, Eunsol Choi, and David Wadden. Kiwi: A dataset of knowledge-intensive writing instructions for answering research questions. arXiv preprint arXiv:2403.03866, 2024. 12 A Details on Datasets In this section, we provide detailed descriptions of the two comprehensive benchmarks used in our experiments: LongFa...