pith. machine review for the scientific record. sign in

arxiv: 2605.10168 · v1 · submitted 2026-05-11 · 💻 cs.CL · cs.IR

Recognition: 2 theorem links

· Lean Theorem

ASTRA-QA: A Benchmark for Abstract Question Answering over Documents

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:51 UTC · model grok-4.3

classification 💻 cs.CL cs.IR
keywords abstract question answeringdocument QA benchmarktopic coverage scoringunsupported content detectionRAG evaluationretrieval scope robustnesshallucination diagnosisacademic and news documents
0
0 comments X

The pith

ASTRA-QA supplies explicit topic annotations so abstract document answers can be scored directly for required coverage and unsupported content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Abstract questions demand that answers synthesize scattered facts from long documents, yet existing benchmarks often lack stable references and fall back on coarse similarity measures or unstable head-to-head comparisons. The paper introduces ASTRA-QA with 869 QA instances drawn from academic papers and news articles, covering five abstract question types and three controlled retrieval scopes. Each instance carries answer topic sets, curated unsupported topics, and aligned evidence annotations. These annotations support direct scoring of topic coverage and unsupported content, removing the need for exhaustive pairwise comparisons. The resulting evaluations diagnose coverage, hallucination, and retrieval-scope robustness in representative RAG methods.

Core claim

ASTRA-QA is a benchmark of 869 QA instances over academic papers and news documents equipped with explicit evaluation annotations that include answer topic sets, curated unsupported topics, and aligned evidence. It assesses generated answers by directly scoring how well they cover the required key points and how much they include unsupported content, thereby enabling scalable, reference-grounded evaluation without exhaustive head-to-head comparisons.

What carries the argument

Explicit evaluation annotations consisting of answer topic sets, curated unsupported topics, and aligned evidence, which permit direct scoring of topic coverage and detection of unsupported content.

Load-bearing premise

The manually curated answer topic sets, unsupported topics, and aligned evidence accurately and unbiasedly represent what constitutes a high-quality abstract answer.

What would settle it

A controlled study in which human raters assign quality rankings to a sample of answers that diverge substantially from the benchmark's topic-coverage and unsupported-content scores would falsify the reliability of the evaluation method.

Figures

Figures reproduced from arXiv: 2605.10168 by Hulong Wu, Shansong Zhou, Shiwei Wang, Shu Wang, Xinyang Wang, Yixiang Fang.

Figure 1
Figure 1. Figure 1: Comparison between existing head-to-head evaluation and our topic-based evaluation for [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An example QA instance in the ASTRA-QA dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Workflow for constructing the ASTRA-QA dataset. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Instead of relying on a single free-form reference, each question is equipped with a set of [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of topics in the answer and hallucination sets in our ASTRA-QA. topics. This design supports joint evaluation of completeness and faithfulness by measuring whether a system covers the major answer topics while avoiding plausible but unsupported ones. 4 Topic-based Evaluation Method 4.1 Evaluation Method Our core idea is to evaluate ASTRA-QA answers in the same spirit as grading a composition o… view at source ↗
Figure 5
Figure 5. Figure 5: Head-to-head (bidirec￾tional, overall) win-rate compar￾ison, using the same method ab￾breviations as in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: b (second row, third column). VR LL LH HY GG GL HR RA AR KR HI VR 50 8 5 36 41 17 3 1 1 91 14 LL 92 50 33 60 86 44 29 14 23 95 78 LH 95 67 50 60 90 52 37 26 31 98 82 HY 64 40 40 50 60 23 28 11 10 88 44 GG 59 14 10 40 50 21 19 2 1 73 30 GL 83 56 48 77 79 50 35 21 28 95 71 HR 97 71 63 72 81 65 50 40 49 98 96 RA 99 86 74 89 98 79 60 50 54 99 96 AR 99 77 69 90 99 72 51 46 50 99 99 KR 9 5 2 12 27 5 2 1 1 50 1 H… view at source ↗
Figure 8
Figure 8. Figure 8: Head-to-head win rates for the Comprehensiveness criterion. Each entry reports the row method’s win rate against the column method; higher is better. VR LL LH HY GG GL HR RA AR KR HI VR 50 8 6 35 38 15 2 2 1 88 10 LL 92 50 36 57 89 42 23 18 10 96 77 LH 94 64 50 59 89 48 27 27 13 96 81 HY 65 43 41 50 62 28 16 13 4 92 44 GG 62 11 11 38 50 19 5 4 1 71 29 GL 85 58 52 72 81 50 27 23 12 96 72 HR 98 77 73 84 95 7… view at source ↗
Figure 9
Figure 9. Figure 9: Head-to-head win rates for the Diversity criterion. Each entry reports the row method’s win rate against the column method; higher is better. VR LL LH HY GG GL HR RA AR KR HI VR 50 7 5 36 40 17 4 1 1 91 11 LL 93 50 34 58 88 43 21 17 21 96 77 LH 95 66 50 60 88 47 27 27 32 97 78 HY 64 42 40 50 61 26 16 13 10 90 42 GG 60 12 12 39 50 19 5 4 1 71 29 GL 83 57 53 74 81 50 27 23 28 96 72 HR 96 79 73 84 95 73 50 42… view at source ↗
Figure 10
Figure 10. Figure 10: Head-to-head win rates for the Empowerment criterion. Each entry reports the row method’s win rate against the column method; higher is better. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt for generating Single-Sum QA instance. [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prompt for generating QA instance of Pair-Comp. [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt for generating Multi-Comp QA instance. [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompt for generating QA instance of Enum. [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Prompt for generating Temporal type. Prompt for Evaluation in ASTRA-QA You are an expert tasked with extracting topic lists from response of the question. Task: Read the response and return the complete predicted topics list. Topic normalization: When you extract a topic, check whether it is semantically equivalent to an existing topic in the Common Errors List or Ground_truth list. − If it matches a topi… view at source ↗
Figure 16
Figure 16. Figure 16: Prompt for ASTRA-QA evaluation method. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Prompt for head-to-head evaluation. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗
read the original abstract

Document-based question answering (QA) increasingly includes abstract questions that require synthesizing scattered information from long documents or across multiple documents into coherent answers. However, this setting is still poorly supported by existing benchmarks and evaluation methods, which often lack stable abstract references or rely on coarse similarity metrics and unstable head-to-head comparisons. To alleviate this issue, we introduce ASTRA-QA, a benchmark for AbSTRAct Question Answering over documents. ASTRA-QA contains 869 QA instances over academic papers and news documents, covering five abstract question types and three controlled retrieval scopes. Each instance is equipped with explicit evaluation annotations, including answer topic sets, curated unsupported topics, and aligned evidence. Building on these annotations, ASTRA-QA assesses whether answers cover required key points and avoid unsupported content by directly scoring topic coverage and curated unsupported content, enabling scalable evaluation without exhaustive head-to-head comparisons. Experiments with representative Retrieval-Augmented Generation (RAG) methods spanning vanilla, graph-based, and hierarchical retrieval settings show that ASTRA-QA provides reference-grounded diagnostics for coverage, hallucination, and retrieval-scope robustness. Our dataset and code are available at https://xinyangsally.github.io/astra-benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces ASTRA-QA, a benchmark of 869 QA instances over academic papers and news documents spanning five abstract question types and three controlled retrieval scopes. Each instance includes explicit annotations consisting of answer topic sets, curated unsupported topics, and aligned evidence. The benchmark enables direct scoring of topic coverage and avoidance of unsupported content in generated answers, supporting scalable evaluation of RAG methods without exhaustive head-to-head comparisons. Experiments with vanilla, graph-based, and hierarchical RAG approaches illustrate its use for diagnosing coverage, hallucination, and retrieval-scope robustness. The dataset and code are publicly released.

Significance. If the annotations are shown to be reliable, ASTRA-QA would address a clear gap in evaluating abstract QA over long or multi-document settings, where existing benchmarks often depend on coarse similarity metrics or unstable comparisons. The public release of the full dataset with annotations and code is a clear strength that supports reproducibility and further research. The approach could enable more stable, reference-grounded diagnostics for coverage and hallucination in RAG systems.

major comments (2)
  1. [§3 (Benchmark Construction)] §3 (Benchmark Construction): The description of how answer topic sets and curated unsupported topics were created for the 869 instances across five question types provides no inter-annotator agreement statistics, no expert re-validation on a held-out sample, and no analysis of topic granularity control. These annotations are load-bearing for the central claim that direct scoring of coverage and unsupported content yields stable, reference-grounded evaluation without head-to-head comparisons.
  2. [§4 (Experiments)] §4 (Experiments): The reported results with representative RAG methods demonstrate diagnostic utility but contain no validation of the automatic topic-coverage and unsupported-content scores against independent human judgments on even a small sample of outputs. This leaves open whether the metrics align with expert notions of answer quality.
minor comments (3)
  1. [Abstract and §1] Abstract and §1: The three controlled retrieval scopes are mentioned but not defined until later; a brief upfront characterization would improve readability.
  2. [§2 (Related Work)] §2 (Related Work): Ensure all cited QA benchmarks are compared on the specific dimensions of abstract synthesis and annotation stability rather than only on dataset size.
  3. [Data Statistics] Table 1 or data statistics section: Report the distribution of instances per question type and retrieval scope to allow readers to assess balance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of ASTRA-QA's potential to address gaps in abstract QA evaluation. We address each major comment below with specific plans for revision where appropriate.

read point-by-point responses
  1. Referee: [§3 (Benchmark Construction)] §3 (Benchmark Construction): The description of how answer topic sets and curated unsupported topics were created for the 869 instances across five question types provides no inter-annotator agreement statistics, no expert re-validation on a held-out sample, and no analysis of topic granularity control. These annotations are load-bearing for the central claim that direct scoring of coverage and unsupported content yields stable, reference-grounded evaluation without head-to-head comparisons.

    Authors: We agree that inter-annotator agreement (IAA) statistics, re-validation, and granularity analysis would strengthen the presentation of the annotations. The topic sets and unsupported topics were constructed using explicit guidelines and domain-expert curation across the 869 instances, but these supporting statistics were not included in the initial submission. In the revised manuscript, we will add IAA results computed on a held-out sample of 100 instances using a second independent annotator (reporting Cohen's kappa for topic overlap and unsupported topic identification). We will also include expert re-validation on a separate 50-instance sample and an analysis of topic granularity control, reporting average topic set sizes, variance, and distributions stratified by question type and document domain. These elements will be incorporated into Section 3. revision: yes

  2. Referee: [§4 (Experiments)] §4 (Experiments): The reported results with representative RAG methods demonstrate diagnostic utility but contain no validation of the automatic topic-coverage and unsupported-content scores against independent human judgments on even a small sample of outputs. This leaves open whether the metrics align with expert notions of answer quality.

    Authors: We acknowledge that direct validation of the automatic scores against human judgments would provide stronger evidence of metric reliability. The current experiments in Section 4 focus on using the benchmark to diagnose RAG behaviors across retrieval scopes, but no human correlation study was reported. In the revision, we will add a targeted human validation: two experts will independently rate a random sample of 100 generated answers (drawn from the reported RAG runs) for topic coverage and unsupported content using a 5-point scale. We will then compute and report Pearson and Spearman correlations between these human ratings and the automatic scores. This analysis will be added to Section 4. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark defined via independent new annotations

full rationale

The paper introduces ASTRA-QA as a new benchmark consisting of 869 instances with explicitly created answer topic sets, curated unsupported topics, and aligned evidence annotations. The evaluation method directly scores topic coverage and unsupported content using these annotations by construction, which is the standard non-circular definition of a reference-based benchmark rather than a derivation that reduces to prior fitted quantities or self-citations. No equations, parameter fits, uniqueness theorems, or load-bearing self-citations appear in the provided text; the central claim of scalable evaluation without head-to-head comparisons follows directly from supplying the reference annotations as new inputs. This is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark relies on standard practices for creating QA datasets but introduces no free parameters, new entities, or ad-hoc axioms beyond typical domain assumptions in NLP evaluation.

axioms (1)
  • domain assumption Standard assumptions in NLP benchmark creation such as representative sampling of documents and questions.
    The benchmark construction implicitly relies on typical practices for creating QA datasets over documents.

pith-pipeline@v0.9.0 · 5524 in / 1191 out tokens · 64203 ms · 2026-05-12T02:51:41.665677+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 6 internal anchors

  1. [1]

    Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

  2. [2]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024

  3. [3]

    Lightrag: Simple and fast retrieval-augmented generation.arXiv e-prints, pages arXiv–2410, 2024

    Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. Lightrag: Simple and fast retrieval-augmented generation.arXiv e-prints, pages arXiv–2410, 2024

  4. [4]

    Archrag: Attributed community-based hierarchical retrieval-augmented generation

    Shu Wang, Yixiang Fang, Yingli Zhou, Xilin Liu, and Yuchi Ma. Archrag: Attributed community-based hierarchical retrieval-augmented generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 15868–15876, 2026

  5. [5]

    In-depth Analysis of Graph-based RAG in a Unified Framework

    Yingli Zhou, Yaodong Su, Youran Sun, Shu Wang, Taotao Wang, Runyuan He, Yongwei Zhang, Sicong Liang, Xilin Liu, Yuchi Ma, et al. In-depth analysis of graph-based rag in a unified framework.arXiv preprint arXiv:2503.04338, 2025

  6. [6]

    Pathrag: Pruning graph-based retrieval augmented generation with relational paths

    Boyu Chen, Zirui Guo, Zidan Yang, Yuluo Chen, Junze Chen, Zhenghao Liu, Chuan Shi, and Cheng Yang. Pathrag: Pruning graph-based retrieval augmented generation with relational paths. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 30183–30191, 2026

  7. [7]

    Retrieval-augmented generation with hierarchical knowledge.arXiv preprint arXiv:2503.10150,

    Haoyu Huang, Yongfeng Huang, Junjie Yang, Zhenyu Pan, Yongqiang Chen, Kaili Ma, Hongzhi Chen, and James Cheng. Retrieval-augmented generation with hierarchical knowledge.arXiv preprint arXiv:2503.10150, 2025

  8. [8]

    Forag: Factuality-optimized retrieval augmented generation for web-enhanced long-form question answering

    Tianchi Cai, Zhiwen Tan, Xierui Song, Tao Sun, Jiyan Jiang, Yunqi Xu, Yinger Zhang, and Jinjie Gu. Forag: Factuality-optimized retrieval augmented generation for web-enhanced long-form question answering. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 199–210, 2024

  9. [9]

    Smith, and Matt Gardner

    Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers.arXiv preprint arXiv:2105.03011, 2021

  10. [10]

    Peerqa: A scientific question answering dataset from peer reviews

    Tim Baumgärtner, Ted Briscoe, and Iryna Gurevych. Peerqa: A scientific question answering dataset from peer reviews. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 508–544, 2025

  11. [11]

    Asqa: Factoid questions meet long-form answers

    Ivan Stelmakh, Yi Luan, Bhuwan Dhingra, and Ming-Wei Chang. Asqa: Factoid questions meet long-form answers. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8273–8288, 2022

  12. [12]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhut- dinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering.arXiv preprint arXiv:1809.09600, 2018

  13. [13]

    Longbench: A bilingual, multitask benchmark for long context understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024

  14. [14]

    Crag-comprehensive rag benchmark.Advances in Neural Information Processing Systems, 37:10470–10490, 2024

    Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, Nikita Bhalla, Xiangsen Chen, Sajal Choudhary, Rongze D Gui, Ziran W Jiang, Ziyu Jiang, et al. Crag-comprehensive rag benchmark.Advances in Neural Information Processing Systems, 37:10470–10490, 2024. 10

  15. [15]

    Rag-qa arena: Evaluating domain robustness for long-form retrieval augmented question answering

    Rujun Han, Yuhao Zhang, Peng Qi, Yumo Xu, Jenyuan Wang, Lan Liu, William Yang Wang, Bonan Min, and Vittorio Castelli. Rag-qa arena: Evaluating domain robustness for long-form retrieval augmented question answering. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 4354–4374, 2024

  16. [16]

    Liverag: A diverse q&a dataset with varying difficulty level for rag evaluation

    David Carmel, Simone Filice, Guy Horowitz, Yoelle Maarek, Alex Shtoff, Oren Somekh, and Ran Tavory. Liverag: A diverse q&a dataset with varying difficulty level for rag evaluation. arXiv preprint arXiv:2511.14531, 2025

  17. [17]

    Rouge: A package for automatic evaluation of summaries

    Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004

  18. [18]

    BERTScore: Evaluating Text Generation with BERT

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019

  19. [19]

    Questeval: Summarization asks for fact-based evaluation

    Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano, Alex Wang, and Patrick Gallinari. Questeval: Summarization asks for fact-based evaluation. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 6594–6604, 2021

  20. [20]

    Qafacteval: Improved qa-based factual consistency evaluation for summarization

    Alexander Richard Fabbri, Chien-Sheng Wu, Wenhao Liu, and Caiming Xiong. Qafacteval: Improved qa-based factual consistency evaluation for summarization. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2587–2601, 2022

  21. [21]

    Factscore: Fine-grained atomic evaluation of factual precision in long form text generation

    Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, 2023

  22. [22]

    G-eval: Nlg evaluation using gpt-4 with better human alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 2511–2522, 2023

  23. [23]

    Leveraging passage retrieval with generative models for open domain question answering

    Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. InProceedings of the 16th conference of the european chapter of the association for computational linguistics: main volume, pages 874–880, 2021

  24. [24]

    Hipporag: Neurobiologically inspired long-term memory for large language models,

    Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. Hipporag: Neurobiologically inspired long-term memory for large language models.arXiv preprint arXiv:2405.14831, 2024

  25. [25]

    Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Schärli, and Denny Zhou

    Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D Manning. Raptor: Recursive abstractive processing for tree-organized retrieval.arXiv preprint arXiv:2401.18059, 2024

  26. [26]

    Eli5: Long form question answering

    Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. Eli5: Long form question answering. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 3558–3567, 2019

  27. [27]

    Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization

    Esin Durmus, He He, and Mona Diab. Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 5055–5070, 2020

  28. [28]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023

  29. [29]

    Open scholarship and peer review: a time for experimentation

    David Soergel, Adam Saunders, and Andrew McCallum. Open scholarship and peer review: a time for experimentation. InICML 2013 Workshop on Peer Reviewing and Publishing Models,

  30. [30]

    URLhttps://openreview.net/forum?id=xf0zSBd2iufMg. 11

  31. [31]

    Mapping and taking stock of the personal informatics literature.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(4):1–38, 2020

    Daniel A Epstein, Clara Caldeira, Mayara Costa Figueiredo, Xi Lu, Lucas M Silva, Lucretia Williams, Jong Ho Lee, Qingyang Li, Simran Ahuja, Qiuer Chen, et al. Mapping and taking stock of the personal informatics literature.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(4):1–38, 2020

  32. [32]

    Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

    Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

  33. [33]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  34. [34]

    Ket-rag: A cost-efficient multi-granular indexing framework for graph-rag

    Yiqian Huang, Shiqi Zhang, and Xiaokui Xiao. Ket-rag: A cost-efficient multi-granular indexing framework for graph-rag. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 1003–1012, 2025

  35. [35]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  36. [36]

    Morris, Brandon Duderstadt, and Andriy Mulyar

    Zach Nussbaum, John X. Morris, Brandon Duderstadt, and Andriy Mulyar. Nomic embed: Training a reproducible long context text embedder, 2024

  37. [37]

    Gpt-5.1 instant and gpt-5.1 thinking system card addendum

    OpenAI. Gpt-5.1 instant and gpt-5.1 thinking system card addendum. https://openai.com/ index/gpt-5-system-card-addendum-gpt-5-1/, 2025. Accessed: 2026-04-14

  38. [38]

    On big data benchmarking

    Rui Han, Xiaoyi Lu, and Jiangtao Xu. On big data benchmarking. InWorkshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, pages 3–18. Springer, 2014

  39. [39]

    Predicting question-answering performance of large language models through semantic consistency

    Ella Rabinovich, Samuel Ackerman, Orna Raz, Eitan Farchi, and Ateret Anaby Tavor. Predicting question-answering performance of large language models through semantic consistency. In Proceedings of the Third Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), pages 138–154, 2023

  40. [40]

    MultiHop-RAG: Benchmarking retrieval-augmented generation for multi-hop queries.arXiv:2401.15391, 2024

    Yixuan Tang and Yi Yang. Multihop-rag: Benchmarking retrieval-augmented generation for multi-hop queries.arXiv preprint arXiv:2401.15391, 2024

  41. [41]

    Memorag: Boosting long context processing with global memory-enhanced retrieval augmentation

    Hongjin Qian, Zheng Liu, Peitian Zhang, Kelong Mao, Defu Lian, Zhicheng Dou, and Tiejun Huang. Memorag: Boosting long context processing with global memory-enhanced retrieval augmentation. InProceedings of the ACM on Web Conference 2025, pages 2366–2377, 2025

  42. [42]

    Standardizing the measurement of text diversity: A tool and a comparative analysis of scores.arXiv preprint arXiv:2403.00553, 2024

    Chantal Shaib, Venkata S Govindarajan, Joe Barrow, Jiuding Sun, Alexa F Siu, Byron C Wallace, and Ani Nenkova. Standardizing the measurement of text diversity: A tool and a comparative analysis of scores.arXiv preprint arXiv:2403.00553, 2024

  43. [43]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large lan- guage model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  44. [44]

    Ollama.https://github.com/ollama/ollama, 2024

    Ollama. Ollama.https://github.com/ollama/ollama, 2024. Accessed: 2026-05-03. 12 A Additional Details of ASTRA-QA A.1 Examples of ASTRA-QA benchmark Table 7: Example QA instance of ASTRA-QA benchmark across the five question types. Question Type Question Answer Topic Set Single-Sum Please summarize the paper Leveraging Large Language Models for Multiple Ch...

  45. [45]

    Read all review content, including summaries, strengths, and weaknesses

  46. [46]

    Extract technical terms or key phrases that are explicitly mentioned or strongly implied by the reviews

  47. [47]

    First identify the main topics mentioned in each individual review, then aggregate them across all reviews

  48. [48]

    Prioritize compound technical phrases over single words

  49. [49]

    Government Agencies vs. Industry and AGI Labs

    Exclude: − general academic verbs or adjectives − generic praise or criticism − explanations, commentary, or full sentences Output Requirements: − Output only technical terms or key phrases − Do not include any introduction or explanation − Use a comma−separated list − Do not use markdown − Do not use numbering Output Format: [term 1, term 2, term 3, term...

  50. [50]

    Scan the list of news articles above

  51. [51]

    Identify a single, specific, and concrete topic or subject that is mentioned or discussed across multiple articles. This should be a clear focal point, like ’Retrieval−Augmented Generation (RAG) techniques’, ’ Apple’s upcoming product features’, ’Impacts of a specific new policy’, or ’Performance of a particular company’s recent quarter’. Avoid overly bro...

  52. [52]

    The question should prompt an answer that lists and briefly describes different aspects or examples

    Formulate a question that asks for a list or enumeration of key points, methods, features, impacts, or other relevant details specifically related to the chosen topic. The question should prompt an answer that lists and briefly describes different aspects or examples. An example question format is: ’What are the key features of the new iPhone as reported ...

  53. [53]

    Format the answer strictly as a list like this: [ Point 1, brief description; Point 2, brief description; Another relevant detail]

    Provide the answer to your question as a list of concise keywords or short phrases, summarizing the core information from the selected articles related to the topic. Format the answer strictly as a list like this: [ Point 1, brief description; Point 2, brief description; Another relevant detail]. Do not include any explanatory text or markdown in the answer

  54. [54]

    This is the ’reason’

    Briefly explain in 1−2 sentences why you chose this specific topic and how the selected articles contribute information to answer your question. This is the ’reason’

  55. [55]

    question

    List the titles of the news articles that are relevant to the chosen topic and used to formulate your question and answer, separated by semicolons. Please respond in JSON format with the following structure: { "question": "<Your generated question asking for a list related to the single topic>", "answer": "[<Point 1, brief description; Point 2, brief desc...