Recognition: unknown
LongSumEval: Question-Answering Based Evaluation and Feedback-Driven Refinement for Long Document Summarization
Pith reviewed 2026-05-07 16:38 UTC · model grok-4.3
The pith
Question-answering pairs derived from source documents evaluate long summaries with stronger human agreement and support self-refinement.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LongSumEval operationalizes summary quality as the answerability and factual alignment of question-answer pairs generated from the source document. This produces interpretable scores identifying coverage gaps and factual inconsistencies while supplying structured feedback that serves as executable instructions for refining the summary through self-refinement.
What carries the argument
The QA-based evaluation module that generates question-answer pairs to assess answerability and factual alignment, providing both aggregate scores and detailed feedback for refinement.
If this is right
- Evaluation scores become directly usable for guiding generation improvements.
- Self-refinement can enhance summary quality iteratively without retraining models.
- Research on long document summarization gains a more reliable benchmarking tool.
- Similar feedback mechanisms could align evaluation and generation in other text tasks.
Where Pith is reading between the lines
- Adopting this could reduce the need for costly human evaluations in summarization research.
- Extensions might apply the same principle to other long-form generation tasks like report writing.
- The reliance on automatic QA generation assumes the questions capture key information, which could be tested by varying question quality.
Load-bearing premise
Automatically generated question-answer pairs from the source document will reliably surface the factual and coverage deficiencies that matter to human readers of the summary.
What would settle it
A new meta-evaluation on held-out long document datasets where the QA-based metric shows no higher correlation with human judgments than ROUGE or BERTScore, or where applying the feedback fails to improve summary quality as rated by humans.
Figures
read the original abstract
Evaluating long document summaries remains the primary bottleneck in summarization research. Existing metrics correlate weakly with human judgments and produce aggregate scores without explaining deficiencies or guiding improvement, preventing effective refinement in applications requiring verifiable accuracy. We introduce LongSumEval, a unified framework bridging evaluation and generation through structured question-answering feedback. The framework operationalizes summary quality as answerability and factual alignment of question-answer pairs, generating interpretable scores and actionable feedback that identifies coverage gaps and factual inconsistencies. This resolves the misalignment where evaluation operates independently of generation objectives. Meta-evaluation of our QA-based evaluation module across seven benchmarks demonstrates substantially stronger agreement with human judgments compared to established metrics. Structured feedback enables significant quality improvements through self-refinement without retraining. By demonstrating that evaluation feedback can serve as executable instructions for generation, this work establishes a generalizable paradigm for aligning assessment with improvement, with direct implications for controllable text generation requiring verifiable accuracy and transparent quality control. All code and datasets will be released in GitHub for reproducibility.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LongSumEval, a unified QA-based framework for evaluating long-document summaries by generating source-derived question-answer pairs and measuring their answerability and factual alignment in the summary. It claims this yields interpretable scores plus actionable feedback for identifying coverage gaps and inconsistencies, enabling self-refinement of summaries without retraining. Meta-evaluation across seven benchmarks is reported to show substantially stronger correlation with human judgments than existing metrics.
Significance. If the results hold after addressing the proxy-validation gap, the work would meaningfully advance summarization evaluation by aligning assessment directly with generation objectives and supplying executable feedback rather than opaque scores. The planned release of code and datasets is a clear strength for reproducibility. The approach could influence controllable generation tasks that require verifiable accuracy.
major comments (2)
- [Meta-evaluation] Meta-evaluation (abstract and corresponding section): the claim of substantially stronger agreement with human judgments rests on the unvalidated assumption that automatically generated QA pairs derived from the source document surface precisely the factual and coverage deficiencies that drive human ratings. No direct comparison between the errors flagged by the QA module and human-annotated error distributions is described; without this check the reported correlation advantage could be an artifact of the proxy rather than evidence of better alignment.
- [Framework] Framework definition (section describing answerability and factual alignment): the operationalization of summary quality via answerability scoring must be shown to be independent of modeling choices in the QA generator; if the scoring embeds the same inductive biases as the summarizer, the feedback loop risks circularity and the refinement gains may not generalize beyond the proxy.
minor comments (2)
- [Abstract] The abstract states results across 'seven benchmarks' but does not name them; an explicit list (with citations) in the abstract or early methods section would aid readers.
- [Methods] Notation for answerability and alignment scores should be introduced with a single equation or table early in the methods to avoid repeated prose definitions.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and indicate where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Meta-evaluation] Meta-evaluation (abstract and corresponding section): the claim of substantially stronger agreement with human judgments rests on the unvalidated assumption that automatically generated QA pairs derived from the source document surface precisely the factual and coverage deficiencies that drive human ratings. No direct comparison between the errors flagged by the QA module and human-annotated error distributions is described; without this check the reported correlation advantage could be an artifact of the proxy rather than evidence of better alignment.
Authors: We acknowledge that the meta-evaluation relies on correlation with human judgments rather than a direct matching of error types flagged by the QA module against human-annotated error distributions. The reported stronger correlations across seven benchmarks follow the standard protocol for validating automatic metrics and provide indirect support that the QA pairs surface deficiencies relevant to humans. However, we agree this leaves room for stronger validation of the proxy. In the revised manuscript we will add a qualitative analysis section with case studies comparing issues identified by LongSumEval to available human feedback and error annotations from the benchmarks. revision: partial
-
Referee: [Framework] Framework definition (section describing answerability and factual alignment): the operationalization of summary quality via answerability scoring must be shown to be independent of modeling choices in the QA generator; if the scoring embeds the same inductive biases as the summarizer, the feedback loop risks circularity and the refinement gains may not generalize beyond the proxy.
Authors: The QA generator produces questions and answers directly from the source document using models trained on general QA corpora, entirely separate from any summarization model being evaluated. Answerability and factual alignment are then measured by probing the summary against these source-derived pairs, typically via an independent verification step. This design avoids embedding the summarizer's inductive biases. We will expand the framework description in the revision to explicitly document the model separation, training data distinctions, and steps taken to prevent circularity, including sensitivity checks with alternate QA models. revision: yes
Circularity Check
No significant circularity detected in derivation or claims.
full rationale
The paper defines summary quality operationally via answerability and factual alignment of source-derived QA pairs, then reports an empirical meta-evaluation of this metric against human judgments across seven independent benchmarks, claiming stronger correlation than prior metrics. This comparison uses an external human standard rather than reducing to fitted parameters or self-referential inputs. Self-refinement applies the same feedback loop to improve outputs, but the primary claims of improved agreement and quality gains are presented as measured results, not derived by construction. No load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation appear in the abstract or context. The framework is self-contained against external benchmarks, consistent with the default expectation for non-circular papers.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption QA pairs derived from the source document capture the factual and coverage aspects that humans care about when judging summaries
Reference graph
Works this paper leans on
-
[1]
A comprehensive survey for automatic text summarization: Techniques, approaches and perspectives,
M. Luo, B. Xue, and B. Niu, “A comprehensive survey for automatic text summarization: Techniques, approaches and perspectives,”Neuro- computing, vol. 603, p. 128280, 2024
2024
-
[2]
A comprehensive survey on automatic text summarization with exploration of llm-based methods,
Y . Zhang, H. Jin, D. Meng, J. Wang, and J. Tan, “A comprehensive survey on automatic text summarization with exploration of llm-based methods,”Neurocomputing, p. 131928, 2025
2025
-
[3]
How far are we from robust long abstractive summarization?
H. Y . Koh, J. Ju, H. Zhang, M. Liu, and S. Pan, “How far are we from robust long abstractive summarization?” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 2682–2698
2022
-
[4]
Current and future state of evaluation of large language models for medical summarization tasks,
E. Croxford, Y . Gao, N. Pellegrino, K. Wong, G. Wills, E. First, F. Liao, C. Goswami, B. Patterson, and M. Afshar, “Current and future state of evaluation of large language models for medical summarization tasks,” Npj health systems, vol. 2, no. 1, p. 6, 2025
2025
-
[5]
Effectiveness in retrieving legal precedents: exploring text summarization and cutting-edge lan- guage models toward a cost-efficient approach,
H. Mentzingen, N. Ant ´onio, and F. Bacao, “Effectiveness in retrieving legal precedents: exploring text summarization and cutting-edge lan- guage models toward a cost-efficient approach,”Artificial Intelligence and Law, pp. 1–21, 2025
2025
-
[6]
The power of graphs in medicine: Introducing biographsum for effective text summarization,
C. Hark, “The power of graphs in medicine: Introducing biographsum for effective text summarization,”Heliyon, vol. 10, no. 11, 2024
2024
-
[7]
A comparative study of quality evaluation methods for text summarization,
H. Nguyen, H. Chen, L. Pobbathi, and J. Ding, “A comparative study of quality evaluation methods for text summarization,”arXiv preprint arXiv:2407.00747, 2024
-
[8]
Rouge: A package for automatic evaluation of summaries,
C.-Y . Lin, “Rouge: A package for automatic evaluation of summaries,” inText summarization branches out, 2004, pp. 74–81
2004
-
[9]
Bleu: a method for automatic evaluation of machine translation,
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318. 12
2002
-
[10]
Bertscore: Evaluating text generation with bert,
T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “Bertscore: Evaluating text generation with bert,” inInternational Conference on Learning Representations, 2020
2020
-
[11]
Summeval: Re-evaluating summarization evaluation,
A. R. Fabbri, W. Kry ´sci´nski, B. McCann, C. Xiong, R. Socher, and D. Radev, “Summeval: Re-evaluating summarization evaluation,”Trans- actions of the Association for Computational Linguistics, vol. 9, pp. 391–409, 04 2021
2021
-
[12]
Towards question- answering as an automatic metric for evaluating the content quality of a summary,
D. Deutsch, T. Bedrax-Weiss, and D. Roth, “Towards question- answering as an automatic metric for evaluating the content quality of a summary,”Transactions of the Association for Computational Linguistics, vol. 9, pp. 774–789, 2021
2021
-
[13]
News summarization and evaluation in the era of gpt-3,
T. Goyal, J. J. Li, and G. Durrett, “News summarization and evaluation in the era of gpt-3,” 2023
2023
-
[14]
Fables: Evaluating faithfulness and content selection in book-length summarization,
Y . Kim, Y . Chang, M. Karpinska, A. Garimella, V . Manjunatha, K. Lo, T. Goyal, and M. Iyyer, “Fables: Evaluating faithfulness and content selection in book-length summarization,”arXiv preprint arXiv:2404.01261, 2024
-
[15]
Quality evaluation of summarization models for patent documents,
J. Ding, H. Chen, S. Kolapudi, L. Pobbathi, and H. Nguyen, “Quality evaluation of summarization models for patent documents,” in2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS). IEEE, 2023, pp. 250–259
2023
-
[16]
Questeval: Summarization asks for fact-based evalu- ation,
T. Scialom, P.-A. Dray, S. Lamprier, B. Piwowarski, J. Staiano, A. Wang, and P. Gallinari, “Questeval: Summarization asks for fact-based evalu- ation,” inProceedings of the 2021 conference on empirical methods in natural language processing, 2021, pp. 6594–6604
2021
-
[17]
Asking and answering questions to evaluate the factual consistency of summaries,
A. Wang, K. Cho, and M. Lewis, “Asking and answering questions to evaluate the factual consistency of summaries,” inProceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 5008–5020
2020
-
[18]
Evaluation of question-answering based text summarization using llm invited paper,
J. Ding, H. Nguyen, and H. Chen, “Evaluation of question-answering based text summarization using llm invited paper,” in2024 IEEE Inter- national Conference on Artificial Intelligence Testing (AITest). IEEE, 2024, pp. 142–149
2024
-
[19]
q 2: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering,
O. Honovich, L. Choshen, R. Aharoni, E. Neeman, I. Szpektor, and O. Abend, “q 2: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering,” inProceed- ings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 7856–7870
2021
-
[20]
Qapyramid: Fine-grained evaluation of content selection for text sum- marization,
S. Zhang, D. Wan, A. Cattan, A. Klein, I. Dagan, and M. Bansal, “Qapyramid: Fine-grained evaluation of content selection for text sum- marization,” inSecond Conference on Language Modeling, 2025
2025
-
[21]
Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization,
E. Durmus, H. He, and M. Diab, “Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization,” inProceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 5055–5070
2020
-
[22]
On faithfulness and factuality in abstractive summarization,
J. Maynez, S. Narayan, B. Bohnet, and R. McDonald, “On faithfulness and factuality in abstractive summarization,” inProceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 1906–1919
2020
-
[23]
Faithful to the original: Fact aware neural abstractive summarization,
Z. Cao, F. Wei, W. Li, and S. Li, “Faithful to the original: Fact aware neural abstractive summarization,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018
2018
-
[24]
Evaluating the factual consistency of abstractive text summarization,
W. Kry ´sci´nski, B. McCann, C. Xiong, and R. Socher, “Evaluating the factual consistency of abstractive text summarization,” inProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), 2020, pp. 9332–9346
2020
-
[25]
Scaling up summarization: leveraging large language models for long text extractive summarization,
L. Hemamou and M. Debiane, “Scaling up summarization: leveraging large language models for long text extractive summarization,”arXiv preprint arXiv:2408.15801, 2024
-
[26]
Self-refine: Iter- ative refinement with self-feedback,
A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yanget al., “Self-refine: Iter- ative refinement with self-feedback,”Advances in neural information processing systems, vol. 36, pp. 46 534–46 594, 2023
2023
-
[27]
Teaching large language models to self-debug,
X. Chen, M. Lin, N. Sch ¨arli, and D. Zhou, “Teaching large language models to self-debug,” inThe Twelfth International Conference on Learning Representations, 2024
2024
-
[28]
Prometheus: Inducing fine-grained evaluation capability in language models,
S. Kim, J. Shin, Y . Cho, J. Jang, S. Longpre, H. Lee, S. Yun, S. Shin, S. Kim, J. Thorne, and M. Seo, “Prometheus: Inducing fine-grained evaluation capability in language models,” inThe Twelfth International Conference on Learning Representations, 2024
2024
-
[29]
Summit: Iterative text summarization via chatgpt,
H. Zhang, X. Liu, and J. Zhang, “Summit: Iterative text summarization via chatgpt,” inFindings of the association for computational linguistics: EMNLP 2023, 2023, pp. 10 644–10 657
2023
-
[30]
Multi-fact correction in abstractive text summarization,
Y . Dong, S. Wang, Z. Gan, Y . Cheng, J. C. K. Cheung, and J. Liu, “Multi-fact correction in abstractive text summarization,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 9320–9331
2020
-
[31]
Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge,
T. Wu, W. Yuan, O. Golovneva, J. Xu, Y . Tian, J. Jiao, J. E. Weston, and S. Sukhbaatar, “Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge,” inProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, 2025, pp. 11 548–11 565
2025
-
[32]
The lighthouse of language: Enhancing llm agents via critique-guided improvement,
R. Yang, F. Ye, J. Li, S. Yuan, Y . Zhang, Z. Tu, X. Li, and D. Yang, “The lighthouse of language: Enhancing llm agents via critique-guided improvement,”arXiv preprint arXiv:2503.16024, 2025
-
[33]
Large language models improve clinical decision making of medical students through patient simulation and structured feedback: a randomized controlled trial,
E. Br ¨ugge, S. Ricchizzi, M. Arenbeck, M. N. Keller, L. Schur, W. Stum- mer, M. Holling, M. H. Lu, and D. Darici, “Large language models improve clinical decision making of medical students through patient simulation and structured feedback: a randomized controlled trial,”BMC medical education, vol. 24, no. 1, p. 1391, 2024
2024
-
[34]
CRITIC: Large language models can self-correct with tool-interactive critiquing,
Z. Gou, Z. Shao, Y . Gong, yelong shen, Y . Yang, N. Duan, and W. Chen, “CRITIC: Large language models can self-correct with tool-interactive critiquing,” inThe Twelfth International Conference on Learning Rep- resentations, 2024
2024
-
[35]
Answers unite! unsupervised metrics for reinforced summarization models,
T. Scialom, S. Lamprier, B. Piwowarski, and J. Staiano, “Answers unite! unsupervised metrics for reinforced summarization models,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 3246–3256
2019
-
[36]
Re3: Generating longer stories with recursive reprompting and revision,
K. Yang, Y . Tian, N. Peng, and D. Klein, “Re3: Generating longer stories with recursive reprompting and revision,” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 4393–4479
2022
-
[37]
Summac: Re- visiting nli-based models for inconsistency detection in summarization,
P. Laban, T. Schnabel, P. N. Bennett, and M. A. Hearst, “Summac: Re- visiting nli-based models for inconsistency detection in summarization,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 163–177, 2022
2022
-
[38]
Iragkr: Iterative retrieval augmented generation with fine- grained knowledge refinement,
K. Du, W. Wang, B. Zhang, P. Wang, F. Zhang, L. Cao, and Y . Guo, “Iragkr: Iterative retrieval augmented generation with fine- grained knowledge refinement,”Neurocomputing, p. 131282, 2025
2025
-
[39]
Adaptive iterative retrieval for enhanced retrieval-augmented genera- tion,
W. Han, X. Xiao, Y . Li, J. Wang, M. Pechenizkiy, and M. Fang, “Adaptive iterative retrieval for enhanced retrieval-augmented genera- tion,”Neurocomputing, p. 132272, 2025
2025
-
[40]
Learning to summarize from human feedback,
N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler, R. Lowe, C. V oss, A. Radford, D. Amodei, and P. Christiano, “Learning to summarize from human feedback,” inProceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS ’20. Red Hook, NY , USA: Curran Associates Inc., 2020
2020
-
[41]
Bigpatent: A large-scale dataset for abstractive and coherent summarization,
E. Sharma, C. Li, and L. Wang, “Bigpatent: A large-scale dataset for abstractive and coherent summarization,” inProceedings of the 57th annual meeting of the association for computational linguistics, 2019, pp. 2204–2213
2019
-
[42]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review arXiv 2024
-
[43]
Llmrefine: Pinpointing and refining large language models via fine-grained actionable feedback,
W. Xu, D. Deutsch, M. Finkelstein, J. Juraska, B. Zhang, Z. Liu, W. Y . Wang, L. Li, and M. Freitag, “Llmrefine: Pinpointing and refining large language models via fine-grained actionable feedback,” inFindings of the Association for Computational Linguistics: NAACL 2024, 2024, pp. 1429–1445
2024
-
[44]
Learning to refine with fine-grained natural language feedback,
M. Wadhwa, X. Zhao, J. J. Li, and G. Durrett, “Learning to refine with fine-grained natural language feedback,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 12 281–12 308
2024
-
[45]
Frame: Feedback- refined agent methodology for enhancing medical research insights,
C. Yu, Y . Zhang, Z. Liu, Z. Ding, Y . Sun, and Z. Jin, “Frame: Feedback- refined agent methodology for enhancing medical research insights,” in Findings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 7690–7704
2025
-
[46]
Beyond factual accuracy: Evaluating coverage of diverse factual information in long-form text generation,
C. Samarinas, A. Krubner, A. Salemi, Y . Kim, and H. Zamani, “Beyond factual accuracy: Evaluating coverage of diverse factual information in long-form text generation,” inFindings of the Association for Compu- tational Linguistics: ACL 2025, 2025, pp. 13 468–13 482
2025
-
[47]
Confidence vs critique: A decomposition of self-correction capability for llms,
Z. Yang, Y . Zhang, Y . Wang, Z. Xu, J. Lin, and Z. Sui, “Confidence vs critique: A decomposition of self-correction capability for llms,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 3998– 4014. 13
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.