pith. machine review for the scientific record. sign in

arxiv: 2604.25130 · v1 · submitted 2026-04-28 · 💻 cs.CL

Recognition: unknown

LongSumEval: Question-Answering Based Evaluation and Feedback-Driven Refinement for Long Document Summarization

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:38 UTC · model grok-4.3

classification 💻 cs.CL
keywords long document summarizationquestion answering evaluationfeedback driven refinementhuman judgment correlationself-refinementfactual consistencysummarization metrics
0
0 comments X

The pith

Question-answering pairs derived from source documents evaluate long summaries with stronger human agreement and support self-refinement.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that evaluation and improvement of long document summaries can be unified through a question-answering framework. By generating QA pairs from the original document, the approach measures how well a summary allows accurate answers and aligns factually, yielding both a score and specific feedback on gaps or errors. This addresses the limitation of current metrics that correlate poorly with humans and offer no guidance for fixing issues. Experiments across seven benchmarks confirm better agreement with human judgments, and the feedback loop enables iterative improvements to summaries without any model retraining.

Core claim

LongSumEval operationalizes summary quality as the answerability and factual alignment of question-answer pairs generated from the source document. This produces interpretable scores identifying coverage gaps and factual inconsistencies while supplying structured feedback that serves as executable instructions for refining the summary through self-refinement.

What carries the argument

The QA-based evaluation module that generates question-answer pairs to assess answerability and factual alignment, providing both aggregate scores and detailed feedback for refinement.

If this is right

  • Evaluation scores become directly usable for guiding generation improvements.
  • Self-refinement can enhance summary quality iteratively without retraining models.
  • Research on long document summarization gains a more reliable benchmarking tool.
  • Similar feedback mechanisms could align evaluation and generation in other text tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adopting this could reduce the need for costly human evaluations in summarization research.
  • Extensions might apply the same principle to other long-form generation tasks like report writing.
  • The reliance on automatic QA generation assumes the questions capture key information, which could be tested by varying question quality.

Load-bearing premise

Automatically generated question-answer pairs from the source document will reliably surface the factual and coverage deficiencies that matter to human readers of the summary.

What would settle it

A new meta-evaluation on held-out long document datasets where the QA-based metric shows no higher correlation with human judgments than ROUGE or BERTScore, or where applying the feedback fails to improve summary quality as rated by humans.

Figures

Figures reproduced from arXiv: 2604.25130 by Haihua Chen, Haoxuan Zhang, Huyen Nguyen, Junhua Ding, Yang Zhang.

Figure 1
Figure 1. Figure 1: Overview of LongSumEval framework. Evaluation Module: Com￾putes coverage and factual consistency scores via LLM-based question an￾swering and produces structured feedback. Self-Refinement Module: Uses the feedback to iteratively revise the summary until quality thresholds are met. Algorithm 1 LongSumEval Evaluation Module Input: Source document D, Generated summary S, Similarity threshold τ Output: Coverag… view at source ↗
Figure 2
Figure 2. Figure 2: Source document length distributions view at source ↗
Figure 3
Figure 3. Figure 3: Model-generated summary length distributions view at source ↗
read the original abstract

Evaluating long document summaries remains the primary bottleneck in summarization research. Existing metrics correlate weakly with human judgments and produce aggregate scores without explaining deficiencies or guiding improvement, preventing effective refinement in applications requiring verifiable accuracy. We introduce LongSumEval, a unified framework bridging evaluation and generation through structured question-answering feedback. The framework operationalizes summary quality as answerability and factual alignment of question-answer pairs, generating interpretable scores and actionable feedback that identifies coverage gaps and factual inconsistencies. This resolves the misalignment where evaluation operates independently of generation objectives. Meta-evaluation of our QA-based evaluation module across seven benchmarks demonstrates substantially stronger agreement with human judgments compared to established metrics. Structured feedback enables significant quality improvements through self-refinement without retraining. By demonstrating that evaluation feedback can serve as executable instructions for generation, this work establishes a generalizable paradigm for aligning assessment with improvement, with direct implications for controllable text generation requiring verifiable accuracy and transparent quality control. All code and datasets will be released in GitHub for reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LongSumEval, a unified QA-based framework for evaluating long-document summaries by generating source-derived question-answer pairs and measuring their answerability and factual alignment in the summary. It claims this yields interpretable scores plus actionable feedback for identifying coverage gaps and inconsistencies, enabling self-refinement of summaries without retraining. Meta-evaluation across seven benchmarks is reported to show substantially stronger correlation with human judgments than existing metrics.

Significance. If the results hold after addressing the proxy-validation gap, the work would meaningfully advance summarization evaluation by aligning assessment directly with generation objectives and supplying executable feedback rather than opaque scores. The planned release of code and datasets is a clear strength for reproducibility. The approach could influence controllable generation tasks that require verifiable accuracy.

major comments (2)
  1. [Meta-evaluation] Meta-evaluation (abstract and corresponding section): the claim of substantially stronger agreement with human judgments rests on the unvalidated assumption that automatically generated QA pairs derived from the source document surface precisely the factual and coverage deficiencies that drive human ratings. No direct comparison between the errors flagged by the QA module and human-annotated error distributions is described; without this check the reported correlation advantage could be an artifact of the proxy rather than evidence of better alignment.
  2. [Framework] Framework definition (section describing answerability and factual alignment): the operationalization of summary quality via answerability scoring must be shown to be independent of modeling choices in the QA generator; if the scoring embeds the same inductive biases as the summarizer, the feedback loop risks circularity and the refinement gains may not generalize beyond the proxy.
minor comments (2)
  1. [Abstract] The abstract states results across 'seven benchmarks' but does not name them; an explicit list (with citations) in the abstract or early methods section would aid readers.
  2. [Methods] Notation for answerability and alignment scores should be introduced with a single equation or table early in the methods to avoid repeated prose definitions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Meta-evaluation] Meta-evaluation (abstract and corresponding section): the claim of substantially stronger agreement with human judgments rests on the unvalidated assumption that automatically generated QA pairs derived from the source document surface precisely the factual and coverage deficiencies that drive human ratings. No direct comparison between the errors flagged by the QA module and human-annotated error distributions is described; without this check the reported correlation advantage could be an artifact of the proxy rather than evidence of better alignment.

    Authors: We acknowledge that the meta-evaluation relies on correlation with human judgments rather than a direct matching of error types flagged by the QA module against human-annotated error distributions. The reported stronger correlations across seven benchmarks follow the standard protocol for validating automatic metrics and provide indirect support that the QA pairs surface deficiencies relevant to humans. However, we agree this leaves room for stronger validation of the proxy. In the revised manuscript we will add a qualitative analysis section with case studies comparing issues identified by LongSumEval to available human feedback and error annotations from the benchmarks. revision: partial

  2. Referee: [Framework] Framework definition (section describing answerability and factual alignment): the operationalization of summary quality via answerability scoring must be shown to be independent of modeling choices in the QA generator; if the scoring embeds the same inductive biases as the summarizer, the feedback loop risks circularity and the refinement gains may not generalize beyond the proxy.

    Authors: The QA generator produces questions and answers directly from the source document using models trained on general QA corpora, entirely separate from any summarization model being evaluated. Answerability and factual alignment are then measured by probing the summary against these source-derived pairs, typically via an independent verification step. This design avoids embedding the summarizer's inductive biases. We will expand the framework description in the revision to explicitly document the model separation, training data distinctions, and steps taken to prevent circularity, including sensitivity checks with alternate QA models. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation or claims.

full rationale

The paper defines summary quality operationally via answerability and factual alignment of source-derived QA pairs, then reports an empirical meta-evaluation of this metric against human judgments across seven independent benchmarks, claiming stronger correlation than prior metrics. This comparison uses an external human standard rather than reducing to fitted parameters or self-referential inputs. Self-refinement applies the same feedback loop to improve outputs, but the primary claims of improved agreement and quality gains are presented as measured results, not derived by construction. No load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation appear in the abstract or context. The framework is self-contained against external benchmarks, consistent with the default expectation for non-circular papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that question-answer pairs generated from the source can serve as a faithful proxy for summary quality; no explicit free parameters or invented entities are named in the abstract, but the QA-generation procedure itself likely contains modeling choices and thresholds.

axioms (1)
  • domain assumption QA pairs derived from the source document capture the factual and coverage aspects that humans care about when judging summaries
    Invoked throughout the abstract as the basis for both scoring and feedback

pith-pipeline@v0.9.0 · 5483 in / 1244 out tokens · 32972 ms · 2026-05-07T16:38:52.254771+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 5 canonical work pages · 1 internal anchor

  1. [1]

    A comprehensive survey for automatic text summarization: Techniques, approaches and perspectives,

    M. Luo, B. Xue, and B. Niu, “A comprehensive survey for automatic text summarization: Techniques, approaches and perspectives,”Neuro- computing, vol. 603, p. 128280, 2024

  2. [2]

    A comprehensive survey on automatic text summarization with exploration of llm-based methods,

    Y . Zhang, H. Jin, D. Meng, J. Wang, and J. Tan, “A comprehensive survey on automatic text summarization with exploration of llm-based methods,”Neurocomputing, p. 131928, 2025

  3. [3]

    How far are we from robust long abstractive summarization?

    H. Y . Koh, J. Ju, H. Zhang, M. Liu, and S. Pan, “How far are we from robust long abstractive summarization?” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 2682–2698

  4. [4]

    Current and future state of evaluation of large language models for medical summarization tasks,

    E. Croxford, Y . Gao, N. Pellegrino, K. Wong, G. Wills, E. First, F. Liao, C. Goswami, B. Patterson, and M. Afshar, “Current and future state of evaluation of large language models for medical summarization tasks,” Npj health systems, vol. 2, no. 1, p. 6, 2025

  5. [5]

    Effectiveness in retrieving legal precedents: exploring text summarization and cutting-edge lan- guage models toward a cost-efficient approach,

    H. Mentzingen, N. Ant ´onio, and F. Bacao, “Effectiveness in retrieving legal precedents: exploring text summarization and cutting-edge lan- guage models toward a cost-efficient approach,”Artificial Intelligence and Law, pp. 1–21, 2025

  6. [6]

    The power of graphs in medicine: Introducing biographsum for effective text summarization,

    C. Hark, “The power of graphs in medicine: Introducing biographsum for effective text summarization,”Heliyon, vol. 10, no. 11, 2024

  7. [7]

    A comparative study of quality evaluation methods for text summarization,

    H. Nguyen, H. Chen, L. Pobbathi, and J. Ding, “A comparative study of quality evaluation methods for text summarization,”arXiv preprint arXiv:2407.00747, 2024

  8. [8]

    Rouge: A package for automatic evaluation of summaries,

    C.-Y . Lin, “Rouge: A package for automatic evaluation of summaries,” inText summarization branches out, 2004, pp. 74–81

  9. [9]

    Bleu: a method for automatic evaluation of machine translation,

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318. 12

  10. [10]

    Bertscore: Evaluating text generation with bert,

    T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “Bertscore: Evaluating text generation with bert,” inInternational Conference on Learning Representations, 2020

  11. [11]

    Summeval: Re-evaluating summarization evaluation,

    A. R. Fabbri, W. Kry ´sci´nski, B. McCann, C. Xiong, R. Socher, and D. Radev, “Summeval: Re-evaluating summarization evaluation,”Trans- actions of the Association for Computational Linguistics, vol. 9, pp. 391–409, 04 2021

  12. [12]

    Towards question- answering as an automatic metric for evaluating the content quality of a summary,

    D. Deutsch, T. Bedrax-Weiss, and D. Roth, “Towards question- answering as an automatic metric for evaluating the content quality of a summary,”Transactions of the Association for Computational Linguistics, vol. 9, pp. 774–789, 2021

  13. [13]

    News summarization and evaluation in the era of gpt-3,

    T. Goyal, J. J. Li, and G. Durrett, “News summarization and evaluation in the era of gpt-3,” 2023

  14. [14]

    Fables: Evaluating faithfulness and content selection in book-length summarization,

    Y . Kim, Y . Chang, M. Karpinska, A. Garimella, V . Manjunatha, K. Lo, T. Goyal, and M. Iyyer, “Fables: Evaluating faithfulness and content selection in book-length summarization,”arXiv preprint arXiv:2404.01261, 2024

  15. [15]

    Quality evaluation of summarization models for patent documents,

    J. Ding, H. Chen, S. Kolapudi, L. Pobbathi, and H. Nguyen, “Quality evaluation of summarization models for patent documents,” in2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS). IEEE, 2023, pp. 250–259

  16. [16]

    Questeval: Summarization asks for fact-based evalu- ation,

    T. Scialom, P.-A. Dray, S. Lamprier, B. Piwowarski, J. Staiano, A. Wang, and P. Gallinari, “Questeval: Summarization asks for fact-based evalu- ation,” inProceedings of the 2021 conference on empirical methods in natural language processing, 2021, pp. 6594–6604

  17. [17]

    Asking and answering questions to evaluate the factual consistency of summaries,

    A. Wang, K. Cho, and M. Lewis, “Asking and answering questions to evaluate the factual consistency of summaries,” inProceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 5008–5020

  18. [18]

    Evaluation of question-answering based text summarization using llm invited paper,

    J. Ding, H. Nguyen, and H. Chen, “Evaluation of question-answering based text summarization using llm invited paper,” in2024 IEEE Inter- national Conference on Artificial Intelligence Testing (AITest). IEEE, 2024, pp. 142–149

  19. [19]

    q 2: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering,

    O. Honovich, L. Choshen, R. Aharoni, E. Neeman, I. Szpektor, and O. Abend, “q 2: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering,” inProceed- ings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 7856–7870

  20. [20]

    Qapyramid: Fine-grained evaluation of content selection for text sum- marization,

    S. Zhang, D. Wan, A. Cattan, A. Klein, I. Dagan, and M. Bansal, “Qapyramid: Fine-grained evaluation of content selection for text sum- marization,” inSecond Conference on Language Modeling, 2025

  21. [21]

    Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization,

    E. Durmus, H. He, and M. Diab, “Feqa: A question answering evaluation framework for faithfulness assessment in abstractive summarization,” inProceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 5055–5070

  22. [22]

    On faithfulness and factuality in abstractive summarization,

    J. Maynez, S. Narayan, B. Bohnet, and R. McDonald, “On faithfulness and factuality in abstractive summarization,” inProceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 1906–1919

  23. [23]

    Faithful to the original: Fact aware neural abstractive summarization,

    Z. Cao, F. Wei, W. Li, and S. Li, “Faithful to the original: Fact aware neural abstractive summarization,” inProceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018

  24. [24]

    Evaluating the factual consistency of abstractive text summarization,

    W. Kry ´sci´nski, B. McCann, C. Xiong, and R. Socher, “Evaluating the factual consistency of abstractive text summarization,” inProceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), 2020, pp. 9332–9346

  25. [25]

    Scaling up summarization: leveraging large language models for long text extractive summarization,

    L. Hemamou and M. Debiane, “Scaling up summarization: leveraging large language models for long text extractive summarization,”arXiv preprint arXiv:2408.15801, 2024

  26. [26]

    Self-refine: Iter- ative refinement with self-feedback,

    A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y . Yanget al., “Self-refine: Iter- ative refinement with self-feedback,”Advances in neural information processing systems, vol. 36, pp. 46 534–46 594, 2023

  27. [27]

    Teaching large language models to self-debug,

    X. Chen, M. Lin, N. Sch ¨arli, and D. Zhou, “Teaching large language models to self-debug,” inThe Twelfth International Conference on Learning Representations, 2024

  28. [28]

    Prometheus: Inducing fine-grained evaluation capability in language models,

    S. Kim, J. Shin, Y . Cho, J. Jang, S. Longpre, H. Lee, S. Yun, S. Shin, S. Kim, J. Thorne, and M. Seo, “Prometheus: Inducing fine-grained evaluation capability in language models,” inThe Twelfth International Conference on Learning Representations, 2024

  29. [29]

    Summit: Iterative text summarization via chatgpt,

    H. Zhang, X. Liu, and J. Zhang, “Summit: Iterative text summarization via chatgpt,” inFindings of the association for computational linguistics: EMNLP 2023, 2023, pp. 10 644–10 657

  30. [30]

    Multi-fact correction in abstractive text summarization,

    Y . Dong, S. Wang, Z. Gan, Y . Cheng, J. C. K. Cheung, and J. Liu, “Multi-fact correction in abstractive text summarization,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 9320–9331

  31. [31]

    Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge,

    T. Wu, W. Yuan, O. Golovneva, J. Xu, Y . Tian, J. Jiao, J. E. Weston, and S. Sukhbaatar, “Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge,” inProceedings of the 2025 Con- ference on Empirical Methods in Natural Language Processing, 2025, pp. 11 548–11 565

  32. [32]

    The lighthouse of language: Enhancing llm agents via critique-guided improvement,

    R. Yang, F. Ye, J. Li, S. Yuan, Y . Zhang, Z. Tu, X. Li, and D. Yang, “The lighthouse of language: Enhancing llm agents via critique-guided improvement,”arXiv preprint arXiv:2503.16024, 2025

  33. [33]

    Large language models improve clinical decision making of medical students through patient simulation and structured feedback: a randomized controlled trial,

    E. Br ¨ugge, S. Ricchizzi, M. Arenbeck, M. N. Keller, L. Schur, W. Stum- mer, M. Holling, M. H. Lu, and D. Darici, “Large language models improve clinical decision making of medical students through patient simulation and structured feedback: a randomized controlled trial,”BMC medical education, vol. 24, no. 1, p. 1391, 2024

  34. [34]

    CRITIC: Large language models can self-correct with tool-interactive critiquing,

    Z. Gou, Z. Shao, Y . Gong, yelong shen, Y . Yang, N. Duan, and W. Chen, “CRITIC: Large language models can self-correct with tool-interactive critiquing,” inThe Twelfth International Conference on Learning Rep- resentations, 2024

  35. [35]

    Answers unite! unsupervised metrics for reinforced summarization models,

    T. Scialom, S. Lamprier, B. Piwowarski, and J. Staiano, “Answers unite! unsupervised metrics for reinforced summarization models,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 3246–3256

  36. [36]

    Re3: Generating longer stories with recursive reprompting and revision,

    K. Yang, Y . Tian, N. Peng, and D. Klein, “Re3: Generating longer stories with recursive reprompting and revision,” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 4393–4479

  37. [37]

    Summac: Re- visiting nli-based models for inconsistency detection in summarization,

    P. Laban, T. Schnabel, P. N. Bennett, and M. A. Hearst, “Summac: Re- visiting nli-based models for inconsistency detection in summarization,” Transactions of the Association for Computational Linguistics, vol. 10, pp. 163–177, 2022

  38. [38]

    Iragkr: Iterative retrieval augmented generation with fine- grained knowledge refinement,

    K. Du, W. Wang, B. Zhang, P. Wang, F. Zhang, L. Cao, and Y . Guo, “Iragkr: Iterative retrieval augmented generation with fine- grained knowledge refinement,”Neurocomputing, p. 131282, 2025

  39. [39]

    Adaptive iterative retrieval for enhanced retrieval-augmented genera- tion,

    W. Han, X. Xiao, Y . Li, J. Wang, M. Pechenizkiy, and M. Fang, “Adaptive iterative retrieval for enhanced retrieval-augmented genera- tion,”Neurocomputing, p. 132272, 2025

  40. [40]

    Learning to summarize from human feedback,

    N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler, R. Lowe, C. V oss, A. Radford, D. Amodei, and P. Christiano, “Learning to summarize from human feedback,” inProceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS ’20. Red Hook, NY , USA: Curran Associates Inc., 2020

  41. [41]

    Bigpatent: A large-scale dataset for abstractive and coherent summarization,

    E. Sharma, C. Li, and L. Wang, “Bigpatent: A large-scale dataset for abstractive and coherent summarization,” inProceedings of the 57th annual meeting of the association for computational linguistics, 2019, pp. 2204–2213

  42. [42]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

  43. [43]

    Llmrefine: Pinpointing and refining large language models via fine-grained actionable feedback,

    W. Xu, D. Deutsch, M. Finkelstein, J. Juraska, B. Zhang, Z. Liu, W. Y . Wang, L. Li, and M. Freitag, “Llmrefine: Pinpointing and refining large language models via fine-grained actionable feedback,” inFindings of the Association for Computational Linguistics: NAACL 2024, 2024, pp. 1429–1445

  44. [44]

    Learning to refine with fine-grained natural language feedback,

    M. Wadhwa, X. Zhao, J. J. Li, and G. Durrett, “Learning to refine with fine-grained natural language feedback,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024, pp. 12 281–12 308

  45. [45]

    Frame: Feedback- refined agent methodology for enhancing medical research insights,

    C. Yu, Y . Zhang, Z. Liu, Z. Ding, Y . Sun, and Z. Jin, “Frame: Feedback- refined agent methodology for enhancing medical research insights,” in Findings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 7690–7704

  46. [46]

    Beyond factual accuracy: Evaluating coverage of diverse factual information in long-form text generation,

    C. Samarinas, A. Krubner, A. Salemi, Y . Kim, and H. Zamani, “Beyond factual accuracy: Evaluating coverage of diverse factual information in long-form text generation,” inFindings of the Association for Compu- tational Linguistics: ACL 2025, 2025, pp. 13 468–13 482

  47. [47]

    Confidence vs critique: A decomposition of self-correction capability for llms,

    Z. Yang, Y . Zhang, Y . Wang, Z. Xu, J. Lin, and Z. Sui, “Confidence vs critique: A decomposition of self-correction capability for llms,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 3998– 4014. 13