Enhancing Factuality through Consensus and Consistency in Summarization Using Minimum Bayes Risk Decoding

Hidetaka Kamigaito; Jingun Kwon; Manabu Okumura; Riza Setiawan Soetedjo; Taro Watanabe; Yusuke Sakai

arxiv: 2605.29336 · v1 · pith:2RI3MAIKnew · submitted 2026-05-28 · 💻 cs.CL

Enhancing Factuality through Consensus and Consistency in Summarization Using Minimum Bayes Risk Decoding

Riza Setiawan Soetedjo , Yusuke Sakai , Hidetaka Kamigaito , Jingun Kwon , Manabu Okumura , Taro Watanabe This is my paper

Pith reviewed 2026-06-29 07:36 UTC · model grok-4.3

classification 💻 cs.CL

keywords summarizationfactualityMinimum Bayes Riskrerankingconsensusconsistencytext generation

0 comments

The pith

Reranking summaries by source consistency plus consensus among candidates improves factuality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that reranking multiple generated summaries by both their agreement with each other and their factual alignment to the source document produces more reliable outputs than reranking against the source alone. Current reranking methods rely only on the source as reference, which can still yield unfaithful summaries. By using Minimum Bayes Risk decoding to measure consensus across candidates and pairing it with factuality metrics against the source, the approach selects summaries that better preserve source content. Human evaluations indicate these summaries are preferred over those from prior systems, while automatic tests show competitive performance.

Core claim

ConSUM reranks candidate summaries by establishing consensus via Minimum Bayes Risk decoding over the set of generated summaries while enforcing consistency through factuality-aware metrics that compare each summary to the source document, resulting in outputs that are competitive with existing methods and preferred in human evaluations.

What carries the argument

Minimum Bayes Risk decoding applied to the set of generated candidate summaries to quantify consensus, combined with factuality metrics that score alignment to the source document.

If this is right

The selected summaries achieve higher human preference scores than prior reranking systems.
Automatic factuality metrics show performance on par with existing methods.
The method applies to any summarization model that can generate multiple candidates.
Consensus is computed without additional training or external models beyond the factuality metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dual-criterion reranking could be tested on tasks like question answering or dialogue generation where multiple outputs are available.
If consensus among candidates correlates with factuality, it might reduce the need for expensive reference-based metrics in some settings.
Extending the approach to longer documents would require checking whether MBR consensus remains stable as candidate length increases.

Load-bearing premise

That measuring agreement among multiple generated candidates will reliably identify summaries that are more faithful to the source than source-only selection.

What would settle it

A direct comparison where summaries chosen by the consensus-plus-consistency method receive lower factuality scores or human ratings than those chosen by source-only reranking on the same set of candidates.

Figures

Figures reproduced from arXiv: 2605.29336 by Hidetaka Kamigaito, Jingun Kwon, Manabu Okumura, Riza Setiawan Soetedjo, Taro Watanabe, Yusuke Sakai.

**Figure 2.** Figure 2: Overview of ConSUM comprising three steps: [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Correlation between the number of facts ex [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: Rank distribution of each system. Top: The rank distribution based on the average ranking between annotators for each system. Bottom: The standard deviation of each annotator for each system. 6 Conclusion In this work, we introduced ConSUM, a novel reranking framework designed to address a key limitation in summary evaluation: the reliance on either a source document or reference summaries alone. We hypoth… view at source ↗

**Figure 6.** Figure 6: Average of normalized evaluation metrics [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Average of normalized metrics based on the pseudo-reference type. Each chart represent the sampling [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison between SimCLS models by DBS setting summarization through Gap Sentence Generation (GSG) objective. This objective forces the model to learn a high-level understanding of the source text. Our study used the publicly available PEGASUS model, fine-tuned to the respective dataset, CNN/DM9 and XSum10 T5 T5 stands for Text-to-Text Transfer Transformer (Raffel et al., 2023). It is a model that unifi… view at source ↗

**Figure 9.** Figure 9: MTurk Page for human evaluation. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

read the original abstract

Improving the quality of model-generated summaries, especially factuality, the accuracy of a summary with respect to its source content, remains a challenge. While reranking could select the optimal output from multiple generated candidates, it is limited to only using the source as guidance, resulting in unreliable summaries. To address this limitation, we propose ConSUM that reranks candidate summaries by considering two factors: consistency to the source document and consensus among the other candidates. Consensus is established using Minimum Bayes Risk (MBR) decoding over the set of generated summaries, while ensuring consistency by employing factuality-aware metrics that compare the summary against the source. Rigorous testing demonstrates that our system is competitive with existing methods, with human evaluations further confirming that its generated summaries are preferred over those from other systems. Our code is available at https://github.com/naist-nlp/ConSUM .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ConSUM applies MBR consensus on top of source factuality metrics for summary reranking, but the abstract gives no numbers so the added value is unclear.

read the letter

The paper's main contribution is ConSUM, a reranking system that scores candidate summaries on two axes: consistency with the source via factuality metrics, and consensus with other candidates via Minimum Bayes Risk decoding. They release the code, which is a practical plus for anyone who wants to plug this into an existing summarization pipeline.

What stands out is the straightforward pairing of an established decoding technique with existing factuality measures. The abstract positions this as addressing the limit of source-only reranking, and the human preference claim suggests they ran at least some evaluation beyond automatic metrics.

The soft spot is the complete absence of numbers, datasets, baselines, or ablations in the abstract. Without those, it's impossible to judge whether the MBR term actually supplies information beyond what the source metrics already capture. The stress-test concern about correlated hallucinations among candidates from one model is reasonable and not addressed in the provided text; if the experiments do not isolate the consensus contribution or test on diverse generators, the central claim weakens.

This is for people already working on summarization factuality or decoding/reranking methods. A reader who needs a ready-to-try post-hoc filter and is willing to run their own checks would get something usable from the code and description. It does not introduce new theory or large-scale results.

I would send it for peer review. The idea is simple enough to evaluate, the code is public, and referees can directly test whether the consensus step improves over source metrics alone.

Referee Report

2 major / 1 minor

Summary. The paper proposes ConSUM, a reranking approach for model-generated summaries that combines consistency to the source document (via factuality-aware metrics) with consensus among multiple candidate summaries (via Minimum Bayes Risk decoding). It claims this yields summaries competitive with existing methods and preferred in human evaluations, with code released at a GitHub link.

Significance. If the empirical results hold and the MBR term demonstrably adds value beyond source metrics, the method would provide a practical, reproducible technique for improving factuality in summarization without requiring new model training. The open code is a clear strength for verification.

major comments (2)

[Abstract] Abstract: the central claim that 'rigorous testing demonstrates that our system is competitive with existing methods' and that 'human evaluations further confirming' preference is unsupported by any numbers, baselines, datasets, or error analysis in the provided text, preventing assessment of whether the reported gains are real or meaningful.
Abstract / method description: the claim that combining source-consistency metrics with MBR consensus improves factuality rests on the untested assumption that the MBR term supplies information orthogonal to the source metrics. No ablations isolating the consensus contribution (or showing that MBR does not simply reinforce correlated model errors) are described, which is load-bearing for the stated advantage over source-only reranking.

minor comments (1)

[Abstract] Abstract: the description of the utility function inside MBR and the specific factuality metrics used could be stated more explicitly even at this level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback. We address each major comment below, indicating planned revisions where appropriate. The full manuscript contains the empirical results referenced in the abstract; we will strengthen the abstract and add requested analyses.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'rigorous testing demonstrates that our system is competitive with existing methods' and that 'human evaluations further confirming' preference is unsupported by any numbers, baselines, datasets, or error analysis in the provided text, preventing assessment of whether the reported gains are real or meaningful.

Authors: Abstracts are intentionally concise and do not typically embed full numerical results, baselines, or error analyses, which are instead reported in the main body (Sections 4 and 5, with tables comparing against multiple baselines on CNN/DM and XSum, plus human preference scores). The claims in the abstract are therefore summaries of those detailed findings rather than standalone assertions. To improve clarity for readers who encounter only the abstract, we will revise it to include one or two key quantitative results (e.g., factuality metric gains and human preference rates) while preserving length constraints. revision: yes
Referee: [—] Abstract / method description: the claim that combining source-consistency metrics with MBR consensus improves factuality rests on the untested assumption that the MBR term supplies information orthogonal to the source metrics. No ablations isolating the consensus contribution (or showing that MBR does not simply reinforce correlated model errors) are described, which is load-bearing for the stated advantage over source-only reranking.

Authors: The referee correctly identifies that the current manuscript does not present explicit ablations that isolate the incremental contribution of the MBR consensus term over source-consistency metrics alone, nor does it directly test for reinforcement of correlated model errors. This is a substantive gap for the central claim. We will add these ablations in the revision (new subsection comparing source-only reranking, MBR-only, and the combined ConSUM objective, plus analysis of error overlap across candidates) to demonstrate orthogonality where it exists. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical reranking method with external validation

full rationale

The paper presents an empirical proposal (ConSUM) that combines existing MBR decoding for candidate consensus with standard factuality metrics against the source document, followed by experimental comparisons and human evaluation. No derivation chain, equations, or fitted parameters are described that reduce by construction to the method's own inputs. Reliance on prior MBR literature and external metrics constitutes normal citation rather than self-referential load-bearing. The central claim rests on empirical results, not on any self-definition or imported uniqueness theorem.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based solely on abstract; no equations, parameters, or entities detailed. No free parameters, axioms, or invented entities identifiable from provided text.

pith-pipeline@v0.9.1-grok · 5700 in / 1071 out tokens · 20658 ms · 2026-06-29T07:36:55.360531+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 35 canonical work pages · 3 internal anchors

[1]

Amanda Bertsch, Alex Xie, Graham Neubig, and Matthew Gormley. 2023. https://doi.org/10.18653/v1/2023.bigpicture-1.9 It's MBR All the Way Down : Modern Generation Techniques Through the Lens of Minimum Bayes Risk . In Proceedings of the Big Picture Workshop , pages 108--122, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.bigpicture-1.9 2023
[2]

Yanran Chen and Steffen Eger. 2023. https://doi.org/10.1162/tacl_a_00576 MENLI : Robust Evaluation Metrics from Natural Language Inference . Transactions of the Association for Computational Linguistics, 11:804--825

work page doi:10.1162/tacl_a_00576 2023
[3]

Julius Cheng and Andreas Vlachos. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.767 Faster Minimum Bayes Risk Decoding with Confidence-based Pruning . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 12473--12480, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.emnlp-main.767 2023
[4]

Anshuman Chhabra, Hadi Askari, and Prasant Mohapatra. 2024. https://doi.org/10.18653/v1/2024.naacl-short.1 Revisiting zero-shot abstractive summarization in the era of large language models from the perspective of position bias . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan...

work page doi:10.18653/v1/2024.naacl-short.1 2024
[5]

Hiroyuki Deguchi, Yusuke Sakai, Hidetaka Kamigaito, and Taro Watanabe. 2024. https://doi.org/10.18653/v1/2024.emnlp-demo.37 mbrs: A library for minimum B ayes risk decoding . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 351--362, Miami, Florida, USA. Association for Computational L...

work page doi:10.18653/v1/2024.emnlp-demo.37 2024
[6]

Tanay Dixit, Fei Wang, and Muhao Chen. 2023. https://doi.org/10.18653/v1/2023.acl-short.78 Improving Factuality of Abstractive Summarization without Sacrificing Summary Quality . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics ( Volume 2: Short Papers ) , pages 902--913, Toronto, Canada. Association for Computati...

work page doi:10.18653/v1/2023.acl-short.78 2023
[7]

Bryan Eikema and Wilker Aziz. 2020. https://doi.org/10.18653/v1/2020.coling-main.398 Is MAP Decoding All You Need ? The Inadequacy of the Mode in Neural Machine Translation . In Proceedings of the 28th International Conference on Computational Linguistics , pages 4506--4520, Barcelona, Spain (Online). International Committee on Computational Linguistics

work page doi:10.18653/v1/2020.coling-main.398 2020
[8]

Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev

Alexander R. Fabbri, Wojciech Kry \'s ci \'n ski, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. 2021. https://doi.org/10.1162/tacl_a_00373 SummEval : Re-evaluating Summarization Evaluation . Transactions of the Association for Computational Linguistics, 9:391--409

work page doi:10.1162/tacl_a_00373 2021
[9]

Markus Freitag, Behrooz Ghorbani, and Patrick Fernandes. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.617 Epsilon Sampling Rocks : Investigating Sampling Strategies for Minimum Bayes Risk Decoding for Machine Translation . In Findings of the Association for Computational Linguistics : EMNLP 2023 , pages 9198--9209, Singapore. Association for Comp...

work page doi:10.18653/v1/2023.findings-emnlp.617 2023
[10]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. https://openreview.net/forum?id=XPZIaotutsD Deberta: Decoding-enhanced bert with disentangled attention . In International Conference on Learning Representations

2021
[12]

John Hewitt, Christopher Manning, and Percy Liang. 2022. https://doi.org/10.18653/v1/2022.findings-emnlp.249 Truncation sampling as language model desmoothing . In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3414--3427, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics

work page doi:10.18653/v1/2022.findings-emnlp.249 2022
[13]

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. https://openreview.net/forum?id=rygGQyrFvH The curious case of neural text degeneration . In International Conference on Learning Representations

2020
[14]

Hidetaka Kamigaito, Hiroyuki Deguchi, Yusuke Sakai, Katsuhiko Hayashi, and Taro Watanabe. 2025. https://aclanthology.org/2025.acl-long.1410/ Diversity explains inference scaling laws: Through a case study of minimum B ayes risk decoding . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...

2025
[15]

M.G. Kendall. 1948. https://books.google.co.jp/books?id=hiBMAAAAMAAJ Rank Correlation Methods . C. Griffin

1948
[16]

Philipp Koehn. 2004. https://aclanthology.org/W04-3250/ Statistical significance tests for machine translation evaluation . In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 388--395, Barcelona, Spain. Association for Computational Linguistics

2004
[17]

Geza Kovacs, Daniel Deutsch, and Markus Freitag. 2024. https://doi.org/10.18653/v1/2024.wmt-1.109 Mitigating Metric Bias in Minimum Bayes Risk Decoding . In Proceedings of the Ninth Conference on Machine Translation , pages 1063--1094, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.wmt-1.109 2024
[18]

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. https://doi.org/10.18653/v1/2020.acl-main.703 BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation , Translation , and Comprehension . In Proceedings of the 58th Annual Meeting of the Associ...

work page doi:10.18653/v1/2020.acl-main.703 2020
[19]

Chin-Yew Lin. 2004. https://aclanthology.org/W04-1013/ ROUGE : A package for automatic evaluation of summaries . In Text Summarization Branches Out, pages 74--81, Barcelona, Spain. Association for Computational Linguistics

2004
[20]

Yixin Liu, Budhaditya Deb, Milagro Teruel, Aaron Halfaker, Dragomir Radev, and Ahmed Hassan Awadallah. 2023. https://doi.org/10.18653/v1/2023.acl-long.844 On Improving Summarization Factual Consistency from Natural Language Feedback . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pag...

work page doi:10.18653/v1/2023.acl-long.844 2023
[21]

Yixin Liu and Pengfei Liu. 2021. https://doi.org/10.18653/v1/2021.acl-short.135 SimCLS : A Simple Framework for Contrastive Learning of Abstractive Summarization . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing ( Volume 2: Short Papers ) ...

work page doi:10.18653/v1/2021.acl-short.135 2021
[22]

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. https://doi.org/10.18653/v1/2020.acl-main.173 On faithfulness and factuality in abstractive summarization . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906--1919, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.acl-main.173 2020
[23]

Mathias Müller and Rico Sennrich. 2021. https://doi.org/10.18653/v1/2021.acl-long.22 Understanding the Properties of Minimum Bayes Risk Decoding in Neural Machine Translation . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing ( Volume 1: Lo...

work page doi:10.18653/v1/2021.acl-long.22 2021
[24]

Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, C a g lar Gu l c ehre, and Bing Xiang. 2016. https://doi.org/10.18653/v1/K16-1028 Abstractive text summarization using sequence-to-sequence RNN s and beyond . In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning , pages 280--290, Berlin, Germany. Association for Computatio...

work page doi:10.18653/v1/k16-1028 2016
[25]

Hardt, M., Recht, B., and Singer, Y

Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. https://doi.org/10.18653/v1/D18-1206 Don`t Give Me the Details , Just the Summary ! Topic-Aware Convolutional Neural Networks for Extreme Summarization . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages 1797--1807, Brussels, Belgium. Association for C...

work page doi:10.18653/v1/d18-1206 2018
[26]

Koki Natsumi, Hiroyuki Deguchi, Yusuke Sakai, Hidetaka Kamigaito, and Taro Watanabe. 2025. https://doi.org/10.18653/v1/2025.ijcnlp-short.39 Agreement-constrained probabilistic minimum B ayes risk decoding . In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Ass...

work page doi:10.18653/v1/2025.ijcnlp-short.39 2025
[27]

Ani Nenkova and Rebecca Passonneau. 2004. https://aclanthology.org/N04-1019/ Evaluating Content Selection in Summarization : The Pyramid Method . In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics : HLT - NAACL 2004 , pages 145--152, Boston, Massachusetts, USA. Associat...

2004
[28]

Atsumoto Ohashi, Ukyo Honda, Tetsuro Morimura, and Yuu Jinnai. 2024. https://doi.org/10.18653/v1/2024.naacl-short.38 On the True Distribution Approximation of Minimum Bayes - Risk Decoding . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics : Human Language Technologies ( Volume 2: Short P...

work page doi:10.18653/v1/2024.naacl-short.38 2024
[29]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2023. https://doi.org/10.48550/arXiv.1910.10683 Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer . Preprint, arXiv:1910.10683

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1910.10683 2023
[30]

Mathieu Ravaut, Shafiq Joty, and Nancy Chen. 2022. https://doi.org/10.18653/v1/2022.acl-long.309 SummaReranker : A Multi-Task Mixture-of-Experts Re-ranking Framework for Abstractive Summarization . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 4504--4524, Dublin, Ireland. Assoc...

work page doi:10.18653/v1/2022.acl-long.309 2022
[31]

Mathieu Ravaut, Shafiq Joty, and Nancy Chen. 2023. https://doi.org/10.18653/v1/2023.findings-acl.529 Unsupervised summarization re-ranking . In Findings of the Association for Computational Linguistics: ACL 2023, pages 8341--8376, Toronto, Canada. Association for Computational Linguistics

work page doi:10.18653/v1/2023.findings-acl.529 2023
[32]

Sangwon Ryu, Heejin Do, Yunsu Kim, Gary Lee, and Jungseul Ok. 2024. https://doi.org/10.18653/v1/2024.acl-long.319 Multi- Dimensional Optimization for Text Summarization via Reinforcement Learning . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 5858--5871, Bangkok, Thailand. Ass...

work page doi:10.18653/v1/2024.acl-long.319 2024
[33]

Alessandro Scir \`e , Karim Ghonim, and Roberto Navigli. 2024. https://doi.org/10.18653/v1/2024.findings-acl.841 FENICE : Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction . In Findings of the Association for Computational Linguistics : ACL 2024 , pages 14148--14161, Bangkok, Thailand. Association for Computat...

work page doi:10.18653/v1/2024.findings-acl.841 2024
[34]

Jeewoo Sul and Yong Suk Choi. 2023. https://doi.org/10.18653/v1/2023.acl-short.56 Balancing Lexical and Semantic Quality in Abstractive Summarization . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics ( Volume 2: Short Papers ) , pages 637--647, Toronto, Canada. Association for Computational Linguistics

work page doi:10.18653/v1/2023.acl-short.56 2023
[35]

Mirac Suzgun, Luke Melas-Kyriazi, and Dan Jurafsky. 2023. https://doi.org/10.18653/v1/2023.findings-acl.262 Follow the wisdom of the crowd: Effective text generation via minimum B ayes risk decoding . In Findings of the Association for Computational Linguistics: ACL 2023, pages 4265--4293, Toronto, Canada. Association for Computational Linguistics

work page doi:10.18653/v1/2023.findings-acl.262 2023
[36]

Firas Trabelsi, David Vilar, Mara Finkelstein, and Markus Freitag. 2024. https://doi.org/10.48550/arXiv.2406.02832 Efficient Minimum Bayes Risk Decoding using Low-Rank Matrix Completion Algorithms . Preprint, arXiv:2406.02832

work page doi:10.48550/arxiv.2406.02832 2024
[37]

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

Ashwin K. Vijayakumar, Michael Cogswell, Ramprasath R. Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2018. https://doi.org/10.48550/arXiv.1610.02424 Diverse Beam Search : Decoding Diverse Solutions from Neural Sequence Models . arXiv preprint. ArXiv:1610.02424 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1610.02424 2018
[38]

Joonho Yang, Seunghyun Yoon, ByeongJeong Kim, and Hwanhee Lee. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.3 FIZZ : Factual Inconsistency Detection by Zoom-in Summary and Zoom-out Document . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 30--45, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.emnlp-main.3 2024
[39]

Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. 2020. https://doi.org/10.48550/arXiv.1912.08777 PEGASUS : Pre-training with Extracted Gap-sentences for Abstractive Summarization . Preprint, arXiv:1912.08777

work page doi:10.48550/arxiv.1912.08777 2020
[40]

Weinberger, and Yoav Artzi

Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. 2020. https://openreview.net/forum?id=SkeHuCVFDr Bertscore: Evaluating text generation with bert . In International Conference on Learning Representations

2020
[41]

Hashimoto

Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, and Tatsunori B. Hashimoto. 2024. https://doi.org/10.1162/tacl_a_00632 Benchmarking Large Language Models for News Summarization . Transactions of the Association for Computational Linguistics, 12:39--57

work page doi:10.1162/tacl_a_00632 2024
[42]

Meyer, and Steffen Eger

Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, and Steffen Eger. 2019. https://doi.org/10.18653/v1/D19-1053 MoverScore : Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference ...

work page doi:10.18653/v1/d19-1053 2019
[43]

Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and Jiawei Han. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.131 Towards a Unified Multi - Dimensional Evaluator for Text Generation . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages 2023--2038, Abu Dhabi, Unite...

work page doi:10.18653/v1/2022.emnlp-main.131 2022
[44]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[45]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

[1] [1]

Amanda Bertsch, Alex Xie, Graham Neubig, and Matthew Gormley. 2023. https://doi.org/10.18653/v1/2023.bigpicture-1.9 It's MBR All the Way Down : Modern Generation Techniques Through the Lens of Minimum Bayes Risk . In Proceedings of the Big Picture Workshop , pages 108--122, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.bigpicture-1.9 2023

[2] [2]

Yanran Chen and Steffen Eger. 2023. https://doi.org/10.1162/tacl_a_00576 MENLI : Robust Evaluation Metrics from Natural Language Inference . Transactions of the Association for Computational Linguistics, 11:804--825

work page doi:10.1162/tacl_a_00576 2023

[3] [3]

Julius Cheng and Andreas Vlachos. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.767 Faster Minimum Bayes Risk Decoding with Confidence-based Pruning . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 12473--12480, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.emnlp-main.767 2023

[4] [4]

Anshuman Chhabra, Hadi Askari, and Prasant Mohapatra. 2024. https://doi.org/10.18653/v1/2024.naacl-short.1 Revisiting zero-shot abstractive summarization in the era of large language models from the perspective of position bias . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan...

work page doi:10.18653/v1/2024.naacl-short.1 2024

[5] [5]

Hiroyuki Deguchi, Yusuke Sakai, Hidetaka Kamigaito, and Taro Watanabe. 2024. https://doi.org/10.18653/v1/2024.emnlp-demo.37 mbrs: A library for minimum B ayes risk decoding . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 351--362, Miami, Florida, USA. Association for Computational L...

work page doi:10.18653/v1/2024.emnlp-demo.37 2024

[6] [6]

Tanay Dixit, Fei Wang, and Muhao Chen. 2023. https://doi.org/10.18653/v1/2023.acl-short.78 Improving Factuality of Abstractive Summarization without Sacrificing Summary Quality . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics ( Volume 2: Short Papers ) , pages 902--913, Toronto, Canada. Association for Computati...

work page doi:10.18653/v1/2023.acl-short.78 2023

[7] [7]

Bryan Eikema and Wilker Aziz. 2020. https://doi.org/10.18653/v1/2020.coling-main.398 Is MAP Decoding All You Need ? The Inadequacy of the Mode in Neural Machine Translation . In Proceedings of the 28th International Conference on Computational Linguistics , pages 4506--4520, Barcelona, Spain (Online). International Committee on Computational Linguistics

work page doi:10.18653/v1/2020.coling-main.398 2020

[8] [8]

Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev

Alexander R. Fabbri, Wojciech Kry \'s ci \'n ski, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. 2021. https://doi.org/10.1162/tacl_a_00373 SummEval : Re-evaluating Summarization Evaluation . Transactions of the Association for Computational Linguistics, 9:391--409

work page doi:10.1162/tacl_a_00373 2021

[9] [9]

Markus Freitag, Behrooz Ghorbani, and Patrick Fernandes. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.617 Epsilon Sampling Rocks : Investigating Sampling Strategies for Minimum Bayes Risk Decoding for Machine Translation . In Findings of the Association for Computational Linguistics : EMNLP 2023 , pages 9198--9209, Singapore. Association for Comp...

work page doi:10.18653/v1/2023.findings-emnlp.617 2023

[10] [10]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2021. https://openreview.net/forum?id=XPZIaotutsD Deberta: Decoding-enhanced bert with disentangled attention . In International Conference on Learning Representations

2021

[12] [12]

John Hewitt, Christopher Manning, and Percy Liang. 2022. https://doi.org/10.18653/v1/2022.findings-emnlp.249 Truncation sampling as language model desmoothing . In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3414--3427, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics

work page doi:10.18653/v1/2022.findings-emnlp.249 2022

[13] [13]

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. https://openreview.net/forum?id=rygGQyrFvH The curious case of neural text degeneration . In International Conference on Learning Representations

2020

[14] [14]

Hidetaka Kamigaito, Hiroyuki Deguchi, Yusuke Sakai, Katsuhiko Hayashi, and Taro Watanabe. 2025. https://aclanthology.org/2025.acl-long.1410/ Diversity explains inference scaling laws: Through a case study of minimum B ayes risk decoding . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pa...

2025

[15] [15]

M.G. Kendall. 1948. https://books.google.co.jp/books?id=hiBMAAAAMAAJ Rank Correlation Methods . C. Griffin

1948

[16] [16]

Philipp Koehn. 2004. https://aclanthology.org/W04-3250/ Statistical significance tests for machine translation evaluation . In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 388--395, Barcelona, Spain. Association for Computational Linguistics

2004

[17] [17]

Geza Kovacs, Daniel Deutsch, and Markus Freitag. 2024. https://doi.org/10.18653/v1/2024.wmt-1.109 Mitigating Metric Bias in Minimum Bayes Risk Decoding . In Proceedings of the Ninth Conference on Machine Translation , pages 1063--1094, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.wmt-1.109 2024

[18] [18]

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. https://doi.org/10.18653/v1/2020.acl-main.703 BART : Denoising Sequence-to-Sequence Pre-training for Natural Language Generation , Translation , and Comprehension . In Proceedings of the 58th Annual Meeting of the Associ...

work page doi:10.18653/v1/2020.acl-main.703 2020

[19] [19]

Chin-Yew Lin. 2004. https://aclanthology.org/W04-1013/ ROUGE : A package for automatic evaluation of summaries . In Text Summarization Branches Out, pages 74--81, Barcelona, Spain. Association for Computational Linguistics

2004

[20] [20]

Yixin Liu, Budhaditya Deb, Milagro Teruel, Aaron Halfaker, Dragomir Radev, and Ahmed Hassan Awadallah. 2023. https://doi.org/10.18653/v1/2023.acl-long.844 On Improving Summarization Factual Consistency from Natural Language Feedback . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pag...

work page doi:10.18653/v1/2023.acl-long.844 2023

[21] [21]

Yixin Liu and Pengfei Liu. 2021. https://doi.org/10.18653/v1/2021.acl-short.135 SimCLS : A Simple Framework for Contrastive Learning of Abstractive Summarization . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing ( Volume 2: Short Papers ) ...

work page doi:10.18653/v1/2021.acl-short.135 2021

[22] [22]

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. https://doi.org/10.18653/v1/2020.acl-main.173 On faithfulness and factuality in abstractive summarization . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906--1919, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.acl-main.173 2020

[23] [23]

Mathias Müller and Rico Sennrich. 2021. https://doi.org/10.18653/v1/2021.acl-long.22 Understanding the Properties of Minimum Bayes Risk Decoding in Neural Machine Translation . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing ( Volume 1: Lo...

work page doi:10.18653/v1/2021.acl-long.22 2021

[24] [24]

Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, C a g lar Gu l c ehre, and Bing Xiang. 2016. https://doi.org/10.18653/v1/K16-1028 Abstractive text summarization using sequence-to-sequence RNN s and beyond . In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning , pages 280--290, Berlin, Germany. Association for Computatio...

work page doi:10.18653/v1/k16-1028 2016

[25] [25]

Hardt, M., Recht, B., and Singer, Y

Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. https://doi.org/10.18653/v1/D18-1206 Don`t Give Me the Details , Just the Summary ! Topic-Aware Convolutional Neural Networks for Extreme Summarization . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages 1797--1807, Brussels, Belgium. Association for C...

work page doi:10.18653/v1/d18-1206 2018

[26] [26]

Koki Natsumi, Hiroyuki Deguchi, Yusuke Sakai, Hidetaka Kamigaito, and Taro Watanabe. 2025. https://doi.org/10.18653/v1/2025.ijcnlp-short.39 Agreement-constrained probabilistic minimum B ayes risk decoding . In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Ass...

work page doi:10.18653/v1/2025.ijcnlp-short.39 2025

[27] [27]

Ani Nenkova and Rebecca Passonneau. 2004. https://aclanthology.org/N04-1019/ Evaluating Content Selection in Summarization : The Pyramid Method . In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics : HLT - NAACL 2004 , pages 145--152, Boston, Massachusetts, USA. Associat...

2004

[28] [28]

Atsumoto Ohashi, Ukyo Honda, Tetsuro Morimura, and Yuu Jinnai. 2024. https://doi.org/10.18653/v1/2024.naacl-short.38 On the True Distribution Approximation of Minimum Bayes - Risk Decoding . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics : Human Language Technologies ( Volume 2: Short P...

work page doi:10.18653/v1/2024.naacl-short.38 2024

[29] [29]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2023. https://doi.org/10.48550/arXiv.1910.10683 Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer . Preprint, arXiv:1910.10683

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1910.10683 2023

[30] [30]

Mathieu Ravaut, Shafiq Joty, and Nancy Chen. 2022. https://doi.org/10.18653/v1/2022.acl-long.309 SummaReranker : A Multi-Task Mixture-of-Experts Re-ranking Framework for Abstractive Summarization . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 4504--4524, Dublin, Ireland. Assoc...

work page doi:10.18653/v1/2022.acl-long.309 2022

[31] [31]

Mathieu Ravaut, Shafiq Joty, and Nancy Chen. 2023. https://doi.org/10.18653/v1/2023.findings-acl.529 Unsupervised summarization re-ranking . In Findings of the Association for Computational Linguistics: ACL 2023, pages 8341--8376, Toronto, Canada. Association for Computational Linguistics

work page doi:10.18653/v1/2023.findings-acl.529 2023

[32] [32]

Sangwon Ryu, Heejin Do, Yunsu Kim, Gary Lee, and Jungseul Ok. 2024. https://doi.org/10.18653/v1/2024.acl-long.319 Multi- Dimensional Optimization for Text Summarization via Reinforcement Learning . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 5858--5871, Bangkok, Thailand. Ass...

work page doi:10.18653/v1/2024.acl-long.319 2024

[33] [33]

Alessandro Scir \`e , Karim Ghonim, and Roberto Navigli. 2024. https://doi.org/10.18653/v1/2024.findings-acl.841 FENICE : Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction . In Findings of the Association for Computational Linguistics : ACL 2024 , pages 14148--14161, Bangkok, Thailand. Association for Computat...

work page doi:10.18653/v1/2024.findings-acl.841 2024

[34] [34]

Jeewoo Sul and Yong Suk Choi. 2023. https://doi.org/10.18653/v1/2023.acl-short.56 Balancing Lexical and Semantic Quality in Abstractive Summarization . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics ( Volume 2: Short Papers ) , pages 637--647, Toronto, Canada. Association for Computational Linguistics

work page doi:10.18653/v1/2023.acl-short.56 2023

[35] [35]

Mirac Suzgun, Luke Melas-Kyriazi, and Dan Jurafsky. 2023. https://doi.org/10.18653/v1/2023.findings-acl.262 Follow the wisdom of the crowd: Effective text generation via minimum B ayes risk decoding . In Findings of the Association for Computational Linguistics: ACL 2023, pages 4265--4293, Toronto, Canada. Association for Computational Linguistics

work page doi:10.18653/v1/2023.findings-acl.262 2023

[36] [36]

Firas Trabelsi, David Vilar, Mara Finkelstein, and Markus Freitag. 2024. https://doi.org/10.48550/arXiv.2406.02832 Efficient Minimum Bayes Risk Decoding using Low-Rank Matrix Completion Algorithms . Preprint, arXiv:2406.02832

work page doi:10.48550/arxiv.2406.02832 2024

[37] [37]

Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models

Ashwin K. Vijayakumar, Michael Cogswell, Ramprasath R. Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2018. https://doi.org/10.48550/arXiv.1610.02424 Diverse Beam Search : Decoding Diverse Solutions from Neural Sequence Models . arXiv preprint. ArXiv:1610.02424 [cs]

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1610.02424 2018

[38] [38]

Joonho Yang, Seunghyun Yoon, ByeongJeong Kim, and Hwanhee Lee. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.3 FIZZ : Factual Inconsistency Detection by Zoom-in Summary and Zoom-out Document . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 30--45, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.emnlp-main.3 2024

[39] [39]

Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. 2020. https://doi.org/10.48550/arXiv.1912.08777 PEGASUS : Pre-training with Extracted Gap-sentences for Abstractive Summarization . Preprint, arXiv:1912.08777

work page doi:10.48550/arxiv.1912.08777 2020

[40] [40]

Weinberger, and Yoav Artzi

Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. 2020. https://openreview.net/forum?id=SkeHuCVFDr Bertscore: Evaluating text generation with bert . In International Conference on Learning Representations

2020

[41] [41]

Hashimoto

Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, and Tatsunori B. Hashimoto. 2024. https://doi.org/10.1162/tacl_a_00632 Benchmarking Large Language Models for News Summarization . Transactions of the Association for Computational Linguistics, 12:39--57

work page doi:10.1162/tacl_a_00632 2024

[42] [42]

Meyer, and Steffen Eger

Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, and Steffen Eger. 2019. https://doi.org/10.18653/v1/D19-1053 MoverScore : Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference ...

work page doi:10.18653/v1/d19-1053 2019

[43] [43]

Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and Jiawei Han. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.131 Towards a Unified Multi - Dimensional Evaluator for Text Generation . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages 2023--2038, Abu Dhabi, Unite...

work page doi:10.18653/v1/2022.emnlp-main.131 2022

[44] [44]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

[45] [45]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...