Stress Testing Factual Consistency Metrics for Long-Document Summarization
Pith reviewed 2026-05-17 23:06 UTC · model grok-4.3
The pith
Short-form factual consistency metrics produce inconsistent scores for equivalent long-document summaries and lose reliability on dense claims.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Existing short-form metrics produce inconsistent scores for semantically equivalent summaries and exhibit declining reliability for information-dense claims whose content is semantically similar to many parts of the source document across three long-form benchmark datasets.
What carries the argument
Seven factuality-preserving perturbations applied to summaries to probe the robustness of factuality metrics in long-document settings.
Load-bearing premise
The seven perturbations preserve factual content and do not introduce new inconsistencies that the metrics should detect.
What would settle it
A controlled test showing that one perturbation such as paraphrasing or compression changes a summary's actual factual alignment with the source in a way the metrics score identically to the original unperturbed version.
Figures
read the original abstract
Evaluating the factual consistency of abstractive text summarization remains a significant challenge, particularly for long documents, where conventional metrics struggle with input length limitations and long-range dependencies. In this work, we systematically evaluate the reliability of six widely used reference-free factuality metrics, originally proposed for short-form summarization, in the long-document setting. We probe metric robustness through seven factuality-preserving perturbations applied to summaries, namely paraphrasing, simplification, synonym replacement, logically equivalent negations, vocabulary reduction, compression, and source text insertion, and further analyze their sensitivity to retrieval context and claim information density. Across three long-form benchmark datasets spanning science fiction, legal, and scientific domains, our results reveal that existing short-form metrics produce inconsistent scores for semantically equivalent summaries and exhibit declining reliability for information-dense claims whose content is semantically similar to many parts of the source document. While expanding the retrieval context improves stability in some domains, no metric consistently maintains factual alignment under long-context conditions. Finally, our results highlight concrete directions for improving factuality evaluation, including multi-span reasoning, context-aware calibration, and training on meaning-preserving variations to enhance robustness in long-form summarization. We release all code, perturbed data, and scripts required to reproduce our results at https://github.com/zainmujahid/metricEval-longSum.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates six reference-free factual consistency metrics (originally for short-form summarization) on long-document settings across three domains (science fiction, legal, scientific). It applies seven perturbations to summaries—paraphrasing, simplification, synonym replacement, logically equivalent negations, vocabulary reduction, compression, and source text insertion—to test robustness, and analyzes sensitivity to retrieval context and claim information density. The central claim is that these metrics yield inconsistent scores on semantically equivalent summaries and decline in reliability for information-dense claims similar to many source spans; no metric is consistently stable under long-context conditions, with suggestions for multi-span reasoning and context-aware calibration. Code, perturbed data, and scripts are released.
Significance. If the perturbations are validated as factuality-preserving, the work provides useful empirical evidence on metric limitations in long-document summarization and concrete improvement directions. Releasing all code and perturbed data is a clear strength for reproducibility and follow-up studies.
major comments (2)
- [Abstract and §3] Abstract and §3 (Perturbations): The central claim that score variance demonstrates metric unreliability depends on the seven perturbations leaving factual content unchanged. No human ratings, entailment checks, or other verification is described to confirm that compression does not drop details, negations do not flip scope, or source insertion does not alter attribution. If any perturbation changes factuality, the observed inconsistencies may reflect correct metric behavior rather than unreliability.
- [§4] §4 (Results, information-density analysis): The claim of declining reliability for information-dense claims is load-bearing but would benefit from explicit quantification of 'information density' (e.g., via semantic similarity thresholds or span overlap counts) and statistical tests showing the correlation with metric scores across domains.
minor comments (2)
- [Tables/Figures] Table 1 and Figure 2: axis labels and legend text are small and may be difficult to read at print size; consider increasing font size or splitting into multiple panels.
- [§2] §2 (Related Work): A brief comparison table of the six metrics' original design assumptions (short vs. long context) would help readers quickly see why they are expected to struggle with long documents.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which highlights important aspects for strengthening the validity of our claims regarding the robustness of factual consistency metrics in long-document summarization. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Perturbations): The central claim that score variance demonstrates metric unreliability depends on the seven perturbations leaving factual content unchanged. No human ratings, entailment checks, or other verification is described to confirm that compression does not drop details, negations do not flip scope, or source insertion does not alter attribution. If any perturbation changes factuality, the observed inconsistencies may reflect correct metric behavior rather than unreliability.
Authors: We agree that confirming the perturbations preserve factual content is essential to support our central claims about metric unreliability. Each perturbation was designed using established meaning-preserving transformations from prior work: paraphrasing and simplification retain core semantics, synonym replacement employs contextually equivalent terms, logically equivalent negations preserve truth value, vocabulary reduction and compression target non-essential elements while keeping key facts, and source text insertion incorporates verbatim source excerpts without modifying attribution. We acknowledge that the original manuscript did not include explicit human ratings or entailment verification. In the revised version, we will expand §3 with a dedicated subsection providing the generation process, concrete examples for each perturbation, and results from a targeted human evaluation on a random sample of 100 perturbed summaries per domain to verify factuality preservation. This will directly address the concern and bolster the interpretation of score variances. revision: partial
-
Referee: [§4] §4 (Results, information-density analysis): The claim of declining reliability for information-dense claims is load-bearing but would benefit from explicit quantification of 'information density' (e.g., via semantic similarity thresholds or span overlap counts) and statistical tests showing the correlation with metric scores across domains.
Authors: We concur that providing an explicit, quantifiable definition of information density and supporting statistical evidence would enhance the robustness of this analysis. In the current work, information density is characterized by the degree of semantic similarity between a claim and multiple spans in the source document. For the revision, we will formalize this by computing the count of source spans exceeding a cosine similarity threshold (using sentence embeddings) for each claim. We will then report Pearson and Spearman correlation coefficients, along with p-values, between these density scores and metric variance measures, broken down by domain. These updates, including any necessary figures or tables, will be added to §4 to substantiate the observed decline in reliability for dense claims. revision: yes
Circularity Check
No significant circularity in empirical evaluation
full rationale
The paper performs a purely empirical stress test of existing factual consistency metrics on long-document summarization benchmarks by applying seven perturbations to summaries and measuring score variance across domains. No derivations, equations, fitted parameters, or predictions are present; central claims rest on direct evaluation against external datasets with released perturbed data rather than any self-referential definitions or self-citation chains. The evaluation is self-contained against benchmarks and does not reduce any result to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The seven listed perturbations preserve factual consistency with the source document.
- domain assumption The three benchmark datasets (science fiction, legal, scientific) are representative of long-document factuality evaluation needs.
Reference graph
Works this paper leans on
- [1]
-
[2]
Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D'Arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke Zettlemoyer, and 6 others. 2024. https://doi.org/10.48550/ARXIV.2411.14199 OpenScholar: Synthesizing Scientific Lite...
-
[4]
Bel \' e m, Pouya Pezeshkpour, Hayate Iso, Seiji Maekawa, Nikita Bhutani, and Estevam Hruschka
Catarina G. Bel \' e m, Pouya Pezeshkpour, Hayate Iso, Seiji Maekawa, Nikita Bhutani, and Estevam Hruschka. 2025 b . https://doi.org/10.18653/V1/2025.FINDINGS-NAACL.293 From single to multi: How llms hallucinate in multi-document summarization . In Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, New Mexico, USA, April 2...
-
[5]
Steven Bird. 2006. https://doi.org/10.3115/1225403.1225421 NLTK: the natural language toolkit . In ACL 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Sydney, Australia, 17-21 July 2006 . The Association for Computer Linguistics
-
[6]
Jennifer Bishop, Sophia Ananiadou, and Qianqian Xie. 2024. https://aclanthology.org/2024.lrec-main.941 LongDocFACTScore : Evaluating the factuality of long document abstractive summarisation . In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino...
work page 2024
-
[7]
Yanran Chen and Steffen Eger. 2023. https://doi.org/10.1162/tacl_a_00576 MENLI : Robust evaluation metrics from natural language inference . Transactions of the Association for Computational Linguistics, 11:804--825
-
[8]
Yiran Chen, Pengfei Liu, and Xipeng Qiu. 2021. https://doi.org/10.18653/v1/2021.findings-emnlp.179 Are factuality checkers reliable? adversarial meta-evaluation of factuality in summarization . In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2082--2095, Punta Cana, Dominican Republic. Association for Computational Linguistics
-
[9]
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. 2024. https://doi.org/10.48550/ARXIV.2404.16130 From Local to Global: A Graph RAG Approach to Query-Focused Summarization . arXiv preprint arXiv:2404.16130
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.16130 2024
-
[10]
Alexander Fabbri, Chien-Sheng Wu, Wenhao Liu, and Caiming Xiong. 2022. https://doi.org/10.18653/v1/2022.naacl-main.187 QAF act E val: Improved QA -based factual consistency evaluation for summarization . In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages ...
-
[11]
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2024. https://doi.org/10.18653/v1/2024.naacl-long.365 GPTS core: Evaluate as you desire . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6556--6576, Mexico City, Mexico....
-
[12]
Saadia Gabriel, Asli Celikyilmaz, Rahul Jha, Yejin Choi, and Jianfeng Gao. 2021. https://doi.org/10.18653/V1/2021.FINDINGS-ACL.42 GO FIGURE: A meta evaluation of factuality in summarization . In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021 , volume ACL/IJCNLP 2021 of Findings of ACL , pages 478...
-
[13]
Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and Kelvin Guu
Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Y. Zhao, Ni Lao, Hongrae Lee, Da - Cheng Juan, and Kelvin Guu. 2023. https://doi.org/10.18653/V1/2023.ACL-LONG.910 RARR: researching and revising what language models say, using language models . In Proceedings of the 61st Annual Meeting of the Association fo...
-
[14]
Omer Goldman, Alon Jacovi, Aviv Slobodkin, Aviya Maimon, Ido Dagan, and Reut Tsarfaty. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.924 Is it really long context if all you need is retrieval? towards genuinely difficult long context NLP . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16576--16586, Mi...
-
[15]
Tanya Goyal and Greg Durrett. 2021. https://doi.org/10.18653/V1/2021.NAACL-MAIN.114 Annotating and modeling fine-grained factuality in summarization . In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021 , pages 1449--1462....
-
[16]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. https://doi.org/10.1145/3703155 A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions . ACM Trans. Inf. Syst., 43(2)
-
[17]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, and 1 others. 2024. https://arxiv.org/abs/2410.21276 Gpt-4o system card . arXiv preprint arXiv:2410.21276
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Huan Yee Koh, Jiaxin Ju, Ming Liu, and Shirui Pan. 2023. https://doi.org/10.1145/3545176 An empirical survey on long document summarization: Datasets, models, and metrics . ACM Comput. Surv. , 55(8):154:1--154:35
-
[19]
Kalpesh Krishna, Erin Bransom, Bailey Kuehl, Mohit Iyyer, Pradeep Dasigi, Arman Cohan, and Kyle Lo. 2023. https://doi.org/10.18653/v1/2023.eacl-main.121 L ong E val: Guidelines for human evaluation of faithfulness in long-form summarization . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pa...
-
[20]
Wojciech Kryscinski, Bryan McCann, Caiming Xiong, and Richard Socher. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.750 Evaluating the factual consistency of abstractive text summarization . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332--9346, Online. Association for Computational Linguistics
-
[21]
Philippe Laban, Alexander Fabbri, Caiming Xiong, and Chien-Sheng Wu. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.552 Summary of a haystack: A challenge to long-context LLM s and RAG systems . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9885--9903, Miami, Florida, USA. Association for Computational...
-
[22]
Philippe Laban, Tobias Schnabel, Paul N. Bennett, and Marti A. Hearst. 2022. https://doi.org/10.1162/tacl_a_00453 S umma C : Re-visiting NLI -based models for inconsistency detection in summarization . Transactions of the Association for Computational Linguistics, 10:163--177
-
[23]
Chin-Yew Lin. 2004. https://aclanthology.org/W04-1013/ ROUGE : A package for automatic evaluation of summaries . In Text Summarization Branches Out, pages 74--81, Barcelona, Spain. Association for Computational Linguistics
work page 2004
-
[24]
Lost in the Middle: How Language Models Use Long Contexts
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. https://doi.org/10.1162/TACL\_A\_00638 Lost in the middle: How language models use long contexts . Trans. Assoc. Comput. Linguistics, 12:157--173
work page internal anchor Pith review doi:10.1162/tacl 2024
-
[25]
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.153 G - E val: NLG evaluation using gpt-4 with better human alignment . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511--2522, Singapore. Association for Computational Linguistics
-
[26]
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. https://doi.org/10.18653/v1/2020.acl-main.173 On faithfulness and factuality in abstractive summarization . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906--1919, Online. Association for Computational Linguistics
-
[27]
Niels M \" u ndler, Jingxuan He, Slobodan Jenko, and Martin T. Vechev. 2024. https://openreview.net/forum?id=EmQSOi1X2f Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation . In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net
work page 2024
-
[28]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. https://doi.org/10.3115/1073083.1073135 B leu: a method for automatic evaluation of machine translation . In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311--318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics
-
[29]
Yifu Qiu, Yftah Ziser, Anna Korhonen, Edoardo Maria Ponti, and Shay B. Cohen. 2023. https://doi.org/10.18653/V1/2023.EMNLP-MAIN.551 Detecting and mitigating hallucinations in multilingual summarisation . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 , pages 8914--8932....
- [30]
-
[31]
Nils Reimers and Iryna Gurevych. 2019. https://doi.org/10.18653/v1/D19-1410 Sentence- BERT : Sentence embeddings using S iamese BERT -networks . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982--3992, Hong Kong, Chi...
- [32]
-
[33]
Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. 2024. https://openreview.net/forum?id=GN921JHCRw RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval . In The Twelfth International Conference on Learning Representations (ICLR)
work page 2024
-
[34]
Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano, Alex Wang, and Patrick Gallinari. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.529 Q uest E val: Summarization asks for fact-based evaluation . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6594--6604, Online and...
-
[35]
Liyan Tang, Philippe Laban, and Greg Durrett. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.499 M ini C heck: Efficient fact-checking of LLM s on grounding documents . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8818--8847, Miami, Florida, USA. Association for Computational Linguistics
-
[36]
Itamar Trainin and Omri Abend. 2025. https://doi.org/10.18653/v1/2025.findings-acl.1351 t^5score : A methodology for automatically assessing the quality of LLM generated multi-document topic sets . In Findings of the Association for Computational Linguistics: ACL 2025, pages 26347--26375, Vienna, Austria. Association for Computational Linguistics
-
[37]
Santosh T.y.s.s., Mahmoud Aly, and Matthias Grabmair. 2024. https://aclanthology.org/2024.lrec-main.911/ L ex A b S umm: Aspect-based summarization of legal decisions . In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 10422--10431, Torino, Italia. ELRA and ICCL
work page 2024
-
[38]
Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020. https://doi.org/10.18653/v1/2020.acl-main.450 Asking and answering questions to evaluate the factual consistency of summaries . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008--5020, Online. Association for Computational Linguistics
-
[39]
Alex Wang, Richard Yuanzhe Pang, Angelica Chen, Jason Phang, and Samuel R. Bowman. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.75 SQ u ALITY : Building a long-document summarization dataset the hard way . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1139--1156, Abu Dhabi, United Arab Emirates. Asso...
-
[40]
Yuxia Wang, Revanth Gangi Reddy, Zain Muhammad Mujahid, Arnav Arora, Aleksandr Rubashevskii, Jiahui Geng, Osama Mohammed Afzal, Liangming Pan, Nadav Borenstein, Aditya Pillai, Isabelle Augenstein, Iryna Gurevych, and Preslav Nakov. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.830 Factcheck-bench: Fine-grained evaluation benchmark for automatic fa...
-
[41]
Dustin Wright, Zain Muhammad Mujahid, Lu Wang, Isabelle Augenstein, and David Jurgens. 2025. Unstructured Evidence Attribution for Long Context Query Focused Summarization . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics
work page 2025
-
[42]
Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021. https://proceedings.neurips.cc/paper/2021/hash/e4d2b6e6fdeca3e60e0f1a62fee3d9dd-Abstract.html BARTScore: Evaluating Generated Text as Text Generation . In Advances in Neural Information Processing Systems (NeurIPS)
work page 2021
-
[43]
Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu. 2023. https://doi.org/10.18653/v1/2023.acl-long.634 A lign S core: Evaluating factual consistency with a unified alignment function . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328--11348, Toronto, Canada. Association for Co...
-
[44]
Shuo Zhang, Liangming Pan, Junzhou Zhao, and William Yang Wang. 2024. https://doi.org/10.18653/V1/2024.FINDINGS-ACL.121 The knowledge alignment problem: Bridging human and external knowledge for large language models . In Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024 , pages ...
-
[45]
Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and Jiawei Han. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.131 Towards a unified multi-dimensional evaluator for text generation . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2023--2038, Abu Dhabi, United A...
-
[46]
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[47]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.