pith. sign in

arxiv: 2511.07689 · v2 · submitted 2025-11-10 · 💻 cs.CL · cs.AI· cs.LG

Stress Testing Factual Consistency Metrics for Long-Document Summarization

Pith reviewed 2026-05-17 23:06 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords factual consistency metricslong-document summarizationrobustness testingperturbationsinformation densityreference-free evaluationmetric stability
0
0 comments X

The pith

Short-form factual consistency metrics produce inconsistent scores for equivalent long-document summaries and lose reliability on dense claims.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates six reference-free factuality metrics on summaries of long documents from science fiction, legal, and scientific domains. It applies seven perturbations that preserve factuality, including paraphrasing, simplification, synonym replacement, logically equivalent negations, vocabulary reduction, compression, and source text insertion, to test if the metrics remain stable. The findings indicate that these metrics give varying scores to summaries with the same meaning and perform worse when claims are information-dense and overlap with multiple parts of the source. A sympathetic reader would care because accurate evaluation is essential for building trustworthy summarization tools that handle books, reports, or articles without introducing errors. The work points to the need for metrics that better handle long contexts and meaning-preserving changes.

Core claim

Existing short-form metrics produce inconsistent scores for semantically equivalent summaries and exhibit declining reliability for information-dense claims whose content is semantically similar to many parts of the source document across three long-form benchmark datasets.

What carries the argument

Seven factuality-preserving perturbations applied to summaries to probe the robustness of factuality metrics in long-document settings.

Load-bearing premise

The seven perturbations preserve factual content and do not introduce new inconsistencies that the metrics should detect.

What would settle it

A controlled test showing that one perturbation such as paraphrasing or compression changes a summary's actual factual alignment with the source in a way the metrics score identically to the original unperturbed version.

Figures

Figures reproduced from arXiv: 2511.07689 by Dustin Wright, Isabelle Augenstein, Zain Muhammad Mujahid.

Figure 1
Figure 1. Figure 1: We aim to see how robust summary factual [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Score change under factuality-preserving perturbations. Boxplots show the difference in factuality score [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Relationship between claim similarity and average factuality score. Higher similarity values correspond to [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prompt templates used with GPT-4o to generate meaning-preserving perturbations of the original summaries. Metric Original Synonym Replaced Summarized Simplified Paraphrased Negated Less Diverse Added Source Text BARTScore 0.16 0.07 0.08 0.11 0.09 0.07 0.10 0.23 MiniCheck 0.84 0.83 0.84 0.85 0.85 0.40 0.85 0.78 SummaC-Conv 0.33 0.31 0.29 0.37 0.21 0.23 0.34 0.42 SummaC-ZS 0.38 0.39 0.32 0.47 0.39 0.22 0.38 … view at source ↗
read the original abstract

Evaluating the factual consistency of abstractive text summarization remains a significant challenge, particularly for long documents, where conventional metrics struggle with input length limitations and long-range dependencies. In this work, we systematically evaluate the reliability of six widely used reference-free factuality metrics, originally proposed for short-form summarization, in the long-document setting. We probe metric robustness through seven factuality-preserving perturbations applied to summaries, namely paraphrasing, simplification, synonym replacement, logically equivalent negations, vocabulary reduction, compression, and source text insertion, and further analyze their sensitivity to retrieval context and claim information density. Across three long-form benchmark datasets spanning science fiction, legal, and scientific domains, our results reveal that existing short-form metrics produce inconsistent scores for semantically equivalent summaries and exhibit declining reliability for information-dense claims whose content is semantically similar to many parts of the source document. While expanding the retrieval context improves stability in some domains, no metric consistently maintains factual alignment under long-context conditions. Finally, our results highlight concrete directions for improving factuality evaluation, including multi-span reasoning, context-aware calibration, and training on meaning-preserving variations to enhance robustness in long-form summarization. We release all code, perturbed data, and scripts required to reproduce our results at https://github.com/zainmujahid/metricEval-longSum.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates six reference-free factual consistency metrics (originally for short-form summarization) on long-document settings across three domains (science fiction, legal, scientific). It applies seven perturbations to summaries—paraphrasing, simplification, synonym replacement, logically equivalent negations, vocabulary reduction, compression, and source text insertion—to test robustness, and analyzes sensitivity to retrieval context and claim information density. The central claim is that these metrics yield inconsistent scores on semantically equivalent summaries and decline in reliability for information-dense claims similar to many source spans; no metric is consistently stable under long-context conditions, with suggestions for multi-span reasoning and context-aware calibration. Code, perturbed data, and scripts are released.

Significance. If the perturbations are validated as factuality-preserving, the work provides useful empirical evidence on metric limitations in long-document summarization and concrete improvement directions. Releasing all code and perturbed data is a clear strength for reproducibility and follow-up studies.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Perturbations): The central claim that score variance demonstrates metric unreliability depends on the seven perturbations leaving factual content unchanged. No human ratings, entailment checks, or other verification is described to confirm that compression does not drop details, negations do not flip scope, or source insertion does not alter attribution. If any perturbation changes factuality, the observed inconsistencies may reflect correct metric behavior rather than unreliability.
  2. [§4] §4 (Results, information-density analysis): The claim of declining reliability for information-dense claims is load-bearing but would benefit from explicit quantification of 'information density' (e.g., via semantic similarity thresholds or span overlap counts) and statistical tests showing the correlation with metric scores across domains.
minor comments (2)
  1. [Tables/Figures] Table 1 and Figure 2: axis labels and legend text are small and may be difficult to read at print size; consider increasing font size or splitting into multiple panels.
  2. [§2] §2 (Related Work): A brief comparison table of the six metrics' original design assumptions (short vs. long context) would help readers quickly see why they are expected to struggle with long documents.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important aspects for strengthening the validity of our claims regarding the robustness of factual consistency metrics in long-document summarization. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Perturbations): The central claim that score variance demonstrates metric unreliability depends on the seven perturbations leaving factual content unchanged. No human ratings, entailment checks, or other verification is described to confirm that compression does not drop details, negations do not flip scope, or source insertion does not alter attribution. If any perturbation changes factuality, the observed inconsistencies may reflect correct metric behavior rather than unreliability.

    Authors: We agree that confirming the perturbations preserve factual content is essential to support our central claims about metric unreliability. Each perturbation was designed using established meaning-preserving transformations from prior work: paraphrasing and simplification retain core semantics, synonym replacement employs contextually equivalent terms, logically equivalent negations preserve truth value, vocabulary reduction and compression target non-essential elements while keeping key facts, and source text insertion incorporates verbatim source excerpts without modifying attribution. We acknowledge that the original manuscript did not include explicit human ratings or entailment verification. In the revised version, we will expand §3 with a dedicated subsection providing the generation process, concrete examples for each perturbation, and results from a targeted human evaluation on a random sample of 100 perturbed summaries per domain to verify factuality preservation. This will directly address the concern and bolster the interpretation of score variances. revision: partial

  2. Referee: [§4] §4 (Results, information-density analysis): The claim of declining reliability for information-dense claims is load-bearing but would benefit from explicit quantification of 'information density' (e.g., via semantic similarity thresholds or span overlap counts) and statistical tests showing the correlation with metric scores across domains.

    Authors: We concur that providing an explicit, quantifiable definition of information density and supporting statistical evidence would enhance the robustness of this analysis. In the current work, information density is characterized by the degree of semantic similarity between a claim and multiple spans in the source document. For the revision, we will formalize this by computing the count of source spans exceeding a cosine similarity threshold (using sentence embeddings) for each claim. We will then report Pearson and Spearman correlation coefficients, along with p-values, between these density scores and metric variance measures, broken down by domain. These updates, including any necessary figures or tables, will be added to §4 to substantiate the observed decline in reliability for dense claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation

full rationale

The paper performs a purely empirical stress test of existing factual consistency metrics on long-document summarization benchmarks by applying seven perturbations to summaries and measuring score variance across domains. No derivations, equations, fitted parameters, or predictions are present; central claims rest on direct evaluation against external datasets with released perturbed data rather than any self-referential definitions or self-citation chains. The evaluation is self-contained against benchmarks and does not reduce any result to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central findings rest on the domain assumption that the chosen perturbations preserve factual content and that the three benchmark datasets adequately represent long-document summarization challenges.

axioms (2)
  • domain assumption The seven listed perturbations preserve factual consistency with the source document.
    Invoked when interpreting metric score changes as evidence of metric unreliability rather than perturbation-induced errors.
  • domain assumption The three benchmark datasets (science fiction, legal, scientific) are representative of long-document factuality evaluation needs.
    Used to generalize results beyond the specific collections tested.

pith-pipeline@v0.9.0 · 5537 in / 1342 out tokens · 26139 ms · 2026-05-17T23:06:22.404967+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 3 internal anchors

  1. [1]

    Reinald Kim Amplayo, Peter J Liu, Yao Zhao, and Shashi Narayan. 2022. https://arxiv.org/abs/2208.01030 SMART : Sentences as basic units for text evaluation . arXiv preprint arXiv:2208.01030

  2. [2]

    Akari Asai, Jacqueline He, Rulin Shao, Weijia Shi, Amanpreet Singh, Joseph Chee Chang, Kyle Lo, Luca Soldaini, Sergey Feldman, Mike D'Arcy, David Wadden, Matt Latzke, Minyang Tian, Pan Ji, Shengyan Liu, Hao Tong, Bohao Wu, Yanyu Xiong, Luke Zettlemoyer, and 6 others. 2024. https://doi.org/10.48550/ARXIV.2411.14199 OpenScholar: Synthesizing Scientific Lite...

  3. [4]

    Bel \' e m, Pouya Pezeshkpour, Hayate Iso, Seiji Maekawa, Nikita Bhutani, and Estevam Hruschka

    Catarina G. Bel \' e m, Pouya Pezeshkpour, Hayate Iso, Seiji Maekawa, Nikita Bhutani, and Estevam Hruschka. 2025 b . https://doi.org/10.18653/V1/2025.FINDINGS-NAACL.293 From single to multi: How llms hallucinate in multi-document summarization . In Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, New Mexico, USA, April 2...

  4. [5]

    Steven Bird. 2006. https://doi.org/10.3115/1225403.1225421 NLTK: the natural language toolkit . In ACL 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Sydney, Australia, 17-21 July 2006 . The Association for Computer Linguistics

  5. [6]

    Jennifer Bishop, Sophia Ananiadou, and Qianqian Xie. 2024. https://aclanthology.org/2024.lrec-main.941 LongDocFACTScore : Evaluating the factuality of long document abstractive summarisation . In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC/COLING 2024, 20-25 May, 2024, Torino...

  6. [7]

    Yanran Chen and Steffen Eger. 2023. https://doi.org/10.1162/tacl_a_00576 MENLI : Robust evaluation metrics from natural language inference . Transactions of the Association for Computational Linguistics, 11:804--825

  7. [8]

    Yiran Chen, Pengfei Liu, and Xipeng Qiu. 2021. https://doi.org/10.18653/v1/2021.findings-emnlp.179 Are factuality checkers reliable? adversarial meta-evaluation of factuality in summarization . In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2082--2095, Punta Cana, Dominican Republic. Association for Computational Linguistics

  8. [9]

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. 2024. https://doi.org/10.48550/ARXIV.2404.16130 From Local to Global: A Graph RAG Approach to Query-Focused Summarization . arXiv preprint arXiv:2404.16130

  9. [10]

    Alexander Fabbri, Chien-Sheng Wu, Wenhao Liu, and Caiming Xiong. 2022. https://doi.org/10.18653/v1/2022.naacl-main.187 QAF act E val: Improved QA -based factual consistency evaluation for summarization . In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages ...

  10. [11]

    Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. 2024. https://doi.org/10.18653/v1/2024.naacl-long.365 GPTS core: Evaluate as you desire . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6556--6576, Mexico City, Mexico....

  11. [12]

    Saadia Gabriel, Asli Celikyilmaz, Rahul Jha, Yejin Choi, and Jianfeng Gao. 2021. https://doi.org/10.18653/V1/2021.FINDINGS-ACL.42 GO FIGURE: A meta evaluation of factuality in summarization . In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021 , volume ACL/IJCNLP 2021 of Findings of ACL , pages 478...

  12. [13]

    Zhao, Ni Lao, Hongrae Lee, Da-Cheng Juan, and Kelvin Guu

    Luyu Gao, Zhuyun Dai, Panupong Pasupat, Anthony Chen, Arun Tejasvi Chaganty, Yicheng Fan, Vincent Y. Zhao, Ni Lao, Hongrae Lee, Da - Cheng Juan, and Kelvin Guu. 2023. https://doi.org/10.18653/V1/2023.ACL-LONG.910 RARR: researching and revising what language models say, using language models . In Proceedings of the 61st Annual Meeting of the Association fo...

  13. [14]

    Omer Goldman, Alon Jacovi, Aviv Slobodkin, Aviya Maimon, Ido Dagan, and Reut Tsarfaty. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.924 Is it really long context if all you need is retrieval? towards genuinely difficult long context NLP . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16576--16586, Mi...

  14. [15]

    Tanya Goyal and Greg Durrett. 2021. https://doi.org/10.18653/V1/2021.NAACL-MAIN.114 Annotating and modeling fine-grained factuality in summarization . In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021 , pages 1449--1462....

  15. [16]

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. 2025. https://doi.org/10.1145/3703155 A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions . ACM Trans. Inf. Syst., 43(2)

  16. [17]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, and 1 others. 2024. https://arxiv.org/abs/2410.21276 Gpt-4o system card . arXiv preprint arXiv:2410.21276

  17. [18]

    Huan Yee Koh, Jiaxin Ju, Ming Liu, and Shirui Pan. 2023. https://doi.org/10.1145/3545176 An empirical survey on long document summarization: Datasets, models, and metrics . ACM Comput. Surv. , 55(8):154:1--154:35

  18. [19]

    Kalpesh Krishna, Erin Bransom, Bailey Kuehl, Mohit Iyyer, Pradeep Dasigi, Arman Cohan, and Kyle Lo. 2023. https://doi.org/10.18653/v1/2023.eacl-main.121 L ong E val: Guidelines for human evaluation of faithfulness in long-form summarization . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pa...

  19. [20]

    Wojciech Kryscinski, Bryan McCann, Caiming Xiong, and Richard Socher. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.750 Evaluating the factual consistency of abstractive text summarization . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332--9346, Online. Association for Computational Linguistics

  20. [21]

    Philippe Laban, Alexander Fabbri, Caiming Xiong, and Chien-Sheng Wu. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.552 Summary of a haystack: A challenge to long-context LLM s and RAG systems . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9885--9903, Miami, Florida, USA. Association for Computational...

  21. [22]

    and Hearst, Marti A

    Philippe Laban, Tobias Schnabel, Paul N. Bennett, and Marti A. Hearst. 2022. https://doi.org/10.1162/tacl_a_00453 S umma C : Re-visiting NLI -based models for inconsistency detection in summarization . Transactions of the Association for Computational Linguistics, 10:163--177

  22. [23]

    Chin-Yew Lin. 2004. https://aclanthology.org/W04-1013/ ROUGE : A package for automatic evaluation of summaries . In Text Summarization Branches Out, pages 74--81, Barcelona, Spain. Association for Computational Linguistics

  23. [24]

    Lost in the Middle: How Language Models Use Long Contexts

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. https://doi.org/10.1162/TACL\_A\_00638 Lost in the middle: How language models use long contexts . Trans. Assoc. Comput. Linguistics, 12:157--173

  24. [25]

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.153 G - E val: NLG evaluation using gpt-4 with better human alignment . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511--2522, Singapore. Association for Computational Linguistics

  25. [26]

    Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. https://doi.org/10.18653/v1/2020.acl-main.173 On faithfulness and factuality in abstractive summarization . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906--1919, Online. Association for Computational Linguistics

  26. [27]

    Niels M \" u ndler, Jingxuan He, Slobodan Jenko, and Martin T. Vechev. 2024. https://openreview.net/forum?id=EmQSOi1X2f Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation . In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net

  27. [28]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. https://doi.org/10.3115/1073083.1073135 B leu: a method for automatic evaluation of machine translation . In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311--318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics

  28. [29]

    Yifu Qiu, Yftah Ziser, Anna Korhonen, Edoardo Maria Ponti, and Shay B. Cohen. 2023. https://doi.org/10.18653/V1/2023.EMNLP-MAIN.551 Detecting and mitigating hallucinations in multilingual summarisation . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 , pages 8914--8932....

  29. [30]

    Sanjana Ramprasad and Byron C. Wallace. 2024. https://arxiv.org/abs/2411.16638 Do automatic factuality metrics measure factuality? a critical evaluation . Preprint, arXiv:2411.16638

  30. [31]

    Nils Reimers and Iryna Gurevych. 2019. https://doi.org/10.18653/v1/D19-1410 Sentence- BERT : Sentence embeddings using S iamese BERT -networks . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982--3992, Hong Kong, Chi...

  31. [32]

    Melisa Russak, Umar Jamil, Christopher Bryant, Kiran Kamble, Axel Magnuson, Mateusz Russak, and Waseem AlShikh. 2024. https://arxiv.org/abs/2408.14906 Writing in the margins: Better inference pattern for long context retrieval . arXiv preprint arXiv:2408.14906

  32. [33]

    Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. 2024. https://openreview.net/forum?id=GN921JHCRw RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval . In The Twelfth International Conference on Learning Representations (ICLR)

  33. [34]

    Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano, Alex Wang, and Patrick Gallinari. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.529 Q uest E val: Summarization asks for fact-based evaluation . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6594--6604, Online and...

  34. [35]

    Liyan Tang, Philippe Laban, and Greg Durrett. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.499 M ini C heck: Efficient fact-checking of LLM s on grounding documents . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8818--8847, Miami, Florida, USA. Association for Computational Linguistics

  35. [36]

    Itamar Trainin and Omri Abend. 2025. https://doi.org/10.18653/v1/2025.findings-acl.1351 t^5score : A methodology for automatically assessing the quality of LLM generated multi-document topic sets . In Findings of the Association for Computational Linguistics: ACL 2025, pages 26347--26375, Vienna, Austria. Association for Computational Linguistics

  36. [37]

    Santosh T.y.s.s., Mahmoud Aly, and Matthias Grabmair. 2024. https://aclanthology.org/2024.lrec-main.911/ L ex A b S umm: Aspect-based summarization of legal decisions . In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 10422--10431, Torino, Italia. ELRA and ICCL

  37. [38]

    Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020. https://doi.org/10.18653/v1/2020.acl-main.450 Asking and answering questions to evaluate the factual consistency of summaries . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008--5020, Online. Association for Computational Linguistics

  38. [39]

    Alex Wang, Richard Yuanzhe Pang, Angelica Chen, Jason Phang, and Samuel R. Bowman. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.75 SQ u ALITY : Building a long-document summarization dataset the hard way . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1139--1156, Abu Dhabi, United Arab Emirates. Asso...

  39. [40]

    Yuxia Wang, Revanth Gangi Reddy, Zain Muhammad Mujahid, Arnav Arora, Aleksandr Rubashevskii, Jiahui Geng, Osama Mohammed Afzal, Liangming Pan, Nadav Borenstein, Aditya Pillai, Isabelle Augenstein, Iryna Gurevych, and Preslav Nakov. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.830 Factcheck-bench: Fine-grained evaluation benchmark for automatic fa...

  40. [41]

    Dustin Wright, Zain Muhammad Mujahid, Lu Wang, Isabelle Augenstein, and David Jurgens. 2025. Unstructured Evidence Attribution for Long Context Query Focused Summarization . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics

  41. [42]

    Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021. https://proceedings.neurips.cc/paper/2021/hash/e4d2b6e6fdeca3e60e0f1a62fee3d9dd-Abstract.html BARTScore: Evaluating Generated Text as Text Generation . In Advances in Neural Information Processing Systems (NeurIPS)

  42. [43]

    Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu. 2023. https://doi.org/10.18653/v1/2023.acl-long.634 A lign S core: Evaluating factual consistency with a unified alignment function . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328--11348, Toronto, Canada. Association for Co...

  43. [44]

    Shuo Zhang, Liangming Pan, Junzhou Zhao, and William Yang Wang. 2024. https://doi.org/10.18653/V1/2024.FINDINGS-ACL.121 The knowledge alignment problem: Bridging human and external knowledge for large language models . In Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024 , pages ...

  44. [45]

    Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and Jiawei Han. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.131 Towards a unified multi-dimensional evaluator for text generation . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2023--2038, Abu Dhabi, United A...

  45. [46]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  46. [47]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...