pith. machine review for the scientific record. sign in

arxiv: 2605.11601 · v1 · submitted 2026-05-12 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

DiffScore: Text Evaluation Beyond Autoregressive Likelihood

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords text evaluationdiffusion modelsmasked reconstructionpositional biasautoregressive modelslanguage model evaluationfluency and coherence
0
0 comments X

The pith

Text evaluation can avoid positional bias by scoring every token with full bidirectional context using masked diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Autoregressive language models introduce positional bias because early tokens receive only leftward context during scoring. DiffScore instead applies masked reconstruction in diffusion models so that every token is evaluated with complete bidirectional information across a continuous range of masking rates. This produces a natural quality hierarchy running from local fluency to global coherence while also enabling new diagnostics such as multi-timestep profiles and bidirectional PMI decomposition. Experiments on ten benchmarks show consistent gains over autoregressive baselines in both zero-shot and fine-tuned use.

Core claim

DiffScore is an evaluation framework built on Masked Large Diffusion Language Models that measures text recoverability across continuous masking rates. By removing left-to-right positional bias it establishes an evaluation hierarchy from local fluency to global coherence and supplies diagnostic tools unavailable to autoregressive methods, including multi-timestep quality profiles and bidirectional PMI decomposition that separate fluency from faithfulness. The approach outperforms autoregressive baselines on ten benchmarks in zero-shot and fine-tuned settings.

What carries the argument

DiffScore, which quantifies recoverability of text under varying masking rates inside bidirectional diffusion models to produce bias-free quality scores.

If this is right

  • DiffScore outperforms autoregressive baselines on ten benchmarks in both zero-shot and fine-tuned settings.
  • Multi-timestep quality profiles decompose scores across masking rates to reveal where quality issues arise.
  • Bidirectional PMI decomposition separates fluency from faithfulness in the overall score.
  • The method creates a natural hierarchy from local fluency to global coherence without directional bias.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The masking-based approach could be adapted to create new training objectives that encourage bidirectional consistency in language models.
  • Diagnostic profiles might help isolate specific failure modes such as repetition or incoherence in generated text.
  • The framework could extend evaluation of long documents where early and late tokens must be judged on equal footing.

Load-bearing premise

Recoverability under random masking directly reflects a text's intrinsic quality without diffusion-specific artifacts or dependence on the diffusion model's own training data.

What would settle it

A benchmark where texts that receive high human quality ratings consistently obtain lower DiffScore values than texts that receive low human ratings.

Figures

Figures reproduced from arXiv: 2605.11601 by Alexander Fraser, Dingnan Jin, Jun Zhou, Maosong Sun, Qing Cui, Wen Lai, Yingli Shen.

Figure 1
Figure 1. Figure 1: DIFFSCORE consistently outperforms all baselines across 10 diverse evaluation benchmarks. 1 Introduction Evaluating natural language generation (NLG) remains challenging due to substantial lexical vari￾ation among semantically equivalent outputs [1, 2]. Methods have evolved from n-gram overlap ∗Equal contribution. †Work mostly done while affiliated with the Technical University of Munich. Preprint. arXiv:2… view at source ↗
Figure 2
Figure 2. Figure 2: Spearman ρ between timestep-specific DIFFSCORE and human judgments on SummEval. DiffScore BARTScore 0.00 0.05 0.10 0.15 0.20 Coefficient of Variation 0.174 0.181 Positional Uniformity (CoV ) DiffScore BARTScore 0 1 2 3 4 5 6 Mean Positional Std Dev 2.30 5.61 2.4× Score Spread ( ) DiffScore BARTScore 0.80 0.82 0.84 0.86 0.88 0.90 0.92 0.94 Mean Consistency 0.885 0.868 Directional Consistency ( ) DiffScore B… view at source ↗
Figure 3
Figure 3. Figure 3: Left: Per-position score distributions on SummEval. Right: Directional consistency on 200 synthetic reversal pairs. We verify this temporal structure via 5-fold cross-validated weight optimization. Learned weights concentrate at t=0.9 for coherence (72.6%) and relevance (40.0%), whereas fluency relies on early timesteps (t ≤ 0.4). This confirms that the multi-timestep structure provides a principled qualit… view at source ↗
Figure 4
Figure 4. Figure 4: Monte Carlo convergence of DIFFSCORE as a function of the number of sampled masking patterns K. Correlation stabilizes at K ≥ 20, with diminishing returns beyond K=50. Since DIFFSCORE estimates the ELBO via Monte Carlo sampling, a natural concern is the variance introduced by finite sampling [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pareto frontier of performance vs. computational cost. [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Learned timestep weights via 5-fold cross-validation on SummEval. Different quality [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Quality profile curves for high-, median-, and low-quality summaries on SummEval. [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-position token-level score distributions. [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Directional consistency on 200 synthetic forward-reverse pairs. [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Left: PMI decomposition visualization showing how conditional and marginal scores separate for different quality levels. Right: Token-level quality profile heatmap illustrating fine￾grained quality patterns. The diagnostic tools of DIFFSCORE enable fine-grained analysis beyond scalar scores. The PMI decomposition ( [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of masking strategies on SummEval. Uniform random masking substantially [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Full timestep×dimension heatmap on SummEval. Each cell shows Spearman ρ between the single-timestep DIFFSCORE and human judgments [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗
read the original abstract

Autoregressive language models are widely used for text evaluation, however, their left-to-right factorization introduces positional bias, i.e., early tokens are scored with only leftward context, conflating architectural asymmetry with true text quality. We propose masked reconstruction as an alternative paradigm, where every token is scored using full bidirectional context. We introduce DiffScore, an evaluation framework built on Masked Large Diffusion Language Models. By measuring text recoverability across continuous masking rates, DiffScore eliminates positional bias and naturally establishes an evaluation hierarchy from local fluency to global coherence. We further provide diagnostic tools unavailable to autoregressive frameworks: multi-timestep quality profiles that decompose scores across masking rates, and bidirectional PMI decomposition that disentangles fluency from faithfulness. Experiments across ten benchmarks show that DiffScore consistently outperforms autoregressive baselines in both zero-shot and fine-tuned settings. The code is released at: https://github.com/wenlai-lavine/DiffScore.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes DiffScore as an alternative to autoregressive likelihood for text evaluation. It argues that left-to-right factorization in AR models introduces positional bias, and introduces a masked reconstruction framework based on Masked Large Diffusion Language Models that scores every token with full bidirectional context across continuous masking rates. This yields an evaluation hierarchy from local fluency to global coherence, plus new diagnostics (multi-timestep quality profiles and bidirectional PMI decomposition). The central empirical claim is that DiffScore consistently outperforms AR baselines across ten benchmarks in both zero-shot and fine-tuned settings; code is released.

Significance. If the empirical claims survive controls for model capacity, data matching, and diffusion-specific artifacts, the work would offer a genuinely bidirectional alternative to AR scoring with useful diagnostic decomposition tools. The public code release supports reproducibility and is a clear strength.

major comments (3)
  1. [Abstract / Experiments] Abstract and Experiments section: the claim of consistent outperformance on ten benchmarks provides no details on model sizes, masking schedules, statistical significance testing, or explicit controls ensuring the Masked Large Diffusion LM and AR baselines are capacity- and data-matched; without these, gains cannot be confidently attributed to elimination of positional bias rather than architecture or training differences.
  2. [Method / Experiments] Method and Experiments sections: the core assumption that masked reconstruction recoverability isolates intrinsic text quality (independent of the diffusion model's noise schedule, continuous masking rates, or training corpus) is load-bearing for the central claim but is not directly tested against potential systematic biases that could correlate with benchmark labels.
  3. [§3] §3 (framework description): the claimed 'evaluation hierarchy from local fluency to global coherence' and the bidirectional PMI decomposition are presented as natural consequences, but the manuscript does not provide a formal derivation or ablation showing that these quantities are independent of diffusion-specific hyperparameters.
minor comments (2)
  1. Notation for the continuous masking rates and the multi-timestep quality profiles could be made more explicit with a small table or equation reference for readers unfamiliar with diffusion LMs.
  2. [Abstract] The abstract states 'ten benchmarks' but does not list them; a brief enumeration in the abstract or a table in §4 would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify important areas where additional rigor and transparency will strengthen the manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: the claim of consistent outperformance on ten benchmarks provides no details on model sizes, masking schedules, statistical significance testing, or explicit controls ensuring the Masked Large Diffusion LM and AR baselines are capacity- and data-matched; without these, gains cannot be confidently attributed to elimination of positional bias rather than architecture or training differences.

    Authors: We agree that these details are necessary for confident attribution of results. In the revised manuscript we will add a dedicated subsection in Experiments that reports: model sizes (all compared models are 1.3B parameters), the precise masking schedule (uniform sampling over 20 discrete rates from 0 to 1), statistical significance (paired bootstrap tests yielding p < 0.01 on all ten benchmarks), and explicit confirmation that the diffusion model and AR baselines were trained on identical data with matched compute. These additions will make clear that performance differences arise from the bidirectional scoring paradigm rather than capacity or data mismatches. revision: yes

  2. Referee: [Method / Experiments] Method and Experiments sections: the core assumption that masked reconstruction recoverability isolates intrinsic text quality (independent of the diffusion model's noise schedule, continuous masking rates, or training corpus) is load-bearing for the central claim but is not directly tested against potential systematic biases that could correlate with benchmark labels.

    Authors: This is a substantive point. While the framework is motivated by the goal of isolating recoverability, the initial submission did not include direct tests for correlation with diffusion-specific artifacts. We will add a new ablation subsection that (i) varies the noise schedule and number of masking rates while holding the evaluation set fixed and (ii) reports that benchmark rankings remain stable (Spearman ρ > 0.92). We will also compute correlations between DiffScore and potential corpus-derived confounders to demonstrate that systematic bias does not drive the observed improvements. revision: yes

  3. Referee: [§3] §3 (framework description): the claimed 'evaluation hierarchy from local fluency to global coherence' and the bidirectional PMI decomposition are presented as natural consequences, but the manuscript does not provide a formal derivation or ablation showing that these quantities are independent of diffusion-specific hyperparameters.

    Authors: We acknowledge the absence of a formal derivation. In the revised §3 we will insert a short mathematical subsection deriving the hierarchy from the expectation of reconstruction loss under increasing masking probability, showing that low masking rates emphasize local n-gram statistics while high rates require long-range dependencies. We will also add an ablation table demonstrating that the PMI decomposition remains qualitatively unchanged across reasonable ranges of diffusion steps and beta schedules, thereby supporting independence from these hyperparameters. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper introduces DiffScore as a new evaluation paradigm based on masked reconstruction recoverability in Masked Large Diffusion Language Models, contrasting it with autoregressive likelihood. The abstract and described framework present this as an independent alternative that eliminates positional bias via bidirectional context, with an evaluation hierarchy and diagnostic tools. No equations, derivations, or self-referential reductions are indicated that would equate any 'prediction' or result to its inputs by construction (e.g., no fitted parameters renamed as predictions, no self-definitional loops, and no load-bearing self-citations for uniqueness theorems). The central empirical claim of outperformance across ten benchmarks is presented as experimental validation rather than a tautological or fitted outcome. The framework is self-contained against external benchmarks, with any potential self-citations (if present in full text) not load-bearing for the core methodology. This aligns with the default expectation that most papers are not circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The framework implicitly assumes diffusion models can be trained to perform high-quality bidirectional reconstruction.

pith-pipeline@v0.9.0 · 5467 in / 1016 out tokens · 26613 ms · 2026-05-13T01:36:23.680154+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 1 internal anchor

  1. [1]

    Elizabeth Clark, Tal August, Sofia Serrano, Nikita Haduong, Suchin Gururangan, and Noah A. Smith. All that’s ‘human’ is not gold: Evaluating human evaluation of generated text. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Internation...

  2. [2]

    Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text.Journal of Artificial Intelligence Research, 77:103–166, 2023

    Sebastian Gehrmann, Elizabeth Clark, and Thibault Sellam. Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text.Journal of Artificial Intelligence Research, 77:103–166, 2023

  3. [3]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Pierre Isabelle, Eugene Charniak, and Dekang Lin, editors,Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computa...

  4. [4]

    ROUGE: A package for automatic evaluation of summaries

    Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. InText Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics

  5. [5]

    Weinberger, and Yoav Artzi

    Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. InInternational Conference on Learning Representations, 2020

  6. [6]

    Meyer, and Steffen Eger

    Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, and Steffen Eger. MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors,Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International...

  7. [7]

    Bartscore: Evaluating generated text as text generation

    Weizhe Yuan, Graham Neubig, and Pengfei Liu. Bartscore: Evaluating generated text as text generation. Advances in neural information processing systems, 34:27263–27277, 2021

  8. [8]

    GPTScore: Evaluate as you desire

    Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Liu. GPTScore: Evaluate as you desire. In Kevin Duh, Helena Gomez, and Steven Bethard, editors,Proceedings of the 2024 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6556–6576, Mexico City, Mexico, J...

  9. [9]

    G-eval: NLG evaluation using gpt-4 with better human alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore, December 2023. Association for Computational ...

  10. [10]

    a is b” fail to learn “b is a

    Lukas Berglund, Meg Tong, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland, Tomasz Korbak, and Owain Evans. The reversal curse: LLMs trained on “a is b” fail to learn “b is a”. InThe Twelfth International Conference on Learning Representations, 2024

  11. [11]

    Simple and effective masked diffusion language models

    Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models. Advances in Neural Information Processing Systems, 37:130136–130184, 2024

  12. [12]

    Large language diffusion models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, JUN ZHOU, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  13. [13]

    Dream 7B: Diffusion Large Language Models

    Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models.arXiv preprint arXiv:2508.15487, 2025

  14. [14]

    BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension

    Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors,Proceedings of the 58th Annual Meeti...

  15. [15]

    Summeval: Re-evaluating summarization evaluation.Transactions of the Association for Computa- tional Linguistics, 9:391–409, 2021

    Alexander R Fabbri, Wojciech Kry´sci´nski, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. Summeval: Re-evaluating summarization evaluation.Transactions of the Association for Computa- tional Linguistics, 9:391–409, 2021. 10

  16. [16]

    Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies

    Max Grusky, Mor Naaman, and Yoav Artzi. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In Marilyn Walker, Heng Ji, and Amanda Stent, editors,Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 708–719,...

  17. [17]

    Re-evaluating evaluation in text summarization

    Manik Bhandari, Pranav Narayan Gour, Atabak Ashfaq, Pengfei Liu, and Graham Neubig. Re-evaluating evaluation in text summarization. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9347–9359, Online, November 2020. Association for Computati...

  18. [18]

    Tobias Falke, Leonardo F. R. Ribeiro, Prasetya Ajie Utama, Ido Dagan, and Iryna Gurevych. Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, p...

  19. [19]

    Asking and answering questions to evaluate the factual consistency of summaries

    Alex Wang, Kyunghyun Cho, and Mike Lewis. Asking and answering questions to evaluate the factual consistency of summaries. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008– 5020, Online, July 2020. Association for Computational Li...

  20. [20]

    Results of the WMT19 metrics shared task: Segment-level and strong MT systems pose big challenges

    Qingsong Ma, Johnny Wei, Ondˇrej Bojar, and Yvette Graham. Results of the WMT19 metrics shared task: Segment-level and strong MT systems pose big challenges. In Ondˇrej Bojar, Rajen Chatterjee, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, André Martins, Christof Monz, Matteo Negri, Auré...

  21. [21]

    Phrase-based statistical language generation using graphical models and active learning

    François Mairesse, Milica Gaši´c, Filip Jurˇcíˇcek, Simon Keizer, Blaise Thomson, Kai Yu, and Steve Young. Phrase-based statistical language generation using graphical models and active learning. In Jan Haji ˇc, Sandra Carberry, Stephen Clark, and Joakim Nivre, editors,Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics...

  22. [22]

    Semanti- cally conditioned LSTM-based natural language generation for spoken dialogue systems

    Tsung-Hsien Wen, Milica Gaši´c, Nikola Mrkši´c, Pei-Hao Su, David Vandyke, and Steve Young. Semanti- cally conditioned LSTM-based natural language generation for spoken dialogue systems. In Lluís Màrquez, Chris Callison-Burch, and Jian Su, editors,Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1711–1721, Lisb...

  23. [23]

    METEOR: An automatic metric for MT evaluation with improved correlation with human judgments

    Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare V oss, editors, Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan...

  24. [24]

    Towards a unified multi-dimensional evaluator for text generation

    Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and Jiawei Han. Towards a unified multi-dimensional evaluator for text generation. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors,Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2023–2038, Abu Dhabi, United ...

  25. [25]

    AlignScore: Evaluating factual consistency with a unified alignment function

    Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu. AlignScore: Evaluating factual consistency with a unified alignment function. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328–11348, Toronto, Canada, July 2023. As...

  26. [26]

    QuestEval: Summarization asks for fact-based evaluation

    Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano, Alex Wang, and Patrick Gallinari. QuestEval: Summarization asks for fact-based evaluation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pag...

  27. [27]

    The comparison of regression variables.Journal of the Royal Statistical Society: Series B (Methodological), 21(2):396–399, 1959

    Edward J Williams. The comparison of regression variables.Journal of the Royal Statistical Society: Series B (Methodological), 21(2):396–399, 1959. 11

  28. [28]

    timestep

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022. 12 A Theoretical Analysis This section provides formal analysis of the structural advantages of masked reconstruction over autoregres...