arxiv: 2604.17200 · v1 · submitted 2026-04-19 · 💻 cs.CL

Recognition: unknown

Calibrating Model-Based Evaluation Metrics for Summarization

Hongye Liu , Dhanajit Brahma , Ricardo Henao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:32 UTC · model grok-4.3

classification 💻 cs.CL

keywords summarization evaluationmodel-based metricscalibrationproxy scoresgroup isotonic regression binningfaithfulnesscompleteness

0 comments

The pith

A framework generates proxy scores for summarization quality without reference summaries or human annotations and calibrates them with group isotonic regression binning to better match ground-truth metrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a general framework for creating individual and average proxy scores to evaluate summary quality dimensions like completeness, conciseness, and faithfulness. These scores are produced without needing reference summaries, human annotations, or costly large language model evaluations. The framework incorporates group isotonic regression binning as a calibration step that adjusts the raw model predictions to align more closely with ground-truth metrics. This approach applies to both continuous scores in summarization and discrete tasks such as question answering. Experiments across seven datasets indicate that the method outperforms existing baselines in calibration accuracy.

Core claim

The authors claim that their general framework generates proxy scores without reference summaries or annotations, and that group isotonic regression binning calibrates these raw predictions to align better with ground-truth evaluation metrics, leading to more reliable assessments.

What carries the argument

Group isotonic regression binning (GIRB), a calibration technique that bins predictions and applies isotonic regression within groups to adjust scores for better alignment with ground truth.

If this is right

Proxy scores enable evaluation of average summary quality for a document without multiple references.
The calibrated metrics provide more reliable estimates of faithfulness, completeness, and conciseness.
The framework extends to discrete-value tasks like question answering.
Outperformance holds consistently across seven different datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This calibration could be applied to improve evaluation in other generation tasks beyond summarization.
Proxy scores might allow for more scalable evaluation of large numbers of summaries without additional annotation costs.
The method suggests a path to reducing dependence on human judgments or references in NLP evaluation pipelines.

Load-bearing premise

The generated proxy scores, despite being created without references, can be validated and calibrated against metrics that do use references or annotations.

What would settle it

Demonstrating on a new dataset that the GIRB-calibrated scores have lower or equal correlation with human judgments compared to uncalibrated model predictions.

Figures

Figures reproduced from arXiv: 2604.17200 by Dhanajit Brahma, Hongye Liu, Ricardo Henao.

**Figure 1.** Figure 1: Workflow Overview. We input a document and its summary to produce two types of scores. The average score, reflecting the mean level across multiple summarization systems, serves as a reference. Comparing an individual score with this reference indicates summary quality: above average is good, below average is poor. Document & Summary Group ID g Proxy score y Raw score ŷ Ground truth score t Embedding Mode… view at source ↗

**Figure 2.** Figure 2: Framework Overview. A unified scoring model maps each document-summary pair to raw scores yˆ via a shared embedding used for grouping. GIRB has two steps: i) embedding-space clustering to assign a group ID, and ii) post hoc calibration using the group ID, raw scores, and model-based ground truth. proxy scores for K evaluation dimensions: the individual proxy score y (i) k and the average proxy score y¯k, … view at source ↗

**Figure 3.** Figure 3: Results comparing calibration methods across dimensions and metrics. The left two plots show cases [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Results comparing calibration methods over all metrics in the MMLU dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Bar Plot Comparison of Calibration Methods Across Dimensions and Metrics [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Bar Plot Comparison of Calibration Methods Across Settings in MathQA Dataset. [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 7.** Figure 7: Bar Plot Comparison of Calibration Methods Across Settings in MMLU Dataset. [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

**Figure 8.** Figure 8: Bar Plot Comparison of Calibration Methods Across Settings in OpenBookQA Dataset. [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗

**Figure 9.** Figure 9: Bar Plot Comparison of Calibration Methods Across Settings in SciQ Dataset. [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗

**Figure 10.** Figure 10: Bar Plot Comparison of Calibration Methods Across Settings in TriviaQA Dataset. [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗

**Figure 11.** Figure 11: Bar Plot Comparison of Calibration Methods Across Settings in TruthfulQA Dataset. [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗

read the original abstract

Recent advances in summary evaluation are based on model-based metrics to assess quality dimensions, such as completeness, conciseness, and faithfulness. However, these methods often require large language models, and predicted scores are frequently miscalibrated, limiting their reliability. Moreover, evaluating the average quality across different summaries for a single document typically requires access to multiple reference summaries. Here, we propose a general framework that generates individual and average proxy scores without relying on reference summaries, human annotations, or expensive model-based metrics. We also propose group isotonic regression binning (GIRB), a calibration method that adjusts the raw predictions to better align with ground-truth evaluation metrics. While we focus on continuous-value scenarios, such as summarization, the method is applicable to discrete-value tasks, such as question answering. Experiments on seven datasets demonstrate that our approach consistently outperforms existing baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a reference-free proxy scoring setup plus GIRB calibration that beats baselines on seven datasets, but the calibration step uses ground-truth labels so the no-annotation claim only covers the raw proxies.

read the letter

The main takeaway is that this work shows how to produce proxy scores for summary quality without references or large models, then applies group isotonic regression binning to bring those scores closer to actual ground-truth metrics. The experiments claim consistent gains over baselines across seven datasets, which is the concrete evidence they provide. GIRB itself looks like a straightforward extension of isotonic regression or binning, adapted for grouped data, and the framework avoids expensive model-based metrics for the initial scoring step. That part is useful for anyone who wants lighter evaluation pipelines. The paper does a decent job laying out the motivation around miscalibration and the cost of multiple references. The results section apparently demonstrates outperformance, which gives readers something to check. The soft spot is exactly the one the stress-test flags. GIRB fits parameters against observed ground-truth values, so if that fitting happens on data from the same evaluation sets, the reported numbers reflect supervised calibration rather than a fully annotation-free pipeline. The abstract separates the proxy generation from the calibration step, but does not spell out whether a single transferable calibration set was used or whether the method was tested in a way that keeps the calibration data disjoint from the reported test sets. Without that detail the end-to-end claim is weaker than it first appears. Minor issues include the usual lack of statistical significance tests or ablation details visible in the abstract, but those are fixable. This is a targeted methods paper for people building or using summarization metrics in NLP. Readers who need practical calibration tricks or reference-free proxies will find the experiments worth looking at. It is coherent on its own terms and shows honest engagement with the evaluation problem, so it deserves a serious referee even though the annotation-free framing needs tightening. I would send it for review with a request for explicit clarification on the calibration data splits.

Referee Report

2 major / 1 minor

Summary. The paper proposes a general framework for generating individual and average proxy scores for summarization quality dimensions (e.g., completeness, conciseness, faithfulness) using model-based metrics, without reference summaries, human annotations, or large expensive models. It introduces Group Isotonic Regression Binning (GIRB) as a calibration procedure to adjust raw proxy predictions for better alignment with ground-truth metrics. The method is claimed to extend to discrete tasks such as QA, and experiments on seven datasets show consistent outperformance over existing baselines.

Significance. If the full pipeline (including GIRB calibration) can be shown to operate without per-dataset ground-truth labels, the work would offer a practical route to scalable, annotation-free evaluation for summarization and related NLG tasks. The emphasis on continuous-value calibration and the explicit applicability note for discrete settings are positive features.

major comments (2)

[Abstract, §3.2] Abstract and §3.2: The central claim that the approach generates and validates proxy scores 'without relying on reference summaries, human annotations' is undercut by the description of GIRB, which fits isotonic regression/bin boundaries by minimizing loss against observed ground-truth values. The manuscript must clarify whether (a) a single transferable calibration set is used across all seven datasets or (b) GIRB parameters are fit per-dataset on the evaluation data containing ground-truth scores. Absent this, the reported outperformance may reflect supervised per-dataset calibration rather than an annotation-free method.
[§4] §4 (Experiments): The abstract states outperformance on seven datasets, yet the provided text supplies no information on baseline implementations, statistical tests (e.g., significance of differences), data splits, or whether calibration choices were made post-hoc. These omissions make it impossible to evaluate whether the central empirical claim is robust.

minor comments (1)

[§3] Notation for proxy scores and GIRB binning could be made more explicit (e.g., define the monotonic mapping function and its parameters in a single equation block).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope of our claims and improve the experimental reporting. We address each major comment below.

read point-by-point responses

Referee: [Abstract, §3.2] Abstract and §3.2: The central claim that the approach generates and validates proxy scores 'without relying on reference summaries, human annotations' is undercut by the description of GIRB, which fits isotonic regression/bin boundaries by minimizing loss against observed ground-truth values. The manuscript must clarify whether (a) a single transferable calibration set is used across all seven datasets or (b) GIRB parameters are fit per-dataset on the evaluation data containing ground-truth scores. Absent this, the reported outperformance may reflect supervised per-dataset calibration rather than an annotation-free method.

Authors: We agree that the distinction between proxy generation and calibration requires explicit clarification to avoid any ambiguity. The core proxy scores are produced by model-based metrics without reference summaries or human annotations on the target instances. GIRB calibration, however, does require ground-truth labels from a separate calibration dataset to fit the isotonic regression and bin boundaries. In our experiments, we used a single calibration set drawn from one dataset and transferred the resulting parameters to the remaining six datasets; we did not perform per-dataset fitting on the evaluation splits. We will revise the abstract and §3.2 to state this distinction clearly, add a paragraph on calibration-set size and transferability, and include an ablation comparing transferred versus per-dataset calibration to demonstrate that the reported gains are not solely due to supervised per-dataset fitting. revision: yes
Referee: [§4] §4 (Experiments): The abstract states outperformance on seven datasets, yet the provided text supplies no information on baseline implementations, statistical tests (e.g., significance of differences), data splits, or whether calibration choices were made post-hoc. These omissions make it impossible to evaluate whether the central empirical claim is robust.

Authors: We acknowledge that the current experimental section omits several details necessary for assessing robustness and reproducibility. In the revised manuscript we will expand §4 (and add an appendix if needed) with: (i) precise descriptions of baseline implementations, including model versions, prompting strategies, and hyper-parameters; (ii) statistical significance tests (paired t-tests and Wilcoxon signed-rank tests with p-values) for all reported improvements; (iii) explicit documentation of data splits, indicating which portions were used exclusively for calibration versus evaluation; and (iv) confirmation that calibration parameters were determined on the calibration set before any evaluation and were not tuned post-hoc on test data. These additions will allow readers to verify the empirical claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical proxy generation and calibration framework

full rationale

The paper describes an empirical method for generating proxy scores for summarization evaluation without references or annotations, followed by a separate GIRB calibration step that aligns outputs to ground-truth metrics on datasets. No derivation chain, equations, or first-principles results are presented that reduce any claimed prediction or result to its inputs by construction. The outperformance on seven datasets is reported as an experimental finding rather than a self-referential or fitted-by-definition outcome. Standard calibration procedures like isotonic regression do not constitute circularity when the paper explicitly separates raw proxy generation (annotation-free) from alignment to external ground-truth labels. The framework remains self-contained against external benchmarks without load-bearing self-citations or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method is described at a high level without mathematical derivations or assumptions detailed.

pith-pipeline@v0.9.0 · 5443 in / 992 out tokens · 39991 ms · 2026-05-10T06:32:37.404690+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

180 extracted references · 85 canonical work pages · 19 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
[2]

Calibrating llm-based evaluator

Calibrating LLM-Based Evaluator , author=. arXiv preprint arXiv:2309.13308 , year=

work page arXiv
[3]

Journal of the American statistical Association , volume=

The well-calibrated Bayesian , author=. Journal of the American statistical Association , volume=. 1982 , publisher=

1982
[4]

arXiv preprint arXiv:2410.02381 , year=

Metametrics: Calibrating metrics for generation tasks using human preferences , author=. arXiv preprint arXiv:2410.02381 , year=

work page arXiv
[5]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
[6]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

2016
[7]

2000 , publisher=

Advances in large margin classifiers , author=. 2000 , publisher=

2000
[8]

Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

BERT-based lexical substitution , author=. Proceedings of the 57th annual meeting of the association for computational linguistics , pages=
[9]

OPT: Open Pre-trained Transformer Language Models

Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

work page internal anchor Pith review arXiv
[10]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
[11]

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension , author=. arXiv preprint arXiv:1910.13461 , year=

work page internal anchor Pith review arXiv 1910
[12]

arXiv preprint arXiv:2305.09975 , year=

Smart word suggestions for writing assistance , author=. arXiv preprint arXiv:2305.09975 , year=

work page arXiv
[13]

arXiv preprint arXiv:2107.05132 , year=

LexSubCon: Integrating knowledge from lexical resources into contextual embeddings for lexical substitution , author=. arXiv preprint arXiv:2107.05132 , year=

work page arXiv
[14]

Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

Don't give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization , author=. arXiv preprint arXiv:1808.08745 , year=

work page Pith review arXiv
[15]

Proceedings of the fourth international workshop on semantic evaluations (SemEval-2007) , pages=

Semeval-2007 task 10: English lexical substitution task , author=. Proceedings of the fourth international workshop on semantic evaluations (SemEval-2007) , pages=

2007
[16]

all-words

What substitutes tell us-analysis of an “all-words” lexical substitution corpus , author=. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics , pages=
[17]

Advances in Neural Information Processing Systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , volume=
[18]

Advances in Neural Information Processing Systems , volume=

Bartscore: Evaluating generated text as text generation , author=. Advances in Neural Information Processing Systems , volume=
[19]

arXiv preprint arXiv:2307.04507 , year=

Improving Factuality of Abstractive Summarization via Contrastive Reward Learning , author=. arXiv preprint arXiv:2307.04507 , year=

work page arXiv
[20]

arXiv preprint arXiv:2305.08146 , year=

ParaLS: lexical substitution via pretrained paraphraser , author=. arXiv preprint arXiv:2305.08146 , year=

work page arXiv
[21]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[22]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Albert Lu, Hongxin Zhang, Yanzhe Zhang, Xuezhi Wang, and Diyi Yang

Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models , author=. arXiv preprint arXiv:2303.13809 , year=

work page arXiv
[25]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Inform: Mitigating reward hacking in rlhf via information-theoretic reward modeling , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[26]

arXiv preprint arXiv:2401.12873 , year=

Improving machine translation with human feedback: An exploration of quality estimation as a reward model , author=. arXiv preprint arXiv:2401.12873 , year=

work page arXiv
[27]

arXiv preprint arXiv:2212.10179 , year=

Toward human-like evaluation for natural language generation with error analysis , author=. arXiv preprint arXiv:2212.10179 , year=

work page arXiv
[28]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , author=. arXiv preprint arXiv:1908.10084 , year=

work page internal anchor Pith review arXiv 1908
[29]

Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

Bleu: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=
[30]

PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification , author=. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) , pages=
[31]

Bleurt: Learning robust metrics for text gener- ation

BLEURT: Learning robust metrics for text generation , author=. arXiv preprint arXiv:2004.04696 , year=

work page arXiv 2004
[32]

BERTScore: Evaluating Text Generation with BERT

Bertscore: Evaluating text generation with bert , author=. arXiv preprint arXiv:1904.09675 , year=

work page internal anchor Pith review arXiv 1904
[33]

Findings of the Association for Computa- tional Linguistics: ACL-IJCNLP 2021

Gptscore: Evaluate as you desire , author=. arXiv preprint arXiv:2302.04166 , year=

work page arXiv
[34]

arXiv preprint arXiv:2203.16804 , year=

BRIO: Bringing order to abstractive summarization , author=. arXiv preprint arXiv:2203.16804 , year=

work page arXiv
[35]

arXiv preprint arXiv:2310.16944 , year=

Zephyr: Direct distillation of lm alignment , author=. arXiv preprint arXiv:2310.16944 , year=

work page arXiv
[36]

Advances in Neural Information Processing Systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in Neural Information Processing Systems , volume=
[37]

Physics in Medicine & Biology , volume=

Bootstrapping, permutation testing and the method of surrogate data , author=. Physics in Medicine & Biology , volume=. 1999 , publisher=

1999
[38]

arXiv preprint arXiv:2305.14239 , year=

On learning to summarize with large language models as references , author=. arXiv preprint arXiv:2305.14239 , year=

work page arXiv
[39]

ACM Computing Surveys , volume=

Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing , author=. ACM Computing Surveys , volume=. 2023 , publisher=

2023
[40]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

1972
[42]

Publications Manual , year = "1983", publisher =

1983
[43]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[44]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of
[45]

Dan Gusfield , title =. 1997

1997
[46]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

2015
[47]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
[48]

Advances in neural information processing systems , volume=

Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=
[49]

Text summarization branches out , pages=

Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=
[50]

arXiv preprint arXiv:2310.10623 , year=

Generating summaries with controllable readability levels , author=. arXiv preprint arXiv:2310.10623 , year=

work page arXiv
[51]

The Llama 3 Herd of Models

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

arXiv preprint arXiv:2502.05933 , year=

Learning to Substitute Words with Model-based Score Ranking , author=. arXiv preprint arXiv:2502.05933 , year=

work page arXiv
[53]

Large Language Models: A Survey

Large language models: A survey , author=. arXiv preprint arXiv:2402.06196 , year=

work page internal anchor Pith review arXiv
[54]

arXiv preprint arXiv:2407.00908 , year=

FineSurE: Fine-grained summarization evaluation using LLMs , author=. arXiv preprint arXiv:2407.00908 , year=

work page arXiv
[55]

arXiv preprint arXiv:2410.13116 , year=

Learning to Summarize from LLM-generated Feedback , author=. arXiv preprint arXiv:2410.13116 , year=

work page arXiv
[56]

Towards a unified multi-dimensional evaluator for text generation.arXiv preprint arXiv:2210.07197,

Towards a unified multi-dimensional evaluator for text generation , author=. arXiv preprint arXiv:2210.07197 , year=

work page arXiv
[57]

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

G-eval: NLG evaluation using gpt-4 with better human alignment , author=. arXiv preprint arXiv:2303.16634 , year=

work page internal anchor Pith review arXiv
[58]

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

2019
[59]

Advances in Neural Information Processing Systems , volume=

Defining and characterizing reward gaming , author=. Advances in Neural Information Processing Systems , volume=
[60]

arXiv preprint arXiv:2409.19898 , year=

UniSumEval: Towards Unified, Fine-Grained, Multi-Dimensional Summarization Evaluation for LLMs , author=. arXiv preprint arXiv:2409.19898 , year=

work page arXiv
[61]

Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas , journal=. Distil
[62]

Calibrating long-form generations from large language models, 2024

Calibrating Long-form Generations from Large Language Models , author=. arXiv:2402.06544 , year=

work page arXiv
[63]

Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How

Sclar, Melanie and Choi, Yejin and Tsvetkov, Yulia and Suhr, Alane , booktitle=. Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How
[64]

Xiong, Miao and Hu, Zhiyuan and Lu, Xinyang and Li, Yifei and Fu, Jie and He, Junxian and Hooi, Bryan , booktitle=. Can
[65]

Does It Know?: Probing and Benchmarking Uncertainty in Language Model Latent Beliefs , author=
[66]

arXiv:2401.08694 , year=

Combining Confidence Elicitation and Sample-based Methods for Uncertainty Quantification in Misinformation Mitigation , author=. arXiv:2401.08694 , year=

work page arXiv
[67]

arXiv preprint arXiv:2402.13213 , year=

Softmax Probabilities (Mostly) Predict Large Language Model Correctness on Multiple-Choice Q&A , author=. arXiv:2402.13213 , year=

work page arXiv
[68]

arXiv:2210.04714 , year=

Uncertainty quantification with pre-trained language models: A large-scale empirical analysis , author=. arXiv:2210.04714 , year=

work page arXiv
[69]

arXiv:2402.08733 , year=

Experts Don't Cheat: Learning What You Don't Know By Predicting Pairs , author=. arXiv:2402.08733 , year=

work page arXiv
[70]

Alignment for honesty

Alignment for honesty , author=. arXiv:2312.07000 , year=

work page arXiv
[71]

Understanding the effects of iterative prompting on truthfulness, 2024

Understanding the Effects of Iterative Prompting on Truthfulness , author=. arXiv:2402.06625 , year=

work page arXiv
[72]

arXiv:2312.04021 , year=

A Study on the Calibration of In-context Learning , author=. arXiv:2312.04021 , year=

work page arXiv
[73]

Transactions of the Association for Computational Linguistics , volume=

Addressing the Binning Problem in Calibration Assessment through Scalar Annotations , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , publisher=

2024
[74]

arXiv preprint arXiv:2311.08298 , year=

A survey of language model confidence estimation and calibration , author=. arXiv:2311.08298 , year=

work page arXiv
[75]

Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models

Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models , author=. arXiv:2303.08896 , year=

work page arXiv
[76]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions , author=. arXiv:2311.05232 , year=

work page internal anchor Pith review arXiv
[77]

Calibrating

Calibrating Large Language Models Using Their Generations Only , author=. arXiv:2403.05973 , year=

work page arXiv
[78]

arXiv:2402.13904 , year=

Calibrating Large Language Models with Sample Consistency , author=. arXiv:2402.13904 , year=

work page arXiv
[79]

Few-shot recalibration of language models

Few-Shot Recalibration of Language Models , author=. arXiv:2403.18286 , year=

work page arXiv
[80]

arXiv:2402.04957 , year=

Reconfidencing LLMs from the Grouping Loss Perspective , author=. arXiv:2402.04957 , year=

work page arXiv

Showing first 80 references.