pith. machine review for the scientific record. sign in

arxiv: 2604.17200 · v1 · submitted 2026-04-19 · 💻 cs.CL

Recognition: unknown

Calibrating Model-Based Evaluation Metrics for Summarization

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:32 UTC · model grok-4.3

classification 💻 cs.CL
keywords summarization evaluationmodel-based metricscalibrationproxy scoresgroup isotonic regression binningfaithfulnesscompleteness
0
0 comments X

The pith

A framework generates proxy scores for summarization quality without reference summaries or human annotations and calibrates them with group isotonic regression binning to better match ground-truth metrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a general framework for creating individual and average proxy scores to evaluate summary quality dimensions like completeness, conciseness, and faithfulness. These scores are produced without needing reference summaries, human annotations, or costly large language model evaluations. The framework incorporates group isotonic regression binning as a calibration step that adjusts the raw model predictions to align more closely with ground-truth metrics. This approach applies to both continuous scores in summarization and discrete tasks such as question answering. Experiments across seven datasets indicate that the method outperforms existing baselines in calibration accuracy.

Core claim

The authors claim that their general framework generates proxy scores without reference summaries or annotations, and that group isotonic regression binning calibrates these raw predictions to align better with ground-truth evaluation metrics, leading to more reliable assessments.

What carries the argument

Group isotonic regression binning (GIRB), a calibration technique that bins predictions and applies isotonic regression within groups to adjust scores for better alignment with ground truth.

If this is right

  • Proxy scores enable evaluation of average summary quality for a document without multiple references.
  • The calibrated metrics provide more reliable estimates of faithfulness, completeness, and conciseness.
  • The framework extends to discrete-value tasks like question answering.
  • Outperformance holds consistently across seven different datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This calibration could be applied to improve evaluation in other generation tasks beyond summarization.
  • Proxy scores might allow for more scalable evaluation of large numbers of summaries without additional annotation costs.
  • The method suggests a path to reducing dependence on human judgments or references in NLP evaluation pipelines.

Load-bearing premise

The generated proxy scores, despite being created without references, can be validated and calibrated against metrics that do use references or annotations.

What would settle it

Demonstrating on a new dataset that the GIRB-calibrated scores have lower or equal correlation with human judgments compared to uncalibrated model predictions.

Figures

Figures reproduced from arXiv: 2604.17200 by Dhanajit Brahma, Hongye Liu, Ricardo Henao.

Figure 1
Figure 1. Figure 1: Workflow Overview. We input a document and its summary to produce two types of scores. The average score, reflecting the mean level across multiple summarization systems, serves as a reference. Comparing an individual score with this reference indicates summary quality: above average is good, below average is poor. Document & Summary Group ID g Proxy score y Raw score ŷ Ground truth score t Embedding Mode… view at source ↗
Figure 2
Figure 2. Figure 2: Framework Overview. A unified scoring model maps each document-summary pair to raw scores yˆ via a shared embedding used for grouping. GIRB has two steps: i) embedding-space clustering to assign a group ID, and ii) post hoc calibration using the group ID, raw scores, and model-based ground truth. proxy scores for K evaluation dimensions: the indi￾vidual proxy score y (i) k and the average proxy score y¯k, … view at source ↗
Figure 3
Figure 3. Figure 3: Results comparing calibration methods across dimensions and metrics. The left two plots show cases [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Results comparing calibration methods over all metrics in the MMLU dataset. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Bar Plot Comparison of Calibration Methods Across Dimensions and Metrics [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Bar Plot Comparison of Calibration Methods Across Settings in MathQA Dataset. [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Bar Plot Comparison of Calibration Methods Across Settings in MMLU Dataset. [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Bar Plot Comparison of Calibration Methods Across Settings in OpenBookQA Dataset. [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Bar Plot Comparison of Calibration Methods Across Settings in SciQ Dataset. [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Bar Plot Comparison of Calibration Methods Across Settings in TriviaQA Dataset. [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Bar Plot Comparison of Calibration Methods Across Settings in TruthfulQA Dataset. [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗
read the original abstract

Recent advances in summary evaluation are based on model-based metrics to assess quality dimensions, such as completeness, conciseness, and faithfulness. However, these methods often require large language models, and predicted scores are frequently miscalibrated, limiting their reliability. Moreover, evaluating the average quality across different summaries for a single document typically requires access to multiple reference summaries. Here, we propose a general framework that generates individual and average proxy scores without relying on reference summaries, human annotations, or expensive model-based metrics. We also propose group isotonic regression binning (GIRB), a calibration method that adjusts the raw predictions to better align with ground-truth evaluation metrics. While we focus on continuous-value scenarios, such as summarization, the method is applicable to discrete-value tasks, such as question answering. Experiments on seven datasets demonstrate that our approach consistently outperforms existing baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a general framework for generating individual and average proxy scores for summarization quality dimensions (e.g., completeness, conciseness, faithfulness) using model-based metrics, without reference summaries, human annotations, or large expensive models. It introduces Group Isotonic Regression Binning (GIRB) as a calibration procedure to adjust raw proxy predictions for better alignment with ground-truth metrics. The method is claimed to extend to discrete tasks such as QA, and experiments on seven datasets show consistent outperformance over existing baselines.

Significance. If the full pipeline (including GIRB calibration) can be shown to operate without per-dataset ground-truth labels, the work would offer a practical route to scalable, annotation-free evaluation for summarization and related NLG tasks. The emphasis on continuous-value calibration and the explicit applicability note for discrete settings are positive features.

major comments (2)
  1. [Abstract, §3.2] Abstract and §3.2: The central claim that the approach generates and validates proxy scores 'without relying on reference summaries, human annotations' is undercut by the description of GIRB, which fits isotonic regression/bin boundaries by minimizing loss against observed ground-truth values. The manuscript must clarify whether (a) a single transferable calibration set is used across all seven datasets or (b) GIRB parameters are fit per-dataset on the evaluation data containing ground-truth scores. Absent this, the reported outperformance may reflect supervised per-dataset calibration rather than an annotation-free method.
  2. [§4] §4 (Experiments): The abstract states outperformance on seven datasets, yet the provided text supplies no information on baseline implementations, statistical tests (e.g., significance of differences), data splits, or whether calibration choices were made post-hoc. These omissions make it impossible to evaluate whether the central empirical claim is robust.
minor comments (1)
  1. [§3] Notation for proxy scores and GIRB binning could be made more explicit (e.g., define the monotonic mapping function and its parameters in a single equation block).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope of our claims and improve the experimental reporting. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract, §3.2] Abstract and §3.2: The central claim that the approach generates and validates proxy scores 'without relying on reference summaries, human annotations' is undercut by the description of GIRB, which fits isotonic regression/bin boundaries by minimizing loss against observed ground-truth values. The manuscript must clarify whether (a) a single transferable calibration set is used across all seven datasets or (b) GIRB parameters are fit per-dataset on the evaluation data containing ground-truth scores. Absent this, the reported outperformance may reflect supervised per-dataset calibration rather than an annotation-free method.

    Authors: We agree that the distinction between proxy generation and calibration requires explicit clarification to avoid any ambiguity. The core proxy scores are produced by model-based metrics without reference summaries or human annotations on the target instances. GIRB calibration, however, does require ground-truth labels from a separate calibration dataset to fit the isotonic regression and bin boundaries. In our experiments, we used a single calibration set drawn from one dataset and transferred the resulting parameters to the remaining six datasets; we did not perform per-dataset fitting on the evaluation splits. We will revise the abstract and §3.2 to state this distinction clearly, add a paragraph on calibration-set size and transferability, and include an ablation comparing transferred versus per-dataset calibration to demonstrate that the reported gains are not solely due to supervised per-dataset fitting. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract states outperformance on seven datasets, yet the provided text supplies no information on baseline implementations, statistical tests (e.g., significance of differences), data splits, or whether calibration choices were made post-hoc. These omissions make it impossible to evaluate whether the central empirical claim is robust.

    Authors: We acknowledge that the current experimental section omits several details necessary for assessing robustness and reproducibility. In the revised manuscript we will expand §4 (and add an appendix if needed) with: (i) precise descriptions of baseline implementations, including model versions, prompting strategies, and hyper-parameters; (ii) statistical significance tests (paired t-tests and Wilcoxon signed-rank tests with p-values) for all reported improvements; (iii) explicit documentation of data splits, indicating which portions were used exclusively for calibration versus evaluation; and (iv) confirmation that calibration parameters were determined on the calibration set before any evaluation and were not tuned post-hoc on test data. These additions will allow readers to verify the empirical claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical proxy generation and calibration framework

full rationale

The paper describes an empirical method for generating proxy scores for summarization evaluation without references or annotations, followed by a separate GIRB calibration step that aligns outputs to ground-truth metrics on datasets. No derivation chain, equations, or first-principles results are presented that reduce any claimed prediction or result to its inputs by construction. The outperformance on seven datasets is reported as an experimental finding rather than a self-referential or fitted-by-definition outcome. Standard calibration procedures like isotonic regression do not constitute circularity when the paper explicitly separates raw proxy generation (annotation-free) from alignment to external ground-truth labels. The framework remains self-contained against external benchmarks without load-bearing self-citations or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method is described at a high level without mathematical derivations or assumptions detailed.

pith-pipeline@v0.9.0 · 5443 in / 992 out tokens · 39991 ms · 2026-05-10T06:32:37.404690+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

180 extracted references · 85 canonical work pages · 19 internal anchors

  1. [1]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  2. [2]

    Calibrating llm-based evaluator

    Calibrating LLM-Based Evaluator , author=. arXiv preprint arXiv:2309.13308 , year=

  3. [3]

    Journal of the American statistical Association , volume=

    The well-calibrated Bayesian , author=. Journal of the American statistical Association , volume=. 1982 , publisher=

  4. [4]

    arXiv preprint arXiv:2410.02381 , year=

    Metametrics: Calibrating metrics for generation tasks using human preferences , author=. arXiv preprint arXiv:2410.02381 , year=

  5. [5]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  6. [6]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  7. [7]

    2000 , publisher=

    Advances in large margin classifiers , author=. 2000 , publisher=

  8. [8]

    Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

    BERT-based lexical substitution , author=. Proceedings of the 57th annual meeting of the association for computational linguistics , pages=

  9. [9]

    OPT: Open Pre-trained Transformer Language Models

    Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

  10. [10]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  11. [11]

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension , author=. arXiv preprint arXiv:1910.13461 , year=

  12. [12]

    arXiv preprint arXiv:2305.09975 , year=

    Smart word suggestions for writing assistance , author=. arXiv preprint arXiv:2305.09975 , year=

  13. [13]

    arXiv preprint arXiv:2107.05132 , year=

    LexSubCon: Integrating knowledge from lexical resources into contextual embeddings for lexical substitution , author=. arXiv preprint arXiv:2107.05132 , year=

  14. [14]

    Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

    Don't give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization , author=. arXiv preprint arXiv:1808.08745 , year=

  15. [15]

    Proceedings of the fourth international workshop on semantic evaluations (SemEval-2007) , pages=

    Semeval-2007 task 10: English lexical substitution task , author=. Proceedings of the fourth international workshop on semantic evaluations (SemEval-2007) , pages=

  16. [16]

    all-words

    What substitutes tell us-analysis of an “all-words” lexical substitution corpus , author=. Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics , pages=

  17. [17]

    Advances in Neural Information Processing Systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , volume=

  18. [18]

    Advances in Neural Information Processing Systems , volume=

    Bartscore: Evaluating generated text as text generation , author=. Advances in Neural Information Processing Systems , volume=

  19. [19]

    arXiv preprint arXiv:2307.04507 , year=

    Improving Factuality of Abstractive Summarization via Contrastive Reward Learning , author=. arXiv preprint arXiv:2307.04507 , year=

  20. [20]

    arXiv preprint arXiv:2305.08146 , year=

    ParaLS: lexical substitution via pretrained paraphraser , author=. arXiv preprint arXiv:2305.08146 , year=

  21. [21]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  22. [22]

    LLaMA: Open and Efficient Foundation Language Models

    Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

  23. [23]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  24. [24]

    Albert Lu, Hongxin Zhang, Yanzhe Zhang, Xuezhi Wang, and Diyi Yang

    Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models , author=. arXiv preprint arXiv:2303.13809 , year=

  25. [25]

    The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

    Inform: Mitigating reward hacking in rlhf via information-theoretic reward modeling , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

  26. [26]

    arXiv preprint arXiv:2401.12873 , year=

    Improving machine translation with human feedback: An exploration of quality estimation as a reward model , author=. arXiv preprint arXiv:2401.12873 , year=

  27. [27]

    arXiv preprint arXiv:2212.10179 , year=

    Toward human-like evaluation for natural language generation with error analysis , author=. arXiv preprint arXiv:2212.10179 , year=

  28. [28]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , author=. arXiv preprint arXiv:1908.10084 , year=

  29. [29]

    Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

    Bleu: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

  30. [30]

    PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification , author=. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) , pages=

  31. [31]

    Bleurt: Learning robust metrics for text gener- ation

    BLEURT: Learning robust metrics for text generation , author=. arXiv preprint arXiv:2004.04696 , year=

  32. [32]

    BERTScore: Evaluating Text Generation with BERT

    Bertscore: Evaluating text generation with bert , author=. arXiv preprint arXiv:1904.09675 , year=

  33. [33]

    Findings of the Association for Computa- tional Linguistics: ACL-IJCNLP 2021

    Gptscore: Evaluate as you desire , author=. arXiv preprint arXiv:2302.04166 , year=

  34. [34]

    arXiv preprint arXiv:2203.16804 , year=

    BRIO: Bringing order to abstractive summarization , author=. arXiv preprint arXiv:2203.16804 , year=

  35. [35]

    arXiv preprint arXiv:2310.16944 , year=

    Zephyr: Direct distillation of lm alignment , author=. arXiv preprint arXiv:2310.16944 , year=

  36. [36]

    Advances in Neural Information Processing Systems , volume=

    Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in Neural Information Processing Systems , volume=

  37. [37]

    Physics in Medicine & Biology , volume=

    Bootstrapping, permutation testing and the method of surrogate data , author=. Physics in Medicine & Biology , volume=. 1999 , publisher=

  38. [38]

    arXiv preprint arXiv:2305.14239 , year=

    On learning to summarize with large language models as references , author=. arXiv preprint arXiv:2305.14239 , year=

  39. [39]

    ACM Computing Surveys , volume=

    Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing , author=. ACM Computing Surveys , volume=. 2023 , publisher=

  40. [40]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=

  41. [41]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  42. [42]

    Publications Manual , year = "1983", publisher =

  43. [43]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  44. [44]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  45. [45]

    Dan Gusfield , title =. 1997

  46. [46]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  47. [47]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  48. [48]

    Advances in neural information processing systems , volume=

    Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=

  49. [49]

    Text summarization branches out , pages=

    Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=

  50. [50]

    arXiv preprint arXiv:2310.10623 , year=

    Generating summaries with controllable readability levels , author=. arXiv preprint arXiv:2310.10623 , year=

  51. [51]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  52. [52]

    arXiv preprint arXiv:2502.05933 , year=

    Learning to Substitute Words with Model-based Score Ranking , author=. arXiv preprint arXiv:2502.05933 , year=

  53. [53]

    Large Language Models: A Survey

    Large language models: A survey , author=. arXiv preprint arXiv:2402.06196 , year=

  54. [54]

    arXiv preprint arXiv:2407.00908 , year=

    FineSurE: Fine-grained summarization evaluation using LLMs , author=. arXiv preprint arXiv:2407.00908 , year=

  55. [55]

    arXiv preprint arXiv:2410.13116 , year=

    Learning to Summarize from LLM-generated Feedback , author=. arXiv preprint arXiv:2410.13116 , year=

  56. [56]

    Towards a unified multi-dimensional evaluator for text generation.arXiv preprint arXiv:2210.07197,

    Towards a unified multi-dimensional evaluator for text generation , author=. arXiv preprint arXiv:2210.07197 , year=

  57. [57]

    G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

    G-eval: NLG evaluation using gpt-4 with better human alignment , author=. arXiv preprint arXiv:2303.16634 , year=

  58. [58]

    Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

  59. [59]

    Advances in Neural Information Processing Systems , volume=

    Defining and characterizing reward gaming , author=. Advances in Neural Information Processing Systems , volume=

  60. [60]

    arXiv preprint arXiv:2409.19898 , year=

    UniSumEval: Towards Unified, Fine-Grained, Multi-Dimensional Summarization Evaluation for LLMs , author=. arXiv preprint arXiv:2409.19898 , year=

  61. [61]

    Sanh, Victor and Debut, Lysandre and Chaumond, Julien and Wolf, Thomas , journal=. Distil

  62. [62]

    Calibrating long-form generations from large language models, 2024

    Calibrating Long-form Generations from Large Language Models , author=. arXiv:2402.06544 , year=

  63. [63]

    Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How

    Sclar, Melanie and Choi, Yejin and Tsvetkov, Yulia and Suhr, Alane , booktitle=. Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How

  64. [64]

    Xiong, Miao and Hu, Zhiyuan and Lu, Xinyang and Li, Yifei and Fu, Jie and He, Junxian and Hooi, Bryan , booktitle=. Can

  65. [65]

    Does It Know?: Probing and Benchmarking Uncertainty in Language Model Latent Beliefs , author=

  66. [66]

    arXiv:2401.08694 , year=

    Combining Confidence Elicitation and Sample-based Methods for Uncertainty Quantification in Misinformation Mitigation , author=. arXiv:2401.08694 , year=

  67. [67]

    arXiv preprint arXiv:2402.13213 , year=

    Softmax Probabilities (Mostly) Predict Large Language Model Correctness on Multiple-Choice Q&A , author=. arXiv:2402.13213 , year=

  68. [68]

    arXiv:2210.04714 , year=

    Uncertainty quantification with pre-trained language models: A large-scale empirical analysis , author=. arXiv:2210.04714 , year=

  69. [69]

    arXiv:2402.08733 , year=

    Experts Don't Cheat: Learning What You Don't Know By Predicting Pairs , author=. arXiv:2402.08733 , year=

  70. [70]

    Alignment for honesty

    Alignment for honesty , author=. arXiv:2312.07000 , year=

  71. [71]

    Understanding the effects of iterative prompting on truthfulness, 2024

    Understanding the Effects of Iterative Prompting on Truthfulness , author=. arXiv:2402.06625 , year=

  72. [72]

    arXiv:2312.04021 , year=

    A Study on the Calibration of In-context Learning , author=. arXiv:2312.04021 , year=

  73. [73]

    Transactions of the Association for Computational Linguistics , volume=

    Addressing the Binning Problem in Calibration Assessment through Scalar Annotations , author=. Transactions of the Association for Computational Linguistics , volume=. 2024 , publisher=

  74. [74]

    arXiv preprint arXiv:2311.08298 , year=

    A survey of language model confidence estimation and calibration , author=. arXiv:2311.08298 , year=

  75. [75]

    Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models

    Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models , author=. arXiv:2303.08896 , year=

  76. [76]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions , author=. arXiv:2311.05232 , year=

  77. [77]

    Calibrating

    Calibrating Large Language Models Using Their Generations Only , author=. arXiv:2403.05973 , year=

  78. [78]

    arXiv:2402.13904 , year=

    Calibrating Large Language Models with Sample Consistency , author=. arXiv:2402.13904 , year=

  79. [79]

    Few-shot recalibration of language models

    Few-Shot Recalibration of Language Models , author=. arXiv:2403.18286 , year=

  80. [80]

    arXiv:2402.04957 , year=

    Reconfidencing LLMs from the Grouping Loss Perspective , author=. arXiv:2402.04957 , year=

Showing first 80 references.