PromptNCE: Pointwise Mutual Information Predictions Using Only LLMs and Contrastive Estimation Prompts

Chris Piech; Juliette Woodrow

arxiv: 2605.21776 · v1 · pith:JLAQUNZHnew · submitted 2026-05-20 · 💻 cs.CL

PromptNCE: Pointwise Mutual Information Predictions Using Only LLMs and Contrastive Estimation Prompts

Juliette Woodrow , Chris Piech This is my paper

Pith reviewed 2026-05-22 08:42 UTC · model grok-4.3

classification 💻 cs.CL

keywords pointwise mutual informationzero-shot estimationcontrastive promptinglarge language modelsconditional probabilityprompt engineeringinformation theorynatural language processing

0 comments

The pith

LLMs estimate pointwise mutual information zero-shot by adding an explicit OTHER category to contrastive prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large language models can estimate pointwise mutual information directly from text without any task-specific training. It frames conditional probability estimation as a contrastive task and adds an explicit OTHER option to the list of candidates in the prompt. Theory indicates this forces the model to output probabilities that reflect the true distribution over all possible answers instead of just ranking the listed ones. The resulting PromptNCE method outperforms other zero-shot estimators and reaches Spearman correlations as high as 0.82 with human-derived PMI values on three datasets. A case study demonstrates its use for scoring student knowledge summaries in low-data educational settings.

Core claim

Adding an explicit OTHER category to a contrastive prompt recovers the true conditional probability P(y|x) rather than merely a ranking over listed candidates, turning the prompt into a general-purpose zero-shot probability estimator. PromptNCE is the best zero-shot method on all three datasets, reaching Spearman correlation up to 0.82 with human-derived PMI.

What carries the argument

PromptNCE, a contrastive prompting technique that augments the candidate set with an explicit OTHER category to elicit probabilities matching the true conditional distribution P(y|x).

If this is right

PromptNCE outperforms other zero-shot estimators and reaches Spearman correlation up to 0.82 with human-derived PMI on three datasets.
The method enables scoring of student knowledge summaries without task-specific labeled data.
Contrastive prompts become usable as general-purpose zero-shot conditional probability estimators.
Mutual information can be estimated from text in low-data regimes without training a critic model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to zero-shot estimation of other information measures such as entropy or conditional entropy.
It might support real-time uncertainty quantification in generated text for downstream applications.
Further tests on non-English text or specialized domains would check how broadly the OTHER-category recovery holds.

Load-bearing premise

Including the OTHER option forces the model's output probabilities to match the true conditional distribution over the full space instead of just renormalizing among the listed choices.

What would settle it

Compare probabilities from PromptNCE prompts against empirical conditional frequencies computed from a large labeled text corpus and check for close numerical agreement.

Figures

Figures reproduced from arXiv: 2605.21776 by Chris Piech, Juliette Woodrow.

**Figure 1.** Figure 1: Spearman ρ between estimated and true PMI. All methods are Claude Sonnet 4 unless otherwise noted in the label. The dashed line shows the PROMPTNCE using the empirical label marginal. Error bars are standard error of the mean. a marginal-dominated dataset where PMI rankings are driven by label base rates, P(y). Values near zero indicate a conditional-dominated dataset where rankings are driven by how the i… view at source ↗

read the original abstract

Estimating mutual information from text usually requires training a task-specific critic, which limits its use in low-data settings. We ask whether large language models can instead estimate pointwise mutual information zero-shot, using only prompts and elicited probabilities. We introduce a benchmark with human-derived ground-truth PMI across three publicly available datasets, and evaluate five information-theoretic prompting-based estimators. Our main method, PromptNCE, frames conditional probability estimation as a contrastive task and augments the candidate set with an explicit OTHER category. We show theoretically that adding OTHER recovers the true conditional P(y | x) rather than just a ranking over listed candidates, turning a contrastive prompt into a general-purpose zero-shot probability estimator. PromptNCE is the best zero-shot method on all three datasets, reaching Spearman correlation up to 0.82 with human-derived PMI. We also present a case study in computer science education showing how these estimators can be used to score student knowledge summaries in a low-data setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PromptNCE adds an explicit OTHER option to contrastive LLM prompts to recover actual conditional probabilities for zero-shot PMI, backed by a new human benchmark that reaches 0.82 Spearman correlation.

read the letter

The core move is framing conditional probability estimation as a contrastive prompt plus an OTHER bucket, with a theoretical argument that this recovers true P(y|x) rather than a closed-set ranking. They also release a human-derived PMI benchmark on three public datasets and test five prompting variants, with PromptNCE coming out ahead on all of them. A short case study applies the estimators to scoring student CS summaries in a low-data setting, which matches the stated motivation for avoiding task-specific training.

Referee Report

2 major / 3 minor

Summary. The paper introduces PromptNCE, a zero-shot method for estimating pointwise mutual information (PMI) from text using LLMs via contrastive prompts. It augments the candidate set with an explicit 'OTHER' category and claims theoretically that this recovers the true conditional probability P(y|x) exactly rather than a renormalized ranking over listed options. The authors create a benchmark with human-derived ground-truth PMI across three public datasets, evaluate five prompting-based estimators, and report that PromptNCE achieves the highest Spearman correlations (up to 0.82). A case study applies the method to scoring student knowledge summaries in a low-data educational setting.

Significance. If the central theoretical claim holds, PromptNCE would provide a parameter-free, training-free approach to PMI estimation that generalizes beyond closed candidate sets, with clear utility in low-resource scenarios such as the education case study. The human-annotated PMI benchmark itself is a useful resource for evaluating future zero-shot estimators. The reported correlations indicate practical promise, though the result's impact depends on validating the core assumption against LLM-specific distortions.

major comments (2)

[§3.2] §3.2 (Theoretical Derivation, around Eq. (4)–(6)): The argument that adding the OTHER category makes the normalized LLM output probabilities equal the true P(y|x) over the full space rests on the untested assumption that the model's elicited option probabilities are exactly proportional to the underlying conditional distribution without distortion. This step is load-bearing for the claim that PromptNCE is a general-purpose probability estimator rather than a ranking method; however, the derivation does not address or empirically check known LLM behaviors such as miscalibration, position bias, or prompt-dependent renormalization that could cause the output to reflect only internal prompt-set dynamics instead of true conditionals.
[§5] §5 (Empirical Evaluation): The reported Spearman correlations (up to 0.82) are presented without error bars, dataset statistics (e.g., number of examples per dataset), or details on post-hoc prompt choices. This makes it difficult to assess whether the superiority of PromptNCE over the other four estimators is robust or sensitive to specific implementation decisions, directly affecting the strength of the 'best zero-shot method' conclusion.

minor comments (3)

[Abstract / §1] The abstract and §1 could more explicitly state the three datasets used and their sizes to allow readers to gauge the scale of the human-derived PMI benchmark.
[Figure 3] Figure 3 (case study results): Adding per-summary variance or confidence intervals would improve interpretability of the knowledge-scoring application.
[§3.1] Notation for the contrastive prompt template in §3.1 is clear but could include an explicit example prompt to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve the theoretical discussion and empirical reporting.

read point-by-point responses

Referee: [§3.2] §3.2 (Theoretical Derivation, around Eq. (4)–(6)): The argument that adding the OTHER category makes the normalized LLM output probabilities equal the true P(y|x) over the full space rests on the untested assumption that the model's elicited option probabilities are exactly proportional to the underlying conditional distribution without distortion. This step is load-bearing for the claim that PromptNCE is a general-purpose probability estimator rather than a ranking method; however, the derivation does not address or empirically check known LLM behaviors such as miscalibration, position bias, or prompt-dependent renormalization that could cause the output to reflect only internal prompt-set dynamics instead of true conditionals.

Authors: We agree that the derivation in §3.2 relies on the assumption that LLM-elicited probabilities over the prompt set (including OTHER) are proportional to the true conditional distribution. This assumption is standard for prompting-based estimators but is indeed untested against LLM-specific distortions in the original text. In the revision we have added a dedicated limitations paragraph in §3.2 that explicitly discusses miscalibration, position bias, and prompt-dependent effects, and we have clarified that PromptNCE recovers P(y|x) under the modeling assumption of accurate elicitation rather than claiming exact recovery in all regimes. We have also included a short appendix experiment that permutes option order to quantify position bias on one dataset. While these additions do not constitute a comprehensive validation across all possible distortions, they make the scope of the theoretical claim more precise. revision: partial
Referee: [§5] §5 (Empirical Evaluation): The reported Spearman correlations (up to 0.82) are presented without error bars, dataset statistics (e.g., number of examples per dataset), or details on post-hoc prompt choices. This makes it difficult to assess whether the superiority of PromptNCE over the other four estimators is robust or sensitive to specific implementation decisions, directly affecting the strength of the 'best zero-shot method' conclusion.

Authors: We thank the referee for highlighting these omissions. In the revised manuscript we have added bootstrap-derived 95% confidence intervals (1,000 resamples) to all Spearman correlations in Table 2 and the main results figure. We have inserted a new Table 1 that reports, for each of the three datasets, the number of examples, mean and standard deviation of text lengths, and label cardinality. Full prompt templates, including any post-hoc phrasing decisions, are now provided in Appendix B together with a sensitivity analysis across three alternative prompt wordings. These changes allow readers to evaluate both statistical reliability and implementation sensitivity. revision: yes

Circularity Check

0 steps flagged

Theoretical derivation of OTHER-category probability recovery is independent of inputs and empirical results

full rationale

The paper's central theoretical step shows that augmenting a contrastive prompt with an explicit OTHER category causes normalized LLM option probabilities to equal the true P(y|x) over the full space rather than a closed-set ranking. This follows directly from the modeling assumption that the LLM assigns mass proportional to the underlying conditional probabilities (with OTHER taking the complement), which is an external premise about LLM behavior rather than a self-referential definition or fitted parameter. No equations reduce the claimed recovery to the paper's own data, prior self-citations, or ansatz smuggling; the reported Spearman correlations (up to 0.82) are presented as separate empirical validation on human-derived PMI benchmarks. The derivation chain is therefore self-contained against external benchmarks with no load-bearing circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard probability normalization plus the novel assumption that the OTHER token forces the model to output a proper conditional distribution; no free parameters or new physical entities are introduced.

axioms (1)

standard math Standard axioms of probability that conditional probabilities sum to one over the full output space
Invoked when claiming that OTHER recovers the true P(y|x) rather than a ranking

invented entities (1)

OTHER category no independent evidence
purpose: Augment candidate set in contrastive prompt so that elicited probabilities match the true conditional distribution
New element introduced by the method to convert ranking into normalized probability estimation

pith-pipeline@v0.9.0 · 5698 in / 1458 out tokens · 41209 ms · 2026-05-22T08:42:54.557343+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 3 internal anchors

[1]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page
[2]

Nelson and Cathy L

Douglas L. Nelson and Cathy L. McEvoy and Thomas A. Schreiber , title =. Behavior Research Methods, Instruments, & Computers , volume =. 2004 , url =

work page 2004
[3]

EMNLP , year =

Yixin Nie and Xiang Zhou and Mohit Bansal , title =. EMNLP , year =

work page
[4]

Bowman and Gabor Angeli and Christopher Potts and Christopher D

Samuel R. Bowman and Gabor Angeli and Christopher Potts and Christopher D. Manning , title =. EMNLP , pages =

work page
[5]

ACL , year =

Dorottya Demszky and Dana Movshovitz-Attias and Jeongwoo Ko and Alan Cowen and Gaurav Nemade and Sujith Ravi , title =. ACL , year =

work page
[6]

Wainwright and Michael I

XuanLong Nguyen and Martin J. Wainwright and Michael I. Jordan , title =. IEEE Transactions on Information Theory , volume =

work page
[7]

NeurIPS , pages =

Sebastian Nowozin and Botond Cseke and Ryota Tomioka , title =. NeurIPS , pages =

work page
[8]

Representation Learning with Contrastive Predictive Coding , journal =

A. Representation Learning with Contrastive Predictive Coding , journal =

work page
[9]

On Variational Bounds of Mutual Information , booktitle =

Ben Poole and Sherjil Ozair and A. On Variational Bounds of Mutual Information , booktitle =

work page
[10]

ICML , pages =

Mohamed Ishmael Belghazi and Aristide Baratin and Sai Rajeshwar and Sherjil Ozair and Yoshua Bengio and Aaron Courville and Devon Hjelm , title =. ICML , pages =

work page
[11]

M. D. Donsker and S. R. S. Varadhan , title =. Communications on Pure and Applied Mathematics , volume =

work page
[12]

Noise-Contrastive Estimation: A New Estimation Principle for Unnormalized Statistical Models , booktitle =

Michael Gutmann and Aapo Hyv. Noise-Contrastive Estimation: A New Estimation Principle for Unnormalized Statistical Models , booktitle =

work page
[13]

ICLR , year =

Jiaming Song and Stefano Ermon , title =. ICLR , year =

work page
[14]

ICML , pages =

Pengyu Cheng and Weituo Hao and Shuyang Dai and Jiachang Liu and Zhe Gan and Lawrence Carin , title =. ICML , pages =

work page
[15]

Agakov , title =

David Barber and Felix V. Agakov , title =. NeurIPS , pages =

work page
[16]

AISTATS , series =

David McAllester and Karl Stratos , title =. AISTATS , series =

work page
[17]

Computational Linguistics , volume =

Kenneth Ward Church and Patrick Hanks , title =. Computational Linguistics , volume =

work page
[18]

NeurIPS , year =

Omer Levy and Yoav Goldberg , title =. NeurIPS , year =

work page
[19]

Weinberger , title =

Chuan Guo and Geoff Pleiss and Yu Sun and Kilian Q. Weinberger , title =. ICML , year =

work page
[20]

Language Models (Mostly) Know What They Know

Saurav Kadavath and Tom Conerly and Amanda Askell and others , title =. arXiv:2207.05221 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Lin and Jacob Hilton and Owain Evans , title =

Stephanie C. Lin and Jacob Hilton and Owain Evans , title =. Transactions on Machine Learning Research , year =

work page
[22]

ICLR , year =

Miao Xiong and Zhiyuan Hu and Xinyang Lu and Yifei Li and Jie Fu and Junxian He and Bryan Hooi , title =. ICLR , year =

work page
[23]

On Verbalized Confidence Scores for LLMs

Dongping Yang and Jingyu Yao and Ryan Cotterell , title =. arXiv:2412.14737 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages =

Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages =. 2023 , publisher =

work page 2023
[25]

EMNLP , year =

Jennifer Hu and Roger Levy , title =. EMNLP , year =

work page
[26]

CoRR , volume =

Calibrating Verbalized Probabilities for Large Language Models , author =. CoRR , volume =. 2024 , url =

work page 2024
[27]

Softmax probabilities (mostly) predict large language model correctness on multiple-choice q&a

Benjamin Plaut and others , title =. arXiv:2402.13213 , year =

work page arXiv
[28]

Brown and Benjamin Mann and Nick Ryder and others , title =

Tom B. Brown and Benjamin Mann and Nick Ryder and others , title =. NeurIPS , year =

work page
[29]

NeurIPS , year =

Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed Chi and Quoc Le and Denny Zhou , title =. NeurIPS , year =

work page
[30]

Zhao and Eric Wallace and Shi Feng and Dan Klein and Sameer Singh , title =

Tony Z. Zhao and Eric Wallace and Shi Feng and Dan Klein and Sameer Singh , title =. ICML , year =

work page
[31]

TACL , volume =

Ellie Pavlick and Tom Kwiatkowski , title =. TACL , volume =

work page
[32]

EMNLP , year =

Barbara Plank , title =. EMNLP , year =

work page
[33]

JAIR , volume =

Alexandra Uma and Tommaso Fornaciari and Dirk Hovy and Silviu Paun and Barbara Plank and Massimo Poesio , title =. JAIR , volume =

work page
[34]

Devon Hjelm and Alex Fedorov and Samuel Lavoie-Marchildon and Karan Grewal and Phil Bachman and Adam Trischler and Yoshua Bengio , title =

R. Devon Hjelm and Alex Fedorov and Samuel Lavoie-Marchildon and Karan Grewal and Phil Bachman and Adam Trischler and Yoshua Bengio , title =. ICLR , year =

work page
[35]

ICML , pages =

Ting Chen and Simon Kornblith and Mohammad Norouzi and Geoffrey Hinton , title =. ICML , pages =

work page
[36]

Alemi and Ian Fischer and Joshua V

Alexander A. Alemi and Ian Fischer and Joshua V. Dillon and Kevin Murphy , title =. ICLR , year =

work page
[37]

UAI , series =

Sudipto Mukherjee and Himanshu Asnani and Sreeram Kannan , title =. UAI , series =

work page
[38]

PNAS , volume =

Fabrizio Gilardi and Meysam Alizadeh and Ma. PNAS , volume =

work page
[39]

EMNLP , year =

Yang Liu and Dan Iter and Yichong Xu and others , title =. EMNLP , year =

work page
[40]

Scientific Data , year =

Sudeep Bhatia and others , title =. Scientific Data , year =

work page
[41]

Griffiths and Mark Steyvers and Joshua B

Thomas L. Griffiths and Mark Steyvers and Joshua B. Tenenbaum , title =. Psychological Review , volume =

work page
[42]

EMNLP , year =

Nayeon Lee and Na Min An and James Thorne , title =. EMNLP , year =

work page
[43]

Findings of ACL , year =

Xiang Zhou and Yixin Nie and Mohit Bansal , title =. Findings of ACL , year =

work page
[44]

Findings of EMNLP , year =

Beiduo Chen and Xinpeng Wang and Siyao Peng and Robert Litschko and Anna Korhonen and Barbara Plank , title =. Findings of EMNLP , year =

work page
[45]

Estimating Mutual Information , journal =

Alexander Kraskov and Harald St. Estimating Mutual Information , journal =

work page
[46]

Neural Computation , volume =

Liam Paninski , title =. Neural Computation , volume =

work page
[47]

Pereira and William Bialek , title =

Naftali Tishby and Fernando C. Pereira and William Bialek , title =. Allerton Conference on Communication, Control, and Computing , pages =

work page
[48]

2025 , month = may, url =

Anthropic , title =. 2025 , month = may, url =

work page 2025
[49]

2025 , month = dec, url =

OpenAI , title =. 2025 , month = dec, url =

work page 2025
[50]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Jiuhai Chen and Jonas Mueller , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2024 , publisher =

work page 2024
[51]

, title =

Francesco Tonolini and Tim Baarslag and Diego Antognini and ... , title =. Findings of the Association for Computational Linguistics: ACL 2024 , year =

work page 2024
[52]

Advances in Neural Information Processing Systems , volume =

Sanyam Kapoor and Sander Xie and Simran Jha and Kian Wu and Jacob Hall and Kirill Saveliev and Stefano Ermon and Hritik Bansal , title =. Advances in Neural Information Processing Systems , volume =

work page
[53]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

Shudong Liu and Haitao Cheng and Shuo Huang and Weiran Yang and Furu Wei , title =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =. 2024 , publisher =

work page 2024
[54]

and Zhang, Hao and Gonzalez, Joseph E

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle=. Judging

work page
[55]

EMNLP , pages=

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models , author=. EMNLP , pages=

work page
[56]

From Generation to Judgment: Opportunities and Challenges of

Li, Dawei and Jiang, Bohan and Huang, Liangjie and Beigi, Alimohammad and Zhao, Chengshuai and Tan, Zhen and Bhattacharjee, Amrita and Jiang, Yuxuan and Chen, Canyu and Wu, Tianhao and Shu, Kai and Cheng, Lu and Liu, Huan , journal=. From Generation to Judgment: Opportunities and Challenges of

work page
[57]

2024 , eprint=

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods , author=. 2024 , eprint=

work page 2024
[58]

Neural Computing and Applications , volume=

A review of feature selection methods based on mutual information , author=. Neural Computing and Applications , volume=. 2014 , publisher=

work page 2014
[59]

Opening the Black Box of Deep Neural Networks via Information

Opening the black box of deep neural networks via information , author=. arXiv preprint arXiv:1703.00810 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page

[2] [2]

Nelson and Cathy L

Douglas L. Nelson and Cathy L. McEvoy and Thomas A. Schreiber , title =. Behavior Research Methods, Instruments, & Computers , volume =. 2004 , url =

work page 2004

[3] [3]

EMNLP , year =

Yixin Nie and Xiang Zhou and Mohit Bansal , title =. EMNLP , year =

work page

[4] [4]

Bowman and Gabor Angeli and Christopher Potts and Christopher D

Samuel R. Bowman and Gabor Angeli and Christopher Potts and Christopher D. Manning , title =. EMNLP , pages =

work page

[5] [5]

ACL , year =

Dorottya Demszky and Dana Movshovitz-Attias and Jeongwoo Ko and Alan Cowen and Gaurav Nemade and Sujith Ravi , title =. ACL , year =

work page

[6] [6]

Wainwright and Michael I

XuanLong Nguyen and Martin J. Wainwright and Michael I. Jordan , title =. IEEE Transactions on Information Theory , volume =

work page

[7] [7]

NeurIPS , pages =

Sebastian Nowozin and Botond Cseke and Ryota Tomioka , title =. NeurIPS , pages =

work page

[8] [8]

Representation Learning with Contrastive Predictive Coding , journal =

A. Representation Learning with Contrastive Predictive Coding , journal =

work page

[9] [9]

On Variational Bounds of Mutual Information , booktitle =

Ben Poole and Sherjil Ozair and A. On Variational Bounds of Mutual Information , booktitle =

work page

[10] [10]

ICML , pages =

Mohamed Ishmael Belghazi and Aristide Baratin and Sai Rajeshwar and Sherjil Ozair and Yoshua Bengio and Aaron Courville and Devon Hjelm , title =. ICML , pages =

work page

[11] [11]

M. D. Donsker and S. R. S. Varadhan , title =. Communications on Pure and Applied Mathematics , volume =

work page

[12] [12]

Noise-Contrastive Estimation: A New Estimation Principle for Unnormalized Statistical Models , booktitle =

Michael Gutmann and Aapo Hyv. Noise-Contrastive Estimation: A New Estimation Principle for Unnormalized Statistical Models , booktitle =

work page

[13] [13]

ICLR , year =

Jiaming Song and Stefano Ermon , title =. ICLR , year =

work page

[14] [14]

ICML , pages =

Pengyu Cheng and Weituo Hao and Shuyang Dai and Jiachang Liu and Zhe Gan and Lawrence Carin , title =. ICML , pages =

work page

[15] [15]

Agakov , title =

David Barber and Felix V. Agakov , title =. NeurIPS , pages =

work page

[16] [16]

AISTATS , series =

David McAllester and Karl Stratos , title =. AISTATS , series =

work page

[17] [17]

Computational Linguistics , volume =

Kenneth Ward Church and Patrick Hanks , title =. Computational Linguistics , volume =

work page

[18] [18]

NeurIPS , year =

Omer Levy and Yoav Goldberg , title =. NeurIPS , year =

work page

[19] [19]

Weinberger , title =

Chuan Guo and Geoff Pleiss and Yu Sun and Kilian Q. Weinberger , title =. ICML , year =

work page

[20] [20]

Language Models (Mostly) Know What They Know

Saurav Kadavath and Tom Conerly and Amanda Askell and others , title =. arXiv:2207.05221 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Lin and Jacob Hilton and Owain Evans , title =

Stephanie C. Lin and Jacob Hilton and Owain Evans , title =. Transactions on Machine Learning Research , year =

work page

[22] [22]

ICLR , year =

Miao Xiong and Zhiyuan Hu and Xinyang Lu and Yifei Li and Jie Fu and Junxian He and Bryan Hooi , title =. ICLR , year =

work page

[23] [23]

On Verbalized Confidence Scores for LLMs

Dongping Yang and Jingyu Yao and Ryan Cotterell , title =. arXiv:2412.14737 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages =

Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages =. 2023 , publisher =

work page 2023

[25] [25]

EMNLP , year =

Jennifer Hu and Roger Levy , title =. EMNLP , year =

work page

[26] [26]

CoRR , volume =

Calibrating Verbalized Probabilities for Large Language Models , author =. CoRR , volume =. 2024 , url =

work page 2024

[27] [27]

Softmax probabilities (mostly) predict large language model correctness on multiple-choice q&a

Benjamin Plaut and others , title =. arXiv:2402.13213 , year =

work page arXiv

[28] [28]

Brown and Benjamin Mann and Nick Ryder and others , title =

Tom B. Brown and Benjamin Mann and Nick Ryder and others , title =. NeurIPS , year =

work page

[29] [29]

NeurIPS , year =

Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed Chi and Quoc Le and Denny Zhou , title =. NeurIPS , year =

work page

[30] [30]

Zhao and Eric Wallace and Shi Feng and Dan Klein and Sameer Singh , title =

Tony Z. Zhao and Eric Wallace and Shi Feng and Dan Klein and Sameer Singh , title =. ICML , year =

work page

[31] [31]

TACL , volume =

Ellie Pavlick and Tom Kwiatkowski , title =. TACL , volume =

work page

[32] [32]

EMNLP , year =

Barbara Plank , title =. EMNLP , year =

work page

[33] [33]

JAIR , volume =

Alexandra Uma and Tommaso Fornaciari and Dirk Hovy and Silviu Paun and Barbara Plank and Massimo Poesio , title =. JAIR , volume =

work page

[34] [34]

Devon Hjelm and Alex Fedorov and Samuel Lavoie-Marchildon and Karan Grewal and Phil Bachman and Adam Trischler and Yoshua Bengio , title =

R. Devon Hjelm and Alex Fedorov and Samuel Lavoie-Marchildon and Karan Grewal and Phil Bachman and Adam Trischler and Yoshua Bengio , title =. ICLR , year =

work page

[35] [35]

ICML , pages =

Ting Chen and Simon Kornblith and Mohammad Norouzi and Geoffrey Hinton , title =. ICML , pages =

work page

[36] [36]

Alemi and Ian Fischer and Joshua V

Alexander A. Alemi and Ian Fischer and Joshua V. Dillon and Kevin Murphy , title =. ICLR , year =

work page

[37] [37]

UAI , series =

Sudipto Mukherjee and Himanshu Asnani and Sreeram Kannan , title =. UAI , series =

work page

[38] [38]

PNAS , volume =

Fabrizio Gilardi and Meysam Alizadeh and Ma. PNAS , volume =

work page

[39] [39]

EMNLP , year =

Yang Liu and Dan Iter and Yichong Xu and others , title =. EMNLP , year =

work page

[40] [40]

Scientific Data , year =

Sudeep Bhatia and others , title =. Scientific Data , year =

work page

[41] [41]

Griffiths and Mark Steyvers and Joshua B

Thomas L. Griffiths and Mark Steyvers and Joshua B. Tenenbaum , title =. Psychological Review , volume =

work page

[42] [42]

EMNLP , year =

Nayeon Lee and Na Min An and James Thorne , title =. EMNLP , year =

work page

[43] [43]

Findings of ACL , year =

Xiang Zhou and Yixin Nie and Mohit Bansal , title =. Findings of ACL , year =

work page

[44] [44]

Findings of EMNLP , year =

Beiduo Chen and Xinpeng Wang and Siyao Peng and Robert Litschko and Anna Korhonen and Barbara Plank , title =. Findings of EMNLP , year =

work page

[45] [45]

Estimating Mutual Information , journal =

Alexander Kraskov and Harald St. Estimating Mutual Information , journal =

work page

[46] [46]

Neural Computation , volume =

Liam Paninski , title =. Neural Computation , volume =

work page

[47] [47]

Pereira and William Bialek , title =

Naftali Tishby and Fernando C. Pereira and William Bialek , title =. Allerton Conference on Communication, Control, and Computing , pages =

work page

[48] [48]

2025 , month = may, url =

Anthropic , title =. 2025 , month = may, url =

work page 2025

[49] [49]

2025 , month = dec, url =

OpenAI , title =. 2025 , month = dec, url =

work page 2025

[50] [50]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Jiuhai Chen and Jonas Mueller , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2024 , publisher =

work page 2024

[51] [51]

, title =

Francesco Tonolini and Tim Baarslag and Diego Antognini and ... , title =. Findings of the Association for Computational Linguistics: ACL 2024 , year =

work page 2024

[52] [52]

Advances in Neural Information Processing Systems , volume =

Sanyam Kapoor and Sander Xie and Simran Jha and Kian Wu and Jacob Hall and Kirill Saveliev and Stefano Ermon and Hritik Bansal , title =. Advances in Neural Information Processing Systems , volume =

work page

[53] [53]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

Shudong Liu and Haitao Cheng and Shuo Huang and Weiran Yang and Furu Wei , title =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =. 2024 , publisher =

work page 2024

[54] [54]

and Zhang, Hao and Gonzalez, Joseph E

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle=. Judging

work page

[55] [55]

EMNLP , pages=

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models , author=. EMNLP , pages=

work page

[56] [56]

From Generation to Judgment: Opportunities and Challenges of

Li, Dawei and Jiang, Bohan and Huang, Liangjie and Beigi, Alimohammad and Zhao, Chengshuai and Tan, Zhen and Bhattacharjee, Amrita and Jiang, Yuxuan and Chen, Canyu and Wu, Tianhao and Shu, Kai and Cheng, Lu and Liu, Huan , journal=. From Generation to Judgment: Opportunities and Challenges of

work page

[57] [57]

2024 , eprint=

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods , author=. 2024 , eprint=

work page 2024

[58] [58]

Neural Computing and Applications , volume=

A review of feature selection methods based on mutual information , author=. Neural Computing and Applications , volume=. 2014 , publisher=

work page 2014

[59] [59]

Opening the Black Box of Deep Neural Networks via Information

Opening the black box of deep neural networks via information , author=. arXiv preprint arXiv:1703.00810 , year=

work page internal anchor Pith review Pith/arXiv arXiv