PromptNCE: Pointwise Mutual Information Predictions Using Only LLMs and Contrastive Estimation Prompts
Pith reviewed 2026-05-22 08:42 UTC · model grok-4.3
The pith
LLMs estimate pointwise mutual information zero-shot by adding an explicit OTHER category to contrastive prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Adding an explicit OTHER category to a contrastive prompt recovers the true conditional probability P(y|x) rather than merely a ranking over listed candidates, turning the prompt into a general-purpose zero-shot probability estimator. PromptNCE is the best zero-shot method on all three datasets, reaching Spearman correlation up to 0.82 with human-derived PMI.
What carries the argument
PromptNCE, a contrastive prompting technique that augments the candidate set with an explicit OTHER category to elicit probabilities matching the true conditional distribution P(y|x).
If this is right
- PromptNCE outperforms other zero-shot estimators and reaches Spearman correlation up to 0.82 with human-derived PMI on three datasets.
- The method enables scoring of student knowledge summaries without task-specific labeled data.
- Contrastive prompts become usable as general-purpose zero-shot conditional probability estimators.
- Mutual information can be estimated from text in low-data regimes without training a critic model.
Where Pith is reading between the lines
- The approach could extend to zero-shot estimation of other information measures such as entropy or conditional entropy.
- It might support real-time uncertainty quantification in generated text for downstream applications.
- Further tests on non-English text or specialized domains would check how broadly the OTHER-category recovery holds.
Load-bearing premise
Including the OTHER option forces the model's output probabilities to match the true conditional distribution over the full space instead of just renormalizing among the listed choices.
What would settle it
Compare probabilities from PromptNCE prompts against empirical conditional frequencies computed from a large labeled text corpus and check for close numerical agreement.
Figures
read the original abstract
Estimating mutual information from text usually requires training a task-specific critic, which limits its use in low-data settings. We ask whether large language models can instead estimate pointwise mutual information zero-shot, using only prompts and elicited probabilities. We introduce a benchmark with human-derived ground-truth PMI across three publicly available datasets, and evaluate five information-theoretic prompting-based estimators. Our main method, PromptNCE, frames conditional probability estimation as a contrastive task and augments the candidate set with an explicit OTHER category. We show theoretically that adding OTHER recovers the true conditional P(y | x) rather than just a ranking over listed candidates, turning a contrastive prompt into a general-purpose zero-shot probability estimator. PromptNCE is the best zero-shot method on all three datasets, reaching Spearman correlation up to 0.82 with human-derived PMI. We also present a case study in computer science education showing how these estimators can be used to score student knowledge summaries in a low-data setting.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PromptNCE, a zero-shot method for estimating pointwise mutual information (PMI) from text using LLMs via contrastive prompts. It augments the candidate set with an explicit 'OTHER' category and claims theoretically that this recovers the true conditional probability P(y|x) exactly rather than a renormalized ranking over listed options. The authors create a benchmark with human-derived ground-truth PMI across three public datasets, evaluate five prompting-based estimators, and report that PromptNCE achieves the highest Spearman correlations (up to 0.82). A case study applies the method to scoring student knowledge summaries in a low-data educational setting.
Significance. If the central theoretical claim holds, PromptNCE would provide a parameter-free, training-free approach to PMI estimation that generalizes beyond closed candidate sets, with clear utility in low-resource scenarios such as the education case study. The human-annotated PMI benchmark itself is a useful resource for evaluating future zero-shot estimators. The reported correlations indicate practical promise, though the result's impact depends on validating the core assumption against LLM-specific distortions.
major comments (2)
- [§3.2] §3.2 (Theoretical Derivation, around Eq. (4)–(6)): The argument that adding the OTHER category makes the normalized LLM output probabilities equal the true P(y|x) over the full space rests on the untested assumption that the model's elicited option probabilities are exactly proportional to the underlying conditional distribution without distortion. This step is load-bearing for the claim that PromptNCE is a general-purpose probability estimator rather than a ranking method; however, the derivation does not address or empirically check known LLM behaviors such as miscalibration, position bias, or prompt-dependent renormalization that could cause the output to reflect only internal prompt-set dynamics instead of true conditionals.
- [§5] §5 (Empirical Evaluation): The reported Spearman correlations (up to 0.82) are presented without error bars, dataset statistics (e.g., number of examples per dataset), or details on post-hoc prompt choices. This makes it difficult to assess whether the superiority of PromptNCE over the other four estimators is robust or sensitive to specific implementation decisions, directly affecting the strength of the 'best zero-shot method' conclusion.
minor comments (3)
- [Abstract / §1] The abstract and §1 could more explicitly state the three datasets used and their sizes to allow readers to gauge the scale of the human-derived PMI benchmark.
- [Figure 3] Figure 3 (case study results): Adding per-summary variance or confidence intervals would improve interpretability of the knowledge-scoring application.
- [§3.1] Notation for the contrastive prompt template in §3.1 is clear but could include an explicit example prompt to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve the theoretical discussion and empirical reporting.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Theoretical Derivation, around Eq. (4)–(6)): The argument that adding the OTHER category makes the normalized LLM output probabilities equal the true P(y|x) over the full space rests on the untested assumption that the model's elicited option probabilities are exactly proportional to the underlying conditional distribution without distortion. This step is load-bearing for the claim that PromptNCE is a general-purpose probability estimator rather than a ranking method; however, the derivation does not address or empirically check known LLM behaviors such as miscalibration, position bias, or prompt-dependent renormalization that could cause the output to reflect only internal prompt-set dynamics instead of true conditionals.
Authors: We agree that the derivation in §3.2 relies on the assumption that LLM-elicited probabilities over the prompt set (including OTHER) are proportional to the true conditional distribution. This assumption is standard for prompting-based estimators but is indeed untested against LLM-specific distortions in the original text. In the revision we have added a dedicated limitations paragraph in §3.2 that explicitly discusses miscalibration, position bias, and prompt-dependent effects, and we have clarified that PromptNCE recovers P(y|x) under the modeling assumption of accurate elicitation rather than claiming exact recovery in all regimes. We have also included a short appendix experiment that permutes option order to quantify position bias on one dataset. While these additions do not constitute a comprehensive validation across all possible distortions, they make the scope of the theoretical claim more precise. revision: partial
-
Referee: [§5] §5 (Empirical Evaluation): The reported Spearman correlations (up to 0.82) are presented without error bars, dataset statistics (e.g., number of examples per dataset), or details on post-hoc prompt choices. This makes it difficult to assess whether the superiority of PromptNCE over the other four estimators is robust or sensitive to specific implementation decisions, directly affecting the strength of the 'best zero-shot method' conclusion.
Authors: We thank the referee for highlighting these omissions. In the revised manuscript we have added bootstrap-derived 95% confidence intervals (1,000 resamples) to all Spearman correlations in Table 2 and the main results figure. We have inserted a new Table 1 that reports, for each of the three datasets, the number of examples, mean and standard deviation of text lengths, and label cardinality. Full prompt templates, including any post-hoc phrasing decisions, are now provided in Appendix B together with a sensitivity analysis across three alternative prompt wordings. These changes allow readers to evaluate both statistical reliability and implementation sensitivity. revision: yes
Circularity Check
Theoretical derivation of OTHER-category probability recovery is independent of inputs and empirical results
full rationale
The paper's central theoretical step shows that augmenting a contrastive prompt with an explicit OTHER category causes normalized LLM option probabilities to equal the true P(y|x) over the full space rather than a closed-set ranking. This follows directly from the modeling assumption that the LLM assigns mass proportional to the underlying conditional probabilities (with OTHER taking the complement), which is an external premise about LLM behavior rather than a self-referential definition or fitted parameter. No equations reduce the claimed recovery to the paper's own data, prior self-citations, or ansatz smuggling; the reported Spearman correlations (up to 0.82) are presented as separate empirical validation on human-derived PMI benchmarks. The derivation chain is therefore self-contained against external benchmarks with no load-bearing circular reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard axioms of probability that conditional probabilities sum to one over the full output space
invented entities (1)
-
OTHER category
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Attention is All you Need , url =
Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
-
[2]
Douglas L. Nelson and Cathy L. McEvoy and Thomas A. Schreiber , title =. Behavior Research Methods, Instruments, & Computers , volume =. 2004 , url =
work page 2004
- [3]
-
[4]
Bowman and Gabor Angeli and Christopher Potts and Christopher D
Samuel R. Bowman and Gabor Angeli and Christopher Potts and Christopher D. Manning , title =. EMNLP , pages =
-
[5]
Dorottya Demszky and Dana Movshovitz-Attias and Jeongwoo Ko and Alan Cowen and Gaurav Nemade and Sujith Ravi , title =. ACL , year =
-
[6]
XuanLong Nguyen and Martin J. Wainwright and Michael I. Jordan , title =. IEEE Transactions on Information Theory , volume =
-
[7]
Sebastian Nowozin and Botond Cseke and Ryota Tomioka , title =. NeurIPS , pages =
-
[8]
Representation Learning with Contrastive Predictive Coding , journal =
A. Representation Learning with Contrastive Predictive Coding , journal =
-
[9]
On Variational Bounds of Mutual Information , booktitle =
Ben Poole and Sherjil Ozair and A. On Variational Bounds of Mutual Information , booktitle =
-
[10]
Mohamed Ishmael Belghazi and Aristide Baratin and Sai Rajeshwar and Sherjil Ozair and Yoshua Bengio and Aaron Courville and Devon Hjelm , title =. ICML , pages =
-
[11]
M. D. Donsker and S. R. S. Varadhan , title =. Communications on Pure and Applied Mathematics , volume =
-
[12]
Michael Gutmann and Aapo Hyv. Noise-Contrastive Estimation: A New Estimation Principle for Unnormalized Statistical Models , booktitle =
- [13]
-
[14]
Pengyu Cheng and Weituo Hao and Shuyang Dai and Jiachang Liu and Zhe Gan and Lawrence Carin , title =. ICML , pages =
- [15]
- [16]
-
[17]
Computational Linguistics , volume =
Kenneth Ward Church and Patrick Hanks , title =. Computational Linguistics , volume =
- [18]
-
[19]
Chuan Guo and Geoff Pleiss and Yu Sun and Kilian Q. Weinberger , title =. ICML , year =
-
[20]
Language Models (Mostly) Know What They Know
Saurav Kadavath and Tom Conerly and Amanda Askell and others , title =. arXiv:2207.05221 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Lin and Jacob Hilton and Owain Evans , title =
Stephanie C. Lin and Jacob Hilton and Owain Evans , title =. Transactions on Machine Learning Research , year =
-
[22]
Miao Xiong and Zhiyuan Hu and Xinyang Lu and Yifei Li and Jie Fu and Junxian He and Bryan Hooi , title =. ICLR , year =
-
[23]
On Verbalized Confidence Scores for LLMs
Dongping Yang and Jingyu Yao and Ryan Cotterell , title =. arXiv:2412.14737 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages =
Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages =. 2023 , publisher =
work page 2023
- [25]
-
[26]
Calibrating Verbalized Probabilities for Large Language Models , author =. CoRR , volume =. 2024 , url =
work page 2024
-
[27]
Softmax probabilities (mostly) predict large language model correctness on multiple-choice q&a
Benjamin Plaut and others , title =. arXiv:2402.13213 , year =
-
[28]
Brown and Benjamin Mann and Nick Ryder and others , title =
Tom B. Brown and Benjamin Mann and Nick Ryder and others , title =. NeurIPS , year =
-
[29]
Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed Chi and Quoc Le and Denny Zhou , title =. NeurIPS , year =
-
[30]
Zhao and Eric Wallace and Shi Feng and Dan Klein and Sameer Singh , title =
Tony Z. Zhao and Eric Wallace and Shi Feng and Dan Klein and Sameer Singh , title =. ICML , year =
- [31]
- [32]
-
[33]
Alexandra Uma and Tommaso Fornaciari and Dirk Hovy and Silviu Paun and Barbara Plank and Massimo Poesio , title =. JAIR , volume =
-
[34]
R. Devon Hjelm and Alex Fedorov and Samuel Lavoie-Marchildon and Karan Grewal and Phil Bachman and Adam Trischler and Yoshua Bengio , title =. ICLR , year =
-
[35]
Ting Chen and Simon Kornblith and Mohammad Norouzi and Geoffrey Hinton , title =. ICML , pages =
-
[36]
Alemi and Ian Fischer and Joshua V
Alexander A. Alemi and Ian Fischer and Joshua V. Dillon and Kevin Murphy , title =. ICLR , year =
-
[37]
Sudipto Mukherjee and Himanshu Asnani and Sreeram Kannan , title =. UAI , series =
- [38]
- [39]
- [40]
-
[41]
Griffiths and Mark Steyvers and Joshua B
Thomas L. Griffiths and Mark Steyvers and Joshua B. Tenenbaum , title =. Psychological Review , volume =
- [42]
-
[43]
Xiang Zhou and Yixin Nie and Mohit Bansal , title =. Findings of ACL , year =
-
[44]
Beiduo Chen and Xinpeng Wang and Siyao Peng and Robert Litschko and Anna Korhonen and Barbara Plank , title =. Findings of EMNLP , year =
-
[45]
Estimating Mutual Information , journal =
Alexander Kraskov and Harald St. Estimating Mutual Information , journal =
- [46]
-
[47]
Pereira and William Bialek , title =
Naftali Tishby and Fernando C. Pereira and William Bialek , title =. Allerton Conference on Communication, Control, and Computing , pages =
- [48]
- [49]
-
[50]
Jiuhai Chen and Jonas Mueller , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2024 , publisher =
work page 2024
- [51]
-
[52]
Advances in Neural Information Processing Systems , volume =
Sanyam Kapoor and Sander Xie and Simran Jha and Kian Wu and Jacob Hall and Kirill Saveliev and Stefano Ermon and Hritik Bansal , title =. Advances in Neural Information Processing Systems , volume =
-
[53]
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =
Shudong Liu and Haitao Cheng and Shuo Huang and Weiran Yang and Furu Wei , title =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =. 2024 , publisher =
work page 2024
-
[54]
and Zhang, Hao and Gonzalez, Joseph E
Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle=. Judging
-
[55]
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models , author=. EMNLP , pages=
-
[56]
From Generation to Judgment: Opportunities and Challenges of
Li, Dawei and Jiang, Bohan and Huang, Liangjie and Beigi, Alimohammad and Zhao, Chengshuai and Tan, Zhen and Bhattacharjee, Amrita and Jiang, Yuxuan and Chen, Canyu and Wu, Tianhao and Shu, Kai and Cheng, Lu and Liu, Huan , journal=. From Generation to Judgment: Opportunities and Challenges of
-
[57]
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods , author=. 2024 , eprint=
work page 2024
-
[58]
Neural Computing and Applications , volume=
A review of feature selection methods based on mutual information , author=. Neural Computing and Applications , volume=. 2014 , publisher=
work page 2014
-
[59]
Opening the Black Box of Deep Neural Networks via Information
Opening the black box of deep neural networks via information , author=. arXiv preprint arXiv:1703.00810 , year=
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.