pith. sign in

arxiv: 2605.21776 · v1 · pith:JLAQUNZHnew · submitted 2026-05-20 · 💻 cs.CL

PromptNCE: Pointwise Mutual Information Predictions Using Only LLMs and Contrastive Estimation Prompts

Pith reviewed 2026-05-22 08:42 UTC · model grok-4.3

classification 💻 cs.CL
keywords pointwise mutual informationzero-shot estimationcontrastive promptinglarge language modelsconditional probabilityprompt engineeringinformation theorynatural language processing
0
0 comments X

The pith

LLMs estimate pointwise mutual information zero-shot by adding an explicit OTHER category to contrastive prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large language models can estimate pointwise mutual information directly from text without any task-specific training. It frames conditional probability estimation as a contrastive task and adds an explicit OTHER option to the list of candidates in the prompt. Theory indicates this forces the model to output probabilities that reflect the true distribution over all possible answers instead of just ranking the listed ones. The resulting PromptNCE method outperforms other zero-shot estimators and reaches Spearman correlations as high as 0.82 with human-derived PMI values on three datasets. A case study demonstrates its use for scoring student knowledge summaries in low-data educational settings.

Core claim

Adding an explicit OTHER category to a contrastive prompt recovers the true conditional probability P(y|x) rather than merely a ranking over listed candidates, turning the prompt into a general-purpose zero-shot probability estimator. PromptNCE is the best zero-shot method on all three datasets, reaching Spearman correlation up to 0.82 with human-derived PMI.

What carries the argument

PromptNCE, a contrastive prompting technique that augments the candidate set with an explicit OTHER category to elicit probabilities matching the true conditional distribution P(y|x).

If this is right

  • PromptNCE outperforms other zero-shot estimators and reaches Spearman correlation up to 0.82 with human-derived PMI on three datasets.
  • The method enables scoring of student knowledge summaries without task-specific labeled data.
  • Contrastive prompts become usable as general-purpose zero-shot conditional probability estimators.
  • Mutual information can be estimated from text in low-data regimes without training a critic model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to zero-shot estimation of other information measures such as entropy or conditional entropy.
  • It might support real-time uncertainty quantification in generated text for downstream applications.
  • Further tests on non-English text or specialized domains would check how broadly the OTHER-category recovery holds.

Load-bearing premise

Including the OTHER option forces the model's output probabilities to match the true conditional distribution over the full space instead of just renormalizing among the listed choices.

What would settle it

Compare probabilities from PromptNCE prompts against empirical conditional frequencies computed from a large labeled text corpus and check for close numerical agreement.

Figures

Figures reproduced from arXiv: 2605.21776 by Chris Piech, Juliette Woodrow.

Figure 1
Figure 1. Figure 1: Spearman ρ between estimated and true PMI. All methods are Claude Sonnet 4 unless otherwise noted in the label. The dashed line shows the PROMPTNCE using the empirical label marginal. Error bars are standard error of the mean. a marginal-dominated dataset where PMI rankings are driven by label base rates, P(y). Values near zero indicate a conditional-dominated dataset where rankings are driven by how the i… view at source ↗
read the original abstract

Estimating mutual information from text usually requires training a task-specific critic, which limits its use in low-data settings. We ask whether large language models can instead estimate pointwise mutual information zero-shot, using only prompts and elicited probabilities. We introduce a benchmark with human-derived ground-truth PMI across three publicly available datasets, and evaluate five information-theoretic prompting-based estimators. Our main method, PromptNCE, frames conditional probability estimation as a contrastive task and augments the candidate set with an explicit OTHER category. We show theoretically that adding OTHER recovers the true conditional P(y | x) rather than just a ranking over listed candidates, turning a contrastive prompt into a general-purpose zero-shot probability estimator. PromptNCE is the best zero-shot method on all three datasets, reaching Spearman correlation up to 0.82 with human-derived PMI. We also present a case study in computer science education showing how these estimators can be used to score student knowledge summaries in a low-data setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces PromptNCE, a zero-shot method for estimating pointwise mutual information (PMI) from text using LLMs via contrastive prompts. It augments the candidate set with an explicit 'OTHER' category and claims theoretically that this recovers the true conditional probability P(y|x) exactly rather than a renormalized ranking over listed options. The authors create a benchmark with human-derived ground-truth PMI across three public datasets, evaluate five prompting-based estimators, and report that PromptNCE achieves the highest Spearman correlations (up to 0.82). A case study applies the method to scoring student knowledge summaries in a low-data educational setting.

Significance. If the central theoretical claim holds, PromptNCE would provide a parameter-free, training-free approach to PMI estimation that generalizes beyond closed candidate sets, with clear utility in low-resource scenarios such as the education case study. The human-annotated PMI benchmark itself is a useful resource for evaluating future zero-shot estimators. The reported correlations indicate practical promise, though the result's impact depends on validating the core assumption against LLM-specific distortions.

major comments (2)
  1. [§3.2] §3.2 (Theoretical Derivation, around Eq. (4)–(6)): The argument that adding the OTHER category makes the normalized LLM output probabilities equal the true P(y|x) over the full space rests on the untested assumption that the model's elicited option probabilities are exactly proportional to the underlying conditional distribution without distortion. This step is load-bearing for the claim that PromptNCE is a general-purpose probability estimator rather than a ranking method; however, the derivation does not address or empirically check known LLM behaviors such as miscalibration, position bias, or prompt-dependent renormalization that could cause the output to reflect only internal prompt-set dynamics instead of true conditionals.
  2. [§5] §5 (Empirical Evaluation): The reported Spearman correlations (up to 0.82) are presented without error bars, dataset statistics (e.g., number of examples per dataset), or details on post-hoc prompt choices. This makes it difficult to assess whether the superiority of PromptNCE over the other four estimators is robust or sensitive to specific implementation decisions, directly affecting the strength of the 'best zero-shot method' conclusion.
minor comments (3)
  1. [Abstract / §1] The abstract and §1 could more explicitly state the three datasets used and their sizes to allow readers to gauge the scale of the human-derived PMI benchmark.
  2. [Figure 3] Figure 3 (case study results): Adding per-summary variance or confidence intervals would improve interpretability of the knowledge-scoring application.
  3. [§3.1] Notation for the contrastive prompt template in §3.1 is clear but could include an explicit example prompt to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve the theoretical discussion and empirical reporting.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Theoretical Derivation, around Eq. (4)–(6)): The argument that adding the OTHER category makes the normalized LLM output probabilities equal the true P(y|x) over the full space rests on the untested assumption that the model's elicited option probabilities are exactly proportional to the underlying conditional distribution without distortion. This step is load-bearing for the claim that PromptNCE is a general-purpose probability estimator rather than a ranking method; however, the derivation does not address or empirically check known LLM behaviors such as miscalibration, position bias, or prompt-dependent renormalization that could cause the output to reflect only internal prompt-set dynamics instead of true conditionals.

    Authors: We agree that the derivation in §3.2 relies on the assumption that LLM-elicited probabilities over the prompt set (including OTHER) are proportional to the true conditional distribution. This assumption is standard for prompting-based estimators but is indeed untested against LLM-specific distortions in the original text. In the revision we have added a dedicated limitations paragraph in §3.2 that explicitly discusses miscalibration, position bias, and prompt-dependent effects, and we have clarified that PromptNCE recovers P(y|x) under the modeling assumption of accurate elicitation rather than claiming exact recovery in all regimes. We have also included a short appendix experiment that permutes option order to quantify position bias on one dataset. While these additions do not constitute a comprehensive validation across all possible distortions, they make the scope of the theoretical claim more precise. revision: partial

  2. Referee: [§5] §5 (Empirical Evaluation): The reported Spearman correlations (up to 0.82) are presented without error bars, dataset statistics (e.g., number of examples per dataset), or details on post-hoc prompt choices. This makes it difficult to assess whether the superiority of PromptNCE over the other four estimators is robust or sensitive to specific implementation decisions, directly affecting the strength of the 'best zero-shot method' conclusion.

    Authors: We thank the referee for highlighting these omissions. In the revised manuscript we have added bootstrap-derived 95% confidence intervals (1,000 resamples) to all Spearman correlations in Table 2 and the main results figure. We have inserted a new Table 1 that reports, for each of the three datasets, the number of examples, mean and standard deviation of text lengths, and label cardinality. Full prompt templates, including any post-hoc phrasing decisions, are now provided in Appendix B together with a sensitivity analysis across three alternative prompt wordings. These changes allow readers to evaluate both statistical reliability and implementation sensitivity. revision: yes

Circularity Check

0 steps flagged

Theoretical derivation of OTHER-category probability recovery is independent of inputs and empirical results

full rationale

The paper's central theoretical step shows that augmenting a contrastive prompt with an explicit OTHER category causes normalized LLM option probabilities to equal the true P(y|x) over the full space rather than a closed-set ranking. This follows directly from the modeling assumption that the LLM assigns mass proportional to the underlying conditional probabilities (with OTHER taking the complement), which is an external premise about LLM behavior rather than a self-referential definition or fitted parameter. No equations reduce the claimed recovery to the paper's own data, prior self-citations, or ansatz smuggling; the reported Spearman correlations (up to 0.82) are presented as separate empirical validation on human-derived PMI benchmarks. The derivation chain is therefore self-contained against external benchmarks with no load-bearing circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard probability normalization plus the novel assumption that the OTHER token forces the model to output a proper conditional distribution; no free parameters or new physical entities are introduced.

axioms (1)
  • standard math Standard axioms of probability that conditional probabilities sum to one over the full output space
    Invoked when claiming that OTHER recovers the true P(y|x) rather than a ranking
invented entities (1)
  • OTHER category no independent evidence
    purpose: Augment candidate set in contrastive prompt so that elicited probabilities match the true conditional distribution
    New element introduced by the method to convert ranking into normalized probability estimation

pith-pipeline@v0.9.0 · 5698 in / 1458 out tokens · 41209 ms · 2026-05-22T08:42:54.557343+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 3 internal anchors

  1. [1]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

  2. [2]

    Nelson and Cathy L

    Douglas L. Nelson and Cathy L. McEvoy and Thomas A. Schreiber , title =. Behavior Research Methods, Instruments, & Computers , volume =. 2004 , url =

  3. [3]

    EMNLP , year =

    Yixin Nie and Xiang Zhou and Mohit Bansal , title =. EMNLP , year =

  4. [4]

    Bowman and Gabor Angeli and Christopher Potts and Christopher D

    Samuel R. Bowman and Gabor Angeli and Christopher Potts and Christopher D. Manning , title =. EMNLP , pages =

  5. [5]

    ACL , year =

    Dorottya Demszky and Dana Movshovitz-Attias and Jeongwoo Ko and Alan Cowen and Gaurav Nemade and Sujith Ravi , title =. ACL , year =

  6. [6]

    Wainwright and Michael I

    XuanLong Nguyen and Martin J. Wainwright and Michael I. Jordan , title =. IEEE Transactions on Information Theory , volume =

  7. [7]

    NeurIPS , pages =

    Sebastian Nowozin and Botond Cseke and Ryota Tomioka , title =. NeurIPS , pages =

  8. [8]

    Representation Learning with Contrastive Predictive Coding , journal =

    A. Representation Learning with Contrastive Predictive Coding , journal =

  9. [9]

    On Variational Bounds of Mutual Information , booktitle =

    Ben Poole and Sherjil Ozair and A. On Variational Bounds of Mutual Information , booktitle =

  10. [10]

    ICML , pages =

    Mohamed Ishmael Belghazi and Aristide Baratin and Sai Rajeshwar and Sherjil Ozair and Yoshua Bengio and Aaron Courville and Devon Hjelm , title =. ICML , pages =

  11. [11]

    M. D. Donsker and S. R. S. Varadhan , title =. Communications on Pure and Applied Mathematics , volume =

  12. [12]

    Noise-Contrastive Estimation: A New Estimation Principle for Unnormalized Statistical Models , booktitle =

    Michael Gutmann and Aapo Hyv. Noise-Contrastive Estimation: A New Estimation Principle for Unnormalized Statistical Models , booktitle =

  13. [13]

    ICLR , year =

    Jiaming Song and Stefano Ermon , title =. ICLR , year =

  14. [14]

    ICML , pages =

    Pengyu Cheng and Weituo Hao and Shuyang Dai and Jiachang Liu and Zhe Gan and Lawrence Carin , title =. ICML , pages =

  15. [15]

    Agakov , title =

    David Barber and Felix V. Agakov , title =. NeurIPS , pages =

  16. [16]

    AISTATS , series =

    David McAllester and Karl Stratos , title =. AISTATS , series =

  17. [17]

    Computational Linguistics , volume =

    Kenneth Ward Church and Patrick Hanks , title =. Computational Linguistics , volume =

  18. [18]

    NeurIPS , year =

    Omer Levy and Yoav Goldberg , title =. NeurIPS , year =

  19. [19]

    Weinberger , title =

    Chuan Guo and Geoff Pleiss and Yu Sun and Kilian Q. Weinberger , title =. ICML , year =

  20. [20]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath and Tom Conerly and Amanda Askell and others , title =. arXiv:2207.05221 , year =

  21. [21]

    Lin and Jacob Hilton and Owain Evans , title =

    Stephanie C. Lin and Jacob Hilton and Owain Evans , title =. Transactions on Machine Learning Research , year =

  22. [22]

    ICLR , year =

    Miao Xiong and Zhiyuan Hu and Xinyang Lu and Yifei Li and Jie Fu and Junxian He and Bryan Hooi , title =. ICLR , year =

  23. [23]

    On Verbalized Confidence Scores for LLMs

    Dongping Yang and Jingyu Yao and Ryan Cotterell , title =. arXiv:2412.14737 , year =

  24. [24]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages =

    Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback , author =. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages =. 2023 , publisher =

  25. [25]

    EMNLP , year =

    Jennifer Hu and Roger Levy , title =. EMNLP , year =

  26. [26]

    CoRR , volume =

    Calibrating Verbalized Probabilities for Large Language Models , author =. CoRR , volume =. 2024 , url =

  27. [27]
  28. [28]

    Brown and Benjamin Mann and Nick Ryder and others , title =

    Tom B. Brown and Benjamin Mann and Nick Ryder and others , title =. NeurIPS , year =

  29. [29]

    NeurIPS , year =

    Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed Chi and Quoc Le and Denny Zhou , title =. NeurIPS , year =

  30. [30]

    Zhao and Eric Wallace and Shi Feng and Dan Klein and Sameer Singh , title =

    Tony Z. Zhao and Eric Wallace and Shi Feng and Dan Klein and Sameer Singh , title =. ICML , year =

  31. [31]

    TACL , volume =

    Ellie Pavlick and Tom Kwiatkowski , title =. TACL , volume =

  32. [32]

    EMNLP , year =

    Barbara Plank , title =. EMNLP , year =

  33. [33]

    JAIR , volume =

    Alexandra Uma and Tommaso Fornaciari and Dirk Hovy and Silviu Paun and Barbara Plank and Massimo Poesio , title =. JAIR , volume =

  34. [34]

    Devon Hjelm and Alex Fedorov and Samuel Lavoie-Marchildon and Karan Grewal and Phil Bachman and Adam Trischler and Yoshua Bengio , title =

    R. Devon Hjelm and Alex Fedorov and Samuel Lavoie-Marchildon and Karan Grewal and Phil Bachman and Adam Trischler and Yoshua Bengio , title =. ICLR , year =

  35. [35]

    ICML , pages =

    Ting Chen and Simon Kornblith and Mohammad Norouzi and Geoffrey Hinton , title =. ICML , pages =

  36. [36]

    Alemi and Ian Fischer and Joshua V

    Alexander A. Alemi and Ian Fischer and Joshua V. Dillon and Kevin Murphy , title =. ICLR , year =

  37. [37]

    UAI , series =

    Sudipto Mukherjee and Himanshu Asnani and Sreeram Kannan , title =. UAI , series =

  38. [38]

    PNAS , volume =

    Fabrizio Gilardi and Meysam Alizadeh and Ma. PNAS , volume =

  39. [39]

    EMNLP , year =

    Yang Liu and Dan Iter and Yichong Xu and others , title =. EMNLP , year =

  40. [40]

    Scientific Data , year =

    Sudeep Bhatia and others , title =. Scientific Data , year =

  41. [41]

    Griffiths and Mark Steyvers and Joshua B

    Thomas L. Griffiths and Mark Steyvers and Joshua B. Tenenbaum , title =. Psychological Review , volume =

  42. [42]

    EMNLP , year =

    Nayeon Lee and Na Min An and James Thorne , title =. EMNLP , year =

  43. [43]

    Findings of ACL , year =

    Xiang Zhou and Yixin Nie and Mohit Bansal , title =. Findings of ACL , year =

  44. [44]

    Findings of EMNLP , year =

    Beiduo Chen and Xinpeng Wang and Siyao Peng and Robert Litschko and Anna Korhonen and Barbara Plank , title =. Findings of EMNLP , year =

  45. [45]

    Estimating Mutual Information , journal =

    Alexander Kraskov and Harald St. Estimating Mutual Information , journal =

  46. [46]

    Neural Computation , volume =

    Liam Paninski , title =. Neural Computation , volume =

  47. [47]

    Pereira and William Bialek , title =

    Naftali Tishby and Fernando C. Pereira and William Bialek , title =. Allerton Conference on Communication, Control, and Computing , pages =

  48. [48]

    2025 , month = may, url =

    Anthropic , title =. 2025 , month = may, url =

  49. [49]

    2025 , month = dec, url =

    OpenAI , title =. 2025 , month = dec, url =

  50. [50]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    Jiuhai Chen and Jonas Mueller , title =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2024 , publisher =

  51. [51]

    , title =

    Francesco Tonolini and Tim Baarslag and Diego Antognini and ... , title =. Findings of the Association for Computational Linguistics: ACL 2024 , year =

  52. [52]

    Advances in Neural Information Processing Systems , volume =

    Sanyam Kapoor and Sander Xie and Simran Jha and Kian Wu and Jacob Hall and Kirill Saveliev and Stefano Ermon and Hritik Bansal , title =. Advances in Neural Information Processing Systems , volume =

  53. [53]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

    Shudong Liu and Haitao Cheng and Shuo Huang and Weiran Yang and Furu Wei , title =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =. 2024 , publisher =

  54. [54]

    and Zhang, Hao and Gonzalez, Joseph E

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , booktitle=. Judging

  55. [55]

    EMNLP , pages=

    Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models , author=. EMNLP , pages=

  56. [56]

    From Generation to Judgment: Opportunities and Challenges of

    Li, Dawei and Jiang, Bohan and Huang, Liangjie and Beigi, Alimohammad and Zhao, Chengshuai and Tan, Zhen and Bhattacharjee, Amrita and Jiang, Yuxuan and Chen, Canyu and Wu, Tianhao and Shu, Kai and Cheng, Lu and Liu, Huan , journal=. From Generation to Judgment: Opportunities and Challenges of

  57. [57]

    2024 , eprint=

    LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods , author=. 2024 , eprint=

  58. [58]

    Neural Computing and Applications , volume=

    A review of feature selection methods based on mutual information , author=. Neural Computing and Applications , volume=. 2014 , publisher=

  59. [59]

    Opening the Black Box of Deep Neural Networks via Information

    Opening the black box of deep neural networks via information , author=. arXiv preprint arXiv:1703.00810 , year=