Conjuring Semantic Similarity
Pith reviewed 2026-05-23 18:20 UTC · model grok-4.3
The pith
Semantic similarity between texts equals the Jeffreys divergence between the image distributions each evokes from a diffusion model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Semantic similarity between textual expressions is characterized as the Jeffreys divergence between the reverse-time diffusion SDEs induced by each expression, which can be computed via Monte-Carlo sampling of the conditioned generative process and aligns with human-annotated similarity scores.
What carries the argument
The Jeffreys divergence between reverse-time diffusion stochastic differential equations (SDEs) induced by conditioning a diffusion model on each textual prompt.
If this is right
- This measure can be used to evaluate the quality of text-to-image models by how well their distributions capture semantic relations.
- It offers better interpretability of learnt representations in generative models.
- It opens new avenues for the evaluation of text-conditioned generative models.
Where Pith is reading between the lines
- This definition implies that meaning can be probed through generative processes rather than static embeddings.
- The method could extend to measuring similarity in other modalities if similar generative models exist.
- Discrepancies between this measure and human scores might reveal biases in current diffusion models' understanding of language.
Load-bearing premise
The image distributions generated by a diffusion model conditioned on a text prompt faithfully represent the meaning of that text, so that divergence between distributions measures semantic similarity.
What would settle it
Compute the Jeffreys divergence scores for a dataset of text pairs with known human similarity annotations using a standard diffusion model and check if they correlate as claimed; a significant mismatch would falsify the alignment.
Figures
read the original abstract
The semantic similarity between sample expressions measures the distance between their latent 'meaning'. These meanings are themselves typically represented by textual expressions. We propose a novel approach whereby the semantic similarity among textual expressions is based not on other expressions they can be rephrased as, but rather based on the imagery they evoke. While this is not possible with humans, generative models allow us to easily visualize and compare generated images, or their distribution, evoked by a textual prompt. Therefore, we characterize the semantic similarity between two textual expressions simply as the distance between image distributions they induce, or 'conjure.' We show that by choosing the Jeffreys divergence between the reverse-time diffusion stochastic differential equations (SDEs) induced by each textual expression, this can be directly computed via Monte-Carlo sampling. Our method contributes a novel perspective on semantic similarity that not only aligns with human-annotated scores, but also opens up new avenues for the evaluation of text-conditioned generative models while offering better interpretability of their learnt representations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that semantic similarity between textual expressions is given by the Jeffreys divergence between the reverse-time diffusion SDEs induced by conditioning a generative diffusion model on each prompt; this quantity is computable by Monte-Carlo sampling of the SDEs and is asserted to align with human-annotated similarity scores, thereby providing a new imagery-based measure of meaning and an evaluation tool for text-conditioned models.
Significance. If the central claim holds, the work supplies a parameter-free, sampling-based similarity measure grounded in the path measures of a diffusion process rather than in textual embeddings or rephrasings. The explicit reduction to Monte-Carlo estimation of Jeffreys divergence on reverse SDEs is a concrete technical contribution that could be reproduced and extended. The approach also suggests a route to auditing what text-to-image models have internalized, which would be valuable if the divergence can be shown to track semantics rather than model-specific artifacts.
major comments (2)
- [Abstract / experimental validation] Abstract and experimental validation: the claim that the method 'aligns with human-annotated scores' is presented without any reported quantitative metrics (e.g., correlation coefficients, sample sizes, or statistical tests), dataset identifiers, or controls for prompt length, model choice, or conditioning strength. Because this alignment is the sole empirical support for the semantic interpretation, the absence of these details is load-bearing for the central claim.
- [Method definition] The reduction of semantic similarity to Jeffreys divergence on the induced reverse-time SDEs presupposes that the conditional law p(x|text) faithfully encodes latent meaning rather than training-data biases or mode-coverage gaps. No ablation (different backbone models, varying classifier-free guidance scales, or comparison against a non-diffusion generative model) is described that would test whether the divergence remains stable under changes that preserve semantics but alter the generative distribution. A concrete test would be to recompute the divergence on the same prompt pair using two independently trained diffusion models and report the rank correlation of the resulting scores.
minor comments (1)
- [Method] Notation for the reverse-time SDE and the precise form of the Jeffreys divergence (e.g., whether it is evaluated on the full path measure or only on the terminal marginal) should be stated explicitly with equation numbers in the main text.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below and indicate the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract / experimental validation] Abstract and experimental validation: the claim that the method 'aligns with human-annotated scores' is presented without any reported quantitative metrics (e.g., correlation coefficients, sample sizes, or statistical tests), dataset identifiers, or controls for prompt length, model choice, or conditioning strength. Because this alignment is the sole empirical support for the semantic interpretation, the absence of these details is load-bearing for the central claim.
Authors: We agree that the abstract and experimental sections require explicit quantitative support. The manuscript contains experiments comparing the proposed divergence to human similarity annotations, but these lack the requested metrics and controls. In the revision we will report Pearson and Spearman correlations, sample sizes, p-values, the specific datasets used, and ablations on prompt length and classifier-free guidance scale. revision: yes
-
Referee: [Method definition] The reduction of semantic similarity to Jeffreys divergence on the induced reverse-time SDEs presupposes that the conditional law p(x|text) faithfully encodes latent meaning rather than training-data biases or mode-coverage gaps. No ablation (different backbone models, varying classifier-free guidance scales, or comparison against a non-diffusion generative model) is described that would test whether the divergence remains stable under changes that preserve semantics but alter the generative distribution. A concrete test would be to recompute the divergence on the same prompt pair using two independently trained diffusion models and report the rank correlation of the resulting scores.
Authors: The current work demonstrates the measure on a single, publicly available diffusion backbone and relies on the observed alignment with human scores as initial evidence. We will add an explicit discussion of the modeling assumption and include new experiments that vary the guidance scale and compare two different publicly released diffusion checkpoints on the same prompt pairs, reporting rank correlation of the resulting scores. A broader comparison against non-diffusion generators lies outside the scope of the present study and will be noted as future work. revision: partial
Circularity Check
No significant circularity; explicit definition with external validation
full rationale
The paper explicitly defines semantic similarity as the Jeffreys divergence between reverse-time diffusion SDEs induced by each prompt (abstract: 'we characterize the semantic similarity between two textual expressions simply as the distance between image distributions they induce'). This is a first-principles proposal, not a derivation claiming to recover or predict human scores from other quantities. Alignment with annotations is presented only as empirical validation, not as part of the definitional chain. No equations, self-citations, fitted parameters, or ansatzes appear in the provided text that would reduce the central claim to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Image distributions produced by text-conditioned diffusion models capture the latent meaning of the text.
Reference graph
Works this paper leans on
-
[1]
Semeval-2012 task 6: A pilot on semantic textual similarity
Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre. Semeval-2012 task 6: A pilot on semantic textual similarity. In * SEM 2012: The First Joint Conference on Lexical and Computational Semantics–Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation ...
work page 2012
-
[2]
* sem 2013 shared task: Semantic textual similarity
Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. * sem 2013 shared task: Semantic textual similarity. In Second joint conference on lexical and computa- tional semantics (* SEM), volume 1: proceedings of the Main conference and the shared task: semantic textual similarity , pp. 32–43,
work page 2013
-
[3]
Semeval-2014 task 10: Multilingual semantic textual similarity
Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Rada Mihalcea, German Rigau, and Janyce Wiebe. Semeval-2014 task 10: Multilingual semantic textual similarity. In Proceedings of the 8th international workshop on semantic evaluation (SemEval
work page 2014
-
[4]
Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability
Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Inigo Lopez-Gazpio, Montse Maritxalar, Rada Mihalcea, et al. Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In Proceedings of the 9th international workshop on semantic evaluation (SemEval
work page 2015
-
[5]
Semeval-2016 task 1: Semantic textual similar- ity, monolingual and cross-lingual evaluation
Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, Aitor Gonzalez Agirre, Rada Mihalcea, German Rigau Claramunt, and Janyce Wiebe. Semeval-2016 task 1: Semantic textual similar- ity, monolingual and cross-lingual evaluation. In SemEval-2016. 10th International Workshop on Semantic Evaluation; 2016 Jun 16-17; San Diego, CA. Stroudsburg (PA): ACL
work page 2016
-
[6]
LLM2Vec: Large language models are secretly powerful text encoders
Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Cha- pados, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders. arXiv preprint arXiv:2404.05961 ,
-
[7]
Miko laj Bi´ nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. arXiv preprint arXiv:1801.01401 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation
Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055,
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[9]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Concept sliders: Lora adaptors for precise control in diffusion models
Rohit Gandikota, Joanna Materzynska, Tingrui Zhou, Antonio Torralba, and David Bau. Concept sliders: Lora adaptors for precise control in diffusion models. arXiv preprint arXiv:2311.12092,
-
[11]
SimCSE: Simple Contrastive Learning of Sentence Embeddings
Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Auto-Encoding Variational Bayes
Diederik P Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Information-theoretic diffusion
Xianghao Kong, Rob Brekelmans, and Greg Ver Steeg. Information-theoretic diffusion. arXiv preprint arXiv:2302.03792, 2023a. Xianghao Kong, Ollie Liu, Han Li, Dani Yogatama, and Greg Ver Steeg. Interpretable diffusion via information decomposition. arXiv preprint arXiv:2310.07972 , 2023b. 11 Mingi Kwon, Jaeseok Jeong, and Youngjung Uh. Diffusion models alr...
-
[16]
Meaning representations from trajectories in autoregressive models
Tian Yu Liu, Matthew Trager, Alessandro Achille, Pramuditha Perera, Luca Zancato, and Stefano Soatto. Meaning representations from trajectories in autoregressive models. arXiv preprint arXiv:2310.18348,
-
[17]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 ,
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[18]
Marco Marelli, Luisa Bentivogli, Marco Baroni, Raffaella Bernardi, Stefano Menini, and Roberto Zamparelli. Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In Proceedings of the 8th international workshop on semantic evaluation (SemEval
work page 2014
-
[19]
Conditional Generative Adversarial Nets
Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
H.; Constant, N.; Ma, J.; Hall, K
Jianmo Ni, Gustavo Hern´ andez´Abrego, Noah Constant, Ji Ma, Keith B Hall, Daniel Cer, and Yinfei Yang. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877 ,
-
[21]
Unsupervised discovery of semantic latent directions in diffusion models
Yong-Hyun Park, Mingi Kwon, Junghyo Jo, and Youngjung Uh. Unsupervised discovery of semantic latent directions in diffusion models. arXiv preprint arXiv:2302.12469 ,
-
[22]
URL http://arxiv.org/ abs/2305.18449. arXiv:2305.18449 [cs, eess]. 12 Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256–2265. PMLR,
-
[23]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a. Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b. Yang Song, Conor Du...
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[24]
An unsupervised sentence embedding method by mutual information maximization
Yan Zhang, Ruidan He, Zuozhu Liu, Kwan Hui Lim, and Lidong Bing. An unsupervised sentence embedding method by mutual information maximization. arXiv preprint arXiv:2009.12061 ,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.