Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models

Mingda Li; Rundong Lv; Ting Liu; Weinan Zhang; Xinyu Li

arxiv: 2605.04638 · v2 · pith:DN3ZR3HKnew · submitted 2026-05-06 · 💻 cs.CL · cs.AI

Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models

Mingda Li , Rundong Lv , Xinyu Li , Weinan Zhang , Ting Liu This is my paper

Pith reviewed 2026-06-30 23:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords uncertainty quantificationlarge language modelsgradient-based methodssemantic embeddingsfree-form generationsampling-free estimation

0 comments

The pith

Gradients with respect to semantics-preserving embeddings quantify uncertainty in large language models without sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SemGrad, a gradient-based method for uncertainty quantification in free-form LLM generation that requires no sampling. It starts from the observation that confident models keep their output distributions stable when inputs are altered while preserving meaning. The approach finds embeddings that best maintain semantics via a Semantic Preservation Score, then computes gradients of the output distribution with respect to those embeddings. A hybrid variant adds ordinary parameter gradients. Experiments indicate the resulting scores outperform prior methods, particularly when multiple answers are acceptable for one input.

Core claim

SemGrad treats the stability of an LLM's output distribution under semantically equivalent perturbations as gradients taken in semantic space. A Semantic Preservation Score selects the embeddings that best preserve input semantics for this gradient computation, delivering sampling-free uncertainty estimates for free-form generation that exceed the performance of existing sampling-heavy approaches.

What carries the argument

Gradients computed in semantic space with respect to embeddings selected by the Semantic Preservation Score (SPS).

If this is right

The method eliminates the need for multiple forward passes or sampling, lowering both compute cost and variance.
Performance gains are largest precisely in the regime where several distinct answers count as correct.
HybridGrad improves further by adding the information from ordinary parameter-space gradients.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If output stability under semantic change reliably signals uncertainty, the same gradient construction could be tested on non-text generative models.
Production systems might attach this uncertainty signal to every generation at negligible extra cost.
The technique could be applied to measure consistency across paraphrased prompts in other sequence-to-sequence tasks.

Load-bearing premise

A confident LLM maintains stable output distributions when its inputs undergo semantically equivalent perturbations.

What would settle it

A benchmark experiment in which the magnitude of these semantic gradients shows no correlation with the model's actual error rate or with human ratings of answer reliability on prompts with known multiple valid responses.

Figures

Figures reproduced from arXiv: 2605.04638 by Mingda Li, Rundong Lv, Ting Liu, Weinan Zhang, Xinyu Li.

**Figure 1.** Figure 1: Illustration of output distribution shift under small input semantic perturbations and the semantic gradients. x represents the original input, and x + ∆x denotes a perturbed input with a small semantic change on x in the semantic space. y ∗ denotes the response generated from p(y|x). For an input that the model is certain about, a small semantic perturbation should not significantly alter the output distr… view at source ↗

**Figure 2.** Figure 2: Semantic Preservation Score (SPS) of hidden states across different layers and tokens. We experiment on the last 10 input tokens, where “last #t token” denotes the last t-th token from the end of the user query (corresponding token is different for different queries). We observe that the token position carrying the most semantic information is consistent for the same model across different datasets. 3.3. S… view at source ↗

**Figure 3.** Figure 3: Comparison of SemGrad UQ performance (AUROC) and semantic preservation capability (SPS) of different hidden states across layers and tokens. Experiments are conducted on the last 5 input tokens of Llama3.1-Instruct8B and Qwen3-Instruct4B. A strong correlation is observed: hidden states with higher semantic preservation capability yield better SemGrad performance view at source ↗

**Figure 4.** Figure 4: Semantic Preservation Score (SPS) of hidden states across different layers and tokens. We experiment on the last 10 input tokens, where “last #t token” denotes the last t-th token from the end of the user query (corresponding token is different for different queries). We observe that the token position carrying the most semantic information is consistent for the same model across different datasets. M.I. (… view at source ↗

**Figure 5.** Figure 5: Upper: The upper panels show the histogram of the average per-token entropy ω¯ of responses generated by Llama3.1-Instruct8B on TruthfulQA, SciQ, and TriviaQA (left to right). The darker blue histogram corresponds to ω¯ for correct generations, while the lighter blue histogram corresponds to ω¯ for all generations. The two vertical dashed lines indicate the 50th and 75th percentiles of the ω¯ distribution … view at source ↗

read the original abstract

Uncertainty quantification (UQ) is an important technique for ensuring the trustworthiness of LLMs, given their tendency to hallucinate. Existing state-of-the-art UQ approaches for free-form generation rely heavily on sampling, which incurs high computational cost and variance. In this work, we propose the first gradient-based UQ method for free-form generation, SemGrad, which is sampling-free and computationally efficient. Unlike prior gradient-based methods developed for classification tasks that operates in parameter space, we propose to consider gradients in semantic space. Our method builds on the key intuition that a confident LLM should maintain stable output distributions under semantically equivalent input perturbations. We interpret the stability as the gradients in semantic space and introduce a Semantic Preservation Score (SPS) to identify embeddings that best capture semantics, with respect to which gradients are computed. We further propose HybridGrad, which combines the strengths of SemGrad and parameter gradients. Experiments demonstrate that both of our methods provide efficient and effective uncertainty estimates, achieving superior performance than state-of-the-art methods, particularly in settings with multiple valid responses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SemGrad proposes a sampling-free UQ method for free-form LLM generation by taking gradients in semantic space via a Semantic Preservation Score, but the abstract leaves the validation of its stability intuition and experimental claims thin.

read the letter

The main new piece is moving gradient-based uncertainty from parameter space (standard in classification) to semantic space for open-ended generation. SemGrad picks embeddings that preserve semantics with the SPS, treats output stability under those perturbations as the uncertainty signal, and adds HybridGrad to blend it with ordinary parameter gradients. This targets the cost and variance of sampling methods directly.

The framing is useful. It explicitly calls out the multiple-valid-responses case as a strength, which is a realistic setting for generation. The core intuition—that confident models produce stable distributions under semantic equivalents—is stated plainly and gives a coherent reason for working in embedding space rather than tokens or parameters.

The soft spots sit in the missing checks. The description does not show that SPS embeddings actually isolate semantic equivalence better than simpler perturbations, nor that low gradient values track ground-truth uncertainty instead of embedding-space artifacts. The superiority claim over sampling baselines is asserted but without visible setup, ablations, or numbers, so it is impossible to tell how much the semantic move drives any gains.

This is for people building practical UQ tools for LLMs who already know the sampling trade-offs. A reader who wants to try gradient signals in generation would get a clear starting point even if they need to fill in the validation. The idea is distinct enough from prior classification work and the problem is real enough that it deserves a serious referee to examine the implementation and results.

Recommendation: send it to review.

Referee Report

2 major / 0 minor

Summary. The paper proposes SemGrad, the first gradient-based uncertainty quantification method for free-form LLM generation. It computes gradients in semantic space with respect to embeddings selected via a Semantic Preservation Score (SPS) that identify semantics-preserving perturbations, based on the intuition that a confident LLM maintains stable output distributions under semantically equivalent inputs. It also introduces HybridGrad, which combines SemGrad with parameter-space gradients, and claims through experiments that both methods are sampling-free, computationally efficient, and outperform state-of-the-art sampling-based UQ methods, especially in settings with multiple valid responses.

Significance. If the results hold, the work would be significant for providing an efficient, sampling-free alternative to existing UQ methods for open-ended generation, where sampling incurs high cost and variance. Operating in semantic space rather than parameter space represents a novel direction that could improve trustworthiness assessments for LLMs prone to hallucination.

major comments (2)

[Abstract] Abstract: The foundational intuition that 'a confident LLM should maintain stable output distributions under semantically equivalent input perturbations' is used to justify interpreting gradient magnitude (w.r.t. SPS embeddings) as a measure of uncertainty, but the manuscript provides no anchoring validation such as ablation studies, human evaluation of semantic equivalence, or correlation analysis showing that low gradient values align with ground-truth uncertainty rather than embedding-space artifacts.
[Abstract] Abstract: The claim that experiments demonstrate 'superior performance than state-of-the-art methods, particularly in settings with multiple valid responses' is presented without any reported quantitative results, baselines, metrics, statistical tests, or experimental setup details, preventing assessment of whether the superiority holds or whether SPS embeddings outperform simpler perturbation strategies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on our submission. We address the two major comments on the abstract below. We will revise the abstract to better summarize the empirical validations and quantitative results from the full manuscript while preserving its conciseness.

read point-by-point responses

Referee: [Abstract] Abstract: The foundational intuition that 'a confident LLM should maintain stable output distributions under semantically equivalent input perturbations' is used to justify interpreting gradient magnitude (w.r.t. SPS embeddings) as a measure of uncertainty, but the manuscript provides no anchoring validation such as ablation studies, human evaluation of semantic equivalence, or correlation analysis showing that low gradient values align with ground-truth uncertainty rather than embedding-space artifacts.

Authors: The full manuscript validates the core intuition through systematic experiments comparing SemGrad and HybridGrad against sampling-based baselines on uncertainty quantification tasks, including settings with multiple valid responses where the methods show clear advantages. These results demonstrate that gradient magnitudes correlate with actual model uncertainty rather than artifacts, as evidenced by improved performance when combined in HybridGrad. We did not conduct human evaluations of semantic equivalence, as SPS relies on established embedding similarity metrics from prior literature. We will revise the abstract to explicitly reference the empirical validation of the intuition via these performance correlations and add a brief discussion of SPS design choices in the main text. revision: partial
Referee: [Abstract] Abstract: The claim that experiments demonstrate 'superior performance than state-of-the-art methods, particularly in settings with multiple valid responses' is presented without any reported quantitative results, baselines, metrics, statistical tests, or experimental setup details, preventing assessment of whether the superiority holds or whether SPS embeddings outperform simpler perturbation strategies.

Authors: We agree the abstract is high-level and omits specifics. The full paper reports results using standard UQ metrics (e.g., AUROC, AUPRC) against sampling-based baselines like temperature sampling and ensemble methods, with statistical significance tests, on benchmarks including those with multiple valid answers. We will revise the abstract to include key quantitative highlights (e.g., relative improvements) and mention the experimental setup and metrics to substantiate the superiority claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on explicit assumption without self-referential reduction

full rationale

The paper proposes SemGrad by directly stating its foundational intuition ('a confident LLM should maintain stable output distributions under semantically equivalent input perturbations') and then defining SPS and gradient computation in semantic space as the operationalization of that intuition. No equations, fitted parameters, or self-citations are shown that reduce the uncertainty score to the inputs by construction, nor is any 'prediction' statistically forced from a subset fit. The method introduces new components (SPS embeddings, HybridGrad) whose performance is evaluated externally against baselines, keeping the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields minimal visibility into parameters or entities; the central premise is treated as a domain assumption.

axioms (1)

domain assumption A confident LLM maintains stable output distributions under semantically equivalent input perturbations
This intuition is stated in the abstract as the foundation for interpreting gradients in semantic space as uncertainty.

pith-pipeline@v0.9.1-grok · 5725 in / 1271 out tokens · 29674 ms · 2026-06-30T23:52:26.568336+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 8 canonical work pages · 2 internal anchors

[1]

Chiarello, F., Giordano, V ., Spada, I., Barandoni, S., and Fantoni, G

URL https://openreview.net/forum? id=Zj12nzlQbz. Chiarello, F., Giordano, V ., Spada, I., Barandoni, S., and Fantoni, G. Future applications of generative large language models: A data-driven case study on chatgpt. Technovation, 133:103002, 2024. ISSN 0166-4972. doi: https://doi.org/10.1016/j.technovation.2024.103002. URL https://www.sciencedirect.com/ sc...

work page doi:10.1016/j.technovation.2024.103002 2024
[2]

ACM Transactions on Information Systems 43, 1–55

ISSN 1558-2868. doi: 10.1145/3703155. URL http://dx.doi.org/10.1145/3703155. H¨ullermeier, E. and Waegeman, W. Aleatoric and epis- temic uncertainty in machine learning: an introduction to concepts and methods.Mach. Learn., 110(3):457–506,

work page doi:10.1145/3703155
[3]

(1983).The managed heart: Commercialization of human feeling

doi: 10.1007/S10994-021-05946-3. URL https: //doi.org/10.1007/s10994-021-05946-3. Igoe, C., Chung, Y ., Char, I., and Schneider, J. How useful are gradients for OOD detection really? CoRR, abs/2205.10439, 2022. doi: 10.48550/ARXIV . 2205.10439. URLhttps://doi.org/10.48550/ arXiv.2205.10439. Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Trivi- aqa:...

work page doi:10.1007/s10994-021-05946-3 2022
[4]

URL https://openreview.net/forum? id=VD-AYtP0dve. Lee, J. and AlRegib, G. Gradients as a measure of uncer- tainty in neural networks. InIEEE International Confer- ence on Image Processing, ICIP 2020, Abu Dhabi, United Arab Emirates, October 25-28, 2020, pp. 2416–2420. IEEE, 2020. doi: 10.1109/ICIP40778.2020.9190679. URL https://doi.org/10.1109/ICIP40778. ...

work page doi:10.1109/icip40778.2020.9190679 2020
[6]

Malinin, A

URL https://openreview.net/forum? id=DWkJCSxKU5. Malinin, A. and Gales, M. J. F. Uncertainty estimation in autoregressive structured prediction. In9th International 10 Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2...

2021
[7]

Large Language Models: The Need for Nuance in Current Debates and a Pragmatic Perspective on Understanding

URL https://openreview.net/forum? id=jN5y-zb5Q7m. Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W., Koh, P. W., Iyyer, M., Zettlemoyer, L., and Hajishirzi, H. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. In Bouamor, H., Pino, J., and Bali, K. (eds.),Proceedings of the 2023 Conference on Empirical Methods in...

work page doi:10.18653/v1/2023 2023
[9]

Metamorphictestingoflarge languagemodelsfornaturallanguageprocessing.doi:10.48550/arXiv

doi: 10.48550/ARXIV .2412.05563. URL https: //doi.org/10.48550/arXiv.2412.05563. Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., and Manning, C. D. Just ask for calibra- tion: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Bouamor, H., Pino, J., and Bali, K. (eds...

work page internal anchor Pith review doi:10.48550/arxiv 2023
[10]

A survey on large language model based autonomous agents.Frontiers Comput

doi: 10.1007/S11704-024-40231-1. URL https: //doi.org/10.1007/s11704-024-40231-1. Wang, X., Wei, J., Schuurmans, D., Le, Q. V ., Chi, E. H., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency improves chain of thought reason- ing in language models. InThe Eleventh Interna- tional Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May...

work page doi:10.1007/s11704-024-40231-1 2023
[11]

Qwen3 Technical Report

URL https://openreview.net/forum? id=1PL1NIMMrw. Welbl, J., Liu, N. F., and Gardner, M. Crowdsourcing multiple choice science questions. In Derczynski, L., Xu, W., Ritter, A., and Baldwin, T. (eds.),Proceed- ings of the 3rd Workshop on Noisy User-generated Text, NUT@EMNLP 2017, Copenhagen, Denmark, Septem- ber 7, 2017, pp. 94–106. Association for Computat...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/w17-4413 2017

[1] [1]

Chiarello, F., Giordano, V ., Spada, I., Barandoni, S., and Fantoni, G

URL https://openreview.net/forum? id=Zj12nzlQbz. Chiarello, F., Giordano, V ., Spada, I., Barandoni, S., and Fantoni, G. Future applications of generative large language models: A data-driven case study on chatgpt. Technovation, 133:103002, 2024. ISSN 0166-4972. doi: https://doi.org/10.1016/j.technovation.2024.103002. URL https://www.sciencedirect.com/ sc...

work page doi:10.1016/j.technovation.2024.103002 2024

[2] [2]

ACM Transactions on Information Systems 43, 1–55

ISSN 1558-2868. doi: 10.1145/3703155. URL http://dx.doi.org/10.1145/3703155. H¨ullermeier, E. and Waegeman, W. Aleatoric and epis- temic uncertainty in machine learning: an introduction to concepts and methods.Mach. Learn., 110(3):457–506,

work page doi:10.1145/3703155

[3] [3]

(1983).The managed heart: Commercialization of human feeling

doi: 10.1007/S10994-021-05946-3. URL https: //doi.org/10.1007/s10994-021-05946-3. Igoe, C., Chung, Y ., Char, I., and Schneider, J. How useful are gradients for OOD detection really? CoRR, abs/2205.10439, 2022. doi: 10.48550/ARXIV . 2205.10439. URLhttps://doi.org/10.48550/ arXiv.2205.10439. Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Trivi- aqa:...

work page doi:10.1007/s10994-021-05946-3 2022

[4] [4]

URL https://openreview.net/forum? id=VD-AYtP0dve. Lee, J. and AlRegib, G. Gradients as a measure of uncer- tainty in neural networks. InIEEE International Confer- ence on Image Processing, ICIP 2020, Abu Dhabi, United Arab Emirates, October 25-28, 2020, pp. 2416–2420. IEEE, 2020. doi: 10.1109/ICIP40778.2020.9190679. URL https://doi.org/10.1109/ICIP40778. ...

work page doi:10.1109/icip40778.2020.9190679 2020

[5] [6]

Malinin, A

URL https://openreview.net/forum? id=DWkJCSxKU5. Malinin, A. and Gales, M. J. F. Uncertainty estimation in autoregressive structured prediction. In9th International 10 Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2...

2021

[6] [7]

Large Language Models: The Need for Nuance in Current Debates and a Pragmatic Perspective on Understanding

URL https://openreview.net/forum? id=jN5y-zb5Q7m. Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W., Koh, P. W., Iyyer, M., Zettlemoyer, L., and Hajishirzi, H. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. In Bouamor, H., Pino, J., and Bali, K. (eds.),Proceedings of the 2023 Conference on Empirical Methods in...

work page doi:10.18653/v1/2023 2023

[7] [9]

Metamorphictestingoflarge languagemodelsfornaturallanguageprocessing.doi:10.48550/arXiv

doi: 10.48550/ARXIV .2412.05563. URL https: //doi.org/10.48550/arXiv.2412.05563. Tian, K., Mitchell, E., Zhou, A., Sharma, A., Rafailov, R., Yao, H., Finn, C., and Manning, C. D. Just ask for calibra- tion: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. In Bouamor, H., Pino, J., and Bali, K. (eds...

work page internal anchor Pith review doi:10.48550/arxiv 2023

[8] [10]

A survey on large language model based autonomous agents.Frontiers Comput

doi: 10.1007/S11704-024-40231-1. URL https: //doi.org/10.1007/s11704-024-40231-1. Wang, X., Wei, J., Schuurmans, D., Le, Q. V ., Chi, E. H., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency improves chain of thought reason- ing in language models. InThe Eleventh Interna- tional Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May...

work page doi:10.1007/s11704-024-40231-1 2023

[9] [11]

Qwen3 Technical Report

URL https://openreview.net/forum? id=1PL1NIMMrw. Welbl, J., Liu, N. F., and Gardner, M. Crowdsourcing multiple choice science questions. In Derczynski, L., Xu, W., Ritter, A., and Baldwin, T. (eds.),Proceed- ings of the 3rd Workshop on Noisy User-generated Text, NUT@EMNLP 2017, Copenhagen, Denmark, Septem- ber 7, 2017, pp. 94–106. Association for Computat...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/w17-4413 2017