TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning

· 2025 · cs.LG · arXiv 2505.11737

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

While Large Language Models (LLMs) have demonstrated impressive capabilities, their output quality remains inconsistent across various application scenarios, making it difficult to identify trustworthy responses, especially in complex tasks requiring multi-step reasoning. In this paper, we propose a Token-level Uncertainty estimation framework for Reasoning (TokUR) that enables LLMs to self-assess and self-improve their responses in mathematical reasoning. Specifically, we introduce low-rank random weight perturbation during LLM decoding to generate predictive distributions for token-level uncertainty estimation, and we aggregate these uncertainty quantities to capture the semantic uncertainty of generated responses. Experiments on mathematical reasoning datasets of varying difficulty demonstrate that TokUR exhibits a strong correlation with answer correctness and model robustness, and the uncertainty signals produced by TokUR can be leveraged to enhance the model's reasoning performance at test time. These results highlight the effectiveness of TokUR as a principled and scalable approach for improving the reliability and interpretability of LLMs in challenging reasoning tasks.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.

Aligning LLM Uncertainty with Human Disagreement in Subjectivity Analysis

cs.CL · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

DPUA is a two-phase framework that aligns LLM uncertainty expressions with human disagreement distributions in subjectivity analysis while preserving task performance.

Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models

cs.CL · 2026-01-29 · unverdicted · novelty 6.0

Erroneous processing heads in attention layers cause hop-generalization failures in LLMs; dynamically deactivating them at test time improves multi-step reasoning.

citing papers explorer

Showing 3 of 3 citing papers.

BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence cs.CL · 2026-04-03 · unverdicted · none · ref 58 · internal anchor
BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
Aligning LLM Uncertainty with Human Disagreement in Subjectivity Analysis cs.CL · 2026-05-11 · unverdicted · none · ref 32 · 2 links · internal anchor
DPUA is a two-phase framework that aligns LLM uncertainty expressions with human disagreement distributions in subjectivity analysis while preserving task performance.
Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models cs.CL · 2026-01-29 · unverdicted · none · ref 6 · internal anchor
Erroneous processing heads in attention layers cause hop-generalization failures in LLMs; dynamically deactivating them at test time improves multi-step reasoning.

TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer