POISE trains a lightweight probe on the actor's internal states to predict expected rewards for RLVR, matching DAPO performance on math benchmarks with lower compute by avoiding extra rollouts or critic models.
Trace length is a simple un- certainty signal in reasoning models.arXiv preprint arXiv:2510.10409
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
VERDI derives three structural confidence signals from decomposed LLM verification traces and calibrates them with Platt-scaled logistic regression to achieve AUROC 0.72-0.91 on GPT models and 0.56-0.70 on Qwen models where log-probabilities fail.
SELFDOUBT introduces the Hedge-to-Verify Ratio from reasoning traces as a single-pass uncertainty signal, with no-hedge traces correct 96% of the time and outperforming semantic entropy at 10x lower cost.
Mechanistic experiments on Gemma 3 27B, Qwen 2.5 7B and Magistral Small 24B show verbal confidence is cached at post-answer positions from answer tokens and captures richer answer-quality information beyond token log-probabilities.
citing papers explorer
-
Your Language Model is Its Own Critic: Reinforcement Learning with Value Estimation from Actor's Internal States
POISE trains a lightweight probe on the actor's internal states to predict expected rewards for RLVR, matching DAPO performance on math benchmarks with lower compute by avoiding extra rollouts or critic models.
-
VERDI: Single-Call Confidence Estimation for Verification-Based LLM Judges via Decomposed Inference
VERDI derives three structural confidence signals from decomposed LLM verification traces and calibrates them with Platt-scaled logistic regression to achieve AUROC 0.72-0.91 on GPT models and 0.56-0.70 on Qwen models where log-probabilities fail.
-
SELFDOUBT: Uncertainty Quantification for Reasoning LLMs via the Hedge-to-Verify Ratio
SELFDOUBT introduces the Hedge-to-Verify Ratio from reasoning traces as a single-pass uncertainty signal, with no-hedge traces correct 96% of the time and outperforming semantic entropy at 10x lower cost.
-
How do LLMs Compute Verbal Confidence
Mechanistic experiments on Gemma 3 27B, Qwen 2.5 7B and Magistral Small 24B show verbal confidence is cached at post-answer positions from answer tokens and captures richer answer-quality information beyond token log-probabilities.