Large language models must be taught to know what they don’t know,

· 2023 · arXiv 2406.08391

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

WebSailor: Navigating Super-human Reasoning for Web Agent

cs.CL · 2025-07-03 · conditional · novelty 6.0

WebSailor trains open-source web agents to match proprietary performance on complex information-seeking tasks by generating high-uncertainty scenarios and using a new RL method called DUPO.

Confident in a Confidence Score: Investigating the Sensitivity of Confidence Scores to Supervised Fine-Tuning

cs.CL · 2026-04-10 · unverdicted · novelty 5.0

Supervised fine-tuning degrades the correlation between confidence scores and output quality in language models, driven by factors like training distribution similarity rather than true quality.

TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning

cs.LG · 2025-05-16 · unverdicted · novelty 5.0

TokUR estimates token-level uncertainty via low-rank weight perturbations in LLMs, aggregates signals to correlate with correctness, and uses them to improve reasoning performance on math tasks.

A Showdown of ChatGPT vs DeepSeek in Solving Programming Tasks

cs.SE · 2025-03-16 · unverdicted · novelty 3.0

ChatGPT o3-mini achieves 54.5% success on medium Codeforces tasks versus 18.1% for DeepSeek-R1, with both models performing similarly on easy tasks and poorly on hard ones.

citing papers explorer

Showing 4 of 4 citing papers.

WebSailor: Navigating Super-human Reasoning for Web Agent cs.CL · 2025-07-03 · conditional · none · ref 12
WebSailor trains open-source web agents to match proprietary performance on complex information-seeking tasks by generating high-uncertainty scenarios and using a new RL method called DUPO.
Confident in a Confidence Score: Investigating the Sensitivity of Confidence Scores to Supervised Fine-Tuning cs.CL · 2026-04-10 · unverdicted · none · ref 11
Supervised fine-tuning degrades the correlation between confidence scores and output quality in language models, driven by factors like training distribution similarity rather than true quality.
TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning cs.LG · 2025-05-16 · unverdicted · none · ref 20
TokUR estimates token-level uncertainty via low-rank weight perturbations in LLMs, aggregates signals to correlate with correctness, and uses them to improve reasoning performance on math tasks.
A Showdown of ChatGPT vs DeepSeek in Solving Programming Tasks cs.SE · 2025-03-16 · unverdicted · none · ref 8
ChatGPT o3-mini achieves 54.5% success on medium Codeforces tasks versus 18.1% for DeepSeek-R1, with both models performing similarly on easy tasks and poorly on hard ones.

Large language models must be taught to know what they don’t know,

fields

years

verdicts

representative citing papers

citing papers explorer