WebSailor trains open-source web agents to match proprietary performance on complex information-seeking tasks by generating high-uncertainty scenarios and using a new RL method called DUPO.
Large language models must be taught to know what they don’t know,
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Supervised fine-tuning degrades the correlation between confidence scores and output quality in language models, driven by factors like training distribution similarity rather than true quality.
TokUR estimates token-level uncertainty via low-rank weight perturbations in LLMs, aggregates signals to correlate with correctness, and uses them to improve reasoning performance on math tasks.
ChatGPT o3-mini achieves 54.5% success on medium Codeforces tasks versus 18.1% for DeepSeek-R1, with both models performing similarly on easy tasks and poorly on hard ones.
citing papers explorer
-
WebSailor: Navigating Super-human Reasoning for Web Agent
WebSailor trains open-source web agents to match proprietary performance on complex information-seeking tasks by generating high-uncertainty scenarios and using a new RL method called DUPO.
-
Confident in a Confidence Score: Investigating the Sensitivity of Confidence Scores to Supervised Fine-Tuning
Supervised fine-tuning degrades the correlation between confidence scores and output quality in language models, driven by factors like training distribution similarity rather than true quality.
-
TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning
TokUR estimates token-level uncertainty via low-rank weight perturbations in LLMs, aggregates signals to correlate with correctness, and uses them to improve reasoning performance on math tasks.
-
A Showdown of ChatGPT vs DeepSeek in Solving Programming Tasks
ChatGPT o3-mini achieves 54.5% success on medium Codeforces tasks versus 18.1% for DeepSeek-R1, with both models performing similarly on easy tasks and poorly on hard ones.