WebSailor trains open-source web agents to match proprietary performance on complex information-seeking tasks by generating high-uncertainty scenarios and using a new RL method called DUPO.
Large language models must be taught to know what they don’t know
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
The paper introduces an optimization framework for AI agents to strategically seek support, proving a threshold policy on support value and providing an online algorithm to control missed-support error without distributional assumptions.
Supervised fine-tuning degrades the correlation between confidence scores and output quality in language models, driven by factors like training distribution similarity rather than true quality.
TokUR estimates token-level uncertainty via low-rank weight perturbations in LLMs, aggregates signals to correlate with correctness, and uses them to improve reasoning performance on math tasks.
ChatGPT o3-mini achieves 54.5% success on medium Codeforces tasks versus 18.1% for DeepSeek-R1, with both models performing similarly on easy tasks and poorly on hard ones.
citing papers explorer
-
Strategic Decision Support for AI Agents
The paper introduces an optimization framework for AI agents to strategically seek support, proving a threshold policy on support value and providing an online algorithm to control missed-support error without distributional assumptions.
-
Confident in a Confidence Score: Investigating the Sensitivity of Confidence Scores to Supervised Fine-Tuning
Supervised fine-tuning degrades the correlation between confidence scores and output quality in language models, driven by factors like training distribution similarity rather than true quality.
-
TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning
TokUR estimates token-level uncertainty via low-rank weight perturbations in LLMs, aggregates signals to correlate with correctness, and uses them to improve reasoning performance on math tasks.
-
A Showdown of ChatGPT vs DeepSeek in Solving Programming Tasks
ChatGPT o3-mini achieves 54.5% success on medium Codeforces tasks versus 18.1% for DeepSeek-R1, with both models performing similarly on easy tasks and poorly on hard ones.