BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
hub
Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neu- big, and Xiang Yue
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
Reasoning models judge better than non-reasoning LLMs yet retain biases; generating an evaluation plan first mitigates bias without losing accuracy.
LLMs need metacognition to align expressed uncertainty with their actual knowledge boundaries, moving beyond knowledge expansion to reduce confident errors.
NeuReasoner detects neuron fluctuation patterns linked to reasoning failures and inserts special tokens to enable controllable self-correction, delivering up to 27% performance gains and 19-63% lower token use across multiple benchmarks and model sizes.
ARS shapes reasoning trace representations by clustering states that produce consistent answers and separating those that produce inconsistent ones via latent perturbations, improving plug-and-play hallucination detection without human annotations.
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
FAITH improves LLM factual accuracy by mapping confidence and semantic entropy into natural-language knowledge-state quadrants for trustworthiness and honestness, then applying PPO with a combined reward and retrieval augmentation.
HiExp extracts hierarchical experience knowledge from reasoning trajectories via contrastive analysis and clustering to regularize RL training, turning stochastic exploration into strategic search with reported gains in performance and generalization.
Vision SR1 decomposes VLM reasoning into visual and language components and uses internal self-rewards to improve visual reasoning and reduce hallucinations more efficiently than external-supervision methods.
KnowRL integrates a knowledge-verification factuality reward into RL training to enforce fact-based reasoning steps and lower hallucination rates in LLMs.
Reasoning models achieve only 2-11% higher accuracy than non-reasoning models when handling queries with false presuppositions, failing to challenge 26-42% of them and remaining sensitive to presupposition strength.
The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.
citing papers explorer
-
BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence
BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
-
Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases
Reasoning models judge better than non-reasoning LLMs yet retain biases; generating an evaluation plan first mitigates bias without losing accuracy.
-
Hallucinations Undermine Trust; Metacognition is a Way Forward
LLMs need metacognition to align expressed uncertainty with their actual knowledge boundaries, moving beyond knowledge expansion to reduce confident errors.
-
NeuReasoner: Towards Explainable, Controllable, and Unified Reasoning via Mixture-of-Neurons
NeuReasoner detects neuron fluctuation patterns linked to reasoning failures and inserts special tokens to enable controllable self-correction, delivering up to 27% performance gains and 19-63% lower token use across multiple benchmarks and model sizes.
-
Harnessing Reasoning Trajectories for Hallucination Detection via Answer-agreement Representation Shaping
ARS shapes reasoning trace representations by clustering states that produce consistent answers and separating those that produce inconsistent ones via latent perturbations, improving plug-and-play hallucination detection without human annotations.
-
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.
-
FAITH: Factuality Alignment through Integrating Trustworthiness and Honestness
FAITH improves LLM factual accuracy by mapping confidence and semantic entropy into natural-language knowledge-state quadrants for trustworthiness and honestness, then applying PPO with a combined reward and retrieval augmentation.
-
Beyond Stochastic Exploration: What Makes Training Data Valuable for Agentic Search
HiExp extracts hierarchical experience knowledge from reasoning trajectories via contrastive analysis and clustering to regularize RL training, turning stochastic exploration into strategic search with reported gains in performance and generalization.
-
Self-Rewarding Vision-Language Model via Reasoning Decomposition
Vision SR1 decomposes VLM reasoning into visual and language components and uses internal self-rewards to improve visual reasoning and reduce hallucinations more efficiently than external-supervision methods.
-
KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality
KnowRL integrates a knowledge-verification factuality reward into RL training to enforce fact-based reasoning steps and lower hallucination rates in LLMs.
-
Evaluating Reasoning Models for Queries with Presuppositions
Reasoning models achieve only 2-11% higher accuracy than non-reasoning models when handling queries with false presuppositions, failing to challenge 26-42% of them and remaining sensitive to presupposition strength.
-
Rethinking Agentic Reinforcement Learning In Large Language Models
The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.
- Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards