Hybrid Bayesian-graph LLM agent reaches competitive performance against large models and achieves 67% win rate against humans in controlled Avalon play, outperforming baselines and human teammates.
hub
On Calibration of Modern Neural Networks
24 Pith papers cite this work. Polarity classification is still indexing.
abstract
Confidence calibration -- the problem of predicting probability estimates representative of the true correctness likelihood -- is important for classification models in many applications. We discover that modern neural networks, unlike those from a decade ago, are poorly calibrated. Through extensive experiments, we observe that depth, width, weight decay, and Batch Normalization are important factors influencing calibration. We evaluate the performance of various post-processing calibration methods on state-of-the-art architectures with image and document classification datasets. Our analysis and experiments not only offer insights into neural network learning, but also provide a simple and straightforward recipe for practical settings: on most datasets, temperature scaling -- a single-parameter variant of Platt Scaling -- is surprisingly effective at calibrating predictions.
hub tools
citation-role summary
citation-polarity summary
fields
cs.LG 9 cs.AI 4 cs.CL 3 eess.IV 2 stat.ML 2 astro-ph.GA 1 astro-ph.IM 1 cond-mat.mtrl-sci 1 cs.SE 1roles
background 3polarities
background 3representative citing papers
Pointwise metrics compress marginal spectra in multimodal inverse problems, and a three-part protocol using CRPS, spectrum fidelity, and calibration reverses model rankings on synthetic and particle-physics benchmarks.
DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.
Neural network classification with CRPS optimization produces calibrated photometric redshift PDFs for DESI Legacy and Pan-STARRS data, achieving σ_NMAD of 0.0153 on LSDR10 and outperforming regression methods.
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on benchmarks and large lifts in production-style tasks.
Ensemble-based method of moments on softmax outputs produces stable Dirichlet predictive distributions that improve uncertainty-guided tasks like selective classification over evidential deep learning.
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Introduces a unified framework integrating uncertainty estimation, calibration, and tool-based abstention for reliable code predictions in language models.
R2V-Agent combines an SLM policy trained via BC and DPO with a step-level risk-calibrated router using Brier scores and CVaR to escalate to LLM only on high residual failure risk, improving success-cost tradeoffs on HumanEval+, TextWorld, and TerminalBench.
LiLAW learns to weight samples as easy, moderate or hard using three global scalars updated by one gradient step on a validation batch to improve noisy training performance.
Augmenting model-based RL agents with calibrated predictive uncertainties improves planning, sample efficiency, and exploration on continuous control tasks.
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
Larger LLMs handle detailed crystal descriptions better than small ones, and mean negative log-likelihood of predicted numbers tracks prediction error after fine-tuning.
Sycophantic GRPO fine-tuning degrades LLM calibration, raising ECE by 0.006 and MCE by 0.010, with a persistent residual after post-hoc scaling.
MedFormer-UR integrates evidential uncertainty from Dirichlet distributions and class-specific prototypes into a transformer to improve calibration and selective prediction on medical images across four modalities.
The P3 selector achieves 0.9809 purity and 0.8869 completeness for QSO candidates in selected fields, outperforming Gaia's official probabilities.
Experiments show that shifted-ReLU layers can replace batch-normalization in single-bit-weight wide residual networks on CIFAR-10/100 and ImageNet without consistent accuracy penalty.
A unified taxonomy of uncertainty in ML for physics is introduced together with validation tools such as coverage, calibration, and proper scoring rules, illustrated on regression and classification tasks.
TRACE is a metrologically-grounded four-layer engineering framework for trustworthy agentic AI that enforces an ML-LLM split, stateful policies, human supervision, and a parsimony metric across critical domains.
DAVinCI combines claim attribution to model internals and external sources with entailment-based verification to improve LLM factual reliability by 5-20% on fact-checking datasets.
CKD risk prediction models achieve AUROC 1.00 internally but drop to 0.48-0.58 externally with high calibration error and low deployment scores, indicating need for external validation.
citing papers explorer
-
Bayesian Social Deduction with Graph-Informed Language Models
Hybrid Bayesian-graph LLM agent reaches competitive performance against large models and achieves 67% win rate against humans in controlled Avalon play, outperforming baselines and human teammates.
-
Pointwise Metrics Mislead: An Evaluation Protocol for Multimodal Inverse Problems
Pointwise metrics compress marginal spectra in multimodal inverse problems, and a three-part protocol using CRPS, spectrum fidelity, and calibration reverses model rankings on synthetic and particle-physics benchmarks.
-
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.
-
Photometric Redshift PDFs via Neural Network Classification for DESI Legacy Imaging Surveys and Pan-STARRS
Neural network classification with CRPS optimization produces calibrated photometric redshift PDFs for DESI Legacy and Pan-STARRS data, achieving σ_NMAD of 0.0153 on LSDR10 and outperforming regression methods.
-
Training Language Models to Self-Correct via Reinforcement Learning
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
-
Diversity in Large Language Models under Supervised Fine-Tuning
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
-
Pioneer Agent: Continual Improvement of Small Language Models in Production
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on benchmarks and large lifts in production-style tasks.
-
Ensemble-Based Dirichlet Modeling for Predictive Uncertainty and Selective Classification
Ensemble-based method of moments on softmax outputs produces stable Dirichlet predictive distributions that improve uncertainty-guided tasks like selective classification over evidential deep learning.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
When to Answer and When to Defer: A Decision Framework for Reliable Code Predictions
Introduces a unified framework integrating uncertainty estimation, calibration, and tool-based abstention for reliable code predictions in language models.
-
R2V Agent: Teaching SLMs When to Ask for Help
R2V-Agent combines an SLM policy trained via BC and DPO with a step-level risk-calibrated router using Brier scores and CVaR to escalate to LLM only on high residual failure risk, improving success-cost tradeoffs on HumanEval+, TextWorld, and TerminalBench.
-
LiLAW: Lightweight Learnable Adaptive Weighting to Learn Sample Difficulty & Improve Noisy Training
LiLAW learns to weight samples as easy, moderate or hard using three global scalars updated by one gradient step on a validation batch to improve noisy training performance.
-
Calibrated Model-Based Deep Reinforcement Learning
Augmenting model-based RL agents with calibrated predictive uncertainties improves planning, sample efficiency, and exploration on continuous control tasks.
-
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
-
Scale-Dependent Input Representation and Confidence Estimation for LLMs in Materials Property Prediction
Larger LLMs handle detailed crystal descriptions better than small ones, and mean negative log-likelihood of predicted numbers tracks prediction error after fine-tuning.
-
Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLMs
Sycophantic GRPO fine-tuning degrades LLM calibration, raising ECE by 0.006 and MCE by 0.010, with a persistent residual after post-hoc scaling.
-
MedFormer-UR: Uncertainty-Routed Transformer for Medical Image Classification
MedFormer-UR integrates evidential uncertainty from Dirichlet distributions and class-specific prototypes into a transformer to improve calibration and selective prediction on medical images across four modalities.
-
A Gaia-linked High-purity QSO Candidate Catalog in Selected Fields with Extinction-binned Calibration and Spectrum-informed Training
The P3 selector achieves 0.9809 purity and 0.8869 completeness for QSO candidates in selected fields, outperforming Gaia's official probabilities.
-
Single-bit-per-weight deep convolutional neural networks without batch-normalization layers for embedded systems
Experiments show that shifted-ReLU layers can replace batch-normalization in single-bit-weight wide residual networks on CIFAR-10/100 and ImageNet without consistent accuracy penalty.
-
Uncertainty in Physics and AI: Taxonomy, Quantification, and Validation
A unified taxonomy of uncertainty in ML for physics is introduced together with validation tools such as coverage, calibration, and proper scoring rules, illustrated on regression and classification tasks.
-
TRACE: A Metrologically-Grounded Engineering Framework for Trustworthy Agentic AI Systems in Operationally Critical Domains
TRACE is a metrologically-grounded four-layer engineering framework for trustworthy agentic AI that enforces an ML-LLM split, stateful policies, human supervision, and a parsimony metric across critical domains.
-
Trust but Verify: Introducing DAVinCI -- A Framework for Dual Attribution and Verification in Claim Inference for Language Models
DAVinCI combines claim attribution to model internals and external sources with entailment-based verification to improve LLM factual reliability by 5-20% on fact-checking datasets.
-
Calibration, Uncertainty Communication, and Deployment Readiness in CKD Risk Prediction: A Framework Evaluation Study
CKD risk prediction models achieve AUROC 1.00 internally but drop to 0.48-0.58 externally with high calibration error and low deployment scores, indicating need for external validation.
- Uncertainty-Calibrated Explainable Artificial Intelligence for Fetal Ultrasound Plane Classification: A Systematic Review