hub

On Calibration of Modern Neural Networks

· 2017 · cs.LG · arXiv 1706.04599

24 Pith papers cite this work. Polarity classification is still indexing.

24 Pith papers citing it

open full Pith review browse 24 citing papers arXiv PDF

abstract

Confidence calibration -- the problem of predicting probability estimates representative of the true correctness likelihood -- is important for classification models in many applications. We discover that modern neural networks, unlike those from a decade ago, are poorly calibrated. Through extensive experiments, we observe that depth, width, weight decay, and Batch Normalization are important factors influencing calibration. We evaluate the performance of various post-processing calibration methods on state-of-the-art architectures with image and document classification datasets. Our analysis and experiments not only offer insights into neural network learning, but also provide a simple and straightforward recipe for practical settings: on most datasets, temperature scaling -- a single-parameter variant of Platt Scaling -- is surprisingly effective at calibrating predictions.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Bayesian Social Deduction with Graph-Informed Language Models

cs.AI · 2025-06-21 · unverdicted · novelty 7.0

Hybrid Bayesian-graph LLM agent reaches competitive performance against large models and achieves 67% win rate against humans in controlled Avalon play, outperforming baselines and human teammates.

Pointwise Metrics Mislead: An Evaluation Protocol for Multimodal Inverse Problems

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

Pointwise metrics compress marginal spectra in multimodal inverse problems, and a three-part protocol using CRPS, spectrum fidelity, and calibration reverses model rankings on synthetic and particle-physics benchmarks.

Training-Free Cultural Alignment of Large Language Models via Persona Disagreement

cs.CL · 2026-05-11 · conditional · novelty 6.0

DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.

Photometric Redshift PDFs via Neural Network Classification for DESI Legacy Imaging Surveys and Pan-STARRS

astro-ph.GA · 2026-02-02 · unverdicted · novelty 6.0

Neural network classification with CRPS optimization produces calibrated photometric redshift PDFs for DESI Legacy and Pan-STARRS data, achieving σ_NMAD of 0.0153 on LSDR10 and outperforming regression methods.

Training Language Models to Self-Correct via Reinforcement Learning

cs.LG · 2024-09-19 · unverdicted · novelty 6.0

SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.

Diversity in Large Language Models under Supervised Fine-Tuning

cs.LG · 2026-04-30 · unverdicted · novelty 6.0 · 2 refs

TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.

Pioneer Agent: Continual Improvement of Small Language Models in Production

cs.AI · 2026-04-10 · unverdicted · novelty 6.0

Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on benchmarks and large lifts in production-style tasks.

Ensemble-Based Dirichlet Modeling for Predictive Uncertainty and Selective Classification

stat.ML · 2026-04-07 · unverdicted · novelty 6.0

Ensemble-based method of moments on softmax outputs produces stable Dirichlet predictive distributions that improve uncertainty-guided tasks like selective classification over evidential deep learning.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

When to Answer and When to Defer: A Decision Framework for Reliable Code Predictions

cs.SE · 2026-05-19 · unverdicted · novelty 5.0

Introduces a unified framework integrating uncertainty estimation, calibration, and tool-based abstention for reliable code predictions in language models.

R2V Agent: Teaching SLMs When to Ask for Help

cs.LG · 2026-05-15 · unverdicted · novelty 5.0

R2V-Agent combines an SLM policy trained via BC and DPO with a step-level risk-calibrated router using Brier scores and CVaR to escalate to LLM only on high residual failure risk, improving success-cost tradeoffs on HumanEval+, TextWorld, and TerminalBench.

LiLAW: Lightweight Learnable Adaptive Weighting to Learn Sample Difficulty & Improve Noisy Training

cs.LG · 2025-09-25 · unverdicted · novelty 5.0

LiLAW learns to weight samples as easy, moderate or hard using three global scalars updated by one gradient step on a validation batch to improve noisy training performance.

Calibrated Model-Based Deep Reinforcement Learning

cs.LG · 2019-06-19 · unverdicted · novelty 5.0

Augmenting model-based RL agents with calibrated predictive uncertainties improves planning, sample efficiency, and exploration on continuous control tasks.

Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes

cs.AI · 2026-05-07 · unverdicted · novelty 5.0

Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.

Scale-Dependent Input Representation and Confidence Estimation for LLMs in Materials Property Prediction

cond-mat.mtrl-sci · 2026-05-05 · conditional · novelty 5.0

Larger LLMs handle detailed crystal descriptions better than small ones, and mean negative log-likelihood of predicted numbers tracks prediction error after fine-tuning.

Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLMs

cs.LG · 2026-04-12 · conditional · novelty 5.0

Sycophantic GRPO fine-tuning degrades LLM calibration, raising ECE by 0.006 and MCE by 0.010, with a persistent residual after post-hoc scaling.

MedFormer-UR: Uncertainty-Routed Transformer for Medical Image Classification

eess.IV · 2026-04-10 · unverdicted · novelty 5.0

MedFormer-UR integrates evidential uncertainty from Dirichlet distributions and class-specific prototypes into a transformer to improve calibration and selective prediction on medical images across four modalities.

A Gaia-linked High-purity QSO Candidate Catalog in Selected Fields with Extinction-binned Calibration and Spectrum-informed Training

astro-ph.IM · 2026-05-22 · unverdicted · novelty 4.0

The P3 selector achieves 0.9809 purity and 0.8869 completeness for QSO candidates in selected fields, outperforming Gaia's official probabilities.

Single-bit-per-weight deep convolutional neural networks without batch-normalization layers for embedded systems

cs.LG · 2019-07-16 · unverdicted · novelty 4.0

Experiments show that shifted-ReLU layers can replace batch-normalization in single-bit-weight wide residual networks on CIFAR-10/100 and ImageNet without consistent accuracy penalty.

Uncertainty in Physics and AI: Taxonomy, Quantification, and Validation

stat.ML · 2026-05-11 · accept · novelty 4.0

A unified taxonomy of uncertainty in ML for physics is introduced together with validation tools such as coverage, calibration, and proper scoring rules, illustrated on regression and classification tasks.

TRACE: A Metrologically-Grounded Engineering Framework for Trustworthy Agentic AI Systems in Operationally Critical Domains

cs.CL · 2026-05-05 · unverdicted · novelty 4.0

TRACE is a metrologically-grounded four-layer engineering framework for trustworthy agentic AI that enforces an ML-LLM split, stateful policies, human supervision, and a parsimony metric across critical domains.

Trust but Verify: Introducing DAVinCI -- A Framework for Dual Attribution and Verification in Claim Inference for Language Models

cs.AI · 2026-04-23 · unverdicted · novelty 4.0

DAVinCI combines claim attribution to model internals and external sources with entailment-based verification to improve LLM factual reliability by 5-20% on fact-checking datasets.

Calibration, Uncertainty Communication, and Deployment Readiness in CKD Risk Prediction: A Framework Evaluation Study

cs.LG · 2026-05-20 · unverdicted · novelty 3.0

CKD risk prediction models achieve AUROC 1.00 internally but drop to 0.48-0.58 externally with high calibration error and low deployment scores, indicating need for external validation.

Uncertainty-Calibrated Explainable Artificial Intelligence for Fetal Ultrasound Plane Classification: A Systematic Review

eess.IV · 2026-01-02

citing papers explorer

Showing 24 of 24 citing papers.

Bayesian Social Deduction with Graph-Informed Language Models cs.AI · 2025-06-21 · unverdicted · none · ref 14 · internal anchor
Hybrid Bayesian-graph LLM agent reaches competitive performance against large models and achieves 67% win rate against humans in controlled Avalon play, outperforming baselines and human teammates.
Pointwise Metrics Mislead: An Evaluation Protocol for Multimodal Inverse Problems cs.LG · 2026-05-21 · unverdicted · none · ref 19 · internal anchor
Pointwise metrics compress marginal spectra in multimodal inverse problems, and a three-part protocol using CRPS, spectrum fidelity, and calibration reverses model rankings on synthetic and particle-physics benchmarks.
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement cs.CL · 2026-05-11 · conditional · none · ref 11 · internal anchor
DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.
Photometric Redshift PDFs via Neural Network Classification for DESI Legacy Imaging Surveys and Pan-STARRS astro-ph.GA · 2026-02-02 · unverdicted · none · ref 27 · internal anchor
Neural network classification with CRPS optimization produces calibrated photometric redshift PDFs for DESI Legacy and Pan-STARRS data, achieving σ_NMAD of 0.0153 on LSDR10 and outperforming regression methods.
Training Language Models to Self-Correct via Reinforcement Learning cs.LG · 2024-09-19 · unverdicted · none · ref 198 · internal anchor
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
Diversity in Large Language Models under Supervised Fine-Tuning cs.LG · 2026-04-30 · unverdicted · none · ref 4 · 2 links
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
Pioneer Agent: Continual Improvement of Small Language Models in Production cs.AI · 2026-04-10 · unverdicted · none · ref 35
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on benchmarks and large lifts in production-style tasks.
Ensemble-Based Dirichlet Modeling for Predictive Uncertainty and Selective Classification stat.ML · 2026-04-07 · unverdicted · none · ref 1
Ensemble-based method of moments on softmax outputs produces stable Dirichlet predictive distributions that improve uncertainty-guided tasks like selective classification over evidential deep learning.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 9
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
When to Answer and When to Defer: A Decision Framework for Reliable Code Predictions cs.SE · 2026-05-19 · unverdicted · none · ref 3 · internal anchor
Introduces a unified framework integrating uncertainty estimation, calibration, and tool-based abstention for reliable code predictions in language models.
R2V Agent: Teaching SLMs When to Ask for Help cs.LG · 2026-05-15 · unverdicted · none · ref 4 · internal anchor
R2V-Agent combines an SLM policy trained via BC and DPO with a step-level risk-calibrated router using Brier scores and CVaR to escalate to LLM only on high residual failure risk, improving success-cost tradeoffs on HumanEval+, TextWorld, and TerminalBench.
LiLAW: Lightweight Learnable Adaptive Weighting to Learn Sample Difficulty & Improve Noisy Training cs.LG · 2025-09-25 · unverdicted · none · ref 7 · internal anchor
LiLAW learns to weight samples as easy, moderate or hard using three global scalars updated by one gradient step on a validation batch to improve noisy training performance.
Calibrated Model-Based Deep Reinforcement Learning cs.LG · 2019-06-19 · unverdicted · none · ref 19 · internal anchor
Augmenting model-based RL agents with calibrated predictive uncertainties improves planning, sample efficiency, and exploration on continuous control tasks.
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes cs.AI · 2026-05-07 · unverdicted · none · ref 25
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
Scale-Dependent Input Representation and Confidence Estimation for LLMs in Materials Property Prediction cond-mat.mtrl-sci · 2026-05-05 · conditional · none · ref 39
Larger LLMs handle detailed crystal descriptions better than small ones, and mean negative log-likelihood of predicted numbers tracks prediction error after fine-tuning.
Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLMs cs.LG · 2026-04-12 · conditional · none · ref 1
Sycophantic GRPO fine-tuning degrades LLM calibration, raising ECE by 0.006 and MCE by 0.010, with a persistent residual after post-hoc scaling.
MedFormer-UR: Uncertainty-Routed Transformer for Medical Image Classification eess.IV · 2026-04-10 · unverdicted · none · ref 4
MedFormer-UR integrates evidential uncertainty from Dirichlet distributions and class-specific prototypes into a transformer to improve calibration and selective prediction on medical images across four modalities.
A Gaia-linked High-purity QSO Candidate Catalog in Selected Fields with Extinction-binned Calibration and Spectrum-informed Training astro-ph.IM · 2026-05-22 · unverdicted · none · ref 25 · internal anchor
The P3 selector achieves 0.9809 purity and 0.8869 completeness for QSO candidates in selected fields, outperforming Gaia's official probabilities.
Single-bit-per-weight deep convolutional neural networks without batch-normalization layers for embedded systems cs.LG · 2019-07-16 · unverdicted · none · ref 36 · internal anchor
Experiments show that shifted-ReLU layers can replace batch-normalization in single-bit-weight wide residual networks on CIFAR-10/100 and ImageNet without consistent accuracy penalty.
Uncertainty in Physics and AI: Taxonomy, Quantification, and Validation stat.ML · 2026-05-11 · accept · none · ref 6
A unified taxonomy of uncertainty in ML for physics is introduced together with validation tools such as coverage, calibration, and proper scoring rules, illustrated on regression and classification tasks.
TRACE: A Metrologically-Grounded Engineering Framework for Trustworthy Agentic AI Systems in Operationally Critical Domains cs.CL · 2026-05-05 · unverdicted · none · ref 20
TRACE is a metrologically-grounded four-layer engineering framework for trustworthy agentic AI that enforces an ML-LLM split, stateful policies, human supervision, and a parsimony metric across critical domains.
Trust but Verify: Introducing DAVinCI -- A Framework for Dual Attribution and Verification in Claim Inference for Language Models cs.AI · 2026-04-23 · unverdicted · none · ref 13
DAVinCI combines claim attribution to model internals and external sources with entailment-based verification to improve LLM factual reliability by 5-20% on fact-checking datasets.
Calibration, Uncertainty Communication, and Deployment Readiness in CKD Risk Prediction: A Framework Evaluation Study cs.LG · 2026-05-20 · unverdicted · none · ref 21 · internal anchor
CKD risk prediction models achieve AUROC 1.00 internally but drop to 0.48-0.58 externally with high calibration error and low deployment scores, indicating need for external validation.
Uncertainty-Calibrated Explainable Artificial Intelligence for Fetal Ultrasound Plane Classification: A Systematic Review eess.IV · 2026-01-02 · unreviewed · ref 13 · internal anchor

On Calibration of Modern Neural Networks

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer