Recognition: unknown
On Calibration of Modern Neural Networks
read the original abstract
Confidence calibration -- the problem of predicting probability estimates representative of the true correctness likelihood -- is important for classification models in many applications. We discover that modern neural networks, unlike those from a decade ago, are poorly calibrated. Through extensive experiments, we observe that depth, width, weight decay, and Batch Normalization are important factors influencing calibration. We evaluate the performance of various post-processing calibration methods on state-of-the-art architectures with image and document classification datasets. Our analysis and experiments not only offer insights into neural network learning, but also provide a simple and straightforward recipe for practical settings: on most datasets, temperature scaling -- a single-parameter variant of Platt Scaling -- is surprisingly effective at calibrating predictions.
This paper has not been read by Pith yet.
Forward citations
Cited by 12 Pith papers
-
Diversity in Large Language Models under Supervised Fine-Tuning
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
-
Pioneer Agent: Continual Improvement of Small Language Models in Production
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
-
Ensemble-Based Dirichlet Modeling for Predictive Uncertainty and Selective Classification
Ensemble-based method of moments on softmax outputs produces stable Dirichlet predictive distributions that improve uncertainty-guided tasks like selective classification over evidential deep learning.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
-
Scale-Dependent Input Representation and Confidence Estimation for LLMs in Materials Property Prediction
Larger LLMs handle detailed crystal descriptions better than small ones, and mean negative log-likelihood of predicted numbers tracks prediction error after fine-tuning.
-
Diversity in Large Language Models under Supervised Fine-Tuning
Supervised fine-tuning narrows LLM generative diversity through neglect of low-frequency patterns and knowledge forgetting, but the TOFU loss mitigates this effect across models and benchmarks.
-
Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLMs
Sycophantic GRPO fine-tuning degrades LLM calibration, raising ECE by 0.006 and MCE by 0.010, with a persistent residual after post-hoc scaling.
-
MedFormer-UR: Uncertainty-Routed Transformer for Medical Image Classification
MedFormer-UR integrates evidential uncertainty from Dirichlet distributions and class-specific prototypes into a transformer to improve calibration and selective prediction on medical images across four modalities.
-
Uncertainty in Physics and AI: Taxonomy, Quantification, and Validation
A unified taxonomy of uncertainty in ML for physics is introduced together with validation tools such as coverage, calibration, and proper scoring rules, illustrated on regression and classification tasks.
-
TRACE: A Metrologically-Grounded Engineering Framework for Trustworthy Agentic AI Systems in Operationally Critical Domains
TRACE is a metrologically-grounded four-layer engineering framework for trustworthy agentic AI that enforces an ML-LLM split, stateful policies, human supervision, and a parsimony metric across critical domains.
-
Trust but Verify: Introducing DAVinCI -- A Framework for Dual Attribution and Verification in Claim Inference for Language Models
DAVinCI combines claim attribution to model internals and external sources with entailment-based verification to improve LLM factual reliability by 5-20% on fact-checking datasets.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.