Seven 3-9B instruction-tuned LLMs produce verbal confidence that saturates at high values and fails psychometric validity criteria for Type-2 discrimination under minimal elicitation.
Concurrent Criterion Validation of a Validity Screen for LLM Confidence Signals via Selective Prediction
2 Pith papers cite this work. Polarity classification is still indexing.
abstract
The validity screen (Cacioli, 2026d, 2026e) classifies LLM confidence signals as Valid, Indeterminate, or Invalid. We test whether these classifications predict selective prediction performance. Twenty frontier LLMs from seven families were evaluated on 524 items across six cognitive tracks. Valid models show mean Type 2 AUROC = .624 (SD = .048). Invalid models show mean AUROC = .357 (SD = .231). Cohen's d = 2.81, p = .002. The tiers order monotonically: Invalid (.357) < Indeterminate (.554) < Valid (.624). Split-half cross-validation yields median d = 1.77, P(d > 0) = 1.0 across 1,000 splits. The three-tier classification accounts for 47% of the variance in AUROC. DeepSeek-R1 drops from 85.3% accuracy at full coverage to 11.3% at 10% coverage. The screen predicts the criterion. For selective prediction, the screen matters.
fields
cs.CL 2years
2026 2representative citing papers
The validity screen classifies LLM confidence signals as Valid, Indeterminate, or Invalid, and these labels predict selective prediction AUROC with Valid models averaging 0.624 and Invalid models 0.357 across 20 LLMs.
citing papers explorer
-
Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen
Seven 3-9B instruction-tuned LLMs produce verbal confidence that saturates at high values and fails psychometric validity criteria for Type-2 discrimination under minimal elicitation.
-
Concurrent Criterion Validation of a Validity Screen for LLM Confidence Signals via Selective Prediction
The validity screen classifies LLM confidence signals as Valid, Indeterminate, or Invalid, and these labels predict selective prediction AUROC with Valid models averaging 0.624 and Invalid models 0.357 across 20 LLMs.