Concurrent Criterion Validation of a Validity Screen for LLM Confidence Signals via Selective Prediction

· 2026 · cs.CL · arXiv 2604.17716

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

The validity screen (Cacioli, 2026d, 2026e) classifies LLM confidence signals as Valid, Indeterminate, or Invalid. We test whether these classifications predict selective prediction performance. Twenty frontier LLMs from seven families were evaluated on 524 items across six cognitive tracks. Valid models show mean Type 2 AUROC = .624 (SD = .048). Invalid models show mean AUROC = .357 (SD = .231). Cohen's d = 2.81, p = .002. The tiers order monotonically: Invalid (.357) < Indeterminate (.554) < Valid (.624). Split-half cross-validation yields median d = 1.77, P(d > 0) = 1.0 across 1,000 splits. The three-tier classification accounts for 47% of the variance in AUROC. DeepSeek-R1 drops from 85.3% accuracy at full coverage to 11.3% at 10% coverage. The screen predicts the criterion. For selective prediction, the screen matters.

representative citing papers

Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen

cs.CL · 2026-04-24 · conditional · novelty 6.0

Seven 3-9B instruction-tuned LLMs produce verbal confidence that saturates at high values and fails psychometric validity criteria for Type-2 discrimination under minimal elicitation.

Concurrent Criterion Validation of a Validity Screen for LLM Confidence Signals via Selective Prediction

cs.CL · 2026-04-20 · unverdicted · novelty 4.0

The validity screen classifies LLM confidence signals as Valid, Indeterminate, or Invalid, and these labels predict selective prediction AUROC with Valid models averaging 0.624 and Invalid models 0.357 across 20 LLMs.

citing papers explorer

Showing 2 of 2 citing papers.

Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen cs.CL · 2026-04-24 · conditional · none · ref 7 · internal anchor
Seven 3-9B instruction-tuned LLMs produce verbal confidence that saturates at high values and fails psychometric validity criteria for Type-2 discrimination under minimal elicitation.
Concurrent Criterion Validation of a Validity Screen for LLM Confidence Signals via Selective Prediction cs.CL · 2026-04-20 · unverdicted · none · ref 2 · internal anchor
The validity screen classifies LLM confidence signals as Valid, Indeterminate, or Invalid, and these labels predict selective prediction AUROC with Valid models averaging 0.624 and Invalid models 0.357 across 20 LLMs.

Concurrent Criterion Validation of a Validity Screen for LLM Confidence Signals via Selective Prediction

fields

years

verdicts

representative citing papers

citing papers explorer