Frontier LLMs exhibit consistent domain-specific differences in metacognitive monitoring on the MMLU benchmark, with applied and professional knowledge domains showing the highest monitoring accuracy and formal reasoning and natural science the lowest.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CL 2years
2026 2representative citing papers
Seven 3-9B instruction-tuned LLMs produce verbal confidence that saturates at high values and fails psychometric validity criteria for Type-2 discrimination under minimal elicitation.
citing papers explorer
-
Domain-level metacognitive monitoring in frontier LLMs: A 33-model atlas
Frontier LLMs exhibit consistent domain-specific differences in metacognitive monitoring on the MMLU benchmark, with applied and professional knowledge domains showing the highest monitoring accuracy and formal reasoning and natural science the lowest.
-
Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen
Seven 3-9B instruction-tuned LLMs produce verbal confidence that saturates at high values and fails psychometric validity criteria for Type-2 discrimination under minimal elicitation.