MIRROR benchmark shows LLMs universally fail at compositional self-prediction and cannot translate partial self-knowledge into better agentic actions, with external metacognitive control reducing confident failures by ~70-76%.
After answering, place a bet on your answer: 1–10 points. If correct, you gain the points. If wrong, you lose them
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
MIRROR: A Hierarchical Benchmark for Metacognitive Calibration in Large Language Models
MIRROR benchmark shows LLMs universally fail at compositional self-prediction and cannot translate partial self-knowledge into better agentic actions, with external metacognitive control reducing confident failures by ~70-76%.