More capable LLMs produce worse distributional forecasts on superlinear growth time series with tail risks of regime change, with the error concentrated in the upper tail; this reverses on conventional threshold metrics.
Journal of Applied Meteorology , volume=
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
TRIAGE evaluates LLMs on prospective metacognitive control by requiring a single plan for task selection, sequencing, and token allocation under a calibrated budget, revealing substantial gaps in current models across math, science, code, and knowledge tasks.
A new 2x2 diagnostic matrix classifies probabilistic classifiers into Eagles, Bulls, Sloths, and Moles by calibration and discrimination, with empirical archetype assignments and a proof that post-hoc calibration cannot add discriminatory power.
citing papers explorer
-
Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
More capable LLMs produce worse distributional forecasts on superlinear growth time series with tail risks of regime change, with the error concentrated in the upper tail; this reverses on conventional threshold metrics.
-
TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints
TRIAGE evaluates LLMs on prospective metacognitive control by requiring a single plan for task selection, sequencing, and token allocation under a calibrated budget, revealing substantial gaps in current models across math, science, code, and knowledge tasks.
-
The Manokhin Probability Matrix: A Diagnostic Framework for Classifier Probability Quality
A new 2x2 diagnostic matrix classifies probabilistic classifiers into Eagles, Bulls, Sloths, and Moles by calibration and discrimination, with empirical archetype assignments and a proof that post-hoc calibration cannot add discriminatory power.