KellyBench reveals that frontier language models lose money on average when making sequential betting decisions over a full soccer season, with the best model returning -8% and scoring only 26.5% on a human expert rubric for strategy sophistication.
Foresight: ItsLogicalLaws, ItsSubjectiveSources
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
KellyBench: A Benchmark for Long-Horizon Sequential Decision Making
KellyBench reveals that frontier language models lose money on average when making sequential betting decisions over a full soccer season, with the best model returning -8% and scoring only 26.5% on a human expert rubric for strategy sophistication.