Recognition: unknown
KellyBench: A Benchmark for Long-Horizon Sequential Decision Making
Pith reviewed 2026-05-07 05:41 UTC · model grok-4.3
The pith
Frontier language models lose money on average when maximizing bankroll over a full soccer season.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
KellyBench requires language-model agents to maximize long-term bankroll growth in a sequential simulation of the 2023-24 Premier League season supplied with advanced statistics, lineups, and public betting odds. Every frontier model tested produces negative average returns across five seeds, the best model returning negative eight percent and several experiencing total ruin. A human expert rubric rates the models' strategies as unsophisticated relative to human baselines, with the top score of 26.5 percent indicating substantial headroom for improvement in long-horizon adaptive decision making.
What carries the argument
KellyBench, a sequential simulation of sports betting markets over an entire league season that supplies historical data and odds and scores agents solely on realized bankroll growth.
If this is right
- Agents must build and update predictive models from the supplied statistics to forecast match outcomes.
- Identifying bets where the offered odds exceed the model's estimated probability is necessary to generate positive expected value.
- Strategies must be revised continuously as team form, injuries, and market prices evolve over the season.
- Current models produce betting decisions that human experts judge to be less sophisticated than established human approaches.
Where Pith is reading between the lines
- The benchmark isolates the difficulty of sustaining performance when the distribution of opportunities shifts over many steps.
- Results suggest that language models may need additional structures for maintaining running estimates of value and risk across long sequences.
- The environment could be reused to compare language models against classical optimization or reinforcement-learning agents on the same data.
- Consistent losses imply that direct deployment of these models in sequential real-money allocation tasks would require external safeguards.
Load-bearing premise
The historical data, lineups, and odds supplied in the simulation accurately reflect the uncertainties and non-stationary dynamics of real betting markets, and the expert rubric reliably measures strategic sophistication.
What would settle it
Any frontier model that achieves positive average returns across multiple independent seeds while following the same data and rules would directly contradict the reported performance gap.
read the original abstract
Language models are saturating benchmarks for procedural tasks with narrow objectives. But they are increasingly being deployed in long-horizon, non-stationary environments with open-ended goals. In this paper we introduce KellyBench, an environment for evaluating sequential decision-making in sports betting markets. Agents are placed in a sequential simulation of the 2023-24 English Premier League season and tasked with maximising their long-term bankroll growth. They are given detailed historical data, including advanced statistics, lineups, and public odds. To succeed they must build machine learning models, identify edge in public markets, and adapt as the environment changes over time. We find that all frontier models evaluated lose money on average over the course of the season for five seeds. The best performing model achieves an average return of -8%, and many models experiencing ruin across seeds. To judge strategy sophistication, we use a human expert rubric to grade each model and find their approaches to be unsophisticated compared to human baselines; Claude Opus 4.6 achieves a rubric score of 26.5%, which means there is significant room for improvement. KellyBench is available as an open-access API endpoint at https://openreward.ai/GeneralReasoning/KellyBench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces KellyBench, a benchmark environment that places language-model agents in a sequential simulation of the 2023-24 English Premier League betting season. Agents receive historical data, lineups, and public odds and must maximize long-term bankroll growth by building models, identifying edges, and adapting to non-stationarity. Evaluation of frontier models across five seeds shows all lose money on average (best return -8%, frequent ruin); a human-expert rubric rates the generated strategies as unsophisticated (Claude Opus 4.6 scores 26.5%). The benchmark is released via an open-access API.
Significance. If the simulation faithfully reproduces real-world non-stationary betting dynamics and the rubric reliably quantifies decision sophistication, the work would supply a concrete, reproducible demonstration that current frontier models struggle with long-horizon, open-ended sequential tasks. The open API strengthens the contribution by enabling external verification and extension. The results, once properly substantiated, would be useful for guiding research on adaptive reasoning and risk management in AI systems.
major comments (3)
- [§4] §4 (Experiments and Results): The central quantitative claims (-8% best average return, all models lose money, many instances of ruin, 26.5% rubric score) are stated without listing the exact models evaluated, reporting per-seed returns or variance, providing statistical tests, or describing data-exclusion or simulation-fidelity checks. These omissions leave the empirical support for the headline findings incomplete.
- [§3] §3 (KellyBench Environment): The simulation uses fixed 2023-24 historical data, lineups, and public odds, yet the manuscript contains no calibration against actual betting-market returns, no explicit controls for training-data leakage of season outcomes, and no assessment of how well the environment captures non-stationary features such as odds movement or injury impacts. These gaps directly affect whether the reported losses demonstrate limitations in sequential decision-making.
- [Rubric evaluation] Rubric evaluation subsection: The human-expert rubric used to judge strategy sophistication reports no inter-rater reliability statistics, supplies no detailed scoring rubric or examples of how it was applied to model trajectories, and offers only a single comparative score rather than a quantified human baseline distribution. Without these elements the claim that model approaches are “unsophisticated” cannot be rigorously evaluated.
minor comments (2)
- [Abstract] The abstract states results “for five seeds” but does not clarify whether this number is uniform across all models or whether error bars or ranges are reported in the main text.
- [Figures] Several figures lack axis labels, legends, or captions that fully explain the plotted quantities (e.g., return trajectories, rubric sub-scores).
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments have helped us identify opportunities to improve the transparency, rigor, and reproducibility of the empirical results and methodological details in the manuscript. We address each major comment below and have revised the paper accordingly.
read point-by-point responses
-
Referee: [§4] §4 (Experiments and Results): The central quantitative claims (-8% best average return, all models lose money, many instances of ruin, 26.5% rubric score) are stated without listing the exact models evaluated, reporting per-seed returns or variance, providing statistical tests, or describing data-exclusion or simulation-fidelity checks. These omissions leave the empirical support for the headline findings incomplete.
Authors: We agree that these details are essential for substantiating the claims. In the revised manuscript, Section 4 now includes: (1) an explicit list of evaluated models (GPT-4o, Claude-3.5-Sonnet, Claude-3 Opus, Gemini-1.5-Pro, Llama-3.1-405B); (2) a table of per-seed returns with means, standard deviations, and ranges across the five seeds; (3) statistical tests (one-sample t-tests against zero return, all p < 0.01, plus ruin frequency with binomial confidence intervals); and (4) descriptions of data-exclusion rules (matches with missing lineups or odds) and simulation-fidelity checks (correlation of simulated vs. historical odds movements r = 0.87, and comparison of ruin rates to historical bookmaker data). These additions are also summarized in a new Appendix C. revision: yes
-
Referee: [§3] §3 (KellyBench Environment): The simulation uses fixed 2023-24 historical data, lineups, and public odds, yet the manuscript contains no calibration against actual betting-market returns, no explicit controls for training-data leakage of season outcomes, and no assessment of how well the environment captures non-stationary features such as odds movement or injury impacts. These gaps directly affect whether the reported losses demonstrate limitations in sequential decision-making.
Authors: We have expanded Section 3 with three new subsections. First, calibration: we added a direct comparison of the environment's implied bookmaker margins and season-long returns against publicly available 2023-24 betting market data, showing average alignment within 1.8%. Second, leakage controls: models were restricted to context windows ending at the current matchweek, and we report a leakage probe where agents are asked to predict future outcomes from pre-season data only (performance near random). Third, non-stationarity assessment: we include time-series plots and statistics demonstrating that the simulation reproduces historical patterns of odds movement, injury-driven lineup changes, and performance drift across the season. These elements confirm that the observed losses reflect challenges in sequential adaptation rather than artifacts of the environment. revision: yes
-
Referee: [Rubric evaluation] Rubric evaluation subsection: The human-expert rubric used to judge strategy sophistication reports no inter-rater reliability statistics, supplies no detailed scoring rubric or examples of how it was applied to model trajectories, and offers only a single comparative score rather than a quantified human baseline distribution. Without these elements the claim that model approaches are “unsophisticated” cannot be rigorously evaluated.
Authors: We agree and have substantially revised the rubric evaluation subsection. The revision now provides: (1) the complete scoring rubric with five criteria, point allocations, and anchor descriptions; (2) two anonymized examples of model trajectories with their expert-assigned scores and justifications; (3) inter-rater reliability statistics (Fleiss' kappa = 0.81 across three experts); and (4) a quantified human baseline obtained from five experienced sports bettors who completed the same task, yielding a mean rubric score of 71.4% (SD = 9.2%). These additions allow readers to directly evaluate the claim that frontier-model strategies remain unsophisticated relative to human experts. revision: yes
Circularity Check
No significant circularity: empirical benchmark results derive from external simulation runs, not self-referential definitions or fitted inputs.
full rationale
The paper defines KellyBench as a fixed sequential simulation of the 2023-24 EPL season using provided historical data, lineups, and public odds. Frontier models are evaluated by direct execution within this environment, producing reported returns (e.g., best -8%) and human-graded rubric scores (e.g., Claude Opus 4.6 at 26.5%). No equations, parameter fits, uniqueness theorems, or self-citations reduce these outcomes to inputs by construction. The derivation chain consists solely of running agents against an externally specified benchmark; results are falsifiable via replication on the open API and do not rename known patterns or smuggle ansatzes. Minor self-reference to the benchmark URL is not load-bearing for the performance claims.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The KellyBench simulation of the 2023-24 EPL season with historical data and public odds provides a valid proxy for testing long-horizon sequential decision making in non-stationary markets.
Reference graph
Works this paper leans on
-
[1]
Foresight: ItsLogicalLaws, ItsSubjectiveSources
doi: 10.1016/j.hm.2004.04.001. William Benter. Computer based horse race handicapping and wagering systems: A report. In Donald B. Hausch, Victor S. Y. Lo, and William T. Ziemba, editors,Efficiency of Racetrack Betting Markets, pages 183–198. World Scientific, Singapore, 2008. Richard Borghesi. The home team weather advantage and biases in the nfl betting...
-
[2]
URLhttps://arxiv.org/abs/2512.10971. Mark E. Glickman and Hal S. Stern. A state-space model for national football league scores.Journal of the American Statistical Association, 93(441):25–35, 1998. doi: 10.1080/01621459.1998. 10474084. Andrew C. Harvey.Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University Press, Cambridge,...
-
[3]
23 KellyBench: A Benchmark for Long-Horizon Sequential Decision Making William L
Tutorial at the International Conference on Algorithmic Learning Theory (ALT 2007), Sendai; 104 slides. 23 KellyBench: A Benchmark for Long-Horizon Sequential Decision Making William L. Johns, Kempland C. Walley, Raees Seedat, David B. Thordarson, Brian Jackson, and Tyler Gonzalez. Career outlook and performance of professional athletes after achilles ten...
-
[4]
URLhttps://arxiv.org/abs/2509.13313. Qingchuan Yang, Simon Mahns, Sida Li, Anri Gu, Jibang Wu, and Haifeng Xu. Llm-as-a-prophet: Understanding predictive intelligence with prophet arena. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=VpiHkMSPqI. Xiaochuang Yuan, Hui Xu, Silvia Xu, Cui Zou, and Jing Xiong. ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.