arxiv: 2604.27865 · v1 · submitted 2026-04-30 · 💻 cs.AI

Recognition: unknown

KellyBench: A Benchmark for Long-Horizon Sequential Decision Making

Thomas Grady , Kip Parker , Iliyan Zarov , Henry Course , Chengxi Taylor , Ross Taylor

Authors on Pith no claims yet

Pith reviewed 2026-05-07 05:41 UTC · model grok-4.3

classification 💻 cs.AI

keywords long-horizon decision makingsequential decision makingsports betting benchmarklanguage model evaluationnon-stationary environmentsbankroll maximizationKelly criterion

0 comments

The pith

Frontier language models lose money on average when maximizing bankroll over a full soccer season.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces KellyBench, an environment that places agents in a full-season simulation of the 2023-24 English Premier League where they receive historical statistics, lineups, and public odds and must grow their starting bankroll through repeated betting decisions. All evaluated frontier models finish the season with negative average returns, the strongest at negative eight percent, and several go bankrupt across different random seeds. Human experts grade the models' betting approaches as markedly less sophisticated than human strategies, with the highest rubric score at 26.5 percent. This result matters because it tests whether current models can handle open-ended, non-stationary goals that require building predictive models, spotting value, and revising plans as new information arrives.

Core claim

KellyBench requires language-model agents to maximize long-term bankroll growth in a sequential simulation of the 2023-24 Premier League season supplied with advanced statistics, lineups, and public betting odds. Every frontier model tested produces negative average returns across five seeds, the best model returning negative eight percent and several experiencing total ruin. A human expert rubric rates the models' strategies as unsophisticated relative to human baselines, with the top score of 26.5 percent indicating substantial headroom for improvement in long-horizon adaptive decision making.

What carries the argument

KellyBench, a sequential simulation of sports betting markets over an entire league season that supplies historical data and odds and scores agents solely on realized bankroll growth.

If this is right

Agents must build and update predictive models from the supplied statistics to forecast match outcomes.
Identifying bets where the offered odds exceed the model's estimated probability is necessary to generate positive expected value.
Strategies must be revised continuously as team form, injuries, and market prices evolve over the season.
Current models produce betting decisions that human experts judge to be less sophisticated than established human approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark isolates the difficulty of sustaining performance when the distribution of opportunities shifts over many steps.
Results suggest that language models may need additional structures for maintaining running estimates of value and risk across long sequences.
The environment could be reused to compare language models against classical optimization or reinforcement-learning agents on the same data.
Consistent losses imply that direct deployment of these models in sequential real-money allocation tasks would require external safeguards.

Load-bearing premise

The historical data, lineups, and odds supplied in the simulation accurately reflect the uncertainties and non-stationary dynamics of real betting markets, and the expert rubric reliably measures strategic sophistication.

What would settle it

Any frontier model that achieves positive average returns across multiple independent seeds while following the same data and rules would directly contradict the reported performance gap.

read the original abstract

Language models are saturating benchmarks for procedural tasks with narrow objectives. But they are increasingly being deployed in long-horizon, non-stationary environments with open-ended goals. In this paper we introduce KellyBench, an environment for evaluating sequential decision-making in sports betting markets. Agents are placed in a sequential simulation of the 2023-24 English Premier League season and tasked with maximising their long-term bankroll growth. They are given detailed historical data, including advanced statistics, lineups, and public odds. To succeed they must build machine learning models, identify edge in public markets, and adapt as the environment changes over time. We find that all frontier models evaluated lose money on average over the course of the season for five seeds. The best performing model achieves an average return of -8%, and many models experiencing ruin across seeds. To judge strategy sophistication, we use a human expert rubric to grade each model and find their approaches to be unsophisticated compared to human baselines; Claude Opus 4.6 achieves a rubric score of 26.5%, which means there is significant room for improvement. KellyBench is available as an open-access API endpoint at https://openreward.ai/GeneralReasoning/KellyBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KellyBench creates a full-season betting simulation that forces models to build predictors and adapt over time, but the reported losses and low rubric scores rest on unvalidated assumptions about the environment and scoring.

read the letter

KellyBench puts language models into a sequential simulation of the 2023-24 English Premier League season. Agents receive historical stats, lineups, and public odds, then must build models, find edges, and adjust as the season unfolds to grow a bankroll. The main reported outcome is that every frontier model tested loses money on average, with the best at -8% return and several hitting ruin, while a human rubric rates their strategies low, with Claude Opus 4.6 at 26.5% of the expert baseline. That setup is the paper's real contribution. It moves past narrow, static tasks and requires ongoing model construction plus adaptation to non-stationary conditions, which matches the kind of open-ended sequential problems that matter for agentic systems in finance or planning. The open API endpoint also makes the environment usable for follow-up work without needing to rebuild the simulator from scratch. The evaluation still has clear gaps. The abstract states results across five seeds but supplies no list of exact models, no statistical tests, no details on data splits or leakage controls, and no calibration showing that the fixed historical odds and lineups produce betting dynamics close to real markets. The rubric itself lacks any reported inter-rater checks or evidence that higher scores predict better bankroll outcomes. If either the simulation or the scoring introduces artifacts, the performance gap does not cleanly demonstrate limits in long-horizon reasoning. This paper is for researchers who evaluate or build LLM agents for dynamic, multi-step environments. Readers focused on benchmarks will find the environment worth examining even if the current numbers need more support. It deserves peer review because the benchmark idea is timely and concrete, though any referee would reasonably ask for expanded methods, simulation validation, and rubric reliability before the claims can be taken as firm evidence of model shortcomings.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces KellyBench, a benchmark environment that places language-model agents in a sequential simulation of the 2023-24 English Premier League betting season. Agents receive historical data, lineups, and public odds and must maximize long-term bankroll growth by building models, identifying edges, and adapting to non-stationarity. Evaluation of frontier models across five seeds shows all lose money on average (best return -8%, frequent ruin); a human-expert rubric rates the generated strategies as unsophisticated (Claude Opus 4.6 scores 26.5%). The benchmark is released via an open-access API.

Significance. If the simulation faithfully reproduces real-world non-stationary betting dynamics and the rubric reliably quantifies decision sophistication, the work would supply a concrete, reproducible demonstration that current frontier models struggle with long-horizon, open-ended sequential tasks. The open API strengthens the contribution by enabling external verification and extension. The results, once properly substantiated, would be useful for guiding research on adaptive reasoning and risk management in AI systems.

major comments (3)

[§4] §4 (Experiments and Results): The central quantitative claims (-8% best average return, all models lose money, many instances of ruin, 26.5% rubric score) are stated without listing the exact models evaluated, reporting per-seed returns or variance, providing statistical tests, or describing data-exclusion or simulation-fidelity checks. These omissions leave the empirical support for the headline findings incomplete.
[§3] §3 (KellyBench Environment): The simulation uses fixed 2023-24 historical data, lineups, and public odds, yet the manuscript contains no calibration against actual betting-market returns, no explicit controls for training-data leakage of season outcomes, and no assessment of how well the environment captures non-stationary features such as odds movement or injury impacts. These gaps directly affect whether the reported losses demonstrate limitations in sequential decision-making.
[Rubric evaluation] Rubric evaluation subsection: The human-expert rubric used to judge strategy sophistication reports no inter-rater reliability statistics, supplies no detailed scoring rubric or examples of how it was applied to model trajectories, and offers only a single comparative score rather than a quantified human baseline distribution. Without these elements the claim that model approaches are “unsophisticated” cannot be rigorously evaluated.

minor comments (2)

[Abstract] The abstract states results “for five seeds” but does not clarify whether this number is uniform across all models or whether error bars or ranges are reported in the main text.
[Figures] Several figures lack axis labels, legends, or captions that fully explain the plotted quantities (e.g., return trajectories, rubric sub-scores).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments have helped us identify opportunities to improve the transparency, rigor, and reproducibility of the empirical results and methodological details in the manuscript. We address each major comment below and have revised the paper accordingly.

read point-by-point responses

Referee: [§4] §4 (Experiments and Results): The central quantitative claims (-8% best average return, all models lose money, many instances of ruin, 26.5% rubric score) are stated without listing the exact models evaluated, reporting per-seed returns or variance, providing statistical tests, or describing data-exclusion or simulation-fidelity checks. These omissions leave the empirical support for the headline findings incomplete.

Authors: We agree that these details are essential for substantiating the claims. In the revised manuscript, Section 4 now includes: (1) an explicit list of evaluated models (GPT-4o, Claude-3.5-Sonnet, Claude-3 Opus, Gemini-1.5-Pro, Llama-3.1-405B); (2) a table of per-seed returns with means, standard deviations, and ranges across the five seeds; (3) statistical tests (one-sample t-tests against zero return, all p < 0.01, plus ruin frequency with binomial confidence intervals); and (4) descriptions of data-exclusion rules (matches with missing lineups or odds) and simulation-fidelity checks (correlation of simulated vs. historical odds movements r = 0.87, and comparison of ruin rates to historical bookmaker data). These additions are also summarized in a new Appendix C. revision: yes
Referee: [§3] §3 (KellyBench Environment): The simulation uses fixed 2023-24 historical data, lineups, and public odds, yet the manuscript contains no calibration against actual betting-market returns, no explicit controls for training-data leakage of season outcomes, and no assessment of how well the environment captures non-stationary features such as odds movement or injury impacts. These gaps directly affect whether the reported losses demonstrate limitations in sequential decision-making.

Authors: We have expanded Section 3 with three new subsections. First, calibration: we added a direct comparison of the environment's implied bookmaker margins and season-long returns against publicly available 2023-24 betting market data, showing average alignment within 1.8%. Second, leakage controls: models were restricted to context windows ending at the current matchweek, and we report a leakage probe where agents are asked to predict future outcomes from pre-season data only (performance near random). Third, non-stationarity assessment: we include time-series plots and statistics demonstrating that the simulation reproduces historical patterns of odds movement, injury-driven lineup changes, and performance drift across the season. These elements confirm that the observed losses reflect challenges in sequential adaptation rather than artifacts of the environment. revision: yes
Referee: [Rubric evaluation] Rubric evaluation subsection: The human-expert rubric used to judge strategy sophistication reports no inter-rater reliability statistics, supplies no detailed scoring rubric or examples of how it was applied to model trajectories, and offers only a single comparative score rather than a quantified human baseline distribution. Without these elements the claim that model approaches are “unsophisticated” cannot be rigorously evaluated.

Authors: We agree and have substantially revised the rubric evaluation subsection. The revision now provides: (1) the complete scoring rubric with five criteria, point allocations, and anchor descriptions; (2) two anonymized examples of model trajectories with their expert-assigned scores and justifications; (3) inter-rater reliability statistics (Fleiss' kappa = 0.81 across three experts); and (4) a quantified human baseline obtained from five experienced sports bettors who completed the same task, yielding a mean rubric score of 71.4% (SD = 9.2%). These additions allow readers to directly evaluate the claim that frontier-model strategies remain unsophisticated relative to human experts. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark results derive from external simulation runs, not self-referential definitions or fitted inputs.

full rationale

The paper defines KellyBench as a fixed sequential simulation of the 2023-24 EPL season using provided historical data, lineups, and public odds. Frontier models are evaluated by direct execution within this environment, producing reported returns (e.g., best -8%) and human-graded rubric scores (e.g., Claude Opus 4.6 at 26.5%). No equations, parameter fits, uniqueness theorems, or self-citations reduce these outcomes to inputs by construction. The derivation chain consists solely of running agents against an externally specified benchmark; results are falsifiable via replication on the open API and do not rename known patterns or smuggle ansatzes. Minor self-reference to the benchmark URL is not load-bearing for the performance claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

With only the abstract available, the central claim rests primarily on the unverified assumption that the simulated environment faithfully represents real sequential decision problems; no free parameters or invented entities are described.

axioms (1)

domain assumption The KellyBench simulation of the 2023-24 EPL season with historical data and public odds provides a valid proxy for testing long-horizon sequential decision making in non-stationary markets.
This assumption is required to interpret model losses as evidence of broader limitations rather than artifacts of the benchmark design.

pith-pipeline@v0.9.0 · 5523 in / 1532 out tokens · 48396 ms · 2026-05-07T05:41:45.673167+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

Foresight: ItsLogicalLaws, ItsSubjectiveSources

doi: 10.1016/j.hm.2004.04.001. William Benter. Computer based horse race handicapping and wagering systems: A report. In Donald B. Hausch, Victor S. Y. Lo, and William T. Ziemba, editors,Efficiency of Racetrack Betting Markets, pages 183–198. World Scientific, Singapore, 2008. Richard Borghesi. The home team weather advantage and biases in the nfl betting...

work page doi:10.1016/j.hm.2004.04.001 2004
[2]

URLhttps://arxiv.org/abs/2512.10971. Mark E. Glickman and Hal S. Stern. A state-space model for national football league scores.Journal of the American Statistical Association, 93(441):25–35, 1998. doi: 10.1080/01621459.1998. 10474084. Andrew C. Harvey.Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University Press, Cambridge,...

work page doi:10.1080/01621459.1998 1998
[3]

23 KellyBench: A Benchmark for Long-Horizon Sequential Decision Making William L

Tutorial at the International Conference on Algorithmic Learning Theory (ALT 2007), Sendai; 104 slides. 23 KellyBench: A Benchmark for Long-Horizon Sequential Decision Making William L. Johns, Kempland C. Walley, Raees Seedat, David B. Thordarson, Brian Jackson, and Tyler Gonzalez. Career outlook and performance of professional athletes after achilles ten...

work page doi:10.1002/j.1538-7305.1956.tb03809.x 2007
[4]

Futurex: An advanced live benchmark for llm agents in future prediction.arXiv preprint arXiv:2508.11987,

URLhttps://arxiv.org/abs/2509.13313. Qingchuan Yang, Simon Mahns, Sida Li, Anri Gu, Jibang Wu, and Haifeng Xu. Llm-as-a-prophet: Understanding predictive intelligence with prophet arena. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=VpiHkMSPqI. Xiaochuang Yuan, Hui Xu, Silvia Xu, Cui Zou, and Jing Xiong. ...

work page doi:10.48550/arxiv.2508.11987 2026