pith. machine review for the scientific record. sign in

arxiv: 2604.14199 · v1 · submitted 2026-04-03 · 💱 q-fin.CP · cs.AI· cs.LG

Recognition: no theorem link

PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:13 UTC · model grok-4.3

classification 💱 q-fin.CP cs.AIcs.LG
keywords LLM benchmarkingprediction marketsfinancial forecastingPolymarketorder book simulationConfidence-Weighted Returnmultimodal evaluationtrading performance
0
0 comments X

The pith

Only two of seven LLMs achieve positive returns on live prediction market data while five lose money despite high confidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PolyBench, a benchmark built from 38,666 real binary prediction markets on Polymarket that pairs each market snapshot with live order-book states and news streams. It runs seven large language models through 36,165 timestamp-locked forecasts and measures performance not only by directional accuracy but by simulated trading profits using realistic order-book execution. Only MiMo-V2-Flash and Gemini-3-Flash post positive Confidence-Weighted Returns of 17.6 percent and 6.2 percent; the other five models lose money even though they express uniformly high confidence in their predictions. A sympathetic reader cares because the results expose a gap between fluent language output and the ability to combine qualitative news with quantitative market signals into profitable decisions under time pressure. If the findings hold, they establish a new, contamination-resistant standard for testing whether LLMs can reason probabilistically in live financial environments.

Core claim

PolyBench records point-in-time cross-sections of 38,666 binary prediction markets together with Central Limit Order Book states and real-time news, then evaluates seven state-of-the-art LLMs on 36,165 predictions generated under identical market conditions; the results show that only MiMo-V2-Flash and Gemini-3-Flash produce positive financial returns via the proposed Confidence-Weighted Return metric while the remaining five models incur losses despite uniformly high stated confidence.

What carries the argument

PolyBench, a multimodal benchmark that synchronously couples live prediction-market snapshots with order-book dynamics and news streams and evaluates forecasts through realistic order-book execution simulation to compute directional accuracy, Confidence-Weighted Return, APY, and Sharpe ratio.

If this is right

  • Directional accuracy alone is insufficient to certify an LLM as a capable forecaster; profitability under execution simulation provides a stricter test.
  • High self-reported confidence in LLMs does not reliably translate into positive trading outcomes in live markets.
  • Multimodal inputs that combine news and order-book data are required to expose gaps hidden by language-only benchmarks.
  • The observed performance split between two profitable models and five losing ones indicates that current LLM training leaves most systems poorly calibrated for real-time financial uncertainty.
  • PolyBench supplies a timestamp-locked, financially grounded dataset that future work can use to track progress without contamination from static training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers may need to add explicit loss signals from simulated or real trades during fine-tuning to improve confidence calibration in uncertain environments.
  • Extending the benchmark to longer holding periods or multi-outcome markets could test whether the current performance gap persists beyond short binary resolutions.
  • Platforms that host prediction markets might incorporate similar live benchmarks before deploying LLM-assisted trading tools to limit user losses.
  • The divergence suggests that general-purpose scaling alone may not close the gap; targeted training on order-book dynamics and news integration could be necessary for most models.

Load-bearing premise

The order-book execution simulation accurately captures real-world slippage, liquidity, and fees without introducing artifacts that favor or penalize particular models.

What would settle it

Running the identical model predictions as actual trades with real capital on Polymarket and comparing the realized profits or losses against the benchmark's simulated CWR values would confirm or refute the performance rankings.

Figures

Figures reproduced from arXiv: 2604.14199 by Juncheng Liu, Pu Cheng, Yunshen Long.

Figure 1
Figure 1. Figure 1: Confidence-Weighted Return (CWR) timeline reflecting empirical portfolio evo￾lution. Annotated markers denote outsized individual trade returns resulting from cor￾rect, high-confidence predictions on low-probability events. Only MiMo-V2-Flash and Gemini-3-Flash sustain positive trajectories, isolating predictive alpha from market noise [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The four-stage PolyBench construction pipeline: (1) market collection via the Polymarket Gamma API, (2) multi-modal fetch of news and order-book snapshots, (3) LLM batch analysis, and (4) ground-truth resolution matching. 4.1 Construction As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Impact of base lot size (L) on Confidence-Weighted Return (CWR) for MiMo￾V2-Flash and Gemini-3-Flash. As the investment budget scales from $10 to $1,000, algorithmic execution slippage against the limited top levels of the historical order book rapidly decays theoretical alpha. 5.2 Domain Expertise and Miscalibrated Conviction Different thematic domains impart vastly different analytical challenges. Fig￾ur… view at source ↗
Figure 4
Figure 4. Figure 4: Dual-metric radar chart comparing empirical Confidence-Weighted Return (CWR) and Average Declared Confidence across eight event domains. The disparity highlights LLM miscalibration; models uniformly maintain high confidence (c ≥ 0.8) across all domains, yielding severe negative returns in volatile sectors such as Crypto. polling data, it proves highly detrimental in speculative, high-variance sectors such … view at source ↗
read the original abstract

Predicting real-world events from live market signals demands systems that fuse qualitative news with quantitative order-book dynamics under strict temporal discipline -- a challenge existing benchmarks fail to capture. We present \textbf{PolyBench}, a multimodal benchmark derived from Polymarket that records point-in-time cross-sections of 38,666 binary prediction markets spanning 4,997 events, synchronously coupling each snapshot with a Central Limit Order Book (CLOB) state and a real-time news stream. Using PolyBench, we evaluate seven state-of-the-art Large Language Models -- spanning open- and closed-source families -- generating 36,165 predictions under identical, timestamp-locked market states collected between February 6 and 12, 2026. Our multidimensional framework assesses directional accuracy, our proposed Confidence-Weighted Return (CWR), Annualized Percentage Yield (APY), and Sharpe ratio via realistic order-book execution simulation. The results reveal a pronounced performance divergence: only two of seven models achieve positive financial returns -- MiMo-V2-Flash at \textbf{17.6%} CWR and Gemini-3-Flash at 6.2% CWR -- while the remaining five incur losses despite uniformly high stated confidence. These findings highlight the gap between surface-level language fluency and genuine probabilistic reasoning under live market uncertainty, and establish PolyBench as a contamination-proof, financially-grounded evaluation standard for future LLM research. Our dataset and code available at \underline{\href{https://github.com/PolyBench/PolyBench}{https://github.com/PolyBench/PolyBench}}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PolyBench, a multimodal benchmark derived from 38,666 Polymarket binary prediction markets that synchronously couples timestamp-locked CLOB states and news streams. It evaluates seven LLMs on 36,165 predictions collected over one week, reporting directional accuracy, Confidence-Weighted Return (CWR), APY, and Sharpe ratio computed via order-book execution simulation. The central finding is that only MiMo-V2-Flash (17.6% CWR) and Gemini-3-Flash (6.2% CWR) achieve positive financial returns while the remaining five models incur losses despite uniformly high stated confidence; the work positions PolyBench as a contamination-proof, financially grounded evaluation standard and releases the dataset and code.

Significance. If the execution simulation is shown to be faithful to Polymarket's actual taker/maker fees, depth-based slippage, and partial-fill mechanics, the results would establish a valuable live-market benchmark that directly links LLM outputs to realizable P&L. The timestamp-locked design and public release of data/code are clear strengths that enable reproducible, contamination-resistant evaluation of forecasting and trading capabilities.

major comments (2)
  1. [Methods (order-book execution simulation)] The order-book execution simulation (described in the methods section on CWR/APY computation) is load-bearing for all financial-return claims, yet the manuscript provides no explicit algorithm, parameters, or validation for (i) mapping stated confidence to position size, (ii) depth consumption and slippage from the recorded CLOB, (iii) fee structure matching Polymarket's taker/maker schedule, or (iv) latency and partial-fill handling. Without these details the sign of the reported CWR gap (17.6 % vs. negative) cannot be verified.
  2. [Results and Evaluation Framework] The one-week collection window (February 6–12, 2026) and the 36,165-prediction sample are used to support annualized metrics (APY, Sharpe); the paper does not report robustness checks for this short horizon or for potential post-hoc filtering of markets, both of which directly affect the headline claim that five of seven models lose money.
minor comments (2)
  1. [Tables/Figures] Table and figure captions should explicitly state the exact number of markets and predictions retained after any filtering, and whether the CLOB snapshots include full depth or only top-of-book.
  2. [Abstract] The abstract states 'February 6 and 12, 2026'; confirm this is the intended future-looking collection period or correct the year if it is a typographical error.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The two major comments identify important gaps in methodological transparency and robustness analysis. We address each point below and will incorporate the requested details and checks in the revised manuscript.

read point-by-point responses
  1. Referee: [Methods (order-book execution simulation)] The order-book execution simulation (described in the methods section on CWR/APY computation) is load-bearing for all financial-return claims, yet the manuscript provides no explicit algorithm, parameters, or validation for (i) mapping stated confidence to position size, (ii) depth consumption and slippage from the recorded CLOB, (iii) fee structure matching Polymarket's taker/maker schedule, or (iv) latency and partial-fill handling. Without these details the sign of the reported CWR gap (17.6 % vs. negative) cannot be verified.

    Authors: We agree that the current description of the execution simulation is insufficient for independent verification. In the revised manuscript we will add a dedicated subsection 'Order-Book Execution Simulation' that supplies: (i) the exact linear mapping from model confidence to position size together with the scaling parameter, (ii) the depth-consumption and slippage model applied to each recorded CLOB snapshot, (iii) the precise taker/maker fee schedule used to match Polymarket's rules, and (iv) the latency (zero-latency benchmark) and partial-fill handling logic. We will also include pseudocode and a short validation example against historical trade data. These additions will allow direct reproduction and sign-checking of the reported CWR values. revision: yes

  2. Referee: [Results and Evaluation Framework] The one-week collection window (February 6–12, 2026) and the 36,165-prediction sample are used to support annualized metrics (APY, Sharpe); the paper does not report robustness checks for this short horizon or for potential post-hoc filtering of markets, both of which directly affect the headline claim that five of seven models lose money.

    Authors: We acknowledge that the short one-week horizon limits the strength of annualized claims and that explicit robustness checks are warranted. In revision we will add: (a) bootstrapped confidence intervals for APY and Sharpe ratios, (b) sensitivity tables showing how results change when the collection window is shifted or shortened, and (c) an explicit statement confirming that no post-hoc market filtering occurred beyond the pre-specified timestamp-locking rule. We will also report the raw one-week cumulative returns alongside the annualized figures so readers can assess the headline claim in context. revision: yes

Circularity Check

0 steps flagged

No significant circularity; metrics computed from external market data

full rationale

The paper computes CWR, APY, and Sharpe ratio by feeding LLM predictions into an order-book execution simulator driven by timestamp-locked Polymarket CLOB states and real resolutions. No equation in the provided text reduces these outputs to a fitted parameter defined inside the paper, nor does any self-citation chain or ansatz serve as the load-bearing justification for the headline performance gap. The simulation is an external modeling step whose fidelity is an assumption rather than a definitional identity, matching the reader's assessment of score 2.0 with no self-definitional or fitted-input reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the collected market snapshots are free of future leakage and that the execution simulator faithfully reproduces real trading costs; no free parameters are introduced in the abstract, no new entities are postulated, and background assumptions are standard market-microstructure facts.

axioms (1)
  • domain assumption Market outcomes are independent of the models' predictions and can be used as ground truth.
    Invoked when converting model forecasts into simulated P&L against realized event resolutions.

pith-pipeline@v0.9.0 · 5590 in / 1276 out tokens · 43184 ms · 2026-05-13T19:13:40.313755+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs

    cs.AI 2026-04 unverdicted novelty 6.0

    BLF achieves state-of-the-art binary forecasting on ForecastBench by using linguistic belief states updated in tool-use loops, hierarchical multi-trial logit averaging, and hierarchical Platt scaling calibration.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · cited by 1 Pith paper · 5 internal anchors

  1. [1]

    Artificial-Analysis

    Araci, D.: Finbert: Financial sentiment analysis with pre-trained language models. arXiv preprint arXiv:1908.10063 (2019)

  2. [2]

    Science320(5878), 877–878 (2008) PolyBench: LLM Forecasting and Trading on Prediction Markets 15

    Arrow, K.J., Forsythe, R., Gorham, M., Hahn, R., Hanson, R., Ledyard, J.O., Levmore, S., Litan, R., Milgrom, P., Nelson, F.D., et al.: The promise of prediction markets. Science320(5878), 877–878 (2008) PolyBench: LLM Forecasting and Trading on Prediction Markets 15

  3. [3]

    The Annals of Statistics51(2), 816–845 (2023)

    Barber, R.F., Candes, E.J., Ramdas, A., Tibshirani, R.J.: Conformal prediction beyond exchangeability. The Annals of Statistics51(2), 816–845 (2023)

  4. [4]

    Journal of econometrics31(3), 307–327 (1986)

    Bollerslev, T.: Generalized autoregressive conditional heteroskedasticity. Journal of econometrics31(3), 307–327 (1986)

  5. [5]

    Holden- Day (1970)

    Box, G.E., Jenkins, G.M.: Time series analysis: Forecasting and control. Holden- Day (1970)

  6. [6]

    forecasting and control

    Box, G.E., Jenkins, G.M.: Time series analysis. forecasting and control. Holden- Day Series in Time Series Analysis (1976)

  7. [7]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Bubeck,S.,Chandrasekaran,V.,Eldan,R.,Gehrke,J.,Horvitz,E.,Kamar,E.,Lee, P., Lee, Y.T., Li, Y., Lundberg, S., et al.: Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 (2023)

  8. [8]

    Dong, Y., Jiang, X., Liu, H., Jin, Z., Gu, B., Yang, M., Li, G.: Generalization or memorization: Data contamination and trustworthy evaluation for large language models (2024),https://arxiv.org/abs/2402.15938

  9. [9]

    Games and Eco- nomic Behavior29(1-2), 7–35 (1999)

    Foster, D.P., Vohra, R.: Regret in the on-line decision problem. Games and Eco- nomic Behavior29(1-2), 7–35 (1999)

  10. [10]

    Advances in Neural Information Processing Systems 37, 50426–50468 (2024)

    Halawi,D.,Zhang,F.,Yueh-Han,C.,Steinhardt,J.:Approachinghuman-levelfore- casting with language models. Advances in Neural Information Processing Systems 37, 50426–50468 (2024)

  11. [11]

    Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., Steinhardt, J.: Measuring mathematical problem solving with the MATH dataset (2021),https://arxiv.org/abs/2103.03874

  12. [12]

    Neural computation 9(8), 1735–1780 (1997)

    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997)

  13. [13]

    A decoder- only foundation model for time-series forecasting.arXiv preprint arXiv:2310.10688,

    Jin, M., Wen, S., Liang, Y., Zhang, C., Xue, S., Wang, X., Zhang, J., Wang, M., Chen, H., Li, X., et al.: Time-llm: Time series forecasting by reprogramming large language models. arXiv preprint arXiv:2310.10688 (2023)

  14. [14]

    Karger, E., Bastani, H., Yueh-Han, C., Jacobs, Z., Halawi, D., Zhang, F., Tetlock, P.E.: Forecastbench: A dynamic benchmark of ai forecasting capabilities (2025), https://arxiv.org/abs/2409.19839

  15. [15]

    Philo- sophical Transactions of the Royal Society A379(2194), 20200209 (2021)

    Lim, B., Zohren, S.: Time-series forecasting with deep learning: a survey. Philo- sophical Transactions of the Royal Society A379(2194), 20200209 (2021)

  16. [16]

    DeepSeek-V3 Technical Report

    Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al.: Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024)

  17. [17]

    In: The Twelfth International Conference on Learning Representations (2024)

    Liu,X.,Yu,H.,Zhang,H.,Xu,Y.,Lei,X.,Lai,H.,Gu,Y.,Ding,H.,Men,K.,Yang, K., et al.: Agentbench: Evaluating llms as agents. In: The Twelfth International Conference on Learning Representations (2024)

  18. [18]

    Phan, L., Gatti, A., Han, Z., Li, N., Hu, J., Zhang, H., Zhang, C.B.C., Shaaban, M., Ling, J., Shi, S., et al.: Humanity’s last exam (2025),https://arxiv.org/ abs/2501.14249

  19. [19]

    arXiv e-prints pp

    Saguillo, O., Ghafouri, V., Kiffer, L., Suarez-Tangil, G.: Unravelling the probabilis- tic forest: Arbitrage in prediction markets. arXiv e-prints pp. arXiv–2508 (2025)

  20. [20]

    In: Advances in neural information processing systems

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in neural information processing systems. vol. 30 (2017)

  21. [21]

    BloombergGPT: A Large Language Model for Finance

    Wu, S., Irzan, O., Schleiden, S., et al.: Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564 (2023)

  22. [22]

    Cheng et al

    Yang, Q., Mahns, S., Li, S., Gu, A., Wu, J., Xu, H.: Llm-as-a-prophet: Under- standing predictive intelligence with prophet arena (2025),https://arxiv.org/ abs/2510.17638 16 P. Cheng et al

  23. [23]

    Yu, H., Li, F., You, J.: Livetradebench: Seeking real-world alpha with large lan- guage models (2025),https://arxiv.org/abs/2511.03628

  24. [24]

    Zeng, Z., Liu, J., Chen, S., He, T., Liao, Y., Tian, Y., Wang, J., Wang, Z., Yang, Y., Yin, L., Yin, M., Zhu, Z., Cai, T., Chen, Z., Chen, J., Du, Y., Gao, X., Guo, J., Hu, L., Jiao, J., Li, X., Liu, J., Ni, S., Wen, Z., Zhang, G., Zhang, K., Zhou, X., Blanchet, J., Qiu, X., Wang, M., Huang, W.: Futurex: An advanced live benchmark for llm agents in future...

  25. [25]

    Advances in Neural Information Processing Systems35, 27293–27305 (2022)

    Zou, A., Xiao, T., Jia, R., Kwon, J., Mazeika, M., Li, R., Song, D., Steinhardt, J., Evans, O., Hendrycks, D.: Forecasting future world events with neural networks. Advances in Neural Information Processing Systems35, 27293–27305 (2022)