Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.
Livetradebench: Seeking real-world alpha with large language models.arXiv preprint arXiv:2511.03628
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6representative citing papers
CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.
Herculean benchmark shows frontier agents handle trading and market insights better than hedging and auditing workflows that demand state consistency and structured verification.
InfoDelphi partitions evidence to induce information asymmetry in multi-agent LLM deliberation, yielding 12-18% Brier score gains and 4-8 pp accuracy gains on a 375-question benchmark.
LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.
SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.
citing papers explorer
-
PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data
Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.
-
CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents
CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.
-
Herculean: An Agentic Benchmark for Financial Intelligence
Herculean benchmark shows frontier agents handle trading and market insights better than hedging and auditing workflows that demand state consistency and structured verification.
-
Diverse Evidence, Better Forecasts: Multi-Agent Deliberation Under Information Asymmetry
InfoDelphi partitions evidence to induce information asymmetry in multi-agent LLM deliberation, yielding 12-18% Brier score gains and 4-8 pp accuracy gains on a 375-question benchmark.
-
LATTICE: Evaluating Decision Support Utility of Crypto Agents
LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.
-
SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics
SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.