Livetradebench: Seeking real-world alpha with large language models.arXiv preprint arXiv:2511.03628

Livetradebench: Seeking real-world alpha with large language models , author= · 2025 · arXiv 2511.03628

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1 baseline 1

citation-polarity summary

background 1 baseline 1

representative citing papers

PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data

q-fin.CP · 2026-04-03 · conditional · novelty 8.0

Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.

CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.

Herculean: An Agentic Benchmark for Financial Intelligence

cs.AI · 2026-05-14 · unverdicted · novelty 7.0

Herculean benchmark shows frontier agents handle trading and market insights better than hedging and auditing workflows that demand state consistency and structured verification.

Diverse Evidence, Better Forecasts: Multi-Agent Deliberation Under Information Asymmetry

cs.AI · 2026-07-02 · unverdicted · novelty 6.0

InfoDelphi partitions evidence to induce information asymmetry in multi-agent LLM deliberation, yielding 12-18% Brier score gains and 4-8 pp accuracy gains on a 375-question benchmark.

LATTICE: Evaluating Decision Support Utility of Crypto Agents

cs.CR · 2026-04-29 · unverdicted · novelty 6.0

LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.

SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics

cs.SE · 2026-04-06 · unverdicted · novelty 6.0

SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.

citing papers explorer

Showing 6 of 6 citing papers.

PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data q-fin.CP · 2026-04-03 · conditional · none · ref 23
Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.
CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents cs.AI · 2026-06-29 · unverdicted · none · ref 2
CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.
Herculean: An Agentic Benchmark for Financial Intelligence cs.AI · 2026-05-14 · unverdicted · none · ref 12
Herculean benchmark shows frontier agents handle trading and market insights better than hedging and auditing workflows that demand state consistency and structured verification.
Diverse Evidence, Better Forecasts: Multi-Agent Deliberation Under Information Asymmetry cs.AI · 2026-07-02 · unverdicted · none · ref 6
InfoDelphi partitions evidence to induce information asymmetry in multi-agent LLM deliberation, yielding 12-18% Brier score gains and 4-8 pp accuracy gains on a 375-question benchmark.
LATTICE: Evaluating Decision Support Utility of Crypto Agents cs.CR · 2026-04-29 · unverdicted · none · ref 10
LATTICE is a scalable LLM-judge benchmark for crypto agent decision support that reveals performance trade-offs among real-world copilots across dimensions and tasks.
SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics cs.SE · 2026-04-06 · unverdicted · none · ref 29
SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.

Livetradebench: Seeking real-world alpha with large language models.arXiv preprint arXiv:2511.03628

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer