BacktestBench is the first large-scale benchmark for LLM-automated quantitative backtesting, with 18,246 QA pairs from real market data and a multi-agent baseline called AutoBacktest.
Stockbench: Can llm agents trade stocks profitably in real-world markets?
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6roles
background 2polarities
background 2representative citing papers
AutoRedTrader generates synthetic financial misinformation via behavioral bias manipulation and agent feedback to red-team LLM trading agents, reaching 69% exposure and 26.67% attack success on Bitcoin data simulations.
SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.
EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.
Reported alpha from end-to-end LLM trading agents does not constitute deployment evidence until it passes structural tests for temporal integrity, frictions, robustness, calibration, execution, and disaggregation.
Strat-LLM demonstrates that LLM trading performance varies by reasoning mode and model scale, with strict alignment reducing drawdowns in downtrends and deep reasoning avoiding small-gain traps.
citing papers explorer
-
BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting
BacktestBench is the first large-scale benchmark for LLM-automated quantitative backtesting, with 18,246 QA pairs from real market data and a multi-agent baseline called AutoBacktest.
-
AutoRedTrader: Autonomous Red Teaming of Trading Agents through Synthetic Misinformation Injection
AutoRedTrader generates synthetic financial misinformation via behavioral bias manipulation and agent feedback to red-team LLM trading agents, reaching 69% exposure and 26.67% attack success on Bitcoin data simulations.
-
SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics
SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.
-
EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies
EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.
-
The Alpha Illusion: Reported Alpha from LLM Trading Agents Should Not Be Treated as Deployment Evidence
Reported alpha from end-to-end LLM trading agents does not constitute deployment evidence until it passes structural tests for temporal integrity, frictions, robustness, calibration, execution, and disaggregation.
-
Strat-LLM: Stratified Strategy Alignment for LLM-based Stock Trading with Real-time Multi-Source Signals
Strat-LLM demonstrates that LLM trading performance varies by reasoning mode and model scale, with strict alignment reducing drawdowns in downtrends and deep reasoning avoiding small-gain traps.