CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.
StockBench: Can LLM agents trade stocks profitably in real-world markets?
12 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 12polarities
background 3representative citing papers
Introduces BacktestBench benchmark with 18k QA pairs across four backtesting tasks and evaluates 23 LLMs via the AutoBacktest multi-agent system.
Herculean benchmark shows frontier agents handle trading and market insights better than hedging and auditing workflows that demand state consistency and structured verification.
AutoRedTrader generates synthetic financial misinformation via behavioral bias manipulation and agent feedback to red-team LLM trading agents, reaching 69% exposure and 26.67% attack success on Bitcoin data simulations.
Introduces a protocol scoring AI investment advisors on validity under constraints, stability, and agreement with a deterministic baseline, showing agreement often masks invalid actions.
LLMs produce human-like finite bids in the St. Petersburg game but shift toward rational behavior under controlled prompt changes, indicating surface-level outcome resemblance without mechanism-level alignment.
SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.
EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.
This survey categorizes agentic environments for LLMs by eight attributes and domains, introduces symbolic and neural synthesis paradigms with evaluation, and outlines four agent evolution pathways plus three environment evolution paradigms.
Empirical study of DeFi AI agents finds limited autonomous execution, negative median user returns, high gain inequality, and valuations disconnected from treasury fundamentals, with a proposed maturity framework.
Reported alpha from end-to-end LLM trading agents does not constitute deployment evidence until it passes structural tests for temporal integrity, frictions, robustness, calibration, execution, and disaggregation.
Strat-LLM demonstrates that LLM trading performance varies by reasoning mode and model scale, with strict alignment reducing drawdowns in downtrends and deep reasoning avoiding small-gain traps.
citing papers explorer
-
CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents
CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.
-
Herculean: An Agentic Benchmark for Financial Intelligence
Herculean benchmark shows frontier agents handle trading and market insights better than hedging and auditing workflows that demand state consistency and structured verification.
-
Paper Agents, Paper Gains: An Empirical Analysis of DeFi Investment Agents
Empirical study of DeFi AI agents finds limited autonomous execution, negative median user returns, high gain inequality, and valuations disconnected from treasury fundamentals, with a proposed maturity framework.
-
Strat-LLM: Stratified Strategy Alignment for LLM-based Stock Trading with Real-time Multi-Source Signals
Strat-LLM demonstrates that LLM trading performance varies by reasoning mode and model scale, with strict alignment reducing drawdowns in downtrends and deep reasoning avoiding small-gain traps.