BacktestBench is the first large-scale benchmark for LLM-automated quantitative backtesting, with 18,246 QA pairs from real market data and a multi-agent baseline called AutoBacktest.
ISBN 9798400704901
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
Mango raises web agent success rates to 63.6% on WebVoyager and 52.5% on WebWalkerQA by bandit-based starting-point selection and memory, beating baselines by 7.3% and 26.8%.
ContractSkill converts draft web agent skills into explicit executable contracts that enable deterministic verification, fault localization, and minimal local repair, improving stability on benchmarks like VisualWebArena.
DynaWeb introduces a model-based RL framework that trains web agents via imagined rollouts in a learned web world model interleaved with real expert trajectories, yielding consistent gains on WebArena and WebVoyager benchmarks.
citing papers explorer
-
BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting
BacktestBench is the first large-scale benchmark for LLM-automated quantitative backtesting, with 18,246 QA pairs from real market data and a multi-agent baseline called AutoBacktest.
-
Mango: Multi-Agent Web Navigation via Global-View Optimization
Mango raises web agent success rates to 63.6% on WebVoyager and 52.5% on WebWalkerQA by bandit-based starting-point selection and memory, beating baselines by 7.3% and 26.8%.
-
ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents
ContractSkill converts draft web agent skills into explicit executable contracts that enable deterministic verification, fault localization, and minimal local repair, improving stability on benchmarks like VisualWebArena.
-
DynaWeb: Model-Based Reinforcement Learning of Web Agents
DynaWeb introduces a model-based RL framework that trains web agents via imagined rollouts in a learned web world model interleaved with real expert trajectories, yielding consistent gains on WebArena and WebVoyager benchmarks.