arxiv: 2604.04812 · v1 · submitted 2026-04-06 · 💻 cs.SE

Recognition: no theorem link

SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics

Yuchen Cao , Hanlin Zhang , Jacky Wai Keung , Yang Chen , Linqi Song

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:25 UTC · model grok-4.3

classification 💻 cs.SE

keywords SysTradeBenchLLM trading systemsstrategy to codeiterative benchmarkingdrift-aware diagnosticscode generation evaluationquantitative tradingsoftware governance

0 comments

The pith

SysTradeBench reveals top LLMs achieve over 91.7 percent validity in trading code generation but induce rapid code convergence through iteration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SysTradeBench as an iterative benchmark to evaluate how large language models convert natural-language trading strategy documents into executable code with built-in audit requirements. Models generate strategy cards, code, and logs, which are then tested in a sandbox for determinism, leakage prevention, and rule consistency across iterations. Evaluations of 17 models on 12 strategies show leading systems exceed 91.7 percent validity with solid scores on fidelity, risk, reliability, and robustness. The iteration process, however, tends to produce converging code outputs by the second round, supporting the idea that such tools assist but do not supplant human oversight in developing governed trading systems.

Core claim

SysTradeBench evaluates LLM-generated trading systems through an iterative build-test-patch process with drift-aware diagnostics. Each model receives a standardized Base Strategy Doc and must produce a strategy card, executable code, and audit logs. A sandboxed harness performs checks for determinism and anti-leakage, detects drift, and supplies evidence for patches. Testing 17 models across 12 strategies finds that the best models reach validity above 91.7 percent and strong aggregate scores, yet code converges by iteration two. This indicates LLM iteration supports prototyping and fixes but human quantitative researchers are needed for diversity and robustness in critical applications.

What carries the argument

The iterative build-test-patch cycle supported by a sandboxed harness and drift-aware diagnostics that generate evidence bundles for constrained patches and multi-dimensional scorecards covering spec fidelity, risk discipline, reliability, and out-of-sample robustness.

If this is right

Top models achieve validity rates above 91.7 percent with strong performance across scorecards.
Evidence-driven iteration causes code convergence by the second iteration for most systems.
LLMs are effective for rapid prototyping and shallow bug fixes in trading code.
Human oversight remains essential for maintaining solution diversity and ensemble robustness in critical strategies.
The benchmark provides cost-effectiveness signals in addition to quality metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying similar iterative benchmarks with drift diagnostics could improve evaluation of code generation in other regulated domains.
Code convergence may be mitigated by incorporating diversity-promoting techniques in the prompting or evaluation process.
These results suggest that development workflows could integrate LLMs for initial versions followed by expert review to balance speed and reliability.

Load-bearing premise

The sandboxed harness and drift-aware diagnostics accurately detect rule drift, anti-leakage violations, and spec fidelity without introducing biases or false signals that skew the multi-dimensional scorecards.

What would settle it

An experiment where independently verified trading codes from top-scoring models are found to violate anti-leakage rules or exhibit undetected rule drift would falsify the benchmark's ability to provide reliable diagnostics.

Figures

Figures reproduced from arXiv: 2604.04812 by Hanlin Zhang, Jacky Wai Keung, Linqi Song, Yang Chen, Yuchen Cao.

**Figure 3.** Figure 3: RQ4: Token usage heatmap (3 models × 4 iterations). Request tokens stabilize after Iter0; response tokens shrink as models generate patches rather than full code. These findings reveal a fundamental tension in automated strategy repair workflows. Traditional Quantitative Researcher (QR) teams undergo lengthy governance cycles, typically requiring weeks of iterative refinement and cross-team validation. In… view at source ↗

**Figure 2.** Figure 2: RQ3: Learning curves for Bollinger Mean Reversion [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 4.** Figure 4: Illustrative Example: Code Quality Issues Across Five LLMs [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Iter0 cross-evaluation heatmap [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used as quantitative research copilots to translate natural-language strategy specifications into executable trading code. Yet most existing evaluations either focus on static financial knowledge or summarize performance with a single profitability metric, leaving a gap for benchmarking strategy-to-code trading systems as governed, auditable software. We introduce SysTradeBench (SysTB), an iterative build-test-patch benchmark that evaluates LLM-generated trading systems under drift-aware diagnostics. Given a standardized Base Strategy Doc and frozen semantics, each model must produce (i) a strategy card, (ii) executable code, and (iii) mandatory audit logs. A sandboxed harness runs determinism and anti-leakage checks, detects rule drift across iterations, and returns evidence bundles to support constrained patches. SysTradeBench reports multi-dimensional scorecards for spec fidelity, risk discipline, reliability, and out-of-sample robustness indicators, together with cost-effectiveness signals. We evaluate 17 models across 12 strategies. Top models achieve validity above 91.7 percent with strong aggregate scores, but evidence-driven iteration also induces code convergence by Iter2. These findings suggest that LLM iteration complements rather than replaces human quantitative researcher governance: LLMs excel at rapid prototyping and shallow bug fixes, while human oversight remains essential for critical strategies requiring solution diversity and ensemble robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SysTradeBench adds a structured iterative benchmark with audit logs and drift checks for LLM trading code, but thin details on strategies and harness validation limit how much the 91.7% validity numbers can be trusted.

read the letter

This paper introduces SysTradeBench as an iterative build-test-patch setup for turning natural-language trading strategies into code. Models must output a strategy card, executable code, and audit logs. A sandbox then runs determinism checks, anti-leakage tests, and drift detection across iterations before feeding evidence back for patches. It evaluates 17 models on 12 strategies and reports top validity above 91.7 percent plus convergence by the second iteration, arguing that LLMs handle prototyping and shallow fixes while humans still handle governance for diversity and robustness. The multi-dimensional scorecards on spec fidelity, risk discipline, reliability, and robustness are the clearest addition over prior single-metric evaluations. Requiring audit logs and explicit drift tracking forces more traceable outputs than most code-generation benchmarks. The convergence finding is a concrete observation worth noting if the iteration process is reproducible. The main gaps are the missing specifics on what the 12 strategies actually contain, how the exact scores are computed, and how the 17 models were selected. The abstract also gives no calibration of the harness against human expert judgments or adversarial leakage cases, so the stress-test worry about the diagnostics introducing their own biases or false signals remains open. If the full paper supplies those controls and external validation, the results strengthen; otherwise the high validity numbers rest on unexamined assumptions about the test environment. This is aimed at people working on domain-specific LLM evaluation or quant finance tooling. A reader building similar benchmarks would get usable ideas from the card-plus-logs requirement and the evidence-bundle loop. It deserves peer review because the core framework is concrete enough to get useful referee input on strategy selection and harness design, even if the current reporting needs expansion.

Referee Report

3 major / 2 minor

Summary. The paper introduces SysTradeBench (SysTB), an iterative build-test-patch benchmark for evaluating how well LLMs translate natural-language trading strategy specifications into executable code under a sandboxed harness with drift-aware diagnostics. It requires models to produce strategy cards, code, and audit logs for 12 strategies; the harness performs determinism/anti-leakage checks and returns evidence bundles for constrained patching. The work evaluates 17 models, reports that top models exceed 91.7% validity with strong aggregate multi-dimensional scores (spec fidelity, risk discipline, reliability, robustness), observes code convergence by iteration 2, and concludes that LLM iteration complements rather than replaces human quantitative researcher governance.

Significance. If the harness diagnostics prove reliable, SysTradeBench supplies a needed multi-dimensional, auditable framework for strategy-to-code generation that goes beyond single profitability metrics or static knowledge tests. The empirical finding of rapid convergence under evidence-driven iteration, together with the explicit call for retained human oversight on critical strategies, would be a useful contribution to the cs.SE and quantitative-finance literature on LLM copilots.

major comments (3)

[Abstract / Evaluation] Abstract and evaluation section: the headline validity figure of 91.7% and the claim of convergence by Iter2 are presented without any description of the 12 strategies, the precise multi-dimensional scoring formulas, error bars, or statistical tests. This absence makes it impossible to determine whether the data support the reported performance differences across the 17 models.
[Abstract / Harness description] Abstract and § on harness: the central validity and convergence results rest on the sandboxed harness correctly detecting rule drift, anti-leakage violations, and spec fidelity. The manuscript describes determinism checks and audit logs but supplies no external calibration against human-quantitative-expert ground truth or adversarial test cases, leaving open the possibility that the diagnostics introduce undetected biases that inflate scores.
[Evaluation setup] Model selection: the choice and diversity of the 17 evaluated models is not justified or characterized (e.g., size, training data, fine-tuning), which is load-bearing for the claim that the observed behavior generalizes to LLM iteration in trading-system generation.

minor comments (2)

[Abstract] The abstract states that SysTradeBench reports 'cost-effectiveness signals' but the manuscript does not clarify how these are computed or normalized across models.
[Results presentation] Figure or table captions should explicitly state the number of runs or seeds used to produce the aggregate scores.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating revisions where we agree the manuscript can be strengthened without misrepresenting our results.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and evaluation section: the headline validity figure of 91.7% and the claim of convergence by Iter2 are presented without any description of the 12 strategies, the precise multi-dimensional scoring formulas, error bars, or statistical tests. This absence makes it impossible to determine whether the data support the reported performance differences across the 17 models.

Authors: The abstract is kept concise by design. Section 4 and Table 1 describe the 12 strategies (their objectives, parameters, and Base Strategy Doc references), while Section 3.3 provides the exact multi-dimensional scoring formulas for spec fidelity, risk discipline, reliability, and robustness. We agree, however, that error bars and statistical tests were not reported. In revision we will add per-model standard errors (computed over three random seeds) and pairwise statistical comparisons (Wilcoxon signed-rank tests with Bonferroni correction) to substantiate differences among the 17 models. revision: yes
Referee: [Abstract / Harness description] Abstract and § on harness: the central validity and convergence results rest on the sandboxed harness correctly detecting rule drift, anti-leakage violations, and spec fidelity. The manuscript describes determinism checks and audit logs but supplies no external calibration against human-quantitative-expert ground truth or adversarial test cases, leaving open the possibility that the diagnostics introduce undetected biases that inflate scores.

Authors: Section 3 details the determinism, anti-leakage, and drift-detection logic together with the evidence bundles returned to models. We acknowledge the absence of external calibration. The revised manuscript will add a dedicated validation subsection reporting (i) agreement between harness verdicts and independent quantitative-expert review on a 20 % stratified sample and (ii) results on a set of adversarial test cases that deliberately inject drift or leakage. Cohen’s kappa and detection rates will be reported. revision: yes
Referee: [Evaluation setup] Model selection: the choice and diversity of the 17 evaluated models is not justified or characterized (e.g., size, training data, fine-tuning), which is load-bearing for the claim that the observed behavior generalizes to LLM iteration in trading-system generation.

Authors: Section 4.1 already contains a table listing each model’s provider, approximate parameter count, and known fine-tuning status. We will expand the accompanying text to justify the selection criteria (coverage of open- and closed-source models, size range 7 B–175 B, and inclusion of both general and code-specialized models) and will explicitly note the limits of generalizability given proprietary training data. Where model cards disclose training data or fine-tuning details, these will be summarized. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark evaluation

full rationale

This paper introduces SysTradeBench as an empirical evaluation framework for LLM-generated trading systems, reporting direct measurements such as validity above 91.7% and code convergence by Iter2 from running 17 models on 12 strategies. No equations, derivations, fitted parameters, or self-referential reductions appear in the abstract or described methodology. The multi-dimensional scorecards derive from the sandboxed harness outputs applied to external model generations, not from any self-defined quantities or load-bearing self-citations. The evaluation chain is self-contained against the tested models and harness diagnostics without reducing claims to prior author inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review means the ledger is necessarily incomplete; no free parameters, axioms, or invented entities can be fully audited without the methods section.

axioms (1)

domain assumption Base Strategy Doc has frozen semantics that remain unchanged across iterations
Stated as a core requirement of the benchmark setup.

pith-pipeline@v0.9.0 · 5549 in / 1319 out tokens · 37260 ms · 2026-05-10T19:25:59.125591+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 27 canonical work pages · 13 internal anchors

[1]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. InNeurIPS 2017. arXiv:1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, et al. 2020. Language Models are Few-Shot Learners. InNeurIPS

2020
[3]

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, et al. 2022. Training language models to follow instructions with human feedback. InNeurIPS 2022. arXiv:2203.02155

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

OpenAI and Josh Achiam et al. 2023. GPT-4 Technical Report. arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Timo Schick, Jane Dwivedi-Yu, Roberta Dessì, et al. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761

work page internal anchor Pith review arXiv 2023
[7]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. 2020. Retrieval- Augmented Generation for Knowledge-Intensive NLP Tasks. InNeurIPS 2020. arXiv:2005.11401

work page internal anchor Pith review arXiv 2020
[8]

Percy Liang, Rishi Bommasani, Tony Lee, et al. 2022. Holistic Evaluation of Language Models. arXiv:2211.09110

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Mark Chen, Jerry Tworek, Heewoo Jun, et al. 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Jacob Austin, Augustus Odena, Maxwell Nye, et al. 2021. Program Synthesis with Large Language Models. arXiv:2108.07732. (Associated with MBPP: Mostly Basic Python Problems.)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[11]

Yujia Li, David Choi, Junyoung Chung, et al. 2022. Competition-Level Code Generation with AlphaCode.Science378(6624):1092–1097

2022
[12]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos Jimenez, John Yang, Alex Y. Wang, et al. 2023. SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2012. GenProg: A Generic Method for Automatic Software Repair.IEEE Transactions on Software Engineering38(1):54–72

2012
[14]

Alexandru Marginean, Johannes Bader, et al. 2019. SapFix: Automated End-to-End Repair at Scale. InICSE 2019

2019
[15]

Arpad E. Elo. 1978.The Rating of Chessplayers, Past and Present. Arco Publishing

1978
[16]

Klaus Krippendorff. 2011. Computing Krippendorff’s Alpha-Reliability.Depart- mental Papers (ASC), University of Pennsylvania

2011
[17]

Dogu Araci. 2019. FinBERT: Financial Sentiment Analysis with Pre-trained Lan- guage Models. arXiv:1908.10063

work page arXiv 2019
[18]

Shijie Wu et al. 2023. BloombergGPT: A Large Language Model for Finance. arXiv:2303.17564

work page internal anchor Pith review arXiv 2023
[19]

Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. 2023. FinGPT: Open- Source Financial Large Language Models. arXiv:2306.06031

work page arXiv 2023
[20]

Pranab Islam et al. 2023. FinanceBench: A New Benchmark for Financial Question Answering. arXiv:2311.11944

work page arXiv 2023
[21]

Qianqian Xie et al. 2024. FinBen: A Holistic Financial Benchmark for Large Language Models. arXiv:2402.12659

work page arXiv 2024
[22]

Qianqian Xie et al. 2023. PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark for Finance. arXiv:2306.05443

work page arXiv 2023
[23]

Raj Shah et al. 2022. When FLUE Meets FLANG: Benchmarks and Large Pretrained Language Model for Financial Domain. InEMNLP 2022

2022
[24]

Zhiyu Chen et al. 2021. FinQA: A Dataset of Numerical Reasoning over Financial Data. InNAACL 2021

2021
[25]

Fengbin Zhu et al. 2021. TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance. InACL 2021

2021
[26]

Abhay Srivastava, Sam Jung, and Spencer Mateega. 2025. Market-Bench: Evalu- ating Large Language Models on Introductory Quantitative Trading and Market Dynamics. arXiv:2512.12264

work page arXiv 2025
[27]

Zhaolu Kang et al. 2026. QuantEval: A Benchmark for Financial Quantitative Tasks in Large Language Models. arXiv:2601.08689

work page arXiv 2026
[28]

Yanxu Chen, Zijun Yao, Yantao Liu, Jin Ye, Jianing Yu, Lei Hou, and Juanzi Li. 2025. StockBench: Can LLM Agents Trade Stocks Profitably In Real-world Markets? arXiv:2510.02209

work page arXiv 2025
[29]

Haofei Yu, Fenghai Li, and Jiaxuan You. 2025. LiveTradeBench: Seeking Real- World Alpha with Large Language Models. arXiv:2511.03628

work page arXiv 2025
[30]

Avi Arora and Ritesh Malpani. 2026. PredictionMarketBench: A SWE-bench- Style Framework for Backtesting Trading Agents on Prediction Markets. arXiv:2602.00133

work page arXiv 2026
[31]

Suchow, and Khaldoun Khashanah

Yangyang Yu, Haohang Li, Zhi Chen, Yuechen Jiang, Yang Li, Denghui Zhang, Rong Liu, Jordan W. Suchow, and Khaldoun Khashanah. 2023. FinMem: A Performance-Enhanced LLM Trading Agent with Layered Memory and Character Design. arXiv:2311.13743

work page arXiv 2023
[32]

Wentao Zhang et al. 2024. A Multimodal Foundation Agent for Financial Trading: Tool-Augmented, Diversified, and Generalist. InKDD 2024. arXiv:2402.18485

work page arXiv 2024
[33]

Barber and Terrance Odean

Brad M. Barber and Terrance Odean. 2000. Trading Is Hazardous to Your Wealth. The Journal of Finance55(2):773–806

2000
[34]

Keim and Ananth Madhavan

Donald B. Keim and Ananth Madhavan. 1997. Transaction Costs and Investment Style: An Inter-Exchange Analysis of Institutional Equity Trades.Journal of Financial Economics46(3):265–292

1997
[35]

Moskowitz

Andrea Frazzini, Ronen Israel, and Tobias J. Moskowitz. 2015. Trading Costs of Asset Pricing Anomalies.Journal of Financial Economics116(2):235–257

2015
[36]

Robert Almgren and Neil Chriss. 2001. Optimal Execution of Portfolio Transac- tions.Journal of Risk3(2):5–39

2001
[37]

2007.Empirical Market Microstructure: The Institutions, Economics, and Econometrics of Securities Trading

Joel Hasbrouck. 2007.Empirical Market Microstructure: The Institutions, Economics, and Econometrics of Securities Trading. Oxford University Press

2007
[38]

Harry Markowitz. 1952. Portfolio Selection.The Journal of Finance7(1):77–91

1952
[39]

2018.Advances in Financial Machine Learning

Marcos López de Prado. 2018.Advances in Financial Machine Learning. Wiley

2018
[40]

Philippe Artzner, Freddy Delbaen, Jean-Marc Eber, and David Heath. 1999. Co- herent Measures of Risk.Mathematical Finance9(3):203–228

1999
[41]

Tyrrell Rockafellar and Stanislav Uryasev

R. Tyrrell Rockafellar and Stanislav Uryasev. 2000. Optimization of Conditional Value-at-Risk.Journal of Risk2(3):21–41

2000
[42]

Andrew W. Lo. 2002. The Statistics of Sharpe Ratios.Financial Analysts Journal 58(4):36–52

2002
[43]

Ryan Sullivan, Allan Timmermann, and Halbert White. 1999. Data-Snooping, Technical Trading Rule Performance, and the Bootstrap.The Journal of Finance 54(5):1647–1691

1999
[44]

Halbert White. 2000. A Reality Check for Data Snooping.Econometrica68(5):1097– 1126

2000
[45]

Peter Reinhard Hansen. 2005. A Test for Superior Predictive Ability.Journal of Business & Economic Statistics23(4):365–380

2005
[46]

Bailey and Marcos López de Prado

David H. Bailey and Marcos López de Prado. 2014. The Deflated Sharpe Ratio: Cor- recting for Selection Bias, Backtest Overfitting and Non-Normality.The Journal of Portfolio Management40(5):94–107

2014
[47]

Bailey, Jonathan Borwein, Marcos López de Prado, and Qiji Jim Zhu

David H. Bailey, Jonathan Borwein, Marcos López de Prado, and Qiji Jim Zhu
[48]

The Probability of Backtest Overfitting.Journal of Computational Finance
[49]

Harvey, Yan Liu, and Heqing Zhu

Campbell R. Harvey, Yan Liu, and Heqing Zhu. 2016. ...and the Cross-Section of Expected Returns.The Review of Financial Studies29(1):5–68

2016
[50]

Weixian Waylon Li, Hyeonjun Kim, Mihai Cucuringu, and Tiejun Ma. 2025. Can LLM-based Financial Investing Strategies Outperform the Market in Long Run? (FINSABER). arXiv:2505.07078

work page arXiv 2025
[51]

Lianmin Zheng et al. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Wei-Lin Chiang et al. 2024. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132

work page internal anchor Pith review arXiv 2024
[53]

Philipp Winder, Christian Hildebrand, and Jochen Hartmann. 2025. Biased echoes: Large language models reinforce investment biases and increase portfolio risks. PLOS ONE. DOI:10.1371/journal.pone.0325459

work page doi:10.1371/journal.pone.0325459 2025
[54]

no drift

Daniel Kahneman and Amos Tversky. 1979. Prospect Theory: An Analysis of Decision under Risk.Econometrica47(2):263–291. A Detailed Strategy Specifications This appendix provides one representative Base Strategy Doc ex- cerpt. All 12 full specifications will be released with the benchmark. A.1 Example: Bollinger Band Mean Reversion Strategy family:Mean reve...

1979
[55]

strategy_name

strategy_card.json: {"strategy_name": "...", "parameters": {...}}
[56]

D1": 0.95,

strategy.py: class TradingStrategy with run() method ## Evaluation Dimensions - D1: Spec Fidelity, D2: Risk Discipline, D3: Reliability, D4: OOS Robustness D.2 Iter1–Iter3: Evidence-Driven Refinement Prompts grow from ∼6K (Iter1) to ∼12K (Iter3) tokens as code history accumulates. SysTradeBench: An Iterative Build–Test–Patch Benchmark for Strategy-to-Code...