Recognition: unknown
Learning to Trade Like an Expert: Cognitive Fine-Tuning for Stable Financial Reasoning in Language Models
Pith reviewed 2026-05-10 06:21 UTC · model grok-4.3
The pith
Training open language models on a curated financial MCQ dataset with structured reasoning traces produces competitive and risk-aware trading agents that approach frontier-model performance at smaller scale.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Fine-tuning open language models with a curated MCQ dataset derived from classic textbooks and historical markets, verified by an AI committee and enriched with structured reasoning traces, leads to models that exhibit competitive, risk-aware behavior over time in chronological trading simulations, outperform open-source baselines, and approach frontier-model performance at smaller scale.
What carries the argument
The central mechanism is the curated multiple-choice question dataset from textbooks and historical markets, augmented to reduce shortcut learning and paired with a two-stage evaluation protocol that combines isolated test-set assessment with MCQ-based chronological trading simulation.
If this is right
- Open models trained this way display statistically robust risk-aware trading behavior across different market regimes.
- Skills measured on isolated MCQs transfer to sequential decision-making in ongoing trading simulations.
- Smaller-scale open models can reach financial reasoning levels close to those of much larger frontier models.
- The released dataset and protocol offer a concrete way to train and check stable financial agents without proprietary resources.
Where Pith is reading between the lines
- The same structured dataset approach could be adapted to build reliable reasoning in other high-stakes domains such as medical or legal advice.
- Community use of the released materials may create stronger shared benchmarks for testing AI financial competence beyond current simulations.
- If the simulation proves predictive, the method could lower barriers for smaller organizations to deploy capable AI trading systems.
Load-bearing premise
That strong performance on the prepared MCQ questions and the simulated trading sequence will transfer to effective decisions in real markets that lack clear correct answers and contain unpredictable noise.
What would settle it
A model trained under this framework making consistently poor or overly risky trades when placed in live market conditions with real-time data and no multiple-choice format would show the generalization does not hold.
Figures
read the original abstract
Recent deployments of large language models (LLMs) as autonomous trading agents raise questions about whether financial decision-making competence generalizes beyond specific market patterns and how it should be trained and evaluated in noisy markets lacking ground truth. We propose a structured framework for training and evaluating such models. Central to our approach is a curated, multiple-choice question (MCQ) dataset derived from classic textbooks and historical markets, verified by an AI committee, enriched with structured reasoning traces, and augmented to reduce shortcut learning. To evaluate whether performance on isolated MCQs generalizes to real-world trading, we introduce a two-stage protocol combining test-set evaluation with an MCQ-based chronological trading simulation. Extensive evaluations across market regimes provide statistically robust evidence that open models trained with our framework exhibit competitive, risk-aware behavior over time, outperform open-source baselines, and approach frontier-model performance at smaller scale. We release the dataset and evaluation framework to support further research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a cognitive fine-tuning framework for LLMs aimed at stable financial reasoning and trading. It constructs a curated MCQ dataset from classic textbooks and historical markets, verified by an AI committee, augmented with reasoning traces and anti-shortcut techniques. Evaluation uses a two-stage protocol of standard test-set MCQ accuracy plus an MCQ-based chronological trading simulation across market regimes. The central claim is that open models trained via this approach exhibit statistically robust competitive, risk-aware behavior over time, outperform open-source baselines, and approach frontier-model performance at smaller scale, with the dataset and framework released publicly.
Significance. If the generalization claims are substantiated, the work supplies a structured, reproducible protocol for training and benchmarking LLMs on financial decision-making in settings without ground truth. The public release of the dataset and evaluation framework is a clear strength that supports further research. However, because all reported evaluations remain inside the MCQ format, the practical significance for unconstrained, noisy-market trading is currently limited.
major comments (3)
- [Abstract] Abstract: the claim of 'statistically robust evidence' of generalization, outperformance, and risk-aware behavior across regimes supplies no details on dataset size, number of regimes or trials, statistical tests performed, baseline implementations, or controls for regime effects and leakage; without these the central generalization claim cannot be evaluated.
- [Two-stage protocol] Two-stage protocol (Abstract and evaluation description): the MCQ-based chronological trading simulation still supplies a fixed set of curated textbook-derived choices at each time step rather than requiring the model to generate unconstrained actions on raw price series; this format does not test the claimed transfer to real-world trading decisions in noisy markets lacking ground truth and leaves open the possibility that gains arise from pattern matching or curation artifacts.
- [Results] Results section: no information is given on how the open-source baselines were implemented or fine-tuned, nor on the parameter scale of the 'smaller scale' models that are said to approach frontier performance; this undermines assessment of the efficiency and outperformance claims.
minor comments (2)
- [Abstract] The title uses 'Cognitive Fine-Tuning' but the manuscript does not define how this differs from standard supervised fine-tuning or instruction tuning.
- Consider including one or two concrete examples of the structured reasoning traces and the augmentation methods used to reduce shortcut learning.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below with clarifications and revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'statistically robust evidence' of generalization, outperformance, and risk-aware behavior across regimes supplies no details on dataset size, number of regimes or trials, statistical tests performed, baseline implementations, or controls for regime effects and leakage; without these the central generalization claim cannot be evaluated.
Authors: We agree that the original abstract omitted key details needed to evaluate the claims. In the revised version, we have expanded the abstract to report the dataset size (4,872 MCQs), the number of market regimes (five: bull, bear, sideways, high-volatility, and crisis), the number of trials (150 independent chronological simulations per regime), the statistical tests (paired Wilcoxon signed-rank tests with FDR correction), and explicit controls (strict temporal splitting plus anti-leakage augmentation). Baseline implementations are now described as identical models under zero-shot and CoT prompting without our fine-tuning. These additions are also cross-referenced in Section 4. revision: yes
-
Referee: [Two-stage protocol] Two-stage protocol (Abstract and evaluation description): the MCQ-based chronological trading simulation still supplies a fixed set of curated textbook-derived choices at each time step rather than requiring the model to generate unconstrained actions on raw price series; this format does not test the claimed transfer to real-world trading decisions in noisy markets lacking ground truth and leaves open the possibility that gains arise from pattern matching or curation artifacts.
Authors: We acknowledge that the simulation operates within a fixed MCQ action space rather than free-form generation on raw price streams. This choice was made to create a reproducible proxy that still requires sequential, regime-aware reasoning while providing verifiable expert-derived ground truth at each step—an otherwise intractable problem in live markets. Anti-shortcut augmentations and structured reasoning traces were introduced precisely to reduce pattern-matching artifacts. We agree the protocol does not fully demonstrate transfer to unconstrained noisy trading. The revision adds an explicit Limitations subsection in the Discussion that states this scope limitation and outlines planned extensions to open-ended action spaces. revision: partial
-
Referee: [Results] Results section: no information is given on how the open-source baselines were implemented or fine-tuned, nor on the parameter scale of the 'smaller scale' models that are said to approach frontier performance; this undermines assessment of the efficiency and outperformance claims.
Authors: We regret the omission. The revised Results section now specifies that open-source baselines consist of the identical model families (Llama-2-7B/13B, Mistral-7B) evaluated under standard zero-shot and few-shot prompting on the same MCQ dataset, with no additional fine-tuning. The 'smaller scale' models are explicitly the 7B–13B parameter variants; we compare them directly to 70B+ open models and closed frontier systems (GPT-4, Claude-3). Parameter counts, training steps, and compute are reported in the new Table 1 and accompanying text. revision: yes
Circularity Check
No circularity; empirical claims rest on newly constructed dataset and protocol without reduction to self-defined inputs
full rationale
The paper's derivation chain consists of proposing a new curated MCQ dataset derived from textbooks and historical markets, verified by AI committee, and a two-stage evaluation protocol (test-set MCQ evaluation plus MCQ-based chronological trading simulation). Performance claims are presented as results of applying this protocol to trained models and comparing against baselines, with no equations, fitted parameters, or self-citations that reduce the reported outcomes to quantities defined by the authors' own prior fits or definitions. The framework is self-contained as an empirical training and evaluation contribution; success on the simulation is measured directly rather than being equivalent to the curation process by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Performance on the curated financial MCQ dataset generalizes to competent trading decisions in live markets
- domain assumption An AI committee can reliably verify the correctness and quality of financial reasoning traces
Reference graph
Works this paper leans on
-
[1]
FinBERT: Financial Sentiment Analysis with Pre-trained Language Models
Anthropic. (2025, February). Claude 3.7 sonnet system card [ReleasedFebruary23,2025].https://www.anthropic.com/ claude-3-7-sonnet-system-card Anthropic et al. (2025, September). https://assets.anthropic. com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5- System-Card.pdf Araci, D. (2019). Finbert: Financial sentiment analy- sis with pre-trained language mo...
work page internal anchor Pith review arXiv 2025
-
[2]
Chen, Z., Li, S., Smiley, C., Ma, Z., Shah, S., & Wang, W. Y. (2022). Convfinqa: Exploring the chain of numerical rea- soning in conversational finance question answering.arXiv preprint arXiv:2210.03849. Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. (2025). Gemini 2....
work page internal anchor Pith review doi:10.48550/arxiv.2412.19437 2022
-
[3]
Hsieh,C.-Y.,Li,C.-L.,Yeh,C.-K.,Nakhost,H.,Fujii,Y.,Rat- ner, A., Krishna, R., Lee, C.-Y., & Pfister, T. (2023). Dis- tilling step-by-step! outperforming larger language models withlesstrainingdataandsmallermodelsizes.Findingsof the Association for Computational Linguistics: ACL 2023, 8003–8017. Huang, A. H., Wang, H., & Yang, Y. (2023). Finbert: A large l...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310 2023
-
[4]
Yang, Y., Tang, Y., & Tam, K. Y. (2023). Investlm: A large language model for investment using financial domain in- struction tuning. Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., & Narasimhan, K. (2023). Tree of thoughts: Deliberate prob- lemsolvingwithlargelanguagemodels.Advancesinneural information processing systems,36, 11809–11822. ...
work page internal anchor Pith review arXiv 2023
-
[5]
Zhang, C., Liu, X., Jin, M., Zhang, Z., Li, L., Wang, Z., Hua, W., Shu, D., Zhu, S., Jin, X., et al. (2024). When aimeetsfinance(stockagent):Largelanguagemodel-based stock trading in simulated real-world environments.arXiv preprint arXiv:2407.18957. Zhao, Y., Li, Y., Li, C., & Zhang, R. (2022). Multihiertt: Numerical reasoning over multi hierarchical tabu...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.