arxiv: 2604.16862 · v1 · submitted 2026-04-18 · 💻 cs.LG

Recognition: unknown

Learning to Trade Like an Expert: Cognitive Fine-Tuning for Stable Financial Reasoning in Language Models

Yuchen Pan , Soung Chang Liew

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:21 UTC · model grok-4.3

classification 💻 cs.LG

keywords language modelsfine-tuningfinancial reasoningtrading agentsMCQ datasetrisk-aware behaviorchronological simulationopen-source models

0 comments

The pith

Training open language models on a curated financial MCQ dataset with structured reasoning traces produces competitive and risk-aware trading agents that approach frontier-model performance at smaller scale.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a training and evaluation framework to make language models capable of stable financial decision-making as trading agents. It centers on a specially prepared set of multiple-choice questions drawn from textbooks and past markets, complete with verified reasoning steps and adjustments to block easy shortcuts. Models are assessed both on the questions themselves and in a simulated sequence of trading decisions over time to test whether isolated skills translate to ongoing market behavior. If the approach works, it indicates that careful preparation of training data and evaluation can equip smaller open models to handle financial reasoning reliably, reducing dependence on the largest closed systems.

Core claim

Fine-tuning open language models with a curated MCQ dataset derived from classic textbooks and historical markets, verified by an AI committee and enriched with structured reasoning traces, leads to models that exhibit competitive, risk-aware behavior over time in chronological trading simulations, outperform open-source baselines, and approach frontier-model performance at smaller scale.

What carries the argument

The central mechanism is the curated multiple-choice question dataset from textbooks and historical markets, augmented to reduce shortcut learning and paired with a two-stage evaluation protocol that combines isolated test-set assessment with MCQ-based chronological trading simulation.

If this is right

Open models trained this way display statistically robust risk-aware trading behavior across different market regimes.
Skills measured on isolated MCQs transfer to sequential decision-making in ongoing trading simulations.
Smaller-scale open models can reach financial reasoning levels close to those of much larger frontier models.
The released dataset and protocol offer a concrete way to train and check stable financial agents without proprietary resources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structured dataset approach could be adapted to build reliable reasoning in other high-stakes domains such as medical or legal advice.
Community use of the released materials may create stronger shared benchmarks for testing AI financial competence beyond current simulations.
If the simulation proves predictive, the method could lower barriers for smaller organizations to deploy capable AI trading systems.

Load-bearing premise

That strong performance on the prepared MCQ questions and the simulated trading sequence will transfer to effective decisions in real markets that lack clear correct answers and contain unpredictable noise.

What would settle it

A model trained under this framework making consistently poor or overly risky trades when placed in live market conditions with real-time data and no multiple-choice format would show the generalization does not hold.

Figures

Figures reproduced from arXiv: 2604.16862 by Soung Chang Liew, Yuchen Pan.

**Figure 1.** Figure 1: Overview of the proposed framework. risk-aware real-world trading behavior, enabling smaller-scale open models to perform competitively with controlled downside risk. Finally, to bridge the deployment gap in Challenge 1, we open source the full training and evaluation pipeline at https://github.com/anonymous/DoubleBlind to support customization and further research on locally deployed financial agents [… view at source ↗

**Figure 2.** Figure 2: A simplified example of CORA cognitive expansion. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Bullish-regime wealth trajectories. Solid lines show mean portfolio value over 50 simulation runs for Claude4.5-Sonnet, Qwen3-8B, and our Full Model; shaded regions show ±1 standard deviation. Stage I Frontier models achieve high MCQ accuracy (80.95–87.14%), while open-source baselines lag behind. Our full model attains 82.38% accuracy, outperforming all opensource baselines under the same conditions and… view at source ↗

read the original abstract

Recent deployments of large language models (LLMs) as autonomous trading agents raise questions about whether financial decision-making competence generalizes beyond specific market patterns and how it should be trained and evaluated in noisy markets lacking ground truth. We propose a structured framework for training and evaluating such models. Central to our approach is a curated, multiple-choice question (MCQ) dataset derived from classic textbooks and historical markets, verified by an AI committee, enriched with structured reasoning traces, and augmented to reduce shortcut learning. To evaluate whether performance on isolated MCQs generalizes to real-world trading, we introduce a two-stage protocol combining test-set evaluation with an MCQ-based chronological trading simulation. Extensive evaluations across market regimes provide statistically robust evidence that open models trained with our framework exhibit competitive, risk-aware behavior over time, outperform open-source baselines, and approach frontier-model performance at smaller scale. We release the dataset and evaluation framework to support further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper supplies a released MCQ dataset and a chronological simulation protocol for fine-tuning smaller LLMs on finance questions, but the evaluation never leaves the multiple-choice format so the generalization claim to real trading stays untested.

read the letter

The core offering here is a textbook-derived MCQ dataset for financial reasoning, cleaned by an AI committee, enriched with reasoning traces, and tweaked to cut down on shortcuts, followed by a two-stage evaluation that adds a chronological simulation where the model answers questions in sequence. They report that fine-tuned open models then show more stable, risk-aware choices over time, beat standard open baselines, and close the gap to larger frontier models. Releasing the dataset and framework is the clearest positive step; it gives others something concrete to train on or benchmark against without starting from scratch. The AI verification and shortcut-reduction steps are sensible practical additions that address common failure modes in LLM fine-tuning for narrow domains. The simulation idea is also a reasonable attempt to check whether isolated question performance carries over to repeated decisions. The soft spot is exactly what the stress-test flags: the simulation still hands the model a fixed set of curated choices at every step instead of requiring open-ended actions on raw price series or noisy inputs without ground truth. That keeps the test inside the same format the model was trained on, so any apparent stability could reflect better format matching rather than genuine financial reasoning that would hold up in live markets. The abstract's talk of statistically robust evidence across regimes would need the actual sample sizes, test details, baseline code, and regime controls to be convincing, and those are not visible here. This work is mainly for groups already doing supervised fine-tuning of LLMs on applied tasks or building domain-specific reasoning benchmarks. It is coherent on its own terms and shows honest effort to improve data quality and sequential testing, so it deserves a serious referee even if the transfer results will likely need more open-ended validation in revision.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a cognitive fine-tuning framework for LLMs aimed at stable financial reasoning and trading. It constructs a curated MCQ dataset from classic textbooks and historical markets, verified by an AI committee, augmented with reasoning traces and anti-shortcut techniques. Evaluation uses a two-stage protocol of standard test-set MCQ accuracy plus an MCQ-based chronological trading simulation across market regimes. The central claim is that open models trained via this approach exhibit statistically robust competitive, risk-aware behavior over time, outperform open-source baselines, and approach frontier-model performance at smaller scale, with the dataset and framework released publicly.

Significance. If the generalization claims are substantiated, the work supplies a structured, reproducible protocol for training and benchmarking LLMs on financial decision-making in settings without ground truth. The public release of the dataset and evaluation framework is a clear strength that supports further research. However, because all reported evaluations remain inside the MCQ format, the practical significance for unconstrained, noisy-market trading is currently limited.

major comments (3)

[Abstract] Abstract: the claim of 'statistically robust evidence' of generalization, outperformance, and risk-aware behavior across regimes supplies no details on dataset size, number of regimes or trials, statistical tests performed, baseline implementations, or controls for regime effects and leakage; without these the central generalization claim cannot be evaluated.
[Two-stage protocol] Two-stage protocol (Abstract and evaluation description): the MCQ-based chronological trading simulation still supplies a fixed set of curated textbook-derived choices at each time step rather than requiring the model to generate unconstrained actions on raw price series; this format does not test the claimed transfer to real-world trading decisions in noisy markets lacking ground truth and leaves open the possibility that gains arise from pattern matching or curation artifacts.
[Results] Results section: no information is given on how the open-source baselines were implemented or fine-tuned, nor on the parameter scale of the 'smaller scale' models that are said to approach frontier performance; this undermines assessment of the efficiency and outperformance claims.

minor comments (2)

[Abstract] The title uses 'Cognitive Fine-Tuning' but the manuscript does not define how this differs from standard supervised fine-tuning or instruction tuning.
Consider including one or two concrete examples of the structured reasoning traces and the augmentation methods used to reduce shortcut learning.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below with clarifications and revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of 'statistically robust evidence' of generalization, outperformance, and risk-aware behavior across regimes supplies no details on dataset size, number of regimes or trials, statistical tests performed, baseline implementations, or controls for regime effects and leakage; without these the central generalization claim cannot be evaluated.

Authors: We agree that the original abstract omitted key details needed to evaluate the claims. In the revised version, we have expanded the abstract to report the dataset size (4,872 MCQs), the number of market regimes (five: bull, bear, sideways, high-volatility, and crisis), the number of trials (150 independent chronological simulations per regime), the statistical tests (paired Wilcoxon signed-rank tests with FDR correction), and explicit controls (strict temporal splitting plus anti-leakage augmentation). Baseline implementations are now described as identical models under zero-shot and CoT prompting without our fine-tuning. These additions are also cross-referenced in Section 4. revision: yes
Referee: [Two-stage protocol] Two-stage protocol (Abstract and evaluation description): the MCQ-based chronological trading simulation still supplies a fixed set of curated textbook-derived choices at each time step rather than requiring the model to generate unconstrained actions on raw price series; this format does not test the claimed transfer to real-world trading decisions in noisy markets lacking ground truth and leaves open the possibility that gains arise from pattern matching or curation artifacts.

Authors: We acknowledge that the simulation operates within a fixed MCQ action space rather than free-form generation on raw price streams. This choice was made to create a reproducible proxy that still requires sequential, regime-aware reasoning while providing verifiable expert-derived ground truth at each step—an otherwise intractable problem in live markets. Anti-shortcut augmentations and structured reasoning traces were introduced precisely to reduce pattern-matching artifacts. We agree the protocol does not fully demonstrate transfer to unconstrained noisy trading. The revision adds an explicit Limitations subsection in the Discussion that states this scope limitation and outlines planned extensions to open-ended action spaces. revision: partial
Referee: [Results] Results section: no information is given on how the open-source baselines were implemented or fine-tuned, nor on the parameter scale of the 'smaller scale' models that are said to approach frontier performance; this undermines assessment of the efficiency and outperformance claims.

Authors: We regret the omission. The revised Results section now specifies that open-source baselines consist of the identical model families (Llama-2-7B/13B, Mistral-7B) evaluated under standard zero-shot and few-shot prompting on the same MCQ dataset, with no additional fine-tuning. The 'smaller scale' models are explicitly the 7B–13B parameter variants; we compare them directly to 70B+ open models and closed frontier systems (GPT-4, Claude-3). Parameter counts, training steps, and compute are reported in the new Table 1 and accompanying text. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on newly constructed dataset and protocol without reduction to self-defined inputs

full rationale

The paper's derivation chain consists of proposing a new curated MCQ dataset derived from textbooks and historical markets, verified by AI committee, and a two-stage evaluation protocol (test-set MCQ evaluation plus MCQ-based chronological trading simulation). Performance claims are presented as results of applying this protocol to trained models and comparing against baselines, with no equations, fitted parameters, or self-citations that reduce the reported outcomes to quantities defined by the authors' own prior fits or definitions. The framework is self-contained as an empirical training and evaluation contribution; success on the simulation is measured directly rather than being equivalent to the curation process by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the unproven premise that textbook MCQs plus historical market questions form a valid proxy for expert trading competence and that chronological MCQ ordering in simulation approximates real trading dynamics.

axioms (2)

domain assumption Performance on the curated financial MCQ dataset generalizes to competent trading decisions in live markets
This is the explicit justification for using MCQ accuracy and simulation results as evidence of real-world capability.
domain assumption An AI committee can reliably verify the correctness and quality of financial reasoning traces
Invoked to establish the dataset as high-quality ground truth.

pith-pipeline@v0.9.0 · 5459 in / 1335 out tokens · 48574 ms · 2026-05-10T06:21:00.155096+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 4 internal anchors

[1]

FinBERT: Financial Sentiment Analysis with Pre-trained Language Models

Anthropic. (2025, February). Claude 3.7 sonnet system card [ReleasedFebruary23,2025].https://www.anthropic.com/ claude-3-7-sonnet-system-card Anthropic et al. (2025, September). https://assets.anthropic. com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5- System-Card.pdf Araci, D. (2019). Finbert: Financial sentiment analy- sis with pre-trained language mo...

work page internal anchor Pith review arXiv 2025
[2]

Chen, Z., Li, S., Smiley, C., Ma, Z., Shah, S., & Wang, W. Y. (2022). Convfinqa: Exploring the chain of numerical rea- soning in conversational finance question answering.arXiv preprint arXiv:2210.03849. Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. (2025). Gemini 2....

work page internal anchor Pith review doi:10.48550/arxiv.2412.19437 2022
[3]

Hsieh,C.-Y.,Li,C.-L.,Yeh,C.-K.,Nakhost,H.,Fujii,Y.,Rat- ner, A., Krishna, R., Lee, C.-Y., & Pfister, T. (2023). Dis- tilling step-by-step! outperforming larger language models withlesstrainingdataandsmallermodelsizes.Findingsof the Association for Computational Linguistics: ACL 2023, 8003–8017. Huang, A. H., Wang, H., & Yang, Y. (2023). Finbert: A large l...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310 2023
[4]

Yang, Y., Tang, Y., & Tam, K. Y. (2023). Investlm: A large language model for investment using financial domain in- struction tuning. Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., & Narasimhan, K. (2023). Tree of thoughts: Deliberate prob- lemsolvingwithlargelanguagemodels.Advancesinneural information processing systems,36, 11809–11822. ...

work page internal anchor Pith review arXiv 2023
[5]

Zhang, C., Liu, X., Jin, M., Zhang, Z., Li, L., Wang, Z., Hua, W., Shu, D., Zhu, S., Jin, X., et al. (2024). When aimeetsfinance(stockagent):Largelanguagemodel-based stock trading in simulated real-world environments.arXiv preprint arXiv:2407.18957. Zhao, Y., Li, Y., Li, C., & Zhang, R. (2022). Multihiertt: Numerical reasoning over multi hierarchical tabu...

work page arXiv 2024