arxiv: 2604.02921 · v2 · submitted 2026-04-03 · 💱 q-fin.GN · q-fin.TR

Recognition: no theorem link

Debiasing LLMs by Fine-tuning

Zhenyu Gao , Wenxi Jiang , Yutong Yan

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:31 UTC · model grok-4.3

classification 💱 q-fin.GN q-fin.TR

keywords large language modelsextrapolation biassupervised fine-tuningLoRAforecastingstock returnsdebiasingfinancial prediction

0 comments

The pith

Supervised fine-tuning on rational forecasts reduces extrapolation bias in LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models tend to over-extrapolate when making predictions from limited data, leading to systematic errors. Previous methods relying on prompts have shown limited success in correcting this. The authors demonstrate that fine-tuning the model's parameters using datasets of rational forecasts can change the underlying mapping from inputs to outputs. This intervention proves effective both in controlled lab settings and when predicting stock returns from real-world data. The result is a straightforward, low-cost way to debias LLMs for forecasting applications.

Core claim

By training off-the-shelf LLMs with supervised fine-tuning via LoRA on instruction datasets built from rational benchmark forecasts, the models learn to map observed information into forecasts without the typical extrapolation bias. Evaluation in controlled forecasting experiments and cross-sectional stock return prediction shows that this approach corrects the bias out-of-sample, unlike prompt-based methods.

What carries the argument

Supervised fine-tuning (SFT) with Low-Rank Adaptation (LoRA) applied to instruction datasets from rational benchmark forecasts, which alters the parameter-level mapping from observations to predictions.

If this is right

The fine-tuned LLMs mitigate extrapolation bias in controlled forecasting experiments out-of-sample.
Fine-tuning improves accuracy in cross-sectional stock return prediction tasks.
This method provides a low-cost alternative to prompt engineering for debiasing LLMs.
It establishes a generalizable approach applicable to various forecasting scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such parameter-level interventions might address other systematic biases in LLMs, such as overconfidence or anchoring.
Applying this fine-tuning to models used in other domains like medical diagnosis or climate prediction could yield similar improvements.
Future work could explore combining this with other techniques for even stronger debiasing effects.

Load-bearing premise

Instruction datasets built from rational benchmark forecasts provide transferable examples of unbiased mapping that generalize to real-world forecasting tasks without introducing new biases.

What would settle it

If the fine-tuned model continues to show extrapolation bias when tested on forecasting tasks outside the distribution of the benchmark datasets used for training.

Figures

Figures reproduced from arXiv: 2604.02921 by Wenxi Jiang, Yutong Yan, Zhenyu Gao.

**Figure 1.** Figure 1: Our reparametrization. We only train A and B. Many applications in natural language processing rely on adapting one large-scale, pre-trained language model to multiple downstream applications. Such adaptation is usually done via fine-tuning, which updates all the parameters of the pre-trained model. The major downside of fine-tuning is that the new model contains as many parameters as in the original m… view at source ↗

read the original abstract

Prior research shows that large language models (LLMs) exhibit systematic extrapolation bias when forming predictions from both experimental and real-world data, and that prompt-based approaches appear limited in alleviating this bias. We propose a supervised fine-tuning (SFT) approach that uses Low-Rank Adaptation (LoRA) to train off-the-shelf LLMs on instruction datasets constructed from rational benchmark forecasts. By intervening at the parameter level, SFT changes how LLMs map observed information into forecasts and thereby mitigates extrapolation bias. We evaluate the fine-tuned model in two settings: controlled forecasting experiments and cross-sectional stock return prediction. In both settings, fine-tuning corrects the extrapolative bias out-of-sample, establishing a low-cost and generalizable method for debiasing LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper moves debiasing from prompts to LoRA fine-tuning on rational forecast instructions and reports out-of-sample bias reduction in lab tests plus stock-return prediction.

read the letter

The main thing here is that they take the extrapolation bias problem in LLMs and address it by fine-tuning parameters with LoRA on instruction pairs drawn from rational benchmark forecasts, rather than trying to fix it at inference time. They show the adjusted model produces less biased outputs on held-out data in controlled forecasting tasks and in cross-sectional stock returns. That is the concrete step forward from the prompt-only work they cite.

Referee Report

2 major / 2 minor

Summary. The paper proposes using supervised fine-tuning (SFT) with Low-Rank Adaptation (LoRA) on instruction datasets built from rational benchmark forecasts to debias LLMs from extrapolation bias. It claims that this parameter-level intervention corrects the bias out-of-sample in both controlled forecasting experiments and cross-sectional stock-return prediction tasks, offering a low-cost generalizable alternative to prompt-based methods.

Significance. If the empirical results are shown to be robust, the work would demonstrate a practical way to mitigate a documented limitation of LLMs in forecasting, which is relevant for quantitative finance applications. The shift from prompt engineering to fine-tuning is a concrete contribution, but its value hinges on demonstrating that the correction generalizes beyond the training distribution rather than reflecting dataset-specific imitation.

major comments (2)

[Abstract / Evaluation] Abstract and evaluation description: the claim of out-of-sample bias correction in two settings is presented without any reported statistical tests, error bars, sample sizes, or controls for confounding factors (e.g., differences in input distributions between training and test regimes). This leaves the central empirical claim unverified and makes it impossible to isolate the effect of the SFT intervention from potential memorization of benchmark patterns.
[Method] Method description (instruction dataset construction): no details are given on how the rational benchmark forecasts are generated, including the underlying data-generating processes, noise structures, or domain coverage. Without an ablation that varies the extrapolation regime while holding the training distribution fixed, the transferability assumption cannot be tested and the risk of overfitting to narrow benchmark patterns remains unaddressed.

minor comments (2)

[Method] Clarify the precise definition and construction of 'rational benchmark forecasts' to enable replication and to distinguish modeling choices from the debiasing effect.
[Implementation details] Provide the LoRA rank, learning rate, number of training epochs, and exact prompt templates used, as these are necessary for assessing reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our empirical results and methodological details. We address each point below and will revise the manuscript to incorporate additional statistical reporting, dataset construction specifics, and an ablation study.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and evaluation description: the claim of out-of-sample bias correction in two settings is presented without any reported statistical tests, error bars, sample sizes, or controls for confounding factors (e.g., differences in input distributions between training and test regimes). This leaves the central empirical claim unverified and makes it impossible to isolate the effect of the SFT intervention from potential memorization of benchmark patterns.

Authors: We agree that statistical rigor is necessary to substantiate the out-of-sample claims. In the revised manuscript we will report exact sample sizes for each experiment, include error bars (standard errors across runs), and add formal statistical tests (paired t-tests and Wilcoxon signed-rank tests) comparing bias metrics before and after fine-tuning. We will also include supplementary analyses that quantify input distribution shifts (e.g., via Wasserstein distance and feature-wise Kolmogorov-Smirnov tests) between training and test regimes to help isolate the effect of the SFT intervention. revision: yes
Referee: [Method] Method description (instruction dataset construction): no details are given on how the rational benchmark forecasts are generated, including the underlying data-generating processes, noise structures, or domain coverage. Without an ablation that varies the extrapolation regime while holding the training distribution fixed, the transferability assumption cannot be tested and the risk of overfitting to narrow benchmark patterns remains unaddressed.

Authors: We will expand the Methods section with a new subsection detailing the construction of the rational benchmark forecasts. The forecasts are produced by fitting ordinary least-squares linear models on historical sequences drawn from the respective domains, with additive Gaussian noise whose variance is calibrated to the empirical residual variance observed in each domain. Domain coverage includes both synthetic experimental settings (linear and mildly nonlinear DGPs) and real financial time series. To directly test transferability, we will add an ablation that holds the training distribution fixed while systematically increasing the extrapolation distance on the test set (measured by the distance of test inputs from the training support); results of this ablation will be reported in the revised paper. revision: yes

Circularity Check

0 steps flagged

Empirical SFT method with no derivation chain or self-referential reduction

full rationale

The paper describes an empirical procedure: construct instruction datasets from rational benchmark forecasts, apply LoRA-based SFT to off-the-shelf LLMs, and evaluate out-of-sample correction of extrapolation bias in controlled experiments and cross-sectional stock-return prediction. No equations, uniqueness theorems, or ansatzes are presented that would reduce the claimed mapping change to fitted parameters or prior self-citations by construction. The central claim rests on observable performance differences between base and fine-tuned models on held-out data, which is independent of the training distribution by standard supervised-learning logic. No load-bearing self-citation or renaming of known results appears in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on abstract alone; the central claim rests on the assumption that extrapolation bias is primarily a parameter-mapping issue correctable by fine-tuning on curated rational examples, with no explicit free parameters or invented entities stated.

axioms (2)

domain assumption LLMs exhibit systematic extrapolation bias when forming predictions from experimental and real-world data
Stated as prior research finding that motivates the intervention
domain assumption Prompt-based approaches are limited in alleviating extrapolation bias
Used to justify shifting to parameter-level fine-tuning

pith-pipeline@v0.9.0 · 5420 in / 1210 out tokens · 41486 ms · 2026-05-13T18:31:25.840865+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 7 internal anchors

[1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

J. Bai, S. Bai, Y. Chu, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Brown, B

T . Brown, B. Mann, N. Ryder, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901
[4]

J. L. Bybee. The ghost in the machine: Generating beliefs with large language models.arXiv preprint arXiv:2305.02823,

work page arXiv
[5]

H. Chen, A. Didisheim, L. Somoza, and H. Tian. A financial brain scan of the LLM.arXiv preprint arXiv:2508.21285,

work page arXiv
[6]

S. Chen, T . C. Green, H. Gulen, and D. Zhou. What does ChatGPT make of historical stock returns? Extrapolation and miscalibration in LLM stock return forecasts.arXiv preprint arXiv:2409.11540,

work page arXiv
[7]

Chen and D

21 Z. Chen and D. Pu. Autonomous market intelligence: Agentic ai nowcasting predicts stock returns.arXiv preprint arXiv:2601.11958,

work page arXiv
[8]

Devlin, M.-W

J. Devlin, M.-W . Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186,

work page 2019
[9]

D’Acunto, N

F . D’Acunto, N. Prabhala, and A. G. Rossi. The promises and pitfalls of robo-advising.The Review of Financial Studies, 32(5):1983–2020,

work page 1983
[10]

D. Feng, Y. Dai, J. Huang, et al. Empowering many , biasing a few: Generalist credit scoring through large language models.arXiv preprint arXiv:2310.00566,

work page arXiv
[11]

Z. Gao, W . Jiang, and Y. Yan. A test of lookahead bias in LLM forecasts.arXiv preprint arXiv:2512.23847,

work page arXiv
[12]

Glasserman and C

P . Glasserman and C. Lin. Assessing look-ahead bias in stock return predictions generated by GPT sentiment analysis.arXiv preprint arXiv:2309.17322,

work page arXiv
[13]

22 I. J. Goodfellow, M. Mirza, D. Xiao, et al. An empirical investigation of catastrophic forgetting in gradient-based neural networks.arXiv preprint arXiv:1312.6211,

work page Pith review arXiv
[14]

S. He, L. Lv, A. Manela, and J. Wu. Chronologically consistent large language models.arXiv preprint arXiv:2502.21206, 2025a. S. He, L. Lv, A. Manela, and J. Wu. Instruction tuning chronologically consistent language models.arXiv preprint arXiv:2510.11677, 2025b. J. J. Horton. Large language models as simulated economic agents: What can we learn from homo ...

work page arXiv
[15]

M. Jha, J. Qian, M. Weber, and B. Yang. Chatgpt and corporate policies.arXiv preprint arXiv:2409.17933,

work page arXiv
[16]

S. Kim, M. Kim, J. Kwon, et al. LLM as a risk manager: LLM semantic filtering for lead-lag trading in prediction markets.arXiv preprint arXiv:2602.07048,

work page arXiv
[17]

doi: 10.18653/v1/2024.acl-long.829

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.829. T . Lieberum, S. Rajamanoharan, A. Conmy , et al. Gemma scope: Open sparse autoencoders everywhere all at once on gemma

work page doi:10.18653/v1/2024.acl-long.829 2024
[18]

Lopez-Lira

A. Lopez-Lira. Can large language models trade? Testing financial theories with LLM agents in market simulations.arXiv preprint arXiv:2504.10789,

work page arXiv
[19]

Lopez-Lira and Y

A. Lopez-Lira and Y. Tang. Can ChatGPT forecast stock price movements? Return predictability and large language models.arXiv preprint arXiv:2304.07619,

work page arXiv
[20]

Lopez-Lira, Y

A. Lopez-Lira, Y. Tang, and M. Zhu. The memorization problem: Can we trust LLMs’ economic forecasts?arXiv preprint arXiv:2504.14765,

work page arXiv
[21]

Z. Shao, P . Wang, Q. Zhu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

G. Wang, Y. Xie, Y. Jiang, et al. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Y. Yan, R. Tang, Z. Gao, et al. DatedGPT: Preventing lookahead bias in large language models with time-aware pretraining.arXiv preprint arXiv:2603.11838,

work page arXiv
[25]

A. Yang, A. Li, B. Yang, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Books BookCorpus, arXiv

25 Figure I.Pretraining and Instructional Tuning Pipeline LLM Training Pipeline Ofﬂine Training Web CommonCrawl, C4... Books BookCorpus, arXiv... Code GitHub, StackOvﬂ... Wiki Wikipedia, Wikibooks trillions of tokens Text Corpus raw unlabelled data ~10–100 TB of text 1 Pre-Training Engine Self-Attention Feed-Forward Self-Attention Feed-Forward Transformer...

work page 2022
[27]

LoRA: Low-Rank Adaptation of Large Language Models

with 175 billion trainable parameters. 1 Many sought to mitigate this by adapting only some parameters or learning external modules for new tasks. This way, we only need to store and load a small number of task-speciﬁc parameters in ad- dition to the pre-trained model for each task, greatly boosting the operational efﬁciency when deployed. However, existi...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[28]

You are a sophisticated rational investor

Column (1) estimatesF i,t ri,t+1 =α i + δt + P11 s=0 βs ri,t−s +ϵ i,t , regressing the LLM’s forecast on lagged returns with firm and month fixed effects. For brevity , we report coefficients fors∈ {0, 1, 2, 3, 5, 7, 9, 11}. Column (2) regresses realized returns on lagged returns. Column (3) regresses realized returns on the LLM’s forecast with firm and m...

work page 2016
[29]

For brevity , we report coefficients fors∈ {0, 1, 2, 3, 5, 7, 9, 11}

Columns (1) and (4) estimateF i,t ri,t+1 =α i +δ t + P11 s=0 βs ri,t−s +ϵ i,t , regressing the LLM’s forecast on lagged returns with firm and month fixed effects. For brevity , we report coefficients fors∈ {0, 1, 2, 3, 5, 7, 9, 11}. Columns (2) and (5) regress realized returns on the LLM’s forecast with firm and month fixed effects. Columns (3) and (6) re...

work page 2016