pith. machine review for the scientific record. sign in

arxiv: 2604.02921 · v2 · submitted 2026-04-03 · 💱 q-fin.GN · q-fin.TR

Recognition: no theorem link

Debiasing LLMs by Fine-tuning

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:31 UTC · model grok-4.3

classification 💱 q-fin.GN q-fin.TR
keywords large language modelsextrapolation biassupervised fine-tuningLoRAforecastingstock returnsdebiasingfinancial prediction
0
0 comments X

The pith

Supervised fine-tuning on rational forecasts reduces extrapolation bias in LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models tend to over-extrapolate when making predictions from limited data, leading to systematic errors. Previous methods relying on prompts have shown limited success in correcting this. The authors demonstrate that fine-tuning the model's parameters using datasets of rational forecasts can change the underlying mapping from inputs to outputs. This intervention proves effective both in controlled lab settings and when predicting stock returns from real-world data. The result is a straightforward, low-cost way to debias LLMs for forecasting applications.

Core claim

By training off-the-shelf LLMs with supervised fine-tuning via LoRA on instruction datasets built from rational benchmark forecasts, the models learn to map observed information into forecasts without the typical extrapolation bias. Evaluation in controlled forecasting experiments and cross-sectional stock return prediction shows that this approach corrects the bias out-of-sample, unlike prompt-based methods.

What carries the argument

Supervised fine-tuning (SFT) with Low-Rank Adaptation (LoRA) applied to instruction datasets from rational benchmark forecasts, which alters the parameter-level mapping from observations to predictions.

If this is right

  • The fine-tuned LLMs mitigate extrapolation bias in controlled forecasting experiments out-of-sample.
  • Fine-tuning improves accuracy in cross-sectional stock return prediction tasks.
  • This method provides a low-cost alternative to prompt engineering for debiasing LLMs.
  • It establishes a generalizable approach applicable to various forecasting scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such parameter-level interventions might address other systematic biases in LLMs, such as overconfidence or anchoring.
  • Applying this fine-tuning to models used in other domains like medical diagnosis or climate prediction could yield similar improvements.
  • Future work could explore combining this with other techniques for even stronger debiasing effects.

Load-bearing premise

Instruction datasets built from rational benchmark forecasts provide transferable examples of unbiased mapping that generalize to real-world forecasting tasks without introducing new biases.

What would settle it

If the fine-tuned model continues to show extrapolation bias when tested on forecasting tasks outside the distribution of the benchmark datasets used for training.

Figures

Figures reproduced from arXiv: 2604.02921 by Wenxi Jiang, Yutong Yan, Zhenyu Gao.

Figure 1
Figure 1. Figure 1: Our reparametriza￾tion. We only train A and B. Many applications in natural language processing rely on adapt￾ing one large-scale, pre-trained language model to multiple down￾stream applications. Such adaptation is usually done via fine-tuning, which updates all the parameters of the pre-trained model. The ma￾jor downside of fine-tuning is that the new model contains as many parameters as in the original m… view at source ↗
read the original abstract

Prior research shows that large language models (LLMs) exhibit systematic extrapolation bias when forming predictions from both experimental and real-world data, and that prompt-based approaches appear limited in alleviating this bias. We propose a supervised fine-tuning (SFT) approach that uses Low-Rank Adaptation (LoRA) to train off-the-shelf LLMs on instruction datasets constructed from rational benchmark forecasts. By intervening at the parameter level, SFT changes how LLMs map observed information into forecasts and thereby mitigates extrapolation bias. We evaluate the fine-tuned model in two settings: controlled forecasting experiments and cross-sectional stock return prediction. In both settings, fine-tuning corrects the extrapolative bias out-of-sample, establishing a low-cost and generalizable method for debiasing LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes using supervised fine-tuning (SFT) with Low-Rank Adaptation (LoRA) on instruction datasets built from rational benchmark forecasts to debias LLMs from extrapolation bias. It claims that this parameter-level intervention corrects the bias out-of-sample in both controlled forecasting experiments and cross-sectional stock-return prediction tasks, offering a low-cost generalizable alternative to prompt-based methods.

Significance. If the empirical results are shown to be robust, the work would demonstrate a practical way to mitigate a documented limitation of LLMs in forecasting, which is relevant for quantitative finance applications. The shift from prompt engineering to fine-tuning is a concrete contribution, but its value hinges on demonstrating that the correction generalizes beyond the training distribution rather than reflecting dataset-specific imitation.

major comments (2)
  1. [Abstract / Evaluation] Abstract and evaluation description: the claim of out-of-sample bias correction in two settings is presented without any reported statistical tests, error bars, sample sizes, or controls for confounding factors (e.g., differences in input distributions between training and test regimes). This leaves the central empirical claim unverified and makes it impossible to isolate the effect of the SFT intervention from potential memorization of benchmark patterns.
  2. [Method] Method description (instruction dataset construction): no details are given on how the rational benchmark forecasts are generated, including the underlying data-generating processes, noise structures, or domain coverage. Without an ablation that varies the extrapolation regime while holding the training distribution fixed, the transferability assumption cannot be tested and the risk of overfitting to narrow benchmark patterns remains unaddressed.
minor comments (2)
  1. [Method] Clarify the precise definition and construction of 'rational benchmark forecasts' to enable replication and to distinguish modeling choices from the debiasing effect.
  2. [Implementation details] Provide the LoRA rank, learning rate, number of training epochs, and exact prompt templates used, as these are necessary for assessing reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our empirical results and methodological details. We address each point below and will revise the manuscript to incorporate additional statistical reporting, dataset construction specifics, and an ablation study.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and evaluation description: the claim of out-of-sample bias correction in two settings is presented without any reported statistical tests, error bars, sample sizes, or controls for confounding factors (e.g., differences in input distributions between training and test regimes). This leaves the central empirical claim unverified and makes it impossible to isolate the effect of the SFT intervention from potential memorization of benchmark patterns.

    Authors: We agree that statistical rigor is necessary to substantiate the out-of-sample claims. In the revised manuscript we will report exact sample sizes for each experiment, include error bars (standard errors across runs), and add formal statistical tests (paired t-tests and Wilcoxon signed-rank tests) comparing bias metrics before and after fine-tuning. We will also include supplementary analyses that quantify input distribution shifts (e.g., via Wasserstein distance and feature-wise Kolmogorov-Smirnov tests) between training and test regimes to help isolate the effect of the SFT intervention. revision: yes

  2. Referee: [Method] Method description (instruction dataset construction): no details are given on how the rational benchmark forecasts are generated, including the underlying data-generating processes, noise structures, or domain coverage. Without an ablation that varies the extrapolation regime while holding the training distribution fixed, the transferability assumption cannot be tested and the risk of overfitting to narrow benchmark patterns remains unaddressed.

    Authors: We will expand the Methods section with a new subsection detailing the construction of the rational benchmark forecasts. The forecasts are produced by fitting ordinary least-squares linear models on historical sequences drawn from the respective domains, with additive Gaussian noise whose variance is calibrated to the empirical residual variance observed in each domain. Domain coverage includes both synthetic experimental settings (linear and mildly nonlinear DGPs) and real financial time series. To directly test transferability, we will add an ablation that holds the training distribution fixed while systematically increasing the extrapolation distance on the test set (measured by the distance of test inputs from the training support); results of this ablation will be reported in the revised paper. revision: yes

Circularity Check

0 steps flagged

Empirical SFT method with no derivation chain or self-referential reduction

full rationale

The paper describes an empirical procedure: construct instruction datasets from rational benchmark forecasts, apply LoRA-based SFT to off-the-shelf LLMs, and evaluate out-of-sample correction of extrapolation bias in controlled experiments and cross-sectional stock-return prediction. No equations, uniqueness theorems, or ansatzes are presented that would reduce the claimed mapping change to fitted parameters or prior self-citations by construction. The central claim rests on observable performance differences between base and fine-tuned models on held-out data, which is independent of the training distribution by standard supervised-learning logic. No load-bearing self-citation or renaming of known results appears in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on abstract alone; the central claim rests on the assumption that extrapolation bias is primarily a parameter-mapping issue correctable by fine-tuning on curated rational examples, with no explicit free parameters or invented entities stated.

axioms (2)
  • domain assumption LLMs exhibit systematic extrapolation bias when forming predictions from experimental and real-world data
    Stated as prior research finding that motivates the intervention
  • domain assumption Prompt-based approaches are limited in alleviating extrapolation bias
    Used to justify shifting to parameter-level fine-tuning

pith-pipeline@v0.9.0 · 5420 in / 1210 out tokens · 41486 ms · 2026-05-13T18:31:25.840865+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 7 internal anchors

  1. [1]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    J. Bai, S. Bai, Y. Chu, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

  3. [3]

    Brown, B

    T . Brown, B. Mann, N. Ryder, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

  4. [4]

    J. L. Bybee. The ghost in the machine: Generating beliefs with large language models.arXiv preprint arXiv:2305.02823,

  5. [5]

    H. Chen, A. Didisheim, L. Somoza, and H. Tian. A financial brain scan of the LLM.arXiv preprint arXiv:2508.21285,

  6. [6]

    S. Chen, T . C. Green, H. Gulen, and D. Zhou. What does ChatGPT make of historical stock returns? Extrapolation and miscalibration in LLM stock return forecasts.arXiv preprint arXiv:2409.11540,

  7. [7]

    Chen and D

    21 Z. Chen and D. Pu. Autonomous market intelligence: Agentic ai nowcasting predicts stock returns.arXiv preprint arXiv:2601.11958,

  8. [8]

    Devlin, M.-W

    J. Devlin, M.-W . Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186,

  9. [9]

    D’Acunto, N

    F . D’Acunto, N. Prabhala, and A. G. Rossi. The promises and pitfalls of robo-advising.The Review of Financial Studies, 32(5):1983–2020,

  10. [10]

    D. Feng, Y. Dai, J. Huang, et al. Empowering many , biasing a few: Generalist credit scoring through large language models.arXiv preprint arXiv:2310.00566,

  11. [11]

    Z. Gao, W . Jiang, and Y. Yan. A test of lookahead bias in LLM forecasts.arXiv preprint arXiv:2512.23847,

  12. [12]

    Glasserman and C

    P . Glasserman and C. Lin. Assessing look-ahead bias in stock return predictions generated by GPT sentiment analysis.arXiv preprint arXiv:2309.17322,

  13. [13]

    22 I. J. Goodfellow, M. Mirza, D. Xiao, et al. An empirical investigation of catastrophic forgetting in gradient-based neural networks.arXiv preprint arXiv:1312.6211,

  14. [14]

    S. He, L. Lv, A. Manela, and J. Wu. Chronologically consistent large language models.arXiv preprint arXiv:2502.21206, 2025a. S. He, L. Lv, A. Manela, and J. Wu. Instruction tuning chronologically consistent language models.arXiv preprint arXiv:2510.11677, 2025b. J. J. Horton. Large language models as simulated economic agents: What can we learn from homo ...

  15. [15]

    M. Jha, J. Qian, M. Weber, and B. Yang. Chatgpt and corporate policies.arXiv preprint arXiv:2409.17933,

  16. [16]

    S. Kim, M. Kim, J. Kwon, et al. LLM as a risk manager: LLM semantic filtering for lead-lag trading in prediction markets.arXiv preprint arXiv:2602.07048,

  17. [17]

    doi: 10.18653/v1/2024.acl-long.829

    Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.829. T . Lieberum, S. Rajamanoharan, A. Conmy , et al. Gemma scope: Open sparse autoencoders everywhere all at once on gemma

  18. [18]

    Lopez-Lira

    A. Lopez-Lira. Can large language models trade? Testing financial theories with LLM agents in market simulations.arXiv preprint arXiv:2504.10789,

  19. [19]

    Lopez-Lira and Y

    A. Lopez-Lira and Y. Tang. Can ChatGPT forecast stock price movements? Return predictability and large language models.arXiv preprint arXiv:2304.07619,

  20. [20]

    Lopez-Lira, Y

    A. Lopez-Lira, Y. Tang, and M. Zhu. The memorization problem: Can we trust LLMs’ economic forecasts?arXiv preprint arXiv:2504.14765,

  21. [21]

    Z. Shao, P . Wang, Q. Zhu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  22. [22]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

  23. [23]

    G. Wang, Y. Xie, Y. Jiang, et al. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291,

  24. [24]

    Y. Yan, R. Tang, Z. Gao, et al. DatedGPT: Preventing lookahead bias in large language models with time-aware pretraining.arXiv preprint arXiv:2603.11838,

  25. [25]

    A. Yang, A. Li, B. Yang, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  26. [26]

    Books BookCorpus, arXiv

    25 Figure I.Pretraining and Instructional Tuning Pipeline LLM Training Pipeline Offline Training Web CommonCrawl, C4... Books BookCorpus, arXiv... Code GitHub, StackOvfl... Wiki Wikipedia, Wikibooks trillions of tokens Text Corpus raw unlabelled data ~10–100 TB of text 1 Pre-Training Engine Self-Attention Feed-Forward Self-Attention Feed-Forward Transformer...

  27. [27]

    LoRA: Low-Rank Adaptation of Large Language Models

    with 175 billion trainable parameters. 1 Many sought to mitigate this by adapting only some parameters or learning external modules for new tasks. This way, we only need to store and load a small number of task-specific parameters in ad- dition to the pre-trained model for each task, greatly boosting the operational efficiency when deployed. However, existi...

  28. [28]

    You are a sophisticated rational investor

    Column (1) estimatesF i,t ri,t+1 =α i + δt + P11 s=0 βs ri,t−s +ϵ i,t , regressing the LLM’s forecast on lagged returns with firm and month fixed effects. For brevity , we report coefficients fors∈ {0, 1, 2, 3, 5, 7, 9, 11}. Column (2) regresses realized returns on lagged returns. Column (3) regresses realized returns on the LLM’s forecast with firm and m...

  29. [29]

    For brevity , we report coefficients fors∈ {0, 1, 2, 3, 5, 7, 9, 11}

    Columns (1) and (4) estimateF i,t ri,t+1 =α i +δ t + P11 s=0 βs ri,t−s +ϵ i,t , regressing the LLM’s forecast on lagged returns with firm and month fixed effects. For brevity , we report coefficients fors∈ {0, 1, 2, 3, 5, 7, 9, 11}. Columns (2) and (5) regress realized returns on the LLM’s forecast with firm and month fixed effects. Columns (3) and (6) re...