Recognition: no theorem link
Debiasing LLMs by Fine-tuning
Pith reviewed 2026-05-13 18:31 UTC · model grok-4.3
The pith
Supervised fine-tuning on rational forecasts reduces extrapolation bias in LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training off-the-shelf LLMs with supervised fine-tuning via LoRA on instruction datasets built from rational benchmark forecasts, the models learn to map observed information into forecasts without the typical extrapolation bias. Evaluation in controlled forecasting experiments and cross-sectional stock return prediction shows that this approach corrects the bias out-of-sample, unlike prompt-based methods.
What carries the argument
Supervised fine-tuning (SFT) with Low-Rank Adaptation (LoRA) applied to instruction datasets from rational benchmark forecasts, which alters the parameter-level mapping from observations to predictions.
If this is right
- The fine-tuned LLMs mitigate extrapolation bias in controlled forecasting experiments out-of-sample.
- Fine-tuning improves accuracy in cross-sectional stock return prediction tasks.
- This method provides a low-cost alternative to prompt engineering for debiasing LLMs.
- It establishes a generalizable approach applicable to various forecasting scenarios.
Where Pith is reading between the lines
- Such parameter-level interventions might address other systematic biases in LLMs, such as overconfidence or anchoring.
- Applying this fine-tuning to models used in other domains like medical diagnosis or climate prediction could yield similar improvements.
- Future work could explore combining this with other techniques for even stronger debiasing effects.
Load-bearing premise
Instruction datasets built from rational benchmark forecasts provide transferable examples of unbiased mapping that generalize to real-world forecasting tasks without introducing new biases.
What would settle it
If the fine-tuned model continues to show extrapolation bias when tested on forecasting tasks outside the distribution of the benchmark datasets used for training.
Figures
read the original abstract
Prior research shows that large language models (LLMs) exhibit systematic extrapolation bias when forming predictions from both experimental and real-world data, and that prompt-based approaches appear limited in alleviating this bias. We propose a supervised fine-tuning (SFT) approach that uses Low-Rank Adaptation (LoRA) to train off-the-shelf LLMs on instruction datasets constructed from rational benchmark forecasts. By intervening at the parameter level, SFT changes how LLMs map observed information into forecasts and thereby mitigates extrapolation bias. We evaluate the fine-tuned model in two settings: controlled forecasting experiments and cross-sectional stock return prediction. In both settings, fine-tuning corrects the extrapolative bias out-of-sample, establishing a low-cost and generalizable method for debiasing LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes using supervised fine-tuning (SFT) with Low-Rank Adaptation (LoRA) on instruction datasets built from rational benchmark forecasts to debias LLMs from extrapolation bias. It claims that this parameter-level intervention corrects the bias out-of-sample in both controlled forecasting experiments and cross-sectional stock-return prediction tasks, offering a low-cost generalizable alternative to prompt-based methods.
Significance. If the empirical results are shown to be robust, the work would demonstrate a practical way to mitigate a documented limitation of LLMs in forecasting, which is relevant for quantitative finance applications. The shift from prompt engineering to fine-tuning is a concrete contribution, but its value hinges on demonstrating that the correction generalizes beyond the training distribution rather than reflecting dataset-specific imitation.
major comments (2)
- [Abstract / Evaluation] Abstract and evaluation description: the claim of out-of-sample bias correction in two settings is presented without any reported statistical tests, error bars, sample sizes, or controls for confounding factors (e.g., differences in input distributions between training and test regimes). This leaves the central empirical claim unverified and makes it impossible to isolate the effect of the SFT intervention from potential memorization of benchmark patterns.
- [Method] Method description (instruction dataset construction): no details are given on how the rational benchmark forecasts are generated, including the underlying data-generating processes, noise structures, or domain coverage. Without an ablation that varies the extrapolation regime while holding the training distribution fixed, the transferability assumption cannot be tested and the risk of overfitting to narrow benchmark patterns remains unaddressed.
minor comments (2)
- [Method] Clarify the precise definition and construction of 'rational benchmark forecasts' to enable replication and to distinguish modeling choices from the debiasing effect.
- [Implementation details] Provide the LoRA rank, learning rate, number of training epochs, and exact prompt templates used, as these are necessary for assessing reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the presentation of our empirical results and methodological details. We address each point below and will revise the manuscript to incorporate additional statistical reporting, dataset construction specifics, and an ablation study.
read point-by-point responses
-
Referee: [Abstract / Evaluation] Abstract and evaluation description: the claim of out-of-sample bias correction in two settings is presented without any reported statistical tests, error bars, sample sizes, or controls for confounding factors (e.g., differences in input distributions between training and test regimes). This leaves the central empirical claim unverified and makes it impossible to isolate the effect of the SFT intervention from potential memorization of benchmark patterns.
Authors: We agree that statistical rigor is necessary to substantiate the out-of-sample claims. In the revised manuscript we will report exact sample sizes for each experiment, include error bars (standard errors across runs), and add formal statistical tests (paired t-tests and Wilcoxon signed-rank tests) comparing bias metrics before and after fine-tuning. We will also include supplementary analyses that quantify input distribution shifts (e.g., via Wasserstein distance and feature-wise Kolmogorov-Smirnov tests) between training and test regimes to help isolate the effect of the SFT intervention. revision: yes
-
Referee: [Method] Method description (instruction dataset construction): no details are given on how the rational benchmark forecasts are generated, including the underlying data-generating processes, noise structures, or domain coverage. Without an ablation that varies the extrapolation regime while holding the training distribution fixed, the transferability assumption cannot be tested and the risk of overfitting to narrow benchmark patterns remains unaddressed.
Authors: We will expand the Methods section with a new subsection detailing the construction of the rational benchmark forecasts. The forecasts are produced by fitting ordinary least-squares linear models on historical sequences drawn from the respective domains, with additive Gaussian noise whose variance is calibrated to the empirical residual variance observed in each domain. Domain coverage includes both synthetic experimental settings (linear and mildly nonlinear DGPs) and real financial time series. To directly test transferability, we will add an ablation that holds the training distribution fixed while systematically increasing the extrapolation distance on the test set (measured by the distance of test inputs from the training support); results of this ablation will be reported in the revised paper. revision: yes
Circularity Check
Empirical SFT method with no derivation chain or self-referential reduction
full rationale
The paper describes an empirical procedure: construct instruction datasets from rational benchmark forecasts, apply LoRA-based SFT to off-the-shelf LLMs, and evaluate out-of-sample correction of extrapolation bias in controlled experiments and cross-sectional stock-return prediction. No equations, uniqueness theorems, or ansatzes are presented that would reduce the claimed mapping change to fitted parameters or prior self-citations by construction. The central claim rests on observable performance differences between base and fine-tuned models on held-out data, which is independent of the training distribution by standard supervised-learning logic. No load-bearing self-citation or renaming of known results appears in the provided text.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs exhibit systematic extrapolation bias when forming predictions from experimental and real-world data
- domain assumption Prompt-based approaches are limited in alleviating extrapolation bias
Reference graph
Works this paper leans on
-
[1]
J. Achiam, S. Adler, S. Agarwal, et al. GPT-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
J. Bai, S. Bai, Y. Chu, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,
work page internal anchor Pith review Pith/arXiv arXiv
- [3]
- [4]
- [5]
- [6]
-
[7]
21 Z. Chen and D. Pu. Autonomous market intelligence: Agentic ai nowcasting predicts stock returns.arXiv preprint arXiv:2601.11958,
-
[8]
J. Devlin, M.-W . Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186,
work page 2019
-
[9]
F . D’Acunto, N. Prabhala, and A. G. Rossi. The promises and pitfalls of robo-advising.The Review of Financial Studies, 32(5):1983–2020,
work page 1983
- [10]
- [11]
-
[12]
P . Glasserman and C. Lin. Assessing look-ahead bias in stock return predictions generated by GPT sentiment analysis.arXiv preprint arXiv:2309.17322,
-
[13]
22 I. J. Goodfellow, M. Mirza, D. Xiao, et al. An empirical investigation of catastrophic forgetting in gradient-based neural networks.arXiv preprint arXiv:1312.6211,
-
[14]
S. He, L. Lv, A. Manela, and J. Wu. Chronologically consistent large language models.arXiv preprint arXiv:2502.21206, 2025a. S. He, L. Lv, A. Manela, and J. Wu. Instruction tuning chronologically consistent language models.arXiv preprint arXiv:2510.11677, 2025b. J. J. Horton. Large language models as simulated economic agents: What can we learn from homo ...
- [15]
- [16]
-
[17]
doi: 10.18653/v1/2024.acl-long.829
Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.829. T . Lieberum, S. Rajamanoharan, A. Conmy , et al. Gemma scope: Open sparse autoencoders everywhere all at once on gemma
-
[18]
A. Lopez-Lira. Can large language models trade? Testing financial theories with LLM agents in market simulations.arXiv preprint arXiv:2504.10789,
-
[19]
A. Lopez-Lira and Y. Tang. Can ChatGPT forecast stock price movements? Return predictability and large language models.arXiv preprint arXiv:2304.07619,
-
[20]
A. Lopez-Lira, Y. Tang, and M. Zhu. The memorization problem: Can we trust LLMs’ economic forecasts?arXiv preprint arXiv:2504.14765,
-
[21]
Z. Shao, P . Wang, Q. Zhu, et al. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron, L. Martin, K. Stone, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
G. Wang, Y. Xie, Y. Jiang, et al. Voyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291,
work page internal anchor Pith review Pith/arXiv arXiv
- [24]
-
[25]
A. Yang, A. Li, B. Yang, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
25 Figure I.Pretraining and Instructional Tuning Pipeline LLM Training Pipeline Offline Training Web CommonCrawl, C4... Books BookCorpus, arXiv... Code GitHub, StackOvfl... Wiki Wikipedia, Wikibooks trillions of tokens Text Corpus raw unlabelled data ~10–100 TB of text 1 Pre-Training Engine Self-Attention Feed-Forward Self-Attention Feed-Forward Transformer...
work page 2022
-
[27]
LoRA: Low-Rank Adaptation of Large Language Models
with 175 billion trainable parameters. 1 Many sought to mitigate this by adapting only some parameters or learning external modules for new tasks. This way, we only need to store and load a small number of task-specific parameters in ad- dition to the pre-trained model for each task, greatly boosting the operational efficiency when deployed. However, existi...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[28]
You are a sophisticated rational investor
Column (1) estimatesF i,t ri,t+1 =α i + δt + P11 s=0 βs ri,t−s +ϵ i,t , regressing the LLM’s forecast on lagged returns with firm and month fixed effects. For brevity , we report coefficients fors∈ {0, 1, 2, 3, 5, 7, 9, 11}. Column (2) regresses realized returns on lagged returns. Column (3) regresses realized returns on the LLM’s forecast with firm and m...
work page 2016
-
[29]
For brevity , we report coefficients fors∈ {0, 1, 2, 3, 5, 7, 9, 11}
Columns (1) and (4) estimateF i,t ri,t+1 =α i +δ t + P11 s=0 βs ri,t−s +ϵ i,t , regressing the LLM’s forecast on lagged returns with firm and month fixed effects. For brevity , we report coefficients fors∈ {0, 1, 2, 3, 5, 7, 9, 11}. Columns (2) and (5) regress realized returns on the LLM’s forecast with firm and month fixed effects. Columns (3) and (6) re...
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.