Recognition: 2 theorem links
· Lean TheoremFin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain
Pith reviewed 2026-05-12 02:07 UTC · model grok-4.3
The pith
Large language models tend to follow explicit investment ratings from analyst reports, even when those ratings are fabricated.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMs exhibit herding toward explicit biases in financial contexts, as demonstrated by shifts in their generated investment ratings when analyst ratings are added or faked in long firm reports. A new detection method for human opinions in context enables LLMs to generate more independent ratings, with some models achieving higher accuracy than humans in forecasting stock returns.
What carries the argument
The Fin-Bias benchmark of 8868 analyst reports presented with, without, and with fake investment ratings (Bullish/Neutral/Bearish), plus a method to detect human opinions to promote independent LLM thinking.
If this is right
- LLMs are vulnerable to explicit bias in financial decision contexts.
- Models can be made more independent using opinion detection techniques.
- Some LLMs with this adjustment can exceed human accuracy in predicting stock returns.
- The benchmark provides a way to evaluate LLM reliability in uncertain financial scenarios.
Where Pith is reading between the lines
- Such herding could limit the use of LLMs for unbiased financial advice without additional safeguards.
- The detection method might generalize to reduce bias following in other expert-opinion heavy domains.
- Testing the benchmark on newer models could reveal if larger models are less or more susceptible to this effect.
Load-bearing premise
Differences in LLM-generated ratings are caused by the models following human bias in the context rather than by changes in prompt length, report structure, or other unrelated factors.
What would settle it
Running the same reports through LLMs with and without the analyst ratings and observing no significant difference in the models' own ratings would indicate that herding is not occurring.
Figures
read the original abstract
Large language models (LLMs) are increasingly deployed in financial contexts, raising critical concerns about reliability, alignment, and susceptibility to adversarial manipulation. While prior finance-related benchmarks assess LLMs' capabilities in stock trading, they are often restricted to small sample and fail to demonstrate LLM susceptibility to context with potential human bias. We introduce Fin-Bias (financial herding under long and uncertain financial context), a benchmark for evaluating LLM investment decision-making when faced with uncertainty and possible human-biased opinions. Fin-Bias includes 8868 long firm-specific analyst reports, including firm aspects summarized and analyzed by sophisticated analysts with investment ratings (Bullish/Neutral/Bearish) spanning from various industries. We present large language models with firm analyst reports with/without analyst investment ratings and even with 'fake' rating, to get investment ratings generated by LLMs. Our results reveal that LLMs tend to herd the explicit bias in context. We also develop a method to detect potential human opinions, which can encourage LLMs to think independently, some models even exceed human performance in predicting future stock return.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Fin-Bias, a benchmark of 8868 long firm-specific analyst reports spanning industries, to evaluate LLM investment decision-making under uncertainty and explicit human bias. LLMs are tested on reports alone, reports plus real analyst ratings (Bullish/Neutral/Bearish), and reports plus fake ratings; the central claim is that LLMs herd the provided bias. A secondary method is proposed to detect human opinions and encourage independent thinking, with the assertion that some models exceed human performance in predicting future stock returns.
Significance. If the attribution to bias herding can be isolated from prompt artifacts and supported by statistical evidence, the work would usefully extend existing finance LLM benchmarks by focusing on long contexts and explicit bias susceptibility, with practical implications for reliable deployment in financial decision systems.
major comments (2)
- [Abstract] Abstract: the directional claims that 'LLMs tend to herd the explicit bias in context' and that 'some models even exceed human performance in predicting future stock return' are presented without any metrics, statistical tests, baseline comparisons, error bars, or description of how herding or return-prediction accuracy is quantified. This absence is load-bearing for the central empirical results.
- [Experimental Setup] Experimental design (comparison of reports-alone vs. reports-plus-real-ratings vs. reports-plus-fake-ratings): the three conditions necessarily alter prompt length, token count, lexical content, and report structure. No controls (e.g., length-matched neutral insertions or randomized non-bias text) are described, so observed rating shifts cannot be confidently attributed to susceptibility to human bias rather than attention shifts or training-data overlap with analyst phrasing. This directly undermines the herding claim.
minor comments (2)
- The detection method for human opinions is referenced but not described in sufficient detail for replication or assessment of how it promotes independent thinking.
- Notation for investment ratings (Bullish/Neutral/Bearish) and the exact prompting templates should be provided in a table or appendix to allow precise reproduction.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important areas for strengthening the empirical presentation and controls in our work. We address each major comment below and outline the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the directional claims that 'LLMs tend to herd the explicit bias in context' and that 'some models even exceed human performance in predicting future stock return' are presented without any metrics, statistical tests, baseline comparisons, error bars, or description of how herding or return-prediction accuracy is quantified. This absence is load-bearing for the central empirical results.
Authors: We agree that the abstract should include quantitative support for the key claims to improve clarity and impact. In the revised version, we will expand the abstract to summarize the main empirical metrics, including the magnitude of rating shifts under biased conditions, results of statistical tests for significance, comparisons against human analyst baselines, and brief descriptions of the quantification methods for herding (rating divergence between conditions) and return prediction (e.g., directional accuracy or correlation with realized returns). These additions will be drawn directly from the experimental results already reported in the body of the paper. revision: yes
-
Referee: [Experimental Setup] Experimental design (comparison of reports-alone vs. reports-plus-real-ratings vs. reports-plus-fake-ratings): the three conditions necessarily alter prompt length, token count, lexical content, and report structure. No controls (e.g., length-matched neutral insertions or randomized non-bias text) are described, so observed rating shifts cannot be confidently attributed to susceptibility to human bias rather than attention shifts or training-data overlap with analyst phrasing. This directly undermines the herding claim.
Authors: We recognize this as a valid concern about potential confounds in attributing shifts specifically to bias herding. The reports-alone condition establishes a no-bias baseline, and the real versus fake rating conditions add text of comparable length and structure, allowing isolation of bias direction effects. To further rule out length or lexical artifacts, we will add and report results from new control conditions using length-matched neutral insertions and randomized non-bias text in the revised manuscript. These controls will be used to verify that rating changes are driven by the presence of biased opinions rather than prompt modifications alone. revision: partial
Circularity Check
No significant circularity in empirical benchmark evaluation
full rationale
The paper introduces Fin-Bias as an empirical benchmark consisting of 8868 analyst reports and evaluates LLM investment ratings under three prompt conditions (reports alone, with real ratings, with fake ratings). No equations, derivations, fitted parameters, or predictive models are present; all results are direct observational comparisons of generated outputs. Claims of herding and a detection method for independent thinking rest on these comparisons without any self-referential reduction to inputs by construction. The work is self-contained against external benchmarks and falsifiable via replication on the described dataset.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Analyst investment ratings (Bullish/Neutral/Bearish) represent explicit human bias that can be isolated by adding or removing them from reports.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce Fin-Bias ... Herding Score = 1/N Σ I(m_i, a_i) ... filter sentences ... MPQA Subjectivity Lexicon ... DPO framework ... quantile-based portfolio classification ... 60-day cumulative abnormal return
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LLMs tend to herd the explicit bias in context ... some models even exceed human performance in predicting future stock return
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Advances in neural information processing systems , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
-
[2]
Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models , author=. 2025 , journal=
work page 2025
-
[3]
Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts , author=. 2026 , journal=
work page 2026
-
[4]
Mitigate Extrinsic Social Bias in Pre-trained Language Models via Continuous Prompts Adjustment
Dai, Yiwei and Gu, Hengrui and Wang, Ying and Wang, Xin. Mitigate Extrinsic Social Bias in Pre-trained Language Models via Continuous Prompts Adjustment. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.620
-
[5]
M GQA : A Multi-Entity Multi-Hop Multi-Setting Graph Question Answering Benchmark
Peng, Boci and Liu, Yongchao and Bo, Xiaohe and Guo, Jiaxin and Zhu, Yun and Fan, Xuanbo and Hong, Chuntao and Zhang, Yan. M GQA : A Multi-Entity Multi-Hop Multi-Setting Graph Question Answering Benchmark. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1478
-
[6]
GPT -Signal: Generative AI for Semi-automated Feature Engineering in the Alpha Research Process
Wang, Yining and Zhao, Jinman and Lawryshyn, Yuri. GPT -Signal: Generative AI for Semi-automated Feature Engineering in the Alpha Research Process. Proceedings of the Eighth Financial Technology and Natural Language Processing and the 1st Agent AI for Scenario Planning. 2024
work page 2024
- [7]
- [8]
- [9]
- [10]
- [11]
- [12]
- [13]
- [14]
- [15]
- [16]
- [17]
- [18]
- [19]
-
[20]
Journal of Financial and Quantitative Analysis , volume=
Analysts' conflicts of interest and biases in earnings forecasts , author=. Journal of Financial and Quantitative Analysis , volume=. 2007 , publisher=
work page 2007
-
[21]
F in QA : A dataset of numerical reasoning over financial data
Chen, Zhiyu and Chen, Wenhu and Smiley, Charese and Shah, Sameena and Borova, Iana and Langdon, Dylan and Moussa, Reema and Beane, Matt and Huang, Ting-Hao and Routledge, Bryan and Wang, William Yang. F in QA : A Dataset of Numerical Reasoning over Financial Data. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021...
-
[22]
IEEE Transactions on Big Data , year=
Finmem: A performance-enhanced llm trading agent with layered memory and character design , author=. IEEE Transactions on Big Data , year=
- [23]
-
[24]
Yijia Xiao and Edward Sun and Di Luo and Wei Wang , booktitle=. TradingAgents: Multi-Agents. 2025 , url=
work page 2025
-
[25]
Accounting and Business Research , volume=
Analysts’ earnings forecasts: coexistence and dynamics of overconfidence and strategic incentives , author=. Accounting and Business Research , volume=. 2015 , publisher=
work page 2015
-
[26]
Proceedings of the National Academy of Sciences , volume=
Overconfidence in news judgments is associated with false news susceptibility , author=. Proceedings of the National Academy of Sciences , volume=. 2021 , publisher=
work page 2021
-
[27]
ICES Journal of Marine Science , volume=
Strategic management decision-making in a complex world: quantifying, understanding, and using trade-offs , author=. ICES Journal of Marine Science , volume=. 2017 , publisher=
work page 2017
-
[28]
Advances in Neural Information Processing Systems , volume=
Finben: A holistic financial benchmark for large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[29]
Journal of Machine Learning Research , volume=
Palm: Scaling language modeling with pathways , author=. Journal of Machine Learning Research , volume=
-
[30]
Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Advances in neural information processing systems , volume=
Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
-
[32]
Advances in neural information processing systems , volume=
Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=
-
[33]
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Bloom: A 176b-parameter open-access multilingual language model , author=. arXiv preprint arXiv:2211.05100 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Journal of Machine Learning Research , volume=
Scaling instruction-finetuned language models , author=. Journal of Machine Learning Research , volume=
-
[35]
A Survey of Large Language Models
A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , volume=
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Understanding deep learning requires rethinking generalization
Understanding deep learning requires rethinking generalization , author=. arXiv preprint arXiv:1611.03530 , year=
work page internal anchor Pith review arXiv
-
[37]
30th USENIX security symposium (USENIX Security 21) , pages=
Extracting training data from large language models , author=. 30th USENIX security symposium (USENIX Security 21) , pages=
-
[38]
generalization: Quantifying data leakage in NLP performance evaluation , author=
Memorization vs. generalization: Quantifying data leakage in NLP performance evaluation , author=. arXiv preprint arXiv:2102.01818 , year=
-
[39]
The Memorization Problem: Can We Trust LLMs' Economic Forecasts? , author=. arXiv preprint arXiv:2504.14765 , year=
-
[40]
Available at SSRN 5082861 , year=
Caution ahead: Numerical reasoning and look-ahead bias in AI models , author=. Available at SSRN 5082861 , year=
-
[41]
Emergent Abilities of Large Language Models
Emergent abilities of large language models , author=. arXiv preprint arXiv:2206.07682 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Advances in Neural Information Processing Systems , volume=
Toolformer: Language models can teach themselves to use tools , author=. Advances in Neural Information Processing Systems , volume=
-
[43]
International Conference on Machine Learning , pages=
Pal: Program-aided language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[44]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[45]
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
-
[46]
Chen, Zhiyu and Li, Shiyang and Smiley, Charese and Ma, Zhiqiang and Shah, Sameena and Wang, William Yang. C onv F in QA : Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.421
-
[47]
Financeqa: A benchmark for evaluating financial analysis capabilities of large language models , author=. arXiv preprint arXiv:2501.18062 , year=
-
[48]
Contemporary accounting research , volume=
CEO overconfidence and management forecasting , author=. Contemporary accounting research , volume=. 2016 , publisher=
work page 2016
-
[49]
Qwen2.5: A Party of Foundation Models , url =
Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =
- [50]
-
[51]
Language resources and evaluation , volume=
Annotating expressions of opinions and emotions in language , author=. Language resources and evaluation , volume=. 2005 , publisher=
work page 2005
-
[52]
arXiv preprint arXiv:2211.00083 , year=
When flue meets flang: Benchmarks and large pre-trained language model for financial domain , author=. arXiv preprint arXiv:2211.00083 , year=
-
[53]
Proceedings of the australasian language technology association workshop 2015 , pages=
Domain adaption of named entity recognition to support credit risk assessment , author=. Proceedings of the australasian language technology association workshop 2015 , pages=
work page 2015
-
[54]
TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance , author=. arXiv preprint arXiv:2105.07624 , year=
-
[55]
arXiv preprint arXiv:2305.14471 , year=
Cgce: A chinese generative chat evaluation benchmark for general and financial domains , author=. arXiv preprint arXiv:2305.14471 , year=
-
[56]
arXiv preprint arXiv:2105.12825 , year=
Trade the event: Corporate events detection for news-based event-driven trading , author=. arXiv preprint arXiv:2105.12825 , year=
-
[57]
arXiv preprint arXiv:2304.05351 , year=
The wall street neophyte: A zero-shot analysis of chatgpt over multimodal stock movement prediction challenges , author=. arXiv preprint arXiv:2304.05351 , year=
-
[58]
arXiv preprint arXiv:2310.15205 , year=
Disc-finllm: A chinese financial large language model based on multiple experts fine-tuning , author=. arXiv preprint arXiv:2310.15205 , year=
-
[59]
Stock movement prediction from tweets and historical prices , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[60]
and Huang, Jimin and Qian, Lingfei and Peng, Xueqing and Suchow, Jordan W
Li, Haohang and Cao, Yupeng and Yu, Yangyang and Javaji, Shashidhar Reddy and Deng, Zhiyang and He, Yueru and Jiang, Yuechen and Zhu, Zining and Subbalakshmi, K.p. and Huang, Jimin and Qian, Lingfei and Peng, Xueqing and Suchow, Jordan W. and Xie, Qianqian. INVESTORBENCH : A Benchmark for Financial Decision-Making Tasks with LLM -based Agent. Proceedings ...
- [61]
-
[62]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , author=. 2024 , eprint=
work page 2024
-
[63]
Journal of the Association for Information Science and Technology , volume=
Good debt or bad debt: Detecting semantic orientations in economic texts , author=. Journal of the Association for Information Science and Technology , volume=. 2014 , publisher=
work page 2014
-
[64]
Financial opinion mining and question answering , author=. Von https://sites. google. com/view/fiqa , year=
-
[65]
Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017) , pages=
Semeval-2017 task 5: Fine-grained sentiment analysis on financial microblogs and news , author=. Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017) , pages=
work page 2017
-
[66]
FinGPT: Open-source financial large language models,
Fingpt: Open-source financial large language models , author=. arXiv preprint arXiv:2306.06031 , year=
-
[67]
Xiao-Yang Liu, Guoxuan Wang, Hongyang Yang, and Daochen Zha
Fingpt: Democratizing internet-scale data for financial large language models , author=. arXiv preprint arXiv:2307.10485 , year=
-
[68]
Neng Wang and Hongyang Yang and Christina Wang , booktitle=. Fin. 2023 , url=
work page 2023
-
[69]
Yangyang Yu and Zhiyuan Yao and Haohang Li and Zhiyang Deng and Yuechen Jiang and Yupeng Cao and Zhi Chen and Jordan W. Suchow and Zhenyu Cui and Rong Liu and Zhaozhuo Xu and Denghui Zhang and Koduvayur Subbalakshmi and GUOJUN XIONG and Yueru He and Jimin Huang and Dong Li and Qianqian Xie , booktitle=. FinCon: A Synthesized. 2024 , url=
work page 2024
-
[70]
Ollie Liu and Deqing Fu and Dani Yogatama and Willie Neiswanger , booktitle=. De. 2025 , url=
work page 2025
-
[71]
arXiv preprint arXiv:2205.12507 , year=
Re-examining calibration: The case of question answering , author=. arXiv preprint arXiv:2205.12507 , year=
-
[72]
Decision-Making Behavior Evaluation Framework for
Jingru Jia and Zehua Yuan and Junhao Pan and Paul E McNamara and Deming Chen , booktitle=. Decision-Making Behavior Evaluation Framework for. 2024 , url=
work page 2024
-
[73]
arXiv preprint arXiv:2405.16434 , year=
The Importance of Directional Feedback for LLM-based Optimizers , author=. arXiv preprint arXiv:2405.16434 , year=
-
[74]
Quantifying uncertainty in answers from any language model via intrinsic and extrinsic confidence assessment , author=. arXiv preprint arXiv:2308.16175 , year=
-
[75]
Reducing Sentiment Bias in Language Models via Counterfactual Evaluation
Huang, Po-Sen and Zhang, Huan and Jiang, Ray and Stanforth, Robert and Welbl, Johannes and Rae, Jack and Maini, Vishal and Yogatama, Dani and Kohli, Pushmeet. Reducing Sentiment Bias in Language Models via Counterfactual Evaluation. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.7
-
[76]
Nature Machine Intelligence , volume=
Large pre-trained language models contain human-like biases of what is right and wrong to do , author=. Nature Machine Intelligence , volume=. 2022 , publisher=
work page 2022
-
[77]
arXiv preprint arXiv:2306.12659 , year=
Instruct-fingpt: Financial sentiment analysis by instruction tuning of general-purpose large language models , author=. arXiv preprint arXiv:2306.12659 , year=
-
[78]
DocMath-eval: Evaluating math reasoning capabilities of LLMs in understanding long and specialized documents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[79]
Pixiu: A large language model, instruction data and evaluation benchmark for finance , author=. arXiv preprint arXiv:2306.05443 , year=
-
[80]
Hybrid deep sequential modeling for social text-driven stock prediction , author=. Proceedings of the 27th ACM international conference on information and knowledge management , pages=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.