pith. machine review for the scientific record. sign in

arxiv: 2605.09106 · v1 · submitted 2026-05-09 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:07 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM evaluationfinancial biasherding behaviorinvestment decisionsanalyst reportsbenchmarkstock return prediction
0
0 comments X

The pith

Large language models tend to follow explicit investment ratings from analyst reports, even when those ratings are fabricated.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Fin-Bias, a benchmark using thousands of detailed analyst reports on firms across industries to test LLM investment decisions under uncertainty and bias. By presenting models with reports that include real ratings, no ratings, or fake ratings, the authors show that LLMs adjust their own recommendations to match the provided bias rather than relying solely on the report content. They also create a technique to identify potential human opinions in the input, which prompts more independent analysis by the models, allowing some to better predict actual future stock returns than human analysts do.

Core claim

LLMs exhibit herding toward explicit biases in financial contexts, as demonstrated by shifts in their generated investment ratings when analyst ratings are added or faked in long firm reports. A new detection method for human opinions in context enables LLMs to generate more independent ratings, with some models achieving higher accuracy than humans in forecasting stock returns.

What carries the argument

The Fin-Bias benchmark of 8868 analyst reports presented with, without, and with fake investment ratings (Bullish/Neutral/Bearish), plus a method to detect human opinions to promote independent LLM thinking.

If this is right

  • LLMs are vulnerable to explicit bias in financial decision contexts.
  • Models can be made more independent using opinion detection techniques.
  • Some LLMs with this adjustment can exceed human accuracy in predicting stock returns.
  • The benchmark provides a way to evaluate LLM reliability in uncertain financial scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such herding could limit the use of LLMs for unbiased financial advice without additional safeguards.
  • The detection method might generalize to reduce bias following in other expert-opinion heavy domains.
  • Testing the benchmark on newer models could reveal if larger models are less or more susceptible to this effect.

Load-bearing premise

Differences in LLM-generated ratings are caused by the models following human bias in the context rather than by changes in prompt length, report structure, or other unrelated factors.

What would settle it

Running the same reports through LLMs with and without the analyst ratings and observing no significant difference in the models' own ratings would indicate that herding is not occurring.

Figures

Figures reproduced from arXiv: 2605.09106 by Jinman Zhao, Xiaoyu Hu.

Figure 1
Figure 1. Figure 1: Prompt Template phisticated analysts when independently making investment decision? [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A sample analyst report used as input to the model. [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A sample analyst report without rating as input to the model. [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A sample analyst report with ’fake’ rating as input to the model. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A sample analyst report without human opinions as input to the model. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A sample context as input to the model. LLM output "<|start_header_id|>assistant<|end_header_id|> Here is the analysis and investment rating in JSON format: "answer": "bullish", "reason": "Although Bank of America Corp. (BAC) faces near-term challenges due to the impact of the coronavirus on its earnings, we believe that the company’s long-term fundamentals remain solid. The recent decline in share price p… view at source ↗
Figure 7
Figure 7. Figure 7: A sample LLM output given context with analyst rating [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A sample context as input to the model. LLM output "<|start_header_id|>assistant<|end_header_id|> Here is my analysis and investment rating: Answer: Neutral Reason: While Bank of America faces significant challenges in the near-term due to the impact of the coronavirus on the economy, its focus on ¨responsible growthänd its ability to manage its credit risk are positives. The company’s recent share price u… view at source ↗
Figure 9
Figure 9. Figure 9: A sample LLM output given context without analyst rating [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly deployed in financial contexts, raising critical concerns about reliability, alignment, and susceptibility to adversarial manipulation. While prior finance-related benchmarks assess LLMs' capabilities in stock trading, they are often restricted to small sample and fail to demonstrate LLM susceptibility to context with potential human bias. We introduce Fin-Bias (financial herding under long and uncertain financial context), a benchmark for evaluating LLM investment decision-making when faced with uncertainty and possible human-biased opinions. Fin-Bias includes 8868 long firm-specific analyst reports, including firm aspects summarized and analyzed by sophisticated analysts with investment ratings (Bullish/Neutral/Bearish) spanning from various industries. We present large language models with firm analyst reports with/without analyst investment ratings and even with 'fake' rating, to get investment ratings generated by LLMs. Our results reveal that LLMs tend to herd the explicit bias in context. We also develop a method to detect potential human opinions, which can encourage LLMs to think independently, some models even exceed human performance in predicting future stock return.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Fin-Bias, a benchmark of 8868 long firm-specific analyst reports spanning industries, to evaluate LLM investment decision-making under uncertainty and explicit human bias. LLMs are tested on reports alone, reports plus real analyst ratings (Bullish/Neutral/Bearish), and reports plus fake ratings; the central claim is that LLMs herd the provided bias. A secondary method is proposed to detect human opinions and encourage independent thinking, with the assertion that some models exceed human performance in predicting future stock returns.

Significance. If the attribution to bias herding can be isolated from prompt artifacts and supported by statistical evidence, the work would usefully extend existing finance LLM benchmarks by focusing on long contexts and explicit bias susceptibility, with practical implications for reliable deployment in financial decision systems.

major comments (2)
  1. [Abstract] Abstract: the directional claims that 'LLMs tend to herd the explicit bias in context' and that 'some models even exceed human performance in predicting future stock return' are presented without any metrics, statistical tests, baseline comparisons, error bars, or description of how herding or return-prediction accuracy is quantified. This absence is load-bearing for the central empirical results.
  2. [Experimental Setup] Experimental design (comparison of reports-alone vs. reports-plus-real-ratings vs. reports-plus-fake-ratings): the three conditions necessarily alter prompt length, token count, lexical content, and report structure. No controls (e.g., length-matched neutral insertions or randomized non-bias text) are described, so observed rating shifts cannot be confidently attributed to susceptibility to human bias rather than attention shifts or training-data overlap with analyst phrasing. This directly undermines the herding claim.
minor comments (2)
  1. The detection method for human opinions is referenced but not described in sufficient detail for replication or assessment of how it promotes independent thinking.
  2. Notation for investment ratings (Bullish/Neutral/Bearish) and the exact prompting templates should be provided in a table or appendix to allow precise reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for strengthening the empirical presentation and controls in our work. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the directional claims that 'LLMs tend to herd the explicit bias in context' and that 'some models even exceed human performance in predicting future stock return' are presented without any metrics, statistical tests, baseline comparisons, error bars, or description of how herding or return-prediction accuracy is quantified. This absence is load-bearing for the central empirical results.

    Authors: We agree that the abstract should include quantitative support for the key claims to improve clarity and impact. In the revised version, we will expand the abstract to summarize the main empirical metrics, including the magnitude of rating shifts under biased conditions, results of statistical tests for significance, comparisons against human analyst baselines, and brief descriptions of the quantification methods for herding (rating divergence between conditions) and return prediction (e.g., directional accuracy or correlation with realized returns). These additions will be drawn directly from the experimental results already reported in the body of the paper. revision: yes

  2. Referee: [Experimental Setup] Experimental design (comparison of reports-alone vs. reports-plus-real-ratings vs. reports-plus-fake-ratings): the three conditions necessarily alter prompt length, token count, lexical content, and report structure. No controls (e.g., length-matched neutral insertions or randomized non-bias text) are described, so observed rating shifts cannot be confidently attributed to susceptibility to human bias rather than attention shifts or training-data overlap with analyst phrasing. This directly undermines the herding claim.

    Authors: We recognize this as a valid concern about potential confounds in attributing shifts specifically to bias herding. The reports-alone condition establishes a no-bias baseline, and the real versus fake rating conditions add text of comparable length and structure, allowing isolation of bias direction effects. To further rule out length or lexical artifacts, we will add and report results from new control conditions using length-matched neutral insertions and randomized non-bias text in the revised manuscript. These controls will be used to verify that rating changes are driven by the presence of biased opinions rather than prompt modifications alone. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark evaluation

full rationale

The paper introduces Fin-Bias as an empirical benchmark consisting of 8868 analyst reports and evaluates LLM investment ratings under three prompt conditions (reports alone, with real ratings, with fake ratings). No equations, derivations, fitted parameters, or predictive models are present; all results are direct observational comparisons of generated outputs. Claims of herding and a detection method for independent thinking rest on these comparisons without any self-referential reduction to inputs by construction. The work is self-contained against external benchmarks and falsifiable via replication on the described dataset.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that analyst ratings constitute measurable human bias and that fake ratings isolate susceptibility. No free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Analyst investment ratings (Bullish/Neutral/Bearish) represent explicit human bias that can be isolated by adding or removing them from reports.
    Invoked in the experimental design of presenting reports with/without ratings and with fake ratings.

pith-pipeline@v0.9.0 · 5486 in / 1216 out tokens · 60117 ms · 2026-05-12T02:07:19.783116+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 6 internal anchors

  1. [1]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  2. [2]

    2025 , journal=

    Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models , author=. 2025 , journal=

  3. [3]

    2026 , journal=

    Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts , author=. 2026 , journal=

  4. [4]

    Mitigate Extrinsic Social Bias in Pre-trained Language Models via Continuous Prompts Adjustment

    Dai, Yiwei and Gu, Hengrui and Wang, Ying and Wang, Xin. Mitigate Extrinsic Social Bias in Pre-trained Language Models via Continuous Prompts Adjustment. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.620

  5. [5]

    M GQA : A Multi-Entity Multi-Hop Multi-Setting Graph Question Answering Benchmark

    Peng, Boci and Liu, Yongchao and Bo, Xiaohe and Guo, Jiaxin and Zhu, Yun and Fan, Xuanbo and Hong, Chuntao and Zhang, Yan. M GQA : A Multi-Entity Multi-Hop Multi-Setting Graph Question Answering Benchmark. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1478

  6. [6]

    GPT -Signal: Generative AI for Semi-automated Feature Engineering in the Alpha Research Process

    Wang, Yining and Zhao, Jinman and Lawryshyn, Yuri. GPT -Signal: Generative AI for Semi-automated Feature Engineering in the Alpha Research Process. Proceedings of the Eighth Financial Technology and Natural Language Processing and the 1st Agent AI for Scenario Planning. 2024

  7. [7]

    2024 , howpublished =

    THUDM , title =. 2024 , howpublished =

  8. [8]

    2024 , month =

    Anthropic , title =. 2024 , month =

  9. [9]

    2025 , month =

    Anthropic , title =. 2025 , month =

  10. [10]

    2024 , url =

    OpenAI , title =. 2024 , url =

  11. [11]

    2025 , url =

    OpenAI , title =. 2025 , url =

  12. [12]

    2024 , url =

    Meta AI , title =. 2024 , url =

  13. [13]

    2024 , url =

    Alibaba Group , title =. 2024 , url =

  14. [14]

    2024 , url =

    Google DeepMind , title =. 2024 , url =

  15. [15]

    2024 , url =

    Shanghai AI Laboratory , title =. 2024 , url =

  16. [16]

    2023 , url =

    Meta AI , title =. 2023 , url =

  17. [17]

    2024 , url =

    01.AI , title =. 2024 , url =

  18. [18]

    2024 , url =

    Tsinghua KEG Lab and Zhipu AI , title =. 2024 , url =

  19. [19]

    2024 , url =

    Mistral AI , title =. 2024 , url =

  20. [20]

    Journal of Financial and Quantitative Analysis , volume=

    Analysts' conflicts of interest and biases in earnings forecasts , author=. Journal of Financial and Quantitative Analysis , volume=. 2007 , publisher=

  21. [21]

    F in QA : A dataset of numerical reasoning over financial data

    Chen, Zhiyu and Chen, Wenhu and Smiley, Charese and Shah, Sameena and Borova, Iana and Langdon, Dylan and Moussa, Reema and Beane, Matt and Huang, Ting-Hao and Routledge, Bryan and Wang, William Yang. F in QA : A Dataset of Numerical Reasoning over Financial Data. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021...

  22. [22]

    IEEE Transactions on Big Data , year=

    Finmem: A performance-enhanced llm trading agent with layered memory and character design , author=. IEEE Transactions on Big Data , year=

  23. [23]

    2023 , institution=

    Analyst bias and mispricing , author=. 2023 , institution=

  24. [24]

    TradingAgents: Multi-Agents

    Yijia Xiao and Edward Sun and Di Luo and Wei Wang , booktitle=. TradingAgents: Multi-Agents. 2025 , url=

  25. [25]

    Accounting and Business Research , volume=

    Analysts’ earnings forecasts: coexistence and dynamics of overconfidence and strategic incentives , author=. Accounting and Business Research , volume=. 2015 , publisher=

  26. [26]

    Proceedings of the National Academy of Sciences , volume=

    Overconfidence in news judgments is associated with false news susceptibility , author=. Proceedings of the National Academy of Sciences , volume=. 2021 , publisher=

  27. [27]

    ICES Journal of Marine Science , volume=

    Strategic management decision-making in a complex world: quantifying, understanding, and using trade-offs , author=. ICES Journal of Marine Science , volume=. 2017 , publisher=

  28. [28]

    Advances in Neural Information Processing Systems , volume=

    Finben: A holistic financial benchmark for large language models , author=. Advances in Neural Information Processing Systems , volume=

  29. [29]

    Journal of Machine Learning Research , volume=

    Palm: Scaling language modeling with pathways , author=. Journal of Machine Learning Research , volume=

  30. [30]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  31. [31]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  32. [32]

    Advances in neural information processing systems , volume=

    Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=

  33. [33]

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Bloom: A 176b-parameter open-access multilingual language model , author=. arXiv preprint arXiv:2211.05100 , year=

  34. [34]

    Journal of Machine Learning Research , volume=

    Scaling instruction-finetuned language models , author=. Journal of Machine Learning Research , volume=

  35. [35]

    A Survey of Large Language Models

    A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , volume=

  36. [36]

    Understanding deep learning requires rethinking generalization

    Understanding deep learning requires rethinking generalization , author=. arXiv preprint arXiv:1611.03530 , year=

  37. [37]

    30th USENIX security symposium (USENIX Security 21) , pages=

    Extracting training data from large language models , author=. 30th USENIX security symposium (USENIX Security 21) , pages=

  38. [38]

    generalization: Quantifying data leakage in NLP performance evaluation , author=

    Memorization vs. generalization: Quantifying data leakage in NLP performance evaluation , author=. arXiv preprint arXiv:2102.01818 , year=

  39. [39]

    Lopez-Lira, Y

    The Memorization Problem: Can We Trust LLMs' Economic Forecasts? , author=. arXiv preprint arXiv:2504.14765 , year=

  40. [40]

    Available at SSRN 5082861 , year=

    Caution ahead: Numerical reasoning and look-ahead bias in AI models , author=. Available at SSRN 5082861 , year=

  41. [41]

    Emergent Abilities of Large Language Models

    Emergent abilities of large language models , author=. arXiv preprint arXiv:2206.07682 , year=

  42. [42]

    Advances in Neural Information Processing Systems , volume=

    Toolformer: Language models can teach themselves to use tools , author=. Advances in Neural Information Processing Systems , volume=

  43. [43]

    International Conference on Machine Learning , pages=

    Pal: Program-aided language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  44. [44]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , year=

  45. [45]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  46. [46]

    C onv F in QA : Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering

    Chen, Zhiyu and Li, Shiyang and Smiley, Charese and Ma, Zhiqiang and Shah, Sameena and Wang, William Yang. C onv F in QA : Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.421

  47. [47]

    Financeqa: a benchmark for evaluating financial analysis capabilities of large language models.arXiv preprint arXiv:2501.18062,

    Financeqa: A benchmark for evaluating financial analysis capabilities of large language models , author=. arXiv preprint arXiv:2501.18062 , year=

  48. [48]

    Contemporary accounting research , volume=

    CEO overconfidence and management forecasting , author=. Contemporary accounting research , volume=. 2016 , publisher=

  49. [49]

    Qwen2.5: A Party of Foundation Models , url =

    Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =

  50. [50]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  51. [51]

    Language resources and evaluation , volume=

    Annotating expressions of opinions and emotions in language , author=. Language resources and evaluation , volume=. 2005 , publisher=

  52. [52]

    arXiv preprint arXiv:2211.00083 , year=

    When flue meets flang: Benchmarks and large pre-trained language model for financial domain , author=. arXiv preprint arXiv:2211.00083 , year=

  53. [53]

    Proceedings of the australasian language technology association workshop 2015 , pages=

    Domain adaption of named entity recognition to support credit risk assessment , author=. Proceedings of the australasian language technology association workshop 2015 , pages=

  54. [54]

    In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 9556–9567

    TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance , author=. arXiv preprint arXiv:2105.07624 , year=

  55. [55]

    arXiv preprint arXiv:2305.14471 , year=

    Cgce: A chinese generative chat evaluation benchmark for general and financial domains , author=. arXiv preprint arXiv:2305.14471 , year=

  56. [56]

    arXiv preprint arXiv:2105.12825 , year=

    Trade the event: Corporate events detection for news-based event-driven trading , author=. arXiv preprint arXiv:2105.12825 , year=

  57. [57]

    arXiv preprint arXiv:2304.05351 , year=

    The wall street neophyte: A zero-shot analysis of chatgpt over multimodal stock movement prediction challenges , author=. arXiv preprint arXiv:2304.05351 , year=

  58. [58]

    arXiv preprint arXiv:2310.15205 , year=

    Disc-finllm: A chinese financial large language model based on multiple experts fine-tuning , author=. arXiv preprint arXiv:2310.15205 , year=

  59. [59]

    Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Stock movement prediction from tweets and historical prices , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  60. [60]

    and Huang, Jimin and Qian, Lingfei and Peng, Xueqing and Suchow, Jordan W

    Li, Haohang and Cao, Yupeng and Yu, Yangyang and Javaji, Shashidhar Reddy and Deng, Zhiyang and He, Yueru and Jiang, Yuechen and Zhu, Zining and Subbalakshmi, K.p. and Huang, Jimin and Qian, Lingfei and Peng, Xueqing and Suchow, Jordan W. and Xie, Qianqian. INVESTORBENCH : A Benchmark for Financial Decision-Making Tasks with LLM -based Agent. Proceedings ...

  61. [61]

    2024 , eprint=

    InternLM2 Technical Report , author=. 2024 , eprint=

  62. [62]

    2024 , eprint=

    DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , author=. 2024 , eprint=

  63. [63]

    Journal of the Association for Information Science and Technology , volume=

    Good debt or bad debt: Detecting semantic orientations in economic texts , author=. Journal of the Association for Information Science and Technology , volume=. 2014 , publisher=

  64. [64]

    Von https://sites

    Financial opinion mining and question answering , author=. Von https://sites. google. com/view/fiqa , year=

  65. [65]

    Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017) , pages=

    Semeval-2017 task 5: Fine-grained sentiment analysis on financial microblogs and news , author=. Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017) , pages=

  66. [66]

    FinGPT: Open-source financial large language models,

    Fingpt: Open-source financial large language models , author=. arXiv preprint arXiv:2306.06031 , year=

  67. [67]

    Xiao-Yang Liu, Guoxuan Wang, Hongyang Yang, and Daochen Zha

    Fingpt: Democratizing internet-scale data for financial large language models , author=. arXiv preprint arXiv:2307.10485 , year=

  68. [68]

    Neng Wang and Hongyang Yang and Christina Wang , booktitle=. Fin. 2023 , url=

  69. [69]

    Yangyang Yu and Zhiyuan Yao and Haohang Li and Zhiyang Deng and Yuechen Jiang and Yupeng Cao and Zhi Chen and Jordan W. Suchow and Zhenyu Cui and Rong Liu and Zhaozhuo Xu and Denghui Zhang and Koduvayur Subbalakshmi and GUOJUN XIONG and Yueru He and Jimin Huang and Dong Li and Qianqian Xie , booktitle=. FinCon: A Synthesized. 2024 , url=

  70. [70]

    Ollie Liu and Deqing Fu and Dani Yogatama and Willie Neiswanger , booktitle=. De. 2025 , url=

  71. [71]

    arXiv preprint arXiv:2205.12507 , year=

    Re-examining calibration: The case of question answering , author=. arXiv preprint arXiv:2205.12507 , year=

  72. [72]

    Decision-Making Behavior Evaluation Framework for

    Jingru Jia and Zehua Yuan and Junhao Pan and Paul E McNamara and Deming Chen , booktitle=. Decision-Making Behavior Evaluation Framework for. 2024 , url=

  73. [73]

    arXiv preprint arXiv:2405.16434 , year=

    The Importance of Directional Feedback for LLM-based Optimizers , author=. arXiv preprint arXiv:2405.16434 , year=

  74. [74]

    Jiuhai Chen and Jonas Mueller

    Quantifying uncertainty in answers from any language model via intrinsic and extrinsic confidence assessment , author=. arXiv preprint arXiv:2308.16175 , year=

  75. [75]

    Reducing Sentiment Bias in Language Models via Counterfactual Evaluation

    Huang, Po-Sen and Zhang, Huan and Jiang, Ray and Stanforth, Robert and Welbl, Johannes and Rae, Jack and Maini, Vishal and Yogatama, Dani and Kohli, Pushmeet. Reducing Sentiment Bias in Language Models via Counterfactual Evaluation. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.7

  76. [76]

    Nature Machine Intelligence , volume=

    Large pre-trained language models contain human-like biases of what is right and wrong to do , author=. Nature Machine Intelligence , volume=. 2022 , publisher=

  77. [77]

    arXiv preprint arXiv:2306.12659 , year=

    Instruct-fingpt: Financial sentiment analysis by instruction tuning of general-purpose large language models , author=. arXiv preprint arXiv:2306.12659 , year=

  78. [78]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    DocMath-eval: Evaluating math reasoning capabilities of LLMs in understanding long and specialized documents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  79. [79]

    Pixiu: A large language model, instruction data and evaluation benchmark for finance.arXiv preprint arXiv:2306.05443,

    Pixiu: A large language model, instruction data and evaluation benchmark for finance , author=. arXiv preprint arXiv:2306.05443 , year=

  80. [80]

    Proceedings of the 27th ACM international conference on information and knowledge management , pages=

    Hybrid deep sequential modeling for social text-driven stock prediction , author=. Proceedings of the 27th ACM international conference on information and knowledge management , pages=