arxiv: 2605.09106 · v1 · submitted 2026-05-09 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain

Xiaoyu Hu , Jinman Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM evaluationfinancial biasherding behaviorinvestment decisionsanalyst reportsbenchmarkstock return prediction

0 comments

The pith

Large language models tend to follow explicit investment ratings from analyst reports, even when those ratings are fabricated.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Fin-Bias, a benchmark using thousands of detailed analyst reports on firms across industries to test LLM investment decisions under uncertainty and bias. By presenting models with reports that include real ratings, no ratings, or fake ratings, the authors show that LLMs adjust their own recommendations to match the provided bias rather than relying solely on the report content. They also create a technique to identify potential human opinions in the input, which prompts more independent analysis by the models, allowing some to better predict actual future stock returns than human analysts do.

Core claim

LLMs exhibit herding toward explicit biases in financial contexts, as demonstrated by shifts in their generated investment ratings when analyst ratings are added or faked in long firm reports. A new detection method for human opinions in context enables LLMs to generate more independent ratings, with some models achieving higher accuracy than humans in forecasting stock returns.

What carries the argument

The Fin-Bias benchmark of 8868 analyst reports presented with, without, and with fake investment ratings (Bullish/Neutral/Bearish), plus a method to detect human opinions to promote independent LLM thinking.

If this is right

LLMs are vulnerable to explicit bias in financial decision contexts.
Models can be made more independent using opinion detection techniques.
Some LLMs with this adjustment can exceed human accuracy in predicting stock returns.
The benchmark provides a way to evaluate LLM reliability in uncertain financial scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such herding could limit the use of LLMs for unbiased financial advice without additional safeguards.
The detection method might generalize to reduce bias following in other expert-opinion heavy domains.
Testing the benchmark on newer models could reveal if larger models are less or more susceptible to this effect.

Load-bearing premise

Differences in LLM-generated ratings are caused by the models following human bias in the context rather than by changes in prompt length, report structure, or other unrelated factors.

What would settle it

Running the same reports through LLMs with and without the analyst ratings and observing no significant difference in the models' own ratings would indicate that herding is not occurring.

Figures

Figures reproduced from arXiv: 2605.09106 by Jinman Zhao, Xiaoyu Hu.

**Figure 2.** Figure 2: A sample analyst report used as input to the model. [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗

**Figure 3.** Figure 3: A sample analyst report without rating as input to the model. [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: A sample analyst report with ’fake’ rating as input to the model. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: A sample analyst report without human opinions as input to the model. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: A sample context as input to the model. LLM output "<|start_header_id|>assistant<|end_header_id|> Here is the analysis and investment rating in JSON format: "answer": "bullish", "reason": "Although Bank of America Corp. (BAC) faces near-term challenges due to the impact of the coronavirus on its earnings, we believe that the company’s long-term fundamentals remain solid. The recent decline in share price p… view at source ↗

**Figure 7.** Figure 7: A sample LLM output given context with analyst rating [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: A sample context as input to the model. LLM output "<|start_header_id|>assistant<|end_header_id|> Here is my analysis and investment rating: Answer: Neutral Reason: While Bank of America faces significant challenges in the near-term due to the impact of the coronavirus on the economy, its focus on ¨responsible growthänd its ability to manage its credit risk are positives. The company’s recent share price u… view at source ↗

**Figure 9.** Figure 9: A sample LLM output given context without analyst rating [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly deployed in financial contexts, raising critical concerns about reliability, alignment, and susceptibility to adversarial manipulation. While prior finance-related benchmarks assess LLMs' capabilities in stock trading, they are often restricted to small sample and fail to demonstrate LLM susceptibility to context with potential human bias. We introduce Fin-Bias (financial herding under long and uncertain financial context), a benchmark for evaluating LLM investment decision-making when faced with uncertainty and possible human-biased opinions. Fin-Bias includes 8868 long firm-specific analyst reports, including firm aspects summarized and analyzed by sophisticated analysts with investment ratings (Bullish/Neutral/Bearish) spanning from various industries. We present large language models with firm analyst reports with/without analyst investment ratings and even with 'fake' rating, to get investment ratings generated by LLMs. Our results reveal that LLMs tend to herd the explicit bias in context. We also develop a method to detect potential human opinions, which can encourage LLMs to think independently, some models even exceed human performance in predicting future stock return.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fin-Bias adds a useful large benchmark of real analyst reports for testing LLM financial decisions, but the herding and outperformance claims rest on weak evidence and uncontrolled prompt changes.

read the letter

Fin-Bias gives us a bigger dataset than before for checking how LLMs react to biased financial reports, but the results on herding and independent thinking don't hold up yet because of missing controls and numbers. The work collects 8868 real analyst reports and tests LLMs in three setups: report alone, with real ratings, and with fake ratings. They also try a method to detect opinions and steer the model away from them. Some models supposedly beat humans at predicting returns after that. This is new in its scale and focus on long uncertain contexts, which is better than the small samples in earlier finance benchmarks. It targets a real risk in deploying LLMs for investment calls. The soft spot is the lack of solid evidence. The abstract talks about LLMs herding but shows no metrics or tests. Even if the numbers are in the paper, the design changes prompt length and structure when ratings are added, so any output shift might just be from that rather than bias susceptibility. No length-matched controls or other checks mentioned. That makes the herding claim hard to trust. The detection method and outperformance claim face the same problem. Without details on quantification and baselines, it's unclear if it's real or artifact. This paper is for people working on safe LLM use in finance. It has enough of an idea to go to peer review, though it will need major fixes on the experiments and reporting. I'd send it on for referees to sort out the controls.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Fin-Bias, a benchmark of 8868 long firm-specific analyst reports spanning industries, to evaluate LLM investment decision-making under uncertainty and explicit human bias. LLMs are tested on reports alone, reports plus real analyst ratings (Bullish/Neutral/Bearish), and reports plus fake ratings; the central claim is that LLMs herd the provided bias. A secondary method is proposed to detect human opinions and encourage independent thinking, with the assertion that some models exceed human performance in predicting future stock returns.

Significance. If the attribution to bias herding can be isolated from prompt artifacts and supported by statistical evidence, the work would usefully extend existing finance LLM benchmarks by focusing on long contexts and explicit bias susceptibility, with practical implications for reliable deployment in financial decision systems.

major comments (2)

[Abstract] Abstract: the directional claims that 'LLMs tend to herd the explicit bias in context' and that 'some models even exceed human performance in predicting future stock return' are presented without any metrics, statistical tests, baseline comparisons, error bars, or description of how herding or return-prediction accuracy is quantified. This absence is load-bearing for the central empirical results.
[Experimental Setup] Experimental design (comparison of reports-alone vs. reports-plus-real-ratings vs. reports-plus-fake-ratings): the three conditions necessarily alter prompt length, token count, lexical content, and report structure. No controls (e.g., length-matched neutral insertions or randomized non-bias text) are described, so observed rating shifts cannot be confidently attributed to susceptibility to human bias rather than attention shifts or training-data overlap with analyst phrasing. This directly undermines the herding claim.

minor comments (2)

The detection method for human opinions is referenced but not described in sufficient detail for replication or assessment of how it promotes independent thinking.
Notation for investment ratings (Bullish/Neutral/Bearish) and the exact prompting templates should be provided in a table or appendix to allow precise reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for strengthening the empirical presentation and controls in our work. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the directional claims that 'LLMs tend to herd the explicit bias in context' and that 'some models even exceed human performance in predicting future stock return' are presented without any metrics, statistical tests, baseline comparisons, error bars, or description of how herding or return-prediction accuracy is quantified. This absence is load-bearing for the central empirical results.

Authors: We agree that the abstract should include quantitative support for the key claims to improve clarity and impact. In the revised version, we will expand the abstract to summarize the main empirical metrics, including the magnitude of rating shifts under biased conditions, results of statistical tests for significance, comparisons against human analyst baselines, and brief descriptions of the quantification methods for herding (rating divergence between conditions) and return prediction (e.g., directional accuracy or correlation with realized returns). These additions will be drawn directly from the experimental results already reported in the body of the paper. revision: yes
Referee: [Experimental Setup] Experimental design (comparison of reports-alone vs. reports-plus-real-ratings vs. reports-plus-fake-ratings): the three conditions necessarily alter prompt length, token count, lexical content, and report structure. No controls (e.g., length-matched neutral insertions or randomized non-bias text) are described, so observed rating shifts cannot be confidently attributed to susceptibility to human bias rather than attention shifts or training-data overlap with analyst phrasing. This directly undermines the herding claim.

Authors: We recognize this as a valid concern about potential confounds in attributing shifts specifically to bias herding. The reports-alone condition establishes a no-bias baseline, and the real versus fake rating conditions add text of comparable length and structure, allowing isolation of bias direction effects. To further rule out length or lexical artifacts, we will add and report results from new control conditions using length-matched neutral insertions and randomized non-bias text in the revised manuscript. These controls will be used to verify that rating changes are driven by the presence of biased opinions rather than prompt modifications alone. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark evaluation

full rationale

The paper introduces Fin-Bias as an empirical benchmark consisting of 8868 analyst reports and evaluates LLM investment ratings under three prompt conditions (reports alone, with real ratings, with fake ratings). No equations, derivations, fitted parameters, or predictive models are present; all results are direct observational comparisons of generated outputs. Claims of herding and a detection method for independent thinking rest on these comparisons without any self-referential reduction to inputs by construction. The work is self-contained against external benchmarks and falsifiable via replication on the described dataset.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that analyst ratings constitute measurable human bias and that fake ratings isolate susceptibility. No free parameters or invented entities are introduced.

axioms (1)

domain assumption Analyst investment ratings (Bullish/Neutral/Bearish) represent explicit human bias that can be isolated by adding or removing them from reports.
Invoked in the experimental design of presenting reports with/without ratings and with fake ratings.

pith-pipeline@v0.9.0 · 5486 in / 1216 out tokens · 60117 ms · 2026-05-12T02:07:19.783116+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Fin-Bias ... Herding Score = 1/N Σ I(m_i, a_i) ... filter sentences ... MPQA Subjectivity Lexicon ... DPO framework ... quantile-based portfolio classification ... 60-day cumulative abnormal return
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LLMs tend to herd the explicit bias in context ... some models even exceed human performance in predicting future stock return

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 6 internal anchors

[1]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[2]

2025 , journal=

Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models , author=. 2025 , journal=

work page 2025
[3]

2026 , journal=

Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts , author=. 2026 , journal=

work page 2026
[4]

Mitigate Extrinsic Social Bias in Pre-trained Language Models via Continuous Prompts Adjustment

Dai, Yiwei and Gu, Hengrui and Wang, Ying and Wang, Xin. Mitigate Extrinsic Social Bias in Pre-trained Language Models via Continuous Prompts Adjustment. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.620

work page doi:10.18653/v1/2024.emnlp-main.620 2024
[5]

M GQA : A Multi-Entity Multi-Hop Multi-Setting Graph Question Answering Benchmark

Peng, Boci and Liu, Yongchao and Bo, Xiaohe and Guo, Jiaxin and Zhu, Yun and Fan, Xuanbo and Hong, Chuntao and Zhang, Yan. M GQA : A Multi-Entity Multi-Hop Multi-Setting Graph Question Answering Benchmark. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1478

work page doi:10.18653/v1/2025.acl-long.1478 2025
[6]

GPT -Signal: Generative AI for Semi-automated Feature Engineering in the Alpha Research Process

Wang, Yining and Zhao, Jinman and Lawryshyn, Yuri. GPT -Signal: Generative AI for Semi-automated Feature Engineering in the Alpha Research Process. Proceedings of the Eighth Financial Technology and Natural Language Processing and the 1st Agent AI for Scenario Planning. 2024

work page 2024
[7]

2024 , howpublished =

THUDM , title =. 2024 , howpublished =

work page 2024
[8]

2024 , month =

Anthropic , title =. 2024 , month =

work page 2024
[9]

2025 , month =

Anthropic , title =. 2025 , month =

work page 2025
[10]

2024 , url =

OpenAI , title =. 2024 , url =

work page 2024
[11]

2025 , url =

OpenAI , title =. 2025 , url =

work page 2025
[12]

2024 , url =

Meta AI , title =. 2024 , url =

work page 2024
[13]

2024 , url =

Alibaba Group , title =. 2024 , url =

work page 2024
[14]

2024 , url =

Google DeepMind , title =. 2024 , url =

work page 2024
[15]

2024 , url =

Shanghai AI Laboratory , title =. 2024 , url =

work page 2024
[16]

2023 , url =

Meta AI , title =. 2023 , url =

work page 2023
[17]

2024 , url =

01.AI , title =. 2024 , url =

work page 2024
[18]

2024 , url =

Tsinghua KEG Lab and Zhipu AI , title =. 2024 , url =

work page 2024
[19]

2024 , url =

Mistral AI , title =. 2024 , url =

work page 2024
[20]

Journal of Financial and Quantitative Analysis , volume=

Analysts' conflicts of interest and biases in earnings forecasts , author=. Journal of Financial and Quantitative Analysis , volume=. 2007 , publisher=

work page 2007
[21]

F in QA : A dataset of numerical reasoning over financial data

Chen, Zhiyu and Chen, Wenhu and Smiley, Charese and Shah, Sameena and Borova, Iana and Langdon, Dylan and Moussa, Reema and Beane, Matt and Huang, Ting-Hao and Routledge, Bryan and Wang, William Yang. F in QA : A Dataset of Numerical Reasoning over Financial Data. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021...

work page doi:10.18653/v1/2021.emnlp-main.300 2021
[22]

IEEE Transactions on Big Data , year=

Finmem: A performance-enhanced llm trading agent with layered memory and character design , author=. IEEE Transactions on Big Data , year=

work page
[23]

2023 , institution=

Analyst bias and mispricing , author=. 2023 , institution=

work page 2023
[24]

TradingAgents: Multi-Agents

Yijia Xiao and Edward Sun and Di Luo and Wei Wang , booktitle=. TradingAgents: Multi-Agents. 2025 , url=

work page 2025
[25]

Accounting and Business Research , volume=

Analysts’ earnings forecasts: coexistence and dynamics of overconfidence and strategic incentives , author=. Accounting and Business Research , volume=. 2015 , publisher=

work page 2015
[26]

Proceedings of the National Academy of Sciences , volume=

Overconfidence in news judgments is associated with false news susceptibility , author=. Proceedings of the National Academy of Sciences , volume=. 2021 , publisher=

work page 2021
[27]

ICES Journal of Marine Science , volume=

Strategic management decision-making in a complex world: quantifying, understanding, and using trade-offs , author=. ICES Journal of Marine Science , volume=. 2017 , publisher=

work page 2017
[28]

Advances in Neural Information Processing Systems , volume=

Finben: A holistic financial benchmark for large language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[29]

Journal of Machine Learning Research , volume=

Palm: Scaling language modeling with pathways , author=. Journal of Machine Learning Research , volume=

work page
[30]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

work page
[32]

Advances in neural information processing systems , volume=

Large language models are zero-shot reasoners , author=. Advances in neural information processing systems , volume=

work page
[33]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

Bloom: A 176b-parameter open-access multilingual language model , author=. arXiv preprint arXiv:2211.05100 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Journal of Machine Learning Research , volume=

Scaling instruction-finetuned language models , author=. Journal of Machine Learning Research , volume=

work page
[35]

A Survey of Large Language Models

A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , volume=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Understanding deep learning requires rethinking generalization

Understanding deep learning requires rethinking generalization , author=. arXiv preprint arXiv:1611.03530 , year=

work page internal anchor Pith review arXiv
[37]

30th USENIX security symposium (USENIX Security 21) , pages=

Extracting training data from large language models , author=. 30th USENIX security symposium (USENIX Security 21) , pages=

work page
[38]

generalization: Quantifying data leakage in NLP performance evaluation , author=

Memorization vs. generalization: Quantifying data leakage in NLP performance evaluation , author=. arXiv preprint arXiv:2102.01818 , year=

work page arXiv
[39]

Lopez-Lira, Y

The Memorization Problem: Can We Trust LLMs' Economic Forecasts? , author=. arXiv preprint arXiv:2504.14765 , year=

work page arXiv
[40]

Available at SSRN 5082861 , year=

Caution ahead: Numerical reasoning and look-ahead bias in AI models , author=. Available at SSRN 5082861 , year=

work page
[41]

Emergent Abilities of Large Language Models

Emergent abilities of large language models , author=. arXiv preprint arXiv:2206.07682 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Advances in Neural Information Processing Systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in Neural Information Processing Systems , volume=

work page
[43]

International Conference on Machine Learning , pages=

Pal: Program-aided language models , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[44]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1907
[45]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page
[46]

C onv F in QA : Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering

Chen, Zhiyu and Li, Shiyang and Smiley, Charese and Ma, Zhiqiang and Shah, Sameena and Wang, William Yang. C onv F in QA : Exploring the Chain of Numerical Reasoning in Conversational Finance Question Answering. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.421

work page doi:10.18653/v1/2022.emnlp-main.421 2022
[47]

Financeqa: a benchmark for evaluating financial analysis capabilities of large language models.arXiv preprint arXiv:2501.18062,

Financeqa: A benchmark for evaluating financial analysis capabilities of large language models , author=. arXiv preprint arXiv:2501.18062 , year=

work page arXiv
[48]

Contemporary accounting research , volume=

CEO overconfidence and management forecasting , author=. Contemporary accounting research , volume=. 2016 , publisher=

work page 2016
[49]

Qwen2.5: A Party of Foundation Models , url =

Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =

work page
[50]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[51]

Language resources and evaluation , volume=

Annotating expressions of opinions and emotions in language , author=. Language resources and evaluation , volume=. 2005 , publisher=

work page 2005
[52]

arXiv preprint arXiv:2211.00083 , year=

When flue meets flang: Benchmarks and large pre-trained language model for financial domain , author=. arXiv preprint arXiv:2211.00083 , year=

work page arXiv
[53]

Proceedings of the australasian language technology association workshop 2015 , pages=

Domain adaption of named entity recognition to support credit risk assessment , author=. Proceedings of the australasian language technology association workshop 2015 , pages=

work page 2015
[54]

In Proceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 9556–9567

TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance , author=. arXiv preprint arXiv:2105.07624 , year=

work page arXiv
[55]

arXiv preprint arXiv:2305.14471 , year=

Cgce: A chinese generative chat evaluation benchmark for general and financial domains , author=. arXiv preprint arXiv:2305.14471 , year=

work page arXiv
[56]

arXiv preprint arXiv:2105.12825 , year=

Trade the event: Corporate events detection for news-based event-driven trading , author=. arXiv preprint arXiv:2105.12825 , year=

work page arXiv
[57]

arXiv preprint arXiv:2304.05351 , year=

The wall street neophyte: A zero-shot analysis of chatgpt over multimodal stock movement prediction challenges , author=. arXiv preprint arXiv:2304.05351 , year=

work page arXiv
[58]

arXiv preprint arXiv:2310.15205 , year=

Disc-finllm: A chinese financial large language model based on multiple experts fine-tuning , author=. arXiv preprint arXiv:2310.15205 , year=

work page arXiv
[59]

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Stock movement prediction from tweets and historical prices , author=. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[60]

and Huang, Jimin and Qian, Lingfei and Peng, Xueqing and Suchow, Jordan W

Li, Haohang and Cao, Yupeng and Yu, Yangyang and Javaji, Shashidhar Reddy and Deng, Zhiyang and He, Yueru and Jiang, Yuechen and Zhu, Zining and Subbalakshmi, K.p. and Huang, Jimin and Qian, Lingfei and Peng, Xueqing and Suchow, Jordan W. and Xie, Qianqian. INVESTORBENCH : A Benchmark for Financial Decision-Making Tasks with LLM -based Agent. Proceedings ...

work page doi:10.18653/v1/2025.acl-long.126 2025
[61]

2024 , eprint=

InternLM2 Technical Report , author=. 2024 , eprint=

work page 2024
[62]

2024 , eprint=

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model , author=. 2024 , eprint=

work page 2024
[63]

Journal of the Association for Information Science and Technology , volume=

Good debt or bad debt: Detecting semantic orientations in economic texts , author=. Journal of the Association for Information Science and Technology , volume=. 2014 , publisher=

work page 2014
[64]

Von https://sites

Financial opinion mining and question answering , author=. Von https://sites. google. com/view/fiqa , year=

work page
[65]

Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017) , pages=

Semeval-2017 task 5: Fine-grained sentiment analysis on financial microblogs and news , author=. Proceedings of the 11th international workshop on semantic evaluation (SemEval-2017) , pages=

work page 2017
[66]

FinGPT: Open-source financial large language models,

Fingpt: Open-source financial large language models , author=. arXiv preprint arXiv:2306.06031 , year=

work page arXiv
[67]

Xiao-Yang Liu, Guoxuan Wang, Hongyang Yang, and Daochen Zha

Fingpt: Democratizing internet-scale data for financial large language models , author=. arXiv preprint arXiv:2307.10485 , year=

work page arXiv
[68]

Neng Wang and Hongyang Yang and Christina Wang , booktitle=. Fin. 2023 , url=

work page 2023
[69]

Yangyang Yu and Zhiyuan Yao and Haohang Li and Zhiyang Deng and Yuechen Jiang and Yupeng Cao and Zhi Chen and Jordan W. Suchow and Zhenyu Cui and Rong Liu and Zhaozhuo Xu and Denghui Zhang and Koduvayur Subbalakshmi and GUOJUN XIONG and Yueru He and Jimin Huang and Dong Li and Qianqian Xie , booktitle=. FinCon: A Synthesized. 2024 , url=

work page 2024
[70]

Ollie Liu and Deqing Fu and Dani Yogatama and Willie Neiswanger , booktitle=. De. 2025 , url=

work page 2025
[71]

arXiv preprint arXiv:2205.12507 , year=

Re-examining calibration: The case of question answering , author=. arXiv preprint arXiv:2205.12507 , year=

work page arXiv
[72]

Decision-Making Behavior Evaluation Framework for

Jingru Jia and Zehua Yuan and Junhao Pan and Paul E McNamara and Deming Chen , booktitle=. Decision-Making Behavior Evaluation Framework for. 2024 , url=

work page 2024
[73]

arXiv preprint arXiv:2405.16434 , year=

The Importance of Directional Feedback for LLM-based Optimizers , author=. arXiv preprint arXiv:2405.16434 , year=

work page arXiv
[74]

Jiuhai Chen and Jonas Mueller

Quantifying uncertainty in answers from any language model via intrinsic and extrinsic confidence assessment , author=. arXiv preprint arXiv:2308.16175 , year=

work page arXiv
[75]

Reducing Sentiment Bias in Language Models via Counterfactual Evaluation

Huang, Po-Sen and Zhang, Huan and Jiang, Ray and Stanforth, Robert and Welbl, Johannes and Rae, Jack and Maini, Vishal and Yogatama, Dani and Kohli, Pushmeet. Reducing Sentiment Bias in Language Models via Counterfactual Evaluation. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.7

work page doi:10.18653/v1/2020.findings-emnlp.7 2020
[76]

Nature Machine Intelligence , volume=

Large pre-trained language models contain human-like biases of what is right and wrong to do , author=. Nature Machine Intelligence , volume=. 2022 , publisher=

work page 2022
[77]

arXiv preprint arXiv:2306.12659 , year=

Instruct-fingpt: Financial sentiment analysis by instruction tuning of general-purpose large language models , author=. arXiv preprint arXiv:2306.12659 , year=

work page arXiv
[78]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

DocMath-eval: Evaluating math reasoning capabilities of LLMs in understanding long and specialized documents , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[79]

Pixiu: A large language model, instruction data and evaluation benchmark for finance.arXiv preprint arXiv:2306.05443,

Pixiu: A large language model, instruction data and evaluation benchmark for finance , author=. arXiv preprint arXiv:2306.05443 , year=

work page arXiv
[80]

Proceedings of the 27th ACM international conference on information and knowledge management , pages=

Hybrid deep sequential modeling for social text-driven stock prediction , author=. Proceedings of the 27th ACM international conference on information and knowledge management , pages=

work page