FinTruthQA: A Benchmark for AI-Driven Financial Disclosure Quality Assessment in Investor -- Firm Interactions

Bin Ke; Dading Chong; Jiageng Wu; Jie Yang; Jun Chen; Peilin Zhou; Wang Dong; Xinyu Shi; Yikang Jiang; Ziyue Xu

arxiv: 2406.12009 · v5 · submitted 2024-06-17 · 💻 cs.CL

FinTruthQA: A Benchmark for AI-Driven Financial Disclosure Quality Assessment in Investor -- Firm Interactions

Peilin Zhou , Ziyue Xu , Xinyu Shi , Jiageng Wu , Yikang Jiang , Dading Chong , Wang Dong , Jun Chen

show 2 more authors

Bin Ke Jie Yang

This is my paper

Pith reviewed 2026-05-23 23:53 UTC · model grok-4.3

classification 💻 cs.CL

keywords financial disclosure qualitybenchmark datasetinvestor-firm interactionsAI evaluationquestion answeringpre-trained language modelslarge language modelsChinese stock exchanges

0 comments

The pith

FinTruthQA benchmark shows models score above 95% on financial question tasks but only 80% on answer relevance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates FinTruthQA, a dataset of 6,000 real financial Q&A pairs from Chinese investor platforms, each labeled on four criteria to test whether AI can judge disclosure quality. It reports that current models handle question identification and relevance well but fall short on assessing whether firm answers are readable and actually relevant to the query. The work positions the benchmark as a foundation for scaling up automated checks on corporate responses. A reader would care because weak answer evaluation limits reliable oversight of how firms communicate with investors. Domain-adapted models outperform general ones on the harder criteria.

Core claim

FinTruthQA is presented as the first benchmark for AI assessment of financial disclosure quality in investor-firm interactions. It contains 6,000 manually annotated real-world entries evaluated on question identification, question relevance, answer readability, and answer relevance. Benchmarking shows existing models reach F1 scores above 95% on the question criteria but only around 88% on answer readability and 80% on answer relevance, with task-adapted pre-trained language models performing best overall.

What carries the argument

FinTruthQA dataset of 6,000 annotated financial Q&A entries, labeled according to four evaluation criteria for disclosure quality.

If this is right

Regulatory oversight could apply similar automated checks to monitor firm responses across many interactions.
Investor tools could filter out low-relevance answers to improve decision-making.
Model development would shift toward better handling of answer relevance in financial domains.
Corporate disclosure practices could be influenced by scalable quality scoring systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same annotation approach could be extended to investor platforms in other countries or languages.
The gap on answer relevance points to a need for models that better capture semantic alignment in specialized financial text.
Training on this benchmark might produce systems that encourage firms to give more substantive answers.

Load-bearing premise

The four chosen annotation criteria and the manual labeling process produce reliable, representative measures of actual financial disclosure quality in investor-firm interactions.

What would settle it

A direct comparison of the dataset's labels against independent ratings by financial regulators or experienced analysts on a held-out sample, where substantial disagreement would show the criteria do not track true disclosure quality.

Figures

Figures reproduced from arXiv: 2406.12009 by Bin Ke, Dading Chong, Jiageng Wu, Jie Yang, Jun Chen, Peilin Zhou, Wang Dong, Xinyu Shi, Yikang Jiang, Ziyue Xu.

**Figure 2.** Figure 2: Example of a Q&A pair and four quality evaluation tasks. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Length distributions of our dataset (in characters). [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Confusion matrix on Task 3 and Task 4. the generalizability of the findings to other markets. Future research is needed to explore the applicability of the developed models and methodologies in different contexts and validate the findings across diverse settings. Second, the annotation process inherently involves a degree of subjectivity, as differences in annotators’ interpretations can introduce vari… view at source ↗

read the original abstract

Accurate and transparent financial information disclosure is essential for market efficiency, investor decision-making, and corporate governance. Chinese stock exchanges' investor interactive platforms provide a widely used channel through which listed firms respond to investor concerns, yet these responses are often limited or non-substantive, making disclosure quality difficult to assess at scale. To address this challenge, we introduce FinTruthQA, to our knowledge the first benchmark for AI-driven assessment of financial disclosure quality in investor-firm interactions. FinTruthQA comprises 6,000 real-world financial Q&A entries, each manually annotated based on four key evaluation criteria: question identification, question relevance, answer readability, and answer relevance. We benchmark statistical machine learning models, pre-trained language models and their fine-tuned variants, as well as large language models (LLMs), on FinTruthQA. Experiments show that existing models achieve strong performance on question identification and question relevance (F1 > 95%), but remain substantially weaker on answer readability (Micro F1 approximately 88%) and especially answer relevance (Micro F1 approximately 80%), highlighting the nontrivial difficulty of fine-grained disclosure quality assessment. Domain- and task-adapted pre-trained language models consistently outperform general-purpose models and LLM-based prompting on the most challenging settings. These findings position FinTruthQA as a practical foundation for AI-driven disclosure monitoring in capital markets, with value for regulatory oversight, investor protection, and disclosure governance in real-world financial settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New 6k Chinese financial Q&A dataset with clear performance gaps on answer relevance, but annotation reliability for subjective labels is not yet secured.

read the letter

The main point is that this paper releases FinTruthQA, a 6,000-entry dataset drawn from real Chinese investor-platform Q&A, with each item labeled on four criteria: question identification, question relevance, answer readability, and answer relevance. Existing models hit over 95% F1 on the first two but fall to roughly 88% on readability and 80% on answer relevance, and domain-adapted models do better than general ones or plain LLM prompting on the harder labels. That is the concrete new thing they contribute: a benchmark that did not exist before for this exact setting, plus the empirical demonstration of where current systems still lag. The work is straightforward dataset construction plus standard model comparisons, and it earns credit for focusing on a practical use case in disclosure monitoring. The soft spot is the labeling process itself. The four criteria mix objective and subjective judgments, yet the abstract supplies no inter-annotator agreement numbers, no description of annotator background or training, no sampling details, and no label distribution. If those are absent from the full paper as well, the reported difficulty gap could be inflated by label noise rather than task hardness. The scope is also narrow to one market and platform type. This is useful for researchers working on financial NLP or regulatory tech in Chinese markets who need a starting benchmark; a reader outside that niche will not get much. It deserves peer review so the annotation protocol and data release can be checked, but it is not a load-bearing theoretical claim that would collapse without further evidence.

Referee Report

1 major / 1 minor

Summary. The paper introduces FinTruthQA, a benchmark of 6,000 real-world financial Q&A entries from Chinese stock exchange investor interactive platforms. Each entry receives manual annotation on four criteria (question identification, question relevance, answer readability, answer relevance). Benchmarking of statistical ML models, PLMs and fine-tuned variants, and LLMs shows strong performance (F1 > 95%) on the question tasks but substantially weaker results on answer readability (~88% Micro F1) and answer relevance (~80% Micro F1). The work positions the dataset as a practical foundation for AI-driven disclosure quality monitoring with applications to regulation and investor protection.

Significance. If the annotations are shown to be reliable, the benchmark supplies a needed resource for evaluating models on a high-stakes, domain-specific task with direct relevance to market efficiency and corporate governance. The use of authentic investor-firm interactions and the consistent outperformance of domain-adapted PLMs over general models and prompting baselines are concrete strengths that could guide future work.

major comments (1)

[Abstract and Dataset Construction] Abstract and (presumed) §3 Dataset Construction: the reported aggregate F1 numbers are presented without any documentation of annotation guidelines, inter-annotator agreement statistics, annotator financial expertise, sampling procedure, class balance, or adjudication process. Because answer readability and answer relevance are subjective, the absence of these details directly undermines the central claim that the ~80% Micro F1 on answer relevance reflects genuine task difficulty rather than label noise.

minor comments (1)

Add a table or paragraph reporting label distributions and any error analysis to allow readers to assess balance and common failure modes.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater transparency in our dataset construction. We address the major comment below.

read point-by-point responses

Referee: [Abstract and Dataset Construction] Abstract and (presumed) §3 Dataset Construction: the reported aggregate F1 numbers are presented without any documentation of annotation guidelines, inter-annotator agreement statistics, annotator financial expertise, sampling procedure, class balance, or adjudication process. Because answer readability and answer relevance are subjective, the absence of these details directly undermines the central claim that the ~80% Micro F1 on answer relevance reflects genuine task difficulty rather than label noise.

Authors: We agree that the absence of these details in the current manuscript is a limitation that weakens the interpretability of the results on the more subjective criteria. In the revised version we will expand §3 (Dataset Construction) with the requested information, including the annotation guidelines, inter-annotator agreement statistics, annotator financial expertise and training, sampling procedure, class balance per criterion, and the adjudication process. These additions will allow readers to evaluate whether the performance gap on answer relevance primarily reflects task difficulty. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset creation and held-out model evaluation

full rationale

The paper constructs FinTruthQA as a manually annotated dataset of 6000 real-world Q&A entries using four fixed criteria, then evaluates off-the-shelf and fine-tuned models on held-out splits. No equations, fitted parameters, or predictions are defined in terms of the target metrics; performance numbers are direct empirical measurements. No self-citation chains or uniqueness theorems are invoked to justify the benchmark itself. The work is self-contained against external benchmarks (real financial disclosures) and does not reduce any claimed result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the four annotation criteria and the assumption that the sampled Q&A entries are representative of real disclosure quality; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The four evaluation criteria (question identification, question relevance, answer readability, answer relevance) are sufficient and appropriate for assessing financial disclosure quality.
The paper adopts these criteria for manual annotation without providing justification or validation that they capture the full construct of disclosure quality.

pith-pipeline@v0.9.0 · 5819 in / 1344 out tokens · 24397 ms · 2026-05-23T23:53:39.674759+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FinTruthQA comprises 6,000 real-world financial Q&A entries, each manually annotated based on four key evaluation criteria: question identification, question relevance, answer readability, and answer relevance
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments show that existing models achieve strong performance on question identification and question relevance (F1 > 95%), but remain substantially weaker on answer readability (~88%) and especially answer relevance (~80%)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Bridging Language Models and Financial Analysis
q-fin.ST 2025-03 unverdicted novelty 2.0

A survey synthesizing recent LLM research and assessing its applicability to financial data analysis.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

GPT-4 Technical Report

[Achiam et al., 2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Alt- man, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Large language models as financial data annotators: A study on effectiveness and efficiency

[Aguda et al., 2024] Toyin D Aguda, Suchetha Siddagan- gappa, Elena Kochkina, Simerjot Kaur, Dongsheng Wang, and Charese Smiley. Large language models as financial data annotators: A study on effectiveness and efficiency. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING

work page 2024
[3]

FinBERT: Financial Sentiment Analysis with Pre-trained Language Models

[Araci, 2019] Dogu Araci. Finbert: Financial sentiment analysis with pre-trained language models. arXiv preprint arXiv:1908.10063,

work page internal anchor Pith review Pith/arXiv arXiv 2019
[4]

Fintral: A family of gpt-4 level multi- modal financial large language models

[Bhatia et al., 2024] Gagan Bhatia, El Moatez Billah Nagoudi, Hasan Cavusoglu, and Muhammad Abdul- Mageed. Fintral: A family of gpt-4 level multi- modal financial large language models. arXiv preprint arXiv:2402.10986,

work page arXiv 2024
[5]

Textual analysis in accounting: What’s next? Contemporary accounting research, 40(2):765–805,

[Bochkay et al., 2023] Khrystyna Bochkay, Stephen V Brown, Andrew J Leone, and Jennifer Wu Tucker. Textual analysis in accounting: What’s next? Contemporary accounting research, 40(2):765–805,

work page 2023
[6]

Random forests

[Breiman, 2001] Leo Breiman. Random forests. Machine learning, 45:5–32,

work page 2001
[7]

Measuring the informa- tion content of financial news

[Chang et al., 2016] Ching Yun Chang, Yue Zhang, Zhiyang Teng, Zahn Bozanic, and Bin Ke. Measuring the informa- tion content of financial news. In Proceedings of COLING 2016, the 26th international conference on computational linguistics: Technical papers, pages 3216–3225,

work page 2016
[8]

Disc-finllm: A chinese financial large language model based on multiple experts fine-tuning

[Chen et al., 2023] Wei Chen, Qiushi Wang, Zefei Long, Xi- anyin Zhang, Zhongtian Lu, Bingxuan Li, Siyuan Wang, Jiarong Xu, Xiang Bai, Xuanjing Huang, and Zhongyu Wei. Disc-finllm: A chinese financial large language model based on multiple experts fine-tuning. arXiv preprint arXiv:2310.15205,

work page arXiv 2023
[9]

Attention-based recurrent convolutional neural network for automatic essay scoring

[Dong et al., 2017] Fei Dong, Yue Zhang, and Jie Yang. Attention-based recurrent convolutional neural network for automatic essay scoring. In Proceedings of the 21st conference on computational natural language learning (CoNLL 2017), pages 153–162,

work page 2017
[10]

The Llama 3 Herd of Models

[Dubey et al., 2024] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Do pennies matter? investor relations con- sequences of small negative earnings surprises

[Frankel et al., 2010] Richard Frankel, William J Mayew, and Yan Sun. Do pennies matter? investor relations con- sequences of small negative earnings surprises. Review of Accounting Studies, 15:220–242,

work page 2010
[12]

Stock performance and intermediation changes surrounding sustained increases in disclosure

[Healy et al., 1999] Paul M Healy, Amy P Hutton, and Kr- ishna G Palepu. Stock performance and intermediation changes surrounding sustained increases in disclosure. Contemporary accounting research, 16(3):485–520,

work page 1999
[13]

Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf

[Hearst et al., 1998] Marti A. Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf. Sup- port vector machines. IEEE Intelligent Systems and their applications, 13(4):18–28,

work page 1998
[14]

Applied logistic regression

[Hosmer Jr et al., 2013] David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. Applied logistic regression. John Wiley & Sons,

work page 2013
[15]

Mistral 7B

[Jiang et al., 2023] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Shall we talk? the role of interactive investor platforms in corporate communication

[Lee and Zhong, 2022] Charles MC Lee and Qinlin Zhong. Shall we talk? the role of interactive investor platforms in corporate communication. Journal of Accounting and Economics, 74(2-3):101524,

work page 2022
[17]

The economics of disclosure and financial re- porting regulation: Evidence and suggestions for future research

[Leuz and Wysocki, 2016] Christian Leuz and Peter D Wysocki. The economics of disclosure and financial re- porting regulation: Evidence and suggestions for future research. Journal of accounting research, 54(2):525–622,

work page 2016
[18]

Retrieval-augmented generation for knowledge-intensive nlp tasks

[Lewis et al., 2020] Patrick Lewis, Ethan Perez, Aleksan- dra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Infor- mation Processing Systems, 33:9459–9474,

work page 2020
[19]

Finmath: Injecting a tree-structured solver for question an- swering over financial reports

[Li et al., 2022] Chenying Li, Wenbo Ye, and Yilun Zhao. Finmath: Injecting a tree-structured solver for question an- swering over financial reports. In Proceedings of the Thir- teenth Language Resources and Evaluation Conference , pages 6147–6152,

work page 2022
[20]

Cfgpt: Chinese financial assistant with large lan- guage model

[Li et al., 2023] Jiangtong Li, Yuxuan Bian, Guoxuan Wang, Yang Lei, Dawei Cheng, Zhijun Ding, and Changjun Jiang. Cfgpt: Chinese financial assistant with large lan- guage model. arXiv preprint arXiv:2309.10654,

work page arXiv 2023
[21]

The information content of forward- looking statements in corporate filings—a na ¨ıve bayesian machine learning approach

[Li, 2010] Feng Li. The information content of forward- looking statements in corporate filings—a na ¨ıve bayesian machine learning approach. Journal of accounting re- search, 48(5):1049–1102,

work page 2010
[22]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

[Liu et al., 2019] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,

work page internal anchor Pith review Pith/arXiv arXiv 2019
[23]

Finbert: A pre-trained financial language representation model for financial text mining

[Liu et al., 2021] Zhuang Liu, Degen Huang, Kaiyu Huang, Zhuang Li, and Jun Zhao. Finbert: A pre-trained financial language representation model for financial text mining. In Proceedings of the twenty-ninth international confer- ence on international joint conferences on artificial intel- ligence, pages 4513–4519,

work page 2021
[24]

Uncovering language disparity of chatgpt on retinal vascular disease classification: Cross-sectional study

[Liu et al., 2024] Xiaocong Liu, Jiageng Wu, An Shao, Wenyue SHen, Panpan Ye, Yao Wang, Juan Ye, Kai Jin, and Jie Yang. Uncovering language disparity of chatgpt on retinal vascular disease classification: Cross-sectional study. Journal of Medical Internet Research , 26:e51926,

work page 2024
[25]

Discretionary disclosure and stock-based incentives

[Nagar et al., 2003] Venky Nagar, Dhananjay Nanda, and Peter Wysocki. Discretionary disclosure and stock-based incentives. Journal of accounting and economics , 34(1- 3):283–309,

work page 2003
[26]

Confident learning: Estimating uncertainty in dataset labels

[Northcutt et al., 2021] Curtis Northcutt, Lu Jiang, and Isaac Chuang. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70:1373–1411,

work page 2021
[27]

K-nearest neighbor.Schol- arpedia, 4(2):1883,

[Peterson, 2009] Leif E Peterson. K-nearest neighbor.Schol- arpedia, 4(2):1883,

work page 2009
[28]

Sentence-bert: Sentence embeddings using siamese bert-networks

[Reimers and Gurevych, 2019] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. Proceedings of the 2019 Con- ference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),

work page 2019
[29]

Auto- matic analysis of annual financial reports: A case study

[Smailovi´c et al., 2017] Jasmina Smailovi ´c, Martin ˇZnidarˇsiˇc, Aljo ˇsa Valentin ˇciˇc, Igor Lon ˇcarski, Marko Pahor, Pedro Tiago Martins, and Senja Pollak. Auto- matic analysis of annual financial reports: A case study. Computaci´on y Sistemas, 21(4):809–818,

work page 2017
[30]

A statistical inter- pretation of term specificity and its application in retrieval

[Sparck Jones, 1972] Karen Sparck Jones. A statistical inter- pretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11–21,

work page 1972
[31]

Fine-grained analysis of explicit and implicit sentiment in financial news ar- ticles

[Van de Kauter et al., 2015] Marjan Van de Kauter, Diane Breesch, and V ´eronique Hoste. Fine-grained analysis of explicit and implicit sentiment in financial news ar- ticles. Expert Systems with applications , 42(11):4999– 5010,

work page 2015
[32]

Sentiment correlation in financial news networks and asso- ciated market movements

[Wan et al., 2021] Xingchen Wan, Jie Yang, Slavi Marinov, Jan-Peter Calliess, Stefan Zohren, and Xiaowen Dong. Sentiment correlation in financial news networks and asso- ciated market movements. Scientific reports, 11(1):3062,

work page 2021
[33]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

[Wei et al., 2022] Jason Wei, Xuezhi Wang, Dale Schuur- mans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large lan- guage models. arXiv preprint arXiv:2201.11903,

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

Disclosure policy, informa- tion asymmetry, and liquidity in equity markets

[Welker, 1995] Michael Welker. Disclosure policy, informa- tion asymmetry, and liquidity in equity markets. Contem- porary accounting research, 11(2):801–827,

work page 1995
[35]

Finbert: A pretrained language model for financial communications

[Yang et al., 2020] Yi Yang, Mark Christopher Siy Uy, and Allen Huang. Finbert: A pretrained language model for financial communications. arXiv preprint arXiv:2006.08097,

work page arXiv 2020
[36]

Qwen2 Technical Report

[Yang et al., 2024] An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Mengzi: Towards lightweight yet in- genious pre-trained models for chinese

[Zhang et al., 2021] Zhuosheng Zhang, Hanqing Zhang, Keming Chen, Yuhang Guo, Jingyun Hua, Yulong Wang, and Ming Zhou. Mengzi: Towards lightweight yet in- genious pre-trained models for chinese. arXiv preprint arXiv:2110.06696,

work page arXiv 2021
[38]

Dkplm: decomposable knowledge-enhanced pre- trained language model for natural language understand- ing

[Zhang et al., 2022] Taolin Zhang, Chengyu Wang, Nan Hu, Minghui Qiu, Chengguang Tang, Xiaofeng He, and Jun Huang. Dkplm: decomposable knowledge-enhanced pre- trained language model for natural language understand- ing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11703–11711,

work page 2022
[39]

Uer: An open-source toolkit for pre- training models

[Zhao et al., 2019] Zhe Zhao, Hui Chen, Jinbin Zhang, Xin Zhao, Tao Liu, Wei Lu, Xi Chen, Haotang Deng, Qi Ju, and Xiaoyong Du. Uer: An open-source toolkit for pre- training models. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNL...

work page 2019
[40]

Multihiertt: Numerical reasoning over multi hierarchical tabular and textual data

[Zhao et al., 2022] Yilun Zhao, Yunxiang Li, Chenying Li, and Rui Zhang. Multihiertt: Numerical reasoning over multi hierarchical tabular and textual data. Proceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers),

work page 2022
[41]

Tat-qa: A question answering bench- mark on a hybrid of tabular and textual content in finance

[Zhu et al., 2021] Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. Tat-qa: A question answering bench- mark on a hybrid of tabular and textual content in finance. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference ...

work page 2021

[1] [1]

GPT-4 Technical Report

[Achiam et al., 2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Alt- man, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Large language models as financial data annotators: A study on effectiveness and efficiency

[Aguda et al., 2024] Toyin D Aguda, Suchetha Siddagan- gappa, Elena Kochkina, Simerjot Kaur, Dongsheng Wang, and Charese Smiley. Large language models as financial data annotators: A study on effectiveness and efficiency. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING

work page 2024

[3] [3]

FinBERT: Financial Sentiment Analysis with Pre-trained Language Models

[Araci, 2019] Dogu Araci. Finbert: Financial sentiment analysis with pre-trained language models. arXiv preprint arXiv:1908.10063,

work page internal anchor Pith review Pith/arXiv arXiv 2019

[4] [4]

Fintral: A family of gpt-4 level multi- modal financial large language models

[Bhatia et al., 2024] Gagan Bhatia, El Moatez Billah Nagoudi, Hasan Cavusoglu, and Muhammad Abdul- Mageed. Fintral: A family of gpt-4 level multi- modal financial large language models. arXiv preprint arXiv:2402.10986,

work page arXiv 2024

[5] [5]

Textual analysis in accounting: What’s next? Contemporary accounting research, 40(2):765–805,

[Bochkay et al., 2023] Khrystyna Bochkay, Stephen V Brown, Andrew J Leone, and Jennifer Wu Tucker. Textual analysis in accounting: What’s next? Contemporary accounting research, 40(2):765–805,

work page 2023

[6] [6]

Random forests

[Breiman, 2001] Leo Breiman. Random forests. Machine learning, 45:5–32,

work page 2001

[7] [7]

Measuring the informa- tion content of financial news

[Chang et al., 2016] Ching Yun Chang, Yue Zhang, Zhiyang Teng, Zahn Bozanic, and Bin Ke. Measuring the informa- tion content of financial news. In Proceedings of COLING 2016, the 26th international conference on computational linguistics: Technical papers, pages 3216–3225,

work page 2016

[8] [8]

Disc-finllm: A chinese financial large language model based on multiple experts fine-tuning

[Chen et al., 2023] Wei Chen, Qiushi Wang, Zefei Long, Xi- anyin Zhang, Zhongtian Lu, Bingxuan Li, Siyuan Wang, Jiarong Xu, Xiang Bai, Xuanjing Huang, and Zhongyu Wei. Disc-finllm: A chinese financial large language model based on multiple experts fine-tuning. arXiv preprint arXiv:2310.15205,

work page arXiv 2023

[9] [9]

Attention-based recurrent convolutional neural network for automatic essay scoring

[Dong et al., 2017] Fei Dong, Yue Zhang, and Jie Yang. Attention-based recurrent convolutional neural network for automatic essay scoring. In Proceedings of the 21st conference on computational natural language learning (CoNLL 2017), pages 153–162,

work page 2017

[10] [10]

The Llama 3 Herd of Models

[Dubey et al., 2024] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Do pennies matter? investor relations con- sequences of small negative earnings surprises

[Frankel et al., 2010] Richard Frankel, William J Mayew, and Yan Sun. Do pennies matter? investor relations con- sequences of small negative earnings surprises. Review of Accounting Studies, 15:220–242,

work page 2010

[12] [12]

Stock performance and intermediation changes surrounding sustained increases in disclosure

[Healy et al., 1999] Paul M Healy, Amy P Hutton, and Kr- ishna G Palepu. Stock performance and intermediation changes surrounding sustained increases in disclosure. Contemporary accounting research, 16(3):485–520,

work page 1999

[13] [13]

Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf

[Hearst et al., 1998] Marti A. Hearst, Susan T Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf. Sup- port vector machines. IEEE Intelligent Systems and their applications, 13(4):18–28,

work page 1998

[14] [14]

Applied logistic regression

[Hosmer Jr et al., 2013] David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. Applied logistic regression. John Wiley & Sons,

work page 2013

[15] [15]

Mistral 7B

[Jiang et al., 2023] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Shall we talk? the role of interactive investor platforms in corporate communication

[Lee and Zhong, 2022] Charles MC Lee and Qinlin Zhong. Shall we talk? the role of interactive investor platforms in corporate communication. Journal of Accounting and Economics, 74(2-3):101524,

work page 2022

[17] [17]

The economics of disclosure and financial re- porting regulation: Evidence and suggestions for future research

[Leuz and Wysocki, 2016] Christian Leuz and Peter D Wysocki. The economics of disclosure and financial re- porting regulation: Evidence and suggestions for future research. Journal of accounting research, 54(2):525–622,

work page 2016

[18] [18]

Retrieval-augmented generation for knowledge-intensive nlp tasks

[Lewis et al., 2020] Patrick Lewis, Ethan Perez, Aleksan- dra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K ¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Infor- mation Processing Systems, 33:9459–9474,

work page 2020

[19] [19]

Finmath: Injecting a tree-structured solver for question an- swering over financial reports

[Li et al., 2022] Chenying Li, Wenbo Ye, and Yilun Zhao. Finmath: Injecting a tree-structured solver for question an- swering over financial reports. In Proceedings of the Thir- teenth Language Resources and Evaluation Conference , pages 6147–6152,

work page 2022

[20] [20]

Cfgpt: Chinese financial assistant with large lan- guage model

[Li et al., 2023] Jiangtong Li, Yuxuan Bian, Guoxuan Wang, Yang Lei, Dawei Cheng, Zhijun Ding, and Changjun Jiang. Cfgpt: Chinese financial assistant with large lan- guage model. arXiv preprint arXiv:2309.10654,

work page arXiv 2023

[21] [21]

The information content of forward- looking statements in corporate filings—a na ¨ıve bayesian machine learning approach

[Li, 2010] Feng Li. The information content of forward- looking statements in corporate filings—a na ¨ıve bayesian machine learning approach. Journal of accounting re- search, 48(5):1049–1102,

work page 2010

[22] [22]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

[Liu et al., 2019] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,

work page internal anchor Pith review Pith/arXiv arXiv 2019

[23] [23]

Finbert: A pre-trained financial language representation model for financial text mining

[Liu et al., 2021] Zhuang Liu, Degen Huang, Kaiyu Huang, Zhuang Li, and Jun Zhao. Finbert: A pre-trained financial language representation model for financial text mining. In Proceedings of the twenty-ninth international confer- ence on international joint conferences on artificial intel- ligence, pages 4513–4519,

work page 2021

[24] [24]

Uncovering language disparity of chatgpt on retinal vascular disease classification: Cross-sectional study

[Liu et al., 2024] Xiaocong Liu, Jiageng Wu, An Shao, Wenyue SHen, Panpan Ye, Yao Wang, Juan Ye, Kai Jin, and Jie Yang. Uncovering language disparity of chatgpt on retinal vascular disease classification: Cross-sectional study. Journal of Medical Internet Research , 26:e51926,

work page 2024

[25] [25]

Discretionary disclosure and stock-based incentives

[Nagar et al., 2003] Venky Nagar, Dhananjay Nanda, and Peter Wysocki. Discretionary disclosure and stock-based incentives. Journal of accounting and economics , 34(1- 3):283–309,

work page 2003

[26] [26]

Confident learning: Estimating uncertainty in dataset labels

[Northcutt et al., 2021] Curtis Northcutt, Lu Jiang, and Isaac Chuang. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70:1373–1411,

work page 2021

[27] [27]

K-nearest neighbor.Schol- arpedia, 4(2):1883,

[Peterson, 2009] Leif E Peterson. K-nearest neighbor.Schol- arpedia, 4(2):1883,

work page 2009

[28] [28]

Sentence-bert: Sentence embeddings using siamese bert-networks

[Reimers and Gurevych, 2019] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. Proceedings of the 2019 Con- ference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),

work page 2019

[29] [29]

Auto- matic analysis of annual financial reports: A case study

[Smailovi´c et al., 2017] Jasmina Smailovi ´c, Martin ˇZnidarˇsiˇc, Aljo ˇsa Valentin ˇciˇc, Igor Lon ˇcarski, Marko Pahor, Pedro Tiago Martins, and Senja Pollak. Auto- matic analysis of annual financial reports: A case study. Computaci´on y Sistemas, 21(4):809–818,

work page 2017

[30] [30]

A statistical inter- pretation of term specificity and its application in retrieval

[Sparck Jones, 1972] Karen Sparck Jones. A statistical inter- pretation of term specificity and its application in retrieval. Journal of documentation, 28(1):11–21,

work page 1972

[31] [31]

Fine-grained analysis of explicit and implicit sentiment in financial news ar- ticles

[Van de Kauter et al., 2015] Marjan Van de Kauter, Diane Breesch, and V ´eronique Hoste. Fine-grained analysis of explicit and implicit sentiment in financial news ar- ticles. Expert Systems with applications , 42(11):4999– 5010,

work page 2015

[32] [32]

Sentiment correlation in financial news networks and asso- ciated market movements

[Wan et al., 2021] Xingchen Wan, Jie Yang, Slavi Marinov, Jan-Peter Calliess, Stefan Zohren, and Xiaowen Dong. Sentiment correlation in financial news networks and asso- ciated market movements. Scientific reports, 11(1):3062,

work page 2021

[33] [33]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

[Wei et al., 2022] Jason Wei, Xuezhi Wang, Dale Schuur- mans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large lan- guage models. arXiv preprint arXiv:2201.11903,

work page internal anchor Pith review Pith/arXiv arXiv 2022

[34] [34]

Disclosure policy, informa- tion asymmetry, and liquidity in equity markets

[Welker, 1995] Michael Welker. Disclosure policy, informa- tion asymmetry, and liquidity in equity markets. Contem- porary accounting research, 11(2):801–827,

work page 1995

[35] [35]

Finbert: A pretrained language model for financial communications

[Yang et al., 2020] Yi Yang, Mark Christopher Siy Uy, and Allen Huang. Finbert: A pretrained language model for financial communications. arXiv preprint arXiv:2006.08097,

work page arXiv 2020

[36] [36]

Qwen2 Technical Report

[Yang et al., 2024] An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671,

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Mengzi: Towards lightweight yet in- genious pre-trained models for chinese

[Zhang et al., 2021] Zhuosheng Zhang, Hanqing Zhang, Keming Chen, Yuhang Guo, Jingyun Hua, Yulong Wang, and Ming Zhou. Mengzi: Towards lightweight yet in- genious pre-trained models for chinese. arXiv preprint arXiv:2110.06696,

work page arXiv 2021

[38] [38]

Dkplm: decomposable knowledge-enhanced pre- trained language model for natural language understand- ing

[Zhang et al., 2022] Taolin Zhang, Chengyu Wang, Nan Hu, Minghui Qiu, Chengguang Tang, Xiaofeng He, and Jun Huang. Dkplm: decomposable knowledge-enhanced pre- trained language model for natural language understand- ing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11703–11711,

work page 2022

[39] [39]

Uer: An open-source toolkit for pre- training models

[Zhao et al., 2019] Zhe Zhao, Hui Chen, Jinbin Zhang, Xin Zhao, Tao Liu, Wei Lu, Xi Chen, Haotang Deng, Qi Ju, and Xiaoyong Du. Uer: An open-source toolkit for pre- training models. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNL...

work page 2019

[40] [40]

Multihiertt: Numerical reasoning over multi hierarchical tabular and textual data

[Zhao et al., 2022] Yilun Zhao, Yunxiang Li, Chenying Li, and Rui Zhang. Multihiertt: Numerical reasoning over multi hierarchical tabular and textual data. Proceedings of the 60th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers),

work page 2022

[41] [41]

Tat-qa: A question answering bench- mark on a hybrid of tabular and textual content in finance

[Zhu et al., 2021] Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. Tat-qa: A question answering bench- mark on a hybrid of tabular and textual content in finance. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference ...

work page 2021