What Factors Affect LLMs and RLLMs in Financial Question Answering?

Dagang Li; Jiageng Wu; Peng Wang; Qiancheng Zhang; Xuesi Hu; Yuntao Zou

arxiv: 2507.08339 · v4 · submitted 2025-07-11 · 💻 cs.CL

What Factors Affect LLMs and RLLMs in Financial Question Answering?

Peng Wang , Xuesi Hu , Jiageng Wu , Yuntao Zou , Qiancheng Zhang , Dagang Li This is my paper

Pith reviewed 2026-05-19 05:03 UTC · model grok-4.3

classification 💻 cs.CL

keywords large language modelsreasoning large language modelsfinancial question answeringchain of thought reasoningprompting methodsagent frameworksmultilingual alignment

0 comments

The pith

Prompting methods and agent frameworks boost LLMs on financial questions by simulating long reasoning, while RLLMs show little additional benefit.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how prompting, agent frameworks, and multilingual alignment affect LLMs and RLLMs on financial question answering. It establishes that these techniques mainly help regular LLMs by creating extended reasoning processes akin to Long CoT. RLLMs already include these inherent capabilities, reducing the gains from standard enhancements. Multilingual methods work by lengthening reasoning chains primarily in LLMs. Readers interested in deploying language models for finance would find this useful for selecting appropriate techniques based on model type.

Core claim

Current prompting methods and agent frameworks enhance the performance of LLMs in financial question answering by simulating Long CoT. RLLMs possess inherent Long CoT capabilities, which limits the effectiveness of conventional methods in further enhancing their performance. Current advanced multilingual alignment methods primarily improve the multilingual performance of LLMs by extending the reasoning length, which yields minimal benefits for RLLMs. The authors discuss strategies for enhancing the performance of LLMs and RLLMs in financial question answering.

What carries the argument

Long Chain-of-Thought (Long CoT), the extended step-by-step reasoning process that RLLMs have built-in and that other methods try to simulate or lengthen in LLMs for better financial query handling.

If this is right

Prompting and agent methods improve accuracy for standard LLMs on financial tasks.
RLLMs require different enhancement strategies since added prompting yields limited returns.
Multilingual alignment extends reasoning to help LLMs with non-English financial questions but not RLLMs.
Future work can build on the discussed strategies to tailor improvements to each model category.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers might prioritize creating domain-specific long-reasoning datasets for RLLMs instead of general prompting.
The findings point to a need for new evaluation metrics that account for inherent reasoning length in model comparisons.
Similar experiments in other high-stakes domains like law or medicine could reveal if the LLM-RLLM difference is domain-specific.

Load-bearing premise

That the particular models, methods, and financial tasks tested are representative of the wider financial question answering landscape.

What would settle it

Conducting equivalent tests using additional financial QA datasets or a wider variety of LLMs and RLLMs to determine if the performance patterns for prompting, agents, and multilingual methods remain consistent.

Figures

Figures reproduced from arXiv: 2507.08339 by Dagang Li, Jiageng Wu, Peng Wang, Qiancheng Zhang, Xuesi Hu, Yuntao Zou.

**Figure 1.** Figure 1: The performance of Qwen-2.5-32B and DeepSeek-R1-Distill-Qwen-32B. Extending the reasoning length significantly enhances the performance of LLMs. For most LLMs, employing the CLP method yields better performance than the Translate-en approach. Additionally, the agentic framework (Self-Refine and S3 Agent) enhances the performance of LLMs to some degree. This indicates that CLP not only aligns multiple l… view at source ↗

**Figure 2.** Figure 2: Statistics of average output token consumption for questions of different difficulty in [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Performance of different scale and thinking mode in Qwen-3. The line chart in (a) shows Direct performance of Qwen-3, while the histogram in (a) shows average output tokens across scales. The histogram in (b) shows performance in different methods. In contrast, for RLLMs, the output of Long CoT indicates that the length of the CoT is not the primary determinant of performance. For instance, DeepSeek-R1-D… view at source ↗

**Figure 4.** Figure 4: Error statistics for Qwen3-14B under different thinking modes, considering cases where thinking mode is correct but non-thinking mode fails. To elucidate the advantages conferred by Long CoT, we utilize GPT-4o-mini to systematically analyze the error types observed in Qwen-3-14B under both thinking and non-thinking modes, as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Recently, large language models (LLMs) and reasoning large language models (RLLMs) have gained considerable attention from many researchers. RLLMs enhance the reasoning capabilities of LLMs through Long Chain-of-Thought (Long CoT) processes, significantly improving the performance of LLMs in addressing complex problems. However, there are few works that systematically explore what methods can fully unlock the performance of LLMs and RLLMs within the financial domain. To investigate the impact of various methods on LLMs and RLLMs, we utilize five LLMs and four RLLMs to assess the effects of prompting methods, agentic frameworks, and multilingual alignment methods on financial question-answering tasks. Our research findings indicate: (1) Current prompting methods and agent frameworks enhance the performance of LLMs in financial question answering by simulating Long CoT; (2) RLLMs possess inherent Long CoT capabilities, which limits the effectiveness of conventional methods in further enhancing their performance; (3) Current advanced multilingual alignment methods primarily improve the multilingual performance of LLMs by extending the reasoning length, which yields minimal benefits for RLLMs. Additionally, we discuss strategies for enhancing the performance of LLMs and RLLMs in financial question answering, which may serve as a inspiration for future improvements. We hope that this study can serve as an important reference for LLMs and RLLMs in the field of financial question answering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper benchmarks prompting, agents, and multilingual alignment on financial QA and finds bigger lifts for ordinary LLMs than for reasoning LLMs, which already carry strong built-in Long CoT.

read the letter

The main takeaway is that this paper shows standard LLMs gain more from prompting methods, agent frameworks, and multilingual alignment in financial question answering than reasoning LLMs do. The authors tie the smaller gains on RLLMs to their inherent Long CoT capabilities already being in place. They test five LLMs and four RLLMs and lay out some practical strategies for the financial domain at the end. The comparative angle across model types is the clearest addition here, since it applies familiar techniques to a commercially relevant setting and surfaces differential effects that prior work had not directly contrasted in this way. The setup is systematic enough to give applied readers a sense of which tweaks move the needle. The discussion section stays grounded and offers usable pointers without overclaiming. The soft spots sit mainly in the mechanistic interpretations. The paper attributes LLM gains to simulating Long CoT and multilingual benefits to extending reasoning length, yet the reported results appear to be accuracy and F1 deltas only. No output-length statistics, step counts, or trace-complexity measures are described to support those specific causal links, so alternative accounts such as better terminology handling or fewer hallucinations remain possible. Dataset sizes, statistical tests, and controls for confounding factors also receive limited detail in the available summary, which limits how firmly the differences can be generalized. This work is aimed at engineers and applied researchers who tune LLMs for financial QA or similar domain tasks. Someone needing empirical guidance on current techniques will extract value; theorists looking for new mechanisms will not. It deserves peer review because the experiments are focused and the question is timely, even if the authors should be asked to add direct measurements of reasoning behavior to tighten the claims.

Referee Report

2 major / 1 minor

Summary. The manuscript empirically evaluates the effects of prompting methods, agentic frameworks, and multilingual alignment techniques on financial question-answering performance using five LLMs and four RLLMs. It reports that these interventions improve LLMs by simulating or extending Long CoT reasoning while RLLMs already possess inherent Long CoT capabilities that limit further gains from conventional methods.

Significance. If the reported accuracy improvements are shown to be robust and the mechanistic attributions to Long CoT simulation are backed by direct measurements, the work could offer practical guidance for deploying LLMs and RLLMs in finance. The study is timely given growing interest in domain-specific reasoning, but its current form provides limited new insight beyond standard prompting benchmarks.

major comments (2)

[Abstract] Abstract and Results: The central claims that prompting/agent methods improve LLMs 'by simulating Long CoT' and that multilingual alignment improves performance 'by extending the reasoning length' rest solely on accuracy/F1 deltas. No tables or sections report output token counts, reasoning-step statistics, trace lengths, or complexity metrics comparing baseline vs. enhanced conditions, leaving alternative explanations (e.g., better terminology retrieval or reduced hallucination) unaddressed.
[Experimental Setup] Experimental Setup: The manuscript provides no information on financial QA dataset sizes, sources, train/test splits, statistical significance tests (e.g., paired t-tests or bootstrap), baseline selection criteria, or controls for confounders such as prompt length, model scale, or data contamination. These omissions make it impossible to verify the general conclusions about factors affecting performance.

minor comments (1)

[Abstract] Abstract: 'serve as a inspiration' should read 'serve as an inspiration'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment below and have made revisions to improve the rigor of our claims and experimental reporting.

read point-by-point responses

Referee: [Abstract] Abstract and Results: The central claims that prompting/agent methods improve LLMs 'by simulating Long CoT' and that multilingual alignment improves performance 'by extending the reasoning length' rest solely on accuracy/F1 deltas. No tables or sections report output token counts, reasoning-step statistics, trace lengths, or complexity metrics comparing baseline vs. enhanced conditions, leaving alternative explanations (e.g., better terminology retrieval or reduced hallucination) unaddressed.

Authors: We agree that direct measurements of reasoning length would provide stronger support for attributing gains to Long CoT simulation. In the revised manuscript we have added a new analysis subsection with output token counts and estimated reasoning-step statistics across conditions. These show consistent increases in reasoning length for LLMs under prompting and agent methods, and for multilingual alignment, which helps differentiate our interpretation from alternatives such as terminology retrieval. revision: yes
Referee: [Experimental Setup] Experimental Setup: The manuscript provides no information on financial QA dataset sizes, sources, train/test splits, statistical significance tests (e.g., paired t-tests or bootstrap), baseline selection criteria, or controls for confounders such as prompt length, model scale, or data contamination. These omissions make it impossible to verify the general conclusions about factors affecting performance.

Authors: We acknowledge these omissions limit verifiability. The revised Experimental Setup section now includes dataset sources and sizes, explicit train/test splits, results of paired t-tests for significance, baseline selection rationale, and discussion of controls for prompt length, model scale, and potential data contamination. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking study with no derivations, fitted predictions, or self-referential reductions

full rationale

The paper reports performance differences across LLMs and RLLMs on financial QA tasks under prompting, agent frameworks, and multilingual alignment. All central claims are framed as direct observations from accuracy/F1 metrics rather than quantities derived from equations or parameters that reduce to the inputs by construction. No mathematical derivations, self-definitional loops, fitted-input predictions, or load-bearing self-citation chains appear in the reported results. The study is self-contained as an observational comparison and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The study relies on standard assumptions from LLM evaluation literature such as the validity of benchmark tasks and the comparability of model outputs, with no new free parameters, axioms beyond domain standards, or invented entities introduced.

pith-pipeline@v0.9.0 · 5801 in / 1238 out tokens · 39124 ms · 2026-05-19T05:03:11.377777+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Current prompting methods and agent frameworks enhance the performance of LLMs in financial question answering by simulating Long CoT
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Current advanced multilingual alignment methods primarily improve the multilingual performance of LLMs by extending the reasoning length

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 13 internal anchors

[1]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

9 Preprint Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of- thought for reasoning large language models.arXiv preprint arXiv:2503.09567,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms.arXiv preprint arXiv:2412.21187,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

ISBN 9798400710810

Association for Computing Machin- ery. ISBN 9798400710810. doi: 10.1145/3677052.3698686. URLhttps://doi.org/10. 1145/3677052.3698686. Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cogni- tive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307,

work page doi:10.1145/3677052.3698686
[4]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

GPT-4o System Card

URLhttps://openreview.net/forum?id=VtmBAGCN7o. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[8]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, et al. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460, 2025a. Xufang Luo, Yuge Zhang, Zhiyuan He, Zilong Wang, Siyun Zhao, Dongsheng Li, Luna K Qiu, and Yuqing Yang. Agen...

work page internal anchor Pith review Pith/arXiv arXiv
[10]

M.; Poor, H

10 Preprint Yuqi Nie, Yaxuan Kong, Xiaowen Dong, John M Mulvey, H Vincent Poor, Qingsong Wen, and Stefan Zohren. A survey of large language models for financial applications: Progress, prospects and challenges.arXiv preprint arXiv:2406.11903,

work page arXiv
[11]

Accessed: 2025-07-07

URLhttps://openai.com/index/ o3-o4-mini-system-card. Accessed: 2025-07-07. Libo Qin, Qiguang Chen, Fuxuan Wei, Shijue Huang, and Wanxiang Che. Cross-lingual prompt- ing: Improving zero-shot chain-of-thought reasoning across languages. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Meth- ods in Natural L...

work page 2025
[12]

doi: 10.18653/v1/2023.emnlp-main.163

Associa- tion for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.163. URLhttps: //aclanthology.org/2023.emnlp-main.163/. Libo Qin, Qiguang Chen, Xiachong Feng, Yang Wu, Yongheng Zhang, Yinghui Li, Min Li, Wanxiang Che, and Philip S Yu. Large language models meet nlp: A survey.arXiv preprint arXiv:2405.12819,

work page doi:10.18653/v1/2023.emnlp-main.163 2023
[13]

Pragya Srivastava, Manuj Malik, Vivek Gupta, Tanuja Ganu, and Dan Roth

URLhttps://openreview.net/forum?id= fR3wGCk-IXp. Pragya Srivastava, Manuj Malik, Vivek Gupta, Tanuja Ganu, and Dan Roth. Evaluating llms’ math- ematical reasoning in financial document question answering.arXiv preprint arXiv:2402.11194,

work page arXiv
[14]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context.arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

doi: 10.1145/3690642

ISSN 1551-6857. doi: 10.1145/3690642. URLhttps://doi.org/10.1145/3690642. Just Accepted. Peng Wang, Wenpeng Lu, Chunlin Lu, Ruoxi Zhou, Min Li, and Libo Qin. Large language model for medical images: A survey of taxonomy, systematic review, and future trends.Big Data Mining and Analytics, 8(2):496–517, 2025a. doi: 10.26599/BDMA.2024.9020090. Peng Wang, Rui...

work page doi:10.1145/3690642 2024
[16]

BloombergGPT: A Large Language Model for Finance

Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prab- hanjan Kambadur, David Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance.arXiv preprint arXiv:2303.17564,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Famma: A benchmark for financial domain multilingual multimodal question answering.arXiv preprint arXiv:2410.04526,

Siqiao Xue, Tingting Chen, Fan Zhou, Qingyang Dai, Zhixuan Chu, and Hongyuan Mei. Famma: A benchmark for financial domain multilingual multimodal question answering.arXiv preprint arXiv:2410.04526,

work page arXiv
[18]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023a

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023a. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing...

work page 2023
[21]

Jacovi, A., Caciularu, A., Goldman, O., and Goldberg, Y

Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main

work page doi:10.18653/v1/2023.emnlp-main 2023
[22]

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

URLhttps://aclanthology.org/2023.emnlp-main.936/. Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xiangyuan Xue, Yijiang Li, et al. The landscape of agentic reinforcement learning for llms: A survey.arXiv preprint arXiv:2509.02547,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Autocap: Towards auto- matic cross-lingual alignment planning for zero-shot chain-of-thought

Yongheng Zhang, Qiguang Chen, Min Li, Wanxiang Che, and Libo Qin. Autocap: Towards auto- matic cross-lingual alignment planning for zero-shot chain-of-thought. InFindings of the Associ- ation for Computational Linguistics ACL 2024, pp. 9191–9200, 2024a. Yongheng Zhang, Qiguang Chen, Jingxuan Zhou, Peng Wang, Jiasheng Si, Jin Wang, Wen- peng Lu, and Libo Q...

work page doi:10.18653/v1/2024.findings-emnlp.388 2024

[1] [1]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

9 Preprint Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of- thought for reasoning large language models.arXiv preprint arXiv:2503.09567,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. Do not think that much for 2+ 3=? on the overthinking of o1-like llms.arXiv preprint arXiv:2412.21187,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

ISBN 9798400710810

Association for Computing Machin- ery. ISBN 9798400710810. doi: 10.1145/3677052.3698686. URLhttps://doi.org/10. 1145/3677052.3698686. Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cogni- tive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307,

work page doi:10.1145/3677052.3698686

[4] [4]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

GPT-4o System Card

URLhttps://openreview.net/forum?id=VtmBAGCN7o. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Os- trow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001

[8] [8]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Junyu Luo, Weizhi Zhang, Ye Yuan, Yusheng Zhao, Junwei Yang, Yiyang Gu, Bohan Wu, Binqi Chen, Ziyue Qiao, Qingqing Long, et al. Large language model agent: A survey on methodology, applications and challenges.arXiv preprint arXiv:2503.21460, 2025a. Xufang Luo, Yuge Zhang, Zhiyuan He, Zilong Wang, Siyun Zhao, Dongsheng Li, Luna K Qiu, and Yuqing Yang. Agen...

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

M.; Poor, H

10 Preprint Yuqi Nie, Yaxuan Kong, Xiaowen Dong, John M Mulvey, H Vincent Poor, Qingsong Wen, and Stefan Zohren. A survey of large language models for financial applications: Progress, prospects and challenges.arXiv preprint arXiv:2406.11903,

work page arXiv

[11] [11]

Accessed: 2025-07-07

URLhttps://openai.com/index/ o3-o4-mini-system-card. Accessed: 2025-07-07. Libo Qin, Qiguang Chen, Fuxuan Wei, Shijue Huang, and Wanxiang Che. Cross-lingual prompt- ing: Improving zero-shot chain-of-thought reasoning across languages. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Meth- ods in Natural L...

work page 2025

[12] [12]

doi: 10.18653/v1/2023.emnlp-main.163

Associa- tion for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.163. URLhttps: //aclanthology.org/2023.emnlp-main.163/. Libo Qin, Qiguang Chen, Xiachong Feng, Yang Wu, Yongheng Zhang, Yinghui Li, Min Li, Wanxiang Che, and Philip S Yu. Large language models meet nlp: A survey.arXiv preprint arXiv:2405.12819,

work page doi:10.18653/v1/2023.emnlp-main.163 2023

[13] [13]

Pragya Srivastava, Manuj Malik, Vivek Gupta, Tanuja Ganu, and Dan Roth

URLhttps://openreview.net/forum?id= fR3wGCk-IXp. Pragya Srivastava, Manuj Malik, Vivek Gupta, Tanuja Ganu, and Dan Roth. Evaluating llms’ math- ematical reasoning in financial document question answering.arXiv preprint arXiv:2402.11194,

work page arXiv

[14] [14]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal under- standing across millions of tokens of context.arXiv preprint arXiv:2403.05530,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

doi: 10.1145/3690642

ISSN 1551-6857. doi: 10.1145/3690642. URLhttps://doi.org/10.1145/3690642. Just Accepted. Peng Wang, Wenpeng Lu, Chunlin Lu, Ruoxi Zhou, Min Li, and Libo Qin. Large language model for medical images: A survey of taxonomy, systematic review, and future trends.Big Data Mining and Analytics, 8(2):496–517, 2025a. doi: 10.26599/BDMA.2024.9020090. Peng Wang, Rui...

work page doi:10.1145/3690642 2024

[16] [16]

BloombergGPT: A Large Language Model for Finance

Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prab- hanjan Kambadur, David Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance.arXiv preprint arXiv:2303.17564,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Famma: A benchmark for financial domain multilingual multimodal question answering.arXiv preprint arXiv:2410.04526,

Siqiao Xue, Tingting Chen, Fan Zhou, Qingyang Dai, Zhixuan Chu, and Hongyuan Mei. Famma: A benchmark for financial domain multilingual multimodal question answering.arXiv preprint arXiv:2410.04526,

work page arXiv

[18] [18]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023a

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in neural information processing systems, 36:11809–11822, 2023a. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing...

work page 2023

[21] [21]

Jacovi, A., Caciularu, A., Goldman, O., and Goldberg, Y

Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main

work page doi:10.18653/v1/2023.emnlp-main 2023

[22] [22]

The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

URLhttps://aclanthology.org/2023.emnlp-main.936/. Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhongzhi Li, Xiangyuan Xue, Yijiang Li, et al. The landscape of agentic reinforcement learning for llms: A survey.arXiv preprint arXiv:2509.02547,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Autocap: Towards auto- matic cross-lingual alignment planning for zero-shot chain-of-thought

Yongheng Zhang, Qiguang Chen, Min Li, Wanxiang Che, and Libo Qin. Autocap: Towards auto- matic cross-lingual alignment planning for zero-shot chain-of-thought. InFindings of the Associ- ation for Computational Linguistics ACL 2024, pp. 9191–9200, 2024a. Yongheng Zhang, Qiguang Chen, Jingxuan Zhou, Peng Wang, Jiasheng Si, Jin Wang, Wen- peng Lu, and Libo Q...

work page doi:10.18653/v1/2024.findings-emnlp.388 2024