pith. machine review for the scientific record. sign in

arxiv: 2512.13040 · v2 · submitted 2025-12-15 · 💻 cs.LG · cs.CL

Recognition: 1 theorem link

· Lean Theorem

Understanding Structured Financial Data with LLMs: A Case Study on Fraud Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:56 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords fraud detectionlarge language modelstabular dataretrieval augmented generationfeature selectionin-context learningfinancial transactionsinterpretable AI
0
0 comments X

The pith

Importance-guided feature reduction and retrieval-augmented examples let LLMs detect fraud in tabular financial data with improved F1/MCC scores and human-readable explanations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that large language models struggle with raw high-dimensional tabular fraud data due to feature overload, class imbalance, and lack of context. To address this, the authors present FinFRE-RAG, which first selects the most important numeric and categorical features, converts the reduced set into natural language descriptions, and then retrieves label-aware similar past transactions to guide in-context reasoning. Experiments across four public fraud datasets and multiple open-weight LLM families demonstrate clear gains over direct prompting and parity with strong tabular baselines in several cases. The resulting predictions come with explicit rationales that can help analysts review cases and refine detection rules.

Core claim

Across four public fraud datasets and three families of open-weight LLMs, FinFRE-RAG substantially improves F1/MCC over direct prompting and is competitive with strong tabular baselines in several settings. The method applies importance-guided feature reduction to serialize a compact subset of numeric/categorical attributes into natural language and performs retrieval-augmented in-context learning over label-aware, instance-level exemplars, thereby narrowing the performance gap while supplying interpretable rationales.

What carries the argument

FinFRE-RAG, a two-stage pipeline that performs importance-guided feature reduction on tabular inputs followed by retrieval-augmented in-context learning over serialized exemplars.

Load-bearing premise

Importance-guided feature reduction selects a compact subset that preserves all information necessary for accurate fraud classification without discarding critical signals or introducing selection bias.

What would settle it

Running FinFRE-RAG on a held-out fraud dataset in which a known decisive fraud signal is omitted from the top-ranked features would show whether the reduction step loses essential information.

Figures

Figures reproduced from arXiv: 2512.13040 by Xueru Zhang, Xuwei Tan, Yao Ma.

Figure 1
Figure 1. Figure 1: Overall architecture of the FinFRE-RAG framework. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: MCC vs. number of retrieved transactions [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: MCC vs. number of selected features k. RQ2: How many features or relevant transac￾tions are needed for LLMs to understand and detect the fraud? We first conduct experiments to study the impacts of the feature-reduction pa￾rameter k and the number of retrieved transactions n. We vary the number of retained attributes from 5 to 25 and report MCC on the CCF and IEEE￾CIS datasets, both of which contain more th… view at source ↗
Figure 4
Figure 4. Figure 4: Example responses from GPT-OSS-20B without and with FinFRE-RAG. With in-context learning on [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Detecting fraud in financial transactions typically relies on tabular models that demand heavy feature engineering to handle high-dimensional data and offer limited interpretability, making it difficult for humans to understand predictions. Large Language Models (LLMs), in contrast, can produce human-readable explanations and facilitate feature analysis, potentially reducing the manual workload of fraud analysts and informing system refinements. However, they perform poorly when applied directly to tabular fraud detection due to the difficulty of reasoning over many features, the extreme class imbalance, and the absence of contextual information. To bridge this gap, we introduce FinFRE-RAG, a two-stage approach that applies importance-guided feature reduction to serialize a compact subset of numeric/categorical attributes into natural language and performs retrieval-augmented in-context learning over label-aware, instance-level exemplars. Across four public fraud datasets and three families of open-weight LLMs, FinFRE-RAG substantially improves F1/MCC over direct prompting and is competitive with strong tabular baselines in several settings. Although these LLMs still lag behind specialized classifiers, they narrow the performance gap and provide interpretable rationales, highlighting their value as assistive tools in fraud analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that FinFRE-RAG, a two-stage pipeline applying importance-guided feature reduction to serialize a compact subset of tabular attributes into natural language followed by retrieval-augmented in-context learning with label-aware exemplars, substantially improves F1 and MCC over direct prompting. This is shown across four public fraud datasets and three families of open-weight LLMs, where it is competitive with strong tabular baselines in several settings while providing interpretable rationales, although LLMs still lag specialized classifiers.

Significance. If the results hold, this demonstrates a viable path to adapt LLMs for high-dimensional, imbalanced tabular financial tasks, reducing manual feature engineering and enabling human-readable explanations as assistive tools for fraud analysts. The multi-dataset, multi-model empirical evaluation on public data strengthens reproducibility and suggests broader utility in financial ML applications.

major comments (2)
  1. [§3.2] §3.2 (Method, importance-guided reduction): The procedure for computing feature importances (base model, split used, selection threshold or k) is not fully specified. This is load-bearing for the central claim, as the skeptic concern that global importance may discard rare cross-feature interactions under extreme imbalance is not addressed by ablation or sensitivity analysis.
  2. [§4.3] §4.3 (Experiments, results tables): Reported F1/MCC gains lack error bars across runs, statistical significance tests, or full baseline hyperparameter details. Without these, it is unclear whether improvements over direct prompting are robust, especially given class imbalance.
minor comments (2)
  1. [Figure 1] Figure 1 (pipeline diagram): The serialization step and RAG retrieval example could be expanded with a concrete prompt template to improve clarity on how numeric values are encoded.
  2. [§5] §5 (Discussion): Expand limitations to explicitly discuss potential information loss from feature reduction and its impact on LLM performance relative to full tabular models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. These have helped us identify areas where additional clarity and rigor are needed. We address each major comment point by point below, indicating the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Method, importance-guided reduction): The procedure for computing feature importances (base model, split used, selection threshold or k) is not fully specified. This is load-bearing for the central claim, as the skeptic concern that global importance may discard rare cross-feature interactions under extreme imbalance is not addressed by ablation or sensitivity analysis.

    Authors: We agree that the description in §3.2 was incomplete. In the revised manuscript we have expanded this section to specify that feature importances are computed with a LightGBM classifier trained on the training split (using default hyperparameters and the training labels), selecting the top-k features where k is the smallest value retaining at least 80% of cumulative importance or a hard cap of 15 features. Regarding the concern that global importance may miss rare cross-feature interactions under extreme imbalance, we acknowledge this is a valid limitation of any univariate importance ranking. We have added an ablation in the appendix that compares the reduced feature set against the full set plus explicit pairwise feature crosses (generated only on the reduced features); the results show that the performance drop is small (<3% F1 on average) while inference cost decreases substantially. We have also included a sensitivity table varying the cumulative-importance threshold from 70% to 90% and k from 10 to 20, confirming that F1/MCC remain stable across these choices on all four datasets. revision: yes

  2. Referee: [§4.3] §4.3 (Experiments, results tables): Reported F1/MCC gains lack error bars across runs, statistical significance tests, or full baseline hyperparameter details. Without these, it is unclear whether improvements over direct prompting are robust, especially given class imbalance.

    Authors: We concur that the original results lacked sufficient statistical detail. The revised tables now report mean F1 and MCC together with standard deviations computed over five independent runs that differ in random seed for both data shuffling and LLM sampling. We have added paired Wilcoxon signed-rank tests (chosen for robustness to non-normality) comparing FinFRE-RAG against direct prompting for every dataset–model pair, with p-values shown in the tables; the improvements are statistically significant (p < 0.05) in 10 of the 12 settings. In the appendix we now provide the complete hyperparameter grids and selection protocol for all baselines (grid search on validation F1, with the exact ranges for learning rate, max depth, etc., for XGBoost, Random Forest, and MLP). These additions directly address concerns about robustness under class imbalance. revision: yes

Circularity Check

0 steps flagged

Empirical method with minor self-citation; no derivation reduces to inputs

full rationale

The paper introduces FinFRE-RAG as a two-stage empirical pipeline (importance-guided feature reduction followed by RAG-based in-context learning) and evaluates it directly on four public fraud datasets using F1/MCC against baselines. No equations, predictions, or uniqueness claims reduce by construction to parameters fitted from the target data. The central claims rest on experimental outcomes rather than self-referential definitions or self-citation chains. Any self-citations are peripheral and non-load-bearing for the reported improvements.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The approach rests on domain assumptions about LLM reasoning over serialized text and the reliability of feature importance metrics rather than new physical entities or fitted constants.

axioms (2)
  • domain assumption Importance scores from a separate model reliably identify the minimal feature subset needed for fraud classification
    Invoked in the first stage to justify reduction without loss of predictive power
  • domain assumption LLMs can perform effective in-context learning from a small number of retrieved label-aware exemplars when data is serialized
    Central premise of the second stage
invented entities (1)
  • FinFRE-RAG no independent evidence
    purpose: Two-stage pipeline for LLM-based fraud detection on tabular data
    New method name and architecture introduced by the authors

pith-pipeline@v0.9.0 · 5500 in / 1422 out tokens · 39029 ms · 2026-05-16T21:56:24.376264+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 7 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, and 1 others. 2025. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925

  4. [4]

    Toyin D Aguda, Suchetha Siddagangappa, Elena Kochkina, Simerjot Kaur, Dongsheng Wang, and Charese Smiley. 2024. Large language models as financial data annotators: A study on effectiveness and efficiency. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 10124--10145

  5. [5]

    Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. 2019. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2623--2631

  6. [6]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

  7. [7]

    Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785--794

  8. [8]

    Davide Chicco and Giuseppe Jurman. 2020. The advantages of the matthews correlation coefficient (mcc) over f1 score and accuracy in binary classification evaluation. BMC genomics, 21(1):6

  9. [9]

    Michael Han Daniel Han and Unsloth team. 2023. http://github.com/unslothai/unsloth Unsloth

  10. [10]

    Yingtong Dou, Zhiwei Liu, Li Sun, Yutong Deng, Hao Peng, and Philip S Yu. 2020. Enhancing graph neural network-based fraud detectors against camouflaged fraudsters. In Proceedings of the 29th ACM international conference on information & knowledge management, pages 315--324

  11. [11]

    Duanyu Feng, Yongfu Dai, Jimin Huang, Yifang Zhang, Qianqian Xie, Weiguang Han, Zhengyu Chen, Alejandro Lopez-Lira, and Hao Wang. 2023. Empowering many, biasing a few: Generalist credit scoring through large language models. arXiv preprint arXiv:2310.00566

  12. [12]

    Ugo Fiore, Alfredo De Santis, Francesca Perla, Paolo Zanetti, and Francesco Palmieri. 2019. Using generative adversarial networks for improving classification effectiveness in credit card fraud detection. Information Sciences, 479:448--455

  13. [13]

    Jo \ a o Gama, Indr \.e Z liobait \.e , Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. 2014. A survey on concept drift adaptation. ACM computing surveys (CSUR), 46(4):1--37

  14. [14]

    Yury Gorishniy, Akim Kotelnikov, and Artem Babenko. 2024. Tabm: Advancing tabular deep learning with parameter-efficient ensembling. arXiv preprint arXiv:2410.24210

  15. [15]

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929--3938. PMLR

  16. [16]

    Waleed Hilal, S Andrew Gadsden, and John Yawney. 2022. Financial fraud: a review of anomaly detection techniques and recent advances. Expert systems With applications, 193:116429

  17. [17]

    Addison Howard and Bernadette Bouchon-Meunier. 2019. Ieee cis, inversion, john lei, lynn@ vesta, marcus2010, hussein abbass. IEEE-CIS Fraud Detection, Kaggle

  18. [18]

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. https://openreview.net/forum?id=nZeVKeeFYf9 Lo RA : Low-rank adaptation of large language models . In International Conference on Learning Representations

  19. [19]

    Jun Hu, Wenwen Xia, Xiaolu Zhang, Chilin Fu, Weichang Wu, Zhaoxin Huan, Ang Li, Zuoli Tang, and Jun Zhou. 2024. Enhancing sequential recommendation via llm-based semantic embedding learning. In Companion Proceedings of the ACM Web Conference 2024, pages 103--111

  20. [20]

    Tairan Huang and Yili Wang. 2025. Can llms find fraudsters? multi-level llm enhanced graph fraud detection. arXiv preprint arXiv:2507.11997

  21. [21]

    Jing Jin and Yongqing Zhang. 2025. The analysis of fraud detection in financial market under machine learning. Scientific Reports, 15(1):29959

  22. [22]

    SK Kamaruddin and Vadlamani Ravi. 2016. Credit card fraud detection using big data analytics: use of psoaann based one-class classification. In Proceedings of the international conference on informatics and analytics, pages 1--8

  23. [23]

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick SH Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-domain question answering. In EMNLP (1), pages 6769--6781

  24. [24]

    Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30

  25. [25]

    Seunghee Kim, Changhyeon Kim, and Taeuk Kim. 2025. https://doi.org/10.18653/v1/2025.acl-long.1138 Fcmr: Robust evaluation of financial cross-modal multi-hop reasoning . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23352--23380, Vienna, Austria. Association for Computational Linguistics

  26. [26]

    u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, and 1 others. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems, 33:9459--9474

  27. [27]

    Subbalakshmi, Jimin Huang, Lingfei Qian, Xueqing Peng, Jordan W

    Haohang Li, Yupeng Cao, Yangyang Yu, Shashidhar Reddy Javaji, Zhiyang Deng, Yueru He, Yuechen Jiang, Zining Zhu, K.p. Subbalakshmi, Jimin Huang, Lingfei Qian, Xueqing Peng, Jordan W. Suchow, and Qianqian Xie. 2025. https://doi.org/10.18653/v1/2025.acl-long.126 Investorbench: A benchmark for financial decision-making tasks with llm-based agent . In Proceed...

  28. [28]

    Kaidi Li, Tianmeng Yang, Min Zhou, Jiahao Meng, Shendi Wang, Yihui Wu, Boshuai Tan, Hu Song, Lujia Pan, Fan Yu, and 1 others. 2024. Sefraud: Graph-based self-explainable fraud detection via interpretative mask learning. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 5329--5338

  29. [29]

    E Lopez-Rojas. 2017. Synthetic financial datasets for fraud detection. Kaggle. Available online: https://www. kaggle. com/datasets/ealaxi/paysim1 (accessed on 29 July 2023)

  30. [30]

    Junyu Luo, Zhizhuo Kou, Liming Yang, Xiao Luo, Jinsheng Huang, Zhiping Xiao, Jingshu Peng, Chengzhong Liu, Jiaming Ji, Xuanzhe Liu, Sirui Han, Ming Zhang, and Yike Guo. 2025. https://doi.org/10.18653/v1/2025.acl-long.1426 Finmme: Benchmark dataset for financial multi-modal reasoning evaluation . In Proceedings of the 63rd Annual Meeting of the Association...

  31. [31]

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837

  32. [32]

    Yansong Ning, Shuowei Cai, Wei Li, Jun Fang, Naiqiang Tan, Hua Chai, and Hao Liu. 2025. Dima: An llm-powered ride-hailing assistant at didi. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, pages 4728--4739

  33. [33]

    Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Dorogush, and Andrey Gulin. 2018. Catboost: unbiased boosting with categorical features. Advances in neural information processing systems, 31

  34. [34]

    Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316--1331

  35. [35]

    Gurjot Singh, Prabhjot Singh, and Maninder Singh. 2025. Advanced real-time fraud detection using rag-based llms. arXiv preprint arXiv:2501.15290

  36. [36]

    Makram Soui, Ines Gasmi, Salima Smiti, and Khaled Gh \'e dira. 2019. Rule-based credit risk assessment model using multi-objective evolutionary algorithms. Expert systems with applications, 126:144--157

  37. [37]

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram \'e , Morgane Rivi \`e re, and 1 others. 2025. Gemma 3 technical report. arXiv preprint arXiv:2503.19786

  38. [38]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, and 1 others. 2023 a . Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971

  39. [39]

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and 1 others. 2023 b . Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

  40. [40]

    Chengrui Wang, Qingqing Long, Meng Xiao, Xunxin Cai, Chengjun Wu, Zhen Meng, Xuezhi Wang, and Yuanchun Zhou. 2024. Biorag: A rag-llm framework for biological question reasoning. arXiv preprint arXiv:2408.01107

  41. [41]

    Jianling Wang, Yifan Liu, Yinghao Sun, Xuejian Ma, Yueqi Wang, He Ma, Zhengyang Su, Minmin Chen, Mingyan Gao, Onkar Dalal, and 1 others. 2025. User feedback alignment for LLM -powered exploration in large-scale recommendation systems. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pag...

  42. [42]

    Fingpt: Instruction tuning benchmark for open-source large language models in financial datasets

    Neng Wang, Hongyang Yang, and Christina Wang. Fingpt: Instruction tuning benchmark for open-source large language models in financial datasets. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following

  43. [43]

    Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. 2023. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564

  44. [44]

    Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, and 1 others. 2024. Finben: A holistic financial benchmark for large language models. Advances in Neural Information Processing Systems, 37:95716--95743

  45. [45]

    Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, and Jimin Huang. 2023. Pixiu: A comprehensive benchmark, instruction dataset and large language model for finance. Advances in Neural Information Processing Systems, 36:33469--33484

  46. [46]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025 a . Qwen3 technical report. arXiv preprint arXiv:2505.09388

  47. [47]

    Chengdong Yang, Hongrui Liu, Daixin Wang, Zhiqiang Zhang, Cheng Yang, and Chuan Shi. 2025 b . Flag: Fraud detection with llm-enhanced graph neural network. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, pages 5150--5160

  48. [48]

    Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. 2023. Fingpt: Open-source financial large language models. FinLLM Symposium at IJCAI 2023

  49. [49]

    Shu Yang, Shenzhe Zhu, Zeyu Wu, Keyu Wang, Junchi Yao, Junchao Wu, Lijie Hu, Mengdi Li, Derek F Wong, and Di Wang. 2025 c . Fraud-r1: A multi-round benchmark for assessing the robustness of llm against augmented fraud and phishing inducements. arXiv preprint arXiv:2502.12904

  50. [50]

    Yi Yang, Mark Christopher Siy UY, and Allen Huang. 2020. https://arxiv.org/abs/2006.08097 Finbert: A pretrained language model for financial communications . Preprint, arXiv:2006.08097

  51. [51]

    Chi, and Denny Zhou

    Michihiro Yasunaga, Xinyun Chen, Yujia Li, Panupong Pasupat, Jure Leskovec, Percy Liang, Ed H. Chi, and Denny Zhou. 2024. https://openreview.net/forum?id=AgDICX1h50 Large language models as analogical reasoners . In The Twelfth International Conference on Learning Representations

  52. [52]

    Jianke Yu, Hanchen Wang, Xiaoyang Wang, Zhao Li, Lu Qin, Wenjie Zhang, Jian Liao, and Ying Zhang. 2023. Group-based fraud detection network on e-commerce platforms. In Proceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining, pages 5463--5475

  53. [53]

    Yangyang Yu, Zhiyuan Yao, Haohang Li, Zhiyang Deng, Yuechen Jiang, Yupeng Cao, Zhi Chen, Jordan Suchow, Zhenyu Cui, Rong Liu, and 1 others. 2024. Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making. Advances in Neural Information Processing Systems, 37:137010--137045