From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems

Deepayan Chakrabarti; Dengdu Jiang; Gaoyuan Du; Ruizhe Zhou; Shouxi Ren; Xiaoyang Liu; Yi Zheng

arxiv: 2605.23955 · v2 · pith:M5DI3WAQnew · submitted 2026-05-11 · 💻 cs.AI · cs.DC· cs.LG· cs.SI· q-fin.CP

From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems

Ruizhe Zhou , Xiaoyang Liu , Gaoyuan Du , Yi Zheng , Shouxi Ren , Deepayan Chakrabarti , Dengdu Jiang This is my paper

Pith reviewed 2026-06-30 22:25 UTC · model grok-4.3

classification 💻 cs.AI cs.DCcs.LGcs.SIq-fin.CP

keywords determinismreproducibilityfinancial AIauditabilitynondeterminismtabular modelsgraph networksLLM workflows

0 comments

The pith

Mechanical nondeterminism from hardware and architecture undermines reproducibility in financial AI systems used for credit, fraud, and AML tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys how deep neural networks and generative AI introduce mechanical nondeterminism that breaks consistent outputs across three dominant financial modalities. It documents the problem through literature review and first-party tests on public datasets that measure explanation instability, prediction flips, and output divergence. The work then introduces a layered framework that ties specific metrics to audit readiness. A sympathetic reader would care because regulated finance demands reproducible decisions for compliance and liability reasons. If the central observation holds, current accuracy-focused evaluation falls short of what auditors require.

Core claim

Mechanical nondeterminism rooted in hardware and architecture in deep neural networks and Generative AI has introduced critical vulnerabilities in algorithmic reproducibility across tabular models, graph networks, and LLM-based agentic workflows in financial applications. First-party experiments quantify explanation rank instability in credit scoring, prediction flip rates in GNN-based fraud detection, and tensor-parallel-induced output divergence in LLM entity extraction. A layered evaluation framework is proposed that links modality-specific metrics (RBO, D_cos, TDI, PSD) to audit readiness and validates the complementarity of logit-level and semantic-level determinism measures.

What carries the argument

Layered evaluation framework that maps modality-specific determinism metrics (RBO, D_cos, TDI, PSD) to audit readiness across tabular, graph, and LLM financial AI systems.

If this is right

Tabular models for credit scoring exhibit unstable post-hoc explanations under nondeterminism.
Graph networks for fraud detection show measurable prediction flip rates from stochastic sampling and temporal asynchrony.
LLM-based agentic workflows display batch-dependent divergence and trajectory drift in entity extraction tasks.
Logit-level and semantic-level measures provide complementary signals for assessing determinism.
The framework supplies concrete metrics that can be applied to evaluate audit readiness in each modality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Regulators could incorporate determinism testing into model validation requirements for financial AI.
The same hardware-rooted issues may appear in other regulated domains that use similar model types.
Practical deployment might require hardware or software constraints to enforce reproducibility where audits are mandatory.

Load-bearing premise

The first-party experiments on public financial datasets and the proposed metrics sufficiently demonstrate the scale of nondeterminism issues and the effectiveness of the layered evaluation framework for real-world regulated financial systems.

What would settle it

An empirical study on regulated financial deployments that finds the quantified nondeterminism levels produce no measurable change in audit outcomes or regulatory compliance would falsify the claim of critical vulnerabilities.

Figures

Figures reproduced from arXiv: 2605.23955 by Deepayan Chakrabarti, Dengdu Jiang, Gaoyuan Du, Ruizhe Zhou, Shouxi Ren, Xiaoyang Liu, Yi Zheng.

**Figure 2.** Figure 2: Model ranking across 10 splits on Elliptic Bitcoin. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: plots TDI against PSD for all model×TP configurations. The scatter reveals three distinct regimes: (1) moderate TDI, high PSD (Qwen2.5-7B at TP=4: TDI=0.77, PSD=0.99)—despite tokenlevel divergence, the model produces semantically equivalent outputs, indicating benign surface-level variation; (2) low TDI, low PSD (InternLM2.5-7B at TP=4: TDI=0.75, PSD=0.88)—both evidence and meaning diverge, signaling gen… view at source ↗

read the original abstract

Deploying machine learning in regulated financial environments -- credit risk, fraud detection, and anti-money laundering -- exposes critical vulnerabilities in algorithmic reproducibility. While early financial ML addressed statistical challenges such as backtest overfitting, deep neural networks and Generative AI have introduced mechanical nondeterminism rooted in hardware and architecture. This survey provides a systems perspective on reproducibility failures across three modalities now dominant in financial AI: tabular models (post-hoc explanation variance), graph networks (stochastic sampling and temporal asynchrony), and LLM-based agentic workflows (batch-dependent divergence and trajectory drift). We supplement the literature analysis with first-party experiments on public financial datasets -- quantifying explanation rank instability in credit scoring, prediction flip rates in GNN-based fraud detection, and tensor-parallel-induced output divergence in LLM entity extraction. We propose a layered evaluation framework linking modality-specific metrics (RBO, D_cos, TDI, PSD) to audit readiness, and empirically validate the complementarity of logit-level and semantic-level determinism measures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Survey of nondeterminism in financial AI adds public-dataset experiments and an audit framework, but the evidence tying instabilities to real compliance failures stays indirect.

read the letter

The main things to know about this paper are that it surveys determinism problems in financial AI across tabular, graph, and LLM models, adds first-party experiments on public datasets measuring instabilities like explanation rank changes and prediction flips, and proposes a layered framework with metrics such as RBO, D_cos, TDI, and PSD to assess audit readiness.

What the paper does well is to distinguish mechanical nondeterminism from earlier statistical concerns and to apply it specifically to financial applications like credit risk and fraud detection. The experiments provide concrete quantifications and validate that different determinism measures complement each other, which is a useful addition to the existing literature on reproducibility.

The soft spots are that the experiments use public datasets, which do not reflect the proprietary data scales, imbalances, or exact audit thresholds in real regulated systems. The stress-test note is accurate: the link from these metric values to actual changes in decisions or compliance failures is not made explicit, leaving the claim of critical vulnerabilities on indirect evidence. Since the abstract supplies no specific results or error bars, it's also difficult to gauge how large the instabilities are in practice.

This paper is for researchers and engineers focused on deploying AI in finance or developing audit standards for it. A reader interested in systems-level reproducibility would get value from the modality breakdown and the proposed metrics.

It deserves peer review because the topic is relevant to a major sector and the framework could be built on, even if the current experiments need strengthening to support the stronger claims.

Referee Report

2 major / 1 minor

Summary. The paper surveys mechanical nondeterminism in financial AI across tabular models (post-hoc explanation variance), graph networks (stochastic sampling and temporal asynchrony), and LLM-based agentic workflows (batch-dependent divergence and trajectory drift). It supplements the literature review with first-party experiments on public financial datasets that quantify explanation rank instability (RBO), prediction flip rates (D_cos), temporal divergence (TDI), and output divergence (PSD), and proposes a layered evaluation framework that links these modality-specific metrics to audit readiness while claiming empirical validation of the complementarity between logit-level and semantic-level determinism measures.

Significance. If the layered framework and metrics prove effective at quantifying and mitigating reproducibility failures, the work would provide a valuable systems-level contribution toward auditability in regulated financial domains. The survey structure and introduction of concrete metrics (RBO, D_cos, TDI, PSD) offer a structured way to connect hardware/architecture nondeterminism to practical evaluation, which could inform compliance practices if the public-dataset results generalize.

major comments (2)

[first-party experiments on public financial datasets] First-party experiments section: the reported metric values on public datasets (RBO for credit scoring explanations, D_cos for GNN fraud detection flips, TDI/PSD for LLM divergence) are presented without explicit thresholds, decision-impact analysis, or mapping to regulatory compliance failures (e.g., changes in credit approval or AML alerts), so the central claim of 'critical vulnerabilities' in credit risk, fraud, and AML remains indirect.
[layered evaluation framework] Layered evaluation framework section: the assertion that the metrics demonstrate complementarity and link to 'audit readiness' lacks quantitative validation against real regulatory constraints or proprietary-scale data (class imbalance, audit thresholds), leaving the framework's practical utility for regulated systems unestablished.

minor comments (1)

[abstract] The abstract states that the framework is 'empirically validated' but supplies no numerical results, error bars, or dataset details; this should be expanded with concrete findings even in a survey context.

Simulated Author's Rebuttal

2 responses · 2 unresolved

We thank the referee for the constructive feedback on our survey of determinism in financial AI systems. We address the two major comments point by point below, clarifying the scope of our public-dataset experiments and the proposed framework while noting inherent limitations of a survey format.

read point-by-point responses

Referee: [first-party experiments on public financial datasets] First-party experiments section: the reported metric values on public datasets (RBO for credit scoring explanations, D_cos for GNN fraud detection flips, TDI/PSD for LLM divergence) are presented without explicit thresholds, decision-impact analysis, or mapping to regulatory compliance failures (e.g., changes in credit approval or AML alerts), so the central claim of 'critical vulnerabilities' in credit risk, fraud, and AML remains indirect.

Authors: The experiments quantify mechanical nondeterminism on standard public financial datasets to illustrate the phenomena across modalities, using metrics chosen for their direct interpretability (e.g., RBO capturing explanation rank changes that could affect post-hoc audits). We do not provide explicit thresholds or direct mappings to regulatory outcomes such as credit approval changes because these are institution-specific and would require proprietary decision logs and compliance data unavailable for a survey. The central claim of vulnerabilities is supported by the quantified instability levels on representative data, but we agree the regulatory linkage is indirect. We will add a discussion subsection on potential decision impacts and how practitioners might calibrate thresholds, constituting a partial revision. revision: partial
Referee: [layered evaluation framework] Layered evaluation framework section: the assertion that the metrics demonstrate complementarity and link to 'audit readiness' lacks quantitative validation against real regulatory constraints or proprietary-scale data (class imbalance, audit thresholds), leaving the framework's practical utility for regulated systems unestablished.

Authors: Complementarity between logit-level and semantic-level measures is empirically demonstrated via the public-dataset experiments, where the metrics capture orthogonal aspects of divergence. The layered framework provides a conceptual mapping from these metrics to audit readiness as a systems-level proposal. We acknowledge that quantitative validation on proprietary-scale data with real regulatory constraints is not feasible here. We will revise the framework section to explicitly delineate its scope as a proposed structure with public proof-of-concept results rather than a fully validated production tool, constituting a partial revision for added clarity on limitations. revision: partial

standing simulated objections not resolved

Quantitative validation of the layered framework or metrics against proprietary financial datasets that incorporate actual regulatory thresholds, class imbalance, and audit constraints.
Direct empirical mapping from observed metric values to specific compliance failures (e.g., changes in AML alerts or credit decisions) without access to internal institutional systems and logs.

Circularity Check

0 steps flagged

No circularity: survey structure with independent literature review and public-dataset experiments

full rationale

The paper is a literature survey supplemented by first-party experiments on public financial datasets. It defines new metrics (RBO, D_cos, TDI, PSD) and applies them to quantify observed nondeterminism, then links them to an audit framework. No equations or derivations reduce to self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. The central claims rest on external literature plus reproducible experiments on public data, with no reduction of the 'critical vulnerabilities' claim to author-defined inputs by construction. This is the expected non-finding for a survey paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper supplemented by experiments; the central claims rest on literature analysis and described empirical work rather than new mathematical axioms, free parameters, or invented entities.

pith-pipeline@v0.9.1-grok · 5730 in / 1154 out tokens · 41016 ms · 2026-06-30T22:25:28.777463+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 20 canonical work pages · 3 internal anchors

[1]

Managing explanations: how regulators can address AI explainability in Model Risk Management

Bank for International Settlements (BIS). Managing explanations: how regulators can address AI explainability in Model Risk Management . FSI Insights on policy implementation, 2024

2024
[2]

Regulation (EU) 2024/1689 laying down har- monised rules on Artificial Intelligence (Artificial Intelligence Act)

European Parliament and Council. Regulation (EU) 2024/1689 laying down har- monised rules on Artificial Intelligence (Artificial Intelligence Act) . Official Journal of the European Union, 2024. From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems Conference’17, July 2017, Washington, DC, USA

2024
[3]

Harvey, Yan Liu, and Heqing Zhu

Campbell R. Harvey, Yan Liu, and Heqing Zhu.... and the Cross-Section of Expected Returns. The Review of Financial Studies, 29(1):5–68, 2016

2016
[4]

Bailey and Marcos López de Prado.The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality

David H. Bailey and Marcos López de Prado.The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality. The Journal of Portfolio Management, 40(5):89–107, 2014

2014
[5]

Arian et al

H. Arian et al. Backtest overfitting in the machine learning era: A comparison of out-of-sample testing methods. Knowledge-Based Systems, 2024

2024
[6]

Bhojanapalli et al

S. Bhojanapalli et al. On the Reproducibility of Neural Network Predictions . arXiv preprint arXiv:2102.03349, 2021

work page arXiv 2021
[7]

Detlefsen et al

N. Detlefsen et al. TorchMetrics - Measuring Reproducibility in Deep Learning . Journal of Open Source Software, 7(70):4101, 2022

2022
[8]

Lundberg and Su-In Lee

Scott M. Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017

2017
[9]

Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M

Scott M. Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M. Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. From Local Explanations to Global Understanding with Explainable AI for Trees . Nature Machine Intelligence, 2(1):56–67, 2020

2020
[10]

Intelligible Models for Classification and Regression

Yin Lou, Rich Caruana, and Johannes Gehrke. Intelligible Models for Classification and Regression. Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 150–158, 2012

2012
[11]

Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hos- pital 30-day Readmission

Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hos- pital 30-day Readmission. Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 1721–1730, 2015

2015
[12]

Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods

Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju. Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods . AAAI/ACM Conference on AI, Ethics, and Society (AIES), 2020

2020
[13]

Weber et al

M. Weber et al. Graph Neural Networks for Financial Fraud Detection: A Survey . arXiv preprint arXiv:2411.05815, 2024

work page arXiv 2024
[14]

Hamilton, Rex Ying, and Jure Leskovec

William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive Representation Learning on Large Graphs . Advances in Neural Information Processing Systems (NeurIPS), 30, 2017

2017
[15]

Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents

Raffi Khatchadourian. Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents . arXiv preprint arXiv:2601.15322, 2026

work page arXiv 2026
[16]

Wang and Y

L. Wang and Y. Wang. Assessing Consistency and Reproducibility in the Outputs of Large Language Models: Evidence Across Diverse Finance and Accounting Tasks . arXiv preprint arXiv:2503.16974, 2025

work page arXiv 2025
[17]

HOPPE, M

Jiayi Yuan, Hao Li, Xinheng Ding, Wenya Xie, Yu-Jhe Li, Wentian Zhao, Kun Wan, Jing Shi, Xia Hu, and Zirui Liu. Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference . arXiv preprint arXiv:2506.09501, 2025

work page arXiv 2025
[18]

Defeating Nondeterminism in LLM Inference

Thinking Machines Lab. Defeating Nondeterminism in LLM Inference . Technical Report, 2025. https://thinkingmachines.ai/defeating-nondeterminism

2025
[19]

En- abling Determinism in LLM Inference with Verified Speculation

Raja Gond, Aditya K Kamath, Ramachandran Ramjee, and Ashish Panwar. En- abling Determinism in LLM Inference with Verified Speculation . arXiv preprint arXiv:2601.17768, 2025

work page arXiv 2025
[20]

Why Should I Trust You?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "Why Should I Trust You?": Explaining the Predictions of Any Classifier . Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016

2016
[21]

Improving KernelSHAP: Practical Shapley Value Estima- tion Using Linear Regression

Ian Covert and Su-In Lee. Improving KernelSHAP: Practical Shapley Value Estima- tion Using Linear Regression . Proceedings of AISTATS (PMLR 130), 2021

2021
[22]

A Similarity Measure for Indefinite Rankings

William Webber, Alistair Moffat, and Justin Zobel. A Similarity Measure for Indefinite Rankings. ACM Transactions on Information Systems (TOIS), 28(4):1– 34, 2010

2010
[23]

Interpretation of Neural Networks is Fragile

Amirata Ghorbani, Abubakar Abid, and James Zou. Interpretation of Neural Networks is Fragile. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):3681–3688, 2019

2019
[24]

Weinberger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating Text Generation with BERT . International Conference on Learning Representations (ICLR), 2020

2020
[25]

Kipf and Max Welling

Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. International Conference on Learning Representations (ICLR), 2017

2017
[26]

Pitfalls of Graph Neural Network Evaluation

Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of Graph Neural Network Evaluation . NeurIPS Relational Representation Learning Workshop, 2018. arXiv:1811.05868

work page internal anchor Pith review Pith/arXiv arXiv 2018
[27]

Temporal Graph Networks for Deep Learning on Dynamic Graphs

Emanuele Rossi, Ben Chamberlain, Fabrizio Frasca, Davide Eynard, Federico Monti, and Michael Bronstein. Temporal Graph Networks for Deep Learning on Dynamic Graphs. ICML Workshop on Graph Representation Learning, 2020

2020
[28]

Fast Graph Representation Learning with PyTorch Geometric

Matthias Fey and Jan Eric Lenssen. Fast Graph Representation Learning with PyTorch Geometric. ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019

2019
[29]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAt- tention: Fast and Memory-Efficient Exact Attention with IO-A wareness. Advances in Neural Information Processing Systems (NeurIPS), 35, 2022

2022
[30]

Huang, Hui Wang, and Yi Yang

Allen H. Huang, Hui Wang, and Yi Yang. FinBERT: A Large Language Model for Extracting Information from Financial Text . Contemporary Accounting Research, 40(2):806–841, 2023

2023
[31]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing Reasoning and Acting in Language Models . International Conference on Learning Representations (ICLR), 2023

2023
[32]

Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

Boxin Wang, Chankyu Lee, Nayeon Lee, Sheng-Chieh Lin, Wenliang Dai, Yang Chen, Yangyi Chen, Zhuolin Yang, Zihan Liu, Mohammad Shoeybi, Bryan Catan- zaro, and Wei Ping. Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models . arXiv preprint arXiv:2512.13607, 2025

work page arXiv 2025
[33]

LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows

Raffi Khatchadourian and Rolando Franco. LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows. arXiv preprint arXiv:2511.07585, 2025

work page arXiv 2025
[34]

Efficient Guided Generation for Large Language Models

Brandon T. Willard and Rémi Louf.Efficient Guided Generation for Large Language Models. arXiv preprint arXiv:2307.09702, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

SHAP Stability in Credit Risk Management: A Case Study in Credit Card Default Model

Liang Lin and Yue Wang. SHAP Stability in Credit Risk Management: A Case Study in Credit Card Default Model . Risks, 2025. arXiv:2508.01851

work page arXiv 2025
[36]

Benoit.Evaluating the Stability of Model Explanations in Instance-Dependent Cost-Sensitive Credit Scoring

Matteo Ballegeer, Matthias Bogaert, and Dries F. Benoit.Evaluating the Stability of Model Explanations in Instance-Dependent Cost-Sensitive Credit Scoring . European Journal of Operational Research, 326(3), 2025

2025
[37]

Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch

Ziyang Zhang, Xinheng Ding, Jiayi Yuan, Rixin Liu, Huizi Mao, Jiarong Xing, and Zirui Liu. Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch. arXiv preprint arXiv:2511.17826, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability

Guanchen Wang et al. STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability . NeurIPS Workshop, 2025. arXiv:2512.23712

work page arXiv 2025
[39]

AEMA: Verifiable Evaluation Framework for Trustworthy and Controlled Agentic LLM Systems

Yu-Ting Lee et al. AEMA: Verifiable Evaluation Framework for Trustworthy and Controlled Agentic LLM Systems . arXiv preprint arXiv:2601.11903, 2026

work page arXiv 2026
[40]

Network Analytics for Anti-Money Laundering – A Systematic Literature Review and Experimental Evaluation

Bruno Deprez, Toon Vanderschueren, Bart Baesens, Tim Verdonck, and Wouter Verbeke. Network Analytics for Anti-Money Laundering – A Systematic Literature Review and Experimental Evaluation . INFORMS Journal on Data Science, 2024. arXiv:2405.19383

work page arXiv 2024
[41]

Normalisation and Initialisation Strategies for Graph Neural Networks in Blockchain Anomaly Detection

Dang Sy Duy, Nguyen Duy Chien, Kapil Dev, and Jeff Nijsse. Normalisation and Initialisation Strategies for Graph Neural Networks in Blockchain Anomaly Detection. arXiv preprint arXiv:2602.23599, 2026

work page arXiv 2026
[42]

Jsonschemabench: A rigorous benchmark of structured outputs for language models

Derek Tam et al. Generating Structured Outputs from Language Models . arXiv preprint arXiv:2501.10868, 2025

work page arXiv 2025
[43]

Estimating LLM Uncertainty with Logits

Shuo Chen, Tian Dong, Jiaqi Wang, and Dieter Fox. Estimating LLM Uncertainty with Logits. Proceedings of the International Conference on Machine Learning (ICML), 2025. arXiv:2502.00290

work page arXiv 2025
[44]

Towards Mod- eling Data Quality and Machine Learning Model Performance

Usman Anjum, Chris Trentman, Elrod Caden, and Justin Zhan. Towards Mod- eling Data Quality and Machine Learning Model Performance . arXiv preprint arXiv:2412.05882, 2024

work page arXiv 2024
[45]

Arik and Tomas Pfister.TabNet: Attentive Interpretable Tabular Learning

Sercan Ö. Arik and Tomas Pfister.TabNet: Attentive Interpretable Tabular Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(8):6679–6687, 2021

2021
[46]

Hamilton, and Jure Leskovec

Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec. Graph Convolutional Neural Networks for Web-Scale Recom- mender Systems. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 974–983, 2018

2018
[47]

Effi- cient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Effi- cient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM . Proceedings of the International Conference for High Perfor...

2021
[48]

SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

Ansar Aynetdinov and Alham Fikri Aji. SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity . arXiv preprint arXiv:2401.17072, 2024. A LOGU UNCERTAINTY DECOMPOSITION LogU [43] treats the top-𝐾 logits as Dirichlet parametersDir(𝛼1, . . . , 𝛼𝐾 ). The aleatoric uncertainty (AU) captures relative entropy:AU(𝑎𝑡 ) = −Í𝐾 𝑘...

work page arXiv 2024

[1] [1]

Managing explanations: how regulators can address AI explainability in Model Risk Management

Bank for International Settlements (BIS). Managing explanations: how regulators can address AI explainability in Model Risk Management . FSI Insights on policy implementation, 2024

2024

[2] [2]

Regulation (EU) 2024/1689 laying down har- monised rules on Artificial Intelligence (Artificial Intelligence Act)

European Parliament and Council. Regulation (EU) 2024/1689 laying down har- monised rules on Artificial Intelligence (Artificial Intelligence Act) . Official Journal of the European Union, 2024. From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems Conference’17, July 2017, Washington, DC, USA

2024

[3] [3]

Harvey, Yan Liu, and Heqing Zhu

Campbell R. Harvey, Yan Liu, and Heqing Zhu.... and the Cross-Section of Expected Returns. The Review of Financial Studies, 29(1):5–68, 2016

2016

[4] [4]

Bailey and Marcos López de Prado.The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality

David H. Bailey and Marcos López de Prado.The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality. The Journal of Portfolio Management, 40(5):89–107, 2014

2014

[5] [5]

Arian et al

H. Arian et al. Backtest overfitting in the machine learning era: A comparison of out-of-sample testing methods. Knowledge-Based Systems, 2024

2024

[6] [6]

Bhojanapalli et al

S. Bhojanapalli et al. On the Reproducibility of Neural Network Predictions . arXiv preprint arXiv:2102.03349, 2021

work page arXiv 2021

[7] [7]

Detlefsen et al

N. Detlefsen et al. TorchMetrics - Measuring Reproducibility in Deep Learning . Journal of Open Source Software, 7(70):4101, 2022

2022

[8] [8]

Lundberg and Su-In Lee

Scott M. Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017

2017

[9] [9]

Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M

Scott M. Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M. Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. From Local Explanations to Global Understanding with Explainable AI for Trees . Nature Machine Intelligence, 2(1):56–67, 2020

2020

[10] [10]

Intelligible Models for Classification and Regression

Yin Lou, Rich Caruana, and Johannes Gehrke. Intelligible Models for Classification and Regression. Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 150–158, 2012

2012

[11] [11]

Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hos- pital 30-day Readmission

Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hos- pital 30-day Readmission. Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 1721–1730, 2015

2015

[12] [12]

Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods

Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju. Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods . AAAI/ACM Conference on AI, Ethics, and Society (AIES), 2020

2020

[13] [13]

Weber et al

M. Weber et al. Graph Neural Networks for Financial Fraud Detection: A Survey . arXiv preprint arXiv:2411.05815, 2024

work page arXiv 2024

[14] [14]

Hamilton, Rex Ying, and Jure Leskovec

William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive Representation Learning on Large Graphs . Advances in Neural Information Processing Systems (NeurIPS), 30, 2017

2017

[15] [15]

Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents

Raffi Khatchadourian. Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents . arXiv preprint arXiv:2601.15322, 2026

work page arXiv 2026

[16] [16]

Wang and Y

L. Wang and Y. Wang. Assessing Consistency and Reproducibility in the Outputs of Large Language Models: Evidence Across Diverse Finance and Accounting Tasks . arXiv preprint arXiv:2503.16974, 2025

work page arXiv 2025

[17] [17]

HOPPE, M

Jiayi Yuan, Hao Li, Xinheng Ding, Wenya Xie, Yu-Jhe Li, Wentian Zhao, Kun Wan, Jing Shi, Xia Hu, and Zirui Liu. Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference . arXiv preprint arXiv:2506.09501, 2025

work page arXiv 2025

[18] [18]

Defeating Nondeterminism in LLM Inference

Thinking Machines Lab. Defeating Nondeterminism in LLM Inference . Technical Report, 2025. https://thinkingmachines.ai/defeating-nondeterminism

2025

[19] [19]

En- abling Determinism in LLM Inference with Verified Speculation

Raja Gond, Aditya K Kamath, Ramachandran Ramjee, and Ashish Panwar. En- abling Determinism in LLM Inference with Verified Speculation . arXiv preprint arXiv:2601.17768, 2025

work page arXiv 2025

[20] [20]

Why Should I Trust You?

Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "Why Should I Trust You?": Explaining the Predictions of Any Classifier . Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016

2016

[21] [21]

Improving KernelSHAP: Practical Shapley Value Estima- tion Using Linear Regression

Ian Covert and Su-In Lee. Improving KernelSHAP: Practical Shapley Value Estima- tion Using Linear Regression . Proceedings of AISTATS (PMLR 130), 2021

2021

[22] [22]

A Similarity Measure for Indefinite Rankings

William Webber, Alistair Moffat, and Justin Zobel. A Similarity Measure for Indefinite Rankings. ACM Transactions on Information Systems (TOIS), 28(4):1– 34, 2010

2010

[23] [23]

Interpretation of Neural Networks is Fragile

Amirata Ghorbani, Abubakar Abid, and James Zou. Interpretation of Neural Networks is Fragile. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):3681–3688, 2019

2019

[24] [24]

Weinberger, and Yoav Artzi

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating Text Generation with BERT . International Conference on Learning Representations (ICLR), 2020

2020

[25] [25]

Kipf and Max Welling

Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. International Conference on Learning Representations (ICLR), 2017

2017

[26] [26]

Pitfalls of Graph Neural Network Evaluation

Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of Graph Neural Network Evaluation . NeurIPS Relational Representation Learning Workshop, 2018. arXiv:1811.05868

work page internal anchor Pith review Pith/arXiv arXiv 2018

[27] [27]

Temporal Graph Networks for Deep Learning on Dynamic Graphs

Emanuele Rossi, Ben Chamberlain, Fabrizio Frasca, Davide Eynard, Federico Monti, and Michael Bronstein. Temporal Graph Networks for Deep Learning on Dynamic Graphs. ICML Workshop on Graph Representation Learning, 2020

2020

[28] [28]

Fast Graph Representation Learning with PyTorch Geometric

Matthias Fey and Jan Eric Lenssen. Fast Graph Representation Learning with PyTorch Geometric. ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019

2019

[29] [29]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAt- tention: Fast and Memory-Efficient Exact Attention with IO-A wareness. Advances in Neural Information Processing Systems (NeurIPS), 35, 2022

2022

[30] [30]

Huang, Hui Wang, and Yi Yang

Allen H. Huang, Hui Wang, and Yi Yang. FinBERT: A Large Language Model for Extracting Information from Financial Text . Contemporary Accounting Research, 40(2):806–841, 2023

2023

[31] [31]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing Reasoning and Acting in Language Models . International Conference on Learning Representations (ICLR), 2023

2023

[32] [32]

Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

Boxin Wang, Chankyu Lee, Nayeon Lee, Sheng-Chieh Lin, Wenliang Dai, Yang Chen, Yangyi Chen, Zhuolin Yang, Zihan Liu, Mohammad Shoeybi, Bryan Catan- zaro, and Wei Ping. Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models . arXiv preprint arXiv:2512.13607, 2025

work page arXiv 2025

[33] [33]

LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows

Raffi Khatchadourian and Rolando Franco. LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows. arXiv preprint arXiv:2511.07585, 2025

work page arXiv 2025

[34] [34]

Efficient Guided Generation for Large Language Models

Brandon T. Willard and Rémi Louf.Efficient Guided Generation for Large Language Models. arXiv preprint arXiv:2307.09702, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

SHAP Stability in Credit Risk Management: A Case Study in Credit Card Default Model

Liang Lin and Yue Wang. SHAP Stability in Credit Risk Management: A Case Study in Credit Card Default Model . Risks, 2025. arXiv:2508.01851

work page arXiv 2025

[36] [36]

Benoit.Evaluating the Stability of Model Explanations in Instance-Dependent Cost-Sensitive Credit Scoring

Matteo Ballegeer, Matthias Bogaert, and Dries F. Benoit.Evaluating the Stability of Model Explanations in Instance-Dependent Cost-Sensitive Credit Scoring . European Journal of Operational Research, 326(3), 2025

2025

[37] [37]

Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch

Ziyang Zhang, Xinheng Ding, Jiayi Yuan, Rixin Liu, Huizi Mao, Jiarong Xing, and Zirui Liu. Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch. arXiv preprint arXiv:2511.17826, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [38]

STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability

Guanchen Wang et al. STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability . NeurIPS Workshop, 2025. arXiv:2512.23712

work page arXiv 2025

[39] [39]

AEMA: Verifiable Evaluation Framework for Trustworthy and Controlled Agentic LLM Systems

Yu-Ting Lee et al. AEMA: Verifiable Evaluation Framework for Trustworthy and Controlled Agentic LLM Systems . arXiv preprint arXiv:2601.11903, 2026

work page arXiv 2026

[40] [40]

Network Analytics for Anti-Money Laundering – A Systematic Literature Review and Experimental Evaluation

Bruno Deprez, Toon Vanderschueren, Bart Baesens, Tim Verdonck, and Wouter Verbeke. Network Analytics for Anti-Money Laundering – A Systematic Literature Review and Experimental Evaluation . INFORMS Journal on Data Science, 2024. arXiv:2405.19383

work page arXiv 2024

[41] [41]

Normalisation and Initialisation Strategies for Graph Neural Networks in Blockchain Anomaly Detection

Dang Sy Duy, Nguyen Duy Chien, Kapil Dev, and Jeff Nijsse. Normalisation and Initialisation Strategies for Graph Neural Networks in Blockchain Anomaly Detection. arXiv preprint arXiv:2602.23599, 2026

work page arXiv 2026

[42] [42]

Jsonschemabench: A rigorous benchmark of structured outputs for language models

Derek Tam et al. Generating Structured Outputs from Language Models . arXiv preprint arXiv:2501.10868, 2025

work page arXiv 2025

[43] [43]

Estimating LLM Uncertainty with Logits

Shuo Chen, Tian Dong, Jiaqi Wang, and Dieter Fox. Estimating LLM Uncertainty with Logits. Proceedings of the International Conference on Machine Learning (ICML), 2025. arXiv:2502.00290

work page arXiv 2025

[44] [44]

Towards Mod- eling Data Quality and Machine Learning Model Performance

Usman Anjum, Chris Trentman, Elrod Caden, and Justin Zhan. Towards Mod- eling Data Quality and Machine Learning Model Performance . arXiv preprint arXiv:2412.05882, 2024

work page arXiv 2024

[45] [45]

Arik and Tomas Pfister.TabNet: Attentive Interpretable Tabular Learning

Sercan Ö. Arik and Tomas Pfister.TabNet: Attentive Interpretable Tabular Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(8):6679–6687, 2021

2021

[46] [46]

Hamilton, and Jure Leskovec

Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec. Graph Convolutional Neural Networks for Web-Scale Recom- mender Systems. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 974–983, 2018

2018

[47] [47]

Effi- cient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Effi- cient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM . Proceedings of the International Conference for High Perfor...

2021

[48] [48]

SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

Ansar Aynetdinov and Alham Fikri Aji. SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity . arXiv preprint arXiv:2401.17072, 2024. A LOGU UNCERTAINTY DECOMPOSITION LogU [43] treats the top-𝐾 logits as Dirichlet parametersDir(𝛼1, . . . , 𝛼𝐾 ). The aleatoric uncertainty (AU) captures relative entropy:AU(𝑎𝑡 ) = −Í𝐾 𝑘...

work page arXiv 2024