pith. sign in

arxiv: 2605.23955 · v2 · pith:M5DI3WAQnew · submitted 2026-05-11 · 💻 cs.AI · cs.DC· cs.LG· cs.SI· q-fin.CP

From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems

Pith reviewed 2026-06-30 22:25 UTC · model grok-4.3

classification 💻 cs.AI cs.DCcs.LGcs.SIq-fin.CP
keywords determinismreproducibilityfinancial AIauditabilitynondeterminismtabular modelsgraph networksLLM workflows
0
0 comments X

The pith

Mechanical nondeterminism from hardware and architecture undermines reproducibility in financial AI systems used for credit, fraud, and AML tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys how deep neural networks and generative AI introduce mechanical nondeterminism that breaks consistent outputs across three dominant financial modalities. It documents the problem through literature review and first-party tests on public datasets that measure explanation instability, prediction flips, and output divergence. The work then introduces a layered framework that ties specific metrics to audit readiness. A sympathetic reader would care because regulated finance demands reproducible decisions for compliance and liability reasons. If the central observation holds, current accuracy-focused evaluation falls short of what auditors require.

Core claim

Mechanical nondeterminism rooted in hardware and architecture in deep neural networks and Generative AI has introduced critical vulnerabilities in algorithmic reproducibility across tabular models, graph networks, and LLM-based agentic workflows in financial applications. First-party experiments quantify explanation rank instability in credit scoring, prediction flip rates in GNN-based fraud detection, and tensor-parallel-induced output divergence in LLM entity extraction. A layered evaluation framework is proposed that links modality-specific metrics (RBO, D_cos, TDI, PSD) to audit readiness and validates the complementarity of logit-level and semantic-level determinism measures.

What carries the argument

Layered evaluation framework that maps modality-specific determinism metrics (RBO, D_cos, TDI, PSD) to audit readiness across tabular, graph, and LLM financial AI systems.

If this is right

  • Tabular models for credit scoring exhibit unstable post-hoc explanations under nondeterminism.
  • Graph networks for fraud detection show measurable prediction flip rates from stochastic sampling and temporal asynchrony.
  • LLM-based agentic workflows display batch-dependent divergence and trajectory drift in entity extraction tasks.
  • Logit-level and semantic-level measures provide complementary signals for assessing determinism.
  • The framework supplies concrete metrics that can be applied to evaluate audit readiness in each modality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Regulators could incorporate determinism testing into model validation requirements for financial AI.
  • The same hardware-rooted issues may appear in other regulated domains that use similar model types.
  • Practical deployment might require hardware or software constraints to enforce reproducibility where audits are mandatory.

Load-bearing premise

The first-party experiments on public financial datasets and the proposed metrics sufficiently demonstrate the scale of nondeterminism issues and the effectiveness of the layered evaluation framework for real-world regulated financial systems.

What would settle it

An empirical study on regulated financial deployments that finds the quantified nondeterminism levels produce no measurable change in audit outcomes or regulatory compliance would falsify the claim of critical vulnerabilities.

Figures

Figures reproduced from arXiv: 2605.23955 by Deepayan Chakrabarti, Dengdu Jiang, Gaoyuan Du, Ruizhe Zhou, Shouxi Ren, Xiaoyang Liu, Yi Zheng.

Figure 1
Figure 1. Figure 1: Feature rank span across 30 KernelSHAP runs [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Model ranking across 10 splits on Elliptic Bitcoin. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: plots TDI against PSD for all model×TP configurations. The scatter reveals three distinct regimes: (1) moderate TDI, high PSD (Qwen2.5-7B at TP=4: TDI=0.77, PSD=0.99)—despite token￾level divergence, the model produces semantically equivalent out￾puts, indicating benign surface-level variation; (2) low TDI, low PSD (InternLM2.5-7B at TP=4: TDI=0.75, PSD=0.88)—both evidence and meaning diverge, signaling gen… view at source ↗
read the original abstract

Deploying machine learning in regulated financial environments -- credit risk, fraud detection, and anti-money laundering -- exposes critical vulnerabilities in algorithmic reproducibility. While early financial ML addressed statistical challenges such as backtest overfitting, deep neural networks and Generative AI have introduced mechanical nondeterminism rooted in hardware and architecture. This survey provides a systems perspective on reproducibility failures across three modalities now dominant in financial AI: tabular models (post-hoc explanation variance), graph networks (stochastic sampling and temporal asynchrony), and LLM-based agentic workflows (batch-dependent divergence and trajectory drift). We supplement the literature analysis with first-party experiments on public financial datasets -- quantifying explanation rank instability in credit scoring, prediction flip rates in GNN-based fraud detection, and tensor-parallel-induced output divergence in LLM entity extraction. We propose a layered evaluation framework linking modality-specific metrics (RBO, D_cos, TDI, PSD) to audit readiness, and empirically validate the complementarity of logit-level and semantic-level determinism measures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper surveys mechanical nondeterminism in financial AI across tabular models (post-hoc explanation variance), graph networks (stochastic sampling and temporal asynchrony), and LLM-based agentic workflows (batch-dependent divergence and trajectory drift). It supplements the literature review with first-party experiments on public financial datasets that quantify explanation rank instability (RBO), prediction flip rates (D_cos), temporal divergence (TDI), and output divergence (PSD), and proposes a layered evaluation framework that links these modality-specific metrics to audit readiness while claiming empirical validation of the complementarity between logit-level and semantic-level determinism measures.

Significance. If the layered framework and metrics prove effective at quantifying and mitigating reproducibility failures, the work would provide a valuable systems-level contribution toward auditability in regulated financial domains. The survey structure and introduction of concrete metrics (RBO, D_cos, TDI, PSD) offer a structured way to connect hardware/architecture nondeterminism to practical evaluation, which could inform compliance practices if the public-dataset results generalize.

major comments (2)
  1. [first-party experiments on public financial datasets] First-party experiments section: the reported metric values on public datasets (RBO for credit scoring explanations, D_cos for GNN fraud detection flips, TDI/PSD for LLM divergence) are presented without explicit thresholds, decision-impact analysis, or mapping to regulatory compliance failures (e.g., changes in credit approval or AML alerts), so the central claim of 'critical vulnerabilities' in credit risk, fraud, and AML remains indirect.
  2. [layered evaluation framework] Layered evaluation framework section: the assertion that the metrics demonstrate complementarity and link to 'audit readiness' lacks quantitative validation against real regulatory constraints or proprietary-scale data (class imbalance, audit thresholds), leaving the framework's practical utility for regulated systems unestablished.
minor comments (1)
  1. [abstract] The abstract states that the framework is 'empirically validated' but supplies no numerical results, error bars, or dataset details; this should be expanded with concrete findings even in a survey context.

Simulated Author's Rebuttal

2 responses · 2 unresolved

We thank the referee for the constructive feedback on our survey of determinism in financial AI systems. We address the two major comments point by point below, clarifying the scope of our public-dataset experiments and the proposed framework while noting inherent limitations of a survey format.

read point-by-point responses
  1. Referee: [first-party experiments on public financial datasets] First-party experiments section: the reported metric values on public datasets (RBO for credit scoring explanations, D_cos for GNN fraud detection flips, TDI/PSD for LLM divergence) are presented without explicit thresholds, decision-impact analysis, or mapping to regulatory compliance failures (e.g., changes in credit approval or AML alerts), so the central claim of 'critical vulnerabilities' in credit risk, fraud, and AML remains indirect.

    Authors: The experiments quantify mechanical nondeterminism on standard public financial datasets to illustrate the phenomena across modalities, using metrics chosen for their direct interpretability (e.g., RBO capturing explanation rank changes that could affect post-hoc audits). We do not provide explicit thresholds or direct mappings to regulatory outcomes such as credit approval changes because these are institution-specific and would require proprietary decision logs and compliance data unavailable for a survey. The central claim of vulnerabilities is supported by the quantified instability levels on representative data, but we agree the regulatory linkage is indirect. We will add a discussion subsection on potential decision impacts and how practitioners might calibrate thresholds, constituting a partial revision. revision: partial

  2. Referee: [layered evaluation framework] Layered evaluation framework section: the assertion that the metrics demonstrate complementarity and link to 'audit readiness' lacks quantitative validation against real regulatory constraints or proprietary-scale data (class imbalance, audit thresholds), leaving the framework's practical utility for regulated systems unestablished.

    Authors: Complementarity between logit-level and semantic-level measures is empirically demonstrated via the public-dataset experiments, where the metrics capture orthogonal aspects of divergence. The layered framework provides a conceptual mapping from these metrics to audit readiness as a systems-level proposal. We acknowledge that quantitative validation on proprietary-scale data with real regulatory constraints is not feasible here. We will revise the framework section to explicitly delineate its scope as a proposed structure with public proof-of-concept results rather than a fully validated production tool, constituting a partial revision for added clarity on limitations. revision: partial

standing simulated objections not resolved
  • Quantitative validation of the layered framework or metrics against proprietary financial datasets that incorporate actual regulatory thresholds, class imbalance, and audit constraints.
  • Direct empirical mapping from observed metric values to specific compliance failures (e.g., changes in AML alerts or credit decisions) without access to internal institutional systems and logs.

Circularity Check

0 steps flagged

No circularity: survey structure with independent literature review and public-dataset experiments

full rationale

The paper is a literature survey supplemented by first-party experiments on public financial datasets. It defines new metrics (RBO, D_cos, TDI, PSD) and applies them to quantify observed nondeterminism, then links them to an audit framework. No equations or derivations reduce to self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. The central claims rest on external literature plus reproducible experiments on public data, with no reduction of the 'critical vulnerabilities' claim to author-defined inputs by construction. This is the expected non-finding for a survey paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper supplemented by experiments; the central claims rest on literature analysis and described empirical work rather than new mathematical axioms, free parameters, or invented entities.

pith-pipeline@v0.9.1-grok · 5730 in / 1154 out tokens · 41016 ms · 2026-06-30T22:25:28.777463+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 20 canonical work pages · 3 internal anchors

  1. [1]

    Managing explanations: how regulators can address AI explainability in Model Risk Management

    Bank for International Settlements (BIS). Managing explanations: how regulators can address AI explainability in Model Risk Management . FSI Insights on policy implementation, 2024

  2. [2]

    Regulation (EU) 2024/1689 laying down har- monised rules on Artificial Intelligence (Artificial Intelligence Act)

    European Parliament and Council. Regulation (EU) 2024/1689 laying down har- monised rules on Artificial Intelligence (Artificial Intelligence Act) . Official Journal of the European Union, 2024. From Accuracy to Auditability: A Survey of Determinism in Financial AI Systems Conference’17, July 2017, Washington, DC, USA

  3. [3]

    Harvey, Yan Liu, and Heqing Zhu

    Campbell R. Harvey, Yan Liu, and Heqing Zhu.... and the Cross-Section of Expected Returns. The Review of Financial Studies, 29(1):5–68, 2016

  4. [4]

    Bailey and Marcos López de Prado.The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality

    David H. Bailey and Marcos López de Prado.The Deflated Sharpe Ratio: Correcting for Selection Bias, Backtest Overfitting, and Non-Normality. The Journal of Portfolio Management, 40(5):89–107, 2014

  5. [5]

    Arian et al

    H. Arian et al. Backtest overfitting in the machine learning era: A comparison of out-of-sample testing methods. Knowledge-Based Systems, 2024

  6. [6]

    Bhojanapalli et al

    S. Bhojanapalli et al. On the Reproducibility of Neural Network Predictions . arXiv preprint arXiv:2102.03349, 2021

  7. [7]

    Detlefsen et al

    N. Detlefsen et al. TorchMetrics - Measuring Reproducibility in Deep Learning . Journal of Open Source Software, 7(70):4101, 2022

  8. [8]

    Lundberg and Su-In Lee

    Scott M. Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions. Advances in Neural Information Processing Systems (NeurIPS), 30, 2017

  9. [9]

    Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M

    Scott M. Lundberg, Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M. Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. From Local Explanations to Global Understanding with Explainable AI for Trees . Nature Machine Intelligence, 2(1):56–67, 2020

  10. [10]

    Intelligible Models for Classification and Regression

    Yin Lou, Rich Caruana, and Johannes Gehrke. Intelligible Models for Classification and Regression. Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 150–158, 2012

  11. [11]

    Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hos- pital 30-day Readmission

    Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, and Noemie Elhadad. Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hos- pital 30-day Readmission. Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 1721–1730, 2015

  12. [12]

    Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods

    Dylan Slack, Sophie Hilgard, Emily Jia, Sameer Singh, and Himabindu Lakkaraju. Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods . AAAI/ACM Conference on AI, Ethics, and Society (AIES), 2020

  13. [13]

    Weber et al

    M. Weber et al. Graph Neural Networks for Financial Fraud Detection: A Survey . arXiv preprint arXiv:2411.05815, 2024

  14. [14]

    Hamilton, Rex Ying, and Jure Leskovec

    William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive Representation Learning on Large Graphs . Advances in Neural Information Processing Systems (NeurIPS), 30, 2017

  15. [15]

    Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents

    Raffi Khatchadourian. Replayable Financial Agents: A Determinism-Faithfulness Assurance Harness for Tool-Using LLM Agents . arXiv preprint arXiv:2601.15322, 2026

  16. [16]

    Wang and Y

    L. Wang and Y. Wang. Assessing Consistency and Reproducibility in the Outputs of Large Language Models: Evidence Across Diverse Finance and Accounting Tasks . arXiv preprint arXiv:2503.16974, 2025

  17. [17]

    HOPPE, M

    Jiayi Yuan, Hao Li, Xinheng Ding, Wenya Xie, Yu-Jhe Li, Wentian Zhao, Kun Wan, Jing Shi, Xia Hu, and Zirui Liu. Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference . arXiv preprint arXiv:2506.09501, 2025

  18. [18]

    Defeating Nondeterminism in LLM Inference

    Thinking Machines Lab. Defeating Nondeterminism in LLM Inference . Technical Report, 2025. https://thinkingmachines.ai/defeating-nondeterminism

  19. [19]

    En- abling Determinism in LLM Inference with Verified Speculation

    Raja Gond, Aditya K Kamath, Ramachandran Ramjee, and Ashish Panwar. En- abling Determinism in LLM Inference with Verified Speculation . arXiv preprint arXiv:2601.17768, 2025

  20. [20]

    Why Should I Trust You?

    Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. "Why Should I Trust You?": Explaining the Predictions of Any Classifier . Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2016

  21. [21]

    Improving KernelSHAP: Practical Shapley Value Estima- tion Using Linear Regression

    Ian Covert and Su-In Lee. Improving KernelSHAP: Practical Shapley Value Estima- tion Using Linear Regression . Proceedings of AISTATS (PMLR 130), 2021

  22. [22]

    A Similarity Measure for Indefinite Rankings

    William Webber, Alistair Moffat, and Justin Zobel. A Similarity Measure for Indefinite Rankings. ACM Transactions on Information Systems (TOIS), 28(4):1– 34, 2010

  23. [23]

    Interpretation of Neural Networks is Fragile

    Amirata Ghorbani, Abubakar Abid, and James Zou. Interpretation of Neural Networks is Fragile. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):3681–3688, 2019

  24. [24]

    Weinberger, and Yoav Artzi

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating Text Generation with BERT . International Conference on Learning Representations (ICLR), 2020

  25. [25]

    Kipf and Max Welling

    Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. International Conference on Learning Representations (ICLR), 2017

  26. [26]

    Pitfalls of Graph Neural Network Evaluation

    Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of Graph Neural Network Evaluation . NeurIPS Relational Representation Learning Workshop, 2018. arXiv:1811.05868

  27. [27]

    Temporal Graph Networks for Deep Learning on Dynamic Graphs

    Emanuele Rossi, Ben Chamberlain, Fabrizio Frasca, Davide Eynard, Federico Monti, and Michael Bronstein. Temporal Graph Networks for Deep Learning on Dynamic Graphs. ICML Workshop on Graph Representation Learning, 2020

  28. [28]

    Fast Graph Representation Learning with PyTorch Geometric

    Matthias Fey and Jan Eric Lenssen. Fast Graph Representation Learning with PyTorch Geometric. ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019

  29. [29]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAt- tention: Fast and Memory-Efficient Exact Attention with IO-A wareness. Advances in Neural Information Processing Systems (NeurIPS), 35, 2022

  30. [30]

    Huang, Hui Wang, and Yi Yang

    Allen H. Huang, Hui Wang, and Yi Yang. FinBERT: A Large Language Model for Extracting Information from Financial Text . Contemporary Accounting Research, 40(2):806–841, 2023

  31. [31]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing Reasoning and Acting in Language Models . International Conference on Learning Representations (ICLR), 2023

  32. [32]

    Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

    Boxin Wang, Chankyu Lee, Nayeon Lee, Sheng-Chieh Lin, Wenliang Dai, Yang Chen, Yangyi Chen, Zhuolin Yang, Zihan Liu, Mohammad Shoeybi, Bryan Catan- zaro, and Wei Ping. Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models . arXiv preprint arXiv:2512.13607, 2025

  33. [33]

    LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows

    Raffi Khatchadourian and Rolando Franco. LLM Output Drift: Cross-Provider Validation & Mitigation for Financial Workflows. arXiv preprint arXiv:2511.07585, 2025

  34. [34]

    Efficient Guided Generation for Large Language Models

    Brandon T. Willard and Rémi Louf.Efficient Guided Generation for Large Language Models. arXiv preprint arXiv:2307.09702, 2023

  35. [35]

    SHAP Stability in Credit Risk Management: A Case Study in Credit Card Default Model

    Liang Lin and Yue Wang. SHAP Stability in Credit Risk Management: A Case Study in Credit Card Default Model . Risks, 2025. arXiv:2508.01851

  36. [36]

    Benoit.Evaluating the Stability of Model Explanations in Instance-Dependent Cost-Sensitive Credit Scoring

    Matteo Ballegeer, Matthias Bogaert, and Dries F. Benoit.Evaluating the Stability of Model Explanations in Instance-Dependent Cost-Sensitive Credit Scoring . European Journal of Operational Research, 326(3), 2025

  37. [37]

    Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch

    Ziyang Zhang, Xinheng Ding, Jiayi Yuan, Rixin Liu, Huizi Mao, Jiarong Xing, and Zirui Liu. Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch. arXiv preprint arXiv:2511.17826, 2025

  38. [38]

    STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability

    Guanchen Wang et al. STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability . NeurIPS Workshop, 2025. arXiv:2512.23712

  39. [39]

    AEMA: Verifiable Evaluation Framework for Trustworthy and Controlled Agentic LLM Systems

    Yu-Ting Lee et al. AEMA: Verifiable Evaluation Framework for Trustworthy and Controlled Agentic LLM Systems . arXiv preprint arXiv:2601.11903, 2026

  40. [40]

    Network Analytics for Anti-Money Laundering – A Systematic Literature Review and Experimental Evaluation

    Bruno Deprez, Toon Vanderschueren, Bart Baesens, Tim Verdonck, and Wouter Verbeke. Network Analytics for Anti-Money Laundering – A Systematic Literature Review and Experimental Evaluation . INFORMS Journal on Data Science, 2024. arXiv:2405.19383

  41. [41]

    Normalisation and Initialisation Strategies for Graph Neural Networks in Blockchain Anomaly Detection

    Dang Sy Duy, Nguyen Duy Chien, Kapil Dev, and Jeff Nijsse. Normalisation and Initialisation Strategies for Graph Neural Networks in Blockchain Anomaly Detection. arXiv preprint arXiv:2602.23599, 2026

  42. [42]

    Jsonschemabench: A rigorous benchmark of structured outputs for language models

    Derek Tam et al. Generating Structured Outputs from Language Models . arXiv preprint arXiv:2501.10868, 2025

  43. [43]

    Estimating LLM Uncertainty with Logits

    Shuo Chen, Tian Dong, Jiaqi Wang, and Dieter Fox. Estimating LLM Uncertainty with Logits. Proceedings of the International Conference on Machine Learning (ICML), 2025. arXiv:2502.00290

  44. [44]

    Towards Mod- eling Data Quality and Machine Learning Model Performance

    Usman Anjum, Chris Trentman, Elrod Caden, and Justin Zhan. Towards Mod- eling Data Quality and Machine Learning Model Performance . arXiv preprint arXiv:2412.05882, 2024

  45. [45]

    Arik and Tomas Pfister.TabNet: Attentive Interpretable Tabular Learning

    Sercan Ö. Arik and Tomas Pfister.TabNet: Attentive Interpretable Tabular Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(8):6679–6687, 2021

  46. [46]

    Hamilton, and Jure Leskovec

    Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L. Hamilton, and Jure Leskovec. Graph Convolutional Neural Networks for Web-Scale Recom- mender Systems. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 974–983, 2018

  47. [47]

    Effi- cient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Effi- cient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM . Proceedings of the International Conference for High Perfor...

  48. [48]

    SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

    Ansar Aynetdinov and Alham Fikri Aji. SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity . arXiv preprint arXiv:2401.17072, 2024. A LOGU UNCERTAINTY DECOMPOSITION LogU [43] treats the top-𝐾 logits as Dirichlet parametersDir(𝛼1, . . . , 𝛼𝐾 ). The aleatoric uncertainty (AU) captures relative entropy:AU(𝑎𝑡 ) = −Í𝐾 𝑘...