pith. machine review for the scientific record. sign in

arxiv: 2604.19755 · v1 · submitted 2026-03-22 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Explainable AML Triage with LLMs: Evidence Retrieval and Counterfactual Checks

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:31 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords AML triageexplainable AIlarge language modelsevidence retrievalcounterfactual checkstransaction monitoringregulatory compliancehallucination reduction
0
0 comments X

The pith

Evidence-constrained LLMs with retrieval and counterfactual checks deliver auditable AML triage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a framework for using large language models to triage anti-money laundering alerts while enforcing traceability and accuracy. It bundles evidence from policies, customer details, alerts, and transaction networks through retrieval methods, then requires the model to produce structured outputs with explicit citations and distinctions between supporting, contradicting, and missing evidence. Counterfactual checks test if small plausible changes in the data would alter the triage decision and explanation in a consistent way. The approach is tested against baselines on synthetic benchmarks, showing gains in performance and in metrics for provenance and faithfulness. A sympathetic reader would see this as a way to make AI tools usable in regulated settings where decisions must be defensible.

Core claim

The paper claims that an evidence-constrained decision process, built from retrieval-augmented evidence bundling, a structured LLM output contract requiring citations and evidence separation, and counterfactual perturbation checks, yields superior AML triage performance and explainability compared to rules-based systems, graph ML models, and unconstrained LLM variants on public synthetic benchmarks.

What carries the argument

The evidence-constrained decision process integrating retrieval-augmented bundling of policy, context, alert and subgraph data with citation-mandating contracts and counterfactual validation.

If this is right

  • Evidence grounding substantially improves auditability and reduces numerical and policy hallucination errors.
  • Counterfactual validation further increases decision-linked explainability and robustness.
  • The combined method achieves the best overall triage performance on synthetic AML benchmarks.
  • High scores are reached for citation validity, evidence support, and counterfactual faithfulness.
  • Governed LLM systems can support AML triage decisions while satisfying compliance requirements for traceability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar evidence and counterfactual structures could be adapted for explainable decision support in other regulated fields like sanctions screening or fraud investigation.
  • Deployment in practice would benefit from testing against real transaction data to confirm the synthetic benchmarks capture key variabilities.
  • The counterfactual mechanism might serve as a diagnostic tool to identify weaknesses in the evidence sources themselves.
  • Adoption could allow investigators to focus on complex cases by trusting the automated triage for well-supported alerts.

Load-bearing premise

Public synthetic AML benchmarks and simulators are representative enough of real-world transaction patterns, regulatory noise, and investigator processes.

What would settle it

Testing the framework on proprietary real-world AML datasets from financial institutions and finding significantly lower performance or faithfulness scores than on the synthetic benchmarks.

Figures

Figures reproduced from arXiv: 2604.19755 by Dorothy Torres, Ke Hu, Wei Cheng.

Figure 1
Figure 1. Figure 1: AML Triage Architecture 3.2 Evidence Representation and Retrieval We represent every retrievable artifact as an evidence item 𝑒 ∈ ℰwith a stable identifier, a source type, an effective timestamp, and an access-control label (ACL). Evidence types include (i) policy and typology guidance, (ii) customer and KYC profile attributes, (iii) alert trigger metadata (e.g., rule identifiers, thresholds, scoring outpu… view at source ↗
Figure 2
Figure 2. Figure 2: Evidence bundle construction and provenance constraints. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

Anti-money laundering (AML) transaction monitoring generates large volumes of alerts that must be rapidly triaged by investigators under strict audit and governance constraints. While large language models (LLMs) can summarize heterogeneous evidence and draft rationales, unconstrained generation is risky in regulated workflows due to hallucinations, weak provenance, and explanations that are not faithful to the underlying decision. We propose an explainable AML triage framework that treats triage as an evidence-constrained decision process. Our method combines (i) retrieval-augmented evidence bundling from policy/typology guidance, customer context, alert triggers, and transaction subgraphs, (ii) a structured LLM output contract that requires explicit citations and separates supporting from contradicting or missing evidence, and (iii) counterfactual checks that validate whether minimal, plausible perturbations lead to coherent changes in both the triage recommendation and its rationale. We evaluate on public synthetic AML benchmarks and simulators and compare against rules, tabular and graph machine-learning baselines, and LLM-only/RAG-only variants. Results show that evidence grounding substantially improves auditability and reduces numerical and policy hallucination errors, while counterfactual validation further increases decision-linked explainability and robustness, yielding the best overall triage performance (PR-AUC 0.75; Escalate F1 0.62) and strong provenance and faithfulness metrics (citation validity 0.98; evidence support 0.88; counterfactual faithfulness 0.76). These findings indicate that governed, verifiable LLM systems can provide practical decision support for AML triage without sacrificing compliance requirements for traceability and defensibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes an explainable AML triage framework that combines retrieval-augmented evidence bundling from policy, customer, alert, and graph sources, a structured LLM output contract enforcing explicit citations and separation of supporting/contradicting evidence, and counterfactual perturbation checks. On public synthetic AML benchmarks it reports best-in-class triage performance (PR-AUC 0.75, Escalate F1 0.62) together with high provenance metrics (citation validity 0.98, evidence support 0.88, counterfactual faithfulness 0.76) relative to rules-based, tabular/graph ML, and plain LLM/RAG baselines.

Significance. If the synthetic-benchmark results generalize, the framework supplies a concrete, auditable template for LLM use in regulated financial-crime workflows, directly addressing hallucination and traceability requirements that currently limit deployment. The explicit separation of evidence types and the counterfactual validation step are particularly valuable contributions to the growing literature on verifiable LLM decision support.

major comments (2)
  1. [Abstract] Abstract: the headline performance numbers and the claim of 'practical decision support' rest exclusively on public synthetic AML benchmarks; the manuscript provides no real-world transaction data, no discussion of how the simulators capture adversarial laundering tactics or regulatory noise, and no plan for live validation, which is load-bearing for the central utility argument.
  2. [Results] Results (implied evaluation section): no error bars, no statistical significance tests, and no ablation isolating the incremental contribution of the structured contract versus the counterfactual checks are reported, so the source of the observed gains in PR-AUC and faithfulness metrics cannot be rigorously assessed.
minor comments (2)
  1. [Abstract] Abstract: the precise definitions and computation of 'citation validity', 'evidence support', and 'counterfactual faithfulness' should be stated explicitly so that the 0.98/0.88/0.76 figures can be reproduced.
  2. [Method] The description of the counterfactual perturbation generation procedure is brief; a short algorithmic outline or pseudocode would improve clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback highlighting limitations in evaluation scope and statistical rigor. We agree these points require revision and will update the manuscript to qualify claims, add statistical analyses, and discuss simulator limitations while outlining future validation plans.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline performance numbers and the claim of 'practical decision support' rest exclusively on public synthetic AML benchmarks; the manuscript provides no real-world transaction data, no discussion of how the simulators capture adversarial laundering tactics or regulatory noise, and no plan for live validation, which is load-bearing for the central utility argument.

    Authors: We acknowledge that the evaluation relies exclusively on public synthetic AML benchmarks, which are standard in this domain due to the unavailability of labeled real-world data. We agree this constrains strong claims of practical utility. In the revised manuscript we will qualify the abstract language around 'practical decision support', add a limitations subsection discussing simulator fidelity with respect to adversarial tactics and regulatory noise, and include a forward-looking statement on planned live validation. We cannot add real-world results because we lack access to proprietary transaction data. revision: yes

  2. Referee: [Results] Results (implied evaluation section): no error bars, no statistical significance tests, and no ablation isolating the incremental contribution of the structured contract versus the counterfactual checks are reported, so the source of the observed gains in PR-AUC and faithfulness metrics cannot be rigorously assessed.

    Authors: We agree that the current results lack error bars, significance testing, and targeted ablations. In the revised version we will report error bars from multiple random seeds, add statistical significance tests (e.g., paired bootstrap or McNemar tests) against baselines, and include ablation experiments that separately remove the structured output contract and the counterfactual checks. These additions will clarify the incremental sources of the reported PR-AUC and faithfulness gains. revision: yes

standing simulated objections not resolved
  • We do not have access to real-world AML transaction data due to regulatory and privacy constraints and therefore cannot include live-system or proprietary validation results.

Circularity Check

0 steps flagged

No significant circularity; evaluation metrics are independent of internal definitions

full rationale

The paper presents an empirical framework for AML triage using retrieval-augmented generation and counterfactual validation. It reports performance via standard metrics (PR-AUC 0.75, Escalate F1 0.62, citation validity 0.98) computed against external baselines and public synthetic benchmarks. No equations, fitted parameters, or self-citations are shown that would reduce these results to quantities defined by construction within the method itself. The derivation chain consists of standard retrieval, structured prompting, and perturbation checks whose outputs are measured externally rather than tautologically.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of synthetic benchmarks for real AML workflows and on the assumption that structured prompting plus counterfactual checks reliably reduce hallucinations without introducing new biases.

axioms (1)
  • domain assumption Synthetic AML benchmarks accurately reflect real-world alert triage complexity and regulatory requirements
    All reported results are obtained exclusively on public synthetic benchmarks and simulators as stated in the abstract.

pith-pipeline@v0.9.0 · 5577 in / 1443 out tokens · 52244 ms · 2026-05-15T07:31:52.709591+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 1 internal anchor

  1. [1]

    International Standards on Combating Money Laundering and the Financing of Terrorism & Proliferation: The FATF Recommendations,

    Financial Action Task Force, “International Standards on Combating Money Laundering and the Financing of Terrorism & Proliferation: The FATF Recommendations,” Feb. 2012 (updated Feb. 2025)

  2. [2]

    Guidance for a Risk-Based Approach: The Banking Sector,

    Financial Action Task Force, “Guidance for a Risk-Based Approach: The Banking Sector,” Oct. 2014

  3. [3]

    Risk-Based Approach Guidance for the Securities Sector,

    Financial Action Task Force, “Risk-Based Approach Guidance for the Securities Sector,” Oct. 2018

  4. [4]

    Sound management of risks related to money laundering and financing of terrorism,

    Basel Committee on Banking Supervision, “Sound management of risks related to money laundering and financing of terrorism,” Feb. 2016 (rev. July 2020)

  5. [5]

    Anti -Money Laundering Transaction Monitoring in the Markets Sector: An industry perspective,

    Association for Financial Markets in Europe and Ernst & Young, “Anti -Money Laundering Transaction Monitoring in the Markets Sector: An industry perspective,” Oct. 2021

  6. [6]

    Transaction Monitoring in Anti -Money Laundering: A Qualitative Analysis and Points of View from Industry,

    B. Oztas, D. Cetinkaya, F. Adedoyin, M. Budka, G. Aksu, and H. Dogan, “Transaction Monitoring in Anti -Money Laundering: A Qualitative Analysis and Points of View from Industry,” Future Generation Computer Systems, vol. 159, pp. 161–171, Oct. 2024, doi: 10.1016/j.future.2024.05.027

  7. [7]

    Scalable Graph Learning for Anti-Money Laundering: A First Look

    M. Weber, J. Chen, T. Suzumura, A. Pareja, T. Ma, H. Kanezashi, T. Kaler, C. E. Leiserson, and T. B. Schardl, “Scalable Graph Learning for Anti-Money Laundering: A First Look,” arXiv preprint arXiv:1812.00076, 2018, doi: 10.48550/arXiv.1812.00076

  8. [8]

    Research and Practice of Advertisement Recommendation Algorithm Based on Graph Neural Network,

    K. Liu, S. Yang, and J. Xia, “Research and Practice of Advertisement Recommendation Algorithm Based on Graph Neural Network,” in Proceedings of the 2nd International Symposium on Integrated Circuit Design and Integrated Systems (ICDIS ’25), 2025, pp. 210–215, doi: 10.1145/3772326.3774734

  9. [9]

    A Synthetic Data Set to Benchmark Anti-Money Laundering Methods,

    R. I. T. Jensen, J. Ferwerda, K. S. Jørgensen, E. R. Jensen, M. Borg, M. P. Krogh, J. B. Jensen, and A. Iosifidis, “A Synthetic Data Set to Benchmark Anti-Money Laundering Methods,” Scientific Data, vol. 10, no. 1, Art. no. 661, 2023, doi: 10.1038/s41597-023-02569-2

  10. [10]

    22, 2026

    IBM, “AMLSim,” GitHub repository (synthetic banking transaction simulator for AML research), accessed Feb. 22, 2026

  11. [11]

    Early Warning of Cryptocurrency Reversal Risks via Multi - Source Data,

    Z. Ke, Y. Cao, Z. Chen, Y. Yin, S. He, and Y. Cheng, “Early Warning of Cryptocurrency Reversal Risks via Multi - Source Data,” Finance Research Letters, vol. 85, pt. B, Art. no. 107890, 2025, doi: 10.1016/j.frl.2025.107890

  12. [12]

    A Survey of Large Language Models in Finance (FinLLMs),

    J. Lee, N. Stevens, S. C. Han, and M. Song, “A Survey of Large Language Models in Finance (FinLLMs),” arXiv preprint arXiv:2402.02315, 2024, doi: 10.48550/arXiv.2402.02315

  13. [13]

    Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GDPR,

    S. Wachter, B. Mittelstadt, and C. Russell, “Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GDPR,” Harvard Journal of Law & Technology, vol. 31, pp. 841 –887, 2018, doi: 10.2139/ssrn.3063289

  14. [14]

    Counterfactual Based Probabilistic Graphs for Explainable Money Laundering Detection,

    X. Sun and D. Du, “Counterfactual Based Probabilistic Graphs for Explainable Money Laundering Detection,” in AAAI 2025 Workshop on AI for Cyber Threat Intelligence (AICT), OpenReview, Dec. 2024. [Online]. Available: OpenReview

  15. [15]

    FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in Finance,

    M. Zhang, J. Fu, T. Warrier, Y. Wang, T. Tan, and K.-W. Huang, “FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in Finance,” arXiv preprint arXiv:2508.05201, 2025, doi: 10.48550/arXiv.2508.05201