arxiv: 2604.19755 · v1 · submitted 2026-03-22 · 💻 cs.AI · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Explainable AML Triage with LLMs: Evidence Retrieval and Counterfactual Checks

Dorothy Torres , Wei Cheng , Ke Hu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:31 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords AML triageexplainable AIlarge language modelsevidence retrievalcounterfactual checkstransaction monitoringregulatory compliancehallucination reduction

0 comments

The pith

Evidence-constrained LLMs with retrieval and counterfactual checks deliver auditable AML triage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a framework for using large language models to triage anti-money laundering alerts while enforcing traceability and accuracy. It bundles evidence from policies, customer details, alerts, and transaction networks through retrieval methods, then requires the model to produce structured outputs with explicit citations and distinctions between supporting, contradicting, and missing evidence. Counterfactual checks test if small plausible changes in the data would alter the triage decision and explanation in a consistent way. The approach is tested against baselines on synthetic benchmarks, showing gains in performance and in metrics for provenance and faithfulness. A sympathetic reader would see this as a way to make AI tools usable in regulated settings where decisions must be defensible.

Core claim

The paper claims that an evidence-constrained decision process, built from retrieval-augmented evidence bundling, a structured LLM output contract requiring citations and evidence separation, and counterfactual perturbation checks, yields superior AML triage performance and explainability compared to rules-based systems, graph ML models, and unconstrained LLM variants on public synthetic benchmarks.

What carries the argument

The evidence-constrained decision process integrating retrieval-augmented bundling of policy, context, alert and subgraph data with citation-mandating contracts and counterfactual validation.

If this is right

Evidence grounding substantially improves auditability and reduces numerical and policy hallucination errors.
Counterfactual validation further increases decision-linked explainability and robustness.
The combined method achieves the best overall triage performance on synthetic AML benchmarks.
High scores are reached for citation validity, evidence support, and counterfactual faithfulness.
Governed LLM systems can support AML triage decisions while satisfying compliance requirements for traceability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar evidence and counterfactual structures could be adapted for explainable decision support in other regulated fields like sanctions screening or fraud investigation.
Deployment in practice would benefit from testing against real transaction data to confirm the synthetic benchmarks capture key variabilities.
The counterfactual mechanism might serve as a diagnostic tool to identify weaknesses in the evidence sources themselves.
Adoption could allow investigators to focus on complex cases by trusting the automated triage for well-supported alerts.

Load-bearing premise

Public synthetic AML benchmarks and simulators are representative enough of real-world transaction patterns, regulatory noise, and investigator processes.

What would settle it

Testing the framework on proprietary real-world AML datasets from financial institutions and finding significantly lower performance or faithfulness scores than on the synthetic benchmarks.

Figures

Figures reproduced from arXiv: 2604.19755 by Dorothy Torres, Ke Hu, Wei Cheng.

**Figure 1.** Figure 1: AML Triage Architecture 3.2 Evidence Representation and Retrieval We represent every retrievable artifact as an evidence item 𝑒 ∈ ℰwith a stable identifier, a source type, an effective timestamp, and an access-control label (ACL). Evidence types include (i) policy and typology guidance, (ii) customer and KYC profile attributes, (iii) alert trigger metadata (e.g., rule identifiers, thresholds, scoring outpu… view at source ↗

**Figure 2.** Figure 2: Evidence bundle construction and provenance constraints. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Anti-money laundering (AML) transaction monitoring generates large volumes of alerts that must be rapidly triaged by investigators under strict audit and governance constraints. While large language models (LLMs) can summarize heterogeneous evidence and draft rationales, unconstrained generation is risky in regulated workflows due to hallucinations, weak provenance, and explanations that are not faithful to the underlying decision. We propose an explainable AML triage framework that treats triage as an evidence-constrained decision process. Our method combines (i) retrieval-augmented evidence bundling from policy/typology guidance, customer context, alert triggers, and transaction subgraphs, (ii) a structured LLM output contract that requires explicit citations and separates supporting from contradicting or missing evidence, and (iii) counterfactual checks that validate whether minimal, plausible perturbations lead to coherent changes in both the triage recommendation and its rationale. We evaluate on public synthetic AML benchmarks and simulators and compare against rules, tabular and graph machine-learning baselines, and LLM-only/RAG-only variants. Results show that evidence grounding substantially improves auditability and reduces numerical and policy hallucination errors, while counterfactual validation further increases decision-linked explainability and robustness, yielding the best overall triage performance (PR-AUC 0.75; Escalate F1 0.62) and strong provenance and faithfulness metrics (citation validity 0.98; evidence support 0.88; counterfactual faithfulness 0.76). These findings indicate that governed, verifiable LLM systems can provide practical decision support for AML triage without sacrificing compliance requirements for traceability and defensibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable template for grounding LLMs in AML triage with citations and counterfactual checks, but all gains are shown only on synthetic benchmarks.

read the letter

The core contribution is a three-part pipeline: bundle evidence from policies and transaction graphs, force the LLM to output with explicit citations and separate supporting versus missing evidence, then run minimal perturbations to test if the rationale and decision hold up. This assembly is not just another RAG setup; it is tailored to the audit requirements that currently block LLM use in compliance work. On the synthetic AML benchmarks the authors report clear lifts in PR-AUC and F1 over rules, graph ML, and plain LLM baselines, plus high scores on citation validity and evidence support. That is useful evidence that the constraints reduce obvious hallucinations in this setting. The main limitation is that every number comes from public synthetic simulators. Those datasets lack the incomplete records, policy drift, and human triage variability that dominate live AML environments, so the reported robustness may not carry over. There are also no error bars, no statistical tests on the metric differences, and no ablation that isolates which component drives the gains. The paper is therefore a solid first sketch of a governed LLM workflow rather than a finished demonstration. Readers who build or evaluate AML systems will find the output contract and counterfactual step worth copying. It deserves a serious referee because the problem is real and the proposed structure is concrete, even if the current experiments need real-data follow-up before anyone would deploy it.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes an explainable AML triage framework that combines retrieval-augmented evidence bundling from policy, customer, alert, and graph sources, a structured LLM output contract enforcing explicit citations and separation of supporting/contradicting evidence, and counterfactual perturbation checks. On public synthetic AML benchmarks it reports best-in-class triage performance (PR-AUC 0.75, Escalate F1 0.62) together with high provenance metrics (citation validity 0.98, evidence support 0.88, counterfactual faithfulness 0.76) relative to rules-based, tabular/graph ML, and plain LLM/RAG baselines.

Significance. If the synthetic-benchmark results generalize, the framework supplies a concrete, auditable template for LLM use in regulated financial-crime workflows, directly addressing hallucination and traceability requirements that currently limit deployment. The explicit separation of evidence types and the counterfactual validation step are particularly valuable contributions to the growing literature on verifiable LLM decision support.

major comments (2)

[Abstract] Abstract: the headline performance numbers and the claim of 'practical decision support' rest exclusively on public synthetic AML benchmarks; the manuscript provides no real-world transaction data, no discussion of how the simulators capture adversarial laundering tactics or regulatory noise, and no plan for live validation, which is load-bearing for the central utility argument.
[Results] Results (implied evaluation section): no error bars, no statistical significance tests, and no ablation isolating the incremental contribution of the structured contract versus the counterfactual checks are reported, so the source of the observed gains in PR-AUC and faithfulness metrics cannot be rigorously assessed.

minor comments (2)

[Abstract] Abstract: the precise definitions and computation of 'citation validity', 'evidence support', and 'counterfactual faithfulness' should be stated explicitly so that the 0.98/0.88/0.76 figures can be reproduced.
[Method] The description of the counterfactual perturbation generation procedure is brief; a short algorithmic outline or pseudocode would improve clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback highlighting limitations in evaluation scope and statistical rigor. We agree these points require revision and will update the manuscript to qualify claims, add statistical analyses, and discuss simulator limitations while outlining future validation plans.

read point-by-point responses

Referee: [Abstract] Abstract: the headline performance numbers and the claim of 'practical decision support' rest exclusively on public synthetic AML benchmarks; the manuscript provides no real-world transaction data, no discussion of how the simulators capture adversarial laundering tactics or regulatory noise, and no plan for live validation, which is load-bearing for the central utility argument.

Authors: We acknowledge that the evaluation relies exclusively on public synthetic AML benchmarks, which are standard in this domain due to the unavailability of labeled real-world data. We agree this constrains strong claims of practical utility. In the revised manuscript we will qualify the abstract language around 'practical decision support', add a limitations subsection discussing simulator fidelity with respect to adversarial tactics and regulatory noise, and include a forward-looking statement on planned live validation. We cannot add real-world results because we lack access to proprietary transaction data. revision: yes
Referee: [Results] Results (implied evaluation section): no error bars, no statistical significance tests, and no ablation isolating the incremental contribution of the structured contract versus the counterfactual checks are reported, so the source of the observed gains in PR-AUC and faithfulness metrics cannot be rigorously assessed.

Authors: We agree that the current results lack error bars, significance testing, and targeted ablations. In the revised version we will report error bars from multiple random seeds, add statistical significance tests (e.g., paired bootstrap or McNemar tests) against baselines, and include ablation experiments that separately remove the structured output contract and the counterfactual checks. These additions will clarify the incremental sources of the reported PR-AUC and faithfulness gains. revision: yes

standing simulated objections not resolved

We do not have access to real-world AML transaction data due to regulatory and privacy constraints and therefore cannot include live-system or proprietary validation results.

Circularity Check

0 steps flagged

No significant circularity; evaluation metrics are independent of internal definitions

full rationale

The paper presents an empirical framework for AML triage using retrieval-augmented generation and counterfactual validation. It reports performance via standard metrics (PR-AUC 0.75, Escalate F1 0.62, citation validity 0.98) computed against external baselines and public synthetic benchmarks. No equations, fitted parameters, or self-citations are shown that would reduce these results to quantities defined by construction within the method itself. The derivation chain consists of standard retrieval, structured prompting, and perturbation checks whose outputs are measured externally rather than tautologically.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of synthetic benchmarks for real AML workflows and on the assumption that structured prompting plus counterfactual checks reliably reduce hallucinations without introducing new biases.

axioms (1)

domain assumption Synthetic AML benchmarks accurately reflect real-world alert triage complexity and regulatory requirements
All reported results are obtained exclusively on public synthetic benchmarks and simulators as stated in the abstract.

pith-pipeline@v0.9.0 · 5577 in / 1443 out tokens · 52244 ms · 2026-05-15T07:31:52.709591+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose an explainable AML triage framework that treats triage as an evidence-constrained decision process... retrieval-augmented evidence bundling... counterfactual checks
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Results show that evidence grounding substantially improves auditability... PR-AUC 0.75; Escalate F1 0.62

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 1 internal anchor

[1]

International Standards on Combating Money Laundering and the Financing of Terrorism & Proliferation: The FATF Recommendations,

Financial Action Task Force, “International Standards on Combating Money Laundering and the Financing of Terrorism & Proliferation: The FATF Recommendations,” Feb. 2012 (updated Feb. 2025)

work page 2012
[2]

Guidance for a Risk-Based Approach: The Banking Sector,

Financial Action Task Force, “Guidance for a Risk-Based Approach: The Banking Sector,” Oct. 2014

work page 2014
[3]

Risk-Based Approach Guidance for the Securities Sector,

Financial Action Task Force, “Risk-Based Approach Guidance for the Securities Sector,” Oct. 2018

work page 2018
[4]

Sound management of risks related to money laundering and financing of terrorism,

Basel Committee on Banking Supervision, “Sound management of risks related to money laundering and financing of terrorism,” Feb. 2016 (rev. July 2020)

work page 2016
[5]

Anti -Money Laundering Transaction Monitoring in the Markets Sector: An industry perspective,

Association for Financial Markets in Europe and Ernst & Young, “Anti -Money Laundering Transaction Monitoring in the Markets Sector: An industry perspective,” Oct. 2021

work page 2021
[6]

Transaction Monitoring in Anti -Money Laundering: A Qualitative Analysis and Points of View from Industry,

B. Oztas, D. Cetinkaya, F. Adedoyin, M. Budka, G. Aksu, and H. Dogan, “Transaction Monitoring in Anti -Money Laundering: A Qualitative Analysis and Points of View from Industry,” Future Generation Computer Systems, vol. 159, pp. 161–171, Oct. 2024, doi: 10.1016/j.future.2024.05.027

work page doi:10.1016/j.future.2024.05.027 2024
[7]

Scalable Graph Learning for Anti-Money Laundering: A First Look

M. Weber, J. Chen, T. Suzumura, A. Pareja, T. Ma, H. Kanezashi, T. Kaler, C. E. Leiserson, and T. B. Schardl, “Scalable Graph Learning for Anti-Money Laundering: A First Look,” arXiv preprint arXiv:1812.00076, 2018, doi: 10.48550/arXiv.1812.00076

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1812.00076 2018
[8]

Research and Practice of Advertisement Recommendation Algorithm Based on Graph Neural Network,

K. Liu, S. Yang, and J. Xia, “Research and Practice of Advertisement Recommendation Algorithm Based on Graph Neural Network,” in Proceedings of the 2nd International Symposium on Integrated Circuit Design and Integrated Systems (ICDIS ’25), 2025, pp. 210–215, doi: 10.1145/3772326.3774734

work page doi:10.1145/3772326.3774734 2025
[9]

A Synthetic Data Set to Benchmark Anti-Money Laundering Methods,

R. I. T. Jensen, J. Ferwerda, K. S. Jørgensen, E. R. Jensen, M. Borg, M. P. Krogh, J. B. Jensen, and A. Iosifidis, “A Synthetic Data Set to Benchmark Anti-Money Laundering Methods,” Scientific Data, vol. 10, no. 1, Art. no. 661, 2023, doi: 10.1038/s41597-023-02569-2

work page doi:10.1038/s41597-023-02569-2 2023
[10]

22, 2026

IBM, “AMLSim,” GitHub repository (synthetic banking transaction simulator for AML research), accessed Feb. 22, 2026

work page 2026
[11]

Early Warning of Cryptocurrency Reversal Risks via Multi - Source Data,

Z. Ke, Y. Cao, Z. Chen, Y. Yin, S. He, and Y. Cheng, “Early Warning of Cryptocurrency Reversal Risks via Multi - Source Data,” Finance Research Letters, vol. 85, pt. B, Art. no. 107890, 2025, doi: 10.1016/j.frl.2025.107890

work page doi:10.1016/j.frl.2025.107890 2025
[12]

A Survey of Large Language Models in Finance (FinLLMs),

J. Lee, N. Stevens, S. C. Han, and M. Song, “A Survey of Large Language Models in Finance (FinLLMs),” arXiv preprint arXiv:2402.02315, 2024, doi: 10.48550/arXiv.2402.02315

work page doi:10.48550/arxiv.2402.02315 2024
[13]

Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GDPR,

S. Wachter, B. Mittelstadt, and C. Russell, “Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GDPR,” Harvard Journal of Law & Technology, vol. 31, pp. 841 –887, 2018, doi: 10.2139/ssrn.3063289

work page doi:10.2139/ssrn.3063289 2018
[14]

Counterfactual Based Probabilistic Graphs for Explainable Money Laundering Detection,

X. Sun and D. Du, “Counterfactual Based Probabilistic Graphs for Explainable Money Laundering Detection,” in AAAI 2025 Workshop on AI for Cyber Threat Intelligence (AICT), OpenReview, Dec. 2024. [Online]. Available: OpenReview

work page 2025
[15]

FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in Finance,

M. Zhang, J. Fu, T. Warrier, Y. Wang, T. Tan, and K.-W. Huang, “FAITH: A Framework for Assessing Intrinsic Tabular Hallucinations in Finance,” arXiv preprint arXiv:2508.05201, 2025, doi: 10.48550/arXiv.2508.05201

work page doi:10.48550/arxiv.2508.05201 2025