Helping Customers in Distress: An LLM-powered Agent that Converses, Probes, and Routes

Alankar Atreya; Cristovao Iglesias Jr; Devesh Batra; Giulio Pelosio; Greig A. Cowan; Michael McMillan; Patrick Sinclair; Raad Khraishi; Robert Hankache; Stefan Sylvius Wanger

arxiv: 2605.16268 · v1 · pith:MOXABJYMnew · submitted 2026-03-31 · 💻 cs.HC · cs.AI· cs.LG

Helping Customers in Distress: An LLM-powered Agent that Converses, Probes, and Routes

Alankar Atreya , Stefan Sylvius Wanger , Devesh Batra , Robert Hankache , Cristovao Iglesias Jr , Patrick Sinclair , Giulio Pelosio , Michael McMillan

show 2 more authors

Greig A. Cowan Raad Khraishi

This is my paper

Pith reviewed 2026-05-21 10:18 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.LG

keywords conversational AI agentcustomer triagingfraud and dispute handlingLLM routingpolicy-guided classificationsynthetic dialogue evaluationbanking operationsmulti-turn probing

0 comments

The pith

An LLM-powered agent engages bank customers in multi-turn conversations, probes for details, and routes fraud or dispute cases to the right specialists with 30.6 percent higher classification accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a customer-facing AI agent that uses large language models to handle reports of fraud, scams, and disputed transactions in banking. Instead of relying on slow manual triaging that stresses both customers and staff, the agent conducts natural dialogues, asks targeted questions guided by policy, and classifies each case for accurate routing to specialist teams. Synthetic customer simulations built from historical data allow safe, scalable testing across many real-world scenarios. When evaluated this way, the agent raised classification accuracy by 30.6 percent and received strong approval from subject-matter experts. The work focuses on embedding the agent directly in the customer journey while adding safety guardrails and reasoning steps to keep outputs compliant.

Core claim

The authors build and test an LLM-based triaging agent that carries out multi-turn conversations with customers reporting fraud or disputes, uses policy documents to guide probing questions, classifies the case according to banking rules, and routes it to the appropriate specialist team. Synthetic digital twins derived from historical records generate realistic, labelled dialogues that cover a wide range of scenarios. On these test cases the agent achieves a 30.6 percent increase in classification accuracy compared with prior manual handling, while subject-matter experts report high satisfaction with the outputs.

What carries the argument

The LLM-powered triaging agent, which integrates policy documents, safety guardrails, and multi-turn reasoning to probe customer reports and produce policy-guided classifications for routing.

If this is right

The agent can process high volumes of fraud and dispute reports more quickly than human-only workflows while maintaining policy compliance.
Targeted probing questions improve the quality of information collected before routing, reducing misdirection to wrong teams.
Safety guardrails and reasoning frameworks keep the agent from giving advice outside its scope or violating banking regulations.
Continuous evaluation with synthetic twins allows testing of rare edge cases and iterative improvement without exposing real customers to unverified decisions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar conversational agents could be adapted for triaging in insurance claims or healthcare appointment routing where policy compliance and multi-turn clarification matter.
The synthetic-twin method offers a way to stress-test routing logic on low-frequency but high-impact scenarios that rarely appear in live traffic.
Over time the agent could accumulate anonymized interaction data to refine its probing strategy and reduce the need for human review in routine cases.

Load-bearing premise

Synthetic digital twins created from historical data produce realistic labelled dialogues that cover the full range of real customer interactions and policy edge cases well enough for reliable evaluation of routing accuracy.

What would settle it

Running the agent on a fresh set of real customer interactions that were never used to build the synthetic twins and comparing its routing decisions directly against the actual specialist outcomes would show whether the reported accuracy gain holds outside the simulated environment.

Figures

Figures reproduced from arXiv: 2605.16268 by Alankar Atreya, Cristovao Iglesias Jr, Devesh Batra, Giulio Pelosio, Greig A. Cowan, Michael McMillan, Patrick Sinclair, Raad Khraishi, Robert Hankache, Stefan Sylvius Wanger.

**Figure 1.** Figure 1: Overview of our triaging framework. 2.1 TRIAGE AGENT The Triage Agent handles multi-turn conversations to clarify customer issues and classifies cases as Fraud, Scam, Dispute, or Inconclusive. We utilised third-party LLMs: Claude, Gemini (Team et al., 2023), GPT (Achiam et al., 2023), via prompt engineering, avoiding training or fine-tuning for flexibility, policy compliance, and reduced data risks. The tr… view at source ↗

read the original abstract

Banks receive millions of reports of fraud, scams, and disputed transactions every year, making it challenging to accurately direct customers to the appropriate specialist teams for assistance. The existing manual process driven by humans is slow and stressful for both customers and staff. To address this, we develop a customer-facing AI powered triaging agent that leverages large language models (LLMs) to conduct multi-turn conversations, ask relevant questions, and classify cases for accurate, policy-guided routing, making it embedded in the customer journey. To evaluate and continuously improve the agent, synthetic digital twins of real customers were simulated, generating realistic, labelled dialogues based on historical data to test a wide range of real-world scenarios. This work details the triage agent's modelling approach, integration with policy, safety guardrails and reasoning frameworks, the use of the synthetic agent for scalable evaluation, and findings on the AI system's accuracy, robustness, and compliance. Results show that the agent successfully improves triaging of historical cases, achieving a 30.6% increase in classification accuracy, with high satisfaction levels reported by our subject-matter experts, highlighting how targeted probing can lead to more effective triage in banking operations at scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper describes a policy-guided LLM triage agent for banking fraud reports that uses synthetic customer twins for testing and claims a 30.6% accuracy gain, but the evaluation lacks checks that the synthetic data matches real interaction distributions.

read the letter

This paper is about an LLM agent that handles bank customers reporting fraud or scams. It runs multi-turn conversations, probes with policy-based questions, and routes cases to the right teams. The authors test the system with synthetic digital twins built from historical data and report a 30.6% accuracy increase plus strong feedback from subject-matter experts. The work sits inside a live customer journey rather than as a standalone demo. What stands out is the end-to-end design that ties the LLM to explicit policies and safety guardrails while using the same synthetic setup for scalable testing and iteration. That approach fits high-volume operations where real calls cannot be used freely for development. The synthetic twins let them cover a wide range of scenarios without waiting for live traffic. The softer part is the evidence for the accuracy claim. All results come from dialogues generated by the twins derived from the same historical data the agent is meant to handle. The description does not include statistical comparisons such as intent frequency checks or blind expert reviews of real versus synthetic transcripts. Without those, the test set could be smoother or less adversarial than actual customer calls, which would make the reported lift look larger than it would be in production. There are also no details on the baseline system or any error bars or significance tests. This paper is for teams building conversational tools in regulated customer service, especially banking or insurance. Practitioners who need to combine LLMs with policy constraints and synthetic evaluation will pick up usable architecture choices. It deserves a serious referee because it tackles a concrete, high-stakes problem with measurable outcomes, even if the evaluation section needs more grounding. I would send it to peer review and ask the authors to add validation that the synthetic dialogues reproduce real-world distributions and edge cases.

Referee Report

2 major / 2 minor

Summary. The paper presents an LLM-powered conversational agent for triaging banking customers reporting fraud, scams, or disputed transactions. The agent performs multi-turn dialogues, probes for details, integrates policy rules and safety guardrails, and routes cases to specialist teams. Evaluation relies on synthetic digital twins of customers that generate labelled dialogues from historical data; the central result is a 30.6% gain in classification accuracy over the manual process, accompanied by high satisfaction ratings from subject-matter experts.

Significance. If the synthetic evaluation proves representative of live traffic, the work offers a practical demonstration of embedding policy-aware LLMs into high-volume regulated customer-service workflows. The combination of probing strategies, guardrails, and scalable synthetic testing could reduce routing errors and operational stress in financial institutions. The approach also illustrates how synthetic data can support continuous improvement loops when real labelled interactions are scarce or sensitive.

major comments (2)

[Abstract / Evaluation] Abstract and Evaluation section: the reported 30.6% classification-accuracy gain is stated without identifying the baseline method, the precise metric (top-1 accuracy, macro-F1, etc.), error bars, or any statistical significance test. Because the improvement is the primary quantitative claim, these omissions make it impossible to judge whether the lift is reliable or merely an artifact of the test distribution.
[Evaluation] Evaluation section: synthetic dialogues are generated from the same historical data and policies the agent is designed to apply, yet no distributional checks (KL divergence on intent frequencies, n-gram overlap, or blinded expert comparison of real vs. synthetic transcripts) are reported. Without such grounding, the 30.6% figure risks reflecting an easier or less adversarial test set rather than genuine routing improvement.

minor comments (2)

[Modelling approach] The description of the reasoning framework and policy integration would benefit from a short annotated example dialogue that shows how guardrails are triggered and how the final routing decision is produced.
[Results] Figure captions and axis labels for any accuracy or satisfaction plots should explicitly state the number of synthetic dialogues and the number of expert raters.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below, along with our plans for revision.

read point-by-point responses

Referee: [Abstract / Evaluation] Abstract and Evaluation section: the reported 30.6% classification-accuracy gain is stated without identifying the baseline method, the precise metric (top-1 accuracy, macro-F1, etc.), error bars, or any statistical significance test. Because the improvement is the primary quantitative claim, these omissions make it impossible to judge whether the lift is reliable or merely an artifact of the test distribution.

Authors: We agree that additional details are necessary to properly contextualize our primary result. The baseline is the existing manual triaging process performed by human agents. The metric used is classification accuracy, defined as the percentage of cases correctly classified and routed to the appropriate specialist team according to the bank's policies. To strengthen the claim, we will add error bars computed via bootstrapping over the test dialogues and include a statistical significance test (e.g., McNemar's test for paired comparisons). These details will be incorporated into both the abstract and the Evaluation section in the revised manuscript. revision: yes
Referee: [Evaluation] Evaluation section: synthetic dialogues are generated from the same historical data and policies the agent is designed to apply, yet no distributional checks (KL divergence on intent frequencies, n-gram overlap, or blinded expert comparison of real vs. synthetic transcripts) are reported. Without such grounding, the 30.6% figure risks reflecting an easier or less adversarial test set rather than genuine routing improvement.

Authors: We recognize the value of explicitly demonstrating that the synthetic data distribution aligns with real customer interactions. Although the synthetic digital twins were constructed to mirror historical patterns, we did not report formal distributional checks in the original submission. In the revision, we will include KL divergence measures on intent and entity frequencies, n-gram overlap statistics between real and synthetic transcripts, and results from a blinded expert review comparing a subset of real and synthetic dialogues for realism and fidelity. This additional analysis will be added to the Evaluation section. revision: yes

Circularity Check

1 steps flagged

Accuracy gain measured on synthetic dialogues generated from the same historical data used for agent design

specific steps

fitted input called prediction [Abstract]
"To evaluate and continuously improve the agent, synthetic digital twins of real customers were simulated, generating realistic, labelled dialogues based on historical data to test a wide range of real-world scenarios. [...] Results show that the agent successfully improves triaging of historical cases, achieving a 30.6% increase in classification accuracy"

The accuracy metric is computed on dialogues whose labels and content are generated from the identical historical data and policies that define the triaging task. Because the synthetic set is produced from the same source the agent is built to address, the measured lift is taken on a test distribution that is statistically downstream of the input data rather than on held-out real interactions.

full rationale

The paper's main empirical claim (30.6% accuracy lift) is obtained exclusively by testing on labelled dialogues produced by digital twins that are themselves constructed from the historical cases and policies the agent is intended to handle. This creates a fitted-input-called-prediction pattern: the test distribution is derived from the input data without reported independent statistical grounding (e.g., distribution divergence checks or real-transcript blind validation). The derivation chain therefore reduces the reported improvement to performance on a constructed proxy rather than an external benchmark. No mathematical self-definition or self-citation chain is present, but the evaluation step itself is not independent of the data used to motivate and tune the system.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified assumption that synthetic customer simulations faithfully reproduce real interaction patterns and that LLM outputs remain policy-compliant across the tested scenarios.

axioms (1)

domain assumption Synthetic digital twins based on historical data produce realistic and representative customer dialogues for agent evaluation.
The evaluation and accuracy claims depend directly on this premise.

invented entities (1)

Synthetic digital twins of customers no independent evidence
purpose: Generate labelled multi-turn dialogues to test the triage agent at scale without using live customer data.
Introduced as the primary evaluation mechanism; no independent external validation is described.

pith-pipeline@v0.9.0 · 5782 in / 1429 out tokens · 40317 ms · 2026-05-21T10:18:29.023498+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 2 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[4]

2025 , institution =

Annual Fraud Report 2025 , author =. 2025 , institution =

work page 2025
[5]

LLM applications: Current paradigms and the next frontier,

LLM Applications: Current Paradigms and the Next Frontier , author=. arXiv preprint arXiv:2503.04596 , year=

work page arXiv
[6]

A GPT-based method of Automated Compliance Checking through prompt engineering , author=

work page
[7]

The Innovation , year=

A survey on llm-as-a-judge , author=. The Innovation , year=

work page
[8]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[10]

A Practical Guide to Generative AI Using Amazon Bedrock: Building, Deploying, and Securing Generative AI Applications , pages=

Introduction to Amazon Bedrock , author=. A Practical Guide to Generative AI Using Amazon Bedrock: Building, Deploying, and Securing Generative AI Applications , pages=. 2025 , publisher=

work page 2025
[11]

Available at SSRN 5381584 , year=

A review of llm agent applications in finance and banking , author=. Available at SSRN 5381584 , year=

work page
[12]

Journal of Manufacturing Systems , volume=

Empowering digital twins with large language models for global temporal feature learning , author=. Journal of Manufacturing Systems , volume=. 2024 , publisher=

work page 2024

[1] [1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page

[2] [2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page

[3] [3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016

[4] [4]

2025 , institution =

Annual Fraud Report 2025 , author =. 2025 , institution =

work page 2025

[5] [5]

LLM applications: Current paradigms and the next frontier,

LLM Applications: Current Paradigms and the Next Frontier , author=. arXiv preprint arXiv:2503.04596 , year=

work page arXiv

[6] [6]

A GPT-based method of Automated Compliance Checking through prompt engineering , author=

work page

[7] [7]

The Innovation , year=

A survey on llm-as-a-judge , author=. The Innovation , year=

work page

[8] [8]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

A Practical Guide to Generative AI Using Amazon Bedrock: Building, Deploying, and Securing Generative AI Applications , pages=

Introduction to Amazon Bedrock , author=. A Practical Guide to Generative AI Using Amazon Bedrock: Building, Deploying, and Securing Generative AI Applications , pages=. 2025 , publisher=

work page 2025

[11] [11]

Available at SSRN 5381584 , year=

A review of llm agent applications in finance and banking , author=. Available at SSRN 5381584 , year=

work page

[12] [12]

Journal of Manufacturing Systems , volume=

Empowering digital twins with large language models for global temporal feature learning , author=. Journal of Manufacturing Systems , volume=. 2024 , publisher=

work page 2024