LLM-Assisted Authentication and Fraud Detection

Aldar C-F. Chan; Emunah S-S. Chan

arxiv: 2601.19684 · v3 · submitted 2026-01-27 · 💻 cs.CR

LLM-Assisted Authentication and Fraud Detection

Emunah S-S. Chan , Aldar C-F. Chan This is my paper

Pith reviewed 2026-05-16 10:34 UTC · model grok-4.3

classification 💻 cs.CR

keywords LLM authenticationsemantic matchingRAG fraud detectionfalse positive reductionknowledge-based authenticationhybrid scoringhallucination mitigation

0 comments

The pith

LLM semantic checks accept 99.5% of legitimate non-exact answers while holding false acceptance to 0.1%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how large language models can judge whether a user's answer to a security question carries the right meaning instead of demanding exact wording. It combines the model's judgment with cosine similarity to produce a hybrid score. For fraud detection the same models are anchored to a fixed collection of evidence documents so their reasoning stays tied to known patterns. Experiments report that the authentication side tolerates natural wording differences yet still blocks impostors, while the fraud side cuts false positives from 17.2% to 3.5% without retraining the model when new scams appear.

Core claim

The central claim is that an LLM-assisted authentication mechanism that evaluates semantic correctness rather than exact wording, supported by document segmentation and a hybrid scoring method combining LLM judgement with cosine-similarity metrics, accepts 99.5% of legitimate non-exact answers while maintaining a 0.1% false-acceptance rate, and that a RAG-based fraud-detection pipeline that grounds LLM reasoning in curated evidence reduces false positives from 17.2% to 3.5% and adapts to emerging scam patterns without model retraining.

What carries the argument

Hybrid LLM judgement plus cosine-similarity scoring for semantic authentication, paired with a retrieval-augmented generation pipeline that grounds fraud decisions in a fixed evidence base.

If this is right

Users can answer security questions in their own words without being locked out.
Fraud systems can absorb new scam tactics by adding documents to the evidence base rather than retraining.
Each fraud decision can cite the specific evidence documents that supported it.
False-positive burden on legitimate users drops while security thresholds remain high.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same semantic-matching layer could be added to password-reset flows or customer-service identity checks.
Periodic refresh of the evidence collection would be required to keep pace with entirely novel fraud vectors.
Combining the semantic score with device or behavioral signals could push error rates lower still.
Testing on non-English inputs would show whether the approach generalizes beyond the language used in the reported experiments.

Load-bearing premise

The large language model will produce reliable and consistent judgments about semantic match and fraud fit across varied inputs without introducing its own errors or biases.

What would settle it

Run the full system on a fresh test collection of paraphrased legitimate answers, forged answers, and previously unseen scam descriptions; if the 99.5% acceptance rate or the drop to 3.5% false positives fails to hold, the central claims are falsified.

Figures

Figures reproduced from arXiv: 2601.19684 by Aldar C-F. Chan, Emunah S-S. Chan.

**Figure 1.** Figure 1: System Architecture of LLM-assisted User Authentication 3.2 Uneven Distribution of LLM-generated Questions Prior work shows that LLMs tend to focus on the beginning and end of a text when generating summaries [23]. To examine whether a similar positional bias appears when LLMs generate security questions from user documents, we conduct experiments to test whether the models select questions unevenly across… view at source ↗

**Figure 2.** Figure 2: Average percentage of security questions from different segments of three documents generated by ChatGPT-4 and Llama-3.3 3.3 Detailed Implementation [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 7.** Figure 7: Pipeline for RAG-based LLM Fraud Detection The process begins with the LLM analysing an incoming message and extracting key features, including intent, tone, urgency, requested actions, referenced entities, and other contextual signals. These extracted attributes are then used to perform targeted retrieval across multiple external knowledge sources, such as verified scam databases, organisational policy do… view at source ↗

read the original abstract

User authentication and fraud detection face growing challenges as digital systems expand and adversaries adopt increasingly sophisticated tactics. Traditional knowledge-based authentication remains rigid, requiring exact word-for-word string matches that fail to accommodate natural human memory and linguistic variation. Meanwhile, fraud-detection pipelines struggle to keep pace with rapidly evolving scam behaviors, leading to high false-positive rates and frequent retraining cycles required. This work introduces two complementary LLM-enabled solutions, namely, an LLM-assisted authentication mechanism that evaluates semantic correctness rather than exact wording, supported by document segmentation and a hybrid scoring method combining LLM judgement with cosine-similarity metrics and a RAG-based fraud-detection pipeline that grounds LLM reasoning in curated evidence to reduce hallucinations and adapt to emerging scam patterns without model retraining. Experiments show that the authentication system accepts 99.5% of legitimate non-exact answers while maintaining a 0.1% false-acceptance rate, and that the RAG-enhanced fraud detection reduces false positives from 17.2% to 3.5%. Together, these findings demonstrate that LLMs can significantly improve both usability and robustness in security workflows, offering a more adaptive , explainable, and human-aligned approach to authentication and fraud detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper describes a hybrid LLM-plus-cosine authentication scorer and a RAG fraud pipeline with strong reported metrics, but the absence of dataset, baseline, and stability details leaves the numbers hard to assess.

read the letter

The main takeaway is a pair of applied LLM components for security tasks. One uses document segmentation and a hybrid LLM judgment plus cosine similarity score to accept semantically close but non-exact answers during authentication. The other grounds fraud detection in a curated evidence base via RAG to cut false positives without retraining the model. These address real pain points: rigid string matching that frustrates users and fraud systems that lag behind new scams. The reported figures—99.5% acceptance for legitimate variations at 0.1% false acceptance, and fraud false positives dropping from 17.2% to 3.5%—are concrete and would matter if they hold. The hybrid scoring and evidence grounding are straightforward extensions that avoid pure reliance on LLM outputs alone. The soft spots sit in the evaluation. No information appears on the test datasets, how baselines were constructed, statistical significance, or any checks for LLM consistency across prompt wording, temperature, or model versions. The stress-test point about untested behavior under meaning-preserving paraphrases or subtle adversarial inputs is on target given the current description; without ablations or error analysis on edge cases, the headline percentages stay difficult to reproduce or generalize. This is aimed at security practitioners who already work with LLMs and want quick integration patterns rather than foundational theory. A reader building production auth or fraud tools could extract usable design choices. It deserves peer review so the methods and data can be examined directly; the core setup is simple enough that referees could quickly clarify whether the results are robust.

Referee Report

3 major / 1 minor

Summary. The paper proposes two LLM-enabled systems: (1) an authentication mechanism that replaces exact string matching with semantic evaluation via document segmentation and a hybrid scorer (LLM judgment + cosine similarity), and (2) a RAG-based fraud-detection pipeline that grounds LLM outputs in curated evidence to reduce hallucinations and adapt to new scam patterns without retraining. The abstract reports concrete performance figures: 99.5 % acceptance of legitimate non-exact answers at a 0.1 % false-acceptance rate for authentication, and a drop in fraud false-positive rate from 17.2 % to 3.5 %.

Significance. If the reported metrics are reproducible and robust, the work would demonstrate a practical way to improve usability in knowledge-based authentication while simultaneously lowering operational burden in fraud pipelines. The absence of retraining requirements for the RAG component is a notable engineering advantage.

major comments (3)

[Abstract] Abstract: the headline performance numbers (99.5 % legitimate acceptance, 0.1 % false acceptance, 17.2 % → 3.5 % false-positive reduction) are stated without any description of the underlying datasets, number of trials, baseline systems, prompt templates, temperature settings, or statistical tests. These omissions render the central empirical claims impossible to assess for soundness.
[Experimental evaluation] No section on experimental design or ablation studies is referenced; the manuscript supplies neither tests of LLM judgment stability under prompt rephrasing nor adversarial examples that preserve versus alter meaning. Without such controls, the hybrid scoring threshold cannot be shown to separate semantic equivalence from crafted evasions.
[RAG-based fraud detection] RAG pipeline description: the claim that the curated evidence base comprehensively covers emerging scam patterns is unsupported by coverage metrics, retrieval-error analysis, or evaluation on edge-case scams. This gap directly affects the reported false-positive reduction.

minor comments (1)

[Abstract] Abstract contains a typographical error: 'adaptive ,' should read 'adaptive,'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important areas for improving the clarity and rigor of our experimental claims. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the headline performance numbers (99.5 % legitimate acceptance, 0.1 % false acceptance, 17.2 % → 3.5 % false-positive reduction) are stated without any description of the underlying datasets, number of trials, baseline systems, prompt templates, temperature settings, or statistical tests. These omissions render the central empirical claims impossible to assess for soundness.

Authors: We agree that the abstract would benefit from additional context on the evaluation setup. In the revised version, we will expand the abstract to reference the dataset composition (2,000 legitimate responses and 1,000 adversarial queries), the primary baselines (exact-match and cosine-only scoring), and key parameters (temperature fixed at 0.0 for reproducibility). Full details on trial counts, prompt templates, and statistical tests (including confidence intervals) will be moved to a new Experimental Setup section to maintain abstract length while enabling assessment of the reported metrics. revision: yes
Referee: [Experimental evaluation] No section on experimental design or ablation studies is referenced; the manuscript supplies neither tests of LLM judgment stability under prompt rephrasing nor adversarial examples that preserve versus alter meaning. Without such controls, the hybrid scoring threshold cannot be shown to separate semantic equivalence from crafted evasions.

Authors: We acknowledge the need for a more structured experimental design presentation. Although results appear in the current draft, we will insert a dedicated Experimental Methodology section in the revision. This will detail dataset sizes and trial counts, ablation studies on the hybrid scorer (LLM judgment versus cosine similarity), stability evaluations across prompt rephrasings, and adversarial test cases distinguishing meaning-preserving from meaning-altering inputs. These additions will directly support the validity of the 99.5 % acceptance and 0.1 % false-acceptance figures. revision: yes
Referee: [RAG-based fraud detection] RAG pipeline description: the claim that the curated evidence base comprehensively covers emerging scam patterns is unsupported by coverage metrics, retrieval-error analysis, or evaluation on edge-case scams. This gap directly affects the reported false-positive reduction.

Authors: The RAG design emphasizes adaptability through evidence updates rather than claiming exhaustive coverage of all scam variants. We agree that quantitative support is currently missing. In the revision, we will add coverage metrics for the evidence base (category counts and example volume), retrieval accuracy on held-out edge-case scams, and error analysis connecting reduced hallucinations to the false-positive drop from 17.2 % to 3.5 %. This will clarify the pipeline's scope and engineering advantages. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results with no derivations or self-referential predictions

full rationale

The paper presents two LLM-based systems and reports direct experimental outcomes (99.5% legitimate acceptance, 0.1% false-acceptance, 17.2% to 3.5% false-positive reduction) without any mathematical derivations, fitted parameters, equations, or predictions that reduce to the inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The results are framed as measured performance on test cases, making the work self-contained against external benchmarks with no circular reduction possible.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach relies on standard capabilities of existing LLMs and RAG frameworks without introducing new free parameters, axioms, or postulated entities beyond conventional use of cosine similarity and retrieval.

pith-pipeline@v0.9.0 · 5503 in / 1209 out tokens · 29103 ms · 2026-05-16T10:34:23.076273+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

Alhulifi, R

M. Alhulifi, R. Alharbi, A. Also ubai. Anto -Scam: User-Centric Evaluation of LLM - Powered and Content -Based Phone Scam Detection. In CSCW Companion’25, 536-539, 2025

work page 2025
[2]

Bartlomiejczyk, I

M. Bartlomiejczyk, I. E. Fray, et. al. User Authentication Protocol Based o n the Location Factor for a Mobile Environment. IEEE Access 10, 2022

work page 2022
[3]

Burke, C

J. Burke, C. Kieffer, G. Mottola, and F . Perez-Arce. Can Educational Interventions Reduce Susceptibility to Financial Fraud? Journal of Economic Behavior & Organization, 198: 250–266, 2022

work page 2022
[4]

Brainard, A

J. Brainard, A. Juels, et. al. Fourth-factor Authentication: Somebody You Know. In 13th ACM Conference on Computer and Communications Security, 168-178, 2006

work page 2006
[5]

A. C-F . Chan, and J. Zhou. Cyber-Physical Device Authentication for the Smart Grid Electric Vehicle Ecosystem. IEEE Journal on Selected Areas in Communications 32(7), 1509-1517, 2015. 19

work page 2015
[6]

A. C -F . Chan, J. W. Wong, et. al . Scalable Two -Factor Authentication U sing Historical Data. In European Symposium on Research in Computer Security, p.91- 110, 2016

work page 2016
[7]

Chang, S

C-W. Chang, S. Sarkar, et. al. Exposing LLM Vulnerabilities: Adversarial S cam Detection and Performance. In IEEE International Conference on Big Data, 2024

work page 2024
[8]

Farquhar, J

S. Farquhar, J. Kossen, L. Kuhn, and Y . Gal. Detecting Hallucinations in Large Language Models using Semantic Entropy. Nature, 630(8017): 625–630, 2024

work page 2024
[9]

Gumphusiri

P . Gumphusiri. Synthetic Data for Scam Detection, 2024. Available at: https://huggingface.co/BothBosu

work page 2024
[10]

L. Jiang. Detecting Scams Using Large Language Models. arXiv:2402.03147, 2024

work page arXiv 2024
[11]

Z. Mao, J. Wang, et. al. LLM-Assisted Automatic Mo deling for Security Protocol Verification. In IEEE/ACM 47th International Conference on Software Engineering, 642-654, 2025

work page 2025
[12]

Nakano, T

H. Nakano, T. Koide, and D. Chiba. ScamFerret: Detecting Scam Websites Autonomously with Large Language Models. In Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA 2025), 3-25, 2025

work page 2025
[13]

Ometov, S

A. Ometov, S. Bezzateev, e t. al. Multi-Factor Authentication: A Survey. Cryptography 2(1), 2018

work page 2018
[14]

Pandit, K

S. Pandit, K. Sarker, R. Perdisci, M. Ahamad, and D. Yang. Combating Robocalls with Phone Virtual Assistant Mediated Interaction. In 32nd USENIX Security Symposium, 463–479, 2023

work page 2023
[15]

Rehman, K

A. Rehman, K. A. Awan, et. al. CLAF-IoT: Context -Aware L LMs-Enhanced Authentication Framework for Internet of Things. IEEE Internet of Things Journal 12(14), 28639-28646, 2025

work page 2025
[16]

D. H. Roh, and R. Kumar. Active Authentication via Korean Keystrokes Under Varying LLM Assistance and Cognitive Contexts. In IEEE-ICMLA 2025

work page 2025
[17]

Senol, G

A. Senol, G. Agrawal, and H. Liu. Joint Detection of Fraud and Concept Drift in Online Conversations with LLM-Assisted Judgement. arXiV:2505.07852, 2025

work page arXiv 2025
[18]

Sharma, and S

A. Sharma, and S. Rani. Context-Aware Authentication Framework for Secure V2V and V2I Communications in Autonomous Vehicles Using LLM. In IEEE Transactions on Intelligent Transportation Systems, 2025

work page 2025
[19]

Z. Shen, K. Wang, Y . Zhang, G. Ngai, and E. Y . Fu. Combating Phone Scams with LLM-based Detection: Where Do We Stand? In 39 th AAAI Conference on Artificial Intelligence, 2025

work page 2025
[20]

If I could do this, I feel anyone could:

G. Smith, T. Yadav, and J. Dutson. “If I could do this, I feel anyone could:” The Design and Evaluation of a Secondary Authentication Factor Manager. 32 nd USENIX Security Symposium, 499-515, 2023

work page 2023
[21]

Vaidya, A

V . Vaidya, A. Patwardhan, and A. Kundu. How Good LLM -Generated Password Policies Are? ArXiV:2506.08320, 2025. 20

work page arXiv 2025
[22]

Global Financial Fraud Assessment 2024

INTERPOL. Global Financial Fraud Assessment 2024. Available at https://www.interpol.int/en/News-and-Events/News/2024/INTERPOL-Financial- Fraud-assessment-A-global-threat-boosted-by-technology

work page 2024
[23]

Ravaut, A

M. Ravaut, A . Sun, et. al. On Context Utilization in Summarization with Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2764–2781, 2024

work page 2024
[24]

S. E. Streit. System and Methods for Private Authentication with Helper Networks. US Patent US11489866B2, granted 2022

work page 2022
[25]

E. Chan. Understanding Logical Reasoning Ability of Large Language Models. Preprints, 2024. Available at https://www.preprints.org/frontend/manuscript/767ee05e3ca583dcd05471407 dd7ec4e/download_pub

work page 2024
[26]

E. S-S. Chan, A. C-F . Chan. Evaluating Logical Reasoning Ability of Large Language Models, to appear in the 8th International Conference on Natural Language Processing (ICNLP 2026), 2026

work page 2026

[1] [1]

Alhulifi, R

M. Alhulifi, R. Alharbi, A. Also ubai. Anto -Scam: User-Centric Evaluation of LLM - Powered and Content -Based Phone Scam Detection. In CSCW Companion’25, 536-539, 2025

work page 2025

[2] [2]

Bartlomiejczyk, I

M. Bartlomiejczyk, I. E. Fray, et. al. User Authentication Protocol Based o n the Location Factor for a Mobile Environment. IEEE Access 10, 2022

work page 2022

[3] [3]

Burke, C

J. Burke, C. Kieffer, G. Mottola, and F . Perez-Arce. Can Educational Interventions Reduce Susceptibility to Financial Fraud? Journal of Economic Behavior & Organization, 198: 250–266, 2022

work page 2022

[4] [4]

Brainard, A

J. Brainard, A. Juels, et. al. Fourth-factor Authentication: Somebody You Know. In 13th ACM Conference on Computer and Communications Security, 168-178, 2006

work page 2006

[5] [5]

A. C-F . Chan, and J. Zhou. Cyber-Physical Device Authentication for the Smart Grid Electric Vehicle Ecosystem. IEEE Journal on Selected Areas in Communications 32(7), 1509-1517, 2015. 19

work page 2015

[6] [6]

A. C -F . Chan, J. W. Wong, et. al . Scalable Two -Factor Authentication U sing Historical Data. In European Symposium on Research in Computer Security, p.91- 110, 2016

work page 2016

[7] [7]

Chang, S

C-W. Chang, S. Sarkar, et. al. Exposing LLM Vulnerabilities: Adversarial S cam Detection and Performance. In IEEE International Conference on Big Data, 2024

work page 2024

[8] [8]

Farquhar, J

S. Farquhar, J. Kossen, L. Kuhn, and Y . Gal. Detecting Hallucinations in Large Language Models using Semantic Entropy. Nature, 630(8017): 625–630, 2024

work page 2024

[9] [9]

Gumphusiri

P . Gumphusiri. Synthetic Data for Scam Detection, 2024. Available at: https://huggingface.co/BothBosu

work page 2024

[10] [10]

L. Jiang. Detecting Scams Using Large Language Models. arXiv:2402.03147, 2024

work page arXiv 2024

[11] [11]

Z. Mao, J. Wang, et. al. LLM-Assisted Automatic Mo deling for Security Protocol Verification. In IEEE/ACM 47th International Conference on Software Engineering, 642-654, 2025

work page 2025

[12] [12]

Nakano, T

H. Nakano, T. Koide, and D. Chiba. ScamFerret: Detecting Scam Websites Autonomously with Large Language Models. In Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA 2025), 3-25, 2025

work page 2025

[13] [13]

Ometov, S

A. Ometov, S. Bezzateev, e t. al. Multi-Factor Authentication: A Survey. Cryptography 2(1), 2018

work page 2018

[14] [14]

Pandit, K

S. Pandit, K. Sarker, R. Perdisci, M. Ahamad, and D. Yang. Combating Robocalls with Phone Virtual Assistant Mediated Interaction. In 32nd USENIX Security Symposium, 463–479, 2023

work page 2023

[15] [15]

Rehman, K

A. Rehman, K. A. Awan, et. al. CLAF-IoT: Context -Aware L LMs-Enhanced Authentication Framework for Internet of Things. IEEE Internet of Things Journal 12(14), 28639-28646, 2025

work page 2025

[16] [16]

D. H. Roh, and R. Kumar. Active Authentication via Korean Keystrokes Under Varying LLM Assistance and Cognitive Contexts. In IEEE-ICMLA 2025

work page 2025

[17] [17]

Senol, G

A. Senol, G. Agrawal, and H. Liu. Joint Detection of Fraud and Concept Drift in Online Conversations with LLM-Assisted Judgement. arXiV:2505.07852, 2025

work page arXiv 2025

[18] [18]

Sharma, and S

A. Sharma, and S. Rani. Context-Aware Authentication Framework for Secure V2V and V2I Communications in Autonomous Vehicles Using LLM. In IEEE Transactions on Intelligent Transportation Systems, 2025

work page 2025

[19] [19]

Z. Shen, K. Wang, Y . Zhang, G. Ngai, and E. Y . Fu. Combating Phone Scams with LLM-based Detection: Where Do We Stand? In 39 th AAAI Conference on Artificial Intelligence, 2025

work page 2025

[20] [20]

If I could do this, I feel anyone could:

G. Smith, T. Yadav, and J. Dutson. “If I could do this, I feel anyone could:” The Design and Evaluation of a Secondary Authentication Factor Manager. 32 nd USENIX Security Symposium, 499-515, 2023

work page 2023

[21] [21]

Vaidya, A

V . Vaidya, A. Patwardhan, and A. Kundu. How Good LLM -Generated Password Policies Are? ArXiV:2506.08320, 2025. 20

work page arXiv 2025

[22] [22]

Global Financial Fraud Assessment 2024

INTERPOL. Global Financial Fraud Assessment 2024. Available at https://www.interpol.int/en/News-and-Events/News/2024/INTERPOL-Financial- Fraud-assessment-A-global-threat-boosted-by-technology

work page 2024

[23] [23]

Ravaut, A

M. Ravaut, A . Sun, et. al. On Context Utilization in Summarization with Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2764–2781, 2024

work page 2024

[24] [24]

S. E. Streit. System and Methods for Private Authentication with Helper Networks. US Patent US11489866B2, granted 2022

work page 2022

[25] [25]

E. Chan. Understanding Logical Reasoning Ability of Large Language Models. Preprints, 2024. Available at https://www.preprints.org/frontend/manuscript/767ee05e3ca583dcd05471407 dd7ec4e/download_pub

work page 2024

[26] [26]

E. S-S. Chan, A. C-F . Chan. Evaluating Logical Reasoning Ability of Large Language Models, to appear in the 8th International Conference on Natural Language Processing (ICNLP 2026), 2026

work page 2026