pith. sign in

arxiv: 2601.19684 · v3 · submitted 2026-01-27 · 💻 cs.CR

LLM-Assisted Authentication and Fraud Detection

Pith reviewed 2026-05-16 10:34 UTC · model grok-4.3

classification 💻 cs.CR
keywords LLM authenticationsemantic matchingRAG fraud detectionfalse positive reductionknowledge-based authenticationhybrid scoringhallucination mitigation
0
0 comments X

The pith

LLM semantic checks accept 99.5% of legitimate non-exact answers while holding false acceptance to 0.1%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how large language models can judge whether a user's answer to a security question carries the right meaning instead of demanding exact wording. It combines the model's judgment with cosine similarity to produce a hybrid score. For fraud detection the same models are anchored to a fixed collection of evidence documents so their reasoning stays tied to known patterns. Experiments report that the authentication side tolerates natural wording differences yet still blocks impostors, while the fraud side cuts false positives from 17.2% to 3.5% without retraining the model when new scams appear.

Core claim

The central claim is that an LLM-assisted authentication mechanism that evaluates semantic correctness rather than exact wording, supported by document segmentation and a hybrid scoring method combining LLM judgement with cosine-similarity metrics, accepts 99.5% of legitimate non-exact answers while maintaining a 0.1% false-acceptance rate, and that a RAG-based fraud-detection pipeline that grounds LLM reasoning in curated evidence reduces false positives from 17.2% to 3.5% and adapts to emerging scam patterns without model retraining.

What carries the argument

Hybrid LLM judgement plus cosine-similarity scoring for semantic authentication, paired with a retrieval-augmented generation pipeline that grounds fraud decisions in a fixed evidence base.

If this is right

  • Users can answer security questions in their own words without being locked out.
  • Fraud systems can absorb new scam tactics by adding documents to the evidence base rather than retraining.
  • Each fraud decision can cite the specific evidence documents that supported it.
  • False-positive burden on legitimate users drops while security thresholds remain high.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same semantic-matching layer could be added to password-reset flows or customer-service identity checks.
  • Periodic refresh of the evidence collection would be required to keep pace with entirely novel fraud vectors.
  • Combining the semantic score with device or behavioral signals could push error rates lower still.
  • Testing on non-English inputs would show whether the approach generalizes beyond the language used in the reported experiments.

Load-bearing premise

The large language model will produce reliable and consistent judgments about semantic match and fraud fit across varied inputs without introducing its own errors or biases.

What would settle it

Run the full system on a fresh test collection of paraphrased legitimate answers, forged answers, and previously unseen scam descriptions; if the 99.5% acceptance rate or the drop to 3.5% false positives fails to hold, the central claims are falsified.

Figures

Figures reproduced from arXiv: 2601.19684 by Aldar C-F. Chan, Emunah S-S. Chan.

Figure 1
Figure 1. Figure 1: System Architecture of LLM-assisted User Authentication 3.2 Uneven Distribution of LLM-generated Questions Prior work shows that LLMs tend to focus on the beginning and end of a text when generating summaries [23]. To examine whether a similar positional bias appears when LLMs generate security questions from user documents, we conduct experiments to test whether the models select questions unevenly across… view at source ↗
Figure 2
Figure 2. Figure 2: Average percentage of security questions from different segments of three documents generated by ChatGPT-4 and Llama-3.3 3.3 Detailed Implementation [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 7
Figure 7. Figure 7: Pipeline for RAG-based LLM Fraud Detection The process begins with the LLM analysing an incoming message and extracting key features, including intent, tone, urgency, requested actions, referenced entities, and other contextual signals. These extracted attributes are then used to perform targeted retrieval across multiple external knowledge sources, such as verified scam databases, organisational policy do… view at source ↗
read the original abstract

User authentication and fraud detection face growing challenges as digital systems expand and adversaries adopt increasingly sophisticated tactics. Traditional knowledge-based authentication remains rigid, requiring exact word-for-word string matches that fail to accommodate natural human memory and linguistic variation. Meanwhile, fraud-detection pipelines struggle to keep pace with rapidly evolving scam behaviors, leading to high false-positive rates and frequent retraining cycles required. This work introduces two complementary LLM-enabled solutions, namely, an LLM-assisted authentication mechanism that evaluates semantic correctness rather than exact wording, supported by document segmentation and a hybrid scoring method combining LLM judgement with cosine-similarity metrics and a RAG-based fraud-detection pipeline that grounds LLM reasoning in curated evidence to reduce hallucinations and adapt to emerging scam patterns without model retraining. Experiments show that the authentication system accepts 99.5% of legitimate non-exact answers while maintaining a 0.1% false-acceptance rate, and that the RAG-enhanced fraud detection reduces false positives from 17.2% to 3.5%. Together, these findings demonstrate that LLMs can significantly improve both usability and robustness in security workflows, offering a more adaptive , explainable, and human-aligned approach to authentication and fraud detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes two LLM-enabled systems: (1) an authentication mechanism that replaces exact string matching with semantic evaluation via document segmentation and a hybrid scorer (LLM judgment + cosine similarity), and (2) a RAG-based fraud-detection pipeline that grounds LLM outputs in curated evidence to reduce hallucinations and adapt to new scam patterns without retraining. The abstract reports concrete performance figures: 99.5 % acceptance of legitimate non-exact answers at a 0.1 % false-acceptance rate for authentication, and a drop in fraud false-positive rate from 17.2 % to 3.5 %.

Significance. If the reported metrics are reproducible and robust, the work would demonstrate a practical way to improve usability in knowledge-based authentication while simultaneously lowering operational burden in fraud pipelines. The absence of retraining requirements for the RAG component is a notable engineering advantage.

major comments (3)
  1. [Abstract] Abstract: the headline performance numbers (99.5 % legitimate acceptance, 0.1 % false acceptance, 17.2 % → 3.5 % false-positive reduction) are stated without any description of the underlying datasets, number of trials, baseline systems, prompt templates, temperature settings, or statistical tests. These omissions render the central empirical claims impossible to assess for soundness.
  2. [Experimental evaluation] No section on experimental design or ablation studies is referenced; the manuscript supplies neither tests of LLM judgment stability under prompt rephrasing nor adversarial examples that preserve versus alter meaning. Without such controls, the hybrid scoring threshold cannot be shown to separate semantic equivalence from crafted evasions.
  3. [RAG-based fraud detection] RAG pipeline description: the claim that the curated evidence base comprehensively covers emerging scam patterns is unsupported by coverage metrics, retrieval-error analysis, or evaluation on edge-case scams. This gap directly affects the reported false-positive reduction.
minor comments (1)
  1. [Abstract] Abstract contains a typographical error: 'adaptive ,' should read 'adaptive,'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important areas for improving the clarity and rigor of our experimental claims. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline performance numbers (99.5 % legitimate acceptance, 0.1 % false acceptance, 17.2 % → 3.5 % false-positive reduction) are stated without any description of the underlying datasets, number of trials, baseline systems, prompt templates, temperature settings, or statistical tests. These omissions render the central empirical claims impossible to assess for soundness.

    Authors: We agree that the abstract would benefit from additional context on the evaluation setup. In the revised version, we will expand the abstract to reference the dataset composition (2,000 legitimate responses and 1,000 adversarial queries), the primary baselines (exact-match and cosine-only scoring), and key parameters (temperature fixed at 0.0 for reproducibility). Full details on trial counts, prompt templates, and statistical tests (including confidence intervals) will be moved to a new Experimental Setup section to maintain abstract length while enabling assessment of the reported metrics. revision: yes

  2. Referee: [Experimental evaluation] No section on experimental design or ablation studies is referenced; the manuscript supplies neither tests of LLM judgment stability under prompt rephrasing nor adversarial examples that preserve versus alter meaning. Without such controls, the hybrid scoring threshold cannot be shown to separate semantic equivalence from crafted evasions.

    Authors: We acknowledge the need for a more structured experimental design presentation. Although results appear in the current draft, we will insert a dedicated Experimental Methodology section in the revision. This will detail dataset sizes and trial counts, ablation studies on the hybrid scorer (LLM judgment versus cosine similarity), stability evaluations across prompt rephrasings, and adversarial test cases distinguishing meaning-preserving from meaning-altering inputs. These additions will directly support the validity of the 99.5 % acceptance and 0.1 % false-acceptance figures. revision: yes

  3. Referee: [RAG-based fraud detection] RAG pipeline description: the claim that the curated evidence base comprehensively covers emerging scam patterns is unsupported by coverage metrics, retrieval-error analysis, or evaluation on edge-case scams. This gap directly affects the reported false-positive reduction.

    Authors: The RAG design emphasizes adaptability through evidence updates rather than claiming exhaustive coverage of all scam variants. We agree that quantitative support is currently missing. In the revision, we will add coverage metrics for the evidence base (category counts and example volume), retrieval accuracy on held-out edge-case scams, and error analysis connecting reduced hallucinations to the false-positive drop from 17.2 % to 3.5 %. This will clarify the pipeline's scope and engineering advantages. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results with no derivations or self-referential predictions

full rationale

The paper presents two LLM-based systems and reports direct experimental outcomes (99.5% legitimate acceptance, 0.1% false-acceptance, 17.2% to 3.5% false-positive reduction) without any mathematical derivations, fitted parameters, equations, or predictions that reduce to the inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The results are framed as measured performance on test cases, making the work self-contained against external benchmarks with no circular reduction possible.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach relies on standard capabilities of existing LLMs and RAG frameworks without introducing new free parameters, axioms, or postulated entities beyond conventional use of cosine similarity and retrieval.

pith-pipeline@v0.9.0 · 5503 in / 1209 out tokens · 29103 ms · 2026-05-16T10:34:23.076273+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

  1. [1]

    Alhulifi, R

    M. Alhulifi, R. Alharbi, A. Also ubai. Anto -Scam: User-Centric Evaluation of LLM - Powered and Content -Based Phone Scam Detection. In CSCW Companion’25, 536-539, 2025

  2. [2]

    Bartlomiejczyk, I

    M. Bartlomiejczyk, I. E. Fray, et. al. User Authentication Protocol Based o n the Location Factor for a Mobile Environment. IEEE Access 10, 2022

  3. [3]

    Burke, C

    J. Burke, C. Kieffer, G. Mottola, and F . Perez-Arce. Can Educational Interventions Reduce Susceptibility to Financial Fraud? Journal of Economic Behavior & Organization, 198: 250–266, 2022

  4. [4]

    Brainard, A

    J. Brainard, A. Juels, et. al. Fourth-factor Authentication: Somebody You Know. In 13th ACM Conference on Computer and Communications Security, 168-178, 2006

  5. [5]

    A. C-F . Chan, and J. Zhou. Cyber-Physical Device Authentication for the Smart Grid Electric Vehicle Ecosystem. IEEE Journal on Selected Areas in Communications 32(7), 1509-1517, 2015. 19

  6. [6]

    A. C -F . Chan, J. W. Wong, et. al . Scalable Two -Factor Authentication U sing Historical Data. In European Symposium on Research in Computer Security, p.91- 110, 2016

  7. [7]

    Chang, S

    C-W. Chang, S. Sarkar, et. al. Exposing LLM Vulnerabilities: Adversarial S cam Detection and Performance. In IEEE International Conference on Big Data, 2024

  8. [8]

    Farquhar, J

    S. Farquhar, J. Kossen, L. Kuhn, and Y . Gal. Detecting Hallucinations in Large Language Models using Semantic Entropy. Nature, 630(8017): 625–630, 2024

  9. [9]

    Gumphusiri

    P . Gumphusiri. Synthetic Data for Scam Detection, 2024. Available at: https://huggingface.co/BothBosu

  10. [10]

    L. Jiang. Detecting Scams Using Large Language Models. arXiv:2402.03147, 2024

  11. [11]

    Z. Mao, J. Wang, et. al. LLM-Assisted Automatic Mo deling for Security Protocol Verification. In IEEE/ACM 47th International Conference on Software Engineering, 642-654, 2025

  12. [12]

    Nakano, T

    H. Nakano, T. Koide, and D. Chiba. ScamFerret: Detecting Scam Websites Autonomously with Large Language Models. In Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA 2025), 3-25, 2025

  13. [13]

    Ometov, S

    A. Ometov, S. Bezzateev, e t. al. Multi-Factor Authentication: A Survey. Cryptography 2(1), 2018

  14. [14]

    Pandit, K

    S. Pandit, K. Sarker, R. Perdisci, M. Ahamad, and D. Yang. Combating Robocalls with Phone Virtual Assistant Mediated Interaction. In 32nd USENIX Security Symposium, 463–479, 2023

  15. [15]

    Rehman, K

    A. Rehman, K. A. Awan, et. al. CLAF-IoT: Context -Aware L LMs-Enhanced Authentication Framework for Internet of Things. IEEE Internet of Things Journal 12(14), 28639-28646, 2025

  16. [16]

    D. H. Roh, and R. Kumar. Active Authentication via Korean Keystrokes Under Varying LLM Assistance and Cognitive Contexts. In IEEE-ICMLA 2025

  17. [17]

    Senol, G

    A. Senol, G. Agrawal, and H. Liu. Joint Detection of Fraud and Concept Drift in Online Conversations with LLM-Assisted Judgement. arXiV:2505.07852, 2025

  18. [18]

    Sharma, and S

    A. Sharma, and S. Rani. Context-Aware Authentication Framework for Secure V2V and V2I Communications in Autonomous Vehicles Using LLM. In IEEE Transactions on Intelligent Transportation Systems, 2025

  19. [19]

    Z. Shen, K. Wang, Y . Zhang, G. Ngai, and E. Y . Fu. Combating Phone Scams with LLM-based Detection: Where Do We Stand? In 39 th AAAI Conference on Artificial Intelligence, 2025

  20. [20]

    If I could do this, I feel anyone could:

    G. Smith, T. Yadav, and J. Dutson. “If I could do this, I feel anyone could:” The Design and Evaluation of a Secondary Authentication Factor Manager. 32 nd USENIX Security Symposium, 499-515, 2023

  21. [21]

    Vaidya, A

    V . Vaidya, A. Patwardhan, and A. Kundu. How Good LLM -Generated Password Policies Are? ArXiV:2506.08320, 2025. 20

  22. [22]

    Global Financial Fraud Assessment 2024

    INTERPOL. Global Financial Fraud Assessment 2024. Available at https://www.interpol.int/en/News-and-Events/News/2024/INTERPOL-Financial- Fraud-assessment-A-global-threat-boosted-by-technology

  23. [23]

    Ravaut, A

    M. Ravaut, A . Sun, et. al. On Context Utilization in Summarization with Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2764–2781, 2024

  24. [24]

    S. E. Streit. System and Methods for Private Authentication with Helper Networks. US Patent US11489866B2, granted 2022

  25. [25]

    E. Chan. Understanding Logical Reasoning Ability of Large Language Models. Preprints, 2024. Available at https://www.preprints.org/frontend/manuscript/767ee05e3ca583dcd05471407 dd7ec4e/download_pub

  26. [26]

    E. S-S. Chan, A. C-F . Chan. Evaluating Logical Reasoning Ability of Large Language Models, to appear in the 8th International Conference on Natural Language Processing (ICNLP 2026), 2026