LLM-Assisted Authentication and Fraud Detection
Pith reviewed 2026-05-16 10:34 UTC · model grok-4.3
The pith
LLM semantic checks accept 99.5% of legitimate non-exact answers while holding false acceptance to 0.1%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that an LLM-assisted authentication mechanism that evaluates semantic correctness rather than exact wording, supported by document segmentation and a hybrid scoring method combining LLM judgement with cosine-similarity metrics, accepts 99.5% of legitimate non-exact answers while maintaining a 0.1% false-acceptance rate, and that a RAG-based fraud-detection pipeline that grounds LLM reasoning in curated evidence reduces false positives from 17.2% to 3.5% and adapts to emerging scam patterns without model retraining.
What carries the argument
Hybrid LLM judgement plus cosine-similarity scoring for semantic authentication, paired with a retrieval-augmented generation pipeline that grounds fraud decisions in a fixed evidence base.
If this is right
- Users can answer security questions in their own words without being locked out.
- Fraud systems can absorb new scam tactics by adding documents to the evidence base rather than retraining.
- Each fraud decision can cite the specific evidence documents that supported it.
- False-positive burden on legitimate users drops while security thresholds remain high.
Where Pith is reading between the lines
- The same semantic-matching layer could be added to password-reset flows or customer-service identity checks.
- Periodic refresh of the evidence collection would be required to keep pace with entirely novel fraud vectors.
- Combining the semantic score with device or behavioral signals could push error rates lower still.
- Testing on non-English inputs would show whether the approach generalizes beyond the language used in the reported experiments.
Load-bearing premise
The large language model will produce reliable and consistent judgments about semantic match and fraud fit across varied inputs without introducing its own errors or biases.
What would settle it
Run the full system on a fresh test collection of paraphrased legitimate answers, forged answers, and previously unseen scam descriptions; if the 99.5% acceptance rate or the drop to 3.5% false positives fails to hold, the central claims are falsified.
Figures
read the original abstract
User authentication and fraud detection face growing challenges as digital systems expand and adversaries adopt increasingly sophisticated tactics. Traditional knowledge-based authentication remains rigid, requiring exact word-for-word string matches that fail to accommodate natural human memory and linguistic variation. Meanwhile, fraud-detection pipelines struggle to keep pace with rapidly evolving scam behaviors, leading to high false-positive rates and frequent retraining cycles required. This work introduces two complementary LLM-enabled solutions, namely, an LLM-assisted authentication mechanism that evaluates semantic correctness rather than exact wording, supported by document segmentation and a hybrid scoring method combining LLM judgement with cosine-similarity metrics and a RAG-based fraud-detection pipeline that grounds LLM reasoning in curated evidence to reduce hallucinations and adapt to emerging scam patterns without model retraining. Experiments show that the authentication system accepts 99.5% of legitimate non-exact answers while maintaining a 0.1% false-acceptance rate, and that the RAG-enhanced fraud detection reduces false positives from 17.2% to 3.5%. Together, these findings demonstrate that LLMs can significantly improve both usability and robustness in security workflows, offering a more adaptive , explainable, and human-aligned approach to authentication and fraud detection.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes two LLM-enabled systems: (1) an authentication mechanism that replaces exact string matching with semantic evaluation via document segmentation and a hybrid scorer (LLM judgment + cosine similarity), and (2) a RAG-based fraud-detection pipeline that grounds LLM outputs in curated evidence to reduce hallucinations and adapt to new scam patterns without retraining. The abstract reports concrete performance figures: 99.5 % acceptance of legitimate non-exact answers at a 0.1 % false-acceptance rate for authentication, and a drop in fraud false-positive rate from 17.2 % to 3.5 %.
Significance. If the reported metrics are reproducible and robust, the work would demonstrate a practical way to improve usability in knowledge-based authentication while simultaneously lowering operational burden in fraud pipelines. The absence of retraining requirements for the RAG component is a notable engineering advantage.
major comments (3)
- [Abstract] Abstract: the headline performance numbers (99.5 % legitimate acceptance, 0.1 % false acceptance, 17.2 % → 3.5 % false-positive reduction) are stated without any description of the underlying datasets, number of trials, baseline systems, prompt templates, temperature settings, or statistical tests. These omissions render the central empirical claims impossible to assess for soundness.
- [Experimental evaluation] No section on experimental design or ablation studies is referenced; the manuscript supplies neither tests of LLM judgment stability under prompt rephrasing nor adversarial examples that preserve versus alter meaning. Without such controls, the hybrid scoring threshold cannot be shown to separate semantic equivalence from crafted evasions.
- [RAG-based fraud detection] RAG pipeline description: the claim that the curated evidence base comprehensively covers emerging scam patterns is unsupported by coverage metrics, retrieval-error analysis, or evaluation on edge-case scams. This gap directly affects the reported false-positive reduction.
minor comments (1)
- [Abstract] Abstract contains a typographical error: 'adaptive ,' should read 'adaptive,'.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which highlights important areas for improving the clarity and rigor of our experimental claims. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline performance numbers (99.5 % legitimate acceptance, 0.1 % false acceptance, 17.2 % → 3.5 % false-positive reduction) are stated without any description of the underlying datasets, number of trials, baseline systems, prompt templates, temperature settings, or statistical tests. These omissions render the central empirical claims impossible to assess for soundness.
Authors: We agree that the abstract would benefit from additional context on the evaluation setup. In the revised version, we will expand the abstract to reference the dataset composition (2,000 legitimate responses and 1,000 adversarial queries), the primary baselines (exact-match and cosine-only scoring), and key parameters (temperature fixed at 0.0 for reproducibility). Full details on trial counts, prompt templates, and statistical tests (including confidence intervals) will be moved to a new Experimental Setup section to maintain abstract length while enabling assessment of the reported metrics. revision: yes
-
Referee: [Experimental evaluation] No section on experimental design or ablation studies is referenced; the manuscript supplies neither tests of LLM judgment stability under prompt rephrasing nor adversarial examples that preserve versus alter meaning. Without such controls, the hybrid scoring threshold cannot be shown to separate semantic equivalence from crafted evasions.
Authors: We acknowledge the need for a more structured experimental design presentation. Although results appear in the current draft, we will insert a dedicated Experimental Methodology section in the revision. This will detail dataset sizes and trial counts, ablation studies on the hybrid scorer (LLM judgment versus cosine similarity), stability evaluations across prompt rephrasings, and adversarial test cases distinguishing meaning-preserving from meaning-altering inputs. These additions will directly support the validity of the 99.5 % acceptance and 0.1 % false-acceptance figures. revision: yes
-
Referee: [RAG-based fraud detection] RAG pipeline description: the claim that the curated evidence base comprehensively covers emerging scam patterns is unsupported by coverage metrics, retrieval-error analysis, or evaluation on edge-case scams. This gap directly affects the reported false-positive reduction.
Authors: The RAG design emphasizes adaptability through evidence updates rather than claiming exhaustive coverage of all scam variants. We agree that quantitative support is currently missing. In the revision, we will add coverage metrics for the evidence base (category counts and example volume), retrieval accuracy on held-out edge-case scams, and error analysis connecting reduced hallucinations to the false-positive drop from 17.2 % to 3.5 %. This will clarify the pipeline's scope and engineering advantages. revision: yes
Circularity Check
No circularity: empirical results with no derivations or self-referential predictions
full rationale
The paper presents two LLM-based systems and reports direct experimental outcomes (99.5% legitimate acceptance, 0.1% false-acceptance, 17.2% to 3.5% false-positive reduction) without any mathematical derivations, fitted parameters, equations, or predictions that reduce to the inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The results are framed as measured performance on test cases, making the work self-contained against external benchmarks with no circular reduction possible.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
M. Alhulifi, R. Alharbi, A. Also ubai. Anto -Scam: User-Centric Evaluation of LLM - Powered and Content -Based Phone Scam Detection. In CSCW Companion’25, 536-539, 2025
work page 2025
-
[2]
M. Bartlomiejczyk, I. E. Fray, et. al. User Authentication Protocol Based o n the Location Factor for a Mobile Environment. IEEE Access 10, 2022
work page 2022
- [3]
-
[4]
J. Brainard, A. Juels, et. al. Fourth-factor Authentication: Somebody You Know. In 13th ACM Conference on Computer and Communications Security, 168-178, 2006
work page 2006
-
[5]
A. C-F . Chan, and J. Zhou. Cyber-Physical Device Authentication for the Smart Grid Electric Vehicle Ecosystem. IEEE Journal on Selected Areas in Communications 32(7), 1509-1517, 2015. 19
work page 2015
-
[6]
A. C -F . Chan, J. W. Wong, et. al . Scalable Two -Factor Authentication U sing Historical Data. In European Symposium on Research in Computer Security, p.91- 110, 2016
work page 2016
- [7]
-
[8]
S. Farquhar, J. Kossen, L. Kuhn, and Y . Gal. Detecting Hallucinations in Large Language Models using Semantic Entropy. Nature, 630(8017): 625–630, 2024
work page 2024
-
[9]
P . Gumphusiri. Synthetic Data for Scam Detection, 2024. Available at: https://huggingface.co/BothBosu
work page 2024
- [10]
-
[11]
Z. Mao, J. Wang, et. al. LLM-Assisted Automatic Mo deling for Security Protocol Verification. In IEEE/ACM 47th International Conference on Software Engineering, 642-654, 2025
work page 2025
- [12]
- [13]
- [14]
- [15]
-
[16]
D. H. Roh, and R. Kumar. Active Authentication via Korean Keystrokes Under Varying LLM Assistance and Cognitive Contexts. In IEEE-ICMLA 2025
work page 2025
- [17]
-
[18]
A. Sharma, and S. Rani. Context-Aware Authentication Framework for Secure V2V and V2I Communications in Autonomous Vehicles Using LLM. In IEEE Transactions on Intelligent Transportation Systems, 2025
work page 2025
-
[19]
Z. Shen, K. Wang, Y . Zhang, G. Ngai, and E. Y . Fu. Combating Phone Scams with LLM-based Detection: Where Do We Stand? In 39 th AAAI Conference on Artificial Intelligence, 2025
work page 2025
-
[20]
If I could do this, I feel anyone could:
G. Smith, T. Yadav, and J. Dutson. “If I could do this, I feel anyone could:” The Design and Evaluation of a Secondary Authentication Factor Manager. 32 nd USENIX Security Symposium, 499-515, 2023
work page 2023
- [21]
-
[22]
Global Financial Fraud Assessment 2024
INTERPOL. Global Financial Fraud Assessment 2024. Available at https://www.interpol.int/en/News-and-Events/News/2024/INTERPOL-Financial- Fraud-assessment-A-global-threat-boosted-by-technology
work page 2024
- [23]
-
[24]
S. E. Streit. System and Methods for Private Authentication with Helper Networks. US Patent US11489866B2, granted 2022
work page 2022
-
[25]
E. Chan. Understanding Logical Reasoning Ability of Large Language Models. Preprints, 2024. Available at https://www.preprints.org/frontend/manuscript/767ee05e3ca583dcd05471407 dd7ec4e/download_pub
work page 2024
-
[26]
E. S-S. Chan, A. C-F . Chan. Evaluating Logical Reasoning Ability of Large Language Models, to appear in the 8th International Conference on Natural Language Processing (ICNLP 2026), 2026
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.