pith. machine review for the scientific record. sign in

arxiv: 2602.21394 · v3 · submitted 2026-02-24 · 💻 cs.CR

Recognition: no theorem link

MemoPhishAgent: Memory-Augmented Multi-Modal LLM Agent for Phishing URL Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:32 UTC · model grok-4.3

classification 💻 cs.CR
keywords phishing detectionLLM agentepisodic memorymulti-modal reasoningURL classificationcybersecuritymemory augmentationtool orchestration
0
0 comments X

The pith

Memory-augmented LLM agent uses past reasoning trajectories to raise phishing URL recall by 13.6 to 20 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MemoPhishAgent as an LLM agent that augments multi-modal reasoning with episodic memory of earlier decision paths. Traditional phishing detectors rely on fixed lists or heuristics that quickly become obsolete, while prompt-only LLM pipelines fail to reuse reasoning across similar or evolving cases. By retrieving stored trajectories to guide tool selection and final classification, the agent improves recall on public benchmarks and on live URLs scraped from social media. The same memory component delivers up to 27 percent of the recall lift at no extra compute cost. Production deployment on 60,000 weekly URLs reaches 91.44 percent recall, showing the method works at scale.

Core claim

MemoPhishAgent (MPA) dynamically orchestrates phishing-specific tools and leverages episodic memories of past reasoning trajectories to guide decisions on recurring and novel threats. On two public datasets MPA outperforms three state-of-the-art baselines and improves recall by 13.6 percent. On a benchmark of real-world suspicious URLs crawled from five social media platforms it improves recall by 20 percent. Episodic memory contributes up to 27 percent recall gain without additional computational overhead. The ablation study confirms the necessity of the agent-based approach over prompt-based baselines. In production MPA processes 60K targeted high-risk URLs weekly and achieves 91.44% 91.44

What carries the argument

The episodic memory store that records and retrieves full reasoning trajectories to condition the current tool calls and final phishing verdict.

Load-bearing premise

Memories of earlier reasoning trajectories will remain relevant and non-misleading when the agent encounters new or rapidly changing phishing tactics.

What would settle it

Measure recall on a held-out collection of phishing URLs whose attack patterns and visual features differ from every stored trajectory; if recall falls below the non-memory baseline the central claim is falsified.

read the original abstract

Traditional phishing website detection relies on static heuristics or reference lists, which lag behind rapidly evolving attacks. While recent systems incorporate large language models (LLMs), they are still prompt-based, deterministic pipelines that underutilize reasoning capability. We present MemoPhishAgent (MPA), a memory-augmented multi-modal LLM agent that dynamically orchestrates phishing-specific tools and leverages episodic memories of past reasoning trajectories to guide decisions on recurring and novel threats. On two public datasets, MPA outperforms three state-of-the-art (SOTA) baselines, improving recall by 13.6%. To better reflect realistic, user-facing phishing detection performance, we further evaluate MPA on a benchmark of real-world suspicious URLs actively crawled from five social media platforms, where it improves recall by 20%. Detailed analysis shows episodic memory contributes up to 27% recall gain without introducing additional computational overhead. The ablation study confirms the necessity of the agent-based approach compared to prompt-based baselines and validates the effectiveness of our tool design. Finally, MPA is deployed in production, processing 60K targeted high-risk URLs weekly, and achieving 91.44% recall, providing proactive protection for millions of customers. Together, our results show that combining multi-modal reasoning with episodic memory yields robust phishing detection in realistic user-exposure settings. Our implementation is available at https://github.com/XuanChen-xc/MemoPhishAgent.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MemoPhishAgent (MPA), a memory-augmented multi-modal LLM agent for phishing URL detection that dynamically orchestrates phishing-specific tools and leverages episodic memories of past reasoning trajectories. It reports recall improvements of 13.6% over three SOTA baselines on two public datasets, 20% on a real-world benchmark of suspicious URLs crawled from social media platforms, up to 27% gain attributable to memory, and 91.44% recall in production on 60K high-risk URLs weekly, with an ablation study validating the agent-based design.

Significance. If the results hold under scrutiny of baseline implementations and statistical tests, the work provides evidence that episodic memory augmentation can improve robustness of LLM agents for adaptive security tasks in realistic user-exposure settings, with demonstrated production-scale applicability and no added computational overhead.

major comments (2)
  1. [Abstract] Abstract: the claim that episodic memory 'guides decisions on recurring and novel threats' and contributes up to 27% recall gain lacks any described mechanism for detecting distribution shift, pruning stale memories, or overriding memory-based decisions on novel patterns; this assumption is load-bearing for the central claim that memory yields robust detection on evolving threats.
  2. [Ablation study] Ablation study: the validation of the agent-based approach versus prompt-based baselines does not include explicit tests on novel tactics or distribution-shifted samples, leaving open whether memory augmentation remains beneficial or introduces errors when past trajectories are irrelevant.
minor comments (2)
  1. The GitHub repository link should include a README with exact baseline re-implementation details and statistical significance tests to support the reported recall gains.
  2. Production metrics section would benefit from clearer description of how ground-truth labels are obtained for the 60K weekly URLs to allow independent verification of the 91.44% recall figure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point by point below, providing clarifications from the manuscript and indicating where we will revise to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that episodic memory 'guides decisions on recurring and novel threats' and contributes up to 27% recall gain lacks any described mechanism for detecting distribution shift, pruning stale memories, or overriding memory-based decisions on novel patterns; this assumption is load-bearing for the central claim that memory yields robust detection on evolving threats.

    Authors: We acknowledge that the abstract is high-level and does not spell out the internal mechanisms. Section 3.3 describes the episodic memory module, which retrieves trajectories via embedding similarity and feeds them to the LLM agent alongside current multi-modal inputs (URL text, screenshot, and tool outputs). The agent performs step-by-step reasoning, enabling it to discount or override retrieved memories when current evidence diverges (e.g., low similarity scores or contradictory tool results). The 27% gain is measured by comparing runs with and without memory on the same inputs. We agree an explicit discussion of shift detection and pruning would strengthen the central claim. We will revise the abstract to qualify the claim and add a short paragraph in Section 3.3 plus supporting analysis in Section 5.4 describing the override logic and similarity threshold used in practice. revision: partial

  2. Referee: [Ablation study] Ablation study: the validation of the agent-based approach versus prompt-based baselines does not include explicit tests on novel tactics or distribution-shifted samples, leaving open whether memory augmentation remains beneficial or introduces errors when past trajectories are irrelevant.

    Authors: The ablation (Section 5.3) isolates the contribution of the agent architecture and memory by comparing against prompt-only baselines. While we did not synthesize artificial distribution shifts, the real-world social-media benchmark consists of actively crawled suspicious URLs that exhibit novel tactics absent from the public datasets; the 20% recall lift and the memory-attributed gain of up to 27% on this benchmark already provide evidence that memory remains net beneficial rather than harmful. The production deployment on 60K weekly high-risk URLs further tests robustness under live distribution drift. We will expand the ablation discussion to include a qualitative breakdown of cases where memory was overridden on novel patterns, using examples drawn from the real-world set. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical claims rest on held-out test sets and production data

full rationale

The paper presents an empirical agent system evaluated on separate public datasets and live production traffic (60K URLs weekly). Performance gains (13.6% recall on public sets, 20% on real-world URLs, 91.44% in production) are measured against baselines on held-out data rather than any fitted parameters or self-referential definitions. No equations, derivations, or mathematical reductions appear in the provided text; the architecture description and ablation studies compare independent benchmarks without reducing claims to inputs by construction. This is the standard non-circular outcome for an applied ML evaluation paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that LLMs can reliably use tools and past trajectories for security decisions; no explicit free parameters or new physical entities are introduced in the abstract.

axioms (1)
  • domain assumption Large language models can perform effective multi-step reasoning about phishing indicators when augmented with tools and episodic memory
    Invoked throughout the agent design and performance claims
invented entities (1)
  • MemoPhishAgent no independent evidence
    purpose: Orchestrate phishing-specific tools and leverage episodic memory for detection decisions
    The proposed system architecture itself

pith-pipeline@v0.9.0 · 5562 in / 1252 out tokens · 23800 ms · 2026-05-15T19:32:04.122363+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.