pith. sign in

arxiv: 2606.21353 · v2 · pith:VXPI6VFLnew · submitted 2026-06-19 · 💻 cs.CR

Beyond Classification Accuracy: An Exploration-Range Evaluation of Adaptive Crawling for Fake Shopping Sites

Pith reviewed 2026-06-26 14:00 UTC · model grok-4.3

classification 💻 cs.CR
keywords adaptive crawlingfake shopping sitesSEO poisoningexploration-range evaluationclosed-loop crawlerunique host discoveryquery generation
0
0 comments X

The pith

A closed-loop crawler feeding classifier outputs into search queries discovers about 7.6 times more unique fake shopping hosts than a fixed-keyword baseline after three cycles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an adaptive crawler for fake shopping sites that takes the page-level results of a fastText+LightGBM classifier and uses them to build new search queries each cycle. Fixed-keyword collection stops finding new hosts quickly because attacker campaigns evolve, but the adaptive approach extracts characteristic words from positive pages and compounds them with seed terms to locate additional sites. To move beyond accuracy numbers, the work tracks per-cycle new-host counts and cumulative unique-host counts as exploration-range metrics. In side-by-side runs the baseline acquires no new hosts from cycle two onward while the proposed method keeps discovering them and reaches roughly 7.6 times the baseline's cumulative unique-host total at cycle three.

Core claim

The central claim is that a closed-loop crawler incorporating page-level classifier outputs into a seed-compound query strategy sustains discovery of new fake shopping site hosts across multiple cycles, in contrast to fixed-keyword search which stagnates completely after the first cycle, producing an average cumulative unique-host count 7.6 times higher than the baseline by cycle three.

What carries the argument

The seed-compound strategy, which extracts characteristic words from pages the classifier labels positive and combines them with fixed seed words to form the queries used in the next crawling cycle.

If this is right

  • Fixed-keyword search produces zero new hosts from cycle two onward.
  • The adaptive method continues to acquire new hosts through at least cycle three.
  • Cumulative unique-host count reaches approximately 7.6 times the baseline value on average at cycle three.
  • Exploration-range metrics can be used alongside accuracy to judge whether a crawler keeps pace with changing site campaigns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same closed-loop pattern could be tested on other SEO-poisoned threats such as phishing or malware landing pages.
  • The introduced per-cycle and cumulative host counts offer a general way to compare any crawler that must chase evolving web content.
  • If a stronger classifier is substituted, the query-generation loop would be expected to produce even larger gaps over the baseline without other changes.

Load-bearing premise

That words taken from the classifier's positive pages can be combined with seed terms to produce queries that reliably locate previously unknown fake sites rather than repeating known ones or stalling.

What would settle it

Repeating the three-cycle experiment and finding that the proposed method's new-host acquisition rate falls to zero after cycle one or fails to exceed the baseline's cumulative total by a substantial margin.

Figures

Figures reproduced from arXiv: 2606.21353 by K. Karasawa, K. Takeshige, M. Hashimoto, M. Shimamura, S. Matsugaya.

Figure 1
Figure 1. Figure 1: Overall system architecture: a closed-loop of collection [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

In recent years, fake shopping sites targeting Japanese users have appeared in the top results of search engines through SEO poisoning, causing increasing damage. Conventional collection methods rely on fixed keywords and cannot keep up with evolving attack campaigns, delaying the discovery of new sites. We propose a closed-loop crawler that incorporates the page-level outputs of a fake-site classifier (fastText+LightGBM) into the search queries of the next cycle. Search queries are generated by a seed-compound strategy that combines characteristic words extracted from positive pages with seed words from the fake-shopping context (e.g., ``deep discount,'' ``official''). To complement evaluations that tend to focus on classifier accuracy, we also introduce per-cycle new-host counts and cumulative unique-host counts as exploration-range metrics. In a comparative experiment ($n=3$ for the proposed method, $n=2$ for the baseline), the fixed-keyword baseline yielded zero new-host acquisition from cycle 2 onward, indicating complete stagnation, whereas the proposed method continued to discover new hosts and, at cycle 3, achieved a cumulative unique-host count approximately 7.6 times that of the baseline on average.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper proposes a closed-loop adaptive crawler for fake shopping sites that feeds page-level outputs from a fastText+LightGBM classifier into a seed-compound query generation strategy (combining extracted characteristic words with fixed seed words such as "deep discount"). It introduces exploration-range metrics (per-cycle new-host counts and cumulative unique-host counts) to evaluate discovery performance beyond classifier accuracy. In a small comparative experiment (n=3 runs of the adaptive method, n=2 of a fixed-keyword baseline), the baseline stagnates with zero new hosts after cycle 1 while the adaptive method continues discovering hosts, reaching ~7.6 imes the baseline's cumulative unique-host count by cycle 3 on average.

Significance. If the reported advantage is shown to be robust, the work would demonstrate a practical way to track evolving SEO-poisoned sites and would usefully shift evaluation of discovery systems toward exploration metrics rather than accuracy alone. The seed-compound idea and the explicit baseline comparison are clear strengths, but the current evidence base is too narrow to support strong claims about reliability.

major comments (1)
  1. [Abstract / Experimental Results] Abstract and Experimental Results: the central quantitative claim of an approximately 7.6 imes cumulative unique-host advantage rests on averages computed from only n=3 runs of the proposed crawler and n=2 runs of the baseline, with no per-run values, standard deviations, confidence intervals, or statistical tests supplied. Given the nondeterminism of search-engine results and stochastic word extraction from the classifier, this sample size is insufficient to distinguish a genuine property of the seed-compound strategy from sampling variability.
minor comments (3)
  1. The manuscript provides no description of the host-deduplication rules used to compute the cumulative unique-host metric, which is essential for interpreting the exploration-range results.
  2. No information is given on the training data, feature extraction, or cross-validation procedure for the fastText+LightGBM classifier, nor on how positive-page outputs are filtered before characteristic-word extraction.
  3. The paper does not discuss potential selection effects or timing biases across the three cycles of the experiment.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the value of the seed-compound strategy and exploration-range metrics. We agree that the limited number of runs weakens the reliability of the reported advantage and will revise the manuscript to address this.

read point-by-point responses
  1. Referee: [Abstract / Experimental Results] Abstract and Experimental Results: the central quantitative claim of an approximately 7.6 imes cumulative unique-host advantage rests on averages computed from only n=3 runs of the proposed crawler and n=2 runs of the baseline, with no per-run values, standard deviations, confidence intervals, or statistical tests supplied. Given the nondeterminism of search-engine results and stochastic word extraction from the classifier, this sample size is insufficient to distinguish a genuine property of the seed-compound strategy from sampling variability.

    Authors: We agree that the current sample sizes (n=3 and n=2) and absence of variability measures or statistical tests limit the strength of the claims. In the revised version we will run additional independent trials of both the adaptive crawler and the fixed-keyword baseline, report the per-run cumulative unique-host counts, compute standard deviations and confidence intervals, and include appropriate statistical tests (e.g., two-sample t-test or Wilcoxon rank-sum test) on the cycle-3 totals. These changes will allow readers to evaluate whether the observed advantage exceeds sampling variability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison uses independent metrics

full rationale

The paper reports an empirical result from a comparative experiment (proposed method n=3 vs. fixed-keyword baseline n=2) using per-cycle new-host counts and cumulative unique-host counts. These metrics are defined directly from observed search-engine outputs and are not derived from any fitted parameters, self-referential definitions, or load-bearing self-citations. The 7.6× cumulative-host advantage is a measured experimental outcome, not a quantity that reduces to the seed-compound strategy inputs by construction. The derivation chain is self-contained against the explicit baseline.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified reliability of the classifier for query generation and the effectiveness of the seed-compound strategy; these are domain assumptions rather than derived results. No free parameters or invented entities are explicitly quantified in the abstract.

free parameters (1)
  • seed words
    Fixed context words such as "deep discount" and "official" chosen from the fake-shopping domain; their selection and weighting are not derived.
axioms (1)
  • domain assumption The fastText+LightGBM classifier produces page-level outputs sufficiently accurate to extract useful characteristic words for query adaptation.
    Invoked when the closed-loop uses positive-page outputs to generate the next cycle's queries.

pith-pipeline@v0.9.1-grok · 5749 in / 1481 out tokens · 27812 ms · 2026-06-26T14:00:40.649344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 10 canonical work pages

  1. [1]

    Statistical information on malicious shopping sites (first half of 2025)

    Japan Cybercrime Control Center. Statistical information on malicious shopping sites (first half of 2025). Web. (in Japanese), Accessed: 2025-12-10.https://www.jc3.or.jp/threats/topics/article-641.html

  2. [2]

    Fake shopping sites: Confirmed redirections from search results for osaka-kansai expo goods

    Trend Micro. Fake shopping sites: Confirmed redirections from search results for osaka-kansai expo goods. Web. (in Japanese), Accessed: 2026-01-10. https://www.trendmicro.com/ja_jp/research/25/i/ fake-shopping-sites.html

  3. [3]

    Warning on fraudulent sites disguised as selling rice at low prices

    Consumer Affairs Agency, Government of Japan. Warning on fraudulent sites disguised as selling rice at low prices. Web. (in Japanese), Accessed: 2026-01-10.https://www.caa.go.jp/notice/entry/043659/

  4. [4]

    Field survey on detecting stepping-stone sites that redirect users to fake shopping sites

    Daigo Michishita, Satoru Kobayashi, and Toshihiro Yamauchi. Field survey on detecting stepping-stone sites that redirect users to fake shopping sites. InProceedings of Computer Security Symposium 2024 (CSS2024), pages 1095–1101. Information Processing Society of Japan, Oct 2024. (in Japanese)

  5. [5]

    A nearly four-year longitudinal study of search-engine poisoning

    Nektarios Leontiadis, Tyler Moore, and Nicolas Christin. A nearly four-year longitudinal study of search-engine poisoning. InProceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, CCS ’14, page 930–941, New York, NY , USA, 2014. Association for Computing Machinery. ISBN 9781450329576. doi:10.1145/2660267.2660332. URLhttps:/...

  6. [6]

    Elucidating attack methods of fake shopping sites.IPSJ Journal, 62:1523–1535, Sep 2021

    Hirokazu Kodera, Shun Koide, Daiki Chiba, Kazufumi Aoki, and Mitsuaki Akiyama. Elucidating attack methods of fake shopping sites.IPSJ Journal, 62:1523–1535, Sep 2021. (in Japanese)

  7. [7]

    Mitsuho Hasegawa, Kosuke Sekido, Kazuki Takada, Akira Fujita, Rui Tanabe, and Katsunari Yoshioka. Proposal of a collection method for fake shopping sites that reuse product information from legitimate sites.Proceedings of 11 Exploration-Range Evaluation of Adaptive CrawlingA PREPRINT Computer Security Symposium 2024 (CSS2024), pages 1080–1087, 10 2024. UR...

  8. [8]

    Saul, Stefan Savage, and Geoffrey M

    Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. V oelker. Beyond blacklists: learning to detect malicious web sites from suspicious urls. KDD ’09, page 1245–1254, New York, NY , USA, 2009. Association for Computing Machinery. ISBN 9781605584959. doi:10.1145/1557019.1557153. URL https://doi.org/10. 1145/1557019.1557153

  9. [9]

    Hung Le, Quang Pham, Doyen Sahoo, and Steven C. H. Hoi. Urlnet: Learning a url representation with deep learning for malicious url detection, 2018. URLhttps://arxiv.org/abs/1802.03162

  10. [10]

    An automatic detection system for fake japanese shopping sites using fasttext and lightgbm.IEEE Access, 11: 111389–111401, 2023

    Keisuke Sakai, Kosuke Takeshige, Kazuki Kato, Naoki Kurihara, Katsumi Ono, and Masaki Hashimoto. An automatic detection system for fake japanese shopping sites using fasttext and lightgbm.IEEE Access, 11: 111389–111401, 2023. doi:10.1109/ACCESS.2023.3323218

  11. [11]

    Phishpedia: A hybrid deep learning based approach to visually identify phishing webpages

    Yun Lin, Ruofan Liu, Dinil Mon Divakaran, Jun Yang Ng, Qing Zhou Chan, Yiwen Lu, Yuxuan Si, Fan Zhang, and Jin Song Dong. Phishpedia: A hybrid deep learning based approach to visually identify phishing webpages. In30th USENIX Security Symposium (USENIX Security 21), pages 3793–3810. USENIX Association, August 2021. ISBN 978-1-939133-24-3. URL https://www....

  12. [12]

    Knowphish: Large language models meet multimodal knowledge graphs for enhancing reference-based phishing detection, 2024

    Yuexin Li, Chengyu Huang, Shumin Deng, Mei Lin Lock, Tri Cao, Nay Oo, Hoon Wei Lim, and Bryan Hooi. Knowphish: Large language models meet multimodal knowledge graphs for enhancing reference-based phishing detection, 2024. URLhttps://arxiv.org/abs/2403.02253

  13. [13]

    A taxonomy of attacks on open-source software supply chains

    Marzieh Bitaab, Haehyun Cho, Adam Oest, Zhuoer Lyu, Wei Wang, Jorij Abraham, Ruoyu Wang, Tiffany Bao, Yan Shoshitaishvili, and Adam Doupé. Beyond phish: Toward detecting fraudulent e-commerce websites at scale. In2023 IEEE Symposium on Security and Privacy (SP), pages 2566–2583, 2023. doi:10.1109/SP46215.2023.10179461

  14. [14]

    Learning to detect and measure fake ecommerce websites in search- engine results

    Claudio Carpineto and Giovanni Romano. Learning to detect and measure fake ecommerce websites in search- engine results. WI ’17, page 403–410, New York, NY , USA, 2017. Association for Computing Machinery. ISBN 9781450349512. doi:10.1145/3106426.3106441. URLhttps://doi.org/10.1145/3106426.3106441

  15. [15]

    Measuring and analyzing search-redirection attacks in the illicit online prescription drug trade

    Nektarios Leontiadis, Tyler Moore, and Nicolas Christin. Measuring and analyzing search-redirection attacks in the illicit online prescription drug trade. InProceedings of the 20th USENIX Conference on Security, SEC’11, page 19, USA, 2011. USENIX Association

  16. [16]

    Wang, Stefan Savage, and Geoffrey M

    David Y . Wang, Stefan Savage, and Geoffrey M. V oelker. Cloak and dagger: dynamics of web search cloak- ing. InProceedings of the 18th ACM Conference on Computer and Communications Security, CCS ’11, page 477–490, New York, NY , USA, 2011. Association for Computing Machinery. ISBN 9781450309486. doi:10.1145/2046707.2046763. URLhttps://doi.org/10.1145/204...

  17. [17]

    Cloak of visibility: Detecting when machines browse a different web

    Luca Invernizzi, Kurt Thomas, Alexandros Kapravelos, Oxana Comanescu, Jean-Michel Picod, and Elie Bursztein. Cloak of visibility: Detecting when machines browse a different web. In2016 IEEE Symposium on Security and Privacy (SP), pages 743–758, 2016. doi:10.1109/SP.2016.50

  18. [18]

    Cui, S., Sun, Y., Zhang, Y., Meng, Q., Zhu, H.,

    Adam Oest, Yeganeh Safaei, Adam Doupé, Gail-Joon Ahn, Brad Wardman, and Kevin Tyers. Phishfarm: A scalable framework for measuring the effectiveness of evasion techniques against browser phishing blacklists. In 2019 IEEE Symposium on Security and Privacy (SP), pages 1344–1361, 2019. doi:10.1109/SP.2019.00049

  19. [19]

    2021 IEEE symposium on security and privacy (SP) , pages=

    Penghui Zhang, Adam Oest, Haehyun Cho, Zhibo Sun, RC Johnson, Brad Wardman, Shaown Sarker, Alexandros Kapravelos, Tiffany Bao, Ruoyu Wang, Yan Shoshitaishvili, Adam Doupé, and Gail-Joon Ahn. Crawlphish: Large-scale analysis of client-side cloaking techniques in phishing. In2021 IEEE Symposium on Security and Privacy (SP), pages 1109–1124, 2021. doi:10.110...

  20. [20]

    PhishDecloaker: Detecting CAPTCHA- cloaked phishing websites via hybrid vision-based interactive models

    Xiwen Teoh, Yun Lin, Ruofan Liu, Zhiyong Huang, and Jin Song Dong. PhishDecloaker: Detecting CAPTCHA- cloaked phishing websites via hybrid vision-based interactive models. In33rd USENIX Security Symposium (USENIX Security 24), pages 505–522, Philadelphia, PA, August 2024. USENIX Association. ISBN 978-1- 939133-44-1. URLhttps://www.usenix.org/conference/us...

  21. [21]

    Evilseed: A guided approach to finding malicious web pages

    Luca Invernizzi, Stefano Benvenuti, Marco Cova, Paolo Milani Comparetti, Christopher Kruegel, and Giovanni Vigna. Evilseed: A guided approach to finding malicious web pages. InProceedings of the 2012 IEEE Symposium on Security and Privacy, SP ’12, page 428–442, USA, 2012. IEEE Computer Society. ISBN 9780769546810. doi:10.1109/SP.2012.33. URLhttps://doi.or...

  22. [22]

    Scalable detection of promotional website defacements in black hat SEO campaigns

    Ronghai Yang, Xianbo Wang, Cheng Chi, Dawei Wang, Jiawei He, Siming Pang, and Wing Cheong Lau. Scalable detection of promotional website defacements in black hat SEO campaigns. In30th USENIX Security Symposium (USENIX Security 21), pages 3703–3720. USENIX Association, August 2021. ISBN 978-1-939133-24-3. URL https://www.usenix.org/conference/usenixsecurit...

  23. [23]

    Custom search json api

    Google. Custom search json api. Web. Accessed: 2026-01-21. https://developers.google.com/ custom-search/v1/overview?hl=ja

  24. [24]

    Janome v0.5 documentation(ja). Web. Accessed: 2026-02-18.https://janome.mocobeta.dev/ja/. 13