pith. sign in

arxiv: 2605.24958 · v1 · pith:M52QF7A7new · submitted 2026-05-24 · 💻 cs.CL · cs.AI

SEP-Attack: A Simple and Effective Paradigm for Transfer-Based Textual Adversarial Attack

Pith reviewed 2026-06-30 12:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords adversarial attacktransfer-based attacktextual adversarial examplesdeterminantal point processsurrogate modelsblack-box attacknatural language processing
0
0 comments X

The pith

SEP-Attack uses DPP to assign diverse weights to surrogate models for better transferable textual attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SEP-Attack to generate adversarial examples that transfer more effectively to unseen text classifiers without direct access to the victim model. It tackles prior issues of equal submodel treatment and poor importance scoring by applying the Determinantal Point Process to produce varied ensemble weights that reflect each surrogate's transferability. These weights feed into a new prediction confidence metric that drives word importance calculations and candidate generation. Candidates are then ranked by a separate transferability score before final selection. Experiments on four datasets and two real-world APIs show the method beats existing baselines.

Core claim

SEP-Attack employs the Determinantal Point Process to generate diverse surrogate ensemble weights that represent the transferability of submodels. Using these weights, a new metric evaluates prediction confidence scores, which are used to calculate word importance scores and generate adversarial candidates. A quantified transferability score is then applied to each candidate to select the final transferable adversarial examples.

What carries the argument

Determinantal Point Process (DPP) for generating diverse surrogate ensemble weights that represent submodel transferability and improve word importance scoring plus candidate selection.

If this is right

  • Word importance scores become more accurate when derived from transferability-weighted confidence values.
  • Adversarial candidates selected via the new metric achieve higher attack success on black-box targets.
  • Quantified transferability scoring enables explicit ranking of candidates before deployment against real APIs.
  • The overall pipeline scales to multiple datasets without requiring victim model gradients or architecture details.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same DPP weighting idea could be tested on vision models to check whether diversity in surrogates generalizes beyond text.
  • Replacing DPP with other diversity mechanisms might yield different transfer gains and could be compared directly.
  • Success on real APIs suggests the method could be evaluated on larger language models where API access is the only option.

Load-bearing premise

That DPP-generated weights accurately capture how well each submodel's behavior transfers to the victim model.

What would settle it

An experiment that replaces DPP weighting with uniform weights on the same surrogate set and finds no gain in transfer success rate on held-out victims would refute the core benefit.

Figures

Figures reproduced from arXiv: 2605.24958 by Fenglong Ma, Feng Zhang, Han Liu, Hong Yu, Wei Wang, Xiaoming Xu, Xiaotong Zhang, Zhi Xu.

Figure 1
Figure 1. Figure 1: The overview of our proposed SEP-Attack framework, which consists of three main components. (1) Generating [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: This figure illustrates the variation of confidence scores for the truth label of a sample 𝑋 ′ with respect to both the ensemble model and the victim model. In the figure, 𝑋 ′ 1 , 𝑋′ 2 , 𝑋′ 3 represent po￾tential adversarial examples. model. Since we have 𝐸 diverse weight vectors, we finally generate a set of candidates X = {𝑋 ′ 1 , · · · , 𝑋′ 𝐸×𝑇 }. 3.4 Selecting Transferable Adversarial Examples A naive … view at source ↗
Figure 3
Figure 3. Figure 3: Comparison on ASR(%) in different budget limits. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Despite the strong performance of deep neural networks in modern Web and language applications, they remain vulnerable to adversarial attacks, especially transferable attacks that generate adversarial examples using surrogate models without accessing the victim model. Transferable attacks in the text domain are still under-explored, with only a few studies addressing this challenging issue, often with suboptimal results due to equal treatment of submodels or inaccurate estimation of importance scores. To address these challenges, we propose a simple yet effective paradigm for transfer-based textual adversarial attack, named SEP-Attack. Specifically, we employ the Determinantal Point Process (DPP) to generate diverse surrogate ensemble weights, representing the transferability of submodels. Using these weights, we introduce a new metric to evaluate prediction confidence scores, which in turn are used to calculate word importance scores and generate adversarial candidates. Finally, we quantify the transferability score for each candidate and select the top ones as the final transferable adversarial examples. Experiments conducted on four datasets and two real-world APIs validate the efficacy of SEP-Attack, significantly outperforming state-of-the-art baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces SEP-Attack, a paradigm for transfer-based textual adversarial attacks. It employs the Determinantal Point Process (DPP) to produce diverse weights for an ensemble of surrogate models that represent submodel transferability. These weights define a new metric for prediction confidence scores, which are then used to compute word importance scores and select adversarial candidates. The final step quantifies transferability of each candidate and retains the top ones. Experiments on four datasets and two real-world APIs are reported to show significant outperformance over state-of-the-art baselines.

Significance. If the empirical results hold under scrutiny, the work provides a simple, DPP-based mechanism for improving ensemble diversity and transferability estimation in textual attacks. This directly addresses the documented limitations of equal submodel weighting and inaccurate importance scoring, potentially strengthening robustness evaluations for deployed language models and APIs.

major comments (2)
  1. [§3.2] §3.2, the DPP weight generation procedure: the claim that the resulting weights 'represent the transferability of submodels' requires an explicit validation step (e.g., correlation with actual cross-model attack success rates) that is not shown; without it the downstream word-importance and candidate-selection steps rest on an unverified proxy.
  2. [§4.2, Table 3] §4.2 and Table 3: the reported attack success rates on the two real-world APIs are given as single-point estimates without standard deviations across random seeds or multiple query budgets; this makes it impossible to determine whether the claimed superiority over baselines is statistically reliable.
minor comments (3)
  1. [Eq. (5)] The notation for the new confidence metric (Eq. 5) uses the same symbol C for both the per-submodel and the weighted ensemble versions; a distinct symbol would improve readability.
  2. [Figure 2] Figure 2 caption does not state the number of DPP samples drawn or the kernel matrix construction details, which are needed to reproduce the diversity results.
  3. [§2] Related-work section omits recent DPP applications in NLP ensemble methods (e.g., 2023–2024 papers on DPP for model selection); adding 2–3 citations would better situate the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive recommendation. We address each major point below.

read point-by-point responses
  1. Referee: [§3.2] §3.2, the DPP weight generation procedure: the claim that the resulting weights 'represent the transferability of submodels' requires an explicit validation step (e.g., correlation with actual cross-model attack success rates) that is not shown; without it the downstream word-importance and candidate-selection steps rest on an unverified proxy.

    Authors: We agree that an explicit validation step would strengthen the claim. In the revised manuscript we will add a correlation analysis (in §3.2 or an appendix) between the DPP-derived weights and measured attack success rates on held-out surrogate-victim pairs. revision: yes

  2. Referee: [§4.2, Table 3] §4.2 and Table 3: the reported attack success rates on the two real-world APIs are given as single-point estimates without standard deviations across random seeds or multiple query budgets; this makes it impossible to determine whether the claimed superiority over baselines is statistically reliable.

    Authors: We acknowledge that single-point estimates limit statistical assessment. We will rerun the API experiments across multiple random seeds, report means and standard deviations, and update Table 3 accordingly (noting that query budgets on real APIs constrain the number of runs). revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents SEP-Attack as a procedural method using DPP for ensemble weights, a new confidence metric for word importance, and transferability scoring for candidate selection. The abstract and description contain no equations, derivations, or claims that reduce by construction to fitted inputs or self-citations. The central claim is empirical (outperformance on datasets/APIs), with no load-bearing self-referential steps visible. This matches the default expectation of a non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method implicitly assumes DPP produces useful diversity for transferability without stating how parameters are chosen or validated.

pith-pipeline@v0.9.1-grok · 5734 in / 1077 out tokens · 21483 ms · 2026-06-30T12:17:45.637458+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 2 canonical work pages · 2 internal anchors

  1. [1]

    Pir Noman Ahmad, Adnan Muhammad Shah, KangYoon Lee, and Wazir Muham- mad. 2026. Misinformation detection on online social networks using pretrained language models.Information Processing and Management63, 1 (2026), 104342

  2. [2]

    AI@Meta. 2024. Llama 3 Model Card. (2024). https://github.com/meta-llama/ llama3/blob/main/MODEL_CARD.md

  3. [3]

    Cong Chen, Wei Qu, Si Su, Yukun Feng, and Tao Li. 2025. A comprehensive review of LLM-based content moderation: advancements, challenges, and future directions.Knowledge-Based Systems330 (2025), 114689

  4. [4]

    Huanran Chen, Yichi Zhang, Yinpeng Dong, and Jun Zhu. 2024. Rethinking Model Ensemble in Transfer-based Adversarial Attacks. InInternational Conference on Learning Representations (ICLR)

  5. [5]

    Laming Chen, Guoxin Zhang, and Eric Zhou. 2018. Fast Greedy MAP Infer- ence for Determinantal Point Process to Improve Recommendation Diversity. In Conference on Neural Information Processing Systems (NeurIPS). 5627–5638

  6. [6]

    Yangyi Chen, Hongcheng Gao, Ganqu Cui, Fanchao Qi, Longtao Huang, Zhiyuan Liu, and Maosong Sun. 2022. Why Should Adversarial Perturbations be Imper- ceptible? Rethink the Research Paradigm in Adversarial NLP. InConference on Empirical Methods in Natural Language Processing (EMNLP). 11222–11237

  7. [7]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In North American Chapter of the Association for Computational Linguistics (NAACL). 4171–4186

  8. [8]

    Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. 2018. Boosting Adversarial Attacks With Momentum. InComputer Vision and Pattern Recognition (CVPR). 9185–9193

  9. [9]

    Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2018. HotFlip: White- Box Adversarial Examples for Text Classification. InAnnual Meeting of the Asso- ciation for Computational Linguistics (ACL). 31–36

  10. [10]

    Ji Gao, Jack Lanchantin, Mary Lou Soffa, and Yanjun Qi. 2018. Black-Box Gener- ation of Adversarial Text Sequences to Evade Deep Learning Classifiers. InIEEE Symposium on Security and Privacy (S&P). 50–56

  11. [11]

    Zhijin Ge, Xiaosen Wang, Hongying Liu, Fanhua Shang, and Yuanyuan Liu

  12. [12]

    In Conference on Neural Information Processing Systems (NeurIPS)

    Boosting Adversarial Transferability by Achieving Flat Local Maxima. In Conference on Neural Information Processing Systems (NeurIPS)

  13. [13]

    Adrián Girón, Javier Huertas-Tato, and David Camacho. 2025. LLM synthetic generation to enhance online content moderation generalization in hate speech scenarios.Computing107, 7 (2025), 164

  14. [14]

    Goodfellow, Jonathon Shlens, and Christian Szegedy

    Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and Harnessing Adversarial Examples. InInternational Conference on Learning Representations (ICLR)

  15. [15]

    Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation(1997), 1735–1780

  16. [16]

    Tao Huang. 2025. Content moderation by LLM: from accuracy to legitimacy. Artificial Intelligence Review58, 10 (2025), 320

  17. [17]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, De- vendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7B. abs/2310.06825

  18. [18]

    Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter Szolovits. 2020. Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment. InAAAI Conference on Artificial Intelligence (AAAI). 8018–8025

  19. [19]

    Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. InConference on Empirical Methods in Natural Language Processing (EMNLP). 1746–1751

  20. [20]

    Hyun Kwon and Sanghyun Lee. 2023. Ensemble transfer attack targeting text classification systems.Computers & Security124 (2023), 102944

  21. [21]

    Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2020. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. InInternational Conference on Learning Representations (ICLR)

  22. [22]

    Thai Le, Noseong Park, and Dongwon Lee. 2022. SHIELD: Defending Textual Neural Networks against Multiple Black-Box Adversarial Attacks with Stochastic Multi-Expert Patcher. InAnnual Meeting of the Association for Computational Linguistics (ACL). 6661–6674

  23. [23]

    Deokjae Lee, Seungyong Moon, Junhyeok Lee, and Hyun Oh Song. 2022. Query- Efficient and Scalable Black-Box Adversarial Attacks on Discrete Sequential Data via Bayesian Optimization. InInternational Conference on Machine Learning (ICML). 12478–12497

  24. [24]

    Jinfeng Li, Shouling Ji, Tianyu Du, Bo Li, and Ting Wang. 2019. TextBugger: Generating Adversarial Text Against Real-world Applications. InNetwork and Distributed System Security Symposium (NDSS)

  25. [25]

    Linyang Li, Ruotian Ma, Qipeng Guo, Xiangyang Xue, and Xipeng Qiu. 2020. BERT-ATTACK: Adversarial Attack Against BERT Using BERT. InConference on Empirical Methods in Natural Language Processing (EMNLP). 6193–6202

  26. [26]

    Han Liu, Zhi Xu, Xiaotong Zhang, Feng Zhang, Fenglong Ma, Hongyang Chen, Hong Yu, and Xianchao Zhang. 2023. HQA-Attack: Toward High Quality Black- Box Hard-Label Adversarial Attack on Text. InConference on Neural Information Processing Systems (NeurIPS)

  27. [27]

    Xiaodong Liu, Xiao Lin, Yiming Ding, Changcheng Li, Peng Jiang, and Weiran Shen. 2025. Optimizing Revenue through User Coupon Recommendations in Truthful Online Ad Auctions. InThe Web Conference (WWW). ACM, 1380–1388

  28. [28]

    Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. 2017. Delving into Trans- ferable Adversarial Examples and Black-box Attacks. InInternational Conference on Learning Representations (ICLR)

  29. [29]

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach.arXiv preprint arXiv:1907.11692 abs/1907.11692 (2019)

  30. [30]

    Zhiwei Liu, Keyi Wang, Zhuo Bao, Xin Zhang, Jiping Dong, Kailai Yang, Mohsinul Kabir, Polydoros Giannouris, Rui Xing, Seongchan Park, Jaehong Kim, Dong Li, Qianqian Xie, and Sophia Ananiadou. 2025. FinNLP-FNP-LLMFinLegal-2025 Shared Task: Financial Misinformation Detection Challenge Task. InInternational Conference on Computational Linguistics (COLING). 271–276

  31. [31]

    Maas, Raymond E

    Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Annual Meeting of the Association for Computational Linguistics (ACL). 142–150

  32. [32]

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards Deep Learning Models Resistant to Adversarial Attacks. InInternational Conference on Learning Representations (ICLR). OpenRe- view.net

  33. [33]

    Rishabh Maheshwary, Saket Maheshwary, and Vikram Pudi. 2021. Generating Natural Language Attacks in a Hard Label Black Box Setting. InAAAI Conference on Artificial Intelligence (AAAI). 13525–13533

  34. [34]

    Zhao Meng and Roger Wattenhofer. 2020. A Geometry-Inspired Attack for Generating Natural Language Adversarial Examples. InCOLING. 6679–6689

  35. [35]

    Bo Pang and Lillian Lee. 2005. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. InAnnual Meeting of the Association for Computational Linguistics (ACL). 115–124

  36. [36]

    Zeyu Qin, Yanbo Fan, Yi Liu, Li Shen, Yong Zhang, Jue Wang, and Baoyuan Wu. 2022. Boosting the Transferability of Adversarial Attacks with Reverse Adversarial Perturbation. InConference on Neural Information Processing Systems (NeurIPS)

  37. [37]

    Haoran Tang, Shiqing Wu, Zhihong Cui, Yicong Li, Guandong Xu, and Qing Li. 2025. Model-Agnostic Dual-Side Online Fairness Learning for Dynamic Recommendation.IEEE Transactions on Knowledge and Data Engineering37, 5 (2025), 2727–2742

  38. [38]

    Xiaosen Wang and Kun He. 2021. Enhancing the Transferability of Adversarial Attacks Through Variance Tuning. InComputer Vision and Pattern Recognition (CVPR). 1924–1933

  39. [39]

    Likang Wu, Zhaopeng Qiu, Zhi Zheng, Hengshu Zhu, and Enhong Chen. [n. d.]. Exploring Large Language Model for Graph Data Understanding in Online Job Recommendations. InAAAI Conference on Artificial Intelligence (AAAI), Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan (Eds.). 9178–9186

  40. [40]

    Qingzheng Xu, Heming Du, Szymon Lukasik, Tianqing Zhu, Sen Wang, and Xin Yu. 2025. MDAM3: A Misinformation Detection and Analysis Framework for Multitype Multimodal Media. InThe Web Conference (WWW). ACM, 5285–5296

  41. [41]

    Muchao Ye, Jinghui Chen, Chenglin Miao, Han Liu, Ting Wang, and Fenglong Ma

  42. [42]

    InACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD)

    PAT: Geometry-Aware Hard-Label Black-Box Adversarial Attacks on Text. InACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). 3093–3104

  43. [43]

    Muchao Ye, Jinghui Chen, Chenglin Miao, Ting Wang, and Fenglong Ma. 2022. LeapAttack: Hard-Label Adversarial Attack on Text via Gradient-Based Opti- mization. InACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). 2307–2315

  44. [44]

    Muchao Ye, Chenglin Miao, Ting Wang, and Fenglong Ma. 2022. TextHoaxer: Bud- geted Hard-Label Adversarial Attacks on Text. InAAAI Conference on Artificial Intelligence (AAAI). 3877–3884

  45. [45]

    Kan Yuan, Di Tang, Xiaojing Liao, XiaoFeng Wang, Xuan Feng, Yi Chen, Menghan Sun, Haoran Lu, and Kehuan Zhang. 2019. Stealthy Porn: Understanding Real- World Adversarial Images for Illicit Online Promotion. InIEEE Symposium on Security and Privacy (S&P). 952–966

  46. [46]

    Liping Yuan, Xiaoqing Zheng, Yi Zhou, Cho-Jui Hsieh, and Kai-Wei Chang. 2021. On the Transferability of Adversarial Attacks against Neural Text Classifier. InConference on Empirical Methods in Natural Language Processing (EMNLP). 1612–1625

  47. [47]

    Zhiyuan Zeng and Deyi Xiong. 2021. An Empirical Study on Adversarial Attack on NMT: Languages and Positions Matter. InAnnual Meeting of the Association for Computational Linguistics (ACL). 454–460

  48. [48]

    wonderful

    Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level Convo- lutional Networks for Text Classification. InConference on Neural Information Processing Systems (NeurIPS). 649–657. WWW ’26, April 13–17, 2026, Dubai, United Arab Emirates Han Liu et al. A Dataset Description The detailed dataset information is as follows: • MR[ 34] is a short mov...