arxiv: 2605.08280 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Beyond the False Trade-off: Adaptive EWC for Stealthy and Generalizable T2I Backdoors

Lu Bowen , Xinyu Tang , Yin Yin Low , Shu-Min Leong

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:46 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords text-to-image backdoorselastic weight consolidationadaptive regularizationmodel fidelitystealthy attacksout-of-domain robustness

0 comments

The pith

Cosine-Aware Adaptive EWC eliminates the artificial trade-off between attack success and model fidelity in text-to-image backdoor attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that stealthy backdoor attacks on text-to-image models require preserving normal model behavior to stay undetected. Existing output-based methods like Learning without Forgetting give only weak protection of fidelity, while standard Elastic Weight Consolidation with a fixed penalty creates a false choice between high attack success rate and clean outputs, especially on weak triggers. The proposed Cosine-Aware Adaptive EWC replaces the fixed penalty with dynamic adjustment driven by cosine similarity of semantic parameter importance and adaptive scheduling. This change keeps attack rates high, protects fidelity more effectively, and improves performance on new datasets.

Core claim

Standard static EWC with fixed regularization weight lambda and mean-squared utility loss creates an artificial trade-off between attack success rate and fidelity that degrades performance on weak triggers, whereas Cosine-Aware Adaptive EWC transforms EWC into a context-sensitive constraint through cosine-based semantic utility and adaptive scheduling to maintain high ASR while preserving model fidelity and gaining robustness on out-of-domain data.

What carries the argument

Cosine-Aware Adaptive EWC: a parameter-based regularization method that replaces fixed lambda with dynamic adjustment based on cosine similarity of semantic utility to avoid over-penalizing important parameters.

If this is right

Maintains high attack success rate even on weak triggers where static EWC fails
Achieves better fidelity preservation than output-based distillation methods
Delivers improved robustness when the model is tested on out-of-domain datasets
Converts EWC regularization from a rigid fixed penalty into a context-sensitive constraint

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Detection tools that monitor static fidelity metrics may miss attacks that use this dynamic regularization.
The same adaptive approach could be tested on backdoor or poisoning attacks in other generative models such as text or video.
If the cosine utility generalizes, similar context-sensitive constraints might resolve apparent trade-offs in other parameter-regularized learning tasks.

Load-bearing premise

The cosine-based semantic utility correctly identifies which parameters matter most for model fidelity without adding new biases or needing heavy extra tuning.

What would settle it

An experiment on a held-out T2I model and trigger set where the adaptive method produces either lower attack success rate or noticeably worse fidelity than the static EWC baseline.

Figures

Figures reproduced from arXiv: 2605.08280 by Lu Bowen, Shu-Min Leong, Xinyu Tang, Yin Yin Low.

**Figure 1.** Figure 1: Overview. Left: teacher–student pipeline with backdoor loss Lbd, clean utility Lutl (cosine for adaptive; MSE or cosine for fixed/ablations), optional Lcross, and an EWC penalty Lewc. Right: the adaptive regulator sets λ from the EMA-smoothed ratio Lutl-cos/(Lbd + ϵ), raising consolidation under forgetting and relaxing it when the attack underfits. Inference-time overhead is unchanged (training-only regula… view at source ↗

**Figure 2.** Figure 2: ASR–fidelity trade-off across trigger families. AEWC (green stars) achieves higher fidelity (Clean-Cos) while maintaining high ASR. Prior art methods (LwF, Plain, Fixed-EWC) cluster at lower fidelity. Dashed lines show empirical Pareto frontiers. Unicode exhibits the clearest separation; Phrase maintains near-perfect ASR across all methods [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on an anime LoRA expert. Each row shows one method; column pairs alternate clean/poison outputs. Poison prompts add a text-side trigger to inject a fixed target concept. AEWC (bottom) maintains clean fidelity while achieving consistent backdoor injection, producing a coherent target concept under the trigger. Fixed-EWC preserves clean quality but suppresses the backdoor (ASR= 0). LwF… view at source ↗

read the original abstract

Preserving model fidelity is essential for stealthy text-to-image (T2I) backdoor attacks. Existing methods such as Learning without Forgetting (LwF) rely on output-based distillation, which provides limited regularization. We introduce Elastic Weight Consolidation (EWC) as a parameter-based alternative for preserving fidelity in backdoor learning. While stronger in principle, we show that standard static EWC with a fixed regularization weight lambda and mean-squared utility loss creates an artificial trade-off between attack success rate (ASR) and fidelity, particularly degrading performance on weak triggers. To address this, we propose Cosine-Aware Adaptive EWC, which dynamically adjusts EWC regularization using a cosine-based semantic utility and adaptive scheduling. This approach transforms EWC from a fixed penalty into a context-sensitive constraint, maintaining high ASR while preserving model fidelity. Experiments demonstrate improved ASR-fidelity balance and enhanced robustness on out-of-domain (OOD) datasets compared to existing baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's adaptive EWC with cosine semantic utility is a reasonable attempt to fix the fixed-penalty trade-off in T2I backdoors, but the abstract gives almost no implementation or validation details to judge if the fix actually works.

read the letter

The main thing to know is that the authors replace static EWC regularization with a version that uses cosine-based semantic utility and adaptive scheduling to relax penalties on backdoor directions while keeping fidelity elsewhere. They argue that standard EWC with fixed lambda and MSE utility creates an unnecessary ASR-fidelity trade-off, especially on weak triggers, and that their cosine-aware version removes it while also improving OOD robustness over LwF baselines. The shift from output distillation to parameter importance is a sensible move in principle for diffusion models, where preserving the original weights matters for stealth. That part of the framing is clear and directly addresses a practical pain point in backdoor work. The experiments are claimed to show a better balance, but the abstract supplies no numbers, controls, or ablations, so the strength of the result is still open. The soft spot is the unexamined assumption that cosine similarity on semantic features correctly ranks parameter importance for overall image fidelity. If the cosine is taken over prompt embeddings or high-level activations, it could easily under-weight parameters that control fine texture or low-level statistics in the diffusion process, which would just move the trade-off rather than eliminate it. The paper gives no derivation or control experiment isolating the cosine term from the scheduling, so it is hard to tell whether the reported gains come from the adaptation or from other unstated choices. This is squarely for people working on adversarial attacks and defenses for text-to-image models. A reader already following backdoor papers on diffusion would get value from the adaptive regularization idea once the full methods and results are available. I would send it to peer review; the core proposal is concrete enough to be worth referee scrutiny even if the current evidence is thin.

Referee Report

3 major / 1 minor

Summary. The paper claims that standard Elastic Weight Consolidation (EWC) with fixed lambda and mean-squared utility introduces an artificial trade-off between attack success rate (ASR) and model fidelity in text-to-image (T2I) backdoor attacks, particularly harming weak triggers. It proposes Cosine-Aware Adaptive EWC, which dynamically adjusts regularization via a cosine-based semantic utility and adaptive scheduling to convert EWC into a context-sensitive constraint, thereby achieving superior ASR-fidelity balance and enhanced out-of-domain (OOD) robustness relative to baselines such as Learning without Forgetting (LwF).

Significance. If the central claims hold with proper validation, the work would advance parameter-based regularization techniques for stealthy backdoors in diffusion models by addressing a key limitation of static penalties. The shift from output-based distillation to adaptive EWC is conceptually promising for maintaining fidelity while enabling effective attacks, and the reported OOD improvements could inform broader robustness studies in generative AI security if supported by rigorous controls.

major comments (3)

Abstract: The motivation that 'standard static EWC with a fixed regularization weight lambda and mean-squared utility loss creates an artificial trade-off' is load-bearing for the entire contribution, yet the abstract provides no explicit EWC loss formulation, no derivation of how fixed lambda degrades weak triggers, and no quantitative illustration of the claimed trade-off; without this, the necessity of the adaptive extension cannot be assessed.
Proposed method (cosine-aware component): The cosine-based semantic utility is presented as the mechanism that 'transforms EWC from a fixed penalty into a context-sensitive constraint,' but no derivation, ablation, or justification is given for why cosine similarity on semantic embeddings correctly identifies fidelity-critical parameters versus backdoor directions; this choice is central to the claim of eliminating the trade-off and must be shown not to introduce new biases in fine-grained image statistics.
Experiments: The abstract asserts 'improved ASR-fidelity balance and enhanced robustness on OOD datasets' without reporting statistical significance, exact implementation details, controls for post-hoc hyperparameter selection, or baseline comparisons on the same weak-trigger settings; these omissions make it impossible to verify whether the adaptive scheduling, rather than the cosine utility, drives the gains.

minor comments (1)

The abstract would be clearer if it named the specific T2I models, trigger types, and OOD datasets used, along with the precise definition of 'semantic utility' (e.g., on embeddings, gradients, or activations).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The comments highlight important areas for improving clarity, justification, and experimental rigor. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: Abstract: The motivation that 'standard static EWC with a fixed regularization weight lambda and mean-squared utility loss creates an artificial trade-off' is load-bearing for the entire contribution, yet the abstract provides no explicit EWC loss formulation, no derivation of how fixed lambda degrades weak triggers, and no quantitative illustration of the claimed trade-off; without this, the necessity of the adaptive extension cannot be assessed.

Authors: We agree that the abstract would be strengthened by including the EWC loss formulation. In the revised manuscript we will explicitly state the standard EWC objective (including the lambda-weighted mean-squared utility term) within the abstract itself. We will also add a concise explanation of how a fixed lambda over-regularizes parameters relevant to weak triggers, creating the observed trade-off, and include a brief quantitative illustration (e.g., ASR versus fidelity curves under fixed versus adaptive lambda) either in the abstract or immediately following it in the introduction. These additions will make the motivation self-contained without lengthening the abstract excessively. revision: yes
Referee: Proposed method (cosine-aware component): The cosine-based semantic utility is presented as the mechanism that 'transforms EWC from a fixed penalty into a context-sensitive constraint,' but no derivation, ablation, or justification is given for why cosine similarity on semantic embeddings correctly identifies fidelity-critical parameters versus backdoor directions; this choice is central to the claim of eliminating the trade-off and must be shown not to introduce new biases in fine-grained image statistics.

Authors: The cosine utility is chosen because it measures directional alignment in semantic embedding space, which we hypothesize better separates parameters that preserve global image semantics (fidelity-critical) from those that can accommodate trigger-specific directions. While the manuscript presents the formulation, we acknowledge the absence of an explicit derivation and ablation. In revision we will add a short derivation in Section 3 explaining why cosine similarity is preferred over Euclidean or MSE alternatives for semantic utility, together with an ablation study that replaces cosine with other similarity measures and reports the resulting ASR-fidelity trade-offs. To address potential biases in fine-grained statistics, we will include additional analysis (e.g., per-frequency FID components and visual inspection of high-frequency details) demonstrating that the adaptive regularization does not introduce measurable artifacts beyond those of the baselines. revision: yes
Referee: Experiments: The abstract asserts 'improved ASR-fidelity balance and enhanced robustness on OOD datasets' without reporting statistical significance, exact implementation details, controls for post-hoc hyperparameter selection, or baseline comparisons on the same weak-trigger settings; these omissions make it impossible to verify whether the adaptive scheduling, rather than the cosine utility, drives the gains.

Authors: We agree that stronger statistical reporting and controls are necessary. The current experiments already evaluate weak-trigger settings and OOD datasets, but we will expand the experimental section to report means and standard deviations over at least five random seeds, include exact hyperparameter values and selection procedures (with a held-out validation protocol to avoid post-hoc tuning), and ensure all baselines are re-run under identical weak-trigger conditions. We will also add an explicit ablation that isolates the adaptive scheduling component from the cosine utility to quantify their individual contributions. These changes will be placed in the main results section and an expanded appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on proposed adaptive method and experiments, not reductions to inputs

full rationale

The abstract introduces standard EWC limitations and proposes Cosine-Aware Adaptive EWC with cosine-based semantic utility and adaptive scheduling to balance ASR and fidelity. No equations, parameter fits, self-citations, or derivations are shown that would make any prediction equivalent to its inputs by construction. The central claim of transforming EWC into a context-sensitive constraint is presented as a novel proposal validated by experiments, with independent content from the method design rather than self-definitional or fitted-input circularity. This is the expected honest non-finding for a methods paper without load-bearing reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities; all such elements are unknown without the full text.

pith-pipeline@v0.9.0 · 5472 in / 1107 out tokens · 78630 ms · 2026-05-12T00:46:57.249358+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 1 internal anchor

[1]

Memory aware synapses: Learning what (not) to forget

Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, and Tinne Tuytelaars. Memory aware synapses: Learning what (not) to forget. InECCV, 2018. 1, 2

work page 2018
[2]

Eli- jah: Eliminating backdoors injected in diffusion models via distribution shift

Shengwei An, Sheng-Yen Chou, Kaiyuan Zhang, Qiuling Xu, Guanhong Tao, Guangyu Shen, Siyuan Cheng, Shiqing Ma, Pin-Yu Chen, Tsung-Yi Ho, and Xiangyu Zhang. Eli- jah: Eliminating backdoors injected in diffusion models via distribution shift. InAAAI, 2024. 3

work page 2024
[3]

Trojan source: Invis- ible vulnerabilities

Nicholas Boucher and Ross Anderson. Trojan source: Invis- ible vulnerabilities. InUSENIX Security, pages 1619–1636,

work page
[4]

Trojdiff: Trojan at- tacks on diffusion models with diverse targets

Weixin Chen, Dawn Song, and Bo Li. Trojdiff: Trojan at- tacks on diffusion models with diverse targets. InCVPR,

work page
[5]

Ufid: A unified framework for black-box input- level backdoor detection on diffusion models.AAAI, 39(26): 27312–27320, 2025

Zihan Guan, Mengxuan Hu, Sheng Li, and Anil Kumar Vul- likanti. Ufid: A unified framework for black-box input- level backdoor detection on diffusion models.AAAI, 39(26): 27312–27320, 2025. 3

work page 2025
[6]

Glyphnet: Homoglyph domains dataset and detection using attention-based cnn

Akshat Gupta, Laxman Singh Tomar, and Ridhima Garg. Glyphnet: Homoglyph domains dataset and detection using attention-based cnn. InAAAI AICS, 2023. 1, 2

work page 2023
[7]

Uibdiffusion: Universal imperceptible backdoor attack for diffusion models

Yuning Han, Bingyin Zhao, Rui Chu, Feng Luo, Biplab Sik- dar, and Yingjie Lao. Uibdiffusion: Universal imperceptible backdoor attack for diffusion models. InCVPR, 2025. 2

work page 2025
[8]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. In ICLR, 2022. arXiv:2106.09685. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

Silent branding attack: Trigger-free data poisoning attack on text-to-image diffusion models

Sangwon Jang, June Suk Choi, Jaehyeong Jo, Kimin Lee, and Sung Ju Hwang. Silent branding attack: Trigger-free data poisoning attack on text-to-image diffusion models. In CVPR, 2025. 2

work page 2025
[10]

Back- door defense in diffusion models via spatial attention un- learning

Abha Jha, Ashwath Vaithinathan Aravindan, Matthew Sal- away, Atharva Sandeep Bhide, and Duygu Nur Yaldiz. Back- door defense in diffusion models via spatial attention un- learning. InUSENIX Security, 2025. 3

work page 2025
[11]

Diff-cleanse: Identifying and mitigating backdoor attacks in diffusion models

Hao Jiang, Jin Xiao, Xiaoguang Hu, Tianyou Chen, and Jia- jia Zhao. Diff-cleanse: Identifying and mitigating backdoor attacks in diffusion models. InarXiv preprint, 2024. 3

work page 2024
[12]

Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska- Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Ku- maran, and Raia Hadsell

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska- Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Ku- maran, and Raia Hadsell. Overcoming catastrophic forget- ting in neural networks.PNAS, 114(13):3521–3526, 2017. 1, 2

work page 2017
[13]

Bitabuse: A dataset of visually perturbed texts for defending phishing attacks

Hanyong Lee, Chaelyn Lee, Yongjae Lee, and Jaesung Lee. Bitabuse: A dataset of visually perturbed texts for defending phishing attacks. InFindings of NAACL, pages 3265–3275,

work page
[14]

Learning without forgetting

Zhizhong Li and Derek Hoiem. Learning without forgetting. InECCV, 2016. 1, 2, 3, 5

work page 2016
[15]

Terd: A unified framework for safeguarding diffu- sion models against backdoors, 2024

Yichuan Mo, Hui Huang, Mingjie Li, Ang Li, and Yisen Wang. Terd: A unified framework for safeguarding diffu- sion models against backdoors, 2024. arXiv:2409.05294. 3

work page arXiv 2024
[16]

Backdooring bias (b 2) into stable diffusion models

Ali Naseh, Jaechul Roh, Eugene Bagdasarian, and Amir Houmansadr. Backdooring bias (b 2) into stable diffusion models. InUSENIX Security, pages 977–996, 2025. 1

work page 2025
[17]

Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis, 2023. 1

work page 2023
[18]

High-resolution image syn- thesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj¨orn Ommer. High-resolution image syn- thesis with latent diffusion models. InCVPR, pages 10684– 10695, 2022. 1

work page 2022
[19]

Progress & compress: A scalable framework for continual learning

Jonathan Schwarz, Jelena Luketina, Wojciech Marian Czar- necki, Agnieszka Grabska-Barwinska, Yee Whye Teh, Raz- van Pascanu, and Raia Hadsell. Progress & compress: A scalable framework for continual learning. InICML, 2018. 1, 2

work page 2018
[20]

Shawn Shan, Wenxin Ding, Josephine Passananti, Stanley Wu, Haitao Zheng, and Ben Y . Zhao. Nightshade: Prompt- specific poisoning attacks on text-to-image generative mod- els. InIEEE Symposium on Security and Privacy, pages 807– 825, 2024. 1, 2

work page 2024
[21]

Rickrolling the artist: Injecting backdoors into text en- coders for text-to-image synthesis

Lukas Struppek, Dominik Hintersdorf, and Kristian Kerst- ing. Rickrolling the artist: Injecting backdoors into text en- coders for text-to-image synthesis. InICCV, 2023. 1, 2, 3, 5

work page 2023
[22]

Eviledit: Backdooring text-to-image diffusion models in one second

Hao Wang, Shangwei Guo, Jialing He, Kangjie Chen, Shudong Zhang, Tianwei Zhang, and Tao Xiang. Eviledit: Backdooring text-to-image diffusion models in one second. InACM MM, 2024. 1, 2, 3, 5

work page 2024
[23]

The stronger the diffusion model, the easier the backdoor: Data poisoning to induce copy- right breaches without adjusting finetuning pipeline, 2024

Haonan Wang, Qianli Shen, Yao Tong, Yang Zhang, and Kenji Kawaguchi. The stronger the diffusion model, the easier the backdoor: Data poisoning to induce copy- right breaches without adjusting finetuning pipeline, 2024. arXiv:2401.04136. 1, 2

work page arXiv 2024
[24]

T2ishield: Defending against backdoors on text-to-image diffusion models

Zhongqi Wang, Jie Zhang, Shiguang Shan, and Xilin Chen. T2ishield: Defending against backdoors on text-to-image diffusion models. InECCV, 2024. 3

work page 2024
[25]

Dynamic attention analysis for backdoor detection in text-to- image diffusion models, 2025

Zhongqi Wang, Jie Zhang, Shiguang Shan, and Xilin Chen. Dynamic attention analysis for backdoor detection in text-to- image diffusion models, 2025. arXiv:2504.20518

work page arXiv 2025
[26]

Dadet: Safeguarding image conditional diffusion models against adversarial and backdoor attacks via diffu- sion anomaly detection

Hongwei Yu, Xinlong Ding, Jiawei Li, Jinlong Wang, Yudong Zhang, Rongquan Wang, Huimin Ma, and Jiansheng Chen. Dadet: Safeguarding image conditional diffusion models against adversarial and backdoor attacks via diffu- sion anomaly detection. InICCV, 2025. 3

work page 2025
[27]

Contin- ual learning through synaptic intelligence

Friedemann Zenke, Ben Poole, and Surya Ganguli. Contin- ual learning through synaptic intelligence. InICLR, 2017. 1, 2

work page 2017
[28]

Text-to-image diffusion models can 9 be easily backdoored through multimodal data poisoning

Shengfang Zhai, Yinpeng Dong, Qingni Shen, Shi Pu, Yue- jian Fang, and Hang Su. Text-to-image diffusion models can 9 be easily backdoored through multimodal data poisoning. In ACM MM, 2023. 1, 2

work page 2023
[29]

Efficient input-level backdoor defense on text-to-image synthesis via neuron activation variation,

Shengfang Zhai, Jiajun Li, Yue Liu, Huanran Chen, Zhihua Tian, Wenjie Qu, Qingni Shen, Ruoxi Jia, Yinpeng Dong, and Jiaheng Zhang. Efficient input-level backdoor defense on text-to-image synthesis via neuron activation variation,

work page
[30]

Regularize, expand and compress: Nonexpansive continual learning

Jie Zhang, Junting Zhang, Shalini Ghosh, Dawei Li, Jingwen Zhu, Heming Zhang, and Yalin Wang. Regularize, expand and compress: Nonexpansive continual learning. InWACV, pages 843–851, 2020. 1, 2 10

work page 2020