arxiv: 2604.04488 · v1 · submitted 2026-04-06 · 💻 cs.CV · cs.LG

Recognition: no theorem link

A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models

Tianmeng Fang , Yong Wang , Zetai Kong , Zengzhen Su , Jun Wang , Chengjin Yu , Wei Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 20:08 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords backdoor defensemultimodal large language modelspatch augmentationcross-view regularizationbackdoor attacksoutput distributionmodel security

0 comments

The pith

A patch-based cross-view regularization framework defends multimodal large language models against backdoors by suppressing trigger responses while preserving normal output.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a defense framework for multimodal large language models that uses patch-level data augmentation together with cross-view output difference regularization. It works by exploiting the fact that backdoor responses stay abnormally stable under non-semantic changes such as patch alterations, which allows the method to pull apart the output distributions of original and perturbed views and thereby reduce triggering success. At the same time, output entropy constraints prevent the defense from overly restricting benign generation. A sympathetic reader would care because these models are vulnerable to backdoor implantation during fine-tuning, and low-ratio poisoning can cause them to emit predefined harmful outputs on hidden triggers, limiting safe real-world use.

Core claim

The authors introduce a unified defense that combines patch augmentation with cross-view regularity to constrain anomalous behaviors in response to triggered patterns at both feature representation and output distribution levels. By proactively increasing the difference between outputs on original and patch-perturbed views for backdoored inputs, while adding entropy constraints on normal outputs, the method suppresses attack success without degrading the model's ability to generate high-quality text under clean commands.

What carries the argument

Patch-based cross-view output difference regularization, which augments inputs with patches and enforces distributional separation between original and perturbed views to target the invariance of backdoor responses.

If this is right

Attack success rates drop across three models, two tasks, and six different attacks while normal text generation quality stays high.
The framework supports secure deployment even when poisoning occurs at low frequency and triggers remain covert.
Anomalous behaviors are constrained simultaneously at the feature level and the output distribution level.
Entropy constraints keep the model from over-regularizing clean inputs during the defense process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar cross-view regularization could be tested with other non-semantic perturbations such as additive noise or cropping if those also expose backdoor invariance.
The method's logic suggests it might extend to other vision-language models beyond the three evaluated here, provided the same trigger invariance pattern appears.
Combining the patch regularization with existing input-filtering defenses could produce layered protection against both known and novel backdoor patterns.

Load-bearing premise

Backdoor responses remain abnormally invariant to non-semantic perturbations such as patch changes, so that cross-view regularization can suppress triggers without also flattening benign output distributions.

What would settle it

An experiment in which backdoor responses vary substantially under patch perturbations, yet the defense still fails to lower attack success rates or else degrades normal generation performance across the tested models and attacks.

read the original abstract

Multimodal large language models have become an important infrastructure for unified processing of visual and linguistic tasks. However, such models are highly susceptible to backdoor implantation during supervised fine-tuning and will steadily output the attacker's predefined harmful responses once a specific trigger pattern is activated. The core challenge of backdoor defense lies in suppressing attack success under low poisoning ratios while preserving the model's normal generation ability. These two objectives are inherently conflicting. Strong suppression often degrades benign performance, whereas weak regularization fails to mitigate backdoor behaviors. To this end, we propose a unified defense framework based on patch augmentation and cross-view regularity, which simultaneously constrains the model's anomalous behaviors in response to triggered patterns from both the feature representation and output distribution levels. Specifically, patch-level data augmentation is combined with cross-view output difference regularization to exploit the fact that backdoor responses are abnormally invariant to non-semantic perturbations and to proactively pull apart the output distributions of the original and perturbed views, thereby significantly suppressing the success rate of backdoor triggering. At the same time, we avoid over-suppression of the model during defense by imposing output entropy constraints, ensuring the quality of normal command generation. Experimental results across three models, two tasks, and six attacks show that our proposed defense method effectively reduces the attack success rate while maintaining a high level of normal text generation capability. Our work enables the secure, controlled deployment of large-scale multimodal models in realistic low-frequency poisoning and covert triggering scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's new angle is a unified patch-augmentation plus cross-view output regularization defense for backdoors in MLLMs, but the whole thing rests on an unmeasured claim that backdoor outputs stay abnormally stable under non-semantic patches.

read the letter

The main thing to know is that this work puts forward a defense framework for backdoor attacks on multimodal LLMs. It augments inputs with patches and adds a term that pushes the output distributions of the original and patched views apart, while using entropy constraints to avoid wrecking clean generation. The setup targets low poisoning ratios and covert triggers, which lines up with realistic deployment risks.

Referee Report

2 major / 0 minor

Summary. The paper proposes a unified defense framework for backdoor attacks in multimodal large language models (MLLMs) that combines patch-level data augmentation with cross-view output-difference regularization. The method exploits the claimed property that backdoor responses are abnormally invariant to non-semantic perturbations (such as patch changes) to pull apart output distributions between original and perturbed views, while adding output-entropy constraints to avoid degrading benign generation. Experiments across three models, two tasks, and six attacks are reported to demonstrate reduced attack success rates while preserving normal text-generation quality.

Significance. If the central empirical results hold and the invariance property is selective, the work would offer a practical, low-overhead defense for MLLMs under realistic low-poisoning and covert-trigger conditions, addressing a key security barrier to safe deployment of these models. The approach is notable for directly targeting the suppression-versus-benign-performance trade-off through augmentation plus regularization rather than post-hoc detection.

major comments (2)

Abstract: the defense is built on the unverified premise that 'backdoor responses are abnormally invariant to non-semantic perturbations'; no pre-defense measurement (e.g., output variance or distribution distance under patch changes for triggered versus clean inputs) is described, yet this selectivity is required for the cross-view regularization to suppress ASR without harming benign outputs. If the invariance is attack- or model-specific rather than general, the reported trade-off does not follow.
Results section (as summarized in abstract): the headline claim of effectiveness 'across three models, two tasks, and six attacks' is stated without any quantitative metrics, baseline comparisons, ablation results on the regularization terms, or error analysis, preventing evaluation of whether the method actually achieves the claimed balance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and results presentation. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: Abstract: the defense is built on the unverified premise that 'backdoor responses are abnormally invariant to non-semantic perturbations'; no pre-defense measurement (e.g., output variance or distribution distance under patch changes for triggered versus clean inputs) is described, yet this selectivity is required for the cross-view regularization to suppress ASR without harming benign outputs. If the invariance is attack- or model-specific rather than general, the reported trade-off does not follow.

Authors: We agree that the abstract states the invariance property as a premise without explicit pre-defense quantification. The full manuscript demonstrates the property indirectly through post-defense outcomes (reduced ASR with preserved benign generation across attacks), but we acknowledge this is insufficient for verifying selectivity upfront. We will add a new table and accompanying analysis in the Experiments section reporting quantitative pre-defense metrics, such as output variance (token-level entropy differences) and distribution distances (e.g., KL divergence or cosine similarity on logit distributions) between original and patch-perturbed views, separately for clean and triggered inputs across all six attacks and three models. This will confirm the invariance is more pronounced for backdoored responses. revision: yes
Referee: Results section (as summarized in abstract): the headline claim of effectiveness 'across three models, two tasks, and six attacks' is stated without any quantitative metrics, baseline comparisons, ablation results on the regularization terms, or error analysis, preventing evaluation of whether the method actually achieves the claimed balance.

Authors: The abstract serves as a high-level summary and intentionally omits specific numbers for conciseness. The full Results section (including Tables 1-3 and Figures 2-4) contains the requested details: quantitative ASR reductions (e.g., from 85-98% to 5-15% depending on attack), direct comparisons to baseline defenses (e.g., fine-tuning and detection methods), ablations isolating the cross-view regularization and entropy terms (showing each contributes to the trade-off), and error analysis (e.g., cases of residual backdoor activation under high poisoning). We will revise the abstract to incorporate key quantitative highlights and a brief mention of ablations to better support the headline claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical algorithmic framework

full rationale

The paper presents a patch-augmentation plus cross-view regularization defense whose central steps are algorithmic choices (output-difference penalty plus entropy constraint) motivated by an external assumption about backdoor invariance. No equations, self-definitional loops, or fitted parameters are shown to reduce to their own inputs by construction. The reported ASR reduction and benign preservation are validated experimentally across three models, two tasks, and six attacks, rendering the claims externally falsifiable rather than tautological. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract describes an algorithmic defense without explicit free parameters, mathematical axioms, or newly postulated entities; the approach rests on the empirical observation of backdoor invariance to perturbations.

pith-pipeline@v0.9.0 · 5579 in / 1035 out tokens · 136542 ms · 2026-05-10T20:08:23.988994+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

93 extracted references · 43 canonical work pages · 9 internal anchors

[1]

In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recogni- tion, pp

Liang, S., Zhu, M., Liu, A., Wu, B., Cao, X., Chang, E.-C.: Bad- clip: Dual-embedding guided back- door attack on multimodal con- trastive learning. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recogni- tion, pp. 24645–24654 (2024)

2024
[2]

Interna- tional Journal of Computer Vision, 1–20 (2025)

Liang, J., Liang, S., Liu, A., Cao, X.: Vl-trojan: Multimodal instruction 18 backdoor attacks against autoregres- sive visual language models. Interna- tional Journal of Computer Vision, 1–20 (2025)

2025
[3]

Inter- national Journal of Computer Vision 133(6), 3568–3585 (2025)

Liu, A., Liu, X., Zhang, X., Xiao, Y., Zhou, Y., Liang, S., Wang, J., Cao, X., Tao, D.: Pre-trained trojan attacks for visual recognition. Inter- national Journal of Computer Vision 133(6), 3568–3585 (2025)

2025
[4]

arXiv preprint arXiv:2402.11473 (2024)

Liang, J., Liang, S., Liu, A., Jia, X., Kuang, J., Cao, X.: Poisoned forgery face: Towards backdoor attacks on face forgery detection. arXiv preprint arXiv:2402.11473 (2024)

work page arXiv 2024
[5]

Compromisingembodiedagentswithcontextualbackdoorattacks

Liu, A., Zhou, Y., Liu, X., Zhang, T., Liang, S., Wang, J., Pu, Y., Li, T., Zhang, J., Zhou, W., et al.: Com- promising embodied agents with contextual backdoor attacks. arXiv preprint arXiv:2408.02882 (2024)

work page arXiv 2024
[6]

In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Liang, S., Liang, J., Pang, T., Du, C., Liu, A., Zhu, M., Cao, X., Tao, D.: Revisiting backdoor attacks against large vision-language models from domain shift. In: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 9477– 9486 (2025)

2025
[7]

arXiv preprint arXiv:2406.04031 (2024)

Ying, Z., Liu, A., Zhang, T., Yu, Z., Liang, S., Liu, X., Tao, D.: Jailbreak vision language models via bi-modal adversarial prompt. arXiv preprint arXiv:2406.04031 (2024)

work page arXiv 2024
[8]

arXiv preprint arXiv:2503.04833 , year=

Lu, L., Pang, S., Liang, S., Zhu, H., Zeng, X., Liu, A., Liu, Y., Zhou, Y.: Adversarial training for multi- modal large language models against jailbreak attacks. arXiv preprint arXiv:2503.04833 (2025)

work page arXiv 2025
[9]

Reasoning-augmented conversation for multi-turn jailbreak attacks on large language models.arXiv preprint arXiv:2502.11054, 1, 2025

Ying, Z., Zhang, D., Jing, Z., Xiao, Y., Zou, Q., Liu, A., Liang, S., Zhang, X., Liu, X., Tao, D.: Reasoning-augmented conversation for multi-turn jailbreak attacks on large language models. arXiv preprint arXiv:2502.11054 (2025)

work page arXiv 2025
[10]

arXiv preprint arXiv:2509.21400 (2025)

Zeng, X., Liang, S., Lu, L., Zhu, H., Liu, E., Dang, J., Zhou, Y., Pang, S.: Safesteer: Adaptive subspace steer- ing for efficient jailbreak defense in vision-language models. arXiv preprint arXiv:2509.21400 (2025)

work page arXiv 2025
[11]

International Jour- nal of Computer Vision134(1), 18 (2026)

Ying, Z., Liu, A., Liang, S., Huang, L., Guo, J., Zhou, W., Liu, X., Tao, D.: Safebench: A safety evalua- tion framework for multimodal large language models. International Jour- nal of Computer Vision134(1), 18 (2026)

2026
[12]

Advances in neural information processing sys- tems36, 34892–34916 (2023)

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing sys- tems36, 34892–34916 (2023)

2023
[13]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt- 4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

In: International Conference on Machine Learning, pp

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language- image pre-training with frozen image encoders and large language mod- els. In: International Conference on Machine Learning, pp. 19730–19742 (2023). PMLR

2023
[15]

Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Wahid, A., Tompson, J., Vuong, Q., Yu, 19 T., Huang, W., et al.: Palm-e: An embodied multimodal language model (2023)

2023
[16]

Advances in neural information pro- cessing systems36, 49250–49267 (2023)

Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in neural information pro- cessing systems36, 49250–49267 (2023)

2023
[17]

In: Machine Learning for Health (ML4H), pp

Moor, M., Huang, Q., Wu, S., Yasunaga, M., Dalmia, Y., Leskovec, J., Zakka, C., Reis, E.P., Rajpurkar, P.: Med-flamingo: a multimodal medical few-shot learner. In: Machine Learning for Health (ML4H), pp. 353–367 (2023). PMLR

2023
[18]

In: European Conference on Computer Vision, pp

Sima, C., Renz, K., Chitta, K., Chen, L., Zhang, H., Xie, C., Beißwenger, J., Luo, P., Geiger, A., Li, H.: Drivelm: Driving with graph visual question answering. In: European Conference on Computer Vision, pp. 256–274 (2024). Springer

2024
[19]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

Gu, T., Dolan-Gavitt, B., Garg, S.: Badnets: Identifying vulnera- bilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733 (2017)

work page internal anchor Pith review arXiv 2017
[20]

In: Findings of the Association for Computational Lin- guistics: NAACL 2024, pp

Huang, H., Zhao, Z., Backes, M., Shen, Y., Zhang, Y.: Composite backdoor attacks against large lan- guage models. In: Findings of the Association for Computational Lin- guistics: NAACL 2024, pp. 1459– 1472 (2024)

2024
[21]

Advances in Neural Information Processing Systems37, 57733–57764 (2024)

Xu, Y., Yao, J., Shu, M., Sun, Y., Wu, Z., Yu, N., Goldstein, T., Huang, F.: Shadowcast: Stealthy data poisoning attacks against vision-language models. Advances in Neural Information Processing Systems37, 57733–57764 (2024)

2024
[22]

arXiv preprint arXiv:2105.12400 (2021)

Qi, F., Li, M., Chen, Y., Zhang, Z., Liu, Z., Wang, Y., Sun, M.: Hid- den killer: Invisible textual backdoor attacks with syntactic trigger. arXiv preprint arXiv:2105.12400 (2021)

work page arXiv 2021
[23]

In: Proceedings of the 26th ACM SIGKDD Interna- tional Conference on Knowledge Dis- covery & Data Mining, pp

Tang, R., Du, M., Liu, N., Yang, F., Hu, X.: An embarrassingly simple approach for trojan attack in deep neural networks. In: Proceedings of the 26th ACM SIGKDD Interna- tional Conference on Knowledge Dis- covery & Data Mining, pp. 218–228 (2020)

2020
[24]

In: 25th Annual Network And Distributed System Security Sympo- sium (NDSS 2018) (2018)

Liu, Y., Ma, S., Aafer, Y., Lee, W.-C., Zhai, J., Wang, W., Zhang, X.: Trojaning attack on neural net- works. In: 25th Annual Network And Distributed System Security Sympo- sium (NDSS 2018) (2018). Internet Soc

2018
[25]

In: Findings of the Association for Computational Lin- guistics: ACL 2025, pp

Yin, Z., Ye, M., Cao, Y., Wang, J., Chang, A., Liu, H., Chen, J., Wang, T., Ma, F.: Shadow-activated back- door attacks on multimodal large language models. In: Findings of the Association for Computational Lin- guistics: ACL 2025, pp. 4808–4829 (2025)

2025
[26]

Backdooring vision-language models with out-of-distribution data

Lyu, W., Yao, J., Gupta, S., Pang, L., Sun, T., Yi, L., Hu, L., Ling, H., Chen, C.: Backdoor- ing vision-language models with out- of-distribution data. arXiv preprint arXiv:2410.01264 (2024) 20

work page arXiv 2024
[27]

In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recogni- tion, pp

Ishmam, A.M., Thomas, C.: Seman- tic shield: Defending vision-language models against backdooring and poi- soning via fine-grained knowledge alignment. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recogni- tion, pp. 24820–24830 (2024)

2024
[28]

arXiv preprint arXiv:2412.20392 (2024)

Zhang, Z., He, S., Wang, H., Shen, B., Feng, L.: Defending multimodal backdoored models by repulsive visual prompt tuning. arXiv preprint arXiv:2412.20392 (2024)

work page arXiv 2024
[29]

Advances in neural information processing systems35, 23716–23736 (2022)

Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M.,et al.: Flamingo: a visual language model for few- shot learning. Advances in neural information processing systems35, 23716–23736 (2022)

2022
[30]

OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Awadalla, A., Gao, I., Gardner, J., Hessel, J., Hanafy, Y., Zhu, W., Marathe, K., Bitton, Y., Gadre, S., Sagawa, S., et al.: Open- flamingo: An open-source frame- work for training large autoregres- sive vision-language models. arXiv preprint arXiv:2308.01390 (2023)

work page internal anchor Pith review arXiv 2023
[31]

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Zhu, D., Chen, J., Shen, X., Li, X., Elhoseiny, M.: Minigpt-4: Enhanc- ing vision-language understanding with advanced large language mod- els. arXiv preprint arXiv:2304.10592 (2023)

work page internal anchor Pith review arXiv 2023
[32]

In: International Conference on Machine Learning, pp

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,et al.: Learning transfer- able visual models from natural lan- guage supervision. In: International Conference on Machine Learning, pp. 8748–8763 (2021). PmLR

2021
[33]

In: International Conference on Machine Learning, pp

Jia, C., Yang, Y., Xia, Y., Chen, Y.- T., Parekh, Z., Pham, H., Le, Q., Sung, Y.-H., Li, Z., Duerig, T.: Scal- ing up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916 (2021). PMLR

2021
[34]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Wei, F.: Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)

work page internal anchor Pith review arXiv 2023
[35]

IEEE Transactions on Pat- tern Analysis and Machine Intelli- gence (2025)

Li, B., Zhang, Y., Chen, L., Wang, J., Pu, F., Cahyono, J.A., Yang, J., Li, C., Liu, Z.: Otter: A multi-modal model with in-context instruction tuning. IEEE Transactions on Pat- tern Analysis and Machine Intelli- gence (2025)

2025
[36]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.129661(2), 3 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

arXiv preprint arXiv:2508.02028 , year=

Zhang, T., Jin, T., Wang, L., Liu, J., Liang, S., Zhang, M., Liu, A., Liu, X.: Bench2advlm: a closed-loop benchmark for vision-language mod- els in autonomous driving. arXiv preprint arXiv:2508.02028 (2025)

work page arXiv 2025
[38]

Agentsafe: Benchmarking the safety of embodied agents on hazardous instructions

Liu, A., Ying, Z., Wang, L., Mu, J., Guo, J., Wang, J., Ma, Y., Liang, S., Zhang, M., Liu, X., et al.: Agentsafe: Benchmarking the 21 safety of embodied agents on haz- ardous instructions. arXiv preprint arXiv:2506.14697 (2025)

work page arXiv 2025
[39]

Visual Adversarial Attack on Vision-Language Models for Autonomous Driving

Zhang, T., Wang, L., Zhang, X., Zhang, Y., Jia, B., Liang, S., Hu, S., Fu, Q., Liu, A., Liu, X.: Visual adversarial attack on vision-language models for autonomous driving. arXiv preprint arXiv:2411.18275 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

arXiv preprint arXiv:2501.13563 , year=

Wang, L., Zhang, T., Qu, Y., Liang, S., Chen, Y., Liu, A., Liu, X., Tao, D.: Black-box adversarial attack on vision language models for autonomous driving. arXiv preprint arXiv:2501.13563 (2025)

work page arXiv 2025
[41]

Visual Intelligence2(1), 1–10 (2024)

Kong, D., Liang, S., Zhu, X., Zhong, Y., Ren, W.: Patch is enough: nat- uralistic adversarial patch against vision-language pre-training models. Visual Intelligence2(1), 1–10 (2024)

2024
[42]

arXiv preprint arXiv:2509.20196 , year=

Kong, D., Yu, S., Liang, S., Liang, J., Gan, J., Liu, A., Ren, W.: Universal camouflage attack on vision-language models for autonomous driving. arXiv preprint arXiv:2509.20196 (2025)

work page arXiv 2025
[43]

Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning

Chen, X., Liu, C., Li, B., Lu, K., Song, D.: Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526 (2017)

work page internal anchor Pith review arXiv 2017
[44]

arXiv preprint arXiv:2305.09677 (2023)

Liu, X., Tan, Y.-a., Wang, Y., Qiu, K., Li, Y.: Stealthy low- frequency backdoor attack against deep neural networks. arXiv preprint arXiv:2305.09677 (2023)

work page arXiv 2023
[45]

arXiv preprint arXiv:2209.05244 (2022)

Wang, Y., Shi, H., Min, R., Wu, R., Liang, S., Wu, Y., Liang, D., Liu, A.: Universal backdoor attacks detec- tion via adaptive adversarial probe. arXiv preprint arXiv:2209.05244 (2022)

work page arXiv 2022
[46]

arXiv preprint arXiv:2403.16257 (2024)

Liang, S., Liu, K., Gong, J., Liang, J., Xun, Y., Chang, E.- C., Cao, X.: Unlearning backdoor threats: Enhancing backdoor defense in multimodal contrastive learning via local token unlearning. arXiv preprint arXiv:2403.16257 (2024)

work page arXiv 2024
[47]

Advances in Neural Informa- tion Processing Systems37, 114928– 114964 (2024)

Zhu, M., Liang, S., Wu, B.: Breaking the false sense of security in back- door defense through re-activation attack. Advances in Neural Informa- tion Processing Systems37, 114928– 114964 (2024)

2024
[48]

arXiv preprint arXiv:2409.15968 (2024)

Kuang, J., Liang, S., Liang, J., Liu, K., Cao, X.: Adversarial back- door defense in clip. arXiv preprint arXiv:2409.15968 (2024)

work page arXiv 2024
[49]

arXiv preprint arXiv:2506.05401 (2025)

Xun, Y., Liang, S., Jia, X., Liu, X., Cao, X.: Robust anti-backdoor instruction tuning in lvlms. arXiv preprint arXiv:2506.05401 (2025)

work page arXiv 2025
[50]

arXiv preprint arXiv:2503.16872 (2025)

Wang, X., Liang, S., Liao, D., Fang, H., Liu, A., Cao, X., Lu, Y.-l., Chang, E.-C., Gao, X.: Lie detec- tor: Unified backdoor detection via cross-examination framework. arXiv preprint arXiv:2503.16872 (2025)

work page arXiv 2025
[51]

arXiv preprint arXiv:2507.01321 (2025)

Ren, Z., Liang, S., Liu, A., Tao, D.: Iclshield: Exploring and mitigating in-context learning backdoor attacks. arXiv preprint arXiv:2507.01321 (2025)

work page arXiv 2025
[52]

arXiv preprint arXiv:2502.18511 (2025)

Liu, X., Liang, S., Han, M., Luo, Y., Liu, A., Cai, X., He, Z., Tao, D.: Elba-bench: An efficient learn- ing backdoor attacks benchmark 22 for large language models. arXiv preprint arXiv:2502.18511 (2025)

work page arXiv 2025
[53]

In: 2025 IEEE/ACM 47th Interna- tional Conference on Software Engi- neering (ICSE), pp

Xiao, Y., Liu, A., Zhang, X., Zhang, T., Li, T., Liang, S., Liu, X., Liu, Y., Tao, D.: Bdefects4nn: A backdoor defect database for controlled local- ization studies in neural networks. In: 2025 IEEE/ACM 47th Interna- tional Conference on Software Engi- neering (ICSE), pp. 606–606 (2025). IEEE Computer Society

2025
[54]

IEEE Transactions on Information Foren- sics and Security (2026)

Liang, S., Gong, J., Fang, T., Liu, A., Wang, T., Cao, X., Tao, D., Ee-Chien, C.: Trapflow: Controllable website fingerprinting defense via dynamic backdoor learning. IEEE Transactions on Information Foren- sics and Security (2026)

2026
[55]

arXiv preprint arXiv:2505.06413 , year=

Liu, M., Liang, S., Howlader, K., Wang, L., Tao, D., Zhang, W.: Natural reflection backdoor attack on vision language model for autonomous driving. arXiv preprint arXiv:2505.06413 (2025)

work page arXiv 2025
[56]

Wanet–imperceptible warping-based back- door attack,

Nguyen, A., Tran, A.: Wanet– imperceptible warping-based backdoor attack. arXiv preprint arXiv:2102.10369 (2021)

work page arXiv 2021
[57]

Advances in Neural Information Processing Systems33, 3454–3464 (2020)

Nguyen, T.A., Tran, A.: Input-aware dynamic backdoor attack. Advances in Neural Information Processing Systems33, 3454–3464 (2020)

2020
[58]

In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recogni- tion, pp

Walmer, M., Sikka, K., Sur, I., Shri- vastava, A., Jha, S.: Dual-key mul- timodal backdoors for visual ques- tion answering. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recogni- tion, pp. 15375–15385 (2022)

2022
[59]

Turner, A., Tsipras, D., Madry, A.: Clean-label backdoor attacks (2018)

2018
[60]

IEEE Transactions on Depend- able and Secure Computing18(5), 2088–2105 (2020)

Li, S., Xue, M., Zhao, B.Z.H., Zhu, H., Zhang, X.: Invisible backdoor attacks on deep neural networks via steganography and regulariza- tion. IEEE Transactions on Depend- able and Secure Computing18(5), 2088–2105 (2020)

2088
[61]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Hubinger, E., Denison, C., Mu, J., Lambert, M., Tong, M., MacDi- armid, M., Lanham, T., Ziegler, D.M., Maxwell, T., Cheng, N., et al.: Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566 (2024)

work page internal anchor Pith review arXiv 2024
[62]

In: 2019 IEEE Sympo- sium on Security and Privacy (SP), pp

Wang, B., Yao, Y., Shan, S., Li, H., Viswanath, B., Zheng, H., Zhao, B.Y.: Neural cleanse: Identifying and mitigating backdoor attacks in neu- ral networks. In: 2019 IEEE Sympo- sium on Security and Privacy (SP), pp. 707–723 (2019). IEEE

2019
[63]

In: International Sympo- sium on Research in Attacks, Intru- sions, and Defenses, pp

Liu, K., Dolan-Gavitt, B., Garg, S.: Fine-pruning: Defending against backdooring attacks on deep neural networks. In: International Sympo- sium on Research in Attacks, Intru- sions, and Defenses, pp. 273–294 (2018). Springer

2018
[64]

In: Pro- ceedings of the 35th Annual Com- puter Security Applications Confer- ence, pp

Gao, Y., Xu, C., Wang, D., Chen, S., Ranasinghe, D.C., Nepal, S.: Strip: A defence against trojan attacks on deep neural networks. In: Pro- ceedings of the 35th Annual Com- puter Security Applications Confer- ence, pp. 113–125 (2019)

2019
[65]

Advances in Neural 23 Information Processing Systems34, 16913–16925 (2021)

Wu, D., Wang, Y.: Adversarial neuron pruning purifies backdoored deep models. Advances in Neural 23 Information Processing Systems34, 16913–16925 (2021)

2021
[66]

In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recogni- tion, pp

Pang, L., Sun, T., Ling, H., Chen, C.: Backdoor cleansing with unla- beled data. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recogni- tion, pp. 12218–12227 (2023)

2023
[67]

International Journal of Com- puter Vision134(4), 144 (2026)

Liang, S., Liu, J., Zhai, J., Fang, T., Tu, R., Liu, A., Cao, X., Tao, D.: T2vshield: Model-agnostic jail- break defense for text-to-video mod- els. International Journal of Com- puter Vision134(4), 144 (2026)

2026
[68]

arXiv preprint arXiv:2505.07258 (2025)

Wang, W., Liang, S., Zhang, Y., Jia, X., Lin, H., Cao, X.: No query, no access. arXiv preprint arXiv:2505.07258 (2025)

work page arXiv 2025
[69]

arXiv preprint arXiv:2506.12430 (2025)

Ying, Z., Wu, S., Hao, R., Ying, P., Sun, S., Chen, P., Chen, J., Du, H., Shen, K., Wu, S., et al.: Pushing the limits of safety: A technical report on the atlas challenge 2025. arXiv preprint arXiv:2506.12430 (2025)

work page arXiv 2025
[70]

Augmix: A simple data processing method to improve robustness and uncertainty

Hendrycks, D., Mu, N., Cubuk, E.D., Zoph, B., Gilmer, J., Laksh- minarayanan, B.: Augmix: A simple data processing method to improve robustness and uncertainty. arXiv preprint arXiv:1912.02781 (2019)

work page arXiv 1912
[71]

arXiv preprint arXiv:1811.12641 , year=

Wei, X., Liang, S., Chen, N., Cao, X.: Transferable adversar- ial attacks for image and video object detection. arXiv preprint arXiv:1811.12641 (2018)

work page arXiv 2018
[72]

arXiv preprint arXiv:2201.08970 (2022)

Liang, S., Wu, B., Fan, Y., Wei, X., Cao, X.: Parallel rectangle flip attack: A query-based black-box attack against object detection. arXiv preprint arXiv:2201.08970 (2022)

work page arXiv 2022
[73]

In: European Conference on Com- puter Vision (2022)

Liang, S., Li, L., Fan, Y., Jia, X., Li, J., Wu, B., Cao, X.: A large-scale multiple-objective method for black- box attack against object detection. In: European Conference on Com- puter Vision (2022)

2022
[74]

arXiv preprint arXiv:2311.11017 (2023)

Liu, J., Zhu, S., Liang, S., Zhang, J., Fang, H., Zhang, W., Chang, E.-C.: Improving adversarial transferability by stable diffusion. arXiv preprint arXiv:2311.11017 (2023)

work page arXiv 2023
[75]

Advances in neural information processing systems33, 596–608 (2020)

Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C.A., Cubuk, E.D., Kurakin, A., Li, C.-L.: Fixmatch: Simplifying semi- supervised learning with consistency and confidence. Advances in neural information processing systems33, 596–608 (2020)

2020
[76]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

Liu, A., Tang, S., Liang, S., Gong, R., Wu, B., Liu, X., Tao, D.: Explor- ing the relationship between archi- tectural design and adversarially robust generalization. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2023)

2023
[77]

IEEE transactions on pat- tern analysis and machine intelli- gence41(8), 1979–1993 (2018)

Miyato, T., Maeda, S.-i., Koyama, M., Ishii, S.: Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pat- tern analysis and machine intelli- gence41(8), 1979–1993 (2018)

1979
[78]

arXiv preprint arXiv:2304.10136 (2023) 24

Wang, Z., Zhang, Z., Liang, S., Wang, X.: Diversifying the high- level features for better adversar- ial transferability. arXiv preprint arXiv:2304.10136 (2023) 24

work page arXiv 2023
[79]

Regularizing Neural Networks by Penalizing Confident Output Distributions

Pereyra, G., Tucker, G., Chorowski, J., Kaiser, L., Hinton, G.: Regular- izing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548 (2017)

work page Pith review arXiv 2017
[80]

In: Interna- tional Conference on Machine Learn- ing, pp

Guo, C., Pleiss, G., Sun, Y., Wein- berger, K.Q.: On calibration of mod- ern neural networks. In: Interna- tional Conference on Machine Learn- ing, pp. 1321–1330 (2017). PMLR

2017

Showing first 80 references.