Pulling The REINS: Training-Free Safety Alignment of Video Diffusion Models via Representation Steering

Amit K. Roy-Chowdhury; Arindam Dutta; Athula Balachandran; Rohit Kundu; Sarosij Bose

arxiv: 2606.17257 · v1 · pith:ADYDZCI7new · submitted 2026-06-15 · 💻 cs.CV · cs.AI

Pulling The REINS: Training-Free Safety Alignment of Video Diffusion Models via Representation Steering

Rohit Kundu , Arindam Dutta , Sarosij Bose , Athula Balachandran , Amit K. Roy-Chowdhury This is my paper

Pith reviewed 2026-06-27 03:41 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video diffusionsafety alignmentrepresentation steeringinference-time interventionsupervised PCAhidden-state activationstraining-free alignmentdiffusion transformers

0 comments

The pith

Video diffusion models can be made to generate safe content at inference time by adding one linear direction to their hidden-state activations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that safety-relevant structure exists as a linear direction in the hidden activations of video diffusion transformers. This direction is recovered by applying supervised PCA to activations labeled simply as safe or unsafe. Adding the direction at an intermediate layer during generation shifts trajectories away from harmful outputs toward related safe alternatives. The approach requires no model updates and adds negligible cost, addressing the limits of fine-tuning or post-hoc filters. The finding holds across multiple model scales and both text-to-video and image-to-video tasks.

Core claim

Safety-relevant structure is linearly encoded in the hidden-state activations of video diffusion transformers, and a single direction, discovered via Supervised PCA on binary safety labels, suffices to separate safe from unsafe generation trajectories. At inference, adding this direction to hidden states at an intermediate transformer layer redirects generation from harmful content to semantically related safe alternatives, with no weight updates, no concept enumeration, and negligible computational overhead.

What carries the argument

The safety steering direction recovered by Supervised PCA on binary-labeled hidden-state activations, added to the activations at an intermediate (~50% depth) layer of the video diffusion transformer.

If this is right

The same direction steers outputs across nine different video diffusion models spanning 1.3B to 5B parameters.
Steering effectiveness is highest at intermediate transformer depth even though safety information continues to accumulate in deeper layers.
The method applies equally to text-to-video and image-to-video generation without any task-specific retraining.
No enumeration of unsafe concepts is required; the binary safe/unsafe labeling suffices to locate the direction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same linear-steering recipe could be tested on other controllable attributes such as artistic style or object count by swapping the binary label set.
If the direction is found to be largely orthogonal to the main generation manifold, combining multiple such directions for simultaneous control of safety and style becomes feasible.
The observed tradeoff between information accumulation and steerability at different depths suggests a general principle that could be checked in language or image diffusion transformers.

Load-bearing premise

A direction fitted once on a fixed collection of binary safety labels will continue to steer arbitrary new prompts toward safe outputs without degrading video quality or introducing artifacts.

What would settle it

Run the steering direction on a held-out set of prompts that were never seen during direction discovery and measure whether an independent classifier or human raters judge the resulting videos as unsafe at rates comparable to the unsteered baseline.

Figures

Figures reproduced from arXiv: 2606.17257 by Amit K. Roy-Chowdhury, Arindam Dutta, Athula Balachandran, Rohit Kundu, Sarosij Bose.

**Figure 1.** Figure 1: REINS aligns video diffusion models with safety constraints by steering their internal representations at inference time. (a) Safety-steered outputs: On unsafe baseline generations from CogVideoX and Wan2.1 (left of each pair), REINS redirects the model toward semantically related safe alternatives within its own manifold (right), with no weight updates, no concept enumeration, and no prompt-level filterin… view at source ↗

**Figure 2.** Figure 2: Qualitative results: For each prompt, we show baseline generation and REINS-steered generation using identical random seeds for paired comparison. REINS redirects harmful content toward safe content. Some explicit content has been censored [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Effective steering range across transformer layers. We apply REINS at a fixed steering strength across candidate layers and measure safety rate and VQ/MQ win rates. The shaded effective steering range highlights the band of intermediate layers (∼50% depth) where safety improves substantially while VQ and MQ remain near baseline. The operating region predicted by the information-steerability tradeoff. Outsi… view at source ↗

**Figure 5.** Figure 5: Rank ablation: A single direction (rank1) captures ∼68% of the safety-relevant variance and achieves saturation-level safety rates at sufficient λ. Higher ranks do not improve metrics like safety rate, VQ or MQ significantly, confirming that safety is low-dimensional in hidden-state space [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Representation shift under steering, validated by an OOD classifier. We project hidden states onto the top two SPCA components and connect each unsafe baseline generation (open red) to its steered counterpart (filled green/red) under the same prompt and seed. Labels come from an in-domain classifier (left) and an OOD classifier (right) unused during direction discovery. Steering consistently moves unsafe g… view at source ↗

read the original abstract

Open-weight video diffusion models can generate photorealistic unsafe content, from violence to misinformation, yet existing defenses either require expensive safety fine-tuning that degrades general capability, or apply external filters that are trivially bypassed by adversarial prompts. We present REINS (REpresentation-space INference-time Safety steering), a training-free method that aligns video diffusion models at inference time by steering their internal representations toward safe generation. Our key finding is that safety-relevant structure is linearly encoded in the hidden-state activations of video diffusion transformers, and a single direction, discovered via Supervised PCA on binary safety labels, suffices to separate safe from unsafe generation trajectories. At inference, adding this direction to hidden states at an intermediate transformer layer redirects generation from harmful content to semantically related safe alternatives, with no weight updates, no concept enumeration, and negligible computational overhead. Through mechanistic analysis, we reveal that while safety information accumulates monotonically with transformer depth, steering effectiveness peaks at intermediate layers (~50% depth), exposing a fundamental tradeoff between information availability and downstream propagation capacity. We evaluate REINS across 9 video diffusion models, multiple parameter scales (1.3B-5B), and both text-to-video and image-to-video generation, to our knowledge, the broadest safety evaluation suite in the video generation literature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

REINS adapts supervised-PCA steering to video diffusion models and shows it works across nine of them at inference time, but the generalization from fitted labels to new prompts is the part that still needs checking.

read the letter

The core contribution is a training-free steering method that extracts a single safety direction via supervised PCA on hidden states and adds it at an intermediate layer to redirect unsafe video generation. They report this works on nine models from 1.3B to 5B parameters in both text-to-video and image-to-video settings, plus a depth analysis showing safety information builds up but steering peaks around 50% depth.

What stands out is the breadth of the evaluation and the mechanistic observation about layer choice. That tradeoff between information and propagation capacity is a concrete finding that prior language-model steering work did not emphasize in the same way.

The soft spot is the reliance on a direction fitted to binary safety labels. The abstract gives no numbers on how much the steered videos actually improve safety metrics, how label collection was done, or whether the test prompts were held out from the fitting set. If the direction partly captures prompt-specific features rather than a general safety concept, the reported results would not guarantee performance on fresh unsafe inputs or preservation of video quality. The stress-test note correctly flags this as the least-secured step.

This is aimed at people who need quick, no-training safety patches for open video models. The evaluation scope is wide enough that a serious referee should see the full numbers, ablations, and failure cases. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces REINS, a training-free inference-time safety steering method for video diffusion models. It claims that safety-relevant structure is linearly encoded in the hidden-state activations of video diffusion transformers; a single direction obtained via Supervised PCA on binary safety labels separates safe from unsafe trajectories and, when added to hidden states at an intermediate layer (~50% depth), redirects harmful generations to semantically related safe outputs. The method requires no weight updates or concept enumeration. Mechanistic analysis shows safety information accumulates monotonically with depth but steering peaks at intermediate layers. Evaluation spans 9 models (1.3B–5B parameters) in both text-to-video and image-to-video modes.

Significance. If the central claim holds, the work would be significant for providing a lightweight, training-free alternative to safety fine-tuning or external filters in open video diffusion models. The breadth of the evaluation suite (multiple scales and generation modes) and the identification of a depth-dependent tradeoff between information availability and steerability constitute concrete contributions to representation engineering for diffusion transformers. The absence of training and the single-direction simplicity are practical strengths.

major comments (2)

[Method (Supervised PCA)] Method section on Supervised PCA: the safety direction is fitted on binary labels from a fixed prompt set; the manuscript does not state whether these prompts are held out from the steering evaluation set. This detail is load-bearing for the claim that the direction encodes a prompt-independent safety concept rather than features correlated with the particular label-collection procedure.
[Experiments / Results] Evaluation and results: while the abstract states evaluation across nine models and two modes, the manuscript supplies no quantitative safety success rates, video quality metrics (e.g., FVD, CLIP similarity to safe references), or ablation tables on layer choice and steering coefficient. Without these, it is impossible to verify that steered outputs remain on-distribution or that quality is preserved, which directly affects the practical-utility claim.

minor comments (2)

Clarify how the binary safety labels were obtained (human raters, automated classifier, or prompt templates) and report inter-rater agreement if applicable.
[Mechanistic analysis] The mechanistic analysis paragraph on monotonic accumulation versus intermediate-layer peak would benefit from a figure plotting safety separability (e.g., PCA explained variance or linear probe accuracy) versus depth.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and indicate the revisions we will make.

read point-by-point responses

Referee: [Method (Supervised PCA)] Method section on Supervised PCA: the safety direction is fitted on binary labels from a fixed prompt set; the manuscript does not state whether these prompts are held out from the steering evaluation set. This detail is load-bearing for the claim that the direction encodes a prompt-independent safety concept rather than features correlated with the particular label-collection procedure.

Authors: We agree this clarification is necessary. The safety-labeled activations used to fit the Supervised PCA direction were collected from a prompt set that was held out from all evaluation prompts. We will revise the Method section to explicitly document this held-out split, thereby strengthening the claim that the discovered direction reflects a general safety concept. revision: yes
Referee: [Experiments / Results] Evaluation and results: while the abstract states evaluation across nine models and two modes, the manuscript supplies no quantitative safety success rates, video quality metrics (e.g., FVD, CLIP similarity to safe references), or ablation tables on layer choice and steering coefficient. Without these, it is impossible to verify that steered outputs remain on-distribution or that quality is preserved, which directly affects the practical-utility claim.

Authors: We acknowledge that the current manuscript presents primarily qualitative results and mechanistic analysis. To address this, the revised version will add quantitative safety success rates, video quality metrics including FVD and CLIP similarity to safe references, and ablation tables on layer depth and steering coefficient across the nine models and both generation modes. revision: yes

Circularity Check

1 steps flagged

Safety direction separation reduces to the Supervised PCA fit by construction

specific steps

fitted input called prediction [Abstract]
"Our key finding is that safety-relevant structure is linearly encoded in the hidden-state activations of video diffusion transformers, and a single direction, discovered via Supervised PCA on binary safety labels, suffices to separate safe from unsafe generation trajectories."

Supervised PCA explicitly optimizes for a linear direction that best separates the two classes given by the binary safety labels. The assertion that this direction 'suffices to separate safe from unsafe generation trajectories' is therefore true by construction on the fitted label set rather than an independent result about the model's representations.

full rationale

The paper's central mechanistic claim is presented as an independent finding about linear encoding of safety in activations. However, the quoted key finding directly follows from applying Supervised PCA, whose objective is to produce a direction that separates the supplied binary labels. This makes the separation statement equivalent to the fitting step for the labeled data used, satisfying the fitted-input-called-prediction pattern. The subsequent steering application at inference time is not itself circular, but the load-bearing claim that a single such direction 'suffices to separate' trajectories is. No other patterns (self-citation chains, ansatz smuggling, or renaming) are evidenced in the provided text.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that safety is linearly separable in activation space; no new physical entities are postulated. The direction vector itself is a fitted parameter derived from labeled data.

free parameters (1)

steering coefficient and target layer index
The magnitude of the added direction and the precise transformer layer at which it is injected are chosen (implicitly via the reported peak at ~50% depth) to achieve the steering effect.

axioms (1)

domain assumption Safety information is linearly encoded and can be isolated by supervised PCA on binary labels
Invoked in the key finding sentence of the abstract; the method presupposes that a single linear direction suffices to separate safe and unsafe trajectories.

pith-pipeline@v0.9.1-grok · 5776 in / 1408 out tokens · 48548 ms · 2026-06-27T03:41:22.606762+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 15 canonical work pages · 10 internal anchors

[1]

Safesora: Towards safety alignment of text2video generation via a human preference dataset.Advances in Neural Information Processing Systems, 37:17161–17214, 2024

Juntao Dai, Tianle Chen, Xuyao Wang, Ziran Yang, Taiye Chen, Jiaming Ji, and Yaodong Yang. Safesora: Towards safety alignment of text2video generation via a human preference dataset.Advances in Neural Information Processing Systems, 37:17161–17214, 2024

2024
[2]

Safewatch: An efficient safety-policy following video guardrail model with transparent explanations

Zhaorun Chen, Francesco Pinto, Minzhou Pan, and Bo Li. Safewatch: An efficient safety-policy following video guardrail model with transparent explanations. In13th International Conference on Learning Represen- tations, ICLR 2025, pages 95666–95708. International Conference on Learning Representations, ICLR, 2025

2025
[3]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Mochi 1.https://github.com/genmoai/models, 2024

Genmo Team. Mochi 1.https://github.com/genmoai/models, 2024

2024
[6]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Allegro: Open the black box of commercial-level video generation model.arXiv preprint arXiv:2410.15458, 2024

Yuan Zhou, Qiuyue Wang, Yuxuan Cai, and Huan Yang. Allegro: Open the black box of commercial-level video generation model.arXiv preprint arXiv:2410.15458, 2024

work page arXiv 2024
[8]

Videoguard: Protecting video content from unauthorized editing.arXiv preprint arXiv:2508.03480, 2025

Junjie Cao, Kaizhou Li, Xinchun Yu, Hongxiang Li, and Xiaoping Zhang. Videoguard: Protecting video content from unauthorized editing.arXiv preprint arXiv:2508.03480, 2025

work page arXiv 2025
[9]

Sora by openai.https://openai.com/sora/, 2024

OpenAI. Sora by openai.https://openai.com/sora/, 2024

2024
[10]

Text driven video generation.https://research.runwayml.com/gen2, 2023

Runway Research. Text driven video generation.https://research.runwayml.com/gen2, 2023

2023
[11]

Kling-Omni Technical Report

Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

HunyuanVideo 1.5 Technical Report

Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Steerdiff: Steering towards safe text-to-image diffusion models.arXiv preprint arXiv:2410.02710, 2024

Hongxiang Zhang, Yifeng He, and Hao Chen. Steerdiff: Steering towards safe text-to-image diffusion models.arXiv preprint arXiv:2410.02710, 2024

work page arXiv 2024
[15]

Mma-diffusion: Multimodal attack on diffusion models

Yijun Yang, Ruiyuan Gao, Xiaosen Wang, Tsung-Yi Ho, Nan Xu, and Qiang Xu. Mma-diffusion: Multimodal attack on diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7737–7746, 2024

2024
[16]

A divide-and-conquer approach to distributed attack identification

Fabio Pasqualetti, Florian Dörfler, and Francesco Bullo. A divide-and-conquer approach to distributed attack identification. In2015 54th IEEE Conference on Decision and Control (CDC), pages 5801–5807. IEEE, 2015

2015
[17]

Red-teaming the stable diffusion safety filter.arXiv preprint arXiv:2210.04610, 2022

Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, and Florian Tramèr. Red-teaming the stable diffusion safety filter.arXiv preprint arXiv:2210.04610, 2022

work page arXiv 2022
[18]

Safegen: Mitigating sexually explicit content generation in text-to-image models

Xinfeng Li, Yuchen Yang, Jiangyi Deng, Chen Yan, Yanjiao Chen, Xiaoyu Ji, and Wenyuan Xu. Safegen: Mitigating sexually explicit content generation in text-to-image models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 4807–4821, 2024

2024
[19]

Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds.Pattern Recognition, 44(7):1357–1371, 2011

Elnaz Barshan, Ali Ghodsi, Zohreh Azimifar, and Mansoor Zolghadri Jahromi. Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds.Pattern Recognition, 44(7):1357–1371, 2011

2011
[20]

Polyjuice makes it real: Black-box, universal red teaming for synthetic image detectors.Advances in Neural Information Processing Systems, 2025

Sepehr Dehdashtian, Mashrur M Morshed, Jacob H Seidman, Gaurav Bharaj, and Vishnu Boddeti. Polyjuice makes it real: Black-box, universal red teaming for synthetic image detectors.Advances in Neural Information Processing Systems, 2025

2025
[21]

Stable diffusion safety checker

CompVis/HuggingFace. Stable diffusion safety checker. https://huggingface.co/CompVis/ stable-diffusion-safety-checker, 2022. 10

2022
[22]

Erasing concepts from diffusion models

Rohit Gandikota, Joanna Materzynska, Jaden Fiotto-Kaufman, and David Bau. Erasing concepts from diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 2426–2436, 2023

2023
[23]

Unified concept editing in diffusion models

Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzy´nska, and David Bau. Unified concept editing in diffusion models. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 5111–5120, 2024

2024
[24]

Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models

Patrick Schramowski, Manuel Brack, Björn Deiseroth, and Kristian Kersting. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22522–22531, 2023

2023
[25]

Self-discovering interpretable diffusion latent directions for responsible text-to-image generation

Hang Li, Chengzhi Shen, Philip Torr, V olker Tresp, and Jindong Gu. Self-discovering interpretable diffusion latent directions for responsible text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12006–12016, 2024

2024
[26]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083, 2024

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083, 2024

2024
[30]

Measuring statistical dependence with hilbert-schmidt norms

Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. Measuring statistical dependence with hilbert-schmidt norms. InInternational conference on algorithmic learning theory, pages 63–77. Springer, 2005

2005
[31]

Improving video generation with human feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[32]

LlamaFirewall: An open source guardrail system for building secure AI agents,

Sahana Chennabasappa, Cyrus Nikolaidis, Daniel Song, David Molnar, Stephanie Ding, Shengye Wan, Spencer Whitman, Lauren Deason, Nicholas Doucette, Abraham Montilla, et al. Llamafirewall: An open source guardrail system for building secure ai agents.arXiv preprint arXiv:2505.03574, 2025

work page arXiv 2025
[33]

MMA-Diffusion: MultiModal Attack on Diffusion Models

Yijun Yang, Ruiyuan Gao, Xiaosen Wang, Tsung-Yi Ho, Nan Xu, and Qiang Xu. MMA-Diffusion: MultiModal Attack on Diffusion Models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[34]

Latent guard: a safety framework for text-to-image generation

Runtao Liu, Ashkan Khakzar, Jindong Gu, Qifeng Chen, Philip Torr, and Fabio Pizzati. Latent guard: a safety framework for text-to-image generation. InEuropean Conference on Computer Vision, pages 93–109. Springer, 2024

2024
[35]

Sneakyprompt: Jailbreaking text-to-image generative models

Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, and Yinzhi Cao. Sneakyprompt: Jailbreaking text-to-image generative models. In2024 IEEE symposium on security and privacy (SP), pages 897–912. IEEE, 2024

2024
[36]

T2vsafetybench: Evaluating the safety of text-to-video generative models.Advances in Neural Information Processing Systems, 37: 63858–63872, 2024

Yibo Miao, Yifan Zhu, Lijia Yu, Jun Zhu, Xiao-Shan Gao, and Yinpeng Dong. T2vsafetybench: Evaluating the safety of text-to-video generative models.Advances in Neural Information Processing Systems, 37: 63858–63872, 2024

2024
[37]

Forget-me-not: Learning to forget in text-to-image diffusion models

Gong Zhang, Kai Wang, Xingqian Xu, Zhangyang Wang, and Humphrey Shi. Forget-me-not: Learning to forget in text-to-image diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1755–1764, 2024

2024
[38]

Reliable and efficient concept erasure of text-to-image diffusion models

Chao Gong, Kai Chen, Zhipeng Wei, Jingjing Chen, and Yu-Gang Jiang. Reliable and efficient concept erasure of text-to-image diffusion models. InEuropean Conference on Computer Vision, pages 73–88. Springer, 2024

2024
[39]

SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation

Chongyu Fan, Jiancheng Liu, Yihua Zhang, Eric Wong, Dennis Wei, and Sijia Liu. Salun: Empowering machine unlearning via gradient-based weight saliency in both image classification and generation.arXiv preprint arXiv:2310.12508, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

The illusion of unlearning: The unstable nature of machine unlearning in text-to-image diffusion models

Naveen George, Karthik Nandan Dasaraju, Rutheesh Reddy Chittepu, and Konda Reddy Mopuri. The illusion of unlearning: The unstable nature of machine unlearning in text-to-image diffusion models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13393–13402, 2025. 11

2025
[41]

To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images

Yimeng Zhang, Jinghan Jia, Xin Chen, Aochuan Chen, Yihua Zhang, Jiancheng Liu, Ke Ding, and Sijia Liu. To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images... for now. InEuropean Conference on Computer Vision, pages 385–403. Springer, 2024

2024
[42]

Cambridge university press, 2012

Roger A Horn and Charles R Johnson.Matrix analysis. Cambridge university press, 2012

2012
[43]

nudity” or “violence,

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 12 Appendix A Related Work Prompt and output filtering:The most common defenses operate outside the generative model. Prompt-level classifiers ...

2023

[1] [1]

Safesora: Towards safety alignment of text2video generation via a human preference dataset.Advances in Neural Information Processing Systems, 37:17161–17214, 2024

Juntao Dai, Tianle Chen, Xuyao Wang, Ziran Yang, Taiye Chen, Jiaming Ji, and Yaodong Yang. Safesora: Towards safety alignment of text2video generation via a human preference dataset.Advances in Neural Information Processing Systems, 37:17161–17214, 2024

2024

[2] [2]

Safewatch: An efficient safety-policy following video guardrail model with transparent explanations

Zhaorun Chen, Francesco Pinto, Minzhou Pan, and Bo Li. Safewatch: An efficient safety-policy following video guardrail model with transparent explanations. In13th International Conference on Learning Represen- tations, ICLR 2025, pages 95666–95708. International Conference on Learning Representations, ICLR, 2025

2025

[3] [3]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Mochi 1.https://github.com/genmoai/models, 2024

Genmo Team. Mochi 1.https://github.com/genmoai/models, 2024

2024

[6] [6]

LTX-Video: Realtime Video Latent Diffusion

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Allegro: Open the black box of commercial-level video generation model.arXiv preprint arXiv:2410.15458, 2024

Yuan Zhou, Qiuyue Wang, Yuxuan Cai, and Huan Yang. Allegro: Open the black box of commercial-level video generation model.arXiv preprint arXiv:2410.15458, 2024

work page arXiv 2024

[8] [8]

Videoguard: Protecting video content from unauthorized editing.arXiv preprint arXiv:2508.03480, 2025

Junjie Cao, Kaizhou Li, Xinchun Yu, Hongxiang Li, and Xiaoping Zhang. Videoguard: Protecting video content from unauthorized editing.arXiv preprint arXiv:2508.03480, 2025

work page arXiv 2025

[9] [9]

Sora by openai.https://openai.com/sora/, 2024

OpenAI. Sora by openai.https://openai.com/sora/, 2024

2024

[10] [10]

Text driven video generation.https://research.runwayml.com/gen2, 2023

Runway Research. Text driven video generation.https://research.runwayml.com/gen2, 2023

2023

[11] [11]

Kling-Omni Technical Report

Kling Team, Jialu Chen, Yuanzheng Ci, Xiangyu Du, Zipeng Feng, Kun Gai, Sainan Guo, Feng Han, Jingbin He, Kang He, et al. Kling-omni technical report.arXiv preprint arXiv:2512.16776, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

HunyuanVideo 1.5 Technical Report

Bing Wu, Chang Zou, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Jack Peng, Jianbing Wu, Jiangfeng Xiong, Jie Jiang, et al. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Steerdiff: Steering towards safe text-to-image diffusion models.arXiv preprint arXiv:2410.02710, 2024

Hongxiang Zhang, Yifeng He, and Hao Chen. Steerdiff: Steering towards safe text-to-image diffusion models.arXiv preprint arXiv:2410.02710, 2024

work page arXiv 2024

[15] [15]

Mma-diffusion: Multimodal attack on diffusion models

Yijun Yang, Ruiyuan Gao, Xiaosen Wang, Tsung-Yi Ho, Nan Xu, and Qiang Xu. Mma-diffusion: Multimodal attack on diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7737–7746, 2024

2024

[16] [16]

A divide-and-conquer approach to distributed attack identification

Fabio Pasqualetti, Florian Dörfler, and Francesco Bullo. A divide-and-conquer approach to distributed attack identification. In2015 54th IEEE Conference on Decision and Control (CDC), pages 5801–5807. IEEE, 2015

2015

[17] [17]

Red-teaming the stable diffusion safety filter.arXiv preprint arXiv:2210.04610, 2022

Javier Rando, Daniel Paleka, David Lindner, Lennart Heim, and Florian Tramèr. Red-teaming the stable diffusion safety filter.arXiv preprint arXiv:2210.04610, 2022

work page arXiv 2022

[18] [18]

Safegen: Mitigating sexually explicit content generation in text-to-image models

Xinfeng Li, Yuchen Yang, Jiangyi Deng, Chen Yan, Yanjiao Chen, Xiaoyu Ji, and Wenyuan Xu. Safegen: Mitigating sexually explicit content generation in text-to-image models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 4807–4821, 2024

2024

[19] [19]

Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds.Pattern Recognition, 44(7):1357–1371, 2011

Elnaz Barshan, Ali Ghodsi, Zohreh Azimifar, and Mansoor Zolghadri Jahromi. Supervised principal component analysis: Visualization, classification and regression on subspaces and submanifolds.Pattern Recognition, 44(7):1357–1371, 2011

2011

[20] [20]

Polyjuice makes it real: Black-box, universal red teaming for synthetic image detectors.Advances in Neural Information Processing Systems, 2025

Sepehr Dehdashtian, Mashrur M Morshed, Jacob H Seidman, Gaurav Bharaj, and Vishnu Boddeti. Polyjuice makes it real: Black-box, universal red teaming for synthetic image detectors.Advances in Neural Information Processing Systems, 2025

2025

[21] [21]

Stable diffusion safety checker

CompVis/HuggingFace. Stable diffusion safety checker. https://huggingface.co/CompVis/ stable-diffusion-safety-checker, 2022. 10

2022

[22] [22]

Erasing concepts from diffusion models

Rohit Gandikota, Joanna Materzynska, Jaden Fiotto-Kaufman, and David Bau. Erasing concepts from diffusion models. InProceedings of the IEEE/CVF international conference on computer vision, pages 2426–2436, 2023

2023

[23] [23]

Unified concept editing in diffusion models

Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzy´nska, and David Bau. Unified concept editing in diffusion models. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 5111–5120, 2024

2024

[24] [24]

Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models

Patrick Schramowski, Manuel Brack, Björn Deiseroth, and Kristian Kersting. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22522–22531, 2023

2023

[25] [25]

Self-discovering interpretable diffusion latent directions for responsible text-to-image generation

Hang Li, Chengzhi Shen, Philip Torr, V olker Tresp, and Jindong Gu. Self-discovering interpretable diffusion latent directions for responsible text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12006–12016, 2024

2024

[26] [26]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[27] [27]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [29]

Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083, 2024

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083, 2024

2024

[30] [30]

Measuring statistical dependence with hilbert-schmidt norms

Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. Measuring statistical dependence with hilbert-schmidt norms. InInternational conference on algorithmic learning theory, pages 63–77. Springer, 2005

2005

[31] [31]

Improving video generation with human feedback

Jie Liu, Gongye Liu, Jiajun Liang, Ziyang Yuan, Xiaokun Liu, Mingwu Zheng, Xiele Wu, Qiulin Wang, Menghan Xia, Xintao Wang, et al. Improving video generation with human feedback. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[32] [32]

LlamaFirewall: An open source guardrail system for building secure AI agents,

Sahana Chennabasappa, Cyrus Nikolaidis, Daniel Song, David Molnar, Stephanie Ding, Shengye Wan, Spencer Whitman, Lauren Deason, Nicholas Doucette, Abraham Montilla, et al. Llamafirewall: An open source guardrail system for building secure ai agents.arXiv preprint arXiv:2505.03574, 2025

work page arXiv 2025

[33] [33]

MMA-Diffusion: MultiModal Attack on Diffusion Models

Yijun Yang, Ruiyuan Gao, Xiaosen Wang, Tsung-Yi Ho, Nan Xu, and Qiang Xu. MMA-Diffusion: MultiModal Attack on Diffusion Models. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024

[34] [34]

Latent guard: a safety framework for text-to-image generation

Runtao Liu, Ashkan Khakzar, Jindong Gu, Qifeng Chen, Philip Torr, and Fabio Pizzati. Latent guard: a safety framework for text-to-image generation. InEuropean Conference on Computer Vision, pages 93–109. Springer, 2024

2024

[35] [35]

Sneakyprompt: Jailbreaking text-to-image generative models

Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, and Yinzhi Cao. Sneakyprompt: Jailbreaking text-to-image generative models. In2024 IEEE symposium on security and privacy (SP), pages 897–912. IEEE, 2024

2024

[36] [36]

T2vsafetybench: Evaluating the safety of text-to-video generative models.Advances in Neural Information Processing Systems, 37: 63858–63872, 2024

Yibo Miao, Yifan Zhu, Lijia Yu, Jun Zhu, Xiao-Shan Gao, and Yinpeng Dong. T2vsafetybench: Evaluating the safety of text-to-video generative models.Advances in Neural Information Processing Systems, 37: 63858–63872, 2024

2024

[37] [37]

Forget-me-not: Learning to forget in text-to-image diffusion models

Gong Zhang, Kai Wang, Xingqian Xu, Zhangyang Wang, and Humphrey Shi. Forget-me-not: Learning to forget in text-to-image diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1755–1764, 2024

2024

[38] [38]

Reliable and efficient concept erasure of text-to-image diffusion models

Chao Gong, Kai Chen, Zhipeng Wei, Jingjing Chen, and Yu-Gang Jiang. Reliable and efficient concept erasure of text-to-image diffusion models. InEuropean Conference on Computer Vision, pages 73–88. Springer, 2024

2024

[39] [39]

SalUn: Empowering Machine Unlearning via Gradient-based Weight Saliency in Both Image Classification and Generation

Chongyu Fan, Jiancheng Liu, Yihua Zhang, Eric Wong, Dennis Wei, and Sijia Liu. Salun: Empowering machine unlearning via gradient-based weight saliency in both image classification and generation.arXiv preprint arXiv:2310.12508, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[40] [40]

The illusion of unlearning: The unstable nature of machine unlearning in text-to-image diffusion models

Naveen George, Karthik Nandan Dasaraju, Rutheesh Reddy Chittepu, and Konda Reddy Mopuri. The illusion of unlearning: The unstable nature of machine unlearning in text-to-image diffusion models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 13393–13402, 2025. 11

2025

[41] [41]

To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images

Yimeng Zhang, Jinghan Jia, Xin Chen, Aochuan Chen, Yihua Zhang, Jiancheng Liu, Ke Ding, and Sijia Liu. To generate or not? safety-driven unlearned diffusion models are still easy to generate unsafe images... for now. InEuropean Conference on Computer Vision, pages 385–403. Springer, 2024

2024

[42] [42]

Cambridge university press, 2012

Roger A Horn and Charles R Johnson.Matrix analysis. Cambridge university press, 2012

2012

[43] [43]

nudity” or “violence,

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 12 Appendix A Related Work Prompt and output filtering:The most common defenses operate outside the generative model. Prompt-level classifiers ...

2023