pith. sign in

arxiv: 2606.04775 · v1 · pith:HGJHYK6Dnew · submitted 2026-06-03 · 💻 cs.LG · cs.AI· cs.CV· cs.SY· eess.SY· math.OC

Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control

Pith reviewed 2026-06-28 07:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CVcs.SYeess.SYmath.OC
keywords activation steeringtext-to-videolinear quadratic regulatorlatent dynamicsoptimal controlmodel safetygenerative modelsclosed-loop feedback
0
0 comments X

The pith

Reduced-order LQR steers video model activations to safe setpoints with minimal quality loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LA-LQR as a way to steer text-to-video models away from undesired content by treating generation as a dynamical system. It projects activations into a low-dimensional subspace from contrastive prompts, estimates local linear dynamics there, and solves an LQR problem to produce closed-loop feedback signals that track desired features while limiting unnecessary changes. The approach aims to deliver more precise interventions than prior coarse steering techniques. A reader would care if this yields measurable drops in unsafe outputs on benchmarks without hurting prompt match or visual fidelity.

Core claim

LA-LQR projects high-dimensional video activations onto a low-dimensional task-relevant subspace derived from contrastive prompt pairs, estimates local linear dynamics in that space, solves a latent LQR problem for timestep- and layer-specific steering signals, and supplies theoretical bounds that link latent setpoint tracking to control of the original activation-space features.

What carries the argument

The LA-LQR reduced-order optimal control framework that computes closed-loop steering signals from a latent LQR problem in a contrastive-prompt-derived subspace.

Load-bearing premise

The reduced latent dynamics faithfully approximate the original high-dimensional activation dynamics.

What would settle it

A test in which the latent steering signals, when applied to the full model, produce no measurable shift in the targeted activation features or fail to lower unsafe generation rates on the safety benchmarks.

Figures

Figures reproduced from arXiv: 2606.04775 by Alice Chan, Glen Chou, Jihoon Hong, Julian Skifstad, Qiyue Dai.

Figure 1
Figure 1. Figure 1: Overview. Our method, LA-LQR, steers T2V models by solving an optimal control problem, producing steering signals us for each timestep and transformer layer. For tractability, we perform control within a task-relevant activation subspace identified by contrastive vectors. Activation Steering Prior work in mechanistic interpretability [17–22] suggests that many concepts align with directions in activation s… view at source ↗
Figure 2
Figure 2. Figure 2: (a) Proportion of energy of Cr matrices captured in the subspace spanned by the top￾Dlat = 64 right singular vectors, over (t, l), for pornography feature. (b) Normalized Frobenius norm between (Top) As computed from 20 different prompts and (Bottom) random matrices, at (left) layer 5, (middle) layer 25, and (right) layer 35. (a) (b) (c) (d) (e) (f) [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Safeguarding Wan against harmful prompts from T2VSafetyBench [63]. For each example, [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 13
Figure 13. Figure 13: Filtering- and embedding-based methods such as SAFREE [10] and [69] also struggle on [PITH_FULL_IMAGE:figures/full_fig_p009_13.png] view at source ↗
Figure 5
Figure 5. Figure 5: Text steering only. Left column: first frame; middle column: middle frame; right column: final frame. Row 1: Baseline (no steering). Row 2: λ = 1, Q = I, R = 75000I, QH = I. Row 3: λ = 1, Q = 1.5I, R = 75000I, QH = I. Row 4: λ = 1, Q = 2I, R = 75000I, QH = I. Row 5: λ = 1, Q = 2.5I, R = 75000I, QH = I. Row 6: λ = 1, Q = 3I, R = 75000I, QH = I. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Video steering only. Left column: first frame; middle column: middle frame; right column: final frame. Row 1: λ = 1, Q = 10I, R = 75000I, QH = I. Row 2: λ = 1, Q = 100I, R = 75000I, QH = I. Row 3: λ = 1, Q = 1000I, R = 75000I, QH = I. Row 4: λ = 1, Q = 10000I, R = 75000I, QH = I. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Text and video steering. Left column: first frame; middle column: middle frame; right column: final frame. Row 1: λ = 1, Q = 1I, Rv = 100000I, Rt = 75000I, QH = I. Row 2: λ = 1, Q = 10I, Rv = 1000I, Rt = 106 I, QH = I. Row 3: λ = 1, Q = 10I, Rv = 1000I, Rt = 70000I, QH = I. Row 4: λ = 1, Q = 10I, Rv = 1000I, Rt = 100000I, QH = I. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Hunyuan, animal abuse category on SafeSora [12]. [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Hunyuan, pornography category on SafeSora [12]. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Hunyuan, racism category on SafeSora [12]. [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Hunyuan, terrorism category on SafeSora [12]. [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Hunyuan, violence category on SafeSora [12]. [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Baseline [9], when oversteered, generates videos with frames like this. [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
read the original abstract

Text-to-video (T2V) models trained on large-scale web data can generate undesired content, motivating interventions that reduce harmful outputs without sacrificing visual quality. Activation steering offers an attractive mechanistic alternative to finetuning and prompt filtering, but existing T2V steering methods remain limited, typically applying coarse, non-anticipative interventions that can lead to oversteering and content degradation. To close this gap, we propose Latent Activation Linear-Quadratic Regulator (LA-LQR), a reduced-order optimal control framework for minimally invasive T2V steering. LA-LQR formulates T2V inference as a dynamical system and computes closed-loop feedback interventions that steer activations toward desired feature setpoints while penalizing unnecessary perturbations. To make optimal control feasible for high-dimensional video activations, we project activations onto a low-dimensional, task-relevant subspace derived from contrastive prompt pairs, estimate local linear dynamics in this latent space, and solve a latent LQR problem to obtain timestep- and layer-specific steering signals. We provide theoretical bounds relating latent setpoint tracking to raw activation-space feature control, and empirically validate the fidelity of the reduced latent dynamics. On concept steering and video safety benchmarks, LA-LQR reduces unsafe generations relative to baselines, while preserving prompt fidelity and visual quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce LA-LQR, a reduced-order optimal control method for activation steering in text-to-video models. By projecting activations to a low-dimensional subspace from contrastive pairs, estimating linear dynamics, and solving LQR, it achieves steering with theoretical bounds linking latent to raw space, and shows empirical improvements in safety benchmarks without degrading quality.

Significance. If the reduced dynamics approximation holds as claimed, this provides a principled control-theoretic framework for minimally invasive steering in generative video models, advancing beyond non-anticipative methods. The explicit theoretical bounds and empirical validation of latent dynamics are positive aspects.

major comments (2)
  1. [empirical validation of reduced latent dynamics] The central claim depends on the reduced-order linear dynamics faithfully approximating the high-dimensional activation trajectories over the denoising process. The abstract mentions empirical validation, but without specific quantitative results (e.g., prediction error metrics across timesteps and layers) showing that the approximation captures directions relevant to unsafe content, the theoretical bounds may not fully explain the observed steering effects.
  2. [the section on theoretical bounds] The bounds relating latent setpoint tracking to raw activation-space feature control are load-bearing. If the subspace derived from contrastive prompt pairs discards nonlinear interactions important for feature control, the mapping from latent LQR solution to raw-space control could break, undermining the explanation for the safety benchmark improvements.
minor comments (1)
  1. Notation for the LQR cost matrices Q and R could be clarified with explicit definitions in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We address the two major comments below, agreeing to strengthen the empirical validation section and to clarify the assumptions underlying the theoretical bounds.

read point-by-point responses
  1. Referee: [empirical validation of reduced latent dynamics] The central claim depends on the reduced-order linear dynamics faithfully approximating the high-dimensional activation trajectories over the denoising process. The abstract mentions empirical validation, but without specific quantitative results (e.g., prediction error metrics across timesteps and layers) showing that the approximation captures directions relevant to unsafe content, the theoretical bounds may not fully explain the observed steering effects.

    Authors: We agree that more granular quantitative metrics would strengthen the presentation. While the manuscript reports empirical validation of the reduced latent dynamics, we will revise the relevant section to include explicit prediction error metrics (e.g., MSE between predicted and observed trajectories) computed across denoising timesteps, model layers, and specifically along the contrastive directions tied to unsafe content. These additions will directly link the approximation quality to the observed steering performance. revision: yes

  2. Referee: [the section on theoretical bounds] The bounds relating latent setpoint tracking to raw activation-space feature control are load-bearing. If the subspace derived from contrastive prompt pairs discards nonlinear interactions important for feature control, the mapping from latent LQR solution to raw-space control could break, undermining the explanation for the safety benchmark improvements.

    Authors: The bounds are derived under the linear dynamics assumption within the chosen subspace and rely on the projection operator preserving the relevant directions identified by the contrastive pairs. We acknowledge that highly nonlinear interactions outside this subspace are not captured by construction. In revision we will expand the discussion of assumptions and limitations, explicitly noting the linear regime and the rationale for the contrastive subspace selection, while retaining the existing bound statements. revision: partial

Circularity Check

0 steps flagged

No significant circularity; method applies standard LQR to contrastive-derived latent space with independent empirical validation

full rationale

The derivation projects high-dimensional activations onto a contrastive subspace, fits local linear dynamics, solves an LQR problem in that space, and supplies theoretical bounds from the linear model to raw-space control; these steps are standard control-theoretic constructions whose outputs are not redefined as their own inputs. Empirical validation of reduced-dynamics fidelity and benchmark results on safety/fidelity metrics are measured against external data, not against the fitted parameters themselves. No self-citations appear as load-bearing premises, no uniqueness theorems are imported from the authors' prior work, and no ansatz or known empirical pattern is smuggled or renamed. The framework therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on modeling assumptions about linear dynamics in a reduced subspace derived from contrastive prompts; these are domain assumptions without independent evidence beyond the paper's own validation.

free parameters (2)
  • reduced subspace dimension
    Dimension of the task-relevant subspace projected from contrastive prompt pairs, selected to enable feasible LQR computation.
  • LQR cost matrices Q and R
    Weighting matrices balancing setpoint tracking error against control effort in the latent LQR problem.
axioms (2)
  • domain assumption Local linear dynamics approximation holds in the latent subspace
    Invoked when estimating dynamics and solving the LQR problem for timestep- and layer-specific steering signals.
  • domain assumption Contrastive prompt pairs yield a task-relevant subspace for feature control
    Used to project high-dimensional activations for the reduced-order formulation.

pith-pipeline@v0.9.1-grok · 5779 in / 1442 out tokens · 40311 ms · 2026-06-28T07:16:42.212794+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 19 linked inside Pith

  1. [1]

    Wan: Open and advanced large-scale video generative models,

    T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang et al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

  2. [2]

    Editverse: Unifying image and video editing and generation with in-context learning,

    X. Ju, T. Wang, Y . Zhou, H. Zhang, Q. Liu, N. Zhao, Z. Zhang, Y . Li, Y . Cai, S. Liuet al., “Editverse: Unifying image and video editing and generation with in-context learning,”arXiv preprint arXiv:2509.20360, 2025

  3. [4]

    Cogvideox: Text-to-video diffusion models with an expert transformer,

    Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng et al., “Cogvideox: Text-to-video diffusion models with an expert transformer,”arXiv preprint arXiv:2408.06072, 2024

  4. [5]

    Video diffusion models,

    J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video diffusion models,”Advances in neural information processing systems, vol. 35, pp. 8633–8646, 2022

  5. [6]

    Moviebench: A hierarchical movie level dataset for long video generation,

    W. Wu, M. Liu, Z. Zhu, X. Xia, H. Feng, W. Wang, K. Q. Lin, C. Shen, and M. Z. Shou, “Moviebench: A hierarchical movie level dataset for long video generation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 28 984–28 994

  6. [7]

    Pre-trained video generative models as world simulators,

    H. He, Y . Zhang, L. Lin, Z. Xu, and L. Pan, “Pre-trained video generative models as world simulators,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 6, 2026, pp. 4645–4653

  7. [8]

    Cosmos policy: Fine-tuning video models for visuomotor control and planning,

    M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn et al., “Cosmos policy: Fine-tuning video models for visuomotor control and planning,”arXiv preprint arXiv:2601.16163, 2026

  8. [9]

    Video unlearning via low-rank refusal vector,

    S. Facchiano, S. Saravalle, M. Migliarini, E. De Matteis, A. Sampieri, A. Pilzer, E. Rodolà, I. Spinelli, L. Franco, and F. Galasso, “Video unlearning via low-rank refusal vector,”arXiv preprint arXiv:2506.07891, 2025

  9. [10]

    Safree: Training-free and adaptive guard for safe text-to-image and video generation,

    J. Yoon, S. Yu, V . Patil, H. Yao, and M. Bansal, “Safree: Training-free and adaptive guard for safe text-to-image and video generation,”arXiv preprint arXiv:2410.12761, 2024

  10. [11]

    Vpo: Aligning text-to-video generation models with prompt optimization,

    J. Cheng, R. Lyu, X. Gu, X. Liu, J. Xu, Y . Lu, J. Teng, Z. Yang, Y . Dong, J. Tanget al., “Vpo: Aligning text-to-video generation models with prompt optimization,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 15 636–15 645

  11. [12]

    Safesora: Towards safety alignment of text2video generation via a human preference dataset,

    J. Dai, T. Chen, X. Wang, Z. Yang, T. Chen, J. Ji, and Y . Yang, “Safesora: Towards safety alignment of text2video generation via a human preference dataset,”Advances in Neural Information Processing Systems, vol. 37, pp. 17 161–17 214, 2024

  12. [13]

    Unified concept editing in diffusion models,

    R. Gandikota, H. Orgad, Y . Belinkov, J. Materzy´nska, and D. Bau, “Unified concept editing in diffusion models,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, 2024, pp. 5111–5120

  13. [14]

    Activation addition: Steering language models without optimization,

    A. M. Turner, L. Thiergart, G. Leech, D. Udell, U. Mini, and M. MacDiarmid, “Activation addition: Steering language models without optimization,” 2024. 10

  14. [15]

    ODESteer: A unified ODE-based steering framework for LLM alignment,

    H. Zhao, H. Sun, J. Kong, X. Li, Q. Wang, L. Jiang, Q. Zhu, T. F. Abdelzaher, Y . Choi, M. Li, and H. Shao, “ODESteer: A unified ODE-based steering framework for LLM alignment,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=CFewUmgIIL

  15. [16]

    Local linearity of llms enables activation steering via model-based linear optimal control,

    J. Skifstad, X. A. Yang, and G. Chou, “Local linearity of llms enables activation steering via model-based linear optimal control,”arXiv preprint arXiv:2604.19018, 2026

  16. [17]

    Mechanistic interpretability for AI safety - a review,

    L. Bereska and S. Gavves, “Mechanistic interpretability for AI safety - a review,”Transactions on Machine Learning Research, 2024, survey Certification, Expert Certification. [Online]. Available: https://openreview.net/forum?id=ePUVetPKu6

  17. [18]

    Toy models of superposition,

    N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, R. Grosse, S. McCandlish, J. Kaplan, D. Amodei, M. Wattenberg, and C. Olah, “Toy models of superposition,” no. arXiv:2209.10652, 2022, arXiv:2209.10652. [Online]. Available: http://arxiv.org/abs/2209.10652

  18. [19]

    Linguistic regularities in continuous space word represen- tations,

    T. Mikolov, W.-t. Yih, and G. Zweig, “Linguistic regularities in continuous space word represen- tations,” inProceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, 2013, pp. 746–751

  19. [20]

    The linear representation hypothesis and the geometry of large language models,

    K. Park, Y . J. Choe, and V . Veitch, “The linear representation hypothesis and the geometry of large language models,”arXiv preprint arXiv:2311.03658, 2023

  20. [21]

    The geometry of truth: Emergent linear structure in large language model representations of true/false datasets,

    S. Marks and M. Tegmark, “The geometry of truth: Emergent linear structure in large language model representations of true/false datasets,” inFirst Conference on Language Modeling, 2024. [Online]. Available: https://openreview.net/forum?id=aajyHYjjsk

  21. [22]

    Emergent linear representations in world models of self-supervised sequence models,

    N. Nanda, A. Lee, and M. Wattenberg, “Emergent linear representations in world models of self-supervised sequence models,” inProceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2023, pp. 16–30

  22. [23]

    Plug and play language models: A simple approach to controlled text generation,

    S. Dathathri, A. Madotto, J. Lan, J. Hung, E. Frank, P. Molino, J. Yosinski, and R. Liu, “Plug and play language models: A simple approach to controlled text generation,” inInternational Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=H1edEyBKDS

  23. [24]

    Inference-time intervention: Eliciting truthful answers from a language model,

    K. Li, O. Patel, F. Viégas, H. Pfister, and M. Wattenberg, “Inference-time intervention: Eliciting truthful answers from a language model,”Advances in Neural Information Processing Systems, vol. 36, pp. 41 451–41 530, 2023

  24. [25]

    Refusal in language models is mediated by a single direction,

    A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda, “Refusal in language models is mediated by a single direction,” no. arXiv:2406.11717, Oct. 2024, arXiv:2406.11717. [Online]. Available: http://arxiv.org/abs/2406.11717

  25. [26]

    Steering llama 2 via contrastive activation addition,

    N. Rimsky, N. Gabrieli, J. Schulz, M. Tong, E. Hubinger, and A. Turner, “Steering llama 2 via contrastive activation addition,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics, 2024, p. 15504–15522. [Online]. Available: https://a...

  26. [27]

    Inference-time intervention: Eliciting truthful answers from a language model,

    K. Li, O. Patel, F. Viégas, H. Pfister, and M. Wattenberg, “Inference-time intervention: Eliciting truthful answers from a language model,” no. arXiv:2306.03341, 2024, arXiv:2306.03341. [Online]. Available: http://arxiv.org/abs/2306.03341

  27. [28]

    Steering language models with activation engineering,

    A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini, and M. MacDiarmid, “Steering language models with activation engineering,” no. arXiv:2308.10248, Oct. 2024, arXiv:2308.10248. [Online]. Available: http://arxiv.org/abs/2308.10248

  28. [29]

    Learning to summarize with human feedback,

    N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. V oss, A. Radford, D. Amodei, and P. F. Christiano, “Learning to summarize with human feedback,”Advances in neural information processing systems, vol. 33, pp. 3008–3021, 2020

  29. [30]

    Llama: Open and efficient foundation language models,

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

  30. [31]

    Direct preference optimization: Your language model is secretly a reward model,

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Advances in neural information processing systems, vol. 36, pp. 53 728–53 741, 2023. 11

  31. [32]

    Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation,

    H. Xu, A. Sharaf, Y . Chen, W. Tan, L. Shen, B. Van Durme, K. Murray, and Y . J. Kim, “Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation,”arXiv preprint arXiv:2401.08417, 2024

  32. [33]

    Rrhf: Rank responses to align language models with human feedback,

    H. Yuan, Z. Yuan, C. Tan, W. Wang, S. Huang, and F. Huang, “Rrhf: Rank responses to align language models with human feedback,”Advances in Neural Information Processing Systems, vol. 36, pp. 10 935–10 950, 2023

  33. [34]

    Preference ranking optimization for human alignment,

    F. Song, B. Yu, M. Li, H. Yu, F. Huang, Y . Li, and H. Wang, “Preference ranking optimization for human alignment,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 17, 2024, pp. 18 990–18 998

  34. [35]

    Parameter-efficient transfer learning for nlp,

    N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. At- tariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” inInternational conference on machine learning. PMLR, 2019, pp. 2790–2799

  35. [36]

    A general language assistant as a laboratory for alignment,

    A. Askell, Y . Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. DasSarmaet al., “A general language assistant as a laboratory for alignment,”arXiv preprint arXiv:2112.00861, 2021

  36. [37]

    Defending large language models against jailbreaking attacks through goal prioritization,

    Z. Zhang, J. Yang, P. Ke, F. Mi, H. Wang, and M. Huang, “Defending large language models against jailbreaking attacks through goal prioritization,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 8865–8887

  37. [38]

    Args: Alignment as reward-guided search,

    M. Khanov, J. Burapacheep, and Y . Li, “Args: Alignment as reward-guided search,”arXiv preprint arXiv:2402.01694, 2024

  38. [39]

    Deal: Decoding-time alignment for large language models,

    J. Y . Huang, S. Sengupta, D. Bonadiman, Y .-a. Lai, A. Gupta, N. Pappas, S. Mansour, K. Kirch- hoff, and D. Roth, “Deal: Decoding-time alignment for large language models,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 26 280–26 300

  39. [40]

    Controlling language and diffusion models by transporting activations,

    P. Rodriguez, A. Blaas, M. Klein, L. Zappella, N. Apostoloff, M. Cuturi, and X. Suau, “Controlling language and diffusion models by transporting activations,” no. arXiv:2410.23054, Nov. 2024, arXiv:2410.23054. [Online]. Available: http://arxiv.org/abs/2410.23054

  40. [41]

    Reft: Representation finetuning for language models,

    Z. Wu, A. Arora, Z. Wang, A. Geiger, D. Jurafsky, C. D. Manning, and C. Potts, “Reft: Representation finetuning for language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 63 908–63 962, 2024

  41. [42]

    Advancing parameter efficiency in fine-tuning via representation editing,

    M. Wu, W. Liu, X. Wang, T. Li, C. Lv, Z. Ling, Z. JianHao, C. Zhang, X. Zheng, and X.-J. Huang, “Advancing parameter efficiency in fine-tuning via representation editing,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 13 445–13 464

  42. [43]

    Angular steering: Behavior control via rotation in activation space,

    H. M. Vu and T. M. Nguyen, “Angular steering: Behavior control via rotation in activation space,” no. arXiv:2510.26243, Oct. 2025, arXiv:2510.26243. [Online]. Available: http://arxiv.org/abs/2510.26243

  43. [44]

    What’s the magic word? a control theory of llm prompting,

    A. Bhargava, C. Witkowski, S.-Z. Looi, and M. Thomson, “What’s the magic word? a control theory of llm prompting,”arXiv preprint arXiv:2310.04444, 2023

  44. [45]

    Aligning large language models with representation editing: A control perspective,

    L. Kong, H. Wang, W. Mu, Y . Du, Y . Zhuang, Y . Zhou, Y . Song, R. Zhang, K. Wang, and C. Zhang, “Aligning large language models with representation editing: A control perspective,” no. arXiv:2406.05954, Nov. 2024, arXiv:2406.05954. [Online]. Available: http://arxiv.org/abs/2406.05954

  45. [46]

    Preemptive detection and steering of llm misalignment via latent reachability,

    S. Karnik and S. Bansal, “Preemptive detection and steering of llm misalignment via latent reachability,” no. arXiv:2509.21528, Sep. 2025, arXiv:2509.21528. [Online]. Available: http://arxiv.org/abs/2509.21528

  46. [47]

    Linearly controlled language generation with performative guarantees,

    E. Cheng and C. A. Alonso, “Linearly controlled language generation with performative guarantees,” no. arXiv:2405.15454, Sep. 2025, arXiv:2405.15454. [Online]. Available: http://arxiv.org/abs/2405.15454

  47. [48]

    To steer or not to steer? mechanistic error reduction with abstention for language models,

    A. Hedström, S. I. Amoukou, T. Bewley, S. Mishra, and M. Veloso, “To steer or not to steer? mechanistic error reduction with abstention for language models,” inProceedings of the 42nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 267. Vancouver, Canada: PMLR, 2025. 12

  48. [49]

    Analysing the generalisation and reliability of steering vectors,

    D. Tan, D. Chanin, A. Lynch, B. Paige, D. Kanoulas, A. Garriga-Alonso, and R. Kirk, “Analysing the generalisation and reliability of steering vectors,”Advances in Neural Information Process- ing Systems, vol. 37, pp. 139 179–139 212, 2024

  49. [50]

    Multi-property steering of large language models with dynamic activation composition,

    D. Scalena, G. Sarti, and M. Nissim, “Multi-property steering of large language models with dynamic activation composition,”arXiv preprint arXiv:2406.17563, 2024

  50. [51]

    Cbf-llm: Safe control for llm alignment,

    Y . Miyaoka and M. Inoue, “Cbf-llm: Safe control for llm alignment,”arXiv preprint arXiv:2408.15625, 2024

  51. [52]

    Activation steering with a feedback controller,

    D. V . Nguyen, H. M. Vu, N. Y . Pham, L. Zhang, and T. M. Nguyen, “Activation steering with a feedback controller,” no. arXiv:2510.04309, Oct. 2025, arXiv:2510.04309. [Online]. Available: http://arxiv.org/abs/2510.04309

  52. [53]

    Zeroscope v2 576w,

    Cerspense, “Zeroscope v2 576w,” https://huggingface.co/cerspense/zeroscope_v2_576w, 2023, accessed: 2025-09-23

  53. [54]

    Cogvideo: Large-scale pretraining for text- to-video generation via transformers,

    W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang, “Cogvideo: Large-scale pretraining for text- to-video generation via transformers,” inThe Eleventh International Conference on Learning Representations, 2023

  54. [55]

    Cogvideox: Text-to-video diffusion models with an expert transformer,

    Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng, D. Yin, Yuxuan.Zhang, W. Wang, Y . Cheng, B. Xu, X. Gu, Y . Dong, and J. Tang, “Cogvideox: Text-to-video diffusion models with an expert transformer,” inThe Thirteenth International Conference on Learning Representations, 2025

  55. [56]

    Imagen video: High definition video generation with diffusion models,

    J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, and T. Salimans, “Imagen video: High definition video generation with diffusion models,” 2022

  56. [57]

    Make-a-video: Text-to-video generation without text- video data,

    U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, D. Parikh, S. Gupta, and Y . Taigman, “Make-a-video: Text-to-video generation without text- video data,” 2022

  57. [58]

    Modelscope text-to-video technical report,

    J. Wang, H. Yuan, D. Chen, Y . Zhang, X. Wang, and S. Zhang, “Modelscope text-to-video technical report,”arXiv preprint arXiv:2308.06571, 2023

  58. [59]

    Sora: A review on background, technology, limitations, and opportunities of large vision models,

    Y . Liu, K. Zhang, Y . Li, Z. Yan, C. Gao, R. Chen, Z. Yuan, Y . Huang, H. Sun, J. Gao, L. He, and L. Sun, “Sora: A review on background, technology, limitations, and opportunities of large vision models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.17177

  59. [60]

    Open-sora: Democratizing efficient video production for all,

    Z. Zheng, X. Peng, T. Yang, C. Shen, S. Li, H. Liu, Y . Zhou, T. Li, and Y . You, “Open-sora: Democratizing efficient video production for all,”arXiv preprint arXiv:2412.20404, 2024

  60. [61]

    T2vattack: Adversarial attack on text-to-video diffusion models,

    C. Li, Y . Min, J. Zhang, Z. Yuan, S. Shan, and X. Chen, “T2vattack: Adversarial attack on text-to-video diffusion models,”arXiv preprint arXiv:2512.23953, 2025

  61. [62]

    T2v-optjail: Discrete prompt optimization for text-to-video jailbreak attacks,

    J. Liu, S. Liang, S. Zhao, R. Tu, W. Zhou, A. Liu, D. Tao, and S. K. Lam, “T2v-optjail: Discrete prompt optimization for text-to-video jailbreak attacks,”arXiv preprint arXiv:2505.06679, 2025

  62. [63]

    T2vsafetybench: Evaluating the safety of text-to-video generative models,

    Y . Miao, Y . Zhu, L. Yu, J. Zhu, X.-S. Gao, and Y . Dong, “T2vsafetybench: Evaluating the safety of text-to-video generative models,”Advances in Neural Information Processing Systems, vol. 37, pp. 63 858–63 872, 2024

  63. [64]

    Two frames matter: A temporal attack for text-to-video model jailbreaking,

    M. Chen, Z. Ying, W. Xu, Q. Zou, D. Zhang, D. Yang, and X. Zhang, “Two frames matter: A temporal attack for text-to-video model jailbreaking,”arXiv preprint arXiv:2603.07028, 2026

  64. [65]

    Badvideo: Stealthy back- door attack against text-to-video generation,

    R. Wang, M. Zhu, J. Ou, R. Chen, X. Tao, P. Wan, and B. Wu, “Badvideo: Stealthy back- door attack against text-to-video generation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 19 075–19 084

  65. [66]

    Unlearning concepts from text-to-video diffusion models,

    S. Liu and Y . Tan, “Unlearning concepts from text-to-video diffusion models,”arXiv preprint arXiv:2407.14209, 2024

  66. [67]

    Nullsce: Sequential concept erasure in generative video diffusion models via null-space guidance,

    Q. Yi, B. Li, C. Wu, Y . Li, X. Teng, X. Xu, Y . Tan, and C. Chen, “Nullsce: Sequential concept erasure in generative video diffusion models via null-space guidance,”Available at SSRN 5993786

  67. [68]

    Lineas: End-to-end learning of activation steering with a distributional loss,

    P. Rodriguez, M. Klein, E. Gualdoni, V . Maiorca, A. Blaas, L. Zappella, M. Cuturi, and X. Suau, “Lineas: End-to-end learning of activation steering with a distributional loss,”arXiv preprint arXiv:2503.10679, 2025. 13

  68. [69]

    The unreasonable effectiveness of text embedding interpolation for continuous image steering,

    Y . Ekin and Y . Gandelsman, “The unreasonable effectiveness of text embedding interpolation for continuous image steering,”arXiv preprint arXiv:2603.17998, 2026

  69. [70]

    Exploring the limits of transfer learning with a unified text-to-text transformer,

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020

  70. [71]

    Scalable diffusion models with transformers,

    W. Peebles and S. Xie, “Scalable diffusion models with transformers,”arXiv preprint arXiv:2212.09748, 2022

  71. [72]

    Contributions to the theory of optimal control,

    R. E. Kalmanet al., “Contributions to the theory of optimal control,”Bol. soc. mat. mexicana, vol. 5, no. 2, pp. 102–119, 1960

  72. [73]

    F. L. Lewis, D. L. Vrabie, and V . L. Syrmos,Optimal Control, 3rd ed. Hoboken, NJ: John Wiley & Sons, 2012, ch. 2

  73. [74]

    Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions,

    N. Halko, P.-G. Martinsson, and J. A. Tropp, “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions,”SIAM review, vol. 53, no. 2, pp. 217–288, 2011

  74. [75]

    Hunyuanvideo: A systematic framework for large video generative models,

    W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhanget al., “Hunyuanvideo: A systematic framework for large video generative models,”arXiv preprint arXiv:2412.03603, 2024

  75. [76]

    Vbench: Comprehensive benchmark suite for video generative models,

    Z. Huang, Y . He, J. Yu, F. Zhang, C. Si, Y . Jiang, Y . Zhang, T. Wu, Q. Jin, N. Chanpaisitet al., “Vbench: Comprehensive benchmark suite for video generative models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 21 807–21 818

  76. [77]

    Regret bounds for the adaptive control of linear quadratic systems,

    Y . Abbasi-Yadkori and C. Szepesvári, “Regret bounds for the adaptive control of linear quadratic systems,” inProceedings of the 24th annual conference on learning theory. JMLR Workshop and Conference Proceedings, 2011, pp. 1–26. 14 A Proofs Lemma A.1(Projection-calibrated feature setpoints).For any raw xs, the raw and latent feature strengths satisfy βx ...