Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control

Alice Chan; Glen Chou; Jihoon Hong; Julian Skifstad; Qiyue Dai

arxiv: 2606.04775 · v1 · pith:HGJHYK6Dnew · submitted 2026-06-03 · 💻 cs.LG · cs.AI· cs.CV· cs.SY· eess.SY· math.OC

Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control

Jihoon Hong , Alice Chan , Qiyue Dai , Julian Skifstad , Glen Chou This is my paper

Pith reviewed 2026-06-28 07:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CVcs.SYeess.SYmath.OC

keywords activation steeringtext-to-videolinear quadratic regulatorlatent dynamicsoptimal controlmodel safetygenerative modelsclosed-loop feedback

0 comments

The pith

Reduced-order LQR steers video model activations to safe setpoints with minimal quality loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LA-LQR as a way to steer text-to-video models away from undesired content by treating generation as a dynamical system. It projects activations into a low-dimensional subspace from contrastive prompts, estimates local linear dynamics there, and solves an LQR problem to produce closed-loop feedback signals that track desired features while limiting unnecessary changes. The approach aims to deliver more precise interventions than prior coarse steering techniques. A reader would care if this yields measurable drops in unsafe outputs on benchmarks without hurting prompt match or visual fidelity.

Core claim

LA-LQR projects high-dimensional video activations onto a low-dimensional task-relevant subspace derived from contrastive prompt pairs, estimates local linear dynamics in that space, solves a latent LQR problem for timestep- and layer-specific steering signals, and supplies theoretical bounds that link latent setpoint tracking to control of the original activation-space features.

What carries the argument

The LA-LQR reduced-order optimal control framework that computes closed-loop steering signals from a latent LQR problem in a contrastive-prompt-derived subspace.

Load-bearing premise

The reduced latent dynamics faithfully approximate the original high-dimensional activation dynamics.

What would settle it

A test in which the latent steering signals, when applied to the full model, produce no measurable shift in the targeted activation features or fail to lower unsafe generation rates on the safety benchmarks.

Figures

Figures reproduced from arXiv: 2606.04775 by Alice Chan, Glen Chou, Jihoon Hong, Julian Skifstad, Qiyue Dai.

**Figure 1.** Figure 1: Overview. Our method, LA-LQR, steers T2V models by solving an optimal control problem, producing steering signals us for each timestep and transformer layer. For tractability, we perform control within a task-relevant activation subspace identified by contrastive vectors. Activation Steering Prior work in mechanistic interpretability [17–22] suggests that many concepts align with directions in activation s… view at source ↗

**Figure 2.** Figure 2: (a) Proportion of energy of Cr matrices captured in the subspace spanned by the topDlat = 64 right singular vectors, over (t, l), for pornography feature. (b) Normalized Frobenius norm between (Top) As computed from 20 different prompts and (Bottom) random matrices, at (left) layer 5, (middle) layer 25, and (right) layer 35. (a) (b) (c) (d) (e) (f) [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 4.** Figure 4: Safeguarding Wan against harmful prompts from T2VSafetyBench [63]. For each example, [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 13.** Figure 13: Filtering- and embedding-based methods such as SAFREE [10] and [69] also struggle on [PITH_FULL_IMAGE:figures/full_fig_p009_13.png] view at source ↗

**Figure 5.** Figure 5: Text steering only. Left column: first frame; middle column: middle frame; right column: final frame. Row 1: Baseline (no steering). Row 2: λ = 1, Q = I, R = 75000I, QH = I. Row 3: λ = 1, Q = 1.5I, R = 75000I, QH = I. Row 4: λ = 1, Q = 2I, R = 75000I, QH = I. Row 5: λ = 1, Q = 2.5I, R = 75000I, QH = I. Row 6: λ = 1, Q = 3I, R = 75000I, QH = I. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Video steering only. Left column: first frame; middle column: middle frame; right column: final frame. Row 1: λ = 1, Q = 10I, R = 75000I, QH = I. Row 2: λ = 1, Q = 100I, R = 75000I, QH = I. Row 3: λ = 1, Q = 1000I, R = 75000I, QH = I. Row 4: λ = 1, Q = 10000I, R = 75000I, QH = I. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Text and video steering. Left column: first frame; middle column: middle frame; right column: final frame. Row 1: λ = 1, Q = 1I, Rv = 100000I, Rt = 75000I, QH = I. Row 2: λ = 1, Q = 10I, Rv = 1000I, Rt = 106 I, QH = I. Row 3: λ = 1, Q = 10I, Rv = 1000I, Rt = 70000I, QH = I. Row 4: λ = 1, Q = 10I, Rv = 1000I, Rt = 100000I, QH = I. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Hunyuan, animal abuse category on SafeSora [12]. [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Hunyuan, pornography category on SafeSora [12]. [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Hunyuan, racism category on SafeSora [12]. [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Hunyuan, terrorism category on SafeSora [12]. [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗

**Figure 12.** Figure 12: Hunyuan, violence category on SafeSora [12]. [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Baseline [9], when oversteered, generates videos with frames like this. [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

read the original abstract

Text-to-video (T2V) models trained on large-scale web data can generate undesired content, motivating interventions that reduce harmful outputs without sacrificing visual quality. Activation steering offers an attractive mechanistic alternative to finetuning and prompt filtering, but existing T2V steering methods remain limited, typically applying coarse, non-anticipative interventions that can lead to oversteering and content degradation. To close this gap, we propose Latent Activation Linear-Quadratic Regulator (LA-LQR), a reduced-order optimal control framework for minimally invasive T2V steering. LA-LQR formulates T2V inference as a dynamical system and computes closed-loop feedback interventions that steer activations toward desired feature setpoints while penalizing unnecessary perturbations. To make optimal control feasible for high-dimensional video activations, we project activations onto a low-dimensional, task-relevant subspace derived from contrastive prompt pairs, estimate local linear dynamics in this latent space, and solve a latent LQR problem to obtain timestep- and layer-specific steering signals. We provide theoretical bounds relating latent setpoint tracking to raw activation-space feature control, and empirically validate the fidelity of the reduced latent dynamics. On concept steering and video safety benchmarks, LA-LQR reduces unsafe generations relative to baselines, while preserving prompt fidelity and visual quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LA-LQR applies reduced-order LQR to T2V activations for steering but the approximation quality of the latent dynamics is the load-bearing assumption that needs checking.

read the letter

The paper introduces LA-LQR, which projects video model activations onto a low-dimensional subspace from contrastive prompts, fits local linear dynamics, and runs a latent LQR to generate closed-loop steering signals. This is a clear step past the non-anticipative methods mentioned in the abstract.

It does a solid job framing the problem in control terms and stating theoretical bounds that connect latent setpoint tracking back to raw activation control. The empirical claims on safety benchmarks and prompt fidelity are presented as direct outcomes of that setup.

The main soft spot is exactly the one in the stress-test note: the reduced-order linear model has to capture the relevant parts of the high-dimensional trajectory across layers and timesteps. If the projection drops nonlinear interactions or directions tied to unsafe content, the bounds do not automatically explain the observed reductions. The abstract says they empirically validate the latent dynamics fidelity, but without the actual derivations or ablation details it is difficult to gauge how tight that validation is.

This is for people working on activation-level interventions in generative models, particularly those who already think in terms of dynamical systems or optimal control. A reader who wants a concrete alternative to prompt filtering or fine-tuning will find a usable technique here.

It should go to peer review. The core idea is distinct, the claims are falsifiable, and the control-theoretic grounding is real even if the approximation step needs more scrutiny.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce LA-LQR, a reduced-order optimal control method for activation steering in text-to-video models. By projecting activations to a low-dimensional subspace from contrastive pairs, estimating linear dynamics, and solving LQR, it achieves steering with theoretical bounds linking latent to raw space, and shows empirical improvements in safety benchmarks without degrading quality.

Significance. If the reduced dynamics approximation holds as claimed, this provides a principled control-theoretic framework for minimally invasive steering in generative video models, advancing beyond non-anticipative methods. The explicit theoretical bounds and empirical validation of latent dynamics are positive aspects.

major comments (2)

[empirical validation of reduced latent dynamics] The central claim depends on the reduced-order linear dynamics faithfully approximating the high-dimensional activation trajectories over the denoising process. The abstract mentions empirical validation, but without specific quantitative results (e.g., prediction error metrics across timesteps and layers) showing that the approximation captures directions relevant to unsafe content, the theoretical bounds may not fully explain the observed steering effects.
[the section on theoretical bounds] The bounds relating latent setpoint tracking to raw activation-space feature control are load-bearing. If the subspace derived from contrastive prompt pairs discards nonlinear interactions important for feature control, the mapping from latent LQR solution to raw-space control could break, undermining the explanation for the safety benchmark improvements.

minor comments (1)

Notation for the LQR cost matrices Q and R could be clarified with explicit definitions in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's significance. We address the two major comments below, agreeing to strengthen the empirical validation section and to clarify the assumptions underlying the theoretical bounds.

read point-by-point responses

Referee: [empirical validation of reduced latent dynamics] The central claim depends on the reduced-order linear dynamics faithfully approximating the high-dimensional activation trajectories over the denoising process. The abstract mentions empirical validation, but without specific quantitative results (e.g., prediction error metrics across timesteps and layers) showing that the approximation captures directions relevant to unsafe content, the theoretical bounds may not fully explain the observed steering effects.

Authors: We agree that more granular quantitative metrics would strengthen the presentation. While the manuscript reports empirical validation of the reduced latent dynamics, we will revise the relevant section to include explicit prediction error metrics (e.g., MSE between predicted and observed trajectories) computed across denoising timesteps, model layers, and specifically along the contrastive directions tied to unsafe content. These additions will directly link the approximation quality to the observed steering performance. revision: yes
Referee: [the section on theoretical bounds] The bounds relating latent setpoint tracking to raw activation-space feature control are load-bearing. If the subspace derived from contrastive prompt pairs discards nonlinear interactions important for feature control, the mapping from latent LQR solution to raw-space control could break, undermining the explanation for the safety benchmark improvements.

Authors: The bounds are derived under the linear dynamics assumption within the chosen subspace and rely on the projection operator preserving the relevant directions identified by the contrastive pairs. We acknowledge that highly nonlinear interactions outside this subspace are not captured by construction. In revision we will expand the discussion of assumptions and limitations, explicitly noting the linear regime and the rationale for the contrastive subspace selection, while retaining the existing bound statements. revision: partial

Circularity Check

0 steps flagged

No significant circularity; method applies standard LQR to contrastive-derived latent space with independent empirical validation

full rationale

The derivation projects high-dimensional activations onto a contrastive subspace, fits local linear dynamics, solves an LQR problem in that space, and supplies theoretical bounds from the linear model to raw-space control; these steps are standard control-theoretic constructions whose outputs are not redefined as their own inputs. Empirical validation of reduced-dynamics fidelity and benchmark results on safety/fidelity metrics are measured against external data, not against the fitted parameters themselves. No self-citations appear as load-bearing premises, no uniqueness theorems are imported from the authors' prior work, and no ansatz or known empirical pattern is smuggled or renamed. The framework therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on modeling assumptions about linear dynamics in a reduced subspace derived from contrastive prompts; these are domain assumptions without independent evidence beyond the paper's own validation.

free parameters (2)

reduced subspace dimension
Dimension of the task-relevant subspace projected from contrastive prompt pairs, selected to enable feasible LQR computation.
LQR cost matrices Q and R
Weighting matrices balancing setpoint tracking error against control effort in the latent LQR problem.

axioms (2)

domain assumption Local linear dynamics approximation holds in the latent subspace
Invoked when estimating dynamics and solving the LQR problem for timestep- and layer-specific steering signals.
domain assumption Contrastive prompt pairs yield a task-relevant subspace for feature control
Used to project high-dimensional activations for the reduced-order formulation.

pith-pipeline@v0.9.1-grok · 5779 in / 1442 out tokens · 40311 ms · 2026-06-28T07:16:42.212794+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 19 linked inside Pith

[1]

Wan: Open and advanced large-scale video generative models,

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang et al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[2]

Editverse: Unifying image and video editing and generation with in-context learning,

X. Ju, T. Wang, Y . Zhou, H. Zhang, Q. Liu, N. Zhao, Z. Zhang, Y . Li, Y . Cai, S. Liuet al., “Editverse: Unifying image and video editing and generation with in-context learning,”arXiv preprint arXiv:2509.20360, 2025

Pith/arXiv arXiv 2025
[4]

Cogvideox: Text-to-video diffusion models with an expert transformer,

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng et al., “Cogvideox: Text-to-video diffusion models with an expert transformer,”arXiv preprint arXiv:2408.06072, 2024

Pith/arXiv arXiv 2024
[5]

Video diffusion models,

J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video diffusion models,”Advances in neural information processing systems, vol. 35, pp. 8633–8646, 2022

2022
[6]

Moviebench: A hierarchical movie level dataset for long video generation,

W. Wu, M. Liu, Z. Zhu, X. Xia, H. Feng, W. Wang, K. Q. Lin, C. Shen, and M. Z. Shou, “Moviebench: A hierarchical movie level dataset for long video generation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 28 984–28 994

2025
[7]

Pre-trained video generative models as world simulators,

H. He, Y . Zhang, L. Lin, Z. Xu, and L. Pan, “Pre-trained video generative models as world simulators,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 6, 2026, pp. 4645–4653

2026
[8]

Cosmos policy: Fine-tuning video models for visuomotor control and planning,

M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn et al., “Cosmos policy: Fine-tuning video models for visuomotor control and planning,”arXiv preprint arXiv:2601.16163, 2026

Pith/arXiv arXiv 2026
[9]

Video unlearning via low-rank refusal vector,

S. Facchiano, S. Saravalle, M. Migliarini, E. De Matteis, A. Sampieri, A. Pilzer, E. Rodolà, I. Spinelli, L. Franco, and F. Galasso, “Video unlearning via low-rank refusal vector,”arXiv preprint arXiv:2506.07891, 2025

arXiv 2025
[10]

Safree: Training-free and adaptive guard for safe text-to-image and video generation,

J. Yoon, S. Yu, V . Patil, H. Yao, and M. Bansal, “Safree: Training-free and adaptive guard for safe text-to-image and video generation,”arXiv preprint arXiv:2410.12761, 2024

arXiv 2024
[11]

Vpo: Aligning text-to-video generation models with prompt optimization,

J. Cheng, R. Lyu, X. Gu, X. Liu, J. Xu, Y . Lu, J. Teng, Z. Yang, Y . Dong, J. Tanget al., “Vpo: Aligning text-to-video generation models with prompt optimization,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 15 636–15 645

2025
[12]

Safesora: Towards safety alignment of text2video generation via a human preference dataset,

J. Dai, T. Chen, X. Wang, Z. Yang, T. Chen, J. Ji, and Y . Yang, “Safesora: Towards safety alignment of text2video generation via a human preference dataset,”Advances in Neural Information Processing Systems, vol. 37, pp. 17 161–17 214, 2024

2024
[13]

Unified concept editing in diffusion models,

R. Gandikota, H. Orgad, Y . Belinkov, J. Materzy´nska, and D. Bau, “Unified concept editing in diffusion models,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, 2024, pp. 5111–5120

2024
[14]

Activation addition: Steering language models without optimization,

A. M. Turner, L. Thiergart, G. Leech, D. Udell, U. Mini, and M. MacDiarmid, “Activation addition: Steering language models without optimization,” 2024. 10

2024
[15]

ODESteer: A unified ODE-based steering framework for LLM alignment,

H. Zhao, H. Sun, J. Kong, X. Li, Q. Wang, L. Jiang, Q. Zhu, T. F. Abdelzaher, Y . Choi, M. Li, and H. Shao, “ODESteer: A unified ODE-based steering framework for LLM alignment,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=CFewUmgIIL

2026
[16]

Local linearity of llms enables activation steering via model-based linear optimal control,

J. Skifstad, X. A. Yang, and G. Chou, “Local linearity of llms enables activation steering via model-based linear optimal control,”arXiv preprint arXiv:2604.19018, 2026

Pith/arXiv arXiv 2026
[17]

Mechanistic interpretability for AI safety - a review,

L. Bereska and S. Gavves, “Mechanistic interpretability for AI safety - a review,”Transactions on Machine Learning Research, 2024, survey Certification, Expert Certification. [Online]. Available: https://openreview.net/forum?id=ePUVetPKu6

2024
[18]

Toy models of superposition,

N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, R. Grosse, S. McCandlish, J. Kaplan, D. Amodei, M. Wattenberg, and C. Olah, “Toy models of superposition,” no. arXiv:2209.10652, 2022, arXiv:2209.10652. [Online]. Available: http://arxiv.org/abs/2209.10652

Pith/arXiv arXiv 2022
[19]

Linguistic regularities in continuous space word represen- tations,

T. Mikolov, W.-t. Yih, and G. Zweig, “Linguistic regularities in continuous space word represen- tations,” inProceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, 2013, pp. 746–751

2013
[20]

The linear representation hypothesis and the geometry of large language models,

K. Park, Y . J. Choe, and V . Veitch, “The linear representation hypothesis and the geometry of large language models,”arXiv preprint arXiv:2311.03658, 2023

Pith/arXiv arXiv 2023
[21]

The geometry of truth: Emergent linear structure in large language model representations of true/false datasets,

S. Marks and M. Tegmark, “The geometry of truth: Emergent linear structure in large language model representations of true/false datasets,” inFirst Conference on Language Modeling, 2024. [Online]. Available: https://openreview.net/forum?id=aajyHYjjsk

2024
[22]

Emergent linear representations in world models of self-supervised sequence models,

N. Nanda, A. Lee, and M. Wattenberg, “Emergent linear representations in world models of self-supervised sequence models,” inProceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2023, pp. 16–30

2023
[23]

Plug and play language models: A simple approach to controlled text generation,

S. Dathathri, A. Madotto, J. Lan, J. Hung, E. Frank, P. Molino, J. Yosinski, and R. Liu, “Plug and play language models: A simple approach to controlled text generation,” inInternational Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=H1edEyBKDS

2020
[24]

Inference-time intervention: Eliciting truthful answers from a language model,

K. Li, O. Patel, F. Viégas, H. Pfister, and M. Wattenberg, “Inference-time intervention: Eliciting truthful answers from a language model,”Advances in Neural Information Processing Systems, vol. 36, pp. 41 451–41 530, 2023

2023
[25]

Refusal in language models is mediated by a single direction,

A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda, “Refusal in language models is mediated by a single direction,” no. arXiv:2406.11717, Oct. 2024, arXiv:2406.11717. [Online]. Available: http://arxiv.org/abs/2406.11717

Pith/arXiv arXiv 2024
[26]

Steering llama 2 via contrastive activation addition,

N. Rimsky, N. Gabrieli, J. Schulz, M. Tong, E. Hubinger, and A. Turner, “Steering llama 2 via contrastive activation addition,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics, 2024, p. 15504–15522. [Online]. Available: https://a...

2024
[27]

Inference-time intervention: Eliciting truthful answers from a language model,

K. Li, O. Patel, F. Viégas, H. Pfister, and M. Wattenberg, “Inference-time intervention: Eliciting truthful answers from a language model,” no. arXiv:2306.03341, 2024, arXiv:2306.03341. [Online]. Available: http://arxiv.org/abs/2306.03341

Pith/arXiv arXiv 2024
[28]

Steering language models with activation engineering,

A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini, and M. MacDiarmid, “Steering language models with activation engineering,” no. arXiv:2308.10248, Oct. 2024, arXiv:2308.10248. [Online]. Available: http://arxiv.org/abs/2308.10248

Pith/arXiv arXiv 2024
[29]

Learning to summarize with human feedback,

N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. V oss, A. Radford, D. Amodei, and P. F. Christiano, “Learning to summarize with human feedback,”Advances in neural information processing systems, vol. 33, pp. 3008–3021, 2020

2020
[30]

Llama: Open and efficient foundation language models,

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

Pith/arXiv arXiv 2023
[31]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Advances in neural information processing systems, vol. 36, pp. 53 728–53 741, 2023. 11

2023
[32]

Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation,

H. Xu, A. Sharaf, Y . Chen, W. Tan, L. Shen, B. Van Durme, K. Murray, and Y . J. Kim, “Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation,”arXiv preprint arXiv:2401.08417, 2024

arXiv 2024
[33]

Rrhf: Rank responses to align language models with human feedback,

H. Yuan, Z. Yuan, C. Tan, W. Wang, S. Huang, and F. Huang, “Rrhf: Rank responses to align language models with human feedback,”Advances in Neural Information Processing Systems, vol. 36, pp. 10 935–10 950, 2023

2023
[34]

Preference ranking optimization for human alignment,

F. Song, B. Yu, M. Li, H. Yu, F. Huang, Y . Li, and H. Wang, “Preference ranking optimization for human alignment,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 17, 2024, pp. 18 990–18 998

2024
[35]

Parameter-efficient transfer learning for nlp,

N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. At- tariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” inInternational conference on machine learning. PMLR, 2019, pp. 2790–2799

2019
[36]

A general language assistant as a laboratory for alignment,

A. Askell, Y . Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. DasSarmaet al., “A general language assistant as a laboratory for alignment,”arXiv preprint arXiv:2112.00861, 2021

Pith/arXiv arXiv 2021
[37]

Defending large language models against jailbreaking attacks through goal prioritization,

Z. Zhang, J. Yang, P. Ke, F. Mi, H. Wang, and M. Huang, “Defending large language models against jailbreaking attacks through goal prioritization,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 8865–8887

2024
[38]

Args: Alignment as reward-guided search,

M. Khanov, J. Burapacheep, and Y . Li, “Args: Alignment as reward-guided search,”arXiv preprint arXiv:2402.01694, 2024

arXiv 2024
[39]

Deal: Decoding-time alignment for large language models,

J. Y . Huang, S. Sengupta, D. Bonadiman, Y .-a. Lai, A. Gupta, N. Pappas, S. Mansour, K. Kirch- hoff, and D. Roth, “Deal: Decoding-time alignment for large language models,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 26 280–26 300

2025
[40]

Controlling language and diffusion models by transporting activations,

P. Rodriguez, A. Blaas, M. Klein, L. Zappella, N. Apostoloff, M. Cuturi, and X. Suau, “Controlling language and diffusion models by transporting activations,” no. arXiv:2410.23054, Nov. 2024, arXiv:2410.23054. [Online]. Available: http://arxiv.org/abs/2410.23054

arXiv 2024
[41]

Reft: Representation finetuning for language models,

Z. Wu, A. Arora, Z. Wang, A. Geiger, D. Jurafsky, C. D. Manning, and C. Potts, “Reft: Representation finetuning for language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 63 908–63 962, 2024

2024
[42]

Advancing parameter efficiency in fine-tuning via representation editing,

M. Wu, W. Liu, X. Wang, T. Li, C. Lv, Z. Ling, Z. JianHao, C. Zhang, X. Zheng, and X.-J. Huang, “Advancing parameter efficiency in fine-tuning via representation editing,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 13 445–13 464

2024
[43]

Angular steering: Behavior control via rotation in activation space,

H. M. Vu and T. M. Nguyen, “Angular steering: Behavior control via rotation in activation space,” no. arXiv:2510.26243, Oct. 2025, arXiv:2510.26243. [Online]. Available: http://arxiv.org/abs/2510.26243

arXiv 2025
[44]

What’s the magic word? a control theory of llm prompting,

A. Bhargava, C. Witkowski, S.-Z. Looi, and M. Thomson, “What’s the magic word? a control theory of llm prompting,”arXiv preprint arXiv:2310.04444, 2023

arXiv 2023
[45]

Aligning large language models with representation editing: A control perspective,

L. Kong, H. Wang, W. Mu, Y . Du, Y . Zhuang, Y . Zhou, Y . Song, R. Zhang, K. Wang, and C. Zhang, “Aligning large language models with representation editing: A control perspective,” no. arXiv:2406.05954, Nov. 2024, arXiv:2406.05954. [Online]. Available: http://arxiv.org/abs/2406.05954

arXiv 2024
[46]

Preemptive detection and steering of llm misalignment via latent reachability,

S. Karnik and S. Bansal, “Preemptive detection and steering of llm misalignment via latent reachability,” no. arXiv:2509.21528, Sep. 2025, arXiv:2509.21528. [Online]. Available: http://arxiv.org/abs/2509.21528

arXiv 2025
[47]

Linearly controlled language generation with performative guarantees,

E. Cheng and C. A. Alonso, “Linearly controlled language generation with performative guarantees,” no. arXiv:2405.15454, Sep. 2025, arXiv:2405.15454. [Online]. Available: http://arxiv.org/abs/2405.15454

Pith/arXiv arXiv 2025
[48]

To steer or not to steer? mechanistic error reduction with abstention for language models,

A. Hedström, S. I. Amoukou, T. Bewley, S. Mishra, and M. Veloso, “To steer or not to steer? mechanistic error reduction with abstention for language models,” inProceedings of the 42nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 267. Vancouver, Canada: PMLR, 2025. 12

2025
[49]

Analysing the generalisation and reliability of steering vectors,

D. Tan, D. Chanin, A. Lynch, B. Paige, D. Kanoulas, A. Garriga-Alonso, and R. Kirk, “Analysing the generalisation and reliability of steering vectors,”Advances in Neural Information Process- ing Systems, vol. 37, pp. 139 179–139 212, 2024

2024
[50]

Multi-property steering of large language models with dynamic activation composition,

D. Scalena, G. Sarti, and M. Nissim, “Multi-property steering of large language models with dynamic activation composition,”arXiv preprint arXiv:2406.17563, 2024

arXiv 2024
[51]

Cbf-llm: Safe control for llm alignment,

Y . Miyaoka and M. Inoue, “Cbf-llm: Safe control for llm alignment,”arXiv preprint arXiv:2408.15625, 2024

arXiv 2024
[52]

Activation steering with a feedback controller,

D. V . Nguyen, H. M. Vu, N. Y . Pham, L. Zhang, and T. M. Nguyen, “Activation steering with a feedback controller,” no. arXiv:2510.04309, Oct. 2025, arXiv:2510.04309. [Online]. Available: http://arxiv.org/abs/2510.04309

Pith/arXiv arXiv 2025
[53]

Zeroscope v2 576w,

Cerspense, “Zeroscope v2 576w,” https://huggingface.co/cerspense/zeroscope_v2_576w, 2023, accessed: 2025-09-23

2023
[54]

Cogvideo: Large-scale pretraining for text- to-video generation via transformers,

W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang, “Cogvideo: Large-scale pretraining for text- to-video generation via transformers,” inThe Eleventh International Conference on Learning Representations, 2023

2023
[55]

Cogvideox: Text-to-video diffusion models with an expert transformer,

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng, D. Yin, Yuxuan.Zhang, W. Wang, Y . Cheng, B. Xu, X. Gu, Y . Dong, and J. Tang, “Cogvideox: Text-to-video diffusion models with an expert transformer,” inThe Thirteenth International Conference on Learning Representations, 2025

2025
[56]

Imagen video: High definition video generation with diffusion models,

J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, and T. Salimans, “Imagen video: High definition video generation with diffusion models,” 2022

2022
[57]

Make-a-video: Text-to-video generation without text- video data,

U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, D. Parikh, S. Gupta, and Y . Taigman, “Make-a-video: Text-to-video generation without text- video data,” 2022

2022
[58]

Modelscope text-to-video technical report,

J. Wang, H. Yuan, D. Chen, Y . Zhang, X. Wang, and S. Zhang, “Modelscope text-to-video technical report,”arXiv preprint arXiv:2308.06571, 2023

Pith/arXiv arXiv 2023
[59]

Sora: A review on background, technology, limitations, and opportunities of large vision models,

Y . Liu, K. Zhang, Y . Li, Z. Yan, C. Gao, R. Chen, Z. Yuan, Y . Huang, H. Sun, J. Gao, L. He, and L. Sun, “Sora: A review on background, technology, limitations, and opportunities of large vision models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.17177

Pith/arXiv arXiv 2024
[60]

Open-sora: Democratizing efficient video production for all,

Z. Zheng, X. Peng, T. Yang, C. Shen, S. Li, H. Liu, Y . Zhou, T. Li, and Y . You, “Open-sora: Democratizing efficient video production for all,”arXiv preprint arXiv:2412.20404, 2024

Pith/arXiv arXiv 2024
[61]

T2vattack: Adversarial attack on text-to-video diffusion models,

C. Li, Y . Min, J. Zhang, Z. Yuan, S. Shan, and X. Chen, “T2vattack: Adversarial attack on text-to-video diffusion models,”arXiv preprint arXiv:2512.23953, 2025

arXiv 2025
[62]

T2v-optjail: Discrete prompt optimization for text-to-video jailbreak attacks,

J. Liu, S. Liang, S. Zhao, R. Tu, W. Zhou, A. Liu, D. Tao, and S. K. Lam, “T2v-optjail: Discrete prompt optimization for text-to-video jailbreak attacks,”arXiv preprint arXiv:2505.06679, 2025

arXiv 2025
[63]

T2vsafetybench: Evaluating the safety of text-to-video generative models,

Y . Miao, Y . Zhu, L. Yu, J. Zhu, X.-S. Gao, and Y . Dong, “T2vsafetybench: Evaluating the safety of text-to-video generative models,”Advances in Neural Information Processing Systems, vol. 37, pp. 63 858–63 872, 2024

2024
[64]

Two frames matter: A temporal attack for text-to-video model jailbreaking,

M. Chen, Z. Ying, W. Xu, Q. Zou, D. Zhang, D. Yang, and X. Zhang, “Two frames matter: A temporal attack for text-to-video model jailbreaking,”arXiv preprint arXiv:2603.07028, 2026

arXiv 2026
[65]

Badvideo: Stealthy back- door attack against text-to-video generation,

R. Wang, M. Zhu, J. Ou, R. Chen, X. Tao, P. Wan, and B. Wu, “Badvideo: Stealthy back- door attack against text-to-video generation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 19 075–19 084

2025
[66]

Unlearning concepts from text-to-video diffusion models,

S. Liu and Y . Tan, “Unlearning concepts from text-to-video diffusion models,”arXiv preprint arXiv:2407.14209, 2024

arXiv 2024
[67]

Nullsce: Sequential concept erasure in generative video diffusion models via null-space guidance,

Q. Yi, B. Li, C. Wu, Y . Li, X. Teng, X. Xu, Y . Tan, and C. Chen, “Nullsce: Sequential concept erasure in generative video diffusion models via null-space guidance,”Available at SSRN 5993786
[68]

Lineas: End-to-end learning of activation steering with a distributional loss,

P. Rodriguez, M. Klein, E. Gualdoni, V . Maiorca, A. Blaas, L. Zappella, M. Cuturi, and X. Suau, “Lineas: End-to-end learning of activation steering with a distributional loss,”arXiv preprint arXiv:2503.10679, 2025. 13

arXiv 2025
[69]

The unreasonable effectiveness of text embedding interpolation for continuous image steering,

Y . Ekin and Y . Gandelsman, “The unreasonable effectiveness of text embedding interpolation for continuous image steering,”arXiv preprint arXiv:2603.17998, 2026

arXiv 2026
[70]

Exploring the limits of transfer learning with a unified text-to-text transformer,

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020

2020
[71]

Scalable diffusion models with transformers,

W. Peebles and S. Xie, “Scalable diffusion models with transformers,”arXiv preprint arXiv:2212.09748, 2022

Pith/arXiv arXiv 2022
[72]

Contributions to the theory of optimal control,

R. E. Kalmanet al., “Contributions to the theory of optimal control,”Bol. soc. mat. mexicana, vol. 5, no. 2, pp. 102–119, 1960

1960
[73]

F. L. Lewis, D. L. Vrabie, and V . L. Syrmos,Optimal Control, 3rd ed. Hoboken, NJ: John Wiley & Sons, 2012, ch. 2

2012
[74]

Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions,

N. Halko, P.-G. Martinsson, and J. A. Tropp, “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions,”SIAM review, vol. 53, no. 2, pp. 217–288, 2011

2011
[75]

Hunyuanvideo: A systematic framework for large video generative models,

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhanget al., “Hunyuanvideo: A systematic framework for large video generative models,”arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024
[76]

Vbench: Comprehensive benchmark suite for video generative models,

Z. Huang, Y . He, J. Yu, F. Zhang, C. Si, Y . Jiang, Y . Zhang, T. Wu, Q. Jin, N. Chanpaisitet al., “Vbench: Comprehensive benchmark suite for video generative models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 21 807–21 818

2024
[77]

Regret bounds for the adaptive control of linear quadratic systems,

Y . Abbasi-Yadkori and C. Szepesvári, “Regret bounds for the adaptive control of linear quadratic systems,” inProceedings of the 24th annual conference on learning theory. JMLR Workshop and Conference Proceedings, 2011, pp. 1–26. 14 A Proofs Lemma A.1(Projection-calibrated feature setpoints).For any raw xs, the raw and latent feature strengths satisfy βx ...

2011

[1] [1]

Wan: Open and advanced large-scale video generative models,

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang et al., “Wan: Open and advanced large-scale video generative models,”arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[2] [2]

Editverse: Unifying image and video editing and generation with in-context learning,

X. Ju, T. Wang, Y . Zhou, H. Zhang, Q. Liu, N. Zhao, Z. Zhang, Y . Li, Y . Cai, S. Liuet al., “Editverse: Unifying image and video editing and generation with in-context learning,”arXiv preprint arXiv:2509.20360, 2025

Pith/arXiv arXiv 2025

[3] [4]

Cogvideox: Text-to-video diffusion models with an expert transformer,

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng et al., “Cogvideox: Text-to-video diffusion models with an expert transformer,”arXiv preprint arXiv:2408.06072, 2024

Pith/arXiv arXiv 2024

[4] [5]

Video diffusion models,

J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video diffusion models,”Advances in neural information processing systems, vol. 35, pp. 8633–8646, 2022

2022

[5] [6]

Moviebench: A hierarchical movie level dataset for long video generation,

W. Wu, M. Liu, Z. Zhu, X. Xia, H. Feng, W. Wang, K. Q. Lin, C. Shen, and M. Z. Shou, “Moviebench: A hierarchical movie level dataset for long video generation,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 28 984–28 994

2025

[6] [7]

Pre-trained video generative models as world simulators,

H. He, Y . Zhang, L. Lin, Z. Xu, and L. Pan, “Pre-trained video generative models as world simulators,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 6, 2026, pp. 4645–4653

2026

[7] [8]

Cosmos policy: Fine-tuning video models for visuomotor control and planning,

M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn et al., “Cosmos policy: Fine-tuning video models for visuomotor control and planning,”arXiv preprint arXiv:2601.16163, 2026

Pith/arXiv arXiv 2026

[8] [9]

Video unlearning via low-rank refusal vector,

S. Facchiano, S. Saravalle, M. Migliarini, E. De Matteis, A. Sampieri, A. Pilzer, E. Rodolà, I. Spinelli, L. Franco, and F. Galasso, “Video unlearning via low-rank refusal vector,”arXiv preprint arXiv:2506.07891, 2025

arXiv 2025

[9] [10]

Safree: Training-free and adaptive guard for safe text-to-image and video generation,

J. Yoon, S. Yu, V . Patil, H. Yao, and M. Bansal, “Safree: Training-free and adaptive guard for safe text-to-image and video generation,”arXiv preprint arXiv:2410.12761, 2024

arXiv 2024

[10] [11]

Vpo: Aligning text-to-video generation models with prompt optimization,

J. Cheng, R. Lyu, X. Gu, X. Liu, J. Xu, Y . Lu, J. Teng, Z. Yang, Y . Dong, J. Tanget al., “Vpo: Aligning text-to-video generation models with prompt optimization,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 15 636–15 645

2025

[11] [12]

Safesora: Towards safety alignment of text2video generation via a human preference dataset,

J. Dai, T. Chen, X. Wang, Z. Yang, T. Chen, J. Ji, and Y . Yang, “Safesora: Towards safety alignment of text2video generation via a human preference dataset,”Advances in Neural Information Processing Systems, vol. 37, pp. 17 161–17 214, 2024

2024

[12] [13]

Unified concept editing in diffusion models,

R. Gandikota, H. Orgad, Y . Belinkov, J. Materzy´nska, and D. Bau, “Unified concept editing in diffusion models,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, 2024, pp. 5111–5120

2024

[13] [14]

Activation addition: Steering language models without optimization,

A. M. Turner, L. Thiergart, G. Leech, D. Udell, U. Mini, and M. MacDiarmid, “Activation addition: Steering language models without optimization,” 2024. 10

2024

[14] [15]

ODESteer: A unified ODE-based steering framework for LLM alignment,

H. Zhao, H. Sun, J. Kong, X. Li, Q. Wang, L. Jiang, Q. Zhu, T. F. Abdelzaher, Y . Choi, M. Li, and H. Shao, “ODESteer: A unified ODE-based steering framework for LLM alignment,” inThe Fourteenth International Conference on Learning Representations, 2026. [Online]. Available: https://openreview.net/forum?id=CFewUmgIIL

2026

[15] [16]

Local linearity of llms enables activation steering via model-based linear optimal control,

J. Skifstad, X. A. Yang, and G. Chou, “Local linearity of llms enables activation steering via model-based linear optimal control,”arXiv preprint arXiv:2604.19018, 2026

Pith/arXiv arXiv 2026

[16] [17]

Mechanistic interpretability for AI safety - a review,

L. Bereska and S. Gavves, “Mechanistic interpretability for AI safety - a review,”Transactions on Machine Learning Research, 2024, survey Certification, Expert Certification. [Online]. Available: https://openreview.net/forum?id=ePUVetPKu6

2024

[17] [18]

Toy models of superposition,

N. Elhage, T. Hume, C. Olsson, N. Schiefer, T. Henighan, S. Kravec, Z. Hatfield-Dodds, R. Lasenby, D. Drain, C. Chen, R. Grosse, S. McCandlish, J. Kaplan, D. Amodei, M. Wattenberg, and C. Olah, “Toy models of superposition,” no. arXiv:2209.10652, 2022, arXiv:2209.10652. [Online]. Available: http://arxiv.org/abs/2209.10652

Pith/arXiv arXiv 2022

[18] [19]

Linguistic regularities in continuous space word represen- tations,

T. Mikolov, W.-t. Yih, and G. Zweig, “Linguistic regularities in continuous space word represen- tations,” inProceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, 2013, pp. 746–751

2013

[19] [20]

The linear representation hypothesis and the geometry of large language models,

K. Park, Y . J. Choe, and V . Veitch, “The linear representation hypothesis and the geometry of large language models,”arXiv preprint arXiv:2311.03658, 2023

Pith/arXiv arXiv 2023

[20] [21]

The geometry of truth: Emergent linear structure in large language model representations of true/false datasets,

S. Marks and M. Tegmark, “The geometry of truth: Emergent linear structure in large language model representations of true/false datasets,” inFirst Conference on Language Modeling, 2024. [Online]. Available: https://openreview.net/forum?id=aajyHYjjsk

2024

[21] [22]

Emergent linear representations in world models of self-supervised sequence models,

N. Nanda, A. Lee, and M. Wattenberg, “Emergent linear representations in world models of self-supervised sequence models,” inProceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, 2023, pp. 16–30

2023

[22] [23]

Plug and play language models: A simple approach to controlled text generation,

S. Dathathri, A. Madotto, J. Lan, J. Hung, E. Frank, P. Molino, J. Yosinski, and R. Liu, “Plug and play language models: A simple approach to controlled text generation,” inInternational Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=H1edEyBKDS

2020

[23] [24]

Inference-time intervention: Eliciting truthful answers from a language model,

K. Li, O. Patel, F. Viégas, H. Pfister, and M. Wattenberg, “Inference-time intervention: Eliciting truthful answers from a language model,”Advances in Neural Information Processing Systems, vol. 36, pp. 41 451–41 530, 2023

2023

[24] [25]

Refusal in language models is mediated by a single direction,

A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda, “Refusal in language models is mediated by a single direction,” no. arXiv:2406.11717, Oct. 2024, arXiv:2406.11717. [Online]. Available: http://arxiv.org/abs/2406.11717

Pith/arXiv arXiv 2024

[25] [26]

Steering llama 2 via contrastive activation addition,

N. Rimsky, N. Gabrieli, J. Schulz, M. Tong, E. Hubinger, and A. Turner, “Steering llama 2 via contrastive activation addition,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Bangkok, Thailand: Association for Computational Linguistics, 2024, p. 15504–15522. [Online]. Available: https://a...

2024

[26] [27]

Inference-time intervention: Eliciting truthful answers from a language model,

K. Li, O. Patel, F. Viégas, H. Pfister, and M. Wattenberg, “Inference-time intervention: Eliciting truthful answers from a language model,” no. arXiv:2306.03341, 2024, arXiv:2306.03341. [Online]. Available: http://arxiv.org/abs/2306.03341

Pith/arXiv arXiv 2024

[27] [28]

Steering language models with activation engineering,

A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini, and M. MacDiarmid, “Steering language models with activation engineering,” no. arXiv:2308.10248, Oct. 2024, arXiv:2308.10248. [Online]. Available: http://arxiv.org/abs/2308.10248

Pith/arXiv arXiv 2024

[28] [29]

Learning to summarize with human feedback,

N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. V oss, A. Radford, D. Amodei, and P. F. Christiano, “Learning to summarize with human feedback,”Advances in neural information processing systems, vol. 33, pp. 3008–3021, 2020

2020

[29] [30]

Llama: Open and efficient foundation language models,

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

Pith/arXiv arXiv 2023

[30] [31]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Advances in neural information processing systems, vol. 36, pp. 53 728–53 741, 2023. 11

2023

[31] [32]

Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation,

H. Xu, A. Sharaf, Y . Chen, W. Tan, L. Shen, B. Van Durme, K. Murray, and Y . J. Kim, “Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation,”arXiv preprint arXiv:2401.08417, 2024

arXiv 2024

[32] [33]

Rrhf: Rank responses to align language models with human feedback,

H. Yuan, Z. Yuan, C. Tan, W. Wang, S. Huang, and F. Huang, “Rrhf: Rank responses to align language models with human feedback,”Advances in Neural Information Processing Systems, vol. 36, pp. 10 935–10 950, 2023

2023

[33] [34]

Preference ranking optimization for human alignment,

F. Song, B. Yu, M. Li, H. Yu, F. Huang, Y . Li, and H. Wang, “Preference ranking optimization for human alignment,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 17, 2024, pp. 18 990–18 998

2024

[34] [35]

Parameter-efficient transfer learning for nlp,

N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. At- tariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” inInternational conference on machine learning. PMLR, 2019, pp. 2790–2799

2019

[35] [36]

A general language assistant as a laboratory for alignment,

A. Askell, Y . Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. DasSarmaet al., “A general language assistant as a laboratory for alignment,”arXiv preprint arXiv:2112.00861, 2021

Pith/arXiv arXiv 2021

[36] [37]

Defending large language models against jailbreaking attacks through goal prioritization,

Z. Zhang, J. Yang, P. Ke, F. Mi, H. Wang, and M. Huang, “Defending large language models against jailbreaking attacks through goal prioritization,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 8865–8887

2024

[37] [38]

Args: Alignment as reward-guided search,

M. Khanov, J. Burapacheep, and Y . Li, “Args: Alignment as reward-guided search,”arXiv preprint arXiv:2402.01694, 2024

arXiv 2024

[38] [39]

Deal: Decoding-time alignment for large language models,

J. Y . Huang, S. Sengupta, D. Bonadiman, Y .-a. Lai, A. Gupta, N. Pappas, S. Mansour, K. Kirch- hoff, and D. Roth, “Deal: Decoding-time alignment for large language models,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 26 280–26 300

2025

[39] [40]

Controlling language and diffusion models by transporting activations,

P. Rodriguez, A. Blaas, M. Klein, L. Zappella, N. Apostoloff, M. Cuturi, and X. Suau, “Controlling language and diffusion models by transporting activations,” no. arXiv:2410.23054, Nov. 2024, arXiv:2410.23054. [Online]. Available: http://arxiv.org/abs/2410.23054

arXiv 2024

[40] [41]

Reft: Representation finetuning for language models,

Z. Wu, A. Arora, Z. Wang, A. Geiger, D. Jurafsky, C. D. Manning, and C. Potts, “Reft: Representation finetuning for language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 63 908–63 962, 2024

2024

[41] [42]

Advancing parameter efficiency in fine-tuning via representation editing,

M. Wu, W. Liu, X. Wang, T. Li, C. Lv, Z. Ling, Z. JianHao, C. Zhang, X. Zheng, and X.-J. Huang, “Advancing parameter efficiency in fine-tuning via representation editing,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 13 445–13 464

2024

[42] [43]

Angular steering: Behavior control via rotation in activation space,

H. M. Vu and T. M. Nguyen, “Angular steering: Behavior control via rotation in activation space,” no. arXiv:2510.26243, Oct. 2025, arXiv:2510.26243. [Online]. Available: http://arxiv.org/abs/2510.26243

arXiv 2025

[43] [44]

What’s the magic word? a control theory of llm prompting,

A. Bhargava, C. Witkowski, S.-Z. Looi, and M. Thomson, “What’s the magic word? a control theory of llm prompting,”arXiv preprint arXiv:2310.04444, 2023

arXiv 2023

[44] [45]

Aligning large language models with representation editing: A control perspective,

L. Kong, H. Wang, W. Mu, Y . Du, Y . Zhuang, Y . Zhou, Y . Song, R. Zhang, K. Wang, and C. Zhang, “Aligning large language models with representation editing: A control perspective,” no. arXiv:2406.05954, Nov. 2024, arXiv:2406.05954. [Online]. Available: http://arxiv.org/abs/2406.05954

arXiv 2024

[45] [46]

Preemptive detection and steering of llm misalignment via latent reachability,

S. Karnik and S. Bansal, “Preemptive detection and steering of llm misalignment via latent reachability,” no. arXiv:2509.21528, Sep. 2025, arXiv:2509.21528. [Online]. Available: http://arxiv.org/abs/2509.21528

arXiv 2025

[46] [47]

Linearly controlled language generation with performative guarantees,

E. Cheng and C. A. Alonso, “Linearly controlled language generation with performative guarantees,” no. arXiv:2405.15454, Sep. 2025, arXiv:2405.15454. [Online]. Available: http://arxiv.org/abs/2405.15454

Pith/arXiv arXiv 2025

[47] [48]

To steer or not to steer? mechanistic error reduction with abstention for language models,

A. Hedström, S. I. Amoukou, T. Bewley, S. Mishra, and M. Veloso, “To steer or not to steer? mechanistic error reduction with abstention for language models,” inProceedings of the 42nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 267. Vancouver, Canada: PMLR, 2025. 12

2025

[48] [49]

Analysing the generalisation and reliability of steering vectors,

D. Tan, D. Chanin, A. Lynch, B. Paige, D. Kanoulas, A. Garriga-Alonso, and R. Kirk, “Analysing the generalisation and reliability of steering vectors,”Advances in Neural Information Process- ing Systems, vol. 37, pp. 139 179–139 212, 2024

2024

[49] [50]

Multi-property steering of large language models with dynamic activation composition,

D. Scalena, G. Sarti, and M. Nissim, “Multi-property steering of large language models with dynamic activation composition,”arXiv preprint arXiv:2406.17563, 2024

arXiv 2024

[50] [51]

Cbf-llm: Safe control for llm alignment,

Y . Miyaoka and M. Inoue, “Cbf-llm: Safe control for llm alignment,”arXiv preprint arXiv:2408.15625, 2024

arXiv 2024

[51] [52]

Activation steering with a feedback controller,

D. V . Nguyen, H. M. Vu, N. Y . Pham, L. Zhang, and T. M. Nguyen, “Activation steering with a feedback controller,” no. arXiv:2510.04309, Oct. 2025, arXiv:2510.04309. [Online]. Available: http://arxiv.org/abs/2510.04309

Pith/arXiv arXiv 2025

[52] [53]

Zeroscope v2 576w,

Cerspense, “Zeroscope v2 576w,” https://huggingface.co/cerspense/zeroscope_v2_576w, 2023, accessed: 2025-09-23

2023

[53] [54]

Cogvideo: Large-scale pretraining for text- to-video generation via transformers,

W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang, “Cogvideo: Large-scale pretraining for text- to-video generation via transformers,” inThe Eleventh International Conference on Learning Representations, 2023

2023

[54] [55]

Cogvideox: Text-to-video diffusion models with an expert transformer,

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y . Yang, W. Hong, X. Zhang, G. Feng, D. Yin, Yuxuan.Zhang, W. Wang, Y . Cheng, B. Xu, X. Gu, Y . Dong, and J. Tang, “Cogvideox: Text-to-video diffusion models with an expert transformer,” inThe Thirteenth International Conference on Learning Representations, 2025

2025

[55] [56]

Imagen video: High definition video generation with diffusion models,

J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, and T. Salimans, “Imagen video: High definition video generation with diffusion models,” 2022

2022

[56] [57]

Make-a-video: Text-to-video generation without text- video data,

U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, D. Parikh, S. Gupta, and Y . Taigman, “Make-a-video: Text-to-video generation without text- video data,” 2022

2022

[57] [58]

Modelscope text-to-video technical report,

J. Wang, H. Yuan, D. Chen, Y . Zhang, X. Wang, and S. Zhang, “Modelscope text-to-video technical report,”arXiv preprint arXiv:2308.06571, 2023

Pith/arXiv arXiv 2023

[58] [59]

Sora: A review on background, technology, limitations, and opportunities of large vision models,

Y . Liu, K. Zhang, Y . Li, Z. Yan, C. Gao, R. Chen, Z. Yuan, Y . Huang, H. Sun, J. Gao, L. He, and L. Sun, “Sora: A review on background, technology, limitations, and opportunities of large vision models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.17177

Pith/arXiv arXiv 2024

[59] [60]

Open-sora: Democratizing efficient video production for all,

Z. Zheng, X. Peng, T. Yang, C. Shen, S. Li, H. Liu, Y . Zhou, T. Li, and Y . You, “Open-sora: Democratizing efficient video production for all,”arXiv preprint arXiv:2412.20404, 2024

Pith/arXiv arXiv 2024

[60] [61]

T2vattack: Adversarial attack on text-to-video diffusion models,

C. Li, Y . Min, J. Zhang, Z. Yuan, S. Shan, and X. Chen, “T2vattack: Adversarial attack on text-to-video diffusion models,”arXiv preprint arXiv:2512.23953, 2025

arXiv 2025

[61] [62]

T2v-optjail: Discrete prompt optimization for text-to-video jailbreak attacks,

J. Liu, S. Liang, S. Zhao, R. Tu, W. Zhou, A. Liu, D. Tao, and S. K. Lam, “T2v-optjail: Discrete prompt optimization for text-to-video jailbreak attacks,”arXiv preprint arXiv:2505.06679, 2025

arXiv 2025

[62] [63]

T2vsafetybench: Evaluating the safety of text-to-video generative models,

Y . Miao, Y . Zhu, L. Yu, J. Zhu, X.-S. Gao, and Y . Dong, “T2vsafetybench: Evaluating the safety of text-to-video generative models,”Advances in Neural Information Processing Systems, vol. 37, pp. 63 858–63 872, 2024

2024

[63] [64]

Two frames matter: A temporal attack for text-to-video model jailbreaking,

M. Chen, Z. Ying, W. Xu, Q. Zou, D. Zhang, D. Yang, and X. Zhang, “Two frames matter: A temporal attack for text-to-video model jailbreaking,”arXiv preprint arXiv:2603.07028, 2026

arXiv 2026

[64] [65]

Badvideo: Stealthy back- door attack against text-to-video generation,

R. Wang, M. Zhu, J. Ou, R. Chen, X. Tao, P. Wan, and B. Wu, “Badvideo: Stealthy back- door attack against text-to-video generation,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025, pp. 19 075–19 084

2025

[65] [66]

Unlearning concepts from text-to-video diffusion models,

S. Liu and Y . Tan, “Unlearning concepts from text-to-video diffusion models,”arXiv preprint arXiv:2407.14209, 2024

arXiv 2024

[66] [67]

Nullsce: Sequential concept erasure in generative video diffusion models via null-space guidance,

Q. Yi, B. Li, C. Wu, Y . Li, X. Teng, X. Xu, Y . Tan, and C. Chen, “Nullsce: Sequential concept erasure in generative video diffusion models via null-space guidance,”Available at SSRN 5993786

[67] [68]

Lineas: End-to-end learning of activation steering with a distributional loss,

P. Rodriguez, M. Klein, E. Gualdoni, V . Maiorca, A. Blaas, L. Zappella, M. Cuturi, and X. Suau, “Lineas: End-to-end learning of activation steering with a distributional loss,”arXiv preprint arXiv:2503.10679, 2025. 13

arXiv 2025

[68] [69]

The unreasonable effectiveness of text embedding interpolation for continuous image steering,

Y . Ekin and Y . Gandelsman, “The unreasonable effectiveness of text embedding interpolation for continuous image steering,”arXiv preprint arXiv:2603.17998, 2026

arXiv 2026

[69] [70]

Exploring the limits of transfer learning with a unified text-to-text transformer,

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020

2020

[70] [71]

Scalable diffusion models with transformers,

W. Peebles and S. Xie, “Scalable diffusion models with transformers,”arXiv preprint arXiv:2212.09748, 2022

Pith/arXiv arXiv 2022

[71] [72]

Contributions to the theory of optimal control,

R. E. Kalmanet al., “Contributions to the theory of optimal control,”Bol. soc. mat. mexicana, vol. 5, no. 2, pp. 102–119, 1960

1960

[72] [73]

F. L. Lewis, D. L. Vrabie, and V . L. Syrmos,Optimal Control, 3rd ed. Hoboken, NJ: John Wiley & Sons, 2012, ch. 2

2012

[73] [74]

Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions,

N. Halko, P.-G. Martinsson, and J. A. Tropp, “Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions,”SIAM review, vol. 53, no. 2, pp. 217–288, 2011

2011

[74] [75]

Hunyuanvideo: A systematic framework for large video generative models,

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhanget al., “Hunyuanvideo: A systematic framework for large video generative models,”arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024

[75] [76]

Vbench: Comprehensive benchmark suite for video generative models,

Z. Huang, Y . He, J. Yu, F. Zhang, C. Si, Y . Jiang, Y . Zhang, T. Wu, Q. Jin, N. Chanpaisitet al., “Vbench: Comprehensive benchmark suite for video generative models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 21 807–21 818

2024

[76] [77]

Regret bounds for the adaptive control of linear quadratic systems,

Y . Abbasi-Yadkori and C. Szepesvári, “Regret bounds for the adaptive control of linear quadratic systems,” inProceedings of the 24th annual conference on learning theory. JMLR Workshop and Conference Proceedings, 2011, pp. 1–26. 14 A Proofs Lemma A.1(Projection-calibrated feature setpoints).For any raw xs, the raw and latent feature strengths satisfy βx ...

2011