arxiv: 2604.19018 · v1 · submitted 2026-04-21 · 💻 cs.LG · cs.AI· cs.SY· eess.SY· math.OC· stat.ML

Recognition: unknown

Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

Julian Skifstad , Xinyue Annie Yang , Glen Chou

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.SYeess.SYmath.OCstat.ML

keywords activation steeringlinear quadratic regulatorlocal linearitytransformer dynamicsLLM alignmentfeedback controlJacobian approximationsemantic setpoints

0 comments

The pith

Local linearity of LLM layers allows closed-loop linear quadratic regulator control for activation steering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that transformer layer dynamics are sufficiently linear locally to support model-based optimal control. It adapts the linear quadratic regulator to steer activations toward semantic targets using real-time Jacobian feedback. This closed-loop approach outperforms open-loop steering methods on tasks like toxicity reduction and truthfulness adjustment. The method requires no offline training and comes with theoretical tracking error bounds. A sympathetic reader would care because it offers a lightweight way to align LLM behavior at inference time with formal guarantees.

Core claim

Despite the nonlinear structure of transformer blocks, layer-wise dynamics across multiple LLM architectures and scales are well-approximated by locally-linear models. Exploiting this property, we model LLM inference as a linear time-varying dynamical system and adapt the classical linear quadratic regulator to compute feedback controllers using layer-wise Jacobians, steering activations toward desired semantic setpoints in closed-loop with minimal computational overhead and no offline training. We also derive theoretical bounds on setpoint tracking error.

What carries the argument

The locally-linear approximation of layer-wise dynamics, which enables adapting the linear quadratic regulator to provide Jacobian-based feedback controllers for closed-loop activation steering.

If this is right

Steering achieves robust, fine-grained behavior control across models, scales, and tasks.
Surpasses baseline methods in modulating toxicity, truthfulness, refusal, and arbitrary concepts.
Provides formal performance guarantees via derived bounds on setpoint tracking error.
Operates with minimal computational overhead and requires no offline training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If local linearity holds more generally, other classical control techniques such as model predictive control could be adapted for LLM internals.
The method suggests activation steering can be reframed as an online control problem rather than a static intervention.
Success here raises the possibility of designing or fine-tuning models to preserve or strengthen layer-wise linearity for easier control.

Load-bearing premise

Layer-wise dynamics remain sufficiently linear over the full generation trajectory for the LQR feedback to keep tracking errors low and stable.

What would settle it

If the Jacobian-based LQR controller produces higher activation deviation from semantic setpoints than open-loop baselines on a held-out model scale or task, the local linearity is not sufficient for reliable control.

Figures

Figures reproduced from arXiv: 2604.19018 by Glen Chou, Julian Skifstad, Xinyue Annie Yang.

**Figure 1.** Figure 1: Overview. At each LLM layer k, our method, A-LQR, computes a steering intervention uk that minimizes the deviation between the semantic feature value βk := v ⊤ k zk at the current activation zk and a desired target β ∗ k, where vk encodes the feature direction. To construct uk, we use the linear quadratic regulator (LQR) to efficiently compute steering controllers using linear approximations of the LLM tr… view at source ↗

**Figure 2.** Figure 2: A-LQR linearizes each transformer block ϕk and uses this local structure to synthesize control actions uk that steer the activations toward desired setpoints z ′ k. We show that each ϕk can be well approximated as locally linear (Sec. 5). Thus, for two reachable activations z2 and z alt 2 , the corresponding Jacobians are similar, i.e., (∂ϕ2/∂z)|z=z2 ≈ (∂ϕ2/∂z)|z=z alt 2 (orange, bottom). fixing a single β… view at source ↗

**Figure 3.** Figure 3: Empirical tracking error satisfies the bound (22) with 10 tracking rollouts. At each layer, the bound and error values are normalized by the sampled mean P-norm of the layer activations; hence, these values are relative to the ambient activation norm. 0 500 1000 1500 2000 0.0 0.5 1.0 Normalized σi’s Layer 0 0 500 1000 1500 2000 Layer 6 0 500 1000 1500 2000 Mode index 0.0 0.5 1.0 Normalized σi’s Layer 12 0 … view at source ↗

**Figure 6.** Figure 6: Concept prevalence (% of generations with relevant output) across λ values. Prevalence is the % of 500 trials which exhibit the specified concept, as scored by an LLM-as-a-judge. all i ∈ {1, . . . , ℓ}; a probabilistic overestimate of the worstcase Li can be found following Knuth et al. (2021; 2023). With Li estimated, we roll out the bound (22) for a given initial deviation δz1:T . In practice, we evalua… view at source ↗

**Figure 7.** Figure 7: Jacobian spectral similarity on Qwen-2.5-3B: initial, intermediate, and final linearized-layer alignment. Each plot contains comparisons across 50 matrices (one matrix per row/column), with lighter pixels corresponding to stronger alignment. (a) randomly sampled matrices, (b) randomly sampled nominal Jacobians of differing concepts, and (c) Jacobians corresponding the prompts related to the concept “Cloud.… view at source ↗

**Figure 8.** Figure 8: Jacobian spectral similarity on Llama-3-8B: initial, intermediate, and final linearized-layer alignment. Each plot contains comparisons across 50 matrices (one matrix per row/column), with lighter pixels corresponding to stronger alignment. (a) randomly sampled matrices, (b) randomly sampled nominal Jacobians of differing concepts, and (c) Jacobians corresponding the prompts related to the concept “Cloud.”… view at source ↗

**Figure 9.** Figure 9: Jacobian spectral similarity on Llama-3-1B: initial, intermediate, and final linearized-layer alignment. Each plot contains comparisons across 50 matrices (one matrix per row/column), with lighter pixels corresponding to stronger alignment. (a) randomly sampled matrices, (b) randomly sampled nominal Jacobians of differing concepts, and (c) Jacobians corresponding the prompts related to the concept “Cloud.”… view at source ↗

**Figure 10.** Figure 10: Empirical tracking bounds on two additional models. The empirical tracking error satisfies the bound (22) with 10 tracking rollouts. 0 500 1000 1500 2000 0.0 0.2 0.4 0.6 0.8 1.0 N o r m aliz e d i's Layer 0 0 500 1000 1500 2000 Layer 8 0 500 1000 1500 2000 Layer 16 Mode index [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Qwen-2.5-3B spectrum distribution, showing alignment (same layout as [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Llama-8B spectrum distribution, showing alignment (same layout as [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Llama-1B spectrum distribution, showing alignment (same layout as [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Tiles are ordered from left to right and correspond to layers 0, ⌊N/3⌋, and ⌊2N/3⌋, where N is the total number of layers. Each tile shows the alignment between the sampled Jacobians of the specified prompts at the corresponding layer. If N mod 3 = 0, as in (e) and (f), only layers 0, ⌊N/3⌋, and ⌊2N/3⌋ are plotted. (a) Gemma-2-2b Code Prompts Alignment. (b) Gemma-2-2b Law Prompts Alignment. (c) Llama-3-8B… view at source ↗

**Figure 15.** Figure 15: Gemma-2-2B alignment with heterogeneous data. The top left quadrant corresponds to “code” prompts and the bottom right quadrant corresponds to adversarial prompts. The remaining quadrants are then cross-alignment between these two datasets. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: Gemma-2-2B alignment with heterogeneous data. The top left quadrant corresponds to “code” prompts and the bottom right quadrant corresponds to general prompts written in Japanese. The remaining quadrants are then cross-alignment between these two datasets. 0.0 0.2 0.4 0.6 0.8 1.0 [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

**Figure 17.** Figure 17: Qwen-2.5-3B alignment with heterogeneous data. The top left quadrant corresponds to “code” prompts and the bottom right quadrant corresponds to general prompts written in Japanese. The remaining quadrants are then cross-alignment between these two datasets. 0.0 0.2 0.4 0.6 0.8 1.0 [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗

**Figure 18.** Figure 18: Llama-3-8B alignment with heterogeneous data. The top left quadrant corresponds to “code” prompts and the bottom right quadrant corresponds to general prompts written in Japanese. The remaining quadrants are then cross-alignment between these two datasets. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗

**Figure 19.** Figure 19: Gemma-2-2B empirical tracking bounds under heterogeneous and adversarial inputs. 0 2 4 6 8 10 12 14 16 Model Layer 0 2 4 6 8 10 12 Normalized Error (Lyapunov Norm) Worst-Case Bound vs Error Envelope (a) Llama-3.2-1B tracking bound with heterogeneous data, tracking a code nominal prompt from a Japanese input. 0 2 4 6 8 10 12 14 16 Model Layer 0 2 4 6 8 Normalized Error (Lyapunov Norm) Worst-Case Bound vs E… view at source ↗

**Figure 20.** Figure 20: Llama-3.2-1B empirical tracking bounds under heterogeneous and adversarial inputs. 0 5 10 15 20 25 30 35 Model Layer 0 10 20 30 40 50 Normalized Error (Lyapunov Norm) Worst-Case Bound vs Error Envelope (a) Qwen-2.5-3B tracking bound with heterogeneous data, tracking a code nominal prompt from a Japanese input. 0 5 10 15 20 25 30 35 Model Layer 0 5 10 15 20 25 30 35 Normalized Error (Lyapunov Norm) Worst-C… view at source ↗

**Figure 21.** Figure 21: Qwen-2.5-3B empirical tracking bounds under heterogeneous and adversarial inputs. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_21.png] view at source ↗

**Figure 22.** Figure 22: LQR and λ parameter sweeps in the task of toxicity mitigation (qf := qT , where QT = qT I). Note that the sampling parameters here are different than what was used in the main work, with a temperature of 0.7, and no repetition penalty or top-p setting. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_22.png] view at source ↗

read the original abstract

Inference-time LLM alignment methods, particularly activation steering, offer an alternative to fine-tuning by directly modifying activations during generation. Existing methods, however, often rely on non-anticipative interventions that ignore how perturbations propagate through transformer layers and lack online error feedback, resulting in suboptimal, open-loop control. To address this, we show empirically that, despite the nonlinear structure of transformer blocks, layer-wise dynamics across multiple LLM architectures and scales are well-approximated by locally-linear models. Exploiting this property, we model LLM inference as a linear time-varying dynamical system and adapt the classical linear quadratic regulator to compute feedback controllers using layer-wise Jacobians, steering activations toward desired semantic setpoints in closed-loop with minimal computational overhead and no offline training. We also derive theoretical bounds on setpoint tracking error, enabling formal guarantees on steering performance. Using a novel adaptive semantic feature setpoint signal, our method yields robust, fine-grained behavior control across models, scales, and tasks, including state-of-the-art modulation of toxicity, truthfulness, refusal, and arbitrary concepts, surpassing baseline steering methods. Our code is available at: https://github.com/trustworthyrobotics/lqr-activation-steering

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows local linearity in LLM layers can support LQR feedback for closed-loop activation steering with derived error bounds, but the checks do not clearly confirm the approximation holds across full trajectories.

read the letter

The core advance is treating each transformer layer as a locally linear time-varying system and using its Jacobian to build an LQR controller that steers activations toward semantic targets with feedback. They run empirical checks showing the linear approximation works reasonably at sampled points across models and scales, then derive tracking-error bounds that scale with the linearization residual. The method needs no extra training and claims better control on toxicity, truthfulness, and refusal than prior open-loop steering.

Referee Report

2 major / 2 minor

Summary. The paper claims that despite the nonlinear structure of transformer blocks, layer-wise dynamics in LLMs are well-approximated by locally linear models across architectures and scales; it models inference as a linear time-varying system, adapts classical LQR to compute Jacobian-based feedback controllers for closed-loop steering of activations to adaptive semantic setpoints, derives theoretical bounds on tracking error, and reports superior empirical performance on toxicity, truthfulness, refusal, and concept modulation tasks with no offline training.

Significance. If the local-linearity assumption holds over full trajectories, the work supplies a control-theoretic framework for activation steering that combines empirical validation, formal error bounds, and reproducible code; this could advance inference-time alignment by replacing heuristic open-loop interventions with stable, low-overhead feedback control and quantifiable guarantees.

major comments (2)

[§3] §3 (empirical linearity checks): the reported approximation quality and Jacobian-based residuals are shown at selected points or short horizons, but the manuscript does not quantify how these residuals evolve or remain bounded over complete generation trajectories; this directly affects whether the LQR closed-loop eigenvalues stay stable and whether the derived tracking-error bounds remain tight, as required by the central claim.
[§4.2] §4.2 (LQR formulation and error bounds): the tracking-error bounds are stated to scale with the linearization residual, yet the experiments do not report a direct comparison between predicted bound values and observed setpoint errors across models or tasks; without this calibration the formal guarantees' practical relevance is difficult to assess.

minor comments (2)

[§4] The choice and sensitivity analysis of the LQR weighting matrices Q and R (free parameters) are not detailed; a brief ablation or default-setting justification would clarify reproducibility.
[§5] Figure captions and the adaptive semantic-feature setpoint definition would benefit from an explicit equation or pseudocode block to make the online adaptation rule unambiguous.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has identified important opportunities to strengthen the empirical support for our local-linearity claims and the practical calibration of our theoretical bounds. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [§3] §3 (empirical linearity checks): the reported approximation quality and Jacobian-based residuals are shown at selected points or short horizons, but the manuscript does not quantify how these residuals evolve or remain bounded over complete generation trajectories; this directly affects whether the LQR closed-loop eigenvalues stay stable and whether the derived tracking-error bounds remain tight, as required by the central claim.

Authors: We agree that explicit quantification of residual evolution over full trajectories is needed to rigorously support stability of the closed-loop eigenvalues and tightness of the tracking bounds. Our current experiments evaluate residuals at representative points and short segments during generation, which is consistent with the local-linearity hypothesis but does not fully address long-horizon behavior. In the revised manuscript we will add new figures and tables showing the evolution of Jacobian-based residuals (||f(x) - Ax - b||) over complete generation sequences for the toxicity, truthfulness, refusal, and concept-modulation tasks across the evaluated models. We will also report whether residuals remain bounded and discuss any implications for eigenvalue stability within the LQR design. revision: yes
Referee: [§4.2] §4.2 (LQR formulation and error bounds): the tracking-error bounds are stated to scale with the linearization residual, yet the experiments do not report a direct comparison between predicted bound values and observed setpoint errors across models or tasks; without this calibration the formal guarantees' practical relevance is difficult to assess.

Authors: We appreciate this observation. The derived bounds explicitly depend on the linearization residual, and a side-by-side comparison with observed errors would better demonstrate the tightness and practical utility of the guarantees. In the revision we will include a new table (or supplementary figure) that computes the theoretical tracking-error bounds from the measured residuals for each model and task, then directly compares these predicted values against the empirical setpoint tracking errors reported in our experiments. This calibration will be accompanied by discussion of any observed discrepancies and their relation to the local-linearity assumption. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical linearity check + standard LQR + derived bounds remain independent of fitted inputs

full rationale

The derivation begins with an empirical claim that layer-wise transformer dynamics are locally linear (validated across models and scales), models inference as an LTV system, applies the classical LQR controller using computed Jacobians, and derives tracking-error bounds that explicitly depend on the linearization residual. None of these steps reduce by construction to quantities fitted inside the same experiment; the LQR formulation and error bounds are taken from external control theory, the linearity is presented as an empirical observation rather than a definitional assumption, and no self-citation chain is invoked to justify uniqueness or the ansatz. The central performance claims are therefore supported by external mathematics and separate empirical measurements rather than tautological renaming or self-referential fitting.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption of local linearity (empirically checked) and standard LQR theory; no new entities are postulated and only conventional weighting matrices are introduced.

free parameters (1)

LQR state and control weighting matrices Q and R
Chosen to trade off setpoint tracking versus control effort; values are not derived from first principles and must be selected per task.

axioms (1)

domain assumption Layer-wise LLM dynamics admit a locally linear approximation whose Jacobians can be used for control
Invoked to justify modeling the transformer as a linear time-varying system and applying LQR.

pith-pipeline@v0.9.0 · 5530 in / 1231 out tokens · 39851 ms · 2026-05-10T02:28:05.861216+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

106 extracted references · 62 canonical work pages · 11 internal anchors

[1]

Pawan , year=

Webb, Stefan and Rainforth, Tom and Teh, Yee Whye and Kumar, M. Pawan , year=. A Statistical Approach to Assessing Neural Network Robustness , url=. doi:10.48550/arXiv.1811.07209 , abstractNote=

work page doi:10.48550/arxiv.1811.07209
[2]

and Hilton, J

Wu, Gabriel and Hilton, Jacob , year=. Estimating the Probabilities of Rare Outputs in Language Models , url=. doi:10.48550/arXiv.2410.13211 , abstractNote=

work page doi:10.48550/arxiv.2410.13211
[3]

s-nlp/roberta\_toxicity\_classifier Model

Roberta. s-nlp/roberta\_toxicity\_classifier Model. 2026

2026
[4]

A Koopman framework for rare event simulation in stochastic differential equations , url=

Zhang, Benjamin and Sahai, Tuhin and Marzouk, Youssef , year=. A Koopman framework for rare event simulation in stochastic differential equations , url=. doi:10.48550/arXiv.2101.07330 , abstractNote=

work page doi:10.48550/arxiv.2101.07330
[5]

IEEE Robotics and Automation Letters , volume=

Planning with learned dynamics: Probabilistic guarantees on safety and reachability via lipschitz constants , author=. IEEE Robotics and Automation Letters , volume=. 2021 , publisher=

2021
[6]

and Kaiser,

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser,. Attention is all you need , year =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =
[7]

Jailbroken: How Does LLM Safety Training Fail?

Wei, Alexander and Haghtalab, Nika and Steinhardt, Jacob , year=. Jailbroken: How Does LLM Safety Training Fail? , url=. doi:10.48550/arXiv.2307.02483 , abstractNote=

work page internal anchor Pith review doi:10.48550/arxiv.2307.02483
[8]

Machine Learning: Science and Technology , author=

Neural scaling laws from large- N field theory: solvable model beyond the ridgeless limit , volume=. Machine Learning: Science and Technology , author=. 2025 , month=jun, pages=. doi:10.1088/2632-2153/adc872 , abstractNote=

work page doi:10.1088/2632-2153/adc872 2025
[9]

Rigorous agent evaluation: An adversarial approach to uncover catastrophic failures,

Uesato, Jonathan and Kumar, Ananya and Szepesvari, Csaba and Erez, Tom and Ruderman, Avraham and Anderson, Keith and Krishmamurthy and Dvijotham and Heess, Nicolas and Kohli, Pushmeet , year=. Rigorous Agent Evaluation: An Adversarial Approach to Uncover Catastrophic Failures , url=. doi:10.48550/arXiv.1812.01647 , abstractNote=

work page doi:10.48550/arxiv.1812.01647
[10]

Efficient Sampling Methods of, by, and for Stochastic Dynamical Systems , rights=

Zhang, Benjamin Jiahong , year=. Efficient Sampling Methods of, by, and for Stochastic Dynamical Systems , rights=
[11]

Journal of Statistical Physics , author=

Simulating Rare Events in Dynamical Processes , volume=. Journal of Statistical Physics , author=. 2011 , month=nov, pages=. doi:10.1007/s10955-011-0350-4 , abstractNote=

work page doi:10.1007/s10955-011-0350-4 2011
[12]

and Chen, Mo and Tomlin, Claire J

Fisac, Jaime F. and Chen, Mo and Tomlin, Claire J. and Sastry, S. Shankar , year=. Reach-avoid problems with time-varying dynamics, targets and constraints , ISBN=. doi:10.1145/2728606.2728612 , booktitle=

work page doi:10.1145/2728606.2728612
[13]

Variational approach to rare event simulation using least-squares regression , url=

Hartmann, Carsten and Kebiri, Omar and Neureither, Lara and Richter, Lorenz , year=. Variational approach to rare event simulation using least-squares regression , url=. doi:10.48550/arXiv.1901.09195 , abstractNote=

work page doi:10.48550/arxiv.1901.09195 1901
[14]

Journal of the Franklin Institute , author=

An approximate method of stochastic terminal control for nonlinear dynamical systems , volume=. Journal of the Franklin Institute , author=. 1970 , month=mar, pages=. doi:10.1016/0016-0032(70)90286-3 , abstractNote=

work page doi:10.1016/0016-0032(70)90286-3 1970
[15]

Oppor- tunities and challenges in deep learning adversarial ro- bustness: A survey.arXiv preprint arXiv:2007.00753, 2020

Silva, Samuel Henrique and Najafirad, Peyman , year=. Opportunities and Challenges in Deep Learning Adversarial Robustness: A Survey , url=. doi:10.48550/arXiv.2007.00753 , abstractNote=

work page doi:10.48550/arxiv.2007.00753 2007
[16]

and Schlichting, Marc R

Kruse, Liam A. and Schlichting, Marc R. and Kochenderfer, Mykel J. , year=. Scalable Importance Sampling in High Dimensions with Low-Rank Mixture Proposals , url=. doi:10.48550/arXiv.2505.13335 , abstractNote=

work page doi:10.48550/arxiv.2505.13335
[17]

Chaos: An Interdisciplinary Journal of Nonlinear Science , author=

Rare Event Sampling Methods , volume=. Chaos: An Interdisciplinary Journal of Nonlinear Science , author=. 2019 , month=aug, pages=. doi:10.1063/1.5120509 , number=

work page doi:10.1063/1.5120509 2019
[18]

Curse-of-dimensionality revisited: Collapse of the particle filter in very large scale systems , url=

Bengtsson, Thomas and Bickel, Peter and Li, Bo , year=. Curse-of-dimensionality revisited: Collapse of the particle filter in very large scale systems , url=. doi:10.48550/arXiv.0805.3034 , abstractNote=

work page doi:10.48550/arxiv.0805.3034
[19]

BMC Systems Biology , author=

Quantifying uncertainty, variability and likelihood for ordinary differential equation models , volume=. BMC Systems Biology , author=. 2010 , month=dec, pages=. doi:10.1186/1752-0509-4-144 , number=

work page doi:10.1186/1752-0509-4-144 2010
[20]

Computational Doob h-transforms for Online Filtering of Discretely Observed Diffusions , url=

Chopin, Nicolas , year=. Computational Doob h-transforms for Online Filtering of Discretely Observed Diffusions , url=
[21]

2019 , url =

Simon Särkkä and Arno Solin , title =. 2019 , url =

2019
[22]

Real-time optimal control of high-dimensional parametrized systems by deep learning-based reduced order models , url=

Tomasetto, Matteo and Manzoni, Andrea and Braghin, Francesco , year=. Real-time optimal control of high-dimensional parametrized systems by deep learning-based reduced order models , url=. doi:10.48550/arXiv.2409.05709 , abstractNote=

work page doi:10.48550/arxiv.2409.05709
[23]

IEEE Robotics and Automation Letters , author=

Biased-MPPI: Informing Sampling-Based Model Predictive Control by Fusing Ancillary Controllers , volume=. IEEE Robotics and Automation Letters , author=. 2024 , month=jun, pages=. doi:10.1109/LRA.2024.3397083 , abstractNote=

work page doi:10.1109/lra.2024.3397083 2024
[24]

arXiv preprint arXiv:2301.06227 , year=

Wu, Guangyu and Lindquist, Anders , year=. General Distribution Steering: A Sub-Optimal Solution by Convex Optimization , url=. doi:10.48550/arXiv.2301.06227 , abstractNote=

work page doi:10.48550/arxiv.2301.06227
[25]

Pilipovsky, Joshua and Sivaramakrishnan, Vignesh and Oishi, Meeko M. K. and Tsiotras, Panagiotis , year=. Probabilistic Verification of ReLU Neural Networks via Characteristic Functions , url=. doi:10.48550/arXiv.2212.01544 , abstractNote=

work page doi:10.48550/arxiv.2212.01544
[26]

, Köhler , J

Rapakoulias, George and Tsiotras, Panagiotis , year=. Discrete-Time Maximum Likelihood Neural Distribution Steering , rights=. doi:10.1109/CDC56724.2024.10885992 , booktitle=

work page doi:10.1109/cdc56724.2024.10885992 2024
[27]

arXiv preprint arXiv:2405.15454 , year=

Cheng, Emily and Alonso, Carmen Amo , year=. Linearly Controlled Language Generation with Performative Guarantees , url=. doi:10.48550/arXiv.2405.15454 , abstractNote=

work page doi:10.48550/arxiv.2405.15454
[28]

Steering Llama 2 via Contrastive Activation Addition

Rimsky, Nina and Gabrieli, Nick and Schulz, Julian and Tong, Meg and Hubinger, Evan and Turner, Alexander , year=. Steering Llama 2 via Contrastive Activation Addition , url=. doi:10.18653/v1/2024.acl-long.828 , booktitle=

work page doi:10.18653/v1/2024.acl-long.828 2024
[29]

Primal-dual ilqr for gpu-accelerated learning and control in legged robots.arXiv preprint arXiv:2506.07823, 2025

Amatucci, Lorenzo and Sousa-Pinto, João and Turrisi, Giulio and Orban, Dominique and Barasuol, Victor and Semini, Claudio , year=. Primal-Dual iLQR for GPU-Accelerated Learning and Control in Legged Robots , url=. doi:10.48550/arXiv.2506.07823 , abstractNote=

work page doi:10.48550/arxiv.2506.07823
[30]

Mechanistic interpretability for steering vision-language-action models , url=

Häon, Bear and Stocking, Kaylene and Chuang, Ian and Tomlin, Claire , year=. Mechanistic interpretability for steering vision-language-action models , url=. doi:10.48550/arXiv.2509.00328 , abstractNote=

work page doi:10.48550/arxiv.2509.00328
[31]

Refusal in Language Models Is Mediated by a Single Direction

Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Gurnee, Wes and Nanda, Neel , year=. Refusal in Language Models Is Mediated by a Single Direction , url=. doi:10.48550/arXiv.2406.11717 , abstractNote=

work page internal anchor Pith review doi:10.48550/arxiv.2406.11717
[32]

Preemptive Detection and Steering of LLM Misalignment via Latent Reachability , url=

Karnik, Sathwik and Bansal, Somil , year=. Preemptive Detection and Steering of LLM Misalignment via Latent Reachability , url=. doi:10.48550/arXiv.2509.21528 , abstractNote=

work page doi:10.48550/arxiv.2509.21528
[33]

2013 IEEE International Conference on Robotics and Automation , author=

Optimal sampling-based planning for linear-quadratic kinodynamic systems , ISSN=. 2013 IEEE International Conference on Robotics and Automation , author=. 2013 , month=may, pages=. doi:10.1109/ICRA.2013.6630907 , abstractNote=

work page doi:10.1109/icra.2013.6630907 2013
[34]

and Vu, Hieu M

Nguyen, Dung V. and Vu, Hieu M. and Pham, Nhi Y. and Zhang, Lei and Nguyen, Tan M. , year=. Activation Steering with a Feedback Controller , url=. doi:10.48550/arXiv.2510.04309 , abstractNote=

work page doi:10.48550/arxiv.2510.04309
[35]

arXiv preprint arXiv:2407.15549 , year=

Sheshadri, Abhay and Ewart, Aidan and Guo, Phillip and Lynch, Aengus and Wu, Cindy and Hebbar, Vivek and Sleight, Henry and Stickland, Asa Cooper and Perez, Ethan and Hadfield-Menell, Dylan and Casper, Stephen , year=. Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs , url=. doi:10.48550/arXiv.2407.15549 , abstractNote=

work page doi:10.48550/arxiv.2407.15549
[36]

Monnier, John D. and Jain, Prachet and Gutierrez, Mayra and Han, Chi and Hezi, Sara and Kalluri, Shashank and Kabaria, Hirsh and Kompas, Brennan and Harikumar, Vaishnavi and Skifstad, Julian and Peri, Janani and Hernandez, Emmanuel and Bhaskarapanthula, Ramya and Cutler, James , year=. Prospects for using drones to test formation-flying CubeSat concepts, ...

work page doi:10.48550/arxiv.2408.03911
[37]

Explaining and Harnessing Adversarial Examples

Goodfellow, Ian J. and Shlens, Jonathon and Szegedy, Christian , year=. Explaining and Harnessing Adversarial Examples , url=. doi:10.48550/arXiv.1412.6572 , abstractNote=

work page internal anchor Pith review doi:10.48550/arxiv.1412.6572
[38]

Propagation of Uncertainty with the Koopman Operator , url=

Servadio, Simone and Lavezzi, Giovanni and Hofmann, Christian and Wu, Di and Linares, Richard , year=. Propagation of Uncertainty with the Koopman Operator , url=. doi:10.48550/arXiv.2407.20170 , abstractNote=

work page doi:10.48550/arxiv.2407.20170
[39]

arXiv preprint arXiv:2410.23054 (2024)

Rodriguez, Pau and Blaas, Arno and Klein, Michal and Zappella, Luca and Apostoloff, Nicholas and Cuturi, Marco and Suau, Xavier , year=. Controlling Language and Diffusion Models by Transporting Activations , url=. doi:10.48550/arXiv.2410.23054 , abstractNote=

work page doi:10.48550/arxiv.2410.23054
[40]

European Symposium on Programming , pages=

Neural network verification is a programming language challenge , author=. European Symposium on Programming , pages=. 2025 , organization=

2025
[41]

Steering Language Models With Activation Engineering

Turner, Alexander Matt and Thiergart, Lisa and Leech, Gavin and Udell, David and Vazquez, Juan J. and Mini, Ulisse and MacDiarmid, Monte , year=. Steering Language Models With Activation Engineering , url=. doi:10.48550/arXiv.2308.10248 , abstractNote=

work page internal anchor Pith review doi:10.48550/arxiv.2308.10248
[42]

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , shorttitle =

Li, Kenneth and Patel, Oam and Viégas, Fernanda and Pfister, Hanspeter and Wattenberg, Martin , year=. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , url=. doi:10.48550/arXiv.2306.03341 , abstractNote=

work page doi:10.48550/arxiv.2306.03341
[43]

A language model's guide through latent space

Rütte, Dimitri von and Anagnostidis, Sotiris and Bachmann, Gregor and Hofmann, Thomas , year=. A Language Model’s Guide Through Latent Space , url=. doi:10.48550/arXiv.2402.14433 , abstractNote=

work page doi:10.48550/arxiv.2402.14433
[44]

and Bewley, Tom and Mishra, Saumitra and Veloso, Manuela , title =

Hedström, Anna and Amoukou, Salim I. and Bewley, Tom and Mishra, Saumitra and Veloso, Manuela , title =. Proceedings of the 42nd International Conference on Machine Learning , series =. 2025 , address =

2025
[45]

arXiv preprint arXiv:2510.26243 (2025) 8

Vu, Hieu M. and Nguyen, Tan M. , year=. Angular Steering: Behavior Control via Rotation in Activation Space , url=. doi:10.48550/arXiv.2510.26243 , abstractNote=

work page doi:10.48550/arxiv.2510.26243
[46]

Geiger, Atticus and Wu, Zhengxuan and Potts, Christopher and Icard, Thomas and Goodman, Noah D. , year=. Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations , url=. doi:10.48550/arXiv.2303.02536 , abstractNote=

work page doi:10.48550/arxiv.2303.02536
[47]

Safe Large-Scale Robust Nonlinear MPC in Milliseconds via Reachability-Constrained System Level Synthesis on the GPU

Safe Large-Scale Robust Nonlinear MPC in Milliseconds via Reachability-Constrained System Level Synthesis on the GPU , author=. arXiv preprint arXiv:2604.07644 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Abdelzaher and Yejin Choi and Manling Li and Huajie Shao , booktitle=

Hongjue Zhao and Haosen Sun and Jiangtao Kong and Xiaochang Li and Qineng Wang and Liwei Jiang and Qi Zhu and Tarek F. Abdelzaher and Yejin Choi and Manling Li and Huajie Shao , booktitle=. 2026 , url=

2026
[49]

arXiv preprint arXiv:2301.13729 , year=

Low-rank LQR Optimal Control Design over Wireless Communication Networks , author=. arXiv preprint arXiv:2301.13729 , year=

work page arXiv
[50]

Safety beyond the training data: Robust out-of- distribution mpc via conformalized system level synthe- sis.arXiv preprint arXiv:2602.12047, 2026

Safety Beyond the Training Data: Robust Out-of-Distribution MPC via Conformalized System Level Synthesis , author=. arXiv preprint arXiv:2602.12047 , year=

work page arXiv
[51]

Toy Models of Superposition

Elhage, Nelson and Hume, Tristan and Olsson, Catherine and Schiefer, Nicholas and Henighan, Tom and Kravec, Shauna and Hatfield-Dodds, Zac and Lasenby, Robert and Drain, Dawn and Chen, Carol and Grosse, Roger and McCandlish, Sam and Kaplan, Jared and Amodei, Dario and Wattenberg, Martin and Olah, Christopher , year=. Toy Models of Superposition , url=. do...

work page internal anchor Pith review doi:10.48550/arxiv.2209.10652
[52]

doi:10.48550/arXiv.2410.02707 , abstract =

Orgad, Hadas and Toker, Michael and Gekhman, Zorik and Reichart, Roi and Szpektor, Idan and Kotek, Hadas and Belinkov, Yonatan , year=. LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations , url=. doi:10.48550/arXiv.2410.02707 , abstractNote=

work page doi:10.48550/arxiv.2410.02707
[53]

and Vrabie, Draguna L

Lewis, Frank L. and Vrabie, Draguna L. and Syrmos, Vassilis L. , title =
[54]

thought” of LLM by finding the “circuit

Gemma Team , year=. Gemma , url=. doi:10.34740/KAGGLE/M/3301 , publisher=

work page doi:10.34740/kaggle/m/3301
[55]

Qwen2.5: A Party of Foundation Models , url =

Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =
[56]

Realtoxicityprompts: Evaluating neural toxic degenera- tion in language models

Realtoxicityprompts: Evaluating neural toxic degeneration in language models , author=. arXiv preprint arXiv:2009.11462 , year=

work page arXiv 2009
[57]

Aligning Large Language Models with Representation Editing: A Control Perspective , url=

Kong, Lingkai and Wang, Haorui and Mu, Wenhao and Du, Yuanqi and Zhuang, Yuchen and Zhou, Yifei and Song, Yue and Zhang, Rongzhi and Wang, Kai and Zhang, Chao , year=. Aligning Large Language Models with Representation Editing: A Control Perspective , url=. doi:10.48550/arXiv.2406.05954 , abstractNote=

work page doi:10.48550/arxiv.2406.05954
[58]

Trustworthy llms: a survey and guideline for evaluating large language models’ alignment

Trustworthy llms: a survey and guideline for evaluating large language models' alignment , author=. arXiv preprint arXiv:2308.05374 , year=

work page arXiv
[59]

Advances in Neural Information Processing Systems , volume=

Jailbroken: How does llm safety training fail? , author=. Advances in Neural Information Processing Systems , volume=
[60]

Advances in neural information processing systems , volume=

Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=
[61]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
[62]

International conference on machine learning , pages=

Parameter-efficient transfer learning for NLP , author=. International conference on machine learning , pages=. 2019 , organization=

2019
[63]

arXiv preprint arXiv:2402.01694 , year=

Args: Alignment as reward-guided search , author=. arXiv preprint arXiv:2402.01694 , year=

work page arXiv
[64]

A General Language Assistant as a Laboratory for Alignment

A general language assistant as a laboratory for alignment , author=. arXiv preprint arXiv:2112.00861 , year=

work page internal anchor Pith review arXiv
[65]

Meng, K., Bau, D., Andonian, A., and Belinkov, Y

The unlocking spell on base llms: Rethinking alignment via in-context learning , author=. arXiv preprint arXiv:2312.01552 , year=

work page arXiv
[66]

2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Statistical safety and robustness guarantees for feedback motion planning of unknown underactuated stochastic systems , author=. 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2023 , organization=

2023
[67]

Mechanistic Interpretability for

Leonard Bereska and Stratis Gavves , journal=. Mechanistic Interpretability for. 2024 , url=

2024
[68]

Representation Engineering: A Top-Down Approach to AI Transparency

Representation engineering: A top-down approach to ai transparency , author=. arXiv preprint arXiv:2310.01405 , year=

work page internal anchor Pith review arXiv
[69]

arXiv preprint arXiv:2305.18449 , year=

Taming ai bots: Controllability of neural states in large language models , author=. arXiv preprint arXiv:2305.18449 , year=

work page arXiv
[70]

2012 , publisher=

Optimal control , author=. 2012 , publisher=

2012
[71]

Yulin Luo, Ruichuan An, Bocheng Zou, Yiming Tang, Jiaming Liu, and Shanghang Zhang

Prompt engineering through the lens of optimal control , author=. arXiv preprint arXiv:2310.14201 , year=

work page arXiv
[72]

Contributions to the theory of optimal control , author=. Bol. soc. mat. mexicana , volume=
[73]

arXiv preprint arXiv:2310.04444 , year=

What's the magic word? a control theory of llm prompting , author=. arXiv preprint arXiv:2310.04444 , year=

work page arXiv
[74]

Advances in Neural Information Processing Systems , volume=

Analysing the generalisation and reliability of steering vectors , author=. Advances in Neural Information Processing Systems , volume=
[75]

Dong Shu, Xuansheng Wu, Haiyan Zhao, Daking Rai, Ziyu Yao, Ninghao Liu, and Mengnan Du

Multi-property steering of large language models with dynamic activation composition , author=. arXiv preprint arXiv:2406.17563 , year=

work page arXiv
[76]

arXiv preprint arXiv:2408.15625 , year=

Cbf-llm: Safe control for llm alignment , author=. arXiv preprint arXiv:2408.15625 , year=

work page arXiv
[77]

Advances in neural information processing systems , volume=

Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=
[78]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[79]

Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation.arXiv preprint arXiv:2401.08417, 2024

Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation , author=. arXiv preprint arXiv:2401.08417 , year=

work page arXiv
[80]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Preference ranking optimization for human alignment , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Showing first 80 references.