pith. sign in

arxiv: 2606.30788 · v1 · pith:OLF77JRZnew · submitted 2026-06-29 · 💻 cs.LG · cs.CL· cs.CR

Revocable Learned State via Process Sidecars

Pith reviewed 2026-07-01 06:41 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CR
keywords process sidecarsrevocable learned statetask arithmeticsafety traininglanguage model editingAdamW optimizermemory revocationcounterfactual oracle
0
0 comments X

The pith

A two-coefficient sidecar edit recovers the safety-only model up to second order by correcting for how safety training transports the memory direction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Language models often receive a memory update followed by safety training that refuses outputs tied to the remembered content. The safety optimizer transports the original memory direction, so subtracting the memory update alone leaves first-order error in the counterfactual safety-only state. The paper introduces process sidecars as a family of edits that subtract both the memory change and an estimate of its transported version. When the exact transported direction is used at coefficients (1,1), the edit recovers the desired safety-only oracle up to second order, with AdamW treated as a map over parameters and moments. Experiments on three models confirm that the validation-chosen two-dimensional edit improves held-out refusal closure compared with scalar task arithmetic.

Core claim

The exact sidecar using the true transported direction R_{S←M} at (λ,γ)=(1,1) recovers the counterfactual safety-only oracle θ_AS up to second order. The proof treats AdamW as an augmented-state map over parameters, first moments, and second moments. Whenever future safety training bends the memory direction, every scalar task-arithmetic edit leaves first-order counterfactual error, while the process-sidecar edit remains second-order accurate. The practical implementation approximates the transported direction with a centered secant through the realized safety-training process at ε=1.

What carries the argument

The process sidecar edit family θ̂(λ,γ)=θ_AMS−λΔ_M−γR̂_{S←M}, where R̂_{S←M} is the centered secant estimate of the direction in which safety training transports the memory update.

If this is right

  • The validation-selected two-dimensional edit improves held-out refusal closure over naive task arithmetic in all trials across three models.
  • The two-dimensional edit outperforms the γ=λ process-JVP subfamily in all paired trials.
  • Process information from the safety-training trajectory is required for second-order accuracy once safety training bends the memory direction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sidecar construction could be tested on sequential training pipelines that do not involve safety, such as continued pre-training followed by fine-tuning.
  • If the augmented-state treatment of AdamW extends to other first-order methods, similar second-order corrections might apply without requiring full trajectory replay.
  • The approach suggests a general pattern for editing states that have been altered by later optimization steps rather than treating each update as an independent vector.

Load-bearing premise

The centered secant through the realized future AdamW safety-training process at ε=1 provides a sufficiently accurate estimate of the true transported direction R_{S←M} for the second-order recovery to hold in practice.

What would settle it

A direct comparison, on held-out data, showing that the sidecar edit at (1,1) with the secant estimate deviates from the true safety-only oracle θ_AS by more than second-order terms when safety training has bent the memory direction.

Figures

Figures reproduced from arXiv: 2606.30788 by John Sweeney.

Figure 1
Figure 1. Figure 1: Process sidecars preserve refusal at matched secret-token distinguishability. Held-out test, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Natural coordinates for process sidecars. The process-JVP line is [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
read the original abstract

Language models are often adapted in stages: a public skill phase, a private memory phase, and a later safety phase that learns to refuse outputs tied to the remembered entities. Revoking the memory after the safety phase is not the same problem as subtracting the memory update: the later safety optimizer has transported the memory direction. We introduce process sidecars, a two-coefficient edit family $\hat{\theta}(\lambda,\gamma)=\theta_{\mathrm{AMS}}-\lambda\Delta_{\mathrm{M}}-\gamma\hat{R}_{\mathrm{S}\leftarrow\mathrm{M}}$, with $\hat{R}_{\mathrm{S}\leftarrow\mathrm{M}}=\hat{J}_{\mathrm{S},\varepsilon}(\Delta_{\mathrm{M}})-\Delta_{\mathrm{M}}$, where $\hat{J}_{\mathrm{S},\varepsilon}$ is a centered secant through the realized future AdamW safety-training process. The implementation uses $\varepsilon=1$ at the natural memory-edit scale; it reuses $\theta_{\mathrm{AMS}}$ as the positive endpoint and computes one additional safety trace at $\theta_{\mathrm{A}}-\Delta_{\mathrm{M}}$. We prove two things. First, the exact sidecar, using the true transported direction $R_{\mathrm{S}\leftarrow\mathrm{M}}$ rather than the secant estimate, at $(\lambda,\gamma)=(1,1)$ recovers the counterfactual safety-only oracle $\theta_{\mathrm{AS}}$ up to second order; the proof treats AdamW as an augmented-state map over parameters, first moments, and second moments. Second, this process information is necessary: whenever future safety training bends the memory direction, every scalar task-arithmetic edit leaves first-order counterfactual error, while the process-sidecar edit is second-order accurate. Across three models, the validation-selected 2D edit improves held-out refusal closure over naive task arithmetic in all trials, and over the $\gamma=\lambda$ process-JVP subfamily, the diagonal slice of the cached 2D grid, in all paired trials.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces process sidecars, a two-coefficient edit family hetâ(λ,γ) = θ_AMS − λ Δ_M − γ R̂_{S←M} where R̂ is a centered secant through the realized AdamW safety-training trajectory. It proves that the exact sidecar (true transported direction R_{S←M}) at (λ,γ)=(1,1) recovers the counterfactual safety-only oracle θ_AS up to second order by treating AdamW as an augmented-state map on (θ, m, v); it further proves that any scalar task-arithmetic edit necessarily incurs first-order counterfactual error whenever safety training bends the memory direction; and it reports that the practical secant implementation (ε=1, one extra safety trace) with validation-selected (λ,γ) improves held-out refusal closure over naive task arithmetic and over the γ=λ process-JVP slice on three models.

Significance. If the secant approximation preserves the second-order property in practice, the work supplies a principled, optimizer-aware editing technique that goes beyond scalar task arithmetic for post-training revocation of learned state. The formal modeling of the full AdamW process (parameters plus moments) as a map and the necessity argument that distinguishes process information from scalar edits are genuine strengths; the empirical demonstration that the 2D edit outperforms both baselines in all trials adds practical support.

major comments (3)
  1. [Proof of second-order recovery (exact sidecar)] Proof of second-order recovery: the argument establishes O(‖Δ_M‖²) accuracy only for the exact transported direction R_{S←M}; the implemented method substitutes the centered secant R̂_{S,ε=1} computed at the natural memory-edit scale without a remainder bound or expansion showing that the secant error is o(‖Δ_M‖) rather than O(‖Δ_M‖), which is required to carry the second-order claim over to the practical edit family.
  2. [Necessity argument] Necessity argument: the claim that every scalar task-arithmetic edit leaves first-order counterfactual error relies on the safety-training map bending the memory direction; the precise condition under which the first-order term vanishes (or does not) should be stated explicitly, as it is load-bearing for the motivation that process information is required.
  3. [Empirical evaluation] Implementation and validation: the reported gains use a validation-selected (λ,γ) pair together with a single extra safety trace at θ_A − Δ_M; because the selection is performed on held-out data that also informs the refusal metric, an ablation showing performance for fixed (λ,γ)=(1,1) or for the exact sidecar (when computable) would be needed to separate the contribution of the secant from hyperparameter tuning.
minor comments (2)
  1. [Notation] Notation: the centered secant is written both as R̂_{S←M} and as Ĵ_{S,ε}(Δ_M) − Δ_M; a single consistent symbol and an explicit definition of the centering point would improve readability.
  2. [Figures] Figure clarity: the 2D grid plots of (λ,γ) performance should include contour lines or a marked validation-selected point so readers can see how far the chosen coefficients lie from the (1,1) corner.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and for acknowledging the formal contributions of the process-sidecar framework and the empirical improvements. We respond to each major comment below and will revise the manuscript accordingly to address the points raised.

read point-by-point responses
  1. Referee: Proof of second-order recovery (exact sidecar): the argument establishes O(‖Δ_M‖²) accuracy only for the exact transported direction R_{S←M}; the implemented method substitutes the centered secant R̂_{S,ε=1} computed at the natural memory-edit scale without a remainder bound or expansion showing that the secant error is o(‖Δ_M‖) rather than O(‖Δ_M‖), which is required to carry the second-order claim over to the practical edit family.

    Authors: We agree that the O(‖Δ_M‖²) guarantee is established only for the exact transported direction. The manuscript presents the secant implementation as a practical approximation at the natural edit scale (ε=1) without claiming that the second-order property automatically transfers. We will revise the text to explicitly distinguish the exact-sidecar theorem from the practical secant method and to note that the latter is a heuristic whose error depends on the curvature of the safety-training trajectory. revision: yes

  2. Referee: Necessity argument: the claim that every scalar task-arithmetic edit leaves first-order counterfactual error relies on the safety-training map bending the memory direction; the precise condition under which the first-order term vanishes (or does not) should be stated explicitly, as it is load-bearing for the motivation that process information is required.

    Authors: The necessity claim holds whenever the safety-training map produces a nonzero transverse component in the transported direction, i.e., when the directional derivative of the safety map applied to Δ_M is not collinear with Δ_M. Equivalently, the first-order error term in scalar edits vanishes only if the Jacobian of the safety map at the relevant point maps Δ_M into a scalar multiple of itself. We will add an explicit statement of this condition (including the mathematical criterion for vanishing) in the revised manuscript. revision: yes

  3. Referee: Implementation and validation: the reported gains use a validation-selected (λ,γ) pair together with a single extra safety trace at θ_A − Δ_M; because the selection is performed on held-out data that also informs the refusal metric, an ablation showing performance for fixed (λ,γ)=(1,1) or for the exact sidecar (when computable) would be needed to separate the contribution of the secant from hyperparameter tuning.

    Authors: We acknowledge that validation-based selection on data tied to the evaluation metric limits the ability to isolate the secant contribution. In the revision we will add an ablation reporting performance for the fixed pair (λ,γ)=(1,1) on all three models, as well as for the γ=λ process-JVP slice, thereby separating the effect of the two-coefficient edit from hyperparameter tuning. revision: yes

Circularity Check

0 steps flagged

No circularity: proof for exact sidecar is independent modeling of AdamW dynamics

full rationale

The core derivation consists of two explicit mathematical claims: (1) the exact sidecar with true R_{S←M} at (λ,γ)=(1,1) recovers θ_AS to second order by treating the full AdamW process as an augmented-state dynamical system on (θ,m,v), and (2) scalar task-arithmetic edits necessarily incur first-order error whenever safety training bends the memory direction. Both statements are proven from the optimizer equations rather than fitted to data or reduced to prior self-citations. The implemented secant approximation and validation-selected (λ,γ) appear only in the empirical section and are not invoked inside the proof; the paper does not rename the secant result as a prediction or smuggle an ansatz via citation. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on treating the AdamW optimizer as an augmented-state dynamical system and on the validity of a first-order secant approximation to the transported direction; no free parameters are fitted inside the proof itself but λ and γ are selected post-hoc via validation.

free parameters (1)
  • λ, γ
    Two coefficients of the edit family chosen by validation on held-out data to maximize refusal closure.
axioms (1)
  • domain assumption AdamW training can be treated as an augmented-state map over parameters, first moments, and second moments
    Invoked to prove second-order recovery of the safety-only oracle.
invented entities (2)
  • process sidecar edit family no independent evidence
    purpose: Two-coefficient correction for transported memory direction
    Newly defined edit θ̂(λ,γ) that augments standard task arithmetic.
  • R̂_{S←M} (centered secant estimate) no independent evidence
    purpose: Approximation to the true transported memory direction after safety training
    Defined as Ĵ_{S,ε}(Δ_M) - Δ_M and computed from one additional safety trace.

pith-pipeline@v0.9.1-grok · 5879 in / 1552 out tokens · 35397 ms · 2026-07-01T06:41:36.119278+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    Constitutional AI: Harmlessness from AI Feedback

    Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and Chen, Carol and Olsson, Catherine and Olah, Christopher and Hernandez, Danny and Drain, Dawn and Ganguli, Deep and Li, Dustin and Tran-Johnson, Eli and Perez, Ethan an...

  2. [2]

    IEEE Symposium on Security and Privacy , year =

    Machine Unlearning , author =. IEEE Symposium on Security and Privacy , year =

  3. [3]

    2024 , howpublished =

  4. [4]

    Eldan, Ronen and Russinovich, Mark , year =. Who's. 2310.02238 , archivePrefix =

  5. [5]

    Proceedings of the 34th International Conference on Machine Learning , series =

    Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , author =. Proceedings of the 34th International Conference on Machine Learning , series =. 2017 , publisher =

  6. [6]

    Goddard, Charles and Siriwardhana, Shamane and Ehghaghi, Malikeh and Meyers, Luke and Karpukhin, Vladimir and Benedict, Brian and McQuade, Mark and Solawetz, Jacob , booktitle =. Arcee's. 2024 , publisher =. doi:10.18653/v1/2024.emnlp-industry.36 , url =

  7. [7]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

    Eternal Sunshine of the Spotless Net: Selective Forgetting in Deep Networks , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =. 2020 , url =

  8. [8]

    Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Vaughan, Alex and others , year =. The. 2407.21783 , archivePrefix =

  9. [9]

    Proceedings of the 37th International Conference on Machine Learning , series =

    Certified Data Removal from Machine Learning Models , author =. Proceedings of the 37th International Conference on Machine Learning , series =. 2020 , publisher =

  10. [10]

    and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

    Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =. 2022 , url =

  11. [11]

    Unlearning or Obfuscating? Jogging the Memory of Unlearned

    Hu, Shengyuan and Fu, Yiwei and Wu, Zhiwei Steven and Smith, Virginia , booktitle =. Unlearning or Obfuscating? Jogging the Memory of Unlearned. 2025 , url =

  12. [12]

    International Conference on Learning Representations , year =

    Editing Models with Task Arithmetic , author =. International Conference on Learning Representations , year =

  13. [13]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    Knowledge Unlearning for Mitigating Privacy Risks in Language Models , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =. 2023 , publisher =. doi:10.18653/v1/2023.acl-long.805 , url =

  14. [14]

    and Ba, Jimmy , booktitle =

    Kingma, Diederik P. and Ba, Jimmy , booktitle =. 2015 , url =

  15. [15]

    Proceedings of the 34th International Conference on Machine Learning , series =

    Understanding Black-Box Predictions via Influence Functions , author =. Proceedings of the 34th International Conference on Machine Learning , series =. 2017 , publisher =

  16. [16]

    and Dombrowski, Ann-Kathrin and Goel, Shashwat and Mukobi, Gabriel and others , booktitle =

    Li, Nathaniel and Pan, Alexander and Gopal, Anjali and Yue, Summer and Berrios, Daniel and Gatti, Alice and Li, Justin D. and Dombrowski, Ann-Kathrin and Goel, Shashwat and Mukobi, Gabriel and others , booktitle =. The. 2024 , publisher =

  17. [17]

    Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics , series =

    Optimizing Millions of Hyperparameters by Implicit Differentiation , author =. Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics , series =. 2020 , publisher =

  18. [18]

    International Conference on Learning Representations , year =

    Decoupled Weight Decay Regularization , author =. International Conference on Learning Representations , year =

  19. [19]

    Transactions on Machine Learning Research , year =

    An Adversarial Perspective on Machine Unlearning for. Transactions on Machine Learning Research , year =

  20. [20]

    Proceedings of the 32nd International Conference on Machine Learning , series =

    Gradient-Based Hyperparameter Optimization through Reversible Learning , author =. Proceedings of the 32nd International Conference on Machine Learning , series =. 2015 , publisher =

  21. [21]

    and Kolter, J

    Maini, Pratyush and Feng, Zhili and Schwarzschild, Avi and Lipton, Zachary C. and Kolter, J. Zico , booktitle =. 2024 , url =

  22. [22]

    and Raffel, Colin A

    Matena, Michael S. and Raffel, Colin A. , booktitle =. Merging Models with. 2022 , url =

  23. [23]

    Advances in Neural Information Processing Systems , volume =

    Training Language Models to Follow Instructions with Human Feedback , author =. Advances in Neural Information Processing Systems , volume =. 2022 , url =

  24. [24]

    2023 , publisher =

    Park, Sung Min and Georgiev, Kristian and Ilyas, Andrew and Leclerc, Guillaume and Madry, Aleksander , booktitle =. 2023 , publisher =

  25. [25]

    , journal =

    Pearlmutter, Barak A. , journal =. Fast Exact Multiplication by the. 1994 , doi =

  26. [26]

    2025 , howpublished =

  27. [27]

    2026 , howpublished =

  28. [28]

    Advances in Neural Information Processing Systems , volume =

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model , author =. Advances in Neural Information Processing Systems , volume =. 2023 , url =

  29. [29]

    Advances in Neural Information Processing Systems , volume =

    Meta-Learning with Implicit Gradients , author =. Advances in Neural Information Processing Systems , volume =. 2019 , url =

  30. [30]

    and Zhang, Chiyuan , booktitle =

    Shi, Weijia and Lee, Jaechan and Huang, Yangsibo and Malladi, Sadhika and Zhao, Jieyu and Holtzman, Ari and Liu, Daogao and Zettlemoyer, Luke and Smith, Noah A. and Zhang, Chiyuan , booktitle =. 2025 , url =

  31. [31]

    Proceedings of the 39th International Conference on Machine Learning , series =

    Model Soups: Averaging Weights of Multiple Fine-Tuned Models Improves Accuracy without Increasing Inference Time , author =. Proceedings of the 39th International Conference on Machine Learning , series =. 2022 , publisher =

  32. [32]

    2023 , url =

    Yadav, Prateek and Tam, Derek and Choshen, Leshem and Raffel, Colin and Bansal, Mohit , booktitle =. 2023 , url =

  33. [33]

    Proceedings of the 41st International Conference on Machine Learning , series =

    Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch , author =. Proceedings of the 41st International Conference on Machine Learning , series =. 2024 , publisher =

  34. [34]

    Conference on Language Modeling (COLM) , year =

    Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning , author =. Conference on Language Modeling (COLM) , year =