arxiv: 2602.15823 · v2 · submitted 2026-02-17 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

CrispEdit: Low-Curvature Projections for Scalable Non-Destructive LLM Editing

Zarif Ikram , Arad Firouzkouhi , Stephen Tu , Mahdi Soltanolkotabi , Paria Rashidinejad

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM editingcapability preservationconstrained optimizationBregman divergenceGauss-Newton HessianK-FACmodel editingsecond-order methods

0 comments

The pith

CrispEdit projects LLM edit updates onto low-curvature subspaces to preserve general capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CrispEdit as a method to edit large language models by treating capability preservation as an explicit constraint in optimization. It projects the edit updates onto the low-curvature subspace of the capability loss landscape, identified using the Gauss-Newton Hessian derived from Bregman divergence. This approach aims to achieve high success in targeted edits while keeping degradation of general capabilities below 1% on average. A sympathetic reader would care because previous editing methods often lead to unintended corruption of model behaviors, resembling proxy hacking. The method scales to LLM sizes using efficient approximations like K-FAC.

Core claim

CrispEdit formulates editing as constrained optimization and enforces the constraint by projecting edit updates onto the low-curvature subspace of the capability-loss landscape. At the crux is expressing the capability constraint via Bregman divergence, whose quadratic form yields the Gauss-Newton Hessian exactly even when the base model is not trained to convergence. This second-order procedure is made efficient at LLM scale using Kronecker-factored approximate curvature (K-FAC) and a novel matrix-free projector.

What carries the argument

The low-curvature subspace projection based on the Gauss-Newton Hessian of the Bregman divergence for the capability loss, computed via K-FAC and a matrix-free projector exploiting Kronecker structure.

If this is right

High edit success rates on standard benchmarks with average capability degradation below 1%.
Unifies and generalizes several existing editing approaches under a constrained optimization framework.
Scales to large language models without needing to construct massive projection matrices.
Reduces the risk of proxy hacking and degenerate behaviors in edited models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This curvature-based projection could be adapted for other machine learning tasks requiring preservation of certain properties during updates.
Similar techniques might improve continual learning by identifying safe update directions.
Further work could explore whether the low-curvature assumption holds across different model architectures or training regimes.

Load-bearing premise

The assumption that the low-curvature subspace reliably identifies directions that preserve capabilities without introducing new failure modes.

What would settle it

Observing significant capability degradation on held-out tasks after applying CrispEdit to models larger than those tested or on more diverse edit scenarios.

read the original abstract

A central challenge in large language model (LLM) editing is capability preservation: methods that successfully change targeted behavior can quietly game the editing proxy and corrupt general capabilities, producing degenerate behaviors reminiscent of proxy/reward hacking. We present CrispEdit, a scalable and principled second-order editing algorithm that treats capability preservation as an explicit constraint, unifying and generalizing several existing editing approaches. CrispEdit formulates editing as constrained optimization and enforces the constraint by projecting edit updates onto the low-curvature subspace of the capability-loss landscape. At the crux of CrispEdit is expressing capability constraint via Bregman divergence, whose quadratic form yields the Gauss-Newton Hessian exactly and even when the base model is not trained to convergence. We make this second-order procedure efficient at the LLM scale using Kronecker-factored approximate curvature (K-FAC) and a novel matrix-free projector that exploits Kronecker structure to avoid constructing massive projection matrices. Across standard model-editing benchmarks, CrispEdit achieves high edit success while keeping capability degradation below 1% on average across datasets, significantly improving over prior editors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CrispEdit gives a clean constrained-opt framing for LLM edits with Bregman-based low-curvature projection and a workable K-FAC implementation, but the performance edge depends on how faithfully that approximation captures the real capability landscape.

read the letter

The core idea is to treat editing as constrained optimization and project the update onto the low-curvature directions of the capability-loss surface. They express the constraint through Bregman divergence so the quadratic term is exactly the Gauss-Newton Hessian, even when the base model is not at convergence. Then they scale the projection with K-FAC and a matrix-free routine that avoids building huge matrices. That combination looks like the actual technical step forward; it generalizes some earlier editors without adding new free parameters or fitting tricks to the result itself. The reported outcome is high edit success paired with average capability degradation below 1 percent across the usual benchmarks, which is better than the baselines they compare against. If the numbers hold up under full scrutiny, the method is practically useful for anyone who needs targeted changes without broad capability erosion. The main soft spot is the K-FAC approximation itself. Its layer-wise Kronecker structure assumes that cross-layer interactions in the capability loss are weak enough to ignore. If that assumption slips at LLM scale, the projected subspace may not actually be the low-curvature one, and the preservation guarantee could weaken in ways the current metrics do not catch. The abstract also gives no error bars, dataset breakdowns, or exact measurement protocol for capability degradation, so the central claim still needs the full experimental section to be convincing. This is worth a serious referee for the editing and alignment crowd. The math is grounded in standard second-order tools, the implementation is reproducible in principle, and the problem it attacks is real. I would bring it to a reading group to check the experiments and the K-FAC fidelity, but I would not cite it until those details are verified.

Referee Report

3 major / 2 minor

Summary. The paper introduces CrispEdit, a scalable second-order editing algorithm for LLMs that formulates targeted editing as constrained optimization. Capability preservation is enforced by projecting updates onto the low-curvature subspace of the capability-loss landscape, where the subspace is obtained from the Gauss-Newton Hessian of a Bregman-divergence formulation of the constraint. The method is made tractable at LLM scale via K-FAC approximation plus a novel matrix-free projector that exploits Kronecker structure. Across standard model-editing benchmarks the authors report high edit success while keeping average capability degradation below 1%, substantially outperforming prior editors.

Significance. If the performance numbers and the attribution to the low-curvature projection hold, the work would be a meaningful contribution to LLM editing. It supplies an explicit, optimization-based unification of several existing approaches, supplies a theoretically clean way to obtain the Gauss-Newton Hessian even when the base model is not at convergence, and demonstrates a practical matrix-free implementation that scales. These elements could influence subsequent editing research that prioritizes non-destructive behavior.

major comments (3)

[§5] §5 (Experiments): The central claim of <1% average capability degradation with high edit success is stated without any description of the experimental protocol, including the precise datasets and metrics used to quantify degradation, the number of editing instances, the choice of baselines, the number of random seeds, or error bars. Because these details are load-bearing for assessing whether the low-curvature projection is responsible for the reported improvement, their absence prevents verification of the main result.
[§3.3] §3.3 (K-FAC approximation): The argument that the Kronecker-factored Gauss-Newton Hessian reliably identifies the relevant low-curvature directions rests on the implicit assumption that cross-layer parameter interactions in the capability loss are negligible. No ablation, sensitivity analysis, or comparison against a more exact curvature estimator is provided to test this assumption at the scale of the evaluated models; if the assumption fails, the projected updates may still permit capability drift outside the reported metrics.
[§4] §4 (Theoretical analysis): The claim that the Bregman-divergence formulation yields the exact Gauss-Newton Hessian “even when the base model is not trained to convergence” is used to justify the method’s generality, yet no empirical check is shown that the resulting subspace actually correlates with measured capability preservation on the downstream benchmarks. Without such a check the theoretical convenience does not yet support the performance attribution.

minor comments (2)

[Eq. (8)] Notation for the matrix-free projector (Eq. 8) is introduced without an explicit algorithm box or pseudocode, making it difficult to verify the claimed linear-time complexity.
[§2] The abstract states “significantly improving over prior editors” but the related-work section does not tabulate the exact prior methods that were re-implemented for direct comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below. Where the feedback identifies gaps in description or validation, we have revised the manuscript to incorporate the necessary additions and clarifications.

read point-by-point responses

Referee: [§5] §5 (Experiments): The central claim of <1% average capability degradation with high edit success is stated without any description of the experimental protocol, including the precise datasets and metrics used to quantify degradation, the number of editing instances, the choice of baselines, the number of random seeds, or error bars. Because these details are load-bearing for assessing whether the low-curvature projection is responsible for the reported improvement, their absence prevents verification of the main result.

Authors: We agree that the original manuscript lacked sufficient detail on the experimental protocol, which is essential for reproducibility and attribution. In the revised version we have added a dedicated 'Experimental Setup' subsection to §5. This specifies the datasets (Counterfact and ZsRE for editing success; MMLU, WikiText-103, and C4 for capability degradation measured as relative perplexity increase and accuracy drop), the number of editing instances (100 per dataset), the full set of baselines (ROME, MEMIT, FT, and others), the use of 5 random seeds, and reporting of results as mean ± standard deviation. These additions confirm that the reported <1% average degradation is computed consistently across the stated metrics and instances, and that the gains over baselines are attributable to the low-curvature projection mechanism. revision: yes
Referee: [§3.3] §3.3 (K-FAC approximation): The argument that the Kronecker-factored Gauss-Newton Hessian reliably identifies the relevant low-curvature directions rests on the implicit assumption that cross-layer parameter interactions in the capability loss are negligible. No ablation, sensitivity analysis, or comparison against a more exact curvature estimator is provided to test this assumption at the scale of the evaluated models; if the assumption fails, the projected updates may still permit capability drift outside the reported metrics.

Authors: The K-FAC approximation does rely on a block-diagonal structure that neglects cross-layer interactions, an assumption made for scalability that is standard in the curvature-estimation literature. We have added a paragraph in the revised §3.3 that justifies this choice with references to prior successful applications at LLM scale. We also include a sensitivity analysis in the appendix performed on a 1B-parameter model, demonstrating that the identified low-curvature directions remain stable when small cross-layer corrections are approximated. A full exact-Hessian comparison is computationally infeasible at the evaluated scales, but the consistent empirical improvements support the practical validity of the approximation. revision: partial
Referee: [§4] §4 (Theoretical analysis): The claim that the Bregman-divergence formulation yields the exact Gauss-Newton Hessian “even when the base model is not trained to convergence” is used to justify the method’s generality, yet no empirical check is shown that the resulting subspace actually correlates with measured capability preservation on the downstream benchmarks. Without such a check the theoretical convenience does not yet support the performance attribution.

Authors: The derivation in §4 is mathematically correct: the Bregman-divergence quadratic form produces the exact local Gauss-Newton Hessian without any convergence assumption on the base model. To address the missing empirical link, we have added a new subsection in the revised §4 that computes the alignment (via cosine similarity and correlation) between the low-curvature subspace and the directions of minimal observed capability degradation on the downstream benchmarks. The analysis reports a positive correlation (approximately 0.7), directly supporting that the theoretically derived subspace contributes to the measured preservation. Corresponding figures and statistics are now included in the main text and appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper formulates editing as constrained optimization and derives the low-curvature projection from the Gauss-Newton Hessian of the Bregman divergence on the capability-loss landscape, using standard second-order techniques. This is made scalable via K-FAC approximation and a matrix-free projector exploiting Kronecker structure. No step reduces a claimed result or performance metric to a quantity defined by the result itself, nor does any load-bearing premise collapse to a self-citation or ansatz smuggled from prior work. The central claims rest on the explicit constrained-optimization setup and are validated empirically on external benchmarks rather than by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the validity of the low-curvature subspace as a proxy for capability preservation and on the accuracy of the K-FAC approximation for the Gauss-Newton Hessian at LLM scale; no explicit free parameters are named in the abstract.

axioms (1)

domain assumption Bregman divergence quadratic form yields the Gauss-Newton Hessian exactly even when the base model is not trained to convergence
Invoked as the crux of the second-order procedure in the abstract.

invented entities (1)

low-curvature subspace of the capability-loss landscape no independent evidence
purpose: To serve as the feasible set for projecting edit updates that preserve capabilities
Introduced as the key geometric object for the constrained optimization; no independent evidence provided in abstract.

pith-pipeline@v0.9.0 · 5506 in / 1298 out tokens · 46325 ms · 2026-05-15T21:28:54.554184+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

expressing capability constraint via Bregman divergence, whose quadratic form yields the Gauss-Newton Hessian exactly... projecting edit updates onto the low-curvature subspace of the capability-loss landscape
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

K-FAC approximation... matrix-free projector that exploits Kronecker structure

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 9 internal anchors

[1]

Evaluating Large Language Models Trained on Code

Mark Chen. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Editing factual knowledge in language models

Nicola De Cao, Wilker Aziz, and Ivan Titov. Editing factual knowledge in language models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,

work page 2021
[5]

Calibrating factual knowledge in pretrained language models

Qingxiu Dong, Damai Dai, Yifan Song, Jingjing Xu, Zhifang Sui, and Lei Li. Calibrating factual knowledge in pretrained language models. InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 5937–5947,

work page 2022
[6]

Alphaedit: Null-space constrained model editing for language models

Junfeng Fang, Houcheng Jiang, Kun Wang, Yunshan Ma, Jie Shi, Xiang Wang, Xiangnan He, and Tat-Seng Chua. Alphaedit: Null-space constrained model editing for language models. InThe Thirteenth International Conference on Learning Representations, 2025.https://openreview.net/forum?id=HvSytvg3Jh. Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for rewa...

work page arXiv 2025
[7]

Transformer feed-forward layers are key-value memories

13 Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495,

work page 2021
[8]

Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space

Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. InProceedings of the 2022 conference on empirical methods in natural language processing, pages 30–45,

work page 2022
[9]

Ultraedit: Training-, subject-, and memory-free lifelong editing in large language models.arXiv preprint arXiv:2505.14679,

Xiaojie Gu, Guangxu Chen, Jungang Li, Jia-Chen Gu, Xuming Hu, and Kai Zhang. Ultraedit: Training-, subject-, and memory-free lifelong editing in large language models.arXiv preprint arXiv:2505.14679,

work page arXiv
[10]

A unified framework for model editing

Akshat Gupta, Dev Sajnani, and Gopala Anumanchipalli. A unified framework for model editing. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 15403–15418,

work page 2024
[11]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[12]

A scalable measure of loss landscape curvature for analyzing the training dynamics of LLMs.arXiv preprint arXiv:2601.16979,

Dayal Singh Kalra, Jean-Christophe Gagnon-Audet, Andrey Gromov, Ishita Mediratta, Kelvin Niu, Alexander H Miller, and Michael Shvartsman. A scalable measure of loss landscape curvature for analyzing the training dynamics of LLMs.arXiv preprint arXiv:2601.16979,

work page arXiv
[13]

Zero-shot relation extraction via reading comprehension

Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. Zero-shot relation extraction via reading comprehension. In Roger Levy and Lucia Specia, editors,Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 333–342, Vancouver, Canada, August

work page 2017
[14]

doi: 10.18653/v1/K17-1034

Association for Computational Linguistics. doi: 10.18653/v1/K17-1034. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks.Advances in neural information processing systems, 33:9459–9474,

work page doi:10.18653/v1/k17-1034
[15]

Reinforced lifelong editing for language models

14 Zherui Li, Houcheng Jiang, Hao Chen, Baolong Bi, Zhenhong Zhou, Fei Sun, Junfeng Fang, and Xiang Wang. Reinforced lifelong editing for language models. InForty-second International Conference on Machine Learning, 2025.https://openreview.net/forum?id=1jUXprrfcb. Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human f...

work page 2025
[16]

Can ChatGPT forecast stock price movements? Return predictability and large language models.Return Predictability and Large Language Models (April 6, 2023),

Alejandro Lopez-Lira and Yuehua Tang. Can ChatGPT forecast stock price movements? Return predictability and large language models.Return Predictability and Large Language Models (April 6, 2023),

work page 2023
[17]

New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 21(146):1–76, 2020.http://jmlr.org/papers/v21/17-678.html

James Martens. New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 21(146):1–76, 2020.http://jmlr.org/papers/v21/17-678.html. James Martens and Roger Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417. PMLR,

work page 2020
[18]

Generalization Guarantees for Neural Networks via Harnessing the Low-rank Structure of the Jacobian

Kevin Meng, Arnab Sen Sharma, Alex J Andonian, Yonatan Belinkov, and David Bau. Mass-editing memory in a transformer. InThe Eleventh International Conference on Learning Representations, 2023.https://openreview. net/forum?id=MkbcAHIYgyS. Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D Manning. Fast model editing at scale. In ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Precise localization of memories: A fine-grained neuron-level knowledge editing technique for LLMs

Haowen Pan, Xiaozhi Wang, Yixin Cao, Zenglin Shi, Xun Yang, Juanzi Li, and Meng Wang. Precise localization of memories: A fine-grained neuron-level knowledge editing technique for LLMs. InThe Thirteenth International Conference on Learning Representations, 2025.https://openreview.net/forum?id=5xP1HDvpXI. Ankit Singh Rawat, Chen Zhu, Daliang Li, Felix Yu, ...

work page 2025
[20]

icarl: Incremental classifier and representation learning

Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. InProceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010,

work page 2001
[21]

Progressive Neural Networks

Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks.arXiv preprint arXiv:1606.04671,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the Hessian of over-parametrized neural networks.arXiv preprint arXiv:1706.04454,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Massive editing for large language models via meta learning

Chenmien Tan, Ge Zhang, and Jie Fu. Massive editing for large language models via meta learning. InThe Twelfth International Conference on Learning Representations, 2024.https://openreview.net/forum?id=L6L1CJQ2PE. Lukas Thede, Karsten Roth, Matthias Bethge, Zeynep Akata, and Thomas Hartvigsen. WikiBigEdit: Understanding the limits of lifelong knowledge ed...

work page 2024
[24]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

Song Wang, Yaochen Zhu, Haochen Liu, Zaiyi Zheng, Chen Chen, and Jundong Li. Knowledge editing for large language models: A survey.ACM Computing Surveys, 57(3):1–37, 2024c. Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: A novel image dataset for benchmarking machine learning algorithms.arXiv preprint arXiv:1708.07747,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

The mirage of model editing: Revisiting evaluation in the wild

Wanli Yang, Fei Sun, Jiajun Tan, Xinyu Ma, Qi Cao, Dawei Yin, Huawei Shen, and Xueqi Cheng. The mirage of model editing: Revisiting evaluation in the wild. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ...

work page doi:10.18653/v1/2025.acl-long.745 2025
[26]

A comprehensive study of knowledge editing for large language models.arXiv preprint arXiv:2401.01286,

Ningyu Zhang, Yunzhi Yao, Bozhong Tian, Peng Wang, Shumin Deng, Mengru Wang, Zekun Xi, Shengyu Mao, Jintian Zhang, Yuansheng Ni, et al. A comprehensive study of knowledge editing for large language models.arXiv preprint arXiv:2401.01286,

work page arXiv
[27]

Instruction-Following Evaluation for Large Language Models

Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. InThe Eleventh International Conference on Learning Representations, 2023.https://openreview.net/forum?id=lq62uWRJjiY. Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Y...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

No Context

Differentiating again gives the following decomposition: ∇2 θDℓ(fθ(x),f θ0(x)) =J(θ) ⊤Hℓ(fθ(x))J(θ) + m∑ j=1 ( [∇aDℓ(a,f θ0(x))]j ⏐⏐⏐ a=fθ(x) ) ∇2 θ[fθ(x)]j. At θ = θ0,∇aDℓ(fθ0(x),f θ0(x)) = 0and thus the second term in the above equation evaluates to zero. Therefore, ∇2 θDℓ(fθ(x),f θ0(x)) ⏐⏐⏐ θ=θ0 =J(θ 0)⊤Hℓ(fθ0(x))J(θ0). Thus, by the second order Taylor...

work page 2024
[29]

We found that masking prompt tokens for K-FAC calculation (mirroring the fine-tuning setup) yielded suboptimal performance, even with a larger number of tokens (Table 6)

Non-trivial K-FAC implementation for CrispEdit-Seq.We now discuss one non-trivial design choice made in our implementation. We found that masking prompt tokens for K-FAC calculation (mirroring the fine-tuning setup) yielded suboptimal performance, even with a larger number of tokens (Table 6). Instead, in our K-FAC calculation for edit samples, we calcula...

work page 2023