LoCO: Low-rank Compositional Rotation Fine-tuning

Anh Tong; An Nguyen; Jaesik Choi

arxiv: 2605.15916 · v1 · pith:AIMP2GHXnew · submitted 2026-05-15 · 💻 cs.LG · cs.AI· cs.CV

LoCO: Low-rank Compositional Rotation Fine-tuning

An Nguyen , Jaesik Choi , Anh Tong This is my paper

Pith reviewed 2026-05-20 19:48 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords parameter-efficient fine-tuningorthogonal transformationslow-rank adaptationrotation chainsskew-symmetric matricesPEFTmodel adaptation

0 comments

The pith

LoCO constructs orthogonal fine-tuning updates from low-rank skew-symmetric matrices in rotation chains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LoCO as a parameter-efficient fine-tuning approach that builds orthogonal transformations by chaining low-rank skew-symmetric matrices into rotations. This targets the geometric distortion that occurs when standard low-rank updates like those in LoRA are applied without orthogonality constraints. An approximation scheme is introduced to compute the compositional rotations in parallel, which keeps the method usable for high-dimensional features while limiting the deviation from true orthogonality. Validation across diffusion transformer, vision transformer, and language model tasks shows results that match or surpass both orthogonal and non-orthogonal baselines. A sympathetic reader would see this as a way to adapt large models more faithfully to their original representation geometry without increasing parameter count.

Core claim

LoCO constructs orthogonal transformations through low-rank skew-symmetric matrices and compositional rotation chains. An approximation scheme enables fully parallel computation of these rotations, making the method practical for high-dimensional spaces while preserving orthogonality with controlled approximation error.

What carries the argument

Low-rank skew-symmetric matrices assembled into compositional rotation chains, combined with a parallel approximation for efficient evaluation.

If this is right

The fine-tuned weights remain closer to the original geometry because each update step is an approximate rotation.
Parallel computation keeps training cost comparable to standard low-rank methods despite the chain structure.
The bounded approximation error allows deployment in high-dimensional feature spaces without sacrificing the orthogonality property.
Results on vision, diffusion, and language tasks indicate the method is at least competitive with prior orthogonal and non-orthogonal PEFT approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same low-rank rotation construction could be tested in continual learning settings to check whether orthogonality reduces interference between tasks.
Exact non-approximated versions might be feasible for medium-sized models, allowing direct measurement of how much the parallel approximation affects final performance.
Similar compositional ideas could be applied to other matrix groups beyond rotations, such as scaling or shear transformations in adaptation.

Load-bearing premise

Enforcing orthogonality through this low-rank compositional construction will preserve the geometric structure of pretrained representations enough to produce performance gains over existing PEFT methods.

What would settle it

If a side-by-side experiment on a held-out adaptation task shows LoCO achieving lower accuracy or higher error than a matched low-rank non-orthogonal baseline while using the same number of parameters, the claimed benefit of the orthogonal construction would be contradicted.

Figures

Figures reproduced from arXiv: 2605.15916 by Anh Tong, An Nguyen, Jaesik Choi.

**Figure 2.** Figure 2: Relative error | ∥x∥ − ∥Rx∥ |/ ∥x∥ in first-order approximations in Equation (5). This illustrates that our first-order approximation can preserve vector magnitude under linear transformation R. In this experiment, we vary ∥X∥ = ∥Y∥ across a range of ε ∈ [10−6 , 10−0.5 ]. The shaded region indicates the standard deviation across multiple random initializations of X and Y. chains require multiple composi… view at source ↗

**Figure 3.** Figure 3: Training efficiency comparison on DeBERTA-V3. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of varying the temperature parameter [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Performance comparison on VTAB-1k benchmark across [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: (a) Orthogonality deviation ∥R˜ ⊤R˜ − I∥F extracted from a trained diffusion model. (b) Distribution of perturbation norms ∥Zi∥F or ∥∆i∥F from the same checkpoint. Here, R˜ ∈ R 3072×3072 is a high-dimensional matrix. The deviation norm remains low (mean ≈ 0.1), indicating that the approximation roughly preserves vector norms under transformation. B Experimental Details B.1 Fine-tuning Large language models… view at source ↗

**Figure 8.** Figure 8: Training efficiency comparison on DeBERTA-V3, batch [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Training efficiency comparison on LLaMA2-7B batch size [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: Efficiency comparison during training time on LLaMA2- [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Effect of varying the temperature parameter [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Effect of varying the temperature parameter [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative comparison of Canny edge-to-image generation. Columns show: input canny edges, ground truth, and outputs from [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative comparison of deblurring image generation. Columns show: blurred input, ground truth, and outputs from LoRA, [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative comparison of depth-to-image generation. Columns show: depth map input, ground truth, and outputs from LoRA, [PITH_FULL_IMAGE:figures/full_fig_p019_15.png] view at source ↗

**Figure 16.** Figure 16: Qualitative comparison of inpainting (fill) image generation. Columns show: masked input, ground truth, and outputs from LoRA, [PITH_FULL_IMAGE:figures/full_fig_p020_16.png] view at source ↗

read the original abstract

Parameter-efficient fine-tuning (PEFT) has emerged as an critical technique for adapting large-scale foundation models across natural language processing and computer vision. While existing methods such as low-rank adaptations achieve parameter efficiency via low-rank weight updates, they are limited in their ability to preserve the geometric structure of pretrained representations. We introduce Low-rank Compositional Orthogonal fine-tuning (LoCO), a novel PEFT method that constructs orthogonal transformations through low-rank skew-symmetric matrices and compositional rotation chains. We propose an approximation scheme that enables fully parallel computation of compositional rotations, making the approach practical for high-dimensional feature spaces. Our method maintains low computational complexity while maintaining orthogonality with controlled approximation error. We validate LoCO across diverse domains, including diffusion transformer fine-tuning, vision transformer adaptation, and language model adaptation. Our method demonstrates superior or competitive performance compared to both existing orthogonal and non-orthogonal methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LoCO adds a concrete PEFT construction that builds orthogonal updates from low-rank skew-symmetric matrices in compositional chains plus a parallel approximation, but the lack of a clear error bound on that approximation is the main open question.

read the letter

The main thing to know is that this paper puts forward LoCO as a way to do parameter-efficient fine-tuning with orthogonal transformations built from products of matrix exponentials of low-rank skew-symmetric matrices, plus an approximation that lets the composition run in parallel instead of sequentially. The authors test it on diffusion transformer fine-tuning, vision transformer adaptation, and language model tasks, where it comes out competitive or ahead of both standard low-rank methods and other orthogonal approaches. That spread of domains is useful to see. The specific combination of low-rank skew-symmetric factors with the compositional chain and the parallel scheme is the clearest new element on top of existing orthogonal and low-rank PEFT ideas. It keeps the parameter count low while trying to preserve the geometric properties of the pretrained weights, which is a reasonable motivation if the orthogonality actually holds up. The experiments give some practical evidence that the method works at least as well as the baselines they compare against. The soft spot is the approximation itself. The claim is that it maintains orthogonality with controlled error, yet there is no explicit bound shown on how that error scales with dimension or the number of factors in the chain. In high-dimensional spaces the accumulated deviation from Q^T Q = I could grow without a derivation or at least reported measurements of the actual orthogonality error on the model sizes they use. If that gap is not closed, the geometric invariance argument weakens even if task performance looks fine. This is aimed at people working on efficient adaptation of large models who want to try structure-preserving updates. A reader already following the orthogonal PEFT literature would see the most direct value in the construction details and the multi-domain results. It is worth sending to peer review. The mechanism is specific enough and the experiments broad enough that referees can check the approximation and the numbers directly.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes LoCO, a parameter-efficient fine-tuning method that constructs orthogonal transformations as products of matrix exponentials of low-rank skew-symmetric matrices arranged in compositional rotation chains. It introduces an approximation to enable fully parallel computation of these compositions, claiming this maintains orthogonality with controlled approximation error while remaining computationally efficient for high-dimensional spaces. The approach is evaluated on diffusion transformer fine-tuning, vision transformer adaptation, and language model adaptation, where it reports superior or competitive performance relative to both orthogonal and non-orthogonal PEFT baselines.

Significance. If the parallel approximation indeed preserves the orthogonality guarantee with error that remains small independently of dimension and depth, LoCO would provide a geometrically principled alternative to low-rank methods such as LoRA by better respecting the structure of pretrained representations. The multi-domain empirical validation offers preliminary evidence of practical utility, but the absence of a derived error bound weakens the theoretical foundation relative to the central claim.

major comments (2)

[Method (approximation scheme)] The approximation scheme for parallel computation of compositional rotations (described in the method section following the construction of low-rank skew-symmetric matrices) lacks an explicit error bound. The abstract and method claim that orthogonality is maintained 'with controlled approximation error,' yet no derivation is provided showing that ||Q^T Q - I|| remains below a fixed threshold (e.g., 1e-4) independently of rank r, composition depth k, and dimension d; without such a bound the geometric invariance guarantee is not established.
[Experiments] The experimental results across the three domains do not include quantitative monitoring of the approximation error (e.g., measured ||Q^T Q - I|| values during or after fine-tuning). This omission makes it impossible to verify that the reported performance gains are achieved under the claimed orthogonality control rather than despite uncontrolled drift.

minor comments (2)

[Abstract] Abstract contains the grammatical error 'an critical' which should read 'a critical'.
[Method] Notation for the low-rank skew-symmetric matrices and the composition operator could be introduced more explicitly with a single consistent definition to aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. The comments correctly identify areas where additional theoretical analysis and empirical diagnostics would strengthen the presentation of LoCO. We address each major comment below and commit to the corresponding revisions.

read point-by-point responses

Referee: [Method (approximation scheme)] The approximation scheme for parallel computation of compositional rotations (described in the method section following the construction of low-rank skew-symmetric matrices) lacks an explicit error bound. The abstract and method claim that orthogonality is maintained 'with controlled approximation error,' yet no derivation is provided showing that ||Q^T Q - I|| remains below a fixed threshold (e.g., 1e-4) independently of rank r, composition depth k, and dimension d; without such a bound the geometric invariance guarantee is not established.

Authors: We agree that the manuscript would benefit from an explicit derivation of the approximation error. The current text motivates the parallel scheme through its construction from low-rank skew-symmetric matrices and reports that orthogonality is preserved with controlled error, but does not supply a formal bound on ||Q^T Q - I||. In the revised manuscript we will add a dedicated subsection deriving such a bound, showing that the deviation can be kept below a small constant (independent of d for fixed r and k) under standard assumptions on the step sizes used in the composition. revision: yes
Referee: [Experiments] The experimental results across the three domains do not include quantitative monitoring of the approximation error (e.g., measured ||Q^T Q - I|| values during or after fine-tuning). This omission makes it impossible to verify that the reported performance gains are achieved under the claimed orthogonality control rather than despite uncontrolled drift.

Authors: We concur that reporting the realized orthogonality error is necessary to substantiate the practical control of the approximation. The original experiments emphasized downstream task metrics across diffusion transformers, vision transformers, and language models. In the revision we will include additional tables and/or figures that report the measured ||Q^T Q - I|| values both at initialization and after fine-tuning for each domain, thereby confirming that the observed performance occurs under the claimed level of orthogonality preservation. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit new construction for orthogonal PEFT

full rationale

The paper presents LoCO as a novel method that constructs orthogonal maps from low-rank skew-symmetric matrices and introduces a parallel approximation for compositional chains. This is an independent proposal with stated approximation error control, not a redefinition or fit of prior quantities. No self-definitional equations, fitted-input predictions, or load-bearing self-citations appear in the provided abstract or description. The derivation chain remains self-contained against external benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are extractable from the provided text.

pith-pipeline@v0.9.0 · 5678 in / 1142 out tokens · 37250 ms · 2026-05-20T19:48:18.060295+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We parameterize skew-symmetric matrices through a low-rank outer product form... R = (I−A)⁻¹(I+A) ... first-order approximation ... R ≈ I + 2 Σ Xi(I−Yᵀi Xi)⁻¹Yᵀi
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The composition of orthogonal matrices preserves orthogonality... approximation error is bounded by Theorem 1

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 6 internal anchors

[1]

Intrinsic dimensionality explains the effectiveness of language model fine-tuning.arXiv preprint arXiv:2012.13255, 2020

[Aghajanyanet al., 2020 ] Armen Aghajanyan, Luke Zettle- moyer, and Sonal Gupta. Intrinsic dimensionality explains the effectiveness of language model fine-tuning.arXiv preprint arXiv:2012.13255,

work page arXiv 2020
[2]

Hoft: Householder orthogonal fine-tuning.arXiv preprint arXiv:2505.16531,

[Arcaset al., 2025 ] Alejandro Moreno Arcas, Albert San- chis, Jorge Civera, and Alfons Juan. Hoft: Householder orthogonal fine-tuning.arXiv preprint arXiv:2505.16531,

work page arXiv 2025
[3]

Lora-xs: Low-rank adaptation with extremely small number of parameters

[Bałazyet al., 2024] Klaudia Bałazy, Mohammadreza Ba- naei, Karl Aberer, and Jacek Tabor. Lora-xs: Low-rank adaptation with extremely small number of parameters. arXiv preprint arXiv:2405.17604,

work page arXiv 2024
[4]

Language models are few-shot learners.NeurIPS,

[Brownet al., 2020 ] Tom Brown, Benjamin Mann, Nick Ry- der, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.NeurIPS,

work page 2020
[5]

Angular visual hardness

[Chenet al., 2020 ] Beidi Chen, Weiyang Liu, Zhiding Yu, Jan Kautz, Anshumali Shrivastava, Animesh Garg, and Animashree Anandkumar. Angular visual hardness. In ICML,

work page 2020
[6]

Fully hyperbolic neural networks

[Chenet al., 2022 ] Weize Chen, Xu Han, Yankai Lin, Hexu Zhao, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. Fully hyperbolic neural networks. InACL,

work page 2022
[7]

Training Verifiers to Solve Math Word Problems

[Cobbeet al., 2021 ] Karl Cobbe, Vineet Kosaraju, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Peak signal-to-noise ratio

[contributors, 2025] Wikipedia contributors. Peak signal-to-noise ratio. https://en.wikipedia.org/wiki/ Peak signal-to-noise ratio,

work page 2025
[9]

Monarch: Expressive structured matrices for efficient and accurate training

[Daoet al., 2022 ] Tri Dao, Beidi Chen, et al. Monarch: Expressive structured matrices for efficient and accurate training. InInternational Conference on Machine Learn- ing,

work page 2022
[10]

Efficient adaptation of large vision trans- former via adapter re-composing

[Donget al., 2023 ] Wei Dong, Dawei Yan, Zhijun Lin, and Peng Wang. Efficient adaptation of large vision trans- former via adapter re-composing. InNeurIPS,

work page 2023
[11]

Efficient adaptation of pre-trained vision transformer via householder transfor- mation.NeurIPS,

[Donget al., 2024 ] Wei Dong, Yuan Sun, Yiting Yang, Xing Zhang, Zhijun Lin, Qingsen Yan, Haokui Zhang, Peng Wang, Yang Yang, and Hengtao Shen. Efficient adaptation of pre-trained vision transformer via householder transfor- mation.NeurIPS,

work page 2024
[12]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

[Dosovitskiy, 2020] Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2020
[13]

Generalizing con- volutional neural networks for equivariance to lie groups on arbitrary continuous data

[Finziet al., 2020 ] Marc Finzi, Samuel Stanton, Pavel Iz- mailov, and Andrew Gordon Wilson. Generalizing con- volutional neural networks for equivariance to lie groups on arbitrary continuous data. InICML,

work page 2020
[14]

Hyperbolic entailment cones for learn- ing hierarchical embeddings

[Ganeaet al., 2018 ] Octavian Ganea, Gary B ´ecigneul, and Thomas Hofmann. Hyperbolic entailment cones for learn- ing hierarchical embeddings. InICML,

work page 2018
[15]

Group and Shuffle: Efficient Struc- tured Orthogonal Parametrization

[Gorbunovet al., 2024 ] Mikhail Gorbunov, Kolya Yudin, Maxim Rakhuba, et al. Group and Shuffle: Efficient Struc- tured Orthogonal Parametrization. InThe Thirty-eighth Annual Conference on Neural Information Processing Sys- tems,

work page 2024
[16]

Deberta: Decoding-enhanced bert with disentangled attention

[Heet al., 2021 ] Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. InICLR,

work page 2021
[17]

Measuring massive multitask lan- guage understanding.ICLR,

[Hendryckset al., 2021 ] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask lan- guage understanding.ICLR,

work page 2021
[18]

GANs trained by a two time-scale update rule converge to a local nash equilibrium.NeurIPS,

[Heuselet al., 2017 ] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochre- iter. GANs trained by a two time-scale update rule converge to a local nash equilibrium.NeurIPS,

work page 2017
[19]

Parameter-efficient transfer learning for nlp

[Houlsbyet al., 2019 ] Neil Houlsby, Andrei Giurgiu, Stanis- law Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InICML,

work page 2019
[20]

Householder

[Householder, 1958] Alston S. Householder. Unitary trian- gularization of a nonsymmetric matrix.J. ACM,

work page 1958
[21]

Lora: Low-rank adaptation of large language models.ICLR,

[Huet al., 2022 ] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR,

work page 2022
[22]

Visual prompt tuning

[Jiaet al., 2022 ] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. InECCV,

work page 2022
[23]

Mora: High-rank updating for parameter- efficient fine-tuning.arXiv preprint arXiv:2405.12130, 2024

[Jianget al., 2024 ] Ting Jiang, Shaohan Huang, Shengyue Luo, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. Mora: High-rank updating for parameter-efficient fine-tuning. arXiv preprint arXiv:2405.12130,

work page arXiv 2024
[24]

A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA

[Kalajdzievski, 2023] Damjan Kalajdzievski. A rank stabi- lization scaling factor for fine-tuning with lora. arXiv preprint arXiv:2312.03732,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Musiq: Multi-scale image quality transformer

[Keet al., 2021 ] Junjie Ke, Qifei Wang, Yilin Wang, Pey- man Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InICCV,

work page 2021
[26]

Vera: Vector-based random matrix adaptation

[Kopiczkoet al., 2024 ] Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki M Asano. Vera: Vector-based random matrix adaptation. InICLR,

work page 2024
[27]

[Labs, 2024] Black Forest Labs. Flux. https://github.com/ black-forest-labs/flux,

work page 2024
[28]

The power of scale for parameter-efficient prompt tuning

[Lesteret al., 2021 ] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InEMNLP,

work page 2021
[29]

Prefix- tuning: Optimizing continuous prompts for generation

[Li and Liang, 2021] Xiang Lisa Li and Percy Liang. Prefix- tuning: Optimizing continuous prompts for generation. In ACL,

work page 2021
[30]

Scaling & shifting your features: A new baseline for efficient model tuning.NeurIPS,

[Lianet al., 2022 ] Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: A new baseline for efficient model tuning.NeurIPS,

work page 2022
[31]

3- in-1: 2d rotary adaptation for efficient finetuning, efficient batching and composability.NeurIPS,

[Liao and Monz, 2024] Baohao Liao and Christof Monz. 3- in-1: 2d rotary adaptation for efficient finetuning, efficient batching and composability.NeurIPS,

work page 2024
[32]

Deep hyperspherical learning.NeurIPS,

[Liuet al., 2017 ] Weiyang Liu, Yan-Ming Zhang, Xingguo Li, Zhiding Yu, Bo Dai, Tuo Zhao, and Le Song. Deep hyperspherical learning.NeurIPS,

work page 2017
[33]

Decoupled networks

[Liuet al., 2018 ] Weiyang Liu, Zhen Liu, Zhiding Yu, Bo Dai, Rongmei Lin, Yisen Wang, James M Rehg, and Le Song. Decoupled networks. InCVPR,

work page 2018
[34]

Parameter efficient quasi-orthogonal fine-tuning via givens rotation.arXiv preprint arXiv:2404.04316,

[Maet al., 2024 ] Xinyu Ma, Xu Chu, Zhibang Yang, Yang Lin, Xin Gao, and Junfeng Zhao. Parameter efficient quasi-orthogonal fine-tuning via givens rotation.arXiv preprint arXiv:2404.04316,

work page arXiv 2024
[35]

Inverting modified matri- ces

[Max, 1950] A Woodbury Max. Inverting modified matri- ces. InMemorandum Rept. 42, Statistical Research Group. Princeton Univ.,

work page 1950
[36]

Pissa: Principal singular values and singu- lar vectors adaptation of large language models.NeurIPS,

[Menget al., 2024 ] Fanxu Meng, Zhaohui Wang, and Muhan Zhang. Pissa: Principal singular values and singu- lar vectors adaptation of large language models.NeurIPS,

work page 2024
[37]

Lie group decompositions for equivariant neural networks

[Mironenco and Forr´e, 2024] Mircea Mironenco and Patrick Forr´e. Lie group decompositions for equivariant neural networks. InICLR,

work page 2024
[38]

Scalable diffusion models with transformers

[Peebles and Xie, 2023] William Peebles and Saining Xie. Scalable diffusion models with transformers. InCVPR,

work page 2023
[39]

Cambridge univer- sity press,

[Press, 2007] William H Press.Numerical recipes 3rd edi- tion: The art of scientific computing. Cambridge univer- sity press,

work page 2007
[40]

Controlling text-to- image diffusion by orthogonal finetuning.NeurIPS,

[Qiuet al., 2023 ] Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard Sch ¨olkopf. Controlling text-to- image diffusion by orthogonal finetuning.NeurIPS,

work page 2023
[41]

Reparameterized llm training via orthogonal equivalence transformation.arXiv preprint arXiv:2506.08001,

[Qiuet al., 2025 ] Zeju Qiu, Simon Buchholz, Tim Z Xiao, Maximilian Dax, Bernhard Sch ¨olkopf, and Weiyang Liu. Reparameterized llm training via orthogonal equivalence transformation.arXiv preprint arXiv:2506.08001,

work page arXiv 2025
[42]

SQuAD: 100,000+ questions for machine comprehension of text

[Rajpurkaret al., 2016 ] Pranav Rajpurkar, Jian Zhang, Kon- stantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. InEMNLP,

work page 2016
[43]

High-resolution image synthesis with latent diffusion models

[Rombachet al., 2022 ] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR,

work page 2022
[44]

Skew orthogonal convolutions

[Singla and Feizi, 2021] Sahil Singla and Soheil Feizi. Skew orthogonal convolutions. InICML,

work page 2021
[45]

Ominicontrol: Minimal and universal control for diffusion transformer

[Tanet al., 2025 ] Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer. InICCV,

work page 2025
[46]

Llama 2: Open Foundation and Fine-Tuned Chat Models

[Touvronet al., 2023 ] Hugo Touvron, Louis Martin, Kevin Stone, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Attention is all you need.NeurIPS,

[Vaswaniet al., 2017 ] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS,

work page 2017
[48]

Exploring clip for assessing the look and feel of images

[Wanget al., 2023 ] Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. InAAAI,

work page 2023
[49]

Milora: Harnessing mi- nor singular components for parameter-efficient llm fine- tuning.arXiv preprint arXiv:2406.09044,

[Wanget al., 2024 ] Hanqing Wang, Yixia Li, Shuo Wang, Guanhua Chen, and Yun Chen. Milora: Harnessing mi- nor singular components for parameter-efficient llm fine- tuning.arXiv preprint arXiv:2406.09044,

work page arXiv 2024
[50]

Maniqa: Multi-dimension attention network for no-reference image quality assessment

[Yanget al., 2022 ] Sidi Yang, Tianhe Wu, Shuwei Shi, Shan- shan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. InCVPR,

work page 2022
[51]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

[Yuet al., 2023 ] Longhui Yu, Weisen Jiang, et al. Metamath: Bootstrap your own mathematical questions for large lan- guage models.arXiv preprint arXiv:2309.12284,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Bridging the gap between low-rank and orthogonal adaptation via householder reflection adaptation.NeurIPS,

[Yuanet al., 2024 ] Shen Yuan, Haotian Liu, and Hongteng Xu. Bridging the gap between low-rank and orthogonal adaptation via householder reflection adaptation.NeurIPS,

work page 2024
[53]

A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark

[Zhaiet al., 2019 ] Xiaohua Zhai, Joan Puigcerver, Alexan- der Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neu- mann, Alexey Dosovitskiy, et al. A large-scale study of representation learning with the visual task adaptation benchmark.arXiv preprint arXiv:1910.04867,

work page internal anchor Pith review Pith/arXiv arXiv 2019
[54]

Adaptive budget allocation for parameter- efficient fine-tuning

[Zhanget al., 2023 ] Qingru Zhang, Minshuo Chen, Alexan- der Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter- efficient fine-tuning. InICLR,

work page 2023
[55]

A Theoretical Analysis of LoCO A.1 Time complexity among orthogonal approaches This section aims to provide a comprehensive comparison of orthogonal fine-tuning methods based on their structural properties, time and space complexity. In particular, we compare LoCO with several contemporary methods, namely: OFT [Qiuet al., 2023 ], HRA [Yuanet al., 2024 ], ...

work page 2023
[56]

(a) (b) Figure 7: (a) Orthogonality deviation∥ ˜R⊤ ˜R−I∥ F extracted from a trained diffusion model

The orthogonality deviation ( Figure 7a) remains relatively small across all layers, and the perturbation norm∥∆ i∥F ( Figure 7b) confirms that learned parameters stay within the regime where the approximation is valid. (a) (b) Figure 7: (a) Orthogonality deviation∥ ˜R⊤ ˜R−I∥ F extracted from a trained diffusion model. (b) Distribution of perturbation nor...

work page 2021
[57]

Following BOFT [Liuet al., 2024b ], we adapt the LLaMA2-7B model [Touvronet al., 2023 ] on the first 512 tokens of MetaMath-40K [Yuet al., 2023 ]

Evaluation on mathematical reasoning tasks In the fine-tuning experiments on the LLaMA2 model, we fix the max sequence length as 512, which is sufficient for these tasks. Following BOFT [Liuet al., 2024b ], we adapt the LLaMA2-7B model [Touvronet al., 2023 ] on the first 512 tokens of MetaMath-40K [Yuet al., 2023 ]. The details are given in Table 7 For th...

work page 2023
[58]

To compare computational efficiency, we align the con- figurations of all methods to match the number of trainable parameters

HRA suffers from Out of memory (OOM) issue. To compare computational efficiency, we align the con- figurations of all methods to match the number of trainable parameters. We define two settings:light modeandheavy mode, as computational costs vary significantly depending on specific hyperparameter sets. For instance, the rankrimpacts the matrix inversion c...

work page 2016

[1] [1]

Intrinsic dimensionality explains the effectiveness of language model fine-tuning.arXiv preprint arXiv:2012.13255, 2020

[Aghajanyanet al., 2020 ] Armen Aghajanyan, Luke Zettle- moyer, and Sonal Gupta. Intrinsic dimensionality explains the effectiveness of language model fine-tuning.arXiv preprint arXiv:2012.13255,

work page arXiv 2020

[2] [2]

Hoft: Householder orthogonal fine-tuning.arXiv preprint arXiv:2505.16531,

[Arcaset al., 2025 ] Alejandro Moreno Arcas, Albert San- chis, Jorge Civera, and Alfons Juan. Hoft: Householder orthogonal fine-tuning.arXiv preprint arXiv:2505.16531,

work page arXiv 2025

[3] [3]

Lora-xs: Low-rank adaptation with extremely small number of parameters

[Bałazyet al., 2024] Klaudia Bałazy, Mohammadreza Ba- naei, Karl Aberer, and Jacek Tabor. Lora-xs: Low-rank adaptation with extremely small number of parameters. arXiv preprint arXiv:2405.17604,

work page arXiv 2024

[4] [4]

Language models are few-shot learners.NeurIPS,

[Brownet al., 2020 ] Tom Brown, Benjamin Mann, Nick Ry- der, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.NeurIPS,

work page 2020

[5] [5]

Angular visual hardness

[Chenet al., 2020 ] Beidi Chen, Weiyang Liu, Zhiding Yu, Jan Kautz, Anshumali Shrivastava, Animesh Garg, and Animashree Anandkumar. Angular visual hardness. In ICML,

work page 2020

[6] [6]

Fully hyperbolic neural networks

[Chenet al., 2022 ] Weize Chen, Xu Han, Yankai Lin, Hexu Zhao, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. Fully hyperbolic neural networks. InACL,

work page 2022

[7] [7]

Training Verifiers to Solve Math Word Problems

[Cobbeet al., 2021 ] Karl Cobbe, Vineet Kosaraju, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

Peak signal-to-noise ratio

[contributors, 2025] Wikipedia contributors. Peak signal-to-noise ratio. https://en.wikipedia.org/wiki/ Peak signal-to-noise ratio,

work page 2025

[9] [9]

Monarch: Expressive structured matrices for efficient and accurate training

[Daoet al., 2022 ] Tri Dao, Beidi Chen, et al. Monarch: Expressive structured matrices for efficient and accurate training. InInternational Conference on Machine Learn- ing,

work page 2022

[10] [10]

Efficient adaptation of large vision trans- former via adapter re-composing

[Donget al., 2023 ] Wei Dong, Dawei Yan, Zhijun Lin, and Peng Wang. Efficient adaptation of large vision trans- former via adapter re-composing. InNeurIPS,

work page 2023

[11] [11]

Efficient adaptation of pre-trained vision transformer via householder transfor- mation.NeurIPS,

[Donget al., 2024 ] Wei Dong, Yuan Sun, Yiting Yang, Xing Zhang, Zhijun Lin, Qingsen Yan, Haokui Zhang, Peng Wang, Yang Yang, and Hengtao Shen. Efficient adaptation of pre-trained vision transformer via householder transfor- mation.NeurIPS,

work page 2024

[12] [12]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

[Dosovitskiy, 2020] Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2020

[13] [13]

Generalizing con- volutional neural networks for equivariance to lie groups on arbitrary continuous data

[Finziet al., 2020 ] Marc Finzi, Samuel Stanton, Pavel Iz- mailov, and Andrew Gordon Wilson. Generalizing con- volutional neural networks for equivariance to lie groups on arbitrary continuous data. InICML,

work page 2020

[14] [14]

Hyperbolic entailment cones for learn- ing hierarchical embeddings

[Ganeaet al., 2018 ] Octavian Ganea, Gary B ´ecigneul, and Thomas Hofmann. Hyperbolic entailment cones for learn- ing hierarchical embeddings. InICML,

work page 2018

[15] [15]

Group and Shuffle: Efficient Struc- tured Orthogonal Parametrization

[Gorbunovet al., 2024 ] Mikhail Gorbunov, Kolya Yudin, Maxim Rakhuba, et al. Group and Shuffle: Efficient Struc- tured Orthogonal Parametrization. InThe Thirty-eighth Annual Conference on Neural Information Processing Sys- tems,

work page 2024

[16] [16]

Deberta: Decoding-enhanced bert with disentangled attention

[Heet al., 2021 ] Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. InICLR,

work page 2021

[17] [17]

Measuring massive multitask lan- guage understanding.ICLR,

[Hendryckset al., 2021 ] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask lan- guage understanding.ICLR,

work page 2021

[18] [18]

GANs trained by a two time-scale update rule converge to a local nash equilibrium.NeurIPS,

[Heuselet al., 2017 ] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochre- iter. GANs trained by a two time-scale update rule converge to a local nash equilibrium.NeurIPS,

work page 2017

[19] [19]

Parameter-efficient transfer learning for nlp

[Houlsbyet al., 2019 ] Neil Houlsby, Andrei Giurgiu, Stanis- law Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InICML,

work page 2019

[20] [20]

Householder

[Householder, 1958] Alston S. Householder. Unitary trian- gularization of a nonsymmetric matrix.J. ACM,

work page 1958

[21] [21]

Lora: Low-rank adaptation of large language models.ICLR,

[Huet al., 2022 ] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR,

work page 2022

[22] [22]

Visual prompt tuning

[Jiaet al., 2022 ] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. InECCV,

work page 2022

[23] [23]

Mora: High-rank updating for parameter- efficient fine-tuning.arXiv preprint arXiv:2405.12130, 2024

[Jianget al., 2024 ] Ting Jiang, Shaohan Huang, Shengyue Luo, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. Mora: High-rank updating for parameter-efficient fine-tuning. arXiv preprint arXiv:2405.12130,

work page arXiv 2024

[24] [24]

A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA

[Kalajdzievski, 2023] Damjan Kalajdzievski. A rank stabi- lization scaling factor for fine-tuning with lora. arXiv preprint arXiv:2312.03732,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Musiq: Multi-scale image quality transformer

[Keet al., 2021 ] Junjie Ke, Qifei Wang, Yilin Wang, Pey- man Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InICCV,

work page 2021

[26] [26]

Vera: Vector-based random matrix adaptation

[Kopiczkoet al., 2024 ] Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki M Asano. Vera: Vector-based random matrix adaptation. InICLR,

work page 2024

[27] [27]

[Labs, 2024] Black Forest Labs. Flux. https://github.com/ black-forest-labs/flux,

work page 2024

[28] [28]

The power of scale for parameter-efficient prompt tuning

[Lesteret al., 2021 ] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InEMNLP,

work page 2021

[29] [29]

Prefix- tuning: Optimizing continuous prompts for generation

[Li and Liang, 2021] Xiang Lisa Li and Percy Liang. Prefix- tuning: Optimizing continuous prompts for generation. In ACL,

work page 2021

[30] [30]

Scaling & shifting your features: A new baseline for efficient model tuning.NeurIPS,

[Lianet al., 2022 ] Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: A new baseline for efficient model tuning.NeurIPS,

work page 2022

[31] [31]

3- in-1: 2d rotary adaptation for efficient finetuning, efficient batching and composability.NeurIPS,

[Liao and Monz, 2024] Baohao Liao and Christof Monz. 3- in-1: 2d rotary adaptation for efficient finetuning, efficient batching and composability.NeurIPS,

work page 2024

[32] [32]

Deep hyperspherical learning.NeurIPS,

[Liuet al., 2017 ] Weiyang Liu, Yan-Ming Zhang, Xingguo Li, Zhiding Yu, Bo Dai, Tuo Zhao, and Le Song. Deep hyperspherical learning.NeurIPS,

work page 2017

[33] [33]

Decoupled networks

[Liuet al., 2018 ] Weiyang Liu, Zhen Liu, Zhiding Yu, Bo Dai, Rongmei Lin, Yisen Wang, James M Rehg, and Le Song. Decoupled networks. InCVPR,

work page 2018

[34] [34]

Parameter efficient quasi-orthogonal fine-tuning via givens rotation.arXiv preprint arXiv:2404.04316,

[Maet al., 2024 ] Xinyu Ma, Xu Chu, Zhibang Yang, Yang Lin, Xin Gao, and Junfeng Zhao. Parameter efficient quasi-orthogonal fine-tuning via givens rotation.arXiv preprint arXiv:2404.04316,

work page arXiv 2024

[35] [35]

Inverting modified matri- ces

[Max, 1950] A Woodbury Max. Inverting modified matri- ces. InMemorandum Rept. 42, Statistical Research Group. Princeton Univ.,

work page 1950

[36] [36]

Pissa: Principal singular values and singu- lar vectors adaptation of large language models.NeurIPS,

[Menget al., 2024 ] Fanxu Meng, Zhaohui Wang, and Muhan Zhang. Pissa: Principal singular values and singu- lar vectors adaptation of large language models.NeurIPS,

work page 2024

[37] [37]

Lie group decompositions for equivariant neural networks

[Mironenco and Forr´e, 2024] Mircea Mironenco and Patrick Forr´e. Lie group decompositions for equivariant neural networks. InICLR,

work page 2024

[38] [38]

Scalable diffusion models with transformers

[Peebles and Xie, 2023] William Peebles and Saining Xie. Scalable diffusion models with transformers. InCVPR,

work page 2023

[39] [39]

Cambridge univer- sity press,

[Press, 2007] William H Press.Numerical recipes 3rd edi- tion: The art of scientific computing. Cambridge univer- sity press,

work page 2007

[40] [40]

Controlling text-to- image diffusion by orthogonal finetuning.NeurIPS,

[Qiuet al., 2023 ] Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard Sch ¨olkopf. Controlling text-to- image diffusion by orthogonal finetuning.NeurIPS,

work page 2023

[41] [41]

Reparameterized llm training via orthogonal equivalence transformation.arXiv preprint arXiv:2506.08001,

[Qiuet al., 2025 ] Zeju Qiu, Simon Buchholz, Tim Z Xiao, Maximilian Dax, Bernhard Sch ¨olkopf, and Weiyang Liu. Reparameterized llm training via orthogonal equivalence transformation.arXiv preprint arXiv:2506.08001,

work page arXiv 2025

[42] [42]

SQuAD: 100,000+ questions for machine comprehension of text

[Rajpurkaret al., 2016 ] Pranav Rajpurkar, Jian Zhang, Kon- stantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. InEMNLP,

work page 2016

[43] [43]

High-resolution image synthesis with latent diffusion models

[Rombachet al., 2022 ] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR,

work page 2022

[44] [44]

Skew orthogonal convolutions

[Singla and Feizi, 2021] Sahil Singla and Soheil Feizi. Skew orthogonal convolutions. InICML,

work page 2021

[45] [45]

Ominicontrol: Minimal and universal control for diffusion transformer

[Tanet al., 2025 ] Zhenxiong Tan, Songhua Liu, Xingyi Yang, Qiaochu Xue, and Xinchao Wang. Ominicontrol: Minimal and universal control for diffusion transformer. InICCV,

work page 2025

[46] [46]

Llama 2: Open Foundation and Fine-Tuned Chat Models

[Touvronet al., 2023 ] Hugo Touvron, Louis Martin, Kevin Stone, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

Attention is all you need.NeurIPS,

[Vaswaniet al., 2017 ] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS,

work page 2017

[48] [48]

Exploring clip for assessing the look and feel of images

[Wanget al., 2023 ] Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. InAAAI,

work page 2023

[49] [49]

Milora: Harnessing mi- nor singular components for parameter-efficient llm fine- tuning.arXiv preprint arXiv:2406.09044,

[Wanget al., 2024 ] Hanqing Wang, Yixia Li, Shuo Wang, Guanhua Chen, and Yun Chen. Milora: Harnessing mi- nor singular components for parameter-efficient llm fine- tuning.arXiv preprint arXiv:2406.09044,

work page arXiv 2024

[50] [50]

Maniqa: Multi-dimension attention network for no-reference image quality assessment

[Yanget al., 2022 ] Sidi Yang, Tianhe Wu, Shuwei Shi, Shan- shan Lao, Yuan Gong, Mingdeng Cao, Jiahao Wang, and Yujiu Yang. Maniqa: Multi-dimension attention network for no-reference image quality assessment. InCVPR,

work page 2022

[51] [51]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

[Yuet al., 2023 ] Longhui Yu, Weisen Jiang, et al. Metamath: Bootstrap your own mathematical questions for large lan- guage models.arXiv preprint arXiv:2309.12284,

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

Bridging the gap between low-rank and orthogonal adaptation via householder reflection adaptation.NeurIPS,

[Yuanet al., 2024 ] Shen Yuan, Haotian Liu, and Hongteng Xu. Bridging the gap between low-rank and orthogonal adaptation via householder reflection adaptation.NeurIPS,

work page 2024

[53] [53]

A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark

[Zhaiet al., 2019 ] Xiaohua Zhai, Joan Puigcerver, Alexan- der Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neu- mann, Alexey Dosovitskiy, et al. A large-scale study of representation learning with the visual task adaptation benchmark.arXiv preprint arXiv:1910.04867,

work page internal anchor Pith review Pith/arXiv arXiv 2019

[54] [54]

Adaptive budget allocation for parameter- efficient fine-tuning

[Zhanget al., 2023 ] Qingru Zhang, Minshuo Chen, Alexan- der Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter- efficient fine-tuning. InICLR,

work page 2023

[55] [55]

A Theoretical Analysis of LoCO A.1 Time complexity among orthogonal approaches This section aims to provide a comprehensive comparison of orthogonal fine-tuning methods based on their structural properties, time and space complexity. In particular, we compare LoCO with several contemporary methods, namely: OFT [Qiuet al., 2023 ], HRA [Yuanet al., 2024 ], ...

work page 2023

[56] [56]

(a) (b) Figure 7: (a) Orthogonality deviation∥ ˜R⊤ ˜R−I∥ F extracted from a trained diffusion model

The orthogonality deviation ( Figure 7a) remains relatively small across all layers, and the perturbation norm∥∆ i∥F ( Figure 7b) confirms that learned parameters stay within the regime where the approximation is valid. (a) (b) Figure 7: (a) Orthogonality deviation∥ ˜R⊤ ˜R−I∥ F extracted from a trained diffusion model. (b) Distribution of perturbation nor...

work page 2021

[57] [57]

Following BOFT [Liuet al., 2024b ], we adapt the LLaMA2-7B model [Touvronet al., 2023 ] on the first 512 tokens of MetaMath-40K [Yuet al., 2023 ]

Evaluation on mathematical reasoning tasks In the fine-tuning experiments on the LLaMA2 model, we fix the max sequence length as 512, which is sufficient for these tasks. Following BOFT [Liuet al., 2024b ], we adapt the LLaMA2-7B model [Touvronet al., 2023 ] on the first 512 tokens of MetaMath-40K [Yuet al., 2023 ]. The details are given in Table 7 For th...

work page 2023

[58] [58]

To compare computational efficiency, we align the con- figurations of all methods to match the number of trainable parameters

HRA suffers from Out of memory (OOM) issue. To compare computational efficiency, we align the con- figurations of all methods to match the number of trainable parameters. We define two settings:light modeandheavy mode, as computational costs vary significantly depending on specific hyperparameter sets. For instance, the rankrimpacts the matrix inversion c...

work page 2016