FuRA: Full-Rank Parameter-Efficient Fine-Tuning with Spectral Preconditioning

Liyan Tan; Niall Moran; Ruijie Zhang; Tong Qin; Yequan Zhao; Zheng Zhang

arxiv: 2605.22869 · v1 · pith:RHXTDYLMnew · submitted 2026-05-19 · 💻 cs.LG

FuRA: Full-Rank Parameter-Efficient Fine-Tuning with Spectral Preconditioning

Yequan Zhao , Ruijie Zhang , Liyan Tan , Niall Moran , Tong Qin , Zheng Zhang This is my paper

Pith reviewed 2026-05-25 05:37 UTC · model grok-4.3

classification 💻 cs.LG

keywords full-rank adaptationparameter-efficient fine-tuningspectral preconditioningsingular value decompositionLoRALLM fine-tuningquantized fine-tuning

0 comments

The pith

Reparameterizing weight updates through pretrained singular vectors lets full-rank adaptation beat unconstrained full fine-tuning at the same parameter budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that both full fine-tuning and methods like LoRA overlook the spectral structure set during pretraining, so noisy gradients from small fine-tuning datasets can disturb already robust features. By reparameterizing each weight matrix with its full-rank SVD and freezing one singular basis, updates stay inside the pretrained column space and the optimization becomes preconditioned. FuRA realizes this idea through a block tensor-train factorization W = LSR in which the large core L is locked to the pretrained block-wise SVD basis while only the compact core R and the singular values S are trained. The result is full-rank expressivity, spectral preconditioning, and memory/step-time cost comparable to LoRA, with measured gains such as +1.37 on LLaMA-3-8B commonsense reasoning and better performance than QLoRA for the 4-bit quantized variant.

Core claim

FuRA reparameterizes each weight matrix through its full-rank SVD and freezes one singular basis so that updates are constrained to the pretrained column space; this yields a preconditioned optimization scheme that outperforms unconstrained full fine-tuning at identical trainable parameter count. The concrete implementation uses the block tensor-train factorization W = LSR where the large core L is fixed to the pretrained block-wise SVD basis and only the compact core R together with the block-wise singular values S are optimized, simultaneously delivering full-rank spectral preconditioning, full-rank update expressivity, and LoRA-level efficiency.

What carries the argument

The block tensor-train factorization W = LSR with the large core L fixed to the pretrained block-wise SVD basis, which constrains all updates to the pretrained column space.

If this is right

FuRA outperforms full fine-tuning by 1.37 points on LLaMA-3-8B commonsense reasoning.
The 4-bit quantized variant QFuRA surpasses QLoRA.
The same gains appear in LLM reinforcement learning for mathematical reasoning and in visual instruction tuning for VLMs.
Full-rank updates can be made parameter-efficient and spectrally preconditioned without sacrificing expressivity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the spectral constraint proves general, similar preconditioning could be applied to other gradient-based adaptation settings beyond the tasks tested.
The approach invites direct comparison of update directions in the column space versus the full ambient space on the same downstream objective.
Layer-wise variation in how strictly the SVD basis is frozen could be tested to see whether some layers benefit from more flexibility.

Load-bearing premise

Freezing the pretrained singular vector bases will not reduce the expressivity of the updates enough to hurt performance on downstream tasks.

What would settle it

An experiment in which FuRA underperforms full fine-tuning on a standard downstream task at matched trainable parameter count would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.22869 by Liyan Tan, Niall Moran, Ruijie Zhang, Tong Qin, Yequan Zhao, Zheng Zhang.

**Figure 1.** Figure 1: FURA delivers higher accuracy than Full FT with LoRA-level runtime and the lowest GPU memory. Large language models [10, 36] (LLM) acquire rich, transferable representations during pretraining on web-scale corpora. Fine-tuning these models on a small, task-specific dataset, whether through supervised finetuning (SFT) or reinforcement learning with verifiable rewards (RLVR) [40, 52], can unlock strong do… view at source ↗

**Figure 2.** Figure 2: FURA replaces each pretrained linear layer W with a lossless block tensor-train factorization W = L S R. Tensors are flattened to matrix for better demonstration. Each slice is initialized by the full-rank SVD of weight block Wk, and performs spectral preconditioned update. ∆W, whether computed directly from gradients (Full FT) or accumulated in a low-rank adapter BA (LoRA), is added to W without regard t… view at source ↗

**Figure 3.** Figure 3: (a) Gradient lives outside singular basis of pretrained weight, but weight keep stable. (b) Singular values shift selectively, while singular vectors rotate broadly. (a) (b) demonstrates layer 15 q_proj result, the full sweeps are in Figures 6 and 7 (Appendix). (c) SVD FT outperforms Full FT on both target domain and source domain. 2 Background and Related Work PEFT methods for LLMs. LoRA [14] constrains t… view at source ↗

**Figure 4.** Figure 4: (a) Effective rank of ∆W; full sweep is in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: LLaMA-3-8B Math-10K SFT, GSM8K accuracy. [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Full per-layer / per-module sweep for Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗

**Figure 7.** Figure 7: Full per-layer / per-module sweep for Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗

**Figure 8.** Figure 8: Full per-layer / per-module sweep for Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗

read the original abstract

Both full fine-tuning (Full FT) and parameter-efficient fine-tuning methods such as LoRA introduce weight updates without accounting for the spectral structure established during pretraining. As a result, noisy gradients from limited fine-tuning data can perturb robust pretrained features. We identify spectral preconditioning as the missing ingredient: reparameterizing each weight matrix through its full-rank singular value decomposition (SVD) and freezing one singular basis constrains updates to the pretrained column space, yielding a preconditioned optimization scheme that outperforms unconstrained Full FT at the same trainable parameter count. Building on this insight, we propose FuRA (Full-Rank Adaptation), an efficient full-rank adaptation framework based on a block tensor-train factorization W = LSR, where the large core L is fixed to the pretrained block-wise SVD basis, while only the compact core R and the block-wise singular values S are optimized. This design simultaneously provides full-rank spectral preconditioning, preserves full-rank update expressivity, and achieves parameter, memory, and step-time efficiency comparable to LoRA. FuRA consistently outperforms Full FT across multiple settings, including LLM fine-tuning (+1.37 on LLaMA-3-8B commonsense reasoning), LLM reinforcement learning for mathematical reasoning, and visual instruction tuning for VLMs. Furthermore, the 4-bit quantized variant, QFuRA, also surpasses QLoRA. Code is available at https://github.com/olokevin/FuRA-NIPS

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FuRA freezes pretrained SVD bases inside a full-rank tensor-train factorization to precondition updates, claiming better results than full fine-tuning at similar parameter counts, but the column-space constraint is the part that needs checking.

read the letter

The core idea is to take each weight matrix, compute its block-wise SVD from pretraining, fix the left singular vectors L, and then optimize only the singular values S plus a compact right factor R so that the update stays full-rank but cheap. This is packaged as FuRA and extended to a 4-bit version QFuRA. The abstract reports a +1.37 point lift on LLaMA-3-8B commonsense tasks, plus gains in math RL and visual instruction tuning, all while matching LoRA-level memory and speed. Code is released, which helps reproducibility claims. That combination of full-rank expressivity with spectral preconditioning is not in the standard LoRA or DoRA literature, so the design itself is the new piece. The paper does a reasonable job laying out why ignoring pretrained singular structure can let noisy fine-tuning gradients damage stable features. The empirical story is presented across three different regimes, which is better than single-task claims. The soft spot is the load-bearing assumption that downstream tasks will not need directions outside the pretrained column space. If that assumption fails on some data, the method could underperform unconstrained full fine-tuning even at matched parameter budgets; the abstract does not show ablations that isolate this constraint or test cases where new singular directions would help. The reported gains also need to be examined for whether the SVD basis was computed on the exact same pretraining checkpoint used in the baselines and whether the block size choices were tuned after seeing results. This work is aimed at people already running LoRA-style experiments on LLMs and VLMs who want a drop-in full-rank option without doubling the trainable parameters. A serious referee should see it because the method is concrete, the code is public, and the central design choice is falsifiable with the right controls. I would send it to review rather than desk-reject.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes FuRA, a full-rank adaptation method that reparameterizes each weight matrix via a block tensor-train factorization W = LSR, with the large core L fixed to the pretrained block-wise SVD basis while optimizing only the compact R and block-wise singular values S. This is claimed to deliver spectral preconditioning by constraining updates to the pretrained column space, yielding better performance than unconstrained full fine-tuning at identical trainable parameter counts, with reported gains of +1.37 on LLaMA-3-8B commonsense reasoning, improvements in LLM RL for mathematical reasoning, visual instruction tuning for VLMs, and QFuRA outperforming QLoRA.

Significance. If the gains are shown to be robust, the result would be significant for establishing that spectral preconditioning via frozen pretrained singular bases can improve fine-tuning without sacrificing expressivity, providing an efficient bridge between PEFT and full FT. The public code release is a positive factor for reproducibility.

major comments (1)

[Abstract] Abstract: the central claim that freezing one singular basis (constraining updates to the pretrained column space) does not reduce expressivity enough to negate gains over Full FT is load-bearing but unsupported by any derivation, ablation, or counter-example in the provided text; if optimal downstream directions lie outside this space, the outperformance at matched parameter count would not hold.

minor comments (1)

The abstract reports a +1.37 gain without error bars, run counts, or confirmation that data splits and SVD choices were fixed before seeing results, which is required to assess whether the comparison to Full FT is fair.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the concern regarding support for the central claim below and outline planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that freezing one singular basis (constraining updates to the pretrained column space) does not reduce expressivity enough to negate gains over Full FT is load-bearing but unsupported by any derivation, ablation, or counter-example in the provided text; if optimal downstream directions lie outside this space, the outperformance at matched parameter count would not hold.

Authors: We agree that the abstract's claim would be strengthened by explicit support. The manuscript reports consistent empirical outperformance of FuRA over Full FT at identical trainable parameter budgets (e.g., +1.37 on LLaMA-3-8B commonsense reasoning), which provides evidence that the pretrained column-space constraint does not negate gains in practice. However, we acknowledge the absence of a dedicated derivation or ablation directly addressing expressivity within versus outside the pretrained space. In the revised manuscript we will (i) add a short paragraph deriving that the block tensor-train factorization with fixed full-rank SVD basis L preserves full column-rank updates within the pretrained space, and (ii) include an ablation replacing the pretrained L with a random orthogonal basis of equal dimension, demonstrating degraded performance and thereby illustrating that the pretrained basis is beneficial rather than overly restrictive. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical outperformance claims are independent of any self-referential derivation

full rationale

The paper proposes FuRA via block tensor-train factorization W = LSR with L fixed to pretrained SVD bases, then reports empirical gains over Full FT and QLoRA on specific benchmarks. No equations, theorems, or predictions are shown that reduce the claimed superiority to a quantity fitted from the same downstream data or to a self-citation chain. The central premise (freezing one singular basis yields beneficial spectral preconditioning) is presented as an empirical observation rather than a first-principles derivation that is tautological by construction. No self-citations, ansatzes smuggled via prior work, or renamings of known results appear in the provided text. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated premise that the pretrained SVD bases remain the optimal column space after fine-tuning; no free parameters, axioms, or invented entities are explicitly listed in the abstract.

axioms (1)

domain assumption SVD of each pretrained weight matrix exists and its left/right singular vectors define a useful fixed basis for updates
Invoked when the abstract states that freezing one singular basis constrains updates to the pretrained column space.

pith-pipeline@v0.9.0 · 5804 in / 1326 out tokens · 17479 ms · 2026-05-25T05:37:59.130631+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

reparameterizing each weight matrix through its full-rank singular value decomposition (SVD) and freezing one singular basis constrains updates to the pretrained column space, yielding a preconditioned optimization scheme
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

spectral preconditioning as the key missing ingredient

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 10 internal anchors

[1]

Albert, F

P. Albert, F. Z. Zhang, H. S. Rodriguez, C. Pham, E. Abbasnejad, and A. van den Hengel. Rand- LoRA: Full-rank parameter-efficient fine-tuning of large models. InInternational Conference on Learning Representations, 2025

work page 2025
[2]

Y . Bisk, R. Zellers, R. Le Bras, J. Gao, and Y . Choi. PIQA: Reasoning about physical common- sense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, 2020

work page 2020
[3]

Z. Chen, R. R. Yang, S. Singh, and M. Soljacic. QuanTA: Efficient high-rank fine-tuning of LLMs with quantum-informed tensor adaptation. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[4]

Clark, K

C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, 2019

work page 2019
[5]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Dettmers, A

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[8]

Z. Gao, Q. Wang, A. Chen, Z. Liu, B. Wu, L. Chen, and J. Li. Parameter-efficient fine-tuning with discrete Fourier transform. InInternational Conference on Machine Learning, 2024

work page 2024
[9]

Goyal, T

Y . Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

work page 2017
[10]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Gurari, Q

D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham. VizWiz grand challenge: Answering visual questions from blind people. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

work page 2018
[12]

Halko, P.-G

N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM review, 53(2):217–288, 2011

work page 2011
[13]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021

work page 2021
[14]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

work page 2022
[15]

I. Hu, R. Cong, A. Zhang, A. Shetty, V . Lingam, W.-L. Chou, A. G. Dimakis, and S. Sanghavi. LoRTA: Low rank tensor adaptation of large language models.arXiv preprint arXiv:2410.04060, 2024

work page arXiv 2024
[16]

Z. Hu, L. Wang, Y . Lan, W. Xu, E.-P. Lim, L. Bing, X. Xu, S. Poria, and R. K.-W. Lee. LLM- Adapters: An adapter family for parameter-efficient fine-tuning of large language models.arXiv preprint arXiv:2304.01933, 2023. 10

work page arXiv 2023
[17]

D. A. Hudson and C. D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019
[18]

Jiang, S

T. Jiang, S. Huang, S. Luo, Z. Zhang, H. Huang, F. Wei, W. Deng, F. Sun, Q. Zhang, D. Wang, and F. Zhuang. MoRA: High-rank updating for parameter-efficient fine-tuning. InInternational Conference on Learning Representations, 2025

work page 2025
[19]

Kirkpatrick, R

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

work page 2017
[20]

D. J. Kopiczko, T. Blankevoort, and Y . M. Asano. VeRA: Vector-based random matrix adaptation. InInternational Conference on Learning Representations, 2024

work page 2024
[21]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the ACM Symposium on Operating Systems Principles, 2023

work page 2023
[22]

Y . Li, Y . Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

work page 2023
[23]

Lingam, A

V . Lingam, A. Tejaswi, A. Vavre, A. Shetty, G. K. Gudur, J. Ghosh, A. Dimakis, E. Choi, A. Bojchevski, and S. Sanghavi. SVFT: Parameter-efficient fine-tuning with singular vectors. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[24]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023

work page 2023
[25]

Liu, C.-Y

S.-Y . Liu, C.-Y . Wang, H. Yin, P. Molchanov, Y .-C. F. Wang, K.-T. Cheng, and M.-H. Chen. DoRA: Weight-decomposed low-rank adaptation. InInternational Conference on Machine Learning, 2024

work page 2024
[26]

W. Liu, Z. Qiu, Y . Feng, Y . Xiu, Y . Xue, L. Yu, H. Feng, Z. Liu, J. Heo, S. Peng, Y . Wen, M. J. Black, A. Weller, and B. Schölkopf. Parameter-efficient orthogonal finetuning via butterfly factorization. InInternational Conference on Learning Representations, 2024

work page 2024
[27]

Y . Liu, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liu, K. Chen, and D. Lin. MMBench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Z. Liu, T. Pang, O. Balabanov, C. Yang, T. Huang, L. Yin, Y . Yang, and S. Liu. LIFT the veil for the truth: Principal weights emerge after rank reduction for reasoning-focused supervised fine-tuning. InInternational Conference on Machine Learning, 2025

work page 2025
[29]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

work page 2019
[30]

P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advances in Neural Information Processing Systems, 2022

work page 2022
[31]

F. Meng, Z. Wang, and M. Zhang. PiSSA: Principal singular values and singular vectors adaptation of large language models. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[32]

Mihaylov, P

T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2018

work page 2018
[33]

Novikov, D

A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov. Tensorizing neural networks. In Advances in Neural Information Processing Systems, 2015. 11

work page 2015
[34]

I. V . Oseledets. Tensor-train decomposition.SIAM Journal on Scientific Computing, 33(5):2295– 2317, 2011

work page 2011
[35]

S. Qiu, A. Potapczynski, M. Finzi, M. Goldblum, and A. G. Wilson. Compute better spent: Replacing dense layers with structured matrices. InInternational Conference on Machine Learning, 2024

work page 2024
[36]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Sakaguchi, R

K. Sakaguchi, R. Le Bras, C. Bhagavatula, and Y . Choi. WinoGrande: An adversarial Winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2020

work page 2020
[38]

M. Sap, H. Rashkin, D. Chen, R. Le Bras, and Y . Choi. Social IQa: Commonsense reasoning about social interactions. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2019

work page 2019
[39]

Schulman and Thinking Machines Lab

J. Schulman and Thinking Machines Lab. LoRA without regret. https:// thinkingmachines.ai/blog/lora/, 2025. Blog post

work page 2025
[40]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

Singh, V

A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach. Towards VQA models that can read. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019
[42]

M.-A. Team. Amc 2023 dataset, 2023

work page 2023
[43]

M.-A. Team. American invitational mathematics examination (aime) 2024, 2024

work page 2024
[44]

M.-A. Team. American invitational mathematics examination (aime) 2024, 2025

work page 2024
[45]

Q. Team, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

H. Wang, Y . Li, S. Wang, G. Chen, and Y . Chen. MiLoRA: Harnessing minor singular components for parameter-efficient LLM finetuning.arXiv preprint arXiv:2406.09044, 2024

work page arXiv 2024
[48]

X. Yang, J. Leng, G. Guo, J. Zhao, R. Nakada, L. Zhang, H. Yao, and B. Chen. S2FT: Efficient, scalable and generalizable LLM fine-tuning by structured sparsity. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[49]

Y . Yang, K. Zhen, E. Banijamali, A. Mouchtaris, and Z. Zhang. LoRETTA: Low-rank economic tensor-train adaptation for ultra-low-parameter fine-tuning of large language models. InPro- ceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, 2024

work page 2024
[50]

Q. Yin, Y . Wu, Z. Shen, S. Li, Z. Wang, Y . Li, C. T. Leong, J. Kang, and J. Gu. Evaluating parameter efficient methods for rlvr.arXiv preprint arXiv:2512.23165, 2025

work page arXiv 2025
[51]

L. Yu, W. Jiang, H. Shi, J. Yu, Z. Liu, Y . Zhang, J. T. Kwok, Z. Li, A. Weller, and W. Liu. Metamath: Bootstrap your own mathematical questions for large language models.arXiv preprint arXiv:2309.12284, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, et al. DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

Zellers, A

R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2019. 12

work page 2019
[54]

Zhang and M

F. Zhang and M. Pilanci. Spectral adapter: Fine-tuning in spectral space. InAdvances in Neural Information Processing Systems, 2024

work page 2024
[55]

Zhang, M

Q. Zhang, M. Chen, A. Bukharin, N. Karampatziakis, P. He, Y . Cheng, W. Chen, and T. Zhao. AdaLoRA: Adaptive budget allocation for parameter-efficient fine-tuning. InInternational Conference on Learning Representations, 2023

work page 2023
[56]

H. Zhu, Z. Zhang, H. Huang, D. Su, Z. Liu, J. Zhao, I. Fedorov, H. Pirsiavash, Z. Sha, J. Lee, D. Z. Pan, Z. Wang, Y . Tian, and K. S. Tai. The path not taken: RLVR provably learns off the principals.arXiv preprint arXiv:2511.08567, 2025. NeurIPS 2025 Workshop on Efficient Reasoning (spotlight). 13 A FURA Algorithm Algorithm 1FURA: Full-Rank Adaptation Re...

work page arXiv 2025
[57]

with FURA substituted for DoRA; best LR selected from{2,3,4}×10 −4. Hyperparameters (FURA) LLaV A-1.5-7B Dropout0.05 Optimizer AdamW LR3×10 −4 LR Scheduler Cosine decay Weight decay0 Batch size (per-device×grad-accum)4×4 = 16 Warmup ratio0.03 Epochs1 Model max length2048 Mixed precision bf16 Gradient checkpointing on WhereQ, K, V, O,Up,Down,Gate Table 12:...

work page 2048
[58]

diag(S)2-weighted

for each (model, method) cell. Sample std is taken across n=3 seeds, so the SEM ≈0.6× std. AIME-24/25 are avg@8 at T=0.6 ; MATH-500 and AMC23 are greedy@1 at T=0 . The Paper Table (Table 4) reports the per-seed mean from the same set of runs. 18 Table 15: Math RL with GRPO: per-seed mean±std across 3 seeds (42, 43, 44). The Paper Table (Table 4) reports t...

work page arXiv 2048

[1] [1]

Albert, F

P. Albert, F. Z. Zhang, H. S. Rodriguez, C. Pham, E. Abbasnejad, and A. van den Hengel. Rand- LoRA: Full-rank parameter-efficient fine-tuning of large models. InInternational Conference on Learning Representations, 2025

work page 2025

[2] [2]

Y . Bisk, R. Zellers, R. Le Bras, J. Gao, and Y . Choi. PIQA: Reasoning about physical common- sense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, 2020

work page 2020

[3] [3]

Z. Chen, R. R. Yang, S. Singh, and M. Soljacic. QuanTA: Efficient high-rank fine-tuning of LLMs with quantum-informed tensor adaptation. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[4] [4]

Clark, K

C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, 2019

work page 2019

[5] [5]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[6] [6]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Dettmers, A

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[8] [8]

Z. Gao, Q. Wang, A. Chen, Z. Liu, B. Wu, L. Chen, and J. Li. Parameter-efficient fine-tuning with discrete Fourier transform. InInternational Conference on Machine Learning, 2024

work page 2024

[9] [9]

Goyal, T

Y . Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

work page 2017

[10] [10]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Gurari, Q

D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham. VizWiz grand challenge: Answering visual questions from blind people. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

work page 2018

[12] [12]

Halko, P.-G

N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM review, 53(2):217–288, 2011

work page 2011

[13] [13]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021

work page 2021

[14] [14]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

work page 2022

[15] [15]

I. Hu, R. Cong, A. Zhang, A. Shetty, V . Lingam, W.-L. Chou, A. G. Dimakis, and S. Sanghavi. LoRTA: Low rank tensor adaptation of large language models.arXiv preprint arXiv:2410.04060, 2024

work page arXiv 2024

[16] [16]

Z. Hu, L. Wang, Y . Lan, W. Xu, E.-P. Lim, L. Bing, X. Xu, S. Poria, and R. K.-W. Lee. LLM- Adapters: An adapter family for parameter-efficient fine-tuning of large language models.arXiv preprint arXiv:2304.01933, 2023. 10

work page arXiv 2023

[17] [17]

D. A. Hudson and C. D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019

[18] [18]

Jiang, S

T. Jiang, S. Huang, S. Luo, Z. Zhang, H. Huang, F. Wei, W. Deng, F. Sun, Q. Zhang, D. Wang, and F. Zhuang. MoRA: High-rank updating for parameter-efficient fine-tuning. InInternational Conference on Learning Representations, 2025

work page 2025

[19] [19]

Kirkpatrick, R

J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

work page 2017

[20] [20]

D. J. Kopiczko, T. Blankevoort, and Y . M. Asano. VeRA: Vector-based random matrix adaptation. InInternational Conference on Learning Representations, 2024

work page 2024

[21] [21]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the ACM Symposium on Operating Systems Principles, 2023

work page 2023

[22] [22]

Y . Li, Y . Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

work page 2023

[23] [23]

Lingam, A

V . Lingam, A. Tejaswi, A. Vavre, A. Shetty, G. K. Gudur, J. Ghosh, A. Dimakis, E. Choi, A. Bojchevski, and S. Sanghavi. SVFT: Parameter-efficient fine-tuning with singular vectors. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[24] [24]

H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023

work page 2023

[25] [25]

Liu, C.-Y

S.-Y . Liu, C.-Y . Wang, H. Yin, P. Molchanov, Y .-C. F. Wang, K.-T. Cheng, and M.-H. Chen. DoRA: Weight-decomposed low-rank adaptation. InInternational Conference on Machine Learning, 2024

work page 2024

[26] [26]

W. Liu, Z. Qiu, Y . Feng, Y . Xiu, Y . Xue, L. Yu, H. Feng, Z. Liu, J. Heo, S. Peng, Y . Wen, M. J. Black, A. Weller, and B. Schölkopf. Parameter-efficient orthogonal finetuning via butterfly factorization. InInternational Conference on Learning Representations, 2024

work page 2024

[27] [27]

Y . Liu, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liu, K. Chen, and D. Lin. MMBench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

Z. Liu, T. Pang, O. Balabanov, C. Yang, T. Huang, L. Yin, Y . Yang, and S. Liu. LIFT the veil for the truth: Principal weights emerge after rank reduction for reasoning-focused supervised fine-tuning. InInternational Conference on Machine Learning, 2025

work page 2025

[29] [29]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

work page 2019

[30] [30]

P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advances in Neural Information Processing Systems, 2022

work page 2022

[31] [31]

F. Meng, Z. Wang, and M. Zhang. PiSSA: Principal singular values and singular vectors adaptation of large language models. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[32] [32]

Mihaylov, P

T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2018

work page 2018

[33] [33]

Novikov, D

A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov. Tensorizing neural networks. In Advances in Neural Information Processing Systems, 2015. 11

work page 2015

[34] [34]

I. V . Oseledets. Tensor-train decomposition.SIAM Journal on Scientific Computing, 33(5):2295– 2317, 2011

work page 2011

[35] [35]

S. Qiu, A. Potapczynski, M. Finzi, M. Goldblum, and A. G. Wilson. Compute better spent: Replacing dense layers with structured matrices. InInternational Conference on Machine Learning, 2024

work page 2024

[36] [36]

Qwen3 Technical Report

Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Sakaguchi, R

K. Sakaguchi, R. Le Bras, C. Bhagavatula, and Y . Choi. WinoGrande: An adversarial Winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2020

work page 2020

[38] [38]

M. Sap, H. Rashkin, D. Chen, R. Le Bras, and Y . Choi. Social IQa: Commonsense reasoning about social interactions. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2019

work page 2019

[39] [39]

Schulman and Thinking Machines Lab

J. Schulman and Thinking Machines Lab. LoRA without regret. https:// thinkingmachines.ai/blog/lora/, 2025. Blog post

work page 2025

[40] [40]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[41] [41]

Singh, V

A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach. Towards VQA models that can read. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019

[42] [42]

M.-A. Team. Amc 2023 dataset, 2023

work page 2023

[43] [43]

M.-A. Team. American invitational mathematics examination (aime) 2024, 2024

work page 2024

[44] [44]

M.-A. Team. American invitational mathematics examination (aime) 2024, 2025

work page 2024

[45] [45]

Q. Team, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[47] [47]

H. Wang, Y . Li, S. Wang, G. Chen, and Y . Chen. MiLoRA: Harnessing minor singular components for parameter-efficient LLM finetuning.arXiv preprint arXiv:2406.09044, 2024

work page arXiv 2024

[48] [48]

X. Yang, J. Leng, G. Guo, J. Zhao, R. Nakada, L. Zhang, H. Yao, and B. Chen. S2FT: Efficient, scalable and generalizable LLM fine-tuning by structured sparsity. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[49] [49]

Y . Yang, K. Zhen, E. Banijamali, A. Mouchtaris, and Z. Zhang. LoRETTA: Low-rank economic tensor-train adaptation for ultra-low-parameter fine-tuning of large language models. InPro- ceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, 2024

work page 2024

[50] [50]

Q. Yin, Y . Wu, Z. Shen, S. Li, Z. Wang, Y . Li, C. T. Leong, J. Kang, and J. Gu. Evaluating parameter efficient methods for rlvr.arXiv preprint arXiv:2512.23165, 2025

work page arXiv 2025

[51] [51]

L. Yu, W. Jiang, H. Shi, J. Yu, Z. Liu, Y . Zhang, J. T. Kwok, Z. Li, A. Weller, and W. Liu. Metamath: Bootstrap your own mathematical questions for large language models.arXiv preprint arXiv:2309.12284, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[52] [52]

Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, et al. DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [53]

Zellers, A

R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2019. 12

work page 2019

[54] [54]

Zhang and M

F. Zhang and M. Pilanci. Spectral adapter: Fine-tuning in spectral space. InAdvances in Neural Information Processing Systems, 2024

work page 2024

[55] [55]

Zhang, M

Q. Zhang, M. Chen, A. Bukharin, N. Karampatziakis, P. He, Y . Cheng, W. Chen, and T. Zhao. AdaLoRA: Adaptive budget allocation for parameter-efficient fine-tuning. InInternational Conference on Learning Representations, 2023

work page 2023

[56] [56]

H. Zhu, Z. Zhang, H. Huang, D. Su, Z. Liu, J. Zhao, I. Fedorov, H. Pirsiavash, Z. Sha, J. Lee, D. Z. Pan, Z. Wang, Y . Tian, and K. S. Tai. The path not taken: RLVR provably learns off the principals.arXiv preprint arXiv:2511.08567, 2025. NeurIPS 2025 Workshop on Efficient Reasoning (spotlight). 13 A FURA Algorithm Algorithm 1FURA: Full-Rank Adaptation Re...

work page arXiv 2025

[57] [57]

with FURA substituted for DoRA; best LR selected from{2,3,4}×10 −4. Hyperparameters (FURA) LLaV A-1.5-7B Dropout0.05 Optimizer AdamW LR3×10 −4 LR Scheduler Cosine decay Weight decay0 Batch size (per-device×grad-accum)4×4 = 16 Warmup ratio0.03 Epochs1 Model max length2048 Mixed precision bf16 Gradient checkpointing on WhereQ, K, V, O,Up,Down,Gate Table 12:...

work page 2048

[58] [58]

diag(S)2-weighted

for each (model, method) cell. Sample std is taken across n=3 seeds, so the SEM ≈0.6× std. AIME-24/25 are avg@8 at T=0.6 ; MATH-500 and AMC23 are greedy@1 at T=0 . The Paper Table (Table 4) reports the per-seed mean from the same set of runs. 18 Table 15: Math RL with GRPO: per-seed mean±std across 3 seeds (42, 43, 44). The Paper Table (Table 4) reports t...

work page arXiv 2048