FuRA: Full-Rank Parameter-Efficient Fine-Tuning with Spectral Preconditioning
Pith reviewed 2026-05-25 05:37 UTC · model grok-4.3
The pith
Reparameterizing weight updates through pretrained singular vectors lets full-rank adaptation beat unconstrained full fine-tuning at the same parameter budget.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FuRA reparameterizes each weight matrix through its full-rank SVD and freezes one singular basis so that updates are constrained to the pretrained column space; this yields a preconditioned optimization scheme that outperforms unconstrained full fine-tuning at identical trainable parameter count. The concrete implementation uses the block tensor-train factorization W = LSR where the large core L is fixed to the pretrained block-wise SVD basis and only the compact core R together with the block-wise singular values S are optimized, simultaneously delivering full-rank spectral preconditioning, full-rank update expressivity, and LoRA-level efficiency.
What carries the argument
The block tensor-train factorization W = LSR with the large core L fixed to the pretrained block-wise SVD basis, which constrains all updates to the pretrained column space.
If this is right
- FuRA outperforms full fine-tuning by 1.37 points on LLaMA-3-8B commonsense reasoning.
- The 4-bit quantized variant QFuRA surpasses QLoRA.
- The same gains appear in LLM reinforcement learning for mathematical reasoning and in visual instruction tuning for VLMs.
- Full-rank updates can be made parameter-efficient and spectrally preconditioned without sacrificing expressivity.
Where Pith is reading between the lines
- If the spectral constraint proves general, similar preconditioning could be applied to other gradient-based adaptation settings beyond the tasks tested.
- The approach invites direct comparison of update directions in the column space versus the full ambient space on the same downstream objective.
- Layer-wise variation in how strictly the SVD basis is frozen could be tested to see whether some layers benefit from more flexibility.
Load-bearing premise
Freezing the pretrained singular vector bases will not reduce the expressivity of the updates enough to hurt performance on downstream tasks.
What would settle it
An experiment in which FuRA underperforms full fine-tuning on a standard downstream task at matched trainable parameter count would falsify the central claim.
Figures
read the original abstract
Both full fine-tuning (Full FT) and parameter-efficient fine-tuning methods such as LoRA introduce weight updates without accounting for the spectral structure established during pretraining. As a result, noisy gradients from limited fine-tuning data can perturb robust pretrained features. We identify spectral preconditioning as the missing ingredient: reparameterizing each weight matrix through its full-rank singular value decomposition (SVD) and freezing one singular basis constrains updates to the pretrained column space, yielding a preconditioned optimization scheme that outperforms unconstrained Full FT at the same trainable parameter count. Building on this insight, we propose FuRA (Full-Rank Adaptation), an efficient full-rank adaptation framework based on a block tensor-train factorization W = LSR, where the large core L is fixed to the pretrained block-wise SVD basis, while only the compact core R and the block-wise singular values S are optimized. This design simultaneously provides full-rank spectral preconditioning, preserves full-rank update expressivity, and achieves parameter, memory, and step-time efficiency comparable to LoRA. FuRA consistently outperforms Full FT across multiple settings, including LLM fine-tuning (+1.37 on LLaMA-3-8B commonsense reasoning), LLM reinforcement learning for mathematical reasoning, and visual instruction tuning for VLMs. Furthermore, the 4-bit quantized variant, QFuRA, also surpasses QLoRA. Code is available at https://github.com/olokevin/FuRA-NIPS
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes FuRA, a full-rank adaptation method that reparameterizes each weight matrix via a block tensor-train factorization W = LSR, with the large core L fixed to the pretrained block-wise SVD basis while optimizing only the compact R and block-wise singular values S. This is claimed to deliver spectral preconditioning by constraining updates to the pretrained column space, yielding better performance than unconstrained full fine-tuning at identical trainable parameter counts, with reported gains of +1.37 on LLaMA-3-8B commonsense reasoning, improvements in LLM RL for mathematical reasoning, visual instruction tuning for VLMs, and QFuRA outperforming QLoRA.
Significance. If the gains are shown to be robust, the result would be significant for establishing that spectral preconditioning via frozen pretrained singular bases can improve fine-tuning without sacrificing expressivity, providing an efficient bridge between PEFT and full FT. The public code release is a positive factor for reproducibility.
major comments (1)
- [Abstract] Abstract: the central claim that freezing one singular basis (constraining updates to the pretrained column space) does not reduce expressivity enough to negate gains over Full FT is load-bearing but unsupported by any derivation, ablation, or counter-example in the provided text; if optimal downstream directions lie outside this space, the outperformance at matched parameter count would not hold.
minor comments (1)
- The abstract reports a +1.37 gain without error bars, run counts, or confirmation that data splits and SVD choices were fixed before seeing results, which is required to assess whether the comparison to Full FT is fair.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We address the concern regarding support for the central claim below and outline planned revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that freezing one singular basis (constraining updates to the pretrained column space) does not reduce expressivity enough to negate gains over Full FT is load-bearing but unsupported by any derivation, ablation, or counter-example in the provided text; if optimal downstream directions lie outside this space, the outperformance at matched parameter count would not hold.
Authors: We agree that the abstract's claim would be strengthened by explicit support. The manuscript reports consistent empirical outperformance of FuRA over Full FT at identical trainable parameter budgets (e.g., +1.37 on LLaMA-3-8B commonsense reasoning), which provides evidence that the pretrained column-space constraint does not negate gains in practice. However, we acknowledge the absence of a dedicated derivation or ablation directly addressing expressivity within versus outside the pretrained space. In the revised manuscript we will (i) add a short paragraph deriving that the block tensor-train factorization with fixed full-rank SVD basis L preserves full column-rank updates within the pretrained space, and (ii) include an ablation replacing the pretrained L with a random orthogonal basis of equal dimension, demonstrating degraded performance and thereby illustrating that the pretrained basis is beneficial rather than overly restrictive. revision: yes
Circularity Check
No circularity; empirical outperformance claims are independent of any self-referential derivation
full rationale
The paper proposes FuRA via block tensor-train factorization W = LSR with L fixed to pretrained SVD bases, then reports empirical gains over Full FT and QLoRA on specific benchmarks. No equations, theorems, or predictions are shown that reduce the claimed superiority to a quantity fitted from the same downstream data or to a self-citation chain. The central premise (freezing one singular basis yields beneficial spectral preconditioning) is presented as an empirical observation rather than a first-principles derivation that is tautological by construction. No self-citations, ansatzes smuggled via prior work, or renamings of known results appear in the provided text. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption SVD of each pretrained weight matrix exists and its left/right singular vectors define a useful fixed basis for updates
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
reparameterizing each weight matrix through its full-rank singular value decomposition (SVD) and freezing one singular basis constrains updates to the pretrained column space, yielding a preconditioned optimization scheme
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
spectral preconditioning as the key missing ingredient
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
Y . Bisk, R. Zellers, R. Le Bras, J. Gao, and Y . Choi. PIQA: Reasoning about physical common- sense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, 2020
work page 2020
-
[3]
Z. Chen, R. R. Yang, S. Singh, and M. Soljacic. QuanTA: Efficient high-rank fine-tuning of LLMs with quantum-informed tensor adaptation. InAdvances in Neural Information Processing Systems, 2024
work page 2024
- [4]
-
[5]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[6]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. InAdvances in Neural Information Processing Systems, 2023
work page 2023
-
[8]
Z. Gao, Q. Wang, A. Chen, Z. Liu, B. Wu, L. Chen, and J. Li. Parameter-efficient fine-tuning with discrete Fourier transform. InInternational Conference on Machine Learning, 2024
work page 2024
- [9]
-
[10]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [11]
-
[12]
N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM review, 53(2):217–288, 2011
work page 2011
-
[13]
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021
work page 2021
-
[14]
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022
work page 2022
- [15]
- [16]
-
[17]
D. A. Hudson and C. D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019
work page 2019
- [18]
-
[19]
J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017
work page 2017
-
[20]
D. J. Kopiczko, T. Blankevoort, and Y . M. Asano. VeRA: Vector-based random matrix adaptation. InInternational Conference on Learning Representations, 2024
work page 2024
-
[21]
W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the ACM Symposium on Operating Systems Principles, 2023
work page 2023
-
[22]
Y . Li, Y . Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
work page 2023
- [23]
-
[24]
H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023
work page 2023
- [25]
-
[26]
W. Liu, Z. Qiu, Y . Feng, Y . Xiu, Y . Xue, L. Yu, H. Feng, Z. Liu, J. Heo, S. Peng, Y . Wen, M. J. Black, A. Weller, and B. Schölkopf. Parameter-efficient orthogonal finetuning via butterfly factorization. InInternational Conference on Learning Representations, 2024
work page 2024
-
[27]
Y . Liu, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liu, K. Chen, and D. Lin. MMBench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Z. Liu, T. Pang, O. Balabanov, C. Yang, T. Huang, L. Yin, Y . Yang, and S. Liu. LIFT the veil for the truth: Principal weights emerge after rank reduction for reasoning-focused supervised fine-tuning. InInternational Conference on Machine Learning, 2025
work page 2025
-
[29]
I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019
work page 2019
-
[30]
P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advances in Neural Information Processing Systems, 2022
work page 2022
-
[31]
F. Meng, Z. Wang, and M. Zhang. PiSSA: Principal singular values and singular vectors adaptation of large language models. InAdvances in Neural Information Processing Systems, 2024
work page 2024
-
[32]
T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2018
work page 2018
-
[33]
A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov. Tensorizing neural networks. In Advances in Neural Information Processing Systems, 2015. 11
work page 2015
-
[34]
I. V . Oseledets. Tensor-train decomposition.SIAM Journal on Scientific Computing, 33(5):2295– 2317, 2011
work page 2011
-
[35]
S. Qiu, A. Potapczynski, M. Finzi, M. Goldblum, and A. G. Wilson. Compute better spent: Replacing dense layers with structured matrices. InInternational Conference on Machine Learning, 2024
work page 2024
-
[36]
Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
K. Sakaguchi, R. Le Bras, C. Bhagavatula, and Y . Choi. WinoGrande: An adversarial Winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2020
work page 2020
-
[38]
M. Sap, H. Rashkin, D. Chen, R. Le Bras, and Y . Choi. Social IQa: Commonsense reasoning about social interactions. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2019
work page 2019
-
[39]
Schulman and Thinking Machines Lab
J. Schulman and Thinking Machines Lab. LoRA without regret. https:// thinkingmachines.ai/blog/lora/, 2025. Blog post
work page 2025
-
[40]
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [41]
-
[42]
M.-A. Team. Amc 2023 dataset, 2023
work page 2023
-
[43]
M.-A. Team. American invitational mathematics examination (aime) 2024, 2024
work page 2024
-
[44]
M.-A. Team. American invitational mathematics examination (aime) 2024, 2025
work page 2024
-
[45]
Q. Team, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Llama 2: Open Foundation and Fine-Tuned Chat Models
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [47]
-
[48]
X. Yang, J. Leng, G. Guo, J. Zhao, R. Nakada, L. Zhang, H. Yao, and B. Chen. S2FT: Efficient, scalable and generalizable LLM fine-tuning by structured sparsity. InAdvances in Neural Information Processing Systems, 2024
work page 2024
-
[49]
Y . Yang, K. Zhen, E. Banijamali, A. Mouchtaris, and Z. Zhang. LoRETTA: Low-rank economic tensor-train adaptation for ultra-low-parameter fine-tuning of large language models. InPro- ceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, 2024
work page 2024
- [50]
-
[51]
L. Yu, W. Jiang, H. Shi, J. Yu, Z. Liu, Y . Zhang, J. T. Kwok, Z. Li, A. Weller, and W. Liu. Metamath: Bootstrap your own mathematical questions for large language models.arXiv preprint arXiv:2309.12284, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, et al. DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2019. 12
work page 2019
-
[54]
F. Zhang and M. Pilanci. Spectral adapter: Fine-tuning in spectral space. InAdvances in Neural Information Processing Systems, 2024
work page 2024
- [55]
-
[56]
H. Zhu, Z. Zhang, H. Huang, D. Su, Z. Liu, J. Zhao, I. Fedorov, H. Pirsiavash, Z. Sha, J. Lee, D. Z. Pan, Z. Wang, Y . Tian, and K. S. Tai. The path not taken: RLVR provably learns off the principals.arXiv preprint arXiv:2511.08567, 2025. NeurIPS 2025 Workshop on Efficient Reasoning (spotlight). 13 A FURA Algorithm Algorithm 1FURA: Full-Rank Adaptation Re...
-
[57]
with FURA substituted for DoRA; best LR selected from{2,3,4}×10 −4. Hyperparameters (FURA) LLaV A-1.5-7B Dropout0.05 Optimizer AdamW LR3×10 −4 LR Scheduler Cosine decay Weight decay0 Batch size (per-device×grad-accum)4×4 = 16 Warmup ratio0.03 Epochs1 Model max length2048 Mixed precision bf16 Gradient checkpointing on WhereQ, K, V, O,Up,Down,Gate Table 12:...
work page 2048
-
[58]
for each (model, method) cell. Sample std is taken across n=3 seeds, so the SEM ≈0.6× std. AIME-24/25 are avg@8 at T=0.6 ; MATH-500 and AMC23 are greedy@1 at T=0 . The Paper Table (Table 4) reports the per-seed mean from the same set of runs. 18 Table 15: Math RL with GRPO: per-seed mean±std across 3 seeds (42, 43, 44). The Paper Table (Table 4) reports t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.