pith. sign in

arxiv: 2605.22869 · v1 · pith:RHXTDYLMnew · submitted 2026-05-19 · 💻 cs.LG

FuRA: Full-Rank Parameter-Efficient Fine-Tuning with Spectral Preconditioning

Pith reviewed 2026-05-25 05:37 UTC · model grok-4.3

classification 💻 cs.LG
keywords full-rank adaptationparameter-efficient fine-tuningspectral preconditioningsingular value decompositionLoRALLM fine-tuningquantized fine-tuning
0
0 comments X

The pith

Reparameterizing weight updates through pretrained singular vectors lets full-rank adaptation beat unconstrained full fine-tuning at the same parameter budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that both full fine-tuning and methods like LoRA overlook the spectral structure set during pretraining, so noisy gradients from small fine-tuning datasets can disturb already robust features. By reparameterizing each weight matrix with its full-rank SVD and freezing one singular basis, updates stay inside the pretrained column space and the optimization becomes preconditioned. FuRA realizes this idea through a block tensor-train factorization W = LSR in which the large core L is locked to the pretrained block-wise SVD basis while only the compact core R and the singular values S are trained. The result is full-rank expressivity, spectral preconditioning, and memory/step-time cost comparable to LoRA, with measured gains such as +1.37 on LLaMA-3-8B commonsense reasoning and better performance than QLoRA for the 4-bit quantized variant.

Core claim

FuRA reparameterizes each weight matrix through its full-rank SVD and freezes one singular basis so that updates are constrained to the pretrained column space; this yields a preconditioned optimization scheme that outperforms unconstrained full fine-tuning at identical trainable parameter count. The concrete implementation uses the block tensor-train factorization W = LSR where the large core L is fixed to the pretrained block-wise SVD basis and only the compact core R together with the block-wise singular values S are optimized, simultaneously delivering full-rank spectral preconditioning, full-rank update expressivity, and LoRA-level efficiency.

What carries the argument

The block tensor-train factorization W = LSR with the large core L fixed to the pretrained block-wise SVD basis, which constrains all updates to the pretrained column space.

If this is right

  • FuRA outperforms full fine-tuning by 1.37 points on LLaMA-3-8B commonsense reasoning.
  • The 4-bit quantized variant QFuRA surpasses QLoRA.
  • The same gains appear in LLM reinforcement learning for mathematical reasoning and in visual instruction tuning for VLMs.
  • Full-rank updates can be made parameter-efficient and spectrally preconditioned without sacrificing expressivity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the spectral constraint proves general, similar preconditioning could be applied to other gradient-based adaptation settings beyond the tasks tested.
  • The approach invites direct comparison of update directions in the column space versus the full ambient space on the same downstream objective.
  • Layer-wise variation in how strictly the SVD basis is frozen could be tested to see whether some layers benefit from more flexibility.

Load-bearing premise

Freezing the pretrained singular vector bases will not reduce the expressivity of the updates enough to hurt performance on downstream tasks.

What would settle it

An experiment in which FuRA underperforms full fine-tuning on a standard downstream task at matched trainable parameter count would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.22869 by Liyan Tan, Niall Moran, Ruijie Zhang, Tong Qin, Yequan Zhao, Zheng Zhang.

Figure 1
Figure 1. Figure 1: FURA delivers higher ac￾curacy than Full FT with LoRA-level runtime and the lowest GPU memory. Large language models [10, 36] (LLM) acquire rich, transfer￾able representations during pretraining on web-scale corpora. Fine-tuning these models on a small, task-specific dataset, whether through supervised finetuning (SFT) or reinforce￾ment learning with verifiable rewards (RLVR) [40, 52], can unlock strong do… view at source ↗
Figure 2
Figure 2. Figure 2: FURA replaces each pretrained linear layer W with a lossless block tensor-train factoriza￾tion W = L S R. Tensors are flattened to matrix for better demonstration. Each slice is initialized by the full-rank SVD of weight block Wk, and performs spectral preconditioned update. ∆W, whether computed directly from gradients (Full FT) or accumulated in a low-rank adapter BA (LoRA), is added to W without regard t… view at source ↗
Figure 3
Figure 3. Figure 3: (a) Gradient lives outside singular basis of pretrained weight, but weight keep stable. (b) Singular values shift selectively, while singular vectors rotate broadly. (a) (b) demonstrates layer 15 q_proj result, the full sweeps are in Figures 6 and 7 (Appendix). (c) SVD FT outperforms Full FT on both target domain and source domain. 2 Background and Related Work PEFT methods for LLMs. LoRA [14] constrains t… view at source ↗
Figure 4
Figure 4. Figure 4: (a) Effective rank of ∆W; full sweep is in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: LLaMA-3-8B Math-10K SFT, GSM8K accuracy. [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Full per-layer / per-module sweep for Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p026_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Full per-layer / per-module sweep for Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p027_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Full per-layer / per-module sweep for Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p027_8.png] view at source ↗
read the original abstract

Both full fine-tuning (Full FT) and parameter-efficient fine-tuning methods such as LoRA introduce weight updates without accounting for the spectral structure established during pretraining. As a result, noisy gradients from limited fine-tuning data can perturb robust pretrained features. We identify spectral preconditioning as the missing ingredient: reparameterizing each weight matrix through its full-rank singular value decomposition (SVD) and freezing one singular basis constrains updates to the pretrained column space, yielding a preconditioned optimization scheme that outperforms unconstrained Full FT at the same trainable parameter count. Building on this insight, we propose FuRA (Full-Rank Adaptation), an efficient full-rank adaptation framework based on a block tensor-train factorization W = LSR, where the large core L is fixed to the pretrained block-wise SVD basis, while only the compact core R and the block-wise singular values S are optimized. This design simultaneously provides full-rank spectral preconditioning, preserves full-rank update expressivity, and achieves parameter, memory, and step-time efficiency comparable to LoRA. FuRA consistently outperforms Full FT across multiple settings, including LLM fine-tuning (+1.37 on LLaMA-3-8B commonsense reasoning), LLM reinforcement learning for mathematical reasoning, and visual instruction tuning for VLMs. Furthermore, the 4-bit quantized variant, QFuRA, also surpasses QLoRA. Code is available at https://github.com/olokevin/FuRA-NIPS

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes FuRA, a full-rank adaptation method that reparameterizes each weight matrix via a block tensor-train factorization W = LSR, with the large core L fixed to the pretrained block-wise SVD basis while optimizing only the compact R and block-wise singular values S. This is claimed to deliver spectral preconditioning by constraining updates to the pretrained column space, yielding better performance than unconstrained full fine-tuning at identical trainable parameter counts, with reported gains of +1.37 on LLaMA-3-8B commonsense reasoning, improvements in LLM RL for mathematical reasoning, visual instruction tuning for VLMs, and QFuRA outperforming QLoRA.

Significance. If the gains are shown to be robust, the result would be significant for establishing that spectral preconditioning via frozen pretrained singular bases can improve fine-tuning without sacrificing expressivity, providing an efficient bridge between PEFT and full FT. The public code release is a positive factor for reproducibility.

major comments (1)
  1. [Abstract] Abstract: the central claim that freezing one singular basis (constraining updates to the pretrained column space) does not reduce expressivity enough to negate gains over Full FT is load-bearing but unsupported by any derivation, ablation, or counter-example in the provided text; if optimal downstream directions lie outside this space, the outperformance at matched parameter count would not hold.
minor comments (1)
  1. The abstract reports a +1.37 gain without error bars, run counts, or confirmation that data splits and SVD choices were fixed before seeing results, which is required to assess whether the comparison to Full FT is fair.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the concern regarding support for the central claim below and outline planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that freezing one singular basis (constraining updates to the pretrained column space) does not reduce expressivity enough to negate gains over Full FT is load-bearing but unsupported by any derivation, ablation, or counter-example in the provided text; if optimal downstream directions lie outside this space, the outperformance at matched parameter count would not hold.

    Authors: We agree that the abstract's claim would be strengthened by explicit support. The manuscript reports consistent empirical outperformance of FuRA over Full FT at identical trainable parameter budgets (e.g., +1.37 on LLaMA-3-8B commonsense reasoning), which provides evidence that the pretrained column-space constraint does not negate gains in practice. However, we acknowledge the absence of a dedicated derivation or ablation directly addressing expressivity within versus outside the pretrained space. In the revised manuscript we will (i) add a short paragraph deriving that the block tensor-train factorization with fixed full-rank SVD basis L preserves full column-rank updates within the pretrained space, and (ii) include an ablation replacing the pretrained L with a random orthogonal basis of equal dimension, demonstrating degraded performance and thereby illustrating that the pretrained basis is beneficial rather than overly restrictive. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical outperformance claims are independent of any self-referential derivation

full rationale

The paper proposes FuRA via block tensor-train factorization W = LSR with L fixed to pretrained SVD bases, then reports empirical gains over Full FT and QLoRA on specific benchmarks. No equations, theorems, or predictions are shown that reduce the claimed superiority to a quantity fitted from the same downstream data or to a self-citation chain. The central premise (freezing one singular basis yields beneficial spectral preconditioning) is presented as an empirical observation rather than a first-principles derivation that is tautological by construction. No self-citations, ansatzes smuggled via prior work, or renamings of known results appear in the provided text. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated premise that the pretrained SVD bases remain the optimal column space after fine-tuning; no free parameters, axioms, or invented entities are explicitly listed in the abstract.

axioms (1)
  • domain assumption SVD of each pretrained weight matrix exists and its left/right singular vectors define a useful fixed basis for updates
    Invoked when the abstract states that freezing one singular basis constrains updates to the pretrained column space.

pith-pipeline@v0.9.0 · 5804 in / 1326 out tokens · 17479 ms · 2026-05-25T05:37:59.130631+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 10 internal anchors

  1. [1]

    Albert, F

    P. Albert, F. Z. Zhang, H. S. Rodriguez, C. Pham, E. Abbasnejad, and A. van den Hengel. Rand- LoRA: Full-rank parameter-efficient fine-tuning of large models. InInternational Conference on Learning Representations, 2025

  2. [2]

    Y . Bisk, R. Zellers, R. Le Bras, J. Gao, and Y . Choi. PIQA: Reasoning about physical common- sense in natural language. InProceedings of the AAAI Conference on Artificial Intelligence, 2020

  3. [3]

    Z. Chen, R. R. Yang, S. Singh, and M. Soljacic. QuanTA: Efficient high-rank fine-tuning of LLMs with quantum-informed tensor adaptation. InAdvances in Neural Information Processing Systems, 2024

  4. [4]

    Clark, K

    C. Clark, K. Lee, M.-W. Chang, T. Kwiatkowski, M. Collins, and K. Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, 2019

  5. [5]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

  6. [6]

    Training Verifiers to Solve Math Word Problems

    K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  7. [7]

    Dettmers, A

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. InAdvances in Neural Information Processing Systems, 2023

  8. [8]

    Z. Gao, Q. Wang, A. Chen, Z. Liu, B. Wu, L. Chen, and J. Li. Parameter-efficient fine-tuning with discrete Fourier transform. InInternational Conference on Machine Learning, 2024

  9. [9]

    Goyal, T

    Y . Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

  10. [10]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, et al. The Llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  11. [11]

    Gurari, Q

    D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham. VizWiz grand challenge: Answering visual questions from blind people. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  12. [12]

    Halko, P.-G

    N. Halko, P.-G. Martinsson, and J. A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM review, 53(2):217–288, 2011

  13. [13]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track, 2021

  14. [14]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022

  15. [15]

    I. Hu, R. Cong, A. Zhang, A. Shetty, V . Lingam, W.-L. Chou, A. G. Dimakis, and S. Sanghavi. LoRTA: Low rank tensor adaptation of large language models.arXiv preprint arXiv:2410.04060, 2024

  16. [16]

    Z. Hu, L. Wang, Y . Lan, W. Xu, E.-P. Lim, L. Bing, X. Xu, S. Poria, and R. K.-W. Lee. LLM- Adapters: An adapter family for parameter-efficient fine-tuning of large language models.arXiv preprint arXiv:2304.01933, 2023. 10

  17. [17]

    D. A. Hudson and C. D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  18. [18]

    Jiang, S

    T. Jiang, S. Huang, S. Luo, Z. Zhang, H. Huang, F. Wei, W. Deng, F. Sun, Q. Zhang, D. Wang, and F. Zhuang. MoRA: High-rank updating for parameter-efficient fine-tuning. InInternational Conference on Learning Representations, 2025

  19. [19]

    Kirkpatrick, R

    J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks.Proceedings of the national academy of sciences, 114(13):3521–3526, 2017

  20. [20]

    D. J. Kopiczko, T. Blankevoort, and Y . M. Asano. VeRA: Vector-based random matrix adaptation. InInternational Conference on Learning Representations, 2024

  21. [21]

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the ACM Symposium on Operating Systems Principles, 2023

  22. [22]

    Y . Li, Y . Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

  23. [23]

    Lingam, A

    V . Lingam, A. Tejaswi, A. Vavre, A. Shetty, G. K. Gudur, J. Ghosh, A. Dimakis, E. Choi, A. Bojchevski, and S. Sanghavi. SVFT: Parameter-efficient fine-tuning with singular vectors. InAdvances in Neural Information Processing Systems, 2024

  24. [24]

    H. Liu, C. Li, Q. Wu, and Y . J. Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems, 2023

  25. [25]

    Liu, C.-Y

    S.-Y . Liu, C.-Y . Wang, H. Yin, P. Molchanov, Y .-C. F. Wang, K.-T. Cheng, and M.-H. Chen. DoRA: Weight-decomposed low-rank adaptation. InInternational Conference on Machine Learning, 2024

  26. [26]

    W. Liu, Z. Qiu, Y . Feng, Y . Xiu, Y . Xue, L. Yu, H. Feng, Z. Liu, J. Heo, S. Peng, Y . Wen, M. J. Black, A. Weller, and B. Schölkopf. Parameter-efficient orthogonal finetuning via butterfly factorization. InInternational Conference on Learning Representations, 2024

  27. [27]

    Y . Liu, H. Duan, Y . Zhang, B. Li, S. Zhang, W. Zhao, Y . Yuan, J. Wang, C. He, Z. Liu, K. Chen, and D. Lin. MMBench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2023

  28. [28]

    Z. Liu, T. Pang, O. Balabanov, C. Yang, T. Huang, L. Yin, Y . Yang, and S. Liu. LIFT the veil for the truth: Principal weights emerge after rank reduction for reasoning-focused supervised fine-tuning. InInternational Conference on Machine Learning, 2025

  29. [29]

    Loshchilov and F

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

  30. [30]

    P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advances in Neural Information Processing Systems, 2022

  31. [31]

    F. Meng, Z. Wang, and M. Zhang. PiSSA: Principal singular values and singular vectors adaptation of large language models. InAdvances in Neural Information Processing Systems, 2024

  32. [32]

    Mihaylov, P

    T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2018

  33. [33]

    Novikov, D

    A. Novikov, D. Podoprikhin, A. Osokin, and D. Vetrov. Tensorizing neural networks. In Advances in Neural Information Processing Systems, 2015. 11

  34. [34]

    I. V . Oseledets. Tensor-train decomposition.SIAM Journal on Scientific Computing, 33(5):2295– 2317, 2011

  35. [35]

    S. Qiu, A. Potapczynski, M. Finzi, M. Goldblum, and A. G. Wilson. Compute better spent: Replacing dense layers with structured matrices. InInternational Conference on Machine Learning, 2024

  36. [36]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  37. [37]

    Sakaguchi, R

    K. Sakaguchi, R. Le Bras, C. Bhagavatula, and Y . Choi. WinoGrande: An adversarial Winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2020

  38. [38]

    M. Sap, H. Rashkin, D. Chen, R. Le Bras, and Y . Choi. Social IQa: Commonsense reasoning about social interactions. InProceedings of the Conference on Empirical Methods in Natural Language Processing, 2019

  39. [39]

    Schulman and Thinking Machines Lab

    J. Schulman and Thinking Machines Lab. LoRA without regret. https:// thinkingmachines.ai/blog/lora/, 2025. Blog post

  40. [40]

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . K. Li, Y . Wu, and D. Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  41. [41]

    Singh, V

    A. Singh, V . Natarajan, M. Shah, Y . Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach. Towards VQA models that can read. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  42. [42]

    M.-A. Team. Amc 2023 dataset, 2023

  43. [43]

    M.-A. Team. American invitational mathematics examination (aime) 2024, 2024

  44. [44]

    M.-A. Team. American invitational mathematics examination (aime) 2024, 2025

  45. [45]

    Q. Team, A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025

  46. [46]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  47. [47]

    H. Wang, Y . Li, S. Wang, G. Chen, and Y . Chen. MiLoRA: Harnessing minor singular components for parameter-efficient LLM finetuning.arXiv preprint arXiv:2406.09044, 2024

  48. [48]

    X. Yang, J. Leng, G. Guo, J. Zhao, R. Nakada, L. Zhang, H. Yao, and B. Chen. S2FT: Efficient, scalable and generalizable LLM fine-tuning by structured sparsity. InAdvances in Neural Information Processing Systems, 2024

  49. [49]

    Y . Yang, K. Zhen, E. Banijamali, A. Mouchtaris, and Z. Zhang. LoRETTA: Low-rank economic tensor-train adaptation for ultra-low-parameter fine-tuning of large language models. InPro- ceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, 2024

  50. [50]

    Q. Yin, Y . Wu, Z. Shen, S. Li, Z. Wang, Y . Li, C. T. Leong, J. Kang, and J. Gu. Evaluating parameter efficient methods for rlvr.arXiv preprint arXiv:2512.23165, 2025

  51. [51]

    L. Yu, W. Jiang, H. Shi, J. Yu, Z. Liu, Y . Zhang, J. T. Kwok, Z. Li, A. Weller, and W. Liu. Metamath: Bootstrap your own mathematical questions for large language models.arXiv preprint arXiv:2309.12284, 2023

  52. [52]

    Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, et al. DAPO: An open-source LLM reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  53. [53]

    Zellers, A

    R. Zellers, A. Holtzman, Y . Bisk, A. Farhadi, and Y . Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the Annual Meeting of the Association for Computational Linguistics, 2019. 12

  54. [54]

    Zhang and M

    F. Zhang and M. Pilanci. Spectral adapter: Fine-tuning in spectral space. InAdvances in Neural Information Processing Systems, 2024

  55. [55]

    Zhang, M

    Q. Zhang, M. Chen, A. Bukharin, N. Karampatziakis, P. He, Y . Cheng, W. Chen, and T. Zhao. AdaLoRA: Adaptive budget allocation for parameter-efficient fine-tuning. InInternational Conference on Learning Representations, 2023

  56. [56]

    H. Zhu, Z. Zhang, H. Huang, D. Su, Z. Liu, J. Zhao, I. Fedorov, H. Pirsiavash, Z. Sha, J. Lee, D. Z. Pan, Z. Wang, Y . Tian, and K. S. Tai. The path not taken: RLVR provably learns off the principals.arXiv preprint arXiv:2511.08567, 2025. NeurIPS 2025 Workshop on Efficient Reasoning (spotlight). 13 A FURA Algorithm Algorithm 1FURA: Full-Rank Adaptation Re...

  57. [57]

    with FURA substituted for DoRA; best LR selected from{2,3,4}×10 −4. Hyperparameters (FURA) LLaV A-1.5-7B Dropout0.05 Optimizer AdamW LR3×10 −4 LR Scheduler Cosine decay Weight decay0 Batch size (per-device×grad-accum)4×4 = 16 Warmup ratio0.03 Epochs1 Model max length2048 Mixed precision bf16 Gradient checkpointing on WhereQ, K, V, O,Up,Down,Gate Table 12:...

  58. [58]

    diag(S)2-weighted

    for each (model, method) cell. Sample std is taken across n=3 seeds, so the SEM ≈0.6× std. AIME-24/25 are avg@8 at T=0.6 ; MATH-500 and AMC23 are greedy@1 at T=0 . The Paper Table (Table 4) reports the per-seed mean from the same set of runs. 18 Table 15: Math RL with GRPO: per-seed mean±std across 3 seeds (42, 43, 44). The Paper Table (Table 4) reports t...