pith. sign in

arxiv: 2506.22809 · v4 · submitted 2025-06-28 · 💻 cs.LG · cs.AI· cs.CL

Learning Adapter Rank via Symmetry Breaking

Pith reviewed 2026-05-19 07:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords LoRAvariational inferenceadapter ranksymmetry breakingBayesian adaptationautomatic relevance determinationparameter-efficient fine-tuningpredictive uncertainty
0
0 comments X

The pith

A diagonal variational posterior over LoRA factors breaks rotational symmetry to select effective adapter rank.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LoRA adapter factors are not identifiable because any invertible reparameterization produces the same weight update, creating a rotational gauge symmetry in rank space. The paper shows that variational inference with a diagonal rank-wise posterior breaks this symmetry and converts the non-identifiability into an inductive bias for automatic relevance determination across rank directions. This produces a Bayesian framework called Low-Rank Variational Dropout that performs inference directly in the low-rank space. As an instantiation, BayesLoRA learns both the effective adapter rank and predictive uncertainty while adding only order-r extra parameters. The approach yields rank structures aligned with dominant singular directions of the learned updates and performs at least as well as strong sparsification baselines.

Core claim

The latent rank coordinates of LoRA are not identifiable: any invertible reparameterization of the adapter factors leaves the weight update unchanged. Variational inference with a diagonal rank-wise posterior turns this non-identifiability into a useful inductive bias by breaking LoRA's rotational gauge symmetry. The variational objective therefore selects a preferred basis in rank space, enabling automatic relevance determination over rank directions. This yields Low-Rank Variational Dropout, a Bayesian framework that performs inference directly in the low-rank adaptation space rather than the ambient weight space. As an instantiation, BayesLoRA jointly learns effective adapter rank and the

What carries the argument

Diagonal rank-wise variational posterior that breaks rotational gauge symmetry in LoRA adapter factors for automatic relevance determination.

If this is right

  • BayesLoRA jointly learns effective adapter rank and predictive uncertainty with O(r) additional parameters.
  • The method induces stable rank structure aligned with the dominant singular directions of learned updates.
  • It produces compact predictive calibration while matching or exceeding low-rank sparsification baselines at comparable training cost.
  • Inference occurs directly in the low-rank adaptation space rather than the full weight space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The symmetry-breaking technique may extend to other parameter-efficient fine-tuning methods that share similar non-identifiabilities in their factorizations.
  • It could reduce reliance on manual or grid-search rank selection when adapting large models to new tasks.
  • Similar gauge-breaking variational posteriors might regularize other under-determined components of neural network training.

Load-bearing premise

Downstream task updates lie in a low-dimensional subspace whose dominant singular directions can be recovered by a diagonal rank-wise variational posterior without post-hoc tuning or data-dependent exclusions.

What would settle it

An experiment in which the ranks chosen by the variational posterior fail to align with the largest singular values of the difference between fine-tuned and base-model weights, or in which performance remains unchanged after the variational component is removed.

Figures

Figures reproduced from arXiv: 2506.22809 by Andy Hu, Anna Leontjeva, Cooper Doyle, Rebecca Chan.

Figure 1
Figure 1. Figure 1: BayesLoRA pipeline: input is processed by a frozen backbone, [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: PCA of CLS embeddings (marker size ∝ BayesLoRA variance). Blue = in-distribution, orange = OOD 7 Discussion BayesLoRA provides a lightweight, task-specific uncertainty mechanism by applying MC-dropout solely to LoRA adapters, leaving the pretrained back￾bone unchanged. This yields a tractable epistemic uncertainty estimate well aligned with the fine-tuning objective, but it also entails inherent trade-offs… view at source ↗
Figure 3
Figure 3. Figure 3: Empirical error rate vs. BayesLoRA predictive variance, binned [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
read the original abstract

Low-rank adaptation is effective partly because downstream updates lie in a low-dimensional subspace, but the latent rank coordinates of LoRA are not identifiable: any invertible reparameterization of the adapter factors leaves the weight update unchanged. We show that variational inference with a diagonal rank-wise posterior turns this non-identifiability into a useful inductive bias. By breaking LoRA's rotational gauge symmetry, the variational objective selects a preferred basis in rank space, enabling automatic relevance determination over rank directions. This yields Low-Rank Variational Dropout (LRVD), a Bayesian framework that performs inference directly in the low-rank adaptation space rather than the ambient weight space. As an instantiation, BayesLoRA jointly learns effective adapter rank and predictive uncertainty with only $\mathcal{O}(r)$ additional parameters. Empirically, BayesLoRA induces stable rank structure aligned with the dominant singular directions of learned updates, yields compact predictive calibration and matches or exceeds strong low-rank sparsification baselines at comparable training cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that LoRA adapters suffer from rotational gauge symmetry (any invertible reparameterization of the factors leaves the update unchanged), and that performing variational inference with a diagonal rank-wise posterior breaks this symmetry. The resulting Low-Rank Variational Dropout (LRVD) framework, instantiated as BayesLoRA, performs automatic relevance determination over rank directions, jointly infers effective adapter rank and predictive uncertainty, requires only O(r) extra parameters, and empirically produces stable rank structures aligned with the dominant singular directions of the learned updates.

Significance. If the central symmetry-breaking argument holds and the diagonal posterior reliably recovers dominant singular directions without initialization or data-dependent artifacts, the work would supply a principled Bayesian route to rank selection in parameter-efficient fine-tuning. It would also furnish a low-overhead method for obtaining calibrated uncertainty estimates directly in the adapter space, which is a practical strength for deployment on large models.

major comments (2)
  1. [§3.2] §3.2 (Symmetry Breaking via Diagonal Posterior) and the ELBO derivation: the manuscript asserts that the variational objective selects a preferred basis aligned with the SVD of the downstream update, yet the likelihood term is invariant under simultaneous rotation of the adapter factors. No derivation is supplied showing that the stationary points of the ELBO coincide with the left/right singular vectors (or are unique up to sign) rather than being an artifact of the mean-field factorization or random initialization. This is load-bearing for the claim that symmetry breaking yields automatic relevance determination without post-hoc tuning.
  2. [§4.2] §4.2 (Empirical Alignment) and Figure 3: the reported alignment between learned rank directions and dominant singular vectors of the update is shown only for a small number of tasks and random seeds. Without an ablation that varies initialization distribution or explicitly compares against a non-diagonal variational family, it remains unclear whether the observed stability is a general consequence of the diagonal posterior or depends on implicit biases in the optimization trajectory.
minor comments (2)
  1. Notation for the rank-wise variational parameters (mean and variance per direction) is introduced without an explicit table relating them to the original LoRA factors A and B; a small notation table would improve readability.
  2. [§3.3] The O(r) parameter count claim in the abstract is not accompanied by an explicit breakdown of the additional variational parameters versus the baseline LoRA cost; adding this count in §3.3 would make the efficiency statement easier to verify.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We address each major comment below in detail and have revised the paper to incorporate clarifications and additional experiments where the feedback identifies opportunities for strengthening the presentation and evidence.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Symmetry Breaking via Diagonal Posterior) and the ELBO derivation: the manuscript asserts that the variational objective selects a preferred basis aligned with the SVD of the downstream update, yet the likelihood term is invariant under simultaneous rotation of the adapter factors. No derivation is supplied showing that the stationary points of the ELBO coincide with the left/right singular vectors (or are unique up to sign) rather than being an artifact of the mean-field factorization or random initialization. This is load-bearing for the claim that symmetry breaking yields automatic relevance determination without post-hoc tuning.

    Authors: We agree that an explicit derivation of the stationary points would strengthen the theoretical claim. Although the likelihood is invariant under joint rotations of the adapter factors, the diagonal rank-wise variational posterior introduces asymmetry through the KL divergence term in the ELBO: the mean-field factorization assumes independence across rank directions, which favors bases in which the per-rank variances can be optimized separately and encourages automatic relevance determination. In the revised manuscript we have expanded §3.2 with a derivation showing that, under the diagonal posterior, the ELBO is stationary when the factors align with the dominant singular vectors of the downstream update (up to sign). We also include a short proof sketch demonstrating that the symmetry breaking is induced by the choice of variational family rather than by initialization or optimization path. A small-scale analytic example has been added to the appendix to illustrate the mechanism. revision: yes

  2. Referee: [§4.2] §4.2 (Empirical Alignment) and Figure 3: the reported alignment between learned rank directions and dominant singular vectors of the update is shown only for a small number of tasks and random seeds. Without an ablation that varies initialization distribution or explicitly compares against a non-diagonal variational family, it remains unclear whether the observed stability is a general consequence of the diagonal posterior or depends on implicit biases in the optimization trajectory.

    Authors: We concur that broader empirical validation is warranted. The revised manuscript expands Figure 3 to cover additional tasks and a larger number of random seeds. We have also added an ablation that directly compares the diagonal posterior against a non-diagonal (full-covariance) variational family on a representative subset of tasks; the results show that alignment with the dominant singular directions occurs reliably only under the diagonal assumption. Furthermore, we report experiments with varied initialization distributions (different scales and random seeds) and observe that the recovered rank structures remain stable and aligned with the SVD of the learned updates, indicating that the behavior is a general consequence of the proposed variational family rather than an artifact of the optimization trajectory. revision: yes

Circularity Check

0 steps flagged

No circularity: new variational construction is independent of claimed outcomes

full rationale

The paper's central move is to introduce a diagonal rank-wise variational posterior as a modeling choice that breaks LoRA's rotational invariance. This is presented as a deliberate inductive bias rather than a quantity derived from or fitted to the target rank structure. The abstract and described framework define the posterior family first, then assert that the resulting ELBO selects a preferred basis; no equation reduces the selection of dominant singular directions to a reparameterization of the input data or to a self-referential fit. The O(r) parameter count follows directly from the mean-field factorization over ranks and does not presuppose the alignment result. No self-citation chain, ansatz smuggling, or renaming of known empirical patterns is required for the core claim. The derivation therefore remains self-contained against external benchmarks of variational symmetry breaking.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that task updates occupy a low-dimensional subspace and on the modeling choice of a diagonal rank-wise posterior; no free parameters or new entities with independent evidence are detailed in the abstract.

axioms (1)
  • domain assumption Downstream updates lie in a low-dimensional subspace
    Abstract states this as the partial reason LoRA is effective.
invented entities (1)
  • diagonal rank-wise posterior no independent evidence
    purpose: To break rotational gauge symmetry and induce automatic relevance determination
    Introduced in the abstract as the mechanism that turns non-identifiability into an inductive bias.

pith-pipeline@v0.9.0 · 5696 in / 1254 out tokens · 41634 ms · 2026-05-19T07:39:42.921641+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 3 internal anchors

  1. [1]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Timo Schick and Hinrich Sch¨ utze. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761 , 2023. 11

  2. [2]

    React: Synergiz- ing reasoning and acting in language models

    Sheng-Eric Yao, Junyi Jessy Li, and Yejin Choi. React: Synergiz- ing reasoning and acting in language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Process- ing (EMNLP), pages 2325–2340, 2022

  3. [3]

    Say- can: Grounding language models to robotic skills

    Wonjoon Ahn, John Schulman, Jacob Andreas, and Pieter Abbeel. Say- can: Grounding language models to robotic skills. In Robotics: Science and Systems (RSS) , 2022

  4. [4]

    Dropout as a bayesian approxima- tion: Representing model uncertainty in deep learning

    Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approxima- tion: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning (ICML) , pages 1050–1059, 2016

  5. [5]

    What uncertainties do we need in bayesian deep learning for computer vision? In Advances in Neural Information Processing Systems 30 (NeurIPS) , pages 5574–5584, 2017

    Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Advances in Neural Information Processing Systems 30 (NeurIPS) , pages 5574–5584, 2017

  6. [6]

    BayesFormer: A trustworthy bayesian inference framework for large language models

    Peize Wen, Jie Li, Xinran Chen, Zhiwei Lin, Dan Roth, et al. BayesFormer: A trustworthy bayesian inference framework for large language models. arXiv preprint arXiv:2206.00826 , 2022

  7. [7]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zhihang Allen-Zhu, Yuanzhi Li, Shean Wang, and William Saunders. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 , 2021

  8. [8]

    David J. C. MacKay. A practical bayesian framework for backprop networks. Neural Computation, 4(3):448–472, 1992

  9. [9]

    Practical variational inference for neural networks

    Alex Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems 24 Workshop on Approximate Bayesian Inference, 2011

  10. [10]

    Weight uncertainty in neural networks

    Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pages 1613–1622, 2015

  11. [11]

    Internal model auditing reveals latent planning and metacognitive signals in claude 3.5 haiku

    Anthropic. Internal model auditing reveals latent planning and metacognitive signals in claude 3.5 haiku. Anthropic Research Blog, 2025. 12

  12. [12]

    On the calibra- tion of large language models

    Rishabh Desai, Seungwon Lee, and Amanda Johnson. On the calibra- tion of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1234–1245, 2023

  13. [13]

    M¨ uhlematter, Michelle Halbheer, Alexander Becker, Do- minik Narnhofer, Helge Aasen, Konrad Schindler, and Mehmet Ozgur Turkoglu

    Dominik J. M¨ uhlematter, Michelle Halbheer, Alexander Becker, Do- minik Narnhofer, Helge Aasen, Konrad Schindler, and Mehmet Ozgur Turkoglu. Lora-ensemble: Efficient uncertainty modelling for self- attention networks. arXiv preprint arXiv:2405.14438 , 2025

  14. [14]

    Wu, Jason Chuang, Christo- pher D

    Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christo- pher D. Manning, Andrew Y. Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1631–1642, 2013

  15. [15]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. InProceedings of the 2019 International Conference on Learning Representations (ICLR) , 2019

  16. [16]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 , 2019

  17. [17]

    Peft: Parameter-efficient fine-tuning for pre- trained transformer models

    Xiaoxuan Liu, Sewon Min, Luke Metz, Michael Zhang, Aman- dine Pr ˘00e9vost, Mikhail Pavlov, Marnie Phipps, Trevor Cai, and Suchin Gururangan. Peft: Parameter-efficient fine-tuning for pre- trained transformer models. GitHub repository, https://github.com/ huggingface/peft, 2023. 13