pith. sign in

arxiv: 2509.21637 · v2 · submitted 2025-09-25 · 💻 cs.LG

BoHA: Blockwise Hadamard Product Adaptation for Parameter-Efficient Fine-Tuning

Pith reviewed 2026-05-18 13:41 UTC · model grok-4.3

classification 💻 cs.LG
keywords parameter-efficient fine-tuningHadamard productLoRAcontinual learninglarge language modelsblockwise adaptationsequential adaptationretention
0
0 comments X p. Extension

The pith

BoHA partitions frozen weights into blocks and applies local low-rank Hadamard factors to retain prior-task accuracy during sequential fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BoHA to fill a gap in PEFT evaluation: most methods like LoRA are judged only on single-task accuracy, yet real use often requires adapting a model to new tasks without losing performance on earlier ones. BoHA treats spatial support explicitly by splitting each frozen weight matrix W0 into a b by b grid and training a separate low-rank Hadamard product update inside every block. This construction keeps the overall update rank equivalent to a standard LoRA adapter of the same budget while still permitting the adapter to be merged into the base weights for inference with no extra cost. Experiments across Llama, Mistral and Gemma models show higher single-task averages than LoRA and, in a commonsense-to-arithmetic continual-learning test on Llama-3.2-3B, retain 57.66 percent first-stage accuracy while beating the additive baseline by 15.23 percent under matched second-stage plasticity.

Core claim

BoHA partitions the frozen weight W0 into a b×b grid and learns an independent low-rank Hadamard product factor in each block, preserving a matched LoRA-equivalent total rank with adapter-free merged inference. On a synthetic target, BoHA at per-block rank rb=1 exactly reconstructs an update that requires rank b² under the global W0-coupled Hadamard parameterization. Across Llama-3.2-1B/3B, Mistral-7B, and Gemma-2-9B on commonsense and arithmetic reasoning tasks, BoHA outperforms LoRA across all matched-budget single-task averages and remains competitive with the strongest Hadamard baseline. On a Llama-3.2-3B commonsense to arithmetic continual-learning diagnostic, BoHA retains 57.66 percent

What carries the argument

Blockwise W0-coupled Hadamard product adapter: the frozen weight is split into a b by b grid and each block receives its own low-rank Hadamard factor, keeping total rank matched to LoRA while allowing merged inference.

If this is right

  • BoHA achieves higher single-task accuracy than LoRA at the same parameter budget on commonsense and arithmetic reasoning.
  • In sequential adaptation the method retains substantially more first-stage performance than additive W0-free controls.
  • The blockwise construction still permits exact reconstruction of higher-rank updates using only rank-1 factors per block.
  • Adapter-free merged inference remains possible, matching the deployment convenience of LoRA.
  • The approach scales across model sizes from 1B to 9B parameters without changing the core design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Spatial partitioning may reduce destructive interference between successive tasks more effectively than global low-rank updates.
  • The same blockwise idea could be tested on vision or multimodal models where weight matrices have clear spatial structure.
  • Choosing block size b as a hyperparameter might trade off between local flexibility and total parameter count in longer task sequences.

Load-bearing premise

Partitioning W0 into blocks and learning independent low-rank Hadamard factors per block preserves the effective rank of the adaptation and enables adapter-free merged inference.

What would settle it

If, on the reported Llama-3.2-3B commonsense-to-arithmetic diagnostic, BoHA fails to exceed the W0-free additive-control mean first-stage accuracy while matching second-stage plasticity, the retention advantage claim is falsified.

Figures

Figures reproduced from arXiv: 2509.21637 by Feng Yu, Geyong Min, Jia Hu.

Figure 2
Figure 2. Figure 2: Illustration of BHRA compared with LoRA [Hu et al., 2022] and HiRA [Huang et al., 2025]. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: Average stable rank of ∆W when adapting Llama-3.2 1B to commonsense reasoning. BHRA maintains a substantially larger effective rank across identical rank budgets r. To diagnose the underlying limitation, we examine the stable rank ∥∆W∥ 2 F /∥∆W∥ 2 2 , a standard surrogate for effective rank [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance of BHRA (solid) and HiRA (dashed) of Llama-3.2 1B on eight commonsense rea￾soning datasets using different HiRA configurations. Low-rank adaptation. LoRA Hu et al. [2022] parameterizes the update as a low-rank decomposition ∆W = BA, freezing W0 and training only A, B. It achieves large parameter and memory savings with negligible inference overhead and has the common budget or implementation ba… view at source ↗
Figure 4
Figure 4. Figure 4: Count of singular values exceeding 1% of the layer-wise maximum for FFT, LoRA, HiRA, and BHRA. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Layer-wise sum of squared singular values for FFT, LoRA, HiRA, and BHRA. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Effective rank across layers for FFT, LoRA, HiRA, and BHRA. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Accuracy on the eight commonsense tasks versus mean block Gini of the learned adapters (Llama-3.2-1B, rb × b = 32). Greater block heterogeneity correlates with higher ac￾curacy, and BHRA dominates both LoRA and HiRA in this trade-off. Effective-rank trends in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The performance of BHRA, LoRA and HiRA under fixed budget (rb × b = 32) on common￾sense reasoning tasks. Under a fixed budget rb × b = 32, we sweep rb ∈ {1, 2, 4, 8, 16, 32} with b = 32/rb on commonsense reasoning tasks shown in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Performance of BHRA, LoRA, and HiRA under a fixed budget ( [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
read the original abstract

Parameter-efficient fine-tuning (PEFT) of large language models trains a small task-specific parameter set while keeping the pretrained model frozen. The dominant Low-Rank Adaptation (LoRA) family makes this trade-off practical; however, evaluations under the same parameter budget assess single-task accuracy. In sequential adaptation settings, such evaluations should also measure how well performance on the first-stage task is retained after subsequent fine-tuning. To address this gap, we introduce BoHA, a blockwise $W_0$-coupled Hadamard product adapter that treats spatial support as an explicit design axis. BoHA partitions the frozen weight $W_0$ into a $b{\times}b$ grid and learns an independent low-rank Hadamard product factor in each block, preserving a matched LoRA-equivalent total rank with adapter-free merged inference. On a synthetic target, BoHA at per-block rank $r_b{=}1$ exactly reconstructs an update that requires rank $b^2$ under the global $W_0$-coupled Hadamard parameterization. Across Llama-3.2-1B/3B, Mistral-7B, and Gemma-2-9B on commonsense and arithmetic reasoning tasks, BoHA outperforms LoRA across all matched-budget single-task averages and remains competitive with the strongest Hadamard baseline. On a Llama-3.2-3B commonsense $\to$ arithmetic continual-learning diagnostic, BoHA retains $57.66\%$ first-stage accuracy and exceeds the $W_0$-free additive-control mean by $15.23\%$ under matched second-stage plasticity. These results demonstrate that blockwise $W_0$-coupled Hadamard adaptation is a competitive PEFT design choice when retention under sequential adaptation is part of the objective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces BoHA, a blockwise Hadamard product adaptation method for parameter-efficient fine-tuning of LLMs. It partitions each frozen weight matrix W0 into a b×b grid and learns an independent low-rank Hadamard product factor per block. The central claims are that this design preserves a matched LoRA-equivalent total parameter budget, enables adapter-free merged inference, outperforms LoRA on single-task averages across Llama-3.2-1B/3B, Mistral-7B and Gemma-2-9B on commonsense and arithmetic tasks, and improves retention in a Llama-3.2-3B commonsense-to-arithmetic continual-learning diagnostic (57.66% first-stage retention and +15.23% over the W0-free additive control under matched second-stage plasticity). A synthetic reconstruction experiment is presented to show that per-block rank r_b=1 suffices for an update that would require global rank b² under a non-blockwise W0-coupled Hadamard parameterization.

Significance. If the empirical claims and the rank-preservation argument hold, the work supplies a new explicit design axis (spatial partitioning) for PEFT adapters that may be useful when retention after sequential adaptation is an objective. The synthetic reconstruction result is a clear, falsifiable verification of one claimed advantage of the blockwise structure. The continual-learning diagnostic addresses a relevant evaluation gap beyond single-task accuracy that is rarely reported in the LoRA literature.

major comments (2)
  1. [§3] §3 (Method): The claim that the b×b blockwise low-rank Hadamard factors preserve a LoRA-equivalent total rank (same parameter budget) while producing the reported retention gains is load-bearing for the central contribution. Because the update is multiplicative (ΔW = W0 ⊙ (UV) per block), the realized update magnitude is scaled by the local |W0| values in each block. Pretrained LLM weights are heterogeneous across blocks, so this introduces spatially varying effective plasticity with no counterpart in additive LoRA. The synthetic target only verifies exact reconstruction for a contrived low-rank case and does not test whether the scaling distorts gradient flow or retention on the Llama-3.2-3B commonsense→arithmetic diagnostic. A quantitative comparison of effective per-block update norms or an ablation that normalizes by block magnitude would be required to attribute the 57.66% retention
  2. [§4] §4 (Experiments, continual-learning diagnostic): The reported 57.66% first-stage retention and 15.23% improvement over the W0-free additive-control mean are presented without error bars, number of random seeds, or explicit confirmation that data splits, learning rates, and second-stage training steps are identical across all compared methods. These details are necessary to establish that the gains are attributable to the blockwise Hadamard structure rather than uncontrolled differences in plasticity or optimization.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'W0-free additive-control mean' is used without a parenthetical definition or forward reference; a one-sentence clarification would improve standalone readability.
  2. [§3] Notation: The relation between per-block rank r_b and the global LoRA rank r should be stated explicitly (e.g., total parameters = b² · r_b · (d_in + d_out) versus LoRA's 2r · (d_in + d_out)) in a small comparison table or equation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the presentation and analysis.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The claim that the b×b blockwise low-rank Hadamard factors preserve a LoRA-equivalent total rank (same parameter budget) while producing the reported retention gains is load-bearing for the central contribution. Because the update is multiplicative (ΔW = W0 ⊙ (UV) per block), the realized update magnitude is scaled by the local |W0| values in each block. Pretrained LLM weights are heterogeneous across blocks, so this introduces spatially varying effective plasticity with no counterpart in additive LoRA. The synthetic target only verifies exact reconstruction for a contrived low-rank case and does not test whether the scaling distorts gradient flow or retention on the Llama-3.2-3B commonsense→arithmetic diagnostic. A quantitative comparison of effective per-block update norms or an ablation that normalizes by block magnitude would be required to attribute the 57.66% ret

    Authors: We agree that the multiplicative coupling introduces spatially varying scaling by local |W0| magnitudes, which differs from additive LoRA and is an intentional aspect of the W0-coupled design. This coupling is meant to leverage the pretrained weight structure for potentially better feature preservation during sequential adaptation. The synthetic experiment separately validates the rank-efficiency benefit of the blockwise parameterization. To address attribution of the retention results, we will add to the revised manuscript a quantitative comparison of effective per-block update norms across the Llama-3.2-3B diagnostic and an ablation that normalizes the Hadamard factors by block magnitude to isolate the contribution of the spatial partitioning. revision: yes

  2. Referee: [§4] §4 (Experiments, continual-learning diagnostic): The reported 57.66% first-stage retention and 15.23% improvement over the W0-free additive-control mean are presented without error bars, number of random seeds, or explicit confirmation that data splits, learning rates, and second-stage training steps are identical across all compared methods. These details are necessary to establish that the gains are attributable to the blockwise Hadamard structure rather than uncontrolled differences in plasticity or optimization.

    Authors: We acknowledge that the current reporting lacks these statistical and methodological details. In the revised version we will report error bars over multiple random seeds, state the number of seeds used, and explicitly confirm that data splits, learning rates, and second-stage training steps were held identical across all methods to support fair attribution of the observed retention differences. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with independent validation

full rationale

The paper introduces BoHA as a blockwise Hadamard adapter design and supports its claims through direct empirical comparisons on single-task and continual-learning benchmarks against LoRA and Hadamard baselines. The abstract states the partitioning and rank-matching property as a design feature that enables adapter-free inference, without deriving performance metrics from fitted parameters or self-referential equations. The synthetic reconstruction result is presented as a verification of the blockwise structure's expressivity rather than a prediction forced by the method's own inputs. No load-bearing step reduces a claimed outcome to a self-definition, fitted input renamed as prediction, or self-citation chain; the reported retention and plasticity gains are measured outcomes on held-out task sequences.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The abstract introduces the blockwise grid and per-block Hadamard factors as the core modeling choice; no explicit free parameters beyond the per-block rank r_b are named, and no new physical entities are postulated.

free parameters (1)
  • per-block rank r_b
    The abstract states that BoHA at r_b=1 exactly reconstructs an update requiring rank b squared under global parameterization, implying r_b is a tunable design parameter matched to total budget.

pith-pipeline@v0.9.0 · 5871 in / 1262 out tokens · 36038 ms · 2026-05-18T13:41:39.142496+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 15 internal anchors

  1. [1]

    Large language models in medicine.Nature medicine, 29(8):1930–1940,

    Arun James Thirunavukarasu, Darren Shu Jeng Ting, Kabilan Elangovan, Laura Gutierrez, Ting Fang Tan, and Daniel Shu Wei Ting. Large language models in medicine.Nature medicine, 29(8):1930–1940,

  2. [2]

    BloombergGPT: A Large Language Model for Finance

    9 Blockwise Hadamard high-Rank AdaptationA PREPRINT Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance.arXiv preprint arXiv:2303.17564,

  3. [3]

    Randlora: Full-rank parameter-efficient fine-tuning of large models.arXiv preprint arXiv:2502.00987,

    Paul Albert, Frederic Z Zhang, Hemanth Saratchandran, Cristian Rodriguez-Opazo, Anton van den Hengel, and Ehsan Abbasnejad. Randlora: Full-rank parameter-efficient fine-tuning of large models.arXiv preprint arXiv:2502.00987,

  4. [4]

    Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, and Praneeth Vepakomma

    URLhttps://openreview.net/forum?id=TwJrTz9cRS. Raghav Singhal, Kaustubh Ponkshe, Rohit Vartak, and Praneeth Vepakomma. Abba: Highly expressive hadamard product adaptation for large language models.arXiv preprint arXiv:2505.14238,

  5. [5]

    Yeonjoon Jung, Daehyun Ahn, Hyungjun Kim, Taesu Kim, and Eunhyeok Park

    URLhttps://openreview.net/forum?id=lq62uWRJjiY. Yeonjoon Jung, Daehyun Ahn, Hyungjun Kim, Taesu Kim, and Eunhyeok Park. Gralora: Granular low-rank adaptation for parameter-efficient fine-tuning.arXiv preprint arXiv:2505.20355,

  6. [6]

    The Power of Scale for Parameter-Efficient Prompt Tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning.arXiv preprint arXiv:2104.08691,

  7. [7]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation.arXiv preprint arXiv:2101.00190,

  8. [8]

    Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models.arXiv preprint arXiv:2304.01933,

    Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Ka-Wei Lee. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models.arXiv preprint arXiv:2304.01933,

  9. [9]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions.arXiv preprint arXiv:1905.10044,

  10. [10]

    SocialIQA: Commonsense Reasoning about Social Interactions

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan LeBras, and Yejin Choi. Socialiqa: Commonsense reasoning about social interactions.arXiv preprint arXiv:1904.09728,

  11. [11]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830,

  12. [12]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    10 Blockwise Hadamard high-Rank AdaptationA PREPRINT Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

  13. [13]

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering.arXiv preprint arXiv:1809.02789,

  14. [14]

    Mistral 7B

    Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie- Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.ArXiv, abs/2310.06825,

  15. [15]

    Gemma 2: Improving Open Language Models at a Practical Size

    URLhttps://api.semanticscholar.org/CorpusID:263830494. Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

  16. [16]

    MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models.arXiv preprint arXiv:2309.12284,

  17. [17]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

  18. [18]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

  19. [19]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Let- man, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  20. [20]

    Transformers: State-of-the-art natural language processing

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45,

  21. [21]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

  22. [22]

    Relation to GraLoRA and inference cost.Setting W0,ij =1 recovers GraLoRA with k=b , where the Hadamard stage disappears and the expression above reduces to the classic (2n−k)rT+ (2m−k)mT+ (k−1)mT form. In BHRA we precompute the masks Hij and fold them into W0,ij before deployment, so inference evaluates only the two low-rank GEMMs per block: FLOPs(adapter...

  23. [23]

    While we adopt most settings from prior work [Hu et al., 2023], we run targeted learning-rate sweeps to tune performance

    on arithmetic tasks. While we adopt most settings from prior work [Hu et al., 2023], we run targeted learning-rate sweeps to tune performance. For baselines we replicate the experimental protocols from LoRA [Hu et al., 2022], DoRA [Liu et al., 14 Blockwise Hadamard high-Rank AdaptationA PREPRINT 2024], HiRA [Huang et al., 2025], and ABBA [Singhal et al., ...