pith. sign in

arxiv: 2605.05676 · v1 · submitted 2026-05-07 · 💻 cs.CL · cs.AI

Decomposing the Basic Abilities of Large Language Models: Mitigating Cross-Task Interference in Multi-Task Instruct-Tuning

Pith reviewed 2026-05-08 11:09 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords large language modelsmulti-task instruct-tuningcross-task interferenceLoRA expertsbasic abilities decompositionorthogonal parametersspherical clustering
0
0 comments X

The pith

Large language models can be decomposed into orthogonal basic abilities that tasks combine linearly, reducing cross-task interference in multi-task training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multi-task instruct-tuning of large language models suffers from cross-task interference even when using task-specific parameters, because many parameters remain shared across tasks. It shows that certain parameters are consistently co-activated and naturally organize into base groups that behave like orthogonal basic abilities. By decomposing model parameters into high-singular-value LoRA experts for these abilities and enforcing orthogonality during training via spherical clustering of rank-1 components, the proposed BADIT method reduces interference. A sympathetic reader would care because this yields better performance on multi-task benchmarks than prior isolation approaches without requiring fully separate parameters for every task.

Core claim

LLMs encode several orthogonal basic abilities, and any task can be represented as a linear combination of these abilities. BADIT decomposes LLM parameters into orthogonal high-singular-value LoRA experts representing basic abilities and dynamically enforces their orthogonality during training via spherical clustering of rank-1 components. This approach outperforms state-of-the-art methods on the SuperNI benchmark with six different LLMs by mitigating the degree of cross-task interference.

What carries the argument

BADIT decomposes LLM parameters into orthogonal high-singular-value LoRA experts representing basic abilities and dynamically enforces their orthogonality during training via spherical clustering of rank-1 components.

Load-bearing premise

Parameters that are consistently co-activated across tasks naturally organize into orthogonal base groups representing basic abilities that any task can be expressed as a linear combination of.

What would settle it

If the decomposed LoRA experts fail to stay orthogonal after training or if measured gradient conflicts between tasks do not decrease while performance fails to exceed baselines on SuperNI, the claim that the decomposition mitigates cross-task interference would be falsified.

Figures

Figures reproduced from arXiv: 2605.05676 by Bing Wang, Changchun Li, Gang Niu, Jinjin Chi, Masashi Sugiyama, Ximing Li.

Figure 1
Figure 1. Figure 1: Shared task-specific neurons and parameters across different tasks. 1 2 3 4 5 6 7 8 task1590 task875 task511 task1572 task591 task002 task639 task748 task1290 task1510 task363 task181 task1687 task1729 task073 Expert IDs Llama3-3B Qwen3-4B Gemma2-2B 1 2 3 4 5 6 7 8 Expert IDs 1 2 3 4 5 6 7 8 Expert IDs view at source ↗
Figure 2
Figure 2. Figure 2: Activated experts in MoE-based LLMs by different tasks. subplots display parameter-level activations, where each column represents a specific task and indicates the number of activated parameters in the corresponding row of the pa￾rameter matrix for that task. Additionally, view at source ↗
Figure 3
Figure 3. Figure 3: Comparison with prior multi-task instruct-tuning meth￾ods. We decouple key LLM parameters into basic abilities (i.e., LoRA experts) and dynamically group their rank-1 components, enforcing orthogonality to represent distinct basic abilities. in Eq. (1), we freeze the residual weights Wc and solely update the LoRA parameters. To further encourage each LoRA expert to specialize in a different basic ability d… view at source ↗
Figure 4
Figure 4. Figure 4: Intra- and inter-expert gradient angles across epochs. during training. Second, the DOG stage introduces extra overhead because it performs spherical K-means clustering and integer optimization on parameter gradients, which are computationally intensive operations primarily executed on CPUs. Although our approach entails a modest increase in training time, the resulting substantial gains in model per￾forma… view at source ↗
Figure 5
Figure 5. Figure 5: Shared task-specific neurons of MLP gate layers across different tasks. seed to the values {1, 2, 3, 4, 5} to ensure reproducibility. G.5. Evaluation Metrics Following standard practice in previous literature (Lopez￾Paz & Ranzato, 2017; Zhao et al., 2024b), we adopt the following four metrics to comprehensively assess model performance: • ROUGE measures the overall effectiveness after learn￾ing the entire … view at source ↗
Figure 6
Figure 6. Figure 6: Shared task-specific parameters of MLP gate layers across different tasks. task1590 task875 task511 task1572 task591 task002 task639 task748 task1290 task1510 task363 task181 task1687 task1729 task073 16 528 1040 1552 2064 2576 3088 Parameter Row IDs - Average task1590 task875 task511 task1572 task591 task002 task639 task748 task1290 task1510 task363 task181 task1687 task1729 task073 16 528 1040 1552 2064 … view at source ↗
Figure 7
Figure 7. Figure 7: Shared task-specific parameters of query matrices in self-attention layers across different tasks. 19 view at source ↗
Figure 8
Figure 8. Figure 8: Shared task-specific parameters of key matrices in self-attention layers across different tasks. task1590 task875 task511 task1572 task591 task002 task639 task748 task1290 task1510 task363 task181 task1687 task1729 task073 16 272 528 784 1040 Parameter Row IDs - Average task1590 task875 task511 task1572 task591 task002 task639 task748 task1290 task1510 task363 task181 task1687 task1729 task073 16 272 528 7… view at source ↗
Figure 9
Figure 9. Figure 9: Shared task-specific parameters of value matrices in self-attention layers across different tasks. gradient angles, sensitivity analysis, and performance across different tasks. H.1. More Investigation Results In this section, we employ the same experimental setup as Sec. 2 to investigate neuron and parameter sharing behav￾iors across different tasks in a broader range of modules. Specifically, we conduct … view at source ↗
Figure 10
Figure 10. Figure 10: The number of parameters that activate each parameter column with positive or negative gradients across all tasks, respectively. we identify systematic shifts in activation patterns from lower to higher layers. Specifically, activated parameters tend to migrate from higher to lower parameter IDs as depth increases, for instance, in the query matrix of Llama3-3B in view at source ↗
Figure 11
Figure 11. Figure 11: Intra- and inter-expert gradient angles of BADIT on 6 LLMs across epochs. proximate orthogonality. In some cases, such as Qwen3-4B under Task Order B, the inter-expert angles remain con￾sistently clustered around 90 degrees throughout training. Meanwhile, intra-expert gradient angles are consistently kept at relatively small values. These results collectively demonstrate the effectiveness of our DOG metho… view at source ↗
Figure 12
Figure 12. Figure 12: Sensitivity analysis of the parameters K and r. anatomy clinical_knowledge college_biology college_mathematics college_physics electrical_engineering global_facts high_school_european_history high_school_government_and_politics high_school_mathematics high_school_microeconomics high_school_psychology high_school_statistics high_school_us_history international_law marketing philosophy professional_accounti… view at source ↗
Figure 13
Figure 13. Figure 13: Visualizing basic abilities. H.4. Visualizing Basic Abilities To further explore whether these decomposed LoRAs repre￾sent distinct skills, we conduct a toy experiment on SuperNI (K=8, top-4 experts activated) and evaluate expert activa￾tion on 20 MMLU tasks using our proposed BADIT. Their activated patterns are shown in view at source ↗
read the original abstract

Recently, the prominent performance of large language models (LLMs) has been largely driven by multi-task instruct-tuning. Unfortunately, this training paradigm suffers from a key issue, named cross-task interference, due to conflicting gradients over shared parameters among different tasks. Some previous methods mitigate this issue by isolating task-specific parameters, e.g., task-specific neuron selection and mixture-of-experts. In this paper, we empirically reveal that the cross-task interference still exists for the existing solutions because of many parameters also shared by different tasks, and accordingly, we propose a novel solution, namely Basic Abilities Decomposition for multi-task Instruct-Tuning (BADIT). Specifically, we empirically find that certain parameters are consistently co-activated, and that co-activated parameters naturally organize into base groups. This motivates us to analogize that LLMs encode several orthogonal basic abilities, and that any task can be represented as a linear combination of these abilities. Accordingly, we propose BADIT that decomposes LLM parameters into orthogonal high-singular-value LoRA experts representing basic abilities, and dynamically enforces their orthogonality during training via spherical clustering of rank-1 components. We conduct extensive experiments on the SuperNI benchmark with 6 LLMs, and empirical results demonstrate that BADIT can outperform SOTA methods and mitigate the degree of cross-task interference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that multi-task instruct-tuning of LLMs suffers from cross-task interference due to shared parameters, and proposes BADIT to mitigate it by empirically identifying consistently co-activated parameters that organize into base groups. This motivates an analogy to orthogonal basic abilities, with any task as a linear combination thereof. BADIT decomposes parameters into orthogonal high-singular-value LoRA experts via spherical clustering of rank-1 components during training, and reports outperformance over SOTA methods on the SuperNI benchmark across 6 LLMs while reducing interference.

Significance. If the results hold and the orthogonality enforcement demonstrably isolates basic abilities rather than acting as generic regularization, the work could provide a mechanistic explanation for interference and a practical decomposition technique for multi-task tuning. The empirical co-activation observation and use of LoRA experts are concrete strengths, but the load-bearing assumption that co-activation patterns inherently yield orthogonal linear decompositions requires stronger isolation to elevate the contribution beyond existing MoE-style adapters.

major comments (3)
  1. [Abstract, §3] Abstract and §3 (method): The central claim that 'co-activated parameters naturally organize into base groups' and motivate orthogonal basic abilities is presented as an empirical discovery, yet BADIT procedurally enforces orthogonality via spherical clustering on high-singular-value LoRA rank-1 components. Without an ablation that compares the full pipeline against a non-orthogonal clustering baseline or standard LoRA-MoE (to isolate whether interference reduction stems from the basic-abilities structure versus added capacity or the clustering loss), the results on SuperNI cannot confirm the linear-combination view or rule out alternative explanations.
  2. [§4] §4 (experiments): The reported outperformance and interference mitigation on SuperNI lack details on error bars, statistical significance across the 6 LLMs, or held-out task-composition tests that would verify whether tasks are indeed linear combinations of the learned orthogonal experts. If the number of basic abilities is a free hyperparameter (as implied by the clustering step), the method's advantage over prior task-specific neuron or MoE approaches may not generalize beyond the tuned setting.
  3. [§2, §3.2] §2 (related work) and §3.2: The distinction from prior isolation methods (task-specific neurons, MoE) is that BADIT targets shared parameters via co-activation clustering, but the paper does not quantify residual interference after decomposition (e.g., via gradient conflict metrics before/after) or show that the high-singular-value experts are uniquely interpretable as 'basic abilities' rather than just low-rank adapters.
minor comments (2)
  1. [§3.3] Notation for the spherical clustering objective and the dynamic orthogonality enforcement should be formalized with an equation, as the procedural description leaves the exact loss term and rank-1 component extraction ambiguous.
  2. [§4, Tables] Figure captions and tables reporting SuperNI results should include the exact number of basic abilities used per model and any sensitivity analysis, to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We have revised the manuscript to address the concerns about empirical validation, statistical rigor, and clearer distinctions from prior work, while maintaining the core contributions of the co-activation observation and orthogonal decomposition.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (method): The central claim that 'co-activated parameters naturally organize into base groups' and motivate orthogonal basic abilities is presented as an empirical discovery, yet BADIT procedurally enforces orthogonality via spherical clustering on high-singular-value LoRA rank-1 components. Without an ablation that compares the full pipeline against a non-orthogonal clustering baseline or standard LoRA-MoE (to isolate whether interference reduction stems from the basic-abilities structure versus added capacity or the clustering loss), the results on SuperNI cannot confirm the linear-combination view or rule out alternative explanations.

    Authors: We agree that the presentation of the co-activation finding as directly motivating the orthogonal model requires stronger isolation of effects. The co-activation patterns are an empirical observation from our analysis of shared parameters, but orthogonality is enforced to operationalize the linear-combination hypothesis. In the revised version, we have added an ablation in §4 comparing the full BADIT pipeline to (i) a non-orthogonal k-means clustering baseline on the same rank-1 components and (ii) a standard LoRA-MoE without spherical clustering. These show that the orthogonality constraint yields further reductions in gradient conflicts and higher SuperNI scores than either baseline, indicating benefits beyond capacity or generic clustering. We have also revised the abstract and §3 to clarify this distinction between the empirical observation and the modeling choice. revision: yes

  2. Referee: [§4] §4 (experiments): The reported outperformance and interference mitigation on SuperNI lack details on error bars, statistical significance across the 6 LLMs, or held-out task-composition tests that would verify whether tasks are indeed linear combinations of the learned orthogonal experts. If the number of basic abilities is a free hyperparameter (as implied by the clustering step), the method's advantage over prior task-specific neuron or MoE approaches may not generalize beyond the tuned setting.

    Authors: We accept that the experimental section would benefit from greater statistical detail and generalization checks. The revised §4 now reports error bars as standard deviation over three random seeds for all main results on the six LLMs, along with paired t-test p-values confirming statistical significance of the improvements. We have also added held-out task-composition experiments: we hold out certain ability combinations during clustering, then evaluate on newly composed tasks and show that performance is consistent with linear recombination of the experts. The number of basic abilities is not a free hyperparameter but is determined by the spherical clustering on observed co-activation patterns; we include a sensitivity analysis demonstrating robustness across 8–20 clusters and fair comparisons to prior methods under matched settings. revision: yes

  3. Referee: [§2, §3.2] §2 (related work) and §3.2: The distinction from prior isolation methods (task-specific neurons, MoE) is that BADIT targets shared parameters via co-activation clustering, but the paper does not quantify residual interference after decomposition (e.g., via gradient conflict metrics before/after) or show that the high-singular-value experts are uniquely interpretable as 'basic abilities' rather than just low-rank adapters.

    Authors: We have expanded §2 to emphasize that BADIT operates on consistently shared parameters identified via cross-task co-activation, in contrast to per-task neuron selection or task-specific expert routing. In the updated §3.2 and new results in §4, we now report quantitative residual interference via average cosine similarity of task gradients before versus after decomposition, showing a clear reduction. For interpretability, we added activation heatmaps and qualitative case studies illustrating that the high-singular-value experts align with distinct basic abilities (e.g., one expert activates strongly on arithmetic tasks while another on knowledge-retrieval tasks). While we cannot claim absolute uniqueness of the decomposition, the empirical patterns support the basic-abilities analogy beyond generic low-rank adapters. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper motivates BADIT from an empirical observation that co-activated parameters organize into base groups, analogizing this to orthogonal basic abilities with tasks as linear combinations. It then defines a procedural method using high-singular-value LoRA experts and spherical clustering to enforce orthogonality during training. This is not self-definitional or a fitted prediction by construction; the central claims are evaluated via experiments on SuperNI with multiple LLMs rather than reducing tautologically to inputs or self-citations. No load-bearing self-citations, uniqueness theorems, or ansatzes from prior author work appear in the text. The approach remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven interpretation that co-activated parameters form orthogonal basic abilities; this is postulated rather than derived from first principles or external benchmarks.

free parameters (1)
  • number of basic abilities / LoRA experts
    The decomposition requires choosing or fitting the number of orthogonal components; this choice directly affects the representation of tasks as linear combinations.
axioms (1)
  • domain assumption Co-activated parameters organize into orthogonal base groups that represent independent basic abilities
    Invoked to justify the decomposition step; appears in the motivation paragraph of the abstract.
invented entities (1)
  • basic abilities no independent evidence
    purpose: To serve as orthogonal building blocks that tasks combine linearly
    New conceptual entity introduced to explain the co-activation observation; no independent falsifiable handle provided.

pith-pipeline@v0.9.0 · 5553 in / 1383 out tokens · 50261 ms · 2026-05-08T11:09:06.763070+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages

  1. [1]

    Annual Meeting of the Association for Computational Linguistics , pages =

    Shihan Dou and Enyu Zhou and Yan Liu and Songyang Gao and Wei Shen and Limao Xiong and Yuhao Zhou and Xiao Wang and Zhiheng Xi and Xiaoran Fan and Shiliang Pu and Jiang Zhu and Rui Zheng and Tao Gui and Qi Zhang and Xuanjing Huang , title =. Annual Meeting of the Association for Computational Linguistics , pages =

  2. [2]

    CoRR , volume =

    Dengchun Li and Yingzi Ma and Naizheng Wang and Zhiyuan Cheng and Lei Duan and Jie Zuo and Cal Yang and Mingjie Tang , title =. CoRR , volume =

  3. [3]

    2024 , url =

    Llama 3 Model Card , author=. 2024 , url =

  4. [4]

    Advances in Neural Information Processing Systems , year =

    Rui Pan and Xiang Liu and Shizhe Diao and Renjie Pi and Jipeng Zhang and Chi Han and Tong Zhang , title =. Advances in Neural Information Processing Systems , year =

  5. [5]

    Advances in Neural Information Processing Systems , year =

    Fanxu Meng and Zhaohui Wang and Muhan Zhang , title =. Advances in Neural Information Processing Systems , year =

  6. [6]

    Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen

    Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen. International Conference on Learning Representations , year =

  7. [7]

    International Conference on Learning Representations , year =

    Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt , title =. International Conference on Learning Representations , year =

  8. [8]

    Findings of the Association for Computational Linguistics,

    Haonan Li and Yixuan Zhang and Fajri Koto and Yifei Yang and Hai Zhao and Yeyun Gong and Nan Duan and Timothy Baldwin , title =. Findings of the Association for Computational Linguistics,

  9. [9]

    Annual Meeting of the Association for Computational Linguistics , pages =

    Zahra Fatemi and Chen Xing and Wenhao Liu and Caiming Xiong , title =. Annual Meeting of the Association for Computational Linguistics , pages =

  10. [10]

    Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models , booktitle =

    Ze. Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models , booktitle =

  11. [11]

    Merging Experts into One: Improving Computational Efficiency of Mixture of Experts , booktitle =

    Shwai He and Run. Merging Experts into One: Improving Computational Efficiency of Mixture of Experts , booktitle =

  12. [12]

    Psychology of Learning and Motivation-Advances in Research and Theory , volume=

    Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , author=. Psychology of Learning and Motivation-Advances in Research and Theory , volume=

  13. [13]

    Le and Geoffrey E

    Noam Shazeer and Azalia Mirhoseini and Krzysztof Maziarz and Andy Davis and Quoc V. Le and Geoffrey E. Hinton and Jeff Dean , title =. International Conference on Learning Representations , year =

  14. [14]

    CoRR , volume =

    Yun Luo and Zhen Yang and Fandong Meng and Yafu Li and Jie Zhou and Yue Zhang , title =. CoRR , volume =

  15. [15]

    International Conference on Machine Learning , year =

    Xisen Jin and Xiang Ren , title =. International Conference on Machine Learning , year =

  16. [16]

    Annual Meeting of the Association for Computational Linguistics , pages =

    Zhicheng Wang and Yufang Liu and Tao Ji and Xiaoling Wang and Yuanbin Wu and Congcong Jiang and Ye Chao and Zhencong Han and Ling Wang and Xu Shao and Wenqiu Zeng , title =. Annual Meeting of the Association for Computational Linguistics , pages =

  17. [17]

    Findings of the Association for Computational Linguistics:

    Xiao Wang and Tianze Chen and Qiming Ge and Han Xia and Rong Bao and Rui Zheng and Qi Zhang and Tao Gui and Xuanjing Huang , title =. Findings of the Association for Computational Linguistics:

  18. [18]

    Empirical Methods in Natural Language Processing , pages =

    Thomas Scialom and Tuhin Chakrabarty and Smaranda Muresan , title =. Empirical Methods in Natural Language Processing , pages =

  19. [19]

    Advances in Neural Information Processing Systems , pages =

    Hanul Shin and Jung Kwon Lee and Jaehong Kim and Jiwon Kim , title =. Advances in Neural Information Processing Systems , pages =

  20. [20]

    Liyuan Wang and Xingxing Zhang and Hang Su and Jun Zhu , title =

  21. [21]

    Haizhou Shi and Zihao Xu and Hengyi Wang and Weiyi Qin and Wenyuan Wang and Yibin Wang and Zifeng Wang and Sayna Ebrahimi and Hao Wang , title =

  22. [22]

    International Conference on Artificial Intelligence and Statistics , volume =

    Mehrdad Farajtabar and Navid Azizan and Alex Mott and Ang Li , title =. International Conference on Artificial Intelligence and Statistics , volume =

  23. [23]

    Nature Machine Intelligence , volume =

    Guanxiong Zeng and Yang Chen and Bo Cui and Shan Yu , title =. Nature Machine Intelligence , volume =

  24. [24]

    Advances in Neural Information Processing Systems , year =

    Linglan Zhao and Xuerui Zhang and Ke Yan and Shouhong Ding and Weiran Huang , title =. Advances in Neural Information Processing Systems , year =

  25. [25]

    Continual Learning with Pre-Trained Models:

    Da. Continual Learning with Pre-Trained Models:. International Joint Conference on Artificial Intelligence , pages =

  26. [26]

    International Conference on Computational Linguistics , pages =

    Wenfeng Feng and Chuzhan Hao and Yuewei Zhang and Yu Han and Hao Wang , title =. International Conference on Computational Linguistics , pages =

  27. [27]

    Conference on Empirical Methods in Natural Language Processing , pages =

    Yufei Ma and Zihan Liang and Huangyu Dai and Ben Chen and Dehong Gao and Zhuoran Ran and Zihan Wang and Linbo Jin and Wen Jiang and Guannan Zhang and Xiaoyan Cai and Libin Yang , title =. Conference on Empirical Methods in Natural Language Processing , pages =

  28. [28]

    International Conference on Computational Linguistics , pages =

    Bing Wang and Liang Ding and Qihuang Zhong and Ximing Li and Dacheng Tao , title =. International Conference on Computational Linguistics , pages =

  29. [29]

    International

    Qidong Liu and Xian Wu and Xiangyu Zhao and Yuanshao Zhu and Derong Xu and Feng Tian and Yefeng Zheng , title =. International

  30. [30]

    Advances in Neural Information Processing Systems , year =

    Rajarshi Saha and Naomi Sagan and Varun Srivastava and Andrea Goldsmith and Mert Pilanci , title =. Advances in Neural Information Processing Systems , year =

  31. [31]

    Kaiming He and Xiangyu Zhang and Shaoqing Ren and Jian Sun , title =

  32. [32]

    Annual Meeting of the Association for Computational Linguistics , pages =

    Jianheng Huang and Leyang Cui and Ante Wang and Chengyi Yang and Xinting Liao and Linfeng Song and Junfeng Yao and Jinsong Su , title =. Annual Meeting of the Association for Computational Linguistics , pages =

  33. [33]

    Findings of the Association for Computational Linguistics:

    Hongyu Li and Liang Ding and Meng Fang and Dacheng Tao , title =. Findings of the Association for Computational Linguistics:

  34. [34]

    Conference of the North American Chapter of the Association for Computational Linguistics , pages =

    Yifan Wang and Yafei Liu and Chufan Shi and Haoling Li and Chen Chen and Haonan Lu and Yujiu Yang , title =. Conference of the North American Chapter of the Association for Computational Linguistics , pages =

  35. [35]

    Advances in Neural Information Processing Systems , year =

    Yibo Yang and Xiaojie Li and Zhongzhu Zhou and Shuaiwen Song and Jianlong Wu and Liqiang Nie and Bernard Ghanem , title =. Advances in Neural Information Processing Systems , year =

  36. [36]

    International Conference on Learning Representations , year =

    Gangwei Jiang and Caigao Jiang and Zhaoyi Li and Siqiao Xue and Jun Zhou and Linqi Song and Defu Lian and Yin Wei , title =. International Conference on Learning Representations , year =

  37. [37]

    International Conference on Machine Learning , volume =

    Joel Jang and Seungone Kim and Seonghyeon Ye and Doyoung Kim and Lajanugen Logeswaran and Moontae Lee and Kyungjae Lee and Minjoon Seo , title =. International Conference on Machine Learning , volume =

  38. [38]

    Annual Meeting of the Association for Computational Linguistics , pages =

    Weixiang Zhao and Shilong Wang and Yulin Hu and Yanyan Zhao and Bing Qin and Xuanyu Zhang and Qing Yang and Dongliang Xu and Wanxiang Che , title =. Annual Meeting of the Association for Computational Linguistics , pages =

  39. [39]

    Annual Meeting of the Association for Computational Linguistics , pages =

    Xiang Lisa Li and Percy Liang , title =. Annual Meeting of the Association for Computational Linguistics , pages =

  40. [40]

    DoRA: Weight-Decomposed Low-Rank Adaptation , booktitle =

    Shih. DoRA: Weight-Decomposed Low-Rank Adaptation , booktitle =

  41. [41]

    Advances in Neural Information Processing Systems , year =

    Tim Dettmers and Artidoro Pagnoni and Ari Holtzman and Luke Zettlemoyer , title =. Advances in Neural Information Processing Systems , year =

  42. [42]

    Albert Q. Jiang and Alexandre Sablayrolles and Antoine Roux and Arthur Mensch and Blanche Savary and Chris Bamford and Devendra Singh Chaplot and Diego de Las Casas and Emma Bou Hanna and Florian Bressand and others , title =. CoRR , volume =

  43. [43]

    CoRR , volume =

    Aixin Liu and Bei Feng and Bing Xue and Bingxuan Wang and Bochao Wu and Chengda Lu and Chenggang Zhao and Chengqi Deng and Chenyu Zhang and Chong Ruan and others , title =. CoRR , volume =

  44. [44]

    Annual Meeting of the Association for Computational Linguistics , pages =

    Yang Luo and Xiaozhe Ren and Zangwei Zheng and Zhuo Jiang and Xin Jiang and Yang You , title =. Annual Meeting of the Association for Computational Linguistics , pages =

  45. [45]

    Annual Meeting of the Association for Computational Linguistics , pages =

    Quzhe Huang and Zhenwei An and Nan Zhuang and Mingxu Tao and Chen Zhang and Yang Jin and Kun Xu and Liwei Chen and Songfang Huang and Yansong Feng , title =. Annual Meeting of the Association for Computational Linguistics , pages =

  46. [46]

    Chi , title =

    Jiaqi Ma and Zhe Zhao and Xinyang Yi and Jilin Chen and Lichan Hong and Ed H. Chi , title =

  47. [47]

    International Conference on Learning Representations , year =

    Zeren Chen and Ziqin Wang and Zhen Wang and Huayang Liu and Zhenfei Yin and Si Liu and Lu Sheng and Wanli Ouyang and Jing Shao , title =. International Conference on Learning Representations , year =

  48. [48]

    Lillicrap and Gregory Wayne , title =

    David Rolnick and Arun Ahuja and Jonathan Schwarz and Timothy P. Lillicrap and Gregory Wayne , title =. Advances in Neural Information Processing Systems , pages =

  49. [49]

    Gradient Episodic Memory for Continual Learning , booktitle =

    David Lopez. Gradient Episodic Memory for Continual Learning , booktitle =

  50. [50]

    Advances in Neural Information Processing Systems , pages =

    Cyprien de Masson d'Autume and Sebastian Ruder and Lingpeng Kong and Dani Yogatama , title =. Advances in Neural Information Processing Systems , pages =

  51. [51]

    International Conference on Learning Representations , year =

    Jaehong Yoon and Divyam Madaan and Eunho Yang and Sung Ju Hwang , title =. International Conference on Learning Representations , year =

  52. [52]

    Online Continual Learning with Maximal Interfered Retrieval , booktitle =

    Rahaf Aljundi and Eugene Belilovsky and Tinne Tuytelaars and Laurent Charlin and Massimo Caccia and Min Lin and Lucas Page. Online Continual Learning with Maximal Interfered Retrieval , booktitle =

  53. [53]

    Liyuan Wang and Bo Lei and Qian Li and Hang Su and Jun Zhu and Yi Zhong , title =

  54. [54]

    Bagdanov and Shangling Jui and Joost van de Weijer , title =

    Xialei Liu and Chenshen Wu and Mikel Menta and Luis Herranz and Bogdan Raducanu and Andrew D. Bagdanov and Shangling Jui and Joost van de Weijer , title =

  55. [55]

    CoRR , volume =

    Bartosz Cywinski and Kamil Deja and Tomasz Trzcinski and Bartlomiej Twardowski and Lukasz Kucinski , title =. CoRR , volume =

  56. [56]

    International Conference on Learning Representations , year =

    Fan. International Conference on Learning Representations , year =

  57. [57]

    International Conference on Learning Representations , year =

    Gobinda Saha and Isha Garg and Kaushik Roy , title =. International Conference on Learning Representations , year =

  58. [58]

    Advances in Neural Information Processing Systems , year =

    Ang Bian and Wei Li and Hangjie Yuan and Chengrong Yu and Mang Wang and Zixiang Zhao and Aojun Lu and Pengliang Ji and Tao Feng , title =. Advances in Neural Information Processing Systems , year =

  59. [59]

    International Conference on Machine Learning , year =

    Chenghao Fan and Zhenyi Lu and Sichen Liu and Chengfeng Gu and Xiaoye Qu and Wei Wei and Yu Cheng , title =. International Conference on Machine Learning , year =

  60. [60]

    Pattern Recognition , volume =

    Cheng Chen and Lianli Gao and Pengpeng Zeng and Mingsheng Cao and Heng Tao Shen , title =. Pattern Recognition , volume =

  61. [61]

    Vechev and Kristina Toutanova , title =

    Anton Alexandrov and Veselin Raychev and Mark Mueller and Ce Zhang and Martin T. Vechev and Kristina Toutanova , title =. Findings of the Association for Computational Linguistics:

  62. [62]

    Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics , pages =

    Hanqing Wang and Yixia Li and Shuo Wang and Guanhua Chen and Yun Chen , title =. Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics , pages =

  63. [63]

    International Conference on Machine Learning , year =

    Jiawei Zhao and Zhenyu Zhang and Beidi Chen and Zhangyang Wang and Anima Anandkumar and Yuandong Tian , title =. International Conference on Machine Learning , year =

  64. [64]

    Gomez and Lukasz Kaiser and Illia Polosukhin , title =

    Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , title =. Advances in Neural Information Processing Systems , pages =

  65. [65]

    Advances in Neural Information Processing Systems , year =

    Sen Lin and Li Yang and Deliang Fan and Junshan Zhang , title =. Advances in Neural Information Processing Systems , year =

  66. [66]

    International Conference on Learning Representations , year =

    Sen Lin and Li Yang and Deliang Fan and Junshan Zhang , title =. International Conference on Learning Representations , year =

  67. [67]

    Conference on Empirical Methods in Natural Language Processing , pages =

    Yizhong Wang and Swaroop Mishra and Pegah Alipoormolabashi and Yeganeh Kordi and Amirreza Mirzaei and Atharva Naik and Arjun Ashok and Arut Selvan Dhanasekaran and Anjana Arunkumar and David Stap and others , title =. Conference on Empirical Methods in Natural Language Processing , pages =

  68. [68]

    Liu and Ana Marasovic and Noah A

    Pradeep Dasigi and Nelson F. Liu and Ana Marasovic and Noah A. Smith and Matt Gardner , title =. Conference on Empirical Methods in Natural Language Processing , pages =

  69. [69]

    International Conference on Learning Representations , year =

    Anastasia Razdaibiedina and Yuning Mao and Rui Hou and Madian Khabsa and Mike Lewis and Amjad Almahairi , title =. International Conference on Learning Representations , year =

  70. [70]

    Conference on Empirical Methods in Natural Language Processing , pages =

    Ran Song and Shizhu He and Shuting Jiang and Yantuan Xian and Shengxiang Gao and Kang Liu and Zhengtao Yu , title =. Conference on Empirical Methods in Natural Language Processing , pages =

  71. [71]

    International Conference on Computational Linguistics , pages =

    Yongqi Leng and Deyi Xiong , title =. International Conference on Computational Linguistics , pages =

  72. [72]

    CoRR , volume =

    Xinyu Tang and Zhihao Lv and Xiaoxue Cheng and Junyi Li and Wayne Xin Zhao and Zujie Wen and Zhiqiang Zhang and Jun Zhou , title =. CoRR , volume =

  73. [73]

    Shixiang Tang and Dapeng Chen and Jinguo Zhu and Shijie Yu and Wanli Ouyang , title =

  74. [74]

    CoRR , volume =

    Shen Yuan and Yin Zheng and Taifeng Wang and Binbin Liu and Hongteng Xu , title =. CoRR , volume =

  75. [75]

    Annual Meeting of the Association for Computational Linguistics , pages =

    Damai Dai and Li Dong and Yaru Hao and Zhifang Sui and Baobao Chang and Furu Wei , title =. Annual Meeting of the Association for Computational Linguistics , pages =

  76. [76]

    Conference on Empirical Methods in Natural Language Processing , pages =

    Mor Geva and Roei Schuster and Jonathan Berant and Omer Levy , title =. Conference on Empirical Methods in Natural Language Processing , pages =

  77. [77]

    CoRR , volume =

    An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and Chenxu Lv and others , title =. CoRR , volume =

  78. [78]

    OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models , journal =

    Kerim B. OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models , journal =

  79. [79]

    Advances in Neural Information Processing Systems , year =

    Michael Matena and Colin Raffel , title =. Advances in Neural Information Processing Systems , year =

  80. [80]

    Woodland , title =

    Xiaodong Wu and Wenyi Yu and Chao Zhang and Philip C. Woodland , title =. Advances in Neural Information Processing Systems , year =

Showing first 80 references.