Dynamic Model Merging Made Slim
Pith reviewed 2026-05-20 15:12 UTC · model grok-4.3
The pith
DiDi-Merging achieves dynamic model merging results with just 1.24 times the parameters of a single fine-tuned model by optimizing ranks inside low-rank modules.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DiDi-Merging formulates parameter budgeting as differentiable rank optimization in low-rank modules and introduces a data-free refinement step to recover task fidelity, allowing the merged model to match prior dynamic baselines at only 1.24 times the parameters of one fine-tuned model and surpass them at 1.4 times while remaining far more compact than approaches that require over 2 times storage.
What carries the argument
Differentiable rank optimization inside low-rank modules that allocates capacity between shared and expert parameters.
If this is right
- Matches the accuracy of prior dynamic merging methods at 1.24 times the storage of one fine-tuned model
- Exceeds those methods at 1.4 times single-model size
- Uses substantially less total storage than dynamic methods that exceed twice the size
- Operates without original training data and across vision, language, and multimodal domains
Where Pith is reading between the lines
- The reduced footprint could make merging many more tasks feasible on memory-limited hardware
- The data-free refinement property opens use in settings where training data must stay private
- The rank-allocation idea might combine with quantization or pruning for further size cuts
Load-bearing premise
Differentiable rank optimization can find an allocation of shared versus expert parameters such that a later data-free refinement step will restore full task performance.
What would settle it
A controlled test on held-out tasks where the 1.24-times-size merged model falls measurably below baseline accuracy and the data-free refinement step fails to close the gap.
Figures
read the original abstract
Model merging enables the reuse of fine-tuned models without joint training or access to original data. Dynamic merging further improves flexibility by selectively activating task-relevant parameters and efficiently composing experts across multiple tasks. However, existing dynamic methods either maintain a full shared model with tiny experts or allocate excessive capacity to experts, leading to suboptimal accuracy--efficiency trade-offs. To address this, we propose DiDi-Merging, a slim dynamic merging framework that leverages differentiable rank allocation to balance shared and expert parameters. By formulating parameter budgeting as differentiable rank optimization in low-rank modules and introducing a data-free refinement step to recover task fidelity, DiDi-Merging matches prior dynamic baselines at only 1.24x the parameters of a single fine-tuned model and surpasses them at 1.4x, substantially more compact than methods requiring > 2x storage. DiDi-Merging applies across vision, language, and multimodal tasks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce DiDi-Merging, a dynamic model merging method that uses differentiable rank allocation in low-rank modules to balance shared and expert parameters more efficiently than existing approaches. It adds a data-free refinement step to recover performance. The key result is matching prior methods at 1.24 times the parameters of one fine-tuned model and surpassing at 1.4 times, with applications in vision, language, and multimodal domains.
Significance. Should the empirical claims be substantiated, this would be a notable contribution to model merging literature by demonstrating a slimmer dynamic merging strategy that reduces storage requirements while maintaining or improving performance. The differentiable budgeting and data-free aspect address practical constraints in multi-task learning scenarios.
major comments (2)
- Abstract: Performance numbers are stated without any mention of experimental setup, datasets, number of runs, or error bars. This is a load-bearing issue for assessing whether the differentiable rank optimization and data-free refinement deliver the claimed efficiency gains.
- Data-free refinement procedure: The refinement step is presented as recovering task fidelity using proxy signals without original data. However, there is no analysis or benchmarks quantifying the fidelity recovery gap or identifying when it fails (e.g., dissimilar tasks), which directly impacts the validity of the accuracy claims at 1.24x and 1.4x parameters.
minor comments (2)
- Consider adding a table comparing parameter counts and performance metrics across all baselines for clarity.
- Ensure consistent use of symbols for rank variables and allocation parameters throughout the manuscript.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript accordingly to improve clarity and strengthen the presentation of our results.
read point-by-point responses
-
Referee: Abstract: Performance numbers are stated without any mention of experimental setup, datasets, number of runs, or error bars. This is a load-bearing issue for assessing whether the differentiable rank optimization and data-free refinement deliver the claimed efficiency gains.
Authors: We agree that the abstract would benefit from additional context on the experimental setup. In the revised manuscript, we have expanded the abstract to note that results are reported on standard vision (e.g., CIFAR-100, ImageNet), language (GLUE), and multimodal (VQAv2) benchmarks, averaged over 3 independent runs with standard deviations, and that the 1.24x and 1.4x parameter budgets are measured relative to a single fine-tuned model. This provides the necessary details to evaluate the efficiency claims without exceeding abstract length limits. revision: yes
-
Referee: Data-free refinement procedure: The refinement step is presented as recovering task fidelity using proxy signals without original data. However, there is no analysis or benchmarks quantifying the fidelity recovery gap or identifying when it fails (e.g., dissimilar tasks), which directly impacts the validity of the accuracy claims at 1.24x and 1.4x parameters.
Authors: We acknowledge the value of a more explicit analysis of the data-free refinement's robustness. While the original manuscript shows empirical recovery on the primary task sets, it does not include a dedicated quantification of the fidelity gap or failure cases for dissimilar tasks. In the revision, we have added an appendix section with new benchmarks that measure the performance gap before and after refinement across task similarity levels, including highly dissimilar combinations, along with discussion of proxy signal limitations. This directly supports the validity of the reported accuracy at the stated parameter budgets. revision: yes
Circularity Check
No circularity: empirical method with independent optimization objective
full rationale
The paper introduces DiDi-Merging as a new framework that formulates parameter budgeting via differentiable rank optimization inside low-rank modules, followed by a data-free refinement step. These are presented as algorithmic contributions rather than derivations that reduce to prior inputs by construction. Performance numbers (1.24x and 1.4x parameter efficiency) are reported as experimental outcomes on vision, language, and multimodal tasks, not as predictions forced by fitting or self-citation. No load-bearing step equates the claimed result to its own fitted parameters or renames a known pattern; the central premise relies on the proposed optimization and refinement procedure, which is externally falsifiable via accuracy measurements.
Axiom & Free-Parameter Ledger
free parameters (1)
- rank allocation variables
axioms (1)
- domain assumption Low-rank decomposition can approximate task-specific parameter updates without significant loss of expressivity
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
By formulating parameter budgeting as differentiable rank optimization in low-rank modules and introducing a data-free refinement step to recover task fidelity
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Program Synthesis with Large Language Models
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
C. Chen, Y . Du, Z. Fang, Z. Wang, F. Luo, P. Li, M. Yan, J. Zhang, F. Huang, M. Sun, and Y . Liu. Model composition for multimodal large language models. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2024
work page 2024
-
[4]
D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), pages 190–200, 2011
work page 2011
-
[5]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
S. Chen, Y . Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei. Beats: Audio pre-training with acoustic tokenizers. InProceedings of the International Conference on Machine Learning (ICML), volume 202, pages 5178–5193, 2023
work page 2023
-
[7]
S. Chen, Y . Zhang, and Q. Yang. Multi-task learning in natural language processing: An overview.ACM Computing Surveys, 56(12):1–32, 2024
work page 2024
-
[8]
Y . Chen, J. Li, W. Yao, X. Ma, G. Du, W. Wang, and J. Li. V ocabulary hijacking in lvlms: Unveiling critical attention heads by excluding inert tokens to mitigate hallucination.arXiv preprint arXiv:2605.10622, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [9]
- [10]
- [11]
-
[12]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[13]
M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13142–13153, 2023
work page 2023
-
[14]
N. Ding, Y . Qin, G. Yang, F. Wei, Z. Yang, Y . Su, S. Hu, Y . Chen, C.-M. Chan, W. Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models.Nature Machine Intelligence, 5(3):220–235, 2023
work page 2023
-
[15]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InProceedings of the International Conference on Learning Representations (ICLR), 2021
work page 2021
-
[16]
K. Drossos, S. Lipping, and T. Virtanen. Clotho: an audio captioning dataset. InProceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 736–740, 2020
work page 2020
- [17]
-
[18]
G. Du, R. Jiang, S. Yang, H. Li, W. Chen, K. Li, S. K. Goh, and H.-K. Tang. Impacts of darwinian evolution on pre-trained deep neural networks. In2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 1907–1912. IEEE, 2024
work page 1907
-
[19]
G. Du, J. Lee, J. Li, R. Jiang, Y . Guo, S. Yu, H. Liu, S. K. Goh, H.-K. Tang, D. He, and M. Zhang. Parameter competition balancing for model merging. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[20]
G. Du, J. Li, H. Liu, R. Jiang, S. Yu, Y . Guo, S. K. Goh, and H.-K. Tang. Knowledge fusion by evolving weights of language models. InFindings of the Association for Computational Linguistics ACL 2024, pages 11727–11742, 2024
work page 2024
-
[21]
G. Du, Z. Fang, J. Li, J. Li, R. Jiang, S. Yu, Y . Guo, Y . Chen, S. K. Goh, H.-K. Tang, D. He, H. Liu, and M. Zhang. Neural parameter search for slimmer fine-tuned models and better transfer. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025. doi: 10.18653/v1/2025.acl-long.1570. URL ht...
-
[22]
G. Du, Z. Li, X. Zhou, J. Li, Z. Shi, W. Lin, H.-K. Tang, X. Li, F. Liu, W. Wang, M. Zhang, and J. Li. Knowledge fusion of large language models via modular skillpacks. InProceedings of the International Conference on Learning Representations (ICLR), 2026
work page 2026
-
[23]
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Y . Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [26]
-
[27]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[28]
A. A. Gargiulo, D. Crisostomi, M. S. Bucarelli, S. Scardapane, F. Silvestri, and E. Rodola. Task singular vectors: Reducing task interference in model merging. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18695–18705, 2025
work page 2025
- [30]
-
[31]
Y . Gong, J. Yu, and J. R. Glass. V ocalsound: A dataset for improving human vocal sounds recognition. InProceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 151–155, 2022
work page 2022
-
[32]
Y . Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. R. Glass. Listen, think, and understand. In Proceedings of the International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[33]
Y . Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6325–6334, 2017. 11
work page 2017
- [34]
- [35]
-
[36]
Y . He, Y . Hu, Y . Lin, T. Zhang, and H. Zhao. Localize-and-stitch: Efficient model merging via sparse task arithmetic.Transactions on Machine Learning Research (TMLR), 2024
work page 2024
- [37]
-
[38]
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021
work page 2021
-
[39]
E. J. Hu, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chen, et al. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022
work page 2022
- [40]
-
[41]
D. A. Hudson and C. D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6700–6709, 2019
work page 2019
-
[42]
Editing Models with Task Arithmetic
G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururangan, L. Schmidt, H. Hajishirzi, and A. Farhadi. Editing models with task arithmetic.arXiv preprint arXiv:2212.04089, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [43]
- [44]
-
[45]
Y . LeCun. The mnist database of handwritten digits.http://yann. lecun. com/exdb/mnist/, 1998
work page 1998
-
[46]
J. Li, G. Du, J. Li, S. K. Goh, W. Wang, Y . Wang, F. Liu, H.-K. Tang, S. Alharbi, D. He, et al. Multi-modality expansion and retention for llms through parameter merging and decoupling. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025. doi: 10.18653/v1/2025.acl-long.1491. URL https://...
-
[47]
J. Li, S. Song, G. Du, N. Wong, X. Liu, Y . Li, M. Zhang, J. Li, and X. Li. D-qrelo: Training-and data-free delta compression for large language models via quantization and residual low-rank approximation.arXiv preprint arXiv:2604.16940, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[48]
From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline
T. Li, W.-L. Chiang, E. Frick, L. Dunlap, T. Wu, B. Zhu, J. E. Gonzalez, and I. Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline.arXiv preprint arXiv:2406.11939, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
X. Li, T. Zhang, Y . Dubois, R. Taori, I. Gulrajani, C. Guestrin, P. Liang, and T. B. Hashimoto. AlpacaEval: An automatic evaluator of instruction-following models, 2023
work page 2023
-
[50]
Y . Li, Y . Du, K. Zhou, J. Wang, W. X. Zhao, and J. Wen. Evaluating object hallucination in large vision-language models. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 292–305, 2023. 12
work page 2023
- [51]
-
[52]
B. Lin, B. Zhu, Y . Ye, M. Ning, P. Jin, and L. Yuan. Video-llava: Learning united visual representation by alignment before projection.arXiv preprint arXiv:2311.10122, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, and S. Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of Machine Learning and Systems, 6:87–100, 2024
work page 2024
- [54]
-
[55]
H. Liu, C. Li, Y . Li, and Y . J. Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26296–26306, 2024
work page 2024
-
[56]
J. Liu, G. Xiao, K. Li, J. D. Lee, S. Han, T. Dao, and T. Cai. Bitdelta: Your fine-tune may only be worth one bit.Advances in Neural Information Processing Systems (NeurIPS), 37: 13579–13600, 2024
work page 2024
-
[57]
P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), volume 35, pages 2507–2521, 2022
work page 2022
- [58]
- [59]
-
[60]
M. Maaz, H. A. Rasheed, S. Khan, and F. Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2024
work page 2024
-
[61]
D. Marczak, S. Magistri, S. Cygert, B. Twardowski, A. D. Bagdanov, and J. van de Weijer. No task left behind: Isotropic model merging with common and task-specific subspaces. Proceedings of the International Conference on Machine Learning (ICML), 2025
work page 2025
- [62]
-
[63]
X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y . Zou, and W. Wang. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research.IEEE ACM Transactions on Audio, Speech, and Language Processing (TASLP), 32:3339–3354, 2024
work page 2024
-
[64]
A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen. DCASE2017 challenge setup: Tasks, datasets and baseline system. InProceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE), pages 85–92, 2017
work page 2017
-
[65]
A. Panagopoulou, L. Xue, N. Yu, J. Li, D. Li, S. Joty, R. Xu, S. Savarese, C. Xiong, and J. C. Niebles. X-instructblip: A framework for aligning x-modal instruction-aware representations to llms and emergent cross-modal reasoning. InProceedings of the European Conference on Computer Vision (ECCV), 2024
work page 2024
-
[66]
A. Panigrahi, N. Saunshi, H. Zhao, and S. Arora. Task-specific skill localization in fine-tuned language models.arXiv preprint arXiv:2302.06600, 2023. 13
-
[67]
B. Ping, S. Wang, H. Wang, X. Han, Y . Xu, Y . Yan, Y . Chen, B. Chang, Z. Liu, and M. Sun. Delta-come: Training-free delta-compression with mixed-precision for large language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[68]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021
work page 2021
-
[69]
D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bow- man. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[70]
V . Sanh, A. Webson, C. Raffel, S. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, A. Raja, M. Dey, et al. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022
work page 2022
- [71]
-
[72]
J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. The german traffic sign recognition benchmark: a multi-class classification competition. InIJCNN, pages 1453–1460. IEEE, 2011
work page 2011
- [73]
-
[74]
W. Sun, Q. Li, Y .-a. Geng, and B. Li. Cat merging: A training-free approach for resolving conflicts in model merging.Proceedings of the International Conference on Machine Learning (ICML), 2025
work page 2025
-
[75]
A. Tang, L. Shen, Y . Luo, N. Yin, L. Zhang, and D. Tao. Merging multi-task models via weight-ensembling mixture of experts. InICML, 2024
work page 2024
-
[76]
A. Tang, L. Shen, Y . Luo, S. Xie, H. Hu, L. Zhang, B. Du, and D. Tao. Zero-shot sparse mixture of low-rank experts construction from pre-trained foundation models.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–12, 2025. doi: 10.1109/TPAMI.2025. 3612480
-
[77]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[78]
M. Valipour, M. Rezagholizadeh, I. Kobyzev, and A. Ghodsi. DyLoRA: Parameter-efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, 2023
work page 2023
- [79]
- [80]
-
[81]
Q. Wang, J. Ke, M. Tomizuka, K. Keutzer, and C. Xu. Dobi-SVD: Differentiable SVD for LLM compression and some new perspectives. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025
work page 2025
- [82]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.