pith. sign in

arxiv: 2501.07237 · v5 · submitted 2025-01-13 · 💻 cs.LG · cs.AI

GWT: Scalable Optimizer State Compression for Large Language Model Training

Pith reviewed 2026-05-23 05:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords optimizer state compressiongradient wavelet transformmemory efficient traininglarge language modelsAdam optimizerwavelet subspace projectiontraining memory reduction
0
0 comments X

The pith

Gradient Wavelet Transform compresses optimizer states for large language models by projecting gradients into wavelet subspaces while matching full-rank performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Gradient Wavelet Transform as a way to shrink the memory footprint of stateful optimizers like Adam during LLM training. It does this by moving gradients into wavelet subspaces instead of relying on low-rank approximations such as SVD that often hurt final model quality. The method is shown to plug directly into existing training loops and to deliver the same results as uncompressed updates across both pre-training and fine-tuning tasks. If the approach holds, it would let practitioners run larger models or longer schedules on fixed hardware budgets without the usual accuracy trade-offs seen in other compression schemes.

Core claim

The Gradient Wavelet Transform projects gradients into wavelet subspaces, which compacts the stored optimizer states while retaining the information required for effective parameter updates; theoretical arguments and large-scale experiments indicate that this projection can be inserted into standard optimizers without measurable loss in model fidelity relative to full-rank baselines.

What carries the argument

Gradient Wavelet Transform, which projects gradients into wavelet subspaces to compact optimizer states while preserving update information.

If this is right

  • Optimizer states can be stored at reduced size while the training dynamics remain equivalent to full-rank methods.
  • Existing Adam-style optimizers can incorporate the projection step with no change to the rest of the training code.
  • Both large-scale pre-training and task-specific fine-tuning reach the same final performance as memory-heavy baselines.
  • Memory savings scale with model size, supporting longer training runs or larger batch sizes under fixed hardware limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same wavelet projection could be tested on other first-order methods such as momentum or RMSprop to check generality.
  • If wavelet basis choice proves robust across domains, the technique might extend to non-LLM settings like vision transformers.
  • Hardware implementations could exploit fast wavelet transforms to reduce the added compute overhead during the projection step.

Load-bearing premise

That projecting gradients into wavelet subspaces preserves essential update information without incurring non-negligible performance degradation relative to full-rank updates.

What would settle it

A controlled experiment in which an LLM trained with GWT shows statistically significant higher validation loss or lower downstream accuracy than the identical model trained with standard Adam on the same data and schedule would falsify the performance-parity claim.

Figures

Figures reproduced from arXiv: 2501.07237 by Dongsheng Li, Jiahuan Wang, Kun Yuan, Ping Luo, Tao Sun, Ziqing Wen.

Figure 1
Figure 1. Figure 1: Visualization of memory usage for Adam optimizer states. Compared with full-rank Adam, the 2-level wavelet transform can reduce the optimizer state up to 75%. Given the high costs of training LLMs, methods that reduce the batch size or use larger GPUs quickly become unsustainable. To mitigate these issues, many studies have focused on optimizing memory usage in LLM training. A promising solution is low-ran… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of the approximation coefficients of a 2-Level DHT on an image (rescaled for visual clarity). Even after reducing the image to 25% of its original size (right), the approximation coefficients successfully preserve the key structural features of the image. 3.2 Adam with GWT The Adam optimizer [22] adaptively adjusts the learning rates for each parameter based on its past gradient states, while… view at source ↗
Figure 3
Figure 3. Figure 3: Training loss comparison between Haar-2 with/without NL on LLaMA 60M and 130M. In summary, we integrate the proposed GWT method into the Adam optimizer and observe that it can theoretically reduce the memory footprint of optimizer states by up to 75%. GWT achieves this by projecting gradients into a lower-dimensional subspace and performing updates to the optimizer states within this subspace. This memory-… view at source ↗
Figure 4
Figure 4. Figure 4: Applying GWT to Adam/Adam-mini/MUON optimizers for pre-training various LLaMA models. GWT is scalable to optimizers beyond Adam and consistently achieves lower or comparable PPL than full-rank methods across all tested optimizers. cosine annealing schedule is applied to the learning rate. We use a sequence length of 256 and a batch size of 512, resulting in a total of 131K tokens per batch, unless stated o… view at source ↗
Figure 5
Figure 5. Figure 5: Applying GWT to Adam for pre-training LLaMA models with reduced memory usage (higher l). GWT maintains superior performance over Adam under high wavelet transform l levels. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Study the effects of α on pre-training LLaMA 60M and 130M models. GWT level l. We investigate the effects of the GWT level hyperparameter l, with the corresponding validation perplexity curves shown in [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Applying module-wise learning rate to Adam optimizer. We can observe that applying a module-wise split learning rate in Adam leads to better convergence speed and final results compared to using a uniform learning rate across all modules. E Discussion Our proposed GWT and its implementation draw inspiration from GaLore [54]. Both GWT and GaLore are closely related to Projected Gradient Descent (PGD) [4, 7]… view at source ↗
Figure 8
Figure 8. Figure 8: Applying same learning rate to all modules for GWT optimizer. Similar to Adam, the module-wise learning rate yields better experimental performance compared to using a unified learning rate. onto a subspace before performing the update. However, there are key distinctions between GWT and GaLore. The primary difference lies in the projection operation: GWT not only projects the gradient onto the subspace bu… view at source ↗
read the original abstract

Large Language Models (LLMs) have demonstrated exceptional capabilities across diverse natural language processing benchmarks. However, the escalating scale of model parameters imposes prohibitive memory overheads during training, especially when employing stateful optimizers such as Adam. Conventional memory-efficient strategies, typically involving singular value decomposition (SVD) or weight freezing, often incur non-negligible performance degradation relative to full-rank updates. To address these limitations, this paper explores memory-efficient optimization beyond low-rank constraints and proposes the Gradient Wavelet Transform (GWT). GWT characterizes a novel compression framework that projects gradients into wavelet subspaces, effectively compacting optimizer states while preserving essential update information. We theoretically and empirically demonstrate that GWT can be seamlessly integrated into existing optimization protocols, facilitating resource-efficient training without compromising model fidelity. Rigorous evaluations encompassing both large-scale pre-training and task-specific fine-tuning reveal that GWT yields performance parity with advanced memory-efficient optimizers and full-rank updates. Furthermore, GWT provides a scalable and robust solution for managing the memory-intensive pipelines inherent in modern large-scale data engineering and knowledge discovery systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes Gradient Wavelet Transform (GWT), a compression framework that projects gradients into wavelet subspaces to reduce optimizer state memory for LLM training with Adam and similar methods. It claims seamless integration into existing protocols, theoretical and empirical support for performance parity with full-rank updates and advanced memory-efficient optimizers (without non-negligible degradation), and evaluations on large-scale pre-training plus task-specific fine-tuning.

Significance. If the performance-parity claim holds with rigorous evidence, GWT could extend memory-efficient optimization beyond SVD or low-rank approaches, offering a scalable tool for resource-constrained LLM training pipelines.

major comments (2)
  1. [Abstract] Abstract: the claim of 'theoretical and empirical demonstrations' is unsupported by any derivations, proofs, error bars, dataset details, or quantitative results, rendering the central performance-parity claim impossible to assess.
  2. [Abstract] The weakest assumption—that wavelet-subspace projection preserves essential update information without non-negligible degradation relative to full-rank updates—is stated but not tested with concrete metrics or ablation in the visible text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review. We address the concerns about the abstract's claims by clarifying that the full paper provides the supporting material, and we will revise the abstract accordingly to make the claims more concrete and reference the specific sections and results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'theoretical and empirical demonstrations' is unsupported by any derivations, proofs, error bars, dataset details, or quantitative results, rendering the central performance-parity claim impossible to assess.

    Authors: The abstract serves as a concise overview, while the detailed theoretical derivations (including proofs of information preservation in wavelet subspaces) are presented in Section 3. Empirical evaluations with error bars, specific dataset details (e.g., C4 for pre-training, GLUE for fine-tuning), and quantitative results comparing to full Adam and other baselines are in Sections 4 and 5. We will revise the abstract to include key quantitative findings and explicit references to these sections to better support the claims. revision: yes

  2. Referee: [Abstract] The weakest assumption—that wavelet-subspace projection preserves essential update information without non-negligible degradation relative to full-rank updates—is stated but not tested with concrete metrics or ablation in the visible text.

    Authors: We acknowledge that the abstract does not detail the tests. The manuscript includes concrete metrics such as gradient cosine similarity and update norm preservation in Section 4.1, along with ablation studies on wavelet levels and compression ratios in Section 4.3, demonstrating no non-negligible degradation. We will update the abstract to mention these empirical validations briefly. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The provided abstract and context describe GWT as a novel projection-based compression framework whose performance parity is asserted via theoretical and empirical demonstration. No equations, fitted parameters, self-citations, or derivation steps are visible that reduce any claimed prediction or uniqueness result to an input by construction. The central claim (wavelet-subspace projection preserves update information) is presented as an independent modeling choice rather than a renaming or self-referential fit, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; insufficient detail to populate the ledger.

pith-pipeline@v0.9.0 · 5729 in / 945 out tokens · 24186 ms · 2026-05-23T05:54:22.391267+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 15 internal anchors

  1. [1]

    D. A. Boiko, R. MacKnight, B. Kline, and G. Gomes. Autonomous chemical research with large language models. Nature, 624(7992):570–578, 2023

  2. [2]

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. teusz Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and ...

  3. [3]

    D. M. Cer, M. T. Diab, E. Agirre, I. Lopez-Gazpio, and L. Specia. Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In International Workshop on Semantic Evaluation, 2017

  4. [4]

    H. Chen, G. Raskutti, and M. Yuan. Non-convex projected gradient descent for generalized low-rank tensor regression. J. Mach. Learn. Res., 20:5:1–5:37, 2016

  5. [5]

    T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016

  6. [6]

    X. Chen, K. Feng, C. Li, X. Lai, X. Yue, Y . Yuan, and G. Wang. Fira: Can we achieve full-rank training of llms under low-rank constraint? ArXiv, abs/2410.01623, 2024

  7. [7]

    Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees

    Y . Chen and M. J. Wainwright. Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees. ArXiv, abs/1509.03025, 2015

  8. [8]

    Dettmers, M

    T. Dettmers, M. Lewis, S. Shleifer, and L. Zettlemoyer. 8-bit optimizers via block-wise quantization. arXiv preprint arXiv:2110.02861, 2021

  9. [9]

    Dettmers, A

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024

  10. [10]

    W. B. Dolan and C. Brockett. Automatically constructing a corpus of sentential paraphrases. In International Joint Conference on Natural Language Processing, 2005

  11. [11]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021

  12. [12]

    Everett, L

    K. Everett, L. Xiao, M. Wortsman, A. Alemi, R. Novak, P. J. Liu, I. Gur, J. N. Sohl-Dickstein, L. P. Kaelbling, J. Lee, and J. Pennington. Scaling exponents across parameterizations and optimizers. ArXiv, abs/2407.05872, 2024

  13. [13]

    Ferber, G

    D. Ferber, G. Wölflein, I. C. Wiest, M. Ligero, S. Sainath, N. Ghaffari Laleh, O. S. El Nahhas, G. Müller-Franzes, D. Jäger, D. Truhn, et al. In-context learning enables multimodal large language models to classify cancer pathology images. Nature Communications, 15(1):10104, 2024

  14. [14]

    Genet and H

    R. Genet and H. Inzirillo. Caadam: Improving adam optimizer using connection aware methods. ArXiv, abs/2410.24216, 2024

  15. [15]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  16. [16]

    Y . Hao, Y . Cao, and L. Mou. Flora: Low-rank adapters are secretly gradient compressors.arXiv preprint arXiv:2402.03293, 2024. 10

  17. [17]

    Y . He, P. Li, Y . Hu, C. Chen, and K. Yuan. Subspace optimization for large language models with convergence guarantees. arXiv preprint arXiv:2410.11289, 2024

  18. [18]

    Measuring Massive Multitask Language Understanding

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

  19. [19]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020

  20. [20]

    Hofmann, P

    V . Hofmann, P. R. Kalluri, D. Jurafsky, and S. King. Ai generates covertly racist decisions about people based on their dialect. Nature, 633(8028):147–154, 2024

  21. [21]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

  22. [22]

    D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014

  23. [23]

    Lialin, N

    V . Lialin, N. Shivagunde, S. Muckatira, and A. Rumshisky. Relora: High-rank training through low-rank updates. In International Conference on Learning Representations, 2023

  24. [24]

    J. Liu, J. Su, X. Yao, Z. Jiang, G. Lai, Y . Du, Y . Qin, W. Xu, E. Lu, J. Yan, et al. Muon is scalable for llm training. arXiv preprint arXiv:2502.16982, 2025

  25. [25]

    Y . Liu. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 364, 2019

  26. [26]

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

  27. [27]

    Matthews, M

    D. Matthews, M. Spence, A. Mater, J. Nichols, S. Pulsford, M. Sandhu, J. Kaczmarski, C. Mi- ton, N. Tokuriki, and C. Jackson. Leveraging ancestral sequence reconstruction for protein representation learning. Nature Machine Intelligence, 6(12):1542–1555, 2024

  28. [28]

    Méndez-Lucio, C

    O. Méndez-Lucio, C. A. Nicolaou, and B. Earnshaw. Mole: a foundation model for molecular graphs using disentangled attention. Nature Communications, 15(1):9431, 2024

  29. [29]

    Mischler, Y

    G. Mischler, Y . A. Li, S. Bickel, A. D. Mehta, and N. Mesgarani. Contextual feature extraction hierarchies converge in large language models and the brain. Nature Machine Intelligence, pages 1–11, 2024

  30. [30]

    Molybog, P

    I. Molybog, P. Albert, M. Chen, Z. DeVito, D. Esiobu, N. Goyal, P. S. Koura, S. Narang, A. Poulton, R. Silva, B. Tang, D. Liskovich, P. Xu, Y . Zhang, M. H. M. Kambadur, S. Roller, and S. Zhang. A theory on adam instability in large-scale machine learning. ArXiv, abs/2304.09871, 2023

  31. [31]

    Paszke, S

    A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017

  32. [32]

    Porwik and A

    P. Porwik and A. Lisowska. The haar-wavelet transform in digital image processing: Its status and achievements. Machine graphics & vision, 13:79–98, 2004

  33. [33]

    P. Qiu, C. Wu, X. Zhang, W. Lin, H. Wang, Y . Zhang, Y . Wang, and W. Xie. Towards building multilingual language model for medicine. Nature Communications, 15(1):8384, 2024

  34. [34]

    Raffel, N

    C. Raffel, N. M. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2019

  35. [35]

    Rajbhandari, J

    S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020. 11

  36. [36]

    Know What You Don't Know: Unanswerable Questions for SQuAD

    P. Rajpurkar, R. Jia, and P. Liang. Know what you don’t know: Unanswerable questions for squad. ArXiv, abs/1806.03822, 2018

  37. [37]

    Rioul and M

    O. Rioul and M. Vetterli. Wavelets and signal processing. IEEE Signal Processing Magazine, 8(4):14–38, 1991

  38. [38]

    Socher, A

    R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Conference on Empirical Methods in Natural Language Processing, 2013

  39. [39]

    Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

  40. [40]

    Taori, I

    R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/ stanford_alpaca, 2023

  41. [41]

    G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ramé, M. Rivière, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025

  42. [42]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023

  43. [43]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. R. Stone, P. Albert, and e. a. Amjad Almahairi. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023

  44. [44]

    Vaswani, N

    A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Neural Information Processing Systems, 2017

  45. [45]

    A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi-task bench- mark and analysis platform for natural language understanding. In BlackboxNLP@EMNLP, 2018

  46. [46]

    Warstadt, A

    A. Warstadt, A. Singh, and S. R. Bowman. Neural network acceptability judgments. Transac- tions of the Association for Computational Linguistics, 7:625–641, 2018

  47. [47]

    Williams, N

    A. Williams, N. Nangia, and S. R. Bowman. A broad-coverage challenge corpus for sentence un- derstanding through inference. In North American Chapter of the Association for Computational Linguistics, 2017

  48. [48]

    Wolter, F

    M. Wolter, F. Blanke, J. Garcke, and C. T. Hoyt. ptwt - the pytorch wavelet toolbox. Journal of Machine Learning Research, 25(80):1–7, 2024

  49. [49]

    Z. Wu, O. Zhang, X. Wang, L. Fu, H. Zhao, J. Wang, H. Du, D. Jiang, Y . Deng, D. Cao, et al. Leveraging language model for advanced multiproperty molecular optimization via prompt engineering. Nature Machine Intelligence, pages 1–11, 2024

  50. [50]

    W. Xia, C. Qin, and E. Hazan. Chain of lora: Efficient fine-tuning of language models via residual learning. arXiv preprint arXiv:2401.04151, 2024

  51. [51]

    A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024

  52. [52]

    LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning

    L. Zhang, L. Zhang, S. Shi, X. Chu, and B. Li. Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning. arXiv preprint arXiv:2308.03303, 2023

  53. [53]

    Zhang, C

    Y . Zhang, C. Chen, Z. Li, T. Ding, C. Wu, D. P. Kingma, Y . Ye, Z.-Q. Luo, and R. Sun. Adam-mini: Use fewer learning rates to gain more. arXiv preprint arXiv:2406.16793, 2024

  54. [54]

    J. Zhao, Z. Zhang, B. Chen, Z. Wang, A. Anandkumar, and Y . Tian. Galore: Memory-efficient llm training by gradient low-rank projection. arXiv preprint arXiv:2403.03507, 2024. 12

  55. [55]

    Zheng, R

    Y . Zheng, R. Zhang, J. Zhang, Y . Ye, Z. Luo, Z. Feng, and Y . Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Linguistics

  56. [56]

    Haar-2” refers to GWT with a 2-level discrete Haar wavelet transform. The symbol “+

    H. Zhu, Z. Zhang, W. Cong, X. Liu, S. Park, V . Chandra, B. Long, D. Z. Pan, Z. Wang, and J. Lee. Apollo: Sgd-like memory, adamw-level performance, 2024. 13 Appendix for Wavelet Meets Adam: Compressing Gradients for Memory-Efficient Training A Experiment Details A.1 Network Architecture In this section, we describe the network architectures of the LLaMA (...

  57. [57]

    G Benchmark Details In this section, we provide a detailed description of the benchmarks used in the paper, serving as a supplement to the results section 4

    optimizers for LLaMA pre-training tasks, and present the experimental results in Figure 4. G Benchmark Details In this section, we provide a detailed description of the benchmarks used in the paper, serving as a supplement to the results section 4. C4 benchmark. We adopt the Colossal Clean Crawled Corpus (C4) [34] benchmark of the English version for all ...