pith. sign in

arxiv: 2605.21177 · v1 · pith:EH4GZG7Xnew · submitted 2026-05-20 · 💻 cs.LG · cs.CL

ChunkFT: Byte-Streamed Optimization for Memory-Efficient Full Fine-Tuning

Pith reviewed 2026-05-21 05:42 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords memory-efficient fine-tuningfull-parameter fine-tuningsub-tensor gradientslarge language modelsconvergence analysisLlama modelsgradient computation
0
0 comments X

The pith

ChunkFT performs full fine-tuning by computing gradients only on dynamically activated sub-tensors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ChunkFT to reformulate full-parameter fine-tuning around a working set of sub-tensors that activate one at a time. This lets the method compute gradients for arbitrary parts of the model without any changes to the network architecture or the usual requirement to store every gradient simultaneously. A convergence analysis is given for the deterministic case, and experiments demonstrate that an 8B model fine-tunes on a single 24 GB GPU while a 70B model fits on two 80 GB GPUs. Readers would care because the approach keeps the optimization close to standard full fine-tuning yet slashes memory use enough to make complete updates practical on modest hardware. Downstream results on language understanding, math reasoning, and MT-Bench match or exceed both full fine-tuning and other memory-efficient baselines.

Core claim

ChunkFT reformulates full-parameter fine-tuning around a dynamically activated working set. It enables gradient computation for arbitrary sub-tensors without modifying the network architecture, providing an algorithmic foundation for optimizing arbitrary sub-networks while avoiding standard dense gradient computation. The authors supply a theoretical convergence analysis in the deterministic setting and report that full-parameter fine-tuning of a 7B model with 1K input length requires only 13.72 GB of GPU memory.

What carries the argument

The dynamically activated working set that processes the model in byte-streamed chunks to compute gradients only for selected sub-tensors.

If this is right

  • Full fine-tuning of a 7B model with 1K input length requires only 13.72 GB GPU memory.
  • Llama 3-8B can be fully fine-tuned on a single RTX 4090-24GB GPU.
  • Llama 3-70B can be fully fine-tuned on two H800-80GB GPUs.
  • Downstream performance on language understanding, mathematical reasoning, and MT-Bench matches or exceeds standard full-parameter fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The chunking approach might extend to training from scratch on memory-limited hardware by applying the same sub-tensor schedule.
  • It could support selective parameter updates in continual learning scenarios without requiring full gradient storage.
  • Integration with quantization or sparse activation might further reduce memory while preserving the full-update guarantee.

Load-bearing premise

Sequentially updating dynamically activated sub-tensors produces the same optimization trajectory and final performance as simultaneous dense gradient updates across the entire model.

What would settle it

Train the same small model that fits in memory under both ChunkFT and standard full fine-tuning with identical data and hyperparameters, then compare the final weights and downstream accuracy to check for equivalence.

Figures

Figures reproduced from arXiv: 2605.21177 by Daling Wang, Ercong Nie, Feiliang Ren, Hinrich Sch\"utze, Mengjie Zhao, Mingyang Wang, Qian Li, Shi Feng, Yongkang Liu, Zijing Wang.

Figure 1
Figure 1. Figure 1: Loss convergence behavior of CHUNKFT (Llama 2-7B). Left: training loss curves of CHUNKFT on six natural language understanding tasks. Right: loss curves of training a single block with different update intervals T on BoolQ. 4.2 Downstream Performance Evaluation. Natural language understanding. Tables 4 and 6 report the performance of different optimization methods on natural language understanding benchmar… view at source ↗
Figure 2
Figure 2. Figure 2: Effect of the chunk number K on memory and performance. Left: peak GPU memory under different chunk numbers K. Right: BoolQ accuracy under different chunk numbers K. 4.3 Ablation Study Loss convergence [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
read the original abstract

This work presents \textsc{ChunkFT}, a memory-efficient fine-tuning framework that reformulates full-parameter fine-tuning around a dynamically activated working set. \textsc{ChunkFT} enables gradient computation for arbitrary sub-tensors without modifying the network architecture, providing an algorithmic foundation for optimizing arbitrary sub-networks while avoiding standard dense gradient computation. We provide a theoretical convergence analysis of \textsc{ChunkFT} in the deterministic setting. Empirically, we apply \textsc{ChunkFT} to fine-tune Llama 3-8B and Llama 3-70B using a single RTX 4090-24GB GPU and 2$\times$ H800-80GB GPUs, respectively. Full-parameter fine-tuning of a 7B model with a 1K input length requires only 13.72GB of GPU memory. The results demonstrate the effectiveness of \textsc{ChunkFT} in memory usage, running time, and optimization quality. Moreover, downstream evaluations on language understanding, mathematical reasoning, and MT-Bench show that \textsc{ChunkFT} consistently outperforms existing memory-efficient baselines. Notably, \textsc{ChunkFT} achieves performance comparable to, and in some cases exceeding, full-parameter fine-tuning. Our repository is on https://github.com/misonsky/chunk.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents ChunkFT, a memory-efficient full fine-tuning framework that reformulates optimization around dynamically activated working sets to enable gradient computation on arbitrary sub-tensors without architecture changes. It supplies a deterministic convergence analysis and reports concrete results: full fine-tuning of Llama 3-8B with 1K context uses 13.72 GB on a single RTX 4090, while Llama 3-70B runs on 2x H800-80GB GPUs; downstream performance on language understanding, mathematical reasoning, and MT-Bench matches or exceeds existing memory-efficient baselines and, in some cases, standard full fine-tuning.

Significance. If the central equivalence claim holds, the work would be significant for enabling practical full-parameter fine-tuning of 70B-scale models on modest hardware while preserving optimization quality superior to low-rank methods. The deterministic convergence analysis, open repository, and explicit memory/accuracy numbers on Llama 3 models constitute clear strengths that support reproducibility and practical impact.

major comments (2)
  1. [Theoretical Analysis] Theoretical Analysis section: the convergence result is stated only for the deterministic setting. The empirical protocol uses stochastic mini-batch gradients, and the core mechanism of sequential sub-tensor activation and update can change the effective gradient direction relative to a simultaneous dense update; no extension or bias analysis for the stochastic regime is provided, leaving the claimed equivalence to standard full fine-tuning unverified under the noise and ordering effects present in practice.
  2. [§5 (Experiments)] §5 (Experiments) and associated tables: while memory footprints and downstream metrics are reported for Llama 3-8B/70B, the manuscript does not detail how optimizer states (momentum, variance) are maintained or reset across sequentially activated chunks. This information is load-bearing for reproducing the claimed memory savings and for confirming that the optimization trajectory remains comparable to dense full fine-tuning.
minor comments (1)
  1. [Abstract / Introduction] The abstract and introduction introduce 'byte-streamed optimization' without an early, self-contained definition; a short clarifying sentence would improve accessibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of our work and the constructive major comments. We address each point below, providing clarifications and indicating planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Theoretical Analysis] Theoretical Analysis section: the convergence result is stated only for the deterministic setting. The empirical protocol uses stochastic mini-batch gradients, and the core mechanism of sequential sub-tensor activation and update can change the effective gradient direction relative to a simultaneous dense update; no extension or bias analysis for the stochastic regime is provided, leaving the claimed equivalence to standard full fine-tuning unverified under the noise and ordering effects present in practice.

    Authors: We appreciate the referee's observation regarding the scope of our theoretical analysis. The convergence result is indeed derived under the deterministic setting to establish a foundational guarantee for the ChunkFT framework. In practice, our experiments utilize standard stochastic optimization with mini-batches, as is conventional for large language model fine-tuning. The sequential activation of sub-tensors is implemented such that the gradient for each chunk is computed based on the current model state, and updates are applied sequentially within the same optimization step to closely approximate the dense gradient update. While a complete bias analysis for the stochastic regime is not included, the empirical results on Llama 3 models show performance comparable to or better than standard full fine-tuning, indicating that any deviations due to ordering or noise are not detrimental. In the revised manuscript, we will expand the Theoretical Analysis section to include a discussion on the stochastic extension and potential ordering effects, along with additional empirical validation of the equivalence. revision: partial

  2. Referee: [§5 (Experiments)] §5 (Experiments) and associated tables: while memory footprints and downstream metrics are reported for Llama 3-8B/70B, the manuscript does not detail how optimizer states (momentum, variance) are maintained or reset across sequentially activated chunks. This information is load-bearing for reproducing the claimed memory savings and for confirming that the optimization trajectory remains comparable to dense full fine-tuning.

    Authors: We agree that explicit details on optimizer state management are essential for reproducibility and for verifying the equivalence to dense full fine-tuning. We will revise the Experiments section (§5) to include a clear description of how momentum and variance are handled across chunks, ensuring readers can reproduce the memory savings and confirm that the optimization trajectory remains comparable to dense full fine-tuning. revision: yes

Circularity Check

0 steps flagged

No circularity: algorithmic definition with separate theoretical analysis

full rationale

The paper defines ChunkFT algorithmically as a reformulation around dynamically activated working sets for gradient computation on sub-tensors. It supplies an independent theoretical convergence analysis in the deterministic setting, distinct from the empirical performance numbers. No derivation step reduces a claimed result to a fitted parameter, self-citation chain, or input by construction. Results are validated against external full fine-tuning and baselines on downstream tasks, keeping the chain self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that sub-tensor gradient computation can be performed without architectural modification and that sequential chunk updates preserve convergence behavior.

free parameters (1)
  • chunk size / working-set size
    Chosen dynamically to fit available GPU memory; exact selection rule not visible in abstract.
axioms (1)
  • domain assumption Gradient computation on an arbitrary sub-tensor yields a valid update direction for the full model when chunks are cycled through.
    Invoked in the reformulation of full-parameter fine-tuning around a dynamically activated working set.

pith-pipeline@v0.9.0 · 5797 in / 1207 out tokens · 31887 ms · 2026-05-21T05:42:57.216495+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 9 internal anchors

  1. [1]

    Bart: Denoising sequence-to-sequence pre- training for natural language generation, translation, and comprehension

    Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre- training for natural language generation, translation, and comprehension. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 7871–7880, 2020

  2. [2]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

  3. [3]

    Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

  4. [4]

    Full parameter fine-tuning for large language models with limited resources

    Kai Lv, Yuqing Yang, Tengxiao Liu, Qipeng Guo, and Xipeng Qiu. Full parameter fine-tuning for large language models with limited resources. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8187–8198, 2024

  5. [5]

    Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems, 36:10088– 10115, 2023

  6. [6]

    Training Deep Nets with Sublinear Memory Cost

    Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost.arXiv preprint arXiv:1604.06174, 2016

  7. [7]

    Zero: Memory optimiza- tions toward training trillion parameter models

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimiza- tions toward training trillion parameter models. InSC20: international conference for high performance computing, networking, storage and analysis, pages 1–16. IEEE, 2020

  8. [8]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022

  9. [9]

    Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models

    Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, 2022

  10. [10]

    Ssmlora: Enhancing low-rank adaptation with state space model

    Jiayang Yu, Yihang Zhang, Bin Wang, Peiqin Lin, Yongkang Liu, and Shi Feng. Ssmlora: Enhancing low-rank adaptation with state space model. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4493–4506, 2025

  11. [11]

    Parameter-efficient transfer learning for nlp

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. InInternational conference on machine learning, pages 2790–2799. PMLR, 2019

  12. [12]

    Prefix-tuning: Optimizing continuous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021. 10

  13. [13]

    The power of scale for parameter-efficient prompt tuning

    Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. InProceedings of the 2021 conference on empirical methods in natural language processing, pages 3045–3059, 2021

  14. [14]

    Look within or look beyond? a theoretical comparison between parameter- efficient and full fine-tuning.arXiv preprint arXiv:2505.22355, 2025

    Yongkang Liu, Xingle Xu, Ercong Nie, Zijing Wang, Shi Feng, Daling Wang, Qian Li, and Hinrich Schütze. Look within or look beyond? a theoretical comparison between parameter- efficient and full fine-tuning.arXiv preprint arXiv:2505.22355, 2025

  15. [15]

    Adalomo: Low-memory optimiza- tion with adaptive learning rate

    Kai Lv, Hang Yan, Qipeng Guo, Haijun Lv, and Xipeng Qiu. Adalomo: Low-memory optimiza- tion with adaptive learning rate. InFindings of the Association for Computational Linguistics: ACL 2024, pages 12486–12502, 2024

  16. [16]

    Galore: Memory-efficient llm training by gradient low-rank projection

    Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. Galore: Memory-efficient llm training by gradient low-rank projection. InForty-first International Conference on Machine Learning

  17. [17]

    Apollo: Sgd-like memory, adamw-level performance.Proceedings of Machine Learning and Systems, 7, 2025

    Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z Pan, Zhangyang Wang, and Jinwon Lee. Apollo: Sgd-like memory, adamw-level performance.Proceedings of Machine Learning and Systems, 7, 2025

  18. [18]

    Badam: A memory efficient full parameter optimization method for large language models.Advances in Neural Information Processing Systems, 37:24926–24958, 2024

    Qijun Luo, Hengxu Yu, and Xiao Li. Badam: A memory efficient full parameter optimization method for large language models.Advances in Neural Information Processing Systems, 37:24926–24958, 2024

  19. [19]

    Hift: A hierarchical full parameter fine-tuning strategy

    Yongkang Liu, Yiqun Zhang, Qian Li, Tong Liu, Shi Feng, Daling Wang, Yifei Zhang, and Hinrich Schütze. Hift: A hierarchical full parameter fine-tuning strategy. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18266–18287, 2024

  20. [20]

    Lisa: Lay- erwise importance sampling for memory-efficient large language model fine-tuning.Advances in Neural Information Processing Systems, 37:57018–57049, 2024

    Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, and Tong Zhang. Lisa: Lay- erwise importance sampling for memory-efficient large language model fine-tuning.Advances in Neural Information Processing Systems, 37:57018–57049, 2024

  21. [21]

    Chain of lora: Efficient fine-tuning of language models via residual learning.arXiv preprint arXiv:2401.04151, 2024

    Wenhan Xia, Chengwei Qin, and Elad Hazan. Chain of lora: Efficient fine-tuning of language models via residual learning.arXiv preprint arXiv:2401.04151, 2024

  22. [22]

    Hydralora: An asymmetric lora architecture for efficient fine-tuning.Advances in Neural Information Processing Systems, 37:9565–9584, 2024

    Chunlin Tian, Zhan Shi, Zhijiang Guo, Li Li, and Chengzhong Xu. Hydralora: An asymmetric lora architecture for efficient fine-tuning.Advances in Neural Information Processing Systems, 37:9565–9584, 2024

  23. [23]

    High-rank structured modulation for parameter- efficient fine-tuning.arXiv preprint arXiv:2601.07507, 2026

    Yongkang Liu, Xing Li, Mengjie Zhao, Shanru Zhang, Zijing Wang, Qian Li, Shi Feng, Feiliang Ren, Daling Wang, and Hinrich Schütze. High-rank structured modulation for parameter- efficient fine-tuning.arXiv preprint arXiv:2601.07507, 2026

  24. [24]

    Efficient low-rank adaptation for sparse large language model.Tsinghua Science and Technology, 31(4):2292–2303, 2026

    Yuxuan Hu, Tian Tian, Xiaodong Chen, Zhe Zhao, Tao Tao, Weifang Zhang, Yuanfeng Li, Yuhang Liang, Cuiping Li, Hong Chen, et al. Efficient low-rank adaptation for sparse large language model.Tsinghua Science and Technology, 31(4):2292–2303, 2026

  25. [25]

    Adam-mini: Use fewer learning rates to gain more.arXiv preprint arXiv:2406.16793,

    Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Diederik P Kingma, Yinyu Ye, Zhi-Quan Luo, and Ruoyu Sun. Adam-mini: Use fewer learning rates to gain more.arXiv preprint arXiv:2406.16793, 2024

  26. [26]

    8-bit optimizers via block-wise quantization

    Tim Dettmers, Mike Lewis, Sam Shleifer, and Luke Zettlemoyer. 8-bit optimizers via block-wise quantization. InInternational Conference on Learning Representations

  27. [27]

    Boolq: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 conference of the north American chapter of the association for computational linguistics: Human language technologies, volume 1 (long and short papers), ...

  28. [28]

    Superglue: A stickier benchmark for general-purpose language understanding systems.Advances in neural information processing systems, 32, 2019

    Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems.Advances in neural information processing systems, 32, 2019. 11

  29. [29]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  30. [30]

    Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

  31. [31]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019

  32. [32]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

  33. [33]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  34. [34]

    Program induction by rationale generation: Learning to solve and explain algebraic word problems

    Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale generation: Learning to solve and explain algebraic word problems. InProceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 158–167, 2017

  35. [35]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

  36. [36]

    Agieval: A human-centric benchmark for evaluating foundation models

    Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. InFindings of the association for computational linguistics: NAACL 2024, pages 2299–2314, 2024

  37. [37]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

  38. [38]

    Numglue: A suite of fundamental yet challenging mathematical reasoning tasks

    Swaroop Mishra, Arindam Mitra, Neeraj Varshney, Bhavdeep Sachdeva, Peter Clark, Chitta Baral, and Ashwin Kalyan. Numglue: A suite of fundamental yet challenging mathematical reasoning tasks. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3505–3523, 2022

  39. [39]

    Choice of plausible alterna- tives: An evaluation of commonsense causal reasoning

    Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alterna- tives: An evaluation of commonsense causal reasoning. InAAAI spring symposium: logical formalizations of commonsense reasoning, pages 90–95, 2011

  40. [40]

    The winograd schema challenge.KR, 2012(13th):3, 2012

    Hector J Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge.KR, 2012(13th):3, 2012

  41. [41]

    The pascal recognising textual entailment challenge

    Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising textual entailment challenge. InMachine learning challenges workshop, pages 177–190. Springer, 2005

  42. [42]

    The fifth pascal recognizing textual entailment challenge.TAC, 7(8):1, 2009

    Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. The fifth pascal recognizing textual entailment challenge.TAC, 7(8):1, 2009

  43. [43]

    Looking beyond the surface: A challenge set for reading comprehension over multiple sentences

    Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 252–...

  44. [44]

    Wic: the word-in-context dataset for evaluating context-sensitive meaning representations

    Mohammad Taher Pilehvar and Jose Camacho-Collados. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1267–1273, 2019. 12

  45. [45]

    Recursive deep models for semantic compositionality over a sentiment treebank

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013

  46. [46]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  47. [47]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  48. [48]

    Llamafactory: Unified efficient fine-tuning of 100+ language models

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, and Zheyan Luo. Llamafactory: Unified efficient fine-tuning of 100+ language models. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations), pages 400–410, 2024

  49. [49]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  50. [50]

    Coordinate descent algorithms.Mathematical programming, 151(1):3–34, 2015

    Stephen J Wright. Coordinate descent algorithms.Mathematical programming, 151(1):3–34, 2015

  51. [51]

    Benjamin Ellis, Matthew T Jackson, Andrei Lupu, Alexander D Goldie, Mattie Fellows, Shimon Whiteson, and Jakob Foerster

    Alexandre Défossez, Léon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad.arXiv preprint arXiv:2003.02395, 2020

  52. [52]

    Limitations

    Haochuan Li, Alexander Rakhlin, and Ali Jadbabaie. Convergence of adam under relaxed assumptions.Advances in Neural Information Processing Systems, 36:52166–52196, 2023. 13 A Benchmarks To comprehensively evaluate the capabilities of different methods, we conduct experiments on a diverse set of benchmarks coveringmathematical reasoning,natural language un...

  53. [53]

    Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...