QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models

Beomseok Kang; Hyesung Jeon; Jae-Joon Kim; Seojune Lee; Yulhwa Kim

arxiv: 2509.17428 · v4 · pith:VFMDO24Fnew · submitted 2025-09-22 · 💻 cs.CL

QWHA: Quantization-Aware Walsh-Hadamard Adaptation for Parameter-Efficient Fine-Tuning on Large Language Models

Hyesung Jeon , Seojune Lee , Beomseok Kang , Yulhwa Kim , Jae-Joon Kim This is my paper

Pith reviewed 2026-05-21 21:46 UTC · model grok-4.3

classification 💻 cs.CL

keywords quantization-aware fine-tuningparameter-efficient fine-tuningWalsh-Hadamard transformlarge language modelslow-bit quantizationadaptersmodel compression

0 comments

The pith

QWHA uses Walsh-Hadamard transforms and adaptive initialization to reduce quantization errors in fine-tuned language models while lowering training costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models require both quantization to cut inference costs and parameter-efficient fine-tuning to limit training overhead. Low-rank adapters often lack enough capacity for accurate results after quantization, and earlier Fourier-based adapters add overhead without fully correcting errors. QWHA solves this by adopting the Walsh-Hadamard Transform as the core kernel together with a new initialization method that selects and refines parameters. The approach mitigates quantization errors, supports fine-tuning, and cuts computational demands. Experiments show higher accuracy at low bit widths and faster training than prior adapters.

Core claim

QWHA integrates Fourier-related adapters into quantized models by using the Walsh-Hadamard Transform as the kernel and a novel initialization scheme with adaptive parameter selection and value refinement, which mitigates quantization errors, facilitates fine-tuning, and substantially reduces computational cost compared with existing methods.

What carries the argument

Walsh-Hadamard Transform kernel combined with adaptive parameter selection and value refinement for adapter initialization

Load-bearing premise

Prior Fourier-related transform adapters suffer from ineffective error reduction and added overhead when used directly in quantized models, and the Walsh-Hadamard kernel plus adaptive initialization overcomes this limitation.

What would settle it

Repeating the reported experiments on the same low-bit quantized models and finding no accuracy gain over baselines or no training speedup would show the method does not deliver the claimed benefits.

Figures

Figures reproduced from arXiv: 2509.17428 by Beomseok Kang, Hyesung Jeon, Jae-Joon Kim, Seojune Lee, Yulhwa Kim.

**Figure 2.** Figure 2: (a) Comparison of rank in weight updates between low-rank and FT-based adapters across [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Average coverage of outlier components within the selected parameters. (b) [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Rank of adapter weights for each parameter selection methods [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of refinement on average layer output error. This allows the selected basis vectors to account for the impact of unselected vectors, yielding a more accurate approximation. Without this step, interactions among basis vectors are ignored, leading to suboptimal error reduction. Note that the refinement is applicable regardless of the parameter selection strategy [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 6.** Figure 6: Accuracy of CLoQ and QWHA [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: (a) Weight quantization error distribution and (b) its channel-wise similarity to the pre [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Singular value and coefficient magnitude (squared) distributions with the Pareto hill index [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Parameter selection patterns and two example zoomed-in results of each method in the [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

read the original abstract

The demand for efficient deployment of large language models (LLMs) has driven interest in quantization, which reduces inference cost, and parameter-efficient fine-tuning (PEFT), which lowers training overhead. This motivated the development of quantization-aware PEFT to produce accurate yet efficient quantized models. In this setting, reducing quantization error prior to fine-tuning is crucial for achieving high model accuracy. However, existing methods that rely on low-rank adaptation suffer from limited representational capacity. Recent Fourier-related transform (FT)-based adapters offer greater representational power than low-rank adapters, but their direct integration into quantized models often results in ineffective error reduction and increased computational overhead. To overcome these limitations, we propose QWHA, a method that integrates FT-based adapters into quantized models by employing the Walsh-Hadamard Transform (WHT) as the transform kernel, together with a novel adapter initialization scheme incorporating adaptive parameter selection and value refinement. We demonstrate that QWHA effectively mitigates quantization errors while facilitating fine-tuning, and that its design substantially reduces computational cost. Experimental results show that QWHA consistently outperforms baselines in low-bit quantization accuracy and achieves significant training speedups over existing FT-based adapters. The code is available at https://github.com/vantaa89/qwha.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QWHA swaps Walsh-Hadamard for Fourier transforms in quantized PEFT adapters plus a custom init, delivering accuracy and speed gains, but the error-mitigation story still rests on downstream metrics rather than direct error measurements.

read the letter

The main point is that this paper takes recent Fourier-transform adapters for PEFT and swaps in the Walsh-Hadamard kernel with an adaptive initialization scheme to make them play nicer with quantized LLMs. The result is lower training overhead and better accuracy at low bit widths compared with the FT baselines they test against. They also ship the code, which is useful for anyone who wants to try it directly.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes QWHA, a quantization-aware parameter-efficient fine-tuning method for large language models. It integrates Walsh-Hadamard Transform (WHT) kernels into Fourier-related transform adapters together with an adaptive initialization scheme (parameter selection and value refinement) to mitigate quantization errors, enable effective fine-tuning of quantized models, reduce computational overhead relative to prior FT-based adapters, and achieve higher low-bit quantization accuracy along with training speedups.

Significance. If the central claims hold, QWHA would provide a concrete advance in quantization-aware PEFT by addressing representational and overhead limitations of both low-rank adapters and existing FT-based methods, with direct relevance to efficient LLM deployment. The public code release at the cited GitHub repository is a clear strength for reproducibility.

major comments (1)

[Experimental Results] Experimental Results section: The paper's core claim that QWHA 'effectively mitigates quantization errors' via the WHT kernel plus adaptive initialization lacks direct empirical support. Downstream task accuracies and speedups are reported as outperforming FT-based baselines, yet no pre-/post-adapter quantization error metrics (e.g., Frobenius norm, element-wise error, or reconstruction error between original and quantized weights) are provided to isolate the claimed error-reduction mechanism from general PEFT or fine-tuning effects. This gap is load-bearing because the motivation explicitly contrasts QWHA against prior FT adapters on the basis of ineffective error reduction.

minor comments (1)

[Abstract] Abstract: The statement that QWHA 'consistently outperforms baselines in low-bit quantization accuracy' would be strengthened by including at least one concrete quantitative example (e.g., average accuracy delta or specific bit-width results) rather than remaining purely qualitative.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of QWHA for quantization-aware PEFT. We address the single major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses

Referee: The paper's core claim that QWHA 'effectively mitigates quantization errors' via the WHT kernel plus adaptive initialization lacks direct empirical support. Downstream task accuracies and speedups are reported as outperforming FT-based baselines, yet no pre-/post-adapter quantization error metrics (e.g., Frobenius norm, element-wise error, or reconstruction error between original and quantized weights) are provided to isolate the claimed error-reduction mechanism from general PEFT or fine-tuning effects. This gap is load-bearing because the motivation explicitly contrasts QWHA against prior FT adapters on the basis of ineffective error reduction.

Authors: We agree that direct quantification of quantization error reduction would more rigorously isolate the contribution of the WHT kernel and adaptive initialization from general fine-tuning effects. Our current experiments focus on end-to-end downstream accuracy and training speed, which provide indirect evidence of effective error mitigation through consistent outperformance over FT-based baselines. To address this, we will add new experiments in the revised manuscript that report pre- and post-adaptation quantization error metrics (including Frobenius norm and mean squared reconstruction error) on selected layers across the evaluated models and bit-widths. These additions will directly support the motivation section's contrast with prior FT adapters. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is empirical design validated externally

full rationale

The paper introduces QWHA as a practical combination of Walsh-Hadamard Transform kernel and adaptive initialization for quantization-aware PEFT. No equations, derivations, or first-principles predictions appear in the provided text that reduce the claimed error mitigation or speedups to fitted parameters, self-definitions, or self-citation chains. Claims rest on experimental comparisons to baselines rather than internal reductions, rendering the approach self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is based solely on the abstract; therefore the ledger is necessarily incomplete. The method introduces a novel initialization scheme whose internal parameters are not specified here. No new physical entities are postulated.

axioms (1)

domain assumption Reducing quantization error prior to fine-tuning is crucial for achieving high model accuracy.
Explicitly stated in the abstract as the key motivation for the work.

pith-pipeline@v0.9.0 · 5777 in / 1416 out tokens · 72849 ms · 2026-05-21T21:46:13.858374+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

arg min_{c,E} ∥ΔW_Q R - F H^{-1} R∥_F^2 ... AdaAlloc ... Value Refinement

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 8 internal anchors

[1]

Systematic outliers in large language models, 2025

Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. Systematic outliers in large language models, 2025

work page 2025
[2]

Barry C. Arnold. Pareto Distributions. International Co-operative Publishing House, 1983. ISBN 9780429169410. doi:https://doi.org/10.1201/b18141

work page doi:10.1201/b18141 1983
[3]

Quarot: Outlier-free 4-bit inference in rotated llms

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS), 37: 0 100213--100240, 2024

work page 2024
[4]

Sparse high rank adapters

Kartikeya Bhardwaj, Nilesh Prasad Pandey, Sweta Priyadarshi, Viswanath Ganapathy, Shreya Kadambi, Rafael Esteves, Shubhankar Borse, Paul Whatmough, Risheek Garrepalli, Mart Van Baalen, Harris Teague, and Markus Nagel. Sparse high rank adapters. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NeurIPS '24, 2024

work page 2024
[5]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.\ 7432--7439, 2020

work page 2020
[6]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[7]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

Ergur, Pu Gao, Samuel Hetterich, and Maurice Rolvien

Amin Coja-Oghlan, Alperen A. Ergur, Pu Gao, Samuel Hetterich, and Maurice Rolvien. The rank of sparse random matrices. The Proceedings of the 31th ACM-SIAM Symposium on Discrete Algorithms (SODA), pp.\ 579--591, 2020

work page 2020
[10]

fast-hadamard-transform

Dao-AILab. fast-hadamard-transform. https://github.com/Dao-AILab/fast-hadamard-transform, 2024. Accessed: 2025-05-17

work page 2024
[11]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Cloq: Enhancing fine-tuning of quantized llms via calibrated lora initialization

Yanxia Deng, Aozhong Zhang, Naigang Wang, Selcuk Gurses, Zi Yang, and Penghang Yin. Cloq: Enhancing fine-tuning of quantized llms via calibrated lora initialization. Transactions on Machine Learning Research (TMLR), 2025

work page 2025
[13]

Qlora: Efficient finetuning of quantized llms

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS), 36: 0 10088--10115, 2023

work page 2023
[14]

Spqr: A sparse-quantized representation for near-lossless llm weight compression, 2024

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. Spqr: A sparse-quantized representation for near-lossless llm weight compression, 2024

work page 2024
[15]

Loca: Location-aware cosine adaptation for parameter-efficient fine-tuning

Zhekai Du, Yinjie Min, Jingjing Li, Ke Lu, Changliang Zou, Liuhua Peng, Tingjin Chu, and Mingming Gong. Loca: Location-aware cosine adaptation for parameter-efficient fine-tuning. 13th International Conference on Learning Representations (ICLR), 2025

work page 2025
[16]

Gptq: Accurate post-training quantization for generative pre-trained transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan-Adrian Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. In 11th International Conference on Learning Representations (ICLR), 2023

work page 2023
[17]

He, B., Yin, L., Zhen, H.-L., Liu, S., Wu, H., Zhang, X., Yuan, M., and Ma, C

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page arXiv 2024
[18]

Parameter-efficient fine-tuning with discrete fourier transform

Ziqi Gao, Qichao Wang, Aochuan Chen, Zijing Liu, Bingzhe Wu, Liang Chen, and Jia Li. Parameter-efficient fine-tuning with discrete fourier transform. Proceedings of the 41st International Conference on Machine Learning (ICML), 2024 b

work page 2024
[19]

Gerakoulis and Saeed S

Diakoumis P. Gerakoulis and Saeed S. Ghassemzadeh. System and method for generating orthogonal codes, Mar 2004

work page 2004
[20]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Xing, and Yoon Kim

Han Guo, Philip Greengard, Eric P. Xing, and Yoon Kim. Lq-lora: Low-rank plus quantized matrix decomposition for efficient language model finetuning, 2024

work page 2024
[22]

A. Hedayat. Hadamard matrices and their applications. The Annals of Statistics, 6, 11 1978. doi:10.1214/aos/1176344370

work page doi:10.1214/aos/1176344370 1978
[23]

Hedayat, Neil J

Ashok S. Hedayat, Neil J. A. Sloane, and John Stufken. Orthogonal Arrays: Theory and Applications. Springer Series in Statistics. Springer, 1999. ISBN 978-0-387-98766-8

work page 1999
[24]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. 10th International Conference on Learning Representations (ICLR), 2022

work page 2022
[25]

Ra-lora: Rank-adaptive parameter-efficient fine-tuning for accurate 2-bit quantized large language models

Minsoo Kim, Sihwa Lee, Wonyong Sung, and Jungwook Choi. Ra-lora: Rank-adaptive parameter-efficient fine-tuning for accurate 2-bit quantized large language models. In Findings of the Association for Computational Linguistics 2024 (ACL), pp.\ 15773--15786, 2024 a

work page 2024
[26]

Mahoney, and Kurt Keutzer

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. Squeezellm: Dense-and-sparse quantization, 2024 b

work page 2024
[27]

Kopiczko, Tijmen Blankevoort, and Yuki M

Dawid J. Kopiczko, Tijmen Blankevoort, and Yuki M. Asano. Vera: Vector-based random matrix adaptation, 2024

work page 2024
[28]

Henry O. Kunz. On the equivalence between one-dimensional discrete walsh-hadamard and multidimensional discrete fourier transforms. IEEE Transactions on Computers, C-28 0 (3): 0 267--268, 1979. doi:10.1109/TC.1979.1675334

work page doi:10.1109/tc.1979.1675334 1979
[29]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021

work page 2021
[30]

Loftq: Lora-fine-tuning-aware quantization for large language models

Yixiao Li, Yifan Yu, Chen Liang, Nikos Karampatziakis, Pengcheng He, Weizhu Chen, and Tuo Zhao. Loftq: Lora-fine-tuning-aware quantization for large language models. In 12th International Conference on Learning Representations (ICLR), 2024

work page 2024
[31]

Apiq: Finetuning of 2-bit quantized large language model

Baohao Liao, Christian Herold, Shahram Khadivi, and Christof Monz. Apiq: Finetuning of 2-bit quantized large language model. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 20996--21020, 2024

work page 2024
[32]

Awq: Activation-aware weight quantization for llm compression and acceleration, 2024

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration, 2024

work page 2024
[33]

Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A. Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. In The 35th Annual Conference on Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[34]

Visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023

work page 2023
[35]

Dora: Weight-decomposed low-rank adaptation, 2024

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation, 2024

work page 2024
[36]

Spinquant: Llm quantization with learned rotations, 2025

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations, 2025

work page 2025
[37]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. CoRR, abs/1609.07843, 2016. URL http://arxiv.org/abs/1609.07843

work page internal anchor Pith review Pith/arXiv arXiv 2016
[39]

Mistral 7b v0.3

Mistral AI . Mistral 7b v0.3. https://huggingface.co/mistralai/Mistral-7B-v0.3, 2024. Model card, Apache 2.0 license, released 2024/11/30

work page 2024
[40]

B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24 0 (2): 0 227--234, 1995. doi:10.1137/S0097539792240406. URL https://doi.org/10.1137/S0097539792240406

work page doi:10.1137/s0097539792240406 1995
[41]

Toolllm: Facilitating large language models to master 16000+ real-world apis, 2024

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2024

work page 2024
[42]

Winogrande: An adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64 0 (9): 0 99--106, 2021

work page 2021
[43]

Seberry and Mieko Yamada

Jennifer R. Seberry and Mieko Yamada. Hadamard matrices, sequences, and block designs. In Jeffrey H. Dinitz and Douglas R. Stinson (eds.), Contemporary Design Theory: A Collection of Surveys, pp.\ 431--560. Wiley, 1992

work page 1992
[44]

Omniquant: Omnidirectionally calibrated quantization for large language models, 2024

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models, 2024

work page 2024
[45]

Ssh: Sparse spectrum adaptation via discrete hartley transformation

Yixian Shen, Qi Bi, Jia-Hong Huang, Hongyi Zhu, Andy D Pimentel, and Anuj Pathania. Ssh: Sparse spectrum adaptation via discrete hartley transformation. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), 2025

work page 2025
[46]

N. J. A. Sloane. A library of hadamard matrices. http://neilsloane.com/hadamard/, 2004. Accessed: 2025-05-16

work page 2004
[47]

Commonsenseqa: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In North American Chapter of the Association for Computational Linguistics (NAACL), 2019

work page 2019
[48]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023
[49]

Quip\#: Even better llm quantization with hadamard incoherence and lattice codebooks, 2024

Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. Quip\#: Even better llm quantization with hadamard incoherence and lattice codebooks, 2024

work page 2024
[50]

Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners, 2022

work page 2022
[51]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905
[52]

Magr: Weight magnitude reduction for enhancing post-training quantization, 2024

Aozhong Zhang, Naigang Wang, Yanxia Deng, Xin Li, Zi Yang, and Penghang Yin. Magr: Weight magnitude reduction for enhancing post-training quantization, 2024

work page 2024
[53]

Pan, Zhangyang Wang, and Jinwon Lee

Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z. Pan, Zhangyang Wang, and Jinwon Lee. Apollo: Sgd-like memory, adamw-level performance, 2025

work page 2025
[54]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[55]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[56]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page

[1] [1]

Systematic outliers in large language models, 2025

Yongqi An, Xu Zhao, Tao Yu, Ming Tang, and Jinqiao Wang. Systematic outliers in large language models, 2025

work page 2025

[2] [2]

Barry C. Arnold. Pareto Distributions. International Co-operative Publishing House, 1983. ISBN 9780429169410. doi:https://doi.org/10.1201/b18141

work page doi:10.1201/b18141 1983

[3] [3]

Quarot: Outlier-free 4-bit inference in rotated llms

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. Proceedings of the 37th International Conference on Neural Information Processing Systems (NeurIPS), 37: 0 100213--100240, 2024

work page 2024

[4] [4]

Sparse high rank adapters

Kartikeya Bhardwaj, Nilesh Prasad Pandey, Sweta Priyadarshi, Viswanath Ganapathy, Shreya Kadambi, Rafael Esteves, Shubhankar Borse, Paul Whatmough, Risheek Garrepalli, Mart Van Baalen, Harris Teague, and Markus Nagel. Sparse high rank adapters. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NeurIPS '24, 2024

work page 2024

[5] [5]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.\ 7432--7439, 2020

work page 2020

[6] [6]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[7] [7]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [9]

Ergur, Pu Gao, Samuel Hetterich, and Maurice Rolvien

Amin Coja-Oghlan, Alperen A. Ergur, Pu Gao, Samuel Hetterich, and Maurice Rolvien. The rank of sparse random matrices. The Proceedings of the 31th ACM-SIAM Symposium on Discrete Algorithms (SODA), pp.\ 579--591, 2020

work page 2020

[10] [10]

fast-hadamard-transform

Dao-AILab. fast-hadamard-transform. https://github.com/Dao-AILab/fast-hadamard-transform, 2024. Accessed: 2025-05-17

work page 2024

[11] [11]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Cloq: Enhancing fine-tuning of quantized llms via calibrated lora initialization

Yanxia Deng, Aozhong Zhang, Naigang Wang, Selcuk Gurses, Zi Yang, and Penghang Yin. Cloq: Enhancing fine-tuning of quantized llms via calibrated lora initialization. Transactions on Machine Learning Research (TMLR), 2025

work page 2025

[13] [13]

Qlora: Efficient finetuning of quantized llms

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Proceedings of the 36th International Conference on Neural Information Processing Systems (NeurIPS), 36: 0 10088--10115, 2023

work page 2023

[14] [14]

Spqr: A sparse-quantized representation for near-lossless llm weight compression, 2024

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. Spqr: A sparse-quantized representation for near-lossless llm weight compression, 2024

work page 2024

[15] [15]

Loca: Location-aware cosine adaptation for parameter-efficient fine-tuning

Zhekai Du, Yinjie Min, Jingjing Li, Ke Lu, Changliang Zou, Liuhua Peng, Tingjin Chu, and Mingming Gong. Loca: Location-aware cosine adaptation for parameter-efficient fine-tuning. 13th International Conference on Learning Representations (ICLR), 2025

work page 2025

[16] [16]

Gptq: Accurate post-training quantization for generative pre-trained transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan-Adrian Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. In 11th International Conference on Learning Representations (ICLR), 2023

work page 2023

[17] [17]

He, B., Yin, L., Zhen, H.-L., Liu, S., Wu, H., Zhang, X., Yuan, M., and Ma, C

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page arXiv 2024

[18] [18]

Parameter-efficient fine-tuning with discrete fourier transform

Ziqi Gao, Qichao Wang, Aochuan Chen, Zijing Liu, Bingzhe Wu, Liang Chen, and Jia Li. Parameter-efficient fine-tuning with discrete fourier transform. Proceedings of the 41st International Conference on Machine Learning (ICML), 2024 b

work page 2024

[19] [19]

Gerakoulis and Saeed S

Diakoumis P. Gerakoulis and Saeed S. Ghassemzadeh. System and method for generating orthogonal codes, Mar 2004

work page 2004

[20] [20]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Xing, and Yoon Kim

Han Guo, Philip Greengard, Eric P. Xing, and Yoon Kim. Lq-lora: Low-rank plus quantized matrix decomposition for efficient language model finetuning, 2024

work page 2024

[22] [22]

A. Hedayat. Hadamard matrices and their applications. The Annals of Statistics, 6, 11 1978. doi:10.1214/aos/1176344370

work page doi:10.1214/aos/1176344370 1978

[23] [23]

Hedayat, Neil J

Ashok S. Hedayat, Neil J. A. Sloane, and John Stufken. Orthogonal Arrays: Theory and Applications. Springer Series in Statistics. Springer, 1999. ISBN 978-0-387-98766-8

work page 1999

[24] [24]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. 10th International Conference on Learning Representations (ICLR), 2022

work page 2022

[25] [25]

Ra-lora: Rank-adaptive parameter-efficient fine-tuning for accurate 2-bit quantized large language models

Minsoo Kim, Sihwa Lee, Wonyong Sung, and Jungwook Choi. Ra-lora: Rank-adaptive parameter-efficient fine-tuning for accurate 2-bit quantized large language models. In Findings of the Association for Computational Linguistics 2024 (ACL), pp.\ 15773--15786, 2024 a

work page 2024

[26] [26]

Mahoney, and Kurt Keutzer

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. Squeezellm: Dense-and-sparse quantization, 2024 b

work page 2024

[27] [27]

Kopiczko, Tijmen Blankevoort, and Yuki M

Dawid J. Kopiczko, Tijmen Blankevoort, and Yuki M. Asano. Vera: Vector-based random matrix adaptation, 2024

work page 2024

[28] [28]

Henry O. Kunz. On the equivalence between one-dimensional discrete walsh-hadamard and multidimensional discrete fourier transforms. IEEE Transactions on Computers, C-28 0 (3): 0 267--268, 1979. doi:10.1109/TC.1979.1675334

work page doi:10.1109/tc.1979.1675334 1979

[29] [29]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021

work page 2021

[30] [30]

Loftq: Lora-fine-tuning-aware quantization for large language models

Yixiao Li, Yifan Yu, Chen Liang, Nikos Karampatziakis, Pengcheng He, Weizhu Chen, and Tuo Zhao. Loftq: Lora-fine-tuning-aware quantization for large language models. In 12th International Conference on Learning Representations (ICLR), 2024

work page 2024

[31] [31]

Apiq: Finetuning of 2-bit quantized large language model

Baohao Liao, Christian Herold, Shahram Khadivi, and Christof Monz. Apiq: Finetuning of 2-bit quantized large language model. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 20996--21020, 2024

work page 2024

[32] [32]

Awq: Activation-aware weight quantization for llm compression and acceleration, 2024

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration, 2024

work page 2024

[33] [33]

Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A. Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. In The 35th Annual Conference on Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[34] [34]

Visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023

work page 2023

[35] [35]

Dora: Weight-decomposed low-rank adaptation, 2024

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation, 2024

work page 2024

[36] [36]

Spinquant: Llm quantization with learned rotations, 2025

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations, 2025

work page 2025

[37] [37]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[38] [38]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. CoRR, abs/1609.07843, 2016. URL http://arxiv.org/abs/1609.07843

work page internal anchor Pith review Pith/arXiv arXiv 2016

[39] [39]

Mistral 7b v0.3

Mistral AI . Mistral 7b v0.3. https://huggingface.co/mistralai/Mistral-7B-v0.3, 2024. Model card, Apache 2.0 license, released 2024/11/30

work page 2024

[40] [40]

B. K. Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24 0 (2): 0 227--234, 1995. doi:10.1137/S0097539792240406. URL https://doi.org/10.1137/S0097539792240406

work page doi:10.1137/s0097539792240406 1995

[41] [41]

Toolllm: Facilitating large language models to master 16000+ real-world apis, 2024

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2024

work page 2024

[42] [42]

Winogrande: An adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64 0 (9): 0 99--106, 2021

work page 2021

[43] [43]

Seberry and Mieko Yamada

Jennifer R. Seberry and Mieko Yamada. Hadamard matrices, sequences, and block designs. In Jeffrey H. Dinitz and Douglas R. Stinson (eds.), Contemporary Design Theory: A Collection of Surveys, pp.\ 431--560. Wiley, 1992

work page 1992

[44] [44]

Omniquant: Omnidirectionally calibrated quantization for large language models, 2024

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models, 2024

work page 2024

[45] [45]

Ssh: Sparse spectrum adaptation via discrete hartley transformation

Yixian Shen, Qi Bi, Jia-Hong Huang, Hongyi Zhu, Andy D Pimentel, and Anuj Pathania. Ssh: Sparse spectrum adaptation via discrete hartley transformation. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL), 2025

work page 2025

[46] [46]

N. J. A. Sloane. A library of hadamard matrices. http://neilsloane.com/hadamard/, 2004. Accessed: 2025-05-16

work page 2004

[47] [47]

Commonsenseqa: A question answering challenge targeting commonsense knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In North American Chapter of the Association for Computational Linguistics (NAACL), 2019

work page 2019

[48] [48]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

work page 2023

[49] [49]

Quip\#: Even better llm quantization with hadamard incoherence and lattice codebooks, 2024

Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. Quip\#: Even better llm quantization with hadamard incoherence and lattice codebooks, 2024

work page 2024

[50] [50]

Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M

Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V. Le. Finetuned language models are zero-shot learners, 2022

work page 2022

[51] [51]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1905

[52] [52]

Magr: Weight magnitude reduction for enhancing post-training quantization, 2024

Aozhong Zhang, Naigang Wang, Yanxia Deng, Xin Li, Zi Yang, and Penghang Yin. Magr: Weight magnitude reduction for enhancing post-training quantization, 2024

work page 2024

[53] [53]

Pan, Zhangyang Wang, and Jinwon Lee

Hanqing Zhu, Zhenyu Zhang, Wenyan Cong, Xi Liu, Sem Park, Vikas Chandra, Bo Long, David Z. Pan, Zhangyang Wang, and Jinwon Lee. Apollo: Sgd-like memory, adamw-level performance, 2025

work page 2025

[54] [54]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[55] [55]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[56] [56]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page