TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network

Binhang Yuan; Guangxin He; Kai Chen; Kun Yuan; Tianyi Bai; Yuan Cao; Yutong He

arxiv: 2506.01352 · v2 · submitted 2025-06-02 · 💻 cs.LG

TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network

Guangxin He , Yuan Cao , Yutong He , Tianyi Bai , Kai Chen , Kun Yuan , Binhang Yuan This is my paper

Pith reviewed 2026-05-19 11:20 UTC · model grok-4.3

classification 💻 cs.LG

keywords activation quantizationpipeline parallelismdistributed traininglarge language modelsconvergence ratequantization errorHadamard transformadaptive bit allocation

0 comments

The pith

TAH-Quant enables 3-4 bit activation quantization in pipeline parallelism while preserving the convergence rate of standard SGD.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models trained across distributed devices face communication bottlenecks when sending activations between pipeline stages over slow networks. TAH-Quant addresses this by quantizing those activations to a few bits using tile-wise adaptive bit allocation guided by entropy, combined with a Hadamard transformation and pivot swapping to handle outliers. The framework proves that this quantization does not degrade the overall training convergence beyond the standard rate of vanilla stochastic gradient descent. If effective, it allows much higher throughput in resource-pooled decentralized training without extra caching costs or loss of model quality.

Core claim

Pipeline parallel training equipped with TAH-Quant maintains a convergence rate of O(1/sqrt(T)), matching that of vanilla stochastic gradient descent, while achieving an aggressive activation quantization ratio of 3-4 bits and up to 4.3x throughput speedup.

What carries the argument

Tile-wise Adaptive Hadamard Quantization, which applies fine-grained quantization to small channel windows per token with entropy-guided bit allocation and outlier suppression through Hadamard-based transformation and pivot swapping.

If this is right

Pipeline parallel training can achieve up to 4.3 times higher throughput compared to full-precision FP32.
Convergence rate remains equivalent to vanilla stochastic gradient descent.
The method avoids activation-cache overhead present in prior quantization approaches.
Communication volume for intermediate activations drops significantly under limited network bandwidth.
Performance holds across various training scenarios and model scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar tile-wise adaptive quantization could reduce communication in other distributed setups such as data or tensor parallelism.
Combining the approach with gradient compression techniques might yield additional speedups in fully decentralized systems.
The outlier suppression mechanism may prove useful for quantizing other tensors like weights in bandwidth-constrained environments.

Load-bearing premise

The combination of entropy-guided tile-wise bit allocation, Hadamard transformation, and pivot swapping keeps quantization errors small enough that they do not introduce bias or increased variance capable of breaking the standard SGD convergence bound.

What would settle it

A training run on a standard language model benchmark where loss fails to decrease at the expected rate or diverges under TAH-Quant would show the convergence guarantee does not hold.

Figures

Figures reproduced from arXiv: 2506.01352 by Binhang Yuan, Guangxin He, Kai Chen, Kun Yuan, Tianyi Bai, Yuan Cao, Yutong He.

**Figure 2.** Figure 2: The training convergence for each task (loss vs. steps). Task ( [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: End-to-end training performance over different networks [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 5.** Figure 5: Ablation study for Hadamard transform. Secondly, to examine the effectiveness of the entropy-guided adaptive bit allocation, we compare TAH-QUANT with adaptive bit allocation enabled against a variant without adaptive allocation. Results in [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Decentralized training of large language models offers the opportunity to pool computational resources across geographically distributed participants, but is often bottlenecked by network communication, particularly under pipeline parallel settings. While pipeline parallelism partitions model layers across devices to handle large-scale models, it necessitates frequent communication of intermediate activations, creating challenges when network bandwidth is limited. To address these issues, we propose TAH-Quant (Tile-wise Adaptive Hadamard Quantization), a novel activation quantization framework for pipeline parallelism. TAH-Quant integrates fine-grained tile-wise quantization, entropy-guided tile-wise adaptive bit allocation for optimal bit usage, and a Hadamard-based transformation with pivot swapping to effectively suppress outliers. Compared with token-level allocation, the tile-wise allocator assigns precision at the granularity of small channel windows within each token, reducing quantization error under the same bit budget. We prove that pipeline parallel training equipped with TAH-Quant maintains a convergence rate of O(1/sqrt(T)), matching that of vanilla stochastic gradient descent. Extensive experiments demonstrate that TAH-Quant achieves an aggressive activation quantization ratio of 3-4 bits, providing up to 4.3x throughput speedup over uncompressed FP32 and up to 1.33x wall-clock speedup over AQ-SGD, while preserving training convergence, avoiding AQ-SGD's activation-cache overhead, and generalizing well across various training scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TAH-Quant adds tile-wise entropy-guided quantization and Hadamard pivot swapping to pipeline parallelism, with a claimed O(1/sqrt(T)) convergence that needs checking against the adaptivity.

read the letter

The paper's core contribution is a quantization scheme for activations that moves from token-level to tile-level granularity inside each token, adds entropy-based bit allocation per tile, and layers on a Hadamard transform with pivot swapping to control outliers. This is packaged for pipeline-parallel training where activations cross slow links between stages. The authors report 3-4 bit effective precision, up to 4.3x throughput gains over FP32, and 1.33x wall-clock improvement over AQ-SGD while claiming the same convergence rate as vanilla SGD.

Referee Report

2 major / 2 minor

Summary. The paper proposes TAH-Quant, a tile-wise adaptive Hadamard quantization method for activations in pipeline-parallel LLM training over slow networks. It integrates fine-grained tile-wise quantization, entropy-guided adaptive bit allocation, and Hadamard transformation with pivot swapping to suppress outliers. The central claims are a convergence rate of O(1/sqrt(T)) matching vanilla SGD, 3-4 bit quantization, up to 4.3x throughput speedup over FP32, and 1.33x wall-clock speedup over AQ-SGD while avoiding activation-cache overhead.

Significance. If the convergence analysis holds for the adaptive components, the work addresses a key bottleneck in decentralized pipeline-parallel training by reducing activation communication volume without degrading the SGD rate. The explicit convergence guarantee and comparison to AQ-SGD are positive features; empirical generalization across scenarios strengthens the practical contribution.

major comments (2)

[Convergence analysis / Theorem 1] Convergence analysis (Theorem on O(1/sqrt(T)) rate): The proof must explicitly bound the second moment of quantization noise under entropy-guided tile-wise adaptive allocation. If the analysis invokes a generic fixed-bit or non-adaptive quantization lemma, the data-dependent bit decisions could correlate noise with activation statistics across pipeline stages, potentially inflating variance or introducing bias that violates the stated rate. The manuscript should derive or cite the precise assumption (e.g., unbiasedness or Lipschitz bound) that remains valid after pivot swapping and Hadamard transform.
[Experiments / Figure 3] Experimental validation of convergence (Section 5 / Figure on loss curves): The reported preservation of convergence is shown only for selected models and bit budgets; it is unclear whether the adaptive allocator was ablated against fixed-bit baselines to isolate its effect on the observed O(1/sqrt(T)) behavior. Without variance estimates or multiple random seeds for the adaptive entropy estimator, it is difficult to confirm that the speedup does not trade off against hidden convergence degradation.

minor comments (2)

[Method / §3.2] Notation for tile size and entropy estimator: Define the tile dimensions (channel window size) and the exact entropy computation formula early in Section 3; current description leaves ambiguity whether entropy is computed per-token or per-layer.
[Experiments / §5.1] Baseline implementation details: Clarify whether AQ-SGD's activation-cache overhead is measured under identical pipeline stage counts and network bandwidth; the 1.33x wall-clock claim would be stronger with explicit bandwidth numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below, indicating where revisions will be made to clarify the analysis and strengthen the experiments.

read point-by-point responses

Referee: [Convergence analysis / Theorem 1] Convergence analysis (Theorem on O(1/sqrt(T)) rate): The proof must explicitly bound the second moment of quantization noise under entropy-guided tile-wise adaptive allocation. If the analysis invokes a generic fixed-bit or non-adaptive quantization lemma, the data-dependent bit decisions could correlate noise with activation statistics across pipeline stages, potentially inflating variance or introducing bias that violates the stated rate. The manuscript should derive or cite the precise assumption (e.g., unbiasedness or Lipschitz bound) that remains valid after pivot swapping and Hadamard transform.

Authors: We thank the referee for this precise observation on the proof structure. Theorem 1 establishes the O(1/sqrt(T)) rate by first applying the Hadamard transform with pivot swapping, which renders the per-tile quantization noise unbiased with second-moment bounded proportionally to the local bit width; the entropy-guided allocator then distributes the global bit budget to minimize aggregate variance while preserving the unbiasedness property. The Lipschitz continuity of the loss (assumed in the standard SGD analysis) continues to hold after the orthogonal transform. To make the adaptive case fully explicit and rule out correlation-induced bias across pipeline stages, we will insert a dedicated supporting lemma in the revised Section 4 that directly bounds the second moment under entropy-guided allocation. revision: yes
Referee: [Experiments / Figure 3] Experimental validation of convergence (Section 5 / Figure on loss curves): The reported preservation of convergence is shown only for selected models and bit budgets; it is unclear whether the adaptive allocator was ablated against fixed-bit baselines to isolate its effect on the observed O(1/sqrt(T)) behavior. Without variance estimates or multiple random seeds for the adaptive entropy estimator, it is difficult to confirm that the speedup does not trade off against hidden convergence degradation.

Authors: We agree that isolating the contribution of the adaptive allocator and providing statistical robustness would improve clarity. The main experiments already compare TAH-Quant against fixed-bit quantization baselines (Section 5.2), but we will add an explicit ablation subsection that varies only the allocator while holding total bits constant. In addition, we will rerun the entropy-estimation experiments with three independent random seeds, report mean and standard deviation on the loss curves, and extend Figure 3 plus the appendix with results for two further model scales and bit budgets. revision: yes

Circularity Check

0 steps flagged

No significant circularity; convergence claim references external SGD benchmark independently

full rationale

The paper introduces TAH-Quant with novel components including tile-wise quantization, entropy-guided adaptive bit allocation, Hadamard transformation, and pivot swapping. It explicitly states a proof that equipped pipeline parallel training maintains the O(1/sqrt(T)) convergence rate of vanilla stochastic gradient descent. This is framed as an analysis of quantization error impact rather than a self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation. The derivation chain remains self-contained against the external vanilla SGD benchmark, with no quoted reductions showing the claimed rate or error bounds collapsing to the method's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms, or invented entities are described; the convergence claim likely rests on standard stochastic optimization assumptions not detailed here.

pith-pipeline@v0.9.0 · 5794 in / 1119 out tokens · 68551 ms · 2026-05-19T11:20:39.875572+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding
cs.DC 2026-05 unverdicted novelty 7.0

NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.
TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training
cs.DC 2026-04 unverdicted novelty 5.0

TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.
Evolution of Optimization Methods: Algorithms, Scenarios, and Evaluations
cs.LG 2026-04 unverdicted novelty 3.0

A retrospective survey and empirical evaluation of deep learning optimization algorithms that identifies trends, design trade-offs, and future directions.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · cited by 3 Pith papers · 3 internal anchors

[1]

Towards crowdsourced training of large neural networks using decentral- ized mixture-of-experts

Max Ryabinin and Anton Gusev. Towards crowdsourced training of large neural networks using decentral- ized mixture-of-experts. Advances in Neural Information Processing Systems, 33:3659–3672, 2020

work page 2020
[2]

Decentralized training of foundation models in heterogeneous environments.Advances in Neural Information Processing Systems, 35:25464–25477, 2022

Binhang Yuan, Yongjun He, Jared Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy S Liang, Christopher Re, and Ce Zhang. Decentralized training of foundation models in heterogeneous environments.Advances in Neural Information Processing Systems, 35:25464–25477, 2022

work page 2022
[3]

Improving training time and gpu utilization in geo-distributed language model training.arXiv preprint arXiv:2411.14458, 2024

Rohan Gandhi, Karan Tandon, Debopam Bhattacherjee, Venkata N Padmanabhan, et al. Improving training time and gpu utilization in geo-distributed language model training.arXiv preprint arXiv:2411.14458, 2024

work page arXiv 2024
[4]

Fine-tuning language models over slow networks using activation quantization with guarantees.Advances in Neural Information Processing Systems, 35:19215–19230, 2022

Jue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher R´e, and Ce Zhang. Fine-tuning language models over slow networks using activation quantization with guarantees.Advances in Neural Information Processing Systems, 35:19215–19230, 2022

work page 2022
[5]

Cocktailsgd: Fine-tuning foundation models over 500mbps networks

Jue Wang, Yucheng Lu, Binhang Yuan, Beidi Chen, Percy Liang, Christopher De Sa, Christopher Re, and Ce Zhang. Cocktailsgd: Fine-tuning foundation models over 500mbps networks. InInternational Conference on Machine Learning, pages 36058–36076. PMLR, 2023

work page 2023
[6]

Gpipe: Efficient training of giant neural networks using pipeline parallelism

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019

work page 2019
[7]

Pipedream: generalized pipeline parallelism for dnn training

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019

work page 2019
[8]

Memory-efficient pipeline-parallel dnn training

Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning, pages 7937–7947. PMLR, 2021. 12

work page 2021
[9]

Streaming diloco with overlapping communication: Towards a distributed free lunch.arXiv preprint arXiv:2501.18512, 2025

Arthur Douillard, Yanislav Donchev, Keith Rush, Satyen Kale, Zachary Charles, Zachary Garrett, Gabriel Teston, Dave Lacey , Ross McIlroy , Jiajun Shen, et al. Streaming diloco with overlapping communication: Towards a distributed free lunch.arXiv preprint arXiv:2501.18512, 2025

work page arXiv 2025
[10]

Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding

Song Han, Huizi Mao, and William J Dally . Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. 2016

work page 2016
[11]

Quantized neural networks: Training neural networks with low precision weights and activations.The Journal of Machine Learning Research, 18(1):6869–6898, 2017

Itay Hubara, Matthieu Courbariaux, Daniel Soudry , Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations.The Journal of Machine Learning Research, 18(1):6869–6898, 2017

work page 2017
[12]

Ac-gc: Lossy activation compression with guaranteed convergence

R David Evans and Tor Aamodt. Ac-gc: Lossy activation compression with guaranteed convergence. Advances in Neural Information Processing Systems, 34, 2021

work page 2021
[13]

Backprop with approximate activations for memory-efficient network training

Ayan Chakrabarti and Benjamin Moseley. Backprop with approximate activations for memory-efficient network training. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[14]

Training transformers together

Alexander Borzunov, Max Ryabinin, Tim Dettmers, Quentin Lhoest, Lucile Saulnier, Michael Diskin, and Yacine Jernite. Training transformers together. InNeurIPS 2021 Competitions and Demonstrations Track, pages 335–342. PMLR, 2022

work page 2021
[15]

Distributed inference and fine-tuning of large language models over the internet

Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, and Colin A Raffel. Distributed inference and fine-tuning of large language models over the internet. Advances in neural information processing systems, 36:12312–12331, 2023

work page 2023
[16]

Skippipe: Partial and reordered pipelining framework for training llms in heterogeneous networks.arXiv preprint arXiv:2502.19913, 2025

Nikolay Blagoev, Lydia Yiyu Chen, and O ˘guzhan Ersoy. Skippipe: Partial and reordered pipelining framework for training llms in heterogeneous networks.arXiv preprint arXiv:2502.19913, 2025

work page arXiv 2025
[17]

Distributed deep learning in open collaborations

Michael Diskin, Alexey Bukhtiyarov, Max Ryabinin, Lucile Saulnier, Anton Sinitsin, Dmitry Popov, Dmitry V Pyrkin, Maxim Kashirin, Alexander Borzunov, Albert Villanova del Moral, et al. Distributed deep learning in open collaborations. Advances in Neural Information Processing Systems, 34:7879–7897, 2021

work page 2021
[18]

Swarm parallelism: Training large models can be surprisingly communication-efficient

Max Ryabinin, Tim Dettmers, Michael Diskin, and Alexander Borzunov. Swarm parallelism: Training large models can be surprisingly communication-efficient. InInternational Conference on Machine Learning, pages 29416–29440. PMLR, 2023

work page 2023
[19]

Position: exploring the robustness of pipeline-parallelism-based decentralized training

Lin Lu, Chenxi Dai, Wangcheng Tao, Binhang Yuan, Yanan Sun, and Pan Zhou. Position: exploring the robustness of pipeline-parallelism-based decentralized training. InForty-first International Conference on Machine Learning, 2024

work page 2024
[20]

Ml training with cloud gpu shortages: Is cross-region the answer? In Proceedings of the 4th Workshop on Machine Learning and Systems, pages 107–116, 2024

Foteini Strati, Paul Elvinger, Tolga Kerimoglu, and Ana Klimovic. Ml training with cloud gpu shortages: Is cross-region the answer? In Proceedings of the 4th Workshop on Machine Learning and Systems, pages 107–116, 2024

work page 2024
[21]

Exact: Scalable graph neural networks training via extreme activation compression

Zirui Liu, Kaixiong Zhou, Fan Yang, Li Li, Rui Chen, and Xia Hu. Exact: Scalable graph neural networks training via extreme activation compression. In International Conference on Learning Representations, 2021

work page 2021
[22]

Neural network weight compression with nnw-bdi

Andrei Bersatti, Nima Shoghi Ghalehshahi, and Hyesoon Kim. Neural network weight compression with nnw-bdi. In The International Symposium on Memory Systems, pages 335–340, 2020

work page 2020
[23]

Accelerating convolutional neural networks via activation map compression

Georgios Georgiadis. Accelerating convolutional neural networks via activation map compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7085–7095, 2019

work page 2019
[24]

Don’t waste your bits! squeeze activations and gradients for deep neural networks via tinyscript

Fangcheng Fu, Yuzheng Hu, Yihan He, Jiawei Jiang, Yingxia Shao, Ce Zhang, and Bin Cui. Don’t waste your bits! squeeze activations and gradients for deep neural networks via tinyscript. In International Conference on Machine Learning, pages 3304–3314. PMLR, 2020. 13

work page 2020
[25]

Gact: Activation compressed training for generic network architectures

Xiaoxuan Liu, Lianmin Zheng, Dequan Wang, Yukuo Cen, Weize Chen, Xu Han, Jianfei Chen, Zhiyuan Liu, Jie Tang, Joey Gonzalez, et al. Gact: Activation compressed training for generic network architectures. In International Conference on Machine Learning, pages 14139–14152. PMLR, 2022

work page 2022
[26]

Dropit: Dropping intermediate tensors for memory-efficient dnn training

Joya Chen, Kai Xu, Yuhui Wang, Yifei Cheng, and Angela Yao. Dropit: Dropping intermediate tensors for memory-efficient dnn training. InThe Eleventh International Conference on Learning Representations

work page
[27]

Does compressing activations help model parallel training? Proceedings of Machine Learning and Systems, 6:239–252, 2024

Song Bian, Dacheng Li, Hongyi Wang, Eric Xing, and Shivaram Venkataraman. Does compressing activations help model parallel training? Proceedings of Machine Learning and Systems, 6:239–252, 2024

work page 2024
[28]

Exploring the benefit of activation sparsity in pre-training

Zhengyan Zhang, Chaojun Xiao, Qiujieli Qin, Yankai Lin, Zhiyuan Zeng, Xu Han, Zhiyuan Liu, Ruobing Xie, Maosong Sun, and Jie Zhou. Exploring the benefit of activation sparsity in pre-training. InInternational Conference on Machine Learning, pages 60040–60056. PMLR, 2024

work page 2024
[29]

Compressing dma engine: Leveraging activation sparsity for training deep neural networks

Minsoo Rhu, Mike O’Connor, Niladrish Chatterjee, Jeff Pool, Youngeun Kwon, and Stephen W Keckler. Compressing dma engine: Leveraging activation sparsity for training deep neural networks. In2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 78–91. IEEE, 2018

work page 2018
[30]

Back razor: Memory-efficient transfer learning by self-sparsified backpropagation.Advances in neural information processing systems, 35:29248–29261, 2022

Ziyu Jiang, Xuxi Chen, Xueqin Huang, Xianzhi Du, Denny Zhou, and Zhangyang Wang. Back razor: Memory-efficient transfer learning by self-sparsified backpropagation.Advances in neural information processing systems, 35:29248–29261, 2022

work page 2022
[31]

The lazy neuron phenomenon: On emergence of activation sparsity in transformers

Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, et al. The lazy neuron phenomenon: On emergence of activation sparsity in transformers. In The Eleventh International Conference on Learning Representations

work page
[32]

Jpeg-act: accelerating deep learning via transform-based lossy compression

R David Evans, Lufei Liu, and Tor M Aamodt. Jpeg-act: accelerating deep learning via transform-based lossy compression. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 860–873. IEEE, 2020

work page 2020
[33]

Division: memory efficient training via dual activation precision

Guanchu Wang, Zirui Liu, Zhimeng Jiang, Ninghao Liu, Na Zou, and Xia Hu. Division: memory efficient training via dual activation precision. In International Conference on Machine Learning, pages 36036– 36057. PMLR, 2023

work page 2023
[34]

Actnn: Reducing training memory footprint via 2-bit activation compressed training

Jianfei Chen, Lianmin Zheng, Zhewei Yao, Dequan Wang, Ion Stoica, Michael Mahoney, and Joseph Gonzalez. Actnn: Reducing training memory footprint via 2-bit activation compressed training. In International Conference on Machine Learning, pages 1803–1813. PMLR, 2021

work page 2021
[35]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems, 6:87–100, 2024

work page 2024
[36]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Smoothquant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational Conference on Machine Learning, pages 38087–38099. PMLR, 2023

work page 2023
[38]

Kivi: A tuning-free asymmetric 2bit quantization for kv cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. In International Conference on Machine Learning, pages 32332–32344. PMLR, 2024

work page 2024
[39]

Qlora: Efficient finetuning of quantized llms

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36:10088–10115, 2023. 14

work page 2023
[40]

How to param- eterize asymmetric quantization ranges for quantization-aware training.arXiv preprint arXiv:2404.16898, 2024

Jaeseong You, Minseop Park, Kyunggeun Lee, Seokjun An, Chirag Patel, and Markus Nage. How to param- eterize asymmetric quantization ranges for quantization-aware training.arXiv preprint arXiv:2404.16898, 2024

work page arXiv 2024
[41]

Llm-qat: Data-free quantization aware training for large language models

Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. Llm-qat: Data-free quantization aware training for large language models. In Findings of the Association for Computational Linguistics ACL 2024 , pages 467–484, 2024

work page 2024
[42]

Duquant: Distributing outliers via dual transformation makes stronger quantized llms

Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: Distributing outliers via dual transformation makes stronger quantized llms. Advances in Neural Information Processing Systems, 37:87766–87800, 2024

work page 2024
[43]

arXiv preprint arXiv:2501.13987

Xing Hu, Yuan Cheng, Dawei Yang, Zukang Xu, Zhihang Yuan, Jiangyong Yu, Chen Xu, Zhe Jiang, and Sifan Zhou. Ostquant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting.arXiv preprint arXiv:2501.13987, 2025

work page arXiv 2025
[44]

Feature generation i: data transformation and dimensionality reduction

Sergios Theodoridis and Konstantinos Koutroumbas. Feature generation i: data transformation and dimensionality reduction. Pattern recognition, pages 323–409, 2009

work page 2009
[45]

https://www.ucloud.cn/en/

Ucloud. https://www.ucloud.cn/en/

work page
[46]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021
[47]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey , Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[48]

Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

work page 2019
[49]

TruthfulQA: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computationa...

work page 2022
[50]

Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

work page 2021
[51]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page 2021
[52]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry , Quoc V. Le, and Charles Sutton. Program synthesis with large language models. CoRR, abs/2108.07732, 2021. 15

work page internal anchor Pith review Pith/arXiv arXiv 2021
[53]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page 2024

[1] [1]

Towards crowdsourced training of large neural networks using decentral- ized mixture-of-experts

Max Ryabinin and Anton Gusev. Towards crowdsourced training of large neural networks using decentral- ized mixture-of-experts. Advances in Neural Information Processing Systems, 33:3659–3672, 2020

work page 2020

[2] [2]

Decentralized training of foundation models in heterogeneous environments.Advances in Neural Information Processing Systems, 35:25464–25477, 2022

Binhang Yuan, Yongjun He, Jared Davis, Tianyi Zhang, Tri Dao, Beidi Chen, Percy S Liang, Christopher Re, and Ce Zhang. Decentralized training of foundation models in heterogeneous environments.Advances in Neural Information Processing Systems, 35:25464–25477, 2022

work page 2022

[3] [3]

Improving training time and gpu utilization in geo-distributed language model training.arXiv preprint arXiv:2411.14458, 2024

Rohan Gandhi, Karan Tandon, Debopam Bhattacherjee, Venkata N Padmanabhan, et al. Improving training time and gpu utilization in geo-distributed language model training.arXiv preprint arXiv:2411.14458, 2024

work page arXiv 2024

[4] [4]

Fine-tuning language models over slow networks using activation quantization with guarantees.Advances in Neural Information Processing Systems, 35:19215–19230, 2022

Jue Wang, Binhang Yuan, Luka Rimanic, Yongjun He, Tri Dao, Beidi Chen, Christopher R´e, and Ce Zhang. Fine-tuning language models over slow networks using activation quantization with guarantees.Advances in Neural Information Processing Systems, 35:19215–19230, 2022

work page 2022

[5] [5]

Cocktailsgd: Fine-tuning foundation models over 500mbps networks

Jue Wang, Yucheng Lu, Binhang Yuan, Beidi Chen, Percy Liang, Christopher De Sa, Christopher Re, and Ce Zhang. Cocktailsgd: Fine-tuning foundation models over 500mbps networks. InInternational Conference on Machine Learning, pages 36058–36076. PMLR, 2023

work page 2023

[6] [6]

Gpipe: Efficient training of giant neural networks using pipeline parallelism

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019

work page 2019

[7] [7]

Pipedream: generalized pipeline parallelism for dnn training

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019

work page 2019

[8] [8]

Memory-efficient pipeline-parallel dnn training

Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. Memory-efficient pipeline-parallel dnn training. In International Conference on Machine Learning, pages 7937–7947. PMLR, 2021. 12

work page 2021

[9] [9]

Streaming diloco with overlapping communication: Towards a distributed free lunch.arXiv preprint arXiv:2501.18512, 2025

Arthur Douillard, Yanislav Donchev, Keith Rush, Satyen Kale, Zachary Charles, Zachary Garrett, Gabriel Teston, Dave Lacey , Ross McIlroy , Jiajun Shen, et al. Streaming diloco with overlapping communication: Towards a distributed free lunch.arXiv preprint arXiv:2501.18512, 2025

work page arXiv 2025

[10] [10]

Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding

Song Han, Huizi Mao, and William J Dally . Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. 2016

work page 2016

[11] [11]

Quantized neural networks: Training neural networks with low precision weights and activations.The Journal of Machine Learning Research, 18(1):6869–6898, 2017

Itay Hubara, Matthieu Courbariaux, Daniel Soudry , Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations.The Journal of Machine Learning Research, 18(1):6869–6898, 2017

work page 2017

[12] [12]

Ac-gc: Lossy activation compression with guaranteed convergence

R David Evans and Tor Aamodt. Ac-gc: Lossy activation compression with guaranteed convergence. Advances in Neural Information Processing Systems, 34, 2021

work page 2021

[13] [13]

Backprop with approximate activations for memory-efficient network training

Ayan Chakrabarti and Benjamin Moseley. Backprop with approximate activations for memory-efficient network training. Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[14] [14]

Training transformers together

Alexander Borzunov, Max Ryabinin, Tim Dettmers, Quentin Lhoest, Lucile Saulnier, Michael Diskin, and Yacine Jernite. Training transformers together. InNeurIPS 2021 Competitions and Demonstrations Track, pages 335–342. PMLR, 2022

work page 2021

[15] [15]

Distributed inference and fine-tuning of large language models over the internet

Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, and Colin A Raffel. Distributed inference and fine-tuning of large language models over the internet. Advances in neural information processing systems, 36:12312–12331, 2023

work page 2023

[16] [16]

Skippipe: Partial and reordered pipelining framework for training llms in heterogeneous networks.arXiv preprint arXiv:2502.19913, 2025

Nikolay Blagoev, Lydia Yiyu Chen, and O ˘guzhan Ersoy. Skippipe: Partial and reordered pipelining framework for training llms in heterogeneous networks.arXiv preprint arXiv:2502.19913, 2025

work page arXiv 2025

[17] [17]

Distributed deep learning in open collaborations

Michael Diskin, Alexey Bukhtiyarov, Max Ryabinin, Lucile Saulnier, Anton Sinitsin, Dmitry Popov, Dmitry V Pyrkin, Maxim Kashirin, Alexander Borzunov, Albert Villanova del Moral, et al. Distributed deep learning in open collaborations. Advances in Neural Information Processing Systems, 34:7879–7897, 2021

work page 2021

[18] [18]

Swarm parallelism: Training large models can be surprisingly communication-efficient

Max Ryabinin, Tim Dettmers, Michael Diskin, and Alexander Borzunov. Swarm parallelism: Training large models can be surprisingly communication-efficient. InInternational Conference on Machine Learning, pages 29416–29440. PMLR, 2023

work page 2023

[19] [19]

Position: exploring the robustness of pipeline-parallelism-based decentralized training

Lin Lu, Chenxi Dai, Wangcheng Tao, Binhang Yuan, Yanan Sun, and Pan Zhou. Position: exploring the robustness of pipeline-parallelism-based decentralized training. InForty-first International Conference on Machine Learning, 2024

work page 2024

[20] [20]

Ml training with cloud gpu shortages: Is cross-region the answer? In Proceedings of the 4th Workshop on Machine Learning and Systems, pages 107–116, 2024

Foteini Strati, Paul Elvinger, Tolga Kerimoglu, and Ana Klimovic. Ml training with cloud gpu shortages: Is cross-region the answer? In Proceedings of the 4th Workshop on Machine Learning and Systems, pages 107–116, 2024

work page 2024

[21] [21]

Exact: Scalable graph neural networks training via extreme activation compression

Zirui Liu, Kaixiong Zhou, Fan Yang, Li Li, Rui Chen, and Xia Hu. Exact: Scalable graph neural networks training via extreme activation compression. In International Conference on Learning Representations, 2021

work page 2021

[22] [22]

Neural network weight compression with nnw-bdi

Andrei Bersatti, Nima Shoghi Ghalehshahi, and Hyesoon Kim. Neural network weight compression with nnw-bdi. In The International Symposium on Memory Systems, pages 335–340, 2020

work page 2020

[23] [23]

Accelerating convolutional neural networks via activation map compression

Georgios Georgiadis. Accelerating convolutional neural networks via activation map compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7085–7095, 2019

work page 2019

[24] [24]

Don’t waste your bits! squeeze activations and gradients for deep neural networks via tinyscript

Fangcheng Fu, Yuzheng Hu, Yihan He, Jiawei Jiang, Yingxia Shao, Ce Zhang, and Bin Cui. Don’t waste your bits! squeeze activations and gradients for deep neural networks via tinyscript. In International Conference on Machine Learning, pages 3304–3314. PMLR, 2020. 13

work page 2020

[25] [25]

Gact: Activation compressed training for generic network architectures

Xiaoxuan Liu, Lianmin Zheng, Dequan Wang, Yukuo Cen, Weize Chen, Xu Han, Jianfei Chen, Zhiyuan Liu, Jie Tang, Joey Gonzalez, et al. Gact: Activation compressed training for generic network architectures. In International Conference on Machine Learning, pages 14139–14152. PMLR, 2022

work page 2022

[26] [26]

Dropit: Dropping intermediate tensors for memory-efficient dnn training

Joya Chen, Kai Xu, Yuhui Wang, Yifei Cheng, and Angela Yao. Dropit: Dropping intermediate tensors for memory-efficient dnn training. InThe Eleventh International Conference on Learning Representations

work page

[27] [27]

Does compressing activations help model parallel training? Proceedings of Machine Learning and Systems, 6:239–252, 2024

Song Bian, Dacheng Li, Hongyi Wang, Eric Xing, and Shivaram Venkataraman. Does compressing activations help model parallel training? Proceedings of Machine Learning and Systems, 6:239–252, 2024

work page 2024

[28] [28]

Exploring the benefit of activation sparsity in pre-training

Zhengyan Zhang, Chaojun Xiao, Qiujieli Qin, Yankai Lin, Zhiyuan Zeng, Xu Han, Zhiyuan Liu, Ruobing Xie, Maosong Sun, and Jie Zhou. Exploring the benefit of activation sparsity in pre-training. InInternational Conference on Machine Learning, pages 60040–60056. PMLR, 2024

work page 2024

[29] [29]

Compressing dma engine: Leveraging activation sparsity for training deep neural networks

Minsoo Rhu, Mike O’Connor, Niladrish Chatterjee, Jeff Pool, Youngeun Kwon, and Stephen W Keckler. Compressing dma engine: Leveraging activation sparsity for training deep neural networks. In2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pages 78–91. IEEE, 2018

work page 2018

[30] [30]

Back razor: Memory-efficient transfer learning by self-sparsified backpropagation.Advances in neural information processing systems, 35:29248–29261, 2022

Ziyu Jiang, Xuxi Chen, Xueqin Huang, Xianzhi Du, Denny Zhou, and Zhangyang Wang. Back razor: Memory-efficient transfer learning by self-sparsified backpropagation.Advances in neural information processing systems, 35:29248–29261, 2022

work page 2022

[31] [31]

The lazy neuron phenomenon: On emergence of activation sparsity in transformers

Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, et al. The lazy neuron phenomenon: On emergence of activation sparsity in transformers. In The Eleventh International Conference on Learning Representations

work page

[32] [32]

Jpeg-act: accelerating deep learning via transform-based lossy compression

R David Evans, Lufei Liu, and Tor M Aamodt. Jpeg-act: accelerating deep learning via transform-based lossy compression. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 860–873. IEEE, 2020

work page 2020

[33] [33]

Division: memory efficient training via dual activation precision

Guanchu Wang, Zirui Liu, Zhimeng Jiang, Ninghao Liu, Na Zou, and Xia Hu. Division: memory efficient training via dual activation precision. In International Conference on Machine Learning, pages 36036– 36057. PMLR, 2023

work page 2023

[34] [34]

Actnn: Reducing training memory footprint via 2-bit activation compressed training

Jianfei Chen, Lianmin Zheng, Zhewei Yao, Dequan Wang, Ion Stoica, Michael Mahoney, and Joseph Gonzalez. Actnn: Reducing training memory footprint via 2-bit activation compressed training. In International Conference on Machine Learning, pages 1803–1813. PMLR, 2021

work page 2021

[35] [35]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of Machine Learning and Systems, 6:87–100, 2024

work page 2024

[36] [36]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[37] [37]

Smoothquant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational Conference on Machine Learning, pages 38087–38099. PMLR, 2023

work page 2023

[38] [38]

Kivi: A tuning-free asymmetric 2bit quantization for kv cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. In International Conference on Machine Learning, pages 32332–32344. PMLR, 2024

work page 2024

[39] [39]

Qlora: Efficient finetuning of quantized llms

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36:10088–10115, 2023. 14

work page 2023

[40] [40]

How to param- eterize asymmetric quantization ranges for quantization-aware training.arXiv preprint arXiv:2404.16898, 2024

Jaeseong You, Minseop Park, Kyunggeun Lee, Seokjun An, Chirag Patel, and Markus Nage. How to param- eterize asymmetric quantization ranges for quantization-aware training.arXiv preprint arXiv:2404.16898, 2024

work page arXiv 2024

[41] [41]

Llm-qat: Data-free quantization aware training for large language models

Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. Llm-qat: Data-free quantization aware training for large language models. In Findings of the Association for Computational Linguistics ACL 2024 , pages 467–484, 2024

work page 2024

[42] [42]

Duquant: Distributing outliers via dual transformation makes stronger quantized llms

Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: Distributing outliers via dual transformation makes stronger quantized llms. Advances in Neural Information Processing Systems, 37:87766–87800, 2024

work page 2024

[43] [43]

arXiv preprint arXiv:2501.13987

Xing Hu, Yuan Cheng, Dawei Yang, Zukang Xu, Zhihang Yuan, Jiangyong Yu, Chen Xu, Zhe Jiang, and Sifan Zhou. Ostquant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting.arXiv preprint arXiv:2501.13987, 2025

work page arXiv 2025

[44] [44]

Feature generation i: data transformation and dimensionality reduction

Sergios Theodoridis and Konstantinos Koutroumbas. Feature generation i: data transformation and dimensionality reduction. Pattern recognition, pages 323–409, 2009

work page 2009

[45] [45]

https://www.ucloud.cn/en/

Ucloud. https://www.ucloud.cn/en/

work page

[46] [46]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021

work page 2021

[47] [47]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey , Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR, abs/1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[48] [48]

Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

work page 2019

[49] [49]

TruthfulQA: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors,Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computationa...

work page 2022

[50] [50]

Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

work page 2021

[51] [51]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page 2021

[52] [52]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie J. Cai, Michael Terry , Quoc V. Le, and Charles Sutton. Program synthesis with large language models. CoRR, abs/2108.07732, 2021. 15

work page internal anchor Pith review Pith/arXiv arXiv 2021

[53] [53]

The language model evaluation harness, 07 2024

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The languag...

work page 2024