You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations

Amit Levi; Avi Mendelson; Chaim Baskin; Ravid Shwartz Ziv; Raz Lapid; Rom Himelstein

arxiv: 2511.06516 · v3 · pith:OGZ2ETE6new · submitted 2025-11-09 · 💻 cs.CL

You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations

Amit LeVi , Raz Lapid , Rom Himelstein , Chaim Baskin , Ravid Shwartz Ziv , Avi Mendelson This is my paper

Pith reviewed 2026-05-21 18:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords post-training quantizationmixed-precisionlarge language modelstask-aware compressionhidden representationslayer importance scoring

0 comments

The pith

Task-aware quantization allocates higher precision to LLM layers that matter most for a given task using hidden-representation statistics from unlabeled prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Task-Aware Quantization (TAQ), a training-free mixed-precision post-training method that scores transformer layers by importance for a specific task and assigns more bits to the critical ones under a fixed total bit budget. Importance is estimated from hidden activations and output sensitivity on a small set of unlabeled task calibration prompts, with three concrete scoring rules provided. This produces better accuracy per memory unit than standard task-agnostic quantization on several benchmarks, and the efficiency gains appear in measured hardware throughput and latency. A sympathetic reader would care because many real LLM deployments target narrow capabilities, so uniform bit allocation wastes resources on irrelevant layers.

Core claim

TAQ estimates layer importance from hidden representations and output sensitivity using a small set of unlabeled task calibration prompts, and allocates higher precision to task-relevant layers in a mixed-precision post-training quantization framework, outperforming task-agnostic baselines especially in accuracy-memory ratio, with validation on hardware throughput and latency.

What carries the argument

Task-Aware Quantization (TAQ) framework that computes layer importance scores from hidden-representation statistics or output-distribution sensitivity under a quantization-noise proxy, then assigns mixed precisions accordingly.

If this is right

Higher precision on task-critical layers improves downstream accuracy under a fixed total bit budget.
Gains in accuracy-memory ratio appear as concrete improvements in hardware throughput and latency.
Unlabeled calibration prompts suffice, removing the need for task labels or additional fine-tuning.
Residual-stream error analysis shows where quantization noise accumulates most harmfully for the task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same hidden-representation scoring approach could be applied to other compression methods such as structured pruning or knowledge distillation to make them task-conditioned.
Combining TAQ with hardware-specific cost models might further close the gap between theoretical bit savings and actual inference speed on edge devices.
Extending calibration sets with synthetic prompts generated by the model itself could improve robustness when real task data is scarce.

Load-bearing premise

Layer importance scores derived from hidden-representation statistics or output-sensitivity proxies on a small set of unlabeled task calibration prompts reliably identify the layers whose precision most affects downstream task performance.

What would settle it

A task and model where any of the TAQ scoring rules produces equal or lower accuracy-memory ratio than uniform or task-agnostic quantization at the same bit budget, as measured on the target hardware.

Figures

Figures reproduced from arXiv: 2511.06516 by Amit Levi, Avi Mendelson, Chaim Baskin, Ravid Shwartz Ziv, Raz Lapid, Rom Himelstein.

**Figure 1.** Figure 1: Layers relevance scores per task. Motivation. Different tasks stress different parts of a Transformer: some layers are indispensable for capturing semantic diversity, while others can be aggressively quantized with little effect. In [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

read the original abstract

Many LLM applications require only narrow capabilities, yet standard post-training quantization (PTQ) methods allocate precision without considering the target task. This can waste bits on layers that are less relevant to the task signal while over-compressing layers that are critical for downstream behavior. We propose Task-Aware Quantization (TAQ), a training-free, weight-only mixed-precision PTQ framework that uses a small set of unlabeled task calibration prompts to allocate higher precision to task-relevant transformer layers under a fixed bit budget. TAQ estimates layer importance from hidden representations and output sensitivity, and we instantiate it with three scoring rules: TAQ-IS, based on activation information and stability; TAQ-KL, based on output-distribution sensitivity under a quantization-noise proxy; and TAQ-O, a label-informed oracle diagnostic for analyzing layer sensitivity. Across several benchmarks, TAQ outperforms task-agnostic baselines such in most settings, with especially strong gains in the accuracy--memory ratio. We further validate that these gains translate to real deployment behavior through hardware throughput and latency measurements, and analyze calibration robustness and residual-stream error propagation. Overall, TAQ turns mixed-precision PTQ from a model-centric compression step into a task-conditioned precision-allocation problem. A reference implementation is available at https://anonymous.4open.science/r/TAQ-9217/README.md.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a training-free way to make LLM quantization task-specific by scoring layers from hidden representations on unlabeled calibration prompts.

read the letter

The one thing to know is that this paper gives a training-free way to allocate bits in LLM quantization according to task relevance by looking at hidden representations from a small set of unlabeled calibration prompts. They take standard PTQ and make it task-conditioned without any training. The scores come from activation stats or sensitivity to a quantization noise proxy, and they test it on benchmarks where it beats the usual uniform or task-agnostic methods, especially in the accuracy per bit. Hardware tests show the benefits carry over to actual speed and memory use. Having the oracle version helps show the gap to perfect layer selection. The method is clean and the experiments are straightforward. Credit for including hardware validation and some error analysis. The weaker part is the link between those scores and actual task impact. The scores are independent per layer and based on limited prompts, so residual interactions or bad prompt choice could throw off which layers get the bits. They contrast with the oracle but don't fully test if the selected layers are the ones driving the gains. This is useful for anyone compressing models for specific tasks on limited hardware. A reader looking for practical quantization tweaks would find it worth reading. The thinking is clear and the evidence is empirical and reproducible enough that it deserves a serious referee. I'd recommend putting it through peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces Task-Aware Quantization (TAQ), a training-free weight-only mixed-precision post-training quantization framework for LLMs. It allocates higher bit precision to task-relevant transformer layers under a fixed bit budget by estimating layer importance from hidden-representation statistics and output-sensitivity proxies computed on a small set of unlabeled task calibration prompts. Three instantiations are presented: TAQ-IS (activation information and stability), TAQ-KL (KL-divergence under a quantization-noise proxy), and TAQ-O (label-informed oracle). Experiments across benchmarks show outperformance over task-agnostic baselines in most settings, with notable gains in the accuracy-memory ratio; these are supported by hardware throughput/latency measurements and analyses of calibration robustness and residual-stream error propagation.

Significance. If the per-layer importance scores derived from unlabeled prompts prove to be reliable proxies for task-specific quantization sensitivity, TAQ could meaningfully advance PTQ from model-centric to task-conditioned precision allocation, improving efficiency for narrow-domain LLM applications. The hardware validation and reference implementation are concrete strengths that would increase the work's practical impact if the core proxy assumption holds.

major comments (2)

[§4.3 and §5.2] §4.3 and §5.2: The central claim that TAQ-IS and TAQ-KL scores correctly rank layers by their marginal impact on downstream task accuracy rests on correlation with the TAQ-O oracle and robustness checks, but the manuscript does not include a direct ablation that measures end-to-end benchmark accuracy when precision is allocated exclusively to the highest- versus lowest-scoring layers (independent of the joint optimization). This leaves the reliability of the proxy under residual-stream interactions untested.
[Table 3] Table 3 (or equivalent results table): While accuracy-memory ratio gains are reported, the number of calibration prompts and their selection procedure are not stated in the main experimental setup, making it difficult to reproduce or assess sensitivity of the reported outperformance to this choice despite the later robustness section.

minor comments (2)

[Abstract] Abstract: the clause 'outperforms task-agnostic baselines such in most settings' contains a clear typographical omission and should be rephrased for readability.
[§3.1] §3.1: The normalization step that converts continuous importance scores into discrete bit assignments under the fixed budget could be stated as an explicit equation to improve clarity of the allocation procedure.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and describe the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [§4.3 and §5.2] §4.3 and §5.2: The central claim that TAQ-IS and TAQ-KL scores correctly rank layers by their marginal impact on downstream task accuracy rests on correlation with the TAQ-O oracle and robustness checks, but the manuscript does not include a direct ablation that measures end-to-end benchmark accuracy when precision is allocated exclusively to the highest- versus lowest-scoring layers (independent of the joint optimization). This leaves the reliability of the proxy under residual-stream interactions untested.

Authors: We agree that a direct ablation isolating the ranking effect would provide stronger validation of the proxy scores. Our current results show high correlation between TAQ-IS/TAQ-KL and the TAQ-O oracle along with robustness to calibration variations, but these do not fully isolate the impact of selecting highest- versus lowest-ranked layers under residual-stream interactions. We will add this ablation experiment in the revised manuscript, reporting end-to-end accuracy for precision allocation based solely on top-k versus bottom-k layers according to each scoring method. revision: yes
Referee: [Table 3] Table 3 (or equivalent results table): While accuracy-memory ratio gains are reported, the number of calibration prompts and their selection procedure are not stated in the main experimental setup, making it difficult to reproduce or assess sensitivity of the reported outperformance to this choice despite the later robustness section.

Authors: We thank the referee for highlighting this clarity issue. The main experiments use 32 calibration prompts randomly sampled from the unlabeled task data, with details provided in the robustness section. To improve reproducibility, we will explicitly state the number of prompts and the random sampling procedure in the primary experimental setup description in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in TAQ derivation chain

full rationale

The paper computes layer importance scores (TAQ-IS, TAQ-KL) directly from forward-pass observables on unlabeled calibration prompts: activation statistics, stability measures, and output-distribution sensitivity under a quantization-noise proxy. These quantities are independent of the final task accuracy metric and are not fitted to it; they serve as proxies for precision allocation under a fixed bit budget. The TAQ-O oracle is presented as a diagnostic contrast rather than a load-bearing input. No equations reduce the claimed predictions to self-definitions, fitted inputs renamed as outputs, or self-citation chains. The method remains self-contained against external benchmarks and hardware measurements.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that task relevance can be read out from a small number of forward passes on unlabeled prompts and that the chosen sensitivity proxies (activation statistics or output KL) are faithful proxies for downstream accuracy impact. No new physical entities or ad-hoc constants are introduced beyond the standard quantization bit-width choices.

free parameters (1)

bit budget allocation
Total bit budget is fixed by the user; the method decides per-layer distribution but the overall average bits per weight is a user-chosen constraint.

axioms (1)

domain assumption Quantization noise can be modeled as a proxy for output distribution shift without retraining
Used to define the TAQ-KL scoring rule; invoked when estimating layer sensitivity via output KL under simulated quantization noise.

pith-pipeline@v0.9.0 · 5788 in / 1480 out tokens · 45036 ms · 2026-05-21T18:40:32.281837+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 26 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

Phi-4 Technical Report

Abdin, M. and et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024. URL https://arxiv.org/abs/2412.08905

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Mistral 7b

AI, M. Mistral 7b. https://mistral.ai/news/announcing-mistral-7b/, 2023

work page 2023
[4]

Refusal in Language Models Is Mediated by a Single Direction

Arditi, A., Roberts, O., Stewart, A., Turner, A., and Thiergart, J. Refusal in language models is mediated by a single direction. In arXiv preprint arXiv:2406.11717, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

Ashkboos, S., Frantar, E., Hoefler, T., and Alistarh, D. Quarot: Quantization with rotation for large language models. arXiv preprint arXiv:2404.00456, 2024. URL https://arxiv.org/abs/2404.00456

work page arXiv 2024
[6]

Program Synthesis with Large Language Models

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Post training 4-bit quantization of convolutional networks for rapid-deployment

Banner, R., Nahshan, Y., and Soudry, D. Post training 4-bit quantization of convolutional networks for rapid-deployment. Advances in neural information processing systems, 32, 2019

work page 2019
[8]

Reft: Representation finetuning for language models

Bowen, J., Freedman, S., Zhang, Z., and Belinkov, Y. Selective task arithmetic: Per-vector selection for robust model editing and merging. In Proceedings of the 41st International Conference on Machine Learning (ICML Workshop/Proceedings Track), 2024. URL https://arxiv.org/abs/2404.03592. arXiv:2404.03592

work page arXiv 2024
[9]

Discovering latent knowledge in language models without supervision

Burns, C., Ye, H., Klein, D., and Steinhardt, J. Discovering latent knowledge in language models without supervision. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=0c-S0RyWhq

work page 2023
[10]

Chen, M. et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

Cobbe, K. et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021 b

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Y., Ermon, S., Rudra, A., and R \'e , C

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and R \'e , C. Flash A ttention: Fast and memory-efficient exact attention with IO -awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022. URL https://openreview.net/forum?id=JENyE4ZG5b

work page 2022
[14]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. Llm.int8(): 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems (NeurIPS), 2022. URL https://arxiv.org/abs/2208.07339

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

From lazy to rich: Exact learning dynamics in deep linear networks.arXiv preprint arXiv:2409.14623, 2024

Domin \'e , C. C., Anguita, N., Proca, A. M., Braun, L., Kunin, D., Mediano, P. A., and Saxe, A. M. From lazy to rich: Exact learning dynamics in deep linear networks. arXiv preprint arXiv:2409.14623, 2024

work page arXiv 2024
[16]

W., and Keutzer, K

Dong, Z., Yao, Z., Gholami, A., Mahoney, M. W., and Keutzer, K. Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 293--302, 2019

work page 2019
[17]

The Llama 3 Herd of Models

Dubey, A. and et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Evci, U., Dumoulin, V., Larochelle, H., and Mozer, M. C. Head2toe: Utilizing intermediate representations for better transfer learning. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research. PMLR, 2022. URL https://proceedings.mlr.press/v162/evci22a.html

work page 2022
[19]

Gptq: Accurate post-training quantization for generative pre-trained transformers

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre-trained transformers. In Proceedings of the International Conference on Learning Representations (ICLR), 2022

work page 2022
[20]

W., and Keutzer, K

Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M. W., and Keutzer, K. A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630, 2021. URL https://arxiv.org/abs/2103.13630

work page arXiv 2021
[21]

He, \, . et al. Zipcache: Byte-level kv cache compression for transformer inference. In Advances in Neural Information Processing Systems (NeurIPS), 2024. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/7e57131fdeb815764434b65162c88895-Paper-Conference.pdf

work page 2024
[22]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D. and et al. MATH : Measuring mathematical problem solving with the MATH dataset. arXiv preprint arXiv:2103.03874, 2021. URL https://arxiv.org/abs/2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. URL https://arxiv.org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015
[24]

Training Compute-Optimal Large Language Models

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D., Welbl, J., Clark, A., Hennigan, T., et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022. URL https://arxiv.org/abs/2203.15556

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

Hooper, R. et al. Kvquant: Towards general and efficient kv-cache quantization for large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2024. URL https://nips.cc/virtual/2024/poster/97760

work page 2024
[26]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. URL https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[27]

Editing Models with Task Arithmetic

Ilharco, G., Ribeiro, M. T., Wortsman, M., Gururangan, S., Schmidt, L., Hajishirzi, H., and Farhadi, A. Editing models with task arithmetic. In International Conference on Learning Representations, 2023. URL https://arxiv.org/abs/2212.04089

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Quantization and training of neural networks for efficient integer-arithmetic-only inference

Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., and Kalenichenko, D. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 2704--2713, 2018. URL https://openaccess.thecvf.com/content_cvpr_2018/...

work page 2018
[29]

S., and Zettlemoyer, L

Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1601--1611, 2017

work page 2017
[30]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. URL https://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2001
[31]

Quantizing deep convolutional networks for efficient inference: A whitepaper

Krishnamoorthi, R. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018. URL https://arxiv.org/abs/1806.08342

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. arXiv preprint arXiv:2309.06180, 2023. URL https://arxiv.org/abs/2309.06180

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

and Moeller, M

L \"a hner, Z. and Moeller, M. On the direct alignment of latent spaces. In Proceedings of UniReps: the First Workshop on Unifying Representations in Neural Models, pp.\ 158--169. PMLR, 2024

work page 2024
[34]

Enhancing jailbreak attacks via compliance-refusal-based initialization

Levi, A., Himelstein, R., Nemcovsky, Y., Mendelson, A., and Baskin, C. Enhancing jailbreak attacks via compliance-refusal-based initialization. arXiv e-prints, pp.\ arXiv--2502, 2025 a

work page 2025
[35]

Jailbreak attack initializations as extractors of compliance directions

Levi, A., Himelstein, R., Nemcovsky, Y., Mendelson, A., and Baskin, C. Jailbreak attack initializations as extractors of compliance directions. arXiv preprint arXiv:2502.09755, 2025 b

work page arXiv 2025
[36]

Safety layers of aligned large language models: The key to llm security

Li, S., Yao, L., Zhang, L., and Li, Y. Safety layers in aligned large language models: The key to llm security. arXiv preprint arXiv:2408.17003, 2024. URL https://arxiv.org/abs/2408.17003. ICLR 2025 (OpenReview)

work page arXiv 2024
[37]

Awq: Activation-aware weight quantization for llm compression and acceleration

Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., and Han, S. Awq: Activation-aware weight quantization for llm compression and acceleration. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2023 a

work page 2023
[38]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Lin, J., Tang, J., Tang, H., Yang, S., et al. AWQ : Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023 b . URL https://arxiv.org/abs/2306.00978

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

AWQ : Activation-aware weight quantization for on-device llm compression and acceleration

Lin, J., Tang, H., Li, Z., Zhang, H., et al. AWQ : Activation-aware weight quantization for on-device llm compression and acceleration. In MLSys, 2024. URL https://proceedings.mlsys.org/paper_files/paper/2024/file/42a452cbafa9dd64e9ba4aa95cc1ef21-Paper-Conference.pdf

work page 2024
[40]

KIVI : A tuning-free asymmetric 2bit quantization for KV cache

Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V., Chen, B., and Hu, X. KIVI : A tuning-free asymmetric 2bit quantization for KV cache. In Proceedings of the 41st International Conference on Machine Learning (ICML), volume 235 of Proceedings of Machine Learning Research, 2024 a . URL https://proceedings.mlr.press/v235/liu24bz.html

work page 2024
[41]

SpinQuant: LLM quantization with learned rotations

Liu, Z., Zhao, C., Fedorov, I., Soran, B., Choudhary, D., Krishnamoorthi, R., Chandra, V., Tian, Y., and Blankevoort, T. Spinquant: Llm quantization with learned rotations. arXiv preprint arXiv:2405.16406, 2024 b . doi:10.48550/arXiv.2405.16406. URL https://arxiv.org/abs/2405.16406. ICLR 2025

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.16406 2024
[42]

Locating and editing factual associations in GPT

Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating and editing factual associations in GPT . In Advances in Neural Information Processing Systems (NeurIPS), 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/6f1d43d5a82a37e89b0665b33bf3a182-Paper-Conference.pdf

work page 2022
[43]

Relative representations enable zero-shot latent space communication.arXiv:2209.15430,

Moschella, L., Maiorca, V., Fumero, M., Norelli, A., Locatello, F., and Rodol \`a , E. Relative representations enable zero-shot latent space communication. arXiv preprint arXiv:2209.15430, 2022

work page arXiv 2022
[44]

A., Van Baalen, M., Louizos, C., and Blankevoort, T

Nagel, M., Amjad, R. A., Van Baalen, M., Louizos, C., and Blankevoort, T. Up or down? adaptive rounding for post-training quantization. In International conference on machine learning, pp.\ 7197--7206. PMLR, 2020

work page 2020
[45]

A White Paper on Neural Network Quantization

Nagel, M., Fournarakis, M., Amjad, R. A., and Bondarenko, Y. A white paper on neural network quantization. In arXiv preprint arXiv:2106.08295, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[46]

Quantifying knowledge distillation for large language models and beyond

Ni, R., Sun, S., Xiao, Y.-X., Collins, K., Liu, Z., and Koyejo, S. Quantifying knowledge distillation for large language models and beyond. arXiv preprint arXiv:2505.13030, 2025. URL https://arxiv.org/abs/2505.13030

work page arXiv 2025
[47]

Training language models to follow instructions with human feedback

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, L., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. Training language models to follow instructions with human feedback. In Advances in Neural Information Processin...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[48]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Rajpurkar, P., Jia, R., and Liang, P. SQuAD : 100,000+ questions for machine comprehension of text. In EMNLP, 2016. URL https://arxiv.org/abs/1606.05250

work page internal anchor Pith review Pith/arXiv arXiv 2016
[49]

Roziere, B. et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Sanh, V., Debut, L., Chaumond, J., and Wolf, T. Distilbert, a distilled version of bert: Smaller, faster, cheaper (and lighter). In arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[51]

Omniquant: Omnidirectionally calibrated quantization for large lan- guage models.arXiv preprint arXiv:2308.13137,

Shao, W., Chen, M., Zhang, Z., Xu, P., Zhao, L., Li, Z., Zhang, K., Gao, P., Qiao, Y., and Luo, P. Omni Q uant: Omnidirectionally calibrated quantization for large language models. In International Conference on Learning Representations (ICLR), 2024. URL https://arxiv.org/abs/2308.13137. ICLR 2024 Camera Ready; original preprint arXiv:2308.13137 (2023)

work page arXiv 2024
[52]

Post training quantization of large language models with microscaling formats

Sharify, S., Saxena, U., Xu, Z., Yazar, W., Soloveychik, I., and Wang, X. Post training quantization of large language models with microscaling formats. In Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-V), volume 262 of Proceedings of Machine Learning Research, pp.\ 241--258. PMLR, 2024. URL https://procee...

work page 2024
[53]

Layer by Layer: Uncovering Hidden Representations in Language Models

Skean, O., Arefin, M. R., Zhao, D., Patel, N., Naghiyev, J., LeCun, Y., and Shwartz-Ziv, R. Layer by layer: Uncovering hidden representations in language models, 2025. URL https://arxiv.org/abs/2502.02013

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Does large language model contain task-specific neurons? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Song, R., He, S., Jiang, S., Xian, Y., Gao, S., Liu, K., and Yu, Z. Does large language model contain task-specific neurons? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024. URL https://aclanthology.org/2024.emnlp-main.403/

work page 2024
[55]

Qwen2.5 technical report

Team, Q. Qwen2.5 technical report. https://qwenlm.github.io/blog/qwen2.5/, 2024

work page 2024
[56]

Tsatsaronis, G. et al. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics, 16 0 (1): 0 138, 2015

work page 2015
[58]

Steering Language Models With Activation Engineering

Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering language models with activation engineering. arXiv preprint arXiv:2308.10248, 2023 b . doi:10.48550/arXiv.2308.10248. URL https://arxiv.org/abs/2308.10248

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.10248 2023
[59]

Steering Language Models With Activation Engineering

Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2024. URL https://arxiv.org/abs/2308.10248

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

N., Kaiser, L

Vaswani, A., Shazeer, N., Parmar, N., Uszoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS), pp.\ 5998--6008, 2017

work page 2017
[61]

Towards accurate post-training network quantization via bit-split and stitching

Wang, P., Chen, Q., He, X., and Cheng, J. Towards accurate post-training network quantization via bit-split and stitching. In International Conference on Machine Learning, pp.\ 9847--9856. PMLR, 2020

work page 2020
[62]

H., Kunz, E., Kornblith, S., and Linderman, S

Williams, A. H., Kunz, E., Kornblith, S., and Linderman, S. Generalized shape metrics on neural representations. Advances in neural information processing systems, 34: 0 4738--4750, 2021

work page 2021
[63]

Smoothquant: Accurate and efficient post-training quantization for large language models

Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. In arXiv preprint arXiv:2211.10438, 2023

work page arXiv 2023
[64]

Y., Zhang, M., Li, X., Zhang, Z., and Wang, Y

Yao, Z., Aminabadi, R. Y., Zhang, M., Li, X., Zhang, Z., and Wang, Y. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS), pp.\ 31468--31482, 2022

work page 2022
[65]

A comprehensive study on post-training quantization for large language models

Yao, Z., Li, C., Wu, X., Youn, S., and He, Y. A comprehensive study on post-training quantization for large language models. arXiv preprint arXiv:2303.08302, 2023. URL https://arxiv.org/abs/2303.08302

work page arXiv 2023
[66]

RPTQ: reorder-based post-training quantization for large language models

Yuan, Z., Niu, L., Liu, J., Liu, W., Wang, X., Shang, Y., Sun, G., Wu, Q., Wu, J., and Wu, B. Rptq: Reorder-based post-training quantization for large language models. arXiv preprint arXiv:2304.01089, 2023. URL https://arxiv.org/abs/2304.01089

work page arXiv 2023
[67]

Efficient model editing with task vector bases

Zeng, H., Liu, S., Zhang, X., Zhu, C., Chen, X., Rivera, C., van Schijndel, M., Saffari, A., Poliak, A., Tsvetkov, Y., and Sedoc, J. Efficient model editing with task vector bases. arXiv preprint arXiv:2501.09248, 2025. URL https://arxiv.org/abs/2501.09248

work page arXiv 2025
[68]

Investigating layer importance in large language models

Zhang, Y., Dong, Y., and Kawaguchi, K. Investigating layer importance in large language models. arXiv preprint arXiv:2409.14381, 2024. URL https://arxiv.org/abs/2409.14381

work page arXiv 2024
[69]

M., Walters, R., and Yu, R

Zhao, B., Gower, R. M., Walters, R., and Yu, R. Improving convergence and generalization using parameter symmetries. arXiv preprint arXiv:2305.13404, 2023

work page arXiv 2023
[70]

Knowledge distillation of large language models: A survey

Zhou, L., Gao, R., Song, G., Zhou, L., Zha, H., Li, H., and et al. Knowledge distillation of large language models: A survey. arXiv preprint arXiv:2405.12396, 2024. URL https://arxiv.org/abs/2405.12396

work page arXiv 2024
[71]

arXiv:2308.07633 [cs] 23

Zhu, X., Li, J., Liu, Y., Ma, C., and Wang, W. A survey on model compression for large language models. arXiv preprint arXiv:2308.07633, 2023. URL https://arxiv.org/abs/2308.07633

work page arXiv 2023
[73]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., Fredrikson, M., et al. Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405, 2023 b . URL https://arxiv.org/abs/2310.01405

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

Phi-4 Technical Report

Abdin, M. and et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024. URL https://arxiv.org/abs/2412.08905

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Mistral 7b

AI, M. Mistral 7b. https://mistral.ai/news/announcing-mistral-7b/, 2023

work page 2023

[4] [4]

Refusal in Language Models Is Mediated by a Single Direction

Arditi, A., Roberts, O., Stewart, A., Turner, A., and Thiergart, J. Refusal in language models is mediated by a single direction. In arXiv preprint arXiv:2406.11717, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

Ashkboos, S., Frantar, E., Hoefler, T., and Alistarh, D. Quarot: Quantization with rotation for large language models. arXiv preprint arXiv:2404.00456, 2024. URL https://arxiv.org/abs/2404.00456

work page arXiv 2024

[6] [6]

Program Synthesis with Large Language Models

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Post training 4-bit quantization of convolutional networks for rapid-deployment

Banner, R., Nahshan, Y., and Soudry, D. Post training 4-bit quantization of convolutional networks for rapid-deployment. Advances in neural information processing systems, 32, 2019

work page 2019

[8] [8]

Reft: Representation finetuning for language models

Bowen, J., Freedman, S., Zhang, Z., and Belinkov, Y. Selective task arithmetic: Per-vector selection for robust model editing and merging. In Proceedings of the 41st International Conference on Machine Learning (ICML Workshop/Proceedings Track), 2024. URL https://arxiv.org/abs/2404.03592. arXiv:2404.03592

work page arXiv 2024

[9] [9]

Discovering latent knowledge in language models without supervision

Burns, C., Ye, H., Klein, D., and Steinhardt, J. Discovering latent knowledge in language models without supervision. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=0c-S0RyWhq

work page 2023

[10] [10]

Chen, M. et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[11] [12]

Cobbe, K. et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021 b

work page internal anchor Pith review Pith/arXiv arXiv 2021

[12] [13]

Y., Ermon, S., Rudra, A., and R \'e , C

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and R \'e , C. Flash A ttention: Fast and memory-efficient exact attention with IO -awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022. URL https://openreview.net/forum?id=JENyE4ZG5b

work page 2022

[13] [14]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. Llm.int8(): 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems (NeurIPS), 2022. URL https://arxiv.org/abs/2208.07339

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [15]

From lazy to rich: Exact learning dynamics in deep linear networks.arXiv preprint arXiv:2409.14623, 2024

Domin \'e , C. C., Anguita, N., Proca, A. M., Braun, L., Kunin, D., Mediano, P. A., and Saxe, A. M. From lazy to rich: Exact learning dynamics in deep linear networks. arXiv preprint arXiv:2409.14623, 2024

work page arXiv 2024

[15] [16]

W., and Keutzer, K

Dong, Z., Yao, Z., Gholami, A., Mahoney, M. W., and Keutzer, K. Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.\ 293--302, 2019

work page 2019

[16] [17]

The Llama 3 Herd of Models

Dubey, A. and et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [18]

Evci, U., Dumoulin, V., Larochelle, H., and Mozer, M. C. Head2toe: Utilizing intermediate representations for better transfer learning. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research. PMLR, 2022. URL https://proceedings.mlr.press/v162/evci22a.html

work page 2022

[18] [19]

Gptq: Accurate post-training quantization for generative pre-trained transformers

Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre-trained transformers. In Proceedings of the International Conference on Learning Representations (ICLR), 2022

work page 2022

[19] [20]

W., and Keutzer, K

Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M. W., and Keutzer, K. A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630, 2021. URL https://arxiv.org/abs/2103.13630

work page arXiv 2021

[20] [21]

He, \, . et al. Zipcache: Byte-level kv cache compression for transformer inference. In Advances in Neural Information Processing Systems (NeurIPS), 2024. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/7e57131fdeb815764434b65162c88895-Paper-Conference.pdf

work page 2024

[21] [22]

Measuring Mathematical Problem Solving With the MATH Dataset

Hendrycks, D. and et al. MATH : Measuring mathematical problem solving with the MATH dataset. arXiv preprint arXiv:2103.03874, 2021. URL https://arxiv.org/abs/2103.03874

work page internal anchor Pith review Pith/arXiv arXiv 2021

[22] [23]

Distilling the Knowledge in a Neural Network

Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. URL https://arxiv.org/abs/1503.02531

work page internal anchor Pith review Pith/arXiv arXiv 2015

[23] [24]

Training Compute-Optimal Large Language Models

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D., Welbl, J., Clark, A., Hennigan, T., et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022. URL https://arxiv.org/abs/2203.15556

work page internal anchor Pith review Pith/arXiv arXiv 2022

[24] [25]

Hooper, R. et al. Kvquant: Towards general and efficient kv-cache quantization for large language models. In Advances in Neural Information Processing Systems (NeurIPS), 2024. URL https://nips.cc/virtual/2024/poster/97760

work page 2024

[25] [26]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. URL https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021

[26] [27]

Editing Models with Task Arithmetic

Ilharco, G., Ribeiro, M. T., Wortsman, M., Gururangan, S., Schmidt, L., Hajishirzi, H., and Farhadi, A. Editing models with task arithmetic. In International Conference on Learning Representations, 2023. URL https://arxiv.org/abs/2212.04089

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [28]

Quantization and training of neural networks for efficient integer-arithmetic-only inference

Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., Adam, H., and Kalenichenko, D. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 2704--2713, 2018. URL https://openaccess.thecvf.com/content_cvpr_2018/...

work page 2018

[28] [29]

S., and Zettlemoyer, L

Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1601--1611, 2017

work page 2017

[29] [30]

Scaling Laws for Neural Language Models

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020. URL https://arxiv.org/abs/2001.08361

work page internal anchor Pith review Pith/arXiv arXiv 2001

[30] [31]

Quantizing deep convolutional networks for efficient inference: A whitepaper

Krishnamoorthi, R. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018. URL https://arxiv.org/abs/1806.08342

work page internal anchor Pith review Pith/arXiv arXiv 2018

[31] [32]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. arXiv preprint arXiv:2309.06180, 2023. URL https://arxiv.org/abs/2309.06180

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [33]

and Moeller, M

L \"a hner, Z. and Moeller, M. On the direct alignment of latent spaces. In Proceedings of UniReps: the First Workshop on Unifying Representations in Neural Models, pp.\ 158--169. PMLR, 2024

work page 2024

[33] [34]

Enhancing jailbreak attacks via compliance-refusal-based initialization

Levi, A., Himelstein, R., Nemcovsky, Y., Mendelson, A., and Baskin, C. Enhancing jailbreak attacks via compliance-refusal-based initialization. arXiv e-prints, pp.\ arXiv--2502, 2025 a

work page 2025

[34] [35]

Jailbreak attack initializations as extractors of compliance directions

Levi, A., Himelstein, R., Nemcovsky, Y., Mendelson, A., and Baskin, C. Jailbreak attack initializations as extractors of compliance directions. arXiv preprint arXiv:2502.09755, 2025 b

work page arXiv 2025

[35] [36]

Safety layers of aligned large language models: The key to llm security

Li, S., Yao, L., Zhang, L., and Li, Y. Safety layers in aligned large language models: The key to llm security. arXiv preprint arXiv:2408.17003, 2024. URL https://arxiv.org/abs/2408.17003. ICLR 2025 (OpenReview)

work page arXiv 2024

[36] [37]

Awq: Activation-aware weight quantization for llm compression and acceleration

Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., and Han, S. Awq: Activation-aware weight quantization for llm compression and acceleration. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2023 a

work page 2023

[37] [38]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Lin, J., Tang, J., Tang, H., Yang, S., et al. AWQ : Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023 b . URL https://arxiv.org/abs/2306.00978

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [39]

AWQ : Activation-aware weight quantization for on-device llm compression and acceleration

Lin, J., Tang, H., Li, Z., Zhang, H., et al. AWQ : Activation-aware weight quantization for on-device llm compression and acceleration. In MLSys, 2024. URL https://proceedings.mlsys.org/paper_files/paper/2024/file/42a452cbafa9dd64e9ba4aa95cc1ef21-Paper-Conference.pdf

work page 2024

[39] [40]

KIVI : A tuning-free asymmetric 2bit quantization for KV cache

Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V., Chen, B., and Hu, X. KIVI : A tuning-free asymmetric 2bit quantization for KV cache. In Proceedings of the 41st International Conference on Machine Learning (ICML), volume 235 of Proceedings of Machine Learning Research, 2024 a . URL https://proceedings.mlr.press/v235/liu24bz.html

work page 2024

[40] [41]

SpinQuant: LLM quantization with learned rotations

Liu, Z., Zhao, C., Fedorov, I., Soran, B., Choudhary, D., Krishnamoorthi, R., Chandra, V., Tian, Y., and Blankevoort, T. Spinquant: Llm quantization with learned rotations. arXiv preprint arXiv:2405.16406, 2024 b . doi:10.48550/arXiv.2405.16406. URL https://arxiv.org/abs/2405.16406. ICLR 2025

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.16406 2024

[41] [42]

Locating and editing factual associations in GPT

Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating and editing factual associations in GPT . In Advances in Neural Information Processing Systems (NeurIPS), 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/6f1d43d5a82a37e89b0665b33bf3a182-Paper-Conference.pdf

work page 2022

[42] [43]

Relative representations enable zero-shot latent space communication.arXiv:2209.15430,

Moschella, L., Maiorca, V., Fumero, M., Norelli, A., Locatello, F., and Rodol \`a , E. Relative representations enable zero-shot latent space communication. arXiv preprint arXiv:2209.15430, 2022

work page arXiv 2022

[43] [44]

A., Van Baalen, M., Louizos, C., and Blankevoort, T

Nagel, M., Amjad, R. A., Van Baalen, M., Louizos, C., and Blankevoort, T. Up or down? adaptive rounding for post-training quantization. In International conference on machine learning, pp.\ 7197--7206. PMLR, 2020

work page 2020

[44] [45]

A White Paper on Neural Network Quantization

Nagel, M., Fournarakis, M., Amjad, R. A., and Bondarenko, Y. A white paper on neural network quantization. In arXiv preprint arXiv:2106.08295, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[45] [46]

Quantifying knowledge distillation for large language models and beyond

Ni, R., Sun, S., Xiao, Y.-X., Collins, K., Liu, Z., and Koyejo, S. Quantifying knowledge distillation for large language models and beyond. arXiv preprint arXiv:2505.13030, 2025. URL https://arxiv.org/abs/2505.13030

work page arXiv 2025

[46] [47]

Training language models to follow instructions with human feedback

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, L., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. Training language models to follow instructions with human feedback. In Advances in Neural Information Processin...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[47] [48]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Rajpurkar, P., Jia, R., and Liang, P. SQuAD : 100,000+ questions for machine comprehension of text. In EMNLP, 2016. URL https://arxiv.org/abs/1606.05250

work page internal anchor Pith review Pith/arXiv arXiv 2016

[48] [49]

Roziere, B. et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [50]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Sanh, V., Debut, L., Chaumond, J., and Wolf, T. Distilbert, a distilled version of bert: Smaller, faster, cheaper (and lighter). In arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[50] [51]

Omniquant: Omnidirectionally calibrated quantization for large lan- guage models.arXiv preprint arXiv:2308.13137,

Shao, W., Chen, M., Zhang, Z., Xu, P., Zhao, L., Li, Z., Zhang, K., Gao, P., Qiao, Y., and Luo, P. Omni Q uant: Omnidirectionally calibrated quantization for large language models. In International Conference on Learning Representations (ICLR), 2024. URL https://arxiv.org/abs/2308.13137. ICLR 2024 Camera Ready; original preprint arXiv:2308.13137 (2023)

work page arXiv 2024

[51] [52]

Post training quantization of large language models with microscaling formats

Sharify, S., Saxena, U., Xu, Z., Yazar, W., Soloveychik, I., and Wang, X. Post training quantization of large language models with microscaling formats. In Proceedings of The 4th NeurIPS Efficient Natural Language and Speech Processing Workshop (ENLSP-V), volume 262 of Proceedings of Machine Learning Research, pp.\ 241--258. PMLR, 2024. URL https://procee...

work page 2024

[52] [53]

Layer by Layer: Uncovering Hidden Representations in Language Models

Skean, O., Arefin, M. R., Zhao, D., Patel, N., Naghiyev, J., LeCun, Y., and Shwartz-Ziv, R. Layer by layer: Uncovering hidden representations in language models, 2025. URL https://arxiv.org/abs/2502.02013

work page internal anchor Pith review Pith/arXiv arXiv 2025

[53] [54]

Does large language model contain task-specific neurons? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024

Song, R., He, S., Jiang, S., Xian, Y., Gao, S., Liu, K., and Yu, Z. Does large language model contain task-specific neurons? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024. URL https://aclanthology.org/2024.emnlp-main.403/

work page 2024

[54] [55]

Qwen2.5 technical report

Team, Q. Qwen2.5 technical report. https://qwenlm.github.io/blog/qwen2.5/, 2024

work page 2024

[55] [56]

Tsatsaronis, G. et al. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics, 16 0 (1): 0 138, 2015

work page 2015

[56] [58]

Steering Language Models With Activation Engineering

Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Steering language models with activation engineering. arXiv preprint arXiv:2308.10248, 2023 b . doi:10.48550/arXiv.2308.10248. URL https://arxiv.org/abs/2308.10248

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2308.10248 2023

[57] [59]

Steering Language Models With Activation Engineering

Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2024. URL https://arxiv.org/abs/2308.10248

work page internal anchor Pith review Pith/arXiv arXiv 2024

[58] [60]

N., Kaiser, L

Vaswani, A., Shazeer, N., Parmar, N., Uszoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS), pp.\ 5998--6008, 2017

work page 2017

[59] [61]

Towards accurate post-training network quantization via bit-split and stitching

Wang, P., Chen, Q., He, X., and Cheng, J. Towards accurate post-training network quantization via bit-split and stitching. In International Conference on Machine Learning, pp.\ 9847--9856. PMLR, 2020

work page 2020

[60] [62]

H., Kunz, E., Kornblith, S., and Linderman, S

Williams, A. H., Kunz, E., Kornblith, S., and Linderman, S. Generalized shape metrics on neural representations. Advances in neural information processing systems, 34: 0 4738--4750, 2021

work page 2021

[61] [63]

Smoothquant: Accurate and efficient post-training quantization for large language models

Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. In arXiv preprint arXiv:2211.10438, 2023

work page arXiv 2023

[62] [64]

Y., Zhang, M., Li, X., Zhang, Z., and Wang, Y

Yao, Z., Aminabadi, R. Y., Zhang, M., Li, X., Zhang, Z., and Wang, Y. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS), pp.\ 31468--31482, 2022

work page 2022

[63] [65]

A comprehensive study on post-training quantization for large language models

Yao, Z., Li, C., Wu, X., Youn, S., and He, Y. A comprehensive study on post-training quantization for large language models. arXiv preprint arXiv:2303.08302, 2023. URL https://arxiv.org/abs/2303.08302

work page arXiv 2023

[64] [66]

RPTQ: reorder-based post-training quantization for large language models

Yuan, Z., Niu, L., Liu, J., Liu, W., Wang, X., Shang, Y., Sun, G., Wu, Q., Wu, J., and Wu, B. Rptq: Reorder-based post-training quantization for large language models. arXiv preprint arXiv:2304.01089, 2023. URL https://arxiv.org/abs/2304.01089

work page arXiv 2023

[65] [67]

Efficient model editing with task vector bases

Zeng, H., Liu, S., Zhang, X., Zhu, C., Chen, X., Rivera, C., van Schijndel, M., Saffari, A., Poliak, A., Tsvetkov, Y., and Sedoc, J. Efficient model editing with task vector bases. arXiv preprint arXiv:2501.09248, 2025. URL https://arxiv.org/abs/2501.09248

work page arXiv 2025

[66] [68]

Investigating layer importance in large language models

Zhang, Y., Dong, Y., and Kawaguchi, K. Investigating layer importance in large language models. arXiv preprint arXiv:2409.14381, 2024. URL https://arxiv.org/abs/2409.14381

work page arXiv 2024

[67] [69]

M., Walters, R., and Yu, R

Zhao, B., Gower, R. M., Walters, R., and Yu, R. Improving convergence and generalization using parameter symmetries. arXiv preprint arXiv:2305.13404, 2023

work page arXiv 2023

[68] [70]

Knowledge distillation of large language models: A survey

Zhou, L., Gao, R., Song, G., Zhou, L., Zha, H., Li, H., and et al. Knowledge distillation of large language models: A survey. arXiv preprint arXiv:2405.12396, 2024. URL https://arxiv.org/abs/2405.12396

work page arXiv 2024

[69] [71]

arXiv:2308.07633 [cs] 23

Zhu, X., Li, J., Liu, Y., Ma, C., and Wang, W. A survey on model compression for large language models. arXiv preprint arXiv:2308.07633, 2023. URL https://arxiv.org/abs/2308.07633

work page arXiv 2023

[70] [73]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., Fredrikson, M., et al. Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405, 2023 b . URL https://arxiv.org/abs/2310.01405

work page internal anchor Pith review Pith/arXiv arXiv 2023