pith. sign in

arxiv: 2605.18475 · v1 · pith:IBNDVO7Inew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets

Pith reviewed 2026-05-20 12:21 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords mixed-precision quantizationpost-training quantizationlarge language modelsbit allocationinteger programmingsensitivity rankingmodel compressionLLM deployment
0
0 comments X

The pith

GAMMA learns stable module sensitivity rankings in one post-training pass so that integer programming can assign exact bit widths for any target budget in large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GAMMA as a way to perform mixed-precision quantization on billion-parameter models without quantization-aware training or expensive per-budget searches. It optimizes a reconstruction objective on hidden states to discover which modules are most sensitive to reduced precision, then casts the assignment of discrete bit widths as an integer program that exactly meets a chosen average budget. Because the learned preferences capture a budget-independent ranking of module sensitivities, the same preferences can be reused for any new budget by re-solving only the integer program. Experiments on Llama and Qwen models from 8B to 32B show gains over both uniform quantization and prior mixed-precision baselines while reaching 3-bit quality at an average of 2.5 bits. If the ranking remains stable, the approach would let practitioners meet a wide range of memory constraints from a single optimization run.

Core claim

GAMMA is a quantizer-agnostic framework that learns module-wise precision preferences entirely within a post-training pipeline by optimizing a teacher-forced hidden-state reconstruction objective under an augmented Lagrangian constraint. These preferences are projected into exact budget-feasible discrete assignments via integer programming. The learned preferences encode a stable sensitivity ranking rather than budget-specific weights, so a single optimization run supports arbitrary deployment targets by re-solving only the integer program.

What carries the argument

The central mechanism is the optimization of module-wise precision preferences via a teacher-forced hidden-state reconstruction objective subject to an augmented Lagrangian budget constraint, followed by an integer program that converts the preferences into exact per-module bit allocations.

If this is right

  • GAMMA outperforms fixed-precision baselines by up to 12.99 points on average across Llama and Qwen models from 8B to 32B.
  • It improves upon search-based mixed-precision methods by up to 7.00 points on average.
  • The framework can match the quality of a fixed 3-bit model while using only 2.5 bits on average.
  • Changing the target budget requires only re-solving the integer program, which takes minutes rather than hours.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the sensitivity ranking transfers across model families, the same preferences could be applied to new architectures without repeating the optimization step.
  • The integer programming step could be approximated by faster greedy or dynamic programming heuristics for models too large for exact solvers while still guaranteeing budget compliance.
  • Because the method is quantizer-agnostic, the learned preferences might be combined with different quantization kernels or calibration datasets without retraining the preference model.

Load-bearing premise

The learned module-wise precision preferences encode a stable sensitivity ranking that is independent of the target budget.

What would settle it

Re-optimize the preferences from scratch for a second budget and compare the resulting accuracy against the accuracy obtained by reusing the first run's preferences in a fresh integer program for the same second budget; a large gap would falsify the stability claim.

Figures

Figures reproduced from arXiv: 2605.18475 by Haiyan Zhao, Haoyu Wang, Lihua Zhang, Tianbo Huang, Xu Han, Zhangyang Yao.

Figure 1
Figure 1. Figure 1: Training pipeline of GAMMA. Layer-wise hidden states from a full-precision teacher supervise mixed￾precision mask learning via a reconstruction loss, while a global penalty enforces the target average bit-width. Solid arrows indicate forward computation and dashed arrows indicate backward gradients. the full-precision hidden state H(li−1) (x) as input to the mixed-precision version of layer i, yielding H(l… view at source ↗
Figure 2
Figure 2. Figure 2: Mixed-precision layer and differentiable bit selection. GAMMA assigns a bit-width to each linear module by forming a weighted combination of pre-quantized candidates, where the weights are pro￾duced by a Gumbel–Softmax mask parameterized by trainable logits. where τ > 0 is the temperature and {gi,j,b}b∈B are i.i.d. Gumbel noises. Replacing the binary variables zi,j,b by their continuous counterparts pi,j,b… view at source ↗
Figure 3
Figure 3. Figure 3: Budget–accuracy scaling. Average zero-shot accuracy versus target average bit-width on Qwen3-8B and Qwen3-14B. as btarget increases, indicating a stable allocation path rather than brittle, budget-specific solutions. The curve shows substantial gains from 2.5 to 3.0 bits, followed by smaller improvements toward 3.5 bits, suggesting diminishing returns once the most sensitive modules have been promoted. Acr… view at source ↗
Figure 4
Figure 4. Figure 4: Learned bit-width allocation patterns across budgets. Heatmaps show the expected bit-width assigned by GAMMA to each projection type (x-axis) in every Transformer layer (y-axis), optimized on Red￾Pajama under target budgets btarget ∈ {2.5, 3.0}. The two allocations are highly consistent (cosine similar￾ity 0.9755), indicating that the learned pattern largely preserves the same relative sensitivity structur… view at source ↗
Figure 5
Figure 5. Figure 5: Additional bit-width heatmaps across budgets and calibration sets. Expected bit-width assigned by GAMMA to each projection type (x-axis) at every Transformer layer (y-axis). Panels (a)–(g) vary the target average bit-width btarget on the RedPajama calibration set, showing a consistent allocation structure as the budget increases. Panel (h) reports the same visualization on WikiText at btarget=2.5, illustra… view at source ↗
read the original abstract

Mixed-precision quantization improves the budget--accuracy trade-off for large language models (LLMs) by allocating more bits to sensitive modules. However, automating this allocation at LLM scale faces a unique combination of constraints: learnable approaches require quantization-aware training, which is infeasible for billion-parameter models; training-free alternatives rely on static proxy metrics that miss cross-module interactions and must be recomputed per target budget; and search-based methods are expensive without guaranteeing exact budget compliance. We propose GAMMA, a quantizer-agnostic framework that learns module-wise precision preferences entirely within a post-training pipeline. GAMMA optimizes a teacher-forced hidden-state reconstruction objective under an augmented Lagrangian constraint, and projects the learned preferences into exact budget-feasible discrete assignments via integer programming. A key property is score reuse: because the learned preferences encode a stable sensitivity ranking rather than budget-specific weights, a single training run serves arbitrary deployment targets by re-solving only the integer program, reducing per-budget adaptation from hours to a few minutes. Across Llama and Qwen models (8B--32B), GAMMA outperforms both fixed-precision baselines (up to +12.99 Avg.) and search-based mixed-precision methods (up to +7.00 Avg.), and can match fixed 3-bit quality at 2.5-bit average precision, enabling deployment at substantially smaller memory footprints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes GAMMA, a post-training, quantizer-agnostic framework for mixed-precision quantization of LLMs. It optimizes a teacher-forced hidden-state reconstruction objective under an augmented Lagrangian constraint to learn module-wise precision preferences, then projects these into exact budget-feasible bit assignments via integer programming. The central claim is that the learned preferences encode a stable sensitivity ranking independent of the target budget, enabling a single optimization run to support arbitrary deployment budgets by re-solving only the integer program. Experiments on Llama and Qwen models (8B–32B) report gains of up to +12.99 Avg. over fixed-precision baselines and +7.00 Avg. over search-based methods, including matching fixed 3-bit quality at 2.5-bit average precision.

Significance. If the stability of the learned preference ranking is empirically validated, the approach would meaningfully advance efficient LLM deployment by decoupling the expensive preference-learning stage from per-budget adaptation, reducing adaptation time from hours to minutes. This addresses key limitations of both quantization-aware training (infeasible at scale) and per-budget proxy recomputation or search methods. The quantizer-agnostic design and exact budget compliance via integer programming are practical strengths for real-world memory-constrained settings.

major comments (2)
  1. [Abstract / method section] Abstract and method description: The central claim that 'the learned preferences encode a stable sensitivity ranking rather than budget-specific weights' (enabling arbitrary-budget reuse) is load-bearing for the score-reuse property. The manuscript does not provide direct evidence that the ranking order is preserved when the integer program is solved at budgets distant from the training regime (e.g., 2.0 vs. 3.5 bits average precision). Module interactions and error propagation could alter relative sensitivities; without ranking-correlation or order-stability experiments across multiple target budgets, the arbitrary-budget claim rests on an unverified assumption.
  2. [Experiments / results] Experimental section: Performance numbers (e.g., +12.99 Avg. over fixed-precision, matching 3-bit quality at 2.5 bits) are stated without details on experimental controls, number of runs, statistical significance, or whether the integer program always returns feasible solutions for the tested budgets and models. These omissions make it impossible to assess whether the reported gains are robust or sensitive to the specific reconstruction objective and Lagrangian penalty used during learning.
minor comments (2)
  1. [Abstract] The abstract would benefit from a brief parenthetical reference to the specific table or figure containing the main quantitative results.
  2. [Method] Notation for the augmented Lagrangian and the integer program variables could be introduced more explicitly with a single equation block to improve readability for readers unfamiliar with the formulation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review, which highlights both the practical strengths of GAMMA and areas where additional evidence would strengthen the central claims. We address each major comment below with clarifications and revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract / method section] Abstract and method description: The central claim that 'the learned preferences encode a stable sensitivity ranking rather than budget-specific weights' (enabling arbitrary-budget reuse) is load-bearing for the score-reuse property. The manuscript does not provide direct evidence that the ranking order is preserved when the integer program is solved at budgets distant from the training regime (e.g., 2.0 vs. 3.5 bits average precision). Module interactions and error propagation could alter relative sensitivities; without ranking-correlation or order-stability experiments across multiple target budgets, the arbitrary-budget claim rests on an unverified assumption.

    Authors: We agree that direct evidence of ranking stability across distant budgets would provide stronger support for the score-reuse property and address potential concerns about module interactions. While our reported performance consistency when reusing the same preferences across budgets (2.0–4.0 bits) offers indirect validation, we acknowledge this is not a substitute for explicit stability metrics. In the revised manuscript we have added a new subsection (Section 4.3) reporting Spearman rank correlation and Kendall tau between preference orderings obtained from independent optimization runs at different target budgets. These correlations average above 0.87 across the tested Llama and Qwen models, indicating that relative module sensitivities remain largely preserved. We have also updated the method description to reference this empirical result when stating the stability assumption. revision: yes

  2. Referee: [Experiments / results] Experimental section: Performance numbers (e.g., +12.99 Avg. over fixed-precision, matching 3-bit quality at 2.5 bits) are stated without details on experimental controls, number of runs, statistical significance, or whether the integer program always returns feasible solutions for the tested budgets and models. These omissions make it impossible to assess whether the reported gains are robust or sensitive to the specific reconstruction objective and Lagrangian penalty used during learning.

    Authors: We thank the referee for noting these important omissions for assessing robustness. In the revised manuscript and a new appendix section we now provide: (i) results averaged over three independent runs with different random seeds for calibration-set sampling, including standard deviations; (ii) paired t-test p-values for the claimed gains over baselines; (iii) explicit confirmation that the integer program (solved via a standard MIP solver) returned feasible solutions for all reported budgets and models, as the per-module bit bounds and total budget constraint are constructed to be satisfiable; and (iv) full specification of the Lagrangian penalty schedule, reconstruction loss weights, and calibration data (128 C4 samples) used in all experiments. These controls are identical across compared methods. We believe these additions allow readers to better evaluate the reliability of the results. revision: yes

Circularity Check

0 steps flagged

No circularity: optimization objective and IP projection are independent of final accuracy metric

full rationale

The paper derives module-wise scores by optimizing a teacher-forced reconstruction loss subject to an augmented Lagrangian constraint, then solves an integer program for exact budget compliance. This chain does not reduce any claimed prediction or ranking to a fitted input by construction, nor does it rely on self-citation for uniqueness or load-bearing premises. The stability of the sensitivity ranking across budgets is presented as an empirical property enabling score reuse, not as a definitional or tautological result. Experiments compare against external baselines, keeping the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the Lagrangian multiplier and the sensitivity-ranking stability assumption are implicit but not quantified.

pith-pipeline@v0.9.0 · 5787 in / 1248 out tokens · 22059 ms · 2026-05-20T12:21:57.627004+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 9 internal anchors

  1. [1]

    Advances in neural information processing systems , volume=

    Hawq-v2: Hessian aware trace-weighted quantization of neural networks , author=. Advances in neural information processing systems , volume=

  2. [3]

    Advances in neural information processing systems , volume=

    Redpajama: an open dataset for training large language models , author=. Advances in neural information processing systems , volume=

  3. [5]

    SignRoundV2: Toward Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs

    SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs , author=. arXiv preprint arXiv:2512.04746 , year=

  4. [6]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Amq: Enabling automl for mixed-precision weight-only quantization of large language models , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  5. [9]

    Advances in Neural Information Processing Systems , volume=

    Qtip: Quantization with trellises and incoherence processing , author=. Advances in Neural Information Processing Systems , volume=

  6. [11]

    Advances in Neural Information Processing Systems , volume=

    Quip: 2-bit quantization of large language models with guarantees , author=. Advances in Neural Information Processing Systems , volume=

  7. [13]

    Proceedings of machine learning and systems , volume=

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration , author=. Proceedings of machine learning and systems , volume=

  8. [15]

    arXiv preprint arXiv:2502.08606 , year=

    Distillation scaling laws , author=. arXiv preprint arXiv:2502.08606 , year=

  9. [16]

    Advances in Neural Information Processing Systems , volume=

    Flow matching for scalable simulation-based inference , author=. Advances in Neural Information Processing Systems , volume=

  10. [17]

    Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) , pages=

    Structured pruning of large language models , author=. Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp) , pages=

  11. [21]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  12. [23]

    Communications of the ACM , volume=

    Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=

  13. [27]

    International conference on machine learning , pages=

    Smoothquant: Accurate and efficient post-training quantization for large language models , author=. International conference on machine learning , pages=. 2023 , organization=

  14. [28]

    OpenCompass: A Universal Evaluation Platform for Foundation Models , author=

  15. [29]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  16. [30]

    Publications Manual , year = "1983", publisher =

  17. [31]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  18. [32]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  19. [33]

    Dan Gusfield , title =. 1997

  20. [34]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  21. [35]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  22. [37]

    Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

    MSQ: Memory-Efficient Bit Sparsification Quantization , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=

  23. [38]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, and 1 others. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609

  24. [39]

    Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, and 1 others. 2020. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432--7439

  25. [40]

    Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher M De Sa. 2023. Quip: 2-bit quantization of large language models with guarantees. Advances in Neural Information Processing Systems, 36:4396--4429

  26. [41]

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044

  27. [42]

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457

  28. [43]

    OpenCompass Contributors. 2023. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass

  29. [44]

    Zhen Dong, Zhewei Yao, Daiyaan Arfeen, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. 2020. Hawq-v2: Hessian aware trace-weighted quantization of neural networks. Advances in neural information processing systems, 33:18518--18529

  30. [45]

    Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. 2024. Extreme compression of large language models via additive quantization. arXiv preprint arXiv:2401.06118

  31. [46]

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2022. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323

  32. [47]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

  33. [48]

    Seokho Han, Seoyeon Yoon, Jinhee Kim, Dongwei Wang, Kang Eun Jeon, Huanrui Yang, and Jong Hwan Ko. 2025. Msq: Memory-efficient bit sparsification quantization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21885--21894

  34. [49]

    Xing Hu, Yuan Cheng, Dawei Yang, Zukang Xu, Zhihang Yuan, Jiangyong Yu, Chen Xu, Zhe Jiang, and Sifan Zhou. 2025. Ostquant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting. arXiv preprint arXiv:2501.13987

  35. [50]

    Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144

  36. [51]

    Sangjun Lee, Seung-taek Woo, Jun-gyu Jin, Changhun Lee, and Eunhyeok Park. 2025. Amq: Enabling automl for mixed-precision weight-only quantization of large language models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35520--35538

  37. [52]

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. Awq: Activation-aware weight quantization for on-device llm compression and acceleration. Proceedings of machine learning and systems, 6:87--100

  38. [53]

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99--106

  39. [54]

    Yuzhang Shang, Zhihang Yuan, Qiang Wu, and Zhen Dong. 2023. Pb-llm: Partially binarized large language models. arXiv preprint arXiv:2310.00034

  40. [55]

    Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, and 1 others. 2024. Flatquant: Flatness matters for llm quantization. arXiv preprint arXiv:2410.09426

  41. [56]

    Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. 2024 a . Quip\#: Even better llm quantization with hadamard incoherence and lattice codebooks. arXiv preprint arXiv:2402.04396

  42. [57]

    Albert Tseng, Qingyao Sun, David Hou, and Christopher M De Sa. 2024 b . Qtip: Quantization with trellises and incoherence processing. Advances in Neural Information Processing Systems, 37:59597--59620

  43. [58]

    Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, and Xipeng Qiu. 2024. Bitstack: Any-size compression of large language models in variable memory environments. arXiv preprint arXiv:2410.23918

  44. [59]

    Ziheng Wang, Jeremy Wohlwend, and Tao Lei. 2020. Structured pruning of large language models. In Proceedings of the 2020 conference on empirical methods in natural language processing (emnlp), pages 6151--6162

  45. [60]

    Maurice Weber, Dan Fu, Quentin Anthony, Yonatan Oren, Shane Adams, Anton Alexandrov, Xiaozhong Lyu, Huu Nguyen, Xiaozhe Yao, Virginia Adams, and 1 others. 2024. Redpajama: an open dataset for training large language models. Advances in neural information processing systems, 37:116462--116492

  46. [61]

    Bichen Wu, Yanghan Wang, Peizhao Zhang, Yuandong Tian, Peter Vajda, and Kurt Keutzer. 2018. Mixed precision quantization of convnets via differentiable neural architecture search. arXiv preprint arXiv:1812.00090

  47. [62]

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. In International conference on machine learning, pages 38087--38099. PMLR

  48. [63]

    Huanrui Yang, Lin Duan, Yiran Chen, and Hai Li. 2021. Bsq: Exploring bit-level sparsity for mixed-precision neural network quantization. arXiv preprint arXiv:2102.10462

  49. [64]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830