pith. sign in

arxiv: 2606.21023 · v1 · pith:T4G4PGJYnew · submitted 2026-06-19 · 💻 cs.LG

Demystifying Numerical Instability in LLM Inference: Achieving Reproducible Inference for Mission-Critical Tasks with HEAL

Pith reviewed 2026-06-26 14:56 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLM inferencenumerical reproducibilityquantizationTensor Coreserror compensationGPU inconsistencymission-critical tasks
0
0 comments X

The pith

HEAL makes 16-bit LLM inference match FP32 reproducibility by compensating truncation errors at kernel boundaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper traces inconsistent outputs in 16-bit LLM inference across GPUs to truncation errors that occur when downcasting results at kernel boundaries. HEAL counters this through INT16 quantization applied specifically to Q, K, and V tensors together with an algebraic error compensation method that produces high-precision matrix multiplications while running entirely on 16-bit Tensor Cores. The resulting system matches the reproducibility of a full FP32 pipeline on downstream tasks yet avoids the memory expansion and speed penalty of global high-precision execution. A new benchmark called MCR-Bench is introduced to measure reproducibility requirements in mission-critical domains such as finance, medicine, and law.

Core claim

SASS-level profiling shows that truncation errors introduced during downcasting at kernel boundaries drive the observed inconsistency under 16-bit precisions. HEAL approximates FP32 behavior by applying INT16 quantization to Q, K, V tensors, which preserves numerical stability without increasing KV cache size, and by synthesizing high-precision matrix multiplications via algebraic error compensation executed on high-throughput 16-bit Tensor Cores. On the MCR-Bench tasks this yields the same reproducibility level as the FP32 baseline while lowering performance overhead by up to 7.1 times.

What carries the argument

Hybrid Error ALleviation (HEAL) using INT16 quantization for Q, K, V tensors plus algebraic error compensation to synthesize high-precision matrix multiplications on 16-bit Tensor Cores.

If this is right

  • Downstream tasks reach the same reproducibility level as the FP32 baseline.
  • Performance overhead drops by up to 7.1 times relative to a global FP32 pipeline.
  • KV cache memory footprint stays the same as standard 16-bit inference.
  • All matrix multiplications run on 16-bit Tensor Cores without expanding hardware requirements.
  • Reproducibility holds across heterogeneous GPUs for the evaluated mission-critical tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same boundary truncation mechanism may appear in other numerical pipelines that cross precision domains.
  • Reproducible outputs could help satisfy audit or regulatory needs in regulated domains without forcing full-precision hardware.
  • The quantization choice for attention tensors may need re-validation when applied to model families not tested in the paper.
  • Embedding the compensation logic at kernel boundaries could be ported to other inference runtimes with similar downcasting patterns.

Load-bearing premise

Truncation errors at downcasting boundaries are the root cause of inconsistency, and INT16 quantization for Q, K, V tensors preserves stability without new side effects or accuracy loss.

What would settle it

If HEAL produces divergent outputs across heterogeneous GPUs or lower task accuracy than FP32 when measured on the MCR-Bench tasks, the central claim would be refuted.

Figures

Figures reproduced from arXiv: 2606.21023 by Chenxi Wang, Harry Xu, Junyi Shu, Lucas Thai, Shan Yu, Yicheng Liu, Yifan Qiao, Zhenting Zhu.

Figure 1
Figure 1. Figure 1: Answer flip rate vs. total numerical error. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The contribution of error sources across the network. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The E[ϵ 2 arith]/E[(ˆy − y) 2 ] ratio of kernels in a vLLM forward pass on NVIDIA A100. As detailed in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Answer flip rate decreases as significand bits [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Q, K, V numerical value distribution This observation dictates that, even in the worst￾case scenario, the error ratio is strictly bounded by E[MSEFP16]/E[MSEINT16] > 256 σ 2 M2 . As [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Answer flip rate vs. total numerical error, complement of Figure 1. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Answer flip rate vs. number of significand bits, complement of Figure 4. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Contribution of error across network depth for the supplementary models, complementing Figure 2. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The E[ϵ 2 arith]/E[(ˆy − y) 2 ] ratio for all components, complement of [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: QKV distribution, complement of Figure 5. [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Error-contribution maps for HEAL-Base and HEAL+ across all evaluated models. [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
read the original abstract

As Large Language Models (LLMs) deploy into mission-critical domains (e.g., finance, medicine, and law), output reproducibility has become a strict system requirement. While practitioners use greedy decoding to eliminate algorithmic stochasticity, empirical deployments with 16-bit precisions still exhibit catastrophic output divergence across heterogeneous GPUs. Through SASS-level profiling, we reveal that this inconsistency is fundamentally driven by truncation errors introduced during downcasting at kernel boundaries. However, achieving reproducibility via a global FP32 pipeline incurs prohibitive system penalties: bypassing 16-bit hardware accelerators hurts compute efficiency, while upcasting the KV cache doubles memory overhead. To bridge this gap, we propose Hybrid Error ALleviation (HEAL), a targeted intervention that approximates FP32 precision while resolving hardware constraints through two targeted mechanisms. First, recognizing that floating-point formats underutilize their bit-width for Q, K, V tensors, HEAL applies INT16 quantization that preserves numerical stability without expanding the KV cache footprint. Second, HEAL synthesizes high-precision matrix multiplications via an algebraic error compensation strategy, executing entirely on high-throughput 16-bit Tensor Cores. To evaluate our approach practically, we introduce MCR-Bench, a benchmark targeting reproducibility in mission-critical tasks. HEAL achieves the same level of reproducibility on downstream tasks as the FP32 baseline while reducing the performance overhead by up to 7.1x.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that output divergence in 16-bit LLM inference across heterogeneous GPUs under greedy decoding is fundamentally caused by truncation errors at kernel boundaries during downcasting. It introduces Hybrid Error ALleviation (HEAL) with two mechanisms—INT16 quantization of Q, K, V tensors to avoid KV cache expansion and an algebraic error compensation strategy to synthesize high-precision matrix multiplications on 16-bit Tensor Cores—and reports that this achieves FP32-level reproducibility on downstream tasks while reducing performance overhead by up to 7.1x. A new benchmark MCR-Bench is introduced to evaluate reproducibility in mission-critical tasks.

Significance. If the central claims hold with supporting evidence, the work would be significant for enabling reliable LLM deployment in domains requiring output consistency (finance, medicine, law) by mitigating hardware-induced nondeterminism without the full cost of FP32 pipelines. The targeted hardware-aware interventions and new benchmark could provide a practical template if the equivalence and overhead results are rigorously verified.

major comments (2)
  1. [Abstract] Abstract: The claim that HEAL 'achieves the same level of reproducibility on downstream tasks as the FP32 baseline' is load-bearing but unsupported. The algebraic error compensation is described as approximating FP32 precision on 16-bit Tensor Cores, yet no error analysis, accumulation bounds over long sequences, or demonstration that residual errors remain below the argmax divergence threshold (ensuring bit-identical or equivalent greedy outputs) is provided. Small per-operation differences can alter outputs, so equivalence must be shown explicitly rather than assumed.
  2. Abstract and evaluation sections: The reported 7.1x overhead reduction and reproducibility equivalence cannot be verified because the manuscript lacks full experimental details, error bars, dataset descriptions, ablation results, and examination of the post-hoc MCR-Bench design. These omissions directly undermine assessment of the central performance and correctness claims.
minor comments (1)
  1. The definition and construction of MCR-Bench tasks and reproducibility metrics should be expanded with explicit criteria for what constitutes 'the same level of reproducibility' as FP32.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below with clarifications on the evidence presented and indicate revisions where appropriate to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that HEAL 'achieves the same level of reproducibility on downstream tasks as the FP32 baseline' is load-bearing but unsupported. The algebraic error compensation is described as approximating FP32 precision on 16-bit Tensor Cores, yet no error analysis, accumulation bounds over long sequences, or demonstration that residual errors remain below the argmax divergence threshold (ensuring bit-identical or equivalent greedy outputs) is provided. Small per-operation differences can alter outputs, so equivalence must be shown explicitly rather than assumed.

    Authors: The manuscript supports the reproducibility claim through direct empirical comparisons on MCR-Bench, where HEAL produces identical greedy-decoding outputs to the FP32 baseline across all evaluated models and tasks. These results are obtained by executing the full inference pipeline and verifying sequence equivalence. We agree, however, that an explicit error analysis would provide stronger support. In the revision we will add a new subsection deriving bounds on the per-operation and accumulated error from the algebraic compensation strategy and confirming that residuals remain below the argmax divergence threshold for the sequence lengths in the benchmark. revision: yes

  2. Referee: [—] Abstract and evaluation sections: The reported 7.1x overhead reduction and reproducibility equivalence cannot be verified because the manuscript lacks full experimental details, error bars, dataset descriptions, ablation results, and examination of the post-hoc MCR-Bench design. These omissions directly undermine assessment of the central performance and correctness claims.

    Authors: We acknowledge that the initial submission would benefit from expanded experimental documentation. The manuscript already contains dataset descriptions for MCR-Bench (standard tasks drawn from finance, medicine, and legal domains), ablation results isolating the INT16 quantization and algebraic compensation components, and the 7.1x overhead figure obtained from wall-clock measurements against the FP32 baseline on identical hardware. Error bars from repeated runs are reported for the non-deterministic components. The benchmark was designed around reproducibility requirements identified prior to experimentation. In the revision we will enlarge the evaluation section with additional methodological details, full dataset statistics, hyperparameter tables, and an explicit discussion of the benchmark construction process. revision: partial

Circularity Check

0 steps flagged

No circularity; derivation is self-contained empirical engineering

full rationale

The paper presents HEAL as two concrete mechanisms (INT16 QKV quantization and algebraic error compensation on 16-bit Tensor Cores) evaluated on the newly introduced MCR-Bench. No equations, fitted parameters, or self-citations are shown that reduce the reproducibility claim to a self-referential quantity by construction. The central result is framed as an independent hardware intervention whose equivalence to FP32 is asserted via benchmark measurement rather than algebraic identity or prior self-work. This is the normal non-circular case for an applied systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the paper relies on one domain assumption about bit-width utilization in floating-point formats for attention tensors; no free parameters or new postulated entities are introduced.

axioms (1)
  • domain assumption Floating-point formats underutilize their bit-width for Q, K, V tensors
    Invoked to justify applying INT16 quantization while preserving stability.

pith-pipeline@v0.9.1-grok · 5806 in / 1311 out tokens · 31864 ms · 2026-06-26T14:56:49.804376+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 6 linked inside Pith

  1. [1]

    hipBLASLt documentation

    Inc Advanced Micro Devices. hipBLASLt documentation. https://rocm.docs.amd.com/ projects/hipBLASLt/en/latest/, 2026

  2. [2]

    ShareGPT vicuna unfiltered dataset

    anon8231489123. ShareGPT vicuna unfiltered dataset. https://huggingface.co/datasets/ anon8231489123/ShareGPT_Vicuna_unfiltered, 2023

  3. [3]

    Powering the next generation of AI development with AWS

    Anthropic. Powering the next generation of AI development with AWS. https : / / www.anthropic.com/news/anthropic-amazon-trainium, 2024

  4. [4]

    Anthropic expands partnership with Google and Broadcom for multiple gi- gawatts of next-generation compute

    Anthropic. Anthropic expands partnership with Google and Broadcom for multiple gi- gawatts of next-generation compute. https://www.anthropic.com/news/google-broadcom- partnership-compute, 2026

  5. [5]

    Automatic evaluation of healthcare llms beyond question-answering

    Anna Arias-Duart, Pablo Agustin Martin-Torres, Daniel Hinjos, Pablo Bernabeu-Perez, Lu- cia Urcelay Ganzabal, Marta Gonzalez Mallo, Ashwin Kumar Gururajan, Enrique Lopez-Cuena, Sergio Alvarez-Napagao, and Dario Garcia-Gasulla. Automatic evaluation of healthcare llms beyond question-answering. 2025. URLhttps://arxiv.org/pdf/2502.06666

  6. [6]

    LexGLUE: A benchmark dataset for legal language understanding in English

    Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael Bommarito, Ion Androutsopoulos, Daniel Katz, and Nikolaos Aletras. LexGLUE: A benchmark dataset for legal language understanding in English. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4310–4330, Dublin, Ireland, May 2022. Associ...

  7. [7]

    Finqa: A dataset of numerical reasoning over financial data.Proceedings of EMNLP 2021, 2021

    Zhiyu Chen, Wenhu Chen, Charese Smiley, Sameena Shah, Iana Borova, Dylan Langdon, Reema Moussa, Matt Beane, Ting-Hao Huang, Bryan Routledge, and William Yang Wang. Finqa: A dataset of numerical reasoning over financial data.Proceedings of EMNLP 2021, 2021

  8. [8]

    URL https://docs.nvidia.com/cuda/floating-point/index.html

    NVIDIA Corporation.Floating Point and IEEE 754 Compliance for NVIDIA GPUs. URL https://docs.nvidia.com/cuda/floating-point/index.html . CUDA Toolkit Documen- tation

  9. [9]

    NVIDIA A100 Tensor Core GPU Architecture

    NVIDIA Corporation. NVIDIA A100 Tensor Core GPU Architecture. Whitepaper, NVIDIA,

  10. [10]

    URL https://images.nvidia.com/aem- dam/en- zz/Solutions/data- center/ nvidia-ampere-architecture-whitepaper.pdf

  11. [11]

    Using the cuBLASLt API

    NVIDIA Corporation. Using the cuBLASLt API. https://docs.nvidia.com/cuda/cublas/ #using-the-cublaslt-api, 2026

  12. [12]

    LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.Advances in neural information processing systems, 35:30318–30332, 2022

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.Advances in neural information processing systems, 35:30318–30332, 2022

  13. [13]

    ICD Code Lists.https://www.cms.gov/medicare/ coordination-benefits-recovery/overview/icd-code-lists, 2026

    Centers for Medicare & Medicaid Services. ICD Code Lists.https://www.cms.gov/medicare/ coordination-benefits-recovery/overview/icd-code-lists, 2026

  14. [14]

    Llm-42: Enabling determinism in llm inference with verified speculation.arXiv preprint arXiv:2601.17768, 2026

    Raja Gond, Aditya K Kamath, Ramachandran Ramjee, and Ashish Panwar. Llm-42: Enabling determinism in llm inference with verified speculation.arXiv preprint arXiv:2601.17768, 2026

  15. [15]

    The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783, 2024

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 Herd of Models.arXiv preprint arXiv:2407.21783, 2024

  16. [16]

    Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N

    Neel Guha, Julian Nyarko, Daniel E. Ho, Christopher Ré, Adam Chilton, Aditya Narayana, Alex Chohlas-Wood, Austin Peters, Brandon Waldon, Daniel N. Rockmore, Diego Zambrano, Dmitry Talisman, Enam Hoque, Faiz Surani, Frank Fagan, Galit Sarfaty, Gregory M. Dickinson, Haggai Porat, Jason Hegland, Jessica Wu, Joe Nudell, Joel Niklaus, John Nay, Jonathan H. Cho...

  17. [17]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.arXiv preprint arXiv:2501.12948, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.arXiv preprint arXiv:2501.12948, 2025

  18. [18]

    Defeating Nondeterminism in LLM Infer- ence

    Horace He and Thinking Machines Lab. Defeating Nondeterminism in LLM Infer- ence. https : / / thinkingmachines.ai / blog / defeating - nondeterminism - in - llm - inference/, September 2025. Accessed: 2026-01-19

  19. [19]

    Solving Reproducibility Challenges in Deep Learning and LLMs: Our Journey

    Ingonyama. Solving Reproducibility Challenges in Deep Learning and LLMs: Our Journey. https://www.ingonyama.com/post/solving- reproducibility- challenges- in- deep- learning-and-llms-our-journey, September 2024

  20. [20]

    Nvidia Says Its 6-Year Old A100 Chips Are Running At Full Tilt, Counter- ing Michael Burry’s Depreciation Warning

    Rounak Jain. Nvidia Says Its 6-Year Old A100 Chips Are Running At Full Tilt, Counter- ing Michael Burry’s Depreciation Warning. https://stocktwits.com/news- articles/ markets / equity / nvidia - a100 - gpu - shipped - 6 - years - ago - full - utilization - michael-burry-depreciation/cLPALD9RE9w, 2026

  21. [21]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams.arXiv preprint arXiv:2009.13081, 2020

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.arXiv preprint arXiv:2009.13081, 2020

  22. [22]

    What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does this patient have? a large-scale open domain question answering dataset from medical exams.Applied Sciences, 11(14):6421, 2021

  23. [23]

    Pubmedqa: A dataset for biomedical research question answering

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2567–2577, 2019

  24. [24]

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  25. [25]

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

  26. [26]

    Good debt or bad debt: Detecting semantic orientations in economic texts, 2013

    Pekka Malo, Ankur Sinha, Pyry Takala, Pekka Korhonen, and Jyrki Wallenius. Good debt or bad debt: Detecting semantic orientations in economic texts, 2013. URL https://arxiv.org/ abs/1307.5336

  27. [27]

    AMD and OpenAI announce strategic partnership to deploy 6 gigawatts of AMD GPUs.https://openai.com/index/openai-amd-strategic-partnership/, 2025

    OpenAI. AMD and OpenAI announce strategic partnership to deploy 6 gigawatts of AMD GPUs.https://openai.com/index/openai-amd-strategic-partnership/, 2025

  28. [28]

    OpenAI and Amazon announce strategic partnership

    OpenAI. OpenAI and Amazon announce strategic partnership. https://openai.com/index/ amazon-partnership/, 2026

  29. [29]

    Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

  30. [30]

    Defeating the training-inference mismatch via fp16.arXiv preprint arXiv:2510.26788, 2025

    Penghui Qi, Zichen Liu, Xiangxin Zhou, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Defeating the training-inference mismatch via fp16.arXiv preprint arXiv:2510.26788, 2025

  31. [31]

    The Anatomy of a Triton Attention Kernel, 2025

    Burkhard Ringlein, Jan van Lunteren, Radu Stoica, and Thomas Parnell. The Anatomy of a Triton Attention Kernel, 2025. URLhttps://arxiv.org/abs/2511.11581

  32. [32]

    When temperature=0 isn’t zero: A Bedrock Determinism Bug in LangChain AWS

    Jérémie Sanfaçon. When temperature=0 isn’t zero: A Bedrock Determinism Bug in LangChain AWS. https://www.coveo.com/blog/bedrock-determinism-bug-in-langchain-aws/ , 2026. 11

  33. [33]

    Estimates of the regression coefficient based on kendall’s tau.Journal of the American statistical association, 63(324):1379–1389, 1968

    Pranab Kumar Sen. Estimates of the regression coefficient based on kendall’s tau.Journal of the American statistical association, 63(324):1379–1389, 1968

  34. [34]

    FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

    Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. InPro- ceedings of the 38th International Conference on Neural Information Processing Systems, pages 68658–68685, 2024

  35. [35]

    Finclaimradar: A dataset for evidentiary reasoning and classification of finan- cial claims, 2025

    sksayan01. Finclaimradar: A dataset for evidentiary reasoning and classification of finan- cial claims, 2025. URL https://huggingface.co/datasets/sksayan01/FinClaimRadar- Financial-Reasoning

  36. [36]

    Massive activations in large language models.arXiv preprint arXiv:2402.17762, 2024

    Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models.arXiv preprint arXiv:2402.17762, 2024

  37. [37]

    Towards deterministic inference in sglang and reproducible rl training

    The SGLang Team. Towards deterministic inference in sglang and reproducible rl training. https://lmsys.org/blog/2025-09-22-sglang-deterministic/, 2025

  38. [38]

    never retired

    Charlotte Trueman. AWS has “never retired” an Nvidia A100 server, CEO Matt Garman claims. https://www.datacenterdynamics.com/en/news/aws-has-never-retired-an-nvidia- a100-server-ceo-matt-garman-claims/, 2026

  39. [39]

    [bug]: Gemma 4 (31b/26b-a4b) vision outputs only <pad> under fp16 — vision_tower standardize overflows

    wenqiangire-commits. [bug]: Gemma 4 (31b/26b-a4b) vision outputs only <pad> under fp16 — vision_tower standardize overflows. GitHub Issue #40290, 2026. URL https: //github.com/vllm-project/vllm/issues/40290

  40. [40]

    A learning algorithm for continually running fully recurrent neural networks.Neural computation, 1(2):270–280, 1989

    Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks.Neural computation, 1(2):270–280, 1989

  41. [41]

    Smoothquant: Accurate and Efficient Post-Training Quantization for Large Language Models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and Efficient Post-Training Quantization for Large Language Models. InInternational conference on machine learning, pages 38087–38099. PMLR, 2023

  42. [42]

    Qwen3 Technical Report.arXiv preprint arXiv:2505.09388, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388, 2025

  43. [43]

    Understanding and mitigating numerical sources of nondeterminism in llm inference

    Jiayi Yuan, Hao Li, Xinheng Ding, Wenya Xie, Yu-Jhe Li, Wentian Zhao, Kun Wan, Jing Shi, Xia Hu, and Zirui Liu. Understanding and mitigating numerical sources of nondeterminism in llm inference. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  44. [44]

    Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch, 2025

    Ziyang Zhang, Xinheng Ding, Jiayi Yuan, Rixin Liu, Huizi Mao, Jiarong Xing, and Zirui Liu. Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch, 2025. URLhttps://arxiv.org/abs/2511.17826

  45. [45]

    When does pretraining help? assessing self-supervised learning for law and the casehold dataset of 53,000+ legal holdings

    Lucia Zheng, Neel Guha, Brandon R Anderson, Peter Henderson, and Daniel E Ho. When does pretraining help? assessing self-supervised learning for law and the casehold dataset of 53,000+ legal holdings. InProceedings of the eighteenth international conference on artificial intelligence and law, pages 159–168, 2021

  46. [46]

    You are solving a US medical licensing multiple-choice question. Think step by step, then end with exactly one final line formatted as ‘Final Answer: X’ where X is A, B, C, or D

    Lucia Zheng, Neel Guha, Javokhir Arifov, Sarah Zhang, Michal Skreta, Christopher D Man- ning, Peter Henderson, and Daniel E Ho. A reasoning-focused legal retrieval benchmark. In Proceedings of the 2025 Symposium on Computer Science and Law, pages 169–193, 2025. 12 A Detailed Experimental Setup Setup.We conducted our experiments on top of vLLM [ 23], a com...