pith. sign in

arxiv: 2605.19537 · v2 · pith:3U5PB7E2new · submitted 2026-05-19 · 💻 cs.LG

The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility

Pith reviewed 2026-05-21 08:08 UTC · model grok-4.3

classification 💻 cs.LG
keywords LLM inferencereproducibilitybenchmarksinference engineshyperparametersoutput disagreementbenchmark scoresinference backends
0
0 comments X

The pith

The choice of inference backend can shift LLM benchmark scores by up to 16.6 percentage points even with fixed model weights and hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that the software used to run an LLM at inference time functions as a silent variable in evaluation. When model weights, decoding settings, and hardware remain unchanged, switching among engines such as vLLM, SGLang, and llama.cpp produces benchmark score differences as large as 16.6 percentage points together with frequent disagreements in the actual text generated. The root causes trace to engine-specific choices in prefix caching, CUDA graph capture, custom kernels, and logit handling. A review of 35,000 papers reveals that the inference stack is almost never documented despite the existence of roughly 200 distinct engines. If these effects are real, many reported gains measured in fractions of a point rest on an uncontrolled factor that current practices leave invisible.

Core claim

Holding model weights, decoding parameters, and hardware constant, the choice of inference backend alone can shift benchmark scores by up to 16.6 percentage points and induce high rates of output disagreement. The divergence arises from system-level optimizations such as prefix caching, CUDA graphs, custom kernels, and engine-specific defaults in logit processing.

What carries the argument

Inference backend, the software layer that executes a trained model at inference time through optimizations including prefix caching and custom CUDA kernels.

If this is right

  • Benchmark comparisons between models become unreliable if the papers or evaluations used different inference engines.
  • Small reported improvements in scores may disappear or reverse when the same model is evaluated on another backend.
  • Reproducibility of published LLM results requires explicit documentation of the full inference stack.
  • Standardized reporting of inference engines would make cross-paper benchmark claims more interpretable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmark protocols could designate a reference backend to reduce hidden variance across studies.
  • The same backend sensitivity may affect non-benchmark uses such as production serving or fine-tuning loops.
  • Extending measurements to additional engines or closed-source models would test how general the effect is.

Load-bearing premise

The observed benchmark differences and output disagreements are caused solely by the inference backends once model weights, decoding parameters, and hardware are held constant.

What would settle it

A controlled run of the same models and prompts on two backends with all caching and graph optimizations disabled, identical floating-point precision enforced, and logit processing matched exactly, showing whether score gaps and disagreements vanish.

Figures

Figures reproduced from arXiv: 2605.19537 by David Pape, Jonathan Evertz, Lea Sch\"onherr.

Figure 1
Figure 1. Figure 1: Landscape of Inference Engines. (a) Distribution of 200 surveyed inference engines across the three categories, colored to distinguish between open-source and proprietary systems. (b) The distribution of primary programming languages used across the open-source engines. 3.1 Survey Methodology and Scope We define an inference engine as standalone software capable of loading a transformer-based model and gen… view at source ↗
Figure 2
Figure 2. Figure 2: Prevalence of Reproducibility Artifacts in ML Research. A breakdown of 9,018 relevant papers categorized by their reproducibility tier. named the backend. This subset is dominated by transformers (322; 39 %) and vLLM (150; 18 %), followed by custom PyTorch implementations (98; 12 %). We extended this analysis to the 460 papers with empty or deleted repositories, and found a similar trend: only 180 (39.1 %)… view at source ↗
Figure 3
Figure 3. Figure 3: Output Disagreement Rates. The frequency with which each backend’s prediction differs from the transformers reference implementation for the same input. Higher values indicate a larger disagreement between the two backends. B.2 Length Error Beyond final accuracy, we analyze structural deviations in the generated responses by measuring Output Length consistency ( [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Analysis of Output Length Consistency against the transformers Reference. This scatter plot visualizes the deviation in generation length for various backends. The X-axis (Bias) represents the average Signed Difference, where negative values indicate the backend generated fewer tokens than the reference (shorter), and positive values indicate more tokens (longer). The Y-axis (Magnitude) represents the aver… view at source ↗
Figure 5
Figure 5. Figure 5: Token-Level Divergence Analysis (GPQA). Normalized divergence scores relative to the transformers reference. Larger values indicate high similarity (divergence happens late), while smaller values indicate early divergence. The labels above the bars indicate the average token position at which the generation first differs from the reference sequence. llama.cpp LMDeploy Ollama SGLang vLLM llama.cpp LMDeploy … view at source ↗
Figure 6
Figure 6. Figure 6: Token-Level Divergence Analysis (GSM8K). Normalized divergence scores relative to the transformers reference. Larger values indicate high similarity (divergence happens late), while smaller values indicate early divergence. The labels above the bars indicate the average token position at which the generation first differs from the reference sequence. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Token-Level Divergence Analysis (SimpleQA). Normalized divergence scores relative to the transformers reference. Larger values indicate high similarity (divergence happens late), while smaller values indicate early divergence. The labels above the bars indicate the average token position at which the generation first differs from the reference sequence. llama.cpp LMDeploy Ollama SGLang vLLM llama.cpp LMDep… view at source ↗
Figure 8
Figure 8. Figure 8: Token-Level Divergence Analysis (LiveCodeBench). Normalized divergence scores relative to the transformers reference. Larger values indicate high similarity (divergence happens late), while smaller values indicate early divergence. The labels above the bars indicate the average token position at which the generation first differs from the reference sequence. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Numerical Precision (LogProb RMSE). Root Mean Squared Error (RMSE) of the top-1 token log-probabilities compared to the transformers reference. In an ideal, deterministic setting, we expect an RMSE of exactly 0.0, indicating identical confidence in token selection. Higher values demonstrate numerical drift caused by the backend. This indicates that the underlying probability distribution is shifting, which… view at source ↗
Figure 10
Figure 10. Figure 10: Distribution Stability (Top-5 Jaccard Similarity). The overlap of the top-5 most probable tokens between the backend and the transformers reference. We expect a similarity score of 1.0, meaning the set of the top-5 token candidates is perfectly identical across both implementations. Lower values indicate that the backend’s numerical deviations fundamentally alter the model’s candidate pool, bringing entir… view at source ↗
read the original abstract

Progress in LLMs is increasingly measured through standardized benchmarks, where state-of-the-art improvements are often separated by fractions of a percentage point. At the same time, the computational cost of evaluating modern LLMs has driven widespread adoption of specialized inference backends, software systems that execute trained models efficiently at inference time. While critical for scalability, system-level optimizations, such as custom CUDA kernels and reduced-precision arithmetic, can alter token probabilities and introduce non-determinism, possibly cascading into divergent generation. In this work, we first survey the inference landscape, identifying 200 distinct engines, and analyze 35,000 ML publications, finding that the specific inference stack is rarely reported despite this widespread diversity. We then present a systematic empirical study of how inference backends affect LLM benchmark results. Holding model weights, decoding parameters, and hardware constant, we evaluate five widely used inference engines, including vLLM, SGLang, and llama$.$cpp, across multiple open-weight models and established benchmarks. We show that the choice of backend alone can shift benchmark scores by up to 16.6 percentage points and induce high rates of output disagreement. By isolating backend optimizations and tracing the execution pipeline, we find this divergence is driven by system-level optimizations like prefix caching and CUDA graphs, custom kernels, and engine-specific defaults in logit processing. Our findings identify the inference backend as a previously unreported but consequential hyperparameter in the evaluation of LLM and advocate standardized reporting of inference stacks to improve the reproducibility and interpretability of benchmark comparisons.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript surveys the inference engine landscape, identifying 200 distinct engines and finding that inference stack details are rarely reported in an analysis of 35,000 ML publications. It then presents a controlled empirical study comparing five widely used engines (including vLLM, SGLang, and llama.cpp) across open-weight models and benchmarks. Holding model weights, decoding parameters, and hardware fixed, the study reports that backend choice alone shifts benchmark scores by up to 16.6 percentage points and produces high rates of output disagreement. The divergence is traced to system-level optimizations such as prefix caching, CUDA graphs, custom kernels, and engine-specific defaults in logit processing. The authors conclude that the inference backend is a consequential unreported hyperparameter and advocate standardized reporting to improve reproducibility.

Significance. If the controlled comparisons hold, the work identifies a previously overlooked source of non-reproducibility in LLM evaluations. The combination of a broad survey of publication practices with targeted, multi-model empirical measurements provides concrete evidence that small implementation differences at the inference layer can produce benchmark shifts larger than many claimed state-of-the-art gains. This has direct implications for how the community designs, reports, and interprets standardized evaluations.

major comments (1)
  1. The central empirical claim rests on isolating backend effects while holding decoding parameters and logit processing identical across engines. The abstract explicitly lists 'engine-specific defaults in logit processing' as one driver of divergence. Without explicit confirmation that every engine received the exact same numerical configuration (temperature, top-p, repetition penalty, logit bias handling) via a common interface and that no hidden per-engine transformations occurred, the attribution of the 16.6 pp shifts solely to optimizations such as prefix caching or CUDA graphs remains incomplete. The methodology section should provide the precise configuration commands or code used for each backend to verify synchronization.
minor comments (2)
  1. Abstract: 'llama$.$cpp' is a typesetting artifact and should read 'llama.cpp'.
  2. Abstract, final sentence: 'LLM and' should be 'LLMs' for grammatical consistency.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights an important aspect of methodological transparency. We have revised the manuscript to strengthen the description of our experimental controls and provide the requested configuration details.

read point-by-point responses
  1. Referee: The central empirical claim rests on isolating backend effects while holding decoding parameters and logit processing identical across engines. The abstract explicitly lists 'engine-specific defaults in logit processing' as one driver of divergence. Without explicit confirmation that every engine received the exact same numerical configuration (temperature, top-p, repetition penalty, logit bias handling) via a common interface and that no hidden per-engine transformations occurred, the attribution of the 16.6 pp shifts solely to optimizations such as prefix caching or CUDA graphs remains incomplete. The methodology section should provide the precise configuration commands or code used for each backend to verify synchronization.

    Authors: We agree that explicit documentation of the configuration interface is necessary to fully substantiate the isolation of backend effects. In the original experiments, we employed a common Python interface (built on the Hugging Face transformers generation config where possible, with engine-specific adapters) to enforce identical values: temperature=0.0, top_p=1.0, top_k=0, repetition_penalty=1.0, and no logit bias. Engine-specific logit processing defaults were explicitly disabled or overridden where the API permitted (e.g., via do_sample=False and explicit logit processor lists). However, certain engines apply internal transformations (such as implicit normalization or custom softmax implementations) that cannot be fully disabled through the public API. To address the referee's concern, we have added a new subsection (Section 4.2) containing the exact configuration code snippets and command-line flags used for vLLM, SGLang, llama.cpp, and the other engines. This revision clarifies which parameters were synchronized and which residual differences arise from unavoidable engine internals, thereby reinforcing the attribution of the observed shifts to the listed optimizations while acknowledging the role of logit-processing defaults. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurements with no derivations or self-referential fits

full rationale

The paper conducts a survey of 200 inference engines and a controlled empirical comparison of five backends on benchmark scores while holding model weights, decoding parameters, and hardware fixed. No equations, fitted parameters, or mathematical derivations appear in the provided text or abstract. The central claim rests on observed output differences and disagreement rates rather than any chain that reduces a prediction to its own inputs by construction. Self-citations are not invoked as load-bearing uniqueness theorems or ansatzes. The study is self-contained as a measurement exercise.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new free parameters, invented entities, or non-standard axioms; it relies on standard empirical methods for comparing existing inference systems under controlled conditions.

axioms (1)
  • domain assumption Benchmark scores reflect model performance when all factors except the inference backend are held constant.
    The study design assumes differences in results are due to backend variations after controlling for model weights, decoding parameters, and hardware.

pith-pipeline@v0.9.0 · 5813 in / 1273 out tokens · 46765 ms · 2026-05-21T08:08:56.424168+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 6 internal anchors

  1. [1]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the Symposium on Operating Systems Principles, 2023

  2. [2]

    Gonzalez, Clark Barrett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: efficient execution of structured language model programs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  3. [3]

    llama.cpp.https://github.com/ggml-org/llama.cpp, 2023

    ggml org. llama.cpp.https://github.com/ggml-org/llama.cpp, 2023

  4. [4]

    Transformers: State-of-the-art natural language processing

    Thomas Wolf et al. Transformers: State-of-the-art natural language processing. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics (ACL), 2020

  5. [5]

    Benchmarking prompt sensitivity in large language models

    Amirhossein Razavi, Mina Soltangheis, Negar Arabzadeh, Sara Salamat, Morteza Zihayat, and Ebrahim Bagheri. Benchmarking prompt sensitivity in large language models. InAdvances in Information Retrieval: European Conference on Information Retrieval (ECIR), 2025

  6. [6]

    give me BF16 or give me death

    Eldar Kurtic, Alexandre Noll Marques, Shubhra Pandit, Mark Kurtz, and Dan Alistarh. “give me BF16 or give me death”? accuracy-performance trade-offs in LLM quantization. InAssociation for Computational Linguistics (ACL), 2025

  7. [7]

    A thorough examination of decoding methods in the era of LLMs

    Chufan Shi, Haoran Yang, Deng Cai, Zhisong Zhang, Yifan Wang, Yujiu Yang, and Wai Lam. A thorough examination of decoding methods in the era of LLMs. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

  8. [8]

    Understanding and mitigating numerical sources of nondeterminism in LLM inference

    Jiayi Yuan, Hao Li, Xinheng Ding, Wenya Xie, Yu-Jhe Li, Wentian Zhao, Kun Wan, Jing Shi, Xia Hu, and Zirui Liu. Understanding and mitigating numerical sources of nondeterminism in LLM inference. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  9. [9]

    Lessons from the Trenches on Reproducible Evaluation of Language Models

    Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sid Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Mimansa Jaiswal, Wilson Y . Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick, Jason Phang, A...

  10. [10]

    Chasing shadows: Pitfalls in llm security research

    Jonathan Evertz, Niklas Risse, Nicolai Neuer, Andreas Müller, Philipp Normann, Gaetano Sapia, Srishti Gupta, David Pape, Soumya Shaw, Devansh Srivastav, Christian Wressnegger, Erwin Quiring, Thorsten Eisenhofer, Daniel Arp, and Lea Schönherr. Chasing shadows: Pitfalls in llm security research. InSymposium on Network and Distributed System Security (NDSS), 2026

  11. [11]

    Llm- inference-bench: Inference benchmarking of large language models on ai accelerators

    Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus, Aditya Tanikanti, Ken Raffenetti, Valerie Taylor, Murali Emani, and Venkatram Vishwanath. Llm- inference-bench: Inference benchmarking of large language models on ai accelerators. In Workshops of the International Conference for High Performance Computing, Networking, Storage an...

  12. [12]

    Large language model inference acceleration: A comprehen- sive hardware perspective.arXiv preprint arXiv:2410.04466,

    Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, et al. Large language model inference acceleration: A comprehensive hardware perspective.arXiv preprint arXiv:2410.04466, 2024

  13. [13]

    A survey on inference engines for large language models: Perspectives on optimization and efficiency.arXiv preprint arXiv:2505.01658, 2025

    Sihyeong Park, Sungryeol Jeon, Chaelyn Lee, Seokhun Jeon, Byung-Soo Kim, and Jemin Lee. A survey on inference engines for large language models: Perspectives on optimization and efficiency.arXiv preprint arXiv:2505.01658, 2025

  14. [14]

    Towards efficient generative large language model serving: A survey from algorithms to systems.ACM Comput

    Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, and Zhihao Jia. Towards efficient generative large language model serving: A survey from algorithms to systems.ACM Comput. Surv., 2025

  15. [15]

    Hardware acceleration for neural networks: A comprehensive survey.arXiv preprint arXiv:2512.23914, 2026

    Bin Xu, Ayan Banerjee, and Sandeep Gupta. Hardware acceleration for neural networks: A comprehensive survey.arXiv preprint arXiv:2512.23914, 2026

  16. [16]

    Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements

    Yifei Wang, Tianlin Li, Xiaohan Zhang, Xiaoyu Zhang, Wei Ma, Mingfei Cheng, and Li Pan. Hidden reliability risks in large language models: Systematic identification of precision-induced output disagreements.arXiv preprint arXiv:2604.19790, 2026

  17. [17]

    mlf-core: a framework for deterministic machine learning.Bioinformatics, 2023

    Lukas Heumos, Philipp Ehmele, Luis Kuhn Cuellar, Kevin Menden, Edmund Miller, Steffen Lemke, Gisela Gabernet, and Sven Nahnsen. mlf-core: a framework for deterministic machine learning.Bioinformatics, 2023

  18. [18]

    Robert E Blackwell, Jon Barry, and Anthony G. Cohn. Towards reproducible llm evaluation: Quantifying uncertainty in llm benchmark scores.arXiv preprint arXiv:2410.03492, 2024

  19. [19]

    Lmdeploy: A toolkit for compressing, deploying, and serving llm

    LMDeploy Contributors. Lmdeploy: A toolkit for compressing, deploying, and serving llm. https://github.com/InternLM/lmdeploy, 2023

  20. [20]

    LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind

    Li Zhang, Youhe Jiang, Guoliang He, Xin Chen, Han Lv, Qian Yao, Fangcheng Fu, and Kai Chen. Efficient mixed-precision large language model inference with turbomind.arXiv preprint arXiv:2508.15601, 2025

  21. [21]

    ollama.https://ollama.com/, 2023

    Ollama. ollama.https://ollama.com/, 2023

  22. [22]

    The Llama 3 Herd of Models

    Aaron Grattafiori et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  23. [23]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  24. [24]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

    Daya Guo et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature 645, 2025

  25. [25]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mo Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  26. [26]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InConference on Language Modeling, 2024

  27. [27]

    Simpleqa verified: A reliable factuality benchmark to measure parametric knowledge.arXiv preprint arXiv:2509.07968, 2025

    Lukas Haas, Gal Yona, Giovanni D’Antonio, Sasha Goldshtein, and Dipanjan Das. Simpleqa verified: A reliable factuality benchmark to measure parametric knowledge.arXiv preprint arXiv:2509.07968, 2025. 11

  28. [28]

    Livecodebench: Holistic and contamination free evaluation of large language models for code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025

  29. [29]

    Defeating nondeterminism in llm inference, 2025

    Horace He. Defeating nondeterminism in llm inference, 2025. URL https:// thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/

  30. [30]

    Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. InNeurIPS Workshop Datasets and Benchmarks Track, 2024

  31. [31]

    Qwen3-235b-a22b-instruct-2507-awq, 2025

    AIDXteam. Qwen3-235b-a22b-instruct-2507-awq, 2025. URL https://huggingface.co/ AIDXteam/Qwen3-235B-A22B-Instruct-2507-AWQ

  32. [32]

    Ideal Zone

    OpenAI. Gpt-4o-mini. https://developers.openai.com/api/docs/models/ gpt-4o-mini, 2024. 12 A Backend Versions Table 2 details the specific versions of the inference backends and reference libraries utilized through- out all controlled experiments in Section 5. We enforced these fixed versions across all evaluation runs to ensure that any observed numerical...

  33. [33]

    Systematic Engine Defaults:These are correctable, engine-specific configurations applied prior to generation. As shown in Table 7, hidden prompt mutations (such as forceful BOS token injection) and hidden default repetition penalties fundamentally alter the prompt structure and token distributions. Correcting these defaults yields massive performance reco...

  34. [34]

    Features essential for high-throughput serving, such as Prefix Caching, CUDA Graphs, and custom kernels for greedy decoding, alter floating-point accumulation

    Optimization-Induced Numerical Variance:Even after aligning all generation parameters and prompt templates, subtle numerical drift persists due to the underlying mathematical execution. Features essential for high-throughput serving, such as Prefix Caching, CUDA Graphs, and custom kernels for greedy decoding, alter floating-point accumulation. While these...

  35. [35]

    We scanned the extracted raw text (usingpymupdf Python library) of all PDFs for specific terms related to open-weight models and local execution (see Section E.2)

    Keyword Pre-Filtering:We first applied a heuristic pre-filter, since running an LLM judge over the full corpus was computationally expensive. We scanned the extracted raw text (usingpymupdf Python library) of all PDFs for specific terms related to open-weight models and local execution (see Section E.2). Only papers containing at least one of these keywor...

  36. [36]

    large language model

    Structured Output:To allow for automated parsing of the LLM’s decisions, the judge was strictly prompted to return responses in a valid JSON format. This allowed our evaluation scripts to programmatically route papers through the subsequentCode ExtractionandEngine Extraction stages based on the boolean flags generated during theRelevance Filteringstage. E...

  37. [37]

    * **Diffusion/Generative image models are excluded** * *Excluded Models*: LLaVA, Qwen-VL, GPT-4V, Phi-Vision, CLIP, MiniCPM-V, BakLLaVA, Yi-VL

    **Multimodal Inputs (Vision/Audio)**: * **The Paper uses Images, Video, or Audio as input.** * **VLMs are EXCLUDED**, even if they use a Llama/Qwen backbone. * **Diffusion/Generative image models are excluded** * *Excluded Models*: LLaVA, Qwen-VL, GPT-4V, Phi-Vision, CLIP, MiniCPM-V, BakLLaVA, Yi-VL. * *Reasoning*: The inference stack for VLMs involves vi...

  38. [38]

    * **Embeddings Only**: Papers that only use the model to generate vector embeddings (hidden states) for retrieval/search, without decoding text

    **Non-Generative Architectures**: * **Topic Models / Clustering**: Papers focusing on extracting topics (LDA, BERTopic, Autoencoders) without autoregressive generation. * **Embeddings Only**: Papers that only use the model to generate vector embeddings (hidden states) for retrieval/search, without decoding text. * **Encoder-Only / Autoencoders**: BERT, Ro...

  39. [39]

    * *Exclusion List*: GPT-3.5, GPT-4, GPT-4o, o1, GPT-5, OpenAI, Claude (Sonnet/Opus/Haiku), Gemini (Pro/Ultra), PaLM, Grok (proprietary versions) etc

    **Purely Proprietary/Black-Box**: The paper ONLY uses closed-source models without comparing them to local models. * *Exclusion List*: GPT-3.5, GPT-4, GPT-4o, o1, GPT-5, OpenAI, Claude (Sonnet/Opus/Haiku), Gemini (Pro/Ultra), PaLM, Grok (proprietary versions) etc. 22 * *Exception*: If the paper compares GPT-4 vs. Llama 2, it is RELEVANT

  40. [40]

    We analyze the *EMTeC corpus* (Smith et al.), which contains text generated by Llama-2

    **Secondary Analysis of Pre-Generated Data (PASSIVE USAGE)**: * **CRITICAL EXCLUSION**: If the authors use an *existing dataset* (e.g., a corpus, a benchmark, or human-eval data) where the text was generated by LLMs in a *previous study*, this paper is **IRRELEVANT**. * *Example of Exclusion*: "We analyze the *EMTeC corpus* (Smith et al.), which contains ...

  41. [41]

    * The mechanism must be next-token prediction (Transformer Decoder)

    **Task = Autoregressive Text Generation**: * The model must receive **Text** as input and generate **Text/Code** (or logits for text tokens) as output. * The mechanism must be next-token prediction (Transformer Decoder)

  42. [42]

    **Model = Open-Weights / Local**: * The authors must utilize models where weights are publicly available or can be hosted locally. * *Examples*: Llama (1, 2, 3), Mistral, Mixtral, Qwen (Text-only), DeepSeek (Text-only), Gemma, Phi, Yi, Falcon, OPT, Dolphin, Kimi, Vicuna, Alpaca, Pythia, BLOOM, OLMo, Solar, StarCoder

  43. [43]

    LLM-as-a-judge

    **Action = Running Inference**: * The authors must **actively execute** the model themselves during the course of the study. * This includes: * Running the model to generate *new* responses. * Running the model to calculate perplexity/logits on a dataset. * Running the model to benchmark speed/latency. * *Note*: Papers that Fine-Tune (SFT/RLHF/GRPO etc.) ...

  44. [44]

    Lla ma-2

    **Robustness to Artifacts**: The input text is extracted from PDFs and may contain OCR errors, headers/footers, broken lines, or merged words (e.g., "Lla ma-2", "Hugging Face", "Q wen"). Look past these structural issues to understand the semantic content

  45. [45]

    Llama-4" or

    **Model Family Inheritance**: Use the model’s name to infer its nature. - If a model is unknown to you (e.g., "Llama-4" or "Mistral-Next") but shares a name with a known open-source family (Llama, Mistral, Qwen, etc.), **assume it is open-source**. - Conversely, if it shares a name with a proprietary family (e.g., "GPT-5", "Claude-Next"), assume it is excluded

  46. [46]

    - Many authors fail to report their backend

    **Inference Engine Agnosticism**: - **Do not look for specific engine names** (like vLLM, llama.cpp, SGLang) to determine relevance. - Many authors fail to report their backend. If the paper *uses* a relevant model (e.g., Llama 2) for inference, it is **RELEVANT**, regardless of whether they mention the software stack used to run it

  47. [47]

    MiniCPM" or

    **Non-Exclusive Examples**: The inclusion/exclusion model lists provided above are **representative samples**, not exhaustive lists. If a paper uses a model not listed (e.g., "MiniCPM" or "XVerse"), use your judgment: if it is an open-weights generative transformer, include it

  48. [48]

    weights released,

    **Knowledge Cutoff & New Models**: You may encounter models released after your training data cutoff. **Do not hallucinate**. Instead, look for context clues in the text to classify them. - *Clues for Relevance*: "weights released," "available on GitHub/HuggingFace," "reproduced locally," "7B parameters." - *Clues for Exclusion*: "proprietary model," "ima...

  49. [49]

    We utilize the model proposed by Touvron et al. [15]

    **Indirect Citations (Reference Lookup)**: If the authors refer to a model only by citation (e.g., "We utilize the model proposed by Touvron et al. [15]" or "the model from [1]"), you **MUST** look at the References/Bibliography section at the end of the text to identify the model. If citation [15] is the "Llama 2" paper, then the paper is RELEVANT

  50. [50]

    relevant

    **Burden of Proof (Uncertainty = Reject)**: You must find **positive evidence** of the criteria above. If the text is too vague, lacks sufficient information, or you are unsure, mark it as **"relevant": false**

  51. [51]

    Created By

    **Dataset Origin vs. Experimentation (The "Created By" Check)**: - Pay close attention to grammar. If the text says: *"We use Data X (Author, Year), which was created using Model Y"*, the paper is **NOT RELEVANT** (unless they *also* run Model Y separately). - If the text says: *"We used Model Y to create Data X"*, the paper is **RELEVANT**. --- ### INPUT...

  52. [52]

    **Self-Hosted Libraries**: Software running on the user’s hardware (e.g., ‘vLLM‘, ‘llama.cpp‘, ‘SGLang‘, ‘HuggingFace Transformers‘, ‘TGI‘, ‘LMDeploy‘, ‘TensorRT-LLM‘)

  53. [53]

    **Managed Inference Platforms**: APIs serving open-weight models (e.g., ‘Together AI‘, ‘Fireworks AI‘, ‘RunPod Serverless‘)

  54. [54]

    We generated responses using **vLLM**

    **Aggregators**: Routers that sit in front of providers (e.g., ‘OpenRouter‘, ‘LiteLLM‘). ### 3. KNOWN ENGINE LIST (Reference Only) Use this list to help identify potential candidates, but **do not limit yourself to it**. Context matters more than the list. <known_engines> {known_engines_list} </known_engines> ### 4. CRITICAL LOGIC: ACTIVE EXECUTION vs. PA...

  55. [55]

    HuggingFace

    **Do Not Over-Normalize**: Many libraries have similar names. Do not merge them unless they are aliases. * *Example:* If the text says ‘FastTransformer‘, do NOT map it to ‘transformers‘. Report ‘FastTransformer‘. * *Rule:* Only map generic terms like "HuggingFace", "HF", or "AutoModel" to ‘transformers‘. If a specific, distinct library name is used (even ...

  56. [56]

    PyTorch" /

    **Unknown/New Libraries**: The authors may use a library not in your known list or one released after your knowledge cutoff. * *Rule:* If the text explicitly states a software tool was used for inference/execution, **extract it**, even if you have never heard of it. Trust the text. ### 6. ROBUSTNESS & NORMALIZATION * **OCR Artifacts**: Fix broken text. ‘v...

  57. [57]

    **Abstract**: specifically the very last sentence

  58. [58]

    Contributions

    **Introduction**: specifically in the "Contributions" list or the final paragraph

  59. [59]

    See footnote 1

    **Footnotes**: Look for text like "See footnote 1" or "[1]" near the mention of code

  60. [60]

    Implementation Details

    **Methodology header**: Sometimes listed as "Implementation Details"

  61. [61]

    Reproducibility

    **Conclusion**: A section named "Reproducibility" or "Data Availability". 24

  62. [62]

    Source Code [25]

    **References/Bibliography**: Rarely, authors cite their own code as a bibliography entry (e.g., "Source Code [25]"). ### CRITICAL DECISION LOGIC **1. Verification of Ownership (The "Author" Check)** You must distinguish between **Own Work** and **Prior Work**. * **RELEVANT (True)**: "We release our code at...", "The official implementation is available at...