The Silent Hyperparameter: Quantifying the Impact of Inference Backends on LLM Reproducibility
Pith reviewed 2026-05-21 08:08 UTC · model grok-4.3
The pith
The choice of inference backend can shift LLM benchmark scores by up to 16.6 percentage points even with fixed model weights and hardware.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Holding model weights, decoding parameters, and hardware constant, the choice of inference backend alone can shift benchmark scores by up to 16.6 percentage points and induce high rates of output disagreement. The divergence arises from system-level optimizations such as prefix caching, CUDA graphs, custom kernels, and engine-specific defaults in logit processing.
What carries the argument
Inference backend, the software layer that executes a trained model at inference time through optimizations including prefix caching and custom CUDA kernels.
If this is right
- Benchmark comparisons between models become unreliable if the papers or evaluations used different inference engines.
- Small reported improvements in scores may disappear or reverse when the same model is evaluated on another backend.
- Reproducibility of published LLM results requires explicit documentation of the full inference stack.
- Standardized reporting of inference engines would make cross-paper benchmark claims more interpretable.
Where Pith is reading between the lines
- Benchmark protocols could designate a reference backend to reduce hidden variance across studies.
- The same backend sensitivity may affect non-benchmark uses such as production serving or fine-tuning loops.
- Extending measurements to additional engines or closed-source models would test how general the effect is.
Load-bearing premise
The observed benchmark differences and output disagreements are caused solely by the inference backends once model weights, decoding parameters, and hardware are held constant.
What would settle it
A controlled run of the same models and prompts on two backends with all caching and graph optimizations disabled, identical floating-point precision enforced, and logit processing matched exactly, showing whether score gaps and disagreements vanish.
Figures
read the original abstract
Progress in LLMs is increasingly measured through standardized benchmarks, where state-of-the-art improvements are often separated by fractions of a percentage point. At the same time, the computational cost of evaluating modern LLMs has driven widespread adoption of specialized inference backends, software systems that execute trained models efficiently at inference time. While critical for scalability, system-level optimizations, such as custom CUDA kernels and reduced-precision arithmetic, can alter token probabilities and introduce non-determinism, possibly cascading into divergent generation. In this work, we first survey the inference landscape, identifying 200 distinct engines, and analyze 35,000 ML publications, finding that the specific inference stack is rarely reported despite this widespread diversity. We then present a systematic empirical study of how inference backends affect LLM benchmark results. Holding model weights, decoding parameters, and hardware constant, we evaluate five widely used inference engines, including vLLM, SGLang, and llama$.$cpp, across multiple open-weight models and established benchmarks. We show that the choice of backend alone can shift benchmark scores by up to 16.6 percentage points and induce high rates of output disagreement. By isolating backend optimizations and tracing the execution pipeline, we find this divergence is driven by system-level optimizations like prefix caching and CUDA graphs, custom kernels, and engine-specific defaults in logit processing. Our findings identify the inference backend as a previously unreported but consequential hyperparameter in the evaluation of LLM and advocate standardized reporting of inference stacks to improve the reproducibility and interpretability of benchmark comparisons.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript surveys the inference engine landscape, identifying 200 distinct engines and finding that inference stack details are rarely reported in an analysis of 35,000 ML publications. It then presents a controlled empirical study comparing five widely used engines (including vLLM, SGLang, and llama.cpp) across open-weight models and benchmarks. Holding model weights, decoding parameters, and hardware fixed, the study reports that backend choice alone shifts benchmark scores by up to 16.6 percentage points and produces high rates of output disagreement. The divergence is traced to system-level optimizations such as prefix caching, CUDA graphs, custom kernels, and engine-specific defaults in logit processing. The authors conclude that the inference backend is a consequential unreported hyperparameter and advocate standardized reporting to improve reproducibility.
Significance. If the controlled comparisons hold, the work identifies a previously overlooked source of non-reproducibility in LLM evaluations. The combination of a broad survey of publication practices with targeted, multi-model empirical measurements provides concrete evidence that small implementation differences at the inference layer can produce benchmark shifts larger than many claimed state-of-the-art gains. This has direct implications for how the community designs, reports, and interprets standardized evaluations.
major comments (1)
- The central empirical claim rests on isolating backend effects while holding decoding parameters and logit processing identical across engines. The abstract explicitly lists 'engine-specific defaults in logit processing' as one driver of divergence. Without explicit confirmation that every engine received the exact same numerical configuration (temperature, top-p, repetition penalty, logit bias handling) via a common interface and that no hidden per-engine transformations occurred, the attribution of the 16.6 pp shifts solely to optimizations such as prefix caching or CUDA graphs remains incomplete. The methodology section should provide the precise configuration commands or code used for each backend to verify synchronization.
minor comments (2)
- Abstract: 'llama$.$cpp' is a typesetting artifact and should read 'llama.cpp'.
- Abstract, final sentence: 'LLM and' should be 'LLMs' for grammatical consistency.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights an important aspect of methodological transparency. We have revised the manuscript to strengthen the description of our experimental controls and provide the requested configuration details.
read point-by-point responses
-
Referee: The central empirical claim rests on isolating backend effects while holding decoding parameters and logit processing identical across engines. The abstract explicitly lists 'engine-specific defaults in logit processing' as one driver of divergence. Without explicit confirmation that every engine received the exact same numerical configuration (temperature, top-p, repetition penalty, logit bias handling) via a common interface and that no hidden per-engine transformations occurred, the attribution of the 16.6 pp shifts solely to optimizations such as prefix caching or CUDA graphs remains incomplete. The methodology section should provide the precise configuration commands or code used for each backend to verify synchronization.
Authors: We agree that explicit documentation of the configuration interface is necessary to fully substantiate the isolation of backend effects. In the original experiments, we employed a common Python interface (built on the Hugging Face transformers generation config where possible, with engine-specific adapters) to enforce identical values: temperature=0.0, top_p=1.0, top_k=0, repetition_penalty=1.0, and no logit bias. Engine-specific logit processing defaults were explicitly disabled or overridden where the API permitted (e.g., via do_sample=False and explicit logit processor lists). However, certain engines apply internal transformations (such as implicit normalization or custom softmax implementations) that cannot be fully disabled through the public API. To address the referee's concern, we have added a new subsection (Section 4.2) containing the exact configuration code snippets and command-line flags used for vLLM, SGLang, llama.cpp, and the other engines. This revision clarifies which parameters were synchronized and which residual differences arise from unavoidable engine internals, thereby reinforcing the attribution of the observed shifts to the listed optimizations while acknowledging the role of logit-processing defaults. revision: yes
Circularity Check
No circularity: direct empirical measurements with no derivations or self-referential fits
full rationale
The paper conducts a survey of 200 inference engines and a controlled empirical comparison of five backends on benchmark scores while holding model weights, decoding parameters, and hardware fixed. No equations, fitted parameters, or mathematical derivations appear in the provided text or abstract. The central claim rests on observed output differences and disagreement rates rather than any chain that reduces a prediction to its own inputs by construction. Self-citations are not invoked as load-bearing uniqueness theorems or ansatzes. The study is self-contained as a measurement exercise.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Benchmark scores reflect model performance when all factors except the inference backend are held constant.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Holding model weights, decoding parameters, and hardware constant, the choice of backend alone can shift benchmark scores by up to 16.6 percentage points
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the Symposium on Operating Systems Principles, 2023
work page 2023
-
[2]
Gonzalez, Clark Barrett, and Ying Sheng
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: efficient execution of structured language model programs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
work page 2024
-
[3]
llama.cpp.https://github.com/ggml-org/llama.cpp, 2023
ggml org. llama.cpp.https://github.com/ggml-org/llama.cpp, 2023
work page 2023
-
[4]
Transformers: State-of-the-art natural language processing
Thomas Wolf et al. Transformers: State-of-the-art natural language processing. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics (ACL), 2020
work page 2020
-
[5]
Benchmarking prompt sensitivity in large language models
Amirhossein Razavi, Mina Soltangheis, Negar Arabzadeh, Sara Salamat, Morteza Zihayat, and Ebrahim Bagheri. Benchmarking prompt sensitivity in large language models. InAdvances in Information Retrieval: European Conference on Information Retrieval (ECIR), 2025
work page 2025
-
[6]
Eldar Kurtic, Alexandre Noll Marques, Shubhra Pandit, Mark Kurtz, and Dan Alistarh. “give me BF16 or give me death”? accuracy-performance trade-offs in LLM quantization. InAssociation for Computational Linguistics (ACL), 2025
work page 2025
-
[7]
A thorough examination of decoding methods in the era of LLMs
Chufan Shi, Haoran Yang, Deng Cai, Zhisong Zhang, Yifan Wang, Yujiu Yang, and Wai Lam. A thorough examination of decoding methods in the era of LLMs. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
work page 2024
-
[8]
Understanding and mitigating numerical sources of nondeterminism in LLM inference
Jiayi Yuan, Hao Li, Xinheng Ding, Wenya Xie, Yu-Jhe Li, Wentian Zhao, Kun Wan, Jing Shi, Xia Hu, and Zirui Liu. Understanding and mitigating numerical sources of nondeterminism in LLM inference. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[9]
Lessons from the Trenches on Reproducible Evaluation of Language Models
Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sid Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Benjamin Fattori, Jessica Zosa Forde, Charles Foster, Mimansa Jaiswal, Wilson Y . Lee, Haonan Li, Charles Lovering, Niklas Muennighoff, Ellie Pavlick, Jason Phang, A...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Chasing shadows: Pitfalls in llm security research
Jonathan Evertz, Niklas Risse, Nicolai Neuer, Andreas Müller, Philipp Normann, Gaetano Sapia, Srishti Gupta, David Pape, Soumya Shaw, Devansh Srivastav, Christian Wressnegger, Erwin Quiring, Thorsten Eisenhofer, Daniel Arp, and Lea Schönherr. Chasing shadows: Pitfalls in llm security research. InSymposium on Network and Distributed System Security (NDSS), 2026
work page 2026
-
[11]
Llm- inference-bench: Inference benchmarking of large language models on ai accelerators
Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus, Aditya Tanikanti, Ken Raffenetti, Valerie Taylor, Murali Emani, and Venkatram Vishwanath. Llm- inference-bench: Inference benchmarking of large language models on ai accelerators. In Workshops of the International Conference for High Performance Computing, Networking, Storage an...
work page 2024
-
[12]
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, et al. Large language model inference acceleration: A comprehensive hardware perspective.arXiv preprint arXiv:2410.04466, 2024
-
[13]
Sihyeong Park, Sungryeol Jeon, Chaelyn Lee, Seokhun Jeon, Byung-Soo Kim, and Jemin Lee. A survey on inference engines for large language models: Perspectives on optimization and efficiency.arXiv preprint arXiv:2505.01658, 2025
-
[14]
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi Jin, Tianqi Chen, and Zhihao Jia. Towards efficient generative large language model serving: A survey from algorithms to systems.ACM Comput. Surv., 2025
work page 2025
-
[15]
Bin Xu, Ayan Banerjee, and Sandeep Gupta. Hardware acceleration for neural networks: A comprehensive survey.arXiv preprint arXiv:2512.23914, 2026
-
[16]
Yifei Wang, Tianlin Li, Xiaohan Zhang, Xiaoyu Zhang, Wei Ma, Mingfei Cheng, and Li Pan. Hidden reliability risks in large language models: Systematic identification of precision-induced output disagreements.arXiv preprint arXiv:2604.19790, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
mlf-core: a framework for deterministic machine learning.Bioinformatics, 2023
Lukas Heumos, Philipp Ehmele, Luis Kuhn Cuellar, Kevin Menden, Edmund Miller, Steffen Lemke, Gisela Gabernet, and Sven Nahnsen. mlf-core: a framework for deterministic machine learning.Bioinformatics, 2023
work page 2023
- [18]
-
[19]
Lmdeploy: A toolkit for compressing, deploying, and serving llm
LMDeploy Contributors. Lmdeploy: A toolkit for compressing, deploying, and serving llm. https://github.com/InternLM/lmdeploy, 2023
work page 2023
-
[20]
LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind
Li Zhang, Youhe Jiang, Guoliang He, Xin Chen, Han Lv, Qian Yao, Fangcheng Fu, and Kai Chen. Efficient mixed-precision large language model inference with turbomind.arXiv preprint arXiv:2508.15601, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [21]
-
[22]
Aaron Grattafiori et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Deepseek-r1 incentivizes reasoning in llms through reinforcement learning
Daya Guo et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature 645, 2025
work page 2025
-
[25]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mo Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[26]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google-proof q&a benchmark. InConference on Language Modeling, 2024
work page 2024
-
[27]
Lukas Haas, Gal Yona, Giovanni D’Antonio, Sasha Goldshtein, and Dipanjan Das. Simpleqa verified: A reliable factuality benchmark to measure parametric knowledge.arXiv preprint arXiv:2509.07968, 2025. 11
-
[28]
Livecodebench: Holistic and contamination free evaluation of large language models for code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[29]
Defeating nondeterminism in llm inference, 2025
Horace He. Defeating nondeterminism in llm inference, 2025. URL https:// thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
work page 2025
-
[30]
Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong
Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. InNeurIPS Workshop Datasets and Benchmarks Track, 2024
work page 2024
-
[31]
Qwen3-235b-a22b-instruct-2507-awq, 2025
AIDXteam. Qwen3-235b-a22b-instruct-2507-awq, 2025. URL https://huggingface.co/ AIDXteam/Qwen3-235B-A22B-Instruct-2507-AWQ
work page 2025
-
[32]
OpenAI. Gpt-4o-mini. https://developers.openai.com/api/docs/models/ gpt-4o-mini, 2024. 12 A Backend Versions Table 2 details the specific versions of the inference backends and reference libraries utilized through- out all controlled experiments in Section 5. We enforced these fixed versions across all evaluation runs to ensure that any observed numerical...
work page 2024
-
[33]
Systematic Engine Defaults:These are correctable, engine-specific configurations applied prior to generation. As shown in Table 7, hidden prompt mutations (such as forceful BOS token injection) and hidden default repetition penalties fundamentally alter the prompt structure and token distributions. Correcting these defaults yields massive performance reco...
-
[34]
Optimization-Induced Numerical Variance:Even after aligning all generation parameters and prompt templates, subtle numerical drift persists due to the underlying mathematical execution. Features essential for high-throughput serving, such as Prefix Caching, CUDA Graphs, and custom kernels for greedy decoding, alter floating-point accumulation. While these...
-
[35]
Keyword Pre-Filtering:We first applied a heuristic pre-filter, since running an LLM judge over the full corpus was computationally expensive. We scanned the extracted raw text (usingpymupdf Python library) of all PDFs for specific terms related to open-weight models and local execution (see Section E.2). Only papers containing at least one of these keywor...
-
[36]
Structured Output:To allow for automated parsing of the LLM’s decisions, the judge was strictly prompted to return responses in a valid JSON format. This allowed our evaluation scripts to programmatically route papers through the subsequentCode ExtractionandEngine Extraction stages based on the boolean flags generated during theRelevance Filteringstage. E...
-
[37]
**Multimodal Inputs (Vision/Audio)**: * **The Paper uses Images, Video, or Audio as input.** * **VLMs are EXCLUDED**, even if they use a Llama/Qwen backbone. * **Diffusion/Generative image models are excluded** * *Excluded Models*: LLaVA, Qwen-VL, GPT-4V, Phi-Vision, CLIP, MiniCPM-V, BakLLaVA, Yi-VL. * *Reasoning*: The inference stack for VLMs involves vi...
-
[38]
**Non-Generative Architectures**: * **Topic Models / Clustering**: Papers focusing on extracting topics (LDA, BERTopic, Autoencoders) without autoregressive generation. * **Embeddings Only**: Papers that only use the model to generate vector embeddings (hidden states) for retrieval/search, without decoding text. * **Encoder-Only / Autoencoders**: BERT, Ro...
-
[39]
**Purely Proprietary/Black-Box**: The paper ONLY uses closed-source models without comparing them to local models. * *Exclusion List*: GPT-3.5, GPT-4, GPT-4o, o1, GPT-5, OpenAI, Claude (Sonnet/Opus/Haiku), Gemini (Pro/Ultra), PaLM, Grok (proprietary versions) etc. 22 * *Exception*: If the paper compares GPT-4 vs. Llama 2, it is RELEVANT
-
[40]
We analyze the *EMTeC corpus* (Smith et al.), which contains text generated by Llama-2
**Secondary Analysis of Pre-Generated Data (PASSIVE USAGE)**: * **CRITICAL EXCLUSION**: If the authors use an *existing dataset* (e.g., a corpus, a benchmark, or human-eval data) where the text was generated by LLMs in a *previous study*, this paper is **IRRELEVANT**. * *Example of Exclusion*: "We analyze the *EMTeC corpus* (Smith et al.), which contains ...
-
[41]
* The mechanism must be next-token prediction (Transformer Decoder)
**Task = Autoregressive Text Generation**: * The model must receive **Text** as input and generate **Text/Code** (or logits for text tokens) as output. * The mechanism must be next-token prediction (Transformer Decoder)
-
[42]
**Model = Open-Weights / Local**: * The authors must utilize models where weights are publicly available or can be hosted locally. * *Examples*: Llama (1, 2, 3), Mistral, Mixtral, Qwen (Text-only), DeepSeek (Text-only), Gemma, Phi, Yi, Falcon, OPT, Dolphin, Kimi, Vicuna, Alpaca, Pythia, BLOOM, OLMo, Solar, StarCoder
-
[43]
**Action = Running Inference**: * The authors must **actively execute** the model themselves during the course of the study. * This includes: * Running the model to generate *new* responses. * Running the model to calculate perplexity/logits on a dataset. * Running the model to benchmark speed/latency. * *Note*: Papers that Fine-Tune (SFT/RLHF/GRPO etc.) ...
- [44]
-
[45]
**Model Family Inheritance**: Use the model’s name to infer its nature. - If a model is unknown to you (e.g., "Llama-4" or "Mistral-Next") but shares a name with a known open-source family (Llama, Mistral, Qwen, etc.), **assume it is open-source**. - Conversely, if it shares a name with a proprietary family (e.g., "GPT-5", "Claude-Next"), assume it is excluded
-
[46]
- Many authors fail to report their backend
**Inference Engine Agnosticism**: - **Do not look for specific engine names** (like vLLM, llama.cpp, SGLang) to determine relevance. - Many authors fail to report their backend. If the paper *uses* a relevant model (e.g., Llama 2) for inference, it is **RELEVANT**, regardless of whether they mention the software stack used to run it
-
[47]
**Non-Exclusive Examples**: The inclusion/exclusion model lists provided above are **representative samples**, not exhaustive lists. If a paper uses a model not listed (e.g., "MiniCPM" or "XVerse"), use your judgment: if it is an open-weights generative transformer, include it
-
[48]
**Knowledge Cutoff & New Models**: You may encounter models released after your training data cutoff. **Do not hallucinate**. Instead, look for context clues in the text to classify them. - *Clues for Relevance*: "weights released," "available on GitHub/HuggingFace," "reproduced locally," "7B parameters." - *Clues for Exclusion*: "proprietary model," "ima...
-
[49]
We utilize the model proposed by Touvron et al. [15]
**Indirect Citations (Reference Lookup)**: If the authors refer to a model only by citation (e.g., "We utilize the model proposed by Touvron et al. [15]" or "the model from [1]"), you **MUST** look at the References/Bibliography section at the end of the text to identify the model. If citation [15] is the "Llama 2" paper, then the paper is RELEVANT
- [50]
-
[51]
**Dataset Origin vs. Experimentation (The "Created By" Check)**: - Pay close attention to grammar. If the text says: *"We use Data X (Author, Year), which was created using Model Y"*, the paper is **NOT RELEVANT** (unless they *also* run Model Y separately). - If the text says: *"We used Model Y to create Data X"*, the paper is **RELEVANT**. --- ### INPUT...
-
[52]
**Self-Hosted Libraries**: Software running on the user’s hardware (e.g., ‘vLLM‘, ‘llama.cpp‘, ‘SGLang‘, ‘HuggingFace Transformers‘, ‘TGI‘, ‘LMDeploy‘, ‘TensorRT-LLM‘)
-
[53]
**Managed Inference Platforms**: APIs serving open-weight models (e.g., ‘Together AI‘, ‘Fireworks AI‘, ‘RunPod Serverless‘)
-
[54]
We generated responses using **vLLM**
**Aggregators**: Routers that sit in front of providers (e.g., ‘OpenRouter‘, ‘LiteLLM‘). ### 3. KNOWN ENGINE LIST (Reference Only) Use this list to help identify potential candidates, but **do not limit yourself to it**. Context matters more than the list. <known_engines> {known_engines_list} </known_engines> ### 4. CRITICAL LOGIC: ACTIVE EXECUTION vs. PA...
-
[55]
**Do Not Over-Normalize**: Many libraries have similar names. Do not merge them unless they are aliases. * *Example:* If the text says ‘FastTransformer‘, do NOT map it to ‘transformers‘. Report ‘FastTransformer‘. * *Rule:* Only map generic terms like "HuggingFace", "HF", or "AutoModel" to ‘transformers‘. If a specific, distinct library name is used (even ...
-
[56]
**Unknown/New Libraries**: The authors may use a library not in your known list or one released after your knowledge cutoff. * *Rule:* If the text explicitly states a software tool was used for inference/execution, **extract it**, even if you have never heard of it. Trust the text. ### 6. ROBUSTNESS & NORMALIZATION * **OCR Artifacts**: Fix broken text. ‘v...
-
[57]
**Abstract**: specifically the very last sentence
-
[58]
**Introduction**: specifically in the "Contributions" list or the final paragraph
-
[59]
**Footnotes**: Look for text like "See footnote 1" or "[1]" near the mention of code
-
[60]
**Methodology header**: Sometimes listed as "Implementation Details"
-
[61]
**Conclusion**: A section named "Reproducibility" or "Data Availability". 24
-
[62]
**References/Bibliography**: Rarely, authors cite their own code as a bibliography entry (e.g., "Source Code [25]"). ### CRITICAL DECISION LOGIC **1. Verification of Ownership (The "Author" Check)** You must distinguish between **Own Work** and **Prior Work**. * **RELEVANT (True)**: "We release our code at...", "The official implementation is available at...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.