Unified Deployment-Aware Evaluation of Open Reasoning Language Models

Ge Wang; Md Motaleb Hossen Manik

arxiv: 2604.07035 · v2 · pith:OV6P2JTVnew · submitted 2026-04-08 · 💻 cs.CL

Unified Deployment-Aware Evaluation of Open Reasoning Language Models

Md Motaleb Hossen Manik , Ge Wang This is my paper

Pith reviewed 2026-05-21 09:38 UTC · model grok-4.3

classification 💻 cs.CL

keywords open reasoning language modelsmodel evaluationdeployment-aware evaluationmulti-objective optimizationPareto operating pointsprompting strategiesbenchmark comparisonweighted aggregate performance

0 comments

The pith

Open reasoning language model evaluation should be framed as a deployment-aware multi-objective operating-point problem rather than a single-score leaderboard exercise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs all seven models through the same 238-example subset on four benchmarks using zero-shot, chain-of-thought, and few-shot chain-of-thought prompting for a total of 84 conditions. It records accuracy with Wilson intervals, latency, peak VRAM, weighted aggregate scores, Pareto-efficient points, prompt sensitivity, and compatibility diagnostics. Top results show Gemma configurations leading on weighted score while smaller variants offer close performance at lower cost. Rankings shift with prompting method, and an oracle task-aware selector improves the weighted score further. These patterns lead to the claim that practical selection requires weighing multiple operating points instead of relying on a single accuracy ranking.

Core claim

By enforcing identical test conditions across every model, dataset, and prompting strategy, the evaluation reveals that leading configurations sit close enough that deployment tradeoffs in latency and memory become decisive. Prompting changes relative model order rather than lifting all models equally. Benchmark complementarity creates headroom for routing, and some apparent failures trace to robustness or interface problems under the shared pipeline. The work therefore treats evaluation as the identification of practical multi-objective operating points.

What carries the argument

The shared evaluation pipeline run uniformly on the fixed 238-example subset for every model-dataset-strategy triple, which produces directly comparable accuracy, latency, memory, weighted scores, and Pareto operating points.

If this is right

Prompting strategy alters model rankings rather than producing uniform shifts across all models.
Benchmark-specific strengths allow an oracle task-aware selector to reach a higher weighted score than any single configuration.
Some low scores, such as on GSM8K, reflect interface-adherence or robustness issues under the shared pipeline rather than core capability limits.
The closeness of leading configurations means deployment decisions must balance accuracy against latency and memory usage.
Reporting multiple metrics and Pareto points supplies more actionable information than accuracy-centered leaderboards alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adding energy or monetary cost per query as an extra axis would let the same method identify operating points for specific hardware budgets.
A production router could use the prompt-sensitivity and compatibility diagnostics to pick the best model-prompt pair for each incoming query.
Repeating the experiment with the complete benchmark sets instead of the fixed subset would test whether the current operating points are stable under broader sampling.
Publishing compatibility diagnostics alongside accuracy could reduce cases where interface mismatches are misread as capability shortfalls.

Load-bearing premise

The fixed 238-example subset and chosen benchmarks, when processed through one shared pipeline, produce representative comparisons that generalize to practical deployment without large selection or interface bias.

What would settle it

Re-running the full design on a different or larger example set that reverses the top weighted scores or changes which configurations sit on the Pareto front would show the reported operating points do not generalize.

Figures

Figures reproduced from arXiv: 2604.07035 by Ge Wang, Md Motaleb Hossen Manik.

**Figure 2.** Figure 2: It is notable because it identifies a single configuration family as the most reliable overall [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 2.** Figure 2: Weighted accuracy versus efficiency tradeoffs across model configurations under the three prompting [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: Dataset-specific performance across models and prompting strategies. These panels also make [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

read the original abstract

Open reasoning language models are often compared under mixed sample sizes, partially standardized prompts, and accuracy-centered summaries, which makes practical model selection difficult to interpret. We present a unified evaluation of seven open reasoning language model configurations across four benchmarks: ARC-Challenge, GSM8K, MATH levels 1 to 3, and TruthfulQA MC1. We test zero-shot, chain-of-thought (CoT), and few-shot CoT prompting on the same 238-example subset for every model--dataset--strategy condition, yielding a complete 7 x 4 x 3 design with 84 conditions and 19,992 evaluated examples. Beyond accuracy, we report Wilson confidence intervals, latency, peak video random access memory (VRAM), weighted aggregate performance, Pareto-efficient operating points, prompt-sensitivity metrics, and compatibility diagnostics. Gemma-4-26B-A4B with zero-shot prompting achieves the highest weighted score at 0.794. Gemma-4-E4B remains close to the top across prompting settings while using substantially lower latency and memory, making it a strong practical operating point. Bootstrap and paired-permutation analyses show that the leading configurations are close enough that deployment tradeoffs remain important. We also find that prompting strategy changes model rankings rather than shifting all models uniformly. Benchmark-specific complementarity creates routing headroom, with an oracle task-aware selector reaching a weighted score of 0.825. Compatibility diagnostics show that some apparent failures, especially Phi-4-Reasoning on GSM8K, reflect robustness and interface-adherence problems under the shared evaluation pipeline. These results support a central claim: open-model evaluation should be framed as a deployment-aware, multi-objective operating-point problem rather than as a single-score leaderboard exercise.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's controlled setup across models and prompts shows real ranking shifts and deployment tradeoffs worth considering, but the unvalidated 238-example subset undercuts how much we can trust the generalizations.

read the letter

The main thing to know is that this work runs seven open models through four benchmarks with three prompting styles on one fixed 238-example slice, then layers in latency, VRAM, and Pareto points to argue that single-score leaderboards miss practical choices. The design is consistent enough to make the non-uniform ranking changes and the oracle gain from 0.794 to 0.825 look like actual signals rather than noise from mismatched test sets.

Referee Report

2 major / 1 minor

Summary. The manuscript evaluates seven open reasoning language models across four benchmarks (ARC-Challenge, GSM8K, MATH levels 1-3, TruthfulQA MC1) on a shared 238-example subset under zero-shot, CoT, and few-shot CoT prompting, forming a complete 7x4x3 design. It reports accuracy with Wilson intervals, latency, VRAM, weighted aggregates, Pareto operating points, prompt-sensitivity, and compatibility diagnostics, finding non-uniform ranking shifts with prompting, an oracle gain from 0.794 to 0.825, and interface issues (e.g., Phi-4 on GSM8K). The central claim is that open-model evaluation should be reframed as a deployment-aware, multi-objective operating-point problem rather than single-score leaderboards.

Significance. If the subset proves representative, the work provides concrete empirical grounding for moving beyond accuracy-centric leaderboards toward practical trade-off analysis in open LLM deployment. The full factorial design, bootstrap/paired-permutation tests, and multi-metric reporting (including efficiency and compatibility) are strengths that could inform more actionable benchmarking practices.

major comments (2)

[Experimental design] Experimental design (description of the 238-example subset): The paper fixes one 238-example slice for all 84 conditions but provides no sampling details (random, stratified by difficulty, etc.) and no correlation checks between subset accuracies and full benchmark test-set results for any model. This directly affects generalizability of the reported ranking shifts, prompt-sensitivity findings, and Pareto points, as the central claim requires these to reflect deployment realities rather than subset artifacts.
[Results and methods] Weighted aggregate performance (results and methods): The highest weighted score of 0.794 (Gemma-4-26B-A4B zero-shot) and the oracle 0.825 are central to the multi-objective argument, yet the exact weighting scheme across benchmarks or metrics is not specified. Without this, the aggregate comparisons and operating-point claims cannot be fully interpreted or reproduced.

minor comments (1)

[Abstract] Abstract: The mention of 'full diagnostic procedures' for compatibility issues could be expanded with one sentence on the shared pipeline to improve immediate clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for reviewing our manuscript and providing these valuable comments. We appreciate the emphasis on experimental rigor and reproducibility. We respond to each major comment in turn and describe the revisions planned for the next version of the paper.

read point-by-point responses

Referee: [Experimental design] Experimental design (description of the 238-example subset): The paper fixes one 238-example slice for all 84 conditions but provides no sampling details (random, stratified by difficulty, etc.) and no correlation checks between subset accuracies and full benchmark test-set results for any model. This directly affects generalizability of the reported ranking shifts, prompt-sensitivity findings, and Pareto points, as the central claim requires these to reflect deployment realities rather than subset artifacts.

Authors: We concur that documenting the subset construction and providing correlation checks would enhance the manuscript's transparency and support the generalizability of our conclusions. In the revised manuscript, we will expand the Experimental Setup section to detail the sampling process for the 238-example subset, specifying that it was randomly sampled with stratification by benchmark and difficulty where possible to ensure balanced representation. Furthermore, we will include an analysis of the correlation between subset accuracies and full benchmark results for each of the seven models. These changes will directly mitigate concerns regarding subset artifacts and strengthen the evidence for our central claim regarding deployment-aware evaluation. revision: yes
Referee: [Results and methods] Weighted aggregate performance (results and methods): The highest weighted score of 0.794 (Gemma-4-26B-A4B zero-shot) and the oracle 0.825 are central to the multi-objective argument, yet the exact weighting scheme across benchmarks or metrics is not specified. Without this, the aggregate comparisons and operating-point claims cannot be fully interpreted or reproduced.

Authors: We apologize for the lack of specificity regarding the weighted aggregate. The weighting scheme assigns equal weight to each of the four benchmarks, computing the aggregate as the arithmetic mean of the per-benchmark accuracies. We will revise the Methods and Results sections to explicitly describe this scheme, including the formula and normalization steps if any. This will enable full interpretation and reproduction of the 0.794 score and the oracle improvement to 0.825 without altering the underlying results. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or self-referential reductions

full rationale

The paper conducts a direct experimental evaluation across 84 conditions on a shared 238-example subset, reporting observed accuracies, Wilson intervals, latency, VRAM, weighted aggregates, Pareto points, prompt-sensitivity, and compatibility diagnostics from the runs themselves. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described results; claims such as prompting-induced ranking changes and oracle complementarity follow immediately from the measured values rather than reducing to inputs by construction. The central framing of deployment-aware multi-objective evaluation is therefore an interpretation of independent experimental data, not a self-definitional or fitted-input loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking study with no mathematical derivations, fitted model parameters, or new theoretical postulates; it relies on standard statistical reporting and evaluation practices.

pith-pipeline@v0.9.0 · 5845 in / 1149 out tokens · 51135 ms · 2026-05-21T09:38:30.153040+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present a unified evaluation of seven open reasoning language model configurations across four benchmarks... yielding a complete 7 x 4 x 3 design with 84 conditions... Pareto-efficient operating points, prompt-sensitivity metrics...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Gemma-4-E4B with few-shot CoT achieved the best overall result, reaching weighted accuracy 0.675... prompt-conditioned Pareto frontier over accuracy, latency, memory footprint...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Simple Self-Conditioning Adaptation for Masked Diffusion Models
cs.LG 2026-04 unverdicted novelty 6.0

SCMDM adapts trained masked diffusion models to condition denoising steps on their own prior clean predictions, cutting generative perplexity nearly in half on open-web text while improving discretized image, molecule...
From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification
cs.SE 2026-04 unverdicted novelty 6.0

Open-weight LLMs reach 81-91% success generating formally verified Dafny code for complex algorithmic problems when given structural signatures and self-healing verifier feedback.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 2 Pith papers · 10 internal anchors

[1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical 19 report.arXiv preprint arXiv:2412.08905, 2024. URLhttps://arxiv.org/abs/2412.08905

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Phi-4-reasoning Technical Report

Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, et al. Phi-4-reasoning technical report.arXiv preprint arXiv:2504.21318, 2025. URLhttps://arxiv. org/abs/2504.21318

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Lessons from the Trenches on Reproducible Evaluation of Language Models

Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Ab- basi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, et al. Lessons from the trenches on reproducible evaluation of language models.arXiv preprint arXiv:2405.14782, 2024. URLhttps://arxiv.org/abs/2405.14782

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 2025

Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 2025

work page 2025
[5]

Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024. URLhttps: //jmlr.org/papers/v25/23-0870.html

work page 2024
[6]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018. URLhttps://arxiv.org/abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. URLhttps://arxiv.org/ abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022. URLhttps://jmlr.org/papers/v23/21-0998.html

work page 2022
[9]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URLhttps://openreview.net/forum?id=7Bywt2mQsCe

work page 2021
[10]

Rae, and Laurent Sifre

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack W. Rae, and Laurent Sifre...

work page 2022
[11]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. URLhttps://arxiv.org/abs/2001.08361. 20

work page internal anchor Pith review Pith/arXiv arXiv 2001
[12]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D Manning, Christopher Re, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu ...

work page 2023
[13]

TruthfulQA: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022. URLhttps://aclanthology.org/ 2022.acl-long.229/

work page 2022
[14]

Outrageously large neural networks: The sparsely-gated mixture-of- experts layer

Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. InInternational Conference on Learning Representations, 2017. URLhttps: //openreview.net/forum?id=B1ckMDqlg

work page 2017
[16]

URLhttps://arxiv.org/abs/2408.00118

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Effi- cient large language models: A survey.arXiv preprint arXiv:2312.03863,

Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Jiachen Liu, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, et al. Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 2023. URLhttps://arxiv.org/abs/2312.03863

work page arXiv 2023
[18]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022
[19]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

A Survey on Efficient Inference for Large Language Models

Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, et al. A survey on efficient inference for large language models.arXiv preprint arXiv:2404.14294, 2024. URLhttps://arxiv.org/abs/2404.14294

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. ST-MoE: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022. URLhttps://arxiv.org/abs/2202.08906. 21 A Reproducibility Package To support reproducibility, we release the complete evaluation package at https://...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[1] [1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical 19 report.arXiv preprint arXiv:2412.08905, 2024. URLhttps://arxiv.org/abs/2412.08905

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Phi-4-reasoning Technical Report

Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, et al. Phi-4-reasoning technical report.arXiv preprint arXiv:2504.21318, 2025. URLhttps://arxiv. org/abs/2504.21318

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Lessons from the Trenches on Reproducible Evaluation of Language Models

Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Ab- basi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, et al. Lessons from the trenches on reproducible evaluation of language models.arXiv preprint arXiv:2405.14782, 2024. URLhttps://arxiv.org/abs/2405.14782

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 2025

Weilin Cai, Juyong Jiang, Fan Wang, Jing Tang, Sunghun Kim, and Jiayi Huang. A survey on mixture of experts in large language models.IEEE Transactions on Knowledge and Data Engineering, 2025

work page 2025

[5] [5]

Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024. URLhttps: //jmlr.org/papers/v25/23-0870.html

work page 2024

[6] [6]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018. URLhttps://arxiv.org/abs/1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. URLhttps://arxiv.org/ abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23 (120):1–39, 2022. URLhttps://jmlr.org/papers/v23/21-0998.html

work page 2022

[9] [9]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. URLhttps://openreview.net/forum?id=7Bywt2mQsCe

work page 2021

[10] [10]

Rae, and Laurent Sifre

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack W. Rae, and Laurent Sifre...

work page 2022

[11] [11]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. URLhttps://arxiv.org/abs/2001.08361. 20

work page internal anchor Pith review Pith/arXiv arXiv 2001

[12] [12]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D Manning, Christopher Re, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu ...

work page 2023

[13] [13]

TruthfulQA: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214–3252, 2022. URLhttps://aclanthology.org/ 2022.acl-long.229/

work page 2022

[14] [14]

Outrageously large neural networks: The sparsely-gated mixture-of- experts layer

Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of- experts layer. InInternational Conference on Learning Representations, 2017. URLhttps: //openreview.net/forum?id=B1ckMDqlg

work page 2017

[15] [16]

URLhttps://arxiv.org/abs/2408.00118

work page internal anchor Pith review Pith/arXiv arXiv

[16] [17]

Effi- cient large language models: A survey.arXiv preprint arXiv:2312.03863,

Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Jiachen Liu, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, et al. Efficient large language models: A survey.arXiv preprint arXiv:2312.03863, 2023. URLhttps://arxiv.org/abs/2312.03863

work page arXiv 2023

[17] [18]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

work page 2022

[18] [19]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. URLhttps://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [20]

A Survey on Efficient Inference for Large Language Models

Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, et al. A survey on efficient inference for large language models.arXiv preprint arXiv:2404.14294, 2024. URLhttps://arxiv.org/abs/2404.14294

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [21]

ST-MoE: Designing Stable and Transferable Sparse Expert Models

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam Shazeer, and William Fedus. ST-MoE: Designing stable and transferable sparse expert models.arXiv preprint arXiv:2202.08906, 2022. URLhttps://arxiv.org/abs/2202.08906. 21 A Reproducibility Package To support reproducibility, we release the complete evaluation package at https://...

work page internal anchor Pith review Pith/arXiv arXiv 2022