arxiv: 2510.18245 · v3 · pith:MNKAKQ75new · submitted 2025-10-21 · 💻 cs.LG · cs.AI

Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs

Song Bian , Tao Yu , Shivaram Venkataraman , Youngsuk Park This is my paper

Pith reviewed 2026-05-18 05:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords scaling lawslarge language modelsmodel architectureinference efficiencygrouped-query attentionChinchilla scaling lawLLM optimizationtraining budget

0 comments

The pith

A conditional scaling law that adds architectural details to Chinchilla predicts LLM designs with higher accuracy and faster inference than standard baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that key architectural choices in large language models, including hidden size, the balance of parameters between MLP and attention layers, and the use of grouped-query attention, can be folded into scaling laws to guide the creation of models that are accurate yet cheap to run at inference time. A reader would care because inference now dominates the cost of deploying these models, so any reliable way to pick better designs without extra training compute or data offers immediate practical value. The authors define a conditional scaling law that augments the Chinchilla framework with these architectural terms, pair it with a search procedure, and validate both on more than two hundred models ranging from 80 million to 3 billion parameters trained on 8 billion to 100 billion tokens. The fitted law is shown to forecast strong architectures, and the resulting models exceed open-source baselines in both accuracy and speed under matched training budgets.

Core claim

The central claim is that a conditional scaling law formed by augmenting the Chinchilla framework with information on hidden size, the MLP-to-attention ratio, and grouped-query attention can reliably predict architectural choices that are simultaneously accurate and inference-efficient. This is shown by training more than 200 models across 80M to 3B parameters and 8B to 100B tokens, fitting the law, and using it to select designs that, under the same training budget, reach up to 2.1 percent higher accuracy and 42 percent greater inference throughput than LLaMA-3.2.

What carries the argument

The conditional scaling law that augments the Chinchilla scaling law with terms for hidden size, MLP-to-attention ratio, and grouped-query attention to jointly model loss and inference latency.

If this is right

Architectural choices such as hidden size and GQA can be selected systematically using the fitted conditional scaling law rather than exhaustive trial.
Models produced this way deliver up to 2.1 percent higher accuracy than LLaMA-3.2 when trained on the same budget.
The same models also achieve up to 42 percent higher inference throughput than the baseline.
The conditional scaling law provides accurate forecasts of both loss and latency across the studied range of model sizes and data volumes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the low-dimensional augmentation continues to work, the same approach could guide architecture search for models larger than the 3B scale examined here.
Treating inference cost as an explicit term inside scaling laws may encourage future scaling studies to optimize for deployment cost from the outset.
Analogous conditional laws could be derived for other architectural families or training objectives to extend the efficiency gains.

Load-bearing premise

The effects of hidden size, MLP-to-attention ratio, and GQA on both loss and inference latency can be captured by a low-dimensional additive or multiplicative augmentation to the Chinchilla scaling law without large unmodeled interactions or regime shifts outside the 80M-3B parameter range studied.

What would settle it

Training a new model with the architecture selected by the conditional scaling law and finding that it fails to exceed LLaMA-3.2 in accuracy or inference throughput under the same training budget would falsify the central claim.

Figures

Figures reproduced from arXiv: 2510.18245 by Shivaram Venkataraman, Song Bian, Tao Yu, Youngsuk Park.

**Figure 1.** Figure 1: Although larger models generally achieve lower inference throughput than smaller ones, Qwen2.5-1.5B outperforms Qwen3-0.6B. Despite having the same number of layers, Qwen2.5-1.5B benefits from a higher hidden size, GQA, and mlp-to-attention ratio. In this work, we fix the number of layers and study the effect of other architectural factors, including GQA, hidden size, and the mlp-to-attention ratio. This… view at source ↗

**Figure 2.** Figure 2: Inference throughput vs (left) hidden size d = dmodel and (right) mlp-to-attention ratio r = rmlp/attn on the 8B model. Under a fixed parameter budget Nnon-embed, larger hidden sizes and higher mlp-to-attention ratios improve inference throughput for varying batch sizes. Chinchilla addresses the following question to determine optimal allocation: arg min N,D L(N, D) s.t. FLOPs(N, D) = C (2) where C denotes… view at source ↗

**Figure 3.** Figure 3: Loss vs. hidden size: (Left) 80M model variants; (Center) 145M model variants; (Right) 297M model variants. Across model sizes, the relationship between training loss and dmodel/ √ N exhibits a consistent U-shaped curve when architectural factors such as GQA and the MLP-to-attention ratio are held fixed. The legend denotes the MLP-to-attention ratio r = rmlp/attn for each model. 0 1 2 3 4 5 6 7 mlp-to-att… view at source ↗

**Figure 4.** Figure 4: Loss vs. MLP-to-attention ratio: (Left) 80M model variants; (Center) 145M model variants; (Right) 297M model variants. Across model sizes, the relationship between training loss and rmlp/attn exhibits a consistent U-shaped curve when architectural factors such as GQA and hidden size are held fixed. The legend denotes the hidden size d = dmodel for each model. separately for x = rmlp/attn and dmodel/ √ Nnon… view at source ↗

**Figure 5.** Figure 5: Predictive performances of the fitted conditional scaling law on: (left) Task 1: Fit on 80M, evaluate on 145M; (center) Task 2: Fit on 80, 145M, evaluate on 297M; (right) Task 3: Fit on 80, 145, 297M, evaluate on 1B. Orange dots denote fitting data points, and purple crosses indicate the test data points. We compare scaling-law predicted loss with actual pretraining loss of architectures and observed a con… view at source ↗

**Figure 6.** Figure 6: Results for 1B and 3B models: (left) Panda-1B closely follows the scaling law predictions for minimizing training loss. (center) Inference throughput comparison between LLaMA-3.2-1B and Surefire-1B, showing that Surefire-1B consistently achieves higher efficiency across batch sizes. (right) Inference throughput comparison between LLaMA-3.2-3B and Surefire-3B, demonstrating that Surefire-3B consistently del… view at source ↗

**Figure 7.** Figure 7: Effect of the Fitting Dataset on Predictive Performance vs (left) Fit on 80, 145, 297M, 1B, evaluate on 3B; (right) Fit on 1B, evaluate on 3B. Orange dots denote fitting data points, and purple crosses indicate the test data points. We compare scaling-law predicted loss with actual pretraining loss of architectures and we observe that fitting the scaling laws with only 1B model data yields lower MSE and hi… view at source ↗

**Figure 8.** Figure 8: Hidden size on Inference Throughput: (left) 1B model variants; (center) 3B model variants; (right) 8B model variants. Across varying batch sizes and model scales, larger hidden sizes yield higher inference throughput under a fixed parameter budget. The legend indicates the hidden size of the models, where d = dmodel. 2 4 2 5 2 6 2 7 Batch Size 0 1000 2000 3000 4000 5000 6000 Throughput (tokens/s) r=0.45 r=… view at source ↗

**Figure 9.** Figure 9: MLP-to-Attention ratio on Inference Throughput: (left) 1B model variants; (center) 3B model variants; (right) 8B model variants. Across varying batch sizes and model scales, a larger MLP-to-Attention ratio increases inference throughput under a fixed parameter budget. The legend indicates the MLP-to-Attention ratio of the models, where r = rmlp/attn. 2 4 2 5 2 6 2 7 Batch Size 0 2000 4000 6000 8000 Through… view at source ↗

**Figure 10.** Figure 10 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Hidden size on Inference Throughput (Qwen3): (left) Qwen3-0.6B model variants; (center) Qwen3-1.7B model variants; (right) Qwen3-4B model variants. Across varying batch sizes and model scales, larger hidden sizes yield higher inference throughput under a fixed parameter budget. The legend indicates the hidden size of the models, where d = dmodel. All evaluations are performed using the vLLM framework Kwon… view at source ↗

**Figure 12.** Figure 12: MLP-to-Attention ratio on Inference Throughput (Qwen3): (left) Qwen3-0.6B model variants; (center) Qwen3-1.7B model variants; (right) Qwen3-4B model variants. Across varying batch sizes and model scales, a larger MLP-to-Attention ratio increases inference throughput under a fixed parameter budget. The legend indicates the MLP-to-Attention ratio of the models, where r = rmlp/attn. All evaluations are perfo… view at source ↗

**Figure 13.** Figure 13 [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Loss vs. GQA: (left) 80M model variants; (center) 145M model variants; (right) 297M model variants. Across different model sizes, the relationship between training loss and GQA varies substantially when hidden size and the mlp-to-attention ratio are fixed. The legend denotes the hidden size of each trained model. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: (left) and [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: Joint and non-separable calibrations: (left) use multiplicative calibrations; (right) use joint and non-separable calibrations. We observe that joint and non-separable calibrations yield higher MSE and lower Spearman scores than multiplicative calibrations, indicating inferior performance. Dots denote the data points used for fitting, while crosses indicate the test data points. 24 [PITH_FULL_IMAGE:figu… view at source ↗

**Figure 17.** Figure 17: Active-Experts-to-Attn on Inference Throughput: (left) 3B-A1.1B model variants; (center) 5.3B-A1.7B model variants; (right) 8.3B-A1.5B model variants. We study the effect of the Active-Experts-to-Attention ratio on inference throughput by fixing the total number of active parameters, setting GQA to 4, and using a batch size of 2048 to reduce MoE inference variance in this figure. All evaluations are perfo… view at source ↗

read the original abstract

Scaling the number of parameters and the size of training data has proven to be an effective strategy for improving large language model (LLM) performance. Yet, as these models grow increasingly powerful and widely deployed, the cost of inference has become a pressing concern. Despite its importance, the trade-off between model accuracy and inference efficiency remains underexplored. In this work, we examine how key architectural factors, hidden size, the allocation of parameters between MLP and attention (mlp-to-attention ratio), and grouped-query attention (GQA), influence both inference cost and accuracy. We introduce a conditional scaling law that augments the Chinchilla framework with architectural information, along with a search framework for identifying architectures that are simultaneously inference-efficient and accurate. To validate our approach, we train more than 200 models spanning 80M to 3B parameters and 8B to 100B training tokens, and fit the proposed conditional scaling law. Our results show that the conditional scaling law reliably predicts optimal architectural choices and that the resulting models outperform existing open-source baselines. Under the same training budget, optimized architectures achieve up to 2.1% higher accuracy and 42% greater inference throughput compared to LLaMA-3.2.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They fit a conditional scaling law to hidden size, MLP ratio and GQA after training 200+ models and report inference gains over LLaMA-3.2, but the fit and search stay inside the same data.

read the letter

The main point is that this paper augments the Chinchilla law with architectural variables and uses the resulting surface to pick models that look better on both accuracy and inference speed. They ran more than 200 models from 80M to 3B parameters and 8B to 100B tokens, varying hidden size, MLP-to-attention ratio, and GQA, then searched the fitted law for good trade-offs. The headline numbers are up to 2.1% higher accuracy and 42% higher throughput than LLaMA-3.2 at the same training budget.

Referee Report

3 major / 2 minor

Summary. The paper proposes a conditional scaling law that augments the Chinchilla framework with architectural factors (hidden size, MLP-to-attention ratio, and GQA) to jointly model cross-entropy loss and inference latency. A search framework is introduced to identify architectures that optimize accuracy under inference constraints. The approach is validated by training over 200 models spanning 80M–3B parameters and 8B–100B tokens; the authors claim that architectures selected via the fitted law outperform LLaMA-3.2 by up to 2.1% accuracy and 42% inference throughput under equivalent training budgets.

Significance. If the conditional law generalizes and the reported gains hold under independent validation, the work supplies a practical, data-driven method for co-optimizing model architecture and inference efficiency within the Chinchilla scaling regime. The scale of the experimental campaign—more than 200 models across a useful parameter and token range—provides concrete empirical grounding that strengthens the contribution relative to purely theoretical or small-scale studies.

major comments (3)

[Experiments] Experiments section: the conditional scaling law is fitted directly to the full set of >200 experimental runs, after which the same fitted surface is used to select the “optimal” architectures whose performance is then reported. This circularity means the 2.1% accuracy and 42% throughput claims are not supported by held-out validation or independent test runs; a cross-validation split or separate confirmation experiments would be required to substantiate the predictive reliability.
[Conditional Scaling Law] Section describing the conditional scaling law: the augmentation to the Chinchilla law is presented without an explicit functional form (additive, multiplicative, or with interaction terms) or any analysis of whether optimal MLP-to-attention ratio or GQA group size varies with scale. If unmodeled higher-order interactions or regime shifts exist outside the 80M–3B / 8B–100B token window, the search framework will systematically select architectures whose predicted gains do not materialize, directly undermining the central claim.
[Inference Evaluation] Inference evaluation subsection: no description is given of the hardware, batch size, sequence length, or measurement protocol used to obtain the latency/throughput numbers that underpin the 42% improvement claim. Without these details it is impossible to assess whether the reported throughput advantage is reproducible or sensitive to implementation choices.

minor comments (2)

[Abstract] Abstract and results: the fit of the conditional law is reported without error bars, R² values, or residual analysis, making it difficult to judge how well the low-dimensional augmentation actually captures the observed data.
[Results] Results tables: additional baselines beyond LLaMA-3.2 (e.g., recent efficient variants such as Mistral or Phi-3 derivatives) would help contextualize whether the gains are specific to the proposed search or more generally available.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each major comment below and indicate the revisions we will make to improve the paper.

read point-by-point responses

Referee: [Experiments] Experiments section: the conditional scaling law is fitted directly to the full set of >200 experimental runs, after which the same fitted surface is used to select the “optimal” architectures whose performance is then reported. This circularity means the 2.1% accuracy and 42% throughput claims are not supported by held-out validation or independent test runs; a cross-validation split or separate confirmation experiments would be required to substantiate the predictive reliability.

Authors: We appreciate this important point about potential circularity in our evaluation. While the scaling law was fitted on the full set of experiments to maximize data utilization for the fit, we recognize that this does not provide an independent test of the law's predictive power for architecture selection. To address this, we will include a cross-validation analysis in the revised manuscript, where the law is fitted on a random subset of 80% of the models and used to predict performance on the held-out 20%. Additionally, we will report results from training a small number of confirmation models selected by the law but not included in the original fitting process. revision: yes
Referee: [Conditional Scaling Law] Section describing the conditional scaling law: the augmentation to the Chinchilla law is presented without an explicit functional form (additive, multiplicative, or with interaction terms) or any analysis of whether optimal MLP-to-attention ratio or GQA group size varies with scale. If unmodeled higher-order interactions or regime shifts exist outside the 80M–3B / 8B–100B token window, the search framework will systematically select architectures whose predicted gains do not materialize, directly undermining the central claim.

Authors: We agree that the functional form should be made explicit. The conditional scaling law augments the Chinchilla loss with multiplicative factors for each architectural parameter: L(N, D, h, r, g) = E + A/N^α + B/D^β * f(h, r, g), where f incorporates the hidden size h, MLP-to-attention ratio r, and GQA group size g. We will add the full equation and a subsection analyzing how the optimal r and g change across different scales within our experimental range. Regarding extrapolation beyond the studied regime, we will add a discussion of the limitations and suggest that the law is intended for the 80M-3B parameter range. revision: yes
Referee: [Inference Evaluation] Inference evaluation subsection: no description is given of the hardware, batch size, sequence length, or measurement protocol used to obtain the latency/throughput numbers that underpin the 42% improvement claim. Without these details it is impossible to assess whether the reported throughput advantage is reproducible or sensitive to implementation choices.

Authors: We apologize for the omission of these critical details. In the revised manuscript, we will add a dedicated paragraph in the Inference Evaluation subsection specifying the hardware (NVIDIA H100 GPUs), batch size (1 for latency, 32 for throughput), sequence length (2048 tokens), and the measurement protocol (using PyTorch with CUDA events for timing, averaged over 100 runs after warmup). This will allow readers to reproduce and evaluate the sensitivity of the 42% throughput improvement. revision: yes

Circularity Check

1 steps flagged

Conditional scaling law 'predictions' of optimal architectures reduce to optimization over the fitted surface from the same 200-model runs

specific steps

fitted input called prediction [Abstract]
"we introduce a conditional scaling law that augments the Chinchilla framework with architectural information, along with a search framework for identifying architectures that are simultaneously inference-efficient and accurate. To validate our approach, we train more than 200 models spanning 80M to 3B parameters and 8B to 100B training tokens, and fit the proposed conditional scaling law. Our results show that the conditional scaling law reliably predicts optimal architectural choices and that the resulting models outperform existing open-source baselines."

The law is fitted to the exact experimental runs whose architectural variants are later declared 'optimal' by the search framework. Selecting the architecture that minimizes the fitted conditional scaling law is tautological; the 'prediction' of optimality is the direct output of the fit rather than an independent forecast or derivation.

full rationale

The paper fits its conditional scaling law directly to loss and latency measurements from the >200 models spanning 80M-3B parameters. It then uses this fitted surface, via a search framework, to identify 'optimal' architectural choices (hidden size, MLP-to-attention ratio, GQA). The claim that the law 'reliably predicts optimal architectural choices' therefore reduces to selecting the argmin of the fitted function rather than an out-of-sample derivation or held-out validation. The subsequent training of those selected architectures and reported gains (2.1% accuracy, 42% throughput) inherit this dependence; no independent first-principles derivation or external benchmark is shown to break the loop.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a modest number of architectural scalars can be inserted into the Chinchilla functional form and that the resulting surface will generalize to new model sizes and token budgets; the free parameters are the fitted coefficients of that augmented law.

free parameters (1)

coefficients of the conditional scaling law
The parameters that multiply or add the architectural variables (hidden size, mlp-to-attention ratio, GQA) to the base Chinchilla loss equation are fitted to the 200+ training runs.

axioms (1)

domain assumption Architectural factors can be treated as continuous variables whose effects on loss and latency are smooth and low-order within the studied regime.
Invoked when the authors augment the Chinchilla framework rather than treating architecture as a discrete search space.

pith-pipeline@v0.9.0 · 5756 in / 1520 out tokens · 65717 ms · 2026-05-18T05:25:53.853034+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a conditional scaling law that augments the Chinchilla framework with architectural information... L(d/√N , r|N, D) = (a0 + a1 log(d/√N) + a2 √N/d) · (b0 + b1 log r + b2/r) · Lopt
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

U-shaped curves L(d/√N | r, N, D) ... fit the function c0 + c1 log x + c2/x separately for x = rmlp/attn and dmodel/√Nnon-embed

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 33 internal anchors

[1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, S ´ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 techni- cal report.arXiv preprint arXiv:2412.08905,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Parameters vs flops: Scaling laws for optimal sparsity for mixture-of-experts language models.arXiv preprint arXiv:2501.12370,

Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin Mohamed Elnouby Ali, Josh Susskind, and Vimal Thilak. Parameters vs flops: Scaling laws for optimal sparsity for mixture-of-experts language models.arXiv preprint arXiv:2501.12370,

work page arXiv
[3]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr´on, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head check- points.arXiv preprint arXiv:2305.13245,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Program Synthesis with Large Language Models

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Scaling inference-efficient language mod- els.arXiv preprint arXiv:2501.18107,

Song Bian, Minghao Yan, and Shivaram Venkataraman. Scaling inference-efficient language mod- els.arXiv preprint arXiv:2501.18107,

work page arXiv
[7]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R ´e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901
[9]

Exploring diffusion transformer designs via grafting.arXiv preprint arXiv:2506.05340,

Keshigeyan Chandrasegaran, Michael Poli, Daniel Y Fu, Dongjun Kim, Lea M Hadzic, Manling Li, Agrim Gupta, Stefano Massaroli, Azalia Mirhoseini, Juan Carlos Niebles, et al. Exploring diffusion transformer designs via grafting.arXiv preprint arXiv:2506.05340,

work page arXiv
[10]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Scaling law for quantization-aware training.arXiv preprint arXiv:2505.14302,

11 Preprint Mengzhao Chen, Chaoyi Zhang, Jing Liu, Yutao Zeng, Zeyue Xue, Zhiheng Liu, Yunshui Li, Jin Ma, Jie Huang, Xun Zhou, et al. Scaling law for quantization-aware training.arXiv preprint arXiv:2505.14302,

work page arXiv
[12]

Reducing the carbon impact of generative ai inference (today and in 2035)

Andrew A Chien, Liuzixuan Lin, Hai Nguyen, Varsha Rao, Tristan Sharma, and Rajini Wijayawar- dana. Reducing the carbon impact of generative ai inference (today and in 2035). InProceedings of the 2nd workshop on sustainable computer systems, pp. 1–7,

work page 2035
[13]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Language models scale reliably with over-training and on downstream tasks.arXiv preprint arXiv:2403.08540,

Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Worts- man, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, et al. Language models scale reliably with over-training and on downstream tasks.arXiv preprint arXiv:2403.08540,

work page arXiv
[18]

Truthfulqa: Measuring how models mimic human falsehoods,

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Fos- ter, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muen- nighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The lan...

work page arXiv
[19]

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small llms can master math reasoning with self-evolved deep thinking.arXiv preprint arXiv:2501.04519,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Measuring Mathematical Problem Solving With the MATH Dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Train- ing compute-optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.arXiv preprint arXiv:2407.02490,

work page arXiv
[25]

Scaling Laws for Neural Language Models

12 Preprint Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[26]

Scaling laws for fine-grained mixture of experts.arXiv preprint arXiv:2402.07871,

Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pi ´oro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Kr ´ol, Tomasz Odrzyg ´o´zd´z, Piotr Sankowski, et al. Scaling laws for fine-grained mixture of experts.arXiv preprint arXiv:2402.07871,

work page arXiv
[27]

Scaling laws for precision

Tanishq Kumar, Zachary Ankner, Benjamin F Spector, Blake Bordelon, Niklas Muennighoff, Man- sheej Paul, Cengiz Pehlevan, Christopher R´e, and Aditi Raghunathan. Scaling laws for precision. arXiv preprint arXiv:2411.04330,

work page arXiv
[28]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024a. Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical re...

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering.arXiv preprint arXiv:1809.02789,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

The LAMBADA dataset: Word prediction requiring a broad discourse context

Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern ´andez. The lambada dataset: Word prediction requiring a broad discourse context.arXiv preprint arXiv:1606.06031,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

The impact of depth on compositional generalization in transformer language models.arXiv preprint arXiv:2310.19956,

Jackson Petty, Sjoerd van Steenkiste, Ishita Dasgupta, Fei Sha, Dan Garrette, and Tal Linzen. The impact of depth on compositional generalization in transformer language models.arXiv preprint arXiv:2310.19956,

work page arXiv
[32]

Mutual reason- ing makes smaller llms stronger problem-solvers.arXiv preprint arXiv:2408.06195,

Zhenting Qi, Mingyuan Ma, Jiahang Xu, Li Lyna Zhang, Fan Yang, and Mao Yang. Mutual reason- ing makes smaller llms stronger problem-solvers.arXiv preprint arXiv:2408.06195,

work page arXiv
[33]

Observational scaling laws and the predictability of language model performance.arXiv preprint arXiv:2405.10938,

Yangjun Ruan, Chris J Maddison, and Tatsunori Hashimoto. Observational scaling laws and the predictability of language model performance.arXiv preprint arXiv:2405.10938,

work page arXiv
[34]

Beyond chinchilla-optimal: Accounting for inference in language model scaling laws.arXiv preprint arXiv:2401.00448,

Nikhil Sardana, Jacob Portes, Sasha Doubov, and Jonathan Frankle. Beyond chinchilla-optimal: Accounting for inference in language model scaling laws.arXiv preprint arXiv:2401.00448,

work page arXiv
[35]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv
[36]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model par- allelism.arXiv preprint arXiv:1909.08053,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[37]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Dolma: An open corpus of three trillion tokens for language model pretraining research.arXiv preprint arXiv:2402.00159,

Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, et al. Dolma: An open corpus of three trillion tokens for language model pretraining research.arXiv preprint arXiv:2402.00159,

work page arXiv
[39]

Scale efficiently: Insights from pre-training and fine-tuning transformers.arXiv preprint arXiv:2109.10686,

Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, and Donald Metzler. Scale efficiently: Insights from pre-training and fine-tuning transformers.arXiv preprint arXiv:2109.10686,

work page arXiv
[40]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi `ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024a. Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bh...

work page internal anchor Pith review Pith/arXiv arXiv
[41]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[42]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Alex Wang. Glue: A multi-task benchmark and analysis platform for natural language understand- ing.arXiv preprint arXiv:1804.07461,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yo- gatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682,

work page internal anchor Pith review Pith/arXiv arXiv
[44]

Crowdsourcing Multiple Choice Science Questions

Johannes Welbl, Nelson F Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209,

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Efficient Streaming Language Models with Attention Sinks

14 Preprint Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453,

work page internal anchor Pith review Pith/arXiv arXiv
[46]

Duoattention: Efficient long-context llm inference with retrieval and streaming heads.arXiv preprint arXiv:2410.10819,

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context llm inference with retrieval and streaming heads.arXiv preprint arXiv:2410.10819,

work page arXiv
[47]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[48]

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, et al. Flashinfer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2501.01005,

work page internal anchor Pith review Pith/arXiv arXiv
[49]

Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention

Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, YX Wei, Lean Wang, Zhiping Xiao, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention.arXiv preprint arXiv:2502.11089,

work page internal anchor Pith review Pith/arXiv arXiv
[50]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a ma- chine really finish your sentence?arXiv preprint arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[51]

TinyLlama: An Open-Source Small Language Model

Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model.arXiv preprint arXiv:2401.02385,

work page internal anchor Pith review Pith/arXiv arXiv
[52]

It was not used to generate research ideas

15 Preprint A LLM USAGE We used an LLM to improve the writing by correcting grammar in our draft. It was not used to generate research ideas. B OPEN-WEIGHTEDMODELARCHITECTURES Table 3 presents an overview of the open-weight model architectures utilized in this paper. Table 3:Open-Weighted Model Architectures:We list the architectural configurations of all...

work page 2048
[53]

Across varying batch sizes and model scales, larger hidden sizes yield higher inference throughput under a fixed parameter budget

24 25 26 27 Batch Size 0 2000 4000 6000 8000Throughput (tokens/s) d=1024 d=2048 d=4096 24 25 26 27 Batch Size 0 1000 2000 3000 4000Throughput (tokens/s) d=1536 d=3072 d=6144 24 25 26 27 Batch Size 0 500 1000 1500 2000 2500 3000Throughput (tokens/s) d=2048 d=4096 d=8192 Figure 8:Hidden size on Inference Throughput:(left) 1B model variants; (center) 3B mode...

work page 2000
[54]

Moreover, while multiplicative and additive calibrations differ in formulation, their MSE and Spearman values remain nearly identical

We observe that outlier data points harm the scaling law fit. Moreover, while multiplicative and additive calibrations differ in formulation, their MSE and Spearman values remain nearly identical. Dots denote the data points used for fitting, while crosses indicate the test data points. 2.6 2.8 3.0 3.2 3.4 3.6 Actual Loss 2.6 2.8 3.0 3.2 3.4 3.6Predicted ...

work page 2020
[55]

Table 6:Detailed Results on Downstream Tasks for 1B Models:In this table, we show detailed results of 1B models over 9 downstream tasks. Downstream Tasks LLaMA-3.2-1B Panda-1B Surefire-1B Arc-Easy 58.8 60.9 59.7 Arc-Challenge 29.8 28.9 30.2 LAMBADA 52.8 55.1 52.0 HellaSwag 56.9 58.4 56.6 OpenBookQA 32.0 33.2 32.0 PIQA 73.6 75.2 73.0 SciQ 84.8 87.2 84.9 Wi...

work page 2048