Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs
Pith reviewed 2026-05-18 05:25 UTC · model grok-4.3
The pith
A conditional scaling law that adds architectural details to Chinchilla predicts LLM designs with higher accuracy and faster inference than standard baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a conditional scaling law formed by augmenting the Chinchilla framework with information on hidden size, the MLP-to-attention ratio, and grouped-query attention can reliably predict architectural choices that are simultaneously accurate and inference-efficient. This is shown by training more than 200 models across 80M to 3B parameters and 8B to 100B tokens, fitting the law, and using it to select designs that, under the same training budget, reach up to 2.1 percent higher accuracy and 42 percent greater inference throughput than LLaMA-3.2.
What carries the argument
The conditional scaling law that augments the Chinchilla scaling law with terms for hidden size, MLP-to-attention ratio, and grouped-query attention to jointly model loss and inference latency.
If this is right
- Architectural choices such as hidden size and GQA can be selected systematically using the fitted conditional scaling law rather than exhaustive trial.
- Models produced this way deliver up to 2.1 percent higher accuracy than LLaMA-3.2 when trained on the same budget.
- The same models also achieve up to 42 percent higher inference throughput than the baseline.
- The conditional scaling law provides accurate forecasts of both loss and latency across the studied range of model sizes and data volumes.
Where Pith is reading between the lines
- If the low-dimensional augmentation continues to work, the same approach could guide architecture search for models larger than the 3B scale examined here.
- Treating inference cost as an explicit term inside scaling laws may encourage future scaling studies to optimize for deployment cost from the outset.
- Analogous conditional laws could be derived for other architectural families or training objectives to extend the efficiency gains.
Load-bearing premise
The effects of hidden size, MLP-to-attention ratio, and GQA on both loss and inference latency can be captured by a low-dimensional additive or multiplicative augmentation to the Chinchilla scaling law without large unmodeled interactions or regime shifts outside the 80M-3B parameter range studied.
What would settle it
Training a new model with the architecture selected by the conditional scaling law and finding that it fails to exceed LLaMA-3.2 in accuracy or inference throughput under the same training budget would falsify the central claim.
Figures
read the original abstract
Scaling the number of parameters and the size of training data has proven to be an effective strategy for improving large language model (LLM) performance. Yet, as these models grow increasingly powerful and widely deployed, the cost of inference has become a pressing concern. Despite its importance, the trade-off between model accuracy and inference efficiency remains underexplored. In this work, we examine how key architectural factors, hidden size, the allocation of parameters between MLP and attention (mlp-to-attention ratio), and grouped-query attention (GQA), influence both inference cost and accuracy. We introduce a conditional scaling law that augments the Chinchilla framework with architectural information, along with a search framework for identifying architectures that are simultaneously inference-efficient and accurate. To validate our approach, we train more than 200 models spanning 80M to 3B parameters and 8B to 100B training tokens, and fit the proposed conditional scaling law. Our results show that the conditional scaling law reliably predicts optimal architectural choices and that the resulting models outperform existing open-source baselines. Under the same training budget, optimized architectures achieve up to 2.1% higher accuracy and 42% greater inference throughput compared to LLaMA-3.2.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a conditional scaling law that augments the Chinchilla framework with architectural factors (hidden size, MLP-to-attention ratio, and GQA) to jointly model cross-entropy loss and inference latency. A search framework is introduced to identify architectures that optimize accuracy under inference constraints. The approach is validated by training over 200 models spanning 80M–3B parameters and 8B–100B tokens; the authors claim that architectures selected via the fitted law outperform LLaMA-3.2 by up to 2.1% accuracy and 42% inference throughput under equivalent training budgets.
Significance. If the conditional law generalizes and the reported gains hold under independent validation, the work supplies a practical, data-driven method for co-optimizing model architecture and inference efficiency within the Chinchilla scaling regime. The scale of the experimental campaign—more than 200 models across a useful parameter and token range—provides concrete empirical grounding that strengthens the contribution relative to purely theoretical or small-scale studies.
major comments (3)
- [Experiments] Experiments section: the conditional scaling law is fitted directly to the full set of >200 experimental runs, after which the same fitted surface is used to select the “optimal” architectures whose performance is then reported. This circularity means the 2.1% accuracy and 42% throughput claims are not supported by held-out validation or independent test runs; a cross-validation split or separate confirmation experiments would be required to substantiate the predictive reliability.
- [Conditional Scaling Law] Section describing the conditional scaling law: the augmentation to the Chinchilla law is presented without an explicit functional form (additive, multiplicative, or with interaction terms) or any analysis of whether optimal MLP-to-attention ratio or GQA group size varies with scale. If unmodeled higher-order interactions or regime shifts exist outside the 80M–3B / 8B–100B token window, the search framework will systematically select architectures whose predicted gains do not materialize, directly undermining the central claim.
- [Inference Evaluation] Inference evaluation subsection: no description is given of the hardware, batch size, sequence length, or measurement protocol used to obtain the latency/throughput numbers that underpin the 42% improvement claim. Without these details it is impossible to assess whether the reported throughput advantage is reproducible or sensitive to implementation choices.
minor comments (2)
- [Abstract] Abstract and results: the fit of the conditional law is reported without error bars, R² values, or residual analysis, making it difficult to judge how well the low-dimensional augmentation actually captures the observed data.
- [Results] Results tables: additional baselines beyond LLaMA-3.2 (e.g., recent efficient variants such as Mistral or Phi-3 derivatives) would help contextualize whether the gains are specific to the proposed search or more generally available.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments on our manuscript. We address each major comment below and indicate the revisions we will make to improve the paper.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the conditional scaling law is fitted directly to the full set of >200 experimental runs, after which the same fitted surface is used to select the “optimal” architectures whose performance is then reported. This circularity means the 2.1% accuracy and 42% throughput claims are not supported by held-out validation or independent test runs; a cross-validation split or separate confirmation experiments would be required to substantiate the predictive reliability.
Authors: We appreciate this important point about potential circularity in our evaluation. While the scaling law was fitted on the full set of experiments to maximize data utilization for the fit, we recognize that this does not provide an independent test of the law's predictive power for architecture selection. To address this, we will include a cross-validation analysis in the revised manuscript, where the law is fitted on a random subset of 80% of the models and used to predict performance on the held-out 20%. Additionally, we will report results from training a small number of confirmation models selected by the law but not included in the original fitting process. revision: yes
-
Referee: [Conditional Scaling Law] Section describing the conditional scaling law: the augmentation to the Chinchilla law is presented without an explicit functional form (additive, multiplicative, or with interaction terms) or any analysis of whether optimal MLP-to-attention ratio or GQA group size varies with scale. If unmodeled higher-order interactions or regime shifts exist outside the 80M–3B / 8B–100B token window, the search framework will systematically select architectures whose predicted gains do not materialize, directly undermining the central claim.
Authors: We agree that the functional form should be made explicit. The conditional scaling law augments the Chinchilla loss with multiplicative factors for each architectural parameter: L(N, D, h, r, g) = E + A/N^α + B/D^β * f(h, r, g), where f incorporates the hidden size h, MLP-to-attention ratio r, and GQA group size g. We will add the full equation and a subsection analyzing how the optimal r and g change across different scales within our experimental range. Regarding extrapolation beyond the studied regime, we will add a discussion of the limitations and suggest that the law is intended for the 80M-3B parameter range. revision: yes
-
Referee: [Inference Evaluation] Inference evaluation subsection: no description is given of the hardware, batch size, sequence length, or measurement protocol used to obtain the latency/throughput numbers that underpin the 42% improvement claim. Without these details it is impossible to assess whether the reported throughput advantage is reproducible or sensitive to implementation choices.
Authors: We apologize for the omission of these critical details. In the revised manuscript, we will add a dedicated paragraph in the Inference Evaluation subsection specifying the hardware (NVIDIA H100 GPUs), batch size (1 for latency, 32 for throughput), sequence length (2048 tokens), and the measurement protocol (using PyTorch with CUDA events for timing, averaged over 100 runs after warmup). This will allow readers to reproduce and evaluate the sensitivity of the 42% throughput improvement. revision: yes
Circularity Check
Conditional scaling law 'predictions' of optimal architectures reduce to optimization over the fitted surface from the same 200-model runs
specific steps
-
fitted input called prediction
[Abstract]
"we introduce a conditional scaling law that augments the Chinchilla framework with architectural information, along with a search framework for identifying architectures that are simultaneously inference-efficient and accurate. To validate our approach, we train more than 200 models spanning 80M to 3B parameters and 8B to 100B training tokens, and fit the proposed conditional scaling law. Our results show that the conditional scaling law reliably predicts optimal architectural choices and that the resulting models outperform existing open-source baselines."
The law is fitted to the exact experimental runs whose architectural variants are later declared 'optimal' by the search framework. Selecting the architecture that minimizes the fitted conditional scaling law is tautological; the 'prediction' of optimality is the direct output of the fit rather than an independent forecast or derivation.
full rationale
The paper fits its conditional scaling law directly to loss and latency measurements from the >200 models spanning 80M-3B parameters. It then uses this fitted surface, via a search framework, to identify 'optimal' architectural choices (hidden size, MLP-to-attention ratio, GQA). The claim that the law 'reliably predicts optimal architectural choices' therefore reduces to selecting the argmin of the fitted function rather than an out-of-sample derivation or held-out validation. The subsequent training of those selected architectures and reported gains (2.1% accuracy, 42% throughput) inherit this dependence; no independent first-principles derivation or external benchmark is shown to break the loop.
Axiom & Free-Parameter Ledger
free parameters (1)
- coefficients of the conditional scaling law
axioms (1)
- domain assumption Architectural factors can be treated as continuous variables whose effects on loss and latency are smooth and low-order within the studied regime.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a conditional scaling law that augments the Chinchilla framework with architectural information... L(d/√N , r|N, D) = (a0 + a1 log(d/√N) + a2 √N/d) · (b0 + b1 log r + b2/r) · Lopt
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
U-shaped curves L(d/√N | r, N, D) ... fit the function c0 + c1 log x + c2/x separately for x = rmlp/attn and dmodel/√Nnon-embed
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Marah Abdin, Jyoti Aneja, Harkirat Behl, S ´ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 techni- cal report.arXiv preprint arXiv:2412.08905,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Samira Abnar, Harshay Shah, Dan Busbridge, Alaaeldin Mohamed Elnouby Ali, Josh Susskind, and Vimal Thilak. Parameters vs flops: Scaling laws for optimal sparsity for mixture-of-experts language models.arXiv preprint arXiv:2501.12370,
-
[3]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr´on, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head check- points.arXiv preprint arXiv:2305.13245,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Program Synthesis with Large Language Models
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Scaling inference-efficient language mod- els.arXiv preprint arXiv:2501.18107,
Song Bian, Minghao Yan, and Shivaram Venkataraman. Scaling inference-efficient language mod- els.arXiv preprint arXiv:2501.18107,
-
[7]
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher R ´e, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. arXiv preprint arXiv:2407.21787,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,
work page 1901
-
[9]
Exploring diffusion transformer designs via grafting.arXiv preprint arXiv:2506.05340,
Keshigeyan Chandrasegaran, Michael Poli, Daniel Y Fu, Dongjun Kim, Lea M Hadzic, Manling Li, Agrim Gupta, Stefano Massaroli, Azalia Mirhoseini, Juan Carlos Niebles, et al. Exploring diffusion transformer designs via grafting.arXiv preprint arXiv:2506.05340,
-
[10]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Scaling law for quantization-aware training.arXiv preprint arXiv:2505.14302,
11 Preprint Mengzhao Chen, Chaoyi Zhang, Jing Liu, Yutao Zeng, Zeyue Xue, Zhiheng Liu, Yunshui Li, Jin Ma, Jie Huang, Xun Zhou, et al. Scaling law for quantization-aware training.arXiv preprint arXiv:2505.14302,
-
[12]
Reducing the carbon impact of generative ai inference (today and in 2035)
Andrew A Chien, Liuzixuan Lin, Hai Nguyen, Varsha Rao, Tristan Sharma, and Rajini Wijayawar- dana. Reducing the carbon impact of generative ai inference (today and in 2035). InProceedings of the 2nd workshop on sustainable computer systems, pp. 1–7,
work page 2035
-
[13]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Worts- man, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, et al. Language models scale reliably with over-training and on downstream tasks.arXiv preprint arXiv:2403.08540,
-
[18]
Truthfulqa: Measuring how models mimic human falsehoods,
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Fos- ter, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muen- nighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The lan...
-
[19]
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
Xinyu Guan, Li Lyna Zhang, Yifei Liu, Ning Shang, Youran Sun, Yi Zhu, Fan Yang, and Mao Yang. rstar-math: Small llms can master math reasoning with self-evolved deep thinking.arXiv preprint arXiv:2501.04519,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Train- ing compute-optimal large language models.arXiv preprint arXiv:2203.15556,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code.arXiv preprint arXiv:2403.07974,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention.arXiv preprint arXiv:2407.02490,
-
[25]
Scaling Laws for Neural Language Models
12 Preprint Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[26]
Scaling laws for fine-grained mixture of experts.arXiv preprint arXiv:2402.07871,
Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pi ´oro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Kr ´ol, Tomasz Odrzyg ´o´zd´z, Piotr Sankowski, et al. Scaling laws for fine-grained mixture of experts.arXiv preprint arXiv:2402.07871,
-
[27]
Tanishq Kumar, Zachary Ankner, Benjamin F Spector, Blake Bordelon, Niklas Muennighoff, Man- sheej Paul, Cengiz Pehlevan, Christopher R´e, and Aditi Raghunathan. Scaling laws for precision. arXiv preprint arXiv:2411.04330,
-
[28]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024a. Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical re...
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering.arXiv preprint arXiv:1809.02789,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
The LAMBADA dataset: Word prediction requiring a broad discourse context
Denis Paperno, Germ ´an Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern ´andez. The lambada dataset: Word prediction requiring a broad discourse context.arXiv preprint arXiv:1606.06031,
work page internal anchor Pith review Pith/arXiv arXiv
-
[31]
Jackson Petty, Sjoerd van Steenkiste, Ishita Dasgupta, Fei Sha, Dan Garrette, and Tal Linzen. The impact of depth on compositional generalization in transformer language models.arXiv preprint arXiv:2310.19956,
-
[32]
Mutual reason- ing makes smaller llms stronger problem-solvers.arXiv preprint arXiv:2408.06195,
Zhenting Qi, Mingyuan Ma, Jiahang Xu, Li Lyna Zhang, Fan Yang, and Mao Yang. Mutual reason- ing makes smaller llms stronger problem-solvers.arXiv preprint arXiv:2408.06195,
-
[33]
Yangjun Ruan, Chris J Maddison, and Tatsunori Hashimoto. Observational scaling laws and the predictability of language model performance.arXiv preprint arXiv:2405.10938,
-
[34]
Nikhil Sardana, Jacob Portes, Sasha Doubov, and Jonathan Frankle. Beyond chinchilla-optimal: Accounting for inference in language model scaling laws.arXiv preprint arXiv:2401.00448,
-
[35]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,
work page internal anchor Pith review Pith/arXiv arXiv
-
[36]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model par- allelism.arXiv preprint arXiv:1909.08053,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[37]
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Luca Soldaini, Rodney Kinney, Akshita Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, et al. Dolma: An open corpus of three trillion tokens for language model pretraining research.arXiv preprint arXiv:2402.00159,
-
[39]
Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, and Donald Metzler. Scale efficiently: Insights from pre-training and fine-tuning transformers.arXiv preprint arXiv:2109.10686,
-
[40]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi `ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024a. Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bh...
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Wang. Glue: A multi-task benchmark and analysis platform for natural language understand- ing.arXiv preprint arXiv:1804.07461,
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yo- gatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682,
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Crowdsourcing Multiple Choice Science Questions
Johannes Welbl, Nelson F Liu, and Matt Gardner. Crowdsourcing multiple choice science questions. arXiv preprint arXiv:1707.06209,
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
Efficient Streaming Language Models with Attention Sinks
14 Preprint Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453,
work page internal anchor Pith review Pith/arXiv arXiv
-
[46]
Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context llm inference with retrieval and streaming heads.arXiv preprint arXiv:2410.10819,
-
[47]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, et al. Flashinfer: Efficient and customizable attention engine for llm inference serving.arXiv preprint arXiv:2501.01005,
work page internal anchor Pith review Pith/arXiv arXiv
-
[49]
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, YX Wei, Lean Wang, Zhiping Xiao, et al. Native sparse attention: Hardware-aligned and natively trainable sparse attention.arXiv preprint arXiv:2502.11089,
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
HellaSwag: Can a Machine Really Finish Your Sentence?
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a ma- chine really finish your sentence?arXiv preprint arXiv:1905.07830,
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[51]
TinyLlama: An Open-Source Small Language Model
Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. Tinyllama: An open-source small language model.arXiv preprint arXiv:2401.02385,
work page internal anchor Pith review Pith/arXiv arXiv
-
[52]
It was not used to generate research ideas
15 Preprint A LLM USAGE We used an LLM to improve the writing by correcting grammar in our draft. It was not used to generate research ideas. B OPEN-WEIGHTEDMODELARCHITECTURES Table 3 presents an overview of the open-weight model architectures utilized in this paper. Table 3:Open-Weighted Model Architectures:We list the architectural configurations of all...
work page 2048
-
[53]
24 25 26 27 Batch Size 0 2000 4000 6000 8000Throughput (tokens/s) d=1024 d=2048 d=4096 24 25 26 27 Batch Size 0 1000 2000 3000 4000Throughput (tokens/s) d=1536 d=3072 d=6144 24 25 26 27 Batch Size 0 500 1000 1500 2000 2500 3000Throughput (tokens/s) d=2048 d=4096 d=8192 Figure 8:Hidden size on Inference Throughput:(left) 1B model variants; (center) 3B mode...
work page 2000
-
[54]
We observe that outlier data points harm the scaling law fit. Moreover, while multiplicative and additive calibrations differ in formulation, their MSE and Spearman values remain nearly identical. Dots denote the data points used for fitting, while crosses indicate the test data points. 2.6 2.8 3.0 3.2 3.4 3.6 Actual Loss 2.6 2.8 3.0 3.2 3.4 3.6Predicted ...
work page 2020
-
[55]
Table 6:Detailed Results on Downstream Tasks for 1B Models:In this table, we show detailed results of 1B models over 9 downstream tasks. Downstream Tasks LLaMA-3.2-1B Panda-1B Surefire-1B Arc-Easy 58.8 60.9 59.7 Arc-Challenge 29.8 28.9 30.2 LAMBADA 52.8 55.1 52.0 HellaSwag 56.9 58.4 56.6 OpenBookQA 32.0 33.2 32.0 PIQA 73.6 75.2 73.0 SciQ 84.8 87.2 84.9 Wi...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.