Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation

Rylan Schaeffer; Sang Truong; Sanmi Koyejo; Yuheng Tu

arxiv: 2606.07616 · v1 · pith:KVFYRKM3new · submitted 2026-05-29 · 💻 cs.LG · cs.AI· cs.CL

Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation

Sang Truong , Yuheng Tu , Rylan Schaeffer , Sanmi Koyejo This is my paper

Pith reviewed 2026-06-28 22:46 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords item response theoryscaling lawslanguage modelsefficient evaluationbeta-irtneural scalingmeasurement theory

0 comments

The pith

IRSL uses Item Response Theory to estimate scaling laws from 99.9 percent fewer questions after one calibration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Item Response Scaling Laws to derive performance scaling for language models without running full evaluations on every checkpoint and benchmark. It applies Item Response Theory to separate each model's underlying ability from the properties of individual questions, which shrinks the number of parameters needed from order M times N down to order M plus N. After a single calibration pass on existing model responses, the method produces reliable scaling curves from only 50 questions per benchmark. The same ability estimates also support performance predictions on other benchmarks that share the same measurement goal, and the approach covers both pre-training and test-time scaling settings.

Core claim

IRSL factorizes scaling-law estimation by fitting a Beta-IRT model to empirical probability responses, thereby disentangling latent model ability from question characteristics and reducing estimation complexity from O(M x N) to O(M + N) while preserving decision accuracy and enabling cross-benchmark generalization after one-time calibration.

What carries the argument

Beta-IRT model inside the IRSL framework, which maps probability responses to separate latent model abilities and question parameters.

If this is right

Scaling estimates become feasible with only 50 questions per benchmark after one-time calibration on existing model responses.
Decision accuracy on scaling-law tasks matches or exceeds that of traditional full-evaluation methods.
Latent model abilities estimated once can forecast performance on new benchmarks that share the same measurement objective.
The factorization applies equally to pre-training downstream scaling across thousands of checkpoints and to test-time scaling with multiple samples per question.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Repeated tracking of scaling behavior during model training could become far less expensive if the calibration step is amortized across many checkpoints.
The same separation of ability and item parameters might extend to evaluation settings outside language models whenever probabilistic responses are available.
Benchmarks could be grouped or redesigned around shared measurement objectives to maximize the reuse of ability estimates.

Load-bearing premise

Beta-IRT can separate model ability from question characteristics using probability responses without losing the information needed for accurate scaling curves or cross-benchmark predictions.

What would settle it

Full-benchmark scaling curves on a new collection of models differ substantially from the curves obtained by applying the calibrated IRSL model to only 50 questions, or the ability estimates fail to predict performance on a benchmark sharing the claimed measurement objective.

Figures

Figures reproduced from arXiv: 2606.07616 by Rylan Schaeffer, Sang Truong, Sanmi Koyejo, Yuheng Tu.

**Figure 1.** Figure 1: IRSL reduces scaling law estimation from O(M ×N) to O(M+N) by factorizing model ability from question difficulty. Left: The response matrix R records empirical probabilities across LMs and benchmark questions; sparse rows for new LMs illustrate query efficiency via adaptive testing. Center-left: IRT decomposes R into LM abilities θ (orange) and question difficulties z (blue), so that Rij ≈ σ(θi − zj ). Cen… view at source ↗

**Figure 2.** Figure 2: Beta-IRT achieves reliable calibration with as few as 2 test takers, requiring 30–60× fewer than Binary-IRT. We report RMSE (Left) and Correlation (Right) for both the 1PL model (Top) and the 2PL model (Bottom) as a function of the number of test takers M. Error bars indicate ±1 standard deviation over 10 trials. timated from D is transferable. This allows for the prediction of performance on D′ via Perf… view at source ↗

**Figure 3.** Figure 3: Beta-IRT provides more robust scaling law estimates, especially on lower-quality benchmarks. Decision Accuracy vs. Proportion of Target FLOPs across 10 benchmarks. We iteratively fit scaling laws by including larger models and extrapolating to the target size to predict benchmark accuracy rankings. Results are averaged over five random train-test splits. Black lines denote Traditional Scaling; Blue and Red… view at source ↗

**Figure 4.** Figure 4: Beta-IRT effectively captures the underlying response structure across all 10 benchmarks. Correlation between Beta-IRT 2PL predicted pCorrect Choice (x-axis) and empirical pCorrect Choice (y-axis), visualized using 2-D KDE contour plots. The Pearson correlation coefficient (ρ) is reported for each benchmark, with marginal histograms showing the pCorrect Choice distribution. The corresponding results for 1P… view at source ↗

**Figure 5.** Figure 5: IRSL accurately predicts scaling trends on harder sets using the ability estimated from easy sets alone. (Left) Within-benchmark transfer on OpenBookQA. (Right) Crossbenchmark transfer from ARC Easy to ARC Challenge. Solid lines represent the Ground Truth (GT) scaling curves, while dashed lines represent the estimated curves where LM ability is derived solely from the easy set [PITH_FULL_IMAGE:figures/fu… view at source ↗

**Figure 6.** Figure 6: The ability θ estimated by IRSL is robustly transferable across benchmark sets. MAE distribution for hard set estimation across all benchmarks and LM data mixtures. We report the MAE between the ground truth scaling curve and the estimated curve for two settings: Within-Benchmark Transfer (blue) and Cross-Benchmark Transfer (red). See [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: IRSL yields more reliable test-time scaling estimates than traditional approaches given a limited query budget. Comparison of three test-time scaling curves: Ground Truth, Traditional scaling law, and IRSL, for two representative LMBenchmark pairs in the test set. We plot − log pass@k against the number of samples k [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 9.** Figure 9: Beta-IRT predicted pass@1 strongly correlates with empirical pass@1 across all test-time benchmarks. Correlation between Beta-IRT 1PL predicted pass@1 (x-axis) and empirical pass@1 (y-axis), visualized using 2-D KDE contour plots. The Pearson correlation coefficient (ρ) is reported for each benchmark. The corresponding results for the 2PL variant are provided in [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 8.** Figure 8: IRSL consistently outperforms Traditional scaling across nearly all LM-benchmark pairs. We visualize the distribution of the performance gap Traditional MAE−IRSL MAE on four benchmarks across 100 random train-test splits. The distributions are consistently concentrated to the right of the zero line (red line), which indicates that IRSL achieves a lower MAE and thus provides a more accurate estimate. per … view at source ↗

**Figure 11.** Figure 11: Consistently low MAE confirms that test-time IRSL ability is transferable across difficulty levels. We report the MAE between the ground truth scaling curve and the estimated curve for two settings: Within-Benchmark Transfer (blue) and Cross-Benchmark Transfer (red). The consistent low MAE values indicate that the ability θ estimated by IRSL enables reliable performance forecasting on benchmark sets wit… view at source ↗

**Figure 12.** Figure 12: shows the empirical observation of the linear relationship between θ and log FLOP for Beta-IRT 2PL. The trend is similar for Binary-IRT and 1PL variants [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: Traditional scaling law step 1: L ≈ α · FLOP−β + γ. Representative LM data mixture across all 10 benchmarks. The trend is consistent across other data mixtures [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: Traditional scaling law step 2: Performance(i, D) ≈ a · σ(b · (L − l0)) + c. Representative LM data mixture across all 10 benchmarks. The trend is consistent across other data mixtures [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗

**Figure 15.** Figure 15: IRSL step 1: θi ≈ a · log(FLOPi) + b. Representative LM data mixture across all 10 benchmarks. The trend is consistent across other data mixtures. C. Benchmark Homogeneity Inspection for Pre-training Downstream IRSL To further explain why IRSL does not consistently outperform traditional scaling laws on certain benchmarks, we carry out an additional experiment on benchmark homogeneity [PITH_FULL_IMAGE:fi… view at source ↗

**Figure 16.** Figure 16: Beta-IRT 1PL predicted pCorrect Choice correlates strongly with empirical pCorrect Choice [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗

**Figure 17.** Figure 17: Beta-IRT 2PL curve on a single question for each benchmark. The x-axis is the ability parameter θ, and the y-axis is pCorrect Choice. The red line shows the fitted Beta-IRT curve. The blue dots represent the empirical pCorrect Choice; each dot corresponds to an LM checkpoint in the test set [PITH_FULL_IMAGE:figures/full_fig_p015_17.png] view at source ↗

**Figure 18.** Figure 18: Beta-IRT 1PL curve on a single question for each benchmark. smooth gradient from easy to hard items. Both the difficulty (standard deviation of 0.55) and discrimination (standard deviation of 0.61) distributions are substantially wider (rows 2–3). As a result, the TIF (row 4) exhibits a pronounced peak. We therefore view this not as a limitation of IRSL, but as a property of the benchmarks themselves. IRS… view at source ↗

**Figure 19.** Figure 19: MAE of hard set estimation across all benchmarks and LM data mixtures. We report the MAE between the ground truth scaling curve and the estimated curve on the hard sets. The last row specifically corresponds to the cross-benchmark transfer from ARC Easy to ARC Challenge. principled benchmark design. D. Construct Similarity and Cross-Benchmark Transfer We mention that the LM ability θ estimated from one be… view at source ↗

**Figure 20.** Figure 20: Benchmark homogeneity analysis for BoolQ, HellaSwag, and ARC Challenge. Top row: response matrix heatmaps with rows (models) sorted by mean pCorrect Choice and columns (items) sorted by calibrated difficulty z. Middle rows: histograms of calibrated item difficulty z and discrimination d. Bottom row: Test Information Function (TIF) per item, I(θ)/N. BoolQ and HellaSwag exhibit highly concentrated item para… view at source ↗

**Figure 21.** Figure 21: The transfer benchmark pairs exhibit strong convergent validity in estimated LM ability. The x-axis shows the estimated ability θ on the source benchmark, and the y-axis shows the estimated ability θ on the target benchmark. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_21.png] view at source ↗

**Figure 22.** Figure 22: Most pre-training benchmarks share a strongly aligned latent ability. The x-axis and y-axis show pre-training benchmarks, and each cell reports the Pearson correlation of estimated ability θ between the corresponding benchmark pair [PITH_FULL_IMAGE:figures/full_fig_p018_22.png] view at source ↗

**Figure 23.** Figure 23: Test-time benchmark abilities show weaker but still informative cross-benchmark alignment [PITH_FULL_IMAGE:figures/full_fig_p018_23.png] view at source ↗

**Figure 24.** Figure 24: shows the correlation between Beta-IRT 2PL predicted pass@1 and empirical pass@1 [PITH_FULL_IMAGE:figures/full_fig_p018_24.png] view at source ↗

**Figure 25.** Figure 25: Beta-IRT 1PL curve on a single question for each test-time benchmark. The x-axis is the ability parameter θ, and the y-axis is pass@1. The red line shows the fitted Beta-IRT curve. The blue dots represent the empirical pass@1; each dot corresponds to an LM in the test set [PITH_FULL_IMAGE:figures/full_fig_p019_25.png] view at source ↗

**Figure 26.** Figure 26: Beta-IRT 2PL curve on a single question for each test-time benchmark. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_26.png] view at source ↗

read the original abstract

Scaling laws provide a fundamental framework for understanding the performance of Language Models (LMs), yet deriving them requires prohibitively expensive evaluations across thousands of checkpoints or millions of inference samples. To address this, we introduce Item Response Scaling Laws (IRSL), a unified framework that integrates Item Response Theory (IRT) within the scaling law framework. Unlike traditional approaches that treat each model-benchmark pair in isolation, IRSL disentangles latent model ability from question characteristics, factorizing the scaling law estimation for $M$ models and $N$ questions to significantly reduce parameter complexity from $O(M \times N)$ to $O(M + N)$. We instantiate IRSL with Beta-IRT, which leverages the empirical probability responses of LMs -- such as token probabilities in pre-training and pass rates in test-time sampling -- to capture richer signals than binary responses. We validate our approach across two prevalent scaling paradigms: (1) pre-training downstream scaling, using 6,612 LM checkpoints and 37,682 questions from 10 benchmarks; and (2) test-time scaling, using 12 LMs and 120 questions from 4 benchmarks with up to 2,500 samples per question. Given a one-time calibration on existing model responses, IRSL yields more reliable scaling estimates using only 50 questions per benchmark (a 99.9\% reduction), achieving comparable or superior decision accuracy to traditional approaches. Furthermore, we show that the estimated latent model abilities are generalizable, enabling accurate performance forecasting across benchmarks that share the same measurement objective.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

IRSL applies Beta-IRT to factor scaling-law fits down to O(M+N) and claims 99.9% fewer questions with no accuracy loss, but the information-preservation step is the part that still needs checking.

read the letter

The main thing here is a practical efficiency claim: once you fit Beta-IRT on a set of existing model responses, you can estimate scaling curves from only 50 questions per benchmark instead of the usual thousands, and the latent abilities are supposed to transfer across related benchmarks.

What the work actually does is take the standard scaling-law setup (performance as a function of compute or samples) and replace the per-model-per-question matrix with an IRT factorization. They use the richer probability outputs (token probs or pass rates) rather than binary correctness, run the calibration once, then recover the scaling parameters from the reduced set. The experiments cover a large pre-training sweep (over 6k checkpoints, 10 benchmarks) and a smaller test-time scaling case.

The soft spot is exactly the one the stress-test flags. The 99.9% reduction only works if the Beta-IRT disentanglement keeps the information that actually drives the scaling slope and intercept. If the item parameters absorb too much of the signal or if the latent θ_m values shift when you move to a new benchmark, the downstream scaling estimates will be biased or noisy even if the calibration looked good. The abstract does not show the equations or the error analysis that would let a reader judge whether that loss is small in practice.

The paper is aimed at groups that run repeated scaling-law studies and want to lower the evaluation budget. A reader who already works with IRT or who needs to decide which checkpoints to train next will get the most out of it.

It is worth sending to referees. The efficiency target is real, the scale of the pre-training experiment is substantial, and the IRT angle is new enough in this literature that a careful review can sort out whether the factorization holds up.

Referee Report

3 major / 2 minor

Summary. The paper introduces Item Response Scaling Laws (IRSL), a framework integrating Item Response Theory (IRT) with neural scaling laws. It factorizes estimation for M models and N questions from O(M×N) to O(M+N) by disentangling latent model abilities from question characteristics via Beta-IRT applied to probability responses (token probs or pass rates). After one-time calibration, it claims reliable scaling estimates from only 50 questions per benchmark (99.9% reduction), with comparable or superior decision accuracy to full evaluation, plus generalizable latent abilities enabling cross-benchmark forecasting for shared measurement objectives. Validation covers pre-training scaling (6,612 checkpoints, 37,682 questions across 10 benchmarks) and test-time scaling (12 LMs, 120 questions across 4 benchmarks).

Significance. If the central claims hold after addressing validation gaps, IRSL could substantially lower the cost of deriving and applying scaling laws for language models, enabling more frequent and broader evaluations while preserving accuracy. The reported reduction to 50 questions and cross-benchmark generalization would be particularly impactful for resource-intensive pre-training and test-time studies.

major comments (3)

[Abstract; §3 (Beta-IRT instantiation)] The central claim rests on Beta-IRT producing θ_m estimates from probability responses that remain invariant enough for scaling-law recovery and cross-benchmark forecasting after reduction to 50 questions. The abstract and described validation provide no equations or derivation showing that the scaling parameters are independent of the one-time calibration fit rather than reducing to quantities defined by the fitted item parameters.
[§5] §5 (experiments): the reported 99.9% reduction and comparable/superior decision accuracy on the 6,612-checkpoint and 12-LM datasets must be supported by explicit error bars, ablation on question-subset selection, and direct comparison of scaling-curve parameters (not just downstream decisions) against the full O(M×N) baseline; without these, post-hoc selection or information loss from the Beta likelihood cannot be ruled out.
[§3; §5] The assumption that the Beta likelihood correctly captures the response distribution without misspecification bias is load-bearing for both the efficiency claim and the generalization result, yet no diagnostic (e.g., posterior predictive checks or likelihood-ratio tests against alternative IRT models) is referenced in the validation sections.

minor comments (2)

Clarify the exact functional form of the scaling law once expressed in terms of the latent abilities θ_m and item parameters; an explicit equation would make the O(M+N) factorization transparent.
The description of “benchmarks that share the same measurement objective” would benefit from a precise operational definition or similarity metric used to select the cross-benchmark forecasting pairs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important areas for strengthening the theoretical justification, experimental validation, and model assumptions. We have revised the manuscript to address these points and provide point-by-point responses below.

read point-by-point responses

Referee: [Abstract; §3 (Beta-IRT instantiation)] The central claim rests on Beta-IRT producing θ_m estimates from probability responses that remain invariant enough for scaling-law recovery and cross-benchmark forecasting after reduction to 50 questions. The abstract and described validation provide no equations or derivation showing that the scaling parameters are independent of the one-time calibration fit rather than reducing to quantities defined by the fitted item parameters.

Authors: We agree that an explicit derivation strengthens the central claim. In the revised manuscript we have added a new subsection (3.2) that derives the post-calibration invariance: after fixing the item parameters (a_j, b_j) from the one-time Beta-IRT fit, the model ability estimates θ_m enter the scaling law as an independent variable, so that the fitted scaling parameters (α, β) in the log-linear form depend only on the θ_m sequence and not on the particular item-parameter values. This separation is shown algebraically and is the basis for both the 50-question reduction and cross-benchmark forecasting. revision: yes
Referee: [§5] §5 (experiments): the reported 99.9% reduction and comparable/superior decision accuracy on the 6,612-checkpoint and 12-LM datasets must be supported by explicit error bars, ablation on question-subset selection, and direct comparison of scaling-curve parameters (not just downstream decisions) against the full O(M×N) baseline; without these, post-hoc selection or information loss from the Beta likelihood cannot be ruled out.

Authors: We accept that the original experiments lacked these controls. The revised §5 now reports (i) bootstrap-derived 95% confidence intervals on all accuracy and scaling-parameter estimates, (ii) an ablation comparing random, difficulty-stratified, and information-gain question subsets, and (iii) direct side-by-side fits of the scaling-curve slope and intercept for the reduced versus full O(M×N) evaluations, confirming that the recovered parameters agree within the reported error bars. revision: yes
Referee: [§3; §5] The assumption that the Beta likelihood correctly captures the response distribution without misspecification bias is load-bearing for both the efficiency claim and the generalization result, yet no diagnostic (e.g., posterior predictive checks or likelihood-ratio tests against alternative IRT models) is referenced in the validation sections.

Authors: We have added posterior predictive checks (Appendix C) for representative checkpoints and questions, showing that replicated Beta draws closely match the empirical response histograms. We also include a limited comparison to a Gaussian IRT variant on the same data; the resulting θ_m ranks and downstream scaling predictions remain consistent. Comprehensive likelihood-ratio tests across the full 6,612-checkpoint corpus would require prohibitive additional compute; we therefore treat the current diagnostics as sufficient for the claims while noting the limitation in the revised text. revision: partial

Circularity Check

0 steps flagged

No circularity: IRSL framework derives scaling estimates from independent IRT factorization without reducing to input fits by construction.

full rationale

The abstract describes a one-time calibration on existing responses to enable O(M+N) factorization via Beta-IRT, followed by estimation on 50 questions and cross-benchmark generalization of latent abilities. No equations or derivation steps are provided that equate the final scaling-law parameters to the calibration fit itself, nor is any self-citation invoked as a uniqueness theorem or ansatz source. The approach treats the IRT disentanglement as a measurement model whose outputs (θ_m) are then used for separate scaling-law fitting; this separation keeps the central claim independent of the calibration inputs. The 99.9% reduction claim is an empirical efficiency statement, not a definitional equivalence. Absent load-bearing self-citations or fitted-input predictions in the given text, the derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities can be extracted. The core modeling choice (Beta-IRT on probability responses) implicitly assumes IRT applicability to LM outputs.

axioms (1)

domain assumption Item Response Theory can disentangle latent model ability from question characteristics for language model responses
This is the foundational premise enabling the O(M+N) factorization and generalization claims.

pith-pipeline@v0.9.1-grok · 5825 in / 1189 out tokens · 26810 ms · 2026-06-28T22:46:23.644363+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 23 canonical work pages · 5 internal anchors

[1]

Bahri, Y ., Dyer, E., Kaplan, J., Lee, J., and Sharma, U

URLhttps://arxiv.org/abs/2410.16531. Bahri, Y ., Dyer, E., Kaplan, J., Lee, J., and Sharma, U. Explaining neural scaling laws.arXiv preprint arXiv:2102.06701,

work page arXiv
[2]

H., Soldaini, L., Smith, N

Bhagia, A., Liu, J., Wettig, A., Heineman, D., Tafjord, O., Jha, A. H., Soldaini, L., Smith, N. A., Groeneveld, D., Koh, P. W., et al. Establishing task scaling laws via compute-efficient model ladders.arXiv preprint arXiv:2412.04403,

work page arXiv
[3]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q. V ., R´e, C., and Mirhoseini, A. Large language mon- keys: Scaling inference compute with repeated sam- pling.arXiv preprint arXiv:2407.21787,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sas- try, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sas- try, G., Askell, A., et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901,

1901
[5]

v048.i06

doi: 10.18637/jss. v048.i06. URLhttps://www.jstatsoft.org/ index.php/jss/article/view/v048i06. Chang, H.-H. Psychometrics behind computerized adaptive testing.Psychometrika, 80(1):1–20,

work page doi:10.18637/jss
[6]

Evaluating Large Language Models Trained on Code

URLhttps://arxiv.org/abs/ 2107.03374. Chen, Y ., Silva Filho, T., Prudˆencio, R. B., Diethe, T., and Flach, P.β 3-irt: A new item response model and its applications. InProceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AIS- TATS),

work page internal anchor Pith review Pith/arXiv arXiv
[7]

$\beta^3$-IRT: A New Item Response Model and its Applications

arXiv:1903.04016. Chen, Y ., Huang, B., Gao, Y ., Wang, Z., Yang, J., and Ji, H. Scaling laws for predicting downstream performance in llms.arXiv preprint arXiv:2410.08527,

work page internal anchor Pith review Pith/arXiv arXiv 1903
[8]

URLhttps://doi.org/ 10.1080/0266476042000214501

doi: 10.1080/ 0266476042000214501. URLhttps://doi.org/ 10.1080/0266476042000214501. Gadre, S. Y ., Smyrnis, G., Shankar, V ., Gururangan, S., Wortsman, M., Shao, R., Mercat, J., Fang, A., Li, J., Keh, S., Xin, R., Nezhurina, M., Vasiljevic, I., Jit- sev, J., Soldaini, L., Dimakis, A. G., Ilharco, G., Koh, P. W., Song, S., Kollar, T., Carmon, Y ., Dave, A....

work page doi:10.1080/0266476042000214501
[9]

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al

URLhttps://arxiv.org/ abs/2403.08540. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page arXiv
[10]

J., Ung, M., and Williams, A

Gupta, V ., Ross, C., Pantoja, D., Passonneau, R. J., Ung, M., and Williams, A. Improving model evaluation using smart filtering of benchmark datasets. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguis- tics: Human Language Technologies (Volume 1: Long Papers), pp. 4595–4615,

2025
[11]

Hernandez, D., Kaplan, J., Henighan, T., and McCan- dlish, S

URLhttps: //arxiv.org/abs/2508.13144. Hernandez, D., Kaplan, J., Henighan, T., and McCan- dlish, S. Scaling laws for transfer.arXiv preprint arXiv:2102.01293,

work page arXiv
[12]

Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y ., and Zhou, Y . Deep learning scaling is predictable, empiri- cally.arXiv preprint arXiv:1712.00409,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hen- dricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Hughes, J., Price, S., Lynch, A., Schaeffer, R., Barez, F., Koyejo, S., Sleight, H., Jones, E., Perez, E., and Sharma, M

URLhttps://arxiv.org/abs/ 2509.11106. Hughes, J., Price, S., Lynch, A., Schaeffer, R., Barez, F., Koyejo, S., Sleight, H., Jones, E., Perez, E., and Sharma, M. Best-of-n jailbreaking,

work page arXiv
[15]

Best-of-n jailbreaking.arXiv preprint arXiv:2412.03556, 2024

URLhttps: //arxiv.org/abs/2412.03556. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page arXiv 2001
[16]

Corre- lated errors in large language models.arXiv preprint arXiv:2506.07962,

Kim, E., Garg, A., Peng, K., and Garg, N. Corre- lated errors in large language models.arXiv preprint arXiv:2506.07962,

work page arXiv
[17]

URL https://arxiv.org/abs/2407.12844. Levi, N. A simple model of inference scaling laws.arXiv preprint arXiv:2410.16377,

work page arXiv
[18]

Lourie, N., Hu, M

doi: 10.4324/9780203056615. Lourie, N., Hu, M. Y ., and Cho, K. Scaling laws are un- reliable for downstream tasks: A reality check.arXiv preprint arXiv:2507.00885,

work page doi:10.4324/9780203056615
[19]

org/abs/2504.11393

URLhttps://arxiv. org/abs/2504.11393. Meijer, R. R. and Nering, M. L. Computerized adaptive testing: Overview and introduction.Applied Psycholog- ical Measurement, 23(3):187–194,

work page arXiv
[20]

E., Shnarch, E., Slonim, N., Shmueli-Scheuer, M., and Choshen, L

Perlitz, Y ., Bandel, E., Gera, A., Arviv, O., Dor, L. E., Shnarch, E., Slonim, N., Shmueli-Scheuer, M., and Choshen, L. Efficient benchmarking (of language mod- els). InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 2519–2536,

2024
[21]

tinybenchmarks: evaluating LLMs with fewer examples.arXiv preprint arXiv:2402.14992, 2024

URLhttps://arxiv.org/ abs/2402.14992. Rasch, G.Probabilistic models for some intelligence and attainment tests.ERIC,

work page arXiv
[22]

J., and Hashimoto, T

Ruan, Y ., Maddison, C. J., and Hashimoto, T. Observa- tional scaling laws and the predictability of language model performance.arXiv preprint arXiv:2405.10938,

work page arXiv
[23]

Why has predicting downstream capabilities of frontier ai models with scale remained elusive?arXiv preprint arXiv:2406.04391,

Schaeffer, R., Schoelkopf, H., Miranda, B., Mukobi, G., Madan, V ., Ibrahim, A., Bradley, H., Biderman, S., and Koyejo, S. Why has predicting downstream capabilities of frontier ai models with scale remained elusive?arXiv preprint arXiv:2406.04391,

work page arXiv
[24]

Snell, C., Lee, J., Xu, K., and Kumar, A

URLhttps://arxiv.org/ abs/2502.17578. Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling llm test- time compute optimally can be more effective than scal- ing model parameters.arXiv preprint arXiv:2408.03314,

work page arXiv
[25]

Reliable and efficient amortized model-based evaluation.arXiv preprint arXiv:2503.13335,

Truong, S., Tu, Y ., Liang, P., Li, B., and Koyejo, S. Reliable and efficient amortized model-based evaluation.arXiv preprint arXiv:2503.13335,

work page arXiv
[26]

L., Domingue, B

Wu, M., Davis, R. L., Domingue, B. W., Piech, C., and Goodman, N. Variational item response the- ory: Fast, accurate, and expressive.arXiv preprint arXiv:2002.00276,

work page arXiv 2002
[27]

(2024), we fit step 1 only on final checkpoints for each model size, as the learning rate schedule prevents accurate FLOP estimation on intermediate checkpoints

Following Bhagia et al. (2024), we fit step 1 only on final checkpoints for each model size, as the learning rate schedule prevents accurate FLOP estimation on intermediate checkpoints. Figure 16 shows the correlation between Beta-IRT 1PL predictedp Correct Choice and empiricalp Correct Choice. Figure 17 and 18 show the Beta-IRT curve on a randomly sample...

2024
[28]

We therefore view this not as a limitation of IRSL, but as a property of the benchmarks themselves

exhibits a pronounced peak. We therefore view this not as a limitation of IRSL, but as a property of the benchmarks themselves. IRSL is most effective when evaluation items are sufficiently diverse and informative, and we believe this finding itself contributes toward more 15 Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Gene...

2024
[29]

This aligns with findings from Kipnis et al

The full correlation heatmap shows that most of the 10 pre-training benchmarks exhibit high pairwiseθcorrelations, with BoolQ as the notable exception (BoolQ is known to have a low signal-to-noise ratio as a two-choice benchmark (Heineman et al., 2025)). This aligns with findings from Kipnis et al. (2025) that a single common factor underlies most benchma...

2025

[1] [1]

Bahri, Y ., Dyer, E., Kaplan, J., Lee, J., and Sharma, U

URLhttps://arxiv.org/abs/2410.16531. Bahri, Y ., Dyer, E., Kaplan, J., Lee, J., and Sharma, U. Explaining neural scaling laws.arXiv preprint arXiv:2102.06701,

work page arXiv

[2] [2]

H., Soldaini, L., Smith, N

Bhagia, A., Liu, J., Wettig, A., Heineman, D., Tafjord, O., Jha, A. H., Soldaini, L., Smith, N. A., Groeneveld, D., Koh, P. W., et al. Establishing task scaling laws via compute-efficient model ladders.arXiv preprint arXiv:2412.04403,

work page arXiv

[3] [3]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q. V ., R´e, C., and Mirhoseini, A. Large language mon- keys: Scaling inference compute with repeated sam- pling.arXiv preprint arXiv:2407.21787,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

D., Dhariwal, P., Neelakantan, A., Shyam, P., Sas- try, G., Askell, A., et al

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sas- try, G., Askell, A., et al. Language models are few-shot learners.Advances in Neural Information Processing Systems, 33:1877–1901,

1901

[5] [5]

v048.i06

doi: 10.18637/jss. v048.i06. URLhttps://www.jstatsoft.org/ index.php/jss/article/view/v048i06. Chang, H.-H. Psychometrics behind computerized adaptive testing.Psychometrika, 80(1):1–20,

work page doi:10.18637/jss

[6] [6]

Evaluating Large Language Models Trained on Code

URLhttps://arxiv.org/abs/ 2107.03374. Chen, Y ., Silva Filho, T., Prudˆencio, R. B., Diethe, T., and Flach, P.β 3-irt: A new item response model and its applications. InProceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AIS- TATS),

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

$\beta^3$-IRT: A New Item Response Model and its Applications

arXiv:1903.04016. Chen, Y ., Huang, B., Gao, Y ., Wang, Z., Yang, J., and Ji, H. Scaling laws for predicting downstream performance in llms.arXiv preprint arXiv:2410.08527,

work page internal anchor Pith review Pith/arXiv arXiv 1903

[8] [8]

URLhttps://doi.org/ 10.1080/0266476042000214501

doi: 10.1080/ 0266476042000214501. URLhttps://doi.org/ 10.1080/0266476042000214501. Gadre, S. Y ., Smyrnis, G., Shankar, V ., Gururangan, S., Wortsman, M., Shao, R., Mercat, J., Fang, A., Li, J., Keh, S., Xin, R., Nezhurina, M., Vasiljevic, I., Jit- sev, J., Soldaini, L., Dimakis, A. G., Ilharco, G., Koh, P. W., Song, S., Kollar, T., Carmon, Y ., Dave, A....

work page doi:10.1080/0266476042000214501

[9] [9]

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al

URLhttps://arxiv.org/ abs/2403.08540. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page arXiv

[10] [10]

J., Ung, M., and Williams, A

Gupta, V ., Ross, C., Pantoja, D., Passonneau, R. J., Ung, M., and Williams, A. Improving model evaluation using smart filtering of benchmark datasets. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguis- tics: Human Language Technologies (Volume 1: Long Papers), pp. 4595–4615,

2025

[11] [11]

Hernandez, D., Kaplan, J., Henighan, T., and McCan- dlish, S

URLhttps: //arxiv.org/abs/2508.13144. Hernandez, D., Kaplan, J., Henighan, T., and McCan- dlish, S. Scaling laws for transfer.arXiv preprint arXiv:2102.01293,

work page arXiv

[12] [12]

Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y ., and Zhou, Y . Deep learning scaling is predictable, empiri- cally.arXiv preprint arXiv:1712.00409,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D. d. L., Hen- dricks, L. A., Welbl, J., Clark, A., et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Hughes, J., Price, S., Lynch, A., Schaeffer, R., Barez, F., Koyejo, S., Sleight, H., Jones, E., Perez, E., and Sharma, M

URLhttps://arxiv.org/abs/ 2509.11106. Hughes, J., Price, S., Lynch, A., Schaeffer, R., Barez, F., Koyejo, S., Sleight, H., Jones, E., Perez, E., and Sharma, M. Best-of-n jailbreaking,

work page arXiv

[15] [15]

Best-of-n jailbreaking.arXiv preprint arXiv:2412.03556, 2024

URLhttps: //arxiv.org/abs/2412.03556. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page arXiv 2001

[16] [16]

Corre- lated errors in large language models.arXiv preprint arXiv:2506.07962,

Kim, E., Garg, A., Peng, K., and Garg, N. Corre- lated errors in large language models.arXiv preprint arXiv:2506.07962,

work page arXiv

[17] [17]

URL https://arxiv.org/abs/2407.12844. Levi, N. A simple model of inference scaling laws.arXiv preprint arXiv:2410.16377,

work page arXiv

[18] [18]

Lourie, N., Hu, M

doi: 10.4324/9780203056615. Lourie, N., Hu, M. Y ., and Cho, K. Scaling laws are un- reliable for downstream tasks: A reality check.arXiv preprint arXiv:2507.00885,

work page doi:10.4324/9780203056615

[19] [19]

org/abs/2504.11393

URLhttps://arxiv. org/abs/2504.11393. Meijer, R. R. and Nering, M. L. Computerized adaptive testing: Overview and introduction.Applied Psycholog- ical Measurement, 23(3):187–194,

work page arXiv

[20] [20]

E., Shnarch, E., Slonim, N., Shmueli-Scheuer, M., and Choshen, L

Perlitz, Y ., Bandel, E., Gera, A., Arviv, O., Dor, L. E., Shnarch, E., Slonim, N., Shmueli-Scheuer, M., and Choshen, L. Efficient benchmarking (of language mod- els). InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 2519–2536,

2024

[21] [21]

tinybenchmarks: evaluating LLMs with fewer examples.arXiv preprint arXiv:2402.14992, 2024

URLhttps://arxiv.org/ abs/2402.14992. Rasch, G.Probabilistic models for some intelligence and attainment tests.ERIC,

work page arXiv

[22] [22]

J., and Hashimoto, T

Ruan, Y ., Maddison, C. J., and Hashimoto, T. Observa- tional scaling laws and the predictability of language model performance.arXiv preprint arXiv:2405.10938,

work page arXiv

[23] [23]

Why has predicting downstream capabilities of frontier ai models with scale remained elusive?arXiv preprint arXiv:2406.04391,

Schaeffer, R., Schoelkopf, H., Miranda, B., Mukobi, G., Madan, V ., Ibrahim, A., Bradley, H., Biderman, S., and Koyejo, S. Why has predicting downstream capabilities of frontier ai models with scale remained elusive?arXiv preprint arXiv:2406.04391,

work page arXiv

[24] [24]

Snell, C., Lee, J., Xu, K., and Kumar, A

URLhttps://arxiv.org/ abs/2502.17578. Snell, C., Lee, J., Xu, K., and Kumar, A. Scaling llm test- time compute optimally can be more effective than scal- ing model parameters.arXiv preprint arXiv:2408.03314,

work page arXiv

[25] [25]

Reliable and efficient amortized model-based evaluation.arXiv preprint arXiv:2503.13335,

Truong, S., Tu, Y ., Liang, P., Li, B., and Koyejo, S. Reliable and efficient amortized model-based evaluation.arXiv preprint arXiv:2503.13335,

work page arXiv

[26] [26]

L., Domingue, B

Wu, M., Davis, R. L., Domingue, B. W., Piech, C., and Goodman, N. Variational item response the- ory: Fast, accurate, and expressive.arXiv preprint arXiv:2002.00276,

work page arXiv 2002

[27] [27]

(2024), we fit step 1 only on final checkpoints for each model size, as the learning rate schedule prevents accurate FLOP estimation on intermediate checkpoints

Following Bhagia et al. (2024), we fit step 1 only on final checkpoints for each model size, as the learning rate schedule prevents accurate FLOP estimation on intermediate checkpoints. Figure 16 shows the correlation between Beta-IRT 1PL predictedp Correct Choice and empiricalp Correct Choice. Figure 17 and 18 show the Beta-IRT curve on a randomly sample...

2024

[28] [28]

We therefore view this not as a limitation of IRSL, but as a property of the benchmarks themselves

exhibits a pronounced peak. We therefore view this not as a limitation of IRSL, but as a property of the benchmarks themselves. IRSL is most effective when evaluation items are sufficiently diverse and informative, and we believe this finding itself contributes toward more 15 Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Gene...

2024

[29] [29]

This aligns with findings from Kipnis et al

The full correlation heatmap shows that most of the 10 pre-training benchmarks exhibit high pairwiseθcorrelations, with BoolQ as the notable exception (BoolQ is known to have a low signal-to-noise ratio as a two-choice benchmark (Heineman et al., 2025)). This aligns with findings from Kipnis et al. (2025) that a single common factor underlies most benchma...

2025