Recognition: no theorem link
LLM Evaluation as Tensor Completion: Low Rank Structure and Semiparametric Efficiency
Pith reviewed 2026-05-10 19:35 UTC · model grok-4.3
The pith
Pairwise LLM judgments modeled as low-rank tensor observations admit semiparametric efficient estimators for ability gaps and win probabilities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For a low-rank latent score tensor observed through sparse pairwise comparisons under Bradley-Terry-Luce-type models, the semiparametric efficiency bound for any smooth functional can be attained by a one-step estimator built from the efficient influence function on the low-rank tangent space, once a score-whitening step compensates for the fact that the information operator does not commute with the tangent-space projection.
What carries the argument
efficient influence function on the low-rank tangent space of the score tensor, together with the score-whitening transformation that equalizes anisotropic Fisher information
If this is right
- Ability gaps and win probabilities between models admit asymptotically normal estimators with valid confidence intervals.
- The procedure remains valid under the sparse and non-uniform sampling patterns typical of real LLM evaluation platforms.
- Uncertainty quantification becomes feasible for leaderboard rankings without assuming uniform observation probabilities.
- The same efficiency framework applies to any pairwise comparison data whose latent scores admit low-rank structure.
Where Pith is reading between the lines
- The approach could supply confidence intervals for rankings in sports or product recommendation settings that rely on pairwise outcomes.
- LLM platforms could replace point-estimate leaderboards with intervals that reflect the actual information in sparse human judgments.
- Empirical checks on large judgment datasets could verify whether real score tensors are close enough to low-rank for the efficiency gains to materialize.
Load-bearing premise
The latent score tensor has low-rank structure and the pairwise comparisons follow Bradley-Terry-Luce-type models with sparse non-uniform observations.
What would settle it
On data simulated from a known low-rank tensor under the Bradley-Terry-Luce model, the empirical variance of the one-step estimator would have to match the derived semiparametric efficiency bound at large sample sizes; persistent mismatch would falsify the efficiency claim.
read the original abstract
Large language model (LLM) evaluation platforms increasingly rely on pairwise human judgments. These data are noisy, sparse, and non-uniform, yet leaderboards are reported with limited uncertainty quantification. We study this as semiparametric inference for a low-rank latent score tensor observed through pairwise comparisons under Bradley-Terry-Luce-type models. This places LLM evaluation in a new tensor completion setting with structured observations, non-uniform sampling, and pairwise contrasts. Our target is a smooth functional $\psi(T^\star)$, including linear estimands such as ability gaps and nonlinear ones such as win probabilities. We derive the information operator on the low-rank tangent space, the efficient influence function, and the semiparametric efficiency bound, then construct a one-step debiased estimator with asymptotic normality. A central challenge is that the information operator is anisotropic and does not commute with the tangent-space projection, creating a bottleneck absent from isotropic models. We introduce a score-whitening method that equalizes local Fisher information and restores stable inference at the optimal sample-complexity scale. Our results provide a principled framework for uncertainty quantification in LLM evaluation and more broadly for inference on low-rank structures from pairwise data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript frames LLM evaluation from noisy, sparse, non-uniform pairwise human judgments as semiparametric inference on a low-rank latent score tensor under Bradley-Terry-Luce-type models. It derives the information operator restricted to the low-rank tangent space, the efficient influence function, and the semiparametric efficiency bound for smooth functionals such as ability gaps and win probabilities. A one-step debiased estimator is constructed and shown to achieve asymptotic normality; a score-whitening procedure is introduced to equalize local Fisher information and restore stable inference under the anisotropic information operator that arises from non-uniform sampling.
Significance. If the derivations hold, the work supplies the first rigorous efficiency theory and uncertainty quantification for LLM leaderboards, replacing ad-hoc ranking with statistically grounded inference. The low-rank tensor completion setting with structured pairwise observations is novel, and the score-whitening technique addresses a genuine technical obstacle (non-commuting anisotropic operator and tangent-space projection) that is absent from isotropic models. The framework is extensible to other pairwise-data problems and supplies falsifiable asymptotic predictions once the estimator is implemented.
major comments (2)
- [Section 4 (estimator construction) and Theorem on asymptotic normality] The central asymptotic normality claim for the one-step estimator (presumably Theorem 4 or 5) rests on the score-whitening step restoring the efficient influence function after projection onto the low-rank tangent space. The manuscript should explicitly verify that the whitened score remains orthogonal to the nuisance tangent space under the stated sparsity and non-uniformity conditions; otherwise the efficiency bound may not be attained at the optimal sample-complexity rate.
- [Section 3 (information operator) and Assumption on sampling design] The low-rank tangent space projection is used to derive the information operator, but the manuscript must confirm that the resulting operator remains invertible on the identifiable subspace when the sampling probabilities are highly non-uniform (as is typical in LLM platforms). If the minimal eigenvalue bound depends on the unknown low-rank factors, the efficiency claim becomes conditional rather than uniform.
minor comments (3)
- [Section 2 (model)] Notation for the latent score tensor T* and the observed comparison tensor should be introduced with a single consistent symbol set in the model section to avoid confusion between the full tensor and its low-rank factorization.
- [Introduction and Section 3] The abstract states that the information operator 'does not commute with the tangent-space projection'; this should be illustrated with a small numerical example or a low-dimensional analytic counter-example in the main text so readers can see the anisotropy concretely.
- [Related work] References to prior tensor-completion and semiparametric efficiency literature (e.g., on pairwise ranking models) are present but could be expanded with one or two additional citations on anisotropic information operators in structured models.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive comments on the technical details of our asymptotic results. We address each major comment below and indicate the revisions planned for the next version of the manuscript.
read point-by-point responses
-
Referee: [Section 4 (estimator construction) and Theorem on asymptotic normality] The central asymptotic normality claim for the one-step estimator (presumably Theorem 4 or 5) rests on the score-whitening step restoring the efficient influence function after projection onto the low-rank tangent space. The manuscript should explicitly verify that the whitened score remains orthogonal to the nuisance tangent space under the stated sparsity and non-uniformity conditions; otherwise the efficiency bound may not be attained at the optimal sample-complexity rate.
Authors: We thank the referee for this observation. The proof of Theorem 4 establishes that the whitened score is orthogonal to the nuisance tangent space by exploiting the fact that the whitening operator is constructed to preserve the range of the low-rank projection while equalizing the local information; the argument relies on the sparsity and non-uniform sampling conditions in Assumptions 3.1 and 3.3 together with the boundedness of the latent factors. To make this verification more transparent, we will insert a dedicated lemma immediately preceding Theorem 4 that isolates the orthogonality property and summarizes the key algebraic steps from the appendix. revision: yes
-
Referee: [Section 3 (information operator) and Assumption on sampling design] The low-rank tangent space projection is used to derive the information operator, but the manuscript must confirm that the resulting operator remains invertible on the identifiable subspace when the sampling probabilities are highly non-uniform (as is typical in LLM platforms). If the minimal eigenvalue bound depends on the unknown low-rank factors, the efficiency claim becomes conditional rather than uniform.
Authors: Assumption 3.2 already imposes a uniform lower bound on the minimal eigenvalue of the restricted information operator that is independent of the particular low-rank factors; the bound is expressed solely in terms of the sampling probabilities and the uniform boundedness of the latent scores (Assumption 2.1). Under the non-uniform designs typical of LLM platforms, this bound remains positive and uniform over the parameter space. We will revise the wording of Assumption 3.2 and add a short remark in Section 3 that explicitly states the uniformity of the eigenvalue bound and its implications for the efficiency claim. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper applies standard semiparametric efficiency theory to derive the information operator on the low-rank tangent space, the efficient influence function, the semiparametric efficiency bound, and a one-step debiased estimator for the functional ψ(T★) under the stated low-rank latent tensor and BTL pairwise observation model. The score-whitening step is introduced explicitly to address the acknowledged anisotropy of the information operator. No derivation step reduces by construction to its inputs, no parameter is fitted on a subset and renamed as a prediction, and no load-bearing self-citation or imported uniqueness theorem is invoked in the provided text. The central results follow directly from the model assumptions without self-referential definitions or renaming of known empirical patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The latent score tensor has low-rank structure
- domain assumption Pairwise comparisons follow Bradley-Terry-Luce-type models
Forward citations
Cited by 1 Pith paper
-
Perturbation is All You Need for Extrapolating Language Models
Perturbing prefixes to semantic neighbors during training creates a hierarchical noise model that improves language model predictions on token sequences outside the training corpus support.
Reference graph
Works this paper leans on
-
[1]
barticle [author] Bose , Avinandan A. , Xiong , Zhihan Z. , Chi , Yuejie Y. , Du , Simon Shaolei S. S. , Xiao , Lin L. Fazel , Maryam M. ( 2025 ). LoRe: Personalizing LLMs via Low-Rank Reward Modeling . arXiv preprint arXiv:2504.14439 . barticle
-
[2]
barticle [author] Bradley , Ralph Allan R. A. Terry , Milton E. M. E. ( 1952 ). Rank analysis of incomplete block designs: I. The method of paired comparisons . Biometrika 39 324--345 . barticle
1952
-
[3]
, Li , Gen G
barticle [author] Cai , Changxiao C. , Li , Gen G. , Poor , H. Vincent H. V. Chen , Yuxin Y. ( 2022 ). Nonconvex low-rank tensor completion from noisy data . Operations Research 70 1219--1237 . barticle
2022
-
[4]
barticle [author] Cand \`e s , Emmanuel J. E. J. Recht , Benjamin B. ( 2009 ). Exact matrix completion via convex optimization . Foundations of Computational Mathematics 9 717--772 . barticle
2009
-
[5]
, Huang , Longxiu L
barticle [author] Chao , Zehan Z. , Huang , Longxiu L. Needell , Deanna D. ( 2021 ). HOSVD-based algorithm for weighted tensor completion . Journal of Imaging 7 110 . barticle
2021
-
[6]
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
barticle [author] Chiang , Wei-Lin W.-L. , Zheng , Lianmin L. , Sheng , Ying Y. , Angelopoulos , Anastasios Nikolas A. N. , Li , Tianle T. , Li , Dacheng D. , Zhang , Hao H. , Zhu , Banghua B. , Jordan , Michael M. , Gonzalez , Joseph E. J. E. Stoica , Ion I. ( 2024 ). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference . arXiv preprin...
work page internal anchor Pith review arXiv 2024
-
[7]
barticle [author] Dong , Zihan Z. , Zhang , Zhixian Z. , Zhou , Yang Y. , Jin , Can C. , Wu , Ruijia R. Zhang , Linjun L. ( 2026 ). Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals . arXiv preprint arXiv:2602.03061 . barticle
-
[8]
barticle [author] Duan , Congyuan C. , Ma , Wanteng W. , Xia , Dong D. Xu , Kan K. ( 2025 ). Statistical Inference for Matching Decisions via Matrix Completion under Dependent Missingness . arXiv preprint arXiv:2510.26478 . barticle
-
[9]
, Hou , Jikai J
barticle [author] Fan , Jianqing J. , Hou , Jikai J. Yu , Mengxin M. ( 2024 ). Uncertainty quantification of MLE for entity ranking with covariates . Journal of Machine Learning Research 25 1--83 . barticle
2024
-
[10]
arXiv preprint arXiv:2509.01847 , year=
barticle [author] Fan , Jianqing J. , Kwon , Hyukjun H. Zhu , Xiaonan X. ( 2025 ). Uncertainty Quantification for Ranking with Heterogeneous Preferences . arXiv preprint arXiv:2509.01847 . barticle
-
[11]
, Lou , Zhipeng Z
barticle [author] Fan , Jianqing J. , Lou , Zhipeng Z. , Wang , Weichen W. Yu , Mengxin M. ( 2026 ). Spectral ranking inferences based on general multiway comparisons . Operations Research 74 161--180 . barticle
2026
-
[12]
, Shen , Yandi Y
barticle [author] Gao , Chao C. , Shen , Yandi Y. Zhang , Anderson Y. A. Y. ( 2023 ). Uncertainty quantification in the Bradley--Terry--Luce model . Information and Inference: A Journal of the IMA 12 1073--1140 . barticle
2023
-
[13]
barticle [author] Keshavan , Raghunandan H. R. H. , Montanari , Andrea A. Oh , Sewoong S. ( 2010 ). Matrix completion from a few entries . IEEE Transactions on Information Theory 56 2980--2998 . barticle
2010
-
[14]
barticle [author] Kolda , Tamara G. T. G. Bader , Brett W. B. W. ( 2009 ). Tensor decompositions and applications . SIAM Review 51 455--500 . barticle
2009
-
[15]
, Lounici , Karim K
barticle [author] Koltchinskii , Vladimir V. , Lounici , Karim K. Tsybakov , Alexandre B. A. B. ( 2011 ). Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion . The Annals of Statistics 39 2302--2329 . barticle
2011
-
[16]
, Chiang , Wei-Lin W.-L
bmisc [author] Li , Tianle T. , Chiang , Wei-Lin W.-L. , Frick , Evan E. , Dunlap , Lisa L. , Zhu , Banghua B. , Gonzalez , Joseph E. J. E. Stoica , Ion I. ( 2024 ). The Arena-Hard Pipeline . Arena blog . Published April 19, 2024 . bmisc
2024
-
[17]
arena-human-preference-140k
bmisc [author] LMSYS ( 2025 ). arena-human-preference-140k . bmisc
2025
-
[18]
Duncan R
bbook [author] Luce , R. Duncan R. D. ( 1959 ). Individual Choice Behavior: A Theoretical Analysis . Wiley . bbook
1959
-
[19]
barticle [author] Ma , Wanteng W. Xia , Dong D. ( 2024 ). Statistical inference in tensor completion: Optimal uncertainty quantification and statistical-to-computational gaps . arXiv preprint arXiv:2410.11225 . barticle
-
[20]
, Chen , Song Xi S
barticle [author] Mao , Xiaojun X. , Chen , Song Xi S. X. Wong , Raymond K. W. R. K. W. ( 2019 ). Matrix completion with covariate information . Journal of the American Statistical Association 114 198--210 . barticle
2019
-
[21]
, Wang , Zhonglei Z
barticle [author] Mao , Xiaojun X. , Wang , Zhonglei Z. Yang , Shu S. ( 2023 ). Matrix completion under complex survey sampling . Annals of the Institute of Statistical Mathematics 75 463--492 . barticle
2023
-
[22]
Wainwright , Martin J
barticle [author] Negahban , Sahand S. Wainwright , Martin J. M. J. ( 2012 ). Restricted strong convexity and weighted matrix completion: Optimal bounds with noise . Journal of Machine Learning Research 13 1665--1697 . barticle
2012
-
[23]
, Wu , Jeffrey J
barticle [author] Ouyang , Long L. , Wu , Jeffrey J. , Jiang , Xu X. , Almeida , Diogo D. , Wainwright , Carroll C. , Mishkin , Pamela P. , Zhang , Chong C. , Agarwal , Sandhini S. , Slama , Katarina K. , Ray , Alex A. et al. ( 2022 ). Training language models to follow instructions with human feedback . Advances in neural information processing systems 3...
2022
-
[24]
barticle [author] Petrova , Nora N. , Gordon , Andrew A. Blindow , Enzo E. ( 2026 ). Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework . arXiv preprint arXiv:2603.04409 . barticle
-
[25]
barticle [author] Singh , Shivalika S. , Nan , Yiyang Y. , Wang , Alex A. , D'souza , Daniel D. , Kapoor , Sayash S. , \"U st \"u n , Ahmet A. , Koyejo , Sanmi S. , Deng , Yuntian Y. , Longpre , Shayne S. , Smith , Noah A N. A. et al. ( 2025 ). The leaderboard illusion . arXiv preprint arXiv:2504.20879 . barticle
-
[26]
( 2026 )
barticle [author] Su , Weijie W. ( 2026 ). Do large language models (really) need statistical foundations? The Annals of Applied Statistics 20 724--743 . barticle
2026
-
[27]
Arena-Rank: Open Sourcing the Leaderboard Methodology
bmisc [author] Arena Team ( 2025 ). Arena-Rank: Open Sourcing the Leaderboard Methodology . Arena blog . Published December 18, 2025 . bmisc
2025
-
[28]
barticle [author] Xu , Erhan E. , Ye , Kai K. , Zhou , Hongyi H. , Zhu , Luhan L. , Quinzan , Francesco F. Shi , Chengchun C. ( 2025 ). Doubly robust alignment for large language models . arXiv preprint arXiv:2506.01183 . barticle
-
[29]
barticle [author] Zhang , Maoyu M. , Cai , Biao B. , Sun , Will Wei W. W. Zhang , Jingfei J. ( 2025 ). Generalized tensor completion with non-random missingness . arXiv preprint arXiv:2509.06225 . barticle
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.