Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings
Pith reviewed 2026-05-21 20:44 UTC · model grok-4.3
The pith
Benchmark subsets selected by the cognitive demands of test items alone can predict full LLM scores with low error from tiny samples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Scales++ selects evaluation items by embedding their cognitive demands rather than by analyzing collective model errors; the resulting tiny subsets (0.25 percent on the Open LLM Leaderboard, 2 percent on Humanity's Last Exam) produce full-benchmark score predictions with mean absolute errors of 3.2 percent and 2.9 percent respectively, while cutting selection cost by more than 18 times and enabling cold-start use on fresh benchmarks.
What carries the argument
Cognitive scales embeddings that represent the intrinsic cognitive demands of each benchmark item and drive item-centric subset selection.
If this is right
- Model evaluation becomes feasible with far fewer compute hours because no prior model runs are needed to build the subset.
- New benchmarks can be used immediately without waiting for a large pool of existing model results.
- The selected items remain interpretable because their cognitive properties are explicit and human-readable.
- Predictive error stays competitive with model-centric methods while using dramatically smaller data fractions.
Where Pith is reading between the lines
- The same cognitive-embedding approach could be applied to dynamic or continually updated benchmarks that add items over time.
- If cognitive demand categories prove stable across domains, the method might transfer to vision, multimodal, or agent benchmarks without retraining selectors.
- Standardized cognitive taxonomies derived from the embeddings could serve as a shared language for describing what different benchmarks actually measure.
Load-bearing premise
That the cognitive demands of benchmark items are stable, intrinsic properties that can be embedded reliably enough for selected subsets to generalize to future models.
What would settle it
A new model whose rank order or absolute scores on the cognitively selected subset deviate substantially from its scores on the full benchmark would falsify the claim of predictive fidelity.
read the original abstract
The prohibitive cost of evaluating large language models (LLMs) on comprehensive benchmarks necessitates the creation of small yet representative data subsets (i.e., tiny benchmarks) that enable efficient assessment while retaining predictive fidelity. Current methods for this task operate under a model-centric paradigm, selecting benchmarking items based on the collective performance of existing models. Such approaches are limited by large upfront costs, an inability to immediately handle new benchmarks ("cold-start"), and the fragile assumption that future models will share the failure patterns of their predecessors. In this work, we propose a new item-centric approach to benchmark subset selection, arguing that selection should be based on the intrinsic properties of the task items themselves, rather than on model-specific failure patterns. We instantiate this item-centric efficient benchmarking approach via a novel method, Scales++, where data selection is based on the cognitive demands of the benchmark samples. Empirically, we show Scales++ reduces the upfront selection cost by over 18x while achieving competitive predictive fidelity. On the Open LLM Leaderboard, using just a 0.25% data subset, we predict full benchmark scores with a 3.2% mean absolute error, and on Humanity's Last Exam we predict full scores with 2.9% mean absolute error using a 2.0% sample. We demonstrate that this item-centric approach enables more efficient model evaluation without significant fidelity degradation, while also providing better cold-start performance and more interpretable benchmarking.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Scales++, an item-centric approach to selecting small representative subsets of benchmark items for efficient LLM evaluation. Instead of relying on model performance data, it uses embeddings of the intrinsic cognitive demands of individual items. The authors report that subsets as small as 0.25% of the Open LLM Leaderboard yield full-score predictions with 3.2% mean absolute error, while a 2.0% sample on Humanity's Last Exam achieves 2.9% MAE; they further claim an 18x reduction in upfront selection cost, improved cold-start performance for new benchmarks, and greater interpretability compared to model-centric baselines.
Significance. If the cognitive-demand embeddings prove stable across architectures and training regimes, the work could meaningfully lower the barrier to rigorous LLM evaluation by enabling accurate predictions from tiny subsets without large-scale model runs. The reported error rates at extreme compression ratios are competitive with existing methods, and the explicit focus on cold-start and interpretability addresses practical pain points in the field.
major comments (2)
- §5.1 (Open LLM Leaderboard results): The 3.2% MAE is demonstrated only on models whose failure patterns likely overlap with those used to derive or validate the cognitive embeddings; this does not directly test the central claim that the selected subsets will generalize to future models exhibiting qualitatively different capability profiles or reasoning shortcuts, which is required for the model-agnostic and cold-start guarantees stated in the abstract.
- §3 (Method): The construction of the cognitive scales embeddings is presented as purely item-centric, yet the paper does not explicitly rule out any indirect dependence on model outputs or human judgments calibrated to current models; if such dependence exists, it would undermine the claimed independence from model-specific data and the 18x cost reduction relative to model-centric baselines.
minor comments (2)
- §2.3: The precise mathematical formulation of the embedding similarity metric used for subset selection is introduced without an accompanying equation or pseudocode, making it difficult to reproduce the exact selection procedure.
- Table 1: The baseline comparisons would benefit from reporting the number of models used in each method's upfront cost calculation to allow direct verification of the 18x reduction claim.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. The two major comments raise important questions about generalization to future models and the precise independence of our item-centric embeddings. We respond to each point below and will revise the manuscript accordingly to improve clarity and transparency.
read point-by-point responses
-
Referee: §5.1 (Open LLM Leaderboard results): The 3.2% MAE is demonstrated only on models whose failure patterns likely overlap with those used to derive or validate the cognitive embeddings; this does not directly test the central claim that the selected subsets will generalize to future models exhibiting qualitatively different capability profiles or reasoning shortcuts, which is required for the model-agnostic and cold-start guarantees stated in the abstract.
Authors: We agree that direct empirical testing on future models with qualitatively different profiles is not possible, as such models do not yet exist. Our current experiments evaluate the selected subsets across a broad range of existing models with varying architectures and training regimes on the Open LLM Leaderboard. The cognitive scales are constructed from intrinsic item properties using a fixed expert-defined taxonomy, without incorporating model performance data. In the revised manuscript we will add an expanded limitations discussion in §5 that explicitly addresses this point, including theoretical arguments for improved generalization under the item-centric paradigm and potential risks if future models introduce entirely novel reasoning mechanisms. revision: partial
-
Referee: §3 (Method): The construction of the cognitive scales embeddings is presented as purely item-centric, yet the paper does not explicitly rule out any indirect dependence on model outputs or human judgments calibrated to current models; if such dependence exists, it would undermine the claimed independence from model-specific data and the 18x cost reduction relative to model-centric baselines.
Authors: The referee correctly notes that the current text does not provide sufficient detail to fully exclude indirect dependencies. The embeddings are produced via expert annotation of each item against a predefined cognitive taxonomy (reasoning steps, knowledge prerequisites, and complexity levels) with no reference to model outputs or performance-calibrated judgments. We will revise §3 to include a detailed description of the annotation protocol, explicit statements confirming the absence of model data, and supporting information on the annotation process. These changes will strengthen the justification for the reported cost reduction and model independence. revision: yes
- Direct empirical validation on future LLMs exhibiting qualitatively different capability profiles cannot be performed because such models do not currently exist.
Circularity Check
No significant circularity; derivation is self-contained empirical method
full rationale
The paper's core derivation selects subsets via cognitive-demand embeddings treated as intrinsic item properties, independent of model performance data. This is validated by direct empirical comparison to full benchmark scores on existing leaderboards, yielding reported MAE values. No equations or steps reduce by construction to fitted parameters, self-citations, or prior model outputs. The item-centric framing and cold-start claims rest on the new embedding construction rather than tautological renaming or imported uniqueness. This is the standard non-circular outcome for an empirical selection technique with external validation.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
selection should be guided by the intrinsic properties of the task items themselves, rather than by model-specific failure patterns
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training
CapTrack shows post-training causes drift beyond facts, with instruction fine-tuning producing stronger behavioral changes than preference optimization across model families.
-
Query-efficient model evaluation using cached responses
DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.