Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings

Andrew M. Bean; Jonathan Richard Schwarz; Nabeel Seedat; Shengzhuang Chen

arxiv: 2510.26384 · v2 · pith:UDUU2MOVnew · submitted 2025-10-30 · 💻 cs.AI · cs.LG

Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings

Andrew M. Bean , Nabeel Seedat , Shengzhuang Chen , Jonathan Richard Schwarz This is my paper

Pith reviewed 2026-05-21 20:44 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords benchmark subset selectionLLM evaluationcognitive embeddingsefficient benchmarkingitem-centric selectionOpen LLM LeaderboardHumanity's Last Exam

0 comments

The pith

Benchmark subsets selected by the cognitive demands of test items alone can predict full LLM scores with low error from tiny samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an item-centric method for choosing small, representative subsets from large LLM benchmarks. Rather than basing selection on how past models have failed, the approach embeds the intrinsic cognitive demands of each question or task and picks items whose demands cover the full set. This yields subsets as small as 0.25 percent that still allow prediction of complete benchmark scores within roughly 3 percent mean absolute error. The method also avoids the large upfront cost of running many models to build the selector and works immediately on new benchmarks. The core result is that performance patterns observed on these cognitively balanced micro-benchmarks generalize to unseen models without reference to model-specific failure data.

Core claim

Scales++ selects evaluation items by embedding their cognitive demands rather than by analyzing collective model errors; the resulting tiny subsets (0.25 percent on the Open LLM Leaderboard, 2 percent on Humanity's Last Exam) produce full-benchmark score predictions with mean absolute errors of 3.2 percent and 2.9 percent respectively, while cutting selection cost by more than 18 times and enabling cold-start use on fresh benchmarks.

What carries the argument

Cognitive scales embeddings that represent the intrinsic cognitive demands of each benchmark item and drive item-centric subset selection.

If this is right

Model evaluation becomes feasible with far fewer compute hours because no prior model runs are needed to build the subset.
New benchmarks can be used immediately without waiting for a large pool of existing model results.
The selected items remain interpretable because their cognitive properties are explicit and human-readable.
Predictive error stays competitive with model-centric methods while using dramatically smaller data fractions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cognitive-embedding approach could be applied to dynamic or continually updated benchmarks that add items over time.
If cognitive demand categories prove stable across domains, the method might transfer to vision, multimodal, or agent benchmarks without retraining selectors.
Standardized cognitive taxonomies derived from the embeddings could serve as a shared language for describing what different benchmarks actually measure.

Load-bearing premise

That the cognitive demands of benchmark items are stable, intrinsic properties that can be embedded reliably enough for selected subsets to generalize to future models.

What would settle it

A new model whose rank order or absolute scores on the cognitively selected subset deviate substantially from its scores on the full benchmark would falsify the claim of predictive fidelity.

read the original abstract

The prohibitive cost of evaluating large language models (LLMs) on comprehensive benchmarks necessitates the creation of small yet representative data subsets (i.e., tiny benchmarks) that enable efficient assessment while retaining predictive fidelity. Current methods for this task operate under a model-centric paradigm, selecting benchmarking items based on the collective performance of existing models. Such approaches are limited by large upfront costs, an inability to immediately handle new benchmarks ("cold-start"), and the fragile assumption that future models will share the failure patterns of their predecessors. In this work, we propose a new item-centric approach to benchmark subset selection, arguing that selection should be based on the intrinsic properties of the task items themselves, rather than on model-specific failure patterns. We instantiate this item-centric efficient benchmarking approach via a novel method, Scales++, where data selection is based on the cognitive demands of the benchmark samples. Empirically, we show Scales++ reduces the upfront selection cost by over 18x while achieving competitive predictive fidelity. On the Open LLM Leaderboard, using just a 0.25% data subset, we predict full benchmark scores with a 3.2% mean absolute error, and on Humanity's Last Exam we predict full scores with 2.9% mean absolute error using a 2.0% sample. We demonstrate that this item-centric approach enables more efficient model evaluation without significant fidelity degradation, while also providing better cold-start performance and more interpretable benchmarking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scales++ shifts to item-centric selection via cognitive embeddings to cut eval costs and handle cold starts, but the transfer to future models is the part that needs checking.

read the letter

The core idea is to pick small benchmark subsets by embedding the cognitive demands of the items themselves instead of relying on how existing models perform. This sidesteps the upfront cost of running many models and the assumption that new models will fail in the same places. They report an 18x drop in selection cost, 3.2% MAE on the Open LLM Leaderboard with a 0.25% subset, and 2.9% MAE on Humanity's Last Exam with 2% of the data. That level of compression while keeping predictive error low is the practical win if it holds.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Scales++, an item-centric approach to selecting small representative subsets of benchmark items for efficient LLM evaluation. Instead of relying on model performance data, it uses embeddings of the intrinsic cognitive demands of individual items. The authors report that subsets as small as 0.25% of the Open LLM Leaderboard yield full-score predictions with 3.2% mean absolute error, while a 2.0% sample on Humanity's Last Exam achieves 2.9% MAE; they further claim an 18x reduction in upfront selection cost, improved cold-start performance for new benchmarks, and greater interpretability compared to model-centric baselines.

Significance. If the cognitive-demand embeddings prove stable across architectures and training regimes, the work could meaningfully lower the barrier to rigorous LLM evaluation by enabling accurate predictions from tiny subsets without large-scale model runs. The reported error rates at extreme compression ratios are competitive with existing methods, and the explicit focus on cold-start and interpretability addresses practical pain points in the field.

major comments (2)

§5.1 (Open LLM Leaderboard results): The 3.2% MAE is demonstrated only on models whose failure patterns likely overlap with those used to derive or validate the cognitive embeddings; this does not directly test the central claim that the selected subsets will generalize to future models exhibiting qualitatively different capability profiles or reasoning shortcuts, which is required for the model-agnostic and cold-start guarantees stated in the abstract.
§3 (Method): The construction of the cognitive scales embeddings is presented as purely item-centric, yet the paper does not explicitly rule out any indirect dependence on model outputs or human judgments calibrated to current models; if such dependence exists, it would undermine the claimed independence from model-specific data and the 18x cost reduction relative to model-centric baselines.

minor comments (2)

§2.3: The precise mathematical formulation of the embedding similarity metric used for subset selection is introduced without an accompanying equation or pseudocode, making it difficult to reproduce the exact selection procedure.
Table 1: The baseline comparisons would benefit from reporting the number of models used in each method's upfront cost calculation to allow direct verification of the 18x reduction claim.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive review. The two major comments raise important questions about generalization to future models and the precise independence of our item-centric embeddings. We respond to each point below and will revise the manuscript accordingly to improve clarity and transparency.

read point-by-point responses

Referee: §5.1 (Open LLM Leaderboard results): The 3.2% MAE is demonstrated only on models whose failure patterns likely overlap with those used to derive or validate the cognitive embeddings; this does not directly test the central claim that the selected subsets will generalize to future models exhibiting qualitatively different capability profiles or reasoning shortcuts, which is required for the model-agnostic and cold-start guarantees stated in the abstract.

Authors: We agree that direct empirical testing on future models with qualitatively different profiles is not possible, as such models do not yet exist. Our current experiments evaluate the selected subsets across a broad range of existing models with varying architectures and training regimes on the Open LLM Leaderboard. The cognitive scales are constructed from intrinsic item properties using a fixed expert-defined taxonomy, without incorporating model performance data. In the revised manuscript we will add an expanded limitations discussion in §5 that explicitly addresses this point, including theoretical arguments for improved generalization under the item-centric paradigm and potential risks if future models introduce entirely novel reasoning mechanisms. revision: partial
Referee: §3 (Method): The construction of the cognitive scales embeddings is presented as purely item-centric, yet the paper does not explicitly rule out any indirect dependence on model outputs or human judgments calibrated to current models; if such dependence exists, it would undermine the claimed independence from model-specific data and the 18x cost reduction relative to model-centric baselines.

Authors: The referee correctly notes that the current text does not provide sufficient detail to fully exclude indirect dependencies. The embeddings are produced via expert annotation of each item against a predefined cognitive taxonomy (reasoning steps, knowledge prerequisites, and complexity levels) with no reference to model outputs or performance-calibrated judgments. We will revise §3 to include a detailed description of the annotation protocol, explicit statements confirming the absence of model data, and supporting information on the annotation process. These changes will strengthen the justification for the reported cost reduction and model independence. revision: yes

standing simulated objections not resolved

Direct empirical validation on future LLMs exhibiting qualitatively different capability profiles cannot be performed because such models do not currently exist.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained empirical method

full rationale

The paper's core derivation selects subsets via cognitive-demand embeddings treated as intrinsic item properties, independent of model performance data. This is validated by direct empirical comparison to full benchmark scores on existing leaderboards, yielding reported MAE values. No equations or steps reduce by construction to fitted parameters, self-citations, or prior model outputs. The item-centric framing and cold-start claims rest on the new embedding construction rather than tautological renaming or imported uniqueness. This is the standard non-circular outcome for an empirical selection technique with external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are detailed. The core contribution is the cognitive scales embedding technique whose construction and validation details are not provided.

pith-pipeline@v0.9.0 · 5798 in / 1048 out tokens · 39629 ms · 2026-05-21T20:44:45.154940+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

selection should be guided by the intrinsic properties of the task items themselves, rather than by model-specific failure patterns

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training
cs.LG 2026-02 unverdicted novelty 7.0

CapTrack shows post-training causes drift beyond facts, with instruction fine-tuning producing stronger behavioral changes than preference optimization across model families.
Query-efficient model evaluation using cached responses
cs.LG 2026-05 unverdicted novelty 6.0

DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.