pith. the verified trust layer for science. sign in

arxiv: 2602.18911 · v2 · submitted 2026-02-21 · 💻 cs.LG

From Human-Level AI Tales to AI Leveling Human Scales

Pith reviewed 2026-05-15 20:05 UTC · model grok-4.3

classification 💻 cs.LG
keywords AI evaluationbenchmark calibrationhuman performance scaleslogarithmic scalesworld populationLLM extrapolationstandardization
0
0 comments X p. Extension

The pith

AI performance can be measured on scales calibrated to the entire world population.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to fix misleading comparisons of AI to human-level performance by creating standardized scales based on the whole world population. It develops multi-level scales for various capabilities where each level indicates the probability of success for the global population, using a logarithmic scale with an adjustable base. Human data from large-scale tests like PISA and TIMSS are used to set the levels, while LLMs help estimate the scale's base by extrapolating information from two different demographic profiles. This method seeks to make AI evaluations more comparable and grounded in broad human capabilities rather than narrow samples.

Core claim

The paper introduces a calibration framework that builds capability-specific scales representing logarithmic probabilities of success for the whole world population. These scales are populated using publicly available human benchmark data across education and reasoning tests, with the logarithmic base estimated through LLM-based extrapolation between two demographic profiles to condense population information.

What carries the argument

Multi-level logarithmic scales calibrated to world population success probabilities, with the base estimated via LLM extrapolation from two demographic profiles.

If this is right

  • AI models receive performance scores on a common scale relative to global human probabilities.
  • Benchmarks for different capabilities become directly comparable through shared population anchoring.
  • Recalibration of existing AI results allows consistent tracking of progress against worldwide human standards.
  • Group slicing and post-stratification can validate the quality of the population mappings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This calibration could show that current AI systems lag further behind when measured against global rather than elite populations.
  • Extending the method to new benchmarks would require only additional human data and LLM prompts for base estimation.
  • Policy discussions on AI regulation might benefit from these standardized human-relative metrics.

Load-bearing premise

That large language models can reliably extrapolate detailed information about human populations from just two demographic profiles to set the correct base for the logarithmic scales.

What would settle it

A large-scale survey collecting actual success rates on sample items from a globally representative population and comparing them to the probabilities predicted by the calibrated scales.

Figures

Figures reproduced from arXiv: 2602.18911 by \'Alvaro David G\'omez Ant\'on, Daniel Romero-Alvarado, F\'elix Mart\'i P\'erez, Fernando Mart\'inez-Plumed, Jos\'e Hern\'andez-Orallo, Kevin Wei, Lexin Zhou, Luning Sun, Manuel Cebrian, Matthieu T\'eh\'enan, Peter Romero, Sipeng Chen, Yael Moros Daval, Zachary R. Tidler.

Figure 1
Figure 1. Figure 1: Calibrated annotations of benchmarks can be used to generate profiles of AI systems on human-referenced scales (top). In this paper we calibrate 18 dimensions of capability and knowl￾edge, going from level 0 (near-universal success) to level 5 ≈ 1-in-B 5 people succeeding, with B being normalized according to the human distribution taken from several tests with human results (bottom). The calibration uses … view at source ↗
Figure 2
Figure 2. Figure 2: Human-calibrated bases B for groups of dimensions. For each plot we only use the examples for which any of the dimensions of that group is dominant, and using harmonic mean when the group contains more than one dimension. The x-axis shows the level, as annotated following the ADeLe rubrics, and the y-axis shows the corresponding levels as coming from the LLM estimate. By fitting a linear function using the… view at source ↗
Figure 3
Figure 3. Figure 3: Mapping of dominant ADeLe dimension from extrapolated calibrations (what model thinks about how the world would perform) against real world outcomes of test takers from ICAR Logical Reasoning [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Mapping of dominant ADeLe dimension from extrapolated calibrations (what model thinks about how the world would perform) against real world outcomes of test takers from ICAR Verbal Reasoning [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Mapping of dominant ADeLe dimension from extrapolated calibrations (what model thinks about how the world would perform) against real world outcomes of test takers from PISA results [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Mapping of dominant ADeLe dimension from extrapolated calibrations (what model thinks about how the world would perform) against real world outcomes of test takers from UK BioBank [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Mapping of dominant ADeLe dimension from extrapolated calibrations (what model thinks about how the world would perform) against real world outcomes of test takers from ReliabilityBench. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Mapping of dominant ADeLe dimension from extrapolated calibrations (what model thinks about how the world would perform) against real world outcomes of test takers from TIMSS. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Cluster Analysis of TIMSS outcomes by country. The “High Alignment” cluster (Green) includes Anglosphere nations where models successfully rank item difficulty (r ≈ 0.59), while the “Low Alignment” cluster (Red) includes distinct curricula (e.g., Singapore) where alignment collapses (r ≈ 0.02). 22 [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: illustrates the three-step pipeline that converts raw, theory-driven complexity annotations into empirically calibrated difficulty scales via a means-based regression to estimate each dimension’s slope m and the corresponding optimal base Bopt = 10m. (a) Stage 1: Uncalibrated (B = 10). When using the default ADeLe base (B = 10), empirical difficulty (y-axis) systematically lags behind theoretical complexi… view at source ↗
read the original abstract

Comparing AI models to "human level" is often misleading when benchmark scores are incommensurate or human baselines are drawn from a narrow population. To address this, we propose a framework that calibrates items against the 'world population' and report performance on a common, human-anchored scale. Concretely, we build on a set of multi-level scales for different capabilities where each level should represent a probability of success of the whole world population on a logarithmic scale with a base $B$. We calibrate each scale for each capability (reasoning, comprehension, knowledge, volume, etc.) by compiling publicly released human test data spanning education and reasoning benchmarks (PISA, TIMSS, ICAR, UKBioBank, and ReliabilityBench). The base $B$ is estimated by extrapolating between samples with two demographic profiles using LLMs, with the hypothesis that they condense rich information about human populations. We evaluate the quality of different mappings using group slicing and post-stratification. The new techniques allow for the recalibration and standardization of scales relative to the whole-world population.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes a framework to calibrate AI performance on capabilities such as reasoning, comprehension, knowledge, and volume against the whole-world human population. It defines multi-level logarithmic probability scales with base B estimated via LLM extrapolation from two demographic profiles, calibrates the scales using public human test data from benchmarks including PISA, TIMSS, ICAR, UKBioBank, and ReliabilityBench, and evaluates mappings with group slicing and post-stratification to enable recalibration and standardization relative to the global population.

Significance. If the LLM-based extrapolation for base B can be shown to be accurate and independent, the framework would offer a useful method for creating standardized, human-anchored scales that address the problem of narrow or incommensurate human baselines in AI evaluation. The grounding in established datasets like PISA and TIMSS is a strength, but the current absence of any reported calibration outcomes or validation metrics renders the practical significance speculative.

major comments (3)
  1. [Abstract] Abstract: The central claim that the new techniques allow recalibration and standardization relative to the whole-world population rests on the untested hypothesis that LLMs condense rich information to extrapolate base B from exactly two demographic profiles; no derivation, error bounds, cross-validation against independent global statistics, or concrete calibration results are supplied.
  2. [Abstract] The estimation of base B: Because B serves as the global anchor for the logarithmic scales, any systematic bias in the LLM extrapolation (e.g., from training-data skew toward high-education or English-speaking subpopulations) directly undermines the claimed human-anchored standardization; the manuscript provides no quantitative assessment of this extrapolation step.
  3. [Evaluation] Evaluation section: Group slicing and post-stratification are described for assessing mapping quality, yet no numerical results, metrics, or error analysis are reported, leaving the reader unable to evaluate whether the scales achieve the intended representativeness.
minor comments (1)
  1. The notation for the logarithmic probability scale and the precise definition of base B could be clarified with an explicit equation or example computation to aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our proposed calibration framework. The manuscript introduces a methodological approach for human-anchored AI scales using public datasets and LLM-based extrapolation, and we address each concern below by clarifying the current scope while committing to enhancements that strengthen the presentation of results and validation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the new techniques allow recalibration and standardization relative to the whole-world population rests on the untested hypothesis that LLMs condense rich information to extrapolate base B from exactly two demographic profiles; no derivation, error bounds, cross-validation against independent global statistics, or concrete calibration results are supplied.

    Authors: We agree that the central claim depends on the LLM extrapolation hypothesis for base B, which is presented as such in the manuscript. The abstract and methods describe the extrapolation from two demographic profiles and the subsequent calibration against PISA, TIMSS, ICAR, UKBioBank, and ReliabilityBench data, but we acknowledge the absence of explicit derivations, error bounds, cross-validation, and numerical calibration outcomes. We will expand the abstract and add a new subsection with the mathematical derivation of the extrapolation, sensitivity checks, and initial concrete calibration results in the revision. revision: yes

  2. Referee: [Abstract] The estimation of base B: Because B serves as the global anchor for the logarithmic scales, any systematic bias in the LLM extrapolation (e.g., from training-data skew toward high-education or English-speaking subpopulations) directly undermines the claimed human-anchored standardization; the manuscript provides no quantitative assessment of this extrapolation step.

    Authors: The potential for systematic bias in the LLM extrapolation step is a substantive concern we take seriously, as B anchors the entire scale. The manuscript states the hypothesis that LLMs condense rich population information but does not include quantitative bias assessment. We will add a quantitative evaluation of the extrapolation, including discussion of training-data skew and robustness checks against available global statistics, to address this directly in the revised version. revision: yes

  3. Referee: [Evaluation] Evaluation section: Group slicing and post-stratification are described for assessing mapping quality, yet no numerical results, metrics, or error analysis are reported, leaving the reader unable to evaluate whether the scales achieve the intended representativeness.

    Authors: The evaluation section details the application of group slicing and post-stratification to assess mapping quality and enable recalibration. While the current text emphasizes the methodological framework, we recognize that the lack of reported numerical metrics and error analysis limits immediate evaluation of representativeness. We will incorporate specific quantitative results, metrics (e.g., stratification error rates), and analysis from these techniques applied to the calibrated scales in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper compiles publicly released human test data from benchmarks (PISA, TIMSS, ICAR, UKBioBank, ReliabilityBench) to calibrate multi-level scales, then estimates base B via LLM extrapolation from two demographic profiles under the stated hypothesis that LLMs condense population information. This produces a logarithmic human-anchored scale for reporting AI performance. No equation or step reduces the final calibrated scale or standardization claim to the LLM extrapolation by construction; the human data compilation supplies independent grounding, the LLM step is an explicit estimation procedure rather than a fitted parameter renamed as a prediction, and no self-citation or uniqueness theorem is invoked to close the chain. The framework is therefore self-contained against external human benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The framework depends on the assumption that logarithmic levels accurately represent world-population success probabilities and that LLMs can serve as reliable extrapolators for demographic adjustment; no new entities are postulated.

free parameters (1)
  • base B
    Estimated via LLM extrapolation between two demographic profiles; no specific fitted value is given in the abstract.
axioms (2)
  • domain assumption Each level on the capability scales represents a probability of success for the whole world population arranged on a logarithmic scale with base B.
    Stated directly as the foundation for the multi-level scales in the abstract.
  • domain assumption Public human test data from PISA, TIMSS, ICAR, UKBioBank, and ReliabilityBench can be compiled to calibrate scales for reasoning, comprehension, knowledge, and related capabilities.
    Assumes these benchmarks are sufficiently representative for world-population anchoring.

pith-pipeline@v0.9.0 · 5560 in / 1491 out tokens · 24037 ms · 2026-05-15T20:05:56.892649+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

  1. [1]

    Paradigms of ai evaluation: Mapping goals, method- ologies and culture.arXiv preprint arXiv:2502.15620,

    Burden, J., Teˇsi´c, M., Pacchiardi, L., and Hern´andez-Orallo, J. Paradigms of ai evaluation: Mapping goals, method- ologies and culture.arXiv preprint arXiv:2502.15620,

  2. [2]

    R., and Orallo, J

    Burnell, R., Hao, H., Conway, A. R., and Orallo, J. H. Revealing the structure of language model capabilities. arXiv preprint arXiv:2306.10062, 2023a. Burnell, R., Schellaert, W., Burden, J., Ullman, T. D., Martinez-Plumed, F., Tenenbaum, J. B., Rutar, D., Cheke, L. G., Sohl-Dickstein, J., Mitchell, M., et al. Rethink reporting of evaluation results in ai...

  3. [3]

    Can we trust ai benchmarks? an interdisciplinary re- view of current issues in ai evaluation.arXiv preprint arXiv:2502.06559,

    Eriksson, M., Purificato, E., Noroozian, A., Vinagre, J., Chaslot, G., Gomez, E., and Fernandez-Llorca, D. Can we trust ai benchmarks? an interdisciplinary re- view of current issues in ai evaluation.arXiv preprint arXiv:2502.06559,

  4. [4]

    doi: 10.1371/journal

    doi: 10.1371/journal. 9 From Human-Level AI Tales to AI Leveling Human Scales pone.0231627. URL https://doi.org/10.1371/ journal.pone.0231627. Gou, B., Huang, Z., Ning, Y ., Gu, Y ., Lin, M., Qi, W., Kopanev, A., Yu, B., Guti ´errez, B. J., Shu, Y ., et al. Mind2web 2: Evaluating agentic search with agent-as-a- judge.arXiv preprint arXiv:2506.21506,

  5. [5]

    URL https://doi

    doi: 10.1027/1015-5759/a000616. URL https://doi. org/10.1027/1015-5759/a000616. Hambleton, R. K. and Rogers, H. J. Advances in criterion- referenced measurement. InAdvances in educational and psychological testing: Theory and applications, pp. 3–43. Springer,

  6. [6]

    Measuring ai ability to complete long tasks

    Kwa, T., West, B., Becker, J., Deng, A., Garcia, K., Hasin, M., Jawhar, S., Kinniment, M., Rush, N., V on Arx, S., Bloom, R., Broadley, T., Du, H., Goodrich, B., Jurkovic, N., Miles, L. H., Nix, S., Lin, T., Parikh, N., Rein, D., Sato, L. J. K., Wijk, H., Ziegler, D. M., Barnes, E., and Chan, L. Metr: Measuring ai ability to complete long tasks.arXiv prep...

  7. [7]

    URL https://doi

    doi: 10.1371/journal.pone.0154222. URL https://doi. org/10.1371/journal.pone.0154222. Mart´ınez-Plumed, F., Prudˆencio, R. B., Mart´ınez-Us´o, A., and Hern ´andez-Orallo, J. Item response theory in ai: Analysing machine learning classifiers at the instance level.Artificial intelligence, 271:18–42,

  8. [8]

    OECD.PISA 2006 Technical Report. PISA. OECD Publishing, Paris,

  9. [9]

    URL https://doi.org/10

    doi: 10.1787/ 9789264048096-en. URL https://doi.org/10. 1787/9789264048096-en. OECD. Pisa 2018 results (volume i): What students know and can do,

  10. [10]

    Phan, L., Gatti, A., Han, Z., Li, N., Hu, J., Zhang, H., Zhang, C. B. C., Shaaban, M., Ling, J., Shi, S., et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249,

  11. [11]

    doi: 10.4324/9781315637686

    ISBN 9781315637686. doi: 10.4324/9781315637686. URL https://doi. org/10.4324/9781315637686. 10 From Human-Level AI Tales to AI Leveling Human Scales Shapira, N., Levy, M., Alavi, S. H., Zhou, X., Choi, Y ., Goldberg, Y ., Sap, M., and Shwartz, V . Clever hans or neural theory of mind? stress testing social reasoning in large language models. In Graham, Y ...

  12. [12]

    doi: 10.18653/v1/2024.eacl-long

    Association for Com- putational Linguistics. doi: 10.18653/v1/2024.eacl-long

  13. [13]

    eacl-long.138/

    URL https://aclanthology.org/2024. eacl-long.138/. Stevens, S. S. On the theory of scales of measurement. Science, 103(2684):677–680,

  14. [14]

    doi: 10.1371/journal.pmed.1001779

    doi: 10.1371/journal.pmed.1001779. URL https://doi. org/10.1371/journal.pmed.1001779. Thurstone, L. L.Primary Mental Abilities. University of Chicago Press, Chicago, IL,

  15. [15]

    General scales unlock ai evaluation with explanatory and predictive power

    doi: 10.1038/ s41586-024-07930-y. Zhou, L., Pacchiardi, L., Mart ´ınez-Plumed, F., Collins, K. M., Moros-Daval, Y ., Zhang, S., Zhao, Q., Huang, Y ., Sun, L., Prunty, J. E., et al. General scales unlock AI evaluation with explanatory and predictive power.arXiv preprint arXiv:2503.06378,

  16. [16]

    TIMSS – Released Assessment Questions

    E.1. PISA The Program for International Student Assessment (PISA) is recognized worldwide for evaluating the academic competence of 15-year-olds (Lundahl & Serder, 2020; Breakspear, 2014). The evaluation covers reading, science, and mathematics, and assesses critical thinking, problem-solving, and the application of knowledge in these areas. It primarily ...

  17. [17]

    1 – 3 – 5 – 7 – ?

    These items cover the TIMSS content domains (for example, number, algebra, geometry, data/statistics in mathematics; life, physical, earth sciences in science) and cognitive domains (knowing, applying, reasoning). Format types among the released items include multiple-choice questions and constructed responses. These released assessment items are the basi...

  18. [18]

    Each item is multiple-choice with a single correct answer, designed to be solvable from the text alone without specialized background knowledge

    The 16 items that comprise the Verbal Reasoning test include a mixture of short logic puzzles, basic arithmetic word problems, odd-one-out vocabulary items, and antonym tasks. Each item is multiple-choice with a single correct answer, designed to be solvable from the text alone without specialized background knowledge. For the present work, the analyzed e...

  19. [19]

    For complete rubrics and scale definitions, please refer to the original ADeLe paper

    used in our analyses, providing brief, non-exhaustive descriptions to clarify how we map task demands to abilities. For complete rubrics and scale definitions, please refer to the original ADeLe paper. H. Differentiation to difficulty modeling in psychometrics In Classical Test Theory, the “p-value”, i.e., the proportion of respondents who obtain the corr...