From Human-Level AI Tales to AI Leveling Human Scales
Pith reviewed 2026-05-15 20:05 UTC · model grok-4.3
The pith
AI performance can be measured on scales calibrated to the entire world population.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper introduces a calibration framework that builds capability-specific scales representing logarithmic probabilities of success for the whole world population. These scales are populated using publicly available human benchmark data across education and reasoning tests, with the logarithmic base estimated through LLM-based extrapolation between two demographic profiles to condense population information.
What carries the argument
Multi-level logarithmic scales calibrated to world population success probabilities, with the base estimated via LLM extrapolation from two demographic profiles.
If this is right
- AI models receive performance scores on a common scale relative to global human probabilities.
- Benchmarks for different capabilities become directly comparable through shared population anchoring.
- Recalibration of existing AI results allows consistent tracking of progress against worldwide human standards.
- Group slicing and post-stratification can validate the quality of the population mappings.
Where Pith is reading between the lines
- This calibration could show that current AI systems lag further behind when measured against global rather than elite populations.
- Extending the method to new benchmarks would require only additional human data and LLM prompts for base estimation.
- Policy discussions on AI regulation might benefit from these standardized human-relative metrics.
Load-bearing premise
That large language models can reliably extrapolate detailed information about human populations from just two demographic profiles to set the correct base for the logarithmic scales.
What would settle it
A large-scale survey collecting actual success rates on sample items from a globally representative population and comparing them to the probabilities predicted by the calibrated scales.
Figures
read the original abstract
Comparing AI models to "human level" is often misleading when benchmark scores are incommensurate or human baselines are drawn from a narrow population. To address this, we propose a framework that calibrates items against the 'world population' and report performance on a common, human-anchored scale. Concretely, we build on a set of multi-level scales for different capabilities where each level should represent a probability of success of the whole world population on a logarithmic scale with a base $B$. We calibrate each scale for each capability (reasoning, comprehension, knowledge, volume, etc.) by compiling publicly released human test data spanning education and reasoning benchmarks (PISA, TIMSS, ICAR, UKBioBank, and ReliabilityBench). The base $B$ is estimated by extrapolating between samples with two demographic profiles using LLMs, with the hypothesis that they condense rich information about human populations. We evaluate the quality of different mappings using group slicing and post-stratification. The new techniques allow for the recalibration and standardization of scales relative to the whole-world population.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a framework to calibrate AI performance on capabilities such as reasoning, comprehension, knowledge, and volume against the whole-world human population. It defines multi-level logarithmic probability scales with base B estimated via LLM extrapolation from two demographic profiles, calibrates the scales using public human test data from benchmarks including PISA, TIMSS, ICAR, UKBioBank, and ReliabilityBench, and evaluates mappings with group slicing and post-stratification to enable recalibration and standardization relative to the global population.
Significance. If the LLM-based extrapolation for base B can be shown to be accurate and independent, the framework would offer a useful method for creating standardized, human-anchored scales that address the problem of narrow or incommensurate human baselines in AI evaluation. The grounding in established datasets like PISA and TIMSS is a strength, but the current absence of any reported calibration outcomes or validation metrics renders the practical significance speculative.
major comments (3)
- [Abstract] Abstract: The central claim that the new techniques allow recalibration and standardization relative to the whole-world population rests on the untested hypothesis that LLMs condense rich information to extrapolate base B from exactly two demographic profiles; no derivation, error bounds, cross-validation against independent global statistics, or concrete calibration results are supplied.
- [Abstract] The estimation of base B: Because B serves as the global anchor for the logarithmic scales, any systematic bias in the LLM extrapolation (e.g., from training-data skew toward high-education or English-speaking subpopulations) directly undermines the claimed human-anchored standardization; the manuscript provides no quantitative assessment of this extrapolation step.
- [Evaluation] Evaluation section: Group slicing and post-stratification are described for assessing mapping quality, yet no numerical results, metrics, or error analysis are reported, leaving the reader unable to evaluate whether the scales achieve the intended representativeness.
minor comments (1)
- The notation for the logarithmic probability scale and the precise definition of base B could be clarified with an explicit equation or example computation to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments on our proposed calibration framework. The manuscript introduces a methodological approach for human-anchored AI scales using public datasets and LLM-based extrapolation, and we address each concern below by clarifying the current scope while committing to enhancements that strengthen the presentation of results and validation.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the new techniques allow recalibration and standardization relative to the whole-world population rests on the untested hypothesis that LLMs condense rich information to extrapolate base B from exactly two demographic profiles; no derivation, error bounds, cross-validation against independent global statistics, or concrete calibration results are supplied.
Authors: We agree that the central claim depends on the LLM extrapolation hypothesis for base B, which is presented as such in the manuscript. The abstract and methods describe the extrapolation from two demographic profiles and the subsequent calibration against PISA, TIMSS, ICAR, UKBioBank, and ReliabilityBench data, but we acknowledge the absence of explicit derivations, error bounds, cross-validation, and numerical calibration outcomes. We will expand the abstract and add a new subsection with the mathematical derivation of the extrapolation, sensitivity checks, and initial concrete calibration results in the revision. revision: yes
-
Referee: [Abstract] The estimation of base B: Because B serves as the global anchor for the logarithmic scales, any systematic bias in the LLM extrapolation (e.g., from training-data skew toward high-education or English-speaking subpopulations) directly undermines the claimed human-anchored standardization; the manuscript provides no quantitative assessment of this extrapolation step.
Authors: The potential for systematic bias in the LLM extrapolation step is a substantive concern we take seriously, as B anchors the entire scale. The manuscript states the hypothesis that LLMs condense rich population information but does not include quantitative bias assessment. We will add a quantitative evaluation of the extrapolation, including discussion of training-data skew and robustness checks against available global statistics, to address this directly in the revised version. revision: yes
-
Referee: [Evaluation] Evaluation section: Group slicing and post-stratification are described for assessing mapping quality, yet no numerical results, metrics, or error analysis are reported, leaving the reader unable to evaluate whether the scales achieve the intended representativeness.
Authors: The evaluation section details the application of group slicing and post-stratification to assess mapping quality and enable recalibration. While the current text emphasizes the methodological framework, we recognize that the lack of reported numerical metrics and error analysis limits immediate evaluation of representativeness. We will incorporate specific quantitative results, metrics (e.g., stratification error rates), and analysis from these techniques applied to the calibrated scales in the revised manuscript. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper compiles publicly released human test data from benchmarks (PISA, TIMSS, ICAR, UKBioBank, ReliabilityBench) to calibrate multi-level scales, then estimates base B via LLM extrapolation from two demographic profiles under the stated hypothesis that LLMs condense population information. This produces a logarithmic human-anchored scale for reporting AI performance. No equation or step reduces the final calibrated scale or standardization claim to the LLM extrapolation by construction; the human data compilation supplies independent grounding, the LLM step is an explicit estimation procedure rather than a fitted parameter renamed as a prediction, and no self-citation or uniqueness theorem is invoked to close the chain. The framework is therefore self-contained against external human benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- base B
axioms (2)
- domain assumption Each level on the capability scales represents a probability of success for the whole world population arranged on a logarithmic scale with base B.
- domain assumption Public human test data from PISA, TIMSS, ICAR, UKBioBank, and ReliabilityBench can be compiled to calibrate scales for reasoning, comprehension, knowledge, and related capabilities.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
each level should represent a probability of success of the whole world population on a logarithmic scale with a base B... L_i = -log_B(p_W_i)
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
calibrate the base of the logarithmic scales per dimension... optimal base B=10^m
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Burden, J., Teˇsi´c, M., Pacchiardi, L., and Hern´andez-Orallo, J. Paradigms of ai evaluation: Mapping goals, method- ologies and culture.arXiv preprint arXiv:2502.15620,
-
[2]
Burnell, R., Hao, H., Conway, A. R., and Orallo, J. H. Revealing the structure of language model capabilities. arXiv preprint arXiv:2306.10062, 2023a. Burnell, R., Schellaert, W., Burden, J., Ullman, T. D., Martinez-Plumed, F., Tenenbaum, J. B., Rutar, D., Cheke, L. G., Sohl-Dickstein, J., Mitchell, M., et al. Rethink reporting of evaluation results in ai...
-
[3]
Eriksson, M., Purificato, E., Noroozian, A., Vinagre, J., Chaslot, G., Gomez, E., and Fernandez-Llorca, D. Can we trust ai benchmarks? an interdisciplinary re- view of current issues in ai evaluation.arXiv preprint arXiv:2502.06559,
-
[4]
doi: 10.1371/journal. 9 From Human-Level AI Tales to AI Leveling Human Scales pone.0231627. URL https://doi.org/10.1371/ journal.pone.0231627. Gou, B., Huang, Z., Ning, Y ., Gu, Y ., Lin, M., Qi, W., Kopanev, A., Yu, B., Guti ´errez, B. J., Shu, Y ., et al. Mind2web 2: Evaluating agentic search with agent-as-a- judge.arXiv preprint arXiv:2506.21506,
-
[5]
doi: 10.1027/1015-5759/a000616. URL https://doi. org/10.1027/1015-5759/a000616. Hambleton, R. K. and Rogers, H. J. Advances in criterion- referenced measurement. InAdvances in educational and psychological testing: Theory and applications, pp. 3–43. Springer,
-
[6]
Measuring ai ability to complete long tasks
Kwa, T., West, B., Becker, J., Deng, A., Garcia, K., Hasin, M., Jawhar, S., Kinniment, M., Rush, N., V on Arx, S., Bloom, R., Broadley, T., Du, H., Goodrich, B., Jurkovic, N., Miles, L. H., Nix, S., Lin, T., Parikh, N., Rein, D., Sato, L. J. K., Wijk, H., Ziegler, D. M., Barnes, E., and Chan, L. Metr: Measuring ai ability to complete long tasks.arXiv prep...
-
[7]
doi: 10.1371/journal.pone.0154222. URL https://doi. org/10.1371/journal.pone.0154222. Mart´ınez-Plumed, F., Prudˆencio, R. B., Mart´ınez-Us´o, A., and Hern ´andez-Orallo, J. Item response theory in ai: Analysing machine learning classifiers at the instance level.Artificial intelligence, 271:18–42,
-
[8]
OECD.PISA 2006 Technical Report. PISA. OECD Publishing, Paris,
work page 2006
-
[9]
doi: 10.1787/ 9789264048096-en. URL https://doi.org/10. 1787/9789264048096-en. OECD. Pisa 2018 results (volume i): What students know and can do,
work page 2018
-
[10]
Phan, L., Gatti, A., Han, Z., Li, N., Hu, J., Zhang, H., Zhang, C. B. C., Shaaban, M., Ling, J., Shi, S., et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
ISBN 9781315637686. doi: 10.4324/9781315637686. URL https://doi. org/10.4324/9781315637686. 10 From Human-Level AI Tales to AI Leveling Human Scales Shapira, N., Levy, M., Alavi, S. H., Zhou, X., Choi, Y ., Goldberg, Y ., Sap, M., and Shwartz, V . Clever hans or neural theory of mind? stress testing social reasoning in large language models. In Graham, Y ...
-
[12]
doi: 10.18653/v1/2024.eacl-long
Association for Com- putational Linguistics. doi: 10.18653/v1/2024.eacl-long
-
[13]
URL https://aclanthology.org/2024. eacl-long.138/. Stevens, S. S. On the theory of scales of measurement. Science, 103(2684):677–680,
work page 2024
-
[14]
doi: 10.1371/journal.pmed.1001779
doi: 10.1371/journal.pmed.1001779. URL https://doi. org/10.1371/journal.pmed.1001779. Thurstone, L. L.Primary Mental Abilities. University of Chicago Press, Chicago, IL,
-
[15]
General scales unlock ai evaluation with explanatory and predictive power
doi: 10.1038/ s41586-024-07930-y. Zhou, L., Pacchiardi, L., Mart ´ınez-Plumed, F., Collins, K. M., Moros-Daval, Y ., Zhang, S., Zhao, Q., Huang, Y ., Sun, L., Prunty, J. E., et al. General scales unlock AI evaluation with explanatory and predictive power.arXiv preprint arXiv:2503.06378,
-
[16]
TIMSS – Released Assessment Questions
E.1. PISA The Program for International Student Assessment (PISA) is recognized worldwide for evaluating the academic competence of 15-year-olds (Lundahl & Serder, 2020; Breakspear, 2014). The evaluation covers reading, science, and mathematics, and assesses critical thinking, problem-solving, and the application of knowledge in these areas. It primarily ...
work page 2020
-
[17]
These items cover the TIMSS content domains (for example, number, algebra, geometry, data/statistics in mathematics; life, physical, earth sciences in science) and cognitive domains (knowing, applying, reasoning). Format types among the released items include multiple-choice questions and constructed responses. These released assessment items are the basi...
work page 2003
-
[18]
The 16 items that comprise the Verbal Reasoning test include a mixture of short logic puzzles, basic arithmetic word problems, odd-one-out vocabulary items, and antonym tasks. Each item is multiple-choice with a single correct answer, designed to be solvable from the text alone without specialized background knowledge. For the present work, the analyzed e...
work page 2014
-
[19]
For complete rubrics and scale definitions, please refer to the original ADeLe paper
used in our analyses, providing brief, non-exhaustive descriptions to clarify how we map task demands to abilities. For complete rubrics and scale definitions, please refer to the original ADeLe paper. H. Differentiation to difficulty modeling in psychometrics In Classical Test Theory, the “p-value”, i.e., the proportion of respondents who obtain the corr...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.