Evaluating general-purpose AI with psychometrics

Xiting Wang, Liming Jiang, Jose Hernandez-Orallo, David Stillwell, Luning Sun, Fang Luo, Xing Xie · 2023 · arXiv 2310.16379

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

representative citing papers

FairTree: Subgroup Fairness Auditing of Machine Learning Models with Bias-Variance Decomposition

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

FairTree audits ML models for subgroup fairness by decomposing performance disparities into systematic bias and variance using permutation-based and fluctuation tests adapted from psychometric methods.

An Interpretable and Scalable Framework for Evaluating Large Language Models

stat.ML · 2026-05-07 · unverdicted · novelty 6.0

A majorization-minimization framework turns IRT into scalable matrix factorization subproblems for LLM evaluation, delivering orders-of-magnitude speedups with identifiability guarantees.

Position: AI Evaluations Should be Grounded on a Theory of Capability

cs.AI · 2025-09-23 · conditional · novelty 5.0

AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.

Designing Psychometric Bias Measures for ChatBots: An Application to Racial Bias Measurement

cs.HC · 2025-08-17 · unverdicted · novelty 5.0

STAMP-LLM is a two-phase psychometric protocol for designing and applying bias measures to LLMs, illustrated with one explicit and two implicit racial bias tests.

citing papers explorer

Showing 4 of 4 citing papers.

FairTree: Subgroup Fairness Auditing of Machine Learning Models with Bias-Variance Decomposition cs.LG · 2026-04-21 · unverdicted · none · ref 17
FairTree audits ML models for subgroup fairness by decomposing performance disparities into systematic bias and variance using permutation-based and fluctuation tests adapted from psychometric methods.
An Interpretable and Scalable Framework for Evaluating Large Language Models stat.ML · 2026-05-07 · unverdicted · none · ref 54
A majorization-minimization framework turns IRT into scalable matrix factorization subproblems for LLM evaluation, delivering orders-of-magnitude speedups with identifiability guarantees.
Position: AI Evaluations Should be Grounded on a Theory of Capability cs.AI · 2025-09-23 · conditional · none · ref 52
AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.
Designing Psychometric Bias Measures for ChatBots: An Application to Racial Bias Measurement cs.HC · 2025-08-17 · unverdicted · none · ref 24
STAMP-LLM is a two-phase psychometric protocol for designing and applying bias measures to LLMs, illustrated with one explicit and two implicit racial bias tests.

Evaluating general-purpose AI with psychometrics

fields

years

verdicts

representative citing papers

citing papers explorer