From static benchmarks to adaptive testing: Psychometrics in ai evaluation

Yan Zhuang, Qi Liu, Yuting Ning, Weizhe Huang, Zachary A Pardos, Patrick C Kyllonen, Jiyun Zu, Qingyang Mao, Rui Lv, Zhenya Huang, et al · 2023 · arXiv 2306.10512

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

representative citing papers

FairTree: Subgroup Fairness Auditing of Machine Learning Models with Bias-Variance Decomposition

cs.LG · 2026-04-21 · unverdicted · novelty 7.0

FairTree audits ML models for subgroup fairness by decomposing performance disparities into systematic bias and variance using permutation-based and fluctuation tests adapted from psychometric methods.

Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation

cs.CL · 2026-04-30 · unverdicted · novelty 6.0

Item-level Reliable Change Index analysis shows that LLM version upgrades result in bidirectional performance shifts on individual questions, making aggregate accuracy gains the net residual of improvements and deteriorations.

Position: AI Evaluations Should be Grounded on a Theory of Capability

cs.AI · 2025-09-23 · conditional · novelty 5.0

AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.

citing papers explorer

Showing 3 of 3 citing papers.

FairTree: Subgroup Fairness Auditing of Machine Learning Models with Bias-Variance Decomposition cs.LG · 2026-04-21 · unverdicted · none · ref 22
FairTree audits ML models for subgroup fairness by decomposing performance disparities into systematic bias and variance using permutation-based and fluctuation tests adapted from psychometric methods.
Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation cs.CL · 2026-04-30 · unverdicted · none · ref 4
Item-level Reliable Change Index analysis shows that LLM version upgrades result in bidirectional performance shifts on individual questions, making aggregate accuracy gains the net residual of improvements and deteriorations.
Position: AI Evaluations Should be Grounded on a Theory of Capability cs.AI · 2025-09-23 · conditional · none · ref 57
AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.

From static benchmarks to adaptive testing: Psychometrics in ai evaluation

fields

years

verdicts

representative citing papers

citing papers explorer