FairTree audits ML models for subgroup fairness by decomposing performance disparities into systematic bias and variance using permutation-based and fluctuation tests adapted from psychometric methods.
From static benchmarks to adaptive testing: Psychometrics in ai evaluation
3 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Item-level Reliable Change Index analysis shows that LLM version upgrades result in bidirectional performance shifts on individual questions, making aggregate accuracy gains the net residual of improvements and deteriorations.
AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.
citing papers explorer
-
FairTree: Subgroup Fairness Auditing of Machine Learning Models with Bias-Variance Decomposition
FairTree audits ML models for subgroup fairness by decomposing performance disparities into systematic bias and variance using permutation-based and fluctuation tests adapted from psychometric methods.
-
Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation
Item-level Reliable Change Index analysis shows that LLM version upgrades result in bidirectional performance shifts on individual questions, making aggregate accuracy gains the net residual of improvements and deteriorations.
-
Position: AI Evaluations Should be Grounded on a Theory of Capability
AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.