FairTree audits ML models for subgroup fairness by decomposing performance disparities into systematic bias and variance using permutation-based and fluctuation tests adapted from psychometric methods.
Evaluating general-purpose AI with psychometrics
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
A majorization-minimization framework turns IRT into scalable matrix factorization subproblems for LLM evaluation, delivering orders-of-magnitude speedups with identifiability guarantees.
AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.
STAMP-LLM is a two-phase psychometric protocol for designing and applying bias measures to LLMs, illustrated with one explicit and two implicit racial bias tests.
citing papers explorer
-
FairTree: Subgroup Fairness Auditing of Machine Learning Models with Bias-Variance Decomposition
FairTree audits ML models for subgroup fairness by decomposing performance disparities into systematic bias and variance using permutation-based and fluctuation tests adapted from psychometric methods.
-
An Interpretable and Scalable Framework for Evaluating Large Language Models
A majorization-minimization framework turns IRT into scalable matrix factorization subproblems for LLM evaluation, delivering orders-of-magnitude speedups with identifiability guarantees.
-
Position: AI Evaluations Should be Grounded on a Theory of Capability
AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.
-
Designing Psychometric Bias Measures for ChatBots: An Application to Racial Bias Measurement
STAMP-LLM is a two-phase psychometric protocol for designing and applying bias measures to LLMs, illustrated with one explicit and two implicit racial bias tests.