Proposes SJTs and MIRT to measure consistent latent behavioral tendencies in LLMs, showing stability and predictive validity on external benchmarks.
Specifically, we used PGMs created from US Census data and the names data of Rosenman et al
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2025 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Measure what Matters: Psychometric Evaluation of AI with Situational Judgment Tests
Proposes SJTs and MIRT to measure consistent latent behavioral tendencies in LLMs, showing stability and predictive validity on external benchmarks.