SimEval-IR toolkit and benchmarks demonstrate that human-likeness classifiers have negligible pooled predictive power (r=+0.09) for simulator-based system ranking validity, whereas marginal click-depth distance and Fréchet distance on session embeddings show stronger signals (r=0.43 and 0.40).
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.IR 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
SimEval-IR: A Unified Toolkit and Benchmark Suite for Evaluating User Simulators and Search Sessions
SimEval-IR toolkit and benchmarks demonstrate that human-likeness classifiers have negligible pooled predictive power (r=+0.09) for simulator-based system ranking validity, whereas marginal click-depth distance and Fréchet distance on session embeddings show stronger signals (r=0.43 and 0.40).