SimEval-IR toolkit and benchmarks demonstrate that human-likeness classifiers have negligible pooled predictive power (r=+0.09) for simulator-based system ranking validity, whereas marginal click-depth distance and Fréchet distance on session embeddings show stronger signals (r=0.43 and 0.40).
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.IR 3years
2026 3roles
background 2polarities
background 2representative citing papers
Reproducibility study confirms Hypencoder's non-linear query-specific scoring improves retrieval over bi-encoders on standard benchmarks but standard methods remain faster and hard-task results are mixed due to implementation issues.
Reproducing GAR on BRIGHT shows it boosts reasoning-intensive retrieval effectiveness with low overhead when the reranker's signal quality is strong.
citing papers explorer
-
SimEval-IR: A Unified Toolkit and Benchmark Suite for Evaluating User Simulators and Search Sessions
SimEval-IR toolkit and benchmarks demonstrate that human-likeness classifiers have negligible pooled predictive power (r=+0.09) for simulator-based system ranking validity, whereas marginal click-depth distance and Fréchet distance on session embeddings show stronger signals (r=0.43 and 0.40).
-
Hypencoder Revisited: Reproducibility and Analysis of Non-Linear Scoring for First-Stage Retrieval
Reproducibility study confirms Hypencoder's non-linear query-specific scoring improves retrieval over bi-encoders on standard benchmarks but standard methods remain faster and hard-task results are mixed due to implementation issues.
-
Reproducing Adaptive Reranking for Reasoning-Intensive IR
Reproducing GAR on BRIGHT shows it boosts reasoning-intensive retrieval effectiveness with low overhead when the reranker's signal quality is strong.