SimEval-IR toolkit and benchmarks demonstrate that human-likeness classifiers have negligible pooled predictive power (r=+0.09) for simulator-based system ranking validity, whereas marginal click-depth distance and Fréchet distance on session embeddings show stronger signals (r=0.43 and 0.40).
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.IR 3years
2026 3verdicts
UNVERDICTED 3representative citing papers
Semantic Recall is a new evaluation metric for approximate nearest neighbor search that focuses only on semantically relevant results, with Tolerant Recall as a proxy when relevance labels are unavailable.
A unified evaluation finds LLM query reformulation gains are strongly conditioned on retrieval paradigm, do not consistently transfer to neural retrievers, and are not uniformly improved by larger LLMs.
citing papers explorer
-
SimEval-IR: A Unified Toolkit and Benchmark Suite for Evaluating User Simulators and Search Sessions
SimEval-IR toolkit and benchmarks demonstrate that human-likeness classifiers have negligible pooled predictive power (r=+0.09) for simulator-based system ranking validity, whereas marginal click-depth distance and Fréchet distance on session embeddings show stronger signals (r=0.43 and 0.40).
-
Semantic Recall for Vector Search
Semantic Recall is a new evaluation metric for approximate nearest neighbor search that focuses only on semantically relevant results, with Tolerant Recall as a proxy when relevance labels are unavailable.
-
A Reproducibility Study of LLM-Based Query Reformulation
A unified evaluation finds LLM query reformulation gains are strongly conditioned on retrieval paradigm, do not consistently transfer to neural retrievers, and are not uniformly improved by larger LLMs.