TwinRouterBench supplies 970 execution-verified router prefixes across five datasets plus a live harness for 100 held-out SWE-bench cases, scoring routers on tier accuracy, trajectory success, and realized token cost without LLM judges.
Transactions of the Association for Computational Linguistics , volume =
5 Pith papers cite this work. Polarity classification is still indexing.
years
2026 5representative citing papers
A pipeline with LoRA-fine-tuned query rewriting, BM25+dense hybrid retrieval via RRF, and cross-encoder reranking reaches nDCG@5 of 0.531 on multi-turn retrieval across four domains.
H-RAG uses hierarchical parent-child document segmentation with hybrid retrieval and parent-level aggregation to achieve 0.4271 nDCG@5 on retrieval and 0.3241 harmonic mean on generation in a multi-turn RAG shared task.
A hybrid dense-sparse retrieval pipeline with query rewriting and cross-encoder reranking achieves 0.5453 nDCG@5 (third place) on SemEval-2026 Task 8 Task A and 0.5312 harmonic mean on Task C.
5ting achieves nDCG@5 of 0.4719 on Task A and harmonic score 0.5597 with RL_F 0.7692 on Task C for multi-turn RAG via standard dense retrieval plus LLM reranking and faithfulness constraints.
citing papers explorer
-
TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing
TwinRouterBench supplies 970 execution-verified router prefixes across five datasets plus a live harness for 100 held-out SWE-bench cases, scoring routers on tier accuracy, trajectory success, and realized token cost without LLM judges.