LLM routers across 21 methods on 5 benchmarks converge to similar accuracy below oracle due to learning global performance trends rather than fine-grained query signals.
Irt-router: Effective and interpretable multi-llm routing via item response theory
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
representative citing papers
MedIRT applies Item Response Theory to medical LLM benchmarks to separate latent competency from item difficulty and discrimination, producing more stable rankings and revealing domain heterogeneity than accuracy alone.
citing papers explorer
-
Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks
MedIRT applies Item Response Theory to medical LLM benchmarks to separate latent competency from item difficulty and discrimination, producing more stable rankings and revealing domain heterogeneity than accuracy alone.