{"paper":{"title":"RouterWise: Joint Resource Allocation and Routing for Latency-Aware Multi-Model LLM Serving","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Jointly tuning GPU shares and routing fractions across models raises output quality by up to 87 percent while meeting a fixed latency target.","cross_cats":["cs.DC"],"primary_cat":"cs.NI","authors_text":"Adel N. Toosi, Christopher Leckie, Gholamreza Haffari, Hossein Hosseini Kasnavieh","submitted_at":"2026-04-13T02:13:13Z","abstract_excerpt":"Multi-model LLM routing has emerged as an effective approach for reducing serving cost and latency while maintaining output quality by assigning each prompt to an appropriate model. However, prior routing methods typically assume that each model has a fixed latency. In real deployments, this assumption is inaccurate: multiple models often share limited GPU resources, and a model's latency depends strongly on both its allocated resources and the request load induced by the routing policy. Consequently, routing and resource allocation are tightly coupled.\n  In this work, we study joint resource "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"even on the same GPU cluster, achievable output-quality score can vary by up to 87% across retained setups, highlighting that resource allocation is a key determinant of routing performance.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The latency models obtained from system profiling accurately predict end-to-end latency when the routing policy induces a particular load on each model under a chosen resource allocation.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Joint resource allocation and routing for multi-model LLM serving can produce up to 87% variation in achievable output quality across setups on the same GPU cluster.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Jointly tuning GPU shares and routing fractions across models raises output quality by up to 87 percent while meeting a fixed latency target.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"af33a0197e7946db602b04cb2cc44de659f6466c2515a23c383ced71f3e72bdc"},"source":{"id":"2604.10907","kind":"arxiv","version":2},"verdict":{"id":"b6169b62-907f-4325-ad8a-86327998ba7a","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-10T16:33:24.658836Z","strongest_claim":"even on the same GPU cluster, achievable output-quality score can vary by up to 87% across retained setups, highlighting that resource allocation is a key determinant of routing performance.","one_line_summary":"Joint resource allocation and routing for multi-model LLM serving can produce up to 87% variation in achievable output quality across setups on the same GPU cluster.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The latency models obtained from system profiling accurately predict end-to-end latency when the routing policy induces a particular load on each model under a chosen resource allocation.","pith_extraction_headline":"Jointly tuning GPU shares and routing fractions across models raises output quality by up to 87 percent while meeting a fixed latency target."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2604.10907/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}