vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models

Abdallah Samara; Asaad Balum; Avinash Changrani; Avishek Goswami; Baofa Fan; Bishen Yu; Bowei He; Brent Salisbury; Fang Han; Guohong Wen

arxiv: 2603.04444 · v4 · pith:BLIAR7FRnew · submitted 2026-02-23 · 💻 cs.NI · cs.AI

vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models

Xunzhuo Liu , Huamin Chen , Samzong Lu , Yossi Ovadia , Guohong Wen , Hao Wu , Zhengda Tan , Jintao Zhang

show 24 more authors

Senan Zedan Yehudit Kerido Liav Weiss Haichen Zhang Bishen Yu Asaad Balum Noa Limoy Abdallah Samara Baofa Fan Brent Salisbury Ryan Cook Zhijie Wang Qiping Pan Rehan Khan Avishek Goswami Houston H. Zhang Shuyi Wang Ziang Tang Fang Han Zohaib Hassan Jianqiao Zheng Avinash Changrani Xue (Steve) Liu Bowei He

This is my paper

classification 💻 cs.NI cs.AI

keywords routingsignaldecisionmodelpoliciessafetysemanticarchitecture

0 comments

read the original abstract

As large language models (LLMs) diversify across modalities, capabilities, and cost profiles, the problem of intelligent request routing: selecting the right model for each query at inference time, has become a critical systems challenge. We present vLLM Semantic Router, a signal-driven decision routing framework for Mixture-of-Modality (MoM) model deployments. The architecture follows two complementary Shannon-inspired views. In the information-theoretic regime, signal extraction reduces the entropy of "which model?" by distilling routing-relevant information from raw queries. In the Boolean-algebraic regime, the decision engine composes functionally complete routing policies from signal conditions. The central innovation is composable signal orchestration: thirteen heterogeneous signal types, spanning sub-millisecond heuristics and neural classifiers for semantics, safety, and modality, are composed through configurable Boolean decision rules into deployment-specific routing policies, so that fundamentally different scenarios (multi-cloud enterprise, privacy-regulated, cost-optimized) are expressed as different configurations over the same architecture. Matched decisions drive semantic model routing via thirteen selection algorithms, while per-decision plugin chains enforce safety constraints including a three-stage HaluGate hallucination detection pipeline and a lightweight episodic memory system with ReflectionGate for personalized multi-turn context. A typed neural-symbolic DSL specifies these routing policies and compiles them to multiple deployment targets, enabling configuration-first adaptation without code changes. Together, these components show that composable signal orchestration enables a single framework to serve diverse deployment scenarios with differentiated cost, privacy, and safety policies.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing
cs.LG 2026-05 accept novelty 7.0

TwinRouterBench supplies step-level static evaluation with 970 prefixes and verified tiers plus a dynamic harness for live SWE-bench agent runs, enabling deterministic scoring for agentic LLM routing.
TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing
cs.LG 2026-05 accept novelty 7.0

TwinRouterBench supplies 970 execution-verified router prefixes across five datasets plus a live harness for 100 held-out SWE-bench cases, scoring routers on tier accuracy, trajectory success, and realized token cost ...
The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project
cs.LG 2026-03 unverdicted novelty 5.0

The Workload-Router-Pool architecture is a 3D framework for LLM inference optimization that synthesizes prior vLLM work into a 3x3 interaction matrix and proposes 21 research directions at the intersections.