Communication-Efficient Hybrid Language Model via Uncertainty-Aware Opportunistic and Compressed Transmission
read the original abstract
To support emerging language-based applications using dispersed and heterogeneous computing resources, the hybrid language model (HLM) offers a promising architecture, where an on-device small language model (SLM) generates draft tokens that are validated and corrected by a remote large language model (LLM). However, the original HLM suffers from substantial communication overhead, as the LLM requires the SLM to upload the full vocabulary distribution for each token. Moreover, both communication and computation resources are wasted when the LLM validates tokens that are highly likely to be accepted. To overcome these limitations, we propose communication-efficient and uncertainty-aware HLM (CU-HLM). In CU-HLM, the SLM transmits truncated vocabulary distributions only when its output uncertainty is high. We validate the feasibility of this opportunistic transmission by discovering a strong correlation between SLM's uncertainty and LLM's rejection probability. Furthermore, we theoretically derive optimal uncertainty thresholds and optimal vocabulary truncation strategies. Simulation results show that, compared to standard HLM, CU-HLM achieves up to 206$\times$ higher token throughput by skipping 74.8% transmissions with 97.4% vocabulary compression, while maintaining 97.4% accuracy.
This paper has not been read by Pith yet.
Forward citations
Cited by 2 Pith papers
-
Context-Aware Wireless Token Communication via Joint Token Masking and Detection
A joint token masking and detection scheme with masked language models improves token reconstruction over noisy wireless channels by up to 1.77x on Europarl and 1.63x on WikiText-103 compared to conventional methods.
-
GELATO: Generative Entropy- and Lyapunov-based Adaptive Token Offloading for Device-Edge Speculative LLM Inference
GELATO combines drift-plus-penalty Lyapunov control with generative entropy early exiting to adaptively offload tokens in device-edge speculative decoding, delivering higher throughput and lower energy use than prior ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.