Communication-Efficient Hybrid Language Model via Uncertainty-Aware Opportunistic and Compressed Transmission

Jihong Park; Jinho Choi; Jinhyuk Kim; Seong-Lyun Kim; Seungeun Oh; Seung-Woo Ko; Tony Q. S. Quek

arxiv: 2505.11788 · v2 · pith:QMGZ3L3Nnew · submitted 2025-05-17 · 💻 cs.DC · cs.IT· cs.LG· cs.NI· eess.SP· math.IT

Communication-Efficient Hybrid Language Model via Uncertainty-Aware Opportunistic and Compressed Transmission

Seungeun Oh , Jinhyuk Kim , Jihong Park , Seung-Woo Ko , Jinho Choi , Tony Q. S. Quek , Seong-Lyun Kim This is my paper

classification 💻 cs.DC cs.ITcs.LGcs.NIeess.SPmath.IT

keywords languagemodelvocabularycu-hlmuncertaintycommunicationcommunication-efficienthybrid

0 comments

read the original abstract

To support emerging language-based applications using dispersed and heterogeneous computing resources, the hybrid language model (HLM) offers a promising architecture, where an on-device small language model (SLM) generates draft tokens that are validated and corrected by a remote large language model (LLM). However, the original HLM suffers from substantial communication overhead, as the LLM requires the SLM to upload the full vocabulary distribution for each token. Moreover, both communication and computation resources are wasted when the LLM validates tokens that are highly likely to be accepted. To overcome these limitations, we propose communication-efficient and uncertainty-aware HLM (CU-HLM). In CU-HLM, the SLM transmits truncated vocabulary distributions only when its output uncertainty is high. We validate the feasibility of this opportunistic transmission by discovering a strong correlation between SLM's uncertainty and LLM's rejection probability. Furthermore, we theoretically derive optimal uncertainty thresholds and optimal vocabulary truncation strategies. Simulation results show that, compared to standard HLM, CU-HLM achieves up to 206$\times$ higher token throughput by skipping 74.8% transmissions with 97.4% vocabulary compression, while maintaining 97.4% accuracy.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Context-Aware Wireless Token Communication via Joint Token Masking and Detection
eess.SP 2026-05 unverdicted novelty 6.0

A joint token masking and detection scheme with masked language models improves token reconstruction over noisy wireless channels by up to 1.77x on Europarl and 1.63x on WikiText-103 compared to conventional methods.
GELATO: Generative Entropy- and Lyapunov-based Adaptive Token Offloading for Device-Edge Speculative LLM Inference
cs.NI 2026-05 unverdicted novelty 5.0

GELATO combines drift-plus-penalty Lyapunov control with generative entropy early exiting to adaptively offload tokens in device-edge speculative decoding, delivering higher throughput and lower energy use than prior ...