pith. sign in

arxiv: 2505.11788 · v2 · pith:QMGZ3L3Nnew · submitted 2025-05-17 · 💻 cs.DC · cs.IT· cs.LG· cs.NI· eess.SP· math.IT

Communication-Efficient Hybrid Language Model via Uncertainty-Aware Opportunistic and Compressed Transmission

classification 💻 cs.DC cs.ITcs.LGcs.NIeess.SPmath.IT
keywords languagemodelvocabularycu-hlmuncertaintycommunicationcommunication-efficienthybrid
0
0 comments X
read the original abstract

To support emerging language-based applications using dispersed and heterogeneous computing resources, the hybrid language model (HLM) offers a promising architecture, where an on-device small language model (SLM) generates draft tokens that are validated and corrected by a remote large language model (LLM). However, the original HLM suffers from substantial communication overhead, as the LLM requires the SLM to upload the full vocabulary distribution for each token. Moreover, both communication and computation resources are wasted when the LLM validates tokens that are highly likely to be accepted. To overcome these limitations, we propose communication-efficient and uncertainty-aware HLM (CU-HLM). In CU-HLM, the SLM transmits truncated vocabulary distributions only when its output uncertainty is high. We validate the feasibility of this opportunistic transmission by discovering a strong correlation between SLM's uncertainty and LLM's rejection probability. Furthermore, we theoretically derive optimal uncertainty thresholds and optimal vocabulary truncation strategies. Simulation results show that, compared to standard HLM, CU-HLM achieves up to 206$\times$ higher token throughput by skipping 74.8% transmissions with 97.4% vocabulary compression, while maintaining 97.4% accuracy.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Context-Aware Wireless Token Communication via Joint Token Masking and Detection

    eess.SP 2026-05 unverdicted novelty 6.0

    A joint token masking and detection scheme with masked language models improves token reconstruction over noisy wireless channels by up to 1.77x on Europarl and 1.63x on WikiText-103 compared to conventional methods.

  2. GELATO: Generative Entropy- and Lyapunov-based Adaptive Token Offloading for Device-Edge Speculative LLM Inference

    cs.NI 2026-05 unverdicted novelty 5.0

    GELATO combines drift-plus-penalty Lyapunov control with generative entropy early exiting to adaptively offload tokens in device-edge speculative decoding, delivering higher throughput and lower energy use than prior ...