Citer: Collaborative inference for ef- ficient large language model decoding with token-level routing

Zheng, W · 2025 · arXiv 2502.01976

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization

cs.LG · 2026-04-20 · unverdicted · novelty 7.0

NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.

Sampling from Your Language Model One Byte at a Time

cs.CL · 2025-06-17 · unverdicted · novelty 7.0

An inference-time technique turns BPE-based LMs into byte- or character-level models, solving the prompt boundary problem while unifying vocabularies across different tokenizers.

Rethinking LLM Ensembling from the Perspective of Mixture Models

cs.LG · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

ME reinterprets LLM ensembling as token-level sampling from a mixture model, enabling single-model invocation per token with claimed mathematical equivalence to full ensembling and measured speedups of 1.78x-2.68x.

Harnessing Multiple Large Language Models: A Survey on LLM Ensemble

cs.CL · 2025-02-25 · unverdicted · novelty 2.0

A systematic survey of LLM ensemble methods organized into a taxonomy of ensemble-before-inference, ensemble-during-inference, and ensemble-after-inference stages, with review of benchmarks, applications, and future directions.

citing papers explorer

Showing 4 of 4 citing papers.

NI Sampling: Accelerating Discrete Diffusion Sampling by Token Order Optimization cs.LG · 2026-04-20 · unverdicted · none · ref 33
NI Sampling accelerates discrete diffusion language models up to 14.3 times by training a neural indicator to select which tokens to sample at each step using a trajectory-preserving objective.
Sampling from Your Language Model One Byte at a Time cs.CL · 2025-06-17 · unverdicted · none · ref 84
An inference-time technique turns BPE-based LMs into byte- or character-level models, solving the prompt boundary problem while unifying vocabularies across different tokenizers.
Rethinking LLM Ensembling from the Perspective of Mixture Models cs.LG · 2026-05-01 · unverdicted · none · ref 18 · 2 links
ME reinterprets LLM ensembling as token-level sampling from a mixture model, enabling single-model invocation per token with claimed mathematical equivalence to full ensembling and measured speedups of 1.78x-2.68x.
Harnessing Multiple Large Language Models: A Survey on LLM Ensemble cs.CL · 2025-02-25 · unverdicted · none · ref 70
A systematic survey of LLM ensemble methods organized into a taxonomy of ensemble-before-inference, ensemble-during-inference, and ensemble-after-inference stages, with review of benchmarks, applications, and future directions.

Citer: Collaborative inference for ef- ficient large language model decoding with token-level routing

fields

years

verdicts

representative citing papers

citing papers explorer