SpecFuse: Ensembling Large Language Models via Next-Segment Prediction

Bo Lv; Chen Tang; Nayu Liu; Ping Luo; Xin Liu; Yue Yu

arxiv: 2412.07380 · v3 · pith:QYXF3TD2new · submitted 2024-12-10 · 💻 cs.CL · cs.AI

SpecFuse: Ensembling Large Language Models via Next-Segment Prediction

Bo Lv , Nayu Liu , Chen Tang , Xin Liu , Yue Yu , Ping Luo This is my paper

classification 💻 cs.CL cs.AI

keywords modelsensemblemodelduringperformancespecemensemblinglanguage

0 comments

read the original abstract

Ensembles of generative large language models (LLMs) are a promising way to compensate for individual model limitations, integrating the strengths of different LLMs. Existing LLM ensemble methods, however, face limitations such as first-token delay and challenges in long-range semantic collaboration between models, Moreover, they typically assume equal voting weights for all models during ensemble, ignoring task-specific performance differences among models. In this work, we propose SpecEM, a training-free, plug-and-play LLM ensemble framework that dynamically adjusts each model's model contribution in real time based on task performance. Inspired by speculative decoding, SpecEM iteratively performs drafting and verification, allowing models to collaborate semantically at the segment level for integrated output. Furthermore, we introduce an online feedback mechanism with multiplicative weight updates, where each model's voting weight is adjusted on-the-fly according to how often it outperforms others during verification stage, ensuring that stronger models exert greater influence during ensembling. Experimental results on five LLM families (ranging from 7B to 72B parameters) and six benchmark datasets, spanning open-domain instruction following, reasoning, commonsense, demonstrate consistent performance improvements compared to state-of-the-art LLM ensemble methods. Our code is available at https://github.com/lvbotenbest/SpecEM.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Sampling from Your Language Model One Byte at a Time
cs.CL 2025-06 unverdicted novelty 7.0

An inference-time technique turns BPE-based LMs into byte- or character-level models, solving the prompt boundary problem while unifying vocabularies across different tokenizers.
Harnessing Multiple Large Language Models: A Survey on LLM Ensemble
cs.CL 2025-02 unverdicted novelty 2.0

A systematic survey of LLM ensemble methods organized into a taxonomy of ensemble-before-inference, ensemble-during-inference, and ensemble-after-inference stages, with review of benchmarks, applications, and future d...