arxiv: 2605.00419 · v1 · submitted 2026-05-01 · 💻 cs.LG · cs.CL

Recognition: unknown

Rethinking LLM Ensembling from the Perspective of Mixture Models

Jiale Fu , Yuchu Jiang , Peijun Wu , Chonghan Liu , Joey Tianyi Zhou , Xu Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:01 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords LLM ensemblingmixture modelsefficient inferencetoken-level routingstochastic samplinggenerative modelsmodel selection

0 comments

The pith

LLM ensembles can sample their averaged distribution by stochastically selecting one model per token.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Conventional ensembling of large language models requires averaging the output probabilities from all models at every token generation step, which demands a separate forward pass through each model. The paper proposes to view the ensemble instead as a mixture model and to generate each next token by first sampling which model to use according to the mixture weights and then using only that model's prediction. This process is shown to be mathematically equivalent to always using the averaged probabilities. It therefore delivers the same ensemble performance while invoking only one model per step and achieving speedups between 1.78x and 2.68x. The mixture perspective also shows that standard ensembling is a fixed-weight special case of token-level routing methods.

Core claim

By reinterpreting the ensemble as a mixture model, ME stochastically selects a single model at each step to generate the next token, thereby avoiding the need to explicitly compute the full ensemble distribution. ME is mathematically equivalent to sampling from the ensemble distribution, but requires invoking only one model, making it 1.78x-2.68x faster than conventional ensemble. Furthermore, this perspective connects LLM ensembling and token-level routing methods, suggesting that LLM ensembling is a special case of routing methods.

What carries the argument

Mixture-model-like Ensemble (ME): stochastic selection of one component model per token generation step according to mixture weights, which samples the ensemble distribution equivalently.

If this is right

LLM ensembles achieve equivalent performance without computing the averaged distribution explicitly at each step.
The computational cost of ensembling scales with only one model invocation rather than the number of models.
LLM ensembling is positioned as a special case of token-level routing, enabling integration with routing techniques.
More models can be included in ensembles without proportional increases in inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If mixture weights were allowed to vary with the input or context, ensembles could become adaptive rather than fixed.
Similar stochastic selection could be applied to other multi-model generation setups to reduce costs.
The equivalence suggests that improvements in routing methods could directly benefit traditional ensembling performance.

Load-bearing premise

The independent stochastic choice of model at each token step according to the mixture weights yields generated sequences whose distribution is identical to that from averaging the model probabilities at every step.

What would settle it

A direct comparison of the probability distributions over generated sequences or of downstream task accuracies between the conventional ensemble method and the ME method on held-out prompts would reveal whether the claimed equivalence holds in practice.

Figures

Figures reproduced from arXiv: 2605.00419 by Chonghan Liu, Jiale Fu, Joey Tianyi Zhou, Peijun Wu, Xu Yang, Yuchu Jiang.

**Figure 1.** Figure 1: Comparison of (a) conventional ensemble and (b) mixture-model-like ensemble. Mp and Mq denote two distinct LLMs employed in the ensemble, with p(x) and q(x) indicating their respective output distributions. 1. Introduction In both conventional machine learning and deep learning, model ensembling has been a well-established technique for improving performance by combining the outputs of multiple weaker bas… view at source ↗

**Figure 2.** Figure 2: When ensembling models with different sizes, the trend of ME’s performance and speed changing with λ. Here, λ = 0 indicates that only the smaller model is used for inference, while λ = 1 indicates that only the larger model is used. 2.05x 2.02x 1.98x RTX 3090 V100 A100 0 10 20 30 40 50 Speed (token/sec) Qwen-1.5B 1x 1x 1x 1.20x 1.17x 0.55x 1.89x 1.84x 1.95x 1.97x 1.97x 1.98x RTX 3090 V100 A100 Qwen-3B 1x 1… view at source ↗

**Figure 3.** Figure 3: Speed comparison of ME and other baselines on three other common device types, using three model pairs of varying sizes. Ablation on lambda. We conducted an ablation study on λ using two model combinations: ❶ + ❷ and ❸ + ❹. The results, shown in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation study on λ. k is set to 5. Each point represents the mean of five independent runs, with the shaded bands showing the 95% confidence intervals. Hence, we transform the original combined distribution C(p(x), q(x)) into a mixture form of two distributions (C ′ (p(x), q(x)) and p(x)), allowing us to apply the mixture-model-like ensemble for more efficient inference. Specifically, at each generation s… view at source ↗

read the original abstract

Model ensembling is a well-established technique for improving the performance of machine learning models. Conventionally, this involves averaging the output distributions of multiple models and selecting the most probable label. This idea has been naturally extended to large language models (LLMs), yielding improved performance but incurring substantial computational cost. This inefficiency stems from directly applying conventional ensemble implementation to LLMs, which require a separate forward pass for each model to explicitly compute the ensemble distribution. In this paper, we propose the Mixture-model-like Ensemble (ME). By reinterpreting the ensemble as a mixture model, ME stochastically selects a single model at each step to generate the next token, thereby avoiding the need to explicitly compute the full ensemble distribution. ME is mathematically equivalent to sampling from the ensemble distribution, but requires invoking only one model, making it 1.78x-2.68x faster than conventional ensemble. Furthermore, this perspective connects LLM ensembling and token-level routing methods, suggesting that LLM ensembling is a special case of routing methods. Our findings open new avenues for efficient LLM ensembling and motivate further exploration of token-level routing strategies for LLMs. Our code is available at https://github.com/jialefu/Mixture-model-like-Ensemble/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes the Mixture-model-like Ensemble (ME) for LLMs by reinterpreting conventional output averaging as sampling from a mixture model. At each autoregressive step, ME stochastically selects one constituent model according to the fixed mixture weights to generate the next token. It claims this procedure is mathematically equivalent to sampling from the ensemble distribution (i.e., the per-token conditional is exactly the weighted average of the individual model distributions) while requiring only a single model invocation per token, yielding reported speedups of 1.78x–2.68x. The work further positions LLM ensembling as a special case of token-level routing and releases code for reproducibility.

Significance. If the equivalence is rigorously established, the contribution offers a lightweight, distribution-preserving method to obtain ensemble benefits at reduced cost, which is practically relevant for deploying multiple LLMs. The explicit link between ensembling and routing methods supplies a useful conceptual bridge that could stimulate hybrid algorithms. The open-source code is a clear strength supporting verification and extension.

major comments (1)

The central claim of exact distributional equivalence rests on the per-step conditional P(token_t | prefix) = sum_i w_i p_i(token_t | prefix). While this equality holds by definition of the mixture, the manuscript should supply an explicit derivation (e.g., in the method section) showing that the joint probability of an entire sequence under the autoregressive ME process equals that under step-wise ensemble averaging, including confirmation that no additional normalization or bias correction is required across tokens.

minor comments (1)

The reported speedup range (1.78x–2.68x) is given in the abstract without reference to the specific models, hardware, batch sizes, or implementation details used; these should be stated clearly in the experimental section or a table.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our work and for the constructive suggestion regarding the distributional equivalence. We address the comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: The central claim of exact distributional equivalence rests on the per-step conditional P(token_t | prefix) = sum_i w_i p_i(token_t | prefix). While this equality holds by definition of the mixture, the manuscript should supply an explicit derivation (e.g., in the method section) showing that the joint probability of an entire sequence under the autoregressive ME process equals that under step-wise ensemble averaging, including confirmation that no additional normalization or bias correction is required across tokens.

Authors: We agree that an explicit derivation of the joint sequence probability would strengthen the presentation. In the revised manuscript we will add a short subsection (or appendix) in the Method section that proceeds by induction: the per-token mixture conditional is exactly the weighted average by definition, and the joint probability of any full sequence is therefore the product over tokens of these mixture conditionals. Because the mixture weights are fixed and sum to one at every step independently, the product requires no additional normalization or bias correction. We will also note that this holds for any prefix length, confirming exact equivalence to step-wise ensemble averaging. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's core equivalence—that stochastic per-token selection from a mixture of LLMs yields the identical autoregressive sequence distribution as explicit step-wise averaging of output logits—follows directly from the law of total probability applied to conditional token distributions. For any prefix, the next-token marginal under ME is exactly the weighted sum of individual model conditionals, so the joint sequence probability is the product of identical terms with no additional normalization or bias terms required. This is a definitional identity under the mixture model, not a fitted prediction or result smuggled in via self-citation. No load-bearing steps reduce to self-referential definitions, fitted inputs renamed as predictions, or uniqueness theorems imported from the authors' prior work. The derivation is self-contained against the probabilistic definition of mixtures and requires no external verification to hold.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard assumption that a mixture model can be sampled by selecting components according to their weights; no new free parameters, axioms beyond domain conventions, or invented entities are introduced in the abstract.

axioms (1)

domain assumption Ensemble output distribution can be realized by sampling a single component model according to mixture weights at each generation step
This is the core reinterpretation that enables the single-model-per-step procedure.

pith-pipeline@v0.9.0 · 5529 in / 1262 out tokens · 46115 ms · 2026-05-09T20:01:44.961132+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 16 canonical work pages · 8 internal anchors

[1]

Token-level adaptation of lora adapters for downstream task generalization

Belofsky, J. Token-level adaptation of lora adapters for downstream task generalization. InProceedings of the 2023 6th Artificial Intelligence and Cloud Computing Conference, pp. 168–172,

2023
[2]

Chen, Z., Li, J., Chen, P., Li, Z., Sun, K., Luo, Y ., Mao, Q., Yang, D., Sun, H., and Yu, P. S. Harnessing multiple large language models: A survey on llm ensemble.arXiv preprint arXiv:2502.18036,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y ., et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts lan- guage models.arXiv preprint arXiv:2401.06066,

work page internal anchor Pith review arXiv
[6]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism.arXiv preprint arXiv:2401.02954,

work page internal anchor Pith review arXiv
[7]

On- the-fly fusion of large language models and machine trans- lation.arXiv preprint arXiv:2311.08306,

Hoang, H., Khayrallah, H., and Junczys-Dowmunt, M. On- the-fly fusion of large language models and machine trans- lation.arXiv preprint arXiv:2311.08306,

work page arXiv
[8]

Mistral 7B

URL https: //arxiv.org/abs/2310.06825. Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna, E. B., Bressand, F., et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Purifying large language models by ensembling a small language model.arXiv preprint arXiv:2402.14845,

Li, T., Liu, Q., Pang, T., Du, C., Guo, Q., Liu, Y ., and Lin, M. Purifying large language models by ensembling a small language model.arXiv preprint arXiv:2402.14845,

work page arXiv
[10]

Merge, ensemble, and cooperate! a survey on collaborative strategies in the era of large language models,

Lu, J., Pang, Z., Xiao, M., Zhu, Y ., Xia, R., and Zhang, J. Merge, ensemble, and cooperate! a survey on collabora- tive strategies in the era of large language models.arXiv preprint arXiv:2407.06089,

work page arXiv
[11]

Pack of llms: Model fusion at test-time via perplexity optimization

Mavromatis, C., Karypis, P., and Karypis, G. Pack of llms: Model fusion at test-time via perplexity optimization. arXiv preprint arXiv:2404.11531,

work page arXiv
[12]

Exact byte-level probabilities from tokenized 9 Rethinking LLM Ensembling from the Perspective of Mixture Models language models for fim-tasks and model ensembles

Phan, B., Amos, B., Gat, I., Havasi, M., Muckley, M., and Ullrich, K. Exact byte-level probabilities from tokenized 9 Rethinking LLM Ensembling from the Perspective of Mixture Models language models for fim-tasks and model ensembles. arXiv preprint arXiv:2410.09303,

work page arXiv
[13]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Suzgun, M., Scales, N., Sch ¨arli, N., Gehrmann, S., Tay, Y ., Chung, H. W., Chowdhery, A., Le, Q. V ., Chi, E. H., Zhou, D., , and Wei, J. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261,

work page internal anchor Pith review arXiv
[14]

arXiv preprint arXiv:2309.11235 , year=

URL [https://huggingface.co/NousResearch/ Nous-Hermes-2-Mistral-7B-DPO](https: //huggingface.co/NousResearch/ Nous-Hermes-2-Mistral-7B-DPO). Wang, G., Cheng, S., Zhan, X., Li, X., Song, S., and Liu, Y . Openchat: Advancing open-source language models with mixed-quality data.arXiv preprint arXiv:2309.11235,

work page arXiv
[15]

Bridging the gap between different vocabularies for llm ensemble.arXiv preprint arXiv:2404.09492,

Xu, Y ., Lu, J., and Zhang, J. Bridging the gap between different vocabularies for llm ensemble.arXiv preprint arXiv:2404.09492,

work page arXiv
[16]

Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115, 2024a. Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., Liu, D., Tu, J., Zhou, J., Lin, J., et al. Qwen2. 5-math techni- cal report: Toward mathematical expert model via self- improvement.a...

work page internal anchor Pith review Pith/arXiv arXiv
[17]

C., Ziqi, Y ., Yucheng, C., and Li, Y .-S

Yu, Y .-C., Kuo, C. C., Ziqi, Y ., Yucheng, C., and Li, Y .-S. Breaking the ceiling of the llm community by treating token generation as a classification for ensembling. In Findings of the Association for Computational Linguis- tics: EMNLP 2024, pp. 1826–1839,

2024
[18]

P., Wang, H., and Yao, H

Zheng, W., Chen, Y ., Zhang, W., Kundu, S., Li, Y ., Liu, Z., Xing, E. P., Wang, H., and Yao, H. Citer: Collaborative in- ference for efficient large language model decoding with token-level routing.arXiv preprint arXiv:2502.01976,

work page arXiv