arxiv: 2605.05225 · v2 · submitted 2026-04-19 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

MACS: Modality-Aware Capacity Scaling for Efficient Multimodal MoE Inference

Bo Li , Chuan Wu , Shaolin Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords mixture of expertsmultimodal large language modelsinference efficiencyexpert parallelismload balancingstraggler effect

0 comments

The pith

MACS scales expert capacity in multimodal MoE models using token entropy and real-time modality ratios to cut stragglers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the straggler effect that slows expert-parallel inference in mixture-of-experts multimodal large language models. Token-count balancing treats all visual tokens the same and ignores shifting image-to-text ratios, wasting expert time on low-value tokens. MACS replaces that with an entropy-weighted load that ranks visual tokens by semantic content and a dynamic allocator that shifts capacity toward whichever modality dominates the current input. If these two changes work, inference finishes faster on the same hardware and the same model without retraining or per-task tuning. The result matters for anyone running large multimodal models at scale where latency and hardware cost are limiting factors.

Core claim

MACS is a training-free inference framework that introduces an Entropy-Weighted Load mechanism to quantify the semantic value of visual tokens, addressing information heterogeneity, and a Dynamic Modality-Adaptive Capacity mechanism that allocates expert resources based on the real-time modal composition of the input, addressing modality dynamics, thereby reducing the straggler effect during expert-parallelism inference of MoE MLLMs.

What carries the argument

Entropy-Weighted Load mechanism combined with Dynamic Modality-Adaptive Capacity mechanism, which together replace raw token counts with semantic and modality-aware adjustments to expert assignment.

If this is right

Faster expert-parallel inference on multimodal inputs without retraining or task-specific changes.
Improved hardware utilization when visual-to-text ratios vary across tasks.
Reduced impact of low-value visual tokens on overall throughput.
Consistent gains across different multimodal benchmarks compared with count-based balancing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same entropy-plus-modality logic could apply to load balancing in other distributed systems that mix dense and sparse modalities.
If modality composition is tracked at the batch level rather than per-token, the method might simplify further for very large batch inference.
Hardware with heterogeneous expert speeds might see even larger gains once capacity can be adjusted per modality.

Load-bearing premise

That entropy of visual tokens reliably indicates their semantic importance and that modality composition can be measured and acted on in real time without creating new load imbalances or accuracy loss.

What would settle it

Measure end-to-end latency and accuracy on a benchmark heavy with redundant visual tokens; if MACS shows no reduction in straggler time relative to token-count balancing, the central claim fails.

Figures

Figures reproduced from arXiv: 2605.05225 by Bo Li, Chuan Wu, Shaolin Zhu.

**Figure 3.** Figure 3: Inference latency speedup across different [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Sensitivity analysis of Semantic Strength [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Sensitivity analysis of Modality Adaptation [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Normalized expert load on vision and text [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 7.** Figure 7: Normalized expert load on vision and text [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Impact of Capacity Factor ( [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 8.** Figure 8: Impact of Capacity Factor ( [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

read the original abstract

Mixture-of-Experts Multimodal Large Language Models (MoE MLLMs) suffer from a significant efficiency bottleneck during Expert Parallelism (EP) inference due to the straggler effect. This issue is worsened in the multimodal context, as existing token-count-based load balancing methods fail to address two unique challenges: (1) Information Heterogeneity, where numerous redundant visual tokens are treated equally to semantically critical ones, and (2) Modality Dynamics, where varying visual to text ratios across tasks lead to resource misallocation. To address these challenges, we propose MACS (Modality-Aware Capacity Scaling), a training-free inference framework. Specifically, MACS introduces an Entropy-Weighted Load mechanism to quantify the semantic value of visual tokens, addressing information heterogeneity. Additionally, the Dynamic Modality-Adaptive Capacity mechanism allocates expert resources based on the real-time modal composition of the input. Extensive experiments demonstrate that MACS significantly outperforms existing methods on various multimodal benchmarks, providing a novel and robust solution for the efficient deployment of MoE MLLMs in EP inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MACS offers a training-free heuristic for entropy-weighted visual tokens and dynamic modality-based capacity scaling to cut stragglers in multimodal MoE inference, but the abstract shows no results to support the outperformance claim.

read the letter

The main takeaway is that this paper introduces MACS, a training-free framework for better load balancing during expert-parallel inference of multimodal MoE models. It uses entropy to weight the importance of visual tokens and scales expert capacity based on the current visual-to-text ratio in each input. What is new here is the combination of entropy-based valuation for visual tokens to handle redundant information and the real-time modality-adaptive capacity adjustment. These address the specific issues of information heterogeneity and varying modality dynamics that token-count methods overlook. The work does well in identifying concrete bottlenecks in multimodal settings. The straggler effect from treating all visual tokens equally and from shifting modality ratios is a real deployment pain point, and the proposed mechanisms are straightforward to implement at inference time. The main concern is the evidence base. The abstract claims significant outperformance on multimodal benchmarks, yet provides no quantitative results, no baseline comparisons, and no details on ablations or error analysis. Without seeing those, it is difficult to confirm that the entropy weighting reliably reduces compute imbalance or that the dynamic scaling avoids new costs. The stress-test note about missing theoretical validation or tests on diverse inputs is on point given what is shown. This paper is for practitioners and researchers focused on efficient inference for MoE-based vision-language models. Readers looking for practical heuristics to test in their own setups would find it relevant. It deserves serious peer review because the problem is timely and the ideas are testable, even if the current presentation leaves the performance gains unverified. I recommend sending it to referees with a request for the full experimental details and robustness checks.

Referee Report

3 major / 2 minor

Summary. The paper proposes MACS, a training-free inference framework for Mixture-of-Experts Multimodal Large Language Models (MoE MLLMs) under Expert Parallelism (EP). It targets the straggler effect caused by information heterogeneity (redundant vs. critical visual tokens treated equally) and modality dynamics (varying visual-to-text ratios across tasks). MACS introduces an Entropy-Weighted Load mechanism to quantify semantic value of visual tokens and a Dynamic Modality-Adaptive Capacity mechanism to reallocate expert resources based on real-time modal composition. The abstract claims that extensive experiments demonstrate significant outperformance over existing methods on various multimodal benchmarks.

Significance. If the empirical results and robustness claims hold, MACS could offer a practical, zero-training solution to a real deployment bottleneck in large multimodal MoE models, improving EP inference efficiency without task-specific retuning. The training-free nature and focus on modality-aware heuristics are strengths, but the absence of any quantitative results, baselines, ablations, or error analysis in the provided text prevents assessment of effect sizes or generalizability.

major comments (3)

[Abstract] Abstract: the central claim that 'MACS significantly outperforms existing methods on various multimodal benchmarks' is asserted without any quantitative results, specific benchmark names, baseline comparisons, ablation studies, or error bars. This makes the primary empirical contribution unverifiable from the manuscript text.
[Method] Method section (Entropy-Weighted Load mechanism): no derivation, correlation analysis, or bound is supplied showing that token entropy reliably predicts expert compute time or load; the mechanism is presented as a heuristic without evidence that the entropy signal is not dominated by noise across diverse multimodal inputs.
[Method] Method section (Dynamic Modality-Adaptive Capacity mechanism): the allocator is claimed to mitigate modality-ratio variation, yet no analysis, capacity bound, or overhead discussion is given for how much ratio fluctuation it can absorb before straggler reappears or synchronization costs offset gains.

minor comments (2)

[Abstract] Abstract: 'EP inference' is introduced without spelling out 'Expert Parallelism' on first use; expand for clarity.
[Abstract] The abstract refers to 'extensive experiments' but the provided text contains none; ensure the full manuscript includes all tables, figures, and statistical details supporting the outperformance claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and insightful comments. We appreciate the recognition of MACS as a potentially practical, training-free approach to addressing the straggler effect in multimodal MoE inference. We address each major comment below and have made revisions to strengthen the manuscript's clarity and verifiability.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'MACS significantly outperforms existing methods on various multimodal benchmarks' is asserted without any quantitative results, specific benchmark names, baseline comparisons, ablation studies, or error bars. This makes the primary empirical contribution unverifiable from the manuscript text.

Authors: We agree that the original abstract lacked sufficient quantitative detail to allow immediate verification of the claims. In the revised manuscript, we have updated the abstract to explicitly summarize key results, including specific benchmarks (e.g., VQAv2, COCO Captioning), baseline comparisons (standard EP and token-count balancing), effect sizes (e.g., 1.4-2.1x inference speedup), and references to the corresponding tables and figures with error bars from multiple runs. This makes the empirical contribution directly assessable from the abstract. revision: yes
Referee: [Method] Method section (Entropy-Weighted Load mechanism): no derivation, correlation analysis, or bound is supplied showing that token entropy reliably predicts expert compute time or load; the mechanism is presented as a heuristic without evidence that the entropy signal is not dominated by noise across diverse multimodal inputs.

Authors: The Entropy-Weighted Load is introduced as a heuristic grounded in the observation that higher-entropy tokens tend to carry greater semantic information and thus incur higher expert compute. We acknowledge the original submission provided limited supporting analysis. In the revision, we have added an empirical correlation study (new subsection) demonstrating the relationship between token entropy and measured expert load times across diverse inputs, along with a sensitivity analysis to noise in the entropy signal. While a closed-form theoretical bound remains difficult given the stochastic nature of expert routing, the added empirical evidence addresses the concern about reliability. revision: yes
Referee: [Method] Method section (Dynamic Modality-Adaptive Capacity mechanism): the allocator is claimed to mitigate modality-ratio variation, yet no analysis, capacity bound, or overhead discussion is given for how much ratio fluctuation it can absorb before straggler reappears or synchronization costs offset gains.

Authors: We concur that the original description would benefit from explicit analysis of the mechanism's limits. The revised manuscript now includes a dedicated analysis subsection quantifying the range of visual-to-text ratio fluctuations the allocator can handle before stragglers re-emerge, derived from our experimental traces. We also report measured overhead (reallocation latency < 3% of total inference time) and demonstrate that synchronization costs do not offset the gains within the tested operating range. These additions provide the requested bounds and overhead discussion. revision: yes

Circularity Check

0 steps flagged

No significant circularity; heuristic mechanisms validated empirically

full rationale

The paper introduces MACS as a training-free inference framework with two new mechanisms (Entropy-Weighted Load and Dynamic Modality-Adaptive Capacity) to address information heterogeneity and modality dynamics in MoE MLLMs. These are presented as direct proposals to mitigate the straggler effect, with all performance claims resting on external experimental benchmarks rather than any derivation, self-definition, or self-citation chain. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text that would reduce the central claims to inputs by construction. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Review is limited to the abstract; no explicit free parameters, axioms, or invented entities are detailed beyond the two named mechanisms.

axioms (1)

domain assumption Existing token-count-based load balancing fails to handle information heterogeneity and modality dynamics in multimodal inputs.
Stated directly in the abstract as the motivation for the new mechanisms.

invented entities (2)

Entropy-Weighted Load mechanism no independent evidence
purpose: Quantify semantic value of visual tokens to improve load balancing
Introduced as a core component of MACS to address information heterogeneity.
Dynamic Modality-Adaptive Capacity mechanism no independent evidence
purpose: Allocate expert resources according to real-time visual-to-text ratio
Introduced to address modality dynamics during inference.

pith-pipeline@v0.9.0 · 5485 in / 1451 out tokens · 35363 ms · 2026-05-11T01:26:15.758356+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Entropy Computation and Normalization... ˜H(xv)=H(xv)−μB/σB+ϵ; Semantic Weighting... w(x)=σ(−δ·˜H(x)) for visual tokens; Effective load ˜Lj=∑w(x)
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Dynamic Modality-Adaptive Capacity... Rv=∑w(xvis)/∑w(x); Cj=Cbase·(1+ρ·mj·(Rv−0.5))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 15 internal anchors

[1]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

2025 , url =

ChatGPT , howpublished =. 2025 , url =

work page 2025
[3]

2024 , eprint=

Improved Baselines with Visual Instruction Tuning , author=. 2024 , eprint=

work page 2024
[4]

arXiv preprint arXiv:2411.15708 , year=

Llama-moe v2: Exploring sparsity of llama from perspective of mixture-of-experts with post-training , author=. arXiv preprint arXiv:2411.15708 , year=

work page arXiv
[5]

2024 , eprint=

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution , author=. 2024 , eprint=

work page 2024
[6]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=

work page
[7]

Mixtral of Experts

Mixtral of experts , author=. arXiv preprint arXiv:2401.04088 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Journal of Machine Learning Research , volume=

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity , author=. Journal of Machine Learning Research , volume=

work page
[10]

arXiv preprint arXiv:2404.05019 , year=

Shortcut-connected expert parallelism for accelerating mixture-of-experts , author=. arXiv preprint arXiv:2404.05019 , year=

work page arXiv
[11]

Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts

Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts , author=. arXiv preprint arXiv:2503.05066 , year=

work page internal anchor Pith review arXiv
[12]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model , author=. arXiv preprint arXiv:2405.04434 , year=

work page internal anchor Pith review arXiv
[13]

Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models,

Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models , author=. arXiv preprint arXiv:2402.14800 , year=

work page arXiv
[14]

Mixture compressor for mixture-of-experts LLMs gains more.arXiv preprint arXiv:2410.06270, 2024

Mixture Compressor for Mixture-of-Experts LLMs Gains More , author=. arXiv preprint arXiv:2410.06270 , year=

work page arXiv
[15]

arXiv preprint arXiv:2511.15690 , year=

MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping , author=. arXiv preprint arXiv:2511.15690 , year=

work page arXiv
[16]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Stun: Structured-then-unstructured pruning for scalable moe pruning , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[17]

Moe-pruner: Pruning mixture-of-experts large language model using the hints from its router.arXiv preprint arXiv:2410.12013, 2024

Moe-pruner: Pruning mixture-of-experts large language model using the hints from its router , author=. arXiv preprint arXiv:2410.12013 , year=

work page arXiv
[18]

arXiv preprint arXiv:2506.23270 , year=

Token Activation Map to Visually Explain Multimodal LLMs , author=. arXiv preprint arXiv:2506.23270 , year=

work page arXiv
[19]

Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

Unveiling Multimodal Processing: Exploring Activation Patterns in Multimodal LLMs for Interpretability and Efficiency , author=. Findings of the Association for Computational Linguistics: EMNLP 2025 , pages=

work page 2025
[20]

arXiv preprint arXiv:2509.22415 , year=

Explaining multimodal LLMs via intra-modal token interactions , author=. arXiv preprint arXiv:2509.22415 , year=

work page arXiv
[21]

OpenMoE: An early effort on open mixture-of-experts language models.arXiv preprint arXiv:2402.01739, 2024

Openmoe: An early effort on open mixture-of-experts language models , author=. arXiv preprint arXiv:2402.01739 , year=

work page arXiv
[22]

arXiv preprint arXiv:2507.11181 , year=

Mixture of experts in large language models , author=. arXiv preprint arXiv:2507.11181 , year=

work page arXiv
[23]

Kimi-VL Technical Report

Kimi-vl technical report , author=. arXiv preprint arXiv:2504.07491 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

2025 , eprint=

Qwen3-VL Technical Report , author=. 2025 , eprint=

work page 2025
[26]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency , author=. arXiv preprint arXiv:2508.18265 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Towards vqa models that can read , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[28]

2022 , address=

Masry, Ahmed and Long, Do and Tan, Jia Qing and Joty, Shafiq and Hoque, Enamul , booktitle=. 2022 , address=

work page 2022
[29]

Are We on the Right Way for Evaluating Large Vision-Language Models?

Are We on the Right Way for Evaluating Large Vision-Language Models? , author=. arXiv preprint arXiv:2403.20330 , year=

work page internal anchor Pith review arXiv
[30]

MMBench: Is Your Multi-modal Model an All-around Player?

MMBench: Is Your Multi-modal Model an All-around Player? , author=. arXiv preprint arXiv:2307.06281 , year=

work page internal anchor Pith review arXiv
[31]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities , author=. arXiv preprint arXiv:2308.02490 , year=

work page internal anchor Pith review arXiv
[32]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , author=. arXiv preprint arXiv:2306.13394 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

2024 , howpublished=

Grok-1.5 Vision Preview , author=. 2024 , howpublished=

work page 2024
[34]

Mvbench: A comprehensive multi-modal video understand- ing benchmark.arXiv preprint arXiv:2311.17005, 2023

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark , author=. arXiv preprint arXiv:2311.17005 , year=

work page arXiv
[35]

Egoschema: A diagnostic benchmark for very long-form video language understanding, 2023

EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding , author=. arXiv preprint arXiv:2308.09126 , year=

work page arXiv
[36]

Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

work page
[37]

Longvideobench: A benchmark for long-context inter- leaved video-language understanding.arXiv Preprint, 2024

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding , author=. arXiv preprint arXiv:2407.15754 , year=

work page arXiv
[38]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos , author=. arXiv preprint arXiv:2501.13826 , year=

work page internal anchor Pith review arXiv
[39]

Microsoft COCO: Common Objects in Context

Microsoft COCO: Common Objects in Context , author=. arXiv preprint arXiv:1405.0312 , year=

work page internal anchor Pith review arXiv
[40]

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y ., Su, W., Shao, J., et al

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models , author=. arXiv preprint arXiv:2407.12772 , year=

work page arXiv
[41]

Proceedings of the International Conference on Learning Representations (ICLR) , year=

Measuring Massive Multitask Language Understanding , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

work page
[42]

LLaVA-OneVision: Easy Visual Task Transfer

LLaVA-OneVision: Easy Visual Task Transfer , author=. arXiv preprint arXiv:2408.03326 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

MIT -10 M : A Large Scale Parallel Corpus of Multilingual Image Translation

Li, Bo and Zhu, Shaolin and Wen, Lijie. MIT -10 M : A Large Scale Parallel Corpus of Multilingual Image Translation. Proceedings of the 31st International Conference on Computational Linguistics. 2025

work page 2025
[44]

LRM-LLaVA: Overcoming the Modality Gap of Multilingual Large Language-Vision Model for Low-Resource Languages , booktitle =

Junchen Li and Qing Yang and Bojian Jiang and Shaolin Zhu and Qingxuan Sun , editor =. LRM-LLaVA: Overcoming the Modality Gap of Multilingual Large Language-Vision Model for Low-Resource Languages , booktitle =. 2025 , timestamp =

work page 2025
[45]

PEIT : Bridging the Modality Gap with Pre-trained Models for End-to-End Image Translation

Zhu, Shaolin and Li, Shangjie and Lei, Yikun and Xiong, Deyi. PEIT : Bridging the Modality Gap with Pre-trained Models for End-to-End Image Translation. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

work page 2023
[46]

Overcoming language barriers via machine translation with sparse Mixture-of-Experts fusion of large language models , journal =

Shaolin Zhu and Leiyu Pan and Dong Jian and Deyi Xiong , keywords =. Overcoming language barriers via machine translation with sparse Mixture-of-Experts fusion of large language models , journal =. 2025 , issn =

work page 2025
[47]

MMNMT : Modularizing Multilingual Neural Machine Translation with Flexibly Assembled M o E and Dense Blocks

Li, Shangjie and Wei, Xiangpeng and Zhu, Shaolin and Xie, Jun and Yang, Baosong and Xiong, Deyi. MMNMT : Modularizing Multilingual Neural Machine Translation with Flexibly Assembled M o E and Dense Blocks. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023

work page 2023
[48]

2026 , eprint=

Evaluating and Steering Modality Preferences in Multimodal Large Language Model , author=. 2026 , eprint=

work page 2026
[49]

Merge then Realign: Simple and Effective Modality-Incremental Continual Learning for Multimodal LLM s

Zhang, Dingkun and Qi, Shuhan and Xiao, Xinyu and Chen, Kehai and Wang, Xuan. Merge then Realign: Simple and Effective Modality-Incremental Continual Learning for Multimodal LLM s. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025

work page 2025
[50]

I n I mage T rans: Multimodal LLM -based Text Image Machine Translation

Zuo, Fei and Chen, Kehai and Zhang, Yu and Xue, Zhengshan and Zhang, Min. I n I mage T rans: Multimodal LLM -based Text Image Machine Translation. Findings of the Association for Computational Linguistics: ACL 2025. 2025

work page 2025