MODE decomposes expert selection frequency by modality, filters redundant vision tokens, adds per-modality sensitivity, and uses ILP to assign bit-widths, limiting average loss to 2.9% at W3A16 on MoE-MLLMs.
A is B” fail to learn “B is A
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2roles
background 1polarities
unclear 1representative citing papers
Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.
citing papers explorer
-
Layer Collapse in Diffusion Language Models
Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.