Training matryoshka mixture-of-experts for elastic inference- time expert utilization

Yaoxiang Wang, Qingguo Hu, Yucheng Ding, Ruizhe Wang, Yeyun Gong, Jian Jiao, Yelong Shen, Peng Cheng, Jinsong Su · 2025 · arXiv 2509.26520

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

FlexMoE: One-for-All Nested Intra-Expert Pruning for MoE Language Models

cs.LG · 2026-06-26 · unverdicted · novelty 6.0

FlexMoE produces nested pruned subnetworks for MoE LLMs across budgets via channel importance ranking and discrete action learning, plus one mid-budget recovery fine-tune, retaining 99.8% performance at 50% expert parameter pruning.

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

PARCEL is a new visual tokenization architecture combining pool-anchored resampling with conditioned elastic queries to enhance performance-efficiency tradeoffs in LVLMs over prior matryoshka methods.

Elastic Attention Cores for Scalable Vision Transformers

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.

m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

cs.CL · 2026-05-19 · unverdicted · novelty 4.0

m3BERT uses a three-stage Matryoshka pretraining approach on a bidirectional encoder to support variable embedding sizes while outperforming prior models on large-scale retrieval tasks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

FlexMoE: One-for-All Nested Intra-Expert Pruning for MoE Language Models cs.LG · 2026-06-26 · unverdicted · none · ref 16
FlexMoE produces nested pruned subnetworks for MoE LLMs across budgets via channel importance ranking and discrete action learning, plus one mid-budget recovery fine-tune, retaining 99.8% performance at 50% expert parameter pruning.

Training matryoshka mixture-of-experts for elastic inference- time expert utilization

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer